NeuPro-M™ redefines high performance AI (Artificial Intelligence) and ML (Machine Learning) processing for smart edge devices and edge compute with heterogeneous and secure architecture.

NeuPro-M is a highly scalable high-performance AI processor architecture with an exceptional power efficiency of up to 24 Tera Ops Per Second per Watt (TOPS/Watt).

NeuPro-M provides a major leap in performance mainly because of its heterogeneous coprocessors that demonstrate compound parallel processing, once within each internal engine and secondly between the engines themselves.

Ranging from 8 Tera Ops Per Second (TOPS) up to 160 TOPS per core and is fully scalable to reach above 1200 TOPS using multi-core instantiations, NeuPro-M can cover a wide scope of edge AI compute application needs which enables it to fit a broad range of end markets including automotive, surveillance, mobile, consumer, industrial & IoT and robotics.

With various orthogonal bandwidth reduction mechanisms, decentralized architecture of the management controllers and memory resources, NeuPro-M can ensure full utilization of all its coprocessors while maintaining stable and concurrent data tunneling that eliminate issues of bandwidth limited performance, data congestion or processing unit starvation. These also reduce the dependency in the external memory of the SoC its embedded into.

NeuPro-M builds on CEVA’s industry-leading position and experience in deep neural networks applications. Dozens of customers are already deploying CEVA’s vision & AI platforms along with the full CDNN SDK toolchain in consumer, surveillance and ADAS products.

NeuPro-M was designed to meet the most stringent safety and quality compliance standards like automotive ISO 26262 ASIL-B functional safety standard and A-Spice quality assurance standards and comes complete with a full comprehensive CDNN software stack including:

  • NeuPro-M system architecture planner tool – Allowing fast and accurate neural network development over NeuPro-M and ensure final product performance
  • Neural network training optimizer tool allows even further performance boost & bandwidth reduction still in the neural network domain in order to fully utilize every NeuPro-M optimized hardware feature
  • CDNN compiler & runtime, compose the most efficient flow scheme within the processor to ensure maximum utilization in minimum bandwidth per use-case

The NeuPro-M architecture supports secure access in the form of root of trust, authentication against IP / identity theft, secure boot and end to end data privacy.


NuePro-M Diagram


The NeuPro-M AI processor family is designed to reduce the high barriers-to-entry into the AI space in terms of both architecture and software. Enabling an optimized and cost-effective standard AI platform that can be utilized for a multitude of AI-based workloads and applications

A self-contained heterogeneous AI/ML architecture that concurrently processes diverse workloads of Deep Neural Networks (DNN) using mixed precision MAC array, Winograd engine, Sparsity engine, Vector Processing Unit (VPU), Weight and Data compression
Scalable performance of 8 to 1,200 TOPS in a modular multi-engine/multi-core architecture at both SoC and Chiplet levels for diverse application needs. Up to 20 TOPS for a single engine NPM core and up to 160TOPS for an Octa engine NPM core
5-15X higher performance and 6X bandwidth reduction vs. NeuPro previous generation with overall utilization of more than 90% and outstanding energy efficiency of up to 24 TOPS/Watt

Main Features

  • Support wide range of activations & weights data types, from 32-bit Floating Point down to 2-bit Binary Neural Networks (BNN)
  • Unique mixed precision neural engine MAC array micro architecture to support data type diversity with minimal power consumption
  • Out-of-the-box, untrained Winograd transform engine that allows to replace traditional convolution methods while using 4-bit, 8-bit, 12-bit or 16-bit weights and activations, increasing efficiency in a factor of 2x with <0.5% precision degradation
  • Unstructured Sparsity engine to avoid operations with zero-value weights or activations of every layer along the inference process. With up to 4x in performance, sparsity will also reduce memory bandwidth and power consumption
  • Simultaneous processing of the Vector Processing Unit (VPU), a fully programmable processor for handling any future new neural network architectures to come
  • Lossless Real-time Weight and Data compression/decompression, for reduced external memory bandwidth
  • Scalability by applying different memory configuration per use-case and inherent single core with 1-8 multiengine architecture system for diverse processing performance
  • Secure boot and neural network weights/data against identity theft
  • 2 degrees of freedom parallel processing
  • Memory hierarchy architecture to minimize power consumption attributed to data transfers to and from an external SDRAM as well as optimize overall bandwidth consumption
  • Management controllers decentralized architecture with local data controller on each engine to achieve optimized data tunneling for low bandwidth and maximal utilization as well as efficient parallel processing schema
  • Supports next generation NN architectures like: fully-connected (FC), FC batch, RNN, transformers (self-attention), 3D convolution and more…
  • The NeuPro-M AI processor architecture includes the following processor options:
    • NPM11 – A single NPM engine, with processing power of up to 20 TOPS
    • NPM18 – An Octa NPM engine, with processing power of up to 160 TOPS
  • Matrix Decomposition for up to 10x enhanced performance during network inference

Block Diagram

NeuPro-M Architecture Block Diagram

Resources and Downloads