Many application developers are still comprehending the benefits of machine
learning (ML), but one thing is clear – machine learning is here to
stay, especially as more processing capability moves to the edge. The lowest
hanging fruit for ML will stem from applications that either help save money,
help make money or both. For example, saving money can be accomplished by
adding high performance ML to a vision system used to inspect products moving
down an assembly line; the faster the line, the quicker products are
delivered. Making money can be accomplished by adding ML functionality to a
product making it more useful and/or desirable; consider adding face
recognition to a doorbell, used to determine whether friend or foe is at the
door. In any case, the best ML solution will be represented by a balance of
factors including performance, energy and price.
The processors from NXP span the galaxy of ML solutions – ranging from
MCUs (LPC and i.MX RT) to high-end applications processors (i.MX, Layerscape and S32V
for automotive). Recently we announced a partnership with Arm®
indicating that our ML support for MCUs is expected to go to new dimensions of
performance and energy. Specifically, this announcement was about Arm’s
Ethos-U55, a microNPU (neural processing unit or ML accelerator) designed to
work with the Cortex®-M, including the Cortex-M33, Cortex-M7 and Cortex-M4 processors.
In this microNPU announcement, NXP was named as a lead partner, although at
this time we have not disclosed any MCU implementation details. However,
acknowledging our position on ML acceleration, we recently unveiled the i.MX
8M Plus, our first device with a dedicated NPU. The i.MX 8M Plus contains a
dedicated 2.3 TOPS (tera operations per second) Verisilicon NPU attached to
the system bus, whereas the 0.1-0.5 TOPS microNPU is designed as a
co-processor (more on this later). Most of the industry is focused on the
highest performance ML acceleration, going from 2 to 8 to 30 TOPS and beyond,
and NXP will follow this path as well. But we also believe it’s
important to recognize the value of ML acceleration in the low-power domain(sub
1 TOP), especially as ML functionality is integrated into tiny end-point
sensors and other edge devices.
Common NPU Features to Run a Faster Race
Despite their size and interface differences, the Ethos-U55 and i.MX 8M Plus
NPUs have architectural similarities. Both NPUs can do parallel
multiply-accumulate (MAC) operations to handle complex matrix math (32-256
MACs/cycle and 1150 MACs/cycle, respectively). Both NPUs also support model
compression and weight decompression, helping to minimize the use of system
memory as well as reducing the stress on the memory bus bandwidth. To further
benefit their performance, both NPUs have DMA engines to read and write data
and neural network weights to/from system memory (which could be DRAM or
on-chip RAM or flash, depending on the SoC design).
ML software is equally as important as the hardware. Through our eIQ® machine
learning software development environment, we have enabled the use of
TensorFlow Lite
across all our devices. Today we even offer TensorFlow Lite support on our
i.MX RT devices, including low-level optimizations that significantly
increases the performance of some NN models compared to the out-of-the-box
TensorFlow Lite. But the main point here is the use of a common inferencing
approach to facilitate porting your ML application to many devices, whether
i.MX RT Crossover MCU or i.MX 8 applications processors. And this approach
continues with Ethos-U55, using a further slimmed down version of TensorFlow
called TensorFlow Lite for microcontrollers. This commonality allows users to
develop in TensorFlow and then convert to either TensorFlow Lite or TensorFlow
Lite Micro format.
Developers can take their existing TensorFlow Lite models and run them with
Arm’s modified TensorFlow Lite Micro runtime. The modifications include
an offline optimizer that does automatic graph partitioning, scheduling and
optimizations. These simple additions make it easy to run ML on a heterogenous
system, as developers do not have to make modifications to their networks.
As a coprocessor, the Ethos-U55 shares the neural network graph processing
with the host Cortex M core. The output of the offline optimizer is a
TensorFlow Lite flat file which is deployed on the target device. The flat
file contains information on which layer of the neural network executes on
Ethos-U55 versus the attached Cortex-M processor. The layers supported by
Ethos-U55 are accelerated on it and the remaining layers execute on the
attached Cortex-M. The layers that execute on the Cortex-M processor are
accelerated through the CMSIS-NN software library if the corresponding kernel
is available. Otherwise, the TensorFlow Lite Micro reference kernels are used.
While this might seem limited, Ethos-U55 supports the right mix of operators
to handle a wide range of popular networks. A side benefit of the coprocessor
approach is that it eliminates some redundancy in circuitry, making the
Ethos-U55 small enough to adopt to MCU designs [according to Arm,
“Ethos-U55 provides up to 90% energy reduction over current Cortex-M
CPU’s for AI applications in cost-sensitive and energy-constrained
devices. Ethos-U55 also consumes an extremely small area, starting at about
0.1mm² in TSMC 16FFC process..”].
The machine learning accelerator in i.MX 8M Plus and the prospect of Ethos-U55
hardware puts NXP as a front-runner in the race to the top and the bottom.
Whether its enabling local voice command processing or natural language
processing recognizing 40,000 words, or facial recognition or running several
complex vision algorithms in parallel, you can do these things in many NXP
devices today. But the integrated NPUs in NXP processors is expected to
deliver the next level of performance, energy and cost benefits to your
application, allowing you to win your race to deliver great products.
To explore more of the news, content and training originally planned for
Embedded World 2020 please visit NXP’s online eXperience.