Kirin 970’s Neural Processing Unit and Cambricon

Huawei has unveiled Kirin 970 having a Neural Processing Unit apparently designed by China’s Cambricon:

  • NPU designed by Cambricon
  • achieves 1.92 TFLOPs in FP16

Kirin 970 is using 10nm process, integrated by Huawei’s Hisilicon design house.

In the meantime, Cambricon has announced several new “AI processor” IP products:

  • “Cambricon-1H8 focuses on lower power consumption visual application”,  up to 2.3x performance-per-watt vs. Cambricon-1A
  • “Cambricon-1H16, has wider application and better performance”
  • “Cambricon-1M is made for intelligent driving”, 10x performance vs. Cambricon-1A”
  • “high performance machine learning processor chips Cambricon-MLU100 and Cambricon-MLU200” for servers, Cambricon’s AI software development platform NeuWare

Google Android 8.1 (Oreo) Neural Network API

Google has introduced Android NNAPI for neural network accelerators, starting with Android version 8.1. The API abstracts device hardware – thus allowing app developers to run neural network computation without worrying about actual underlying hardware implementation.

When the user device (smartphone, tablet, etc.) does not have dedicated neural network hardware, GPU, DSP, CPU are used to carry out computations as a fall-back.

NNAPI supports inferencing using pre-trained models.

To run an inference,  consists of these steps:

  • App code loads a computation graph to the API. The graph precisely specifies the sequence of operations – e.g. convolve layer X with filter Y, apply ReLU activation and so on
  • App code instructs NNAPI to “compile” the computation graph into lower-level code that run on the actual underlying hardware
  • App code instructs NNAPI to allocate memory buffers and fills the memory buffers with input data and weights
  • NNAPI runs the computation
  • App code reads out computed output from memory buffers

Apple A11 Bionic’s Neural Engine

Mentions here and here, practically no architecture details.

  • “600 billion operations per second”
  • “Two parallel cores”
  • Used for FaceID, Animoji, ARKit
  • Part of Image Signal Processor
  • Development started 3 years before A11 release

A11 Bionic itself has:

  • 64-bit ARM-based SoC, manufactured at TSMC, 10nm FinFET process
  • 6 ARM cores: 2 high-performance  + 4 high-efficiency, can use all 6 simultaneously
  • On-chip image processor supporting computational photography

Synopsys EV6x Series

Synopsys EV6x Series architecture overview, brief.  Marketed as a “Vision Processor”, it has both DSP and a 12-bit convolution accelerator:

  • 4 “Vision CPU” cores, each 512-bit vector DSP and 32-bit scalar for a total  of 620 GOPS/s
  • “CNN Engine” – apparently a MAC array, scalable up to 4.5 TMACs/sec, 2TMAC/s/W power efficiency
  • “100x higher performance on common vision processing tasks” vs. EV5x
  • Included into Synopsys DesignWare library

Tensilica Vision C5 DSP

Cadence (Tensilica) Vision C5 DSP IP core for neural networks architecture overview here. press release.

4-way VLIW 128-way SIMD processor. Supports bit width of 8 and 16. This is a DSP – not a hard-coded convolution accelerator.

  • 1 TeraMAC (TMAC)/sec (apparently 8-bit), <1 mm.sq. silicon area
  • “1024 8-bit MACs or 512 16-bit MACs”
  •  “128-way, 8-bit SIMD or 64-way, 16-bit SIMD”
  • 4 cores (4-way)
  • 4x throughput of Vision P6 DSP
  • Separate data banks, instruction RAMs. Instructions are cached.

Imagination PowerVR 2NX

Imagination Technologies PowerVR 2NX architecture overview

Streaming architecture, apparently 8-bit inference.

  • Weights, activations flow from “DDR” over bus interface to “NN Compute Core/Engine”
  • “NN Compute Core/Engine” looks like a multiplier array
  • Next, multiplication results proceed to an “Accumulation Buffer”
  • Next, summed results pass through Activation/Pool/Normalize/”Element Engine” modules and end up in a “Shared Buffer”
  • Lastly, data from the “Shared Buffer” streams into “Output Formatter” and on via bus interface to DDR

To save DRAM size, bandwidth and power consumption, weights and activations bit width is configurable up to 8 bits maximum.

  • Less-than-8-bit-wide values appear to be stored in DDR in packed format to save DDR size and bandwidth.
  • After getting fetched from DDR, less-than-8-bit-wide weights, activations get padded to full 8 bits width (e.g. with zeros) and continue on into the multiplier array.