NVIDIA Volta reported being only 2x faster on LSTM training vs P100 in training, 1.7x in inference. This is due to LSTM largely not running matrix multiplications, which Volta accelerates in FP16.
Incidentally, NVIDIA DGX Station with 4 Volta GPUs runs on this ASUS X99-E-10G-WS motherboard.
With GraphCore making news raising additional funding, its architecture is still largely under wraps as of November 2017. This is what we know now:
- It’s a “very large chip”, consisting apparently of “thousand(s)” of “IPU” cores, with lots of on-chip RAM, aimed at TSMC 16nm FinFET
- It is for server/cloud use, both for training and inference
- It is a scalable “graph processor”
- “graph” means TensorFlow-style computation graph.
- The graph describes how to compute output data you need. For example, your graph would specify which input tensors to use, their size (width, height, number of maps), size of output and operations to compute the output (convolve A and B, then apply ReLU to B, then compute dot product of B and C, etc).
- A core is apparently called “IPU”, custom-designed by GraphCore, features “complex instruction set(s) to let compilers be simple”
- Supports “low-precision floating-point”, no double-precision, apparently int32, int16
- Holds entire NN model on-chip to avoid accessing off-chip DRAM. On-chip RAM access is “100x” faster vs. off-chip.
- GraphCore’s board is called “IPU-Appliance”, plugs into [server] PC’s PCe slot, consumes 300 watt (on par with NVIDIA GTX Titan’s 250W)
- GraphCore software stack supports TensorFlow, standard frameworks (no custom framework to be shipped with it).
- Library source code will be open-sourced.
- Supports supervised learning, unsupervised learning, reinforcement learning
- GraphCore will offer cloud-based version of its software stack
GraphCore investors so far include Amadeus Capital Partners, Atomico, C4 Ventures, Dell Technologies Capital, Draper Esprit, Foundation Capital, Pitango Venture Capital, Robert Bosch Venture Capital, Samsung Catalyst Fund and Sequoia Capital. GraphCore is headquartered in Bristol UK.
HP Enterprise teaser on analog memristor-array-based dot-product accelerator.
- Called “Dot Product Engine “
- “linear algebra in analog domain … exploiting Ohm’s law on a memristor array”
- an inference vector, matrix dot product/math accelerator
- for high-performance computing applications
- can be used for “DNN, CNN, RNN, … possibly FFT, DCT, convolution”
Seeing is believing.
A nice compilation of AI market stats and predictions, including hardware AI accelerators:
- McKinsey & Co survey: total investments in AI development grew 3x from 2013 to 2016, with tech giants investing $20 – $30 B
- Linley projects data-center-oriented AI accelerator market reaching $12 B by 2022
- Linley estimates 1.7 B machine-learning client devices by 2022
- Tractica estimates AI-driven consumer services growing from $1.9 B in 2016 to $2.7 B by end of 2017 .
- Tractica estimates the entire AI market – including hardware, software, services – reaching $42.1 B by 2025
- Transparency Market Research predicts machine-learning-as-a-service growing from $1.07 B in 2016 to $19.9 B by 2025. 73% of ML-as-a-service is currently owned by Amazon, IBM and Microsoft
Google’s Pixel Visual Core architecture and its utility for machine learning (PVC)- summary of what we know as-of-today:
- it is in Pixel 2 Smartphone, but is currently disabled, waiting firmware upgrade from Android Oreo 8.0 to Oreo 8.1
- PVC consists of 8 Image Processing Unit (IPU) cores
- Each IPU core has 512 ALUs
- IPU cores are custom-designed by Google
- PVC totals 3+ TOps/sec “on a mobile power budget”
- PVC also has MIPI (apparently for image sensor connection), A53 ARM core, PCIe block and LPDDR4 interface
- Will have TensorFlow and Halide software support
<speculation on> Perhaps this is an ALU MAC array of sorts, perhaps with 8 bit multiplies, >8-bit intermediate result adders or accumulators and attention paid to reduce off-chip DRAM access </speculation>
ARM’s Cortex-A75 and A55 feature a few interesting upgrades to the NEON SIMD engine:
- FP16 support – without converting it first to FP32 as implemented in older architectures
- a single instruction computing int8 dot product – potentially 4x faster over Cortex-A53
Models that train successfully in FP32 may run into troubles converging when trained using FP16 precision. Fixing the convergence troubles requires extra care both in model preparation and choice of hardware architecture.
FP16 offers less precision vs. FP32, and – importantly – FP16 sometimes lacks range to express very small and very large values. As a result, very small FP32 values can become zeroes when cast to FP16. This can cause models convergence to break when trained using FP16.
To fix the convergence problem, one can
- Use FP32 hardware for accumulation (but FP16 for multiplies). The use of these two different precision formats is referred to as “mixed precision training”.
- Tweak the model to artificially scale up model parameters, such they don’t become zeroes in FP16. NVIDIA and Baidu explain this “loss scaling” tweak in more detail.
- Keep a master copy of weights using full FP32 precision when applying weight updates – yet use reduced-precision of those weights FP16 for forward and back-propagation
Speaking of Cambricon’s new products, let’s recall Cambricon’s NPU architecture publications.
“An Instruction Set Architecture for Neural Networks“, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture
- A vector processor optimized for neural network computation
- Architecture key points:
- There is no vector register file – instead, there are two scratch pad memories (one for vectors, one for matrices) to handle variable-size vectors and matrices
- The two scratch pad memories (64KB vector, 768KB matrix) are both visible to the compiler
- Optimized for neural networks. Authors argue almost all NN ops are vector or matrix ops, not scalar. Vector or matrix can be implemented, accelerated using data-parallel hardware – therefore the architecture was designed to be data-parallel.
- Instruction set architecture (ISA)
- Load-store (main memory is accessed only via explicit LOAD, STORE instructions). There are Load, Store instructions for scalars, vectors (VLOAD, VSTORE) and matrices (MLOAD, MSTORE), latter two taking vector, matrix size as arguments.
- Loads and Stores support scalars, logical and importantly variable-length vectors and matrices
- ISA is called Cambricon
- Instructions include
- Matrix: matrix-vector multiply, vector-matrix multiply (to avoid explicit transpose step), element-wise matrix add, subtract, multiply; matrix outer product (two vectors as inputs, outputs one matrix)
- Vector: vector dot product, element-wise vector multiply, add, subtract, divide, logarithm, exponent; random-vector generation; vector maximum, minimum
- Vector max pooling – Vector-Greater-Than-Merge (VGTM) op
- Logical: vector compare, logical and, or, invert, vector-greater-than-merge
- Scalar: elementary arithmetics and transcendental functions, compare
- Load, store and move for matrices, vectors and scalars.
- Control: jump, conditional branch
- The accelerator pipeline
- Starts with instruction handling hardware: [instruction] Fetch -> Decode -> Issue Queue
- Followed by scalar hardware: Scalar Register File -> Scalar Function Unit + Address Generation Unit, computing scalar, vector and matrix addresses to be accessed in L1 cache (scalars only) and corresponding scratch memories. The scalar register file consists of 64 32-bit General-Purpose Registers to hold scalars and addresses.
- Next, addresses buffer in Memory Queue, waiting for any dependencies (operations-on-which-queued-instruction-depends to finish) to be resolved
- Addresses of scalars from Memory [Address] Queue get dispatched to L1 cache. Addresses of vectors, matrices from Memory [Address] Queue get dispatched respectively to the Vector Functions and Matrix Function Units. Scalar L1 cache misses apparently result in DMA access to main memory.
- Vector (VFU) and Matrix Function Units (MFU)
- are where the actual number crunching takes place
- Vector Function Unit contains 32 16-bit adders, 32 16-bit multipliers
- “Matrix Function Unit contains 1024 multipliers and 1024 adders, which has been divided into 32 separate computational blocks to avoid excessive wire congestion and power consumption on long-distance data movements”
- connect respectively to vector and matrix scratch memories, each accessed over DMA.
- have their own small scratchpads – “Each computational block is equipped with a separate 24KB scratchpad. The 32 computational blocks are connected through an h-tree bus that serves to broadcast input values to each block and to collect output values from each block.”
- appear to keep shuffling data back-and-forth between the small 24K and large (64KB vector, 768KB matrix) scratchpads
- otherwise, there is not much detail about MFU or VFU
- Matrix scratchpad memory has 4 ports to handle 4 accesses simultaneously. Under-the-hood, the scratchpad memory consists of 4 memory banks plugged into a shared crossbar. Bank 0, 1, 2, 3 stores data corresponding to access addresses xxx00, xxx01, xxx10, xxx11. During access, the lower 2 access address bits are decoded to fetch values from the desired bank(s).
AI Products Group at Intel has released a preprint of “Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks”.
- replace traditional FP32 with 16-bit “Flexpoint”
- Tensor values are int16 acting as mantissas
- A shared exponent value (5 bit) is specified for the entire tensor
- Since there is only one exponent per tensor, multiplication, addition (of individual elements from pair of tensors) become a fixed point operation
- On the other hand, shared-per-tensor exponent causes dynamic range of values in the tensor to reduce
- To counteract the reduced dynamic range, the shared exponent is “dynamically adjusted [managed] to minimize overflows and maximize available dynamic range”
- The format is verified with AlexNet, a deep residual network (ResNet) and a generative adversarial network (GAN) with no need to tweak model hyper-parameters
Shared-exponent management algorithm, called Autoflex, assumes that ranges of values in the network change sufficiently slowly, such that exponent ranges change slowly as model training proceeds and “exponents can be predicted with high accuracy based on historical trends”. Autoflex adjusts the common exponent up and down as it detects under-flows and over-flows.
NVIDIA has open-sourced their “Deep Learning Accelerator” (NVDLA), available at GitHub. It comes with the whole package
- Synthesizable RTL
- Synthesis scripts
- Verification testbench
- C-model (to be released)
- Linux drivers
Seems like there are no strings attached licensing-wise and patent-grant-wise – anyone can integrate it in a commercial product, sell the product and owe nothing to NVIDIA.
NVIDIA wants to continue NVDLA development in public, via GitHub community contribution.
Architecture-wise, NVDLA appears to be a convolution accelerator
- Input data streams from memory, via “Memory interface block” and via “Convolution buffer” (4Kb..32Kb) in to “Convolution core”
- The “Convolution core” is a “wide MAC pipeline”
- Followed by “Activation engine”
- Followed by “Pooling engine”
- Followed by “Local response normalization” block
- Followed by “Reshape” block
- and streaming out back to “Memory interface block”
The architecture is configurable using RTL synthesis parameters, supports
- Data type choice of Binary, INT4, INT8, INT16, INT32, FP16, FP32, FP64
- Winograd convolution
- Sparse compression for both weights and feature data to reduce memory storage, bandwidth – especially useful for fully-connected layers
- Second memory interface for on-chip buffering to increase bandwidth, reduce latency vs. DRAM access
- Batching, ranging 1..32 samples