Cambricon Papers

Speaking of Cambricon’s new products, let’s recall Cambricon’s NPU architecture publications.

An Instruction Set Architecture for Neural Networks“, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture

  • A vector processor optimized for neural network computation
  • Architecture key points:
    • There is no vector register file – instead, there are two scratch pad memories (one for vectors, one for matrices) to handle variable-size vectors and matrices
    • The two scratch pad memories (64KB vector, 768KB matrix) are both visible to the compiler
    • Optimized for neural networks. Authors argue almost all NN ops are vector or matrix ops, not scalar. Vector or matrix can be implemented, accelerated using data-parallel hardware – therefore the architecture was designed to be data-parallel.
  • Instruction set architecture (ISA)
    • Load-store (main memory is accessed only via explicit LOAD, STORE instructions). There are Load, Store instructions for scalars, vectors (VLOAD, VSTORE) and matrices (MLOAD, MSTORE), latter two taking vector, matrix size as arguments.
    • Loads and Stores support scalars, logical and importantly variable-length vectors and matrices
    • ISA is called Cambricon
    • Instructions include
      • Matrix: matrix-vector multiply, vector-matrix multiply (to avoid explicit transpose step), element-wise matrix add, subtract, multiply; matrix outer product (two vectors as inputs, outputs one matrix)
      • Vector: vector dot product, element-wise vector multiply, add, subtract, divide, logarithm, exponent; random-vector generation; vector maximum, minimum
      • Vector max pooling – Vector-Greater-Than-Merge (VGTM) op
      • Logical: vector compare, logical and, or, invert, vector-greater-than-merge
      • Scalar: elementary arithmetics and transcendental functions, compare
      • Load, store and move for matrices, vectors and scalars.
      • Control: jump, conditional branch
  • The accelerator pipeline
    • Starts with instruction handling hardware: [instruction] Fetch -> Decode -> Issue Queue
    • Followed by scalar hardware: Scalar Register File -> Scalar Function Unit + Address Generation Unit, computing scalar, vector and matrix addresses to be accessed in L1 cache (scalars only) and corresponding scratch memories. The scalar register file consists of 64 32-bit General-Purpose Registers to hold scalars and addresses.
    • Next, addresses buffer in Memory Queue, waiting for any dependencies (operations-on-which-queued-instruction-depends to finish) to be resolved
    • Addresses of scalars from Memory [Address] Queue get dispatched to L1 cache.  Addresses of vectors, matrices from Memory [Address] Queue get dispatched respectively to the Vector Functions and Matrix Function Units. Scalar L1 cache misses apparently result in DMA access to main memory.
    • Vector (VFU) and Matrix Function Units (MFU)
      • are where the actual number crunching takes place
        • Vector Function Unit contains 32 16-bit adders, 32 16-bit multipliers
      • “Matrix Function Unit contains 1024 multipliers and 1024 adders, which has been divided into 32 separate computational blocks to avoid excessive wire congestion and power consumption on long-distance data movements”
      • connect respectively to vector and matrix scratch memories, each accessed over DMA.
      • have their own small scratchpads – “Each computational block is equipped with a separate 24KB scratchpad. The 32 computational blocks are connected through an h-tree bus that serves to broadcast input values to each block and to collect output values from each block.”
      • appear to keep shuffling data back-and-forth between the small 24K and large (64KB vector, 768KB matrix) scratchpads
      • otherwise, there is not much detail about MFU or VFU
  • Matrix scratchpad memory has 4 ports to handle 4 accesses simultaneously. Under-the-hood, the scratchpad memory consists of 4 memory banks plugged into a shared crossbar. Bank 0, 1, 2, 3 stores data corresponding to access addresses xxx00, xxx01, xxx10, xxx11. During access, the lower 2 access address bits are decoded to fetch values from the desired bank(s).

Second paper:

Leave a Reply

Your email address will not be published. Required fields are marked *