Skip to content
Jordi Fornt edited this page Oct 1, 2024 · 4 revisions

The SAURIA accelerator is built around an Output Stationary (OS) systolic array. Three independent SRAM memories are used to store the input feature map (ifmap) data, the weights and the partial sums. A novel Data Feeder module provides the OS array with streams of data by performing on-the-fly convolution lowering (im2col algorithm) in order to natively perform convolutions. The Weight Fetcher provides the weight streams in a similar fashion. The Partial Sums Manager (PSM) controls the readout of the results as well as the insertion of initial preload values into the array accumulators. A controller module is used to properly orchestrate all the data movements and processes.

The ReDMA engine is used to perform data movements from main memory to the internal SRAMs via AXI-4 interfaces. It features a realignment module to correct the memory alignment of the tensor data on-the-fly when transferring it.

A custom control module called the Dataflow Controller is used to manage the DMA transactions and the accelerator comuptation to run convolutions and GEMM operations of arbitrary size with the method of tiling. The only restriction is that tile sizes must always be homogeneous (the tensors must be divided in tiles of the same size).

Main Idea - OS GeMM-based Systolic Array

Systolic arrays are computing architectures that distribute many simple Processing Elements (PEs) with a particular structure in order to exploit operand reuse. In 2D systolic arrays, the PEs are distributed in a 2-D mesh, where each block shares data with its immediate neighbours. Each PE performs a multiply-accumulate operation (MAC) each cycle and passes some data operands to its right and bottom neighbors, in order to reuse them throughout their transit inside the array.

In an Output Stationary (OS) dataflow, each PE uses its local memory as an accumulator to aggregate a partial sum at every cycle, and thus the partial MACs (the outputs) are kept stationary. At every cycle, each PE passes the previous activation and weight values to its neighbors.

The equivalent operation to shifting data into this type of array is a general matrix-matrix multiplication (GEMM), in which the size of the output matrix is fixed by the array size, and the two input matrices are arbitrarily wide and tall, respectively. Thus, the OS dataflow maximizes the partial sum reuse by performing as much reduction of the operands as possible.

This GEMM computation primitive can be used as the core engine to implement plenty of Neural Network operations, from Fully Connected (FC) layers to arbitrary convolutions, and even dilated or atrous convolutions.

Clone this wiki locally