Skip to content

Systolic Array

jfornt edited this page Feb 9, 2024 · 1 revision

The Output Stationary (OS) systolic array that lays at the core of the accelerator is built as a 2D mesh of processing elements in which the weight values enter the array from the top and the activation values from the left, as depicted in the figure below.

A set of registers concatenated to form a shift register (called the shift-chain) is used to extract the output values and (optionally) insert preload values into the PE accumulators. The shift chain is built opposite to the activation data flow: the inputs enter from the right and egress from the left. This is done so that memory accesses to the output SRAM are aligned with the data extraction (see [Main Controller](Architecture/Main Controller)).

A set of 1-bit context switch signals are used to trigger a swap between contents of the accumulator and the reserve register on the PEs. These enter the array from the top and propagate with the weight data.

Two signals are broadcasted to all PEs. The pipeline enable is used to stall the whole array when needed: a low value disables all registers of the array. Scan enable is used to control the shift-chain of output and preload values.

Processing Element

The PE architecture proposed on this work builds upon the basic OS PE circuit and adds several elements in order to improve performance and energy efficiency.

The core of the processing element is its FP16 multiply-add unit, which performs the MAC operation itself. We use a simplified version of the FPnew floating point unit from the PULP open-source platform (see https://github.com/pulp-platform/fpnew) in order to implement the FP16 logic.

We explore different approximate arithmetic structures for the mantissa adders and multipliers in order to improve the energy efficiency of the system (wiki page coming soon...).

The Zero Gating block of the PE checks whether any of the multiplication operands is zero and controls its registars accordingly. Given that in this case the MAC partial sum will stay the same, the PE does not need to waste energy in computing and adding a zero to the current partial sum. Hence by saving this switching we are able to increase the energy efficiency a bit.

The IO Shifting part of the PE is responsible for managing the extraction of completed results and the insertion of preload data into the accumulator. It features a register called the Reserve Register, which holds the data in transit. This circuitry allows us to manage the extraction and insertion of data in parallel to the computation and perform context switches with zero performance overhead.

Clone this wiki locally