Skip to content

Partial Sums Manager

jfornt edited this page Feb 9, 2024 · 1 revision

The Partial Sums Manager is in charge of handling the extraction of results from the array as well as the insertion of preload values. For the latter task, it implies reading the preload values from the SRAM memory, and from the former it implies writing the outputs to the SRAM.

The PSum Shift Registers module is a block of registers that mirrors the systolic array shift-chain (see [Systolic Array](Architecture/Systolic Array)). The Read Data Manager picks the data from the memory bus and distributes it to the registers, while the Write Data Manager is in charge of packing the outputs from the register in proper data words to write into memory. The Control module manages the memory access pattern for read and write operations to the SRAM and orchestrates the whole process.

The main idea behind both the extraction of outputs and the insertion of preload inputs is to swap the contents of the array and the output registers by simply shifting with the shift-chain. This operation is performed in three distinct phases, as sketched in the figure below. First, the preload values are read from the SRAM and loaded into the registers (if no preloads are needed and instead we need to start accumulating at zero, this phase can be skipped by simply clearing the registers). Second, the preload values are exchanged with the outputs of the previous computation by shifting. Finally, the output values that reside now in the shift registers are written to the SRAM memory.

PSum Shift Registers

This block contains a set of shift registers that mirrors how the registers in the systolic array connect to each other. The input of this block is connected to the output of the array shift-chain, and the input of the array shift-chain is connected to its output, creating a circular chain of shift-registers. With this structure, the output values on the array registers can be swapped with with the preload values in one single full shift.

Data Managers

The main task of the Read and Write Data Managers is to adapt the SRAM data bus to the shift registers width, and vice-versa. The Read Data Manager accumulates input elements until there are enough values to push to the shift registers. This task is very similar to what the activation and weight Data Managers do to push to the FIFOs (see [Feeder Modules](Architecture/Feeder Modules)), so the circuit used is similar, with two important differences. First, the Dilation Pattern shifting is replaced by the mask provided by the Address Counter. Second, in order to maximize performance, the number of feeding registers is set to the maximum possible to optimize performance. This is acceptable in this block because there is no replication: the Data Managers will be instantiated only once.

The Write Data Manager deals with the task of preparing the SRAM data words to be sent to the memory for writing. The shift registers are typically wider than (or at most equal to) the SRAM data width, so this is reduced to a simple selection and interconnection task. A crossbar interconnect block is used for this, and a set of registers is included before the output towards the SRAM for timing reasons.

Controller

The index sequence required for read and write operations has been implemented with four cascaded counters, connected analogously to the case of the activation and weight feeders' Global Index Counters. Two counters are used to read and write to the current tile, which points to the region of values being computed in the array. A counter for the y dimension is not needed since the array encodes only the x and k dimensions in its rows and columns, respectively. For tiling, two counters are used, one to navigate the current tile through the x,y axes and another one to switch output channels. The indexing of the x and y axes can be combined in a single counter, since the y direction increment is always unitary.

The following pseudocode illustrates how the indexing sequence is performed by the counters.

# Tiling
for (til_k=0, til_k<TENSOR_XY*TENSOR_K, til_k+=TENSOR_XY*TIL_MOVE_K):
    for (til_xy=0, til_xy<TENSOR_XY, til_xy+=TIL_MOVE_XY):
        # Current Tile
        for (k=0, k<TENSOR_XY*TILE_CHANNELS, k+=TENSOR_XY):
            for (x=0, x<TILE_X, x+=WORD_WIDTH):
            
                Idx = x + k + til_xy + til_k

The counter module is used for addressing reads and writes to the SRAM. These read and write operations are coordinated in such a way that only one transaction is needed at a given time (hence the SRAM can be single-port). This module also generates a mask for the Data Managers indicating the location of the current interest elements.

An important detail is that read and write transactions index exactly the same memory positions on different contexts (tiles). Hence, the x and k counters can be shared for both operations. The difference between a read operation and a consecutive write will be exclusive to the tiling loop, so only the til_xy and til_k counters need to be replicated. To save some resources, I created a "dual counter" module, which is a generic counter that uses two registers to store different counting contexts, and shares all other resources using multiplexers.

The Scan FSM is in charge of coordinating the SRAM read and write operations together with the scanning of elements into and out of the array shift-chain. To explain how these actions are arranged, let us first take a look at the different output-related memory resources present inside the accelerator. Any preload or output value coming into or out of the array must traverse three levels of registers. The figure below illustrates three example registers on this hierarchy where a particular preload/output value would be stored.

The accumulator is the lowest level register, and it is used for computing. The reserve register holds the preload value of the next computation while the PE is working, and temporarily stores the output value before it is extracted. Its main purpose is to allow the accumulator and datapath of the PEs to be always busy, independently on the shift-chain (see [Systolic Array](Architecture/Systolic Array)). In the last register level lays the shift registers of the PSM , discussed above.

Under this organization, an incoming preload value will first be loaded into the PSM register, then shifted through the shift-chain into the reserve register, and finally introduced in the accumulator when a context switch happens. Simultaneously, the accumulator output will be swapped and go up to the reserve register, where it will be shifted out of the array and into the PSM register. From there, the output value will be packed into an SRAM write word and dispatched to the memory. In order to use this framework to effectively provide the array with the preload values it needs while extracting the outputs, the sequence of operations depicted in the figure below was designed.

In the beginning, the array is empty, so we need to read values and shift them in as soon as possible. I call this period the Start Phase. The values of the first context (RD0) go directly to the accumulator in order to start computing, while the second context values (RD1) are loaded into the reserve register. When the values of the third context (RD2) are finally loaded into the PSM shift registers, we have all three levels of registers ready with values.

During the Main Phase, two data streams must be managed: array to SRAM and SRAM to array. When the computation of one context finishes, I start by swapping the accumulator with the reserve register and then scanning out the outputs. As the preloads for the next context were ready in the PSM, this shift simultaneously advances them to the reserve register, where they are ready for the next swap. The outputs residing in the PSM are then written to the SRAM (WR0). Only then the PSM becomes free and the next preload data (RD3) can be read. This whole process is repeated for all remaining contexts until there is no more preload data to be read.

The results of the last three contexts are extracted during the Final Phase without triggering the read process, since all the preload data has already been fetched. If preload values are not needed and the accumulators should start at zero, all read stages can be omitted, which makes the process faster. This sequence of operations is monitored and controlled by the Scan FSM, including the generation of the necessary control signals towards the rest of the blocks inside the Partial Sums Manager. This FSM is triggered by and reports to the Context FSM of the Main Controller block (see [Main Controller](Architecture/Main Controller)).

Clone this wiki locally