Skip to content

Weight Fetcher

jfornt edited this page Feb 9, 2024 · 1 revision

The Weight Fetcher reads the values of the weights stored in the weights SRAM and provides them to the systolic array in the correct order. It is implemented using a simplified version of the [Data Feeder](Architecture/Data Feeder), in particular, the kernel pattern and local word offset are fixed to constant values, since configuration is not necessary in this case.

The weight data tensor stored in its SRAM memory has dimensions KxCxHxW, where K and C are the output and input channels respectively and H,W is the convolutional kernel shape. For the weight memory, the values are flattened in the order [K,W,H,C], where the output-channel dimension (K) is contiguous in memory. This is done because different output channels are mapped directly to the array columns, so the feeding process becomes more natural. The following figure illustrates this memory layout with some examples.

Every column of the systolic array needs te be fed all the weights corresponding to the output channel mapped to it, in the order: [W,H,C]. Thanks to the memory layout described earlier, if the SRAM data bus is properly dimensioned, on every memory read we can obtain weight data elements for all the array columns. Therefore, we can (ideally) feed one element to all array columns on every cycle.

The main idea is to use the same flexible filter-and-selection scheme as with the Data Feeder to enforce this. We achieve this by setting the first bit of the Dilation Pattern to 1 and all others to 0 (indicating a single value being read), and setting the local word offset of each column to its own column number. With this configuration plus the global offset working as usual, every array column knows which element it needs to take. The left shift of the dilation pattern is only used in some corner cases, where the data alignment forces the current batch of weights to be read in two contiguous SRAM positions.

Similar to the ifmaps case, the SRAM data readout is implemented with 3 independent counters. A single counter is used to navigate through all the weights of the current context, since the [W,H,C] dimensions must be read sequentially, just as they are in memory. A tiling counter is used to select new output channels for a new computing context. These two counters perform the readout itself and are depicted in the figure below.

An auxiliary third counter is also included to handle the case in which the current batch of weights to be fed to the different colums is spread on two SRAM positions due to memory misalignment. This is necessary for the system to work with any tensor shape and size, but typically the user will try to enforce memory alignment to improve performance.

The following pseudocode represents the sequence performed by the counters. The step and overflow values of each counter are set so that their counts can be added directly to form the global index, in order to avoid the need for multiplications:

# Tiling
for (til_k=0, til_k<TENSOR_K, til_k+=TIL_MOVE_K):
    # Current Tile
    for (w=0, w<TENSOR_K*TENSOR_W*TENSOR_H*TENSOR_C, til_x+=TENSOR_K):
        # Auxiliary
        for (aux=0, aux<AUX_LIM, aux+=AUX_STEP):
            
            Glob_Idx = aux + w + til_k
Clone this wiki locally