Skip to content

Data Feeder

jfornt edited this page Feb 9, 2024 · 1 revision

The Data Feeder is in charge of providing streams of data to all rows of the systolic array. It must perform the data readout from the SRAM blocks, filter the data elements to be provided and distribute them to the corresponding rows and columns.

GeMM-based systolic arrays by themselves allow to perform matrix-matrix multiplications with the contents of the local memories. This operation can be very efficient, but requires the use of external convolutional lowering or im2col transformation. Convolutional lowering is relatively fast in terms of runtime, but it critically inflates the total amount of data to be transferred to the local memory of the accelerator. For 3x3 kernels, the lowered matrices are about 9 times larger than the original tensors.

To tackle this issue, we designed the Data Feeder module, which performs on-the-fly on-chip convolutional lowering from the tensors stored in the SRAM memories. This results in significant gains in memory transfer speed as well as power, in exchange for the inclusion of some extra hardware resources.

The Index Counters perform a readout of the SRAM memory positions, in which the interest region of the current computation context (the current set of partial sums in the array) is read sequentially. The Feed Lanes are replicated following the array rows and handle the task of getting the correct data elements from the SRAM bus, in the correct order, to a FIFO memory from where they will be fed to the array. In this context, data elements are defined as individual operands: e.g. when using FP16 arithmetic, a data element would be an array of 16 bits corresponding to a FP16 number.

Global Readout

The feature map tensor residing in memory has dimensions CxHxW, where W is the tensor width, H is the height and C is the number of input channels. The values are flattened in the order [W,H,C], with the width dimension (x-axis) being contiguous in memory. The figure below illustrates this memory layout by pointing to the locations of different tensor locations.

The ifmaps SRAM contains the flattened tensor, which must be read and transformed into a stream of data per each row of the array. Each row needs to be fed all the ifmap values that fit inside the convolutional kernel of size [kW; kH; ch] centered around the target output pixel mapped to that row. Contiguous rows of the array will compute regions with same size, and a relative offset proportional to the strides. Typically, for non-unit kernel sizes, there will be an overlap between the regions of contiguous positions.

When an SRAM address is read, the data word returned by the memory will contain bits that must be read by several rows, some of them overlapping. The main idea is to perform a sequential readout of all addresses, and for every row of the array filter and select the positions of interest of the current data word using the Lane's Data Manager.

The Index Counters read all the memory positions containing the whole area of interest (the union of the kernel positions of all rows, or the colored part in the figure above), and the data words are broadcasted to all Feed Lanes. The Index Counters count the ifmap input elements that have been read, rather than the SRAM addresses. This is because an element index is more specific than an address, since it also contains information about the word offset (i.e. the position of the current index inside the current data word).

# Global Index Decomposition - Example assuming 4 elements/data word
SRAM_address = Glob_Idx >> 4
global_word_offset = Glob_Idx & 0xF

The Feed Lanes need further information in order to locate the position of their first interest element inside the data word. In particular, they need to encode their relative spatial position: contiguous rows encode contiguous output pixels in the x-axis. We denote this relative offset value as the Local Offset, which is simple to generate: the kernel locations of contiguous rows are separated by the convolution stride factor. Therefore the local offset for every row is computed as:

# Local Word Offset
for (y=0, y<Y, y++):
    local_offset[y] = y*stride

These values (Y values of 8-bits each, by default) are provided by the user by writing them to the configuration registers, and they encode the stride factor of the convolution.

The readout of the ifmap data from its SRAM is performed at two indexing levels. The lower level corresponds to the positions relevant to the current computing context, or the current tile. The second indexing level shifts the current tile into a new location of the overall tensor, in order to switch to a new computing context (new output pixel location). This way of splitting a tensor into smaller sub-tensors for computation is called tiling.

In order to index elements from tensors and tiles of arbitrary shapes, the Global Index Counter is constructed using 5 independent counters. Three navigate through the current tile (x, y and channel dimensions) and two "move" the tile through the bigger tensor. All feature channels are reduced for any output pixel, so tiling is not necessary in the ch direction. The following pseudocode represents the sequence performed by the counters. The step and overflow values of each counter are set so that their counts can be added directly to form the global index, in order to avoid the need for multiplications:

# Tiling
for (til_y=0, til_y<TENSOR_X*TENSOR_Y, til_y+=TENSOR_X*TIL_MOVE_Y):
    for (til_x=0, til_x<TENSOR_X, til_x+=TIL_MOVE_X):
        # Current Tile
        for (ch=0, ch<TENSOR_X*TENSOR_Y*IN_CHANNELS, ch+=TENSOR_X*TENSOR_Y):
            for (y=0, y<TENSOR_X*KH_EFFECTIVE, y+=TENSOR_X):
                for (x=0, x<BW_EFFECTIVE, x+=WORD_WIDTH):
            
                    Glob_Idx = x + y + ch + til_x + til_y

Data Management

The selection task to be handled by the Feed Lanes must overcome several difficulties:

  • The data elements to read can start and end anywhere in the SRAM data bus (as long as element bit alignment is respected).
  • The Lane may need to read any number of elements present in the data bus, from zero to all.
  • The data elements to read can be in non-contigouos positions in memory (e.g. in dilated convolutions).

To fully handle this variability while exploiting the regularity of convolution operations, we define the Kernel Pattern: a bit vector where each position represents one data element of the readout sequence. A high bit on the Dilation Pattern means that the corresponding data position needs to be read, and the first bit of the pattern (MSB) is required to be high always. By shifting the Dilation Pattern by a set amount of bits, we can align it with the data stream so that the actual elements of interest coincide with the high bits. The figure below illustrates this idea with an example in which 13 elements must be read.

These 13 elements could correspond to one row of ifmap data for a 13x13 convolutional kernel. After these are read, we would proceed with the following kernel row by advancing the Global Index Counters (generating a potentially new word offset, see the blue arrow in the figure) and resetting the completion counter (see the purple arrow in the figure). With this, as long as the Kernel Pattern and the Index Counters are well configured, any repeating read pattern can be implemented. This scheme is well suited to implement dilated convolutions, like in the figure below.

This kernel pattern encodes the relative position of the elements to be read as well as their number, but it is independent on the row to be fed, so a single pattern value can be provided for all rows. The spatial information corresponding to each Lane's row or column position is accounted for in the word offset itself (more details below).

Once the elements to be read have been identified after shifting the Dilation Pattern, they are moved to a set of intermediate registers and, when these are full, the values are pushed into the Lane's FIFO.

The intermediate registers the data until there are enough values to be pushed in parallel to the FIFO. If the number of elements to be read from the current data word is greater than the number of free registers, the Lane's Data Manager has the ability of stalling the readout pipeline so that it can keep reading the operands on the next cycle.

Clone this wiki locally