Konkuk University (Kyle Jonghyuk Park, Ko Ryeowook)
In cutting-edge neural network accelerators, our project analyzes SIMBA, NVIDIA's NN accelerator, and its Network on-Chip (NoC) structure. We aim to uncover its strengths and weaknesses while also addressing the need for advanced simulators. To this end, we have developed a unique simulator that supports both multicast and unicast data transmissions, filling a critical gap in existing tools. Our work contributes to the advancement of neural network accelerator research, enabling efficient data processing in the era of AI and deep learning.
- Developed a 2D Mesh NoC simulator in Verilog to verify different tile structures and efficient dataflow of a neural network accelerator
- Flit-based flow control: wormhole
- Virtual Channel
- Lookahead routing pipeline (4-cycle)
- Credit-based buffer backpressure
- Implemented the Based Routing Conformed Paths (BRCP) model to support both unicast and multicast
- "Forward & Absorb" multicast mechanism
- Can avoid multicast-unicast routing deadlock since multicast and unicast share the same network paths
- Unicast vs Multicast: Mulitcast priority
- Applied advanced multicast algorithm: Advanced Hierarchical Leader-Based (HL) scheme (NoC Simulator (HL))
- More efficient in cycle than the original HL scheme
- Parameterized the simulator's options to improve usability (define.v, parameters.v)
- Routing Algorithm: XY / YX DOR
- Arbiter Configurations: Fixed priority / Round robin priority
- Priority Configuration
- Flit Configuration
- Virtual channel used (Logical data path)
-
Forward & Absorb
-
Determine the status of the input packet at the routing computation logic and send it to the mux controller in crossbar
- 3 Status: Unicast / Multicast & Forward / Multicast & Absorb
-
Depending on the status of the packet, the crossbar behaves differently
-
Contention between multicast & absorb packet and other packet is well handled by the multab_ct signal
Left: SIMBA proposed PE archiecture, Right: PE cycle module
- PE cycle module is only used for cycle simulation
- 1 MAC: 8 cycles, total 128 MAC: 1024 cycles
- Refer to SIMBA paper: ResNet-50 (res4a_branch1)
- If you don't need a cycle simulation for PE computation, you can delete this module. The NoC Simulator will still work.
- Multicast algorithm for sending data from one source to multiple destinations
- The HL scheme is proposed by a paper that proposed BRCP model ("Multidestination message passing in wormhole k-ary n-cube networks with base routing conformed paths")
- Divide the Mesh NoC into four quadrants and determine L1, L2 with the algorithm specified for the quadrant where the source is located.
- Through U-mesh algorithm, send data with one L2 as the first destination among multiple L2s
- Proceed multicast in the specified col and row directions based on the quadrant where the source is located.
- Contention between row-by-row multicast, and partial-sum (PSUM) unicast transmissions sent after PE internal MAC operations during SIMBA dataflow.
Row 2: IA Multicast & PE MAC cycle
Row 3: IA Multicast & PSUM contention
- Simulation logs & Simulator Configuration options (define.v, parameters.v)
- Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture LINK
- D. K. Panda, S. Singal and R. Kesavan, "Multidestination message passing in wormhole k-ary n-cube networks with base routing conformed paths," in IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 1, pp. 76-96, Jan. 1999, doi: 10.1109/71.744844. LINK
- NoCGEN: "An open-source on-chip router model originally developed for [Matsutani_HPCA09]" LINK
- On-Chip Networks, Second Edition (Natalie Enright Jerger, Tushar Krishna, Li-Shiuan Peh)
- Principles and Practices of Interconnection Networks (William James Dally, Brian Towles)