Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-Datacenter Performance Analysis of a Tensor Processing Unit (TPU) #3

Open
meton-robean opened this issue Oct 27, 2019 · 4 comments

Comments

@meton-robean
Copy link
Owner

谷歌TPU

Repository owner locked and limited conversation to collaborators Oct 27, 2019
@meton-robean
Copy link
Owner Author

meton-robean commented Oct 29, 2019

image

The TPU instructions are sent from the host over the PCIe Gen3 x16 bus into an instruction buffer. The internal blocks are typically connected together by 256-byte-wide paths.

The weight FIFO is four tiles deep. The intermediate results are held in the 24 MiB on-chip Unified Buffer, which can serve as inputs to the Matrix Unit. A programmable DMA controller transfers data to or from CPU Host memory and the Unified Buffer.

the Matrix Multiply Unit is the heart of the TPU. It contains 256x256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers. The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below the matrix unit. The 4 MiB holds 4096, 256-element, 32-bit accumulators. The matrix unit produces one 256-element partial sum per clock cycle.

The philosophy of the TPU microarchitecture is to keep the matrix unit busy. It uses a 4-stage pipeline for these CISC instructions, where each instruction executes in a separate stage. The plan was to hide the execution of the other instructions by overlapping their execution with the
MatrixMultiply instruction.
Toward that end, the Read_Weights instruction follows the decoupled-access/execute philosophy [54]

结构图中其他的部分基本都是为尽可能跑满这个矩阵计算阵列服务的,据此有以下设计参考博客
Selection_030

@meton-robean
Copy link
Owner Author

meton-robean commented Oct 29, 2019

TPU使用了脉动阵列技术

Selection_031
Selection_032

参考博客
Selection_027
Selection_028
Selection_029

  1. 体现脉动阵列的运算效率(“keep the matrix unit busy”),需要对weight和activation进行很多形式上的转换。从TPU论文来看,似乎这项工作是由software stack中的“User Space driver”来完成。

2.脉动矩阵在计算时,列方向的向量乘加运行是串行的(每两个元素乘后结果往下流,作为下一个PE的加法操作数),但是横方向的数据队列源源不断进数填充PE,所以多个向量的乘加运算结果是流水产生

更多参考

我们应该拥抱“脉动阵列”吗- 对 Google TPU 可扩展性的思考
深度学习的异构加速技术(二):螺蛳壳里做道场,总结加速器设计几个流派

@meton-robean
Copy link
Owner Author

meton-robean commented Oct 30, 2019

TPU设计哲学

  1. Simple and regular design:可以说,简单和规则是脉动阵列的一个重要原则。而这样的设计主要是从“成本”的角度来考虑问题的。

  2. Balancing computation with I/O:平衡运算和I/O,应该说是脉动阵列最重要的设计目标。 尽量让数据在运算部件中多停留。 尽量让运算部件保持繁忙,不会因为要取数而停滞计算。
    Selection_033
    这个系统的最大问题是:数据存取的速度往往大大低于数据处理的速度。因此,整个系统的处理能力(MOPS,每秒完成的操作)很大程度受限于访存的能力。这个问题也是多年来计算机体系结构研究的重要课题之一,可以说是推动处理器和存储器设计的一大动力。而脉动架构用了一个很简单的方法:让数据尽量在处理单元中多流动一会儿。

缺点:

  1. 脉动架构是一种很特殊的设计,结构简单,实现成本低。但它灵活性较差,只适合特定运算。而作者认为,卷积运算是展示脉动架构特点的理想应用。

  2. 脉动阵列难以扩展,因为它需要带宽的成比例的增加来维持所需的加速倍数。这违反摩尔定律和储存速度落后于逻辑速度的技术趋势。另外它又使延迟变得更糟,这违反了论文中提到的用户需求的趋势。即使TPU中的矩阵乘法单元只是脉动矩阵乘法的多种选项之一。是否可以找到可扩展并且节能的替代方案,来实现所需的传播和累加而不以牺牲延迟时间为代价,这是一个值得思考的有趣的话题。

3.数据重排。脉动矩阵主要实现向量/矩阵乘法。以CNN计算为例,CNN数据进入脉动阵列需要调整好形式,并且严格遵循时钟节拍和空间顺序输入。数据重排的额外操作增加了复杂性,据推测由软件驱动实现。

  1. 规模适配.在数据流经整个阵列后,才能输出结果。当计算的向量中元素过少,脉动阵列规模过大时,不仅难以将阵列中的每个单元都利用起来,数据的导入和导出延时也随着尺寸扩大而增加,降低了计算效率。因此在确定脉动阵列的规模时,在考虑面积、能耗、峰值计算能力的同时,还要考虑典型应用下的效率。

@meton-robean
Copy link
Owner Author

TPU2.0

Selection_035

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant