PaddlePaddle 2.1.0 Release Note
重要更新
飞桨框架2.1.0 版本有如下重要更新:
-
环境适配: 增加了对Python 3.9、CUDA 11.2的支持;提供了对ROCm平台的支持(experimental);提供了对昇腾AI处理器的支持(experimental);增加了可在百度昆仑芯片上运行的模型数量;详情请见:开始使用。
-
分布式训练:在已有静态图的多维混合并行的基础上,新增动态图实现。
-
框架功能:完成了多项功能增强和性能优化,特别的,新增了以下重要功能:
- 自定义算子:提供了在框架外部自定义算子的新方案,简化了自定义算子写法与训练推理部署流程,详情请见:自定义外部算子。
- 新增inplace操作:新增可降低显存占用与提升性能的inplace操作,包括View策略,与12个inplace API。
- 高层API相关:新增支持混合精度训练的高层API;新增通过
paddle.hub
来查看、共享、加载模型。 - 自动混合精度训练优化: 优化了混合精度训练中slice、where、range等多个op的计算性能,提升了在MaskRCNN、ERNIE等模型上的加速效果。
- oneDNN下BF16训练:新增支持了AMP(AutoMixedPrecision) pure_BF16模式; 新增支持了BF16类型的SGD和initializers初始值设定并减小了内存;新增支持了大部分word2vec BF16训练需要的前向和反向op。
飞桨的官方模型库和套件的最新更新请参见:Paddle projects notes along with PaddlePaddle2.1。
不兼容升级
- 飞桨框架2.1放弃了对python2和python3.5的支持,建议您升级python到3.8版本来使用飞桨。飞桨框架2.1不再提供支持CUDA9的预编译包,建议您升级CUDA版本来使用飞桨。
- 对API可见性的优化,会导致无法使用
from deeply_nested_namespace import *
的方式导入被认为是实现细节的位于最底层的命名空间中的私有API。建议您通过查看飞桨官网的API文档说明来使用飞桨。具体的,以下行为在飞桨框架2.1版本中不再被允许。
# will import nothing from the deeply nested namespaces
from paddle.nn.layer.loss import *
from paddle.nn.layer.conv import *
Tensor.grad
不兼容升级,返回值的类型由numpy
变为Tensor
。(#32142)
2.0 | 2.1 |
---|---|
|
|
paddle.jit.TraceLayer.save_inference_model
接口不兼容升级。将原先的第一个参数dirname改为path,名字表意更通用并且与paddle.save和load等接口统一,表示由用户指定保存模型路径的前缀。(#31989)
2.0 | 2.1 |
---|---|
|
|
paddle.io.DataLoader
当Dataset
只包含一个字段时,DataLoader
返回格式不兼容升级。当用户自定义数据集只包含一个字段并通过如return image
或yield image
返回数据时,2.0版本返回的数据格式是[image_tensor]
,而2.1版本返回的数据格式为image_tensor
,保持输入输出数据结构的一致性。
2.0 | 2.1 |
---|---|
|
|
训练框架
功能优化(含分布式)
基础API
- 新增
paddle.dtype
以及paddle.float32
等数据类型,作为 paddle 内的数据类型。 (#32012) - 新增
paddle.nn.functional.glu
。 (#32096) - 新增
paddle.nn.utils.spectral_norm
。#32633 - 新增
paddle.Tensor.register_hook
API,用于在动态图场景中为前向Tensor对应的梯度Tensor注册hook函数。(#31775) - 新增
Tensor.__array__
函数,支持numpy.array(Tensor)
和numpy.asarray(Tensor)
将paddle.Tensor
类型转换成numpy.ndarray
类型 。(#32300) - 新增Tensor API:
Tensor.item(*args)
,可将Tensor中指定位置的元素转化为Python的scalar值并返回。(#32634) - 新增
paddle.nn.LayerList
对负数索引的支持。(#31750) - 新增12个动态图inplace API:
clip_
、scale_
、add_
、subtract_
、ceil_
、floor_
、exp_
、reciprocal_
、round_
、sqrt_
、rsqrt_
、flatten_
。这些inplace API不能通过paddle.api_
的形式调用,应该使用Tensor.api_
来调用。(#32699) - 新增
paddle.autograd.backward
API, 用于自定义起始梯度。(#31540) - 新增
paddle.nn.LayerDict
类。(#31951) - 新增
layer.to
API。(#32040) - 新增
paddle.autograd.PyLayer
API,用于支持动态图在Python端自定义反向计算。(#32130) - 新增支持
paddle.optimizer
在动态图中指定非参数的Tensor作为parameters进行优化。#32362) - 在
paddle.static.nn
添加了若干sequence*
系列功能,在paddle.nn.functional
添加了sequence_mask
。 (#32089) - 在
paddle.nn.CTCLoss
中添加norm_by_times
参数。(#32490) paddle.fill_constant
支持uint8_t
。(#31911)paddle.clip
支持int32
和int64
。(#32373)- 支持
paddle.nn.functional.interpolate
在 Nearest neighbor 模式下,输入数据类型为int。(#32270) - API中所有支持传入list或tuple的参数,全部升级为支持传入list和tuple。(#32344, #32528 #32360)
- 优化
softmax
算子性能。(#31821) - 优化
paddle.norm
文档说明,澄清paddle.norm
与numpy.linalg.norm
API 存在功能差异。(#32530) - 优化Tensor 的数据类型(
datatype
)的打印形式,例如,float32
类型的Tensor的dtype
从VarType.FP32
变为paddle.float32
。(#30682) - oneDNN功能优化:
- 升级 oneDNN 至 v2.2.1。(#31067 [#31473])(#31473), #30295 32227)
- 增加了更加准确的,基于数据类型的 oneDNN kernel 选择策略。(#29840)
- 融合oneDNN
layer_norm
子图为完整的单个layer_norm
op。(#32162, #30891, #30962) - 减少oneDNN
elementwise_mul
创建中不必要的内存分配。(#30203) - 改进了缓存每个线程使用的内存消耗。(#30358)
- 增加了LSTM oneDNN fp32 and int8 kernel支持。(#30719 #31894)
- 增加了 OneDNN hardswish 支持。(#30211)
- 增加了
bilinear_interp_v2
和nearest_interp_v2
的oneDNN支持。(#32312)
- 升级 Xbyak 数学库 至 v5.81。(#30809)
- 修复
paddle.io.DataLoader
支持数据集包含list,dict和string等嵌套的复杂数据格式,修复迭代中途程序退出偶现的报错、资源未释放等问题。(#31481) - 修复 paddle 中修改 logging 库的 root logger 导致的问题。(#32706)
- 修复
L1Decay
动态图模式下backward
报错的问题。(32718) - 修复
paddle.nn.functional.cross_entropy
中设置ignore_index
和reduction='mean'
下出Nan的问题。(#32545) - 修复bool tensor和float tensor相加输出的类型为bool的问题。(#32272)
- 修复比较类API在broadcast的计算错误。(#32470)
- 修复加减乘除在右侧输入是大shape下的broadcast下梯度计算错误。(#30818)
- 修复segment mean OP在处理大shape tensor输入时,计算结果不正确的问题。(#32610)
- 修复优化器变量的数据类型与模型参数的数据类型不一致的问题。(#29917)
- 修复
paddle.io.DataLoader
预处理中包含paddle的操作时,num worker>0
时报错。(#31177) - 修复打印空tensor时的报错。(#32501)
- 调整静态图参数初始化顺序,调整后与动态图保持一致,以便于相同模型设置相同随机种子在动态图和静态图中初始化得到相同参数。(#32177)
- 修复
paddle.to_tensor
不支持接受dtype=Tensor.dtype
的bug。(#31931) - 修复
paddle.dist
在2个输入相等时,梯度为nan的问题。(#32448) paddle.nn.functional.temporal_shift
API增加data_format
属性,支持设置为NCHW或者NHWC。(#31642)- 修复
adaptive_avg_pool2d
在输入数据类型为float16时计算结果不正确的问题。(#31887) paddle.nn.Layer.sublayers
和paddle.nn.Layer.named_sublayers
:将原本paddle.nn.Layer.sublayers
的include_sublayers = True
参数修改为include_self = False
, 从而修复从前include_sublayers = False
时返回空的问题。现在不填写任何参数时默认行为和之前一致,即返回不包含自己的所有递归子层,当include_self = True
时同字面意思,返回包含自己在内的所有递归子层。而paddle.nn.Layer.named_sublayers
中include_sublayers
的参数则直接删除了 其他行为不变。(#31824 )
2.0 | 2.1 |
---|---|
|
|
高层API
- 新增
paddle.hub
功能,提供help
、list
和load
函数用于查看和加载第三方模型,支持加载远程和本地repository。(#31873) - 支持混合精度训练,提供O0, O1, O2三种模式,分别对应FP32训练、自动混合精度训练、纯FP16训练。目前纯FP16训练仅支持静态图。(#31417)
- 支持
paddle.Tensor
类型的图像变换,包括normalize, to_grayscale, vflip, hflip, crop, center_crop, pad, rotate, resize
等算子 。(#32705)
动态图转静态图
修复了动态图转静态图的bug:
- 静态图
arange、range
API返回的shape与动态图不一致。 paddle.to_tensor
在动转静中支持输入为int,float,bool
基础类型。- for循环中支持解析dict推导式语法。(#32159)
- 修复部分场景下嵌套控制流语句中存在变量未声明报错的问题。(#32153)
- 修复了
expand
op缺少float16类型的bug。(#32238) - 修复了
expand_v2、tile、expand、expand_as、expand_as_v2、meshgrid
等6个OP反向梯度求解,当shape维度为6时,返回梯度信息为None的bug。(#32004) - 修复了
paddle.jit.TraceLayer.save_inference_model
接口中因未同时保存网络结构和参数导致与paddle.static.load_inference_model
搭配使用不一致的问题。(#31989 )
混合精度训练
- 动态图混合精度接口 auto_cast 中自动将不支持fp16 kernel的op保持为fp32计算。(#32543)
- 修复静态图混合精度训练中因不支持FP16计算的Op列表(
unsupported_fp16_list
)统计不完整导致的意外报错问题,当前不支持FP16计算的Op列表可根据运行时环境自动生成。(#32102) - 优化
update_loss_scaling
for循环起多个相同cuda kernel问题,融合为一个cuda kernel。(#32554) - 优化
slice
多维情况下性能较慢问题。(#32266) - 优化
elementwise_add_grad
输入输出相同时的冗余拷贝问题。(#32051) - 优化
check_finite_and_unscale
for循环起多个相同cuda kernel问题,融合为一个cuda kernel。(#31954) - 优化
range
参数冗余拷贝问题。(#30811) - 优化
top_k_v2
在input_width <= 1024
时性能较慢问题。(#30403) - 移植
where_index
CPU计算流程到GPU上完成。(#30601)
BF16训练
- 增加了初级 BF16 AMP 集成, 通过在前向网络中添加
cast op
来修改图使一些 operator 使用 BF16 kernel 。(#31093) - 增加了 BF16
pure_mode
模式, 在此模式下,默认开启使用 BF16 数据类型的模型参数,BF16的operator,对于optimizer的BF16 decorator。(#32281, #32681) - 增加了对于CPU flags的检查以确认是否支持oneDNN BF16性能提升。(#30551)
- 对BF16支持进行过程统一。(#31034)
- 增加了对于constant initilizer的BF16数据类型的支持。(#31935)
- 增加了BF16 uniform initializer支持。(#32468)
- 增加了将startup_program initializer转化为BF16的机制。(#32720)
- 增加了 sgd operator 的 BF16 数据类型支持。(#32162)
- 增加了lookup_table op BF16 数据类型的支持。(#31558)
- 增加了 sum kernel 和 SelectedRows 的 BF16的支持。(#32755, #32631)
- 增加了conv_transpose的BF16数据类型支持。(#30877)
- 增加了elementwise_add grad BF16数据类型的支持。(#30925)
- 增加了reshape grad BF16 数据类型的支持。(#31035)
- 增加了elementwise_add grad op 对于 broadcasting 的支持(FP32/BF16)。(#31385)
- 增加了elementwise_mul grad op 对于fp32/bf16数据类型的支持。(#31647)
- 增加了 LSTM BF16 支持,并修复GRU BF16的一些问题。(#31234)
- 增加了 oneDNN reduce_op fp32 和 bf16支持。(#31816)
- 增加了oneDNN reduce_op grad 对于 fp32 和 bf16 的支持。(#32280 #32592)
分布式训练优化
- 加入图检索引擎,支持万亿边规模的分布式图神经网络存储、采样、训练(#31226)。
- 加入基于索引的数据采样类,支持图、树深度匹配等模型的采样(#31696)。
- 新增
paddle.distributed.send, paddle.distributed.recv,paddle.distributed.new_group,paddle.distributed.wait
,完善分布式通信API。(#32504, #31682) - 动态图分布式初始化支持
sync_parameters_buffer
,解决动态图buffer未全局初始化的问题。(#31625) - 流水线并行支持1F1B调度方式,优化显存占用量,理论上显存占用量为常量。(#31786)
- [混合并行] 优化Sharding 策略:Gradient Merge支持、减少参数通信量等,提升训练速度;支持与其他并行策略的灵活组合。(#31884 #32486 #32485 #31996 #31939 #31796)
- [混合并行] Sharding策略中添加optimize offload支持,降低训练显存占用。(#32134)
- [混合并行] 持久化广播通信ID的socket服务,减少混合并行端口冲突问题。(#31589)
- [参数服务器] 优化日志输出和LOG打印,去除无效日志。
- [参数服务器] 优化稀疏参数存储结构,维度较小(低于64)的情况下内存有较大降幅 。
- [参数服务器] 修复在分布式预测时,准入策略生效的BUG。
- [参数服务器] HeterPs支持多机GPU训练(#31102)。
动态图混合并行
动态图分布式支持混合并行功能,支持数据并行,模型并行以及流水线并行三种并行方式的任意组合。同时支持混合并行基础上添加AMP混合精度策略,ReCompute策略。
- Fleet支持动态图混合并行,支持数据并行(DataParallel)/模型并行(ModelParallel)/流水线并行(PipelineParallel)三种并行的互相组合。(#32248)
- 动态图分布式DataParallel添加
find_unused_parameters
参数,用于支持控制流组网。(#31625) - Fleet添加
VocabParallelEmbedding
,ColumnParallelLinear
,RowParallelLinear
API用于模型并行组网。添加model_parallel_random_seed
/get_rng_state_tracker
用于ModelParallel的随机性控制。(#32248) - Fleet添加
distributed_scaler
接口,用于混合并行AMP策略下的loss scaler。(#32354) - Fleet添加
PipelineLyaer
用于流水线并行组网切图,添加LayerDesc
用于动态图Layer描述以减少显存初始化。(#32449) - 动态图新增 Recompute 策略。(#32516)
自定义OP
- 新增支持Mac平台上使用自定义OP功能。(#31976)。
- Mac平台下支持C++/v11头文件目录的自动搜索功能,兼容本地可能存在多版本clang的情况。
- 新增支持Op前反向函数Attribute参数以及inferShape, InferDtype函数输入参数使用const &类型。(#31588)
- 新增支持在自定义Op实现时使用三种框架内部数据类型
paddle::complex64, paddle::complex128, paddle::float16
。(#31602, #31657, #31669, #31725) - 新增支持在自定义Op中使用
std::vector<paddle::Tensor>
类型参数作为前反向函数的输入。(#31535) - 新增支持InferShape函数使用Attribute参数作为输入。(#31713)
- 优化自动生成的Python API在动态图下的调用栈,提升执行效率。(#32209)
- 降低Windows上检查编译器cl.exe时的报错条件,增强Windows环境自检的鲁棒性。(#32769)
- 修复Windows上安装多个CUDA环境时编译器选择时的bug。(#31694)
- 修复Windows安装中文版本VS时出现的Python编码问题的bug。(#31493)
- 移除对单独动态库文件的依赖,仅链接框架核心动态库文件。(#32404、#32769)
- 移除之前的旧自定义OP方案,并对whl包中多余的库文件与头文件进行了清理,降低了whl包大小约11M。(#31813), (#32463)
模型保存与载入
paddle.save, paddle.load
支持Tensor的保存加载。(#31756)paddle.save, paddle.load
支持list[Tensor]、dict[Tensor]、tuple[Tensor]
以及list、tuple、dict
嵌套的包含Tensor的结构的保存加载。(#32446)paddle.save, paddle.load
支持Layer的保存加载。(#32446)paddle.save, paddle.load
支持Program的保存加载。(#32336)paddle.save, paddle.load
支持C++二进制格式单个Tensor的保存加载。(#32211)paddle.jit.save, paddle.jit.load
支持无参数的Fucntion的保存加载。(#32430)
性能优化(含分布式)
- 优化重点算子,提升多个模型单GPU训练性能,Deeplabv3+单卡FP32和AMP性能分别提升11%、72%,TSM单卡AMP性能提升44.5%,HRNet单卡FP32、AMP分别提升46%、51%。
- 增加
index_sample
CUDA实现。(#30380) - 实现
relu, leaky_relu
算子的CUDA Kernel,代替原Eigen实现,正反向共提升5% ~ 20%。(#31869, #31841) temporal_shift
性能提升20%~40%。(#31642)- 优化
depthwise_conv2d
,NHWC format下性能提升30%~50%。(#31667) - 优化
interp_bilinear_grad
算子NCHW性能,提升19%~303%。(#30950) - 优化
adaptive_avg_pool2d
算子NCHW、output_size = 1情况下的性能,提升80%~90% 。(#31197) - conv op当dtype为float16时,forward和backward支持开启
exhaustive_search
。(#30959) momentum
的weight_decay
参数设置为float类型时,实现momentum
和L2Decay
的融合。(#30881)- 实现
log_softmax
算子axis
为最后一维、维度<=1024时的CUDA Kernel,相比原Eigen实现,正反向算子性能提升4.55x ~ 26.45x。(#31630, #32180)
推理部署
模型量化
- 新增支持将FP32模型保存为FP16模型。(#32112)
- 重构动态图量化训练中统计输出量化信息模块,支持多Block和多分支的模型,增强通用性。(#31680 #31710 #31784 #31861)
- 动态图量化训练功能支持跳过量化OP,并且和预测端形成打通。(#31704)
Paddle Inference
功能升级
- 发布C API (experimental), 功能与C++ API基本对齐。(#32225)
- 重构Tensor 底层代码,与旧有 ZeroCopyTensor 数据结构解耦。此升级不涉及用户 API 改动,对用户透明。(#31402)
- 预测框架python接口接入训练自定义算子。用户在训练过程中加载自定义算子后,即可像框架原生算子那样,通过 PaddlePredictor 直接执行包含此自定义算子的预测模型部署。(#32533)
- 支持从内存加载模型时TensorRT序列化和反序列化功能。(#31342)
性能优化
- 支持ERNIE量化模型在NV GPU上混合精度推理,其中MatMul以Int8精度计算,其他部分以FP16精度计算。相比纯FP16推理,在T4上batch size=40时,标准ERNIE模型在XNLI数据集上推理性能由1898 seq/s提升至2310 seq/s,提升17.8%。(#32232)
易用性优化
- 用户开启TensorRT变长输入,输入shape超出限定范围时增加报错信息。(#32155)
- 增加运行时TensorRT版本检查,若运行和编译时TensorRT大版本号不一致会以warning提示。(#32443)
- 增加TensorRT VERBOSE级别log开关,用户可通过
export GLOG_v=3
开启TensorRT VERBOSE日志,打印更多调试信息。(#32459)
BugFix
- 修复预测结束后可能出现非指定使用显卡显存不足的错误。(#32655)
- 修复动态图下原生推理非正规值引起的CPU推理性能问题。(#32350)
- 修复在使用PaddleSlim量化模型开启TensorRT推理时,若从内存读入模型,仍会要求设置校准表路径的问题。(#32676)
- 升级TensorRT量化校准表接口,修复在DLA上不支持TensorRT离线量化的问题。(#31060)
- 修复当使用变长方式进行ERNIE/BERT模型推理时(EnableTensorRtOSS),不支持裁剪Attention的header数量的问题。(#31497)
- 修复2.0之后训练的BERT模型QK输入顺序不稳定带来的结果偶现diff问题。(#32659)
- 修复ERNIE模型开启TensorRT varlen加速时因输入变量名顺序错误导致报错或结果错误问题。(#32482)
- 修复TensorRT的plugin ElementwisePluginDynamic序列化失败的问题。(#31587)
- 修复TensorRT动态shape下FC layer维度补1带来的后续OP维度报错的问题。(#32458, #31803)
- 修复FC使用Padding时
repeated_fc_relu_fuse_pass.cc
错误的问题。(#32648) - 修复conv2d_transpose op使用TensorRT推理时结果错误的问题。(#32593)
- 修复NAN的错误比较导致的 OCR INT8 模型 oneDNN 预测报错的问题。(#32227)
- 修复部署多个模型在多executor上多线程进行oneDNN 预测时出现数据争用的问题。(#32499, #32136 #32664)
环境适配
编译安装
- 新增支持CUDA11.2编译,支持3070/3080/3090显卡架构的编译。(#31529)
- 新增支持Windows Visual Studio 2017编译,并将发版、CI/CE、编译文档等各项配套设施,由VS2015全面升级至VS2017。(#311652)
- 新增对cuda11.2镜像的支持。(#32531)
- cuda10.1镜像支持gcc 5.4。(#32531)
- 镜像中新增对python 3.9的支持。(#32385)
- 修复
run_check
接口的bug,并在run_check
接口里新增了对动态图的检查:现在run_check
检测paddle安装的逻辑里,首先检测用户机器上是否有GPU,没有则报warning,未考虑安装cpu包的用户。(#32428) - 修复Windows系统上缺乏 symlink 方法的问题。(#31006)
新硬件训练支持
- 新增支持海光芯片:飞桨基于 ROCM 4.0.1 版本可以在海光CPU与DCU上进行模型训练与推理。已经验证支持图像分类、目标检测、图像分割、自然语言处理、推荐系统、视频分类与语音合成共计7个分类的36个模型。(#29342, #30758, #30639, #31009, #31077)
- 新增支持昇腾芯片:支持在昇腾NPU上进行单机多卡训练。(#31957, #32381, #32197, ...)
- 昆仑硬件训练支持
Thanks to our Contributors
This release contains contributions from:
123malin, Adam Osewski, alncat, arlesniak, AshburnLee, Aurelius84, Bai Yifan, Baibaifan, Bin Lu, cc, ceci3, chajchaj, chalsliu, channings, Chen Long, Chen Weihang, chen zhiyu, Chengmo, chentianyu03, cnn, CtfGo, cucuzg, danleifeng, denglin-github, Double_V, fangshuixun007, Feiyu Chan, fluffyrita, FlyingQianMM, FNRE, furnace, GaoWei8, GeminiCarrie, gongweibao, Gradie, GT-Zhang, Guanghua Yu, Guo Sheng, guofei, hong, houj04, huangjun12, huangxu96, Huihuang Zheng, hutuxian, iducn, Jacek Czaja, Jack Zhou, jakpiase, JamesLim, Jiabin Yang, jiangcheng, Jiaqi Liu, Jiawei Wang, joanna.wozna.intel, joejiong, JZ-LIANG, Kaipeng Deng, Kqnonrime, kuizhiqing, Lei.C, Leo Chen, lidanqing, LielinJiang, lijianshe02, lilong12, limingshu, littletomatodonkey, liu zhengxi, LiuChiachi, liuyuhui, liym27, LoveAn, LutaoChu, minghaoBD, mls1999725, niuliling123, Ouyang Chao, pangyoki, parap1uie-s, Pei Yang, procr, Qi Li, qingqing01, QingshuChen, Ren Wei (任卫), ronnywang, ruri, seemingwang, Shang Zhizhou, shanliang1992, ShenLiang, Shibo Tao, Steffy-zxf, syyxsxx, taixiurong, tangwei12, Tao Luo, Thomas Young, Thunderbrook, tianshuo78520a, TTerror, wangchaochaohu, wangguanzhong, wanghuancoder, wangna11BD, WangXi, wangxinxin08, wawltor, Wei Shengyu, weihaoji, WeiXin, wenbin, Wenyu, whs, Wilber, winter-wang, Wojciech Uss, wuhuanzhou, wuyefeilin, XGZhang, XiangGao, XiaoguangHu, xiaoting, xiegegege, xiemoyuan, xingfeng01, Yang Zhang, yaoxuefeng, yiak, yingshengBD, yinhaofeng, Yiqun Liu, ykkk2333, yongqiangma, Yuang Liu, yukavio, YUNSHEN XIE, Y_Xuan, Zhang Jun, Zhang Ting, zhang wenhui, Zhang Zheng, zhangchunle, Zhen Wang, zhiboniu, Zhong Hui, Zhou Wei, zhulei, zhupengyang, zlsh80826, 卖鱼的哲学, 石晓伟
PaddlePaddle 2.1.0 Release Note
Highlights
The PaddlePaddle Framework V2.1.0 has the following important updates:
-
Environment Adaptation: Add the support for Python 3.9, CUDA 11.2; Provide the support for ROCm platform (experimental); Provide the support for Ascend AI processor (experimental); Add the number of models that can run on Baidu Kunlun chip . For details, please see: Getting Started.
-
Distributed training: besides multidimensional hybrid parallelism in static graph mode, implementation in dynamic graph is added.
-
Framework function: Complete a number of enhancements and performance optimizations, in particular, including the following important new functions:
- Customized OP: Provide a new solution for customizing operators outside the framework, simplifying the process of writing custom operators and deploying training inference. For details see: Customizing External Operators.
- Inplace Operation: Add the inplace operation to reduce the memory consumption and improve performance, including View strategy, and 12 inplace APIs.
- High-level API related: Add the high-level APIs to support mixed precision training; add
paddle.hub
to view, share, and load models. - Automatic mixed precision training optimization: Optimized the computational performance of multiple OPs in mixed precision training such as slice, where, range, etc., and improved the acceleration effect on MaskRCNN, ERNIE and other models.
- oneDNN BF16 training: Enabled AMP (AutoMixedPrecision) pure_BF16 mode. Enabled BF16 SGD and initializers for less memory consumption. Enabled most of FWD & BWD BF16 ops for BF16 word2vec training.
For the latest updates to the official model libraries and suites of PaddlePaddle, please see: Paddle projects notes along with PaddlePaddle2.1.
Backwards Incompatible Changes
- The PaddlePaddle Framework 2.1 drops the support for python2 and python3.5. It is recommended that you upgrade your python to version 3.8 before using the PaddlePaddle. PaddlePaddle Framework 2.1 no longer provides the support for CUDA9 pre-built package. It is recommended that you upgrade the CUDA version before using the PaddlePaddle.
- The optimization of API visibility makes it impossible to import private APIs located in the deeply nested namespaces that are considered as implementation details by using
from deeply_nested_namespace import *
. It is recommended that you use the PaddlePaddle by following the instructions in the API Documentation on the PaddlePaddle website. Specifically, the following actions are no longer allowed in the PaddlePaddle Framework 2.1.
# will import nothing from the deeply nested namespaces
from paddle.nn.layer.loss import *
from paddle.nn.layer.conv import *
Tensor.grad
Incompatible upgrade. The type of return value is changed fromnumpy
toTensor
. (#32142)
2.0 | 2.1 |
---|---|
|
|
paddle.jit.TraceLayer.save_inference_model
Interface incompatibility upgrade. Changed the original first parameter dirname to path, the name symbol is more generic and unified with interfaces such as paddle.save and load, indicating that the user specifies the prefix for saving the model path. (#31989)
2.0 | 2.1 |
---|---|
|
|
paddle.io.DataLoader
return format incompatibility upgrade when user-define dataset only contains single field。If user-define dataset only contains single field and output data with code likereturn image
oryield image
,output data format in Paddle 2.0 is[image_tensor]
,and output data format in Paddle 2.1 isimage_tensor
to keep data structure same with input.
2.0 | 2.1 |
---|---|
|
|
Training Framework
Functional optimization (including distributed)
Basic API
- Add data types such as
paddle.dtype
andpaddle.float32
as data types within the Paddle. (#32012) - Add
paddle.nn.functional.glu
. (#32096) - Add
paddle.nn.utils.spectral_norm
. (#32633) - Add
paddle.Tensor.register_hook
API for registering the hook function for the gradient Tensor corresponding to the forward Tensor in dynamic graph scenes. (#31775) - Add the
Tensor.__array__
function to supportnumpy.array(Tensor)
andnumpy.asarray(Tensor)
to convertpaddle.Tensor
type tonumpy.ndarray
type . (#32300) - Add the Tensor API:
Tensor.item(*args)
. It can convert the element at the specified position in Tensor to Python scalar value and return it. (#32634) - Add the
paddle.nn.LayerList
support for negative indexing. (#31750) - Add 12 dynamic graph inplace APIs:
clip_
,scale_
,add_
,subtract_
,ceil_
,floor_
,exp_
,reciprocal_
,round_
,sqrt_
,rsqrt_
, andflatten_
. These inplace APIs cannot be called by usingpaddle.api_
and should be called by usingTensor.api_
. (#32699) - Add
paddle.autograd.backward
API for customizing the starting gradient. (#31540) - Add
paddle.nn.LayerDict
class. (#31951) - Add
layer.to
API. (#32040) - Add
paddle.autograd.PyLayer
API for supporting custom backward calculation of dynamic graphs on Python side. (#32130) - Add the support for
paddle.optimizer
to specify non-parametric Tensor as parameters for optimization in dynamic graphs. (#32362) - Add several
sequence*
functions inpaddle.static.nn
. Addpaddle.nn.functional
insequence_mask
. (#32089) - Add
paddle.nn.CTCLoss
parameters innorm_by_times
. (#32490) paddle.fill_constant
supportsuint8_t
. (#31911)paddle.clip
supportsint32
andint64
. (#32373)- Support the input data type to be int in Nearest neighbor mode in
paddle.nn.functional.interpolate
. (#32270) - All parameters in API that support passing in list or tuple are upgraded to support passing in list and tuple. (#32344, #32528 #32360)
- Optimize
softmax
operator performance. (#31821) - Optimize
paddle.norm
documentation description to clarify the functional differences betweenpaddle.norm
andnumpy.linalg.norm
API. (#32530) - Optimize the printing form of data type (
datatype
) of Tensor, for example, thedtype
of Tensor offloat32
type is changed fromVarType.FP32
topaddle.float32
. (#30682) - OneDNN Functional optimization
- Upgraded oneDNN to 2.2.1 (#31067 #31473 #30295 32227)
- Added more precise mkldnn kernel rules in GetExpectedKernelType based on kernel's data type. (#29840)
- Fused
layer_norm
subgraphs to singlelayer_norm
op. (#32162, #30891, #30962) - Reduced unnecessary memory allocation during creation of
elementwise_mul
operator (#30203) - Improved memory consumption used in cache per thread (#30358)
- Added oneDNN FP32 and INT8 support for vanilla LSTM (#30719 #31894)
- Added OneDNN
hardswish
support (#30211) - Added
bilinear_interp_v2
andnearest_interp_v2
oneDNN FP32 kernels (#32312)
- Updated Xbyak to v5.81 (#30809)
- Fix
paddle.io.DataLoader
to support data sets containing nested complex data formats such as list, dict and string, and fix the occasional error report and unreleased resources when the program exits during the iteration. (#31481) - Fix the problem caused by modifying the root logger of logging library in paddle. (#32706)
- Fix the problem of
L1Decay
error report inbackward
dynamic graph mode. (#32718) - Fix the problem that nan comes out in setting
ignore_index
andreduction='mean'
inpaddle.nn.functional.cross_entropy
. (#32545) - Fix the problem that the output type is bool during the summing of bool tensor and float tensor. (#32272)
- Fix the calculation error of comparison class API in broadcast. (#32470)
- Fix the gradient calculation error under broadcast where right input is large shape in addition, subtraction, multiplication and division. (#30818)
- Fix the problem of the calculation result of segment mean OP being incorrect when processing the large shape tensor input. (#32610)
- Fix the problem of the data type of optimizer variables not matching with the data type of model parameters. (#29917)
- Fix the error report in
num worker>0
when thepaddle.io.DataLoader
pre-processing includes the paddle operation. (#31177) - Fix the error report when printing empty tensor. (#32501)
- Adjust the initialization order of static graph parameters, and keep consistency with dynamic graphs after adjustment, so that the same model is set with the same random seed to get the same parameters initialized in dynamic graphs and static graphs. (#32177)
- Fix the bug that
paddle.to_tensor
does not support acceptingdtype=Tensor.dtype
. (#31931) - Fix the bug that the gradient is nan when 2 inputs are equal in
paddle.dist
. (#32448) paddle.nn.functional.temporal_shift
addeddata_format
property to support to set to NCHW or NHWC. (#31642)- Fix the problem of the calculation result being incorrect in
adaptive_avg_pool2d
when the input data type is float16. (#31887) paddle.nn.Layer.sublayers
andpaddle.nn.Layer.named_sublayers
: Modify theinclude_sublayers = True
parameter of originalpaddle.nn.Layer.sublayers
toinclude_self = False
, thus fixing the problem of returning null of the formerinclude_sublayers = False
. Now the default behavior is the same as that when no parameter is filled in, that is, return all recursive sublevels that don't contain themselves. Wheninclude_self = True
is the same as the literal meaning, return all recursive sublevels that contain themselves. Theinclude_sublayers
parameter inpaddle.nn.Layer.named_sublayers
is directly removed. Other behaviors remain unchanged. (#31824 )
2.0 | 2.1 |
---|---|
|
|
High-level API
- Add the
paddle.hub
function. Providehelp
,list
andload
functions for viewing and loading third-party models, and support the loading of remote and local repository. (#31873) - Support the mixed precision training. Provide O0, O1, O2 three modes, which correspond to FP32 training, automatic mixed precision training, pure FP16 training respectively. At present, pure FP16 training only supports static graphs. (#31417)
- Support the image transformation of the
paddle.Tensor
type, including operators such asnormalize, to_grayscale, vflip, hflip, crop, center_crop, pad, rotate, resize
. (#32705)
Dynamic Graphs to Static Graphs
Fix the bug of dynamic graphs converted to static graphs.
- The shape returned by the static graph
arange、range
API is not consistent with the dynamic graph. paddle.to_tensor
supports the input asint,float,bool
basic type in dynamic to static.- Support the parsing of the dict derivative syntax in the for loop. (#32159)
- Fix the problem of undeclared variables errors in the nested control flow statements in some scenarios. (#32153)
- Fix the bug that the float16 type is missed in
expand
op. (#32238) - Fix the bug of returning the gradient information as None when the shape dimension is 6 in the
expand_v2、tile、expand、expand_as、expand_as_v2、meshgrid
6 OP backward gradient solution. (#32004) - Fix the problem that the
paddle.jit.TraceLayer.save_inference_model
interface is inconsistent withpaddle.static.load_inference_model
because the network structure and parameters are not saved at the same time. (#31989)
Mixed Precision Training
- The op that does not support fp16 kernel is automatically kept as fp32 calculation in the dynamic graph mixed precision interface auto_cast. (#32543)
- Fix the unexpected error in the static graph mixed precision training caused by the incomplete statistics of the Op list (
unsupported_fp16_list
) which does not support FP16 calculation. The list of Op that currently does not support FP16 calculation can be generated automatically according to the runtime environment. (#32102) - In the for loop in the
update_loss_scaling
, optimize the problem that multiple identical cuda kernel are fused into one cuda kernel. (#32554) - Optimize the slow performance in
slice
multi-dimensional cases. (#32266) - Optimize the redundant copy problem when
elementwise_add_grad
inputs and outputs are the same. (#32051) - In the for loop in the
check_finite_and_unscale
, optimize the problem that multiple identical cuda kernel are fused into one cuda kernel. (#31954) - Optimize the
range
parameter redundant copy problem. (#30811) - Optimize the slow performance problem of
top_k_v2
ininput_width <= 1024
. (#30403) - Migrate
where_index
CPU calculation process to GPU for completion. (#30601)
BF16 Training
- Added initial bf16 amp integration that modify models by adding cast ops to BF16 enabled ops in the forward pass. #31093
- Added BF16 pure_mode, which means adding support for BF16 training based on BF16-enabled ops list and enable BF16 parameters, BF16 operators, BF16 decorator for optimizer during training. #32281 #32681
- Added CPU core flags verification for BF16 fast performance support. #30551
- Unification of BF16 enablement process #31034
- Added BF16 Constant Initializer and for other initializers, add cast op to convert other initializer output to be BF16 datatype. #31935
- Added BF16 uniform random initializer #32468
- Added mechanism that converts startup_program initializers to BF16 #32720
- Added BF16 support for sgd operator CPU kernel. #32162
- Added BF16 support for lookup_table operator. #31558
- Added Sum kernel for CPU supporting BF16 and SelectedRows #32755 #32631
- Added Conv Transpose BF16 support #30877
- Added elementwise_add bf16 grad #30925
- Added reshape op BWD grad bf16 #31035
- Added broadcasting support in elementwise_add grad bf16/fp32 #31385
- Added Elementwise Mul grad fp32/bf16 #31647
- Added LSTM BF16 and fixed GRU BF16 #31234
- Added oneDNN reduce_op fp32 and bf16 kernels #31816
- Added oneDNN reduce_op GRAD fp32 and bf16 kernels #32280 #32592
Distributed Training Optimization
- New graph-based retrieval engine for training distributed graph neural network over trillion edges(#31226).
- Added index-based data sampling class to support sampling from graph and TDM/OTM tree(#31696).
- Added
paddle.distributed.send, paddle.distributed.recv, paddle.distributed.new_group, paddle.distributed.wait
to improve the distributed communication API. (#32504, #31682) - Support to initialize the
sync_parameters_buffer
in the distributed dynamic graph, which solved the issue that the buffer of the dynamic graph is not globally initialized. (#31625) - Pipeline Parallelism supports 1F1B scheduling method to optimize the memory usage of GPU. Theoretically, it is constant(#31786).
- [Hybrid Parallel] Sharding strategy optimization: support Gradients aggregation, reducing the amount of parameter communication, and improving the speed of training; Could be used flexibly with other parallelism strategies. (#31884 #32486 #32485 #31996 #31939 #31796)
- [Hybrid Parallel] Added optimizer state offload in the Sharding strategy, to reduce the memory usage of GPU. (#32134)
- [Hybrid Parallel] Support the persistence of the broadcast ID’s socket service, reduced the conflicts of ports in the hybrid parallelism. (#31589)
- [Parameter Server] Optimize the output and printing of LOG, and remove invalid logs.
- [Parameter Server] Optimize the sparse parameter storage structure, with large memory reduction for small dimensions (below 64).
- [Parameter Server] Fix the bug of access policy taking effect in the distributed prediction.
- HeterPs supports multiple machines. (#31102)
Hybrid Parallelism with dynamic Graph
Support hybrid parallelism in the distributed dynamic graph mode, powered by data parallelism, model parallelism and pipeline parallelism, in addition, they can combine with AMP and the new ReCompute strategy to achieve better efficiency.
- Support hybrid parallelism with the Fleet dynamic graph API, and any arbitrary combination of data/model/pipeline parallelism. (#32248)
- Added parameter
find_unused_parameters
n the data parallelism of distributed dynamic graph to support grouping control flow in the network. (#31625) - Added
VocabParallelEmbedding
,ColumnParallelLinear
,RowParallelLinear
Fleet API for model parallelism. Addedmodel_parallel_random_seed
/get_rng_state_tracker
for the random control used in model parallelism. (#32248) - Added
distributed_scaler
interface for loss scaler of AMP combined with the hybrid parallelism strategy. (#32354) - Added
PipelineLyaer
to partition graph in the pipeline parallelism, addedLayerDesc
or description of dynamic graph Layer to reduce memory initialization. (#32449) - Add Recompute strategy for dynamic graphs. (#32516)
Custom OP
- Add support for using custom OP function on Mac platform. (#31976)
- Support automatic search function of C++/v11 header file directory on Mac platform, compatible with the situation that multiple versions of clang may exist locally.
- Add support for Op forward/backward function Attribute parameter, inferShape, and InferDtype function input parameter using the
const &
type. (#31588) - Add support for using three framework internal data types
paddle::complex64, paddle::complex128, paddle::float16
in the implementation of custom Op. (#31602, #31657, #31669, #31725) - Add support for using
std::vector<paddle::Tensor>
type parameters as input of forward/backward functions in custom Op. (#31535) - Add support for the InferShape function using Attribute parameter as input. (#31713)
- Optimize the call stack of auto-generated Python API under dynamic graph to improve the execution efficiency. (#32209)
- Reduce the error reporting condition when checking the compiler cl.exe on Windows, and enhance the self-test robustness in Windows environment. (#32769)
- Fix a bug in compiler selection when installing multiple CUDA environments on Windows. (#31694)
- Fix a bug in Python encoding issue when installing Chinese version of VS on Windows. (#31493)
- Remove the dependency on separate dynamic library files and link only the framework core dynamic library files. (#32404、#32769)
- Remove the previous old custom OP scheme and clean up the redundant library files and header files in the whl package, reducing the whl package size by about 11M. (#31813), (#32463)
Model saving and loading
paddle.save, paddle.load
supports saving and loading of Tensor. (#31756)paddle.save, paddle.load
supports saving and loading oflist[Tensor]、dict[Tensor]、tuple[Tensor]
andlist、tuple、dict
nested structures containing Tensor. (#32446)paddle.save, paddle.load
supports saving and loading of Layer. (#32446)paddle.save, paddle.load
supports saving and loading of Program. (#32336)paddle.save, paddle.load
supports saving and loading of single Tensor in C++ binary format. (#32211)paddle.jit.save, paddle.jit.load
supports saving and loading of Fucntion without parameters. (#32430)
Performance optimization (including distributed)
- Optimize key operators to improve single GPU training performance of multiple models. Deeplabv3+ single card FP32 and AMP performance are improved by 11% and 72% respectively. TSM single card AMP performance is improved by 44.5%. HRNet single card FP32 and AMP are improved by 46% and 51% respectively.
- Add
index_sample
CUDA implementation. (#30380) - Implement the CUDA Kernel of
relu, leaky_relu
operator, replacing the original Eigen implementation, with a total improvement of 5% - 20% in forward and backward directions. (#31869, #31841) temporal_shift
Performance improvement by 20% to 40%. (#31642)- Optimize
depthwise_conv2d
. Performance is improved by 30% to 50% under the NHWC format. (#31667) - Optimize
interp_bilinear_grad
operator NCHW performance with improvement by 19% - 303%. (#30950) - Optimize the performance of
adaptive_avg_pool2d
operator NCHW. In case of output_size = 1 case, improve by 80%~90%. (#31197) - In conv op, when dtype is float16, forward and backward support the enabling of
exhaustive_search
. (#30959) - When
weight_decay
parameter ofmomentum
is set to float type, the fusion ofmomentum
andL2Decay
is achieved (#30881) - Implement CUDA Kernel when
log_softmax
operatoraxis
is the last dimension and dimension is equal to or smaller than 1024. Compared to the original Eigen, the forward and backward operator performance is improved by 4.55x ~ 26.45x. (#31630, #32180)
Inference Deployment
Model Quantization
- Add the support for saving FP32 model as FP16 model. (#32112)
- Refactor the module of statistical output quantization information in dynamic graph quantization training to support multi-Block and multi-branch models to enhance generality. (#31680 #31710 #31784 #31861)
- Dynamic graph quantization training function supports the skipping of the quantization OP and forms the successful connection at the prediction side. (#31704)
Paddle Inference
Function Upgrade
- Release C API (experimental). The function of new C API is basically equal to that of C + +. (#32225)
- The prediction framework python interface access to the train custom operators. After loading a custom operator during training, users can execute the deployment of prediction models containing this custom operator directly through PaddlePredictor, just like the framework's native operator. (#32533)
- The underlying implementation of Tensor has been refactored internally to decouple from the old ZeroCopyTensor data structure. This upgrade does not involve user API changes and is transparent to users. (#31402)
- Support TensorRT serialization and deserialization when loading models from memory. (#31342)
Performance Optimization
- Support quantilized ERNIE models to be inferred with the mixed precision using TensorRT, where Matmul is computed with Int8 precision and other parts are computed with FP16 precision. Compared with the pure FP16 inference, the inference performance of the standard ERNIE model on XNLI dataset is improved from 1898 seq/s to 2310 seq/s at batch size=40 on T4, improving by 17.8%. (#32232)
Ease-of-use optimization
- Add error messages when the user enables the TensorRT variable-length input settings, and the wrong input shape is provided. (#32155)
- Add runtime TensorRT version check. If the major version number of TensorRT at runtime and compile time differs, the warning is generated. (#32443)
- Add the TensorRT VERBOSE level log switch. Users can enable the TensorRT VERBOSE log by
export GLOG_v=3
to print more debugging information. (#32459)
BugFix
- Fix the error of insufficient graphics card or video memory of unspecified usage at the end of prediction. (#32655)
- Fix the CPU performance issue caused by informal values of native inference in dynamic graphs. (#32350)
- Fix the problem of requiring the setting of the calibration table path in the data read from memory when TensorRT inference is enabled by using the PaddleSlim quantization model. (#32676)
- Upgrade the TensorRT quantization calibration table interface, fix the problem that TensorRT offline quantization is not supported on DLA. (#31060)
- Fix the problem of the number of header of crop Attention not being supported when using variable length method for ERNIE/BERT model inference (EnableTensorRtOSS). (#31497)
- Fix the occasional diff problem caused by the instable QK input sequence of the BERT model trained after version 2.0 (#32659)
- Fix the problem that ERNIE model reports an error or incorrect result due to the wrong order of input variable names when TensorRT varlen acceleration is enabled. (#32482)
- Fix the bug that plugin ElementwisePluginDynamic serialization of TensorRT fails. (#31587)
- Fix the problem of subsequent OP dimension error caused by FC layer dimension complement 1 under TensorRT dynamic shape. (#32458, #31803)
- Fix the problem of
repeated_fc_relu_fuse_pass.cc
error when FC uses Padding. (#32648) - Fix the problem of the result of conv2d_transpose op being wrong when using TensorRT inference. (#32593)
- Fix the problem with OCR INT8 model oneDNN prediction errors caused by incorrect comparison of NAN. (#32227)
- Fix the problem of data contention when deploying multiple models for oneDNN prediction on multiple executors with multiple threads. (#32499, #32136 #32664)
Environment Adaptation
Compile and install
- Add support for CUDA11.2 compilation. Support the compilation based on the 3070/3080/3090 graphics card architecture. (#31529)
- Add the support for compilation of Windows Visual Studio 2017. Upgrade all supporting facilities such as release, CI/CE, compilation documentation, etc. from VS2015 to VS2017 comprehensively. (#311652)
- Add support for cuda11.2 image. (#32531)
- cuda10.1 image support for gcc 5.4. (#32531)
- Add support for python 3.9 in mirrors. (#32385)
- Fix the bug of
run_check
interface, and add the check of dynamic graph inrun_check
interface: Now the logic ofrun_check
detecting paddle installation first detects whether there is a GPU on the user's machine. If not, report warning, without considering the users who install the cpu package (#32428) - Fix the problem of lack of symlink method on Windows system. (#31006)
New hardware training support
- Add the support for Hygon chips: PaddlePaddle, based on ROCM version 4.0.1, can train and infer models on Hygon CPU and DCU. A total of 36 models of 7 categories of image classification, target detection, image segmentation, natural language processing, recommendation systems, video classification and speech synthesis have been validated. (#29342, #30758, #30639, #31009, #31077, and more)
- Add the support of Ascend chips: support for single hosting, multiple accelerators training on Ascend NPUs. (#31957, #32381, #32197, and more)
- Kunlun hardware training support
- Kunlun XPU supports dynamic graph distributed training. (#30455, #30671)
- Kunlun XPU supports fleet distributed training. (#30858)
- Kunlun XPU supports spawn to start multi-card training and optimize XPU dynamic graph multi-card performance. (#31130)
- Kunlun XPU static graph multi-card supports the optimization of fuse allreduce and gradient merge. (#31104)
- Support Kunlun XPU in the exposure of all_reduce/reduce collection communication API. (#32303)
- Fix the bug of the random hang of Kunlun XPU dynamic graph multi-card. (#32662)
Thanks to our Contributors
This release contains contributions from:
123malin, Adam Osewski, alncat, arlesniak, AshburnLee, Aurelius84, Bai Yifan, Baibaifan, Bin Lu, cc, ceci3, chajchaj, chalsliu, channings, Chen Long, Chen Weihang, chen zhiyu, Chengmo, chentianyu03, cnn, CtfGo, cucuzg, danleifeng, denglin-github, Double_V, fangshuixun007, Feiyu Chan, fluffyrita, FlyingQianMM, FNRE, furnace, GaoWei8, GeminiCarrie, gongweibao, Gradie, GT-Zhang, Guanghua Yu, Guo Sheng, guofei, hong, houj04, huangjun12, huangxu96, Huihuang Zheng, hutuxian, iducn, Jacek Czaja, Jack Zhou, jakpiase, JamesLim, Jiabin Yang, jiangcheng, Jiaqi Liu, Jiawei Wang, joanna.wozna.intel, joejiong, JZ-LIANG, Kaipeng Deng, Kqnonrime, kuizhiqing, Lei.C, Leo Chen, lidanqing, LielinJiang, lijianshe02, lilong12, limingshu, littletomatodonkey, liu zhengxi, LiuChiachi, liuyuhui, liym27, LoveAn, LutaoChu, minghaoBD, mls1999725, niuliling123, Ouyang Chao, pangyoki, parap1uie-s, Pei Yang, procr, Qi Li, qingqing01, QingshuChen, Ren Wei (任卫), ronnywang, ruri, seemingwang, Shang Zhizhou, shanliang1992, ShenLiang, Shibo Tao, Steffy-zxf, syyxsxx, taixiurong, tangwei12, Tao Luo, Thomas Young, Thunderbrook, tianshuo78520a, TTerror, wangchaochaohu, wangguanzhong, wanghuancoder, wangna11BD, WangXi, wangxinxin08, wawltor, Wei Shengyu, weihaoji, WeiXin, wenbin, Wenyu, whs, Wilber, winter-wang, Wojciech Uss, wuhuanzhou, wuyefeilin, XGZhang, XiangGao, XiaoguangHu, xiaoting, xiegegege, xiemoyuan, xingfeng01, Yang Zhang, yaoxuefeng, yiak, yingshengBD, yinhaofeng, Yiqun Liu, ykkk2333, yongqiangma, Yuang Liu, yukavio, YUNSHEN XIE, Y_Xuan, Zhang Jun, Zhang Ting, zhang wenhui, Zhang Zheng, zhangchunle, Zhen Wang, zhiboniu, Zhong Hui, Zhou Wei, zhulei, zhupengyang, zlsh80826, 卖鱼的哲学, 石晓伟