PaddlePaddle 2.0.0-rc1
Release Note
重要更新
飞桨框架2.0-RC1版本有如下重要更新:
- 安装环境 官方发布支持CUDA11的安装包(experimental);官方发布支持百度昆仑芯片的安装包(experimental)
- API功能 支持numpy兼容的
paddle.Tensor
索引和切片操作(基本索引);去除部分API中的axis参数,支持numpy兼容的广播语义;新增了部分API,完善了部分API的功能,修复了部分API的bug - 动静转换 支持动态图转静态图的更多python语法,并支持通过
paddle.jit.not_to_static
标识不进行动转静的函数 - 框架功能 支持多次调用
paddle.Tensor.backward()
进行累计梯度,效果等同于增加batch size后计算的梯度;默认隐藏了C++报错栈,并优化了报错格式;分布式训练支持heterbox异构训练 - 框架性能 混合精度训练支持纯FP16模式,ResNet50模型V100单卡训练性能达1400+ samples/sec;分布式训练做了性能优化
前瞻性预告
- 飞桨框架计划在未来的某个版本起,放弃对python2和python3.5的支持,建议您升级python到3.8版本来使用飞桨
- 飞桨框架计划在未来的某个版本起,放弃对CUDA9.0的支持,建议您升级CUDA版本来使用飞桨
训练框架
基础API(含分布式)
新增API
- 新增paddle.log2
- 新增paddle.log10
- 新增paddle.nn.initializer.set_global_initializer
- 新增paddle.median
- 新增paddle.broadcast_shape,可以计算两个tensor shape经过broadcast计算后的shape
- 新增paddle.vision.ops.deform_conv2d, paddle.vision.ops.DeformConv2d
- 新增paddle.subtract
- 新增paddle.optimizer.lamb
- 新增Tensor相关API,Tensor.cpu、Tensor.cuda(idx)、Tensor.pin_memory、Tensor.is_leaf、Tensor.clone
修复和完善API
- paddle.multiply 去掉axis
- paddle.pow 去掉 type promotion
- paddle.add, paddle.subtract, paddle.multiply, paddle.divide, paddle.matmul, paddle.reshape, paddle.transpose, paddle.kron, paddle.trace, paddle.sum 支持complex64 和complex128 数据类型
- 移除paddle.maximum, paddle.minimum的axis参数
- multiplex支持动态图
- CrossEntropyLoss增加soft_label and axis,修改形状,并提升性能
- paddle.nn.functional.interpolate size参数支持Tensor格式输入
- paddle.nn.functional.pad添加在constant模式时,对N和C维度的padding
- paddle.optimizer.momentum支持恢复训练
- 修复转换前对BatchNorm指定weight_param名字,再使用paddle.nn.SyncBatchNorm.convert_sync_batchnorm 转换成SyncBatchNorm时报错
- paddle.to_tensor选择设备时,支持直接输入其他Tensor的place
- 优化Tensor.detach的性能,与原Tensor共享内存,减少1次内存拷贝,并且不保留在原计算图中
- 静态图模式下,新增支持通过paddle.optimizer.get_lr()获取学习率
- 修复paddle.Embedding在GPU下使用超范围ID报错异常
移除API(包括别名)
- 移除complex module下的api: paddle.complex.matmul, paddle.complex.reshape, paddle.complex.transpose, paddle.complex.kron, paddle.complex.trace, paddle.complex.sum, paddle.complex.elementwise_add, paddle.complex.elementwise_sub, paddle.complex.elementwise_mul, paddle.complex.elementwise_div
- 移除paddle.nn.functional下的sigmoid_cross_entropy_with_logits
高层API
- 新增api paddle.callbacks.ReduceLROnPlateau
- 新增api paddle.callbacks.LRScheduler
- 新增api paddle.vision.datasets.FashionMnist
- paddle.io.DataLoader中places参数变更为可选参数,当为默认值None时,自动选择paddle.CPUPlace()或paddle.CUDAPlace(0),places参数将在后续版本删除
- paddle.io.DataLoader支持通过设置batch_size=None来禁用DataLoader自动组batch功能
- 新增api paddle.io.ComposeDataset 用于将多个数据集按字段拼接为一个数据集
- 新增api paddle.io.ChainDataset 用于将多个数据集按sample整合为一个数据集
- 新增api paddle.io.WeightedRadnomSampler 用于通过指定权重进行随机采样
- 新增api paddle.vison.ops.yolo_loss和paddle.vision.ops.yolo_box
- 新增api paddle.flops
- 新增api paddle.callbacks.EarlyStopping
- 更新api model.save,保存文件格式与底层保持一致
- 修复api 修复动态图input dtype为非float32且Model初始化不提供inputs时,保存预测模型报错的bug
- paddle.metric.Accuracy支持输入多维Tensor,支持rank为1的label和one-hot表示的label
功能优化(含分布式)
动态图基础功能
- 支持Tensor和Scalar在使用运算符运算时进行正确的类型提升
- 修复了多个模型train/eval模型切换互相干扰的问题。动态图Layer.eval()与no_grad解耦,改动前调用Layer.eval()后Tracer不会自动记录反向,改动后调用Layer.eval()仍会自动记录反向,如果需要反向,可以使用paddle.no_grad
- 支持通过索引或切片修改 Tensor数据
- 增加 inplace 反向检测模块,检测是否前向inplace 操作会影响梯度计算的正确性
- 新增Tensor.backward()自动求导时,梯度会累加在之前的梯度上,可以实现变相扩大“batch_size”
- 支持了 SE-ResNext oneDNN 动态图训练
动态图转静态图
新增语法
- 增加在动转静循环中使用isinstance语法的支持
- 添加对赋值shape给tuple的动转静语法支持,如a, b, c, d = tensor.shape
- python的 and/or 语句的左右操作数的执行是有先后顺序的,若左操作数的结果能够确定逻辑值,将不执行右操作数。过去动转静图中的logical_and/logical_or对这种情况处理有问题。增加了这种支持。
- 增加支持了函数 signature中含有**kwargs的情况
- 支持使用 jit.not_to_static 装饰函数,在动转静过程中,不转化该函数
- 支持python字典语法 dict.pop()
bug修复
- 修复动转静存储lstm接口时一个表示drop_state的变量没有初始化导致模型存储失败的问题
- 修复嵌套循环在变量分析上的问题
- 修复return在一些特殊情况的问题
- 修复if-else中处理列表生成式及变量分析上的问题
- 修复迭代变量在一些特殊情况的问题
- 修复transpose API 在动态图和静态图行为不一致问题,使之支持动转静
- 修复concat API 在动态图和静态图行为不一致问题,使之支持动转静
- 优化部分动转静报错信息,使报错位置更准确
- 修复convert_call在特殊情况下会重复递归调用问题
- 修复由于2.0 API对out.dtype判断不同导致的动转静问题
- 修复了x.shape == y.shape在动态图是判断list相等,返回True/False,但静态图下会被重载成elementwise的问题,这种转为静态图后对elementwise结果进行reduce。
- 修复了param_guard覆盖不到hook的问题。
- 修复了init运行动态图一些参数变量在静态图因为类型不是静态图变量不能赋值的问题
- 修复了用户在__init__函数中定义的非参数类型变量值无法正确修改和更新的问题
- 修复了动转静过程中错误转化第三方库logging的问题
- 修复了for-enumerate语法AST转写有误的问题
- 修复了部分warning信息循环显示多次的问题
混合精度训练
- 支持更为激进的FP16训练模式(即纯FP16训练)。为保证模型的收敛性在Momentum优化器中新增
multi_precision
和rescale_grad
属性,multi_precision
主要指示优化器需要维护一份master weights - 使用纯FP16训练,ResNet50模型在配有16GB显存的V100上单卡训练性能可达1400+ samples / sec
模型量化
- 动态图量化支持skip指定Layer
- 动态图量化支持2.0 API Conv 以及Linear
分布式训练优化
- 支持使用
paddle.distibuted.spawn
接口启动all_gather
等分布式低阶API - 支持heterbox异构训练
- 流水线并行支持Executor.run接口,提升易用性
- Launch接口升级,支持指定单节点的进程数
- Sharding支持百亿参数模型多卡训练
模型保存与载入
- 支持有多个方法声明由
paddle.jit.to_static
转写的Layer在使用paddle.jit.save
存储后,仍然能够通过paddle.jit.load
载入,并且由paddle.jit.to_static
转写的多个方法仍然能够使用 - 支持由
paddle.jit.load
载入的Layer在fine-tune或者作为其他Layer的子Layer使用之后,仍然能够通过paddle.jit.save
正确存储 - 拓展
paddle.jit.save
支持存储paddle.DataParallel
模型 - 优化
paddle.static.load_program_state
接口使用体验,在不指定载入var_list
的使用场景中,载入目录存在干扰文件时仅警告而不报错 - 支持
paddle.jit.save
处理dict类型的InputSpec - 支持
paddle.onnx.export
将动态图模型导出为ONNX文件格式
性能优化(含分布式)
- 提升RNN类OP在CPU上的性能(LSTM,GRU,SimpleRNN),对比2.0-rc版本,LSTM、GRU、SimpleRNN前向性能与后向性能均有显著提升
- 优化FastThreadedSSAGraphExecutor调度,修复通信同步场景下,通信计算不重叠的情况,4机32卡resnet50提升约0.3%
- 优化paddle.fleet amp分布式性能,修复最后一个通信和计算不重叠的情况,fp16 4机32卡性能提升约0.5%
- 优化分布式通信组件Communicator性能。GEO-400模式下,W2V模型吞吐率、Simnet-Bow模型性能均有显著提升。Async模式下,相较于飞桨框架1.8按本,W2V模型吞吐率提升11%,CTR-DNN模型性能提升14%
- 优化参数服务器模式下Worker为GPU设备时的性能,降低Embedding查表的拷贝耗时,在CTR-DNN模型中,训练吞吐率有显著提升
- 分布式GPU动态图实现计算和通信overlap,并支持用户细粒度配置梯度fuse的group大小等选项。在ResNet152、Bert两个模型上,多节点性能提升在5%以上。在ResNet50也有3%以上的提升
- 提升cumsum在GPU上的性能。
- 提高了Resnet50 oneDNN 动态图训练的性能。目前Resnet50 oneDNN drgraph训练比CPU训练快 6.4 倍
- 新增GRU和SimpleRNN的cudnn支持
调试分析
- 优化Paddle Python端报错异常类型与Python原生报错类型对齐
- 默认隐藏C++报错栈,优化隐藏C++栈之后的报错格式,去掉分界标志
Error Message Summary
,与Python原生报错格式对齐 - 优化部分static模块下API在非静态图模式下使用报错提示,包括static.append_backward, static.gradients, static.scope_guard, static.Print, static.nn.embedding, static.nn.data_norm, static.nn.multi_box_head, static.nn.nce, static.nn.py_func共9个API
- 优化了动态图模型下传入Tensor为None时的报错信息
- 动态图print tensor的格式进一步优化
编译安装
新增支持
- (experimental)发布支持cuda11的安装包
- 将cuda10.1及以上的Paddle镜像以及CI系统镜像中的NCCL版本到2.7.8
- 发布支持xpu的安装包
- 发布支持jetpack的安装包,以及支持nv_jetson的C++预测库。
体验优化
- 修复联编策略,单独发布包含tensorrt的gpu包,避免用户在安装其他GPU版本的包出现没有tensorrt的报错
- 删除安装依赖包:scipy、rarfile、prettytable、pathlib
- 安装文档优化
Bug修复
- 修复多卡训练时0号GPU卡显存占用多于其他卡的Bug
- 修复了tile op计算时shape推导错误的问题
- 修复了使用paddle时出现的大量invalid escape sequence的warning信息
- 修复了paddle.full设置INF、NAN、NINF等时的bug
- 修复paddle.fleet多nccl comm设置不生效的问题,添加同步模式下多nccl comm通信不重叠的警告
- 修复paddle.framework.seed在TruncatedNormal初始化不符合预期的问题
- 修复AvgPool 相关 API动转静exclusive参数行为不一致问题;修复MaxPool 相关 API ceil_mode传参问题
- 修复paddle.topk在GPU下结果不正确的bug
- 修复 fluid.layers.nn.gather 动态图API,缺少了 overwrite 选项 bug
- 修复Windows下终端不识别CUDA_VISIBLE_DEVICES为空字符的bug,通过设置空字符串可以使框架以CPU模式执行
- 修复当LinearLrWarmup中递归包含Learning Rate Scheduler时,optimizer.state_dict/set_dict时的无法递归保存加载的Bug
- 修复了ptb lm模型单机训练性能下降的问题
- 修复了softmax_with_cross_entropy使用ignore_index时梯度计算的bug
- 修复了AdamW第一次执行后再次获取要进行decay的参数为空的bug
推理
Paddle Inference
功能升级
- Paddle 在 2.0 中新增或升级了部分算子。从本版本起,对前向算子版本规则进行定义与兼容约束。通过框架间算子版本的对齐,确保不同框架中同一算子版本的定义和行为一致,从而增强框架整体的健壮性
- 新增TryShrinkMemory接口,通过释放临时tensor的方式减少应用显/内存占用,demo示例可参考Paddle-Inference-Demo
- Paddle-TRT支持clip op,支持分类模型GhostNet在Paddle-TRT下运行
- Paddle-TRT int8预测支持含有channelwise量化的mul op的模型,支持PaddleOCR检测和识别的PaddleSlim量化模型在Paddle-TRT int8下运行
load_inference_model
和save_inference_model
两个API迁移到paddle.static
下,提升了易用性,兼容旧接口。- 新增
serialize_program
,deserialize_program
,serialize_persistables
,deserialize_persistables
,save_to_file
,load_from_file
六个API,用来满足用户执行序列化/反序列化 program,序列化/反序列化 params,以及将模型/参数保存到文件,或从文件中加载模型/参数的需求。 - 支持部分模型的BF16预测。目前支持resnet50,googlenet,mobilenetv1和mobilenetv2模型的BF16预测
- 添加了一些oneDNN 算子的版本兼容性支持
性能优化
- ERNIE模型在开启TenorRT时增加变长输入的支持,带来性能提升147%。在软件版本cuda10.1、cudnn 7.6、tensorrt 6.0、OSS 7.2.1,模型ernie-base-2.0,数据集QNLI,输入BatchSize = 32时,Nvidia Telsa T4上的性能从905 sentences/s提升到2237 sentences/s。示例代码:Paddle-Inference-Demo/c++
- 提高了oneDNN INT8 GRU性能。GRU INT8 模型的预测速度是原Paddle NativeConfig float32 模型的 1.65倍(线程= 1,batch_size = 50)
- 添加了oneDNN batchnorem + activation的fuse支持,pvanet_ocr模型性能因此提高了2.8%
Bug修复
- 修复含有avg pooling或global pooling的模型在jetson设备上出现计算结果错误、报错跳出或hang住的问题
- 修复使用TensorRT动态shape推理时,TensorRT子图输出Tensor的shape结尾是x1时会被错误的删除的问题
- 修复当使用TensorRT推理时,config.pass_builder()->DeletePass()不生效的问题
- 解决了某些模型的性能取决于 matmul 算子的 weights 数值的问题
- 修复了当CPU oneDNN加载多个模型预测时性能变慢的问题
模型升级
PaddleDetection
- 升级动态图模型:
- Faster RCNN, Faster FPN, Mask RCNN, Mask FPN, Cascade RCNN, Cascade Mask, YOLOv3模型精度打平静态图
- 支持动转静功能,并打通Paddle Inference,精度速度打平静态图
- Faster RCNN, Faster FPN, Mask RCNN, Mask FPN, Cascade RCNN, Cascade Mask, YOLOv3模型精度打平静态图
- 发布实时实例分割模型SOLOv2,相较竞品精度提升2.4个点,预测速度提升31.2%, 训练速度为竞品2.4倍
- 新增Android移动端检测demo,包括SSD、YOLO系列模型
- 新增PACT新量化策略,YOLOv3-Mobilenetv3在COCO数据集上比普通量化相比提升0.7%。
PaddleSlim
- 动态图压缩功能支持
- 新增动态图剪裁、量化训练功能
- 剪裁新增通道数对齐功能,使产出模型更容易被预测库加速
- PACT量化训练方法改为内置方法,方便用户直接调用
- 新增OFA模型压缩技术,TinyERNIE经压缩后加速40%,精度无损
PaddleSeg
- 全新发布2.0-rc版本,全面升级至动态图,支持15+分割模型,4个骨干网络,3个数据集,4种Loss:
- 分割模型:ANN, BiSeNetV2, DANet, DeeplabV3, DeeplabV3+, FCN, FastSCNN, Gated-scnn, GCNet, HarDNet, OCRNet, PSPNet, UNet, UNet++, U^2Net, Attention UNet
- 骨干网络:ResNet, HRNet, MobileNetV3, Xception
- 数据集:Cityscapes, ADE20K, Pascal VOC
- Loss:CrossEntropy Loss、BootstrappedCrossEntropy Loss、Dice Loss、BCE Loss
- 提供基于Cityscapes和Pascal Voc数据集的高质量预训练模型 40+
- 支持多卡GPU并行评估,提供了高效的指标计算功能。支持多尺度评估/翻转评估/滑动窗口评估等多种评估方式。
PaddleClas
- 全新发布2.0-rc1,全面升级至动态图,支持23个系列分类网络结构,135个图像分类预训练模型。其中包含14个实用的SSLD蒸馏模型,效果普遍比基准模型提升3%以上,新增ResNeSt、RegNet和GhostNet三个系列模型。
- 基于动态图,提供混合精度训练方式和基于DALI的训练方式。
- 基于动态图,提供离线预测部署、服务化部署以及端侧部署三种部署方式。
PaddleOCR
- 全新发布2.0-rc1,PP-OCR系列模型升级至动态图。提供8.1M超轻量中英文OCR模型,通用中英文OCR模型以及效果更优的多种语言识别模型(纯英文数字、法、德、日、韩),并支持离线预测部署和服务化部署两种部署方式。
- 发布Style-Text通用文本数据合成工具。
- 发布PPOCRLabel文本数据标注工具。
PaddleRec
- 发布模型:gru4rec, deepfm, mmoe, dnn, LR 支持动态图
PaddleGAN
- 发布模型:Pixel2Pixel, CyclGAN, PSGAN, UGATIT, ESRGAN, CGAN, DCGAN
- 提供风格迁移,妆容迁移,上色,超分,人物、场景动漫化等预训练模型10个
PaddleNLP
- 发布2.0-beta版本,全面支持动态图模式,提供PaddleNLP核心库,与高阶API深入融合,支持pip安装,为开发者提供飞桨2.0文本领域的最佳实践。
- 新增文本图学习模型ERNIESage,生成式预训练模型ERNIE-Gen,开放域对话生成模型PLATO-2,语义匹配模型SentenceTransformer,时间序列预估模型TCN等。
- 预训练语言模型进一步丰富,包括ERNIE, BERT, RoBERTa, ELECTRA等共计22个预训练模型,其中包含11个中文预训练模型。
- 新增Perplexity, BLEU, Rouge-L等8种常用的文本任务评估指标,适配飞桨2.0 Metrics API体系,提升易用性。
- 新增文本分类、序列标注、机器翻译、阅读理解等共25个数据集,适配飞桨2.0 Dataset API体系,一键快速加载。
- 新增Embedding API功能,包含38个中文词向量,支持快速加载和词粒度语义距离计算。
Parakeet
- 发布 2.0-alpha 版本,提供 Parakeet 核心库,完善了中文文档,支持 pip 安装。
- 语音合成模型框架全新升级,统一文本前端的接口使用,模型全面升级为 Paddle 2.0 API,包括TransformerTTS、Waveflow、Wavenet 模型,新增 Tacotron2 模型。
- 提供了更多可复用的组网模块,方便灵活搭建模型。优化数据处理及加载流程,提升训练速度。
- 新增 experiment 模块,标准化实验流程,方便实验管理和二次开发,对已有模型提供的实验样例代码。
工具组件
PaddleHub
- 发布 2.0-rc版本,全面迁移动态图编程模式,模型开发调试更加方便,finetune接口更加灵活易用。
- 视觉类任务迁移学习能力全面升级,支持图像分类、图像着色、风格迁移等多种任务。
- BERT、ERNIE、RoBERTa等Transformer类模型升级至动态图,支持文本分类的Fine-Tune能力。
- 优化服务化部署Serving能力,支持多卡预测、自动负载均衡,性能大幅度提升。
- 新增自动数据增强能力Auto Augment,能高效地搜索适合数据集的数据增强策略组合。
X2Paddle
- 发布 1.0.0-rc0版本,全面支持PaddlePaddle动态图API。
- 新增PyTorch模型转换,支持Tracing和Scripting两种方式进行转换。
- 新增Caffe/ONNX/Tensorflow到Paddle2.0 动态图的转换支持。
- 新增Optimizer模块,主要包括op融合、op消除功能,提升转换后模型代码的可读性以及模型的预测性能。
昆仑硬件
模型适配昆仑硬件
- Resnet50, mobilenetv3, deeplabv3, bertbase, DQN 静态图模型适配昆仑硬件
Release Note
Release note
The Paddle framework 2.0-RC1 version has the following updates:
- Installation environment Official release of the binary package supporting CUDA11(experimental) ; Official release of the binary package supporting Baidu Kunlun chip (experimental)
- API function Support numpy-compatible
paddle.Tensor
indexing and slicing operations(basic indexing); removes the axis parameter in some APIs, support numpy-compatible broadcast semantics; add some new APIs, improve some APIs' functions, and fix some API bugs - Dynamic to static conversion Support more python syntax for dynamic to static graphs, and support for marking functions that do not perform dynamic to static conversion by running
paddle.jit.not_to_static
- Framework function Support multiple executions of
paddle.Tensor.backward()
to accumulate the gradient. The effect is equivalent to the gradient calculated after increasing the batch size. By default, the C++ error stack is hidden, and the error reporting format is optimized. The distributed training supports the heterbox training - Framework performance The mixed precision training supports pure FP16 mode. The ResNet50 model V100 single card training performance reaches up to 1400+ samples/sec. The performance of the distributed training is optimized
Forward-looking preview
- The Paddle Framework plans to drop the support for python2 and python3.5 from a certain version in the future. It is recommended that you upgrade python to V3.8 for Paddle
- The Paddle Framework plans to drop the support for CUDA 9.0 from a certain version in the future. It is recommended that you upgrade the CUDA for Paddle
Training framework
Basic API (including the distributed)
New APIs
- Add the paddle.log2
- Add the paddle.log10
- Add the paddle.nn.initializer.set_global_initializer
- Add the paddle.median
- Add the paddle.broadcast_shape. You can calculate the shape of two tensor shapes after broadcast calculation
- Add the paddle.vision.ops.deform_conv2d, paddle.vision.ops.DeformConv2d
- Add the paddle.subtract
- Add the paddle.optimizer.lamb
- Add the Tensor related APIs, Tensor.cpu, Tensor.cuda(idx), Tensor.pin_memory, Tensor.is_leaf, Tensor.clone
Fix and improve APIs
- In the paddle.multiply, remove the axis
- In the paddle.pow, remove the type promotion
- The paddle.add, paddle.subtract, paddle.multiply, paddle.divide, paddle.matmul, paddle.reshape, paddle.transpose, paddle.kron, paddle.trace, and paddle.sum support complex64 and complex128 data types
- Remove the axis parameter from the paddle.maximum and paddle.minimum
- In the multiplex, support the dynamic graphs
- In the CrossEntropyLoss, add the soft_label and axis, modify shape and improve performance
- The paddle.nn.functional.interpolate size parameter supports the input in the Tensor format
- In the paddle.nn.functional.pad, add the padding for N and C dimensions in constant mode
- In the paddle.optimizer.momentum, support the resume training
- Fix the error when converting a BatchNorm to a SyncBatchNorm using paddle.nn.SyncBatchNorm.convert_sync_batchnorm after specifying the weight_param name before conversion
- paddle.to_tensor supports direct input of other Tensor's place when selecting devices
- Optimize the performance of Tensor.detach, share memory with the original Tensor, reduce one memory copy, without keeping in the original computational graph
- In static graph mode, add the acquisition of the learning rate by paddle.optimizer.get_lr()
- Fix the exceeding-range ID error exception in the use of GPU in the paddle.Embedding
Remove API (including aliases)
- Remove the api under complex module: paddle.complex.matmul, paddle.complex.reshape, paddle.complex.transpose, paddle.complex.kron, paddle.complex.trace, paddle.complex.sum, paddle.complex.elementwise_add, paddle.complex.elementwise_sub, paddle.complex.elementwise_mul, paddle.complex.elementwise_div
- Remove the sigmoid_cross_entropy_with_logits in the paddle.nn.functional
High-level API
- Add api paddle.callbacks.ReduceLROnPlateau
- Add api paddle.callbacks.LRScheduler
- Add api paddle.vision.datasets.FashionMnist
- In the paddle.io.DataLoader, change the places parameter to an optional parameter. When the default value is None, paddle.CPUPlace() or paddle.CUDAPlace(0) is automatically selected, and the places parameter will be deleted in later versions
- paddle.io.DataLoader supports disabling the DataLoader automatic group batch function by setting batch_size=None
- Add the api paddle.io. ComposeDataset for stitching multiple datasets into one dataset by field
- Add the api paddle.io. ChainDataset to integrate multiple datasets into one dataset by sample
- Add the api paddle.io. WeightedRadnomSampler for random sampling with the specified weights
- Add the api paddle.vison.ops.yolo_loss and paddle.vision.ops.yolo_box
- Add the api paddle.flops
- Add the api paddle.callbacks.EarlyStopping
- Update the api model.save. The saved file format is consistent with the bottom
- Fix the bug of saving prediction model when input dtype in the api dynamic graph is non-float32 and inputs are not provided in the Model initialization
- The paddle. metric. Accuracy supports input multi-dimensional Tensor, supports the label whose rank is 1 and the label represented by one-hot
Function optimization (including distributed)
Dynamic graph basic functions
- Support Tensor and Scalar for correct type improvement when using operators for operations
- Fix the bug of the interference with each other in the switching between multiple model train/eval models.Dynamic graph Layer.eval() is decoupled from no_grad, Tracer will not automatically record the reverse after calling Layer.eval() before the change, but will still automatically record the reverse after calling Layer.eval() after the change. If the reverse is needed, you can use paddle.no_grad
- Support the change of Tensor data by index or slice
- Add inplace reverse detection module to detect whether the forward inplace operation will affect the correctness of the gradient calculation
- Add that in the Tensor.backward() automatic derivation, the gradient will be added to the previous gradient. This can increase the "batch_size"
- Enabled SE-ResNext oneDNN dygraph training
Dynamic graph to static graph
New syntax
- Add the support for using the isinstance syntax in the dynamic to static loop
- Add the support for dynamic to static syntax for assigning shape to tuples, such as a, b, c, d = tensor.shape
- Python's and/or statements have sequential execution of the left and right operands. If the result of the left operation can determine the logical value, the right operand will not be executed.In the past, logical_and/logical_or in dynamic to static graphs had problems in handling this case.This support is added
- Add the support for the case where the function signature contains **kwargs
- Support the use of jit.not_to_static decorative function. The function is not converted in the dynamic to static process
- Support python dictionary syntax dict.pop()
Bug fixing
- Fix the bug of model storage failure when a variable representing drop_state is not initialized in the dynamic to static storage lstm interface
- Fix the bug of nested loops in the variable analysis
- Fix the bug of return in some special cases
- Fix the bug of if-else in the handling of list generation and variable analysis
- Fix the bug of iterative variables in some special cases
- Fix the bug of inconsistent behavior of transpose API in dynamic and static graphs, and make it support dynamic to static
- Fix the bug of inconsistent behavior of concat API in dynamic and static graphs, and make it support dynamic to static
- Optimize some dynamic to static error messages, so that the error location is more accurate
- Fix the bug that convert_call will be repeatedly called recursively under special circumstances
- Fix the dynamic to static bug caused by different judgments of out.dtype in 2.0 API
- Fix the bug that x.shape == y.shape is judged to be equal to list in the dynamic graph and returns True/False, but will be re-loaded to elementwise in the static graph, and the elementwise result will be reduced after such conversion to static graph
- Fix the bug that param_guard does not cover hook
- Fix the bug of having some parameter variables in the init running in the static graph can not be assigned because the type is not static graph variables
- Fix the bug of the value of non-parameter type variables being defined by users in _init_ function cannot be modified and updated correctly
- Fix the bug of wrongly converting third-party library logging in the dynamic to static process
- Fix the bug of incorrect transcription of AST in the for-enumerate syntax
- Fix the bug that some warning information is displayed multiple times in a loop
Mixed precision training
- Support more aggressive FP16 training mode (i.e., pure FP16 training).To ensure the convergence of the model in Momentum optimizer, add the new
multi_precision
andrescale_grad
attributes. Themulti_precision
mainly indicates that the optimizer needs to maintain a copy of master weights - Use the pure FP16 training. The ResNet50 model can reach 1400+ samples/sec on a single card with 16GB video memory on V100
Model quantization
- Dynamic graph quantization supports skip to specify the Layer
- Dynamic graph quantization supports 2.0 API Conv and Linear
Distributed training optimization
- Support the distributed low-order APIs such as
all_gather
usingpaddle.distibuted.spawn
interface - Support the heterbox heterogeneous training
- Pipeline supports Executor.run interface in parallel to improve the usability
- Launch interface is upgraded, support for specifying the number of processes of a single node
- Sharding supports multi-card training for 10 billion parameter models
Model saving and loading
- Support multiple methods declaring that Layers overridden by
paddle.jit.to_static
can still be loaded bypaddle.jit.load
after being stored bypaddle.jit.save
, and multiple methods overridden bypaddle.jit.to_static
can still be used - Support that Layers loaded by
paddle.jit.load
can still be stored correctly bypaddle.jit.save
after fine-tune or used as sub-Layers of other Layers - Expand
paddle.jit.save
to support storing thepaddle.DataParallel
model - Optimize
paddle.static.load_program_state
interface experience. In the scenarios that do not specify to loadvar_list
, only a warning is given when loading a directory with interfering files and no error is reported - Support
paddle.jit.save
to handle InputSpec of dict type - Support
paddle.onnx.export
to export dynamic model to ONNX file type
Performance optimization (including the distributed)
- Improve the performance of RNN class OP on CPU (LSTM, GRU, SimpleRNN). Compared with version 2.0-rc, the forward performance and backward performance of the LSTM, GRU, SimpleRNN have been significantly improved
- Optimize the FastThreadedSSAGraphExecutor scheduling. Fix the performance of the 4-engine 32-card resnet50 that is improved by about 0.3% in the communication synchronization scenario without the overlapping of the communication calculation
- Optimize the paddle. fleet amp distributed performance. Fix the performance of the 4-engine 32-card fp16 that is improved by about 0.5% in the case that the last communication and calculation are not overlapping
- Optimize the performance of the distributed communication component Communicator. In the GEO-400 mode, the W2V model throughput rate, Simnet-Bow model performance have been significantly improved. In the Async mode, compared to the Paddle Framework 1.8, the throughput rate of W2V model is improved by 11% and the performance of CTR-DNN model is improved by 14%
- Optimize the performance when the Worker is a GPU device in parameter server mode, reduce the copy time of Embedding table query. Significantly improve the training throughput rate in the CTR-DNN model
- The distributed GPU dynamic graph realizes the computation and communication overlap, and support the user fine-grained configuration of gradient fuse group size and other options. On the two models ResNet152 and Bert, the multi-node performance improvement is more than 5%.The performance of the ResNet50 is also improved by more than 3%
- Improve the performance of cumsum on GPU
- mproved performance of Resnet50 oneDNN dygraph training. Currently Resnet50 oneDNN drgraph training is 6.4X faster than Native CPU training
- Add the support of cudnn on the GRU and SimpleRNN
Debug analysis
- Optimize the alignment of the error exception type on the Paddle Python side with Python native error type
- Hide the C++ error stack by default, optimize the error reporting format after hiding the C++ stack, remove the demarcation flag
Error Message Summary
, and align with the native Python error reporting format - Optimize some static module APIs in non-static graph mode, including 9 APIs such as static.append_backward, static.gradients, static.scope_guard, static. Print, static.nn.embedding, static.nn.data_norm, static.nn.multi_box_head, static.nn.nce, and static.nn.py_func
- Optimize the error message when the pass-in Tensor is None under the dynamic graph model
- Further optimize the print tensor format of the dynamic graph
Compile and install
New support
- (experimental) Release the binary package supporting cuda11
- Mirror the Paddle of cuda10.1 or later and NCCL to version 2.7.8 in the CI system images
- Release the binary package supporting xpu
- Release the binary package supporting jetpack and C++ prediction library supporting nv_jetson
Experience optimization
- Fix the build strategy, separately release the gpu package containing tensorrt, to avoid the error of no tensorrt when users install other GPU versions of the package
- Remove installation dependencies: scipy, rarfile, prettytable, pathlib
- Installation documentation optimization
Bug fixing
- Fix the bug that GPU card 0 occupies more video memory than other cards during multi-card training
- Fix the bug of wrong shape derivation in the tile op calculation
- Fix the bug of the large number of warning messages of invalid escape sequence in the use of paddle
- Fix the bug when paddle. full is set to INF, NAN, NINF, etc.
- Fix the bug that multiple-nccl comm settings of paddle. fleet do not take effect, and add the non-overlapping warning of multi-nccl comm communication in synchronous mode
- Fix the bug that the paddle. framework.seed in TruncatedNormal initialization does not meet the expectation
- Fix the inconsistent behavior of AvgPool related API dynamic to static exclusive parameters; fix the MaxPool related API ceil_mode transmission parameter problem
- Fix the bug that paddle. topk result is incorrect under GPU
- option in the fluid.layers.nn.gather dynamic graph API
- Fix the bug that the Window-based terminal does not recognize CUDA_VISIBLE_DEVICES as null character, and the frame can be executed in CPU mode by setting the null string
- Fix the bug that the recursive saving and loading of optimizer.state_dict/set_dict fails when LinearLrWarmup recursively contains Learning Rate Scheduler
- Fixed the ptb lm training performance decrease issue
- Fix the bug of gradient calculation when softmax_with_cross_entropy uses ignore_index
- Fix the bug that the parameter to be decayed is empty in the second acquisition after the first execution of AdamW
Inference
Paddle Inference
Function upgrade
- In Paddle V2.0, add or upgrade some operators. Starting from this version, the forward operator versioning rules are defined by compatibility constraints. Through the alignment of operator versions between frameworks, ensure consistent definition and behavior of the same operator version in different frameworks, thus enhancing the overall robustness of the framework
- Add the TryShrinkMemory interface to reduce the application display/memory consumption by releasing temporary tensor. For the demo example, refer to Paddle-Inference-Demo
- Paddle-TRT supports clip op. Support the classification model GhostNet running under Paddle-TRT
- Paddle-TRT int8 prediction support models containing channelwise quantization of mul op. Support the PaddleOCR detection and recognition of PaddleSlim quantization model running under Paddle-TRT int8
load_inference_model
andsave_inference_model
APIs are migrated topaddle.static
to improve ease of use and compatibility with old interfaces- Add six APIs such as
serialize_program
,deserialize_program
,serialize_persistables
,deserialize_persistables
,save_to_file
,load_from_ file
six APIs for users to perform serialize/deserialize program, serialize/deserialize params, and save models/parameters to file, or load models/parameters from files - Enabled BF16 inference for models: resnet50, googlenet, mobilenetv1 and mobilenetv2
- Added oneDNN operators version compatibility support
Performance optimization
- When TenorRT is enabled, ERNIE models add the support for variable-length inputs, resulting in the performance improving by 147%.In software versions cuda10.1, cudnn 7.6, tensorrt 6.0, OSS 7.2.1, model ernie-base-2.0, dataset QNLI, the performance on Nvidia Telsa T4 improves from 905 sentences/s to 2237 sentences/s when input BatchSize = 32.Example code: Paddle-Inference-Demo/c++
- Improved oneDNN INT8 GRU performance. The GRU INT8 model has 1.65X speed-up compared with NativeConfig inference. (with thread=1, batch_size=50)
- Added oneDNN batchnorm + activation fuse, hence improved pvanet_ocr model performance by 2.8%
Bug fixing
- Fix the bug that models with avg pooling or global pooling have wrong computation results, error popups or hang
- Fix the bug that the shape of TensorRT subgraph output Tensor ended with x1 will be deleted incorrectly when using the TensorRT dynamic shape inference
- Fix the bug that config.pass_builder()->DeletePass() is not effective when the TensorRT inference is used
- Fix the issue that some models performance depends on the matmul ops' weights
- Fix the issue that CPU oneDNN predictin many models will report error or cause performance regression
Model upgrade
PaddleDetection
- Upgrade dynamic graph models:
- Faster RCNN, Faster FPN, Mask RCNN, Mask FPN, Cascade RCNN, Cascade Mask, YOLOv3 model accuracy flattening static graphs
- Support the dynamic to static function. Enable the Paddle Inference. The precision speed flattens the static graphs
- Faster RCNN, Faster FPN, Mask RCNN, Mask FPN, Cascade RCNN, Cascade Mask, YOLOv3 model accuracy flattening static graphs
- Release the SOLOv2, a real-time instance segmentation model. Compared to competing models, it is improved by 2.4% in accuracy and 31.2% in prediction speed. The training speed is as fast as 2.4 times of the competing models
- Add the Android mobile detection demos, including SSD and YOLO series models
- Add the PACT new quantification strategy. Compared to the ordinary quantification, YOLOv3-Mobilenetv3 on COCO dataset is improved by 0.7%
PaddleSlim
- Support the dynamic graph compression function
- Add the dynamic graph cropping and quantization training function
- Add the cropping of the channel quantity alignment function, so that the output model is more easily accelerated by the prediction library
- PACT quantization training method is changed to built-in method. It is convenient for users to call directly
- Add the OFA model compression technology. The TinyERNIE is accelerated by 40% after compression, with no loss of accuracy
PaddleSeg
- Newly release 1.0-rc version, fully upgraded to dynamic graph. It supports 13 segmentation models, 4 backbone networks, and 3 datasets:
- Segmentation models: ANN, BiSeNetV2, DANet, DeeplabV3, DeeplabV3+, FCN, FastSCNN, Gated-scnn, GCNet, OCRNet, PSPNet, UNet, and U^2Net
- Backbone networks: ResNet, HRNet, MobileNetV3, and Xception
- Datasets: Cityscapes, ADE20K, and Pascal VOC
- Loss: CrossEntropy Loss、BootstrappedCrossEntropy Loss、Dice Loss、BCE Loss
- Provide 40+ high quality pre-trained models based on Cityscapes and Pascal Voc datasets
- Support multi-card GPU parallel evaluation. This provides the efficient index calculation function. Support multiple evaluation methods such as multi-scale evaluation/flip evaluation/sliding window evaluation
PaddleClas
- Newly released 2.0-rc1, fully upgraded to dynamic graph. It supports 23 series of classification network structures and 135 image classification pre-training models. Among them, 14 practical SSLD distillation models are included, and the effect is generally improved by more than 3% compared with the benchmark model. Three new series of ResNeSt, RegNet and GhostNet models are added
- Based on dynamic graph, provide the mixed precision training method and DALI-based training method
- Provide the off-line predictive deployment, service-oriented deployment and end-side deployment based on the dynamic graphs
PaddleOCR
- Newly released 2.0-rc1. PP-OCR series models are upgraded to dynamic graphs. Provide 8.1M ultra-lightweight Chinese and English OCR models, universal Chinese and English OCR models and better multilingual recognition models (pure English numbers, French, German, Japanese, Korean). Support the offline predictive deployment and service-oriented deployment
- Release the Style-Text universal text data synthesis tool
- Release the PPOCRLabel text data annotation tool
PaddleRec
- Release models: gru4rec, deepfm, mmoe, dnn, LR supporting dynamic graph
PaddleGAN
- Release models: Pixel2Pixel, CycleGAN, PSGAN, UGATIT, ESRGAN, CGAN, DCGAN
- Provide 10 pre-trained models for style migration, makeup migration, coloring, super score, character and scene animation, etc.
PaddleNLP
- Release 2.0-beta version: support all-around dynamic graph models; provide the PaddleNLP core library, with deeply integrating with higher-order APIs; support the pip installation; provide developers with best practices in the text domain of PaddlePaddle 2.0.
- Add the text graph learning model ERNIESage, generative pre-training model ERNIE-Gen, open domain dialogue generation model PLATO-2, semantic matching model SentenceTransformer, time sequence prediction model TCN, and so on.
- Enrich the pre-training language models further, including a total of 22 pre-training models such as ERNIE, BERT, RoBERTa, and ELECTRA (containing 11 Chinese pre-training models).
- Add 8 common text task evaluation metrics such as Perplexity, BLEU, Rouge-L, and so on, adapted to the PaddlePaddle 2.0 Metrics API system to improve ease of use.
- Add 25 new datasets for text classification, sequence annotation, machine translation, reading comprehension, and so on, adapted to the PaddlePaddle 2.0 Dataset API system, with fast loading by pressing one key.
- Add the Embedding API function, including 38 Chinese word vectors, supporting fast loading and word granularity semantic distance calculation.
Parakeet
- Release 2.0-alpha version: provide Parakeet core library; improve Chinese documentation; support pip installation.
- Upgrade the text-to-speech model framework to unify the text front-end interface. The model is fully upgraded to Paddle 2.0 API, including TransformerTTS, Waveflow, Wavenet model, and new Tacotron2 model.
- Provide more reusable networking modules. This facilitates the combination of model flexibly. Optimize the data processing and loading process. This improves the training speed.
- Add the experiment module to standardize the experiment process. This facilitates the experiment management and secondary development. The sample codes for experiments are provided for existing models.
Utility Component
PaddleHub
- Release 2.0-rc version: fully migrate the dynamic graph programming mode. It is more convenient for model development and debugging. The finetune interface is more flexible and easy to use.
- Upgrade the visual class task migration learning capability fully, supporting a variety of tasks such as image classification, image coloring, and style migration.
- Upgrade Transformer class models such as BERT, ERNIE and RoBERTa to dynamic graph. Support the Fine-Tune capability for text classification.
- Optimize the Serving capability for service-oriented deployment, supporting multi-card prediction and automatic load balancing. The performance is improved greatly.
- Add the Auto Augment (automatic data augment capability). This allows the efficient search for the proper combination of data augment policies for the datasets.
X2Paddle
- Release version 1.0.0-rc0: It fully supports PaddlePaddle dynamic graph API.
- Add the PyTorch model conversion: supports the conversion between Tracing and Scripting.
- Add the support of conversion from Caffe/ONNX/Tensorflow to Paddle2.0 dynamic graph.
- Add the Optimizer module, mainly including op fusions and op elimination functions, to improve the readability of the converted model code and the prediction performance of the model.
Kunlun hardware
Models adapted to Kunlun hardware
- Resnet50, mobilenetv3, deeplabv3, bertbase, DQN static graphs model adapted to Kunlun hardware