Skip to content

Commit

Permalink
Add ort infer
Browse files Browse the repository at this point in the history
  • Loading branch information
SWHL authored May 7, 2023
1 parent 0da18ad commit e0cc44c
Showing 1 changed file with 173 additions and 0 deletions.
173 changes: 173 additions & 0 deletions content/blog/onnxruntime_infer_optim.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
+++
author = "SWHL"
title = "ONNXRuntime推理调优指南"
date = "2023-05-07"
description = "ONNXRuntime推理调优指南"
tags = ["llm"]
+++


<details open>
<summary>目录</summary>

- [引言](#引言)
- [推荐常用设置](#推荐常用设置)
- [`enable_cpu_mem_arena`](#enable_cpu_mem_arena)
- [`enable_profiling`](#enable_profiling)
- [`execution_mode`](#execution_mode)
- [`inter_op_num_threads`](#inter_op_num_threads)
- [`intra_op_num_threads`](#intra_op_num_threads)
- [`graph_optimization_level`](#graph_optimization_level)
- [FAQ](#faq)
- [为什么我的模型在GPU上比在CPU上还要慢?](#为什么我的模型在gpu上比在cpu上还要慢)
- [参考资料](#参考资料)
</details>

#### 引言
- 平时推理用的最多是ONNXRuntime,推理引擎的合适调配对推理性能有着至关重要的影响。但是有关于ONNXRuntime参数设置的资料却散落在各个地方,不能形成有效的指导意见。
- 因此,决定在这一篇文章中来梳理一下相关的设置。
- 以下参数都是来自`SessionOptions`
- 相关测试代码可以前往[AI Studio](https://aistudio.baidu.com/aistudio/projectdetail/6109918?sUid=57084&shared=1&ts=1683438418669)查看
- 欢迎补充和指出不足之处。

#### 推荐常用设置
```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.log_severity_level = 4
sess_options.enable_cpu_mem_arena = False

# 其他参数,采用默认即可
```

#### [`enable_cpu_mem_arena`](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.enable_cpu_mem_arena)
- 作用:启用CPU上的**memory arena**。Arena可能会为将来预先申请很多内存。如果不想使用它,可以设置为`enable_cpu_mem_area=False`,默认是`True`
- 结论:建议关闭
- 开启之后,占用内存会剧增(5618.3M >> 5.3M),且持续占用,不释放;推理时间只有大约13%提升

- 测试环境:
- Python: 3.7.13
- ONNXRuntime: 1.14.1
- 测试代码(来自[issue 11627](https://github.com/microsoft/onnxruntime/issues/11627)[enable_cpu_memory_area_example.zip](https://github.com/microsoft/onnxruntime/files/8772315/enable_cpu_memory_area_example.zip)
```python
# pip install onnxruntime==1.14.1
# pip install memory_profiler

import numpy as np
import onnxruntime as ort
from memory_profiler import profile


@profile
def onnx_prediction(model_path, input_data):
ort_sess = ort.InferenceSession(model_path, sess_options=sess_options)
preds = ort_sess.run(output_names=["predictions"],
input_feed={"input_1": input_data})[0]
return preds


sess_options = ort.SessionOptions()
sess_options.enable_cpu_mem_arena = False

input_data = np.load('enable_cpu_memory_area_example/input.npy')
print(f'input_data shape: {input_data.shape}')
model_path = 'enable_cpu_memory_area_example/model.onnx'

onnx_prediction(model_path, input_data)
```
- Windows端 | Mac端 | Linux端 测试情况都大致相同
<details>

- `enable_cpu_mem_arena=True`
```bash
(demo) PS G:> python .\test_enable_cpu_mem_arena.py
enable_cpu_mem_arena: True
input_data shape: (32, 200, 200, 1)
Filename: .\test_enable_cpu_mem_arena.py

Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 69.1 MiB 69.1 MiB 1 @profile
8 def onnx_prediction(model_path, input_data):
9 77.2 MiB 8.1 MiB 1 ort_sess = ort.InferenceSession(model_path, sess_options=sess_options)
10 77.2 MiB 0.0 MiB 1 preds = ort_sess.run(output_names=["predictions"],
11 5695.5 MiB 5618.3 MiB 1 input_feed={"input_1": input_data})[0]
12 5695.5 MiB 0.0 MiB 1 return preds
```
- `enable_cpu_mem_arena=False`
```bash
(demo) PS G:> python .\test_enable_cpu_mem_arena.py
enable_cpu_mem_arena: False
input_data shape: (32, 200, 200, 1)
Filename: .\test_enable_cpu_mem_arena.py

Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 69.1 MiB 69.1 MiB 1 @profile
8 def onnx_prediction(model_path, input_data):
9 76.9 MiB 7.8 MiB 1 ort_sess = ort.InferenceSession(model_path, sess_options=sess_options)
10 76.9 MiB 0.0 MiB 1 preds = ort_sess.run(output_names=["predictions"],
11 82.1 MiB 5.3 MiB 1 input_feed={"input_1": input_data})[0]
12 82.1 MiB 0.0 MiB 1 return preds
```

</details>

#### `enable_profiling`
- 开启这个参数,在推理时,会生成一个类似`onnxruntime_profile__2023-05-07_09-02-15.json`的日志文件,包含详细的性能数据(线程、每个运算符的延迟等)。
- 建议开启
- 示例代码:
```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True
```

#### `execution_mode`
- 设置运行模型的模式,包括`rt.ExecutionMode.ORT_SEQUENTIAL``rt.ExecutionMode.ORT_PARALLEL`。一个序列执行,一个并行。默认是序列执行
- **通常来说,当一个模型中有许多分支时,可以设置该参数为`ORT_PARALLEL`来达到更好的表现**
- 当设置`sess_options.execution_mode = rt.ExecutionMode.ORT_PARALLEL`时,可以设置`sess_options.inter_op_num_threads`来控制使用线程的数量,来并行化执行(模型中各个节点之间)

#### `inter_op_num_threads`
- 设置并行化执行图(跨节点)时,使用的线程数。默认是0,交由onnxruntime自行决定。
- 示例代码:
```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.inter_op_num_threads = 2
```

#### `intra_op_num_threads`
- 设置并行化执行图(内部节点)时,使用的线程数。默认是0,交由onnxruntime自行决定,一般会选择使用设备上所有的核。
- ⚠️ 这个值并不是越大越好,具体参考[AI Studio](https://aistudio.baidu.com/aistudio/projectdetail/6109918?sUid=57084&shared=1&ts=1683438418669)中的消融实验。
- 示例代码:
```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.intra_op_num_threads = 2
```

#### [`graph_optimization_level`](https://github.com/microsoft/onnxruntime-openenclave/blob/openenclave-public/docs/ONNX_Runtime_Graph_Optimizations.md)
- 运行图时,对图中算子的优化水平。默认是开启全部算子的优化。建议采用默认值即可。
- 可选的枚举值有:`ORT_DISABLE_ALL | ORT_ENABLE_BASIC | ORT_ENABLE_EXTENDED | ORT_ENABLE_ALL`
- 示例代码:
```python
import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
```

#### FAQ
##### 为什么我的模型在GPU上比在CPU上还要慢?
- 取决于所使用的执行提供者,它可能没有完全支持模型中的所有操作。回落到CPU操作可能会导致性能速度的下降。此外,即使一个操作是由CUDA execution provider实现的,由于性能的原因,它也不一定会把操作分配/放置到CUDA EP上。要想看到ORT决定的位置,请打开verbose日志并查看控制台的输出。


#### 参考资料
- [ONNX Runtime Performance Tuning](https://github.com/microsoft/onnxruntime-openenclave/blob/openenclave-public/docs/ONNX_Runtime_Perf_Tuning.md)
- [Python API](https://onnxruntime.ai/docs/api/python/api_summary.html)

0 comments on commit e0cc44c

Please sign in to comment.