Skip to content

Latest commit

 

History

History
77 lines (68 loc) · 5.05 KB

Quantization.md

File metadata and controls

77 lines (68 loc) · 5.05 KB

Quantization

Quantization is a widely-used model compression technique that can reduce model size while also improving inference and training latency.
The full precision data converts to low-precision, there is little degradation in model accuracy, but the inference performance of quantized model can gain higher performance by saving the memory bandwidth and accelerating computations with low precision instructions. Intel provided several lower precision instructions (ex: 8-bit or 16-bit multipliers), both training and inference can get benefits from them. Refer to the Intel article on lower numerical precision inference and training in deep learning.

Quantization Support Matrix

Quantization methods include the following three types:

Types Quantization Dataset Requirements Framework Backend
Post-Training Static Quantization (PTQ) weights and activations calibration PyTorch PyTorch Eager/PyTorch FX/IPEX
TensorFlow TensorFlow/Intel TensorFlow
ONNX Runtime QLinearops/QDQ
Post-Training Dynamic Quantization weights none PyTorch PyTorch eager mode/PyTorch fx mode/IPEX
ONNX Runtime QIntegerops
Quantization-aware Training (QAT) weights and activations fine-tuning PyTorch PyTorch eager mode/PyTorch fx mode/IPEX
TensorFlow TensorFlow/Intel TensorFlow


Post-Training Static Quantization performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.

PTQ


Post-Training Dynamic Quantization simply multiplies input values by a scaling factor, then rounds the result to the nearest, it determines the scale factor for activations dynamically based on the data range observed at runtime. Weights are quantized ahead of time but the activations are dynamically quantized during inference.

Dynamic Quantization


Quantization-aware Training (QAT) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.

QAT

Examples of Quantization

For Quantization related examples, please refer to Quantization examples