How to make 4bit pytorch_quantization model export to .engine model? #4262

StarryAzure · 2024-11-26T13:46:15Z

pytorch_quantization is supporting 4bit, ONNX is supporting 4bit, but torch.onnx.export is not support 4bit. How to make 4bit pytorch_quantization .pt model export to .engine model?

lix19937 · 2024-11-27T11:18:03Z

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

StarryAzure · 2024-12-02T07:49:12Z

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

In TensorRT10, it can use --int4 and won't report any errors, but the result is still FP32.

lix19937 · 2024-12-02T09:28:27Z

Can you upload the build log when trtexec --verbose ?

lix19937 · 2024-12-03T01:24:17Z

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

Sorry, I am wrong, from https://github.com/NVIDIA/TensorRT/blob/release/10.6/samples/common/sampleOptions.cpp#L1231 v10.6 has already support int4, which need meet the requirements:

INT4: low-precision integer type for weight compression
INT4 is used for weight-only-quantization. Requires dequantization before computing is performed.
Conversion to and from INT4 type requires an explicit Q/DQ layer.
INT4 weights are expected to be serialized by packing two elements per byte. For additional information, refer to the Quantized Weights section.

StarryAzure · 2024-12-11T07:19:58Z

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

Sorry, I am wrong, from https://github.com/NVIDIA/TensorRT/blob/release/10.6/samples/common/sampleOptions.cpp#L1231 v10.6 has already support int4, which need meet the requirements:

INT4: low-precision integer type for weight compression
INT4 is used for weight-only-quantization. Requires dequantization before computing is performed.
Conversion to and from INT4 type requires an explicit Q/DQ layer.
INT4 weights are expected to be serialized by packing two elements per byte. For additional information, refer to the Quantized Weights section.

but i use pytorch_quantization to add 4bit Q/DQ layer, but can't use torch.onnx.export export to onnx model. is there any way to do that?

lix19937 · 2024-12-16T05:25:51Z

see https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html

INT4 Weights only AWQ (W4A16)

4-bit integer group-wise/block-wise weight only quantization with AWQ calibration.

Compresses FP16/BF16 model to 25% of original size.

Calibration time: tens of minutes**.

Deploy via TensorRT-LLM. Supported GPUs: Ampere and later.

INT4-FP8 AWQ (W4A8)

4-bit integer group-wise/block-wise weight quantization, FP8 per-tensor activation quantization & AWQ calibration.

Compresses FP16/BF16 model to 25% of original size.

Calibration time: tens of minutes**.

Deploy via TensorRT-LLM. Supported GPUs: Ada, Hopper and later.

convert_checkpoint to get trtllm ckpt
trtllm-build to get engine and visual onnx

StarryAzure changed the title ~~How to make 4bit pytorch_quantization model export to .engine mode?~~ How to make 4bit pytorch_quantization model export to .engine model? Nov 26, 2024

poweiw added the triaged Issue has been triaged by maintainers label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make 4bit pytorch_quantization model export to .engine model? #4262

How to make 4bit pytorch_quantization model export to .engine model? #4262

StarryAzure commented Nov 26, 2024

lix19937 commented Nov 27, 2024

StarryAzure commented Dec 2, 2024

lix19937 commented Dec 2, 2024

lix19937 commented Dec 3, 2024

StarryAzure commented Dec 11, 2024

lix19937 commented Dec 16, 2024

How to make 4bit pytorch_quantization model export to .engine model? #4262

How to make 4bit pytorch_quantization model export to .engine model? #4262

Comments

StarryAzure commented Nov 26, 2024

lix19937 commented Nov 27, 2024

StarryAzure commented Dec 2, 2024

lix19937 commented Dec 2, 2024

lix19937 commented Dec 3, 2024

StarryAzure commented Dec 11, 2024

lix19937 commented Dec 16, 2024