Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make 4bit pytorch_quantization model export to .engine model? #4262

Open
StarryAzure opened this issue Nov 26, 2024 · 6 comments
Open
Labels
triaged Issue has been triaged by maintainers

Comments

@StarryAzure
Copy link

pytorch_quantization is supporting 4bit, ONNX is supporting 4bit, but torch.onnx.export is not support 4bit. How to make 4bit pytorch_quantization .pt model export to .engine model?

@StarryAzure StarryAzure changed the title How to make 4bit pytorch_quantization model export to .engine mode? How to make 4bit pytorch_quantization model export to .engine model? Nov 26, 2024
@lix19937
Copy link

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

@poweiw poweiw added the triaged Issue has been triaged by maintainers label Dec 2, 2024
@StarryAzure
Copy link
Author

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

In TensorRT10, it can use --int4 and won't report any errors, but the result is still FP32.

@lix19937
Copy link

lix19937 commented Dec 2, 2024

Can you upload the build log when trtexec --verbose ?

@lix19937
Copy link

lix19937 commented Dec 3, 2024

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

Sorry, I am wrong, from https://github.com/NVIDIA/TensorRT/blob/release/10.6/samples/common/sampleOptions.cpp#L1231 v10.6 has already support int4, which need meet the requirements:

INT4: low-precision integer type for weight compression
INT4 is used for weight-only-quantization. Requires dequantization before computing is performed.
Conversion to and from INT4 type requires an explicit Q/DQ layer.
INT4 weights are expected to be serialized by packing two elements per byte. For additional information, refer to the Quantized Weights section.

@StarryAzure
Copy link
Author

Current the latest trtexec only support -fp16, --bf16,--int8, --fp8,--noTF32, and --best, maybe you can use trt-llm.
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec

Sorry, I am wrong, from https://github.com/NVIDIA/TensorRT/blob/release/10.6/samples/common/sampleOptions.cpp#L1231 v10.6 has already support int4, which need meet the requirements:

INT4: low-precision integer type for weight compression
INT4 is used for weight-only-quantization. Requires dequantization before computing is performed.
Conversion to and from INT4 type requires an explicit Q/DQ layer.
INT4 weights are expected to be serialized by packing two elements per byte. For additional information, refer to the Quantized Weights section.

but i use pytorch_quantization to add 4bit Q/DQ layer, but can't use torch.onnx.export export to onnx model. is there any way to do that?

@lix19937
Copy link

see https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html

INT4 Weights only AWQ (W4A16)

4-bit integer group-wise/block-wise weight only quantization with AWQ calibration.

Compresses FP16/BF16 model to 25% of original size.

Calibration time: tens of minutes**.

Deploy via TensorRT-LLM. Supported GPUs: Ampere and later.

INT4-FP8 AWQ (W4A8)

4-bit integer group-wise/block-wise weight quantization, FP8 per-tensor activation quantization & AWQ calibration.

Compresses FP16/BF16 model to 25% of original size.

Calibration time: tens of minutes**.

Deploy via TensorRT-LLM. Supported GPUs: Ada, Hopper and later.

  • convert_checkpoint to get trtllm ckpt
  • trtllm-build to get engine and visual onnx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants