Is W4A(FP)8 quant not supported with bf16 datatype? #1843

wxsms · 2024-06-26T02:59:19Z

System Info

ubuntu, with Ada GPUs. tllm version: 0.11.0.dev2024061800

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

use example/quantization/quantize.py to quant a model like this (I am using Llama):

python3 ./quantization/quantize.py \
        --model_dir /mnt/models/source \
        --dtype bfloat16 \
        --qformat w4a8_awq \
        --output_dir /tmp/checkpoint \
        --calib_tp_size 4 \
        --tp_size 1

Expected behavior

the quantization should work

actual behavior

not working with error: FP8 is unsupported on with BF16 scales and zero-points!

additional notes

I notice that in tensorrt_llm/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp there is a snip of code like this:

#if defined(ENABLE_BF16)
    else if (mType == nvinfer1::DataType::kBF16)
    {
        if (quant_algo & FP8_ALPHA)
        {
            // FP8 requires at least sm89 devices
            if (mArch < 89)
            {
                TLLM_THROW("W4A(fp)8 kernel is unsupported on pre-Ada (sm<89) architectures!");
            }
            TLLM_THROW("FP8 is unsupported on with BF16 scales and zero-points!");
        }
        else
        {
            if (quant_algo & ZERO)
            {
                // has zeros
                m_weightOnlyGroupwiseGemmRunner
                    = std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_bfloat16,
                        cutlass::uint4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>>();
            }
            else
            {
                // no zeros
                m_weightOnlyGroupwiseGemmRunner
                    = std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_bfloat16,
                        cutlass::uint4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_ONLY>>();
            }
        }
        mCudaKernelEnabled = tensorrt_llm::kernels::weight_only::is_supported(
            mArch, tensorrt_llm::kernels::weight_only::KernelType::BF16Int4Groupwise);
        mCudaKernelType = tensorrt_llm::kernels::weight_only::KernelType::BF16Int4Groupwise;
    }
#endif

I not very sure but is this a mistake? though the error message is mentioning zero-points, but it throws without zero condition check (which in in the next block I think?).

The text was updated successfully, but these errors were encountered:

Barry-Delaney · 2024-06-26T10:06:25Z

@wxsms thanks for the feedback. w4a8_awq with BF16 data type is not supported yet, we will add it in the following updates.

nv-guomingz · 2024-07-03T02:47:05Z

Hi @wxsms could we close this ticket now?

wxsms · 2024-07-03T03:12:30Z

Hi @wxsms could we close this ticket now?

It's okay. we can also close this issue while this feature is fully supported. You may close it on your demand. Thanks

nv-guomingz · 2024-07-03T04:56:12Z

Thanks @wxsms . Please feel free to reopen it if neede.

youki-sada · 2024-12-24T08:06:28Z

@Barry-Delaney Do you have any updates on this? It's seems not supported yet on v0.15.0. We are waiting this feature since bfloat16 is crucial for gemma2 27B.

youki-sada · 2024-12-24T08:36:15Z

I guess now it's supported on v0.16.0. Thank you.
https://github.com/NVIDIA/TensorRT-LLM/releases

Added W4A8 quantization support to BF16 models on Ada (SM89).

wxsms added the bug Something isn't working label Jun 26, 2024

nv-guomingz assigned Barry-Delaney Jun 26, 2024

nv-guomingz added the question Further information is requested label Jun 26, 2024

nv-guomingz closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is W4A(FP)8 quant not supported with bf16 datatype? #1843

Is W4A(FP)8 quant not supported with bf16 datatype? #1843

wxsms commented Jun 26, 2024

Barry-Delaney commented Jun 26, 2024

nv-guomingz commented Jul 3, 2024

wxsms commented Jul 3, 2024 •

edited

Loading

nv-guomingz commented Jul 3, 2024

youki-sada commented Dec 24, 2024

youki-sada commented Dec 24, 2024

Is W4A(FP)8 quant not supported with bf16 datatype? #1843

Is W4A(FP)8 quant not supported with bf16 datatype? #1843

Comments

wxsms commented Jun 26, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Barry-Delaney commented Jun 26, 2024

nv-guomingz commented Jul 3, 2024

wxsms commented Jul 3, 2024 • edited Loading

nv-guomingz commented Jul 3, 2024

youki-sada commented Dec 24, 2024

youki-sada commented Dec 24, 2024

wxsms commented Jul 3, 2024 •

edited

Loading