You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
not working with error: FP8 is unsupported on with BF16 scales and zero-points!
additional notes
I notice that in tensorrt_llm/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp there is a snip of code like this:
#if defined(ENABLE_BF16)
elseif (mType == nvinfer1::DataType::kBF16)
{
if (quant_algo & FP8_ALPHA)
{
// FP8 requires at least sm89 devicesif (mArch < 89)
{
TLLM_THROW("W4A(fp)8 kernel is unsupported on pre-Ada (sm<89) architectures!");
}
TLLM_THROW("FP8 is unsupported on with BF16 scales and zero-points!");
}
else
{
if (quant_algo & ZERO)
{
// has zeros
m_weightOnlyGroupwiseGemmRunner
= std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_bfloat16,
cutlass::uint4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_AND_ZEROS>>();
}
else
{
// no zeros
m_weightOnlyGroupwiseGemmRunner
= std::make_shared<tensorrt_llm::kernels::cutlass_kernels::CutlassFpAIntBGemmRunner<__nv_bfloat16,
cutlass::uint4b_t, cutlass::WeightOnlyQuantOp::FINEGRAINED_SCALE_ONLY>>();
}
}
mCudaKernelEnabled = tensorrt_llm::kernels::weight_only::is_supported(
mArch, tensorrt_llm::kernels::weight_only::KernelType::BF16Int4Groupwise);
mCudaKernelType = tensorrt_llm::kernels::weight_only::KernelType::BF16Int4Groupwise;
}
#endif
I not very sure but is this a mistake? though the error message is mentioning zero-points, but it throws without zero condition check (which in in the next block I think?).
The text was updated successfully, but these errors were encountered:
@Barry-Delaney Do you have any updates on this? It's seems not supported yet on v0.15.0. We are waiting this feature since bfloat16 is crucial for gemma2 27B.
System Info
ubuntu, with Ada GPUs. tllm version: 0.11.0.dev2024061800
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
use
example/quantization/quantize.py
to quant a model like this (I am using Llama):Expected behavior
the quantization should work
actual behavior
not working with error: FP8 is unsupported on with BF16 scales and zero-points!
additional notes
I notice that in
tensorrt_llm/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp
there is a snip of code like this:I not very sure but is this a mistake? though the error message is mentioning zero-points, but it throws without zero condition check (which in in the next block I think?).
The text was updated successfully, but these errors were encountered: