What does "weights_scaling_factor_2" mean in safetensor results of awq_w4a8 #2561

gujiewen · 2024-12-11T06:20:07Z

I follow this step to do quantization for qwen2 model.
Then I got the safetensor results like

What does ' prequant_scaling_factor', 'activation_scaling_factor', 'weights_scaling_factor', 'weights_scaling_factor_2' mean. And how are they used in the w4a8 gemm?

Barry-Delaney · 2024-12-17T10:21:17Z

For a linear layer with GEMM shape [M, N, K], we need these components in the TRT-LLM layer:

Name	Dtype	Shape	Layout
`{LAYER_NAME}.weight`	float16	[K, N / 4]	Interleaved and packed INT4
`{LAYER_NAME}.weight_scaling_factor`	float16	[K / group_size, N]	Row-major
`{LAYER_NAME}.activation_scaling_factor`	float16	[K]	Row-major
`{LAYER_NAME}.alpha`	float32	[1]	-

The calculation process is:
output = FP16(FP8(act * activation_scaling_factor) * FP8(weight * weight_scaling_factor) * alpha)

However, the checkpoint will have more parameters, here is how they are converted when building the engine.

gujiewen · 2024-12-18T02:00:36Z

For a linear layer with GEMM shape [M, N, K], we need these components in the TRT-LLM layer:

Name Dtype Shape Layout
{LAYER_NAME}.weight float16 [K, N / 4] Interleaved and packed INT4
{LAYER_NAME}.weight_scaling_factor float16 [K / group_size, N] Row-major
{LAYER_NAME}.activation_scaling_factor float16 [K] Row-major
{LAYER_NAME}.alpha float32 [1] -
The calculation process is: output = FP16(FP8(act * activation_scaling_factor) * FP8(weight * weight_scaling_factor) * alpha)

However, the checkpoint will have more parameters, here is how they are converted when building the engine.

Thanks for your reply. However, in w4a8_awq, I found prequant_scaling_factor has shape of [K]. According to the source code in modeling_utils.py

        if quant_algo == QuantAlgo.W4A8_AWQ:
            for name in list(weights):
                if name.endswith('weights_scaling_factor'):
                    activation_scaling_factor = weights.pop(
                        name.replace('weights_scaling_factor',
                                     'activation_scaling_factor'))
                    weights_scaling_factor_2 = weights.pop(
                        name.replace('weights_scaling_factor',
                                     'weights_scaling_factor_2'))
                    weights[name] /= weights_scaling_factor_2
                    weights[name] = weights[name].to(torch.float16).view(
                        str_dtype_to_torch(model_config.dtype))
                    weights[name.replace(
                        'weights_scaling_factor',
                        'prequant_scaling_factor')] /= activation_scaling_factor
                    weights[name.replace(
                        'weights_scaling_factor', 'alpha'
                    )] = activation_scaling_factor * weights_scaling_factor_2

,alpha semms to be computed as activation_scaling_factor * weight_scaling_factor_2.

So, the calculation process of w4a8 is
output = FP16(FP8(act * prequant_scaling_factor / activation_scaling_factor) * FP8(weight * weight_scaling_factor / weight_scaling_factor_2) * activation_scaling_factor * weight_scaling_factor_2),
If we set
activation_scaling_factor'= prequant_scaling_factor / activation_scaling_factor
and
weight_scaling_factor'=weight_scaling_factor / weight_scaling_factor_2
the formula becomes
output = FP16(FP8(act * activation_scaling_factor') * FP8(weight * weight_scaling_factor') * alpha) which is your form.

Am I right?

Barry-Delaney · 2024-12-18T02:26:05Z

Exactly.
For clearer understanding, you can consider W4A8_AWQ as W4A16_AWQ + FP8.
In addition to the components of W4A16_AWQ, i.e., prequant_scaling_factor, weight_scaling_factor, FP8 will provide 2 more per-tensor scaling factors activation_scaling_factor and weight_scaling_factor_2. In order to having them combined in one GEMM, we have:

Multiplied the per-tensor activation_scaling_factor into prequant_scaling_factor
Multiplied the per-tensor weight_scaling_factor_2 into weight_scaling_factor
Exposed alpha as layer parameters for de-quantization

nv-guomingz added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Dec 12, 2024

github-actions bot added the triaged Issue has been triaged by maintainers label Dec 12, 2024

github-actions bot assigned nv-guomingz Dec 12, 2024

github-actions bot added the Investigating label Dec 12, 2024

nv-guomingz removed triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 Investigating labels Dec 12, 2024

nv-guomingz removed their assignment Dec 12, 2024

nv-guomingz added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Dec 12, 2024

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 12, 2024

nv-guomingz added Low Precision Issue about lower bit quantization, including int8, int4, fp8 and removed triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 Investigating labels Dec 12, 2024

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 12, 2024

kevinch-nv assigned Naveassaf Dec 12, 2024

Barry-Delaney assigned Barry-Delaney and unassigned Naveassaf Dec 17, 2024

gujiewen closed this as completed Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does "weights_scaling_factor_2" mean in safetensor results of awq_w4a8 #2561

What does "weights_scaling_factor_2" mean in safetensor results of awq_w4a8 #2561

gujiewen commented Dec 11, 2024 •

edited

Loading

Barry-Delaney commented Dec 17, 2024

gujiewen commented Dec 18, 2024 •

edited

Loading

Barry-Delaney commented Dec 18, 2024

What does "weights_scaling_factor_2" mean in safetensor results of awq_w4a8 #2561

What does "weights_scaling_factor_2" mean in safetensor results of awq_w4a8 #2561

Comments

gujiewen commented Dec 11, 2024 • edited Loading

Barry-Delaney commented Dec 17, 2024

gujiewen commented Dec 18, 2024 • edited Loading

Barry-Delaney commented Dec 18, 2024

gujiewen commented Dec 11, 2024 •

edited

Loading

gujiewen commented Dec 18, 2024 •

edited

Loading