[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC #1757

ftian1 · 2023-11-28T07:51:57Z

ftian1
Nov 28, 2023
Maintainer

This RFC is to propose a Hugging Face-compatible yet flexible Weight Only Quantization (WOQ) format in INC, and then the model quantized by INC can be loaded by IPEX for further inference optimization.

Feature, Motivation and Pitch

As we know, WOQ is getting more attentions from the industry. There have had a lot of quantized WOQ models, like Llama-2-7B-Chat-GPTQ, whose format has been becoming the standard WOQ storage format. Therefore, we propose a Hugging Face-compatible, yet flexible WOQ format definition. With this, we can leverage community effort to get those WOQ models and can also easily extend for new WOQ algorithms in the future which may keep improving on the accuracy of LLMs.

Design

The WOQ quantized model is usually saved at HuggingFace model hub like below layout:

User needs a quantization_config to know which group_size, desc_act, and sym is used when generating such WOQ model. however, such info/fields are able to be calculated from the WOQ checkpoint's content.

So the WOQ checkpoint format is the key factor to consider. it's mainly consists of two parts:

checkpoint attributes like packed weight, scale, zero_points, group_idx (De facto standard in HuggingFace WOQ model)
how packed weight gets compressed, like compress dimension, zero point dimension (hardcoded in HuggingFace WOQ model but INC can be more flexible to generate such packed model)

NOTE: The fields marked with bold font is the one we are missing in current IPEX code.

In the industry, the common practice is to save the first part into model checkpoint. For the second part, output channel for compression dimension and input channel for zero point dimension is the default behavior. INC extends the second part to support input channel as compression dimension and output channel as zero point dimension. This extension can be converted to follow the default dimension.

Solutions

Solution 1 (Recommended)

Enhance INC to export the converted model format which can be identified by current IPEX implementation.

### INC export interface
def export_compressed_model(woq_model, ipex_format=True):
    ### convert WOQ model to be IPEX compatible one
    #
    # the converted checkpoint attributes include:
    #
    # 1. 'qweight' to 'packed_weight'
    # 2. 'scales' to 'scale'
    # 3. 'qzeros' to 'packed_zp'
    # 4. 'g_idx' to support HF GPTQ model
    ...

### Usage from User View ###
from neural_compressor import export_compressed_model
compressed_model = export_compressed_model('TheBloke/Llama-2-7B-Chat-GPTQ', ipex_format=True)
torch.save(compressed_model.state_dict(), "/path/to/model.pt")

import intel_extension_for_pytorch as ipex
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
ipex.optimize_transformers(ipex_woq_model.eval(), quantization_config=qconfig, low_precision_checkpoint='/path/to/model.pt')

This way has minimal impact on IPEX current WOQ implementation. But to support GPTQ like model, IPEX is lack of the g_idx support when group_size != -1 as well as the corresponding kernel. This is the feature gap existing in IPEX.

In INC, it will internally convert compression_dim and zp_dim to the default format IPEX supported, that's compressing weight along input_channel and storing zero point along output_channel.

Solution 2

Enhance IPEX to be directly compatible with latest & popular WOQ model format.

class IpexWoqLinear(nn.Module):    
    def from_float_and_int4_weight(
        cls, mod, qweight, scales, zero_points, bias=None, group_size=-1,
        group_idx=-1, compression_dim=0, zero_point_dim=1 ### new args needed to be supported by IPEX
    ):
        ...

DEFAULT_LOWP_CHECKPOINT_CONFIG = {
    "name": "default",
    "weight_key": "packed_weight",  ### need to be updated as 'qweight'
    "scale_key": "scale",           ### need to be updated as 'scales'
    "zero_point_key": "packed_zp",  ### need to be updated as 'qzeros'
    "bias_key": "bias",
    "g_idx_key": "g_idx"            ### new attribute to be supported
} 

### Usage from User View ###
model.load_state_dict(torch.load(PATH))
model.eval()
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
optimized_model = ipex.optimize_transformers(model, quantization_config=qconfig, low_precision_checkpoint='/path/to/woq/checkpoint')

In this solution, IPEX needs to be updated to cowork with latest/popular WOQ format in the industry.

jgong5 · 2023-11-29T03:11:01Z

jgong5
Nov 29, 2023

@ftian1 Thanks for the RFC.
I'd like to learn more about the "standard" woq format in huggingface before making a judgement call. Do you have document pointers to its format? What are the exact semantics for each field, in particular, the "group_idx", "compression dim" and "zero point dim"? Can you provide some examples for the standard format? You also mentioned that INC has some extension to the format. Appreciate if you can explain the semantics with examples too.

0 replies

ftian1 · 2023-11-30T02:23:50Z

ftian1
Nov 30, 2023
Maintainer Author

@jgong5

"compression dim" means the 4bits weight compression direction, either along output channel or along input channel.

"zero point dim" means the 4bits zero point compression direction, same with "compression dim" it can be output channel or input channel.

as for "g_idx", it's used by GPTQ algorithm to record/restore shuffled group in input channel during quantization.

0 replies

jgong5 · 2023-11-30T05:15:17Z

jgong5
Nov 30, 2023

"compression dim" means the 4bits weight compression direction, either along output channel or along input channel.

"zero point dim" means the 4bits zero point compression direction, same with "compression dim" it can be output channel or input channel.

Do these packed dims imply how the layout of the weights. For example, if the compression dim is along output channel, it requires the weights would be contiguous along the output dim and the weight is supposed to be KxN? Also, what is the benefit of allowing different compression direction?

as for "g_idx", it's used by GPTQ algorithm to record/restore shuffled group in input channel during quantization.

Not sure if I understand correctly, can we shuffle the scales/zps to make sure g_idx is always sorted so that we don't need an additional g_idx field?

0 replies

ftian1 · 2023-12-01T06:27:32Z

ftian1
Dec 1, 2023
Maintainer Author

per offline discussion, we decided to take solution 2, that's IPEX to be directly compatible with HF format, INC will also generate HF compatible format in export_compressed_model().

0 replies

Xia-Weiwen · 2023-12-06T09:24:29Z

Xia-Weiwen
Dec 6, 2023

Hi @ftian1. We have a question about g_idx. We know that g_idx is used for indexing of scales and zero points for groups. For example, we have weight shape = [64, 512], group size = 128. So, there are 4 groups along input channel. Scales/zero points shapes are [64, 4] and g_idx shape is also [64, 4].

Given g_idx[0] = [1, 0, 3, 2], it means that group i of output channel 0 should use the scales and zero points by scales[0][g_idx[i]] and zero_points[0][g_idx[i]], as listed below:

group 0: scales[0][1] and zero_points[0][1]
group 1: scales[0][0] and zero_points[0][0]
group 2: scales[0][3] and zero_points[0][3]
group 3: scales[0][2] and zero_points[0][2]

And the same goes for other output channels.
Our question is: can we shuffle the scales and zero points so that they are in the same order as groups? For example,

scales_shuffled = torch.empty_like(scales)
scales_shuffled[0][0] = scales[0][1]
scales_shuffled[0][1] = scales[0][0]
scales_shuffled[0][2] = scales[0][3]
scales_shuffled[0][3] = scales[0][2]

With the shuffled scales and zero points, g_idx is not needed anymore for computation. Is it doable? Are there any concerns about this? Thanks!

0 replies

xin3he · 2023-12-13T01:55:38Z

xin3he
Dec 13, 2023
Collaborator

@Xia-Weiwen Thanks for raising that.
First, I want to correct the shape of g_idx, it's ~~[64, 512]~~ [512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

0 replies

Xia-Weiwen · 2023-12-13T02:03:28Z

Xia-Weiwen
Dec 13, 2023

@Xia-Weiwen Thanks for raising that. First, I want to correct the shape of g_idx, it's [64, 512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

Hi @xin3he Thanks a lot for the explanation. Did you mean that all input channels are shuffled within an output channel? How to get the correct scale and zero point? scales[g_idx/group_size]?
And still the question: what difference does it make? Can we shuffle it back or not?

0 replies

xin3he · 2023-12-13T02:09:15Z

xin3he
Dec 13, 2023
Collaborator

I think we may need to reshuffle the channels based on g_idx before performing dequantizaiton, then shuffle it back before performing matmul. This design from GPTQ is aiming to improve accuracy.

0 replies

jgong5 · 2023-12-13T07:46:34Z

jgong5
Dec 13, 2023

First, I want to correct the shape of g_idx, it's [64, 512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

@xin3he So the g_idx has the same shape as weight? Would the overhead of loading g_idx be too big here? Suppose we have a group size of 64, each value in g_idx has to be 4-bit, making g_idx the same size as weight? Or did I missing anything?

0 replies

xin3he · 2023-12-13T07:48:15Z

xin3he
Dec 13, 2023
Collaborator

Oh, sorry, my mistake @jgong5 . The shape of g_idx is [512], not [64, 512]. Only input channel is shuffled. The dtype of g_idx is torch.int32 in Optimum.

0 replies

Xia-Weiwen · 2023-12-13T07:54:40Z

Xia-Weiwen
Dec 13, 2023

I think we may need to reshuffle the channels based on g_idx before performing dequantizaiton, then shuffle it back before performing matmul. This design from GPTQ is aiming to improve accuracy.

@xin3he Sorry. I don't understand. Why can't we shuffle input channels back to their original order after GPTQ and discard g_idx completely?

0 replies

xin3he · 2023-12-13T08:21:37Z

xin3he
Dec 13, 2023
Collaborator

The quantization flow of weights using g_idx

@Xia-Weiwen Hope this figure can help you understand it.

0 replies

Xia-Weiwen · 2023-12-13T08:27:51Z

Xia-Weiwen
Dec 13, 2023

@xin3he Thanks for the figure. It is much clearer now. So, we cannot get rid of g_idx. The final weights have input channels in the original order, but they belong to different groups now. Correct?

0 replies

xin3he · 2023-12-13T08:53:08Z

xin3he
Dec 13, 2023
Collaborator

@xin3he Thanks for the figure. It is much clearer now. So, we cannot get rid of g_idx. The final weights have input channels in the original order, but they belong to different groups now. Correct?

Yes, exactly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC #1757

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 14 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC #1757

ftian1 Nov 28, 2023 Maintainer

Feature, Motivation and Pitch

Design

Solutions

Solution 1 (Recommended)

Solution 2

Replies: 14 comments

jgong5 Nov 29, 2023

ftian1 Nov 30, 2023 Maintainer Author

jgong5 Nov 30, 2023

ftian1 Dec 1, 2023 Maintainer Author

Xia-Weiwen Dec 6, 2023

xin3he Dec 13, 2023 Collaborator

Xia-Weiwen Dec 13, 2023

xin3he Dec 13, 2023 Collaborator

jgong5 Dec 13, 2023

xin3he Dec 13, 2023 Collaborator

Xia-Weiwen Dec 13, 2023

xin3he Dec 13, 2023 Collaborator

Xia-Weiwen Dec 13, 2023

xin3he Dec 13, 2023 Collaborator

ftian1
Nov 28, 2023
Maintainer

jgong5
Nov 29, 2023

ftian1
Nov 30, 2023
Maintainer Author

jgong5
Nov 30, 2023

ftian1
Dec 1, 2023
Maintainer Author

Xia-Weiwen
Dec 6, 2023

xin3he
Dec 13, 2023
Collaborator

Xia-Weiwen
Dec 13, 2023

xin3he
Dec 13, 2023
Collaborator

jgong5
Dec 13, 2023

xin3he
Dec 13, 2023
Collaborator

Xia-Weiwen
Dec 13, 2023

xin3he
Dec 13, 2023
Collaborator

Xia-Weiwen
Dec 13, 2023

xin3he
Dec 13, 2023
Collaborator