Replies: 14 comments
-
@ftian1 Thanks for the RFC. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Do these packed dims imply how the layout of the weights. For example, if the compression dim is along output channel, it requires the weights would be contiguous along the output dim and the weight is supposed to be KxN? Also, what is the benefit of allowing different compression direction?
Not sure if I understand correctly, can we shuffle the scales/zps to make sure g_idx is always sorted so that we don't need an additional g_idx field? |
Beta Was this translation helpful? Give feedback.
-
per offline discussion, we decided to take solution 2, that's IPEX to be directly compatible with HF format, INC will also generate HF compatible format in export_compressed_model(). |
Beta Was this translation helpful? Give feedback.
-
Hi @ftian1. We have a question about Given
And the same goes for other output channels. scales_shuffled = torch.empty_like(scales)
scales_shuffled[0][0] = scales[0][1]
scales_shuffled[0][1] = scales[0][0]
scales_shuffled[0][2] = scales[0][3]
scales_shuffled[0][3] = scales[0][2] With the shuffled scales and zero points, |
Beta Was this translation helpful? Give feedback.
-
@Xia-Weiwen Thanks for raising that. |
Beta Was this translation helpful? Give feedback.
-
Hi @xin3he Thanks a lot for the explanation. Did you mean that all input channels are shuffled within an output channel? How to get the correct scale and zero point? |
Beta Was this translation helpful? Give feedback.
-
I think we may need to reshuffle the channels based on g_idx before performing dequantizaiton, then shuffle it back before performing matmul. This design from GPTQ is aiming to improve accuracy. |
Beta Was this translation helpful? Give feedback.
-
@xin3he So the g_idx has the same shape as weight? Would the overhead of loading g_idx be too big here? Suppose we have a group size of 64, each value in g_idx has to be 4-bit, making g_idx the same size as weight? Or did I missing anything? |
Beta Was this translation helpful? Give feedback.
-
Oh, sorry, my mistake @jgong5 . The shape of g_idx is |
Beta Was this translation helpful? Give feedback.
-
@xin3he Sorry. I don't understand. Why can't we shuffle input channels back to their original order after GPTQ and discard |
Beta Was this translation helpful? Give feedback.
-
The quantization flow of weights using g_idx |
Beta Was this translation helpful? Give feedback.
-
@xin3he Thanks for the figure. It is much clearer now. So, we cannot get rid of g_idx. The final weights have input channels in the original order, but they belong to different groups now. Correct? |
Beta Was this translation helpful? Give feedback.
-
Yes, exactly. |
Beta Was this translation helpful? Give feedback.
-
This RFC is to propose a Hugging Face-compatible yet flexible Weight Only Quantization (WOQ) format in INC, and then the model quantized by INC can be loaded by IPEX for further inference optimization.
Feature, Motivation and Pitch
As we know, WOQ is getting more attentions from the industry. There have had a lot of quantized WOQ models, like Llama-2-7B-Chat-GPTQ, whose format has been becoming the standard WOQ storage format. Therefore, we propose a Hugging Face-compatible, yet flexible WOQ format definition. With this, we can leverage community effort to get those WOQ models and can also easily extend for new WOQ algorithms in the future which may keep improving on the accuracy of LLMs.
Design
The WOQ quantized model is usually saved at HuggingFace model hub like below layout:
User needs a
quantization_config
to know which group_size, desc_act, and sym is used when generating such WOQ model. however, such info/fields are able to be calculated from the WOQ checkpoint's content.So the WOQ checkpoint format is the key factor to consider. it's mainly consists of two parts:
NOTE: The fields marked with bold font is the one we are missing in current IPEX code.
In the industry, the common practice is to save the first part into model checkpoint. For the second part, output channel for compression dimension and input channel for zero point dimension is the default behavior. INC extends the second part to support input channel as compression dimension and output channel as zero point dimension. This extension can be converted to follow the default dimension.
Solutions
Solution 1 (Recommended)
Enhance INC to export the converted model format which can be identified by current IPEX implementation.
This way has minimal impact on IPEX current WOQ implementation. But to support GPTQ like model, IPEX is lack of the
g_idx
support whengroup_size != -1
as well as the corresponding kernel. This is the feature gap existing in IPEX.In INC, it will internally convert
compression_dim
andzp_dim
to the default format IPEX supported, that's compressing weight alonginput_channel
and storingzero point
alongoutput_channel
.Solution 2
Enhance IPEX to be directly compatible with latest & popular WOQ model format.
In this solution, IPEX needs to be updated to cowork with latest/popular WOQ format in the industry.
Beta Was this translation helpful? Give feedback.
All reactions