-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable gptqmodel #35012
base: main
Are you sure you want to change the base?
Enable gptqmodel #35012
Conversation
@SunMarc GPTQModel is intended to replace AutoGPTQ entirely due to lack of progress in that repo for many reasons but for the sake of compat, they can co-exist in parallel until this integration is merged, everything is stable/tested, and maybe later we can add init a deprecation plan of AutoGPTQ which is no longer actively developed and/or maintained. |
Hey @jiqing-feng, thanks for adding |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. Left a couple of comments. Note that we aslo need to modify the dockerfile for our quantization test if we decide to deprecate auto-gptq + would be nice to include a new version of a colab notebook that works with gptqmodel
gptq_supports_cpu = ( | ||
is_auto_gptq_available() | ||
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2") | ||
) or is_gptqmodel_available() | ||
if not gptq_supports_cpu and not torch.cuda.is_available(): | ||
raise RuntimeError("GPU is required to quantize or run quantize model.") | ||
elif not (is_optimum_available() and is_auto_gptq_available()): | ||
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())): | ||
raise ImportError( | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)" | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)" | ||
) | ||
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"): | ||
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse( | ||
"0.4.2" | ||
): | ||
raise ImportError( | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`" | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a message mentioning that autogptq will be deprecated ? I think we can do two version of transformers from now. For optimum, maybe we can deprecate this a bit later than transformers to make sure that we can still revert if there is a big issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
gptq_supports_cpu = ( | ||
is_auto_gptq_available() | ||
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2") | ||
) or is_gptqmodel_available() | ||
if not gptq_supports_cpu and not torch.cuda.is_available(): | ||
raise RuntimeError("GPU is required to quantize or run quantize model.") | ||
elif not (is_optimum_available() and is_auto_gptq_available()): | ||
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())): | ||
raise ImportError( | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)" | ||
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)" | ||
) | ||
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"): | ||
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse( | ||
"0.4.2" | ||
): | ||
raise ImportError( | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`" | ||
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget that the users need to use the latest version from optimum with gptqmodel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have limited the optimum and gptqmodel version. The version limitation can be changed after gptqmodel and optimum released.
@SunMarc Current PR in the current state is not passing our internal tests. @jiqing-feng Will merge some of our changes in that will pass both inference/quant tests. Please delay your review until then since there are substantial changes, relative to the code/PR currently. |
* gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
* revert quantizer_gptq.py change * pass **kwargs
Testing changes contain: Refactor: CPU tests do not need @require_torch_gpu |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
@SunMarc Review can start at There may be some testing code tweaks but I do not foresee any major changes from this point foreword other than passing flaky tests and/or some testing bugs. Due to |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Yes, I have updated the test file to make sure all previous tests will be contained. But we found an error when testing device map, it should be fixed by the tiny change of optimum. |
Right. Fixed in my new changes. |
Based on our own testing, non-deterministic variability of the output is amplified in |
Thanks for the fix @jiqing-feng, @Qubitium, all tests are passing with both packages now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks for iterating ! Just a few nits
docs/source/en/quantization/gptq.md
Outdated
@@ -92,9 +118,14 @@ from transformers import AutoModelForCausalLM | |||
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto") | |||
``` | |||
|
|||
## Marlin | |||
|
|||
[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does support model quantization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we add a snippet to show to the user who to use it ? Generally, it will help a lot to the user if we explain a bit how the backend attribute in GPTQConfig works.
[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does support model quantization. | |
[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does not support model quantization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc Good idea. Example usage of selection of Marlin via backend
added.
<TIP> | ||
|
||
**<sup>6</sup>** [See GGUF section](../gguf.md) | ||
|
||
</Tip> No newline at end of file | ||
</TIP> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GGUF, can you revert the changes you did ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc Done. Please check But I also made following clean-ups to the table. The table is horizontally overflowing on my 14.5 inch 16x10 screen even with the edits. Something needs to be done and break the table into single-view digestible version.
- Fixed using subscript (instead of superscript) notation for the footnotes.
- Renamed some of the column headers for brevity so we can fit more columns without overflow. See following.
Just in time Quantization
=>Runtime Quantization
RoCm GPU (AMD)
=>ROCm GPU
Fine-tuning (via PEFT)
=>PEFT Fine Tuning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our next PR will also address the issue of GPTQConfig __init__
forcing pass bits
. It is actually not required and hinders api brevity when launching inference only mode using backend="marlin"
. Add this here so I won't forget.
AutoModelForCausalLM.from_pretrained(
"ModelCloud/Opt-125-GPTQ-4bit-10-25-2024", device_map="auto",
quantization_config=GPTQConfig(bits=4, backend="marlin"))
Can be optimized to:
AutoModelForCausalLM.from_pretrained(
"ModelCloud/Opt-125-GPTQ-4bit-10-25-2024", device_map="auto",
quantization_config=GPTQConfig(backend="marlin"))
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* review: update docs * fix typo
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* update overview.md * cleanup * Update overview.md * Update overview.md * Update overview.md * update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md --------- Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
docs/source/en/quantization/gptq.md
Outdated
* Model support: GPTQModel continues to support all of the latest released LLM models. | ||
* Multi-Modal support: GPTQModel supports accurate quantization of Qwen 2-VL and Ovis 1.6-VL image-to-text models. | ||
* Platform support: Validated MacOS Apple Silicone and Windows 11 support. | ||
* Hardware support: Apple silicone M1+, Intel/AMD CPU, and Intel Datacetner Max + Arc GPUs. | ||
* IPEX kernel for Intel/AMD accelerated CPU and Intel GPU (Datacenter Max + ARc) support. | ||
* Updated Marlin kernel from Neural Magic that is higly optimized for A100 | ||
* Updated Kernels with auto-padding for legacy model support and models with non-uniform in/out-features. | ||
* Faster quantization, lower memory usage, and more accurate default quantization via GPTQModel quantization apis. | ||
* User and developer friendly apis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding 😊
checkpoint_format (`str`, *optional*, defaults to `"gptq"`): | ||
GPTQ weight format. `gptq`(v1) is supported by both gptqmodel and auto-gptq. `gptq_v2` is gptqmodel only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering what is the difference between the two formats, can you add a link to a doc 😊 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MekkCyber Good question. I pushed a doc change relating to this: 0aef2df
- Asymmetric support: Asymmetric quantization can potentially introduce lower quantization errors compared to symmetric quantization. However, it is not backward compatible with AutoGPTQ, and not all kernels, such as Marlin, support asymmetric quantization.
GPTQModel operates internally in gptq_v2 format. Autogptq is gptq_v1 but only sym=True. Autogptq has broken asymmetric sym=False
support. GPTQModel gptq_v1 is actually different slightly (zero point offset) than autogptq but backward compatible with autogptq and historic gptq kernels with sym=True
.
Flow graph for sym
and gptq_v1
and gptq_v2
support.
If AutoGPTQ:
- sym=True: gptq_v1
- sym=False: quantization works but generation is broken. ppl > 2K for example. GPTQ is capable of
sym=False
but AutoGPTQ never merged the bug fix (that I was later involved with).
If GPTQModel:
- sym=True: gptq_v1 + gptq_v2 format. gptq_v1 is backward compatible with all autogptq + Marlin (vllm/sglang)
- sym=False: gptq_v1 + gptq_v2 format. here gptq_v1 is incompatible with autogptq. Marlin never supported sym=False to begin with. So sym=False is only compatible with GPTQModel (some) kernels.
In summary:
- gptq_v2 is gptqmodel only format recognized by gptqmodel loading code.
- gptq_v1 is backward compatible with autogptq/vllm kernels (Marlin) only if sym=True. But for gptqmodel, it can run sym=True/False for both formats.
During runtime, GPTQModel will upconvert during load all gptq_v1 to internal gptq_v2 (very fast). GPTQModel will reject any loading of sym=False
+ gptq_v1
combo based on meta.quantizer
info: who quantized this gptq_v1
and what version since only GPTQModel quantized sym=False
+ gptq_v1
is valid. This is one big reason we added meta.quantizer
meta info to quants quantized by gptqmodel, and now optimum code.
Hopefully I made the above less confusing. We have not written correct doc/notes about this difference on GPTQModel to be frank but added to our todo list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @Qubitium for the clarifications
We are going to replace
auto_gptq
withgptqmodel
. Start with the quantizer check, and also need to change the optimum: huggingface/optimum#2064.We intended to deprecate AutoGPTQ in this PR, but considering users' behavior, we would like to keep the support for auto_gptq for the next few versions and give a warning for deprecating.