Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable gptqmodel #35012

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open

Enable gptqmodel #35012

wants to merge 47 commits into from

Conversation

jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Nov 29, 2024

We are going to replace auto_gptq with gptqmodel. Start with the quantizer check, and also need to change the optimum: huggingface/optimum#2064.

We intended to deprecate AutoGPTQ in this PR, but considering users' behavior, we would like to keep the support for auto_gptq for the next few versions and give a warning for deprecating.

@Rocketknight1
Copy link
Member

cc @SunMarc @MekkCyber

@Qubitium
Copy link
Contributor

@SunMarc GPTQModel is intended to replace AutoGPTQ entirely due to lack of progress in that repo for many reasons but for the sake of compat, they can co-exist in parallel until this integration is merged, everything is stable/tested, and maybe later we can add init a deprecation plan of AutoGPTQ which is no longer actively developed and/or maintained.

@MekkCyber
Copy link
Contributor

Hey @jiqing-feng, thanks for adding gptqmodel LGTM ! Could you update the PR description and title to make them clearer? Thanks 😊

@MekkCyber MekkCyber requested a review from SunMarc November 29, 2024 14:48
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng marked this pull request as ready for review December 2, 2024 05:14
@jiqing-feng jiqing-feng changed the title gptqmodel Enable gptqmodel Dec 2, 2024
@jiqing-feng jiqing-feng marked this pull request as draft December 2, 2024 09:11
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. Left a couple of comments. Note that we aslo need to modify the dockerfile for our quantization test if we decide to deprecate auto-gptq + would be nice to include a new version of a colab notebook that works with gptqmodel

Comment on lines 52 to 67
gptq_supports_cpu = (
is_auto_gptq_available()
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2")
) or is_gptqmodel_available()
if not gptq_supports_cpu and not torch.cuda.is_available():
raise RuntimeError("GPU is required to quantize or run quantize model.")
elif not (is_optimum_available() and is_auto_gptq_available()):
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())):
raise ImportError(
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)"
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)"
)
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"):
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse(
"0.4.2"
):
raise ImportError(
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`"
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a message mentioning that autogptq will be deprecated ? I think we can do two version of transformers from now. For optimum, maybe we can deprecate this a bit later than transformers to make sure that we can still revert if there is a big issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 52 to 67
gptq_supports_cpu = (
is_auto_gptq_available()
and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2")
) or is_gptqmodel_available()
if not gptq_supports_cpu and not torch.cuda.is_available():
raise RuntimeError("GPU is required to quantize or run quantize model.")
elif not (is_optimum_available() and is_auto_gptq_available()):
elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())):
raise ImportError(
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)"
"Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)"
)
elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"):
elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse(
"0.4.2"
):
raise ImportError(
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`"
"You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget that the users need to use the latest version from optimum with gptqmodel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have limited the optimum and gptqmodel version. The version limitation can be changed after gptqmodel and optimum released.

@Qubitium
Copy link
Contributor

Qubitium commented Dec 2, 2024

@SunMarc Current PR in the current state is not passing our internal tests. @jiqing-feng Will merge some of our changes in that will pass both inference/quant tests. Please delay your review until then since there are substantial changes, relative to the code/PR currently.

* gptqmodel need use checkpoint_format

* fix quantize

* Update quantization_config.py

* Update quantization_config.py

* Update quantization_config.py

---------

Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
LRL-ModelCloud and others added 2 commits December 4, 2024 08:55
* revert quantizer_gptq.py change

* pass **kwargs
@jiqing-feng
Copy link
Contributor Author

Testing changes contain:

Refactor: CPU tests do not need @require_torch_gpu
GPTQ lib: @require_gptq means we can run these tests with gptqmodel or auto-gptq
Default model: we default run llama in tests instead of bloom because it's more common.

jiqing-feng and others added 8 commits December 4, 2024 10:32
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* revert quantizer_gptq.py change

* pass **kwargs

* add meta info

* cleanup

* cleanup

* Update quantization_config.py

* hf_select_quant_linear pass checkpoint_format and meta

* fix GPTQTestCUDA

* Update test_gptq.py

* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2

* cleanup

* add backend

* cleanup

* cleanup

* no need check exllama version

* Update quantization_config.py

* lower checkpoint_format and backend

* check none

* cleanup

* Update quantization_config.py

* fix self.use_exllama == False

* spell

* fix unittest

* fix unittest

---------

Co-authored-by: LRL <lrl@lbx.dev>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
@Qubitium
Copy link
Contributor

Qubitium commented Dec 5, 2024

@SunMarc Review can start at optimum first. I will write up a detailed explainer at the optimum pr on some obvious small/large change we pushed.

There may be some testing code tweaks but I do not foresee any major changes from this point foreword other than passing flaky tests and/or some testing bugs. Due to optimum having the largest diffs and where most of the gptq quant logic is, we fill first concentrate on making sure optimum is review-cleared, then pef/transformers in that order.

jiqing-feng and others added 6 commits December 23, 2024 17:50
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Dec 24, 2024

I think there is no need to test cpu when only autogptq available because the model will be forced move to cuda. I'd like to remove the CPU tests only when autogptq available if you agree.

I prefer to keep the cpu test for autogptq. Otherwise, there is no way to check that the per layer quantization works with auto-gptq.

Yes, I have updated the test file to make sure all previous tests will be contained. But we found an error when testing device map, it should be fixed by the tiny change of optimum.

@jiqing-feng
Copy link
Contributor Author

Please run pip install intel_extension_for_pytorch and make sure your torch version is 2.5.0 or 2.5.1
Or run pip install -v --no-build-isolation gptqmodel[ipex].

We shouldn't have to install intel_extension_for_pytorch to make the tests pass no ?

Right. Fixed in my new changes.

@Qubitium
Copy link
Contributor

Hey @jiqing-feng thanks for the fix ! I am having some small fails related to AssertionError: 'Hello my name is Katie. I am a 22 year' not found in {'Hello my name is Katie, I am a 22 year', 'Hello my name is Katie. I am a 20, but these should be easy to fix maybe even after.

Based on our own testing, non-deterministic variability of the output is amplified in cpu testing. Looks like cpu implementation and execution of fp16 varies much more than gpu. Future testing code on cpu needs to take this into consideration.

@MekkCyber
Copy link
Contributor

MekkCyber commented Dec 24, 2024

Thanks for the fix @jiqing-feng, @Qubitium, all tests are passing with both packages now

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks for iterating ! Just a few nits

@@ -92,9 +118,14 @@ from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
```

## Marlin

[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does support model quantization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we add a snippet to show to the user who to use it ? Generally, it will help a lot to the user if we explain a bit how the backend attribute in GPTQConfig works.

Suggested change
[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does support model quantization.
[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does not support model quantization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Good idea. Example usage of selection of Marlin via backend added.

Comment on lines 95 to 100
<TIP>

**<sup>6</sup>** [See GGUF section](../gguf.md)

</Tip> No newline at end of file
</TIP>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GGUF, can you revert the changes you did ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Done. Please check But I also made following clean-ups to the table. The table is horizontally overflowing on my 14.5 inch 16x10 screen even with the edits. Something needs to be done and break the table into single-view digestible version.

  • Fixed using subscript (instead of superscript) notation for the footnotes.
  • Renamed some of the column headers for brevity so we can fit more columns without overflow. See following.
  • Just in time Quantization => Runtime Quantization
  • RoCm GPU (AMD) => ROCm GPU
  • Fine-tuning (via PEFT) => PEFT Fine Tuning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc

Our next PR will also address the issue of GPTQConfig __init__ forcing pass bits. It is actually not required and hinders api brevity when launching inference only mode using backend="marlin". Add this here so I won't forget.

AutoModelForCausalLM.from_pretrained(
    "ModelCloud/Opt-125-GPTQ-4bit-10-25-2024", device_map="auto", 
     quantization_config=GPTQConfig(bits=4, backend="marlin")) 

Can be optimized to:

AutoModelForCausalLM.from_pretrained(
    "ModelCloud/Opt-125-GPTQ-4bit-10-25-2024", device_map="auto", 
     quantization_config=GPTQConfig(backend="marlin")) 

@SunMarc SunMarc requested a review from ArthurZucker December 24, 2024 11:07
jiqing-feng and others added 8 commits December 24, 2024 06:53
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* review: update docs

* fix typo
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* update overview.md

* cleanup

* Update overview.md

* Update overview.md

* Update overview.md

* update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

---------

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Comment on lines 29 to 37
* Model support: GPTQModel continues to support all of the latest released LLM models.
* Multi-Modal support: GPTQModel supports accurate quantization of Qwen 2-VL and Ovis 1.6-VL image-to-text models.
* Platform support: Validated MacOS Apple Silicone and Windows 11 support.
* Hardware support: Apple silicone M1+, Intel/AMD CPU, and Intel Datacetner Max + Arc GPUs.
* IPEX kernel for Intel/AMD accelerated CPU and Intel GPU (Datacenter Max + ARc) support.
* Updated Marlin kernel from Neural Magic that is higly optimized for A100
* Updated Kernels with auto-padding for legacy model support and models with non-uniform in/out-features.
* Faster quantization, lower memory usage, and more accurate default quantization via GPTQModel quantization apis.
* User and developer friendly apis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding 😊

Comment on lines +588 to +589
checkpoint_format (`str`, *optional*, defaults to `"gptq"`):
GPTQ weight format. `gptq`(v1) is supported by both gptqmodel and auto-gptq. `gptq_v2` is gptqmodel only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what is the difference between the two formats, can you add a link to a doc 😊 ?

Copy link
Contributor

@Qubitium Qubitium Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MekkCyber Good question. I pushed a doc change relating to this: 0aef2df

  • Asymmetric support: Asymmetric quantization can potentially introduce lower quantization errors compared to symmetric quantization. However, it is not backward compatible with AutoGPTQ, and not all kernels, such as Marlin, support asymmetric quantization.

GPTQModel operates internally in gptq_v2 format. Autogptq is gptq_v1 but only sym=True. Autogptq has broken asymmetric sym=False support. GPTQModel gptq_v1 is actually different slightly (zero point offset) than autogptq but backward compatible with autogptq and historic gptq kernels with sym=True.

Flow graph for sym and gptq_v1 and gptq_v2 support.

If AutoGPTQ:

  • sym=True: gptq_v1
  • sym=False: quantization works but generation is broken. ppl > 2K for example. GPTQ is capable of sym=False but AutoGPTQ never merged the bug fix (that I was later involved with).

If GPTQModel:

  • sym=True: gptq_v1 + gptq_v2 format. gptq_v1 is backward compatible with all autogptq + Marlin (vllm/sglang)
  • sym=False: gptq_v1 + gptq_v2 format. here gptq_v1 is incompatible with autogptq. Marlin never supported sym=False to begin with. So sym=False is only compatible with GPTQModel (some) kernels.

In summary:

  • gptq_v2 is gptqmodel only format recognized by gptqmodel loading code.
  • gptq_v1 is backward compatible with autogptq/vllm kernels (Marlin) only if sym=True. But for gptqmodel, it can run sym=True/False for both formats.

During runtime, GPTQModel will upconvert during load all gptq_v1 to internal gptq_v2 (very fast). GPTQModel will reject any loading of sym=False + gptq_v1 combo based on meta.quantizer info: who quantized this gptq_v1 and what version since only GPTQModel quantized sym=False + gptq_v1 is valid. This is one big reason we added meta.quantizer meta info to quants quantized by gptqmodel, and now optimum code.

Hopefully I made the above less confusing. We have not written correct doc/notes about this difference on GPTQModel to be frank but added to our todo list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @Qubitium for the clarifications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants