Enable gptqmodel #35012

jiqing-feng · 2024-11-29T07:33:44Z

We are going to replace auto_gptq with gptqmodel. Start with the quantizer check, and also need to change the optimum: huggingface/optimum#2064.

We intended to deprecate AutoGPTQ in this PR, but considering users' behavior, we would like to keep the support for auto_gptq for the next few versions and give a warning for deprecating.

Rocketknight1 · 2024-11-29T13:29:39Z

cc @SunMarc @MekkCyber

Qubitium · 2024-11-29T14:34:28Z

@SunMarc GPTQModel is intended to replace AutoGPTQ entirely due to lack of progress in that repo for many reasons but for the sake of compat, they can co-exist in parallel until this integration is merged, everything is stable/tested, and maybe later we can add init a deprecation plan of AutoGPTQ which is no longer actively developed and/or maintained.

MekkCyber · 2024-11-29T14:45:06Z

Hey @jiqing-feng, thanks for adding gptqmodel LGTM ! Could you update the PR description and title to make them clearer? Thanks 😊

src/transformers/quantizers/quantizer_gptq.py

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

SunMarc

Thanks for this PR. Left a couple of comments. Note that we aslo need to modify the dockerfile for our quantization test if we decide to deprecate auto-gptq + would be nice to include a new version of a colab notebook that works with gptqmodel

SunMarc · 2024-12-02T15:26:50Z

src/transformers/quantizers/quantizer_gptq.py

+        gptq_supports_cpu = (
+            is_auto_gptq_available()
+            and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2")
+        ) or is_gptqmodel_available()
        if not gptq_supports_cpu and not torch.cuda.is_available():
            raise RuntimeError("GPU is required to quantize or run quantize model.")
-        elif not (is_optimum_available() and is_auto_gptq_available()):
+        elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())):
            raise ImportError(
-                "Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)"
+                "Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)"
            )
-        elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"):
+        elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse(
+            "0.4.2"
+        ):
            raise ImportError(
-                "You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`"
+                "You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`"
            )


can you add a message mentioning that autogptq will be deprecated ? I think we can do two version of transformers from now. For optimum, maybe we can deprecate this a bit later than transformers to make sure that we can still revert if there is a big issue.

SunMarc · 2024-12-02T15:28:10Z

src/transformers/quantizers/quantizer_gptq.py

+        gptq_supports_cpu = (
+            is_auto_gptq_available()
+            and version.parse(importlib.metadata.version("auto-gptq")) > version.parse("0.4.2")
+        ) or is_gptqmodel_available()
        if not gptq_supports_cpu and not torch.cuda.is_available():
            raise RuntimeError("GPU is required to quantize or run quantize model.")
-        elif not (is_optimum_available() and is_auto_gptq_available()):
+        elif not (is_optimum_available() and (is_auto_gptq_available() or is_gptqmodel_available())):
            raise ImportError(
-                "Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)"
+                "Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq or gptqmodel library (`pip install auto-gptq` or `pip install gptqmodel`)"
            )
-        elif version.parse(importlib.metadata.version("auto_gptq")) < version.parse("0.4.2"):
+        elif is_auto_gptq_available() and version.parse(importlib.metadata.version("auto_gptq")) < version.parse(
+            "0.4.2"
+        ):
            raise ImportError(
-                "You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq`"
+                "You need a version of auto_gptq >= 0.4.2 to use GPTQ: `pip install --upgrade auto-gptq` or use gptqmodel by `pip install gptqmodel`"
            )


Don't forget that the users need to use the latest version from optimum with gptqmodel.

I have limited the optimum and gptqmodel version. The version limitation can be changed after gptqmodel and optimum released.

Qubitium · 2024-12-02T16:05:51Z

@SunMarc Current PR in the current state is not passing our internal tests. @jiqing-feng Will merge some of our changes in that will pass both inference/quant tests. Please delay your review until then since there are substantial changes, relative to the code/PR currently.

* gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

* revert quantizer_gptq.py change * pass **kwargs

jiqing-feng · 2024-12-04T07:04:27Z

Testing changes contain:

Refactor: CPU tests do not need @require_torch_gpu
GPTQ lib: @require_gptq means we can run these tests with gptqmodel or auto-gptq
Default model: we default run llama in tests instead of bloom because it's more common.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

Qubitium · 2024-12-05T13:37:45Z

@SunMarc Review can start at optimum first. I will write up a detailed explainer at the optimum pr on some obvious small/large change we pushed.

There may be some testing code tweaks but I do not foresee any major changes from this point foreword other than passing flaky tests and/or some testing bugs. Due to optimum having the largest diffs and where most of the gptq quant logic is, we fill first concentrate on making sure optimum is review-cleared, then pef/transformers in that order.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

jiqing-feng · 2024-12-24T01:24:01Z

I think there is no need to test cpu when only autogptq available because the model will be forced move to cuda. I'd like to remove the CPU tests only when autogptq available if you agree.

I prefer to keep the cpu test for autogptq. Otherwise, there is no way to check that the per layer quantization works with auto-gptq.

Yes, I have updated the test file to make sure all previous tests will be contained. But we found an error when testing device map, it should be fixed by the tiny change of optimum.

jiqing-feng · 2024-12-24T05:03:29Z

Please run pip install intel_extension_for_pytorch and make sure your torch version is 2.5.0 or 2.5.1
Or run pip install -v --no-build-isolation gptqmodel[ipex].

We shouldn't have to install intel_extension_for_pytorch to make the tests pass no ?

Right. Fixed in my new changes.

Qubitium · 2024-12-24T07:13:29Z

Hey @jiqing-feng thanks for the fix ! I am having some small fails related to AssertionError: 'Hello my name is Katie. I am a 22 year' not found in {'Hello my name is Katie, I am a 22 year', 'Hello my name is Katie. I am a 20, but these should be easy to fix maybe even after.

Based on our own testing, non-deterministic variability of the output is amplified in cpu testing. Looks like cpu implementation and execution of fp16 varies much more than gpu. Future testing code on cpu needs to take this into consideration.

MekkCyber · 2024-12-24T09:49:06Z

Thanks for the fix @jiqing-feng, @Qubitium, all tests are passing with both packages now

SunMarc

LGTM ! Thanks for iterating ! Just a few nits

SunMarc · 2024-12-24T10:58:05Z

docs/source/en/quantization/gptq.md

@@ -92,9 +118,14 @@ from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
 ```

+## Marlin
+
+[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does support model quantization.


Also, can we add a snippet to show to the user who to use it ? Generally, it will help a lot to the user if we explain a bit how the backend attribute in GPTQConfig works.

Suggested change

[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does support model quantization.

[Marlin](https://github.com/IST-DASLab/marlin) is a CUDA gptq kernel, 4-bit only, that is highly optimized for the Nvidia A100 GPU (Ampere) architecture where the the loading, dequantization, and execution of post-dequantized weights are highly parallelized offering a substantial inference improvement versus the original CUDA gptq kernel. Marlin is only available for quantized inference and does not support model quantization.

@SunMarc Good idea. Example usage of selection of Marlin via backend added.

SunMarc · 2024-12-24T11:02:21Z

docs/source/en/quantization/overview.md

+<TIP>
+
+**<sup>6</sup>** [See GGUF section](../gguf.md)

-</Tip>
+</TIP>


For GGUF, can you revert the changes you did ?

@SunMarc Done. Please check But I also made following clean-ups to the table. The table is horizontally overflowing on my 14.5 inch 16x10 screen even with the edits. Something needs to be done and break the table into single-view digestible version.

Fixed using subscript (instead of superscript) notation for the footnotes.

Renamed some of the column headers for brevity so we can fit more columns without overflow. See following.

Just in time Quantization => Runtime Quantization

RoCm GPU (AMD) => ROCm GPU

Fine-tuning (via PEFT) => PEFT Fine Tuning

@SunMarc

Our next PR will also address the issue of GPTQConfig __init__ forcing pass bits. It is actually not required and hinders api brevity when launching inference only mode using backend="marlin". Add this here so I won't forget.

AutoModelForCausalLM.from_pretrained( "ModelCloud/Opt-125-GPTQ-4bit-10-25-2024", device_map="auto", quantization_config=GPTQConfig(bits=4, backend="marlin"))

Can be optimized to:

AutoModelForCausalLM.from_pretrained( "ModelCloud/Opt-125-GPTQ-4bit-10-25-2024", device_map="auto", quantization_config=GPTQConfig(backend="marlin"))

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* review: update docs * fix typo

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* update overview.md * cleanup * Update overview.md * Update overview.md * Update overview.md * update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md --------- Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

MekkCyber · 2024-12-24T10:28:38Z

docs/source/en/quantization/gptq.md

+* Model support: GPTQModel continues to support all of the latest released LLM models.
+* Multi-Modal support: GPTQModel supports accurate quantization of Qwen 2-VL and Ovis 1.6-VL image-to-text models. 
+* Platform support: Validated MacOS Apple Silicone and Windows 11 support.
+* Hardware support: Apple silicone M1+, Intel/AMD CPU, and Intel Datacetner Max + Arc GPUs.
+* IPEX kernel for Intel/AMD accelerated CPU and Intel GPU (Datacenter Max + ARc) support.
+* Updated Marlin kernel from Neural Magic that is higly optimized for A100
+* Updated Kernels with auto-padding for legacy model support and models with non-uniform in/out-features. 
+* Faster quantization, lower memory usage, and more accurate default quantization via GPTQModel quantization apis.
+* User and developer friendly apis. 


Thanks for adding 😊

MekkCyber · 2024-12-24T10:35:33Z

src/transformers/utils/quantization_config.py

+        checkpoint_format (`str`, *optional*, defaults to `"gptq"`):
+            GPTQ weight format. `gptq`(v1) is supported by both gptqmodel and auto-gptq. `gptq_v2` is gptqmodel only.


I was wondering what is the difference between the two formats, can you add a link to a doc 😊 ?

@MekkCyber Good question. I pushed a doc change relating to this: 0aef2df

Asymmetric support: Asymmetric quantization can potentially introduce lower quantization errors compared to symmetric quantization. However, it is not backward compatible with AutoGPTQ, and not all kernels, such as Marlin, support asymmetric quantization.

GPTQModel operates internally in gptq_v2 format. Autogptq is gptq_v1 but only sym=True. Autogptq has broken asymmetric sym=False support. GPTQModel gptq_v1 is actually different slightly (zero point offset) than autogptq but backward compatible with autogptq and historic gptq kernels with sym=True.

Flow graph for sym and gptq_v1 and gptq_v2 support.

If AutoGPTQ:

sym=True: gptq_v1

sym=False: quantization works but generation is broken. ppl > 2K for example. GPTQ is capable of sym=False but AutoGPTQ never merged the bug fix (that I was later involved with).

If GPTQModel:

sym=True: gptq_v1 + gptq_v2 format. gptq_v1 is backward compatible with all autogptq + Marlin (vllm/sglang)

sym=False: gptq_v1 + gptq_v2 format. here gptq_v1 is incompatible with autogptq. Marlin never supported sym=False to begin with. So sym=False is only compatible with GPTQModel (some) kernels.

In summary:

gptq_v2 is gptqmodel only format recognized by gptqmodel loading code.

gptq_v1 is backward compatible with autogptq/vllm kernels (Marlin) only if sym=True. But for gptqmodel, it can run sym=True/False for both formats.

During runtime, GPTQModel will upconvert during load all gptq_v1 to internal gptq_v2 (very fast). GPTQModel will reject any loading of sym=False + gptq_v1 combo based on meta.quantizer info: who quantized this gptq_v1 and what version since only GPTQModel quantized sym=False + gptq_v1 is valid. This is one big reason we added meta.quantizer meta info to quants quantized by gptqmodel, and now optimum code.

Hopefully I made the above less confusing. We have not written correct doc/notes about this difference on GPTQModel to be frank but added to our todo list.

Thanks a lot @Qubitium for the clarifications

jiqing-feng mentioned this pull request Nov 29, 2024

transformers ModelCloud/GPTQModel#713

Closed

MekkCyber reviewed Nov 29, 2024

View reviewed changes

src/transformers/quantizers/quantizer_gptq.py Outdated Show resolved Hide resolved

MekkCyber requested a review from SunMarc November 29, 2024 14:48

jiqing-feng added 2 commits November 29, 2024 15:22

gptqmodel

4c567b3

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

1d8f83e

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng marked this pull request as ready for review December 2, 2024 05:14

jiqing-feng changed the title ~~gptqmodel~~ Enable gptqmodel Dec 2, 2024

jiqing-feng marked this pull request as draft December 2, 2024 09:11

jiqing-feng added 2 commits December 2, 2024 13:05

update readme

9f44604

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into gptq

62cd0dd

SunMarc reviewed Dec 2, 2024

View reviewed changes

jiqing-feng mentioned this pull request Dec 3, 2024

[INTEGRATION] Add GPTQModel support into transformers + optimum + peft ModelCloud/GPTQModel#729

Open

6 tasks

BenjaminBossan mentioned this pull request Dec 3, 2024

add gptqmodel support huggingface/peft#2247

Open

LRL-ModelCloud and others added 2 commits December 4, 2024 08:55

Revert quantizer_gptq.py (#2)

ef0fb56

* revert quantizer_gptq.py change * pass **kwargs

Merge branch 'main' into gptq

0191322

jiqing-feng and others added 8 commits December 4, 2024 10:32

limit gptqmodel and optimum version

0655960

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

be914ea

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix warning

aa9a5c6

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix version check

a4bc251

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

revert unrelated changes

9ae979b

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

enable gptqmodel tests

a73a8c2

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix requires gptq

c18a5f1

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng and others added 6 commits December 23, 2024 17:50

fix result check

c996a41

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into gptq

84e972c

Update src/transformers/quantizers/quantizer_gptq.py

dbf68e8

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update src/transformers/quantizers/quantizer_gptq.py

f4c2ad3

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update src/transformers/quantizers/quantizer_gptq.py

9185f8b

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Merge branch 'main' into gptq

8d69ba4

jiqing-feng mentioned this pull request Dec 24, 2024

fix device check huggingface/optimum#2136

Open

Merge branch 'main' into gptq

226953a

SunMarc approved these changes Dec 24, 2024

View reviewed changes

SunMarc requested a review from ArthurZucker December 24, 2024 11:07

jiqing-feng and others added 8 commits December 24, 2024 06:53

update tests

65ee44b

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

review: update docs (#10)

34d0ec0

Merge branch 'main' into gptq

9d71301

review: update docs (#12)

153121a

* review: update docs * fix typo

update tests for gptqmodel

b270b2d

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into gptq

a7fcfd7

typo

8e36a0e

Qubitium approved these changes Dec 24, 2024

View reviewed changes

MekkCyber approved these changes Dec 24, 2024

View reviewed changes

Qubitium added 2 commits December 24, 2024 23:35

doc note for asymmetric quant

0aef2df

typo with apple silicon(e)

31a6baa

Qubitium approved these changes Dec 24, 2024

View reviewed changes

typo for marlin

d7c8890

Qubitium approved these changes Dec 24, 2024

View reviewed changes

Merge branch 'main' into gptq

db33fd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable gptqmodel #35012

Enable gptqmodel #35012

jiqing-feng commented Nov 29, 2024 •

edited

Loading

Rocketknight1 commented Nov 29, 2024

Qubitium commented Nov 29, 2024

MekkCyber commented Nov 29, 2024

SunMarc left a comment •

edited

Loading

SunMarc Dec 2, 2024

jiqing-feng Dec 4, 2024

SunMarc Dec 2, 2024

jiqing-feng Dec 4, 2024

Qubitium commented Dec 2, 2024

jiqing-feng commented Dec 4, 2024

Qubitium commented Dec 5, 2024

jiqing-feng commented Dec 24, 2024 •

edited

Loading

jiqing-feng commented Dec 24, 2024

Qubitium commented Dec 24, 2024

MekkCyber commented Dec 24, 2024 •

edited

Loading

SunMarc left a comment

SunMarc Dec 24, 2024

Qubitium Dec 24, 2024

SunMarc Dec 24, 2024

Qubitium Dec 24, 2024

Qubitium Dec 24, 2024

MekkCyber Dec 24, 2024

MekkCyber Dec 24, 2024

Qubitium Dec 24, 2024 •

edited

Loading

MekkCyber Dec 24, 2024

		checkpoint_format (`str`, optional, defaults to `"gptq"`):
		GPTQ weight format. `gptq`(v1) is supported by both gptqmodel and auto-gptq. `gptq_v2` is gptqmodel only.

Enable gptqmodel #35012

Are you sure you want to change the base?

Enable gptqmodel #35012

Conversation

jiqing-feng commented Nov 29, 2024 • edited Loading

Rocketknight1 commented Nov 29, 2024

Qubitium commented Nov 29, 2024

MekkCyber commented Nov 29, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium commented Dec 2, 2024

jiqing-feng commented Dec 4, 2024

Qubitium commented Dec 5, 2024

jiqing-feng commented Dec 24, 2024 • edited Loading

jiqing-feng commented Dec 24, 2024

Qubitium commented Dec 24, 2024

MekkCyber commented Dec 24, 2024 • edited Loading

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng commented Nov 29, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading

jiqing-feng commented Dec 24, 2024 •

edited

Loading

MekkCyber commented Dec 24, 2024 •

edited

Loading

Qubitium Dec 24, 2024 •

edited

Loading