Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX IQ Quants #7845

Merged
merged 16 commits into from
Jun 21, 2024
Merged

AVX IQ Quants #7845

merged 16 commits into from
Jun 21, 2024

Conversation

netrunnereve
Copy link
Collaborator

@netrunnereve netrunnereve commented Jun 10, 2024

I finally had the time to work on original AVX versions of the IQ quants ggml_vec_dot for Sandy Bridge and Ivy Bridge users.

Master:

iq3_xxs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :    121.59
      avg cycles/32 vals   :    122.17
      float32 throughput   :      2.68 GB/s
      quantized throughput :      0.26 GB/s

iq4_nl
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     67.44
      avg cycles/32 vals   :     67.99
      float32 throughput   :      4.92 GB/s
      quantized throughput :      0.69 GB/s

iq3_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :    111.94
      avg cycles/32 vals   :    112.66
      float32 throughput   :      2.93 GB/s
      quantized throughput :      0.32 GB/s

iq2_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     89.03
      avg cycles/32 vals   :     89.54
      float32 throughput   :      3.72 GB/s
      quantized throughput :      0.30 GB/s

iq4_xs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     87.30
      avg cycles/32 vals   :     88.05
      float32 throughput   :      3.72 GB/s
      quantized throughput :      0.49 GB/s

PR:

iq3_xxs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     24.12
      avg cycles/32 vals   :     24.41
      float32 throughput   :     13.87 GB/s
      quantized throughput :      1.33 GB/s

iq4_nl
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     10.12
      avg cycles/32 vals   :     10.33
      float32 throughput   :     38.15 GB/s
      quantized throughput :      5.36 GB/s

iq3_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     25.13
      avg cycles/32 vals   :     25.22
      float32 throughput   :     12.72 GB/s
      quantized throughput :      1.37 GB/s

iq2_s
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     14.95
      avg cycles/32 vals   :     15.16
      float32 throughput   :     21.80 GB/s
      quantized throughput :      1.75 GB/s

iq4_xs
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :     13.88
      avg cycles/32 vals   :     13.94
      float32 throughput   :     25.43 GB/s
      quantized throughput :      3.38 GB/s

Some example benchmarks:

model size params backend threads test t/s
llama 8B IQ4_XS - 4.25 bpw (Master) 4.13 GiB 8.03 B CPU 8 pp512 1.78 ± 0.08
llama 8B IQ4_XS - 4.25 bpw (Master) 4.13 GiB 8.03 B CPU 8 tg128 1.60 ± 0.05
llama 8B IQ4_XS - 4.25 bpw (PR) 4.13 GiB 8.03 B CPU 8 pp512 10.95 ± 0.04
llama 8B IQ4_XS - 4.25 bpw (PR) 4.13 GiB 8.03 B CPU 8 tg128 7.72 ± 0.01
llama 8B IQ2_XS - 2.3125 bpw (Master) 2.42 GiB 8.03 B CPU 8 pp512 1.09 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (Master) 2.42 GiB 8.03 B CPU 8 tg128 1.02 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (PR) 2.42 GiB 8.03 B CPU 8 pp512 6.65 ± 0.00
llama 8B IQ2_XS - 2.3125 bpw (PR) 2.42 GiB 8.03 B CPU 8 tg128 5.47 ± 0.14

The scalar IQ code is really slow on my computer, even with a 8B model. Pretty much any K quant of equivalent size can beat it with a 30B model! I mostly followed the original AVX2 implementation and converted the new 256-bit instructions into two 128-bit ones when required.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 10, 2024
@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jun 12, 2024
@netrunnereve netrunnereve marked this pull request as ready for review June 16, 2024 03:01
ggml-quants.c Outdated Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the test-backend-ops passes

@netrunnereve
Copy link
Collaborator Author

Make sure the test-backend-ops passes

Yeah it runs and passes with -b CPU; interestingly enough the test is skipped by default if no backend is set. I actually used test-quantize-fns when writing this PR.

./tests/test-backend-ops -b CPU
Testing 1 backends

Backend 1/1 (CPU)
  Backend name: CPU
  ABS(type=f32,ne_a=[128,10,10,10],v=0): OK
...
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8,mask=0,max_bias=0.000000,type_KV=q4_0): OK
  1270/1270 tests passed
  Backend CPU: OK

1/1 backends passed
OK

Considering how it only takes a minute to run I think it's worth adding the CPU version of test-backend-ops to the CI.

@slaren
Copy link
Collaborator

slaren commented Jun 16, 2024

The goal of test-backend-ops is to compare GPU backends with the CPU backend. When used with the CPU backend, it only compares the CPU backend with itself. It can still detect some issues such as nan and inf values, but generally I don't think it adds much value.

@ggerganov
Copy link
Owner

I compared the AVX CPU vs the GPU results on my linux box and tests are passing. Should be good to merge

@netrunnereve netrunnereve mentioned this pull request Jun 21, 2024
4 tasks
@ggerganov ggerganov merged commit 7d5e877 into ggerganov:master Jun 21, 2024
65 checks passed
@netrunnereve netrunnereve deleted the avx_iq branch June 21, 2024 15:26
HoiV added a commit to HoiV/llama.cpp that referenced this pull request Jun 24, 2024
Update hv/matmul up to:
commit 557b653 (HEAD -> master, origin/master, origin/HEAD)
Author: k.h.lai <adrian.k.h.lai@outlook.com>
Date:   Fri Jun 21 16:28:20 2024 +0800

    vulkan: detect multiple devices by deviceUUID instead of deviceID (ggerganov#8022)

commit 7d5e877
Author: Eve <139727413+netrunnereve@users.noreply.github.com>
Date:   Fri Jun 21 05:57:36 2024 +0000

    ggml : AVX IQ quants (ggerganov#7845)

...
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
* initial iq4_xs

* fix ci

* iq4_nl

* iq1_m

* iq1_s

* iq2_xxs

* iq3_xxs

* iq2_s

* iq2_xs

* iq3_s before sllv

* iq3_s

* iq3_s small fix

* iq3_s sllv can be safely replaced with sse multiply
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants