Skip to content

Releases: ngxson/llama.cpp

b4122

18 Nov 11:38
9b75f03
Compare
Choose a tag to compare
Vulkan: Fix device info output format specifiers (#10366)

* Vulkan: Fix device info output format specifiers

* Vulkan: Use zu printf specifier for size_t instead of ld

b4120

17 Nov 23:32
76e9e58
Compare
Choose a tag to compare
CUDA: fix MMV kernel being used for FP16 src1 (#10357)

b4118

17 Nov 12:36
be5cacc
Compare
Choose a tag to compare
llama : only use default buffer types for the KV cache (#10358)

b4114

17 Nov 09:35
c3ea58a
Compare
Choose a tag to compare
CUDA: remove DMMV, consolidate F16 mult mat vec (#10318)

b4112

17 Nov 07:33
eda7e1d
Compare
Choose a tag to compare
ggml : fix possible buffer use after free in sched reserve (#9930)

b4103

16 Nov 23:44
4e54be0
Compare
Choose a tag to compare
llama/ex: remove --logdir argument (#10339)

b4102

16 Nov 20:39
Compare
Choose a tag to compare
llamafile : fix include path (#0)

ggml-ci

b4100

16 Nov 14:30
bcdb7a2
Compare
Choose a tag to compare
server: (web UI) Add samplers sequence customization (#10255)

* Samplers sequence: simplified and input field.

* Removed unused function

* Modify and use `settings-modal-short-input`

* rename "name" --> "label"

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

b4098

16 Nov 07:50
772703c
Compare
Choose a tag to compare
vulkan: Optimize some mat-vec mul quant shaders (#10296)

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses
the B loads across the rows and also reuses some addressing calculations.
This required manually partially unrolling the loop, since the compiler
is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at
least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of
bit twiddling instructions.

b4096

16 Nov 02:45
1e58ee1
Compare
Choose a tag to compare
ggml : optimize Q4_0 into Q4_0_X_Y repack (#10324)