Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on linux, link against Openblas Parallel (e.g. for fedora 40) #1077

Open
wants to merge 2 commits into
base: concedo
Choose a base branch
from

Conversation

AndLLA
Copy link

@AndLLA AndLLA commented Aug 19, 2024

on recent linux distros (e.g. fedora 40),
the paralell version of openblas has a "p" suffix "-lopenblasp",
therefore linking against "-lopenblas" always uses the serial version.

in addition,
we print out at runtime the exact flavour of openblas used:

ggml_backend_blas_init: openblas_get_parallel 1 
ggml_backend_blas_init: openblas_get_config OpenBLAS 0.3.26 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=128 
ggml_backend_blas_init: GGML_USE_OPENMP n/a 

on recent linux distros (e.g. fedora 40) the multi threaded openblas library has a "p" suffix
@AndLLA AndLLA changed the title on linux, ink against Openblas Parallel (e.g. for fedora 40) on linux, link against Openblas Parallel (e.g. for fedora 40) Aug 19, 2024
@LostRuins
Copy link
Owner

Is it actually faster?

@AndLLA
Copy link
Author

AndLLA commented Aug 19, 2024

I did several tests and the BLAS stage seems to scale with NumCores/2,
for example if a BLAS stage with "openblas+serial" takes 3 mins,
the "openblas+parallel" takes 1 min,
using 6 cores/threads and a batch size of 512,
running everything on the cpu.

On the other side,
the speed increase for the inference stage is less noticeable (about 5%-10% faster).

Compared to "koboldcpp_default",
the BLAS stage using "openblas+parallel" is 10%-20% faster.

p.s.
the openblas_set_num_threads is completely ignored in the "serial" openblas,
it always uses one thread.

@henk717
Copy link

henk717 commented Sep 10, 2024

Looked at conda since we'd be implementing it in the CI based on conda. Libopenblas's latest version ships openblasp and apparently openblas regular .so is symlinked to this. Are you sure Fedora isn't doing the same thing? Because our prebuilt binaries probably already use the parralel version.

Does mean we can probably drop-in replace the windows .dll.

@AndLLA
Copy link
Author

AndLLA commented Sep 10, 2024

Hallo,
part of the patch introduces a log line
which reports the flavour of openblas.

for example if the runtime uses the parallel flavour,
the output will be something like this:

ggml_backend_blas_init: openblas_get_parallel 1
ggml_backend_blas_init: openblas_get_config OpenBLAS 0.3.26 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=128

instead, if the runtime uses the non-parallel flavour,
the output will be something like this:

ggml_backend_blas_init: openblas_get_parallel 0
ggml_backend_blas_init: openblas_get_config OpenBLAS 0.3.26 DYNAMIC_ARCH NO_AFFINITY Haswell SINGLE_THREADED

On fc40 there isn't a symlink pointing to the parallel openblas by default.
here what I see on the file system (re-installed the latest rpm to make sure):

-rwxr-xr-x. 1 root root 40779408 Feb  9  2024 libopenblasp-r0.3.26.so
lrwxrwxrwx. 1 root root       23 Feb  9  2024 libopenblasp.so -> libopenblasp-r0.3.26.so
lrwxrwxrwx. 1 root root       23 Feb  9  2024 libopenblasp.so.0 -> libopenblasp-r0.3.26.so

-rwxr-xr-x. 1 root root 39286328 Feb  9  2024 libopenblas-r0.3.26.so
lrwxrwxrwx. 1 root root       22 Feb  9  2024 libopenblas.so -> libopenblas-r0.3.26.so
lrwxrwxrwx. 1 root root       22 Feb  9  2024 libopenblas.so.0 -> libopenblas-r0.3.26.so

I don't know on windows, but on linux they are "drop-in" replaceable :)

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants