partr thread support for openblas #43984
Replies: 19 comments 2 replies
-
We now have algorithms in DifferentialEquations.jl which utilize simultaneous implicit methods to enhance the parallelizability of small stiff ODEs and DAEs (i.e. <= 20 ODEs). Right now we'll just document that the user should probably set the BLAS threads to 1, but once this PR is in this algorithm can serve as a very good test case / showcase of why PARTR mixed into BLAS is useful. |
Beta Was this translation helpful? Give feedback.
-
This is a fairly straightforward project for someone who doesn't mind diving in and seeing how it was done in FFTW. I will certainly try it out if nobody gives it a shot in a few weeks. |
Beta Was this translation helpful? Give feedback.
-
In the long run, it would be good if partr had a documented C API for spawn/wait, which would give us a lot more flexibility in integrating it with external libraries like this. |
Beta Was this translation helpful? Give feedback.
-
Do you think this something that will require changes to OpenBLAS upstream and/or compiling OpenBLAS with specific options? Just checking from a packager perspective. |
Beta Was this translation helpful? Give feedback.
-
Yes, we probably have to work with OpenBLAS upstream |
Beta Was this translation helpful? Give feedback.
-
I'm also implementing the FFTW strategy of a pluggable threading backend for Blosc (Blosc/c-blosc2#81). I think we can make a strong argument to upstream developers that their libraries should use this kind of strategy where possible, because it allows easy composability not only with Julia's partr, but also with Intel's TBB and other threading schedulers. It also seems possible to do this with minimal patches in cases where they have already implemented their own threading. |
Beta Was this translation helpful? Give feedback.
-
I think it's attractive to implement this as a runtime option, in addition to existing threading options rather than instead of them, as I did for FFTW and Blosc. That is, we add a single exec_blas(num, queue) {
if (threads_callback) {
// pass work to the callback function
return;
}
// parallelize normally
} This has three advantages:
|
Beta Was this translation helpful? Give feedback.
-
Regarding the
I'm not sure why the "other work" can't simply be added to the queue of parallel tasks, and let the runtime worry about load-balancing. |
Beta Was this translation helpful? Give feedback.
-
I posted a very early draft of the requisite changes at OpenMathLib/OpenBLAS#2255 |
Beta Was this translation helpful? Give feedback.
-
Actually, I thought of an even easier way to implement |
Beta Was this translation helpful? Give feedback.
-
Removing milestone since this certainly wasn't release blocking for 1.3 and neither will be for 1.4 or 1.x. |
Beta Was this translation helpful? Give feedback.
-
I'm confused. I thought that now that we switched to a time-based release schedule with 1.x releases, nothing is release-blocking, so should then all the remaining issues be removed from 1.4 milestone as well? |
Beta Was this translation helpful? Give feedback.
-
friendly bump on this one. new AMD processors have a ton of threads but I can't take much advantage of PARTR until it works nicely with OpenBLAS since my loops all have various LAPACK calls in them (and I also have standalone LAPACK calls outside of loops that ought to still use all threads) |
Beta Was this translation helpful? Give feedback.
-
Increasingly, a lot of libraries in Yggdrasil BB are using openmp, and many of them call BLAS. I suspect that we are increasingly going to see multi-threading clashes between julia threads, pthreaded libraries (openblas), and openmp. The fewer of these we can use the better! I also learnt that if MKL enters the picture, it is yet another library - tbb. |
Beta Was this translation helpful? Give feedback.
-
cc @kpamnany |
Beta Was this translation helpful? Give feedback.
-
It was described to me that this thread pool is actually only relevant for a small number of LAPACK functions, so we could probably reimplement them in a julia better and faster than trying to integrate with the existing system in BLAS. @Keno is that accurate? |
Beta Was this translation helpful? Give feedback.
-
I have a multi-threaded LU factorization and linear solve here: https://github.com/ViralBShah/HPL.jl/blob/master/src/hpl_shared.jl It is reasonable on performance, and may be the better way to do multi-threading. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/YingboMa/RecursiveFactorization.jl is multi-threaded and already outperforms BLAS, both OpenBLAS and MKL. SciML has been defaulting to this for over a year with great success. However, achieving that performance relied on using Polyester.jl and thus opting out of the composable multithreading. Adding the non-polyester threads option to that and calling it a day would be a fitting end to the story at least for the LU case. |
Beta Was this translation helpful? Give feedback.
-
OpenMathLib/OpenBLAS#4577 is a new PR to allow pluggable thread backends into OpenBLAS (currently they tried TBB), which hopefully will make it easy to add partr support. Would be good for some Julia folks to take a look. |
Beta Was this translation helpful? Give feedback.
-
Here are some notes from digging into the openblas codebase (with @stevengj) to enable partr threading support.
exec_blas
is called by all the routines. The code pattern followed is setting up the work queue and callingexec_blas
to do all the work through an openmp pragma.exec_blas_async
functions.The easiest way may be to modify the openmp threading backend, which seems amenable to something like the fftw partr backend. To start with, we should ignore lapack threading. We could probably just implement an
exec_blas_async
fallback that callsexec_blas
(and makeexec_blas_async_wait
a no-op).All of this should work on windows too, although the going through the openmp build route may need some work on the makefiles.
The patch to FFTW should be indicative of something similar to be done for the openblas build.
Beta Was this translation helpful? Give feedback.
All reactions