Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread callback for OpenMP backend #4770

Open
jeremiedbb opened this issue Jun 28, 2024 · 9 comments · Fixed by #4775
Open

Thread callback for OpenMP backend #4770

jeremiedbb opened this issue Jun 28, 2024 · 9 comments · Fixed by #4775
Milestone

Comments

@jeremiedbb
Copy link

Hi,

I'm trying to leverage #4577 in a project (scikit-learn) that has a mix of OpenMP and OpenBLAS built with the pthreads threading layer to make OpenBLAS use the OpenMP threadpool. Ideally we'd use an OpenBLAS built with the OpenMP threading layer but it's outside of our control because it comes from a dependency.

I naively tried the example callback presented in #4577 (comment), but I can't make it work. Here's a simple reproducer, just a gemm:

test.c

# include <stdio.h>
# include <stdlib.h>
# include <stddef.h>
# include <omp.h>
# include <cblas.h>


void omp_cb (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
    #pragma omp parallel for
    for(int i = 0; i < numjobs; i++)
    {
        printf("thread: %d, i: %d\n", omp_get_thread_num(), i);
        void *element_adrr = (void *) (((char *)jobdata) + ((unsigned) i)*jobdata_elsize);
        dojob(i, element_adrr, dojob_data);
    }
    return;
}


void test()
{
    int n = 100;

    double *A = (double *)malloc(n * n * sizeof(double));
    double *B = (double *)malloc(n * n * sizeof(double));
    double *C = (double *)malloc(n * n * sizeof(double));

    for(int i = 0; i < n * n; i++)
    {
        A[i] = 1.0;
        B[i] = 1.0;
        C[i] = 0.0;
    }

    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, A, n, B, n, 0.0, C, n);

    free(A);
    free(B);
    free(C);
}


int main()
{
    openblas_set_threads_callback_function(omp_cb);
    test();
}

Compile command:

gcc -o test test.c -fopenmp -I/home/jeremie/R/installs/OpenBLAS/include -Wl,-rpath,/home/jeremie/R/installs/OpenBLAS/lib -L/home/jeremie/R/installs/OpenBLAS/lib -lopenblas

It just results in a segfault at the first step of the loop. Note that it still segfaults if I remove the omp pragma and just use a sequential loop in the callback.

Any help would be greatly appreciated.

@martin-frbg
Copy link
Collaborator

This is hopefully fixed by the one-liner in #4775 (though I still need to track down the source of the spurious queue entry)

@martin-frbg martin-frbg added this to the 0.3.28 milestone Jun 30, 2024
@jeremiedbb
Copy link
Author

Thanks. I just tested the fix but now the program hangs at the first step of the callback loop.

Here's the backtrace:

#0  0x00007ffff739c996 in inner_thread () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
#1  0x00007ffff74ce11b in exec_threads () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
#2  0x000055555555523b in omp_cb ()
#3  0x00007ffff74ce87f in exec_blas () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
#4  0x00007ffff739cfc3 in gemm_driver.isra () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
#5  0x00007ffff739d11f in dgemm_thread_nn () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
#6  0x00007ffff72aaa97 in cblas_dgemm () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
#7  0x0000555555555363 in test ()
#8  0x00005555555553af in main ()

And some additional info

(gdb) info thread
  Id   Target Id                                  Frame 
* 1    Thread 0x7ffff709b740 (LWP 1296888) "test" 0x00007ffff739c996 in inner_thread () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
  2    Thread 0x7fff769ff640 (LWP 1296891) "test" 0x00007ffff74ce1ff in blas_thread_server () from /home/jeremie/R/installs/OpenBLAS/lib/libopenblas.so.0
  3    Thread 0x7fff761fe640 (LWP 1296892) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  4    Thread 0x7fff759fd640 (LWP 1296893) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  5    Thread 0x7fff751fc640 (LWP 1296894) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7fff749fb640 (LWP 1296895) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7fff741fa640 (LWP 1296896) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7fff739f9640 (LWP 1296897) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7fff731f8640 (LWP 1296898) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  10   Thread 0x7fff729f7640 (LWP 1296899) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  11   Thread 0x7fff721f6640 (LWP 1296900) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  12   Thread 0x7fff719f5640 (LWP 1296901) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  13   Thread 0x7fff711f4640 (LWP 1296902) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  14   Thread 0x7fff709f3640 (LWP 1296903) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  15   Thread 0x7fff701f2640 (LWP 1296904) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
  16   Thread 0x7fff6f9f1640 (LWP 1296905) "test" 0x00007ffff6e91117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

(Side question: Is it expected that even when setting a sequential callback, OpenBLAS still creates that many threads ?)

@martin-frbg
Copy link
Collaborator

Strange, I ran several tests with and without valgrind and did not encounter any hang yesterday. Is this still with the unmodified testcase that you posted above ? I guess there is a small chance that this may be a "generic" lockup from combining OpenMP and non-OpenMP threading which don't know of each other's existence. (And my understanding is that in this iteration of the user-defined threading callback, the OpenBLAS mechanisms for obtaining the initial thread count are unchanged. @shivammonaka can you comment on this issue ?

@jeremiedbb
Copy link
Author

Is this still with the unmodified testcase that you posted above ?

Yes with the exact same reproducer, with a sequential loop or with an OpenMP based parallel loop. In addition, it works if I set OPENBLAS_NUM_THREADS=1 but hangs for any value > 1.

@martin-frbg
Copy link
Collaborator

I cannot reproduce the problem (on x86_64 hardware of various sizes up to 64 cores and with multiple compiler versions) with my PR in place. Did you do a full rebuild (i.e. make clean before themake) after applying the patch ? (And are you sure your testcase is picking up the "fixed" version of the library ?)

@jeremiedbb
Copy link
Author

Sorry, there was a slight difference with the original reproducer. I explicitely set the number of openmp threads for the for loop:

void omp_cb (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
    #pragma omp parallel for num_threads(4)   // <-- here
    for(int i = 0; i < numjobs; i++)
    {
        printf("thread: %d, i: %d\n", omp_get_thread_num(), i);
        void *element_adrr = (void *) (((char *)jobdata) + ((unsigned) i)*jobdata_elsize);
        dojob(i, element_adrr, dojob_data);
    }
    return;
}

Then if I set OPENBLAS_NUM_THREADS to any value less or equal to the number of openmp threads it works. If I set it to any value larger, it hangs.

@shivammonaka
Copy link
Contributor

shivammonaka commented Jul 2, 2024

Hi @martin-frbg, @jeremiedbb , Sorry for the late reply. I was preoccupied with some other stuff. Let me explain my understanding of this error

Question : Why does this happens : -I set OPENBLAS_NUM_THREADS to any value less or equal to the number of openmp threads it works. If I set it to any value larger, it hangs.

Answer :

  • Currently its because of how OpenBLAS internally handles threading incase of any backend.
  • Every threading backend internally calls inner_threads(thread_id) function call with thread_id as a argument. So, The number of threads required by each BLAS call is pre-decided even before threading was initiated (Inside gemm_driver() in level3_thread.c)
  • Why we cannot use less threads than what OpenBLAS has decided? Its because the current algorithm heavily relies on the work of others sibling threads before moving to next part of the algorithm. So, if you use less number of threads then it gets stuck in deadlock.

I am currently working on fixing this thing only. I want to architecturally change the execution of the algorithm so that

  1. We can remove the dependency of NUM_PARALLEL and have a truly parallel and composable solution inside OpenBLAS that allows maximum BLAS calls to be executed parallelly.
  2. Allow maximum core utilization when using lower level matrix multiplication
  3. Allow execution of a BLAS with any number of threads that the user may want or core availability

I have a solution ready for OpenMP backend but currently its in testing phase. I will try to get it out ASAP.

@martin-frbg
Copy link
Collaborator

Seems to be related to requesting more OMP threads than there are iterations in the for loop, possibly the two independent threading models each having an idle thread and both waiting for each other to make the first move...

@shivammonaka
Copy link
Contributor

shivammonaka commented Jul 2, 2024

It actually requires Number of OMP Threads on caller side >= Number of iteration in the for loop. BLAS wants each iteration of the loop to be executed by a different thread otherwise it goes into infinite waiting. Sadly, That's how inner_thread() currently works. I am sorry but I am not sure what the second line meant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants