Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openmp is not working #41

Open
wittscien opened this issue Nov 18, 2022 · 8 comments
Open

Openmp is not working #41

wittscien opened this issue Nov 18, 2022 · 8 comments

Comments

@wittscien
Copy link

I have posted a former version of this problem on the QUDA page, where I found plenty of time was wasted when calculating the propagators. And the time does not change essentially no matter how I change OMP_NUM_THREADS. This is because OpenMP is not working, at some stage. @SaltyChiang has pointed out on the QUDA page that CMakeLists.txt in the devel branch of qdpxx did not actually set QDP_USE_OMP_THREADS. I think this can be fixed in later versions.

I use the latest versions: QMP 2-5-4, QDP++ 1-46-0, QUDA 1.1.0, and Chroma 3-44-0, checked out to their development branch, and all build with CMake, with cc=mpicc and -fopenmp flag, and -DQDP_USE_OPENMP=ON, -DQUDA_OPENMP=ON, -DChroma_ENABLE_OPENMP=ON. The log of a typical propagator calculation shows a very low invertQuda / initQuda-endQuda ratio. If I use top to look at the process, I see clearly the Chroma program uses only one thread.

However, after I modified the CMakeLists.txt of qdpxx, the output of the Chroma program prints QDP use OpenMP threading. We have x threads as expected (it does not do so before the change), the program still uses only one thread. Are there any possible problems going on here? I have checked with a simple C++ program that OpenMP works on the cluster.

@fwinter
Copy link
Contributor

fwinter commented Nov 18, 2022

Hi Haobo. Sorry to hear you're experiencing performance problems. I can't comment much on threading in qdpxx but I know most users have switched to the more performant qdp++ implementation qdp-jit (https://github.com/JeffersonLab/qdp-jit, devel branch). Is there any reason you're using qdpxx instead of qdp-jit? I understand you're doing propagators using QUDA. This should work fine with qdp-jit. In case you're interested: I have a simple build package and there should be a version to build qdp-jit/chroma/quda.

https://github.com/fwinter/package

For a CUDA/QUDA build, for instance, you could do
./download_sources.sh
./download_sources_quda.sh
cd cuda-jit-quda/
./build_all.sh

EDIT: before building everything check for the correct 'sm_xx' version in build_quda.sh

@wittscien
Copy link
Author

Thank you Frank, I don't have a specific preference. Yes, I use QUDA to calculate propagators, and I also use Chroma to write some contractions. I will try building with qdp-jit tomorrow. Is this going to solve the OpenMP problem? Thank you for sharing the build script! And why set -DQUDA_OPENMP=OFF (and not set it to ON in Chroma)?

@fwinter
Copy link
Contributor

fwinter commented Nov 18, 2022

I believe so. qdp-jit doesn't use CPU multithreading. It uses the GPU for parallelization.
Good point. This switch could be turned on, I suppose.

@wittscien
Copy link
Author

Thank you! I'll let you know after I tried the jit version.

@wittscien
Copy link
Author

Sorry I took some time to build the required packages. I tried to build qdp-jit with llvm 13.0.0 and gcc 12.2.0 (using C++20 standard), and encountered an error (which repeat many times):
In file included from /.../gcc/12.2/include/c++/12.2.0/bits/unique_ptr.h:36, from /.../gcc/12.2/include/c++/12.2.0/memory:76, from /.../source/qdp-jit/lib/../include/qdp.h:77: /.../gcc/12.2/include/c++/12.2.0/tuple:1595:45: note:declaration of ‘struct std::array<QDP::PScalarJIT<QDP::RScalarJIT<QDP::WordJIT<float> > >, 1>’ 1595 | template<typename _Tp, size_t _Nm> struct array; | ^~~~~ /.../source/qdp-jit/lib/../include/qdp_basejit.h: In instantiation of ‘class QDP::BaseJIT<QDP::RScalarJIT<QDP::WordJIT<float> >, 1>’: /.../source/qdp-jit/lib/../include/qdp_primscalarjit.h:24:27: required from ‘class QDP::PScalarJIT<QDP::RScalarJIT<QDP::WordJIT<float> > >’ /.../source/qdp-jit/lib/../include/qdp_basejit.h:18:76: required from ‘class QDP::BaseJIT<QDP::PScalarJIT<QDP::RScalarJIT<QDP::WordJIT<float> > >, 1>’ /.../source/qdp-jit/lib/../include/qdp_primscalarjit.h:24:27: required from ‘class QDP::PScalarJIT<QDP::PScalarJIT<QDP::RScalarJIT<QDP::WordJIT<float> > > >’ /.../source/qdp-jit/lib/../include/qdp_viewleaf.h:35:28: required from here /.../source/qdp-jit/lib/../include/qdp_basejit.h:9:21: error:‘QDP::BaseJIT<T, N>::F’ has incomplete type 9 | std::array<T,N> F;
Where did I do wrong? Thanks in advance!

@fwinter
Copy link
Contributor

fwinter commented Nov 24, 2022

I could reproduce this using gcc12. Before had used gcc11. I'll fix this and let you know..
EDIT: On another note, I had found that CUDA is somewhat restrictive about the GCC versions that it supports. To build QUDA with gcc11 I had to install the very latest CUDA (v11.8). If you have gcc11 at hand it might be worth a shot. But I will fix the gcc12 issue for sure.

@wittscien
Copy link
Author

Thanks for the information. I have only CUDA 11.1 and gcc 12.2, I can ask the admin to install other versions later and try to build again. (The gcc I installed directly from the source in my own directory does not work.)

@fwinter
Copy link
Contributor

fwinter commented Nov 25, 2022

Committed changes to qdp-jit for gcc12. Your CUDA version 11.1 might cause you trouble when it comes to building QUDA. My guess is you need the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants