tmLQCD on Xeon Phi

I was recently (26.06.2013) granted access to a Xeon Phi machine at CSC in Finland. This page documents my efforts to get the code running in the various modes and some benchmarks.

General Remarks

Compilation

One first needs to adjust configure.in to allow the definition 64 byte alignment. Also, I was unable to make the zheev test work so I just disabled it.

diff --git a/configure.in b/configure.in                                                                                                                     
index 3129fdb..0cd2a19 100644
--- a/configure.in
+++ b/configure.in
@@ -318,11 +318,11 @@ fi

 dnl Checks for lapack and defines proper name mangling scheme for
 dnl linking with f77 code
-AC_F77_FUNC(zheev)
-if test "$zheev" = "zheev"; then
-  AC_DEFINE(NOF77_,1,Fortran has no extra _)
-fi
-AC_SEARCH_LIBS([$zheev],[lapack], [], [AC_MSG_ERROR([Cannot find lapack])])
+#AC_F77_FUNC(zheev)
+#if test "$zheev" = "zheev"; then
+#  AC_DEFINE(NOF77_,1,Fortran has no extra _)
+#fi
+#AC_SEARCH_LIBS([$zheev],[lapack], [], [AC_MSG_ERROR([Cannot find lapack])])

 dnl Checks for header files.
 AC_HEADER_STDC
@@ -379,6 +379,10 @@ elif test $withalign = 32; then
   AC_MSG_RESULT(32 bytes)
   AC_DEFINE(ALIGN_BASE, 0x1F, [Align base])
   AC_DEFINE(ALIGN, [__attribute__ ((aligned (32)))])
+elif test $withalign = 64; then
+  AC_MSG_RESULT(64 bytes)
+  AC_DEFINE(ALIGN_BASE, 0x3F, [Align base])
+  AC_DEFINE(ALIGN, [__attribute__ ((aligned (64)))])
 elif test $withalign = auto; then
   withautoalign=1
   AC_MSG_RESULT(auto)
@@ -386,7 +390,7 @@ elif test $withalign = auto; then
   AC_DEFINE(ALIGN, [], [])
 else
   AC_MSG_RESULT(Unusable value for array alignment)
-  AC_MSG_ERROR([Allowed values are: auto, none, 16, 32])
+  AC_MSG_ERROR([Allowed values are: auto, none, 16, 32, 64])
 fi

 dnl in the following we check for extra options

Random Number Generator

The fact that RANLUX is a serial code can really be felt on Xeon Phi. This will only worsen with increasing thread counts and wee need to implement (use) and test a thread-safe RNG at some point.

Pure OpenMP native mode

Compilation

In Finland, this seems to build a native MIC application with OpenMP:

../configure --enable-omp --disable-mpi --enable-gaugecopy --disable-halfspinor --enable-alignment=64
 --with-limedir=/home/users/bartek/local_mic_native --with-lapack="-L/share/apps/intel/mkl/lib/mic 
-lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread 
-lifcore" F77=ifort LIBS="-L/share/apps/intel/mkl/lib/mic -lifcore"
 CC=/share/apps/intel/composer_xe_2013.3.163/bin/intel64_mic/icc CFLAGS=-mmic LDFLAGS="-mmic"

note that the build system adds some CFLAGS for openmp and so on.

Running

In interactive mode, this seems to run a job:

PHI_KMP_AFFINITY="verbose,compact" PHI_KMP_PLACE_THREADS=60c,3t PHI_OMP_NUM_THREADS=180  MIC_PPN=1 MIC_OMP_NUM_THREADS=180 srun -v -N 1 /home/users/bartek/code/tmLQCD/build_mic_native_openmp/benchmark

The maximum number of threads is 240 and for some applications fewer threads (such as 180) is recommended. In our case the maximum number of threads provides the highest performance, at least in the hopping matrix.

Performance

Performance is absolutely abysmal:

# Instructing OpenMP to use 240 threads.
# The code was compiled with -D_GAUGE_COPY

# The number of processes is 1 
# The lattice size is 48 x 24 x 24 x 24
# The local lattice size is 48 x 24 x 24 x 24
# benchmarking the even/odd preconditioned Dirac operator
# The lattice is correctly mapped by the index arrays

# The following result is just to make sure that the calculation is not optimized away: 8.533125e+02
# Total compute time 5.241630e-02 sec, variance of the time -nan sec. (2048 iterations).
# Communication switched on:
# (30677 Mflops [64 bit arithmetic])
# Mflops per OpenMP thread ~ 127

but this is probably due to a lack of autovectorization...

A kernel with AVX intrinsics would probably make this substantially faster (note, multiply-add is available!). One also has to keep in mind that the SIMD unit requires 64 byte alignment! Decreasing the number of threads decreases total performance so running at the maximum 240 seems to be optimal, at least for this memory load. The theoretical maximum when not in L2 cache (which is essentially impossible in pure OpenMP mode because it is only 30 MB (512KB per thread!) ) should be less than 200 GFlop/s.

When I reduce the lattice size to 12^4 so that the working set fits into L2, performance does not go up, indicating that we either have a massive overhead somewhere, or that we're simply not vectorizing and this is the performance without the possible x8 speedup from SIMD.

Hybrid OpenMP+MPI

Compilation

$ module load intel mic-impi

../configure --enable-omp --enable-mpi --with-mpidimension=1 --enable-gaugecopy --enable-halfspinor
--enable-alignment=64 --with-lemondir=/home/users/bartek/local_mic_native 
--with-limedir=/home/users/bartek/local_mic_native --with-lapack="-L/share/apps/intel/mkl/lib/mic 
-lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread 
-lifcore" F77=ifort LIBS="-L/share/apps/intel/mkl/lib/mic -lifcore" CC="mpiicc" CFLAGS="-mmic -openmp" 
LDFLAGS="-mmic"

Running

export PHI_KMP_AFFINITY=compact
export PHI_KMP_PLACE_THREADS=60c,4t
export PHI_OMP_NUM_THREADS=240
export MIC_PPN=1 
export MIC_OMP_NUM_THREADS=${PHI_OMP_NUM_THREADS}

After defining the various environment variables, the following is used to run a job with two processes on two mic-equipped nodes.

$ module load intel mic-impi
$ srun -N 2 mpirun-mic -m ./benchmark

Performance

This is from the git://github.com/kostrzewa/tmLQCD/MPI_thread_overlap_ITC branch which has threaded overlapping of communication and computation. With a large enough body (and the slowness of the unvectorized code), communication is very effectively masked.

[bartek@master:~/code/tmLQCD/build_mic_native_hybrid_hs ] srun -N 2 mpirun-mic -m ./benchmark
# Instructing OpenMP to use 240 threads.
# Creating the following cartesian grid for a 1 dimensional parallelisation:
# 2 x 1 x 1 x 1
# The code was compiled with -D_GAUGE_COPY
# The code was compiled with -D_USE_HALFSPINOR
# The code was compiled for non-blocking MPI calls (spinor and gauge)

# The number of processes is 2 
# The lattice size is 48 x 24 x 24 x 24
# The local lattice size is 24 x 24 x 24 x 24
# benchmarking the even/odd preconditioned Dirac operator
# Initialising rectangular gauge action stuff
# The lattice is correctly mapped by the index arrays

# The following result is just to make sure that the calculation is not optimized away: 8.727883e+02
# Total compute time 5.908793e-02 sec, variance of the time 4.525540e-07 sec. (4096 iterations).
# Communication switched on:
# (27213 Mflops [64 bit arithmetic])
# Mflops per OpenMP thread ~ 113

# The following result is printed just to make sure that the calculation is not optimized away: 8.084319e+02
# Communication switched off: 
# (27932 Mflops [64 bit arithmetic])
# Mflops per OpenMP thread ~ 116

# The size of the package is 1327104 bytes.
# The bandwidth is 5015.29 + 5015.29 MB/sec
# Performing parallel IO test ...
# Constructing LEMON writer for file conf.test for append = 0
# Time spent writing 382 Mb was 15.3 s.
# Writing speed: 25.0 Mb/s (12.5 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.test:
#   Calculated            : A = 0x95f6fdad B = 0x9b58ac11.
# done ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tmLQCD on Xeon Phi

General Remarks

Compilation

Random Number Generator

Pure OpenMP native mode

Compilation

Running

Performance

Hybrid OpenMP+MPI

Compilation

Running

Performance

Clone this wiki locally