Skip to content
urbach edited this page Dec 10, 2012 · 32 revisions

System Configuration

As far as I understand, thin nodes are running the Sandy Bridge-EP Intel Xeon E5-2680 8C processor, which does support AVX, but not the fused multiply add operations (FMA). The fat nodes don't support AVX, but SSE4. The login nodes seem to be E5-2680, and indeed I get illegal instructions when using fma instructions.

Configuring the tmLQCD code

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl, for a mixed MPI + openMP executable. I obtain best results when using the intel mpi library

module unload mpi.ibm
module load mpi.intel/4.0

and then configure with

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2  --with-alignment=32  --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -xAVX -openmp" F77="ifort"

Sample Jobfile

Here is an example for loadleveler job file with IBM MPI

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

for Intel MPI library use (note the different number of tasks per node and hence the different number of OMP threads!)

#!/bin/bash
#@ job_type=MPICH
#@ network.MPI=sn_all,not_shared,us
#@ class = large
#@ job_name=mytest
#@ output =     $(job_name).$(jobid).out
#@ error =      $(job_name).$(jobid).err
#@ node_topology = island
#@ wall_clock_limit = 0:15:00
#@ node=16
#@ tasks_per_node=2
#@ initialdir = $(HOME)/head/testrun/
#@ notification=always
#@ notify_user=me@whereever.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
module load prace
module load mkl
module unload mpi.ibm
module load mpi.intel/4.0

### OpenMP variables
export OMP_NUM_THREADS=8
export OMP_DYNAMIC=FALSE
export SPINLOOPTIME=5000
export YIELDLOOPTIME=5000
export OMP_SCHEDULE=STATIC

## Intel MPI variables
export I_MPI_FABRICS=shm:ofa
export I_MPI_FALLBACK=0
export I_MPI_JOB_FAST_STARTUP=enable

mpiexec -n 32 /home/hpc/pr86ve/pr3da094/head/testrun/benchmark

Theoretical expectations

It is expected that pure MPI will perform best on this machine but it is worth trying a 2x8 hybrid approach with 2 processes per node and 8 threads per process. (as each node has two processors with 8 cores each) It would perhaps even be beneficial to run 16 threads per process because the E5-2680 supports two hardware threads per core.

From the Zeuthen cluster - which has older CPUs clocked at similar clockrates - it is expected that this machine should provide a performance of around 3.5 GFlops per core with communication and around 5 GFlops per core without. It is also expected that the halfspinor version of the code will underperform "node-locally" but overperform with many MPI processes and non-node-local communication. In a pure MPI approach the halfspinor version should be faster always. (values for problems fitting into the cache)

These CPUs have a large cache of 20MB so a node-local volume of up to 24x12^3 should not be a problem at all, giving OpenMP a lot of time to manage threads as the loops are quite large in such a configuration.

Performance Measures

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

  1. A comment on the local lattice size: the CPU has 20MB L2 cache and you're running 8 processes per CPU if I understand correctly. Therefore even your gauge field won't fit in the cache. Better to decrease the local lattice size by a factor of 2.
  2. Also, it is worth testing whether running 8 threads per task, two tasks per node wouldn't be faster.
  3. Finally, the OpenMP overhead might be so large on Intel that it makes more sense to simply run two processes per core!
  • 1193 Mflops per core with communication
  • 2443 Mflops per core without communication

with 256 tasks

  • 1294 Mflops Mflops per core with communication
  • 2473 Mflops per core without communication

with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.

Using --with-alignment=32 -axAVX performance is better (on 256 tasks again):

  • 1455 Mflops Mflops per core with communication
  • 3029 Mflops per core without communication

Using halfspinor gives again better performance

  • 1813 Mflops Mflops per core with communication
  • 2926 Mflops per core without communication

Don't know what this is in % of peak performance right now.

Newer performance data (7.12.2012) with halfspinor, OMP_NUM_THREADS=2, 24^3x48, 256 tasks on 16 nodes:

  • 2274 Mflops per core with communication
  • 3547 Mflops per core without communication

I tested -avSSE4.2, which makes the code slower. I also tested different affinity options, but MP_TASK_AFFINITY=core:$OMP_NUM_THREADS appears to be by far the fastest. Performance drops to 30 Mflops otherwise. Also using gcc instead of icc currently leads to 12 Mflops per core...

Even newer (10.12.2012) with halfspinor, `OMP_NUM_THREADS=8, 24^3x48, 16 nodes, 2 tasks per node with Intel MPI library (local volume 12^4, 32 MPI processes)

  • 16245 Mflops per MPI task, 2030 Mflops per thread (w/ comm)
  • 27581 Mflops per MPI task, 3447 Mflops per thread (w/o comm)

and with halfspinor, `OMP_NUM_THREADS=8, 16^3x32, 16 nodes, 2 tasks per node with Intel MPI library (local volume 8^4, 32 MPI processes)

  • 13413 Mflops per MPI task, 1676 Mflops per thread (w/ comm)
  • 31101 Mflops per MPI task, 3887 Mflops per thread (w/o comm)

Interleaving communication and computation should help a lot, however it seems to be not working well, for the same parameteres as before (24^3x48) , but with interleaving I get

  • 13866 Mflops per MPI task, 1733 Mflops per threads (w/ comm)
  • 24206 Mflops per MPI task, 3025 Mflops per threads (w/o comm)

so, there is large potential, if we get the MPI lib to do a proper non-blocking send/recv.

Using 1 MPI process per node, i.e. OMP_NUM_THREADS=16, 16 nodes, 1 task per node for 24^3x48

  • 18685 Mflops per MPI task, 1167 Mflops per thread (w/ comm)
  • 23697 Mflops per MPI task, 1481 Mflops per thread (w/o comm)

Comments

The E5-2680 is a 2 hardware SMT per core design so this is not surprising.

Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)

On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!

It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX...

Also you might have much better luck with 2 tasks per node and 8 or even 16 threads per task. (since the nodes have - AFAIK - 2 sockets and each CPU has 8 cores with dual SMT)

Some more comments relating to poor performance with OpenMP. On systems with more than one socket it is extremely important to "pin" threads to the CPU, otherwise cross-socket memory access will occur and this is slow (by up to a factor of 2). I don't know whether this is done automatically on SuperMUC or whether one has to proceed like on other machines and have a wrapper script which is called by mpirun before the application runs, defining some constants. In particular, KMP_AFFINITY needs to be set appropriately for each process separately. This should be discussed with the people responsible or looked up in the docu.

Performance - Krzysztof

../../tmLQCD/configure --prefix=/home/hpc/pr63po/lu64qov2/build/hmc_supermuc_mpi/ --enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor --with-alignment=32 --disable-sse2 --enable-sse3 --with-limedir=/home/hlrb2/pr63po/lu64qov2/build/lime_supermuc/install CC=mpicc CFLAGS="-O3 -axAVX" --with-lapack="$MKL_LIB" F77=ifort

24^3x48 lattice, 8x8x2x2 parallelization, ompnumthreads=2

--enable-omp --disable-sse3 CC="-axAVX" comm: 1502 Mflops, nocomm: 2733 Mflops

--enable-omp --enable-sse3 CC="-axAVX" comm: 1562 Mflops, nocomm: 2784 Mflops

--enable-omp --enable-sse3 comm: 1435 Mflops, nocomm: 2277 Mflops

--disable-omp --enable-sse3 CC="-axAVX" comm: 1575 Mflops, nocomm: 2824 Mflops