Skip to content
kostrzewa edited this page Sep 21, 2012 · 32 revisions

System Configuration

As far as I understand, thin nodes are running the Sandy Bridge-EP Intel Xeon E5-2680 8C processor, which does support AVX, but not the fused multiply add operations (FMA). The fat nodes don't support AVX, but SSE4. The login nodes seem to be E5-2680, and indeed I get illegal instructions when using fma instructions.

Configuring the tmLQCD code

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl, for a mixed MPI + openMP executable:

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2  --with-alignment=32  --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -xAVX -openmp" F77="ifort"

Sample Jobfile

Here is an example for loadleveler job file

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

Theoretical expectations

It is expected that pure MPI will perform best on this machine but it is worth trying a 2x8 hybrid approach with 2 processes per node and 8 threads per process. (as each node has two processors with 8 cores each) It would perhaps even be beneficial to run 16 threads per process because the E5-2680 supports two hardware threads per core.

From the Zeuthen cluster - which has older CPUs clocked at similar clockrates - it is expected that this machine should provide a performance of around 3.5 GFlops per core with communication and around 5 GFlops per core without. It is also expected that the halfspinor version of the code will underperform "node-locally" but overperform with many MPI processes and non-node-local communication. In a pure MPI approach the halfspinor version should be faster always. (values for problems fitting into the cache)

These CPUs have a large cache of 20MB so a node-local volume of up to 24x12^3 should not be a problem at all, giving OpenMP a lot of time to manage threads as the loops are quite large in such a configuration.

Performance Measures

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

  1. A comment on the local lattice size: the CPU has 20MB L2 cache and you're running 8 processes per CPU if I understand correctly. Therefore even your gauge field won't fit in the cache. Better to decrease the local lattice size by a factor of 2.
  2. Also, it is worth testing whether running 8 threads per task, two tasks per node wouldn't be faster.
  3. Finally, the OpenMP overhead might be so large on Intel that it makes more sense to simply run two processes per core!
  • 1193 Mflops per core with communication
  • 2443 Mflops per core without communication

with 256 tasks

  • 1294 Mflops Mflops per core with communication
  • 2473 Mflops per core without communication

with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.

Using --with-alignment=32 -axAVX performance is better (on 256 tasks again):

  • 1455 Mflops Mflops per core with communication
  • 3029 Mflops per core without communication

Using halfspinor gives again better performance

  • 1813 Mflops Mflops per core with communication
  • 2926 Mflops per core without communication

Don't know what this is in % of peak performance right now.

Comments

The E5-2680 is a 2 hardware SMT per core design so this is not surprising.

Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)

On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!

It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX...

Also you might have much better luck with 2 tasks per node and 8 or even 16 threads per task. (since the nodes have - AFAIK - 2 sockets and each CPU has 8 cores with dual SMT)

Performance - Krzysztof

../../tmLQCD/configure --prefix=/home/hpc/pr63po/lu64qov2/build/hmc_supermuc_mpi/ --enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor --with-alignment=32 --disable-sse2 --enable-sse3 --with-limedir=/home/hlrb2/pr63po/lu64qov2/build/lime_supermuc/install CC=mpicc CFLAGS="-O3 -axAVX" --with-lapack="$MKL_LIB" F77=ifort

24^3x48 lattice, 8x8x2x2 parallelization, ompnumthreads=2

--enable-omp --disable-sse3 CC="-axAVX" comm: 1502 Mflops, nocomm: 2733 Mflops

--enable-omp --enable-sse3 CC="-axAVX" comm: 1562 Mflops, nocomm: 2784 Mflops

--enable-omp --enable-sse3 comm: 1435 Mflops, nocomm: 2277 Mflops

--disable-omp --enable-sse3 CC="-axAVX" comm: 1575 Mflops, nocomm: 2824 Mflops