SuperMUC docu

System Configuration

As far as I understand, thin nodes are running the Sandy Bridge-EP Intel Xeon E5-2680 8C processor, which does support AVX, but not the fused multiply add operations (FMA). The fat nodes don’t support AVX, but SSE4. The login nodes seem to be E5-2680, and indeed I get illegal instructions when using fma instructions.

Configuring the tmLQCD code

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl, for a mixed MPI + openMP executable:

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2  --with-alignment=32  --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -xAVX -openmp" F77="ifort"

Sample Jobfile

Here is an example for loadleveler job file

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

Performance Measures

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

1193 Mflops per core with communication
2443 Mflops per core without communication

with 256 taks

1294 Mflops Mflops per core with communication
2473 Mflops per core without communication

with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.

Using --with-alignment=32 -axAVX performance is better (on 256 tasks again):

1455 Mflops Mflops per core with communication
3029 Mflops per core without communication

Using halfspinor gives again better performance

1813 Mflops Mflops per core with communication
2926 Mflops per core without communication

Don’t know what this is in % of peak performance right now.

Comments

The E5-2680 is a 2 hardware SMT per core design so this is not surprising.

Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)

On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!

It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX…

Also you might have much better luck with 2 tasks per node and 8 or even 16 threads per task. (since the nodes have - AFAIK - 2 sockets and each CPU has 8 cores with dual SMT)

Performance - Krzysztof

./../tmLQCD/configure --prefix=/home/hpc/pr63po/lu64qov2/build/hmc_supermuc_mpi/ --enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor --with-alignment=32 --disable-sse2 --enable-sse3 --with-limedir=/home/hlrb2/pr63po/lu64qov2/build/lime_supermuc/install CC=mpicc CFLAGS="-O3 -axAVX" --with-lapack="$MKL_LIB" F77=ifort

24^3x48 lattice, 8x8x2x2 parallelization, ompnumthreads=2

--enable-omp --disable-sse3 CC="-axAVX" comm: 1502 Mflops, nocomm: 2733 Mflops

--enable-omp --enable-sse3 CC="-axAVX" comm: 1562 Mflops, nocomm: 2784 Mflops

--enable-omp --enable-sse3 comm: 1435 Mflops, nocomm: 2277 Mflops

--disable-omp --enable-sse3 CC="-axAVX" comm: 1575 Mflops, nocomm: 2824 Mflops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly