SuperMUC docu

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl, for a mixed MPI + openMP executable:

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2    --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -axCORE-AVX2 -openmp" F77="ifort"

Here is an example for loadleveler job file

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

1193 Mflops per core with communication
2443 Mflops per core without communication

with 256 taks

1294 Mflops Mflops per core with communication
2473 Mflops per core without communication

with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.

Don't know what this is in % of peak performance right now.

Comments

The E5-2680 is a 2 hardware SMT per core design so this is not surprising.

Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)

On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!

It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SuperMUC docu

Comments

Clone this wiki locally