Skip to content
kcichy edited this page Sep 7, 2012 · 2 revisions

System Configuration

As far as I understand, thin nodes are running the Sandy Bridge-EP Intel Xeon E5-2680 8C processor, which does support AVX, but not the fused multiply add operations (FMA). The fat nodes don’t support AVX, but SSE4. The login nodes seem to be E5-2680, and indeed I get illegal instructions when using fma instructions.

Configuring the tmLQCD code

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl, for a mixed MPI + openMP executable:

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2  --with-alignment=32  --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -xAVX -openmp" F77="ifort"

Sample Jobfile

Here is an example for loadleveler job file

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

Performance Measures

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

  • 1193 Mflops per core with communication

  • 2443 Mflops per core without communication

with 256 taks

  • 1294 Mflops Mflops per core with communication

  • 2473 Mflops per core without communication

with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.

Using --with-alignment=32 -axAVX performance is better (on 256 tasks again):

  • 1455 Mflops Mflops per core with communication

  • 3029 Mflops per core without communication

Using halfspinor gives again better performance

  • 1813 Mflops Mflops per core with communication

  • 2926 Mflops per core without communication

Don’t know what this is in % of peak performance right now.

Comments

The E5-2680 is a 2 hardware SMT per core design so this is not surprising.

Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)

On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!

It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX…​

Also you might have much better luck with 2 tasks per node and 8 or even 16 threads per task. (since the nodes have - AFAIK - 2 sockets and each CPU has 8 cores with dual SMT)

Performance - Krzysztof

./../tmLQCD/configure --prefix=/home/hpc/pr63po/lu64qov2/build/hmc_supermuc_mpi/ --enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor --with-alignment=32 --disable-sse2 --enable-sse3 --with-limedir=/home/hlrb2/pr63po/lu64qov2/build/lime_supermuc/install CC=mpicc CFLAGS="-O3 -axAVX" --with-lapack="$MKL_LIB" F77=ifort

24^3x48 lattice, 8x8x2x2 parallelization, ompnumthreads=2

--enable-omp --disable-sse3 CC="-axAVX" comm: 1502 Mflops, nocomm: 2733 Mflops

--enable-omp --enable-sse3 CC="-axAVX" comm: 1562 Mflops, nocomm: 2784 Mflops

--enable-omp --enable-sse3 comm: 1435 Mflops, nocomm: 2277 Mflops

--disable-omp --enable-sse3 CC="-axAVX" comm: 1575 Mflops, nocomm: 2824 Mflops