Skip to content
urbach edited this page Aug 31, 2012 · 32 revisions

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl, for a mixed MPI + openMP executable:

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2    --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -axCORE-AVX2 -openmp" F77="ifort"

Here is an example for loadleveler job file

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

  • 1193 Mflops per core with communication
  • 2443 Mflops per core without communication

with 256 taks

  • 1294 Mflops Mflops per core with communication
  • 2473 Mflops per core without communication

with 256 tasks and 4 threads each gives 15 Mflops only, so better use 2 threads per core.

Don't know what this is in % of peak performance right now.

Comments

The E5-2680 is a 2 hardware SMT per core design so this is not surprising.

Does --with-alignment=32 help? (we still need to make all the alignments independent of SSE2/SSE3 being defined)

On the machines in Zeuthen which are similar in clockspeed (2.67GHz but: 4 cores, 2 SMT, no AVX), I get over 5000 Mflops (nocomm) per core using the full-spinor code and over 4800 Mflops per core with half-spinor. I cannot look at MPI scaling but the local volume was the same during this test. The fact that your new half-spinor version is so fast is truly remarkable!

It was only a first shot, so I should also try alignment=32|64. Maybe also the other optimisation options, like no AVX and with AVXXX...