SuperMUC docu

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl:

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2    --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -axCORE-AVX2 -openmp" F77="ifort"

Here is an example for loadleveler job file

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

1193 Mflops per core with communication
2443 Mflops per core without communication

with 256 taks

1294 Mflops Mflops per core with communication
2473 Mflops per core without communication

Don't know what this is in % of peak performance right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SuperMUC docu

Clone this wiki locally