Skip to content
urbach edited this page Aug 30, 2012 · 32 revisions

I configured the latest bgq_omp branch of my tmLQCD fork with the following options, after module load mkl:

configure --enable-mpi --with-mpidimension=4 --with-limedir="$HOME/cu/head/lime-1.3.2/" --disable-sse2    --disable-sse3 --with-lapack="$MKL_LIB" --disable-halfspinor --disable-shmem CC=mpicc CFLAGS="-O3 -axCORE-AVX2 -openmp" F77="ifort"

Here is an example for loadleveler job file

#!/bin/bash
#
#@ job_type = parallel
#@ class = large
#@ node = 8
#@ total_tasks= 128
#@ island_count=1
### other example
##@ tasks_per_node = 16
##@ island_count = 1,18
#@ wall_clock_limit = 0:15:00
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/cu/head/testrun/
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=you@there.com
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=2
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 128 ./benchmark

The performance I got from a benchmark run with 128 task with 2 threads each is with a 24^3x48 lattice is (local lattice size is 24x6x6x6)

  • 1193 Mflops per core with communication
  • 2443 Mflops per core without communication

with 256 taks

  • 1294 Mflops Mflops per core with communication
  • 2473 Mflops per core without communication

Don't know what this is in % of peak performance right now.