Skip to content

Blue Gene Q Running and Performance

kostrzewa edited this page Mar 27, 2013 · 10 revisions

Running on BG/Q

I obtain best performance when using SPI plus overlapping communication and computation. Please make sure you use

runjob  --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --ranks-per-node 1

to have enough resources for SPI and MPI together. Moreover, use OMP_NUM_THREADS=64.

I see a strong dependence on the mapping option. For a 512 node partition with a 48^3x96 global volume it works best if I chose --mapping EABCDT, because that is 2x4x4x4x4=8x4x4x4.

Getting the Mapping right

There is an evironment variable called LOADL_BG_SHAPE which gives the dimensions in the form AxBxCxDxE. From this variable one can deduce the correct mapping for runjob. So, E=2 always. A midplane has AxBxCxD=4x4x4x4. For any combination of midplanes one needs to properly match the input parameters NrXprocs, NrYProcs and NrZProcs and the resulting NrTProcs to the 5-dim torus of the machine.

(page 112 in BG redbook for converting core dump to proper file)

It seems that from a midplane on (512 nodes), LOADL_BG_SHAPE gives the topology of midplanes, so for instance for a 1024 node partition something like 1x2x1x1, which would translate into 4x8x4x4x2 in the AxBxCxDxE format. Here is an example bash script to set the mapping corresponding to the value of LOADL_BG_SHAPE for a 1024 node partition and a 48 cube times 96 lattice with MPI mapping 16x4x4x4:

echo loadl shape is $LOADL_BG_SHAPE
export MP=EABCDT
case ${LOADL_BG_SHAPE} in
  2x1x1x1 )
    MP=EABCDT 
  ;;
  1x2x1x1 )
    MP=EBACDT
  ;;
  1x1x2x1 )
    MP=ECABDT
  ;;
  1x1x1x2 )
    MP=EDABCT
  ;;
esac
echo mapping is ${MP}

Latest Results for tmLQCD BG/Q performance

The latest performance numbers are summarised in the following plot:

performance

The performance is shown as a function of the local lattice extend, which is identical in all directions.

  • plain C is the original C version of the code
  • QPX is the QPX version with communication switched off
  • QPX+MPI is the QPX version with MPI communication
  • QPX+SPI is the QPX version with SPI communication
  • QPX+SPI+EABCDT is the QPX version with SPI communication and --mapping EABCDT

Best performance is with 12^4 local lattice 24% of peak. All performance numbers are in double precision.

Scaling in the solver

Interestingly, at least with the clover term, the solver scales above ideally when going from one midplane to one rack (the numbers are spurious because they don't include the computation of the clover term [i.e. the real performance is higher]):

midplane

CG: iter: 534 eps_sq: 1.0000e-22 t/s: 3.2464e+00
CG: flopcount (for e/o tmWilson only): t/s: 3.2464e+00 mflops_local: 11276.7 mflops: 5773675.5
Time for cloverdetratio1 monomial heatbath: 3.645461e+00 s

rack

CG: iter: 534 eps_sq: 1.0000e-22 t/s: 1.3440e+00
CG: flopcount (for e/o tmWilson only): t/s: 1.3440e+00 mflops_local: 13619.4 mflops: 13946272.4
Time for cloverdetratio1 monomial heatbath: 1.544599e+00 s