Tuning CG on BGQ

This page is meant as a record of a few measurements I've been conducting using scalasca on 512 nodes of BG/Q. The measurements refer to 5 volume source inversions on a 48^3x96 configuration. This is for a very heavy quark mass, however, as the CG reaches a residue of O(1e-19) in 808 iterations.

Scalasca measurements

Ignoring I/O and source preparation (which amount to more than 25% of total time! 12% is spent writing propagators!), we normalize total time spent to 74.33%, the proportion of time spent in cg_her. Of these 56% is spent applying Qtm_pm_psi with the usual overheads of the hopping matrix. The remaining 44% are spent doing linear algebra.

Starting situation

56% Qtm_pm_psi
11.8% scalar_prod_r
9.8% assign_add_mul_r
12.2% assign_mul_add_r_and_square
9.8% assign_add_mul_r (second call)

In the calls involving collectives about 20% of the time is spent waiting for MPI_Allreduce. About 50% of the time is spent doing the linear algebra and the remaining 30% is spent outside the parallel section. As far as I understand, this is usually a good measure of OpenMP overhead.

Approximate breakdown of linear algebra routines with collectives:
- 20% MPI_Allreduce
- 50% parallel section doing linear algebra
- 30% outside of parallel section (OpenMP overhead)
The routines without collectives look similar with slightly changed percentages due to the lack of MPI_Allreduce.

A comparison using a pure MPI run will be helpful in elucidating these points. In the following few tests, aspects of the linear algebra routines will be modified and the effect quantified here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuning CG on BGQ

Scalasca measurements

Clone this wiki locally