Skip to content
kostrzewa edited this page Mar 13, 2013 · 3 revisions

This page is meant as a record of a few measurements I've been conducting using scalasca on 512 nodes of BG/Q. The measurements refer to 5 volume source inversions on a 48^3x96 configuration. This is for a very heavy quark mass, however, as the CG reaches a residue of O(1e-19) in 808 iterations.

Scalasca measurements

Ignoring I/O and source preparation (which amount to more than 25% of total time! 12% is spent writing propagators!), we normalize total time spent to 74.33%, the proportion of time spent in cg_her. Of these 56% is spent applying Qtm_pm_psi with the usual overheads of the hopping matrix. The remaining 44% are spent doing linear algebra.

Starting situation

  • 56% Qtm_pm_psi
  • 11.8% scalar_prod_r
  • 9.8% assign_add_mul_r
  • 12.2% assign_mul_add_r_and_square
  • 9.8% assign_add_mul_r (second call)

In the calls involving collectives about 20% of the time is spent waiting for MPI_Allreduce. About 50% of the time is spent doing the linear algebra and the remaining 30% is spent outside the parallel section. As far as I understand, this is usually a good measure of OpenMP overhead.

  • Approximate breakdown of linear algebra routines with collectives:
    • 20% MPI_Allreduce
    • 50% parallel section doing linear algebra
    • 30% outside of parallel section (OpenMP overhead)
  • The routines without collectives look similar with slightly changed percentages due to the lack of MPI_Allreduce.

A comparison using a pure MPI run will be helpful in elucidating these points. In the following few tests, aspects of the linear algebra routines will be modified and the effect quantified here.