Skip to content

Webex affinity discussions 2019 09

Jeff Squyres edited this page Sep 21, 2019 · 2 revisions

Webex to discuss affinity options

September 19, 2019.

Attendees:

  • Jeff Squyres
  • Ralph Castain
  • George Bosilca
  • Howard Pritchard
  • Geoffray Vallee
  • Brice Goglin
  • Josh Hursey
  • Geoffrey Paulsen
  • Mark Allen
  • ...I'm sure there were others there; please fill yourselves in...

Premise of the discussion

Let's discuss mpirun's mapping and binding options.

Per https://github.com/open-mpi/ompi/issues/6966 and https://github.com/open-mpi/ompi/pull/6755, it's not entirely clear to me that we understand what our CLI options are for mapping and binding, and, more importantly, what we want them to do. I think it would be good to have a conversation about:

  1. What are the CLI options that we have right now?
    • What do we think these CLI options do?
  2. What functionality do we want moving forward (perhaps >=v5.0)?
    • How do we expose this functionality through CLI options / MCA params / config files / etc.?

I think the basics seem to be covered well/work well (e.g., --map-to X and --bind-to Y). But we have lots of other options that a) aren't well documented, b) perhaps aren't tested well, and c) may no longer do what we think they should do (e.g., perhaps they have bit-rotted).

Regardless, I think we really need a mapping/binding section of the FAQ.

Discussion

We generally agreed that the context of the discussion is for >=v5.0.

We made a list of all of mpirun's affinity options (i.e., mpirun --help all and manually picked out the affinity options). We then talked through each one and decided things like:

  • Is this option necessary / do we want this option going forward?
  • Is this option legacy?
  • Is this option just a synonym for another option?

This generated a lot of discussion.

Random meeting notes

I did not take good notes of the overall discussion, but some of the highlights / notable secondary items that came out were:

  1. PRRTE will not be using OPAL. There hasn't been any movement on the git submodules work for months, and PRRTE needs to move ahead. It was really only using a small portion of OPAL, anyway, so copying/re-implementing/whatever the necessary bits wasn't a huge deal.
  2. Ralph has been actively working towards PRRTE, and anticipates eventually rm -rf ...'ing much of ORTE (as per many prior discussions).
    • Specifically: we're looking at PRRTE for Open MPI v5.0.x
  3. For https://github.com/open-mpi/ompi/issues/6966 ("--host, binding and cpuset does not seem to work"), we should probably fix these.
    • This issue was reporting by a user.
  4. When discussing affinity options in Open MPI:
    • "Overloading" refers to binding
    • "Oversubscribing" refers to mapping
  5. For https://github.com/open-mpi/ompi/pull/6755 ("restoring more hwloc --cpu-set behavior from OMPI 3.x"), this may or may not be worth it.
    • This issue was found by IBM testing. We're not sure if any user has run into this.
    • This is really two issues:
      • Make --report-bindings show "whole system"-like behavior. E.g., don't just show the software envelope of PE's -- show all PE's for the machine. We do agree, however, that showing . (a dot) for each unused PE would be confusing to the user. The current PR shows ~ for PEs outside the software envelope. It might be nice to distinguish which PEs are outside the OMPI software envelope and show PEs that are on the system but are unavailable to OMPI (e.g., disabled and/or outside the OS cgroup).
        • Mark will work on this.
      • ...I forget the 2nd issue.
    • Specifically:
      • It may be worth fixing on the v4.0.x branch.
      • But it may not be worth fixing on master, because master's mpirun is going to be replaced with PRRTE, potentially within the next few months.

Conclusions

For mpirun, we came to the following conclusions:

These are the primary 4 mpirun options we care about / want to have going forward. There are two sets:

  1. --map-by, --bind-to, and --rank-by
  2. --rankfile

You can use one set or the other. It is an error to specify options from more than one of those two sets.

Between these two sets, users can do whatever they want. Specifically: we have a lot of pre-defined patterns built in to the the 1st set that cover many common scenarios. But if a user wants to do something else, they can use a rankfile and precisely specify exactly what they want.

Note: most other options in Open MPI master (as of 2019-09) / v4.0.x are synonyms of --map-by, --bind-to, and --rank-by (there are a very, very small number of options that are not exact synonyms to some combination of these three options -- e.g., --). They aren't currently well documented, but by design (and implementation), the vast majority of them are simply synonyms to some value of mapping / binding / ranking. We propose to drop most of these legacy options in PRRTE / Open MPI v5.0, and just have users use the main two sets of options listed above.

A small number of these synonyms may remain if they have

Set 1: --map-by, --bind-to, --rank-by

Let's NOT support the "single dash" versions of these options. Let's please only support the "double dash" versions (i.e., don't support -foo -- only support --foo).

  1. --map-by <arg0>
    • <arg0> can be:
      • slot
      • hwthread
      • core
      • socket
      • numa
      • board
      • node
      • ppr:X:RESOURCE
    • Default value:
      • ...it's complicated. Need to check and see what the default it (it might be different in different situations...?).
    • Modifiers:
      • pe=X: map each process to X PEs
      • NEW/PRRTE pe-list=LIST: list of availale PEs
        • This is intended to replace --cpu-list and --cpu-set
      • span: treat all available PEs as a single giant node
      • oversubscribe
      • nooversubscribe
  2. --bind-to <arg0>
    • <arg0> can be:
      • none
      • hwthread
      • core
      • l1cache
      • l2cache
      • l3cache
      • socket
      • numa
      • board
      • cpu-list
    • Default value:
      • "none" is the default when oversubscribed
      • "core" is the default when np<=2,
      • "socket" is the default when np>2
    • Modifiers:
      • DEPRECATE/PRRTE overload-allowed (i.e., let's not bring this forward to PRRTE / Open MPI v5.0)
      • NEW/PRRTE overload
      • NEW/PRRTE nooverload
      • if-supported
      • DELETE/PRRTE ordered: delete this because the option is recognized, but does not seem to do anything in master/Open MPI v4.0.x
  3. --rank-by <arg0>
    • <arg0> can be:
      • slot
      • hwthread
      • core
      • socket
      • numa
      • board
      • node
    • Default value
      • slot
    • Modifiers
      • span: treat all available PEs as a single giant node
      • fill: ...? Need to look up what this does

If only one of --map-by|bind-to|rank-by is specified, the other two will default to the same value (e.g., --map-by core == --bind-to core == --rank-by core == --map-by core --bind-to core --rank-by core).

Set 2: --rankfile

NOTE: This is --rankfile, not -rankfile (don't support a "single dash" version!).

This option just takes a single argument: a filename.

NEW for PRRTE The rankfile must specify every single process. It is an error if you run with N processes and only specify M processes in the rankfile (where M < N).

This new stiplation (where every single process must be specified in the rankfile) will significantly simplify the rankfile code.

Deprecated options that (likely) will not be supported in PRRTE / Open MPI v5.0

These options are mostly synonyms of --map-by|bind-to|rank-by and will likely not be supported in PRRTE / Open MPI v5.0. We might keep some of them if there's a good reason to (e.g., if an option is a synonym for multiple --map-by|bind-to|rank-by options and would be cumbersome to type on the command line).

NOTE: If any of these options are preseved for PRRTE / Open MPI v5.0, let's NOT support the "single dash" versions of these options. Let's please only support the "double dash" versions (i.e., don't support -foo -- only support --foo).

  • -use-hwthread-cpus|--use-hwthread-cpus
    • According to the code, it might be a synonym for --bind-to hwthread. But we're not sure of this -- it feels like it is supposed to have wider implications.
    • More investigation is needed for this option:
      • What does it currently do?
      • What do we think it should do?
      • Do we want to carry this option forward in PRRTE / Open MPI v5.0?
  • -cpu-set|--cpu-set and -cpu-list|--cpu-list
    • Currently, these two options (in v3.0, v3.1, master, and v4.0) seem to do exactly the same thing. But we're pretty sure they're not supposed to be. :frown:
    • Perhaps we should ditch these two and replace it with a new PRRTE pe-list modifier for --map-by
  • -bind-to-core|--bind-to-core
  • -bind-to-socket|--bind-to-socket
    • Synonym for --bind-to X
  • -bycore|--bycore
  • -bynode|--bynode
  • -byslot|--byslot
    • Synonym for --map-by X --rank-by X
  • -oversubscribe|--oversubscribe
  • -nooversubscribe|--nooversubscribe
    • Synonym for --map-by modifier oversubscribe / nooversubscribe
  • -cpus-per-proc|--cpus-per-proc
  • -cpus-per-rank|--cpus-per-rank
    • Synonym for --map-by pe option
  • -npernode|--npernode
  • -npersocket|--npersocket
  • -pernode|--pernode
  • --ppr
    • Synonym for --map-by ppr option
Clone this wiki locally