-
Notifications
You must be signed in to change notification settings - Fork 862
Webex affinity discussions 2019 09
September 19, 2019.
Attendees:
- Jeff Squyres
- Ralph Castain
- George Bosilca
- Howard Pritchard
- Geoffray Vallee
- Brice Goglin
- Josh Hursey
- Geoffrey Paulsen
- Mark Allen
- ...I'm sure there were others there; please fill yourselves in...
Let's discuss mpirun
's mapping and binding options.
Per https://github.com/open-mpi/ompi/issues/6966 and https://github.com/open-mpi/ompi/pull/6755, it's not entirely clear to me that we understand what our CLI options are for mapping and binding, and, more importantly, what we want them to do. I think it would be good to have a conversation about:
- What are the CLI options that we have right now?
- What do we think these CLI options do?
- What functionality do we want moving forward (perhaps >=v5.0)?
- How do we expose this functionality through CLI options / MCA params / config files / etc.?
I think the basics seem to be covered well/work well (e.g., --map-to X
and --bind-to Y
). But we have lots of other options that a) aren't well documented, b) perhaps aren't tested well, and c) may no longer do what we think they should do (e.g., perhaps they have bit-rotted).
Regardless, I think we really need a mapping/binding section of the FAQ.
We generally agreed that the context of the discussion is for >=v5.0.
We made a list of all of mpirun
's affinity options (i.e., mpirun --help all
and manually picked out the affinity options). We then talked through each one and decided things like:
- Is this option necessary / do we want this option going forward?
- Is this option legacy?
- Is this option just a synonym for another option?
This generated a lot of discussion.
I did not take good notes of the overall discussion, but some of the highlights / notable secondary items that came out were:
- PRRTE will not be using OPAL. There hasn't been any movement on the git submodules work for months, and PRRTE needs to move ahead. It was really only using a small portion of OPAL, anyway, so copying/re-implementing/whatever the necessary bits wasn't a huge deal.
- Ralph has been actively working towards PRRTE, and anticipates eventually
rm -rf ...
'ing much of ORTE (as per many prior discussions).- Specifically: we're looking at PRRTE for Open MPI v5.0.x
- For https://github.com/open-mpi/ompi/issues/6966 ("--host, binding and cpuset does not seem to work"), we should probably fix these.
- This issue was reporting by a user.
- When discussing affinity options in Open MPI:
- "Overloading" refers to binding
- "Oversubscribing" refers to mapping
- For https://github.com/open-mpi/ompi/pull/6755 ("restoring more hwloc --cpu-set behavior from OMPI 3.x"), this may or may not be worth it.
- This issue was found by IBM testing. We're not sure if any user has run into this.
- This is really two issues:
- Make
--report-bindings
show "whole system"-like behavior. E.g., don't just show the software envelope of PE's -- show all PE's for the machine. We do agree, however, that showing.
(a dot) for each unused PE would be confusing to the user. The current PR shows~
for PEs outside the software envelope. It might be nice to distinguish which PEs are outside the OMPI software envelope and show PEs that are on the system but are unavailable to OMPI (e.g., disabled and/or outside the OS cgroup).- Mark will work on this.
- ...I forget the 2nd issue.
- Make
- Specifically:
- It may be worth fixing on the v4.0.x branch.
- But it may not be worth fixing on master, because master's
mpirun
is going to be replaced with PRRTE, potentially within the next few months.
For mpirun
, we came to the following conclusions:
These are the primary 4 mpirun
options we care about / want to have going forward. There are two sets:
-
--map-by
,--bind-to
, and--rank-by
--rankfile
You can use one set or the other. It is an error to specify options from more than one of those two sets.
Between these two sets, users can do whatever they want. Specifically: we have a lot of pre-defined patterns built in to the the 1st set that cover many common scenarios. But if a user wants to do something else, they can use a rankfile and precisely specify exactly what they want.
Note: most other options in Open MPI master (as of 2019-09) / v4.0.x are synonyms of --map-by
, --bind-to
, and --rank-by
(there are a very, very small number of options that are not exact synonyms to some combination of these three options -- e.g., --
). They aren't currently well documented, but by design (and implementation), the vast majority of them are simply synonyms to some value of mapping / binding / ranking. We propose to drop most of these legacy options in PRRTE / Open MPI v5.0, and just have users use the main two sets of options listed above.
A small number of these synonyms may remain if they have
Let's NOT support the "single dash" versions of these options. Let's please only support the "double dash" versions (i.e., don't support -foo
-- only support --foo
).
-
--map-by <arg0>
-
<arg0>
can be:- slot
- hwthread
- core
- socket
- numa
- board
- node
- ppr:X:RESOURCE
- Default value:
- ...it's complicated. Need to check and see what the default it (it might be different in different situations...?).
- Modifiers:
- pe=X: map each process to X PEs
-
NEW/PRRTE pe-list=LIST: list of availale PEs
- This is intended to replace
--cpu-list
and--cpu-set
- This is intended to replace
- span: treat all available PEs as a single giant node
- oversubscribe
- nooversubscribe
-
-
--bind-to <arg0>
-
<arg0>
can be:- none
- hwthread
- core
- l1cache
- l2cache
- l3cache
- socket
- numa
- board
- cpu-list
- Default value:
- "none" is the default when oversubscribed
- "core" is the default when np<=2,
- "socket" is the default when np>2
- Modifiers:
- DEPRECATE/PRRTE overload-allowed (i.e., let's not bring this forward to PRRTE / Open MPI v5.0)
- NEW/PRRTE overload
- NEW/PRRTE nooverload
- if-supported
- DELETE/PRRTE ordered: delete this because the option is recognized, but does not seem to do anything in master/Open MPI v4.0.x
-
-
--rank-by <arg0>
-
<arg0>
can be:- slot
- hwthread
- core
- socket
- numa
- board
- node
- Default value
- slot
- Modifiers
- span: treat all available PEs as a single giant node
- fill: ...? Need to look up what this does
-
If only one of --map-by
|bind-to
|rank-by
is specified, the other two
will default to the same value (e.g., --map-by core
== --bind-to core
== --rank-by core
== --map-by core --bind-to core --rank-by core
).
NOTE: This is --rankfile
, not -rankfile
(don't support a "single dash" version!).
This option just takes a single argument: a filename.
NEW for PRRTE The rankfile must specify every single process. It is an error if you run with N processes and only specify M processes in the rankfile (where M < N).
This new stiplation (where every single process must be specified in the rankfile) will significantly simplify the rankfile code.
These options are mostly synonyms of --map-by
|bind-to
|rank-by
and will likely not be supported in PRRTE / Open MPI v5.0. We might keep some of them if there's a good reason to (e.g., if an option is a synonym for multiple --map-by
|bind-to
|rank-by
options and would be cumbersome to type on the command line).
NOTE: If any of these options are preseved for PRRTE / Open MPI v5.0, let's NOT support the "single dash" versions of these options. Let's please only support the "double dash" versions (i.e., don't support -foo
-- only support --foo
).
- -use-hwthread-cpus|--use-hwthread-cpus
- According to the code, it might be a synonym for
--bind-to hwthread
. But we're not sure of this -- it feels like it is supposed to have wider implications. - More investigation is needed for this option:
- What does it currently do?
- What do we think it should do?
- Do we want to carry this option forward in PRRTE / Open MPI v5.0?
- According to the code, it might be a synonym for
- -cpu-set|--cpu-set and -cpu-list|--cpu-list
- Currently, these two options (in v3.0, v3.1, master, and v4.0) seem to do exactly the same thing. But we're pretty sure they're not supposed to be. :frown:
- Perhaps we should ditch these two and replace it with a new PRRTE
pe-list
modifier for--map-by
- -bind-to-core|--bind-to-core
- -bind-to-socket|--bind-to-socket
- Synonym for
--bind-to X
- Synonym for
- -bycore|--bycore
- -bynode|--bynode
- -byslot|--byslot
- Synonym for
--map-by X --rank-by X
- Synonym for
- -oversubscribe|--oversubscribe
- -nooversubscribe|--nooversubscribe
- Synonym for
--map-by
modifieroversubscribe
/nooversubscribe
- Synonym for
- -cpus-per-proc|--cpus-per-proc
- -cpus-per-rank|--cpus-per-rank
- Synonym for
--map-by
pe
option
- Synonym for
- -npernode|--npernode
- -npersocket|--npersocket
- -pernode|--pernode
- --ppr
- Synonym for
--map-by
ppr
option
- Synonym for