-
Notifications
You must be signed in to change notification settings - Fork 862
WeeklyTelcon_20230425
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres (Cisco)
- Brian Barrett (Amazon)
- David Bernholdt (ORNL)
- Donny Kruse
- Edgar Gabriel (AMD)
- Howard Pritchard (LANL)
- Joseph Schuchart (UTK)
- Ralph Castain
- Thomas Huber
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
Current issue: https://github.com/openpmix/prrte/issues/1731
Summary of the existing problems below.
can't figure out which project they're intended for. Currently,
they're converted to --pmixmca mca_* value
.
Clarification of existing behavior: variables that are unknown are passed to all 3.
Proposal: have PMIx notice when variables are named mca_*
and
just pass them to PMIx, PRTE, and whatever the current schizo
(implementation TBD).
The existing --pmixmca ...
, --prtemca ...
, and --omca ...
mechanisms still exist and work. This is just new handling for
mca_*
variables.
We are not translating the variable names in there. E.g., if they
pass oob_tcp_blah
in that file, we won't translate it to the new
corresponding PRTE equivalent -- so they're being ignored.
This is different than if someone passes oob_tcp_blah
on the
command line -- that is translated to the PRTE equivalent.
- Need to check
--tuned
-- but pretty sure those files are not translated, either.
Similarly, if users setenv MCA variables, those are not translated either.
Ralph proposes the following to handle the above 2 problems
- Have OMPI / OPAL layer read in param/tuned files, shove everything into the environment. Then, later, when PMIx is initialized, we'll have to pass an attribute that says "this is an OMPI process" so that PMIx can see/react the OMPI_MCA env vars and translate them to the appropriate PMIx and PRTE MCA params.
In the old OPAL system, the command line parser was integrate into the MCA param system, thereby allowing the MCA param system to track the source of where the MCA param was set.
That linkage is now broken -- everything is now (translated into) an env var. We've now lost the source of where a variable was set.
Ralph needs to think about this -- don't have an immediate solution.
William brings up https://github.com/open-mpi/ompi/issues/7737
He's going to try to verify that this is an actual issue. If so, we'll discuss how to fix.
Edgar brings up https://github.com/open-mpi/ompi/pull/11529
NOTE: Think of coll/cuda
as really coll/accelerator
-- i.e.,
it's just a dispatch back to the accelerator framework, not a direct
dispatch back to CUDA. coll/cuda
is a legacy name.
Edgar's PR wants to make coll/cuda
always compile/build (i.e., not
depend on libcuda.so
) because the current configure.m4
logic is
incorrect for ROCM. Since coll/cuda
really just dispatches off to
the accelerator framework, it should always exist and rely on the
acclerator framework to dispatch (or not) off to the correct back-end
component (i.e., ROCM or CUDA).
However, coll/cuda
has a very high priority, and if it's always
built, we're penalizing environments where accelerator support was
built but is not being used (E.g., no accelerator hardware is
present).
Notes:
-
the
coll/cuda
component will disqualify itself if there are no accelerator components available. -
the ROCM accelerator component disqualifies itself if no ROCM hardware is present. The CUDA accelerator component does not disqualify itself if no CUDA-capable hardware is present.
-
the
coll/cuda
component currently only does reductions.
It seems like the right solution is to make the CUDA accelerator component disqualify itself if there are no CUDA-capable hardware available.
William/AWS will implenent this change. He'll put it on Edgar's existing PR.