-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20220215
Geoffrey Paulsen edited this page Mar 4, 2022
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
Failed to capture.
- Geoffrey Paulsen (IBM)
- Austen Lauria (IBM)
- Jeff Squyres (Cisco)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- Christoph Niethammer (HLRS)
- David Bernhold (ORNL)
- Hessam Mirsadeghi (UCX/nVidia)
- Howard Pritchard (LANL)
- Josh Hursey (IBM)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic (nVidia)
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Edgar Gabriel (UoH)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Joseph Schuchart
- Joshua Ladd (nVidia)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Sam Gutierrez (LLNL)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Xin Zhao (nVidia)
- Two HWLOC issues
- PRRTE/PMIx hwloc issue: https://github.com/openpmix/prrte/pull/1185 and https://github.com/openpmix/openpmix/pull/2445
- hwloc when built with CUDA support, is hard linking against it.
- This doesn't work in the common case where CUDA isn't installed on login nodes.
- hwloc v2.5 - v2.7.0 is putting variables in read-only memory into
environ
, but prrte is trying to modify these and segvs. - PMIx and PRRTE has block-listed large hwloc versions 2.5-2.7.0
- putstr(env) is segv-ing.
- Discussions about minimizing mpirun/mpicc to only link against subset of opal.
- Makes things slightly better, but not really. Still have cuda on some nodes and not on others.
- Projected solution is to use hwloc plugins (dlopen cuda libs)
- A while back, hwloc changed default to NOT load components as plugins.
- He this this for Open MPI (some cyclic dependencies).
- This is no longer an issue for us.
- Now hwloc has reasonable defaults for some things build as plugins (dlopened at runtime).
- Usually customers install in local filesystems.
- This gets us around the dependencies.
- So whenever this is actually fixed, Jeff will write docs, and we can touch on points.
- From JOSH'es HWLOC PR, if there are any other suggestions or modifications, please put this on the hwloc PR.
- A while back, hwloc changed default to NOT load components as plugins.
- Resuming MTT development - send email
- Like to have a monthly call.
- Christopph Niethammer is interested.
- Might need a new cleanup mechanism when rolling out lots of versions.
- Find out who's using python client, and what problems.
- IU database plugin (what ends up getting data into MTT viewer) has a number of issues.
- Schedule: No schedule for v4.0.8 yet
- bugfixes case-by-case basis
- Winding down v4.0.x, and after v5.0.x will stop
- Really only want small changes reported by users.
- Otherwise, point users to v4.1.x release.
- Howard and Geoff will meet Jan 28th
- Schedule: Shooting for v4.1.3 end of March/Q1.
- RC in 2 weeks or so.
- No other update.
- CI is back.
- Need a full ROMIO update [Geoff to file issue}
- Open an issue to track this.
- https://github.com/openpmix/prrte/pull/1176
- Sessions - https://github.com/open-mpi/ompi/pull/9097
- Howard will rebase (again)
- Prrte has for a long time has had a schizo component, that tries to provide an
interface based on what implementation the user's using. CLI was still centralized,
and this was leading to difficulties. Example: disagreement about how ranks should
be placed with
-N
option. So moved some of these decisions down into a framework that has an OMPI component.- Some questions if we should bring this into v5.0 for OMPI. There is a PRRTE PR up with some early work.
- This would be backported to the PRRTE release branch for our OMPI v5
- Blocker v5.0 items are in the Project/2
- Schedule is Q1
- Thinking about an RC before and after Sessions.
- Well as far as tracking, we have nightly tarballs, and it'll be clear in git
- Docs rework
- We made a lot of progress on revamping the docs with restructured text.
- Might actually be able to get this done by v5.0.x
- Dont go review yet, but lots of good progress.
- definately have these docs for v5.0.0, but maybe not 100% complete,
- But do want THIS is what's different in mpirun command line, etc.
- PR 9996 - bug with current cuda common code.
- ported this code to UTIL, to try to fix the bug, but been an ask to do a bit more.
- An accelerator framework,
- Need to figure out how we move forward here. Moving it into util is not the right place.
- Don't need more things with knarly dependencies in util.
- this makes the mpicc problem worse.
- Don't need more things with knarly dependencies in util.
- William will take a stab at it, but if it's not a lot of work.
- four to six functions that datatype engine calls.
- Is accellerator?
- data movement functions.
- need to figure out memory hooks stuff.
- libfabric has this abstraction, so we could
- No new code, just moving things around.
- four to six functions that datatype engine calls.
- No new Gnus
- A fix pending to workaround the IBM XL MTT build failure (compiler abort)
- Issue 9919 - Thinks this common component should still be built.
- Commons get built when it's likely their is a dependency.
- Commons self-select if they should be built or not.