-
Notifications
You must be signed in to change notification settings - Fork 862
WeeklyTelcon_20190917
Geoffrey Paulsen edited this page Oct 1, 2019
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Akshay Venkatesh (NVIDIA)
- Brendan Cunningham (Intel)
- Dan Topa (LANL)
- David Bernhold (ORNL)
- Edgar Gabriel (UH)
- Erik Zeiske
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Intel)
- Noah Evans (Sandia)
- Ralph Castain (Intel)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Artem Polyakov (Mellanox)
- Brandon Yates (Intel)
- Brian Barrett (AWS)
- George Bosilca (UTK)
- Joshua Ladd (Mellanox)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Tom Naughton
- Xin Zhao (Mellanox)
- mohan (AWS)
- Erik opened this issue, but no one has replied yet.
- Everyone please read this issue, and comment if you have thoughts.
- PR6844 - Want to test if this affects containers.
- Worth the question, don't see any reason not to take this.
- Jeff will review and add comments.
- Howard tested, and the workaround fixes it.
- No update (Brian on vacation)
- Merged
--recurse-submodules
update intoompi-scripts
Jenkins script as first step. Let's see if that works.
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
- Start drawing up a list of fixes that won't be backported to v3.0.x
- Datatype bug won't be backported, because it snowballed too big.
- Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.
Review v3.1.x Milestones v3.1.4
- PR6556 and PR 6621 should go to the v3.x release branches.
- Start drawing up a list of fixes that won't be backported to v3.0.x
- Datatype bug won't be backported, because it snowballed too big.
- Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.
Review v4.0.x Milestones v4.0.2
-
Howard tested the CMA workaround PR
-
Issue 6976 - Thinks this is a PSM issue, not a v4.0.x
- Confirmed that this issue exists.
- Should this be a blocker of v4.0.2? Think this is in the OFI layer issue.
- Silent data issue. Would really like a fatal error at Open-MPI layer.
- Not a regression,
- Not-default path (OFI MTL (non-default) BTL
- IS a default path if built with libfabric
- Will work on issues.
- Intel will look at what it might take to add a fatal error check for v4.0.2
-
ABI changes: https://github.com/open-mpi/ompi/issues/6949
- Linkers are a bit smarter now and we should define our ABI better.
- Help it work with the tool.
- Looks like in this
- We have Open MPI the package, then we have Open MPI and Open SHMEM libraries.
- Our versioning is on the larger package, not really on library level.
- Compatibility guarantees are confusing
- We're letting OpenSHMEM add new functions, though not Open MPI.
- this is confusing for folks.
- Tearing this apart will be challenging.
- Lets take this particular issue seriously.
- It would be cool to have CI - Geoff signs up to find out more information about tools.
- This is probably okay for v4.0.2.
- We should
-
Geoffroy Vallee has a system setup to run cross-compatibility, and can report out which versions are failing. Ralph will forward info to devel-core.
-
Still have some issues; we expect to still have to do an rc2, e.g., https://github.com/open-mpi/ompi/issues/6932.
-
Discuss Issue 6568 - large messages overwhelm put
- PR 6961 went into master - Nathan said it might help.
- George commented it's a partial solution.
- See if this fixes 6568, and if it does consider for v4.0.2
- Hold off on pulling into v4.0.x until after rc2, for easier regression testing.
- The other interfaces don't have as tight of constraints, and might not hit this.
- This SHOULD stay as a blocker, since it ends in hang.
- We need to look for a workaround.
- Could disable put completely.
- Could use an opal_unlikely check of message-size, and only then kick it back if the message size is too large.
- OB1 tries put / get, and if these don't work, it falls back to send/recv.?
- possibly a flaw in put itself.
- Jeff will ask george what would be viable workaround, and identify.
- Not signing up to implement.
- PR 6961 went into master - Nathan said it might help.
-
PR6942 - ready to merge.
-
MTT failures in Generic Simple unpack on v4.0.x - segfaults, assertions.
- DDT-unpack assertion on v4.0.x
-
NERSC - running ibm suite will always fail because of srun won't pass connect-accept.
Review Master Master Pull Requests
- Howard will test master to see if PR 6961 fixes Issue 6568 (large messages overwhelm put)
- If it goes well, we can
- PR 6844 - If Jeff gives the okay, Howard says we should merge this.
- This does fix what container folks were seeing (having to disable CMA)
- Trying to talk to each other through vader, will talk to each other (bypassing CMA)
- XPmem doesn't care about memspaces, just the key to access virtual address space.
- This is a good PR.
- Is this for v4.0.x or just master?
- Need to investigate if it changes datastructures that are exchanged.
- PMIx did a think in v3.1.4 to extend the modex at some point, since just added it to existing one.
- So this does it similarly, so shouldn't be an issue.
- IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
- nVidia bought PGI, perhaps someone there could take a look?
- Akshay said he'd talk to a PGI person at nVidia to see.
- Edgar mentioned that Mark Allen should rebase PR6756 and get that in to resolve an issue another customer is seeing.
- Cray running into problems again. :frown:
- Back on track.
- No discussion this week.
- See older weekday notes for prior items.
- No discussion this week.
- See older weekday notes for prior items.
- No discussion this week.
- See older weekday notes for prior items.
- IBM has to triage some failures on master and v4.0.x and some test build issues. Josh Hursey thought they might be accidentally mixing XLC and PGI compilers. Will investigate.
- Cisco has a build failure to investigate.