-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20210727
- Geoffrey Paulsen (IBM)
- Ralph Castain (Intel)
- Jeff Squyres (Cisco)
- Nathan Hjelm (Google)
- William Zhang (AWS)
- Aurelien Bouteiller (UTK)
- David Bernholdt (ORNL)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Sam Gutierrez (LANL)
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- Hessam Mirsadeghi (NVIDIA))
- Joseph Schuchart (HLRS)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Naughton III, Thomas (ORNL)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (NVIDIA)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- Edgar Gabriel (UH)
- Erik Zeiske (HPE)
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Josh Hursey (IBM)
- Joshua Ladd (NVIDIA)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Noah Evans (Sandia)
- Raghu Raja
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic (NVIDIA)
- Xin Zhao (NVIDIA)
- Schedule: Planning on late August (no reason for August) for accumulated bugfixes.
-
Went over the Google Doc of current Blocking v5.0.x issues.
- Red is blocking
- Yellow may be blocking
- White is tracking, but not blocking
- removing from the list as fixes are merged
-
Issue 8850 - Jeff will re-review
-
Issue 6966 - Austen will retest
-
9032 has PR open
-
9128 just needs cherry-pick to v5.0.x
-
Do we want https://github.com/open-mpi/ompi/pull/9154 in v5.0.x
- This new OMPI Framework name is
smsc
- When you add a new framework, you need to go into PRRTE schizzo and add it to the list of generic ompi framework names.
- So PRRTE knows to put
ompi_
prefix before the mca parameter, otherwise need to prefix withomca
. If you want to beprte_mca
andpmix_mca
. All frameworks have unique names, but users can just saymca_sizzo_base_verbose
, and it looks it up in the table to - Would it be possible to print a warning (I don't know what framework this mca is)? Been burned before.
- More complicated than it sounds.
- In this case, it might be doable, since on command line, didn't specify the prefix, so if neither of the 3 projects understand it.
- Ralph will take a look.
- Can we document this somewhere about adding a framework?
- Could add a few sentences about it on OMPI wiki
- env variables you need to manually set the prefixes.
- Nathan tested all three cma, kmem, and
- Will have coll_sm updates.
- Should be fine to put into
- Just code-shuffling, because all code was in vader
- anyone can call the new framework (btls or PMLs)
- used by BTL sm
- Will this be needed for refactoring for sessions?
- Probably not.
- BTL_sm doesn't call to itself
- This new OMPI Framework name is
-
Documentation
- Issue 7668 - lots of things need to change here.
- Can use help
- Jeff done with first past of docs, and slowly folding in docs
- Still stuff that needs to be revamped or written.
- Still all one docs.
- Harumi - Even if others can't write well
- Docs that should go into PRRTE
- Some infrastructure with sphynx - can be started as well.
- Decent handle on infrastructure.
- Doc could also start in PMIX/PRRTE so we can slurp in.
-
PMIX / PRRTE plan to release in next few weeks
-
Need to do a v5.0 rc as soon as PRRTE v2 ships.
- Need feedback if we've missed an important one.
-
PMIx Tools support is still not functional. Opened tickets in PRRTE.
- Not a common case for most users.
- This also impacts the MPIR shim.
- PRRTE v2 will probably ship with broken tool support.
-
Is the driving force for PRRTE v2.0 OMPI?
- So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
- Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
- Or just fix it in PRRTE v2.0?
- Is broken tool support a blocker for PRRTE v2.0?
- Don't ship OMPI v5.0 with broken Tools support.
-
Is there any objections to delaying
- Either we resource this
-
https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665
- Current state of PMIx tool support.
- We'd like to get Tool support in CI, but need it to be working to enable the CI.
-
https://github.com/openpmix/prrte/issues/978#issuecomment-856205950
- Blocking issue for Open-MPI
- Brian
-
PR 9014 - new blocker.
- fix should just be a couple of lines of code... hard to decide what we want.
- Ralph, Jeff and Brian started talking.
- Simplest solution was to have our own
-
Need people working on v5.0 stuff.
-
Need some configury changes in before we RC.
-
Issue 8850, 8990 and more
-
Brian will file 3-ish issues
- One is configure pmix
-
Dynamic Windows fix in for UCX.
-
Any update on debugger support?
-
Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if
-
UCC coll component updating to just set to be default when UCX is selected. PR 8969
- Intent is that this will eventually replace hcoll.
- Qaulity
- Solid progress happening, on Read the docs.
- These docs would be on the readthedocs.io site, or on our site?
- Haven't thought either way yet.
- No strong opinion yet.
- Geoff is going to help
-
Issue 8884 - ROMIO detects CUDA differently.
- Giles proposed a quick fix for now.
-
https://github.com/open-mpi/ompi/wiki/Meeting-2021-07
- Find link to Web-ex HERE.
- July 22nd (2pm Central)
- July 29st (10-12 Central)
-
Now released.
-
Virtual Face to face.
-
Persistant Collectives
- So nice to get MPIX_ rename into v5.0
- Don't think this was planned for v5.0
- Don't know if anyone asked them this. - Might not matter to them
- Virtual face to face -
-
a bunch of stuff in pipeline. Then details.
-
Plan to open Sessions pull request.
- Big, almost all in OMPI.
- Some of it are more impacted by clang format changes.
- New functions.
- Considerably more functions can be called before MPI_Init/Finalize
- Don't want to do sessions in v5.0
- Hessam Miradeghi is interested in trying MPI_Sessions.
- Interested in a timeline of a release that will contain MPI_Sessions.
- Sessions working group meets every monday at noon central time.
- https://github.com/mpiwg-sessions/sessions-issues/wiki
- Several of the tools tests are busted on master.
- Sessions branch fixes some of these.
- Initialize tools after finalize MPI
- Update:
- Did some cleanup of refactoring.
- Topology might NOT change with Sessions relative to whats currently in master
- Extra topology work that wasn't accepted by MPI v4.0 standard.
- Question on how we do mca versioning
-
We don't KNOW that OMPI v6.0 may not be an ABI break
-
Would be NICE to get MPIX symbols into a seperate library.
- What's left in MPIX after persistant collectives?
- Short Float,
- Pcall_req - persistant collective
- Affinity
- If they're NOT built by default, it's not too high of a priority.
- Should just be some code-shuffling.
- On the surface shouldn't be too much.
- If they use wrapper compilers, or official mechanism
- Top level library, since app -> MPI and app -> MPIX lib.
- libmpi_x library can then be versioned differently.
- Should just be some code-shuffling.
- What's left in MPIX after persistant collectives?
-
Dont change to build MPIX by default.
-
Open an issue to track all of our MPI 4.0 items
- MPI Forum will want, certainly before supercomputing.
-
Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting.
- In person meeting is off the table for many of us. We might want an out of sequence meeting.
- Lets doodle something a couple of weeks out.
- Doodle and send it out
- trivial wiki page in style of other in person wiki.
-
Two days of 2 hour blocks - wiki *
-
Who owns our open-SQL?
- noone?
- What value is the viewer using to generate the ORG data?
- Looking for field in the perl client
- It's just the username. It's nothing simple.
- Something about how the cherry-pie server is stuffing stuff into the database.
- It's just the username. It's nothing simple.
- Thought it was in the ini file, but isn't.
- Looking for field in the perl client
- Concerned that we don't have an owner.
- Back in the day, we used MTT because there was nothing else.
- But perhaps there's something else now?
-
A lot of segfaults in UCX 1sided in IBM
-
Howard Pritchard Does someone at nVidia have a good set of test for GPU
- Can ask around.
- Only tests is The OSU MPI has support for CUDA and ROCM tests.
- Good enough for sanity.
- No support for Intel low level stuff now.
- PyTorch - machine learning framework - resembles an actual application.
- Has different backends, collectives reduction tool NCCL, but also has a CUDA backend for single/multiple nodes.
-
ECP - worried we're going to get so far behind MPICH because all 3 major exascale systems are using essentially the same technology and their vendors use MPICH. They're racing ahead with integrating GPU offloaded code with MPICH. Just a heads up.
- A thread on The GPU can trigger something to happen in MPI.
- CUDA_Async Not sure of
-
Jeff will send out committer list to remove people from list.
- Trivial to re-add someone, so error on kicking folks out.
- No discussion
- No update
- No discussion.