-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20210112
- Dialup Info: (Do not post to public mailing list or public wiki)
- Akshay Venkatesh (NVIDIA)
- Aurelien Bouteiller (UTK)
- Brendan Cunningham (Cornelis Networks)
- Christoph Niethammer (HLRS)
- Edgar Gabriel (UH)
- Geoffrey Paulsen (IBM)
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (UCX/nVidia)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Naughton III, Thomas (ORNL)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Artem Polyakov (nVidia/Mellanox)
- Austen Lauria (IBM)
- Barrett, Brian (AWS)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- Josh Hursey (IBM)
- Joshua Ladd (nVidia/Mellanox)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Tomislav Janjusic
- Xin Zhao (nVidia/Mellanox)
- mohan (AWS)
- link has changed for 2021. Please see email from Jeff Squyres to
devel-core@lists.open-mpi.org
on 12/15/2020 for the new link
- v4.0.6rc1 - built, please test.
- Discussed https://github.com/open-mpi/ompi/issues/8299 - srun issue in v4.0.x, mpirun works.
- SRUN might not give us enough info, so might need a fix.
- Curious what version of hwloc their slurm is built with.
- Discussed https://github.com/open-mpi/ompi/issues/8321
- UCX in VM possible silent error.
- Added blocker label.
- in v4.0.x and master, though might be down in UCX.
- SLURM_WHOLE issue, want to stay in sync with OMPI v4.1.x.
- Howard wants to get Luster testing before v4.0.6rc2.
- Geoff pinged Mark to post his branch of ROMIO fixes for Luster
-
Merged a number of PRs yesterday.
-
Issue 8334 - a performance regression with AVX. Still digging into.
- AVX Perf issue.
- Raghu tested AVX512 seems to make it slower.
- Papers show that anything after AVX2 throttles down cores and have this effect.
- Need to look into root cause.
- Probably not ready for default.
- Many apps just do one rank per node, which might WANT AVX on, but fully subscribed may want AVX off.
-
Issue 8335 - Trying to run with external PMIx.
- resolved
-
Michael Heinz is looking at PSM2(?) new issue from yesterday. Possibly for v4.1.1
- Fix PRed CQ entry data size field
-
Josh Hursey is working on Issue 8304 (verified in v4.1, v4.0, and v3.1)
- Resolved.
- Does the community want this ULFM PR 7740 for OMPI v5.0? If so, we need a PRRTE v3.0
- Aurelien will rebase.
- Works with PRRTE refered to ompi master submodule pointer.
- Currently used in a bunch of places.
- Run normal regression tests. Should not see any performance regressions.
- When this works, can provide other tests.
- Is a configure flag. Default is to configure in, but disabled at runtime.
- A number of things to set to enable.
- Aurelien is working to get a single parameter
- Lets get some CODE reviews done.
- Look at intersections of the core, and ensure that the NOT-ULFM paths are "clean".
- Also we have a downstream affect PMIX and PRRTE to get a
- Lets put a deadline on reviews. Lets say in 4 weeks, we'll push the merge button.
- Jan 26th we'll merge if no issues
- Modified ABI - removed one callback/member function from some components (BTLs/PMLs) used for FT event.
- All these structures for these components.
- Pending for this discussion.
- Going to version the frameworks that are affected.
- Not this simple in practice, because usually we just return a pointer to a static object.
- But this isn't possible anymore.
- We don't support multiple versions
- Do we think we should allow Open-MPI v5.0 to run with mcas from past versions?
- Maybe good to protect against it?
- Unless we know of someone we need to support like this, we shouldn't bend over for this.
- Josh thinks the Container community is experimenting with this.
- Josh has advised that Open-MPI doesn't guarantee
- v5.0 is advertised as an ABI break.
- In this case, the framework doesn't exist anymore.
- George will do a check to ensure we're not loading mcas from earlier version. *
-
Still need to coirdinate on this. He'd like this, this week.
-
PMIx v4.0 working on Tools, hopefully done soon.
- PMIx go through python bindings.
- a new Shmem component to replace
- Still working on.
-
Dave Wooten pushed up some PRRTE patches, and making some progress there.
- Slow but steady progress.
- Once tool work is more stabilized on PMIx v4.0, will add some tool tests to CI.
- Probably won't start until first of the year.
-
How is the submodule reference updatees on Open-MPI master
- Probably be switching OMPI master to master PMIx in next few weeks.
- PR 8319 - this failed. Should this be closed and create a new one?
- Josh was still looking to see about adding some cross checking CI
- When making a PRTE PR, could add some comment to the PR and it'll trigger Open-MPI CI with that PR.
- Probably be switching OMPI master to master PMIx in next few weeks.
-
v4.0 PMIx and PRRTE master.
- When PRRTE branches a v2.0 branch, we can switch to that then, but that'll
-
Two different drivers:
- OFI MTL
- HFI support
- Interest in PRRTE in a release, and a few other things that are already in v4.1.x
- HAN and ADAPT as default.
- Amazon helping testing and other resources
- Amazon also investing to contract Ralph to help get PRRTE up to speed.
-
Other features in PMIX
- can set GPU affinities, can query GPU info
- New web-ex for January
- Too latest ROMIO from and it failed on both
- But then he took LAST week's 3.4 BETA ROMIO and it passed. But it's a little too new.
- He gave a bit more info about the stuff he integrates, and stuff he moves forward.
-
- ROMIO modernization (don't use MPI1 based things)
-
- ROMIO integration items.
-
- We're hesitant to put this into 4.1.0 because it's NOT yet release from MPICH
- hesitant to even update ROMIO in v4.0.6 since it's a big change.
- If we delay and pickup newer ROMIO in the next minor, would there be backwards compatibility issues?
- Need to ask about compatibility between ROMIO 3.2.2 and 3.4
- If fully compatibile, then only one ROMIO
- Need to ask about compatibility between ROMIO 3.2.2 and 3.4
- We could ship multiple ROMIOs, but that has a lot of problems.
- He gave a bit more info about the stuff he integrates, and stuff he moves forward.
- Just got resources to test, and root caused the issue in OMPIO
- So, given some more time Edgar will get a fix, and OMPIO can be default
- What do we want to do about ROMIO in general.
- OMPIO is the default everywhere.
- Giles is saying the changes we made are integration changes.
- There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
- We may be able to work with upstream to make a clear API between the two.
- As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
- Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
- Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
- PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
- Has a built from this PR, so we can see what it looks like.
- Have a look. It's a different approach to have one document that's the whole thing.
- FAQ, README, HACKING.
- Do people even use manpages anymore? Do we need/want them in our tarballs?
- https://github.com/openpmix/prrte/pull/711
- please review and give opinon.
- Will commit next week if no opinion
How's the state of https://github.com/open-mpi/ompi-tests-public/
-
Putting new tests there
-
Very little there so far, but working on adding some more.
-
Should have some new Sessions tests
-
What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?
- What's the general state? Any known issues?
- AWS would like to get.
- Josh Ladd - Will take internally to see what they have to say.
- From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
- Hessam Mirsadeg - All Cuda awareness through UCX
- May ask George Bosilica about this.
- Don't want to remove a BTL if someone is interested in it.
- UCX also supports TCP via CUDA
- PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
-
Update 11/17/2020
- UTK is interested in this BTL, and maybe others.
- Still gap in the MTL use-case.
- nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
- What's the state of the shared memory in the BTL?
- This is the really old generation Shared Memory. Older than Vader.
- Was told after a certain point, no more development in SM Cuda.
- One option might be to
- Another option might be to bring that SM in SMCuda to Vader(now SM)
-
Restructure Tech Doc (more features than Markdown, including crossrefrences)
- Jeff had a first stab at this, but take a look. Sent it out to devel-list.
- All work for master / v5.0
- Might just be useful to do README for v4.1.? (don't block v4.1.0 for this)
- Sphynx is tool to generate docs from restructured doc.
- can handle current markdown manpages together with new docs.
- readthedocs.io encourages "restructured text" format over markdown.
- They also support a hybrid for projects that have both.
- Thomas Naughton has done the restructured text, and it allows
- LICENSE question - what license would the docs be available under? Open-MPI BSD license, or
-
Ralph tried the Instant on at scale:
- 10,000 nodes x 32PPN
- Ralph verified Open-MPI could do all of that in < 5 seconds, Instant-On.
- Through MPI_Init() (if using Instant-On)
- TCP and Slingshot (OFI provider private now)
- PRRTE with PMIx v4.0 support
- SLURM has some of the integration, but hasn't taken this patch yet.
-
Discussion on:
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
- One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
- Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)
- New George and Jeff are leading
- One for Open-MPI and one for PMIx
- In a month and a half or so. George will send date to Jeff