-
Notifications
You must be signed in to change notification settings - Fork 862
WeeklyTelcon_20210316
Geoffrey Paulsen edited this page Mar 16, 2021
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Austen Lauria (IBM)
- Brian Barrett (AWS)
- Christoph Niethammer (HLRS)
- Geoffrey Paulsen (IBM)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Ralph Castain (Intel)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Brendan Cunningham (Cornelis Networks)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Edgar Gabriel (UH)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (UCX/nVidia)
- Howard Pritchard
- Joshua Ladd (nVidia/Mellanox)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Naughton III, Thomas (ORNL)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Tomislav Janjusic
- Xin Zhao (nVidia/Mellanox)
- PSM2 transport in OFI mtl
- Summary: Passing huge messages > 2GB == Messages silently truncated.
- Dispute before, folks wanted libofi to check for max size.
- If Open-MPI thinks it should be fixed in the providers, we should just fix it in ofi ompi code.
- checking for max message size is trivial
- Gluing together messages is too complex.
- Some other providers have a 2 or 4 GB limit.
- MPI symantecs are pretty clear that messages can be > 4GBs.
- If providers don't support, in hardware, then it's not very pro
- Is this something that can be JUST solved in OFI providers, or does it need something in the higher level.
- It can be solved down in the provider level.
- Open MPI only cares about the provider that Open MPI uses.
- Lets solve this issue for the providers we have control of
- And other providers would have a bug.
- Can Open MPI detect the provider we're using if we have this issue?
- PSM2 exposed lower level max to Open MPI and returned an error for larger messages.
- Well, what is the highest number?
- Check in OFI MTL - so providers that provide a very small max message size will return MPI Error if they try at runtime.
- OFI provider should be able to handle it, not MPI or OFI-super layer.
- Will ensure this works for OFI providers that we care about.
- ACTION: Brian will update the readme for v5.0.x
- This is what we expect, this is who to talk to.
- not neccisarily a badly designed MPI app that tries these.
- Please update your CI to run MTT on v5.0.x PRs, and on v5.0.x based PRs
- Please Cherry-pick your bugfix/v5.0.x PRs there after your PR is accepted to master
- Needs a squash, missing signed off commit.
- Austen will ping Nathan.
- want in v5.0.x also
- This is working just fine at the moment, except for ROMIO.
- ROMIO is throwing tons of warnings. But okay.
- Would need to fix it upstream.
- PMIx/PRRTE is updated.
- Perhaps now for 3rdParties, configure with --silence-obsolencense flag.
- Does someone want to ping Rob about it?
- Jeff will
- https://github.com/open-mpi/ompi/issues/8566
- Using an actual 32bit gcc - Compile fail
- Nathan thinks he might be able to write a compare-and-swap
- v5.0 - good time to drop 32bit.
- Jeff will send note to packaging, and see if they will care.
- Debian is okay, they will just use MPICH
- OSC/RDMA assumed everything was 64bit, but once we changed
- On 32bit, if we could use C11 atomics with locks, it might be allowed.
- So perhaps this would be a path.
- Is C11 available on older 32bit systems.
- gcc 6.0+ it should work fine.
- Nobody has a strong opinon.
- Pride issue, but it's also time and money
- Right now the only thing breaking it Nathan's 1sided.
- Lets ask Nathan what he thinks, and if he has time to fix it.
- blocking on UCX issues (see New topics above)
- Jeff pinged George.
- George off at conference next 2 days, will get to it soon.
- Too many Open Issues (50)
- Geoff and Howard will go over v4.0.x issues, and try to close or address some of them.
- Need to label some as wont_fix, let sit for a while, and then close
- blocking on UCX issues (see New topics above)
- Jeff pinged George.
- George off at conference next 2 days, will get to after that.
- Scrubbed a bunch of issues yesterday.
- Marked a few Issues as critical
- Intercomm Merge Issue
- William just started looking at. Can't reproduce yet.
- AWS MTT never ran on multiple devices but his this issue.
- Seperate issue, this issue
- William will try with multiple devices
- BTL Vader perf regression. Issue #8603
- PR v5.0.x #8622 (Comment about penalizing for all versions of GCC), but could put check for GCC version.
- Title implies GCC < 6, but change is for all compilers.
- Looks solved, but some performance issues
- Giles linked some atomic PRs.
- Nathan / George approved.
- Merge into all release branches.
- PR v5.0.x #8622 (Comment about penalizing for all versions of GCC), but could put check for GCC version.
- What do we do with the mpirun Manpage?
- Didn't want OMPI requiring Sphynx, but if PRRTE and PMIx in same tar
- Ralph almost has singleton comm spawn working
- Single node without the mpirun process
- Static MCA components default still on track for v5.0.x
- ECP Community days ( March 30-April 1st )
- David Bernholdt and/or George Bosilica
- Each day 90 minute time slots.
- Get proposal in by this Friday.
- Tuesday March 30th from 1-2:30pm (US Eastern)
- Invited some people to speak. They will be our main community speakers.
- Anyone on OMPI community can send slides to Jeff and George
- Due Friday March 26th
- PMIx Wed 31st 11 - 12:30 (US Eastern)
- PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
- Intent this is for v5.0
- mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
- Ralph has asked about this for PMIx/PRRTE since this is turning out to work
- No update - 3/16
- Could be independent of PMIx and PRRTE.
- PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.
- OLD
- What do we want to do about ROMIO in general.
- OMPIO is the default everywhere.
- Giles is saying the changes we made are integration changes.
- There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
- We may be able to work with upstream to make a clear API between the two.
- As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
- Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
- Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
How's the state of https://github.com/open-mpi/ompi-tests-public/
- Putting new tests there
- Very little there so far, but working on adding some more.
- Should have some new Sessions tests
- Intercomm Merge is getting inconsistant ordering of procs.
- What is the priority of this?
- Many of the ibm tests start off by doing some intercomm manipulation.
- Won't get
- Mellanox MTT had been failing. Boris set some debug, and they unplugged it.
- They plan to re-enable it tomorrow.