Skip to content

WeeklyTelcon_20210316

Geoffrey Paulsen edited this page Mar 16, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brian Barrett (AWS)
  • Christoph Niethammer (HLRS)
  • Geoffrey Paulsen (IBM)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Brendan Cunningham (Cornelis Networks)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard
  • Joshua Ladd (nVidia/Mellanox)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

New Items

7058 Issues - Issue got stuck

  • PSM2 transport in OFI mtl
    • Summary: Passing huge messages > 2GB == Messages silently truncated.
    • Dispute before, folks wanted libofi to check for max size.
  • If Open-MPI thinks it should be fixed in the providers, we should just fix it in ofi ompi code.
    • checking for max message size is trivial
    • Gluing together messages is too complex.
    • Some other providers have a 2 or 4 GB limit.
  • MPI symantecs are pretty clear that messages can be > 4GBs.
    • If providers don't support, in hardware, then it's not very pro
  • Is this something that can be JUST solved in OFI providers, or does it need something in the higher level.
    • It can be solved down in the provider level.
  • Open MPI only cares about the provider that Open MPI uses.
    • Lets solve this issue for the providers we have control of
    • And other providers would have a bug.
  • Can Open MPI detect the provider we're using if we have this issue?
    • PSM2 exposed lower level max to Open MPI and returned an error for larger messages.
  • Well, what is the highest number?
  • Check in OFI MTL - so providers that provide a very small max message size will return MPI Error if they try at runtime.
  • OFI provider should be able to handle it, not MPI or OFI-super layer.
  • Will ensure this works for OFI providers that we care about.
  • ACTION: Brian will update the readme for v5.0.x
    • This is what we expect, this is who to talk to.
    • not neccisarily a badly designed MPI app that tries these.

v5.0.x branch now building nightly tarballs.

  • Please update your CI to run MTT on v5.0.x PRs, and on v5.0.x based PRs
  • Please Cherry-pick your bugfix/v5.0.x PRs there after your PR is accepted to master

PR 8551 - New coding style enforced via clang --format

  • Needs a squash, missing signed off commit.
    • Austen will ping Nathan.
    • want in v5.0.x also

Autoconf 2.7

  • This is working just fine at the moment, except for ROMIO.
    • ROMIO is throwing tons of warnings. But okay.
    • Would need to fix it upstream.
  • PMIx/PRRTE is updated.
  • Perhaps now for 3rdParties, configure with --silence-obsolencense flag.
  • Does someone want to ping Rob about it?
    • Jeff will

32bit? Do we want to continue to support this?

  • https://github.com/open-mpi/ompi/issues/8566
  • Using an actual 32bit gcc - Compile fail
  • Nathan thinks he might be able to write a compare-and-swap
  • v5.0 - good time to drop 32bit.
    • Jeff will send note to packaging, and see if they will care.
    • Debian is okay, they will just use MPICH
    • OSC/RDMA assumed everything was 64bit, but once we changed
  • On 32bit, if we could use C11 atomics with locks, it might be allowed.
    • So perhaps this would be a path.
    • Is C11 available on older 32bit systems.
    • gcc 6.0+ it should work fine.
  • Nobody has a strong opinon.
    • Pride issue, but it's also time and money
    • Right now the only thing breaking it Nathan's 1sided.
    • Lets ask Nathan what he thinks, and if he has time to fix it.

4.0.x

  • blocking on UCX issues (see New topics above)
    • Jeff pinged George.
    • George off at conference next 2 days, will get to it soon.
  • Too many Open Issues (50)
    • Geoff and Howard will go over v4.0.x issues, and try to close or address some of them.
    • Need to label some as wont_fix, let sit for a while, and then close

v4.1

  • blocking on UCX issues (see New topics above)
    • Jeff pinged George.
    • George off at conference next 2 days, will get to after that.
  • Scrubbed a bunch of issues yesterday.
    • Marked a few Issues as critical
  • Intercomm Merge Issue
    • William just started looking at. Can't reproduce yet.
    • AWS MTT never ran on multiple devices but his this issue.
      • Seperate issue, this issue
    • William will try with multiple devices
  • BTL Vader perf regression. Issue #8603
    • PR v5.0.x #8622 (Comment about penalizing for all versions of GCC), but could put check for GCC version.
      • Title implies GCC < 6, but change is for all compilers.
    • Looks solved, but some performance issues
    • Giles linked some atomic PRs.
      • Nathan / George approved.
    • Merge into all release branches.

Open-MPI v5.0

  • What do we do with the mpirun Manpage?
    • Didn't want OMPI requiring Sphynx, but if PRRTE and PMIx in same tar
  • Ralph almost has singleton comm spawn working
    • Single node without the mpirun process
  • Static MCA components default still on track for v5.0.x

Video Presentation

  • ECP Community days ( March 30-April 1st )
    • David Bernholdt and/or George Bosilica
    • Each day 90 minute time slots.
    • Get proposal in by this Friday.
    • Tuesday March 30th from 1-2:30pm (US Eastern)
      • Invited some people to speak. They will be our main community speakers.
      • Anyone on OMPI community can send slides to Jeff and George
      • Due Friday March 26th
    • PMIx Wed 31st 11 - 12:30 (US Eastern)

Doc update

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work
  • No update - 3/16
    • Could be independent of PMIx and PRRTE.
    • PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.

Longer Term discussions

ROMIO Long Term (12/8)

  • OLD
  • What do we want to do about ROMIO in general.
    • OMPIO is the default everywhere.
    • Giles is saying the changes we made are integration changes.
      • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
      • We may be able to work with upstream to make a clear API between the two.
    • As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
  • Putting new tests there
  • Very little there so far, but working on adding some more.
  • Should have some new Sessions tests

MTT

  • Intercomm Merge is getting inconsistant ordering of procs.
    • What is the priority of this?
    • Many of the ibm tests start off by doing some intercomm manipulation.
      • Won't get
  • Mellanox MTT had been failing. Boris set some debug, and they unplugged it.
    • They plan to re-enable it tomorrow.
Clone this wiki locally