Skip to content

WeeklyTelcon_20210202

Geoffrey Paulsen edited this page Feb 5, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Jeff Squyres (Cisco)
  • Howard Pritchard (LANL)
  • Ralph Castain (Intel)
  • Geoffrey Paulsen (IBM)
  • Austen Lauria (IBM)
  • Joseph Schuchart
  • Hessam Mirsadeghi (UCX/nVidia)
  • Edgar Gabriel (UH)
  • Brendan Cunningham (Cornelis Networks)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Naughton III, Thomas (ORNL)
  • Raghu Raja (AWS)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)
  • George Bosilca (UTK)
  • Aurelien Bouteiller (UTK)
  • Christoph Niethammer (HLRS)
  • Harumi Kuno (HPE)

not there today (I keep this for easy cut-n-paste for future notes)

  • Brian Barrett (AWS)
  • David Bernhold (ORNL)
  • Joshua Ladd (nVidia/Mellanox)
  • Michael Heinz (Cornelis Networks)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

Chronologically, we started with New Topics this week (before release branches)

4.0.x

  • v4.0 release, would like to take this ROMIO one-off fix instead of
    • https://github.com/open-mpi/ompi/pull/8370 - Fixes HDF5 on LUSTRE
    • Proposing take this one-off for v4.0.6, as a whole new ROMIO is a big change.
    • Waiting on v4.0.6rc2 until we get an answer.
    • Everyone seems okay with taking this into release branch, and waiting for ROMIO update on master.
    • Merged
  • Schedule - If we could get something for Issue 8321, we can do an RC soon.

v4.1

  • Jeff pushed a few commits to PR 8376

    • Pro - if using Intel compiler, it'll . nice runtime option along with configure option.
      • Nice
      • Also added supported read-only, but what you use may be different.
      • Also used Enum flags to see/set what is there.
      • George is against the complete disabling of AVX, since it's overly agressively disabling things.
        • Consistancy is good, and George wants to be consistant, so is against these commits.
        • This commit prevents all users from
        • problem is only seen on certain processors. Can't reproduce in other family of processors.
        • And compiler versions seem to matter.
      • Hoping then people can white-list certain processors as
      • v4.1 is the first series this went into, so perhaps we need more time to bake.
      • Historically if it's broken in one case, we generally just put a warning out in that case.
      • v4.1.0 is out there already, and could put in a bigger hammer option in v4.2.0
      • Jeff will ammend the 2 commits he added, and then remove one restrictions.
  • PR 8435 - Moved a feature from Tuned to base, and use it in libnbc.

    • George will write up a how to use this, and Jeff will get into doc.
  • Will do a v4.1.1 RC

  • Issue 8334 - a performance regression with AVX512 on Skylake. Still digging into.

  • Issue 8410 - Build Failure on Apple Silicon.

    • Do we just need new updated string, or is that just one of the issues.
    • Code changes we need in v4.1.
    • Will have exact same problem in PMIx and PRRTE
    • Performance with Atomic FIFO is another issue, might not need to backport to v4.1
      • Closed
  • Issue 8367 - will take to UCX community

    • Not yet brought up to UCX community. Josh will take up
  • Issue 8379 - UCT appears to be default and not UCX

    • Jeff repinged for request
      • Does UCT BTL even get built?
      • Still in discussion in Issue 8102.
        • Common missconception that people can install over existing install.
  • Might be an older mca component from

  • We had a PR to have a Unique signature for each build.

    • If we had this, we could use this signature in the modules themselves, but then we'd avoid this issue at runtime, and only open mca if from same build.
      • We currently have something for mca VERSION, but we never update the mca version.
      • So maybe we want to add OMPI version into this mca version check.
        • But this might not be enough, as recompiles might have different configure.
        • We need something to have something to identify the configure itself.
  • 8431 - git commit checks as action.

  • hwloc are we tracking the usage of the hwloc topology loads?

    • George wants to take a stab at it. Using it in HAN and Treematch

Open-MPI v5.0

Comment that we can't branch until PRRTE is ready.

  • MTT is showing that the master branch is pretty good. We don't need to wait for PRRTE to be complet to branch v5.0.x in OMPI
  • Raghu added an entry for libFabric entry.
  • One-sided tests are still busted. Do we keep running these if they're failing?
    • Nathan is actively working on, so hopeful we'll get this.

What's the state of ULFM (PR 7740) for v5.0?

  • Adding ULFM tests to new public repo
  • Are we Feature Complete?
    • PRRTE should be ready end of Q1.
    • Based on v5.0 tracker, there is a bunch of stuff not in.
    • GPU Direct support for OFI MTL
      • AWS working on now. Need to rebase, and upstream.
    • OFI BTL changes need to get upstreamed.
    • Weeks for MTL
  • Edgar atomicity issue for OMPIO. Not sure if it's a full feature, but need to have on radar.
    • ETA: a few days after Edgar finds time. 2-3 weeks.
  • Any other big features?
  • Branch Date will discuss next week.

New Topics

  • How to implement so that ./configure --help presents all configure options to users?
  • Brian sent a good summary to devel mailing list. Presented a summary and 3 different options
    1. Document (possibly at the end of the default –help output, definitely in the README) that you need to run –help=recursive to see the options for PMIx and PRRTE
      • Nice, but very complex. Also GIANT output and not super friendly.
        • This won't show any hwloc help, probably because it's not git submodules.
        • This would need to be fixed, and especially if we go this direction.
      • This way will STILL warn at high level that it doesn't understand an argument, and then it'll be picked up at a lower level.
    2. Add “dummy” help options for the parameters from PMIx and PRRTE we think are worth exporting.  This is likely prime for bit rot
      • This is frowned upon... We've been hit by bit-rot quite a bit.
    3. Josh’s script to create a dummy help option for each argument in PMIx and PRRTE not in the top level configure.
      • email states incorrect PR. Correct PR is: PR 8409
    • Sometimes there are options we want to pass to one subcomponent but not the other.
      • We have --with-feature-X but by keeping these seperate, then they won't be "mixed" as they might mean different things.
      • In this way, configure options that ompi configure doesn't recognize (3rd party args), won't warn.
  • Probably don't want to give users TOO much control of subcomponents.
  • If users want more control they can use External. This allows us to not need to get too complex in configure.
  • gcc also has similar problem, and they're okay without prefixing.
  • Dont want to sever connectivity to embedded packages.
  • Arguments against #3.
    • You would not have visibility of which subcomponent a particular subcomponent belongs to.
      • But users shouldn't have to worry about where a configure flag is implemented.
  • Process returns wrong result unless pml is ^ucx.
  • Should we release a v4.0.6 with a PR that would disable building ucx against older than UCX 1.9 (current UCX)
    • This is blocking v4.0.6
    • This would be a drastic change to deny all UCX before current 1.9
      • Hassam / Yossi are looking into this.
  • Please get back this week.
  • We'd like to ship v4.0.6 soon, and getting a more specific fix would be better than the big hammer.
  • Would be good to do both configure time and runtime
  • Assume this affects v4.1 as well.
  • Should be straight forward to chase down, and
    • Possibly an issue with collective and UCX in this runmode.

Should we accept PR 8406 or drop it?

* PR 8406 - Technically not needed.
* This PR is redundant with prior fix already in.
* Already in v4.1
* 

Setup Github Teams

  • Jeff can setup so we have single point of contact in github, that many members of organizations can watch
    • Don't go crazy to start, just setup a few

Doc update

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Has a built from this PR, so we can see what it looks like.
    • Have a look. It's a different approach to have one document that's the whole thing.
      • FAQ, README, HACKING.
  • Do people even use manpages anymore? Do we need/want them in our tarballs?
    • Useful for tools, ofline deployments.
      • not really for APIs.
  • 2/2 Update Going well.
  • It's going slowly going through FAQ. Validating and freshing the content.
    • Aimed at v5.0
  • Probably Rearrange this. No longer need FAQ, but now that if we're going to have Docs, will rearrange into sections.
  • May want to contact archive existing content.

Longer Term discussions

ROMIO Long Term (12/8)

  • What do we want to do about ROMIO in general.
    • OMPIO is the default everywhere.
    • Giles is saying the changes we made are integration changes.
      • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
      • We may be able to work with upstream to make a clear API between the two.
    • As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
  • Putting new tests there
  • Very little there so far, but working on adding some more.
  • Should have some new Sessions tests

What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?

  • What's the general state? Any known issues?

  • AWS would like to get.

  • Josh Ladd - Will take internally to see what they have to say.

  • From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.

  • Hessam Mirsadeg - All Cuda awareness through UCX

  • May ask George Bosilica about this.

  • Don't want to remove a BTL if someone is interested in it.

  • UCX also supports TCP via CUDA

  • PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on

  • Update 11/17/2020

    • UTK is interested in this BTL, and maybe others.
    • Still gap in the MTL use-case.
    • nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
    • What's the state of the shared memory in the BTL?
      • This is the really old generation Shared Memory. Older than Vader.
    • Was told after a certain point, no more development in SM Cuda.
    • One option might be to
    • Another option might be to bring that SM in SMCuda to Vader(now SM)
  • Discussion on:

    • Didn't get to this week. :(
    • Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
    • One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
    • Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)

Video Presentation

  • ECP Community days ( March 30-April 1st )
    • David Bernholdt and/or George Bosilica
    • Each day 90 minute time slots.
    • Get proposal in by this Friday.
Clone this wiki locally