Skip to content

WeeklyTelcon_20190806

Geoffrey Paulsen edited this page Aug 6, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Akshay Venkatesh (nVidia)
  • Artem Polyakov (Mellanox)
  • Brendan Cunningham (Intel)
  • Brian Barrett (Amazon)
  • Howard Pritchard (LANL)
  • Josh Hursey (IBM)
  • Michael Heinz (Intel)
  • Ralph Castain (Intel)
  • Thomas Naughton
  • Todd Kordenbrock

not there today (I keep this for easy cut-n-paste for future notes)

  • Edgar Gabriel (UH)
  • Harumi Kuno (HPE)
  • Joshua Ladd (Mellanox)
  • Noah Evans (Sandia)
  • Aravind Gopalakrishnan (Intel)
  • Arm (UTK)
  • Brandon Yates (Intel)
  • Dan Topa (LANL)
  • David Bernhold
  • Geoffroy Vallee
  • George Bosilca (UTK)
  • Jake Hemstad
  • Mark Allen (IBM)
  • Matias Cabral
  • Matthew Dosanjh (Sandia)
  • Nathan Hjelm
  • Peter Gottesman (Cisco)
  • Xin Zhao (Mellanox)
  • mohan

Agenda/New Business

  • Git submodules

    • This PR is in progress. Requires CI owners to add --recursive to their Jenkin's git clone commands.
    • As a first step, Jeff created:
      • PR 6821 "hwloc201 use a submodule"
    • Brian will not have cycles for a weeks.
    • Jeknins has an issue that Brian.
  • What to do with OFI BTL and OFI MTL

    • Harumi Kuno (HPE) - Discussion about OMPI's component philosophy
    • mail archive: https://www.mail-archive.com/devel@lists.open-mpi.org/msg20736.html
    • ofi/BTL and MTL components can step on each other.
    • PSM2 - when a user of PSM2 calls PSM2_Finalize, as long as there's a PSM2 provider, PSM2 is refcounting is only observed in initializing not in finallizing, meaning first finalize, was finalizing entire job.
    • No progress Brendan is looking at this on PSM2 side.
    • What is the plans for PSM2 and the MTL, etc?
    • Still fully supporting PSM2. PSM1 is end-of-life-ing the adapters in march of 2020. Will probably remove PSM1 code from v5.0 and master. Michael Heinz
  • Status of Scale testing

    • No update. Blocking on Amazon time, lower priority.
    • Issue 6786 "OMPI 4.0.1 TCP connection errors beyond 86 nodes"
    • Issue 6198 "SSH launch fails when host file has more than 64 hosts"
    • IBM is also working on something like this as well (for ssh launch)
      • Prefer this every night, instead of each PR.
  • Issue 6799 "UFM buffers failing in culpGetMemHandle ?"

    • No update
  • Issue 6831

    • https://engineering.mongodb.com/post/succeeding-with-clangformat-part-1-pitfalls-and-planning
    • Should get this cleaned up. Need one big PR fix.
    • Whitespace vs Tab cleanup.
    • Good conversation on PR.
    • Should we have CI for this?
    • MongoDB did something similar, and branches, and issues, and why they went with CLANG.
    • After folks write the scripts, then adding to CI is no problem.
    • Want it to be EASY to add local githooks so CI isn't first line for these.
    • Giant clean up commits should be done on each
    • Implementation details:
      • It might be easy to use clang for the CI / formatting.
      • clang enforces a set of things, but it may require more than
      • We have a requirement in Open MPI that says you write 'if (NULL == var)'
        • very hard to enforce this in perl, and gcc can't give us AST to do at that level.
      • run clang far enough to get AST, to do formatting.
        • you can now run clang_format.py reformat-branch T R (using T and R from the algorithm above) to easily bring a stranded topic branch forward after a reformat commit.
      • If we have to add yet another dependency (like clang), most of us don't use clang, so adding a bunch of painful.
    • White space is how this started, and perhaps just fix white space stuff. And both githooks and CI to enforce.
    • scripts are in mentioned in PR.
    • Most of these scripts UPDATE the git commit, and so for CI we want them just to check.
    • Command line example on how to add to add to git hooks.

Infrastrastructure

Transition website, and email to AWS

  • Complete

Process enforcement bots

  • No update

Submodule prototype

  • Suggest just doing hwloc (stable and not too much development) first
  • No update

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • v3.0.x MPIR_Breakpoint issue need a bit more data why -O3
  • Tested new PMIx
    • Exposed a few new test suite issues in "ibm", but fixed

Review v4.0.x Milestones v4.0.2

  • Akshay will test new datatypes with CUDA.

    • Will test on master maybe v4.0.x too.
  • PR against v4.0.x to pull in latest PMIx

  • Many bugfixes waiting for 4.0.1, we should try to get 4.0.2 out the door.

  • OB1 get protocol problem Issues 6568 - Nice, but not a blocker since everything but MAC has CMA

  • George is back from vacation, want two things before rc1

    • Datatype work, master PR for datatypes
    • Also ob1 get/put path problem
    • Edgar just reported a bug
  • Howard is verifying 6613 MPIR Disappearing queue on re-attach.

  • PR6806 - Want to wait until CI is back. Do we have any tests to test this?

    • Howard will reproduce and add to ibm suite
  • 2nd Put issue PR 6568 (Vader deadlocking with 4MB transfers)

    • waiting on George to return (end of the month)
  • New Datatype work https://github.com/open-mpi/ompi/pull/6695 (master)

    • Want for v4.0.2
    • Now approved for master.
    • waiting on George to return (end of the month). We could merge to master, but if any issues, we'd need George to fix.
  • https://github.com/open-mpi/ompi/issues/6568 - put protocol has lost it's pipelining.

    • Combination of both ob1 and vader.
    • Right now only shows in vader, because all others prefer get protocol.
    • Vader generate a bunch of 32K frags. so for 4MBs overwhelms vader.
    • Does NOT occur with single copy like CMA or KNEM.
    • Marked as a blocker, but wont block RCs, just
    • Is this a regression? Not sure if it was ever implemented.
    • Used to be some pipelining, used to work. Not sure why it's showing up.
    • Everything George knows is in the ticket.
    • Need a throttle for large messages.
  • Issue 6789 - OMPI crashes when configured with ucx version

    • Issue with PML UCX conflicting with btl_uct - memory hooks
    • New this week: Howard not convinced it's memory hooks.
    • Howard can't reproduce. Asking user to

Review Master Master Pull Requests

  • PR6556 and 6621 should go to the release branches.
    • no update
  • Good reminder that we now need to be careful about OPAL's ABI.

CI should be working now.

  • Not a great way to test CI before

v5.0.0

  • When do we get rid of 32bit?
  • Still don't have any release manager.
    • Need to identify someone in next few months.

Depdendancies

PMIx Update

  • a bunch of stuff going on, but nothing necessarily impacting OMPI.
  • Made a change for Nathan - allow you to get locality of other processes on node.
    • Allows you to hook up with shared memory
  • The version master PMIx can support network coordinates of any NIC, and depending on type of network can map for each process.
    • "network coordinates" - map to MPI network topology definition.
    • Fujitsu, Cray is implementing.
  • In PMIx when do instant-on, the scheduler queries the ___ plugin to get a payload of info you want. If the process is bound to a certain socket, this is the NIC they should use, and these others are available. Then you assign the endpoint to that NIC.
    • Requires Instant-On? - simple to do without instant-on if you want to.

ORTE/PRRTE

  • Aug 7th - web-ex meeting.
  • Gile's PRRTE work was done differently than we're not proposing. New proposal uses submodules, etc.

Next face to face

MTT

  • IBM has to triage some failures on master and v4.0.x

Back to 2019 WeeklyTelcon-2019

Clone this wiki locally