Skip to content

WeeklyTelcon_20180605

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Geoff Paulsen
  • Brian
  • Howard Pritchard
  • Josh Hursey
  • Thomas Naughton
  • Nathan Hjelm
  • Joshua Ladd
  • Edgar Gabriel
  • Akvenkatesh
  • Nathan Hjelm
  • Todd Kordenbrock
  • Ralph
  • Xin Zhao

not there today (I keep this for easy cut-n-paste for future notes)

  • Geoffroy Vallee
  • David Bernholdt
  • Howard Pritchard
  • Matthew Dosanjh
  • Dan Topa (LANL)

Agenda/New Business

Minutes

Review v2.x Milestones v2.1.4

  • v2.1.4 - Targeting Oct 15th,
  • No complelling reason, but might pull in the date to assist with more testing on v4.0
  • lower priority to v3.0 and v3.1
  • PR 5217 changes in OSHMEM logic MPI_Initialized/Finalized

Review v3.0.x Milestones v3.0.2

  • Schedule:
    • Still has not shipped
  • v3.0.2 has been tagged and build, and just need 30 minutes to release.
  • v3.0.3 - targeting Sept 1st (3 months out)
    • Do we want AR64 stuff in v3.0.3? - Up to Nathan. Sounds good.
    • Helps IBM too.

Review v3.1.x Milestones v3.1.0

  • No progress yes.

v4.0.0

  • Schedule: mid-July branch. mid-Sept relelase.
  • Still working through iWARP issues; LANL waiting for Chelsio RNICs.
  • No further / substantive update since last week (4 day weekend prevented a bunch of work this past weekend).
  • favor external vs internal components - hwloc and pmix and libevent.
  • PMIx v3.0 updates to ORTE
  • Xin - OSHMEM PRs going in today for review
  • Edgar - Did you want to make something default?

PMIx

  • Ralph will put a PMIXv3.0
    • An update as well to ORTE code
    • Update to PMIX v3.0 component (PMIX branched for v3.0)
    • On PRTE side of things, quite a few bugfixes that haven't been implemented in the orte code.
    • Not using any PMIx v3.0 features in Open MPI yet, but Ralph is interested in pre-setting endpoints (option)
    • Preliminary Debugger connection stuff.
    • Not going to touch MPIR.
    • New feature coming later this summer - to look at network topology - decide how to run collective based on that. *
  • OMPI has an MCA framework for ORTE and a STICKY component. We'd added a PMIx component, where you could run MPI just on PMIx.
    • With OMPI PMIX RTE - right now it's a static frame work, so you either build PMIX RTE, or ORTE, but no great reason for that yet, just need to put things behind function pointers.
      • Or perhaps this might be a bit mute.

New topics

  • Overall Runtime Discussion (talking v5.0 timeframe, 2019)
    • What is it that we want? It's changed a bit since last Face to Face.
    • Getting confused about the Goal - Regardless of who and when, lets discuss what.
  • What? Two Options:
    1. Keep going on our current path, and taking updates to ORTE, etc.
    2. Shuffle our code a bit (new ompi_rte framework merged with orte_pmix frame work moved down and renamed)
      • Opal used to be single process abstraction, but not as true anymore.
      • API of foo, looks pretty much like PMIx API.
        • Still have PMIx v2.0, PMI2 or other components (all retooled for new framework to use PMIx)
      • to call just call opal_foo.spawn(), etc then you get whatever component is underneath.
      • what about mpirun? Well, PRTE comes in, it's the server side of the PMIx stuff.
      • Could use their prun and wrap in a new mpirun wrapper
      • PRTE doesn't just replace ORTE. PRTE and OMPI layer don't really interact with each other, they both call the same OPAL layer (which contains PMIx, and other OPAL stuff).
        • prun has a lam-boot looking approach.
      • Build system about opal, etc. Code Shufflling, retooling of components.
      • We want to leverage the work the PMIx community is doing correctly.
  • If we do this, we still need people to do runtime work over in PRTE.
    • In some ways it might be harder to get resources from management for yet another project.
    • Nice to have a componentized interface, without moving runtime to a 3rd party project.
    • Need to think about it.
  • Concerns with working adding ORTE PMIx integration.
  • Want to know the state of SLURM PMIx Plugin with PMIx v3.x
    • It should build, and work with v3. They only implemented about 5 interfaces, and they haven't changed.
  • A few related to OMPIx project, talking about how much to contribute to this effort.
    • How to factor in requirements of OSHMEM (who use our runtimes), and already doing things to adapt.
    • Would be nice to support both groups with a straight forward component to handle both of these.
  • Thinking about how much effort this will be. and manage these tasks in a timely manor.
  • Testing, will need to discuss how to best test all of this.
  • ACTION: Lets go off and reflect and discuss at next week's Web-Ex.
    • We aren't going to do this before v4.0 branches in mid-July.
    • Need to be thinking about the Schedule, action items, and owners.

Review Master Master Pull Requests

  • Decided to file https://github.com/open-mpi/ompi/pull/5200 to begin the long process of deleting osc/pt2pt (by enabling all relevant RDMA BTLs so that every transport will use osc/rdma).
  • Nightly Coverity runs are failing -- Brian to investigate why.
    • Brian fixed. - Accidentally broke Coverity when we removed the SPMLs.
    • Coverity Build now builds against both libfabric and UCX
      • Build Only.
      • x86 linux ubuntu 16.04
      • If you want Brian to add your component to Coverity build, contact him.
  • Anything Jeff can help with Absoft and NAG licenses?
  • Hope to have better Cisco MTT in a week or two

  • aarch64 and cray master is failing to build with AlltoAllw INTER something.

    • Might be an MPI1 causalty. Several MPI1 cleanup PRs.
    • Giles caught some stuff.
  • Next Face to Face?

    • When? Late summer, early fall?
    • Where? San Jose - Cisco, Albuquerque - Sandia
    • Super computing is in Dallas this year in Nov.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally