Skip to content

WeeklyTelcon_20211012

Geoffrey Paulsen edited this page Nov 2, 2021 · 1 revision

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • David Bernholdt (ORNL)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (NVIDIA))
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart (HLRS)
  • Matthew Dosanjh (Sandia)
  • Sam Gutierrez (LANL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic (NVIDIA)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS) - Welcome Back!
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • Edgar Gabriel (UH)
  • Erik Zeiske (HPE)
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Josh Hursey (IBM)
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Sriraj Paul (Intel)
  • Thomas Naughton (ORNL)
  • Xin Zhao (NVIDIA)

New Topics For Today

  • Discuss the relative submodule path issue

    • Only master and v5.0.x.
    • Request from distributor to change from https:// to relative path.
    • Works in git client in RHEL7 (git in RHEL6 was too old)
    • Some issues:
      • One client was using a mirror, but accidentally using https for submodules. This exposed that.
    • If this is a problem, please let us know.
  • Does Fortran Fixes affect API? (i.e. needed for v5.0.0?)

v4.0.x

  • Schedule: Pushed to October for 4.0.7
  • --cpu-set - Geoff working on PR for nice warning/docs
  • Fortran PR 9259, 9367 probably affect v4.0.x branch as well.
    • Geoff will follow up.
  • Geoff and Howard

v4.1.x

  • Schedule:
    • Made a v4.1.2rc1 - Please TEST.
  • OFI and Memchecker
    • One more pending on v4.1.x Jenkins had some issues that Brian is looking at.
  • Common OFI memory registraiton
    • Brian Backporting to v4.1 and 5.0
  • Issue #9462 Debian saw a new SEGFAULT in 32bit build of vader.
    • Not a v4.1 issue
    • Details on Debian tracker.
    • Jeff asked for a reproducer.
      • Sent a complicated reproducer.
    • Not obviously an Open MPI bug with existing information.

v5.0.x

  • Schedule: aiming for rc1 on Sept 23rd.
  • George was able to verify the BTL+OSC RDMA failures is not only IBM.
  • Some activity on relocation problem
    • A bunch of back and forth and confusion about the problem.
    • Some clarity over night. Moving in the right direction.
    • Some PR on master, not on v5.0 yet.
  • Blocker v4.1.x blocker also in v5.0.x Common/OFI
  • PR #9495 TCP Onesided for master.
  • Tommy's still pushing on UCX Onesided.
  • Think there are other issues than just one sided.
    • 5 in issues, only 2 are one-sided.
    • One is static linking, Austen will reverify
  • Talk about gcc v4.7 and RHEL6
    • PMIx and PRRTE just don't compile on RHEL6, but because of this, do we even care about RHEL6? specifically gcc v4.4.7
    • RHEL7 v4.8.5 works fine.
    • Pull portable platform from gasnet - didn't even know what CLANG was.
      • Fujitsu put in pull request for another fixup
      • Brian did a portable package update from downstream
    • Now Error out if try to use a gcc too old.
  • Documentation
    • Got a change in sphynx tools needed. No sure if there's a release yet.
      • This fixes outputting issues in manpages.
    • Process to update FAQ is to talk to Jeff or Harumi.
    • Any changes in README or FAQ let them know to make changes in NEW docs.
      • For now, make changes in ompi-www and README as usual and let them know.
  • v5.0.x requires pandoc. If user downloads from .tarball they do NOT need pandoc installed.
    • If user runs make dist or make dist-check they WILL need pandoc.
      • This is a strange quirk, but seems fine.
  • Problem with OFI and Open MPI
    • No discussion
  • Github Project of [critical v5.0.x issues|https://github.com/open-mpi/ompi/projects/3]
    • Issue #8983 If we partially disable OSC/TCP BTL - Not breaking MPI compliance, just breaking One-sided performance badly.
    • Described approach of rc1 on Sept 23, disabling any functionality that are blockers to allow for the rc.
      • Worried that blockers might not be fixed in time, so will put in code to issue an error at runtime to prevent getting into those paths, and document it heavily.
  • MPIAlltoallw needs to go in. Is a PR from Giles George

Super Computing SC BoF

  • Was accepted for Open MPI
    • Our Hybrid BoF will be mostly VIRTUAL BoF
      • George may be there in person for tutorial (tho other tutorials will be fully-virtual)
    • Bird of a Feather will be Virtual.
    • George sent out an email to Amazon, Cisco, IBM, nVidia

Master

Documentation

  • No update
  • Don't do the old system, use this new system for v5.0.0

MPI 4.0 API

  • No discussion [Open MPI 4.0 API Compliance Github Project|https://github.com/open-mpi/ompi/projects/2]
  • Jeff's going to review PR 9246
  • Howard will review 7985
  • Need to decide what to do with 8057
  • Sessions branch, don't want to merge into master until possibly v5.0.1 gets out.
    • It will complicate things in finalize/initialize code.

MTT

  • Looking okay.
  • Looks like something was wrong with MTT.
    • That machine just got upgraded.
    • Install fail is kinda weird.

Longer Term discussions

  • No discussion.
Clone this wiki locally