Skip to content

WeeklyTelcon_20170530

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Edgar Gabriel
  • Artem Polyakov
  • Jeff Squyres (Cisco)
  • Howard
  • Josh Hursey
  • Joshua Ladd
  • Murali (LLNL)
  • Todd Kordenbrock
  • David Bernholdt
  • Nathan Hjelm
  • Ralph
  • Brian (Amazon)

Agenda

1.10.8 - no plans.

  • 1.10.x branch is closed, don't bother filing PRs.

2.0.3

  • Howard is doing a few more items on the Checklist, with plans to release June 1st.
  • Ralph closed some really old Issues that were not updated in a long time.
  • This week - There are a lot of issues on v2.0.x milestone. We'd like to move these to either v2.1.x or v3.x Do we need to fix them in v2.0.x? or beyond?
    • EVERYONE please review open v2.0.x Issues.
    • Close them if it's already addressed.
    • Move them if needed.

Review v2.x

  • No update here. No reason to update, or Schedule for next release.
  • Take this offline to talk at face to face:
    • Issue 3442 - 32bit builds are busted, probably affects v2.1.x also.
    • Could be exotic architecture issue, or possibly just our CMA glue isn't right. CMA seems to be masking the issue?
  • Ralph created an unofficial RC1 of PMIx 2.0
    • updated master with this.
    • Rolled this into giant orted PR.
  • Some discussion last week about Checkpoint restart -
    • Think we decided that they'd remove take CR 3554 (remove various sub components)
    • We still need a PR to remove CR from v3.x (leave it in master).
  • Brian is in driver seat for RCs on this one.
    • Howard hasn't been able to talk to Brian in 2 weeks. He will reach out.
  • v3.x update to v3.0.x changes didn't happen.
    • Like to do this after Pacific time hours tonight.
    • Open PRs will have to be re-created.
    • Brian will send out email to devel.
  • When we did v2.x we pulled out Checkpoint Restart out of master, and then remove it from v3.x/v3.0.x also.
    • Brian will do this after the rename.
  • Schedule for v3.0.0
    • branch rename tonight.
    • pull in PMIx orted changes, and PMIx v2.0

  • Still seeing some 'make check' errors, which is disturbing.
    • Jeff hasn't been able to focus on that.
    • Some kind of compile error, but not seeing the compile error.
      • 32bit, 64bit is fine.
  • Should clean up compiler warnings.
  • MPI_Send_receive_replace - seems to fail consistently.
  • Simply large send, managed CUDA.
  • Timeouts are all CUDA related - nvidia.
  • Issue: Redhat stock autoconf (rather than build our own)
  • Someone added autogen requirement of "correct" 1.15 version.
    • update broke Travis, but Travis always break (bad)
    • website specifies versions of automake / autoconf we require.
      • bug in 1.14, so everyone jumped to 1.15. (thought 1.12 is reported to work)
    • We should not merge things to master, If PR checker breaks.
    • https://github.com/open-mpi/ompi/pull/3602 - make autoconf track posted requirements.,
    • PMIx requires 1.15 - got dinged that they weren't checking for version of autoconf that website says we require.
      • Came in on Thursday, and started failing when we recurse down there.
    • Came up on mailing list - Do sometimes get people reporting the 1.14 bug, because no requirement check.
    • ACTION - Brian will update PRchecker / CI to use correct version of autogen.

MTT Dev status:


Exceptional topics

  • Face2Face Meeting-2017-07
    • Date: July 11-13 (9am Tuesday - noon on Thursday.
    • Cisco has booked space in Chicago.
      • Cisco has reserved some space right next to O Hare (can get shuttle to hotel).
        • we have met there before.
      • Jeff will come in Monday evening.
  • Ralph's goal is to get all PMIx runtime bugs into v3.0, make it as clean as possible.
    • Scalability? All the scaling fixing in there, exception of 3 mappers, that could be updated if someone wanted to (to improve scalability with new PMIx 2.0 way of doing things):
      • Non-updated (anyone interested?): sequential mapper, rankfile, and min-dist mappers

Status Updates:

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally