-
Notifications
You must be signed in to change notification settings - Fork 862
WeeklyTelcon_20210427
- Dialup Info: (Do not post to public mailing list or public wiki)
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (UCX/nVidia)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Naughton III, Thomas (ORNL)
- Sam Gutierrez (LANL)
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Brian Barrett (AWS)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- David Bernhold (ORNL)
- Edgar Gabriel (UH)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Joshua Ladd (nVidia/Mellanox)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Xin Zhao (nVidia/Mellanox)
- Tommy is taking over for Josh Ladd for short-term.
- Please send Mellanox items to him.
- He will also help with v5 RM work.
- Howard was trying to build OSU benchmark (most recent) doesn't build simply against master and v5
- Howard didn't have mpicxx or mpicpp
- If this is an actual issue, assign this to Jeff.
- Also, Joseph set CC not CCX env, and C++ wraper wasn't being built.
- Didn't dig in...
- This Could be correct behavior even if it's unexpected.
- Issue 8850 static linking blocker for v5
- Need to talk to brian
- 8860 is related - Howard
- Issue 8925: MPI apps hang, if runtime decides to kill the job,
- the PMIx event is not processed properly, and doesn't tear down the job.
- Need to talk this through with Fault Tollerant.
- Blocker for v5
- We're still waiting on Datatype issues now reported in v4.1.1
- Issue 8856
- Howard took the DT fix and created a PR
- Need an explanation for PR 8810
- Hessem contacted Artem, and that it's a work in progress.
- Follow up 8818 on datatypes
- Is this also blocker?
- No.
- Raghu has left AWS.
- Brian is stepping up for v4.1.x RM work
- v4.1.1
- Released over the weekend. Got George's datatype fix.
- Brian and Jeff did a bunch of testing, and was happy with.
- Unfortunately two different folks reported partial roundoff error #8856
- George spent a lot of time trying
- Holding off on merging v4.1.x PRs until we get a better understanding of #8856
-
Still haven't done the alpha, but haven't done that until we get Cherry-picks from master.
-
Austen, Tommy, and Geoff will Cherry-pick "easier"
-
Issue #8652 RDMA performance problem.
-
This is more of an enchancement than a severity: blocker
-
Not a blocker, just an issue with the way the user ran.
-
If there's a mode that we know has bad performance, useful to call out in UCX section of docs.
-
Issue 8776 - libevent confusion if running with external 3rd party tools
-
PR 8792 - Need to move this over to v5.0.x
-
Need to check with Brian if this is relevant on v4.0 or v4.1
-
compile with --disable-dlopen, or slurp in all of the plugins.
-
3 line change, should be small work.
-
Not a linker error, job just hangs and fails, really might want on v4.0 and v4.1
-
PR 8799 - should probably be PRed to v5.0
-
Howard's concerned that these package specific for config lookups, into the way that mpicc is linked, (for example cray)
-
mpicc --show - shows some long dependencies.
-
Just let him know on the ticket.
-
Howard will update the ticket.
-
Docs - Man pages will be included in this effort.
-
Likely include nroff and http in the tarball (so users don't need sphynx, and don't need internet)
-
If this doesn't make v5.0.0, it can go into later.
-
Packagers need some advice, and need a README, few more weeks at minimum.
- 8808 - same memory backing file.
- what is the failure profile for this?
- Rare, but what happens is if two users are sharing a node, and we leave backing files because a job fails, another user tries to create the backing file, it can conflict. So we add user-id to give a little more safety for conflicting.
- Does mean that there's a cleanup issue for shared memory files.
- Only reason is because moved the backing file out of dev/shmem.
- PR 8816
- Would like Nathan to rebase and merge to master.
- Certain blocks we don't want to format (specifically some in datatype)
- Joseph saw Opal code, some copyright headers got scrambled.
- clang format trips over
- Something going on in PMIx v4.x branch around tools interface
- Relable v4.x as v4.1 and then create a new v4.x without some tools interface.
- Shouldn't
- No update
- Also some changes with libcurl, especially since this breaks OMPI built.
- PMIx can interface with REST interfaces (used by libcurl)
- JSON
- Build system issue in PMIx when we changed to static DSOs.
- Think this has been resolved
- Jeff and Ralph and Yosi had a good conversation
- Lengthy discussion, Summary is, that it's a work in progress.
- Ralph is working this.
- Need to look at the public tests repo for merging in both ULFM and Sessions tests.
- Howard and Geoff will look at this week.
- Converting docs to Readthedocs.io
- https://github.com/open-mpi/ompi/pull/8329
- PRRTE issue https://github.com/openpmix/prrte/issues/931 on how to document personalities