-
Notifications
You must be signed in to change notification settings - Fork 862
WeeklyTelcon_20190820
- Dialup Info: (Do not post to public mailing list or public wiki)
- Akshay Venkatesh
- Brendan Cunningham (Intel)
- Brian Barrett (Amazon)
- Dan Topa (LANL)
- Geoff Paulsen (IBM)
- Harumi Kuno
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Intel)
- Ralph Castain (Intel)
- Todd Kordenbrock
- Akshay Venkatesh (nVidia)
- Aravind Gopalakrishnan (Intel)
- Arm (UTK)
- Artem Polyakov (Mellanox)
- Brandon Yates (Intel)
- David Bernhold
- Edgar Gabriel (UH)
- Geoffroy Vallee
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Jake Hemstad
- Josh Hursey (IBM)
- Joshua Ladd (Mellanox)
- Mark Allen (IBM)
- Matias Cabral
- Nathan Hjelm
- Noah Evans (Sandia)
- Peter Gottesman (Cisco)
- Thomas Naughton
- Xin Zhao (Mellanox)
- mohan
-
Git submodules
- This PR is in progress. Requires CI owners to add
--recursive
to their Jenkin's git clone commands. - As a first step, Jeff created:
- PR 6821 "hwloc201 use a submodule"
- Brian will not have cycles for a weeks.
- Jeknins has an issue that Brian could fix.
- This PR is in progress. Requires CI owners to add
-
What to do with OFI BTL and OFI MTL
- Harumi Kuno (HPE) - Discussion about OMPI's component philosophy
- mail archive: https://www.mail-archive.com/devel@lists.open-mpi.org/msg20736.html
- ofi/BTL and MTL components can step on each other.
- PSM2 - when a user of PSM2 calls PSM2_Finalize, as long as there's a PSM2 provider, PSM2 is refcounting is only observed in initializing not in finallizing, meaning first finalize, was finalizing entire job.
- No progress Brendan is looking at this on PSM2 side.
- What is the plans for PSM2 and the MTL, etc?
- Still fully supporting PSM2. PSM1 is end-of-life-ing the adapters in march of 2020. Will probably remove PSM1 code from v5.0 and master. Michael Heinz
- Update Harumi Kuno - Jeff raised some issues with OFI common PR to return to master (older issue 2519), build issue. Think we
- Intel is discussing if they will claim ownership of common OFI.
- If Intel won't HPE will.
-
Status of Scale testing
- Still no update. Blocking on Amazon time, lower priority.
- Issue 6786 "OMPI 4.0.1 TCP connection errors beyond 86 nodes"
- Issue 6198 "SSH launch fails when host file has more than 64 hosts"
- IBM is also working on something like this as well (for ssh launch)
- Prefer this every night, instead of each PR.
-
- https://engineering.mongodb.com/post/succeeding-with-clangformat-part-1-pitfalls-and-planning
- Should get this cleaned up. Need one big PR fix.
- Whitespace vs Tab cleanup.
- Good conversation on PR.
- Should we have CI for this?
- MongoDB did something similar, and branches, and issues, and why they went with CLANG.
- After folks write the scripts, then adding to CI is no problem.
- Want it to be EASY to add local githooks so CI isn't first line for these.
- Giant clean up commits should be done on each
- Implementation details:
- It might be easy to use clang for the CI / formatting.
- clang enforces a set of things, but it may require more than
- We have a requirement in Open MPI that says you write 'if (NULL == var)'
- very hard to enforce this in perl, and gcc can't give us AST to do at that level.
- run clang far enough to get AST, to do formatting.
- you can now run clang_format.py reformat-branch T R (using T and R from the algorithm above) to easily bring a stranded topic branch forward after a reformat commit.
- If we have to add yet another dependency (like clang), most of us don't use clang, so adding a bunch of painful.
- White space is how this started, and perhaps just fix white space stuff. And both githooks and CI to enforce.
- scripts are in mentioned in PR.
- Most of these scripts UPDATE the git commit, and so for CI we want them just to check.
- Command line example on how to add to add to git hooks.
- Brian thinks he owns next steps - basic style checking in CI.
- Pull Request into ompi-scripts - wanted something to drop into Jenkins.
- Complete
- No update
- Suggest just doing hwloc (stable and not too much development) first
- No update
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- Merged in PMIX releases
- Waiting on MPIR fix.
- Vader audit?
- OB1 pipelining
- v3.0.x MPIR_Breakpoint issue need a bit more data why -O3
- Tested new PMIx
- Exposed a few new test suite issues in "ibm", but fixed
Review v4.0.x Milestones v4.0.2
-
Howard is out this week. Once Datatype PR is merged, will spin RC1 to begin testing.
- Still waiting on Giles for review.
- Geoff will email and ping on github to request review TODAY
-
Akshay will test new datatypes with CUDA.
- Will test on master maybe v4.0.x too.
- No update 8/13
- Posted on SLACK
-
PR against v4.0.x to pull in latest PMIx release merged.
-
Many bugfixes waiting for 4.0.1, we should try to get 4.0.2 out the door.
-
OB1 get protocol problem Issues 6568 - Nice, but not a blocker since everything but MCA has CMA
-
George is back from vacation, want two things before rc1
- Datatype work, master PR for datatypes
- Also ob1 get/put path problem
- Edgar just reported a bug
-
Howard is verifying 6613 MPIR Disappearing queue on re-attach.
-
PR6806 - Want to wait until CI is back. Do we have any tests to test this?
- Howard will reproduce and add to ibm suite
-
2nd Put issue PR 6568 (Vader deadlocking with 4MB transfers)
- waiting on George to return (end of the month)
-
New Datatype work https://github.com/open-mpi/ompi/pull/6695 (master)
- Want for v4.0.2
- Now approved for master.
- waiting on George to return (end of the month). We could merge to master, but if any issues, we'd need George to fix.
-
https://github.com/open-mpi/ompi/issues/6568 - put protocol has lost it's pipelining.
- Combination of both ob1 and vader.
- Right now only shows in vader, because all others prefer get protocol.
- Vader generate a bunch of 32K frags. so for 4MBs overwhelms vader.
- Does NOT occur with single copy like CMA or KNEM.
- Marked as a blocker, but wont block RCs, just
- Is this a regression? Not sure if it was ever implemented.
- Used to be some pipelining, used to work. Not sure why it's showing up.
- Everything George knows is in the ticket.
- Need a throttle for large messages.
-
Issue 6789 - OMPI crashes when configured with ucx version
- Issue with PML UCX conflicting with btl_uct - memory hooks
- New this week: Howard not convinced it's memory hooks.
- Howard can't reproduce. Asking user to
Review Master Master Pull Requests
- IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
- nVidia bought PGI, perhaps someone there could take a look?
- Akshay said he'd talk to a PGI person at nVidia to see.
- PR6556 and 6621 should go to the release branches.
- no update
- Not a great way to test CI before
- When do we get rid of 32bit?
- Good reminder that we now need to be careful about OPAL's ABI.
- Still don't have any release manager.
- Ralph is willing to help with v5.0.0
- Need to identify someone in next few months.
- Put notes in v5.0 milestone wiki page.
- We Put MPI1 compatibility configure flag back on master.
- And node in 6.0 wiki to re-evaluate.
- 3.1.4 is out
- 2.2.3 is in RC.
- 4.0 just rough schedule now. Trying to get standard RFCs out this month.
- Branching for PMIx v4.0 might be September.
- a bunch of stuff going on, but nothing necessarily impacting OMPI.
- Made a change for Nathan - allow you to get locality of other processes on node.
- Allows you to hook up with shared memory
- The version master PMIx can support network coordinates of any NIC, and depending
on type of network can map for each process.
- "network coordinates" - map to MPI network topology definition.
- Fujitsu, Cray is implementing.
- In PMIx when do instant-on, the scheduler queries the ___ plugin to get a payload of info you want. If the process is bound to a certain socket, this is the NIC they should use, and these others are available. Then you assign the endpoint to that NIC.
- Requires Instant-On? - simple to do without instant-on if you want to.
- Howard has someone coming onboard in LANL next month.
- Tom filed a PRTE PR recently, so making some progress.
- Open-MPI would like the mpirun launch versus the lam-boot 2 command approach
- Aug 7th - web-ex meeting.
- Talked about what needed to happen, and confirmed want to go down this path
- laid out a few steps of what needs to happen.
- Some hinges on submodule automation.
- OLD - Gile's PRRTE work was done differently than we're not proposing. New proposal uses submodules, etc.
- PR6339 - he's closed, and re-opened a new branch to look at.
- Howard reviewed PR6339, and likes everything that Giles did.
- IBM has to triage some failures on master and v4.0.x