-
Notifications
You must be signed in to change notification settings - Fork 861
WeeklyTelcon_20180619
Geoffrey Paulsen edited this page Jan 15, 2019
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres
- Geoff Paulsen
- Joshua Ladd
- Howard Pritchard
- akvenkatesh
- Dan Topa (LANL)
- Edgar Gabriel
- Geoffroy Vallee
- Nathan Hjelm
- Peter Gottesman (Cisco)
- Ralph
- Todd Kordenbrock
- Xin Zhao
- David Bernholdt
- Josh Hursey
- Brian
- Thomas Naughton
- Akvenkatesh
- Matthew Dosanjh
- Dan Topa (LANL)
- Next week (June26): Discuss schedule of v2 and v3 releases.
- github suggestion on email filtering
Review All Open Blockers
Review v2.x Milestones v2.1.4
- v2.1.4 - Targeting Late August 31
- lower priority to v3.0 and v3.1
-
PR5217 changes in OSHMEM logic MPI_Initialized/Finalized
- IN
Review v3.0.x Milestones v3.0.3
- Schedule:
- v3.0.2 has been shipped.
- v3.0.3 - targeting Sept 1st
- Do we want AR64 stuff in v3.0.3? - Up to Nathan. Sounds good.
- Helps IBM too.
- Cisco is seeing some weirdness in v3.0 and v3.1
- Haven't nailed down, but haven't reported yet. PMIx / runtime.
Review v3.1.x Milestones v3.1.0
- v3.1.0 - targeting Sept 1st
- Last week Brian posted an
- Brian just merged in a bunch of stuff.
- OMPIO Issue 5263 - Symbols in an issue in v3.x and master.
- resolved
- Schedule: mid-July branch. mid-Sept relelase.
- Still working through iWARP issues; LANL waiting for Chelsio RNICs.
- installed, maybe do some tests in a week or two.
- mostly smaller issues. Wants to test before do more drastic (like rely only on RDMACM)
- Howard will ping broadcom - apprise them of situation. Either they do smoke testing with UCX or keep openib only for braodcom.
- No further / substantive update since last week (4 day weekend prevented a bunch of work this past weekend).
- PMIx v3.0 RC went into master, but will need another sync before release.
- PR4618 - Persistant Collectives, Fujitsu is planning to merge into master
- MPI 4.0 standard is targetting 2020, so these will go in as MPIX until then.
- OSHMEM Pushed last big OSHEME v1.4
- Xin Still need to fix up mxm and __
- Still planning to implement one other portion of the API.
- favor external vs internal components - hwloc and pmix and libevent.- jeff.
- Edgar - Did you want to make something default?
- component is there
- luster - waiting from George - not sure if going to get OMPIO as default for luster.
- Ralph merged in some PMIx v3.0
-
PR5258 to master - DONE
- ARM [Pasha] CI was having lots of problems. Upgraded the Atomics in PMIx (some changes).
- Skeptical itsn't not just a race condition somewhere.
- Changed both PMIx code and code in Open MPI.
- ARM [Pasha] CI was having lots of problems. Upgraded the Atomics in PMIx (some changes).
- Why are we testing ARM? Ralph asked them for help, but no help for a week.
- Nathan is helping support ARM, but had a few bad weeks.
- Jeff sent Pasha an email asking if they can better support.
- Forwarded github suggestion on email filtering
-
PR5258 to master - DONE
- Overall Runtime Discussion (talking v5.0 timeframe, 2019)
- DELAYED - Geoff Paulsen will send out a seperate email to discuss in approximately two weeks.
- From last week:
- What is it that we want? It's changed a bit since last Face to Face.
- Getting confused about the Goal - Regardless of who and when, lets discuss what.
- Set up a prep-call
- What? Two Options:
- Keep going on our current path, and taking updates to ORTE, etc.
- Shuffle our code a bit (new ompi_rte framework merged with orte_pmix frame work moved down and renamed)
- Opal used to be single process abstraction, but not as true anymore.
- API of foo, looks pretty much like PMIx API.
- Still have PMIx v2.0, PMI2 or other components (all retooled for new framework to use PMIx)
- to call just call opal_foo.spawn(), etc then you get whatever component is underneath.
- what about mpirun? Well, PRTE comes in, it's the server side of the PMIx stuff.
- Could use their prun and wrap in a new mpirun wrapper
- PRTE doesn't just replace ORTE. PRTE and OMPI layer don't really interact with each other, they both call the same OPAL layer (which contains PMIx, and other OPAL stuff).
- prun has a lam-boot looking approach.
- Build system about opal, etc. Code Shufflling, retooling of components.
- We want to leverage the work the PMIx community is doing correctly.
- If we do this, we still need people to do runtime work over in PRTE.
- In some ways it might be harder to get resources from management for yet another project.
- Nice to have a componentized interface, without moving runtime to a 3rd party project.
- Need to think about it.
- Concerns with working adding ORTE PMIx integration.
- Want to know the state of SLURM PMIx Plugin with PMIx v3.x
- It should build, and work with v3. They only implemented about 5 interfaces, and they haven't changed.
- A few related to OMPIx project, talking about how much to contribute to this effort.
- How to factor in requirements of OSHMEM (who use our runtimes), and already doing things to adapt.
- Would be nice to support both groups with a straight forward component to handle both of these.
- Thinking about how much effort this will be. and manage these tasks in a timely manor.
- Testing, will need to discuss how to best test all of this.
- ACTION: Lets go off and reflect and discuss at next week's Web-Ex.
- We aren't going to do this before v4.0 branches in mid-July.
- Need to be thinking about the Schedule, action items, and owners.
Review Master Master Pull Requests
- Decided to file PR5200 to begin the long process of deleting osc/pt2pt (by enabling all relevant RDMA BTLs so that every transport will use osc/rdma).
- Anything Jeff can help with Absoft and NAG licenses?
- waiting.
Review Master MTT testing
-
Hope to have better Cisco MTT in a week or two
- Peter is going through, and he found a few failures, which some have been posted.
- one-sided - nathan's looking at.
- some more coming.
- OSC_pt2pt will exclude yourself in a MT run.
- One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
- Now that osc_pt2pt is ineligible, many tests fail.
- on Master, this will fix itself 'soon'
- BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
- Probably an issue on v3.x also.
- One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
- Peter is going through, and he found a few failures, which some have been posted.
-
OSHMEM v1.4 - cleanup work
- and refactoring.
-
Edgar has some issues running on omnipath - Not able to open HFI correctly.
- Not sure if it's OFI components.
- Mathias just updated his PR5004 and asked Jeff to review.
- libfabric related, but probably not Edgar's issue.
- might be missing coverage here. Results LLNL and cray stuff, not sure what these are.
-
aarch64 and cray master is failing to build with AlltoAllw INTER something.
- Might be an MPI1 causalty. Several MPI1 cleanup PRs.
- Giles caught some stuff
- Leave on here for one more week.
-
Next Face to Face?
- When? Late summer, early fall?
- Where? San Jose - Cisco, Albuquerque - Sandia
- Super computing is in Dallas this year in Nov.
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu
- Amazon,
- Cisco, ORNL, UTK, NVIDIA