-
Notifications
You must be signed in to change notification settings - Fork 862
WeeklyTelcon_20160823
Jeff Squyres edited this page Nov 18, 2016
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Artem Polyakov
- Jeff Squyres
- Brian
- Edgar Gabriel
- george
- Geoffroy Vallee
- Howard
- Josh Hursey
- Joshua Ladd
- Ralph
- Slyvain Jeaugey
- Todd Kordenbrock
- Milestones
- 1.10.4
- A few PRs to pull in. want folks to focus on 2.0. Once 2.0.1 is out, might begin work on 1.10.4.
- Ralph may need a 1.10.4.
- Cisco 1.10.4 has a bunch of failures on MTT, Jeff needs to know if there is an issue.
- Driver: Didn't have the sync component in the collectives. This is causing problems for several customers. Trying to bump them to 2.x, but might not be possible.
- What problem does the sync component in collectives solve? If applications in tight loop call non-blocking collectives, and one process starts to fall behind (typically extra work). We don't have flow control for that.
- nathan has an idea, if unexpected msg queue (per rank) gets big, send message to other side, to use sync send on next message.
- This can deadlock if messages come in, in a bad message. (in non-blocking send)
- Do have an ACK protocol for long and sync messages. Ack is piggibacked for rondevue.
- Portals MTL has something, since it has a small unexpected msg.
- MPI standard is not clear what to do on sender side for non-blocking sends if running out of messages.
- Hard to do scalably, reliably, and fast.
- Should take this offline to wiki or email or something.
- George will describe deadlock path.
- coll_sync is the temporary solution?
- George, either we force it all the time for everybody, or we ask people to activate by hand.
- OR they could change the size of the eagar, and get almost the same effect.
- Can't set eager below match size.
- coll_sync is a good bandaid.
- coll_sync was in up to 1.6 series, but it disappeared, and they want / need it.
- George, either we force it all the time for everybody, or we ask people to activate by hand.
- Is coll_sync on FAQ? - yes, think so.
- May need coll_sync in 2.0.2 also.
-
Paul Hargrove uncovered 3 things.
- Need to update PMIx anyway, due to solaris issue.
- Can drop OSX v10.6 - 10.10 is list of systems tested. 10.6 can't even be run in VM.
- Should change test list to OSX v10.8 - 10.11. (10.12 still in beta)
- dlopen crash, possibly specific to XLC in Patcher.
- Nathan may not have got the XL piece correct.
- Don't actually refer to the translation table.
- Oracle Studio lightly tested.
-
PR1333 - hcoll datatype fixes.
-
Check AUTHORS file - NOW auto-generated from Spreadsheet.
- git .mailcap - filters name and emails show through .mailcap file.
- Edgar had a commit from his wife's local macbook, so this was put into .mailcap, so when you see that it changes to actual email.
- dist directory has a make AUTHORS script to run before release, to regen AUTHORs.
-
coll_sync - 2.0.1 or 2.0.2?
- 2.0.2 - already in PR list, Ralph will set milestone.
-
Mellanox needs PMIx 2.0 in 2.1.0
- PMIx will release a 2.0 that just has shared memory data as an addition,
- but doesn't have everything else they were targeting for 2.0.0.
- This should come out Early September.
- This is the piece that Mellanox and IBM are interested in.
- Put items requested on the wiki (e.g., PMIx direct modex, OpenSHMEM, stability improvements)
- What do people want to see for 2.1.0?
- Finalize the list in Dallas meeting
- Hopefully target Sept./Oct. release, not Super Computing Goal.
- PMIx will release a 2.0 that just has shared memory data as an addition,
Review Master MTT testing (https://mtt.open-mpi.org/)
- Howard looks close talking to reporter.
- looks like Jengo Cherry-py is not running during HTTP_PUT. Josh will check.
- There is a separate path, that's different, send email to josh, and josh will check.
- Getting closer.
- Josh started moving MTT server to Amazon cloud server.
- Probably have a transition time for database transfer, not this week.
- Most of it's migrated now, other than MTT database.
- Statistics for download numbers?
- at the moment these are gone.
- when did we actually flip this bits to move to hostgator?
- 3-4 weeks ago.
- get numbers up until then to Edgar.
- Google analytics only has permissions to certain directories.
- So can't track number of downloads.
- If we're eventually going to move downloads to S3, then we get that for free.
-
Date of another face to face. January or February? Think about, and discuss next week.
-
Non-Profit
- Ralph sent email out to list, please comment either pro/con.
- LANL, Houston, IBM
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel