WeeklyTelcon_20200915

Open MPI Weekly Telecon ---

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

NOT-YET-UPDATED

4.0.x

No driver for 4.0.6 right now

v4.1

Waiting for Adapt and Han.
- Bunch of testing and fixes on Adapt.
- Adapt is in good shape now (as good as non-Adapt).
- George has one more perf imporvement before squashing and PRing to master.
Han uses Adapt.
- Main issue on Han is already known (infrastructure issue, discussed last week)
  - non-communitive MPI_Ops, fall back in those cases.
Fortran fix Jeff will bring in.
Getting close for another RC (updated News, etc).
Schedule:
- Do RC2 this week (without Han and Adapt)

v5.0

Been in a holding pattern
- Josh Ladd is ready and willing for RM work, has just been busy with nVidia/Mellanox transition.
Schedule: PMIx v4.0 Standard is in good shape.
- libpmix in September
- PRRTE in October
ULFM review
- What are our Internal ABI guarantees?
  - Example: in ULFM pull request changes sizeof(ompi_proc_t)
- size changes if ULFM is configured in or not.
- ompi_proc_t is used by SHMEM and they're using the extention space, so WE can't use that.
- ompi_proc_t is something that leaks into MPI API space... :(
  - Brian will look at this.
- Still open - Aurelien was not sure what the ABI requirements are.
  - Aurelien will update PR8007 And Brian will look to see what ABI changes are being discussed.
Questions about users doing their own PMIx implementation.
- Is OMPI v5.0 is going to #if 0 all of the PMIx APIs not needed by MPI?
  - Consensus
- If they implement their own pmix, they want to implement the bare minimum.
- OMPI v5 will require PMIx v3
- We should point out that we already have an existing way to interface with older PMIx, and they should use that.
- Want to support OMPI v5 in FLUX is the issue.

master / new topics

Are any Organizations doing avx512 testing?
- On by default, but no one is explicitly calling out avx512 (Intel Skylake)
- Some inline assembler macros that are picked based on what cpu can do.
- User older gcc <6.x is creating compiler issues.
  - Need to create an issue.
Branch date for v5.0.x?
- RMs could go off and look at features on wiki and create a proposal?
- What's the plan to track things, frustrating people.

PPRTE

Been doing some work from Tools side.
A lot of new work needed to stabilize it.
Not too many bug reports lately, but maybe some more as use picks up.
Some ULFM and scale testing.
Open MPI master submodule update is manual process.

PMIx v4.0

release canidate of document for PMIx v4.0 standard.
Bulk of standard changes was pushed yesterday.
What should Open-MPI master track of PMIx?
- End goal would be to track PMIx releases.
- First week of October is target for Open PMIx v4.0 release

Supercomputing

Open-MPI got rejected.
MPI Forum got a BOF, and they're encourage to include Open-MPI and MPICH.
We'll have a virtual talk in November-ish.
George and Jeff are
PMIx BOF also rejected, and they're doing a virtual also around Nov

New

HWLOC initializiation thing. (Issue #7937)

trivial to fix in master.
Once Brian gets his configure stuff in.
May need someone else to finish.
Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
- This won't work going back into releases.
- buried in mca system.
- need
What to do about fixing release branches.
Can't give local topology without ___
Don't run it at scale.
The portable way to get it, is hwloc.

revert libevent https://github.com/open-mpi/ompi/pull/7940

Summary: We committed some code
- Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
- We switched the configury logic so we always prefer external libevent (above a certain level of external libevent).
  - Most OSes are above that level, so almost always prefer external libevent.
  - If we get the fix into our internal libevent,
    - Concern is that unless we or users explicitly request internal libevent, we'll almost never get this fix.
  - One solution would be
- Can't think of another solution.
- Packagers don't like to use our internal component
- Only thing we can think of is if you want ULFM, you can't use external libevent.
Progress of getting PR accepted upstream?
- Yes, prepared an upstream libevent PR.
  - They want a non-open-mpi reproducer.
  - Have ideas on how to create this reproducer, but not sure if it's very easy.
  - Original code writer added some protection, but has since retired. This PR removes this protection.
    - Actually "we" added this race condition protection in libevent. It delays removal of file descriptor until too late.
      - The fix validates the FD before handling. Sounds right to all.
- Not started yet. Creating
- May be a way to code around this on ULFM, but not really sure, because things get into a bad state, and only way might be to ruin our performance.
If we protect this with configure (when building ULFM and have to use internal libevent).
- It means we move to submodules for libevent, we'd have to "mirror" libevent ourselves
Only master / v5.0
- If we have TCP it could happen, but we disable errors in Finalize so don't hit this issue.
libevent patch to this OLD internal libevent 2022
- It's possible that the problem goes away in newer libevent. But updating libevent was a major hassle.
- George check if code is gone or has been modified in libevent.
  - Code is still there in latest libevent (so still need fix).
- updating libevent would be a much better solution.
If upgrading to new libevent is answer.

Annual review of OMPI

Jeff will send out Once a year, make sure those who have commit access should
- Have not reviewed yet:
  - Amazon, Fujitsu, Google, HPE, Los Alamos, nVidia/mellanox, IBM
- Need to update the spreadsheet saying "looked at".

Face to face

August 10th, 11th, Monday and Tuesday that week.
List of Topics to discuss, and presenters.
- On the wiki, start filling in.
Need to figure out snacks.

Open MPI Weekly Telecon ---

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

Call in user - Thomas

not there today (I keep this for easy cut-n-paste for future notes)

Jeff Squyres (Cisco)
Artem Polyakov (nVidia/Mellanox)
Aurelien Bouteiller (UTK)
Austen Lauria (IBM)
Barrett, Brian (AWS)
Christoph Niethammer (HLRS)
Edgar Gabriel (UH)
Geoffrey Paulsen (IBM)
George Bosilca (UTK)
Howard Pritchard (LANL)
Joseph Schuchart
Josh Hursey (IBM)
Joshua Ladd (nVidia/Mellanox)
Matthew Dosanjh (Sandia)
Noah Evans (Sandia)
Ralph Castain (Intel)
Naughton III, Thomas (ORNL)
Todd Kordenbrock (Sandia)
Tomislav Janjusic
William Zhang (AWS)
Akshay Venkatesh (NVIDIA)
Brandon Yates (Intel)
Charles Shereda (LLNL)
David Bernhold (ORNL)
Erik Zeiske
Geoffroy Vallee (ARM)
Harumi Kuno (HPE)
Mark Allen (IBM)
Matias Cabral (Intel)
Michael Heinz (Intel)
Nathan Hjelm (Google)
Scott Breyer (Sandia?)
Shintaro iwasaki
William Zhang (AWS)
Xin Zhao (nVidia/Mellanox)
mohan (AWS)

New

Obtaining cache line size from hwloc topo info.
- trivial to fix in master.
- Once Brian gets his configure stuff in.
- May need someone else to finish.
- Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
  - This won't work going back into releases.
  - buried in mca system.
  - need
- What to do about fixing release branches.
- Can't give local topology without ___
- Don't run it at scale.
- The portable way to get it, is hwloc.

revert libevent https://github.com/open-mpi/ompi/pull/7940

Summary: We committed some code
- Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
- Aurelien Bouteiller posted a nice summary of the situation and we discussed mitigation
  - Doesn't really affect Linux, just Mac-OS
- Would like a user visible message if we know we can run, rather than crash.
George isn't here today.
- Picking it up too late
- PMIX may or may not have this ordering issue.
  - PMIx doesn't depend on hwloc (and not using that)
Should we upgrade our internal libevent to latest 2.1.12?
- Reasons for or against?
- Maybe hold off until we get configure code to change it to a submodule.
If we make libevent a submodule pointer, then we wouldn't be able to fix problems even if we have bigger problems than this.
- For OMPI v5.0, The earliest version of libevent we're going to support out of the box 2.0.21 (RHEL7)
  - Issue 7666
- Logic if the version installed on system, is older we'll use our bundled
There is a hypothetical risk that we can't ship patches, and we
- MAC configury work
ULFM configury work is independent of libevent configury work.
Do we still merge 7940 to revert it since submodule will replace it completely?
- Might be nice for git

EFA

AWS backend uses verbs interface in OFI.
- If OFI BTL is there, it initializes first.
- If EFA device is there, initialize OFI BTL before openib BTL won't cause issues.
  - If EFA device isn't there, then openib BTL
  - But this means mucking around with base initializiation code.
- Calling ibv_fork_safe() by default.

Face to face

August 10th, 11th, Monday and Tuesday that week.
List of Topics to discuss, and presenters.
- On the wiki, start filling in.
Many companies are not allowing a face to face travel until 2021 due to COVID19.
- Instead lets do a series of virtual-face to face?
Yes this summer to discuss for v5.0
- Maybe we can do it by topic?
- Maybe not 4 or 8 hour things.
Different topics on different days.
Do a doodle poll of least-worse days in late July/August.
- August 10th-14th - 3 hour block of time 8-11 Pacific time.
- Jeff will do another doodle for days of the week (vote for 2)
Start a list of topics.

MPI Forum was last week.

Sessions is now in in.
Partition communication voted in.

Thread local storage issue

OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
- But when we go to delete, it's not being deleted.
- But want flexibility to destroy on our own or explicitly
- George thinks the mode we have today, since tracking all keys to be released by main thread.
- George thinks Artem's approach is the correct approach.
Would have to change the way that keys are USED, and different components are using it in a different way.
Something similar should be done in different places.
If you do it just for UCX, then others can see how you did it and check for their code.
So we think current PR is good, but it leaves old API and new API.
- But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
- Should be easy for components to add explicit cleanup calls
Master branch only.
Opened a new PUll Request yesterday that addresses the problem as discussed last week.
Tracking of TLS in common code.
- Have a low level thread specific keys (very simple based on thread implementation)
- Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
- Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
- Changed set_specific and get_specific to just set and get.
- Please review and give suggestions.
Does it even make sense to do TLS in OPAL at all?
- May indicate that we have an abstraction wrong somewhere.
- If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
- Not sure if there is a problem, but at a high level, sounds problematic.
Baking in pthread assumptions in general is not a good idea.
- That's what this PR does is abstract pthread semantics.
May be some confusion, no problem with porting this API anywhere.
- Issue raised before is that if you're relying on a certain type of thread in MPI layer.
- But we don't, because there's a framework.
- But Application is linked against PMIx and libevent and to use other threading models is dangerous.
  - To make this work, you have to make changes to event polling, etc.
Not saying we shouldn't take these patches, these make things better.
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
  - argobots actually uses pthreads, not sure about qthreads.
  - Working on a way to configure libevent to make this combo work.

C11 atomic usage is a mess

Last week:
- George needs some input on PR
- We don't need _atomic_ in most cases just need volatile
- patch linked to the issue PR7914
- We're not breaking things, we just get alot of valid complaints from intel compiler.
  - STDOUT of make is ~16 MB due to all intel compiler warnings without this fix
There is a PR pending

Open Source Parent organization

Since Open-MPI is a registered non-profit.
If we log volunteer time we can
- Software in the Public Interest (Parent non-profit)
A week or two

Release Branches

Blockers All Open Blockers

Review v4.0.x Milestones v4.0.5

Blocked on a PR from George Issue 7937
7968 is marked as a blocker, but this is more of a UCX issue, than OMPI issue.

Review v4.1.x Milestones v4.1.0

A couple of pending issues:
- OFI issue Amazon is working on.
Need Review on PR7991
Need some more cycles on HAN and Adapt in master, before pull it into v4.1
- AWS will run tests before next week.
- Waiting on George's patch to HAN and Adapt.
Schedule: Want to release end-of-July
- A minimum of a week, need changes from George on collective components
- Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release.
A number of PRs for v4.1 have not yet gone into master.
PRs against v4.1.x need reviews (and need corrisponding PRs to go into master)
- A UCX init PR out for 4 weeks, still need a review
Release Engineers: Brian (AWS) Jeff Squyres (Cisco)
The fact that we removed pt2pt in OSC, is causing One-sided.
- Nathan agreed to take a look.
George found an SM BTL issue at Init on master. Jeff filed Issue 7937
- Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
- This is a correctness issue (not optimization) - George on today's call
  - At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
    - Affects all the way back to v2.x
- Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
- Looking at the code, we do this other places as well, but not as dramatic.
- May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
- How do we fix this?
  - Can we just get the cacheline size before we get the rest of topology information? Brice said no.
  - Only solution we can see is creating an opal function to do this.
    - Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
    - George can look for it, but can't do it before end of week.
    - Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
- Who can do this work?
  - Showing itself in CUDA issue.
    - Tomislav Janjusic (nVidia) will ask some of his colleges.
- Because we align some structs based on that, but
  - It would be associated with getting the topology (but not retreived until after the modex)
  - Only cuda btl calls the function directly, everyone else extracts from PMIx.
    - What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
    - On v4.1, we don't get the topology before someone requests it much later.
      - Must also affect v4.0.x
  - George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
    - Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
    - Problem is that the process that creates the backing file, creates it very early.
- Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
- George saw it in SM BTL structures. Deadlock.
- This isn't tested by our CI infrastructure.
Still want:
- George's Collectives
  - George is still working on master version of coll
  - Next thing he's working on today.
- Will probably need to do something to CI to enable these for testing.
  - CI not really executing
  - IBM will do some testing of this.
  - Will need some docs on how users to select this.
- Tunings for tuned coll
  - Nothing to discuss today.
  - https://github.com/open-mpi/ompi/pull/7952
- AVX
  - Went in this morning.
- UCX PRs awaiting review.
Past: We've come to consensus for a v4.1.0 release
- Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
  - Patch for OFI stuff messed up v4.1.x branch.
  - Howard has a fix PR, Jeff is looking at.
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
All MTT is online on v4.1.x branch
Not compiling under SLURM EFA test. (OFI BTL issue)

Review v5.0.0 Milestones v5.0.0

No update this week other than master discussion.
Need to put OSC pt2pt
- OS RDMA requires a single BTL that can contact every single process.
  - This didn't use to be the case. (Comment in the code)
We can't use the OSC pt2pt.
- It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.
- The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
Jeff will close the PR, and
Jeff will Nathan will fetching, get, compare and swap.
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.
Does UCX support iWarp?
- Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
PMIX
- Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
  - PMIX talked about it. Artem might know someone who might be interested in working on it.
  - Algorithm behind one of the interfaces doesn't scale well.
  - Not a regression. Above ~ 4K nodes, becomes quadratic.
PRRTE
- Nothing's happening there.

master

Mostly discussed above.

ompi-tests-public

We now have a new publicly visible test repo, for new tests
- Haven't tried to do two checkouts (of both public and private test repos) in one MTT run yet.
- Should probably update instructions on how to setup mtt
- Can add new PR based tests if we want. We'll need to add new infrastructure.

Super Computing Birds-of-a-feather

George and Jeff will help plan and come to community.
- Done / Submitted.
- Probably won't hear back until Sept.
Probably after super computing.

Infrastructure

scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status

Depdendancies

PMIx Update

ORTE/PRRTE

MTT

Back to 2020 WeeklyTelcon-2020

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

Call in user - Thomas

not there today (I keep this for easy cut-n-paste for future notes)

Jeff Squyres (Cisco)
Artem Polyakov (nVidia/Mellanox)
Aurelien Bouteiller (UTK)
Austen Lauria (IBM)
Barrett, Brian (AWS)
Brendan Cunningham (Intel)
Christoph Niethammer (HLRS)
Edgar Gabriel (UH)
Geoffrey Paulsen (IBM)
George Bosilca (UTK)
Howard Pritchard (LANL)
Joseph Schuchart
Josh Hursey (IBM)
Joshua Ladd (nVidia/Mellanox)
Matthew Dosanjh (Sandia)
Noah Evans (Sandia)
Ralph Castain (Intel)
Naughton III, Thomas (ORNL)
Todd Kordenbrock (Sandia)
Tomislav Janjusic
William Zhang (AWS)
Akshay Venkatesh (NVIDIA)
Brandon Yates (Intel)
Charles Shereda (LLNL)
David Bernhold (ORNL)
Erik Zeiske
Geoffroy Vallee (ARM)
Harumi Kuno (HPE)
Mark Allen (IBM)
Matias Cabral (Intel)
Michael Heinz (Intel)
Nathan Hjelm (Google)
Scott Breyer (Sandia?)
Shintaro iwasaki
William Zhang (AWS)
Xin Zhao (nVidia/Mellanox)
mohan (AWS)

New

HWLOC initializiation thing.

trivial to fix in master.
Once Brian gets his configure stuff in.
May need someone else to finish.
Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
- This won't work going back into releases.
- buried in mca system.
- need
What to do about fixing release branches.
Can't give local topology without ___
Don't run it at scale.
The portable way to get it, is hwloc.

revert libevent https://github.com/open-mpi/ompi/pull/7940

Summary: We committed some code
- Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
- We switched the configury logic so we always prefer external libevent (above a certain level of external libevent).
  - Most OSes are above that level, so almost always prefer external libevent.
  - If we get the fix into our internal libevent,
    - Concern is that unless we or users explicitly request internal libevent, we'll almost never get this fix.
  - One solution would be
- Can't think of another solution.
- Packagers don't like to use our internal component
- Only thing we can think of is if you want ULFM, you can't use external libevent.
Progress of getting PR accepted upstream?
- Yes, prepared an upstream libevent PR.
  - They want a non-open-mpi reproducer.
  - Have ideas on how to create this reproducer, but not sure if it's very easy.
  - Original code writer added some protection, but has since retired. This PR removes this protection.
    - Actually "we" added this race condition protection in libevent. It delays removal of file descriptor until too late.
      - The fix validates the FD before handling. Sounds right to all.
- Not started yet. Creating
- May be a way to code around this on ULFM, but not really sure, because things get into a bad state, and only way might be to ruin our performance.
If we protect this with configure (when building ULFM and have to use internal libevent).
- It means we move to submodules for libevent, we'd have to "mirror" libevent ourselves
Only master / v5.0
- If we have TCP it could happen, but we disable errors in Finalize so don't hit this issue.
libevent patch to this OLD internal libevent 2022
- It's possible that the problem goes away in newer libevent. But updating libevent was a major hassle.
- George check if code is gone or has been modified in libevent.
  - Code is still there in latest libevent (so still need fix).
- updating libevent would be a much better solution.
If upgrading to new libevent is answer.

Annual review of OMPI

Jeff will send out Once a year, make sure those who have commit access should
- Have not reviewed yet:
  - Amazon, Bull, Google, Los Alamos, nVidia/Mellanox
- Need to update the spreadsheet saying "looked at".

Face to face

August 10th, 11th, Monday and Tuesday that week.
Put stuff on the agenda wiki (URL HERE)
List of Topics to discuss, and presenters.
- On the wiki, start filling in.

Super Computing Birds-of-a-feather

George and Jeff will help plan and come to community.
- Done / Submitted.
May not have Super Computing conference at ALL this year.
Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
Then this works pretty well, and do this a couple of times a year.
Not constrained to Super Computing
Almost certain that it will be virtual
- Not sure the cost.
- Ralph and Jeff have been doing ABCs of Open MPI - SO many people. Done 2 of 3 sessions (each went 1.5 hours, lots of questions)
  - Slides and Youtube are on website, and will send link to userlist.
  - Part 3 is August 5th
- Also want an indept walk through of PMIx initialization / wireup

MPI Forum was last week.

Sessions is now in in.
Partition communication voted in.

Thread local storage issue

OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
- But when we go to delete, it's not being deleted.
- But want flexibility to destroy on our own or explicitly
- George thinks the mode we have today, since tracking all keys to be released by main thread.
- George thinks Artem's approach is the correct approach.
Would have to change the way that keys are USED, and different components are using it in a different way.
Something similar should be done in different places.
If you do it just for UCX, then others can see how you did it and check for their code.
So we think current PR is good, but it leaves old API and new API.
- But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
- Should be easy for components to add explicit cleanup calls
Master branch only.
Opened a new PUll Request yesterday that addresses the problem as discussed last week.
Tracking of TLS in common code.
- Have a low level thread specific keys (very simple based on thread implementation)
- Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
- Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
- Changed set_specific and get_specific to just set and get.
- Please review and give suggestions.
Does it even make sense to do TLS in OPAL at all?
- May indicate that we have an abstraction wrong somewhere.
- If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
- Not sure if there is a problem, but at a high level, sounds problematic.
Baking in pthread assumptions in general is not a good idea.
- That's what this PR does is abstract pthread semantics.
May be some confusion, no problem with porting this API anywhere.
- Issue raised before is that if you're relying on a certain type of thread in MPI layer.
- But we don't, because there's a framework.
- But Application is linked against PMIx and libevent and to use other threading models is dangerous.
  - To make this work, you have to make changes to event polling, etc.
Not saying we shouldn't take these patches, these make things better.
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
  - argobots actually uses pthreads, not sure about qthreads.
  - Working on a way to configure libevent to make this combo work.

C11 atomic usage is a mess

Last week:
- George needs some input on PR
- We don't need _atomic_ in most cases just need volatile
- patch linked to the issue PR7914
- We're not breaking things, we just get alot of valid complaints from intel compiler.
  - STDOUT of make is ~16 MB due to all intel compiler warnings without this fix
There is a PR pending

Discuss Open-MPI binding when direct-launched

Schizo SLURM binding detection - Might not need a solution on v4.0.x
PRs have gone into v4.0.x and v4.1.x

Open Source Parent organization

Since Open-MPI is a registered non-profit.
If we log volunteer time we can
- Software in the Public Interest (Parent non-profit)
A week or two

Release Branches

Blockers All Open Blockers

Review v4.0.x Milestones v4.0.5

Discussing CUDA init in UCX PML PR 7898
- Looks like a bugfix, so should be okay to put into a release branch.
- Is there a better place to initialize the CUDA hooks?
- If we request a BTL or PML to be loaded, if configured with cuda
- CUDA library is loaded by BTL that requires it.
- Some questions about possibly making it more generic for all PMLs that use CUDA.
  - Don't want to load cuda if using only using TCP or Shared Mem
- We'll take this PR once it passes CI and is reviewed.
v4.0.5 schedule: End of July
- Will create RC1 today after PR7898 goes in.
- Two potential drivers for a quick v4.0.5 turn-around.
  - OSC RDMA Bug - May drive a v4.0.5 release.
  - Program Aborts on detach.

Review v4.1.x Milestones v4.1.0

Schedule: Want to release end-of-July
- A minimum of a week, need changes from George on collective components
Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release.
Release Engineers: Brian (AWS) Jeff Squyres (Cisco)
Jeff is reviewing Collective components
- Yoseph also reviewing.
George found an SM BTL issue at Init on master. Jeff filed Issue 7937
- Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
- This is a correctness issue (not optimization) - George on today's call
  - At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
    - Affects all the way back to v2.x
- Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
- Looking at the code, we do this other places as well, but not as dramatic.
- May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
- How do we fix this?
  - Can we just get the cacheline size before we get the rest of topology information? Brice said no.
  - Only solution we can see is creating an opal function to do this.
    - Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
    - George can look for it, but can't do it before end of week.
    - Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
- Who can do this work?
  - Showing itself in CUDA issue.
    - Tomislav Janjusic (nVidia) will ask some of his colleges.
- Because we align some structs based on that, but
  - It would be associated with getting the topology (but not retreived until after the modex)
  - Only cuda btl calls the function directly, everyone else extracts from PMIx.
    - What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
    - On v4.1, we don't get the topology before someone requests it much later.
      - Must also affect v4.0.x
  - George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
    - Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
    - Problem is that the process that creates the backing file, creates it very early.
- Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
- George saw it in SM BTL structures. Deadlock.
- This isn't tested by our CI infrastructure.
Still want:
- George's Collectives
  - George is still working on master version of coll
  - Next thing he's working on today.
- Will probably need to do something to CI to enable these for testing.
  - CI not really executing
  - IBM will do some testing of this.
  - Will need some docs on how users to select this.
- Tunings for tuned coll
  - Nothing to discuss today.
  - https://github.com/open-mpi/ompi/pull/7952
- AVX
  - Went in this morning.
- UCX PRs awaiting review.
Past: We've come to consensus for a v4.1.0 release
- Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
  - Patch for OFI stuff messed up v4.1.x branch.
  - Howard has a fix PR, Jeff is looking at.
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
All MTT is online on v4.1.x branch
Not compiling under SLURM EFA test. (OFI BTL issue)

Review v5.0.0 Milestones v5.0.0

No update this week other than master discussion.
Need to put OSC pt2pt
- OS RDMA requires a single BTL that can contact every single process.
  - This didn't use to be the case. (Comment in the code)
We can't use the OSC pt2pt.
- It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.
- The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
Jeff will close the PR, and
Jeff will Nathan will fetching, get, compare and swap.
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.
Does UCX support iWarp?
- Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
PMIX
- Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
  - PMIX talked about it. Artem might know someone who might be interested in working on it.
  - Algorithm behind one of the interfaces doesn't scale well.
  - Not a regression. Above ~ 4K nodes, becomes quadratic.
PRRTE
- Nothing's happening there.

master

Mostly discussed above.

Infrastructure

scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status

Depdendancies

PMIx Update

ORTE/PRRTE

MTT

Back to 2020 WeeklyTelcon-2020

Many companies are not allowing a face to face travel until 2021 due to COVID19.
- Instead lets do a series of virtual-face to face?
Yes this summer to discuss for v5.0
- Maybe we can do it by topic?
- Maybe not 4 or 8 hour things.
Different topics on different days.
Do a doodle poll of least-worse days in late July/August.
- August 10th-14th - 3 hour block of time 8-11 Pacific time.
- Jeff will do another doodle for days of the week (vote for 2)
Start a list of topics.

MPI Forum was last week.

Sessions is now in in.
Partition communication voted in.

Thread local storage issue

OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
- But when we go to delete, it's not being deleted.
- But want flexibility to destroy on our own or explicitly
- George thinks the mode we have today, since tracking all keys to be released by main thread.
- George thinks Artem's approach is the correct approach.
Would have to change the way that keys are USED, and different components are using it in a different way.
Something similar should be done in different places.
If you do it just for UCX, then others can see how you did it and check for their code.
So we think current PR is good, but it leaves old API and new API.
- But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
- Should be easy for components to add explicit cleanup calls
Master branch only.
Opened a new PUll Request yesterday that addresses the problem as discussed last week.
Tracking of TLS in common code.
- Have a low level thread specific keys (very simple based on thread implementation)
- Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
- Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
- Changed set_specific and get_specific to just set and get.
- Please review and give suggestions.
Does it even make sense to do TLS in OPAL at all?
- May indicate that we have an abstraction wrong somewhere.
- If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
- Not sure if there is a problem, but at a high level, sounds problematic.
Baking in pthread assumptions in general is not a good idea.
- That's what this PR does is abstract pthread semantics.
May be some confusion, no problem with porting this API anywhere.
- Issue raised before is that if you're relying on a certain type of thread in MPI layer.
- But we don't, because there's a framework.
- But Application is linked against PMIx and libevent and to use other threading models is dangerous.
  - To make this work, you have to make changes to event polling, etc.
Not saying we shouldn't take these patches, these make things better.
- But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
  - argobots actually uses pthreads, not sure about qthreads.
  - Working on a way to configure libevent to make this combo work.

C11 atomic usage is a mess

Last week:
- George needs some input on PR
- We don't need _atomic_ in most cases just need volatile
- patch linked to the issue PR7914
- We're not breaking things, we just get alot of valid complaints from intel compiler.
  - STDOUT of make is ~16 MB due to all intel compiler warnings without this fix
There is a PR pending

Discuss Open-MPI binding when direct-launched

Schizo SLURM binding detection - Might not need a solution on v4.0.x
PRs have gone into v4.0.x and v4.1.x

Open Source Parent organization

Since Open-MPI is a registered non-profit.
If we log volunteer time we can
- Software in the Public Interest (Parent non-profit)
A week or two

Release Branches

Blockers All Open Blockers

Review v4.0.x Milestones v4.0.5

Discussing CUDA init in UCX PML PR 7898
- Looks like a bugfix, so should be okay to put into a release branch.
- Is there a better place to initialize the CUDA hooks?
- If we request a BTL or PML to be loaded, if configured with cuda
- CUDA library is loaded by BTL that requires it.
- Some questions about possibly making it more generic for all PMLs that use CUDA.
  - Don't want to load cuda if using only using TCP or Shared Mem
- We'll take this PR once it passes CI and is reviewed.
v4.0.5 schedule: End of July
- Will create RC1 today after PR7898 goes in.
- Two potential drivers for a quick v4.0.5 turn-around.
  - OSC RDMA Bug - May drive a v4.0.5 release.
  - Program Aborts on detach.

Review v4.1.x Milestones v4.1.0

Schedule: Want to release end-of-July
- A minimum of a week, need changes from George on collective components
Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release.
Release Engineers: Brian (AWS) Jeff Squyres (Cisco)
Jeff is reviewing Collective components
- Yoseph also reviewing.
George found an SM BTL issue at Init on master. Jeff filed Issue 7937
- Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
- This is a correctness issue (not optimization) - George on today's call
  - At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
    - Affects all the way back to v2.x
- Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
- Looking at the code, we do this other places as well, but not as dramatic.
- May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
- How do we fix this?
  - Can we just get the cacheline size before we get the rest of topology information? Brice said no.
  - Only solution we can see is creating an opal function to do this.
    - Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
    - George can look for it, but can't do it before end of week.
    - Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
- Who can do this work?
  - Showing itself in CUDA issue.
    - Tomislav Janjusic (nVidia) will ask some of his colleges.
- Because we align some structs based on that, but
  - It would be associated with getting the topology (but not retreived until after the modex)
  - Only cuda btl calls the function directly, everyone else extracts from PMIx.
    - What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
    - On v4.1, we don't get the topology before someone requests it much later.
      - Must also affect v4.0.x
  - George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
    - Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
    - Problem is that the process that creates the backing file, creates it very early.
- Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
- George saw it in SM BTL structures. Deadlock.
- This isn't tested by our CI infrastructure.
Still want:
- George's Collectives
  - George is still working on master version of coll
  - Next thing he's working on today.
- Will probably need to do something to CI to enable these for testing.
  - CI not really executing
  - IBM will do some testing of this.
  - Will need some docs on how users to select this.
- Tunings for tuned coll
  - Nothing to discuss today.
  - https://github.com/open-mpi/ompi/pull/7952
- AVX
  - Went in this morning.
- UCX PRs awaiting review.
Past: We've come to consensus for a v4.1.0 release
- Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
  - Patch for OFI stuff messed up v4.1.x branch.
  - Howard has a fix PR, Jeff is looking at.
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
All MTT is online on v4.1.x branch
Not compiling under SLURM EFA test. (OFI BTL issue)

Review v5.0.0 Milestones v5.0.0

No update this week other than master discussion.
Need to put OSC pt2pt
- OS RDMA requires a single BTL that can contact every single process.
  - This didn't use to be the case. (Comment in the code)
We can't use the OSC pt2pt.
- It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.
- The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
Jeff will close the PR, and
Jeff will Nathan will fetching, get, compare and swap.
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.
Does UCX support iWarp?
- Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
PMIX
- Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
  - PMIX talked about it. Artem might know someone who might be interested in working on it.
  - Algorithm behind one of the interfaces doesn't scale well.
  - Not a regression. Above ~ 4K nodes, becomes quadratic.
PRRTE
- Nothing's happening there.

master

Mostly discussed above.

Super Computing Birds-of-a-feather

George and Jeff will help plan and come to community.
- Done / Submitted.
May not have Super Computing conference at ALL this year.
Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
Then this works pretty well, and do this a couple of times a year.
Not constrained to Super Computing
Almost certain that it will be virtual
- Not sure the cost.
- Ralph and Jeff have been doing ABCs of Open MPI - SO many people. Done 2 of 3 sessions (each went 1.5 hours, lots of questions)
  - Slides and Youtube are on website, and will send link to userlist.
  - Part 3 is August 5th
- Also want an indept walk through of PMIx initialization / wireup