Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNT] handle mps backend for lower versions of pytorch and fix mps failure on macOS-latest runner #1648

Merged
merged 14 commits into from
Sep 13, 2024

Conversation

fnhirwa
Copy link
Member

@fnhirwa fnhirwa commented Sep 4, 2024

Description

This PR handles the issue that may result when setting the device to mps if the torch version doesn't support mps backend

Depends on #1633

I used pytest.MonkeyPatch() to disable the discovery of the mps accelerator. The tests run on CPU for macOS-latest

fixes #1596

@codecov-commenter
Copy link

codecov-commenter commented Sep 4, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (e2bf3fd) to head (f3c64ab).

Files with missing lines Patch % Lines
pytorch_forecasting/utils/_utils.py 0.00% 7 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1648      +/-   ##
==========================================
- Coverage   90.14%   90.03%   -0.12%     
==========================================
  Files          32       32              
  Lines        4780     4786       +6     
==========================================
  Hits         4309     4309              
- Misses        471      477       +6     
Flag Coverage Δ
cpu 90.03% <0.00%> (-0.12%) ⬇️
pytest 90.03% <0.00%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fnhirwa fnhirwa mentioned this pull request Sep 6, 2024
@fnhirwa fnhirwa marked this pull request as ready for review September 6, 2024 16:44
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

@fkiraly fkiraly added the maintenance Continuous integration, unit testing & package distribution label Sep 6, 2024
@XinyuWuu
Copy link
Member

XinyuWuu commented Sep 7, 2024

Great, I thought we have to patch the binary file to overwite torch._C._mps_is_available but MonkeyPatch is a much better solution.

@XinyuWuu
Copy link
Member

XinyuWuu commented Sep 7, 2024

Can we put the patch in a global fixture? Then we don't have to request it manually for every test.

# put it in conftest.py
import pytest


@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
    """replace torch._C._mps_is_available for all tests."""
    monkeypatch.setattr("torch._C._mps_is_available", lambda: False)

@XinyuWuu
Copy link
Member

XinyuWuu commented Sep 7, 2024

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

I think it's replacing torch._C._mps_is_available function to disable MPS. torch._C is a binary library.

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 7, 2024

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

well, the mps backend uses the torch._C._mps_is_available function, patching this function gives the ability to force torch.backends.mps.is_available to return False as it generally calls _mps_is_available from the binary torch._C

For macOS now with these changes we run tests on cpu only and accelerator is disabled

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 7, 2024

Can we put the patch in a global fixture? Then we don't have to request it manually for every test.

# put it in conftest.py
import pytest


@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
    """replace torch._C._mps_is_available for all tests."""
    monkeypatch.setattr("torch._C._mps_is_available", lambda: False)

This is a great solution, making it a global fixture makes more sense.

@fnhirwa fnhirwa changed the title [MNT] handle mps backend for lower versions of pytorch [MNT] handle mps backend for lower versions of pytorch and fix mps failure on macOS-latest runner Sep 7, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Sep 7, 2024

Question: does the user on macos has the option to force the backend at runtime?

That is, do the tests faithfully reflect a situation that the user can establish on mac?

Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.

More specific question: let's take an actual mac similar to the setup of macos-latest. Can you give me a very concrete condition or preparation code that shows:

  • how the user would run the code on conditions prior to this PR and encounter the error
  • how the user would run the code on the conditions after this PR and not encounter the error

If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 7, 2024

Question: does the user on macos has the option to force the backend at runtime?

That is, do the tests faithfully reflect a situation that the user can establish on mac?

Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.

More specific question: let's take an actual mac similar to the setup of macos-latest. Can you give me a very concrete condition or preparation code that shows:

  • how the user would run the code on conditions prior to this PR and encounter the error
  • how the user would run the code on the conditions after this PR and not encounter the error

If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.

Running the code doesn't fail locally, as I am using macos the main issue is with the github macOS runner.

The tests are failing upstream due to lack of Nested-virtualization and Metal Performance Shaders (MPS) as noted here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners

I would say we can test back with MPS accelerator once this is solved.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 7, 2024

What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 7, 2024

What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?

yeah we are only patching the tests, this code doesn't pose any global changes, any modifications made have lifetime of that test being run, once done the system rolls back to normal setup.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 7, 2024

May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:

  1. pre-PR condition under macos-latest
  2. post-PR condition under macos-13 or macos-latest

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 7, 2024

May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:

  1. pre-PR condition under macos-latest
  2. post-PR condition under macos-13 or macos-latest

On macOS:

  • pre-PR: under macOS-latest the tests were failing on CI due to a lack of Nested-virtualization and Metal Performance Shaders (MPS), but locally the tests pass successfully when we add PYTORCH_ENABLE_MPS_FALLBACK=1 env variable and the user can use both cpu and mps when the mentioned env variable is set.

  • post-PR on either macOS-13 and macOS-latest the mps accelerator is disabled when testing, but the user can use the mps accelerator when training. When needs mps based testing we have to comment out this introduced fixture and add PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to enable the torch fallback from mps to cpu for non-implemented function on mps. With that setup, you can run tests locally having mps as an accelerator.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 7, 2024

What's the way a user would enable mps or disable mps? Is this hard-coded, or can this be set somewhere?

Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 7, 2024

What's the way a user would enable mps or disable mps? Is this hard-coded, or can this be set somewhere?

Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.

A user currently is not disabling mps it is enabled if it is present by default. What I think of now is adding to the documentation the guide for using mps accelerator and the need for PYTORCH_ENABLE_MPS_FALLBACK=1 env variable when using mps.

Giving users the device control whether to enable or disable mps is possible we can create a mock function that receives a parameter whether to enable it or not and then mock the function torch._C._mps_is_available to return False. With this, we will need to specify the usage and need for that function.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 7, 2024

A user currently is not disabling mps it is enabled if it is present by default.

  • which layer does this - torch or pytorch-forecasting?
  • if there is defaulting depending on available or not, why is it failing on macos-latest

Giving users the device control whether to enable or disable mps is possible

Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 7, 2024

A user currently is not disabling mps it is enabled if it is present by default.

  • which layer does this - torch or pytorch-forecasting?
  • if there is defaulting depending on available or not, why is it failing on macos-latest

This is on the torch layer and the reason why it is failing for the on macos-latest is because when installing torch on macos the build has mps backend built as it doesn't install cpu only, and as specified in github actions docs https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners

Giving users the device control whether to enable or disable mps is possible

Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.

Yeah I agree with this there is no setting in one of the layers for this. Better is explaining the possible challenges that may raise when using mps including failures for some functions that aren't yet added to torch for mps and the reason why using PYTORCH_ENABLE_MPS_FALLBACK=1 is advisable for such cases.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 7, 2024

Thanks for your explanations, I'm still not getting a good grip of the situation yet, unfortunately.

May I kindly ask you to take some time to explain:

  • the precise condition on the VM or local machine that may or may not be satisfied here in terms of mps. What makes a machine or VM exactly "supporting mps", and how can I determine whether it does? Optimally, with a link to docs.
  • the precise condition determines in torch whether mps is used or not, optimally with a link to docs.
    • whether a user can select this, e.g., to force use of cpu, on the torch layer, if yes how.
  • the precise mechanism or location by which this is determined in pytorch-forecasting

@XinyuWuu
Copy link
Member

XinyuWuu commented Sep 8, 2024

I do not think a regular user will have the error.

It's cased by lack of Nested-virtualization from Apple MacOS. To my understanding, nested-virtualization is some thing that you have a layer of virtualization to provide virtual hardware and the another layer of virtualization to provide hardware and kernel isolated software such as a Hyper-V isolated container or VM. So if a user have a container on a local MacOS, there should be no problem. If a user have a regular container with shared hardware and kernel inside a hypervisor isolated VM, there should be no problem either. If a user have a hypervisor isolated VM or container inside another hypervisor isolated VM or container, there will be a problem. For Github runners, I guess Microsoft are having a first layer of hypervisor to build the cloud (Linux KVM or Microsoft Hyper-V based VMs), then another layer of hypervisor to provide hardware and kernel isolated MacOS VMs.

Reference for Github runner problem: #1596 (comment)

some links about Nested-virtualization:
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/nested-virtualization
https://cloud.google.com/compute/docs/instances/nested-virtualization/overview

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 9, 2024

Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on mps accelerators pytorch/pytorch#77764 which forces user to use the fallback env variable in case some ops should be required in their torch implementation.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 9, 2024

Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on mps accelerators pytorch/pytorch#77764

Issue with the most participants I've seen so far

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 9, 2024

I do not think a regular user will have the error.

If this is the case, the my questions are:

  • in what precise respect does the VM differ fom the "regular user" condition
  • what is the "regular user condition", more generally
  • how and why can we be sure that the tests cover the regular user condition, post PR?

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 10, 2024

I do not think a regular user will have the error.

If this is the case, the my questions are:

  • in what precise respect does the VM differ fom the "regular user" condition

In general for the VM the difference from a regular user is the mps support that isn't available on vm as mentioned in limitations in github actions docs here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners.

This implies that anything running on these runners will be run on cpu

  • what is the "regular user condition", more generally

For general users, the arm64 macOS has support for mps and they can use both the cpu and mps as accelerators.

  • how and why can we be sure that the tests cover the regular user condition, post PR?

I believe that post PR the tests will be forced to run on cpu. This makes it understandable that we can set this as a limitation for arm64 macOS and we can specify that we don't thorough test on mps also as I think with the current setup we don't generally test on any other accelerators other than cpu on other runners.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 10, 2024

Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:

  • regular user systems of Mac users will have "MPS" as their standard backend, used by default. They would have to manually deactivate or override to use cpu backend. They will also have no problems running torch on the default.
  • the GH VM do not provide MPS support. Yet torch tries to use it as default on mac systems, and fails, as not available.
  • we can patch torch to force using cpu even in a case where it thinks mps should be present. That is what this PR does.
  • Therefore, we will be testing the condition where the OS is Mac, but no MPS backend is found available.

Is this correct?

If it is, I have further questions:

  • what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.
  • is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

@XinyuWuu
Copy link
Member

XinyuWuu commented Sep 11, 2024

  • what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.

macos-13 is designed for intel chips. torch._C._mps_is_available will return false in macos-13 even without the patch.

  • is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

Yes.

@fnhirwa
Copy link
Member Author

fnhirwa commented Sep 11, 2024

Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:

  • regular user systems of Mac users will have "MPS" as their standard backend, used by default. They would have to manually deactivate or override to use cpu backend. They will also have no problems running torch on the default.
  • the GH VM do not provide MPS support. Yet torch tries to use it as default on mac systems, and fails, as not available.
  • we can patch torch to force using cpu even in a case where it thinks mps should be present. That is what this PR does.
  • Therefore, we will be testing the condition where the OS is Mac, but no MPS backend is found available.

Is this correct?

Yes, this is correct.

If it is, I have further questions:

  • what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.
  • is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

As @XinyuWuu specified we will always be testing on cpu.

device = torch.device(device)
else:
device = torch.device("cpu")
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this perhaps too aggressive? Are we moving devices to cpu that couId work?

Copy link
Member Author

@fnhirwa fnhirwa Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree we can simplify and remove the setting to CPU as it is always the default device. But enforcing the mps to fallback to cpu when not built as torch needs it to be built in the backend

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, should we change the device not only if (a) it is mps, and (b), mps is not available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add the try-except block at the end of the first conditional check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current branching makes sense - not sure if you changed it or if I now just understod it, but I think it is exhaustive and the step-out covers only the MPS-not-present case

if device == "mps":
if hasattr(torch.backends, device):
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
device = torch.device(device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, I would just substitute "mps"

fkiraly
fkiraly previously approved these changes Sep 11, 2024
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have understood now what conditions this covers (we test only cpu), and the logic of move_to_device makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Continuous integration, unit testing & package distribution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MNT] MPS backend test failures on MacOS
4 participants