[MNT] handle `mps backend` for lower versions of pytorch and fix `mps` failure on `macOS-latest` runner #1648

fnhirwa · 2024-09-04T07:00:27Z

Description

This PR handles the issue that may result when setting the device to mps if the torch version doesn't support mps backend

Depends on #1633

I used pytest.MonkeyPatch() to disable the discovery of the mps accelerator. The tests run on CPU for macOS-latest

fixes #1596

…s to fallback to CPU

codecov-commenter · 2024-09-04T11:40:40Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (e2bf3fd) to head (f3c64ab).

Files with missing lines	Patch %	Lines
pytorch_forecasting/utils/_utils.py	0.00%	7 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1648      +/-   ##
==========================================
- Coverage   90.14%   90.03%   -0.12%     
==========================================
  Files          32       32              
  Lines        4780     4786       +6     
==========================================
  Hits         4309     4309              
- Misses        471      477       +6

Flag	Coverage Δ
cpu	`90.03% <0.00%> (-0.12%)`	⬇️
pytest	`90.03% <0.00%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fkiraly

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

XinyuWuu · 2024-09-07T03:24:19Z

Great, I thought we have to patch the binary file to overwite torch._C._mps_is_available but MonkeyPatch is a much better solution.

XinyuWuu · 2024-09-07T03:39:28Z

Can we put the patch in a global fixture? Then we don't have to request it manually for every test.

# put it in conftest.py
import pytest


@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
    """replace torch._C._mps_is_available for all tests."""
    monkeypatch.setattr("torch._C._mps_is_available", lambda: False)

XinyuWuu · 2024-09-07T03:41:15Z

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

I think it's replacing torch._C._mps_is_available function to disable MPS. torch._C is a binary library.

fnhirwa · 2024-09-07T05:38:18Z

ok, this seems to be working, but I do not understand entirely what you are doing here and why...

Can you kindly explain?

well, the mps backend uses the torch._C._mps_is_available function, patching this function gives the ability to force torch.backends.mps.is_available to return False as it generally calls _mps_is_available from the binary torch._C

For macOS now with these changes we run tests on cpu only and accelerator is disabled

fnhirwa · 2024-09-07T05:44:14Z

Can we put the patch in a global fixture? Then we don't have to request it manually for every test.

# put it in conftest.py
import pytest


@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
    """replace torch._C._mps_is_available for all tests."""
    monkeypatch.setattr("torch._C._mps_is_available", lambda: False)

This is a great solution, making it a global fixture makes more sense.

fkiraly · 2024-09-07T07:52:08Z

Question: does the user on macos has the option to force the backend at runtime?

That is, do the tests faithfully reflect a situation that the user can establish on mac?

Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.

More specific question: let's take an actual mac similar to the setup of macos-latest. Can you give me a very concrete condition or preparation code that shows:

how the user would run the code on conditions prior to this PR and encounter the error
how the user would run the code on the conditions after this PR and not encounter the error

If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.

fnhirwa · 2024-09-07T15:48:45Z

Question: does the user on macos has the option to force the backend at runtime?

That is, do the tests faithfully reflect a situation that the user can establish on mac?

Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer.

More specific question: let's take an actual mac similar to the setup of macos-latest. Can you give me a very concrete condition or preparation code that shows:

how the user would run the code on conditions prior to this PR and encounter the error

how the user would run the code on the conditions after this PR and not encounter the error

If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround.

Running the code doesn't fail locally, as I am using macos the main issue is with the github macOS runner.

The tests are failing upstream due to lack of Nested-virtualization and Metal Performance Shaders (MPS) as noted here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners

I would say we can test back with MPS accelerator once this is solved.

fkiraly · 2024-09-07T15:52:49Z

What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?

fnhirwa · 2024-09-07T16:58:16Z

What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system?

yeah we are only patching the tests, this code doesn't pose any global changes, any modifications made have lifetime of that test being run, once done the system rolls back to normal setup.

fkiraly · 2024-09-07T17:49:57Z

May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:

pre-PR condition under macos-latest
post-PR condition under macos-13 or macos-latest

fnhirwa · 2024-09-07T18:15:45Z

May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:

pre-PR condition under macos-latest

post-PR condition under macos-13 or macos-latest

On macOS:

pre-PR: under macOS-latest the tests were failing on CI due to a lack of Nested-virtualization and Metal Performance Shaders (MPS), but locally the tests pass successfully when we add PYTORCH_ENABLE_MPS_FALLBACK=1 env variable and the user can use both cpu and mps when the mentioned env variable is set.
post-PR on either macOS-13 and macOS-latest the mps accelerator is disabled when testing, but the user can use the mps accelerator when training. When needs mps based testing we have to comment out this introduced fixture and add PYTORCH_ENABLE_MPS_FALLBACK=1 env variable to enable the torch fallback from mps to cpu for non-implemented function on mps. With that setup, you can run tests locally having mps as an accelerator.

fkiraly · 2024-09-07T18:26:47Z

What's the way a user would enable mps or disable mps? Is this hard-coded, or can this be set somewhere?

Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.

fnhirwa · 2024-09-07T18:37:25Z

What's the way a user would enable mps or disable mps? Is this hard-coded, or can this be set somewhere?

Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions.

A user currently is not disabling mps it is enabled if it is present by default. What I think of now is adding to the documentation the guide for using mps accelerator and the need for PYTORCH_ENABLE_MPS_FALLBACK=1 env variable when using mps.

Giving users the device control whether to enable or disable mps is possible we can create a mock function that receives a parameter whether to enable it or not and then mock the function torch._C._mps_is_available to return False. With this, we will need to specify the usage and need for that function.

fkiraly · 2024-09-07T18:45:14Z

A user currently is not disabling mps it is enabled if it is present by default.

which layer does this - torch or pytorch-forecasting?
if there is defaulting depending on available or not, why is it failing on macos-latest

Giving users the device control whether to enable or disable mps is possible

Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.

fnhirwa · 2024-09-07T18:59:45Z

A user currently is not disabling mps it is enabled if it is present by default.

which layer does this - torch or pytorch-forecasting?

if there is defaulting depending on available or not, why is it failing on macos-latest

This is on the torch layer and the reason why it is failing for the on macos-latest is because when installing torch on macos the build has mps backend built as it doesn't install cpu only, and as specified in github actions docs https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners

Giving users the device control whether to enable or disable mps is possible

Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API.

Yeah I agree with this there is no setting in one of the layers for this. Better is explaining the possible challenges that may raise when using mps including failures for some functions that aren't yet added to torch for mps and the reason why using PYTORCH_ENABLE_MPS_FALLBACK=1 is advisable for such cases.

fkiraly · 2024-09-07T19:45:00Z

Thanks for your explanations, I'm still not getting a good grip of the situation yet, unfortunately.

May I kindly ask you to take some time to explain:

the precise condition on the VM or local machine that may or may not be satisfied here in terms of mps. What makes a machine or VM exactly "supporting mps", and how can I determine whether it does? Optimally, with a link to docs.
the precise condition determines in torch whether mps is used or not, optimally with a link to docs.
- whether a user can select this, e.g., to force use of cpu, on the torch layer, if yes how.
the precise mechanism or location by which this is determined in pytorch-forecasting

XinyuWuu · 2024-09-08T04:01:56Z

I do not think a regular user will have the error.

It's cased by lack of Nested-virtualization from Apple MacOS. To my understanding, nested-virtualization is some thing that you have a layer of virtualization to provide virtual hardware and the another layer of virtualization to provide hardware and kernel isolated software such as a Hyper-V isolated container or VM. So if a user have a container on a local MacOS, there should be no problem. If a user have a regular container with shared hardware and kernel inside a hypervisor isolated VM, there should be no problem either. If a user have a hypervisor isolated VM or container inside another hypervisor isolated VM or container, there will be a problem. For Github runners, I guess Microsoft are having a first layer of hypervisor to build the cloud (Linux KVM or Microsoft Hyper-V based VMs), then another layer of hypervisor to provide hardware and kernel isolated MacOS VMs.

Reference for Github runner problem: #1596 (comment)

some links about Nested-virtualization:
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/nested-virtualization
https://cloud.google.com/compute/docs/instances/nested-virtualization/overview

fnhirwa · 2024-09-09T09:47:59Z

Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on mps accelerators pytorch/pytorch#77764 which forces user to use the fallback env variable in case some ops should be required in their torch implementation.

fkiraly · 2024-09-09T12:20:07Z

Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on mps accelerators pytorch/pytorch#77764

Issue with the most participants I've seen so far

fkiraly · 2024-09-09T12:21:52Z

I do not think a regular user will have the error.

If this is the case, the my questions are:

in what precise respect does the VM differ fom the "regular user" condition
what is the "regular user condition", more generally
how and why can we be sure that the tests cover the regular user condition, post PR?

fnhirwa · 2024-09-10T16:11:41Z

I do not think a regular user will have the error.

If this is the case, the my questions are:

in what precise respect does the VM differ fom the "regular user" condition

In general for the VM the difference from a regular user is the mps support that isn't available on vm as mentioned in limitations in github actions docs here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners.

This implies that anything running on these runners will be run on cpu

what is the "regular user condition", more generally

For general users, the arm64 macOS has support for mps and they can use both the cpu and mps as accelerators.

how and why can we be sure that the tests cover the regular user condition, post PR?

I believe that post PR the tests will be forced to run on cpu. This makes it understandable that we can set this as a limitation for arm64 macOS and we can specify that we don't thorough test on mps also as I think with the current setup we don't generally test on any other accelerators other than cpu on other runners.

fkiraly · 2024-09-10T17:34:44Z

Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:

regular user systems of Mac users will have "MPS" as their standard backend, used by default. They would have to manually deactivate or override to use cpu backend. They will also have no problems running torch on the default.
the GH VM do not provide MPS support. Yet torch tries to use it as default on mac systems, and fails, as not available.
we can patch torch to force using cpu even in a case where it thinks mps should be present. That is what this PR does.
Therefore, we will be testing the condition where the OS is Mac, but no MPS backend is found available.

Is this correct?

If it is, I have further questions:

what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.
is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

XinyuWuu · 2024-09-11T03:10:02Z

what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.

macos-13 is designed for intel chips. torch._C._mps_is_available will return false in macos-13 even without the patch.

is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

Yes.

fnhirwa · 2024-09-11T07:23:40Z

Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:

regular user systems of Mac users will have "MPS" as their standard backend, used by default. They would have to manually deactivate or override to use cpu backend. They will also have no problems running torch on the default.

the GH VM do not provide MPS support. Yet torch tries to use it as default on mac systems, and fails, as not available.

we can patch torch to force using cpu even in a case where it thinks mps should be present. That is what this PR does.

Therefore, we will be testing the condition where the OS is Mac, but no MPS backend is found available.

Is this correct?

Yes, this is correct.

If it is, I have further questions:

what exactly differs between macos-13 and macos-latest (= macos-14) runners here? Why does macos-13 not crash without the patch, while macos-14 does? Nothing in the runners page seems to indicate a clear differenc to me.

is it correct that, whether we patch or not, wehtehr we use macos-13 or macos-14, we will always be testing on cpu backend?

As @XinyuWuu specified we will always be testing on cpu.

fkiraly · 2024-09-11T07:53:37Z

pytorch_forecasting/utils/_utils.py

+                    device = torch.device(device)
+                else:
+                    device = torch.device("cpu")
+            else:


is this perhaps too aggressive? Are we moving devices to cpu that couId work?

Yeah, I agree we can simplify and remove the setting to CPU as it is always the default device. But enforcing the mps to fallback to cpu when not built as torch needs it to be built in the backend

I mean, should we change the device not only if (a) it is mps, and (b), mps is not available?

should we add the try-except block at the end of the first conditional check?

I think the current branching makes sense - not sure if you changed it or if I now just understod it, but I think it is exhaustive and the step-out covers only the MPS-not-present case

fkiraly · 2024-09-11T20:08:33Z

pytorch_forecasting/utils/_utils.py

+        if device == "mps":
+            if hasattr(torch.backends, device):
+                if torch.backends.mps.is_available() and torch.backends.mps.is_built():
+                    device = torch.device(device)


for clarity, I would just substitute "mps"

fkiraly

I think I have understood now what conditions this covers (we test only cpu), and the logic of move_to_device makes sense

fkiraly and others added 3 commits August 30, 2024 09:09

Update test.yml

912e0a4

handle mps backend for lower versions of pytorch

f770159

Merge branch 'upstream-pr-1633' into mps

470c736

fnhirwa mentioned this pull request Sep 4, 2024

[MNT] MPS backend test failures on MacOS #1596

Closed

fnhirwa added 2 commits September 4, 2024 15:43

added the decorator for patching the environment variable allowing mp…

3061fe6

…s to fallback to CPU

fix the decorator

b58b701

fnhirwa mentioned this pull request Sep 4, 2024

[MNT] Remove .env file (we shouldn't be pushing the .env file #1646

Merged

fnhirwa added 4 commits September 6, 2024 20:20

try to use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 flag in monkeypatch

f387898

Merge remote-tracking branch 'upstream/master' into mps

114c0ff

monkeypatch ismps_available

217c245

correct docstrings

1b91b29

fnhirwa mentioned this pull request Sep 6, 2024

[MNT] Test mps #1654

Closed

fnhirwa marked this pull request as ready for review September 6, 2024 16:44

meaningful function name

f3c64ab

fkiraly reviewed Sep 6, 2024

View reviewed changes

fkiraly added the maintenance Continuous integration, unit testing & package distribution label Sep 6, 2024

make global fixture

53a049e

fnhirwa changed the title ~~[MNT] handle mps backend for lower versions of pytorch~~ [MNT] handle mps backend for lower versions of pytorch and fix mps failure on macOS-latest runner Sep 7, 2024

fkiraly reviewed Sep 11, 2024

View reviewed changes

fnhirwa added 2 commits September 11, 2024 14:37

remove extra else statement

3bd4d6d

Merge remote-tracking branch 'upstream/master' into mps

c60d324

fkiraly reviewed Sep 11, 2024

View reviewed changes

fkiraly previously approved these changes Sep 11, 2024

View reviewed changes

add explicit mps

d15424e

fnhirwa dismissed fkiraly’s stale review via d15424e September 11, 2024 20:53

fkiraly approved these changes Sep 13, 2024

View reviewed changes

fkiraly merged commit f233d92 into sktime:master Sep 13, 2024
35 checks passed

fkiraly mentioned this pull request Sep 19, 2024

[MNT] switch CI macos runner to macos-latest #1633

Closed

fnhirwa deleted the mps branch October 4, 2024 19:09

fnhirwa mentioned this pull request Dec 10, 2024

[BUG] Fix issue when training TFT model on mac M1 mps device. element 0 of tensors does not require grad and does not have a grad_fn #1725

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MNT] handle `mps backend` for lower versions of pytorch and fix `mps` failure on `macOS-latest` runner #1648

[MNT] handle `mps backend` for lower versions of pytorch and fix `mps` failure on `macOS-latest` runner #1648

fnhirwa commented Sep 4, 2024 •

edited

Loading

codecov-commenter commented Sep 4, 2024 •

edited

Loading

fkiraly left a comment

XinyuWuu commented Sep 7, 2024

XinyuWuu commented Sep 7, 2024

XinyuWuu commented Sep 7, 2024

fnhirwa commented Sep 7, 2024 •

edited

Loading

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024 •

edited

Loading

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024 •

edited

Loading

XinyuWuu commented Sep 8, 2024

fnhirwa commented Sep 9, 2024

fkiraly commented Sep 9, 2024

fkiraly commented Sep 9, 2024

fnhirwa commented Sep 10, 2024

fkiraly commented Sep 10, 2024

XinyuWuu commented Sep 11, 2024 •

edited

Loading

fnhirwa commented Sep 11, 2024

fkiraly Sep 11, 2024

fnhirwa Sep 11, 2024 •

edited

Loading

fkiraly Sep 11, 2024

fnhirwa Sep 11, 2024

fkiraly Sep 11, 2024

fkiraly Sep 11, 2024

fkiraly left a comment

[MNT] handle mps backend for lower versions of pytorch and fix mps failure on macOS-latest runner #1648

[MNT] handle mps backend for lower versions of pytorch and fix mps failure on macOS-latest runner #1648

Conversation

fnhirwa commented Sep 4, 2024 • edited Loading

Description

codecov-commenter commented Sep 4, 2024 • edited Loading

Codecov Report

fkiraly left a comment

Choose a reason for hiding this comment

XinyuWuu commented Sep 7, 2024

XinyuWuu commented Sep 7, 2024

XinyuWuu commented Sep 7, 2024

fnhirwa commented Sep 7, 2024 • edited Loading

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024 • edited Loading

fkiraly commented Sep 7, 2024

fnhirwa commented Sep 7, 2024

fkiraly commented Sep 7, 2024 • edited Loading

XinyuWuu commented Sep 8, 2024

fnhirwa commented Sep 9, 2024

fkiraly commented Sep 9, 2024

fkiraly commented Sep 9, 2024

fnhirwa commented Sep 10, 2024

fkiraly commented Sep 10, 2024

XinyuWuu commented Sep 11, 2024 • edited Loading

fnhirwa commented Sep 11, 2024

fkiraly Sep 11, 2024

Choose a reason for hiding this comment

fnhirwa Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly Sep 11, 2024

Choose a reason for hiding this comment

fnhirwa Sep 11, 2024

Choose a reason for hiding this comment

fkiraly Sep 11, 2024

Choose a reason for hiding this comment

fkiraly Sep 11, 2024

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

[MNT] handle `mps backend` for lower versions of pytorch and fix `mps` failure on `macOS-latest` runner #1648

[MNT] handle `mps backend` for lower versions of pytorch and fix `mps` failure on `macOS-latest` runner #1648

fnhirwa commented Sep 4, 2024 •

edited

Loading

codecov-commenter commented Sep 4, 2024 •

edited

Loading

fnhirwa commented Sep 7, 2024 •

edited

Loading

fnhirwa commented Sep 7, 2024 •

edited

Loading

fkiraly commented Sep 7, 2024 •

edited

Loading

XinyuWuu commented Sep 11, 2024 •

edited

Loading

fnhirwa Sep 11, 2024 •

edited

Loading