-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MNT] handle mps backend
for lower versions of pytorch and fix mps
failure on macOS-latest
runner
#1648
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #1648 +/- ##
==========================================
- Coverage 90.14% 90.03% -0.12%
==========================================
Files 32 32
Lines 4780 4786 +6
==========================================
Hits 4309 4309
- Misses 471 477 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this seems to be working, but I do not understand entirely what you are doing here and why...
Can you kindly explain?
Great, I thought we have to patch the binary file to overwite |
Can we put the patch in a global fixture? Then we don't have to request it manually for every test. # put it in conftest.py
import pytest
@pytest.fixture(autouse=True)
def no_mps(monkeypatch):
"""replace torch._C._mps_is_available for all tests."""
monkeypatch.setattr("torch._C._mps_is_available", lambda: False) |
I think it's replacing |
well, the For |
This is a great solution, making it a global fixture makes more sense. |
mps backend
for lower versions of pytorchmps backend
for lower versions of pytorch and fix mps
failure on macOS-latest
runner
Question: does the user on macos has the option to force the backend at runtime? That is, do the tests faithfully reflect a situation that the user can establish on mac? Because if the user cannot set the flag or sth equivalent to it, then our tests will pass, but the user cannot make use of the package. This might sound "trivial" but I just want to emphasize this again, each test should "certify" for a run case on the user's computer. More specific question: let's take an actual mac similar to the setup of
If this is something to configure for the user, furthermore, we need to explain it in the documentation as workaround. |
Running the code doesn't fail locally, as I am using The tests are failing upstream due to lack of Nested-virtualization and Metal Performance Shaders (MPS) as noted here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners I would say we can test back with MPS accelerator once this is solved. |
What I mean, we are patching only the tests, so it looks to me like it still would fail for the user on such a system? |
yeah we are only patching the tests, this code doesn't pose any global changes, any modifications made have lifetime of that test being run, once done the system rolls back to normal setup. |
May I revert to my original question - could you kindly be precise which usage condition we are now testing, and what user sided setup the following two correspond to:
|
On macOS:
|
What's the way a user would enable Depending on whether this is possible or not, we should make sure we give clear pointers in documentation, or set reasonable defaults, or raise informative exceptions. |
A user currently is not disabling Giving users the device control whether to enable or disable |
Is this only possible with the patch, is there not a setting in one of the layers? If not, sounds risky as it is not using a public API. |
This is on the
Yeah I agree with this there is no setting in one of the layers for this. Better is explaining the possible challenges that may raise when using |
Thanks for your explanations, I'm still not getting a good grip of the situation yet, unfortunately. May I kindly ask you to take some time to explain:
|
I do not think a regular user will have the error. It's cased by lack of Nested-virtualization from Apple MacOS. To my understanding, nested-virtualization is some thing that you have a layer of virtualization to provide virtual hardware and the another layer of virtualization to provide hardware and kernel isolated software such as a Hyper-V isolated container or VM. So if a user have a container on a local MacOS, there should be no problem. If a user have a regular container with shared hardware and kernel inside a hypervisor isolated VM, there should be no problem either. If a user have a hypervisor isolated VM or container inside another hypervisor isolated VM or container, there will be a problem. For Github runners, I guess Microsoft are having a first layer of hypervisor to build the cloud (Linux KVM or Microsoft Hyper-V based VMs), then another layer of hypervisor to provide hardware and kernel isolated MacOS VMs. Reference for Github runner problem: #1596 (comment) some links about Nested-virtualization: |
Adding to what @XinyuWuu said above, there a tracker issue for torch layer operations that aren't yet covered on |
Issue with the most participants I've seen so far |
If this is the case, the my questions are:
|
In general for the VM the difference from a regular user is the mps support that isn't available on vm as mentioned in limitations in github actions docs here: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners. This implies that anything running on these runners will be run on
For general users, the arm64 macOS has support for mps and they can use both the
I believe that post PR the tests will be forced to run on |
Oh, I see. May I repeat back to check whether I understood, it would be appreciated if you could check my understanding, @fnhirwa:
Is this correct? If it is, I have further questions:
|
Yes. |
Yes, this is correct.
As @XinyuWuu specified we will always be testing on cpu. |
pytorch_forecasting/utils/_utils.py
Outdated
device = torch.device(device) | ||
else: | ||
device = torch.device("cpu") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this perhaps too aggressive? Are we moving devices to cpu that couId work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree we can simplify and remove the setting to CPU as it is always the default device. But enforcing the mps to fallback to cpu when not built as torch needs it to be built in the backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, should we change the device not only if (a) it is mps, and (b), mps is not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add the try-except block at the end of the first conditional check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current branching makes sense - not sure if you changed it or if I now just understod it, but I think it is exhaustive and the step-out covers only the MPS-not-present case
pytorch_forecasting/utils/_utils.py
Outdated
if device == "mps": | ||
if hasattr(torch.backends, device): | ||
if torch.backends.mps.is_available() and torch.backends.mps.is_built(): | ||
device = torch.device(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, I would just substitute "mps"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have understood now what conditions this covers (we test only cpu), and the logic of move_to_device
makes sense
Description
This PR handles the issue that may result when setting the device to
mps
if the torch version doesn't supportmps backend
Depends on #1633
I used
pytest.MonkeyPatch()
to disable the discovery of themps
accelerator. The tests run on CPU formacOS-latest
fixes #1596