-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4450: Ensure Modin successfully initializes when Ray cluster has no resources #4451
base: master
Are you sure you want to change the base?
Conversation
…y cluster has no resources. Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
While this patch does fix the issue, I think it would be helpful to have a discussion around how we want Modin to behave in such scenarios (i.e. whether it should just use some number, say 16, for the number of partitions until the user specifies how many partitions they want, whether it should ask the user to set the number of partitions manually, and not set num partitions, or whether it should launch ray tasks to cause the autoscaler to spin up workers, and then get the number of CPUs). |
This pull request has been mentioned on Modin Discuss. There might be relevant details there: |
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Codecov Report
@@ Coverage Diff @@
## master #4451 +/- ##
==========================================
+ Coverage 86.65% 88.04% +1.38%
==========================================
Files 226 226
Lines 18311 18311
==========================================
+ Hits 15868 16122 +254
+ Misses 2443 2189 -254
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
+ " 4. To update, run the following python code:\n\tfrom modin.config import " | ||
+ "NPartitions\n\tNPartitions.put(desired_num_cpus)" | ||
) | ||
num_cpus = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add an inline comment why we set num_cpus=4
by default. I remember there was a question why we did that and whether we should have changed it. Also, it would be good to reflect this in the docs somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, is such kind of a scenario applicable to Dask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we need documentation.
I think this might also be applicable to Dask, but we'd need to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to add a comment here! Before that: Is 4 a good heuristic? I picked it bc most personal laptops have 4 cores/it seemed like a good number, but I'd prefer to have a better reason behind how we treat this case, so I'd appreciate y'all's insight into how we can better handle this case!
@@ -217,7 +217,17 @@ def initialize_ray( | |||
_move_stdlib_ahead_of_site_packages | |||
) | |||
ray.worker.global_worker.run_function_on_all_workers(_import_pandas) | |||
num_cpus = int(ray.cluster_resources()["CPU"]) | |||
num_cpus = ray.cluster_resources().get("CPU", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_cpus = ray.cluster_resources().get("CPU", None) | |
num_cpus = ray.cluster_resources().get("CPU", None) or 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did it like this is because I wanted to warn users that we were unable to determine how many CPU
s there were and were relying on an internal heuristic - should we keep it this way, or change it to the more streamlined version you propose?
+ " 4. To update, run the following python code:\n\tfrom modin.config import " | ||
+ "NPartitions\n\tNPartitions.put(desired_num_cpus)" | ||
) | ||
num_cpus = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_cpus = 4 |
+ "NPartitions\n\tNPartitions.put(desired_num_cpus)" | ||
) | ||
num_cpus = 4 | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: |
) | ||
num_cpus = 4 | ||
else: | ||
num_cpus = int(num_cpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_cpus = int(num_cpus) | |
num_cpus = int(num_cpus) |
+ " 4. To update, run the following python code:\n\tfrom modin.config import " | ||
+ "NPartitions\n\tNPartitions.put(desired_num_cpus)" | ||
) | ||
num_cpus = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we need documentation.
I think this might also be applicable to Dask, but we'd need to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure setting the value to 4 is actually right here either. Should we throw an exception asking the user to specify the NPartitions
value?
Signed-off-by: Rehan Durrani <rehan@ponder.io>
I double checked dask, and it uses |
Signed-off-by: Rehan Durrani <rehan@ponder.io>
…into rehan/scale_from_0 Signed-off-by: Rehan Durrani <rehan@ponder.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
warnings.warn( | ||
"The current Ray cluster does not have any CPU Resources.\nModin uses the number of " | ||
+ "CPUs to determine how many partitions to create.\nNumber of partitions defaulting to" | ||
+ " number of CPUs on head node. To update, run the following python code:\n\tfrom " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you instead instantiate num_cpus
above this warning, and add the actual value of num_cpus
to the warning message? such as, with an f-string.
Is there another (more complicated but better) solution, which is to wait until ray tasks launch, then get the number of CPUs, update the number of partitions, and propagate the change? I think you mentioned in the Discuss forum some challenges with this approach since we don't currently repartition when In the nearer term, I think the current approach of showing a warning and defaulting to 4 (or some other number) is best for this edge case. |
Another option, in contrast to surfacing a warning and setting a default, is to raise an Error with an error message that gives instructions to bypass the error with a config variable to ignore the error, or to set the default The reason this might be better is that users in batch mode won't see a warning, but they will see an error (because the script will fail), and we don't want users to have mysterious behavior with Modin without knowing about it. |
Signed-off-by: Rehan Durrani rehan@ponder.io
What do these changes do?
This patch ensures that Modin can initialize successfully even when the ray cluster has no resources (because it is autoscaling from 0).
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date