-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] ray.init() stuck at "Started a local Ray instance." #37373
Comments
cc @mattip can you follow up and triage? |
What CPU and how many cores does it have? How much memory? Does running in regular mode, without |
It is a server with Intel Xenon Gold 6230 CPU @ 2.1 Ghz (80 cores as seen on Task manager, maybe because of multi threading it is showing more) with 384 GB of ram. I tried doing a ray.init() and getting the following
|
Maybe connected to the large number of cores taking too long to spin up all the processes on windows. In order to test that hypothesis, could you try |
Okay so I did a
|
@mattip Could you pls provide an update on this issue? Thanks. |
If ray 2.7 moves us to a world without python grpcio, we can revisit the issues that have
in the logs |
There are quite a few similar errors, I think they all using a non-compliant version of grpcio |
@AvisP does this still happen on ray 2.8? |
@mattip For the windows server with lots of cores I got this when I did a 2023-11-22 16:47:09,371 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None)
>>> �[33m(raylet)�[0m [2023-11-22 16:47:12,284 E 68664 94064] (raylet.exe) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = runtime_env_agent.
�[33m(raylet)�[0m The raylet fate shares with the agent. This can happen because
�[33m(raylet)�[0m - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
�[33m(raylet)�[0m - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
�[33m(raylet)�[0m - The agent is killed by the OS (e.g., out of memory).
�[33m(raylet)�[0m *** SIGTERM received at time=1700689632 ***
�[33m(raylet)�[0m @ 00007FF6A41785C6 (unknown) (unknown)
�[33m(raylet)�[0m @ 00007FF6A418EE86 (unknown) (unknown)
�[33m(raylet)�[0m @ 00007FF6A418E5BE (unknown) (unknown)
�[33m(raylet)�[0m @ 00007FF8C242268A (unknown) o_exp
�[33m(raylet)�[0m @ 00007FF8C2797AC4 (unknown) BaseThreadInitThunk
�[33m(raylet)�[0m @ 00007FF8C551A351 (unknown) RtlUserThreadStart
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: *** SIGTERM received at time=1700689632 ***
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF6A41785C6 (unknown) (unknown)
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF6A418EE86 (unknown) (unknown)
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF6A418E5BE (unknown) (unknown)
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF8C242268A (unknown) o_exp
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF8C2797AC4 (unknown) BaseThreadInitThunk
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF8C551A351 (unknown) RtlUserThreadStart
2023-11-22 16:47:26,277 WARNING worker.py:2074 -- The node with node id: 4fa93e7451f6a84361e9c2a042e02e8cfe57ae10e06bc9cf428a4ddf and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload. Then I interrupted using keyboard and did a 2023-11-22 16:48:22,439 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None) Although it works fine on windows laptop with smaller number of cores. For python |
I am intrigued by the |
I think I just did |
Any update of how you installed the software? |
Sorry for late response. So I created a new virtual environment with python
I exited the python environment and tried it again and it seemed to work with the following message
However if I do only
|
Where did you get python, |
I downloaded the installer for |
No, that build should work well, but the first time you run it you will have to allow firewall exceptions via a giu pop-up box. Did you see that? |
The last log snippet on your last attempt looks like it is all working: |
I got a similar problem:
2 cores works
This hang forever. |
@albert-ying for completeness: what machine are you running this on, how many CPUs does it have? |
@mattip Arch Linux x86_64, 112 cores |
I have a similar problem, and this is useful in my issue. But I think it's not friendly enough that it only shows "INFO worker.py:1794 -- Started a local Ray instance" when |
Either |
Closing this. Please reopen or open a new issue if the problem reappears |
i also encountered this hang issue. i did test @albert-ying's method, test on python interpreter shell, by increasing the num_cpus, i don't get any error. does |
@cometta This issue is closed, please open a new one with
|
What happened + What you expected to happen
When I am trying to do a
ray.init(logging_level='debug')
on a windows server it gets stuck atStarted a local Ray instance
Versions / Dependencies
Ray version[RLLIB]:
2.5.1
OS:
Windows 10
Python:
3.8.5
Reproduction script
I am just doing
import ray
ray.init(logging_level='debug')
Debug message I am getting is
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: