Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] ray.init() stuck at "Started a local Ray instance." #37373

Closed
AvisP opened this issue Jul 12, 2023 · 27 comments
Closed

[Core] ray.init() stuck at "Started a local Ray instance." #37373

AvisP opened this issue Jul 12, 2023 · 27 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P2 Important issue, but not time-critical QS Quantsight triage label windows

Comments

@AvisP
Copy link

AvisP commented Jul 12, 2023

What happened + What you expected to happen

When I am trying to do a ray.init(logging_level='debug') on a windows server it gets stuck at Started a local Ray instance

Versions / Dependencies

Ray version[RLLIB]: 2.5.1
OS: Windows 10
Python: 3.8.5

Reproduction script

I am just doing

import ray
ray.init(logging_level='debug')

Debug message I am getting is

2023-07-12 17:07:30,736 DEBUG worker.py:1412 -- Could not import resource module (on Windows)
2023-07-12 17:07:37,350 DEBUG node.py:1149 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\3\ray\session_2023-07-12_17-07-37_304650_56384\logs.
2023-07-12 17:07:37,473 DEBUG node.py:605 -- Failed to send request to gcs, reconnecting. Error failed to connect to all addresses
2023-07-12 17:07:41,530 DEBUG node.py:1187 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\3\ray\session_2023-07-12_17-07-37_304650_56384\logs.
2023-07-12 17:07:41,820 DEBUG services.py:1894 -- Determine to start the Plasma object store with 72.38 GB memory using C:\Users\paula\AppData\Local\Temp\3.
2023-07-12 17:07:42,004 INFO worker.py:1636 -- Started a local Ray instance.

Issue Severity

High: It blocks me from completing my task.

@AvisP AvisP added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 12, 2023
@ArturNiederfahrenhorst ArturNiederfahrenhorst changed the title [RLlib] ray.init() stuck at "Started a local Ray instance." [Core] ray.init() stuck at "Started a local Ray instance." Jul 21, 2023
@ArturNiederfahrenhorst ArturNiederfahrenhorst added the core Issues that should be addressed in Ray Core label Jul 21, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Jul 24, 2023

cc @mattip can you follow up and triage?

@rkooo567 rkooo567 added P2 Important issue, but not time-critical triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component) P2 Important issue, but not time-critical labels Jul 24, 2023
@mattip
Copy link
Contributor

mattip commented Jul 25, 2023

What CPU and how many cores does it have? How much memory? Does running in regular mode, without debug work?

@AvisP
Copy link
Author

AvisP commented Jul 25, 2023

It is a server with Intel Xenon Gold 6230 CPU @ 2.1 Ghz (80 cores as seen on Task manager, maybe because of multi threading it is showing more) with 384 GB of ram. I tried doing a ray.init() and getting the following

Traceback (most recent call last):
  File "F:\........\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\node.py", line 293, in __init__
    ray._private.services.wait_for_node(
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\services.py", line 459, in wait_for_node
    raise TimeoutError(
TimeoutError: Timed out after 60 seconds while waiting for node to startup. Did not find socket name tcp://127.0.0.1:58449 in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\worker.py", line 1534, in init
    _global_node = ray._private.node.Node(
  File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\node.py", line 298, in __init__
    raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

@mattip
Copy link
Contributor

mattip commented Jul 26, 2023

80 cores as seen on Task manager

Maybe connected to the large number of cores taking too long to spin up all the processes on windows. In order to test that hypothesis, could you try ray.init(num_cpus=N) and figure out how large N can be before the function fails? I do not have access to such a large machine.

@AvisP
Copy link
Author

AvisP commented Jul 26, 2023

Okay so I did a ray.init(num_cpus=2) and got the following message and the terminal froze. The grpcio version I have is 1.56.0

2023-07-26 10:48:42,392 INFO worker.py:1636 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.8.5', ray_version='2.5.1', ray_commit='a03efd9931128d387649dd48b0e4864b43d3bfb4', address_info={'node
_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': 'tcp://127.0.0.1:65389', 'raylet_socket_
name': 'tcp://127.0.0.1:65442', 'webui_url': '', 'session_dir': 'C:\\Users\\......\\AppData\\Local\\Temp\\6\\ray\\session_2023-07-26_10-48-36_478614_
41080', 'metrics_export_port': 65526, 'gcs_address': '127.0.0.1:65386', 'address': '127.0.0.1:65386', 'dashboard_agent_listen_port': 52365, 'node_id
': 'd1b0c0601ebf26445176b6bb12a84bf4bc1a49f1c3414177d2b0e103'})
>>> (raylet) [2023-07-26 10:48:45,736 E 51140 31988] (raylet.exe) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed.
The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio versio
n `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/dashboard_agent.lo
g`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
(raylet) *** SIGTERM received at time=1690382925 ***
(raylet)     @   00007FF61E2B0F6C  (unknown)  (unknown)
(raylet)     @   00007FF61E2B03DE  (unknown)  (unknown)
(raylet)     @   00007FF8E27B268A  (unknown)  o_exp
(raylet)     @   00007FF8E5FA7AC4  (unknown)  BaseThreadInitThunk
(raylet)     @   00007FF8E627A351  (unknown)  RtlUserThreadStart
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: *** SIGTERM received at time=1690382925 ***
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF61E2B0F6C  (unknown)  (unknown)
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF61E2B03DE  (unknown)  (unknown)
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF8E27B268A  (unknown)  o_exp
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF8E5FA7AC4  (unknown)  BaseThreadInitThunk
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361:     @   00007FF8E627A351  (unknown)  RtlUserThreadStart
2023-07-26 10:48:59,384 WARNING worker.py:2019 -- The node with node id: d1b0c0601ebf26445176b6bb12a84bf4bc1a49f1c3414177d2b0e103 and address: 127.0
.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a  (1) raylet c
rashes unexpectedly (OOM, preempted node, etc.)
        (2) raylet has lagging heartbeats due to slow network or busy workload.

@jjyao jjyao added core Issues that should be addressed in Ray Core and removed core Issues that should be addressed in Ray Core labels Aug 7, 2023
@sven1977 sven1977 added the core Issues that should be addressed in Ray Core label Aug 9, 2023
@xieus
Copy link

xieus commented Aug 9, 2023

@mattip Could you pls provide an update on this issue? Thanks.

@mattip
Copy link
Contributor

mattip commented Aug 9, 2023

If ray 2.7 moves us to a world without python grpcio, we can revisit the issues that have

The version of grpcio doesn't follow Ray's requirements

in the logs

@mattip
Copy link
Contributor

mattip commented Aug 9, 2023

There are quite a few similar errors, I think they all using a non-compliant version of grpcio

@jjyao jjyao added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 14, 2023
@anyscalesam anyscalesam added the QS Quantsight triage label label Oct 25, 2023
@mattip
Copy link
Contributor

mattip commented Nov 7, 2023

@AvisP does this still happen on ray 2.8?

@peytondmurray peytondmurray added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Nov 7, 2023
@AvisP
Copy link
Author

AvisP commented Nov 22, 2023

@mattip For the windows server with lots of cores I got this when I did a ray.init(num_cpus=2) on python 3.10.11

2023-11-22 16:47:09,371 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None)
>>> �[33m(raylet)�[0m [2023-11-22 16:47:12,284 E 68664 94064] (raylet.exe) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = runtime_env_agent.
�[33m(raylet)�[0m The raylet fate shares with the agent. This can happen because
�[33m(raylet)�[0m - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
�[33m(raylet)�[0m - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
�[33m(raylet)�[0m - The agent is killed by the OS (e.g., out of memory).
�[33m(raylet)�[0m *** SIGTERM received at time=1700689632 ***
�[33m(raylet)�[0m     @   00007FF6A41785C6  (unknown)  (unknown)
�[33m(raylet)�[0m     @   00007FF6A418EE86  (unknown)  (unknown)
�[33m(raylet)�[0m     @   00007FF6A418E5BE  (unknown)  (unknown)
�[33m(raylet)�[0m     @   00007FF8C242268A  (unknown)  o_exp
�[33m(raylet)�[0m     @   00007FF8C2797AC4  (unknown)  BaseThreadInitThunk
�[33m(raylet)�[0m     @   00007FF8C551A351  (unknown)  RtlUserThreadStart
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: *** SIGTERM received at time=1700689632 ***
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF6A41785C6  (unknown)  (unknown)
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF6A418EE86  (unknown)  (unknown)
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF6A418E5BE  (unknown)  (unknown)
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF8C242268A  (unknown)  o_exp
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF8C2797AC4  (unknown)  BaseThreadInitThunk
�[33m(raylet)�[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361:     @   00007FF8C551A351  (unknown)  RtlUserThreadStart
2023-11-22 16:47:26,277 WARNING worker.py:2074 -- The node with node id: 4fa93e7451f6a84361e9c2a042e02e8cfe57ae10e06bc9cf428a4ddf and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
        (2) raylet has lagging heartbeats due to slow network or busy workload.

Then I interrupted using keyboard and did a ray.shutdown() and tried initializing again and then I got

2023-11-22 16:48:22,439 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None)

Although it works fine on windows laptop with smaller number of cores.

For python 3.11 I can't even install ray on a windows based environment due to this issue #39727 #38300

@anyscalesam anyscalesam self-assigned this Dec 6, 2023
@mattip
Copy link
Contributor

mattip commented Dec 20, 2023

I am intrigued by the The version of grpcio doesn't follow Ray's requirement in the log. How exactly are you installing the software stack? What is pulling in a non-compliant version of grpcio? Perhaps you are seeing some stall due to some other package, like NumPy, that tries to open worker threads on every core?

@anyscalesam anyscalesam assigned AvisP and unassigned mattip, rynewang and anyscalesam Jan 3, 2024
@anyscalesam
Copy link
Contributor

@AvisP can you provide more details to @mattip Q above?

@AvisP
Copy link
Author

AvisP commented Jan 9, 2024

I think I just did pip install ray, I wanted to confirm by testing on the server with multiple cores running Windows but I can get access end of next week and provide more info on what I see during installation.

@mattip
Copy link
Contributor

mattip commented Feb 15, 2024

Any update of how you installed the software?

@AvisP
Copy link
Author

AvisP commented Feb 21, 2024

Sorry for late response. So I created a new virtual environment with python 3.10.11 and did a pip install ray which installed version 2.9.2. Then I did a import ray and then a ray.init(num_cpus=4) and it gave the following error message

>>> ray.init(num_cpus=4)
Traceback (most recent call last):
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\node.py", line 312, in __init__
    ray._private.services.wait_for_node(
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\services.py", line 463, in wait_for_node
    raise TimeoutError(
TimeoutError: Timed out after 60 seconds while waiting for node to startup. Did not find socket name tcp://127.0.0.1:63251 in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\worker.py", line 1618, in init
    _global_node = ray._private.node.Node(
  File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\node.py", line 317, in __init__
    raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

I exited the python environment and tried it again and it seemed to work with the following message

2024-02-21 15:29:34,099 INFO worker.py:1724 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.9.2', ray_commit='fce7a361807580953364e2da964f9498f3123bf9', protocol_version=None)

However if I do only ray.init() was leaving it at 2024-02-21 15:33:16,903 INFO worker.py:1724 -- Started a local Ray instance. indefinitely. So I did a ray.init(logging_level='debug') and this is what I got

2024-02-21 15:33:11,105 DEBUG worker.py:1483 -- Could not import resource module (on Windows)
2024-02-21 15:33:13,029 DEBUG node.py:1303 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\ray\session_2024-02-21_15-33-12_966592_144860\logs.
2024-02-21 15:33:16,404 DEBUG node.py:1332 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\ray\session_2024-02-21_15-33-12_966592_144860\logs.
2024-02-21 15:33:16,410 DEBUG npu.py:58 -- Could not import AscendCL: No module named 'acl'
2024-02-21 15:33:16,411 DEBUG tpu.py:119 -- Failed to detect number of TPUs: [WinError 3] The system cannot find the path specified: '/dev/vfio'
2024-02-21 15:33:16,624 DEBUG services.py:2064 -- Determine to start the Plasma object store with 92.53 GB memory using C:\Users\paula\AppData\Local\Temp.
2024-02-21 15:33:16,903 INFO worker.py:1724 -- Started a local Ray instance.

@mattip
Copy link
Contributor

mattip commented Mar 6, 2024

Where did you get python, how are you installing ray? We have seen some problems with the python from the windows app store.

@AvisP
Copy link
Author

AvisP commented Mar 13, 2024

I downloaded the installer for 3.10.11 from here. If there is another version that you know has no issues I can try that

@mattip
Copy link
Contributor

mattip commented Mar 17, 2024

No, that build should work well, but the first time you run it you will have to allow firewall exceptions via a giu pop-up box. Did you see that?

@mattip
Copy link
Contributor

mattip commented Mar 20, 2024

The last log snippet on your last attempt looks like it is all working: Started a local Ray instance.. Could you summarize for me where this all stands? I am confused as to what works and what doesn't.

@albert-ying
Copy link

I got a similar problem:

Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import ray

In [2]: ray.init(num_cpus=2, logging_level="debug")
2024-05-15 16:15:24,958 DEBUG node.py:293 -- Setting node ID to 2e3620292cacbc0db97e26bbb3f3fbd3e3733955eb3b7f3aa12b71fb
2024-05-15 16:15:24,959 DEBUG node.py:1344 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-24_957954_661582/logs.
2024-05-15 16:15:26,363 DEBUG node.py:1373 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-24_957954_661582/logs.
2024-05-15 16:15:29,428 DEBUG npu.py:60 -- Could not import AscendCL: No module named 'acl'
2024-05-15 16:15:29,430 DEBUG services.py:2077 -- Determine to start the Plasma object store with 78.2 GB memory using /dev/shm.
2024-05-15 16:15:30,588 INFO worker.py:1749 -- Started a local Ray instance.
Out[2]: RayContext(dashboard_url='', python_version='3.11.9', ray_version='2.22.0', ray_commit='a8ab7b87a59b833242fe3564c61d7b39265cd838')

In [3]: ray.shutdown()

2 cores works

In [4]: ray.init(num_cpus=5, logging_level="debug")
2024-05-15 16:15:50,060 DEBUG node.py:293 -- Setting node ID to 97a9a7a0d6b57dbb9c182dd042fa56b7aae8c9f04dd61fa1ac1b0a7c
2024-05-15 16:15:50,061 DEBUG node.py:1344 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-50_060016_661582/logs.
2024-05-15 16:15:51,436 DEBUG node.py:1373 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-50_060016_661582/logs.
2024-05-15 16:15:54,531 DEBUG npu.py:60 -- Could not import AscendCL: No module named 'acl'
2024-05-15 16:15:54,532 DEBUG services.py:2077 -- Determine to start the Plasma object store with 78.21 GB memory using /dev/shm.
2024-05-15 16:15:55,588 INFO worker.py:1749 -- Started a local Ray instance.

This hang forever.

@mattip
Copy link
Contributor

mattip commented May 16, 2024

@albert-ying for completeness: what machine are you running this on, how many CPUs does it have?

@albert-ying
Copy link

@mattip Arch Linux x86_64, 112 cores

@zPz-002
Copy link

zPz-002 commented Jun 19, 2024

80 cores as seen on Task manager

Maybe connected to the large number of cores taking too long to spin up all the processes on windows. In order to test that hypothesis, could you try ray.init(num_cpus=N) and figure out how large N can be before the function fails? I do not have access to such a large machine.

I have a similar problem, and this is useful in my issue. But I think it's not friendly enough that it only shows "INFO worker.py:1794 -- Started a local Ray instance" when ray.init() is executing. How can I get more information to ensure the program is running normally?

@mattip
Copy link
Contributor

mattip commented Jun 19, 2024

How can I get more information

Either ray status or open the dashboard. But note #45578 which is meant to fix the dashboard reporting on windows

@mattip
Copy link
Contributor

mattip commented Jul 10, 2024

Closing this. Please reopen or open a new issue if the problem reappears

@mattip mattip closed this as completed Jul 10, 2024
@cometta
Copy link

cometta commented Jul 17, 2024

i also encountered this hang issue. i did test @albert-ying's method, test on python interpreter shell, by increasing the num_cpus, i don't get any error. does Started a local Ray instance needs to download anything from the internet, can this related to URL blocked?

@mattip
Copy link
Contributor

mattip commented Jul 17, 2024

@cometta This issue is closed, please open a new one with

  • details about how you installed ray and which version
  • version of python and platform
  • copies of the relevant logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P2 Important issue, but not time-critical QS Quantsight triage label windows
Projects
None yet
Development

No branches or pull requests