You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Noticing that a handful of the MG Link Prediction tests are hanging and timing out on single-node, 2 GPU & 8 GPU configurations in our nightlies.
The tests are:
test_cosine_mg
test_jaccard_mg
test_overlap_mg
test_sorensen_mg
The first few tests will pass, but then the pytest file will hang until the job is killed due to the timeout.
Minimum reproducible example
pytest -v --import-mode=append test_jaccard_mg.py
Relevant log output
10/06/24-09:44:46.865339912_UTC>>>> NODE 0: ******** STARTING TESTS FROM: tests/link_prediction/test_jaccard_mg.py, using 2 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.3, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=1 min_time=0.000005 max_time=0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/python/cugraph
configfile: pytest.ini
plugins: benchmark-4.0.0, cov-5.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 96 items
tests/link_prediction/test_jaccard_mg.py::test_dask_mg_jaccard[graph_file:/datasets/karate.csv-directed:False-has_vertex_pair:False-has_vertices:False-has_topk:False-is_weighted:False]
Dask client/cluster created using LocalCUDACluster
PASSED
tests/link_prediction/test_jaccard_mg.py::test_dask_mg_all_pairs_jaccard[graph_file:/datasets/karate.csv-directed:False-has_vertex_pair:False-has_vertices:False-has_topk:False-is_weighted:False] 2024-10-06 02:54:51,390 - distributed.scheduler - WARNING - Worker failed to heartbeat for 589s; attempting restart: <WorkerState 'tcp://127.0.0.1:34653', name: 0, status: running, memory: 1, processing: 1>
2024-10-06 02:54:51,391 - distributed.scheduler - WARNING - Worker failed to heartbeat for 589s; attempting restart: <WorkerState 'tcp://127.0.0.1:45585', name: 1, status: running, memory: 1, processing: 1>
2024-10-06 02:54:55,395 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-10-06 02:54:55,395 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-10-06 02:54:55,805 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:45585' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-5bafbba2-f450-4ef0-9a27-30cbbeec400a'} (stimulus_id='handle-worker-cleanup-1728208495.8048048')
2024-10-06 02:54:55,844 - distributed.nanny - WARNING - Restarting worker
vertices None
2024-10-06 02:54:55,882 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:34653' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-7526511b-3437-4809-b535-b91dd8cd9a4c'} (stimulus_id='handle-worker-cleanup-1728208495.8818748')
2024-10-06 02:54:55,929 - distributed.nanny - WARNING - Restarting worker
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
2024-10-06 02:59:48,917 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-10-06 02:59:48,918 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
target(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1022, in _run
asyncio_run(run(), loop_factory=get_loop_factory())
File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
target(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1022, in _run
asyncio_run(run(), loop_factory=get_loop_factory())
File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-10-06 03:00:17,190 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:45585 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:45585 after 30 s
2024-10-06 03:00:17,191 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:34653 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:34653 after 30 s
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 546, in connect
stream = await self.client.connect(
File "/opt/conda/lib/python3.10/site-packages/tornado/tcpclient.py", line 279, in connect
af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 342, in connect
comm = await wait_for(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 1926, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/pytest", line 10, in <module>
sys.exit(console_main())
File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 201, in console_main
code = main()
File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175, in main
ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
raise exception.with_traceback(exception.__traceback__)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
res = hook_impl.function(*args)
File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
return wrap_session(config, _main)
File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 318, in wrap_session
config.hook.pytest_sessionfinish(
File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
return outcome.get_result()
File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
raise exc.with_traceback(exc.__traceback__)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 868, in pytest_sessionfinish
return (yield)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 893, in pytest_sessionfinish
result = yield
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
teardown.throw(outcome._exception)
File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 141, in pytest_sessionfinish
return (yield)
File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
res = hook_impl.function(*args)
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 107, in pytest_sessionfinish
session._setupstate.teardown_exact(None)
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
raise exceptions[0]
File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
fin()
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1032, in finish
raise exceptions[0]
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1021, in finish
fin()
File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 907, in _teardown_yield_fixture
next(it)
File "/root/cugraph/python/cugraph/cugraph/tests/conftest.py", line 52, in dask_client
stop_dask_client(dask_client, dask_cluster)
File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 182, in stop_dask_client
Comms.destroy()
File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 214, in destroy
__instance.destroy()
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
self.client.run(
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3183, in run
return self.sync(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
return sync(
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 440, in sync
raise error
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 414, in f
result = yield future
File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3088, in _run
raise exc
File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6642, in send_message
comm = await self.rpc.connect(addr)
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1535, in connect
return connect_attempt.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1425, in _connect
comm = await connect(
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:34653 after 30 s
Exception ignored in: <function Comms.__del__ at 0x15234c2544c0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3183, in run
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 431, in sync
File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
10/06/24-10:00:18.401427570_UTC>>>> ERROR: command timed out after 900 seconds
10/06/24-10:00:18.402550166_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
10/06/24-10:00:18.480734601_UTC>>>> NODE 0: remaining python processes: [ 2532726 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
10/06/24-10:00:18.505469746_UTC>>>> NODE 0: remaining dask processes: [ ]
Environment details
Running inside the nightly MNMG test containers.
Other/Misc.
No response
Code of Conduct
I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report
The text was updated successfully, but these errors were encountered:
Version
24.10
Which installation method(s) does this occur on?
Conda
Describe the bug.
Noticing that a handful of the MG Link Prediction tests are hanging and timing out on single-node, 2 GPU & 8 GPU configurations in our nightlies.
The tests are:
test_cosine_mg
test_jaccard_mg
test_overlap_mg
test_sorensen_mg
The first few tests will pass, but then the pytest file will hang until the job is killed due to the timeout.
Minimum reproducible example
pytest -v --import-mode=append test_jaccard_mg.py
Relevant log output
Environment details
Running inside the nightly MNMG test containers.
Other/Misc.
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: