Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] Enable cuGraph workflow with Ray GPU Cluster #4611

Open
VibhuJawa opened this issue Aug 14, 2024 · 0 comments
Open

[EXP] Enable cuGraph workflow with Ray GPU Cluster #4611

VibhuJawa opened this issue Aug 14, 2024 · 0 comments
Assignees
Labels
feature request New feature or request
Milestone

Comments

@VibhuJawa
Copy link
Member

VibhuJawa commented Aug 14, 2024

Enable cuGraph workflow with Ray GPU Cluster

Currently, we support using PyTorch DDP with RAFT along with dask.

See example: https://github.com/rapidsai/cugraph-gnn/blob/e6000e53f7b1a6bb0834d69e8d54a5af16583289/python/cugraph-pyg/cugraph_pyg/tests/loader/test_neighbor_loader_mg.py#L34-L58

We should similarly explore enabling this with a Ray GPU cluster

This involves using cugraph_nccl_comms in a Ray setting instead of PyTorch DDP / Dask

def cugraph_comms_init(rank, world_size, uid, device=0):
global __nccl_comms, __raft_handle
if __nccl_comms is not None or __raft_handle is not None:
raise RuntimeError("cuGraph has already been initialized!")
# TODO add options for rmm initialization
global __old_device
__old_device = getDevice()
setDevice(device)
nccl_comms = nccl_init(rank, world_size, uid)
# FIXME should we use n_streams_per_handle=1 here?
raft_handle = make_raft_handle(rank, world_size, nccl_comms, verbose=True)
pcols, _ = __get_2D_div(world_size)
init_subcomms(raft_handle, pcols)
__nccl_comms = nccl_comms
__raft_handle = raft_handle

Workflow to Test:

  1. Set up a Ray GPU cluster (@ayushdg , to share scripts for this)
  2. Set up comms using Ray cluster.
  3. Create a pylibcugraph.MGGraph similar to:

https://github.com/rapidsai/cugraph-gnn/blob/e6000e53f7b1a6bb0834d69e8d54a5af16583289/python/cugraph-pyg/cugraph_pyg/data/graph_store.py#L141-L166

  1. Call Connected Components, similar to:

    result = [
    client.submit(
    _call_plc_wcc,
    Comms.get_session_id(),
    input_graph._plc_graph[w],
    do_expensive_check,
    workers=[w],
    allow_other_workers=False,
    )
    for w in Comms.get_workers()
    ]
    wait(result)
    cudf_result = [client.submit(convert_to_cudf, cp_arrays) for cp_arrays in result]
    wait(cudf_result)
    ddf = dask_cudf.from_delayed(cudf_result).persist()
    wait(ddf)
    # Wait until the inactive futures are released
    wait([(r.release(), c_r.release()) for r, c_r in zip(result, cudf_result)])

  2. Write results to a parquet file.

CC: @BradReesWork , @quasiben , @ayushdg , @randerzander

@alexbarghi-nv alexbarghi-nv self-assigned this Aug 19, 2024
@alexbarghi-nv alexbarghi-nv added this to the 24.10 milestone Aug 19, 2024
@alexbarghi-nv alexbarghi-nv added the feature request New feature or request label Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants