Using RDMA capable nodes #34

msalvaris · 2019-10-03T15:38:22Z

Is there a reason for using Standard_NC24s_v3 rather than the RDMA capable Standard_NC24rs_v3?

usuyama · 2020-01-02T08:24:34Z

I also noticed NCCL_IB_DISABLE (env variable) is set to 1 by the pretrain AML environment (or maybe by the Docker image)

NCCL_IB_DISABLE
The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.

https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html

Wonder if the authors hit any blocking issues using infiniband/rdma @aashna

usuyama · 2020-01-02T20:54:10Z

When I tried the pretraining on ND24rs (RDMA/infiniband), I got the following error:

RuntimeError: NCCL error in: ... /torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled system error

I think NCCL_IB_DISABLE should be set to 0 (or unset), but haven't tried yet.

usuyama · 2020-01-08T21:40:28Z

After checking with AzureML folks, it turned out I have to use Intel MPI as the backend when I use nodes without SR-IOV support.

SR-IOV stands for “single root input/output virtualization” which optimizes sharing of PCI Express devices in a system with virtual machines. In Azure, SR-IOV for InfiniBand enables near bare-metal performance for any MPI library.

Accelerating Distributed Training in Azure Machine Learning service using SR-IOV

If you have access to NCv3 or NDv2, then you can take advantage of the faster GPU interconnect. SR-IOV support should come to NCv2 and NDv1 later in 2020.

Without SR-IOV, for NCCL, we need to set "NCCL_IB_DISABLE": "0" to disable InfiniBand on RDMA capable VMs (e.g., ND24rs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using RDMA capable nodes #34

Using RDMA capable nodes #34

msalvaris commented Oct 3, 2019

usuyama commented Jan 2, 2020

usuyama commented Jan 2, 2020

usuyama commented Jan 8, 2020

Using RDMA capable nodes #34

Using RDMA capable nodes #34

Comments

msalvaris commented Oct 3, 2019

usuyama commented Jan 2, 2020

usuyama commented Jan 2, 2020

usuyama commented Jan 8, 2020