Skip to content
Jerry Liu edited this page Mar 12, 2022 · 11 revisions

Overview

CUDA environment support enables the use of NVIDIA’s GPU memory in UCX and HCOLL communication libraries for point-to-point and collective routines, respectively.

Supported Architectures

CPU architecture: x86, power pc

NVIDIA GPU architectures: Tesla, Kepler, Pascal, Volta

System Requirements

Once the NVIDIA software components are installed, it is important to verify that the GPUDirect RDMA kernel module is properly loaded on each of the computing systems where you plan to run the job that requires the GPUDirect RDMA.

To check whether the GPUDirect RDMA module is loaded, run:

service nv_peer_mem status

To run this verification on other Linux flavors

lsmod | grep nv_peer_mem

  • GDR COPY plugin module

GDR COPY is a fast copy library from NVIDIA, used to transfer between HOST and GPU. For information on how to install GDR COPY, Please refer to its GitHub webpage. Once GDR COPY is installed, it is important to verify that the gdrcopy kernel module is properly loaded on each of the computing systems where you plan to run the job that requires the GDR COPY.

To check whether the GDR COPY module is loaded, run:

lsmod | grep gdrdrv

Configuring CUDA Support

Configuration flags:

--with-cuda=<cuda/runtime/install/path>

--with-gdrcopy=<gdr_copy/install/path>

List of CUDA TLS:

  • cuda_copy : CUDA copy transport for staging protocols
  • gdr_copy : gdr copy for a faster copy from HOST to GPU for small buffers
  • cuda_ipc : Intra-node Peer-to-Peer (P2P) device transfers

Tuning

  • In GPUDIrectRDMA optimized system configurations where GPU and HCA connected to same PCIe Switch fabric, and MPI processes are bind to HCA and GPU under the same PCIe switch, please use following rendezvous protocol for Optimal GPUDirectRDMA performance.

    UCX_RNDV_SCHEME=get_zcopy

Known Issues

  • For versions older than v1.12: CUDA runtime memory Hooks do not work if application statically links to libcudart_static.a

    UCX intercepts CUDA memory allocation calls(cudaMalloc/Free) to populate memory cache to save the cost (around 0.2 us) of pointer type checking with cudacudaPointerGetAttributes() CUDA API.

    Workaround : Disable CUDA memory type cache with -x UCX_MEMTYPE_CACHE=n

  • For versions older than v1.7: Segfault if CUDA memory allocated before MPI_Init/ucp_init is used in communication APIs.

    UCX memory hooks are set during the initialization UCX. CUDA memory allocations done before that are intercepted and populated in the memory type cache

    Workaround : Disable memory type cache with -x UCX_MEMTYPE_CACHE=n

FAQ

  • GPUDirect RDMA/P2P might fail or get unexpected performance

    Enabling ACS(access control services) on PLX PCI-e switches might be a reason for this. Please refer this for more details

Clone this wiki locally