diff --git a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst index 15f3958735..99cd2f6364 100644 --- a/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst @@ -18,7 +18,7 @@ System Architectures (HSA) and Unified Memory (UM) help avoid these limitations and promise increased efficiency and innovation. Unified memory -============== +================================================================================ Unified Memory is a single memory address space accessible from any processor within a system. This setup simplifies memory management processes and enables @@ -27,10 +27,37 @@ either CPUs or GPUs. The Unified memory model is shown in the following figure. .. figure:: ../../../data/how-to/hip_runtime_api/memory_management/unified_memory/um.svg +Traditionally host memory accessible at device side or device memory accessible +at host side with zero-copy feature. Zero-copy accesses happen over the +Infinity Fabric (IF) interconnect or PCIe lanes on discrete GPUs. + +The unified memory introducing the page-fault or on-demand migration to access +host data at kernel side. The steps of page migration: + +1. GPU tries to access the memory addresses, that are resident on the host. +2. Triggers a page-fault event. +3. Page request send to host over PCIe or interconnect. +4. Page unmapped from host memory and send to device over PCIe or interconnect. +5. Page mapped to device memory. +6. Multiprocessor access the requested memory addresses. + +If the GPU already reached the memory capacity, then the page migration has two +extra steps after the page-fault event trigger. The steps of page migration, +when the GPU is full: + +1. GPU tries to access the memory addresses, that are resident on the host. +2. Triggers a page-fault event. +3. **Page unmapped from device memory and send to host over PCIe or IF.** +4. **Page mapped to host memory.** +5. Page request send to host over PCIe or interconnect. +6. Page unmapped from host memory and send to device over PCIe or IF. +7. Page mapped to device memory. +8. Multiprocessor access the requested memory addresses. + .. _unified memory system requirements: System requirements -=================== +================================================================================ Unified memory is supported on Linux by all modern AMD GPUs from the Vega series onward. Unified memory management can be achieved with managed memory @@ -45,7 +72,7 @@ the next section. :align: center * - Architecture - - ``hipMallocManaged()`` + - :cpp:func:`hipMallocManaged()` - ``__managed__`` - ``malloc()`` * - MI200, MI300 Series @@ -76,7 +103,7 @@ page-fault. For more details, visit .. _unified memory programming models: Unified memory programming models -================================= +================================================================================ Showcasing various unified memory programming models, the model availability depends on your architecture. For more information, see :ref:`unified memory @@ -84,7 +111,7 @@ system requirements` and :ref:`checking unified memory management support`. - **HIP managed memory allocation API**: - The ``hipMallocManaged()`` is a dynamic memory allocator available on + The :cpp:func:`hipMallocManaged()` is a dynamic memory allocator available on all GPUs with unified memory support. For more details, visit :ref:`unified_memory_reference`. @@ -101,10 +128,72 @@ system requirements` and :ref:`checking unified memory management support`. offers an easy transition from a CPU written C++ code to a HIP code as the same system allocation API is used. +To ensure the proper functioning of unified memory features on Heterogeneous +Memory Management (HMM) supported graphics cards, it is essential to configure +the environment variable ``XNACK=1``. Without this configuration, the behavior +will be similar to that of systems without HMM support. For more details, visit +`GPU memory `_. + +The chart below illustrates the expected behavior of managed and unified memory +functions in ROCm and CUDA environments, both with and without HMM support: + +.. tab-set:: + .. tab-item:: ROCm + :sync: original-block + + .. list-table:: Comparison of expected behavior of managed and unified memory functions in ROCm + :widths: 35, 35, 30 + :header-rows: 2 + + * - call + - Without HMM or with ``XNACK=0`` + - With HMM and with ``XNACK=1`` + * - ``malloc()``, ``new``, system allocator + - host (not accessible on device) + - host, page-fault migration + * - :cpp:func:`hipMalloc()` + - device, zero copy + - device, zero copy + * - :cpp:func:`hipMallocManaged()`, ``__managed__`` + - host, pinned, zero copy + - host, page-fault migration + * - :cpp:func:`hipHostRegister()` + - undefined behavior + - host, page-fault migration + * - :cpp:func:`hipHostMalloc()` + - host, pinned, zero copy + - host, pinned, zero copy + + .. tab-item:: CUDA + :sync: cooperative-groups + + .. list-table:: Comparison of expected behavior of managed and unified memory functions in CUDA + :widths: 40, 25, 25 + :header-rows: 2 + + * - call + - Without HMM + - With HMM + * - ``malloc()``, ``new``, system allocator + - host (not accessible on device) + - first touch, page-fault migration + * - ``cudaMalloc()`` + - device (not accessible on host) + - device, page-fault migration + * - ``cudaMallocManaged()``, ``__managed__`` + - host, page-fault migration + - first touch, page-fault migration + * - ``cudaHostRegister()`` + - host, page-fault migration + - host, page-fault migration + * - ``cudaMallocHost()`` + - host, pinned, zero copy + - host, pinned, zero copy + .. _checking unified memory management support: Checking unified memory management support ------------------------------------------- +-------------------------------------------------------------------------------- Some device attributes can offer information about which :ref:`unified memory programming models` are supported. The attribute value is 1 if the @@ -144,7 +233,7 @@ The following examples show how to use device attributes: } Example for unified memory management -------------------------------------- +-------------------------------------------------------------------------------- The following example shows how to use unified memory management with ``hipMallocManaged()``, function, with ``__managed__`` attribute for static @@ -323,7 +412,7 @@ Memory Management example is presented in the last tab. .. _using unified memory management: Using unified memory management (UMM) -===================================== +================================================================================ Unified memory management (UMM) is a feature that can simplify the complexities of memory management in GPU computing. It is particularly useful in @@ -361,7 +450,7 @@ case. .. _unified memory runtime hints: Unified memory HIP runtime hints for the better performance -=========================================================== +================================================================================ Unified memory HIP runtime hints can help improve the performance of your code if you know your code's ability and infrastructure. Some hint techniques are @@ -378,7 +467,7 @@ For the best performance, profile your application to optimize the utilization of HIP runtime hints. Data prefetching ----------------- +-------------------------------------------------------------------------------- Data prefetching is a technique used to improve the performance of your application by moving data closer to the processing unit before it's actually @@ -438,7 +527,7 @@ Remember to check the return status of ``hipMemPrefetchAsync()`` to ensure that the prefetch operations are completed successfully. Memory advice -------------- +-------------------------------------------------------------------------------- The effectiveness of ``hipMemAdvise()`` comes from its ability to inform the runtime system of the developer's intentions regarding memory usage. When the @@ -506,7 +595,7 @@ Here is the updated version of the example above with memory advice. Memory range attributes ------------------------ +-------------------------------------------------------------------------------- Memory Range attributes allow you to query attributes of a given memory range. @@ -568,8 +657,12 @@ For more details, visit the } Asynchronously attach memory to a stream ----------------------------------------- +-------------------------------------------------------------------------------- -The ``hipStreamAttachMemAsync`` function would be able to asynchronously attach memory to a stream, which can help concurrent execution when using streams. +The ``hipStreamAttachMemAsync`` function would be able to asynchronously attach +memory to a stream, which can help concurrent execution when using streams. -Currently, this function is a no-operation (NOP) function on AMD GPUs. It simply returns success after the runtime memory validation passed. This function is necessary on Microsoft Windows, and UMM is not supported on this operating system with AMD GPUs at the moment. +Currently, this function is a no-operation (NOP) function on AMD GPUs. It simply +returns success after the runtime memory validation passed. This function is +necessary on Microsoft Windows, and UMM is not supported on this operating +system with AMD GPUs at the moment. diff --git a/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst b/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst index 7f6880a639..597b54040f 100644 --- a/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst +++ b/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst @@ -5,9 +5,9 @@ .. _virtual_memory: -***************************** +******************************************************************************** Virtual memory management -***************************** +******************************************************************************** Memory management is important when creating high-performance applications in the HIP ecosystem. Both allocating and copying memory can result in bottlenecks, @@ -28,14 +28,30 @@ reduce memory usage and unnecessary ``memcpy`` calls. .. _memory_allocation_virtual_memory: Memory allocation -================= +================================================================================ -Standard memory allocation uses the ``hipMalloc`` function to allocate a block of memory on the device. However, when using virtual memory, this process is separated into multiple steps using the ``hipMemCreate``, ``hipMemAddressReserve``, ``hipMemMap``, and ``hipMemSetAccess`` functions. This guide explains what these functions do and how you can use them for virtual memory management. +Standard memory allocation uses the :cpp:func:`hipMalloc` function to allocate a +block of memory on the device. However, when using virtual memory, this process +is separated into multiple steps using the :cpp:func:`hipMemCreate`, +:cpp:func:`hipMemAddressReserve`, :cpp:func:`hipMemMap`, and +:cpp:func:`hipMemSetAccess` functions. This guide explains what these functions +do and how you can use them for virtual memory management. Allocate physical memory ------------------------- - -The first step is to allocate the physical memory itself with the ``hipMemCreate`` function. This function accepts the size of the buffer, an ``unsigned long long`` variable for the flags, and a ``hipMemAllocationProp`` variable. ``hipMemAllocationProp`` contains the properties of the memory to be allocated, such as where the memory is physically located and what kind of shareable handles are available. If the allocation is successful, the function returns a value of ``hipSuccess``, with ``hipMemGenericAllocationHandle_t`` representing a valid physical memory allocation. The allocated memory size must be aligned with the granularity appropriate for the properties of the allocation. You can use the ``hipMemGetAllocationGranularity`` function to determine the correct granularity. +-------------------------------------------------------------------------------- + +The first step is to allocate the physical memory itself with the +:cpp:func:`hipMemCreate` function. This function accepts the size of the buffer, +an ``unsigned long long`` variable for the flags, and a +:cpp:struct:`hipMemAllocationProp` variable. :cpp:struct:`hipMemAllocationProp` +contains the properties of the memory to be allocated, such as where the memory +is physically located and what kind of shareable handles are available. If the +allocation is successful, the function returns a value of +:cpp:enumerator:`hipSuccess`, with :cpp:type:`hipMemGenericAllocationHandle_t` +representing a valid physical memory allocation. The allocated memory size must +be aligned with the granularity appropriate for the properties of the +allocation. You can use the :cpp:func:`hipMemGetAllocationGranularity` function +to determine the correct granularity. .. code-block:: cpp @@ -50,9 +66,16 @@ The first step is to allocate the physical memory itself with the ``hipMemCreate hipMemCreate(&allocHandle, padded_size, &prop, 0); Reserve virtual address range ------------------------------ +-------------------------------------------------------------------------------- -After you have acquired an allocation of physical memory, you must map it before you can use it. To do so, you need a virtual address to map it to. Mapping means the physical memory allocation is available from the virtual address range it is mapped to. To reserve a virtual memory range, use the ``hipMemAddressReserve`` function. The size of the virtual memory must match the amount of physical memory previously allocated. You can then map the physical memory allocation to the newly-acquired virtual memory address range using the ``hipMemMap`` function. +After you have acquired an allocation of physical memory, you must map it before +you can use it. To do so, you need a virtual address to map it to. Mapping +means the physical memory allocation is available from the virtual address range +it is mapped to. To reserve a virtual memory range, use the +:cpp:func:`hipMemAddressReserve` function. The size of the virtual memory must +match the amount of physical memory previously allocated. You can then map the +physical memory allocation to the newly-acquired virtual memory address range +using the :cpp:func:`hipMemMap` function. .. code-block:: cpp @@ -60,9 +83,22 @@ After you have acquired an allocation of physical memory, you must map it before hipMemMap(ptr, padded_size, 0, allocHandle, 0); Set memory access ------------------ - -Finally, use the ``hipMemSetAccess`` function to enable memory access. It accepts the pointer to the virtual memory, the size, and a ``hipMemAccessDesc`` descriptor as parameters. In a multi-GPU environment, you can map the device memory of one GPU to another. This feature also works with the traditional memory management system, but isn't as scalable as with virtual memory. When memory is allocated with ``hipMalloc``, ``hipDeviceEnablePeerAccess`` is used to enable peer access. This function enables access between two devices, but it means that every call to ``hipMalloc`` takes more time to perform the checks and the mapping between the devices. When using virtual memory management, peer access is enabled by ``hipMemSetAccess``, which provides a finer level of control over what is shared. This has no performance impact on memory allocation and gives you more control over what memory buffers are shared with which devices. +-------------------------------------------------------------------------------- + +Finally, use the :cpp:func:`hipMemSetAccess` function to enable memory access. +It accepts the pointer to the virtual memory, the size, and a +:cpp:struct:`hipMemAccessDesc` descriptor as parameters. In a multi-GPU +environment, you can map the device memory of one GPU to another. This feature +also works with the traditional memory management system, but isn't as scalable +as with virtual memory. When memory is allocated with :cpp:func:`hipMalloc`, +:cpp:func:`hipDeviceEnablePeerAccess` is used to enable peer access. This +function enables access between two devices, but it means that every call to +:cpp:func:`hipMalloc` takes more time to perform the checks and the mapping +between the devices. When using virtual memory management, peer access is +enabled by :cpp:func:`hipMemSetAccess`, which provides a finer level of +control over what is shared. This has no performance impact on memory allocation +and gives you more control over what memory buffers are shared with which +devices. .. code-block:: cpp @@ -72,12 +108,19 @@ Finally, use the ``hipMemSetAccess`` function to enable memory access. It accept accessDesc.flags = HIP_MEM_ACCESS_FLAGS_PROT_READWRITE; hipMemSetAccess(ptr, padded_size, &accessDesc, 1); -At this point the memory is allocated, mapped, and ready for use. You can read and write to it, just like you would a C style memory allocation. +At this point the memory is allocated, mapped, and ready for use. You can read +and write to it, just like you would a C style memory allocation. Free virtual memory -------------------- +-------------------------------------------------------------------------------- -To free the memory allocated in this manner, use the corresponding free functions. To unmap the memory, use ``hipMemUnmap``. To release the virtual address range, use ``hipMemAddressFree``. Finally, to release the physical memory, use ``hipMemRelease``. A side effect of these functions is the lack of synchronization when memory is released. If you call ``hipFree`` when you have multiple streams running in parallel, it synchronizes the device. This causes worse resource usage and performance. +To free the memory allocated in this manner, use the corresponding free +functions. To unmap the memory, use :cpp:func:`hipMemUnmap`. To release the +virtual address range, use :cpp:func:`hipMemAddressFree`. Finally, to release +the physical memory, use :cpp:func:`hipMemRelease`. A side effect of these +functions is the lack of synchronization when memory is released. If you call +:cpp:func:`hipFree` when you have multiple streams running in parallel, it +synchronizes the device. This causes worse resource usage and performance. .. code-block:: cpp @@ -88,12 +131,16 @@ To free the memory allocated in this manner, use the corresponding free function .. _usage_virtual_memory: Memory usage -============ +================================================================================ Dynamically increase allocation size ------------------------------------- +-------------------------------------------------------------------------------- -The ``hipMemAddressReserve`` function allows you to increase the amount of pre-allocated memory. This function accepts a parameter representing the requested starting address of the virtual memory. This allows you to have a continuous virtual address space without worrying about the underlying physical allocation. +The :cpp:func:`hipMemAddressReserve` function allows you to increase the amount +of pre-allocated memory. This function accepts a parameter representing the +requested starting address of the virtual memory. This allows you to have a +continuous virtual address space without worrying about the underlying physical +allocation. .. code-block:: cpp @@ -101,4 +148,7 @@ The ``hipMemAddressReserve`` function allows you to increase the amount of pre-a hipMemMap(new_ptr, (new_size - padded_size), 0, newAllocHandle, 0); hipMemSetAccess(new_ptr, (new_size - padded_size), &accessDesc, 1); -The code sample above assumes that ``hipMemAddressReserve`` was able to reserve the memory address at the specified location. However, this isn't guaranteed to be true, so you should validate that ``new_ptr`` points to a specific virtual address before using it. +The code sample above assumes that :cpp:func:`hipMemAddressReserve` was able to +reserve the memory address at the specified location. However, this isn't +guaranteed to be true, so you should validate that ``new_ptr`` points to a +specific virtual address before using it. diff --git a/include/hip/hip_runtime_api.h b/include/hip/hip_runtime_api.h index 6636f6431e..368621f277 100644 --- a/include/hip/hip_runtime_api.h +++ b/include/hip/hip_runtime_api.h @@ -3523,6 +3523,8 @@ hipError_t hipExtHostAlloc(void** ptr, size_t size, unsigned int flags); * * The API returns the allocation pointer, managed by HMM, can be used further to execute kernels * on device and fetch data between the host and device as needed. + * + * If HMM is not supported, the function behaves the same as @p hipMallocHost . * * @note It is recommend to do the capability check before call this API. *