PR feedbacks and clean up

ROCm · Oct 14, 2024 · c60069a · c60069a
1 parent a43de9e
commit c60069a
Show file tree

Hide file tree

Showing 3 changed files with 88 additions and 73 deletions.
diff --git a/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst b/docs/how-to/hip_runtime_api/memory_management/coherence_control.rst
@@ -9,24 +9,35 @@
 Coherence control
 *******************************************************************************
 
-Memory coherence describes how memory of a specific part of the system is visible to the other parts of the system. For example, how GPU memory is visible to the CPU and vice versa.
-In HIP, host and device memory can be allocated with two different types of coherence:
-
-* **Coarse-grained coherence:** The memory is considered up-to-date
-only after synchronization performed using :cpp:func:`hipDeviceSynchronize`,
-:cpp:func:`hipStreamSynchronize`, or any blocking operation that acts on the null stream such as :cpp:func:`hipMemcpy`. To avoid the cache from being accessed by a part of the system while simultaneously being written by another, the memory is made visible only after the caches have been flushed.
-* **Fine-grained coherence:** The memory is coherent even while being modified by a part of the system. Fine-grained coherence ensures that up-to-date data is visible to others regardless of kernel boundaries. This can be useful if both host and device operate on the same data.
+Memory coherence describes how memory of a specific part of the system is
+visible to the other parts of the system. For example, how GPU memory is visible
+to the CPU and vice versa. In HIP, host and device memory can be allocated with
+two different types of coherence:
+
+* **Coarse-grained coherence:** The memory is considered up-to-date only after
+  synchronization performed using :cpp:func:`hipDeviceSynchronize`,
+  :cpp:func:`hipStreamSynchronize`, or any blocking operation that acts on the
+  null stream such as :cpp:func:`hipMemcpy`. To avoid the cache from being
+  accessed by a part of the system while simultaneously being written by
+  another, the memory is made visible only after the caches have been flushed.
+* **Fine-grained coherence:** The memory is coherent even while being modified
+  by a part of the system. Fine-grained coherence ensures that up-to-date data
+  is visible to others regardless of kernel boundaries. This can be useful if
+  both host and device operate on the same data.
 
 .. note::
 
-  To achieve fine-grained coherence, many AMD GPUs use a limited cache policy, such as leaving these allocations uncached by the GPU or making them read-only.
+  To achieve fine-grained coherence, many AMD GPUs use a limited cache policy,
+  such as leaving these allocations uncached by the GPU or making them read-only.
 
 .. TODO: Is this still valid? What about Mi300?
 
-Mi200 accelerator's hardware based floating point instructions work on coarse-grained memory regions. Coarse-grained coherence is typically useful in reducing host-device
-interconnect communication.
+Mi200 accelerator's hardware based floating point instructions work on
+coarse-grained memory regions. Coarse-grained coherence is typically useful in
+reducing host-device interconnect communication.
 
-To check the availability of fine- and coarse-grained memory pools, use ``rocminfo``:
+To check the availability of fine- and coarse-grained memory pools, use
+``rocminfo``:
 
 .. code-block:: sh
 
@@ -54,8 +65,8 @@ To check the availability of fine- and coarse-grained memory pools, use ``rocmin
   Segment:                 GLOBAL; FLAGS: COARSE GRAINED
   ...
 
-
-The APIs, flags and respective memory coherence control are listed in the following table:
+The APIs, flags and respective memory coherence control are listed in the
+following table:
 
 .. list-table:: Memory coherence control
     :widths: 25, 35, 20, 20

diff --git a/docs/how-to/hip_runtime_api/memory_management/host_memory.rst b/docs/how-to/hip_runtime_api/memory_management/host_memory.rst
@@ -8,19 +8,24 @@
 Host memory
 ********************************************************************************
 
-Host memory is the "normal" memory residing in the host RAM and allocated by C or C++.
-Host memory can be allocated in two different ways:
+Host memory is the "normal" memory residing in the host RAM and allocated by C
+or C++. Host memory can be allocated in two different ways:
 
 * Pageable memory
 * Pinned memory
 
-The following figure explains how data is transferred in pageable and pinned memory.
+The following figure explains how data is transferred in pageable and pinned
+memory.
 
 .. figure:: ../../../data/how-to/hip_runtime_api/memory_management/pageable_pinned.svg
-The pageable and pinned memory allow developers to exercise direct control over memory operations, which is known as explicit memory management. When using the unified memory, developers get a simplified memory model with less control over
+
+The pageable and pinned memory allow developers to exercise direct control over
+memory operations, which is known as explicit memory management. When using the
+unified memory, developers get a simplified memory model with less control over
 low level memory operations.
 
-The difference in memory transfers between explicit and unified memory management are highlighted in the following figure:
+The difference in memory transfers between explicit and unified memory
+management are highlighted in the following figure:
 
 .. figure:: ../../../data/how-to/hip_runtime_api/memory_management/unified_memory/um.svg
 
@@ -31,10 +36,13 @@ For more details on unified memory management, see :doc:`/how-to/hip_runtime_api
 Pageable memory
 ================================================================================
 
-Pageable memory exists on memory blocks known as "pages" that can be migrated to other memory storage. For example, migrating memory between CPU sockets on a motherboard or in a system whose RAM runs out of space and starts dumping pages into the swap partition of the hard drive.
+Pageable memory exists on memory blocks known as "pages" that can be migrated to
+other memory storage. For example, migrating memory between CPU sockets on a
+motherboard or in a system whose RAM runs out of space and starts dumping pages
+into the swap partition of the hard drive.
 
-Pageable memory is usually allocated with a call to ``malloc`` or ``new`` in a C++
-application. 
+Pageable memory is usually allocated with a call to ``malloc`` or ``new`` in a
+C++ application. 
 
 **Example:** Using pageable host memory in HIP:
 
@@ -192,16 +200,21 @@ The memory allocation for pinned memory can be controlled using ``hipHostMalloc`
 * ``hipHostMallocCoherent``: Fine-grained memory is allocated. Overrides ``HIP_HOST_COHERENT`` environment variable for specific allocation. For details, see :ref:`coherence_control`.
 * ``hipHostMallocNonCoherent``: Coarse-grained memory is allocated. Overrides ``HIP_HOST_COHERENT`` environment variable for specific allocation. For details, see :ref:`coherence_control`.
 
-All allocation flags are independent and can be set in any combination. The only exception is setting ``hipHostMallocCoherent`` and
-``hipHostMallocNonCoherent`` together, which leads to an illegal state.
-An example of a valid flag combination is calling :cpp:func:`hipHostMalloc` with both
-``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both the flags use the same model and differentiate only between
-how the surrounding code uses the host memory.
+All allocation flags are independent and can be set in any combination. The only
+exception is setting ``hipHostMallocCoherent`` and ``hipHostMallocNonCoherent``
+together, which leads to an illegal state. An example of a valid flag
+combination is calling :cpp:func:`hipHostMalloc` with both
+``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both the flags
+use the same model and differentiate only between how the surrounding code uses
+the host memory.
 
 .. note:: 
 
-  By default, each GPU selects a Numa CPU node with the least Numa distance between them. This implies that the host memory is automatically allocated on the closest memory pool of the current GPU device's Numa node. Using
-:cpp:func:`hipSetDevice` API to set a different GPU increases the Numa distance but still allows you to access the host allocation.
+  By default, each GPU selects a Numa CPU node with the least Numa distance
+  between them. This implies that the host memory is automatically allocated on
+  the closest memory pool of the current GPU device's Numa node. Using
+  :cpp:func:`hipSetDevice` API to set a different GPU increases the Numa
+  distance but still allows you to access the host allocation.
 
   Numa policy is implemented on Linux and is under development on Microsoft
   Windows.
diff --git a/docs/understand/compilers.rst b/docs/understand/compilers.rst
@@ -8,8 +8,9 @@
 HIP compilers
 ********************************************************************************
 
-ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and NVIDIA CUDA
-platforms.
+ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and
+NVIDIA CUDA platforms.
+
 On ROCm, ``hipcc`` takes care of the following:
 
 - Setting the default library and include paths for HIP
@@ -26,21 +27,25 @@ HIP compilation workflow
 Offline compilation
 --------------------------------------------------------------------------------
 
-The HIP code compilation is performed in two stages: host and  device code compilation stage.
+The HIP code compilation is performed in two stages: host and  device code
+compilation stage.
 
-- Device-code compilation stage: The compiled device code is embedded into the host object file. Depending on the
-platform, the device code can be compiled into assembly or binary. ``nvcc`` and 
-``amdclang++`` target different architectures and use different code object
-formats. ``nvcc`` uses the binary ``cubin`` or the assembly PTX files, while
-the ``amdclang++`` path is the binary ``hsaco`` format. On CUDA platforms, the
-driver compiles the PTX files to executable code during runtime.
+- Device-code compilation stage: The compiled device code is embedded into the
+  host object file. Depending on the platform, the device code can be compiled
+  into assembly or binary. ``nvcc`` and ``amdclang++`` target different
+  architectures and use different code object formats. ``nvcc`` uses the binary
+  ``cubin`` or the assembly PTX files, while the ``amdclang++`` path is the
+  binary ``hsaco`` format. On CUDA platforms, the driver compiles the PTX files
+  to executable code during runtime.
 
-- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can compile the
-host code in one step without other C++ compilers. On the other hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax
-with the appropriate CUDA runtime function call and the modified host code is
-passed to the default host compiler. 
+- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can
+  compile the host code in one step without other C++ compilers. On the other
+  hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax with the
+  appropriate CUDA runtime function call and the modified host code is passed to
+  the default host compiler. 
 
-For an example on how to compile HIP from the command line, see :ref:`SAXPY tutorial<compiling_on_the_command_line>` .
+For an example on how to compile HIP from the command line, see :ref:`SAXPY
+tutorial<compiling_on_the_command_line>` .
 
 Runtime compilation
 --------------------------------------------------------------------------------
@@ -55,42 +60,28 @@ For more details, see
 Static libraries
 ================================================================================
 
-``hipcc`` supports generating two types of static libraries. 
-- The first type of static library only exports and launches 
-host functions within the same library and not the device functions. This library type offers the ability to link with a non-hipcc compiler such as ``gcc``. Additionally, this library type contains host objects with device code embedded as fat binaries. 
-This library type is generated using the flag ``--emit-static-lib``:
+``hipcc`` supports generating two types of static libraries.
 
-.. code-block:: cpp
-    
-      hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
-      gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
-      
-- The second type of static library exports
-device functions to be linked by other code objects by using ``hipcc`` as the linker. 
-This library type contains relocatable device objects and is generated using ``ar``:
-
-.. code-block:: cpp
-      
-      hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
-      ar rcsD libHipDevice.a hipDevice.o
-      hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
+- The first type of static library only exports and launches host functions
+  within the same library and not the device functions. This library type offers
+  the ability to link with a non-hipcc compiler such as ``gcc``. Additionally,
+  this library type contains host objects with device code embedded as fat
+  binaries. This library type is generated using the flag ``--emit-static-lib``:
 
-Here is an example to create and use static libraries:
-
-* Type 1 using `--emit-static-lib`:
-
-    .. code-block:: cpp
+  .. code-block:: shell
     
-      hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
-      gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
-
-* Type 2 using system `ar`:
+    hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
+    gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
+      
+- The second type of static library exports device functions to be linked by
+  other code objects by using ``hipcc`` as the linker. This library type
+  contains relocatable device objects and is generated using ``ar``:
 
-    .. code-block:: cpp
+  .. code-block:: shell
       
-      hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
-      ar rcsD libHipDevice.a hipDevice.o
-      hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
+    hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
+    ar rcsD libHipDevice.a hipDevice.o
+    hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
 
 For more information, see `HIP samples host functions <https://github.com/ROCm/hip-tests/tree/develop/samples/2_Cookbook/15_static_library/host_functions>`_
 and `device functions <https://github.com/ROCm/hip-tests/tree/develop/samples/2_Cookbook/15_static_library/device_functions>`_.