Skip to content

Commit

Permalink
PR feedbacks and clean up
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Oct 14, 2024
1 parent a43de9e commit c60069a
Show file tree
Hide file tree
Showing 3 changed files with 88 additions and 73 deletions.
37 changes: 24 additions & 13 deletions docs/how-to/hip_runtime_api/memory_management/coherence_control.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,35 @@
Coherence control
*******************************************************************************

Memory coherence describes how memory of a specific part of the system is visible to the other parts of the system. For example, how GPU memory is visible to the CPU and vice versa.
In HIP, host and device memory can be allocated with two different types of coherence:

* **Coarse-grained coherence:** The memory is considered up-to-date
only after synchronization performed using :cpp:func:`hipDeviceSynchronize`,
:cpp:func:`hipStreamSynchronize`, or any blocking operation that acts on the null stream such as :cpp:func:`hipMemcpy`. To avoid the cache from being accessed by a part of the system while simultaneously being written by another, the memory is made visible only after the caches have been flushed.
* **Fine-grained coherence:** The memory is coherent even while being modified by a part of the system. Fine-grained coherence ensures that up-to-date data is visible to others regardless of kernel boundaries. This can be useful if both host and device operate on the same data.
Memory coherence describes how memory of a specific part of the system is
visible to the other parts of the system. For example, how GPU memory is visible
to the CPU and vice versa. In HIP, host and device memory can be allocated with
two different types of coherence:

* **Coarse-grained coherence:** The memory is considered up-to-date only after
synchronization performed using :cpp:func:`hipDeviceSynchronize`,
:cpp:func:`hipStreamSynchronize`, or any blocking operation that acts on the
null stream such as :cpp:func:`hipMemcpy`. To avoid the cache from being
accessed by a part of the system while simultaneously being written by
another, the memory is made visible only after the caches have been flushed.
* **Fine-grained coherence:** The memory is coherent even while being modified
by a part of the system. Fine-grained coherence ensures that up-to-date data
is visible to others regardless of kernel boundaries. This can be useful if
both host and device operate on the same data.

.. note::

To achieve fine-grained coherence, many AMD GPUs use a limited cache policy, such as leaving these allocations uncached by the GPU or making them read-only.
To achieve fine-grained coherence, many AMD GPUs use a limited cache policy,
such as leaving these allocations uncached by the GPU or making them read-only.

.. TODO: Is this still valid? What about Mi300?
Mi200 accelerator's hardware based floating point instructions work on coarse-grained memory regions. Coarse-grained coherence is typically useful in reducing host-device
interconnect communication.
Mi200 accelerator's hardware based floating point instructions work on
coarse-grained memory regions. Coarse-grained coherence is typically useful in
reducing host-device interconnect communication.

To check the availability of fine- and coarse-grained memory pools, use ``rocminfo``:
To check the availability of fine- and coarse-grained memory pools, use
``rocminfo``:

.. code-block:: sh
Expand Down Expand Up @@ -54,8 +65,8 @@ To check the availability of fine- and coarse-grained memory pools, use ``rocmin
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
The APIs, flags and respective memory coherence control are listed in the following table:
The APIs, flags and respective memory coherence control are listed in the
following table:

.. list-table:: Memory coherence control
:widths: 25, 35, 20, 20
Expand Down
43 changes: 28 additions & 15 deletions docs/how-to/hip_runtime_api/memory_management/host_memory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,24 @@
Host memory
********************************************************************************

Host memory is the "normal" memory residing in the host RAM and allocated by C or C++.
Host memory can be allocated in two different ways:
Host memory is the "normal" memory residing in the host RAM and allocated by C
or C++. Host memory can be allocated in two different ways:

* Pageable memory
* Pinned memory

The following figure explains how data is transferred in pageable and pinned memory.
The following figure explains how data is transferred in pageable and pinned
memory.

.. figure:: ../../../data/how-to/hip_runtime_api/memory_management/pageable_pinned.svg
The pageable and pinned memory allow developers to exercise direct control over memory operations, which is known as explicit memory management. When using the unified memory, developers get a simplified memory model with less control over

The pageable and pinned memory allow developers to exercise direct control over
memory operations, which is known as explicit memory management. When using the
unified memory, developers get a simplified memory model with less control over
low level memory operations.

The difference in memory transfers between explicit and unified memory management are highlighted in the following figure:
The difference in memory transfers between explicit and unified memory
management are highlighted in the following figure:

.. figure:: ../../../data/how-to/hip_runtime_api/memory_management/unified_memory/um.svg

Expand All @@ -31,10 +36,13 @@ For more details on unified memory management, see :doc:`/how-to/hip_runtime_api
Pageable memory
================================================================================

Pageable memory exists on memory blocks known as "pages" that can be migrated to other memory storage. For example, migrating memory between CPU sockets on a motherboard or in a system whose RAM runs out of space and starts dumping pages into the swap partition of the hard drive.
Pageable memory exists on memory blocks known as "pages" that can be migrated to
other memory storage. For example, migrating memory between CPU sockets on a
motherboard or in a system whose RAM runs out of space and starts dumping pages
into the swap partition of the hard drive.

Pageable memory is usually allocated with a call to ``malloc`` or ``new`` in a C++
application.
Pageable memory is usually allocated with a call to ``malloc`` or ``new`` in a
C++ application.

**Example:** Using pageable host memory in HIP:

Expand Down Expand Up @@ -192,16 +200,21 @@ The memory allocation for pinned memory can be controlled using ``hipHostMalloc`
* ``hipHostMallocCoherent``: Fine-grained memory is allocated. Overrides ``HIP_HOST_COHERENT`` environment variable for specific allocation. For details, see :ref:`coherence_control`.
* ``hipHostMallocNonCoherent``: Coarse-grained memory is allocated. Overrides ``HIP_HOST_COHERENT`` environment variable for specific allocation. For details, see :ref:`coherence_control`.

All allocation flags are independent and can be set in any combination. The only exception is setting ``hipHostMallocCoherent`` and
``hipHostMallocNonCoherent`` together, which leads to an illegal state.
An example of a valid flag combination is calling :cpp:func:`hipHostMalloc` with both
``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both the flags use the same model and differentiate only between
how the surrounding code uses the host memory.
All allocation flags are independent and can be set in any combination. The only
exception is setting ``hipHostMallocCoherent`` and ``hipHostMallocNonCoherent``
together, which leads to an illegal state. An example of a valid flag
combination is calling :cpp:func:`hipHostMalloc` with both
``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both the flags
use the same model and differentiate only between how the surrounding code uses
the host memory.

.. note::

By default, each GPU selects a Numa CPU node with the least Numa distance between them. This implies that the host memory is automatically allocated on the closest memory pool of the current GPU device's Numa node. Using
:cpp:func:`hipSetDevice` API to set a different GPU increases the Numa distance but still allows you to access the host allocation.
By default, each GPU selects a Numa CPU node with the least Numa distance
between them. This implies that the host memory is automatically allocated on
the closest memory pool of the current GPU device's Numa node. Using
:cpp:func:`hipSetDevice` API to set a different GPU increases the Numa
distance but still allows you to access the host allocation.

Numa policy is implemented on Linux and is under development on Microsoft
Windows.
81 changes: 36 additions & 45 deletions docs/understand/compilers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@
HIP compilers
********************************************************************************

ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and NVIDIA CUDA
platforms.
ROCm provides the compiler driver ``hipcc``, that can be used on AMD ROCm and
NVIDIA CUDA platforms.

On ROCm, ``hipcc`` takes care of the following:

- Setting the default library and include paths for HIP
Expand All @@ -26,21 +27,25 @@ HIP compilation workflow
Offline compilation
--------------------------------------------------------------------------------

The HIP code compilation is performed in two stages: host and device code compilation stage.
The HIP code compilation is performed in two stages: host and device code
compilation stage.

- Device-code compilation stage: The compiled device code is embedded into the host object file. Depending on the
platform, the device code can be compiled into assembly or binary. ``nvcc`` and
``amdclang++`` target different architectures and use different code object
formats. ``nvcc`` uses the binary ``cubin`` or the assembly PTX files, while
the ``amdclang++`` path is the binary ``hsaco`` format. On CUDA platforms, the
driver compiles the PTX files to executable code during runtime.
- Device-code compilation stage: The compiled device code is embedded into the
host object file. Depending on the platform, the device code can be compiled
into assembly or binary. ``nvcc`` and ``amdclang++`` target different
architectures and use different code object formats. ``nvcc`` uses the binary
``cubin`` or the assembly PTX files, while the ``amdclang++`` path is the
binary ``hsaco`` format. On CUDA platforms, the driver compiles the PTX files
to executable code during runtime.

- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can compile the
host code in one step without other C++ compilers. On the other hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax
with the appropriate CUDA runtime function call and the modified host code is
passed to the default host compiler.
- Host-code compilation stage: On the host side, ``hipcc`` or ``amdclang++`` can
compile the host code in one step without other C++ compilers. On the other
hand, ``nvcc`` only replaces the ``<<<...>>>`` kernel launch syntax with the
appropriate CUDA runtime function call and the modified host code is passed to
the default host compiler.

For an example on how to compile HIP from the command line, see :ref:`SAXPY tutorial<compiling_on_the_command_line>` .
For an example on how to compile HIP from the command line, see :ref:`SAXPY
tutorial<compiling_on_the_command_line>` .

Runtime compilation
--------------------------------------------------------------------------------
Expand All @@ -55,42 +60,28 @@ For more details, see
Static libraries
================================================================================

``hipcc`` supports generating two types of static libraries.
- The first type of static library only exports and launches
host functions within the same library and not the device functions. This library type offers the ability to link with a non-hipcc compiler such as ``gcc``. Additionally, this library type contains host objects with device code embedded as fat binaries.
This library type is generated using the flag ``--emit-static-lib``:
``hipcc`` supports generating two types of static libraries.

.. code-block:: cpp
hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
- The second type of static library exports
device functions to be linked by other code objects by using ``hipcc`` as the linker.
This library type contains relocatable device objects and is generated using ``ar``:

.. code-block:: cpp
hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
ar rcsD libHipDevice.a hipDevice.o
hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
- The first type of static library only exports and launches host functions
within the same library and not the device functions. This library type offers
the ability to link with a non-hipcc compiler such as ``gcc``. Additionally,
this library type contains host objects with device code embedded as fat
binaries. This library type is generated using the flag ``--emit-static-lib``:

Here is an example to create and use static libraries:

* Type 1 using `--emit-static-lib`:

.. code-block:: cpp
.. code-block:: shell
hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
* Type 2 using system `ar`:
hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a
gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out
- The second type of static library exports device functions to be linked by
other code objects by using ``hipcc`` as the linker. This library type
contains relocatable device objects and is generated using ``ar``:

.. code-block:: cpp
.. code-block:: shell
hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
ar rcsD libHipDevice.a hipDevice.o
hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
ar rcsD libHipDevice.a hipDevice.o
hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
For more information, see `HIP samples host functions <https://github.com/ROCm/hip-tests/tree/develop/samples/2_Cookbook/15_static_library/host_functions>`_
and `device functions <https://github.com/ROCm/hip-tests/tree/develop/samples/2_Cookbook/15_static_library/device_functions>`_.

0 comments on commit c60069a

Please sign in to comment.