diff --git a/docs/how-to/hip_porting_driver_api.rst b/docs/how-to/hip_porting_driver_api.rst index ca8178b038..ae5bb7226f 100644 --- a/docs/how-to/hip_porting_driver_api.rst +++ b/docs/how-to/hip_porting_driver_api.rst @@ -18,14 +18,28 @@ NVIDIA provides separate CUDA driver and runtime APIs. The two APIs have signifi The driver API offers two additional functionalities not provided by the runtime API: ``cuModule`` and ``cuCtx`` APIs. cuModule API -============ - -The Module section of the driver API provides additional control over how and when accelerator code objects are loaded. For example, the driver API enables code objects to load from files or memory pointers. Symbols for kernels or global data are extracted from the loaded code objects. In contrast, the runtime API loads automatically and, if necessary, compiles all the kernels from an executable binary when it runs. In this mode, kernel code must be compiled using NVCC so that automatic loading can function correctly. - -The Module features are useful in an environment that generates the code objects directly, such as a new accelerator language front end. NVCC is not used here. Instead, the environment might have a different kernel language or compilation flow. Other environments have many kernels and don't want all of them to be loaded automatically. The Module functions load the generated code objects and launch kernels. Similar to the cuModule API, HIP defines a hipModule API that provides similar explicit control over code object management. +================================================================================ + +The Module section of the driver API provides additional control over how and +when accelerator code objects are loaded. For example, the driver API enables +code objects to load from files or memory pointers. Symbols for kernels or +global data are extracted from the loaded code objects. In contrast, the runtime +API loads automatically and, if necessary, compiles all the kernels from an +executable binary when it runs. In this mode, kernel code must be compiled using +NVCC so that automatic loading can function correctly. + +The Module features are useful in an environment that generates the code objects +directly, such as a new accelerator language front end. NVCC is not used here. +Instead, the environment might have a different kernel language or compilation +flow. Other environments have many kernels and don't want all of them to be +loaded automatically. The Module functions load the generated code objects and +launch kernels. Similar to the cuModule API, HIP defines a hipModule API that +provides similar explicit control over code object management. + +.. _context_driver_api: cuCtx API -========= +================================================================================ The driver API defines "Context" and "Devices" as separate entities. Contexts contain a single device, and a device can theoretically have multiple contexts. @@ -41,17 +55,25 @@ In HIP, the ``Ctx`` functions largely provide an alternate syntax for changing t Most new applications preferentially use ``hipSetDevice`` or the stream APIs. Therefore, HIP has marked the ``hipCtx`` APIs as **deprecated**. Support for these APIs might not be available in future releases. For more details on deprecated APIs, see :doc:`../reference/deprecated_api_list`. HIP module and Ctx APIs -======================= +================================================================================ -Rather than present two separate APIs, HIP extends the HIP API with new APIs for modules and ``Ctx`` control. +Rather than present two separate APIs, HIP extends the HIP API with new APIs for +modules and ``Ctx`` control. hipModule API -------------- - -Like the CUDA driver API, the Module API provides additional control over how code is loaded, including options to load code from files or from in-memory pointers. -NVCC and HIP-Clang target different architectures and use different code object formats. NVCC supports ``cubin`` or ``ptx`` files, while the HIP-Clang path uses the ``hsaco`` format. -The external compilers which generate these code objects are responsible for generating and loading the correct code object for each platform. -Notably, there is no fat binary format that can contain code for both NVCC and HIP-Clang platforms. The following table summarizes the formats used on each platform: +-------------------------------------------------------------------------------- + +Like the CUDA driver API, the Module API provides additional control over how +code is loaded, including options to load code from files or from in-memory +pointers. +NVCC and HIP-Clang target different architectures and use different code object +formats. NVCC supports ``cubin`` or ``ptx`` files, while the HIP-Clang path uses +the ``hsaco`` format. +The external compilers which generate these code objects are responsible for +generating and loading the correct code object for each platform. +Notably, there is no fat binary format that can contain code for both NVCC and +HIP-Clang platforms. The following table summarizes the formats used on each +platform: .. list-table:: Module formats :header-rows: 1 @@ -76,7 +98,7 @@ HIP-Clang enables both of these capabilities to be used together. Of course, it For module API reference, visit :ref:`module_management_reference`. hipCtx API ----------- +-------------------------------------------------------------------------------- HIP provides a ``Ctx`` API as a thin layer over the existing device functions. The ``Ctx`` API can be used to set the current context or to query properties of the device associated with the context. The current context is implicitly used by other APIs, such as ``hipStreamCreate``. @@ -84,7 +106,7 @@ The current context is implicitly used by other APIs, such as ``hipStreamCreate` For context reference, visit :ref:`context_management_reference`. HIPIFY translation of CUDA driver API -===================================== +================================================================================ The HIPIFY tools convert CUDA driver APIs for streams, events, modules, devices, memory management, context, and the profiler to the equivalent HIP calls. For example, ``cuEventCreate`` is translated to ``hipEventCreate``. HIPIFY tools also convert error codes from the driver namespace and coding conventions to the equivalent HIP error code. HIP unifies the APIs for these common functions. @@ -98,13 +120,13 @@ HIP defines a single error space and uses camel case for all errors (i.e. ``hipE For further information, visit the :doc:`hipify:index`. Address spaces --------------- +-------------------------------------------------------------------------------- HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool. This means addresses can be shared between contexts. Unlike the original CUDA implementation, a new context does not create a new address space for the device. Using hipModuleLaunchKernel ---------------------------- +-------------------------------------------------------------------------------- Both CUDA driver and runtime APIs define a function for launching kernels, called ``cuLaunchKernel`` or ``cudaLaunchKernel``. The equivalent API in HIP is ``hipModuleLaunchKernel``. The kernel arguments and the execution configuration (grid dimensions, group dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function. @@ -112,26 +134,45 @@ The runtime API additionally provides the ``<<< >>>`` syntax for launching kerne However, this syntax is not standard C++ and is available only when NVCC is used to compile the host code. Additional information ----------------------- +-------------------------------------------------------------------------------- -HIP-Clang creates a primary context when the HIP API is called. So, in pure driver API code, HIP-Clang creates a primary context while HIP/NVCC has an empty context stack. HIP-Clang pushes the primary context to the context stack when it is empty. This can lead to subtle differences in applications which mix the runtime and driver APIs. +HIP-Clang creates a primary context when the HIP API is called. So, in pure +driver API code, HIP-Clang creates a primary context while HIP/NVCC has an empty +context stack. HIP-Clang pushes the primary context to the context stack when it +is empty. This can lead to subtle differences in applications which mix the +runtime and driver APIs. HIP-Clang implementation notes -============================== +================================================================================ .hip_fatbin ------------ +-------------------------------------------------------------------------------- -HIP-Clang links device code from different translation units together. For each device target, it generates a code object. ``clang-offload-bundler`` bundles code objects for different device targets into one fat binary, which is embedded as the global symbol ``__hip_fatbin`` in the ``.hip_fatbin`` section of the ELF file of the executable or shared object. +HIP-Clang links device code from different translation units together. For each +device target, it generates a code object. ``clang-offload-bundler`` bundles +code objects for different device targets into one fat binary, which is embedded +as the global symbol ``__hip_fatbin`` in the ``.hip_fatbin`` section of the ELF +file of the executable or shared object. Initialization and termination functions ------------------------------------------ - -HIP-Clang generates initialization and termination functions for each translation unit for host code compilation. The initialization functions call ``__hipRegisterFatBinary`` to register the fat binary embedded in the ELF file. They also call ``__hipRegisterFunction`` and ``__hipRegisterVar`` to register kernel functions and device-side global variables. The termination functions call ``__hipUnregisterFatBinary``. -HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**`` with ``linkonce`` linkage and an initial value of 0 for each host translation unit. Each initialization function checks ``__hip_gpubin_handle`` and registers the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This ensures that the fat binary is registered once. A similar check is performed in the termination functions. +-------------------------------------------------------------------------------- + +HIP-Clang generates initialization and termination functions for each +translation unit for host code compilation. The initialization functions call +``__hipRegisterFatBinary`` to register the fat binary embedded in the ELF file. +They also call ``__hipRegisterFunction`` and ``__hipRegisterVar`` to register +kernel functions and device-side global variables. The termination functions +call ``__hipUnregisterFatBinary``. +HIP-Clang emits a global variable ``__hip_gpubin_handle`` of type ``void**`` +with ``linkonce`` linkage and an initial value of 0 for each host translation +unit. Each initialization function checks ``__hip_gpubin_handle`` and registers +the fat binary only if ``__hip_gpubin_handle`` is 0. It saves the return value +of ``__hip_gpubin_handle`` to ``__hip_gpubin_handle``. This ensures that the fat +binary is registered once. A similar check is performed in the termination +functions. Kernel launching ----------------- +-------------------------------------------------------------------------------- HIP-Clang supports kernel launching using either the CUDA ``<<<>>>`` syntax, ``hipLaunchKernel``, or ``hipLaunchKernelGGL``. The last option is a macro which expands to the CUDA ``<<<>>>`` syntax by default. It can also be turned into a template by defining ``HIP_TEMPLATE_KERNEL_LAUNCH``. @@ -141,10 +182,10 @@ HIP-Clang implements two sets of APIs for launching kernels. By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code, it first calls ``hipConfigureCall`` to set up the threads and grids. It then calls the stub function with the given arguments. The stub function calls ``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr`` with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual kernel associated with the stub function is launched. NVCC implementation notes -========================= +================================================================================ Interoperation between HIP and CUDA driver ------------------------------------------- +-------------------------------------------------------------------------------- CUDA applications might want to mix CUDA driver code with HIP code (see the example below). This table shows the equivalence between CUDA and HIP types required to implement this interaction. @@ -177,7 +218,7 @@ CUDA applications might want to mix CUDA driver code with HIP code (see the exam - ``cudaArray`` Compilation options -------------------- +-------------------------------------------------------------------------------- The ``hipModule_t`` interface does not support the ``cuModuleLoadDataEx`` function, which is used to control PTX compilation options. HIP-Clang does not use PTX, so it does not support these compilation options. @@ -302,9 +343,11 @@ The sample below shows how to use ``hipModuleGetFunction``. } HIP module and texture Driver API -================================= +================================================================================ -HIP supports texture driver APIs. However, texture references must be declared within the host scope. The following code demonstrates the use of texture references for the ``__HIP_PLATFORM_AMD__`` platform. +HIP supports texture driver APIs. However, texture references must be declared +within the host scope. The following code demonstrates the use of texture +references for the ``__HIP_PLATFORM_AMD__`` platform. .. code-block:: cpp @@ -343,9 +386,12 @@ HIP supports texture driver APIs. However, texture references must be declared w } Driver entry point access -========================= +================================================================================ -Starting from HIP version 6.2.0, support for Driver Entry Point Access is available when using CUDA 12.0 or newer. This feature allows developers to directly interact with the CUDA driver API, providing more control over GPU operations. +Starting from HIP version 6.2.0, support for Driver Entry Point Access is +available when using CUDA 12.0 or newer. This feature allows developers to +directly interact with the CUDA driver API, providing more control over GPU +operations. Driver Entry Point Access provides several features: @@ -356,9 +402,10 @@ Driver Entry Point Access provides several features: For driver entry point access reference, visit :cpp:func:`hipGetProcAddress`. Address retrieval ------------------ +-------------------------------------------------------------------------------- -The ``hipGetProcAddress`` function can be used to obtain the address of a runtime function. This is demonstrated in the following example: +The :cpp:func:`hipGetProcAddress` function can be used to obtain the address of +a runtime function. This is demonstrated in the following example: .. code-block:: cpp @@ -401,9 +448,13 @@ The ``hipGetProcAddress`` function can be used to obtain the address of a runtim } Per-thread default stream version request ------------------------------------------ +================================================================================ -HIP offers functionality similar to CUDA for managing streams on a per-thread basis. By using ``hipStreamPerThread``, each thread can independently manage its default stream, simplifying operations. The following example demonstrates how this feature enhances performance by reducing contention and improving efficiency. +HIP offers functionality similar to CUDA for managing streams on a per-thread +basis. By using ``hipStreamPerThread``, each thread can independently manage its +default stream, simplifying operations. The following example demonstrates how +this feature enhances performance by reducing contention and improving +efficiency. .. code-block:: cpp @@ -456,9 +507,16 @@ HIP offers functionality similar to CUDA for managing streams on a per-thread ba } Accessing new HIP features with a newer driver ----------------------------------------------- - -HIP is designed to be forward compatible, allowing newer features to be utilized with older toolkits, provided a compatible driver is present. Feature support can be verified through runtime API functions and version checks. This approach ensures that applications can benefit from new features and improvements in the HIP runtime without needing to be recompiled with a newer toolkit. The function ``hipGetProcAddress`` enables dynamic querying and the use of newer functions offered by the HIP runtime, even if the application was built with an older toolkit. +================================================================================ + +HIP is designed to be forward compatible, allowing newer features to be utilized +with older toolkits, provided a compatible driver is present. Feature support +can be verified through runtime API functions and version checks. This approach +ensures that applications can benefit from new features and improvements in the +HIP runtime without needing to be recompiled with a newer toolkit. The function +:cpp:func:`hipGetProcAddress` enables dynamic querying and the use of newer +functions offered by the HIP runtime, even if the application was built with an +older toolkit. An example is provided for a hypothetical ``foo()`` function. diff --git a/docs/how-to/hipgraph.rst b/docs/how-to/hipgraph.rst index 83d6fa0f61..958784a71f 100644 --- a/docs/how-to/hipgraph.rst +++ b/docs/how-to/hipgraph.rst @@ -31,7 +31,7 @@ The nodes can be one of the following: - signalling or waiting on external semaphores .. note:: - The available node types are specified by :cpp:enumerator:`hipGraphNodeType`. + The available node types are specified by :cpp:enum:`hipGraphNodeType`. The following figure visualizes the concept of graphs, compared to using streams. @@ -64,10 +64,8 @@ that are only possible when knowing the dependencies between the operations. when launched using HIP stream versus HIP graph. This does not include the time needed to set up the graph. - -******************************************************************************** Using HIP graphs -******************************************************************************** +================================================================================ There are two different ways of creating graphs: Capturing kernel launches from a stream, or explicitly creating graphs. The difference between the two @@ -102,7 +100,7 @@ that scheduling the operations within the graph encompasses less overhead and can enable some optimizations, but they still need to be associated with a stream for execution. Memory management ------------------ +-------------------------------------------------------------------------------- Memory that is used by operations in graphs can either be pre-allocated or managed within the graph. Graphs can contain nodes that take care of allocating @@ -189,6 +187,7 @@ The following code is an example of how to use the HIP graph API to capture a graph from a stream. .. code-block:: cpp + #include #include #include diff --git a/docs/how-to/performance_guidelines.rst b/docs/how-to/performance_guidelines.rst index e119931865..9ebd210106 100644 --- a/docs/how-to/performance_guidelines.rst +++ b/docs/how-to/performance_guidelines.rst @@ -9,7 +9,8 @@ Performance guidelines The AMD HIP performance guidelines are a set of best practices designed to help you optimize the application performance on AMDGPUs. The guidelines discuss -established parallelization and optimization techniques to improve the application performance on HIP-capable GPU architectures. +established parallelization and optimization techniques to improve the +application performance on HIP-capable GPU architectures. Here are the four main cornerstones to help you exploit HIP's performance optimization potential: @@ -22,58 +23,75 @@ optimization potential: This document discusses the usage and benefits of these cornerstones in detail. .. _parallel execution: + Parallel execution -==================== +================================================================================ -For optimal use and to keep all system components busy, the application must reveal and efficiently provide as much parallelism as possible. -The parallelism can be performed at the application level, device level, and multiprocessor level. +For optimal use and to keep all system components busy, the application must +reveal and efficiently provide as much parallelism as possible. The parallelism +can be performed at the application level, device level, and multiprocessor +level. Application level -------------------- +-------------------------------------------------------------------------------- -To enable parallel execution of the application across the host and devices, use asynchronous calls and streams. Assign workloads based on efficiency: serial to the host or parallel to the devices. +To enable parallel execution of the application across the host and devices, use +asynchronous calls and streams. Assign workloads based on efficiency: serial to +the host or parallel to the devices. -For parallel workloads, when threads belonging to the same block need to synchronize to share data, -use ``__syncthreads()`` (see: +For parallel workloads, when threads belonging to the same block need to +synchronize to share data, use :cpp:func:`__syncthreads()` (see: :ref:`synchronization functions`) within the same kernel invocation. For threads belonging to different blocks, use global memory with two separate -kernel invocations. It is recommended to avoid the latter approach as it adds overhead. +kernel invocations. It is recommended to avoid the latter approach as it adds +overhead. Device level --------------- +-------------------------------------------------------------------------------- Device level optimization primarily involves maximizing parallel execution -across the multiprocessors on the device. You can achieve device level optimization by executing -multiple kernels concurrently on a device. To enhance performance, the management of these kernels is -facilitated by streams, which allows overlapping of computation and data transfers. This approach aims at keeping all multiprocessors busy by executing enough kernels concurrently. However, launching too many kernels can lead to resource contention, hence a balance must be found for optimal performance. The device level optimization helps in achieving maximum utilization of the device resources. +across the multiprocessors on the device. You can achieve device level +optimization by executing multiple kernels concurrently on a device. To enhance +performance, the management of these kernels is facilitated by streams, which +allows overlapping of computation and data transfers. This approach aims at +keeping all multiprocessors busy by executing enough kernels concurrently. +However, launching too many kernels can lead to resource contention, hence a +balance must be found for optimal performance. The device level optimization +helps in achieving maximum utilization of the device resources. Multiprocessor level ----------------------- +-------------------------------------------------------------------------------- Multiprocessor level optimization involves maximizing parallel execution within each multiprocessor on a device. The key to multiprocessor level optimization is to efficiently utilize the various functional units within a multiprocessor. For example, ensuring a sufficient number of resident warps, so that every clock -cycle has an instruction from a warp is ready for execution. This instruction could -either be another independent instruction of the same warp, which exploits -:ref:`instruction level optimization `, or more +cycle has an instruction from a warp is ready for execution. This instruction +could either be another independent instruction of the same warp, which exploits +:ref:`instruction level optimization `, or more commonly an instruction of another warp, which exploits thread-level parallelism. On the other hand, device level optimization focuses on the device as a whole, aiming at keeping all multiprocessors busy by executing enough kernels -concurrently. Both multiprocessor and device levels of optimization are crucial for achieving maximum -performance. They work together to ensure efficient utilization of the -GPU resources, ranging from individual multiprocessors to the device as a -whole. +concurrently. Both multiprocessor and device levels of optimization are crucial +for achieving maximum performance. They work together to ensure efficient +utilization of the GPU resources, ranging from individual multiprocessors to the +device as a whole. .. _memory optimization: + Memory throughput optimization -=============================== +================================================================================ The first step in maximizing memory throughput is to minimize low-bandwidth data transfers between the host and the device. -Additionally, maximize the use of on-chip memory, that is, shared memory and caches, and minimize transfers with global memory. Shared memory acts as a user-managed cache explicitly allocated and accessed by the application. A common programming pattern is to stage data from device memory into shared memory. The staging of data from the device to shared memory involves the following steps: +Additionally, maximize the use of on-chip memory, that is, shared memory and +caches, and minimize transfers with global memory. Shared memory acts as a +user-managed cache explicitly allocated and accessed by the application. A +common programming pattern is to stage data from device memory into shared +memory. The staging of data from the device to shared memory involves the +following steps: 1. Each thread of a block loading data from device memory to shared memory. 2. Synchronizing with all other threads of the block. @@ -84,20 +102,26 @@ Additionally, maximize the use of on-chip memory, that is, shared memory and cac For some applications, a traditional hardware-managed cache is more appropriate for exploiting data locality. -In conclusion, the throughput of memory accesses by a kernel can vary significantly -depending on the access pattern. Therefore, the next step in maximizing memory -throughput is to organize memory accesses as optimally as possible. This is -especially important for global memory accesses, as global memory bandwidth is -low compared to available on-chip bandwidths and arithmetic instruction -throughput. Thus, non-optimal global memory accesses generally have a high -impact on performance. -The memory throughput optimization techniques are further discussed in detail in the following sections. +In conclusion, the throughput of memory accesses by a kernel can vary +significantly depending on the access pattern. Therefore, the next step in +maximizing memory throughput is to organize memory accesses as optimally as +possible. This is especially important for global memory accesses, as global +memory bandwidth is low compared to available on-chip bandwidths and arithmetic +instruction throughput. Thus, non-optimal global memory accesses generally have +a high impact on performance. +The memory throughput optimization techniques are further discussed in detail in +the following sections. .. _data transfer: + Data transfer ---------------- +-------------------------------------------------------------------------------- -To minimize data transfers between the host and the device, applications should move more computations from the host to the device, even at the cost of running kernels that don't fully utilize parallelism for the device. Intermediate data structures should be created, used, and discarded in device memory without being mapped or copied to host memory. +To minimize data transfers between the host and the device, applications should +move more computations from the host to the device, even at the cost of running +kernels that don't fully utilize parallelism for the device. Intermediate data +structures should be created, used, and discarded in device memory without being +mapped or copied to host memory. Batching small transfers into a single large transfer can improve performance due to the overhead associated with each transfer. On systems with a front-side @@ -107,11 +131,17 @@ When using mapped page-locked memory, there is no need to allocate device memory or explicitly copy data between device and host memory. Data transfers occur implicitly each time the kernel accesses the mapped memory. For optimal performance, these memory accesses should be coalesced, similar to global -memory accesses. The process where threads in a warp access sequential memory locations is known as coalesced memory access, which can enhance memory data transfer efficiency. +memory accesses. The process where threads in a warp access sequential memory +locations is known as coalesced memory access, which can enhance memory data +transfer efficiency. -On integrated systems where device and host memory are physically the same, no copy operation between host and device memory is required and hence mapped page-locked memory should be used instead. To check if the device is integrated, applications can query the integrated device property. +On integrated systems where device and host memory are physically the same, no +copy operation between host and device memory is required and hence mapped +page-locked memory should be used instead. To check if the device is integrated, +applications can query the integrated device property. .. _device memory access: + Device memory access --------------------- @@ -129,7 +159,10 @@ Maximizing memory throughput involves: - Using properly sized and aligned data types. - Padding data when necessary. -Global memory instructions support reading or writing data of specific sizes (1, 2, 4, 8, or 16 bytes) that are naturally aligned. Not meeting the size and alignment requirements leads to multiple instructions, which reduces performance. Therefore, for correct results and optimal performance: +Global memory instructions support reading or writing data of specific sizes (1, +2, 4, 8, or 16 bytes) that are naturally aligned. Not meeting the size and +alignment requirements leads to multiple instructions, which reduces +performance. Therefore, for correct results and optimal performance: - Use data types that meet these requirements - Ensure alignment for structures @@ -139,27 +172,46 @@ Threads often access 2D arrays at an address calculated as ``BaseAddress + xIndex + width * yIndex``. For efficient memory access, the array and thread block widths should be multiples of the warp size. If the array width is not a multiple of the warp size, it is usually more efficient to -allocate the array with a width rounded up to the nearest multiple and pad the rows -accordingly. - -Local memory is used for certain automatic variables, such as arrays with non-constant indices, large structures of arrays, and any variable where the kernel uses more registers than available. Local memory resides in device memory, which leads to high latency and low bandwidth, similar to global memory accesses. However, the local memory is organized for consecutive 32-bit words to be accessed by consecutive thread IDs, which allows full coalescing when all threads in a warp access the same relative address. - -Shared memory is located on-chip and provides higher bandwidth and lower latency than local or global memory. It is divided into banks that can be simultaneously accessed, which boosts bandwidth. However, bank conflicts, where two addresses fall in the same bank, lead to serialized access and decreased throughput. Therefore, understanding how memory addresses map to banks and scheduling requests to minimize conflicts is crucial for optimal performance. - -Constant memory is in the device memory and cached in the constant cache. Requests are split based on different memory addresses and are -serviced based either on the throughput of the constant cache for cache hits or on the throughput of the device memory otherwise. This splitting of requests affects throughput. - -Texture and surface memory are stored in the device memory and cached in the texture cache. This setup optimizes 2D spatial locality, which leads to better performance for threads reading close 2D addresses. -Reading device memory through texture or surface fetching provides the following advantages: +allocate the array with a width rounded up to the nearest multiple and pad the +rows accordingly. + +Local memory is used for certain automatic variables, such as arrays with +non-constant indices, large structures of arrays, and any variable where the +kernel uses more registers than available. Local memory resides in device +memory, which leads to high latency and low bandwidth, similar to global memory +accesses. However, the local memory is organized for consecutive 32-bit words to +be accessed by consecutive thread IDs, which allows full coalescing when all +threads in a warp access the same relative address. + +Shared memory is located on-chip and provides higher bandwidth and lower latency +than local or global memory. It is divided into banks that can be simultaneously +accessed, which boosts bandwidth. However, bank conflicts, where two addresses +fall in the same bank, lead to serialized access and decreased throughput. +Therefore, understanding how memory addresses map to banks and scheduling +requests to minimize conflicts is crucial for optimal performance. + +Constant memory is in the device memory and cached in the constant cache. +Requests are split based on different memory addresses and are serviced based +either on the throughput of the constant cache for cache hits or on the +throughput of the device memory otherwise. This splitting of requests affects +throughput. + +Texture and surface memory are stored in the device memory and cached in the +texture cache. This setup optimizes 2D spatial locality, which leads to better +performance for threads reading close 2D addresses. +Reading device memory through texture or surface fetching provides the following +advantages: - Higher bandwidth for local texture fetches or surface reads. - Offloading addressing calculation. - Data broadcasting. -- Optional conversion of 8-bit and 16-bit integer input data to 32-bit floating-point values on the fly. +- Optional conversion of 8-bit and 16-bit integer input data to 32-bit + floating-point values on the fly. .. _instruction optimization: + Optimization for maximum instruction throughput -================================================= +================================================================================ To maximize instruction throughput: @@ -170,47 +222,94 @@ To maximize instruction throughput: These techniques are discussed in detail in the following sections. Arithmetic instructions -------------------------- +-------------------------------------------------------------------------------- The type and complexity of arithmetic operations can significantly impact the performance of your application. We are highlighting some hints how to maximize it. -Use efficient operations: Some arithmetic operations are costlier than others. For example, multiplication is typically faster than division, and integer operations are usually faster than floating-point operations, especially with double precision. +Use efficient operations: Some arithmetic operations are costlier than others. +For example, multiplication is typically faster than division, and integer +operations are usually faster than floating-point operations, especially with +double precision. -Minimize low-throughput instructions: This might involve trading precision for speed when it does not affect the final result. For instance, consider using single-precision arithmetic instead of double-precision. +Minimize low-throughput instructions: This might involve trading precision for +speed when it does not affect the final result. For instance, consider using +single-precision arithmetic instead of double-precision. -Leverage intrinsic functions: Intrinsic functions are predefined functions available in HIP that can often be executed faster than equivalent arithmetic operations (subject to some input or accuracy restrictions). They can help optimize performance by replacing more complex arithmetic operations. +Leverage intrinsic functions: Intrinsic functions are predefined functions +available in HIP that can often be executed faster than equivalent arithmetic +operations (subject to some input or accuracy restrictions). They can help +optimize performance by replacing more complex arithmetic operations. -Optimize memory access: The memory access efficiency can impact the speed of arithmetic operations. See: :ref:`device memory access`. +Optimize memory access: The memory access efficiency can impact the speed of +arithmetic operations. See: :ref:`device memory access`. .. _control flow instructions: -Control flow instructions ---------------------------- -Control flow instructions (``if``, ``else``, ``for``, ``do``, ``while``, ``break``, ``continue``, ``switch``) can impact instruction throughput by causing threads within a warp to diverge and follow different execution paths. To optimize performance, write control conditions to minimize divergent warps. For example, when the control condition depends on ``threadIdx`` or ``warpSize``, warp doesn't diverge. The compiler might optimize loops, short ifs, or switch blocks using branch predication, which prevents warp divergence. With branch predication, instructions associated with a false predicate are scheduled but not executed, which avoids unnecessary operations. +Control flow instructions +-------------------------------------------------------------------------------- + +Control flow instructions (``if``, ``else``, ``for``, ``do``, ``while``, +``break``, ``continue``, ``switch``) can impact instruction throughput by +causing threads within a warp to diverge and follow different execution paths. +To optimize performance, write control conditions to minimize divergent warps. +For example, when the control condition depends on ``threadIdx`` or ``warpSize``, +warp doesn't diverge. The compiler might optimize loops, short ifs, or switch +blocks using branch predication, which prevents warp divergence. With branch +predication, instructions associated with a false predicate are scheduled but +not executed, which avoids unnecessary operations. Avoiding divergent warps -.......................................................... +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Warps diverge when threads within the same warp follow different execution paths. This is caused by conditional statements that lead to different arithmetic operations being performed by different threads. Divergent warps can significantly reduce instruction throughput, so it is advisable to structure your code to minimize divergence. +Warps diverge when threads within the same warp follow different execution paths. +This is caused by conditional statements that lead to different arithmetic +operations being performed by different threads. Divergent warps can +significantly reduce instruction throughput, so it is advisable to structure +your code to minimize divergence. Synchronization ----------------- +-------------------------------------------------------------------------------- -Synchronization ensures that all threads within a block complete their computations and memory accesses before moving forward, which is critical when threads depend on other thread results. However, synchronization can also cause performance overhead, as it needs the threads to wait, which might lead to idle GPU resources. +Synchronization ensures that all threads within a block complete their +computations and memory accesses before moving forward, which is critical when +threads depend on other thread results. However, synchronization can also cause +performance overhead, as it needs the threads to wait, which might lead to idle +GPU resources. -To synchronize all threads in a block, use ``__syncthreads()``. ``__syncthreads()`` ensures -that, all threads reach the same point in the code and can access shared memory after reaching that point. +To synchronize all threads in a block, use :cpp:func:`__syncthreads()`. +:cpp:func:`__syncthreads()` ensures that, all threads reach the same point in +the code and can access shared memory after reaching that point. -An alternative way to synchronize is to use streams. Different streams can execute commands either without following a specific order or concurrently. This is why streams allow more fine-grained control over the execution order of commands, which can be beneficial in certain scenarios. +An alternative way to synchronize is to use streams. Different streams can +execute commands either without following a specific order or concurrently. This +is why streams allow more fine-grained control over the execution order of +commands, which can be beneficial in certain scenarios. Minimizing memory thrashing -============================ - -Applications frequently allocating and freeing memory might experience slower allocation calls over time as memory is released back to the operating system. To optimize performance in such scenarios, follow these guidelines: - -- Avoid allocating all available memory with ``hipMalloc`` or ``hipHostMalloc``, as this immediately reserves memory and might prevent other applications from using it. This behavior could strain the operating system schedulers or prevent other applications from running on the same GPU. -- Try to allocate memory in suitably sized blocks early in the application's lifecycle and deallocate only when the application no longer needs it. Minimize the number of ``hipMalloc`` and ``hipFree`` calls in your application, particularly in performance-critical areas. -- Consider resorting to other memory types such as ``hipHostMalloc`` or ``hipMallocManaged``, if an application can't allocate sufficient device memory. While the other memory types might not offer similar performance, they allow the application to continue running. -- For supported platforms, use ``hipMallocManaged``, as it allows oversubscription. With the right policies, ``hipMallocManaged`` can maintain most, if not all, ``hipMalloc`` performance. ``hipMallocManaged`` doesn't require an allocation to be resident until it is needed or prefetched, which eases the load on the operating system's schedulers and facilitates multitenant scenarios. +================================================================================ + +Applications frequently allocating and freeing memory might experience slower +allocation calls over time as memory is released back to the operating system. +To optimize performance in such scenarios, follow these guidelines: + +- Avoid allocating all available memory with :cpp:func:`hipMalloc` or + :cpp:func:`hipHostMalloc`, as this immediately reserves memory and might + prevent other applications from using it. This behavior could strain the + operating system schedulers or prevent other applications from running on the + same GPU. +- Try to allocate memory in suitably sized blocks early in the application's + lifecycle and deallocate only when the application no longer needs it. + Minimize the number of :cpp:func:`hipMalloc` and :cpp:func:`hipFree` calls in + your application, particularly in performance-critical areas. +- Consider resorting to other memory types such as :cpp:func:`hipHostMalloc` or + :cpp:func:`hipMallocManaged`, if an application can't allocate sufficient + device memory. While the other memory types might not offer similar + performance, they allow the application to continue running. +- For supported platforms, use :cpp:func:`hipMallocManaged`, as it allows + oversubscription. With the right policies, :cpp:func:`hipMallocManaged` can + maintain most, if not all, :cpp:func:`hipMalloc` performance. + :cpp:func:`hipMallocManaged` doesn't require an allocation to be resident + until it is needed or prefetched, which eases the load on the operating + system's schedulers and facilitates multitenant scenarios. diff --git a/docs/how-to/stream_ordered_allocator.rst b/docs/how-to/stream_ordered_allocator.rst index 3279fa639e..0d130a540d 100644 --- a/docs/how-to/stream_ordered_allocator.rst +++ b/docs/how-to/stream_ordered_allocator.rst @@ -220,6 +220,7 @@ Trim pools The memory allocator allows you to allocate and free memory in stream order. To control memory usage, set the release threshold attribute using ``hipMemPoolAttrReleaseThreshold``. This threshold specifies the amount of reserved memory in bytes to hold onto. .. code-block:: cpp + uint64_t threshold = UINT64_MAX; hipMemPoolSetAttribute(memPool, hipMemPoolAttrReleaseThreshold, &threshold); diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 410448434d..a4aa41fff7 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -225,7 +225,7 @@ better than the defaults defined by the hardware. The implicit groups defined by kernel launch parameters are still available when working with cooperative groups. -For further information, see :doc:`Cooperative groups `. +For further information, see :doc:`Cooperative groups `. Memory model ============