Host Performance

API Latency

When controlling multiple devices, there is at least the overhead of thread-thread communications and small array allocations for book-keeping of load-balancing or dynamic load-balancing.

System:

Ryzen 7900 5.3GHz (has an integrated GPU)
4800 MHz DDR5 CL40 64GB memory
GT 1030 graphics card
MSVC 2022, vcpkg installed OpenCL binaries OpenCL devices:
2 cloned iGPUs
2 cloned discrete GPUs with pcie 3.0 x4 lanes
CPU with direct-RAM-access feature enabled Kernel:
add 1 to each element
65536 integer elements
read/compute/write operation is repeated for 1000 times
normal load-balanced run

Timings:

240 microseconds per run
2.1 GB/s processing speed

Bottleneck is caused by multiple issues:

OpenCL API (kernel launch, finish, etc) / driver overhead
Mutex locking contention (increases with number of devices)
Allocation of local arrays in methods (may be optimized by a good compiler)
Finding kernel names, parameter names in std::map objects

When there is only one device selected, the CPU with direct RAM access enabled, timings get better:

44 microseconds per run
11.6 GB/s processing speed

This is much closer to the OpenCL's own overhead.

RAM Bandwidth

If the last latency test is repeated with bigger array (only CPU used), 64M elements, bandwidth approaches to theoretical peak value:

14.8 milliseconds per run
36 GB/s processing speed (72 GB/s both directions)

If dataset fits L3 cache, bandwidth increases, for example with 4 million elements (32MB total input+output):

320 microseconds per run
102 GB/s processing speed (204 GB/s total bandwidth)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host Performance

API Latency

RAM Bandwidth

Clone this wiki locally