Skip to content

Host Performance

Hüseyin Tuğrul BÜYÜKIŞIK edited this page May 21, 2023 · 1 revision

API Latency

When controlling multiple devices, there is at least the overhead of thread-thread communications and small array allocations for book-keeping of load-balancing or dynamic load-balancing.

System:

  • Ryzen 7900 5.3GHz (has an integrated GPU)
  • 4800 MHz DDR5 CL40 64GB memory
  • GT 1030 graphics card
  • MSVC 2022, vcpkg installed OpenCL binaries OpenCL devices:
  • 2 cloned iGPUs
  • 2 cloned discrete GPUs with pcie 3.0 x4 lanes
  • CPU with direct-RAM-access feature enabled Kernel:
  • add 1 to each element
  • 65536 integer elements
  • read/compute/write operation is repeated for 1000 times
  • normal load-balanced run

Timings:

240 microseconds per run
2.1 GB/s processing speed

Bottleneck is caused by multiple issues:

  • OpenCL API (kernel launch, finish, etc) / driver overhead
  • Mutex locking contention (increases with number of devices)
  • Allocation of local arrays in methods (may be optimized by a good compiler)
  • Finding kernel names, parameter names in std::map objects

When there is only one device selected, the CPU with direct RAM access enabled, timings get better:

44 microseconds per run
11.6 GB/s processing speed

This is much closer to the OpenCL's own overhead.


RAM Bandwidth

If the last latency test is repeated with bigger array (only CPU used), 64M elements, bandwidth approaches to theoretical peak value:

14.8 milliseconds per run
36 GB/s processing speed (72 GB/s both directions)

If dataset fits L3 cache, bandwidth increases, for example with 4 million elements (32MB total input+output):

320 microseconds per run
102 GB/s processing speed (204 GB/s total bandwidth)
Clone this wiki locally