Matrix Multiply Stress Test

 ______                    __ _______
|      |.-----.----.---.-.|  |     __|.-----.--------.--------.
|   ---||  _  |   _|  _  ||  |    |  ||  -__|        |        |
|______||_____|__| |___._||__|_______||_____|__|__|__|__|__|__|

Matrix Multiply Stress Test

Prerequisites

Building

git clone git@github.com:AMD-HPC/CoralGemm.git
cd CoralGemm
mkdir build
cd build
cmake ..
make -j

Need be, set CMAKE_MODULE_PATH and CMAKE_PREFIX_PATH, e.g.:

export CMAKE_MODULE_PATH=/opt/rocm/hip/cmake:${CMAKE_MODULE_PATH}
export CMAKE_PREFIX_PATH=/opt/rocm/lib/cmake:${CMAKE_PREFIX_PATH}

By default CoralGemm is built for AMD GPUs using ROCm.
However, it can also be built for NVIDIA GPUs using CUDA.
To do so, set USE_HIP=OFF, USE_CUDA=ON, and set CMAKE_CUDA_ARCHITECTURES, e.g.:

cmake -DUSE_HIP=OFF -DUSE_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 ..

Common Cases

DGEMM

16 GB devices (Radeon VII): ./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 9 300
32 GB devices (MI60, MI100): ./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 18 300
64 GB devices (MI200 series): ./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 36 300

SGEMM

16 GB devices (Radeon VII): ./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 18 300
32 GB devices (MI60, MI100): ./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 36 300
64 GB devices (MI200 series): ./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 72 300

Mixed-Precision

Mixed-precision GEMMs are provided by the Ex API.
Supply the "ex" command line option to use the Ex API.

To run half-precision (FP16) GEMM with accumulation to FP32 on the MI200 series devices call, e.g.:
./gemm R_16F R_16F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 50 300 ex

To run bfloat16 (BF16) GEMM with accumulation to FP32 on the MI200 series devices call, e.g.:
./gemm R_16B R_16B R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 50 300 ex

Command-Line Details

    ./gemm PRECISION_A
           PRECISION_B
           PRECISION_C
           COMPUTE_PRECISION
           OP_A
           OP_B
           M
           N
           K
           LDA
           LDB
           LDC
           BATCH_COUNT
           TIME_SPAN    runtime duration in seconds
           [batched]    run batched GEMM
           [strided]    run strided batched GEMM
           [ex]         use the Ex API
           [hostA]      A in host memory
           [hostB]      B in host memory
           [hostC]      C in host memory
           [coherentA]  if in host memory, A is coherent (not cached)
           [coherentB]  if in host memory, B is coherent (not cached)
           [coherentC]  if in host memory, C is coherent (not cached)
           [sharedA]    one A for all devices
           [sharedB]    one B for all devices
           [zeroBeta]   set beta to zero
           [testing]    perform a basic sanity check
           [times]      print time in microseconds in addition to GFLOPS
           [hostname]   print the hostname
           [threaded]   launch to each device from a different thread

When TIME_SPAN is set to 0, one warmup run is done, followed by one timing run, and printing of column labels is disabled.

Supported Precisions:

R_16B: BF16
R_16F: FP16
R_32F: float
R_64F: double
C_32F: float complex
C_64F: float double
R_8I: 8-bit int
R_32I: 32-bit int

Supported Ops:

OP_N: non-transposed
OP_T: transposed
OP_C: conjugate-transposed

Details

benchmarks hipblas?gemm[Batched|StridedBatched][Ex]
allocates BATCH_SIZE number of matrices A, B, and C
initializes with hipRAND (random uniform, 0.0 to 1.0)
calls hipBLAS and collects execution times using std::chrono
sets alpha to 2.71828 and beta to 3.14159
for hipblas?gemm[Ex] launches a sequence of calls and takes the median time
for hipblas?gemm[Strided]Batched[Ex] launches one call and takes the overall time
reports the corresponding GFLOPS
repeats until TIME_SPAN exceeded
executes simulteneously on all devices

If testing is set, a primitive sanity test is ran. Entries of A, B, and C are set to 1, and so are the factors alpha and beta. Then, after GEMM is ran, all entries of C are checked to contain k+1. Note that performance is usually much higher when using integer initialization then when using random data.

Help

Jakub Kurzak (jakurzak@amd.com)

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
doc/doxygen		doc/doxygen
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matrix Multiply Stress Test

Prerequisites

Building

Common Cases

DGEMM

SGEMM

Mixed-Precision

Command-Line Details

Supported Precisions:

Supported Ops:

Details

Help

About

Releases

Packages

Contributors 4

Languages

License

AMD-HPC/CoralGemm

Folders and files

Latest commit

History

Repository files navigation

Matrix Multiply Stress Test

Prerequisites

Building

Common Cases

DGEMM

SGEMM

Mixed-Precision

Command-Line Details

Supported Precisions:

Supported Ops:

Details

Help

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages