______ __ _______
| |.-----.----.---.-.| | __|.-----.--------.--------.
| ---|| _ | _| _ || | | || -__| | |
|______||_____|__| |___._||__|_______||_____|__|__|__|__|__|__|
git clone git@github.com:AMD-HPC/CoralGemm.git
cd CoralGemm
mkdir build
cd build
cmake ..
make -j
Need be, set CMAKE_MODULE_PATH
and CMAKE_PREFIX_PATH
, e.g.:
export CMAKE_MODULE_PATH=/opt/rocm/hip/cmake:${CMAKE_MODULE_PATH}
export CMAKE_PREFIX_PATH=/opt/rocm/lib/cmake:${CMAKE_PREFIX_PATH}
By default CoralGemm is built for AMD GPUs using ROCm.
However, it can also be built for NVIDIA GPUs using CUDA.
To do so, set USE_HIP=OFF
, USE_CUDA=ON
, and set CMAKE_CUDA_ARCHITECTURES
, e.g.:
cmake -DUSE_HIP=OFF -DUSE_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 ..
- 16 GB devices (Radeon VII):
./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 9 300
- 32 GB devices (MI60, MI100):
./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 18 300
- 64 GB devices (MI200 series):
./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 36 300
- 16 GB devices (Radeon VII):
./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 18 300
- 32 GB devices (MI60, MI100):
./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 36 300
- 64 GB devices (MI200 series):
./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 72 300
Mixed-precision GEMMs are provided by the Ex API.
Supply the "ex" command line option to use the Ex API.
To run half-precision (FP16) GEMM with accumulation to FP32 on the MI200 series devices call, e.g.:
./gemm R_16F R_16F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 50 300 ex
To run bfloat16 (BF16) GEMM with accumulation to FP32 on the MI200 series devices call, e.g.:
./gemm R_16B R_16B R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 50 300 ex
./gemm PRECISION_A
PRECISION_B
PRECISION_C
COMPUTE_PRECISION
OP_A
OP_B
M
N
K
LDA
LDB
LDC
BATCH_COUNT
TIME_SPAN runtime duration in seconds
[batched] run batched GEMM
[strided] run strided batched GEMM
[ex] use the Ex API
[hostA] A in host memory
[hostB] B in host memory
[hostC] C in host memory
[coherentA] if in host memory, A is coherent (not cached)
[coherentB] if in host memory, B is coherent (not cached)
[coherentC] if in host memory, C is coherent (not cached)
[sharedA] one A for all devices
[sharedB] one B for all devices
[zeroBeta] set beta to zero
[testing] perform a basic sanity check
[times] print time in microseconds in addition to GFLOPS
[hostname] print the hostname
[threaded] launch to each device from a different thread
When TIME_SPAN
is set to 0, one warmup run is done, followed by one timing run, and printing of column labels is disabled.
R_16B
: BF16R_16F
: FP16R_32F
: floatR_64F
: doubleC_32F
: float complexC_64F
: float doubleR_8I
: 8-bit intR_32I
: 32-bit int
OP_N
: non-transposedOP_T
: transposedOP_C
: conjugate-transposed
- benchmarks
hipblas?gemm[Batched|StridedBatched][Ex]
- allocates
BATCH_SIZE
number of matrices A, B, and C - initializes with hipRAND (random uniform, 0.0 to 1.0)
- calls hipBLAS and collects execution times using
std::chrono
- sets
alpha
to 2.71828 andbeta
to 3.14159 - for
hipblas?gemm[Ex]
launches a sequence of calls and takes the median time - for
hipblas?gemm[Strided]Batched[Ex]
launches one call and takes the overall time - reports the corresponding GFLOPS
- repeats until
TIME_SPAN
exceeded - executes simulteneously on all devices
If testing
is set, a primitive sanity test is ran.
Entries of A, B, and C are set to 1, and so are the factors alpha
and beta
.
Then, after GEMM is ran, all entries of C are checked to contain k+1.
Note that performance is usually much higher when using integer initialization
then when using random data.
Jakub Kurzak (jakurzak@amd.com)