Skip to content
/ GEMM Public

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

License

Notifications You must be signed in to change notification settings

iVishalr/GEMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

General Matrix Multiplication (GEMM)

matrix-multiplication-image

This implementation provides fast matrix multiplication for multiplying two square matrices. This method is known as the general matrix multiplication (GEMM). Many scientific computing libraries like Numpy, BLAS, etc., use a slight modified version of this algorithm. This implementation works only on square matrices. This is done to avoid making the algorithm too complicated to handle rectangular matrices.

This implementation deals with multiplying two sqaured matrices ordered in column major order. In Linear Algebra you order everything according to columns. Hence we will also order it column wise. The two square matrices must have dimensions that is divisible by 4. This is because we are using a 4x4 kernel to compute the dot products of sub matrices.

This implementation achieves good performance using only a single thread. Majority of the performance comes from storing the 4x4 kernel in vector registers in the CPU and perform the dot products. Vector registers use instructions provided by SSE3(Streaming SIMD Extension 3). We slice and dice the original large matrices in such a way that the slices fit in L1 and L2 caches which means that CPU doesn't need to wait for data to be loaded to cache before performing the dot products. This gives us a decent performance boost when we compare the naive approach to compute the matrix multiplication.

Features

  1. Fast Matrix Multiplication of Square Matrices
  2. Register, Cache blocking
  3. No dependencies

Building the Project

Clone this repository by typing the following commands in terminal,

$ git clone https://github.com/iVishalr/GEMM.git
$ cd GEMM

Next, we need to compile the code. Typing the following command will compile the code.

$ make

To execute GEMM, type the following in terminal,

$ ./gemm

Additionally, there are other intermediate optimization files included. Compiling and executing them shows the performance improvements with slight modifications to the implementation of the algorithm. To run this kind of a benchmark, type the following command in terminal,

$ make optim

This will run all the files present in src/optimization_steps/ folder and you can see the difference in performance as it exxecutes.

To clean the object files, type the following command in terminal,

$ make clean

Limitations

  1. Works only on Square Matrices whose dimensions are divisible by 4. GEMM uses a 4x4 kernel to compute the dot product of submatrices. Official implementations of GEMM use multiple kernels optimized to different CPU architectures. Too complicated to implement :P
  2. This is a single threaded implementation of GEMM. Getting multi-threaded implementation of GEMM requires coding in much lower level and different kernel specific optimizations. Numpy, BLAS will be much faster than what this implementation provides.

License

MIT

About

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published