Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
gpu
cuda
inference
nvidia
mha
multi-head-attention
llm
large-language-model
flash-attention
cuda-core
decoding-attention
flashinfer
-
Updated
Nov 5, 2024 - C++