WIP: Multi query attention #3

RaymondLi0 · 2022-08-09T15:58:17Z

#1
TODO:

inference speed benchmark to compare multi-query with multi-head
Test with 3D parallelism configuration

… is now in a separate class

RaymondLi0 · 2022-09-02T17:23:06Z

Ran some inference benchmark on an A100 gpu to compare multi-query (MQ) with multi-head (MH) attention. I used:

"--num-layers", "8", "--hidden-size", "1024", "--num-attention-heads", "16", "--seq-length", "1024", "--max-position-embeddings", "1024"


BATCH_SIZE = 512
TOKENS_TO_GENERATE = 128
PROMPT_LENGTH = 128
NUM_BATCHES = 8

Some findings:

Some timers slow-down the inference, significantly more for the MH model than for the MQ model

Only a timer for the whole model:
MH: generate: 40845.17 | Transformer forward: 13884.54
MQ: generate: 37670.10 | Transformer forward: 10263.32

The difference of 3 seconds (in favour of MQ) with a timer on the whole model jumps to 6 seconds when using timers within each layer for some reason. The timers use torch.cuda.synchronize(), which is probably a reason for the slow-down. No idea why the slowdown is bigger for the MH model though.

We end up with a reduction of 26% on the transformer-forward step

when comparing 13884.54 against 10263.32.
However most of the inference time is not spent on model computations, but on other stuff. Using a profiler could be good to find the other bottlenecks

RaymondLi0 added 2 commits August 8, 2022 18:26

add multi-query attention logic in attention module

4064969

add kv weight gradient reduction in tensor-parallel group

190e328

RaymondLi0 self-assigned this Aug 9, 2022

RaymondLi0 added 4 commits August 10, 2022 12:46

more efficient multiquery attention

6fd0c29

raise if trying to uyse multi-query cross-atteention

254ff4b

remove expand_key_value parameter since CoreAttention for multi-query…

eaf6174

… is now in a separate class

remove most timers

1513137

RaymondLi0 added 3 commits September 7, 2022 10:21

Merge branch 'load-iter' into multi-query-attention

d63c4b6

resolve conflict

5045d6f

implement alibi in multiquery core-attention

2117058

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Multi query attention #3

WIP: Multi query attention #3

RaymondLi0 commented Aug 9, 2022 •

edited

Loading

RaymondLi0 commented Sep 2, 2022

WIP: Multi query attention #3

Are you sure you want to change the base?

WIP: Multi query attention #3

Conversation

RaymondLi0 commented Aug 9, 2022 • edited Loading

RaymondLi0 commented Sep 2, 2022

Some timers slow-down the inference, significantly more for the MH model than for the MQ model

We end up with a reduction of 26% on the transformer-forward step

RaymondLi0 commented Aug 9, 2022 •

edited

Loading