-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Multi query attention #3
base: load-iter
Are you sure you want to change the base?
Conversation
Ran some inference benchmark on an A100 gpu to compare multi-query (MQ) with multi-head (MH) attention. I used:
Some findings: Some timers slow-down the inference, significantly more for the MH model than for the MQ modelTimes are in ms. Only a timer for the whole model: The difference of 3 seconds (in favour of MQ) with a timer on the whole model jumps to 6 seconds when using timers within each layer for some reason. The timers use We end up with a reduction of 26% on the transformer-forward stepwhen comparing 13884.54 against 10263.32. |
#1
TODO: