Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why the TVM impelmentation is memroy efficient #16

Open
jlidw opened this issue Oct 14, 2022 · 1 comment
Open

Why the TVM impelmentation is memroy efficient #16

jlidw opened this issue Oct 14, 2022 · 1 comment

Comments

@jlidw
Copy link

jlidw commented Oct 14, 2022

Thanks for your excellent work!

Just want to discuss the memory reduction problem. It seems that the TVM implementation does not store fewer matrices (like Queries, Keys, and Values matrix). The num of Q-K pairs is less than the full attention so that we can get a faster calculation speed, but why the memory reduction has a similar trend with the time reduction? Seems the TVM kernel does not use any technique to save the memory, and the padding 0 values are also int32, but the fact is that TVM implementation is memory efficient...

Looking forward to your reply.

@Zhazhan
Copy link

Zhazhan commented Oct 15, 2022

Hello, thanks for your interest in our work.

In fact, the number of Q-K pairs not only corresponds to the computational complexity, but also contributes to the memory consumption. The memory occupied by the attention scores matrix $S$ (S=Q@K, where @ is the matrix multiplication) grows quadratically with the sequence length $L$. Therefore, reducing the number of attention scores that need to be stored leads to reduced memory consumption. The TVM implementation stores a maximum of $A+C+1$ attention scores per query, thus reducing memory consumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants