Efficient Memory Management for Large Language Model #27280
anusonawane
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The main consumers of GPU memory during LLM inference are model parameters (weights), key-value cache memory, activations, and temporary buffers along with overheads.
Model Parameters (Weights):
The memory required to store model weights depends on the number of parameters and the precision format (FP16 uses 2 bytes per parameter).
KV Cache Memory:
The KV cache stores key and value vectors for each token during text generation. Memory usage depends on the number of layers, hidden size, and token count.
Activations and Buffers:
Activations are temporary outputs, typically consuming 5-10% of total GPU memory. For a 40 GB GPU, activations use 2-4 GB
Memory Overheads (Fragmentation):
Fragmentation causes inefficiencies in memory usage.
If 20% of a 40 GB GPU is lost to fragmentation, 8 GB is wasted, leaving only 32 GB for computations.
Beta Was this translation helpful? Give feedback.
All reactions