OOM when building engine for meta-llama/Llama-3.1-405B-FP8
on 8 x A100 80G
#2586
Labels
Memory
Issue about memory, like memory leak
According to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-31-405b-model, TensorRT-LLM should be able to handle Llama-3.1-405B-FP8 as
For the FP8 model, we can fit it on a single 8xH100 node
. Yet when I try to do it bytrtllm-serve
, it crashes with OOM. I tried reducing the max_seq_len but it didn't help. Istrtllm-serve
not the optimal way to build the engine, or is there something I am missing?The text was updated successfully, but these errors were encountered: