OOM when building engine for `meta-llama/Llama-3.1-405B-FP8` on 8 x A100 80G #2586

HeyangQin · 2024-12-17T18:46:18Z

According to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-31-405b-model, TensorRT-LLM should be able to handle Llama-3.1-405B-FP8 as For the FP8 model, we can fit it on a single 8xH100 node. Yet when I try to do it by trtllm-serve, it crashes with OOM. I tried reducing the max_seq_len but it didn't help. Is trtllm-serve not the optimal way to build the engine, or is there something I am missing?

$ trtllm-serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tp_size 8
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
Loading Model: �[1;32m[1/3]	�[0mDownloading HF model
�[38;20mDownloaded model to /home/deepspeed/.cache/huggingface/hub/models--meta-llama--Llama-3.1-405B-Instruct-FP8/snapshots/64a54b704768dfd589a3e4ac05d546052f67f4fd
�[0m�[38;20mTime: 0.411s
�[0mLoading Model: �[1;32m[2/3]	�[0mLoading HF model to memory

�[38;20mTime: 285.374s
�[0mLoading Model: �[1;32m[3/3]	�[0mBuilding TRT-LLM engine

884it [04:44,  3.75it/s]
888it [04:45,  3.12it/s]
[12/17/2024-18:36:47] [TRT] [E] [resizingAllocator.cpp::allocate::78] Error Code 1: Cuda Runtime (out of memory)
[12/17/2024-18:36:52] [TRT] [E] [myelinResourceManager.cpp::allocate::196] Error Code 2: OutOfMemory (Requested size was 101470601472 bytes.)
Traceback (most recent call last):
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/utils.py", line 31, in wrapper
    return func(*args, **kwargs)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1519, in _node_build_task
    model_loader(engine_dir=engine_dir)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1060, in __call__
    pipeline()
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1000, in __call__
    self.step_forward()
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1029, in step_forward
    self.step_handlers[self.counter]()
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1257, in _build_engine
    self._engine = build(self.model, copied_build_config)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1234, in build
    engine = None if build_config.dry_run else builder.build_engine(
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 204, in decorated
    return f(*args, **kwargs)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 426, in build_engine
    assert engine is not None, 'Engine building failed, please check the error log.'
AssertionError: Engine building failed, please check the error log.

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-12-18T01:18:48Z

A100 doesn't support FP8 so it's not the recommend way to run Llama-3.1-405B-FP8 on a non H100 node.

HeyangQin · 2024-12-18T04:18:53Z

A100 doesn't support FP8 so it's not the recommend way to run Llama-3.1-405B-FP8 on a non H100 node.

@nv-guomingz Yes. But even with suboptimal performance, it shouldn't crash with OOM

JoJoLev · 2024-12-18T04:45:10Z

Could you share the code used for the build? I recently resolved a similar error on a smaller model. Had to modify some parameters like max seq Len, etc

HeyangQin · 2024-12-18T15:10:27Z

Hi @JoJoLev, I am using trtllm-serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tp_size 8 as shown in the log. I have also tried trtllm-serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tp_size 8 --max_seq_len 2048 but it still crashes with OOM

nv-guomingz added the Memory Issue about memory, like memory leak label Dec 18, 2024

nv-guomingz assigned byshiue Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when building engine for `meta-llama/Llama-3.1-405B-FP8` on 8 x A100 80G #2586

OOM when building engine for `meta-llama/Llama-3.1-405B-FP8` on 8 x A100 80G #2586

HeyangQin commented Dec 17, 2024

nv-guomingz commented Dec 18, 2024

HeyangQin commented Dec 18, 2024

JoJoLev commented Dec 18, 2024

HeyangQin commented Dec 18, 2024

OOM when building engine for meta-llama/Llama-3.1-405B-FP8 on 8 x A100 80G #2586

OOM when building engine for meta-llama/Llama-3.1-405B-FP8 on 8 x A100 80G #2586

Comments

HeyangQin commented Dec 17, 2024

nv-guomingz commented Dec 18, 2024

HeyangQin commented Dec 18, 2024

JoJoLev commented Dec 18, 2024

HeyangQin commented Dec 18, 2024

OOM when building engine for `meta-llama/Llama-3.1-405B-FP8` on 8 x A100 80G #2586

OOM when building engine for `meta-llama/Llama-3.1-405B-FP8` on 8 x A100 80G #2586