Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when building engine for meta-llama/Llama-3.1-405B-FP8 on 8 x A100 80G #2586

Open
HeyangQin opened this issue Dec 17, 2024 · 4 comments
Open
Assignees
Labels
Memory Issue about memory, like memory leak

Comments

@HeyangQin
Copy link

According to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-31-405b-model, TensorRT-LLM should be able to handle Llama-3.1-405B-FP8 as For the FP8 model, we can fit it on a single 8xH100 node. Yet when I try to do it by trtllm-serve, it crashes with OOM. I tried reducing the max_seq_len but it didn't help. Is trtllm-serve not the optimal way to build the engine, or is there something I am missing?

$ trtllm-serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tp_size 8
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
[TensorRT-LLM] TensorRT-LLM version: 0.15.0
Loading Model: �[1;32m[1/3]	�[0mDownloading HF model
�[38;20mDownloaded model to /home/deepspeed/.cache/huggingface/hub/models--meta-llama--Llama-3.1-405B-Instruct-FP8/snapshots/64a54b704768dfd589a3e4ac05d546052f67f4fd
�[0m�[38;20mTime: 0.411s
�[0mLoading Model: �[1;32m[2/3]	�[0mLoading HF model to memory

�[38;20mTime: 285.374s
�[0mLoading Model: �[1;32m[3/3]	�[0mBuilding TRT-LLM engine

884it [04:44,  3.75it/s]
888it [04:45,  3.12it/s]
[12/17/2024-18:36:47] [TRT] [E] [resizingAllocator.cpp::allocate::78] Error Code 1: Cuda Runtime (out of memory)
[12/17/2024-18:36:52] [TRT] [E] [myelinResourceManager.cpp::allocate::196] Error Code 2: OutOfMemory (Requested size was 101470601472 bytes.)
Traceback (most recent call last):
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/utils.py", line 31, in wrapper
    return func(*args, **kwargs)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1519, in _node_build_task
    model_loader(engine_dir=engine_dir)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1060, in __call__
    pipeline()
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1000, in __call__
    self.step_forward()
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1029, in step_forward
    self.step_handlers[self.counter]()
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/llmapi/llm_utils.py", line 1257, in _build_engine
    self._engine = build(self.model, copied_build_config)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1234, in build
    engine = None if build_config.dry_run else builder.build_engine(
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 204, in decorated
    return f(*args, **kwargs)
  File "/mnt/heyangqin/envs/heyangqin/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 426, in build_engine
    assert engine is not None, 'Engine building failed, please check the error log.'
AssertionError: Engine building failed, please check the error log.

@nv-guomingz
Copy link
Collaborator

A100 doesn't support FP8 so it's not the recommend way to run Llama-3.1-405B-FP8 on a non H100 node.

@nv-guomingz nv-guomingz added the Memory Issue about memory, like memory leak label Dec 18, 2024
@HeyangQin
Copy link
Author

A100 doesn't support FP8 so it's not the recommend way to run Llama-3.1-405B-FP8 on a non H100 node.

@nv-guomingz Yes. But even with suboptimal performance, it shouldn't crash with OOM

@JoJoLev
Copy link

JoJoLev commented Dec 18, 2024

Could you share the code used for the build? I recently resolved a similar error on a smaller model. Had to modify some parameters like max seq Len, etc

@HeyangQin
Copy link
Author

Hi @JoJoLev, I am using trtllm-serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tp_size 8 as shown in the log. I have also tried trtllm-serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tp_size 8 --max_seq_len 2048 but it still crashes with OOM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Memory Issue about memory, like memory leak
Projects
None yet
Development

No branches or pull requests

4 participants