out of memory when I run on 8 4090 #1201
Replies: 4 comments 8 replies
-
my batch_size set to 1 to test multi_gpu training |
Beta Was this translation helpful? Give feedback.
-
the model is sd3 medium can train on one 4090。when i use accelerate config as above, out of memory |
Beta Was this translation helpful? Give feedback.
-
which doc did you follow to set this up? |
Beta Was this translation helpful? Give feedback.
-
I have one host computer with eight nvidia4090 graphics cards. When I set it to run with a single card, there is no problem when the batch_size is set to 2. However, when I set all eight cards to run simultaneously, the problem mentioned above occurs. I'm wondering if the project didn't allocate graphics cards to the processes. Logically speaking, accelerate should allocate them automatically. |
Beta Was this translation helpful? Give feedback.
-
stable-diffusion-3-medium-diffusers 8 4090
my config as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
[rank0]:[W1211 15:03:01.691333399 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W1211 15:03:01.693289016 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank7]:[W1211 15:03:01.868181424 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W1211 15:03:02.445215995 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank4]:[W1211 15:03:02.649411725 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W1211 15:03:02.667586240 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank5]:[W1211 15:03:02.988290365 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank6]:[W1211 15:03:04.304754537 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Traceback (most recent call last):
File "/home/ubuntu/SimpleTuner/train.py", line 53, in
trainer.train()
File "/home/ubuntu/SimpleTuner/helpers/training/trainer.py", line 2705, in train
self.accelerator.backward(loss)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2248, in backward
loss.backward(**kwargs)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 4 has a total capacity of 23.55 GiB of which 69.75 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.54 GiB is allocated by PyTorch, and 348.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
CUDA out of memory. Tried to allocate 96.00 MiB. GPU 2 has a total capacity of 23.55 GiB of which 69.75 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.54 GiB is allocated by PyTorch, and 348.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/ubuntu/SimpleTuner/train.py", line 53, in
trainer.train()
File "/home/ubuntu/SimpleTuner/helpers/training/trainer.py", line 2705, in train
self.accelerator.backward(loss)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2248, in backward
loss.backward(**kwargs)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 2 has a total capacity of 23.55 GiB of which 69.75 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.54 GiB is allocated by PyTorch, and 348.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Beta Was this translation helpful? Give feedback.
All reactions