out of memory when I run on 8 4090 #1201

mobilejammer · 2024-12-11T07:29:00Z

mobilejammer
Dec 11, 2024

stable-diffusion-3-medium-diffusers 8 4090
my config as follows:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

[rank0]:[W1211 15:03:01.691333399 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W1211 15:03:01.693289016 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank7]:[W1211 15:03:01.868181424 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W1211 15:03:02.445215995 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank4]:[W1211 15:03:02.649411725 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W1211 15:03:02.667586240 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank5]:[W1211 15:03:02.988290365 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank6]:[W1211 15:03:04.304754537 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.

Traceback (most recent call last):
File "/home/ubuntu/SimpleTuner/train.py", line 53, in
trainer.train()
File "/home/ubuntu/SimpleTuner/helpers/training/trainer.py", line 2705, in train
self.accelerator.backward(loss)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2248, in backward
loss.backward(**kwargs)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 4 has a total capacity of 23.55 GiB of which 69.75 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.54 GiB is allocated by PyTorch, and 348.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

CUDA out of memory. Tried to allocate 96.00 MiB. GPU 2 has a total capacity of 23.55 GiB of which 69.75 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.54 GiB is allocated by PyTorch, and 348.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/ubuntu/SimpleTuner/train.py", line 53, in
trainer.train()
File "/home/ubuntu/SimpleTuner/helpers/training/trainer.py", line 2705, in train
self.accelerator.backward(loss)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2248, in backward
loss.backward(**kwargs)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 2 has a total capacity of 23.55 GiB of which 69.75 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.54 GiB is allocated by PyTorch, and 348.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

mobilejammer · 2024-12-11T07:30:25Z

mobilejammer
Dec 11, 2024
Author

my batch_size set to 1 to test multi_gpu training

0 replies

mobilejammer · 2024-12-11T09:13:53Z

mobilejammer
Dec 11, 2024
Author

the model is sd3 medium can train on one 4090。when i use accelerate config as above, out of memory

0 replies

bghira · 2024-12-11T12:00:33Z

bghira
Dec 11, 2024
Maintainer

which doc did you follow to set this up?

5 replies

mobilejammer Dec 11, 2024
Author

quick start sd3。

bghira Dec 11, 2024
Maintainer

the SD3 quick start is doing full-rank tuning but for multigpu on 4090s you'll probably want Lycoris or AdamW8Bit optimiser.

multigpu has added overhead.

mobilejammer Dec 11, 2024
Author

I use to full parameters fintuning. when just one 4090 and batch_size = 2 is ok! But when set accelerate config as
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
batch_size = 1 to avoid memory out of.
error occur.
when I set num_processes: 2 also error occur the same

bghira Dec 11, 2024
Maintainer

multigpu has added overhead

bghira Dec 11, 2024
Maintainer

but also mixed_precision=no isn't recommended 🤷 it is probably trying to do computations in fp32.

mobilejammer · 2024-12-11T12:42:30Z

mobilejammer
Dec 11, 2024
Author

I have one host computer with eight nvidia4090 graphics cards. When I set it to run with a single card, there is no problem when the batch_size is set to 2. However, when I set all eight cards to run simultaneously, the problem mentioned above occurs. I'm wondering if the project didn't allocate graphics cards to the processes. Logically speaking, accelerate should allocate them automatically.

3 replies

bghira Dec 11, 2024
Maintainer

see nvidia-smi

mobilejammer Dec 12, 2024
Author

watch -n 1 nvidia-smi can see that the memory out really。

bghira Dec 12, 2024
Maintainer

i meant to ensure it's allocating across GPUs. it is doing so and the DDP overhead is out of my hands, sorry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out of memory when I run on 8 4090 #1201

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

out of memory when I run on 8 4090 #1201

mobilejammer Dec 11, 2024

Replies: 4 comments · 8 replies

mobilejammer Dec 11, 2024 Author

mobilejammer Dec 11, 2024 Author

bghira Dec 11, 2024 Maintainer

mobilejammer Dec 11, 2024 Author

bghira Dec 11, 2024 Maintainer

mobilejammer Dec 11, 2024 Author

bghira Dec 11, 2024 Maintainer

bghira Dec 11, 2024 Maintainer

mobilejammer Dec 11, 2024 Author

bghira Dec 11, 2024 Maintainer

mobilejammer Dec 12, 2024 Author

bghira Dec 12, 2024 Maintainer

mobilejammer
Dec 11, 2024

Replies: 4 comments 8 replies

mobilejammer
Dec 11, 2024
Author

mobilejammer
Dec 11, 2024
Author

bghira
Dec 11, 2024
Maintainer

mobilejammer Dec 11, 2024
Author

bghira Dec 11, 2024
Maintainer

mobilejammer Dec 11, 2024
Author

bghira Dec 11, 2024
Maintainer

bghira Dec 11, 2024
Maintainer

mobilejammer
Dec 11, 2024
Author

bghira Dec 11, 2024
Maintainer

mobilejammer Dec 12, 2024
Author

bghira Dec 12, 2024
Maintainer