DDP script and code conflict bug #1683

wcyjerry · 2024-11-15T02:14:34Z

Hi, refer to PR 1675
there exists an conflict bug when using ddp.
For me, it make train process stuck at first epoch, and memory usage cost continuous increasing.
Usually, If code have set mp.start_method("spawn"), then one should not use torch.distributed.launch xxx train.py,
just python train.py is ok. So, the dist_train.sh is not suitable for code.
I have uncomment set_method, and use the dist_train.sh, the whole train process start well and no more huge memory cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP script and code conflict bug #1683

DDP script and code conflict bug #1683

wcyjerry commented Nov 15, 2024

DDP script and code conflict bug #1683

DDP script and code conflict bug #1683

Comments

wcyjerry commented Nov 15, 2024