You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, refer to PR 1675
there exists an conflict bug when using ddp.
For me, it make train process stuck at first epoch, and memory usage cost continuous increasing.
Usually, If code have set mp.start_method("spawn"), then one should not use torch.distributed.launch xxx train.py,
just python train.py is ok. So, the dist_train.sh is not suitable for code.
I have uncomment set_method, and use the dist_train.sh, the whole train process start well and no more huge memory cost.
The text was updated successfully, but these errors were encountered:
Hi, refer to PR 1675
there exists an conflict bug when using ddp.
For me, it make train process stuck at first epoch, and memory usage cost continuous increasing.
Usually, If code have set mp.start_method("spawn"), then one should not use torch.distributed.launch xxx train.py,
just python train.py is ok. So, the dist_train.sh is not suitable for code.
I have uncomment set_method, and use the dist_train.sh, the whole train process start well and no more huge memory cost.
The text was updated successfully, but these errors were encountered: