Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP/DP training - multigpu #12

Open
helen1c opened this issue Jan 28, 2023 · 7 comments
Open

DDP/DP training - multigpu #12

helen1c opened this issue Jan 28, 2023 · 7 comments

Comments

@helen1c
Copy link

helen1c commented Jan 28, 2023

Hi @chrockey, great work!

Can you guide me on how to set up multigpu training? I have only 20GB gpus available, and when using batch size of 2 I obtain poor performance (~6% lower mIoU and mAcc; probably due to the batch norm and batch size).

If I add multigpu support (DDP) according to the example from the ME repository the learning is blocked, i.e. it never starts.

Any help will be appreciated. You commented "multi-GPU training is currently not supported" in the code. Have you had similar issues as I mentioned?

Thanks!

@chrockey
Copy link
Collaborator

chrockey commented Jan 29, 2023

Hi @helen1c,

Have you had similar issues as I mentioned?

No, I haven't. I was able to use DDP with PyTorch Lightning and ME together. However, I found a weird issue: the model's performance gets a little bit worse (~1%). That's why I do not use multi-GPU training in this repo. Anyway, here I provide you a code snippet to support DDP training:

You need to convert BN module into the synchronized BN before this line:

pl_module = get_lightning_module(lightning_module_name)(model=model, max_steps=max_step)

as

if gpus > 1:
    model = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(model)
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

Then, set the DDP-related keyword arguments here:

if gpus > 1:
as

if gpus > 1:
    kwargs["replace_sampler_ddp"] = True
    kwargs["sync_batchnorm"] = False
    kwargs["strategy"] = "ddp_find_unused_parameters_false"

I hope this helps your experiments.

@helen1c
Copy link
Author

helen1c commented Jan 29, 2023

@chrockey Unfortunately, this doesn't help. The same problem again.

Can you provide me versions of torch, cuda and pytorch lightning you are using?

Thanks for the quick reply though! :)

@chrockey
Copy link
Collaborator

chrockey commented Feb 1, 2023

Sorry for the late reply.
Here are the versions:

  • CUDA: 11.3
  • PyTorch: 1.12.1
  • PyTorch Lightning: 1.8.2
  • TorchMetrics: 0.11.0

FYI, I've just uploaded the environment.yaml file to the master branch, which you can refer to.

@chrockey chrockey closed this as completed Feb 7, 2023
@chrockey
Copy link
Collaborator

chrockey commented Feb 7, 2023

If you have further questions, please feel free to re-open this issue.

@lishuai-97
Copy link

Hi @chrockey ,

I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

@chrockey chrockey reopened this Feb 8, 2023
@Charlie839242
Copy link

Hi @chrockey ,

I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!

@lishuai-97
Copy link

Hi @chrockey ,
I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!

Hi @Charlie839242, sorry for the late reply, unfortunately I still didn't solve the problem in the end, but I think it may be a problem with the pytorch-lighting setup, and I have now moved to a new point cloud processing repository https://github.com/Pointcept/Pointcept, which is also an amazing work including many SOTA methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants