Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove multi processing spawn process to avoid conflict with torch.dist.launch #1675

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dmund95
Copy link

@dmund95 dmund95 commented Oct 29, 2024

TLDR; we should either torch multiprocessing spawn or torch distributed launch for invoking processes for each gpu

This PR will help improve the following symptoms

  1. During dataloader initialization, the main GPU memory usage spikes. For 40 workers, I saw > 16GB being used for basically no reason. After the fix, GPU:0 take ~300MB which is for some communication related purpose I think and not dependent on num_workers
  2. Training time per iteration is very slow. This can be seen from the screenshot shown below

image

Testing:

  1. I tested the code by using cached data and feeding the same samples to each GPU. With this I verified that all GPUs use the same memory now (GPU:0 still takes ~300 MB more being the primary GPU)
  2. The dataloader initialization is also very fast now with no spike in memory usage on GPU:0

@wcyjerry
Copy link

@dmund95 Hi,
I have met a question that train get freeze at first epoch and memory causes huge cost(maybe leak, not on gpu), and have no clue, until I see this PR.
I think this should be a bug in framework, usually, if set start method with 'spawn', then should not use torch.distributed.launch,
I have changed it and no more freeze and train starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants