Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Deepspeed Trainer #10100

Merged
merged 3 commits into from
Oct 25, 2024
Merged

Conversation

MikhailKardash
Copy link
Contributor

@MikhailKardash MikhailKardash commented Oct 22, 2024

Ticket

https://hpe-aiatscale.atlassian.net/browse/MD-501

Description

Add Deepspeed Trainer API
Refactor some pytorch training classes
Reinstate DCGANTrial as an example of DeepSpeed Trainer

Test Plan

  1. Try DCGAN trainer example. Instructions are in the README included with tests/deepspeed/dcgan. After it's done, continue the run for 100 more batches.
  2. Run GPT NEOX example. After it's done, continue the run for 100 more batches.
  3. Deepspeed unit tests pass.

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Oct 22, 2024
@determined-ci determined-ci requested a review from a team October 22, 2024 19:50
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Oct 22, 2024
@azhou-determined azhou-determined self-assigned this Oct 22, 2024
harness/determined/pytorch/deepspeed/_trainer.py Outdated Show resolved Hide resolved
harness/determined/pytorch/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like maybe this should just stay where it is, in pytorch_trial. deepspeed a subpackage of pytorch anyway, and lots of this functionality is very specific to the training loop implementation of pytorch trial, which happens to be shared by deepspeed trial. also, user imports as pytorch.Batch etc.

Copy link
Contributor Author

@MikhailKardash MikhailKardash Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, user imports wouldn't change regardless if this is in pytorch_trial.py versus _util.py
What about renaming to _training_utils.py? I'd rather keep things that are shareable out of _pytorch_trial.py rather than in it.

Copy link
Contributor

@azhou-determined azhou-determined Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, user imports wouldn't change regardless if this is in pytorch_trial.py versus _util.py

yeah, that's what i'm saying, user imports are the same anyway so it's a purely internal concern that we move it. i was hoping that deepspeed trial could subclass pytorchtrial at some point, so i guess that's how i was thinking about it.

i'm fine either way on this. i do kinda like _training_utils.py better... what about _trainer_utils.py? feel like that's slightly better than _util.

harness/determined/pytorch/deepspeed/_trainer.py Outdated Show resolved Hide resolved
info = det.get_cluster_info()

if torch.cuda.is_available():
deepspeed.init_distributed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing code has:

# We use an environment variable to allow users to enable custom initialization routine for
        # distributed training since the pre_execute_hook runs before trial initialization.
        manual_dist_init = os.environ.get("DET_MANUAL_INIT_DISTRIBUTED")
        if not manual_dist_init:
            # DeepSpeed's init_distributed handles situations in which only 1 gpu is used and
            # also handles multiple calls to init in one process.
            deepspeed.init_distributed(auto_mpi_discovery=False)

which we probably need

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, but i think the whole point of that manual flag is to not do this if DET_MANUAL_INIT_DISTRIBUTED is false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole block would be easier read as:

if distributed_backend.use_deepspeed():
    manual_dist_init = os.environ.get("DET_MANUAL_INIT_DISTRIBUTED")
    if not manual_dist_init:
        deepspeed.init_distributed(auto_mpi_discovery=False)
    return core.DistributedContext.from_deepspeed()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this would break my example, which can be run on gpu without explicitly calling the launcher, via python trainer.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should only get called if local training != true tho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

examples/deepspeed/trainer/model_def.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/model_def.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/trainer.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/mnist.yaml Outdated Show resolved Hide resolved
examples/deepspeed/trainer/trainer.py Outdated Show resolved Hide resolved

if self.is_chief:
# We assume these stateful objects should be the same across slots and only have
# the chief save them.
util.write_user_code(path, self.env.on_cluster)

if self.wlsq is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        # We assume these stateful objects should be the same across slots and only have
        # the chief save them.

not a big deal, but previously we were saving the training state only on the chief (now TrialState)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepSpeedTrial previously called _save() on every rank, so each gpu had it's own state. The only thing done once was/is saving user code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the saving of the training state is also gated by the chief worker?

        if self.is_chief:
            # We assume these stateful objects should be the same across slots and only have
            # the chief save them.
            util.write_user_code(path, self.env.on_cluster)

            if self.wlsq is not None:
                with path.joinpath("workload_sequencer.pkl").open("wb") as f:
                    pickle.dump(self.wlsq.get_state(), f)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added state to the top self.is_chief portion of this method

@rb-determined-ai rb-determined-ai force-pushed the searcher-context-removal branch 2 times, most recently from 6ad58a5 to 7212d0e Compare October 23, 2024 16:24
@MikhailKardash MikhailKardash force-pushed the deepspeed_trainer_4 branch 3 times, most recently from 6a3de3d to d623e28 Compare October 23, 2024 16:58
examples/deepspeed/trainer/requirements.txt Outdated Show resolved Hide resolved
examples/deepspeed/trainer/gan_model.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/mnist_trainer.yaml Outdated Show resolved Hide resolved
examples/deepspeed/trainer/model_def.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/model_def.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/trainer.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/trainer.py Outdated Show resolved Hide resolved
examples/deepspeed/trainer/trainer.py Outdated Show resolved Hide resolved
harness/determined/pytorch/deepspeed/_deepspeed_context.py Outdated Show resolved Hide resolved
harness/determined/pytorch/deepspeed/_deepspeed_context.py Outdated Show resolved Hide resolved
@MikhailKardash MikhailKardash marked this pull request as ready for review October 24, 2024 00:10
@MikhailKardash MikhailKardash requested a review from a team as a code owner October 24, 2024 00:10
@MikhailKardash MikhailKardash requested review from mackrorysd and removed request for a team October 24, 2024 00:10
Copy link

codecov bot commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 68.31923% with 262 lines in your changes missing coverage. Please review.

Project coverage is 54.10%. Comparing base (60bde17) to head (acf5089).
Report is 1 commits behind head on searcher-context-removal.

Files with missing lines Patch % Lines
...s/determined/pytorch/deepspeed/_deepspeed_trial.py 71.12% 108 Missing ⚠️
harness/determined/pytorch/deepspeed/_trainer.py 49.54% 56 Missing ⚠️
harness/determined/pytorch/_trainer_utils.py 57.50% 34 Missing ⚠️
...ts/experiment/integrations/test_deepspeed_trial.py 77.09% 30 Missing ⚠️
...determined/pytorch/deepspeed/_deepspeed_context.py 53.22% 29 Missing ⚠️
harness/determined/pytorch/_pytorch_trial.py 78.26% 5 Missing ⚠️
Additional details and impacted files
@@                     Coverage Diff                      @@
##           searcher-context-removal   #10100      +/-   ##
============================================================
+ Coverage                     53.72%   54.10%   +0.38%     
============================================================
  Files                          1256     1258       +2     
  Lines                        156491   156839     +348     
  Branches                       3616     3616              
============================================================
+ Hits                          84074    84864     +790     
+ Misses                        72284    71842     -442     
  Partials                        133      133              
Flag Coverage Δ
backend 45.83% <ø> (-0.02%) ⬇️
harness 71.14% <68.31%> (+2.06%) ⬆️
web 54.01% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
harness/determined/pytorch/__init__.py 100.00% <100.00%> (ø)
harness/determined/pytorch/deepspeed/__init__.py 100.00% <100.00%> (ø)
...ests/experiment/fixtures/deepspeed_linear_model.py 99.46% <100.00%> (+73.46%) ⬆️
harness/determined/pytorch/_pytorch_trial.py 82.93% <78.26%> (+2.82%) ⬆️
...determined/pytorch/deepspeed/_deepspeed_context.py 63.02% <53.22%> (+36.60%) ⬆️
...ts/experiment/integrations/test_deepspeed_trial.py 85.84% <77.09%> (+61.98%) ⬆️
harness/determined/pytorch/_trainer_utils.py 57.50% <57.50%> (ø)
harness/determined/pytorch/deepspeed/_trainer.py 49.54% <49.54%> (ø)
...s/determined/pytorch/deepspeed/_deepspeed_trial.py 75.61% <71.12%> (+61.01%) ⬆️

... and 5 files with indirect coverage changes

@determined-ci
Copy link
Collaborator

Docsite preview being generated for this PR.
You can (eventually) find the generated docsite here.

@determined-ci determined-ci removed the documentation Improvements or additions to documentation label Oct 24, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel like the num_micro_batches_per_slot and train_micro_batch_size_per_gpu should be get_* following convention on the rest of the methods here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced @property with get_

@azhou-determined
Copy link
Contributor

the gpt_neox example still uses the legacy harness trial. we should either get rid of the example if it's not useful, or port it to trainer

Copy link
Member

@dannysauer dannysauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra part is fine

@determined-ci determined-ci added the documentation Improvements or additions to documentation label Oct 25, 2024
@determined-ci determined-ci removed the documentation Improvements or additions to documentation label Oct 25, 2024
Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

refactors max_length dependency and searcher context out of DeepSpeedTrial
@MikhailKardash MikhailKardash merged commit 46c2de8 into searcher-context-removal Oct 25, 2024
66 of 90 checks passed
@MikhailKardash MikhailKardash deleted the deepspeed_trainer_4 branch October 25, 2024 19:05
rb-determined-ai pushed a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Anda Zhou <83614683+azhou-determined@users.noreply.github.com>
azhou-determined added a commit that referenced this pull request Oct 25, 2024
Co-authored-by: Anda Zhou <83614683+azhou-determined@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants