feat: Deepspeed Trainer #10100

MikhailKardash · 2024-10-22T19:50:23Z

Ticket

https://hpe-aiatscale.atlassian.net/browse/MD-501

Description

Add Deepspeed Trainer API
Refactor some pytorch training classes
Reinstate DCGANTrial as an example of DeepSpeed Trainer

Test Plan

Try DCGAN trainer example. Instructions are in the README included with tests/deepspeed/dcgan. After it's done, continue the run for 100 more batches.
Run GPT NEOX example. After it's done, continue the run for 100 more batches.
Deepspeed unit tests pass.

Checklist

Changes have been manually QA'd
New features have been approved by the corresponding PM
User-facing API changes have the "User-facing API Change" label
Release notes have been added as a separate file under docs/release-notes/
See Release Note for details.
Licenses have been included for new code which was copied and/or modified from any external code

harness/determined/pytorch/deepspeed/_trainer.py

harness/determined/pytorch/__init__.py

azhou-determined · 2024-10-22T20:19:16Z

harness/determined/pytorch/_util.py

i feel like maybe this should just stay where it is, in pytorch_trial. deepspeed a subpackage of pytorch anyway, and lots of this functionality is very specific to the training loop implementation of pytorch trial, which happens to be shared by deepspeed trial. also, user imports as pytorch.Batch etc.

Well, user imports wouldn't change regardless if this is in pytorch_trial.py versus _util.py
What about renaming to _training_utils.py? I'd rather keep things that are shareable out of _pytorch_trial.py rather than in it.

Well, user imports wouldn't change regardless if this is in pytorch_trial.py versus _util.py

yeah, that's what i'm saying, user imports are the same anyway so it's a purely internal concern that we move it. i was hoping that deepspeed trial could subclass pytorchtrial at some point, so i guess that's how i was thinking about it.

i'm fine either way on this. i do kinda like _training_utils.py better... what about _trainer_utils.py? feel like that's slightly better than _util.

harness/determined/pytorch/deepspeed/_trainer.py

azhou-determined · 2024-10-22T20:27:01Z

harness/determined/pytorch/deepspeed/_trainer.py

+    info = det.get_cluster_info()
+
+    if torch.cuda.is_available():
+        deepspeed.init_distributed()


existing code has:

# We use an environment variable to allow users to enable custom initialization routine for # distributed training since the pre_execute_hook runs before trial initialization. manual_dist_init = os.environ.get("DET_MANUAL_INIT_DISTRIBUTED") if not manual_dist_init: # DeepSpeed's init_distributed handles situations in which only 1 gpu is used and # also handles multiple calls to init in one process. deepspeed.init_distributed(auto_mpi_discovery=False)

which we probably need

wait, but i think the whole point of that manual flag is to not do this if DET_MANUAL_INIT_DISTRIBUTED is false.

this whole block would be easier read as:

if distributed_backend.use_deepspeed(): manual_dist_init = os.environ.get("DET_MANUAL_INIT_DISTRIBUTED") if not manual_dist_init: deepspeed.init_distributed(auto_mpi_discovery=False) return core.DistributedContext.from_deepspeed()

I believe this would break my example, which can be run on gpu without explicitly calling the launcher, via python trainer.py

this should only get called if local training != true tho

examples/deepspeed/trainer/model_def.py

examples/deepspeed/trainer/trainer.py

examples/deepspeed/trainer/mnist.yaml

examples/deepspeed/trainer/trainer.py

harness/determined/pytorch/deepspeed/_deepspeed_trial.py

azhou-determined · 2024-10-22T22:56:16Z

harness/determined/pytorch/deepspeed/_deepspeed_trial.py


        if self.is_chief:
            # We assume these stateful objects should be the same across slots and only have
            # the chief save them.
-            util.write_user_code(path, self.env.on_cluster)
-
-            if self.wlsq is not None:


# We assume these stateful objects should be the same across slots and only have # the chief save them.

not a big deal, but previously we were saving the training state only on the chief (now TrialState)

DeepSpeedTrial previously called _save() on every rank, so each gpu had it's own state. The only thing done once was/is saving user code.

but the saving of the training state is also gated by the chief worker?

if self.is_chief: # We assume these stateful objects should be the same across slots and only have # the chief save them. util.write_user_code(path, self.env.on_cluster) if self.wlsq is not None: with path.joinpath("workload_sequencer.pkl").open("wb") as f: pickle.dump(self.wlsq.get_state(), f)

added state to the top self.is_chief portion of this method

examples/deepspeed/trainer/requirements.txt

examples/deepspeed/trainer/gan_model.py

examples/deepspeed/trainer/mnist_trainer.yaml

examples/deepspeed/trainer/model_def.py

examples/deepspeed/trainer/trainer.py

harness/determined/pytorch/deepspeed/_deepspeed_context.py

codecov · 2024-10-24T20:14:40Z

Codecov Report

Attention: Patch coverage is 68.31923% with 262 lines in your changes missing coverage. Please review.

Project coverage is 54.10%. Comparing base (60bde17) to head (acf5089).
Report is 1 commits behind head on searcher-context-removal.

Files with missing lines	Patch %	Lines
...s/determined/pytorch/deepspeed/_deepspeed_trial.py	71.12%	108 Missing ⚠️
harness/determined/pytorch/deepspeed/_trainer.py	49.54%	56 Missing ⚠️
harness/determined/pytorch/_trainer_utils.py	57.50%	34 Missing ⚠️
...ts/experiment/integrations/test_deepspeed_trial.py	77.09%	30 Missing ⚠️
...determined/pytorch/deepspeed/_deepspeed_context.py	53.22%	29 Missing ⚠️
harness/determined/pytorch/_pytorch_trial.py	78.26%	5 Missing ⚠️

Additional details and impacted files

@@                     Coverage Diff                      @@
##           searcher-context-removal   #10100      +/-   ##
============================================================
+ Coverage                     53.72%   54.10%   +0.38%     
============================================================
  Files                          1256     1258       +2     
  Lines                        156491   156839     +348     
  Branches                       3616     3616              
============================================================
+ Hits                          84074    84864     +790     
+ Misses                        72284    71842     -442     
  Partials                        133      133

Flag	Coverage Δ
backend	`45.83% <ø> (-0.02%)`	⬇️
harness	`71.14% <68.31%> (+2.06%)`	⬆️
web	`54.01% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
harness/determined/pytorch/__init__.py	`100.00% <100.00%> (ø)`
harness/determined/pytorch/deepspeed/__init__.py	`100.00% <100.00%> (ø)`
...ests/experiment/fixtures/deepspeed_linear_model.py	`99.46% <100.00%> (+73.46%)`	⬆️
harness/determined/pytorch/_pytorch_trial.py	`82.93% <78.26%> (+2.82%)`	⬆️
...determined/pytorch/deepspeed/_deepspeed_context.py	`63.02% <53.22%> (+36.60%)`	⬆️
...ts/experiment/integrations/test_deepspeed_trial.py	`85.84% <77.09%> (+61.98%)`	⬆️
harness/determined/pytorch/_trainer_utils.py	`57.50% <57.50%> (ø)`
harness/determined/pytorch/deepspeed/_trainer.py	`49.54% <49.54%> (ø)`
...s/determined/pytorch/deepspeed/_deepspeed_trial.py	`75.61% <71.12%> (+61.01%)`	⬆️

... and 5 files with indirect coverage changes

determined-ci · 2024-10-24T23:32:19Z

Docsite preview being generated for this PR.
You can (eventually) find the generated docsite here.

azhou-determined · 2024-10-25T16:34:13Z

harness/determined/pytorch/deepspeed/_deepspeed_context.py

feel like the num_micro_batches_per_slot and train_micro_batch_size_per_gpu should be get_* following convention on the rest of the methods here.

replaced @property with get_

azhou-determined · 2024-10-25T16:40:21Z

the gpt_neox example still uses the legacy harness trial. we should either get rid of the example if it's not useful, or port it to trainer

dannysauer

Infra part is fine

tara-det-ai

LGTM

refactors max_length dependency and searcher context out of DeepSpeedTrial

Co-authored-by: Anda Zhou <83614683+azhou-determined@users.noreply.github.com>

cla-bot bot added the cla-signed label Oct 22, 2024

determined-ci requested a review from a team October 22, 2024 19:50

determined-ci added the documentation Improvements or additions to documentation label Oct 22, 2024

azhou-determined self-assigned this Oct 22, 2024

azhou-determined self-requested a review October 22, 2024 19:53

azhou-determined reviewed Oct 22, 2024

View reviewed changes

harness/determined/pytorch/deepspeed/_deepspeed_trial.py Show resolved Hide resolved

azhou-determined reviewed Oct 22, 2024

View reviewed changes

rb-determined-ai force-pushed the searcher-context-removal branch 2 times, most recently from 6ad58a5 to 7212d0e Compare October 23, 2024 16:24

MikhailKardash force-pushed the deepspeed_trainer_4 branch 3 times, most recently from 6a3de3d to d623e28 Compare October 23, 2024 16:58

azhou-determined reviewed Oct 23, 2024

View reviewed changes

MikhailKardash marked this pull request as ready for review October 24, 2024 00:10

MikhailKardash requested a review from a team as a code owner October 24, 2024 00:10

MikhailKardash requested review from mackrorysd and removed request for a team October 24, 2024 00:10

azhou-determined reviewed Oct 24, 2024

View reviewed changes

harness/determined/pytorch/deepspeed/_deepspeed_context.py Outdated Show resolved Hide resolved

azhou-determined reviewed Oct 24, 2024

View reviewed changes

harness/determined/pytorch/deepspeed/_deepspeed_context.py Outdated Show resolved Hide resolved

rb-determined-ai force-pushed the searcher-context-removal branch from 7212d0e to dfd7d50 Compare October 24, 2024 20:15

rb-determined-ai requested review from a team as code owners October 24, 2024 20:15

rb-determined-ai requested review from hkang1 and jesse-amano-hpe and removed request for a team October 24, 2024 20:15

rb-determined-ai force-pushed the searcher-context-removal branch from dfd7d50 to db8d4b0 Compare October 24, 2024 22:06

MikhailKardash force-pushed the deepspeed_trainer_4 branch from 055a3a8 to 77351cf Compare October 24, 2024 23:32

MikhailKardash force-pushed the deepspeed_trainer_4 branch from 77351cf to d7943bc Compare October 24, 2024 23:43

determined-ci removed the documentation Improvements or additions to documentation label Oct 24, 2024

azhou-determined reviewed Oct 25, 2024

View reviewed changes

rb-determined-ai force-pushed the searcher-context-removal branch from 2970bb2 to 544c404 Compare October 25, 2024 17:30

rb-determined-ai requested a review from a team as a code owner October 25, 2024 17:30

rb-determined-ai requested a review from dannysauer October 25, 2024 17:30

dannysauer approved these changes Oct 25, 2024

View reviewed changes

MikhailKardash force-pushed the deepspeed_trainer_4 branch from 34fda27 to cae787d Compare October 25, 2024 17:53

determined-ci added the documentation Improvements or additions to documentation label Oct 25, 2024

MikhailKardash force-pushed the deepspeed_trainer_4 branch from cae787d to 23277e5 Compare October 25, 2024 17:56

determined-ci removed the documentation Improvements or additions to documentation label Oct 25, 2024

merge

23277e5

tara-det-ai approved these changes Oct 25, 2024

View reviewed changes

chore: refactor max_length out of deepspeed trial (#10128)

697df6c

refactors max_length dependency and searcher context out of DeepSpeedTrial

keita-determined approved these changes Oct 25, 2024

View reviewed changes

jesse-amano-hpe approved these changes Oct 25, 2024

View reviewed changes

replace property->get_

acf5089

MikhailKardash merged commit 46c2de8 into searcher-context-removal Oct 25, 2024
66 of 90 checks passed

MikhailKardash deleted the deepspeed_trainer_4 branch October 25, 2024 19:05

rb-determined-ai pushed a commit that referenced this pull request Oct 25, 2024

feat: Deepspeed Trainer (#10100)

4288a2e

Co-authored-by: Anda Zhou <83614683+azhou-determined@users.noreply.github.com>

rb-determined-ai mentioned this pull request Oct 25, 2024

feat: remove searcher context from harness and master [MD-498] #10131

Merged

5 tasks

azhou-determined added a commit that referenced this pull request Oct 25, 2024

feat: Deepspeed Trainer (#10100)

363f940

Co-authored-by: Anda Zhou <83614683+azhou-determined@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Deepspeed Trainer #10100

feat: Deepspeed Trainer #10100

MikhailKardash commented Oct 22, 2024 •

edited

Loading

azhou-determined Oct 22, 2024

MikhailKardash Oct 22, 2024 •

edited

Loading

azhou-determined Oct 24, 2024 •

edited

Loading

azhou-determined Oct 22, 2024

azhou-determined Oct 24, 2024

azhou-determined Oct 24, 2024

MikhailKardash Oct 25, 2024

azhou-determined Oct 25, 2024

MikhailKardash Oct 25, 2024

azhou-determined Oct 22, 2024

MikhailKardash Oct 23, 2024

azhou-determined Oct 24, 2024

MikhailKardash Oct 25, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

determined-ci commented Oct 24, 2024

azhou-determined Oct 25, 2024

MikhailKardash Oct 25, 2024

azhou-determined commented Oct 25, 2024

dannysauer left a comment

tara-det-ai left a comment

feat: Deepspeed Trainer #10100

feat: Deepspeed Trainer #10100

Conversation

MikhailKardash commented Oct 22, 2024 • edited Loading

Ticket

Description

Test Plan

Checklist

Choose a reason for hiding this comment

MikhailKardash Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

azhou-determined Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

determined-ci commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azhou-determined commented Oct 25, 2024

dannysauer left a comment

Choose a reason for hiding this comment

tara-det-ai left a comment

Choose a reason for hiding this comment

MikhailKardash commented Oct 22, 2024 •

edited

Loading

MikhailKardash Oct 22, 2024 •

edited

Loading

azhou-determined Oct 24, 2024 •

edited

Loading

codecov bot commented Oct 24, 2024 •

edited

Loading