Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add meta_init, enable it as default init process #84

Merged
merged 12 commits into from
Mar 5, 2024

Conversation

lessw2020
Copy link
Contributor

@lessw2020 lessw2020 commented Feb 25, 2024

This PR enables meta_init functionality to avoid OOM'ing on cpu for larger models.
The core functionality is in meta_init.py, and a few changes in parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but I did a new one b/c faster than reworking it with all the interim changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:

with torch.device("meta"):
    model = Model.from_args(...)

Unfortunately this does not work b/c the rope embeddings are treated differently (buffer) and thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the device is referenced in the forward pass for the current rope class. Have opened #110 to track this and investigate while not holding up meta init that is working from landing.

b - per earlier feedback - meta init is now 'not optional' but simply the default. This should ensure all models leverage it and ensure we aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all params count instead of 'unique params' b/c it does not mesh with what people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure) you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we default to torch.randn_like) but your starting loss will be 5x+ higher (telling you that you have not properly init'ed the model).

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 25, 2024
@wanchaol
Copy link
Contributor

thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings

@lessw2020 curious is this because the buffer are created in the meta device however the module.to_device('cuda') won't move the buffer to GPU? iirc model.to("cuda") would move both params and buffers to cuda https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/module.py#L848

torchtrain/models/llama/__init__.py Show resolved Hide resolved
train.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you get a bad init (meta init failure) you will simply not train (loss is same every iter)

Would it be possible to just error out if meta init fails?

torchtrain/parallelisms/parallelize_llama.py Show resolved Hide resolved
@wconstab
Copy link
Contributor

i actually kinda like the separation of device-changes from parallelism application.

that leads me to think its nice to keep .cuda() or .to() or .reset_parameters() out of parallelize_llama() helper, and explicitly in the train loop. but i could be convinced otherwise.

@tianyu-l tianyu-l linked an issue Feb 28, 2024 that may be closed by this pull request
@lessw2020
Copy link
Contributor Author

thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings

@lessw2020 curious is this because the buffer are created in the meta device however the module.to_device('cuda') won't move the buffer to GPU? iirc model.to("cuda") would move both params and buffers to cuda https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/module.py#L848

It's deeper than that - I've opened an issue for tracking this to resolve and thus simplify the whole meta init:
#110
For now, would like to land a working meta_init and then refine with the fix for rope embeddings.

@lessw2020
Copy link
Contributor Author

re: meta init simplification - This issue relates to the nn.embeddings not being moved, and that the device is referenced in the forward pass for the current rope class. Have opened #110 to track this and investigate while not holding up meta init that is working from landing.

…nterplays with meta_init and when it should be re-activated.
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for enabling meta init! Great to see this it's working well. Had one inline comment.

Let's follow up on #110 to see if we can get a simpler solution.

torchtrain/models/llama/model.py Show resolved Hide resolved
@lessw2020 lessw2020 merged commit c5a4308 into pytorch:main Mar 5, 2024
4 checks passed
model.cuda()
# we have now moved from meta to device,
# reset parameters for proper initialization
model.reset_parameters()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize that FSDP meta-device init is confusing, but I think this might not be fully correct.

  1. The contract from PyTorch core is supposed to be that module.reset_parameters() only resets/initializes the parameters immediately owned by module, not those of its children/submodules. Here, model: Transformer is initializing all of its submodules' parameters. This contract is the only way for reset_parameters() to always work compositionally, otherwise (like on our case) we must assume Transformer to always be the root module.
  2. When we call model.reset_parameters(), the parameters have already been flattened and sharded by FSDP. This means that any initialization method that depends on the tensor shape would be incorrect. That is why we would normally want users to do the correct initialization in the param_init_fn.

For this Llama case, it looks like perhaps the reset_parameters() have been written to not depend on the tensor shape directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct that the init does not depend on tensor shape directly.
I see your point though about only resetting it's own params and not children, but let's meet to discuss as I also have questions on how meta_init should work for FSDP2 and we can get an updated impl.

lessw2020 added a commit that referenced this pull request Apr 18, 2024
This PR enables meta_init functionality to avoid OOM'ing on cpu for
larger models.
The core functionality is in meta_init.py, and a few changes in
parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but
I did a new one b/c faster than reworking it with all the interim
changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:
~~~
with torch.device("meta"):
    model = Model.from_args(...)
~~~
Unfortunately this does not work b/c the rope embeddings are treated
differently (buffer) and thus the simple lambda call from param_init_fn
in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the
device is referenced in the forward pass for the current rope class.
Have opened #110 to track
this and investigate while not holding up meta init that is working from
landing.

b - per earlier feedback - meta init is now 'not optional' but simply
the default. This should ensure all models leverage it and ensure we
aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all
params count instead of 'unique params' b/c it does not mesh with what
people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to
confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure)
you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we
default to torch.randn_like) but your starting loss will be 5x+ higher
(telling you that you have not properly init'ed the model).
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
This PR enables meta_init functionality to avoid OOM'ing on cpu for
larger models.
The core functionality is in meta_init.py, and a few changes in
parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but
I did a new one b/c faster than reworking it with all the interim
changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:
~~~
with torch.device("meta"):
    model = Model.from_args(...)
~~~
Unfortunately this does not work b/c the rope embeddings are treated
differently (buffer) and thus the simple lambda call from param_init_fn
in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the
device is referenced in the forward pass for the current rope class.
Have opened pytorch#110 to track
this and investigate while not holding up meta init that is working from
landing.

b - per earlier feedback - meta init is now 'not optional' but simply
the default. This should ensure all models leverage it and ensure we
aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all
params count instead of 'unique params' b/c it does not mesh with what
people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to
confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure)
you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we
default to torch.randn_like) but your starting loss will be 5x+ higher
(telling you that you have not properly init'ed the model).
tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Sep 8, 2024
* Load missing keys default from argparse (#111)

```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama
[rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701.
[rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep:  1  �[32mloss: 10.8424  �[39miter: �[34m 1.8688�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep:  2  �[32mloss: 10.7581  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0357  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep:  3  �[32mloss: 10.6239  �[39miter: �[34m  0.045�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep:  4  �[32mloss: 10.4163  �[39miter: �[34m 0.0455�[39m  data: �[34m0.0323  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep:  5  �[32mloss: 10.1529  �[39miter: �[34m 0.0459�[39m  data: �[34m0.032  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep:  6  �[32mloss:  9.8899  �[39miter: �[34m 0.0468�[39m  data: �[34m0.0311  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep:  7  �[32mloss:  9.7204  �[39miter: �[34m 0.0461�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep:  8  �[32mloss:  9.3757  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep:  9  �[32mloss:  9.1883  �[39miter: �[34m 0.0762�[39m  data: �[34m0.0318  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10  �[32mloss:  9.1212  �[39miter: �[34m 0.0808�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Add meta_init, enable it as default init process (#84)

This PR enables meta_init functionality to avoid OOM'ing on cpu for
larger models.
The core functionality is in meta_init.py, and a few changes in
parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but
I did a new one b/c faster than reworking it with all the interim
changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:
~~~
with torch.device("meta"):
    model = Model.from_args(...)
~~~
Unfortunately this does not work b/c the rope embeddings are treated
differently (buffer) and thus the simple lambda call from param_init_fn
in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the
device is referenced in the forward pass for the current rope class.
Have opened https://github.com/pytorch/torchtrain/issues/110 to track
this and investigate while not holding up meta init that is working from
landing.

b - per earlier feedback - meta init is now 'not optional' but simply
the default. This should ensure all models leverage it and ensure we
aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all
params count instead of 'unique params' b/c it does not mesh with what
people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to
confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure)
you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we
default to torch.randn_like) but your starting loss will be 5x+ higher
(telling you that you have not properly init'ed the model).

* Fix feedback from PR 111 (#113)

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* fix SP minor issues

ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/114

* enable loss parallel in SP

ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/112

* Float8_experimental option for training (#102)

* add miniPile dataset for pretraining, 1M entries (solves the 'out of data' at 40 iters issue) (#88)

This PR add's minipile (1M, 6GB) dataset as an option for pretraining
with torchtrain.
It resolves the issue where we run out of data after 40 iterations with
the default alpaca dataset.
Per @tianyu-l's excellent suggestion, have refactored to have a single
hf_datasets.py file that supports both minipile and alpaca since it
turned out no need for any different tokenizer, etc.
Also cleaned up the datasets package so that create_tokenizer is exposed
directly, and thus all public apis can be used directly from
torchtrain.datasets.
Lastly - added warning if/when a dataset is being re-looped so users
don't get burned by overfitting:
<img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a">


Adds a color highlight to showcase what dataloader was built:
<img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0">
and
<img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e">


Usage:
just add "minipile" or "alpaca" as the dataset in the training config
toml file.
<img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27">

Testing:
verified training loss is improving and ran for 100 iters to verify no
issue with out of data any longer with minipile.
reran with alpaca and saw the expected out of data at 40 iters without
infinite loop option, runs to 100 with infinite.

Notes:
I did not make this a default dataset since for debugmodel, mostly
running 10 iters is fine and there's 6GB to pull down.
<img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0">

* add data loading option to load from local file system

ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/117

* add llama 13B configs

ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/121

* add llama 70B toml

ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/122

* set betas and weight decay for optimizers

according to suggestions in https://github.com/pytorch/torchtrain/issues/118#issuecomment-1986470746

ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/123

* Add c4 dataset (177M, streaming), update multi-node support for latest job configs (#124)

This PR:
1 - adds the english language portion of c4 dataset, which has 177M
entries. (https://huggingface.co/datasets/allenai/c4)

Due to the size, streaming is enabled as the default.  
This is the allen-ai/c4, as apparently the original c4 is being
deprecated and HF advises to use allen-ai now.

For comparison per @tianyu-l request - 40 iterations avg time:
alpaca cached loading: Average data load time: 0.0279 seconds
c4 streaming loading: Average data load time: 0.0290 seconds

There is a longer initial delay during the 'preparing c4' vs alpaca
(i.e. 45 seconds vs 10 seconds), but after that speed is similar.

Dataset sample (not displayed in training, just an excerpt I pulled to
double check the data flow):
<img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121">

2 - I also updated the multi-node slurm file to account for the new job
config.

Test:
verified no looping with 100 iterations, 
sampled data streamed to verify.

* Add openwebtext dataset for larger scale training without shuffling (#130)

This PR adds the openwebtext 1M dataset. 
This is a homogenous dataset, so we are able to train successfully while
not having any shuffle in our dataset loader.

1 - adds the dateset to hf_datasets
2 - makes the default dataset for 13b and 70b as openwebtext since that
is the preferred choice for larger scale training.

Testing - ran 5K iters (9 nodes) to verify no spiking issues:

<img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c">

* [TorchTrain][Checkpoint] Fix TrainState state_dict to unblock loading (#131)

This fix would temporarily unblock loading. So we won't run into the
issue of:

```
[rank0]:[rank0]:     train_state.losses.append(train_state.current_loss)
[rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append'
```

However, current_loss and losses are still not correct, since by current
setup, losses and current_losses would be different across different
ranks. Also, we don't know the size of losses because this is based on
the # of steps. So loading still work but the value of current_loss and
losses are not being loaded correctly.

I will follow up with further fixes.

* improve logging

ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/132

* use SequenceParallel style in tp/sp (#133)

simplify things given we already have SequenceParallel style landed in
main

* support TP-only parallelism

ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/137

* disable verbose print from profiling

ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/136

* add Selective layer  activation checkpointing, single control for turning AC on or off. (#125)

This PR:
1 - adds selective layer checkpointing - this lets the user select every
x layer to checkpoint:
i.e. 2 = every other layer is checkpointed.

spec for config was updated by Wanchao - so we now have this layout for
AC which is hopefully self-explanatory (covers None, full, Selective Op
or Selective Layer and layer filtering policy:
<img width="941" alt="Screenshot 2024-03-13 at 6 09 52 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4b992286-1fbd-4a14-957a-4325f81a9ab4">


Thus, it lets user toggle between traditional 'all layers' to more and
more fine grained checkpointing.
Note that I implemented this for IBM last summer and in their llama
testing, every 2nd layer was the best bang/buck so I have made that the
default.

2 - the config file has been updated to make a new
[activation_checkpointing] section and make it easier to modify vs being
dumped into the training section.

Testing and results:
I tested all the AC options to ensure all options are working, and that
we fail if both types are set to true in config:
<img width="608" alt="Screenshot 2024-03-09 at 3 43 52 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/e3c20fbf-73e2-492d-9fb9-f32e772e239e">

* remove per iter syncronize

ghstack-source-id: 581c9115e89d3de57e558175b527c12c06a6808c
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/134

* Shorten nccl comm  timeout and enable flight recorder dumping (#103)

Timeout
-------

It's convenient whether during iterative debugging or long running
training to find out asap about a failure. The default timeout is way
too long and leads to wasted cluster time or developer frustration.
  
Timeout can be adjusted via cmdline or in .toml if it needs to be larger
for a particular model.

Another useful pattern can be to set a large timeout for initialization
and then tighten it after iteration 1. We can add this later if desired.

Ideally we could pass the timeout to the device mesh ctor, but it's not
ready yet. Also, we can change timeouts of the existing PGs after
creating them, but that's more LOC and not necessary unless we want to
change the timeouts at runtime.

Dumps
-----

Dumping on timeout should be a safe default for everyone. It has the
side-effect of requiring a dump path which defaults to ~/pgnccl_dump but
can be overridden via DUMP_PATH env.

The raw content of the dump is a pickle that is intended to be consumed
through scripts/tools which are under development, so it may not be easy
to know how to use these for now. As the tooling matures, we should
provide reference docs and probably print out pointers in the logs when
we perform the dump.


Test plan:
tested locally by adding a rank0 sleep for 10sec inside the training
loop, validating all 8 ranks dumped a trace.

* fix up gpu memory monitoring and logging

ghstack-source-id: 2f79d081c7724dbc34f357913671e8aefdf437b1
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/147

* Separate timeout during init and training (#149)

Allow a tighter timeout during training than during init.

Init includes the first train step, as well as any loading and setup. It
can be slower and less predictable due to various factors including lazy
initialization or jit compilation.

After the first train step, we expect more predictable runtime and
benefit from a tighter timeout to give quick feedback on a hang.

Tested by pasting this code in 2 places
```
if dp_mesh.get_local_rank() == 0 and train_state.step == 1:
   import time
   time.sleep(10)
```

(a) before calling set_pg_timeout, which did not cause a timeout (b)
after calling set_pg_timeout, which timed out

* Update activation check with updates to config manager (#152)

* Refactor to clean up parallelisms/__init__.py

(second attempt, didn't land correctly)

ghstack-source-id: 3dfec3ed134105cc5a951f8db160c8c2a9b3349b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/154

* enable gc control scheduling to help avoid stragglers (#148)

This PR adds control over Python garbage collection to help avoid
stragglers during large scale training.
updates - this feature is now exposed as a controllable option
gc_schedule, with a default of 50.
0 = not enabled.
int = schedules gc at every int iters during training loop. 
<img width="1078" alt="Screenshot 2024-03-15 at 12 39 26 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1ee387c5-f0a6-4366-936c-a1e281dad88f">

Effectively we disable the gc, run one collection to ensure a good
starting point, and then at the start of each gc_schedule iter, we call
gc to free up things.

By enforcing a fixed schedule for collection, it helps all ranks stay
more in synch.
Point of reference - on 512 GPU FSDP, adding this (gc_schedule=1) gave a
perf boost of ~1.5% per iter just by virtue of better synch.

(this was originally developed during dist compiler to resolve
stragglers, I believe @fegin came up with this solution).

* Add float8 specific parallel strategies (#153)

* add MFU to metrics

ghstack-source-id: 995efd6f460f3fe83ecf8d72c2178554f325485b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/151

* disable buffer reuse for compile for now (#156)

disable buffer reuse for compile to have close numerics to eager mode,
as suggested by @Chillee

This is probably only a temp change until buff reuse fix in inductor

* refactor config manager and support cmd overrides (#157)

This PR supports explicit cmd overrides, to allow infra layers to
override certain options (the most important one is dump_folder)

* Add support for generating debug traces on failure

* rename sequence_parallel to tensor_parallel (#162)

This PR renames sequence_parallel to tensor_parallel, as sequence
parallel is only applied to rmsnorm layers, a broader name should be
tensor_parallel, maybe with sequence_parallel enabled

ghstack broken :( so using direct branch push instead

* add basic AC configs for 13B and 70B (#169)

as titled, currently 13B use selective op, and 70B use selective layer,
we can do some more experiments and adjust the configs later

* [TorchTrain][Checkpoint] Update train state to include global_avg_losses and global_max_losses (#167)

Based on discussion with @tianyu-l, we decided to only checkpoint
`global_avg_losses` and `global_max_losses` per log frequency iteration
to avoid all_reduce and device sync every iteration.
`TrainState.current_loss` and `TrainState.losses` are removed from
TrainState `state_dict()` and `load_state_dict()` call.


Tested with saving/loading with 30 steps with log_frequency = 10 and
loading with 40 steps to resume training. The numerics in
global_avg_losses and global_max_losses in the list aligns with
expected.

```
Step 30 save:
[rank0]:before save: 
self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])


Step 30 load:
[rank0]:after load:
self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])


Step 40 load and resume training:
[rank0]:before save: 
self.states['train_state']=TrainState(step=40, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945, 5.596909999847412], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555, 5.6796345710754395], log_steps=[1, 11, 21, 31])
```

* Basic integration test infra (#170)

Summary:
PR adds an option `use_for_integration_test`. when set to `True`, this
adds the config to the integration test suite. A test runner picks all
the configs marked for integration test and run them.

Test Plan:
```
=====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757]
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946
[rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep:  1  �[32mloss: 10.9486  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 20,066  �[35mmfu: 0.25%�[39m
[rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep:  2  �[32mloss: 10.8786  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,046  �[35mmfu: 2.60%�[39m
[rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep:  3  �[32mloss: 10.7362  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 204,441  �[35mmfu: 2.50%�[39m
[rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep:  4  �[32mloss: 10.5094  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,800  �[35mmfu: 2.66%�[39m
[rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep:  5  �[32mloss: 10.2755  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,527  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep:  6  �[32mloss: 10.0318  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,117  �[35mmfu: 2.62%�[39m
[rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep:  7  �[32mloss:  9.7929  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,509  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep:  8  �[32mloss:  9.5539  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 166,639  �[35mmfu: 2.04%�[39m
[rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep:  9  �[32mloss:  9.3909  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 120,381  �[35mmfu: 1.47%�[39m
[rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10  �[32mloss:  9.2749  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 207,613  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0

```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Add 2D integration test (FSDP + TP) (#171)

Summary:
Add a 2D test to integration test suite

Test Plan:

```

=====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757]
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 14:29:49,466 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 14:29:49,615 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 14:29:49,621 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-03-27 14:29:49,623 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 14:29:49,630 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 14:29:49,630 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 14:29:51,114 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 14:29:51,124 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 14:29:51,124 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 14:29:51,284 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 14:29:51,284 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 14:29:51,285 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1429
[rank0]:2024-03-27 14:29:52,056 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 14:29:52,825 - root - INFO - �[36mstep:  1  �[32mloss: 10.7425  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 21,337  �[35mmfu: 0.26%�[39m
[rank0]:2024-03-27 14:29:52,825 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 14:29:52,905 - root - INFO - �[36mstep:  2  �[32mloss: 10.6722  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 208,060  �[35mmfu: 2.55%�[39m
[rank0]:2024-03-27 14:29:52,982 - root - INFO - �[36mstep:  3  �[32mloss: 10.5435  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 213,622  �[35mmfu: 2.62%�[39m
[rank0]:2024-03-27 14:29:53,060 - root - INFO - �[36mstep:  4  �[32mloss: 10.3359  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,856  �[35mmfu: 2.61%�[39m
[rank0]:2024-03-27 14:29:53,139 - root - INFO - �[36mstep:  5  �[32mloss: 10.0965  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 209,326  �[35mmfu: 2.56%�[39m
[rank0]:2024-03-27 14:29:53,215 - root - INFO - �[36mstep:  6  �[32mloss:  9.8806  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,808  �[35mmfu: 2.66%�[39m
[rank0]:2024-03-27 14:29:53,292 - root - INFO - �[36mstep:  7  �[32mloss:  9.6442  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,874  �[35mmfu: 2.63%�[39m
[rank0]:2024-03-27 14:29:53,367 - root - INFO - �[36mstep:  8  �[32mloss:  9.4349  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 220,877  �[35mmfu: 2.70%�[39m
[rank0]:2024-03-27 14:29:53,500 - root - INFO - �[36mstep:  9  �[32mloss:  9.2674  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 123,924  �[35mmfu: 1.52%�[39m
[rank0]:[rank0]:[W327 14:29:53.248291822 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 14:29:53,577 - root - INFO - �[36mstep: 10  �[32mloss:  9.1404  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,910  �[35mmfu: 2.63%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0

=====Integration test: CONFIG_FILE=./train_configs/debug_model_2d.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model_2d.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_2d.toml
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757]
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 14:30:00,872 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 14:30:01,177 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 14:30:01,182 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2]
[rank0]:2024-03-27 14:30:01,185 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 14:30:01,194 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 14:30:01,195 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 14:30:02,807 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 14:30:02,818 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 14:30:02,819 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 14:30:02,830 - root - INFO - Applied Sequence Parallelism to the model
[rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 14:30:03,004 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 14:30:03,004 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 14:30:03,005 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1430
[rank0]:2024-03-27 14:30:03,642 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 14:30:04,528 - root - INFO - �[36mstep:  1  �[32mloss: 10.8502  �[33mmemory:  5.71GiB(6.01%)  �[34mwps: 9,259  �[35mmfu: 0.11%�[39m
[rank0]:2024-03-27 14:30:04,528 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 14:30:04,679 - root - INFO - �[36mstep:  2  �[32mloss: 10.7671  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 54,430  �[35mmfu: 0.67%�[39m
[rank0]:2024-03-27 14:30:04,773 - root - INFO - �[36mstep:  3  �[32mloss: 10.6390  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 88,457  �[35mmfu: 1.08%�[39m
[rank0]:2024-03-27 14:30:04,864 - root - INFO - �[36mstep:  4  �[32mloss: 10.4210  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 90,384  �[35mmfu: 1.11%�[39m
[rank0]:2024-03-27 14:30:04,954 - root - INFO - �[36mstep:  5  �[32mloss: 10.1648  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 93,058  �[35mmfu: 1.14%�[39m
[rank0]:2024-03-27 14:30:05,067 - root - INFO - �[36mstep:  6  �[32mloss:  9.9451  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 72,642  �[35mmfu: 0.89%�[39m
[rank0]:2024-03-27 14:30:05,165 - root - INFO - �[36mstep:  7  �[32mloss:  9.7004  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 85,096  �[35mmfu: 1.04%�[39m
[rank0]:2024-03-27 14:30:05,251 - root - INFO - �[36mstep:  8  �[32mloss:  9.4422  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 95,860  �[35mmfu: 1.17%�[39m
[rank0]:2024-03-27 14:30:05,399 - root - INFO - �[36mstep:  9  �[32mloss:  9.2144  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 55,837  �[35mmfu: 0.68%�[39m
[rank0]:[rank0]:[W327 14:30:05.148473462 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 14:30:05,496 - root - INFO - �[36mstep: 10  �[32mloss:  9.1710  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 86,136  �[35mmfu: 1.05%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Used per-parameter FSDP (#165)

**Numeric Parity**
1D FSDP
- Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8,
sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
- FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
- FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
- FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
- FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
    - Loss curves match between FSDP1 and FSDP2
- Memory numbers reported as percentage since that is how they are
logged; can convert against 95.0396 GiB GPU memory
- Compile: same setup as eager
- FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved,
8100-8400 WPS, 36% MFU
    - Loss curves slightly better than eager
    - For fun -- how much can we push MFU?
- If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23
GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
- If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB
(94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
- Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel
instead of two and (2), `reshard_after_forward=False` for the last
transformer block

2D FSDP
- Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs,
local batch size 16 (to preserve global batch size), sequence length
2048, bf16 mixed precision, fp32 reduce-scatter
- FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
- FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
- Loss curves match 8-way FSDP
- FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_`
not all-reducing over TP mesh dimension

<details>
<summary> Loss curves </summary>

<img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM"
src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85">

</details>


**Meta-Device Initialization**
- The PyTorch Core guideline is for `module.reset_parameters()` to only
initialize parameters/buffers immediately owned by `module` (i.e.
`module.parameters(recurse=False)` and `module.buffers(recurse=False)`).
- This makes it challenging to specify custom initializations for core
modules like `nn.Linear` and `nn.Embedding`. For example, in
@lessw2020's depth-wise truncated normal initialization, the
`trunc_normal_` standard deviation depends on the layer ID, which is a
property of the `TransformerBlock` but affects the child `nn.Linear`s.
- To disambiguate, I suggest avoiding the name `reset_parameters()` in
the case that we violate the PyTorch Core guideline and instead use a
different name (e.g. `init_weights`).

**DCP & Save/Load**
- Tested 1D and 2D by specifying `checkpoint_folder =
"/tmp/checkpoint_andgu` in the `.toml`, training until saving a
checkpoint, terminating the run, and restarting the training to load the
checkpoint -- the loss after loading looks reasonable

* plot losses in loaded TrainState to TensorBoard

ghstack-source-id: f13612ce1f739219c31aa2b9222259f9f586126b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/173

* Removed setting global flag for `swap_tensors` since not needed anymore

ghstack-source-id: 484237b30ba8bf8bb9e7a9cf2c97180d9fb21295
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/178

* Add integration test with compile enabled (#183)

Summary:
same as title

Test Plan:
```

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model_compile.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757]
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
[rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training
[rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled
[rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank1]:  warnings.warn(
[rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,739  �[35mmfu: 2.56%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,501  �[35mmfu: 2.55%�[39m
[rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,416  �[35mmfu: 2.69%�[39m
[rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,182  �[35mmfu: 2.68%�[39m
[rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,226  �[35mmfu: 2.67%�[39m
[rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,015  �[35mmfu: 2.67%�[39m
[rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,094  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,220  �[35mmfu: 2.54%�[39m
[rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,814  �[35mmfu: 2.58%�[39m
[rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,649  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,849  �[35mmfu: 2.58%�[39m
[rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,542  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,690  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,786  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,833  �[35mmfu: 1.54%�[39m
[rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,765  �[35mmfu: 1.54%�[39m
[rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,661  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,426  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* remove folding and unfolding of sequence dim in model.py

ghstack-source-id: 5d299adcd766baad6a36e63be4acc01fb2fd36db
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/190

* bump comm.train_timeout_seconds (#189)

this PR bumps this default config to a larger value, as profiling is
pretty heavy step, a default 5 seconds would likely trigger watchdog
unintentionally

* fix checkpoint parser

ghstack-source-id: 47ee7b5e2228705e5215195ac9ff13e1b168f93e
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/197

* support sequence of tests and add checkpoint test

address comments

ghstack-source-id: 7d6c51a5ef68dea06ba7d64741a554165c79f1d3
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/198

* Make freqs_cis a persistent buffer for pp init

currently, planning to use a 'seed checkpoint' to initialize the
pipeline parallel model chunks after moving them from meta device to
cuda/empty.

non-persistent buffers are incompatible with this approach, as they are
missing from the checkpoint and thus require manual init.

an alternative is to manually run the initializer for just the
non-persistent buffers after loading a seed-checkpoint, but this
approach is nearly equivalent with less code changes.

ghstack-source-id: b48228488d4c3924fffef4237f4106383c14a934
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/201

* Delete grad scaler, which is unsupported/unused

grad scaler currently doesn't work with FSDP2, and isn't enabled anyway
becuase bf16 training is the norm and doens't require it.

remove it for simplicity.  it will be easier to enable pipeline
parallelism with a simplier loss function setup, but if desired, its
still possible to support pipeline parallelism with the scaler added
back in.

ghstack-source-id: 82b0e4324eac88ee62723a6d832182d4e6c76e0f
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/202

* Factor out loss_fn to share code with pipeline par

PP requires feeding a loss_fn into the schedule's step so that loss can
be computed per microbatch as part of the forward/backward scheduling.

As such, it is nice to define loss once and use it both in the non-pp
code that manually calls f/loss/b and also use it in the pp step().

ghstack-source-id: 9bedd5103e23627d5e268c287d49f0759442ba12
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/203

* [TorchTrain] Minor fix for #197 (#204)

The changes made in github editor didn't go in when doing ghstack land.

* Add FusedRMSNorm (Triton kernel, +15% eager), Add NPLayerNorm, Enable config selectable Norm Type (#181)

This PR has multiple aspects:
1 - Adds a new Triton based Fused RMSNorm I wrote. I've verified it's
numerical accuracy on both forward and backward with a unit test.
It improves MFU by +15% with FSDP2 7B, and compiled slightly by +1.2%:
<img width="545" alt="Screenshot 2024-03-29 at 5 18 14 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/8f16fae9-947b-4720-a370-b954779c33a7">

2 - Adds norms.py to house all 4 norm types, and standardizes to
[layernorm / np_layernorm / rmsnorm / fused_rmsnorm]. Norms.py has a
create_norms function that then creates the appropriate norm.

3 - Adds np_layernorm, which is layernorm with no affine transformation.

4 - Updates model.py to now support plug and play of any supported norm.

Thus instead of this type of if/then logic in the model class:
<img width="928" alt="Screenshot 2024-03-30 at 1 52 07 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/ba7cb976-580f-4471-a79b-a584f7d20693">

We simply have this:
<img width="1129" alt="Screenshot 2024-03-30 at 1 52 23 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/aba48b4d-1620-4059-840d-e620468f00f2">

This then allows for easy plug and play of any norm type with no
fiddling around in the model code.

5 - updates run_llama_train.sh to randomly select a port vs previous
fixed port number. (thanks @yifuwang for this tip!)


6 - Now users can quickly select the norm of their choice via the config
file:
<img width="774" alt="Screenshot 2024-03-30 at 3 01 43 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/3238b375-dc21-4ee2-a5fa-f6571da79edb">

7 - adds a NotImpl error if users try to run TP + fused_rnsmorm to avoid
any confusion (per @tianyu-l feedback):
~~~
NotImplementedError: fused_rmsnorm not yet compatible with TP. Please
use rmsnorm.
~~~

* remove .item() per iter

ghstack-source-id: ab29c214604fd76cefdfe70149ecf07a2e03103e
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/206

* Removed cache_k and cache_v comments

ghstack-source-id: 8bc66c683a801189b152b0ef4301579ec1ec17e7
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/213

* Some more cleanups

ghstack-source-id: a53cbbecc35eac2a62d8ebc241462ac418666336
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/212

* avoid record streams and make color printing a config

ghstack-source-id: 1c7cb2710330ec3fb2384793b5ad77c65b107cbc
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/195

* fix SAC to use the correct reduce_scatter op (#215)

as titled, we migrated to the native functional collective so the SAC
should capture this instead of the old one

* Test runner  raises exception on failures (#216)

Summary: Test runner  should raise exception on failures.

Test Plan: 

```
=====Integration test, flavor : , command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh  =====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 0 -ne 0 ']'

=====Integration test, flavor : 1D compile, command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh --training.compile=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=--training.compile
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 1 -ne 0 ']'
+ overrides=--training.compile
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.compile W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 13:32:45,243 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 13:32:45,676 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 13:32:46,028 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-10 13:32:46,030 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 13:32:46,038 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 13:32:46,038 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 13:32:47,813 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='fused_rmsnorm') [rank0]:2024-04-10 13:32:47,826 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 13:32:47,826 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 13:32:48,582 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 13:32:48,582 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1332 [rank0]:2024-04-10 13:32:48,584 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-10 13:32:49,384 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 13:32:49,385 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:[rank0]:W0410 13:32:49.487000 139672077292544 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 394, in <module>
[rank0]:[rank0]:     main(config)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
[rank0]:[rank0]:     return f(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main
[rank0]:[rank0]:     pred = model(input_ids)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:[rank0]:     return forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]:[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors
[rank0]:[rank0]:     return callback(frame, cache_entry, hooks, frame_state, skip=1)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame
[rank0]:[rank0]:     result = inner_convert(
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert
[rank0]:[rank0]:     return _compile(
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function
[rank0]:[rank0]:     return function(*args, **kwargs)
[rank0]:[rank0]:   File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner
[rank0]:[rank0]:     return func(*args, **kwds)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile
[rank0]:[rank0]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
[rank0]:[rank0]:     r = func(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner
[rank0]:[rank0]:     out_code = transform_code_object(code, transform)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object
[rank0]:[rank0]:     transformations(instructions, code_options)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform
[rank0]:[rank0]:     tracer.run()
[rank0]:[rank0]:   File "/data/u…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add meta initilization for the model
6 participants