Refactor Recipe State Dict Code #1964

pbontrager · 2024-11-07T18:26:45Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

In our recipes we have 4 different ways that we collect model/adapter state_dicts:

single device: model.state_dict
lora single device: {k: v.cpu() for k,v in self.adapter_params}
distributed: get_full_model_state_dict using sharded_sd
distributed qlora: get_full_model_state_dict using model.named_modules

Where this causes issues is when models use state_dict_hooks which will change the expected names. This has come up as an issue with activation checkpointing in the past but was side stepped. Now with fusion models, this is an issue again as it relies on state dict hooks to operate. To address these known issues and make support easier in the future, this PR consolidates all of our checkpoint code to use the state_dict api so we'll always have a consistent set of names. This PR takes two primary approaches:

Introduces get_adapter_state_dict to filter a full model state_dict using the same pattern matching logic as get_merged_lora_ckpt uses. This incurs no extra cost even if save_adapter_only=True since calling model.state_dict doesn't copy params and is almost free.
replaces get_full_model_state_dict with gather_cpu_state_dict for distributed recipes. This takes a sharded state_dict as input and gathers each param and copies in to cpu. Notably, when there are NF4Tensors it still uses the state_dict (unlike the old function) but does a manual all_gather instead of calling full_tensor.

Changelog

What are the changes made in this PR?

added get_merged_lora_ckpt
replaces get_full_model_state_dict with gather_cpu_state_dict
Updated every recipe that uses peft or get_full_model_state_dict
Updated docs and tests

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

I will update here with an overview of all the updated recipes showing that memory and save time doesn't change with the checkpoint update.

pytorch-bot · 2024-11-07T18:26:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1964

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5a18094 with merge base 24d3579 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings

This looks really awesome overall.

Main concerns are around any differences in memory and/or speed this could cause? Can we confirm?

joecummings · 2024-11-07T20:53:27Z

torchtune/training/_distributed.py

-                    "Please call get_full_model_state_dict(..., device=self._device),"
-                    " so DTensor can communicate over NCCL."
+    for param_name, sharded_param in sharded_sd.items():
+        if sharded_param.is_cpu:


What's happening here?

I assume this is for CPU offload or something?

joecummings · 2024-11-07T20:53:48Z

torchtune/training/_distributed.py

+            # skip non-trainable params when trainable_only is True
+            continue
+        if isinstance(sharded_param._local_tensor, NF4Tensor):
+            # NF4Tensor does not support all_gather from DTensor


Lol why does it not support all_gather? Can't we ask AO to support that?

Yeah let's open an issue there

We want AO to support it, but it would take too long to get on stable so we have to do it ourselves in the meantime.

joecummings · 2024-11-07T20:54:24Z

torchtune/training/_distributed.py

+            cpu_state_dict[param_name] = full_param.cpu()
+        else:
+            del full_param
+        torch.distributed.barrier()


ebsmothers · 2024-11-08T15:12:46Z

tests/torchtune/modules/peft/test_utils.py

@@ -38,30 +39,30 @@
 class DummyAdapterModule(nn.Module, AdapterModule):
    def __init__(self, in_dim, out_dim):
        super().__init__()
-        self.adapter = nn.Linear(in_dim, out_dim, bias=False)
+        self.lora = nn.Linear(in_dim, out_dim, bias=False)


So as of now what is the actual value of AdapterModule? Is it just for setting trainable params?

I think AdapterModule is the right way to go, but since we are already ignoring it for ckpt merging, I'm not really changing anything by not using it for get_adapter_state_dict. I think we should move to using AdapterModule for all of these functions but that doesn't need to be solved in this PR.

ebsmothers · 2024-11-08T15:13:23Z

torchtune/modules/peft/_utils.py

+
+    """
+    adapter_key_filter = lambda x: "lora" in x or "magnitude" in x
+    return {k: v.cpu() for k, v in state_dict.items() if adapter_key_filter(k)}


Do we want to make the move to CPU optional here?

If it's already on cpu this is a no op

ebsmothers · 2024-11-08T15:18:45Z

torchtune/training/_distributed.py

+                d0, *dn = quant_param.shape
+                shape = (d0 * mesh.get_group().size(), *dn)


So this means sharding is always along the first dimension? If so might just leave a comment to that effect

Wei said FSDP always shards on dim 0

ebsmothers

Thanks you for wading through this minefield and fixing this! And thanks for adding the new test case as well

state_dict only checkpoints

539ee82

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2024

joecummings reviewed Nov 7, 2024

View reviewed changes

ebsmothers reviewed Nov 8, 2024

View reviewed changes

pbontrager added 2 commits November 8, 2024 10:14

Added distributed test

66447da

Fix unit test

5a18094

ebsmothers approved these changes Nov 9, 2024

View reviewed changes

pbontrager merged commit 08efaed into pytorch:main Nov 9, 2024
17 checks passed

joecummings pushed a commit that referenced this pull request Nov 11, 2024

Refactor Recipe State Dict Code (#1964)

dd2f0d9

ebsmothers mentioned this pull request Nov 26, 2024

v0.5.0 tracker #2008

Open

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Recipe State Dict Code #1964

Refactor Recipe State Dict Code #1964

pbontrager commented Nov 7, 2024

pytorch-bot bot commented Nov 7, 2024 •

edited

Loading

joecummings left a comment

joecummings Nov 7, 2024

ebsmothers Nov 8, 2024

joecummings Nov 7, 2024

ebsmothers Nov 8, 2024

pbontrager Nov 8, 2024

joecummings Nov 7, 2024

ebsmothers Nov 8, 2024

pbontrager Nov 8, 2024

ebsmothers Nov 8, 2024

pbontrager Nov 8, 2024

ebsmothers Nov 8, 2024

pbontrager Nov 8, 2024

ebsmothers left a comment

		d0, *dn = quant_param.shape
		shape = (d0 * mesh.get_group().size(), *dn)

Refactor Recipe State Dict Code #1964

Refactor Recipe State Dict Code #1964

Conversation

pbontrager commented Nov 7, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Nov 7, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1964

✅ No Failures

joecummings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Nov 7, 2024 •

edited

Loading