[DP] change device mesh dim naming convention to make it more consistent #720

XilunWu · 2024-12-05T00:47:10Z

Stack from ghstack (oldest at bottom):

Summary
This PR improves the design of DeviceMesh hierarchy in torchtitan. Now, we define all device meshes except world_mesh into 2 categories:

Basic mesh: those meshes defined in job .toml file by users. This include pp (pipeline_parallel_degree), dp_replicate (data_parallel_replicate_degree), dp_shard (data_parallel_shard_degree), tp (tensor_parallel_degree), and cp(context_parallel_degree).
Synthesized mesh (or called "derived mesh"): meshes that are synthesized from basic mesh by _flatten(). If the mesh in synthesized from a single mesh, then it is equivalent to aliasing. So far we utilize 2 synthesized meshes: dp and dp_shard_cp. The dp mesh is used for data loading and the dp_shard_cp mesh is used for model params sharding.

Test
CI

[ghstack-poisoned]

…ore consistent" [ghstack-poisoned]

…ore consistent" **Summary** This PR improves the design of DeviceMesh hierarchy in torchtitan. Now, we define all device meshes except `world_mesh` into 2 categories: 1. Basic mesh: those meshes defined in job `.toml` file by users. This include `pp` (`pipeline_parallel_degree`), `dp_replicate` (`data_parallel_replicate_degree`), `dp_shard` (`data_parallel_shard_degree`), `tp` (`tensor_parallel_degree`), and `cp`(`context_parallel_degree`). 2. Synthesized mesh (or called "derived mesh"): meshes that are synthesized from basic mesh by `_flatten()`. If the mesh in synthesized from a single mesh, then it is equivalent to aliasing. So far we utilize 2 synthesized meshes: `dp` and `dp_shard_cp`. The `dp` mesh is used for data loading and the `dp_shard_cp` mesh is used for model params sharding. **Test** CI [ghstack-poisoned]

tianyu-l

lgtm!

…ion to make it more consistent" **Summary** This PR improves the design of DeviceMesh hierarchy in torchtitan. Now, we define all device meshes except `world_mesh` into 2 categories: 1. Basic mesh: those meshes defined in job `.toml` file by users. This include `pp` (`pipeline_parallel_degree`), `dp_replicate` (`data_parallel_replicate_degree`), `dp_shard` (`data_parallel_shard_degree`), `tp` (`tensor_parallel_degree`), and `cp`(`context_parallel_degree`). 2. Synthesized mesh (or called "derived mesh"): meshes that are synthesized from basic mesh by `_flatten()`. If the mesh in synthesized from a single mesh, then it is equivalent to aliasing. So far we utilize 2 synthesized meshes: `dp` and `dp_shard_cp`. The `dp` mesh is used for data loading and the `dp_shard_cp` mesh is used for model params sharding. **Test** CI [ghstack-poisoned]

…ore consistent" **Summary** This PR improves the design of DeviceMesh hierarchy in torchtitan. Now, we define all device meshes except `world_mesh` into 2 categories: 1. Basic mesh: those meshes defined in job `.toml` file by users. This include `pp` (`pipeline_parallel_degree`), `dp_replicate` (`data_parallel_replicate_degree`), `dp_shard` (`data_parallel_shard_degree`), `tp` (`tensor_parallel_degree`), and `cp`(`context_parallel_degree`). 2. Synthesized mesh (or called "derived mesh"): meshes that are synthesized from basic mesh by `_flatten()`. If the mesh in synthesized from a single mesh, then it is equivalent to aliasing. So far we utilize 2 synthesized meshes: `dp` and `dp_shard_cp`. The `dp` mesh is used for data loading and the `dp_shard_cp` mesh is used for model params sharding. **Test** CI [ghstack-poisoned]

[DP] change device mesh dim naming convention to make it more consistent

1cef052

[ghstack-poisoned]

This was referenced Dec 5, 2024

[cp] apply fsdp to model when CP is enabled without DP for correct loss and lower mem usage #685

Merged

[cp] add option to choose kv shards rotation method #684

Closed

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2024

XilunWu requested review from tianyu-l and fegin December 5, 2024 00:49

XilunWu added 2 commits December 4, 2024 16:51

Update on "[DP] change device mesh dim naming convention to make it m…

ff59d4a

…ore consistent" [ghstack-poisoned]

tianyu-l approved these changes Dec 5, 2024

View reviewed changes

fegin approved these changes Dec 5, 2024

View reviewed changes

XilunWu added 2 commits December 11, 2024 14:16

XilunWu changed the base branch from gh/XilunWu/13/base to main December 11, 2024 22:42

XilunWu merged commit cb633e3 into main Dec 11, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DP] change device mesh dim naming convention to make it more consistent #720

[DP] change device mesh dim naming convention to make it more consistent #720

XilunWu commented Dec 5, 2024 •

edited

Loading

tianyu-l left a comment

[DP] change device mesh dim naming convention to make it more consistent #720

[DP] change device mesh dim naming convention to make it more consistent #720

Conversation

XilunWu commented Dec 5, 2024 • edited Loading

tianyu-l left a comment

Choose a reason for hiding this comment

XilunWu commented Dec 5, 2024 •

edited

Loading