-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torchtitan][debug] integrated CommDebugMode into TorchTitan #480
base: gh/sinhaanhsul/1/base
Are you sure you want to change the base?
Conversation
ghstack-source-id: 7e9de7b83a376eb320a403c416b891a0c5b5321e Pull Request resolved: #480
[ghstack-poisoned]
ghstack-source-id: 3b531851e5fb12259ab0e28979ea5b94afe936f8 Pull Request resolved: #480
[ghstack-poisoned]
ghstack-source-id: 60227a2c491c10eb6cf03f156754be4341481957 Pull Request resolved: #480
[ghstack-poisoned]
ghstack-source-id: ca3a9f5983a86dad0b867a8ec92e0e878e7784d5 Pull Request resolved: #480
[ghstack-poisoned]
ghstack-source-id: 8985c58902dc4e8b00e7975921df065d04c01911 Pull Request resolved: #480
ghstack-source-id: 9d0eae8fe6c7f19ea75f6e5ac8929802f1ae1157 Pull Request resolved: #480
ghstack-source-id: fbbc6f0257396b21eea0e40939c832a7afa3490f Pull Request resolved: #480
Hi @sinhaanshul! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
eaa3900
to
d090ccb
Compare
566b258
to
30cc639
Compare
ghstack-source-id: fbbc6f0257396b21eea0e40939c832a7afa3490f Pull Request resolved: #480
Stack from ghstack (oldest at bottom):
Summary
I have enabled TorchTitan developers to have the option to use CommDebugMode to help debug when using DTensors. Users can use it by setting the command line argument to use CommDebugMode and have the option to use arguments to set the console file dump name, json file name, and noise level they want to use. Currently, the debugger fails when using compiled_rmsnorm. The temporary fix is to increase torch._dynamo.config.cache_size_limit before using commdebugmode.
Test Plan
CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_llama_train.sh --comm_debug.enable_comm_debug_mode --model.norm_type="rmsnorm"