-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect #426
Comments
With |
The fine-tuning code works fine in our environment (multi-gpus + checkpoint saving). Can u provide more details of your environment? (e.g., the package version of Transformer and PEFT). We use transformers==4.33.2 and peft==0.8.2 |
@yuhangzang My python script is here:
|
Why u chose the |
What is a big difference between them, let me try torch run |
@yuhangzang torchrun does not change anything
The initial LOSS is still around 6.0
|
|
All script are in this zip file:
- Steps to reproduce: Multiple GPU error, training-save-reload is incorrect
Observation:
bash export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
bash merge.sh output/finetune_lora
Observation:
with LOADING output/finetune_lora, the LOSS goes back from 6.0, which is unexpected - Steps to reproduce: Single GPU, no error
Observation:
After re-loading from the previous training, LOSS starts from a good point, which is expected. - Control Experiment: Save with 1 GPU training , load and resume training with 2 GPUs, no error
This behaves normally. - Other Clues:With a same conda environment |
@YerongLi Have you tried full-finetuning on the multi-GPU? |
If you want to continue training with the LoRA, you may face errors when you continue training with an older transformer version. We are working on providing a new fine-tuning code that is compatible with the new transformer version. |
@khyati2396 I am not able to do full fine tuning, I only have multiple 48 GB GPU. |
I mean are you able to reproduce this error? Can you resume from a "correct LOSS" after saving the model? For me, in a same conda environment I feel this is a deep bug between the interaction of deepspeed and transformers... Let me try newer transformers at the same time. |
@yuhangzang I tested with newer transformers and got the same bug, the only difference is that is initial loss starts from around 2.0 instead of 6.0
|
Tested that neither CUDA 11.7 , nor CUDA 12 works, the issue persists. |
Tested with smaller learning rate, which is not the issue. |
I tested on a machine with 2xL40S and a machine with 2xA100, none of them works. Here is the output on 2xA100
|
I found with 2d5-7b the checkpoint saved from LoRA tuning finetune.py with one GPU is correct, while with multiple GPU the model saved is incorrect.
Does anyone met similar problem?
For example I saved LoRA-checkpoint
multi
with 2 GPU training and savedsingle
with 1 GPU trainingWith multiple GPUs
With single GPU
The text was updated successfully, but these errors were encountered: