Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed LoRA checkpoint problem, did not resume checkpoint #423

Closed
YerongLi opened this issue Aug 18, 2024 · 2 comments
Closed

deepspeed LoRA checkpoint problem, did not resume checkpoint #423

YerongLi opened this issue Aug 18, 2024 · 2 comments
Assignees

Comments

@YerongLi
Copy link

  • Step 1 : Do the LoRA fine tuning with two GPU
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
# export MODEL="merged/checkpoint-10"
# export DATA="path of data"
export DATA="data.txt"
ds_master_port=$((29000 + RANDOM % 1000))
GPUS_PER_NODE=2
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
# deepspeed --num_gpus 2 finetune.py \
deepspeed --master_port $ds_master_port --include localhost:1,2 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

LOSS is declining from 6.23 to 0.4

{'loss': 6.2368, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                      
 10%|████████▊                                                                               | 1/10 [00:13<02:01, 13.45s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.46, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                        
{'loss': 0.7711, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                      
{'loss': 1.9513, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                     
{'loss': 0.1499, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                     
 50%|████████████████████████████████████████████                                            | 5/10 [00:27<00:22,  4.44s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in /home/yerong2/models/internlm-xcomposer2d5-7b - will assume that the vocabulary was not modified.
  warnings.warn(
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
{'loss': 1.9427, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                     
{'loss': 1.33, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                       
{'loss': 0.1506, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                      
{'loss': 2.8562, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                     
{'loss': 0.5747, 'learning_rate': 0.0, 'epoch': 6.4}                                                                        
{'train_runtime': 192.5393, 'train_samples_per_second': 0.519, 'train_steps_per_second': 0.052, 'train_loss': 1.6423154383897782, 'epoch': 6.4}
100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:12<00:00, 19.25s/it]
[2024-08-18 07:50:06,882] [INFO] [launch.py:347:main] Process 3472257 exits successfully.
  • Step 2 merge the model to with merge_peft_adapter.py and place it at merged/checkpoint-10
  • Step 3 start from merged/checkpoint-10 and view the loss. LOSS restart from 6.0 !!!
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

# export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
export MODEL="merged/checkpoint-10"
# export DATA="path of data"
export DATA="data.txt"
ds_master_port=$((29000 + RANDOM % 1000))
GPUS_PER_NODE=2
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
# deepspeed --num_gpus 2 finetune.py \
deepspeed --master_port $ds_master_port --include localhost:1,2 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

LOSS restart from 6.0

Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 5.7047, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                      
 10%|████████▊                                                                               | 1/10 [00:12<01:56, 13.00s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4522, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                      
{'loss': 0.5286, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                      
{'loss': 1.2067, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                     
{'loss': 0.1496, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                     
 50%|████████████████████████████████████████████                                            | 5/10 [00:27<00:21,  4.39s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
{'loss': 1.3043, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                     
{'loss': 0.731, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                      
{'loss': 0.1483, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                      
{'loss': 1.6292, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                     
{'loss': 0.3236, 'learning_rate': 0.0, 'epoch': 6.4}                                                                        
{'train_runtime': 191.0236, 'train_samples_per_second': 0.523, 'train_steps_per_second': 0.052, 'train_loss': 1.2178180634975433, 'epoch': 6.4}
100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:11<00:00, 19.10s/it]
[2024-08-18 08:01:03,083] [INFO] [launch.py:347:main] Process 3485551 exits successfully.
[2024-08-18 08:01:04,084] [INFO] [launch.py:347:main] Process 3485552 exits successfully.

@YerongLi
Copy link
Author

YerongLi commented Aug 19, 2024

With `internlm-xcomposer2-vl-7b", resume from intermediate checkpoint got no issue at all.

@YerongLi
Copy link
Author

This issue coexists with #426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants