deepspeed LoRA checkpoint problem, did not resume checkpoint #423

YerongLi · 2024-08-18T13:07:51Z

Step 1 : Do the LoRA fine tuning with two GPU

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
# export MODEL="merged/checkpoint-10"
# export DATA="path of data"
export DATA="data.txt"
ds_master_port=$((29000 + RANDOM % 1000))
GPUS_PER_NODE=2
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
# deepspeed --num_gpus 2 finetune.py \
deepspeed --master_port $ds_master_port --include localhost:1,2 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

LOSS is declining from 6.23 to 0.4

{'loss': 6.2368, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                      
 10%|████████▊                                                                               | 1/10 [00:13<02:01, 13.45s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.46, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                        
{'loss': 0.7711, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                      
{'loss': 1.9513, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                     
{'loss': 0.1499, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                     
 50%|████████████████████████████████████████████                                            | 5/10 [00:27<00:22,  4.44s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in /home/yerong2/models/internlm-xcomposer2d5-7b - will assume that the vocabulary was not modified.
  warnings.warn(
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
{'loss': 1.9427, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                     
{'loss': 1.33, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                       
{'loss': 0.1506, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                      
{'loss': 2.8562, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                     
{'loss': 0.5747, 'learning_rate': 0.0, 'epoch': 6.4}                                                                        
{'train_runtime': 192.5393, 'train_samples_per_second': 0.519, 'train_steps_per_second': 0.052, 'train_loss': 1.6423154383897782, 'epoch': 6.4}
100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:12<00:00, 19.25s/it]
[2024-08-18 07:50:06,882] [INFO] [launch.py:347:main] Process 3472257 exits successfully.

Step 2 merge the model to with merge_peft_adapter.py and place it at merged/checkpoint-10
Step 3 start from merged/checkpoint-10 and view the loss. LOSS restart from 6.0 !!!

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

# export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
export MODEL="merged/checkpoint-10"
# export DATA="path of data"
export DATA="data.txt"
ds_master_port=$((29000 + RANDOM % 1000))
GPUS_PER_NODE=2
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
# deepspeed --num_gpus 2 finetune.py \
deepspeed --master_port $ds_master_port --include localhost:1,2 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

LOSS restart from 6.0

Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 5.7047, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                      
 10%|████████▊                                                                               | 1/10 [00:12<01:56, 13.00s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4522, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                      
{'loss': 0.5286, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                      
{'loss': 1.2067, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                     
{'loss': 0.1496, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                     
 50%|████████████████████████████████████████████                                            | 5/10 [00:27<00:21,  4.39s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
{'loss': 1.3043, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                     
{'loss': 0.731, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                      
{'loss': 0.1483, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                      
{'loss': 1.6292, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                     
{'loss': 0.3236, 'learning_rate': 0.0, 'epoch': 6.4}                                                                        
{'train_runtime': 191.0236, 'train_samples_per_second': 0.523, 'train_steps_per_second': 0.052, 'train_loss': 1.2178180634975433, 'epoch': 6.4}
100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:11<00:00, 19.10s/it]
[2024-08-18 08:01:03,083] [INFO] [launch.py:347:main] Process 3485551 exits successfully.
[2024-08-18 08:01:04,084] [INFO] [launch.py:347:main] Process 3485552 exits successfully.

The text was updated successfully, but these errors were encountered:

YerongLi · 2024-08-19T13:53:59Z

With `internlm-xcomposer2-vl-7b", resume from intermediate checkpoint got no issue at all.

YerongLi · 2024-08-19T13:55:03Z

This issue coexists with #426

mm-assistant bot assigned myownskyW7 Aug 18, 2024

This was referenced Aug 18, 2024

请问要是中断训练，想继续的话需要加什么参数呢 #421

Open

I find I cannot load from fined LoRA checkpoint #418

Closed

YerongLi closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed LoRA checkpoint problem, did not resume checkpoint #423

deepspeed LoRA checkpoint problem, did not resume checkpoint #423

YerongLi commented Aug 18, 2024

YerongLi commented Aug 19, 2024 •

edited

Loading

YerongLi commented Aug 19, 2024

deepspeed LoRA checkpoint problem, did not resume checkpoint #423

deepspeed LoRA checkpoint problem, did not resume checkpoint #423

Comments

YerongLi commented Aug 18, 2024

YerongLi commented Aug 19, 2024 • edited Loading

YerongLi commented Aug 19, 2024

YerongLi commented Aug 19, 2024 •

edited

Loading