2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect #426

YerongLi · 2024-08-19T10:04:05Z

I found with 2d5-7b the checkpoint saved from LoRA tuning finetune.py with one GPU is correct, while with multiple GPU the model saved is incorrect.

Does anyone met similar problem?

For example I saved LoRA-checkpoint multi with 2 GPU training and saved single with 1 GPU training
With multiple GPUs

 ==== Model merged successfully from checkpoint: ./multi
 ==== Model merged successfully from checkpoint: ./multi

trainable params: 151,003,136 || all params: 11,246,729,216 || trainable%: 1.3426
init mix data at rank 1
load 20 data
load 10 data
load 10 data
10samples is loaded
True
trainable params: 151,003,136 || all params: 11,246,729,216 || trainable%: 1.3426
Loading data...
Load 20 samples from ['data/only_text_example.json', '0.02']
Load 10 samples from ['data/single_turn_single_image_example.json', '0.01']
Load 10 samples from ['data/multi_turn_multi_images_example.json', '0.01']
init mix data at rank 0
load 20 data
load 10 data
load 10 data
10samples is loaded
True
[2024-08-19 04:49:00,958] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
[2024-08-19 04:49:00,960] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
  0%|                                                                                                                           | 0/10 [00:00<?, ?it/s]Set seed 0 for rank 0

Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 5.7072, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                 
 10%|███████████▌                                                                                                       | 1/10 [00:13<02:03, 
{'loss': 0.4416, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                 
{'loss': 0.5018, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                 
{'loss': 1.2302, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                
{'loss': 0.1435, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                
{'loss': 1.3144, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                
{'loss': 0.7652, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                
{'loss': 0.1507, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                 
{'loss': 1.6507, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                
{'loss': 0.3251, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                   
{'train_runtime': 54.9046, 'train_samples_per_second': 1.821, 'train_steps_per_second': 0.182, 'train_loss': 1.2230536431074142, 'epoch': 6.4}         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:54<00:00,  5.49s/it]
[2024-08-19 04:50:23,365] [INFO] [launch.py:347:main] Process 757444 exits successfully.
[2024-08-19 04:50:25,366] [INFO] [launch.py:347:main] Process 757443 exits successfully.

With single GPU

 ==== Model merged successfully from checkpoint: ./single
trainable params: 151,003,136 || all params: 11,246,729,216 || trainable%: 1.3426
Loading data...
Load 20 samples from ['data/only_text_example.json', '0.02']
Load 10 samples from ['data/single_turn_single_image_example.json', '0.01']
Load 10 samples from ['data/multi_turn_multi_images_example.json', '0.01']
init mix data at rank 0
load 20 data
load 10 data
load 10 data
10samples is loaded
True
[2024-08-19 05:01:54,481] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
  0%|                                                                                                                           | 0/10 [00:00<?, ?it/s]Set seed 8 for rank 0
{'loss': 1.7768, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                 
{'loss': 1.6696, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                 
{'loss': 2.0381, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                 
{'loss': 3.3811, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                
{'loss': 2.8074, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                
{'loss': 3.2228, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                
{'loss': 0.8844, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                
{'loss': 1.4385, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                 
{'loss': 0.8122, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                
{'loss': 0.9553, 'learning_rate': 0.0, 'epoch': 8.0}                                                                                                   
{'train_runtime': 116.9895, 'train_samples_per_second': 0.855, 'train_steps_per_second': 0.085, 'train_loss': 1.8986241340637207, 'epoch': 8.0}        
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:56<00:00, 11.70s/it]
[2024-08-19 05:04:05,081] [INFO] [launch.py:347:main] Process 773274 exits successfully.

The text was updated successfully, but these errors were encountered:

YerongLi · 2024-08-19T13:52:40Z

With internlm-xcomposer2-vl-7b", saving with multiple GPUs got no problem at all.

yuhangzang · 2024-08-21T07:34:49Z

The fine-tuning code works fine in our environment (multi-gpus + checkpoint saving). Can u provide more details of your environment? (e.g., the package version of Transformer and PEFT). We use transformers==4.33.2 and peft==0.8.2

YerongLi · 2024-08-21T07:53:21Z

@yuhangzang
My environment yaml is this
mllm.txt It matches PEFT, transformers versions

My python script is here:
finetune.py.txt

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
# export MODEL="merged/finetune_lora"
# export DATA="path of data"
export DATA="data.txt"
ds_master_port=$((29000 + RANDOM % 1000))
GPUS_PER_NODE=2
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
# deepspeed --num_gpus 2 finetune.py \
deepspeed --master_port $ds_master_port --include localhost:1 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

yuhangzang · 2024-08-21T10:30:13Z

Why u chose the deepspeed instead of torchrun?

YerongLi · 2024-08-21T11:17:53Z

What is a big difference between them, let me try torch run

YerongLi · 2024-08-21T11:43:14Z

@yuhangzang torchrun does not change anything

Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 5.5722, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                          
 10%|█████████▏                                                                                  | 1/10 [00:12<01:54, 12.72s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4475, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                          
{'loss': 0.5362, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                          
 30%|███████████████████████████▌                                                                | 3/10 [00:16<00:32,  4.64s/it]

The initial LOSS is still around 6.0

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

# export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
export MODEL="merged/finetune_lora"
# export DATA="path of data"
export DATA="data.txt"

ds_master_port=$((29000 + RANDOM % 1000))

CUDA_VISIBLE_DEVICES=2,3
GPUS_PER_NODE=$(echo $CUDA_VISIBLE_DEVICES | tr ',' '\n' | wc -l)
echo "GPUS_PER_NODE=$GPUS_PER_NODE"
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001


DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

YerongLi · 2024-08-21T12:16:40Z

1 GPU saving checkpoing

==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.2112, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2995, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.7574, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                
{'loss': 8.7194, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.196, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                                
{'loss': 4.1853, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.243, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                                
{'loss': 1.8546, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                                                
{'loss': 1.1676, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                                               
{'loss': 1.5011, 'learning_rate': 0.0, 'epoch': 8.0}

1 GPU saving - 2 GPU loading

=== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 1.5588, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:13<01:57, 13.01s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4286, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.3325, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 0.6463, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1381, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.0558, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 0.4959, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1399, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                             | 8/10 [00:40<00:07,  3.82s/it]
{'loss': 1.1972, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.2461, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.5835, 'train_samples_per_second': 1.902, 'train_steps_per_second': 0.19, 'train_loss': 0.6239138901233673, 'epoch': 6.4}                                         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

2 GPU training checkpoints

=== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 6.3766, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:13<01:59, 13.23s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.46, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                  
{'loss': 0.7684, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 1.9358, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1545, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9196, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3209, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.15, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                  
{'loss': 2.8148, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5776, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.7429, 'train_samples_per_second': 1.896, 'train_steps_per_second': 0.19, 'train_loss': 1.6478209435939788, 'epoch': 6.4}

2 GPU saving - 2 GPU loading

==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
'loss': 6.4358, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:12<01:54, 12.76s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4603, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.7867, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 1.9566, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1517, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9381, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3546, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1513, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
{'loss': 3.0085, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5992, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.3773, 'train_samples_per_second': 1.909, 'train_steps_per_second': 0.191, 'train_loss': 1.6842805743217468, 'epoch': 6.4}                                        
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:52<00:00,  5.24s/it]
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in /home/yerong2/models/internlm-xcomposer2d5-7b - will assume that the vocabulary was not modified.

2 GPU saving - 1 GPU loading

==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.1474, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2065, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.749, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                 
{'loss': 8.4081, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.0869, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                               
{'loss': 4.1438, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.2339, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                               
{'loss': 1.9014, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}

YerongLi · 2024-08-21T14:08:30Z

Steps to reproduce: Multiple GPU error, training-save-reload is incorrect
Steps to reproduce: Single GPU, no error
Control Experiment: Save with 1 GPU training , load and resume training with 2 GPUs, no error
Other Clues

All script are in this zip file:
finetune25.zip

I used deepspeed because torchrun cannot specify GPU indices

- Steps to reproduce: Multiple GPU error, training-save-reload is incorrect

Run multi_finetune_lora.sh for the first time, keep export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"

bash export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"

Merge the LoRA

bash merge.sh output/finetune_lora

Run load_multi_finetune_lora.sh for the to load the checkpoint output/finetune_lora, keep export MODEL="merged/finetune_lora"

bash export MODEL="merged/finetune_lora"

Observation:

## FIRST TIME TRIAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
'loss': 6.43

### - Steps to reproduce: **Multiple GPU** error, training-save-reload is incorrect

1. Run `multi_finetune_lora.sh` for the first time, keep `export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"`

bash export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"

2. Merge the LoRA

bash merge.sh output/finetune_lora

3. Run `load_single_finetune_lora.sh` for the to load the checkpoint `output/finetune_lora`, keep `export MODEL="merged/finetune_lora"` ```
bash export MODEL="merged/finetune_lora"

Observation:

## FIRST TIME TRIAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
'loss': 6.4358, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:12<01:54, 12.76s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4603, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.7867, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 1.9566, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1517, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9381, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3546, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1513, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
{'loss': 3.0085, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5992, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.3773, 'train_samples_per_second': 1.909, 'train_steps_per_second': 0.191, 'train_loss': 1.6842805743217468, 'epoch': 6.4}                                        
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:52<00:00,  5.24s/it]
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-package

## SECOND TIME : LOAD AND TRAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 6.1474, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2065, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.749, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                 
{'loss': 8.4081, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.0869, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                               
{'loss': 4.1438, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.2339, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                               
{'loss': 1.9014, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}

with LOADING output/finetune_lora, the LOSS goes back from 6.0, which is unexpected

- Steps to reproduce: Single GPU, no error

Run single_finetune_lora.sh for the first time, keep export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"

bash export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"

Merge the LoRA

bash merge.sh output/finetune_lora

Run single_finetune_lora.sh for the second time, keep export MODEL="merged/finetune_lora"

bash export MODEL="merged/finetune_lora"

Observation:

## FIRST TIME TRIAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.2112, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2995, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.7574, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                
{'loss': 8.7194, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.196, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                                
{'loss': 4.1853, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.243, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                                
{'loss': 0.8546, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                                                
{'loss': 1.1676, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                                               
{'loss': 0.5011, 'learning_rate': 0.0, 'epoch': 8.0}  

## SECOND TIME : LOAD AND TRAINING
{'loss': 0.46, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                  
{'loss': 0.759, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                 
{'loss': 1.9288, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1534, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9273, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3308, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1523, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
{'loss': 2.9644, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5831, 'learning_rate': 0.0, 'epoch': 6.4}

After re-loading from the previous training, LOSS starts from a good point, which is expected.

- Control Experiment: Save with 1 GPU training , load and resume training with 2 GPUs, no error

==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.2112, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2995, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.7574, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                
{'loss': 8.7194, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.196, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                                
{'loss': 4.1853, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.243, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                                
{'loss': 1.8546, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                                                
{'loss': 1.1676, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                                               
{'loss': 1.5011, 'learning_rate': 0.0, 'epoch': 8.0}

=== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 1.5588, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:13<01:57, 13.01s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4286, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.3325, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 0.6463, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1381, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.0558, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 0.4959, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1399, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                             | 8/10 [00:40<00:07,  3.82s/it]
{'loss': 1.1972, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.2461, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.5835, 'train_samples_per_second': 1.902, 'train_steps_per_second': 0.19, 'train_loss': 0.6239138901233673, 'epoch': 6.4}                                         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

This behaves normally.

- Other Clues:

With a same conda environment internlm/internlm-xcomposer2-vl-7b saves and loads checkpoint correctly.

khyati2396 · 2024-08-29T08:40:07Z

@YerongLi Have you tried full-finetuning on the multi-GPU?

yuhangzang · 2024-08-30T09:28:38Z

If you want to continue training with the LoRA, you may face errors when you continue training with an older transformer version. We are working on providing a new fine-tuning code that is compatible with the new transformer version.

YerongLi · 2024-08-30T22:46:07Z

@YerongLi Have you tried full-finetuning on the multi-GPU?

@khyati2396 I am not able to do full fine tuning, I only have multiple 48 GB GPU.
There is a known similar issue, seems to be a issue with deepspeed, not sure whether we can multi-GPU without deepspeed. (Say wether we can dispatch over multiple GPUs)
huggingface/transformers#25340

YerongLi · 2024-08-30T22:53:19Z

If you want to continue training with the LoRA, you may face errors when you continue training with an older transformer version. We are working on providing a new fine-tuning code that is compatible with the new transformer version.

I mean are you able to reproduce this error? Can you resume from a "correct LOSS" after saving the model? For me, in a same conda environment internlm-xcomposer2-vl-7b works fine with .merge_and_unload with multiple GPUs, but internlm-xcomposer2-vl-7b breaks with resume_from_checkpoint=True.

I feel this is a deep bug between the interaction of deepspeed and transformers...
Which transformers are you using? 4.33.2 is suggested. With newer versions, we need gradient_checkpointing_enable({"use_reentrant": "True"}). https://huggingface.co/internlm/internlm-xcomposer2d5-7b/discussions/20

Let me try newer transformers at the same time.

YerongLi · 2024-09-02T09:59:30Z

If you want to continue training with the LoRA, you may face errors when you continue training with an older transformer version. We are working on providing a new fine-tuning code that is compatible with the new transformer version.

@yuhangzang I tested with newer transformers and got the same bug, the only difference is that is initial loss starts from around 2.0 instead of 6.0

$ pip show transformers; pip show peft; pip show accelerate; pip show deepspeed
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: transformers
Version: 4.44.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: peft
Version: 0.8.2
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: sourab@huggingface.co
License: Apache
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: accelerate, huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch, tqdm, transformers
Required-by: 
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: accelerate
Version: 0.33.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: peft
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: deepspeed
Version: 0.12.3
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
Required-by:

 ==== Model merged successfully from checkpoint: output/finetune_lora
{'loss': 1.9818, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                              
 10%|███████████▏                                                                                                    | 1/10 [00:13<01:58, 13.16s/it]/home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4471, 'grad_norm': 30.07342800176501, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                              
{'loss': 0.5259, 'grad_norm': 30.07342800176501, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                              
{'loss': 2.0256, 'grad_norm': 30.07342800176501, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                             
{'loss': 0.1477, 'grad_norm': 78.24120420313231, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                             
{'loss': 1.2312, 'grad_norm': 78.24120420313231, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                             
{'loss': 0.7337, 'grad_norm': 33.6438670494202, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                              
{'loss': 0.1426, 'grad_norm': 33.6438670494202, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                               
{'loss': 2.1172, 'grad_norm': 33.6438670494202, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                              
{'loss': 0.3183, 'grad_norm': 25.38373103524246, 'learning_rate': 0.0, 'epoch': 6.4}                                                                
{'train_runtime': 52.4924, 'train_samples_per_second': 1.905, 'train_steps_per_second': 0.191, 'train_loss': 0.9671015128493309, 'epoch': 6.4}

YerongLi · 2024-09-12T01:55:29Z

Tested that neither CUDA 11.7 , nor CUDA 12 works, the issue persists.

YerongLi · 2024-09-13T08:59:17Z

Tested with smaller learning rate, which is not the issue.

YerongLi · 2024-09-13T09:46:16Z

I tested on a machine with 2xL40S and a machine with 2xA100, none of them works.

Here is the output on 2xA100

Loading data...
 ==== Model merged successfully from checkpoint: output/finetune_lora
LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=64, target_modules={'attention.wo', 'attention.wqkv', 'feed_forward.w3', 'feed_forward.w2', 'feed_forward.w1'}, lora_alpha=64, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})
Load 20 samples from ['data/only_text_example.json', '0.02']
Load 10 samples from ['data/single_turn_single_image_example.json', '0.01']
Load 10 samples from ['data/multi_turn_multi_images_example.json', '0.01']
init mix data at rank 0
load 20 data
load 10 data
load 10 data
10samples is loaded
True
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/accelerate/accelerator.py:447: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
  warnings.warn(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
trainable params: 99,098,624 || all params: 11,194,824,704 || trainable%: 0.8852181844758255
init mix data at rank 1
load 20 data
load 10 data
load 10 data
10samples is loaded
True
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/accelerate/accelerator.py:447: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
  warnings.warn(
[2024-09-13 04:39:10,210] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
[2024-09-13 04:39:10,210] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
  0%|                                                                                    | 0/12 [00:00<?, ?it/s]Set seed 0 for rank 0
Set seed 3 for rank 1
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 6.1314, 'learning_rate': 5e-05, 'epoch': 1.0}                                                          
  8%|██████▎                                                                     | 1/12 [01:21<14:50, 80.98s/it]/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4381, 'learning_rate': 4.898732434036244e-05, 'epoch': 1.6}                                          
{'loss': 0.7703, 'learning_rate': 4.6031338320779534e-05, 'epoch': 2.0}                                         
{'loss': 1.8976, 'learning_rate': 4.137151834863213e-05, 'epoch': 3.0}                                          
{'loss': 0.1462, 'learning_rate': 3.5385375325047166e-05, 'epoch': 3.2}                                         
{'loss': 1.8919, 'learning_rate': 2.8557870956832132e-05, 'epoch': 4.0}                                         
{'loss': 1.3274, 'learning_rate': 2.1442129043167874e-05, 'epoch': 4.8}                                         
{'loss': 0.1458, 'learning_rate': 1.4614624674952842e-05, 'epoch': 5.0}                                         
{'loss': 3.2888, 'learning_rate': 8.628481651367876e-06, 'epoch': 6.0}                                          
{'loss': 0.6449, 'learning_rate': 3.968661679220468e-06, 'epoch': 6.4}                                          
{'loss': 0.431, 'learning_rate': 1.0126756596375686e-06, 'epoch': 7.0}                                          
{'loss': 3.4945, 'learning_rate': 0.0, 'epoch': 8.0}                                                            
{'train_runtime': 149.6291, 'train_samples_per_second': 0.802, 'train_steps_per_second': 0.08, 'train_loss': 1.7173307587703068, 'epoch': 8.0}
100%|███████████████████████████████████████████████████████████████████████████| 12/12 [02:29<00:00, 12.46s/it]
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in /scratch/bbrz/yirenl2/models//internlm-xcomposer2d5-7b - will assume that the vocabulary was not modified.
  warnings.warn(
[2024-09-13 04:43:29,332] [INFO] [launch.py:347:main] Process 3334842 exits successfully.
[2024-09-13 04:43:29,332] [INFO] [launch.py:347:main] Process 3334843 exits successfully.

mm-assistant bot assigned myownskyW7 Aug 19, 2024

YerongLi changed the title ~~I found the LoRA-checkpoint saved with multiple gpu is incorrect~~ 2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect Aug 19, 2024

This was referenced Aug 19, 2024

deepspeed LoRA checkpoint problem, did not resume checkpoint #423

Closed

请问要是中断训练，想继续的话需要加什么参数呢 #421

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect #426

2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect #426

YerongLi commented Aug 19, 2024 •

edited

Loading

YerongLi commented Aug 19, 2024 •

edited

Loading

yuhangzang commented Aug 21, 2024 •

edited

Loading

YerongLi commented Aug 21, 2024 •

edited

Loading

yuhangzang commented Aug 21, 2024

YerongLi commented Aug 21, 2024

YerongLi commented Aug 21, 2024 •

edited

Loading

YerongLi commented Aug 21, 2024 •

edited

Loading

YerongLi commented Aug 21, 2024 •

edited

Loading

khyati2396 commented Aug 29, 2024

yuhangzang commented Aug 30, 2024 •

edited

Loading

YerongLi commented Aug 30, 2024 •

edited

Loading

YerongLi commented Aug 30, 2024 •

edited

Loading

YerongLi commented Sep 2, 2024 •

edited

Loading

YerongLi commented Sep 12, 2024

YerongLi commented Sep 13, 2024

YerongLi commented Sep 13, 2024 •

edited

Loading

2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect #426

2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect #426

Comments

YerongLi commented Aug 19, 2024 • edited Loading

YerongLi commented Aug 19, 2024 • edited Loading

yuhangzang commented Aug 21, 2024 • edited Loading

YerongLi commented Aug 21, 2024 • edited Loading

yuhangzang commented Aug 21, 2024

YerongLi commented Aug 21, 2024

YerongLi commented Aug 21, 2024 • edited Loading

YerongLi commented Aug 21, 2024 • edited Loading

YerongLi commented Aug 21, 2024 • edited Loading

- Steps to reproduce: Multiple GPU error, training-save-reload is incorrect

- Steps to reproduce: Single GPU, no error

- Control Experiment: Save with 1 GPU training , load and resume training with 2 GPUs, no error

- Other Clues:

khyati2396 commented Aug 29, 2024

yuhangzang commented Aug 30, 2024 • edited Loading

YerongLi commented Aug 30, 2024 • edited Loading

YerongLi commented Aug 30, 2024 • edited Loading

YerongLi commented Sep 2, 2024 • edited Loading

YerongLi commented Sep 12, 2024

YerongLi commented Sep 13, 2024

YerongLi commented Sep 13, 2024 • edited Loading

YerongLi commented Aug 19, 2024 •

edited

Loading

YerongLi commented Aug 19, 2024 •

edited

Loading

yuhangzang commented Aug 21, 2024 •

edited

Loading

YerongLi commented Aug 21, 2024 •

edited

Loading

YerongLi commented Aug 21, 2024 •

edited

Loading

YerongLi commented Aug 21, 2024 •

edited

Loading

YerongLi commented Aug 21, 2024 •

edited

Loading

yuhangzang commented Aug 30, 2024 •

edited

Loading

YerongLi commented Aug 30, 2024 •

edited

Loading

YerongLi commented Aug 30, 2024 •

edited

Loading

YerongLi commented Sep 2, 2024 •

edited

Loading

YerongLi commented Sep 13, 2024 •

edited

Loading