Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect #426

Open
YerongLi opened this issue Aug 19, 2024 · 16 comments
Open
Assignees

Comments

@YerongLi
Copy link

YerongLi commented Aug 19, 2024

I found with 2d5-7b the checkpoint saved from LoRA tuning finetune.py with one GPU is correct, while with multiple GPU the model saved is incorrect.

Does anyone met similar problem?

For example I saved LoRA-checkpoint multi with 2 GPU training and saved single with 1 GPU training
With multiple GPUs

 ==== Model merged successfully from checkpoint: ./multi
 ==== Model merged successfully from checkpoint: ./multi

trainable params: 151,003,136 || all params: 11,246,729,216 || trainable%: 1.3426
init mix data at rank 1
load 20 data
load 10 data
load 10 data
10samples is loaded
True
trainable params: 151,003,136 || all params: 11,246,729,216 || trainable%: 1.3426
Loading data...
Load 20 samples from ['data/only_text_example.json', '0.02']
Load 10 samples from ['data/single_turn_single_image_example.json', '0.01']
Load 10 samples from ['data/multi_turn_multi_images_example.json', '0.01']
init mix data at rank 0
load 20 data
load 10 data
load 10 data
10samples is loaded
True
[2024-08-19 04:49:00,958] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
[2024-08-19 04:49:00,960] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
  0%|                                                                                                                           | 0/10 [00:00<?, ?it/s]Set seed 0 for rank 0

Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 5.7072, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                 
 10%|███████████▌                                                                                                       | 1/10 [00:13<02:03, 
{'loss': 0.4416, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                 
{'loss': 0.5018, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                 
{'loss': 1.2302, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                
{'loss': 0.1435, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                
{'loss': 1.3144, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                
{'loss': 0.7652, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                
{'loss': 0.1507, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                 
{'loss': 1.6507, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                
{'loss': 0.3251, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                   
{'train_runtime': 54.9046, 'train_samples_per_second': 1.821, 'train_steps_per_second': 0.182, 'train_loss': 1.2230536431074142, 'epoch': 6.4}         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:54<00:00,  5.49s/it]
[2024-08-19 04:50:23,365] [INFO] [launch.py:347:main] Process 757444 exits successfully.
[2024-08-19 04:50:25,366] [INFO] [launch.py:347:main] Process 757443 exits successfully.

With single GPU

 ==== Model merged successfully from checkpoint: ./single
trainable params: 151,003,136 || all params: 11,246,729,216 || trainable%: 1.3426
Loading data...
Load 20 samples from ['data/only_text_example.json', '0.02']
Load 10 samples from ['data/single_turn_single_image_example.json', '0.01']
Load 10 samples from ['data/multi_turn_multi_images_example.json', '0.01']
init mix data at rank 0
load 20 data
load 10 data
load 10 data
10samples is loaded
True
[2024-08-19 05:01:54,481] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
  0%|                                                                                                                           | 0/10 [00:00<?, ?it/s]Set seed 8 for rank 0
{'loss': 1.7768, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                 
{'loss': 1.6696, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                 
{'loss': 2.0381, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                 
{'loss': 3.3811, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                
{'loss': 2.8074, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                
{'loss': 3.2228, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                
{'loss': 0.8844, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                
{'loss': 1.4385, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                 
{'loss': 0.8122, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                
{'loss': 0.9553, 'learning_rate': 0.0, 'epoch': 8.0}                                                                                                   
{'train_runtime': 116.9895, 'train_samples_per_second': 0.855, 'train_steps_per_second': 0.085, 'train_loss': 1.8986241340637207, 'epoch': 8.0}        
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:56<00:00, 11.70s/it]
[2024-08-19 05:04:05,081] [INFO] [launch.py:347:main] Process 773274 exits successfully.
@YerongLi YerongLi changed the title I found the LoRA-checkpoint saved with multiple gpu is incorrect 2d5-7b : I found the LoRA-checkpoint saved with multiple gpu is incorrect Aug 19, 2024
@YerongLi
Copy link
Author

YerongLi commented Aug 19, 2024

With internlm-xcomposer2-vl-7b", saving with multiple GPUs got no problem at all.

@yuhangzang
Copy link
Collaborator

yuhangzang commented Aug 21, 2024

The fine-tuning code works fine in our environment (multi-gpus + checkpoint saving). Can u provide more details of your environment? (e.g., the package version of Transformer and PEFT). We use transformers==4.33.2 and peft==0.8.2

@YerongLi
Copy link
Author

YerongLi commented Aug 21, 2024

@yuhangzang
My environment yaml is this
mllm.txt It matches PEFT, transformers versions

My python script is here:
finetune.py.txt

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
# export MODEL="merged/finetune_lora"
# export DATA="path of data"
export DATA="data.txt"
ds_master_port=$((29000 + RANDOM % 1000))
GPUS_PER_NODE=2
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
# deepspeed --num_gpus 2 finetune.py \
deepspeed --master_port $ds_master_port --include localhost:1 finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

@yuhangzang
Copy link
Collaborator

Why u chose the deepspeed instead of torchrun?

@YerongLi
Copy link
Author

What is a big difference between them, let me try torch run

@YerongLi
Copy link
Author

YerongLi commented Aug 21, 2024

@yuhangzang torchrun does not change anything

Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 5.5722, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                          
 10%|█████████▏                                                                                  | 1/10 [00:12<01:54, 12.72s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4475, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                          
{'loss': 0.5362, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                          
 30%|███████████████████████████▌                                                                | 3/10 [00:16<00:32,  4.64s/it]

The initial LOSS is still around 6.0

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

# export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
export MODEL="merged/finetune_lora"
# export DATA="path of data"
export DATA="data.txt"

ds_master_port=$((29000 + RANDOM % 1000))

CUDA_VISIBLE_DEVICES=2,3
GPUS_PER_NODE=$(echo $CUDA_VISIBLE_DEVICES | tr ',' '\n' | wc -l)
echo "GPUS_PER_NODE=$GPUS_PER_NODE"
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001


DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --given_num True \
    --bf16 True \
    --fix_vit True \
    --fix_sampler True \
    --use_lora True \
    --hd_num 18 \
    --output_dir output/finetune_lora \
    --num_train_epochs 10 \
    --batch_size 4 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps 5 \
    --save_total_limit 1 \
    --overwrite_output_dir \
    --learning_rate 5e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --max_length 1024 \
    --deepspeed ds_config_zero2.json \
    --gradient_checkpointing True

@YerongLi
Copy link
Author

YerongLi commented Aug 21, 2024

  • 1 GPU saving checkpoing
==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.2112, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2995, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.7574, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                
{'loss': 8.7194, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.196, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                                
{'loss': 4.1853, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.243, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                                
{'loss': 1.8546, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                                                
{'loss': 1.1676, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                                               
{'loss': 1.5011, 'learning_rate': 0.0, 'epoch': 8.0}  
  • 1 GPU saving - 2 GPU loading
=== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 1.5588, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:13<01:57, 13.01s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4286, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.3325, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 0.6463, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1381, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.0558, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 0.4959, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1399, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                             | 8/10 [00:40<00:07,  3.82s/it]
{'loss': 1.1972, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.2461, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.5835, 'train_samples_per_second': 1.902, 'train_steps_per_second': 0.19, 'train_loss': 0.6239138901233673, 'epoch': 6.4}                                         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
  • 2 GPU training checkpoints
=== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 6.3766, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:13<01:59, 13.23s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.46, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                  
{'loss': 0.7684, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 1.9358, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1545, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9196, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3209, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.15, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                  
{'loss': 2.8148, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5776, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.7429, 'train_samples_per_second': 1.896, 'train_steps_per_second': 0.19, 'train_loss': 1.6478209435939788, 'epoch': 6.4} 
  • 2 GPU saving - 2 GPU loading
==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
'loss': 6.4358, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:12<01:54, 12.76s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4603, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.7867, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 1.9566, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1517, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9381, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3546, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1513, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
{'loss': 3.0085, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5992, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.3773, 'train_samples_per_second': 1.909, 'train_steps_per_second': 0.191, 'train_loss': 1.6842805743217468, 'epoch': 6.4}                                        
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:52<00:00,  5.24s/it]
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in /home/yerong2/models/internlm-xcomposer2d5-7b - will assume that the vocabulary was not modified.
  • 2 GPU saving - 1 GPU loading
==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.1474, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2065, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.749, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                 
{'loss': 8.4081, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.0869, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                               
{'loss': 4.1438, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.2339, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                               
{'loss': 1.9014, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}   

@YerongLi
Copy link
Author

YerongLi commented Aug 21, 2024

  1. Steps to reproduce: Multiple GPU error, training-save-reload is incorrect
  2. Steps to reproduce: Single GPU, no error
  3. Control Experiment: Save with 1 GPU training , load and resume training with 2 GPUs, no error
  4. Other Clues

All script are in this zip file:
finetune25.zip

  • I used deepspeed because torchrun cannot specify GPU indices

- Steps to reproduce: Multiple GPU error, training-save-reload is incorrect

  1. Run multi_finetune_lora.sh for the first time, keep export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
bash export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
  1. Merge the LoRA
bash merge.sh output/finetune_lora
  1. Run load_multi_finetune_lora.sh for the to load the checkpoint output/finetune_lora, keep export MODEL="merged/finetune_lora"
bash export MODEL="merged/finetune_lora"

Observation:

## FIRST TIME TRIAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
'loss': 6.43

### - Steps to reproduce: **Multiple GPU** error, training-save-reload is incorrect

1. Run `multi_finetune_lora.sh` for the first time, keep `export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"`

bash export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"

2. Merge the LoRA

bash merge.sh output/finetune_lora

3. Run `load_single_finetune_lora.sh` for the to load the checkpoint `output/finetune_lora`, keep `export MODEL="merged/finetune_lora"` ```
bash export MODEL="merged/finetune_lora"

Observation:

## FIRST TIME TRIAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
'loss': 6.4358, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:12<01:54, 12.76s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4603, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.7867, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 1.9566, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1517, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9381, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3546, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1513, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
{'loss': 3.0085, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5992, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.3773, 'train_samples_per_second': 1.909, 'train_steps_per_second': 0.191, 'train_loss': 1.6842805743217468, 'epoch': 6.4}                                        
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:52<00:00,  5.24s/it]
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-package

## SECOND TIME : LOAD AND TRAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 6.1474, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2065, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.749, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                 
{'loss': 8.4081, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.0869, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                               
{'loss': 4.1438, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.2339, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                               
{'loss': 1.9014, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}

with LOADING output/finetune_lora, the LOSS goes back from 6.0, which is unexpected

- Steps to reproduce: Single GPU, no error

  1. Run single_finetune_lora.sh for the first time, keep export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
bash export MODEL="/home/yerong2/models/internlm-xcomposer2d5-7b"
  1. Merge the LoRA
bash merge.sh output/finetune_lora
  1. Run single_finetune_lora.sh for the second time, keep export MODEL="merged/finetune_lora"
bash export MODEL="merged/finetune_lora"

Observation:

## FIRST TIME TRIAINING
==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.2112, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2995, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.7574, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                
{'loss': 8.7194, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.196, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                                
{'loss': 4.1853, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.243, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                                
{'loss': 0.8546, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                                                
{'loss': 1.1676, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                                               
{'loss': 0.5011, 'learning_rate': 0.0, 'epoch': 8.0}  

## SECOND TIME : LOAD AND TRAINING
{'loss': 0.46, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                  
{'loss': 0.759, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                 
{'loss': 1.9288, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1534, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.9273, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 1.3308, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1523, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
{'loss': 2.9644, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.5831, 'learning_rate': 0.0, 'epoch': 6.4}                 

After re-loading from the previous training, LOSS starts from a good point, which is expected.

- Control Experiment: Save with 1 GPU training , load and resume training with 2 GPUs, no error

==== NUMBER OF GPUS ==== GPUS_PER_NODE=1
{'loss': 6.2112, 'learning_rate': 5e-05, 'epoch': 0.8}                                                                                                                                
{'loss': 5.2995, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 4.7574, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.4}                                                                                                                
{'loss': 8.7194, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.2}                                                                                                               
{'loss': 4.196, 'learning_rate': 2.9341204441673266e-05, 'epoch': 4.0}                                                                                                                
{'loss': 4.1853, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.8}                                                                                                               
{'loss': 1.243, 'learning_rate': 1.2500000000000006e-05, 'epoch': 5.6}                                                                                                                
{'loss': 1.8546, 'learning_rate': 5.848888922025553e-06, 'epoch': 6.4}                                                                                                                
{'loss': 1.1676, 'learning_rate': 1.5076844803522922e-06, 'epoch': 7.2}                                                                                                               
{'loss': 1.5011, 'learning_rate': 0.0, 'epoch': 8.0}  
=== NUMBER OF GPUS ==== GPUS_PER_NODE=2
{'loss': 1.5588, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                                                                
 10%|██████████████▌                                                                                                                                   | 1/10 [00:13<01:57, 13.01s/it]/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4286, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                                                                                                
{'loss': 0.3325, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                                                                                                
{'loss': 0.6463, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                                                                                               
{'loss': 0.1381, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                                                                                               
{'loss': 1.0558, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                                                                                               
{'loss': 0.4959, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                                                                                               
{'loss': 0.1399, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                                                                                                
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                             | 8/10 [00:40<00:07,  3.82s/it]
{'loss': 1.1972, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                                                                                               
{'loss': 0.2461, 'learning_rate': 0.0, 'epoch': 6.4}                                                                                                                                  
{'train_runtime': 52.5835, 'train_samples_per_second': 1.902, 'train_steps_per_second': 0.19, 'train_loss': 0.6239138901233673, 'epoch': 6.4}                                         
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

This behaves normally.

- Other Clues:

With a same conda environment internlm/internlm-xcomposer2-vl-7b saves and loads checkpoint correctly.

@khyati2396
Copy link

@YerongLi Have you tried full-finetuning on the multi-GPU?

@yuhangzang
Copy link
Collaborator

yuhangzang commented Aug 30, 2024

If you want to continue training with the LoRA, you may face errors when you continue training with an older transformer version. We are working on providing a new fine-tuning code that is compatible with the new transformer version.

@YerongLi
Copy link
Author

YerongLi commented Aug 30, 2024

@YerongLi Have you tried full-finetuning on the multi-GPU?

@khyati2396 I am not able to do full fine tuning, I only have multiple 48 GB GPU.
There is a known similar issue, seems to be a issue with deepspeed, not sure whether we can multi-GPU without deepspeed. (Say wether we can dispatch over multiple GPUs)
huggingface/transformers#25340

@YerongLi
Copy link
Author

YerongLi commented Aug 30, 2024

If you want to continue training with the LoRA, you may face errors when you continue training with an older transformer version. We are working on providing a new fine-tuning code that is compatible with the new transformer version.

I mean are you able to reproduce this error? Can you resume from a "correct LOSS" after saving the model? For me, in a same conda environment internlm-xcomposer2-vl-7b works fine with .merge_and_unload with multiple GPUs, but internlm-xcomposer2-vl-7b breaks with resume_from_checkpoint=True.

I feel this is a deep bug between the interaction of deepspeed and transformers...
Which transformers are you using? 4.33.2 is suggested. With newer versions, we need gradient_checkpointing_enable({"use_reentrant": "True"}). https://huggingface.co/internlm/internlm-xcomposer2d5-7b/discussions/20

Let me try newer transformers at the same time.

@YerongLi
Copy link
Author

YerongLi commented Sep 2, 2024

If you want to continue training with the LoRA, you may face errors when you continue training with an older transformer version. We are working on providing a new fine-tuning code that is compatible with the new transformer version.

@yuhangzang I tested with newer transformers and got the same bug, the only difference is that is initial loss starts from around 2.0 instead of 6.0

$ pip show transformers; pip show peft; pip show accelerate; pip show deepspeed
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: transformers
Version: 4.44.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: peft
Version: 0.8.2
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: sourab@huggingface.co
License: Apache
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: accelerate, huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch, tqdm, transformers
Required-by: 
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: accelerate
Version: 0.33.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: zach.mueller@huggingface.co
License: Apache
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: peft
DEPRECATION: Loading egg at /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/flash_attn-2.6.3-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: deepspeed
Version: 0.12.3
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
Required-by: 
 ==== Model merged successfully from checkpoint: output/finetune_lora
{'loss': 1.9818, 'learning_rate': 5e-05, 'epoch': 1.0}                                                                                              
 10%|███████████▏                                                                                                    | 1/10 [00:13<01:58, 13.16s/it]/home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/home/yerong2/local/miniconda3/envs/mllm2/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1586: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4471, 'grad_norm': 30.07342800176501, 'learning_rate': 4.849231551964771e-05, 'epoch': 1.6}                                              
{'loss': 0.5259, 'grad_norm': 30.07342800176501, 'learning_rate': 4.415111107797445e-05, 'epoch': 2.0}                                              
{'loss': 2.0256, 'grad_norm': 30.07342800176501, 'learning_rate': 3.7500000000000003e-05, 'epoch': 3.0}                                             
{'loss': 0.1477, 'grad_norm': 78.24120420313231, 'learning_rate': 2.9341204441673266e-05, 'epoch': 3.2}                                             
{'loss': 1.2312, 'grad_norm': 78.24120420313231, 'learning_rate': 2.0658795558326743e-05, 'epoch': 4.0}                                             
{'loss': 0.7337, 'grad_norm': 33.6438670494202, 'learning_rate': 1.2500000000000006e-05, 'epoch': 4.8}                                              
{'loss': 0.1426, 'grad_norm': 33.6438670494202, 'learning_rate': 5.848888922025553e-06, 'epoch': 5.0}                                               
{'loss': 2.1172, 'grad_norm': 33.6438670494202, 'learning_rate': 1.5076844803522922e-06, 'epoch': 6.0}                                              
{'loss': 0.3183, 'grad_norm': 25.38373103524246, 'learning_rate': 0.0, 'epoch': 6.4}                                                                
{'train_runtime': 52.4924, 'train_samples_per_second': 1.905, 'train_steps_per_second': 0.191, 'train_loss': 0.9671015128493309, 'epoch': 6.4}  

@YerongLi
Copy link
Author

Tested that neither CUDA 11.7 , nor CUDA 12 works, the issue persists.

@YerongLi
Copy link
Author

Tested with smaller learning rate, which is not the issue.

@YerongLi
Copy link
Author

YerongLi commented Sep 13, 2024

I tested on a machine with 2xL40S and a machine with 2xA100, none of them works.

Here is the output on 2xA100

Loading data...
 ==== Model merged successfully from checkpoint: output/finetune_lora
LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=64, target_modules={'attention.wo', 'attention.wqkv', 'feed_forward.w3', 'feed_forward.w2', 'feed_forward.w1'}, lora_alpha=64, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})
Load 20 samples from ['data/only_text_example.json', '0.02']
Load 10 samples from ['data/single_turn_single_image_example.json', '0.01']
Load 10 samples from ['data/multi_turn_multi_images_example.json', '0.01']
init mix data at rank 0
load 20 data
load 10 data
load 10 data
10samples is loaded
True
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/accelerate/accelerator.py:447: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
  warnings.warn(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
trainable params: 99,098,624 || all params: 11,194,824,704 || trainable%: 0.8852181844758255
init mix data at rank 1
load 20 data
load 10 data
load 10 data
10samples is loaded
True
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/accelerate/accelerator.py:447: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
  warnings.warn(
[2024-09-13 04:39:10,210] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
[2024-09-13 04:39:10,210] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter stage3_gather_fp16_weights_on_model_save is deprecated use gather_16bit_weights_on_model_save instead
  0%|                                                                                    | 0/12 [00:00<?, ?it/s]Set seed 0 for rank 0
Set seed 3 for rank 1
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
Could not estimate the number of tokens of the input, floating-point operations will not be computed
Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 6.1314, 'learning_rate': 5e-05, 'epoch': 1.0}                                                          
  8%|██████▎                                                                     | 1/12 [01:21<14:50, 80.98s/it]/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
{'loss': 0.4381, 'learning_rate': 4.898732434036244e-05, 'epoch': 1.6}                                          
{'loss': 0.7703, 'learning_rate': 4.6031338320779534e-05, 'epoch': 2.0}                                         
{'loss': 1.8976, 'learning_rate': 4.137151834863213e-05, 'epoch': 3.0}                                          
{'loss': 0.1462, 'learning_rate': 3.5385375325047166e-05, 'epoch': 3.2}                                         
{'loss': 1.8919, 'learning_rate': 2.8557870956832132e-05, 'epoch': 4.0}                                         
{'loss': 1.3274, 'learning_rate': 2.1442129043167874e-05, 'epoch': 4.8}                                         
{'loss': 0.1458, 'learning_rate': 1.4614624674952842e-05, 'epoch': 5.0}                                         
{'loss': 3.2888, 'learning_rate': 8.628481651367876e-06, 'epoch': 6.0}                                          
{'loss': 0.6449, 'learning_rate': 3.968661679220468e-06, 'epoch': 6.4}                                          
{'loss': 0.431, 'learning_rate': 1.0126756596375686e-06, 'epoch': 7.0}                                          
{'loss': 3.4945, 'learning_rate': 0.0, 'epoch': 8.0}                                                            
{'train_runtime': 149.6291, 'train_samples_per_second': 0.802, 'train_steps_per_second': 0.08, 'train_loss': 1.7173307587703068, 'epoch': 8.0}
100%|███████████████████████████████████████████████████████████████████████████| 12/12 [02:29<00:00, 12.46s/it]
/scratch/bbrz/local/conda/envs/mllm/lib/python3.11/site-packages/peft/utils/save_and_load.py:148: UserWarning: Could not find a config file in /scratch/bbrz/yirenl2/models//internlm-xcomposer2d5-7b - will assume that the vocabulary was not modified.
  warnings.warn(
[2024-09-13 04:43:29,332] [INFO] [launch.py:347:main] Process 3334842 exits successfully.
[2024-09-13 04:43:29,332] [INFO] [launch.py:347:main] Process 3334843 exits successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants