-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error encountered while training qwen-2.5-3b model using Qwen2.5-Coder/finetuning/sft/train.py #171
Comments
Please provide us with a minimal example to reproduce the error: training data (a small set is ok), binarize scripts and training scripts. |
training data: {
"messages": [
{"role": "user", "content": "CREATE TABLE \"stadium\" (\n\"Stadium_ID\" int,\n\"Location\" text,\n\"Name\" text,\n\"Capacity\" int,\n\"Highest\" int,\n\"Lowest\" int,\n\"Average\" int,\nPRIMARY KEY (\"Stadium_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM stadium LIMIT 3;\nStadium_ID Location Name Capacity Highest Lowest Average\n1 Raith Rovers Stark's Park 10104 4812 1294 2106\n2 Ayr United Somerset Park 11998 2363 1057 1477\n3 East Fife Bayview Stadium 2000 1980 533 864\n*/\n\nCREATE TABLE \"singer\" (\n\"Singer_ID\" int,\n\"Name\" text,\n\"Country\" text,\n\"Song_Name\" text,\n\"Song_release_year\" text,\n\"Age\" int,\n\"Is_male\" bool,\nPRIMARY KEY (\"Singer_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM singer LIMIT 3;\nSinger_ID Name Country Song_Name Song_release_year Age Is_male\n1 Joe Sharp Netherlands You 1992 52 F\n2 Timbaland United States Dangerous 2008 32 T\n3 Justin Brown France Hey Oh 2013 29 T\n*/\n\nCREATE TABLE \"concert\" (\n\"concert_ID\" int,\n\"concert_Name\" text,\n\"Theme\" text,\n\"Stadium_ID\" text,\n\"Year\" text,\nPRIMARY KEY (\"concert_ID\"),\nFOREIGN KEY (\"Stadium_ID\") REFERENCES \"stadium\"(\"Stadium_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM concert LIMIT 3;\nconcert_ID concert_Name Theme Stadium_ID Year\n1 Auditions Free choice 1 2014\n2 Super bootcamp Free choice 2 2 2014\n3 Home Visits Bleeding Love 2 2015\n*/\n\nCREATE TABLE \"singer_in_concert\" (\n\"concert_ID\" int,\n\"Singer_ID\" text,\nPRIMARY KEY (\"concert_ID\",\"Singer_ID\"),\nFOREIGN KEY (\"concert_ID\") REFERENCES \"concert\"(\"concert_ID\"),\nFOREIGN KEY (\"Singer_ID\") REFERENCES \"singer\"(\"Singer_ID\")\n);\n/*\n3 example rows:\nSELECT * FROM singer_in_concert LIMIT 3;\nconcert_ID Singer_ID\n1 2\n1 3\n1 5\n*/\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\nQuestion: How many singers do we have?\n"},
{"role": "assistant", "content": "SELECT count(*) FROM singer"}
],
"format": "chatml"
} Here is my binarize scripts: export PATH=/path/to/miniconda3/envs/qwen/bin:$PATH;
# cd ./finetuning/sft/;
INPUT_PATH=${1}
OUTPUT_PATH=${2}
TOKENIZER_PATH=${3}
INPUT_PATH=${INPUT_PATH:-"./raw_data/sft.jsonl"}
OUTPUT_PATH=${OUTPUT_PATH:-"./processed/sft.jsonl"}
TOKENIZER_PATH=${TOKENIZER_PATH:-"/home/yhw/text_to_SQL/model/Qwen2_5_coder_3B"}
python binarize_data.py -input_path ${INPUT_PATH} -output_path ${OUTPUT_PATH} -workers 64 -tokenizer_path ${TOKENIZER_PATH} and my training scripts: export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=enp0s31f6
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export CUDA_VISIBLE_DEVICES=0
export CUDA_LAUNCH_BLOCKING=1
export NCCL_NET_PLUGIN=none
export TORCHELASTIC_ERROR_FILE=error.json
export PATH=/home/yhw/miniconda3/envs/sft_env/bin:$PATH;
DATA_PATH=${1}
PRETRAINED_MODEL=${2}
OUTPUT_DIR=${3}
DATA_PATH=${DATA_PATH:-"./processed/sft.jsonl"}
PRETRAINED_MODEL=${PRETRAINED_MODEL:-"/home/yhw/text_to_SQL/model/Qwen2_5_coder_3B"}
OUTPUT_DIR=${OUTPUT_DIR:-"./checkpoints/lr${LR}-wr${WARMUP_STEPS}-wd${WEIGHT_DECAY}-bsz${BATCH_SIZE}-maxlen${MAX_LENGTH}/"}
GPUS_PER_NODE=$(python -c "import torch; print(torch.cuda.device_count());")
MASTER_ADDR=${MASTER_ADDR:-localhost}
NNODES=${WORLD_SIZE:-1}
NODE_RANK=${RANK:-0}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=${MASTER_PORT:-6105}
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
DEEPSPEED_CONFIG="./configs/default_offload_opt_param.json"
BATCH_SIZE=1024
MICRO_BATCH_SIZE=4
GRAD_ACCU=$(($BATCH_SIZE / $WORLD_SIZE / $MICRO_BATCH_SIZE))
LR=5e-5
MIN_LR=5e-6
WARMUP_STEPS=100
WEIGHT_DECAY=0.0
MAX_LENGTH=1280
echo $OUTPUT_DIR
echo "Pretrained Model" ${PRETRAINED_MODEL}
echo "WORLD_SIZE" $WORLD_SIZE "MICRO BATCH SIZE" $MICRO_BATCH_SIZE "GRAD_ACCU" $GRAD_ACCU
echo $DISTRIBUTED_ARGS
# cd ROOT_PATH="/path/to/sft/";
torchrun ${DISTRIBUTED_ARGS} train.py \
--model_name_or_path ${PRETRAINED_MODEL} \
--data_path $DATA_PATH \
--model_max_length ${MAX_LENGTH} \
--output_dir ${OUTPUT_DIR} \
--num_train_epochs 3 \
--per_device_train_batch_size ${MICRO_BATCH_SIZE} \
--gradient_accumulation_steps ${GRAD_ACCU} \
--per_device_eval_batch_size 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 100 \
--learning_rate ${LR} \
--weight_decay ${WEIGHT_DECAY} \
--warmup_steps ${WARMUP_STEPS} \
--lr_scheduler_type "cosine" \
--logging_strategy "steps" \
--logging_steps 1 \
--deepspeed ${DEEPSPEED_CONFIG} \
--report_to "tensorboard" \
--bf16 True \
--tf32 True \
--truncate_source False I made some modifications to ensure that both scripts can run on my server. The |
Does the same problem also occur in other model sizes (e.g. Qwen2.5-Coder-1.5B) ? |
I'm facing the same issue. Have you managed to resolve it? @Yhw109 @CSJianYang |
Hello,
When I run the code and scripts in Qwen2.5-Coder/finetuning/sft/train.py to train the qwen-2.5-3b model, I encounter the following error:
it seems to be a token range issue. Could you please advise me on how to resolve this problem? My library version matches requirements.
Thank you!
The text was updated successfully, but these errors were encountered: