We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
利用昇腾机器通过DPO训练方法训练千问模型,报显存不足,当然通过一定的方法能训练起来,但是存在隐患,想找一个更好的解决办法,特来求助: 设置: 昇腾8机,64卡 64GB显存机器 seq_length 4096 micro_batch_size_per_gpu 1 train_batch_size 64 gradient_accumulation_step 1 这样的情况下跑着跑着会显存不足。
后来设置seq_length 2048 显存就够了 就能跑完结果
本想设置train_batch_size=32来降低显存占用,但是会报错 train_batch_size!=micro_batch_size_per_gpu*gradient_accumulation_step *world_size
想问的是是否有什么办法把train_batch_size设置成32 可以正常跑 例如设置PP TP这样的参数 来实现更灵活的batch_size设置。
运行的脚本:torchrun --nonodes=$NNODES --node_rank=$NODE_RANK --nproc_per_node=$NGPUS_PER_NODE --master_addr $MASTER_ADDR --master_port $MASTER_PORT -m train example/train_full/llama_8B_full_train.yaml
不涉及
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reminder
System Info
利用昇腾机器通过DPO训练方法训练千问模型,报显存不足,当然通过一定的方法能训练起来,但是存在隐患,想找一个更好的解决办法,特来求助:
设置:
昇腾8机,64卡 64GB显存机器
seq_length 4096
micro_batch_size_per_gpu 1
train_batch_size 64
gradient_accumulation_step 1
这样的情况下跑着跑着会显存不足。
后来设置seq_length 2048 显存就够了 就能跑完结果
本想设置train_batch_size=32来降低显存占用,但是会报错 train_batch_size!=micro_batch_size_per_gpu*gradient_accumulation_step *world_size
想问的是是否有什么办法把train_batch_size设置成32 可以正常跑 例如设置PP TP这样的参数 来实现更灵活的batch_size设置。
Reproduction
运行的脚本:torchrun --nonodes=$NNODES --node_rank=$NODE_RANK --nproc_per_node=$NGPUS_PER_NODE --master_addr $MASTER_ADDR --master_port $MASTER_PORT -m train example/train_full/llama_8B_full_train.yaml
Expected behavior
不涉及
Others
不涉及
The text was updated successfully, but these errors were encountered: