多卡SFT chatglm4-9b的过程中出现选择显卡无用的情况 #4678

Synco9 · 2024-07-04T07:45:46Z

Synco9
Jul 4, 2024

最近在使用llamafactory的过程中发现，使用CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 NNODES=2 RANK=0 llamafactory-cli train examples/train_lora/glm4-9b-chat-sft.yaml命令开启训练会把所有的卡都用上，其中CUDA_VISIBLE_DEVICES参数不起作用，也就是我们在指定显卡之后还是会在其他卡上进行训练，导致OOM。

使用设备

3080Ti * 4 和 3090 * 4

method

model_name_or_path: /wspace/aigc/weights/THUDM/chatglm4-9b
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json # 无论是用z0、z2、z3还是不使用deepspeed都会多卡加载

dataset

dataset: webqa,webnovel
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: saves/glm4-9b-chat/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

hiyouga · 2024-07-04T09:03:17Z

hiyouga
Jul 4, 2024
Maintainer

单机多卡时候能选择特定显卡吗？

1 reply

Synco9 Jul 4, 2024
Author

是可以的，多机多卡会出现问题

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡SFT chatglm4-9b的过程中出现选择显卡无用的情况 #4678

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

多卡SFT chatglm4-9b的过程中出现选择显卡无用的情况 #4678

Synco9 Jul 4, 2024

使用设备

method

dataset

output

train

eval

Replies: 1 comment · 1 reply

hiyouga Jul 4, 2024 Maintainer

Synco9 Jul 4, 2024 Author

Synco9
Jul 4, 2024

Replies: 1 comment 1 reply

hiyouga
Jul 4, 2024
Maintainer

Synco9 Jul 4, 2024
Author