Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_steps: 256, streaming: true, buffer_size: 128 报错 #6302

Closed
1 task done
StarDewXXX opened this issue Dec 10, 2024 · 3 comments
Closed
1 task done

max_steps: 256, streaming: true, buffer_size: 128 报错 #6302

StarDewXXX opened this issue Dec 10, 2024 · 3 comments
Labels
solved This problem has been already solved

Comments

@StarDewXXX
Copy link

StarDewXXX commented Dec 10, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

[2024-12-10 12:01:29,169] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible

  • llamafactory version: 0.8.4.dev0
  • Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.11.9
  • PyTorch version: 2.3.1+cu118 (GPU)
  • Transformers version: 4.43.1
  • Datasets version: 2.21.0
  • Accelerate version: 0.33.0
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A100-SXM4-40GB
  • DeepSpeed version: 0.14.4

Reproduction

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 HF_ENDPOINT=https://hf-mirror.com llamafactory-cli train examples/curriculum/phi3_dpo_curriculum.yaml

Expected behavior

期望逐步加载数据

Others

配置文件:

model

model_name_or_path: LordNoah/phi3-sft

method

stage: dpo
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: ultrafeedback_curriculum
template: phi
cutoff_len: 1024
max_steps: 256
overwrite_cache: true
preprocessing_num_workers: 1
streaming: true
buffer_size: 128

output

output_dir: saves/phi3-curriculum/curriculum
logging_steps: 5
save_strategy: "no"
plot_loss: true
overwrite_output_dir: true
save_only_model: true

train

per_device_train_batch_size: 2 #4
gradient_accumulation_steps: 2 #4
learning_rate: 2.0e-7
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.05 #0.1q
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: False

eval

val_size: 64
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 5000

###报错信息
[INFO|trainer.py:2134] 2024-12-10 11:57:08,144 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-12-10 11:57:08,144 >> Num examples = 8,192
[INFO|trainer.py:2136] 2024-12-10 11:57:08,144 >> Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:2137] 2024-12-10 11:57:08,144 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-12-10 11:57:08,144 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2141] 2024-12-10 11:57:08,144 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2142] 2024-12-10 11:57:08,144 >> Total optimization steps = 256
[INFO|trainer.py:2143] 2024-12-10 11:57:08,145 >> Number of trainable parameters = 3,821,079,552
0%| | 0/256 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 633, in _fetch_batches
[rank0]: batch = concatenate(batches, dim=0)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in concatenate
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 623, in concatenate
[rank0]: return torch.cat(data, dim=dim)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1024 but got size 576 for tensor number 1 in the list.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/dpo/workflow.py", line 88, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 2236, in _inner_training_loop
[rank0]: for step, inputs in enumerate(epoch_iterator):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 677, in iter
[rank0]: next_batch, next_batch_info = self._fetch_batches(main_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 635, in _fetch_batches
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: You can't use batches of different size with dispatch_batches=True or when using an IterableDataset.either pass dispatch_batches=False and have each process fetch its own batch or pass split_batches=True. By doing so, the main process will fetch a full batch and slice it into num_processes batches for each process.

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 10, 2024
@hiyouga
Copy link
Owner

hiyouga commented Dec 10, 2024

buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
  dispatch_batches: false

@hiyouga hiyouga closed this as completed Dec 10, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 10, 2024
@StarDewXXX
Copy link
Author

StarDewXXX commented Dec 10, 2024

buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
  dispatch_batches: false

感谢,现在可以运行了,但能否设置epoch数呢?虽然配置文件里num_train_epochs=3,但现在是跑完一个epoch就结束了

@hiyouga
Copy link
Owner

hiyouga commented Dec 10, 2024

能吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants