max_steps: 256, streaming: true, buffer_size: 128 报错 #6302

StarDewXXX · 2024-12-10T12:03:17Z

Reminder

I have read the README and searched the existing issues.

System Info

[2024-12-10 12:01:29,169] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible

llamafactory version: 0.8.4.dev0
Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
Python version: 3.11.9
PyTorch version: 2.3.1+cu118 (GPU)
Transformers version: 4.43.1
Datasets version: 2.21.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A100-SXM4-40GB
DeepSpeed version: 0.14.4

Reproduction

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 HF_ENDPOINT=https://hf-mirror.com llamafactory-cli train examples/curriculum/phi3_dpo_curriculum.yaml

Expected behavior

期望逐步加载数据

Others

配置文件：

model

model_name_or_path: LordNoah/phi3-sft

method

stage: dpo
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: ultrafeedback_curriculum
template: phi
cutoff_len: 1024
max_steps: 256
overwrite_cache: true
preprocessing_num_workers: 1
streaming: true
buffer_size: 128

output

output_dir: saves/phi3-curriculum/curriculum
logging_steps: 5
save_strategy: "no"
plot_loss: true
overwrite_output_dir: true
save_only_model: true

train

per_device_train_batch_size: 2 #4
gradient_accumulation_steps: 2 #4
learning_rate: 2.0e-7
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.05 #0.1q
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: False

eval

val_size: 64
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 5000

###报错信息
[INFO|trainer.py:2134] 2024-12-10 11:57:08,144 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-12-10 11:57:08,144 >> Num examples = 8,192
[INFO|trainer.py:2136] 2024-12-10 11:57:08,144 >> Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:2137] 2024-12-10 11:57:08,144 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-12-10 11:57:08,144 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2141] 2024-12-10 11:57:08,144 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2142] 2024-12-10 11:57:08,144 >> Total optimization steps = 256
[INFO|trainer.py:2143] 2024-12-10 11:57:08,145 >> Number of trainable parameters = 3,821,079,552
0%| | 0/256 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 633, in _fetch_batches
[rank0]: batch = concatenate(batches, dim=0)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in concatenate
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 623, in concatenate
[rank0]: return torch.cat(data, dim=dim)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1024 but got size 576 for tensor number 1 in the list.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/dpo/workflow.py", line 88, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 2236, in _inner_training_loop
[rank0]: for step, inputs in enumerate(epoch_iterator):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 677, in iter
[rank0]: next_batch, next_batch_info = self._fetch_batches(main_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 635, in _fetch_batches
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: You can't use batches of different size with dispatch_batches=True or when using an IterableDataset.either pass dispatch_batches=False and have each process fetch its own batch or pass split_batches=True. By doing so, the main process will fetch a full batch and slice it into num_processes batches for each process.

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-12-10T12:47:56Z

buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
  dispatch_batches: false

StarDewXXX · 2024-12-10T13:26:50Z

buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
  dispatch_batches: false

感谢，现在可以运行了，但能否设置epoch数呢？虽然配置文件里num_train_epochs=3，但现在是跑完一个epoch就结束了

hiyouga · 2024-12-10T15:44:02Z

能吧

github-actions bot added the pending This problem is yet to be addressed label Dec 10, 2024

hiyouga closed this as completed Dec 10, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_steps: 256, streaming: true, buffer_size: 128 报错 #6302

max_steps: 256, streaming: true, buffer_size: 128 报错 #6302

StarDewXXX commented Dec 10, 2024 •

edited

Loading

hiyouga commented Dec 10, 2024

StarDewXXX commented Dec 10, 2024 •

edited

Loading

hiyouga commented Dec 10, 2024

max_steps: 256, streaming: true, buffer_size: 128 报错 #6302

max_steps: 256, streaming: true, buffer_size: 128 报错 #6302

Comments

StarDewXXX commented Dec 10, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

model

method

dataset

output

train

eval

hiyouga commented Dec 10, 2024

StarDewXXX commented Dec 10, 2024 • edited Loading

hiyouga commented Dec 10, 2024

StarDewXXX commented Dec 10, 2024 •

edited

Loading

StarDewXXX commented Dec 10, 2024 •

edited

Loading