You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and searched the existing issues.
System Info
[2024-12-10 12:01:29,169] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
###报错信息
[INFO|trainer.py:2134] 2024-12-10 11:57:08,144 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-12-10 11:57:08,144 >> Num examples = 8,192
[INFO|trainer.py:2136] 2024-12-10 11:57:08,144 >> Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:2137] 2024-12-10 11:57:08,144 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-12-10 11:57:08,144 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2141] 2024-12-10 11:57:08,144 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2142] 2024-12-10 11:57:08,144 >> Total optimization steps = 256
[INFO|trainer.py:2143] 2024-12-10 11:57:08,145 >> Number of trainable parameters = 3,821,079,552
0%| | 0/256 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 633, in _fetch_batches
[rank0]: batch = concatenate(batches, dim=0)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in concatenate
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 623, in concatenate
[rank0]: return torch.cat(data, dim=dim)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1024 but got size 576 for tensor number 1 in the list.
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/dpo/workflow.py", line 88, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 2236, in _inner_training_loop
[rank0]: for step, inputs in enumerate(epoch_iterator):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 677, in iter
[rank0]: next_batch, next_batch_info = self._fetch_batches(main_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 635, in _fetch_batches
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: You can't use batches of different size with dispatch_batches=True or when using an IterableDataset.either pass dispatch_batches=False and have each process fetch its own batch or pass split_batches=True. By doing so, the main process will fetch a full batch and slice it into num_processes batches for each process.
The text was updated successfully, but these errors were encountered:
Reminder
System Info
[2024-12-10 12:01:29,169] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
llamafactory
version: 0.8.4.dev0Reproduction
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 HF_ENDPOINT=https://hf-mirror.com llamafactory-cli train examples/curriculum/phi3_dpo_curriculum.yaml
Expected behavior
期望逐步加载数据
Others
配置文件:
model
model_name_or_path: LordNoah/phi3-sft
method
stage: dpo
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset: ultrafeedback_curriculum
template: phi
cutoff_len: 1024
max_steps: 256
overwrite_cache: true
preprocessing_num_workers: 1
streaming: true
buffer_size: 128
output
output_dir: saves/phi3-curriculum/curriculum
logging_steps: 5
save_strategy: "no"
plot_loss: true
overwrite_output_dir: true
save_only_model: true
train
per_device_train_batch_size: 2 #4
gradient_accumulation_steps: 2 #4
learning_rate: 2.0e-7
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.05 #0.1q
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: False
eval
val_size: 64
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 5000
###报错信息
[INFO|trainer.py:2134] 2024-12-10 11:57:08,144 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-12-10 11:57:08,144 >> Num examples = 8,192
[INFO|trainer.py:2136] 2024-12-10 11:57:08,144 >> Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:2137] 2024-12-10 11:57:08,144 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-12-10 11:57:08,144 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2141] 2024-12-10 11:57:08,144 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2142] 2024-12-10 11:57:08,144 >> Total optimization steps = 256
[INFO|trainer.py:2143] 2024-12-10 11:57:08,145 >> Number of trainable parameters = 3,821,079,552
0%| | 0/256 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 633, in _fetch_batches
[rank0]: batch = concatenate(batches, dim=0)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in concatenate
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 620, in
[rank0]: return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/utils/operations.py", line 623, in concatenate
[rank0]: return torch.cat(data, dim=dim)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1024 but got size 576 for tensor number 1 in the list.
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/root/projects/SP-STO/src/llamafactory/train/dpo/workflow.py", line 88, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/transformers/trainer.py", line 2236, in _inner_training_loop
[rank0]: for step, inputs in enumerate(epoch_iterator):
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 677, in iter
[rank0]: next_batch, next_batch_info = self._fetch_batches(main_iterator)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/sto/lib/python3.11/site-packages/accelerate/data_loader.py", line 635, in _fetch_batches
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: You can't use batches of different size with
dispatch_batches=True
or when using anIterableDataset
.either passdispatch_batches=False
and have each process fetch its own batch or passsplit_batches=True
. By doing so, the main process will fetch a full batch and slice it intonum_processes
batches for each process.The text was updated successfully, but these errors were encountered: