Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多模态paligemma训练 ValueError: Decompressed Data Too Large #6300

Closed
1 task done
CLL112 opened this issue Dec 10, 2024 · 4 comments
Closed
1 task done

多模态paligemma训练 ValueError: Decompressed Data Too Large #6300

CLL112 opened this issue Dec 10, 2024 · 4 comments
Labels
solved This problem has been already solved

Comments

@CLL112
Copy link

CLL112 commented Dec 10, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
  • Python version: 3.10.15
  • PyTorch version: 2.5.1+cu124 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 2.19.2
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 3090
  • DeepSpeed version: 0.15.3
  • Bitsandbytes version: 0.44.1

Reproduction

### model
model_name_or_path: google/paligemma-3b-mix-224

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
freeze_vision_tower: true  # choices: [true, false]
# train_mm_proj_only: true  # choices: [true, false]
flash_attn: fa2   
enable_liger_kernel: true
use_unsloth_gc: true
eval_on_start: true

### dataset 
dataset: llava_150k_en,llava_150k_zh

template: paligemma
cutoff_len: 1024
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /data/llm/paligemma/sfttest-all
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 3
gradient_accumulation_steps: 1
learning_rate: 0.000003
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
use_cache: False

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

报错:

[rank1]:[W1210 16:02:10.456129427 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[INFO|2024-12-10 16:02:10] llamafactory.data.loader:157 >> Loading dataset BUAADreamer/llava-en-zh-300k...
[rank3]:[W1210 16:02:11.702959374 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank4]:[W1210 16:02:11.361638490 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W1210 16:02:11.540846184 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank7]:[W1210 16:02:11.546711878 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank5]:[W1210 16:02:12.707562135 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank6]:[W1210 16:02:12.266404866 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [00:02<00:00, 28.81it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 101669.65it/s]
Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 4355.30it/s]
Converting format of dataset (num_proc=16):  93%|█████████████████████████████████████████████████████████████████████████████████████████████▎      | 147094/157712 [23:47<01:43, 103.07 examples/s]
[rank0]:[W1210 16:26:09.601578241 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]: multiprocess.pool.RemoteTraceback: 
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]:     result = (True, func(*args, **kwds))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]:     for i, result in enumerate(func(**kwargs)):
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single
[rank0]:     example = apply_function_on_filtered_inputs(example, i, offset=offset)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
[rank0]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]:   File "/home/longchen/llm/LLaMA-Factory/src/llamafactory/data/aligner.py", line 224, in convert_sharegpt
[rank0]:     "_images": convert_images(example[dataset_attr.images]) if dataset_attr.images else None,
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 273, in __getitem__
[rank0]:     value = self.format(key)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 371, in format
[rank0]:     return self.formatter.format_column(self.pa_table.select([key]))[0]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 443, in format_column
[rank0]:     column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 219, in decode_column
[rank0]:     return self.features.decode_column(column, column_name) if self.features else column
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1997, in decode_column
[rank0]:     [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1997, in <listcomp>
[rank0]:     [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1336, in decode_nested_example
[rank0]:     return decode_nested_example([schema.feature], obj)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1328, in decode_nested_example
[rank0]:     if decode_nested_example(sub_schema, first_elmt) != first_elmt:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1341, in decode_nested_example
[rank0]:     return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/image.py", line 187, in decode_example
[rank0]:     image = PIL.Image.open(BytesIO(bytes_))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/Image.py", line 3477, in open
[rank0]:     im = _open_core(fp, filename, prefix, formats)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/Image.py", line 3465, in _open_core
[rank0]:     im = factory(fp, filename)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/ImageFile.py", line 138, in __init__
[rank0]:     self._open()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 749, in _open
[rank0]:     s = self.png.call(cid, pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 198, in call
[rank0]:     return getattr(self, f"chunk_{cid.decode('ascii')}")(pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 424, in chunk_iCCP
[rank0]:     icc_profile = _safe_zlib_decompress(s[i + 2 :])
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 143, in _safe_zlib_decompress
[rank0]:     raise ValueError(msg)
[rank0]: ValueError: Decompressed Data Too Large
[rank0]: """

Expected behavior

  1. 使用数据集llava_1k_en,llava_1k_zh时能正常训练,换成:llava_150k_en,llava_150k_zh就报错:ValueError: Decompressed Data Too Large
  2. 此外,使用了纯文本数据集时dataset: identity,llava_1k_en,llava_1k_zh,训练会一直卡住,无法开始训练

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 10, 2024
@CLL112
Copy link
Author

CLL112 commented Dec 11, 2024

更新代码就可以训练了

@CLL112 CLL112 closed this as completed Dec 11, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 11, 2024
@CLL112 CLL112 reopened this Dec 11, 2024
@CLL112
Copy link
Author

CLL112 commented Dec 11, 2024

还是有问题,设置max_samples: 10000可以训练,如果不设置使用全部数据集还是会报错ValueError: Decompressed Data Too Large,是因为数据集太大了吗?

@CLL112
Copy link
Author

CLL112 commented Dec 11, 2024

使用streaming可以训练,但会在训练一段时间后报错:

{'loss': 3.0184, 'grad_norm': 4.040715386848615, 'learning_rate': 2.9999527971349995e-06, 'epoch': 0.01}                                                                                                  
{'loss': 3.0188, 'grad_norm': 4.090397107456916, 'learning_rate': 2.999948945403066e-06, 'epoch': 0.01}                                                                                                   
{'loss': 2.9131, 'grad_norm': 3.5998091314843355, 'learning_rate': 2.999944942626314e-06, 'epoch': 0.01}                                                                                                  
{'loss': 3.0172, 'grad_norm': 4.186458330567469, 'learning_rate': 2.9999407888051476e-06, 'epoch': 0.01}                                                                                                  
{'loss': 2.9712, 'grad_norm': 4.287678401200029, 'learning_rate': 2.9999364839399852e-06, 'epoch': 0.01}                                                                                                  
  1%|██                                                                                                                                                            | 129/10000 [18:29<19:15:11,  7.02s/it][rank0]: Traceback (most recent call last):
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 100, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/transformers/trainer.py", line 2426, in _inner_training_loop
[rank0]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/transformers/trainer.py", line 5038, in get_batch_samples
[rank0]:     batch_samples += [next(epoch_iterator)]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/accelerate/data_loader.py", line 561, in __iter__
[rank0]:     next_batch = next(dataloader_iter)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank0]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
[rank0]:     data.append(next(self.dataset_iter))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/accelerate/data_loader.py", line 336, in __iter__
[rank0]:     for element in self.dataset:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 2093, in __iter__
[rank0]:     for key, example in ex_iterable:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1511, in __iter__
[rank0]:     yield from islice(self.ex_iterable, ex_iterable_idx_start, None)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1454, in __iter__
[rank0]:     for x in self.ex_iterable:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1000, in __iter__
[rank0]:     yield from self._iter()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1024, in _iter
[rank0]:     for key, example in iterator:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1683, in __iter__
[rank0]:     for key, example in self.ex_iterable:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 744, in __iter__
[rank0]:     yield from ex_iterable
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1000, in __iter__
[rank0]:     yield from self._iter()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1080, in _iter
[rank0]:     for key, example in iterator:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1686, in __iter__
[rank0]:     _apply_feature_types_on_example(example, self.features, token_per_repo_id=self.token_per_repo_id),
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1635, in _apply_feature_types_on_example
[rank0]:     decoded_example = features.decode_example(encoded_example, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 2044, in decode_example
[rank0]:     return {
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 2045, in <dictcomp>
[rank0]:     column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 1400, in decode_nested_example
[rank0]:     return decode_nested_example([schema.feature], obj)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 1380, in decode_nested_example
[rank0]:     if decode_nested_example(sub_schema, first_elmt) != first_elmt:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 1405, in decode_nested_example
[rank0]:     return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/image.py", line 187, in decode_example
[rank0]:     image = PIL.Image.open(BytesIO(bytes_))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/Image.py", line 3477, in open
[rank0]:     im = _open_core(fp, filename, prefix, formats)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/Image.py", line 3465, in _open_core
[rank0]:     im = factory(fp, filename)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/ImageFile.py", line 138, in __init__
[rank0]:     self._open()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 749, in _open
[rank0]:     s = self.png.call(cid, pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 198, in call
[rank0]:     return getattr(self, f"chunk_{cid.decode('ascii')}")(pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 424, in chunk_iCCP
[rank0]:     icc_profile = _safe_zlib_decompress(s[i + 2 :])
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 143, in _safe_zlib_decompress
[rank0]:     raise ValueError(msg)
[rank0]: ValueError: Decompressed Data Too Large

@hiyouga hiyouga closed this as completed Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants