多模态paligemma训练 ValueError: Decompressed Data Too Large #6300

CLL112 · 2024-12-10T08:48:31Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 2.19.2
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA GeForce RTX 3090
DeepSpeed version: 0.15.3
Bitsandbytes version: 0.44.1

Reproduction

### model
model_name_or_path: google/paligemma-3b-mix-224

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
freeze_vision_tower: true  # choices: [true, false]
# train_mm_proj_only: true  # choices: [true, false]
flash_attn: fa2   
enable_liger_kernel: true
use_unsloth_gc: true
eval_on_start: true

### dataset 
dataset: llava_150k_en,llava_150k_zh

template: paligemma
cutoff_len: 1024
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /data/llm/paligemma/sfttest-all
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 3
gradient_accumulation_steps: 1
learning_rate: 0.000003
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
use_cache: False

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

报错:

[rank1]:[W1210 16:02:10.456129427 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[INFO|2024-12-10 16:02:10] llamafactory.data.loader:157 >> Loading dataset BUAADreamer/llava-en-zh-300k...
[rank3]:[W1210 16:02:11.702959374 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank4]:[W1210 16:02:11.361638490 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank2]:[W1210 16:02:11.540846184 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank7]:[W1210 16:02:11.546711878 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank5]:[W1210 16:02:12.707562135 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank6]:[W1210 16:02:12.266404866 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [00:02<00:00, 28.81it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 101669.65it/s]
Loading dataset shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 4355.30it/s]
Converting format of dataset (num_proc=16):  93%|█████████████████████████████████████████████████████████████████████████████████████████████▎      | 147094/157712 [23:47<01:43, 103.07 examples/s]
[rank0]:[W1210 16:26:09.601578241 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]: multiprocess.pool.RemoteTraceback: 
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]:     result = (True, func(*args, **kwds))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank0]:     for i, result in enumerate(func(**kwargs)):
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single
[rank0]:     example = apply_function_on_filtered_inputs(example, i, offset=offset)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
[rank0]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]:   File "/home/longchen/llm/LLaMA-Factory/src/llamafactory/data/aligner.py", line 224, in convert_sharegpt
[rank0]:     "_images": convert_images(example[dataset_attr.images]) if dataset_attr.images else None,
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 273, in __getitem__
[rank0]:     value = self.format(key)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 371, in format
[rank0]:     return self.formatter.format_column(self.pa_table.select([key]))[0]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 443, in format_column
[rank0]:     column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 219, in decode_column
[rank0]:     return self.features.decode_column(column, column_name) if self.features else column
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1997, in decode_column
[rank0]:     [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1997, in <listcomp>
[rank0]:     [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1336, in decode_nested_example
[rank0]:     return decode_nested_example([schema.feature], obj)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1328, in decode_nested_example
[rank0]:     if decode_nested_example(sub_schema, first_elmt) != first_elmt:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/features.py", line 1341, in decode_nested_example
[rank0]:     return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/datasets/features/image.py", line 187, in decode_example
[rank0]:     image = PIL.Image.open(BytesIO(bytes_))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/Image.py", line 3477, in open
[rank0]:     im = _open_core(fp, filename, prefix, formats)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/Image.py", line 3465, in _open_core
[rank0]:     im = factory(fp, filename)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/ImageFile.py", line 138, in __init__
[rank0]:     self._open()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 749, in _open
[rank0]:     s = self.png.call(cid, pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 198, in call
[rank0]:     return getattr(self, f"chunk_{cid.decode('ascii')}")(pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 424, in chunk_iCCP
[rank0]:     icc_profile = _safe_zlib_decompress(s[i + 2 :])
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 143, in _safe_zlib_decompress
[rank0]:     raise ValueError(msg)
[rank0]: ValueError: Decompressed Data Too Large
[rank0]: """

Expected behavior

使用数据集llava_1k_en,llava_1k_zh时能正常训练，换成：llava_150k_en,llava_150k_zh就报错：ValueError: Decompressed Data Too Large
此外，使用了纯文本数据集时dataset: identity,llava_1k_en,llava_1k_zh，训练会一直卡住，无法开始训练

Others

No response

The text was updated successfully, but these errors were encountered:

CLL112 · 2024-12-11T07:45:12Z

更新代码就可以训练了

CLL112 · 2024-12-11T08:58:22Z

还是有问题，设置max_samples: 10000可以训练，如果不设置使用全部数据集还是会报错ValueError: Decompressed Data Too Large,是因为数据集太大了吗？

CLL112 · 2024-12-11T11:40:36Z

使用streaming可以训练，但会在训练一段时间后报错:

{'loss': 3.0184, 'grad_norm': 4.040715386848615, 'learning_rate': 2.9999527971349995e-06, 'epoch': 0.01}                                                                                                  
{'loss': 3.0188, 'grad_norm': 4.090397107456916, 'learning_rate': 2.999948945403066e-06, 'epoch': 0.01}                                                                                                   
{'loss': 2.9131, 'grad_norm': 3.5998091314843355, 'learning_rate': 2.999944942626314e-06, 'epoch': 0.01}                                                                                                  
{'loss': 3.0172, 'grad_norm': 4.186458330567469, 'learning_rate': 2.9999407888051476e-06, 'epoch': 0.01}                                                                                                  
{'loss': 2.9712, 'grad_norm': 4.287678401200029, 'learning_rate': 2.9999364839399852e-06, 'epoch': 0.01}                                                                                                  
  1%|██                                                                                                                                                            | 129/10000 [18:29<19:15:11,  7.02s/it][rank0]: Traceback (most recent call last):
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/home/longchen/llm/llamafact2/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 100, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/transformers/trainer.py", line 2426, in _inner_training_loop
[rank0]:     batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/transformers/trainer.py", line 5038, in get_batch_samples
[rank0]:     batch_samples += [next(epoch_iterator)]
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/accelerate/data_loader.py", line 561, in __iter__
[rank0]:     next_batch = next(dataloader_iter)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 757, in _next_data
[rank0]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
[rank0]:     data.append(next(self.dataset_iter))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/accelerate/data_loader.py", line 336, in __iter__
[rank0]:     for element in self.dataset:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 2093, in __iter__
[rank0]:     for key, example in ex_iterable:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1511, in __iter__
[rank0]:     yield from islice(self.ex_iterable, ex_iterable_idx_start, None)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1454, in __iter__
[rank0]:     for x in self.ex_iterable:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1000, in __iter__
[rank0]:     yield from self._iter()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1024, in _iter
[rank0]:     for key, example in iterator:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1683, in __iter__
[rank0]:     for key, example in self.ex_iterable:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 744, in __iter__
[rank0]:     yield from ex_iterable
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1000, in __iter__
[rank0]:     yield from self._iter()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1080, in _iter
[rank0]:     for key, example in iterator:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1686, in __iter__
[rank0]:     _apply_feature_types_on_example(example, self.features, token_per_repo_id=self.token_per_repo_id),
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 1635, in _apply_feature_types_on_example
[rank0]:     decoded_example = features.decode_example(encoded_example, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 2044, in decode_example
[rank0]:     return {
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 2045, in <dictcomp>
[rank0]:     column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 1400, in decode_nested_example
[rank0]:     return decode_nested_example([schema.feature], obj)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 1380, in decode_nested_example
[rank0]:     if decode_nested_example(sub_schema, first_elmt) != first_elmt:
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/features.py", line 1405, in decode_nested_example
[rank0]:     return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/datasets/features/image.py", line 187, in decode_example
[rank0]:     image = PIL.Image.open(BytesIO(bytes_))
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/Image.py", line 3477, in open
[rank0]:     im = _open_core(fp, filename, prefix, formats)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/Image.py", line 3465, in _open_core
[rank0]:     im = factory(fp, filename)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/ImageFile.py", line 138, in __init__
[rank0]:     self._open()
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 749, in _open
[rank0]:     s = self.png.call(cid, pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 198, in call
[rank0]:     return getattr(self, f"chunk_{cid.decode('ascii')}")(pos, length)
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 424, in chunk_iCCP
[rank0]:     icc_profile = _safe_zlib_decompress(s[i + 2 :])
[rank0]:   File "/home/longchen/miniconda3/envs/llamafact2/lib/python3.10/site-packages/PIL/PngImagePlugin.py", line 143, in _safe_zlib_decompress
[rank0]:     raise ValueError(msg)
[rank0]: ValueError: Decompressed Data Too Large

hiyouga · 2024-12-11T12:02:57Z

https://stackoverflow.com/questions/42671252/python-pillow-valueerror-decompressed-data-too-large

github-actions bot added the pending This problem is yet to be addressed label Dec 10, 2024

CLL112 closed this as completed Dec 11, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 11, 2024

CLL112 reopened this Dec 11, 2024

hiyouga closed this as completed Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多模态paligemma训练 ValueError: Decompressed Data Too Large #6300

多模态paligemma训练 ValueError: Decompressed Data Too Large #6300

CLL112 commented Dec 10, 2024 •

edited

Loading

CLL112 commented Dec 11, 2024

CLL112 commented Dec 11, 2024

CLL112 commented Dec 11, 2024

hiyouga commented Dec 11, 2024

多模态paligemma训练 ValueError: Decompressed Data Too Large #6300

多模态paligemma训练 ValueError: Decompressed Data Too Large #6300

Comments

CLL112 commented Dec 10, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

CLL112 commented Dec 11, 2024

CLL112 commented Dec 11, 2024

CLL112 commented Dec 11, 2024

hiyouga commented Dec 11, 2024

CLL112 commented Dec 10, 2024 •

edited

Loading