You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(qagenerator) Apptainer> python prepare_data.py \
--task e2e_qg \
--model_type t5 \
--dataset_path data/squad_v1_pt \
--qg_format highlight_qg_format \
--max_source_length 512 \
--max_target_length 32 \
--train_file_name train_data_e2e_qg_t5_ptbr.pt \
--valid_file_name valid_data_e2e_qg_t5_ptbr.pt
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
07/03/2023 08:55:55 - INFO - nlp.load - Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 08:55:55 - INFO - nlp.load - Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 08:55:55 - INFO - nlp.load - Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4
07/03/2023 08:55:55 - INFO - nlp.load - Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.py
07/03/2023 08:55:55 - INFO - nlp.load - Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/dataset_infos.json
07/03/2023 08:55:55 - INFO - nlp.load - Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.json
Traceback (most recent call last):
File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
main()
File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 536, in load_dataset
builder_instance: DatasetBuilder = builder_cls(
TypeError: 'NoneType' object is not callable
I also tried to copy the data/squad_multitask directory and modify these lines with my URLs:
(qagenerator) Apptainer> python prepare_data.py --task e2e_qg --model_type t5 --dataset_path data/squad_v1_pt --qg_format highlight_qg_format --max_source_length 512 --max_target_length 32 --train_file_name train_data_e2e_qg_t5_ptbr.pt --valid_file_name valid_data_e2e_qg_t5_ptbr.pt
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
07/03/2023 09:07:01 - INFO - nlp.load - Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 09:07:02 - INFO - nlp.load - Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 09:07:02 - INFO - nlp.load - Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.load - Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py
07/03/2023 09:07:02 - INFO - nlp.load - Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/dataset_infos.json
07/03/2023 09:07:02 - INFO - nlp.load - Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.json
[nltk_data] Downloading package punkt to /home/U4VN/nltk_data...
[nltk_data] Package punkt is already up-to-date!
07/03/2023 09:07:02 - INFO - nlp.info - Loading Dataset Infos from /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.builder - Generating dataset squad_multitask (/tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec)
Downloading and preparing dataset squad_multitask/highlight_qg_format (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec...
07/03/2023 09:07:02 - INFO - nlp.builder - Dataset not on Hf google storage. Downloading and preparing it from source
07/03/2023 09:07:04 - INFO - nlp.utils.info_utils - Unable to verify checksums.
07/03/2023 09:07:04 - INFO - nlp.builder - Generating split train
0 examples [00:00, ? examples/s]07/03/2023 09:07:04 - INFO - root - generating examples from = /tmp/u4vn/huggingface/datasets/downloads/6bf2e2bfc0769ed6e47c7935079d8584fb3201dd7915b637bbcf0fe3409710a0.4d4fd5bfbda09cd172db9f6f025e9bbf6d4d7d20cd53cef625822e1f2a34dd1f
Traceback (most recent call last):
File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
main()
File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 548, in load_dataset
builder_instance.download_and_prepare(
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 462, in download_and_prepare
self._download_and_prepare(
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 537, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 810, in _prepare_split
for key, record in utils.tqdm(generator, unit=" examples", total=split_info.num_examples, leave=False):
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 239, in _generate_examples
yield count, self.process_qg_text(context, question, qa["answers"][0])
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 144, in process_qg_text
start_pos, end_pos = self._get_correct_alignement(context, answer)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 131, in _get_correct_alignement
raise ValueError()
ValueError
The text was updated successfully, but these errors were encountered:
Hi,
I'm trying to figure out how to prepare data and fine tuning this T5 base model (https://huggingface.co/unicamp-dl/ptt5-base-t5-vocab) with this squad dataset (https://huggingface.co/datasets/squad_v1_pt)
I downloaded data from hugging face to local folder:
Then run the following command:
But I got this error:
I also tried to copy the
data/squad_multitask
directory and modify these lines with my URLs:The error now is another:
The text was updated successfully, but these errors were encountered: