Is it possible to take data from the dataset sequentially during training? #1871

pitkad · 2023-12-16T10:32:52Z

pitkad
Dec 16, 2023

Hello!
For my task there is a large dataset generated automatically. This dataset includes a number of incorrect pairs, where "output" is not related to "input" in any way.
I came up with the idea of filtering this dataset by analyzing the "loss" function for each element using the following command:

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--stage sft
--model_name_or_path krshtfr01
--do_train
--dataset test-reseach
--template default
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir test-reseach4
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 100000
--learning_rate 1e-59
--num_train_epochs 1.0
--plot_loss
--fp16

It turned out that incorrect elements are clearly visible by a high “loss”, however, the elements from the dataset are not selected sequentially and it is impossible to make a correspondence between the dataset and the result.

In this regard, I have a question:
Is it possible to ensure that during training the elements are taken strictly sequentially from the beginning to the end of the file? Or at least so that “trainer_log.jsonl” includes information about the element number in the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to take data from the dataset sequentially during training? #1871

{{title}}

Replies: 0 comments

Select a reply

Is it possible to take data from the dataset sequentially during training? #1871

pitkad Dec 16, 2023

Replies: 0 comments

pitkad
Dec 16, 2023