You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
For my task there is a large dataset generated automatically. This dataset includes a number of incorrect pairs, where "output" is not related to "input" in any way.
I came up with the idea of filtering this dataset by analyzing the "loss" function for each element using the following command:
It turned out that incorrect elements are clearly visible by a high “loss”, however, the elements from the dataset are not selected sequentially and it is impossible to make a correspondence between the dataset and the result.
In this regard, I have a question:
Is it possible to ensure that during training the elements are taken strictly sequentially from the beginning to the end of the file? Or at least so that “trainer_log.jsonl” includes information about the element number in the dataset.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello!
For my task there is a large dataset generated automatically. This dataset includes a number of incorrect pairs, where "output" is not related to "input" in any way.
I came up with the idea of filtering this dataset by analyzing the "loss" function for each element using the following command:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py
--stage sft
--model_name_or_path krshtfr01
--do_train
--dataset test-reseach
--template default
--finetuning_type lora
--lora_target q_proj,v_proj
--output_dir test-reseach4
--overwrite_cache
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--save_steps 100000
--learning_rate 1e-59
--num_train_epochs 1.0
--plot_loss
--fp16
It turned out that incorrect elements are clearly visible by a high “loss”, however, the elements from the dataset are not selected sequentially and it is impossible to make a correspondence between the dataset and the result.
In this regard, I have a question:
Is it possible to ensure that during training the elements are taken strictly sequentially from the beginning to the end of the file? Or at least so that “trainer_log.jsonl” includes information about the element number in the dataset.
Beta Was this translation helpful? Give feedback.
All reactions