-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between SentenceLabelDataset
and GroupByLabelBatchSampler
?
#2920
Comments
SentenceLabelDataset
and GroupByLabelBatchSampler
SentenceLabelDataset
and GroupByLabelBatchSampler
?
Hello! You're very right in your analysis: sentence-transformers/sentence_transformers/sampler.py Lines 40 to 43 in 0a32ec8
This sampler is meant for the
Yes, and no. You can override the Trainer's sentence-transformers/sentence_transformers/trainer.py Lines 459 to 466 in 0a32ec8
And replace it with a function that immediately returns a custom Batch Sampler which has your desired behaviour. So yes: you can use the older logic, but no: you'd have to write it yourself. Hope this helps a bit.
|
Hi @tomaarsen
First of all - kudos to you for maintaining such an awesome and pragmatic library.
I am facing some difficulty on using
GROUP_BY_LABEL
batch sampler in v3.0 and want to highlight the issues to check if there is any way to mitigate those.I went through the issues and found this: #2698 (comment)
You have mentioned here that the idea is to replace
SentenceLabelDataset
byGroupByLabelBatchSampler
but I think there is very drastic differences between the two and we haven't retained the same functionality ofSentenceLabelDataset
while opting forGroupByLabelBatchSampler
as a replacement.I am taking an detailed example to explain the differences:
Let's take a simple example with a list of integers representing classes, and we'll use it to illustrate how the two approaches handle homogeneity in batch construction.
Example Data:
[1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
GroupByLabelBatchSampler
Behavior:Step 1: Initialization
Grouping by Labels:
[0, 1, 2, 3]
[4, 5, 6]
[7, 8]
[9]
Truncate to Even Number:
[0, 1, 2, 3]
(all 4 samples)[4, 5]
(only 2 samples, 1 sample is discarded)[7, 8]
(both samples)[9]
(removed because it doesn't have at least 2 samples)Resulting Groups:
[0, 1, 2, 3]
[4, 5]
[7, 8]
Step 2: Batch Construction
Batch 1:
[0, 1, 2, 3]
[0, 1, 2, 3]
(4 samples from Class 1, fully homogeneous)Batch 2:
[4, 5]
[7, 8]
[4, 5, 7, 8]
(2 samples from Class 2, 2 samples from Class 3)SentenceLabelDataset
Behavior:Step 1: Initialization
[1, 2, 3, 4]
Step 2: Batch Construction
Batch 1:
[0, 1]
)[0, 1]
(2 samples from Class 1)[4, 5]
)[0, 1, 4, 5]
(2 samples from Class 1, 2 samples from Class 2)Batch 2:
[2, 3]
)[2, 3]
(2 samples from Class 1)[7, 8]
)[2, 3, 7, 8]
(2 samples from Class 1, 2 samples from Class 3)Batch 3:
with_replacement
isTrue
orFalse
.Comparison of Homogeneity:
GroupByLabelBatchSampler
:[0, 1, 2, 3]
.[4, 5, 7, 8]
.SentenceLabelDataset
:[0, 1, 4, 5]
.[2, 3, 7, 8]
.TL;DR:
I am trying to fine-tune sentence transformers models using the dataset with this label distribution:
In the new
GroupByLabelBatchSampler
the batching logic is yielding most of the batches as homogeneous and there is not much improvement observed after fine-tuning.IMO this type of data could have been easily used with
SentenceLabelDataset
as it ensures there is at maxN
samples from each label in a batch. Intuitively, ST models should benefit from having in-batch negatives and more heterogeneous batches.Can you help me in veryfying if my understanding is correct and if yes, is there any way to opt for the older logic?
The text was updated successfully, but these errors were encountered: