-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat
] Update mine_hard_negatives to using a full corpus and multiple positives
#2848
[feat
] Update mine_hard_negatives to using a full corpus and multiple positives
#2848
Conversation
…ining hard negatives.
Add a positive to corpus indices mapping, useful to get non-deduplicated positives and to filter away positives taken from the corpus
…orpus Mine hard negatives with corpus
if range_max > 2048 and use_faiss: | ||
# FAISS on GPU can only retrieve up to 2048 documents per query | ||
range_max = 2048 | ||
if verbose: | ||
print("Using FAISS, we can only retrieve up to 2048 documents per query. Setting range_max to 2048.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment, I didn't realise this
query_embeddings = query_embeddings.cpu().numpy() | ||
corpus_embeddings = corpus_embeddings.cpu().numpy() | ||
index = faiss.IndexFlatIP(len(corpus_embeddings[0])) | ||
index = faiss.IndexFlatIP(model.get_sentence_embedding_dimension()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Hello! Apologies for the delay, I've been recovering from a surgery since July.
I've tested this up to datasets of 3m samples (gooaq, amazon-qa) with/without FAISS and with/without a CrossEncoder. Here's a script to test it: from pprint import pprint
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer, CrossEncoder
from datasets import load_dataset
# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/amazon-qa", split="train").select(range(50000))
print(dataset)
corpus = load_dataset("sentence-transformers/gooaq", split="train[:50000]")["answer"]
# cross_encoder_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
# cross_encoder = CrossEncoder(cross_encoder_name)
# Mine hard negatives
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
corpus=corpus,
# cross_encoder=cross_encoder,
range_min=0,
range_max=10,
max_score=0.8,
margin=0,
num_negatives=5,
sampling_strategy="random",
batch_size=512,
use_faiss=True,
)
print(dataset)
pprint(dataset[0])
# dataset.push_to_hub("natural-questions-cnn-hard-negatives", "triplet", private=True)
breakpoint() Would love to hear what you think @ArthurCamara @ChrisGeishauser. Again sorry for the radio silence.
|
feat
] Update mine_hard_negatives to using a full corpus and multiple positives
I'm going to move ahead with merging this, as I'd like to include this in the upcoming release. If you find a moment, feel free to experiment with this function and report any issues.
|
…le positives (#2848) * updated mine_hard_negatives method to include a seperate corpus for mining hard negatives. * Run 'make check' * Update "corpus" to just a list of strings * Prevent duplicate embeddings if no separate corpus * Deduplicate corpus Add a positive to corpus indices mapping, useful to get non-deduplicated positives and to filter away positives taken from the corpus * Skip rescoring positive pairs via pos_to_corpus_indices instead * Add a mine_hard_negatives_from_corpus util * Speedup pos_to_corpus_indices for large corpora * Fix range_max by number of max_positives in dataset * encode in chunks, ensure at least one positive per query always * Hard_negative_mining with corpus and multiple positives is possible * docstring * Fix for random sampling * fix for return_triplets=False * Typo on list * Fix bug with multiple positives. More efficient creation of some tensors. * Fix offset of positives scoring with multiple chunks * fix pytorch copy warning * Only embed each text once; no need for chunking if convert_to_numpy=True * Undo unintended changes * Fix mismatch in anchor/positive and negatives if multiple positives per query * Don't repeat positive_scores as it inflates the positive score counts * Remove the "Count" for Difference as it's rather confusing --------- Co-authored-by: Christian Geishauser <christiangeishauser@Christians-Laptop-455.local> Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Nice! Thanks for your help, @tomaarsen! I actually used this code to create a set of "NanoBEIR" datasets that I just made public here: https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6 |
Awesome! Will you have some writing on how well it correlates to BEIR itself? Because as we all know, BEIR takes forever to run, and a faster option is definitely interesting 😅 |
I've found the post already 👀 https://www.zeta-alpha.com/post/fine-tuning-an-llm-for-state-of-the-art-retrieval-zeta-alpha-s-top-10-submission-to-the-the-mteb-be Looks quite solid! Nice collection of datasets, I'm glad some of my datasets came in useful too. And your eventual training recipe seems to be about equivalent to CachedMultipleNegativesRankingLoss, which is GradCache + InfoNCE (except this doesn't update the temperature (called Also, in my opinion the Sentence Transformers evaluators are very useful, especially InformationRetrievalEvaluator, but preparing this data is always rather difficult. Perhaps NanoBEIR is a nice moment to e.g. package these datasets such that people can easily import them for use with any Sentence Transformer model? That said, a lot of modern ST models require specific prompts for queries vs documents, which is not implemented in this evaluator (or the hard negatives mining) (yet?).
|
Yup. That's exactly that. I began working with the new ST trainer, but ended up building our own custom trainer, mainly because it was not clear/straightforward how to use multiple negatives per query or change between using (or not) in-batch negatives, specially with multiple GPUs (i.e., gathering negatives across devices). Giving it another shot is is in my (rather long, I have to admit) to-do list. I'm starting to work on other large models now, so maybe that's a good moment to get working on it again.
Seems like it's quite similar to the hard negative mining use case. Perhaps I can try adapting the code (or NanoBEIR) to work directly with it. NanoBEIR format is mainly how it is now because that's what the MTEB datasets look like. No hard attachments there.
You mean things like E5's
|
I don't blame you at all, I like building things myself to really understand the core mechanics as well.
I think it's totally fine to keep that format as-is (it's pretty normal). My thoughts was to create a new package like
I'm also considering packaging it into Sentence Transformers directly, but that might be a bit overkill.
Yeah. It's not hard, it's just missing still, e.g. adding "query_prompt", "query_prompt_name", "passage_prompt", "passage_prompt_name" kwargs to pass on to
|
I've read #2831 and I actually agree with you there. It is easier to just don't sync across devices and just increase the batch size accordingly. My only follow-up issue (and that I'm trying to understand how to solve) is how to handle multiple hard negatives per query when using From what I understood, if the anchor I guess the "proper" way around it would be to disentangle query and documents in the sampler/collator and then each anchor have a list of integers of where its negatives are in the list of all documents. Or can you see another way around it? (I should probably open another issue to discuss this)
Maybe to start with, adding a function to
I've opened a PR here: #2951 that adds this (and an option to deal with instruction masking in left-padded tokens)
|
Edit: I think I found another solution for this. I've opened #2954 to discuss this |
I'm glad I'm not the only one 😄
I replied with more detail in #2954 (comment), but something to note is that the
Indeed.
Perhaps a new file in
Awesome! I'll try and review this in the coming workdays.
|
Following from #2818, this PR updates the
mine_hard_negatives
method to allow for a corpus to be passed (thanks @ChrisGeishauser) and to a single query to have multiple positives (like the case in the TREC-Covid dataset).The way it handles multiple positives is to check for duplicated queries in the input dataset. If the same query appears multiple times, every occurrence is considered another positive for that query. The method then only uses each query once when searching, and keep tracks of the positives retrieved.
One thing to consider is that, if the dataset has too many positives and
use_triplets=True
, the method will "explode" the dataset, returning n_positives*n_negatives rows per query. Ifuse_triplets=False
onlyn_positives
rows are returned per query. An alternative would be to return a nested dataset, with a "positives" and a "negatives" column.