Finetune for "clustering" when we don't have exact positive/negative pairs #2936

HenningDinero · 2024-09-13T12:22:06Z

When using the Triplet loss - we try to minimize the distance between each pair (a_1, p_1) while maximizing the distance between (a_1,p_j), j!=1.

I'm trying to solve the following; for given set of texts t1 = ["text about banking", "text about finance", "text about money laundry"] and t2= ["text about sport", "text about injuries", "text about running shoes"] create embeddings such that the embeddings for t1 are closer/ than for any in t2 i.e create embeddings which are clustered.

As far as I can see that is not "directly supported" - but is there a way around this? I could take each text in t2 as a hard-negative for each text in t1, but I can't figure out if there is a better approach, because we would still get a anchor/negative pair for each text in t1 i.e if I set a_1 ="Text about banking", p1="text about finance" then "text about money laundry" would be a negative for "text about banking" which it shouldn't be.

Note, there is this example https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py which shows to apply a model to create clusters - I want to fine-tune the model based on "clusters"

The text was updated successfully, but these errors were encountered:

ir2718 · 2024-09-13T15:46:03Z

To me this sounds a lot like hierarchical classification where hyperbolic embeddings are often used. Have a look at this. You can partially aumatomate the process of creating labels using some existing sentence transformer model and hierarchical agglomerative clustering (and possibly manually relabel the mistakes). Since it seems you're working on some kind of topic modeling check out BERTopic, as it does something similar but including dimensionality reduction. Does this fit your use case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune for "clustering" when we don't have exact positive/negative pairs #2936

Finetune for "clustering" when we don't have exact positive/negative pairs #2936

HenningDinero commented Sep 13, 2024 •

edited

Loading

ir2718 commented Sep 13, 2024

Finetune for "clustering" when we don't have exact positive/negative pairs #2936

Finetune for "clustering" when we don't have exact positive/negative pairs #2936

Comments

HenningDinero commented Sep 13, 2024 • edited Loading

ir2718 commented Sep 13, 2024

HenningDinero commented Sep 13, 2024 •

edited

Loading