Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune for "clustering" when we don't have exact positive/negative pairs #2936

Open
HenningDinero opened this issue Sep 13, 2024 · 1 comment

Comments

@HenningDinero
Copy link

HenningDinero commented Sep 13, 2024

When using the Triplet loss - we try to minimize the distance between each pair (a_1, p_1) while maximizing the distance between (a_1,p_j), j!=1.

I'm trying to solve the following; for given set of texts t1 = ["text about banking", "text about finance", "text about money laundry"] and t2= ["text about sport", "text about injuries", "text about running shoes"] create embeddings such that the embeddings for t1 are closer/ than for any in t2 i.e create embeddings which are clustered.

As far as I can see that is not "directly supported" - but is there a way around this? I could take each text in t2 as a hard-negative for each text in t1, but I can't figure out if there is a better approach, because we would still get a anchor/negative pair for each text in t1 i.e if I set a_1 ="Text about banking", p1="text about finance" then "text about money laundry" would be a negative for "text about banking" which it shouldn't be.

Note, there is this example https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py which shows to apply a model to create clusters - I want to fine-tune the model based on "clusters"

@ir2718
Copy link
Contributor

ir2718 commented Sep 13, 2024

To me this sounds a lot like hierarchical classification where hyperbolic embeddings are often used. Have a look at this. You can partially aumatomate the process of creating labels using some existing sentence transformer model and hierarchical agglomerative clustering (and possibly manually relabel the mistakes). Since it seems you're working on some kind of topic modeling check out BERTopic, as it does something similar but including dimensionality reduction. Does this fit your use case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants