You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the Triplet loss - we try to minimize the distance between each pair (a_1, p_1) while maximizing the distance between (a_1,p_j), j!=1.
I'm trying to solve the following; for given set of texts t1 = ["text about banking", "text about finance", "text about money laundry"] and t2= ["text about sport", "text about injuries", "text about running shoes"] create embeddings such that the embeddings for t1 are closer/ than for any in t2 i.e create embeddings which are clustered.
As far as I can see that is not "directly supported" - but is there a way around this? I could take each text in t2 as a hard-negative for each text in t1, but I can't figure out if there is a better approach, because we would still get a anchor/negative pair for each text in t1 i.e if I set a_1 ="Text about banking", p1="text about finance" then "text about money laundry" would be a negative for "text about banking" which it shouldn't be.
To me this sounds a lot like hierarchical classification where hyperbolic embeddings are often used. Have a look at this. You can partially aumatomate the process of creating labels using some existing sentence transformer model and hierarchical agglomerative clustering (and possibly manually relabel the mistakes). Since it seems you're working on some kind of topic modeling check out BERTopic, as it does something similar but including dimensionality reduction. Does this fit your use case?
When using the Triplet loss - we try to minimize the distance between each pair
(a_1, p_1)
while maximizing the distance between(a_1,p_j), j!=1
.I'm trying to solve the following; for given set of texts
t1 = ["text about banking", "text about finance", "text about money laundry"]
andt2= ["text about sport", "text about injuries", "text about running shoes"]
create embeddings such that the embeddings fort1
are closer/ than for any int2
i.e create embeddings which are clustered.As far as I can see that is not "directly supported" - but is there a way around this? I could take each text in
t2
as a hard-negative for each text int1
, but I can't figure out if there is a better approach, because we would still get a anchor/negative pair for each text int1
i.e if I seta_1 ="Text about banking", p1="text about finance"
then"text about money laundry"
would be a negative for"text about banking"
which it shouldn't be.Note, there is this example https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py which shows to apply a model to create clusters - I want to fine-tune the model based on "clusters"
The text was updated successfully, but these errors were encountered: