Probability for a document after transforming the model on unseen data does not make sense #2104

Smit1400 · 2024-07-30T21:45:16Z

Smit1400
Jul 30, 2024

Hi,

So when I fit my model on training data, I get probs and when I look at the probs for a particular document, most of the time it adds to 1 (like it is higher for the topic it predicted), when I did the same on unseen data, by transforming the same model, I get a probs array with the value of each topic for a description > 0.8, I have like 347 topics, and this is very confusing how can the predicted probability values be so high?

here is my sample code where I am transforming on unseen data,

doc_pred = sample_50k['description'].values.tolist()

test_sent_model  = SentenceTransformer(emb_model)
test_embeddings = test_sent_model.encode(doc_pred, show_progress_bar=True)
topic_model = BERTopic.load('./models/bertopic_model_begllm_347_topics/', embedding_model=emb_model)
predicted_topics, predicted_probs = topic_model.transform(doc_pred, embeddings=test_embeddings)

this is how predicted probs look:

array([[0.8440131 , 0.8135607 , 0.8450621 , ..., 0.8420589 , 0.8310971 , 0.8121016 ], [0.8658934 , 0.8120748 , 0.83674765, ..., 0.8333061 , 0.7931837 , 0.8188243 ], [0.84641635, 0.8463278 , 0.84506565, ..., 0.8028357 , 0.82243097, 0.7969541 ], ..., [0.84708565, 0.8698659 , 0.84264684, ..., 0.8333156 , 0.822389 , 0.83424973], [0.8474626 , 0.8262439 , 0.87007135, ..., 0.84710294, 0.8100672 , 0.85145724], [0.82811856, 0.86753595, 0.8372407 , ..., 0.8027702 , 0.7852622 , 0.79957473]], dtype=float32)

if you see for each doc or description the values are so high looks like each description is related to all the topics but it is not.

Why is that? Am I doing something wrong? The output topic are fine, it does guess it somewhat correct but I am more concerned about the probs as I want to use that to do more analysis like get the 25th percentile etc...

PS: Super library, really helpful, awesome work!

Best,
Smit Shah

Answered by MaartenGr

Jul 31, 2024

You are not doing anything wrong, that's just the nature of the underlying embedding model which tends to produce relatively high similarity scores. It's distribution of similarity scores can be centered towards to the higher values, so using something like a soft-max would be helpful to get a more fine-grained perspective.

View full answer

MaartenGr · 2024-07-31T06:58:19Z

MaartenGr
Jul 31, 2024
Maintainer

You are not doing anything wrong, that's just the nature of the underlying embedding model which tends to produce relatively high similarity scores. It's distribution of similarity scores can be centered towards to the higher values, so using something like a soft-max would be helpful to get a more fine-grained perspective.

0 replies

Smit1400 · 2024-08-01T19:41:37Z

Smit1400
Aug 1, 2024
Author

I had one more question, so I am running bertopic on 260000 documents and also I am using huge embedding model with high dimensions, so because of that I ended up using the gpu accelerated version of umap and hdbscan and it is phenomenally fast, but even though I have set random_state in umap, I still different clusters every run with the same parameter configuration. But when I use the normal umap (cpu) it gives the same cluster every run. Why is it that I get different cluster when use cuml UMAP? Is there any workaround for this so that I can reproduce the results for this as well?

1 reply

MaartenGr Aug 2, 2024
Maintainer

You would have to check the cuML repository for that as I am not sure why that is the case in your particular example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probability for a document after transforming the model on unseen data does not make sense #2104

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Probability for a document after transforming the model on unseen data does not make sense #2104

Smit1400 Jul 30, 2024

Replies: 2 comments · 1 reply

MaartenGr Jul 31, 2024 Maintainer

Smit1400 Aug 1, 2024 Author

MaartenGr Aug 2, 2024 Maintainer

Smit1400
Jul 30, 2024

Replies: 2 comments 1 reply

MaartenGr
Jul 31, 2024
Maintainer

Smit1400
Aug 1, 2024
Author

MaartenGr Aug 2, 2024
Maintainer