-
Hi, So when I fit my model on training data, I get probs and when I look at the probs for a particular document, most of the time it adds to 1 (like it is higher for the topic it predicted), when I did the same on unseen data, by transforming the same model, I get a probs array with the value of each topic for a description > 0.8, I have like 347 topics, and this is very confusing how can the predicted probability values be so high? here is my sample code where I am transforming on unseen data,
this is how predicted probs look:
if you see for each doc or description the values are so high looks like each description is related to all the topics but it is not. Why is that? Am I doing something wrong? The output topic are fine, it does guess it somewhat correct but I am more concerned about the probs as I want to use that to do more analysis like get the 25th percentile etc... PS: Super library, really helpful, awesome work! Best, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
You are not doing anything wrong, that's just the nature of the underlying embedding model which tends to produce relatively high similarity scores. It's distribution of similarity scores can be centered towards to the higher values, so using something like a soft-max would be helpful to get a more fine-grained perspective. |
Beta Was this translation helpful? Give feedback.
-
I had one more question, so I am running bertopic on 260000 documents and also I am using huge embedding model with high dimensions, so because of that I ended up using the gpu accelerated version of umap and hdbscan and it is phenomenally fast, but even though I have set random_state in umap, I still different clusters every run with the same parameter configuration. But when I use the normal umap (cpu) it gives the same cluster every run. Why is it that I get different cluster when use cuml UMAP? Is there any workaround for this so that I can reproduce the results for this as well? |
Beta Was this translation helpful? Give feedback.
You are not doing anything wrong, that's just the nature of the underlying embedding model which tends to produce relatively high similarity scores. It's distribution of similarity scores can be centered towards to the higher values, so using something like a soft-max would be helpful to get a more fine-grained perspective.