-
I have a quite good clustering just using using https://huggingface.co/WhereIsAI/UAE-Large-V1, reduced by PCA to 128 then clustered by HDBSCAN but when I apply thee same embeddings and clustering algo to BERTopic I am getting mess. Seems like cTFIDF with default counter messes stuff up. Is that possible? Does it still make sense to do cTFIDF when just using embeddings gives good results? I am using it for news clustering and was hoping to get better results by maybe adding entities recognition from headlines but seems like just embeddings are better unless I do something wrong (that's why I post this discussion). I used it like that: embedder = SentenceTransformer("WhereIsAI/UAE-Large-V1")
topic_model = BERTopic(
n_gram_range=(1, 2),
embedding_model=embedder,
hdbscan_model=HDBSCAN(
min_cluster_size=2,
metric="euclidean",
max_cluster_size=100,
),
umap_model=PCA(n_components=64),
calculate_probabilities=False,
verbose=True,
) |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
That shouldn't be possible since c-TF-IDF does not affect the clustering process at all. You should be getting similar results if you are using the same underlying algorithms. Could you share your code? Also, which version of BERTopic are you using? |
Beta Was this translation helpful? Give feedback.
-
Thank @MaartenGr for confirming my hunch that I've messed something up. I did removal of -1 twice and this messed up my code :/ As an apology collab proof that it indeed works fine: BERTTopic Though it seems to adjust these clusters after initial clustering still, right? If I can ask couple more practical questions:
Sorry for wall of text and thanks for any further assistance |
Beta Was this translation helpful? Give feedback.
No problem! Glad you found the issue.
It doesn't change the clusters themselves but merely their IDs to be make sure that topic 0 is a larger topic than topic 1, etc.
The default countvectorizer from sklearn, which you can indeed change.
BERTopic doesn't use TF-IDF but a variant, c…