Replies: 1 comment
-
As shown in the visualization, it calculates the similarity between a certain tokenset and the topic representation. For example, this can be the cosine similarity between the c-TF-IDF representations of a tokenset and the topic representation. They are aggregated and normalized: BERTopic/bertopic/_bertopic.py Line 1382 in bf1fedd I would advise checking out the source code as each step is documented: BERTopic/bertopic/_bertopic.py Line 1182 in bf1fedd
I believe that might be a result of the BERTopic/bertopic/_bertopic.py Line 1288 in bf1fedd In your use case, replacing it with |
Beta Was this translation helpful? Give feedback.
-
Hello!
I'd like to know the result when we get the topic-word distributions from topic_model.approximate_distribution(docs, calculate_tokens=True).
Refer from https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html:
How do we calculate topic-word distributions? In simpler terms, do we just calculate the similarity between the topic representation and token representation and apply some function to make it become a probability?
I understand that in the default pipeline, the document is tokenized by the tokenizer defined by CountVectorizer before calculating the approximated probability above. I found that, even if we set hyperparameters like stop_words="english", min_df=3, max_df=0.9, we still get some repetitive tokens with different probability. Moreover, we could get two tokens with different capitalization or verb forms (like _s or _ed). How should I fix this?
I really appreciate your work. Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions