Topic-Word Distributions from approximate_distribution #2084

pjirach · 2024-07-14T15:51:48Z

pjirach
Jul 14, 2024

Hello!

I'd like to know the result when we get the topic-word distributions from topic_model.approximate_distribution(docs, calculate_tokens=True).

Refer from https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html:

How do we calculate topic-word distributions? In simpler terms, do we just calculate the similarity between the topic representation and token representation and apply some function to make it become a probability?
I understand that in the default pipeline, the document is tokenized by the tokenizer defined by CountVectorizer before calculating the approximated probability above. I found that, even if we set hyperparameters like stop_words="english", min_df=3, max_df=0.9, we still get some repetitive tokens with different probability. Moreover, we could get two tokens with different capitalization or verb forms (like _s or _ed). How should I fix this?

I really appreciate your work. Thanks in advance!

MaartenGr · 2024-07-15T09:53:03Z

MaartenGr
Jul 15, 2024
Maintainer

How do we calculate topic-word distributions? In simpler terms, do we just calculate the similarity between the topic representation and token representation and apply some function to make it become a probability?

As shown in the visualization, it calculates the similarity between a certain tokenset and the topic representation. For example, this can be the cosine similarity between the c-TF-IDF representations of a tokenset and the topic representation. They are aggregated and normalized:

BERTopic/bertopic/_bertopic.py

Line 1382 in bf1fedd

topic_distribution = normalize(topic_distribution, norm="l1", axis=1)

I would advise checking out the source code as each step is documented:

BERTopic/bertopic/_bertopic.py

Line 1182 in bf1fedd

def approximate_distribution(

I understand that in the default pipeline, the document is tokenized by the tokenizer defined by CountVectorizer before calculating the approximated probability above. I found that, even if we set hyperparameters like stop_words="english", min_df=3, max_df=0.9, we still get some repetitive tokens with different probability. Moreover, we could get two tokens with different capitalization or verb forms (like _s or _ed). How should I fix this?

I believe that might be a result of the build_tokenizer that is being used to tokenize the documents:

BERTopic/bertopic/_bertopic.py

Line 1288 in bf1fedd

analyzer = self.vectorizer_model.build_tokenizer()

In your use case, replacing it with build_analyzer might be more appropriate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic-Word Distributions from approximate_distribution #2084

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Topic-Word Distributions from approximate_distribution #2084

pjirach Jul 14, 2024

Replies: 1 comment

MaartenGr Jul 15, 2024 Maintainer

pjirach
Jul 14, 2024

MaartenGr
Jul 15, 2024
Maintainer