Different topics for the same text #2228

satyrmipt · 2024-11-27T17:45:17Z

satyrmipt
Nov 27, 2024

I have two problems, not sure is they are bugs or my lame skills.
I have done a search for similar discussions, find them, learned a lot but not enough to solve the problem.

I've noticed that the same text may be classified differently. So i've created dataset out of one single phrase and get shock results:
there was as much as dozen topics for the same string.
The code:

import umap
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans
import numpy as np
my_seed=314
# Let's freeze hdbscan randomness by np.random.seed (see https://github.com/scikit-learn-contrib/hdbscan/issues/326 )
# Let's else add approx_min_span_tree=False to hdbscan_model, default value is True
# it does not help end of day
np.random.seed(my_seed)  
from sklearn.metrics.pairwise import cosine_similarity
embedding_model=SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
#     hdbscan_model=KMeans(n_clusters=150, random_state=my_seed),
topic_model = BERTopic(
    embedding_model=embedding_model,
    hdbscan_model=HDBSCAN(min_cluster_size=3, prediction_data=True, approx_min_span_tree=False),
    calculate_probabilities=True,
    n_gram_range=(1,5),
    min_topic_size=3,
    umap_model=umap.UMAP(random_state=my_seed, angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=7, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
)
text_list=["it's easy to answer the question" for w in range(2100)]
topics, probs = topic_model.fit_transform(text_list)

topic_model.visualize_hierarchy(distance_function = lambda x: 1 - np.clip(cosine_similarity(x), -1, 1))

doc_topic, doc_proba = topic_model.transform(text_list)
# topic_df=pd.DataFrame({'text':text_list, 'topic':doc_topic})
topic_df=pd.DataFrame({'text':text_list, 'topic':doc_topic, 'proba':list(doc_proba)}) # does not have doc_proba when using KMeans
display(topic_df.head(4))
display(topic_df[topic_df['text']=="it's easy to answer the question"]['topic'].value_counts())
# display(topic_df[topic_df['text']=="it's easy to answer the question"].groupby('topic')[['prob_1', 'prob_0']].mean())

Check params as topic_model.get_params() output:

topic_model.get_params()
{'calculate_probabilities': True,
 'ctfidf_model': ClassTfidfTransformer(),
 'embedding_model': <bertopic.backend._sentencetransformers.SentenceTransformerBackend at 0x7fda8f692d00>,
 'hdbscan_model': HDBSCAN(approx_min_span_tree=False, min_cluster_size=3, prediction_data=True),
 'language': None,
 'low_memory': False,
 'min_topic_size': 3,
 'n_gram_range': (1, 5),
 'nr_topics': None,
 'representation_model': None,
 'seed_topic_list': None,
 'top_n_words': 10,
 'umap_model': UMAP(angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=7, n_jobs=1, random_state=314, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}),
 'vectorizer_model': CountVectorizer(ngram_range=(1, 5)),
 'verbose': False,
 'zeroshot_min_similarity': 0.7,
 'zeroshot_topic_list': None}

Several consecutive results below. Please notice how many rows marked as -1. Also notice that last run has very good results so it is not hyperparameter issue:

------- n_components=7
topic
 0     1371
-1      300
 1      139
 2      100
 3       84
 4       22
 5       18
 6       14
 7       12
 8       11
 9        6
 10       5
 11       4
 12       4
 13       4
 14       3
 15       3

topic
 0     672
 1     631
-1     389
 2     226
 3      39
 4      21
 5      17
 6      12
 7      10
 8      10
 9      10
 10      9
 11      8
 12      8
 13      7
 14      6
 15      6
 16      6
 17      5
 18      5
 19      3
Name: count, dtype: int64

topic
 0     1199
-1      445
 1      221
 2      118
 3       19
 4       19
 5       16
 6       10
 7        6
 8        5
 9        4
 10       4
 11       4
 12       4
 13       4
 14       4
 15       3
 16       3
 17       3
 18       3
 19       3
 20       3
Name: count, dtype: int64

topic
 0    2060
-1      33
 1       7
Name: count, dtype: int64

I have also try to train bertopic on dataset with diffferent phrases and then apply it to dataset of the same single phrase. Same result.

My questions:
a) Why clustring algorithm work so weak in case of the same phrases? Is it clustring algorithm issue or may be other parts of bertopic?
b) How can i choose hyperparameters if i can't reproduce results?

after reading a lot of issues on this topic i still can not reproduce results of consecutive runs. I tried to seed HDBSCAN and UMAP. But as you can see my results is waaay different from one start to another. I also tried to replace clever HDBSCAN by base KMeans with fixed random_state and still got different results (different number of documents in clusters).

P.s.
a) bertopic.__version__ # '0.16.4'
b) distance_function = lambda x: 1 - np.clip(cosine_similarity(x), -1, 1) was added to fix known issue for some particular values of n_components. It does not change experiment results.

satyrmipt · 2024-11-28T14:41:55Z

satyrmipt
Nov 28, 2024
Author

Found out there is allow_single_cluster=True, parameter in HDBSCAN that allows algo to find single cluster. It allows me to reduce dozen of clusters to two only, several consecutive runs:

import umap
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
from bertopic import BERTopic
print(bertopic.__version__)
from sentence_transformers import SentenceTransformer
my_seed=314
# lets freeze hdbscan randomness by np.random.seed (see https://github.com/scikit-learn-contrib/hdbscan/issues/326 )
# Let's else add approx_min_span_tree=False to hdbscan_model, default value is True
# it does not help end of day ( may be not due to hdbscan randomnes but other yet not seeded components of BERTopic)
np.random.seed(my_seed)  
from sklearn.metrics.pairwise import cosine_similarity
embedding_model=SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
#     hdbscan_model=KMeans(n_clusters=150, random_state=my_seed),
topic_model = BERTopic(
    embedding_model=embedding_model,
    hdbscan_model=HDBSCAN(min_cluster_size=3, 
                          prediction_data=True, 
                          approx_min_span_tree=False, 
                          allow_single_cluster=True, 
                          cluster_selection_epsilon=0.0 ),
    calculate_probabilities=True,
    n_gram_range=(1,5),
    min_topic_size=3,
    umap_model=umap.UMAP(random_state=my_seed, angular_rp_forest=True, low_memory=False, 
                         metric='cosine', min_dist=1, n_components=15, 
                         tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
)
text_list=["it's easy to answer the question" for w in range(2100)]
doc_topic, doc_proba = topic_model.fit_transform(text_list)
topic_df=pd.DataFrame({'text':text_list, 'topic':doc_topic, 'proba':list(doc_proba)}) # does not have doc_proba when using KMeans
display(topic_df.head(4))
display(topic_df[topic_df['text']=="it's easy to answer the question"]['topic'].value_counts())

results:

topic
 0    2091
-1       9
Name: count, dtype: int64

topic
 0    2088
-1      12
Name: count, dtype: int64

topic
 0    2047
-1      53
Name: count, dtype: int64

But we still have -1 cluster. So i decided to reproduce the problem using pure HDBSCAN and got perfetct results (see below). So my questions is still relevent:

why more then one cluster for the same single string in bertopic results?
how to reproduce bertopic results?

from hdbscan import HDBSCAN  # Requirement already satisfied: HDBSCAN in /home/jupyter/.local/lib/python3.8/site-packages (0.8.39)
from sentence_transformers import SentenceTransformer
text_list=["it's easy to answer the question" for w in range(2100)]
emb=embedding_model.encode(text_list)
print(emb.shape)
print(np.unique(emb, axis=0).shape) # check if all embeddings are equal
hdbscan_model=HDBSCAN(min_cluster_size=3, 
                      prediction_data=True, 
                      approx_min_span_tree=False, 
                      allow_single_cluster=True, 
                      cluster_selection_epsilon=0.0 )
clust=hdbscan_model.fit_predict(emb)
print(clust.shape, np.unique(clust))

Result:

(2100, 768)
(1, 768)
(2100,) [0]

Also i noticed tha you can set cluster_selection_epsilon=0.1 in the lasе code and still got the perfect result (single cluster [0]) but if i do the same change in bertopic code, the result would be -1 for every row:

topic
-1    2100
Name: count, dtype: int64

1 reply

MaartenGr Nov 29, 2024
Maintainer

In your experimentation with pure HDBSCAN, you did not use UMAP did you? If you indeed didn't use UMAP, then you can easily reproduce the results in BERTopic by simply removing UMAP from the pipeline. You can find more about that here.

satyrmipt · 2024-11-29T07:29:28Z

satyrmipt
Nov 29, 2024
Author

Thank you for the link to dummy dim_model. I've tried it and got this error. Finally i switched to PCA and found out it works well with n_components less then 60 and returns the same error for n_components >=61 (this threshold depends on number of examples in dataset). I'll try to go forward with PCA, thank you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different topics for the same text #2228

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Different topics for the same text #2228

satyrmipt Nov 27, 2024

Replies: 2 comments · 1 reply

satyrmipt Nov 28, 2024 Author

MaartenGr Nov 29, 2024 Maintainer

satyrmipt Nov 29, 2024 Author

satyrmipt
Nov 27, 2024

Replies: 2 comments 1 reply

satyrmipt
Nov 28, 2024
Author

MaartenGr Nov 29, 2024
Maintainer

satyrmipt
Nov 29, 2024
Author