Is this library still useful? #2195

damaon · 2024-10-22T10:57:26Z

damaon
Oct 22, 2024

I have a quite good clustering just using using https://huggingface.co/WhereIsAI/UAE-Large-V1, reduced by PCA to 128 then clustered by HDBSCAN but when I apply thee same embeddings and clustering algo to BERTopic I am getting mess.

Seems like cTFIDF with default counter messes stuff up. Is that possible? Does it still make sense to do cTFIDF when just using embeddings gives good results?

I am using it for news clustering and was hoping to get better results by maybe adding entities recognition from headlines but seems like just embeddings are better unless I do something wrong (that's why I post this discussion).

I used it like that:

    embedder = SentenceTransformer("WhereIsAI/UAE-Large-V1")
    topic_model = BERTopic(
        n_gram_range=(1, 2),
        embedding_model=embedder,
        hdbscan_model=HDBSCAN(
            min_cluster_size=2,
            metric="euclidean",
            max_cluster_size=100,
        ),
        umap_model=PCA(n_components=64),
        calculate_probabilities=False,
        verbose=True,
    )

Answered by MaartenGr

Oct 22, 2024

No problem! Glad you found the issue.

Though it seems to adjust these clusters after initial clustering still, right?

It doesn't change the clusters themselves but merely their IDs to be make sure that topic 0 is a larger topic than topic 1, etc.

What vectorizer_model and cTFIDF does exactly after clustering?

The default countvectorizer from sklearn, which you can indeed change.

I was thinking on using also TFIDF representation somehow to make clusters more "words" (as opposed to semantical similarity) based but just normalizing and appending to embeddings before reduction doesn't seem to work, so hoped that BERTopic has it better solved.

BERTopic doesn't use TF-IDF but a variant, c…

View full answer

MaartenGr · 2024-10-22T11:05:46Z

MaartenGr
Oct 22, 2024
Maintainer

That shouldn't be possible since c-TF-IDF does not affect the clustering process at all. You should be getting similar results if you are using the same underlying algorithms. Could you share your code? Also, which version of BERTopic are you using?

0 replies

damaon · 2024-10-22T12:08:42Z

damaon
Oct 22, 2024
Author

Thank @MaartenGr for confirming my hunch that I've messed something up. I did removal of -1 twice and this messed up my code :/

As an apology collab proof that it indeed works fine: BERTTopic

Though it seems to adjust these clusters after initial clustering still, right?

If I can ask couple more practical questions:

What vectorizer_model and cTFIDF does exactly after clustering? I tried to infer from https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py but feeling a bit lost.
Ok from https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html seems like it's used only for "topic representation". I was thinking on using also TFIDF representation somehow to make clusters more "words" (as opposed to semantical similarity) based but just normalizing and appending to embeddings before reduction doesn't seem to work, so hoped that BERTopic has it better solved.
Do you maybe know if it's good idea to emphasise entities and nouns for clustering news in vectorizer? In your examples you have huge clusters for my taste, where I would like to have 5-50 articles per topic (so very narrow and specific)
I don't like that BERTopic installs sentence-transformers that are installing nvidia dependency in docker which is huge while I want to run it on CPU only. Seems like there should be flag that installs only CPU related code but can't find it .
I have my own solution for online learning but would love to switch to some better tested one but assuming I've triplets of (topic_id, article_id, embedding) and I already have 90% clustered (topic_id not None) it's not clear from https://maartengr.github.io/BERTopic/getting_started/online/online.html#example how to "load" my clusters into model (assuming I also do some post-processing after BERTopic just recomputing won't give same result). Or do i just actually do partial_fit first on my already computed data and then partial_fit on fresh stuff?

Sorry for wall of text and thanks for any further assistance

4 replies

MaartenGr Oct 22, 2024
Maintainer

No problem! Glad you found the issue.

Though it seems to adjust these clusters after initial clustering still, right?

It doesn't change the clusters themselves but merely their IDs to be make sure that topic 0 is a larger topic than topic 1, etc.

What vectorizer_model and cTFIDF does exactly after clustering?

The default countvectorizer from sklearn, which you can indeed change.

I was thinking on using also TFIDF representation somehow to make clusters more "words" (as opposed to semantical similarity) based but just normalizing and appending to embeddings before reduction doesn't seem to work, so hoped that BERTopic has it better solved.

BERTopic doesn't use TF-IDF but a variant, called c-TF-IDF to get the topic representations. You can, however, can get many more representations if you want: https://maartengr.github.io/BERTopic/getting_started/representation/representation.html
This package is all about modularity, hence the many options.

Do you maybe know if it's good idea to emphasise entities and nouns for clustering news in vectorizer?

You could but that will not affect the clusters but merely the topic representations since the vectorizer is run after clustering.

In your examples you have huge clusters for my taste, where I would like to have 5-50 articles per topic (so very narrow and specific)

That's the nature of the underlying clustering algorithm, using something like k-Means gives you more control over the number of topics are created. Also note that in my experience UMAP -> HDBSCAN works a tad better than PCA -> HDBSCAN. Especially since you reduce to 128 dimensions and HDBSCAN works better with a lower number of dimensions (as a result of the curse of dimensionality with euclidean metrics).

I don't like that BERTopic installs sentence-transformers that are installing nvidia dependency in docker which is huge while I want to run it on CPU only. Seems like there should be flag that installs only CPU related code but can't find it .

You can find the instructions for light-weight installation here: https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#lightweight-installation

Also note that with pypi, there is currently no possibility to have a flag that allows you to install fewer dependencies, only more on top of the default ones. That's the reason why the link above exists and you cannot do it with pip install bertopic[cpu] or something similar.

Answer selected by damaon

damaon Oct 22, 2024
Author

@MaartenGr Thanks for the replies. Would love one more answer about loading topics to allow online learning (added in my previous comment).

I have my own solution for online learning but would love to switch to some better tested one but assuming I've triplets of (topic_id, article_id, embedding) and I already have 90% clustered (topic_id not None) it's not clear from https://maartengr.github.io/BERTopic/getting_started/online/online.html#example how to "load" my clusters into model (assuming I also do some post-processing after BERTopic just recomputing won't give same result). Or do i just actually do partial_fit first on my already computed data and then partial_fit on fresh stuff?

damaon Oct 22, 2024
Author

To clarify. I think I would get exact same results if I applied partial fit every batch starting from first article but that's doesn't seem practical to me so I would prefer to do "sliding window" on my data and in the "window" I would have part clustered and and part not yet so loading of these pre-clustered would be great to have.

MaartenGr Oct 22, 2024
Maintainer

I have my own solution for online learning but would love to switch to some better tested one but assuming I've triplets of (topic_id, article_id, embedding) and I already have 90% clustered (topic_id not None) it's not clear from https://maartengr.github.io/BERTopic/getting_started/online/online.html#example how to "load" my clusters into model (assuming I also do some post-processing after BERTopic just recomputing won't give same result). Or do i just actually do partial_fit first on my already computed data and then partial_fit on fresh stuff?

You can't load your clusters in the model and continue training with partial_fit, that requires you to start from scratch with specific submodels. Instead,you can choose to iteratively train new models and merge them with the old ones as described here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this library still useful? #2195

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is this library still useful? #2195

damaon Oct 22, 2024

Replies: 2 comments · 4 replies

MaartenGr Oct 22, 2024 Maintainer

damaon Oct 22, 2024 Author

MaartenGr Oct 22, 2024 Maintainer

damaon Oct 22, 2024 Author

damaon Oct 22, 2024 Author

MaartenGr Oct 22, 2024 Maintainer

damaon
Oct 22, 2024

Replies: 2 comments 4 replies

MaartenGr
Oct 22, 2024
Maintainer

damaon
Oct 22, 2024
Author

MaartenGr Oct 22, 2024
Maintainer

damaon Oct 22, 2024
Author

damaon Oct 22, 2024
Author

MaartenGr Oct 22, 2024
Maintainer