Skip to content

Commit

Permalink
v0.12 (#668)
Browse files Browse the repository at this point in the history
* Online/incremental topic modeling with .partial_fit
* Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
* Expose attributes for easier access to internal data
* Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm
* Added an example of combining BERTopic with KeyBERT
* Added many tests with the intention of making development a bit more stable
* Fix #632, #648, #673, #682, #667, #664
  • Loading branch information
MaartenGr authored Sep 11, 2022
1 parent 62a3ecb commit 09c1732
Show file tree
Hide file tree
Showing 92 changed files with 2,873 additions and 1,170 deletions.
2 changes: 1 addition & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1 @@
*.ipynb linguist-documentation
*.ipynb linguist-documentation
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2020, Maarten P. Grootendorst
Copyright (c) 2022, Maarten P. Grootendorst

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ install:

install-test:
python -m pip install -e ".[test]"
python -m pip install -e ".[all]"
python -m pip install -e "."

pypi:
python setup.py sdist
Expand Down
92 changes: 37 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ BERTopic supports
[**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html),
(semi-) [**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html),
[**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html),
and [**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) topic modeling. It even supports visualizations similar to LDAvis!
[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html), and
[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html) topic modeling. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99)
and [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794).
Expand All @@ -42,7 +43,7 @@ pip install bertopic[use]

## Getting Started
For an in-depth overview of the features of BERTopic
you can check the full documentation [here](https://maartengr.github.io/BERTopic/) or you can follow along
you can check the [**full documentation**](https://maartengr.github.io/BERTopic/) or you can follow along
with one of the examples below:

| Name | Link |
Expand Down Expand Up @@ -130,6 +131,7 @@ Find all possible visualizations with interactive examples in the documentation
## Embedding Models
BERTopic supports many embedding models that can be used to embed the documents and words:
* Sentence-Transformers
* 🤗 Transformers
* Flair
* Spacy
* Gensim
Expand All @@ -143,65 +145,24 @@ meant for semantic similarity. Simply select any from their documentation
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
```

[**Flair**](https://github.com/flairNLP/flair) allows you to choose almost any 🤗 transformers model. Simply
select any from [here](https://huggingface.co/models) and pass it to BERTopic:
Similarly, you can choose any [**🤗 Transformers**](https://huggingface.co/models) model and pass it to BERTopic:

```python
from flair.embeddings import TransformerDocumentEmbeddings
from transformers.pipelines import pipeline

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
embedding_model = pipeline("feature-extraction", model="distilbert-base-cased")
topic_model = BERTopic(embedding_model=embedding_model)
```

Click [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html)
for a full overview of all supported embedding models.

## Dynamic Topic Modeling
Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics
over time. These methods allow you to understand how a topic is represented over time.
Here, we will be using all of Donald Trump's tweet to see how he talked over certain topics over time:

```python
import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
```

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

```python
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)
```

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this
by simply calling `topics_over_time` and pass in his tweets, the corresponding timestamps, and the related topics:

```python
topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)
```

Finally, we can visualize the topics by simply calling `visualize_topics_over_time()`:

```python
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)
```

<img src="images/dtm.gif" width="80%" height="80%" align="center" />

## Overview
BERTopic has quite a number of functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
of all methods and a short description of its purpose.

### Common
For quick access to common functions, here is an overview of BERTopic's main methods:
Below, you will find an overview of common functions in BERTopic.

| Method | Code |
|-----------------------|---|
Expand All @@ -213,26 +174,46 @@ For quick access to common functions, here is an overview of BERTopic's main met
| Get topic freq | `.get_topic_freq()` |
| Get all topic information| `.get_topic_info()` |
| Get representative docs per topic | `.get_representative_docs()` |
| Update topic representation | `.update_topics(docs, topics, n_gram_range=(1, 3))` |
| Update topic representation | `.update_topics(docs, n_gram_range=(1, 3))` |
| Generate topic labels | `.generate_topic_labels()` |
| Set topic labels | `.set_topic_labels(my_custom_labels)` |
| Merge topics | `.merge_topics(docs, topics, topics_to_merge)` |
| Reduce nr of topics | `.reduce_topics(docs, topics, nr_topics=30)` |
| Merge topics | `.merge_topics(docs, topics_to_merge)` |
| Reduce nr of topics | `.reduce_topics(docs, nr_topics=30)` |
| Find topics | `.find_topics("vehicle")` |
| Save model | `.save("my_model")` |
| Load model | `BERTopic.load("my_model")` |
| Get parameters | `.get_params()` |


### Attributes
After having trained your BERTopic model, a number of attributes are saved within your model. These attributes, in part,
refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in `_` and are
public attributes that can be used to access model information.

| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_ | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_ | The size of each topic |
| topic_mapper_ | A class for tracking topics and their mappings anytime they are merged/reduced. |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values. |
| c_tf_idf_ | The topic-term matrix as calculated through c-TF-IDF. |
| topic_labels_ | The default labels for each topic. |
| custom_labels_ | Custom labels for each topic as generated through `.set_topic_labels`. |
| topic_embeddings_ | The embeddings for each topic if `embedding_model` was used. |
| representative_docs_ | The representative documents for each topic if HDBSCAN is used. |


### Variations
There are many different use cases in which topic modeling can be used. As such, a number of
variations of BERTopic have been developed such that one package can be used across across many use cases:
variations of BERTopic have been developed such that one package can be used across across many use cases.

| Method | Code |
|-----------------------|---|
| (semi-) Supervised Topic Modeling | `.fit(docs, y=y)` |
| Topic Modeling per Class | `.topics_per_class(docs, topics, classes)` |
| Dynamic Topic Modeling | `.topics_over_time(docs, topics, timestamps)` |
| Hierarchical Topic Modeling | `.hierarchical_topics(docs, topics)` |
| Topic Modeling per Class | `.topics_per_class(docs, classes)` |
| Dynamic Topic Modeling | `.topics_over_time(docs, timestamps)` |
| Hierarchical Topic Modeling | `.hierarchical_topics(docs)` |
| Guided Topic Modeling | `BERTopic(seed_topic_list=seed_topic_list)` |

### Visualizations
Expand All @@ -254,6 +235,7 @@ to tweak the model to your liking.
| Visualize Topics over Time | `.visualize_topics_over_time(topics_over_time)` |
| Visualize Topics per Class | `.visualize_topics_per_class(topics_per_class)` |


## Citation
To cite the [BERTopic paper](https://arxiv.org/abs/2203.05794), please use the following bibtex reference:

Expand Down
2 changes: 1 addition & 1 deletion bertopic/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from bertopic._bertopic import BERTopic

__version__ = "0.11.0"
__version__ = "0.12.0"

__all__ = [
"BERTopic",
Expand Down
Loading

0 comments on commit 09c1732

Please sign in to comment.