v0.13 (#840)

* Calculate topic distributions with .approximate_distribution regardless of the cluster model used * Fully supervised topic modeling with BERTopic * Manual topic modeling with BERTopic * Reduce outliers with 4 different strategies using .reduce_outliers * Install BERTopic without SentenceTransformers for a lightweight package * Get metadata of trained documents such as topics and probabilities using .get_document_info(docs) * Added more support for cuML's HDBSCAN * More images to the documentation and a lot of changes/updates/clarifications * Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations * Sklearn Pipeline Embedder
MaartenGr · Jan 4, 2023 · 06dcd47 · 06dcd47
1 parent 3edfdb4
commit 06dcd47
Show file tree

Hide file tree

Showing 88 changed files with 4,249 additions and 714 deletions.
diff --git a/.flake8 b/.flake8
@@ -1,2 +1,2 @@
-[flake8]
+[flake8] 
 max-line-length = 160
diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -25,7 +25,7 @@ jobs:
         python-version: ${{ matrix.python-version }}
     - name: Install dependencies
       run: |
-        python -m pip install --upgrade pip
+        python -m pip install --upgrade pip 
         pip install -e ".[test]"
     - name: Run Checking Mechanisms
       run: make check
diff --git a/.gitignore b/.gitignore
@@ -73,6 +73,7 @@ ENV/
 env.bak/
 venv.bak/
 
+# Artifacts
 .idea
 .idea/
 .vscode
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2022, Maarten P. Grootendorst
+Copyright (c) 2023, Maarten P. Grootendorst
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -15,13 +15,16 @@ allowing for easily interpretable topics whilst keeping important words in the t
 
 BERTopic supports 
 [**guided**](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html), 
-(semi-) [**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html), 
+[**supervised**](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html), 
+[**semi-supervised**](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html), 
+[**manual**](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html), 
+[**long-document**](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html),
 [**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html), 
+[**class-based**](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html),
 [**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html), and 
 [**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html) topic modeling. It even supports visualizations similar to LDAvis!
 
-Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99) 
-and [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794). 
+Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99), [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4) and [here](https://towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html). 
 
 ## Installation
 
@@ -31,8 +34,7 @@ Installation, with sentence-transformers, can be done using [pypi](https://pypi.
 pip install bertopic
 ```
 
-You may want to install more depending on the transformers and language backends that you will be using. 
-The possible installations are: 
+If you want to install BERTopic with other embedding models, you can choose one of the following:
 
 ```bash
 pip install bertopic[flair]
@@ -82,8 +84,8 @@ Topic	Count	Name
 3	381	22_key_encryption_keys_encrypted
 ```
 
--1 refers to all outliers and should typically be ignored. Next, let's take a look at the most 
-frequent topic that was generated, topic 0:
+The `-1` topic refers to all outlier documents and are typically ignored. Next, let's take a look at the most 
+frequent topic that was generated:
 
 ```python
 >>> topic_model.get_topic(0)
@@ -100,7 +102,22 @@ frequent topic that was generated, topic 0:
  ('pc', 0.003047105930670237)]
 ```  
 
-**NOTE**: Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages. 
+Using `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:
+
+```python
+>>> topic_model.get_document_info(docs)
+
+Document                               Topic	Name	                        Top_n_words                     Probability    ...
+I am sure some bashers of Pens...	0	0_game_team_games_season	game - team - games...	        0.200010       ...
+My brother is in the market for...      -1     -1_can_your_will_any	        can - your - will...	        0.420668       ...
+Finally you said what you dream...	-1     -1_can_your_will_any	        can - your - will...            0.807259       ...
+Think! It's the SCSI card doing...	49     49_windows_drive_dos_file	windows - drive - docs...	0.071746       ...
+1) I have an old Jasmine drive...	49     49_windows_drive_dos_file	windows - drive - docs...	0.038983       ...
+```
+
+> **Note**
+>
+> Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages. 
 
 ## Visualize Topics
 After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good 
@@ -114,51 +131,19 @@ topic_model.visualize_topics()
 
 <img src="images/topic_visualization.gif" width="60%" height="60%" align="center" />
 
-We can create an overview of the most frequent topics in a way that they are easily interpretable. 
-Horizontal barcharts typically convey information rather well and allow for an intuitive representation 
-of the topics: 
-
-```python
-topic_model.visualize_barchart()
-``` 
-
-<img src="images/topics.png" width="70%" height="70%" align="center" />
-
-
 Find all possible visualizations with interactive examples in the documentation 
 [here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html). 
 
-## Embedding Models
-BERTopic supports many embedding models that can be used to embed the documents and words:
-* Sentence-Transformers
-* 🤗 Transformers
-* Flair
-* Spacy
-* Gensim
-* USE
 
-[**Sentence-Transformers**](https://github.com/UKPLab/sentence-transformers) is typically used as it has shown great results embedding documents 
-meant for semantic similarity. Simply select any from their documentation 
-[here](https://www.sbert.net/docs/pretrained_models.html) and pass it to BERTopic:
+## Modularity
+By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
 
-```python
-topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
-```
+https://user-images.githubusercontent.com/25746895/205490350-cd9833e7-9cd5-44fa-8752-407d748de633.mp4
 
-Similarly, you can choose any [**🤗 Transformers**](https://huggingface.co/models) model and pass it to BERTopic:
-
-```python
-from transformers.pipelines import pipeline
+You can swap out any of these models or even remove them entirely. Starting with the embedding step, you can find out how to do this [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) and more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html). 
 
-embedding_model = pipeline("feature-extraction", model="distilbert-base-cased")
-topic_model = BERTopic(embedding_model=embedding_model)
-```
-
-Click [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) 
-for a full overview of all supported embedding models. 
-
-## Overview
-BERTopic has quite a number of functions that quickly can become overwhelming. To alleviate this issue, you will find an overview 
+## Functionality
+BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview 
 of all methods and a short description of its purpose. 
 
 ### Common
@@ -173,48 +158,54 @@ Below, you will find an overview of common functions in BERTopic.
 | Access all topics     |  `.get_topics()` |
 | Get topic freq    |  `.get_topic_freq()` |
 | Get all topic information|  `.get_topic_info()` |
+| Get all document information|  `.get_document_info(docs)` |
 | Get representative docs per topic |  `.get_representative_docs()` |
 | Update topic representation | `.update_topics(docs, n_gram_range=(1, 3))` |
 | Generate topic labels | `.generate_topic_labels()` |
 | Set topic labels | `.set_topic_labels(my_custom_labels)` |
 | Merge topics | `.merge_topics(docs, topics_to_merge)` |
 | Reduce nr of topics | `.reduce_topics(docs, nr_topics=30)` |
+| Reduce outliers | `.reduce_outliers(docs, topics)` |
 | Find topics | `.find_topics("vehicle")` |
 | Save model    |  `.save("my_model")` |
 | Load model    |  `BERTopic.load("my_model")` |
 | Get parameters |  `.get_params()` |
 
 
 ### Attributes
-After having trained your BERTopic model, a number of attributes are saved within your model. These attributes, in part, 
+After having trained your BERTopic model, several attributes are saved within your model. These attributes, in part, 
 refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in `_` and are 
 public attributes that can be used to access model information. 
 
 | Attribute | Description |
 |------------------------|---------------------------------------------------------------------------------------------|
-| topics_               | The topics that are generated for each document after training or updating the topic model. |
-| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
-| topic_sizes_           | The size of each topic                                                                      |
-| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
-| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
-| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
-| topic_labels_          | The default labels for each topic.                                                          |
-| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
-| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
-| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |
+| `.topics_`               | The topics that are generated for each document after training or updating the topic model. |
+| `.probabilities_` | The probabilities that are generated for each document if HDBSCAN is used. |
+| `.topic_sizes_`           | The size of each topic                                                                      |
+| `.topic_mapper_`          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
+| `.topic_representations_` | The top *n* terms per topic and their respective c-TF-IDF values.                             |
+| `.c_tf_idf_`              | The topic-term matrix as calculated through c-TF-IDF.                                       |
+| `.topic_labels_`          | The default labels for each topic.                                                          |
+| `.custom_labels_`         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
+| `.topic_embeddings_`      | The embeddings for each topic if `embedding_model` was used.                                                              |
+| `.representative_docs_`   | The representative documents for each topic if HDBSCAN is used.                                                |
 
 
 ### Variations
-There are many different use cases in which topic modeling can be used. As such, a number of 
-variations of BERTopic have been developed such that one package can be used across across many use cases.
+There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.
 
 | Method | Code  | 
 |-----------------------|---|
-| (semi-) Supervised Topic Modeling | `.fit(docs, y=y)` |
-| Topic Modeling per Class | `.topics_per_class(docs, classes)` |
-| Dynamic Topic Modeling | `.topics_over_time(docs, timestamps)` |
-| Hierarchical Topic Modeling | `.hierarchical_topics(docs)` |
-| Guided Topic Modeling | `BERTopic(seed_topic_list=seed_topic_list)` |
+| [Topic Distribution Approximation](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html) | `.approximate_distribution(docs)` |
+| [Online Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/online/online.html) | `.partial_fit(doc)` |
+| [Semi-supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/semisupervised/semisupervised.html) | `.fit(docs, y=y)` |
+| [Supervised Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) | `.fit(docs, y=y)` |
+| [Manual Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/manual/manual.html) | `.fit(docs, y=y)` |
+| [Topic Modeling per Class](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html) | `.topics_per_class(docs, classes)` |
+| [Dynamic Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html) | `.topics_over_time(docs, timestamps)` |
+| [Hierarchical Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html) | `.hierarchical_topics(docs)` |
+| [Guided Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |
+
 
 ### Visualizations
 Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. 

diff --git a/bertopic/__init__.py b/bertopic/__init__.py
@@ -1,6 +1,6 @@
 from bertopic._bertopic import BERTopic
 
-__version__ = "0.12.0"
+__version__ = "0.13.0"
 
 __all__ = [
     "BERTopic",