Support Hugging Face Tokenizers in Python bindings (#19)

benbrandt · Jun 12, 2023 · fc3709a · fc3709a
1 parent 9c5773f
commit fc3709a
Show file tree

Hide file tree

Showing 11 changed files with 31,579 additions and 70 deletions.
diff --git a/.github/workflows/python.yml b/.github/workflows/python.yml
@@ -16,6 +16,9 @@ on:
       - "bindings/python/**"
       - ".github/workflows/python.yml"
   pull_request:
+    paths:
+      - "bindings/python/**"
+      - ".github/workflows/python.yml"
   workflow_dispatch:
 
 concurrency:
@@ -78,7 +81,7 @@ jobs:
         run: |
           set -e
           pip install semantic-text-splitter --find-links dist --force-reinstall
-          pip install pytest
+          pip install pytest tokenizers
           pytest
 
   windows:
@@ -114,7 +117,7 @@ jobs:
         run: |
           set -e
           pip install semantic-text-splitter --find-links dist --force-reinstall
-          pip install pytest
+          pip install pytest tokenizers
           pytest
 
   macos:
@@ -149,7 +152,7 @@ jobs:
         run: |
           set -e
           pip install semantic-text-splitter --find-links dist --force-reinstall
-          pip install pytest
+          pip install pytest tokenizers
           pytest
 
   sdist:

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -13,4 +13,4 @@ build:
   commands:
     - pip install pdoc
     - cd ./bindings/python && pip install .
-    - pdoc semantic_text_splitter -o $READTHEDOCS_OUTPUT/html/
+    - pdoc semantic_text_splitter -d google -o $READTHEDOCS_OUTPUT/html/
diff --git a/bindings/python/CHANGELOG.md b/bindings/python/CHANGELOG.md
@@ -1,5 +1,28 @@
 # Changelog
 
+## v0.2.0
+
+## What's New
+
+- New `HuggingFaceTextSplitter`, which allows for using Hugging Face's `tokenizers` package to count chunks by tokens with a tokenizer of your choice.
+
+```python
+from semantic_text_splitter import HuggingFaceTextSplitter
+from tokenizers import Tokenizer
+
+# Maximum number of tokens in a chunk
+max_characters = 1000
+# Optionally can also have the splitter not trim whitespace for you
+tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
+splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)
+
+chunks = splitter.chunks("your document text", max_characters)
+```
+
+## Breaking Changes
+
+- `trim_chunks` now defaults to `True` instead of `False`. For most use cases, this is the desired behavior, especially with chunk ranges.
+
 ## v0.1.4
 
 Fifth time is the charm?