Skip to content

Commit

Permalink
Support Hugging Face Tokenizers in Python bindings (#19)
Browse files Browse the repository at this point in the history
  • Loading branch information
benbrandt authored Jun 12, 2023
1 parent 9c5773f commit fc3709a
Show file tree
Hide file tree
Showing 11 changed files with 31,579 additions and 70 deletions.
9 changes: 6 additions & 3 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ on:
- "bindings/python/**"
- ".github/workflows/python.yml"
pull_request:
paths:
- "bindings/python/**"
- ".github/workflows/python.yml"
workflow_dispatch:

concurrency:
Expand Down Expand Up @@ -78,7 +81,7 @@ jobs:
run: |
set -e
pip install semantic-text-splitter --find-links dist --force-reinstall
pip install pytest
pip install pytest tokenizers
pytest
windows:
Expand Down Expand Up @@ -114,7 +117,7 @@ jobs:
run: |
set -e
pip install semantic-text-splitter --find-links dist --force-reinstall
pip install pytest
pip install pytest tokenizers
pytest
macos:
Expand Down Expand Up @@ -149,7 +152,7 @@ jobs:
run: |
set -e
pip install semantic-text-splitter --find-links dist --force-reinstall
pip install pytest
pip install pytest tokenizers
pytest
sdist:
Expand Down
2 changes: 1 addition & 1 deletion .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ build:
commands:
- pip install pdoc
- cd ./bindings/python && pip install .
- pdoc semantic_text_splitter -o $READTHEDOCS_OUTPUT/html/
- pdoc semantic_text_splitter -d google -o $READTHEDOCS_OUTPUT/html/
23 changes: 23 additions & 0 deletions bindings/python/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,28 @@
# Changelog

## v0.2.0

## What's New

- New `HuggingFaceTextSplitter`, which allows for using Hugging Face's `tokenizers` package to count chunks by tokens with a tokenizer of your choice.

```python
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)
```

## Breaking Changes

- `trim_chunks` now defaults to `True` instead of `False`. For most use cases, this is the desired behavior, especially with chunk ranges.

## v0.1.4

Fifth time is the charm?
Expand Down
Loading

0 comments on commit fc3709a

Please sign in to comment.