Skip to content

Commit

Permalink
feat: Add chunk overlap setting (#160)
Browse files Browse the repository at this point in the history
* feat: Add chunk overlap setting

Allows for overlapping chunks. Will still use the semantic range sections to determine a good splitting point for the overlap as well.

* Make sure there aren't chunks emitted whose entire content was already emitted

* Consolidate snapshot test code

* Add overlap snapshot tests

* Update changelog and python bindings
  • Loading branch information
benbrandt authored Apr 28, 2024
1 parent 7c5d994 commit c6e599e
Show file tree
Hide file tree
Showing 180 changed files with 152,081 additions and 611 deletions.
27 changes: 27 additions & 0 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,33 @@ permissions:
contents: read

jobs:
min_supported_version:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.8"
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
with:
working-directory: bindings/python
- name: Build wheels
uses: PyO3/maturin-action@v1
with:
target: x86_64
args: --release --out dist
sccache: "true"
manylinux: auto
working-directory: bindings/python
- name: pytest
shell: bash
run: |
set -e
pip install --no-index --find-links dist --force-reinstall semantic-text-splitter
pip install pytest tokenizers
pytest
linux:
runs-on: ubuntu-latest
strategy:
Expand Down
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,31 @@
# Changelog

## v0.12.2

### What's New

**Support for chunk overlapping:** Several of you have been waiting on this for awhile now, and I am happy to say that chunk overlapping is now available in a way that still stays true to the spirit of finding good semantic break points.

When a new chunk is emitted, if chunk overlapping is enabled, the splitter will look back at the semantic sections of the current level and pull in as many as possible that fit within the overlap window. **This does mean that none can be taken**, which is often the case when close to a higher semantic level boundary.

When it will almost always produce an overlap is when the current semantic level couldn't be fit into a single chunk, and it provides overlapping sections since we may not have found a good break point in the middle of the section. Which seems to be the main motivation for using chunk overlapping in the first place.

#### Rust Usage

```rust
let chunk_config = ChunkConfig::new(256)
// .with_sizer(sizer) // Optional tokenizer or other chunk sizer impl
.with_overlap(64)
.expect("Overlap must be less than desired chunk capacity");
let splitter = TextSplitter::new(chunk_config); // Or MarkdownSplitter
```

#### Python Usage

```python
splitter = TextSplitter(256, overlap=64) # or any of the class methods to use a tokenizer
```

## v0.12.1

### What's New
Expand Down
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
members = ["bindings/*"]

[workspace.package]
version = "0.12.1"
version = "0.12.2"
authors = ["Ben Brandt <benjamin.j.brandt@gmail.com>"]
edition = "2021"
description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python."
Expand Down
94 changes: 73 additions & 21 deletions bindings/python/semantic_text_splitter.pyi

Large diffs are not rendered by default.

Loading

0 comments on commit c6e599e

Please sign in to comment.