Releases · benbrandt/text-splitter

Support tree-sitter@v0.24 for CodeSplitters.
Due to a slight change in the backing unicode segmentation implementation, there are some slight shifts in behavior for CodeSplitters as well (in my tests, mostly that semicolons have a more logical grouping with previous content).

Full Changelog: v0.16.1...v0.17.0

Assets 2

0 Join discussion

07 Sep 11:27

benbrandt

v0.16.1

e53d5e2

v0.16.1

What's New

Updates pulldown-cmark to v0.12.1 to address an issue with high CPU usage for certain Markdown elements.

Full Changelog: v0.16.0...v0.16.1

Assets 2

0 Join discussion

02 Sep 21:32

benbrandt

v0.16.0

79a8137

v0.16.0

Breaking Changes

Update to v0.23.0 of tree-sitter for CodeSplitter. There was a breaking change for language definitions, so this is also a breaking change for us, especially on the Python side, since we support passing the language in.
Minimum Python version for the Python bindings is now 3.9 since 3.8 will be EOL next month.

Python

Make sure to upgrade to the latest version of your tree-sitter language package.

Rust

Make sure to upgrade to the latest version of your tree-sitter language package crate. These know have a LANGUAGE constant rather than a language() function.

// Before
tree_sitter_rust::language()
// After
tree_sitter_rust::LANGUAGE

What's New

MarkdownSplitter can better parse the Commonmark HS extension for Definition Lists.

Full Changelog: v0.15.0...v0.16.0

Assets 2

0 Join discussion

11 Aug 05:21

benbrandt

v0.15.0

67b20aa

v0.15.0

What's New

Support version 0.20.0 of the tokenizers crate.

Python

No longer cause a segmentation fault when using the wrong type for tree-sitter languages. Fixes #265

Full Changelog: v0.14.1...v0.15.0

Assets 2

0 Join discussion

06 Jul 05:38

benbrandt

v0.14.1

304e55f

v0.14.1

What's New

Small performance improvements where checking the size of the chunk is avoided if we already know it is too small or we don't need to. #261
Loosen dependency ranges for Rust crates to allow for more flexibility in the versions you can use.

Full Changelog: v0.14.0...v0.14.1

Assets 2

0 Join discussion

21 Jun 20:54

benbrandt

v0.14.0

7c3cbbd

v0.14.0

What's New

Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.

For the "happy path", this new approach also led to big speed gains in the CodeSplitter (50%+ speed increase in some cases), marginal regressions in the MarkdownSplitter, and not much difference in the TextSplitter. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.

Breaking Changes

Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the MarkdownSplitter at very small sizes, and any splitter using RustTokenizers because of its offset behavior.

Rust

ChunkSize has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.
This makes implementing a custom ChunkSizer much easier, as you now only need to generate the size of the chunk as a usize. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.

Before

pub trait ChunkSizer {
    // Required method
    fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}

After

pub trait ChunkSizer {
    // Required method
    fn size(&self, chunk: &str) -> usize;
}

Optimization for SemanticSplitRange searching by @benbrandt in #219
Performance Optimization: Expanding binary search window by @benbrandt in #231

Full Changelog: v0.13.3...v0.14.0

Contributors

benbrandt

Assets 2

2 Join discussion

02 Jun 21:10

benbrandt

v0.13.3

a3900eb

v0.13.3

What's Changed

Fixes broken PyPI publish because of a bad dev dependency specification

Full Changelog: v0.13.2...v0.13.3

Assets 2

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's New

New Contributors

Contributors

Breaking

What's New

Breaking Changes

What's New

Breaking Changes

Python

Rust

What's New

What's New

Python

What's New

What's New

Breaking Changes

Rust

Before

After

Contributors

What's Changed

Releases: benbrandt/text-splitter

v0.18.1

What's New

New Contributors

Contributors

v0.18.0

Breaking

v0.17.1

What's New

v0.17.0

Breaking Changes

v0.16.1

What's New

v0.16.0

Breaking Changes

Python

Rust

What's New

v0.15.0

What's New

Python

v0.14.1

What's New

v0.14.0

What's New

Breaking Changes

Rust

Before

After

Contributors

v0.13.3

What's Changed