Releases · benbrandt/text-splitter

04 Apr 05:05

benbrandt

v0.9.0

ae9730c

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.
In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

Full Changelog: v0.8.1...v0.9.0

Assets 2

0 Join discussion

26 Mar 21:31

benbrandt

v0.8.1

d506086

v0.8.1

What's New

Updates to documentation and examples.
Update pyo3 to 0.21.0 in Python package, which should bring some performance improvements. #125

Full Changelog: v0.8.0...v0.8.1

Assets 2

0 Join discussion

26 Mar 06:15

benbrandt

v0.8.0

fc78a9c

v0.8.0 - Performance Improvements

What's New

Significantly fewer allocations necessary when generating chunks. This should result in a performance improvement for most use cases. This was achieved by both reusing pre-allocated collections, as well as memoizing chunk size calculations since that is often the bottleneck, and tokenizer libraries tend to be very allocation heavy!

Benchmarks show:

20-40% fewer allocations caused by the core algorithm.
Up to 20% fewer allocations when using tokenizers to calculate chunk sizes.
In some cases, especially with Markdown, these improvements can also result in up to 20% faster chunk generation.

Breaking Changes

There was a bug in the MarkdownSplitter logic that caused some strange split points.
The Text semantic level in MarkdownSplitter has been merged with inline elements to also find better split points inside content.
Fixed a bug that could cause the algorithm to use a lower semantic level than necessary on occasion. This mostly impacted the MarkdownSplitter, but there were same cases of different behavior in the TextSplitter as well if chunks are not trimmed.

All of the above can cause different chunks to be output than before, depending on the text. So, even though these are bug fixes to bring intended behavior, they are being treated as a major version bump.

Full Changelog: v0.7.0...v0.8.0

Assets 2

0 Join discussion

09 Mar 21:21

benbrandt

v0.7.0

999d567

v0.7.0 - Markdown Support

What's New

Markdown Support! Both the Rust crate and Python package have a new MarkdownSplitter you can use to split markdown text. It leverages the great work of the pulldown-cmark crate to parse markdown according to the CommonMark spec, and allows for very fine-grained control over how to split the text.

In terms of use, the API is identical to the TextSplitter, so you should be able to just drop it in when you have Markdown available instead of just plain text.

Rust

use text_splitter::MarkdownSplitter;

// Default implementation uses character count for chunk size.
// Can also use all of the same tokenizer implementations as `TextSplitter`.
let splitter = MarkdownSplitter::default()
    // Optionally can also have the splitter trim whitespace for you. It
    // will preserve indentation if multiple lines are covered in a chunk.
    .with_trim_chunks(true);

let chunks = splitter.chunks("# Header\n\nyour document text", 1000)

Python

from semantic_text_splitter import MarkdownSplitter

# Default implementation uses character count for chunk size.
# Can also use all of the same tokenizer implementations as `TextSplitter`.
# By default it will also have trim whitespace for you.
# It will preserve indentation if multiple lines are covered in a chunk.
splitter = MarkdownSplitter()
chunks = splitter.chunks("# Header\n\nyour document text", 1000)

Breaking Changes

Rust

MSRV is now 1.75.0 since the ability to use impl Trait in trait methods allowed for much simpler internal APIs to enable the MarkdownSplitter.

Python

CharacterTextSplitter, HuggingFaceTextSplitter, TiktokenTextSplitter, and CustomTextSplitter classes have now all been consolidated into a single TextSplitter class. All of the previous use cases are still supported, you just need to instantiate the class with various class methods.

Below are the changes you need to make to your code to upgrade to v0.7.0:

`CharacterTextSplitter`

# Before
from semantic_text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter()

# After
from semantic_text_splitter import TextSplitter
splitter = TextSplitter()

`HuggingFaceTextSplitter`

# Before
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer)

# After
from semantic_text_splitter import TextSplitter
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer)

`TiktokenTextSplitter`

# Before
from semantic_text_splitter import TiktokenTextSplitter

splitter = TiktokenTextSplitter("gpt-3.5-turbo")

# After
from semantic_text_splitter import TextSplitter

splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo")

`CustomTextSplitter`

# Before
from semantic_text_splitter import CustomTextSplitter

splitter = CustomTextSplitter(lambda text: len(text))

# After
from semantic_text_splitter import TextSplitter

splitter = TextSplitter.from_callback(lambda text: len(text))

New Contributors

@SimonCW made their first contribution in #92

Full Changelog: v0.6.3...v0.7.0

Contributors

SimonCW

Assets 2

1 Join discussion

20 Jan 20:50

benbrandt

v0.6.3

620738f

v0.6.3

Re-release because of aggresive exclusions of benchmarks for the Rust package causing release to fail.

Full Changelog: v0.6.2...v0.6.3

Assets 2

0 Join discussion

20 Jan 20:39

benbrandt

v0.6.2

de292e8

v0.6.2

Re-release of v0.6.1 because of wrong version tag in Python package

Full Changelog: v0.6.1...v0.6.2

Assets 2

0 Join discussion

20 Jan 19:56

benbrandt

v0.6.1

fb21920

v0.6.1

Fixes

Fix error in section filtering that didn't fix the chunk behavior regression from v0.5.0 in very tiny chunk capacities. For most commonly used chunk sizes (i.e. >=10 tokens), this shouldn't have been an issue. @benbrandt in #84

Full Changelog: v0.6.0...v0.6.1

Contributors

benbrandt

Assets 2

0 Join discussion

14 Jan 07:19

benbrandt

v0.6.0

d28f8c0

v0.6.0

Breaking Changes

Chunk behavior should now be the same as prior to v0.5.0. Once binary search finds the optimal chunk, we now check the next few sections as long as the chunk size doesn't change. This should result in the same behavior as before, but with the performance improvements of binary search. @benbrandt in #81

Full Changelog: v0.5.1...v0.6.0

Contributors

benbrandt

Assets 2

0 Join discussion

13 Jan 14:23

benbrandt

v0.5.1

53cc041

v0.5.1

What's New

Python bindings and Rust crate now have the same version number.

Rust

Constructors for ChunkSize are now public, so you can more easily create your own ChunkSize structs for your own custom ChunkSizer implementation.

Python

New CustomTextSplitter that accepts a custom callback with the signature of (str) -> int. Allows for custom chunk sizing on the Python side. @benbrandt in #80

Full Changelog: v0.5.0...v0.5.1

Contributors

benbrandt

Assets 2

0 Join discussion

27 Dec 19:26

benbrandt

v0.5.0

e716aa9

v0.5.0

What's New

Significant performance improvements for generating chunks with the tokenizers or tiktoken-rs crates by applying binary search when attempting to find the next matching chunk size. @benbrandt and @bradfier in #71

Breaking Changes

Minimum required version of tokenizers is now 0.15.0
Minimum required version of tiktoken-rs is now 0.5.6
Due to using binary search, there are some slight differences at the edges of chunks where the algorithm was a little greedier before. If two candidates would tokenize to the same amount of tokens that fit within the capacity, it will now choose the shorter text. Due to the nature of of tokenizers, this happens more often with whitespace at the end of a chunk, and rarely effects users who have set with_trim_chunks(true). It is a tradeoff, but would have made the binary search code much more complicated to keep the exact same behavior.
The chunk_size method on ChunkSizer now needs to accept a ChunkCapacity argument, and return a ChunkSize struct instead of a usize. This was to help support the new binary search method in chunking, and should only affect users who implemented custom ChunkSizers and weren't using one of the provided ones.
- New signature: fn chunk_size(&self, chunk: &str, capacity: &impl ChunkCapacity) -> ChunkSize;

Full Changelog: v0.4.5...v0.5.0

Contributors

benbrandt and bradfier

Assets 2

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's New

Breaking Changes

What's New

What's New

Breaking Changes

What's New

Rust

Python

Breaking Changes

Rust

Python

`CharacterTextSplitter`

`HuggingFaceTextSplitter`

`TiktokenTextSplitter`

`CustomTextSplitter`

New Contributors

Contributors

Fixes

Contributors

Breaking Changes

Contributors

What's New

Rust

Python

Contributors

What's New

Breaking Changes

Contributors

Releases: benbrandt/text-splitter

v0.9.0

What's New

Breaking Changes

v0.8.1

What's New

v0.8.0 - Performance Improvements

What's New

Breaking Changes

v0.7.0 - Markdown Support

What's New

Rust

Python

Breaking Changes

Rust

Python

CharacterTextSplitter

HuggingFaceTextSplitter

TiktokenTextSplitter

CustomTextSplitter

New Contributors

Contributors

v0.6.3

v0.6.2

v0.6.1

Fixes

Contributors

v0.6.0

Breaking Changes

Contributors

v0.5.1

What's New

Rust

Python

Contributors

v0.5.0

What's New

Breaking Changes

Contributors

`CharacterTextSplitter`

`HuggingFaceTextSplitter`

`TiktokenTextSplitter`

`CustomTextSplitter`