Releases · benbrandt/text-splitter

12 Jun 08:37

python-v0.2.0

fc3709a

Python: v0.2.0 - Hugging Face Tokenizer support

What's New

New HuggingFaceTextSplitter, which allows for using Hugging Face's tokenizers package to count chunks by tokens with a tokenizer of your choice.

from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

Breaking Changes

trim_chunks now defaults to True instead of False. For most use cases, this is the desired behavior, especially with chunk ranges.

Full Changelog: python-v0.1.4...python-v0.2.0

Assets 2

0 Join discussion

11 Jun 05:50

benbrandt

v0.4.1

9c5773f

v0.4.1 - Remove unneeded `tokenizers` features

What's Changed

Remove unnecessary tokenizer features by @benbrandt in #20

Full Changelog: v0.4.0...v0.4.1

Contributors

benbrandt

Assets 2

0 Join discussion

09 Jun 05:01

benbrandt

python-v0.1.4

88d6068

Python: v0.1.4 - Fifth time is the charm?

Full Changelog: python-v0.1.3...python-v0.1.4

Assets 2

0 Join discussion

09 Jun 04:44

benbrandt

python-v0.1.3

1326d48

Python: v0.1.3 - New package name

Had to adjust the package name so that it could upload to PyPi

from text_splitter import CharacterTextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=True)

chunks = splitter.chunks("your document text", max_characters)

Full Changelog: python-v0.1.2...python-v0.1.3

Assets 2

0 Join discussion

08 Jun 21:09

benbrandt

python-v0.1.2

bbfc65c

Python: v0.1.2 - Fix bad release

Apologies...first time publishing a python package...

Full Changelog: python-v0.1.1...python-v0.1.2

Assets 2

0 Join discussion

08 Jun 20:54

benbrandt

python-v0.1.1

192f093

Python: v0.1.1 - Fix bad release

Full Changelog: python-v0.1.0...python-v0.1.1

Assets 2

0 Join discussion

08 Jun 20:48

benbrandt

python-v0.1.0

25f9fc7

Python: v0.1.0 - Initial Python Binding Release

What's Changed

Initial Python Bindings by @benbrandt in #13
Currently only includes a CharacterTextSplitter to test the release process.

from text_splitter import CharacterTextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=True)

chunks = splitter.chunks("your document text", max_characters)

Contributors

benbrandt

Assets 2

0 Join discussion

01 Jun 07:59

benbrandt

v0.4.0

cb674db

v0.4.0 - New Chunk Capacity

What's New

New Chunk Capacity (can now size chunks with Ranges)

New ChunkCapacity trait. When calling splitter.chunks() or splitter.chunk_indices(), the chunk_size argument has been replaced with chunk_capacity, which can be anything that implements the ChunkCapacity trait. This means that now the following can all be passed in:

usize
Range<usize>
RangeFrom<usize>
RangeFull
RangeInclusive<usize>
RangeTo<usize>
RangeToInclusive<usize>

This is helpful for cases where you do have a maximum chunk size, but you don't necessarily want to fill it up all the way every time. This can be helpful in embedding cases, where you have some maximum context size, but you don't necessarily want to muddy the embeddings with lots of neighboring semantic elements. You can use a range to express this now, and the chunks will stop filling up once they have reached a size within the range.

Simplified Chunk Sizing traits

Simplified ChunkSizer trait that allows for various calculations of chunk size. No longer requires full validation logic, since that now happens within the TextSplitter itself.

Breaking Changes

ChunkValidator trait removed. Instead impl ChunkSizer instead, which just requires calculating chunk_size and not the full validation logic.
TokenCount trait removed. You can just use ChunkSizer directly instead.
Internal TextChunks iterator is no longer pub.

Assets 2

0 Join discussion

23 May 05:20

benbrandt

v0.3.1

ba3c01b

v0.3.1

What's Changed

Handle more semantic levels of line breaks by @benbrandt in #9

Full Changelog: v0.3.0...v0.3.1

Contributors

benbrandt

Assets 2

0 Join discussion

19 May 03:53

benbrandt

v0.3.0

ebc2c76

v0.3.0 - Feature renaming + Optimized splitting algorithm

What's Changed

Breaking Changes

Match feature names for tokenizer crates to prevent conflicts in the future.
- huggingface -> tokenizers
- tiktoken -> tiktoken-rs

Features

Moved from recursive approach to iterative approach to avoid stack overflow issues by @benbrandt in #7
Relax MSRV to 1.60.0

Full Changelog: v0.2.2...v0.3.0

Contributors

benbrandt

Assets 2

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's New

Breaking Changes

What's Changed

Contributors

What's Changed

Contributors

What's New

New Chunk Capacity (can now size chunks with Ranges)

Simplified Chunk Sizing traits

Breaking Changes

What's Changed

Contributors

What's Changed

Breaking Changes

Features

Contributors

Releases: benbrandt/text-splitter

Python: v0.2.0 - Hugging Face Tokenizer support

What's New

Breaking Changes

v0.4.1 - Remove unneeded `tokenizers` features

What's Changed

Contributors

Python: v0.1.4 - Fifth time is the charm?

Python: v0.1.3 - New package name

Python: v0.1.2 - Fix bad release

Python: v0.1.1 - Fix bad release

Python: v0.1.0 - Initial Python Binding Release

What's Changed

Contributors

v0.4.0 - New Chunk Capacity

What's New

New Chunk Capacity (can now size chunks with Ranges)

Simplified Chunk Sizing traits

Breaking Changes

v0.3.1

What's Changed

Contributors

v0.3.0 - Feature renaming + Optimized splitting algorithm

What's Changed

Breaking Changes

Features

Contributors