Releases: benbrandt/text-splitter
Python: v0.2.0 - Hugging Face Tokenizer support
What's New
- New
HuggingFaceTextSplitter
, which allows for using Hugging Face'stokenizers
package to count chunks by tokens with a tokenizer of your choice.
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer
# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)
chunks = splitter.chunks("your document text", max_characters)
Breaking Changes
trim_chunks
now defaults toTrue
instead ofFalse
. For most use cases, this is the desired behavior, especially with chunk ranges.
Full Changelog: python-v0.1.4...python-v0.2.0
v0.4.1 - Remove unneeded `tokenizers` features
What's Changed
- Remove unnecessary tokenizer features by @benbrandt in #20
Full Changelog: v0.4.0...v0.4.1
Python: v0.1.4 - Fifth time is the charm?
Python: v0.1.3 - New package name
Had to adjust the package name so that it could upload to PyPi
from text_splitter import CharacterTextSplitter
# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=True)
chunks = splitter.chunks("your document text", max_characters)
Full Changelog: python-v0.1.2...python-v0.1.3
Python: v0.1.2 - Fix bad release
Apologies...first time publishing a python package...
Full Changelog: python-v0.1.1...python-v0.1.2
Python: v0.1.1 - Fix bad release
Full Changelog: python-v0.1.0...python-v0.1.1
Python: v0.1.0 - Initial Python Binding Release
What's Changed
- Initial Python Bindings by @benbrandt in #13
- Currently only includes a
CharacterTextSplitter
to test the release process.
from text_splitter import CharacterTextSplitter
# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=True)
chunks = splitter.chunks("your document text", max_characters)
v0.4.0 - New Chunk Capacity
What's New
New Chunk Capacity (can now size chunks with Ranges)
New ChunkCapacity
trait. When calling splitter.chunks()
or splitter.chunk_indices()
, the chunk_size
argument has been replaced with chunk_capacity
, which can be anything that implements the ChunkCapacity
trait. This means that now the following can all be passed in:
usize
Range<usize>
RangeFrom<usize>
RangeFull
RangeInclusive<usize>
RangeTo<usize>
RangeToInclusive<usize>
This is helpful for cases where you do have a maximum chunk size, but you don't necessarily want to fill it up all the way every time. This can be helpful in embedding cases, where you have some maximum context size, but you don't necessarily want to muddy the embeddings with lots of neighboring semantic elements. You can use a range to express this now, and the chunks will stop filling up once they have reached a size within the range.
Simplified Chunk Sizing traits
Simplified ChunkSizer
trait that allows for various calculations of chunk size. No longer requires full validation logic, since that now happens within the TextSplitter
itself.
Breaking Changes
ChunkValidator
trait removed. Insteadimpl ChunkSizer
instead, which just requires calculating chunk_size and not the full validation logic.TokenCount
trait removed. You can just useChunkSizer
directly instead.- Internal
TextChunks
iterator is no longerpub
.
v0.3.1
What's Changed
- Handle more semantic levels of line breaks by @benbrandt in #9
Full Changelog: v0.3.0...v0.3.1
v0.3.0 - Feature renaming + Optimized splitting algorithm
What's Changed
Breaking Changes
- Match feature names for tokenizer crates to prevent conflicts in the future.
huggingface -> tokenizers
tiktoken -> tiktoken-rs
Features
- Moved from recursive approach to iterative approach to avoid stack overflow issues by @benbrandt in #7
- Relax MSRV to 1.60.0
Full Changelog: v0.2.2...v0.3.0