From 78caed78fc8e353b50b3278dc4eb30cc7ec4837d Mon Sep 17 00:00:00 2001 From: "promptless[bot]" <179508745+promptless[bot]@users.noreply.github.com> Date: Mon, 25 Nov 2024 15:34:46 +0000 Subject: [PATCH 1/3] Docs update (f50cc65) --- docs/docs/concepts/text_splitters.mdx | 146 ++++++++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 docs/docs/concepts/text_splitters.mdx diff --git a/docs/docs/concepts/text_splitters.mdx b/docs/docs/concepts/text_splitters.mdx new file mode 100644 index 0000000000000..2ee85588a27aa --- /dev/null +++ b/docs/docs/concepts/text_splitters.mdx @@ -0,0 +1,146 @@ +# Text splitters + + +:::info[Prerequisites] + +* [Documents](/docs/concepts/retrievers/#interface) +* Tokenization(/docs/concepts/tokens) +::: + +## Overview + +Document splitting is often a crucial preprocessing step for many applications. +It involves breaking down large texts into smaller, manageable chunks. +This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. +There are several strategies for splitting documents, each with its own advantages. + +## Key concepts + +![Conceptual Overview](/img/text_splitters.png) + +Text splitters split documents into smaller chunks for use in downstream applications. + +## Why split documents? + +There are several reasons to split documents: + +- **Handling non-uniform document lengths**: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents. +- **Overcoming model limitations**: Many embedding models and language models have maximum input size constraints. Splitting allows us to process documents that would otherwise exceed these limits. +- **Improving representation quality**: For longer documents, the quality of embeddings or other representations may degrade as they try to capture too much information. Splitting can lead to more focused and accurate representations of each section. +- **Enhancing retrieval precision**: In information retrieval systems, splitting can improve the granularity of search results, allowing for more precise matching of queries to relevant document sections. +- **Optimizing computational resources**: Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks. + +Now, the next question is *how* to split the documents into chunks! There are several strategies, each with its own advantages. + +:::info[Further reading] +* See Greg Kamradt's [chunkviz](https://chunkviz.up.railway.app/) to visualize different splitting strategies discussed below. +::: + +## Approaches + +### Length-based + +The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit. +Key benefits of length-based splitting: +- Straightforward implementation +- Consistent chunk sizes +- Easily adaptable to different model requirements + +Types of length-based splitting: +- **Token-based**: Splits text based on the number of tokens, which is useful when working with language models. +- **Character-based**: Splits text based on the number of characters, which can be more consistent across different types of text. + +Example implementation using LangChain's `CharacterTextSplitter` with token-based splitting: + +```python +from langchain_text_splitters import CharacterTextSplitter +text_splitter = CharacterTextSplitter.from_tiktoken_encoder( + encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0 +) +texts = text_splitter.split_text(document) +``` + +:::info[Further reading] + +* See the how-to guide for [token-based](/docs/how_to/split_by_token/) splitting. +* See the how-to guide for [character-based](/docs/how_to/character_text_splitter/) splitting. + +::: + +### Text-structured based + +Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. +We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity. +LangChain's [`RecursiveCharacterTextSplitter`](/docs/how_to/recursive_text_splitter/) implements this concept: +- The `RecursiveCharacterTextSplitter` attempts to keep larger units (e.g., paragraphs) intact. +- If a unit exceeds the chunk size, it moves to the next level (e.g., sentences). +- This process continues down to the word level if necessary. + +Here is example usage: + +```python +from langchain_text_splitters import RecursiveCharacterTextSplitter +text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0) +texts = text_splitter.split_text(document) +``` + +:::info[Further reading] + +* See the how-to guide for [recursive text splitting](/docs/how_to/recursive_text_splitter/). + +::: + +### Document-structured based + +Some documents have an inherent structure, such as HTML, Markdown, or JSON files. +In these cases, it's beneficial to split the document based on its structure, as it often naturally groups semantically related text. +Key benefits of structure-based splitting: +- Preserves the logical organization of the document +- Maintains context within each chunk +- Can be more effective for downstream tasks like retrieval or summarization + +Examples of structure-based splitting: +- **Markdown**: Split based on headers (e.g., #, ##, ###) +- **HTML**: Split using tags +- **JSON**: Split by object or array elements +- **Code**: Split by functions, classes, or logical blocks + +Example implementation using LangChain's `ExperimentalMarkdownSyntaxTextSplitter`: + +```python +from langchain_text_splitters import ExperimentalMarkdownSyntaxTextSplitter +text_splitter = ExperimentalMarkdownSyntaxTextSplitter( + headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")], + return_each_line=False +) +texts = text_splitter.split_text(document) +``` + +:::info[Further reading] + +* See the how-to guide for [Markdown splitting](/docs/how_to/markdown_header_metadata_splitter/). +* See the how-to guide for [Recursive JSON splitting](/docs/how_to/recursive_json_splitter/). +* See the how-to guide for [Code splitting](/docs/how_to/code_splitter/). +* See the how-to guide for [HTML splitting](/docs/how_to/HTML_header_metadata_splitter/). + +::: + +### Semantic meaning based + +Unlike the previous methods, semantic-based splitting actually considers the *content* of the text. +While other approaches use document or text structure as proxies for semantic meaning, this method directly analyzes the text's semantics. +There are several ways to implement this, but conceptually the approach is split text when there are significant changes in text *meaning*. +As an example, we can use a sliding window approach to generate embeddings, and compare the embeddings to find significant differences: + +- Start with the first few sentences and generate an embedding. +- Move to the next group of sentences and generate another embedding (e.g., using a sliding window approach). +- Compare the embeddings to find significant differences, which indicate potential "break points" between semantic sections. + +This technique helps create chunks that are more semantically coherent, potentially improving the quality of downstream tasks like retrieval or summarization. + +:::info[Further reading] + +* See the how-to guide for [splitting text based on semantic meaning](/docs/how_to/semantic-chunker/). +* See Greg Kamradt's [notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) showcasing semantic splitting. + +::: From 66028f24e38560f3099cbe736d71fc560c468ce7 Mon Sep 17 00:00:00 2001 From: Prithvi Ramakrishnan Date: Mon, 25 Nov 2024 07:37:59 -0800 Subject: [PATCH 2/3] Docs update --- docs/docs/concepts/text_splitters.mdx | 146 -------------------------- 1 file changed, 146 deletions(-) delete mode 100644 docs/docs/concepts/text_splitters.mdx diff --git a/docs/docs/concepts/text_splitters.mdx b/docs/docs/concepts/text_splitters.mdx deleted file mode 100644 index 2ee85588a27aa..0000000000000 --- a/docs/docs/concepts/text_splitters.mdx +++ /dev/null @@ -1,146 +0,0 @@ -# Text splitters - - -:::info[Prerequisites] - -* [Documents](/docs/concepts/retrievers/#interface) -* Tokenization(/docs/concepts/tokens) -::: - -## Overview - -Document splitting is often a crucial preprocessing step for many applications. -It involves breaking down large texts into smaller, manageable chunks. -This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. -There are several strategies for splitting documents, each with its own advantages. - -## Key concepts - -![Conceptual Overview](/img/text_splitters.png) - -Text splitters split documents into smaller chunks for use in downstream applications. - -## Why split documents? - -There are several reasons to split documents: - -- **Handling non-uniform document lengths**: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents. -- **Overcoming model limitations**: Many embedding models and language models have maximum input size constraints. Splitting allows us to process documents that would otherwise exceed these limits. -- **Improving representation quality**: For longer documents, the quality of embeddings or other representations may degrade as they try to capture too much information. Splitting can lead to more focused and accurate representations of each section. -- **Enhancing retrieval precision**: In information retrieval systems, splitting can improve the granularity of search results, allowing for more precise matching of queries to relevant document sections. -- **Optimizing computational resources**: Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks. - -Now, the next question is *how* to split the documents into chunks! There are several strategies, each with its own advantages. - -:::info[Further reading] -* See Greg Kamradt's [chunkviz](https://chunkviz.up.railway.app/) to visualize different splitting strategies discussed below. -::: - -## Approaches - -### Length-based - -The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit. -Key benefits of length-based splitting: -- Straightforward implementation -- Consistent chunk sizes -- Easily adaptable to different model requirements - -Types of length-based splitting: -- **Token-based**: Splits text based on the number of tokens, which is useful when working with language models. -- **Character-based**: Splits text based on the number of characters, which can be more consistent across different types of text. - -Example implementation using LangChain's `CharacterTextSplitter` with token-based splitting: - -```python -from langchain_text_splitters import CharacterTextSplitter -text_splitter = CharacterTextSplitter.from_tiktoken_encoder( - encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0 -) -texts = text_splitter.split_text(document) -``` - -:::info[Further reading] - -* See the how-to guide for [token-based](/docs/how_to/split_by_token/) splitting. -* See the how-to guide for [character-based](/docs/how_to/character_text_splitter/) splitting. - -::: - -### Text-structured based - -Text is naturally organized into hierarchical units such as paragraphs, sentences, and words. -We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity. -LangChain's [`RecursiveCharacterTextSplitter`](/docs/how_to/recursive_text_splitter/) implements this concept: -- The `RecursiveCharacterTextSplitter` attempts to keep larger units (e.g., paragraphs) intact. -- If a unit exceeds the chunk size, it moves to the next level (e.g., sentences). -- This process continues down to the word level if necessary. - -Here is example usage: - -```python -from langchain_text_splitters import RecursiveCharacterTextSplitter -text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0) -texts = text_splitter.split_text(document) -``` - -:::info[Further reading] - -* See the how-to guide for [recursive text splitting](/docs/how_to/recursive_text_splitter/). - -::: - -### Document-structured based - -Some documents have an inherent structure, such as HTML, Markdown, or JSON files. -In these cases, it's beneficial to split the document based on its structure, as it often naturally groups semantically related text. -Key benefits of structure-based splitting: -- Preserves the logical organization of the document -- Maintains context within each chunk -- Can be more effective for downstream tasks like retrieval or summarization - -Examples of structure-based splitting: -- **Markdown**: Split based on headers (e.g., #, ##, ###) -- **HTML**: Split using tags -- **JSON**: Split by object or array elements -- **Code**: Split by functions, classes, or logical blocks - -Example implementation using LangChain's `ExperimentalMarkdownSyntaxTextSplitter`: - -```python -from langchain_text_splitters import ExperimentalMarkdownSyntaxTextSplitter -text_splitter = ExperimentalMarkdownSyntaxTextSplitter( - headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")], - return_each_line=False -) -texts = text_splitter.split_text(document) -``` - -:::info[Further reading] - -* See the how-to guide for [Markdown splitting](/docs/how_to/markdown_header_metadata_splitter/). -* See the how-to guide for [Recursive JSON splitting](/docs/how_to/recursive_json_splitter/). -* See the how-to guide for [Code splitting](/docs/how_to/code_splitter/). -* See the how-to guide for [HTML splitting](/docs/how_to/HTML_header_metadata_splitter/). - -::: - -### Semantic meaning based - -Unlike the previous methods, semantic-based splitting actually considers the *content* of the text. -While other approaches use document or text structure as proxies for semantic meaning, this method directly analyzes the text's semantics. -There are several ways to implement this, but conceptually the approach is split text when there are significant changes in text *meaning*. -As an example, we can use a sliding window approach to generate embeddings, and compare the embeddings to find significant differences: - -- Start with the first few sentences and generate an embedding. -- Move to the next group of sentences and generate another embedding (e.g., using a sliding window approach). -- Compare the embeddings to find significant differences, which indicate potential "break points" between semantic sections. - -This technique helps create chunks that are more semantically coherent, potentially improving the quality of downstream tasks like retrieval or summarization. - -:::info[Further reading] - -* See the how-to guide for [splitting text based on semantic meaning](/docs/how_to/semantic-chunker/). -* See Greg Kamradt's [notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) showcasing semantic splitting. - -::: From 0a406a1df13148b47441f560507e38d9ca641fb5 Mon Sep 17 00:00:00 2001 From: Prithvi Ramakrishnan Date: Mon, 25 Nov 2024 07:59:43 -0800 Subject: [PATCH 3/3] Docs update --- .../markdown_header_metadata_splitter.ipynb | 46 ++++++++++++++++++- 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/docs/docs/how_to/markdown_header_metadata_splitter.ipynb b/docs/docs/how_to/markdown_header_metadata_splitter.ipynb index 24ae5e421f6d1..2f4abafd00340 100644 --- a/docs/docs/how_to/markdown_header_metadata_splitter.ipynb +++ b/docs/docs/how_to/markdown_header_metadata_splitter.ipynb @@ -261,6 +261,50 @@ "splits = text_splitter.split_documents(md_header_splits)\n", "splits" ] + }, + { + "cell_type": "markdown", + "id": "b7b557f7", + "metadata": {}, + "source": [ + "### How to preserve whitespace from the original document:\n", + "\n", + "By default, `MarkdownHeaderTextSplitter` strips whitespace and newlines from the resulting documents, which can sometimes can cause issues with markdown sections like code blocks or nested lists. Use the `ExperimentalMarkdownSyntaxTextSplitter` to preserve whitespace in these instances." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ba48193b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'Header 1': 'Foo'}, page_content='# Foo \\nThis is Jim'),\n", + " Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}, page_content='## Bar \\n* Bullet 1\\n* Sub-bullet a')]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_text_splitters import ExperimentalMarkdownSyntaxTextSplitter\n", + "\n", + "markdown_document = \"# Foo\\n\\n This is Jim \\n\\n## Bar\\n\\n* Bullet 1\\n * Sub-bullet a\"\n", + "\n", + "headers_to_split_on = [\n", + " (\"#\", \"Header 1\"),\n", + " (\"##\", \"Header 2\"),\n", + " (\"###\", \"Header 3\"),\n", + "]\n", + "\n", + "markdown_splitter = ExperimentalMarkdownSyntaxTextSplitter(headers_to_split_on, strip_headers=False)\n", + "md_header_splits = markdown_splitter.split_text(markdown_document)\n", + "md_header_splits" + ] } ], "metadata": { @@ -279,7 +323,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.4" + "version": "3.11.6" } }, "nbformat": 4,