ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

SuryaThiru · 2024-09-13T18:57:37Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters.markdown import ExperimentalMarkdownSyntaxTextSplitter, MarkdownHeaderTextSplitter
import os


splitter = ExperimentalMarkdownSyntaxTextSplitter(
    headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
], strip_headers=False, return_each_line=False)



for file in sorted(os.listdir("testdata")):
    print(file)
    with open(f"testdata/{file}", "r") as f:
        text = f.read()

    splits = splitter.split_text(text)

    for split in splits:
        print(split.metadata)
        print(split.page_content)
        print('-'*80)

    print('='*80)
    print()

Files

Files.zip

Error Message and Stack Trace (if applicable)

Output

sample1.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1

Content 1 from file 1


--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1

Content 2 file file 1


More stuff in file 1

* list1.1
    * list 2.1
    * list 2.2
* list1.2
--------------------------------------------------------------------------------
================================================================================

sample2.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1

Content 1 from file 1


--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1

Content 2 file file 1


More stuff in file 1

* list1.1
    * list 2.1
    * list 2.2
* list1.2
--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2'}
# Header 1 from file 2

Content 1 from file 2


--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2', 'Header 2': 'Header 2 from file 2'}
## Header 2 from file 2

Content 2 file file 2


More stuff in file 2

1. list1.1
    1. list 2.1
    2. list 2.2
1. list1.2
--------------------------------------------------------------------------------
================================================================================

Description

I was testing out the ExperimentalMarkdownSyntaxTextSplitter class due to issues with whitespacing in the MarkdownHeaderTextSplitter. I noticed that the class was mixing up text between subsequent split_text calls.

I do not believe this is intended. Please find the attached zip to reproduce the issue. Happy to help fix the issue.

Let me know if there are stable alternatives to achieve splitting by markdown headers in the mean time.

System Info

python -m langchain_core.sys_info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:46 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6031
Python Version: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information

langchain_core: 0.2.38
langchain: 0.2.16
langchain_community: 0.2.16
langsmith: 0.1.114
langchain_text_splitters: 0.2.4

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.5
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.7
packaging: 24.1
pydantic: 2.8.2
PyYAML: 6.0.2
requests: 2.32.3
SQLAlchemy: 2.0.34
tenacity: 8.5.0
typing-extensions: 4.12.2

The text was updated successfully, but these errors were encountered:

chkaty · 2024-10-13T22:58:16Z

@SuryaThiru Thank you for highlighting this intriguing issue. We are students from the University of Toronto and would be delighted to look into it further.

chkaty · 2024-10-15T23:17:19Z

@SuryaThiru We’d like to propose modifying the split_text method to reset relevant attributes at the start of each invocation. This change will ensure that each call processes input independently without carrying over any previous state.

We would appreciate any feedback from the community on this approach. We are looking forward to your thoughts!

SuryaThiru · 2024-10-20T16:07:04Z

Yes, might also be worth adding more unit tests covering multi-document runs.

chkaty · 2024-10-24T00:37:57Z

Thank you for your prompt feedback! We agree that adding unit tests for multi-document runs would be essential in validating the solution and preventing future issues. :)

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 13, 2024

promptless bot mentioned this issue Nov 25, 2024

Docs update for PR #14 on langchain-test Promptless/langchain-test#15

Open

chkaty linked a pull request Nov 27, 2024 that will close this issue

text-splitters: fix state persistence issue in ExperimentalMarkdownSyntaxTextSplitter #28373

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

SuryaThiru commented Sep 13, 2024

chkaty commented Oct 13, 2024

chkaty commented Oct 15, 2024

SuryaThiru commented Oct 20, 2024

chkaty commented Oct 24, 2024

ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

Comments

SuryaThiru commented Sep 13, 2024

Checked other resources

Example Code

Files

Error Message and Stack Trace (if applicable)

Output

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

chkaty commented Oct 13, 2024

chkaty commented Oct 15, 2024

SuryaThiru commented Oct 20, 2024

chkaty commented Oct 24, 2024