Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

Open
5 tasks done
SuryaThiru opened this issue Sep 13, 2024 · 4 comments · May be fixed by #28373
Open
5 tasks done

ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

SuryaThiru opened this issue Sep 13, 2024 · 4 comments · May be fixed by #28373
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@SuryaThiru
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters.markdown import ExperimentalMarkdownSyntaxTextSplitter, MarkdownHeaderTextSplitter
import os


splitter = ExperimentalMarkdownSyntaxTextSplitter(
    headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
], strip_headers=False, return_each_line=False)



for file in sorted(os.listdir("testdata")):
    print(file)
    with open(f"testdata/{file}", "r") as f:
        text = f.read()

    splits = splitter.split_text(text)

    for split in splits:
        print(split.metadata)
        print(split.page_content)
        print('-'*80)

    print('='*80)
    print()

Files

Files.zip

Error Message and Stack Trace (if applicable)

Output

sample1.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1

Content 1 from file 1


--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1

Content 2 file file 1


More stuff in file 1

* list1.1
    * list 2.1
    * list 2.2
* list1.2
--------------------------------------------------------------------------------
================================================================================

sample2.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1

Content 1 from file 1


--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1

Content 2 file file 1


More stuff in file 1

* list1.1
    * list 2.1
    * list 2.2
* list1.2
--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2'}
# Header 1 from file 2

Content 1 from file 2


--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2', 'Header 2': 'Header 2 from file 2'}
## Header 2 from file 2

Content 2 file file 2


More stuff in file 2

1. list1.1
    1. list 2.1
    2. list 2.2
1. list1.2
--------------------------------------------------------------------------------
================================================================================

Description

I was testing out the ExperimentalMarkdownSyntaxTextSplitter class due to issues with whitespacing in the MarkdownHeaderTextSplitter. I noticed that the class was mixing up text between subsequent split_text calls.

I do not believe this is intended. Please find the attached zip to reproduce the issue. Happy to help fix the issue.

Let me know if there are stable alternatives to achieve splitting by markdown headers in the mean time.

System Info

python -m langchain_core.sys_info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:46 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6031
Python Version: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information

langchain_core: 0.2.38
langchain: 0.2.16
langchain_community: 0.2.16
langsmith: 0.1.114
langchain_text_splitters: 0.2.4

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.5
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.7
packaging: 24.1
pydantic: 2.8.2
PyYAML: 6.0.2
requests: 2.32.3
SQLAlchemy: 2.0.34
tenacity: 8.5.0
typing-extensions: 4.12.2

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 13, 2024
@chkaty
Copy link

chkaty commented Oct 13, 2024

@SuryaThiru Thank you for highlighting this intriguing issue. We are students from the University of Toronto and would be delighted to look into it further.

@chkaty
Copy link

chkaty commented Oct 15, 2024

@SuryaThiru We’d like to propose modifying the split_text method to reset relevant attributes at the start of each invocation. This change will ensure that each call processes input independently without carrying over any previous state.

We would appreciate any feedback from the community on this approach. We are looking forward to your thoughts!

@SuryaThiru
Copy link
Author

Yes, might also be worth adding more unit tests covering multi-document runs.

@chkaty
Copy link

chkaty commented Oct 24, 2024

Thank you for your prompt feedback! We agree that adding unit tests for multi-document runs would be essential in validating the solution and preventing future issues. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
2 participants