You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sample1.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1
Content 1 from file 1
--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1
Content 2 file file 1
More stuff in file 1
* list1.1
* list 2.1
* list 2.2
* list1.2
--------------------------------------------------------------------------------
================================================================================
sample2.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1
Content 1 from file 1
--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1
Content 2 file file 1
More stuff in file 1
* list1.1
* list 2.1
* list 2.2
* list1.2
--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2'}
# Header 1 from file 2
Content 1 from file 2
--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2', 'Header 2': 'Header 2 from file 2'}
## Header 2 from file 2
Content 2 file file 2
More stuff in file 2
1. list1.1
1. list 2.1
2. list 2.2
1. list1.2
--------------------------------------------------------------------------------
================================================================================
Description
I was testing out the ExperimentalMarkdownSyntaxTextSplitter class due to issues with whitespacing in the MarkdownHeaderTextSplitter. I noticed that the class was mixing up text between subsequent split_text calls.
I do not believe this is intended. Please find the attached zip to reproduce the issue. Happy to help fix the issue.
Let me know if there are stable alternatives to achieve splitting by markdown headers in the mean time.
System Info
python -m langchain_core.sys_info
System Information
OS: Darwin
OS Version: Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:46 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6031
Python Version: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]
@SuryaThiru Thank you for highlighting this intriguing issue. We are students from the University of Toronto and would be delighted to look into it further.
@SuryaThiru We’d like to propose modifying the split_text method to reset relevant attributes at the start of each invocation. This change will ensure that each call processes input independently without carrying over any previous state.
We would appreciate any feedback from the community on this approach. We are looking forward to your thoughts!
Thank you for your prompt feedback! We agree that adding unit tests for multi-document runs would be essential in validating the solution and preventing future issues. :)
Checked other resources
Example Code
Files
Files.zip
Error Message and Stack Trace (if applicable)
Output
Description
I was testing out the
ExperimentalMarkdownSyntaxTextSplitter
class due to issues with whitespacing in theMarkdownHeaderTextSplitter
. I noticed that the class was mixing up text between subsequentsplit_text
calls.I do not believe this is intended. Please find the attached zip to reproduce the issue. Happy to help fix the issue.
Let me know if there are stable alternatives to achieve splitting by markdown headers in the mean time.
System Info
python -m langchain_core.sys_info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: