You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Original content of the Markdown document is something like:
# ABCDEFG- abc:
- abc123: - abc1234: - abc12345: - a. - b. - abcd1234: - abcd12345: - a. - b.- def:
- def1234: - def12345。- ghijkl
Here's the convert process:
$ docling --from md --to md -vv /data/doc/test2.md
DEBUG:docling.backend.md_backend:MD INIT!!!
DEBUG:docling.backend.md_backend:# ABCDEFG
- abc:
- abc123:
- abc1234:
- abc12345:
- a.
- b.
- abcd1234:
- abcd12345:
- a.
- b.
- def:
- def1234:
- def12345.
- ghijkl
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document test2.md
DEBUG:docling.backend.md_backend:converting Markdown...
DEBUG:docling.backend.md_backend:Some other element: <Document children=[<Heading children=[<RawText children='ABCDEFG'>]>,
<BlankLine children=[]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc123:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc12345:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='abcd1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abcd12345:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>]>]>]>]>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='def:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='def1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='def12345.'>]>]>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='ghijkl'>]>]>]>]>
DEBUG:docling.backend.md_backend: - Heading level 1, content: ABCDEFG
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - List unordered
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
INFO:docling.document_converter:Finished converting document test2.md in 2.19 sec.
INFO:docling.cli.main:writing Markdown output to test2.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 2.19 seconds.
And here's the final result I got:
$ cat test2.md
# ABCDEFG
- abc:
- def:
- ghijkl
I also try to use python library to convert this document, but I still got same output.
In final result, a lot content is not been output, did I do anything wrong?
PS: I know that inputting and outputting Markdown might be unnecessary, but in my application scenario, I'm not sure in what format users will provide their content. I need to be able to convert various content formats into Markdown.
Similar issue is happening with inline code using `
Converting the following markdown file and exporting it back to markdown using DocumentConverter().convert("file.md").document.export_to_markdown() results in docling cutting off the text after the `
Input:
# Contributing1. Pull the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
Exported Markdown:
# Contributing- Pull the repository
- Create your feature branch (
- Commit your changes (
- Push to the branch (
- Open a Pull Request
Did a little debugging and it seems this stems from the md_backend. The handling for marko.block.ListItem only considers the first children, ignoring any other children of the ListItem.
In my example above, element.children[0] is a Paragraph containing multiple RawText and CodeSpan children. element.children[0].children[0] only uses the first RawText child and ignores the rest of the Paragraph.
Bug
Convert Markdown document error.
...
Steps to reproduce
Original content of the Markdown document is something like:
Here's the convert process:
And here's the final result I got:
$ cat test2.md # ABCDEFG - abc: - def: - ghijkl
I also try to use python library to convert this document, but I still got same output.
In final result, a lot content is not been output, did I do anything wrong?
Docling version
Python version
The text was updated successfully, but these errors were encountered: