Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Markdown document incorrect #623

Open
kime541200 opened this issue Dec 18, 2024 · 3 comments
Open

Convert Markdown document incorrect #623

kime541200 opened this issue Dec 18, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@kime541200
Copy link

Bug

Convert Markdown document error.
...

Steps to reproduce

Original content of the Markdown document is something like:

# ABCDEFG
- abc:
	- abc123:
		- abc1234:
			- abc12345:
				- a.
				- b.
		- abcd1234:
			- abcd12345:
				- a.
				- b.
- def:
	- def1234:
		- def12345。
- ghijkl

Here's the convert process:

$ docling --from md --to md -vv /data/doc/test2.md
DEBUG:docling.backend.md_backend:MD INIT!!!
DEBUG:docling.backend.md_backend:# ABCDEFG

- abc:
  - abc123:
    - abc1234:
      - abc12345:
        - a.
        - b.
      - abcd1234:
        - abcd12345:
          - a.
          - b.
- def:
  - def1234:
    - def12345.
- ghijkl
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document test2.md
DEBUG:docling.backend.md_backend:converting Markdown...
DEBUG:docling.backend.md_backend:Some other element: <Document children=[<Heading children=[<RawText children='ABCDEFG'>]>,
 <BlankLine children=[]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc123:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='abcd1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abcd12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>]>]>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='def:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def12345.'>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='ghijkl'>]>]>]>]>
DEBUG:docling.backend.md_backend: - Heading level 1, content: ABCDEFG
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - List unordered
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
INFO:docling.document_converter:Finished converting document test2.md in 2.19 sec.
INFO:docling.cli.main:writing Markdown output to test2.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 2.19 seconds.

And here's the final result I got:

$ cat test2.md
# ABCDEFG

- abc:
- def:
- ghijkl

I also try to use python library to convert this document, but I still got same output.

In final result, a lot content is not been output, did I do anything wrong?

PS: I know that inputting and outputting Markdown might be unnecessary, but in my application scenario, I'm not sure in what format users will provide their content. I need to be able to convert various content formats into Markdown.

Docling version

$ docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

$ python --version
Python 3.11.10
@kime541200 kime541200 added the bug Something isn't working label Dec 18, 2024
@Heremeus
Copy link

Similar issue is happening with inline code using `

Converting the following markdown file and exporting it back to markdown using DocumentConverter().convert("file.md").document.export_to_markdown() results in docling cutting off the text after the `

Input:

# Contributing

1. Pull the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

Exported Markdown:

 # Contributing

- Pull the repository
- Create your feature branch (
- Commit your changes (
- Push to the branch (
- Open a Pull Request

@Heremeus
Copy link

Heremeus commented Dec 19, 2024

Did a little debugging and it seems this stems from the md_backend. The handling for marko.block.ListItem only considers the first children, ignoring any other children of the ListItem.

snippet_text = str(element.children[0].children[0].children)

In my example above, element.children[0] is a Paragraph containing multiple RawText and CodeSpan children. element.children[0].children[0] only uses the first RawText child and ignores the rest of the Paragraph.

@maxmnemonic
Copy link
Contributor

maxmnemonic commented Dec 19, 2024

@kime541200 @Heremeus Thanks for these findings, I will look into this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants