Processing of TOC objects in Word Documents DOCX fails #627

w0o · 2024-12-19T03:36:46Z

Bug

We are seeing an odd behavior where the processing of a TOC table in a word document fails without any errors with the resulting document missing the content that was originally in the TOC.
What we have tried:

Using OCR fails with the TOC content being omitted.
Exporting to PDF (using Word) and then using docling to convert to markdown works as expected with no content omissions.

Steps to reproduce

Use attached minimal example docx file and run:

docling sample.docx

resulting in the attached Markdown file which has the TOC content missing.

Docling version

Docling version: 2.13.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.11.11

Note: all shared samples and publicly available documents.
sample.md
sample.docx

The text was updated successfully, but these errors were encountered:

w0o added the bug Something isn't working label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing of TOC objects in Word Documents DOCX fails #627

Processing of TOC objects in Word Documents DOCX fails #627

w0o commented Dec 19, 2024

Processing of TOC objects in Word Documents DOCX fails #627

Processing of TOC objects in Word Documents DOCX fails #627

Comments

w0o commented Dec 19, 2024

Bug

Steps to reproduce

Docling version

Python version