Docling 2.10.0: Performance Degradation When Reading Large PDF Files #568

langzichai · 2024-12-11T09:53:13Z

Describe the problem
While using the latest version of Docling (2.10.0), the following issues were observed with PDF file read speed performance:

For small files (e.g. 3M), the performance improvement is significant and the read speed does increase by about 30%.
However, for large files (e.g. more than 60M), the read speed has not improved and still maintains a slower processing time, the same as in the previous version (2.9.0).

gpu ： V100 32g

A solution or guidance on how to tune performance for large file scenarios would be appreciated!

< v2.10.0
3M
default ocr 66s
63M
default ocr 1059.538206 seconds

v2.10.0
3M
default ocr 43.967382

63M
default ocr 1053s

def convert(filepath): start_time = datetime.now() print(f"start_time : {start_time}") input_doc = Path(filepath) pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options.use_gpu = True pipeline_options.do_table_structure = False pipeline_options.table_structure_options.do_cell_matching = False converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc).document md = doc.export_to_text() end_time = datetime.now() print(f"end_time : {end_time}") print(f"Total processing time: {end_time - start_time}")

The text was updated successfully, but these errors were encountered:

dolfim-ibm · 2024-12-11T11:38:29Z

Please note that proper GPU support is not yet released, but it is coming very soon.

To understand your timings, let us propose another experiment.

Please add the following in your python code.

from docling.datamodel.settings import settings
settings.debug.profile_pipeline_timings=True

Then you will get internal profiling output in

result = converter.convert(input_doc)
doc = result.document
print(result.timings)

PeterStaar-IBM · 2024-12-11T13:53:08Z

@langzichai Is there any chance you could share the bigger file?

langzichai · 2024-12-12T00:59:16Z

Thanks for the reply, I will test it

langzichai · 2024-12-12T01:04:53Z

@langzichai Is there any chance you could share the bigger file?

Sorry, the file can't be shared at the moment

langzichai · 2024-12-12T02:22:58Z

Please note that proper GPU support is not yet released, but it is coming very soon.

To understand your timings, let us propose another experiment.

Please add the following in your python code.
from docling.datamodel.settings import settings
settings.debug.profile_pipeline_timings=True
Then you will get internal profiling output in
result = converter.convert(input_doc)
doc = result.document
print(result.timings)

@dolfim-ibm
Attached is the information output by timings
timings.txt

cau-git · 2024-12-18T10:35:41Z

@langzichai Several improvements, especially for GPU acceleration and layout processing, were released since you last reported, would you mind checking again with docling==2.14.0?

langzichai · 2024-12-18T10:57:50Z

Ok, I will test again, thanks for the help!

langzichai added the bug Something isn't working label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docling 2.10.0: Performance Degradation When Reading Large PDF Files #568

Docling 2.10.0: Performance Degradation When Reading Large PDF Files #568

langzichai commented Dec 11, 2024 •

edited

Loading

dolfim-ibm commented Dec 11, 2024

PeterStaar-IBM commented Dec 11, 2024

langzichai commented Dec 12, 2024

langzichai commented Dec 12, 2024

langzichai commented Dec 12, 2024 •

edited

Loading

cau-git commented Dec 18, 2024

langzichai commented Dec 18, 2024

Docling 2.10.0: Performance Degradation When Reading Large PDF Files #568

Docling 2.10.0: Performance Degradation When Reading Large PDF Files #568

Comments

langzichai commented Dec 11, 2024 • edited Loading

dolfim-ibm commented Dec 11, 2024

PeterStaar-IBM commented Dec 11, 2024

langzichai commented Dec 12, 2024

langzichai commented Dec 12, 2024

langzichai commented Dec 12, 2024 • edited Loading

cau-git commented Dec 18, 2024

langzichai commented Dec 18, 2024

langzichai commented Dec 11, 2024 •

edited

Loading

langzichai commented Dec 12, 2024 •

edited

Loading