Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docling 2.10.0: Performance Degradation When Reading Large PDF Files #568

Open
langzichai opened this issue Dec 11, 2024 · 7 comments
Open
Labels
bug Something isn't working

Comments

@langzichai
Copy link

langzichai commented Dec 11, 2024

Describe the problem
While using the latest version of Docling (2.10.0), the following issues were observed with PDF file read speed performance:

For small files (e.g. 3M), the performance improvement is significant and the read speed does increase by about 30%.
However, for large files (e.g. more than 60M), the read speed has not improved and still maintains a slower processing time, the same as in the previous version (2.9.0).

gpu : V100 32g

A solution or guidance on how to tune performance for large file scenarios would be appreciated!

< v2.10.0
3M
default ocr 66s
63M
default ocr 1059.538206 seconds

v2.10.0
3M
default ocr 43.967382

63M
default ocr 1053s

def convert(filepath): start_time = datetime.now() print(f"start_time : {start_time}") input_doc = Path(filepath) pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options.use_gpu = True pipeline_options.do_table_structure = False pipeline_options.table_structure_options.do_cell_matching = False converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc).document md = doc.export_to_text() end_time = datetime.now() print(f"end_time : {end_time}") print(f"Total processing time: {end_time - start_time}")

@langzichai langzichai added the bug Something isn't working label Dec 11, 2024
@dolfim-ibm
Copy link
Contributor

Please note that proper GPU support is not yet released, but it is coming very soon.

To understand your timings, let us propose another experiment.

Please add the following in your python code.

from docling.datamodel.settings import settings
settings.debug.profile_pipeline_timings=True

Then you will get internal profiling output in

result = converter.convert(input_doc)
doc = result.document
print(result.timings)

@PeterStaar-IBM
Copy link
Contributor

@langzichai Is there any chance you could share the bigger file?

@langzichai
Copy link
Author

Thanks for the reply, I will test it

@langzichai
Copy link
Author

@langzichai Is there any chance you could share the bigger file?

Sorry, the file can't be shared at the moment

@langzichai
Copy link
Author

langzichai commented Dec 12, 2024

Please note that proper GPU support is not yet released, but it is coming very soon.

To understand your timings, let us propose another experiment.

Please add the following in your python code.

from docling.datamodel.settings import settings
settings.debug.profile_pipeline_timings=True

Then you will get internal profiling output in

result = converter.convert(input_doc)
doc = result.document
print(result.timings)

@dolfim-ibm
Attached is the information output by timings
timings.txt

@cau-git
Copy link
Contributor

cau-git commented Dec 18, 2024

@langzichai Several improvements, especially for GPU acceleration and layout processing, were released since you last reported, would you mind checking again with docling==2.14.0?

@langzichai
Copy link
Author

Ok, I will test again, thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants