-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docling 2.10.0: Performance Degradation When Reading Large PDF Files #568
Comments
Please note that proper GPU support is not yet released, but it is coming very soon. To understand your timings, let us propose another experiment. Please add the following in your python code. from docling.datamodel.settings import settings
settings.debug.profile_pipeline_timings=True Then you will get internal profiling output in result = converter.convert(input_doc)
doc = result.document
print(result.timings) |
@langzichai Is there any chance you could share the bigger file? |
Thanks for the reply, I will test it |
Sorry, the file can't be shared at the moment |
@dolfim-ibm |
@langzichai Several improvements, especially for GPU acceleration and layout processing, were released since you last reported, would you mind checking again with docling==2.14.0? |
Ok, I will test again, thanks for the help! |
Describe the problem
While using the latest version of Docling (2.10.0), the following issues were observed with PDF file read speed performance:
For small files (e.g. 3M), the performance improvement is significant and the read speed does increase by about 30%.
However, for large files (e.g. more than 60M), the read speed has not improved and still maintains a slower processing time, the same as in the previous version (2.9.0).
gpu : V100 32g
A solution or guidance on how to tune performance for large file scenarios would be appreciated!
< v2.10.0
3M
default ocr 66s
63M
default ocr 1059.538206 seconds
v2.10.0
3M
default ocr 43.967382
63M
default ocr 1053s
def convert(filepath): start_time = datetime.now() print(f"start_time : {start_time}") input_doc = Path(filepath) pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options.use_gpu = True pipeline_options.do_table_structure = False pipeline_options.table_structure_options.do_cell_matching = False converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, ) } ) doc = converter.convert(input_doc).document md = doc.export_to_text() end_time = datetime.now() print(f"end_time : {end_time}") print(f"Total processing time: {end_time - start_time}")
The text was updated successfully, but these errors were encountered: