-
Hi there, Please could someone help on how I can optimise use of Docling for a low resource machine? At the moment, whilst very accurate, PDF parsing takes 5 minutes with all the default settings. How can I speed this up? Thank you! |
Beta Was this translation helpful? Give feedback.
Answered by
cau-git
Nov 5, 2024
Replies: 1 comment 5 replies
-
@timif2 Good to see this question coming up 😃 . There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:
Full API code sample: pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # pick what you need
pipeline_options.do_table_structure = False # pick what you need
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=DoclingParseV2DocumentBackend) # switch to beta PDF backend
}
)
conv_result = doc_converter.convert(input_doc_path)
print(conv_result.document.export_to_markdown()) |
Beta Was this translation helpful? Give feedback.
5 replies
Answer selected by
timif2
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@timif2 Good to see this question coming up 😃 .
There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:
--no-ocr
DoclingParseV2DocumentBackend
(beta), which speeds up PDF loading by ~10x, with good impact o…