Speed for low resource machine #245

timif2 · 2024-11-05T11:46:02Z

timif2
Nov 5, 2024

Hi there,

Please could someone help on how I can optimise use of Docling for a low resource machine? At the moment, whilst very accurate, PDF parsing takes 5 minutes with all the default settings. How can I speed this up?

Thank you!

Answered by cau-git

Nov 5, 2024

@timif2 Good to see this question coming up 😃 .

There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:

Turn off OCR if you don't need it for your data (e.g. you bring digital-only PDFs)
- CLI option --no-ocr
Turn of table structure recognition if you don't need table structure (e.g. your PDFs have no tables or you don't need the table's content)
- only possible in python API code, see below.
Switch the PDF backend to DoclingParseV2DocumentBackend (beta), which speeds up PDF loading by ~10x, with good impact o…

View full answer

cau-git · 2024-11-05T15:47:19Z

cau-git
Nov 5, 2024
Maintainer

@timif2 Good to see this question coming up 😃 .

There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:

Turn off OCR if you don't need it for your data (e.g. you bring digital-only PDFs)
- CLI option --no-ocr
Turn of table structure recognition if you don't need table structure (e.g. your PDFs have no tables or you don't need the table's content)
- only possible in python API code, see below.
Switch the PDF backend to DoclingParseV2DocumentBackend (beta), which speeds up PDF loading by ~10x, with good impact on the overall pipeline speed.
- CLI arg --pdf-backend= dlparse_v2

Full API code sample:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # pick what you need
pipeline_options.do_table_structure = False # pick what you need

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=DoclingParseV2DocumentBackend)  # switch to beta PDF backend
        }
)
conv_result = doc_converter.convert(input_doc_path)

print(conv_result.document.export_to_markdown())

5 replies

AdityaMannu1709 Nov 6, 2024

Is there a potential performance difference between default PDF backend and DoclingParseV2DocumentBackend means this better speed is at any kind of trade-off between speed and performance?

cau-git Nov 6, 2024
Maintainer

The current default PDF backend is DoclingParseV1DocumentBackend, while DoclingParseV2DocumentBackend is in beta and available as an option. We are currently ironing out the last few stability issues before we will make it the new default. There is no trade off in quality, it is just much much faster while delivering the same or better quality.

AdityaMannu1709 Nov 6, 2024

okay....got it

simjak Dec 3, 2024

ocr_options should be then optional
Currently, they are required

class PdfPipelineOptions(PipelineOptions):
    ...
    ocr_options: OcrOptions = Field(EasyOcrOptions())

simjak Dec 4, 2024

It is taking the same time with OCR option and without :)
without OCR I have 39s for https://arxiv.org/pdf/2305.19435
In JIna 3s -> https://r.jina.ai/https://arxiv.org/pdf/2305.19435

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed for low resource machine #245

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Speed for low resource machine #245

timif2 Nov 5, 2024

Replies: 1 comment · 5 replies

cau-git Nov 5, 2024 Maintainer

AdityaMannu1709 Nov 6, 2024

cau-git Nov 6, 2024 Maintainer

AdityaMannu1709 Nov 6, 2024

simjak Dec 3, 2024

simjak Dec 4, 2024

timif2
Nov 5, 2024

Replies: 1 comment 5 replies

cau-git
Nov 5, 2024
Maintainer

cau-git Nov 6, 2024
Maintainer