Replies: 2 comments
-
I've been wondering the same thing. But my guess is that the ocr engine is tightly coupled to the partition library because the ocr engine would output data in a specific way. My biggest issue is that tesseract doesn't support GPUs so that slows down the extraction if you want to use high-res for example. Particularly for large documents. There are some newer approaches like https://github.com/VikParuchuri/surya that are promising, allow for multiple parallel processing, and run on multiple GPUs if needed. It would be interesting to see if something like that could be supported. |
Beta Was this translation helpful? Give feedback.
-
This is possible now just not documented - see here |
Beta Was this translation helpful? Give feedback.
-
As per the documentation, regarding the partition strategy "ocr_only" - The "ocr_only" strategy runs the document through Tesseract for OCR and then runs the raw text through partition_text.
I want to know if there any option to integrate to another OCR engine.
Beta Was this translation helpful? Give feedback.
All reactions