Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? #84

T-Dane · 2024-10-30T13:47:46Z

Requesting a version of PDF OCR that only runs tesseract OCR on embedded images in PDF instead of capturing the whole page of the PDF.

A lot of my professors use powerpoints converted to PDF, the text is already text, while the screen-grabs they use lack this and could benefit from OCR.

I believe this could save time for others as well as not all PDF documents are purely images and often a combination.

aborel · 2024-11-01T04:21:49Z

Interesting idea, but inserting the OCRed text back into the existing text layer for hybrid pages might be challenging.
I'm not familiar with ImageMapping, can you provide a link?

T-Dane · 2024-11-05T11:18:29Z

I completely trust it would be challenging, but it would make for an AMAZING feature!
This:
https://poppler.freedesktop.org/api/glib/poppler-Poppler-Page.html#PopplerImageMapping-struct

Or maybe this:
https://world.pages.gitlab.gnome.org/Rust/poppler-rs/stable/0.24/docs/poppler/struct.ImageMapping.html

aborel · 2024-11-08T05:22:45Z

Thanks. We're currently looking into reducing the dependencies on external programs, so I'm not sure we'll use your suggestion, but we'll keep this in mind.

T-Dane changed the title ~~Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf?~~ Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? Oct 30, 2024

stweil added the enhancement New feature or request label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? #84

Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? #84

T-Dane commented Oct 30, 2024

aborel commented Nov 1, 2024

T-Dane commented Nov 5, 2024 •

edited

Loading

aborel commented Nov 8, 2024

Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? #84

Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? #84

Comments

T-Dane commented Oct 30, 2024

aborel commented Nov 1, 2024

T-Dane commented Nov 5, 2024 • edited Loading

aborel commented Nov 8, 2024

T-Dane commented Nov 5, 2024 •

edited

Loading