Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? #84

Open
T-Dane opened this issue Oct 30, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@T-Dane
Copy link

T-Dane commented Oct 30, 2024

Requesting a version of PDF OCR that only runs tesseract OCR on embedded images in PDF instead of capturing the whole page of the PDF.

A lot of my professors use powerpoints converted to PDF, the text is already text, while the screen-grabs they use lack this and could benefit from OCR.

I believe this could save time for others as well as not all PDF documents are purely images and often a combination.

@T-Dane T-Dane changed the title Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf? Running OCR on embedded images of PDF using Poppler pdfimages or ImageMapping instead of whole pdf pages converted to png? Oct 30, 2024
@stweil stweil added the enhancement New feature or request label Oct 30, 2024
@aborel
Copy link
Collaborator

aborel commented Nov 1, 2024

Interesting idea, but inserting the OCRed text back into the existing text layer for hybrid pages might be challenging.
I'm not familiar with ImageMapping, can you provide a link?

@T-Dane
Copy link
Author

T-Dane commented Nov 5, 2024

@aborel
Copy link
Collaborator

aborel commented Nov 8, 2024

Thanks. We're currently looking into reducing the dependencies on external programs, so I'm not sure we'll use your suggestion, but we'll keep this in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants