handwritten-table-extraction-ocr

DESCRIPTION:

This project processes scanned images of handwritten tables to automatically detect and recognize the tabular structure and content. It utilizes pretrained OCR models to accurately read handwritten entries and fill in an Excel sheet, mirroring the original layout. This project significantly reduces manual data entry effort and improves efficiency in handling handwritten documents.

(Note: This is a primary version, increasing the accuracy can be further worked upon)

TIMELINE:

DATA FLOW DIAGRAM:

SAMPLE OUTPUT:

ACCURACY, LIMITATIONS AND FUTURE SCOPE:

The current accuracy of the program is roughly 75-80%. It can be further improved by substituting the pretrained model used, with a model that is actively trained using the IAM dataset.
The current implementation of the code runs efficiently on powerful GPU machines in Google Colab but takes longer to execute on a local Jupyter notebook. As this is a preliminary version, future improvements can be made by optimizing and testing the code on a PC with a robust GPU to enhance performance.
There is one anamoly that needs to be worked upon that occurs when certain images are not cropped correctly, resulting in an error. This anomaly occurs when the table is unable to identify a cell in the table resulting in a NULL value being returned, which causes an error to spring up. (UPDATE: this error has been temporarily handled (in TROCR_v2.ipynb) by replacing the unrecognized text with a blank “ ” to avoid the program from abruptly stopping, and ensure its smooth completion)

UPDATES: The final version ready for use is the TROCR_v4.ipynb file.

This program is now capable of handling pdf files as well as image (.jpg, .png, .jpeg) files. It automatically detects the file type and performs the respective procedure.
The images extracted from the pdf file are also auto-cropped, to increase the clarity and accuracy.
The images obtained are now also preprocessed to increase the accuracy even further.

CREDIT: Pretrained model used: TROCR Table detection model used: IMG2TABLE

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images and relevant files		images and relevant files
README.md		README.md
TROCR_v1.ipynb		TROCR_v1.ipynb
TROCR_v2.ipynb		TROCR_v2.ipynb
TROCR_v3.ipynb		TROCR_v3.ipynb
TROCR_v4.ipynb		TROCR_v4.ipynb
colab-link.txt		colab-link.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

handwritten-table-extraction-ocr

About

Releases

Packages

Languages

natashasalvi2003/handwritten-table-extraction-ocr

Folders and files

Latest commit

History

Repository files navigation

handwritten-table-extraction-ocr

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages