Keywords: OCR
, Page Dewarping
, Deep Learning
, Tesseract-OCR
, DAS2018
, ICDAR2017
Conference Paper: ...
This repository is part of the Master Thesis "Camera-based Document Analysis based on Deep Learning and OCR".
Capturing document images with the Smartphone provide a convenient way to digitize physical documents and facilitate the automation of document processing and information retrieval. In contrast to flatbed scans, camera-captured documents require a more sophisticated preprocessing pipeline, because of perspective distortions, suboptimal lighting and physically deformed documents. The main goal of this work was to:
- build an end-to-end OCR-Pipeline (input: document image, output: full text transcription) based on the best Open Source solution currently available.
- analyze Deep Learning techniques to deal with one of the major challenges discussed at the DAS2018 workshop in the domain of camera-based document analysis: Page Dewarping (in particular perspective distortions and folded/ curved documents).
A high-level overview is illustrated in the following figure:
This repo contains a demo application for the by the page dewarping component extended OCR pipeline (Tesseract 4.0).
Methodically, different neural network architectures were investigated on a large-scale synthetic dataset to estimate the document's corner points from a single input image, without prior assumptions. The distorted image is then mapped to its canonical position by using the 4-point homography parameterization. The best result is achieved by a modified Xception-network, with a mean displacement error of 3.38px. Finally, the correction component is integrated into Tesseract 4.0 and evaluated on the SmartDoc 2015 challenge 2 test set. Experiments show that the correction component improves the character accuracy results by more than 15 percentage points (93.11%), in comparison to Tesseract alone (77.27%).
Page Dewarping results:
By tesseract recognized textlines before after dewarping:
-
Install Tesseract OCR; at time of writing, tesseract 4.0.0-beta.1 was used as OCR engine.
-
Download homography_model into /res/homographyModel/
-
Install dependencies (using conda virtualenv)
conda env create -f environment.yml
# note: to use gpu support, exchange tensorflow with tensorflow-gpu (environment.yml)
To test different pipeline modes, consider ocrMaster; to test the page dewarping performance test_homographyDL.py.
doc
├── ... # README resources
res
├── homographyModel/ # dir to trained homography model
├── smartDocSamples/ # smartDoc challenge 2 test set samples
├── smartDocSamplesOutput/ # mobileOCR results are stored here
src
├── pipeline/ # mobileOCR pipeline
├── dl_homograhpy_tf/ # deep learning: homography; dewarping
├── modes/ # different OCR pipieline modes
├── textline_recognition/ # Tesseract 4.0 python wrapper
├── ocrMaster.py # interface to OCR pipeline
environment.yml # dependencies (easy setup)