tesschar

Simple character checking for tesseract. The parameters are:

usage: tesschar.py [-h] [-f FILE] [-b BORDER] [-l LANG] [-o OUTPUT] [-t TEXT]

optional arguments:
  -h, --help            show this help message and exit

named arguments:
  -f FILE, --file FILE  input image, for example: imgs/my_image.tif
  -b BORDER, --border BORDER
                        adjust border value for extracted regions
  -l LANG, --lang LANG  language for OCR
  -o OUTPUT, --output OUTPUT
                        file for output
  -t TEXT, --text TEXT  text to reprocess

For example:

tesseract sample.jpg sample -c hocr_char_boxes=1 hocr
tesschar.py -f sample.jpg -t O,B

By default, the output will be in the base of the filename, sample.txt in this case. Note that a border is put around the extracted character to help improve the results. If an hocr file is not detected, pytesseract will be used to create an in-memory version. The single character recognition step is also done in-memory with pytesseract. This could be done more efficiently with the Tesseract API but the key would be to test on a big enough sample to make sure it is worth pursuing since the process adds considerable overhead.

This has had minimal testing, YMMV, etc...

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
tesschar.py		tesschar.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tesschar

About

Releases

Packages

Languages

License

OurDigitalWorld/tesschar

Folders and files

Latest commit

History

Repository files navigation

tesschar

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages