speller-ocr-eval

Evaluate OCR correctness by identifying the language and then running a spell checker

Usage

npm init
put the list of files to analyse into files.txt
run `node index.js

Dependencies

langid.py if wanted
elasticsearch / kibana for pushing the data
jq for formatting output from elasticsearch

Source Data

Download https://data.bnl.lu/open-data/digitization/newspapers/export01-newspapers1841-1878.zip from the eluxemburgensia open data set as source data.

Identifying the language

Using https://github.com/CLD2Owners/cld2/ As an alternative, use https://pypi.org/project/langid/

Spelling

Using hunspell with dictionaries from Libreoffice and spellchecker.lu

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
speller		speller
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
allresults.txt		allresults.txt
config.json		config.json
detail2elastic.sh		detail2elastic.sh
extended2elastic.sh		extended2elastic.sh
index.js		index.js
kibana-ocr-overview-bnl-opendata.png		kibana-ocr-overview-bnl-opendata.png
mappings-detailed.json		mappings-detailed.json
mappings-extended.json		mappings-extended.json
mappings.json		mappings.json
package.json		package.json
run_langid.sh		run_langid.sh
summarize.sh		summarize.sh
summary2elastic.sh		summary2elastic.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speller-ocr-eval

Usage

Dependencies

Source Data

Identifying the language

Spelling

Sample results

About

Releases

Packages

Languages

License

ymaurer/speller-ocr-eval

Folders and files

Latest commit

History

Repository files navigation

speller-ocr-eval

Usage

Dependencies

Source Data

Identifying the language

Spelling

Sample results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages