Skip to content
Stefan Weil edited this page Sep 7, 2024 · 3 revisions

Welcome to the reichsanzeiger-gt wiki!

List of words (dictionary)

A list of all words can be extracted from the PAGE XML transcriptions:

grep "                   <Unicode>..*</Unicode>" *.xml | \
sed 's/.*<Unicode>//' | sed 's/<.Unicode>//' | sed 's/[ »«"„“,;:(){}]/\n/g' > words

From the list of all words, a sorted list of unique words (dictionary) without leading or trailing special characters, without numbers and without single-letter words, can be produced:

sed 's/[»"“(]//g' words | sed 's/^[„]//' | sed 's/[.,;:)]*$//' | \
cat words | sed 's/^[=\[(]*//' | sed 's/[).,;]*$//' | grep -v '^[0-9]' | grep -v '^.$' | sort | uniq | sort > dictionary

The resulting dictionary can be integrated into Tesseract OCR models to improve the recognition rate.

Clone this wiki locally