Home

Welcome to the reichsanzeiger-gt wiki!

List of words (dictionary)

A list of all words can be extracted from the PAGE XML transcriptions:

grep "                   <Unicode>..*</Unicode>" *.xml | \
sed 's/.*<Unicode>//' | sed 's/<.Unicode>//' | sed 's/[ »«"„“,;:(){}]/\n/g' > words

From the list of all words, a sorted list of unique words (dictionary) without leading or trailing special characters, without numbers and without single-letter words, can be produced:

sed 's/[»"“(]//g' words | sed 's/^[„]//' | sed 's/[.,;:)]*$//' | \
cat words | sed 's/^[=\[(]*//' | sed 's/[).,;]*$//' | grep -v '^[0-9]' | grep -v '^.$' | sort | uniq | sort > dictionary

The resulting dictionary can be integrated into Tesseract OCR models to improve the recognition rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

List of words (dictionary)

Clone this wiki locally