-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Stefan Weil edited this page Sep 7, 2024
·
3 revisions
Welcome to the reichsanzeiger-gt wiki!
A list of all words can be extracted from the PAGE XML transcriptions:
grep " <Unicode>..*</Unicode>" *.xml | \
sed 's/.*<Unicode>//' | sed 's/<.Unicode>//' | sed 's/[ »«"„“,;:(){}]/\n/g' > words
From the list of all words, a sorted list of unique words (dictionary) without leading or trailing special characters, without numbers and without single-letter words, can be produced:
sed 's/[»"“(]//g' words | sed 's/^[„]//' | sed 's/[.,;:)]*$//' | \
cat words | sed 's/^[=\[(]*//' | sed 's/[).,;]*$//' | grep -v '^[0-9]' | grep -v '^.$' | sort | uniq | sort > dictionary
The resulting dictionary can be integrated into Tesseract OCR models to improve the recognition rate.