Calculates the OCR confidence score per page in ALTO files.
The method used is really simple:
- find all String elements
- get value of attribute "(WC)" (word confidence) for each String
- calculate sum of all "WC" values
- divide sum by the count of words per page
Use like:
python alto_ocr_confidence.py <inputdir>
Example output:
File: alto\AZ_1926_04_25_0001.xml, Confidence: 54.13
Note that OCR confidence (which is a native output of the OCR engine) is NOT equal to the actual OCR accuracy, which can only be determined by evaluation against Ground Truth.
Read more about OCR evaluation here.