Tools to simulate post-OCR noisy texts.
Features:
- Sliding window;
- Sliding window with hyphenation;
- Simulate text errors;
- Simulate text annotations.
pip install noisocr
import noisocr
text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50
windows = noisocr.sliding_window(text, max_window_size)
# Output:
# [
# 'Lorem Ipsum is simply dummy text of the printing',
# ...
# 'type and scrambled it to make a type specimen',
# 'book.'
# ]
- See the package https://pypi.org/project/PyHyphen to see all supported languages.
import noisocr
text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50
windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')
# Output:
# [
# 'Lorem Ipsum is simply dummy text of the printing ',
# 'typesetting industry. Lorem Ipsum has been the in-',
# ...
# 'scrambled it to make a type specimen book.'
# ]
- See the package https://pypi.org/project/typo to see all possible error types.
import noisocr
text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!
- By default, the annotations found in the BRESSAY dataset were used. But you can define which types of annotations you want to simulate. For annotations with internal text, use the pattern
##--text--##
.
import noisocr
text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.