NoisOCR

Tools to simulate post-OCR noisy texts.

Features:

Sliding window;
Sliding window with hyphenation;
Simulate text errors;
Simulate text annotations.

Install

pip install noisocr

Sliding window:

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing', 
#   ...
#   'type and scrambled it to make a type specimen', 
#   'book.'
# ]

Sliding window with hyphenation:

See the package https://pypi.org/project/PyHyphen to see all supported languages.

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing ',        
#   'typesetting industry. Lorem Ipsum has been the in-', 
#   ...
#   'scrambled it to make a type specimen book.'
# ]

Simulate text errors:

See the package https://pypi.org/project/typo to see all possible error types.

import noisocr

text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!

Simulate text annotations:

By default, the annotations found in the BRESSAY dataset were used. But you can define which types of annotations you want to simulate. For annotations with internal text, use the pattern ##--text--##.

import noisocr

text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
noisocr		noisocr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NoisOCR

Install

Sliding window:

Sliding window with hyphenation:

Simulate text errors:

Simulate text annotations:

About

Releases 1

Packages

Languages

License

savi8sant8s/noisocr

Folders and files

Latest commit

History

Repository files navigation

NoisOCR

Install

Sliding window:

Sliding window with hyphenation:

Simulate text errors:

Simulate text annotations:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages