pagexml-tools

Utility functions for reading PageXML files

installing

using poetry

poetry add pagexml-tools

using pip

pip install pagexml-tools

Using

PageXML-tools contains functions for parsing and for a range of analysis tasks.

Parsing PageXML files and the Physical Document model

There is a tutorial that demonstrates the physical document model API

PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.

from pagexml.parser import parse_pagexml_file

pagexml_file = "path/to/pagexml_file.xml"

page_doc = parse_pagexml_file(pagexml_file)

# a page document has an ID
print(page_doc.id)

# print descriptive statistics
print(page_doc.stats)

# iterative over text regions and lines
for tr in page_doc.text_regions:
    # a text_region has an ID and a bounding box derived from its coordinates
    print(tr.id, tr.coords.box)
    # a text_region can have sub-text_regions and lines
    for line in tr.lines:
        # a line has an ID, coordinates and text
        print(line.id, line.coords.box, line.text)

In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:

reading sets of PageXML files from a archive (tar, zip) file (tutorial),
searching in text (keyword in context, keywords or fuzzy search)
classifying physical document types in a large set of PageXML documents (tutorial),
checking the quality of the HTR/OCR process (tutorial),
comparing subsets (tutorial),
identifying document sections in sequences of PageXML documents (tutorial),
turning text lines into running text (tutorial),
supporting different reading orders (tutorial),
reinterpreting and restructuring text regions and lines (tutorial),
turning physical structure into logical structure,

USAGE | CONTRIBUTING | LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.github/workflows		.github/workflows
data		data
docs		docs
notebooks		notebooks
pagexml		pagexml
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry_scripts.py		poetry_scripts.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pagexml-tools

installing

using poetry

using pip

Using

Parsing PageXML files and the Physical Document model

About

Releases 13

Packages

Contributors 5

Languages

License

knaw-huc/pagexml

Folders and files

Latest commit

History

Repository files navigation

pagexml-tools

installing

using poetry

using pip

Using

Parsing PageXML files and the Physical Document model

About

Resources

License

Stars

Watchers

Forks

Releases 13

Packages 0

Contributors 5

Languages

Packages