Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index full page text into a new SOLR field using existing ALTO files #2256

Open
eporter23 opened this issue Feb 29, 2024 · 1 comment
Open
Assignees
Labels
Content Dissemination | Export Software Engineering Flag work for software engineering team

Comments

@eporter23
Copy link
Contributor

eporter23 commented Feb 29, 2024

As noted in the epic and in planning discussions, we will check works' FileSets to see if they contain an ALTO xml file (which has the file use of "Extracted"). That page-level XML file will contain text content as well as page coordinates.

If there is "Extracted" that contains a .pos file, we do not want to use these.

If there is no "Extracted" file attached to the FileSet, we can instead look for a .txt ("Transcript File"). These will contain text data, but no word coordinates. This should still provide some search within IIIF capabilities.

Examples of works with page-level ALTO files:
This work contains page-level ALTO and has already been indexed for full text search for the entire work.
https://curate-test.library.emory.edu/concern/curate_generic_works/453wstqk05-cor?locale=en&page=2
This work also contains ALTO, but has not been indexed for full text yet.
https://curate-test.library.emory.edu/concern/parent/7203xsj44s-cor/file_sets/501pg4f51w-cor

Examples of works without ALTO files:
https://curate-test.library.emory.edu/concern/parent/28380gb5xh-cor/curate_generic_works/4300zpc8g9-cor
https://curate-test.library.emory.edu/concern/curate_generic_works/846d2547r6-cor?locale=en

@eporter23
Copy link
Contributor Author

Current updates are: we are looking at using a SOLR plugin to assist with highlighting behavior. The indexing process is also underway and SOLR fields are in place. If the Solr field contains the text of the XML, the SOLR plugin automatically converts it into HOCR which the UV needs for the highlighting behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Content Dissemination | Export Software Engineering Flag work for software engineering team
Projects
None yet
Development

No branches or pull requests

2 participants