IiifPrint is a gem (Rails "engine") for Hyrax-based digital repository applications to support displaying parent/child works in the same viewer (Universal Viewer) and the ability to search OCR from the parent work to the child work(s).
IiifPrint is not a stand-alone application. It is designed to be integrated into a new or existing Hyku (v4.0-v5.0) application. Future development will include integrating it into a Hyrax-based application without Hyku and support for IIIF Presentation Manifest version 3 along with AllinsonFlex metadata profiles.
IiifPrint supports:
- OCR and ALTO creation
- full-text search
- OCR keyword match highlighting
- viewer with page navigation and deep zooming
- splitting of PDFs to LZW compressed TIFFs for viewing
- adding metadata fields to the manifest with faceted search links and external links
- excluding specified work types to be found in the catalog search
- external IIIF image urls that work with services such as serverless-iiif or cantaloup
A complete list of features can be found here.
A set of helpful documents to help you learn more and deploy IiifPrint can be found on the Project Wiki.
IiifPrint was developed against Hyku v4.0-v5.0. If your application uses Bulkrax, please ensure that its version is 5.0.1 or greater.
- Ruby >=2.4
- Rails ~>5.0
- Bundler
- Hyrax v2.5-v3.5.0
- ...and various Samvera dependencies that entails.
- A Hyrax-based Rails application
- FITS
- Tesseract-ocr
- LibreOffice
- ghostscript
- poppler-utils
- ImageMagick
- ImageMagick policy XML may need to be more permissive in both resources and source media types allowed. See template policy.xml.
- libcurl3
- libgbm1
IiifPrint easily integrates with your Hyrax 2.x applications.
- Add
gem 'iiif_print'
to your Gemfile. - Run
bundle install
- Run
rails generate iiif_print:install
- Set config options as indicated below...
- In
app/assets/javascripts/application.js
, it adds//= require iiif_print
- Adds
app/assets/stylesheets/iiif_print.scss
- In
app/controllers/catalog_controller.rb
, it addsinclude BlacklightIiifSearch::Controller
- In
app/controllers/catalog_controller.rb
, it addsadd_index_field
andiiif_search
config in theconfigure_blacklight
block - Adds
app/models/iiif_search_build.rb
- In
config/routes.rb
, it addsconcern :iiif_search, BlacklightIiifSearch::Routes.new
- In
config/routes.rb
, it addsconcerns :iiif_search
in theresources :solr_documents
block - Adds
config/initializers/iiif_print.rb
- Adds three migrations,
CreateIiifPrintDerivativeAttachments
,CreateIiifPrintIngestFileRelations
, andCreateIiifPrintPendingRelationships
(It may be helpful to run git diff
after installation to see all the changes made by the installer.)
To enable a feature where the UV automatically picks up the search from the catalog, do the following:
- Add
highlight: urlDataProvider.get('q'),
into your uv.html in the<script>
section.
uv = createUV('#uv', {
root: '.',
iiifResourceUri: urlDataProvider.get('manifest'),
configUri: 'uv-config.json',
collectionIndex: Number(urlDataProvider.get('c', 0)),
manifestIndex: Number(urlDataProvider.get('m', 0)),
sequenceIndex: Number(urlDataProvider.get('s', 0)),
canvasIndex: Number(urlDataProvider.get('cv', 0)),
rangeId: urlDataProvider.get('rid', 0),
rotation: Number(urlDataProvider.get('r', 0)),
xywh: urlDataProvider.get('xywh', ''),
embedded: true,
highlight: urlDataProvider.get('q'), // <-- here's a good spot
locales: formattedLocales
}, urlDataProvider);
- Make sure to remove your application's
app/helpers/hyrax/iiif_helper.rb
andapp/views/hyrax/base/iiif_viewers/_universal_viewer.html.erb
(if exists)
NOTE: WorkTypes and models are used synonymously here.
We created IiifPrint with an assumption of ActiveFedora. However, as Hyrax now supports Valkyrie, we need an alternate approach. We introduced IiifPrint::Configuration#persistence_layer
as a configuration option. By default it will use ActiveFedora
methods; but you can switch adapters to use Valkyrie instead. (See IiifPrint::PersistentLayer
for more details).
If you set EXTERNAL_IIIF_URL in your environment, then IiifPrint will use that URL as the root for your IIIF URLs. It will also switch from using the file set ID to using the SHA1 of the file as the identifier. This enables using serverless_iiif or Cantaloupe (refered to as the service) by pointing the service to the same S3 bucket that FCREPO writes the uploaded files to. By setting it up that way you do not need the service to connect to FCREPO or Hyrax at all, both natively support connecting to an S3 bucket to get their data.
In app/models/{work_type}.rb
add include IiifPrint.model_configuration
to any work types which require IiifPrint processing features (such as PDF splitting or OCR derivatives). See lib/iiif_print.rb for details on configuration options.
# Example model Book which splits PDFs into child works of
# model Page, and runs only one derivative service (TIFFs)
class Book < ActiveFedora::Base
include IiifPrint.model_configuration(
pdf_split_child_model: Page,
derivative_service_plugins: [
IiifPrint::TIFFDerivativeService
]
)
end
In config/initializers/iiif_print.rb
specify application level configuration options.
IiifPrint.config do |config|
# Add models to be excluded from search so the user would not see them in the search results.
# By default, use the human readable versions like:
config.excluded_model_name_solr_field_values = ['Generic Work', 'Image']
# Add configurable solr field key for searching, default key is: 'human_readable_type_sim' if
# another key is used, make sure to adjust the config.excluded_model_name_solr_field_values to match
config.excluded_model_name_solr_field_key = 'some_solr_field_key'
end
TO ENABLE OCR Search (from the UV and catalog search)
- In the CatalogController, find the add_search_field config block for 'all_fields'. Add
advanced_parse: false
as seen in the following example:
config.add_search_field('all_fields', label: 'All Fields', include_in_advanced_search: false, advanced_parse: false) do |field|
all_names = config.show_fields.values.map(&:field).join(" ")
title_name = 'title_tesim'
field.solr_parameters = {
qf: "#{all_names} file_format_tesim all_text_timv",
pf: title_name.to_s
}
end
- Set
config.search_builder_class = IiifPrint::CatalogSearchBuilder
to remove works from the catalog search results ifis_child_bsi: true
- Ensure that all text search is configured in default_solr_params config block:
config.default_solr_params = {
qt: "search",
rows: 10,
qf: "title_tesim description_tesim creator_tesim keyword_tesim all_text_timv"
}
To remove child works from recent works on homepage
- In the HomepageController, change the search_builder_class to remove works from recent_documents if
is_child_bsi: true
require "iiif_print/homepage_search_builder"
def search_builder_class
IiifPrint::HomepageSearchBuilder
end
By default when a work is configured for splitting PDFs, we will split all PDFs. However, in some cases you don't want to split based on the file name's suffix. In that case, configure code as follows:
IiifPrint.config do |config|
config.skip_splitting_pdf_files_that_end_with_these_texts = ['.reader.pdf']
end
The Derivative Rodeo is used in two ways:
- Configuring the
Hyrax::DerivativeService
by addingIiifPrint::DerivativeRodeoService
- Enable Derivative Rodeo PDF Splitting service by
IiifPrint.model_configuration
In the application initializer:
Hyrax::DerivativeService.services = [
IiifPrint::DerivativeRodeoService,
Hyrax::FileSetDerivativesService]
The IiifPrint.model_configuration method allows for specifying the pdf\_splitter\_service
as below:
class Book < ActiveFedora::Base
include IiifPrint.model_configuration(
pdf_splitter_service: IiifPrint::SplitPdfs::DerivativeRodeoSplitter
)
end
The DerivativeRodeo allows for specifying a location where you've done pre-processing (e.g. you ran splitting and derivative generation in AWS's Lambda).
By default the preprocess location is S3, as that is where SoftServ has been running pre-processing. However that default may not be adequate for local development.
The [IiifPrint::DerivativeRodeoService][./app/services/iiif_print/derivative_rodeo_service.rb] provides a means of specifying the derivatives to generate via two configuration points:
IiifPrint::DerivativeRodeoService.named_derivatives_and_generators_by_type
IiifPrint::DerivativeRodeoService.named_derivatives_and_generators_filter
In the case of named_derivatives_and_generators_by_type
, we're saying all mime categories will generate these derivatives.
In the case of named_derivatives_and_generators_filter
, we're providing a point where we can specify for each file_set and filename the specific derivatives to accept/reject/append to the named derivative generation.
See their examples for further configuration guidance.
IiifPrint supports a range of different ingest workflows:
- single-item ingest via the UI
- batch ingest of works from local files or remote files via Bulkrax
The ingest process is configurable at the model level, granting the option to:
- split a PDF into TIFFs and create child works
- create a full complement of derivatives, including TIFF, JP2, PDF, OCR text, and word-coordinate JSON
We develop the IIIF Print gem using Docker and Docker Compose. You'll want to clone this repository and run the following commands:
$ docker compose build
$ docker compose up
$ docker compose exec web bash
You'll now be inside the web container:
$ bundle exec rake
The above will build the test application (if it doesn't already exist). During the rebuild you might get a notice on a conflict for files. It will ask you to override. We recommend that you select the "accept all" option (e.g. Typing a).
To rebuild the test application, delete the .internal_test_app
directory.
If you're working on a PR for this project, create a feature branch off of main
.
This repository follows the Samvera Community Code of Conduct and language recommendations. Please do not create a branch called master
for this repository or as part of your pull request; the branch will either need to be removed or renamed before it can be considered for inclusion in the code base and history of this repository.
We encourage anyone who is interested in newspapers and Samvera to contribute to this project. How can I contribute?
IIIF Print is a gem that was forked off Newspaper Works, a powerful and versatile library for working with digitized newspapers. We would like to thank the team and maintainers of Newspaper Works for creating such a useful and well-designed gem. Our work on IIIF Print would not have been possible without their hard work and dedication.
In particular, we would like to express our gratitude to brianmcbride, seanupton, ebenenglish, and JacobR for their pioneering efforts on Newspaper Works. Their foundation and expertise were invaluable in the development of this gem.
Thank you to the entire Newspaper Works team for creating and maintaining such a valuable resource for the Samvera community.