This work uses an ETL (extract-transform-load) approach and machine learning technics to help image retrieval in multicollection digital librairies.
Specs are:
- Identify and extract iconography wherever it may be found, in image collections but also in printed materials (newspapers, magazines, books);
- Transform, harmonize and enrich the image descriptive metadata (in particular with machine learning classification tools: IBM Watson for visual recognition, Google TensorFlow Inception-V3 for image classification)
- Load all the medatada into a web app dedicated to image retrieval.
A proof of concept has been implemented on the World War 1 theme. All the contents have been harvested from the BnF (Bibliotheque national de France) digital collections (gallica.bnf.fr) of heritage materials (photos, drawings, engravings, maps, posters, etc.).
"Image Retrieval in Digital Libraries" (EN article, FR article, presentation), IFLA News Media section 2017 (Dresden, August 2017).
The datasets are available as metadata files (one XML file/document). Images can be extracted from the metadata files thanks to IIIF Image API:
- Complete dataset (300k illustrations)
- Person dataset
- Gender dataset
This dataset has been used for the image genres classification training:
- Image genres classification dataset (10k images)
Note: All the scripts have been written by an amateur coder. They have been designed for the Gallica BnF digital documents and digital repositories, but this can be easily fixed.
The metadata are stored thanks to an in-house XML schema (IR_schema.xsd).
Sample documents are generally stored in the "DOCS" folder. Output samples are stored in OUT folders.
We've used Perl scripts. The extract step can be performed from OAI-PHM, SRU or OCR sources.
The OAI-PMH Gallica repository (endpoint) can be harvested for sets or documents. Perl script extractMD_OAI.pl can handled 2 methods:
- harvesting a complete OAI Set
- harvesting a document from its ark (or a list of documents).
Usage:
perl extractMD_OAI.pl gallica:corpus:1418 OUT xml
where:
- "gallica:corpus:1418" is the OAI set
- "xml" the (only) output format
This script also performs (using the available metadata):
- topic classification (considering the WW1 theme)
- image genres classification (photo/drawing/map...)
It outputs one XML metadata file per document, describing each page (and included illustrations) of the document.
SRU requesting of Gallica digital library can be done with extractARKs_SRU.pl. The SRU request must be copy/paste directly in the script.
It outputs a text file (one ark ID per line). This output can then be used as the input of the OAI-PMH script.
Usage:
perl extractARKs_SRU.pl OUT.txt
OCRed documents can be analysed using extractMD.pl script. This script is the more BnF centered of this github and it will be difficult to adapt to other context... It can handle various types of digital documents (books, newspapers) produced by the BnF or during the Europeana Newspapers. Regarding the newspapers type, the script can handle raw OCR production or OLR production (article recognition with METS/ALTO).
Usage:
perl extractMD.pl [-LI] mode title IN OUT format
where: -L : extraction of illustrations is performed: dimensions, caption... -I : BnF ark IDs are computed mode : types of BnF documents (olren, ocren, olrbnf, ocrbnf) title: some newspapers titles need to be identified by their title IN : input folder OUT : output folder format: XML only
We've used IBM Watson Visual Recognition API. The script calls the API to perform visual recognition of content or human faces.
Inception-v3 model (Google's convolutional neural network) has been retrained on a 12 classes (photos, drawings, maps, music scores, comics...) ground truth datasets (10k images). 3 Python scripts are used:
- split.py: the GT dataset is splited in a training set (2/3) and an evaluation set (1/3)
- retrain.py: the training set is used to train the last layer of the Inception-v3 model
- label.py: the evaluation set is labeled by the model
This script performs basic operations on the documents metadata files:
- deletion
- renumbering
- ...
An XML database (BaseX.org) is used. Querying the metadata is done with XQuery (see https://github.com/altomator/EN-data_mining for details). The web app uses IIIF Image API and Mansory grid layout JavaScript library for image display.