Skip to content

Latest commit

 

History

History
284 lines (192 loc) · 24.1 KB

README.md

File metadata and controls

284 lines (192 loc) · 24.1 KB

eprints2bags

A program for downloading records from an EPrints server and creating BagIt packages out of them.

Authors: Michael Hucka, Betsy Coles
Repository: https://github.com/caltechlibrary/eprints2bags
License: BSD/MIT derivative – see the LICENSE file for more information

License Python Latest release PyPI

Table of Contents

☀ Introduction

Materials in EPrints must be extracted before they can be moved to a long-term preservation system or dark archive. Eprints2bags is a self-contained program that encapsulates the processes needed to download records and documents from EPrints, bundle up individual records in BagIt packages, and create single-file archives (e.g., in ZIP format) of each bag. The program is written in Python 3 and works over a network using an EPrints server's REST API.

✺ Installation instructions

The instructions below assume you have a Python interpreter installed on your computer; if that's not the case, please first install Python and familiarize yourself with running Python programs on your system.

On Linux, macOS, and Windows operating systems, you should be able to install eprints2bags with pip. If you don't have the pip package or are uncertain if you do, first run the following command in a terminal command line interpreter:

sudo python3 -m ensurepip

Then, to install eprints2bags from the Python package repository, run the following command:

python3 -m pip install eprints2bags --user --upgrade

As an alternative to getting it from PyPI, you can instruct pip to install eprints2bags directly from the GitHub repository:

python3 -m pip install git+https://github.com/caltechlibrary/eprints2bags.git --user --upgrade

On Linux and macOS systems, assuming that the installation proceeds normally, you should end up with a program called eprints2bags in a location normally searched by your terminal shell for commands.

▶︎ Using Eprints2bags

For help with usage at any time, run eprints2bags with the option -h (or /h on Windows).

eprints2bags contacts an EPrints REST server whose network API is accessible at the URL given by the command-line option -a (or /a on Windows). A typical EPrints server URL has the form https://somename.yourinstitution.edu/rest. This program will automatically add /eprint to the URL path given, so omit that part of the URL in the value given to -a. The -a (or /a) option is required; the program cannot infer the server address on its own.

Specifying which records to get

The EPrints records to be written will be limited to the list of EPrints numbers found in the file given by the option -i (or /i on Windows). If no -i option is given, this program will download all the contents available at the given EPrints server. The value of -i can also be one or more integers separated by commas (e.g., -i 54602,54604), or a range of numbers separated by a dash (e.g., -i 1-100, which is interpreted as the list of numbers 1, 2, ..., 100 inclusive), or some combination thereof. In those cases, the records written will be limited to those numbered.

If the -l option (or /l on Windows) is given, the records will be additionally filtered to return only those whose last-modified date/time stamp is no older than the given date/time description. Valid descriptors are those accepted by the Python dateparser library. Make sure to enclose descriptions within single or double quotes. Examples:

eprints2bags -l "2 weeks ago" -a ....
eprints2bags -l "2014-08-29"  -a ....
eprints2bags -l "12 Dec 2014" -a ....
eprints2bags -l "July 4, 2013" -a ....

If the -s option (or /s on Windows) is given, the records will also be filtered to include only those whose <eprint_status> element value is one of the listed status codes. Comparisons are done in a case-insensitive manner. Putting a caret character (^) in front of the status (or status list) negates the sense, so that eprints2bags will only keep those records whose <eprint_status> value is not among those given. Examples:

eprints2bags -s archive -a ...
eprints2bags -s ^inbox,buffer,deletion -a ...

Both lastmod and status filering are done after the -i argument is processed.

By default, if an error occurs when requesting a record from the EPrints server, it stops execution of eprints2bags. Common causes of errors include missing records implied by the arguments to -i, missing files associated with a given record, and files inaccessible due to permissions errors. If the option -k (or /k on Windows) is given, eprints2bags will attempt to keep going upon encountering missing records, or missing files within records, or similar errors. Option -k is particularly useful when giving a range of numbers with the -i option, as it is common for EPrints records to be updated or deleted and gaps to be left in the numbering. (Running without -i will skip over gaps in the numbering because the available record numbers will be obtained directly from the server, which is unlike the user providing a list of record numbers that may or may not exist on the server. However, even without -i, errors may still result from permissions errors or other causes.)

Specifying what to do with the records

This program writes its output in subdirectories under the directory given by the command-line option -o (or /o on Windows). If the directory does not exist, this program will create it. If no -o is given, the current directory where eprints2bags is running is used. Whatever the destination is, eprints2bags will create subdirectories in the destination, with each subdirectory named according to the EPrints record number (e.g., /path/to/output/43, /path/to/output/44, /path/to/output/45, ...). If the -n option (/n on Windows) is given, the subdirectory names are changed to have the form NAME-NUMBER_ where NAME is the text string provided to the -n option and the NUMBER is the EPrints number for a given entry (meaning, /path/to/output/NAME-43, /path/to/output/NAME-44, /path/to/output/NAME-45, ...).

Each directory will contain an EPrints XML file and additional document file(s) associated with the EPrints record in question. Documents associated with each record will be fetched over the network. The list of documents for each record is determined from XML file, in the <documents> element. Certain EPrints internal documents such as indexcodes.txt and preview images are ignored.

By default, each record and associated files downloaded from EPrints will be placed in a directory structure that follows the BagIt specification, and then this bag will then be put into its own single-file archive. The default archive file format is ZIP with compression turned off (see next paragraph). Option -b (/b on Windows) can be used to change this behavior. This option takes a keyword value; possible values are none, bag and bag-and-archive, with the last being the default. Value none will cause eprints2bags to leave the downloaded record content in individual directories without bagging or archiving, and value bag will cause eprints2bags to create BagIt bags but not single-file archives from the results. Everything will be left in the output directory (the location given by the -o or /o option). Note that creating bags is a destructive operation: it replaces the individual directories of each record with a restructured directory corresponding to the BagIt format.

The type of archive made when bag-and-archive mode is used for the -b option can be changed using the option -t (or /t on Windows). The possible values are: compressed-zip, uncompressed-zip, compressed-tar, and uncompressed-tar. As mentioned above, the default is uncompressed-zip (used if no -t option is given). ZIP is the default because it is more widely recognized and supported than tar format, and uncompressed ZIP is used because file corruption is generally more damaging to a compressed archive than an uncompressed one. Since the main use case for eprints2bags is to archive contents for long-term storage, avoiding compression seems safer.

The ZIP archive file will be written with a text comment describing the contents of the archive. This comment can be viewed by ZIP utilities (e.g., using zipinfo -z on Unix/Linux and macOS). The following is an example of a comment and the information it contains:

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
About this archive file:

This is an archive of a file directory organized in BagIt v1.0 format.
The bag contains the content from the EPrints record located at
http://resolver.caltech.edu/CaltechAUTHORS:SHIjfm98

The software used to create this archive file was:
eprints2bags version 1.3.1 <https://github.com/caltechlibrary/eprints2bags>

The following is the metadata contained in bag-info.txt:
Bag-Software-Agent: bagit.py v1.7.0 <https://github.com/LibraryOfCongress/bagit-python>
Bagging-Date: 2018-12-13
External-Description: Archive of EPrints record and document files
External-Identifier: http://resolver.caltech.edu/CaltechAUTHORS:SHIjfm98
Internal-Sender-Identifier: https://authors.library.caltech.edu/id/eprint/355
Payload-Oxum: 4646541.2
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Archive comments are a feature of the ZIP file format and not available with tar.

Finally, the overall collection of EPrints records (whether the records are bagged and archived, or just bagged, or left as-is) can optionally be itself put into a bag and/or put in a ZIP archive. This behavior can be changed with the option -e (/e on Windows). Like -b, this option takes the possible values none, bag, and bag-and-archive. The default is none. If the value bag is used, a top-level bag containing the individual EPrints bags is created out of the output directory (the location given by the -o option); if the value bag-and-archive is used, the bag is also put into a single-file archive. (In other words, the result will be a ZIP archive of a bag whose data directory contains other ZIP archives of bags.) For safety, eprints2bags will refuse to do bag or bag-and-archive unless a separate output directory is given via the -o option; otherwise, this would restructure the current directory where eprints2bags is running – with potentially unexpected or even catastrophic results. (Imagine if the current directory were the user's home directory!)

Generating checksum values can be a time-consuming operation for large bags. By default, during the bagging step, eprints2bags will use a number of processes equal to one-half of the available CPUs on the computer. The number of processes can be changed using the option -c (or /c on Windows).

The use of separate options for the different stages provides some flexibility in choosing the final output. For example,

eprints2bags --bag-action none --end-action bag-and-archive

will create a ZIP archive containing a single bag directory whose data/ subdirectory contains the set of (unbagged) EPrints records retrieved by eprints2bags from the server.

Server credentials

Downloading documents usually requires supplying a user login and password to the EPrints server. By default, this program uses the operating system's keyring/keychain functionality to get a user name and password. If the information does not exist from a previous run of eprints2bags, it will query the user interactively for the user name and password, and unless the -K argument (/K on Windows) is given, store them in the user's keyring/keychain so that it does not have to ask again in the future. It is also possible to supply the information directly on the command line using the -u and -p options (or /u and /p on Windows), but this is discouraged because it is insecure on multiuser computer systems.

If a given EPrints server does not require a user name and password, do not use -u or -p and leave the name and password blank when prompted for them by eprints2bags. Empty user name and password are allowed values.

To reset the user name and password (e.g., if a mistake was made the last time and the wrong credentials were stored in the keyring/keychain system), add the -R (or /R on Windows) command-line argument to a command. When eprints2bags is run with this option, it will query for the user name and password again even if an entry already exists in the keyring or keychain.

Other options

eprints2bags produces color-coded diagnostic output as it runs, by default. However, some terminals or terminal configurations may make it hard to read the text with colors, so eprints2bags offers the -C option (/C on Windows) to turn off colored output.

If given the -@ argument (/@ on Windows), this program will output a detailed trace of what it is doing, and will also drop into a debugger upon the occurrence of any errors. The debug trace will be written to the given destination, which can be a dash character (-) to indicate console output, or a file path.

If given the -V option (/V on Windows), this program will print the version and other information, and exit without doing anything else.

Basic usage examples

Running eprints2bags then consists of invoking the program like any other program on your system. The following is a simple example showing how to get a single record (#85447) from Caltech's CODA EPrints server (with user name and password blanked out here for security reasons):

# eprints2bags -o /tmp/eprints -i 85447 -a https://authors.library.caltech.edu/rest -u XXXXX -p XXXXX

Beginning to process 1 EPrints entry.
Output will be written under directory "/tmp/eprints"
======================================================================
Getting record with id 85447
Creating /tmp/eprints/85447
Downloading https://authors.library.caltech.edu/85447/1/1-s2.0-S0164121218300517-main.pdf
Making bag out of /tmp/eprints/85447
Creating tarball /tmp/eprints/85447.tgz
======================================================================
Done. Wrote 1 EPrints record to /tmp/eprints/.

The following is a screen cast to give a sense for what it's like to run eprints2bags. Click on the following image:

Screencast of simple eprints2bags demo

Summary of command-line options

The following table summarizes all the command line options available. (Note: on Windows computers, / must be used as the prefix character instead of -):

Short      Long form opt   Meaning Default
-aA --api-urlA Use A as the server's REST API URL
-bB --bag-actionB Do B with each record directory Bag and archive
-cC --processesC No. of processes during bag creation ½ the number of CPUs
-eE --end-actionE Do E with the entire set of records Nothing
-h --help Print help info and exit
-iI --id-listI Records to get (can be a file name) Fetch all records from the server
-k --keep-going Don't count missing records as an error Stop if encounter missing record
-lL --lastmodL Filter by last-modified date/time Don't filter by date/time
-nN --name-baseN Prefix directory names with N Use record number only
-oO --output-dirO Write outputs in the directory O Write in the current directory
-q --quiet Don't print info messages while working Be chatty while working
-sS --statusS Filter by status(s) in S Don't filter by status
-uU --userU User name for EPrints server login
-pP --passwordU Password for EPrints proxy login
-tT --arch-typeT Use archive type T Uncompressed ZIP
-C --no-color Don't color-code the output Use colors in the terminal output
-K --no-keyring Don't use a keyring/keychain Store login info in keyring
-R --reset Reset user login & password used Reuse previous credentials
-V --version Print program version info and exit Do other actions instead
-@OUT --debugOUT Debugging mode; write trace to OUT Normal mode

⚑   Required argument.
✦   Possible values: none, bag, bag-and-archive.
♢   Possible values: uncompressed-zip, compressed-zip, uncompressed-tar, compressed-tar.
⚐   To write to the console, use the character - as the value of OUT; otherwise, OUT must be the name of a file where the output should be written.

Additional notes and considerations

Beware that some file systems have limitations on the number of subdirectories that can be created, which directly impacts how many record subdirectories can be created by this program. eprints2bags attempts to guess the type of file system where the output is being written and warn the user if the number of records exceeds known maximums (e.g., 31,998 subdirectories for the ext2 and ext3 file systems in Linux), but its internal table does not include all possible file systems and it may not be able to warn users in all cases. If you encounter file system limitations on the number of subdirectories that can be created, a simple solution is to manually create an intermediate level of subdirectories under the destination given to -o, then run eprints2bags multiple times, each time indicating a different subrange of records to the -i option and a different subdirectory to -o, such that the number of records written to each destination is below the file system's limit on total number of directories.

For maximum performance, the debug logging code that implements option -@ can be skipped completely at run-time by running Python with optimization turn on. One way to do this is to run eprints2bags using an invocation such as the following:

python -O -m eprints2bags ...other arguments...

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

★ Do you like it?

If you like this software, don't forget to give this repo a star on GitHub to show your support!

♬ Contributing — info for developers

We would be happy to receive your help and participation with enhancing eprints2bags! Please visit the guidelines for contributing for some tips on getting started.

❡ History

In 2018, Betsy Coles wrote a set of Perl scripts and described a workflow for bagging contents from Caltech's EPrints-based Caltech Collection of Open Digital Archives (CODA) server. The original code is still available in this repository in the historical subdirectory. In late 2018, Mike Hucka sought to expand the functionality of the original tools and generalize them in anticipation of having to stop using DPN because on 2018-12-04, DPN announced they were shutting down. Thus was born Eprints2bags.

☺︎ Acknowledgments

The vector artwork of a bag used as a logo for this repository was created by StoneHub from the Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license.

We thank the following people for suggestions and ideas that led to improvements in eprints2bags: Robert Doiel, Tom Morrell, Tommy Keswick.

eprints2bags makes use of numerous open-source packages, without which it would have been effectively impossible to develop eprints2bags with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:

  • bagit – Python library for working with BagIt style packages
  • colorama – makes ANSI escape character sequences work under MS Windows terminals
  • dateparser – parse dates in almost any string format
  • humanize – helps write large numbers in a more human-readable form
  • ipdb – the IPython debugger
  • keyring – access the system keyring service from Python
  • lxml – an XML parsing library for Python
  • plac – a command line argument parser
  • psutil – process and system utilities
  • requests – an HTTP library for Python
  • setuptools – library for setup.py
  • termcolor – ANSI color formatting for output in terminal
  • twine – Twine is a utility for publishing Python packages on PyPI
  • urllib3 – HTTP client library for Python
  • validators – data validation package for Python

☮︎ Copyright and license

Copyright (C) 2019–2023, Caltech. This software is freely distributed under a BSD/MIT type license. Please see the LICENSE file for more information.