This code allows for:
- generating artificial datasets for reference matching
- applying reference matching approaches
- evaluating reference matching approaches
Before running the scripts, add the code directory path to the environment variable PYTHONPATH.
Most scripts call Crossref REST API
under the hood. To use the polite pool
or Plus Service
when calling the API, simply copy the file .crapi_key_example
to ~/.crapi_key
,
and fill in the settings in the file.
A dataset is generated in several steps:
Drawing the sample of documents from the system.
dataset/draw_sample.py [-h] [-v] -s SIZE [-f FILTER] [-q QUERY] -o OUTPUT
Arguments:
- -h print help
- -v verbose output
- -s sample size
- -f filter to apply before drawing the sample
- -q query to apply before drawing the sample
- -o output file path
Extending the sample by adding similar documents.
dataset/extend_sample.py [-h] [-v] -s SAMPLE -e EXTEND -o OUTPUT
Arguments:
- -h print help
- -v verbose output
- -s sample file path
- -e number of similar documets to add for each sample document
- -o output file path
Generating reference strings and the final dataset.
dataset/generate_dataset.py [-h] [-v] [-l STYLES] [-a ATTRIBUTES] [-n] -s SAMPLE -o OUTPUT
Arguments:
- -h print help
- -v verbose output
- -l comma-separated list of citation styles
- -a comma-separated list of document attributes to keep (useful if we want to calculate the results along the values of those attributes)
- -n set the target DOIs to null
- -s sample file path
- -o output file path
Match the references from the dataset to target DOIs.
matching/match_references.py [-h] [-v] -d DATASET -o OUTPUT
Arguments:
- -h print help
- -v verbose output
- -d dataset file path
- -o output file path
The matching algorithm is set by specifying the matcher object in matching/match_config.py
. The matcher object should implement two methods:
- description(): returns a string
- match(reference_string): returns a tuple (matched DOI, matching score)
Evaluate the reference matching. The results are printed to stdout.
evaluation/evaluate.py [-h] [-v] -d DATASET [-s SPLITATTR] [-o OUTPUTDOC] [-p OUTPUTSPLIT]
Arguments:
- -h print help
- -v verbose output
- -d dataset file path (matched)
- -s attribute along which the results should be split
- -o output file path for document-level results
- -p output file path for split attribute results
Merge two or more samples or references datasets into a single file.
dataset/merge_datasets.py [-h] [-v] -d DATASETS -o OUTPUT
Arguments:
- -h print help
- -v verbose output
- -d comma-separated list of dataset file paths
- -o output file path
Export the information about a sample of existing and missing links in the current system.
dataset/export_existing_and_new_links.py [-h] [-v] -s SAMPLE [-n REFERENCES] [-p] -o OUTPUT
Arguments:
- -h print help
- -v verbose output
- -s sample file path
- -n max number of output references
- -p export parsed references instead of strings
- -o output file path