Juan A. Rodríguez, David Vázquez, Issam Laradji, Marco Pedersoli, Pau Rodríguez
Computer Vision Center, Autonomous University of Barcelona
ServiceNow Research, Montréal, Canada
ÉTS Montreal, University of Québec
This project is devoted to construct a dataset of pairs of Figures and Captions extracted from research papers, which we call Paper2Fig. The current version of the dataset is Paper2Fig100k, presented in this paper. The following are some samples from the dataset.
Here we present the pipeline designed to construct Paper2Fig dataset using public research papers generally from Computer Science fields like Machine Learning or Computer Vision. The papers are downloaded from arXiv.org thanks to arxiv kaggle dataset and scypdf repository, a pdf-parser based on GROBID.
Check out our work OCR-VQGAN, which presents an image encoder especialized in the image domains of figures and diagrams using Paper2Fig100k dataset.
- Filter papers from arxiv-metadata.json
That is, downloading the latest version of the file available at https://www.kaggle.com/datasets/Cornell-University/arxiv.
Then run the script dataset_pipeline/filter_papers.py
using the following command:
python dataset_pipeline/filter_papers.py -s <List of arxiv subjects> -y <year> -m <month> -p <path to arxiv metadata>
The available subjects can be found in https://arxiv.org/category_taxonomy. For instance, we can filter papers corresponding to the subject of Computer Vision (cs.CV)
and Machine Learning (cs.LG)
, and papers after 01/2010
:
python dataset_pipeline/filter_papers -s cs.CV cs.LG -y 10 -m 01 -p path/to/arxiv-metadata-oai-snapshot.json
This process will generate a .txt file with the list of extracted papers (arXiv ids). Optionlally, you can generate a xlsx file passing --export_xlsx
to explore the paper's metadata.
- Download the filtered papers
Files can be downloaded using gsutil, as explained in arXiv Dataset - Bulk access. We use Google Cloud Storage API to programmatically download the filtered papers. Follow the steps in the link to install google-cloud-storage
, setting up API authentication and creating a new project.
We can use the dataset_pipeline/download_papers.py
script as follows:
python dataset_pipeline/download_papers.py --paper_ids <path to txt file> --project <Google Cloud project> --out_path <path store pdfs>
where we pass the generated .txt file with paper ids using the parameter --paper_ids
, the project name with --project
, and the output path with --out_path
, for pdf storage.
In this process, we need to parse and organize the texts and images that are contained in each pdf. We make use of GROBID and SciPdf to parse images and texts.
- Install and run GROBID service via Docker
GROBID is an open-project for parsing pdf files, that is based in CRF and Deep Learning. The easiest way to use GROBID is via Docker, using the latest version available in Docker Hub.
You can run the GROBID service locally using Docker desktop and:
docker pull lfoppiano/grobid:0.7.1
docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.1
See the official GROBID documentation to explore the different installation alternatives.
- Parse pdfs
Once the GROBID service is running, it can be used through its API using scipdf library, by installing int in your env.
Then, use the dataset_pipeline/parse_pdf.py
script as follows:
python dataset_pipeline/parse_pdf.py -p [dataset dir]
where -p
defines the directory where your dataset is located. Note that the directory that you specify should have a directory named pdf
, where the downloaded pdfs are located.
At the end, you should have the following structure:
├── dataset
│ ├── pdf
│ └── parsed
└── ...
In this step we apply heuristic rules to filter images (avoid results figures like plots, charts, etc.) and obtain text captions of figures. This is done using multiprocessing, and can be executed with:
python dataset_pipeline/apply_heuristics.py -p <dataset dir>
In this step we process the final set of figures with EasyOCR text recognizer.
pip install easyocr
python dataset_pipeline/apply_ocr.py -p <dataset dir>
Here we
python dataset_pipeline/assign_class_tags.py -p <dataset dir>
Now it's time to put it all together, generating the final dataset using a JSON structure. Also, the split of the dataset in train and test is performed with:
python dataset_pipeline/construct_dataset.py -p <dataset dir>
In Paper2Fig, each figure has a json object associated that contains the following information:
{
"figure_id": "...",
"captions": ["...", "..."],
"captions_norm": ["...", "..."],
"ocr_result": [{
"text": "...",
"bbox": "[[71, 18], [134, 18], [134, 44], [71, 44]]",
"confidence": 0.99
}],
"aspect": 4.7962466487935655
}