This repository contains the code that is necessary to run all the steps of our pipeline, along with base meta-data files with URLs to meta-data and annotations.
We provide two ways to run our pipeline:
- Makefile/Docker Interface: A Makefile is provided that defines all pipeline steps and their dependencies. Under the hood, we leverage a docker container to execute commands.
- Python CLI: This requires the installation of the provided package on the local machine. Also, commands must be executed manually. This is more involved but allows for more flexibility.
We give here some indications on how the project is organized starting from the root-level directory:
assets
: Contains base meta-data and annotation file for the CMMD dataset. These are necessary in the early stage of our pipeline to download raw-data and assign semantic labels to images.breastclf
: Root directory of our Python project. Contains routines to build dataset and produce our ML models and results from scratch.report
: Contains source files to generate our report, including images and vector graphics.best-results
: Raw predictions obtained with our best model, meta-data with train/val/test splits, and aggregate performance metrics.
We make available a Docker image on DockerHub, along with
a Makefile
to facilitate reproducibility and
abstract-away the granular pipeline steps described in the next section.
Prior to running targets, you may customize the following environment variables.
BREASTCLF_RUN_DIR
: Path where checkpoints and results are stored.BREASTCLF_DATA_DIR
: Path where raw-data, preprocessed images, and meta-data are stored.BREASTCLF_USE_CUDA
: Flag to use nvidia GPU acceleration (recommended). You must first install the Nvidia Container Toolkit to use this feature.
Read the makefile target definitions with:
make help
Run full pipeline to produce our best predictor:
make best-model
Run ful pipeline to generate all tested models:
make all
- Install Poetry in user-space.
- Move to project’s root and install project with:
poetry install
This will look for an existing and activated virtual environment, and create one in current directory if it does not exist.
We make available a serie of CLI endpoints to execute the different phases of our pipeline. The root endpoint is documented at:
python breastclf/main.py --help
We provide a base meta-data file that contains references to the original CMMD dataset at
assets/meta.csv
.
Remove rows from this file prior to running the next commands to discard
patients.
First, we merge our base meta-data file with annotations with:
python breastclf/main.py cmmd merge-meta-and-annotations assets/meta.csv assets/annotations.csv <data-dir>/meta-annotated.csv
The raw DICOM files are publicly available on cancerimagingarchive.net
.
We provide a routine to download these. We recommend setting n-threads
to >16 to accelerate download time.
python breastclf/main.py cmmd fetch-raw-data -w <n-threads> <data-dir>/meta-annotated.csv <data-dir>/dicom
Next, we parse each dicom file to obtain some relevant meta-data:
python breastclf/main.py cmmd build-per-image-meta <data-dir>/meta-annotated.csv <data-dir>/meta-images.csv
We convert each image to 8-bit in PNG format:
python breastclf/main.py cmmd dicom-to-png <data-dir>/meta-images.csv <data-dir>/dicom <data-dir>/png
Last, we construct training, validation, and testing splits with:
python breastclf/main.py ml split <data-dir>/meta-images.csv <data-dir>/meta-images-split.csv 0.2 0.2
All our models can be trained using commands of the following form:
python breastclf/main.py ml train --cuda --fusion <fusion-mode> --lfabnorm <lfa> --lftype <lft> <data-dir>/meta-images-split.csv <data-dir>/png <run-dir> <experiment-name>
Where:
<fusion-mode>
sets the fusion strategy.<lfabnorm>
is the loss factor applied to the multi-label abnormality classification objective.<lftype>
is the loss factor applied to the tumor type classification objective.<run-dir>
is the root path where checkpoints, logs, and validation data will be stored.<experiment-name>
sets the name of the directory created in<run-dir>
.