Skip to content

lejeunel/breast-cancer-cmmd

Repository files navigation

Breast Cancer Screening from Mammograms

Introduction

This repository contains the code that is necessary to run all the steps of our pipeline, along with base meta-data files with URLs to meta-data and annotations.

We provide two ways to run our pipeline:

  1. Makefile/Docker Interface: A Makefile is provided that defines all pipeline steps and their dependencies. Under the hood, we leverage a docker container to execute commands.
  2. Python CLI: This requires the installation of the provided package on the local machine. Also, commands must be executed manually. This is more involved but allows for more flexibility.

Project’s Structure

We give here some indications on how the project is organized starting from the root-level directory:

  • assets: Contains base meta-data and annotation file for the CMMD dataset. These are necessary in the early stage of our pipeline to download raw-data and assign semantic labels to images.
  • breastclf: Root directory of our Python project. Contains routines to build dataset and produce our ML models and results from scratch.
  • report: Contains source files to generate our report, including images and vector graphics.
  • best-results: Raw predictions obtained with our best model, meta-data with train/val/test splits, and aggregate performance metrics.

Usage

Makefile/Docker Interface

We make available a Docker image on DockerHub, along with a Makefile to facilitate reproducibility and abstract-away the granular pipeline steps described in the next section.

Prior to running targets, you may customize the following environment variables.

  • BREASTCLF_RUN_DIR: Path where checkpoints and results are stored.
  • BREASTCLF_DATA_DIR: Path where raw-data, preprocessed images, and meta-data are stored.
  • BREASTCLF_USE_CUDA: Flag to use nvidia GPU acceleration (recommended). You must first install the Nvidia Container Toolkit to use this feature.

Read the makefile target definitions with:

make help

Run full pipeline to produce our best predictor:

make best-model

Run ful pipeline to generate all tested models:

make all

Python CLI

  • Install Poetry in user-space.
  • Move to project’s root and install project with:
poetry install

This will look for an existing and activated virtual environment, and create one in current directory if it does not exist.

We make available a serie of CLI endpoints to execute the different phases of our pipeline. The root endpoint is documented at:

python breastclf/main.py --help

Pipeline Steps

Selecting Series and Merging Annotations

We provide a base meta-data file that contains references to the original CMMD dataset at assets/meta.csv. Remove rows from this file prior to running the next commands to discard patients.

First, we merge our base meta-data file with annotations with:

python breastclf/main.py cmmd merge-meta-and-annotations assets/meta.csv assets/annotations.csv <data-dir>/meta-annotated.csv

Downloading Raw-data

The raw DICOM files are publicly available on cancerimagingarchive.net. We provide a routine to download these. We recommend setting n-threads to >16 to accelerate download time.

python breastclf/main.py cmmd fetch-raw-data -w <n-threads> <data-dir>/meta-annotated.csv <data-dir>/dicom

Parse DICOM series

Next, we parse each dicom file to obtain some relevant meta-data:

python breastclf/main.py cmmd build-per-image-meta <data-dir>/meta-annotated.csv <data-dir>/meta-images.csv

Convert images

We convert each image to 8-bit in PNG format:

python breastclf/main.py cmmd dicom-to-png <data-dir>/meta-images.csv <data-dir>/dicom <data-dir>/png

Making train/val/test Splits

Last, we construct training, validation, and testing splits with:

python breastclf/main.py ml split <data-dir>/meta-images.csv <data-dir>/meta-images-split.csv 0.2 0.2

Training

All our models can be trained using commands of the following form:

python breastclf/main.py ml train --cuda --fusion <fusion-mode> --lfabnorm <lfa> --lftype <lft> <data-dir>/meta-images-split.csv <data-dir>/png <run-dir> <experiment-name>

Where:

  • <fusion-mode> sets the fusion strategy.
  • <lfabnorm> is the loss factor applied to the multi-label abnormality classification objective.
  • <lftype> is the loss factor applied to the tumor type classification objective.
  • <run-dir> is the root path where checkpoints, logs, and validation data will be stored.
  • <experiment-name> sets the name of the directory created in <run-dir>.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published