VLSA: Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology
[Preprint] | [VLSA Walkthrough] | [Awesome Papers of Pathology VLMs] | [Zhihu (中文)] | [WSI Preprocessing] | [Acknowledgements] | [Citation]
Abstract: Histopathology Whole-Slide Images (WSIs) provide an important tool to assess cancer prognosis in computational pathology (CPATH). While existing survival analysis (SA) approaches have made exciting progress, they are generally limited to adopting highly-expressive architectures and only coarse-grained patient-level labels to learn prognostic visual representations from gigapixel WSIs. Such learning paradigm suffers from important performance bottlenecks, when facing present scarce training data and standard multi-instance learning (MIL) framework in CPATH. To overcome it, this paper, for the first time, proposes a new Vision-Language-based SA (VLSA) paradigm. Concretely, (1) VLSA is driven by pathology VL foundation models. It no longer relies on high-capability networks and shows the advantage of data efficiency. (2) In vision-end, VLSA encodes prognostic language prior and then employs it as auxiliary signals to guide the aggregating of prognostic visual features at instance level, thereby compensating for the weak supervision in MIL. Moreover, given the characteristics of SA, we propose i) ordinal survival prompt learning to transform continuous survival labels into textual prompts; and ii) ordinal incidence function as prediction target to make SA compatible with VL-based prediction. Notably, VLSA's predictions can be interpreted intuitively by our Shapley values-based method. The extensive experiments on five datasets confirm the effectiveness of our scheme. Our VLSA could pave a new way for SA in CPATH by offering weakly-supervised MIL an effective means to learn valuable prognostic clues from gigapixel WSIs.
📚 Recent updates:
- 24/10/07: add the Notebook - VLSA Walkthrough
- 24/09/24: codes & papers are live
- 24/09/10: release VLSA
On updating. Stay tuned.
Please refer to our Notebook - VLSA Walkthrough. It provides the detail of
- individual incidence function prediction in VLSA models;
- and prediction interpretation using our Shapley values-based method.
All experiments are run on a machine with
- two NVIDIA GeForce RTX 3090 GPUs
- python 3.8 and pytorch==1.11.0+cu113
Detailed package requirements:
- for
pip
orconda
users, full requirements are provided in requirements.txt. - for
Docker
users, you could use our base Docker image viadocker pull yuukilp/deepath:py38-torch1.11.0-cuda11.3-cudnn8-devel
and then install additional essential python packages (see requirements.txt) in the container.
Use the following command to load an experiment configuration and train the VLSA model (5-fold cross-validation):
python3 main.py --config config/IFMLE/tcga_blca/cfg_vlsa_conch.yaml --handler VLSA --multi_run
All important arguments are explained in config/IFMLE/tcga_blca/cfg_vlsa_conch.yaml
.
For the traditional SA models only using visual features, use this one:
python3 main.py --config config/IFMLE/tcga_blca/cfg_sa_base_conch.yaml --handler SA --multi_run
We advocate open-source research. Our full training logs for VLSA
models can be accessed at Google Drive.
Foundational VLMs for computational pathology:
Model | Architecture | Paper | Code | Data |
---|---|---|---|---|
PLIP (NatMed'23) | CLIP | A visual language foundation model for pathology image analysis using medical twitter | Github | 208,414 pathology images paired with natural language descriptions from twitter |
Quilt-Net (NeurIPS'23) | CLIP | Quilt-1M: One million image-text pairs for histopathology | Github | 802,148 image and text pairs from YouTube |
CONCH (NatMed'24) | CoCa | A Vision-Language Foundation Model for Computational Pathology | Github | over 1.17 million image-caption pairs |
CPLIP (CVPR'24) | CLIP | CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment | Github | Many-to-many VL alignment on ARCH dataset |
PathAlign (arXiv'24) | BLIP-2 | PathAlign: A vision-language model for whole slide images in histopathology | - | over 350,000 WSIs and diagnostic text pairs |
TITAN (arXiv'24) | CoCa | Multimodal Whole Slide Foundation Model for Pathology | Github | Slide-level vision-language alignment |
VLM-driven computational pathology tasks:
NOTE: please open a new PR if you want to add your work into this table.
Following CONCH, we first divide each WSI into patches of 448 * 448 pixels at 20x magnification. Then we adopt the image encoder of CONCH to extract patch features.
Our complete procedure in WSI preprocessing follows Pipeline-Processing-TCGA-Slides-for-MIL. You could move to it for a detailed tutorial.
Some parts of codes in this repo are adapted from the following amazing works. We thank the authors and developers for their selfless contributions.
- CONCH: our VLSA is driven by this great pathology VLM.
- OrdinalCLIP: adapted for survival prompt learning.
- SurvivalEVAL: used for performance evaluation (D-cal and MAE computation).
- Patch-GCN: we follow its all data splits in 5-fold cross-validation.
ⓒ UESTC. The models and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of the VLSA model and its derivatives is prohibited and requires prior approval. If you are a commercial entity, please contact the corresponding author.
If you find this work helps your research, please consider citing our paper:
@misc{liu2024interpretablevisionlanguagesurvivalanalysis,
title={Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology},
author={Pei Liu and Luping Ji and Jiaxiang Gou and Bo Fu and Mao Ye},
year={2024},
eprint={2409.09369},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2409.09369},
}