Skip to content

liugangcode/InfoAlign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Molecular Representation in a Cell

InfoAlign learns molecular representation from bottleneck information derived from molecular structures, cell morphology, and gene expressions. Readers are referred to the paper for more details: https://arxiv.org/abs/2406.12056v1.

InfoAlign

Requirements

This code was developed and tested with Python 3.11.7, PyTorch 2.1.0+cu118, and torch-geometric 2.5.2. All dependencies are specified in the requirements.txt file.

Usage

Fine-tuning

We provide a checkpoint which can be downloaded from Hugging Face. Please place the model weights (pretrain.pt) under the ckpt folder along with its configurations in the YAML file.

For fine-tuning and inference, use the following code:

python main.py --model-path ckpt/pretrain.pt --dataset finetune-chembl2k

python main.py --model-path ckpt/pretrain.pt --dataset finetune-broad6k

python main.py --model-path ckpt/pretrain.pt --dataset finetune-biogenadme

python main.py --model-path ckpt/pretrain.pt --dataset finetune-moltoxcast

Note: Please visit Hugging Face for the cell morphology and gene expression features in the ChEMBL2k and Broad6K datasets.

Pretraining

To pretrain the model from scratch, first download the pretraining dataset from Hugging Face. Place all pretrain data files under the raw_data/pretrain/raw folder. Then run the following code:

python main.py --model-path "ckpt/pretrain.pt" --lr 1e-4 --wdecay 1e-8 --batch-size 3072

The pretrained result will be saved in the ckpt folder with the name pretrain.pt.

Data source

For readers interested in data collection, here are the sources:

  1. Cell Morphology Data

    • JUMP dataset: The data are from "JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations" and can be downloaded here. The dataset includes chemical and genetic perturbations for cell morphology features.
    • Bray's dataset: "A dataset of images and morphological profiles of 30,000 small-molecule treatments using the Cell Painting assay". Download from GigaDB. Processed version available on Zenodo.
  2. Gene Expression Data

    • LINCS L1000 gene expression data from the paper "Drug-induced adverse events prediction with the LINCS L1000 data": Data.
  3. Relationships

    • Gene-gene, gene-compound relationships from Hetionet: Data.

Citation

If you find this repository useful, please cite our paper:

@article{liu2024learning,
  title={Learning Molecular Representation in a Cell},
  author={Liu, Gang and Seal, Srijit and Arevalo, John and Liang, Zhenwen and Carpenter, Anne E and Jiang, Meng and Singh, Shantanu},
  journal={arXiv preprint arXiv:2406.12056},
  year={2024}
}

About

The code for "Learning Molecular Representation in a Cell"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages