This repository includes code and a pre-trained model of scHiGex for single-cell gene expression prediction.
The code was tested on Python 3.10.4. The conda environment is shared via env/environment.yml
, and for dnabert2, the environment is shared via env/environment_dnabert2.yml
.
The dataset used for training is from the HiRES experiment. The dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223917.
Files to be placed in the assets
directory are as follows:
- assets/
- gencode.vM23.annotation.gtf
- mm10.fa
- mm10_100_segments_browser.bed
- rna_umicount.tsv - embryo
- metadata.xlsx - embryo
- pairs/ (Only for training)
To train the scHiGex model from scratch for mm10,
- Download and place the required files in the
assets
directory. - Run the python scripts inside the
scripts
directory in the order of the numbers prefixed to the file names. These scripts will generate the required data files for training the model. - Run
./train.sh
to train the model.
To predict gene expression levels using the trained model for mm10,
-
Download and place the required files in the
assets
directory (aparts from pairs files since there is no training involved). -
Run the following python scripts inside the
scripts
directory (Goal is to create chromosome definitions insidescripts
directory):1.1_run_gtfparse.py
1.2_generate_metadata.py
-
Place the .pairs files in the
predict
directory:- Group of Hi-C .pairs file that you want to predict gene expressions of inside the directory
predict/pairs/
. At least 20 pairs files for each cell types are required to create the meta-cell. - example:
predict/pairs/
cell_type_1/
cell_type_1_1.pairs
cell_type_1_2.pairs
- ...
cell_type_2/
cell_type_2_1.pairs
cell_type_2_2.pairs
- ...
- ...
- Group of Hi-C .pairs file that you want to predict gene expressions of inside the directory
-
Run
python 1.data_prep.py
to generate the required data files for prediction. -
Run
python 2.predict.py
to predict gene expression levels. -
The predicted gene expression levels will be saved in the
predict
directory under the file namepredictions.csv
If you want to use your own trained model using scHiGex architecture, you need to point to right model file and node_embeddings.
The scripts were desinged to be compatible with the HiRES data for the experiment. The code can be easily modified to work according to the user's purpose.
Please cite the following paper:
@article{scHiGex,
title={scHiGex: predicting single-cell gene expression based on single-cell Hi-C data},
author={Bishal Shrestha, Andrew Jordan Siciliano, Hao Zhu, Tong Liu, Zheng Wang},
journal={},
year={2024}
}