scHiGex: predicting single-cell gene expression based on single-cell Hi-C data

This repository includes code and a pre-trained model of scHiGex for single-cell gene expression prediction.

Instructions

Python Environment

The code was tested on Python 3.10.4. The conda environment is shared via env/environment.yml, and for dnabert2, the environment is shared via env/environment_dnabert2.yml.

Dataset

The dataset used for training is from the HiRES experiment. The dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223917.

Files to be placed in the assets directory are as follows:

assets/
- gencode.vM23.annotation.gtf
- mm10.fa
- mm10_100_segments_browser.bed
- rna_umicount.tsv - embryo
- metadata.xlsx - embryo
- pairs/ (Only for training)
  - Hi-C pairs files

Training

To train the scHiGex model from scratch for mm10,

Download and place the required files in the assets directory.
Run the python scripts inside the scripts directory in the order of the numbers prefixed to the file names. These scripts will generate the required data files for training the model.
Run ./train.sh to train the model.

Prediction

To predict gene expression levels using the trained model for mm10,

Download and place the required files in the assets directory (aparts from pairs files since there is no training involved).
Run the following python scripts inside the scripts directory (Goal is to create chromosome definitions inside scripts directory):
- 1.1_run_gtfparse.py
- 1.2_generate_metadata.py
Place the .pairs files in the predict directory:
- Group of Hi-C .pairs file that you want to predict gene expressions of inside the directory predict/pairs/. At least 20 pairs files for each cell types are required to create the meta-cell.
- example:
  - predict/pairs/
    - cell_type_1/
      - cell_type_1_1.pairs
      - cell_type_1_2.pairs
      - ...
    - cell_type_2/
      - cell_type_2_1.pairs
      - cell_type_2_2.pairs
      - ...
    - ...
Run python 1.data_prep.py to generate the required data files for prediction.
Run python 2.predict.py to predict gene expression levels.
The predicted gene expression levels will be saved in the predict directory under the file name predictions.csv

If you want to use your own trained model using scHiGex architecture, you need to point to right model file and node_embeddings.

The scripts were desinged to be compatible with the HiRES data for the experiment. The code can be easily modified to work according to the user's purpose.

Citation

Please cite the following paper:

@article{scHiGex,
  title={scHiGex: predicting single-cell gene expression based on single-cell Hi-C data},
  author={Bishal Shrestha, Andrew Jordan Siciliano, Hao Zhu, Tong Liu, Zheng Wang},
  journal={},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
env		env
predict		predict
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
architecture.jpg		architecture.jpg
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scHiGex: predicting single-cell gene expression based on single-cell Hi-C data

Instructions

Python Environment

Dataset

Training

Prediction

Citation

About

Contributors 2

Languages

zwang-bioinformatics/scHiGex

Folders and files

Latest commit

History

Repository files navigation

scHiGex: predicting single-cell gene expression based on single-cell Hi-C data

Instructions

Python Environment

Dataset

Training

Prediction

Citation

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages