An easy-to-use random forest-based tool for risk gene, cell type and drug ranking for complex traits using GWAS-derived genetic evidence.

Warning! Drug ranking produced by the pipeline is for scientific purposes only and should not be used as a guidance for clinical practice.

Description

The pipeline is designed to use the data on known risk genes (which can be obtained from GWAS fine-mapping) and expression profiles characterizing cell types/tissues to construct a random forest classifier to distinguish risk genes.
A modification of feature importance analysis with SHAP values is used to rank the cell types/tissues based on their importance for risk class assignment.
The information on drug gene-targets can then be used to rank the drugs by maximal risk class assignment probability of any of their targets.
The scores for small molecule receptors from Cellinker database with the corresponding ligand information can be obtained in a separate file.

Pipeline performance checking

The pipleline was tested on IBD, schizophrenia, Alzheimer's disease.
For schizophrenia, it displays ROC-AUC above 0.9 on test gene set in a risk gene discrimination task. In the other case studies, it shows ROC-AUC above 0.8.
The risk genes identified with GCDPipe for schizophrenia link dopaminergic synapse, retrograde endocannabinoid signaling, nicotine addiction ad other dysregulations at molecular level; for Alzheimer’s disease it shows diuretics as a leading enriched drug target category, and for IBD it prioritizes TH1/17 and other CD4+ T cells.
In the case studies on IBD and schizophrenia, the obtained gene ranking is significantly enriched with drug targets for the corresponding diseases and drug ranking - in the corresponding disease drugs.

For further details on the pipeline, see the GCDPipe publication.

General pipeline scheme

Back To The Top

Demo

Back To The Top

Installation

# A virtual environment can be created and activated with:
python3 -m venv GCDpipe_env
source GCDpipe_env/bin/activate
# Downloading the code:
git clone https://github.com/ACDBio/GCDPipe.git
# It might be required to upgrade pip with: 
python -m pip install --upgrade pip
# Downloading the required packages: 
pip install -r ./GCDPipe/requirements.txt
# Launching the GCDpipe Dash App:
python ./GCDPipe/GCDPipe.py
# To use GCDPipe interface, open up the link depicted after the phrase 'Dash is running on' in the console.

Back To The Top

Input

For gene classification, only first two fields need to be filled. The files to the other fields are uploaded in cases when drug prioritization and its initial quality assessment are required.

Field 1: Gene Data (a data on risk and non-risk genes used for classifier training and testing)

Two types of .csv files can be uploaded in this field:

A file with gene identifiers in the first column and their attribution to risk (1) or non-risk (0) class.

pipe_genesymbol	is_True
ACP5	0
GRIN2A	1

A file, specifying locations of the significant loci in GRCh38 coordinates and the corresponding risk genes for each locus. In this case, all genes lying within 500 kbase window around the locus center (which is expected to be a genome-wide significant variant (a 'leading variant') from GWAS), except for the risk ones are considered as 'false', and a gene set for classifier training and testing is constructed in an automated manner. If not variant ID is unknown, an rsid field can be left blank.

pipe_genesymbol	rsid	location	chromosome
NOD2	rs2066844	50712015	16

Field 2: Feature data (expression profiles)

Here, a .csv file with expression profiles characterizing cell types/tissues of interest can be uploaded. The data can be obtained from a range of publicly available expression atlases and other sources (such as Allen Brain Mep, DropViz, DICE Immune Cell Atlas, GTEx etc.). These profiles are used as features to build a risk gene classifier. It requires pipe_genesymbol column with gene identifiers and other columns with custom names (for example, names of cell types).

pipe_genesymbol	Exc.L5.6.FEZF2.ANKRD20A1	Exc.L5.6.THEMIS.TMEM233	Inh.L1.LAMP5.NDNF
A2M	4.15	3.83	0

Field 3: Drug-target interaction data

For drug ranking, the pipeline needs a .csv file with drug identifiers and their gene targets. We assembled an example of such dataset from DGIdb and DrugCentral, which can be found here. The field with drug IDs is expected to be named 'DRUGBANK_ID' and the field with genes - 'pipe_genesymbol'.

DRUGBANK_ID	pipe_genesymbol
DB05969	CDK7
DB01054	ADORA3
DB01054	TTR

Field 4: Target drug category Drugbank IDs

In cases when there exists a list of the drugs with desired functions (drugs for treatment of a disease of interest), it can be uploaded here for initial quality assessment of gene and drug ranking. The .csv file is expected with one column - DRUGBANK_ID.

DRUGBANK_ID
DB00321
DB01238
DB14185

Checkbox option: 'Rank the Cellinker small molecule ligands based on the receptor gene score'

Mark this checkbox if a separate file with Cellinker small molecule ligand receptor scores and corresponsing ligand information is needed.

Back To The Top

Output

The pipeline gives the ROC curve, ROC-AUC and a range of classifier performance metrics for the obtained classifier (the metrics are calculated from testing on a test set of genes automatically generated within the pipeline).
In addition, it provides a range of output files:

Output file 1: Gene Classification Results

A .csv file can be downloaded with information on gene attribution to a risk class by the obtained classifier with the probability threshold corresponding to maximal difference between tpr and fpr on the ROC ('is_risk_class' field), probabilities of the genes to be assigned to this class ('score' field) and information on whether the gene was considered as the risk one in the original training-testing set ('is_input_risk_gene' field).

pipe_genesymbol	score	is_risk_class	is_input_risk_gene
PPP2R2B	0.939833072	TRUE	0
CALM2	0.939833072	TRUE	0
CACNA1C	0.936927967	TRUE	1

Output file 2: Expression profile (cell type/tissue) ranking

The pipeline returns ranking of features (expression profiles) used for classifier training (characterizing cell types/tissues) based on correlation between SHAP values for the risk class and values of these features (expression intensity in these profiles).

expression_profile	importance_based_score
Inh.L4.5.PVALB.TRIM67	0.971200485

Output file 3: Cellinker receptor gene scores with corresponding ligand information

Output file 4: Drug ranking

If drug ranking is performed, the pipeline gives the file, in which drugs are scored by maximal probability of any of their gene-target to be assigned to the risk class ('score' field). The drugs, which have at least one gene-target attributed to the risk class at default probability threshold (corresponding to maximal difference between tpr and fpr) are marked in the field 'has_risk_class_targets'.

DRUGBANK_ID	score	has_risk_class_targets
DB06288	0.939833072	TRUE

In addition, if a drug set of interest is provided, GCDPipe can run two Mann-Whitney tests: comparing the gene scores of all targets of the given drugs with those of other genes and comparing their drug scores with scores of all other drugs. Then it shows the corresponding p-values, U statistics and displays boxplots and density plots for the compared distributions.

Back To The Top

Settings

If the file for the 'Gene Data' field specifies the loci, click the checkbox 'Locus file is provided: generate the training-testing set from GWAS genetic fine-mapping results' to automaticlly select 'false' genes.
If gene nomenclature differs across files, unification to HUGO nomenclature can be performed. To use this option, click the checkbox 'Unify gene nomenculature with HUGO'.
To perform drug analysis, click the checkbox 'Drug-risk gene interaction assessment' - the required input field and other options will open up.

The pipeline interface also allows to define a range of settings for classifier generation:

Number of estimators: decision tree count in a random forest can be changed.
Testing gene set/training gene set ratio: a fraction of the genes from the training-testing set to be used for testing.
Max tree depth - a specific maximal depth can be set.
Number of samples per leaf: up to 10 different values can be set for hyperparameter search by sliding the points from the default value.
Min number of samples required to split an internal node: up to 10 different values can be set for hyperparameter search by sliding the points from the default value.
Min impurity decrease: up to 10 different values can be set for hyperparameter search by sliding the points from the default value.

Back To The Top

Citation

https://www.biorxiv.org/content/10.1101/2022.07.27.501775v1

Back To The Top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Warning! Drug ranking produced by the pipeline is for scientific purposes only and should not be used as a guidance for clinical practice.

Table of Contents

Description

Pipeline performance checking

General pipeline scheme

Demo

Installation

Input

Field 1: Gene Data (a data on risk and non-risk genes used for classifier training and testing)

Field 2: Feature data (expression profiles)

Field 3: Drug-target interaction data

Field 4: Target drug category Drugbank IDs

Checkbox option: 'Rank the Cellinker small molecule ligands based on the receptor gene score'

Output

Output file 1: Gene Classification Results

Output file 2: Expression profile (cell type/tissue) ranking

Output file 3: Cellinker receptor gene scores with corresponding ligand information

Output file 4: Drug ranking

Settings

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Warning! Drug ranking produced by the pipeline is for scientific purposes only and should not be used as a guidance for clinical practice.

Table of Contents

Description

Pipeline performance checking

General pipeline scheme

Demo

Installation

Input

Field 1: Gene Data (a data on risk and non-risk genes used for classifier training and testing)

Field 2: Feature data (expression profiles)

Field 3: Drug-target interaction data

Field 4: Target drug category Drugbank IDs

Checkbox option: 'Rank the Cellinker small molecule ligands based on the receptor gene score'

Output

Output file 1: Gene Classification Results

Output file 2: Expression profile (cell type/tissue) ranking

Output file 3: Cellinker receptor gene scores with corresponding ligand information

Output file 4: Drug ranking

Settings

Citation