Introduction

The genedise project aims at finding druggable genes for a specific disease based on previously essayed targets. Whether these targets were successful or not is not the primary concern - the fact that there was enough evidence to try them is enough for us. In this way, we aim at mimicking the time-consuming task of proposing new reasonable targets.

The suggestion of new disease genes uses data from OpenTargets as seed gene lists and the STRING protein-protein interaction network to infer new genes.

The project is almost entirely coded using R. Some Matlab code has been necessary to include state of the art approaches.

Structure

The files and directories of this project are proceded by a number that indicates the chronological order of their execution. Scripts are stored in Rmd files. Their outputs are saved in folders sharing their prefix. The most relevant prefixes are:

2X_: analysis on the STRING network
4X_: analysis on the OmniPath network
6X_: plots and models combining both networks (depends on the execution of the 2X an 4X scripts)

Reproducibility

Metadata files

The output of sessionInfo() is always stored in the directory 00_metadata to keep track of the package versions.

Configuration files

There are configuration files, such as 03_config.R, that contain a comprehensive amount of parameters, paths and file names. Generally, these parameters are sourced instead of being hardcoded in the scripts.

Package management

The project has package version control through packrat to ease portability between machines.

External files

Almost all the files in the project are included in the git repository at the moment. Exceptions:

STRING database files
Network kernel(s)

The route of these files (Sergi's machines) can be found in the config files.

Other

There are several set.seed calls throughout the code. Intermediate results are saved when the space required is not prohibitive.

Workflow

Data preprocessing

Check OpenTargets data sanity
Choose network: compromise between coverage and size
Compute and store graph kernel on chosen network
Save cleaned data, mapped to the network of choice

Topology analysis

Characterisation of disease genes in terms of network properties
Within-disease study
Between-disease study

Performance

Load configuration files
Load dataset
Load network data
Build CV folds
Define functions for prediction
Define performance metrics
For each disease,input_type,fold
1. Define train and validation
2. Predict for every method using train
3. Compute performance metrics
4. Write to disk
Plot metrics
Build statistical models for comparing methods

System requirements

Hardware

The runs have been executed on the following hardware from the UPC:

eko:
- 12 threads (Intel(R) Xeon(R) CPU E7310@1.60GHz)
- 32GB RAM
sun:
- 32 threads (Intel(R) Xeon(R) CPU E5-2450@2.10GHz)
- 32GB RAM

Code profiling

Running the script is barely possible with 16GB of RAM. We recommend using 32GB to avoid spikes with swapping.

For reference, executing all the diseases under a single repeated CV scheme (25 repetitions, 3 folds per repetition) on eko takes one week. Likewise, sun is twice as fast. The code is a mixture between serial and parallel executions because not all the methods run in parallel.

On the other hand, the computationally intensive code was run on a torque-based cluster, but the parallel R package -part of the R base- was unable to clean up the child processes. This led to memory exhaustion and proved to be infeasible. Alternatives to tackle this while keeping reproducibility might be added in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
00_metadata		00_metadata
00_rawdata		00_rawdata
01_data		01_data
02_performance		02_performance
03_data		03_data
03_performance		03_performance
04_topology		04_topology
05_mashup		05_mashup
10_data		10_data
11_topology		11_topology
12_performance		12_performance
13_complexes		13_complexes
20_data		20_data
21_topology		21_topology
22_performance		22_performance
23_boxplots		23_boxplots
23_contrasts		23_contrasts
23_models		23_models
40_data		40_data
42_performance		42_performance
43_boxplots		43_boxplots
43_contrasts		43_contrasts
43_models		43_models
45_mashup		45_mashup
63_boxplots		63_boxplots
63_models		63_models
packrat		packrat
.Renviron		.Renviron
.Rprofile		.Rprofile
.gitignore		.gitignore
00_packrat_table.R		00_packrat_table.R
01_preprocessing.Rmd		01_preprocessing.Rmd
02_diffusion_scores.Rmd		02_diffusion_scores.Rmd
03_config.R		03_config.R
03_multiple_disease.Rmd		03_multiple_disease.Rmd
03_preprocessing.Rmd		03_preprocessing.Rmd
04_positives_analysis.Rmd		04_positives_analysis.Rmd
05_mashup.m		05_mashup.m
05_mashup_features.Rmd		05_mashup_features.Rmd
10_preprocessing.Rmd		10_preprocessing.Rmd
11_positives_analysis.Rmd		11_positives_analysis.Rmd
11_upgma.R		11_upgma.R
12_multiple_disease.Rmd		12_multiple_disease.Rmd
13_complexes.Rmd		13_complexes.Rmd
13_pilot_cv_schemes.Rmd		13_pilot_cv_schemes.Rmd
20_config.R		20_config.R
20_preprocessing.Rmd		20_preprocessing.Rmd
21_positives_analysis.Rmd		21_positives_analysis.Rmd
22_performance.Rmd		22_performance.Rmd
23_models.Rmd		23_models.Rmd
40_config.R		40_config.R
40_preprocessing.Rmd		40_preprocessing.Rmd
42_performance.Rmd		42_performance.Rmd
43_models.Rmd		43_models.Rmd
45_mashup.m		45_mashup.m
60_abbreviations.R		60_abbreviations.R
60_config.R		60_config.R
60_palette25.txt		60_palette25.txt
63_models.Rmd		63_models.Rmd
LICENSE		LICENSE
README.md		README.md
config.R		config.R
genease.Rproj		genease.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Structure

Reproducibility

Metadata files

Configuration files

Package management

External files

Other

Workflow

Data preprocessing

Topology analysis

Performance

System requirements

Hardware

Code profiling

About

Releases

Packages

Contributors 3

Languages

License

b2slab/genedise

Folders and files

Latest commit

History

Repository files navigation

Introduction

Structure

Reproducibility

Metadata files

Configuration files

Package management

External files

Other

Workflow

Data preprocessing

Topology analysis

Performance

System requirements

Hardware

Code profiling

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages