Gene-gene interaction screening pipeline

scbirlab/nf-ggi is a Nextflow pipeline to screen gene-gene interactions from a specified organism (or set of organisms).

Table of contents

Processing steps
Requirements
Quick start
Inputs
Outputs
Credit
Issues, problems, suggestions
Further help

Processing steps

Download Rhea DB in preparation for searching.

For each organism ID provided:

Download its STRING database and tidy data.
Download FASTA sequences of its proteins from UniProt
- Where possible, reference proteomes are used.
Find reactions in Rhea DB and connect products with reactants between enzymes in the proteome.

For each FASTA sequence in each organism:

Generate a multiple sequence alignment with hhblits.

Then within each organism:

Generate all unique pairs of proteins.

Then for each protein pair:

Calculate the co-evolutionary signal with DCA.
Predict the interface contact map with yunta rf2t (RosettaFold-2track).
Predict the protein-protein complex structure map with yunta af2 (AlphaFold2).

Requirements

Databases

To generate multiple-sequence alignments (MSAs) for co-evolutionary analysis, hhblits databases of pre-clustered sequences is required. Unfortunately, these are extremely large, so cannot be downlaoded as part of the pipeline. You should download the UniClust and BFD databases, then set the --uniclust and --bfd parameters of the pipeline (see below).

If you're at the Crick, these databases already reside on NEMO.

Software

You need to have Nextflow and conda installed on your system.

First time using Nextflow?

If you're at the Crick or your shared cluster has it already installed, try:

module load Nextflow

Otherwise, if it's your first time using Nextflow on your system, you can install it using conda:

conda install -c bioconda nextflow

You may need to set the NXF_HOME environment variable. For example,

mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflow

To make this a permanent change, you can do something like the following:

mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profile

Quick start

Make a sample sheet (see below) and, optionally, a nextflow.config file in the directory where you want the pipeline to run. Then run Nextflow.

nextflow run scbirlab/nf-ont-call-variants

Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which will not be automatically updated. If you want to ensure that you're using the very latest version of the pipeline, use the -latest flag.

nextflow run scbirlab/nf-ont-call-variants -latest

If you want to run a particular tagged version of the pipeline, such as v0.0.1, you can do so using

nextflow run scbirlab/nf-ont-call-variants -r v0.0.2

For help, use nextflow run scbirlab/nf-ont-call-variants --help.

The first time you run the pipeline on your system, the software dependencies in environment.yml will be installed. This may take several minutes.

Inputs

The following parameters are required:

sample_sheet: path to a CSV with information about the samples and FASTQ files to be processed
uniclust: Path to hhblits UniClust database. This is very large, so you need to have it already downlaoded on your system.
bfd: Path to hhblits BFD database. This is very large, so you need to have it already downlaoded on your system.

The following parameters have default values which can be overridden if necessary.

rhea_url = "https://ftp.expasy.org/databases/rhea": URL to download Rhea reaction database
outputs = "outputs": Output folder
batch_size = 100: How many protein-protein interactions to group into one job at a time.
test = false: Whether to run in test mode. If so, only 3 proteins per organism will be analyzed.
non_self = false: Whether to run in non-self mode. This is where a whole proteome is run against a single bait protein (rather than all pairwise from the proteome).

The parameters can be provided either in the nextflow.config file or on the nextflow run command.

Here is an example of the nextflow.config file:

params {
    sample_sheet = "/path/to/sample-sheet.csv"
}

Alternatively, you can provide the parameters on the command line:

nextflow run scbirlab/nf-ggi --sample_sheet /path/to/sample-sheet.csv

Sample sheet

The sample sheet is a CSV file indicating which organisms you want to analyze.

The file must have a header with the column names below, and one line per organism to be processed.

organism_id: the NCBI Taxonomic ID for your organism. This can be found at NCBI Taxonomy
proteome_name: This can be anything, but should be an informative description

Here is an example of the sample sheet:

organism_id	proteome_name
243273	"Mycoplasma genitalium"

If running with --non-self, to do a pulldown against a single bait protein, add another column with the bait UniProt ID.

organism_id	proteome_name	bait
243273	"Mycoplasma genitalium"	P47259

Outputs

Outputs are saved in the same directory as sample_sheet. They are organised under three directories:

coexpression: STRING co-expression values
metabolites: Reconstructed metabolic network
msa: All MSA files
ppi: All protein-protein interaction data
sequences: Each organism's proteome sequence

Credit

The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:

scbirlab/nf-ggi applies these algorithms in a Nextflow pipeline to allow easy scaling. It also reconstructs metabolic networks, and pulls known interactions from the STRING database.

Issues, problems, suggestions

Add to the issue tracker.

Further help

Here are the pages of the software and databases used by this pipeline.

Databases:

STRING for co-expression
Rhea for enzyme reactions
UniProt for protein sequences
NCBI Genbank for taxonomy

Software:

hhblits for generating MSAs
rdkit for cheminformatics of enzyme reactants and products
yunta for running DCA, RosettaFold-2track, and AlphaFold2 on MSAs

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
bin/metabolism		bin/metabolism
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gene-gene interaction screening pipeline

Processing steps

Requirements

Databases

Software

First time using Nextflow?

Quick start

Inputs

Sample sheet

Outputs

Credit

Issues, problems, suggestions

Further help

About

Releases

Packages

Languages

License

scbirlab/nf-ggi

Folders and files

Latest commit

History

Repository files navigation

Gene-gene interaction screening pipeline

Processing steps

Requirements

Databases

Software

First time using Nextflow?

Quick start

Inputs

Sample sheet

Outputs

Credit

Issues, problems, suggestions

Further help

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages