scbirlab/nf-ggi is a Nextflow pipeline to screen gene-gene interactions from a specified organism (or set of organisms).
Table of contents
- Processing steps
- Requirements
- Quick start
- Inputs
- Outputs
- Credit
- Issues, problems, suggestions
- Further help
- Download Rhea DB in preparation for searching.
For each organism ID provided:
- Download its STRING database and tidy data.
- Download FASTA sequences of its proteins from UniProt
- Where possible, reference proteomes are used.
- Find reactions in Rhea DB and connect products with reactants between enzymes in the proteome.
For each FASTA sequence in each organism:
- Generate a multiple sequence alignment with
hhblits
.
Then within each organism:
- Generate all unique pairs of proteins.
Then for each protein pair:
- Calculate the co-evolutionary signal with DCA.
- Predict the interface contact map with
yunta rf2t
(RosettaFold-2track). - Predict the protein-protein complex structure map with
yunta af2
(AlphaFold2).
To generate multiple-sequence alignments (MSAs) for co-evolutionary analysis, hhblits
databases of pre-clustered sequences is required. Unfortunately, these are extremely large, so cannot be downlaoded as part of the pipeline. You should download the UniClust and BFD databases, then set the --uniclust
and --bfd
parameters of the pipeline (see below).
If you're at the Crick, these databases already reside on NEMO.
You need to have Nextflow and conda
installed on your system.
If you're at the Crick or your shared cluster has it already installed, try:
module load Nextflow
Otherwise, if it's your first time using Nextflow on your system, you can install it using conda
:
conda install -c bioconda nextflow
You may need to set the NXF_HOME
environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflow
To make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profile
Make a sample sheet (see below) and, optionally, a nextflow.config
file in the directory where you want the pipeline to run. Then run Nextflow.
nextflow run scbirlab/nf-ont-call-variants
Each time you run the pipeline after the first time, Nextflow will use a locally-cached version which will not be automatically updated. If you want to ensure that you're using the very latest version of the pipeline, use the -latest
flag.
nextflow run scbirlab/nf-ont-call-variants -latest
If you want to run a particular tagged version of the pipeline, such as v0.0.1
, you can do so using
nextflow run scbirlab/nf-ont-call-variants -r v0.0.2
For help, use nextflow run scbirlab/nf-ont-call-variants --help
.
The first time you run the pipeline on your system, the software dependencies in environment.yml
will be installed. This may take several minutes.
The following parameters are required:
sample_sheet
: path to a CSV with information about the samples and FASTQ files to be processeduniclust
: Path tohhblits
UniClust database. This is very large, so you need to have it already downlaoded on your system.bfd
: Path tohhblits
BFD database. This is very large, so you need to have it already downlaoded on your system.
The following parameters have default values which can be overridden if necessary.
rhea_url = "https://ftp.expasy.org/databases/rhea"
: URL to download Rhea reaction databaseoutputs = "outputs"
: Output folderbatch_size = 100
: How many protein-protein interactions to group into one job at a time.test = false
: Whether to run in test mode. If so, only 3 proteins per organism will be analyzed.non_self = false
: Whether to run in non-self mode. This is where a whole proteome is run against a single bait protein (rather than all pairwise from the proteome).
The parameters can be provided either in the nextflow.config
file or on the nextflow run
command.
Here is an example of the nextflow.config
file:
params {
sample_sheet = "/path/to/sample-sheet.csv"
}
Alternatively, you can provide the parameters on the command line:
nextflow run scbirlab/nf-ggi --sample_sheet /path/to/sample-sheet.csv
The sample sheet is a CSV file indicating which organisms you want to analyze.
The file must have a header with the column names below, and one line per organism to be processed.
organism_id
: the NCBI Taxonomic ID for your organism. This can be found at NCBI Taxonomyproteome_name
: This can be anything, but should be an informative description
Here is an example of the sample sheet:
organism_id | proteome_name |
---|---|
243273 | "Mycoplasma genitalium" |
If running with --non-self
, to do a pulldown against a single bait protein, add another column with the bait UniProt ID.
organism_id | proteome_name | bait |
---|---|---|
243273 | "Mycoplasma genitalium" | P47259 |
Outputs are saved in the same directory as sample_sheet
. They are organised under three directories:
coexpression
: STRING co-expression valuesmetabolites
: Reconstructed metabolic networkmsa
: All MSA filesppi
: All protein-protein interaction datasequences
: Each organism's proteome sequence
The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:
- Cong et al., Protein interaction networks revealed by proteome coevolution. Science, 2019
- Humpreys et al., Computed structures of core eukaryotic protein complexes. Science, 2021
- Humpreys et al., Protein interactions in human pathogens revealed through deep learning. Nature Microbiology, 2024
scbirlab/nf-ggi
applies these algorithms in a Nextflow pipeline to allow easy scaling. It also reconstructs metabolic networks, and pulls known interactions from the STRING database.
Add to the issue tracker.
Here are the pages of the software and databases used by this pipeline.
Databases:
- STRING for co-expression
- Rhea for enzyme reactions
- UniProt for protein sequences
- NCBI Genbank for taxonomy
Software: