From 88f7c57cffdae0aea54e3c1b21bb0b85f3ad9e15 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 15:04:15 +1200 Subject: [PATCH 01/12] Aesthetics ex10 --- docs/day3/ex10_viruses.md | 145 +++++++++++++++++++++----------------- 1 file changed, 81 insertions(+), 64 deletions(-) diff --git a/docs/day3/ex10_viruses.md b/docs/day3/ex10_viruses.md index eaa8de06..7793aba9 100644 --- a/docs/day3/ex10_viruses.md +++ b/docs/day3/ex10_viruses.md @@ -2,22 +2,22 @@ !!! info "Objectives" - * [Identifying viral contigs using *VirSorter2*](#identifying-viral-contigs-using-VirSorter2) - * [Check quality and estimate completeness of the viral contigs via *CheckV*](#check-quality-and-estimate-completeness-of-the-viral-contigs-via-checkv) - * [Introduction to *vContact2* for predicting taxonomy of viral contigs](#introduction-to-vcontact2-for-predicting-taxonomy-of-viral-contigs) - * [OPTIONAL: Visualise the vcontact2 gene-sharing network in *Cytoscape*](#optional-visualise-the-vcontact2-gene-sharing-network-in-cytoscape) + * [Identifying viral contigs using `VirSorter2`](#identifying-viral-contigs-using-VirSorter2) + * [Check quality and estimate completeness of the viral contigs via `CheckV`](#check-quality-and-estimate-completeness-of-the-viral-contigs-via-checkv) + * [Introduction to `vConTACT2` for predicting taxonomy of viral contigs](#introduction-to-vcontact2-for-predicting-taxonomy-of-viral-contigs) + * [OPTIONAL: Visualise the `vConTACT2` gene-sharing network in `Cytoscape`](#optional-visualise-the-vcontact2-gene-sharing-network-in-cytoscape) --- -### Identifying viral contigs +## Identifying viral contigs Viral metagenomics is a rapidly progressing field, and new software are constantly being developed and released each year that aim to better identify and characterise viral genomic sequences from assembled metagenomic sequence reads. -Currently, the most commonly used methods are [VirSorter2](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00990-y), [VIBRANT](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00867-0), and [VirFinder](https://link.springer.com/epdf/10.1186/s40168-017-0283-5?author_access_token=YQgkTWibFIFPtRICkTjZF2_BpE1tBhCbnbw3BuzI2RMCpVMGldKV8DA9scozc7Z-db3ufPFz9-pswHsYVHyEsCrziBuECllLPOgZ6ANHsMeKF5KejrdDKdeASyDkxB5wfFDq523QSd01cnqxCLqCiQ%3D%3D) (or the machine learning implementation of this, [DeepVirFinder](https://github.com/jessieren/DeepVirFinder)). A number of recent studies use one of these tools or a combination of several at once. +Currently, the most commonly used methods are [`VirSorter2`](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00990-y), [`VIBRANT`](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00867-0), and [`VirFinder`](https://link.springer.com/epdf/10.1186/s40168-017-0283-5?author_access_token=YQgkTWibFIFPtRICkTjZF2_BpE1tBhCbnbw3BuzI2RMCpVMGldKV8DA9scozc7Z-db3ufPFz9-pswHsYVHyEsCrziBuECllLPOgZ6ANHsMeKF5KejrdDKdeASyDkxB5wfFDq523QSd01cnqxCLqCiQ%3D%3D) (or the machine learning implementation of this, [`DeepVirFinder`](https://github.com/jessieren/DeepVirFinder)). A number of recent studies use one of these tools or a combination of several at once. !!! info "" - === "VirSorter2" + === "`VirSorter2`" Uses a predicted protein homology reference database-based approach, together with searching for a number of pre-defined metrics based on known viral genomic features. `VirSorter2` includes dsDNAphage, ssDNA, and RNA viruses, and the viral groups Nucleocytoviricota and lavidaviridae.* @@ -26,7 +26,7 @@ Currently, the most commonly used methods are [VirSorter2](https://microbiomejou - [VirSorter2 GitHub](https://github.com/jiarong/VirSorter2) - [Guo *et al.* (2021) VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00990-y) - === "VIBRANT" + === "`VIBRANT`" Uses a machine learning approach based on protein similarity (non-reference-based similarity searches with multiple HMM sets), and is in principle applicable to bacterial and archaeal DNA and RNA viruses, integrated proviruses (which are excised from contigs by `VIBRANT`), and eukaryotic viruses. @@ -35,7 +35,7 @@ Currently, the most commonly used methods are [VirSorter2](https://microbiomejou - [VIBRANT GitHub](https://github.com/AnantharamanLab/VIBRANT) - [Kieft, Zhou, and Anantharaman (2020) VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00867-0) - === "DeepVirFinder" + === "`DeepVirFinder`" Uses a machine learning based approach based on *k*-mer frequencies. Having developed a database of the differences in *k*-mer frequencies between prokaryote and viral genomes, *VirFinder* examines assembled contigs and identifies whether their *k*-mer frequencies are comparable to known viruses in the database, using this to predict viral genomic sequence. This method has some limitations based on the viruses that were included when building the database (bacterial DNA viruses, but very few archaeal viruses, and, at least in some versions of the software, no eukaryotic viruses). However, tools are also provided to build your own database should you wish to develop an expanded one. Due to its distinctive *k*-mer frequency-based approach, *VirFinder* may also have the capability of identifying some novel viruses overlooked by tools such as *VIBRANT* or *VirSorter*. @@ -46,41 +46,36 @@ Currently, the most commonly used methods are [VirSorter2](https://microbiomejou --- -### Identifying viral contigs using *VirSorter2* +## Identifying viral contigs using `VirSorter2` -For this exercise, we will use *VirSorter2* to identify viral contigs from our assembled contigs. We can also use *VirSorter2* to prepare files for later use with the gene annotation tool *DRAM-v*, which we'll run later in the day. +For this exercise, we will use `VirSorter2` to identify viral contigs from our assembled contigs. We can also use `VirSorter2` to prepare files for later use with the gene annotation tool `DRAM-v`, which we'll run later in the day. -### Check quality and estimate completeness of the viral contigs via *CheckV* +## Check quality and estimate completeness of the viral contigs via `CheckV` -[*CheckV*](https://www.biorxiv.org/content/10.1101/2020.05.06.081778v1.abstract) was developed as an analogue to *CheckM*. *CheckV* first performs a 'contaminating sequence' trim, removing any retained (prokaryote) host sequence on the end of contigs with integrated prophage, and then assesses the quality and completeness of the assembled viral contigs. The quality of the contigs are also categoriesed based on the recently developed [Minimum Information about an Unclutivated Virus Genome](https://www.nature.com/articles/nbt.4306) (MIUViG) standards for reporting sequences of unclutivated virus geneomes (such as those recovered from metagenomic sequencing data). The MIUViG were developed as an extension of the [Minimum Information about any (x) Sequence](https://www.nature.com/articles/nbt.1823) ([MIxS](https://gensc.org/mixs/)) standards, which include, among others, standards for Metagenome-Assembled Genomes (MIMAG). +[`CheckV`](https://www.biorxiv.org/content/10.1101/2020.05.06.081778v1.abstract) was developed as an analogue to `CheckM`. `CheckV` first performs a 'contaminating sequence' trim, removing any retained (prokaryote) host sequence on the end of contigs with integrated prophage, and then assesses the quality and completeness of the assembled viral contigs. The quality of the contigs are also categoriesed based on the recently developed [Minimum Information about an Unclutivated Virus Genome](https://www.nature.com/articles/nbt.4306) (MIUViG) standards for reporting sequences of unclutivated virus geneomes (such as those recovered from metagenomic sequencing data). The MIUViG were developed as an extension of the [Minimum Information about any (x) Sequence](https://www.nature.com/articles/nbt.1823) ([MIxS](https://gensc.org/mixs/)) standards, which include, among others, standards for Metagenome-Assembled Genomes (MIMAG). -#### Run *VirSorter2* and *CheckV* +### Run `VirSorter2` and `CheckV` These exercises will take place in the `7.viruses/` folder. -```bash -cd /nesi/nobackup/nesi02659/MGSS_U//7.viruses/ -``` +!!! terminal-2 "Navigate to working directory" -For *VirSorter2*, we will input the assembled contigs from the *SPAdes* assembly we performed earlier. These assembly files have been copied to `7.viruses/spades_assembly/` for this exercise. - -We will then run *CheckV* in the same script, providing the *fasta* file of viral contigs output by *VirSorter2* as input (`final-viral-combined.fa`). - -*NOTE: For the `VirSorter/2.2.3-gimkl-2020a-Python-3.8.2` NeSI module to work properly, **we must also include `module unload XALT`** in the script below* + ```bash + cd /nesi/nobackup/nesi02659/MGSS_U//7.viruses/ + ``` -*NOTE: The key parameters you may want to consider altering for your own work are `--min-score` and `--include-groups`. For today's excersice we will include all available groups (`--include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae`), and will set the min-score to 0.7. You can expiriment with this value for your own data (see the [Virsorter2 github page](https://github.com/jiarong/VirSorter2) for more information).* +For `VirSorter2`, we will input the assembled contigs from the `SPAdes` assembly we performed earlier. These assembly files have been copied to `7.viruses/spades_assembly/` for this exercise. -*NOTE: The required databases for *VirSorter2* are not loaded with the NeSI module. For your own work, you will need to first download these databases and provide the path to the `-d` flag below. For today's workshop this is already set up.* +We will then run `CheckV` in the same script, providing the FASTA file of viral contigs output by `VirSorter2` as input (`final-viral-combined.fa`). -Create a new script +!!! terminal-2 "Create a script named `VirSorter2_and_checkv.sl`" -```bash -nano VirSorter2_and_checkv.sl -``` -!!! warning "Warning" + ```bash + nano VirSorter2_and_checkv.sl + ``` - Paste in the following (updating ``) +!!! warning "Remember to update `` to your own folder." !!! terminal "code" @@ -129,25 +124,37 @@ nano VirSorter2_and_checkv.sl checkv end_to_end VirSorter2/mgss-final-viral-combined.fa checkv_out/ -t $SLURM_CPUS_PER_TASK ``` +!!! warning "`module unload XALT`" + + For the `VirSorter/2.2.3-gimkl-2020a-Python-3.8.2` NeSI module to work properly, **we must also include `module unload XALT`** in the script above. + +!!! tip "`VirSorter2` parameters" + + The key parameters you may want to consider altering for your own work are `--min-score` and `--include-groups`. For today's excersice we will include all available groups (`--include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae`), and will set the min-score to 0.7. You can expiriment with this value for your own data (see the [Virsorter2 github page](https://github.com/jiarong/VirSorter2) for more information). + !!! terminal-2 "Submit the script as a slurm job" ```bash sbatch VirSorter2_and_checkv.sl ``` -#### Outputs of *VirSorter2* and *CheckV* +!!! note "`VirSorter2` for your own work" -Key outputs from *VirSorter2* include: + The required databases for `VirSorter2` are not loaded with the NeSI module. For your own work, you will need to first download these databases and provide the path to the `-d` flag below. For today's workshop this is already set up. + +### Outputs of `VirSorter2` and `CheckV` + +Key outputs from `VirSorter2` include: - `mgss-final-viral-combined.fa`: FASTA file of identified viral sequences - `mgss-final-viral-score.tsv`: table with score of each viral sequences across groups and a few more key features, which can also be used for further filtering -- `mgss-for-dramv/`: files to be used as input to *DRAM-v* for gene prediction and annotation (we will be running *DRAM-v* later today during the gene annotation session) +- `mgss-for-dramv/`: files to be used as input to `DRAM-v` for gene prediction and annotation (we will be running `DRAM-v` later today during the gene annotation session) -*CheckV* provides summary outputs for contamination, completeness, repeats, and an overall quality summary. Later today we will have a brief look at some examples of the information you can draw from these *CheckV* outputs. +`CheckV` provides summary outputs for contamination, completeness, repeats, and an overall quality summary. Later today we will have a brief look at some examples of the information you can draw from these `CheckV` outputs. -### Exercise: Examine viral output files from *VirSorter2* and *CheckV* +## Exercise: Examine viral output files from `VirSorter2` and `CheckV` -*VirSorter2* and *CheckV* provide several of different output files that are important for identifying and understanding the viruses present in your data. Explore through the following files: +`VirSorter2` and `CheckV` provide several of different output files that are important for identifying and understanding the viruses present in your data. Explore through the following files: - `7.viruses/VirSorter2/mgss-final-viral-score.tsv` - `7.viruses/checkv_out/quality_summary.tsv` @@ -156,35 +163,39 @@ When viewing these files, see if you can find the following information: !!! quote "" - * How many viral contigs did *VirSorter2* identify? + * How many viral contigs did `VirSorter2` identify? * How many viral contigs meet the "High-quality" (MIUViG) standard? - * How many might we consider "complete" genomes based on *CheckV*'s completeness estimation? - * Are any of the identified viral contigs complete *cirular* genomes (based on identifying direct terminal repeat regions on both ends of the genome)? If not, think about why this might be the case for this dataset (hint: the workshop materials are a manufactured "metagenome" data set based on compiling several individual genomes) - * Are there any suspicious contigs that you might want to flag for closer examination (and/or careful consideration in downstream analyses)? (Note that standard practice would be to use these *CheckV* results as one basis for filtering to remove potential false positives) + * How many might we consider "complete" genomes based on `CheckV`'s completeness estimation? + * Are any of the identified viral contigs complete *circular* genomes (based on identifying direct terminal repeat regions on both ends of the genome)? If not, think about why this might be the case for this dataset (hint: the workshop materials are a manufactured "metagenome" data set based on compiling several individual genomes) + * Are there any suspicious contigs that you might want to flag for closer examination (and/or careful consideration in downstream analyses)? (Note that standard practice would be to use these `CheckV` results as one basis for filtering to remove potential false positives) --- -### Introduction to *vContact2* for predicting taxonomy of viral contigs +## Introduction to `vConTACT2` for predicting taxonomy of viral contigs Even more so than prokaryote taxonomy, establishing a coherent system for viral taxonomy is complex and continues to evolve. In 2020, the International Committee on Taxonomy of Viruses ([ICTV](https://talk.ictvonline.org/)) overhauled the classification code into [15 hierarchical ranks](https://www.nature.com/articles/s41564-020-0709-x). Furthermore, the knowledge gap in databases of known and taxonomically assigned viruses remains substantial, and so identifying the putative taxonomy of viral contigs from environmental metagenomics data remains challenging. -There are a number of approaches that can be used to attempt to predict the taxonomy of the set of putative viral contigs output by programs such as *VIBRANT*, *VirSorter*, and *VirFinder*. [*vContact2*](https://www.nature.com/articles/s41587-019-0100-8) is one such method that uses 'guilt-by-contig-association' to predict the potential taxonomy of viral genomic sequence data based on relatedness to known viruses within a reference database (such as viral RefSeq). The principle is that, to the extent that the 'unknown' viral contigs cluster closely with known viral genomes, we can then expect that they are closely related enough to be able to predict a shared taxonomic rank. +There are a number of approaches that can be used to attempt to predict the taxonomy of the set of putative viral contigs output by programs such as `VIBRANT`, `VirSorter`, and `VirFinder`. [`vConTACT2`](https://www.nature.com/articles/s41587-019-0100-8) is one such method that uses 'guilt-by-contig-association' to predict the potential taxonomy of viral genomic sequence data based on relatedness to known viruses within a reference database (such as viral RefSeq). The principle is that, to the extent that the 'unknown' viral contigs cluster closely with known viral genomes, we can then expect that they are closely related enough to be able to predict a shared taxonomic rank. !!! note "Note" - Anecdotally, however, in my own experience with this process I have unfortunately been unable to directly predict the taxonomy of the vast majority of the viral contigs ouput by *VIBRANT*, *VirSorter*, or *VirFinder* from an environmental metagenomic data set (due to not clustering closely enough with known viruses in the reference database). You can, however, visualise the gene-sharing network generated to infer the *likely* taxonomy of each of your viruses at higher taxonomic ranks due to the relatedness to known reference viral genomes. + Anecdotally, however, in my own experience with this process I have unfortunately been unable to directly predict the taxonomy of the vast majority of the viral contigs output by `VIBRANT`, `VirSorter`, or `VirFinder` from an environmental metagenomic data set (due to not clustering closely enough with known viruses in the reference database). You can, however, visualise the gene-sharing network generated to infer the *likely* taxonomy of each of your viruses at higher taxonomic ranks due to the relatedness to known reference viral genomes. -Running *vContact2* can require a considerable amount of computational resources, and so we won't be running this in the workshop today. The required process is outlined for reference in an [Appendix for this exercise](../resources/4_APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md), should you wish to experiment with this on your own data in the future. +Running `vConTACT2` can require a considerable amount of computational resources, and so we won't be running this in the workshop today. The required process is outlined for reference in an [Appendix for this exercise](../resources/4_APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md), should you wish to experiment with this on your own data in the future. For today, we have provided some of the output files from this process when applied to our mock metagenome data. A selection of these can be viewed in the folder `7.viruses/vConTACT2_Results/` via `head` or `less`. -```bash -less vConTACT2_Results/genome_by_genome_overview.csv -``` +!!! terminal "code" + + ```bash + less vConTACT2_Results/genome_by_genome_overview.csv + ``` -```bash -less vConTACT2_Results/tax_predict_table.tsv -``` +!!! terminal "code" + + ```bash + less vConTACT2_Results/tax_predict_table.tsv + ``` A few notes to consider: @@ -204,29 +215,33 @@ A few notes to consider: --- -### OPTIONAL: Visualise the *vContact2* gene-sharing network in *cytoscape* +## *(Optional)* Visualise the `vConTACT2` gene-sharing network in `Cytoscape` -We can visualise the gene-sharing network generated by *vContact2* (`c1.ntw `) using the software *Cytoscape*. *Cytoscape* runs as a GUI (graphical user interface), so we will need to either download and install this software or open *Cytoscape* using NeSI's Virtual Desktop. To open in the Virtual Desktop, click *New Laucher* in Jupyterhub and lauch *Virtual Desktop*. This will open a new tab. In the new desktop, open a terminal, and then load and run the *Cytoscape* module as per below. +We can visualise the gene-sharing network generated by `vConTACT2` (`c1.ntw `) using the software `Cytoscape`. `Cytoscape` runs as a GUI (graphical user interface), so we will need to either download and install this software or open `Cytoscape` using NeSI's Virtual Desktop (instructions can be found [here](../day2/ex9_refining_bins.md#initiate-vizbin-within-the-virtual-desktop-environment)). With our Virtual Desktop open, `Cytoscape` can then be loaded as follows. !!! warning "Copy/paste in the Virtual Desktop" You will not be able to copy text from outside the Virtual Desktop and paste into the Virtual Desktop, in which case you will need to manually type these commands. -```bash -# Load the module -module purge -module load Cytoscape/3.9.1 -# Run cytoscape -Cytoscape -``` +!!! note "Open a terminal in Virtual Desktop prior to running code below" -!!! warning "Do not update Cytoscape!" +!!! terminal "code" + + ```bash + # Load the module + module purge + module load Cytoscape/3.9.1 + # Run cytoscape + Cytoscape + ``` + +!!! warning "Do not update `Cytoscape`!" - A dialog box will appear telling you about a new version of Cytoscape. **Click "close"**, as we will not be installing any new versions today! + A dialog box will appear telling you about a new version of `Cytoscape`. **Click "close"**, as we will not be installing any new versions today! -#### Load the network +### Load the network -1. With Cytoscape open, click on *File* $\rightarrow$ *Import* $\rightarrow$ *Network from file* +1. With `Cytoscape` open, click on *File* $\rightarrow$ *Import* $\rightarrow$ *Network from file* 2. Open the `c1.ntw` file by (a) typing in the absolute path in the *File Name* box: `/nesi/nobackup/nesi02659/MGSS_U//7.viruses/vConTACT2_Results/c1.ntw` or (b) navigate to the file using the GUI. 3. In the *Import Network From Table* pop-up box: 1. Click on *Advanced Options* @@ -244,11 +259,13 @@ Cytoscape It will now ask if you want to create a view for your large networks now. Click *OK*. This may take a minute to generate the network visualisation. +### Annotate the network + There are many ways to modify and inspect this visualisation. One basic addition that will help us interpret the results here is to colour code the viral genomes based on reference (RefSeq) genomes and the viral contigs recovered from our dataset. We can do this by loading the `genome_by_genome_overview.csv` file. !!! note "`Dataset` column" - For the purposes of this workshop, we have added the additional column `Dataset` to the `genome_by_genome_overview.csv` file stating whether each viral sequence originated from either the reference database (`RefSeq`) or our own data (`Our_data`). This column is not generated by vConTACT2, but you can open the file in Excel to add any additional columns you would like to colour code the nodes by. + For the purposes of this workshop, we have added the additional column `Dataset` to the `genome_by_genome_overview.csv` file stating whether each viral sequence originated from either the reference database (`RefSeq`) or our own data (`Our_data`). This column is not generated by `vConTACT2`, but you can open the file in Excel to add any additional columns you would like to colour code the nodes by. Click `File/Import/Table from file` and select the `genome_by_genome_overview.csv` file to open. In the pop-up box, leave the settings as default and click *OK*. This will add a bunch of metadata to your network table. We can now use this to colour code the nodes in the network. From 28b5ae52f03d5ffa323ec4dfeabe1bc132aa6e79 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 15:08:46 +1200 Subject: [PATCH 02/12] Aesthetics ex11 --- docs/day3/ex11_coverage_and_taxonomy.md | 37 +++++++++++++------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/docs/day3/ex11_coverage_and_taxonomy.md b/docs/day3/ex11_coverage_and_taxonomy.md index 58b22bf3..6f790ad6 100644 --- a/docs/day3/ex11_coverage_and_taxonomy.md +++ b/docs/day3/ex11_coverage_and_taxonomy.md @@ -6,7 +6,7 @@ --- -### Assign taxonomy to the refined bins +## Assign taxonomy to the refined bins It is always valuable to know the taxonomy of our binned MAGs, so that we can link them to the wider scientific literature. In order to do this, there are a few different options available to us: @@ -33,12 +33,13 @@ For the following exercises, we will be working in `8.prokaryotic_taxonomy/`. Create a new script -```bash -nano gtdbtk.sl -``` -!!! warning "Warning" +!!! terminal-2 "Create script named `gtdbtk.sl`" - Paste in the script (replacing ``) + ```bash + nano gtdbtk.sl + ``` + +!!! warning "Remember to update `` to your own folder" !!! terminal "code" @@ -67,26 +68,26 @@ nano gtdbtk.sl --out_dir gtdbtk_out/ ``` -Submit the script +!!! terminal-2 "Submit the script" -```bash -sbatch gtdbtk.sl -``` + ```bash + sbatch gtdbtk.sl + ``` As usual, lets look at the parameters here -|Parameter|Function| +|
Parameter
|Description| |:---|:---| -|**classify_wf**|Specifies the sub-workflow from `GTDB-TK` that we wish to use| -|**-x ...**|Specify the file extension for MAGs within our input directory.
Default is *.fna*, but it's always good practice to specify it anyway| -|**--cpus ...**|Number of threads/CPUs to use when finding marker genes, and performing tree insertion operations| -|**--keep_intermediates**|Keep intermediate outputs| -|**--genome_dir ...**|Input directory containing MAGs as individual *fastA* files| -|**--out_dir ...**|Output directory to write the final set of files| +|`classify_wf`|Specifies the sub-workflow from `GTDB-TK` that we wish to use| +|`-x`|Specify the file extension for MAGs within our input directory.
Default is *.fna*, but it's always good practice to specify it anyway| +|`--cpus`|Number of threads/CPUs to use when finding marker genes, and performing tree insertion operations| +|`--keep_intermediates`|Keep intermediate outputs| +|`--genome_dir`|Input directory containing MAGs as individual FASTA files| +|`--out_dir`|Output directory to write the final set of files| Before submitting your job, think carefully about which set of MAGs you want to classify. You could either use the raw `DAS_Tool` outputs in the `../6.bin_refinement/dastool_out/_DASTool_bins/` folder, the renamed set of bins in the `../6.bin_refinement/example_data_unchopped/` folder, the set of curated bins in the `filtered_bins/` folder, or your own set of refined bins. Whichever set you choose, make sure you select the correct input folder and extension setting as it may differ from the example here. -When the task completes, you will have a number of output files provided. The main ones to look for are `gtdbtk.bac120.summary.tsv` and `gtdbtk.arch122.summary.tsv` which report the taoxnomies for your MAGs, split at the domain level. These file are only written if MAGs that fall into the domain were found in your data set, so for this exercise we do not expect to see the `gtdbtk.arch122.summary.tsv` file. +When the task completes, you will have a number of output files provided. The main ones to look for are `gtdbtk.bac120.summary.tsv` and `gtdbtk.arch122.summary.tsv` which report the taxonomies for your MAGs, split at the domain level. These file are only written if MAGs that fall into the domain were found in your data set, so for this exercise we do not expect to see the `gtdbtk.arch122.summary.tsv` file. If you are interested in performing more detailed phylogenetic analysis of the data, the filtered multiple sequence alignment (MSA) for the data are provided in the `gtdbtk.bac120.msa.fasta` and `gtdbtk.arch122.msa.fasta` files. From d4cb2be5105464d1ee547f344dec0136249023f7 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 15:09:33 +1200 Subject: [PATCH 03/12] Aesthetics ex11.1 --- docs/day3/ex11.1_phylogenomics.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/day3/ex11.1_phylogenomics.md b/docs/day3/ex11.1_phylogenomics.md index 88ddaeae..6a47095b 100644 --- a/docs/day3/ex11.1_phylogenomics.md +++ b/docs/day3/ex11.1_phylogenomics.md @@ -158,4 +158,5 @@ We can edit and add annotations by clicking within the iTol website environment. !!! book-atlas "References" [Kapli, P., Yang, Z. and Telford M.J. (2020) Phylogenetic tree building in the genomic age. Nat Rev Genet 21: 428-444](https://doi.org/10.1038/s41576-020-0233-0) + [Yang, Z. and Rannala, B. (2012) Molecular phylogenetics: principles and practice. Nat Rev Genet 13: 303-314.](https://doi-org.ezproxy.auckland.ac.nz/10.1038/nrg3186) From cdff8a06d1a3394771c75141fb9eb34cb118ddc1 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 15:36:51 +1200 Subject: [PATCH 04/12] Aesthetics ex12 --- docs/day3/ex12_gene_prediction.md | 231 ++++++++++++++++-------------- 1 file changed, 127 insertions(+), 104 deletions(-) diff --git a/docs/day3/ex12_gene_prediction.md b/docs/day3/ex12_gene_prediction.md index 2fa3e4fd..b082cbae 100644 --- a/docs/day3/ex12_gene_prediction.md +++ b/docs/day3/ex12_gene_prediction.md @@ -8,15 +8,15 @@ --- -### Overview/refresher of *prodigal* +## Overview/refresher of `prodigal` At this stage we have recovered a number of high quality genomes or population genomes. While there are interesting biological questions we can ask of the genomes at the DNA/organisational level, it is more likely that we are interested in the genes present in the organism. How we predict genes in the metagenomic data varies depending on what features we are trying to detect. Most often, we are interested in putatively protein coding regions and open reading frames. For features that are functional but not not translated, such as ribosomal RNA and tRNA sequences we need to use alternative tools. When considering protein coding sequences, we avoid the use of the term 'open reading frame' (ORF). The nature of a fragmented assembly is that you may encounter a partial gene on the start or end of a contig that is a function gene, but lacks the start or stop codon due to issues with assembly or sequencing depth. -There are many software tools to predict gene sequences and in this workshop we will start with the tool `prodigal` (PROkaryotic Dynamic Programming Genefinding ALgorithm). `prodigal` has gone on to become one of the most popular microbial gene prediction algorithms as in incorporates modeling algorithms to profile the coding sequences within your genome and better identify the more cryptic (or partial) genes. +There are many software tools to predict gene sequences and in this workshop we will start with the tool `prodigal` (PROkaryotic Dynamic Programming Genefinding ALgorithm). `prodigal` has gone on to become one of the most popular microbial gene prediction algorithms as in incorporates modelling algorithms to profile the coding sequences within your genome and better identify the more cryptic (or partial) genes. -`prodigal` is execellent for the following use cases: +`prodigal` is excellent for the following use cases: !!! quote "" 1. Predicting protein-coding genes in draft genomes and metagenomes @@ -35,79 +35,85 @@ It is also not advised to use `prodigal` when making predictions through your un --- -### Predicting protein coding sequences in MAGs +## Predict protein coding sequences in MAGs To get started, move into the exercise directory. -```bash -cd /nesi/nobackup/nesi02659/MGSS_U//9.gene_prediction/ -``` - -#### Examining the *prodigal* parameters - -Before we start runnning `prodigal`, we will take a quick look at the parameters. - -```bash -module purge -module load prodigal/2.6.3-GCC-11.3.0 - -prodigal -h - -# Usage: prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type] -# [-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file] -# [-p mode] [-q] [-s start_file] [-t training_file] [-v] - -# -a: Write protein translations to the selected file. -# -c: Closed ends. Do not allow genes to run off edges. -# -d: Write nucleotide sequences of genes to the selected file. -# -f: Select output format (gbk, gff, or sco). Default is gbk. -# -g: Specify a translation table to use (default 11). -# -h: Print help menu and exit. -# -i: Specify FASTA/Genbank input file (default reads from stdin). -# -m: Treat runs of N as masked sequence; don't build genes across them. -# -n: Bypass Shine-Dalgarno trainer and force a full motif scan. -# -o: Specify output file (default writes to stdout). -# -p: Select procedure (single or meta). Default is single. -# -q: Run quietly (suppress normal stderr output). -# -s: Write all potential genes (with scores) to the selected file. -# -t: Write a training file (if none exists); otherwise, read and use -# the specified training file. -# -v: Print version number and exit. -``` +!!! terminal-2 "Navigate to working directory" + + ```bash + cd /nesi/nobackup/nesi02659/MGSS_U//9.gene_prediction/ + ``` + +### Examine the `prodigal` parameters + +Before we start running `prodigal`, we will take a quick look at the parameters. + +!!! terminal "code" + + ```bash + module purge + module load prodigal/2.6.3-GCC-11.3.0 + + prodigal -h + ``` + +!!! circle-check "Terminal output" + + ``` + Usage: prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type] + [-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file] + [-p mode] [-q] [-s start_file] [-t training_file] [-v] + + -a: Write protein translations to the selected file. + -c: Closed ends. Do not allow genes to run off edges. + -d: Write nucleotide sequences of genes to the selected file. + -f: Select output format (gbk, gff, or sco). Default is gbk. + -g: Specify a translation table to use (default 11). + -h: Print help menu and exit. + -i: Specify FASTA/Genbank input file (default reads from stdin). + -m: Treat runs of N as masked sequence; don't build genes across them. + -n: Bypass Shine-Dalgarno trainer and force a full motif scan. + -o: Specify output file (default writes to stdout). + -p: Select procedure (single or meta). Default is single. + -q: Run quietly (suppress normal stderr output). + -s: Write all potential genes (with scores) to the selected file. + -t: Write a training file (if none exists); otherwise, read and use + the specified training file. + -v: Print version number and exit. + ``` There are a few parameters that are worth considering in advance. -##### Output files +**Output files** When running `prodigal` the default behaviour is to create a *gbk* file (this is a Genbank-like feature table, see [here](https://github.com/hyattpd/prodigal/wiki/understanding-the-prodigal-output) for more information) from your genome and write it to the stdout of your interface. This can either be captured using a redirect, or the output can instead be placed into a file using the `-o` flag. You can also change the format of the file using the `-f` flag. Since we often want to go straight from gene prediction to annotation, `prodigal` also has the option to create *fastA* files of the gene prediction (`-d`) and protein translation (`-a`) at the same time. This is an extremely helpful feature, and it is worth running all three outputs at the same time. **_Generally_** speaking, you will probably find that the amino acid sequence for your genes is all you need for most practical purposes, but having the corresponding nucleotide sequence can sometimes be useful if we want to mine other data sets. -##### Modes of gene prediction +**Modes of gene prediction** As mentioned in the introduction to this exercise, `prodigal` uses the profiles of genes it detects in your data set to better tune its prediction models and improve coding sequence recovery. It has three algorithms for how the training is performed which you must determine in advance: |Parameter|Mode|Description| |:---|:---|:---| -|Normal mode|**-p single**|Take the sequence(s) you provide and profiles the sequence(s) properties. Gene predictions are then made based upon those properties.
Normal mode should be used on finished genomes, reasonable quality draft genomes, and big viruses.| -|Anonymous mode|**-p meta**|Apply pre-calculated training files to the provided input sequences.
Anonymous mode should be used on metagenomes, low quality draft genomes, small viruses, and small plasmids.| -|Training mode|**-p train**|Works like normal mode, but `prodigal` saves a training file for future use. | +|Normal mode|`-p single`|Take the sequence(s) you provide and profiles the sequence(s) properties. Gene predictions are then made based upon those properties.
Normal mode should be used on finished genomes, reasonable quality draft genomes, and big viruses.| +|Anonymous mode|`-p meta`|Apply pre-calculated training files to the provided input sequences.
Anonymous mode should be used on metagenomes, low quality draft genomes, small viruses, and small plasmids.| +|Training mode|`-p train`|Works like normal mode, but `prodigal` saves a training file for future use. | -Anecdotally, when applied to a MAG or genome, anonymous mode (`-p meta`) will identify slightly fewer genes than normal mode (`-p single`). However, single mode can miss laterally transfered elements. There is not necesarily a best choice for which version to use and this is at the users discretion. +Anecdotally, when applied to a MAG or genome, anonymous mode (`-p meta`) will identify slightly fewer genes than normal mode (`-p single`). However, single mode can miss laterally transferred elements. There is not necessarily a best choice for which version to use and this is at the users discretion. -#### Executing *prodigal* +### Execute `prodigal` We will now run `prodigal` over the 10 bins in *anonymous* mode using an array. -Create a new script - -```bash -nano prodigal.sl -``` +!!! terminal-2 "Create script named `prodigal.sl`" -!!! warning "Warning" + ```bash + nano prodigal.sl + ``` - Paste in the script (modifying ``) +!!! warning "Remember to update to your own folder" !!! terminal "code" @@ -152,84 +158,102 @@ nano prodigal.sl Once `prodigal` has completed, let's check one of the output files: -```bash -head -n 5 predictions/bin_0.filtered.genes.faa +!!! terminal "code" + + ```bash + head -n 5 predictions/bin_0.filtered.genes.faa + ``` + +!!! circle-check "Terminal output" + + ``` + >bin_0_NODE_6_length_607162_cov_1.000759_1 # 2 # 667 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.321 + MIKSNGILFTGKKFVMGAVVSALLATSGIAADYTLKFSHVVSPNTPKGKAADFFAKRLEE + LSGGKIDVQVYPSSQLYNDSAVLKALRLDSVQMAAPSFSKFGKIVPQLALFNLPFLFKDI + DQLHRVQDGPVGEKLKSLVTAKGFVALNFWDNGFKQLSSSKEPLLMPKDAEGQKFRIMSS + KVLEAQFKAVGANPQMMPFSEVYSGLQQGVIDAAENPFSNIY + ``` + +There are a few thing to unpack here. First lets look at the first line (the header) of the FASTA file: -# >bin_0_NODE_6_length_607162_cov_1.000759_1 # 2 # 667 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.321 -# MIKSNGILFTGKKFVMGAVVSALLATSGIAADYTLKFSHVVSPNTPKGKAADFFAKRLEE -# LSGGKIDVQVYPSSQLYNDSAVLKALRLDSVQMAAPSFSKFGKIVPQLALFNLPFLFKDI -# DQLHRVQDGPVGEKLKSLVTAKGFVALNFWDNGFKQLSSSKEPLLMPKDAEGQKFRIMSS -# KVLEAQFKAVGANPQMMPFSEVYSGLQQGVIDAAENPFSNIY -``` +!!! terminal-2 "Sequence header" -There are a few thing to unpack here. First lets look at the first line of the *fastA* file: + ``` + >bin_0_NODE_6_length_607162_cov_1.000759_1 # 2 # 667 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.321 + | | | | | | | + A B C D E F G + ``` -```bash ->bin_0_NODE_6_length_607162_cov_1.000759_1 # 2 # 667 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.321 - | | | | | | | - | | | | | | Are the gene boundaries complete or not - | | | | | Unique gene ID - | | | | Orientation - | | | Stop position - | | Start position - | Gene suffix - Contig name -``` +| Section | Example text | Description | +| :--- | :--- | :--- | +| A | `>bin_0_NODE_6_length_607162_cov_1.000759` | Contig ID | +| B | `_1` | Gene suffix for the contig | +| C | `2` | Start position relative to contig | +| D | `667` | Stop position relative to contig | +| E | `-1` | Orientation | +| F | `ID=1_1` | Unique gene ID | +| G | `partial=10` | Completeness of gene boundaries (0 = complete; 1 = partial) | -Here are the first few pieces of information in the *fastA* header identified by what they mean. `prodigal` names genes using the contig name followed by an underscore then the number of the gene along the contig. The next two pieces of information are the start and stop coordinates of the gene. Next, a 1 is report for a gene that is in the forward orientation (relative to the start of the contig) and a -1 for genes that are in reverse orientation. +Here are the first few pieces of information in the FASTA header identified by what they mean. `prodigal` names genes using the contig name followed by an underscore then the number of the gene along the contig. The next two pieces of information are the start and stop coordinates of the gene. Next, a 1 is report for a gene that is in the forward orientation (relative to the start of the contig) and a -1 for genes that are in reverse orientation. There is also a unique gene ID provided although this may not be necessary. As long as your contig names are unique, then all gene names generated from them will also be unique. The last option that is important to check is the *partial* parameter. This is reported as two digits which correspond to the start and end of the gene and report whether or not the gene has the expected amino acids for the start (M) and end of a gene (* in the protein file). A 0 indicates a complete gene edge, and 1 means partial. In this case, we have '10' which indicates the gene runs off the left edge of the contig. Alternate outcomes for this field are '00', '01', or '11'. +### Strip metadata -##### Stripping metadata +While this header information can be very informative, its presence in the FASTA files can lead to some downstream issues. The FASTA file format specifies that the sequence name for each entry runs from the '>' character to the first space, and everything after the space is metadata. Some bioinformatic programs are aware of this convention and will strip the metadata when producing their outputs, but some tools do not do this. It's really easy to end up in situations where your gene names are failing to match between analyses because of this inconsistency, so we recommend creating new FASTA files with the metadata removed to preempt this problem. -While this header information can be very informative, its presence in the *fastA* files can lead to some downstream issues. The *fastA* file format specifies that the sequence name for each entry runs from the '>' character to the first space, and everything after the space is metadata. Some bioinformatic programs are aware of this convention and will strip the metadata when producing their outputs, but some tools do not do this. It's really easy to end up in situtations where your gene names are failing to match between analyses because of this inconsistency, so we recommend creating new *fastA* files with the metadata removed to preempt this problem. +!!! terminal "code" + + ```bash + for pred_file in predictions/*.fna; + do + file_base=$(basename ${pred_file} .fna) + + cut -f1 -d ' ' predictions/${file_base}.fna > predictions/${file_base}.no_metadata.fna + cut -f1 -d ' ' predictions/${file_base}.faa > predictions/${file_base}.no_metadata.faa + done + ``` -```bash -for pred_file in predictions/*.fna; -do - file_base=$(basename ${pred_file} .fna) - - cut -f1 -d ' ' predictions/${file_base}.fna > predictions/${file_base}.no_metadata.fna - cut -f1 -d ' ' predictions/${file_base}.faa > predictions/${file_base}.no_metadata.faa -done -``` !!! magnifying-glass "`cut` flags" * `-f` argument is how we specify which columns to keep. It can be used to specify a range as well * `-d-` **delimiter** : `cut` uses tab as a default field delimiter but can also work with other delimiter by using `-d` option -```bash -head -n 5 predictions/bin_0.filtered.genes.no_metadata.faa +!!! terminal "code" + + ```bash + head -n 5 predictions/bin_0.filtered.genes.no_metadata.faa + ``` -# >bin_0_NODE_6_length_607162_cov_1.000759_1 -# MIKSNGILFTGKKFVMGAVVSALLATSGIAADYTLKFSHVVSPNTPKGKAADFFAKRLEE -# LSGGKIDVQVYPSSQLYNDSAVLKALRLDSVQMAAPSFSKFGKIVPQLALFNLPFLFKDI -# DQLHRVQDGPVGEKLKSLVTAKGFVALNFWDNGFKQLSSSKEPLLMPKDAEGQKFRIMSS -# KVLEAQFKAVGANPQMMPFSEVYSGLQQGVIDAAENPFSNIY -``` +!!! circle-check "Terminal output" + + ``` + >bin_0_NODE_6_length_607162_cov_1.000759_1 + MIKSNGILFTGKKFVMGAVVSALLATSGIAADYTLKFSHVVSPNTPKGKAADFFAKRLEE + LSGGKIDVQVYPSSQLYNDSAVLKALRLDSVQMAAPSFSKFGKIVPQLALFNLPFLFKDI + DQLHRVQDGPVGEKLKSLVTAKGFVALNFWDNGFKQLSSSKEPLLMPKDAEGQKFRIMSS + KVLEAQFKAVGANPQMMPFSEVYSGLQQGVIDAAENPFSNIY + ``` --- -### Predicting RNA features and non-coding regions +## Predict RNA features and non-coding regions -#### Predicting rRNA sequences +### Predict rRNA sequences While they will not be covered in great detail here, there are a few other prediction tools that are useful when working with metagenomic data. The first of these is `MeTaxa2`, which can be used to predict ribosomal RNA sequences in a genome. Detection of these is a handy way to link your MAGs to the scientific literature and taxonomy, although recovery of ribosomal sequences like the 16S rRNA subunit is often not successful. To attempt to find the small (16S, SSU) and large (28S, LSU) ribosomal subunits in our data, use the following commands. -Create a new script - -```bash -nano metaxa2.sl -``` +!!! terminal-2 "Create script named `metaxa2.sl`" -!!! warning "Warning" + ```bash + nano metaxa2.sl + ``` - Paste in the script (replacing ``) +!!! warning "Remember to update to your own folder" !!! terminal "code" @@ -284,11 +308,10 @@ The only other parameter that can be helpful is the `-t` flag to indicate taxa t It is usually worth letting it search for all options as detecting multiple rRNAs from different lineages can be a good sign of binning contamination. However, if you want to restrict the search or provide a custom training set this can be set with the `-t` flag. -#### Predicting tRNA and tmRNA sequences +### Predict tRNA and tmRNA sequences The [MIMAG](https://www.ncbi.nlm.nih.gov/pubmed/28787424) standard specifies that in order to reach particular quality criteria, a MAG must contain a certain number or tRNA sequences. We can search a MAG or genome for these using `Aragorn` ([link here](http://130.235.244.92/ARAGORN/)). -![](https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/figures/ex11_aragorn.PNG) ![image](../figures/ex11_aragorn.PNG) --- From 63b8577f3b9c7a41e1467de783513910b8274fd5 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 16:11:03 +1200 Subject: [PATCH 05/12] Aesthetics ex13 --- docs/day3/ex13_gene_annotation_part1.md | 192 +++++++++++++----------- 1 file changed, 105 insertions(+), 87 deletions(-) diff --git a/docs/day3/ex13_gene_annotation_part1.md b/docs/day3/ex13_gene_annotation_part1.md index dd75a524..7d3490c1 100644 --- a/docs/day3/ex13_gene_annotation_part1.md +++ b/docs/day3/ex13_gene_annotation_part1.md @@ -7,93 +7,108 @@ - [*BLAST*-like gene annotations and domain annotations](#blast-like-gene-annotations-and-domain-annotations) - [BLAST-like annotation](#blast-like-annotation) - [HMM-profiling of domains](#hmm-profiling-of-domains) - - [Annotating MAGs against the *UniProt* database with *diamond*](#annotating-mags-against-the-uniprot-database-with-diamond) - - [Annotating MAGs against the *Pfam* database with *hmmer*](#annotating-mags-against-the-pfam-database-with-hmmer) + - [Annotating MAGs against the *UniProt* database with `DIAMOND`](#annotating-mags-against-the-uniprot-database-with-diamond) + - [Annotating MAGs against the *Pfam* database with `HMMER`](#annotating-mags-against-the-pfam-database-with-hmmer) - [Evaluating the quality of gene assignment](#evaluating-the-quality-of-gene-assignment) - [Differences in taxonomies](#differences-in-taxonomies) - [Examples of various genome-level and protein taxonomies](#examples-of-various-genome-level-and-protein-taxonomies) --- -### *BLAST*-like gene annotations and domain annotations +## Annotation methods Broadly speaking, there are two ways we perform gene annotations with protein sequences. Both compare our sequences of interest against a curated set of protein sequences for which function is known, or is strongly suspected. In each case, there are particular strengths to the approach and for particular research questions, one option may be favoured over another. -#### BLAST-like annotation +### `BLAST`-like annotation The first of these is the `BLAST` algorithm for sequence alignment. This approach performs pairwise alignment between the gene of interest (query sequence) and the sequences in the database (target sequence). `BLAST` searches each potential target sequence for *k*-mers identified in the query sequence. Where these *k*-mers are found in targets, the ends are extended out to try to create longer regions of highly similar sequence spans. Across this span, the tool identifies the longest span of characters (nucleotide or amino acid) that match within a scoring framework to return the length of the region (coverage) and the sequence identity over the span (identity). -The original tool for performing this kind of analysis was the `BLAST` tool. While `BLAST` and its variants are still excellent tools for performing this kind of sequence annotation, they suffer from a slow runtime speed due to the need to test each query sequence against every target sequence in the database. For this reason, several tools have been published which take the basic approach of `BLAST`, but augment it with methods to reduce the number of pairwise comparisons needed to identify targets with high sequence similarity to the query. Two popular pieces of software are the tools `usearch` [here](http://www.drive5.com/usearch/) and `diamond` [here](https://github.com/bbuchfink/diamond). +The original tool for performing this kind of analysis was the `BLAST` tool. While `BLAST` and its variants are still excellent tools for performing this kind of sequence annotation, they suffer from a slow runtime speed due to the need to test each query sequence against every target sequence in the database. For this reason, several tools have been published which take the basic approach of `BLAST`, but augment it with methods to reduce the number of pairwise comparisons needed to identify targets with high sequence similarity to the query. Two popular pieces of software are the tools [`USEARCH`](http://www.drive5.com/usearch/) and [`DIAMOND`](https://github.com/bbuchfink/diamond). -#### HMM-profiling of domains +### HMM-profiling of domains -An alternate method for attributing function to query sequences is to consider them as a collection of independently functioning protein folding domains. This is the approach used in the [HMMer](http://hmmer.org/) software, and the *Pfam*, *TIGRfam*, and *PANTHER* databases. In these analyses, the database consists not of individual sequences, but of Hidden Markov models built from a collection of proteins that share a common domain. These profiles build out a statistical map of the amino acid transitions (from position to position), variations (differences at a position), and insertions/deletions between positions in the domain across the different observations in the training database and apply these maps to the query data. +An alternate method for attributing function to query sequences is to consider them as a collection of independently functioning protein folding domains. This is the approach used in the [HMMER](http://hmmer.org/) software, and the *Pfam*, *TIGRfam*, and *PANTHER* databases. In these analyses, the database consists not of individual sequences, but of Hidden Markov models built from a collection of proteins that share a common domain. These profiles build out a statistical map of the amino acid transitions (from position to position), variations (differences at a position), and insertions/deletions between positions in the domain across the different observations in the training database and apply these maps to the query data. These exercises will take place in the `10.gene_annotation_and_coverage/` folder. +!!! terminal-2 "Navigate to working directory" + + ``` + cd /nesi/nobackup/nesi02659/MGSS_U//10.gene_annotation_and_coverage + ``` + --- -### Annotating MAGs against the *UniProt* database with *diamond* +## Annotate MAGs against the *UniProt* database with `DIAMOND` -For this exercise we are going to use `diamond` for performing our annotation. We have chosen to use this tool because it is faster than BLAST, and `usearch` comes with licensing restrictions that make it hard to work with in a shared computing environment like NeSI. +For this exercise we are going to use `DIAMOND` for performing our annotation. We have chosen to use this tool because it is faster than `BLAST`, and `USEARCH` comes with licensing restrictions that make it hard to work with in a shared computing environment like NeSI. -For this exercise we have created a diamond-compatible database from the 2018 release of the UniProt database. +For this exercise we have created a diamond-compatible database from the 2018 release of the *UniProt* database. For input files, the `predictions/` results from the previous gene prediction exercise have been copied over to `10.gene_annotation_and_coverage/predictions/`. In general, `diamond` takes a pair of input files - the protein coding sequences we wish to annotate and the database we will use for this purpose. There are a few parameters that need to be tweaked for obtaining a useful output file, however. -```bash -module purge -module load DIAMOND/2.0.15-GCC-11.3.0 +!!! terminal "code" -diamond help -# diamond v2.0.15.153 (C) Max Planck Society for the Advancement of Science -# Documentation, support and updates available at http://www.diamondsearch.org -# Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021) + ```bash + module purge + module load DIAMOND/2.0.15-GCC-11.3.0 -# Syntax: diamond COMMAND [OPTIONS] + diamond help + ``` -# Commands: -# ... -# blastp Align amino acid query sequences against a protein reference database -# ... +!!! circle-check "Terminal output" -# General options: -# --threads (-p) number of CPU threads -# --db (-d) database file -# --out (-o) output file -# --outfmt (-f) output format -# ... -``` + ``` + diamond v2.0.15.153 (C) Max Planck Society for the Advancement of Science + Documentation, support and updates available at http://www.diamondsearch.org + Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021) + + Syntax: diamond COMMAND [OPTIONS] + + Commands: + ... + blastp Align amino acid query sequences against a protein reference database + ... + + General options: + --threads (-p) number of CPU threads + --db (-d) database file + --out (-o) output file + --outfmt (-f) output format + ... + ``` There are two output formats we can chose from which are useful for our analysis. We will obtain our output in the BLAST tabular format, which provides the annotation information in a simple-to-parse text file that can be viewed in any text or spreadsheet viewing tool. This will allow us to investigate and evaluate the quality of our annotations. -Awkwardly, `diamond` does not provide the headers for what the columns in the output table mean. [This table](http://www.metagenomics.wiki/tools/blast/blastn-output-format-6) is a handy reference for how to interpret the output. +Awkwardly, `DIAMOND` does not provide the headers for what the columns in the output table mean. [This table](http://www.metagenomics.wiki/tools/blast/blastn-output-format-6) is a handy reference for how to interpret the output. -From here we can view important stastics for each query/target pairing such as the number of identical residues between sequences and the aligned length between query and target. +From here we can view important statistics for each query/target pairing such as the number of identical residues between sequences and the aligned length between query and target. Before we begin, we need to create an directory for outputs. -```bash -mkdir -p gene_annotations -``` +!!! terminal "code" + + ```bash + mkdir -p gene_annotations + ``` Now, lets set up a slurm job to annotate each of our MAGs. Create a new script -```bash -nano annotate_uniprot.sl -``` +!!! terminal-2 "Create script named `annotate_uniprot.sl`" -!!! warning "Warning" + ```bash + nano annotate_uniprot.sl + ``` - Paste in the script (update ``) +!!! warning "Remember to update to your own folder" !!! terminal "code" - ```bash + ```bash linenums="1" #!/bin/bash -e #SBATCH --account nesi02659 @@ -123,65 +138,68 @@ nano annotate_uniprot.sl --out gene_annotations/${out_file}.uniprot.txt ``` -Submit the script +!!! terminal-2 "Submit the script" -```bash -sbatch annotate_uniprot.sl -``` + ```bash + sbatch annotate_uniprot.sl + ``` --- -### Annotating MAGs against the *Pfam* database with *hmmer* +## Annotate MAGs against the *Pfam* database with `HMMER` -The standard software for performing HMM-profiling annotation is [hmmer](http://hmmer.org/). Compared to `BLAST`, `FASTA`, and other sequence alignment and database search tools based on older scoring methodology, `HMMER` aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new `HMMER3` project, `HMMER` is now essentially as fast as `BLAST`. +The standard software for performing HMM-profiling annotation is [HMMER](http://hmmer.org/). Compared to `BLAST`, `FASTA`, and other sequence alignment and database search tools based on older scoring methodology, `HMMER` aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new `HMMER3` project, `HMMER` is now essentially as fast as `BLAST`. `HMMER` will search one or more profiles against a sequence database for sequence hommologs, and for making sequence alignments, implementing profile hidden Markov models. In this exercise, we will perform a search using `hmmsearch`. For each profile in *hmmfile*, `HMMER` uses that query profile to search the target database of sequences indicated in *seqdb*, and output ranked lists of the sequences with the most significant matches to the profile. `hmmsearch` accepts any *fastA* file as target database input. It also accepts EMBL/UniProtKB text format, and Genbank format. It will automatically determine what format your file is in so you don’t have to specify it. As we did with `diamond`, we will also have to modify some parameters to get the desired ouotput. -```bash -module load HMMER/3.3.2-GCC-11.3.0 -``` +!!! terminal "code" -??? abstract "`hmmsearch -h`" ```bash - # hmmsearch :: search profile(s) against a sequence database - # HMMER 3.3.2 (Nov 2020); http://hmmer.org/ - # Copyright (C) 2020 Howard Hughes Medical Institute. - # Freely distributed under the BSD open source license. - # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - # Usage: hmmsearch [options] - - # Basic options: - # -h : show brief help on version and usage - - # Options directing output: - # ... - # --tblout : save parseable table of per-sequence hits to file - # .... - - # Options controlling reporting thresholds: - # ... - # -E : report sequences <= this E-value threshold in output [10.0] (x>0) - # ... - - # Other expert options: - # ... - # --cpu : number of parallel CPU workers to use for multithreads - # ... + module load HMMER/3.3.2-GCC-11.3.0 + + hmmsearch -h ``` +??? abstract "`hmmsearch -h`" + + ``` + hmmsearch :: search profile(s) against a sequence database + HMMER 3.3.2 (Nov 2020); http://hmmer.org/ + Copyright (C) 2020 Howard Hughes Medical Institute. + Freely distributed under the BSD open source license. + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + Usage: hmmsearch [options] + + Basic options: + -h : show brief help on version and usage + + Options directing output: + ... + --tblout : save parseable table of per-sequence hits to file + .... + + Options controlling reporting thresholds: + ... + -E : report sequences <= this E-value threshold in output [10.0] (x>0) + ... + + Other expert options: + ... + --cpu : number of parallel CPU workers to use for multithreads + ... + ``` We are now going to submit another slurm job to annotate our MAGs using the [Pfam database](https://www.ebi.ac.uk/interpro/download/Pfam/). Pfam used to have a standalone website, but it has recently been integrated into InterPro maintained by the European Bioinformatics Institute (see [announcement](https://xfam.wordpress.com/2022/08/04/pfam-website-decommission/)). Matching sequences to a `Pfam` entry allows us to transfer the functional information from an experimentally characterised sequence to uncharacterised sequences in the same entry. `Pfam` then provides comprehensive annotation for each entry. -Create a new script +!!! terminal-2 "Create script named `annotate_pfam.sl`" -```bash -nano annotate_pfam.sl -``` -!!! warning "Warning" + ```bash + nano annotate_pfam.sl + ``` - Paste in the script (update ``) +!!! warning "Remember to update to your own folder" !!! terminal "code" @@ -223,7 +241,7 @@ nano annotate_pfam.sl --- -### Evaluating the quality of gene assignment +## Evaluate the quality of gene assignment Determining how trustworthy a gene annotation is can be a very tricky process. How similar do protein sequences need to be to perform the same function? The answer is surprisingly low. A bioinformatic analysis performed in 1999 identified that proteins with as little as 20 - 35% sequence identity can still share the same function ([Rost, 1999](https://doi.org/10.1093/protein/12.2.85)), but this is not a universal occurrence. When evaluating annotations, consider the following questions: @@ -240,23 +258,23 @@ We must also remain aware of the potential for incorrectly annotated genes in th --- -### Differences in taxonomies +## Differences in taxonomies -Another way to determine if an annotation 'belongs' in the MAG of interest is to consider the predicted taxonomy of the query gene with that of the MAG itself. For example, if you detect a *Desulfovibrio*-like *dsrA* seuqence in a bin that has been classified as belonging to the genus *Desulfovibrio* then it is probably a safe bet that the annotation is correct. +Another way to determine if an annotation 'belongs' in the MAG of interest is to consider the predicted taxonomy of the query gene with that of the MAG itself. For example, if you detect a *Desulfovibrio*-like *dsrA* sequence in a bin that has been classified as belonging to the genus *Desulfovibrio* then it is probably a safe bet that the annotation is correct. However, when comparing taxonomic assignments, it is important to be aware of the differing taxonomic schemas that are circulating in the microbiological and bioinformatic literature and to know how to reconcile their differences. Similar to how the 16S rRNA gene taxonomies provided by *SILVA*, *Greengenes*, and *RDP* taxonomies all differ in some aspects, there are multiple competing taxonomies in protein databases. -#### Examples of various genome-level and protein taxonomies +### Examples of various genome-level and protein taxonomies * [NCBI](https://www.ncbi.nlm.nih.gov/taxonomy) * [Genome Taxonomy Database](https://gtdb.ecogenomic.org/) -This problem exists despite the existance of a formal [Code](https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.000778) for the naming of bacteria and archaea, because +This problem exists despite the existence of a formal [Code](https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.000778) for the naming of bacteria and archaea, because !!! quote "" 1. There are no rules governing how we define the grouping of these names together, other than for type species - 1. Defunct synonyms and basonyms are not correctly purged from taxonomy lists (this is quite noticable with the NCBI taxonomy) - 1. Valid names cannot be assigned for uncultivate organisms, meaning there are many informal placeholder names in the literature. For example, clades like WPS-2, SAR324, and SAUL are widely cited in the literature despite having no official standing + 1. Defunct synonyms and basonyms are not correctly purged from taxonomy lists (this is quite noticeable with the NCBI taxonomy) + 1. Valid names cannot be assigned for uncultivated organisms, meaning there are many informal placeholder names in the literature. For example, clades like WPS-2, SAR324, and SAUL are widely cited in the literature despite having no official standing It is therefore important to periodically sanity check your taxonomic annotations in order to avoid splitting taxa based on spelling differences or the use of historic names that have since been reclassified. From bfc7a144d0642d8f92979154b37177bebc6a1d2e Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 16:35:47 +1200 Subject: [PATCH 06/12] Aesthetics ex14 --- docs/day3/ex14_gene_annotation_part2.md | 274 +++++++++++++----------- 1 file changed, 148 insertions(+), 126 deletions(-) diff --git a/docs/day3/ex14_gene_annotation_part2.md b/docs/day3/ex14_gene_annotation_part2.md index ccc15102..d780c09f 100644 --- a/docs/day3/ex14_gene_annotation_part2.md +++ b/docs/day3/ex14_gene_annotation_part2.md @@ -11,9 +11,9 @@ --- -### Gene prediction and annotation with *DRAM* (Distilled and Refined Annotation of Metabolism) +## Gene prediction and annotation with `DRAM` (Distilled and Refined Annotation of Metabolism) -[DRAM](http://dx.doi.org/10.1093/nar/gkaa621) is a tool designed to profile microbial (meta)genomes for metabolisms known to impact ecosystem functions across biomes. `DRAM` annotates MAGs and viral contigs using KEGG (if provided by user), UniRef90, PFAM, CAZy, dbCAN, RefSeq viral, VOGDB (Virus Orthologous Groups), and the MEROPS peptidase database. It is also highly customizable to other custom user databases. +[`DRAM`](http://dx.doi.org/10.1093/nar/gkaa621) is a tool designed to profile microbial (meta)genomes for metabolisms known to impact ecosystem functions across biomes. `DRAM` annotates MAGs and viral contigs using KEGG (if provided by user), UniRef90, PFAM, CAZy, dbCAN, RefSeq viral, VOGDB (Virus Orthologous Groups), and the MEROPS peptidase database. It is also highly customizable to other custom user databases. `DRAM` only uses assembly-derived *fastA* files input by the user. These input files may come from unbinned data (metagenome contig or scaffold files) or genome-resolved data from one or many organisms (isolate genomes, single-amplified genome (SAGs), MAGs). @@ -21,86 +21,95 @@ ![image](../figures/DRAM_overview.jpeg) -#### Annotation +### 1. Annotation The first step in `DRAM` is to annotate genes by assigning database identifiers to genes. Short contigs (default < 2,500 bp) are initially removed. Then, `Prodigal` is used to detect open reading frames (ORFs) and to predict their amino acid sequences. Next, `DRAM` searches all amino acid sequences against multiple databases, providing a single *Raw* output. When gene annotation is complete, all results are merged in a single tab-delimited annotation table, including the best hit for each database for user comparison. -#### Distillation +### 2. Distillation -After genome annotation, a distill step follows with the aim to curate these annotations into useful functional categories, creating genome statistics and metabolism summary files, which are stored in the *Distillate* output. The genome statistics provides most genome quality information required for [MIMAG](https://www.nature.com/articles/nbt.3893) standards, including `GTDB-tk` and `checkM` information if provided by the user. The summarised metabolism table includes the number of genes with specific metabolic function identifiers (KO, CAZY ID, etc) for each genome, with information obtained from multiple databases. The *Distillate* output is then further distilled into the *Product*, an html file displaying a heatmap, as well as the corresponding data table. We will investigate all these files later on. +After genome annotation, a distill step follows with the aim to curate these annotations into useful functional categories, creating genome statistics and metabolism summary files, which are stored in the *Distillate* output. The genome statistics provides most genome quality information required for [MIMAG](https://www.nature.com/articles/nbt.3893) standards, including `GTDB-tk` and `CheckM` information if provided by the user. The summarised metabolism table includes the number of genes with specific metabolic function identifiers (KO, CAZY ID, etc) for each genome, with information obtained from multiple databases. The *Distillate* output is then further distilled into the *Product*, an html file displaying a heatmap, as well as the corresponding data table. We will investigate all these files later on. --- -### Annotation of the MAGs with *DRAM* +## Annotate MAGs with `DRAM` Beyond annotation, `DRAM` aims to be a data compiler. For that reason, output files from both `CheckM` and `GTDB_tk` steps can be input to `DRAM` to provide both taxonomy and genome quality information of the MAGs. -#### *DRAM* input files +### `DRAM` input files For these exercises, we have copied the relevant input files into the folder `10.gene_annotation_and_coverage/DRAM_input_files/`. `gtdbtk.bac120.summary.tsv` was taken from the earlier `8.prokaryotic_taxonomy/gtdbtk_out/` outputs, and `filtered_bins_checkm.txt` from the result of re-running `CheckM` on the final refined filtered bins in `6.bin_refinement/filtered_bins`. -Navigate to the `10.gene_annotation_and_coverage/` folder +!!! terminal-2 "Navigate to working directory" -```bash -cd /nesi/nobackup/nesi02659/MGSS_U//10.gene_annotation_and_coverage/ -``` + ```bash + cd /nesi/nobackup/nesi02659/MGSS_U//10.gene_annotation_and_coverage/ + ``` Along with our filtered bins, the `CheckM` output file (`checkm.txt`) and `GTDB-Tk` summary output `gtdbtk.bac120.summary.tsv` are used as inputs as is. -#### *DRAM* annotation +### `DRAM` annotation In default annotation mode, `DRAM` only requires as input the directory containing all the bins we would like to annotate in *fastA* format (either .fa or .fna). There are few parameters that can be modified if not using the default mode. Once the annotation step is complete, the mode `distill` is used to summarise the obtained results. -*NOTE: due to the increased memory requirements, UniRef90 database is not default and the flag `–use_uniref` should be specified in order to search amino acid sequences against UniRef90. In this exercise, due to memory and time constraints, we won't be using the UniRef90 database.* +!!! note "UniRef and RAM requirements" + + Due to the increased memory requirements, annotations against the UniRef90 database is not set by default and the flag `–use_uniref` should be specified in order to search amino acid sequences against UniRef90. In this exercise, due to memory and time constraints, we won't be using the UniRef90 database. We will start by glancing at some of the options for `DRAM`. -```sh -module purge -module load DRAM/1.3.5-Miniconda3 - -DRAM.py --help - -# usage: DRAM.py [-h] {annotate,annotate_genes,distill,strainer,neighborhoods,merge_annotations} ... -# -# positional arguments: -# {annotate,annotate_genes,distill,strainer,neighborhoods,merge_annotations} -# annotate Annotate genomes/contigs/bins/MAGs -# annotate_genes Annotate already called genes, limited functionality compared to annotate -# distill Summarize metabolic content of annotated genomes -# strainer Strain annotations down to genes of interest -# neighborhoods Find neighborhoods around genes of interest -# merge_annotations Merge multiple annotations to one larger set -# -# options: -# -h, --help show this help message and exit -``` +!!! terminal "code" + + ```bash + module purge + module load DRAM/1.3.5-Miniconda3 + + DRAM.py --help + ``` + +!!! circle-check "Terminal output" + + ``` + usage: DRAM.py [-h] {annotate,annotate_genes,distill,strainer,neighborhoods,merge_annotations} ... + + positional arguments: + {annotate,annotate_genes,distill,strainer,neighborhoods,merge_annotations} + annotate Annotate genomes/contigs/bins/MAGs + annotate_genes Annotate already called genes, limited functionality compared to annotate + distill Summarize metabolic content of annotated genomes + strainer Strain annotations down to genes of interest + neighborhoods Find neighborhoods around genes of interest + merge_annotations Merge multiple annotations to one larger set + + options: + -h, --help show this help message and exit + ``` To look at some of the arguments in each command, type the following: -```sh -# DRAM.py --help +!!! terminal "code" + + ```bash + # DRAM.py --help -# For example: -DRAM.py annotate --help -``` + # For example: + DRAM.py annotate --help + ``` -#### Submitting *DRAM* annotation as a slurm job +### Submitting `DRAM` annotation as a slurm job To run this exercise we first need to set up a slurm job. We will use the results for tomorrow's distillation step. -Create a new script +!!! terminal-2 "Create script named `annotate_dram.sl`" -```bash -nano annotate_dram.sl -``` -!!! warning "Warning" + ```bash + nano annotate_dram.sl + ``` - Paste in the script (update all of the cases of ``) +!!! warning "Remember to update `` to your own folder" !!! terminal "code" - ```bash linenums="!" + ```bash linenums="1" #!/bin/bash -e #SBATCH --account nesi02659 @@ -125,49 +134,51 @@ nano annotate_dram.sl -o dram_annotations --threads $SLURM_CPUS_PER_TASK ``` -Submit the job +!!! terminal-2 "Submit the job" -```bash -sbatch annotate_dram.sl -``` + ```bash + sbatch annotate_dram.sl + ``` The program will take 4-4.5 hours to run, so we will submit the jobs and inspect the results tomorrow morning. --- -### Annotation of the viral contigs with *DRAM-v* +## Annotate viral contigs with `DRAM-v` -*DRAM* also has an equivalent program (*DRAM-v*) developed for predicting and annotating genes of viral genomes. A number of the options are similar to the standard *DRAM* run above, although the selection of databases differs slightly, and the subseqent *distill* step is focussed on identifying and classifying auxilliary metabolic genes (AMGs) rather than the metabolic pathways output by *DRAM*'s standard *distill* step. +`DRAM` also has an equivalent program (`DRAM-v`) developed for predicting and annotating genes of viral genomes. A number of the options are similar to the standard `DRAM` run above, although the selection of databases differs slightly, and the subsequent *distill* step is focussed on identifying and classifying auxilliary metabolic genes (AMGs) rather than the metabolic pathways output by `DRAM`'s standard *distill* step. -To see more details on options for *DRAM-v* we can run the same `--help` command as above: +To see more details on options for `DRAM-v` we can run the same `--help` command as above: -```sh -# DRAM-v.py --help +!!! terminal "code" + + ```bash + # DRAM-v.py --help -# For example: -DRAM-v.py annotate --help -DRAM-v.py distill --help -``` + # For example: + DRAM-v.py annotate --help + DRAM-v.py distill --help + ``` -#### Submitting *DRAM-v* annotation as a slurm job +### Submit `DRAM-v` annotation as a slurm job To run this exercise we first need to set up a slurm job. We will use the results for tomorrow's distillation step. !!! note "Note" - *DRAM-v* requires the `mgss-for-dramv/` files `final-viral-combined-for-dramv.fa` and `viral-affi-contigs-for-dramv.tab` that were generated by *VirSorter2*. These have been copied into `10.gene_annotation_and_coverage/` for this exercise. -Create a new script + `DRAM-v` requires the `mgss-for-dramv/` files `final-viral-combined-for-dramv.fa` and `viral-affi-contigs-for-dramv.tab` that were generated by `VirSorter2`. These have been copied into `10.gene_annotation_and_coverage/` for this exercise. -```bash -nano annotate_dramv.sl -``` -!!! warning "Warning" +!!! terminal-2 "Create script named `annotate_dramv.sl`" + + ```bash + nano annotate_dramv.sl + ``` - Paste in the script (update all of the cases of ``) +!!! warning "Remember to update `` to your own folder" !!! terminal "code" - ```bash + ```bash linenums="1" #!/bin/bash -e #SBATCH --account nesi02659 @@ -193,16 +204,17 @@ nano annotate_dramv.sl -o dramv_annotations ``` -Submit the job +!!! terminal-2 "Submit the job" -```bash -sbatch annotate_dramv.sl -``` + ```bash + sbatch annotate_dramv.sl + ``` We will submit this job now and inspect the results tomorrow morning. --- -### Calculate per-sample coverage stats of the filtered prokaryote bins + +## Calculate per-sample coverage stats of the filtered prokaryote bins One of the first questions we often ask when studying the ecology of a system is: What are the pattens of abundance and distribution of taxa across the different samples? With bins of metagenome-assembled genome (MAG) data, we can investigate this by mapping the quality-filtered unassembled reads back to the refined bins to then generate coverage profiles. Genomes in higher abundance in a sample will contribute more genomic sequence to the metagenome, and so the average depth of sequencing coverage for each of the different genomes provides a proxy for abundance in each sample. @@ -212,35 +224,36 @@ These exercises will take place in the `10.gene_annotation_and_coverage/` folder First, concatenate the bin data into a single file to then use to generate an index for the read mapper. -```bash -cd /nesi/nobackup/nesi02659/MGSS_U//10.gene_annotation_and_coverage/ +!!! terminal-2 "Concatenate bin nucleotide FASTA files" -cat filtered_bins/*.fna > filtered_bins.fna -``` + ```bash + cat filtered_bins/*.fna > filtered_bins.fna + ``` Now build the index for `Bowtie2` using the concatenated bin data. We will also make a new directory `bin_coverage/` to store the index and read mapping output into. -```bash -mkdir -p bin_coverage/ +!!! terminal-2 "Build `Bowtie2` index" -# Load Bowtie2 -module purge -module load Bowtie2/2.4.5-GCC-11.3.0 + ```bash + mkdir -p bin_coverage/ -# Build Bowtie2 index -bowtie2-build filtered_bins.fna bin_coverage/bw_bins -``` + # Load Bowtie2 + module purge + module load Bowtie2/2.4.5-GCC-11.3.0 + + # Build Bowtie2 index + bowtie2-build filtered_bins.fna bin_coverage/bw_bins + ``` Map the quality-filtered reads (from `../3.assembly/`) to the index using `Bowtie2`, and sort and convert to `.bam` format via `samtools`. -Create a new script +!!! terminal-2 "Create script named `mapping_filtered_bins.sl`" -```bash -nano mapping_filtered_bins.sl -``` -!!! warning "Warning" + ```bash + nano mapping_filtered_bins.sl + ``` - Paste in the script (replacing ``) +!!! warning "Remember to update `` to your own folder" !!! terminal "code" @@ -281,27 +294,30 @@ nano mapping_filtered_bins.sl Finally, generate the per-sample coverage table for each contig in each bin via `MetaBAT`'s `jgi_summarize_bam_contig_depths`. -```bash -# Load MetaBAT -module load MetaBAT/2.15-GCC-11.3.0 +!!! terminal-2 "Obtain coverage values" + + ```bash + # Load MetaBAT + module load MetaBAT/2.15-GCC-11.3.0 -# calculate coverage table -jgi_summarize_bam_contig_depths --outputDepth bins_cov_table.txt bin_coverage/sample*.bam -``` + # Calculate coverage table + jgi_summarize_bam_contig_depths --outputDepth bins_cov_table.txt bin_coverage/sample*.bam + ``` The coverage table will be generated as `bins_cov_table.txt`. As before, the key columns of interest are the `contigName`, and each `sample[1-n].bam` column. !!! note "Note" + Here we are generating a per-sample table of coverage values for **each contig** within each bin. To get per-sample coverage of **each bin** as a whole, we will need to generate average coverage values based on all contigs contained within each bin. We will do this in `R` during our data visualisation exercises on day 4 of the workshop, leveraging the fact that we added bin IDs to the sequence headers.* -### Calculate per-sample coverage stats of viral contigs +## Calculate per-sample coverage stats of viral contigs -Here we can follow the same steps as outlined above for the bin data, but with a concatenated *fastA* file of viral contigs. +Here we can follow the same steps as outlined above for the bin data, but with a concatenated FASTA file of viral contigs. -To quickly recap: +!!! backward "A quick recap" -* In previous exercises, we first used `VirSorter2` to identify viral contigs from the assembled reads, generating a new fasta file of viral contigs: `final-viral-combined.fa` -* We then processed this file using `CheckV` to generate quality information for each contig, and to further trim any retained (prokaryote) sequence on the ends of prophage contigs. + * In previous exercises, we first used `VirSorter2` to identify viral contigs from the assembled reads, generating a new fasta file of viral contigs: `final-viral-combined.fa` + * We then processed this file using `CheckV` to generate quality information for each contig, and to further trim any retained (prokaryote) sequence on the ends of prophage contigs. The resultant *fasta* files generated by `CheckV` (`proviruses.fna` and `viruses.fna`) have been copied to to the `10.gene_annotation_and_coverage/checkv` folder for use in this exercise. @@ -310,32 +326,35 @@ The resultant *fasta* files generated by `CheckV` (`proviruses.fna` and `viruses We will first need to concatenate these files together. -```bash -cat checkv/proviruses.fna checkv/viruses.fna > checkv_combined.fna -``` +!!! terminal "code" + + ```bash + cat checkv/proviruses.fna checkv/viruses.fna > checkv_combined.fna + ``` Now build the index for `Bowtie2` using the concatenated viral contig data. We will also make a new directory `viruses_coverage/` to store the index and read mapping output into. -```bash -mkdir -p viruses_coverage/ +!!! terminal "code" -# Load Bowtie2 -module load Bowtie2/2.4.5-GCC-11.3.0 + ```bash + mkdir -p viruses_coverage/ -# Build Bowtie2 index -bowtie2-build checkv_combined.fna viruses_coverage/bw_viruses -``` + # Load Bowtie2 + module load Bowtie2/2.4.5-GCC-11.3.0 + + # Build Bowtie2 index + bowtie2-build checkv_combined.fna viruses_coverage/bw_viruses + ``` Map the quality-filtered reads (from `../3.assembly/`) to the index using `Bowtie2`, and sort and convert to `.bam` format via `samtools`. -Create a new script +!!! terminal-2 "Create script named `mapping_viruses.sl`" -```bash -nano mapping_viruses.sl -``` -!!! warning "Warning" + ```bash + nano mapping_viruses.sl + ``` - Paste in the script (replacing ``) +!!! warning "Remember to update `` to your own folder" !!! terminal "code" @@ -376,20 +395,23 @@ nano mapping_viruses.sl Finally, generate the per-sample coverage table for each viral contig via `MetaBAT`'s `jgi_summarize_bam_contig_depths`. -```bash -# Load MetaBAT -module load MetaBAT/2.15-GCC-11.3.0 +!!! terminal "code" -# calculate coverage table -jgi_summarize_bam_contig_depths --outputDepth viruses_cov_table.txt viruses_coverage/sample*.bam -``` + ```bash + # Load MetaBAT + module load MetaBAT/2.15-GCC-11.3.0 + + # calculate coverage table + jgi_summarize_bam_contig_depths --outputDepth viruses_cov_table.txt viruses_coverage/sample*.bam + ``` The coverage table will be generated as `viruses_cov_table.txt`. As before, the key columns of interest are the `contigName`, and each `sample[1-n].bam` column. !!! note "Note" + Unlike the prokaryote data, we have not used a binning process on the viral contigs (since many of the binning tools use hallmark characteristics of prokaryotes in the binning process). Here, `viruses_cov_table.txt` is the final coverage table. This can be combined with `CheckV` quality and completeness metrics to, for example, examine the coverage profiles of only those viral contigs considered to be "High-quality" or "Complete".* -#### Normalising coverage values +### Normalise coverage values Having generated per-sample coverage values, it is usually necessary to also normalise these values across samples of differing sequencing depth. In this case, the mock metagenome data we have been working with are already of equal depth, and so this is an unnecessary step for the purposes of this workshop. @@ -397,7 +419,7 @@ For an example of one way in which the `cov_table.txt` output generated by `jgi_ --- -### Select initial goal +## Select initial goal It is now time to select the goals to investigate the genomes you have been working with. We ask you to select one of the following goals: From e1e7ce4a2c3c645630d9c04f13ec8c7c618c087b Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 16:44:50 +1200 Subject: [PATCH 07/12] Aesthetics ex15 --- docs/day4/ex15_gene_annotation_part3.md | 102 +++++++++++++----------- 1 file changed, 56 insertions(+), 46 deletions(-) diff --git a/docs/day4/ex15_gene_annotation_part3.md b/docs/day4/ex15_gene_annotation_part3.md index 197c4219..70dff231 100644 --- a/docs/day4/ex15_gene_annotation_part3.md +++ b/docs/day4/ex15_gene_annotation_part3.md @@ -3,23 +3,23 @@ !!! info "Objectives" * [Overview of `DRAM.py annotate` and `DRAM-v.py annotate` output](#overview-of-drampy-annotate-and-dramvpy-annotate-output) - * [*DRAM* and *DRAM-v* distillation step and visualization of results](#dram-and-dramv-distillation-of-the-results) + * [`DRAM` and `DRAM-v` distillation step and visualization of results](#dram-and-dramv-distillation-of-the-results) * [Tie findings to your initial goal](#tie-findings-to-your-initial-goal) --- -### Overview of *DRAM.py annotate* output +## Overview of *DRAM.py annotate* output The submitted jobs from the previous session should now be completed. If we examine the output directory `10.gene_annotation_and_coverage/dram_annotations/` we will see the following files: |File name | Description| |:--- | :--- | -|**genes.faa** and **genes.fna** | Fasta files with all the genes called by `Prodigal`, with additional header information gained from the annotation as nucleotide and amino acid records, respectively | -|**genes.gff** | GFF3 file with the same annotation information as well as gene locations | -|**scaffolds.fna** | A collection of all scaffolds/contigs given as input to `DRAM.py annotate` with added bin information | -|**annotations.tsv** | This file includes all annotation information about every gene from all MAGs | -|**trnas.tsv** | Summary of the tRNAs found in each MAG | -|**rrnas.tsv** | Summary of the rRNAs found in each MAG | +|`genes.faa` and `genes.fna` | FASTA files with all the genes called by `Prodigal`, with additional header information gained from the annotation as nucleotide and amino acid records, respectively | +|`genes.gff` | GFF3 file with the same annotation information as well as gene locations | +|`scaffolds.fna` | A collection of all scaffolds/contigs given as input to `DRAM.py annotate` with added bin information | +|`annotations.tsv` | This file includes all annotation information about every gene from all MAGs | +|`trnas.tsv` | Summary of the tRNAs found in each MAG | +|`rrnas.tsv` | Summary of the rRNAs found in each MAG | If we inspect the head of the annotation file we will see the following @@ -29,30 +29,33 @@ If we inspect the head of the annotation file we will see the following cd /nesi/nobackup/nesi02659/MGSS_U//10.gene_annotation_and_coverage/ head -n 5 dram_annotations/annotations.tsv - - # fasta scaffold gene_position start_position end_position strandedness rank ko_id kegg_hit peptidase_id peptidase_family peptidase_hit peptidase_RBH peptidase_identity peptidase_bitScore peptidase_eVal pfam_hits cazy_id cazy_hits heme_regulatory_motif_count bin_taxonomy bin_completeness bin_contamination - # bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_1 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 1 2 277 -1 D Fumarate hydratase (Fumerase) [PF05681.17] 0 d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 - # bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_2 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 2 474 764 1 E 0d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 - # bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_3 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 3 775 1452 1 D Response regulator receiver domain [PF00072.27]; Transcriptional regulatory protein, C terminal [PF00486.31] 0 d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 - # bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_4 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 4 1449 2696 1 D GHKL domain [PF14501.9] 0 d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 + ``` +!!! circle-check "Terminal output" + + ``` + fasta scaffold gene_position start_position end_position strandedness rank ko_id kegg_hit peptidase_id peptidase_family peptidase_hit peptidase_RBH peptidase_identity peptidase_bitScore peptidase_eVal pfam_hits cazy_id cazy_hits heme_regulatory_motif_count bin_taxonomy bin_completeness bin_contamination + bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_1 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 1 2 277 -1 D Fumarate hydratase (Fumerase) [PF05681.17] 0 d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 + bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_2 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 2 474 764 1 E 0d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 + bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_3 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 3 775 1452 1 D Response regulator receiver domain [PF00072.27]; Transcriptional regulatory protein, C terminal [PF00486.31] 0 d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 + bin_0.filtered_bin_0_NODE_11_length_361162_cov_0.994424_4 bin_0.filtered bin_0_NODE_11_length_361162_cov_0.994424 4 1449 2696 1 D GHKL domain [PF14501.9] 0 d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Arcobacteraceae;g__Arcobacter;s__Arcobacter nitrofigilis 99.59 3.79 ``` For each gene annotated, `DRAM` provides a summary rank (from A to E), representing the confidence of the annotation based on reciprocal best hits (RBH). The following figure briefly explains how this summary rank is calculated:
![image](../figures/ex14_DRAM_annotation_rank.png){width="500"}
-### Overview of *DRAM-v.py annotate* output +## Overview of `DRAM-v.py annotate` output -*DRAM-v* generates the same output files as *DRAM*, but this time for the viral contigs. These files can be viewed in the output directory `10.gene_annotation_and_coverage/dramv_annotations/`. In this case, `annotations.tsv` also includes some viral-specific columns, including viral gene database matches (`vogdb`), and categories that are used by *DRAM-v.py distill* to identify putative auxiliary metabolic genes (AMGs) (`virsorter_category`, `auxiliary_score`, `is_transposon`, `amg_flags`) +`DRAM-v` generates the same output files as `DRAM`, but this time for the viral contigs. These files can be viewed in the output directory `10.gene_annotation_and_coverage/dramv_annotations/`. In this case, `annotations.tsv` also includes some viral-specific columns, including viral gene database matches (`vogdb`), and categories that are used by `DRAM-v.py distill` to identify putative auxiliary metabolic genes (AMGs) (`virsorter_category`, `auxiliary_score`, `is_transposon`, `amg_flags`) --- -### *DRAM* and *DRAM-v* distillation of the results +## `DRAM` and `DRAM-v` distillation of the results After the annotation is finished, we will summarise and visualise these annotations with the so-called *distillation* step. We do so by running the following commands directly in the terminal. This will generate the distillate and liquor files for each dataset. -For the viral annotations, we will also include the parameters `--remove_transposons` ("Do not consider genes on scaffolds with transposons as potential AMGs") and `--remove_fs` ("Do not consider genes near ends of scaffolds as potential AMGs") to filter out some potential false positives for auxilliary metabolic gene identification. +For the viral annotations, we will also include the parameters `--remove_transposons` ("Do not consider genes on scaffolds with transposons as potential AMGs") and `--remove_fs` ("Do not consider genes near ends of scaffolds as potential AMGs") to filter out some potential false positives for auxiliary metabolic gene identification. !!! terminal "code" @@ -71,19 +74,18 @@ For the viral annotations, we will also include the parameters `--remove_transpo -o dramv_distillation ``` -### *DRAM.py distill* output files +## `DRAM.py distill` output files -The *DRAM* distillation step generates the following files that can be found within the ```dram_distillation``` directory : +The `DRAM` distillation step generates the following files that can be found within the ```dram_distillation``` directory : |File name | Description| |:--- | :--- | -|**genome_stats.tsv**| Genome quality information required for [MIMAG](https://www.nature.com/articles/nbt.3893) | -|**metabolism_summary.xlsx**| Summarised metabolism table containing number of genes with specific metabolic function identifiers | -|**product.html**| HTML file displaying a heatmap summarising pathway coverage, electron transport chain component completion, and presence/absence of specific functions | -|**product.tsv**| Data table visualised in product.html | - +|`genome_stats.tsv`| Genome quality information required for [MIMAG](https://www.nature.com/articles/nbt.3893) | +|`metabolism_summary.xlsx`| Summarised metabolism table containing number of genes with specific metabolic function identifiers | +|`product.html`| HTML file displaying a heatmap summarising pathway coverage, electron transport chain component completion, and presence/absence of specific functions | +|`product.tsv`| Data table visualised in `product.html` | -First, let's have a look at the ```genome_stats.tsv``` file to check the assembly quality of our bins by double-clicking the file within the `Jupyter` environment, viewing from the terminal via `less` or `cat`, or downloading the files from [here](../resources/dram_distillation.zip) and opening locally (e.g. via excel). +First, let's have a look at the `genome_stats.tsv` file to check the assembly quality of our bins by double-clicking the file within the `Jupyter` environment, viewing from the terminal via `less` or `cat`, or downloading the files from [here](../resources/dram_distillation.zip) and opening locally (e.g. via Excel). ??? magnifying-glass "Content of `genome_stats.tsv`" @@ -103,33 +105,41 @@ First, let's have a look at the ```genome_stats.tsv``` file to check the assembl To finish, we visualize the *Product*, an .HTML file produced in the distillation step, by double-clicking on it in our *Jupyter* lab notebook or downloading from [here](../resources/dram_distillation.zip). The *Product* has three primary parts: -1. **Modules.** Central metabolism pathways coverage. Completion of pathways is based on the structure of KEGG modules, with the pathway coverage calculated as the percent of steps with at least one gene present. +!!! circle-check "*Product* visualisation" -
-![image](../figures/ex15_fig1_DRAM_product_met_2022.png) -
+ === "Modules" -2. **ETC Complexes.** Electron Transport Chain component completion + Central metabolism pathways coverage. Completion of pathways is based on the structure of KEGG modules, with the pathway coverage calculated as the percent of steps with at least one gene present. -
-![image](../figures/ex15_fig2_DRAM_product_ETC_2022.png) -
+
+ ![image](../figures/ex15_fig1_DRAM_product_met_2022.png) +
-3. Presence of specific functions, including CAZy, Nitrogen metabolism, Sulfur metabolism and Photosynthesis. Note that the taxonomic classification of each of the bins is also shown in the first figure + === "ETC Complexes" -
-![image](../figures/ex15_fig3_DRAM_product_2022.png) -
+ Electron Transport Chain component completion -### *DRAM-v.py distill* output files +
+ ![image](../figures/ex15_fig2_DRAM_product_ETC_2022.png) +
-The *DRAM-v* distillation step for the viral contigs generates the following files that can be found within the ```dramv_distillation/``` directory : + === "Other functions" -|File name | Description| + Presence of specific functions, including CAZy, Nitrogen metabolism, Sulfur metabolism and Photosynthesis. Note that the taxonomic classification of each of the bins is also shown in the first figure + +
+ ![image](../figures/ex15_fig3_DRAM_product_2022.png) +
+ +## `DRAM-v.py distill` output files + +The `DRAM-v` distillation step for the viral contigs generates the following files that can be found within the ```dramv_distillation/``` directory : + +|
File name
| Description| |:--- | :--- | -|**vMAG_stats.tsv**| "Genome" (in this case viral contigs of varying completeness) information including: total gene counts, viral vs host gene counts, and counts of genes related to viral replication, structure, and those with presumed viral or host benefits | -|**amg_summary.tsv**| Genes identified as putative auxiliary metabolic genes (AMGs) and various columns for metabolic characterisation of each gene | -|**product.html**| HTML file displaying a heatmap summarising AMG counts and presence/absence for different broad metabolic categories for each viral contig | +|`vMAG_stats.tsv`| "Genome" (in this case viral contigs of varying completeness) information including: total gene counts, viral vs host gene counts, and counts of genes related to viral replication, structure, and those with presumed viral or host benefits | +|`amg_summary.tsv`| Genes identified as putative auxiliary metabolic genes (AMGs) and various columns for metabolic characterisation of each gene | +|`product.html`| HTML file displaying a heatmap summarising AMG counts and presence/absence for different broad metabolic categories for each viral contig | When viewing these files, see if you can find the following information: @@ -137,13 +147,13 @@ When viewing these files, see if you can find the following information: * What are some annotations of interest within the output annotations file? * *NOTE: the *VirSorter2* annotations file includes multiple columns for both **prokaryote** and **viral** protein predictions. Be careful as to which column you are looking at (as well as its associated confidence score) when assessing viral annotations vs. AMGs*. - * Among these annotations, how many were flagged as AMGs by *DRAM-v*? + * Among these annotations, how many were flagged as AMGs by `DRAM-v`? * What broad metabolic categories did the AMGs fall into? * Discussion point: How might we investigate whether identified putative AMGs are actually *within* the viral genomes, rather than residual contaminating host genomic sequence attached to the end of integrated prophage (but incompletely trimmed off in the excision process)? --- -### Tie findings to your initial goal +## Tie findings to your initial goal It is now time to explore the genomes and try to address your original goal! From 6776ec3e14f98efe7da150e8672dec88bb3ef29d Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 16:50:42 +1200 Subject: [PATCH 08/12] Aesthetics ex16b --- docs/day4/ex16b_data_presentation_Coverage.md | 117 ++++++++++-------- 1 file changed, 65 insertions(+), 52 deletions(-) diff --git a/docs/day4/ex16b_data_presentation_Coverage.md b/docs/day4/ex16b_data_presentation_Coverage.md index 152c0783..54aaeadf 100644 --- a/docs/day4/ex16b_data_presentation_Coverage.md +++ b/docs/day4/ex16b_data_presentation_Coverage.md @@ -7,7 +7,7 @@ --- -### Build a heatmap of average coverage per sample using *R* +## Build a heatmap of average coverage per sample using *R* !!! quote "" @@ -29,11 +29,11 @@ --- -### Part 1 - Building a heatmap of MAG coverage per sample +## Part 1 - Building a heatmap of MAG coverage per sample To get started, if you're not already, log back in to NeSI's [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) and make sure you are working within RStudio with the required packages installed (see the [data presentation intro](../day4/ex16a_data_presentation_Intro.md) for more information). -#### 1.1 Prepare environment +### 1.1 Prepare environment First, set the working directory and load the required libraries. @@ -74,7 +74,7 @@ Import all relevant data as follows: metadata <- read_tsv("mapping_file.txt") # Metadata/mapping file of environmental parameters ``` -#### 1.2 Wrangle data +### 1.2 Wrangle data After importing the data tables, we need to subset the tables to only relevant columns. As noted during the [coverage](../day3/ex14_gene_annotation_part2.md) exercises, it is important to remember that we currently have a table of coverage values for all *contigs* contained within each MAG. Since we're aiming to present coverage for each *MAG*, we need to reduce these contig coverages into a single mean coverage value per MAG per sample. @@ -160,39 +160,45 @@ After obtaining a tidy taxonomy table, we append the species annotations to the For example, you may only be interested in the top 10 MAGs based on coverage: - ```R - # Create a vector of top 10 MAGs - top10 <- sort(rowSums(MAG_cov), decreasing = TRUE)[1:10] + !!! r-project "code" + + ```R + # Create a vector of top 10 MAGs + top10 <- sort(rowSums(MAG_cov), decreasing = TRUE)[1:10] - # Retain only top 10 MAGs - MAG_cov_top10 <- MAG_cov[rownames(MAG_cov) %in% names(top10), ] - ``` + # Retain only top 10 MAGs + MAG_cov_top10 <- MAG_cov[rownames(MAG_cov) %in% names(top10), ] + ``` Perhaps you want to filter MAGs based on their prevalence across your sampling regime? Assuming you would like to retain MAGs that are present in at least 50% of your samples: - ```R - # Create a presence/absence matrix - MAG_prevalence <- ifelse(MAG_cov > 0, 1, 0) + !!! r-project "code" - # Create a logical vector that fulfill criteria - top_prev <- rowSums(MAG_prevalence) > (0.5 * ncol(MAG_prevalence)) + ```R + # Create a presence/absence matrix + MAG_prevalence <- ifelse(MAG_cov > 0, 1, 0) - # Retain only MAGs in coverage table that fulfill criteria - MAG_cov_top_prev <- MAG_cov[top_prev, ] - ``` + # Create a logical vector that fulfill criteria + top_prev <- rowSums(MAG_prevalence) > (0.5 * ncol(MAG_prevalence)) + + # Retain only MAGs in coverage table that fulfill criteria + MAG_cov_top_prev <- MAG_cov[top_prev, ] + ``` -Very often, coverage data generated using metagenomic methods can be sparse and/or have values that differ by large orders of magnitude. This can hamper effective visualisation of the data. To remedy this, we can perform numeric transformations that can enhance visualisation of relative differences between values. Here, we will use a logarithm (base 2) transform with 1 added as a pseudocount (because $\log 0$ is undefined). +Very often, coverage data generated using metagenomic methods can be sparse and/or have values that differ by large orders of magnitude. This can hamper effective visualisation of the data. To remedy this, we can perform numeric transformations that can enhance visualisation of relative differences between values. Here, we will use a logarithm (base 2) transform with 1 added as a pseudo-count (because $\log 0$ is undefined). !!! note "On data transformation" - Transformations applied to enhance data visualisation are not necessarily suitable for downstream statistical analyses. Depending on the attributes of your data (shape, sparseness, potential biases, etc.) and choice of analyses, it is recommended that you become familiar with field and tool-specifc assumptions, norms and requirements for data transformation. + Transformations applied to enhance data visualisation are not necessarily suitable for downstream statistical analyses. Depending on the attributes of your data (shape, sparsity, potential biases, etc.) and choice of analyses, it is recommended that you become familiar with field and tool-specific assumptions, norms and requirements for data transformation. + +!!! r-project "code" -```R -## 4. Transform coverage data ---- -MAG_cov_log2 <- log2(MAG_cov + 1) -``` + ```R + ## 4. Transform coverage data ---- + MAG_cov_log2 <- log2(MAG_cov + 1) + ``` -#### 1.3 Order the heatmap +### 1.3 Order the heatmap We then need to prepare some form of ordering for our heatmap. This is usually presented in the form of a dendrogram based on results from hierarchical clustering. To do this, we first need to generate two dissimilarity/distance matrices based on transformed coverage values: @@ -247,7 +253,7 @@ Here, we calculate [Bray-Curtis dissimilarity](https://en.wikipedia.org/wiki/Bra dev.off() ``` -#### 1.4 Set colour palettes +### 1.4 Set colour palettes The penultimate step before building our heatmap is to set the colours that will be used to represent annotations and the cell/tile values. In this case, annotations are sample groups as designated in the metadata (a.k.a. mapping) file and MAG taxonomy. In the code below, we will set 3 palettes: @@ -261,16 +267,19 @@ The penultimate step before building our heatmap is to set the colours that will You can also quickly check what these colour palettes are: - ```R - # What colour palettes come pre-installed? - palette.pals() + !!! r-project "code" + + ```R + # What colour palettes come pre-installed? + palette.pals() + + # Plotting the "Okabe-Ito" palette + # (Requires the "scales" package) + scales::show_col( + palette.colors(palette = "Okabe-Ito") + ) + ``` - # Plotting the "Okabe-Ito" palette - # (Requires the "scales" package) - scales::show_col( - palette.colors(palette = "Okabe-Ito") - ) - ``` ![image](../figures/day4_coverage.03.show_col.Okabe-Ito.png) !!! r-project "code" @@ -293,7 +302,7 @@ The penultimate step before building our heatmap is to set the colours that will sample_colour <- left_join(metadata, group_colour) ## Prepare MAG colours based on taxonomic class - ### Remember that the new labels (bin ID) are 'binID_species' + ## Remember that the new labels (bin ID) are 'binID_species' bin_phylum <- bin_taxonomy %>% select(user_genome, species, phylum) %>% unite(col = "binID", user_genome, species, sep = "_") %>% @@ -314,7 +323,7 @@ The penultimate step before building our heatmap is to set the colours that will cell_colour <- colorRampPalette(c("white", "black"), space = "Lab")(2^8) ``` -#### 1.5 Build heatmap +### 1.5 Build heatmap !!! r-project "code" @@ -379,15 +388,15 @@ Here, we arrange our columns based Bray-Curtis dissimilarities between samples. --- -### Part 2 - Building a heatmap of viral contigs per sample +## Part 2 - Building a heatmap of viral contigs per sample We can run through the same process for the viral contigs. Many of the steps are as outlined above, so we will work through these a bit quicker and with less commentary along the way. However, we will highlight a handful of differences compared to the commands for the MAGs above, for example: steps for selecting and/or formatting the taxonomy; importing the quality output from `CheckV`; and the (optional) addition of filtering out low quality contigs. -#### 2.1 Prepare environment +### 2.1 Prepare environment This has already been done above. We will immediately begin wrangling the data. -#### 2.2 Wrangle data +### 2.2 Wrangle data To start, we need to identify viral contigs of at least medium quality. THis will be used as a filtering vector. @@ -444,7 +453,7 @@ We then obtain the coverage matrix and transform the values to enhance visualisa virus_cov_matrix_log2 <- log2(virus_cov_matrix + 1) ``` -#### 2.3 Order the heatmap +### 2.3 Order the heatmap !!! r-project "code" @@ -481,21 +490,25 @@ We then obtain the coverage matrix and transform the values to enhance visualisa ![image](../figures/day4_coverage.05.cov_hmp.hclust.vir_sample.png) ![image](../figures/day4_coverage.06.cov_hmp.hclust.vir.png) -#### 2.4 Set colour palettes +### 2.4 Set colour palettes For colours based on sample group, we will retain the colours set as above. Here, we will only set a new palette (Tableau 10) for viruses based on the predicted taxonomy. -```R -# Prepare grouping variables and colour palettes ---- -# Check palette colours -scales::show_col( - palette.colors( - palette = "Tableau 10" - ) -) -``` +!!! r-project "code" + ```R + # Prepare grouping variables and colour palettes ---- + # Check palette colours + scales::show_col( + palette.colors( + palette = "Tableau 10" + ) + ) + ``` + +
![image](../figures/day4_coverage.07.show_col.Tableau10.png) +
!!! r-project "code" @@ -515,7 +528,7 @@ scales::show_col( left_join(virus_order_colour, by = c("Order_VC_predicted" = "virus_order")) ``` -#### 2.5 Build heatmap +### 2.5 Build heatmap !!! r-project "code" From 3fc103f1de83b7d16567ce6bc286d79fe6153992 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 16:54:35 +1200 Subject: [PATCH 09/12] Aesthetics ex16c --- ...c_OPTIONAL_data_presentation_Ordination.md | 72 +++++++++++-------- 1 file changed, 44 insertions(+), 28 deletions(-) diff --git a/docs/day4/ex16c_OPTIONAL_data_presentation_Ordination.md b/docs/day4/ex16c_OPTIONAL_data_presentation_Ordination.md index 1b3b7812..e6354166 100644 --- a/docs/day4/ex16c_OPTIONAL_data_presentation_Ordination.md +++ b/docs/day4/ex16c_OPTIONAL_data_presentation_Ordination.md @@ -28,6 +28,7 @@ In addition to this, a simple mapping file has also been created (`11.data_prese To get started, open `RStudio` and start a new document. !!! note "Note" + You will recognise that the first few steps will follow the same process as the previous exercise on [generating coverage heatmaps](../day4/ex16b_data_presentation_Coverage.md). In practice, these two workflows can be combined to reduce the repetitive aspects. #### 1.1 Prepare environment @@ -56,16 +57,18 @@ First, set the working directory and load the required libraries. Import coverage tables and mapping file. -```R -# Read files ---- -contig_cov <- read_tsv("bins_cov_table.txt") # Bin contig coverage table -virus_cov <- read_tsv("viruses_cov_table.txt") # Viral contig coverage table -metadata <- read_tsv("mapping_file.txt") # Metadata/mapping file of environmental parameters -``` +!!! r-project "code" + + ```R + # Read files ---- + contig_cov <- read_tsv("bins_cov_table.txt") # Bin contig coverage table + virus_cov <- read_tsv("viruses_cov_table.txt") # Viral contig coverage table + metadata <- read_tsv("mapping_file.txt") # Metadata/mapping file of environmental parameters + ``` #### 1.2 Wrangle data -As before in [coverage exercise](../day4/ex16b_data_presentation_Coverage.md), we need to obtain per MAG and sample average coverage values. We begin by selecting relavent columns and renaming them. +As before in [coverage exercise](../day4/ex16b_data_presentation_Coverage.md), we need to obtain per MAG and sample average coverage values. We begin by selecting relevant columns and renaming them. !!! r-project "" @@ -105,7 +108,8 @@ It is often useful to examine ordinations based on both weighted and unweighted Here we will use the functions `vegdist()` and `metaMDS()` from the `R` package `vegan` to generate weighted and unweighted Bray-Curtis dissimilarity matrices and nMDS solutions for the microbial bin data and viral contigs data. -!!! note "Note" +!!! tip "Setting seed" + You may also wish to make use of the `set.seed()` function before each calculation to ensure that you obtain consistent results if the same commands are re-run at a later date. !!! r-project "code" @@ -141,32 +145,38 @@ Here we will use the functions `vegdist()` and `metaMDS()` from the `R` package From here on out, we will process the data using the same functions/commands. We can make our code less redundant by compiling all necessary inputs as a list, then processing them together. This is achieved by using the `map(...)` family of functions from the `purrr` package. -```R -# Collect dissimilarities into a list for collective processing ---- -bray_list <- list( - "MAG_binary_bray" = MAG_binary_bray, - "MAG_bray" = MAG_bray, - "virus_binary_bray" = virus_binary_bray, - "virus_bray" = virus_bray -) -``` +!!! r-project "code" + + ```R + # Collect dissimilarities into a list for collective processing ---- + bray_list <- list( + "MAG_binary_bray" = MAG_binary_bray, + "MAG_bray" = MAG_bray, + "virus_binary_bray" = virus_binary_bray, + "virus_bray" = virus_bray + ) + ``` ### 3. Generate ordination We now generate and visualise all ordinations in 4 panels them using native plotting methods. -```R -# Perform non-metric multidimensional scaling (nMDS) ---- -nmds_list <- map(bray_list, function(bray) metaMDS(bray, trymax = 999)) +!!! r-project "code" + + ```R + # Perform non-metric multidimensional scaling (nMDS) ---- + nmds_list <- map(bray_list, function(bray) metaMDS(bray, trymax = 999)) -## Check nMDS plot natively -par(mfrow = c(2, 2)) # Sets panels -map2(nmds_list, names(nmds_list), function(nmds, lbl) { - ordiplot(nmds, main = lbl, type = "t", display = "sites") -}) -``` + ## Check nMDS plot natively + par(mfrow = c(2, 2)) # Sets panels + map2(nmds_list, names(nmds_list), function(nmds, lbl) { + ordiplot(nmds, main = lbl, type = "t", display = "sites") + }) + ``` +
![image](../figures/day4_coverage.09.nmds_native.png) +
Plotting via this method is a quick and easy way to look at what your ordination looks like. However, this method is often tedious to modify to achieve publication quality figures. In the following section, we will use `ggplot2` to generate a polished and panelled figure. @@ -280,8 +290,14 @@ You should try and modify any of the arguments above to see what changes: Change ### Follow-up analyses -It is often valuable to follow these visualisations up with tests for beta-dispersion (whether or not sample groups have a comparable *spread* to one another, i.e. is one group of communities more heterogeneous than another?) and, provided that beta-dispersion is not significantly different between groups, PERMANOVA tests (the extent to which the variation in the data can be explained by a given variable (such as sample groups or other environmental factors, based on differences between the *centroids* of each group). +It is often valuable to follow these visualisations up with tests for $\beta$-dispersion (whether or not sample groups have a comparable *spread* to one another, i.e. is one group of communities more heterogeneous than another?) and, provided that beta-dispersion is not significantly different between groups, PERMANOVA tests (the extent to which the variation in the data can be explained by a given variable (such as sample groups or other environmental factors, based on differences between the *centroids* of each group). -Beta-dispersion can be calculated using the `betadisper()` function from the `vegan` package (passing the `bray.dist` data and the `map$Group` variable to group by), followed by `anova()`, `permutest()`, or `TukeyHSD()` tests of differences between the groups (by inputting the generated `betadisper` output). PERMANOVA tests can be conducted via the `adonis()` function from the `vegan` package (for example, via: `adonis(bray.dist ~ Group, data=map, permutations=999, method="bray")`. +Beta-dispersion can be calculated using the `betadisper()` function from the `vegan` package (passing the `bray.dist` data and the `map$Group` variable to group by), followed by `anova()`, `permutest()`, or `TukeyHSD()` tests of differences between the groups (by inputting the generated `betadisper` output). PERMANOVA tests can be conducted via the `adonis()` function from the `vegan` package. For example: + +!!! r-project "code" + + ```r + adonis(bray.dist ~ Group, data=map, permutations=999, method="bray") + ``` --- From baf56299bd1a78c24d29bb47f512f954eaec9fa1 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 17:01:30 +1200 Subject: [PATCH 10/12] Aesthetics ex16d --- .../ex16d_data_presentation_KEGG_pathways.md | 179 +++++++++++------- 1 file changed, 107 insertions(+), 72 deletions(-) diff --git a/docs/day4/ex16d_data_presentation_KEGG_pathways.md b/docs/day4/ex16d_data_presentation_KEGG_pathways.md index ba188e80..8bf4dbfc 100644 --- a/docs/day4/ex16d_data_presentation_KEGG_pathways.md +++ b/docs/day4/ex16d_data_presentation_KEGG_pathways.md @@ -10,7 +10,7 @@ --- -### Build a KEGG pathway map using *R* +## Build a KEGG pathway map using *R* In this exercise, we will generate KEGG a pathways map from genome annotations to visualize potential pathways present in our assembled, binned genome sequences. @@ -18,7 +18,7 @@ The key package used here is [pathview](https://academic.oup.com/bioinformatics/ To get started, if you're not already, log back in to NeSI's [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) and open `RStudio`. -#### 1. Prepare environment +### 1. Prepare environment Set the working directory, load the required packages, and import data. @@ -47,65 +47,81 @@ Set the working directory, load the required packages, and import data. Load your files into R. Here, we are loading them into a list. Given that there are ten files, it is easier to load, clean, and analyze the data using list methods available in `tidyverse`. -```R -filenames <- list.files(pattern = ".*.aa") -bin_ids <- str_extract(filenames, "bin_\\d+") -annotations <- map(filenames, function(file) read_tsv(file)) %>% - setNames(., bin_ids) -``` +!!! r-project "code" + + ```R + filenames <- list.files(pattern = ".*.aa") + bin_ids <- str_extract(filenames, "bin_\\d+") + annotations <- map(filenames, function(file) read_tsv(file)) %>% + setNames(., bin_ids) + ``` !!! note "Note" + R will print some messages about column types and renaming columns. However, we do not need all the columns for this analysis and it is not necessary to reformat them. -#### 2. Subset the data +### 2. Subset the data For this exercise, we really only require the KEGG orthology IDs (KO numbers) for each bin. There are many columns in the table with annotations pulled from many databases. Lets begin by finding out which columns correspond to each predicted gene and relevant KEGG annotations in the unified annotations table. We will use annotations for `bin_0` as an example. -```R -# Find out what columns are in the table. -names(annotations$bin_0) - -# names(annotations$bin_0) -# [1] "Query Gene" "GC%" "Contig name" "Start position" -# [5] "Stop position" "Orientation" "Query sequence" "Signalling" -# [9] "Signal confidence" "Target gene (UniProt)" "Identity...11" "Coverage...12" -# [13] "E-value...13" "Description...14" "Taxonomy...15" "Target gene (UniRef100)" -# [17] "Identity...17" "Coverage...18" "E-value...19" "Description...20" -# [21] "Taxonomy...21" "Target gene (KEGG)" "Identity...23" "Coverage...24" -# [25] "E-value...25" "Description...26" "Taxonomy...27" "Target gene (Pfam)" -# [29] "E-value...29" "Description...30" "Target gene (TIGRfam)" "E-value...32" -# [33] "Description...33" -``` +!!! r-project "code" + + ```R + # Find out what columns are in the table. + names(annotations$bin_0) + ``` + +!!! circle-check "Console output" + + ``` + [1] "Query Gene" "GC%" "Contig name" "Start position" + [5] "Stop position" "Orientation" "Query sequence" "Signalling" + [9] "Signal confidence" "Target gene (UniProt)" "Identity...11" "Coverage...12" + [13] "E-value...13" "Description...14" "Taxonomy...15" "Target gene (UniRef100)" + [17] "Identity...17" "Coverage...18" "E-value...19" "Description...20" + [21] "Taxonomy...21" "Target gene (KEGG)" "Identity...23" "Coverage...24" + [25] "E-value...25" "Description...26" "Taxonomy...27" "Target gene (Pfam)" + [29] "E-value...29" "Description...30" "Target gene (TIGRfam)" "E-value...32" + [33] "Description...33" + ``` We can see that the headers related to KEGG annotations are in columns 22 to 27, with column headers `Target gene (KEGG)`, `Identity...23`, `Coverage...24`, `E-value...25`, `Description...26`, and `Taxonomy...27`. However, lets be surgical and only get columns that match the pattern of a KO number. All KO numbers start with K and are followed by 5 digits. We can use this pattern to find the required column. -```R -map(annotations$bin_0, function(column) { - any(str_detect(column, "K\\d{5}")) -}) %>% - keep(isTRUE) +!!! r-project "code" -# $Description...26 -# [1] TRUE -``` + ```R + map(annotations$bin_0, function(column) { + any(str_detect(column, "K\\d{5}")) + }) %>% + keep(isTRUE) + ``` + +!!! circle-check "Console output" + + ``` + $Description...26 + [1] TRUE + ``` The code above goes through each column in the annotation table of bin 0, then detects if any string in the column contains the KO pattern. We then follow through by keeping only those columns that report `TRUE` in detecting the requisite pattern. We can move on by only selecting for the gene ID (here, `Query Gene`) and the column where KO numbers are. The code below also creates a separate column `KO` that contains only string that matches the KO pattern. -```R -KEGG_annotations <- map(annotations, function(data) { - data %>% - # Select columns for gene ID and KEGG annotations - select(`Query Gene`, `Description...26`) %>% - # Create a column for KO numbers - mutate( - KO = str_extract_all(`Description...26`, "K\\d{5}") - ) -}) -``` +!!! r-project "code" -#### 3. Summarise KO per bin + ```R + KEGG_annotations <- map(annotations, function(data) { + data %>% + # Select columns for gene ID and KEGG annotations + select(`Query Gene`, `Description...26`) %>% + # Create a column for KO numbers + mutate( + KO = str_extract_all(`Description...26`, "K\\d{5}") + ) + }) + ``` + +### 3. Summarise KO per bin Here, we are interested in the available KO in each bin. Thus, we can summarise the data by the bin to generate a list of KO per bin. Note that some annotations do not have KO numbers attached to them. In these cases, we will filter these data out. @@ -135,56 +151,73 @@ Here, we are interested in the available KO in each bin. Thus, we can summarise arrange(bin_id, KO) ``` -#### Identifying pathway maps of interest +### Identify pathway maps of interest Before moving on, we must first identify the pathway map ID of our pathway of interest. Lets say, for this exercise, we are interested in the TCA cycle. Here, we will use `KEGGREST` to access the KEGG database and query it with a search term. `KEGGREST` can also help you identify other information stored in the KEGG database. For more information, the `KEGGREST` vignette can be viewed using the `vignette` function in `R`: `vignette("KEGGREST-vignette")` -```R -keggFind(database = "pathway", query = "TCA cycle") -# path:map00020 -# "Citrate cycle (TCA cycle)" +!!! r-project "code" + + ```R + keggFind(database = "pathway", query = "TCA cycle") + ``` + +!!! circle-check "Console output" -# We find the map ID is 00020 and assign it to an object. -tca_map_id <- "00020" -``` + ``` + path:map00020 + "Citrate cycle (TCA cycle)" + ``` + +!!! r-project "code" + + ```r + # We find the map ID is 00020 and assign it to an object. + tca_map_id <- "00020" + ``` -#### Reshaping the data for input into pathview +### Reshaping the data for input into pathview `pathview` needs the data as a numeric matrix with IDs as row names and samples/experiments/bins as column names. Here, we will reshape the data into a matrix of counts per KO number in each bin. -```R -KO_matrix <- pivot_wider( - KO_bins, - names_from = "bin_id", - values_from = "hits", - values_fill = NA -) %>% - column_to_rownames("KO") %>% - as.matrix() -``` +!!! r-project "code" + + ```R + KO_matrix <- pivot_wider( + KO_bins, + names_from = "bin_id", + values_from = "hits", + values_fill = NA + ) %>% + column_to_rownames("KO") %>% + as.matrix() + ``` If you click on `KO_matrix` in the Environment pane, you can see that it is now a matrix of counts per KO per bin. Bins that do not possess a particular KO number is given NA. Do not worry about that as `pathview` can deal with that. -#### Creating pathway map of genes related to TCA cycle +### Creating pathway map of genes related to TCA cycle Now we can generate images of the KEGG pathway maps using the matrix we just made. For this section, we will try to find genes invovled in the TCA cycle. -```R -pv_bin_0 <- pathview( - gene.data = KO_matrix[, "bin_0"], - pathway.id = tca_map_id, - species = "ko", - out.suffix = "pv_bin_0" -) -``` +!!! r-project "code" + + ```R + pv_bin_0 <- pathview( + gene.data = KO_matrix[, "bin_0"], + pathway.id = tca_map_id, + species = "ko", + out.suffix = "pv_bin_0" + ) + ``` There is no plot output for this command as it automatically writes the results into the current working directory. By default, it names the file as `..png`. If this is the first time this is run, it will also write the pathway map's original image file `.png` and the `.xml` with information about how the pathway is connected. Lets take a look at our first output. +
![image](../figures/day4_keggmap.ko00020.pv_bin_0.png) +
Boxes highlighted in red means that our MAG has this gene. However, the colour scale is a little strange seeing as there cannot be negative gene annotation hits (its either NA or larger than 0). Also, we know that there are definitely bins with more than 1 of some KO, but the colour highlights do not show that. Lets tweak the code further and perhaps pick better colours. For the latter, we will use the `viridis` colour package that is good for showing a gradient. @@ -225,7 +258,9 @@ Boxes highlighted in red means that our MAG has this gene. However, the colour s ) ``` +
![image](../figures/day4_keggmap.ko00020.pv_bin_0.2.png) +
This plot looks much better. We can see that some genes do have more hits than others. Now, lets propagate this using `map(...)` based on our bin IDs. From 362782f2af081ac9cfc36f11743262d6dc7505ec Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 17:05:11 +1200 Subject: [PATCH 11/12] Aesthetics ex16e --- .../ex16e_data_presentation_Gene_synteny.md | 109 +++++++++--------- 1 file changed, 57 insertions(+), 52 deletions(-) diff --git a/docs/day4/ex16e_data_presentation_Gene_synteny.md b/docs/day4/ex16e_data_presentation_Gene_synteny.md index fe308f40..e7d3f12d 100644 --- a/docs/day4/ex16e_data_presentation_Gene_synteny.md +++ b/docs/day4/ex16e_data_presentation_Gene_synteny.md @@ -9,7 +9,7 @@ --- -### Build a sulfur assimilation gene alignment figure to investigate gene synteny using `R` +## Build a sulfur assimilation gene alignment figure to investigate gene synteny using `R` When investigating the evolution of genomes, we sometimes want to consider not only the presence/absence of genes in a genome, but also how they are arranged in an operon. For this exercise, we are going to visualise several sulfur assimilation genes from our MAGs and compare their arrangements. @@ -21,7 +21,7 @@ For this exercise, navigate to the directory `11.data_presentation/gene_synteny/ Refer to [this appendix](../resources/5_APPENDIX_ex15_Prepare_gene_synteny_inputs.md) for detailed information on how to generate input data. -#### Part 1 - Parsing files in *bash* +### Part 1 - Parsing files in *bash* We will be performing this exercise in two stages. Firstly, in `bash`, we will use `cut` and `tail` to pull out the genes of interest listed in the `*cys.txt` files from the `prodigal` files. The gene names will then be used to create a table of gene coordinates from the `prodigal` output using `grep`, `cut`, and `sed`. @@ -76,18 +76,18 @@ Check the content of the `.coords` files now. You should see something like the If you recall from the previous exercise on gene prediction, we have taken the first four entries from each line of the `prodigal` output, which consists of: !!! quote "" - 1. The gene name, written as [CONTIG]\_[GENE] - 2. The start position of the gene - 3. The stop position of the gene - 4. The orienation of the gene + 1. Gene name, written as [CONTIG]\_[GENE] + 2. Start position of the gene + 3. Stop position of the gene + 4. Orientation of the gene We will now use these tables, together with the annotation tables to create the gene synteny view in `R`. -#### Part 2 - Producing the figure in *R* +### Part 2 - Producing the figure in `R` First, move back to the [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) pane and start an `RStudio` session -##### 2.1 Prepare environment +#### 2.1 Prepare environment Similar to previous sessions, we will begin by preparing the environment by setting our working directory and loading required libraries. @@ -109,7 +109,7 @@ Similar to previous sessions, we will begin by preparing the environment by sett library(genoPlotR) ``` -##### 2.2 Import files +#### 2.2 Import files Using genoPlotR requires files with 3 different types of information: @@ -155,7 +155,7 @@ Lets continue to import the other files Notice that the file reading function for BLAST files are different (`read_comparison_from_blast()` versus `read_tsv()`). This is a function specific to `genoPlotR` for parsing BLAST outputs. If you have some custom comparison pipeline, `genoPlotR` can also read tab-delimited files via `read_comparison_from_tab()` -##### 2.3 Wrangle data +#### 2.3 Wrangle data We now need to do the following: @@ -186,44 +186,49 @@ Now we tidy our annotations table and create a joint coordinate-annotation table ```R annot_list - # $bin_3_annot - # # A tibble: 7 × 3 - # `Query gene` KO Annotation - # - # 1 bin_3_NODE_53_length_158395_cov_1.135272_128 K02048 sbp1; Prokaryotic sulfate-/thiosulfate-bin… - # 2 bin_3_NODE_53_length_158395_cov_1.135272_129 possible predicted diverged CheY-domain possible predicted diverged CheY-domain - # 3 bin_3_NODE_53_length_158395_cov_1.135272_130 Domain of unknown function 2 Domain of unknown function 2 - # 4 bin_3_NODE_53_length_158395_cov_1.135272_131 Domain of unknown function 2 Domain of unknown function 2 - # 5 bin_3_NODE_53_length_158395_cov_1.135272_132 K02046 cysU; cysU; sulfate transport ABC transpor… - # 6 bin_3_NODE_53_length_158395_cov_1.135272_133 K02047 cysW; cysW; sulfate transport ABC transpor… - # 7 bin_3_NODE_53_length_158395_cov_1.135272_134 K02045 cysA; cysA; sulfate transport ATP-binding … - # - # $bin_5_annot - # # A tibble: 4 × 3 - # `Query gene` KO Annotation - # - # 1 bin_5_NODE_95_length_91726_cov_0.379302_67 K02048 sbp; sulfate-binding protein - # 2 bin_5_NODE_95_length_91726_cov_0.379302_68 K02046 cysT; sulfate transporter CysT - # 3 bin_5_NODE_95_length_91726_cov_0.379302_69 K02047 cysW; sulfate transporter CysW - # 4 bin_5_NODE_95_length_91726_cov_0.379302_70 K02045 cysA; sulfate.thiosulfate ABC transporter ATP-binding protein CysA - # - # $bin_8_annot - # # A tibble: 4 × 3 - # `Query gene` KO Annotation - # - # 1 bin_8_NODE_60_length_149231_cov_0.651774_55 K02048 Thiosulphate-binding protein - # 2 bin_8_NODE_60_length_149231_cov_0.651774_56 K02046 sulfate ABC transporter permease subunit CysT - # 3 bin_8_NODE_60_length_149231_cov_0.651774_57 K02047 Sulfate ABC transporter, permease protein CysW - # 4 bin_8_NODE_60_length_149231_cov_0.651774_58 K02045 sulphate transport system permease protein 1 - # - # $bin_9_annot - # # A tibble: 4 × 3 - # `Query gene` KO Annotation - # - # 1 bin_9_NODE_12_length_355673_cov_0.516990_147 K02048 thiosulfate ABC transporter substrate-binding protein - # 2 bin_9_NODE_12_length_355673_cov_0.516990_148 K02046 sulfate ABC transporter permease - # 3 bin_9_NODE_12_length_355673_cov_0.516990_149 K02047 sulfate ABC transporter permease - # 4 bin_9_NODE_12_length_355673_cov_0.516990_150 K02045 sulfate ABC transporter ATP-binding protein + ``` + +!!! circle-check "Console output" + + ``` + $bin_3_annot + # A tibble: 7 × 3 + `Query gene` KO Annotation + + 1 bin_3_NODE_53_length_158395_cov_1.135272_128 K02048 sbp1; Prokaryotic sulfate-/thiosulfate-bin… + 2 bin_3_NODE_53_length_158395_cov_1.135272_129 possible predicted diverged CheY-domain possible predicted diverged CheY-domain + 3 bin_3_NODE_53_length_158395_cov_1.135272_130 Domain of unknown function 2 Domain of unknown function 2 + 4 bin_3_NODE_53_length_158395_cov_1.135272_131 Domain of unknown function 2 Domain of unknown function 2 + 5 bin_3_NODE_53_length_158395_cov_1.135272_132 K02046 cysU; cysU; sulfate transport ABC transpor… + 6 bin_3_NODE_53_length_158395_cov_1.135272_133 K02047 cysW; cysW; sulfate transport ABC transpor… + 7 bin_3_NODE_53_length_158395_cov_1.135272_134 K02045 cysA; cysA; sulfate transport ATP-binding … + + $bin_5_annot + # A tibble: 4 × 3 + `Query gene` KO Annotation + + 1 bin_5_NODE_95_length_91726_cov_0.379302_67 K02048 sbp; sulfate-binding protein + 2 bin_5_NODE_95_length_91726_cov_0.379302_68 K02046 cysT; sulfate transporter CysT + 3 bin_5_NODE_95_length_91726_cov_0.379302_69 K02047 cysW; sulfate transporter CysW + 4 bin_5_NODE_95_length_91726_cov_0.379302_70 K02045 cysA; sulfate.thiosulfate ABC transporter ATP-binding protein CysA + + $bin_8_annot + # A tibble: 4 × 3 + `Query gene` KO Annotation + + 1 bin_8_NODE_60_length_149231_cov_0.651774_55 K02048 Thiosulphate-binding protein + 2 bin_8_NODE_60_length_149231_cov_0.651774_56 K02046 sulfate ABC transporter permease subunit CysT + 3 bin_8_NODE_60_length_149231_cov_0.651774_57 K02047 Sulfate ABC transporter, permease protein CysW + 4 bin_8_NODE_60_length_149231_cov_0.651774_58 K02045 sulphate transport system permease protein 1 + + $bin_9_annot + # A tibble: 4 × 3 + `Query gene` KO Annotation + + 1 bin_9_NODE_12_length_355673_cov_0.516990_147 K02048 thiosulfate ABC transporter substrate-binding protein + 2 bin_9_NODE_12_length_355673_cov_0.516990_148 K02046 sulfate ABC transporter permease + 3 bin_9_NODE_12_length_355673_cov_0.516990_149 K02047 sulfate ABC transporter permease + 4 bin_9_NODE_12_length_355673_cov_0.516990_150 K02045 sulfate ABC transporter ATP-binding protein ``` Immediately, we can see that there are some inconsistencies in the annotations. @@ -292,7 +297,7 @@ We also need to convert the tidy annotation table into something that `genoPlotR # 4 74470 75459 cysA black 0 ``` -##### 2.4 Plot data +#### 2.4 Plot data Lets plot our data to see what it looks like and if it need tweaking. @@ -351,11 +356,11 @@ We have a plot! However, it is rather cluttered and could be further improved: This looks much cleaner! -!!! success "Results at a glance" +??? success "Results at a glance" From the last plot, we see that: * The arrangement sulfur assimilation genes are well conserved - * Gene sequences of sbp1, cysW, and cysA are very conserved between bins 3, 5, and 8 + * Gene sequences of *sbp1*, *cysW*, and *cysA* are very conserved between bins 3, 5, and 8 * Bin 3 has 3 unknown domains between its sulfur binding protein and the other sulfur assimilation genes - * cysW in bin 9 shows some sequence homology with cysW from bin 8 + * *cysW* in bin 9 shows some sequence homology with *cysU* from bin 8 From 35e7b5279d760bf3e3bd049b78857f38995dd6f0 Mon Sep 17 00:00:00 2001 From: Jian Sheng Boey <95262076+JSBoey@users.noreply.github.com> Date: Fri, 1 Sep 2023 17:11:41 +1200 Subject: [PATCH 12/12] Aesthetics ex16f --- ...ONAL_data_presentation_CAZy_annotations.md | 167 +++++++++++------- 1 file changed, 100 insertions(+), 67 deletions(-) diff --git a/docs/day4/ex16f_OPTIONAL_data_presentation_CAZy_annotations.md b/docs/day4/ex16f_OPTIONAL_data_presentation_CAZy_annotations.md index d55b53b5..1785ceb8 100644 --- a/docs/day4/ex16f_OPTIONAL_data_presentation_CAZy_annotations.md +++ b/docs/day4/ex16f_OPTIONAL_data_presentation_CAZy_annotations.md @@ -9,7 +9,7 @@ --- -### Build a basic heatmap from annotation data using *R* +## Build a basic heatmap from annotation data using *R* To get started, if you're not already, log back in to NeSI's [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) and open `RStudio`. @@ -18,7 +18,7 @@ For this exercise, set `11.data_presentation/cazy_heatmap/` as the working direc 1. Annotating each `prodigal` output against the **dbCAN** database using `hmmer` 1. Converting the raw `hmmer` output into a table using the `hmmscan-parser.py` script that bundles with **dbCAN** -#### Import the data into an R data.frame +### Import the data into an R data.frame The first thing we need to do is import these annotations into `R`. We will do this using the following workflow @@ -32,82 +32,109 @@ The first thing we need to do is import these annotations into `R`. We will do t First, we import our `R` libraries with the `library()` command. For this workflow, we need three libraries from the `tidyverse` package: -```R -setwd('/nesi/nobackup/nesi02659/MGSS_U//11.data_presentation/cazy_heatmap/') +!!! r-project "code" -library(dplyr) -library(tidyr) -library(tibble) -``` + ```R + setwd('/nesi/nobackup/nesi02659/MGSS_U//11.data_presentation/cazy_heatmap/') + + library(dplyr) + library(tidyr) + library(tibble) + ``` We can then import our data using the `list.files()` function to loop over each text file, and the `mutate`, `select`, and pipe (`%>%`) functions from the `dplyr` library. -```R -cazy_files = list.files('.') +!!! r-project "code" + + ```R + cazy_files <- list.files('.') -# For each file, import it, drop unneeded columns, and add a column recording the bin name -cazy_df = data.frame() + # For each file, import it, drop unneeded columns, and add a column recording the bin name + cazy_df <- data.frame() -for( cazy_file in cazy_files ) { - df = read.table(cazy_file, sep='\t', stringsAsFactors=F, header=F) %>% - mutate('Bin' = cazy_file) %>% - select(CAZy=V1, Bin) - cazy_df = rbind(cazy_df, df) -} -``` + for( cazy_file in cazy_files ) { + df <- read.table(cazy_file, sep='\t', stringsAsFactors=F, header=F) %>% + mutate('Bin' = cazy_file) %>% + select(CAZy=V1, Bin) + cazy_df <- rbind(cazy_df, df) + } + ``` We can inspect the final data.frame using the `head` command: -```R -head(cazy_df) -# CAZy Bin -# 1 AA3.hmm bin_0_parsed.domtbl -# 2 AA4.hmm bin_0_parsed.domtbl -# 3 GT2_Glycos_transf_2.hmm bin_0_parsed.domtbl -# 4 CE1.hmm bin_0_parsed.domtbl -# 5 GH23.hmm bin_0_parsed.domtbl -# 6 GT8.hmm bin_0_parsed.domtbl -``` +!!! r-project "code" + + ```R + head(cazy_df) + ``` + +!!! circle-check "Console output" + + ``` + CAZy Bin + 1 AA3.hmm bin_0_parsed.domtbl + 2 AA4.hmm bin_0_parsed.domtbl + 3 GT2_Glycos_transf_2.hmm bin_0_parsed.domtbl + 4 CE1.hmm bin_0_parsed.domtbl + 5 GH23.hmm bin_0_parsed.domtbl + 6 GT8.hmm bin_0_parsed.domtbl + ``` We can also confirm that we have imported all of the text files by looking at the unique entries in the `Bin` column: -```R -unique(cazy_df$Bin) -# [1] "bin_0_parsed.domtbl" "bin_1_parsed.domtbl" "bin_2_parsed.domtbl" "bin_3_parsed.domtbl" -# [5] "bin_4_parsed.domtbl" "bin_5_parsed.domtbl" "bin_6_parsed.domtbl" "bin_7_parsed.domtbl" -# [9] "bin_8_parsed.domtbl" "bin_9_parsed.domtbl" -``` +!!! r-project "code" + + ```R + unique(cazy_df$Bin) + ``` + +!!! circle-check "Console output" + + ``` + [1] "bin_0_parsed.domtbl" "bin_1_parsed.domtbl" "bin_2_parsed.domtbl" "bin_3_parsed.domtbl" + [5] "bin_4_parsed.domtbl" "bin_5_parsed.domtbl" "bin_6_parsed.domtbl" "bin_7_parsed.domtbl" + [9] "bin_8_parsed.domtbl" "bin_9_parsed.domtbl" + ``` We will now perform a summarising step, aggregating instances of multiple genes with the same annotation into a single count for each genome. We do this by -- For each bin in the data.frame - - For each annotation in the bin - - Count the number of times the annotation is observed +- For each bin in the data frame + - For each annotation in the bin + - Count the number of times the annotation is observed For the majority of cases this will probably be one, but there will be a few cases where multiple annotations have been seen. This process is done using the `group_by` and `tally` functions from `dplyr`, again using pipes to pass the data between functions. -```R -cazy_counts = cazy_df %>% group_by(Bin, CAZy) %>% tally() - -cazy_counts -# A tibble: 402 × 3 -# Groups: Bin [10] -# Bin CAZy n -# -# 1 bin_0_parsed.domtbl AA3.hmm 1 -# 2 bin_0_parsed.domtbl AA4.hmm 1 -# 3 bin_0_parsed.domtbl AA6.hmm 2 -# 4 bin_0_parsed.domtbl CBM12.hmm 1 -# 5 bin_0_parsed.domtbl CBM50.hmm 2 -# 6 bin_0_parsed.domtbl CBM78.hmm 2 -# 7 bin_0_parsed.domtbl CE1.hmm 3 -# 8 bin_0_parsed.domtbl CE11.hmm 1 -# 9 bin_0_parsed.domtbl CE3.hmm 1 -# 10 bin_0_parsed.domtbl CE4.hmm 1 -# … with 392 more rows -``` +!!! r-project "code" + + ```R + cazy_counts <- cazy_df %>% + group_by(Bin, CAZy) %>% + tally() + + cazy_counts + ``` + +!!! circle-check "Console output" + + ``` + A tibble: 402 × 3 + Groups: Bin [10] + Bin CAZy n + + 1 bin_0_parsed.domtbl AA3.hmm 1 + 2 bin_0_parsed.domtbl AA4.hmm 1 + 3 bin_0_parsed.domtbl AA6.hmm 2 + 4 bin_0_parsed.domtbl CBM12.hmm 1 + 5 bin_0_parsed.domtbl CBM50.hmm 2 + 6 bin_0_parsed.domtbl CBM78.hmm 2 + 7 bin_0_parsed.domtbl CE1.hmm 3 + 8 bin_0_parsed.domtbl CE11.hmm 1 + 9 bin_0_parsed.domtbl CE3.hmm 1 + 10 bin_0_parsed.domtbl CE4.hmm 1 + … with 392 more rows + ``` We now have a data.frame-like object (a [tibble](https://tibble.tidyverse.org/)) with three columns. We can convert this into a gene matrix using the `pivot_wider` function from the `tidyr` library to create a genome x gene matrix in the following form: @@ -118,21 +145,27 @@ We now have a data.frame-like object (a [tibble](https://tibble.tidyverse.org/)) |...|...|...|...|...| |bin_9|N. of genes|...|...|...| -```R -cazy_matrix = cazy_counts %>% pivot_wider(id_cols=Bin, names_from=CAZy, values_from=n, values_fill=list(n = 0)) -``` +!!! r-project "code" -#### Build the plot in *R* + ```R + cazy_matrix <- cazy_counts %>% + pivot_wider(id_cols=Bin, names_from=CAZy, values_from=n, values_fill=list(n = 0)) + ``` + +### Build the plot in *R* Finally, we create the actual plot by passing this matrix into the `pheatmap` library. Before doing this, we need to take the text column `Bin` from the matrix and move it into the rownames, as this is how `pheatmap` infers the names of our samples. Also, if we left text data in the numeric input for a heatmap, the function would crash. We can quickly transfer the `Bin` column to the rownames using the `column_to_rownames` function from the `tibble` library. -```R -library(pheatmap) +!!! r-project "code" + + ```R + library(pheatmap) -colours <- colorRampPalette(c("#fff9e7","#920000"), space="Lab")(100) + colours <- colorRampPalette(c("#fff9e7","#920000"), space="Lab")(100) -cazy_matrix %>% column_to_rownames('Bin') %>% as.matrix(.) %>% pheatmap(., col = colours) -``` + cazy_matrix %>% column_to_rownames('Bin') %>% as.matrix(.) %>% pheatmap(., col = colours) + ``` +
![image](../figures/ex15_CAZy_heatmap.png){width="700"}