Skip to content

Commit

Permalink
Merge pull request #45 from GenomicsAotearoa/corrections_2023
Browse files Browse the repository at this point in the history
Update based on Mike's comments
  • Loading branch information
JSBoey authored Aug 30, 2023
2 parents d9c3407 + d2acc36 commit bd1caf2
Show file tree
Hide file tree
Showing 7 changed files with 28 additions and 30 deletions.
3 changes: 2 additions & 1 deletion docs/day1/ex5_evaluating_assemblies.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ For more genome-informed evaluation of the assembly, we can use the `MetaQUAST`
1. [MeTaxa2](https://microbiology.se/software/metaxa2/) or [SingleM](https://github.com/wwood/singlem) (DNA based, 16S rRNA recovery and classification)
1. [MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2) (DNA based, clade-specific marker gene classification)

A good summary and comparison of these tools (and more) was recently published by [Ye *et al.*](https://www.ncbi.nlm.nih.gov/pubmed/31398336).
A good summary and comparison of these tools (and more) was published by [Ye *et al.*](https://www.ncbi.nlm.nih.gov/pubmed/31398336).

However, since we **_do_** know the composition of the original communities used to build this mock metagenome, `MetaQUAST` will work very well for us today. In your `4.evaluation/` directory you will find a file called `ref_genomes.txt`. This file contains the names of the genomes used to build these mock metagenomes. We will provide these as the reference input for `MetaQUAST`.

Expand All @@ -198,6 +198,7 @@ However, since we **_do_** know the composition of the original communities used
#SBATCH --cpus-per-task 10
#SBATCH --error %x_%j.err
#SBATCH --output %x_%j.out
#SBATCH --parition milan

# Load module
module purge
Expand Down
32 changes: 13 additions & 19 deletions docs/day2/ex8_bin_dereplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,36 +121,35 @@ Both `MetaBAT` and `MaxBin` have the option to output unbinned contigs after bin
### Bin dereplication using *DAS_Tool* - Running the tool
We are now ready to run `DAS_Tool`. This can be done from the command line, as it does not take a particularly long time to run for this data set. Start by loading `DAS_Tool`.
```bash
module load DAS_Tool/1.1.5-gimkl-2022a-R-4.2.1
```
Depending on whether or not your session has been continued from previous exercises, you may encounter an error performing this module load. This is because some of the tools we have used in previous exercises have dependencies which conflict with the dependencies in `DAS_Tool`. If this is the case for you, you can unload all previous module loads with the following:
We are now ready to run `DAS_Tool`. This can be done from the command line, as it does not take a particularly long time to run for this data set.
```bash
# Remove modules to ensure a clean environment
module purge
# Load DAS Tool
module load DAS_Tool/1.1.5-gimkl-2022a-R-4.2.1
module load DIAMOND/2.0.15-GCC-11.3.0
module load USEARCH/11.0.667-i86linux32
```
`DAS_Tool` should now load without issue. With 2 threads, `DAS_Tool` should take 10 - 15 minutes to complete.
```bash
# Create DAS_Tool output directory
mkdir -p dastool_out/
# Run DAS_Tool
DAS_Tool -i metabat_associations.txt,maxbin_associations.txt \
-l MetaBAT,MaxBin \
-t 2 --write_bins --search_engine blastp \
-t 2 --write_bins --search_engine diamond \
-c spades_assembly/spades_assembly.m1000.fna \
-o dastool_out/
```
```
DAS Tool 1.1.5
Analyzing assembly
Predicting genes
Annotating single copy genes using diamond
Dereplicating, aggregating, and scoring bins
Writing bins
```
As usual, we will break down the parameters:
|Parameter|Function|
Expand All @@ -163,11 +162,6 @@ As usual, we will break down the parameters:
|**-c ...**|Path to the assembly used in binning|
|**-o ..**|Output directory for all files|
This is not a problem - `DAS_Tool` can use either `BLAST`, `diamond`, or `usearch` for performing its alignment operations. Regardless of which one you specify, it will search to see which ones are available. In this case, it is telling us that `diamond` and `usearch` cannot be found, which doesn't really matter because we have specified `BLAST` as our search engine.
When `DAS_Tool` has completed, we will have a final set of bins located in the folder path `dastool_out/_DASTool_bins`. Have a look at the output and see which bins made it to the final selection. Did a single binning tool pick the best bins, or are the results a split between `MetaBAT` and `MaxBin`?
---
Expand Down
5 changes: 4 additions & 1 deletion docs/day2/ex9_refining_bins.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,12 @@

For future reference, and for working with your own data, a step-by-step process for generating these files from the curated bins generated by `DAS_Tool` has been provided as an [Appendix](../resources/2_APPENDIX_ex9_Generating_input_files_for_VizBin.md).

Let's first have a quick look at the annotation file.
For this section, we will be working within `6.bin_refinement/`. Let's first have a quick look at the annotation file.

```bash
# Navigate to correct directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement

head -n 5 all_bins.sample1.vizbin.ann

# coverage,label,length
Expand Down
6 changes: 3 additions & 3 deletions docs/day3/ex10_viruses.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ less vConTACT2_Results/genome_by_genome_overview.csv
```

```bash
less vConTACT2_Results/tax_predict_table.txt
less vConTACT2_Results/tax_predict_table.tsv
```

A few notes to consider:
Expand All @@ -194,7 +194,7 @@ A few notes to consider:
* Note also that these lines however will *not* contain taxonomy information.
* See the notes in the [Appendix](https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/resources/APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md) for further information about why this might be.
* As per the notes in the [Appendix](https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/resources/APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md), the `tax_predict_table.txt` file contains *predictions* of potential taxonomy (and or *taxonomies*) of the input viral contigs for order, family, and genus, based on whether they clustered with any viruses in the reference database.
* As per the notes in the [Appendix](https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/resources/APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md), the `tax_predict_table.tsv` file contains *predictions* of potential taxonomy (and or *taxonomies*) of the input viral contigs for order, family, and genus, based on whether they clustered with any viruses in the reference database.
* Note that these may be lists of *multiple* potential taxonomies, in the cases where viral contigs clustered with multiple reference viruses representing more than one taxonomy at the given rank.

!!! note ""
Expand All @@ -221,7 +221,7 @@ Cytoscape

!!! warning "Do not update Cytoscape!"

A dialog box will appear telling you about a new version of Cytoscape. **Click "discard"**, as we will not be installing any new versions today!
A dialog box will appear telling you about a new version of Cytoscape. **Click "close"**, as we will not be installing any new versions today!

In *Cytoscape*, we can load the gene-sharing network by clicking `File/Import/Network from file`, and then opening the `c1.ntw` file (You may need to click the `Home` button and then navigate to the relevant directory when `c1.ntw` is located (i.e. in `/nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses/vConTACT2_Results/`)).

Expand Down
2 changes: 1 addition & 1 deletion docs/day3/ex13_gene_annotation_part1.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

### *BLAST*-like gene annotations and domain annotations

Broadly speaking, there are two ways we perform gene annotations with protein sequences. Both compare our sequences of interest against a curated set of protein sequences for which function is known, or is strongly suspected. In each case, there are particular strenths to the approach and for particular research questions, one option may be favoured over another.
Broadly speaking, there are two ways we perform gene annotations with protein sequences. Both compare our sequences of interest against a curated set of protein sequences for which function is known, or is strongly suspected. In each case, there are particular strengths to the approach and for particular research questions, one option may be favoured over another.

#### BLAST-like annotation

Expand Down
6 changes: 3 additions & 3 deletions docs/day3/ex14_gene_annotation_part2.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ nano annotate_dramv.sl
#SBATCH --account nesi02659
#SBATCH --job-name annotate_DRAMv
#SBATCH --time 02:00:00
#SBATCH --mem 4Gb
#SBATCH --mem 10Gb
#SBATCH --cpus-per-task 12
#SBATCH --error %x_%A.err
#SBATCH --output %x_%A.out
Expand Down Expand Up @@ -416,8 +416,8 @@ It is now time to select the goals to investigate the genomes you have been work

Depending on what you are looking for, you will either be trying to find gene(s) of relevance to a particular functional pathway, or the omission of genes that might be critical in function. In either case, make sure to use the taxonomy of each MAG to determine whether it is likely to be a worthwhile candidate for exploration, as some of these traits are quite restricted in terms of which organisms carry them.

To conduct this exersise, you should use the information generated with ```DRAM``` as well as the annotation files we created previously that will be available in the directory ```10.gene_annotation_and_coverage/gene_annotations```.
To conduct this exersise, you should use the information generated with `DRAM` as well as the annotation files we created previously that will be available in the directory `10.gene_annotation_and_coverage/gene_annotations`.

Please note that we have also provided further annotation files within the directory ```10.gene_annotation_and_coverage/example_annotation_tables``` that contain information obtained after annotating the MAGs against additional databases (UniProt, UniRef100, KEGG, PFAM and TIGRfam). These example files can also be downloaded from [here](../resources/example_annotation_tables.zip). These files were created by using an in-house python script designed to aggregate different annotations and as part of the environmental metagenomics worflow followed in Handley's lab. Information about using this script as well as the script is available [here](https://github.com/GenomicsAotearoa/environmental_metagenomics/blob/master/metagenomic_annotation/3.aggregation.md)
Please note that we have also provided further annotation files within the directory `10.gene_annotation_and_coverage/example_annotation_tables` that contain information obtained after annotating the MAGs against additional databases (UniProt, UniRef100, KEGG, PFAM and TIGRfam). These example files can also be downloaded from [here](../resources/example_annotation_tables.zip). These files were created by using an in-house python script designed to aggregate different annotations and as part of the environmental metagenomics worflow followed in Handley's lab. Information about using this script as well as the script is available [here](https://github.com/GenomicsAotearoa/environmental_metagenomics/blob/master/metagenomic_annotation/3.aggregation.md)

---
4 changes: 2 additions & 2 deletions docs/day4/ex16b_data_presentation_Coverage.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@

### Part 1 - Building a heatmap of MAG coverage per sample

To get started, if you're not already, log back in to NeSI's [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) and open a `Notebook` running the `R 4.0.1` module as the kernel (or, outside the context of this workshop, open `RStudio` with the required packages installed (see the [data presentation intro](../day4/ex16a_data_presentation_Intro.md) docs for more information)).
To get started, if you're not already, log back in to NeSI's [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) and make sure you are working within RStudio with the required packages installed (see the [data presentation intro](../day4/ex16a_data_presentation_Intro.md) for more information).

#### 1.1 Prepare environment

Expand Down Expand Up @@ -470,7 +470,7 @@ We then obtain the coverage matrix and transform the values to enhance visualisa
plot(
virus_hclust,
main = "Bray-Curtis dissimilarities between viruses",
xlab = "MAGs",
xlab = "Viral contigs",
ylab = "Height",
sub = "Method: average linkage",
hang = -1,
Expand Down

0 comments on commit bd1caf2

Please sign in to comment.