Merge pull request #45 from GenomicsAotearoa/corrections_2023

Update based on Mike's comments
GenomicsAotearoa · Aug 30, 2023 · bd1caf2 · bd1caf2
2 parents d9c3407 + d2acc36
commit bd1caf2
Show file tree

Hide file tree

Showing 7 changed files with 28 additions and 30 deletions.
diff --git a/docs/day1/ex5_evaluating_assemblies.md b/docs/day1/ex5_evaluating_assemblies.md
@@ -182,7 +182,7 @@ For more genome-informed evaluation of the assembly, we can use the `MetaQUAST`
     1. [MeTaxa2](https://microbiology.se/software/metaxa2/) or [SingleM](https://github.com/wwood/singlem) (DNA based, 16S rRNA recovery and classification)
     1. [MetaPhlAn2](http://huttenhower.sph.harvard.edu/metaphlan2) (DNA based, clade-specific marker gene classification)
 
-A good summary and comparison of these tools (and more) was recently published by [Ye *et al.*](https://www.ncbi.nlm.nih.gov/pubmed/31398336).
+A good summary and comparison of these tools (and more) was published by [Ye *et al.*](https://www.ncbi.nlm.nih.gov/pubmed/31398336).
 
 However, since we **_do_** know the composition of the original communities used to build this mock metagenome, `MetaQUAST` will work very well for us today. In your `4.evaluation/` directory you will find a file called `ref_genomes.txt`. This file contains the names of the genomes used to build these mock metagenomes. We will provide these as the reference input for `MetaQUAST`.
 
@@ -198,6 +198,7 @@ However, since we **_do_** know the composition of the original communities used
     #SBATCH --cpus-per-task 10
     #SBATCH --error         %x_%j.err
     #SBATCH --output        %x_%j.out
+    #SBATCH --parition      milan
 
     # Load module
     module purge

diff --git a/docs/day2/ex8_bin_dereplication.md b/docs/day2/ex8_bin_dereplication.md
@@ -121,36 +121,35 @@ Both `MetaBAT` and `MaxBin` have the option to output unbinned contigs after bin
 
 ### Bin dereplication using *DAS_Tool* - Running the tool
 
-We are now ready to run `DAS_Tool`. This can be done from the command line, as it does not take a particularly long time to run for this data set. Start by loading `DAS_Tool`.
-
-```bash
-module load DAS_Tool/1.1.5-gimkl-2022a-R-4.2.1
-```
-
-Depending on whether or not your session has been continued from previous exercises, you may encounter an error performing this module load. This is because some of the tools we have used in previous exercises have dependencies which conflict with the dependencies in `DAS_Tool`. If this is the case for you, you can unload all previous module loads with the following:
+We are now ready to run `DAS_Tool`. This can be done from the command line, as it does not take a particularly long time to run for this data set. 
 
 ```bash
+# Remove modules to ensure a clean environment
 module purge
 
+# Load DAS Tool
 module load DAS_Tool/1.1.5-gimkl-2022a-R-4.2.1
-module load DIAMOND/2.0.15-GCC-11.3.0
-module load USEARCH/11.0.667-i86linux32
-```
 
-`DAS_Tool` should now load without issue. With 2 threads, `DAS_Tool` should take 10 - 15 minutes to complete.
-
-```bash
 # Create DAS_Tool output directory
 mkdir -p dastool_out/
 
 # Run DAS_Tool
 DAS_Tool -i metabat_associations.txt,maxbin_associations.txt \
          -l MetaBAT,MaxBin \
-         -t 2 --write_bins --search_engine blastp \
+         -t 2 --write_bins --search_engine diamond \
          -c spades_assembly/spades_assembly.m1000.fna \
          -o dastool_out/
 ```
 
+```
+DAS Tool 1.1.5 
+Analyzing assembly 
+Predicting genes 
+Annotating single copy genes using diamond 
+Dereplicating, aggregating, and scoring bins 
+Writing bins
+```
+
 As usual, we will break down the parameters:
 
 |Parameter|Function|
@@ -163,11 +162,6 @@ As usual, we will break down the parameters:
 |**-c ...**|Path to the assembly used in binning|
 |**-o ..**|Output directory for all files|
 
-
-
-
-This is not a problem - `DAS_Tool` can use either `BLAST`, `diamond`, or `usearch` for performing its alignment operations. Regardless of which one you specify, it will search to see which ones are available. In this case, it is telling us that `diamond` and `usearch` cannot be found, which doesn't really matter because we have specified `BLAST` as our search engine.
-
 When `DAS_Tool` has completed, we will have a final set of bins located in the folder path `dastool_out/_DASTool_bins`. Have a look at the output and see which bins made it to the final selection. Did a single binning tool pick the best bins, or are the results a split between `MetaBAT` and `MaxBin`?
 
 ---

diff --git a/docs/day2/ex9_refining_bins.md b/docs/day2/ex9_refining_bins.md
@@ -31,9 +31,12 @@
 
 For future reference, and for working with your own data, a step-by-step process for generating these files from the curated bins generated by `DAS_Tool` has been provided as an [Appendix](../resources/2_APPENDIX_ex9_Generating_input_files_for_VizBin.md).
 
-Let's first have a quick look at the annotation file. 
+For this section, we will be working within `6.bin_refinement/`. Let's first have a quick look at the annotation file. 
 
 ```bash
+# Navigate to correct directory
+cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement
+
 head -n 5 all_bins.sample1.vizbin.ann
 
 # coverage,label,length

diff --git a/docs/day3/ex10_viruses.md b/docs/day3/ex10_viruses.md
@@ -182,7 +182,7 @@ less vConTACT2_Results/genome_by_genome_overview.csv
 ```
 
 ```bash
-less vConTACT2_Results/tax_predict_table.txt
+less vConTACT2_Results/tax_predict_table.tsv
 ```
 
 A few notes to consider: 
@@ -194,7 +194,7 @@ A few notes to consider:
         * Note also that these lines however will *not* contain taxonomy information. 
         * See the notes in the [Appendix](https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/resources/APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md) for further information about why this might be.
         
-    * As per the notes in the [Appendix](https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/resources/APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md), the `tax_predict_table.txt` file contains *predictions* of potential taxonomy (and or *taxonomies*) of the input viral contigs for order, family, and genus, based on whether they clustered with any viruses in the reference database.
+    * As per the notes in the [Appendix](https://github.com/GenomicsAotearoa/metagenomics_summer_school/blob/master/materials/resources/APPENDIX_ex11_viral_taxonomy_prediction_via_vContact2.md), the `tax_predict_table.tsv` file contains *predictions* of potential taxonomy (and or *taxonomies*) of the input viral contigs for order, family, and genus, based on whether they clustered with any viruses in the reference database.
         * Note that these may be lists of *multiple* potential taxonomies, in the cases where viral contigs clustered with multiple reference viruses representing more than one taxonomy at the given rank.
 
         !!! note "" 
@@ -221,7 +221,7 @@ Cytoscape
 
 !!! warning "Do not update Cytoscape!" 
 
-    A dialog box will appear telling you about a new version of Cytoscape. **Click "discard"**, as we will not be installing any new versions today!
+    A dialog box will appear telling you about a new version of Cytoscape. **Click "close"**, as we will not be installing any new versions today!
 
 In *Cytoscape*, we can load the gene-sharing network by clicking `File/Import/Network from file`, and then opening the `c1.ntw` file (You may need to click the `Home` button and then navigate to the relevant directory when `c1.ntw` is located (i.e. in `/nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses/vConTACT2_Results/`)). 
 

diff --git a/docs/day3/ex13_gene_annotation_part1.md b/docs/day3/ex13_gene_annotation_part1.md
@@ -17,7 +17,7 @@
 
 ### *BLAST*-like gene annotations and domain annotations
 
-Broadly speaking, there are two ways we perform gene annotations with protein sequences. Both compare our sequences of interest against a curated set of protein sequences for which function is known, or is strongly suspected. In each case, there are particular strenths to the approach and for particular research questions, one option may be favoured over another.
+Broadly speaking, there are two ways we perform gene annotations with protein sequences. Both compare our sequences of interest against a curated set of protein sequences for which function is known, or is strongly suspected. In each case, there are particular strengths to the approach and for particular research questions, one option may be favoured over another.
 
 #### BLAST-like annotation
 

diff --git a/docs/day3/ex14_gene_annotation_part2.md b/docs/day3/ex14_gene_annotation_part2.md
@@ -173,7 +173,7 @@ nano annotate_dramv.sl
     #SBATCH --account       nesi02659
     #SBATCH --job-name      annotate_DRAMv
     #SBATCH --time          02:00:00
-    #SBATCH --mem           4Gb
+    #SBATCH --mem           10Gb
     #SBATCH --cpus-per-task 12
     #SBATCH --error         %x_%A.err
     #SBATCH --output        %x_%A.out
@@ -416,8 +416,8 @@ It is now time to select the goals to investigate the genomes you have been work
 
 Depending on what you are looking for, you will either be trying to find gene(s) of relevance to a particular functional pathway, or the omission of genes that might be critical in function. In either case, make sure to use the taxonomy of each MAG to determine whether it is likely to be a worthwhile candidate for exploration, as some of these traits are quite restricted in terms of which organisms carry them.
 
-To conduct this exersise, you should use the information generated with ```DRAM``` as well as the annotation files we created previously that will be available in the directory ```10.gene_annotation_and_coverage/gene_annotations```. 
+To conduct this exersise, you should use the information generated with `DRAM` as well as the annotation files we created previously that will be available in the directory `10.gene_annotation_and_coverage/gene_annotations`. 
 
-Please note that we have also provided further annotation files within the directory ```10.gene_annotation_and_coverage/example_annotation_tables``` that contain information obtained after annotating the MAGs against additional databases (UniProt, UniRef100, KEGG, PFAM and TIGRfam). These example files can also be downloaded from [here](../resources/example_annotation_tables.zip). These files were created by using an in-house python script designed to aggregate different annotations and as part of the environmental metagenomics worflow followed in Handley's lab. Information about using this script as well as the script is available [here](https://github.com/GenomicsAotearoa/environmental_metagenomics/blob/master/metagenomic_annotation/3.aggregation.md)  
+Please note that we have also provided further annotation files within the directory `10.gene_annotation_and_coverage/example_annotation_tables` that contain information obtained after annotating the MAGs against additional databases (UniProt, UniRef100, KEGG, PFAM and TIGRfam). These example files can also be downloaded from [here](../resources/example_annotation_tables.zip). These files were created by using an in-house python script designed to aggregate different annotations and as part of the environmental metagenomics worflow followed in Handley's lab. Information about using this script as well as the script is available [here](https://github.com/GenomicsAotearoa/environmental_metagenomics/blob/master/metagenomic_annotation/3.aggregation.md)  
 
 ---
diff --git a/docs/day4/ex16b_data_presentation_Coverage.md b/docs/day4/ex16b_data_presentation_Coverage.md
@@ -31,7 +31,7 @@
 
 ### Part 1 - Building a heatmap of MAG coverage per sample
 
-To get started, if you're not already, log back in to NeSI's [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) and open a `Notebook` running the `R 4.0.1` module as the kernel (or, outside the context of this workshop, open `RStudio` with the required packages installed (see the [data presentation intro](../day4/ex16a_data_presentation_Intro.md) docs for more information)).
+To get started, if you're not already, log back in to NeSI's [Jupyter hub](https://jupyter.nesi.org.nz/hub/login) and make sure you are working within RStudio with the required packages installed (see the [data presentation intro](../day4/ex16a_data_presentation_Intro.md) for more information).
 
 #### 1.1 Prepare environment
 
@@ -470,7 +470,7 @@ We then obtain the coverage matrix and transform the values to enhance visualisa
     plot(
       virus_hclust,
       main = "Bray-Curtis dissimilarities between viruses",
-      xlab = "MAGs",
+      xlab = "Viral contigs",
       ylab = "Height",
       sub = "Method: average linkage",
       hang = -1,