Merge pull request #51 from GenomicsAotearoa/corrections_2023

Final aesthetics and fixes
GenomicsAotearoa · Sep 2, 2023 · 1d27010 · 1d27010
2 parents 451c497 + d4d35f2
commit 1d27010
Show file tree

Hide file tree

Showing 31 changed files with 465 additions and 434 deletions.
diff --git a/docs/day1/ex1_bash_and_scheduler.md b/docs/day1/ex1_bash_and_scheduler.md
@@ -1,4 +1,4 @@
-# Introduction to shell 
+# Introduction I: Shell 
 
 !!! warning ""
     This lesson will be covered/referred during pre-Summer School sessions. We will start **Day 1** with [Introduction to HPC & HPC job scheduler](https://genomicsaotearoa.github.io/metagenomics_summer_school/day1/ex2_1_intro_to_scheduler/)

diff --git a/docs/day1/ex2_1_intro_to_scheduler.md b/docs/day1/ex2_1_intro_to_scheduler.md
@@ -1,4 +1,4 @@
-# Introduction to HPC & HPC job scheduler
+# Introduction II: HPC and job scheduler
 
 <center>![image](../theme_images/scaling.png){width="300"}</center>
 

diff --git a/docs/day1/ex2_quality_filtering.md b/docs/day1/ex2_quality_filtering.md
@@ -1,11 +1,11 @@
-# Quality filtering raw reads
+# Filter raw reads by quality
 
 !!! info "Objectives"
 
     * [Visualising raw reads with `FastQC`](#visualising-raw-reads)
     * [Read trimming and adapter removal with `trimmomatic`](#read-trimming-and-adapter-removal-with-trimmomatic)
     * [Diagnosing poor libraries](#diagnosing-poor-libraries)
-    * [Understand common issues and best practices](#understand-common-issues-and-best-practices)
+    * [Understand common issues and best practices](#understanding-common-issues-and-best-practices)
     * [*Optional*: Filtering out host DNA with `BBMap`](#optional-filtering-out-host-dna)
 
 <center>
@@ -224,7 +224,7 @@ There is always some subjectivity in how sensitive you want your adapter (and ba
 
 ---
 
-### Diagnosing poor libraries
+## Diagnosing poor libraries
 
 Whether a library is 'poor' quality or not can be a bit subjective. These are some aspects of the library that you should be looking for when evaluating `FastQC`:
 
@@ -242,7 +242,7 @@ Whether a library is 'poor' quality or not can be a bit subjective. These are so
 
 ---
 
-### Understand common issues and best practices
+## Understanding common issues and best practices
 
 !!! success ""
 
@@ -315,12 +315,6 @@ We will cover more about read mapping in [later exercises](https://genomicsaotea
 
 Build index reference via `BBMap`. We will do this by submitting the job via slurm. 
 
-<!--
-!!! note "Note"
-
-    See [Preparing an assembly job for slurm](https://genomicsaotearoa.github.io/metagenomics_summer_school/day1/ex3_assembly/) for more information about how to submit a job via slurm.
--->
-
 Create a new script named `host_filt_bbmap_index.sl` using `nano`:
 
 !!! terminal "code"
@@ -329,9 +323,7 @@ Create a new script named `host_filt_bbmap_index.sl` using `nano`:
     nano host_filt_bbmap_index.sl
     ```
 
-!!! warning "Warning"
-
-    Paste or type in the following. Remember to update `<YOUR FOLDER>` to your own directory.
+!!! warning "Remember to update `<YOUR FOLDER>` to your own folder"
 
 !!! terminal "code"
 
@@ -340,6 +332,7 @@ Create a new script named `host_filt_bbmap_index.sl` using `nano`:
 
     #SBATCH --account       nesi02659
     #SBATCH --job-name      host_filt_bbmap_index
+    #SBATCH --partition     milan
     #SBATCH --time          00:20:00
     #SBATCH --mem           32GB
     #SBATCH --cpus-per-task 12
@@ -384,6 +377,7 @@ Again, we will create a script using `nano`:
 
     #SBATCH --account       nesi02659
     #SBATCH --job-name      host_filt_bbmap_map
+    #SBATCH --partition     milan
     #SBATCH --time          01:00:00
     #SBATCH --mem           27GB
     #SBATCH --array         1-4
@@ -420,18 +414,6 @@ Again, we will create a script using `nano`:
     | `path` | The parent directory where `ref/` (our indexed and masked reference) exists |
     | `outu1` / `outu2` | Reads that *were not mapped* to our masked reference written to `host_filtered_reads/` |
 
-<!--
-Breaking down this command a little:
-
-!!! quote ""
-
-    - We pass the path to the `ref` directory (the reference we just built) to `path=...`.
-    - Provide quality-filtered reads as input (i.e. output of the `trimmomatic` process above). In this case, we will provide the FASTQ files located in `../3.assembly/` which have been processed via `trimmomatic` in the same manner as the exercise above. These are four sets of paired reads (representing metagenome data from four "samples") that the remainder of the workshop exercises will be working with.
-    - The flags `-Xmx27g` and `-t=$SLURM_CPUS_PER_TASK` set the maximum memory and thread (AKA CPUs) allocations, and must match the `--mem` and `--cpus_per_task` allocations in the slurm headers at the top of the script.
-    - The rest of the settings in the `BBMap` call here are as per the recommendations found within [this thread](http://seqanswers.com/forums/showthread.php?t=42552) about processing data to remove host reads.
-    - Finally, the filtered output FASTQ files for downstream use are written to the `host_filtered_reads/` directory (taken from the outputs `outu1=` and `outu2=`, which include only those reads that did not map to the host reference genome).
--->
-
 We'll submit the mapping script:
 
 !!! terminal "code"

diff --git a/docs/day1/ex3_assembly.md b/docs/day1/ex3_assembly.md
@@ -1,4 +1,4 @@
-# Assembly
+# Assembly I: Assembling contigs
 
 !!! info "Objectives"
 
@@ -292,6 +292,7 @@ Into this file, either write or copy/paste the following commands:
 
     #SBATCH --account       nesi02659
     #SBATCH --job-name      spades_assembly
+    #SBATCH --partition     milan
     #SBATCH --time          00:30:00
     #SBATCH --mem           10GB
     #SBATCH --cpus-per-task 12
@@ -389,7 +390,7 @@ We can see here that the job has not yet begun, as NeSI is waiting for resources
 
 Which allows us to track how far into our run we are, and see the remaining time for the job. The `START_TIME` column now reports the time the job actually began.
 
-## Submitting an `IDBA-UD` job to NeSI using slurm
+### Submitting an `IDBA-UD` job to NeSI using slurm
 
 !!! terminal-2 "Create a new slurm script using `nano` to run an equivalent assembly with `IDBA-UD`"
 
@@ -406,6 +407,7 @@ Paste or type in the following:
 
     #SBATCH --account       nesi02659
     #SBATCH --job-name      idbaud_assembly
+    #SBATCH --partition     milan
     #SBATCH --time          00:20:00
     #SBATCH --mem           4GB
     #SBATCH --cpus-per-task 12

diff --git a/docs/day1/ex4_assembly.md b/docs/day1/ex4_assembly.md
@@ -1,14 +1,14 @@
-# Assembly (part 2)
+# Assembly II: Varying the parameters
 
 !!! info "Objectives"
 
-    * [Examine the effect of changing parameters for assembly](#examine-the-effect-of-changing-assembly-parameters)
+    * [Examining the effect of changing parameters for assembly](#examining-the-effect-of-changing-assembly-parameters)
 
 All work for this exercise will occur in the `3.assembly/` directory.
 
 ---
 
-## Examine the effect of changing assembly parameters
+## Examining the effect of changing assembly parameters
 
 For this exercise, there is no real structure. Make a few copies of your initial slurm scripts and tweak a few of the assembly parameters. You will have a chance to compare the effects of these changes tomorrow.
 

diff --git a/docs/day1/ex5_evaluating_assemblies.md b/docs/day1/ex5_evaluating_assemblies.md
@@ -1,18 +1,18 @@
-# Evaluating the assemblies
+# Assembly evaluation
 
 !!! info "Objectives"
 
-    * [Evaluate the resource consumption of various assemblies](#evaluate-the-resource-consumption-of-various-assemblies)
-    * [Evaluate the assemblies](#evaluate-the-assemblies)
-    * Future considerations
+    * [Evaluating the resource consumption of various assemblies](#evaluating-the-resource-consumption-of-various-assemblies)
+    * [Evaluating the assemblies using `BBMap`](#evaluating-the-assemblies-using-bbmap)
+    * [*(Optional)* Evaluating assemblies using `MetaQUAST`](#optional-evaluating-assemblies-using-metaquast)
 
 ---
 
 <center>
 ![image](../theme_images/eval_assembly.png){width="450"}
 </center>
 
-## Evaluate the resource consumption of various assemblies
+## Evaluating the resource consumption of various assemblies
 
 Check to see if your assembly jobs have completed. If you have multiple jobs running or queued, the easiest way to check this is to simply run the `squeue` command.
 
@@ -84,7 +84,7 @@ CPU efficiency is harder to interpret as it can be impacted by the behaviour of
 
 ---
 
-## Evaluate the assemblies using `BBMap`
+## Evaluating the assemblies using `BBMap`
 
 Evaluating the quality of a raw metagenomic assembly is quite a tricky process. Since, by definition, our community is a mixture of different organisms, the genomes from some of these organisms assemble better than those of others. It is possible to have an assembly that looks 'bad' by traditional metrics that still yields high-quality genomes from individual species, and the converse is also true.
 
@@ -195,7 +195,8 @@ This gives quite a verbose output:
        1 MB                      1               2       1,221,431       1,221,421   100.00%
     ```
 
-!!! danger "N50 and L50 in BBMap"
+!!! danger "N50 and L50 in `BBMap`"
+
     Unfortunately, the N50 and L50 values generated by `stats.sh` are switched. N50 should be a length and L50 should be a count. The results table below shows the corrected values based on `stats.sh` outputs.
 
 But what we can highlight here is that the statistics for the `SPAdes` assembly, with short contigs removed, yielded an N50 of 72.5 kbp at the contig level. We will now compute those same statistics from the other assembly options.
@@ -241,12 +242,12 @@ However, since we **_do_** know the composition of the original communities used
 
     #SBATCH --account       nesi02659
     #SBATCH --job-name      metaquast
+    #SBATCH --partition     milan
     #SBATCH --time          00:15:00
     #SBATCH --mem           4GB
     #SBATCH --cpus-per-task 10
     #SBATCH --error         %x_%j.err
     #SBATCH --output        %x_%j.out
-    #SBATCH --partition     milan
 
     # Load module
     module purge

diff --git a/docs/day2/ex6_initial_binning.md b/docs/day2/ex6_initial_binning.md
@@ -82,6 +82,7 @@ Open a new script using nano:
 
     #SBATCH --account       nesi02659
     #SBATCH --job-name      spades_mapping
+    #SBATCH --partition     milan
     #SBATCH --time          00:05:00
     #SBATCH --mem           1GB
     #SBATCH --cpus-per-task 10
@@ -136,7 +137,7 @@ For large sets of files, it can be beneficial to use a slurm *array* to send the
     |`-1` / `-2`|The forward and reverse read pairs to map to the assembly|
     |`-S`|Name of the output file, to be written in *sam* format|
 
-### Step 3 - Sorting and compressing results
+### Step 3 - Sort and compress results
 
 The default output format for most mapping tools is the Sequence Alignment/Map (*sam*) format. This is a compact text representation of where each short read sits in the contigs. You can view this file using any text viewer, although owing to the file size `less` is a good idea.
 
@@ -192,7 +193,7 @@ Sorting the mapping information is an important prerequisite for performing cert
 
 ---
 
-## *(Optional)* Map reads using an array
+## *(Optional)* Read mapping using an array
 
 If you have a large number of files to process, it might be worth using a slurm array to distribute your individual mapping jobs across many separate nodes. An example script for how to perform this is given below. We do not need to use an array for read mapping in this workshop, but we will revisit array jobs in further lessons.
 
@@ -213,6 +214,7 @@ Open a new script using nano:
 
     #SBATCH --account       nesi02659
     #SBATCH --job-name      spades_mapping_array
+    #SBATCH --partition     milan
     #SBATCH --time          00:20:00
     #SBATCH --mem           20GB
     #SBATCH --array         0-3

diff --git a/docs/day2/ex7_initial_binning.md b/docs/day2/ex7_initial_binning.md
@@ -1,4 +1,4 @@
-# Binning (continued)
+# Binning with multiple tools
 
 !!! info "Objectives"
 
@@ -14,26 +14,31 @@ With the mapping information computed in the last exercise, we can now perform b
 
 In our own workflow, we use the tools `MetaBAT`, `MaxBin`, and `CONCOCT` for binning, but there are many alternatives that are equally viable. In the interests of time, we are only going to demonstrate the first two tools. However, we recommend that you experiment with some of the following tools when conducting your own research.
 
-1. [GroopM](http://ecogenomics.github.io/GroopM/)
 1. [Tetra-ESOM](https://github.com/tetramerFreqs/Binning)
 1. [VAMB](https://github.com/RasmussenLab/vamb)
 
 ---
 
-## *MetaBAT*
+## `MetaBAT`
 
 `MetaBAT` binning occurs in two steps. First, the *bam* files from the last exercise are parsed into a tab-delimited table of the average coverage depth and variance per sample mapped. Binning is then performed using this table.
 
 The *.bam* files can be passed in via either a user-defined order, or using wildcards.
 
+!!! warning "Remember to update `<YOUR FOLDER>` to your own folder"
+
+!!! terminal-2 "Navigate to working directory"
+
+    ```bash
+    cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/5.binning/
+    ```
+
 !!! terminal "code"
 
     ```bash
     module purge
     module load MetaBAT/2.15-GCC-11.3.0
 
-    cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/5.binning/
-
     # Manual specification of files
     jgi_summarize_bam_contig_depths --outputDepth metabat.txt sample1.bam sample2.bam sample3.bam sample4.bam
 
@@ -78,7 +83,7 @@ The problem with this is that on Linux systems, prefixing a file or folder name
 
 ---
 
-## *MaxBin*
+## `MaxBin`
 
 Like `MetaBAT`, `MaxBin` requires a text representation of the coverage information for binning. Luckily, we can be sneaky here and just reformat the `metabat.txt` file into the format expected by `MaxBin`. We use `cut` to select only the columns of interest, which are the *contigName* and coverage columns, but not the *contigLen*, *totalAvgDepth*, or variance columns.