Skip to content

Commit

Permalink
Merge pull request #51 from GenomicsAotearoa/corrections_2023
Browse files Browse the repository at this point in the history
Final aesthetics and fixes
  • Loading branch information
JSBoey authored Sep 2, 2023
2 parents 451c497 + d4d35f2 commit 1d27010
Show file tree
Hide file tree
Showing 31 changed files with 465 additions and 434 deletions.
2 changes: 1 addition & 1 deletion docs/day1/ex1_bash_and_scheduler.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction to shell
# Introduction I: Shell

!!! warning ""
This lesson will be covered/referred during pre-Summer School sessions. We will start **Day 1** with [Introduction to HPC & HPC job scheduler](https://genomicsaotearoa.github.io/metagenomics_summer_school/day1/ex2_1_intro_to_scheduler/)
Expand Down
2 changes: 1 addition & 1 deletion docs/day1/ex2_1_intro_to_scheduler.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction to HPC & HPC job scheduler
# Introduction II: HPC and job scheduler

<center>![image](../theme_images/scaling.png){width="300"}</center>

Expand Down
32 changes: 7 additions & 25 deletions docs/day1/ex2_quality_filtering.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Quality filtering raw reads
# Filter raw reads by quality

!!! info "Objectives"

* [Visualising raw reads with `FastQC`](#visualising-raw-reads)
* [Read trimming and adapter removal with `trimmomatic`](#read-trimming-and-adapter-removal-with-trimmomatic)
* [Diagnosing poor libraries](#diagnosing-poor-libraries)
* [Understand common issues and best practices](#understand-common-issues-and-best-practices)
* [Understand common issues and best practices](#understanding-common-issues-and-best-practices)
* [*Optional*: Filtering out host DNA with `BBMap`](#optional-filtering-out-host-dna)

<center>
Expand Down Expand Up @@ -224,7 +224,7 @@ There is always some subjectivity in how sensitive you want your adapter (and ba

---

### Diagnosing poor libraries
## Diagnosing poor libraries

Whether a library is 'poor' quality or not can be a bit subjective. These are some aspects of the library that you should be looking for when evaluating `FastQC`:

Expand All @@ -242,7 +242,7 @@ Whether a library is 'poor' quality or not can be a bit subjective. These are so

---

### Understand common issues and best practices
## Understanding common issues and best practices

!!! success ""

Expand Down Expand Up @@ -315,12 +315,6 @@ We will cover more about read mapping in [later exercises](https://genomicsaotea

Build index reference via `BBMap`. We will do this by submitting the job via slurm.

<!--
!!! note "Note"
See [Preparing an assembly job for slurm](https://genomicsaotearoa.github.io/metagenomics_summer_school/day1/ex3_assembly/) for more information about how to submit a job via slurm.
-->

Create a new script named `host_filt_bbmap_index.sl` using `nano`:

!!! terminal "code"
Expand All @@ -329,9 +323,7 @@ Create a new script named `host_filt_bbmap_index.sl` using `nano`:
nano host_filt_bbmap_index.sl
```

!!! warning "Warning"

Paste or type in the following. Remember to update `<YOUR FOLDER>` to your own directory.
!!! warning "Remember to update `<YOUR FOLDER>` to your own folder"

!!! terminal "code"

Expand All @@ -340,6 +332,7 @@ Create a new script named `host_filt_bbmap_index.sl` using `nano`:

#SBATCH --account nesi02659
#SBATCH --job-name host_filt_bbmap_index
#SBATCH --partition     milan
#SBATCH --time 00:20:00
#SBATCH --mem 32GB
#SBATCH --cpus-per-task 12
Expand Down Expand Up @@ -384,6 +377,7 @@ Again, we will create a script using `nano`:

#SBATCH --account nesi02659
#SBATCH --job-name host_filt_bbmap_map
#SBATCH --partition     milan
#SBATCH --time 01:00:00
#SBATCH --mem 27GB
#SBATCH --array 1-4
Expand Down Expand Up @@ -420,18 +414,6 @@ Again, we will create a script using `nano`:
| `path` | The parent directory where `ref/` (our indexed and masked reference) exists |
| `outu1` / `outu2` | Reads that *were not mapped* to our masked reference written to `host_filtered_reads/` |

<!--
Breaking down this command a little:
!!! quote ""
- We pass the path to the `ref` directory (the reference we just built) to `path=...`.
- Provide quality-filtered reads as input (i.e. output of the `trimmomatic` process above). In this case, we will provide the FASTQ files located in `../3.assembly/` which have been processed via `trimmomatic` in the same manner as the exercise above. These are four sets of paired reads (representing metagenome data from four "samples") that the remainder of the workshop exercises will be working with.
- The flags `-Xmx27g` and `-t=$SLURM_CPUS_PER_TASK` set the maximum memory and thread (AKA CPUs) allocations, and must match the `--mem` and `--cpus_per_task` allocations in the slurm headers at the top of the script.
- The rest of the settings in the `BBMap` call here are as per the recommendations found within [this thread](http://seqanswers.com/forums/showthread.php?t=42552) about processing data to remove host reads.
- Finally, the filtered output FASTQ files for downstream use are written to the `host_filtered_reads/` directory (taken from the outputs `outu1=` and `outu2=`, which include only those reads that did not map to the host reference genome).
-->

We'll submit the mapping script:

!!! terminal "code"
Expand Down
6 changes: 4 additions & 2 deletions docs/day1/ex3_assembly.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Assembly
# Assembly I: Assembling contigs

!!! info "Objectives"

Expand Down Expand Up @@ -292,6 +292,7 @@ Into this file, either write or copy/paste the following commands:

#SBATCH --account nesi02659
#SBATCH --job-name spades_assembly
#SBATCH --partition     milan
#SBATCH --time 00:30:00
#SBATCH --mem 10GB
#SBATCH --cpus-per-task 12
Expand Down Expand Up @@ -389,7 +390,7 @@ We can see here that the job has not yet begun, as NeSI is waiting for resources

Which allows us to track how far into our run we are, and see the remaining time for the job. The `START_TIME` column now reports the time the job actually began.

## Submitting an `IDBA-UD` job to NeSI using slurm
### Submitting an `IDBA-UD` job to NeSI using slurm

!!! terminal-2 "Create a new slurm script using `nano` to run an equivalent assembly with `IDBA-UD`"

Expand All @@ -406,6 +407,7 @@ Paste or type in the following:

#SBATCH --account nesi02659
#SBATCH --job-name idbaud_assembly
#SBATCH --partition     milan
#SBATCH --time 00:20:00
#SBATCH --mem 4GB
#SBATCH --cpus-per-task 12
Expand Down
6 changes: 3 additions & 3 deletions docs/day1/ex4_assembly.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Assembly (part 2)
# Assembly II: Varying the parameters

!!! info "Objectives"

* [Examine the effect of changing parameters for assembly](#examine-the-effect-of-changing-assembly-parameters)
* [Examining the effect of changing parameters for assembly](#examining-the-effect-of-changing-assembly-parameters)

All work for this exercise will occur in the `3.assembly/` directory.

---

## Examine the effect of changing assembly parameters
## Examining the effect of changing assembly parameters

For this exercise, there is no real structure. Make a few copies of your initial slurm scripts and tweak a few of the assembly parameters. You will have a chance to compare the effects of these changes tomorrow.

Expand Down
17 changes: 9 additions & 8 deletions docs/day1/ex5_evaluating_assemblies.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Evaluating the assemblies
# Assembly evaluation

!!! info "Objectives"

* [Evaluate the resource consumption of various assemblies](#evaluate-the-resource-consumption-of-various-assemblies)
* [Evaluate the assemblies](#evaluate-the-assemblies)
* Future considerations
* [Evaluating the resource consumption of various assemblies](#evaluating-the-resource-consumption-of-various-assemblies)
* [Evaluating the assemblies using `BBMap`](#evaluating-the-assemblies-using-bbmap)
* [*(Optional)* Evaluating assemblies using `MetaQUAST`](#optional-evaluating-assemblies-using-metaquast)

---

<center>
![image](../theme_images/eval_assembly.png){width="450"}
</center>

## Evaluate the resource consumption of various assemblies
## Evaluating the resource consumption of various assemblies

Check to see if your assembly jobs have completed. If you have multiple jobs running or queued, the easiest way to check this is to simply run the `squeue` command.

Expand Down Expand Up @@ -84,7 +84,7 @@ CPU efficiency is harder to interpret as it can be impacted by the behaviour of

---

## Evaluate the assemblies using `BBMap`
## Evaluating the assemblies using `BBMap`

Evaluating the quality of a raw metagenomic assembly is quite a tricky process. Since, by definition, our community is a mixture of different organisms, the genomes from some of these organisms assemble better than those of others. It is possible to have an assembly that looks 'bad' by traditional metrics that still yields high-quality genomes from individual species, and the converse is also true.

Expand Down Expand Up @@ -195,7 +195,8 @@ This gives quite a verbose output:
1 MB 1 2 1,221,431 1,221,421 100.00%
```

!!! danger "N50 and L50 in BBMap"
!!! danger "N50 and L50 in `BBMap`"

Unfortunately, the N50 and L50 values generated by `stats.sh` are switched. N50 should be a length and L50 should be a count. The results table below shows the corrected values based on `stats.sh` outputs.

But what we can highlight here is that the statistics for the `SPAdes` assembly, with short contigs removed, yielded an N50 of 72.5 kbp at the contig level. We will now compute those same statistics from the other assembly options.
Expand Down Expand Up @@ -241,12 +242,12 @@ However, since we **_do_** know the composition of the original communities used

#SBATCH --account nesi02659
#SBATCH --job-name metaquast
#SBATCH --partition milan
#SBATCH --time 00:15:00
#SBATCH --mem 4GB
#SBATCH --cpus-per-task 10
#SBATCH --error %x_%j.err
#SBATCH --output %x_%j.out
#SBATCH --partition milan

# Load module
module purge
Expand Down
6 changes: 4 additions & 2 deletions docs/day2/ex6_initial_binning.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ Open a new script using nano:

#SBATCH --account nesi02659
#SBATCH --job-name spades_mapping
#SBATCH --partition     milan
#SBATCH --time 00:05:00
#SBATCH --mem 1GB
#SBATCH --cpus-per-task 10
Expand Down Expand Up @@ -136,7 +137,7 @@ For large sets of files, it can be beneficial to use a slurm *array* to send the
|`-1` / `-2`|The forward and reverse read pairs to map to the assembly|
|`-S`|Name of the output file, to be written in *sam* format|

### Step 3 - Sorting and compressing results
### Step 3 - Sort and compress results

The default output format for most mapping tools is the Sequence Alignment/Map (*sam*) format. This is a compact text representation of where each short read sits in the contigs. You can view this file using any text viewer, although owing to the file size `less` is a good idea.

Expand Down Expand Up @@ -192,7 +193,7 @@ Sorting the mapping information is an important prerequisite for performing cert

---

## *(Optional)* Map reads using an array
## *(Optional)* Read mapping using an array

If you have a large number of files to process, it might be worth using a slurm array to distribute your individual mapping jobs across many separate nodes. An example script for how to perform this is given below. We do not need to use an array for read mapping in this workshop, but we will revisit array jobs in further lessons.

Expand All @@ -213,6 +214,7 @@ Open a new script using nano:

#SBATCH --account nesi02659
#SBATCH --job-name spades_mapping_array
#SBATCH --partition     milan
#SBATCH --time 00:20:00
#SBATCH --mem 20GB
#SBATCH --array 0-3
Expand Down
17 changes: 11 additions & 6 deletions docs/day2/ex7_initial_binning.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Binning (continued)
# Binning with multiple tools

!!! info "Objectives"

Expand All @@ -14,26 +14,31 @@ With the mapping information computed in the last exercise, we can now perform b

In our own workflow, we use the tools `MetaBAT`, `MaxBin`, and `CONCOCT` for binning, but there are many alternatives that are equally viable. In the interests of time, we are only going to demonstrate the first two tools. However, we recommend that you experiment with some of the following tools when conducting your own research.

1. [GroopM](http://ecogenomics.github.io/GroopM/)
1. [Tetra-ESOM](https://github.com/tetramerFreqs/Binning)
1. [VAMB](https://github.com/RasmussenLab/vamb)

---

## *MetaBAT*
## `MetaBAT`

`MetaBAT` binning occurs in two steps. First, the *bam* files from the last exercise are parsed into a tab-delimited table of the average coverage depth and variance per sample mapped. Binning is then performed using this table.

The *.bam* files can be passed in via either a user-defined order, or using wildcards.

!!! warning "Remember to update `<YOUR FOLDER>` to your own folder"

!!! terminal-2 "Navigate to working directory"

```bash
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/5.binning/
```

!!! terminal "code"

```bash
module purge
module load MetaBAT/2.15-GCC-11.3.0

cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/5.binning/

# Manual specification of files
jgi_summarize_bam_contig_depths --outputDepth metabat.txt sample1.bam sample2.bam sample3.bam sample4.bam

Expand Down Expand Up @@ -78,7 +83,7 @@ The problem with this is that on Linux systems, prefixing a file or folder name

---

## *MaxBin*
## `MaxBin`

Like `MetaBAT`, `MaxBin` requires a text representation of the coverage information for binning. Luckily, we can be sneaky here and just reformat the `metabat.txt` file into the format expected by `MaxBin`. We use `cut` to select only the columns of interest, which are the *contigName* and coverage columns, but not the *contigLen*, *totalAvgDepth*, or variance columns.

Expand Down
Loading

0 comments on commit 1d27010

Please sign in to comment.