Skip to content

Commit

Permalink
Merge pull request #27 from Sydney-Informatics-Hub/part-2
Browse files Browse the repository at this point in the history
Part 2 dry run changes
  • Loading branch information
fredjaya authored Sep 23, 2024
2 parents 1b1bc9f + ffea3d6 commit 58a592e
Show file tree
Hide file tree
Showing 6 changed files with 230 additions and 198 deletions.
14 changes: 4 additions & 10 deletions docs/part2/00_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,17 +63,11 @@ You decide to use Nextflow.
2. Inspect the scripts (open in a VSCode tab, or text editor in the terminal).

Each script runs a single data processing step and are run in order of the prefixed number.

!!! quote "Poll"

What are some limitations of these scripts in terms of running them in a
pipeline and monitoring it?

??? note "Solution"

* **No parallelism**: processes run iteratively, increasing overall runtime and limiting scalability.
* **No error handling**: if a step fails, may propagate errors or incomplete results into subsequent steps.
* **No caching**: if a step fails, you either need to re-run from the beginning or edit the script to exclude the files that have already run.
* **No resource management**: inefficient resource usage, no guarantee processes are able to access the CPU, RAM, disk space they need.
* **No software management**: assumes same environment is available every time it is run.
What are some limitations of these scripts in terms of running them in a
pipeline and monitoring it?

## 2.0.3 Our workflow: RNAseq data processing

Expand Down
45 changes: 31 additions & 14 deletions docs/part2/01_salmon_idx.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,11 @@ Nextflow also handles whether the directory already exists or if it
should be created. In the `00_index.sh` script you had to manually make a
results directory with `mkdir -p "results`.


More information and other modes can be found on
[publishDir](https://www.nextflow.io/docs/latest/process.html#publishdir).

>
You now have a complete process!

## 2.1.3 Using containers

Nextflow recommends using containers to ensure reproducibility and portability
Expand Down Expand Up @@ -233,14 +233,27 @@ pre-installed on your Virtual Machine.
We can tell Nextflow configuration to run containers with Docker by using the
`nextflow.config` file.

Create a `nextflow.config` file in the same directory as `main.nf` and add the
following line:
Create a `nextflow.config` file in the same directory as `main.nf`.

!!! note

You can create the file via the VSCode Explorer (left sidebar) or in the
terminal with a text editor.

If you are using the Explorer, right click on `part2` in the sidebar and
select **"New file"**.

Add the following line to your config file:

```groovy linenums="1" title="nextflow.config"
docker.enabled = true
```

You now have a complete process!
You have now configured Nextflow to use Docker.

!!! tip

Remember to save your files after editing them!

## 2.1.4 Adding `params` and the workflow scope

Expand Down Expand Up @@ -282,7 +295,7 @@ If we need to run our pipeline with a different transcriptome
file, we can overwrite this default in our execution command with
`--transcriptome` double hyphen flag.

Next, add the workflow scope at the bottom of you `main.nf` after the process:
Next, add the workflow scope at the bottom of your `main.nf` after the process:

```groovy title="main.nf"
// Define the workflow
Expand All @@ -296,8 +309,11 @@ workflow {
This will tell Nextflow to run the `INDEX` process with
`params.transcriptome_file` as input.

> Note about adding comments, Part 1 suggests developers choice
rather than fixed comments
!!! tip "Tip: Your own comments"

As a developer you can to choose how and where to comment your code!
Feel free to modify or add to the provided comments to record useful
information about the code you are writing.

We are now ready to run our workflow!

Expand Down Expand Up @@ -354,15 +370,16 @@ arguments inside a process.
Instead of trying to infer how the variable is being defined and applied to
the process, let’s use the hidden command files saved for this task in the work/ directory.

Open the `work/` directory:
!!! question "Exercise"

```bash
Image/instrctions for how to find command
```
1. Navigate to the `work/` directory and open the `.command.sh` file.
2. Compare the `.command.sh` file with `00_index.sh`.

!!! question "Exercise"

Inspect the `.command.sh` file and compare it with `00_index.sh`. A question for attendees to answer.
!!! quote "Poll"

Why do we no longer see or hardcoded file paths like `results/salmon_index` and `data/ggal/transcriptome.fa` in `.command.sh`?


!!! abstract "Summary"

Expand Down
62 changes: 28 additions & 34 deletions docs/part2/02_fastqc.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,9 +89,10 @@ It contains:
* The empty `output:` block for us to define the output data for the process.
* The `script:` block prefilled with the command that will be executed.

> Note about ${}
> Consider making consistent use of capital letters vs lowercase
!!! info "Dynamic naming"

Recall that curly brackets are used to pass variables as part of a file name.

### 2. Define the process `output`

Unlike `salmon` from the previous process, `fastqc` requires that the output
Expand Down Expand Up @@ -190,13 +191,9 @@ or metadata that needs to be processed together.

Working with samplesheets is particularly useful when you have a combination of files and metadata that need to be assigned to a sample in a flexible manner. Typically, samplesheets are written in comma-separated (`.csv`) or tab-separated (`.tsv`) formats.

We reccommend using comma-separated files as they are less error prone and easier to read and write.
We recommend using comma-separated files as they are less error prone and easier to read and write.

Inspect our samplesheet:

```bash
cat data/samplesheet.csv
```
Let's inspect `data/samplesheet.csv` with VSCode.

```console title="Output"
sample,fastq_1,fastq_2
Expand Down Expand Up @@ -237,24 +234,7 @@ for the process we just added.
!!! info "Using samplesheets with Nextflow can be tricky business"
There are currently no Nextflow operators specifically designed to handle samplesheets. As such, we Nextflow workflow developers have to write custom parsing logic to read and split the data. This adds complexity to our workflow development, especially when trying to handle tasks like parallel processing of samples or filtering data by sample type.

We won't explore the logic of constructing our samplesheet input channel in depth in this lesson. The key takeaway here is to understand that using samplesheets is best practice for reading grouped files and metadata into Nextflow, and that operators and groovy needs to be chained together to get these in the correct format. The best way to do this is using a combination of Groovy and Nextflow operators.

Our samplesheet input channel has used common [Nextflow operators](https://www.nextflow.io/docs/latest/operator.html):

```bash
// Define the fastqc input channel
Channel.fromPath(params.reads)
.splitCsv(header: true)
.map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
.view()
```
* `.fromPath` creates a channel from one or more files matching a given path or pattern (to our `.csv` file, provided with the `--reads` parameter).
* `splitCsv` splits the input file into rows, treating it as a CSV (Comma-Separated Values) file. The `header: true` option means that the first row of the CSV contains column headers, which will be used to access the values by name.
* `map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }` uses some Groovy syntax to transform each row of the CSV file into a list, extracting the sample value, `fastq_1` and `fastq_2` file paths from the row.
* `.view()` is a debugging step that outputs the transformed data to the console so we can see how the channel is structured. Its a great tool to use when building your channels.
Add the following to your workflow scope above where `INDEX` is called:
Add the following to your workflow scope below where `INDEX` is called:

```groovy title="main.nf" hl_lines="7-12"
// Define the workflow
Expand All @@ -272,6 +252,19 @@ workflow {
}
```

We won't explore the logic of constructing our samplesheet input channel in depth in this lesson. The key takeaway here is to understand that using samplesheets is best practice for reading grouped files and metadata into Nextflow, and that operators and groovy needs to be chained together to get these in the correct format. The best way to do this is using a combination of Groovy and Nextflow operators.

Our samplesheet input channel has used common [Nextflow operators](https://www.nextflow.io/docs/latest/operator.html):

* `.fromPath` creates a channel from one or more files matching a given path or pattern (to our `.csv` file, provided with the `--reads` parameter).
* `splitCsv` splits the input file into rows, treating it as a CSV (Comma-Separated Values) file. The `header: true` option means that the first row of the CSV contains column headers, which will be used to access the values by name.
* `map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }` uses some Groovy syntax to transform each row of the CSV file into a list, extracting the sample value, `fastq_1` and `fastq_2` file paths from the row.
* `.view()` is a debugging step that outputs the transformed data to the console so we can see how the channel is structured. Its a great tool to use when building your channels.

??? Tip "Tip: using the `view()` operator for testing"

The [`view()`](https://www.nextflow.io/docs/latest/operator.html#view) operator is a useful tool for debugging Nextflow workflows. It allows you to inspect the data structure of a channel at any point in the workflow, helping you to understand how the data is being processed and transformed.

Run the workflow with the `-resume` flag:

```bash
Expand Down Expand Up @@ -300,14 +293,14 @@ definition of `tuple val(sample_id), path(reads_1), path(reads_2)`:
[gut, /home/setup2/hello-nextflow/part2/data/ggal/gut_1.fq, /home/setup2/hello-nextflow/part2/data/ggal/gut_2.fq]
```

Next, we need to assign this output to a variable so it can be passed to the `FASTQC`
!!! quote "Checkpoint"

Zoom react Y/N

Next, we need to assign the channel we create to a variable so it can be passed to the `FASTQC`
process. Assign to a variable called `reads_in`, and remove the `.view()`
operator as we now know what the output looks like.

??? Tip "Tip: using the `view()` operator for testing"
The [`view()`](https://www.nextflow.io/docs/latest/operator.html#view) operator is a useful tool for debugging Nextflow workflows. It allows you to inspect the data structure of a channel at any point in the workflow, helping you to understand how the data is being processed and transformed.
```groovy title="main.nf" hl_lines="8-11"
// Define the workflow
workflow {
Expand Down Expand Up @@ -412,9 +405,10 @@ for each of the `.fastq` files.

!!! abstract "Summary"

In this step you have learned:
In this lesson you have learned:

1. How to implement a process with a tuple input
2. How to construct an input channel using operators and Groovy
3. How to use the `view()` operator to inspect the structure of a channel
4. How to use a samplesheet to scale your workflow across multiple samples
3. How to use the `.view()` operator to inspect the structure of a channel
3. How to use the `-resume` flag to skip sucessful tasks
4. How to use a samplesheet to read in grouped samples and metadata
54 changes: 29 additions & 25 deletions docs/part2/03_quant.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@
!!! note "Learning objectives"

1. Implement a process with multiple input channels.
2. Understand the importance of creating channels from process outputs.
3. Implement chained Nextflow processes with channels.
4. Understand the components of a process such as `input`, `output`,
`script`, `directive`, and the `workflow` scope.

In this lesson we will transform the bash script `02_quant.sh` into a process called `QUANTIFICATION`. This step focuses on the next phase of RNAseq data processing: quantifying the expression of transcripts relative to the reference transcriptome.

Expand Down Expand Up @@ -70,7 +69,7 @@ It contains:
* Prefilled process directives `container` and `publishDir`.
* The empty `input:` block for us to define the input data for the process.
* The empty `output:` block for us to define the output data for the process.
* The `script:` block prefilled with the command that will be executed.
* The empty `script:` block for us to define the script for the process.


### 2. Define the process `script`
Expand Down Expand Up @@ -106,10 +105,6 @@ The `--libType=U` is a required argument and can be left as is for the script de
The `output` is a directory of `$sample_id`. In this case, it will be a
directory called `gut/`. Replace `< process outputs >` with the following:

```
path "$sample_id"
```

```groovy title="main.nf" hl_lines="9"
process QUANTIFICATION {
container "quay.io/biocontainers/salmon:1.10.1--h7e5ed60_0"
Expand Down Expand Up @@ -186,6 +181,12 @@ process QUANTIFICATION {
}
```

!!! info "Matching process inputs"

Recall that the number of inputs in the process input block and the workflow must match!

If you have multiple inputs they need to be listed across multiple lines in the input block and listed inside the brackets in the workflow block.

You have just defined a process with multiple inputs!

### 5. Call the process in the `workflow` scope
Expand All @@ -195,12 +196,9 @@ Recall that the inputs for the `QUANTIFICATION` process are emitted by the
is ready to be called by the `QUANTIFICATION` process. Similarly, we need to
prepare a channel for the index files output by the `INDEX` process.

Add the following channel to your `main.nf` file, after the `reads_in` channel:
```
transcriptome_index_in = INDEX.out[0]
```
Add the following channel to your `main.nf` file, after the `FASTQC` process:

```groovy title="main.nf" hl_lines="12-14"
```groovy title="main.nf" hl_lines="15-16"
// Define the workflow
workflow {
Expand All @@ -212,6 +210,9 @@ workflow {
.splitCsv(header: true)
.map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
// Run the fastqc step with the reads_in channel
FASTQC(reads_in)
// Define the quantification channel for the index files
transcriptome_index_in = INDEX.out[0]
Expand All @@ -226,10 +227,7 @@ workflow {

Call the `QUANTIFICATION` process in the workflow scope and add the inputs by adding the following line to your `main.nf` file after your `transcriptome_index_in` channel definition:

```bash
QUANTIFICATION(transcriptome_index_in, reads_in)
```
```groovy title="main.nf" hl_lines="15-17"
```groovy title="main.nf" hl_lines="18-19"
// Define the workflow
workflow {
Expand All @@ -241,8 +239,11 @@ workflow {
.splitCsv(header: true)
.map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
// Run the fastqc step with the reads_in channel
FASTQC(reads_in)
// Define the quantification channel for the index files
transcriptome_index_in = INDEX.out
transcriptome_index_in = INDEX.out[0]
// Run the quantification step with the index and reads_in channels
QUANTIFICATION(transcriptome_index_in, reads_in)
Expand All @@ -252,14 +253,14 @@ workflow {

By doing this, we have passed two arguments to the `QUANTIFICATION` process as there are two inputs in the `process` definition.

> Add note about tuples being a single input?
Run the workflow:

```bash
nextflow run main.nf -resume
```

Your output should look like:

```console title="Output"
Launching `main.nf` [shrivelled_cuvier] DSL2 - revision: 4781bf6c41

Expand All @@ -270,13 +271,16 @@ executor > local (1)

```

You now have a `results/gut` folder, with an assortment of files and
directories.
A new `QUANTIFICATION` task has been successfully run and have a `results/gut`
folder, with an assortment of files and directories.

!!! abstract "Summary"

In this step you have learned:
In this lesson you have learned:

1. How to define a process with multiple input channels
2. How to access a process output with `.out`
3. How to create a channel from a process output
4. How to chain Nextflow processes with channels


1. How to
1. How to
1. How to
Loading

0 comments on commit 58a592e

Please sign in to comment.