Merge pull request #27 from Sydney-Informatics-Hub/part-2

Part 2 dry run changes
Sydney-Informatics-Hub · Sep 23, 2024 · 58a592e · 58a592e
2 parents 1b1bc9f + ffea3d6
commit 58a592e
Show file tree

Hide file tree

Showing 6 changed files with 230 additions and 198 deletions.
diff --git a/docs/part2/00_intro.md b/docs/part2/00_intro.md
@@ -63,17 +63,11 @@ You decide to use Nextflow.
     2. Inspect the scripts (open in a VSCode tab, or text editor in the terminal).
 
     Each script runs a single data processing step and are run in order of the prefixed number.
+
+    !!! quote "Poll"
 
-    What are some limitations of these scripts in terms of running them in a
-    pipeline and monitoring it?  
-
-    ??? note "Solution"
-
-        * **No parallelism**: processes run iteratively, increasing overall runtime and limiting scalability. 
-        * **No error handling**: if a step fails, may propagate errors or incomplete results into subsequent steps. 
-        * **No caching**: if a step fails, you either need to re-run from the beginning or edit the script to exclude the files that have already run.
-        * **No resource management**: inefficient resource usage, no guarantee processes are able to access the CPU, RAM, disk space they need. 
-        * **No software management**: assumes same environment is available every time it is run.   
+        What are some limitations of these scripts in terms of running them in a
+        pipeline and monitoring it?  
 
 ## 2.0.3 Our workflow: RNAseq data processing 
 

diff --git a/docs/part2/01_salmon_idx.md b/docs/part2/01_salmon_idx.md
@@ -148,11 +148,11 @@ Nextflow also handles whether the directory already exists or if it
 should be created. In the `00_index.sh` script you had to manually make a 
 results directory with `mkdir -p "results`.
 
-
 More information and other modes can be found on
 [publishDir](https://www.nextflow.io/docs/latest/process.html#publishdir).
 
-> 
+You now have a complete process! 
+
 ## 2.1.3 Using containers  
 
 Nextflow recommends using containers to ensure reproducibility and portability
@@ -233,14 +233,27 @@ pre-installed on your Virtual Machine.
 We can tell Nextflow configuration to run containers with Docker by using the 
 `nextflow.config` file.
 
-Create a `nextflow.config` file in the same directory as `main.nf` and add the
-following line:
+Create a `nextflow.config` file in the same directory as `main.nf`.  
+
+!!! note
+
+    You can create the file via the VSCode Explorer (left sidebar) or in the
+    terminal with a text editor.
+
+    If you are using the Explorer, right click on `part2` in the sidebar and
+    select **"New file"**.
+
+Add the following line to your config file:
 
 ```groovy linenums="1" title="nextflow.config"
 docker.enabled = true
 ```
 
-You now have a complete process! 
+You have now configured Nextflow to use Docker.  
+
+!!! tip
+
+    Remember to save your files after editing them!
 
 ## 2.1.4 Adding `params` and the workflow scope  
 
@@ -282,7 +295,7 @@ If we need to run our pipeline with a different transcriptome
 file, we can overwrite this default in our execution command with 
 `--transcriptome` double hyphen flag.
 
-Next, add the workflow scope at the bottom of you `main.nf` after the process:  
+Next, add the workflow scope at the bottom of your `main.nf` after the process:  
 
 ```groovy title="main.nf"
 // Define the workflow
@@ -296,8 +309,11 @@ workflow {
 This will tell Nextflow to run the `INDEX` process with
 `params.transcriptome_file` as input.
 
-> Note about adding comments, Part 1 suggests developers choice
-rather than fixed comments
+!!! tip "Tip: Your own comments"
+
+    As a developer you can to choose how and where to comment your code!
+    Feel free to modify or add to the provided comments to record useful
+    information about the code you are writing.
 
 We are now ready to run our workflow!  
 
@@ -354,15 +370,16 @@ arguments inside a process.
 Instead of trying to infer how the variable is being defined and applied to 
 the process, let’s use the hidden command files saved for this task in the work/ directory.
 
-Open the `work/` directory: 
+!!! question "Exercise"
 
-```bash
-Image/instrctions for how to find command
-```
+    1. Navigate to the `work/` directory and open the `.command.sh` file.
+    2. Compare the `.command.sh` file with `00_index.sh`.  
 
-!!! question "Exercise"
 
-    Inspect the `.command.sh` file and compare it with `00_index.sh`. A question for attendees to answer.  
+    !!! quote "Poll"  
+
+        Why do we no longer see or hardcoded file paths like `results/salmon_index` and `data/ggal/transcriptome.fa` in `.command.sh`?
+
 
 !!! abstract "Summary"
 

diff --git a/docs/part2/02_fastqc.md b/docs/part2/02_fastqc.md
@@ -89,9 +89,10 @@ It contains:
 * The empty `output:` block for us to define the output data for the process.
 * The `script:` block prefilled with the command that will be executed.
 
-> Note about ${}  
-> Consider making consistent use of capital letters vs lowercase
+!!! info "Dynamic naming"
 
+    Recall that curly brackets are used to pass variables as part of a file name.
+
 ### 2. Define the process `output`
 
 Unlike `salmon` from the previous process, `fastqc` requires that the output
@@ -190,13 +191,9 @@ or metadata that needs to be processed together.
 
       Working with samplesheets is particularly useful when you have a combination of files and metadata that need to be assigned to a sample in a flexible manner. Typically, samplesheets are written in comma-separated (`.csv`) or tab-separated (`.tsv`) formats. 
 
-      We reccommend using comma-separated files as they are less error prone and easier to read and write.
+      We recommend using comma-separated files as they are less error prone and easier to read and write.
 
-Inspect our samplesheet:  
-
-```bash
-cat data/samplesheet.csv
-```
+Let's inspect `data/samplesheet.csv` with VSCode.
 
 ```console title="Output"
 sample,fastq_1,fastq_2
@@ -237,24 +234,7 @@ for the process we just added.
 !!! info "Using samplesheets with Nextflow can be tricky business"
     There are currently no Nextflow operators specifically designed to handle samplesheets. As such, we Nextflow workflow developers have to write custom parsing logic to read and split the data. This adds complexity to our workflow development, especially when trying to handle tasks like parallel processing of samples or filtering data by sample type.
 
-We won't explore the logic of constructing our samplesheet input channel in depth in this lesson. The key takeaway here is to understand that using samplesheets is best practice for reading grouped files and metadata into Nextflow, and that operators and groovy needs to be chained together to get these in the correct format. The best way to do this is using a combination of Groovy and Nextflow operators.
-
-Our samplesheet input channel has used common [Nextflow operators](https://www.nextflow.io/docs/latest/operator.html):
-
-```bash
-// Define the fastqc input channel
-Channel.fromPath(params.reads)
-  .splitCsv(header: true)
-  .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
-  .view()
-```
-
-* `.fromPath` creates a channel from one or more files matching a given path or pattern (to our `.csv` file, provided with the `--reads` parameter).
-* `splitCsv` splits the input file into rows, treating it as a CSV (Comma-Separated Values) file. The `header: true` option means that the first row of the CSV contains column headers, which will be used to access the values by name.
-* `map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }` uses some Groovy syntax to transform each row of the CSV file into a list, extracting the sample value, `fastq_1` and `fastq_2` file paths from the row.
-* `.view()` is a debugging step that outputs the transformed data to the console so we can see how the channel is structured. Its a great tool to use when building your channels.
-
-Add the following to your workflow scope above where `INDEX` is called:
+Add the following to your workflow scope below where `INDEX` is called:
 
 ```groovy title="main.nf" hl_lines="7-12"
 // Define the workflow  
@@ -272,6 +252,19 @@ workflow {
 }
 ```
 
+We won't explore the logic of constructing our samplesheet input channel in depth in this lesson. The key takeaway here is to understand that using samplesheets is best practice for reading grouped files and metadata into Nextflow, and that operators and groovy needs to be chained together to get these in the correct format. The best way to do this is using a combination of Groovy and Nextflow operators.
+
+Our samplesheet input channel has used common [Nextflow operators](https://www.nextflow.io/docs/latest/operator.html):
+
+* `.fromPath` creates a channel from one or more files matching a given path or pattern (to our `.csv` file, provided with the `--reads` parameter).
+* `splitCsv` splits the input file into rows, treating it as a CSV (Comma-Separated Values) file. The `header: true` option means that the first row of the CSV contains column headers, which will be used to access the values by name.
+* `map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }` uses some Groovy syntax to transform each row of the CSV file into a list, extracting the sample value, `fastq_1` and `fastq_2` file paths from the row.
+* `.view()` is a debugging step that outputs the transformed data to the console so we can see how the channel is structured. Its a great tool to use when building your channels.
+
+??? Tip "Tip: using the `view()` operator for testing"
+
+      The [`view()`](https://www.nextflow.io/docs/latest/operator.html#view) operator is a useful tool for debugging Nextflow workflows. It allows you to inspect the data structure of a channel at any point in the workflow, helping you to understand how the data is being processed and transformed.
+
 Run the workflow with the `-resume` flag:
 
 ```bash
@@ -300,14 +293,14 @@ definition of `tuple val(sample_id), path(reads_1), path(reads_2)`:
 [gut, /home/setup2/hello-nextflow/part2/data/ggal/gut_1.fq, /home/setup2/hello-nextflow/part2/data/ggal/gut_2.fq]
 ```
 
-Next, we need to assign this output to a variable so it can be passed to the `FASTQC`
+!!! quote "Checkpoint"  
+
+    Zoom react Y/N
+
+Next, we need to assign the channel we create to a variable so it can be passed to the `FASTQC`
 process. Assign to a variable called `reads_in`, and remove the `.view()`
 operator as we now know what the output looks like.
 
-??? Tip "Tip: using the `view()` operator for testing"
-      
-      The [`view()`](https://www.nextflow.io/docs/latest/operator.html#view) operator is a useful tool for debugging Nextflow workflows. It allows you to inspect the data structure of a channel at any point in the workflow, helping you to understand how the data is being processed and transformed.
-
 ```groovy title="main.nf" hl_lines="8-11"
 // Define the workflow  
 workflow {
@@ -412,9 +405,10 @@ for each of the `.fastq` files.
 
 !!! abstract "Summary"
 
-    In this step you have learned:
+    In this lesson you have learned:
 
     1. How to implement a process with a tuple input
     2. How to construct an input channel using operators and Groovy
-    3. How to use the `view()` operator to inspect the structure of a channel
-    4. How to use a samplesheet to scale your workflow across multiple samples
+    3. How to use the `.view()` operator to inspect the structure of a channel
+    3. How to use the `-resume` flag to skip sucessful tasks
+    4. How to use a samplesheet to read in grouped samples and metadata
diff --git a/docs/part2/03_quant.md b/docs/part2/03_quant.md
@@ -3,9 +3,8 @@
 !!! note "Learning objectives"
 
     1. Implement a process with multiple input channels. 
+    2. Understand the importance of creating channels from process outputs.
     3. Implement chained Nextflow processes with channels.  
-    4. Understand the components of a process such as `input`, `output`,
-    `script`, `directive`, and the `workflow` scope.  
 
 In this lesson we will transform the bash script `02_quant.sh` into a process called `QUANTIFICATION`. This step focuses on the next phase of RNAseq data processing: quantifying the expression of transcripts relative to the reference transcriptome. 
 
@@ -70,7 +69,7 @@ It contains:
 * Prefilled process directives `container` and `publishDir`.
 * The empty `input:` block for us to define the input data for the process. 
 * The empty `output:` block for us to define the output data for the process.
-* The `script:` block prefilled with the command that will be executed.
+* The empty `script:` block for us to define the script for the process.
 
 
 ### 2. Define the process `script`  
@@ -106,10 +105,6 @@ The `--libType=U` is a required argument and can be left as is for the script de
 The `output` is a directory of `$sample_id`. In this case, it will be a
 directory called `gut/`. Replace `< process outputs >` with the following:  
 
-```
-path "$sample_id"
-```
-
 ```groovy title="main.nf" hl_lines="9"
 process QUANTIFICATION {
   container "quay.io/biocontainers/salmon:1.10.1--h7e5ed60_0"
@@ -186,6 +181,12 @@ process QUANTIFICATION {
 }
 ```
 
+!!! info "Matching process inputs"
+
+    Recall that the number of inputs in the process input block and the workflow must match!
+
+    If you have multiple inputs they need to be listed across multiple lines in the input block and listed inside the brackets in the workflow block.
+
 You have just defined a process with multiple inputs!  
 
 ### 5. Call the process in the `workflow` scope  
@@ -195,12 +196,9 @@ Recall that the inputs for the `QUANTIFICATION` process are emitted by the
 is ready to be called by the `QUANTIFICATION` process. Similarly, we need to
 prepare a channel for the index files output by the `INDEX` process.
 
-Add the following channel to your `main.nf` file, after the `reads_in` channel:
-```
-transcriptome_index_in = INDEX.out[0]
-``` 
+Add the following channel to your `main.nf` file, after the `FASTQC` process:
 
-```groovy title="main.nf" hl_lines="12-14"
+```groovy title="main.nf" hl_lines="15-16"
 // Define the workflow
 workflow {
 
@@ -212,6 +210,9 @@ workflow {
         .splitCsv(header: true)
         .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
 
+    // Run the fastqc step with the reads_in channel
+    FASTQC(reads_in)
+
     // Define the quantification channel for the index files
     transcriptome_index_in = INDEX.out[0]
 
@@ -226,10 +227,7 @@ workflow {
 
 Call the `QUANTIFICATION` process in the workflow scope and add the inputs by adding the following line to your `main.nf` file after your `transcriptome_index_in` channel definition:  
 
-```bash
-QUANTIFICATION(transcriptome_index_in, reads_in)
-```
-```groovy title="main.nf" hl_lines="15-17"
+```groovy title="main.nf" hl_lines="18-19"
 // Define the workflow
 workflow {
 
@@ -241,8 +239,11 @@ workflow {
         .splitCsv(header: true)
         .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
 
+    // Run the fastqc step with the reads_in channel
+    FASTQC(reads_in)
+
     // Define the quantification channel for the index files
-    transcriptome_index_in = INDEX.out
+    transcriptome_index_in = INDEX.out[0]
 
     // Run the quantification step with the index and reads_in channels
     QUANTIFICATION(transcriptome_index_in, reads_in)
@@ -252,14 +253,14 @@ workflow {
 
 By doing this, we have passed two arguments to the `QUANTIFICATION` process as there are two inputs in the `process` definition. 
 
-> Add note about tuples being a single input?
-
 Run the workflow:  
 
 ```bash
 nextflow run main.nf -resume
 ```
 
+Your output should look like:  
+
 ```console title="Output"
 Launching `main.nf` [shrivelled_cuvier] DSL2 - revision: 4781bf6c41
 
@@ -270,13 +271,16 @@ executor >  local (1)
 
 ```
 
-You now have a `results/gut` folder, with an assortment of files and
-directories.
+A new `QUANTIFICATION` task has been successfully run and have a `results/gut`
+folder, with an assortment of files and directories. 
 
 !!! abstract "Summary"
 
-    In this step you have learned:
+    In this lesson you have learned:  
+
+    1. How to define a process with multiple input channels
+    2. How to access a process output with `.out`
+    3. How to create a channel from a process output
+    4. How to chain Nextflow processes with channels  
+
 
-        1. How to          
-        1. How to 
-        1. How to