-
Notifications
You must be signed in to change notification settings - Fork 2
/
secondary_analysis.Rmd
874 lines (584 loc) · 58.9 KB
/
secondary_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
---
title: "RNA-Seq Secondary Analysis"
author: "Austyn Trull, Bharat Mishra, Ph.D., Lara Ianov, Ph.D."
---
<!-- Add some CSS for horizontal scrolling in samplesheet blocks -->
```{=html}
<style>
.scrollable-horizontal {
overflow-x: auto;
white-space: nowrap;
}
</style>
```
# Introduction to Secondary Analysis
![<https://www.rna-seqblog.com/learn-more-about-available-bioinformatics-tools-and-pipelines-to-analyze-rna-sequencing-data/>](imgs/primary_secondary_tertiary.jpg)
An experiment can typically be broken down into 3 different parts:
- **Primary Analysis** - There can be a bit of a disjoint in the community about what constitutes a primary analysis. Some consider trimming + initial qc to be part of it. For this workshop, we’re going to consider this analysis to be the steps performed by a sequencing core and the end result of this will be a demultiplexed FASTQs. That is very much an oversimplification about what happens at this step, as there is a lot of complexity, an entire workshops could be given (and are given) solely on this step.
- **Secondary Analysis** - Secondary analysis are the steps in a workflow where FASTQ files are quality-controlled (QC), trimmed, aligned, and quantified. The type of quantification produced differs depending on what kind of analysis you are performing - e.g.: a single-cell analysis would need a feature-barcode matrix, whole genome analysis would work with variant call files. For bulk RNA-seq analyses, the quantification represents expression count files.
- **Tertiary Analysis** - Tertiary analysis takes the output from secondary for additional analysis, visualization and data interpretation. The additional analysis steps may include more QC, data filtering, annotation and experimental specific analysis (driven by study design). A typical bulk RNA-Seq analysis, includes differential and co-regulated expression of genes and to infer their biological meaning.
**The purpose of this workshop will be to provide a more in-depth look at secondary and tertiary analysis from data derived from short read sequencing.**
## Introduction to FASTQ Files
As mentioned previously, secondary analysis starts with a FASTQ file. A FASTQ file is a standard file format for storing the raw sequences and their qualities obtained from primary analysis.
Taking a quick look at the input files provided with the workshop, we can tell it's a FASTQ because it has the `.fq` file extension, but you’ll also see `.fastq` as an alternative extension. The other file extension we see on the files is `.gz`. `gz` indicates that the file is zipped using the gzip tool. Gzip is both the name of the tool and the name for the file compression algorithm. File compression is typically performed to minimize the storage space the file takes up on a file system. This is especially important for FASTQs because they can be very large files. Furthermore, many standard tools for secondary analysis of RNA-Seq accept compressed FASTQs (in the gzip format) as input, and thus, the user does not need to de-compress the files prior to starting the analysis.
Let's go ahead and peek at the FASTQ file itself. We'll use `gunzip -c` to produce the file contents to stdout (display output), and we'll just look at the first 10 lines using `head` command.
```
gunzip -c <fastq.name> | head
```
We can see that there is a discernible pattern. Each set of four lines is considered a single entry. The four lines are:
1. The **sequence identifier**. The exact contents of this vary depending on many different factors, but are always unique for each entry.
2. The **sequence**. A string of A, T, C, G, and N. (N indicates the base could not be identified)
3. A **separator**. Starts with a `+` sign. Sometimes this will also have a quality score identifier that matches the sequence identifier, but it's also not uncommon for this to only contain the `+` sign.
4. The **sequence qualities**. These indicate the quality for the base at the same position. These are encoded using characters, and there are ways to convert them back into a numerical value. See the table below for quick guide on the character-score translations. The reason these are stored as characters is again to reduce the amount of space the FASTQ takes up on a file system.
| Q-Score | ASCII Code | Symbol | Q-Score | ASCII Code | Symbol | Q-Score | ASCII Code | Symbol |
|-----------|--------------|----------|-----------|--------------|----------|-----------|--------------|----------|
| 0 | 33 | ! | 14 | 47 | / | 28 | 61 | = |
| 1 | 34 | " | 15 | 48 | 0 | 29 | 62 | > |
| 2 | 35 | # | 16 | 49 | 1 | 30 | 63 | ? |
| 3 | 36 | $ | 17 | 50 | 2 | 31 | 64 | @ |
| 4 | 37 | % | 18 | 51 | 3 | 32 | 65 | A |
| 5 | 38 | & | 19 | 52 | 4 | 33 | 66 | B |
| 6 | 39 | ' | 20 | 53 | 5 | 34 | 67 | C |
| 7 | 40 | ( | 21 | 54 | 6 | 35 | 68 | D |
| 8 | 41 | ) | 22 | 55 | 7 | 36 | 69 | E |
| 9 | 42 | * | 23 | 56 | 8 | 37 | 70 | F |
| 10 | 43 | + | 24 | 57 | 9 | 38 | 71 | G |
| 11 | 44 | , | 25 | 58 | : | 39 | 72 | H |
| 12 | 45 | - | 26 | 59 | ; | 40 | 73 | I |
| 13 | 46 | . | 27 | 60 | < | 41 | 74 | J |
<!-- ![<https://www.omixon.com/bioinformatics-for-beginners-file-formats-part-2-short-reads/>](imgs/q_scores.png) -->
### QC-ing FASTQ Files
Before starting an analysis, there are a couple sanity checks that can be run on a FASTQ file in order to ensure they are correct (for example, not corrupted from a download).
#### Checking line count
The first sanity check is to perform just a simple word count on the file.
```
gunzip -c <fastq_file> \| wc -l
```
> **Discussion Question**
>
> What is the purpose of this check?
>
> If you recall, the format of the FASTQ file is each entry is separated into four lines. So a very quick sanity check to perform is making sure the number of lines in a FASTQ is divisible by 4.
> **NOTE**
>
> If you are running the above command on a computing cluster or high performance computer, please ensure that you do not run this command on the login or head node. With large files, such as FASTQ files, this command can cause high memory usage and cause issues for other users that are also using the high performance computer. In order to run the command either submit the command onto the cluster or run the command while in an interactive session. Example commands for how to perform this will be demonstrated later in the lesson.
#### Md5 Checks
The second sanity check is a little more complex, but it’s known as an md5 check. An md5 is a string of characters and numbers. The important note is that the md5 is unique for every file, so two files having the same md5 hash indicates a high probability the files are the same (we say probability because it is technically possible for two different files to have the same md5, however because of how its calculated this is an extremely low chance and for the purposes of this course you can consider md5’s to be unique to a file).
The reason to perform this check is often due to the fact that FASTQs are uploaded and downloaded from other sources, so the md5 can be computed to determine if there is a potential chance for corruption during either the upload or download process. If the source (e.g.: sequencing core, collaborators etc.) does not provide any md5’s, it is worth reaching out to the source with this request. Once these are generated you are able to check the md5s of the FASTQs after they‘ve been downloaded against the md5’s that the source generated. Ensuring these values are the same is a good way to ensure no corruption happened during the download process.
The way to generate a md5 of a local file is using the following `md5sum` command
```
md5sum <fastq_file>
```
The output of the above can be checked against md5 received from a sequencing core to ensure they match perfectly. Additionally, if you receive a text file containing all md5 files across various samples (named `md5sum.txt` in the example below), they can be efficiently checked with the following command:
```
# given you are in a directory that contains all fastq files
# and that you have the md5sum txt file from a sequencing facility:
md5sum -c md5sum.txt
```
> **Discussion Question**
>
> Given that you can generate md5 yourself for a given file, why is it critical to have the md5 from the data generators/sequencing facility?
## Intro to RNA-Seq
> **Discussion Question**
>
> Has anyone performed any RNA-Seq analysis? What tools have you used?
The FASTQ file on its own doesn’t really tell much information. In order to move forward with our project, these raw base calls need to be converted into raw expression data.
![<https://nf-co.re/rnaseq/3.14.0>](imgs/nf-core_rnaseq.png)
The rnaseq pipeline developed and published by nf-core will be the pipeline that this lesson will be using to analyze the FASTQs.
This rnaseq pipeline has a lot of tools that are part of it, and a user may choose to run it in default mode (tools that are executed by default), or configure it to run with more or less tools/steps.
------------------------------------------------------------------------
**Note:** Importantly, this pipeline is designed for short-read sequencing (e.g.: Illumina). Long-read bulk RNA-Seq requires distinct analytical steps which are linked to the nature of long read vs short read sequencing. While bulk long-read sequencing is beyond to scopre of this workshop, nf-core has pipelines for long-read sequencing (e.g.: for more information see the `nf-core/nanoseq` pipeline).
------------------------------------------------------------------------
RNA-Seq workflows follows generally the below pattern:
1. Quality-control
2. Read trimming/filtering
3. Alignment or quasi-mapping
4. Counting / quantification
The pipeline contains several quality control steps in order to assist in the determination of any steps where an error may have happened or to determine viability of the samples as they progress through the pipeline.
Additionally the pipeline also provides support for experimental specific approaches which we will not cover in today's workshop. This includes as an example: removal of genome contaminants in expected model types (e.g.: mouse PDX / xenograft) and inclusion of Unique Molecular Identifier (UMI) tagging for PCR deduplication (an optional molecular preparation step).
### Trimming and Filtering
A common initial step in RNA-Seq workflows is often read trimming and filtering. A common type of trimming is adapter trimming. Adapters are artificial sequences that are attached to the DNA fragments to be sequenced to ensure that the fragment attaches to the sequencing flow cell. Adapters are normally also sequenced with the fragment, meaning some of the reads may have bases that are not biological and can impact the accuracy of mapping and other downstream analyses. In order to alleviate this concern there are dedicated tools published that will remove adapters (also referred to as “adapter trimming”). Some examples of tools that do this are `TrimGalore`, `BBmap`, or `Trimmomatic.`
The `nf-core/rnaseq` pipeline additionally supports the removal of contamination with `bbsplit` (e.g.: xenograph experiment, or un-expected genomic contamination), and removal of ribosomal rna using `SortMeRNA`.
### Alignment
Following trimming is alignment/mapping. With RNA-Seq, there is more than one approach to alignment, the and nf-core/rnaseq pipelines supports various methods.
Alignment is the process by which a read is mapped to a reference genome. `STAR` is considered the standard aligner for RNA-Seq. The reason it is used over traditional mappers such as `bwa` is due to `STAR` being a splice aware aligner, where exon-exon junctions are taken into account at the mapping stage. `STAR` alignment outputs (BAM files) can be critical for use-cases where genome alignment is required such as gene fusion identification, or variant-calling/genome-wide/exome data integration.
The second type of alignment/mapping supported is "quasi-mapping"/"pseudo-alignment".
`Salmon` is a quasi-mapper which implements k-mer based indexing of the transcriptome (as opposed to the genome) to perform lightweight mapping to the reference. Similarly, `kallisto` also implements k-mer indexing of the transcriptome, but it processes the reads without mapping, and instead identifies k-mers within the reads which are compatible and shared to the index. Thus, a key difference between the tools is that pseudo-alignment relies on shared k-mers between the reference and the reads, while quasi-mapping implements efficient lightweight mapping using the k-mer indexed reference.
Lastly, `Salmon` and `kallisto` are computationally efficient, leading to rapid mapping with low memory use.
### Counting
Feature (gene or transcript) quantification is typically the last analytical step of an RNA-Seq pipeline. Importantly, the outputs of `Salmon` and `kallisto` are quantified counts, as opposed to `STAR` which are mapped data (BAM files), where an additional quantification tools is required. In the nf-core/rnaseq pipeline `Salmon` is implemented as the read quantifier from a `STAR` alignment.
The count outputs format typically contains a list of features (genes names or IDs) as rows and their relative abundance as columns for each sample.
------------------------------------------------------------------------
As discussed earlier, the pipeline we will cover also contains several additional steps for QC and optional steps dependent on design or analytical needs. The primary steps covered in the sections above are the required analytical steps across all bulk RNA-Seq analysis.
------------------------------------------------------------------------
## Executing the nf-core/rnaseq Pipeline
### Intro to Nextflow / nf-core
Due to the amount of steps this analysis uses, it is written using a workflow management system (or workflow manager) known as Nextflow. Workflow management systems allow for the chaining together of tools and inherently support some of the common checks that a developer would have to program in, such as checking if a file exists, moving output files to a central location, and logging information. Workflow management systems are useful as they natively support parallelization, portability, scalability and reproducibility. There are other workflow languages such as Snakemake or WDL.
Nextflow is built on top of a programming language known as Groovy. The specifics of Groovy are not necessarily important for the lesson, but if anyone wishes to develop pipelines using Nextflow it will be relevant.
Nf-core is a community-led effort to curate and develop a set of analysis pipelines using Nextflow. All pipelines published as part of nf-core are open source and there are a wide range of pipelines published currently supporting analyses such as single-cell or ATAC-seq. The community itself is not specifically limited to bioinformatics, but most pipelines published so far have all been within that field. One point to be noted, is that the nf-core team has published a lot of standards and tools to create a styling guidelines, and that any Nextflow pipeline is able to use these standards and tools even if they do not wish to publish as part of nf-core. To view a list of all pipelines (released and in development), please visit <https://nf-co.re/pipelines>.
#### Workflows and Reproducibility
As mentioned previously, one of the important benefits of using workflow management systems, such as Nextflow, is reproducibility. By ensuring that analyses are reproducible, you or any other person is able to run a workflow and receive the exact same results. One of the ways that Nextflow ensures reproducibility is via native support for tools that utilize containers, such as Docker or Singularity.
A container is a packaged up snapshot of code (or software) and all of its dependencies. A container allows the application it encapsulates to run on any machine afterwards. Containers help with the automation of installing all the packages and dependencies and minimizes the chance for error and deviation during the installation process.
It is highly recommended that any code or analysis that is meant to be shared should be encapsulated in a container as that will help in increasing the reproducibility of your work.
### Testing the Pipeline
Importantly, when implementing a novel pipeline or tool, it is critical to always test the pipeline to ensure that it is installed and working properly. Nf-core requires all pipelines to have downsampled test data to be able to run a test profile.
We will be using UAB's supercomputer (called `Cheaha`) to run the nf-core/rnaseq pipeline, since most common use-cases require resources that are not be readily available in local computers. More specifically, we'll be using the OnDemand application provided by UAB's Research Computing team, which provides an online interface to interact with Cheaha. Thus as a first step, we will need to navigate to the OnDemand website and login using your UAB login. The link can be found below:
`https://rc.uab.edu/pun/sys/dashboard/`
OnDemand provides an interactive file system that can be accessed by clicking on `Files` in the upper left of the OnDemand homepage, and then clicking `/scratch/<blazer_id>` in the dropdown (NOTE: The `<blazer_id>` in the path should be replaced with your blazer id).
We'll navigate to the directory where we have moved the files downloaded from Globus, which can be done by double clicking on directories. Following the install instructions will have placed the files in the directory below (NOTE: The `<blazer_id>` in the path should be replaced with your blazer id):
```
/scratch/<blazer_id>/rnaseq_workshop
```
> **NOTE**
>
> For this lesson, we'll be using the onDemand Editor provided by UAB Research Computing in order to edit and create files. However, if you are more comfortable using a software editor such as `vim`, `nano` , or `Visual Studio Code`, that is also acceptable.
In order to run the pipeline with test data, we’ll want to create a script to submit and run the Nextflow command, thus we will start by creating a file called `run_test_rnaseq.sh`. In order to create a file, click the `New File` button in the upper right of interactive file system, and in the dialog box type `run_test_rnaseq.sh`.
> **NOTE**
>
> It is considered bad practice to execute any script or command that requires large amounts of resources on the login node of a supercomputer. Not only does it take resources from other users on the login node, but it can cause errors and system failures for the super computer itself, therefore **ALWAYS** ensure you either submit code to the computer OR execute code in an interactive session.
Once the file is created, make sure it is highlighted in the file system window and click the `Edit` button located in the upper left of the screen, below the file path. In the new window, we'll start by entering the below text:
```
#!/usr/bin/env bash
```
This is known as a `shebang`. It is standard practice to put this line at the top of any script you make, as it is used by the computer to assist in determing which language the script will be using. In this case we are letting the computer know that this is `bash` script. Because this script needs to be submitted to Cheaha, it is especially important to place this line in the file. When the job is dispatched to a its own machine, the machine will not know how to execute the script and error immediately.
The next lines we will add are known as slurm directives. Slurm directives are used to define how much resources are needed to execute the job, and define where to place stdout and stderr outputs from the command.
```
#SBATCH --job-name=rnaseq_test
#SBATCH --output=rnaseq_test.out
#SBATCH --error=rnaseq_test.err
#SBATCH --time=2:00:00
#SBATCH --partition=express
#SBATCH --mem-per-cpu=5000
#SBATCH --cpus-per-task=1
```
To break down each line:
- `--job-name` : Submitting code on the cluster is called a job. The `job-name` flag assigns a user-generated name to the job to allow for easier tracking. Job names should be short and descriptive in order to assist in determining a job. In this case we will be naming out job `rnaseq_test`
- `--output` and `--error` : These are for defining files to place the `stdout` and `stderr` output of the command being submitted. Because these jobs are not interactive, if these flags are not defined, the output will be lost. Instead of providing the direct name to these directives, slurm provides file name patterns which can be used. For instance, adding `%j_%x.out` and `%j_%x.err` will generate outputs with the following file name pattern: job id and job name
- `--time` : This defines the amount of time you expect a command to run. If a job runs longer than this command, it will be killed by Cheaha.
- `--partition` : There are a number of partitions available on Cheaha. The `express` partition is for jobs you expect to run in less than 2 hours. There are also `short`, `medium`, and `long` queues, which have their own maximum run time that can be found in the Cheaha documentation. It is recommended to choose the partition that best fits the needs of your pipeline, as choosing the wrong one can result in your scripts being killed too soon or causing very long pending times.
- `--mem-per-cpu` : This defines how much memory a job will take multiplied by the number of cpus requested. There is not good way to determine the memory a job will take ahead of time, so it is very common for this to take some iteration as the script will be killed if it goes over the memory limit specified in this file.
- `--cpus-per-task` : This defines how many CPUs you expect a job to take.
We are specifying pretty low memory and time requirements for this job as because this is downsampled data, it should run very quickly and require very little resources. Its also important to note that submitting a job like this will always require low memory since we are submitting what is known as the Nextflow runner job. The Nextflow runner's job is to just to submit the steps in the workflow that do the actual work for the pipeline.
Next, we need to load our conda environment created from our installation instructions to ensure that the required software are active. This can be done with the following:
```
module load Anaconda3
conda activate $USER_SCRATCH/conda_envs/rnaseq_workshop
```
With all that completed, we can now place the Nextflow command in the file:
```
nextflow run nf-core/rnaseq -profile test,cheaha -r 3.14.0 --outdir output
```
Breaking the Nextflow command down:
- `nextflow run` is how to run any nextflow pipeline. nf-core/rnaseq tells which nf-core pipeline to run, if we were running a single-cell pipeline, we’d use nf-core/scrnaseq.
- `-profile test, cheaha` : Profiles can be thought of as presets for a pipeline. The pipeline comes with a lot of defaults, but the profiles provide additional defaults for the pipeline. The `test` profile tells the pipeline that we are going to use the test data for the pipeline. The `cheaha` profile is what is called an institutional profile, and the one for use at UAB is `cheaha`. But there are also other institutions and you are able to create your own if one doesn’t already exist. One point to note is that if you pass in multiple profiles corresponding to the same pipeline parameters, the last one takes precedence.
- The `-r` specifies a version of the pipeline to use. This is an important parameter for reproducibility as we can force rnaseq to run at a specific version. If you do not use this parameter, the pipeline will default to the latest version.
- `--outdir output` With this flag we are noting where the pipeline needs to place all the output files, in this case it will be a directory called `output`.
In order to save the file, we click the `Save` button located in the upper right of the editor window.
The final step is we need to submit our script to cheaha so it can be executed. The way we will accomplish that is via the commandline available from OnDemand. To access the commandline, click the `Open in Terminal` button located in the upper right of the interactive file system.
We'll need to start by navigating to our `rnaseq_workshop` directory (NOTE: The `<blazer_id>` in the path should be replaced with your blazer id):
```
# Note: /scratch/<blazer_id>/ can be replaced by $USER_SCRATCH variable
cd /scratch/<blazer_id>/rnaseq_workshop
```
In order to submit the command we need to run the below command:
```
sbatch run_test_rnaseq.sh
```
Once the command is run, you will see output similar to below:
```
Submitted batch job 28085735
```
The number will be different, but this is giving you the job id, which can be used to query the job for additional information. You can monitor the job by running the below command (NOTE: The `<job_id>` in the command should be replaced with the job id obtained from the previous command):
```
seff <job_id>
```
```
Job ID: 28085735
Cluster: slurm_cluster
User/Group: atrull/atrull
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:01 core-walltime
Job Wall-clock time: 00:00:01
Memory Utilized: 1.62 MB
Memory Efficiency: 0.16% of 1.00 GB
```
Running that command will provide some basic information about the job being queried, you can see how much memory the job used, its walltime, and other information. For this workshop, we are mainly just wanting to check the `State` of the job, and determine when it has changed to `COMPLETED` as in the example, as this means the job has completed successfully.
You may also check the running status of the Nextflow runner's jobs (and the jobs it submits) with the following command:
```
squeue -u $USER
```
Once everything is completed, move on to the next section as we can conclude the pipeline has been downloaded successfully.
> **NOTE**
>
> This section briefly introduces the concept of the institutional profile. The institutional profile is very useful as it allows users to configure nf-core workflows to better utilize the resources at their institution. However, due to the specificity the insitutional profile possesses, it is not recommended to run an instituional profile that does not match the institution you are running the analysis at as their resources may not the be same and can lead to errors or take resources from other users.
> **NOTE**
>
> The `cheaha` profile that is used within this lesson has a number of configurations to allow nf-core workflows to make better use of the cheaha supercomputer. It will direct jobs to the correct queue to ensure that jobs have minimal time between being submitted and run. It will also use the module system and native nextflow configuration to ensure that Singularity is used to process containers.
### Understanding the Dataset
Let's start moving to actual data. In order to run the data, it's important to understand the data a little more before proceeding.
The data comes from the following paper: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6096346/>
The data that we will be analyzing is mouse transplant data. The original paper contains 12 samples in total - 4 naive controls, 4 from 2 hours past transplant, and 4 from 24 hours past transplant. For this workshop, we will only be concerned with the controls and the 24 hour samples. We will be using the results from a pipeline run that contained only these samples for the tertiary portion, and will be reviewing the reports for later in this portion of the lesson. In addition, we are going to run down sampled versions of the data for teaching purposes, but the steps we will follow will be largely applicable to the full dataset.
Another point to note is that the paper we are obtaining the data from is a beginner’s guide for RNA-Seq published about 10 years ago. Given the timeline, you will note that are some differences from the analysis that was done for the paper versus the analysis done today. The reason is a lot of the tools used by the paper are considered outdated or have been phased out of the field. Bioinformatics and computational biology are fast-paced domains. Thus, it is important to ensure that all analyses are using correct and modern approaches. In summary, always ensure that the tools and pipelines you apply to your analysis are up to date to what is considered best practice in the RNA-Seq domain.
### Creating the Samplesheet
So lets work on running the pipeline on actual data. The RNAseq pipeline (like most nf-core pipelines) receives input via a samplesheet. The format of which can be seen here:
<https://nf-co.re/rnaseq/3.14.0/docs/usage>
We will again want to open the interactive file system from UAB Research Computing's OnDemand application, and navigate to the below directory (NOTE: The `<blazer_id>` in the path should be replaced with your blazer id):
```
/scratch/<blazer_id>/rnaseq_workshop
```
We will want to start by creating a file called `samplesheet.csv`. In order to create a file, click the `New File` button in the upper right of interactive file system, and in the dialog box type `samplesheet.csv`.
Once the file is created, make sure it is highlighted and click the `Edit` button located. In the new window, enter the below text.
```
sample,fastq_1,fastq_2,strandedness
```
We’ve put the header into the samplesheet, so let’s save the samplesheet because we need to also include the FASTQs.
The next step in creating the samplesheet is to populate it with information. The way we will do that is via the commandline available from OnDemand. To access the commandline, click the `Open in Terminal` button located in the upper right of the interactive file system.
In order to get the list of fastqs, we'll run the below command.
```
for file in input/downsample_fastqs/*.fastq.gz; do echo $file; done
```
That will output all the FASTQs we will be working with. Lets redirect the output to the end of the samplesheet
```
for file in input/downsample_fastqs/*.fastq.gz; do echo $file; done >> samplesheet.csv
```
Open up the samplesheet again. You will see the header and a list of FASTQs. To start modifying it, a sample name needs to be added with some additional commas.
**Note**: the path below in the example samplesheet is relative to our own Globus endpoint. You should use the paths from the for loops outputs shown above.
::: scrollable-horizontal
```{=html}
<pre>
sample,fastq_1,fastq_2,strandedness
N02_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328049_SRR7457560.fastq.gz,,auto
N01_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328050_SRR7457559.fastq.gz,,auto
N04_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328051_SRR7457558.fastq.gz,,auto
N03_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328052_SRR7457557.fastq.gz,,auto
R08_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328047_SRR7457562.fastq.gz,,auto
R07_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328048_SRR7457561.fastq.gz,,auto
R06_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328057_SRR7457552.fastq.gz,,auto
R05_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/downsample_fastqs/SRX4328058_SRR7457551.fastq.gz,,auto
</pre>
```
:::
The sample name is completely up to the user, so pick something that will help with identification for yourself if there is not an already existent sample name. Strandedness we have set to `auto` for the purposes of the workshop, if you know if the data is forward or reverse stranded by all means use that instead.
> **Discussion Question**
>
> You'll note that we have two commas next to each other in the samplesheet, which means we have left a column empty. The data for this workshop is single-ended, meaning that there is just a single FASTQ for us to analyze per sample. It's also common for data to be paired-ended, which for us would result in a pair of FASTQs for each sample, known as an read 1 and read 2.
>
> Can anyone tell me what we would change with this samplesheet if the data was paired-end instead of single-end?
>
> <details>
>
> <summary>**Click here for solution**</summary>
>
> We would need to fill in the column that is left blank. The samplesheet headers help guide us in what information needs to be placed in each column. In this case, `fastq_1` would correspond to the read 1 fastq and `fastq_2` would correspond to the read 2 fastq.
>
> </details>
### Adding Pipeline Parameters
One point of customization is the pipeline parameters. The amount of configurability the parameters provide is very dependent on the pipeline and the pipeline authors, so we won’t be able to customize every aspect of the pipeline, but parameters are often based on common requests by users of the pipeline.
The parameter options for an nf-core can be found on the pipeline website, for rnaseq its available here: <https://nf-co.re/rnaseq/3.14.0/parameters>
While there are a lot of options, it's very rare that you will need all of these as the pipeline authors have set sensible parameters. But these range from everything - from required inputs to which steps to run in the pipeline. It is **heavily encouraged** to read through this when doing your own analysis to evaluate if there are any flags that may apply, as well as to note the default flag values as these can have notable impact on analyses.
Parameters are able to be passed as flags to the command line or via a yaml file. For the workshop, we will use the yaml option in order to keep the pipeline execution command more straight-forward.
Inside the `/scratch/<blazer_id>/rnaseq_workshop` directory, we want to create a new file called `params.yml`. Once created, let's open the file up for editing.
```
# names/email
email: "atrull@uab.edu"
multiqc_title: "rnaseq_workshop"
```
The first parameters to add are our email and the title. We’ll get an email anytime the pipeline fails or succeeds.
```
# input samplesheet
input: "./samplesheet.csv"
# Genome references
fasta: "/path/to/fasta"
gtf: "/path/to/gtf"
gencode: true
```
Next we’ll provide the required parameters. `input` is the path to the samplesheet. `fasta` and `gtf` are the paths to the FASTA and GTF files to be used for the analysis. Note that these can be obtained from the GENCODE website at <https://www.gencodegenes.org/>. For genomes not available you can also find reference files at the Ensembl website <https://useast.ensembl.org/index.html>. If you use GENCODE reference files, be sure to set the `gencode` parameter to `true`. For this workshop we have provided the GENCODE reference (version M32) in the data downloaded from Globus (`reference` folder).
```
# Read Trimming Options
trimmer: "trimgalore"
extra_trimgalore_args: "--illumina"
min_trimmed_reads: 1
```
We are going to use `trim_galore` for the trimmer. The pipeline also provides `fastp` as an option. Neither one of these is necessarily better than the other, `trim_galore` provides a wrapper around an older tool called `cutadapt` and `FASTQC.` Trim_galore will try to guess the adapters it needs to cut, but you can limit this by passing specific parameters. This dataset uses the illumina default adapters, so the `--illumina` flag needs to be passed to the tool.
We will additionally use the `min_trimmed_reads` param. This parameter is used to set what the minimum number of reads a sample will need to have after trimming. The value of this parameter and whether it needs to be used depends on the experiment and on the data being analyzed. We are using this parameter as we are using downsampled files, and will not meet the default value for this parameter.
```
# Alignment Options
aligner: "star_salmon"
pseudo_aligner: "salmon"
```
The aligner we use is `star_salmon`. For rnaseq, this means `STAR` will be the aligner and `Salmon` will be used for quantification. We also instruct the pipeline to run quasi-mapping with `Salmon`. The choice of which to use for your analysis is dependent on factors including experimental design, and resource limitations. As previously discussed, if you also have genome data being processed you may want to use the `STAR` (followed by `Salmon` quantification), since it outputs genomic mapping data. We typically execute the pipeline so that both types of mapping strategies are executed (as demonstrated above). This way, the situation never arises where the pipeline needs to be rerun just to generate the files from the opposite aligner.
```
# Quality Control
deseq2_vst: true
```
We'll also add this parameters for the run.
- `deseq2_vst` will use the vst transformation over deseq's rlog. This is ideal as vst is faster to run and more suitable for larger datsets. We will see more about this in the tertiary analysis section.
### Additional Customization
Pipeline parameters are very useful, but can only get so far. An additional route of customization is to use the custom configuration ("configs") files implemented within Nextflow. Custom configs allow customization of everything from resource allocations to addition of parameters to specific tools in the pipeline.
To demonstrate this, let's start by creating a file called `custom.conf`, and adding the below lines.
```
process {
executor = 'slurm'
queue = ‘amd-hdr100’
}
```
Custom configuration works off the idea of scopes – what this means is users can group settings so they apply to specific pieces of the pipeline.
The above lines of code set the executor to be slurm, and set the queue so that all jobs will be submitted to the `amd-hdr100` partition on cheaha.
Another item that can be configured via this method is adding parameters to specific tools in the pipeline. In order to do that, you will need to find the module. For this example, let’s modify the Salmon script. There is a way to do this via the pipeline parameters (with the `extra_salmon_quant_args` parameter), however, this is a good opportunity to show custom configuration to pipeline tools.
The first step to do that is to find the name of the process. For this step, it's important to understand the nf-core directory structure a little better. Let’s start by changing into the rnaseq directory that was downloaded when the test command was executed.
Open up the OnDemand Terminal and run the below command.
```
cd ~/.nextflow/assets/nf-core/rnaseq/
```
The directory above contains the copy of the pipeline source code matching the version which we specified in the run. This can also be found in GitHub noting that the following links to the same pipeline version used here: <https://github.com/nf-core/rnaseq/tree/3.14.0>.
For the purpose of this workshop, we will explore some of the source code from our local copy.
There’s a lot of files and directories in here, breaking down the directories one at a time:
- `assets` - Contains images and tool-specfic files
- `bin` - Contains scripts used by the pipeline
- `conf` - Contains pipeline configuration files
- `docs` - Contains documentation
- `lib` - Contains groovy libraries for the pipeline to use
- `modules` - Contains the nextflow process files for individual tools. It is further split into ‘local’ modules which are unique to the pipeline and ‘nf-core’ modules which are files that have been obtained from a central repo.
- `subworkflows` and `workflows` - Contains the actual workflow files themselves. `subworkflows` are parts of a workflow that are repeated multiple times so are abstracted out of the main workflow to alleviate the maintenance burden.
There are additional files in the pipeline directory, but most of these are used for either documentation or for the pipeline template. For the purposes of this workshop, these are largely unimportant but are very important should someone choose to write an nf-core style pipeline.
To find Salmon, we need to search down the `modules` directory. There are two bash commands that can be executed to find the Salmon process file:
```
find modules -name "*salmon*"
```
OR
```
grep -rl "salmon" modules/
```
The `find` command assumes that there is a file or directory that is named like the tool, this isn’t always the case so the grep can help in the cases where find fails to find anything.
We can see that the salmon module is located at `modules/nf-core/salmon/quant`. Let's take a look at the process file:
```
cat modules/nf-core/salmon/quant/main.nf
```
The name of the process is at the very top of the file, which is what we need. Feel free to explore this, but a lot of this is Nextflow syntax. Special attention should be paid to the command for the tool which is located towards the bottom of the file.
With the name of the process, let’s go back and open up the custom configuration file using the OnDemand Editor and add the below lines.
```
process {
withName: ‘.*:SALMON_QUANT’ {
ext.args = ‘--gcBias --seqBias’
}
}
```
With this customization, Nextflow will add the two additional flags. The flags themselves tell `Salmon` to correct for gc and sequence bias. We use `withName` to limit the additions to only the salmon process, as opposed to all other tools. Since `withName` respects regular expressions we use the `.*` to act as a catchall for all salmon jobs.
There are additional ways you can use this config, controlling whether Nextflow processes are run with Singularity or Docker, further configurations with processes such as controlling memory or cpus making this method a very powerful tool for modifying a pipeline without changing the source code (this is known as modularity).
### Submitting the pipeline
With all the files created, we can submit the pipeline. We'll create a `run_rnaseq.sh` file, and put the below lines in it. Note that while we're increasing the time limit, we aren't increasing any other resources for the job because we are only submitting the Nextflow runner which requires very minimal resources. In general when submitting the Nextflow runner, the main resource that may need increasing is the `--time` flag as that determines how much time you are allocating for the pipeline to run.
```
#!/usr/bin/env bash
#SBATCH --job-name=rnaseq_subsample
#SBATCH --output=rnaseq_subsample.out
#SBATCH --error=rnaseq_subsample.err
#SBATCH --time=12:00:00
#SBATCH --partition=short
#SBATCH --mem-per-cpu=5000
#SBATCH --cpus-per-task=1
# load environment
module load Anaconda3
conda activate $USER_SCRATCH/conda_envs/rnaseq_workshop
# run pipeline
nextflow run nf-core/rnaseq \
--outdir ./subsample_results \
-profile cheaha \
-r 3.14.0 \
-params-file ./params.yml \
-c custom.conf
```
> **Discussion Question**
>
> So we have currently submitted the downsampled data. Can anyone tell me what we would need to change in order to run the full dataset as if we were doing a complete analysis?
>
> <details>
>
> <summary>**Click here for solution**</summary>
>
> The files that need to be updated are provided below with a description of what needs to be changed.
>
> `samplesheet.csv` needs to have the file paths in the `fastq_1` column to point to the full dataset.
>
> ::: scrollable-horizontal
> ```{=html}
> <pre>
> sample,fastq_1,fastq_2,strandedness
> N02_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328049_SRR7457560.fastq.gz,,auto
> N01_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328050_SRR7457559.fastq.gz,,auto
> N04_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328051_SRR7457558.fastq.gz,,auto
> N03_AM_Naive,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328052_SRR7457557.fastq.gz,,auto
> R08_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328047_SRR7457562.fastq.gz,,auto
> R07_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328048_SRR7457561.fastq.gz,,auto
> R06_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328057_SRR7457552.fastq.gz,,auto
> R05_AM_Allo24h,/data/project/U_BDS/Globus_endpoints/rnaseq_workshop/input/fastqs/SRX4328058_SRR7457551.fastq.gz,,auto
> </pre>
> ```
> :::
>
> `custom.conf` would remove the section we added to route all jobs to the `amdhdr-100` queue. We want the jobs to go to the 'correct' queue since we would be processing the full dataset and may require different resources that would better suit a different queue.
>
> ``` custom.conf
> process{
> withName: ‘.*:SALMON_QUANT’{
> ext.args = ‘--gcBias --seqBias’
> }
> ```
>
> `run_rnaseq.sh` maintains the same time limit as our downsample data. For most RNA-seq analysis, this is sufficient, but if your pipeline fails due to time limit, simply increase this further.
>
> ``` run_rnaseq.sh
> #!/usr/bin/env bash
>
>#SBATCH --job-name=rnaseq_full
>#SBATCH --output=rnaseq_full.out
>#SBATCH --error=rnaseq_full.err
>#SBATCH --time=12:00:00
>#SBATCH --partition=short
>#SBATCH --mem-per-cpu=5000
>#SBATCH --cpus-per-task=1
>
># load environment
>module load Anaconda3
>conda activate $USER_SCRATCH/conda_envs/rnaseq_workshop
>
># run pipeline
>nextflow run nf-core/rnaseq \
> --outdir ./full_results \
> -profile cheaha \
> -r 3.14.0 \
> -params-file ./params.yml \
> -c custom.conf
> ```
>
> </details>
> **Discussion Question**
>
> In the runs above, we wrote a custom configuration for the Salmon flags as an demostration of how to pass in tool-specific parameters for nf-core pipelines.
> This was a good example on how to configure your pipeline in a more advanced manner (e.g.: when no alternative approach exist).
>
> However, the pipeline authors have also provided explicit parameters in this pipeline which enables users to accomplish the same task in a more straight-forward manner.
> From looking at the documentation, what is the alternative approach of providing the `--gcBias --seqBias` parameters to Salmon?
>
> <details>
> <summary>**Click here for solution**</summary>
> You can add the following to your `params.yml` with the pipeline flag `extra_salmon_quant_args`
>
>```
> extra_salmon_quant_args: "--seqBias --gcBias"
>```
>
> **NOTE**: this option only became available with version >= 3.10. This is a good example of the advantages of always checking the pipeline release notes for updates/improvements and bug fixes.
>
> </details>
______
## Reviewing Pipeline Results
### Pipeline Metrics
Inside the `subsample_results` directory, there is a `pipeline_info` directory that contains the metrics and statistics for the jobs the pipeline submitted to the cluster. We are going to review the report and timeline files.
![](imgs/report.png)
The report file provides details on the resource and timings for each process that was executed as part of the pipeline. This is useful in helping to narrow down jobs with high memory usage for optimization purposes.
![](imgs/timeline.png)
The timeline file provides a detailed timeline noting when each process was executed. This is especially useful in being able to estimate how long a pipeline may take and to identify long running tools.
### Secondary Outputs
The rnaseq pipeline produces a lot of output files. Many of them are for qc and general review of the pipeline user. Due to the amount of files, we will only go over some of the more commonly used files in this workshop, but note that this pipeline (along with other nf-core pipelines) have dedicated documentation for noting the various output files. For nf-core/rnaseq, the description of all the output files can be found [here](https://nf-co.re/rnaseq/3.14.0/docs/output).
#### MultiQC Report
The first point of review for the pipeline is the multiqc report. MultiQC is a very useful tool that aggregates a number of different tools into a single report, providing a quick way of reviewing the quality of the data to determine if there are any potential concerns.
The first table in the multiqc report is the "General Statistics Table". This provides a high level overview of various metrics, from read counts in the BAM to read counts in the FASTQ. It's a lot of metrics, but this is useful in the accessing if a large amount of data filtration occur. This is also a good place to check if the samples have sufficient sequencing depth. Standard mRNA-Sequencing typically requires an average of \~25M reads. This requirement increases for whole-transcriptome (where other RNA species are also quantified). For more reading abour sequencing depth consideration, please visit [this Illumina resource](https://knowledge.illumina.com/library-preparation/rna-library-prep/library-preparation-rna-library-prep-reference_material-list/000001243).
![](imgs/multiqc_1.png)
RNASeq does the very initial stages of tertiary analysis. It is definitely recommended to do these on your own rather than rely on the results from rnaseq as these are very generic. These results help indicate if there are any outliers within your dataset or if there are potential sample swaps that may be a concern as we expect samples that are similar to each other to indicate that.
![](imgs/multiqc_2.png)
![](imgs/multiqc_3.png)
The next three qc reports are very useful for rnaseq. We expect these reads to be from exonic, protein coding regions. These QC's act as good sanity checks that our data matches our assumptions.
![](imgs/multiqc_6.png)
![](imgs/multiqc_7.png)
![](imgs/multiqc_8.png)
`RSeQC` is tool with a lot of quality-control for RNA-Seq data. In addition to the above, it also lets us check for duplication levels. The image below is indicative of a good 'normal' plot for this metric.
![](imgs/multiqc_9.png)
There are a couple variations of BAM metrics. `SAMtools` is a common tool with a number of subtools that provide useful metrics for your BAM. The key points to look for are mapping rates, unique mappings, (and properly paired in the case of paired FASTQs) as those are expected to be high regardless of dataset, though the degree varies based on species.
![](imgs/multiqc_12.png)
![](imgs/multiqc_13.png)
![](imgs/multiqc_14.png)
![](imgs/multiqc_15.png)
The mappings per chromosome can vary, but in general the smaller the chromosome the less mappings will occur, so this graph is pretty normal.
![](imgs/multiqc_16.png)
In the case of experiments involving multiple sexes of organisms, the XY count chart can be very useful for ensuring no sample swap has occurred.
![](imgs/multiqc_17.png)
In addition to `SAMtools`, `STAR` provides its own versions. We again are looking for high unique mappings. Multi mapped regions are okay though, we just need to make sure they are not the overwhelming majority.
![](imgs/multiqc_18_true.png)
`FastQC` is a very common tool used for providing metrics for FASTQ files. There is also a `FastQC` report that gets generated, but its on a per-sample basis and as a result does not aggreate the results like in MultiQC.
We also note that the `FastQC` plot below also qc duplicate rates from the raw data. Bear in mind unless the RNA-Seq library preparation protocol implements UMI-tagging, duplicates cannot be removed from RNA-Seq data (as it is not possible to differentiate technical/artifact duplicates from biological duplicates derived from the high abundance of RNA molecules / expression).
![](imgs/multiqc_19.png)
For short read data, we expect the mean quality score to remain well in the green section. But it's important to note that there are quality drops at the start and end of the read and this is a result of the technology, and that any low quality reads (based on trimming parameters) are removing at the trimming step.
![](imgs/multiqc_20.png)
![](imgs/multiqc_21.png)
The Per Base Sequence Content is very useful for determing if there is any potential issue with the FASTQ. You'll again notice there are issues at the start and end of the reads and its universal across samples. Clicking each row provides a better view of which base is the strongest. This is again pretty expected for reasons mentioned earlier, but always worth investigating.
![](imgs/multiqc_22.png)
![](imgs/multiqc_23.png)
This is pretty typical GC Content, anything wrong with this could indicate species contamination.
![](imgs/multiqc_24.png)
Per Base N Content indicates how many 'N's' were detected in the dataset. N's mean the base was unable to be determined, and high levels could indicate a potentially significant sequencer error.
![](imgs/multiqc_25.png)
![](imgs/multiqc_26.png)
![](imgs/multiqc_27.png)
We expect a level of adapters in our data, however note that this is pre trimmed figure, and that they are removed at the trimming step.
![](imgs/multiqc_28.png)
Finally there are the pipeline data. Just below the citations, are the tool versions. This is useful for manuscript writing, and for reporting tool-specific versions (along with the rnaseq pipeline version). It is also critical to cite the pipeline, the nf-core community and the tools' developers at the writing stage (many of these tools have published papers available for citation).
All nf-core pipelines contain a `CITATION.md` document listing all tools and proper citation. The `README.md` file contains the method to cite the pipeline.
![](imgs/multiqc_30.png)
The workflow summary indicates what parameters and options were used during the analysis.
![](imgs/multiqc_31.png)
#### Files for Tertiary Analysis
Due to the way we configured the rnaseq pipeline, there are two set of files we can use for tertiary analysis. While we only need to choose one set, you can also run analyses on both to determine which results you prefer, but note that the two sets should produce very similar results.
##### STAR + Salmon Results
The `star_salmon` results are produced by aligning reads using STAR and quantifying using Salmon.
##### Salmon Results
The `salmon` results are produced by performing quasi-mapping using `Salmon`. This method will NOT produce any alignment files as discussed earlier.
### Troubleshooting Nextflow Pipelines
Another useful skill when running Nextflow pipelines is the ability to troubleshoot pipelines.
We've reviewed the results, but we are still unsure if the arguments we passed in with the custom config were applied to the pipeline. In our case, we are wanting to make sure that the process names `SALMON_QUANT` had the `--gcBias --seqBias` flags added to its command
All this information is held within the `work` directory, so lets try and take a peek into it.
```
ls work
```
If we take a look at the `work` directory, we'll see that we have a number of directories that we don't know what they are for. Nextflow hashes all of its processes for memory's sake, however it makes it much harder for us to locate information on individual processes.
In order to find our command, we'll need to use the Nextflow log command:
```
nextflow log
```
With no arguments, this command will produce a history of Nextflow pipelines. It produces some basic metrics, such as start date, duration, and even the command that was used to execute the pipeline.
Its also important to note the output are ordered such that newer runs are at the bottom of the list and older runs are at the top. Our test run should be the last item in the list (or the only item in the list in the case it's the only entry).
> Discussion Question
>
> What happens if you run this command in a different directory?
>
> You may get output that looks like the below:
>
> `It looks like no pipeline was executed in this folder (or execution history is empty)`
>
> The Nextflow log command uses a hidden folder called `.nextflow`. This directory contains the information for all nextflow runs that have bene executed within that directory. As a result, `nextflow log` is only useful in locations where there have been nextflow runs.
In order to continue, we'll need to obtain the run name for our Nextflow run. Run names are in the third column of the Nextflow log output, and for our run it will be in the last row in the output of Nextflow log. The run name is a randomly generated phrase of two words separated by an underscore. For the sake of simplicity, I'll be using the name of my run (`thirsty_heisenberg`) in the examples below, but just note that your run name is expected to be different and just use your run name in place of mine when running the below commands.
```
nextflow log thirsty_heisenberg
```
This time when we run the Nextflow log command, we receive a list of file paths. These paths are the work directories used by the pipeline, but they don't really mean anything to us as we still don't know what they mean.
```
nextflow log thirsty_heisenberg -l
```
We need to configure the columns produced by our Nextflow line, by running the same command with the `-l` flag we will produce all the column names the tool will produce. We are wanting to find the `SALMON_QUANT` process and its command, so we'll want the `name` and `script` columns
```
nextflow log thirsty_heisenberg -f name,script
```
This produces a lot of output, we can see that it has changed to contain the process name and a series of bash commands. However it is a lot to go through when we only care about a single process. So let's filter down the output:
```
nextflow log thirsty_heisenberg -f name,script -filter 'name =~ ".*SALMON_QUANT.*"'
```
As can be seen in the previous command's output, the process names are very long and there is not an easy way to find it. So we've added the `-filter` flag to reduce down the results. The value we're assigning it is `'name =~ ".*SALMON_QUANT.*"'`. This is a regular expression, which just means we are trying to find a string, we know the structure of the string, but not the exact string. Let's break down the regex a little better:
- The `=~` operator is used when you are performing a regex match.
- `.*SALMON_QUANT.*` is our regex to match against
- `.` means any character.
- `*` means that the previous character can repeat 0 or more times
Putting it together, the regex means that we are looking for the string `SALMON_QUANT` and it can appear anywhere in the `name` string.
Running this yields a number of results, but we are able to now validate that we do see the `--gcBias --seqBias` being added to all commands.
Lastly, the various `work` directories are also useful for troubleshooting the pipeline as they contain other "hidden"
files (files starting with a `.`) and logs that are useful for the the troubleshooting process.