- De novo assembly of Durian and do structural and functional annotation.
- Transcriptome assembly and analyses differential expression.
- Biological interpretation of the results.
- Further analyses.
Check data description
No reads quality control and reads preprocessing since we are working with PacBio reads.
Analysis | Type | Software | Installed/codes | ERT | Input/note |
---|---|---|---|---|---|
1.DNA assembly | Assembly | Canu | UPPMAX/1_gene_assembly.sh | ~17h (4 cores) | PacBio reads; |
2.Mapping illlumia reads against assembly | Mapping/Aligner | BWA( BWA-MEM) | UPPMAX//4_MappingIllumina.sh | ~ 1h | Illumina read to PacBio assembly (for BAM file as Pilon input) |
3. DNA assembly Improvement | Assembly Improvement | Pilon | UPPMAX/7_Pilon.sh | ~30min | PacBio assembly; BAM file; |
4. Assembly evaluation ( must do after 3(Pilon), optional for 1(Canu)) | Fasta sequences | QUAST | Locally/3_PacBio_AssemblyQC.sh | ~ | Assembly |
RNA raw data quality check and trim and check again after trim | FastQC Trimmomatic | UPPMAX/2_RNA_rawData_QC.sh | ~ | ||
5. Mapping (Aligner) | Eukaryotic RNA | Tophat | UPPMAX/5_tophatFolder/ | ~5h (2 cores) | Downloaded DNA sequence, different pairs of RNA reads. Part for 6, all for 10. |
6. RNA assembly | Illumina RNA | Trinity(need map../no map) | UPPMAX/9_trinity_withBAM.sh | ~5.5h (4 cores) | Merged BAM file from Tophat. |
7. Functional annotation | Eukaryotes | EggNOGmapper | Online/Submitted online output | maker.protein.fasta from Maker2 | |
8. Find relatedness proteins | FASTA | Not provided | Download online: (arabidopsis) AND "Arabidopsis thaliana" in: NCBI protein database | - | Reduce the number of species to run faster (used arabidopsis) |
9. Two iterations Annotation (structural)!!!first structural then functional!!! | Eukaryotes | Maker2 | UPPMAX/11_Maker2/ | Two iterations 6,12h(long) (4 cores) | (Input: assembly trinity output and relatedness protein) |
10. Read counting | Count features | HTSeq | UPPMAX/12_forEachPair_HTSeq/ | ~ | BAM files from Tophat, gff from Maker2 |
11.Differential Expression | Comparison | (input: HTSeq) DESeq2 (R library) | Locally, UPPMAX/15_DEseq.R | Variable | counts from HTseq |
12. Visualization of the genome | Reads and genomic annotation | IGV/Artemis | Locally | Variable | genome, .Gff, bam(one of each, then add more) |
● Genome assembly of PacBio reads.
● Correct the assembly with Illumina reads.
● Assembly quality assessment.
● Structural and functional annotation.
● Transcriptome assembly.
● Differential expression analyses.
● Biological interpretation of the results.
● Assembly with different parameters.
● Assembly evaluation with more than one method.
● Deeper analyses of differential expression analyses: e.g. different comparisons.
The biggest bottleneck will be DNA assembly of PacBio reads(17h). RNA assembly, annotation and mapping will take more than 5h.
Check work log
There are three types of data I have: RNA raw data, RNA trimmed data and WGS trimmed data. Thay are all fastq files.