Skip to content

This script is used for the analysis of unmapped reads from next generation sequencing (NGS) data.

Notifications You must be signed in to change notification settings

jalwillcox/Analyze-Unmapped-Reads

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Analyze Unmapped Reads

This script is used for the analysis of unmapped reads from next generation sequencing (NGS) data.

Because this script uses scaffolds, it is most effective when the target DNA exists at a high enough depth to ensure sequence continuity. For example, it has been successfully used to identify:

  • Integrated HHV6 full sequences (~140kb) in 30x WGS data
  • Parvovirus B19 (5596bp) in samples with mean viral depth >1
  • Rickettsia Rickettsii (2.1Mb) fragments in a sample with mean viral depth <1

Overview

Broadly, this script works in three steps:

  1. Collect Unmapped Reads Samtools and Bedtools are used to extract all unmapped reads and their mates from bam/cram input
  2. Generate Scaffolds SPAdes is used to generate scaffolds from the unmapped reads
  3. Blast Scaffolds *Blast is used to compare scaffolds to known sequences and identify likely sources of the

Required Programs

This script uses the following programs; alternate versions may also work:

Flags

Flag Argument Default Description
-b filename (required unless using -q) - BAM/CRAM input file
-o string (required) - the basename for the output file
-a integer (optional) 5 the number of blast alignments
-c integer (optional) 8 the number of threads to allow for parallel processing
-d integer (optional) 20 dustmasker level
-f no argument (optional) - stop script after retreiving fastq files for unaligned reads
-k no argument (optional) - keep unmapped fastq files
-l blast db (optional) /n/shared_db/blastdb/202008 full path to blast database
-m integer (optional) 500 minimum scaffold length to keep
-n integer (optional) 10 max percentage of Ns in order to keep a scaffold
-q string (optional) - use zipped pair1 fastq file as input (fmt: pair1/2 or read1/2)
-r no argument (optional) - use RNA-Seq data
-s no argument (optional) - skip blast step
-t temp directory (optional) . a temporary directory to store intermediate files
-h no argument (optional) - print usage

Example Usage

To use defaults, but keep unmapped read fastq output:

./analyze-unmapped-reads.sh -b sample1.cram -o sample1 -k

To only return fastq files for unaligned reads:

./analyze-unmapped-reads.sh -b sample1.cram -o sample1 -f

To use RNA-Seq data and limit scaffold lengths to 1kb or longer:

./analyze-unmapped-reads.sh -b sample1.cram -o sample1 -r -m 1000

Output

File Description
*fasta A fasta file of the final set of scaffolds
*fastq.gz Paired fastq files for the samples' unaligned reads
*blast[nx].txt a text file with blast results

About

This script is used for the analysis of unmapped reads from next generation sequencing (NGS) data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages