cluster_identifier/
- this directory contains the part of the application responsible for identifying soft clipped clusters. For how to build see the build section.cluster_analysis/
- contains code related to the analysis of soft-clipped clusters and interpretation.validation/
- sample bam and outputs for testing SCRAMbleREADME
- Repository description/notes.
Install dependencies (Ubuntu 20.04):
apt-get update
apt-get install -y \
autoconf \
autogen \
build-essential \
curl \
libbz2-dev \
libcurl4-openssl-dev \
libhts-dev \
liblzma-dev \
libncurses5-dev \
libnss-sss \
libssl-dev \
libxml2-dev \
ncbi-blast+ \
r-base \
r-bioc-biostrings \
r-bioc-rsamtools \
r-cran-biocmanager \
r-cran-devtools \
r-cran-stringr \
r-cran-optparse \
zlib1g-dev
Install R packages dependencies:
Rscript -e "library(devtools); install_github('mhahsler/rBLAST')"
To build the cluster_identifier (estimated install time <5 minutes):
$ cd cluster_identifier/src
$ make
That should be it. It will create an executable named build/cluster_identifier
.
SCRAMble runs as a two-step process. First cluster_identifier
is used to generate soft-clipped read cluster consensus
sequences. Second, SCRAMble-MEIs.R
analyzes the cluster file for likely MEIs. Running SCRAMble on the test bam in the validation directory should take <1 minute for each step.
To run SCRAMble cluster_identifier:
$ /path/to/scramble/cluster_identifier/src/build/cluster_identifier \
/path/to/install_dir/scramble/validation/test.bam > /path/to/output/test.clusters.txt
To run SCRAMble-MEIs and SCRAMble-dels(with default settings):
$ Rscript --vanilla /path/to/scramble/cluster_analysis/bin/SCRAMble.R \
--out-name /path/to/output/test \
--cluster-file /path/to/output/test.clusters.txt \
--install-dir /path/to/scramble/cluster_analysis/bin \
--mei-refs /path/to/scramble/cluster_analysis/resources/MEI_consensus_seqs.fa \
--ref /path/to/scramble/validation/test.fa \
--eval-meis \
--eval-dels
SCRAMble is also distributed with a Dockerfile
. Running SCRAMble using docker
(estimated install time <10 minutes):
$ git clone https://github.com/GeneDx/scramble.git
$ cd scramble
$ docker build -t scramble:latest .
$ docker run -it --rm scramble:latest bash
# cluster_identifier \
/app/validation/test.bam > /app/validation/test.clusters.txt
# Rscript --vanilla /app/cluster_analysis/bin/SCRAMble.R \
--out-name ${PWD}/test \
--cluster-file /app/validation/test.clusters.txt \
--install-dir /app/cluster_analysis/bin \
--mei-refs /app/cluster_analysis/resources/MEI_consensus_seqs.fa \
--ref /app/validation/test.fa \
--eval-dels \
--eval-meis
The output of cluster_identifier is a tab delimited text file with clipped cluster consensus sequences. The columns are as follows:
1. | Coordinate |
2. | Side of read where soft-clipped occurred |
3. | Number of reads in cluster |
4. | Clipped read consensus |
5. | Anchored read consensus |
Calling SCRAMble.R
with --eval-meis
produces a tab delimted file. If a reference .fa
file is provided, then a VCF is produced as well. The <out-name>_MEIs.txt
output is a tab delimited text file with MEI calls. If no MEIs are present an output file will still be produced with only the header.
The columns are as follows:
1. | Insertion | Coordinate where MEI insertion occurs (zero-based) |
2. | MEI_Family | The consensus sequence to which the clipped sequence aligned best |
3. | Insertion_Direction | Whether MEI is on fwd or rev strand relative to bam reference |
4. | Clipped_Reads_In_Cluster | Number of supporting reads in cluster |
5. | Alignment_Score | Pairwise alignment score of clipped read consensus to MEI reference sequence |
6. | Alignment_Percent_Length | Percent of clipped read consensus sequence involved in alignment to MEI reference sequence |
7. | Alignment_Percent_Identity | Percent identify of alignment of clipped read consensus sequence with MEI reference sequence |
8. | Clipped_Sequence | Clipped cluster consensus sequences |
9. | Clipped_Side | Left or right, side of read where soft-clipping ocurred |
10. | Start_In_MEI | Left-most position of alignment to MEI reference sequence |
11. | Stop_In_MEI | Right-most position of alignment to MEI reference sequence |
12. | polyA_Position | Position of polyA clipped read cluster if found |
13. | polyA_Seq | Clipped cluster consensus sequences of polyA clipped read cluster if found |
14. | polyA_SupportingReads | Number of supporting reads in polyA clipped read cluster if found |
15. | TSD | Target site duplication sequence if polyA clipped read cluster found |
16. | TSD_length | Length of target site duplication if polyA clipped read cluster found |
Calling SCRAMble.R
with --eval-dels
produced a VCF and a tab delimted file. The <out-name>_PredictedDeletions.txt
output is a tab delimited text file with deletion calls. If no deletions are present an output file will still be produced with only the header.
The columns are as follows:
1. | CONTIG | Chromosome |
2. | DEL.START | Deletion start coordinate (0-based) |
3. | DEL.END | Deletion end coordinate (0-based) |
4. | REF.ANCHOR.BASE | Reference based at deletion start |
5. | DEL.LENGTH | Deletion length |
6. | RIGHT.CLUSTER | Name of right cluster |
7. | RIGHT.CLUSTER.COUNTS | Number of supporting reads in right cluster |
8. | LEFT.CLUSTER | Name of left cluster |
9. | LEFT.CLUSTER.COUNTS | Number of supporting reads in left cluster |
10. | LEN.RIGHT.ALIGNMENT | Length of right-clipped consensus sequence involved in alignment |
11. | SCORE.RIGHT.ALIGNMENT | BLAST alignment bitscore for right-clipped consensus |
12. | PCT.COV.RIGHT.ALIGNMENT | Percent length of right-clipped consensus involved in alignment |
13. | PCT.IDENTITY.RIGHT.ALIGNMENT | Percent identity of right-clipped consensus in alignment |
14. | LEN.LEFT.ALIGNMENT | Length of left-clipped consensus sequence involved in alignment |
15. | SCORE.LEFT.ALIGNMENT | BLAST alignment bitscore for left-clipped consensus |
16. | PCT.COV.LEFT.ALIGNMENT | Percent length of left-clipped consensus involved in alignment |
17. | PCT.IDENTITY.LEFT.ALIGNMENT | Percent identity of right-clipped consensus in alignment |
18. | INS.SIZE | Length of insert within deleted sequence (for two-end deletions only) |
19. | INS.SEQ | Inserted sequence (for two-end deletions only) |
20. | RIGHT.CLIPPED.SEQ | Clipped consensus sequence for right-clipped cluster |
21. | LEFT.CLIPPED.SEQ | Clipped consensus sequence for left-clipped cluster |
In theory, SCRAMble should work well on any MEI reference fasta sequences, however, it has only been tested on the
sequences provided in /path/to/scramble/cluster_analysis/resources/MEI_consensus_seqs.fa
.