TE-Aid is a shell
+R
program aimed to help the manual curation of transposable elements (TE). It inputs a TE consensus sequence (fasta format) and requires a reference genome (in fasta as well). Using R
and the NCBI blast+ suite
, TE-Aid produces 4 figures reporting:
- (top left) the genomic hits with divergence to consensus
- (top right) the genomic coverage of the consensus
- (bottom left) a self dot-plot
- (bottom right) a structure analysis including: TIR and LTR suggestions, open reading frames (ORFs) and TE protein hit annotation.
🗞️ TE-Aid is presented in "A beginner’s guide to manual curation of transposable elements" by Clement Goubert, Rory J. Craig, Agustin F. Bilat, Valentina Peona, Aaron A. Vogan & Anna V. Protasio, published in Mobile DNA (2022)
Pipeline overview:
- The TE (ideally, candidate consensus sequence) is searched against the provided reference genome with
blastn
- Fig 1: genomic hits (horizontal lines) are represented relative to the query (TE consensus), the y axis represent the
blastn
divergence - Fig 2: pileup of the genomic hits relative to position along the query (TE consensus)
- Fig 1: genomic hits (horizontal lines) are represented relative to the query (TE consensus), the y axis represent the
- The query is then blasted against itself in order to detect micro repeats and inversions (putative TIRs, LTRs)
- Fig 3: self dot-plot and Fig 4 (top): TIR and LTR are suggested (colored arrows)
- Bonus: a self dot-plot with
emboss dotmatcher
is also produced in an extra file
- Putative ORFs are searched with
emboss getorf
and the peptides queried against a TE protein database (distributed withRepeatMasker
)- Fig 4: ORFs (black rectangles: + orientation; red rectangles: - orientation), TE protein hits
The consensus size, number of fragments (hits) and full length copies (according to user-defined threshold) are automatically printed on the graph.
If any ORFs and protein hits are found, their locations relative to the consensus are printed in the stdout
TE-Aid has been tested on MacOSX (shell, sh, zsh) and Linux (shell, sh) support: click the "issues" tab on github or email me
TE-Aid comes from consensus2genome
that is now deprecated
TE+Aid is a fully open software and is being integrated in a growing number of projects (thank you! ❤️). In order to track project-specific modifications of the base code, I have created specific branches based on the pull requests of developpers. Do not hesitate to check them out!
The main branch may not includes all these modifications, but I am happy to consider any request to modify the main branch. If you think your changes should make it to the main branch but are only available in a parallel branch, please let me know, and when time allows, I'll be happy to review and merge!
- R (Rscript)
- Biostrings
- Rcpp (when using -r option)
- NCBI Blast+ suite
- EMBOSS
getorf
TE-Aid calls NCBI blast and R from the command line with blastn
, blastp
, makeblastdb
and Rscript
commands. All these executables must be accessible in the user path (usually the case following the default install). You can also set up a conda environment specifically for TE-Aid (see below).
If not, you need to locate the executables' location and add them to your local path before using TE-Aid.
For instance:
export PATH="/path/to/blast/bins/folder/:$PATH"`
export PATH="/path/to/R/bins/folder/:$PATH"`
These lines can be added to the user ~/.bashrc
(Linux) or ~/.zshrc
(macOS) to add these programs permanently to $PATH
.
git clone https://github.com/clemgoub/TE-Aid.git
You can set a conda environment for running TE-Aid after you cloned the repository with this command (use mamba instead of conda because it's way faster):
cd TE-Aid
mamba env create -f TE_AID.yml
After that, you'll have all the dependencies ready once you activate the environment:
mamba activate TE_AID
<user-path>/TE-Aid [-q|--query <query.TE.fa>] [-g|--genome <genome.fa>] [options]
Note. replace
<user-path>
with the path of the downloadedTE-Aid
folder.
-q, --query TE consensus (fasta file)
-g, --genome Reference genome (fasta file)
-h, --help show this help message and exit
-o, --output output folder (default "./")
-t, --tables write features coordinates in tables (self dot-plot, ORFs and protein hits coordinates)
-T, --all-Tables same as -t plus write the genomic blastn table.
Warning: can be very large if your TE is highly repetitive!
-r, --remove-redundant remove redundant hits from genomic blastn table and a title of the first plot
-e, --e-value genome blastn: e-value threshold to keep hit (default: 10e-8)
-f, --full-length-threshold genome blastn: min. proportion (hit_size)/(consensus_size) to be considered "full length" (0-1; default: 0.9)
-m, --min-orf getorf: minimum ORF size (in bp)
-R, --no-reverse-orfs getorf: don't use ORFs in ther reverse complement of your sequence
-a, --alpha graphical: transparency value for blastn hit (0-1; default 0.3)
-F, --full-length-alpha graphical: transparency value for full-length blastn hits (0-1; default 1)
-y, --auto-y graphical: manual override for y lims (default: TRUE; otherwise: -y NUM)
-D | --emboss-dotmatcher Produce a dotplot with EMBOSS dotmatcher
In this example we are going to analyze some transposable elements of Drosophila melanogaster. The consensus sequences for this tutorial are located in the Example/
folder, and you will need to download the D. melanogaster reference genome (dm6). Let's go!
curl -o Example/dm6.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
gunzip Example/dm6.fa.gz
A couple of D. melanogaster TE consensus sequences are present in the folder Examples
Let's start with Jockey, a recent LINE element in the D. melanogaster genome
./TE-Aid -q Example/Jockey_DM.fasta -g Example/dm6.fa -o ../dm6example
Next is Gypsy-2, from the LTR lineage
./TE-Aid -q Example/Gypsy2_DM.fasta -g Example/dm6.fa -o ../dm6example