genomes_feature_table.pl
is a script to create a feature table for genomes in EMBL and GENBANK format.
- Synopsis
- Description
- Usage
- Options
- Output
- Run environment
- Dependencies
- Author - contact
- Citation, installation, and license
- Changelog
perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv
A genome feature table lists basic stats/info (e.g. genome size, GC content, coding percentage, accession number(s)) and the numbers of annotated primary features (e.g. CDS, genes, RNAs) of genomes. It can be used to have an overview of these features in different genomes, e.g. in comparative genomics publications.
genomes_feature_table.pl
is designed to extract (or calculate)
these basic stats and all annotated primary features from RichSeq
files (EMBL or GENBANK format) in a specified directory (with the
correct file extension, see option -e). The default directory
is the current working directory. The primary features are
counted and the results for each genome printed in tab-separated
format. It is a requirement that each file contains only one
genome (complete or draft, with or without plasmids).
The most important features will be listed first, like genome description, genome size, GC content, coding percentage (calculated based on non-pseudo CDS annotation), CDS and gene numbers, accession number(s) (first..last in the sequence file), RNAs (rRNA, tRNA, tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are annotated in a sequence file, the number of plasmids are counted and listed as well (needs a /plasmid="plasmid_name" tag in the source primary tag, see e.g. Genbank accession number CP009167). Use option -p to list plasmids as separate entries (lines) in the feature table.
For draft genomes the number of contigs/scaffolds are counted. All contigs/scaffolds of draft genomes should be marked with the WGS keyword (see e.g. draft NCBI Genbank entry JSAY00000000). If this is not the case for your file(s) you can add those keywords to each sequence entry with the following Perl one-liners (will edit files in place). For files in GENBANK format if 'KEYWORDS .' is present
perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file
or if 'KEYWORDS' isn't present at all
perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file
For files in EMBL format if 'KW .' is present
perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file
or if 'KW' isn't present at all
perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file
perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv
perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv
-
-h, -help
Help (perldoc POD)
-
-e, -extensions
File extensions to include in the analysis (EMBL or GENBANK format), either comma-separated list or multiple occurences of the option [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
-
-p, -plasmids
Optionally list plasmids as extra entries in the feature table, if they are annotated with a /plasmid="plasmid_name" tag in the source primary tag
-
-v, -version
Print version number to STDERR
-
STDOUT
The resulting feature table is printed to STDOUT. Redirect or pipe into another tool as needed (e.g.
cut
,grep
, orhead
).
The Perl script runs under Windows and UNIX flavors.
- BioPerl (tested version 1.006923)
Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
For citation, installation, and license information please see the repository main README.md.
- v0.5 (14.09.2015)
- changed script name to
genomes_feature_table.pl
- included a POD
- options with Getopt::Long
- included
pod2usage
with Pod::Usage - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes
- changed input options to get folder path from STDIN
- as a consequence new option -e|-extensions
- accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file
- draft genomes should include 'WGS' keyword (warning if not)
- option -p|-plasmids works now correctly with complete and draft genomes
- count plasmids without option -p
- changed script name to
- v0.4 (11.08.2013)
- included 'use autodie;' pragma
- included version switch
- v0.3 (05.11.2012)
- new option p to report plasmid features in multi-sequence draft files separately
- v0.2 (19.09.2012)
- v0.1 (25.11.2011)
- original script name:
get_genome_features.pl
- original script name: