Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
genomes_feature_table.pl	genomes_feature_table.pl

genomes_feature_table

genomes_feature_table.pl is a script to create a feature table for genomes in EMBL and GENBANK format.

Synopsis
Description
Usage
Options
Output
Run environment
Dependencies
Author - contact
Citation, installation, and license
Changelog

Synopsis

perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv

Description

A genome feature table lists basic stats/info (e.g. genome size, GC content, coding percentage, accession number(s)) and the numbers of annotated primary features (e.g. CDS, genes, RNAs) of genomes. It can be used to have an overview of these features in different genomes, e.g. in comparative genomics publications.

genomes_feature_table.pl is designed to extract (or calculate) these basic stats and all annotated primary features from RichSeq files (EMBL or GENBANK format) in a specified directory (with the correct file extension, see option -e). The default directory is the current working directory. The primary features are counted and the results for each genome printed in tab-separated format. It is a requirement that each file contains only one genome (complete or draft, with or without plasmids).

The most important features will be listed first, like genome description, genome size, GC content, coding percentage (calculated based on non-pseudo CDS annotation), CDS and gene numbers, accession number(s) (first..last in the sequence file), RNAs (rRNA, tRNA, tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are annotated in a sequence file, the number of plasmids are counted and listed as well (needs a /plasmid="plasmid_name" tag in the source primary tag, see e.g. Genbank accession number CP009167). Use option -p to list plasmids as separate entries (lines) in the feature table.

For draft genomes the number of contigs/scaffolds are counted. All contigs/scaffolds of draft genomes should be marked with the WGS keyword (see e.g. draft NCBI Genbank entry JSAY00000000). If this is not the case for your file(s) you can add those keywords to each sequence entry with the following Perl one-liners (will edit files in place). For files in GENBANK format if 'KEYWORDS .' is present

perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file

or if 'KEYWORDS' isn't present at all

perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS    WGS.\n";} else{ print;}' file

For files in EMBL format if 'KW .' is present

perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file

or if 'KW' isn't present at all

perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW   WGS.\n";} else{ print;}' file

Usage

perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv

perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv

Options

-h, -help

Help (perldoc POD)
-e, -extensions

File extensions to include in the analysis (EMBL or GENBANK format), either comma-separated list or multiple occurences of the option [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
-p, -plasmids

Optionally list plasmids as extra entries in the feature table, if they are annotated with a /plasmid="plasmid_name" tag in the source primary tag
-v, -version

Print version number to STDERR

Output

STDOUT

The resulting feature table is printed to STDOUT. Redirect or pipe into another tool as needed (e.g. cut, grep, or head).

Run environment

The Perl script runs under Windows and UNIX flavors.

Dependencies

BioPerl (tested version 1.006923)

Author - contact

Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)

Citation, installation, and license

For citation, installation, and license information please see the repository main README.md.

Changelog

v0.5 (14.09.2015)
- changed script name to genomes_feature_table.pl
- included a POD
- options with Getopt::Long
- included pod2usage with Pod::Usage
- major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes
- changed input options to get folder path from STDIN
- as a consequence new option -e|-extensions
- accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file
- draft genomes should include 'WGS' keyword (warning if not)
- option -p|-plasmids works now correctly with complete and draft genomes
- count plasmids without option -p
v0.4 (11.08.2013)
- included 'use autodie;' pragma
- included version switch
v0.3 (05.11.2012)
- new option p to report plasmid features in multi-sequence draft files separately
v0.2 (19.09.2012)
v0.1 (25.11.2011)
- original script name: get_genome_features.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genomes_feature_table

genomes_feature_table

README.md

genomes_feature_table

Synopsis

Description

Usage

Options

Output

Run environment

Dependencies

Author - contact

Citation, installation, and license

Changelog

Files

genomes_feature_table

Directory actions

More options

Directory actions

More options

Latest commit

History

genomes_feature_table

Folders and files

parent directory

README.md

genomes_feature_table

Synopsis

Description

Usage

Options

Output

Run environment

Dependencies

Author - contact

Citation, installation, and license

Changelog