prot_finder.pl
is a script to search for query protein homologs in annotated bacterial genomes with BLASTP. A companion bash shell script pipeline is available, prot_finder_pipe.sh
.
Included is also the script prot_binary_matrix.pl
to create a binary presence/absence matrix (e.g. for iTOL) from the prot_finder.pl
output. Additionally, two downstream scripts are provided to wrangle these binary presence/absence matrix: transpose_matrix.pl
to transpose a delimited TEXT matrix and binary_group_stats.pl
to get overall presence/absence statistics for groups of columns in a delimited binary TEXT matrix (in the style of po2group_stats.pl
).
- Synopsis
- Description
- Usage
- Options
- Output
- Dependencies
- Run environment
- Author - contact
- Citation, installation, and license
- Acknowledgements
- Changelog
./prot_finder_pipe.sh -q query.faa (-s subject.faa|-f (embl|gbk)) > blast_hits.tsv
or
perl prot_finder.pl -r report.blastp -s subject.faa > blast_hits.tsv
perl prot_binary_matrix.pl blast_hits.tsv > binary_matrix.tsv
or
perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl > binary_matrix.tsv
perl transpose_matrix.pl input_matrix.tsv > input_matrix_transposed.tsv
or
perl prot_binary_matrix.pl blast_hits.tsv | perl transpose_matrix.pl > binary_matrix_transposed.tsv
perl binary_group_stats.pl -i binary_matrix.tsv -g group_file.tsv -p > overall_stats.tsv
The script prot_finder.pl
is intended to search for homologous proteins in annotated bacterial genomes. For this purpose, a previous BLASTP, either legacy or plus, needs to be run with query protein sequences against a BLASTP database of subject proteins (e.g. all proteins from several Escherichia coli genomes).
The script cds_extractor.pl
(with options -p -f) can be used to create multi-FASTA protein files of all non-pseudo CDS from RichSeq genome files to create the needed subject BLASTP database. Present locus tags will be used as FASTA IDs, but see cds_extractor.pl
for a description of the format. Query protein sequences for the BLASTP need a unique FASTA ID.
The BLASTP report file (option -r), the subject protein multi-FASTA file (option -s), and optionally the query protein (multi-)FASTA (option -q) file are then given to prot_finder.pl
. Significant BLASTP subject hits are filtered according to the given cutoffs (options -i, -cov_q, and -cov_s) and the result is printed as an informative tab-separated result table to STDOUT. To apply global identity/coverage cutoffs to subject hits high-scoring pairs (HSPs) are tiled (see http://www.bioperl.org/wiki/HOWTO:Tiling and http://search.cpan.org/dist/BioPerl/Bio/Search/Hit/GenericHit.pm). Additionally, the subject protein sequences with significant query hits are written to result multi-FASTA files, named according to the respective query FASTA IDs (optionally including the query sequence with option -q).
Optionally, Clustal Omega can be called (option -a with optional -p) to create multiple alignments (FASTA format) for each of the resulting multi-FASTA files. These alignments can be used to calculate phylogenies e.g. with RAxML (http://sco.h-its.org/exelixis/software.html) or MEGA (http://www.megasoftware.net/).
Run the script cds_extractor.pl
(with options -p -f) and the BLASTP manually or use the bash shell wrapper script prot_finder_pipe.sh
(see below 'prot_finder_pipe bash script pipeline') to execute the whole pipeline including prot_finder.pl
(with optional option -q). For additional options of the pipeline shell script see below 'prot_finder_pipe.sh options'. Be aware that some options in prot_finder_pipe.sh
corresponding to options in prot_finder.pl
have different names (-c instead of -cov_q, -k instead of -cov_s, and -o instead of -p; also -f has a different meaning). If cds_extractor.pl
is used in the pipeline (option -f of the shell script) the working folder has to contain the annotated bacterial genome subject files (in RichSeq format, e.g. EMBL or GENBANK format). Also, the Perl scripts cds_extractor.pl
(only for prot_finder_pipe.sh
option -f) and prot_finder.pl
have to be either contained in the current working directory or installed in the global PATH. BLASTP (legacy and/or plus) and Clustal Omega binaries have to be installed in global PATH, or for Clustal Omega you can give the path to the binary with option -o. In the pipeline BLASTP is run with disabled query filtering, locally optimal Smith-Waterman alignments, and increasing the number of database sequences to show alignments to 500 for BioPerl parsing (legacy: -F F -s T -b 500, plus: -seg no -use_sw_tback -num_alignments 500). The pipeline script ends with the STDERR message 'Pipeline finished!', if this is not the case have a look at the log files in the result directory for errors.
The resulting tab-separated table with significant BLASTP hits (from prot_finder.pl
or prot_finder_pipe.sh
) can be given to the script prot_binary_matrix.pl
, either as STDIN or as a file, to create a presence/absence matrix of the results. See below 'prot_binary_matrix.pl options' for the prot_binary_matrix.pl
options. By default a tab-delimited binary presence/absence matrix for query hits per subject organism will be printed to STDOUT. Use option -t to count all query hits per subject organism, not just the binary presence/absence. This presence/absence matrix can be given to the script transpose_matrix.pl
, either as STDIN or as a file, to transpose the matrix, i.e. rows will become columns and columns rows. Actually, transpose_matrix.pl
can be used to transpose any delimited TEXT matrix (see below 'transpose_matrix.pl options').
The presence/absence matri(x|ces) can e.g. be loaded into the Interactive Tree Of Life website (iTOL) to associate the data with a phylogenetic tree. iTOL likes individual comma-separated input files, thus use prot_binary_matrix.pl
options -s -c for this purpose. However, the organism names have to have identical names to the leaves of the phylogenetic tree, thus manual adaptation, e.g. in a spreadsheet software (like LibreOffice Calc), might be needed. Careful, subject organisms without a significant BLASTP hit won't be included in the prot_finder.pl
result table and hence can't be included by prot_binary_matrix.pl
. If needed add them manually to the result matri(x|ces).
At last, script binary_group_stats.pl
can be used to categorize columns of the binary presence/absence matrix from prot_binary_matrix.pl
according to group affiliations. binary_group_stats.pl
is based upon po2group_stats.pl
, which does the same thing for genomes in an ortholog/paralog output matrix from a Proteinortho5 calculation. Actually, binary_group_stats.pl
can work with any delimited TEXT binary matrix (option -i). But, all fields of the binary matrix need to be filled with either a 0 indicating absence or a 1 indicating presence, i.e. all rows need to have the same number of columns. Use option -d to set the delimiter of the input matrix, default is set to tab-delimited/separated matrices.
Also, column headers in the first row and row headers in the first column are mandatory for the input binary matrix. Only alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and period (.) characters are allowed for the column headers and group names in the group file (option -g) to avoid downstream problems with the operating/file system. As a consequence, also no whitespaces are allowed in these! Additionally, column headers, row headers, and group names need to be unique.
The group affiliations of the columns are intended to get overall presence/absence statistics for groups of columns and not simply single columns of the matrix. Percentage inclusion (option -cut_i) and exclusion (option -cut_e) cutoffs can be set to define how strict the presence/absence of column groups within a row are defined. Of course groups can also hold only single column headers to get single column statistics. Group affiliations are defined in a mandatory tab-delimited group input file (option -g), including the column headers from the input binary matrix, with minimal two and maximal four groups.
See the README.md of po2group_stats.pl
for an explanation of the logic behind the categorization and the resulting group binary matrix and venn diagram of binary_group_stats.pl
(and of course its options and output).
for file in *.(gbk|embl); do perl cds_extractor.pl -i "$file" -p -f; done
cat *.faa > subject.faa
rm !(subject).faa
formatdb -p T -i subject.faa -n prot_finder_db
blastall -p blastp -d prot_finder_db -i query.faa -o prot_finder.blastp -e 1e-10 -F F -s T -b 500
or
makeblastdb -dbtype prot -in subject.faa -out prot_finder_db
blastp -db prot_finder_db -query query.faa -out prot_finder.blastp -evalue 1e-10 -seg no -use_sw_tback -num_alignments 500
perl prot_finder.pl -r prot_finder.blastp -s subject.faa -cov_s 80 > blast_hits.tsv
or
perl prot_finder.pl -r prot_finder.blastp -s subject.faa -d result_dir -f -q query.faa -i 50 -cov_q 50 -b -a -p ~/bin/clustalo -t 6 > result_dir/blast_hits.tsv
./prot_finder_pipe.sh -q query.faa -s subject.faa > blast_hits.tsv
or
./prot_finder_pipe.sh -q query.faa -f embl -d result_dir -p legacy -e 0 -t 12 -i 50 -c 50 -k 30 -b -a -o ~/bin/clustalo -m > result_dir/blast_hits.tsv
perl prot_binary_matrix.pl -s -d result_dir -t blast_hits.tsv
or
perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl -l -c > binary_matrix.csv
or
mkdir result_dir && ./prot_finder_pipe.sh -q query.faa -s subject.faa -d result_dir -m | tee result_dir/blast_hits.tsv | perl prot_binary_matrix.pl > binary_matrix.tsv
perl transpose_matrix.pl -d ' ' -e NA input_matrix_space-delimit.txt > input_matrix_space-delimit_transposed.txt
or
perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl -l -c | perl transpose_matrix.pl -d , > binary_matrix_transposed.csv
or
mkdir result_dir && ./prot_finder_pipe.sh -q query.faa -s subject.faa -d result_dir -m | tee result_dir/blast_hits.tsv | perl prot_binary_matrix.pl | tee result_dir/binary_matrix.tsv | perl transpose_matrix.pl > result_dir/binary_matrix_transposed.tsv
Transpose all matrices in a folder:
for matrix in *.tsv; do perl transpose_matrix.pl "$matrix" > "${matrix%.*}_transposed.tsv"; done
perl binary_group_stats.pl -i binary_matrix_transposed.csv -g group_file.tsv -d , -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co -s -u -a > overall_stats.tsv
-
-r=str, -report=str
Path to BLASTP report/output
-
-s=str, -subject=str
Path to subject multi-FASTA protein sequence file (*.faa) created with
cds_extractor.pl
(and its options -p -f), which was used to create the BLASTP database
-
-h, -help
Help (perldoc POD)
-
-d=str, -dir_result=str
Path to result folder [default = query identity and coverage cutoffs, './results_i#_cq#']
-
-f, -force_dir
Force output to an existing result folder, otherwise ask user to remove content of existing folder. Careful, files from a previous analysis might not be overwritten if different to current analysis.
-
-q=str, -query=str
Path to query (multi-)FASTA protein sequence file (*.faa) with unique FASTA IDs, which was used as query in the BLASTP. Will include each query protein sequence in the respective multi-FASTA 'query-ID_hits.faa' result file.
-
-b, -best_hit
Give only the best hit (i.e. highest identity) for each subject sequence if a subject has several hits with different queries
-
-i=int, -ident_cutoff=int
Query identity cutoff for significant hits (not including gaps), has to be an integer number >= 0 and <= 100 [default = 70]
-
-cov_q=int, -cov_query_cutoff=int
Query coverage cutoff, has to be an integer number >= 0 and <= 100 [default = 70]
-
-cov_s=int, -cov_subject_cutoff=int
Subject/hit coverage cutoff, has to be an integer >= 0 and <= 100 [default = 0]
-
-a, -align_clustalo
Call Clustal Omega for multiple alignment of each 'query-ID_hits.faa' result file
-
-p=str, -path_clustalo=str
Path to executable Clustal Omega binary if not present in global PATH variable; requires option -a
-
-t=int, -threads_clustalo=int
Number of threads for Clustal Omega to use; requires option -a [default = all processors on system]
-
-v, -version
Print version number to STDERR
-
-q=str
Path to query protein (multi-)FASTA file (*.faa) with unique FASTA IDs
-
-f=str
File extension for files in the current working directory to use for
cds_extractor.pl
(e.g. 'embl' or 'gbk'); excludes shell script option -s
or
-
-s=str
Path to subject protein multi-FASTA file (*.faa) already created with
cds_extractor.pl
(and its options -p -f), will not runcds_extractor.pl
; excludes shell script option -f
-
-h
Print usage
-
-d=str
Path to result folder [default = query identity and coverage cutoffs,'./results_i#_cq#']
-
-p=(legacy|plus)
BLASTP suite to use [default = plus]
-
-e=real
E-value for BLASTP [default = 1e-10]
-
-t=int
Number of threads to be used for BLASTP and Clustal Omega [default = all processors on system]
-
-i=int
Query identity cutoff for significant hits [default = 70]
-
-c=int
Query coverage cutoff (corresponds to
prot_finder.pl
option -cov_q) [default = 70] -
-k=int
Subject coverage cutoff (corresponds to
prot_finder.pl
option -cov_s) [default = 0] -
-b
Give only the best hit (i.e. highest identity) for each subject sequence
-
-a
Multiple alignment of each multi-FASTA result file with Clustal Omega
-
-o=str
Path to executable Clustal Omega binary if not in global PATH; requires shell script option -a (corresponds to
prot_finder.pl
option -p) -
-m
Clean up all non-essential files, see below '
prot_finder_pipe.sh
output'
-
-h, -help
Help (perldoc POD)
-
-s, -separate
Separate presence/absence files for each query protein printed to the result directory [default without -s = STDOUT matrix for all query proteins combined]
-
-d=str, -dir_result=str
Path to result folder, requires option -s [default = './binary_matrix_results']
-
-t, -total
Count total occurrences of query proteins, not just presence/absence binary
-
-c, -csv
Output matri(x|ces) in comma-separated format (*.csv) instead of tab-delimited format (*.tsv)
-
-l, -locus_tag
Use the locus_tag prefixes in the subject_ID column of the
prot_finder.pl
output (instead of the subject_organism column) as organism IDs to associate query hits to organisms. The subject_ID column will include locus_tags if they're annotated for a genome (see thecds_extractor.pl
format description). Useful if thecds_extractor.pl
output doesn't include strain names for 'o=' in the FASTA IDs, because the prefix of a locus_tag should be unique for a genome (see http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation). -
-v, -version
Print version number to STDERR
-
-h, -help
Help (perldoc POD)
-
-d=str, -delimiter=str
Set delimiter of input and output matrix (e.g. comma ',', single space ' ' etc.) [default = tab-delimited/separated]
-
-e=str, -empty=str
Fill empty cells of the input matrix with a value in the transposed matrix (e.g. 'NA', '0' etc.)
-
-v, -version
Print version number to STDERR
-
-i=str, -input=str
Input delimited TEXT binary matrix (e.g. *.tsv, *.csv, or *.txt), see option -d
-
-g=str, -groups_file=str
Tab-delimited file with group affiliation for the columns from the input binary matrix with minimal two and maximal four groups (easiest to create in a spreadsheet software and save in tab-separated format). All column headers from the input binary matrix need to be included. Column headers and group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two column headers in group A, three in group B and D, and one in group C:
group_A group_B group_C group_D
column_header1 column_header9 column_header3 column_header8
column_header7 column_header6 column_header5
column_header4 column_header2
-
-h, -help
Help (perldoc POD)
-
-d=str, -delimiter=str
Set delimiter of input binary matrix (e.g. comma ',', single space ' ' etc.) [default = tab-delimited/separated]
-
-r=str, -result_dir=str
Path to result folder [default = inclusion and exclusion percentage cutoff, './results_i#_e#']
-
-cut_i=float, -cut_inclusion=float
Percentage inclusion cutoff for column presence counts in a group per row, has to be > 0 and <= 1. Cutoff will be rounded according to the column header number in each group and has to be > the rounded exclusion cutoff in this group. [default = 0.9]
-
-cut_e=float, -cut_exclusion=float
Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. [default = 0.1]
-
-b, -binary_group_matrix
Print a group binary matrix with the presence/absence column group results according to the cutoffs (excluding 'unspecific' category rows)
-
-p, -plot_venn
Plot venn diagram from the group binary matrix (except 'unspecific' and 'underrepresented' categories) with function
venn
from R package gplots, requires option -b -
-co, -core_strict
Include 'strict core' category in output for rows where all columns have a '1'
-
-s, -singletons
Include singleton/column-specific rows for each column header in the output, activates also overall column header presence ('1') counts in final stats matrix for columns with singletons
-
-u, -unspecific
Include 'unspecific' category in output
-
-a, -all_column_presence_overall
Report overall presence counts for all column headers (appended to the final stats matrix), also those without singletons; will include all overall column header presence counts without option -s
-
-v, -version
Print version number to STDERR
-
*.faa
Multi-FASTA file(s) of subject CDS protein sequences; will be removed with
prot_finder_pipe.sh
option -m
For optional error files from cds_extractor.pl
see its documentation.
-
STDOUT
The resulting tab-delimited output table with the significant subject BLASTP hits is printed to STDOUT. Redirect (e.g. to a file in the result directory, options -d -f) or pipe into another tool as needed (e.g.
prot_binary_matrix.pl
). -
./results_i#_cq#
All output files are stored in a result folder
-
./results_i#_cq#/query-ID_hits.faa
Multi-FASTA protein files of significant subject hits for each query protein (named after the respective query FASTA ID), optionally includes the respective query protein sequence (with option -q)
-
subject.faa.idx
Index file of the subject protein file for fast sequence retrieval (can be deleted if no further BLASTPs are needed with these subject sequences); will be removed with
prot_finder_pipe.sh
option -m -
(./results_i#_cq#/queries_no_blastp-hits.txt)
Lists all query sequence IDs without significant subject hits; with option -b includes also queries with significant hits but without a best blast hit for a subject
-
(./results_i#_cq#/clustal_omega.log)
Optional log file of verbose Clustal Omega STDOUT/STDERR messages; will be removed with
prot_finder_pipe.sh
option -m -
(./results_i#_cq#/query-ID_aln.fasta)
Optional Clustal Omega multiple alignment of each 'query-ID_hits.faa' result file in FASTA alignment format
-
(./results_i#_cq#/query-ID_tree.nwk)
Optional, Clustal Omega NJ-guide tree in Newick format
In addition to the cds_extractor.pl
output and the prot_finder.pl
output the pipeline also creates the following non-essential output files, which will be removed with prot_finder_pipe.sh
option -m:
-
./results_i#_cq#/cds_extractor.log
Log file of
cds_extractor.pl
STDOUT/STDERR messages (withprot_finder_pipe.sh
option -f) -
./results_i#_cq#/prot_finder.faa
Concatenated
cds_extractor.pl
output files to create the subject BLASTP database -
prot_finder_db.phr, prot_finder_db.pin, prot_finder_db.psq
BLASTP database files from the concatenated subject sequences ('prot_finder.faa')
-
formatdb.log, error.log or ./results_i#_cq#/makeblastdb.log
Legacy BLASTP or BLASTP+ log files
-
./results_i#_cq#/prot_finder.blastp
BLASTP report/output
-
./results_i#_cq#/prot_finder.log
Log file of
prot_finder.pl
STDOUT/STDERR messages
-
STDOUT
The resulting presence/absence matrix is printed to STDOUT without option -s. Redirect or pipe into another tool as needed.
-
(./binary_matrix_results)
Separate query presence/absence files are stored in a result folder with option -s
-
(./binary_matrix_results/query-ID_binary_matrix.(tsv|csv))
Separate query presence/absence files with option -s
-
STDOUT
The transposed matrix is printed to STDOUT. Redirect or pipe into another tool as needed.
-
STDOUT
The tab-delimited final stats matrix is printed to STDOUT. Redirect or pipe into another tool as needed.
-
./results_i#_e#
All output files are stored in a results folder
-
./results_i#_e#/[*_specific|*_absent|cutoff_core|underrepresented]_rows.txt
Files including the row headers for rows in non-optional categories
-
(./results_i#_e#/[*_singletons|strict_core|unspecific]_rows.txt)
Optional category output files with the respective row headers
-
(./results_i#_e#/binary_matrix.tsv)
Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' rows)
-
(./results_i#_e#/venn_diagram.pdf)
Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories)
- BioPerl (tested version 1.006923)
cds_extractor
(tested version 0.7.1)- BLASTP
- legacy BLASTP (tested version 2.2.26)
- BLASTP+ (tested version 2.2.28+)
- Clustal Omega (tested version 1.2.1)
-
Statistical computing language R
Rscript
is needed to plot the venn diagram with option -p, tested with version 3.2.2 -
gplots (https://cran.r-project.org/web/packages/gplots/index.html)
Package needed for R to plot the venn diagram, includes function
venn
. Tested with gplots version 2.17.0.
The scripts run under UNIX flavors.
Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
For citation, installation, and license information please see the repository main README.md.
The Perl implementation for transposing a matrix on Stack Overflow
was very useful for transpose_matrix.pl
: https://stackoverflow.com/questions/1729824/transpose-a-file-in-bash
- v0.7.1 (05.04.2016)
- bug fix: significant but non-best blast hits with option -b now listed in queries_no_blastp-hits.txt
- v0.7 (23.11.2015)
- changed script name from
blast_prot_finder.pl
toprot_finder.pl
- fixed bug introduced in v0.6 with a
seek
, because option -query didn't pull query sequences anymore - included version switch
- included 'use autodie;' pragma
- included
pod2usage
with Pod::Usage andpod2usage
-die for Getopt::Long call - tab-delimited output table with filtered BLASTP hits now printed to STDOUT instead of file and can be used as STDIN for
prot_binary_matrix.pl
- result files now created in result directory to unclutter working dir (new options -dir_result and -force_dir) and replaced subroutine 'file_exist' with 'empty_dir' (also no STDOUT message which files were created anymore)
- new output file queries_no_blastp-hits.txt to list all queries without BLASTP hits
- new output file clustal_omega.log to log Clustal Omega verbose output
- additional new options, -path_clustalo and -threads_clustalo
- major code changes/restructuring and additions:
- POD syntax changes and additions for new code
- Perl syntax changes and removing code redundancies
- including script run status messages
- changes to subroutine 'split_fasta_header' to work more robustly with
split
instead of a regex and adapted tocds_extractor.pl
v0.7+ output format
- included several additional option and input checks
- replaced simple
blast_prot_finder_legacy.sh
with more elaborate bash script pipelineprot_finder_pipe.sh
- changed some output column names and moved the subject_ID column before the subject_gene column
- changed script name from
- v0.6 (10.06.2013)
- included BioPerls 'frac*' methods, which include hsp tiling, to correct bug in query coverage and identity calculations for whole hits
- corrected bug to use whole hit e-value and not just first hsp e-value of a significant hit
- option -cov_subject for subject/hit coverage cutoff (not only query coverage cutoff)
- more info in POD
- v0.5 (14.05.2013)
- optionally include query proteins in result subject multi-FASTA files
- new option -best_hit to include only best BLASTP hit (highest identity) for each subject locus_tag, if a subject protein has hits to several query proteins
- print queries with no subject BLASTP hits to STDOUT
- v0.4 (24.01.2013)
- corrected bug in hash structure to store BLASTP hits, query acc/IDs (keys) and array reference of subject locus_tags (values), as several queries can have the same subject locus_tag as hit
- include 'subject protein_function' in 'blast_hits.txt' output
- v0.3 (12.09.2012)
- original script name
blast_prot_finder.pl
- BLASTP hits are stored in a hash with subject locus_tags (keys) and query accessions/IDs (values)
- original script name
- v0.6 (23.11.2015)
- adapted to
prot_finder.pl
v0.7 output - included a POD,
pod2usage
with Pod::Usage andpod2usage
-die for Getopt::Long call - Perl syntax changes and some simpler loop structures
- Removed option -input and instead accept
prot_finder.pl
output as STDIN or file as argument - Removed option -c|-collective (actually repurposed, see below), option -separate to indicate separate output is enough
- Without option -separate result matrix is now printed to STDOUT
- With option -separate result matrices now printed to result directory (name optionally given with new option -dir_result)
- New option -locus_tag to use the locus_tag prefix in column subject_ID of the
prot_finder.pl
output as ID, which is in most cases the locus_tag, instead of the subject_organism; controls if the locus_tags follow NCBI standards - Output now default tab-separated format (tsv), optionally with new option -csv comma-separated (csv)
- Removed subroutine 'result' and statement which result files have been created
- adapted to
- v0.5 (05.03.2014)
- option -total to count all occurences of query proteins within one organism (paralogs), instead of just binary presence/absence
- enforce mandatory options
- changed script name to 'prot_binary_matrix.pl'
- version switch
- included 'use autodie'
- options with Getopt::Long
- changed usage to HERE document
- v0.4 (29.04.2013)
- adapted to new prot_finders 'blast_hits.txt' layout, which includes an additional column for 'subject protein_function'
- changed script name to 'blast_binary_matrix.pl'
- v0.3 (15.02.2013)
- status of created result files in STDOUT for option -s
- v0.2 (21.12.2012)
- prot_finder output file 'blast_hits.txt' doesn't have to be ordered by query protein accessions/IDs anymore
- v0.1 (25.10.2012)
- original script name 'blastp_iTOL_binary.pl'
- v0.1 (12.04.2016)
- v0.1 (06.06.2016)