GitHub - rdpstaff/classifier: RDP extensible sequence classifier for fungal lsu, bacterial and archaeal 16s

rdpstaff / classifier Public

Notifications You must be signed in to change notification settings
Fork 32
Star 56

RDP extensible sequence classifier for fungal lsu, bacterial and archaeal 16s

56 stars 32 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
nbproject		nbproject
samplefiles		samplefiles
src/edu/msu/cme/rdp		src/edu/msu/cme/rdp
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README		README
build.xml		build.xml
ivy.xml		ivy.xml
ivysettings.xml		ivysettings.xml
manifest.mf		manifest.mf

Repository files navigation

INTRO

The RDP Classifier is a naive Bayesian classifier which was developed to provide rapid taxonomic placement based on rRNA sequence data. The RDP Classifier can rapidly and accurately classify bacterial and archaeal 16s rRNA sequences, and Fungal LSU sequences. It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. The RDP Classifier likely can be adapted to additional phylogenetically coherent bacterial taxonomies. The online version of RDP Classifier can be found at http://rdp.cme.msu.edu/classifier/classifier.jsp.

How to cite Classifier? Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007.
Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7.

Gene Copy Number Adjustment
Classifier provides gene copy number adjustment for 16S gene sequences (see http://rdp.cme.msu.edu/classifier/class_help.jsp#copynumber).
The precompiled Classifier was trained with the 16S gene copy number data provided by rrnDB website. The Classifier can be trained with user-provided gene copy number data. See How to Train the Classifier below. Classifier outputs both copy number adjusted and unadjusted assignment counts in the hierarchical output files.

BIOM Format

The Classifier can take an input minimal (or rich) dense BIOM file as input with an optional Metadata file, and produces a rich dense BIOM file. If an input cluster BIOM file ( version 1.0) is provided, along with the representative sequences input (output from Clustering rep-seqs subcommand with "-c" option to make sure rep seq Ids match the cluster Ids in the BIOM file, see http://rdp.cme.msu.edu/tutorials/stats/RDPtutorial_statistics.html), the classification result of each sequence will replace the taxonomy of the corresponding cluster. If a metadata file is provided, the information will replace the metadata of the corresponding sample. The resulting rich dense BIOM file can be used by thirdparty tools such as phyloseq or QIIME etc.

In order to use BIOM files as input, the format must be specified in the command line with "-f biom". Then, the biom file is specified with the command "-m /path/to/biom_file.biom". Including a Metadata file is optional and can be included by using the command "-d /path/to/metadata.txt".

QUICKSTART

ant jar
Some commands in this tutorial depend on RDP Clustering. See RDPTools (https://github.com/rdpstaff/RDPTools) to install.

USAGE

There are many subcommands offered by the Classifier package. The default subcommand is classify.

java -Xmx1g -jar /path/to/classifier.jar
USAGE: ClassifierMain <subcommand> <subcommand args ...>
classify - classify one or multiple samples
crossvalidate - cross validate accuracy testing
libcompare - compare two samples
loot - leave one (sequence or taxon) out accuracy testing
merge-detail - merge classification detail result files to create a taxon assignment counts file
merge-count - merge multiple taxon assignment count files to into one count file
random-sample - random select a subset or subregion of sequences
rm-dupseq - remove identical or any sequence contained by another sequence
rm-partialseq - remove partial sequences
taxa-sim - calculate and plot the similarities within taxa
train - retrain classifier

1. Classify one or more samples
usage: [options] <samplefile>[,idmappingfile] ...
-c,--conf <arg> assignment confidence cutoff used to determine the
assignment count for each taxon. Range [0-1],
Default is 0.8.
-f,--format <arg> tab-delimited output format:
[allrank|fixrank|biom|filterbyconf|db]. Default is
allRank.
allrank: outputs the results for all ranks applied
for each sequence: seqname, orientation, taxon
name, rank, conf, ...
fixrank: only outputs the results for fixed ranks
in order: domain, phylum, class, order, family,
genus
biom: outputs rich dense biom format if OTU or
metadata provided
filterbyconf: only outputs the results for major
ranks as in fixrank, results below the confidence
cutoff were bin to a higher rank unclassified_node
db: outputs the seqname, trainset_no, tax_id, conf.
-g,--gene <arg> 16srrna, fungallsu, fungalits_warcup,
fungalits_unite. Default is 16srrna. This option
can be overwritten by -t option
-h,--hier_outfile <arg> tab-delimited output file containing the assignment
count for each taxon in the hierarchical format.
Default is null.
-o,--outputFile <arg> tab-delimited text output file for classification
assignment.
-q,--queryFile legacy option, no longer needed
-t,--train_propfile <arg> property file containing the mapping of the
training files if not using the default. Note: the
training files and the property file should be in
the same directory.
-w,--minWords <arg> minimum number of words for each bootstrap trial.
Default(maximum) is 1/8 of the words of each
sequence. Minimum is 5

[Example command to classify sequences ]:
java -Xmx1g -jar /path/to/classifier.jar classify -c 0.5 -o usga_classified.txt -h soil_hier.txt samplefiles/USGA_2_4_B_trimmed.fasta

To speedup classification when large number of duplicate sequences exist in the inputs, you can dereplicate the input files first and use both the unique sequence fasta and idmapping file as input. The classification assignment output only contains the results of the unique sequences, the assignment counts in the hier_out_file are expanded to reflect the full sets. The hier_outfile can be imported to Excel to make plots, or loaded into R program as a data matrix.

[Example command to classify sequences with idmapping]:
java -jar /path/to/Clustering.jar derep -u -o Native_1_4_A_derep.fasta Native_1_4_A.ids Native_1_4_A.sample samplefiles/Native_1_4_A_trimmed.fasta
java -Xmx1g -jar /path/to/classifier.jar classify -c 0.5 -o native_classified.txt -h soil_hier.txt Native_1_4_A_derep.fasta,Native_1_4_A.ids
java -Xmx1g -jar /path/to/classifier.jar classify -c 0.5 -f fixrank -o soil_classified.txt -h soil_hier.txt Native_1_4_A_derep.fasta,Native_1_4_A.ids samplefiles/USGA_2_4_B_trimmed.fasta

Notes:
The bootstrap assignment strategy has been changed to avoid over-predication problem when multiple genera are tied for highest score occurred during bootstrap trials. This happens when every sequence in multiple genera (say N) contains the same partial sequence. One of the genera will be randomly chosen from the list of N genera with the highest tie score. If the tie score occurred during the genus assignment deterministic step, the first genus will be chosen. In this way, the genus assignment will remain deterministic but the bootstrap score will be close to 1/N .

By default, the Classifier output the results for all ranks applied for each sequence. Some users found the format "fixrank" useful to load into third party analysis tools. When "fixrank" is specified, the Classifier outputs the results in a fixed rank order as described above. In case of missing ranks in the lineage, the bootstrap value and the taxon name from the immediate lower rank will be reported. This eliminates the gaps in the lineage, but also introduces non-existing taxon name and rank. User should interpret the "fixrank" results with caution.

By default the Classifier chooses a subset of 1/8 of all the possible overlapping words from the query sequence for each bootstrap trial. The Classifier uses the minWords if the minWords is larger than 1/8 of words. Choosing more words helps gaining higher bootstrap values for short query sequence. Using larger "minWords" will increase the run time since the run time is proportional to the number and the length of the query sequences.

2. Compare two samples
This command combines classification with a statistical test to flag taxa differing significantly between libraries.

[Example command from a terminal]:
java -Xmx1g -jar /path/to/classifier.jar libcompare -q1 samplefiles/Native_1_4_A_trimmed.fasta -q2 samplefiles/USGA_2_4_B_trimmed.fasta -c 0.5 -o libcompare.txt

3. Merge classification results
If you have classified samples at different time using the same training set, you can use this command to merge the classification results and reproduce the hier_outfile with the assignment counts from all the samples from the input. Each input classification result is treated as one sample. Note taxon and rank filter options only affect the assignment output, not the hier_outfile.

[Example command to merge two classification results]:
java -Xmx1g -jar /path/to/classifier.jar merge-detail -h soil_hier.txt -o merged_classified.txt native_classified.txt,Native_1_4_A.ids usga_classified.txt

[Example command to merge two classification results, filter the classification output by confidence ]:
java -Xmx1g -jar /path/to/classifier.jar merge-detail -h soil_hier.txt -o merged_classified.txt -f filterbyconf -c 0.5 native_classified.txt,Native_1_4_A.ids usga_classified.txt

[Example command to merge two classification results, only output classification results assigned to Alphaproteobacteria and confidence at family >= 0.5 ]:
create a file taxonFilter.txt containing Alphaproteobacteria.
java -Xmx1g -jar /path/to/classifier.jar merge-detail -h soil_hier.txt -o merged_classified.txt -n taxonFilter.txt -c 0.5 -r family native_classified.txt,Native_1_4_A.ids usga_classified.txt

4. Merge assignment count results
If you have classified samples at different time using the same training set, you can merger multiple assignment count results into one assignment count file, keeping one column for each unique sample. If same sample occurred more than once, the taxon counts for this sample will be combined.

[Example command to merge three assignment count results]:
java -Xmx1g -jar /path/to/classifier.jar merge-count merged_hier.txt sampleset1_hier.txt sampleset2_hier.txt sampleset3_hier.txt

5. How to Train the Classifier
a. Follow these steps when there is a need to retrain Classifier, such as novel lineages, newly named type organisms, taxonomic rearrangements, better training set covering specific taxa, or alternative taxonomy. Two files, a taxonomy file and a training sequence file with lineage are required. Prefer high quality, full length sequences, or at least covering the entire region of gene of interest. See samplefiles for example data files. The 16S rRNA training data and Fungal LSU training data can be download from http://sourceforge.net/projects/rdp-classifier/?source=directory.
Based on our experience, trimming the sequences to a specific region does not improve accuracy. The ranks are not required to be uniform neither, which means you can define any number of ranks as necessary. The speed of the Classifier is proportional to the number of genera, not the number of training sequences.

b. Clean-up partial or duplicate training sequences to avoid inflated results in classification performance testing.
Use subcommand "rm-dupseq" to remove identical sequences or any sequence contained by another sequence. Use subcommand "rm-partialseq" to remove partial sequences based on pairwise alignment to near-full length reference sequences.

c. Plot intra taxon Similarity by fraction of matching 8-mer
Use subcommand "taxa-sim" to calculate and plot intra taxon Similarity by fraction of matching 8-mer (see example plots using fungal ITS training sets on RDP's poster http://rdp.cme.msu.edu/download/posters/MSA2014_RDP.pdf). To run taxa-sim in Headless mode without GUI display, use the following options:
java -Djava.awt.headless=true -jar classifier.jar taxa-sim

d. Estimate the accuracy of your own training data using leave-one-out testing
The program will output a tab-delimited test result file which can be loaded to Excel and plot the accuracy rates. It also contains the list of misclassified sequences and the rank when misclassified seqs group by taxon. Examine the result careful to spot errors in the taxonomy.

Leave-one-sequence-out testing: each iteration one sequence from the training set was chosen as a test sequence. That sequence was removed from training set. The assignment of the sequence produced by the Classifier was compared to the original taxonomy label to measure the accuracy of the Classifier.
[Example command with length of 400]:
java -Xmx1g -jar /path/to/classifier.jar loot -q samplefiles/Armatimonadetes.fasta -s samplefiles/new_trainset.fasta -t samplefiles/new_trainset_db_taxid.txt -l 400 -o Armatimonadetes_400_loso_test.txt

Leave-one-taxon-out testing: similar to the leave-one-sequence-out testing except for each test sequence, the lowest taxon that sequence assigned to (either species or genus node) was removed from the training set. This is intended to test if the species or genus is no present in the training set, how likely the Classifier can assign the sequence to the correct genus or higher taxa.
[Example command ]:
java -Xmx1g -jar /path/to/classifier.jar loot -h -q samplefiles/Armatimonadetes.fasta -s samplefiles/new_trainset.fasta -t samplefiles/new_trainset_db_taxid.txt -o Armatimonadetes_loto_test.txt

e. Cross validate testing
This comand performs a random sub-sampling validation. It calculate the values of (1-specificity) and sensitivity for each rank at each bootstrap cutoff.

[Example command to do cross validation testing with length of 400]:
java -Xmx1g -jar /path/to/classifier.jar crossvalidate -o crossvalidate_400.txt -s samplefiles/new_trainset.fasta -t samplefiles/new_trainset_db_taxid.txt -l 400

f. Train classifer
If you are satisfied with the testing results, go ahead to train the classifier.

[Example command to train classifier]:
java -Xmx1g -jar /path/to/classifier.jar train -o mytrained -s samplefiles/new_trainset.fasta -t samplefiles/new_trainset_db_taxid.txt -c gene_copynumber.txt
cp samplefiles/rRNAClassifier.properties mytrained/

"-c" specify the gene copy number file. It should at least three columns: name, rank and mean for the lowest rank taxon to be trained. See example file samplefiles/gene_copynumber.txt

g. Classify sequences using the new model

[Example command to classify with the new model using "-t" option]:
java -Xmx1g -jar /path/to/classifier.jar classify -t mytrained/rRNAClassifier.properties -o Armatimonadetes_classified.txt samplefiles/Armatimonadetes.fasta