TreeMer

A simple tool to generate hierarchical clustering trees from nucleotide sequences using kmer spectra distance. Included is a small testset of SARSCOV2 genomes downloaded from https://www.nlm.nih.gov/news/coronavirus_genbank.html.

Overview

This tool calculates the distance between a set of nucleotide sequences in FASTA format by digesting them into kmer count vectors (effectively kmer spectra). The pairwise distance between all pairs of vectors are calculated and clustered to build a Hierarchical clustering tree. A number of distance metrics and clustering methods are supported (see distance and clustering).

Installation

Installation is very straightforward, simply run

git clone git@github.com:ArthurVM/TreeMer.git
cd TreeMer
python3 -m pip install -d dependencies.txt

and you are good to go!

Input

TreeMer takes kmer a set of nucleotide sequences in FASTA format, and generates kmer count files, stuctured as:

kmer0 count
kmer1 count
...
kmern count

in tab seperated format (denoting the kmer spectrum of the sequence). These kmer spectra are used to distance vector, and a Hierarchical Clustering tree generated.

Output

TreeMer outputs the following files:

HC_dendro.png     - The hierarchical clustering dendrogram in .png format.
HC_tree.nwk       - A text file containing the hierarchical clustering tree in Newick format. 
heatmap.png       - The heatmap of sequence distances in .png format.
heatmap.{D}.tsv   - A heatmap file in .tsv format. {D} is the distance metric used.

Usage

usage: TreeMer.py [-h] [-i I I] [-k K] [-m M] [-s]
                  [-d {distance metric}}]
                  [-c {clustering method}]
                  [-g G]
                  [fa_files [fa_files ...]]

positional arguments:
  fa_files              An arbitrary number of sequence files in FASTA format.

optional arguments:
  -h, --help            show this help message and exit
  -i I I                Lower and upper bound percentiles to construct the
                        tree. E.g. 25 75 will generate a tree from kmers from
                        the 25th to the 75th percentiles in the total set of
                        kmers ordered by count.
  -k K                  Kmer size to use in constructing genome comparison.
                        Default=7.
  -m M                  The maximum count to return a kmer, e.g. return only
                        kmers with count <=10 if m=10. Default=return ALL.
  -s                    Suppress the generation of kmer-spectra from sequence
                        files. This assumes that all positional arguments
                        provided to this tool are already kmer-spectra files
                        generated by genKmerCount. Default=False.
  -d {euclidean,minkowski,cityblock,sqeuclidean,hamming,jaccard,chebyshev,canberra,braycurtis,yule}
                        Metric used in calculating distance between kmer
                        spectra. Default=euclidean.
  -c {ward,single,complete,average,weighted,centroid,median}
                        Clustering method utilised to build the tree.
                        Default=ward.
  -g G                  A tab seperated text file containing geographic
                        locations for each sequence, ith the sequence ID in
                        col0 an geolocation in col1. Default=False.
  -v                    Verbose output mode. Default=False.

Example Using SARSCOV2 Dataset

A dataset of complete SARSCOV2 genomes are provided with this tool, in the /TreeMer/SARSCOV2/SARSCOV2_WGS directory. This includes geolocations of each isolate in /TreeMer/SARSCOV2/geolocs.tsv.

The entire pipeline can be run using a single command fromthe TreeMer root directory:

python3 TreeMer.py SARSCOV2/SARSCOV2_WGS/* -k 7 -i 10 90 -d euclidean -c ward -g SARSCOV2/geolocs.tsv

In this instance, we are calculating the euclidean distance between 7mer frequency vectors, stripping out the 10% least and most frequent kmers, and clustered using Wards method. The subsiquent tree is:

Distance and Clustering

A number of distance metrics and clustering methods are supported by this tool.

Distance Metrics

Euclidean
Minkowski
Cityblock
Sqeuclidean
Hamming
Jaccard
Chebyshev
Canberra
Bradycurtis
Yule

Clustering Methods

Ward
Single
Complete
Average
Weighted
Centroid
Median

Dependencies

python3
argparse
scipy
numpy
matplotlib
seaborn

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
SARSCOV2		SARSCOV2
bin		bin
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TreeMer.py		TreeMer.py
dependencies.txt		dependencies.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TreeMer

Overview

Installation

Input

Output

Usage

Example Using SARSCOV2 Dataset

Distance and Clustering

Distance Metrics

Clustering Methods

Dependencies

About

Releases

Packages

Languages

License

ArthurVM/TreeMer

Folders and files

Latest commit

History

Repository files navigation

TreeMer

Overview

Installation

Input

Output

Usage

Example Using SARSCOV2 Dataset

Distance and Clustering

Distance Metrics

Clustering Methods

Dependencies

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages