Alignment to the ATB v0.2 dataset; can be done on a single laptop/server or a cluster. Phylign uses phylogenetically compressed assemblies and their k-mer indexes to align batches of queries to them by Minimap 2.
The central idea behind Phylign is
- have a highly compressed set of assemblies, which you want to map to. This is done (losslessly) using phylogenetic compression (paper). We batch them by species, and compress each batch. Some species have so many genomes that they have many batches.
- have a set of k-mer indexes, one per batch, and use them to decide which batches contain likely hits for a query. We use a k-mer index called COBS (https://github.com/iqbal-lab-org/cobs)
- decompress the candidate genomes and then align to them using minimap.
In short to do this, you will to clone this repo and place the assembly batches and the COBs (k-mer) indices in the right place. You put your queries in the right place and then run make
and Snakemake will execute the search, either locally (on the laptop/server you are using) or on a cluster. In our tests, you can search the 2 million genomes locally in 30mins-2 hours (depends on number of hits) if you have a 48 core machine, or in say 30 minutes if you have a compute cluster.
Phylign requires a standard desktop or laptop computer with an *nix system, and it can also run on a cluster.
WARNING: Phylign does not currently work on systems with ARM processors.
Get the assemblies using this command. Get the indexes using this command
Phylign is implemented as a Snakemake pipeline, using the Conda system to manage non-standard dependencies. Ensure you have Conda installed with the following packages:
- GNU Time (on Linux present by default; on OS X, install with
brew install gnu-time
).
Additionally, Phylign uses standard Unix tools like GNU Make, cURL, XZ Utils, and GNU Gzip. These tools are typically included in standard *nix installations. However, in minimal setups (e.g., virtualization, continuous integration), you might need to install them using the corresponding package managers.
Make sure you have Conda and GNU Time installed. On Linux:
sudo apt-get install conda
On OS X (using Homebrew):
brew install conda
brew install gnu-time
Clone the Phylign repository from GitHub and navigate into the directory:
git clone https://github.com/AllTheBacteria/Phylign
cd phylign
conda env create -f environment.yaml && conda activate phylign
**Default usage is to run locally using 8 CPUs. This is going to be very slow if you it on a laptop. A 48-core machine brings it down to an hour or so to query a single gene. ** How to run on a cluster is described in point 6 below.
Copy or symlink the miniphy-compressed batches of assemblies that you want to map to and place them in asms/
. The assemblies can be found at https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/.
Copy or symlink the miniphy-cobs compressed batches of search indices you want to query and place them in cobs/
. Each batch in asms/
should have a matching index in cobs/
. The search indices can be found at https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/indexes/phylign/.
Remove the default test files or your old files in the input/
directory and
copy or symlink (recommended) your query files. The supported input formats are
FASTA and FASTQ, possibly gzipped.
Notes:
- All query names have to be unique among all query files.
- Queries should not contain non-ACGT characters. All non-
ACGT
characters in your query sequences will be translated toA
.
Edit the config.yaml
file for your desired search. All
available options are documented directly there.
Run make clean
to clean intermediate files from the previous runs. This
includes COBS matching files, alignment files, and various reports.
Simply run make
, which will execute Snakemake with the corresponding
parameters. If you want to run the pipeline step by step, run make match
followed by make map
.
Check the output files in output/
(for more info about formats, see
5c) File formats).
If the results do not correspond to what you expected and you need to re-adjust
your search parameters, go to Step 2. If only the mapping part is affected by
the changes, you proceed more rapidly by manually removing the files in
intermediate/05_map
and output/
and running directly make map
.
For additional info see the additional info file.
It is possible to run Phylign on a subset of the AllTheBacteria assemblies if e.g. you only want to query a certain species or your resources are limited. This can be done by downloading the desired assemblies and COBS indices and following the steps described in Usage. You then need to modify data/batches_2m.txt
to only include batches you have assemblies and compressed COBS indices for. E.g. to search only asms/salmonella_enterica__81.asm.tar.xz
using the compressed index cobs/salmonella_enterica__81.cobs_classic.xz
, you must modify the file to only include salmonella_enterica__81
. Alternatively, you can create a new .txt
file with one batch per line, and set the batches
variable in config.yaml
to the path of this new file.
Running on a cluster is much faster as the jobs produced by this pipeline are quite light and usually start running as soon as they are scheduled.
For LSF clusters:
- Setup the snakemake LSF profile described here.
- Configure you queries and run the full pipeline:
make cluster_lsf
;
For SLURM clusters:
- Setup the snakemake SLURM profile described here.
- Configure you queries and run the full pipeline:
make cluster_slurm
;
K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996
- If you want to know about All The Bacteria, contact Zamin Iqbal (zi245@bath.ac.uk) or John Lees (jlees@ebi.ac.uk).
- If you want to know about Phylign on All the Bacteria, contact Daniel Anderson (dander@ebi.ac.uk) or Wei Shen (shenwei356@gmail.com).
- If you want to know about Phylign generally contact Karel Brinda (karel.brinda@inria.fr).