Skip to content

Commit

Permalink
Merge pull request #44 from GenomicsAotearoa/update_valter
Browse files Browse the repository at this point in the history
Add and format Valter's phylogenetics episode
  • Loading branch information
JSBoey authored Aug 30, 2023
2 parents aac8f15 + 939af4b commit d9c3407
Show file tree
Hide file tree
Showing 7 changed files with 183 additions and 4 deletions.
180 changes: 180 additions & 0 deletions docs/day3/ex11.1_phylogenomics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Inferring phylogenetic relationships

GTDB-Tk is very helpful in inferring the taxonomy of our MAGs. It also has a bunch of additional files that is generated so we can visualise the phylogenetic relationships between the MAGs (both in the larger context of the entire GTDB tree, or just our MAGs).

## Available software and models

There are several software options for inferring phylogenetic relationships, some more commonly used than others. Popular options include BEAST (Bayesian Evolutionary Analysis Sampling Trees), FastTree 2, Geneious, IQ-TREE 2, MEGA (Molecular Evolutionary Genetics Analysis), and RAxML (Randomized Axelerated Maximum Likelihood). A requirement common to all software listed is that they require a multiple sequence alignment (MSA). For this workshop, GTDB-Tk has conveniently output this file for us.

??? tip "Software for MSAs"

GTDB-Tk generates an MSA based on a concatenated set of core single-copy genes identified within each of our MAGs. However, if you are studying other genes, you will need to generate (and curate) your own MSA. Popular software include MAFFT, ClustalW, Clustal Omega, and MUSCLE.

**Phylogenetic inference methods**

To start, we start need a set of aligned (and ideally, trimmed) orthologous genes. These are genes in different species that, by speciation, have evolved from a common ancestor and tend to retain the same function over the course of evolution. From here, there are two ways to generate an input for phylogenetic inference:

!!! note ""

=== "Super-tree approach"

Alignment of each gene is analysed individually to estimate a tree. Then, these distinct trees are integrated to estimate the species tree.

=== "Supermatrix approach (most common)"

Aligned orthologous genes are concatenated (this is the supermatrix) and then used to estimate a tree. This is the most commonly used method.

From here, we then choose one of two ways to reconstruct the phylogeny:

!!! note ""

=== "Distance-based"

This involves the calculation of a pairwise genetic distance matrix and then using it to iteratively construct a tree. A common implementation of this method is the Neighbour-Joining algorithm. While computationally efficient, it tends not to perform well for distantly related organisms.

=== "Character-based"

Infers phylogeny based on the characters (nucleic/amino acid alphabet) in the MSA via one of three methods:

* **Maximum parsimony** This method tries to find a tree topology with the least the number of character changes required to explain the data (i.e., the most parsimonious tree). While computationally efficient, this method is the least commonly applied due to unrealistic assumptions about evolution.
* **Maximum likelihood (ML)** Estimates the parameters of a statistical model such that the probability of observing the data (i.e., the likelihood) is maximal. The general idea behind this method is that if the estimated parameter values of a model makes it more likely to observe the data, it is assumed to approximate the true topology of the tree. This framework is employed in software such as RAxML, IQ-TREE2, and FastTree2.
* **Bayesian methods** This method uses a similar approach to ML in that it relies on a model and maximises the likelihood of observing the data. The difference is in its implementation, where trees are inferred via Bayesian models. Here, parameters have uncertainties that are described using statistical distributions. Firstly, it collects/estimates and analyses data relevant to generating tree topologies and describes it via a distribution (the prior; i.e., the uncertainty of the unknown parameters). Next, it estimates the likelihood (data is observed at this stage) and then <!--needs work-->

<!--
Given a set of aligned and trimmed orthologous genes (genes in different species that evolved from a common ancestral gene by speciation, and in general orthologs retain the same function during the course of evolution), there are different methods for obtaining a species tree:
1. Each gene alignment can be individually analysed to generate a tree estimate, and subsequently, these distinct trees can be combined to form an estimation of the species tree. This method is referred to as the super-tree approach.
2. The aligned genes can be concatenated into a supermatrix, which is then examined to generate an overall approximation of the species tree (supermatrix method).
3. Distance methods involve calculating a genetic distance between every pair of species (based on a comparison of their aligned sequences) and using the resulting distance matrix iteratively to construct a tree. Most commonly used: The neighbour joining (NJ) algorithm.
Character-based phylogenetic inference methods:
1 - Maximum parsimony method: Calculates the minimum number of nucleotide or amino acid changes that are required to explain the data using each possible tree topology. The tree arrangement with the fewest changes is referred to as the most parsimonious tree.
2 - Maximum likelihood (ML): A parameter value that makes the observed data seem very likely is expected to be closer to the truth than a value that makes the data seem almost impossible. For each tree topol- ogy, the substitution parameters and branch lengths are optimized to maximize the likelihood and the tree topology that achieves the highest likelihood is the ML tree.
3 - Bayesian method: Relies on an explicitly stated model and on the likelihood function. It differs from ML in that it uses statistical distributions to quan- tify uncertainties in the parameters.
-->

**Confidence in tree inference**

How would we know if the placement of branches are correct in our inferred trees? For ML- and parsimony-based tree reconstruction, a common way to test that is via bootstrapping. This is the generation of pseudo-data of similar size to the observed data via iterative resampling of the observed data. New trees are then inferred from the resampled data, and the number of times the same tree or branch placements (clades) are observed based on the pseudo-data. This is analogous to permutations in non-parametric tests. For Bayesian method-based inference, confidence is estimated based on posterior probabilities (a.k.a. the posterior support; i.e., the product of prior probabilities and maximal likelihood).

<!--
Measure of confidence: The most commonly used method for this purpose is bootstrapping. The bootstrap is often used to attach support values for the clades. The bootstrap is applied to assess confidence in estimated trees for the distance, parsimony and ML methods. In analyses of phylogenomic datasets, a common observation is that bootstrap and posterior support values are very high (near 100%) whether the relationships are correct or not.
- Tree file formats
Example of phylogenetic tree file formats: Newick, Nexus, PhyloXML, NeXML and Phylip.
## Inspect the MSA (optional)
-->

## Building a phylogenetic tree

The exercises for this section is performed in the `8.prokaryotic_taxonomy/` directory. Here, we will build a bootstrapped ML tree using FastTree2 with an MSA provided by GTDB-Tk of our MAGs.

We begin by loading the required modules:

!!! terminal "code"

```bash
module purge
module load FastTree/2.1.11-GCCcore-9.2.0
```

Then, we prepare the input file based on our GTDB-Tk outputs:

!!! terminal "code"

```bash
# Copy file from previous exercise
cp gtdbtk_out/align/gtdbtk.bac120.user_msa.fasta.gz .

# Decompress file
gzip -d gtdbtk.bac120.user_msa.fasta.gz
```

Finally, we can build the tree using the following code:

!!! terminal "code"

```bash
FastTree gtdbtk.bac120.user_msa.fasta > bin.tree
```

We can inspect what the output looks like:

!!! terminal "code"

```bash
cat bin.tree
```

??? success "Output"

```
((bin_0.filtered:0.39270,bin_1.filtered:0.27312)1.000:0.53352,(bin_7.filtered:0.73278,(bin_2.filtered:0.51665,bin_6.filtered:0.66105)0.930:0.06144)1.000:0.07298,(bin_4.filtered:0.62639,(bin_8.filtered:0.57815,(bin_3.filtered:0.36294,(bin_5.filtered:0.24898,bin_9.filtered:0.28774)1.000:0.11846)1.000:0.19589)1.000:0.07212)0.970:0.05113);
```

## Visualise the tree

We will use [iTOL](https://itol.embl.de/) to visualise the tree we made. iToL (Interactive Tree Of Life) is a powerful online, browser-based tool for the display, annotation and management of phylogenetic and trees.

!!! tip "iToL subscription"

For the purposes of today's workshop, the free version of iToL will suffice. However, if you think you will need to perform more phylogenetic analyses, you should consider creating an account and subscribing to the web service. This allows you to save your trees online for future use.

**Navigate to iTOL**

On your web browser, navigate to [iTOL](https://itol.embl.de/).

![itol main page](../figures/day3_iToLMainPage.PNG)

Click on the Upload button on the top left.

![itol upload page](../figures/day3_iToLUpload.PNG)

**Add tree information**

As we are working with a small tree, we can go back to the Jupyter terminal to copy the contents of `bin.tree` into the 'Tree text' area.

!!! terminal "code"

```bash
cat bin.tree
```

??? success "Content of `bin.tree`"

```
((bin_0.filtered:0.39270,bin_1.filtered:0.27312)1.000:0.53352,(bin_7.filtered:0.73278,(bin_2.filtered:0.51665,bin_6.filtered:0.66105)0.930:0.06144)1.000:0.07298,(bin_4.filtered:0.62639,(bin_8.filtered:0.57815,(bin_3.filtered:0.36294,(bin_5.filtered:0.24898,bin_9.filtered:0.28774)1.000:0.11846)1.000:0.19589)1.000:0.07212)0.970:0.05113);
```

We can also name our tree.

Once we are done, click Upload.

![itol tree info](../figures/day3_iToLTreeInfo.PNG)

**Annotate tree**

We can also add additional information to highlight some or all of the tips. iTOL requires annotation files in a specific format. We have provided you with some example ones to use in this workshop:

- [Relabel the bin IDs to their taxon](../resources/Relabelling.txt)
- [Add colour to highlight relevant taxa](../resources/Color.txt)
- [Add shapes to highlight risk-associated taxa](../resources/Risks.txt)

After downloading the files, simply drag and drop them into the iTOL page and it will annotate the tips accordingly.

![itol tree annotated](../figures/day3_iToLTreeAnnotation.PNG)

**Play around with the tree**

We can edit and add annotations by clicking within the iTol website environment. However, the annotation files are a great way to maintain annotations for the phylogenetic trees and automatically display information you need to highlight on the tree.

!!! question "**Exercise** Alternative display and additional information"

See if you can add/change the following from your tree:

* Show bootstrap support values in blue
* Show branch lengths in red
* Change display format of your tree

Binary file added docs/figures/day3_iToLMainPage.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/day3_iToLTreeAnnotation.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/day3_iToLTreeInfo.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/day3_iToLUpload.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/resources/Color.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ SEPARATOR TAB
#SEPARATOR COMMA

#label is used in the legend table (can be changed later)
DATASET_LABEL example style
DATASET_LABEL Host

#dataset color (can be changed later)
COLOR #ffff00
Expand All @@ -27,15 +27,15 @@ COLOR #ffff00
#Shape should be a number between 1 and 6, or any protein domain shape definition.
#1: square
#2: circle
3: star
#3: star
#4: right pointing triangle
#5: left pointing triangle
#6: checkmark

LEGEND_TITLE Host
#LEGEND_POSITION_X,100
#LEGEND_POSITION_Y,100
#LEGEND_SHAPES 3 3 3 3 3 3 3 3 3
LEGEND_SHAPES 3
#LEGEND_COLORS
#LEGEND_LABELS Co-assembled (group)
#LEGEND_SHAPE_SCALES,1,1,0.5
Expand Down
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,6 @@ markdown_extensions:
custom_checkbox: true
- pymdownx.tilde
- pymdownx.snippets
- markdown_grid_tables

extra_javascript:
- javascripts/mathjax.js
Expand Down

0 comments on commit d9c3407

Please sign in to comment.