Merge pull request #44 from GenomicsAotearoa/update_valter

Add and format Valter's phylogenetics episode
GenomicsAotearoa · Aug 30, 2023 · d9c3407 · d9c3407
2 parents aac8f15 + 939af4b
commit d9c3407
Show file tree

Hide file tree

Showing 7 changed files with 183 additions and 4 deletions.
diff --git a/docs/day3/ex11.1_phylogenomics.md b/docs/day3/ex11.1_phylogenomics.md
@@ -0,0 +1,180 @@
+# Inferring phylogenetic relationships
+
+GTDB-Tk is very helpful in inferring the taxonomy of our MAGs. It also has a bunch of additional files that is generated so we can visualise the phylogenetic relationships between the MAGs (both in the larger context of the entire GTDB tree, or just our MAGs). 
+
+## Available software and models
+
+There are several software options for inferring phylogenetic relationships, some more commonly used than others. Popular options include BEAST (Bayesian Evolutionary Analysis Sampling Trees), FastTree 2, Geneious, IQ-TREE 2, MEGA (Molecular Evolutionary Genetics Analysis), and RAxML (Randomized Axelerated Maximum Likelihood). A requirement common to all software listed is that they require a multiple sequence alignment (MSA). For this workshop, GTDB-Tk has conveniently output this file for us.
+
+??? tip "Software for MSAs"
+
+    GTDB-Tk generates an MSA based on a concatenated set of core single-copy genes identified within each of our MAGs. However, if you are studying other genes, you will need to generate (and curate) your own MSA. Popular software include MAFFT, ClustalW, Clustal Omega, and MUSCLE. 
+
+**Phylogenetic inference methods** 
+
+To start, we start need a set of aligned (and ideally, trimmed) orthologous genes. These are genes in different species that, by speciation, have evolved from a common ancestor and tend to retain the same function over the course of evolution. From here, there are two ways to generate an input for phylogenetic inference:
+
+!!! note ""
+
+    === "Super-tree approach"
+
+        Alignment of each gene is analysed individually to estimate a tree. Then, these distinct trees are integrated to estimate the species tree.
+
+    === "Supermatrix approach (most common)"
+
+        Aligned orthologous genes are concatenated (this is the supermatrix) and then used to estimate a tree. This is the most commonly used method.
+
+From here, we then choose one of two ways to reconstruct the phylogeny:
+
+!!! note ""
+
+    === "Distance-based"
+
+        This involves the calculation of a pairwise genetic distance matrix and then using it to iteratively construct a tree. A common implementation of this method is the Neighbour-Joining algorithm. While computationally efficient, it tends not to perform well for distantly related organisms.
+
+    === "Character-based"
+
+        Infers phylogeny based on the characters (nucleic/amino acid alphabet) in the MSA via one of three methods:
+
+        * **Maximum parsimony** This method tries to find a tree topology with the least the number of character changes required to explain the data (i.e., the most parsimonious tree). While computationally efficient, this method is the least commonly applied due to unrealistic assumptions about evolution.
+        * **Maximum likelihood (ML)** Estimates the parameters of a statistical model such that the probability of observing the data (i.e., the likelihood) is maximal. The general idea behind this method is that if the estimated parameter values of a model makes it more likely to observe the data, it is assumed to approximate the true topology of the tree. This framework is employed in software such as RAxML, IQ-TREE2, and FastTree2.
+        * **Bayesian methods** This method uses a similar approach to ML in that it relies on a model and maximises the likelihood of observing the data. The difference is in its implementation, where trees are inferred via Bayesian models. Here, parameters have uncertainties that are described using statistical distributions. Firstly, it collects/estimates and analyses data relevant to generating tree topologies and describes it via a distribution (the prior; i.e., the uncertainty of the unknown parameters). Next, it estimates the likelihood (data is observed at this stage) and then <!--needs work--> 
+
+<!--
+Given a set of aligned and trimmed orthologous genes (genes in different species that evolved from a common ancestral gene by speciation, and in general orthologs retain the same function during the course of evolution), there are different methods for obtaining a species tree:
+
+1. Each gene alignment can be individually analysed to generate a tree estimate, and subsequently, these distinct trees can be combined to form an estimation of the species tree. This method is referred to as the super-tree approach. 
+2. The aligned genes can be concatenated into a supermatrix, which is then examined to generate an overall approximation of the species tree (supermatrix method).
+3. Distance methods involve calculating a genetic distance between every pair of species (based on a comparison of their aligned sequences) and using the resulting distance matrix iteratively to construct a tree. Most commonly used: The neighbour joining (NJ) algorithm.
+
+Character-based phylogenetic inference methods:
+1 - Maximum parsimony method: Calculates the minimum number of nucleotide or amino acid changes that are required to explain the data using each possible tree topology. The tree arrangement with the fewest changes is referred to as the most parsimonious tree.
+2 - Maximum likelihood (ML): A parameter value that makes the observed data seem very likely is expected to be closer to the truth than a value that makes the data seem almost impossible. For each tree topol- ogy, the substitution parameters and branch lengths are optimized to maximize the likelihood and the tree topology that achieves the highest likelihood is the ML tree.
+3 - Bayesian method: Relies on an explicitly stated model and on the likelihood function. It differs from ML in that it uses statistical distributions to quan- tify uncertainties in the parameters.
+-->
+
+**Confidence in tree inference**
+
+How would we know if the placement of branches are correct in our inferred trees? For ML- and parsimony-based tree reconstruction, a common way to test that is via bootstrapping. This is the generation of pseudo-data of similar size to the observed data via iterative resampling of the observed data. New trees are then inferred from the resampled data, and the number of times the same tree or branch placements (clades) are observed based on the pseudo-data. This is analogous to permutations in non-parametric tests. For Bayesian method-based inference, confidence is estimated based on posterior probabilities (a.k.a. the posterior support; i.e., the product of prior probabilities and maximal likelihood).
+
+<!--
+Measure of confidence: The most commonly used method for this purpose is bootstrapping. The bootstrap is often used to attach support values for the clades. The bootstrap is applied to assess confidence in estimated trees for the distance, parsimony and ML methods. In analyses of phylogenomic datasets, a common observation is that bootstrap and posterior support values are very high (near 100%) whether the relationships are correct or not.
+
+- Tree file formats
+
+Example of phylogenetic tree file formats: Newick, Nexus, PhyloXML, NeXML and Phylip.
+
+## Inspect the MSA (optional) 
+-->
+
+## Building a phylogenetic tree
+
+The exercises for this section is performed in the `8.prokaryotic_taxonomy/` directory. Here, we will build a bootstrapped ML tree using FastTree2 with an MSA provided by GTDB-Tk of our MAGs.
+
+We begin by loading the required modules:
+
+!!! terminal "code"
+
+    ```bash
+    module purge
+    module load FastTree/2.1.11-GCCcore-9.2.0
+    ```
+
+Then, we prepare the input file based on our GTDB-Tk outputs:
+
+!!! terminal "code"
+
+    ```bash
+    # Copy file from previous exercise
+    cp gtdbtk_out/align/gtdbtk.bac120.user_msa.fasta.gz .
+
+    # Decompress file
+    gzip -d gtdbtk.bac120.user_msa.fasta.gz
+    ```
+
+Finally, we can build the tree using the following code:
+
+!!! terminal "code"
+
+    ```bash
+    FastTree gtdbtk.bac120.user_msa.fasta > bin.tree
+    ```
+
+We can inspect what the output looks like:
+
+!!! terminal "code"
+
+    ```bash
+    cat bin.tree
+    ```
+
+    ??? success "Output"
+
+        ```
+        ((bin_0.filtered:0.39270,bin_1.filtered:0.27312)1.000:0.53352,(bin_7.filtered:0.73278,(bin_2.filtered:0.51665,bin_6.filtered:0.66105)0.930:0.06144)1.000:0.07298,(bin_4.filtered:0.62639,(bin_8.filtered:0.57815,(bin_3.filtered:0.36294,(bin_5.filtered:0.24898,bin_9.filtered:0.28774)1.000:0.11846)1.000:0.19589)1.000:0.07212)0.970:0.05113);
+        ```
+
+## Visualise the tree
+
+We will use [iTOL](https://itol.embl.de/) to visualise the tree we made. iToL (Interactive Tree Of Life) is a powerful online, browser-based tool for the display, annotation and management of phylogenetic and trees. 
+
+!!! tip "iToL subscription"
+
+    For the purposes of today's workshop, the free version of iToL will suffice. However, if you think you will need to perform more phylogenetic analyses, you should consider creating an account and subscribing to the web service. This allows you to save your trees online for future use.
+
+**Navigate to iTOL**
+
+On your web browser, navigate to [iTOL](https://itol.embl.de/).
+
+![itol main page](../figures/day3_iToLMainPage.PNG)
+
+Click on the Upload button on the top left.
+
+![itol upload page](../figures/day3_iToLUpload.PNG)
+
+**Add tree information**
+
+As we are working with a small tree, we can go back to the Jupyter terminal to copy the contents of `bin.tree` into the 'Tree text' area.
+
+!!! terminal "code"
+
+    ```bash
+    cat bin.tree
+    ```
+
+    ??? success "Content of `bin.tree`"
+
+        ```
+        ((bin_0.filtered:0.39270,bin_1.filtered:0.27312)1.000:0.53352,(bin_7.filtered:0.73278,(bin_2.filtered:0.51665,bin_6.filtered:0.66105)0.930:0.06144)1.000:0.07298,(bin_4.filtered:0.62639,(bin_8.filtered:0.57815,(bin_3.filtered:0.36294,(bin_5.filtered:0.24898,bin_9.filtered:0.28774)1.000:0.11846)1.000:0.19589)1.000:0.07212)0.970:0.05113);
+        ```
+
+We can also name our tree.
+
+Once we are done, click Upload.
+
+![itol tree info](../figures/day3_iToLTreeInfo.PNG)
+
+**Annotate tree**
+
+We can also add additional information to highlight some or all of the tips. iTOL requires annotation files in a specific format. We have provided you with some example ones to use in this workshop:
+
+- [Relabel the bin IDs to their taxon](../resources/Relabelling.txt)
+- [Add colour to highlight relevant taxa](../resources/Color.txt)
+- [Add shapes to highlight risk-associated taxa](../resources/Risks.txt)
+
+After downloading the files, simply drag and drop them into the iTOL page and it will annotate the tips accordingly.
+
+![itol tree annotated](../figures/day3_iToLTreeAnnotation.PNG)
+
+**Play around with the tree**
+
+We can edit and add annotations by clicking within the iTol website environment. However, the annotation files are a great way to maintain annotations for the phylogenetic trees and automatically display information you need to highlight on the tree.
+
+!!! question "**Exercise** Alternative display and additional information"
+
+    See if you can add/change the following from your tree:
+
+    * Show bootstrap support values in blue
+    * Show branch lengths in red
+    * Change display format of your tree
+
diff --git a/docs/figures/day3_iToLMainPage.PNG b/docs/figures/day3_iToLMainPage.PNG
diff --git a/docs/figures/day3_iToLTreeAnnotation.PNG b/docs/figures/day3_iToLTreeAnnotation.PNG
diff --git a/docs/figures/day3_iToLTreeInfo.PNG b/docs/figures/day3_iToLTreeInfo.PNG
diff --git a/docs/figures/day3_iToLUpload.PNG b/docs/figures/day3_iToLUpload.PNG
diff --git a/docs/resources/Color.txt b/docs/resources/Color.txt
@@ -11,7 +11,7 @@ SEPARATOR TAB
 #SEPARATOR COMMA
 
 #label is used in the legend table (can be changed later)
-DATASET_LABEL	example style
+DATASET_LABEL	Host
 
 #dataset color (can be changed later)
 COLOR	#ffff00
@@ -27,15 +27,15 @@ COLOR	#ffff00
 #Shape should be a number between 1 and 6, or any protein domain shape definition.
 #1: square
 #2: circle
-3: star
+#3: star
 #4: right pointing triangle
 #5: left pointing triangle
 #6: checkmark
 
 LEGEND_TITLE	Host
 #LEGEND_POSITION_X,100
 #LEGEND_POSITION_Y,100
-#LEGEND_SHAPES	3	3	3	3	3	3	3	3	3
+LEGEND_SHAPES	3
 #LEGEND_COLORS	
 #LEGEND_LABELS	Co-assembled (group)
 #LEGEND_SHAPE_SCALES,1,1,0.5

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -104,7 +104,6 @@ markdown_extensions:
       custom_checkbox: true
   - pymdownx.tilde
   - pymdownx.snippets
-  - markdown_grid_tables
 
 extra_javascript:
   - javascripts/mathjax.js