Skip to content

Linking fossils and bioinformatics

tkwsm edited this page Sep 16, 2015 · 19 revisions

Takeshi Kawashima and Rutger Vos

(Part of the genomics track - repo)

The Problem

Fossils are commonly used to calibrate nodes on phylogenetic trees so that molecular evolutionary events and processes can be projected on a timescale. However, placing fossils on trees frequently yields strange results. A trivially simple case is for the Newick tree ((A,B)fossil1),C)fossil2;, if fossil1 is estimated to be older than fossil2: in that case the tree topology and the age estimates contradict each other and can only be reconciled by having a negative length for the branch that connects clade A,B with the root: time has to go "in reverse" for a bit, which is of course impossible. However, in reality the symptoms are usually not as trivial.

In reality, in most approaches for tree calibration, fossils are not assigned to strongly fix the ages of nodes but rather to define limits on age ranges for groups of species. For example, "crown" fossils impose a minimum age constraint on the most recent common ancestor of a taxonomic group. A crown fossil is a fossil that is diagnosed as having been deposited after the extant members of a taxonomic group started radiating. As such, the age of the crown fossil suggests that the most recent common ancestor is at least as old as the fossil. Conversely, "stem" fossils impose a maximum age on the most recent common ancestor, because stem fossils are diagnosed as having been deposited before the extant taxonomic group started radiating. Outside of isotope dating methods, fossils are often dated in relation to the presence of "index" fossils: well-characterized, easy to identify, broadly distributed fossils that are diagnostic for a particular geologic era, thereby allowing the sediment layer (and any other fossils in it) to be dated.

Given a set of age limits, tree calibration is either done as part of tree inference - for example in a Bayesian framework, such as in BEAST - or in post tree analysis, where popular approaches are based on non-parametric rate smoothing or penalized likelihood. When fossils and trees are incompatible, the problem usually manifests during these analyses, for example as poor mixing of chains during Bayesian analyses, or as problems with numerical optimization in penalized likelihood (e.g. pathological likelihood surfaces). Sometimes negative branch lengths result, but more often the result can be wildly improbable implied shifts in rates of molecular evolution (compressed paths along part of a tree). Any of these issues can of course only manifest when more than one fossil is used. With a single fossil, the entire tree is simply scaled to a depth proportional to the age of the fossil.

How could this be?

As always, a big cause for these problems is bad data. In this particular case, the bad data can originate in either one of two disciplines: palaeontology or bioinformatics. Bad data in palaeontology can result when fossils are misidentified - for example, when a crown fossil is diagnosed as a stem fossil, or vice versa, or even belongs to a completely different taxonomic group - or when the fossils are dated incorrectly because of ambiguous stratigraphy or problems with isotope dating methods. Bad data in bioinformatics can result from a variety of methodological issues, such as incorrect orthology prediction, misalignment, unfit substitution model selection, suboptimal tree topologies.

Biologically interesting phenomena can also cause apparent incompatibilities between fossils and molecules: our assumptions about physical processes usually somehow trace back to Ockham's Razor, we prefer more constant rates over more variable ones and less sudden morphological change over hopeful monsters. These phenomena aren't problems, they are scholarly works in waiting and they don't require technological "solutions".

What is to be done?

Since the recent launch of fossilcalibrations.org (and its web service API: docs.fcdb.apiary.io), carefully vetted fossil data have become available and machine readable for the express purpose of calibrating trees. Likewise, automated approaches for orthology prediction, multiple sequence alignment and tree inference have resulted in a number of databases (e.g. see questfororthologs.org/orthology_databases) that provide large amounts of re-usable, machine readable data intended to address some of the bad data problems in bioinformatics described above. This has created the opportunity to survey the extent of incompatibilities between fossils and molecules, but also the need to consider how to formally express such incompatibility: which do we believe, the fossil or the molecule? Why? Can we quantify this in a comparable way?

Proposal:

  1. Select an orthology database (e.g. TreeFam, OMA, EnsEMBL COMPARA). Ideally this is a database with a web service that hosts either an inferred (species) tree for each set of orthologous sequences covering enough different taxa so that at least some fossils fit on each tree, or a gene tree with nodes that clearly distinguish by their annotation whether they result from a speciation or a gene duplication. As a proof of principle, we use a dump from TreeFam, which provides annotated gene trees as "New Hampshire Extended" (nhx), which is a syntax for parenthetical tree description with semantic comments. An alternative solution would be PhyloXML (also developed by Christian Zmasek), which is easier to parse but makes the data dump much larger. Ideal would be if SPARQL queries on ortholog databases could return trees if the underlying database used a tree-based method for orthology prediction. The orthology ontology includes terms from the CDAO. The latter was originally developed to represent phylogenetic data, so trees could be expressed in it.

  2. Fetch or iterate over all sets of orthologous sequences. A genome-wide scan would be nice. For now we iterate over a subsample from TreeFAM.

  3. For each set of orthologous sequences, fetch or infer a tree with molecular branch lengths. For now we use the nhx trees provided by TreeFAM.

  4. Fetch fossils that fit on the tree. For now we do this on the basis of internal node labels (names of higher taxa) in the nhx trees. Better would be to discover fossils by ingroup sets, but this requires a mapping between sequence identifiers in the gene trees and taxon names. TreeFAM sequence identifiers are a messy mix of EnsEMBL identifiers (easy) and things like genome assemblies (hard).

  5. Calibrate the tree, e.g. with treePL (penalized likelihood). We use r8s, which emits ratograms directly. It is less efficient than treePL, however, so might be a problem for large clusters of orthologs.

  6. Invent a way to detect incompatibility. For example, given the "ratogram", we might say that there is a problem if the highest and the lowest rate differ by more than two standard deviations. Negative branch lengths is another obvious symptom.

  7. Invent a way to assign "blame" for the incompatibility. For example, if the same fossil participates in a lot of problematic ratograms, it's probably that fossil's fault. If some particular gene (or category of genes) it's probably the fault of the gene. Among fossils, we should probably trust index fossils more than others.

  8. Deposit the results of the survey somewhere publicly, e.g. for the benefit of fossilcalibrations.org, to refine their quality score for fossils.

As a scientifically interesting application of the pipeline that is described above (and which is under development here) we could do the following. TreeFAM gives us gene family trees. By calibrating them and extracting "ratograms" we can look for rate shifts following gene duplications (for example as a result of selective pressure imposed during neo- or subfunctionalization). We might be able to detect higher rates immediately following duplications, which might subsequently taper off, such that "the figure" would show substitution rate as a function of distance (in time) from duplication event.

Usecase:

Sea Urchin is a possible good example for testing the above case. All of Modern Urchins are descendants of the ancestor which was remaining survivor after the great extinction at P-T boundary. There are a series of fossils of urchins after the Mesozoic. They change the basic shape several times and the offsprings still keep the shape of ancestors. It is possible to get genomic data from modern urchins, then such data allow us to compare the phylogenetic-data with fossils.

Example-phylogeny_of_urchins_and_fossils