Skip to content

theislab/moslin

Repository files navigation

Mapping lineage-traced cells across time points with moslin

Mapping cells across time points recovers differentiation trajectories.

moslin is an algorithm to map lineage-traced single cells across time points. Our algorithm combines gene expression with lineage information at all time points to reconstruct precise differentiation trajectories in complex biological systems. See the tutorial or read the preprint learn more.

moslin's key applications

  • Probabilistically map cells across time points in lineage-traced single-cell RNA-sequencing (scRNA-seq) studies.
  • Infer ancestors and descendants of rare or transient cell types or states.
  • Combine with CellRank to compute putative driver genes, expression trends, activation cascades, and much more.

Manuscript

Our manuscript is available as a preprint on bioRxiv.

The moslin algorithm

High-diversity lineage relationships can be recorded using evolving barcoding systems (Wagner and Klein, Nat Rev Genet 2020); when applied in-vivo, these record independent lineage relationships in each individual. To infer the molecular identity of putative ancestor states, samples need to be related from early to late time points.

Independent clonal evolution.

Mapping independent clonal evolution: evolving lineage recording systems, based on, e.g., Cas-9 induced genetic scars (Alemany et al., Nature 2018, Raj et al., Nature Biotech 2018, Spanjaard et al., Nature Biotech 2018), record independent clonal evolution in each individual.

In our setting, each individual corresponds to a different time point, and we wish to relate cells across time to infer precise differentiation trajectories ( Forrow and Schiebinger, Nature Comms 2021). While gene expression is directly comparable across time points, lineage information is not: individual lineage trees may be reconstructed at each time point (Alemany et al., Nature 2018, Raj et al., Nature Biotech 2018, Spanjaard et al., Nature Biotech 2018, Jones et al., Genome Biology 2020), but these do not uncover the molecular identity of putative ancestors or descendants.

moslin combines gene expression with lineage information in a Fused Gromov-Wasserstein objective function.

The moslin algorithm: the grey outline represents a simplified state manifold, dots and triangles illustrate early and late cells, respectively, and colors indicate cell states.

Critically, moslin uses two sources of information to map cells across time in an optimal transport (OT) formulation (Peyré and Cuturi, arXiv 2019):

  • gene expression: directly comparable across time points, included in a Wasserstein (W)-term (Schiebinger et al., Cell 2019). The W-term compares individual early and late cells and seeks to minimize the distance cells travel in phenotypic space.
  • lineage information: not directly comparable across time points, included in a Gromov-Wasserstein (GW)-term (Nitzan et al., Nature 2019, Peyré et al., PMLR 2016). The GW-term compares pairwise early and late cells and seeks to maximize lineage concordance.

We combine both sources of information in a Fused Gromov-Wasserstein (FGW) problem (Vayer et al., Algorithms 2020), a type of OT-problem. Additionally, we use entropic regularization (Cuturi 2013) to speed up computations and to improve the statistical properties of the solution (Peyré and Cuturi, arXiv 2019).

Code, tutorials and data

Under the hood, moslin is based on moscot to solve the optimal transport problem of mapping lineage-traced cells across time points. Specifically, we implement moslin via the LineageClass , we demonstrate a use case in our tutorial and we showcase how to work with tree distances in an example. Downstream analysis, like visualizing the inferred cell-cell transitions, is available via moscot's API.

Raw published data is available from the Gene Expression Omnibus (GEO) under accession codes:

Additionally, we simulated data using LineageOT and TedSim. Processed data is available on figshare. To ease reproducibility, our data examples can also be accessed through moscot's dataset interface.

Utility functions required for the C. elegans analysis are available in the moslin_utils mini-package in this repo.

Reproducibility

To ease reproducibility of our results, we've organized this repository along the categories below. Each folder contains notebooks and scripts necessary to reproduce the corresponding analysis. We read data from data and write figures to figures. Please open an issue should you experience difficulties reproducing any result.

Results

Application Folder path
Simulated data (Fig. 2) analysis/simulations/
C elegans embryogenesis (Fig. 3) analysis/packer_c_elegans/
Zebrafish heart regeneration (Fig. 4) analysis/hu_zebrafish_linnaeus/

The concept figures in this README have been created with BioRender.