RNA-seq is a way to measure our body's gene expression and has driven medical innovation and discovery in recent years. The approach is often applied to tissues containing tens of thousands to millions of cells. However, this has prevented direct assessment of the cells within the tissue. As a result, the sequencing process was applied to individual cells, called single-cell RNA-eq, instead of tissues. The opportunities arising from single-cell RNA-eq are enormous, but its data pose unique and new challenges.
- Sparsity and dropout: The scRNA-seq data are sparse and typically have a high percentage of observed zeros. The zeros in the scRNA-seq are typically from two sources. One is biological, where a gene is not expressed (the count is zero), and the other is technical, where a gene is expressed but not detected in the experiment, called "dropout."
- Batch effect: Operational limitations cause the data to be generated separately (at different times or in various laboratories) for large-scale studies. Consequently, there will be differences between distinctive groups. These differences are not due to biological reasons and are technical errors that happen when cells from one group are processed separately from cells in a second group.
- Unlabeled samples: Cell types are undetermined and have to be labeled during the analyses.
- Curse of dimensionality: In a typical single-cell experiment, each cell (sample) could have more than 20,000 genes (features).
- Count data: The count nature of the data has to be considered in the analyses; i.e., the measurements are integers and not floating point numbers.
- Scalability: Analyzing the ever-growing number of cells (samples), ranging from thousands to millions in large projects like the Human Cell Atlas, is a major challenge in scRNA-seq data analysis. Many statistical and machine learning methods developed for scRNA-seq fail to scale to more than 30,000 samples.
Applying the appropriate computational and statistical techniques is essential for efficient use and correct interpretations of the data. Consequently, many statistical and machine learning models have been developed to analyze scRNA-seq data.
Here, I have re-engineered the implementation of the scVI tool, developed at the University of Berkeley, using Pyro, PyTorch Lightning, and Pytorch resulted in fewer lines of code, more efficiency, and reliability. scVI is a general dimension reduction and data imputation tool which can be trained efficiently for large datasets.
scVI is a variational autoencoder with latent variables specified by Normal distribution and generates estimations of the Zero-Inflated Negative Binomial (ZINB) distribution parameters through a nonlinear transformation of data. Lopez et al. showed that scVI captures meaningful biological information through nonlinear transformation and is scalable from tens of thousands of cells to a million.
In the scVI,
$$f_{\mathrm{ZINB}}(y;\mu ,\theta ,\pi ) = \pi \delta 0(y) + (1 - \pi ) f{\mathrm{NB}}(y;\mu ,\theta ),\quad \forall y \in {\mathbb N}$$
where
The
scVI uses a Bayesian approach in which the probability of observing
Lopez et al. used the stochastic gradient-descent to optimize the objective function and find the parameters of their model. Therefore, they do not need to go over the entire dataset, and it enabled them to use GPU acceleration for parameter estimation.
- Dropout and count nature of the data: Using the ZINB distribution as described above.
-
Batch effect: Using
$s_i$ variable. -
Unlabeled samples: Using K-means on latent variable
$z_i$ for clustering. -
Curse of dimensionality: By learning representations of the data through
$z_i$ , a multi-dimensional latent variable following a Gaussian distribution. - Scalability: By stochastic gradient-descent optimization procedure.
The Colab notebook contains a short report of my implementation, interpretations, and results.