Dimension reduction is a common and critical first step in analysis of high throughput singe cell RNA sequencing. tGPLVM is a nonparametric, generative model for nonlinear manifold learning; that is a flexible, nearly assumption-free model that doesn't require setting parameters a priori (e.g. number of dimensions, perplexity, etc.) and provides uncertainty estimates for sample mappings. tGPLVM can be used for visualization of high-dimensional data or as part of a pipeline for cell type identification or pseudotime reconstruction.
We provide a script for fitting the model with Black Box Variational Inference for speed and scabality. A batch learning implementation is also provided for larger datasets that need to be fit under memory restriction.
tGPLVM is implemented in python 2.7 with the following packages:
- numpy 1.14.5
- pandas 0.23.3
- h5py 2.8.0
- tensorflow 1.6.0
- edwards 1.3.5
- sklearn 0.19.2
Input: A numpy array or sparse csr/csc matrix of scRNA counts (or other types data) with format N cells (samples) as rows by p genes (features) as columns (loaded to y_train
). Input this directly into the code.
Options: The following parameters can be adjusted in the script to adjust inference:
- Degrees of freedom (
--df
) - default: 4 - Use t-Distribution error model (otherwise normal error) (
--T
) - default: True - Initial Number of Dimensions (
--Q
) - default: 3 - Kernel Function
- Matern 1/2, 3/2, 5/2 (
--m12, --m32, --m52
) - default: True - Periodic (
--per_bool
) - default: False
- Matern 1/2, 3/2, 5/2 (
- Number of Inducing Points (
--m
) - default: 30 - Batch size (
--M
) - default: 250 - Max iterations (
--iterations
) - default: 5000 - Save frequency (
--save_freq
): - default: 250 - Sparse data type (is CSC or CSR) (
--sparse
): - default: False - PCA Initialization (otherwise random initialization) (
--pca_init
): - default: True - Output directory (
--out
): - default: ./test
Output: hdf5 file with
- Latent mapping posterior (mean and variance)
- Gene-specific noise
- Kernel hyperparameters (variance, lengthscale)
- Inducing points in latent and high-dimensional space
Example:
When the input is Test_3_Pollen.h5, the following code runs 250 iterations with the full dataset
python tGPLVM-batch.py --Q 2 --M 249 --p 6982 --m12 True --m32 True --m52 True --iterations 250 --out ./test
We provide the input code for two other files:
- tapio_tcell_tpm.txt - Data from Lonnberg gpfates. Data is available at https://github.com/Teichlab/GPfates
- 1M_neurons_filtered_gene_bc_matrices_h5.h5 - 1 million 10x mice brains cell. Data is available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons. Make sure to set
--sparse True
for this data.
The final data from the paper is available here: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/cd34