-
Notifications
You must be signed in to change notification settings - Fork 3
/
phemd.Rmd
246 lines (191 loc) · 9.98 KB
/
phemd.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
title: "Tutorial on using PhEMD to analyze multi-sample single-cell experiments"
author: "William Chen"
date: "October 20, 2018"
output: BiocStyle::html_document
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{PhEMD vignette}
\usepackage[UTF-8]{inputenc}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
PhEMD is a package for comparing multiple heterogeneous single-cell samples to
one another. It currently does so by first defining cell subtypes and relating
them to one another using Monocle 2. It then computes Earth Mover's Distance
(EMD) between each pair of samples, incorporating information about intrinsic
cell subtype dissimilarity (i.e. manifold distance between cell subtypes) and
differences between samples with respect to relative abundance of each cell
subtype. PhEMD uses these pairwise distances to construct a network of
single-cell samples that may be visually related in 2D or 3D using a diffusion
map and partitioned to identify groups of similar samples.
## 1. Installation
PhEMD requires R version >= 3.4.0 (recommended 3.5.0+), Bioconductor
version >= 3.5 (recommended 3.7+), and Monocle 2 version >= 2.4.0
(recommended 2.8.0)
###Install from Bioconductor
```{r echo=T, results = 'hide',message=F, warning=F, eval=F}
BiocManager::install("phemd")
```
###Install from Github (though direct installation from Bioconductor is preferred)
```{r echo=T, results = 'hide',message=F, warning=F, eval=F}
library(devtools)
install_github("wschen/phemd")
```
###Load library after installation
```{r echo=T, results = 'hide', message=F, warning=F}
library('phemd')
library('monocle')
```
## 2. Preparing data for cell state definition and embedding
PhEMD expects single-cell data to be represented as an R list of samples. Each
sample (i.e. list element) is expected to be a matrix of dimension *num_cells*
x *num_markers*, where markers may represent genes or cytometry protein markers.
For this vignette, we will be demonstrating our analysis pipeline on a melanoma
dataset consisting of tumor-infiltrating immune cell scRNA-seq data (selected
genes) that were log-transformed following TPM-normalization (first published
by Tirosh et al., 2016).
We first start by creating a PhEMD data object, specifying the multi-sample
expression data (R list), marker names (i.e. column names of the data matrices
in the list of expression data), and sample names (in the order they appear in
the list of expression data).
```{r echo=T, results = 'hide', message=F, warning=F}
load('melanomaData.RData')
myobj <- createDataObj(all_expn_data, all_genes, as.character(snames))
```
We can optionally remove samples in the PhEMD data object that have fewer than
min_sz number of cells as follows:
```{r echo=T}
myobj <- removeTinySamples(myobj, min_sz = 20)
```
Note that samples that don't meet the meet the minimum cell yield criteria are
removed from rawExpn(myobj) and from the list of sample names in
sampleNames(myobj).
Next, aggregate data from all samples into a matrix that is stored in the PhEMD
data object (in slot 'data_aggregate'). This aggregated data will then be used
for initial cell subtype definition and embedding. If there are more cells
collectively in all samples than max_cells, an equal number of cells from each
sample will be subsampled and stored in pooledCells(myobj).
```{r echo=T, results = 'hide'}
myobj <- aggregateSamples(myobj, max_cells=12000)
```
## 3. Generate Monocle 2 cell embedding with cell state definitions
Now that we have aggregated single-cell data from all samples, we are ready to
perform cell subtype definition and dimensionality reduction to visually and
spatially relate cells and cell subtypes. For this, we use Monocle 2. Before we
begin, we first perform feature selection by selecting 44 important genes.
Suggestions on how to choose important genes can be found here: http://cole-trapnell-lab.github.io/monocle-release/docs/#trajectory-step-1-choose-genes-that-define-a-cell-s-progress
```{r echo=T, results = 'hide'}
myobj <- selectFeatures(myobj, selected_genes)
```
We are now ready to generate a Monocle 2 embedding. Our *embedCells()* function
is a wrapper function for the *reduceDimension()* function in Monocle 2. For our
example dataset, we specify the expression distribution model as *'gaussianff'*
as is recommended in the Monocle 2 tutorial for log-transformed scRNA-seq TPM
values (http://cole-trapnell-lab.github.io/monocle-release/docs/#choosing-a-distribution-for-your-data-required).
*'negbinomial_sz'* is the recommended data type for most unnormalized scRNA-seq
data (raw read counts) and *'gaussianff'* is recommended for log-transformed
data or arcsin-transformed mass cytometry data. See above link for more details.
Additional parameters may be passed to Monocle 2 *reduceDimension()* as optional
named parameters in *embed_cells()*. We found that Monocle 2 is robust to a
range of parameters. Sigma can be thought of as a "noise" parameter and we
empirically found that sigma in the range of [0.01, 0.1] often works well for
log-transformed scRNA-seq data or normalized CyTOF data. Greater values of sigma
generally result in fewer total number of clusters. See Monocle 2 publication
(Qiu et al., 2017) for additional details on parameter selection.
```{r echo=T, results = 'hide', message=F, warning=F}
# generate 2D cell embedding and cell subtype assignments
myobj <- embedCells(myobj, data_model = 'gaussianff', pseudo_expr=0, sigma=0.02)
# generate pseudotime ordering of cells
myobj <- orderCellsMonocle(myobj)
```
The result of the code above is a Monocle 2 object stored in
*monocleInfo(myobj)*. This object contains cell subtype and pseudotime
assignments for each cell in the aggregated data matrix (stored in
*pooledCells(myobj)*). A 2D embedding of these cells has also been generated.
We can visualize the embedding by writing them to file in this way:
```{r echo=T, results = 'hide', fig.width=8, fig.height=4}
cmap <- plotEmbeddings(myobj, cell_model='monocle2')
```
To visualize the expression profiles of the cell subtypes, we can plot a heatmap
and save to file as such:
```{r echo=T, results = 'hide', fig.width=8, fig.height=6}
plotHeatmaps(myobj, cell_model='monocle2', selected_genes=heatmap_genes)
```
## 4. Deconvolute single-cell samples and compare using Earth Mover's Distance
Now that we have identified a comprehensive set of cell subtypes across all
single-cell samples and related them in a low-dimensional embedding by
aggregating cells from all samples, we want to perform deconvolution to
determine the abundance of each cell subtype on a per sample basis. To do so,
we call this function:
```{r echo=T, results = 'hide'}
# Determine cell subtype breakdown of each sample
myobj <- clusterIndividualSamples(myobj)
```
The results of this process are stored in *celltypeFreqs(myobj)*. Row
*i* column *j* represents the fraction of all cells in sample *i* assigned to
cell subtype *j*.
To compare single-cell samples, we use Earth Mover's Distance, which is a metric
that takes into account both the difference in relative frequencies of matching
cell subtypes (e.g. % of all cells in each sample that are CD8+ T-cells) and the
dissimilarity of the cell subtypes themselves (e.g. intrinsic dissimilarity
between CD8+ and CD4+ T-cells). To compute the intrinsic dissimilarity between
cell subtypes, we call the following function:
```{r echo=T, results = 'hide'}
# Determine (dis)similarity of different cell subtypes
myobj <- generateGDM(myobj)
```
*generateGDM()* stores the pairwise dissimilarity (i.e. "ground-distance" or
"tree-distance") between cell subtypes in *GDM(myobj)*.
We are now ready to compare single-cell samples using EMD. To do so, we simply
call the function *compareSamples()*:
```{r echo=T, results = 'hide'}
# Perform inter-sample comparisons using EMD
my_distmat <- compareSamples(myobj)
```
*compareSamples()* returns a distance matrix representing the pairwise EMD
between single-cell samples; *my_distmat[i,j]* represents the dissimilarity
between samples *i* and *j* (i.e. samples represented by rows *i* and *j* in
celltypeFreqs(myobj)). We can use this distance matrix to identify similar
groups of samples as such:
```{r echo=T, results = 'hide'}
## Identify similar groups of inhibitors
group_assignments <- groupSamples(my_distmat, distfun = 'hclust', ncluster=5)
```
## 5. Visualize single-cell samples based on PhEMD-based similarity
We can also use the PhEMD-based distance matrix to generate an embedding of
single-cell samples, colored by group assignments.
```{r echo=T, results = 'hide', fig.width=8, fig.height=7, fig.keep='none'}
dmap_obj <- plotGroupedSamplesDmap(my_distmat, group_assignments, pt_sz = 1.5)
```
```{r echo=F, results = 'hide', fig.keep='none'}
plotGroupedSamplesDmap(my_distmat, group_assignments, pt_sz = 1.5)
```
```{r echo=F, results = 'hide', fig.width=8, fig.height=7}
plotGroupedSamplesDmap(my_distmat, group_assignments, pt_sz = 1.5)
```
To retrieve the cell subtype distribution on a per-sample basis, use the
following function. Histograms can be subsequently plotted for a given sample
as desired.
```{r echo=T, results = 'hide'}
# Plot cell subtype distribution (histogram) for each sample
sample.cellfreqs <- getSampleHistsByCluster(myobj, group_assignments, cell_model='monocle2')
```
To plot cell subtype histograms summarizing groups of similar samples (bin-wise
mean of each cell subtype across all samples assigned to a particular group),
use the following plotting function:
```{r echo=T, results = 'hide', fig.width=10, fig.height=2.5}
# Plot representative cell subtype distribution for each group of samples
plotSummaryHistograms(myobj, group_assignments, cell_model='monocle2', cmap,
ncol.plot = 5, ax.lab.sz=1.3, title.sz=2)
```
To plot cell yield of each samples as a barplot, use the following function:
```{r echo=T, results='hide', fig.width=8, fig.height=5.5}
# Plot cell yield of each experimental condition
plotCellYield(myobj, group_assignments, font_sz = 0.7, w=8, h=5)
```
```{r echo=T}
sessionInfo()
```