Robin Browaeys 2020-10-08
In this vignette, you can learn how to use synthvisium to generate synthetic Visium-like spot data. First we will demonstrate it by using a toy scRNAseq dataset as input (internal Seurat object of downsampled brain cortex data). Later on, we will demonstrate it on a real scRNAseq dataset (brain cortex) that can be downloaded from Zenodo. Using this dataset as input for your own benchmark is recommended.
library(Seurat)
library(synthvisium)
library(dplyr)
First we will explore the toy scRNAseq dataset (seurat_obj
) and show
where the cell type information is stored in this object (subclass
column of the meta data)
# scrnaseq data
DimPlot(seurat_obj, group.by = "subclass", label = T)
We provide one function to generate a dataset and its meta data. As
example here, we will generate a dataset consisting of 5 different
regions with each between 5 or 20 spots (and on average 30000 counts in
total per spot). The dataset type that will be created here is called
artificial_diverse_distinct
, and is one of the 13 different
implemented options. As can be read in the documentation of the function
(?generate_synthetic_visium
), here each artifical prior region has a
distinct set of cell types, and cell type frequencies are randomly
different for each cell type in each region.
synthetic_visium_data = generate_synthetic_visium(seurat_obj = seurat_obj, dataset_type = "artificial_diverse_distinct", clust_var = "subclass", n_regions = 5, n_spots_min = 5, n_spots_max = 20, visium_mean = 30000, visium_sd = 8000)
Now we will go over the different components of this object:
First, we have the gene expression counts for each spot in the counts
sublist (genes in rows, spots in columns):
synthetic_visium_data$counts %>% as.matrix() %>% .[1:5,1:5]
## priorregion1_spot_1 priorregion1_spot_2 priorregion1_spot_3 priorregion1_spot_4 priorregion1_spot_5
## Vip 5287 11155 6689 4996 4294
## Sst 0 0 0 6 0
## Npy 2 0 0 0 1
## Tac2 0 1822 815 0 154
## Crh 11 920 1300 83 163
Then, you can also see the cell type composition (absolute numbers of
cells and relative proportions) of each generated spot in the
spot_composition
relative_spot_composition
sublists (names and prior
regions of the spot in the name
and region
columns):
synthetic_visium_data$spot_composition %>% .[1:10,]
## L2.3.IT L4 L5.IT L5.PT L6.CT L6.IT L6b Lamp5 NP Pvalb Sst Vip name region
## 1 0 0 0 0 0 0 0 0 7 0 0 1 priorregion1_spot_1 priorregion1
## 2 0 0 0 0 0 0 0 0 1 0 0 1 priorregion1_spot_2 priorregion1
## 3 0 0 0 0 0 0 0 0 7 0 0 1 priorregion1_spot_3 priorregion1
## 4 0 0 0 0 0 0 0 0 3 0 0 1 priorregion1_spot_4 priorregion1
## 5 0 0 0 0 0 0 0 0 6 0 0 3 priorregion1_spot_5 priorregion1
## 6 0 0 0 0 0 0 0 0 2 0 0 0 priorregion1_spot_6 priorregion1
## 7 0 0 0 0 0 9 0 0 0 0 0 0 priorregion2_spot_1 priorregion2
## 8 0 0 0 0 0 8 0 0 0 0 0 0 priorregion2_spot_2 priorregion2
## 9 0 0 0 0 0 8 0 0 0 0 0 0 priorregion2_spot_3 priorregion2
## 10 0 0 0 0 0 8 0 0 0 0 0 0 priorregion2_spot_4 priorregion2
synthetic_visium_data$relative_spot_composition %>% .[1:10,]
## L2.3.IT L4 L5.IT L5.PT L6.CT L6.IT L6b Lamp5 NP Pvalb Sst Vip name region
## 1 0 0 0 0 0 0 0 0 0.8750000 0 0 0.1250000 priorregion1_spot_1 priorregion1
## 2 0 0 0 0 0 0 0 0 0.5000000 0 0 0.5000000 priorregion1_spot_2 priorregion1
## 3 0 0 0 0 0 0 0 0 0.8750000 0 0 0.1250000 priorregion1_spot_3 priorregion1
## 4 0 0 0 0 0 0 0 0 0.7500000 0 0 0.2500000 priorregion1_spot_4 priorregion1
## 5 0 0 0 0 0 0 0 0 0.6666667 0 0 0.3333333 priorregion1_spot_5 priorregion1
## 6 0 0 0 0 0 0 0 0 1.0000000 0 0 0.0000000 priorregion1_spot_6 priorregion1
## 7 0 0 0 0 0 1 0 0 0.0000000 0 0 0.0000000 priorregion2_spot_1 priorregion2
## 8 0 0 0 0 0 1 0 0 0.0000000 0 0 0.0000000 priorregion2_spot_2 priorregion2
## 9 0 0 0 0 0 1 0 0 0.0000000 0 0 0.0000000 priorregion2_spot_3 priorregion2
## 10 0 0 0 0 0 1 0 0 0.0000000 0 0 0.0000000 priorregion2_spot_4 priorregion2
If you want to see which cell type is present in which region, and with
which prior frequency, you can look at the gold_standard_priorregion
sublist.
synthetic_visium_data$gold_standard_priorregion %>% head()
## # A tibble: 6 x 4
## prior_region celltype freq present
## <chr> <chr> <dbl> <lgl>
## 1 priorregion1 Vip 0.143 TRUE
## 2 priorregion1 NP 0.857 TRUE
## 3 priorregion1 Sst 0 FALSE
## 4 priorregion1 L4 0 FALSE
## 5 priorregion1 L5 IT 0 FALSE
## 6 priorregion1 L6 IT 0 FALSE
We also provide a data frame with properties of this specific dataset. This might be useful in a benchmark to check the influence of these properties on performance.
synthetic_visium_data$dataset_properties
## # A tibble: 1 x 10
## dataset_id dataset_type real_artificial uniform_diverse distinct_overlap dominant_celltype missing_celltype rare_celltype missing_celltype_sc real_region_var
## <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 artificial_diverse_distinct1 artificial_diverse_distinct artificial diverse distinct FALSE FALSE FALSE NA NA
Because it could be possible that a cell type will belong to all regions, we will make the synthetic data in such a way that we have a ‘mock region’ as well - no cell type should be predicted to belong to this region. Adding a mock region is thus recommended when the synthetic data will be used to evaluate spatial/region annotation of cells, not when evaluating deconvolution tools. The mock region is generated similar as the real regions, but the input cell type frequencies are the same for each cell type, and after generation of the counts, the gene names are shuffled such that cellular identities in this mock region are lost.
synthetic_visium_data = generate_synthetic_visium(seurat_obj = seurat_obj, dataset_type = "artificial_diverse_distinct", clust_var = "subclass", n_regions = 5, n_spots_min = 5, n_spots_max = 20, visium_mean = 30000, visium_sd = 8000, add_mock_region = TRUE)
Generating Visium-like spot data from toy input scRNAseq data that contains information of the anatomical regions of cells.
Just pooling cells from different cell types as done in the previous example could have the (theoretical) disadvantage that spatial information inherently encoded in the scRNAseq data is lost. This is because cells extracted from a similar anatomical region could be influenced by the same microenvironmental signal, resulting in a shared spatial gene expression pattern for cells in the same region, which could be reflected in the input scRNAseq data. In case you know the anatomical origin from cells in the input scRNAseq data, it could be useful to generate synthetic Visium-spot data by only sampling cells in one spot if those cells belong to the same anatomical region.
In the brain cortex dataset, we have this information stored in the meta
data (brain_subregion
column)
# scrnaseq data
DimPlot(seurat_obj, group.by = "brain_subregion")
As the second example, we will now generate a dataset consisting of
artificial “L1 - L2/3 - L4 - L5 - L6” regions with each between 5 or 20
spots (and on average 30000 counts in total per spot). The dataset type
that will be created here is called real
, and is one of the 13
different implemented options. As can be read in the documentation of
the function (?generate_synthetic_visium
), this dataset type will
exploit the anatomical information and cell type frequencies per region
will be based on the number of cells in the corresponding region in the
input scRNAseq data.
synthetic_visium_data = generate_synthetic_visium(seurat_obj = seurat_obj, dataset_type = "real", clust_var = "subclass", region_var = "brain_subregion" , n_regions = NULL,
n_spots_min = 5, n_spots_max = 20, visium_mean = 20000, visium_sd = 5000)
The real input scRNAseq (Allen Brain Atlas brain cortex) can be downloaded from Zenodo (https://zenodo.org/record/4072975).
# seurat_obj = readRDS(url("https://zenodo.org/record/4072975/files/seurat_obj_scrnaseq_cortex_filtered.rds")) # direct read from URL - will take a long time
# seurat_obj = readRDS(path/seurat_obj_scrnaseq_cortex_filtered.rds")) # after downloading and saving locally
Now, we will create a synthetic dataset based on this real dataset and
with a more realistic number of spots. This will take some time to run.
As another example dataset type, we pick here real_top2_overlap
.
synthetic_visium_data = generate_synthetic_visium(seurat_obj = seurat_obj, dataset_type = "real_top2_overlap", clust_var = "subclass", region_var = "brain_subregion" , n_regions = NULL,
n_spots_min = 50, n_spots_max = 200, visium_mean = 20000, visium_sd = 5000)
Using synthetic Visium-like spot data to evaluate performance of Visium-scRNAseq integration workflows
Check the following vignette to find a short demonstration of a basic
analysis you can do on this synthetic Visium data. We also show how to
do spatial annotation of cells through Seurat label transfer and compare
gold standard labels with predicted labels. Analyzing synthetic
Visium-like spot data for evaluating label transfer
methods:vignette("analyze_syn_data", package="synthvisium")
.
Currently, there are 17 possible dataset types implemented:
possible_dataset_types = c("real", "real_top1","real_top1_uniform","real_top2_overlap","real_top2_overlap_uniform", "real_missing_celltypes_visium", "artificial_uniform_distinct", "artificial_diverse_distinct", "artificial_uniform_overlap", "artificial_diverse_overlap", "artificial_dominant_celltype_diverse", "artificial_partially_dominant_celltype_diverse", "artificial_missing_celltypes_visium", "artificial_dominant_rare_celltype_diverse","artificial_regional_rare_celltype_diverse", "artificial_diverse_distinct_missing_celltype_sc","artificial_diverse_overlap_missing_celltype_sc")
The dataset types that are possible for every input scRNAseq dataset and that we recommend to use are:
recommended_dataset_types = c("artificial_uniform_distinct", "artificial_diverse_distinct", "artificial_uniform_overlap", "artificial_diverse_overlap", "artificial_dominant_celltype_diverse", "artificial_partially_dominant_celltype_diverse", "artificial_dominant_rare_celltype_diverse","artificial_regional_rare_celltype_diverse")
#recommended_dataset_types = c("artificial_uniform_distinct", "artificial_diverse_distinct", "artificial_uniform_overlap", "artificial_diverse_overlap", "artificial_dominant_celltype_diverse", "artificial_partially_dominant_celltype_diverse", "artificial_dominant_rare_celltype_diverse","artificial_regional_rare_celltype_diverse", "artificial_diverse_distinct_missing_celltype_sc","artificial_diverse_overlap_missing_celltype_sc")
Explanation of each of the types:
- real: for each region (defined in scRNAseq data - region_var - and thus real ), cell type frequencies are based on the number of cells in the corresponding region in the input scRNAseq data
- real_top1: for each region (defined in scRNAseq data and thus real ), we will only consider cell types that are most frequently present in that 1 region ( top 1 ). Cell type frequencies are based on the number of cells in the corresponding region in the input scRNAseq data
- real_top1_uniform: for each region (defined in scRNAseq data - region_var - and thus real ), we will only consider cell types that are most frequently present in that 1 region ( top 1 ). Cell type frequencies are equal for each cell type assigned to a region ( uniform )
- real_top2_overlap: for each region (defined in scRNAseq data - region_var - and thus real ), we will only consider cell types that if this particular region is among the 2 regions in which they are most abundant ( top 2 ). As a consequence, there will be some overlap in cell type presence between two regions. Cell type frequencies are based on the number of cells in the corresponding region in the input scRNAseq data
- real_top2_overlap_uniform: for each region (defined in scRNAseq data - region_var - and thus real ), we will only consider cell types that if this particular region is among the 2 regions in which they are most abundant ( top 2 ). As a consequence, there will be some overlap in cell type presence between two regions. Cell type frequencies are equal for each cell type assigned to a region ( uniform )
- real_missing_celltypes_visium: the same as the real dataset type, but now some random input scRNAseq cell types will be left out, and thus not used to generate synthetic data.
- artificial_uniform_distinct: celltype-prior region assignments are entirely artificial and thus not based on possibly present regional information in the input scRNAseq data. Each artifical prior region has a distinct set of cell types here, and cell type frequencies are the same for each cell type in each region.
- artificial_diverse_distinct: celltype-prior region assignments are entirely artificial and thus not based on possibly present regional information in the input scRNAseq data. Each artifical prior region has a distinct set of cell types here, and cell type frequencies are different for each cell type in each region.
- artificial_uniform_overlap: celltype-prior region assignments are entirely artificial and thus not based on possibly present regional information in the input scRNAseq data. Each artifical prior region could have some overlap in cell type contribution with other regions, and cell type frequencies are the same for each cell type in each region.
- artificial_diverse_overlap: celltype-prior region assignments are entirely artificial and thus not based on possibly present regional information in the input scRNAseq data. Each artifical prior region could have some overlap in cell type contribution with other regions, and cell type frequencies are different for each cell type in each region.
- artificial_dominant_celltype_diverse: celltype-prior region assignments are entirely artificial and thus not based on possibly present regional information in the input scRNAseq data. Each artifical prior region could have some overlap in cell type contribution with other regions, and cell type frequencies are different for each cell type in each region. There is one dominant celltype that is present in each region, and is much more abundant than other cell types (5-15x more).
- artificial_partially_dominant_celltype_diverse: celltype-prior region assignments are entirely artificial and thus not based on possibly present regional information in the input scRNAseq data. Each artifical prior region could have some overlap in cell type contribution with other regions, and cell type frequencies are different for each cell type in each region. There is one dominant celltype that is present in all but one regions (partially dominant), and is much more abundant than other cell types, except for one region where it is equally abundant.
- artificial_missing_celltypes_visium: the same as the ‘artificial_diverse_overlap’ dataset type, but now some random input scRNAseq cell types will be left out, and thus not used to generate synthetic data.
- artificial_dominant_rare_celltype_diverse: the same as ‘artificial_dominant_celltype_diverse’, except that the dominant cell type that is present in all regions is much less abundant than other cell types.
- artificial_regional_rare_celltype_diverse: the same as ‘artificial_dominant_rare_celltype_diverse’, except that the rare cell type is now only present in one region instead of all regions.
- artificial_diverse_distinct_missing_celltype_sc: the same
as ‘artificial_diverse_distinct’ except that we here include
information in the
properties_df
about a random cell type that could be removed from the reference scRNAseq dataset later on during integration and evaluation. This to simulate a case where integration is done with an incomplete reference. Note that you should adapt your integration/evaluation code such that this missing_celltype can be read by your integration/evaluation function, otherwise this will be the same as ‘artificial_diverse_distinct’ - artificial_diverse_overlap_missing_celltype_sc: the same as ‘artificial_diverse_distinct_missing_celltype_sc’ but now based on the ‘artificial_diverse_overlap’ basic type.
Some important notes on these data types:
The data types indicated with the ‘real_’ prefix are only possible to use for input scRNAseq datasets that have a regional indication of each cell. They could be extra difficult if these regional indicators are not very distinct (and/or accurate). They also use the real cell type frequency (= the number of cells of cell type of interest compared to all cells in scRNAseq data)
The data types real_missing_celltypes_visium
and
artificial_missing_celltypes_visium
could be considered less realistic
since they assume that your visium data will lack cell types that are
present in your reference of the scRNAseq data. If the visium sample and
scRNAseq sample are taken from the same tissue, this won’t occur in
practice.
The data types artificial_diverse_overlap_missing_celltype_sc
and
artificial_diverse_distinct_missing_celltype_sc
can be used to
simulate the use case with an incomplete reference scRNAseq dataset, but
only if the integration and evaluation functions are adapted such that
they can exploit the additional information offered in the
properties_df
output for these types. Note that you can simulate the
use case with an incomplete reference scRNAseq dataset as well in other
ways.