Robin Browaeys 2020-10-08
In this vignette, you can learn how to use synthvisium to generate synthetic Visium-like spot data and see a possible workflow to analyze these data. As a specific case, we will show how you could use these datasets to evaluate Seurat’s label transfer.
library(Seurat)
library(synthvisium)
library(tidyverse)
The real input scRNAseq (Allen Brain Atlas brain cortex) can be downloaded from Zenodo and should preferably be download locally and then loaded here.
For demonstration purposes we will use the internal toy data here
# seurat_obj_scrnaseq = readRDS(url("https://zenodo.org/record/3260758/files/seurat_obj_scrnaseq_cortex_filtered.rds"))
seurat_obj_scrnaseq = seurat_obj # the internal object
seurat_obj_scrnaseq@meta.data$celltype = seurat_obj_scrnaseq@meta.data$subclass
seurat_obj_scrnaseq = seurat_obj_scrnaseq %>% SetIdent(value = "celltype")
DimPlot(seurat_obj_scrnaseq, reduction = "umap",pt.size = 0.5, label = T)
Now, we will create a synthetic dataset based on this real dataset and
with a more realistic number of spots. This will take some time to run.
As example dataset type, we pick here real_top2_overlap
.
Because it could be possible that a cell type will belong to all regions, we will make the synthetic data in such a way that we have a ‘mock region’ as well - no cell type should be predicted to belong to this region. Adding a mock region is thus recommended when the synthetic data will be used to evaluate spatial/region annotation of cells, not when evaluating deconvolution tools. The mock region is generated similar as the real regions, but the input cell type frequencies are the same for each cell type, and after generation of the counts, the gene names are shuffled such that cellular identities in this mock region are lost.
synthetic_visium_data = generate_synthetic_visium(seurat_obj = seurat_obj_scrnaseq, dataset_type = "real_top2_overlap", clust_var = "subclass", region_var = "brain_subregion" , n_regions = NULL,
n_spots_min = 50, n_spots_max = 200, visium_mean = 20000, visium_sd = 5000, add_mock_region = TRUE)
seurat_obj_visium = CreateSeuratObject(counts = synthetic_visium_data$counts, min.cells = 2, min.features = 200, assay = "Spatial")
seurat_obj_visium = SCTransform(seurat_obj_visium, assay = "Spatial", verbose = FALSE)
seurat_obj_visium = RunPCA(seurat_obj_visium, assay = "SCT", verbose = FALSE)
seurat_obj_visium = RunTSNE(seurat_obj_visium, reduction = "pca", dims = 1:30)
seurat_obj_visium = RunUMAP(seurat_obj_visium, reduction = "pca", dims = 1:30)
This will group the spots in clusters/regions based on gene expression
seurat_obj_visium = FindNeighbors(seurat_obj_visium, reduction = "pca", dims = 1:30)
seurat_obj_visium = FindClusters(seurat_obj_visium, verbose = FALSE,resolution = 0.1)
Compare the prior regions with the clusters found based on gene expression
p_priorregion = DimPlot(seurat_obj_visium, reduction = "umap", label = TRUE, group.by = "orig.ident") # a priori defined regions
p_exprs_clusters = DimPlot(seurat_obj_visium, reduction = "umap", label = TRUE) # a priori defined regions
patchwork::wrap_plots(list(p_priorregion, p_exprs_clusters), nrow = 1)
Here, they map very nicely onto eachother, probably because this dataset type can be considered as ‘easy’ and different regions have strongly distinct cell type composition.
Here we will show how to predict for each cell (scRNAseq data - spatial label unknown) the probabilities to belong to certain regions (spatial label from ‘synthetic Visium’ data).
Here we go for the prior regions
seurat_obj_visium@meta.data$region = seurat_obj_visium@meta.data$orig.ident # if you want to transfer the label of priorregion clusters
seurat_obj_visium = seurat_obj_visium %>% SetIdent(value = "region")
Perform the label transfer for spatial annotation of cells and add the region predictions as new Assay to the Seurat Object
if(DefaultAssay(seurat_obj_visium) != "SCT"){
DefaultAssay(seurat_obj_visium) = "SCT"
}
if(DefaultAssay(seurat_obj_scrnaseq) != "SCT"){
DefaultAssay(seurat_obj_scrnaseq) = "SCT"
}
# Define parameters used for the label transfer
parameter_input = list(
npcs = 30,
k.anchor = 50,
k.filter = 200,
k.score = 50
)
# Define anchors (step 1 of label transfer)
anchors = FindTransferAnchors(reference = seurat_obj_visium, query = seurat_obj_scrnaseq, normalization.method = "SCT", verbose = FALSE,
npcs = parameter_input$npcs, k.anchor = parameter_input$k.anchor, k.filter = parameter_input$k.filter, k.score = parameter_input$k.score) # check other parameters of this function
# Use anchors to transfer the labels (label is here: region metadata column of visium) (step 2 of the label transfer)
# this outcome is an 'seurat assay' that we will call 'regions': each cell has a score for each region in this matrix
predictions.assay = TransferData(anchorset = anchors, refdata = seurat_obj_visium$region, weight.reduction = seurat_obj_scrnaseq[["pca"]], prediction.assay = TRUE, verbose = FALSE) # check parameters of this function
seurat_obj_scrnaseq[["regions"]] = predictions.assay
seurat_obj_scrnaseq[["regions"]]@data = seurat_obj_scrnaseq[["regions"]]@data[seurat_obj_visium$region %>% unique(), ]
Just as we added the region predictions as new Assay to the scRNAseq object, we will now also add ground truth / gold standard region
gold_standard_priorregion = synthetic_visium_data$gold_standard_priorregion
metadata = seurat_obj_scrnaseq@meta.data %>% rownames_to_column("cell_id") %>% as_tibble()
cell_priorregion = metadata %>% inner_join(gold_standard_priorregion) %>% select(cell_id, prior_region, present) %>% distinct()
cell_priorregion_spread = cell_priorregion %>% spread(cell_id, present)
cell_priorregion_matrix = cell_priorregion_spread %>% select(-prior_region) %>% as.matrix() %>% magrittr::set_rownames(cell_priorregion_spread$prior_region)
cell_priorregion_matrix[cell_priorregion_matrix == TRUE] = 1
cell_priorregion_matrix[cell_priorregion_matrix == FALSE] = 0
seurat_obj_scrnaseq[["GS"]] = CreateAssayObject(data = cell_priorregion_matrix)
Visualize the gold standard
if(DefaultAssay(seurat_obj_scrnaseq) != "GS"){
DefaultAssay(seurat_obj_scrnaseq) = "GS"
}
# add the names of the regions here (change if necessary)
basic_plot = FeaturePlot(seurat_obj_scrnaseq, features = c("L1","L2/3","L4", "L5" ,"L6","mockregion"), ncol = 3, combine = FALSE)
custom_scale_fill = scale_color_gradientn(colours = RColorBrewer::brewer.pal(n = 4, name = "Oranges"),values = c(0,0.2, 0.4, 0.6, 1), limits = c(0, 1))
gs_plots = lapply(basic_plot, function (x) x + custom_scale_fill + theme(legend.position = "none")) %>% patchwork::wrap_plots(ncol = 5)
gs_plots
Visualize the predictions
# Visualize the 'region prediction scores' of each cell - regions assay
if(DefaultAssay(seurat_obj_scrnaseq) != "regions"){
DefaultAssay(seurat_obj_scrnaseq) = "regions"
}
# add the names of the regions here (change if necessary) (seurat_obj_scrnaseq %>% rownames())
basic_plot = FeaturePlot(seurat_obj_scrnaseq, features = c("L1","L2/3","L4", "L5" ,"L6", "mockregion"), ncol = 3, combine = FALSE)
custom_scale_fill = scale_color_gradientn(colours = RColorBrewer::brewer.pal(n = 4, name = "RdBu") %>% rev(),values = c(0,0.2, 0.4, 0.6, 1), limits = c(0, 1))
pred_plots = lapply(basic_plot, function (x) x + custom_scale_fill + theme(legend.position = "none")) %>% patchwork::wrap_plots(ncol = 5)
pred_plots
Put the visualizations next to each other
patchwork::wrap_plots(list(gs_plots, pred_plots), nrow = 2)
You see that predictions don’t cover all regions they should cover!
A very short intro in how to quantitatively evaluate the predictions: example for the region L5
prediction_vector_L5 = seurat_obj_scrnaseq[["regions"]]@data["L5",]
response_vector_L5 = seurat_obj_scrnaseq[["GS"]]@data["L5",]
ROC_object = pROC::roc(response_vector_L5, prediction_vector_L5, plot = TRUE)
ROC_object$auc
## Area under the curve: 0.8798
To avoid potential bias, we recommend to split your input scRNAseq in two equivalent sub-datasets (eg through stratified sampling of half of the cells for one sub-dataset, and the other half of the cells for the second sub-dataset). One sub-dataset can then be used to generate the synthetic visium dataset, the second sub-dataset can be used to integrate with the synthetic visium dataset for evaluation. This way, the generation and evaluation will be performed based on different single cells.
Instead of splitting one dataset into two, it is also possible to use different datasets (but containing the same cell types and cell type labels) for generation and integration/evaluation. This way you could check the potential effect of batch effects or cross-platform differences (eg use single-nucleus dataset to generate the synthetic data, and single-cell to integratie/evaluate and compare this to the situation where both generation and integration/evaluation datasets came from the same platform).