This repository contains installation instructions and tutorials to run Single Cell Annotation of TumOur Microenvironments In pan-Cancer settings
For more information please check out our manuscript at https://www.nature.com/articles/s41467-023-37353-8
For direct questions please email Ido Nofech-Mozes at inofechmozes@oicr.on.ca
If you are using scATOMIC in your research please cite the corresponding manuscript: Nofech-Mozes, et al., Nature Communications
This tutorial has been tested on MacOS (Montery), Linux (Ubuntu 18.04.6 LTS), and Windows. Note: Windows does not natively perform parallel functions and thus takes longer to run. Note: Mac users require Xcode Command Line Tools for the installation.
To check if Xcode Command Line Tools are installed run the following in terminal:
xcode-select -p
## /Library/Developer/CommandLineTools
Install these by running the following in terminal if CommandlLineTools are not found:
xcode-select --install
To install scATOMIC you will need to use the devtools package. We recommend you install other dependencies before installing scATOMIC. Note that packages DLM and RMagic were removed from CRAN in October 2022 so we install the archived package. Note: In v2 we switch to using a fork of the package cutoff where the package name is changed to avoid conflicting names with the unrelated CRAN cutoff package. Installation of scATOMIC may take several minutes, as the package contains the pre-trained random forest models.
devtools::install_version("dlm", version = "1.1.5", repos = "http://cran.us.r-project.org")
devtools::install_version("Rmagic", version = "2.0.3", repos = "http://cran.us.r-project.org")
if(!require(devtools)) install.packages("devtools")
if(!require(cutoff.scATOMIC)) devtools::install_github("inofechm/cutoff.scATOMIC", force = T)
if(!require(scATOMIC)) devtools::install_github("abelson-lab/scATOMIC")
if installation has the error due to the large size of the package: Error in utils::download.file(url, path, method = method, quiet = quiet, : download from ‘https://api.github.com/repos/abelson-lab/scATOMIC/tarball/HEAD’ failed
Set timeout to longer by running:
options(timeout=9999999)
devtools::install_github("abelson-lab/scATOMIC")
In scATOMIC we use the Rmagic package to impute values.
To use MAGIC, you will need to install both the R and Python packages.
If python or pip are not installed, you will need to install them. We recommend Miniconda3 to install Python and pip together, or otherwise you can install pip from https://pip.pypa.io/en/stable/installing/.
Installation from CRAN In R, run this command to install MAGIC and all dependencies:
if(!require(Rmagic)) devtools::install_version("Rmagic", version = "2.0.3", repos = "http://cran.us.r-project.org")
In a terminal, run the following command in command line to install the Python repository:
pip install --user magic-impute
Often the python magic module is not loading properly in R and the following in R is producing FALSE:
Rmagic::pymagic_is_available()
To resolve this issue run:
library(reticulate)
library(Rmagic)
install.magic()
pymagic_is_available()
For more information visit https://rdrr.io/cran/Rmagic/f/README.md#installation
-
We have updated scATOMIC to classify additional subclasses of cells. Specifically, subtypes of macrophages, monocytes, cDCs and cancer associated fibroblasts are now included by default.
-
We have added the normal_tissue parameter in the create_summary_matrix when inputted samples represent non-cancer, in this case any cancer cell label is converted to normal tissue cell. See this new vignette
-
We have added the known_cancer_type parameter in the create_summary_matrix when the cancer type is already known. In this case any cancer cell label is converted to the known cancer type. See this new vignette
-
We have added a low resolution mode to output broader subtypes of non malignant cells.
-
We have switched to using a fork of the cutoff package called cutoff.scATOMIC to avoid conflicting package names with the CRAN cutoff package.
Inter-patient cancer cell specific effects can interfere with scATOMICs cancer signature scoring module and confidence in classifications. We strongly recommend running scATOMIC only on individual patient/biopsy samples. For most datasets the scATOMIC should run in ~5-10 minutes on a standard 16GB RAM laptop. For larger datasets (> 15,000 cells) we recommend using a machine with greater RAM. The default scATOMIC model assumes the presence of at least some cancer cells in a biopsy and should not be used in normal tissue applications. In version 2 we have added the normal_tissue parameter for normal tissue samples.
scATOMIC is designed to classify cells within the following pan cancer TME hierarchy:
We have validated scATOMIC in external datasets of bladder, brain, breast, colorectal, kidney, liver, lung, neuroblastoma, ovarian, pancreatic, prostate, sarcoma, and skin cancers. Additionally, one can apply scATOMIC to bile duct, bone, endometrial, esophageal, gallbladder, and gastric cancer, however, we have not validated these classes in external datasets and they may be misclassified as other related cancers. For a full list of subtypes included in scATOMIC’s reference see Table S2 in the manuscript.
First we load all dependency packages
library(scATOMIC)
library(plyr)
library(dplyr)
library(data.table)
library(randomForest)
library(caret)
library(parallel)
library(reticulate)
library(Rmagic)
library(Matrix)
library(Seurat)
library(agrmt)
library(cutoff.scATOMIC)
library(copykat)
library(ggplot2)
As of version 2.0.3 required libraries and/or their functions can all be imported when simply running:
library(scATOMIC)
scATOMIC is run directly on gene by cell count matrices. We recommend running scATOMIC on sparse matrices to speed up performance and save memory. To convert a regular matrix or dataframe to a sparse matrix run:
#where input_matrix is any non sparse count matrix
sparse_matrix <- as(as.matrix(input_matrix), "sparseMatrix")
To extract raw count sparse matrices directly from Seurat objects run:
#where seurat_object is any scRNA-seq Seurat object
sparse_matrix <- seurat_object@assays$RNA@counts
We included a demo dataset of a primary lung cancer sample within our package under the object name ‘demo_lung_data’. We store this as a sparse matrix.
lung_cancer_demo_data <- demo_lung_data
We strongly recommend the user filter low quality cells from their dataset. We recommend to remove any cells with more than 25% of reads mapping to mitochondrial genes or those with fewer than 500 unique features expressed. The user can change the pct mt and nFeatureRNA numbers to suit their own filtering parameters.
pct_mt <- colSums(lung_cancer_demo_data[grep("^MT-", row.names(lung_cancer_demo_data)),])/colSums(lung_cancer_demo_data) * 100
nFeatureRNA <- colSums(lung_cancer_demo_data > 0)
lung_cancer_demo_data <- lung_cancer_demo_data[, names(which(pct_mt < 25))]
lung_cancer_demo_data <- lung_cancer_demo_data[, intersect(names(which(nFeatureRNA > 500)), colnames(lung_cancer_demo_data))]
To run scATOMIC we use the run_scATOMIC() function to generate a detailed annotation object. To run with default settings simply use:
cell_predictions <- run_scATOMIC(lung_cancer_demo_data)
Ignore warnings regarding unexpressed genes. Ignore warnings ‘In xtfrm.data.frame(x) : cannot xtfrm data frames’ This returns a list object with predictions for each cell at each layer of the hierarchy.
Other relevant parameters of run_scATOMIC are whether to use imputation (default = TRUE), how many cores to use, and the threshold for classifying cells.
After running scATOMIC we generate a summary of the intermediate results with create_summary_matrix(). Here we need to input our prediction list and the raw count data. By default we set use_CNV to FALSE to avoid CNV corrections. We set modify_results to TRUE to allow for the assumption that only one cancer type is in the sample. See ?create_summary_matrix for information on the other parameters.
create_summary_matrix() returns a matrix that provides the final classification for each cell as well as the classification at each layer
results_lung <- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = F, modify_results = T, mc.cores = 1, raw_counts = lung_cancer_demo_data, min_prop = 0.5 )
table(results_lung$scATOMIC_pred)
##
## Any Cell B Cell
## 10 158
## Blood Cell CAFadi
## 10 13
## CAFinfla CAFmyo
## 51 93
## Cancer Associated Fibroblasts CD4 or CD8 T cell
## 25 21
## CD4+ T cell CD8 T or NK cell
## 95 9
## CD8+ T cell cDC
## 41 3
## cDC1 cDC2
## 9 56
## Dendritic Cell Effector/Memory CD4+ T cells
## 1 175
## Effector/Memory CD8+ T cells Endothelial Cells
## 235 110
## Exhausted CD8+ T cells LAMP3 cDC
## 30 3
## Lung Cancer Cell Macrophage
## 598 11
## Macrophage or Dendritic Cell Mast cell
## 35 80
## Naive CD4+ T cells Natural killer cell
## 6 101
## Non Blood Cell Normal Tissue Cell
## 95 41
## pDC Phagocytosis Macrophage
## 18 319
## Plasmablast Pro-angiogenesis Macrophage
## 91 252
## Pro-inflammatory Macrophage T or NK Cell
## 28 2
## T regulatory cells Tfh/Th1 helper CD4+ T cells
## 144 28
head(results_lung)
## orig.ident nCount_RNA nFeature_RNA cell_names
## AAACCTGAGACCGGAT-1 SeuratProject 7088 2072 AAACCTGAGACCGGAT-1
## AAACCTGCAGTCACTA-1 SeuratProject 4985 1269 AAACCTGCAGTCACTA-1
## AAACCTGGTAAGTAGT-1 SeuratProject 10007 2543 AAACCTGGTAAGTAGT-1
## AAACCTGTCGAATGCT-1 SeuratProject 18045 4597 AAACCTGTCGAATGCT-1
## AAACCTGTCTGAGGGA-1 SeuratProject 4937 1492 AAACCTGTCTGAGGGA-1
## AAACGGGAGTAGATGT-1 SeuratProject 12569 3856 AAACGGGAGTAGATGT-1
## layer_1 layer_2
## AAACCTGAGACCGGAT-1 Blood_Cell macrophage_or_dendritic_cell
## AAACCTGCAGTCACTA-1 Blood_Cell T_or_NK_lymphocyte
## AAACCTGGTAAGTAGT-1 Blood_Cell macrophage_or_dendritic_cell
## AAACCTGTCGAATGCT-1 Tissue_Cell_Normal_or_Cancer Non Stromal Cell
## AAACCTGTCTGAGGGA-1 Blood_Cell Mast cell
## AAACGGGAGTAGATGT-1 Tissue_Cell_Normal_or_Cancer Non Stromal Cell
## layer_3 layer_4
## AAACCTGAGACCGGAT-1 Macrophage or Monocyte Macrophage
## AAACCTGCAGTCACTA-1 CD4 or CD8 T cell CD4+ T cell
## AAACCTGGTAAGTAGT-1 Macrophage or Monocyte Macrophage
## AAACCTGTCGAATGCT-1 Non GI Epithelial Cell Breast/Lung/Prostate Cell
## AAACCTGTCTGAGGGA-1 Mast cell Mast cell
## AAACGGGAGTAGATGT-1 Non GI Epithelial Cell Breast/Lung/Prostate Cell
## layer_5 layer_6
## AAACCTGAGACCGGAT-1 Phagocytosis Macrophage Phagocytosis Macrophage
## AAACCTGCAGTCACTA-1 CD4+ T cell CD4+ T cell
## AAACCTGGTAAGTAGT-1 Phagocytosis Macrophage Phagocytosis Macrophage
## AAACCTGTCGAATGCT-1 Lung Cancer Cell Lung Cancer Cell
## AAACCTGTCTGAGGGA-1 Mast cell Mast cell
## AAACGGGAGTAGATGT-1 Lung Cancer Cell Lung Cancer Cell
## median_score_class_layer_1 median_score_class_layer_2
## AAACCTGAGACCGGAT-1 0.926 0.998
## AAACCTGCAGTCACTA-1 0.926 0.858
## AAACCTGGTAAGTAGT-1 0.926 0.998
## AAACCTGTCGAATGCT-1 0.980 0.874
## AAACCTGTCTGAGGGA-1 0.926 0.970
## AAACGGGAGTAGATGT-1 0.980 0.874
## median_score_class_layer_3 median_score_class_layer_4
## AAACCTGAGACCGGAT-1 0.984 1.000
## AAACCTGCAGTCACTA-1 0.996 0.977
## AAACCTGGTAAGTAGT-1 0.984 1.000
## AAACCTGTCGAATGCT-1 0.574 0.810
## AAACCTGTCTGAGGGA-1 0.970 0.970
## AAACGGGAGTAGATGT-1 0.574 0.810
## median_score_class_layer_5 median_score_class_layer_6
## AAACCTGAGACCGGAT-1 1.000 1.000
## AAACCTGCAGTCACTA-1 0.977 0.977
## AAACCTGGTAAGTAGT-1 1.000 1.000
## AAACCTGTCGAATGCT-1 0.874 0.874
## AAACCTGTCTGAGGGA-1 0.970 0.970
## AAACGGGAGTAGATGT-1 0.874 0.874
## scATOMIC_pred S.Score G2M.Score Phase
## AAACCTGAGACCGGAT-1 Phagocytosis Macrophage -0.006630921 -0.047453336 G1
## AAACCTGCAGTCACTA-1 CD4+ T cell -0.046677709 -0.006673286 G1
## AAACCTGGTAAGTAGT-1 Phagocytosis Macrophage 0.015462403 0.063293498 G2M
## AAACCTGTCGAATGCT-1 Lung Cancer Cell -0.049247934 -0.064310971 G1
## AAACCTGTCTGAGGGA-1 Mast cell -0.025618571 -0.023923499 G1
## AAACGGGAGTAGATGT-1 Lung Cancer Cell -0.038660836 -0.129262025 G1
## old.ident RNA_snn_res.0.2 seurat_clusters
## AAACCTGAGACCGGAT-1 SeuratProject 2 2
## AAACCTGCAGTCACTA-1 SeuratProject 0 0
## AAACCTGGTAAGTAGT-1 SeuratProject 2 2
## AAACCTGTCGAATGCT-1 SeuratProject 1 1
## AAACCTGTCTGAGGGA-1 SeuratProject 9 9
## AAACGGGAGTAGATGT-1 SeuratProject 1 1
## pan_cancer_cluster classification_confidence
## AAACCTGAGACCGGAT-1 Normal confident
## AAACCTGCAGTCACTA-1 Normal confident
## AAACCTGGTAAGTAGT-1 Normal confident
## AAACCTGTCGAATGCT-1 Cancer confident
## AAACCTGTCTGAGGGA-1 Normal confident
## AAACGGGAGTAGATGT-1 Cancer confident
To calculate inferred CNV status we offer a built in argument to run CopyKAT CNV inference. In this mode we run the create_summary_matrix() function with use_CNVs = TRUE. Note: CNV correction takes significantly longer than the regular mode of scATOMIC and can be sped up with parallel computing. We recommend using parallel multi-core processing by setting the mc.cores argument to the number of cores that can be used.
results_lung_CNV <- create_summary_matrix(prediction_list = cell_predictions, use_CNVs = T, modify_results = T, mc.cores = 6, raw_counts = lung_cancer_demo_data, min_prop = 0.5 )
To visualize the results of scATOMIC we provide the scATOMICTree() function
tree_results_lung <- scATOMICTree(predictions_list = cell_predictions, summary_matrix = results_lung,
interactive_mode = T, collapsed = T, save_results = F,height = 700, width = 1000)
This produces an interactive tree for the results. Click on nodes to
expand and reveal subgroups. Hover your mouse over dots to get metrics
(in RStudio Viewer or can save the tree to HTML via the save_results
argument)
Github .md files do not allow the interactive html file to be embedded
but it can be visualized in the R console or by opening the saved file
in a web browser. This gif demonstrates the interactive mode:
By setting the collapsed argument to FALSE we can generate an interactive tree that is fully open
tree_results_lung_open <- scATOMICTree(predictions_list = cell_predictions, summary_matrix = results_lung,
interactive_mode = T, collapsed = F, save_results = F,height = 700, width = 1000)
Github .md files do not allow the interactive mode to be seen but can be visualized in the R console or by opening the saved file in a web browser.
By setting the interactive mode to FALSE, we can generate a non interactive tree
tree_results_non_interactive <- scATOMICTree(predictions_list = cell_predictions, summary_matrix = results_lung,
interactive_mode = F, save_results = F)
tree_results_non_interactive
If one wants to visualize the results in a Seurat object we can easily add the annotations and visualize the results using the DimPlot() function.
First we create a seurat object with our count matrix We can add our annotations to the seurat object when we create it
#create seurat object
lung_seurat <- CreateSeuratObject(lung_cancer_demo_data, meta.data = results_lung)
#run seurat pipeline
lung_seurat <- NormalizeData(lung_seurat)
lung_seurat <- FindVariableFeatures(lung_seurat)
lung_seurat <- ScaleData(lung_seurat)
lung_seurat <- RunPCA(lung_seurat, features = VariableFeatures(object = lung_seurat))
lung_seurat <- RunUMAP(lung_seurat, dims = 1:50)
lung_seurat <- FindNeighbors(lung_seurat)
lung_seurat <- FindClusters(lung_seurat)
If you already have a seurat object you can add the results using the AddMetaData() function
lung_seurat <- AddMetaData(lung_seurat, results_lung)
We can plot our results via:
DimPlot(lung_seurat, group.by = "scATOMIC_pred") + ggtitle("Lung Demo Dataset") + labs(fill="scATOMIC Annotations")
Given scATOMIC’s modular design, we allow users to add new subclassification layers for any additional cell type including blood, cancer, and stromal cells. We provide functions to both train new layers and to run these subclassifications. As an example we will add a breast cancer subclassification layer.
Here we use Wu et al as a training dataset, and a sample from Pal et al as a testing dataset.
Download training data at GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz
We unzip these files, and rename count_matrix_barcode.tsv, count_matrix_genes.tsv, and count_matrix_sparse.mtx to barcodes.tsv, genes.tsv, and matrix.mtx respectively for the Read10X function to work.
#test new layer for breast subtyping
#change "~/Downloads" to path containing "Wu_etal_2021_BRCA_scRNASeq/"
Wu_et_al_breast_count_mat <- Seurat::Read10X("~/Downloads/Wu_etal_2021_BRCA_scRNASeq/", gene.column = 1)
#load metadata
Wu_et_al_breast_metadata <- read.csv("~/Downloads/Wu_etal_2021_BRCA_scRNASeq/metadata.csv", row.names = 1)
table(Wu_et_al_breast_metadata$celltype_major)
##
## B-cells CAFs Cancer Epithelial Endothelial
## 3206 6573 24489 7605
## Myeloid Normal Epithelial Plasmablasts PVL
## 9675 4355 3524 5423
## T-cells
## 35214
#filter to only keep cancer cells
cancer_cell_index <- row.names(Wu_et_al_breast_metadata)[which(Wu_et_al_breast_metadata$celltype_major == "Cancer Epithelial")]
Wu_et_al_breast_count_mat <- Wu_et_al_breast_count_mat[,cancer_cell_index]
Wu_et_al_breast_metadata <- Wu_et_al_breast_metadata[cancer_cell_index,]
#remove low quality cells
Wu_et_al_breast_metadata <- Wu_et_al_breast_metadata[which(Wu_et_al_breast_metadata$percent.mito < 25),]
Wu_et_al_breast_metadata <- Wu_et_al_breast_metadata[which(Wu_et_al_breast_metadata$nFeature_RNA > 500),]
Wu_et_al_breast_count_mat <- Wu_et_al_breast_count_mat[,row.names(Wu_et_al_breast_metadata)]
table(Wu_et_al_breast_metadata$subtype)
##
## ER+ HER2+ TNBC
## 11878 1775 10836
We recommend using a high performance compute cluster to train new models as they can often include large references requiring large memory usage. For this demo we are taking a sample of the data to speed up training and run it on a laptop:
cancer_types <- levels(as.factor(Wu_et_al_breast_metadata$subtype))
#sample 3000 cells from each type if size is greater than 3000
index_subset <- c()
for(i in 1:length(cancer_types)){
index_cancer <- row.names(Wu_et_al_breast_metadata)[which(Wu_et_al_breast_metadata$subtype == cancer_types[i])]
if(length(index_cancer) > 3000){
set.seed(123)
index_subset <- c(index_subset, index_cancer[sample(1:length(index_cancer), size = 3000)])
} else{
index_subset <- c(index_subset, index_cancer)
}
}
Wu_et_al_breast_count_mat <- Wu_et_al_breast_count_mat[,index_subset]
Wu_et_al_breast_metadata <- Wu_et_al_breast_metadata[index_subset,]
To train a new layer we must create a directory to save the model to and we run the get_new_scATOMIC_layer() function To speed up the demo we use only 100 trees here
#change "~/Downloads" to path containing "Wu_etal_2021_BRCA_scRNASeq/"
dir.create("~/Downloads/Wu_etal_2021_BRCA_scRNASeq/breast_subclassification_layer")
breast_cancer_subclassification <- get_new_scATOMIC_layer(training_data = Wu_et_al_breast_count_mat,cell_type_metadata = Wu_et_al_breast_metadata$subtype, output_dir = "~/Downloads/Wu_etal_2021_BRCA_scRNASeq/breast_subclassification_layer/",
layer_name = "breast_subclassification", n_cells_replicate = 5000, n_trees = 100)
To show how to use the new subclassification layer, we apply it to an example ER+ tumour, 0125 from Pal et al. Download example matrix at GSM4909297. Additionally download the features list and barcode file. Move these files into a folder named Pal_et_al_ER_0125 and name them matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz respectively.
We read and filter this dataset.
#change "~/Downloads" to path containing "Wu_etal_2021_BRCA_scRNASeq/"
Pal_0125 <- Read10X("~/Downloads/Pal_et_al_ER_0125/")
pct_mt <- colSums(Pal_0125[grep("^MT-", row.names(Pal_0125)),])/colSums(Pal_0125) * 100
nFeatureRNA <- apply(Pal_0125, 2,function(x){length(which(x != 0))})
Pal_0125 <- Pal_0125[, names(which(pct_mt < 25))]
Pal_0125 <- Pal_0125[, intersect(names(which(nFeatureRNA > 500)), colnames(Pal_0125))]
First we use the base scATOMIC model to classify the cells in the data.
predictions_Pal_0125 <- run_scATOMIC(Pal_0125)
results_Pal_0125 <- create_summary_matrix(prediction_list = predictions_Pal_0125, raw_counts = Pal_0125)
table(results_Pal_0125$scATOMIC_pred)
We now want to use our new subclassification layer to further subtype the breast cancer cells identified. We do this with the classify_new_scATOMIC_layer() function.
#define breast cancer cells
cells_to_subclassify <- row.names(results_Pal_0125)[which(
results_Pal_0125$scATOMIC_pred == "Breast Cancer Cell")]
breast_subclassifications <- classify_new_scATOMIC_layer(rf_model = breast_cancer_subclassification[["rf_classifier"]], selected_features = breast_cancer_subclassification[["selected_features"]],
rna_counts = Pal_0125, cell_names = cells_to_subclassify, layer_name = "Breast Cancer Cells", mc.cores = 1, imputation = T, confidence_cutoff = T )
table(breast_subclassifications$predicted_tissue_with_cutoff)
#add new classifications to results matrix
results_Pal_0125[row.names(breast_subclassifications), "scATOMIC_pred"] <- breast_subclassifications$predicted_tissue_with_cutoff
table(results_Pal_0125$scATOMIC_pred)
##
## Any Cell
## 8
## Blood Cell
## 30
## CAFadi
## 23
## CAFendmt
## 1
## CAFinfla
## 65
## CAFmyo
## 25
## Cancer Associated Fibroblasts
## 29
## CD4 or CD8 T cell
## 5
## cDC1
## 2
## cDC2
## 34
## Effector/Memory CD4+ T cells
## 6
## Effector/Memory CD8+ T cells
## 29
## Endothelial Cells
## 16
## ER+
## 3677
## Macrophage or Dendritic Cell
## 2
## Naive CD4+ T cells
## 13
## Natural killer cell
## 3
## Non Blood Cell
## 38
## Normal Fibroblast
## 60
## Normal Tissue Cell
## 123
## Phagocytosis Macrophage
## 34
## Pro-angiogenesis Macrophage
## 2
## T regulatory cells
## 7
## Tfh/Th1 helper CD4+ T cells
## 1
## Unclassified_Cell_from_Breast Cancer Cells
## 4
The breast cancer cells are now classified as ER+ cells.
Although scATOMIC is designed for cancer samples, one can use it to annotate non-cancerous samples. This can be particularly useful in the context of comparing tumour microenvironments to normal tissue environments where both data would be annotated in the same way.
In this example we use a matched normal lung sample downloaded from Kim et al For the convenience of this tutorial, the count matrix in h5 format can be downloaded from this Google Drive link
Run the standard pipeline for scATOMIC
normal_lungdata <- Read10X_h5("~/Downloads/LUNG_N01_sparse_count_mat.h5")
pct_mt <- colSums(normal_lungdata[grep("^MT-", row.names(normal_lungdata)),])/colSums(normal_lungdata) * 100
nFeatureRNA <- colSums(normal_lungdata > 0)
normal_lungdata <- normal_lungdata[, names(which(pct_mt < 25))]
normal_lungdata <- normal_lungdata[, intersect(names(which(nFeatureRNA > 500)), colnames(normal_lungdata))]
cell_predictions_normal_lung <- run_scATOMIC(normal_lungdata)
results_normal_lung <- create_summary_matrix(prediction_list = cell_predictions_normal_lung, raw_counts = normal_lungdata, normal_tissue = T)
As you can see there are no cancer cells labelled.
table(results_normal_lung$scATOMIC_pred)
##
## Any Cell B Cell
## 5 43
## Blood Cell CD4 or CD8 T cell
## 68 23
## CD4+ T cell CD8 T or NK cell
## 17 30
## CD8+ T cell cDC
## 17 5
## cDC1 cDC2
## 10 112
## Effector/Memory CD4+ T cells Effector/Memory CD8+ T cells
## 148 617
## Endothelial Cells LAMP3 cDC
## 30 13
## Macrophage Macrophage or Dendritic Cell
## 170 28
## MAIT cells Mast cell
## 1 220
## Naive CD4+ T cells Natural killer cell
## 2 394
## Normal Fibroblast Normal Tissue Cell
## 44 167
## pDC Phagocytosis Macrophage
## 31 64
## Pro-angiogenesis Macrophage Pro-inflammatory Macrophage
## 662 247
## Stromal Cell T or NK Cell
## 1 1
## T regulatory cells
## 7
All cells that received a classification that was referring to a cancer cell type in layer 6 was converted to normal tissue cell in scATOMIC’s final prediction
table(results_normal_lung[which(results_normal_lung$scATOMIC_pred == "Normal Tissue Cell"), "layer_6"])
##
## Epithelial Cell Lung Cancer Cell Non Blood Cell
## 5 156 6
Although scATOMIC can predict cancer type well without any prior knowledge, not all cancer types are captured in the training model and it can occasionally make misclassifications. As such we now allows users to input a ground truth for the cancer type if it is known using the known_cancer_type parameter in the create_summary_matrix function.
We showcase this functionality in the demo_lung_cancer dataset. In this example we provide the function knowledge that the sample is “LUAD”
lung_cancer_demo_data <- demo_lung_data
pct_mt <- colSums(lung_cancer_demo_data[grep("^MT-", row.names(lung_cancer_demo_data)),])/colSums(lung_cancer_demo_data) * 100
nFeatureRNA <- colSums(lung_cancer_demo_data > 0)
lung_cancer_demo_data <- lung_cancer_demo_data[, names(which(pct_mt < 25))]
lung_cancer_demo_data <- lung_cancer_demo_data[, intersect(names(which(nFeatureRNA > 500)), colnames(lung_cancer_demo_data))]
cell_predictions <- run_scATOMIC(lung_cancer_demo_data)
results_lung <- create_summary_matrix(prediction_list = cell_predictions, raw_counts = lung_cancer_demo_data, known_cancer_type = "LUAD cell" )
table(results_lung$scATOMIC_pred)
##
## Any Cell B Cell
## 10 158
## Blood Cell CAFadi
## 10 13
## CAFinfla CAFmyo
## 51 94
## Cancer Associated Fibroblasts CD4 or CD8 T cell
## 24 21
## CD4+ T cell CD8 T or NK cell
## 95 10
## CD8+ T cell cDC
## 46 3
## cDC1 cDC2
## 9 56
## Dendritic Cell Effector/Memory CD4+ T cells
## 1 175
## Effector/Memory CD8+ T cells Endothelial Cells
## 229 110
## Exhausted CD8+ T cells LAMP3 cDC
## 30 3
## LUAD cell Macrophage
## 599 11
## Macrophage or Dendritic Cell Mast cell
## 35 80
## Naive CD4+ T cells Natural killer cell
## 6 101
## Non Blood Cell Normal Tissue Cell
## 66 69
## pDC Phagocytosis Macrophage
## 18 319
## Plasmablast Pro-angiogenesis Macrophage
## 91 252
## Pro-inflammatory Macrophage T or NK Cell
## 28 2
## T regulatory cells Tfh/Th1 helper CD4+ T cells
## 144 28
To bypass any classification score cutoffs and force a terminal specialized classification for each cell use run_scATOMIC() and create_summary_matrix() with the confidence_cutoff argument as FALSE.
We included a native breast cancer subclassification module. To run breast subclassification, set the breast_mode argument to TRUE in both run_scATOMIC() and create_summary_matrix().
If one would like to add additional layers not within the existing hierarchy please contact us directly as it likely requires retraining the base model.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Toronto
## tzcode source: internal
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] cutoff.scATOMIC_0.1.0 agrmt_1.42.12 Seurat_5.0.1
## [4] SeuratObject_5.0.1 sp_2.1-2 Rmagic_2.0.3
## [7] Matrix_1.6-4 caret_6.0-94 lattice_0.22-5
## [10] ggplot2_3.4.4 data.table_1.14.10 dplyr_1.1.4
## [13] plyr_1.8.9 scATOMIC_2.0.3 copykat_1.1.0
## [16] reticulate_1.34.0 randomForest_4.7-1.1
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.21 splines_4.3.2 later_1.3.2
## [4] R.oo_1.25.0 tibble_3.2.1 polyclip_1.10-6
## [7] hardhat_1.3.0 pROC_1.18.5 rpart_4.1.23
## [10] fastDummies_1.7.3 lifecycle_1.0.4 rprojroot_2.0.4
## [13] hdf5r_1.3.8 globals_0.16.2 MASS_7.3-60
## [16] tree_1.0-43 magrittr_2.0.3 plotly_4.10.3
## [19] rmarkdown_2.25 yaml_2.3.8 httpuv_1.6.13
## [22] sctransform_0.4.1 spam_2.10-0 spatstat.sparse_3.0-3
## [25] cowplot_1.1.2 pbapply_1.7-2 RColorBrewer_1.1-3
## [28] lubridate_1.9.3 abind_1.4-5 Rtsne_0.17
## [31] R.utils_2.12.3 purrr_1.0.2 nnet_7.3-19
## [34] ipred_0.9-14 lava_1.7.3 data.tree_1.1.0
## [37] ggrepel_0.9.4 irlba_2.3.5.1 listenv_0.9.0
## [40] spatstat.utils_3.0-4 goftest_1.2-3 RSpectra_0.16-1
## [43] spatstat.random_3.2-2 fitdistrplus_1.1-11 parallelly_1.36.0
## [46] leiden_0.4.3.1 codetools_0.2-19 tidyselect_1.2.0
## [49] farver_2.1.1 matrixStats_1.2.0 stats4_4.3.2
## [52] spatstat.explore_3.2-5 jsonlite_1.8.8 ellipsis_0.3.2
## [55] progressr_0.14.0 ggridges_0.5.5 survival_3.5-7
## [58] iterators_1.0.14 bbmle_1.0.25.1 foreach_1.5.2
## [61] tools_4.3.2 ica_1.0-3 Rcpp_1.0.11
## [64] glue_1.6.2 prodlim_2023.08.28 gridExtra_2.3
## [67] dlm_1.1-6 xfun_0.41 here_1.0.1
## [70] amap_0.8-19 withr_2.5.2 numDeriv_2016.8-1.1
## [73] fastmap_1.1.1 fansi_1.0.6 digest_0.6.33
## [76] parallelDist_0.2.6 timechange_0.2.0 R6_2.5.1
## [79] mime_0.12 colorspace_2.1-0 scattermore_1.2
## [82] tensor_1.5 spatstat.data_3.0-3 R.methodsS3_1.8.2
## [85] DiagrammeR_1.0.10 utf8_1.2.4 tidyr_1.3.0
## [88] generics_0.1.3 recipes_1.0.9 class_7.3-22
## [91] httr_1.4.7 htmlwidgets_1.6.4 uwot_0.1.16
## [94] ModelMetrics_1.2.2.2 pkgconfig_2.0.3 gtable_0.3.4
## [97] timeDate_4032.109 lmtest_0.9-40 htmltools_0.5.7
## [100] dotCall64_1.1-1 scales_1.3.0 png_0.1-8
## [103] gower_1.0.1 knitr_1.45 rstudioapi_0.15.0
## [106] reshape2_1.4.4 visNetwork_2.1.2 nlme_3.1-164
## [109] bdsmatrix_1.3-6 zoo_1.8-12 stringr_1.5.1
## [112] KernSmooth_2.23-22 miniUI_0.1.1.1 pillar_1.9.0
## [115] grid_4.3.2 vctrs_0.6.5 RANN_2.6.1
## [118] promises_1.2.1 xtable_1.8-4 cluster_2.1.6
## [121] evaluate_0.23 mvtnorm_1.2-4 cli_3.6.2
## [124] compiler_4.3.2 rlang_1.1.2 future.apply_1.11.1
## [127] labeling_0.4.3 stringi_1.8.3 viridisLite_0.4.2
## [130] deldir_2.0-2 munsell_0.5.0 lazyeval_0.2.2
## [133] spatstat.geom_3.2-7 RcppHNSW_0.5.0 patchwork_1.1.3
## [136] bit64_4.0.5 future_1.33.1 shiny_1.8.0
## [139] highr_0.10 ROCR_1.0-11 igraph_1.6.0
## [142] RcppParallel_5.1.7 collapsibleTree_0.1.8 bit_4.0.5