├── .Rbuildignore ├── .gitignore ├── CIPR.Rproj ├── DESCRIPTION ├── NAMESPACE ├── R └── CIPR.R ├── README.md ├── data ├── blueprint_expr.rda ├── blueprint_samples.rda ├── dice_expr.rda ├── dice_samples.rda ├── example_avgexp_data.rda ├── example_log_fc_data.rda ├── hema_expr.rda ├── hema_samples.rda ├── hpca_expr.rda ├── hpca_samples.rda ├── hsrnaseq_expr.rda ├── hsrnaseq_samples.rda ├── immgen_expr.rda ├── immgen_samples.rda ├── mmrnaseq_expr.rda └── mmrnaseq_samples.rda ├── doc ├── CIPR_hex_mid.png ├── cipr_human_pbmc.R ├── cipr_human_pbmc.Rmd ├── cipr_human_pbmc.html ├── sample_ind_output.png └── sample_top_output.png ├── man └── CIPR.Rd └── vignettes └── cipr_human_pbmc.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^doc$ 4 | ^Meta$ 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | Meta 6 | to_improve.txt 7 | -------------------------------------------------------------------------------- /CIPR.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 4 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: CIPR 2 | Type: Package 3 | Title: Cluster Identity Predictor 4 | Version: 0.1.0 5 | Author: Atakan Ekiz 6 | Maintainer: Atakan Ekiz 7 | Description: CIPR (Cluster Identity PRedictor) package is developed to help annotating unknown cell clusters in single cell RNAseq (scRNAseq) experiments. Due to overlapping gene expression profiles, annotating the biological origins of single cell clusters can be challenging. The current software solutions to help annotating single cell clusters generally lack a user-friendly GUI, can take long times to compute, or may not have suitable reference data frames for the experimental data at hand. Alternatively, researchers can manually examine a few known marker genes in the experimental data to approximate the identity of single cell clusters which can be low throughput, time consuming, and subjective. We introduce CIPR (Cluster Identity PRedictor) package (and its Shiny implementation) to address these shortcomings. CIPR scores the gene expression profiles of unknown single cell clusters against a collection of known reference cell types and calculates an identity score for the cluster-reference cell pairs. Multiple reference RNAseq/microarray datasets obtained from sorted human and mouse cells are available in CIPR and the computational pipeline is agnostic to species differences allowing this program to be used with both human and mouse input data. Users can limit the analysis to only certain reference cell types and to genes with high overall variance. Furthrmore, CIPR can also work with custom reference data frames provided by the users to be used in specialized experimental contexts. Through this R-package implementation, CIPR can be run directly in R and be implemented easily in different analytical contexts. In summary, CIPR is a software solution to facilitate the analysis of scRNAseq data. 8 | License: GPL-3 9 | Encoding: UTF-8 10 | LazyData: true 11 | Depends: R (>= 3.5.0) 12 | Imports: ggpubr, gtools, tibble, dplyr, rlang 13 | Suggests: knitr, rmarkdown, Seurat 14 | VignetteBuilder: knitr 15 | URL: https://github.com/atakanekiz/CIPR-Package 16 | BugReports: https://github.com/atakanekiz/CIPR-Package/issues 17 | RoxygenNote: 7.0.2 18 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | exportPattern("^[[:alpha:]]+") 2 | importFrom(magrittr,"%>%") 3 | -------------------------------------------------------------------------------- /R/CIPR.R: -------------------------------------------------------------------------------- 1 | #' Cluster Identity Predictor 2 | #' 3 | #' @description CIPR (Cluster Identity PRedictor) is a pipeline that helps annotating 4 | #' unknown single cell clusters in single cell RNA sequencing (scRNAseq experiments). 5 | #' This function scores unknown cluster gene expression signatures against known reference 6 | #' datasets using user-selected analytical approaches to facilitate scRNAseq analysis. 7 | #' 8 | #' @param input_dat Data frame containing normalized log2-transformed gene 9 | #' expression values per cluster OR a table of differentially expressed genes 10 | #' per cluster 11 | #' 12 | #' @param comp_method Method to use for identity score calculations. It accepts 13 | #' one of the following: "logfc_dot_product" (default), "logfc_spearman", 14 | #' "logfc_pearson", "all_genes_spearman", "all_genes_pearson" 15 | #' 16 | #' @param reference Reference data frame containing gene expression data from 17 | #' known cell types. It accepts one of the following: "immgen" (default), 18 | #' "mmrnaseq", "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom" 19 | #' 20 | #' @param select_ref_subsets The names of cell subsets to be included in the 21 | #' analysis. For using the entire reference dataset use "all", or 22 | #' provide a character vector of cell types of interest. Defaults to "all" 23 | #' 24 | #' @param custom_reference A data frame containing custom reference. There must 25 | #' be a column named 'gene' and other columns contain normalized gene expression 26 | #' data from known samples. Defaults to NULL 27 | #' 28 | #' @param custom_ref_annot A data frame containing custom reference metadata. 29 | #' This is optional to get more informative results from CIPR. The data 30 | #' frame must contain columns named 'short_name' (must match column names in 31 | #' custom reference), 'long_name' (human readable names for reference samples), 32 | #' 'description' (details such as positive and negative sorting markers), 33 | #' 'reference_cell_type' (e.g. T cell, B cell, NK) 34 | #' 35 | #' @param keep_top_var Top n percent of highly variant reference genes to 36 | #' include in the analysis. It accepts a numeric value smaller than or equal 37 | #' to 100 (default). The value of 100 results in keeping all the genes in the 38 | #' reference dataset) 39 | #' 40 | #' @param plot_ind Logical value. Set it to TRUE to plot identity scores for 41 | #' each cluster. Defaults to FALSE 42 | #' 43 | #' @param plot_top Logical value. set it to TRUE to plot top scoring reference 44 | #' cell types for each cluster. Defaults ot TRUE. 45 | #' 46 | #' @param top_num A numeric value determining how many top scoring reference 47 | #' cell types will be plotted for each cluster. Defaults to 5. 48 | #' 49 | #' @param save_png Logical value. Set it to TRUE if you would like to export png 50 | #' images of the results. Defaults to FALSE 51 | #' 52 | #' @param global_plot_obj Logical value. Set it to TRUE if you would like to 53 | #' keep the plots as an object in the global environment. This can be useful 54 | #' for accessing and manipulating the graphs. Defaults to TRUE. 55 | #' 56 | #' @param global_results_obj Logical value. Set it to TRUE if you would like to 57 | #' keep the analysis results as a global object. Defaults to TRUE. 58 | #' 59 | #' @param ... arguments to pass to ggplot2::theme() (for graph manipulation) 60 | #' 61 | #' @return Graphical outputs and/or data frames of identity scores calculated 62 | #' for each cluster in the input data. 63 | #' 64 | #' @examples 65 | #' 66 | #' # Example of using CIPR in conjunction with Seurat 67 | #' library(Seurat) 68 | #' allmarkers <- FindAllMarkers(seurat_object) 69 | #' avgexp <- AverageExpression(seurat_object) 70 | #' 71 | #' # Using built-in immgen as reference and logfc dot product method 72 | #' CIPR(input_dat = allmarkers, 73 | #' comp_method = "logfc_dot_product", 74 | #' reference="immgen", 75 | #' keep_top_var = 100, 76 | #' global_results_obj = T, 77 | #' plot_top = T) 78 | #' 79 | #' #' # Using built-in immgen as reference and all genes spearman method 80 | #' 81 | #' CIPR(input_dat = avgexp, 82 | #' comp_method = "all_genes_spearman", 83 | #' reference="immgen", 84 | #' keep_top_var = 100, 85 | #' global_results_obj = T, 86 | #' plot_top = T) 87 | #' 88 | #' 89 | #' # Using built-in dice reference and logFC spearman method 90 | #' # Variance threshold of top 50% 91 | #' 92 | #' CIPR(input_dat = allmarkers, 93 | #' comp_method = "logfc_spearman", 94 | #' reference="dice", 95 | #' keep_top_var = 50, 96 | #' global_results_obj = T, 97 | #' plot_top = T) 98 | #' 99 | #' 100 | #' # Using a custom reference 101 | #' 102 | #' CIPR(input_dat = allmarkers, 103 | #' comp_method = "logfc_dot_product", 104 | #' reference="custom", 105 | #' custom_reference = custom_ref_df, 106 | #' custom_ref_annot = custom_annot_df, 107 | #' keep_top_var = 100, 108 | #' global_results_obj = T, 109 | #' plot_top = T) 110 | #' 111 | #' 112 | #' # Using a blueprint-encode reference and limiting the analysis 113 | #' # to "Pericytes", "Skeletal muscle", "Smooth muscle" 114 | #' 115 | #' CIPR(input_dat = allmarkers, 116 | #' comp_method = "logfc_dot_product", 117 | #' reference="blueprint-encode", 118 | #' select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle") 119 | #' keep_top_var = 100, 120 | #' global_results_obj = T, 121 | #' plot_top = T) 122 | #' 123 | #' # Using built in example data (logFC signatures per cluster) 124 | #' CIPR(input_dat = example_logfc_data, 125 | #' comp_method = "logfc_dot_product", 126 | #' reference="blueprint-encode", 127 | #' select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle") 128 | #' keep_top_var = 100, 129 | #' global_results_obj = T, 130 | #' plot_top = T) 131 | #' 132 | #' 133 | #' # Using built in example data (average expression) 134 | #' CIPR(input_dat = example_avgexp_data, 135 | #' comp_method = "all_spearman", 136 | #' reference="immgen", 137 | #' select_ref_subsets = "all", 138 | #' keep_top_var = 100, 139 | #' global_results_obj = T, 140 | #' plot_top = T) 141 | #' 142 | 143 | 144 | 145 | CIPR <- function(input_dat, 146 | comp_method = "logfc_dot_product", 147 | reference = NULL, 148 | select_ref_subsets = "all", 149 | custom_reference = NULL, 150 | custom_ref_annot = NULL, 151 | keep_top_var = 100, 152 | plot_ind = F, 153 | plot_top = T, 154 | top_num = 5, 155 | save_png = F, 156 | global_plot_obj = T, 157 | global_results_obj = T, 158 | update_ref = T, 159 | ... 160 | ){ 161 | 162 | 163 | 164 | # suppressMessages({ 165 | # require(ggpubr) 166 | # require(gtools) 167 | # require(tibble) 168 | # require(dplyr) 169 | # }) 170 | 171 | 172 | 173 | ######################### Prepare input_dat ######################### 174 | 175 | message("Preparing input data") 176 | 177 | if(!comp_method %in% c("logfc_dot_product", 178 | "logfc_spearman", 179 | "logfc_pearson", 180 | "all_genes_spearman", 181 | "all_genes_pearson")) stop("Check comp_method. Allowed strings are: 'logfc_dot_product', 'logfc_spearman', 'logfc_pearson', 'all_genes_spearman', 'all_genes_pearson'") 182 | 183 | if(grepl("logfc", comp_method)){ 184 | 185 | # Define column names to allow flexibility in case and close matches in column names 186 | gene_column <- grep("gene", colnames(input_dat), ignore.case = T, value = T) 187 | logFC_column <- grep("logfc|log2fc", colnames(input_dat), ignore.case = T, value = T) 188 | cluster_column <- grep("cluster", colnames(input_dat), ignore.case = T, value = T) 189 | 190 | 191 | if(length(c(gene_column, logFC_column, cluster_column)) != 3) stop("Check column names of the input data. Data frame must have columns named as 'gene', 'logfc', and 'cluster'") 192 | 193 | # Convert gene symbols to lower case letters to allow mouse-vs-human comparisons 194 | input_dat[,gene_column] <- tolower(input_dat[,gene_column]) 195 | 196 | # input_dat <- input_dat[!duplicated(input_dat[,gene_column]),] 197 | 198 | 199 | 200 | } else { 201 | 202 | gene_column <- grep("gene", colnames(input_dat), ignore.case = T, value = T) 203 | 204 | input_dat[,gene_column] <- tolower(input_dat[,gene_column]) 205 | 206 | input_dat <- input_dat[!duplicated(input_dat[,gene_column]),] 207 | 208 | if(length(gene_column) != 1) stop("Check column names of the input data. Data frame must have a column named as 'gene'") 209 | 210 | } 211 | 212 | 213 | 214 | 215 | ######################### Prepare ref_dat ######################### 216 | 217 | # if(update_ref == T){ # WORK ON CACHING TO SPEED UP THE PIPELINE 218 | 219 | message("Preparing reference data") 220 | 221 | if(is.null(reference)){ 222 | 223 | stop('Please specify the reference dataset. The following are allowed: "mmrnaseq", "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom"') 224 | 225 | } else if(reference == "immgen"){ 226 | 227 | message("Reading ImmGen reference data") 228 | 229 | # # Read reference dataset 230 | # load("data/immgen_expr.rda") 231 | # ref_dat <- get("immgen_expr") 232 | # rm(immgen_expr) 233 | # 234 | # # Read immgen annotation file for explanations of cell types 235 | # load("data/immgen_samples.rda") 236 | # ref_annot <- get("immgen_samples") 237 | # rm(immgen_samples) 238 | 239 | ref_dat <- immgen_expr 240 | ref_annot <- immgen_samples 241 | 242 | 243 | } else if(reference == "mmrnaseq"){ 244 | 245 | message("Reading MmRNAseq reference data") 246 | 247 | ref_dat <- mmrnaseq_expr 248 | ref_annot <- mmrnaseq_samples 249 | 250 | } else if(reference == "blueprint"){ 251 | 252 | message("Reading Blueprint-ENCODE reference data") 253 | 254 | ref_dat <- blueprint_expr 255 | ref_annot <- blueprint_samples 256 | 257 | } else if(reference == "hpca"){ 258 | 259 | message("Reading HCPA reference data") 260 | 261 | ref_dat <- hpca_expr 262 | ref_annot <- hpca_samples 263 | 264 | } else if(reference == "dice"){ 265 | 266 | message("Reading DICE reference data") 267 | 268 | ref_dat <- dice_expr 269 | ref_annot <- dice_samples 270 | 271 | } else if(reference == "hema"){ 272 | 273 | message("Reading hema reference data") 274 | 275 | ref_dat <- hema_expr 276 | ref_annot <- hema_samples 277 | 278 | } else if(reference == "hsrnaseq"){ 279 | 280 | message("Reading HsRNAseq reference data") 281 | 282 | ref_dat <- hsrnaseq_expr 283 | ref_annot <- hsrnaseq_samples 284 | 285 | } else if(reference == "custom"){ 286 | 287 | message("Reading custom reference data") 288 | 289 | # Read reference dataset 290 | ref_dat <- custom_reference 291 | 292 | # Read immgen annotation file for explanations of cell types 293 | ref_annot <- custom_ref_annot 294 | 295 | 296 | } else { 297 | 298 | stop(message('"reference" argument accepts of the following: "mmrnaseq", "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom"')) 299 | 300 | } 301 | 302 | 303 | ref_gene_column <- grep("^gene", colnames(ref_dat), ignore.case = T, value = T) 304 | 305 | if(length(ref_gene_column) != 1) stop("Check column names of the input data. Data frame must have a column named as 'gene'") 306 | 307 | ref_dat[, ref_gene_column] <- tolower(ref_dat[, ref_gene_column]) 308 | 309 | 310 | 311 | 312 | 313 | # Select relevant subsets from the reference 314 | 315 | if(select_ref_subsets != "all"){ 316 | 317 | message("Subsetting reference data") 318 | 319 | sel_positions <- which(ref_annot[, "reference_cell_type"] %in% select_ref_subsets) 320 | 321 | if(length(sel_positions) == 0) stop("Selected subset is not present in the reference dataset. Please check spelling, as the string input must exactly match to reference cell type.") 322 | 323 | not_found <- !select_ref_subsets %in% ref_annot[, "reference_cell_type"] 324 | 325 | if(sum(not_found != 0 )) { 326 | 327 | message("Following subsets were not found in reference (Please double check your entry for spelling errors, as the string input must exactly match to reference cell type):") 328 | 329 | print(select_ref_subsets[not_found]) 330 | 331 | 332 | } 333 | 334 | 335 | select_ref_subsets <- as.character(ref_annot[sel_positions, "short_name"]) 336 | 337 | ref_dat <- ref_dat[, c(ref_gene_column, select_ref_subsets)] 338 | 339 | } 340 | 341 | 342 | 343 | 344 | 345 | 346 | # Apply quantile filtering 347 | 348 | if(keep_top_var != 100){ 349 | 350 | 351 | message("Applying variance filtering") 352 | 353 | var_vec <- apply(ref_dat[, ! colnames(ref_dat) %in% ref_gene_column], 1, var, na.rm=T) 354 | 355 | keep_var <- quantile(var_vec, probs = 1-keep_top_var/100, na.rm = T) 356 | 357 | keep_genes <- var_vec >= keep_var 358 | 359 | ref_dat <- ref_dat[keep_genes, ] 360 | 361 | } 362 | 363 | 364 | if(grepl("logfc", comp_method)){ 365 | 366 | 367 | # Calculate row means for each gene (mean expression across the reference cell types) 368 | gene_avg <- rowMeans(ref_dat[, !colnames(ref_dat) %in% ref_gene_column]) 369 | 370 | # Log scale data 371 | reference_ratio <- sweep(ref_dat[,!colnames(ref_dat) %in% ref_gene_column], 1, FUN="-", gene_avg) 372 | 373 | # Combine gene names and the log fold change in one data frame 374 | ref_dat <- cbind(tolower(ref_dat[, ref_gene_column]), reference_ratio) 375 | 376 | colnames(ref_dat)[1] <- ref_gene_column 377 | 378 | } 379 | 380 | # } else if(exists("ref_dat") & exists("ref_annot") & update_ref == F){ 381 | # 382 | # ref_dat <- get("ref_dat", envir = .GlobalEnv) 383 | # ref_annot <- get("ref_annot", envir = .GlobalEnv) 384 | # 385 | # } 386 | 387 | 388 | 389 | ######################## Define clusters ############################### 390 | 391 | if(grepl("logfc", comp_method)){ 392 | 393 | clusters <- gtools::mixedsort( 394 | levels( 395 | as.factor( 396 | dplyr::pull(input_dat, grep("cluster", x = colnames(input_dat), 397 | ignore.case = T, value = T) 398 | ) 399 | ) 400 | ) 401 | ) 402 | 403 | } else { 404 | 405 | clusters <- gtools::mixedsort( 406 | levels( 407 | as.factor( 408 | colnames(input_dat)[!grepl("gene", colnames(input_dat), 409 | ignore.case = T)] 410 | ) 411 | ) 412 | )} 413 | 414 | 415 | 416 | 417 | ######################### Compare input_dat against ref_dat ############################# 418 | 419 | message("Analyzing cluster signatures") 420 | 421 | 422 | if(comp_method == "logfc_dot_product"){ #################################################################################### 423 | 424 | # Initiate a master data frame to store the results 425 | master_df <- data.frame() 426 | 427 | # Iterate over clusters to calculate a distinct identity score for each reference cell type 428 | for (i in clusters) { 429 | 430 | # Increment the progress bar, and update the detail text. 431 | # message(paste("Analyzing cluster", i)) 432 | 433 | # Subset on the cluster in iteration 434 | sel_clst <- input_dat %>% 435 | dplyr::filter(!!rlang::sym(cluster_column) == i) %>% 436 | dplyr::select(c(!!rlang::sym(gene_column), !!rlang::sym(logFC_column))) 437 | 438 | 439 | # Merge SCseq cluster log FC value with immgen log FC for shared genes 440 | merged <- merge(sel_clst, ref_dat, by.x = gene_column, by.y = ref_gene_column) 441 | 442 | if(dim(merged)[1] < 2) next 443 | 444 | # Calculate a scoring matrix by multiplying log changes of clusters and immgen cells 445 | reference_scoring <- data.frame(apply(merged[,3:dim(merged)[2]],2,function(x){x*merged[,2]}), check.names = FALSE) 446 | 447 | # Calculate the aggregate score of each immgen cell type by adding 448 | score_sum <- colSums(reference_scoring) 449 | 450 | # Store identity scores in a data frame 451 | df <- data.frame(identity_score = score_sum) 452 | 453 | df <- tibble::rownames_to_column(df, var="reference_id") 454 | 455 | 456 | df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name")) 457 | 458 | 459 | # Store cluster information in a column 460 | df$cluster <- i 461 | 462 | # Add confidence-of-prediction calculations here and append to the df 463 | # Calculate the mean and standard deviation of the aggregate scores per reference cell type 464 | mean_score_sum <- mean(df$identity_score) 465 | score_sum_sd <- sd(df$identity_score) 466 | 467 | # Calculate the distance of the identity score from population mean (how many std devs apart?) 468 | df$z_score <- (df$identity_score - mean_score_sum)/score_sum_sd 469 | 470 | # Calculate the proportion of the genes changing in the same direction between unknown cluster and reference cell type 471 | df$percent_pos_correlation <- { 472 | 473 | ngenes <- dim(reference_scoring)[1] 474 | 475 | pos_corr_vector <- numeric() 476 | 477 | for(i in 1:dim(reference_scoring)[2]){ 478 | 479 | # Calculate number of genes positively correlated (upregulated or downregulated in both unk cluster and reference) 480 | pos_cor <- ( sum(reference_scoring[, i] > 0) / ngenes ) * 100 481 | 482 | pos_corr_vector <- c(pos_corr_vector, pos_cor) 483 | 484 | } #close for loop 485 | 486 | pos_corr_vector 487 | 488 | } # close expression 489 | 490 | 491 | # Add calculation results under the master data frame to have a composite results file 492 | master_df <- rbind(master_df,df) 493 | 494 | 495 | 496 | } # close for loop that iterates over clusters1 497 | 498 | } else if(comp_method == "logfc_spearman" | comp_method == "logfc_pearson"){ ######################################################## 499 | 500 | # Initiate master data frame to store results 501 | master_df <- data.frame() 502 | 503 | 504 | # Iterate analysis for each cluster. The loop below will calculate a distinct correlation 505 | # coefficient for each cluster-reference cell pairs 506 | for (i in clusters) { 507 | 508 | 509 | trim_dat <- input_dat %>% 510 | dplyr::filter(!!rlang::sym(cluster_column) == i) 511 | 512 | dat_genes <- trim_dat[gene_column] %>% dplyr::pull() %>% as.character 513 | ref_genes <- ref_dat[ref_gene_column] %>% dplyr::pull() %>% as.character 514 | 515 | common_genes <- intersect(dat_genes, ref_genes) 516 | 517 | 518 | trim_dat <- trim_dat %>% 519 | dplyr::filter(!!rlang::sym(gene_column) %in% common_genes) %>% 520 | dplyr::arrange(!!rlang::sym(gene_column)) %>% 521 | dplyr::select(- !!rlang::sym(gene_column)) 522 | 523 | 524 | trim_ref <- ref_dat %>% 525 | dplyr::filter(!!rlang::sym(ref_gene_column) %in% common_genes) %>% 526 | dplyr::arrange(!!rlang::sym(ref_gene_column)) %>% 527 | dplyr::select(- !!rlang::sym(ref_gene_column)) 528 | 529 | 530 | # Calculate correlation between the the cluster (single column in trimmed input data) and each of the 531 | # reference cell subsets (columns of the trimmed reference data) 532 | cor_df <- cor(trim_dat[logFC_column], trim_ref, method = gsub("logfc_", "", comp_method)) 533 | 534 | # Store results in a data frame 535 | df <- data.frame(identity_score = cor_df[1,]) 536 | 537 | df <- tibble::rownames_to_column(df, var="reference_id") 538 | 539 | # Combine results with reference annotations 540 | if(reference != "custom"){ 541 | 542 | df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name")) 543 | 544 | 545 | 546 | 547 | 548 | } else if (reference == "custom" & !is.null(custom_ref_annot)){ 549 | 550 | df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name")) 551 | 552 | 553 | } else if(reference == "custom" & is.null(custom_ref_annot)){ 554 | 555 | # Fill in with reminder if annotation file is not updated 556 | df$reference_cell_type <- rep("Upload annotation file", dim(ref_dat)[2]-1) 557 | df$short_name <- colnames(ref_dat)[!colnames(ref_dat) %in% ref_gene_column] 558 | df$long_name <- rep("Upload annotation file", dim(ref_dat)[2]-1) 559 | df$description <- rep("Upload annotation file", dim(ref_dat)[2]-1) 560 | 561 | } 562 | 563 | 564 | 565 | # Store cluster information in a column 566 | df$cluster <- i 567 | 568 | # Add confidence-of-prediction calculations here and append to the df 569 | # Calculate the mean and standard deviation of the aggregate scores per reference cell type 570 | mean_cor_coeff <- mean(df$identity_score) 571 | cor_coeff_sd <- sd(df$identity_score) 572 | 573 | # Calculate the distance of the identity score from population mean (how many std devs apart?) 574 | df$z_score <- (df$identity_score - mean_cor_coeff)/cor_coeff_sd 575 | 576 | # Add all the results to the master data frame 577 | master_df <- rbind(master_df, df) 578 | 579 | 580 | } # close for loop that iterates over clusters 581 | 582 | } else if(comp_method == "all_genes_spearman" | comp_method == "all_genes_pearson"){ ################################################ 583 | 584 | 585 | 586 | dat_genes <- input_dat[gene_column] %>% dplyr::pull() %>% as.character 587 | ref_genes <- ref_dat[ref_gene_column] %>% dplyr::pull() %>% as.character 588 | 589 | common_genes <- intersect(dat_genes, ref_genes) 590 | 591 | trim_dat <- input_dat %>% 592 | dplyr::filter(!!rlang::sym(gene_column) %in% common_genes) %>% 593 | dplyr::arrange(!!rlang::sym(gene_column)) %>% 594 | dplyr::select_(.dots= paste0("-", gene_column)) 595 | 596 | trim_ref <- ref_dat %>% 597 | dplyr::filter(!!rlang::sym(ref_gene_column) %in% common_genes) %>% 598 | dplyr::arrange(!!rlang::sym(ref_gene_column)) %>% 599 | dplyr::select_(.dots=paste0("-", ref_gene_column)) 600 | 601 | clusters <- colnames(trim_dat) 602 | 603 | 604 | master_df <- data.frame() 605 | 606 | comp_method <- gsub("all_genes_", "", comp_method) 607 | 608 | for (i in clusters) { 609 | 610 | 611 | cor_df <- cor(trim_dat[i], trim_ref, method = comp_method) 612 | 613 | 614 | df <- data.frame(identity_score = cor_df[1,]) 615 | 616 | df <- tibble::rownames_to_column(df, var="reference_id") 617 | 618 | 619 | 620 | if(reference != "custom"){ 621 | 622 | df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name")) 623 | 624 | 625 | 626 | 627 | 628 | } else if (reference == "custom" & !is.null(custom_ref_annot)){ 629 | 630 | df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name")) 631 | 632 | 633 | } else if(reference == "custom" & is.null(custom_ref_annot)){ 634 | 635 | df$reference_cell_type <- rep("Upload annotation file", dim(ref_dat)[2]-1) 636 | df$short_name <- colnames(ref_dat)[!colnames(ref_dat) %in% ref_gene_column] 637 | df$long_name <- rep("Upload annotation file", dim(ref_dat)[2]-1) 638 | df$description <- rep("Upload annotation file", dim(ref_dat)[2]-1) 639 | 640 | } 641 | 642 | 643 | 644 | 645 | df$cluster <- i 646 | 647 | # Add confidence-of-prediction calculations here and append to the df 648 | # Calculate the mean and standard deviation of the aggregate scores per reference cell type 649 | mean_cor_coeff <- mean(df$identity_score) 650 | cor_coeff_sd <- sd(df$identity_score) 651 | 652 | # Calculate the distance of the identity score from population mean (how many std devs apart?) 653 | df$z_score <- (df$identity_score - mean_cor_coeff)/cor_coeff_sd 654 | 655 | 656 | master_df <- rbind(master_df,df) 657 | 658 | } 659 | } 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | if(global_results_obj == T) CIPR_all_results <<- master_df 669 | 670 | 671 | #prep individual plots 672 | if(plot_ind == T){ 673 | 674 | 675 | colnum <- length(levels(as.factor(ref_annot$reference_cell_type))) 676 | 677 | 678 | cols <- colorspace::rainbow_hcl(colnum, c = 90, l = 75, start = 0, end = 330, 679 | gamma = NULL, fixup = TRUE, alpha = 1) 680 | 681 | set.seed(5) 682 | 683 | cols <- sample(cols, length(cols), replace = F) 684 | 685 | 686 | ind_clu_plots <- list() 687 | 688 | for (i in clusters) { 689 | 690 | 691 | # Extract results calculated for individual clusters 692 | df_plot <- master_df %>% 693 | dplyr::filter(cluster == i) 694 | 695 | # Calculate mean and sd deviation for adding confidence bands to graphs 696 | score_mean <- mean(df_plot$identity_score) 697 | score_sd <- sd(df_plot$identity_score) 698 | 699 | 700 | plotname <- paste("cluster", i, sep="") 701 | 702 | # Plot identity scores per cluster per reference cell type and add confidence bands 703 | ind_clu_plots[[plotname]] <- ggpubr::ggdotplot(df_plot, x = "reference_id", y="identity_score", 704 | fill = "reference_cell_type", xlab=F, ylab="Reference identity score", palette=cols, 705 | font.y = c(14, "bold", "black"), size=1, x.text.angle=90, 706 | title = paste("Cluster:",i), font.title = c(15, "bold.italic"), 707 | font.legend = c(15, "plain", "black"))+ 708 | ggplot2::theme(axis.text.x = ggplot2::element_text(size=10, vjust=0.5, hjust=1))+ 709 | ggplot2::geom_hline(yintercept=score_mean)+ 710 | ggplot2::annotate("rect", xmin = 1, xmax = length(df_plot$reference_id), 711 | ymin = score_mean-score_sd, ymax = score_mean+score_sd, 712 | fill = "gray50", alpha = .1)+ 713 | ggplot2::annotate("rect", xmin = 1, xmax = length(df_plot$reference_id), 714 | ymin = score_mean-2*score_sd, ymax = score_mean+2*score_sd, 715 | fill = "gray50", alpha = .1)+ 716 | ggplot2::theme(...) 717 | 718 | 719 | } 720 | 721 | if(global_plot_obj == T) ind_clu_plots <<-ind_clu_plots 722 | 723 | 724 | if(save_png == T) { 725 | ggpubr::ggexport(filename = "CIPR_individual_clusters.png", plotlist = ind_clu_plots, ncol = 1, width = 1800, height = 360 * length(clusters)) 726 | } 727 | else { 728 | print(ggpubr::ggarrange(plotlist = ind_clu_plots, ncol = 1, common.legend = T)) 729 | } 730 | 731 | } 732 | 733 | ################################################################################################################################ 734 | # Prepare top5 summary plots 735 | # This plot will show the 5 highest scoring reference cell types for each cluster. 736 | 737 | # if(plot_top == T){ 738 | 739 | message("Preparing top plots") 740 | 741 | # Extract top5 hits from the reuslts 742 | top_df <- master_df %>% 743 | dplyr::group_by(cluster) %>% #cluster 744 | dplyr::top_n(top_num, wt = identity_score) %>% 745 | dplyr::arrange(cluster, desc(identity_score)) 746 | 747 | # Index variable helps keeping the results for clusters separate and helps ordered outputs 748 | top_df$index <- 1:nrow(top_df) 749 | 750 | 751 | # Order clusters levels for ordered plotting 752 | ordered_cluster_levels <- gtools::mixedsort(levels(as.factor(top_df$cluster))) 753 | 754 | 755 | top_df$cluster <- factor(top_df$cluster, levels = ordered_cluster_levels) 756 | 757 | 758 | 759 | # Extract relevant columns 760 | top_df <- dplyr::select(top_df, cluster, 761 | reference_cell_type, 762 | reference_id, 763 | long_name, 764 | description, 765 | identity_score, 766 | index, dplyr::everything()) 767 | 768 | if(global_results_obj == T) CIPR_top_results <<- top_df 769 | 770 | p <- ggpubr::ggdotplot(top_df, x="index", y="identity_score", 771 | fill = "cluster", size=1, x.text.angle=90, 772 | font.legend = c(15, "plain", "black")) + 773 | ggplot2::scale_x_discrete(labels=top_df$reference_id)+ 774 | ggplot2::theme(axis.text.x = ggplot2::element_text(vjust=0.5, hjust=1))+ 775 | ggplot2::theme(...) 776 | 777 | if(global_plot_obj == T) top_plots <<- p 778 | 779 | if(save_png == T) { 780 | 781 | ggpubr::ggexport(p, filename = "CIPR_top_hits.png", ncol = 1, width = 150 * length(clusters), height = 300) 782 | 783 | } else { 784 | 785 | if(plot_top == T) print(p) 786 | 787 | } 788 | 789 | # } 790 | 791 | } # close function 792 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CIPR-Package 2 | 3 |
4 | 5 | 6 | 7 | ## Cluster Identity Predictor 8 | 9 |
10 | 11 | During the analysis of single cell RNA sequencing (scRNAseq) data, annotating the biological identity of cell clusters is an important step before downstream analyses and it remains technically challenging. The current solutions for annotating single cell clusters generally lack a graphical user interface, can be computationally intensive or have a limited scope. On the other hand, manually annotating single cell clusters by examining the expression of marker genes can be subjective and labor-intensive. 12 | 13 | To improve the quality and efficiency of annotating cell clusters in scRNAseq data, we present a web-based R/Shiny app and R package, __Cluster Identity PRedictor (CIPR)__, which provides a graphical user interface to quickly score gene expression profiles of unknown cell clusters against mouse or human references, or a custom dataset provided by the user. CIPR can be easily integrated into the current pipelines to facilitate scRNAseq data analysis. 14 | 15 | CIPR performs analyses at individual cluster level and generates informative graphical outputs to help the users assess the quality of algorithmic predictions (see the example outputs below). 16 | 17 | 18 | This repository contains the source code for the R package implementation of CIPR pipeline. For CIPR-Shiny, please check out [CIPR-Shiny repository](https://github.com/atakanekiz/CIPR-Shiny). 19 | 20 | --- 21 | 22 | ## Installation and Usage 23 | 24 | ```{r} 25 | 26 | if (!requireNamespace("devtools", quietly = TRUE)) 27 | install.packages("devtools") 28 | 29 | devtools::install_github("atakanekiz/CIPR-Package", build_vignettes = TRUE) 30 | 31 | # # For faster installation without vignette 32 | # devtools::install_github("atakanekiz/CIPR-Package", build_vignettes = FALSE) 33 | ``` 34 | 35 | #### Example use case in conjunction with Seurat pipeline 36 | 37 | ```{r} 38 | 39 | 40 | library(Seurat) 41 | 42 | allmarkers <- FindAllMarkers(seurat_object) 43 | avgexp <- AverageExpression(seurat_object) 44 | 45 | 46 | # Plot summarizing top scoring references per cluster (logFC comparison) 47 | CIPR(input_dat = allmarkers, 48 | comp_method = "logfc_dot_product", 49 | reference = "immgen", 50 | plot_ind = F, 51 | plot_top = T) 52 | 53 | # Plot summarizing top scoring references per cluster (all-genes correlation) 54 | CIPR(input_dat = allmarkers, 55 | comp_method = "logfc_dot_product", 56 | reference = "immgen", 57 | plot_ind = F, 58 | plot_top = T) 59 | 60 | 61 | # Plots for individual clusters 62 | CIPR(input_dat = allmarkers, 63 | comp_method = "logfc_dot_product", 64 | reference = "immgen", 65 | plot_ind = T, 66 | plot_top = F) 67 | 68 | # Limiting the analysis to certain reference subsets 69 | CIPR(input_dat = allmarkers, 70 | comp_method = "logfc_dot_product", 71 | reference = "immgen", 72 | plot_ind = F, 73 | plot_top = T, 74 | select_ref_subsets = c("T cell", "B cell", "NK cell")) 75 | 76 | 77 | 78 | 79 | ``` 80 | 81 | 82 | ## Reference datasets available in CIPR 83 | 84 | * [Immunological Genome Project (ImmGen)](https://www.immgen.org) microarray data from sorted mouse immune cells. This dataset is prepared by using both V1 and V2 ImmGen releases and it contains 296 samples from 20 different cell types (253 subtypes). 85 | 86 | * Mouse RNAseq data from sorted cells reported in [Benayoun et al. (2019)](http://www.genome.org/cgi/doi/10.1101/gr.240093.118). This dataset contains 358 sorted immune and nonimmune samples from 18 different lineages (28 subtypes). 87 | 88 | * [Blueprint](https://doi.org/10.3324/haematol.2013.094243)/[Encode](https://doi.org/10.1038/nature11247) RNAseq data that contains 259 sorted human immune and nonimmune samples from 24 different lineages (43 subtypes). 89 | 90 | * [Human Primary Cell Atlas](https://doi.org/10.1186/1471-2164-14-632) that contains microarray data from 713 sorted immune and nonimmune cells (37 main cell types and 157 subtypes). 91 | 92 | * [DICE (Database for Immune Cell Expression(/eQTLs/Epigenomics)](https://doi.org/10.1016/j.cell.2018.10.022) that contains 1561 human immune samples from 5 main cell types (15 subtypes). To reduce object sizes, mean TPM values per cell type is used. 93 | 94 | * Human microarray data from sorted hematopoietic cells reported in [Novershtern et al. (2011)](https://doi.org/10.1016/j.cell.2011.01.004). This dataset contains data from 211 samples and 17 main cell types (38 subtypes) 95 | 96 | * Human RNAseq data from sorted cells reported in [Monaco et al. (2019)](https://doi.org/10.1016/j.celrep.2019.01.041). This dataset contains 114 samples originating from 11 main cell types (29 subtypes) 97 | 98 | * ___A custom reference dataset provided by the user.___ This dataset can be obtained from a number of high througput methods including microarray and bulk RNAseq. For details about how to prepare custom reference, please see the How-to tab on the [Shiny website](https://aekiz.shinyapps.io/CIPR). 99 | 100 | --- 101 | 102 | ## Analytical approach 103 | 104 | CIPR calculates pairwise identity scores between individual unknown clusters and the reference samples and generates a vector of identity scores per each cluster in the experiment. While doing this CIPR utilizes two main approaches: 105 | 106 | * ___Comparison of differentially expressed genes.___ In this method users provide an input data frame that contains the log fold-change (logFC) values of differentially expressed genes in each cluster. The algorithm first calculates differential expression within the reference data frame for each gene by taking the ratio of the expression value of individual subsets to the average expression in the entire data frame. Then the CIPR pipeline compares these reference logFC values to the logFC from the experimental clusters. The users can select one of three methods for these comparisons: 107 | 108 | * __LogFC dot product:__ LogFC values of the matching genes are mutliplied and added together to yield an aggregate identity score. 109 | * __LogFC Spearman's correlation:__ Rank correlation is calculated between the logFC values of the experimental and reference data. 110 | * __LogFC Pearson's correlation__: Linear correlation is calculated between the logFC values of the expermental and reference data. 111 | 112 |
113 | 114 | * ___Comparison of all genes.___ In this method, users provide an input data frame that contains average gene expression per cluster. The algorithm compares the expression profiles of individual cluster to that from reference dataset. In this method, all the common genes between experimenal and reference data are used in the analysis regardless of their expression values and differential expression status. Users can use one of the two methods in this approach: 115 | 116 | * __Spearman's correlation:__ Rank correlation between the experimental clusters and reference cell subsets 117 | * __Pearson's correlation:__ Linear correlation (which could be beneficial especially when using custom references where the reference and the experimental data is obtained using similar methodologies.) 118 | 119 | 120 | --- 121 | 122 | 123 | 124 | ## Flexible options 125 | 126 | To be adaptable to various experimental contexts, CIPR enables users to: 127 | 128 | * Select only interesting reference subsets from the provided reference datasets 129 | 130 | * Limit the analysis to the genes whose expression variance (in the reference dataset) is above a certain quantile determined by the user. 131 | 132 | 133 | --- 134 | 135 | ## Sample outputs 136 | 137 | ### Results per cluster 138 | 139 | In the plot below x-axis signifies the individual samples within the reference data frame (ImmGen in this example). Reference cell types are marked by different colors. Each data point indicates the identity score calculated for Cluster 1 in the input data. Shaded regions demarcate 1 and 2 standard deviations around the average identity score across the reference dataset. In this analysis logFC dot product method was used. 140 | 141 | 142 | 143 | 144 | 145 | 146 | ### Summary of top hits per cluster 147 | 148 | It is often easier to examine the top predictions in one graph. This plot shows the top 5 scoring reference samples for each cluster (shown in different colors). 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | -------------------------------------------------------------------------------- /data/blueprint_expr.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/blueprint_expr.rda -------------------------------------------------------------------------------- /data/blueprint_samples.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/blueprint_samples.rda -------------------------------------------------------------------------------- /data/dice_expr.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/dice_expr.rda -------------------------------------------------------------------------------- /data/dice_samples.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/dice_samples.rda -------------------------------------------------------------------------------- /data/example_avgexp_data.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/example_avgexp_data.rda -------------------------------------------------------------------------------- /data/example_log_fc_data.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/example_log_fc_data.rda -------------------------------------------------------------------------------- /data/hema_expr.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hema_expr.rda -------------------------------------------------------------------------------- /data/hema_samples.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hema_samples.rda -------------------------------------------------------------------------------- /data/hpca_expr.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hpca_expr.rda -------------------------------------------------------------------------------- /data/hpca_samples.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hpca_samples.rda -------------------------------------------------------------------------------- /data/hsrnaseq_expr.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hsrnaseq_expr.rda -------------------------------------------------------------------------------- /data/hsrnaseq_samples.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hsrnaseq_samples.rda -------------------------------------------------------------------------------- /data/immgen_expr.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/immgen_expr.rda -------------------------------------------------------------------------------- /data/immgen_samples.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/immgen_samples.rda -------------------------------------------------------------------------------- /data/mmrnaseq_expr.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/mmrnaseq_expr.rda -------------------------------------------------------------------------------- /data/mmrnaseq_samples.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/mmrnaseq_samples.rda -------------------------------------------------------------------------------- /doc/CIPR_hex_mid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/doc/CIPR_hex_mid.png -------------------------------------------------------------------------------- /doc/cipr_human_pbmc.R: -------------------------------------------------------------------------------- 1 | ## ---- include=FALSE----------------------------------------------------------- 2 | 3 | knitr::opts_chunk$set(eval=T) 4 | 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | 7 | 8 | 9 | ## ---- eval=F------------------------------------------------------------------ 10 | # 11 | # if (!requireNamespace("devtools", quietly = TRUE)) 12 | # install.packages("devtools") 13 | # 14 | # 15 | # # Use this option if you want to build vignettes during installation 16 | # # This can take a long time due to the installation of suggested packages. 17 | # devtools::install_github(..., build = TRUE, build_opts = c("--no-resave-data", "--no-manual") 18 | # 19 | # # Use this if you would like to install the package without vignettes 20 | # # devtools::install_github("atakanekiz/CIPR-Package") 21 | # 22 | 23 | ## ---- eval=T------------------------------------------------------------------ 24 | 25 | library(dplyr) 26 | library(Seurat) 27 | library(CIPR) 28 | 29 | 30 | ## ----------------------------------------------------------------------------- 31 | 32 | # Download data 33 | 34 | temp <- tempfile() 35 | tempd <- tempdir() 36 | 37 | download.file("https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz", destfile = temp) 38 | 39 | untar(temp, exdir = tempd) 40 | 41 | unlink(temp) 42 | 43 | 44 | # Load the PBMC dataset 45 | pbmc.data <- Read10X(data.dir = paste0(tempd, "\\filtered_gene_bc_matrices\\hg19")) 46 | # Initialize the Seurat object with the raw (non-normalized data). 47 | pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) 48 | pbmc 49 | 50 | 51 | 52 | ## ----------------------------------------------------------------------------- 53 | 54 | # Calculate mitochondrial gene representation (indicative of low quality cells) 55 | pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") 56 | 57 | # Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes 58 | pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) 59 | 60 | 61 | ## ---- results="hide", message=F----------------------------------------------- 62 | 63 | pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) 64 | pbmc <- NormalizeData(pbmc) 65 | 66 | 67 | ## ---- results="hide", message=F----------------------------------------------- 68 | 69 | pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) 70 | 71 | all.genes <- rownames(pbmc) 72 | pbmc <- ScaleData(pbmc, features = all.genes) 73 | 74 | 75 | ## ---- results="hide", message=F----------------------------------------------- 76 | 77 | pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) 78 | 79 | 80 | ## ---- eval=T------------------------------------------------------------------ 81 | 82 | ElbowPlot(pbmc) 83 | 84 | 85 | ## ---- results="hide", message=F----------------------------------------------- 86 | 87 | pbmc <- FindNeighbors(pbmc, dims = 1:10) 88 | pbmc <- FindClusters(pbmc, resolution = 0.5) 89 | 90 | 91 | ## ---- results="hide", message=F----------------------------------------------- 92 | 93 | pbmc <- RunUMAP(pbmc, dims = 1:10) 94 | 95 | pbmc$unnamed_clusters <- Idents(pbmc) 96 | 97 | 98 | ## ----------------------------------------------------------------------------- 99 | 100 | # saveRDS(pbmc, "pbmc.rds") 101 | 102 | 103 | ## ---- echo=F, results="hide"-------------------------------------------------- 104 | 105 | allmarkers <- FindAllMarkers(pbmc) 106 | 107 | 108 | ## ---- include=F--------------------------------------------------------------- 109 | 110 | # saveRDS(allmarkers, "allmarkers.rds") 111 | 112 | 113 | ## ---- results="hide"---------------------------------------------------------- 114 | 115 | avgexp <- AverageExpression(pbmc) 116 | 117 | avgexp <- avgexp$RNA 118 | 119 | avgexp$gene <- rownames(avgexp) 120 | 121 | 122 | ## ---- include=F--------------------------------------------------------------- 123 | 124 | # saveRDS(avgexp, "avgexp.rds") 125 | 126 | 127 | ## ---- include=F--------------------------------------------------------------- 128 | 129 | # pbmc <- readRDS("pbmc.rds") 130 | 131 | 132 | ## ----------------------------------------------------------------------------- 133 | 134 | DimPlot(pbmc) 135 | 136 | 137 | ## ---- eval=T, include=F------------------------------------------------------- 138 | 139 | # allmarkers <- readRDS("allmarkers.rds") 140 | 141 | 142 | ## ---- eval=T, fig.width=16, fig.height=32, message=F-------------------------- 143 | 144 | CIPR(input_dat = allmarkers, 145 | comp_method = "logfc_dot_product", 146 | reference = "hsrnaseq", 147 | plot_ind = T, 148 | plot_top = F, 149 | global_results_obj = T, 150 | global_plot_obj = T, 151 | # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters 152 | ) 153 | 154 | 155 | 156 | 157 | ## ---- eval=T, fig.width=16, fig.height=5, message=F--------------------------- 158 | 159 | library(ggplot2) 160 | 161 | ind_clu_plots$cluster6 + 162 | theme(axis.text.y = element_text(color="red"), 163 | axis.text.x = element_text(color="blue")) + 164 | labs(fill="Reference")+ 165 | ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") + 166 | annotate("text", label="2 sd range", x=10, y= 500, size=8, color = "steelblue")+ 167 | annotate("text", label= "1 sd range", x=10, y=175, size=8, color ="orange2")+ 168 | geom_rect(aes(xmin=94, xmax=99, ymin=550, ymax=900), fill=NA, size=3, color="red") 169 | 170 | 171 | 172 | 173 | ## ---- eval=T, fig.width=8, fig.height=4.5, message=F-------------------------- 174 | 175 | CIPR(input_dat = allmarkers, 176 | comp_method = "logfc_dot_product", 177 | reference = "hsrnaseq", 178 | plot_ind = F, 179 | plot_top = T, 180 | global_results_obj = T, 181 | global_plot_obj = T) 182 | 183 | 184 | 185 | ## ---- eval=T------------------------------------------------------------------ 186 | 187 | DT::datatable(CIPR_top_results) 188 | 189 | 190 | DT::datatable(head(CIPR_all_results)) 191 | 192 | 193 | ## ---- eval=T, fig.width=16, fig.height=32, message=F-------------------------- 194 | 195 | CIPR(input_dat = avgexp, 196 | comp_method = "all_genes_spearman", 197 | reference = "hsrnaseq", 198 | plot_ind = T, 199 | plot_top = F, 200 | global_results_obj = T, 201 | global_plot_obj = T) 202 | 203 | 204 | 205 | 206 | ## ---- eval=T, fig.width=8, fig.height=4.5, message=F-------------------------- 207 | 208 | CIPR(input_dat = avgexp, 209 | comp_method = "all_genes_spearman", 210 | reference = "hsrnaseq", 211 | plot_ind = F, 212 | plot_top = T, 213 | global_results_obj = T, 214 | global_plot_obj = T) 215 | 216 | 217 | 218 | ## ---- eval=T------------------------------------------------------------------ 219 | 220 | DT::datatable(CIPR_top_results) 221 | 222 | 223 | DT::datatable(head(CIPR_all_results)) 224 | 225 | 226 | ## ---- eval=T, fig.width=16, fig.height=32, message=F-------------------------- 227 | 228 | CIPR(input_dat = allmarkers, 229 | comp_method = "logfc_dot_product", 230 | reference = "hsrnaseq", 231 | plot_ind = T, 232 | plot_top = F, 233 | global_results_obj = T, 234 | global_plot_obj = T, 235 | select_ref_subsets = c("CD4+ T cell", "CD8+ T cell", "Monocyte", "NK cell")) 236 | 237 | 238 | 239 | ## ---- eval=T, fig.width=16, fig.height=32, message=F-------------------------- 240 | 241 | CIPR(input_dat = avgexp, 242 | comp_method = "all_genes_spearman", 243 | reference = "hsrnaseq", 244 | plot_ind = T, 245 | plot_top = F, 246 | global_results_obj = T, 247 | global_plot_obj = T, 248 | keep_top_var = 10) 249 | 250 | 251 | 252 | 253 | ## ----------------------------------------------------------------------------- 254 | 255 | sessionInfo() 256 | 257 | 258 | 259 | -------------------------------------------------------------------------------- /doc/cipr_human_pbmc.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Using CIPR with human PBMC data" 3 | date: "`r format(Sys.time(), '%d %B, %Y')`" 4 | author: "Atakan Ekiz" 5 | output: 6 | rmarkdown::html_vignette: 7 | toc: true 8 | toc_depth: 3 9 | vignette: > 10 | %\VignetteIndexEntry{Using CIPR with human PBMC data} 11 | %\VignetteEngine{knitr::rmarkdown} 12 | \usepackage[utf8]{inputenc} 13 | --- 14 | 15 | ```{r, include=FALSE} 16 | 17 | knitr::opts_chunk$set(eval=T) 18 | 19 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 20 | 21 | 22 | ``` 23 | 24 | 25 | 26 | # Summary 27 | 28 | This vignette describes how to use CIPR package with 3k PBMC data freely available from 10X genomics. Here, we recycle the code described in [Seurat's guided clustering tutorial](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html) to help users perform analyses from scratch. Using this dataset we will demonstrate the capabilities of CIPR to annotate single cell clusters in single cell RNAseq (scRNAseq) experiments. For further information about other clustering methods, please see Seurat's comprehensive [website](https://satijalab.org/seurat/) 29 | 30 | 31 | # Install CIPR 32 | 33 | ```{r, eval=F} 34 | 35 | if (!requireNamespace("devtools", quietly = TRUE)) 36 | install.packages("devtools") 37 | 38 | 39 | # Use this option if you want to build vignettes during installation 40 | # This can take a long time due to the installation of suggested packages. 41 | devtools::install_github(..., build = TRUE, build_opts = c("--no-resave-data", "--no-manual") 42 | 43 | # Use this if you would like to install the package without vignettes 44 | # devtools::install_github("atakanekiz/CIPR-Package") 45 | 46 | ``` 47 | 48 | 49 | # Seurat pipeline 50 | 51 | ## Setup Seurat object 52 | 53 | ```{r, eval=T} 54 | 55 | library(dplyr) 56 | library(Seurat) 57 | library(CIPR) 58 | 59 | ``` 60 | 61 | ```{r} 62 | 63 | # Download data 64 | 65 | temp <- tempfile() 66 | tempd <- tempdir() 67 | 68 | download.file("https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz", destfile = temp) 69 | 70 | untar(temp, exdir = tempd) 71 | 72 | unlink(temp) 73 | 74 | 75 | # Load the PBMC dataset 76 | pbmc.data <- Read10X(data.dir = paste0(tempd, "\\filtered_gene_bc_matrices\\hg19")) 77 | # Initialize the Seurat object with the raw (non-normalized data). 78 | pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) 79 | pbmc 80 | 81 | 82 | ``` 83 | 84 | 85 | ## Pre-processing 86 | 87 | The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features. 88 | 89 | ```{r} 90 | 91 | # Calculate mitochondrial gene representation (indicative of low quality cells) 92 | pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") 93 | 94 | # Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes 95 | pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) 96 | 97 | ``` 98 | 99 | 100 | ## Normalizing data 101 | 102 | ```{r, results="hide", message=F} 103 | 104 | pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) 105 | pbmc <- NormalizeData(pbmc) 106 | 107 | ``` 108 | 109 | ## Variable gene detection and scaling 110 | 111 | ```{r, results="hide", message=F} 112 | 113 | pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) 114 | 115 | all.genes <- rownames(pbmc) 116 | pbmc <- ScaleData(pbmc, features = all.genes) 117 | 118 | ``` 119 | 120 | ## Perform PCA 121 | 122 | ```{r, results="hide", message=F} 123 | 124 | pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) 125 | 126 | ``` 127 | 128 | 129 | ```{r, eval=T} 130 | 131 | ElbowPlot(pbmc) 132 | 133 | ``` 134 | 135 | ## Cluster cells 136 | 137 | ```{r, results="hide", message=F} 138 | 139 | pbmc <- FindNeighbors(pbmc, dims = 1:10) 140 | pbmc <- FindClusters(pbmc, resolution = 0.5) 141 | 142 | ``` 143 | 144 | ## Run non-linear dimensionality reduction (UMAP) 145 | 146 | ```{r, results="hide", message=F} 147 | 148 | pbmc <- RunUMAP(pbmc, dims = 1:10) 149 | 150 | pbmc$unnamed_clusters <- Idents(pbmc) 151 | 152 | ``` 153 | 154 | ```{r} 155 | 156 | # saveRDS(pbmc, "pbmc.rds") 157 | 158 | ``` 159 | 160 | 161 | 162 | ## Find differentially expressed genes 163 | 164 | __This is the step where we generate the input for CIPR's log fold change (logFC) comparison methods.__ 165 | 166 | ```{r, echo=F, results="hide"} 167 | 168 | allmarkers <- FindAllMarkers(pbmc) 169 | 170 | ``` 171 | 172 | ```{r, include=F} 173 | 174 | # saveRDS(allmarkers, "allmarkers.rds") 175 | 176 | ``` 177 | 178 | ## Calculate average gene expression per cluster 179 | 180 | __This is the step where we generate the input for CIPR's all-genes correlation methods.__ 181 | 182 | ```{r, results="hide"} 183 | 184 | avgexp <- AverageExpression(pbmc) 185 | 186 | avgexp <- avgexp$RNA 187 | 188 | avgexp <- as.data.frame(avgexp) 189 | 190 | avgexp$gene <- rownames(avgexp) 191 | 192 | ``` 193 | 194 | ```{r, include=F} 195 | 196 | # saveRDS(avgexp, "avgexp.rds") 197 | 198 | ``` 199 | 200 | ## Visualize Seurat pbject 201 | 202 | ```{r, include=F} 203 | 204 | # pbmc <- readRDS("pbmc.rds") 205 | 206 | ``` 207 | 208 | ```{r} 209 | 210 | DimPlot(pbmc) 211 | 212 | ``` 213 | 214 | # CIPR analysis 215 | 216 | The user can select one of the 7 provided reference data sets: 217 | 218 | | Reference | `reference` argument | 219 | |-------------------------------------------|----------------------| 220 | | Immunological Genome Project (ImmGen) | "immgen" | 221 | | Presorted cell RNAseq (various tissues) | "mmrnaseq" | 222 | | Blueprint/ENCODE | "blueprint" | 223 | | Human Primary Cell Atlas | "hpca" | 224 | | Database of Immune Cell Expression (DICE) | "dice" | 225 | | Hematopoietic differentiation | "hema" | 226 | | Presorted cell RNAseq (PBMC) | "hsrnaseq" | 227 | | User-provided custom reference | "custom" | 228 | 229 | ## Standard logFC comparison method 230 | 231 | In this method CIPR accepts `allmarkers` data frame created above and performs the following analytical steps: 232 | 233 | - It calculates a vector of logFC values for each reference sample (i.e. individual columns of the reference data frame) by comparing log-normalized expression value of a gene (i.e. rows of the reference data frame) to the average gene expression across the entire reference dataset. 234 | - It then scores unknown cluster logFC differential gene expression data against this reference logFC values to create a vector of identity scores 235 | - User can select one of three methods: 236 | - LogFC dot product (sum of all logFC x logFC values among matching genes). This is the recommended method in CIPR. 237 | - LogFC Spearman's correlation (rank correlation of logFC values) 238 | - LogFC Pearson's correlation (linear correlation of logFC values) 239 | 240 | 241 | 242 | ```{r, eval=T, include=F} 243 | 244 | # allmarkers <- readRDS("allmarkers.rds") 245 | 246 | ``` 247 | 248 | 249 | ### Plot all identity scores per cluster-reference cell pairs 250 | 251 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 252 | 253 | CIPR results can be summarized for each cluster in scatter plots. 254 | 255 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 256 | 257 | CIPR(input_dat = allmarkers, 258 | comp_method = "logfc_dot_product", 259 | reference = "hsrnaseq", 260 | plot_ind = T, 261 | plot_top = F, 262 | global_results_obj = T, 263 | global_plot_obj = T, 264 | # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters 265 | ) 266 | 267 | 268 | 269 | ``` 270 | 271 | ### Plot identity scores for a select cluster 272 | 273 | `ind_clu_plots` object is created in the global environment to help users can visualize results for a desired cluster and manipulate graphing parameters. ggplot2 functions can be iteratively added to individual plots to create annotations etc. 274 | 275 | ```{r, eval=T, fig.width=16, fig.height=5, message=F} 276 | 277 | library(ggplot2) 278 | 279 | ind_clu_plots$cluster6 + 280 | theme(axis.text.y = element_text(color="red"), 281 | axis.text.x = element_text(color="blue")) + 282 | labs(fill="Reference")+ 283 | ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") + 284 | annotate("text", label="2 sd range", x=10, y= 500, size=8, color = "steelblue")+ 285 | annotate("text", label= "1 sd range", x=10, y=175, size=8, color ="orange2")+ 286 | geom_rect(aes(xmin=94, xmax=99, ymin=550, ymax=900), fill=NA, size=3, color="red") 287 | 288 | 289 | 290 | ``` 291 | 292 | 293 | 294 | ### Plot top scoring refernce subsets for each cluster 295 | 296 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F} 297 | 298 | CIPR(input_dat = allmarkers, 299 | comp_method = "logfc_dot_product", 300 | reference = "hsrnaseq", 301 | plot_ind = F, 302 | plot_top = T, 303 | global_results_obj = T, 304 | global_plot_obj = T) 305 | 306 | 307 | ``` 308 | 309 | ### Tabulate CIPR results 310 | 311 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables. 312 | 313 | ```{r, eval=T} 314 | 315 | DT::datatable(CIPR_top_results) 316 | 317 | 318 | DT::datatable(head(CIPR_all_results)) 319 | 320 | ``` 321 | 322 | 323 | ## Standard all-genes correlation method 324 | 325 | CIPR also implements a simple correlation approach in which overall correlation in gene expression is calculated for the pairs of unknown clusters and the reference samples (regardless of the differential expression status of the gene). This approach is conceptually similiar to some other automated identity prediction pipelines such as [SingleR](https://www.ncbi.nlm.nih.gov/pubmed/30643263) and [scMCA](https://www.ncbi.nlm.nih.gov/pubmed/30758821). 326 | 327 | Users can select one of the following methods: 328 | 329 | - __Spearman's correlation:__ It calculates correlation based on ranked gene expression. It can be suitable for comparing experimental and reference data which were obtained using different technologies. 330 | - __Pearson's correlation:__ It calculates linear correlations. This can be useful when the user would like to provide a custom reference dataset to CIPR. 331 | 332 | 333 | 334 | ### Plot all identity scores per cluster-reference cell pairs 335 | 336 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 337 | 338 | CIPR results can be summarized for each cluster in scatter plots. 339 | 340 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 341 | 342 | CIPR(input_dat = avgexp, 343 | comp_method = "all_genes_spearman", 344 | reference = "hsrnaseq", 345 | plot_ind = T, 346 | plot_top = F, 347 | global_results_obj = T, 348 | global_plot_obj = T) 349 | 350 | 351 | 352 | ``` 353 | 354 | 355 | ### Plot top scoring refernce subsets for each cluster 356 | 357 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F} 358 | 359 | CIPR(input_dat = avgexp, 360 | comp_method = "all_genes_spearman", 361 | reference = "hsrnaseq", 362 | plot_ind = F, 363 | plot_top = T, 364 | global_results_obj = T, 365 | global_plot_obj = T) 366 | 367 | 368 | ``` 369 | 370 | ### Tabulate CIPR results 371 | 372 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables. 373 | 374 | ```{r, eval=T} 375 | 376 | DT::datatable(CIPR_top_results) 377 | 378 | 379 | DT::datatable(head(CIPR_all_results)) 380 | 381 | ``` 382 | 383 | 384 | ## Limiting analysis to the select subsets of reference data 385 | 386 | Sometimes excluding irrelevant reference cell types from the analysis can be helpful. Especially when the logFC comparison methods are utilized, removing irrelevant subsets may improve discrimination of closely related subsets, since the reference logFC values will be calculated after subsetting the data frame. Filtering out reference subsets should not impact results of the all-genes correlation methods, but it can make the graphical outputs easier to look at 387 | 388 | 3k PBMC dataset may not be the best example to demonstrate benefits of reference dataset subsetting, but the code below serves as an example for this functionality. 389 | 390 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 391 | 392 | CIPR(input_dat = allmarkers, 393 | comp_method = "logfc_dot_product", 394 | reference = "hsrnaseq", 395 | plot_ind = T, 396 | plot_top = F, 397 | global_results_obj = T, 398 | global_plot_obj = T, 399 | select_ref_subsets = c("CD4+ T cell", "CD8+ T cell", "Monocyte", "NK cell")) 400 | 401 | 402 | ``` 403 | 404 | 405 | 406 | ## Filtering out lowly variable genes 407 | 408 | Genes that have a low expression variance across the reference data frame has weaker discriminatory potential. Thus, excluding these genes from the analysis can reduce the noise and improve the prediction scores, especially when using all-genes correlation based methods. 409 | 410 | We implemented a variance filtering parameter, `keep_top_var`, which allows users to keep top Nth% variable reference genes in the analysis. For instance, by setting this argument to 10, CIPR can be instructed to use only the top 10% highly variable genes in identity score calculations. In our experience _(Ekiz HA, BMC Bioinformatics, in revision)_ limiting the analysis to highly variable genes does not significantly impact the identity scores of the top-scoring reference cell subsets, but it reduces the identity scores of intermediate/low-scoring reference cells leading to an improvement of z-scores. The "best" value for this parameter remains to be determined by the user in individual studies. 411 | 412 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 413 | 414 | CIPR(input_dat = avgexp, 415 | comp_method = "all_genes_spearman", 416 | reference = "hsrnaseq", 417 | plot_ind = T, 418 | plot_top = F, 419 | global_results_obj = T, 420 | global_plot_obj = T, 421 | keep_top_var = 10) 422 | 423 | 424 | 425 | ``` 426 | 427 | 428 | 429 | 430 | ```{r} 431 | 432 | sessionInfo() 433 | 434 | 435 | ``` 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | -------------------------------------------------------------------------------- /doc/sample_ind_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/doc/sample_ind_output.png -------------------------------------------------------------------------------- /doc/sample_top_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/doc/sample_top_output.png -------------------------------------------------------------------------------- /man/CIPR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/CIPR.R 3 | \name{CIPR} 4 | \alias{CIPR} 5 | \title{Cluster Identity Predictor} 6 | \usage{ 7 | CIPR( 8 | input_dat, 9 | comp_method = "logfc_dot_product", 10 | reference = NULL, 11 | select_ref_subsets = "all", 12 | custom_reference = NULL, 13 | custom_ref_annot = NULL, 14 | keep_top_var = 100, 15 | plot_ind = F, 16 | plot_top = T, 17 | top_num = 5, 18 | save_png = F, 19 | global_plot_obj = T, 20 | global_results_obj = T, 21 | update_ref = T, 22 | ... 23 | ) 24 | } 25 | \arguments{ 26 | \item{input_dat}{Data frame containing normalized log2-transformed gene 27 | expression values per cluster OR a table of differentially expressed genes 28 | per cluster} 29 | 30 | \item{comp_method}{Method to use for identity score calculations. It accepts 31 | one of the following: "logfc_dot_product" (default), "logfc_spearman", 32 | "logfc_pearson", "all_genes_spearman", "all_genes_pearson"} 33 | 34 | \item{reference}{Reference data frame containing gene expression data from 35 | known cell types. It accepts one of the following: "immgen" (default), 36 | "mmrnaseq", "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom"} 37 | 38 | \item{select_ref_subsets}{The names of cell subsets to be included in the 39 | analysis. For using the entire reference dataset use "all", or 40 | provide a character vector of cell types of interest. Defaults to "all"} 41 | 42 | \item{custom_reference}{A data frame containing custom reference. There must 43 | be a column named 'gene' and other columns contain normalized gene expression 44 | data from known samples. Defaults to NULL} 45 | 46 | \item{custom_ref_annot}{A data frame containing custom reference metadata. 47 | This is optional to get more informative results from CIPR. The data 48 | frame must contain columns named 'short_name' (must match column names in 49 | custom reference), 'long_name' (human readable names for reference samples), 50 | 'description' (details such as positive and negative sorting markers), 51 | 'reference_cell_type' (e.g. T cell, B cell, NK)} 52 | 53 | \item{keep_top_var}{Top n percent of highly variant reference genes to 54 | include in the analysis. It accepts a numeric value smaller than or equal 55 | to 100 (default). The value of 100 results in keeping all the genes in the 56 | reference dataset)} 57 | 58 | \item{plot_ind}{Logical value. Set it to TRUE to plot identity scores for 59 | each cluster. Defaults to FALSE} 60 | 61 | \item{plot_top}{Logical value. set it to TRUE to plot top scoring reference 62 | cell types for each cluster. Defaults ot TRUE.} 63 | 64 | \item{top_num}{A numeric value determining how many top scoring reference 65 | cell types will be plotted for each cluster. Defaults to 5.} 66 | 67 | \item{save_png}{Logical value. Set it to TRUE if you would like to export png 68 | images of the results. Defaults to FALSE} 69 | 70 | \item{global_plot_obj}{Logical value. Set it to TRUE if you would like to 71 | keep the plots as an object in the global environment. This can be useful 72 | for accessing and manipulating the graphs. Defaults to TRUE.} 73 | 74 | \item{global_results_obj}{Logical value. Set it to TRUE if you would like to 75 | keep the analysis results as a global object. Defaults to TRUE.} 76 | 77 | \item{...}{arguments to pass to theme() (for graph manipulation)} 78 | } 79 | \value{ 80 | Graphical outputs and/or data frames of identity scores calculated 81 | for each cluster in the input data. 82 | } 83 | \description{ 84 | CIPR (Cluster Identity PRedictor) is a pipeline that helps annotating 85 | unknown single cell clusters in single cell RNA sequencing (scRNAseq experiments). 86 | This function scores unknown cluster gene expression signatures against known reference 87 | datasets using user-selected analytical approaches to facilitate scRNAseq analysis. 88 | } 89 | \examples{ 90 | 91 | # Example of using CIPR in conjunction with Seurat 92 | library(Seurat) 93 | allmarkers <- FindAllMarkers(seurat_object) 94 | avgexp <- AverageExpression(seurat_object) 95 | 96 | # Using built-in immgen as reference and logfc dot product method 97 | CIPR(input_dat = allmarkers, 98 | comp_method = "logfc_dot_product", 99 | reference="immgen", 100 | keep_top_var = 100, 101 | global_results_obj = T, 102 | plot_top = T) 103 | 104 | #' # Using built-in immgen as reference and all genes spearman method 105 | 106 | CIPR(input_dat = avgexp, 107 | comp_method = "all_genes_spearman", 108 | reference="immgen", 109 | keep_top_var = 100, 110 | global_results_obj = T, 111 | plot_top = T) 112 | 113 | 114 | # Using built-in dice reference and logFC spearman method 115 | # Variance threshold of top 50\% 116 | 117 | CIPR(input_dat = allmarkers, 118 | comp_method = "logfc_spearman", 119 | reference="dice", 120 | keep_top_var = 50, 121 | global_results_obj = T, 122 | plot_top = T) 123 | 124 | 125 | # Using a custom reference 126 | 127 | CIPR(input_dat = allmarkers, 128 | comp_method = "logfc_dot_product", 129 | reference="custom", 130 | custom_ref_dat_path = custom_ref_df, 131 | custom_ref_annot_path = custom_annot_df, 132 | keep_top_var = 100, 133 | global_results_obj = T, 134 | plot_top = T) 135 | 136 | 137 | # Using a blueprint-encode reference and limiting the analysis 138 | # to "Pericytes", "Skeletal muscle", "Smooth muscle" 139 | 140 | CIPR(input_dat = allmarkers, 141 | comp_method = "logfc_dot_product", 142 | reference="blueprint-encode", 143 | select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle") 144 | keep_top_var = 100, 145 | global_results_obj = T, 146 | plot_top = T) 147 | 148 | # Using built in example data (logFC signatures per cluster) 149 | CIPR(input_dat = example_logfc_data, 150 | comp_method = "logfc_dot_product", 151 | reference="blueprint-encode", 152 | select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle") 153 | keep_top_var = 100, 154 | global_results_obj = T, 155 | plot_top = T) 156 | 157 | 158 | # Using built in example data (average expression) 159 | CIPR(input_dat = example_avgexp_data, 160 | comp_method = "all_spearman", 161 | reference="immgen", 162 | select_ref_subsets = "all", 163 | keep_top_var = 100, 164 | global_results_obj = T, 165 | plot_top = T) 166 | 167 | } 168 | -------------------------------------------------------------------------------- /vignettes/cipr_human_pbmc.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Using CIPR with human PBMC data" 3 | date: "`r format(Sys.time(), '%d %B, %Y')`" 4 | author: "Atakan Ekiz" 5 | output: 6 | rmarkdown::html_vignette: 7 | toc: true 8 | toc_depth: 3 9 | vignette: > 10 | %\VignetteIndexEntry{Using CIPR with human PBMC data} 11 | %\VignetteEngine{knitr::rmarkdown} 12 | \usepackage[utf8]{inputenc} 13 | --- 14 | 15 | ```{r, include=FALSE} 16 | 17 | knitr::opts_chunk$set(eval=T) 18 | 19 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 20 | 21 | 22 | ``` 23 | 24 | 25 | 26 | # Summary 27 | 28 | This vignette describes how to use CIPR package with 3k PBMC data freely available from 10X genomics. Here, we recycle the code described in [Seurat's guided clustering tutorial](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html) to help users perform analyses from scratch. Using this dataset we will demonstrate the capabilities of CIPR to annotate single cell clusters in single cell RNAseq (scRNAseq) experiments. For further information about other clustering methods, please see Seurat's comprehensive [website](https://satijalab.org/seurat/) 29 | 30 | 31 | # Install CIPR 32 | 33 | ```{r, eval=F} 34 | 35 | if (!requireNamespace("devtools", quietly = TRUE)) 36 | install.packages("devtools") 37 | 38 | 39 | # Use this option if you want to build vignettes during installation 40 | # This can take a long time due to the installation of suggested packages. 41 | devtools::install_github("atakanekiz/CIPR-Package", build_vignettes = TRUE) 42 | 43 | # Use this if you would like to install the package without vignettes 44 | # devtools::install_github("atakanekiz/CIPR-Package") 45 | 46 | ``` 47 | 48 | 49 | # Seurat pipeline 50 | 51 | ## Setup Seurat object 52 | 53 | ```{r, eval=T} 54 | 55 | library(dplyr) 56 | library(Seurat) 57 | library(CIPR) 58 | 59 | ``` 60 | 61 | ```{r} 62 | 63 | # Download data 64 | 65 | temp <- tempfile() 66 | tempd <- tempdir() 67 | 68 | download.file("https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz", destfile = temp) 69 | 70 | untar(temp, exdir = tempd) 71 | 72 | unlink(temp) 73 | 74 | 75 | # Load the PBMC dataset 76 | pbmc.data <- Read10X(data.dir = paste0(tempd, "/filtered_gene_bc_matrices/hg19")) 77 | # Initialize the Seurat object with the raw (non-normalized data). 78 | pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) 79 | pbmc 80 | 81 | 82 | ``` 83 | 84 | 85 | ## Pre-processing 86 | 87 | The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features. 88 | 89 | ```{r} 90 | 91 | # Calculate mitochondrial gene representation (indicative of low quality cells) 92 | pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") 93 | 94 | # Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes 95 | pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) 96 | 97 | ``` 98 | 99 | 100 | ## Normalizing data 101 | 102 | ```{r, results="hide", message=F} 103 | 104 | pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) 105 | pbmc <- NormalizeData(pbmc) 106 | 107 | ``` 108 | 109 | ## Variable gene detection and scaling 110 | 111 | ```{r, results="hide", message=F} 112 | 113 | pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) 114 | 115 | all.genes <- rownames(pbmc) 116 | pbmc <- ScaleData(pbmc, features = all.genes) 117 | 118 | ``` 119 | 120 | ## Perform PCA 121 | 122 | ```{r, results="hide", message=F} 123 | 124 | pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) 125 | 126 | ``` 127 | 128 | 129 | ```{r, eval=T} 130 | 131 | ElbowPlot(pbmc) 132 | 133 | ``` 134 | 135 | ## Cluster cells 136 | 137 | ```{r, results="hide", message=F} 138 | 139 | pbmc <- FindNeighbors(pbmc, dims = 1:10) 140 | pbmc <- FindClusters(pbmc, resolution = 0.5) 141 | 142 | ``` 143 | 144 | ## Run non-linear dimensionality reduction (tSNE) 145 | 146 | ```{r, results="hide", message=F} 147 | 148 | pbmc <- RunTSNE(pbmc, dims = 1:10) 149 | 150 | pbmc$unnamed_clusters <- Idents(pbmc) 151 | 152 | ``` 153 | 154 | ```{r} 155 | 156 | # saveRDS(pbmc, "pbmc.rds") 157 | 158 | ``` 159 | 160 | 161 | 162 | ## Find differentially expressed genes 163 | 164 | __This is the step where we generate the input for CIPR's log fold change (logFC) comparison methods.__ 165 | 166 | ```{r, echo=F, results="hide"} 167 | 168 | allmarkers <- FindAllMarkers(pbmc) 169 | 170 | ``` 171 | 172 | ```{r, include=F} 173 | 174 | # saveRDS(allmarkers, "allmarkers.rds") 175 | 176 | ``` 177 | 178 | ## Calculate average gene expression per cluster 179 | 180 | __This is the step where we generate the input for CIPR's all-genes correlation methods.__ 181 | 182 | ```{r, results="hide"} 183 | 184 | avgexp <- AverageExpression(pbmc) 185 | 186 | avgexp <- as.data.frame(avgexp$RNA) 187 | 188 | avgexp$gene <- rownames(avgexp) 189 | 190 | ``` 191 | 192 | ```{r, include=F} 193 | 194 | # saveRDS(avgexp, "avgexp.rds") 195 | 196 | ``` 197 | 198 | ## Visualize Seurat pbject 199 | 200 | ```{r, include=F} 201 | 202 | # pbmc <- readRDS("pbmc.rds") 203 | 204 | ``` 205 | 206 | ```{r} 207 | 208 | DimPlot(pbmc) 209 | 210 | ``` 211 | 212 | # CIPR analysis 213 | 214 | The user can select one of the 7 provided reference data sets: 215 | 216 | | Reference | `reference` argument | 217 | |-------------------------------------------|----------------------| 218 | | Immunological Genome Project (ImmGen) | "immgen" | 219 | | Presorted cell RNAseq (various tissues) | "mmrnaseq" | 220 | | Blueprint/ENCODE | "blueprint" | 221 | | Human Primary Cell Atlas | "hpca" | 222 | | Database of Immune Cell Expression (DICE) | "dice" | 223 | | Hematopoietic differentiation | "hema" | 224 | | Presorted cell RNAseq (PBMC) | "hsrnaseq" | 225 | | User-provided custom reference | "custom" | 226 | 227 | ## Standard logFC comparison method 228 | 229 | In this method CIPR accepts `allmarkers` data frame created above and performs the following analytical steps: 230 | 231 | - It calculates a vector of logFC values for each reference sample (i.e. individual columns of the reference data frame) by comparing log-normalized expression value of a gene (i.e. rows of the reference data frame) to the average gene expression across the entire reference dataset. 232 | - It then scores unknown cluster logFC differential gene expression data against this reference logFC values to create a vector of identity scores 233 | - User can select one of three methods: 234 | - LogFC dot product (sum of all logFC x logFC values among matching genes). This is the recommended method in CIPR. 235 | - LogFC Spearman's correlation (rank correlation of logFC values) 236 | - LogFC Pearson's correlation (linear correlation of logFC values) 237 | 238 | 239 | 240 | ```{r, eval=T, include=F} 241 | 242 | # allmarkers <- readRDS("allmarkers.rds") 243 | 244 | ``` 245 | 246 | 247 | ### Plot all identity scores per cluster-reference cell pairs 248 | 249 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 250 | 251 | CIPR results can be summarized for each cluster in scatter plots. 252 | 253 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 254 | 255 | CIPR(input_dat = allmarkers, 256 | comp_method = "logfc_dot_product", 257 | reference = "hsrnaseq", 258 | plot_ind = T, 259 | plot_top = F, 260 | global_results_obj = T, 261 | global_plot_obj = T, 262 | # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters 263 | ) 264 | 265 | 266 | 267 | ``` 268 | 269 | ### Plot identity scores for a select cluster 270 | 271 | `ind_clu_plots` object is created in the global environment to help users can visualize results for a desired cluster and manipulate graphing parameters. ggplot2 functions can be iteratively added to individual plots to create annotations etc. 272 | 273 | ```{r, eval=T, fig.width=16, fig.height=5, message=F} 274 | 275 | library(ggplot2) 276 | 277 | ind_clu_plots$cluster6 + 278 | theme(axis.text.y = element_text(color="red"), 279 | axis.text.x = element_text(color="blue")) + 280 | labs(fill="Reference")+ 281 | ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") + 282 | annotate("text", label="2 sd range", x=10, y= 500, size=8, color = "steelblue")+ 283 | annotate("text", label= "1 sd range", x=10, y=175, size=8, color ="orange2")+ 284 | geom_rect(aes(xmin=94, xmax=99, ymin=550, ymax=900), fill=NA, size=3, color="red") 285 | 286 | 287 | 288 | ``` 289 | 290 | 291 | 292 | ### Plot top scoring refernce subsets for each cluster 293 | 294 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F} 295 | 296 | CIPR(input_dat = allmarkers, 297 | comp_method = "logfc_dot_product", 298 | reference = "hsrnaseq", 299 | plot_ind = F, 300 | plot_top = T, 301 | global_results_obj = T, 302 | global_plot_obj = T) 303 | 304 | 305 | ``` 306 | 307 | ### Tabulate CIPR results 308 | 309 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables. 310 | 311 | ```{r, eval=T} 312 | 313 | DT::datatable(CIPR_top_results) 314 | 315 | 316 | DT::datatable(head(CIPR_all_results)) 317 | 318 | ``` 319 | 320 | 321 | ## Standard all-genes correlation method 322 | 323 | CIPR also implements a simple correlation approach in which overall correlation in gene expression is calculated for the pairs of unknown clusters and the reference samples (regardless of the differential expression status of the gene). This approach is conceptually similiar to some other automated identity prediction pipelines such as [SingleR](https://www.ncbi.nlm.nih.gov/pubmed/30643263) and [scMCA](https://www.ncbi.nlm.nih.gov/pubmed/30758821). 324 | 325 | Users can select one of the following methods: 326 | 327 | - __Spearman's correlation:__ It calculates correlation based on ranked gene expression. It can be suitable for comparing experimental and reference data which were obtained using different technologies. 328 | - __Pearson's correlation:__ It calculates linear correlations. This can be useful when the user would like to provide a custom reference dataset to CIPR. 329 | 330 | 331 | 332 | ### Plot all identity scores per cluster-reference cell pairs 333 | 334 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 335 | 336 | CIPR results can be summarized for each cluster in scatter plots. 337 | 338 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 339 | 340 | CIPR(input_dat = avgexp, 341 | comp_method = "all_genes_spearman", 342 | reference = "hsrnaseq", 343 | plot_ind = T, 344 | plot_top = F, 345 | global_results_obj = T, 346 | global_plot_obj = T) 347 | 348 | 349 | 350 | ``` 351 | 352 | 353 | ### Plot top scoring refernce subsets for each cluster 354 | 355 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F} 356 | 357 | CIPR(input_dat = avgexp, 358 | comp_method = "all_genes_spearman", 359 | reference = "hsrnaseq", 360 | plot_ind = F, 361 | plot_top = T, 362 | global_results_obj = T, 363 | global_plot_obj = T) 364 | 365 | 366 | ``` 367 | 368 | ### Tabulate CIPR results 369 | 370 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables. 371 | 372 | ```{r, eval=T} 373 | 374 | DT::datatable(CIPR_top_results) 375 | 376 | 377 | DT::datatable(head(CIPR_all_results)) 378 | 379 | ``` 380 | 381 | 382 | ## Limiting analysis to the select subsets of reference data 383 | 384 | Sometimes excluding irrelevant reference cell types from the analysis can be helpful. Especially when the logFC comparison methods are utilized, removing irrelevant subsets may improve discrimination of closely related subsets, since the reference logFC values will be calculated after subsetting the data frame. Filtering out reference subsets should not impact results of the all-genes correlation methods, but it can make the graphical outputs easier to look at 385 | 386 | 3k PBMC dataset may not be the best example to demonstrate benefits of reference dataset subsetting, but the code below serves as an example for this functionality. 387 | 388 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 389 | 390 | CIPR(input_dat = allmarkers, 391 | comp_method = "logfc_dot_product", 392 | reference = "hsrnaseq", 393 | plot_ind = T, 394 | plot_top = F, 395 | global_results_obj = T, 396 | global_plot_obj = T, 397 | select_ref_subsets = c("CD4+ T cell", "CD8+ T cell", "Monocyte", "NK cell")) 398 | 399 | 400 | ``` 401 | 402 | 403 | 404 | ## Filtering out lowly variable genes 405 | 406 | Genes that have a low expression variance across the reference data frame has weaker discriminatory potential. Thus, excluding these genes from the analysis can reduce the noise and improve the prediction scores, especially when using all-genes correlation based methods. 407 | 408 | We implemented a variance filtering parameter, `keep_top_var`, which allows users to keep top Nth% variable reference genes in the analysis. For instance, by setting this argument to 10, CIPR can be instructed to use only the top 10% highly variable genes in identity score calculations. In our experience _(Ekiz HA, BMC Bioinformatics, in revision)_ limiting the analysis to highly variable genes does not significantly impact the identity scores of the top-scoring reference cell subsets, but it reduces the identity scores of intermediate/low-scoring reference cells leading to an improvement of z-scores. The "best" value for this parameter remains to be determined by the user in individual studies. 409 | 410 | ```{r, eval=T, fig.width=16, fig.height=32, message=F} 411 | 412 | CIPR(input_dat = avgexp, 413 | comp_method = "all_genes_spearman", 414 | reference = "hsrnaseq", 415 | plot_ind = T, 416 | plot_top = F, 417 | global_results_obj = T, 418 | global_plot_obj = T, 419 | keep_top_var = 10) 420 | 421 | 422 | 423 | ``` 424 | 425 | 426 | 427 | 428 | ```{r} 429 | 430 | sessionInfo() 431 | 432 | 433 | ``` 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | --------------------------------------------------------------------------------