├── .Rbuildignore
├── .gitignore
├── CIPR.Rproj
├── DESCRIPTION
├── NAMESPACE
├── R
    └── CIPR.R
├── README.md
├── data
    ├── blueprint_expr.rda
    ├── blueprint_samples.rda
    ├── dice_expr.rda
    ├── dice_samples.rda
    ├── example_avgexp_data.rda
    ├── example_log_fc_data.rda
    ├── hema_expr.rda
    ├── hema_samples.rda
    ├── hpca_expr.rda
    ├── hpca_samples.rda
    ├── hsrnaseq_expr.rda
    ├── hsrnaseq_samples.rda
    ├── immgen_expr.rda
    ├── immgen_samples.rda
    ├── mmrnaseq_expr.rda
    └── mmrnaseq_samples.rda
├── doc
    ├── CIPR_hex_mid.png
    ├── cipr_human_pbmc.R
    ├── cipr_human_pbmc.Rmd
    ├── cipr_human_pbmc.html
    ├── sample_ind_output.png
    └── sample_top_output.png
├── man
    └── CIPR.Rd
└── vignettes
    └── cipr_human_pbmc.Rmd


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | ^doc$
4 | ^Meta$
5 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | Meta
6 | to_improve.txt
7 | 


--------------------------------------------------------------------------------
/CIPR.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 4
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: CIPR
 2 | Type: Package
 3 | Title: Cluster Identity Predictor
 4 | Version: 0.1.0
 5 | Author: Atakan Ekiz
 6 | Maintainer: Atakan Ekiz <atakanekiz@gmail.com>
 7 | Description: CIPR (Cluster Identity PRedictor) package is developed to help annotating unknown cell clusters in single cell RNAseq (scRNAseq) experiments. Due to overlapping gene expression profiles, annotating the biological origins of single cell clusters can be challenging. The current software solutions to help annotating single cell clusters generally lack a user-friendly GUI, can take long times to compute, or may not have suitable reference data frames for the experimental data at hand. Alternatively, researchers can manually examine a few known marker genes in the experimental data to approximate the identity of single cell clusters which can be low throughput, time consuming, and subjective. We introduce CIPR (Cluster Identity PRedictor) package (and its Shiny implementation) to address these shortcomings. CIPR scores the gene expression profiles of unknown single cell clusters against a collection of known reference cell types and calculates an identity score for the cluster-reference cell pairs. Multiple reference RNAseq/microarray datasets obtained from sorted human and mouse cells are available in CIPR and the computational pipeline is agnostic to species differences allowing this program to be used with both human and mouse input data. Users can limit the analysis to only certain reference cell types and to genes with high overall variance. Furthrmore, CIPR can also work with custom reference data frames provided by the users to be used in specialized experimental contexts. Through this R-package implementation, CIPR can be run directly in R and be implemented easily in different analytical contexts. In summary, CIPR is a software solution to facilitate the analysis of scRNAseq data.
 8 | License: GPL-3
 9 | Encoding: UTF-8
10 | LazyData: true
11 | Depends: R (>= 3.5.0)
12 | Imports: ggpubr, gtools, tibble, dplyr, rlang
13 | Suggests: knitr, rmarkdown, Seurat
14 | VignetteBuilder: knitr
15 | URL: https://github.com/atakanekiz/CIPR-Package
16 | BugReports: https://github.com/atakanekiz/CIPR-Package/issues
17 | RoxygenNote: 7.0.2
18 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | exportPattern("^[[:alpha:]]+")
2 | importFrom(magrittr,"%>%")
3 | 


--------------------------------------------------------------------------------
/R/CIPR.R:
--------------------------------------------------------------------------------
  1 | #' Cluster Identity Predictor
  2 | #'
  3 | #' @description CIPR (Cluster Identity PRedictor) is a pipeline that helps annotating
  4 | #' unknown single cell clusters in single cell RNA sequencing (scRNAseq experiments).
  5 | #' This function scores unknown cluster gene expression signatures against known reference
  6 | #' datasets using user-selected analytical approaches to facilitate scRNAseq analysis.
  7 | #'
  8 | #' @param input_dat Data frame containing normalized log2-transformed gene
  9 | #' expression values per cluster OR a table of differentially expressed genes
 10 | #' per cluster
 11 | #'
 12 | #' @param comp_method Method to use for identity score calculations. It accepts
 13 | #' one of the following: "logfc_dot_product" (default), "logfc_spearman",
 14 | #' "logfc_pearson", "all_genes_spearman", "all_genes_pearson"
 15 | #'
 16 | #' @param reference Reference data frame containing gene expression data from
 17 | #' known cell types. It accepts one of the following: "immgen" (default),
 18 | #' "mmrnaseq", "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom"
 19 | #'
 20 | #' @param select_ref_subsets The names of cell subsets to be included in the
 21 | #' analysis. For using the entire reference dataset use "all", or
 22 | #' provide a character vector of cell types of interest. Defaults to "all"
 23 | #'
 24 | #' @param custom_reference A data frame containing custom reference. There must
 25 | #' be a column named 'gene' and other columns contain normalized gene expression
 26 | #' data from known samples. Defaults to NULL
 27 | #'
 28 | #' @param custom_ref_annot A data frame containing custom reference metadata.
 29 | #' This is optional to get more informative results from CIPR. The data
 30 | #' frame must contain columns named 'short_name' (must match column names in
 31 | #' custom reference), 'long_name' (human readable names for reference samples),
 32 | #' 'description' (details such as positive and negative sorting markers),
 33 | #' 'reference_cell_type' (e.g. T cell, B cell, NK)
 34 | #'
 35 | #' @param keep_top_var Top n percent of highly variant reference genes to
 36 | #' include in the analysis. It accepts a numeric value smaller than or equal
 37 | #' to 100 (default). The value of 100 results in keeping all the genes in the
 38 | #' reference dataset)
 39 | #'
 40 | #' @param plot_ind Logical value. Set it to TRUE to plot identity scores for
 41 | #' each cluster. Defaults to FALSE
 42 | #'
 43 | #' @param plot_top Logical value. set it to TRUE to plot top scoring reference
 44 | #' cell types for each cluster. Defaults ot TRUE.
 45 | #'
 46 | #' @param top_num A numeric value determining how many top scoring reference
 47 | #' cell types will be plotted for each cluster. Defaults to 5.
 48 | #'
 49 | #' @param save_png Logical value. Set it to TRUE if you would like to export png
 50 | #' images of the results. Defaults to FALSE
 51 | #'
 52 | #' @param global_plot_obj Logical value. Set it to TRUE if you would like to
 53 | #' keep the plots as an object in the global environment. This can be useful
 54 | #' for accessing and manipulating the graphs. Defaults to TRUE.
 55 | #'
 56 | #' @param global_results_obj Logical value. Set it to TRUE if you would like to
 57 | #' keep the analysis results as a global object. Defaults to TRUE.
 58 | #'
 59 | #' @param ... arguments to pass to ggplot2::theme() (for graph manipulation)
 60 | #'
 61 | #' @return Graphical outputs and/or data frames of identity scores calculated
 62 | #' for each cluster in the input data.
 63 | #'
 64 | #' @examples
 65 | #'
 66 | #' # Example of using CIPR in conjunction with Seurat
 67 | #' library(Seurat)
 68 | #' allmarkers <- FindAllMarkers(seurat_object)
 69 | #' avgexp <- AverageExpression(seurat_object)
 70 | #'
 71 | #' # Using built-in immgen as reference and logfc dot product method
 72 | #' CIPR(input_dat = allmarkers,
 73 | #' comp_method = "logfc_dot_product",
 74 | #' reference="immgen",
 75 | #' keep_top_var = 100,
 76 | #' global_results_obj = T,
 77 | #' plot_top = T)
 78 | #'
 79 | #' #' # Using built-in immgen as reference and all genes spearman method
 80 | #'
 81 | #' CIPR(input_dat = avgexp,
 82 | #' comp_method = "all_genes_spearman",
 83 | #' reference="immgen",
 84 | #' keep_top_var = 100,
 85 | #' global_results_obj = T,
 86 | #' plot_top = T)
 87 | #'
 88 | #'
 89 | #' # Using built-in dice reference and logFC spearman method
 90 | #' # Variance threshold of top 50%
 91 | #'
 92 | #' CIPR(input_dat = allmarkers,
 93 | #' comp_method = "logfc_spearman",
 94 | #' reference="dice",
 95 | #' keep_top_var = 50,
 96 | #' global_results_obj = T,
 97 | #' plot_top = T)
 98 | #'
 99 | #'
100 | #' # Using a custom reference
101 | #'
102 | #' CIPR(input_dat = allmarkers,
103 | #' comp_method = "logfc_dot_product",
104 | #' reference="custom",
105 | #' custom_reference = custom_ref_df,
106 | #' custom_ref_annot = custom_annot_df,
107 | #' keep_top_var = 100,
108 | #' global_results_obj = T,
109 | #' plot_top = T)
110 | #'
111 | #'
112 | #' # Using a blueprint-encode reference and limiting the analysis
113 | #' # to "Pericytes", "Skeletal muscle", "Smooth muscle"
114 | #'
115 | #' CIPR(input_dat = allmarkers,
116 | #' comp_method = "logfc_dot_product",
117 | #' reference="blueprint-encode",
118 | #' select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle")
119 | #' keep_top_var = 100,
120 | #' global_results_obj = T,
121 | #' plot_top = T)
122 | #'
123 | #' # Using built in example data (logFC signatures per cluster)
124 | #' CIPR(input_dat = example_logfc_data,
125 | #' comp_method = "logfc_dot_product",
126 | #' reference="blueprint-encode",
127 | #' select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle")
128 | #' keep_top_var = 100,
129 | #' global_results_obj = T,
130 | #' plot_top = T)
131 | #'
132 | #'
133 | #' # Using built in example data (average expression)
134 | #' CIPR(input_dat = example_avgexp_data,
135 | #' comp_method = "all_spearman",
136 | #' reference="immgen",
137 | #' select_ref_subsets = "all",
138 | #' keep_top_var = 100,
139 | #' global_results_obj = T,
140 | #' plot_top = T)
141 | #'
142 | 
143 | 
144 | 
145 | CIPR <- function(input_dat,
146 |                  comp_method = "logfc_dot_product",
147 |                  reference = NULL,
148 |                  select_ref_subsets = "all",
149 |                  custom_reference = NULL,
150 |                  custom_ref_annot = NULL,
151 |                  keep_top_var = 100,
152 |                  plot_ind = F,
153 |                  plot_top = T,
154 |                  top_num = 5,
155 |                  save_png = F,
156 |                  global_plot_obj = T,
157 |                  global_results_obj = T,
158 |                  update_ref = T,
159 |                  ...
160 | ){
161 | 
162 | 
163 | 
164 |   # suppressMessages({
165 |   #   require(ggpubr)
166 |   #   require(gtools)
167 |   #   require(tibble)
168 |   #   require(dplyr)
169 |   # })
170 | 
171 | 
172 | 
173 |   ######################### Prepare input_dat   #########################
174 | 
175 |   message("Preparing input data")
176 | 
177 |   if(!comp_method %in% c("logfc_dot_product",
178 |                          "logfc_spearman",
179 |                          "logfc_pearson",
180 |                          "all_genes_spearman",
181 |                          "all_genes_pearson")) stop("Check comp_method. Allowed strings are: 'logfc_dot_product', 'logfc_spearman', 'logfc_pearson', 'all_genes_spearman', 'all_genes_pearson'")
182 | 
183 |   if(grepl("logfc", comp_method)){
184 | 
185 |     # Define column names to allow flexibility in case and close matches in column names
186 |     gene_column <- grep("gene", colnames(input_dat), ignore.case = T, value = T)
187 |     logFC_column <- grep("logfc|log2fc", colnames(input_dat), ignore.case = T, value = T)
188 |     cluster_column <- grep("cluster", colnames(input_dat), ignore.case = T, value = T)
189 | 
190 | 
191 |     if(length(c(gene_column, logFC_column, cluster_column)) != 3) stop("Check column names of the input data. Data frame must have columns named as 'gene', 'logfc', and 'cluster'")
192 | 
193 |     # Convert gene symbols to lower case letters to allow mouse-vs-human comparisons
194 |     input_dat[,gene_column] <- tolower(input_dat[,gene_column])
195 | 
196 |     # input_dat <- input_dat[!duplicated(input_dat[,gene_column]),]
197 | 
198 | 
199 | 
200 |   } else {
201 | 
202 |     gene_column <- grep("gene", colnames(input_dat), ignore.case = T, value = T)
203 | 
204 |     input_dat[,gene_column] <- tolower(input_dat[,gene_column])
205 | 
206 |     input_dat <- input_dat[!duplicated(input_dat[,gene_column]),]
207 | 
208 |     if(length(gene_column) != 1) stop("Check column names of the input data. Data frame must have a column named as 'gene'")
209 | 
210 |   }
211 | 
212 | 
213 | 
214 | 
215 |   ######################### Prepare ref_dat   #########################
216 | 
217 |   # if(update_ref == T){  # WORK ON CACHING TO SPEED UP THE PIPELINE
218 | 
219 |   message("Preparing reference data")
220 | 
221 |     if(is.null(reference)){
222 | 
223 |       stop('Please specify the reference dataset. The following are allowed: "mmrnaseq",  "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom"')
224 | 
225 |     } else if(reference == "immgen"){
226 | 
227 |     message("Reading ImmGen reference data")
228 | 
229 |     # # Read reference dataset
230 |     # load("data/immgen_expr.rda")
231 |     # ref_dat <- get("immgen_expr")
232 |     # rm(immgen_expr)
233 |     #
234 |     # # Read immgen annotation file for explanations of cell types
235 |     # load("data/immgen_samples.rda")
236 |     # ref_annot <- get("immgen_samples")
237 |     # rm(immgen_samples)
238 | 
239 |       ref_dat <- immgen_expr
240 |       ref_annot <- immgen_samples
241 | 
242 | 
243 |   } else if(reference == "mmrnaseq"){
244 | 
245 |     message("Reading MmRNAseq reference data")
246 | 
247 |     ref_dat <- mmrnaseq_expr
248 |     ref_annot <- mmrnaseq_samples
249 | 
250 |   } else if(reference == "blueprint"){
251 | 
252 |     message("Reading Blueprint-ENCODE reference data")
253 | 
254 |     ref_dat <- blueprint_expr
255 |     ref_annot <- blueprint_samples
256 | 
257 |   } else if(reference == "hpca"){
258 | 
259 |     message("Reading HCPA reference data")
260 | 
261 |     ref_dat <- hpca_expr
262 |     ref_annot <- hpca_samples
263 | 
264 |   } else if(reference == "dice"){
265 | 
266 |     message("Reading DICE reference data")
267 | 
268 |     ref_dat <- dice_expr
269 |     ref_annot <- dice_samples
270 | 
271 |   } else if(reference == "hema"){
272 | 
273 |     message("Reading hema reference data")
274 | 
275 |     ref_dat <- hema_expr
276 |     ref_annot <- hema_samples
277 | 
278 |   } else if(reference == "hsrnaseq"){
279 | 
280 |     message("Reading HsRNAseq reference data")
281 | 
282 |     ref_dat <- hsrnaseq_expr
283 |     ref_annot <- hsrnaseq_samples
284 | 
285 |   } else if(reference == "custom"){
286 | 
287 |     message("Reading custom reference data")
288 | 
289 |     # Read reference dataset
290 |     ref_dat <- custom_reference
291 | 
292 |     # Read immgen annotation file for explanations of cell types
293 |     ref_annot <- custom_ref_annot
294 | 
295 | 
296 |   } else {
297 | 
298 |   stop(message('"reference" argument accepts of the following: "mmrnaseq",  "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom"'))
299 | 
300 |   }
301 | 
302 | 
303 |   ref_gene_column <- grep("^gene", colnames(ref_dat), ignore.case = T, value = T)
304 | 
305 |   if(length(ref_gene_column) != 1) stop("Check column names of the input data. Data frame must have a column named as 'gene'")
306 | 
307 |   ref_dat[, ref_gene_column] <- tolower(ref_dat[, ref_gene_column])
308 | 
309 | 
310 | 
311 | 
312 | 
313 |   # Select relevant subsets from the reference
314 | 
315 |   if(select_ref_subsets != "all"){
316 | 
317 |     message("Subsetting reference data")
318 | 
319 |     sel_positions <- which(ref_annot[, "reference_cell_type"] %in% select_ref_subsets)
320 | 
321 |     if(length(sel_positions) == 0) stop("Selected subset is not present in the reference dataset. Please check spelling, as the string input must exactly match to reference cell type.")
322 | 
323 |     not_found <- !select_ref_subsets %in% ref_annot[, "reference_cell_type"]
324 | 
325 |     if(sum(not_found != 0 )) {
326 | 
327 |         message("Following subsets were not found in reference (Please double check your entry for spelling errors, as the string input must exactly match to reference cell type):")
328 | 
329 |         print(select_ref_subsets[not_found])
330 | 
331 | 
332 |     }
333 | 
334 | 
335 |     select_ref_subsets <- as.character(ref_annot[sel_positions, "short_name"])
336 | 
337 |     ref_dat <- ref_dat[, c(ref_gene_column, select_ref_subsets)]
338 | 
339 |   }
340 | 
341 | 
342 | 
343 | 
344 | 
345 | 
346 |   # Apply quantile filtering
347 | 
348 |   if(keep_top_var != 100){
349 | 
350 | 
351 |   message("Applying variance filtering")
352 | 
353 |   var_vec <- apply(ref_dat[, ! colnames(ref_dat) %in% ref_gene_column], 1, var, na.rm=T)
354 | 
355 |   keep_var <- quantile(var_vec, probs = 1-keep_top_var/100, na.rm = T)
356 | 
357 |   keep_genes <- var_vec >= keep_var
358 | 
359 |   ref_dat <- ref_dat[keep_genes, ]
360 | 
361 |   }
362 | 
363 | 
364 |   if(grepl("logfc", comp_method)){
365 | 
366 | 
367 |     # Calculate row means for each gene (mean expression across the reference cell types)
368 |     gene_avg <- rowMeans(ref_dat[, !colnames(ref_dat) %in% ref_gene_column])
369 | 
370 |     # Log scale data
371 |     reference_ratio <- sweep(ref_dat[,!colnames(ref_dat) %in% ref_gene_column], 1, FUN="-", gene_avg)
372 | 
373 |     # Combine gene names and the log fold change in one data frame
374 |     ref_dat <- cbind(tolower(ref_dat[, ref_gene_column]), reference_ratio)
375 | 
376 |     colnames(ref_dat)[1] <- ref_gene_column
377 | 
378 |   }
379 | 
380 |   # } else if(exists("ref_dat") & exists("ref_annot") & update_ref == F){
381 |   #
382 |   #   ref_dat <- get("ref_dat", envir = .GlobalEnv)
383 |   #   ref_annot <- get("ref_annot", envir = .GlobalEnv)
384 |   #
385 |   # }
386 | 
387 | 
388 | 
389 |     ######################## Define clusters ###############################
390 | 
391 |     if(grepl("logfc", comp_method)){
392 | 
393 |       clusters <- gtools::mixedsort(
394 |         levels(
395 |           as.factor(
396 |             dplyr::pull(input_dat, grep("cluster", x = colnames(input_dat),
397 |                                  ignore.case = T, value = T)
398 |             )
399 |           )
400 |         )
401 |       )
402 | 
403 |     } else {
404 | 
405 |       clusters <- gtools::mixedsort(
406 |         levels(
407 |           as.factor(
408 |             colnames(input_dat)[!grepl("gene", colnames(input_dat),
409 |                                        ignore.case = T)]
410 |           )
411 |         )
412 |       )}
413 | 
414 | 
415 | 
416 | 
417 |     ######################### Compare input_dat against ref_dat #############################
418 | 
419 |     message("Analyzing cluster signatures")
420 | 
421 | 
422 |     if(comp_method == "logfc_dot_product"){    ####################################################################################
423 | 
424 |       # Initiate a master data frame to store the results
425 |       master_df <- data.frame()
426 | 
427 |       # Iterate over clusters to calculate a distinct identity score for each reference cell type
428 |       for (i in clusters) {
429 | 
430 |         # Increment the progress bar, and update the detail text.
431 |         # message(paste("Analyzing cluster", i))
432 | 
433 |         # Subset on the cluster in iteration
434 |         sel_clst <- input_dat %>%
435 |           dplyr::filter(!!rlang::sym(cluster_column) == i) %>%
436 |           dplyr::select(c(!!rlang::sym(gene_column), !!rlang::sym(logFC_column)))
437 | 
438 | 
439 |         # Merge SCseq cluster log FC value with immgen log FC for shared genes
440 |         merged <- merge(sel_clst, ref_dat, by.x = gene_column, by.y = ref_gene_column)
441 | 
442 |         if(dim(merged)[1] < 2) next
443 | 
444 |         # Calculate a scoring matrix by multiplying log changes of clusters and immgen cells
445 |         reference_scoring <- data.frame(apply(merged[,3:dim(merged)[2]],2,function(x){x*merged[,2]}), check.names = FALSE)
446 | 
447 |         # Calculate the aggregate score of each immgen cell type by adding
448 |         score_sum <- colSums(reference_scoring)
449 | 
450 |         # Store identity scores in a data frame
451 |         df <- data.frame(identity_score = score_sum)
452 | 
453 |         df <- tibble::rownames_to_column(df, var="reference_id")
454 | 
455 | 
456 |         df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name"))
457 | 
458 | 
459 |         # Store cluster information in a column
460 |         df$cluster <- i
461 | 
462 |         # Add confidence-of-prediction calculations here and append to the df
463 |         # Calculate the mean and standard deviation of the aggregate scores per reference cell type
464 |         mean_score_sum <- mean(df$identity_score)
465 |         score_sum_sd <- sd(df$identity_score)
466 | 
467 |         # Calculate the distance of the identity score from population mean (how many std devs apart?)
468 |         df$z_score <- (df$identity_score - mean_score_sum)/score_sum_sd
469 | 
470 |         # Calculate the proportion of the genes changing in the same direction between unknown cluster and reference cell type
471 |         df$percent_pos_correlation <- {
472 | 
473 |           ngenes <- dim(reference_scoring)[1]
474 | 
475 |           pos_corr_vector <- numeric()
476 | 
477 |           for(i in 1:dim(reference_scoring)[2]){
478 | 
479 |             # Calculate number of genes positively correlated (upregulated or downregulated in both unk cluster and reference)
480 |             pos_cor <- ( sum(reference_scoring[, i] > 0) / ngenes ) * 100
481 | 
482 |             pos_corr_vector <- c(pos_corr_vector, pos_cor)
483 | 
484 |           } #close for loop
485 | 
486 |           pos_corr_vector
487 | 
488 |         } # close expression
489 | 
490 | 
491 |         # Add calculation results under the master data frame to have a composite results file
492 |         master_df <- rbind(master_df,df)
493 | 
494 | 
495 | 
496 |       } # close for loop that iterates over clusters1
497 | 
498 |     } else if(comp_method == "logfc_spearman" | comp_method == "logfc_pearson"){  ########################################################
499 | 
500 |       # Initiate master data frame to store results
501 |       master_df <- data.frame()
502 | 
503 | 
504 |       # Iterate analysis for each cluster. The loop below will calculate a distinct correlation
505 |       # coefficient for each cluster-reference cell pairs
506 |       for (i in clusters) {
507 | 
508 | 
509 |         trim_dat <- input_dat %>%
510 |           dplyr::filter(!!rlang::sym(cluster_column) == i)
511 | 
512 |         dat_genes <- trim_dat[gene_column] %>% dplyr::pull() %>% as.character
513 |         ref_genes <- ref_dat[ref_gene_column] %>% dplyr::pull() %>% as.character
514 | 
515 |         common_genes <- intersect(dat_genes, ref_genes)
516 | 
517 | 
518 |         trim_dat <- trim_dat %>%
519 |           dplyr::filter(!!rlang::sym(gene_column) %in% common_genes) %>%
520 |           dplyr::arrange(!!rlang::sym(gene_column)) %>%
521 |           dplyr::select(- !!rlang::sym(gene_column))
522 | 
523 | 
524 |         trim_ref <- ref_dat %>%
525 |           dplyr::filter(!!rlang::sym(ref_gene_column) %in% common_genes) %>%
526 |           dplyr::arrange(!!rlang::sym(ref_gene_column)) %>%
527 |           dplyr::select(- !!rlang::sym(ref_gene_column))
528 | 
529 | 
530 |         # Calculate correlation between the the cluster (single column in trimmed input data) and each of the
531 |         # reference cell subsets (columns of the trimmed reference data)
532 |         cor_df <- cor(trim_dat[logFC_column], trim_ref, method = gsub("logfc_", "", comp_method))
533 | 
534 |         # Store results in a data frame
535 |         df <- data.frame(identity_score = cor_df[1,])
536 | 
537 |         df <- tibble::rownames_to_column(df, var="reference_id")
538 | 
539 |         # Combine results with reference annotations
540 |         if(reference != "custom"){
541 | 
542 |           df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name"))
543 | 
544 | 
545 | 
546 | 
547 | 
548 |         } else if (reference == "custom" & !is.null(custom_ref_annot)){
549 | 
550 |           df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name"))
551 | 
552 | 
553 |         } else if(reference == "custom" & is.null(custom_ref_annot)){
554 | 
555 |           # Fill in with reminder if annotation file is not updated
556 |           df$reference_cell_type <- rep("Upload annotation file", dim(ref_dat)[2]-1)
557 |           df$short_name <- colnames(ref_dat)[!colnames(ref_dat) %in% ref_gene_column]
558 |           df$long_name <- rep("Upload annotation file", dim(ref_dat)[2]-1)
559 |           df$description <- rep("Upload annotation file", dim(ref_dat)[2]-1)
560 | 
561 |         }
562 | 
563 | 
564 | 
565 |         # Store cluster information in a column
566 |         df$cluster <- i
567 | 
568 |         # Add confidence-of-prediction calculations here and append to the df
569 |         # Calculate the mean and standard deviation of the aggregate scores per reference cell type
570 |         mean_cor_coeff <- mean(df$identity_score)
571 |         cor_coeff_sd <- sd(df$identity_score)
572 | 
573 |         # Calculate the distance of the identity score from population mean (how many std devs apart?)
574 |         df$z_score <- (df$identity_score - mean_cor_coeff)/cor_coeff_sd
575 | 
576 |         # Add all the results to the master data frame
577 |         master_df <- rbind(master_df, df)
578 | 
579 | 
580 |       } # close for loop that iterates over clusters
581 | 
582 |     } else if(comp_method == "all_genes_spearman" | comp_method == "all_genes_pearson"){  ################################################
583 | 
584 | 
585 | 
586 |       dat_genes <- input_dat[gene_column] %>% dplyr::pull() %>% as.character
587 |       ref_genes <- ref_dat[ref_gene_column] %>% dplyr::pull() %>% as.character
588 | 
589 |       common_genes <- intersect(dat_genes, ref_genes)
590 | 
591 |       trim_dat <- input_dat %>%
592 |         dplyr::filter(!!rlang::sym(gene_column) %in% common_genes) %>%
593 |         dplyr::arrange(!!rlang::sym(gene_column)) %>%
594 |         dplyr::select_(.dots= paste0("-", gene_column))
595 | 
596 |       trim_ref <- ref_dat %>%
597 |         dplyr::filter(!!rlang::sym(ref_gene_column) %in% common_genes) %>%
598 |         dplyr::arrange(!!rlang::sym(ref_gene_column)) %>%
599 |         dplyr::select_(.dots=paste0("-", ref_gene_column))
600 | 
601 |       clusters <- colnames(trim_dat)
602 | 
603 | 
604 |       master_df <- data.frame()
605 | 
606 |       comp_method <- gsub("all_genes_", "", comp_method)
607 | 
608 |       for (i in clusters) {
609 | 
610 | 
611 |         cor_df <- cor(trim_dat[i], trim_ref, method = comp_method)
612 | 
613 | 
614 |         df <- data.frame(identity_score = cor_df[1,])
615 | 
616 |         df <- tibble::rownames_to_column(df, var="reference_id")
617 | 
618 | 
619 | 
620 |         if(reference != "custom"){
621 | 
622 |           df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name"))
623 | 
624 | 
625 | 
626 | 
627 | 
628 |         } else if (reference == "custom" & !is.null(custom_ref_annot)){
629 | 
630 |           df <- dplyr::left_join(df, ref_annot, by=c("reference_id" = "short_name"))
631 | 
632 | 
633 |         } else if(reference == "custom" & is.null(custom_ref_annot)){
634 | 
635 |           df$reference_cell_type <- rep("Upload annotation file", dim(ref_dat)[2]-1)
636 |           df$short_name <- colnames(ref_dat)[!colnames(ref_dat) %in% ref_gene_column]
637 |           df$long_name <- rep("Upload annotation file", dim(ref_dat)[2]-1)
638 |           df$description <- rep("Upload annotation file", dim(ref_dat)[2]-1)
639 | 
640 |         }
641 | 
642 | 
643 | 
644 | 
645 |         df$cluster <- i
646 | 
647 |         # Add confidence-of-prediction calculations here and append to the df
648 |         # Calculate the mean and standard deviation of the aggregate scores per reference cell type
649 |         mean_cor_coeff <- mean(df$identity_score)
650 |         cor_coeff_sd <- sd(df$identity_score)
651 | 
652 |         # Calculate the distance of the identity score from population mean (how many std devs apart?)
653 |         df$z_score <- (df$identity_score - mean_cor_coeff)/cor_coeff_sd
654 | 
655 | 
656 |         master_df <- rbind(master_df,df)
657 | 
658 |       }
659 |     }
660 | 
661 | 
662 | 
663 | 
664 | 
665 | 
666 | 
667 | 
668 |     if(global_results_obj == T) CIPR_all_results <<- master_df
669 | 
670 | 
671 |     #prep individual plots
672 |     if(plot_ind == T){
673 | 
674 | 
675 |       colnum <- length(levels(as.factor(ref_annot$reference_cell_type)))
676 | 
677 | 
678 |       cols <- colorspace::rainbow_hcl(colnum, c = 90, l = 75, start = 0, end = 330,
679 |                                       gamma = NULL, fixup = TRUE, alpha = 1)
680 | 
681 |       set.seed(5)
682 | 
683 |       cols <- sample(cols, length(cols), replace = F)
684 | 
685 | 
686 |       ind_clu_plots <- list()
687 | 
688 |       for (i in clusters) {
689 | 
690 | 
691 |         # Extract results calculated for individual clusters
692 |         df_plot <- master_df %>%
693 |           dplyr::filter(cluster == i)
694 | 
695 |         # Calculate mean and sd deviation for adding confidence bands to graphs
696 |         score_mean <- mean(df_plot$identity_score)
697 |         score_sd <- sd(df_plot$identity_score)
698 | 
699 | 
700 |         plotname <- paste("cluster", i, sep="")
701 | 
702 |         # Plot identity scores per cluster per reference cell type and add confidence bands
703 |         ind_clu_plots[[plotname]] <- ggpubr::ggdotplot(df_plot, x = "reference_id", y="identity_score",
704 |                                                fill = "reference_cell_type", xlab=F, ylab="Reference identity score", palette=cols,
705 |                                                font.y = c(14, "bold", "black"), size=1, x.text.angle=90,
706 |                                                title = paste("Cluster:",i), font.title = c(15, "bold.italic"),
707 |                                                font.legend = c(15, "plain", "black"))+
708 |           ggplot2::theme(axis.text.x = ggplot2::element_text(size=10, vjust=0.5, hjust=1))+
709 |           ggplot2::geom_hline(yintercept=score_mean)+
710 |           ggplot2::annotate("rect", xmin = 1, xmax = length(df_plot$reference_id),
711 |                    ymin = score_mean-score_sd, ymax = score_mean+score_sd,
712 |                    fill = "gray50", alpha = .1)+
713 |           ggplot2::annotate("rect", xmin = 1, xmax = length(df_plot$reference_id),
714 |                    ymin = score_mean-2*score_sd, ymax = score_mean+2*score_sd,
715 |                    fill = "gray50", alpha = .1)+
716 |           ggplot2::theme(...)
717 | 
718 | 
719 |       }
720 | 
721 |       if(global_plot_obj == T) ind_clu_plots <<-ind_clu_plots
722 | 
723 | 
724 |       if(save_png == T) {
725 |         ggpubr::ggexport(filename = "CIPR_individual_clusters.png", plotlist = ind_clu_plots, ncol = 1, width = 1800, height = 360 * length(clusters))
726 |       }
727 |       else {
728 |         print(ggpubr::ggarrange(plotlist = ind_clu_plots, ncol = 1, common.legend = T))
729 |       }
730 | 
731 |     }
732 | 
733 |     ################################################################################################################################
734 |     # Prepare top5 summary plots
735 |     # This plot will show the 5 highest scoring reference cell types for each cluster.
736 | 
737 |     # if(plot_top == T){
738 | 
739 |   message("Preparing top plots")
740 | 
741 |       # Extract top5 hits from the reuslts
742 |       top_df <- master_df %>%
743 |         dplyr::group_by(cluster) %>%    #cluster
744 |         dplyr::top_n(top_num, wt = identity_score) %>%
745 |         dplyr::arrange(cluster, desc(identity_score))
746 | 
747 |       # Index variable helps keeping the results for clusters separate and helps ordered outputs
748 |       top_df$index <- 1:nrow(top_df)
749 | 
750 | 
751 |       # Order clusters levels for ordered plotting
752 |       ordered_cluster_levels <- gtools::mixedsort(levels(as.factor(top_df$cluster)))
753 | 
754 | 
755 |       top_df$cluster <- factor(top_df$cluster, levels = ordered_cluster_levels)
756 | 
757 | 
758 | 
759 |       # Extract relevant columns
760 |       top_df <- dplyr::select(top_df, cluster,
761 |                        reference_cell_type,
762 |                        reference_id,
763 |                        long_name,
764 |                        description,
765 |                        identity_score,
766 |                        index, dplyr::everything())
767 | 
768 |       if(global_results_obj == T) CIPR_top_results <<- top_df
769 | 
770 |       p <- ggpubr::ggdotplot(top_df, x="index", y="identity_score",
771 |                      fill = "cluster", size=1, x.text.angle=90,
772 |                      font.legend = c(15, "plain", "black")) +
773 |         ggplot2::scale_x_discrete(labels=top_df$reference_id)+
774 |         ggplot2::theme(axis.text.x = ggplot2::element_text(vjust=0.5, hjust=1))+
775 |         ggplot2::theme(...)
776 | 
777 |       if(global_plot_obj == T) top_plots <<- p
778 | 
779 |       if(save_png == T) {
780 | 
781 |         ggpubr::ggexport(p, filename = "CIPR_top_hits.png", ncol = 1, width = 150 * length(clusters), height = 300)
782 | 
783 |       } else {
784 | 
785 |         if(plot_top == T) print(p)
786 | 
787 |       }
788 | 
789 |     # }
790 | 
791 |   } # close function
792 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # CIPR-Package
  2 | 
  3 | <br>
  4 | 
  5 | <img align="right" src="https://github.com/atakanekiz/CIPR-Package/raw/master/doc/CIPR_hex_mid.png" width=300> 
  6 | 
  7 | ## Cluster Identity Predictor 
  8 | 
  9 | <br>
 10 | 
 11 | During the analysis of single cell RNA sequencing (scRNAseq) data, annotating the biological identity of cell clusters is an important step before downstream analyses and it remains technically challenging. The current solutions for annotating single cell clusters generally lack a graphical user interface, can be computationally intensive or have a limited scope. On the other hand, manually annotating single cell clusters by examining the expression of marker genes can be subjective and labor-intensive.
 12 | 
 13 | To improve the quality and efficiency of annotating cell clusters in scRNAseq data, we present a web-based R/Shiny app and R package, __Cluster Identity PRedictor (CIPR)__, which provides a graphical user interface to quickly score gene expression profiles of unknown cell clusters against mouse or human references, or a custom dataset provided by the user. CIPR can be easily integrated into the current pipelines to facilitate scRNAseq data analysis.
 14 | 
 15 | CIPR performs analyses at individual cluster level and generates informative graphical outputs to help the users assess the quality of algorithmic predictions (see the example outputs below).  
 16 | 
 17 | 
 18 | This repository contains the source code for the R package implementation of CIPR pipeline. For CIPR-Shiny, please check out [CIPR-Shiny repository](https://github.com/atakanekiz/CIPR-Shiny).
 19 | 
 20 | ---
 21 | 
 22 | ## Installation and Usage
 23 | 
 24 | ```{r}
 25 | 
 26 | if (!requireNamespace("devtools", quietly = TRUE))
 27 |     install.packages("devtools")
 28 | 
 29 | devtools::install_github("atakanekiz/CIPR-Package", build_vignettes = TRUE)
 30 | 
 31 | # # For faster installation without vignette
 32 | # devtools::install_github("atakanekiz/CIPR-Package", build_vignettes = FALSE)
 33 | ```
 34 | 
 35 | #### Example use case in conjunction with Seurat pipeline
 36 | 
 37 | ```{r}
 38 | 
 39 | 
 40 | library(Seurat)
 41 | 
 42 | allmarkers <- FindAllMarkers(seurat_object)
 43 | avgexp <- AverageExpression(seurat_object)
 44 | 
 45 | 
 46 | # Plot summarizing top scoring references per cluster (logFC comparison)
 47 | CIPR(input_dat = allmarkers,
 48 |      comp_method = "logfc_dot_product", 
 49 |      reference = "immgen", 
 50 |      plot_ind = F,
 51 |      plot_top = T)
 52 |      
 53 | # Plot summarizing top scoring references per cluster (all-genes correlation)
 54 | CIPR(input_dat = allmarkers,
 55 |      comp_method = "logfc_dot_product", 
 56 |      reference = "immgen", 
 57 |      plot_ind = F,
 58 |      plot_top = T)
 59 |      
 60 |      
 61 | # Plots for individual clusters
 62 | CIPR(input_dat = allmarkers,
 63 |      comp_method = "logfc_dot_product", 
 64 |      reference = "immgen", 
 65 |      plot_ind = T,
 66 |      plot_top = F)
 67 | 
 68 | # Limiting the analysis to certain reference subsets
 69 | CIPR(input_dat = allmarkers,
 70 |      comp_method = "logfc_dot_product", 
 71 |      reference = "immgen", 
 72 |      plot_ind = F,
 73 |      plot_top = T, 
 74 |      select_ref_subsets = c("T cell", "B cell", "NK cell"))
 75 | 
 76 | 
 77 | 
 78 | 
 79 | ```
 80 | 
 81 | 
 82 | ## Reference datasets available in CIPR
 83 | 
 84 | * [Immunological Genome Project (ImmGen)](https://www.immgen.org) microarray data from sorted mouse immune cells. This dataset is prepared by using both V1 and V2 ImmGen releases and it contains 296 samples from 20 different cell types (253 subtypes).
 85 | 
 86 | * Mouse RNAseq data from sorted cells reported in [Benayoun et al. (2019)](http://www.genome.org/cgi/doi/10.1101/gr.240093.118). This dataset contains 358 sorted immune and nonimmune samples from 18 different lineages (28 subtypes).
 87 | 
 88 | * [Blueprint](https://doi.org/10.3324/haematol.2013.094243)/[Encode](https://doi.org/10.1038/nature11247) RNAseq data that contains 259 sorted human immune and nonimmune samples from 24 different lineages (43 subtypes).
 89 | 
 90 | * [Human Primary Cell Atlas](https://doi.org/10.1186/1471-2164-14-632) that contains microarray data from 713 sorted immune and nonimmune cells (37 main cell types and 157 subtypes).
 91 | 
 92 | * [DICE (Database for Immune Cell Expression(/eQTLs/Epigenomics)](https://doi.org/10.1016/j.cell.2018.10.022) that contains 1561 human immune samples from 5 main cell types (15 subtypes). To reduce object sizes, mean TPM values per cell type is used.
 93 | 
 94 | * Human microarray data from sorted hematopoietic cells reported in [Novershtern et al. (2011)](https://doi.org/10.1016/j.cell.2011.01.004). This dataset contains data from 211 samples and 17 main cell types (38 subtypes)
 95 | 
 96 | * Human RNAseq data from sorted cells reported in [Monaco et al. (2019)](https://doi.org/10.1016/j.celrep.2019.01.041). This dataset contains 114 samples originating from 11 main cell types (29 subtypes)
 97 | 
 98 | * ___A custom reference dataset provided by the user.___ This dataset can be obtained from a number of high througput methods including microarray and bulk RNAseq. For details about how to prepare custom reference, please see the How-to tab on the [Shiny website](https://aekiz.shinyapps.io/CIPR).
 99 | 
100 | ---
101 | 
102 | ## Analytical approach
103 | 
104 | CIPR calculates pairwise identity scores between individual unknown clusters and the reference samples and generates a vector of identity scores per each cluster in the experiment. While doing this CIPR utilizes two main approaches:
105 | 
106 | * ___Comparison of differentially expressed genes.___ In this method users provide an input data frame that contains the log fold-change (logFC) values of differentially expressed genes in each cluster. The algorithm first calculates differential expression within the reference data frame for each gene by taking the ratio of the expression value of individual subsets to the average expression in the entire data frame. Then the CIPR pipeline compares these reference logFC values to the logFC from the experimental clusters. The users can select one of three methods for these comparisons:
107 | 
108 |    * __LogFC dot product:__ LogFC values of the matching genes are mutliplied and added together to yield an aggregate identity score.
109 |    * __LogFC Spearman's correlation:__ Rank correlation is calculated between the logFC values of the experimental and reference data. 
110 |    * __LogFC Pearson's correlation__: Linear correlation is calculated between the logFC values of the expermental and reference data.
111 |    
112 |    <br>
113 |    
114 | * ___Comparison of all genes.___ In this method, users provide an input data frame that contains average gene expression per cluster. The algorithm compares the expression profiles of individual cluster to that from reference dataset. In this method, all the common genes between experimenal and reference data are used in the analysis regardless of their expression values and differential expression status. Users can use one of the two methods in this approach:
115 | 
116 |    * __Spearman's correlation:__ Rank correlation between the experimental clusters and reference cell subsets
117 |    * __Pearson's correlation:__ Linear correlation (which could be beneficial especially when using custom references where the reference and the experimental data is obtained using similar methodologies.)
118 | 
119 | 
120 | ---
121 | 
122 | 
123 | 
124 | ## Flexible options
125 | 
126 | To be adaptable to various experimental contexts, CIPR enables users to:
127 | 
128 | * Select only interesting reference subsets from the provided reference datasets
129 | 
130 | * Limit the analysis to the genes whose expression variance (in the reference dataset) is above a certain quantile determined by the user.
131 | 
132 | 
133 | ---
134 | 
135 | ## Sample outputs
136 | 
137 | ### Results per cluster
138 | 
139 | In the plot below x-axis signifies the individual samples within the reference data frame (ImmGen in this example). Reference cell types are marked by different colors. Each data point indicates the identity score calculated for Cluster 1 in the input data. Shaded regions demarcate 1 and 2 standard deviations around the average identity score across the reference dataset. In this analysis logFC dot product method was used.
140 | 
141 | 
142 | <kbd>
143 | <img src=https://github.com/atakanekiz/CIPR-Package/raw/master/doc/sample_ind_output.png>
144 | </kbd>
145 | 
146 | ### Summary of top hits per cluster
147 | 
148 | It is often easier to examine the top predictions in one graph. This plot shows the top 5 scoring reference samples for each cluster (shown in different colors).
149 | 
150 | <kbd>
151 | <img src=https://github.com/atakanekiz/CIPR-Package/raw/master/doc/sample_top_output.png>
152 | </kbd>
153 | 
154 | 
155 | 
156 | 


--------------------------------------------------------------------------------
/data/blueprint_expr.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/blueprint_expr.rda


--------------------------------------------------------------------------------
/data/blueprint_samples.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/blueprint_samples.rda


--------------------------------------------------------------------------------
/data/dice_expr.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/dice_expr.rda


--------------------------------------------------------------------------------
/data/dice_samples.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/dice_samples.rda


--------------------------------------------------------------------------------
/data/example_avgexp_data.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/example_avgexp_data.rda


--------------------------------------------------------------------------------
/data/example_log_fc_data.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/example_log_fc_data.rda


--------------------------------------------------------------------------------
/data/hema_expr.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hema_expr.rda


--------------------------------------------------------------------------------
/data/hema_samples.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hema_samples.rda


--------------------------------------------------------------------------------
/data/hpca_expr.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hpca_expr.rda


--------------------------------------------------------------------------------
/data/hpca_samples.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hpca_samples.rda


--------------------------------------------------------------------------------
/data/hsrnaseq_expr.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hsrnaseq_expr.rda


--------------------------------------------------------------------------------
/data/hsrnaseq_samples.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/hsrnaseq_samples.rda


--------------------------------------------------------------------------------
/data/immgen_expr.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/immgen_expr.rda


--------------------------------------------------------------------------------
/data/immgen_samples.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/immgen_samples.rda


--------------------------------------------------------------------------------
/data/mmrnaseq_expr.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/mmrnaseq_expr.rda


--------------------------------------------------------------------------------
/data/mmrnaseq_samples.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/data/mmrnaseq_samples.rda


--------------------------------------------------------------------------------
/doc/CIPR_hex_mid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/doc/CIPR_hex_mid.png


--------------------------------------------------------------------------------
/doc/cipr_human_pbmc.R:
--------------------------------------------------------------------------------
  1 | ## ---- include=FALSE-----------------------------------------------------------
  2 | 
  3 | knitr::opts_chunk$set(eval=T)
  4 | 
  5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
  6 | 
  7 | 
  8 | 
  9 | ## ---- eval=F------------------------------------------------------------------
 10 | #  
 11 | #  if (!requireNamespace("devtools", quietly = TRUE))
 12 | #      install.packages("devtools")
 13 | #  
 14 | #  
 15 | #  # Use this option if you want to build vignettes during installation
 16 | #  # This can take a long time due to the installation of suggested packages.
 17 | #  devtools::install_github(..., build = TRUE, build_opts = c("--no-resave-data", "--no-manual")
 18 | #  
 19 | #  # Use this if you would like to install the package without vignettes
 20 | #  # devtools::install_github("atakanekiz/CIPR-Package")
 21 | #  
 22 | 
 23 | ## ---- eval=T------------------------------------------------------------------
 24 | 
 25 | library(dplyr)
 26 | library(Seurat)
 27 | library(CIPR)
 28 | 
 29 | 
 30 | ## -----------------------------------------------------------------------------
 31 | 
 32 | # Download data
 33 | 
 34 | temp <- tempfile()
 35 | tempd <- tempdir()
 36 | 
 37 | download.file("https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz", destfile = temp)
 38 | 
 39 | untar(temp, exdir = tempd)
 40 | 
 41 | unlink(temp)
 42 | 
 43 | 
 44 | # Load the PBMC dataset
 45 | pbmc.data <- Read10X(data.dir = paste0(tempd, "\\filtered_gene_bc_matrices\\hg19"))
 46 | # Initialize the Seurat object with the raw (non-normalized data).
 47 | pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
 48 | pbmc
 49 | 
 50 | 
 51 | 
 52 | ## -----------------------------------------------------------------------------
 53 | 
 54 | # Calculate mitochondrial gene representation (indicative of low quality cells)
 55 | pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
 56 | 
 57 | # Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes 
 58 | pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
 59 | 
 60 | 
 61 | ## ---- results="hide", message=F-----------------------------------------------
 62 | 
 63 | pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
 64 | pbmc <- NormalizeData(pbmc)
 65 | 
 66 | 
 67 | ## ---- results="hide", message=F-----------------------------------------------
 68 | 
 69 | pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
 70 | 
 71 | all.genes <- rownames(pbmc)
 72 | pbmc <- ScaleData(pbmc, features = all.genes)
 73 | 
 74 | 
 75 | ## ---- results="hide", message=F-----------------------------------------------
 76 | 
 77 | pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
 78 | 
 79 | 
 80 | ## ---- eval=T------------------------------------------------------------------
 81 | 
 82 | ElbowPlot(pbmc)
 83 | 
 84 | 
 85 | ## ---- results="hide", message=F-----------------------------------------------
 86 | 
 87 | pbmc <- FindNeighbors(pbmc, dims = 1:10)
 88 | pbmc <- FindClusters(pbmc, resolution = 0.5)
 89 | 
 90 | 
 91 | ## ---- results="hide", message=F-----------------------------------------------
 92 | 
 93 | pbmc <- RunUMAP(pbmc, dims = 1:10)
 94 | 
 95 | pbmc$unnamed_clusters <- Idents(pbmc)
 96 | 
 97 | 
 98 | ## -----------------------------------------------------------------------------
 99 | 
100 | # saveRDS(pbmc, "pbmc.rds")
101 | 
102 | 
103 | ## ---- echo=F, results="hide"--------------------------------------------------
104 | 
105 | allmarkers <- FindAllMarkers(pbmc)
106 | 
107 | 
108 | ## ---- include=F---------------------------------------------------------------
109 | 
110 | # saveRDS(allmarkers, "allmarkers.rds")
111 | 
112 | 
113 | ## ---- results="hide"----------------------------------------------------------
114 | 
115 | avgexp <- AverageExpression(pbmc)
116 | 
117 | avgexp <- avgexp$RNA
118 | 
119 | avgexp$gene <- rownames(avgexp)
120 | 
121 | 
122 | ## ---- include=F---------------------------------------------------------------
123 | 
124 | # saveRDS(avgexp, "avgexp.rds")
125 | 
126 | 
127 | ## ---- include=F---------------------------------------------------------------
128 | 
129 | # pbmc <- readRDS("pbmc.rds")
130 | 
131 | 
132 | ## -----------------------------------------------------------------------------
133 | 
134 | DimPlot(pbmc)
135 | 
136 | 
137 | ## ---- eval=T, include=F-------------------------------------------------------
138 | 
139 | # allmarkers <- readRDS("allmarkers.rds")
140 | 
141 | 
142 | ## ---- eval=T, fig.width=16, fig.height=32, message=F--------------------------
143 | 
144 | CIPR(input_dat = allmarkers,
145 |      comp_method = "logfc_dot_product", 
146 |      reference = "hsrnaseq", 
147 |      plot_ind = T,
148 |      plot_top = F, 
149 |      global_results_obj = T, 
150 |      global_plot_obj = T,
151 |      # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters
152 |      )
153 | 
154 | 
155 | 
156 | 
157 | ## ---- eval=T, fig.width=16, fig.height=5, message=F---------------------------
158 | 
159 | library(ggplot2)
160 | 
161 | ind_clu_plots$cluster6 +
162 |     theme(axis.text.y = element_text(color="red"),
163 |           axis.text.x = element_text(color="blue")) +
164 |     labs(fill="Reference")+
165 |     ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") +
166 |     annotate("text", label="2 sd range", x=10, y= 500, size=8, color = "steelblue")+
167 |     annotate("text", label= "1 sd range", x=10, y=175, size=8, color ="orange2")+
168 |   geom_rect(aes(xmin=94, xmax=99, ymin=550, ymax=900), fill=NA, size=3, color="red")
169 |     
170 |     
171 | 
172 | 
173 | ## ---- eval=T, fig.width=8, fig.height=4.5, message=F--------------------------
174 | 
175 | CIPR(input_dat = allmarkers,
176 |      comp_method = "logfc_dot_product", 
177 |      reference = "hsrnaseq", 
178 |      plot_ind = F,
179 |      plot_top = T, 
180 |      global_results_obj = T, 
181 |      global_plot_obj = T)
182 | 
183 | 
184 | 
185 | ## ---- eval=T------------------------------------------------------------------
186 | 
187 | DT::datatable(CIPR_top_results)
188 | 
189 | 
190 | DT::datatable(head(CIPR_all_results))
191 | 
192 | 
193 | ## ---- eval=T, fig.width=16, fig.height=32, message=F--------------------------
194 | 
195 | CIPR(input_dat = avgexp,
196 |      comp_method = "all_genes_spearman", 
197 |      reference = "hsrnaseq", 
198 |      plot_ind = T,
199 |      plot_top = F, 
200 |      global_results_obj = T, 
201 |      global_plot_obj = T)
202 | 
203 | 
204 | 
205 | 
206 | ## ---- eval=T, fig.width=8, fig.height=4.5, message=F--------------------------
207 | 
208 | CIPR(input_dat = avgexp,
209 |      comp_method = "all_genes_spearman", 
210 |      reference = "hsrnaseq", 
211 |      plot_ind = F,
212 |      plot_top = T, 
213 |      global_results_obj = T, 
214 |      global_plot_obj = T)
215 | 
216 | 
217 | 
218 | ## ---- eval=T------------------------------------------------------------------
219 | 
220 | DT::datatable(CIPR_top_results)
221 | 
222 | 
223 | DT::datatable(head(CIPR_all_results))
224 | 
225 | 
226 | ## ---- eval=T, fig.width=16, fig.height=32, message=F--------------------------
227 | 
228 | CIPR(input_dat = allmarkers,
229 |      comp_method = "logfc_dot_product", 
230 |      reference = "hsrnaseq", 
231 |      plot_ind = T,
232 |      plot_top = F, 
233 |      global_results_obj = T, 
234 |      global_plot_obj = T,
235 |      select_ref_subsets = c("CD4+ T cell", "CD8+ T cell", "Monocyte", "NK cell"))
236 | 
237 | 
238 | 
239 | ## ---- eval=T, fig.width=16, fig.height=32, message=F--------------------------
240 | 
241 | CIPR(input_dat = avgexp,
242 |      comp_method = "all_genes_spearman", 
243 |      reference = "hsrnaseq", 
244 |      plot_ind = T,
245 |      plot_top = F, 
246 |      global_results_obj = T, 
247 |      global_plot_obj = T,
248 |      keep_top_var = 10)
249 | 
250 | 
251 | 
252 | 
253 | ## -----------------------------------------------------------------------------
254 | 
255 | sessionInfo()
256 | 
257 | 
258 | 
259 | 


--------------------------------------------------------------------------------
/doc/cipr_human_pbmc.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Using CIPR with human PBMC data"
  3 | date: "`r format(Sys.time(), '%d %B, %Y')`"
  4 | author: "Atakan Ekiz"
  5 | output:
  6 |   rmarkdown::html_vignette:
  7 |     toc: true
  8 |     toc_depth: 3
  9 | vignette: >
 10 |   %\VignetteIndexEntry{Using CIPR with human PBMC data}
 11 |   %\VignetteEngine{knitr::rmarkdown}
 12 |   \usepackage[utf8]{inputenc}
 13 | ---
 14 | 
 15 | ```{r, include=FALSE}
 16 | 
 17 | knitr::opts_chunk$set(eval=T)
 18 | 
 19 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
 20 | 
 21 | 
 22 | ```
 23 | 
 24 | 
 25 | 
 26 | # Summary 
 27 | 
 28 | This vignette describes how to use CIPR package with 3k PBMC data freely available from 10X genomics. Here, we recycle the code described in [Seurat's guided clustering tutorial](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html) to help users perform analyses from scratch. Using this dataset we will demonstrate the capabilities of CIPR to annotate single cell clusters in single cell RNAseq (scRNAseq) experiments. For further information about other clustering methods, please see Seurat's comprehensive [website](https://satijalab.org/seurat/)
 29 | 
 30 | 
 31 | # Install CIPR
 32 | 
 33 | ```{r, eval=F}
 34 | 
 35 | if (!requireNamespace("devtools", quietly = TRUE))
 36 |     install.packages("devtools")
 37 | 
 38 | 
 39 | # Use this option if you want to build vignettes during installation
 40 | # This can take a long time due to the installation of suggested packages.
 41 | devtools::install_github(..., build = TRUE, build_opts = c("--no-resave-data", "--no-manual")
 42 | 
 43 | # Use this if you would like to install the package without vignettes
 44 | # devtools::install_github("atakanekiz/CIPR-Package")
 45 | 
 46 | ```
 47 | 
 48 | 
 49 | # Seurat pipeline
 50 | 
 51 | ## Setup Seurat object
 52 | 
 53 | ```{r, eval=T}
 54 | 
 55 | library(dplyr)
 56 | library(Seurat)
 57 | library(CIPR)
 58 | 
 59 | ```
 60 | 
 61 | ```{r}
 62 | 
 63 | # Download data
 64 | 
 65 | temp <- tempfile()
 66 | tempd <- tempdir()
 67 | 
 68 | download.file("https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz", destfile = temp)
 69 | 
 70 | untar(temp, exdir = tempd)
 71 | 
 72 | unlink(temp)
 73 | 
 74 | 
 75 | # Load the PBMC dataset
 76 | pbmc.data <- Read10X(data.dir = paste0(tempd, "\\filtered_gene_bc_matrices\\hg19"))
 77 | # Initialize the Seurat object with the raw (non-normalized data).
 78 | pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
 79 | pbmc
 80 | 
 81 | 
 82 | ```
 83 | 
 84 | 
 85 | ## Pre-processing
 86 | 
 87 | The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.
 88 | 
 89 | ```{r}
 90 | 
 91 | # Calculate mitochondrial gene representation (indicative of low quality cells)
 92 | pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
 93 | 
 94 | # Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes 
 95 | pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
 96 | 
 97 | ```
 98 | 
 99 | 
100 | ## Normalizing data
101 | 
102 | ```{r, results="hide", message=F}
103 | 
104 | pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
105 | pbmc <- NormalizeData(pbmc)
106 | 
107 | ```
108 | 
109 | ## Variable gene detection and scaling
110 | 
111 | ```{r, results="hide", message=F}
112 | 
113 | pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
114 | 
115 | all.genes <- rownames(pbmc)
116 | pbmc <- ScaleData(pbmc, features = all.genes)
117 | 
118 | ```
119 | 
120 | ## Perform PCA
121 | 
122 | ```{r, results="hide", message=F}
123 | 
124 | pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
125 | 
126 | ```
127 | 
128 | 
129 | ```{r, eval=T}
130 | 
131 | ElbowPlot(pbmc)
132 | 
133 | ```
134 | 
135 | ## Cluster cells
136 | 
137 | ```{r, results="hide", message=F}
138 | 
139 | pbmc <- FindNeighbors(pbmc, dims = 1:10)
140 | pbmc <- FindClusters(pbmc, resolution = 0.5)
141 | 
142 | ```
143 | 
144 | ## Run non-linear dimensionality reduction (UMAP)
145 | 
146 | ```{r, results="hide", message=F}
147 | 
148 | pbmc <- RunUMAP(pbmc, dims = 1:10)
149 | 
150 | pbmc$unnamed_clusters <- Idents(pbmc)
151 | 
152 | ```
153 | 
154 | ```{r}
155 | 
156 | # saveRDS(pbmc, "pbmc.rds")
157 | 
158 | ```
159 | 
160 | 
161 | 
162 | ## Find differentially expressed genes
163 | 
164 | __This is the step where we generate the input for CIPR's log fold change (logFC) comparison methods.__
165 | 
166 | ```{r, echo=F, results="hide"}
167 | 
168 | allmarkers <- FindAllMarkers(pbmc)
169 | 
170 | ```
171 | 
172 | ```{r, include=F}
173 | 
174 | # saveRDS(allmarkers, "allmarkers.rds")
175 | 
176 | ```
177 | 
178 | ## Calculate average gene expression per cluster
179 | 
180 | __This is the step where we generate the input for CIPR's all-genes correlation methods.__
181 | 
182 | ```{r, results="hide"}
183 | 
184 | avgexp <- AverageExpression(pbmc)
185 | 
186 | avgexp <- avgexp$RNA
187 | 
188 | avgexp <- as.data.frame(avgexp)
189 | 
190 | avgexp$gene <- rownames(avgexp)
191 | 
192 | ```
193 | 
194 | ```{r, include=F}
195 | 
196 | # saveRDS(avgexp, "avgexp.rds")
197 | 
198 | ```
199 | 
200 | ## Visualize Seurat pbject
201 | 
202 | ```{r, include=F}
203 | 
204 | # pbmc <- readRDS("pbmc.rds")
205 | 
206 | ```
207 | 
208 | ```{r}
209 | 
210 | DimPlot(pbmc)
211 | 
212 | ```
213 | 
214 | # CIPR analysis
215 | 
216 | The user can select one of the 7 provided reference data sets:
217 | 
218 | | Reference                                 | `reference` argument |
219 | |-------------------------------------------|----------------------|
220 | | Immunological Genome Project (ImmGen)     | "immgen"             |
221 | | Presorted cell RNAseq (various tissues)   | "mmrnaseq"           |
222 | | Blueprint/ENCODE                          | "blueprint"          |
223 | | Human Primary Cell Atlas                  | "hpca"               |
224 | | Database of Immune Cell Expression (DICE) | "dice"               |
225 | | Hematopoietic differentiation             | "hema"               |
226 | | Presorted cell RNAseq (PBMC)              | "hsrnaseq"           |
227 | | User-provided custom reference            | "custom"             |
228 | 
229 | ## Standard logFC comparison method
230 | 
231 | In this method CIPR accepts `allmarkers` data frame created above and performs the following analytical steps:
232 | 
233 | - It calculates a vector of logFC values for each reference sample (i.e. individual columns of the reference data frame) by comparing log-normalized expression value of a gene (i.e. rows of the reference data frame) to the average gene expression across the entire reference dataset.
234 | - It then scores unknown cluster logFC differential gene expression data against this reference logFC values to create a vector of identity scores
235 | - User can select one of three methods:
236 |     - LogFC dot product (sum of all logFC x logFC values among matching genes). This is the recommended method in CIPR.
237 |     - LogFC Spearman's correlation (rank correlation of logFC values)
238 |     - LogFC Pearson's correlation (linear correlation of logFC values)
239 | 
240 | 
241 | 
242 | ```{r, eval=T, include=F}
243 | 
244 | # allmarkers <- readRDS("allmarkers.rds")
245 | 
246 | ```
247 | 
248 | 
249 | ### Plot all identity scores per cluster-reference cell pairs
250 | 
251 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 
252 | 
253 | CIPR results can be summarized for each cluster in scatter plots.
254 | 
255 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
256 | 
257 | CIPR(input_dat = allmarkers,
258 |      comp_method = "logfc_dot_product", 
259 |      reference = "hsrnaseq", 
260 |      plot_ind = T,
261 |      plot_top = F, 
262 |      global_results_obj = T, 
263 |      global_plot_obj = T,
264 |      # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters
265 |      )
266 | 
267 | 
268 | 
269 | ```
270 | 
271 | ### Plot identity scores for a select cluster
272 | 
273 | `ind_clu_plots` object is created in the global environment to help users can visualize results for a desired cluster and manipulate graphing parameters. ggplot2 functions can be iteratively added to individual plots to create annotations etc.
274 | 
275 | ```{r, eval=T, fig.width=16, fig.height=5, message=F}
276 | 
277 | library(ggplot2)
278 | 
279 | ind_clu_plots$cluster6 +
280 |     theme(axis.text.y = element_text(color="red"),
281 |           axis.text.x = element_text(color="blue")) +
282 |     labs(fill="Reference")+
283 |     ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") +
284 |     annotate("text", label="2 sd range", x=10, y= 500, size=8, color = "steelblue")+
285 |     annotate("text", label= "1 sd range", x=10, y=175, size=8, color ="orange2")+
286 |   geom_rect(aes(xmin=94, xmax=99, ymin=550, ymax=900), fill=NA, size=3, color="red")
287 |     
288 |     
289 | 
290 | ```
291 | 
292 | 
293 | 
294 | ### Plot top scoring refernce subsets for each cluster
295 | 
296 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F}
297 | 
298 | CIPR(input_dat = allmarkers,
299 |      comp_method = "logfc_dot_product", 
300 |      reference = "hsrnaseq", 
301 |      plot_ind = F,
302 |      plot_top = T, 
303 |      global_results_obj = T, 
304 |      global_plot_obj = T)
305 | 
306 | 
307 | ```
308 | 
309 | ### Tabulate CIPR results
310 | 
311 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables.
312 | 
313 | ```{r, eval=T}
314 | 
315 | DT::datatable(CIPR_top_results)
316 | 
317 | 
318 | DT::datatable(head(CIPR_all_results))
319 | 
320 | ```
321 | 
322 | 
323 | ## Standard all-genes correlation method
324 | 
325 | CIPR also implements a simple correlation approach in which overall correlation in gene expression is calculated for the pairs of unknown clusters and the reference samples (regardless of the differential expression status of the gene). This approach is conceptually similiar to some other automated identity prediction pipelines such as [SingleR](https://www.ncbi.nlm.nih.gov/pubmed/30643263) and [scMCA](https://www.ncbi.nlm.nih.gov/pubmed/30758821).
326 | 
327 | Users can select one of the following methods:
328 | 
329 | - __Spearman's correlation:__ It calculates correlation based on ranked gene expression. It can be suitable for comparing experimental and reference data which were obtained using different technologies.
330 | - __Pearson's correlation:__ It calculates linear correlations. This can be useful when the user would like to provide a custom reference dataset to CIPR.
331 | 
332 | 
333 | 
334 | ### Plot all identity scores per cluster-reference cell pairs
335 | 
336 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 
337 | 
338 | CIPR results can be summarized for each cluster in scatter plots.
339 | 
340 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
341 | 
342 | CIPR(input_dat = avgexp,
343 |      comp_method = "all_genes_spearman", 
344 |      reference = "hsrnaseq", 
345 |      plot_ind = T,
346 |      plot_top = F, 
347 |      global_results_obj = T, 
348 |      global_plot_obj = T)
349 | 
350 | 
351 | 
352 | ```
353 | 
354 | 
355 | ### Plot top scoring refernce subsets for each cluster
356 | 
357 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F}
358 | 
359 | CIPR(input_dat = avgexp,
360 |      comp_method = "all_genes_spearman", 
361 |      reference = "hsrnaseq", 
362 |      plot_ind = F,
363 |      plot_top = T, 
364 |      global_results_obj = T, 
365 |      global_plot_obj = T)
366 | 
367 | 
368 | ```
369 | 
370 | ### Tabulate CIPR results
371 | 
372 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables.
373 | 
374 | ```{r, eval=T}
375 | 
376 | DT::datatable(CIPR_top_results)
377 | 
378 | 
379 | DT::datatable(head(CIPR_all_results))
380 | 
381 | ```
382 | 
383 | 
384 | ## Limiting analysis to the select subsets of reference data
385 | 
386 | Sometimes excluding irrelevant reference cell types from the analysis can be helpful. Especially when the logFC comparison methods are utilized, removing irrelevant subsets may improve discrimination of closely related subsets, since the reference logFC values will be calculated after subsetting the data frame. Filtering out reference subsets should not impact results of the all-genes correlation methods, but it can make the graphical outputs easier to look at
387 | 
388 | 3k PBMC dataset may not be the best example to demonstrate benefits of reference dataset subsetting, but the code below serves as an example for this functionality.
389 | 
390 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
391 | 
392 | CIPR(input_dat = allmarkers,
393 |      comp_method = "logfc_dot_product", 
394 |      reference = "hsrnaseq", 
395 |      plot_ind = T,
396 |      plot_top = F, 
397 |      global_results_obj = T, 
398 |      global_plot_obj = T,
399 |      select_ref_subsets = c("CD4+ T cell", "CD8+ T cell", "Monocyte", "NK cell"))
400 | 
401 | 
402 | ```
403 | 
404 | 
405 | 
406 | ## Filtering out lowly variable genes
407 | 
408 | Genes that have a low expression variance across the reference data frame has weaker discriminatory potential. Thus, excluding these genes from the analysis can reduce the noise and improve the prediction scores, especially when using all-genes correlation based methods.
409 | 
410 | We implemented a variance filtering parameter, `keep_top_var`, which allows users to keep top Nth% variable reference genes in the analysis. For instance, by setting this argument to 10, CIPR can be instructed to use only the top 10% highly variable genes in identity score calculations. In our experience _(Ekiz HA, BMC Bioinformatics, in revision)_ limiting the analysis to highly variable genes does not significantly impact the identity scores of the top-scoring reference cell subsets, but it reduces the identity scores of intermediate/low-scoring reference cells leading to an improvement of z-scores. The "best" value for this parameter remains to be determined by the user in individual studies.
411 | 
412 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
413 | 
414 | CIPR(input_dat = avgexp,
415 |      comp_method = "all_genes_spearman", 
416 |      reference = "hsrnaseq", 
417 |      plot_ind = T,
418 |      plot_top = F, 
419 |      global_results_obj = T, 
420 |      global_plot_obj = T,
421 |      keep_top_var = 10)
422 | 
423 | 
424 | 
425 | ```
426 | 
427 | 
428 | 
429 | 
430 | ```{r}
431 | 
432 | sessionInfo()
433 | 
434 | 
435 | ```
436 | 
437 | 
438 | 
439 | 
440 | 
441 | 
442 | 
443 | 
444 | 
445 | 
446 | 
447 | 
448 | 
449 | 
450 | 
451 | 
452 | 
453 | 
454 | 
455 | 
456 | 
457 | 
458 | 
459 | 
460 | 


--------------------------------------------------------------------------------
/doc/sample_ind_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/doc/sample_ind_output.png


--------------------------------------------------------------------------------
/doc/sample_top_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/atakanekiz/CIPR-Package/8729f6bda9f9a178f77844cab9467904fb9ecdae/doc/sample_top_output.png


--------------------------------------------------------------------------------
/man/CIPR.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/CIPR.R
  3 | \name{CIPR}
  4 | \alias{CIPR}
  5 | \title{Cluster Identity Predictor}
  6 | \usage{
  7 | CIPR(
  8 |   input_dat,
  9 |   comp_method = "logfc_dot_product",
 10 |   reference = NULL,
 11 |   select_ref_subsets = "all",
 12 |   custom_reference = NULL,
 13 |   custom_ref_annot = NULL,
 14 |   keep_top_var = 100,
 15 |   plot_ind = F,
 16 |   plot_top = T,
 17 |   top_num = 5,
 18 |   save_png = F,
 19 |   global_plot_obj = T,
 20 |   global_results_obj = T,
 21 |   update_ref = T,
 22 |   ...
 23 | )
 24 | }
 25 | \arguments{
 26 | \item{input_dat}{Data frame containing normalized log2-transformed gene
 27 | expression values per cluster OR a table of differentially expressed genes
 28 | per cluster}
 29 | 
 30 | \item{comp_method}{Method to use for identity score calculations. It accepts
 31 | one of the following: "logfc_dot_product" (default), "logfc_spearman",
 32 | "logfc_pearson", "all_genes_spearman", "all_genes_pearson"}
 33 | 
 34 | \item{reference}{Reference data frame containing gene expression data from
 35 | known cell types. It accepts one of the following: "immgen" (default),
 36 | "mmrnaseq", "blueprint", "hpca", "dice", "hema", "hsrnaseq", "custom"}
 37 | 
 38 | \item{select_ref_subsets}{The names of cell subsets to be included in the
 39 | analysis. For using the entire reference dataset use "all", or
 40 | provide a character vector of cell types of interest. Defaults to "all"}
 41 | 
 42 | \item{custom_reference}{A data frame containing custom reference. There must
 43 | be a column named 'gene' and other columns contain normalized gene expression
 44 | data from known samples. Defaults to NULL}
 45 | 
 46 | \item{custom_ref_annot}{A data frame containing custom reference metadata.
 47 | This is optional to get more informative results from CIPR. The data
 48 | frame must contain columns named 'short_name' (must match column names in
 49 | custom reference), 'long_name' (human readable names for reference samples),
 50 | 'description' (details such as positive and negative sorting markers),
 51 | 'reference_cell_type' (e.g. T cell, B cell, NK)}
 52 | 
 53 | \item{keep_top_var}{Top n percent of highly variant reference genes to
 54 | include in the analysis. It accepts a numeric value smaller than or equal
 55 | to 100 (default). The value of 100 results in keeping all the genes in the
 56 | reference dataset)}
 57 | 
 58 | \item{plot_ind}{Logical value. Set it to TRUE to plot identity scores for
 59 | each cluster. Defaults to FALSE}
 60 | 
 61 | \item{plot_top}{Logical value. set it to TRUE to plot top scoring reference
 62 | cell types for each cluster. Defaults ot TRUE.}
 63 | 
 64 | \item{top_num}{A numeric value determining how many top scoring reference
 65 | cell types will be plotted for each cluster. Defaults to 5.}
 66 | 
 67 | \item{save_png}{Logical value. Set it to TRUE if you would like to export png
 68 | images of the results. Defaults to FALSE}
 69 | 
 70 | \item{global_plot_obj}{Logical value. Set it to TRUE if you would like to
 71 | keep the plots as an object in the global environment. This can be useful
 72 | for accessing and manipulating the graphs. Defaults to TRUE.}
 73 | 
 74 | \item{global_results_obj}{Logical value. Set it to TRUE if you would like to
 75 | keep the analysis results as a global object. Defaults to TRUE.}
 76 | 
 77 | \item{...}{arguments to pass to theme() (for graph manipulation)}
 78 | }
 79 | \value{
 80 | Graphical outputs and/or data frames of identity scores calculated
 81 | for each cluster in the input data.
 82 | }
 83 | \description{
 84 | CIPR (Cluster Identity PRedictor) is a pipeline that helps annotating
 85 | unknown single cell clusters in single cell RNA sequencing (scRNAseq experiments).
 86 | This function scores unknown cluster gene expression signatures against known reference
 87 | datasets using user-selected analytical approaches to facilitate scRNAseq analysis.
 88 | }
 89 | \examples{
 90 | 
 91 | # Example of using CIPR in conjunction with Seurat
 92 | library(Seurat)
 93 | allmarkers <- FindAllMarkers(seurat_object)
 94 | avgexp <- AverageExpression(seurat_object)
 95 | 
 96 | # Using built-in immgen as reference and logfc dot product method
 97 | CIPR(input_dat = allmarkers,
 98 | comp_method = "logfc_dot_product",
 99 | reference="immgen",
100 | keep_top_var = 100,
101 | global_results_obj = T,
102 | plot_top = T)
103 | 
104 | #' # Using built-in immgen as reference and all genes spearman method
105 | 
106 | CIPR(input_dat = avgexp,
107 | comp_method = "all_genes_spearman",
108 | reference="immgen",
109 | keep_top_var = 100,
110 | global_results_obj = T,
111 | plot_top = T)
112 | 
113 | 
114 | # Using built-in dice reference and logFC spearman method
115 | # Variance threshold of top 50\%
116 | 
117 | CIPR(input_dat = allmarkers,
118 | comp_method = "logfc_spearman",
119 | reference="dice",
120 | keep_top_var = 50,
121 | global_results_obj = T,
122 | plot_top = T)
123 | 
124 | 
125 | # Using a custom reference
126 | 
127 | CIPR(input_dat = allmarkers,
128 | comp_method = "logfc_dot_product",
129 | reference="custom",
130 | custom_ref_dat_path = custom_ref_df,
131 | custom_ref_annot_path = custom_annot_df,
132 | keep_top_var = 100,
133 | global_results_obj = T,
134 | plot_top = T)
135 | 
136 | 
137 | # Using a blueprint-encode reference and limiting the analysis
138 | # to "Pericytes", "Skeletal muscle", "Smooth muscle"
139 | 
140 | CIPR(input_dat = allmarkers,
141 | comp_method = "logfc_dot_product",
142 | reference="blueprint-encode",
143 | select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle")
144 | keep_top_var = 100,
145 | global_results_obj = T,
146 | plot_top = T)
147 | 
148 | # Using built in example data (logFC signatures per cluster)
149 | CIPR(input_dat = example_logfc_data,
150 | comp_method = "logfc_dot_product",
151 | reference="blueprint-encode",
152 | select_ref_subsets = c("Pericytes", "Skeletal muscle", "Smooth muscle")
153 | keep_top_var = 100,
154 | global_results_obj = T,
155 | plot_top = T)
156 | 
157 | 
158 | # Using built in example data (average expression)
159 | CIPR(input_dat = example_avgexp_data,
160 | comp_method = "all_spearman",
161 | reference="immgen",
162 | select_ref_subsets = "all",
163 | keep_top_var = 100,
164 | global_results_obj = T,
165 | plot_top = T)
166 | 
167 | }
168 | 


--------------------------------------------------------------------------------
/vignettes/cipr_human_pbmc.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Using CIPR with human PBMC data"
  3 | date: "`r format(Sys.time(), '%d %B, %Y')`"
  4 | author: "Atakan Ekiz"
  5 | output:
  6 |   rmarkdown::html_vignette:
  7 |     toc: true
  8 |     toc_depth: 3
  9 | vignette: >
 10 |   %\VignetteIndexEntry{Using CIPR with human PBMC data}
 11 |   %\VignetteEngine{knitr::rmarkdown}
 12 |   \usepackage[utf8]{inputenc}
 13 | ---
 14 | 
 15 | ```{r, include=FALSE}
 16 | 
 17 | knitr::opts_chunk$set(eval=T)
 18 | 
 19 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
 20 | 
 21 | 
 22 | ```
 23 | 
 24 | 
 25 | 
 26 | # Summary 
 27 | 
 28 | This vignette describes how to use CIPR package with 3k PBMC data freely available from 10X genomics. Here, we recycle the code described in [Seurat's guided clustering tutorial](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html) to help users perform analyses from scratch. Using this dataset we will demonstrate the capabilities of CIPR to annotate single cell clusters in single cell RNAseq (scRNAseq) experiments. For further information about other clustering methods, please see Seurat's comprehensive [website](https://satijalab.org/seurat/)
 29 | 
 30 | 
 31 | # Install CIPR
 32 | 
 33 | ```{r, eval=F}
 34 | 
 35 | if (!requireNamespace("devtools", quietly = TRUE))
 36 |     install.packages("devtools")
 37 | 
 38 | 
 39 | # Use this option if you want to build vignettes during installation
 40 | # This can take a long time due to the installation of suggested packages.
 41 | devtools::install_github("atakanekiz/CIPR-Package", build_vignettes = TRUE)
 42 | 
 43 | # Use this if you would like to install the package without vignettes
 44 | # devtools::install_github("atakanekiz/CIPR-Package")
 45 | 
 46 | ```
 47 | 
 48 | 
 49 | # Seurat pipeline
 50 | 
 51 | ## Setup Seurat object
 52 | 
 53 | ```{r, eval=T}
 54 | 
 55 | library(dplyr)
 56 | library(Seurat)
 57 | library(CIPR)
 58 | 
 59 | ```
 60 | 
 61 | ```{r}
 62 | 
 63 | # Download data
 64 | 
 65 | temp <- tempfile()
 66 | tempd <- tempdir()
 67 | 
 68 | download.file("https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz", destfile = temp)
 69 | 
 70 | untar(temp, exdir = tempd)
 71 | 
 72 | unlink(temp)
 73 | 
 74 | 
 75 | # Load the PBMC dataset
 76 | pbmc.data <- Read10X(data.dir = paste0(tempd, "/filtered_gene_bc_matrices/hg19"))
 77 | # Initialize the Seurat object with the raw (non-normalized data).
 78 | pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
 79 | pbmc
 80 | 
 81 | 
 82 | ```
 83 | 
 84 | 
 85 | ## Pre-processing
 86 | 
 87 | The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.
 88 | 
 89 | ```{r}
 90 | 
 91 | # Calculate mitochondrial gene representation (indicative of low quality cells)
 92 | pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
 93 | 
 94 | # Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes 
 95 | pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
 96 | 
 97 | ```
 98 | 
 99 | 
100 | ## Normalizing data
101 | 
102 | ```{r, results="hide", message=F}
103 | 
104 | pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
105 | pbmc <- NormalizeData(pbmc)
106 | 
107 | ```
108 | 
109 | ## Variable gene detection and scaling
110 | 
111 | ```{r, results="hide", message=F}
112 | 
113 | pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
114 | 
115 | all.genes <- rownames(pbmc)
116 | pbmc <- ScaleData(pbmc, features = all.genes)
117 | 
118 | ```
119 | 
120 | ## Perform PCA
121 | 
122 | ```{r, results="hide", message=F}
123 | 
124 | pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
125 | 
126 | ```
127 | 
128 | 
129 | ```{r, eval=T}
130 | 
131 | ElbowPlot(pbmc)
132 | 
133 | ```
134 | 
135 | ## Cluster cells
136 | 
137 | ```{r, results="hide", message=F}
138 | 
139 | pbmc <- FindNeighbors(pbmc, dims = 1:10)
140 | pbmc <- FindClusters(pbmc, resolution = 0.5)
141 | 
142 | ```
143 | 
144 | ## Run non-linear dimensionality reduction (tSNE)
145 | 
146 | ```{r, results="hide", message=F}
147 | 
148 | pbmc <- RunTSNE(pbmc, dims = 1:10)
149 | 
150 | pbmc$unnamed_clusters <- Idents(pbmc)
151 | 
152 | ```
153 | 
154 | ```{r}
155 | 
156 | # saveRDS(pbmc, "pbmc.rds")
157 | 
158 | ```
159 | 
160 | 
161 | 
162 | ## Find differentially expressed genes
163 | 
164 | __This is the step where we generate the input for CIPR's log fold change (logFC) comparison methods.__
165 | 
166 | ```{r, echo=F, results="hide"}
167 | 
168 | allmarkers <- FindAllMarkers(pbmc)
169 | 
170 | ```
171 | 
172 | ```{r, include=F}
173 | 
174 | # saveRDS(allmarkers, "allmarkers.rds")
175 | 
176 | ```
177 | 
178 | ## Calculate average gene expression per cluster
179 | 
180 | __This is the step where we generate the input for CIPR's all-genes correlation methods.__
181 | 
182 | ```{r, results="hide"}
183 | 
184 | avgexp <- AverageExpression(pbmc)
185 | 
186 | avgexp <- as.data.frame(avgexp$RNA)
187 | 
188 | avgexp$gene <- rownames(avgexp)
189 | 
190 | ```
191 | 
192 | ```{r, include=F}
193 | 
194 | # saveRDS(avgexp, "avgexp.rds")
195 | 
196 | ```
197 | 
198 | ## Visualize Seurat pbject
199 | 
200 | ```{r, include=F}
201 | 
202 | # pbmc <- readRDS("pbmc.rds")
203 | 
204 | ```
205 | 
206 | ```{r}
207 | 
208 | DimPlot(pbmc)
209 | 
210 | ```
211 | 
212 | # CIPR analysis
213 | 
214 | The user can select one of the 7 provided reference data sets:
215 | 
216 | | Reference                                 | `reference` argument |
217 | |-------------------------------------------|----------------------|
218 | | Immunological Genome Project (ImmGen)     | "immgen"             |
219 | | Presorted cell RNAseq (various tissues)   | "mmrnaseq"           |
220 | | Blueprint/ENCODE                          | "blueprint"          |
221 | | Human Primary Cell Atlas                  | "hpca"               |
222 | | Database of Immune Cell Expression (DICE) | "dice"               |
223 | | Hematopoietic differentiation             | "hema"               |
224 | | Presorted cell RNAseq (PBMC)              | "hsrnaseq"           |
225 | | User-provided custom reference            | "custom"             |
226 | 
227 | ## Standard logFC comparison method
228 | 
229 | In this method CIPR accepts `allmarkers` data frame created above and performs the following analytical steps:
230 | 
231 | - It calculates a vector of logFC values for each reference sample (i.e. individual columns of the reference data frame) by comparing log-normalized expression value of a gene (i.e. rows of the reference data frame) to the average gene expression across the entire reference dataset.
232 | - It then scores unknown cluster logFC differential gene expression data against this reference logFC values to create a vector of identity scores
233 | - User can select one of three methods:
234 |     - LogFC dot product (sum of all logFC x logFC values among matching genes). This is the recommended method in CIPR.
235 |     - LogFC Spearman's correlation (rank correlation of logFC values)
236 |     - LogFC Pearson's correlation (linear correlation of logFC values)
237 | 
238 | 
239 | 
240 | ```{r, eval=T, include=F}
241 | 
242 | # allmarkers <- readRDS("allmarkers.rds")
243 | 
244 | ```
245 | 
246 | 
247 | ### Plot all identity scores per cluster-reference cell pairs
248 | 
249 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 
250 | 
251 | CIPR results can be summarized for each cluster in scatter plots.
252 | 
253 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
254 | 
255 | CIPR(input_dat = allmarkers,
256 |      comp_method = "logfc_dot_product", 
257 |      reference = "hsrnaseq", 
258 |      plot_ind = T,
259 |      plot_top = F, 
260 |      global_results_obj = T, 
261 |      global_plot_obj = T,
262 |      # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters
263 |      )
264 | 
265 | 
266 | 
267 | ```
268 | 
269 | ### Plot identity scores for a select cluster
270 | 
271 | `ind_clu_plots` object is created in the global environment to help users can visualize results for a desired cluster and manipulate graphing parameters. ggplot2 functions can be iteratively added to individual plots to create annotations etc.
272 | 
273 | ```{r, eval=T, fig.width=16, fig.height=5, message=F}
274 | 
275 | library(ggplot2)
276 | 
277 | ind_clu_plots$cluster6 +
278 |     theme(axis.text.y = element_text(color="red"),
279 |           axis.text.x = element_text(color="blue")) +
280 |     labs(fill="Reference")+
281 |     ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") +
282 |     annotate("text", label="2 sd range", x=10, y= 500, size=8, color = "steelblue")+
283 |     annotate("text", label= "1 sd range", x=10, y=175, size=8, color ="orange2")+
284 |   geom_rect(aes(xmin=94, xmax=99, ymin=550, ymax=900), fill=NA, size=3, color="red")
285 |     
286 |     
287 | 
288 | ```
289 | 
290 | 
291 | 
292 | ### Plot top scoring refernce subsets for each cluster
293 | 
294 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F}
295 | 
296 | CIPR(input_dat = allmarkers,
297 |      comp_method = "logfc_dot_product", 
298 |      reference = "hsrnaseq", 
299 |      plot_ind = F,
300 |      plot_top = T, 
301 |      global_results_obj = T, 
302 |      global_plot_obj = T)
303 | 
304 | 
305 | ```
306 | 
307 | ### Tabulate CIPR results
308 | 
309 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables.
310 | 
311 | ```{r, eval=T}
312 | 
313 | DT::datatable(CIPR_top_results)
314 | 
315 | 
316 | DT::datatable(head(CIPR_all_results))
317 | 
318 | ```
319 | 
320 | 
321 | ## Standard all-genes correlation method
322 | 
323 | CIPR also implements a simple correlation approach in which overall correlation in gene expression is calculated for the pairs of unknown clusters and the reference samples (regardless of the differential expression status of the gene). This approach is conceptually similiar to some other automated identity prediction pipelines such as [SingleR](https://www.ncbi.nlm.nih.gov/pubmed/30643263) and [scMCA](https://www.ncbi.nlm.nih.gov/pubmed/30758821).
324 | 
325 | Users can select one of the following methods:
326 | 
327 | - __Spearman's correlation:__ It calculates correlation based on ranked gene expression. It can be suitable for comparing experimental and reference data which were obtained using different technologies.
328 | - __Pearson's correlation:__ It calculates linear correlations. This can be useful when the user would like to provide a custom reference dataset to CIPR.
329 | 
330 | 
331 | 
332 | ### Plot all identity scores per cluster-reference cell pairs
333 | 
334 | The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots 
335 | 
336 | CIPR results can be summarized for each cluster in scatter plots.
337 | 
338 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
339 | 
340 | CIPR(input_dat = avgexp,
341 |      comp_method = "all_genes_spearman", 
342 |      reference = "hsrnaseq", 
343 |      plot_ind = T,
344 |      plot_top = F, 
345 |      global_results_obj = T, 
346 |      global_plot_obj = T)
347 | 
348 | 
349 | 
350 | ```
351 | 
352 | 
353 | ### Plot top scoring refernce subsets for each cluster
354 | 
355 | ```{r, eval=T, fig.width=8, fig.height=4.5, message=F}
356 | 
357 | CIPR(input_dat = avgexp,
358 |      comp_method = "all_genes_spearman", 
359 |      reference = "hsrnaseq", 
360 |      plot_ind = F,
361 |      plot_top = T, 
362 |      global_results_obj = T, 
363 |      global_plot_obj = T)
364 | 
365 | 
366 | ```
367 | 
368 | ### Tabulate CIPR results
369 | 
370 | CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (`CIPR_top_results` and `CIPR_all_results` respectively) to allow users to explore the outputs and generate specific plots and tables.
371 | 
372 | ```{r, eval=T}
373 | 
374 | DT::datatable(CIPR_top_results)
375 | 
376 | 
377 | DT::datatable(head(CIPR_all_results))
378 | 
379 | ```
380 | 
381 | 
382 | ## Limiting analysis to the select subsets of reference data
383 | 
384 | Sometimes excluding irrelevant reference cell types from the analysis can be helpful. Especially when the logFC comparison methods are utilized, removing irrelevant subsets may improve discrimination of closely related subsets, since the reference logFC values will be calculated after subsetting the data frame. Filtering out reference subsets should not impact results of the all-genes correlation methods, but it can make the graphical outputs easier to look at
385 | 
386 | 3k PBMC dataset may not be the best example to demonstrate benefits of reference dataset subsetting, but the code below serves as an example for this functionality.
387 | 
388 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
389 | 
390 | CIPR(input_dat = allmarkers,
391 |      comp_method = "logfc_dot_product", 
392 |      reference = "hsrnaseq", 
393 |      plot_ind = T,
394 |      plot_top = F, 
395 |      global_results_obj = T, 
396 |      global_plot_obj = T,
397 |      select_ref_subsets = c("CD4+ T cell", "CD8+ T cell", "Monocyte", "NK cell"))
398 | 
399 | 
400 | ```
401 | 
402 | 
403 | 
404 | ## Filtering out lowly variable genes
405 | 
406 | Genes that have a low expression variance across the reference data frame has weaker discriminatory potential. Thus, excluding these genes from the analysis can reduce the noise and improve the prediction scores, especially when using all-genes correlation based methods.
407 | 
408 | We implemented a variance filtering parameter, `keep_top_var`, which allows users to keep top Nth% variable reference genes in the analysis. For instance, by setting this argument to 10, CIPR can be instructed to use only the top 10% highly variable genes in identity score calculations. In our experience _(Ekiz HA, BMC Bioinformatics, in revision)_ limiting the analysis to highly variable genes does not significantly impact the identity scores of the top-scoring reference cell subsets, but it reduces the identity scores of intermediate/low-scoring reference cells leading to an improvement of z-scores. The "best" value for this parameter remains to be determined by the user in individual studies.
409 | 
410 | ```{r, eval=T, fig.width=16, fig.height=32, message=F}
411 | 
412 | CIPR(input_dat = avgexp,
413 |      comp_method = "all_genes_spearman", 
414 |      reference = "hsrnaseq", 
415 |      plot_ind = T,
416 |      plot_top = F, 
417 |      global_results_obj = T, 
418 |      global_plot_obj = T,
419 |      keep_top_var = 10)
420 | 
421 | 
422 | 
423 | ```
424 | 
425 | 
426 | 
427 | 
428 | ```{r}
429 | 
430 | sessionInfo()
431 | 
432 | 
433 | ```
434 | 
435 | 
436 | 
437 | 
438 | 
439 | 
440 | 
441 | 
442 | 
443 | 
444 | 
445 | 
446 | 
447 | 
448 | 
449 | 
450 | 
451 | 
452 | 
453 | 
454 | 
455 | 
456 | 
457 | 
458 | 


--------------------------------------------------------------------------------