├── .Rbuildignore ├── .gitignore ├── CONTRIBUTION ├── DESCRIPTION ├── LICENSE ├── NAMESPACE ├── R ├── markerGenesAndMapping.r ├── mfishMapping.r └── update.r ├── README.md ├── data ├── fishData.rda └── metadata.rda ├── man ├── buildMappingBasedMarkerPanel.Rd ├── buildPanel_oneCluster.Rd ├── buildQualityTable.Rd ├── buildTreeFromGenePanel.Rd ├── cellToClusterMapping_byCor.Rd ├── cellToClusterMapping_byRank.Rd ├── corTreeMapping.Rd ├── corTreeMapping_withFilter.Rd ├── distTreeMapping.Rd ├── filterByClass.Rd ├── filterCells.Rd ├── filterPanelGenes.Rd ├── fishScaleAndMap.Rd ├── fractionCorrectPerNode.Rd ├── fractionCorrectWithGenes.Rd ├── generateMultipleCellReferenceSet.Rd ├── getBetaScore.Rd ├── getBranchList.Rd ├── getConfusionMatrix.Rd ├── getDend.Rd ├── getNodeHeight.Rd ├── getTopMatch.Rd ├── get_subtree_label.Rd ├── labelDend.Rd ├── layerFraction.Rd ├── layerScale.Rd ├── lca.Rd ├── leafToNodeMedians.Rd ├── makeLCAtable.Rd ├── map_dend.Rd ├── mergeFish.Rd ├── mfishtools.Rd ├── outputTopConfused.Rd ├── plotConfusionVsConfidence.Rd ├── plotCorrectWithGenes.Rd ├── plotDistributions.Rd ├── plotHeatmap.Rd ├── plotNodes.Rd ├── plotTsne.Rd ├── possibleClustersByPriors.Rd ├── quantileTruncate.Rd ├── resolve_cl.Rd ├── rfTreeMapping.Rd ├── rotateXY.Rd ├── smartLayerAllocation.Rd ├── subsampleCells.Rd ├── summarizeMatrix.Rd └── update_mfishtools.Rd ├── mfishtools.Rproj ├── mfishtools_0.0.2.pdf └── vignettes ├── .gitignore ├── inhibitory_marker_mapping.Rmd ├── inhibitory_marker_mapping.html ├── inhibitory_marker_selection.Rmd └── inhibitory_marker_selection.html /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^Meta$ 2 | ^doc$ 3 | ^.*\.Rproj$ 4 | ^\.Rproj\.user$ 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | Meta 2 | doc 3 | .Rproj.user 4 | .Rhistory 5 | .RData 6 | .Ruserdata 7 | -------------------------------------------------------------------------------- /CONTRIBUTION: -------------------------------------------------------------------------------- 1 | Allen Institute Contribution Agreement 2 | 3 | This document describes the terms under which you may make “Contributions” — which may include without limitation, software additions, revisions, bug fixes, configuration changes, documentation, or any other materials — to any of the projects owned or managed by the Allen Institute. If you have questions about these terms, please contact us at terms@alleninstitute.org. 4 | 5 | You certify that: 6 | 7 | • Your Contributions are either: 8 | 1. Created in whole or in part by you and you have the right to submit them under the designated license (described below); or 9 | 2. Based upon previous work that, to the best of your knowledge, is covered under an appropriate open source license and you have the right under that license to submit that work with modifications, whether created in whole or in part by you, under the designated license; or 10 | 3. Provided directly to you by some other person who certified (1) or (2) and you have not modified them. 11 | 12 | • You are granting your Contributions to the Allen Institute under the terms of the 2-Clause BSD license (the “designated license”). 13 | 14 | • You understand and agree that the Allen Institute projects and your Contributions are public and that a record of the Contributions (including all metadata and personal information you submit with them) is maintained indefinitely and may be redistributed consistent with the Allen Institute’s mission and the 2-Clause BSD license. 15 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: mfishtools 2 | Type: Package 3 | Title: Building Gene Sets and Mapping mFISH Data 4 | Version: 0.0.2 5 | Author: Jeremy Miller 6 | Maintainer: Jeremy Miller 7 | Description: This repository includes code for gene selection for spatial transcriptomics methods and for mapping of spatial transcriptomics (or RNA-Seq data) onto a RNA-Seq reference. Specific topics include: 8 | 1) Correlation-based mapping of cells to reference cell types 9 | 2) Iterative building of gene panels a greedy algorithm with pre-defined constraints 10 | 3) Visualizations related to gene mapping a gene panel selection 11 | License: What license is it under? 12 | Encoding: UTF-8 13 | LazyData: true 14 | Imports: 15 | dendextend (>= 1.7.0), 16 | gplots (>= 3.0.1), 17 | pdist (>= 1.2), 18 | matrixStats (>= 0.53.1), 19 | Rtsne (>= 0.13), 20 | ggplot2 (>= 2.2.1), 21 | scrattch.vis (>= 0.0), 22 | dplyr (>= 0.3.4), 23 | WGCNA (>= 1.0) 24 | RoxygenNote: 7.1.1 25 | Suggests: 26 | knitr, 27 | rmarkdown 28 | VignetteBuilder: knitr 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The 2-Clause BSD License 2 | Note: This license has also been called the "Simplified BSD License" and the "FreeBSD License". See also the 3-clause BSD License. 3 | 4 | Copyright (c) 2018. Allen Institute. 5 | 6 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 7 | 8 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 11 | 12 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 13 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(buildMappingBasedMarkerPanel) 4 | export(buildPanel_oneCluster) 5 | export(buildQualityTable) 6 | export(buildTreeFromGenePanel) 7 | export(cellToClusterMapping_byCor) 8 | export(cellToClusterMapping_byRank) 9 | export(corTreeMapping) 10 | export(corTreeMapping_withFilter) 11 | export(distTreeMapping) 12 | export(filterByClass) 13 | export(filterCells) 14 | export(filterPanelGenes) 15 | export(fishScaleAndMap) 16 | export(fractionCorrectPerNode) 17 | export(fractionCorrectWithGenes) 18 | export(generateMultipleCellReferenceSet) 19 | export(getBetaScore) 20 | export(getBranchList) 21 | export(getConfusionMatrix) 22 | export(getDend) 23 | export(getNodeHeight) 24 | export(getTopMatch) 25 | export(get_subtree_label) 26 | export(labelDend) 27 | export(layerFraction) 28 | export(layerScale) 29 | export(lca) 30 | export(leafToNodeMedians) 31 | export(makeLCAtable) 32 | export(map_dend) 33 | export(mergeFish) 34 | export(outputTopConfused) 35 | export(plotConfusionVsConfidence) 36 | export(plotCorrectWithGenes) 37 | export(plotDistributions) 38 | export(plotHeatmap) 39 | export(plotNodes) 40 | export(plotTsne) 41 | export(possibleClustersByPriors) 42 | export(quantileTruncate) 43 | export(resolve_cl) 44 | export(rfTreeMapping) 45 | export(rotateXY) 46 | export(smartLayerAllocation) 47 | export(subsampleCells) 48 | export(summarizeMatrix) 49 | export(update_mfishtools) 50 | -------------------------------------------------------------------------------- /R/markerGenesAndMapping.r: -------------------------------------------------------------------------------- 1 | #' mfishtools: Building Gene Sets and Mapping mFISH Data. 2 | #' 3 | #' This repository includes code for gene selection for spatial transcriptomics methods and for 4 | #' mapping of spatial transcriptomics (or RNA-Seq data) onto a RNA-Seq reference. Specific topics include: 5 | #' 1) Correlation-based mapping of cells to reference cell types 6 | #' 2) Iterative building of gene panels a greedy algorithm with pre-defined constraints 7 | #' 3) Visualizations related to gene mapping a gene panel selection 8 | #' 9 | #' @docType package 10 | #' @name mfishtools 11 | NULL 12 | 13 | 14 | #' Table of confused clusters 15 | #' 16 | #' This function returns a table of the top confused clusters (assigned clusters 17 | #' incorrectly mapped) 18 | #' 19 | #' @param confusionProp confusion matrix (e.g., output from getConfusionMatrix). 20 | #' @param count number of top confusions to show 21 | #' 22 | #' @return a 3 x count matrix of the top confused pairs of clusters with the three 23 | #' columns corresponding to mapped cluster, assigned cluster, and fraction of 24 | #' cells incorrectly mapped, respectively. 25 | #' 26 | #' @export 27 | outputTopConfused <- function(confusionProp, 28 | count = 10) { 29 | topConfused <- NULL 30 | mfp <- confusionProp 31 | diag(mfp) <- 0 32 | rn <- rownames(mfp) 33 | for (i in 1:count) { 34 | wt <- which(mfp == max(mfp), arr.ind = TRUE) 35 | wt <- t(wt[1, ]) 36 | topConfused <- rbind(topConfused, c(rn[wt], max(mfp))) 37 | mfp[wt] <- 0 38 | } 39 | colnames(topConfused) <- c("foundCluster", "realCluster", "proportionOff") 40 | rownames(topConfused) <- NULL 41 | topConfused <- as.data.frame(topConfused) 42 | topConfused[, 3] <- as.numeric(topConfused[, 3]) 43 | return(topConfused) 44 | } 45 | 46 | #' Confusion plot vs. confidence 47 | #' 48 | #' Produces line plots showing the percent of correctly mapped cells above a certain confidence value (or score). 49 | #' This is a wrapper for plot. 50 | #' 51 | #' @param foundClusterAndScore matrix where first column is found cluster and second column is 52 | #' confidence score (e.g., output from getTopMatch) 53 | #' @param realCluster character vector of assigned clusters 54 | #' @param ... additional parameters for the plot function 55 | #' 56 | #' @export 57 | plotConfusionVsConfidence <- function(foundClusterAndScore, 58 | realCluster, 59 | RI = (31:100) / 100, 60 | main = "% mapping (blue) / correct (orange)", 61 | ylab = "Percent", 62 | xlab = "Fraction correctly mapped to leaf", 63 | type = "l", 64 | xlim = range(RI), 65 | ...) { 66 | fracMap <- fracRight <- NULL 67 | for (r in RI) { 68 | isMap <- foundClusterAndScore[, 2] >= r 69 | fracMap <- c(fracMap, round(1000 * mean(isMap)) / 10) 70 | isRight <- (realCluster == foundClusterAndScore[, 1])[isMap] 71 | fracRight <- c(fracRight, round(1000 * mean(isRight)) / 10) 72 | } 73 | plot(RI, fracMap, 74 | xlim = xlim, col = "blue", main = main, 75 | ylab = ylab, xlab = xlab, type = "l", ... 76 | ) 77 | lines(RI, fracRight, col = "orange") 78 | abline(h = 5 * (0:20), col = "grey", lty = "dotted") 79 | } 80 | 81 | 82 | #' Confusion matrix 83 | #' 84 | #' Returns a confusion matrix of the found (mapped) vs. real (assigned) clusters. 85 | #' 86 | #' @param realCluster character vector of assigned clusters 87 | #' @param foundCluster character vector of mapped clusters 88 | #' @param proportions FALSE if the counts are to be returned and TRUE if the proportions are to be returned 89 | #' 90 | #' @export 91 | getConfusionMatrix <- function(realCluster, 92 | foundCluster, 93 | proportions = TRUE) { 94 | realCluster <- as.character(realCluster) 95 | foundCluster <- as.character(foundCluster) 96 | lev <- sort(unique(c(realCluster, foundCluster))) 97 | realCluster <- factor(realCluster, levels = lev) 98 | foundCluster <- factor(foundCluster, levels = lev) 99 | confusion <- table(foundCluster, realCluster) 100 | if (proportions) { 101 | cs <- colSums(confusion) 102 | for (i in 1:dim(confusion)[1]) 103 | confusion[i, ] <- confusion[i, ] / pmax(cs, 1e-08) 104 | } 105 | confusion 106 | } 107 | 108 | 109 | 110 | 111 | #' Filter genes for spatial transcriptomics panel 112 | #' 113 | #' Returns a set of genes for inclusion in a spatial transcriptomics panel based on a series of 114 | #' hard-coded and user-defined constraints 115 | #' 116 | #' @param summaryExpr Matrix of summarized expression levels for a given cluster. Typically the median 117 | #' or mean should be used. Rows are genes and columns are samples. ROW NAMES MUST BE GENE SYMBOLS! 118 | #' @param propExpr Proportion of cells expressed in each cluster for use with binary score calculation 119 | #' (default = summaryExpr, which is not recommended) 120 | #' @param onClusters Vector indicating which clusters should be included in the gene panel (default 121 | #' is all clusters. Can be logical or numeric, or a character string of cluster names) 122 | #' @param offClusters Vector indidicating from which clusters expression should be avoided 123 | #' @param geneLength Optional vector of gene lengths in same order as summaryExpr. Default is NULL 124 | #' @param numBinaryGenes Number of genes to include in the final panel. Genes are sorted by binary 125 | #' score using `getBetaScore` and this number of genes are chosen (default = 500) 126 | #' @param minOn Minimum summary expression level in most highly expressed "on" cluster (default = 10) 127 | #' @param maxOn Maximum summary expression level in most highly expressed "on" cluster (default = 250) 128 | #' @param maxOff Maximum summary expression level in most highly expressed "off" cluster (default = 50) 129 | #' @param minLength Minimum gene length for marker gene selection. Ignored if geneLength is not 130 | #' provided (default = 960) 131 | #' @param fractionOnClusters What is the maximum fraction of clusters in which a gene can be expressed 132 | #' (as defined by propExpr>onThreshold; default = 0.5). This prevents nearly ubiquitous genes from selection 133 | #' @param onThreshold What fraction of cells need to have expression for a gene to be defined as expressed 134 | #' (default = 0.5) 135 | #' @param excludeGenes Which genes should be excluded from the analysis (default is none) 136 | #' @param excludeFamilies Which gene classes or families should be excluded from the analysis? More 137 | #' specifically, any gene that contain these strings of characters anywhere in the symbol will be 138 | #' excluded (default is "LOC","LINC","FAM","ORF","KIAA","FLJ","DKFZ","RIK","RPS","RPL","\\-"). 139 | #' 140 | #' @return A character vector of genes meeting all constraints 141 | #' 142 | #' @export 143 | filterPanelGenes <- function(summaryExpr, 144 | propExpr = summaryExpr, 145 | onClusters = 1:dim(summaryExpr)[2], 146 | offClusters = NULL, 147 | geneLengths = NULL, 148 | startingGenes = c("GAD1","SLC17A7"), 149 | numBinaryGenes = 500, 150 | minOn = 10, 151 | maxOn = 250, 152 | maxOff = 50, 153 | minLength = 960, 154 | fractionOnClusters = 0.5, 155 | onThreshold = 0.5, 156 | excludeGenes = NULL, 157 | excludeFamilies = c("LOC","LINC","FAM","ORF","KIAA","FLJ","DKFZ","RIK","RPS","RPL","\\-")) { 158 | ## Variable check 159 | if(!is.element("matrix",class(summaryExpr))) { 160 | summaryExpr <- as.matrix(summaryExpr) 161 | } 162 | if(!is.numeric(summaryExpr[1,1])) { 163 | stop("summaryExpr must be a matrix of numeric values.") 164 | } 165 | if(is.null(rownames(summaryExpr))) { 166 | stop("Please provide summaryExpr with genes as row names.") 167 | } 168 | if(!is.numeric(fractionOnClusters)){ 169 | stop("fractionOnClusters needs to be numeric.") 170 | } 171 | fractionOnClusters = fractionOnClusters[1] 172 | if(fractionOnClusters>1) fractionOnClusters = fractionOnClusters/100 # Assume if it is greater than 1 then a percentage was given 173 | genes <- rownames(summaryExpr) 174 | genesU <- toupper(genes) 175 | 176 | ## Define excluded genes 177 | excludeFamilies <- toupper(excludeFamilies) 178 | excludeGenes <- is.element(genes,excludeGenes) 179 | if(length(excludeFamilies)>0){ 180 | for(i in 1:length(excludeFamilies)){ 181 | excludeGenes <- excludeGenes|grepl(excludeFamilies[i],genesU) 182 | } 183 | } 184 | 185 | ## Determine max expression levels in on and off clusters 186 | if(is.character(onClusters)){ 187 | onClusters <- is.element(colnames(summaryExpr),onClusters) 188 | } 189 | if(is.numeric(onClusters)){ 190 | onClusters <- is.element(1:dim(summaryExpr)[2],onClusters) 191 | } 192 | if(sum(onClusters)<2){ 193 | stop("Please provide at least two onClusters. If cluster names were provided, check colnames(summaryExpr).") 194 | } 195 | if(is.character(offClusters)){ 196 | offClusters <- is.element(colnames(summaryExpr),offClusters) 197 | } 198 | if(is.numeric(offClusters)){ 199 | offClusters <- is.element(1:dim(summaryExpr)[2],offClusters) 200 | } 201 | 202 | ## Now add the actual constraints from above 203 | maxExprOn <- apply(summaryExpr[,onClusters],1,max) 204 | if(sum(offClusters)>1){ 205 | maxExprOff <- apply(summaryExpr[,offClusters],1,max) 206 | } else if(sum(offClusters)==1){ 207 | maxExprOff <- summaryExpr[,offClusters] 208 | } else { 209 | maxExprOff <- pmax(maxExprOn,0.0000001) * -Inf # Essentially this is saying there is no off constraint 210 | } 211 | 212 | ## Set the gene lengths, if needed 213 | if(!is.null(geneLengths)) if(length(geneLengths)!=length(offClusters)){ 214 | stop("geneLengths must be in the same order and of same length as rows of summaryExpr.") 215 | } 216 | if(!is.null(geneLengths)) if(!is.numeric(geneLengths)){ 217 | stop("geneLengths must be numeric.") 218 | } 219 | if(is.null(geneLengths)){ 220 | geneLengths <- maxExprOn+Inf # Essentially this is saying there is no gene length constraint 221 | } 222 | 223 | ## Determine the acceptable genes 224 | keepGenes <- (!excludeGenes) & (maxExprOn > minOn) & (maxExprOn <= maxOn) & 225 | (maxExprOff <= maxOff) & (geneLengths >= minLength) & 226 | (rowMeans(propExpr[, onClusters] > onThreshold) <= fractionOnClusters) & 227 | (rowMeans(propExpr[, onClusters] > onThreshold) > 0) # Genes expressed in more than 0 clusters! 228 | keepGenes[is.na(keepGenes)] <- FALSE 229 | 230 | ## Find the top binary genes (if needed) and return gene list 231 | message(paste(sum(keepGenes),"total genes pass constraints prior to binary score calculation.")) 232 | if(sum(keepGenes)<=numBinaryGenes){ 233 | warning("Fewer genes pass constraints than numBinaryGenes, so binary score was not calculated.") 234 | return(sort(union(rownames(propExpr)[keepGenes], startingGenes))) 235 | } 236 | 237 | topBeta <- getBetaScore(propExpr[keepGenes,onClusters],FALSE) 238 | runGenes <- names(topBeta)[topBeta<=numBinaryGenes] 239 | runGenes <- sort(union(runGenes,startingGenes)) 240 | runGenes 241 | } 242 | 243 | 244 | 245 | 246 | 247 | 248 | #' Branch list 249 | #' 250 | #' Returns branches of a dendrogram in a specific format 251 | #' 252 | #' @param dend dendrogram for mapping. Ignored if medianDat is passed 253 | #' @param branches do not change from default 254 | #' @param allTips do not change from default 255 | #' 256 | #' @return a list of branch information for use with leafToNodeMedians 257 | #' 258 | #' @export 259 | getBranchList <- function(dend, 260 | branches = list(), 261 | allTips = as.character(dend %>% labels())) { 262 | library(dendextend) 263 | numBranch <- dend %>% nnodes() 264 | if (numBranch > 1) { 265 | lb <- attr(dend, "label") 266 | lab1 <- paste("BranchInTree___", lb, sep = "") 267 | branches[[lab1]] <- list() 268 | cn <- as.character(dend %>% labels()) 269 | for (i in 1:length(dend)) { 270 | nm <- as.character(dend[[i]] %>% labels()) 271 | branches[[lab1]][[attr(dend[[i]], "label")]] <- list(nm, setdiff(cn, nm)) 272 | if ((length(nm) > 1) & (length(nm) < (length(allTips) - 1))) { 273 | lab2 <- paste("BranchVsAll___", attr(dend[[i]], "label"), sep = "") 274 | branches[[lab2]] <- list() 275 | branches[[lab2]][["branch"]] <- list(nm, setdiff(allTips, nm)) 276 | } 277 | branches <- getBranchList(dend[[i]], branches, allTips) 278 | } 279 | } else { 280 | leaf <- dend %>% labels() 281 | lab <- paste("LeafOnly___", leaf, sep = "") 282 | branches[[lab]][["leaf"]] <- list(leaf, setdiff(allTips, leaf)) 283 | } 284 | branches <- branches[order(substr(names(branches), 1, 8))] 285 | branches 286 | } 287 | 288 | #' Return mean node expression 289 | #' 290 | #' Define expression at a node as the MEAN expression for each leaf as default (using the 291 | #' median removes all specific marker genes!) 292 | #' 293 | #' @param dend dendrogram for mapping. Ignored if medianDat is passed 294 | #' @param medianDat median expression data at each node 295 | #' @param branches a particular format of branch information from the dendrogram structure 296 | #' @param fnIn function to use to wrap up to the node level (default = mean) 297 | #' 298 | #' @return a matrix of mean node expression (rows=genes, columns=nodes) 299 | #' 300 | #' @export 301 | leafToNodeMedians <- function(dend, 302 | medianDat, 303 | branches = getBranchList(dend), 304 | fnIn = mean) { 305 | library(dendextend) 306 | if (is.null(rownames(medianDat))) { 307 | rownames(medianDat) <- 1:dim(medianDat)[1] 308 | } 309 | allGenes <- rownames(medianDat) 310 | brNames <- names(branches) 311 | brNames <- brNames[grep("BranchInTree", brNames)] 312 | medianNode <- matrix(0, nrow = length(allGenes), ncol = length(brNames)) 313 | rownames(medianNode) <- allGenes 314 | colnames(medianNode) <- brNames 315 | for (n in brNames) medianNode[, n] <- 316 | apply(medianDat[, c(branches[[n]][[1]][[1]], branches[[n]][[1]][[2]])], 1, fnIn) 317 | medianNode <- cbind(medianDat, medianNode) 318 | nameOrd <- 319 | dend %>% 320 | get_nodes_attr("label", id = 1:(dend %>% nnodes())) 321 | nameOrd[substr(nameOrd, 1, 1) == "n"] <- 322 | paste0("BranchInTree___", nameOrd[substr(nameOrd, 1, 1) == "n"]) 323 | medianNode <- medianNode[, nameOrd] 324 | colnames(medianNode) <- gsub("BranchInTree___", "", colnames(medianNode)) 325 | medianNode 326 | } 327 | 328 | 329 | #' Greedy algorithm for building marker gene panel 330 | #' 331 | #' This is the primary function that iteratively builds a marker gene panel, one gene at a 332 | #' time by iteratively adding the most informative gene to the existing gene panel. 333 | #' 334 | #' @param mapDat normalized data of the mapping (=reference) data set. 335 | #' @param medianDat representative value for each leaf. If not entered, it is calculated 336 | #' @param clustersF cluster calls for each cell. 337 | #' @param panelSize number of genes to include in the marker gene panel 338 | #' @param subSamp number of random nuclei to select from each cluster (to increase speed); 339 | #' set as NA to not subsample 340 | #' @param maxFcGene maximum number of genes to consider at each iteration (to increase speed) 341 | #' @param qMin minimum quantile for fold change comparison (between 0 and 1, higher = more 342 | #' specific marker genes are included) 343 | #' @param seed for reproducibility 344 | #' @param currentPanel starting panel. Default is NULL. 345 | #' @param panelMin if there are fewer genes than this, the top number of these genes by fc 346 | #' rank are set as the starting panel. Cannot be less than 2. 347 | #' @param writeText should gene names and marker scores be output (default TRUE) 348 | #' @param corMapping if TRUE (default) map by correlation; otherwise, map by Euclidean 349 | #' distance (not recommended) 350 | #' @param optimize if 'FractionCorrect' (default) will seek to maximize the fraction of 351 | #' cells correctly mapping to final clusters 352 | #' if 'CorrelationDistance' will seek to minimize the total distance between actual 353 | #' cluster calls and mapped clusters 354 | #' if 'DendrogramHeight' will seek to minimize the total dendrogram height between 355 | #' actual cluster calls and mapped clusters 356 | #' @param clusterDistance only used if optimize='CorrelationDistance'; a matrix (or 357 | #' vector) of cluster distances. Will be calculated if NULL and if clusterGenes 358 | #' provided. (NOTE: order must be the same as medianDat and/or have column and row 359 | #' names corresponding to clusters in clustersF) 360 | #' @param clusterGenes a vector of genes used to calculate the cluster distance. 361 | #' Only used if optimize='CorrelationDistance' and clusterDistance=NULL. 362 | #' @param dend only used if optimize='DendrogramHeight' dendrogram; will error out of not provided 363 | #' @param percentSubset for each iteration the function can subset the set of possible 364 | #' genes to speed up the calculation. 365 | #' 366 | #' @return an ordered character vector corresponding to the marker gene panel 367 | #' 368 | #' @export 369 | buildMappingBasedMarkerPanel <- function(mapDat, 370 | medianDat = NA, 371 | clustersF = NA, 372 | panelSize = 50, 373 | subSamp = 20, 374 | maxFcGene = 1000, 375 | qMin = 0.75, 376 | seed = 10, 377 | currentPanel = NULL, 378 | panelMin = 5, 379 | writeText = TRUE, 380 | corMapping = TRUE, 381 | optimize = "FractionCorrect", 382 | clusterDistance = NULL, 383 | clusterGenes = NULL, 384 | dend = NULL, 385 | percentSubset = 100) { 386 | 387 | # Return an error if optimize='DendrogramHeight' 388 | # and a dendrogram is not provided 389 | if ((optimize == "DendrogramHeight") & is.null(dend)) { 390 | return("Error: dendrogram not provided") 391 | } 392 | 393 | # CALCULATE THE MEDIAN 394 | if (is.na(medianDat[1])) { 395 | names(clustersF) <- colnames(mapDat) 396 | medianDat <- do.call("cbind", tapply( 397 | names(clustersF), clustersF, function(x) rowMedians(mapDat[, x]) 398 | )) 399 | rownames(medianDat) <- rownames(mapDat) 400 | } 401 | if (is.null(rownames(medianDat))) { 402 | rownames(medianDat) <- rownames(mapDat) 403 | } 404 | 405 | # Convert the dendrogram height into a correlation 406 | # distance if dendrogram height is entered as the 407 | # option 408 | if (optimize == "FractionCorrect") { 409 | clusterDistance <- NULL 410 | } 411 | if (optimize == "CorrelationDistance") { 412 | if (is.null(clusterDistance)) { 413 | corDist <- function(x) return(as.dist(1 - WGCNA::cor(x))) 414 | if(is.null(clusterGenes)) clusterGenes = rownames(medianDat) 415 | clusterGenes <- intersect(clusterGenes, rownames(medianDat)) 416 | clusterDistance <- as.matrix(corDist(medianDat[clusterGenes, ])) 417 | } 418 | if (is.matrix(clusterDistance)) { 419 | if (!is.null(rownames(clusterDistance))) { 420 | clusterDistance <- clusterDistance[colnames(medianDat), colnames(medianDat)] 421 | } 422 | clusterDistance <- as.vector(clusterDistance) 423 | } 424 | } 425 | 426 | if (optimize == "DendrogramHeight") { 427 | lcaTable <- makeLCAtable(dend) 428 | clusterDistance <- 1 - getNodeHeight(dend)[lcaTable] 429 | optimize <- "clusterDistance" 430 | } 431 | 432 | # TAKE THE TOP DEX GENES Use fold change (rather 433 | # than beta) because this function only receives 434 | # median as input 435 | fcDiff <- rank(apply(medianDat, 1, function(x) return(diff(quantile(x, c(1, qMin)))))) 436 | if (dim(medianDat)[1] > maxFcGene) { 437 | kpGene <- names(fcDiff)[fcDiff <= maxFcGene] 438 | mapDat <- mapDat[kpGene, ] 439 | medianDat <- medianDat[kpGene, ] 440 | } 441 | 442 | panelMin <- max(2, panelMin) 443 | if (length(currentPanel) < panelMin) { 444 | panelMin <- max(2, panelMin - length(currentPanel)) 445 | currentPanel <- unique(c(currentPanel, names(sort(fcDiff))[1:panelMin])) 446 | if (writeText) { 447 | print(paste("Setting starting panel as:", paste(currentPanel, sep = ", ", collapse = ", "))) 448 | } 449 | } 450 | 451 | # FIND THE NEXT GENE IN THE PANEL, IF THE DESIRED 452 | # PANEL SIZE IS NOT REACHED 453 | if (length(currentPanel) < panelSize) { 454 | 455 | # SUBSAMPLE 456 | if (!is.na(subSamp)) { 457 | kpSamp <- subsampleCells(clustersF, subSamp, seed) 458 | mapDat <- mapDat[, kpSamp] 459 | clustersF <- clustersF[kpSamp] 460 | subSamp <- NA 461 | } 462 | 463 | # CORRELATION MAPPING FOR EACH POSSIBLE ADDITION OF 464 | # ONE GENE 465 | otherGenes <- setdiff(rownames(mapDat), currentPanel) 466 | if (percentSubset < 100) { 467 | # Only look at a subset of genes if desired 468 | set.seed(seed + length(currentPanel)) 469 | otherGenes <- otherGenes[sort(sample( 470 | 1:length(otherGenes), 471 | ceiling(length(otherGenes) * percentSubset / 100) 472 | ))] 473 | } 474 | matchCount <- rep(0, length(otherGenes)) 475 | clustIndex <- match(clustersF, colnames(medianDat)) 476 | for (i in 1:length(otherGenes)) { 477 | ggnn <- c(currentPanel, otherGenes[i]) 478 | if (corMapping) { 479 | corMapTmp <- corTreeMapping(mapDat = mapDat, medianDat = medianDat, genesToMap = ggnn) 480 | } 481 | if (!corMapping) { 482 | corMapTmp <- distTreeMapping(mapDat = mapDat, medianDat = medianDat, genesToMap = ggnn) 483 | } 484 | corMapTmp[is.na(corMapTmp)] <- -1 485 | topLeafTmp <- getTopMatch(corMapTmp) 486 | if (is.null(clusterDistance)) { 487 | matchCount[i] <- mean(clustersF == topLeafTmp[, 1]) 488 | } else { 489 | tmpVal <- dim(medianDat)[2] * (match(topLeafTmp[, 1], colnames(medianDat)) - 1) + clustIndex # NEED TO CHECK THIS!!!!!!! 490 | matchCount[i] <- -mean(clusterDistance[tmpVal]) 491 | } 492 | } 493 | wm <- which.max(matchCount) 494 | addGene <- as.character(otherGenes)[wm] 495 | if (writeText) { 496 | if (optimize == "FractionCorrect") { 497 | print(paste( 498 | "Added", addGene, "with", signif(matchCount[wm], 3), 499 | "now matching [", length(currentPanel), "]." 500 | )) 501 | } else { 502 | print(paste( 503 | "Added", addGene, "with average cluster distance", 504 | -signif(matchCount[wm], 3), "[", length(currentPanel), "]." 505 | )) 506 | } 507 | } 508 | currentPanel <- c(currentPanel, addGene) 509 | currentPanel <- buildMappingBasedMarkerPanel( 510 | mapDat = mapDat, medianDat = medianDat, clustersF = clustersF, 511 | panelSize = panelSize, subSamp = subSamp, maxFcGene = maxFcGene, 512 | qMin = qMin, seed = seed, currentPanel = currentPanel, panelMin = panelMin, 513 | writeText = writeText, corMapping = corMapping, optimize = optimize, 514 | clusterDistance = clusterDistance, clusterGenes = clusterGenes, dend = dend, 515 | percentSubset = percentSubset 516 | ) 517 | } 518 | currentPanel 519 | } 520 | 521 | 522 | #' Correlation between nodes and leafs (deprecated) 523 | #' 524 | #' Returns the correlation between expression of each cell and representative 525 | #' value for each node and leaf. NOTE: this function is unstable and will 526 | #' eventually be merged with corTreeMapping. 527 | #' 528 | #' @param dend dendrogram for mapping. Ignored if medianDat is passed 529 | #' @param refDat normalized data of the REFERENCE data set. Ignored if medianExpr 530 | #' and propExpr are passed 531 | #' @param mapDat normalized data of the MAPPING data set. Default is to map the 532 | #' data onto itself. 533 | #' @param medianExpr representative value for each leaf. If not entered, it is 534 | #' calculated 535 | #' @param propExpr proportion of cells in each type expressing a given gene. If not 536 | #' entered, it is calculated 537 | #' @param filterMatrix a matrix of TRUE/FALSE values to indicate whether a given 538 | #' cluster is possible 539 | #' @param clusters cluster calls for each cell. Ignored if medianExpr and propExpr 540 | #' are passed 541 | #' @param numberOfGenes how many variables genes 542 | #' @param outerLimitGenes choose different numberOfGenes per cell from the top overall 543 | #' outerLimitGenes (to speed up function) 544 | #' @param genesToMap which genes to include in the correlation mapping 545 | #' @param use,... additional parameters for cor 546 | #' 547 | #' @return a matrix of correlation values with rows as mapped cells and columns as clusters 548 | #' 549 | #' @export 550 | corTreeMapping_withFilter <- function(dend = NA, 551 | refDat = NA, 552 | mapDat = refDat, 553 | medianExpr = NA, 554 | propExpr = NA, 555 | filterMatrix = NA, 556 | clusters = NA, 557 | numberOfGenes = 1200, 558 | outerLimitGenes = 7200, 559 | rankGeneFunction = function(x) getBetaScore(x, returnScore = FALSE), 560 | use = "p", 561 | ...) { 562 | 563 | # -- prepare the data 564 | if (is.na(medianExpr[1])) { 565 | names(clusters) <- colnames(refDat) 566 | medianExpr <- do.call("cbind", tapply( 567 | names(clusters), clusters, function(x) rowMedians(refDat[, x]) 568 | )) 569 | rownames(medianExpr) <- rownames(refDat) 570 | } 571 | if (is.na(propExpr[1])) { 572 | names(clusters) <- colnames(refDat) 573 | medianExpr <- do.call("cbind", tapply( 574 | names(clusters), clusters, function(x) rowMeans(refDat[, x] > 1) 575 | )) 576 | rownames(propExpr) <- rownames(refDat) 577 | } 578 | filterMatrix <- filterMatrix[, colnames(medianExpr)] 579 | 580 | # -- take the top outerLimitGenes for the 581 | # proportion and median 582 | rankGn <- rankGeneFunction(propExpr) 583 | kpGn <- rankGn <= outerLimitGenes 584 | medianDat <- medianExpr[kpGn, ] 585 | propDat <- propExpr[kpGn, ] 586 | 587 | # -- find all possible filters 588 | filterVec <- apply(filterMatrix, 1, function(x) paste(x, collapse = "|", sep = "|")) 589 | vecs <- unique(filterVec) 590 | 591 | # -- find all gene lists based on these filters 592 | geneLists <- list() 593 | for (v in vecs) { 594 | kp <- filterMatrix[which(filterVec == v)[1], ] 595 | if (sum(kp) > 1) { 596 | geneLists[[v]] <- rownames(propDat)[rankGeneFunction(propDat[, kp]) <= numberOfGenes] 597 | } else { 598 | geneLists[[v]] <- rownames(propDat)[1:2] 599 | } 600 | } 601 | 602 | ## -- find the correlations 603 | kpVar <- intersect( 604 | names(rankGn)[rankGn <= numberOfGenes], 605 | intersect(rownames(mapDat), rownames(medianDat)) 606 | ) 607 | corrVar <- WGCNA::cor(mapDat[kpVar, ], medianDat[kpVar, ], use = use, ...) 608 | for (v in vecs) { 609 | kpRow <- filterVec == v 610 | kpCol <- filterMatrix[which(filterVec == v)[1], ] 611 | if (sum(kpCol) > 1) { 612 | kpVar <- intersect(geneLists[[v]], intersect(rownames(mapDat), rownames(medianDat))) 613 | corrVar[kpRow, kpCol] <- WGCNA::cor(mapDat[kpVar, kpRow], medianDat[kpVar, kpCol], use = use) # ,...) 614 | } 615 | } 616 | corrVar <- corrVar * filterMatrix[, colnames(corrVar)] 617 | corrVar 618 | } 619 | 620 | 621 | #' (Euclidean) distance mapping 622 | #' 623 | #' Returns the distance between expression of each cell and representative value for each node and 624 | #' leaf (default is based on euclidean distance). In our hands this is does not work very well. 625 | #' 626 | #' @param dend dendrogram for mapping. Ignored if medianDat is passed 627 | #' @param refDat normalized data of the REFERENCE data set. Ignored if medianDat is passed 628 | #' @param mapDat normalized data of the MAPPING data set. Default is to map the data onto itself. 629 | #' @param medianDat representative value for each leaf and node. If not entered, it is calculated 630 | #' @param clusters cluster calls for each cell. Ignored if medianDat is passed 631 | #' @param genesToMap which genes to include in the correlation mapping 632 | #' @param returnSimilarity FALSE to return distance, TRUE to return something like a similarity 633 | #' @param use,... additional parameters for dist (for back-compatiblity; doesn't work) 634 | #' 635 | #' @return matrix of Euclidean distances between cells (rows) and clusters (columns) 636 | #' 637 | #' @export 638 | distTreeMapping <- function(dend = NA, 639 | refDat = NA, 640 | mapDat = refDat, 641 | medianDat = NA, 642 | clusters = NA, 643 | genesToMap = rownames(mapDat), 644 | returnSimilarity = TRUE, 645 | use = "p", 646 | ...) { 647 | library(pdist) 648 | 649 | if (is.na(medianDat[1])) { 650 | names(clusters) <- colnames(refDat) 651 | medianDat <- do.call("cbind", tapply( 652 | names(clusters), clusters, function(x) rowMedians(refDat[, x]) 653 | )) 654 | rownames(medianDat) <- rownames(refDat) 655 | medianDat <- leafToNodeMedians(dend, medianDat) 656 | } 657 | kpVar <- intersect(genesToMap, intersect(rownames(mapDat), rownames(medianDat))) 658 | if (length(kpVar) == 1) { 659 | kpVar <- c(kpVar, kpVar) 660 | } 661 | eucDist <- as.matrix(pdist(t(mapDat[kpVar, ]), t(medianDat[kpVar, ]), ...)) 662 | rownames(eucDist) <- colnames(mapDat) 663 | colnames(eucDist) <- colnames(medianDat) 664 | if (!returnSimilarity) { 665 | return(eucDist) 666 | } 667 | eucDist <- sqrt(eucDist / max(eucDist)) 668 | 1 - eucDist 669 | } 670 | 671 | 672 | #' Tree-based mapping 673 | #' 674 | #' Returns the mapping membership of each cell to each node and leaf using a 675 | #' tree-based method. This is a wrapper function for map_dend. 676 | #' 677 | #' @param dend dendrogram for mapping 678 | #' @param refDat normalized data of the REFERENCE data set 679 | #' @param clustersF factor indicating which cluster each cell type is actually assigned to 680 | #' in the reference data set 681 | #' @param mapDat normalized data of the MAPPING data set. Default is to map the data onto itself. 682 | #' @param p proportion of marker genes to include in each iteration of the mapping algorithm. 683 | #' @param low.th the minimum difference in Pearson correlation required to decide on which branch 684 | #' to map to. otherwise, a random branch is chosen. 685 | #' @param seed added for reproducibility 686 | #' 687 | #' @return a matrix of confidence scores (from 0 to 100) with rows as cells and columns 688 | #' as tree node/leafs. Values indicate the fraction of permutations in which the cell 689 | #' mapped to that node/leaf using the subset of cells/genes in map_dend 690 | #' 691 | #' @export 692 | rfTreeMapping <- function(dend, 693 | refDat, 694 | clustersF, 695 | mapDat = refDat, 696 | p = 0.7, 697 | low.th = 0.15, 698 | seed = 1) { 699 | 700 | genes <- intersect(rownames(refDat),rownames(mapDat)) 701 | refDat <- as.matrix(refDat)[genes,] # need common genes and matrix format 702 | mapDat <- as.matrix(mapDat)[genes,] # need common genes and matrix format 703 | pseq.cells <- colnames(mapDat) 704 | # isMarker <- ifelse(is.na(adjustMarkers[1]), FALSE, TRUE) # Doesn't do anything 705 | 706 | pseq.mem <- sapply(1:100, function(i) { 707 | j <- i 708 | if (i %% 25 == 0) print(i) 709 | go <- TRUE 710 | while (go) { 711 | # Allow for failures 712 | j <- j + 1000 713 | set.seed(j + seed) # Added for reproducibility 714 | tmp <- try(map_dend(dend, clustersF, refDat, 715 | mapDat, pseq.cells, 716 | p = p, low.th = low.th 717 | )) 718 | if (length(tmp) > 1) go <- FALSE 719 | } 720 | tmp 721 | }, simplify = F) 722 | 723 | memb <- unlist(pseq.mem) 724 | memb <- data.frame(cell = names(memb), cl = memb) 725 | memb$cl <- factor(memb$cl, levels = get_nodes_attr(dend, "label")) 726 | memb <- table(memb$cell, memb$cl) 727 | memb <- memb / 100 728 | return(memb) 729 | } 730 | 731 | 732 | #' Tree-based mapping 733 | #' 734 | #' Returns the mapping membership of each cell to each node and leaf using a 735 | #' tree-based method. This is a wrapper function for map_dend. 736 | #' 737 | #' @param dend dendrogram for mapping 738 | #' @param cl factor indicating which cluster each cell type is actually assigned to 739 | #' in the reference data set 740 | #' @param dat normalized data of the REFERENCE data set 741 | #' @param map.dat normalized data of the MAPPING data set. Default is to map the 742 | #' data onto itself. 743 | #' @param p proportion of marker genes to include in each iteration of the mapping 744 | #' algorithm. 745 | #' @param low.th the minimum difference in Pearson correlation required to decide 746 | #' on which branch to map to. otherwise, a random branch is chosen. 747 | #' @param default.markers not used 748 | #' 749 | #' @return a matrix of confidence scores (from 0 to 100) with rows as cells and columns 750 | #' as tree node/leafs. Values indicate the fraction of permutations in which the cell 751 | #' mapped to that node/leaf using the subset of cells/genes in map_dend 752 | #' 753 | #' @export 754 | map_dend <- function(dend, 755 | cl, 756 | dat, 757 | map.dat, 758 | select.cells, 759 | p = 0.8, 760 | low.th = 0.2, 761 | default.markers = NULL) { 762 | final.cl <- c(setNames(rep(attr(dend, "label"), length(select.cells)), select.cells)) 763 | if (length(dend) <= 1) { 764 | return(final.cl) 765 | } 766 | markers <- attr(dend, "markers") 767 | markers <- markers[names(markers) %in% row.names(map.dat)] 768 | cl.g <- sapply(dend, labels, simplify = F) 769 | names(cl.g) <- 1:length(cl.g) 770 | select.cl <- cl[cl %in% unlist(cl.g)] 771 | ### Sampling the cells from the reference cluster 772 | cells <- unlist(tapply(names(select.cl), select.cl, function(x) sample(x, round(length(x) * p)))) 773 | genes <- names(markers) 774 | genes <- union(genes, default.markers) 775 | ### Compute reference cluster median based on 776 | ### subsampled cells 777 | cl.med <- do.call("cbind", tapply( 778 | cells, droplevels(cl[cells]), 779 | function(x) rowMedians(dat[genes, x, drop = F]) 780 | )) 781 | row.names(cl.med) <- genes 782 | ### determine which branch to take. 783 | mapped.cl <- resolve_cl(cl.g, cl.med, markers, dat, 784 | map.dat, select.cells, 785 | p = p, low.th = low.th 786 | ) 787 | if (length(mapped.cl) > 0) { 788 | for (i in unique(mapped.cl)) { 789 | select.cells <- names(mapped.cl)[mapped.cl == i] 790 | if (length(select.cells) > 0) { 791 | final.cl <- c(final.cl, map_dend(dend[[as.integer(i)]], 792 | cl, dat, map.dat, select.cells, 793 | p = p, low.th = low.th 794 | )) 795 | } 796 | } 797 | } 798 | return(cl = final.cl) 799 | } 800 | 801 | 802 | #' Tree-based mapping (internal) 803 | #' 804 | #' Returns the mapped cluster call of each cell to each leaf. This function is called by map_dend 805 | #' 806 | #' @param cl.g all clusters 807 | #' @param cl.med cluster medians 808 | #' @param markers gene markers 809 | #' @param dat normalized data of the REFERENCE data set 810 | #' @param map.dat normalized data of the MAPPING data set. Default is to map the data onto itself. 811 | #' @param select.cells which cells to use? 812 | #' @param p proportion of marker genes to include in each iteration of the mapping algorithm. 813 | #' @param low.th the minimum difference in Pearson correlation required to decide on which branch 814 | #' to map to. otherwise, a random branch is chosen. 815 | #' 816 | #' @return a vector of the mapped cluster 817 | #' 818 | #' @export 819 | resolve_cl <- function(cl.g, 820 | cl.med, 821 | markers, 822 | dat, 823 | map.dat, 824 | select.cells, 825 | p = 0.7, 826 | low.th = 0.2) { 827 | library(matrixStats) 828 | ## 829 | genes <- names(markers)[markers > 0] 830 | tmp.cl <- unlist(cl.g) 831 | 832 | ### For each branch point, find the highest 833 | ### expression cluster. 834 | tmp.med <- sapply(cl.g, function(g) rowMaxs(cl.med[genes, g, drop = F])) 835 | row.names(tmp.med) <- genes 836 | ### Make sure the genes are discriminative between 837 | ### all the branches. 838 | genes <- genes[rowMaxs(tmp.med) - rowMins(tmp.med) > 1] 839 | 840 | ### Sample the markers based on the weigts. TO DO: 841 | ### randomforest sometimes give importance value of 842 | ### 0. adjust for that. 843 | genes <- sample(genes, round(length(genes) * p), prob = markers[genes]) 844 | 845 | ### Compute the correlation with the median cluster 846 | ### profile. add drop=F 847 | cl.cor <- WGCNA::cor(map.dat[genes, select.cells, drop = F], cl.med[genes, tmp.cl, drop = F]) 848 | cl.cor[is.na(cl.cor)] <- 0 849 | ### Compute the best match in each branch. 850 | tmp.score <- do.call("cbind", sapply(cl.g, function(x) rowMaxs(cl.cor[, 851 | x, 852 | drop = F 853 | ]), simplify = F)) 854 | row.names(tmp.score) <- row.names(cl.cor) 855 | #### Determine the best match. 856 | best.score <- setNames(rowMaxs(tmp.score), row.names(tmp.score)) 857 | ### determine the difference from the best match. 858 | diff.score <- best.score - tmp.score 859 | 860 | #### Give up on cells can't be discriminated,choose 861 | #### one branch randomly. 862 | unresolved.cl <- row.names(tmp.score)[rowSums(diff.score < low.th) == ncol(diff.score)] 863 | mapped.cl <- setNames(sample(colnames(tmp.score), length(unresolved.cl), replace = T), unresolved.cl) 864 | 865 | ### Cells mapped to one or more branches. 866 | mapped.cells <- setdiff(row.names(cl.cor), unresolved.cl) 867 | ### For binary branch, done already 868 | if (length(cl.g) == 2) { 869 | mapped.cl <- c(mapped.cl, setNames(colnames(diff.score)[apply(diff.score[mapped.cells, 870 | , 871 | drop = F 872 | ], 1, which.min)], mapped.cells)) 873 | return(mapped.cl) 874 | } 875 | ## The remaining options for mapped cells 876 | tmp.cl <- sapply(mapped.cells, function(x) colnames(diff.score)[which(diff.score[x, ] < low.th)], simplify = F) 877 | ### cells with multiple options 878 | resolve.cells <- names(tmp.cl)[sapply(tmp.cl, length) > 1] 879 | ### cells with only one option. Not further job. 880 | mapped.cells <- setdiff(mapped.cells, resolve.cells) 881 | if (length(mapped.cells) > 0) { 882 | mapped.cl <- c(mapped.cl, setNames(unlist(tmp.cl[mapped.cells]), mapped.cells)) 883 | } 884 | ### Resolve further options. 885 | if (length(resolve.cells) > 0) { 886 | tmp.cat <- sapply(tmp.cl[resolve.cells], function(x) paste(x, collapse = " ")) 887 | for (cat in unique(tmp.cat)) { 888 | tmp.cl <- unlist(strsplit(cat, " ")) 889 | select.cells <- names(tmp.cat)[tmp.cat == cat] 890 | mapped.cl <- c(mapped.cl, resolve_cl(cl.g[tmp.cl], 891 | cl.med, markers, dat, map.dat, select.cells, 892 | p = p, low.th = low.th 893 | )) 894 | } 895 | } 896 | return(mapped.cl) 897 | } 898 | 899 | 900 | 901 | #' Get top leaf match 902 | #' 903 | #' Returns the top leaf match for each cell and the corresponding fraction mapping there. 904 | #' 905 | #' @param memb.cl membership scores for each leaf 906 | #' 907 | #' @return a matrix where first column is found cluster and second column is confidence score 908 | #' 909 | #' @export 910 | getTopMatch <- function(memb.cl) { 911 | tmp.cl <- apply(memb.cl, 1, function(x) { 912 | y <- which.max(x)[1] 913 | as.character(c(colnames(memb.cl)[y], x[y])) 914 | }) 915 | rfv <- as.data.frame(t(tmp.cl)) 916 | rfv[, 2] <- as.numeric(rfv[, 2]) 917 | colnames(rfv) <- c("TopLeaf", "Value") 918 | rfv[is.na(rfv[, 1]), 1] <- "none" 919 | rfv[is.na(rfv[, 2]), 2] <- 0 920 | rfv 921 | } 922 | 923 | 924 | #' Generate reference set of pseudo-cells 925 | #' 926 | #' Creates a new reference set as input for cellToClusterMapping_byRank, where each 'cell' is the 927 | #' combiniation of several cells and this is run several times using different subsets of cells. 928 | #' @param refDat normalized data of the REFERENCE data set 929 | #' @param clustersF factor indicating which cluster each cell type is actually assigned to in the reference data set 930 | #' @param genesToMap which genes to include in the correlation mapping 931 | #' @param cellsPerMerge Number of cells to include in each combo cell 932 | #' @param numberOfMerges Number of combo cells to include per cell type 933 | #' @param mergeFunction function for combining cells into combo cells (use rowMeans or rowMedians) 934 | #' @param seed for resproducibility 935 | #' 936 | #' @return list where first element is data matrix of multi-cells by genes and 937 | #' second element is a vector of corresponding clusters 938 | #' 939 | #' @export 940 | generateMultipleCellReferenceSet <- function(refDat, 941 | clustersF, 942 | genesToUse = rownames(refDat), 943 | cellsPerMerge = 5, 944 | numberOfMerges = 10, 945 | mergeFunction = rowMedians, 946 | seed = 1) { 947 | if (!is.factor(clustersF)) { 948 | clustersF <- factor(clustersF) 949 | } 950 | names(clustersF) <- colnames(refDat) 951 | clusts <- levels(clustersF) 952 | refUse <- refDat[intersect(rownames(refDat), genesToUse), ] 953 | refOut <- matrix(nrow = dim(refUse)[1], ncol = numberOfMerges * length(clusts)) 954 | rownames(refOut) <- rownames(refUse) 955 | index <- 0 956 | for (k in 1:numberOfMerges) { 957 | i <- NULL 958 | for (cl in 1:length(clusts)) { 959 | val <- which(clustersF == clusts[cl]) 960 | set.seed(seed + index + cl + k) 961 | i <- c(i, sample(val, min(cellsPerMerge, length(val)))) 962 | } 963 | i <- is.element(1:length(clustersF), i) 964 | refOut[, (index + 1):(index + length(clusts))] <- do.call( 965 | "cbind", 966 | tapply(names(clustersF[i]), clustersF[i], function(x) rowMedians(refUse[, i][, x])) 967 | ) 968 | index <- index + length(clusts) 969 | } 970 | list(data = refOut, clusters = rep(clusts, numberOfMerges)) 971 | } 972 | 973 | 974 | #' Cell-based cluster mapping 975 | #' 976 | #' Maps cells to clusters by correlating every mapped cell with every reference cell, 977 | #' ranking the cells by correlation, and the reporting the cluster with the lowest average rank. 978 | #' 979 | #' @param mapDat normalized data of the MAPPING data set. 980 | #' @param refDat normalized data of the REFERENCE data set 981 | #' @param clustersF factor indicating which cluster each cell type is actually assigned 982 | #' to in the reference data set 983 | #' @param genesToMap character vector of which genes to include in the correlation mapping 984 | #' @param mergeFunction function for combining ranks; the tested choices are rowMeans or 985 | #' rowMedians (default) 986 | #' @param useRank use the rank of the correlation (default) or the correlation itself to 987 | #' determine the top cluster 988 | #' @param use additional parameter for cor (use='p' as default) 989 | #' @param method additional parameter for cor (method='p' as default) 990 | #' 991 | #' @return a two column data matrix where the first column is the mapped cluster and the second 992 | #' column is a confidence call indicating how close to the top of the ranked list cells of the 993 | #' assigned cluster were located relative to their best possible location in the ranked list. 994 | #' This confidence score seems to be a bit more reliable than correlation at determining how 995 | #' likely a cell in a training set is to being correctly assigned to the training cluster. 996 | #' 997 | #' @export 998 | cellToClusterMapping_byRank <- function(mapDat, 999 | refDat, 1000 | clustersF, 1001 | genesToMap = rownames(mapDat), 1002 | mergeFunction = rowMedians, 1003 | useRank = TRUE, 1004 | use = "p", 1005 | method = "p") { 1006 | if (is.null(names(clustersF))) names(clustersF) <- colnames(refDat) 1007 | kpVar <- intersect(genesToMap, intersect(rownames(mapDat), rownames(refDat))) 1008 | corrVar <- WGCNA::cor(mapDat[kpVar, ], refDat[kpVar, ], use = use, method = method) 1009 | corrVar[corrVar > 0.999999] <- NA # assume any perfect correlation is either an self-to-self mapping, or a mapping using exactly 1 non-zero gene 1010 | if (useRank) rankVar <- t(apply(-corrVar, 1, rank, na.last = "keep")) 1011 | if (!useRank) rankVar <- -corrVar 1012 | colnames(rankVar) <- names(clustersF) <- paste0("n", 1:length(clustersF)) 1013 | clMean <- do.call("cbind", tapply( 1014 | names(clustersF), 1015 | clustersF, function(x) match.fun(mergeFunction)(rankVar[, x], na.rm = TRUE) 1016 | )) 1017 | clMin <- apply(clMean, 1, min, na.rm = TRUE) 1018 | clMin[is.na(clMin)] <- 0 1019 | clMin[clMin == Inf] <- 1e+09 1020 | clBest <- colnames(clMean)[apply(clMean, 1, function(x) return(which.min(x)[1]))] 1021 | clBest[is.na(clBest)] <- colnames(clMean)[1] 1022 | if (useRank) clScore <- (table(clustersF)[clBest] / 2) / pmax(clMin, 1e-11) 1023 | if (!useRank) clScore <- -clMin 1024 | rfv <- data.frame(TopLeaf = clBest, Score = as.numeric(as.character(clScore))) 1025 | rownames(rfv) <- rownames(corrVar) 1026 | rfv 1027 | } 1028 | 1029 | 1030 | #' Correlation-based cluster mapping 1031 | #' 1032 | #' Primary function for doing correlation-based mapping to cluster medians. This is wrapper for cor 1033 | #' and returns a correlation matrix. 1034 | #' 1035 | #' @param mapDat normalized data of the MAPPING data set. Default is to map the data onto itself. 1036 | #' @param medianDat representative value for each leaf and node. If not entered, it is calculated 1037 | #' @param dend dendrogram for mapping. If provided, correlations to nodes are also returned 1038 | #' @param refDat normalized data of the REFERENCE data set. Ignored if medianDat is passed 1039 | #' @param clusters cluster calls for each cell. Ignored if medianDat is passed 1040 | #' @param genesToMap which genes to include in the correlation mapping 1041 | #' @param use additional parameter for cor (use='p' as default) 1042 | #' @param method additional parameter for cor (method='p' as default) 1043 | #' 1044 | #' @return matrix with the correlation between expression of each cell and representative value for 1045 | #' each leaf and node 1046 | #' 1047 | #' @export 1048 | corTreeMapping <- function(mapDat, 1049 | medianDat, 1050 | dend = NULL, 1051 | refDat = NA, 1052 | clusters = NA, 1053 | genesToMap = rownames(mapDat), 1054 | use = "p", 1055 | method = "p") { 1056 | if (is.na(medianDat)[1]) { 1057 | names(clusters) <- colnames(refDat) 1058 | medianDat <- do.call("cbind", tapply( 1059 | names(clusters), clusters, function(x) rowMedians(refDat[, x]) 1060 | )) 1061 | rownames(medianDat) <- rownames(refDat) 1062 | } 1063 | if (!is.null(dend)) { 1064 | medianDat <- leafToNodeMedians(dend, medianDat) 1065 | } 1066 | kpVar <- intersect(genesToMap, intersect(rownames(mapDat), rownames(medianDat))) 1067 | corrVar <- WGCNA::cor(mapDat[kpVar, ], medianDat[kpVar, ], use = use, method = method) 1068 | return(corrVar) 1069 | } 1070 | 1071 | 1072 | #' Gets subtree labels for lca function. 1073 | #' 1074 | #' @param dend a cluster dendrogram 1075 | #' 1076 | #' @return vector of subtree labels 1077 | #' 1078 | #' @export 1079 | get_subtree_label <- function(dend) { 1080 | library(dendextend) 1081 | l <- attr(dend, "label") 1082 | if (length(dend) > 1) { 1083 | for (i in 1:length(dend)) { 1084 | l <- c(l, get_subtree_label(dend[[i]])) 1085 | } 1086 | } 1087 | return(l) 1088 | } 1089 | 1090 | 1091 | #' Get lowest common ancestor (defined cluster pairs) 1092 | #' 1093 | #' Maps a cluster back up the tree to the first node where the mapped and correct clusters agree. 1094 | #' 1095 | #' @param dend a cluster dendrogram 1096 | #' @param l1 a vector of node labels 1097 | #' @param l2 a second fector of node labels (of the same length as l1) 1098 | #' @param l do not adjust; required for recursive function 1099 | #' 1100 | #' @return The function will return a vector for lowest common ancestor for every pair of nodes in l1 and l2 1101 | #' 1102 | #' @export 1103 | lca <- function(dend, 1104 | l1, 1105 | l2, 1106 | l = rep(attr(dend, "label"), length(l1))) { 1107 | library(dendextend) 1108 | node.height <- setNames(get_nodes_attr(dend, "height"), get_nodes_attr(dend, "label")) 1109 | if (length(dend) > 1) { 1110 | for (i in 1:length(dend)) { 1111 | tmp.l <- attr(dend[[i]], "label") 1112 | labels <- get_subtree_label(dend[[i]]) 1113 | select <- l1 %in% labels & l2 %in% labels 1114 | if (sum(select) > 0) { 1115 | select <- which(select)[node.height[l[select]] > node.height[tmp.l]] 1116 | l[select] <- tmp.l 1117 | l <- lca(dend[[i]], l1, l2, l) 1118 | } 1119 | } 1120 | } 1121 | l 1122 | } 1123 | 1124 | 1125 | #' Get lowest common ancestor (all cluster pairs in tree) 1126 | #' 1127 | #' Calculates the vector for lowest common ancestor for every pair of leaves in a tree and returns a 1128 | #' vector in a specific format for faster look-up. 1129 | #' 1130 | #' @param dend a cluster dendrogram 1131 | #' @param includeInternalNodes should internal nodes be included in the output? 1132 | #' @param verbose if TRUE, status will be printed to the screen, since function is relatively slow 1133 | #' for large trees (default FALSE) 1134 | #' 1135 | #' @return The function will return a vector for lowest common ancestor for every pair of leaves 1136 | #' in dend. Vector names are l1|||l2 for string parsing in other functions. 1137 | #' 1138 | #' @export 1139 | makeLCAtable <- function(dend, 1140 | includeInternalNodes = FALSE, 1141 | verbose = FALSE) { 1142 | library(dendextend) 1143 | nodes <- get_leaves_attr(dend, "label") 1144 | if (includeInternalNodes) nodes <- get_nodes_attr(dend, "label") 1145 | out <- nm <- rep(0, length(nodes)^2) 1146 | i <- 1 1147 | for (l1 in nodes) { 1148 | if (verbose) print(l1) 1149 | for (l2 in nodes) { 1150 | nm[i] <- paste(l1, l2, sep = "|||") 1151 | out[i] <- lca(dend, l1, l2) 1152 | i <- i + 1 1153 | } 1154 | } 1155 | names(out) <- nm 1156 | out 1157 | } 1158 | 1159 | 1160 | #' Plot dendrogram 1161 | #' 1162 | #' Plots a dendrogram with set not colors, shapes, sizes and labels. This is a wrapper for plot. 1163 | #' 1164 | #' @param tree a dendrogram object 1165 | #' @param value numeric vector corresponding to the size of each node 1166 | #' @param cexScale a global cex multiplier for node sizes 1167 | #' @param margins set the margins using par(mar=margins) 1168 | #' @param cols vector of node colors (or a single value) 1169 | #' @param pch vector of node pch shapes (or a single value) 1170 | #' @param ... additional parameters for the plot function 1171 | #' 1172 | #' @export 1173 | plotNodes <- function(tree, 1174 | value = rep(1, length(labels(tree))), 1175 | cexScale = 2, 1176 | margins = c(10, 5, 2, 2), 1177 | cols = "black", 1178 | pch = 19, 1179 | ...) { 1180 | library(dendextend) 1181 | tree <- set(tree, "nodes_pch", pch) 1182 | tree <- set(tree, "nodes_col", cols) 1183 | tree <- set(tree, "labels_cex", 1) 1184 | treeN <- set(tree, "nodes_cex", cexScale * value) 1185 | par(mar = margins) 1186 | plot(treeN, ylab = "height", ...) 1187 | } 1188 | 1189 | 1190 | 1191 | #' Build panel for one cluster (beta) 1192 | #' 1193 | #' This UNTESTED function finds the best small marker panel for marking a single cluster, using 1194 | #' proportion difference as the metric for determining the starting panel. 1195 | #' 1196 | #' @param mapDat normalized data of the mapping (=reference) data set. 1197 | #' @param clustersF cluster calls for each cell. 1198 | #' @param medianDat median value for each leaf 1199 | #' @param propIn proportions of cells with expression > 1 in each leaf 1200 | #' @param clust which cluster to target? 1201 | #' @param subSamp number of random nuclei to select from each cluster, EXCEPT the target cluster; 1202 | #' set as NA to not subsample 1203 | #' @param seed for reproducibility 1204 | #' @param maxSize maximum size of marker gene panel 1205 | #' @param dexCutoff criteria for stopping: when improvement in fraction of cells properly mapped 1206 | #' dips below this value 1207 | #' @param topGeneCount number of top genes by proportion to consider 1208 | #' 1209 | #' @return a matrix of the top marker genes for each cluster. Output matrix includes five columns: 1210 | #' clust = cluster; panel = ordered genes in the panel for that cluster; onCorrect = fraction of 1211 | #' correctly assigned cells in cluster; offCorrect = fraction of cells correctly assigned outside 1212 | #' of cluster; dexTotal = additional dex explained by last gene added. 1213 | #' 1214 | #' @export 1215 | buildPanel_oneCluster <- function(mapDat, 1216 | clustersF, 1217 | medianDat = NA, 1218 | propIn = NA, 1219 | clust = as.character(clustersF[1]), 1220 | subSamp = NA, 1221 | seed = 10, 1222 | maxSize = 20, 1223 | dexCutoff = 0.001, 1224 | topGeneCount = 100) { 1225 | 1226 | # SUBSAMPLE 1227 | if (!is.na(subSamp)[1]) { 1228 | kpSamp <- subsampleCells(clustersF, subSamp, seed) 1229 | kpSamp[as.character(clustersF) == clust] <- TRUE 1230 | mapDat <- mapDat[, kpSamp] 1231 | clustersF <- clustersF[kpSamp] 1232 | subSamp <- NA 1233 | } 1234 | 1235 | # REFORMAT THE CLUSTER VARIABLE 1236 | clust <- as.character(clust) 1237 | clustersIn <- clustersF 1238 | clustersF <- rep(clust, length(clustersIn)) 1239 | clustersF[as.character(clustersIn) != clust] <- "other" 1240 | clustersF <- factor(clustersF, levels = c(clust, "other")) 1241 | 1242 | # CALCULATE THE PROPORTION OF CELLS EXPRESSED IN 1243 | # EACH CLUSTERS, AND THE MEDIANS (SEND TO OTHER 1244 | # FUNCTIONS AS MEDIAN) 1245 | names(clustersF) <- colnames(mapDat) 1246 | if (is.na(propIn)[1]) { 1247 | propIn <- do.call("cbind", tapply( 1248 | names(clustersF), clustersF, function(x) rowMeans(mapDat[, x] >= 1) 1249 | )) 1250 | } 1251 | rownames(propIn) <- rownames(mapDat) 1252 | if (is.na(medianDat)[1]) { 1253 | medianDat <- do.call("cbind", tapply( 1254 | names(clustersF), clustersF, function(x) rowMedians(mapDat[, x]) 1255 | )) 1256 | } # switched clustersIn to clustersF 1257 | rownames(medianDat) <- rownames(mapDat) 1258 | 1259 | # FIND THE BEST GENE IN THE PANEL, UNTIL THE 1260 | # DESIRED PANEL SIZE IS REACHED 1261 | propDat <- cbind(propIn[, clust], rowMeans(propIn[, colnames(propIn) != clust])) 1262 | colnames(propDat) <- c(clust, "Other") 1263 | panel <- onCorrect <- offCorrect <- dexTotal <- NULL 1264 | first <- TRUE 1265 | tt <- dex <- 0 1266 | propDex <- propDat[, 1] - propDat[, 2] 1267 | topMark <- names(-sort(-propDex))[1:topGeneCount] # Some semi-reasonable way to cut down the gene count 1268 | while ((((length(panel) < maxSize) & (dex > dexCutoff))) | first) { 1269 | 1270 | # EUCLIDEAN MAPPING FOR EACH POSSIBLE ADDITION OF ONE GENE 1271 | first <- FALSE 1272 | otherGenes <- setdiff(topMark, panel) 1273 | matchCount <- offTarget <- totalCount <- rep(0, length(otherGenes)) 1274 | for (i in 1:length(otherGenes)) { 1275 | corMapTmp <- distTreeMapping( 1276 | mapDat = mapDat, medianDat = medianDat, genesToMap = c(panel, otherGenes[i]) 1277 | ) 1278 | corMapTmp[is.na(corMapTmp)] <- -1 1279 | higherOn <- corMapTmp[, clust] == apply(corMapTmp, 1, max) 1280 | matchCount[i] <- sum((clustersF == clust) & (higherOn)) / sum(clustersF == clust) 1281 | offTarget[i] <- sum((clustersF != clust) & (!higherOn)) / sum(clustersF != clust) 1282 | totalCount[i] <- mean(c(matchCount[i], offTarget[i])) 1283 | } 1284 | wm <- which.max(totalCount) 1285 | dex <- totalCount[wm] - tt 1286 | cr <- matchCount[wm] 1287 | tt <- totalCount[wm] 1288 | ot <- offTarget[wm] 1289 | addGene <- as.character(otherGenes)[wm] 1290 | panel <- c(panel, addGene) 1291 | onCorrect <- c(onCorrect, cr) 1292 | offCorrect <- c(offCorrect, ot) 1293 | dexTotal <- c(dexTotal, dex) 1294 | } 1295 | out <- data.frame(clust, panel, onCorrect, offCorrect, dexTotal) 1296 | out <- out[1:dim(out)[1] - 1, ] 1297 | out 1298 | } 1299 | 1300 | 1301 | #' Fraction of cells per layer 1302 | #' 1303 | #' Determines the expected proportions in each layer based on input 1304 | #' 1305 | #' @param layerIn a list corresponding to all layers of dissection for a given sample 1306 | #' @param layerNm names of all layers. set to NULL to have this calculated 1307 | #' @param scale if TRUE (default), scale to the total number of cells 1308 | #' 1309 | #' @return vector indicating the fraction of cells in each layerNm layer 1310 | #' 1311 | #' @export 1312 | layerScale <- function(layerIn, 1313 | layerNm = c("L1", "L2/3", "L4", "L5", "L6"), 1314 | scale = TRUE) { 1315 | if (is.null(layerNm)) { 1316 | for (l in 1:length(layerIn)) layerNm <- c(layerNm, layerIn[[l]]) 1317 | layerNm <- sort(unique(layerNm)) 1318 | } 1319 | total <- rep(0, length(layerNm)) 1320 | names(total) <- layerNm 1321 | for (l in 1:length(layerIn)) total[layerIn[[l]]] <- total[layerIn[[l]]] + 1322 | 1 / length(total[layerIn[[l]]]) 1323 | if (scale) total <- total / sum(total) 1324 | total 1325 | } 1326 | 1327 | 1328 | #' Layer weights per cell 1329 | #' 1330 | #' Returns a numeric vector saying how to weight a particular cell for each layer, using a smart 1331 | #' weighting strategy 1332 | #' @param layerIn a list corresponding to all layers of dissection for a given sample 1333 | #' @param useLayer target layer 1334 | #' @param spillFactor fractional amount of cells in a layer below which it is assumed no cells are 1335 | #' from that layer in multilayer dissection 1336 | #' @param weightCutoff anything less than this is set to 0 for convenience and to avoid rare types 1337 | #' @param layerNm names of all layers. set to NULL to have this calculated 1338 | #' 1339 | #' @return numeric vector saying how to weight a particular cell for each layer, using a smart 1340 | #' weighting strategy 1341 | #' 1342 | #' @export 1343 | smartLayerAllocation <- function(layerIn, 1344 | useLayer = "L1", 1345 | spillFactor = 0.15, 1346 | weightCutoff = 0.02, 1347 | layerNm = c("L1", "L2/3", "L4", "L5", "L6")) { 1348 | if (is.null(layerNm)) { 1349 | for (i in 1:length(layer)) layerNm <- c(layerNm, layerNm[[i]]) 1350 | layerNm <- sort(unique(layerNm)) 1351 | } 1352 | layerMat <- matrix(0, nrow = length(layerIn), ncol = length(layerNm)) 1353 | colnames(layerMat) <- layerNm 1354 | for (i in 1:length(layerIn)) layerMat[i, layerIn[[i]]] <- 1 1355 | oneCount <- rowSums(layerMat) == 1 1356 | wgtFrac <- colSums(rbind(layerMat[oneCount, ], layerMat[oneCount, ])) / 2 1357 | wgtFrac <- wgtFrac / max(wgtFrac) 1358 | wgtFrac[is.na(wgtFrac)] <- 0.01 1359 | wgtFrac[wgtFrac < spillFactor] <- 1e-06 * wgtFrac[wgtFrac < spillFactor] 1360 | wgtFrac <- wgtFrac / sum(wgtFrac) 1361 | for (i in which(!oneCount)) 1362 | layerMat[i, ] <- (layerMat[i, ] * wgtFrac) / sum(layerMat[i, ] * wgtFrac) 1363 | layerMat[layerMat < weightCutoff] <- 0 1364 | for (i in 1:length(oneCount)) 1365 | layerMat[i, ] <- layerMat[i, ] / sum(layerMat[i, ]) 1366 | out <- layerMat[, useLayer] 1367 | out 1368 | } 1369 | 1370 | 1371 | #' Layer weights per cell 1372 | #' 1373 | #' Returns a numeric vector saying how to weight a particular cell for each layer. This is a 1374 | #' wrapper for smartLayerAllocation 1375 | #' 1376 | #' @param layerIn a list corresponding to all layers of dissection for a given sample 1377 | #' @param useLayer target layer 1378 | #' @param cluster if passed the weights are smartly allocated based on laminar distributions by cluster 1379 | #' @param ... additional variables for smartLayerAllocation 1380 | #' 1381 | #' @return numeric vector with weights for cells in input layer 1382 | #' 1383 | #' @export 1384 | layerFraction <- function(layerIn, 1385 | useLayer = "L1", 1386 | cluster = NA, 1387 | ...) { 1388 | weight <- rep(0, length(layerIn)) 1389 | if (is.na(cluster)[1]) { 1390 | for (l in 1:length(weight)) weight[l] <- sum(layerIn[[l]] == useLayer) / length(layerIn[[l]]) 1391 | return(weight) 1392 | } 1393 | for (cli in unique(cluster)) 1394 | weight[cli == cluster] <- smartLayerAllocation(layerIn[cli == cluster], useLayer, ...) 1395 | weight[is.na(weight)] <- 0 1396 | weight 1397 | } 1398 | 1399 | 1400 | #' Filter possible cluster calls using priors 1401 | #' 1402 | #' This function will return a vector of possible clusters for cells that meet a set of priors for each layer 1403 | #' 1404 | #' @param cluster vector of all clusters 1405 | #' @param layer list of layers for each cluster entry (for data sets with only laminar dissections, 1406 | #' each list entry will be of length 1) 1407 | #' @param subsetVector a vector of TRUE/FALSE values indicated whether the entry is in the subset of 1408 | #' interest (e.g., Cre lines); default is all 1409 | #' @param useClusters a set of clusters to be considered a priori (e.g., GABA vs. glut); default is all 1410 | #' @param rareLimit define any values less than this as 0. The idea is to exclude rare cells 1411 | #' @param layerNm names of all layers. set to NULL to have this calculated 1412 | #' @param scaleByLayer if TRUE, scales to the proportion of cells in each layer 1413 | #' @param scaleByFn what function should be used for the layer scaling (default=max, ignored 1414 | #' if scaleByLayer=FALSE) 1415 | #' @param smartWeight if TRUE, multilayer dissections are weighted smartly by cluster, rather 1416 | #' than evenly by cluster (FALSE) 1417 | #' @param spillFactor fractional amount of cells in a layer below which it is assumed no cells 1418 | #' are from that layer in multilayer dissection 1419 | #' @param weightCutoff anything less than this is set to 0 for convenience 1420 | #' 1421 | #' @return a vector of possible clusters for cells that meet a set of priors for each layer 1422 | #' 1423 | #' @export 1424 | possibleClustersByPriors <- function(cluster, 1425 | layer, 1426 | subsetVector = rep(TRUE, length(cluster)), 1427 | useClusters = sort(unique(cluster)), 1428 | rareLimit = 0.005, 1429 | layerNm = c("L1", "L2/3", "L4", "L5", "L6"), 1430 | scaleByLayer = TRUE, 1431 | scaleByFn = max, 1432 | smartWeight = TRUE, 1433 | spillFactor = 0.15, 1434 | weightCutoff = 0.02) { 1435 | if (is.null(layerNm)) { 1436 | for (i in 1:length(layer)) layerNm <- c(layerNm, layerNm[[i]]) 1437 | layerNm <- sort(unique(layerNm)) 1438 | } 1439 | if (!is.factor(cluster)) cluster <- factor(cluster) 1440 | allClusters <- levels(cluster) 1441 | out <- matrix(NA, nrow = length(allClusters), ncol = length(layerNm)) 1442 | rownames(out) <- allClusters 1443 | colnames(out) <- layerNm 1444 | isClust <- is.element(cluster, useClusters) 1445 | sb <- subsetVector & isClust 1446 | if (sum(sb) == 0) { 1447 | return(out) 1448 | } # Don't run on cases with no data 1449 | subLayer <- layerScale(layer[sb], layerNm = layerNm) / 1450 | layerScale(layer[isClust], layerNm = layerNm) 1451 | subLayer <- subLayer / sum(subLayer) 1452 | kpLay <- names(subLayer)[subLayer >= rareLimit] 1453 | out[useClusters, kpLay] <- 0 1454 | 1455 | smartClust <- NA 1456 | if (smartWeight) smartClust <- cluster 1457 | 1458 | for (lay in kpLay) { 1459 | weight <- layerFraction(layer, lay, smartClust, 1460 | spillFactor = spillFactor, 1461 | weightCutoff = weightCutoff, layerNm = layerNm 1462 | ) 1463 | kpCl <- unique(cluster[sb & (weight > 0)]) 1464 | for (cli in kpCl) out[cli, lay] <- sum(weight[sb & (cluster == cli)]) 1465 | } 1466 | if (scaleByLayer) { 1467 | for (lay in kpLay) out[, lay] <- out[, lay] / scaleByFn(out[, lay], na.rm = TRUE) 1468 | out[out < rareLimit] <- 0 1469 | } 1470 | out 1471 | } 1472 | 1473 | 1474 | #' Get node height 1475 | #' 1476 | #' Returns the heights of each node, scaled from 0 (top) to 1 (leafs); this is a wrapper for dendextend functions 1477 | #' 1478 | #' @param tree a dendrogram object 1479 | #' 1480 | #' @return a vector of node heights 1481 | #' 1482 | #' @export 1483 | getNodeHeight <- function(tree) { 1484 | nodeHeight <- get_nodes_attr(tree, "height") 1485 | nodeHeight <- 1 - nodeHeight / max(nodeHeight) 1486 | names(nodeHeight) <- get_nodes_attr(tree, "label") 1487 | return(nodeHeight) 1488 | } 1489 | 1490 | 1491 | #' Fraction of correct calls per node 1492 | #' 1493 | #' This function returns the fraction correctly assigned to each node (as defined that the actual 1494 | #' and predicted cluster are both in the same node) 1495 | #' 1496 | #' @param dendIn dendrogram for mapping. Ignored if minimizeHeight=FALSE 1497 | #' @param clActual character vector of actual cluster assignments 1498 | #' @param clPredict character vector of predicted cluster assignments 1499 | #' @param minCount set to 0 results from clusters with fewer than this number of cells (default 1500 | #' is to consider all clusters) 1501 | #' @param defaultSum value to return in cases where there are fewer than minCount cells in the 1502 | #' actual cluster (e.g., cases that aren't considered at all) 1503 | #' @param out required for recursive function. Do not set! 1504 | #' 1505 | #' @return matrix of two columns: (1) node name and (2) the fraction of cells in that node that 1506 | #' are correctly assigned 1507 | #' 1508 | #' @export 1509 | fractionCorrectPerNode <- function(dendIn, 1510 | clActual, 1511 | clPredict, 1512 | minCount = 0.1, 1513 | defaultSum = -1, 1514 | out = NULL) { 1515 | clActual <- as.character(clActual) 1516 | clPredict <- as.character(clPredict) 1517 | if (length(dendIn) > 1) { 1518 | for (i in 1:length(dendIn)) { 1519 | allLabels <- labels(dendIn[[i]]) 1520 | nodeName <- attr(dendIn[[i]], "label") 1521 | isActual <- is.element(clActual, allLabels) 1522 | isPredict <- is.element(clPredict, allLabels) 1523 | fractionCorrrect <- signif(sum(isActual & isPredict) / sum(isPredict + 1e-11), 3) # CHECK THIS 1524 | if (sum(isActual) < minCount) fractionCorrrect <- defaultSum 1525 | out <- rbind(out, c(nodeName, fractionCorrrect)) 1526 | colnames(out) <- c("nodeName", "fractionCorrrect") 1527 | out <- fractionCorrectPerNode(dendIn[[i]], clActual, clPredict, minCount, defaultSum, out) 1528 | } 1529 | } 1530 | out <- as.data.frame(out) 1531 | out$fractionCorrrect <- as.numeric(as.character(out$fractionCorrrect)) 1532 | rownames(out) <- out[, 1] 1533 | out 1534 | } 1535 | 1536 | 1537 | #' Filter by meta-data 1538 | #' 1539 | #' Return a filter of TRUE/FALSE values for a given piece of meta-data (e.g., broad class). 1540 | #' 1541 | #' @param classVector vector corresponding to the class information for filtering (e.g., vector 1542 | #' of label calls) 1543 | #' @param sampleInfo matrix of sample information with rows corresponding to cells and columns 1544 | #' corresponding to meta-data 1545 | #' @param classColumn column name of class information 1546 | #' @param clusterColumn column name of cluster information 1547 | #' @param threshold minimum fraction of cluster cells from a given class to be considered present 1548 | #' 1549 | #' @return a matrix of filters with rows as clusters and columns as classes with entries of TRUE or 1550 | #' FALSE indicating whether cells from a given class can assigned to that cluster, given threshold. 1551 | #' 1552 | #' @export 1553 | filterByClass <- function(classVector, 1554 | sampleInfo, 1555 | classColumn = "cluster_type_label", 1556 | clusterColumn = "cluster_label", 1557 | threshold = 0.1) { 1558 | ## 1559 | out <- table(factor(sampleInfo[, clusterColumn]), factor(sampleInfo[, classColumn])) 1560 | out <- out / rowSums(out) 1561 | out <- out > threshold 1562 | 1563 | # Allow for names that not present in the 1564 | # sampleInfo file 1565 | out <- as.data.frame(out) 1566 | tmp <- setdiff(classVector, colnames(out)) 1567 | if (length(tmp) > 0) { 1568 | for (tm in tmp) out[, tm] <- TRUE 1569 | } 1570 | out <- as.matrix(out) 1571 | 1572 | # Remove '' names 1573 | classVector[classVector == ""] <- "none" 1574 | rownames(out)[rownames(out) == ""] <- "none" 1575 | colnames(out)[colnames(out) == ""] <- "none" 1576 | 1577 | # Find and output the filter matrix 1578 | outTable <- t(out[, classVector]) 1579 | if (length(classVector) == dim(sampleInfo)[1]) { 1580 | rownames(outTable) <- sampleInfo[, "sample_id"] 1581 | } 1582 | outTable 1583 | } 1584 | 1585 | #' Subsample cells 1586 | #' 1587 | #' Subsets a categorical vector to include up to a maximum number of values for each category. 1588 | #' 1589 | #' @param clusters vector of cluster labels (or any category) in factor or character format 1590 | #' @param subSamp maximum number of values for each category to subsample. Can be single integer 1591 | #' for global subsampling, or a *named* vector corresponding to how many values to take from each 1592 | #' category in clusters. 1593 | #' @param seed for reproducibility 1594 | #' 1595 | #' @return returns a vector of TRUE / FALSE with a maximum of subSamp TRUE calls per category 1596 | #' 1597 | #' @export 1598 | subsampleCells <- function(clusters, 1599 | subSamp = 25, 1600 | seed = 5) { 1601 | if(length(subSamp)==1) 1602 | subSamp = rep(subSamp,length(unique(as.character(clusters)))) 1603 | if(is.null(names(subSamp))) 1604 | names(subSamp) <- unique(as.character(clusters)) 1605 | kpSamp <- rep(FALSE, length(clusters)) 1606 | for (cli in unique(as.character(clusters))) { 1607 | val = subSamp[cli] 1608 | if(!is.na(val)[1]){ 1609 | set.seed(seed) 1610 | seed <- seed + 1 1611 | kp <- which(clusters == cli) 1612 | kpSamp[kp[sample(1:length(kp), min(length(kp), val))]] <- TRUE 1613 | } 1614 | } 1615 | kpSamp 1616 | } 1617 | 1618 | 1619 | #' Fraction of cells correctly assigned 1620 | #' 1621 | #' This function takes as input an ordered set of marker genes (e.g., from at iterative 1622 | #' algorithm), and returns a vector showing the fraction of cells correctly mapped. 1623 | #' 1624 | #' @param orderedGenes an ordered list of input genes (e.g. from an iterative algorithm) 1625 | #' @param mapDat normalized data of the mapping (=reference) data set. 1626 | #' @param medianDat median value for each leaf 1627 | #' @param clustersF cluster calls for each cell 1628 | #' @param verbose whether or not to show progress in the function 1629 | #' @param plot if TRUE, plotCorrectWithGenes is run 1630 | #' @param ... parameters passed to plotCorrectWithGenes (if plot=TRUE) 1631 | #' @param return if TRUE, the value is returned 1632 | #' 1633 | #' @return a vector showing the fraction of cells correctly mapped to each cluster 1634 | #' 1635 | #' @export 1636 | fractionCorrectWithGenes <- function(orderedGenes, 1637 | mapDat, 1638 | medianDat, 1639 | clustersF, 1640 | verbose = FALSE, 1641 | plot = TRUE, 1642 | return = TRUE, 1643 | ...) { 1644 | numGn <- 2:length(orderedGenes) 1645 | frac <- rep(0, length(orderedGenes)) 1646 | for (i in numGn) { 1647 | gns <- orderedGenes[1:i] 1648 | corMapTmp <- suppressWarnings(corTreeMapping( 1649 | mapDat = mapDat, 1650 | medianDat = medianDat, genesToMap = gns 1651 | )) 1652 | corMapTmp[is.na(corMapTmp)] <- 0 1653 | topLeafTmp <- getTopMatch(corMapTmp) 1654 | frac[i] <- 100 * mean(topLeafTmp[, 1] == clustersF) 1655 | } 1656 | frac[is.na(frac)] <- 0 1657 | if (plot) { 1658 | plotCorrectWithGenes(frac, genes = orderedGenes, ...) 1659 | } 1660 | if (return) { 1661 | return(frac) 1662 | } 1663 | } 1664 | 1665 | #' Correct mapping at different tree heights 1666 | #' 1667 | #' This function takes as input an ordered set of marker genes (e.g., from at iterative algorithm, 1668 | #' and returns an table showing the fraction of cells correctly mapped to a similar cell type 1669 | #' (as defined by the heights parameter). A height of 1 indicates correct mapping to the leaf. 1670 | #' 1671 | #' @param orderedGenes an ordered list of input genes (e.g. from an iterative algorithm) 1672 | #' @param dend dendrogram for mapping. 1673 | #' @param mapDat normalized data of the mapping (=reference) data set. 1674 | #' @param medianDat median value for each leaf 1675 | #' @param clustersF cluster calls for each cell 1676 | #' @param minVal minimum number of genes to consider from the list in the mapping 1677 | #' @param heights height in the tree to look at 1678 | #' @param verbose whether or not to show progress in the function 1679 | #' 1680 | #' @return a matrix of fractions of cells correctly mapped for different tree heights (columns) 1681 | #' and different gene panels (rows) 1682 | #' 1683 | #' @export 1684 | buildQualityTable <- function(orderedGenes, 1685 | dend, 1686 | mapDat, 1687 | medianDat, 1688 | clustersF, 1689 | minVal = 2, 1690 | heights = c((0:100) / 100), 1691 | verbose = FALSE) { 1692 | minVal <- max(2, round(minVal)) 1693 | nodeHeight <- get_nodes_attr(dend, "height") 1694 | nodeHeight <- 1 - nodeHeight / max(nodeHeight) 1695 | names(nodeHeight) <- get_nodes_attr(dend, "label") 1696 | 1697 | outTable <- matrix(0, nrow = length(orderedGenes), ncol = length(heights)) 1698 | rownames(outTable) <- orderedGenes 1699 | colnames(outTable) <- paste0("Height_", heights) 1700 | 1701 | for (r in minVal:length(orderedGenes)) { 1702 | if (verbose) print(r) 1703 | gns <- orderedGenes[1:r] 1704 | topLeafTmp <- suppressWarnings(getTopMatch(corTreeMapping( 1705 | mapDat = mapDat, medianDat = medianDat, genesToMap = gns 1706 | ))) 1707 | lcaVector <- nodeHeight[lca(dend, as.character(topLeafTmp[, 1]), clustersF)] 1708 | for (c in 1:length(heights)) { 1709 | outTable[r, c] <- mean(lcaVector >= (1 - heights[c])) 1710 | } 1711 | } 1712 | outTable 1713 | } 1714 | 1715 | #' Plot fraction correct 1716 | #' 1717 | #' This function is a wrapper for plot designd for plotting the fraction correctly mapped for a given gene set. 1718 | #' If geneN is the Nth gene, the plotted value indicates correct mapping using genes 1:N. 1719 | #' 1720 | #' @param frac a numeric vector indicating the fraction of cells correctly mapped for a given gene panel 1721 | #' @param genes ordered character vector (e.g., of genes) to be plotted; default is names(frac) 1722 | #' @param ... additional parameters for plot. 1723 | #' 1724 | #' @export 1725 | plotCorrectWithGenes <- function(frac, 1726 | genes = names(frac), 1727 | xlab = "Number of genes in panel", 1728 | main = "All clusters gene panel", 1729 | ylim = c(-10, 100), 1730 | lwd = 5, 1731 | ylab = "Percent of nuclei correctly mapping", 1732 | colLine = "grey", 1733 | ...) { 1734 | numGn <- 1:length(frac) 1735 | plot(numGn, frac, 1736 | type = "l", col = "grey", xlab = xlab, 1737 | ylab = ylab, main = main, ylim = ylim, lwd = lwd, ... 1738 | ) 1739 | abline(h = (-2:20) * 5, lty = "dotted", col = colLine) 1740 | abline(h = 0, col = "black", lwd = 2) 1741 | text(numGn, frac, genes, srt = 90) 1742 | } 1743 | 1744 | 1745 | #' Get binary (aka beta) score 1746 | #' 1747 | #' Returns a beta score which indicates the binaryness of a gene across clusters. High scores 1748 | #' (near 1) indicate that a gene is either on or off in nearly all cells of every cluster. 1749 | #' Scores near 0 indicate a cells is non-binary (e.g., not expressed, ubiquitous, or 1750 | #' randomly expressed). This value is used for gene filtering prior to defining clustering. 1751 | #' 1752 | #' @param propExpr a matrix of proportions of cells (rows) in a given cluster (columns) with 1753 | #' CPM/FPKM > 1 (or 0, HCT uses 1) 1754 | #' @param returnScore if TRUE returns the score, if FALSE returns the ranks 1755 | #' @param spec.exp scaling factor (recommended to leave as default) 1756 | #' 1757 | #' @return returns a numeric vector of beta score (or ranks) 1758 | #' 1759 | #' @export 1760 | getBetaScore <- function(propExpr, 1761 | returnScore = TRUE, 1762 | spec.exp = 2) { 1763 | calc_beta <- function(y, spec.exp = 2) { 1764 | d1 <- as.matrix(dist(y)) 1765 | eps1 <- 1e-10 1766 | # Marker score is combination of specificity and sparsity 1767 | score1 <- sum(d1^spec.exp) / (sum(d1) + eps1) 1768 | score1 1769 | } 1770 | 1771 | betaScore <- apply(propExpr, 1, calc_beta) 1772 | betaScore[is.na(betaScore)] <- 0 1773 | if (returnScore) return(betaScore) 1774 | scoreRank <- rank(-betaScore) 1775 | scoreRank 1776 | } 1777 | 1778 | 1779 | #' Calculate row medians 1780 | #' 1781 | #' This is a wrapper for matrixStats::rowMedians that doesn't crash if a vector is provided as input 1782 | #' 1783 | #' @param x a matrix or vector of values 1784 | #' 1785 | #' @return if the input is a matrix, return row medians. If a vector, it returns the inputted vector 1786 | #' 1787 | #' @export 1788 | rowMedians <- function(x,...) { 1789 | require(matrixStats) 1790 | if(is.matrix(x)) return(matrixStats::rowMedians(x,...)) 1791 | x 1792 | } 1793 | -------------------------------------------------------------------------------- /R/mfishMapping.r: -------------------------------------------------------------------------------- 1 | #' Build and plot dendrogram from gene panel 2 | #' 3 | #' Build and plot a dendrogram using correlation-based average linkage hierarchical 4 | #' clustering and only using a specified set of genes. The output is the expected 5 | #' accuracy of mapping to each node in the tree, which gives an idea of the best-case 6 | #' expected results for mFISH analysis. 7 | #' 8 | #' @param dend dendrogram for mapping. Ignored if medianDat is passed 9 | #' @param refDat normalized data of the REFERENCE data set. Ignored if medianDat is passed 10 | #' @param mapDat normalized data of the MAPPING data set. Default is to map the data onto itself. 11 | #' @param medianDat representative value for each leaf and node. If not entered, it is calculated 12 | #' @param requiredGenes minimum number of genes required to be expressed in a cluster (column 13 | #' of medianDat) for the cluster to be included (default=2) 14 | #' @param clusters cluster calls for each cell 15 | #' @param mappedAsReference if TRUE, returns the fraction of cells mapped to a node which are 16 | #' were orginally clustered from that node; if FALSE (default) returns the fraction of cells 17 | #' clustered under a node which are mapped to the correct node. 18 | #' @param genesToMap which genes to include in the correlation mapping 19 | #' @param plotdendro should the dendrogram be plotted (default = TRUE) 20 | #' @param returnDendro should the dendrogram be returned (default = TRUE) 21 | #' @param mar margins (for use with par) 22 | #' @param main,ylab add title and labels to plot (default is NULL) 23 | #' @param use,... additional parameters for cor 24 | #' 25 | #' @return a list where the first entry is the resulting tree and the second entry is the 26 | #' fraction of cells correctly mapping to each node using the inputted gene panel. 27 | #' 28 | #' @export 29 | buildTreeFromGenePanel <- function(dend = NA, 30 | refDat = NA, 31 | mapDat = refDat, 32 | medianDat = NA, 33 | requiredGenes = 2, 34 | clusters = NA, 35 | mappedAsReference = FALSE, 36 | genesToMap = rownames(mapDat), 37 | plotdendro = TRUE, 38 | returndendro = TRUE, 39 | mar = c(12, 5, 5, 5), 40 | main = NULL, 41 | ylab = NULL, 42 | use = "p", 43 | ...) { 44 | library(dendextend) 45 | 46 | # Calculate the median, if needed. 47 | if (is.na(medianDat[1])) { 48 | names(clusters) <- colnames(refDat) 49 | medianDat <- do.call("cbind", tapply( 50 | names(clusters), clusters, function(x) rowMedians(refDat[, x]) 51 | )) 52 | rownames(medianDat) <- rownames(refDat) 53 | if (!is.na(dend)) medianDat <- leafToNodeMedians(dend, medianDat) 54 | } 55 | gns <- intersect(genesToMap, intersect(rownames(mapDat), rownames(medianDat))) 56 | 57 | # Subset the data to relevant genes and clusters 58 | medianDat <- medianDat[gns, ] 59 | mapDat <- mapDat[gns, ] 60 | medianDat <- medianDat[, colSums(medianDat > 0) >= requiredGenes] 61 | kpDat <- (colSums(mapDat > 0) >= requiredGenes) & (is.element(clusters, colnames(medianDat))) 62 | mapDat <- mapDat[, kpDat] 63 | 64 | # Perform the correlation mapping 65 | facsCor <- corTreeMapping(medianDat = medianDat, mapDat = mapDat, use = use, ...) 66 | facsCl <- colnames(facsCor)[apply(facsCor, 1, which.max)] 67 | 68 | # Build a new tree based on mapping 69 | sCore <- function(x, use, ...) return(as.dist(1 - WGCNA::cor(x, use = use, ...))) 70 | dend <- getDend(medianDat, sCore, use = use, ...) 71 | 72 | # Which leaves have which nodes? 73 | has_any_labels <- function(sub_dend, the_labels) any(labels(sub_dend) %in% the_labels) 74 | node_labels <- NULL 75 | for (lab in labels(dend)) node_labels <- cbind( 76 | node_labels, noded_with_condition(dend, has_any_labels, the_labels = lab) 77 | ) 78 | rownames(node_labels) <- get_nodes_attr(dend, "label") 79 | colnames(node_labels) <- labels(dend) 80 | 81 | # Swap the mapped and reference nodes if 82 | # mappedAsReference=TRUE 83 | clTmp <- as.character(clusters[kpDat]) 84 | if (mappedAsReference) { 85 | temp <- clTmp 86 | clTmp <- facsCl 87 | facsCl <- temp 88 | } 89 | 90 | # Which clusters agree at the node level? 91 | agreeNodes <- apply(cbind(facsCl, clTmp), 1, function(lab, node_labels) { 92 | rowSums(node_labels[, lab]) == 2 93 | }, node_labels) 94 | colnames(agreeNodes) <- clTmp 95 | 96 | # Which clusters are in each nodes? 97 | isInNodes <- t(apply(node_labels, 1, function(node, cl, dend) { 98 | is.element(cl, labels(dend)[node]) 99 | }, clTmp, dend)) 100 | colnames(isInNodes) <- clTmp 101 | 102 | # For each node, plot the fraction of cells that 103 | # match if desired? 104 | fracAgree <- rowSums(agreeNodes) / rowSums(isInNodes) 105 | if (plotdendro) { 106 | par(mar = mar) 107 | dend %>% set("nodes_cex", 0) %>% set("branches_col", "grey") %>% plot() 108 | text(get_nodes_xy(dend)[, 1], get_nodes_xy(dend)[, 2], round(fracAgree * 100)) 109 | title(main = main, ylab = ylab) 110 | } 111 | 112 | # Return the results (if desired) 113 | if (returndendro) return(list(dend, fracAgree)) 114 | } 115 | 116 | 117 | #' Build a dendrogram from gene panel 118 | #' 119 | #' Build a dendrogram from an inputted data matrix. 120 | #' 121 | #' @param dat matrix of values (e.g., genes x clusters) for calculating the dendrogram 122 | #' @param distFun function for calculating distance matrix (default is correlation-based) 123 | #' @param ... additional variables for distFun 124 | #' 125 | #' @return dendrogram 126 | #' 127 | #' @export 128 | getDend <- function(dat, 129 | distFun = function(x) return(as.dist(1 - WGCNA::cor(x))), 130 | ...) { 131 | distCor <- distFun(dat, ...) 132 | distCor[is.na(distCor)] <- max(distCor, na.rm = TRUE) * 1.2 133 | # Avoid crashing by setting NA values to hang off the side of the tree. 134 | avgClust <- hclust(distCor, method = "average") 135 | dend <- as.dendrogram(avgClust) 136 | dend <- labelDend(dend)[[1]] 137 | return(dend) 138 | } 139 | 140 | 141 | #' Label dendrogram nodes 142 | #' 143 | #' Add numeric node labels to a dendrogram. 144 | #' 145 | #' @param dend dendrogram object 146 | #' @param distFun starting numeric node value (default=1) 147 | #' 148 | #' @return a list where the first item is the new dendrogram object and the 149 | #' second item is the final numeric node value. 150 | #' 151 | #' @export 152 | labelDend <- function(dend, n = 1) { 153 | if (is.null(attr(dend, "label"))) { 154 | attr(dend, "label") <- paste0("n", n) 155 | n <- n + 1 156 | } 157 | if (length(dend) > 1) { 158 | for (i in 1:length(dend)) { 159 | tmp <- labelDend(dend[[i]], n) 160 | dend[[i]] <- tmp[[1]] 161 | n <- tmp[[2]] 162 | } 163 | } 164 | return(list(dend, n)) 165 | } 166 | 167 | 168 | #' Summarize matrix 169 | #' 170 | #' Groups columns in a matrix by a specified group vector and summarizes using a specificed function. 171 | #' Optionally binarizes the matrix using a specified cutoff parameter. This is a wrapper for tapply. 172 | #' 173 | #' @param mat matrix where the columns (e.g., samples) are going to be grouped 174 | #' @param group vector of length dim(mat)[2] corresponding to the groups 175 | #' @param scale either 'none' (default),'row', or 'column' 176 | #' @param scaleQuantile what quantile of value should be set as 1 (default=1) 177 | #' @param binarize should the data be binarized? (default=FALSE) 178 | #' @param binMin minimum ON value for the binarized matrix (ignored if binarize=FALSE) 179 | #' @param summaryFunction function (or function name) to be used for summarization 180 | #' @param ... additional parameters for summaryFunction 181 | #' 182 | #' @return matrix of summarized values 183 | #' 184 | #' @export 185 | summarizeMatrix <- function(mat, 186 | group, 187 | scale = "none", 188 | scaleQuantile = 1, 189 | binarize = FALSE, 190 | binMin = 0.5, 191 | summaryFunction = median, 192 | ...) { 193 | 194 | # Make sure the names match up 195 | if (is.null(colnames(mat))) colnames(mat) <- names(group) 196 | if (is.null(colnames(mat))) colnames(mat) <- 1:length(group) 197 | names(group) <- colnames(mat) 198 | 199 | # Calculate the summary 200 | summaryFunction <- match.fun(summaryFunction) 201 | runFunction <- function(x, ...) { 202 | if (length(x) > 1) { 203 | return(apply(mat[, x], 1, summaryFunction, ...)) 204 | } 205 | return(mat[, x]) 206 | } 207 | summarizedMat <- do.call("cbind", tapply(names(group), group, runFunction, ...)) 208 | if (is.factor(group)) summarizedMat <- summarizedMat[, levels(group)] 209 | rownames(summarizedMat) <- rownames(mat) 210 | 211 | # Scale the data if desired 212 | if (substr(scale, 1, 1) == "r") { 213 | for (i in 1:dim(summarizedMat)[1]) 214 | summarizedMat[i, ] <- summarizedMat[i, ] / 215 | max(1e-06, quantile(summarizedMat[i, ], probs = scaleQuantile)) 216 | } 217 | if (substr(scale, 1, 1) == "c") { 218 | for (i in 1:dim(summarizedMat)[2]) 219 | summarizedMat[, i] <- summarizedMat[, i] / 220 | max(1e-06, quantile(summarizedMat[, i], probs = scaleQuantile)) 221 | } 222 | 223 | # Binarize the data if desired 224 | if (binarize) { 225 | summarizedMat <- summarizedMat > binMin 226 | summarizedMat <- summarizedMat + 1 - 1 227 | } 228 | summarizedMat 229 | } 230 | 231 | 232 | #' Scale mFISH data and map to RNA-seq reference 233 | #' 234 | #' This function is a wrapper for several other functions which aim to scale mFISH data to 235 | #' more closely match RNA-seq data and then map the mFISH data to the closest reference 236 | #' classes. There are several parameters allowing flexability in filtering and analysis. 237 | #' 238 | #' @param mapDat normalized data of the MAPPING data set. Default is to map the data onto itself. 239 | #' @param refSummaryDat normalized summary data of the REFERENCE data set (e.g., what to map against) 240 | #' @param genesToMap which genes to include in the mapping (calculated in not entered) 241 | #' @param mappingFunction which function to use for mapping (default is cellToClusterMapping_byCor) 242 | #' The function must include at least two parameters with the first one being mapped data and the 243 | #' second data the reference. Additional parameters are okay. Output must be a data frame where 244 | #' the first value is a mapped class. Additional columns are okay and will be returned) 245 | #' @param transform function for transformation of the data (default in none) 246 | #' @param noiselevel scalar value at or below which all values are set to 0 (default is 0) 247 | #' @param scaleFunction which function to use for scaling mapDat to refSummaryDat (default is setting 248 | #' 90th quantile of mapDat to max of refSummaryDat and truncating higher mapDat values) 249 | #' @param omitGenes genes to be included in the data frames but excluded from the mapping 250 | #' @param metadata a data frame of possible metadata (additional columns are okay and ignored): 251 | #' \describe{ 252 | #' \item{area}{a vector of cell areas for normalization} 253 | #' \item{experiment}{a vector indicating if multiple experiments should be scaled separately} 254 | #' \item{x,y}{x (e.g., parallel to layer) and y (e.g., across cortical layers) coordinates in tissue} 255 | #' } 256 | #' @param integerWeights if not NULL (default) a vector of integers corresponding to how many times 257 | #' each gene should be counted as part of the correlation. This is equivalent to calculating 258 | #' a weighted correlation, but only allows for integer weight values (for use with cor). 259 | #' @param binarize should the data be binarized? (default=FALSE) 260 | #' @param binMin minimum ON value for the binarized matrix (ignored if binarize=FALSE) 261 | #' @param ... additional parameters for passthrough into other functions 262 | #' 263 | #' @return a list with the following entrees: 264 | #' \describe{ 265 | #' \item{mapDat}{mapDat data matrix is passed through} 266 | #' \item{scaleDat}{scaled mapDat data matrix} 267 | #' \item{mappingResults}{Results of the mapping and associated confidence values (if any)} 268 | #' \item{metadata=metadata}{metadata is passed through unchanged} 269 | #' \item{scaledX/Y}{scaled x and y coordinates (or unscaled if scaling was not performed)} 270 | #' } 271 | #' 272 | #' @export 273 | fishScaleAndMap <- function(mapDat, 274 | refSummaryDat, 275 | genesToMap = NULL, 276 | mappingFunction = cellToClusterMapping_byCor, 277 | transform = function(x) x, 278 | noiselevel = 0, 279 | scaleFunction = quantileTruncate, 280 | omitGenes = NULL, 281 | metadata = data.frame(experiment = rep("all", dim(mapDat)[2])), 282 | integerWeights = NULL, 283 | binarize = FALSE, 284 | binMin = 0.5, 285 | ...) { 286 | 287 | # Setup 288 | mappingFunction <- match.fun(mappingFunction) 289 | scaleFunction <- match.fun(scaleFunction) 290 | transform <- match.fun(transform) 291 | if (is.null(genesToMap)) genesToMap <- colnames(mapDat) 292 | genesToMap <- intersect(genesToMap, rownames(refSummaryDat)) 293 | params <- colnames(metadata) 294 | refSummaryDat <- refSummaryDat[genesToMap, ] 295 | mapDat <- mapDat[genesToMap, ] 296 | 297 | # Transform the data to be mapped 298 | scaleDat <- as.matrix(mapDat[genesToMap, ]) 299 | scaleDat[scaleDat <= noiselevel] <- 0 # Set values less than or equal to noiselevel to 0 300 | if (is.element("area", params)) { 301 | # Account for spot area in gene expression calculation 302 | scaleDat <- t(t(scaleDat) / metadata$area) * mean(metadata$area) 303 | } 304 | scaleDat <- transform(scaleDat) 305 | 306 | # Scale to the reference data 307 | for (ex in unique(metadata$experiment)) { 308 | isExp <- metadata$experiment == ex 309 | for (g in genesToMap) scaleDat[g, isExp] <- scaleFunction(scaleDat[ 310 | g, isExp 311 | ], maxVal = max(refSummaryDat[g, ]), ...) 312 | } 313 | 314 | # Binarize, if desired 315 | if (binarize) { 316 | scaleDat <- scaleDat > binMin 317 | scaleDat <- scaleDat + 1 - 1 318 | } 319 | 320 | # Omit genes and weight scaling, if desired 321 | genesToMap2 <- genesToMap 322 | if (!is.null(integerWeights)) genesToMap2 <- rep(genesToMap2, integerWeights) 323 | genesToMap2 <- genesToMap2[!is.element(genesToMap2, omitGenes)] 324 | 325 | # Map the map data to the reference data 326 | scaleDat2 <- scaleDat[genesToMap2, ] 327 | refSummaryDat2 <- refSummaryDat[genesToMap2, ] 328 | mappingResults <- mappingFunction(refSummaryDat2, scaleDat2, ...) 329 | 330 | # Scale x and y coordinates to (0,1) within experiment, if desired 331 | for (ex in unique(metadata$experiment)) { 332 | isExp <- metadata$experiment == ex 333 | metadata$x[isExp] <- metadata$x[isExp] - min(metadata$x[isExp]) 334 | metadata$x[isExp] <- metadata$x[isExp] / max(metadata$x[isExp]) 335 | metadata$y[isExp] <- metadata$y[isExp] - min(metadata$y[isExp]) 336 | metadata$y[isExp] <- metadata$y[isExp] / max(metadata$y[isExp]) 337 | } 338 | 339 | # Return the results 340 | out <- list( 341 | mapDat = mapDat, scaleDat = scaleDat, 342 | mappingResults = mappingResults, metadata = metadata, 343 | scaledX = metadata$x, scaledY = metadata$y 344 | ) 345 | } 346 | 347 | 348 | #' Filter (subset) fishScaleAndMap object 349 | #' 350 | #' Subsets all components in a fishScaleAndMap object 351 | #' 352 | #' @param datFish a fishScaleAndMap output list 353 | #' @param subset a boolean or numeric vector of the elements to retain 354 | #' 355 | #' @return a fishScaleAndMap output subsetted to the requested elements 356 | #' 357 | #' @export 358 | filterCells <- function(datFish, 359 | subset) { 360 | ## Error checking 361 | if ((length(subset) != length(datFish$scaledX)) & (!is.numeric(subset))) { 362 | print("subset is incorrect format. Returning original entry.") 363 | return(datFish) 364 | } 365 | if (is.numeric(subset)) subset <- intersect(subset, 1:length(datFish$scaledX)) 366 | 367 | ## Subset all of the elements 368 | datFish$mapDat <- datFish$mapDat[, subset] 369 | datFish$scaleDat <- datFish$scaleDat[, subset] 370 | datFish$metadata <- datFish$metadata[subset, ] 371 | datFish$scaledX <- datFish$scaledX[subset] 372 | datFish$scaledY <- datFish$scaledY[subset] 373 | if (!is.null(datFish$mappingResults)) { 374 | datFish$mappingResults <- datFish$mappingResults[subset, ] 375 | } 376 | return(datFish) 377 | } 378 | 379 | 380 | #' Merge two fishScaleAndMap objects 381 | #' 382 | #' Merges all components of two fishScaleAndMap objects to create a new 383 | #' one. Note: only meta-data and mappingResults that is present in BOTH 384 | #' objects will be returned. 385 | #' 386 | #' @param datFish1 a fishScaleAndMap output list 387 | #' @param datFish2 a second fishScaleAndMap output list. 388 | #' 389 | #' @return a new fishScaleAndMap output list with the two original ones merged 390 | #' 391 | #' @export 392 | mergeFish <- function(datFish1, 393 | datFish2) { 394 | datFish <- datFish1 395 | datFish$mapDat <- cbind(datFish1$mapDat, datFish2$mapDat) 396 | datFish$scaleDat <- cbind(datFish1$scaleDat, datFish2$scaleDat) 397 | datFish$metadata <- rbind(datFish1$metadata, datFish2$metadata) 398 | datFish$scaledX <- c(datFish1$scaledX, datFish2$scaledX) 399 | datFish$scaledY <- c(datFish1$scaledY, datFish2$scaledY) 400 | if ((!is.null(datFish1$mappingResults)) & (!is.null(datFish2$mappingResults))) { 401 | datFish$mappingResults <- rbind(datFish1$mappingResults, datFish2$mappingResults) 402 | } 403 | return(datFish) 404 | } 405 | 406 | 407 | 408 | 409 | #' Rotate coordinates 410 | #' 411 | #' Rotates the scaledX and scaledY elements of a fishScaleAndMap output list so that the 412 | #' axis of interest (e.g., cortical layer) is paralled with the x cooridate plan. 413 | #' Rotation code is from https://stackoverflow.com/questions/15463462/rotate-graph-by-angle 414 | #' 415 | #' @param datFish a fishScaleAndMap output list 416 | #' @param flatVector a TRUE/FALSE vector ordred in the same way as the elements (e.g., cells) 417 | #' in datIn where all TRUE values correspond to cells who should have the same Y coordinate 418 | #' (e.g., be in the same layer). Alternatively a numeric vector of cell indices to include 419 | #' @param flipVector a numeric vector of values to ensure proper reflection on Y-axes (e.g., 420 | #' layer; default=NULL) 421 | #' @param subset a boolean or numeric vector of the elements to retain 422 | #' 423 | #' @return a fishScaleAndMap output list with updated scaledX and scaleY coordinates 424 | #' 425 | #' @export 426 | rotateXY <- function(datFish, 427 | flatVector = NULL, 428 | flipVector = NULL, 429 | subset = NULL) { 430 | 431 | ## Error checking 432 | datFishIn <- datFish 433 | if ((length(flatVector) != length(datFish$scaledX)) & (!is.numeric(flatVector))) { 434 | print("flatVector is incorrect format. Returning original entry.") 435 | return(datFish) 436 | } 437 | if (!is.null(subset)) { 438 | if ((length(subset) != length(datFish$scaledX)) & (!is.numeric(subset))) { 439 | print("subset is incorrect format. Returning original entry.") 440 | return(datFish) 441 | } 442 | } 443 | if (((length(flipVector) != length(datFish$scaledX)) & 444 | (!is.numeric(flipVector))) & (!is.null(flipVector))) { 445 | print("flipVector is incorrect format. Returning original entry.") 446 | return(datFish) 447 | } 448 | if (is.numeric(flatVector)) { 449 | flatVector <- intersect(flatVector, 1:length(datFish$scaledX)) 450 | } 451 | 452 | ## Subset the data if needed 453 | datFish <- datFishIn 454 | if (!is.null(subset)) { 455 | datFish <- filterCells(datFishIn, subset) 456 | flatVector <- flatVector[subset] 457 | flipVector <- flipVector[subset] 458 | } 459 | 460 | ## Caculate best angle 461 | v <- prcomp(cbind(datFish$scaledX, datFish$scaledY)[flatVector, ])$rotation 462 | beta <- as.numeric(atan(-v[2, 1]/v[1, 1])) 463 | 464 | ## Rotate coordinates (internal function) 465 | rotCor <- function(datFish, 466 | beta) { 467 | M <- cbind(datFish$scaledX, datFish$scaledY) 468 | rotm <- matrix(c(cos(beta), sin(beta), -sin(beta), cos(beta)), ncol = 2) # rotation matrix 469 | M2.1 <- t(t(M) - c(M[1, 1], M[1, 2])) # shift points, so that turning point is (0,0) 470 | M2.2 <- t(rotm %*% (t(M2.1))) # rotate 471 | M2.3 <- t(t(M2.2) + c(M[1, 1], M[1, 2])) # shift back 472 | x <- M2.3[, 1] 473 | y <- M2.3[, 2] 474 | 475 | x <- x - min(x) 476 | x <- x / max(x) 477 | y <- y - min(y) 478 | y <- y / max(y) 479 | 480 | datFish$scaledX <- x 481 | datFish$scaledY <- y 482 | datFish 483 | } 484 | datFish2 <- rotCor(datFish, beta) 485 | 486 | if (!is.null(flipVector)) { 487 | if (sum(datFish2$scaledY * flipVector) < sum((1 - datFish2$scaledY) * flipVector)) { 488 | datFish2 <- rotCor(datFish, beta)# + pi) # Pi should not be needed 489 | } 490 | } 491 | datFish <- datFish2 492 | 493 | ## Unsubset and return the data 494 | if (is.null(subset)) return(datFish) 495 | 496 | datFishIn$scaledX[subset] <- datFish$scaledX 497 | datFishIn$scaledY[subset] <- datFish$scaledY 498 | datFishIn 499 | } 500 | 501 | 502 | 503 | 504 | #' Plot distributions 505 | #' 506 | #' Plot the distributions of cells across the tissue with overlaying color information. This is 507 | #' a wrapper function for plot 508 | #' 509 | #' @param datIn a fishScaleAndMap output list 510 | #' @param group a character vector (or factor) indicating how to split the data (e.g., cluster 511 | #' call) or a metadata/mappingResults column name 512 | #' @param groups a character vector of groups to show (default is levels of group) 513 | #' @param colors a character vector (or factor) indicating how to color the plots (e.g., layer 514 | #' or gene expression) or a metadata/mappingResults column name (default is all black) 515 | #' @param colormap function to use for the colormap for the data (default gray.colors) 516 | #' @param maxrow maximum number of plots to show in one row (default=12) 517 | #' @param xlim,ylim for plot, but will be calculated if not entered 518 | #' @param pch,cex for plot. Can be single values or vectors 519 | #' @param main,xlab,ylab,... other parameters for plot (must be single values) 520 | #' @param singlePlot should everything be plot on a single page (default=TRUE) 521 | #' 522 | #' @return Only returns if there is an error 523 | #' 524 | #' @export 525 | plotDistributions <- function(datIn, 526 | group, 527 | groups = NULL, 528 | colors = rep("black", dim(datIn$mapDat)[2]), 529 | colormap = gray.colors, 530 | maxrow = 12, 531 | pch = 19, 532 | cex = 1.5, 533 | xlim = NULL, ylim = NULL, 534 | main = "", 535 | xlab = "", ylab = "", 536 | singlePlot = TRUE, 537 | ...) { 538 | colormap <- match.fun(colormap) 539 | meta <- cbind(datIn$metadata, datIn$mappingResults) 540 | if (length(group) == 1) { 541 | if (is.element(group, colnames(meta))) { 542 | group <- as.factor(meta[, group]) 543 | if (is.null(groups)) groups <- levels(group) 544 | } else { 545 | return(paste(group, "is not an available column name for division.")) 546 | } 547 | } 548 | if (length(colors) == 1) { 549 | if (is.element(colors, colnames(meta))) { 550 | colors <- as.numeric(as.factor(meta[, colors])) 551 | colors <- colormap(length(unique(colors)))[colors] 552 | } else { 553 | return(paste(colors, "is not an available column name for coloring.")) 554 | } 555 | } else { 556 | colors <- as.numeric(as.factor(colors)) 557 | colors <- colormap(length(unique(colors)))[colors] 558 | } 559 | 560 | if (is.null(xlim)) xlim <- range(datIn$scaledX) 561 | if (is.null(ylim)) ylim <- range(-datIn$scaledY) 562 | 563 | # Make the plot! 564 | if(singlePlot){ 565 | ncolv <- min(length(groups), maxrow) 566 | nrowv <- ceiling(length(groups)/maxrow) 567 | par(mfrow = c(nrowv, ncolv)) 568 | } 569 | for (gp in groups) { 570 | kp <- group == gp 571 | pch2 <- pch 572 | if (length(pch) > 1) pch2 <- pch[kp] 573 | cex2 <- cex 574 | if (length(cex) > 1) cex2 <- cex[kp] 575 | 576 | plot(datIn$scaledX[kp], -datIn$scaledY[kp], 577 | pch = pch2, col = colors[kp], xlim = xlim, 578 | ylim = ylim, main = paste(main, gp), xlab = xlab, 579 | ylab = ylab, cex = cex2, ... 580 | ) 581 | } 582 | } 583 | 584 | 585 | #' Plot heatmap 586 | #' 587 | #' Plot the heatmap of cells ordering by a specified order. This is a wrapper for heatmap.2 588 | #' 589 | #' @param datIn a fishScaleAndMap output list 590 | #' @param group a character vector (or factor) indicating how to order the heatmap (e.g., cluster 591 | #' call) or a metadata/mappingResults column name 592 | #' @param groups a character vector of groups to show (default is levels of group) 593 | #' @param grouplab label for the grouping in the heatmap (default is 'Grouping' or the value for group) 594 | #' @param useScaled plot the scaled (TRUE) or unscaled (FALSE; default) values 595 | #' @param capValue values above capValue will be capped at capValue (default is none) 596 | #' @param colormap set of values to use for the colormap for the data (default heat_colors) 597 | #' @param Rowv,Colv,dendrogram,trace,margins,rowsep,colsep,key,... other parameters for heatmap.2 598 | #' (some default values are different) 599 | #' 600 | #' @return Only returns if there is an error 601 | #' 602 | #' @export 603 | plotHeatmap <- function(datIn, 604 | group, 605 | groups = NULL, 606 | grouplab = "Grouping", 607 | useScaled = FALSE, 608 | capValue = Inf, 609 | colormap = grey.colors(1000), 610 | pch = 19, 611 | xlim = NULL, ylim = NULL, 612 | Rowv = FALSE, 613 | Colv = FALSE, 614 | dendrogram = "none", 615 | trace = "none", 616 | margins = c(6, 10), 617 | rowsep = NULL, 618 | sepwidth=c(0.4,0.4), 619 | key = FALSE, ...) { 620 | library(gplots) 621 | 622 | if (useScaled) { 623 | plotDat <- datIn$scaleDat 624 | } else { 625 | plotDat <- datIn$mapDat 626 | } 627 | plotDat <- pmin(plotDat, capValue) 628 | 629 | meta <- cbind(datIn$metadata, datIn$mappingResults) 630 | if (length(group) == 1) { 631 | if (is.element(group, colnames(meta))) { 632 | if (grouplab == "Grouping") grouplab <- group 633 | group <- as.factor(meta[, group]) 634 | if (is.null(groups)) groups <- levels(group) 635 | } else { 636 | return(paste(group, "is not an available column name for division.")) 637 | } 638 | } 639 | 640 | # Update the cell order 641 | groups <- c(groups, setdiff(levels(group), groups)) 642 | ord <- order(factor(group, levels = groups), -colSums(plotDat)) 643 | plotDat <- plotDat[, ord] 644 | group <- group[ord] 645 | 646 | # Append the cluster name to the plot data and find colseps 647 | cn <- rep("", length(colnames(plotDat))) 648 | colseps <- NULL 649 | for (g in unique(as.character(group))){ 650 | wg <- which(as.character(group)==g) 651 | cn[round(mean(wg))] <- g 652 | colseps <- c(colseps,min(wg)-1) 653 | } 654 | colnames(plotDat) <- paste(cn,colnames(plotDat)) 655 | 656 | # Make the plot! 657 | heatmap.2(plotDat, 658 | Rowv = Rowv, Colv = Colv, dendrogram = dendrogram, 659 | trace = trace, margins = margins, rowsep = rowsep, 660 | colsep = colseps, key = key, col = colormap, 661 | ... 662 | ) 663 | } 664 | 665 | 666 | #' Return top mapped correlation-based cluster and confidence 667 | #' 668 | #' Primary function for doing correlation-based mapping to cluster medians and also reporting the 669 | #' correlations and confidences. This is wrapper for getTopMatch and corTreeMapping. 670 | #' 671 | #' @param medianDat representative value for each leaf and node. If not entered, it is calculated 672 | #' @param mapDat normalized data of the MAPPING data set. Default is to map the data onto itself. 673 | #' @param refDat normalized data of the REFERENCE data set. Ignored if medianDat is passed 674 | #' @param clusters cluster calls for each cell. Ignored if medianDat is passed 675 | #' @param genesToMap which genes to include in the correlation mapping 676 | #' @param use additional parameter for cor (use='p' as default) 677 | #' @param method additional parameter for cor (method='p' as default) 678 | #' @param returnCor should the correlation matrix be appended to the return? 679 | #' @param ... not used 680 | #' 681 | #' @return data frame with the top match and associated correlation 682 | #' 683 | #' @export 684 | cellToClusterMapping_byCor <- function(medianDat, 685 | mapDat, 686 | refDat = NA, 687 | clusters = NA, 688 | genesToMap = rownames(mapDat), 689 | use = "p", 690 | method = "p", 691 | returnCor=FALSE, 692 | ...) { 693 | corVar <- corTreeMapping( 694 | medianDat = medianDat, 695 | mapDat = mapDat, refDat = refDat, clusters = clusters, 696 | genesToMap = genesToMap, use = use, method = method 697 | ) 698 | corMatch <- getTopMatch(corVar) 699 | colnames(corMatch) <- c("Class", "Correlation") 700 | 701 | dex <- apply(corVar, 1, function(x) return(diff(sort(-x)[1:2]))) 702 | corMatch$DifferenceBetweenTopTwoCorrelations <- dex 703 | if(returnCor) 704 | corMatch <- cbind(corMatch,corVar) 705 | corMatch 706 | } 707 | 708 | 709 | #' Quantile normalize, truncate, and scale 710 | #' 711 | #' Quantile normalize, truncate, and scale a numeric vector (e.g. mFISH data from one gene) 712 | #' 713 | #' @param x input data vector 714 | #' @param qprob probs value to result from quantile (default=0.9) 715 | #' @param maxVal max value for scaling (default=1) 716 | #' @param truncate should data above the qprob threshold be truncated (default=yes) 717 | #' @param ... not used 718 | #' 719 | #' @return scaled vector 720 | #' 721 | #' @export 722 | quantileTruncate <- function(x, 723 | qprob = 0.9, 724 | maxVal = 1, 725 | truncate = TRUE, 726 | ...) { 727 | qs <- quantile(x[x > 0], probs = qprob, na.rm = TRUE) 728 | if (is.na(qs)) return(x) 729 | if (truncate) x[x > qs] <- qs 730 | x * maxVal / qs 731 | } 732 | 733 | 734 | #' Plot TSNE 735 | #' 736 | #' Plot a TSNE of the data, with assigned colors and labels from provided variables. Note that this 737 | #' function is a modification of code from Pabloc (https://www.r-bloggers.com/author/pabloc/) from 738 | #' https://www.r-bloggers.com/playing-with-dimensions-from-clustering-pca-t-sne-to-carl-sagan/ 739 | #' 740 | #' @param datIn a fishScaleAndMap output list 741 | #' @param colorGroup a character vector (or factor) indicating how to color the Tsne (e.g., cluster 742 | #' call) or a metadata/mappingResults column name (default=NULL) 743 | #' @param labelGroup a character vector (or factor) indicating how to label the Tsne (e.g., cluster 744 | #' call) or a metadata/mappingResults column name (default=NULL) 745 | #' @param useScaled plot the scaled (TRUE) or unscaled (FALSE; default) values 746 | #' @param capValue values above capValue will be capped at capValue (default is none) 747 | #' @param perplexity,theta other parameters for Rtsne 748 | #' @param main title of the plot 749 | #' @param maxNchar what is the maximum number of characters to display in the plot for each entry? 750 | #' @param seed for reproducibility 751 | #' 752 | #' @return Only returns if there is an error 753 | #' 754 | #' @export 755 | plotTsne <- function(datIn, 756 | colorGroup = "none", 757 | labelGroup = "none", 758 | useScaled = FALSE, 759 | capValue = Inf, 760 | perplexity = 10, 761 | theta = 0.5, 762 | main = "TSNE plot", 763 | maxNchar = 1000, 764 | seed = 10) { 765 | library(Rtsne) 766 | library(ggplot2) 767 | 768 | # Get the data 769 | if (useScaled) { 770 | plotDat <- datIn$scaleDat 771 | } else { 772 | plotDat <- datIn$mapDat 773 | } 774 | plotDat <- pmin(plotDat, capValue) 775 | plotDat <- t(plotDat) 776 | 777 | # Color and label groups 778 | meta <- cbind(datIn$metadata, datIn$mappingResults) 779 | if (length(colorGroup) == 1) { 780 | if (is.element(colorGroup, colnames(meta))) { 781 | colorGroup <- as.factor(meta[, colorGroup]) 782 | } else { 783 | colorGroup <- as.factor(rep("none", dim(meta)[1])) 784 | } 785 | } 786 | if (length(labelGroup) == 1) { 787 | if (is.element(labelGroup, colnames(meta))) { 788 | labelGroup <- as.factor(meta[, labelGroup]) 789 | } else { 790 | labelGroup <- as.factor(rep("*", dim(meta)[1])) 791 | } 792 | } 793 | # Subset to maxNchar characters 794 | levs <- substr(levels(as.factor(labelGroup)), 1, maxNchar) 795 | labelGroup <- factor(as.character(substr( 796 | labelGroup, 1, maxNchar 797 | )), levels = as.character(unique(levs))) 798 | 799 | # Get the tsne corrdinates 800 | set.seed(seed) 801 | tsne_model_1 <- Rtsne(as.matrix(plotDat), 802 | check_duplicates = FALSE, 803 | pca = TRUE, perplexity = perplexity, theta = theta, dims = 2 804 | ) 805 | d_tsne_1 <- as.data.frame(tsne_model_1$Y) 806 | 807 | # Make the plot! 808 | plot_k <- ggplot(d_tsne_1, aes_string( 809 | x = "V1", y = "V2", 810 | color = colorGroup, label = labelGroup 811 | )) + 812 | geom_text() + xlab("TSNE 1") + ylab("TSNE 2") + 813 | ggtitle(main) + theme(legend.title = element_blank()) 814 | scale_colour_discrete() 815 | 816 | plot_k 817 | } 818 | -------------------------------------------------------------------------------- /R/update.r: -------------------------------------------------------------------------------- 1 | #' Update the mfishtools library 2 | #' 3 | #' @export 4 | update_mfishtools <- function() { 5 | devtools::install_github("AllenInstitute/mfishtools", build_vignettes = TRUE) 6 | } 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # mfishtools 2 | 3 | R functions for gene selection and analysis of mFISH data 4 | 5 | mfishtools includes many functions that are used for analysis of data for the CZI SpaceTx project, and mostly relies on correlation-based analysis with filtering. *This library is still in beta testing and may be buggy. Community involvement is encouraged through both issues and pull requests.* 6 | 7 | ## Installation 8 | 9 | Install prerequisites: 10 | ``` 11 | install.packages("devtools") 12 | devtools::install_github("AllenInstitute/scrattch.vis") 13 | devtools::install_github("AllenInstitute/tasic2016data") 14 | ``` 15 | 16 | Note that some people may need to manually install the `GO.db` and `WGCNA` libraries as well: 17 | ``` 18 | install.packages("BiocManager") 19 | BiocManager::install("GO.db") 20 | BiocManager::install("WGCNA") 21 | ``` 22 | 23 | Install `mfishtools` using: 24 | ``` 25 | # Quickly, but without the vignettes: 26 | devtools::install_github("AllenInstitute/mfishtools") 27 | 28 | # More slowly, but with the vignettes: 29 | install.packages("remotes", repos='http://cran.us.r-project.org') 30 | remotes::install_github("AllenInstitute/mfishtools", build_vignettes = TRUE) 31 | ``` 32 | 33 | 34 | ## Library use cases 35 | 36 | There are two primary use cases for this libary: 37 | 38 | 1. **Building a combinatorial marker gene panel for spatial transcriptomics.** [LINK TO VIGNETTE](http://htmlpreview.github.io/?https://github.com/AllenInstitute/mfishtools/blob/master/vignettes/inhibitory_marker_selection.html) This allows the generation of computationally "optimal" marker gene panels based on single cell/nucleus RNA-Seq reference data. A starting set of manually-selected marker genes is first selected, and then the remaining genes are chosen using a greedy algorithm. Relevant statistics and plots are generated that show the predicted success for the panel. 39 | 2. **Mapping cells from spatial transcriptomics data sets to reference cell types.** [LINK TO VIGNETTE](http://htmlpreview.github.io/?https://github.com/AllenInstitute/mfishtools/blob/master/vignettes/inhibitory_marker_mapping.html) This allows for cell type calling of cells in a spatial transcriptomics study, and also predicts the accuracy of the calls based on reference data. *Note: it is currently unclear how reliable this method is at correctly predicting cell type calls. Please review results carefully!* Plots can also be generated to show the results. 40 | 41 | Many functions are currently not included in these vignettes; please use the R help ("?") if additional information is needed, or e-mail me at jeremym@alleninstitute.org. 42 | 43 | ## License 44 | 45 | The license for this package is available on Github at: https://github.com/AllenInstitute/mfishtools/blob/master/LICENSE 46 | 47 | ## Level of Support 48 | 49 | We are planning on occasional updating this tool with no fixed schedule. Community involvement is encouraged through both issues and pull requests. 50 | 51 | ## Contribution Agreement 52 | 53 | If you contribute code to this repository through pull requests or other mechanisms, you are subject to the Allen Institute Contribution Agreement, which is available in full at: https://github.com/AllenInstitute/mfishtools/blob/master/CONTRIBUTION 54 | -------------------------------------------------------------------------------- /data/fishData.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenInstitute/mfishtools/df6d7f1104b149698a3cebd7f82dd9848bc7328a/data/fishData.rda -------------------------------------------------------------------------------- /data/metadata.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenInstitute/mfishtools/df6d7f1104b149698a3cebd7f82dd9848bc7328a/data/metadata.rda -------------------------------------------------------------------------------- /man/buildMappingBasedMarkerPanel.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{buildMappingBasedMarkerPanel} 4 | \alias{buildMappingBasedMarkerPanel} 5 | \title{Greedy algorithm for building marker gene panel} 6 | \usage{ 7 | buildMappingBasedMarkerPanel( 8 | mapDat, 9 | medianDat = NA, 10 | clustersF = NA, 11 | panelSize = 50, 12 | subSamp = 20, 13 | maxFcGene = 1000, 14 | qMin = 0.75, 15 | seed = 10, 16 | currentPanel = NULL, 17 | panelMin = 5, 18 | writeText = TRUE, 19 | corMapping = TRUE, 20 | optimize = "FractionCorrect", 21 | clusterDistance = NULL, 22 | clusterGenes = NULL, 23 | dend = NULL, 24 | percentSubset = 100 25 | ) 26 | } 27 | \arguments{ 28 | \item{mapDat}{normalized data of the mapping (=reference) data set.} 29 | 30 | \item{medianDat}{representative value for each leaf. If not entered, it is calculated} 31 | 32 | \item{clustersF}{cluster calls for each cell.} 33 | 34 | \item{panelSize}{number of genes to include in the marker gene panel} 35 | 36 | \item{subSamp}{number of random nuclei to select from each cluster (to increase speed); 37 | set as NA to not subsample} 38 | 39 | \item{maxFcGene}{maximum number of genes to consider at each iteration (to increase speed)} 40 | 41 | \item{qMin}{minimum quantile for fold change comparison (between 0 and 1, higher = more 42 | specific marker genes are included)} 43 | 44 | \item{seed}{for reproducibility} 45 | 46 | \item{currentPanel}{starting panel. Default is NULL.} 47 | 48 | \item{panelMin}{if there are fewer genes than this, the top number of these genes by fc 49 | rank are set as the starting panel. Cannot be less than 2.} 50 | 51 | \item{writeText}{should gene names and marker scores be output (default TRUE)} 52 | 53 | \item{corMapping}{if TRUE (default) map by correlation; otherwise, map by Euclidean 54 | distance (not recommended)} 55 | 56 | \item{optimize}{if 'FractionCorrect' (default) will seek to maximize the fraction of 57 | cells correctly mapping to final clusters 58 | if 'CorrelationDistance' will seek to minimize the total distance between actual 59 | cluster calls and mapped clusters 60 | if 'DendrogramHeight' will seek to minimize the total dendrogram height between 61 | actual cluster calls and mapped clusters} 62 | 63 | \item{clusterDistance}{only used if optimize='CorrelationDistance'; a matrix (or 64 | vector) of cluster distances. Will be calculated if NULL and if clusterGenes 65 | provided. (NOTE: order must be the same as medianDat and/or have column and row 66 | names corresponding to clusters in clustersF)} 67 | 68 | \item{clusterGenes}{a vector of genes used to calculate the cluster distance. 69 | Only used if optimize='CorrelationDistance' and clusterDistance=NULL.} 70 | 71 | \item{dend}{only used if optimize='DendrogramHeight' dendrogram; will error out of not provided} 72 | 73 | \item{percentSubset}{for each iteration the function can subset the set of possible 74 | genes to speed up the calculation.} 75 | } 76 | \value{ 77 | an ordered character vector corresponding to the marker gene panel 78 | } 79 | \description{ 80 | This is the primary function that iteratively builds a marker gene panel, one gene at a 81 | time by iteratively adding the most informative gene to the existing gene panel. 82 | } 83 | -------------------------------------------------------------------------------- /man/buildPanel_oneCluster.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{buildPanel_oneCluster} 4 | \alias{buildPanel_oneCluster} 5 | \title{Build panel for one cluster (beta)} 6 | \usage{ 7 | buildPanel_oneCluster( 8 | mapDat, 9 | clustersF, 10 | medianDat = NA, 11 | propIn = NA, 12 | clust = as.character(clustersF[1]), 13 | subSamp = NA, 14 | seed = 10, 15 | maxSize = 20, 16 | dexCutoff = 0.001, 17 | topGeneCount = 100 18 | ) 19 | } 20 | \arguments{ 21 | \item{mapDat}{normalized data of the mapping (=reference) data set.} 22 | 23 | \item{clustersF}{cluster calls for each cell.} 24 | 25 | \item{medianDat}{median value for each leaf} 26 | 27 | \item{propIn}{proportions of cells with expression > 1 in each leaf} 28 | 29 | \item{clust}{which cluster to target?} 30 | 31 | \item{subSamp}{number of random nuclei to select from each cluster, EXCEPT the target cluster; 32 | set as NA to not subsample} 33 | 34 | \item{seed}{for reproducibility} 35 | 36 | \item{maxSize}{maximum size of marker gene panel} 37 | 38 | \item{dexCutoff}{criteria for stopping: when improvement in fraction of cells properly mapped 39 | dips below this value} 40 | 41 | \item{topGeneCount}{number of top genes by proportion to consider} 42 | } 43 | \value{ 44 | a matrix of the top marker genes for each cluster. Output matrix includes five columns: 45 | clust = cluster; panel = ordered genes in the panel for that cluster; onCorrect = fraction of 46 | correctly assigned cells in cluster; offCorrect = fraction of cells correctly assigned outside 47 | of cluster; dexTotal = additional dex explained by last gene added. 48 | } 49 | \description{ 50 | This UNTESTED function finds the best small marker panel for marking a single cluster, using 51 | proportion difference as the metric for determining the starting panel. 52 | } 53 | -------------------------------------------------------------------------------- /man/buildQualityTable.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{buildQualityTable} 4 | \alias{buildQualityTable} 5 | \title{Correct mapping at different tree heights} 6 | \usage{ 7 | buildQualityTable( 8 | orderedGenes, 9 | dend, 10 | mapDat, 11 | medianDat, 12 | clustersF, 13 | minVal = 2, 14 | heights = c((0:100)/100), 15 | verbose = FALSE 16 | ) 17 | } 18 | \arguments{ 19 | \item{orderedGenes}{an ordered list of input genes (e.g. from an iterative algorithm)} 20 | 21 | \item{dend}{dendrogram for mapping.} 22 | 23 | \item{mapDat}{normalized data of the mapping (=reference) data set.} 24 | 25 | \item{medianDat}{median value for each leaf} 26 | 27 | \item{clustersF}{cluster calls for each cell} 28 | 29 | \item{minVal}{minimum number of genes to consider from the list in the mapping} 30 | 31 | \item{heights}{height in the tree to look at} 32 | 33 | \item{verbose}{whether or not to show progress in the function} 34 | } 35 | \value{ 36 | a matrix of fractions of cells correctly mapped for different tree heights (columns) 37 | and different gene panels (rows) 38 | } 39 | \description{ 40 | This function takes as input an ordered set of marker genes (e.g., from at iterative algorithm, 41 | and returns an table showing the fraction of cells correctly mapped to a similar cell type 42 | (as defined by the heights parameter). A height of 1 indicates correct mapping to the leaf. 43 | } 44 | -------------------------------------------------------------------------------- /man/buildTreeFromGenePanel.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{buildTreeFromGenePanel} 4 | \alias{buildTreeFromGenePanel} 5 | \title{Build and plot dendrogram from gene panel} 6 | \usage{ 7 | buildTreeFromGenePanel( 8 | dend = NA, 9 | refDat = NA, 10 | mapDat = refDat, 11 | medianDat = NA, 12 | requiredGenes = 2, 13 | clusters = NA, 14 | mappedAsReference = FALSE, 15 | genesToMap = rownames(mapDat), 16 | plotdendro = TRUE, 17 | returndendro = TRUE, 18 | mar = c(12, 5, 5, 5), 19 | main = NULL, 20 | ylab = NULL, 21 | use = "p", 22 | ... 23 | ) 24 | } 25 | \arguments{ 26 | \item{dend}{dendrogram for mapping. Ignored if medianDat is passed} 27 | 28 | \item{refDat}{normalized data of the REFERENCE data set. Ignored if medianDat is passed} 29 | 30 | \item{mapDat}{normalized data of the MAPPING data set. Default is to map the data onto itself.} 31 | 32 | \item{medianDat}{representative value for each leaf and node. If not entered, it is calculated} 33 | 34 | \item{requiredGenes}{minimum number of genes required to be expressed in a cluster (column 35 | of medianDat) for the cluster to be included (default=2)} 36 | 37 | \item{clusters}{cluster calls for each cell} 38 | 39 | \item{mappedAsReference}{if TRUE, returns the fraction of cells mapped to a node which are 40 | were orginally clustered from that node; if FALSE (default) returns the fraction of cells 41 | clustered under a node which are mapped to the correct node.} 42 | 43 | \item{genesToMap}{which genes to include in the correlation mapping} 44 | 45 | \item{plotdendro}{should the dendrogram be plotted (default = TRUE)} 46 | 47 | \item{mar}{margins (for use with par)} 48 | 49 | \item{main, ylab}{add title and labels to plot (default is NULL)} 50 | 51 | \item{use, ...}{additional parameters for cor} 52 | 53 | \item{returnDendro}{should the dendrogram be returned (default = TRUE)} 54 | } 55 | \value{ 56 | a list where the first entry is the resulting tree and the second entry is the 57 | fraction of cells correctly mapping to each node using the inputted gene panel. 58 | } 59 | \description{ 60 | Build and plot a dendrogram using correlation-based average linkage hierarchical 61 | clustering and only using a specified set of genes. The output is the expected 62 | accuracy of mapping to each node in the tree, which gives an idea of the best-case 63 | expected results for mFISH analysis. 64 | } 65 | -------------------------------------------------------------------------------- /man/cellToClusterMapping_byCor.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{cellToClusterMapping_byCor} 4 | \alias{cellToClusterMapping_byCor} 5 | \title{Return top mapped correlation-based cluster and confidence} 6 | \usage{ 7 | cellToClusterMapping_byCor( 8 | medianDat, 9 | mapDat, 10 | refDat = NA, 11 | clusters = NA, 12 | genesToMap = rownames(mapDat), 13 | use = "p", 14 | method = "p", 15 | ... 16 | ) 17 | } 18 | \arguments{ 19 | \item{medianDat}{representative value for each leaf and node. If not entered, it is calculated} 20 | 21 | \item{mapDat}{normalized data of the MAPPING data set. Default is to map the data onto itself.} 22 | 23 | \item{refDat}{normalized data of the REFERENCE data set. Ignored if medianDat is passed} 24 | 25 | \item{clusters}{cluster calls for each cell. Ignored if medianDat is passed} 26 | 27 | \item{genesToMap}{which genes to include in the correlation mapping} 28 | 29 | \item{use}{additional parameter for cor (use='p' as default)} 30 | 31 | \item{method}{additional parameter for cor (method='p' as default)} 32 | 33 | \item{...}{not used} 34 | } 35 | \value{ 36 | data frame with the top match and associated correlation 37 | } 38 | \description{ 39 | Primary function for doing correlation-based mapping to cluster medians and also reporting the 40 | correlations and confidences. This is wrapper for getTopMatch and corTreeMapping. 41 | } 42 | -------------------------------------------------------------------------------- /man/cellToClusterMapping_byRank.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{cellToClusterMapping_byRank} 4 | \alias{cellToClusterMapping_byRank} 5 | \title{Cell-based cluster mapping} 6 | \usage{ 7 | cellToClusterMapping_byRank( 8 | mapDat, 9 | refDat, 10 | clustersF, 11 | genesToMap = rownames(mapDat), 12 | mergeFunction = rowMedians, 13 | useRank = TRUE, 14 | use = "p", 15 | method = "p" 16 | ) 17 | } 18 | \arguments{ 19 | \item{mapDat}{normalized data of the MAPPING data set.} 20 | 21 | \item{refDat}{normalized data of the REFERENCE data set} 22 | 23 | \item{clustersF}{factor indicating which cluster each cell type is actually assigned 24 | to in the reference data set} 25 | 26 | \item{genesToMap}{character vector of which genes to include in the correlation mapping} 27 | 28 | \item{mergeFunction}{function for combining ranks; the tested choices are rowMeans or 29 | rowMedians (default)} 30 | 31 | \item{useRank}{use the rank of the correlation (default) or the correlation itself to 32 | determine the top cluster} 33 | 34 | \item{use}{additional parameter for cor (use='p' as default)} 35 | 36 | \item{method}{additional parameter for cor (method='p' as default)} 37 | } 38 | \value{ 39 | a two column data matrix where the first column is the mapped cluster and the second 40 | column is a confidence call indicating how close to the top of the ranked list cells of the 41 | assigned cluster were located relative to their best possible location in the ranked list. 42 | This confidence score seems to be a bit more reliable than correlation at determining how 43 | likely a cell in a training set is to being correctly assigned to the training cluster. 44 | } 45 | \description{ 46 | Maps cells to clusters by correlating every mapped cell with every reference cell, 47 | ranking the cells by correlation, and the reporting the cluster with the lowest average rank. 48 | } 49 | -------------------------------------------------------------------------------- /man/corTreeMapping.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{corTreeMapping} 4 | \alias{corTreeMapping} 5 | \title{Correlation-based cluster mapping} 6 | \usage{ 7 | corTreeMapping( 8 | mapDat, 9 | medianDat, 10 | dend = NULL, 11 | refDat = NA, 12 | clusters = NA, 13 | genesToMap = rownames(mapDat), 14 | use = "p", 15 | method = "p" 16 | ) 17 | } 18 | \arguments{ 19 | \item{mapDat}{normalized data of the MAPPING data set. Default is to map the data onto itself.} 20 | 21 | \item{medianDat}{representative value for each leaf and node. If not entered, it is calculated} 22 | 23 | \item{dend}{dendrogram for mapping. If provided, correlations to nodes are also returned} 24 | 25 | \item{refDat}{normalized data of the REFERENCE data set. Ignored if medianDat is passed} 26 | 27 | \item{clusters}{cluster calls for each cell. Ignored if medianDat is passed} 28 | 29 | \item{genesToMap}{which genes to include in the correlation mapping} 30 | 31 | \item{use}{additional parameter for cor (use='p' as default)} 32 | 33 | \item{method}{additional parameter for cor (method='p' as default)} 34 | } 35 | \value{ 36 | matrix with the correlation between expression of each cell and representative value for 37 | each leaf and node 38 | } 39 | \description{ 40 | Primary function for doing correlation-based mapping to cluster medians. This is wrapper for cor 41 | and returns a correlation matrix. 42 | } 43 | -------------------------------------------------------------------------------- /man/corTreeMapping_withFilter.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{corTreeMapping_withFilter} 4 | \alias{corTreeMapping_withFilter} 5 | \title{Correlation between nodes and leafs (deprecated)} 6 | \usage{ 7 | corTreeMapping_withFilter( 8 | dend = NA, 9 | refDat = NA, 10 | mapDat = refDat, 11 | medianExpr = NA, 12 | propExpr = NA, 13 | filterMatrix = NA, 14 | clusters = NA, 15 | numberOfGenes = 1200, 16 | outerLimitGenes = 7200, 17 | rankGeneFunction = function(x) getBetaScore(x, returnScore = FALSE), 18 | use = "p", 19 | ... 20 | ) 21 | } 22 | \arguments{ 23 | \item{dend}{dendrogram for mapping. Ignored if medianDat is passed} 24 | 25 | \item{refDat}{normalized data of the REFERENCE data set. Ignored if medianExpr 26 | and propExpr are passed} 27 | 28 | \item{mapDat}{normalized data of the MAPPING data set. Default is to map the 29 | data onto itself.} 30 | 31 | \item{medianExpr}{representative value for each leaf. If not entered, it is 32 | calculated} 33 | 34 | \item{propExpr}{proportion of cells in each type expressing a given gene. If not 35 | entered, it is calculated} 36 | 37 | \item{filterMatrix}{a matrix of TRUE/FALSE values to indicate whether a given 38 | cluster is possible} 39 | 40 | \item{clusters}{cluster calls for each cell. Ignored if medianExpr and propExpr 41 | are passed} 42 | 43 | \item{numberOfGenes}{how many variables genes} 44 | 45 | \item{outerLimitGenes}{choose different numberOfGenes per cell from the top overall 46 | outerLimitGenes (to speed up function)} 47 | 48 | \item{use, ...}{additional parameters for cor} 49 | 50 | \item{genesToMap}{which genes to include in the correlation mapping} 51 | } 52 | \value{ 53 | a matrix of correlation values with rows as mapped cells and columns as clusters 54 | } 55 | \description{ 56 | Returns the correlation between expression of each cell and representative 57 | value for each node and leaf. NOTE: this function is unstable and will 58 | eventually be merged with corTreeMapping. 59 | } 60 | -------------------------------------------------------------------------------- /man/distTreeMapping.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{distTreeMapping} 4 | \alias{distTreeMapping} 5 | \title{(Euclidean) distance mapping} 6 | \usage{ 7 | distTreeMapping( 8 | dend = NA, 9 | refDat = NA, 10 | mapDat = refDat, 11 | medianDat = NA, 12 | clusters = NA, 13 | genesToMap = rownames(mapDat), 14 | returnSimilarity = TRUE, 15 | use = "p", 16 | ... 17 | ) 18 | } 19 | \arguments{ 20 | \item{dend}{dendrogram for mapping. Ignored if medianDat is passed} 21 | 22 | \item{refDat}{normalized data of the REFERENCE data set. Ignored if medianDat is passed} 23 | 24 | \item{mapDat}{normalized data of the MAPPING data set. Default is to map the data onto itself.} 25 | 26 | \item{medianDat}{representative value for each leaf and node. If not entered, it is calculated} 27 | 28 | \item{clusters}{cluster calls for each cell. Ignored if medianDat is passed} 29 | 30 | \item{genesToMap}{which genes to include in the correlation mapping} 31 | 32 | \item{returnSimilarity}{FALSE to return distance, TRUE to return something like a similarity} 33 | 34 | \item{use, ...}{additional parameters for dist (for back-compatiblity; doesn't work)} 35 | } 36 | \value{ 37 | matrix of Euclidean distances between cells (rows) and clusters (columns) 38 | } 39 | \description{ 40 | Returns the distance between expression of each cell and representative value for each node and 41 | leaf (default is based on euclidean distance). In our hands this is does not work very well. 42 | } 43 | -------------------------------------------------------------------------------- /man/filterByClass.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{filterByClass} 4 | \alias{filterByClass} 5 | \title{Filter by meta-data} 6 | \usage{ 7 | filterByClass( 8 | classVector, 9 | sampleInfo, 10 | classColumn = "cluster_type_label", 11 | clusterColumn = "cluster_label", 12 | threshold = 0.1 13 | ) 14 | } 15 | \arguments{ 16 | \item{classVector}{vector corresponding to the class information for filtering (e.g., vector 17 | of label calls)} 18 | 19 | \item{sampleInfo}{matrix of sample information with rows corresponding to cells and columns 20 | corresponding to meta-data} 21 | 22 | \item{classColumn}{column name of class information} 23 | 24 | \item{clusterColumn}{column name of cluster information} 25 | 26 | \item{threshold}{minimum fraction of cluster cells from a given class to be considered present} 27 | } 28 | \value{ 29 | a matrix of filters with rows as clusters and columns as classes with entries of TRUE or 30 | FALSE indicating whether cells from a given class can assigned to that cluster, given threshold. 31 | } 32 | \description{ 33 | Return a filter of TRUE/FALSE values for a given piece of meta-data (e.g., broad class). 34 | } 35 | -------------------------------------------------------------------------------- /man/filterCells.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{filterCells} 4 | \alias{filterCells} 5 | \title{Filter (subset) fishScaleAndMap object} 6 | \usage{ 7 | filterCells(datFish, subset) 8 | } 9 | \arguments{ 10 | \item{datFish}{a fishScaleAndMap output list} 11 | 12 | \item{subset}{a boolean or numeric vector of the elements to retain} 13 | } 14 | \value{ 15 | a fishScaleAndMap output subsetted to the requested elements 16 | } 17 | \description{ 18 | Subsets all components in a fishScaleAndMap object 19 | } 20 | -------------------------------------------------------------------------------- /man/filterPanelGenes.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{filterPanelGenes} 4 | \alias{filterPanelGenes} 5 | \title{Filter genes for spatial transcriptomics panel} 6 | \usage{ 7 | filterPanelGenes( 8 | summaryExpr, 9 | propExpr = summaryExpr, 10 | onClusters = 1:dim(summaryExpr)[2], 11 | offClusters = NULL, 12 | geneLengths = NULL, 13 | startingGenes = c("GAD1", "SLC17A7"), 14 | numBinaryGenes = 500, 15 | minOn = 10, 16 | maxOn = 250, 17 | maxOff = 50, 18 | minLength = 960, 19 | fractionOnClusters = 0.5, 20 | onThreshold = 0.5, 21 | excludeGenes = NULL, 22 | excludeFamilies = c("LOC", "LINC", "FAM", "ORF", "KIAA", "FLJ", "DKFZ", "RIK", "RPS", 23 | "RPL", "\\\\-") 24 | ) 25 | } 26 | \arguments{ 27 | \item{summaryExpr}{Matrix of summarized expression levels for a given cluster. Typically the median 28 | or mean should be used. Rows are genes and columns are samples. ROW NAMES MUST BE GENE SYMBOLS!} 29 | 30 | \item{propExpr}{Proportion of cells expressed in each cluster for use with binary score calculation 31 | (default = summaryExpr, which is not recommended)} 32 | 33 | \item{onClusters}{Vector indicating which clusters should be included in the gene panel (default 34 | is all clusters. Can be logical or numeric, or a character string of cluster names)} 35 | 36 | \item{offClusters}{Vector indidicating from which clusters expression should be avoided} 37 | 38 | \item{numBinaryGenes}{Number of genes to include in the final panel. Genes are sorted by binary 39 | score using `getBetaScore` and this number of genes are chosen (default = 500)} 40 | 41 | \item{minOn}{Minimum summary expression level in most highly expressed "on" cluster (default = 10)} 42 | 43 | \item{maxOn}{Maximum summary expression level in most highly expressed "on" cluster (default = 250)} 44 | 45 | \item{maxOff}{Maximum summary expression level in most highly expressed "off" cluster (default = 50)} 46 | 47 | \item{minLength}{Minimum gene length for marker gene selection. Ignored if geneLength is not 48 | provided (default = 960)} 49 | 50 | \item{fractionOnClusters}{What is the maximum fraction of clusters in which a gene can be expressed 51 | (as defined by propExpr>onThreshold; default = 0.5). This prevents nearly ubiquitous genes from selection} 52 | 53 | \item{onThreshold}{What fraction of cells need to have expression for a gene to be defined as expressed 54 | (default = 0.5)} 55 | 56 | \item{excludeGenes}{Which genes should be excluded from the analysis (default is none)} 57 | 58 | \item{excludeFamilies}{Which gene classes or families should be excluded from the analysis? More 59 | specifically, any gene that contain these strings of characters anywhere in the symbol will be 60 | excluded (default is "LOC","LINC","FAM","ORF","KIAA","FLJ","DKFZ","RIK","RPS","RPL","\\-").} 61 | 62 | \item{geneLength}{Optional vector of gene lengths in same order as summaryExpr. Default is NULL} 63 | } 64 | \value{ 65 | A character vector of genes meeting all constraints 66 | } 67 | \description{ 68 | Returns a set of genes for inclusion in a spatial transcriptomics panel based on a series of 69 | hard-coded and user-defined constraints 70 | } 71 | -------------------------------------------------------------------------------- /man/fishScaleAndMap.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{fishScaleAndMap} 4 | \alias{fishScaleAndMap} 5 | \title{Scale mFISH data and map to RNA-seq reference} 6 | \usage{ 7 | fishScaleAndMap( 8 | mapDat, 9 | refSummaryDat, 10 | genesToMap = NULL, 11 | mappingFunction = cellToClusterMapping_byCor, 12 | transform = function(x) x, 13 | noiselevel = 0, 14 | scaleFunction = quantileTruncate, 15 | omitGenes = NULL, 16 | metadata = data.frame(experiment = rep("all", dim(mapDat)[2])), 17 | integerWeights = NULL, 18 | binarize = FALSE, 19 | binMin = 0.5, 20 | ... 21 | ) 22 | } 23 | \arguments{ 24 | \item{mapDat}{normalized data of the MAPPING data set. Default is to map the data onto itself.} 25 | 26 | \item{refSummaryDat}{normalized summary data of the REFERENCE data set (e.g., what to map against)} 27 | 28 | \item{genesToMap}{which genes to include in the mapping (calculated in not entered)} 29 | 30 | \item{mappingFunction}{which function to use for mapping (default is cellToClusterMapping_byCor) 31 | The function must include at least two parameters with the first one being mapped data and the 32 | second data the reference. Additional parameters are okay. Output must be a data frame where 33 | the first value is a mapped class. Additional columns are okay and will be returned)} 34 | 35 | \item{transform}{function for transformation of the data (default in none)} 36 | 37 | \item{noiselevel}{scalar value at or below which all values are set to 0 (default is 0)} 38 | 39 | \item{scaleFunction}{which function to use for scaling mapDat to refSummaryDat (default is setting 40 | 90th quantile of mapDat to max of refSummaryDat and truncating higher mapDat values)} 41 | 42 | \item{omitGenes}{genes to be included in the data frames but excluded from the mapping} 43 | 44 | \item{metadata}{a data frame of possible metadata (additional columns are okay and ignored): 45 | \describe{ 46 | \item{area}{a vector of cell areas for normalization} 47 | \item{experiment}{a vector indicating if multiple experiments should be scaled separately} 48 | \item{x,y}{x (e.g., parallel to layer) and y (e.g., across cortical layers) coordinates in tissue} 49 | }} 50 | 51 | \item{integerWeights}{if not NULL (default) a vector of integers corresponding to how many times 52 | each gene should be counted as part of the correlation. This is equivalent to calculating 53 | a weighted correlation, but only allows for integer weight values (for use with cor).} 54 | 55 | \item{binarize}{should the data be binarized? (default=FALSE)} 56 | 57 | \item{binMin}{minimum ON value for the binarized matrix (ignored if binarize=FALSE)} 58 | 59 | \item{...}{additional parameters for passthrough into other functions} 60 | } 61 | \value{ 62 | a list with the following entrees: 63 | \describe{ 64 | \item{mapDat}{mapDat data matrix is passed through} 65 | \item{scaleDat}{scaled mapDat data matrix} 66 | \item{mappingResults}{Results of the mapping and associated confidence values (if any)} 67 | \item{metadata=metadata}{metadata is passed through unchanged} 68 | \item{scaledX/Y}{scaled x and y coordinates (or unscaled if scaling was not performed)} 69 | } 70 | } 71 | \description{ 72 | This function is a wrapper for several other functions which aim to scale mFISH data to 73 | more closely match RNA-seq data and then map the mFISH data to the closest reference 74 | classes. There are several parameters allowing flexability in filtering and analysis. 75 | } 76 | -------------------------------------------------------------------------------- /man/fractionCorrectPerNode.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{fractionCorrectPerNode} 4 | \alias{fractionCorrectPerNode} 5 | \title{Fraction of correct calls per node} 6 | \usage{ 7 | fractionCorrectPerNode( 8 | dendIn, 9 | clActual, 10 | clPredict, 11 | minCount = 0.1, 12 | defaultSum = -1, 13 | out = NULL 14 | ) 15 | } 16 | \arguments{ 17 | \item{dendIn}{dendrogram for mapping. Ignored if minimizeHeight=FALSE} 18 | 19 | \item{clActual}{character vector of actual cluster assignments} 20 | 21 | \item{clPredict}{character vector of predicted cluster assignments} 22 | 23 | \item{minCount}{set to 0 results from clusters with fewer than this number of cells (default 24 | is to consider all clusters)} 25 | 26 | \item{defaultSum}{value to return in cases where there are fewer than minCount cells in the 27 | actual cluster (e.g., cases that aren't considered at all)} 28 | 29 | \item{out}{required for recursive function. Do not set!} 30 | } 31 | \value{ 32 | matrix of two columns: (1) node name and (2) the fraction of cells in that node that 33 | are correctly assigned 34 | } 35 | \description{ 36 | This function returns the fraction correctly assigned to each node (as defined that the actual 37 | and predicted cluster are both in the same node) 38 | } 39 | -------------------------------------------------------------------------------- /man/fractionCorrectWithGenes.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{fractionCorrectWithGenes} 4 | \alias{fractionCorrectWithGenes} 5 | \title{Fraction of cells correctly assigned} 6 | \usage{ 7 | fractionCorrectWithGenes( 8 | orderedGenes, 9 | mapDat, 10 | medianDat, 11 | clustersF, 12 | verbose = FALSE, 13 | plot = TRUE, 14 | return = TRUE, 15 | ... 16 | ) 17 | } 18 | \arguments{ 19 | \item{orderedGenes}{an ordered list of input genes (e.g. from an iterative algorithm)} 20 | 21 | \item{mapDat}{normalized data of the mapping (=reference) data set.} 22 | 23 | \item{medianDat}{median value for each leaf} 24 | 25 | \item{clustersF}{cluster calls for each cell} 26 | 27 | \item{verbose}{whether or not to show progress in the function} 28 | 29 | \item{plot}{if TRUE, plotCorrectWithGenes is run} 30 | 31 | \item{return}{if TRUE, the value is returned} 32 | 33 | \item{...}{parameters passed to plotCorrectWithGenes (if plot=TRUE)} 34 | } 35 | \value{ 36 | a vector showing the fraction of cells correctly mapped to each cluster 37 | } 38 | \description{ 39 | This function takes as input an ordered set of marker genes (e.g., from at iterative 40 | algorithm), and returns a vector showing the fraction of cells correctly mapped. 41 | } 42 | -------------------------------------------------------------------------------- /man/generateMultipleCellReferenceSet.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{generateMultipleCellReferenceSet} 4 | \alias{generateMultipleCellReferenceSet} 5 | \title{Generate reference set of pseudo-cells} 6 | \usage{ 7 | generateMultipleCellReferenceSet( 8 | refDat, 9 | clustersF, 10 | genesToUse = rownames(refDat), 11 | cellsPerMerge = 5, 12 | numberOfMerges = 10, 13 | mergeFunction = rowMedians, 14 | seed = 1 15 | ) 16 | } 17 | \arguments{ 18 | \item{refDat}{normalized data of the REFERENCE data set} 19 | 20 | \item{clustersF}{factor indicating which cluster each cell type is actually assigned to in the reference data set} 21 | 22 | \item{cellsPerMerge}{Number of cells to include in each combo cell} 23 | 24 | \item{numberOfMerges}{Number of combo cells to include per cell type} 25 | 26 | \item{mergeFunction}{function for combining cells into combo cells (use rowMeans or rowMedians)} 27 | 28 | \item{seed}{for resproducibility} 29 | 30 | \item{genesToMap}{which genes to include in the correlation mapping} 31 | } 32 | \value{ 33 | list where first element is data matrix of multi-cells by genes and 34 | second element is a vector of corresponding clusters 35 | } 36 | \description{ 37 | Creates a new reference set as input for cellToClusterMapping_byRank, where each 'cell' is the 38 | combiniation of several cells and this is run several times using different subsets of cells. 39 | } 40 | -------------------------------------------------------------------------------- /man/getBetaScore.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{getBetaScore} 4 | \alias{getBetaScore} 5 | \title{Get binary (aka beta) score} 6 | \usage{ 7 | getBetaScore(propExpr, returnScore = TRUE, spec.exp = 2) 8 | } 9 | \arguments{ 10 | \item{propExpr}{a matrix of proportions of cells (rows) in a given cluster (columns) with 11 | CPM/FPKM > 1 (or 0, HCT uses 1)} 12 | 13 | \item{returnScore}{if TRUE returns the score, if FALSE returns the ranks} 14 | 15 | \item{spec.exp}{scaling factor (recommended to leave as default)} 16 | } 17 | \value{ 18 | returns a numeric vector of beta score (or ranks) 19 | } 20 | \description{ 21 | Returns a beta score which indicates the binaryness of a gene across clusters. High scores 22 | (near 1) indicate that a gene is either on or off in nearly all cells of every cluster. 23 | Scores near 0 indicate a cells is non-binary (e.g., not expressed, ubiquitous, or 24 | randomly expressed). This value is used for gene filtering prior to defining clustering. 25 | } 26 | -------------------------------------------------------------------------------- /man/getBranchList.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{getBranchList} 4 | \alias{getBranchList} 5 | \title{Branch list} 6 | \usage{ 7 | getBranchList( 8 | dend, 9 | branches = list(), 10 | allTips = as.character(dend \%>\% labels()) 11 | ) 12 | } 13 | \arguments{ 14 | \item{dend}{dendrogram for mapping. Ignored if medianDat is passed} 15 | 16 | \item{branches}{do not change from default} 17 | 18 | \item{allTips}{do not change from default} 19 | } 20 | \value{ 21 | a list of branch information for use with leafToNodeMedians 22 | } 23 | \description{ 24 | Returns branches of a dendrogram in a specific format 25 | } 26 | -------------------------------------------------------------------------------- /man/getConfusionMatrix.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{getConfusionMatrix} 4 | \alias{getConfusionMatrix} 5 | \title{Confusion matrix} 6 | \usage{ 7 | getConfusionMatrix(realCluster, foundCluster, proportions = TRUE) 8 | } 9 | \arguments{ 10 | \item{realCluster}{character vector of assigned clusters} 11 | 12 | \item{foundCluster}{character vector of mapped clusters} 13 | 14 | \item{proportions}{FALSE if the counts are to be returned and TRUE if the proportions are to be returned} 15 | } 16 | \description{ 17 | Returns a confusion matrix of the found (mapped) vs. real (assigned) clusters. 18 | } 19 | -------------------------------------------------------------------------------- /man/getDend.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{getDend} 4 | \alias{getDend} 5 | \title{Build a dendrogram from gene panel} 6 | \usage{ 7 | getDend(dat, distFun = function(x) return(as.dist(1 - WGCNA::cor(x))), ...) 8 | } 9 | \arguments{ 10 | \item{dat}{matrix of values (e.g., genes x clusters) for calculating the dendrogram} 11 | 12 | \item{distFun}{function for calculating distance matrix (default is correlation-based)} 13 | 14 | \item{...}{additional variables for distFun} 15 | } 16 | \value{ 17 | dendrogram 18 | } 19 | \description{ 20 | Build a dendrogram from an inputted data matrix. 21 | } 22 | -------------------------------------------------------------------------------- /man/getNodeHeight.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{getNodeHeight} 4 | \alias{getNodeHeight} 5 | \title{Get node height} 6 | \usage{ 7 | getNodeHeight(tree) 8 | } 9 | \arguments{ 10 | \item{tree}{a dendrogram object} 11 | } 12 | \value{ 13 | a vector of node heights 14 | } 15 | \description{ 16 | Returns the heights of each node, scaled from 0 (top) to 1 (leafs); this is a wrapper for dendextend functions 17 | } 18 | -------------------------------------------------------------------------------- /man/getTopMatch.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{getTopMatch} 4 | \alias{getTopMatch} 5 | \title{Get top leaf match} 6 | \usage{ 7 | getTopMatch(memb.cl) 8 | } 9 | \arguments{ 10 | \item{memb.cl}{membership scores for each leaf} 11 | } 12 | \value{ 13 | a matrix where first column is found cluster and second column is confidence score 14 | } 15 | \description{ 16 | Returns the top leaf match for each cell and the corresponding fraction mapping there. 17 | } 18 | -------------------------------------------------------------------------------- /man/get_subtree_label.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{get_subtree_label} 4 | \alias{get_subtree_label} 5 | \title{Gets subtree labels for lca function.} 6 | \usage{ 7 | get_subtree_label(dend) 8 | } 9 | \arguments{ 10 | \item{dend}{a cluster dendrogram} 11 | } 12 | \value{ 13 | vector of subtree labels 14 | } 15 | \description{ 16 | Gets subtree labels for lca function. 17 | } 18 | -------------------------------------------------------------------------------- /man/labelDend.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{labelDend} 4 | \alias{labelDend} 5 | \title{Label dendrogram nodes} 6 | \usage{ 7 | labelDend(dend, n = 1) 8 | } 9 | \arguments{ 10 | \item{dend}{dendrogram object} 11 | 12 | \item{distFun}{starting numeric node value (default=1)} 13 | } 14 | \value{ 15 | a list where the first item is the new dendrogram object and the 16 | second item is the final numeric node value. 17 | } 18 | \description{ 19 | Add numeric node labels to a dendrogram. 20 | } 21 | -------------------------------------------------------------------------------- /man/layerFraction.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{layerFraction} 4 | \alias{layerFraction} 5 | \title{Layer weights per cell} 6 | \usage{ 7 | layerFraction(layerIn, useLayer = "L1", cluster = NA, ...) 8 | } 9 | \arguments{ 10 | \item{layerIn}{a list corresponding to all layers of dissection for a given sample} 11 | 12 | \item{useLayer}{target layer} 13 | 14 | \item{cluster}{if passed the weights are smartly allocated based on laminar distributions by cluster} 15 | 16 | \item{...}{additional variables for smartLayerAllocation} 17 | } 18 | \value{ 19 | numeric vector with weights for cells in input layer 20 | } 21 | \description{ 22 | Returns a numeric vector saying how to weight a particular cell for each layer. This is a 23 | wrapper for smartLayerAllocation 24 | } 25 | -------------------------------------------------------------------------------- /man/layerScale.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{layerScale} 4 | \alias{layerScale} 5 | \title{Fraction of cells per layer} 6 | \usage{ 7 | layerScale(layerIn, layerNm = c("L1", "L2/3", "L4", "L5", "L6"), scale = TRUE) 8 | } 9 | \arguments{ 10 | \item{layerIn}{a list corresponding to all layers of dissection for a given sample} 11 | 12 | \item{layerNm}{names of all layers. set to NULL to have this calculated} 13 | 14 | \item{scale}{if TRUE (default), scale to the total number of cells} 15 | } 16 | \value{ 17 | vector indicating the fraction of cells in each layerNm layer 18 | } 19 | \description{ 20 | Determines the expected proportions in each layer based on input 21 | } 22 | -------------------------------------------------------------------------------- /man/lca.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{lca} 4 | \alias{lca} 5 | \title{Get lowest common ancestor (defined cluster pairs)} 6 | \usage{ 7 | lca(dend, l1, l2, l = rep(attr(dend, "label"), length(l1))) 8 | } 9 | \arguments{ 10 | \item{dend}{a cluster dendrogram} 11 | 12 | \item{l1}{a vector of node labels} 13 | 14 | \item{l2}{a second fector of node labels (of the same length as l1)} 15 | 16 | \item{l}{do not adjust; required for recursive function} 17 | } 18 | \value{ 19 | The function will return a vector for lowest common ancestor for every pair of nodes in l1 and l2 20 | } 21 | \description{ 22 | Maps a cluster back up the tree to the first node where the mapped and correct clusters agree. 23 | } 24 | -------------------------------------------------------------------------------- /man/leafToNodeMedians.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{leafToNodeMedians} 4 | \alias{leafToNodeMedians} 5 | \title{Return mean node expression} 6 | \usage{ 7 | leafToNodeMedians(dend, medianDat, branches = getBranchList(dend), fnIn = mean) 8 | } 9 | \arguments{ 10 | \item{dend}{dendrogram for mapping. Ignored if medianDat is passed} 11 | 12 | \item{medianDat}{median expression data at each node} 13 | 14 | \item{branches}{a particular format of branch information from the dendrogram structure} 15 | 16 | \item{fnIn}{function to use to wrap up to the node level (default = mean)} 17 | } 18 | \value{ 19 | a matrix of mean node expression (rows=genes, columns=nodes) 20 | } 21 | \description{ 22 | Define expression at a node as the MEAN expression for each leaf as default (using the 23 | median removes all specific marker genes!) 24 | } 25 | -------------------------------------------------------------------------------- /man/makeLCAtable.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{makeLCAtable} 4 | \alias{makeLCAtable} 5 | \title{Get lowest common ancestor (all cluster pairs in tree)} 6 | \usage{ 7 | makeLCAtable(dend, includeInternalNodes = FALSE, verbose = FALSE) 8 | } 9 | \arguments{ 10 | \item{dend}{a cluster dendrogram} 11 | 12 | \item{includeInternalNodes}{should internal nodes be included in the output?} 13 | 14 | \item{verbose}{if TRUE, status will be printed to the screen, since function is relatively slow 15 | for large trees (default FALSE)} 16 | } 17 | \value{ 18 | The function will return a vector for lowest common ancestor for every pair of leaves 19 | in dend. Vector names are l1|||l2 for string parsing in other functions. 20 | } 21 | \description{ 22 | Calculates the vector for lowest common ancestor for every pair of leaves in a tree and returns a 23 | vector in a specific format for faster look-up. 24 | } 25 | -------------------------------------------------------------------------------- /man/map_dend.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{map_dend} 4 | \alias{map_dend} 5 | \title{Tree-based mapping} 6 | \usage{ 7 | map_dend( 8 | dend, 9 | cl, 10 | dat, 11 | map.dat, 12 | select.cells, 13 | p = 0.8, 14 | low.th = 0.2, 15 | default.markers = NULL 16 | ) 17 | } 18 | \arguments{ 19 | \item{dend}{dendrogram for mapping} 20 | 21 | \item{cl}{factor indicating which cluster each cell type is actually assigned to 22 | in the reference data set} 23 | 24 | \item{dat}{normalized data of the REFERENCE data set} 25 | 26 | \item{map.dat}{normalized data of the MAPPING data set. Default is to map the 27 | data onto itself.} 28 | 29 | \item{p}{proportion of marker genes to include in each iteration of the mapping 30 | algorithm.} 31 | 32 | \item{low.th}{the minimum difference in Pearson correlation required to decide 33 | on which branch to map to. otherwise, a random branch is chosen.} 34 | 35 | \item{default.markers}{not used} 36 | } 37 | \value{ 38 | a matrix of confidence scores (from 0 to 100) with rows as cells and columns 39 | as tree node/leafs. Values indicate the fraction of permutations in which the cell 40 | mapped to that node/leaf using the subset of cells/genes in map_dend 41 | } 42 | \description{ 43 | Returns the mapping membership of each cell to each node and leaf using a 44 | tree-based method. This is a wrapper function for map_dend. 45 | } 46 | -------------------------------------------------------------------------------- /man/mergeFish.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{mergeFish} 4 | \alias{mergeFish} 5 | \title{Merge two fishScaleAndMap objects} 6 | \usage{ 7 | mergeFish(datFish1, datFish2) 8 | } 9 | \arguments{ 10 | \item{datFish1}{a fishScaleAndMap output list} 11 | 12 | \item{datFish2}{a second fishScaleAndMap output list.} 13 | } 14 | \value{ 15 | a new fishScaleAndMap output list with the two original ones merged 16 | } 17 | \description{ 18 | Merges all components of two fishScaleAndMap objects to create a new 19 | one. Note: only meta-data and mappingResults that is present in BOTH 20 | objects will be returned. 21 | } 22 | -------------------------------------------------------------------------------- /man/mfishtools.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \docType{package} 4 | \name{mfishtools} 5 | \alias{mfishtools} 6 | \title{mfishtools: Building Gene Sets and Mapping mFISH Data.} 7 | \description{ 8 | This repository includes code for gene selection for spatial transcriptomics methods and for 9 | mapping of spatial transcriptomics (or RNA-Seq data) onto a RNA-Seq reference. Specific topics include: 10 | 1) Correlation-based mapping of cells to reference cell types 11 | 2) Iterative building of gene panels a greedy algorithm with pre-defined constraints 12 | 3) Visualizations related to gene mapping a gene panel selection 13 | } 14 | -------------------------------------------------------------------------------- /man/outputTopConfused.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{outputTopConfused} 4 | \alias{outputTopConfused} 5 | \title{Table of confused clusters} 6 | \usage{ 7 | outputTopConfused(confusionProp, count = 10) 8 | } 9 | \arguments{ 10 | \item{confusionProp}{confusion matrix (e.g., output from getConfusionMatrix).} 11 | 12 | \item{count}{number of top confusions to show} 13 | } 14 | \value{ 15 | a 3 x count matrix of the top confused pairs of clusters with the three 16 | columns corresponding to mapped cluster, assigned cluster, and fraction of 17 | cells incorrectly mapped, respectively. 18 | } 19 | \description{ 20 | This function returns a table of the top confused clusters (assigned clusters 21 | incorrectly mapped) 22 | } 23 | -------------------------------------------------------------------------------- /man/plotConfusionVsConfidence.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{plotConfusionVsConfidence} 4 | \alias{plotConfusionVsConfidence} 5 | \title{Confusion plot vs. confidence} 6 | \usage{ 7 | plotConfusionVsConfidence( 8 | foundClusterAndScore, 9 | realCluster, 10 | RI = (31:100)/100, 11 | main = "\% mapping (blue) / correct (orange)", 12 | ylab = "Percent", 13 | xlab = "Fraction correctly mapped to leaf", 14 | type = "l", 15 | xlim = range(RI), 16 | ... 17 | ) 18 | } 19 | \arguments{ 20 | \item{foundClusterAndScore}{matrix where first column is found cluster and second column is 21 | confidence score (e.g., output from getTopMatch)} 22 | 23 | \item{realCluster}{character vector of assigned clusters} 24 | 25 | \item{...}{additional parameters for the plot function} 26 | } 27 | \description{ 28 | Produces line plots showing the percent of correctly mapped cells above a certain confidence value (or score). 29 | This is a wrapper for plot. 30 | } 31 | -------------------------------------------------------------------------------- /man/plotCorrectWithGenes.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{plotCorrectWithGenes} 4 | \alias{plotCorrectWithGenes} 5 | \title{Plot fraction correct} 6 | \usage{ 7 | plotCorrectWithGenes( 8 | frac, 9 | genes = names(frac), 10 | xlab = "Number of genes in panel", 11 | main = "All clusters gene panel", 12 | ylim = c(-10, 100), 13 | lwd = 5, 14 | ylab = "Percent of nuclei correctly mapping", 15 | colLine = "grey", 16 | ... 17 | ) 18 | } 19 | \arguments{ 20 | \item{frac}{a numeric vector indicating the fraction of cells correctly mapped for a given gene panel} 21 | 22 | \item{genes}{ordered character vector (e.g., of genes) to be plotted; default is names(frac)} 23 | 24 | \item{...}{additional parameters for plot.} 25 | } 26 | \description{ 27 | This function is a wrapper for plot designd for plotting the fraction correctly mapped for a given gene set. 28 | If geneN is the Nth gene, the plotted value indicates correct mapping using genes 1:N. 29 | } 30 | -------------------------------------------------------------------------------- /man/plotDistributions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{plotDistributions} 4 | \alias{plotDistributions} 5 | \title{Plot distributions} 6 | \usage{ 7 | plotDistributions( 8 | datIn, 9 | group, 10 | groups = NULL, 11 | colors = rep("black", dim(datIn$mapDat)[2]), 12 | colormap = gray.colors, 13 | maxrow = 12, 14 | pch = 19, 15 | cex = 1.5, 16 | xlim = NULL, 17 | ylim = NULL, 18 | main = "", 19 | xlab = "", 20 | ylab = "", 21 | ... 22 | ) 23 | } 24 | \arguments{ 25 | \item{datIn}{a fishScaleAndMap output list} 26 | 27 | \item{group}{a character vector (or factor) indicating how to split the data (e.g., cluster 28 | call) or a metadata/mappingResults column name} 29 | 30 | \item{groups}{a character vector of groups to show (default is levels of group)} 31 | 32 | \item{colors}{a character vector (or factor) indicating how to color the plots (e.g., layer 33 | or gene expression) or a metadata/mappingResults column name (default is all black)} 34 | 35 | \item{colormap}{function to use for the colormap for the data (default gray.colors)} 36 | 37 | \item{maxrow}{maximum number of plots to show in one row (default=12)} 38 | 39 | \item{pch, cex}{for plot. Can be single values or vectors} 40 | 41 | \item{xlim, ylim}{for plot, but will be calculated if not entered} 42 | 43 | \item{main, xlab, ylab, ...}{other parameters for plot (must be single values)} 44 | } 45 | \value{ 46 | Only returns if there is an error 47 | } 48 | \description{ 49 | Plot the distributions of cells across the tissue with overlaying color information. This is 50 | a wrapper function for plot 51 | } 52 | -------------------------------------------------------------------------------- /man/plotHeatmap.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{plotHeatmap} 4 | \alias{plotHeatmap} 5 | \title{Plot heatmap} 6 | \usage{ 7 | plotHeatmap( 8 | datIn, 9 | group, 10 | groups = NULL, 11 | grouplab = "Grouping", 12 | useScaled = FALSE, 13 | capValue = Inf, 14 | colormap = grey.colors(1000), 15 | pch = 19, 16 | xlim = NULL, 17 | ylim = NULL, 18 | Rowv = FALSE, 19 | Colv = FALSE, 20 | dendrogram = "none", 21 | trace = "none", 22 | margins = c(6, 10), 23 | rowsep = NULL, 24 | sepwidth = c(0.4, 0.4), 25 | key = FALSE, 26 | ... 27 | ) 28 | } 29 | \arguments{ 30 | \item{datIn}{a fishScaleAndMap output list} 31 | 32 | \item{group}{a character vector (or factor) indicating how to order the heatmap (e.g., cluster 33 | call) or a metadata/mappingResults column name} 34 | 35 | \item{groups}{a character vector of groups to show (default is levels of group)} 36 | 37 | \item{grouplab}{label for the grouping in the heatmap (default is 'Grouping' or the value for group)} 38 | 39 | \item{useScaled}{plot the scaled (TRUE) or unscaled (FALSE; default) values} 40 | 41 | \item{capValue}{values above capValue will be capped at capValue (default is none)} 42 | 43 | \item{colormap}{set of values to use for the colormap for the data (default heat_colors)} 44 | 45 | \item{Rowv, Colv, dendrogram, trace, margins, rowsep, colsep, key, ...}{other parameters for heatmap.2 46 | (some default values are different)} 47 | } 48 | \value{ 49 | Only returns if there is an error 50 | } 51 | \description{ 52 | Plot the heatmap of cells ordering by a specified order. This is a wrapper for heatmap.2 53 | } 54 | -------------------------------------------------------------------------------- /man/plotNodes.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{plotNodes} 4 | \alias{plotNodes} 5 | \title{Plot dendrogram} 6 | \usage{ 7 | plotNodes( 8 | tree, 9 | value = rep(1, length(labels(tree))), 10 | cexScale = 2, 11 | margins = c(10, 5, 2, 2), 12 | cols = "black", 13 | pch = 19, 14 | ... 15 | ) 16 | } 17 | \arguments{ 18 | \item{tree}{a dendrogram object} 19 | 20 | \item{value}{numeric vector corresponding to the size of each node} 21 | 22 | \item{cexScale}{a global cex multiplier for node sizes} 23 | 24 | \item{margins}{set the margins using par(mar=margins)} 25 | 26 | \item{cols}{vector of node colors (or a single value)} 27 | 28 | \item{pch}{vector of node pch shapes (or a single value)} 29 | 30 | \item{...}{additional parameters for the plot function} 31 | } 32 | \description{ 33 | Plots a dendrogram with set not colors, shapes, sizes and labels. This is a wrapper for plot. 34 | } 35 | -------------------------------------------------------------------------------- /man/plotTsne.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{plotTsne} 4 | \alias{plotTsne} 5 | \title{Plot TSNE} 6 | \usage{ 7 | plotTsne( 8 | datIn, 9 | colorGroup = "none", 10 | labelGroup = "none", 11 | useScaled = FALSE, 12 | capValue = Inf, 13 | perplexity = 10, 14 | theta = 0.5, 15 | main = "TSNE plot", 16 | maxNchar = 1000, 17 | seed = 10 18 | ) 19 | } 20 | \arguments{ 21 | \item{datIn}{a fishScaleAndMap output list} 22 | 23 | \item{colorGroup}{a character vector (or factor) indicating how to color the Tsne (e.g., cluster 24 | call) or a metadata/mappingResults column name (default=NULL)} 25 | 26 | \item{labelGroup}{a character vector (or factor) indicating how to label the Tsne (e.g., cluster 27 | call) or a metadata/mappingResults column name (default=NULL)} 28 | 29 | \item{useScaled}{plot the scaled (TRUE) or unscaled (FALSE; default) values} 30 | 31 | \item{capValue}{values above capValue will be capped at capValue (default is none)} 32 | 33 | \item{perplexity, theta}{other parameters for Rtsne} 34 | 35 | \item{main}{title of the plot} 36 | 37 | \item{maxNchar}{what is the maximum number of characters to display in the plot for each entry?} 38 | 39 | \item{seed}{for reproducibility} 40 | } 41 | \value{ 42 | Only returns if there is an error 43 | } 44 | \description{ 45 | Plot a TSNE of the data, with assigned colors and labels from provided variables. Note that this 46 | function is a modification of code from Pabloc (https://www.r-bloggers.com/author/pabloc/) from 47 | https://www.r-bloggers.com/playing-with-dimensions-from-clustering-pca-t-sne-to-carl-sagan/ 48 | } 49 | -------------------------------------------------------------------------------- /man/possibleClustersByPriors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{possibleClustersByPriors} 4 | \alias{possibleClustersByPriors} 5 | \title{Filter possible cluster calls using priors} 6 | \usage{ 7 | possibleClustersByPriors( 8 | cluster, 9 | layer, 10 | subsetVector = rep(TRUE, length(cluster)), 11 | useClusters = sort(unique(cluster)), 12 | rareLimit = 0.005, 13 | layerNm = c("L1", "L2/3", "L4", "L5", "L6"), 14 | scaleByLayer = TRUE, 15 | scaleByFn = max, 16 | smartWeight = TRUE, 17 | spillFactor = 0.15, 18 | weightCutoff = 0.02 19 | ) 20 | } 21 | \arguments{ 22 | \item{cluster}{vector of all clusters} 23 | 24 | \item{layer}{list of layers for each cluster entry (for data sets with only laminar dissections, 25 | each list entry will be of length 1)} 26 | 27 | \item{subsetVector}{a vector of TRUE/FALSE values indicated whether the entry is in the subset of 28 | interest (e.g., Cre lines); default is all} 29 | 30 | \item{useClusters}{a set of clusters to be considered a priori (e.g., GABA vs. glut); default is all} 31 | 32 | \item{rareLimit}{define any values less than this as 0. The idea is to exclude rare cells} 33 | 34 | \item{layerNm}{names of all layers. set to NULL to have this calculated} 35 | 36 | \item{scaleByLayer}{if TRUE, scales to the proportion of cells in each layer} 37 | 38 | \item{scaleByFn}{what function should be used for the layer scaling (default=max, ignored 39 | if scaleByLayer=FALSE)} 40 | 41 | \item{smartWeight}{if TRUE, multilayer dissections are weighted smartly by cluster, rather 42 | than evenly by cluster (FALSE)} 43 | 44 | \item{spillFactor}{fractional amount of cells in a layer below which it is assumed no cells 45 | are from that layer in multilayer dissection} 46 | 47 | \item{weightCutoff}{anything less than this is set to 0 for convenience} 48 | } 49 | \value{ 50 | a vector of possible clusters for cells that meet a set of priors for each layer 51 | } 52 | \description{ 53 | This function will return a vector of possible clusters for cells that meet a set of priors for each layer 54 | } 55 | -------------------------------------------------------------------------------- /man/quantileTruncate.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{quantileTruncate} 4 | \alias{quantileTruncate} 5 | \title{Quantile normalize, truncate, and scale} 6 | \usage{ 7 | quantileTruncate(x, qprob = 0.9, maxVal = 1, truncate = TRUE, ...) 8 | } 9 | \arguments{ 10 | \item{x}{input data vector} 11 | 12 | \item{qprob}{probs value to result from quantile (default=0.9)} 13 | 14 | \item{maxVal}{max value for scaling (default=1)} 15 | 16 | \item{truncate}{should data above the qprob threshold be truncated (default=yes)} 17 | 18 | \item{...}{not used} 19 | } 20 | \value{ 21 | scaled vector 22 | } 23 | \description{ 24 | Quantile normalize, truncate, and scale a numeric vector (e.g. mFISH data from one gene) 25 | } 26 | -------------------------------------------------------------------------------- /man/resolve_cl.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{resolve_cl} 4 | \alias{resolve_cl} 5 | \title{Tree-based mapping (internal)} 6 | \usage{ 7 | resolve_cl( 8 | cl.g, 9 | cl.med, 10 | markers, 11 | dat, 12 | map.dat, 13 | select.cells, 14 | p = 0.7, 15 | low.th = 0.2 16 | ) 17 | } 18 | \arguments{ 19 | \item{cl.g}{all clusters} 20 | 21 | \item{cl.med}{cluster medians} 22 | 23 | \item{markers}{gene markers} 24 | 25 | \item{dat}{normalized data of the REFERENCE data set} 26 | 27 | \item{map.dat}{normalized data of the MAPPING data set. Default is to map the data onto itself.} 28 | 29 | \item{select.cells}{which cells to use?} 30 | 31 | \item{p}{proportion of marker genes to include in each iteration of the mapping algorithm.} 32 | 33 | \item{low.th}{the minimum difference in Pearson correlation required to decide on which branch 34 | to map to. otherwise, a random branch is chosen.} 35 | } 36 | \value{ 37 | a vector of the mapped cluster 38 | } 39 | \description{ 40 | Returns the mapped cluster call of each cell to each leaf. This function is called by map_dend 41 | } 42 | -------------------------------------------------------------------------------- /man/rfTreeMapping.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{rfTreeMapping} 4 | \alias{rfTreeMapping} 5 | \title{Tree-based mapping} 6 | \usage{ 7 | rfTreeMapping( 8 | dend, 9 | refDat, 10 | clustersF, 11 | mapDat = refDat, 12 | p = 0.7, 13 | low.th = 0.15, 14 | seed = 1 15 | ) 16 | } 17 | \arguments{ 18 | \item{dend}{dendrogram for mapping} 19 | 20 | \item{refDat}{normalized data of the REFERENCE data set} 21 | 22 | \item{clustersF}{factor indicating which cluster each cell type is actually assigned to 23 | in the reference data set} 24 | 25 | \item{mapDat}{normalized data of the MAPPING data set. Default is to map the data onto itself.} 26 | 27 | \item{p}{proportion of marker genes to include in each iteration of the mapping algorithm.} 28 | 29 | \item{low.th}{the minimum difference in Pearson correlation required to decide on which branch 30 | to map to. otherwise, a random branch is chosen.} 31 | 32 | \item{seed}{added for reproducibility} 33 | } 34 | \value{ 35 | a matrix of confidence scores (from 0 to 100) with rows as cells and columns 36 | as tree node/leafs. Values indicate the fraction of permutations in which the cell 37 | mapped to that node/leaf using the subset of cells/genes in map_dend 38 | } 39 | \description{ 40 | Returns the mapping membership of each cell to each node and leaf using a 41 | tree-based method. This is a wrapper function for map_dend. 42 | } 43 | -------------------------------------------------------------------------------- /man/rotateXY.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{rotateXY} 4 | \alias{rotateXY} 5 | \title{Rotate coordinates} 6 | \usage{ 7 | rotateXY(datFish, flatVector = NULL, flipVector = NULL, subset = NULL) 8 | } 9 | \arguments{ 10 | \item{datFish}{a fishScaleAndMap output list} 11 | 12 | \item{flatVector}{a TRUE/FALSE vector ordred in the same way as the elements (e.g., cells) 13 | in datIn where all TRUE values correspond to cells who should have the same Y coordinate 14 | (e.g., be in the same layer). Alternatively a numeric vector of cell indices to include} 15 | 16 | \item{flipVector}{a numeric vector of values to ensure proper reflection on Y-axes (e.g., 17 | layer; default=NULL)} 18 | 19 | \item{subset}{a boolean or numeric vector of the elements to retain} 20 | } 21 | \value{ 22 | a fishScaleAndMap output list with updated scaledX and scaleY coordinates 23 | } 24 | \description{ 25 | Rotates the scaledX and scaledY elements of a fishScaleAndMap output list so that the 26 | axis of interest (e.g., cortical layer) is paralled with the x cooridate plan. 27 | Rotation code is from https://stackoverflow.com/questions/15463462/rotate-graph-by-angle 28 | } 29 | -------------------------------------------------------------------------------- /man/smartLayerAllocation.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{smartLayerAllocation} 4 | \alias{smartLayerAllocation} 5 | \title{Layer weights per cell} 6 | \usage{ 7 | smartLayerAllocation( 8 | layerIn, 9 | useLayer = "L1", 10 | spillFactor = 0.15, 11 | weightCutoff = 0.02, 12 | layerNm = c("L1", "L2/3", "L4", "L5", "L6") 13 | ) 14 | } 15 | \arguments{ 16 | \item{layerIn}{a list corresponding to all layers of dissection for a given sample} 17 | 18 | \item{useLayer}{target layer} 19 | 20 | \item{spillFactor}{fractional amount of cells in a layer below which it is assumed no cells are 21 | from that layer in multilayer dissection} 22 | 23 | \item{weightCutoff}{anything less than this is set to 0 for convenience and to avoid rare types} 24 | 25 | \item{layerNm}{names of all layers. set to NULL to have this calculated} 26 | } 27 | \value{ 28 | numeric vector saying how to weight a particular cell for each layer, using a smart 29 | weighting strategy 30 | } 31 | \description{ 32 | Returns a numeric vector saying how to weight a particular cell for each layer, using a smart 33 | weighting strategy 34 | } 35 | -------------------------------------------------------------------------------- /man/subsampleCells.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/markerGenesAndMapping.r 3 | \name{subsampleCells} 4 | \alias{subsampleCells} 5 | \title{Subsample cells} 6 | \usage{ 7 | subsampleCells(clusters, subSamp = 25, seed = 5) 8 | } 9 | \arguments{ 10 | \item{clusters}{vector of cluster labels (or any category) in factor or character format} 11 | 12 | \item{subSamp}{maximum number of values for each category to subsample. Can be single integer 13 | for global subsampling, or a *named* vector corresponding to how many values to take from each 14 | category in clusters.} 15 | 16 | \item{seed}{for reproducibility} 17 | } 18 | \value{ 19 | returns a vector of TRUE / FALSE with a maximum of subSamp TRUE calls per category 20 | } 21 | \description{ 22 | Subsets a categorical vector to include up to a maximum number of values for each category. 23 | } 24 | -------------------------------------------------------------------------------- /man/summarizeMatrix.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mfishMapping.r 3 | \name{summarizeMatrix} 4 | \alias{summarizeMatrix} 5 | \title{Summarize matrix} 6 | \usage{ 7 | summarizeMatrix( 8 | mat, 9 | group, 10 | scale = "none", 11 | scaleQuantile = 1, 12 | binarize = FALSE, 13 | binMin = 0.5, 14 | summaryFunction = median, 15 | ... 16 | ) 17 | } 18 | \arguments{ 19 | \item{mat}{matrix where the columns (e.g., samples) are going to be grouped} 20 | 21 | \item{group}{vector of length dim(mat)[2] corresponding to the groups} 22 | 23 | \item{scale}{either 'none' (default),'row', or 'column'} 24 | 25 | \item{scaleQuantile}{what quantile of value should be set as 1 (default=1)} 26 | 27 | \item{binarize}{should the data be binarized? (default=FALSE)} 28 | 29 | \item{binMin}{minimum ON value for the binarized matrix (ignored if binarize=FALSE)} 30 | 31 | \item{summaryFunction}{function (or function name) to be used for summarization} 32 | 33 | \item{...}{additional parameters for summaryFunction} 34 | } 35 | \value{ 36 | matrix of summarized values 37 | } 38 | \description{ 39 | Groups columns in a matrix by a specified group vector and summarizes using a specificed function. 40 | Optionally binarizes the matrix using a specified cutoff parameter. This is a wrapper for tapply. 41 | } 42 | -------------------------------------------------------------------------------- /man/update_mfishtools.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/update.r 3 | \name{update_mfishtools} 4 | \alias{update_mfishtools} 5 | \title{Update the mfishtools library} 6 | \usage{ 7 | update_mfishtools() 8 | } 9 | \description{ 10 | Update the mfishtools library 11 | } 12 | -------------------------------------------------------------------------------- /mfishtools.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | BuildType: Package 16 | PackageUseDevtools: Yes 17 | PackageInstallArgs: --no-multiarch --with-keep.source 18 | PackageCheckArgs: –no-examples 19 | -------------------------------------------------------------------------------- /mfishtools_0.0.2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AllenInstitute/mfishtools/df6d7f1104b149698a3cebd7f82dd9848bc7328a/mfishtools_0.0.2.pdf -------------------------------------------------------------------------------- /vignettes/.gitignore: -------------------------------------------------------------------------------- 1 | *.R 2 | .Rhistory 3 | -------------------------------------------------------------------------------- /vignettes/inhibitory_marker_mapping.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Mapping mFISH data to reference data set" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{Mapping mFISH data to RNA-seq reference} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | \usepackage[utf8]{inputenc} 8 | --- 9 | 10 | This code reads in all of the data for an example mouse SST mFISH experiment and compares it against a reference FACs data set. **In this case we have very limited prior knowledge, so I am open to suggestions for how to check whether the results look reasonable.**. 11 | 12 | ### Workspace set-up 13 | 14 | Install the necessary packages. In this case we are using data from `tasic2016data` and plotting functions from `scrattch.vis`. 15 | 16 | ```{r install packages, eval=FALSE} 17 | install.packages("devtools") 18 | devtools::install_github("AllenInstitute/tasic2016data") # For our data example 19 | ``` 20 | 21 | 22 | Load libraries. 23 | 24 | ```{r load libraries} 25 | suppressPackageStartupMessages({ 26 | library(mfishtools) # This library! 27 | library(matrixStats) # For rowMedians function, which is fast 28 | library(gplots) # For some of the plots 29 | library(ggplot2) # For other plots 30 | library(tasic2016data) # For the data 31 | }) 32 | options(stringsAsFactors = FALSE) # IMPORTANT 33 | print("Libraries loaded.") 34 | ``` 35 | 36 | 37 | Read in the reference data (in this case we will use the Tasic 2016 data, which includes ~1800 cells from mouse primary visual cortex). 38 | 39 | ```{r load tasic data} 40 | annotations <- tasic_2016_anno 41 | #counts <- tasic_2016_counts # uncomment if using CPM below 42 | rpkm <- tasic_2016_rpkm 43 | annotations <- annotations[match(colnames(rpkm),annotations$sample_name),] # Put them in the correct order 44 | ``` 45 | 46 | 47 | Read in the mFISH data. An example data set is provided as part of `mfishtools`, and was loaded with the library. In this case, `fishData` is a matrix of gene expression levels (e.g., spot counts within a cell) for a given cell, with genes as rows and cells as columns. This is the same format as the RNA-seq data. 48 | 49 | ```{r what is mfish data} 50 | dim(fishData) 51 | ``` 52 | 53 | 54 | The `metadata` variable is a data frame with some specific requirements on column names. 55 | 56 | ```{r what is mfish metadata} 57 | colnames(metadata) 58 | ``` 59 | 60 | The required column names are as follows (and other column names are perfectly fine): 61 | - area = Area/volume of the cell (or just set to a constant) 62 | - experiment = Name of the experiment or experiments (or just set ot a constant) 63 | - layerData = Numeric call for the layer. Not requred, but useful for plotting an rotating x, y 64 | - x = X coordinate for cell (ideally this is the lateral coordinate) 65 | - y = Y coordinate for cell (ideally this is the laminar coordinate) 66 | 67 | 68 | ### Data preparations 69 | 70 | This analysis will only be looking at marker genes for GABAergic neurons, so we need to only consider cells mapping to GABAergic types. We also define some convenient cluster info variables here. 71 | 72 | ```{r define variables} 73 | clusterType = annotations$broad_type 74 | includeClas = "GABA-ergic Neuron" # In this analysis, we are only considering interneurons 75 | excludeClas = sort(setdiff(clusterType,includeClas)) 76 | kpSamp = !is.element(clusterType,excludeClas) 77 | anno = annotations[kpSamp,] 78 | cl = annotations$primary_type_label 79 | names(cl) = annotations$sample_name 80 | kpClust = sort(unique(cl[kpSamp])) 81 | ``` 82 | 83 | 84 | Convert the data to log2(rpkm). NOTE: we often use counts per million of introns + exons when performing this analysis. Currently, we don't know which method produces more reliable markers. Alternative code for calculating cpm is commented out below. 85 | 86 | ```{r convert to log2} 87 | normDat = log2(rpkm+1) 88 | #sf = colSums(counts)/10^6 89 | #cpms = t(t(counts)/sf) 90 | #normDat = log2(cpms+1) 91 | print("Data normalized!") 92 | ``` 93 | 94 | 95 | Calculate proportions, means, and medians. These are all used later for various reasons. One important thing to note is that we are using `cl` here as a vector of cell type calls for the RNA-seq data (with names corresponding to the column/sample names of the RNA-seq sample data); however, if you wanted to map to a more coarse definition of cell types, or to omit certain cell types, this would be the step to do it. The columns of these summary values calculated here define the cell types that mFISH data will be mapped against for the remainder of the vignette. 96 | 97 | ```{r calculate reference stats} 98 | exprThresh = 1 99 | medianExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMedians(normDat[,x]))) 100 | meanExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]))) 101 | propExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]>exprThresh))) 102 | rownames(medianExpr) <- rownames(propExpr) <- rownames(meanExpr) <- genes <- rownames(normDat) 103 | print("Summary values calculated!") 104 | ``` 105 | 106 | 107 | Consider only genes included in the mFISH experiment that are also present in the RNA-seq reference data set (usually this is all of them). 108 | 109 | ```{r format mFISH data} 110 | useGenes <- intersect(rownames(fishData),genes) # Define genes to be used in the analysis 111 | fishDat <- fishData[useGenes,] # Separate out the data from the metadata 112 | print("mFISH data is ready!") 113 | ``` 114 | 115 | 116 | ## Map the mFISH data! 117 | 118 | Do the mapping using the parameters defined below. *Currently these are all manually selected and a bit of a guess. I am open to suggestions for how to do this mapping in a more systematic way.* The idea behind this method is to attach various filtering and scaling strategies to the mFISH and RNA-seq data sets, and then use correlation-based mapping to find the best fitting cell cluster. My expectation (still to be tested) is that this strategy will be useful for smaller gene panels, but that other more computational-based strategies will be most effective for larger gene panels. 119 | 120 | ```{r map mFISH data, warning=FALSE} 121 | qprob <- 0.9 # Parameter for scaling mFISH to FACS 122 | thresh <- 3 # Set counts less than or equal to thresh to 0 1 123 | log2p1 <- function(x) return(log2(x+1)) # log transform function 124 | binarize <- FALSE # Should the data be binarized? 125 | weights <- NULL # Integer weights. SET TO NULL IF YOU DON'T KNOW WHAT YOU ARE DOING! 126 | #weights <- round(rowSums(fishDat)/min(rowSums(fishDat))) # Here is how to weight roughly by average expression level 127 | 128 | fishMouse <- fishScaleAndMap(mapDat=fishDat, refSummaryDat=medianExpr, 129 | mappingFunction = cellToClusterMapping_byCor, transform = log2p1, noiselevel = thresh, 130 | genesToMap = useGenes, metadata = metadata, qprob=qprob, binarize=binarize, 131 | omitGenes = NULL,integerWeights=weights) 132 | print("Data is mapped.") 133 | ``` 134 | 135 | 136 | #### View the mapping results 137 | 138 | First rotate the X and Y coordinate space to that layer 2 is parallel to the X axis. This won't be perfect since the tissue is not perfectly linear, but it will make for easier viewing of the data. This requires some meta-data variable which marks a subset of genes roughly along align (e.g., cortical layer calls). 139 | 140 | ```{r rotateXY, fig.width=6, fig.height=6} 141 | rotateAxis <- fishMouse$metadata$layerData==4 142 | flipVector <- fishMouse$metadata$layerData 143 | for (e in unique(fishMouse$metadata$experiment)){ 144 | subset <- fishMouse$metadata$experiment==e 145 | fishMouse <- rotateXY(fishMouse,rotateAxis,flipVector,subset) 146 | } 147 | ``` 148 | 149 | 150 | Now plot the location of all of the cells in the tissue section. 151 | 152 | ```{r plot cell locations, fig.width=6, fig.height=6} 153 | sc <- function(n,...) return(c("brown","pink","orange","turquoise","blue","green")[1:n]) # Standard colors without yellow 154 | lay <- as.character(fishMouse$metadata$layerData) 155 | plotDistributions(fishMouse,group = "experiment", xlab="Mouse - All cells", ylab="Layer", 156 | colors=lay, colormap=sc,maxrow=8,cex=1) 157 | ``` 158 | 159 | 160 | Show the cell distribution across all types in the tissue. 161 | 162 | ```{r plot mFISH distributions,fig.width=20,fig.height=20} 163 | allClusts <- colnames(medianExpr) 164 | plotDistributions(fishMouse,group = "Class", groups = allClusts, colors = lay, pch=lay, xlab="Mouse", 165 | ylab="Layer",colormap=sc,maxrow=8) 166 | ``` 167 | 168 | There are some broad observations that suggest we are not too far off. First, nearly all of the cells map to inhibitory types as expected. Second, many of the cell types have layer signatures, with SST/PVALB types more likely to be in deep layers and other inhibitory types more likely to be in upper layers. 169 | 170 | Now let's plot the heatmap to see how the data looks when clustered along the tree like this. First, unscaled data, capped at 10 counts. 171 | 172 | ```{r Plot mFISH heatmap (unscaled), fig.width=24,fig.height=6} 173 | cap = 10 174 | colorset = c("darkblue", "dodgerblue", "gray80", "orange", "orangered") 175 | heat_colors <- colorRampPalette(colorset)(1001) 176 | plotHeatmap(fishMouse,main="Mouse cells (unscaled, cap=10)",group="Class",groups=allClusts,capValue=cap, 177 | colormap=heat_colors,rowsep=NULL,dendrogram="row",Rowv = TRUE,margins = c(8,6)) 178 | ``` 179 | 180 | Next, let's plot the scaled heatmaps for comparison. This is the data that is used in the mapping. 181 | 182 | ```{r Plot mFISH heatmap (scaled), fig.width=24,fig.height=6} 183 | cap = 8 184 | colorset = c("darkblue", "dodgerblue", "gray80", "orange", "orangered") 185 | heat_colors <- colorRampPalette(colorset)(1001) 186 | plotHeatmap(fishMouse,main="Mouse cells (scaled, cap=8)",group="Class",groups=allClusts,capValue=cap,colormap=heat_colors, 187 | rowsep=NULL,dendrogram="row",Rowv = TRUE,margins = c(8,6),useScaled=TRUE) 188 | ``` 189 | 190 | A few genes have quite low expression and probably should be omitted from future experiments. Otherwise, by eye things look reasonable because there do appear to be distinct expression patterns across clusters (although it is really hard to tell by eye, and some of the cells in different blocks seem like they should be grouped together). 191 | 192 | 193 | ## Quantitative sanity check 194 | 195 | So far we have focused on getting the results and trying to determine agreement with expections based on resulting plots. The results seem mixed so far. We now want to do what I am calling a quantitative sanity check, where we compare results obtained by different computational methods, or between RNA-Seq and mFISH, or taking prior knowledge into consideration. Ideally we can build a mapping alorithm that adjusts parameters to try and optimize the results based on priors (or something to this effect), but for now we want to have quick ways of seeing what looks right and what looks wrong to help us make adjustments. 196 | 197 | First, let's see whether the proportion of cells identified in RNA-seq and mFISH agree. Note that we would only expect this to be the case when we have unbiased surveyed in both modalities, which is not the case here. 198 | 199 | ```{r type wrap-up, fig.width=8,fig.height=8} 200 | countFish9 <- table(fishMouse$mappingResults$Class) 201 | countFish9 <- countFish9[countFish9>0] 202 | countSeq <- table(anno$primary_type_label) 203 | fs <- intersect(names(countFish9),names(countSeq)) 204 | plot(as.numeric(countSeq[fs]),as.numeric(countFish9[fs]),main="Sst types", 205 | pch=19,col="grey",ylab="FISH cell count",xlab="RNA-seq cell count",xlim=c(0,max(as.numeric(countSeq[fs]))*1.1)) 206 | text(as.numeric(countSeq[fs]),as.numeric(countFish9[fs]),fs,cex=0.7,srt=0) 207 | ``` 208 | 209 | 210 | There is not great agreement between modalities. What if we wrap up by class? 211 | 212 | ```{r broad class wrap-up, fig.width=8,fig.height=8} 213 | # Second, broad class 214 | val = list(fishMouse$mappingResults$Class,anno$primary_type_label) 215 | for (i in 1:2){ 216 | val[[i]] <- as.character(lapply(val[[i]], function(x) strsplit(x," ")[[1]][1])) # Get the class within interneurons 217 | } 218 | countFish <- table(val[[1]]) 219 | countSeq <- table(val[[2]]) 220 | fs <- intersect(val[[1]],val[[2]]) 221 | plot(as.numeric(countSeq[fs]),as.numeric(countFish[fs]),main="Broad classes", 222 | pch=19,col="grey",ylab="FISH cell count",xlab="RNA-seq cell count", 223 | xlim=c(0,max(as.numeric(countSeq[fs]))*1.1),ylim=c(0,max(as.numeric(countFish[fs]))*1.1)) 224 | text(as.numeric(countSeq[fs]),as.numeric(countFish[fs]),fs,cex=0.7,srt=0) 225 | ``` 226 | 227 | Wrapping up by class, there is much better agreement between proportions across modalities. Whiel this is not necessarily something we'd expect to find, it is still useful to note. 228 | 229 | One possibility (that I think is likely the case for at least a subset of cells) is that there is some mapping issues. We can try and understand why cells are mapping specific ways by plotting expression levels between RNA-seq and mFISH for each cell type. Let's do that. 230 | 231 | ```{r mean expression comparison, fig.width=18,fig.height=11} 232 | meanFish <- summarizeMatrix(fishMouse$scaleDat,fishMouse$mappingResults$Class,summaryFunction = mean)[useGenes,] 233 | meanFish <- meanFish[,intersect(colnames(meanFish),colnames(medianExpr))] 234 | medSeq <- meanExpr[useGenes,colnames(meanFish)] 235 | corFS <- NULL 236 | par(mfrow=c(4,7)) 237 | for (ct in intersect(allClusts,colnames(medSeq))){ 238 | corMR <- signif(cor(medSeq[,ct],meanFish[,ct]),3) 239 | plot(medSeq[,ct],meanFish[,ct],xlab="RNA-seq",ylab=ct,pch=19,col="white", main=paste("R =",corMR)) 240 | text(medSeq[,ct],meanFish[,ct],rownames(medSeq),cex=0.9) 241 | corFS <- c(corFS,cor(medSeq[,ct],meanFish[,ct])) 242 | } 243 | names(corFS) <- intersect(allClusts,colnames(medSeq)) 244 | ``` 245 | 246 | Overall, there is quite good agreement for most, but not all, of these cell types. Similarly, this type of plot can give us a sense of which cell types we can be confident in and which ones we can't, although the exact relationship between confidence of mapping and mapping accuracy remains TBD. 247 | 248 | 249 | Now make a TSNE plot on all genes using the mFISH data. Do the data cluster by color, and show the points also corresponding to cluster (abbreviated). 250 | 251 | ```{r Plot mFISH TSNE (scaled), fig.width=10,fig.height=8} 252 | cap=10 253 | fishPlot <- fishMouse 254 | fishPlot$mappingResults$Broad <- val[[1]] 255 | fishPlot <- filterCells(fishPlot,fishPlot$mappingResults$Class!="none") 256 | p=plotTsne(fishPlot,main="Mouse cells (scaled)",colorGroup="Broad",labelGroup = "Broad", 257 | capValue=cap, useScaled=TRUE, perplexity = 10, maxNchar=5) 258 | p 259 | ``` 260 | 261 | Overall it seems that most cells map reasonably well by broad class calls, although it looks like there are a few errors. 262 | 263 | Finally, repeat but with the points corresponding to layer. 264 | 265 | ```{r Plot mFISH TSNE (scaled, color by layer), fig.width=10,fig.height=6} 266 | p=plotTsne(fishPlot,main="Mouse Sst cells (scaled)",colorGroup="Class",labelGroup = "layerData", 267 | capValue=cap, useScaled=TRUE, perplexity = 10) 268 | p 269 | ``` 270 | 271 | The layer and cell type appear to provide some complementary information. 272 | 273 | **It is important to note that these tools are very much in development and minimally tested, but even so we hope that that are useful.** 274 | -------------------------------------------------------------------------------- /vignettes/inhibitory_marker_selection.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Marker panel generation" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{Marker panel generation} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | \usepackage[utf8]{inputenc} 8 | --- 9 | 10 | This vignette demonstrates how to generate an optimal marker gene panel (of length N), and then predicts how well each of the FACs cells map into the types defined as a measure of panel quality. 11 | 12 | The strategy used is a correlation-baseed greedy algorithm, which aims to minimize the distance between the actual and predicted clusters (rather than maximizing the fraction correctly mapping to each cluster). 13 | 14 | ### Workspace set-up 15 | 16 | Install the necessary packages. In this case we are using data from `tasic2016data` and plotting functions from `scrattch.vis`. 17 | 18 | ```{r install packages, eval=FALSE} 19 | install.packages("devtools") 20 | devtools::install_github("AllenInstitute/scrattch.vis") # For plotting 21 | devtools::install_github("AllenInstitute/tasic2016data") # For our data example 22 | ``` 23 | 24 | 25 | Load libraries. 26 | 27 | ```{r load libraries} 28 | suppressPackageStartupMessages({ 29 | library(mfishtools) # This library! 30 | library(gplots) # This is for plotting gene panels only. 31 | library(scrattch.vis) # This is for plotting gene panels only. 32 | library(matrixStats) # For rowMedians function, which is fast 33 | library(tasic2016data) # For the data 34 | }) 35 | options(stringsAsFactors = FALSE) # IMPORTANT 36 | print("Libraries loaded.") 37 | ``` 38 | 39 | 40 | Read in the data (in this case we will use the Tasic 2016 data, which includes ~1800 cells from mouse primary visual cortex). 41 | 42 | ```{r load tasic data} 43 | annotations <- tasic_2016_anno 44 | counts <- tasic_2016_counts 45 | rpkm <- tasic_2016_rpkm 46 | annotations <- annotations[match(colnames(counts),annotations$sample_name),] # Put them in the correct order 47 | ``` 48 | 49 | ### Data preparations 50 | 51 | This analysis will only be looking at marker genes for GABAergic neurons, so we need to only consider cells mapping to GABAergic types. We also define some convenient cluster info variables here. 52 | 53 | ```{r define variables} 54 | clusterType = annotations$broad_type 55 | includeClas = "GABA-ergic Neuron" # In this analysis, we are only considering interneurons 56 | excludeClas = sort(setdiff(clusterType,includeClas)) 57 | gliaClas = setdiff(excludeClas,"Glutamatergic Neuron") 58 | kpSamp = !is.element(clusterType,excludeClas) 59 | anno = annotations[kpSamp,] 60 | cl = annotations$primary_type_label 61 | names(cl) = annotations$sample_name 62 | kpClust = sort(unique(cl[kpSamp])) 63 | gliaClust = sort(unique(cl[is.element(clusterType,gliaClas)])) 64 | ``` 65 | 66 | 67 | Convert the data to log2(rpkm). NOTE: we often use counts per million of introns + exons when performing this analysis. Currently, we don't know which method produces more reliable markers. Alternative code for calculating cpm is commented out below. 68 | 69 | ```{r convert to log2} 70 | normDat = log2(rpkm+1) 71 | #sf = colSums(counts)/10^6 72 | #cpms = t(t(counts)/sf) 73 | #normDat = log2(cpms+1) 74 | ``` 75 | 76 | 77 | Calculate proportions and medians. These are both needed for gene filtering and for marker selection. 78 | 79 | ```{r calculate proportions and medians} 80 | exprThresh = 1 81 | medianExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMedians(normDat[,x]))) 82 | propExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]>exprThresh))) 83 | rownames(medianExpr) <- rownames(propExpr) <- genes <- rownames(normDat) 84 | ``` 85 | 86 | 87 | ### Combinatorial marker gene selection 88 | 89 | This section is where gene selection happens. There are two steps: (1) gene filtering and (2) marker selection, both of which are described below. 90 | 91 | We first want to define some gene filters prior to running gene selection. Note that this filtering occurs prior to gene selection and does not occur during the selection process. To see details about all of the specific filters, see the code block below or use `?filterPanelGenes`. Overall, the goal of this section is to exclude genes that won't work properly with the specific spatial transcriptomics method desired because their expression is too low or too high, or they are too short. It also removes genes with too much off-target expression, genes that are expressed in too many cell types, and genes that are potentially not of interest because they are unannotated, or on a sex chromosome, or are don't work for any other reason. It also takes the most binary genes as the possible selection space in order to try and avoid genes whose only cell type differences are in magnitude. In this case, we will use a total of 250 binary genes. 92 | 93 | ```{r filter genes for panel selection} 94 | startingGenePanel <- c("Gad1","Slc32a1","Pvalb","Sst","Vip") 95 | runGenes <- NULL 96 | runGenes <- filterPanelGenes( 97 | summaryExpr = 2^medianExpr-1, # medians (could also try means); We enter linear values to match the linear limits below 98 | propExpr = propExpr, # proportions 99 | onClusters = kpClust, # clusters of interest for gene panel 100 | offClusters = gliaClust, # clusters to exclude expression 101 | geneLengths = NULL, # vector of gene lengths (not included here) 102 | startingGenes = startingGenePanel, # Starting genes (from above) 103 | numBinaryGenes = 250, # Number of binary genes (explained below) 104 | minOn = 10, # Minimum required expression in highest expressing cell type 105 | maxOn = 500, # Maximum allowed expression 106 | maxOff = 50, # Maximum allowed expression in off types (e.g., aviod glial expression) 107 | minLength = 960, # Minimum gene length (to allow probe design; ignored in this case) 108 | fractionOnClusters = 0.5, # Max fraction of on clusters (described above) 109 | excludeGenes = NULL, # Genes to exclude. Often sex chromosome or mitochondrial genes would be input here. 110 | excludeFamilies = c("LOC","Fam","RIK","RPS","RPL","\\-","Gm","Rnf","BC0")) # Avoid LOC markers, in this case 111 | ``` 112 | 113 | 114 | The second step is our marker panel selection. This strategy uses a greedy algorithm to iteratively add the "best" gene to the existing panel until the panel reaches a certain size. Specifically, each cell in the reference data set is correlated with each cluster median using the existing marker gene set, and the most highly correlated cluster is compared with the originally assigned cluster for each cell. By default the algorithm tries to optimize the fraction of cells correctly mapping to each cluster by using this correlation-based mapping. An alternative strategy that is particularly useful for smaller gene panels is to use a weighting strategy to penalize cells that map to a distant cluster more than cells that map to a nearby cluster. To do this we iteratively choose genes from the starting gene panel such that the addition of each gene minimizes the correlation distance between clusters. 115 | 116 | To do this we first determine the cluster distance based on correlation of top binary genes, which is what is done in this code block. 117 | 118 | ```{r identify binary markers} 119 | corDist <- function(x) return(as.dist(1-cor(x))) 120 | clusterDistance <- as.matrix(corDist(medianExpr[runGenes,kpClust])) 121 | print(dim(clusterDistance)) 122 | ``` 123 | 124 | 125 | This step constructs the gene panel, as described above. Once again, use `?buildMappingBasedMarkerPanel` and see the code block below for details on the parameters, but there are a few key options. First, we find it useful to subsample cells from each cluster which decreases the time for the algorithm to run and also more evenly weights the clusters for gene selection. 126 | 127 | ```{r build gene panels} 128 | fishPanel <- buildMappingBasedMarkerPanel( 129 | mapDat = normDat[runGenes,kpSamp], # Data for optimization 130 | medianDat = medianExpr[runGenes,kpClust], # Median expression levels of relevant genes in relevant clusters 131 | clustersF = cl[kpSamp], # Vector of cluster assignments 132 | panelSize = 30, # Final panel size 133 | currentPanel = startingGenePanel, # Starting gene panel 134 | subSamp = 15, # Maximum number of cells per cluster to include in analysis (20-50 is usually best) 135 | optimize = "CorrelationDistance", # CorrelationDistance maximizes the cluster distance as described 136 | clusterDistance = clusterDistance, # Cluster distance matrix 137 | percentSubset = 50 # Only consider a certain percent of genes each iteration to speed up calculations (in most cases this is not recommeded) 138 | ) 139 | 140 | ``` 141 | 142 | This is the panel! 143 | 144 | ### Assess panel quality 145 | 146 | First, let's plot the panel in the context of the clusters we care about. This can be done using the `scrattch.vis` library. 147 | 148 | ```{r panel gene plot in subset, fig.width=7,fig.height=10} 149 | plotGenes <- fishPanel 150 | plotData <- cbind(sample_name = colnames(rpkm), as.data.frame(t(rpkm[plotGenes,]))) 151 | clid_inh <- 1:23 # Cluster IDs for inhibitory clusters 152 | 153 | # violin plot example. Could be swapped with fire, dot, bar, box plot, heatmap, Quasirandom, etc. 154 | sample_fire_plot(data = plotData, anno = annotations, genes = plotGenes, grouping = "primary_type", 155 | log_scale=TRUE, max_width=15, label_height = 8, group_order = clid_inh) 156 | ``` 157 | 158 | The first half of this panel looks like most of the genes have fairly distinct and binarized patterning, while many of the genes in the latter half of the panel appear to be showing redundant information, suggesting that after a certain point this strategy becomes less than optimal in a practical sense. 159 | 160 | How do they look across ALL cell types (realizing that we don't expect to see all these types in the tissue we care about)? 161 | 162 | ```{r panel gene plot across all types, fig.width=14,fig.height=10} 163 | sample_fire_plot(data = plotData, anno = annotations, genes = plotGenes, grouping = "primary_type", 164 | log_scale=TRUE, max_width=15, label_height = 8) 165 | ``` 166 | 167 | We get a little bit of excitatory and glial cell separation for free, but it's not great (as expected). 168 | 169 | What fraction of cells are correctly mapped to leaf nodes? Note that we don't necessarily expect this number to be high. Also note that this is using all of the above genes, so will actually be higher than we would get with any given 9-gene panel. 170 | 171 | ```{r dsiplay fraction correctly mapped} 172 | assignedCluster <- suppressWarnings(getTopMatch(corTreeMapping(mapDat = normDat[runGenes,kpSamp], 173 | medianDat=medianExpr[runGenes,kpClust], genesToMap=fishPanel))) 174 | print(paste0("Percent correctly mapped: ",signif(100*mean(as.character(assignedCluster$TopLeaf)==cl[kpSamp],na.rm=TRUE),3),"%")) 175 | ``` 176 | 177 | Around ~72% of the cells are correctly mapped with this panel, which is reasonable, but less than ideal. How does the plot look for fewer genes? 178 | 179 | ```{r plot fraction correctly mapped,fig.width=9,fig.height=6} 180 | fractionCorrectWithGenes(fishPanel,normDat[,kpSamp],medianExpr[runGenes,kpClust],cl[kpSamp], 181 | main="Mapping quality for different numbers of included genes",return=FALSE) 182 | ``` 183 | 184 | This plot suggests that with 30 genes we are continuing to see improvement by adding new genes; however, by approximately a 20 gene panel the level of improvement decreases dramatically for each gene added. **Later releases of this library will include additional strategies for optimizing the panel beyond these initial 20 or so genes.** 185 | 186 | Finally, as an overview, we create a confusion matrix based on the top leaf assignments. This will let us address which pairs of clusters are the most confused. Are they adjacent/nearby on the tree? Note that the colors are distorted to highlight confusion in the tree. 187 | 188 | ```{r cluster confusion, fig.height=8,fig.width=8} 189 | membConfusionProp <- getConfusionMatrix(cl[kpSamp],assignedCluster[,1],TRUE) 190 | clOrd <- (annotations$primary_type_label[match(clid_inh,annotations$primary_type_id)]) # Cluster order 191 | heatmap.2(pmin(membConfusionProp,0.25)[clOrd,clOrd],Rowv=FALSE,Colv=FALSE,trace="none",dendrogram="none", 192 | margins=c(16,16),main="Confusion Matrix") 193 | ``` 194 | 195 | Most of the errors are nearby on the diagonal, which is what we were optimizing for using the clusterDistance strategy. 196 | --------------------------------------------------------------------------------