├── .Rbuildignore
├── .gitignore
├── CONTRIBUTION
├── DESCRIPTION
├── LICENSE
├── NAMESPACE
├── R
    ├── markerGenesAndMapping.r
    ├── mfishMapping.r
    └── update.r
├── README.md
├── data
    ├── fishData.rda
    └── metadata.rda
├── man
    ├── buildMappingBasedMarkerPanel.Rd
    ├── buildPanel_oneCluster.Rd
    ├── buildQualityTable.Rd
    ├── buildTreeFromGenePanel.Rd
    ├── cellToClusterMapping_byCor.Rd
    ├── cellToClusterMapping_byRank.Rd
    ├── corTreeMapping.Rd
    ├── corTreeMapping_withFilter.Rd
    ├── distTreeMapping.Rd
    ├── filterByClass.Rd
    ├── filterCells.Rd
    ├── filterPanelGenes.Rd
    ├── fishScaleAndMap.Rd
    ├── fractionCorrectPerNode.Rd
    ├── fractionCorrectWithGenes.Rd
    ├── generateMultipleCellReferenceSet.Rd
    ├── getBetaScore.Rd
    ├── getBranchList.Rd
    ├── getConfusionMatrix.Rd
    ├── getDend.Rd
    ├── getNodeHeight.Rd
    ├── getTopMatch.Rd
    ├── get_subtree_label.Rd
    ├── labelDend.Rd
    ├── layerFraction.Rd
    ├── layerScale.Rd
    ├── lca.Rd
    ├── leafToNodeMedians.Rd
    ├── makeLCAtable.Rd
    ├── map_dend.Rd
    ├── mergeFish.Rd
    ├── mfishtools.Rd
    ├── outputTopConfused.Rd
    ├── plotConfusionVsConfidence.Rd
    ├── plotCorrectWithGenes.Rd
    ├── plotDistributions.Rd
    ├── plotHeatmap.Rd
    ├── plotNodes.Rd
    ├── plotTsne.Rd
    ├── possibleClustersByPriors.Rd
    ├── quantileTruncate.Rd
    ├── resolve_cl.Rd
    ├── rfTreeMapping.Rd
    ├── rotateXY.Rd
    ├── smartLayerAllocation.Rd
    ├── subsampleCells.Rd
    ├── summarizeMatrix.Rd
    └── update_mfishtools.Rd
├── mfishtools.Rproj
├── mfishtools_0.0.2.pdf
└── vignettes
    ├── .gitignore
    ├── inhibitory_marker_mapping.Rmd
    ├── inhibitory_marker_mapping.html
    ├── inhibitory_marker_selection.Rmd
    └── inhibitory_marker_selection.html


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^Meta$
2 | ^doc$
3 | ^.*\.Rproj$
4 | ^\.Rproj\.user$
5 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | Meta
2 | doc
3 | .Rproj.user
4 | .Rhistory
5 | .RData
6 | .Ruserdata
7 | 


--------------------------------------------------------------------------------
/CONTRIBUTION:
--------------------------------------------------------------------------------
 1 | Allen Institute Contribution Agreement
 2 | 
 3 | This document describes the terms under which you may make “Contributions” — which may include without limitation, software additions, revisions, bug fixes, configuration changes, documentation, or any other materials — to any of the projects owned or managed by the Allen Institute. If you have questions about these terms, please contact us at terms@alleninstitute.org.
 4 | 
 5 | You certify that:
 6 | 
 7 | •	Your Contributions are either:
 8 |   1.	Created in whole or in part by you and you have the right to submit them under the designated license (described below); or
 9 |   2.	Based upon previous work that, to the best of your knowledge, is covered under an appropriate open source license and you have the right under that license to submit that work with modifications, whether created in whole or in part by you, under the designated license; or
10 |   3.	Provided directly to you by some other person who certified (1) or (2) and you have not modified them.
11 | 
12 | •	You are granting your Contributions to the Allen Institute under the terms of the 2-Clause BSD license (the “designated license”).
13 | 
14 | •	You understand and agree that the Allen Institute projects and your Contributions are public and that a record of the Contributions (including all metadata and personal information you submit with them) is maintained indefinitely and may be redistributed consistent with the Allen Institute’s mission and the 2-Clause BSD license.
15 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: mfishtools
 2 | Type: Package
 3 | Title: Building Gene Sets and Mapping mFISH Data
 4 | Version: 0.0.2
 5 | Author: Jeremy Miller <jeremym@alleninstitute.org>
 6 | Maintainer: Jeremy Miller <jeremym@alleninstitute.org>
 7 | Description: This repository includes code for gene selection for spatial transcriptomics methods and for mapping of spatial transcriptomics (or RNA-Seq data) onto a RNA-Seq reference. Specific topics include:
 8 |     1) Correlation-based mapping of cells to reference cell types
 9 |     2) Iterative building of gene panels a greedy algorithm with pre-defined constraints
10 |     3) Visualizations related to gene mapping a gene panel selection
11 | License: What license is it under?
12 | Encoding: UTF-8
13 | LazyData: true
14 | Imports:
15 |     dendextend (>= 1.7.0),
16 |     gplots (>= 3.0.1),
17 |     pdist (>= 1.2),
18 |     matrixStats (>= 0.53.1),
19 |     Rtsne (>= 0.13),
20 |     ggplot2 (>= 2.2.1),
21 |     scrattch.vis (>= 0.0),
22 |     dplyr (>= 0.3.4),
23 |     WGCNA (>= 1.0)
24 | RoxygenNote: 7.1.1
25 | Suggests: 
26 |     knitr,
27 |     rmarkdown
28 | VignetteBuilder: knitr
29 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The 2-Clause BSD License
 2 | Note: This license has also been called the "Simplified BSD License" and the "FreeBSD License". See also the 3-clause BSD License.
 3 | 
 4 | Copyright (c) 2018. Allen Institute. 
 5 | 
 6 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 7 | 
 8 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
 9 | 
10 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
11 | 
12 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
13 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(buildMappingBasedMarkerPanel)
 4 | export(buildPanel_oneCluster)
 5 | export(buildQualityTable)
 6 | export(buildTreeFromGenePanel)
 7 | export(cellToClusterMapping_byCor)
 8 | export(cellToClusterMapping_byRank)
 9 | export(corTreeMapping)
10 | export(corTreeMapping_withFilter)
11 | export(distTreeMapping)
12 | export(filterByClass)
13 | export(filterCells)
14 | export(filterPanelGenes)
15 | export(fishScaleAndMap)
16 | export(fractionCorrectPerNode)
17 | export(fractionCorrectWithGenes)
18 | export(generateMultipleCellReferenceSet)
19 | export(getBetaScore)
20 | export(getBranchList)
21 | export(getConfusionMatrix)
22 | export(getDend)
23 | export(getNodeHeight)
24 | export(getTopMatch)
25 | export(get_subtree_label)
26 | export(labelDend)
27 | export(layerFraction)
28 | export(layerScale)
29 | export(lca)
30 | export(leafToNodeMedians)
31 | export(makeLCAtable)
32 | export(map_dend)
33 | export(mergeFish)
34 | export(outputTopConfused)
35 | export(plotConfusionVsConfidence)
36 | export(plotCorrectWithGenes)
37 | export(plotDistributions)
38 | export(plotHeatmap)
39 | export(plotNodes)
40 | export(plotTsne)
41 | export(possibleClustersByPriors)
42 | export(quantileTruncate)
43 | export(resolve_cl)
44 | export(rfTreeMapping)
45 | export(rotateXY)
46 | export(smartLayerAllocation)
47 | export(subsampleCells)
48 | export(summarizeMatrix)
49 | export(update_mfishtools)
50 | 


--------------------------------------------------------------------------------
/R/markerGenesAndMapping.r:
--------------------------------------------------------------------------------
   1 | #' mfishtools: Building Gene Sets and Mapping mFISH Data.
   2 | #'
   3 | #' This repository includes code for gene selection for spatial transcriptomics methods and for
   4 | #' mapping of spatial transcriptomics (or RNA-Seq data) onto a RNA-Seq reference. Specific topics include:
   5 | #' 1) Correlation-based mapping of cells to reference cell types
   6 | #' 2) Iterative building of gene panels a greedy algorithm with pre-defined constraints
   7 | #' 3) Visualizations related to gene mapping a gene panel selection
   8 | #'
   9 | #' @docType package
  10 | #' @name mfishtools
  11 | NULL
  12 | 
  13 | 
  14 | #' Table of confused clusters
  15 | #'
  16 | #' This function returns a table of the top confused clusters (assigned clusters
  17 | #'   incorrectly mapped)
  18 | #'
  19 | #' @param confusionProp confusion matrix (e.g., output from getConfusionMatrix).
  20 | #' @param count number of top confusions to show
  21 | #'
  22 | #' @return a 3 x count matrix of the top confused pairs of clusters with the three
  23 | #'   columns corresponding to mapped cluster, assigned cluster, and fraction of
  24 | #'   cells incorrectly mapped, respectively.
  25 | #'
  26 | #' @export
  27 | outputTopConfused <- function(confusionProp,
  28 |                               count = 10) {
  29 |   topConfused <- NULL
  30 |   mfp <- confusionProp
  31 |   diag(mfp) <- 0
  32 |   rn <- rownames(mfp)
  33 |   for (i in 1:count) {
  34 |     wt <- which(mfp == max(mfp), arr.ind = TRUE)
  35 |     wt <- t(wt[1, ])
  36 |     topConfused <- rbind(topConfused, c(rn[wt], max(mfp)))
  37 |     mfp[wt] <- 0
  38 |   }
  39 |   colnames(topConfused) <- c("foundCluster", "realCluster", "proportionOff")
  40 |   rownames(topConfused) <- NULL
  41 |   topConfused <- as.data.frame(topConfused)
  42 |   topConfused[, 3] <- as.numeric(topConfused[, 3])
  43 |   return(topConfused)
  44 | }
  45 | 
  46 | #' Confusion plot vs. confidence
  47 | #'
  48 | #' Produces line plots showing the percent of correctly mapped cells above a certain confidence value (or score).
  49 | #'   This is a wrapper for plot.
  50 | #'
  51 | #' @param foundClusterAndScore matrix where first column is found cluster and second column is
  52 | #'   confidence score (e.g., output from getTopMatch)
  53 | #' @param realCluster character vector of assigned clusters
  54 | #' @param ... additional parameters for the plot function
  55 | #'
  56 | #' @export
  57 | plotConfusionVsConfidence <- function(foundClusterAndScore,
  58 |                                       realCluster,
  59 |                                       RI = (31:100) / 100,
  60 |                                       main = "% mapping (blue) / correct (orange)",
  61 |                                       ylab = "Percent",
  62 |                                       xlab = "Fraction correctly mapped to leaf",
  63 |                                       type = "l",
  64 |                                       xlim = range(RI),
  65 |                                       ...) {
  66 |   fracMap <- fracRight <- NULL
  67 |   for (r in RI) {
  68 |     isMap <- foundClusterAndScore[, 2] >= r
  69 |     fracMap <- c(fracMap, round(1000 * mean(isMap)) / 10)
  70 |     isRight <- (realCluster == foundClusterAndScore[, 1])[isMap]
  71 |     fracRight <- c(fracRight, round(1000 * mean(isRight)) / 10)
  72 |   }
  73 |   plot(RI, fracMap,
  74 |     xlim = xlim, col = "blue", main = main,
  75 |     ylab = ylab, xlab = xlab, type = "l", ...
  76 |   )
  77 |   lines(RI, fracRight, col = "orange")
  78 |   abline(h = 5 * (0:20), col = "grey", lty = "dotted")
  79 | }
  80 | 
  81 | 
  82 | #' Confusion matrix
  83 | #'
  84 | #' Returns a confusion matrix of the found (mapped) vs. real (assigned) clusters.
  85 | #'
  86 | #' @param realCluster character vector of assigned clusters
  87 | #' @param foundCluster character vector of mapped clusters
  88 | #' @param proportions FALSE if the counts are to be returned and TRUE if the proportions are to be returned
  89 | #'
  90 | #' @export
  91 | getConfusionMatrix <- function(realCluster,
  92 |                                foundCluster,
  93 |                                proportions = TRUE) {
  94 |   realCluster <- as.character(realCluster)
  95 |   foundCluster <- as.character(foundCluster)
  96 |   lev <- sort(unique(c(realCluster, foundCluster)))
  97 |   realCluster <- factor(realCluster, levels = lev)
  98 |   foundCluster <- factor(foundCluster, levels = lev)
  99 |   confusion <- table(foundCluster, realCluster)
 100 |   if (proportions) {
 101 |     cs <- colSums(confusion)
 102 |     for (i in 1:dim(confusion)[1])
 103 |       confusion[i, ] <- confusion[i, ] / pmax(cs, 1e-08)
 104 |   }
 105 |   confusion
 106 | }
 107 | 
 108 | 
 109 | 
 110 | 
 111 | #' Filter genes for spatial transcriptomics panel
 112 | #'
 113 | #' Returns a set of genes for inclusion in a spatial transcriptomics panel based on a series of 
 114 | #'   hard-coded and user-defined constraints 
 115 | #'
 116 | #' @param summaryExpr Matrix of summarized expression levels for a given cluster.  Typically the median
 117 | #'   or mean should be used.  Rows are genes and columns are samples.  ROW NAMES MUST BE GENE SYMBOLS!
 118 | #' @param propExpr Proportion of cells expressed in each cluster for use with binary score calculation
 119 | #'   (default = summaryExpr, which is not recommended)
 120 | #' @param onClusters Vector indicating which clusters should be included in the gene panel (default
 121 | #'   is all clusters. Can be logical or numeric, or a character string of cluster names)
 122 | #' @param offClusters Vector indidicating from which clusters expression should be avoided
 123 | #' @param geneLength Optional vector of gene lengths in same order as summaryExpr.  Default is NULL
 124 | #' @param numBinaryGenes Number of genes to include in the final panel.  Genes are sorted by binary
 125 | #'   score using `getBetaScore` and this number of genes are chosen (default = 500)
 126 | #' @param minOn Minimum summary expression level in most highly expressed "on" cluster (default = 10)
 127 | #' @param maxOn Maximum summary expression level in most highly expressed "on" cluster (default = 250) 
 128 | #' @param maxOff Maximum summary expression level in most highly expressed "off" cluster (default = 50)
 129 | #' @param minLength Minimum gene length for marker gene selection.  Ignored if geneLength is not 
 130 | #'   provided (default = 960)
 131 | #' @param fractionOnClusters What is the maximum fraction of clusters in which a gene can be expressed
 132 | #'   (as defined by propExpr>onThreshold; default = 0.5).  This prevents nearly ubiquitous genes from selection
 133 | #' @param onThreshold What fraction of cells need to have expression for a gene to be defined as expressed
 134 | #'   (default = 0.5)
 135 | #' @param excludeGenes Which genes should be excluded from the analysis (default is none)
 136 | #' @param excludeFamilies Which gene classes or families should be excluded from the analysis?  More
 137 | #'   specifically, any gene that contain these strings of characters anywhere in the symbol will be
 138 | #'   excluded (default is "LOC","LINC","FAM","ORF","KIAA","FLJ","DKFZ","RIK","RPS","RPL","\\-"). 
 139 | #' 
 140 | #' @return A character vector of genes meeting all constraints
 141 | #'
 142 | #' @export
 143 | filterPanelGenes <- function(summaryExpr,
 144 |                              propExpr = summaryExpr,
 145 |                              onClusters = 1:dim(summaryExpr)[2],
 146 |                              offClusters = NULL,
 147 |                              geneLengths = NULL,
 148 |                              startingGenes = c("GAD1","SLC17A7"),
 149 |                              numBinaryGenes = 500,                             
 150 |                              minOn = 10,
 151 |                              maxOn = 250,
 152 |                              maxOff = 50,
 153 |                              minLength = 960,
 154 |                              fractionOnClusters = 0.5,
 155 |                              onThreshold = 0.5,
 156 |                              excludeGenes = NULL,
 157 |                              excludeFamilies = c("LOC","LINC","FAM","ORF","KIAA","FLJ","DKFZ","RIK","RPS","RPL","\\-")) {
 158 |   ## Variable check
 159 |   if(!is.element("matrix",class(summaryExpr))) {
 160 |     summaryExpr <- as.matrix(summaryExpr)
 161 |   }
 162 |   if(!is.numeric(summaryExpr[1,1])) {
 163 |     stop("summaryExpr must be a matrix of numeric values.")
 164 |   }
 165 |   if(is.null(rownames(summaryExpr))) {
 166 |     stop("Please provide summaryExpr with genes as row names.")
 167 |   }
 168 |   if(!is.numeric(fractionOnClusters)){
 169 |     stop("fractionOnClusters needs to be numeric.")
 170 |   }
 171 |   fractionOnClusters = fractionOnClusters[1]
 172 |   if(fractionOnClusters>1) fractionOnClusters = fractionOnClusters/100 # Assume if it is greater than 1 then a percentage was given
 173 |   genes  <- rownames(summaryExpr)
 174 |   genesU <- toupper(genes)
 175 |   
 176 |   ## Define excluded genes
 177 |   excludeFamilies <- toupper(excludeFamilies)
 178 |   excludeGenes <- is.element(genes,excludeGenes)
 179 |   if(length(excludeFamilies)>0){
 180 |     for(i in 1:length(excludeFamilies)){
 181 |       excludeGenes <- excludeGenes|grepl(excludeFamilies[i],genesU)
 182 |     }
 183 |   }
 184 |   
 185 |   ## Determine max expression levels in on and off clusters
 186 |   if(is.character(onClusters)){
 187 |     onClusters <- is.element(colnames(summaryExpr),onClusters)
 188 |   }
 189 |   if(is.numeric(onClusters)){
 190 |     onClusters <- is.element(1:dim(summaryExpr)[2],onClusters)
 191 |   }
 192 |   if(sum(onClusters)<2){
 193 |     stop("Please provide at least two onClusters.  If cluster names were provided, check colnames(summaryExpr).")
 194 |   }
 195 |   if(is.character(offClusters)){
 196 |     offClusters <- is.element(colnames(summaryExpr),offClusters)
 197 |   }
 198 |   if(is.numeric(offClusters)){
 199 |     offClusters <- is.element(1:dim(summaryExpr)[2],offClusters)
 200 |   }
 201 |   
 202 |   ## Now add the actual constraints from above
 203 |   maxExprOn <- apply(summaryExpr[,onClusters],1,max)
 204 |   if(sum(offClusters)>1){
 205 |     maxExprOff <- apply(summaryExpr[,offClusters],1,max)
 206 |   } else if(sum(offClusters)==1){
 207 |     maxExprOff <- summaryExpr[,offClusters]
 208 |   } else {
 209 |     maxExprOff <- pmax(maxExprOn,0.0000001) * -Inf # Essentially this is saying there is no off constraint
 210 |   }
 211 |   
 212 |   ## Set the gene lengths, if needed
 213 |   if(!is.null(geneLengths)) if(length(geneLengths)!=length(offClusters)){
 214 |     stop("geneLengths must be in the same order and of same length as rows of summaryExpr.")
 215 |   }
 216 |   if(!is.null(geneLengths)) if(!is.numeric(geneLengths)){
 217 |     stop("geneLengths must be numeric.")
 218 |   }
 219 |   if(is.null(geneLengths)){
 220 |     geneLengths <- maxExprOn+Inf # Essentially this is saying there is no gene length constraint
 221 |   }
 222 |   
 223 |   ## Determine the acceptable genes
 224 |   keepGenes <- (!excludeGenes) & (maxExprOn > minOn) & (maxExprOn <= maxOn) & 
 225 |     (maxExprOff <= maxOff) & (geneLengths >= minLength) & 
 226 |     (rowMeans(propExpr[, onClusters] > onThreshold) <= fractionOnClusters) &
 227 |     (rowMeans(propExpr[, onClusters] > onThreshold) > 0)  # Genes expressed in more than 0 clusters!
 228 |   keepGenes[is.na(keepGenes)] <- FALSE                               
 229 |   
 230 |   ## Find the top binary genes (if needed) and return gene list
 231 |   message(paste(sum(keepGenes),"total genes pass constraints prior to binary score calculation."))
 232 |   if(sum(keepGenes)<=numBinaryGenes){
 233 |     warning("Fewer genes pass constraints than numBinaryGenes, so binary score was not calculated.")
 234 |     return(sort(union(rownames(propExpr)[keepGenes], startingGenes)))
 235 |   }
 236 |   
 237 |   topBeta     <- getBetaScore(propExpr[keepGenes,onClusters],FALSE)
 238 |   runGenes    <- names(topBeta)[topBeta<=numBinaryGenes]
 239 |   runGenes    <- sort(union(runGenes,startingGenes))
 240 |   runGenes
 241 | }
 242 | 
 243 | 
 244 | 
 245 | 
 246 | 
 247 | 
 248 | #' Branch list
 249 | #'
 250 | #' Returns branches of a dendrogram in a specific format
 251 | #'
 252 | #' @param dend dendrogram for mapping.  Ignored if medianDat is passed
 253 | #' @param branches do not change from default
 254 | #' @param allTips do not change from default
 255 | #'
 256 | #' @return a list of branch information for use with leafToNodeMedians
 257 | #'
 258 | #' @export
 259 | getBranchList <- function(dend,
 260 |                           branches = list(),
 261 |                           allTips = as.character(dend %>% labels())) {
 262 |   library(dendextend)
 263 |   numBranch <- dend %>% nnodes()
 264 |   if (numBranch > 1) {
 265 |     lb <- attr(dend, "label")
 266 |     lab1 <- paste("BranchInTree___", lb, sep = "")
 267 |     branches[[lab1]] <- list()
 268 |     cn <- as.character(dend %>% labels())
 269 |     for (i in 1:length(dend)) {
 270 |       nm <- as.character(dend[[i]] %>% labels())
 271 |       branches[[lab1]][[attr(dend[[i]], "label")]] <- list(nm, setdiff(cn, nm))
 272 |       if ((length(nm) > 1) & (length(nm) < (length(allTips) - 1))) {
 273 |         lab2 <- paste("BranchVsAll___", attr(dend[[i]], "label"), sep = "")
 274 |         branches[[lab2]] <- list()
 275 |         branches[[lab2]][["branch"]] <- list(nm, setdiff(allTips, nm))
 276 |       }
 277 |       branches <- getBranchList(dend[[i]], branches, allTips)
 278 |     }
 279 |   } else {
 280 |     leaf <- dend %>% labels()
 281 |     lab <- paste("LeafOnly___", leaf, sep = "")
 282 |     branches[[lab]][["leaf"]] <- list(leaf, setdiff(allTips, leaf))
 283 |   }
 284 |   branches <- branches[order(substr(names(branches), 1, 8))]
 285 |   branches
 286 | }
 287 | 
 288 | #' Return mean node expression
 289 | #'
 290 | #' Define expression at a node as the MEAN expression for each leaf as default (using the
 291 | #'   median removes all specific marker genes!)
 292 | #'
 293 | #' @param dend dendrogram for mapping.  Ignored if medianDat is passed
 294 | #' @param medianDat median expression data at each node
 295 | #' @param branches a particular format of branch information from the dendrogram structure
 296 | #' @param fnIn function to use to wrap up to the node level (default = mean)
 297 | #'
 298 | #' @return a matrix of mean node expression (rows=genes, columns=nodes)
 299 | #'
 300 | #' @export
 301 | leafToNodeMedians <- function(dend,
 302 |                               medianDat,
 303 |                               branches = getBranchList(dend),
 304 |                               fnIn = mean) {
 305 |   library(dendextend)
 306 |   if (is.null(rownames(medianDat))) {
 307 |     rownames(medianDat) <- 1:dim(medianDat)[1]
 308 |   }
 309 |   allGenes <- rownames(medianDat)
 310 |   brNames <- names(branches)
 311 |   brNames <- brNames[grep("BranchInTree", brNames)]
 312 |   medianNode <- matrix(0, nrow = length(allGenes), ncol = length(brNames))
 313 |   rownames(medianNode) <- allGenes
 314 |   colnames(medianNode) <- brNames
 315 |   for (n in brNames) medianNode[, n] <-
 316 |       apply(medianDat[, c(branches[[n]][[1]][[1]], branches[[n]][[1]][[2]])], 1, fnIn)
 317 |   medianNode <- cbind(medianDat, medianNode)
 318 |   nameOrd <-
 319 |     dend %>%
 320 |     get_nodes_attr("label", id = 1:(dend %>% nnodes()))
 321 |   nameOrd[substr(nameOrd, 1, 1) == "n"] <-
 322 |     paste0("BranchInTree___", nameOrd[substr(nameOrd, 1, 1) == "n"])
 323 |   medianNode <- medianNode[, nameOrd]
 324 |   colnames(medianNode) <- gsub("BranchInTree___", "", colnames(medianNode))
 325 |   medianNode
 326 | }
 327 | 
 328 | 
 329 | #' Greedy algorithm for building marker gene panel
 330 | #'
 331 | #' This is the primary function that iteratively builds a marker gene panel, one gene at a
 332 | #'   time by iteratively adding the most informative gene to the existing gene panel.
 333 | #'
 334 | #' @param mapDat normalized data of the mapping (=reference) data set.
 335 | #' @param medianDat representative value for each leaf.  If not entered, it is calculated
 336 | #' @param clustersF cluster calls for each cell.
 337 | #' @param panelSize number of genes to include in the marker gene panel
 338 | #' @param subSamp number of random nuclei to select from each cluster (to increase speed);
 339 | #'   set as NA to not subsample
 340 | #' @param maxFcGene maximum number of genes to consider at each iteration (to increase speed)
 341 | #' @param qMin minimum quantile for fold change comparison (between 0 and 1, higher = more
 342 | #'   specific marker genes are included)
 343 | #' @param seed for reproducibility
 344 | #' @param currentPanel starting panel.  Default is NULL.
 345 | #' @param panelMin if there are fewer genes than this, the top number of these genes by fc
 346 | #'   rank are set as the starting panel.  Cannot be less than 2.
 347 | #' @param writeText should gene names and marker scores be output (default TRUE)
 348 | #' @param corMapping if TRUE (default) map by correlation; otherwise, map by Euclidean
 349 | #'   distance (not recommended)
 350 | #' @param optimize if 'FractionCorrect' (default) will seek to maximize the fraction of
 351 | #'   cells correctly mapping to final clusters
 352 | #'   if 'CorrelationDistance' will seek to minimize the total distance between actual
 353 | #'   cluster calls and mapped clusters
 354 | #'   if 'DendrogramHeight' will seek to minimize the total dendrogram height between
 355 | #'   actual cluster calls and mapped clusters
 356 | #' @param clusterDistance only used if optimize='CorrelationDistance'; a matrix (or
 357 | #'   vector) of cluster distances.  Will be calculated if NULL and if clusterGenes
 358 | #'   provided. (NOTE: order must be the same as medianDat and/or have column and row
 359 | #'   names corresponding to clusters in clustersF)
 360 | #' @param clusterGenes a vector of genes used to calculate the cluster distance.
 361 | #'   Only used if optimize='CorrelationDistance' and clusterDistance=NULL.
 362 | #' @param dend only used if optimize='DendrogramHeight' dendrogram; will error out of not provided
 363 | #' @param percentSubset for each iteration the function can subset the set of possible
 364 | #'   genes to speed up the calculation.
 365 | #'
 366 | #' @return an ordered character vector corresponding to the marker gene panel
 367 | #'
 368 | #' @export
 369 | buildMappingBasedMarkerPanel <- function(mapDat,
 370 |                                          medianDat = NA,
 371 |                                          clustersF = NA,
 372 |                                          panelSize = 50,
 373 |                                          subSamp = 20,
 374 |                                          maxFcGene = 1000,
 375 |                                          qMin = 0.75,
 376 |                                          seed = 10,
 377 |                                          currentPanel = NULL,
 378 |                                          panelMin = 5,
 379 |                                          writeText = TRUE,
 380 |                                          corMapping = TRUE,
 381 |                                          optimize = "FractionCorrect",
 382 |                                          clusterDistance = NULL,
 383 |                                          clusterGenes = NULL,
 384 |                                          dend = NULL,
 385 |                                          percentSubset = 100) {
 386 | 
 387 |   # Return an error if optimize='DendrogramHeight'
 388 |   # and a dendrogram is not provided
 389 |   if ((optimize == "DendrogramHeight") & is.null(dend)) {
 390 |     return("Error: dendrogram not provided")
 391 |   }
 392 | 
 393 |   # CALCULATE THE MEDIAN
 394 |   if (is.na(medianDat[1])) {
 395 |     names(clustersF) <- colnames(mapDat)
 396 |     medianDat <- do.call("cbind", tapply(
 397 |       names(clustersF), clustersF, function(x) rowMedians(mapDat[, x])
 398 |     ))
 399 |     rownames(medianDat) <- rownames(mapDat)
 400 |   }
 401 |   if (is.null(rownames(medianDat))) {
 402 |     rownames(medianDat) <- rownames(mapDat)
 403 |   }
 404 | 
 405 |   # Convert the dendrogram height into a correlation
 406 |   # distance if dendrogram height is entered as the
 407 |   # option
 408 |   if (optimize == "FractionCorrect") {
 409 |     clusterDistance <- NULL
 410 |   }
 411 |   if (optimize == "CorrelationDistance") {
 412 |     if (is.null(clusterDistance)) {
 413 |       corDist <- function(x) return(as.dist(1 - WGCNA::cor(x)))
 414 |       if(is.null(clusterGenes)) clusterGenes = rownames(medianDat)
 415 |       clusterGenes <- intersect(clusterGenes, rownames(medianDat))
 416 |       clusterDistance <- as.matrix(corDist(medianDat[clusterGenes, ]))
 417 |     }
 418 |     if (is.matrix(clusterDistance)) {
 419 |       if (!is.null(rownames(clusterDistance))) {
 420 |         clusterDistance <- clusterDistance[colnames(medianDat), colnames(medianDat)]
 421 |       }
 422 |       clusterDistance <- as.vector(clusterDistance)
 423 |     }
 424 |   }
 425 | 
 426 |   if (optimize == "DendrogramHeight") {
 427 |     lcaTable <- makeLCAtable(dend)
 428 |     clusterDistance <- 1 - getNodeHeight(dend)[lcaTable]
 429 |     optimize <- "clusterDistance"
 430 |   }
 431 | 
 432 |   # TAKE THE TOP DEX GENES Use fold change (rather
 433 |   # than beta) because this function only receives
 434 |   # median as input
 435 |   fcDiff <- rank(apply(medianDat, 1, function(x) return(diff(quantile(x, c(1, qMin))))))
 436 |   if (dim(medianDat)[1] > maxFcGene) {
 437 |     kpGene <- names(fcDiff)[fcDiff <= maxFcGene]
 438 |     mapDat <- mapDat[kpGene, ]
 439 |     medianDat <- medianDat[kpGene, ]
 440 |   }
 441 | 
 442 |   panelMin <- max(2, panelMin)
 443 |   if (length(currentPanel) < panelMin) {
 444 |     panelMin <- max(2, panelMin - length(currentPanel))
 445 |     currentPanel <- unique(c(currentPanel, names(sort(fcDiff))[1:panelMin]))
 446 |     if (writeText) {
 447 |       print(paste("Setting starting panel as:", paste(currentPanel, sep = ", ", collapse = ", ")))
 448 |     }
 449 |   }
 450 | 
 451 |   # FIND THE NEXT GENE IN THE PANEL, IF THE DESIRED
 452 |   # PANEL SIZE IS NOT REACHED
 453 |   if (length(currentPanel) < panelSize) {
 454 | 
 455 |     # SUBSAMPLE
 456 |     if (!is.na(subSamp)) {
 457 |       kpSamp <- subsampleCells(clustersF, subSamp, seed)
 458 |       mapDat <- mapDat[, kpSamp]
 459 |       clustersF <- clustersF[kpSamp]
 460 |       subSamp <- NA
 461 |     }
 462 | 
 463 |     # CORRELATION MAPPING FOR EACH POSSIBLE ADDITION OF
 464 |     # ONE GENE
 465 |     otherGenes <- setdiff(rownames(mapDat), currentPanel)
 466 |     if (percentSubset < 100) {
 467 |       # Only look at a subset of genes if desired
 468 |       set.seed(seed + length(currentPanel))
 469 |       otherGenes <- otherGenes[sort(sample(
 470 |         1:length(otherGenes),
 471 |         ceiling(length(otherGenes) * percentSubset / 100)
 472 |       ))]
 473 |     }
 474 |     matchCount <- rep(0, length(otherGenes))
 475 |     clustIndex <- match(clustersF, colnames(medianDat))
 476 |     for (i in 1:length(otherGenes)) {
 477 |       ggnn <- c(currentPanel, otherGenes[i])
 478 |       if (corMapping) {
 479 |         corMapTmp <- corTreeMapping(mapDat = mapDat, medianDat = medianDat, genesToMap = ggnn)
 480 |       }
 481 |       if (!corMapping) {
 482 |         corMapTmp <- distTreeMapping(mapDat = mapDat, medianDat = medianDat, genesToMap = ggnn)
 483 |       }
 484 |       corMapTmp[is.na(corMapTmp)] <- -1
 485 |       topLeafTmp <- getTopMatch(corMapTmp)
 486 |       if (is.null(clusterDistance)) {
 487 |         matchCount[i] <- mean(clustersF == topLeafTmp[, 1])
 488 |       } else {
 489 |         tmpVal <- dim(medianDat)[2] * (match(topLeafTmp[, 1], colnames(medianDat)) - 1) + clustIndex # NEED TO CHECK THIS!!!!!!!
 490 |         matchCount[i] <- -mean(clusterDistance[tmpVal])
 491 |       }
 492 |     }
 493 |     wm <- which.max(matchCount)
 494 |     addGene <- as.character(otherGenes)[wm]
 495 |     if (writeText) {
 496 |       if (optimize == "FractionCorrect") {
 497 |         print(paste(
 498 |           "Added", addGene, "with", signif(matchCount[wm], 3),
 499 |           "now matching [", length(currentPanel), "]."
 500 |         ))
 501 |       } else {
 502 |         print(paste(
 503 |           "Added", addGene, "with average cluster distance",
 504 |           -signif(matchCount[wm], 3), "[", length(currentPanel), "]."
 505 |         ))
 506 |       }
 507 |     }
 508 |     currentPanel <- c(currentPanel, addGene)
 509 |     currentPanel <- buildMappingBasedMarkerPanel(
 510 |       mapDat = mapDat, medianDat = medianDat, clustersF = clustersF,
 511 |       panelSize = panelSize, subSamp = subSamp, maxFcGene = maxFcGene,
 512 |       qMin = qMin, seed = seed, currentPanel = currentPanel, panelMin = panelMin,
 513 |       writeText = writeText, corMapping = corMapping, optimize = optimize,
 514 |       clusterDistance = clusterDistance, clusterGenes = clusterGenes, dend = dend,
 515 |       percentSubset = percentSubset
 516 |     )
 517 |   }
 518 |   currentPanel
 519 | }
 520 | 
 521 | 
 522 | #' Correlation between nodes and leafs (deprecated)
 523 | #'
 524 | #' Returns the correlation between expression of each cell and representative
 525 | #'   value for each node and leaf.  NOTE: this function is unstable and will
 526 | #'   eventually be merged with corTreeMapping.
 527 | #'
 528 | #' @param dend dendrogram for mapping.  Ignored if medianDat is passed
 529 | #' @param refDat normalized data of the REFERENCE data set.  Ignored if medianExpr
 530 | #'   and propExpr are passed
 531 | #' @param mapDat normalized data of the MAPPING data set.  Default is to map the
 532 | #'   data onto itself.
 533 | #' @param medianExpr representative value for each leaf.  If not entered, it is
 534 | #'   calculated
 535 | #' @param propExpr proportion of cells in each type expressing a given gene.  If not
 536 | #'   entered, it is calculated
 537 | #' @param filterMatrix a matrix of TRUE/FALSE values to indicate whether a given
 538 | #'   cluster is possible
 539 | #' @param clusters cluster calls for each cell.  Ignored if medianExpr and propExpr
 540 | #'   are passed
 541 | #' @param numberOfGenes how many variables genes
 542 | #' @param outerLimitGenes choose different numberOfGenes per cell from the top overall
 543 | #'   outerLimitGenes (to speed up function)
 544 | #' @param genesToMap which genes to include in the correlation mapping
 545 | #' @param use,... additional parameters for cor
 546 | #'
 547 | #' @return a matrix of correlation values with rows as mapped cells and columns as clusters
 548 | #'
 549 | #' @export
 550 | corTreeMapping_withFilter <- function(dend = NA,
 551 |                                       refDat = NA,
 552 |                                       mapDat = refDat,
 553 |                                       medianExpr = NA,
 554 |                                       propExpr = NA,
 555 |                                       filterMatrix = NA,
 556 |                                       clusters = NA,
 557 |                                       numberOfGenes = 1200,
 558 |                                       outerLimitGenes = 7200,
 559 |                                       rankGeneFunction = function(x) getBetaScore(x, returnScore = FALSE),
 560 |                                       use = "p",
 561 |                                       ...) {
 562 | 
 563 |   # -- prepare the data
 564 |   if (is.na(medianExpr[1])) {
 565 |     names(clusters) <- colnames(refDat)
 566 |     medianExpr <- do.call("cbind", tapply(
 567 |       names(clusters), clusters, function(x) rowMedians(refDat[, x])
 568 |     ))
 569 |     rownames(medianExpr) <- rownames(refDat)
 570 |   }
 571 |   if (is.na(propExpr[1])) {
 572 |     names(clusters) <- colnames(refDat)
 573 |     medianExpr <- do.call("cbind", tapply(
 574 |       names(clusters), clusters, function(x) rowMeans(refDat[, x] > 1)
 575 |     ))
 576 |     rownames(propExpr) <- rownames(refDat)
 577 |   }
 578 |   filterMatrix <- filterMatrix[, colnames(medianExpr)]
 579 | 
 580 |   # -- take the top outerLimitGenes for the
 581 |   # proportion and median
 582 |   rankGn <- rankGeneFunction(propExpr)
 583 |   kpGn <- rankGn <= outerLimitGenes
 584 |   medianDat <- medianExpr[kpGn, ]
 585 |   propDat <- propExpr[kpGn, ]
 586 | 
 587 |   # -- find all possible filters
 588 |   filterVec <- apply(filterMatrix, 1, function(x) paste(x, collapse = "|", sep = "|"))
 589 |   vecs <- unique(filterVec)
 590 | 
 591 |   # -- find all gene lists based on these filters
 592 |   geneLists <- list()
 593 |   for (v in vecs) {
 594 |     kp <- filterMatrix[which(filterVec == v)[1], ]
 595 |     if (sum(kp) > 1) {
 596 |       geneLists[[v]] <- rownames(propDat)[rankGeneFunction(propDat[, kp]) <= numberOfGenes]
 597 |     } else {
 598 |       geneLists[[v]] <- rownames(propDat)[1:2]
 599 |     }
 600 |   }
 601 | 
 602 |   ## -- find the correlations
 603 |   kpVar <- intersect(
 604 |     names(rankGn)[rankGn <= numberOfGenes],
 605 |     intersect(rownames(mapDat), rownames(medianDat))
 606 |   )
 607 |   corrVar <- WGCNA::cor(mapDat[kpVar, ], medianDat[kpVar, ], use = use, ...)
 608 |   for (v in vecs) {
 609 |     kpRow <- filterVec == v
 610 |     kpCol <- filterMatrix[which(filterVec == v)[1], ]
 611 |     if (sum(kpCol) > 1) {
 612 |       kpVar <- intersect(geneLists[[v]], intersect(rownames(mapDat), rownames(medianDat)))
 613 |       corrVar[kpRow, kpCol] <- WGCNA::cor(mapDat[kpVar, kpRow], medianDat[kpVar, kpCol], use = use) # ,...)
 614 |     }
 615 |   }
 616 |   corrVar <- corrVar * filterMatrix[, colnames(corrVar)]
 617 |   corrVar
 618 | }
 619 | 
 620 | 
 621 | #' (Euclidean) distance mapping
 622 | #'
 623 | #' Returns the distance between expression of each cell and representative value for each node and
 624 | #'   leaf (default is based on euclidean distance).  In our hands this is does not work very well.
 625 | #'
 626 | #' @param dend dendrogram for mapping.  Ignored if medianDat is passed
 627 | #' @param refDat normalized data of the REFERENCE data set.  Ignored if medianDat is passed
 628 | #' @param mapDat normalized data of the MAPPING data set.  Default is to map the data onto itself.
 629 | #' @param medianDat representative value for each leaf and node.  If not entered, it is calculated
 630 | #' @param clusters cluster calls for each cell.  Ignored if medianDat is passed
 631 | #' @param genesToMap which genes to include in the correlation mapping
 632 | #' @param returnSimilarity FALSE to return distance, TRUE to return something like a similarity
 633 | #' @param use,... additional parameters for dist (for back-compatiblity; doesn't work)
 634 | #'
 635 | #' @return matrix of Euclidean distances between cells (rows) and clusters (columns)
 636 | #'
 637 | #' @export
 638 | distTreeMapping <- function(dend = NA,
 639 |                             refDat = NA,
 640 |                             mapDat = refDat,
 641 |                             medianDat = NA,
 642 |                             clusters = NA,
 643 |                             genesToMap = rownames(mapDat),
 644 |                             returnSimilarity = TRUE,
 645 |                             use = "p",
 646 |                             ...) {
 647 |   library(pdist)
 648 | 
 649 |   if (is.na(medianDat[1])) {
 650 |     names(clusters) <- colnames(refDat)
 651 |     medianDat <- do.call("cbind", tapply(
 652 |       names(clusters), clusters, function(x) rowMedians(refDat[, x])
 653 |     ))
 654 |     rownames(medianDat) <- rownames(refDat)
 655 |     medianDat <- leafToNodeMedians(dend, medianDat)
 656 |   }
 657 |   kpVar <- intersect(genesToMap, intersect(rownames(mapDat), rownames(medianDat)))
 658 |   if (length(kpVar) == 1) {
 659 |     kpVar <- c(kpVar, kpVar)
 660 |   }
 661 |   eucDist <- as.matrix(pdist(t(mapDat[kpVar, ]), t(medianDat[kpVar, ]), ...))
 662 |   rownames(eucDist) <- colnames(mapDat)
 663 |   colnames(eucDist) <- colnames(medianDat)
 664 |   if (!returnSimilarity) {
 665 |     return(eucDist)
 666 |   }
 667 |   eucDist <- sqrt(eucDist / max(eucDist))
 668 |   1 - eucDist
 669 | }
 670 | 
 671 | 
 672 | #' Tree-based mapping
 673 | #'
 674 | #' Returns the mapping membership of each cell to each node and leaf using a
 675 | #'   tree-based method.  This is a wrapper function for map_dend.
 676 | #'
 677 | #' @param dend dendrogram for mapping
 678 | #' @param refDat normalized data of the REFERENCE data set
 679 | #' @param clustersF factor indicating which cluster each cell type is actually assigned to
 680 | #'   in the reference data set
 681 | #' @param mapDat normalized data of the MAPPING data set.  Default is to map the data onto itself.
 682 | #' @param p proportion of marker genes to include in each iteration of the mapping algorithm.
 683 | #' @param low.th the minimum difference in Pearson correlation required to decide on which branch
 684 | #'   to map to. otherwise, a random branch is chosen.
 685 | #' @param seed added for reproducibility
 686 | #'
 687 | #' @return a matrix of confidence scores (from 0 to 100) with rows as cells and columns
 688 | #'   as tree node/leafs.  Values indicate the fraction of permutations in which the cell
 689 | #'   mapped to that node/leaf using the subset of cells/genes in map_dend
 690 | #'
 691 | #' @export
 692 | rfTreeMapping <- function(dend,
 693 |                           refDat,
 694 |                           clustersF,
 695 |                           mapDat = refDat,
 696 |                           p = 0.7,
 697 |                           low.th = 0.15,
 698 |                           seed = 1) {
 699 |   
 700 |   genes  <- intersect(rownames(refDat),rownames(mapDat))
 701 |   refDat <- as.matrix(refDat)[genes,] # need common genes and matrix format
 702 |   mapDat <- as.matrix(mapDat)[genes,] # need common genes and matrix format
 703 |   pseq.cells <- colnames(mapDat)
 704 |   # isMarker   <- ifelse(is.na(adjustMarkers[1]), FALSE, TRUE) # Doesn't do anything
 705 | 
 706 |   pseq.mem <- sapply(1:100, function(i) {
 707 |     j <- i
 708 |     if (i %% 25 == 0) print(i)
 709 |     go <- TRUE
 710 |     while (go) {
 711 |       # Allow for failures
 712 |       j <- j + 1000
 713 |       set.seed(j + seed) # Added for reproducibility
 714 |       tmp <- try(map_dend(dend, clustersF, refDat,
 715 |         mapDat, pseq.cells,
 716 |         p = p, low.th = low.th
 717 |       ))
 718 |       if (length(tmp) > 1) go <- FALSE
 719 |     }
 720 |     tmp
 721 |   }, simplify = F)
 722 | 
 723 |   memb <- unlist(pseq.mem)
 724 |   memb <- data.frame(cell = names(memb), cl = memb)
 725 |   memb$cl <- factor(memb$cl, levels = get_nodes_attr(dend, "label"))
 726 |   memb <- table(memb$cell, memb$cl)
 727 |   memb <- memb / 100
 728 |   return(memb)
 729 | }
 730 | 
 731 | 
 732 | #' Tree-based mapping
 733 | #'
 734 | #' Returns the mapping membership of each cell to each node and leaf using a
 735 | #'   tree-based method.  This is a wrapper function for map_dend.
 736 | #'
 737 | #' @param dend dendrogram for mapping
 738 | #' @param cl factor indicating which cluster each cell type is actually assigned to
 739 | #'   in the reference data set
 740 | #' @param dat normalized data of the REFERENCE data set
 741 | #' @param map.dat normalized data of the MAPPING data set.  Default is to map the
 742 | #'   data onto itself.
 743 | #' @param p proportion of marker genes to include in each iteration of the mapping
 744 | #'   algorithm.
 745 | #' @param low.th the minimum difference in Pearson correlation required to decide
 746 | #'   on which branch to map to. otherwise, a random branch is chosen.
 747 | #' @param default.markers not used
 748 | #'
 749 | #' @return a matrix of confidence scores (from 0 to 100) with rows as cells and columns
 750 | #'   as tree node/leafs.  Values indicate the fraction of permutations in which the cell
 751 | #'   mapped to that node/leaf using the subset of cells/genes in map_dend
 752 | #'
 753 | #' @export
 754 | map_dend <- function(dend,
 755 |                      cl,
 756 |                      dat,
 757 |                      map.dat,
 758 |                      select.cells,
 759 |                      p = 0.8,
 760 |                      low.th = 0.2,
 761 |                      default.markers = NULL) {
 762 |   final.cl <- c(setNames(rep(attr(dend, "label"), length(select.cells)), select.cells))
 763 |   if (length(dend) <= 1) {
 764 |     return(final.cl)
 765 |   }
 766 |   markers <- attr(dend, "markers")
 767 |   markers <- markers[names(markers) %in% row.names(map.dat)]
 768 |   cl.g <- sapply(dend, labels, simplify = F)
 769 |   names(cl.g) <- 1:length(cl.g)
 770 |   select.cl <- cl[cl %in% unlist(cl.g)]
 771 |   ### Sampling the cells from the reference cluster
 772 |   cells <- unlist(tapply(names(select.cl), select.cl, function(x) sample(x, round(length(x) * p))))
 773 |   genes <- names(markers)
 774 |   genes <- union(genes, default.markers)
 775 |   ### Compute reference cluster median based on
 776 |   ### subsampled cells
 777 |   cl.med <- do.call("cbind", tapply(
 778 |     cells, droplevels(cl[cells]),
 779 |     function(x) rowMedians(dat[genes, x, drop = F])
 780 |   ))
 781 |   row.names(cl.med) <- genes
 782 |   ### determine which branch to take.
 783 |   mapped.cl <- resolve_cl(cl.g, cl.med, markers, dat,
 784 |     map.dat, select.cells,
 785 |     p = p, low.th = low.th
 786 |   )
 787 |   if (length(mapped.cl) > 0) {
 788 |     for (i in unique(mapped.cl)) {
 789 |       select.cells <- names(mapped.cl)[mapped.cl == i]
 790 |       if (length(select.cells) > 0) {
 791 |         final.cl <- c(final.cl, map_dend(dend[[as.integer(i)]],
 792 |           cl, dat, map.dat, select.cells,
 793 |           p = p, low.th = low.th
 794 |         ))
 795 |       }
 796 |     }
 797 |   }
 798 |   return(cl = final.cl)
 799 | }
 800 | 
 801 | 
 802 | #' Tree-based mapping (internal)
 803 | #'
 804 | #' Returns the mapped cluster call of each cell to each leaf. This function is called by map_dend
 805 | #'
 806 | #' @param cl.g all clusters
 807 | #' @param cl.med cluster medians
 808 | #' @param markers gene markers
 809 | #' @param dat normalized data of the REFERENCE data set
 810 | #' @param map.dat normalized data of the MAPPING data set.  Default is to map the data onto itself.
 811 | #' @param select.cells which cells to use?
 812 | #' @param p proportion of marker genes to include in each iteration of the mapping algorithm.
 813 | #' @param low.th the minimum difference in Pearson correlation required to decide on which branch
 814 | #'   to map to. otherwise, a random branch is chosen.
 815 | #'
 816 | #' @return a vector of the mapped cluster
 817 | #'
 818 | #' @export
 819 | resolve_cl <- function(cl.g,
 820 |                        cl.med,
 821 |                        markers,
 822 |                        dat,
 823 |                        map.dat,
 824 |                        select.cells,
 825 |                        p = 0.7,
 826 |                        low.th = 0.2) {
 827 |   library(matrixStats)
 828 |   ##
 829 |   genes <- names(markers)[markers > 0]
 830 |   tmp.cl <- unlist(cl.g)
 831 | 
 832 |   ### For each branch point, find the highest
 833 |   ### expression cluster.
 834 |   tmp.med <- sapply(cl.g, function(g) rowMaxs(cl.med[genes, g, drop = F]))
 835 |   row.names(tmp.med) <- genes
 836 |   ### Make sure the genes are discriminative between
 837 |   ### all the branches.
 838 |   genes <- genes[rowMaxs(tmp.med) - rowMins(tmp.med) > 1]
 839 | 
 840 |   ### Sample the markers based on the weigts. TO DO:
 841 |   ### randomforest sometimes give importance value of
 842 |   ### 0. adjust for that.
 843 |   genes <- sample(genes, round(length(genes) * p), prob = markers[genes])
 844 | 
 845 |   ### Compute the correlation with the median cluster
 846 |   ### profile. add drop=F
 847 |   cl.cor <- WGCNA::cor(map.dat[genes, select.cells, drop = F], cl.med[genes, tmp.cl, drop = F])
 848 |   cl.cor[is.na(cl.cor)] <- 0
 849 |   ### Compute the best match in each branch.
 850 |   tmp.score <- do.call("cbind", sapply(cl.g, function(x) rowMaxs(cl.cor[,
 851 |       x,
 852 |       drop = F
 853 |     ]), simplify = F))
 854 |   row.names(tmp.score) <- row.names(cl.cor)
 855 |   #### Determine the best match.
 856 |   best.score <- setNames(rowMaxs(tmp.score), row.names(tmp.score))
 857 |   ### determine the difference from the best match.
 858 |   diff.score <- best.score - tmp.score
 859 | 
 860 |   #### Give up on cells can't be discriminated,choose
 861 |   #### one branch randomly.
 862 |   unresolved.cl <- row.names(tmp.score)[rowSums(diff.score < low.th) == ncol(diff.score)]
 863 |   mapped.cl <- setNames(sample(colnames(tmp.score), length(unresolved.cl), replace = T), unresolved.cl)
 864 | 
 865 |   ### Cells mapped to one or more branches.
 866 |   mapped.cells <- setdiff(row.names(cl.cor), unresolved.cl)
 867 |   ### For binary branch, done already
 868 |   if (length(cl.g) == 2) {
 869 |     mapped.cl <- c(mapped.cl, setNames(colnames(diff.score)[apply(diff.score[mapped.cells,
 870 |       ,
 871 |       drop = F
 872 |     ], 1, which.min)], mapped.cells))
 873 |     return(mapped.cl)
 874 |   }
 875 |   ## The remaining options for mapped cells
 876 |   tmp.cl <- sapply(mapped.cells, function(x) colnames(diff.score)[which(diff.score[x, ] < low.th)], simplify = F)
 877 |   ### cells with multiple options
 878 |   resolve.cells <- names(tmp.cl)[sapply(tmp.cl, length) > 1]
 879 |   ### cells with only one option. Not further job.
 880 |   mapped.cells <- setdiff(mapped.cells, resolve.cells)
 881 |   if (length(mapped.cells) > 0) {
 882 |     mapped.cl <- c(mapped.cl, setNames(unlist(tmp.cl[mapped.cells]), mapped.cells))
 883 |   }
 884 |   ### Resolve further options.
 885 |   if (length(resolve.cells) > 0) {
 886 |     tmp.cat <- sapply(tmp.cl[resolve.cells], function(x) paste(x, collapse = " "))
 887 |     for (cat in unique(tmp.cat)) {
 888 |       tmp.cl <- unlist(strsplit(cat, " "))
 889 |       select.cells <- names(tmp.cat)[tmp.cat == cat]
 890 |       mapped.cl <- c(mapped.cl, resolve_cl(cl.g[tmp.cl],
 891 |         cl.med, markers, dat, map.dat, select.cells,
 892 |         p = p, low.th = low.th
 893 |       ))
 894 |     }
 895 |   }
 896 |   return(mapped.cl)
 897 | }
 898 | 
 899 | 
 900 | 
 901 | #' Get top leaf match
 902 | #'
 903 | #' Returns the top leaf match for each cell and the corresponding fraction mapping there.
 904 | #'
 905 | #' @param memb.cl membership scores for each leaf
 906 | #'
 907 | #' @return a matrix where first column is found cluster and second column is confidence score
 908 | #'
 909 | #' @export
 910 | getTopMatch <- function(memb.cl) {
 911 |   tmp.cl <- apply(memb.cl, 1, function(x) {
 912 |     y <- which.max(x)[1]
 913 |     as.character(c(colnames(memb.cl)[y], x[y]))
 914 |   })
 915 |   rfv <- as.data.frame(t(tmp.cl))
 916 |   rfv[, 2] <- as.numeric(rfv[, 2])
 917 |   colnames(rfv) <- c("TopLeaf", "Value")
 918 |   rfv[is.na(rfv[, 1]), 1] <- "none"
 919 |   rfv[is.na(rfv[, 2]), 2] <- 0
 920 |   rfv
 921 | }
 922 | 
 923 | 
 924 | #' Generate reference set of pseudo-cells
 925 | #'
 926 | #' Creates a new reference set as input for cellToClusterMapping_byRank, where each 'cell' is the
 927 | #'   combiniation of several cells and this is run several times using different subsets of cells.
 928 | #' @param refDat  normalized data of the REFERENCE data set
 929 | #' @param clustersF factor indicating which cluster each cell type is actually assigned to in the reference data set
 930 | #' @param genesToMap which genes to include in the correlation mapping
 931 | #' @param cellsPerMerge Number of cells to include in each combo cell
 932 | #' @param numberOfMerges Number of combo cells to include per cell type
 933 | #' @param mergeFunction function for combining cells into combo cells (use rowMeans or rowMedians)
 934 | #' @param seed for resproducibility
 935 | #'
 936 | #' @return list where first element is data matrix of multi-cells by genes and
 937 | #'   second element is a vector of corresponding clusters
 938 | #'
 939 | #' @export
 940 | generateMultipleCellReferenceSet <- function(refDat,
 941 |                                              clustersF,
 942 |                                              genesToUse = rownames(refDat),
 943 |                                              cellsPerMerge = 5,
 944 |                                              numberOfMerges = 10,
 945 |                                              mergeFunction = rowMedians,
 946 |                                              seed = 1) {
 947 |   if (!is.factor(clustersF)) {
 948 |     clustersF <- factor(clustersF)
 949 |   }
 950 |   names(clustersF) <- colnames(refDat)
 951 |   clusts <- levels(clustersF)
 952 |   refUse <- refDat[intersect(rownames(refDat), genesToUse), ]
 953 |   refOut <- matrix(nrow = dim(refUse)[1], ncol = numberOfMerges * length(clusts))
 954 |   rownames(refOut) <- rownames(refUse)
 955 |   index <- 0
 956 |   for (k in 1:numberOfMerges) {
 957 |     i <- NULL
 958 |     for (cl in 1:length(clusts)) {
 959 |       val <- which(clustersF == clusts[cl])
 960 |       set.seed(seed + index + cl + k)
 961 |       i <- c(i, sample(val, min(cellsPerMerge, length(val))))
 962 |     }
 963 |     i <- is.element(1:length(clustersF), i)
 964 |     refOut[, (index + 1):(index + length(clusts))] <- do.call(
 965 |       "cbind",
 966 |       tapply(names(clustersF[i]), clustersF[i], function(x) rowMedians(refUse[, i][, x]))
 967 |     )
 968 |     index <- index + length(clusts)
 969 |   }
 970 |   list(data = refOut, clusters = rep(clusts, numberOfMerges))
 971 | }
 972 | 
 973 | 
 974 | #' Cell-based cluster mapping
 975 | #'
 976 | #' Maps cells to clusters by correlating every mapped cell with every reference cell,
 977 | #'   ranking the cells by correlation, and the reporting the cluster with the lowest average rank.
 978 | #'
 979 | #' @param mapDat normalized data of the MAPPING data set.
 980 | #' @param refDat normalized data of the REFERENCE data set
 981 | #' @param clustersF factor indicating which cluster each cell type is actually assigned
 982 | #'   to in the reference data set
 983 | #' @param genesToMap character vector of which genes to include in the correlation mapping
 984 | #' @param mergeFunction function for combining ranks; the tested choices are rowMeans or
 985 | #'   rowMedians (default)
 986 | #' @param useRank use the rank of the correlation (default) or the correlation itself to
 987 | #'   determine the top cluster
 988 | #' @param use additional parameter for cor (use='p' as default)
 989 | #' @param method additional parameter for cor (method='p' as default)
 990 | #'
 991 | #' @return a two column data matrix where the first column is the mapped cluster and the second
 992 | #'   column is a confidence call indicating how close to the top of the ranked list cells of the
 993 | #'   assigned cluster were located relative to their best possible location in the ranked list.
 994 | #'   This confidence score seems to be a bit more reliable than correlation at determining how
 995 | #'   likely a cell in a training set is to being correctly assigned to the training cluster.
 996 | #'
 997 | #' @export
 998 | cellToClusterMapping_byRank <- function(mapDat,
 999 |                                         refDat,
1000 |                                         clustersF,
1001 |                                         genesToMap = rownames(mapDat),
1002 |                                         mergeFunction = rowMedians,
1003 |                                         useRank = TRUE,
1004 |                                         use = "p",
1005 |                                         method = "p") {
1006 |   if (is.null(names(clustersF))) names(clustersF) <- colnames(refDat)
1007 |   kpVar <- intersect(genesToMap, intersect(rownames(mapDat), rownames(refDat)))
1008 |   corrVar <- WGCNA::cor(mapDat[kpVar, ], refDat[kpVar, ], use = use, method = method)
1009 |   corrVar[corrVar > 0.999999] <- NA # assume any perfect correlation is either an self-to-self mapping, or a mapping using exactly 1 non-zero gene
1010 |   if (useRank) rankVar <- t(apply(-corrVar, 1, rank, na.last = "keep"))
1011 |   if (!useRank) rankVar <- -corrVar
1012 |   colnames(rankVar) <- names(clustersF) <- paste0("n", 1:length(clustersF))
1013 |   clMean <- do.call("cbind", tapply(
1014 |     names(clustersF),
1015 |     clustersF, function(x) match.fun(mergeFunction)(rankVar[, x], na.rm = TRUE)
1016 |   ))
1017 |   clMin <- apply(clMean, 1, min, na.rm = TRUE)
1018 |   clMin[is.na(clMin)] <- 0
1019 |   clMin[clMin == Inf] <- 1e+09
1020 |   clBest <- colnames(clMean)[apply(clMean, 1, function(x) return(which.min(x)[1]))]
1021 |   clBest[is.na(clBest)] <- colnames(clMean)[1]
1022 |   if (useRank) clScore <- (table(clustersF)[clBest] / 2) / pmax(clMin, 1e-11)
1023 |   if (!useRank) clScore <- -clMin
1024 |   rfv <- data.frame(TopLeaf = clBest, Score = as.numeric(as.character(clScore)))
1025 |   rownames(rfv) <- rownames(corrVar)
1026 |   rfv
1027 | }
1028 | 
1029 | 
1030 | #' Correlation-based cluster mapping
1031 | #'
1032 | #' Primary function for doing correlation-based mapping to cluster medians.  This is wrapper for cor
1033 | #'   and returns a correlation matrix.
1034 | #'
1035 | #' @param mapDat normalized data of the MAPPING data set.  Default is to map the data onto itself.
1036 | #' @param medianDat representative value for each leaf and node.  If not entered, it is calculated
1037 | #' @param dend dendrogram for mapping.  If provided, correlations to nodes are also returned
1038 | #' @param refDat normalized data of the REFERENCE data set.  Ignored if medianDat is passed
1039 | #' @param clusters  cluster calls for each cell.  Ignored if medianDat is passed
1040 | #' @param genesToMap which genes to include in the correlation mapping
1041 | #' @param use additional parameter for cor (use='p' as default)
1042 | #' @param method additional parameter for cor (method='p' as default)
1043 | #'
1044 | #' @return matrix with the correlation between expression of each cell and representative value for
1045 | #'   each leaf and node
1046 | #'
1047 | #' @export
1048 | corTreeMapping <- function(mapDat,
1049 |                            medianDat,
1050 |                            dend = NULL,
1051 |                            refDat = NA,
1052 |                            clusters = NA,
1053 |                            genesToMap = rownames(mapDat),
1054 |                            use = "p",
1055 |                            method = "p") {
1056 |   if (is.na(medianDat)[1]) {
1057 |     names(clusters) <- colnames(refDat)
1058 |     medianDat <- do.call("cbind", tapply(
1059 |       names(clusters), clusters, function(x) rowMedians(refDat[, x])
1060 |     ))
1061 |     rownames(medianDat) <- rownames(refDat)
1062 |   }
1063 |   if (!is.null(dend)) {
1064 |     medianDat <- leafToNodeMedians(dend, medianDat)
1065 |   }
1066 |   kpVar <- intersect(genesToMap, intersect(rownames(mapDat), rownames(medianDat)))
1067 |   corrVar <- WGCNA::cor(mapDat[kpVar, ], medianDat[kpVar, ], use = use, method = method)
1068 |   return(corrVar)
1069 | }
1070 | 
1071 | 
1072 | #' Gets subtree labels for lca function.
1073 | #'
1074 | #' @param dend a cluster dendrogram
1075 | #'
1076 | #' @return vector of subtree labels
1077 | #'
1078 | #' @export
1079 | get_subtree_label <- function(dend) {
1080 |   library(dendextend)
1081 |   l <- attr(dend, "label")
1082 |   if (length(dend) > 1) {
1083 |     for (i in 1:length(dend)) {
1084 |       l <- c(l, get_subtree_label(dend[[i]]))
1085 |     }
1086 |   }
1087 |   return(l)
1088 | }
1089 | 
1090 | 
1091 | #' Get lowest common ancestor (defined cluster pairs)
1092 | #'
1093 | #' Maps a cluster back up the tree to the first node where the mapped and correct clusters agree.
1094 | #'
1095 | #' @param dend a cluster dendrogram
1096 | #' @param l1 a vector of node labels
1097 | #' @param l2 a second fector of node labels (of the same length as l1)
1098 | #' @param l do not adjust; required for recursive function
1099 | #'
1100 | #' @return The function will return a vector for lowest common ancestor for every pair of nodes in l1 and l2
1101 | #'
1102 | #' @export
1103 | lca <- function(dend,
1104 |                 l1,
1105 |                 l2,
1106 |                 l = rep(attr(dend, "label"), length(l1))) {
1107 |   library(dendextend)
1108 |   node.height <- setNames(get_nodes_attr(dend, "height"), get_nodes_attr(dend, "label"))
1109 |   if (length(dend) > 1) {
1110 |     for (i in 1:length(dend)) {
1111 |       tmp.l <- attr(dend[[i]], "label")
1112 |       labels <- get_subtree_label(dend[[i]])
1113 |       select <- l1 %in% labels & l2 %in% labels
1114 |       if (sum(select) > 0) {
1115 |         select <- which(select)[node.height[l[select]] > node.height[tmp.l]]
1116 |         l[select] <- tmp.l
1117 |         l <- lca(dend[[i]], l1, l2, l)
1118 |       }
1119 |     }
1120 |   }
1121 |   l
1122 | }
1123 | 
1124 | 
1125 | #' Get lowest common ancestor (all cluster pairs in tree)
1126 | #'
1127 | #' Calculates the vector for lowest common ancestor for every pair of leaves in a tree and returns a
1128 | #'   vector in a specific format for faster look-up.
1129 | #'
1130 | #' @param dend a cluster dendrogram
1131 | #' @param includeInternalNodes should internal nodes be included in the output?
1132 | #' @param verbose if TRUE, status will be printed to the screen, since function is relatively slow
1133 | #'   for large trees (default FALSE)
1134 | #'
1135 | #' @return The function will return a vector for lowest common ancestor for every pair of leaves
1136 | #'   in dend.  Vector names are l1|||l2 for string parsing in other functions.
1137 | #'
1138 | #' @export
1139 | makeLCAtable <- function(dend,
1140 |                          includeInternalNodes = FALSE,
1141 |                          verbose = FALSE) {
1142 |   library(dendextend)
1143 |   nodes <- get_leaves_attr(dend, "label")
1144 |   if (includeInternalNodes) nodes <- get_nodes_attr(dend, "label")
1145 |   out <- nm <- rep(0, length(nodes)^2)
1146 |   i <- 1
1147 |   for (l1 in nodes) {
1148 |     if (verbose) print(l1)
1149 |     for (l2 in nodes) {
1150 |       nm[i] <- paste(l1, l2, sep = "|||")
1151 |       out[i] <- lca(dend, l1, l2)
1152 |       i <- i + 1
1153 |     }
1154 |   }
1155 |   names(out) <- nm
1156 |   out
1157 | }
1158 | 
1159 | 
1160 | #' Plot dendrogram
1161 | #'
1162 | #' Plots a dendrogram with set not colors, shapes, sizes and labels.  This is a wrapper for plot.
1163 | #'
1164 | #' @param tree a dendrogram object
1165 | #' @param value numeric vector corresponding to the size of each node
1166 | #' @param cexScale a global cex multiplier for node sizes
1167 | #' @param margins set the margins using par(mar=margins)
1168 | #' @param cols vector of node colors (or a single value)
1169 | #' @param pch vector of node pch shapes (or a single value)
1170 | #' @param ... additional parameters for the plot function
1171 | #'
1172 | #' @export
1173 | plotNodes <- function(tree,
1174 |                       value = rep(1, length(labels(tree))),
1175 |                       cexScale = 2,
1176 |                       margins = c(10, 5, 2, 2),
1177 |                       cols = "black",
1178 |                       pch = 19,
1179 |                       ...) {
1180 |   library(dendextend)
1181 |   tree <- set(tree, "nodes_pch", pch)
1182 |   tree <- set(tree, "nodes_col", cols)
1183 |   tree <- set(tree, "labels_cex", 1)
1184 |   treeN <- set(tree, "nodes_cex", cexScale * value)
1185 |   par(mar = margins)
1186 |   plot(treeN, ylab = "height", ...)
1187 | }
1188 | 
1189 | 
1190 | 
1191 | #' Build panel for one cluster (beta)
1192 | #'
1193 | #' This UNTESTED function finds the best small marker panel for marking a single cluster, using
1194 | #'   proportion difference as the metric for determining the starting panel.
1195 | #'
1196 | #' @param mapDat normalized data of the mapping (=reference) data set.
1197 | #' @param clustersF cluster calls for each cell.
1198 | #' @param medianDat median value for each leaf
1199 | #' @param propIn proportions of cells with expression > 1 in each leaf
1200 | #' @param clust which cluster to target?
1201 | #' @param subSamp number of random nuclei to select from each cluster, EXCEPT the target cluster;
1202 | #'   set as NA to not subsample
1203 | #' @param seed for reproducibility
1204 | #' @param maxSize maximum size of marker gene panel
1205 | #' @param dexCutoff criteria for stopping: when improvement in fraction of cells properly mapped
1206 | #'   dips below this value
1207 | #' @param topGeneCount number of top genes by proportion to consider
1208 | #'
1209 | #' @return a matrix of the top marker genes for each cluster.  Output matrix includes five columns:
1210 | #'   clust = cluster; panel = ordered genes in the panel for that cluster; onCorrect = fraction of
1211 | #'   correctly assigned cells in cluster; offCorrect = fraction of cells correctly assigned outside
1212 | #'   of cluster; dexTotal = additional dex explained by last gene added.
1213 | #'
1214 | #' @export
1215 | buildPanel_oneCluster <- function(mapDat,
1216 |                                   clustersF,
1217 |                                   medianDat = NA,
1218 |                                   propIn = NA,
1219 |                                   clust = as.character(clustersF[1]),
1220 |                                   subSamp = NA,
1221 |                                   seed = 10,
1222 |                                   maxSize = 20,
1223 |                                   dexCutoff = 0.001,
1224 |                                   topGeneCount = 100) {
1225 | 
1226 |   # SUBSAMPLE
1227 |   if (!is.na(subSamp)[1]) {
1228 |     kpSamp <- subsampleCells(clustersF, subSamp, seed)
1229 |     kpSamp[as.character(clustersF) == clust] <- TRUE
1230 |     mapDat <- mapDat[, kpSamp]
1231 |     clustersF <- clustersF[kpSamp]
1232 |     subSamp <- NA
1233 |   }
1234 | 
1235 |   # REFORMAT THE CLUSTER VARIABLE
1236 |   clust <- as.character(clust)
1237 |   clustersIn <- clustersF
1238 |   clustersF <- rep(clust, length(clustersIn))
1239 |   clustersF[as.character(clustersIn) != clust] <- "other"
1240 |   clustersF <- factor(clustersF, levels = c(clust, "other"))
1241 | 
1242 |   # CALCULATE THE PROPORTION OF CELLS EXPRESSED IN
1243 |   # EACH CLUSTERS, AND THE MEDIANS (SEND TO OTHER
1244 |   # FUNCTIONS AS MEDIAN)
1245 |   names(clustersF) <- colnames(mapDat)
1246 |   if (is.na(propIn)[1]) {
1247 |     propIn <- do.call("cbind", tapply(
1248 |       names(clustersF), clustersF, function(x) rowMeans(mapDat[, x] >= 1)
1249 |     ))
1250 |   }
1251 |   rownames(propIn) <- rownames(mapDat)
1252 |   if (is.na(medianDat)[1]) {
1253 |     medianDat <- do.call("cbind", tapply(
1254 |       names(clustersF), clustersF, function(x) rowMedians(mapDat[, x])
1255 |     ))
1256 |   } # switched clustersIn to clustersF
1257 |   rownames(medianDat) <- rownames(mapDat)
1258 | 
1259 |   # FIND THE BEST GENE IN THE PANEL, UNTIL THE
1260 |   # DESIRED PANEL SIZE IS REACHED
1261 |   propDat <- cbind(propIn[, clust], rowMeans(propIn[, colnames(propIn) != clust]))
1262 |   colnames(propDat) <- c(clust, "Other")
1263 |   panel <- onCorrect <- offCorrect <- dexTotal <- NULL
1264 |   first <- TRUE
1265 |   tt <- dex <- 0
1266 |   propDex <- propDat[, 1] - propDat[, 2]
1267 |   topMark <- names(-sort(-propDex))[1:topGeneCount] # Some semi-reasonable way to cut down the gene count
1268 |   while ((((length(panel) < maxSize) & (dex > dexCutoff))) | first) {
1269 | 
1270 |     # EUCLIDEAN MAPPING FOR EACH POSSIBLE ADDITION OF ONE GENE
1271 |     first <- FALSE
1272 |     otherGenes <- setdiff(topMark, panel)
1273 |     matchCount <- offTarget <- totalCount <- rep(0, length(otherGenes))
1274 |     for (i in 1:length(otherGenes)) {
1275 |       corMapTmp <- distTreeMapping(
1276 |         mapDat = mapDat, medianDat = medianDat, genesToMap = c(panel, otherGenes[i])
1277 |       )
1278 |       corMapTmp[is.na(corMapTmp)] <- -1
1279 |       higherOn <- corMapTmp[, clust] == apply(corMapTmp, 1, max)
1280 |       matchCount[i] <- sum((clustersF == clust) & (higherOn)) / sum(clustersF == clust)
1281 |       offTarget[i] <- sum((clustersF != clust) & (!higherOn)) / sum(clustersF != clust)
1282 |       totalCount[i] <- mean(c(matchCount[i], offTarget[i]))
1283 |     }
1284 |     wm <- which.max(totalCount)
1285 |     dex <- totalCount[wm] - tt
1286 |     cr <- matchCount[wm]
1287 |     tt <- totalCount[wm]
1288 |     ot <- offTarget[wm]
1289 |     addGene <- as.character(otherGenes)[wm]
1290 |     panel <- c(panel, addGene)
1291 |     onCorrect <- c(onCorrect, cr)
1292 |     offCorrect <- c(offCorrect, ot)
1293 |     dexTotal <- c(dexTotal, dex)
1294 |   }
1295 |   out <- data.frame(clust, panel, onCorrect, offCorrect, dexTotal)
1296 |   out <- out[1:dim(out)[1] - 1, ]
1297 |   out
1298 | }
1299 | 
1300 | 
1301 | #' Fraction of cells per layer
1302 | #'
1303 | #' Determines the expected proportions in each layer based on input
1304 | #'
1305 | #' @param layerIn a list corresponding to all layers of dissection for a given sample
1306 | #' @param layerNm names of all layers.  set to NULL to have this calculated
1307 | #' @param scale if TRUE (default), scale to the total number of cells
1308 | #'
1309 | #' @return vector indicating the fraction of cells in each layerNm layer
1310 | #'
1311 | #' @export
1312 | layerScale <- function(layerIn,
1313 |                        layerNm = c("L1", "L2/3", "L4", "L5", "L6"),
1314 |                        scale = TRUE) {
1315 |   if (is.null(layerNm)) {
1316 |     for (l in 1:length(layerIn)) layerNm <- c(layerNm, layerIn[[l]])
1317 |     layerNm <- sort(unique(layerNm))
1318 |   }
1319 |   total <- rep(0, length(layerNm))
1320 |   names(total) <- layerNm
1321 |   for (l in 1:length(layerIn)) total[layerIn[[l]]] <- total[layerIn[[l]]] +
1322 |       1 / length(total[layerIn[[l]]])
1323 |   if (scale) total <- total / sum(total)
1324 |   total
1325 | }
1326 | 
1327 | 
1328 | #' Layer weights per cell
1329 | #'
1330 | #' Returns a numeric vector saying how to weight a particular cell for each layer, using a smart
1331 | #'   weighting strategy
1332 | #' @param layerIn a list corresponding to all layers of dissection for a given sample
1333 | #' @param useLayer target layer
1334 | #' @param spillFactor fractional amount of cells in a layer below which it is assumed no cells are
1335 | #'   from that layer in multilayer dissection
1336 | #' @param weightCutoff anything less than this is set to 0 for convenience and to avoid rare types
1337 | #' @param layerNm names of all layers.  set to NULL to have this calculated
1338 | #'
1339 | #' @return numeric vector saying how to weight a particular cell for each layer, using a smart
1340 | #'   weighting strategy
1341 | #'
1342 | #' @export
1343 | smartLayerAllocation <- function(layerIn,
1344 |                                  useLayer = "L1",
1345 |                                  spillFactor = 0.15,
1346 |                                  weightCutoff = 0.02,
1347 |                                  layerNm = c("L1", "L2/3", "L4", "L5", "L6")) {
1348 |   if (is.null(layerNm)) {
1349 |     for (i in 1:length(layer)) layerNm <- c(layerNm, layerNm[[i]])
1350 |     layerNm <- sort(unique(layerNm))
1351 |   }
1352 |   layerMat <- matrix(0, nrow = length(layerIn), ncol = length(layerNm))
1353 |   colnames(layerMat) <- layerNm
1354 |   for (i in 1:length(layerIn)) layerMat[i, layerIn[[i]]] <- 1
1355 |   oneCount <- rowSums(layerMat) == 1
1356 |   wgtFrac <- colSums(rbind(layerMat[oneCount, ], layerMat[oneCount, ])) / 2
1357 |   wgtFrac <- wgtFrac / max(wgtFrac)
1358 |   wgtFrac[is.na(wgtFrac)] <- 0.01
1359 |   wgtFrac[wgtFrac < spillFactor] <- 1e-06 * wgtFrac[wgtFrac < spillFactor]
1360 |   wgtFrac <- wgtFrac / sum(wgtFrac)
1361 |   for (i in which(!oneCount))
1362 |     layerMat[i, ] <- (layerMat[i, ] * wgtFrac) / sum(layerMat[i, ] * wgtFrac)
1363 |   layerMat[layerMat < weightCutoff] <- 0
1364 |   for (i in 1:length(oneCount))
1365 |     layerMat[i, ] <- layerMat[i, ] / sum(layerMat[i, ])
1366 |   out <- layerMat[, useLayer]
1367 |   out
1368 | }
1369 | 
1370 | 
1371 | #' Layer weights per cell
1372 | #'
1373 | #' Returns a numeric vector saying how to weight a particular cell for each layer.  This is a
1374 | #'   wrapper for smartLayerAllocation
1375 | #'
1376 | #' @param layerIn a list corresponding to all layers of dissection for a given sample
1377 | #' @param useLayer target layer
1378 | #' @param cluster if passed the weights are smartly allocated based on laminar distributions by cluster
1379 | #' @param ... additional variables for smartLayerAllocation
1380 | #'
1381 | #' @return numeric vector with weights for cells in input layer
1382 | #'
1383 | #' @export
1384 | layerFraction <- function(layerIn,
1385 |                           useLayer = "L1",
1386 |                           cluster = NA,
1387 |                           ...) {
1388 |   weight <- rep(0, length(layerIn))
1389 |   if (is.na(cluster)[1]) {
1390 |     for (l in 1:length(weight)) weight[l] <- sum(layerIn[[l]] == useLayer) / length(layerIn[[l]])
1391 |     return(weight)
1392 |   }
1393 |   for (cli in unique(cluster))
1394 |     weight[cli == cluster] <- smartLayerAllocation(layerIn[cli == cluster], useLayer, ...)
1395 |   weight[is.na(weight)] <- 0
1396 |   weight
1397 | }
1398 | 
1399 | 
1400 | #' Filter possible cluster calls using priors
1401 | #'
1402 | #' This function will return a vector of possible clusters for cells that meet a set of priors for each layer
1403 | #'
1404 | #' @param cluster vector of all clusters
1405 | #' @param layer list of layers for each cluster entry (for data sets with only laminar dissections,
1406 | #'   each list entry will be of length 1)
1407 | #' @param subsetVector a vector of TRUE/FALSE values indicated whether the entry is in the subset of
1408 | #'   interest (e.g., Cre lines); default is all
1409 | #' @param useClusters a set of clusters to be considered a priori (e.g., GABA vs. glut); default is all
1410 | #' @param rareLimit define any values less than this as 0.  The idea is to exclude rare cells
1411 | #' @param layerNm names of all layers.  set to NULL to have this calculated
1412 | #' @param scaleByLayer if TRUE, scales to the proportion of cells in each layer
1413 | #' @param scaleByFn what function should be used for the layer scaling (default=max, ignored
1414 | #'   if scaleByLayer=FALSE)
1415 | #' @param smartWeight if TRUE, multilayer dissections are weighted smartly by cluster, rather
1416 | #'   than evenly by cluster (FALSE)
1417 | #' @param spillFactor fractional amount of cells in a layer below which it is assumed no cells
1418 | #'   are from that layer in multilayer dissection
1419 | #' @param weightCutoff anything less than this is set to 0 for convenience
1420 | #'
1421 | #' @return a vector of possible clusters for cells that meet a set of priors for each layer
1422 | #'
1423 | #' @export
1424 | possibleClustersByPriors <- function(cluster,
1425 |                                      layer,
1426 |                                      subsetVector = rep(TRUE, length(cluster)),
1427 |                                      useClusters = sort(unique(cluster)),
1428 |                                      rareLimit = 0.005,
1429 |                                      layerNm = c("L1", "L2/3", "L4", "L5", "L6"),
1430 |                                      scaleByLayer = TRUE,
1431 |                                      scaleByFn = max,
1432 |                                      smartWeight = TRUE,
1433 |                                      spillFactor = 0.15,
1434 |                                      weightCutoff = 0.02) {
1435 |   if (is.null(layerNm)) {
1436 |     for (i in 1:length(layer)) layerNm <- c(layerNm, layerNm[[i]])
1437 |     layerNm <- sort(unique(layerNm))
1438 |   }
1439 |   if (!is.factor(cluster)) cluster <- factor(cluster)
1440 |   allClusters <- levels(cluster)
1441 |   out <- matrix(NA, nrow = length(allClusters), ncol = length(layerNm))
1442 |   rownames(out) <- allClusters
1443 |   colnames(out) <- layerNm
1444 |   isClust <- is.element(cluster, useClusters)
1445 |   sb <- subsetVector & isClust
1446 |   if (sum(sb) == 0) {
1447 |     return(out)
1448 |   } # Don't run on cases with no data
1449 |   subLayer <- layerScale(layer[sb], layerNm = layerNm) /
1450 |     layerScale(layer[isClust], layerNm = layerNm)
1451 |   subLayer <- subLayer / sum(subLayer)
1452 |   kpLay <- names(subLayer)[subLayer >= rareLimit]
1453 |   out[useClusters, kpLay] <- 0
1454 | 
1455 |   smartClust <- NA
1456 |   if (smartWeight) smartClust <- cluster
1457 | 
1458 |   for (lay in kpLay) {
1459 |     weight <- layerFraction(layer, lay, smartClust,
1460 |       spillFactor = spillFactor,
1461 |       weightCutoff = weightCutoff, layerNm = layerNm
1462 |     )
1463 |     kpCl <- unique(cluster[sb & (weight > 0)])
1464 |     for (cli in kpCl) out[cli, lay] <- sum(weight[sb & (cluster == cli)])
1465 |   }
1466 |   if (scaleByLayer) {
1467 |     for (lay in kpLay) out[, lay] <- out[, lay] / scaleByFn(out[, lay], na.rm = TRUE)
1468 |     out[out < rareLimit] <- 0
1469 |   }
1470 |   out
1471 | }
1472 | 
1473 | 
1474 | #' Get node height
1475 | #'
1476 | #' Returns the heights of each node, scaled from 0 (top) to 1 (leafs); this is a wrapper for dendextend functions
1477 | #'
1478 | #' @param tree a dendrogram object
1479 | #'
1480 | #' @return a vector of node heights
1481 | #'
1482 | #' @export
1483 | getNodeHeight <- function(tree) {
1484 |   nodeHeight <- get_nodes_attr(tree, "height")
1485 |   nodeHeight <- 1 - nodeHeight / max(nodeHeight)
1486 |   names(nodeHeight) <- get_nodes_attr(tree, "label")
1487 |   return(nodeHeight)
1488 | }
1489 | 
1490 | 
1491 | #' Fraction of correct calls per node
1492 | #'
1493 | #' This function returns the fraction correctly assigned to each node (as defined that the actual
1494 | #'   and predicted cluster are both in the same node)
1495 | #'
1496 | #' @param dendIn dendrogram for mapping.  Ignored if minimizeHeight=FALSE
1497 | #' @param clActual character vector of actual cluster assignments
1498 | #' @param clPredict character vector of predicted cluster assignments
1499 | #' @param minCount set to 0 results from clusters with fewer than this number of cells (default
1500 | #'   is to consider all clusters)
1501 | #' @param defaultSum value to return in cases where there are fewer than minCount cells in the
1502 | #'   actual cluster (e.g., cases that aren't considered at all)
1503 | #' @param out required for recursive function.  Do not set!
1504 | #'
1505 | #' @return matrix of two columns: (1) node name and (2) the fraction of cells in that node that
1506 | #'   are correctly assigned
1507 | #'
1508 | #' @export
1509 | fractionCorrectPerNode <- function(dendIn,
1510 |                                    clActual,
1511 |                                    clPredict,
1512 |                                    minCount = 0.1,
1513 |                                    defaultSum = -1,
1514 |                                    out = NULL) {
1515 |   clActual <- as.character(clActual)
1516 |   clPredict <- as.character(clPredict)
1517 |   if (length(dendIn) > 1) {
1518 |     for (i in 1:length(dendIn)) {
1519 |       allLabels <- labels(dendIn[[i]])
1520 |       nodeName <- attr(dendIn[[i]], "label")
1521 |       isActual <- is.element(clActual, allLabels)
1522 |       isPredict <- is.element(clPredict, allLabels)
1523 |       fractionCorrrect <- signif(sum(isActual & isPredict) / sum(isPredict + 1e-11), 3) # CHECK THIS
1524 |       if (sum(isActual) < minCount) fractionCorrrect <- defaultSum
1525 |       out <- rbind(out, c(nodeName, fractionCorrrect))
1526 |       colnames(out) <- c("nodeName", "fractionCorrrect")
1527 |       out <- fractionCorrectPerNode(dendIn[[i]], clActual, clPredict, minCount, defaultSum, out)
1528 |     }
1529 |   }
1530 |   out <- as.data.frame(out)
1531 |   out$fractionCorrrect <- as.numeric(as.character(out$fractionCorrrect))
1532 |   rownames(out) <- out[, 1]
1533 |   out
1534 | }
1535 | 
1536 | 
1537 | #' Filter by meta-data
1538 | #'
1539 | #' Return a filter of TRUE/FALSE values for a given piece of meta-data (e.g., broad class).
1540 | #'
1541 | #' @param classVector vector corresponding to the class information for filtering (e.g., vector
1542 | #'   of label calls)
1543 | #' @param sampleInfo matrix of sample information with rows corresponding to cells and columns
1544 | #'   corresponding to meta-data
1545 | #' @param classColumn column name of class information
1546 | #' @param clusterColumn column name of cluster information
1547 | #' @param threshold minimum fraction of cluster cells from a given class to be considered present
1548 | #'
1549 | #' @return a matrix of filters with rows as clusters and columns as classes with entries of TRUE or
1550 | #'   FALSE indicating whether cells from a given class can assigned to that cluster, given threshold.
1551 | #'
1552 | #' @export
1553 | filterByClass <- function(classVector,
1554 |                           sampleInfo,
1555 |                           classColumn = "cluster_type_label",
1556 |                           clusterColumn = "cluster_label",
1557 |                           threshold = 0.1) {
1558 |   ##
1559 |   out <- table(factor(sampleInfo[, clusterColumn]), factor(sampleInfo[, classColumn]))
1560 |   out <- out / rowSums(out)
1561 |   out <- out > threshold
1562 | 
1563 |   # Allow for names that not present in the
1564 |   # sampleInfo file
1565 |   out <- as.data.frame(out)
1566 |   tmp <- setdiff(classVector, colnames(out))
1567 |   if (length(tmp) > 0) {
1568 |     for (tm in tmp) out[, tm] <- TRUE
1569 |   }
1570 |   out <- as.matrix(out)
1571 | 
1572 |   # Remove '' names
1573 |   classVector[classVector == ""] <- "none"
1574 |   rownames(out)[rownames(out) == ""] <- "none"
1575 |   colnames(out)[colnames(out) == ""] <- "none"
1576 | 
1577 |   # Find and output the filter matrix
1578 |   outTable <- t(out[, classVector])
1579 |   if (length(classVector) == dim(sampleInfo)[1]) {
1580 |     rownames(outTable) <- sampleInfo[, "sample_id"]
1581 |   }
1582 |   outTable
1583 | }
1584 | 
1585 | #' Subsample cells
1586 | #'
1587 | #' Subsets a categorical vector to include up to a maximum number of values for each category.
1588 | #'
1589 | #' @param clusters vector of cluster labels (or any category) in factor or character format
1590 | #' @param subSamp maximum number of values for each category to subsample.  Can be single integer
1591 | #'   for global subsampling, or a *named* vector corresponding to how many values to take from each
1592 | #'   category in clusters.
1593 | #' @param seed for reproducibility
1594 | #'
1595 | #' @return returns a vector of TRUE / FALSE with a maximum of subSamp TRUE calls per category
1596 | #'
1597 | #' @export
1598 | subsampleCells <- function(clusters,
1599 |                            subSamp = 25,
1600 |                            seed = 5) {
1601 |   if(length(subSamp)==1)
1602 |     subSamp = rep(subSamp,length(unique(as.character(clusters))))
1603 |   if(is.null(names(subSamp))) 
1604 |     names(subSamp) <- unique(as.character(clusters))
1605 |   kpSamp <- rep(FALSE, length(clusters))
1606 |   for (cli in unique(as.character(clusters))) {
1607 |     val = subSamp[cli]
1608 |     if(!is.na(val)[1]){
1609 |       set.seed(seed)
1610 |       seed <- seed + 1
1611 |       kp <- which(clusters == cli)
1612 |       kpSamp[kp[sample(1:length(kp), min(length(kp), val))]] <- TRUE
1613 |     }
1614 |   }
1615 |   kpSamp
1616 | }
1617 | 
1618 | 
1619 | #' Fraction of cells correctly assigned
1620 | #'
1621 | #' This function takes as input an ordered set of marker genes (e.g., from at iterative
1622 | #'   algorithm), and returns a vector showing the fraction of cells correctly mapped.
1623 | #'
1624 | #' @param orderedGenes an ordered list of input genes (e.g. from an iterative algorithm)
1625 | #' @param mapDat normalized data of the mapping (=reference) data set.
1626 | #' @param medianDat median value for each leaf
1627 | #' @param clustersF cluster calls for each cell
1628 | #' @param verbose whether or not to show progress in the function
1629 | #' @param plot if TRUE, plotCorrectWithGenes is run
1630 | #' @param ... parameters passed to plotCorrectWithGenes (if plot=TRUE)
1631 | #' @param return if TRUE, the value is returned
1632 | #'
1633 | #' @return a vector showing the fraction of cells correctly mapped to each cluster
1634 | #'
1635 | #' @export
1636 | fractionCorrectWithGenes <- function(orderedGenes,
1637 |                                      mapDat,
1638 |                                      medianDat,
1639 |                                      clustersF,
1640 |                                      verbose = FALSE,
1641 |                                      plot = TRUE,
1642 |                                      return = TRUE,
1643 |                                      ...) {
1644 |   numGn <- 2:length(orderedGenes)
1645 |   frac <- rep(0, length(orderedGenes))
1646 |   for (i in numGn) {
1647 |     gns <- orderedGenes[1:i]
1648 |     corMapTmp <- suppressWarnings(corTreeMapping(
1649 |       mapDat = mapDat,
1650 |       medianDat = medianDat, genesToMap = gns
1651 |     ))
1652 |     corMapTmp[is.na(corMapTmp)] <- 0
1653 |     topLeafTmp <- getTopMatch(corMapTmp)
1654 |     frac[i] <- 100 * mean(topLeafTmp[, 1] == clustersF)
1655 |   }
1656 |   frac[is.na(frac)] <- 0
1657 |   if (plot) {
1658 |     plotCorrectWithGenes(frac, genes = orderedGenes, ...)
1659 |   }
1660 |   if (return) {
1661 |     return(frac)
1662 |   }
1663 | }
1664 | 
1665 | #' Correct mapping at different tree heights
1666 | #'
1667 | #' This function takes as input an ordered set of marker genes (e.g., from at iterative algorithm,
1668 | #'   and returns an table showing the fraction of cells correctly mapped to a similar cell type
1669 | #'   (as defined by the heights parameter).  A height of 1 indicates correct mapping to the leaf.
1670 | #'
1671 | #' @param orderedGenes an ordered list of input genes (e.g. from an iterative algorithm)
1672 | #' @param dend dendrogram for mapping.
1673 | #' @param mapDat normalized data of the mapping (=reference) data set.
1674 | #' @param medianDat median value for each leaf
1675 | #' @param clustersF cluster calls for each cell
1676 | #' @param minVal minimum number of genes to consider from the list in the mapping
1677 | #' @param heights height in the tree to look at
1678 | #' @param verbose whether or not to show progress in the function
1679 | #'
1680 | #' @return a matrix of fractions of cells correctly mapped for different tree heights (columns)
1681 | #'   and different gene panels (rows)
1682 | #'
1683 | #' @export
1684 | buildQualityTable <- function(orderedGenes,
1685 |                               dend,
1686 |                               mapDat,
1687 |                               medianDat,
1688 |                               clustersF,
1689 |                               minVal = 2,
1690 |                               heights = c((0:100) / 100),
1691 |                               verbose = FALSE) {
1692 |   minVal <- max(2, round(minVal))
1693 |   nodeHeight <- get_nodes_attr(dend, "height")
1694 |   nodeHeight <- 1 - nodeHeight / max(nodeHeight)
1695 |   names(nodeHeight) <- get_nodes_attr(dend, "label")
1696 | 
1697 |   outTable <- matrix(0, nrow = length(orderedGenes), ncol = length(heights))
1698 |   rownames(outTable) <- orderedGenes
1699 |   colnames(outTable) <- paste0("Height_", heights)
1700 | 
1701 |   for (r in minVal:length(orderedGenes)) {
1702 |     if (verbose) print(r)
1703 |     gns <- orderedGenes[1:r]
1704 |     topLeafTmp <- suppressWarnings(getTopMatch(corTreeMapping(
1705 |       mapDat = mapDat, medianDat = medianDat, genesToMap = gns
1706 |     )))
1707 |     lcaVector <- nodeHeight[lca(dend, as.character(topLeafTmp[, 1]), clustersF)]
1708 |     for (c in 1:length(heights)) {
1709 |       outTable[r, c] <- mean(lcaVector >= (1 - heights[c]))
1710 |     }
1711 |   }
1712 |   outTable
1713 | }
1714 | 
1715 | #' Plot fraction correct
1716 | #'
1717 | #' This function is a wrapper for plot designd for plotting the fraction correctly mapped for a given gene set.
1718 | #'   If geneN is the Nth gene, the plotted value indicates correct mapping using genes 1:N.
1719 | #'
1720 | #' @param frac a numeric vector indicating the fraction of cells correctly mapped for a given gene panel
1721 | #' @param genes ordered character vector (e.g., of genes) to be plotted; default is names(frac)
1722 | #' @param ... additional parameters for plot.
1723 | #'
1724 | #' @export
1725 | plotCorrectWithGenes <- function(frac,
1726 |                                  genes = names(frac),
1727 |                                  xlab = "Number of genes in panel",
1728 |                                  main = "All clusters gene panel",
1729 |                                  ylim = c(-10, 100),
1730 |                                  lwd = 5,
1731 |                                  ylab = "Percent of nuclei correctly mapping",
1732 |                                  colLine = "grey",
1733 |                                  ...) {
1734 |   numGn <- 1:length(frac)
1735 |   plot(numGn, frac,
1736 |     type = "l", col = "grey", xlab = xlab,
1737 |     ylab = ylab, main = main, ylim = ylim, lwd = lwd, ...
1738 |   )
1739 |   abline(h = (-2:20) * 5, lty = "dotted", col = colLine)
1740 |   abline(h = 0, col = "black", lwd = 2)
1741 |   text(numGn, frac, genes, srt = 90)
1742 | }
1743 | 
1744 | 
1745 | #' Get binary (aka beta) score
1746 | #'
1747 | #' Returns a beta score which indicates the binaryness of a gene across clusters.  High scores
1748 | #'   (near 1) indicate that a gene is either on or off in nearly all cells of every cluster.
1749 | #'   Scores near 0 indicate a cells is non-binary (e.g., not expressed, ubiquitous, or
1750 | #'   randomly expressed).  This value is used for gene filtering prior to defining clustering.
1751 | #'
1752 | #' @param propExpr a matrix of proportions of cells (rows) in a given cluster (columns) with
1753 | #'   CPM/FPKM > 1 (or 0, HCT uses 1)
1754 | #' @param returnScore if TRUE returns the score, if FALSE returns the ranks
1755 | #' @param spec.exp scaling factor (recommended to leave as default)
1756 | #'
1757 | #' @return returns a numeric vector of beta score (or ranks)
1758 | #'
1759 | #' @export
1760 | getBetaScore <- function(propExpr,
1761 |                          returnScore = TRUE,
1762 |                          spec.exp = 2) {
1763 |   calc_beta <- function(y, spec.exp = 2) {
1764 |     d1 <- as.matrix(dist(y))
1765 |     eps1 <- 1e-10
1766 |     # Marker score is combination of specificity and sparsity
1767 |     score1 <- sum(d1^spec.exp) / (sum(d1) + eps1)
1768 |     score1
1769 |   }
1770 | 
1771 |   betaScore <- apply(propExpr, 1, calc_beta)
1772 |   betaScore[is.na(betaScore)] <- 0
1773 |   if (returnScore) return(betaScore)
1774 |   scoreRank <- rank(-betaScore)
1775 |   scoreRank
1776 | }
1777 |                                    
1778 |                                  
1779 | #' Calculate row medians
1780 | #'
1781 | #' This is a wrapper for matrixStats::rowMedians that doesn't crash if a vector is provided as input
1782 | #'
1783 | #' @param x a matrix or vector of values
1784 | #'
1785 | #' @return if the input is a matrix, return row medians.  If a vector, it returns the inputted vector
1786 | #'
1787 | #' @export
1788 | rowMedians <- function(x,...) {
1789 |   require(matrixStats)
1790 |   if(is.matrix(x)) return(matrixStats::rowMedians(x,...))
1791 |   x
1792 | }
1793 | 


--------------------------------------------------------------------------------
/R/mfishMapping.r:
--------------------------------------------------------------------------------
  1 | #' Build and plot dendrogram from gene panel
  2 | #'
  3 | #' Build and plot a dendrogram using correlation-based average linkage hierarchical
  4 | #'   clustering and only using a specified set of genes.  The output is the expected
  5 | #'   accuracy of mapping to each node in the tree, which gives an idea of the best-case
  6 | #'   expected results for mFISH analysis.
  7 | #'
  8 | #' @param dend dendrogram for mapping.  Ignored if medianDat is passed
  9 | #' @param refDat normalized data of the REFERENCE data set.  Ignored if medianDat is passed
 10 | #' @param mapDat normalized data of the MAPPING data set.  Default is to map the data onto itself.
 11 | #' @param medianDat representative value for each leaf and node.  If not entered, it is calculated
 12 | #' @param requiredGenes minimum number of genes required to be expressed in a cluster (column
 13 | #'   of medianDat) for the cluster to be included (default=2)
 14 | #' @param clusters  cluster calls for each cell
 15 | #' @param mappedAsReference if TRUE, returns the fraction of cells mapped to a node which are
 16 | #'   were orginally clustered from that node; if FALSE (default) returns the fraction of cells
 17 | #'   clustered under a node which are mapped to the correct node.
 18 | #' @param genesToMap which genes to include in the correlation mapping
 19 | #' @param plotdendro should the dendrogram be plotted (default = TRUE)
 20 | #' @param returnDendro should the dendrogram be returned (default = TRUE)
 21 | #' @param mar margins (for use with par)
 22 | #' @param main,ylab add title and labels to plot (default is NULL)
 23 | #' @param use,... additional parameters for cor
 24 | #'
 25 | #' @return a list where the first entry is the resulting tree and the second entry is the
 26 | #'   fraction of cells correctly mapping to each node using the inputted gene panel.
 27 | #'
 28 | #' @export
 29 | buildTreeFromGenePanel <- function(dend = NA,
 30 |                                    refDat = NA,
 31 |                                    mapDat = refDat,
 32 |                                    medianDat = NA,
 33 |                                    requiredGenes = 2,
 34 |                                    clusters = NA,
 35 |                                    mappedAsReference = FALSE,
 36 |                                    genesToMap = rownames(mapDat),
 37 |                                    plotdendro = TRUE,
 38 |                                    returndendro = TRUE,
 39 |                                    mar = c(12, 5, 5, 5),
 40 |                                    main = NULL,
 41 |                                    ylab = NULL,
 42 |                                    use = "p",
 43 |                                    ...) {
 44 |   library(dendextend)
 45 | 
 46 |   # Calculate the median, if needed.
 47 |   if (is.na(medianDat[1])) {
 48 |     names(clusters) <- colnames(refDat)
 49 |     medianDat <- do.call("cbind", tapply(
 50 |       names(clusters), clusters, function(x) rowMedians(refDat[, x])
 51 |     ))
 52 |     rownames(medianDat) <- rownames(refDat)
 53 |     if (!is.na(dend)) medianDat <- leafToNodeMedians(dend, medianDat)
 54 |   }
 55 |   gns <- intersect(genesToMap, intersect(rownames(mapDat), rownames(medianDat)))
 56 | 
 57 |   # Subset the data to relevant genes and clusters
 58 |   medianDat <- medianDat[gns, ]
 59 |   mapDat <- mapDat[gns, ]
 60 |   medianDat <- medianDat[, colSums(medianDat > 0) >= requiredGenes]
 61 |   kpDat <- (colSums(mapDat > 0) >= requiredGenes) & (is.element(clusters, colnames(medianDat)))
 62 |   mapDat <- mapDat[, kpDat]
 63 | 
 64 |   # Perform the correlation mapping
 65 |   facsCor <- corTreeMapping(medianDat = medianDat, mapDat = mapDat, use = use, ...)
 66 |   facsCl <- colnames(facsCor)[apply(facsCor, 1, which.max)]
 67 | 
 68 |   # Build a new tree based on mapping
 69 |   sCore <- function(x, use, ...) return(as.dist(1 - WGCNA::cor(x, use = use, ...)))
 70 |   dend <- getDend(medianDat, sCore, use = use, ...)
 71 | 
 72 |   # Which leaves have which nodes?
 73 |   has_any_labels <- function(sub_dend, the_labels) any(labels(sub_dend) %in% the_labels)
 74 |   node_labels <- NULL
 75 |   for (lab in labels(dend)) node_labels <- cbind(
 76 |       node_labels, noded_with_condition(dend, has_any_labels, the_labels = lab)
 77 |     )
 78 |   rownames(node_labels) <- get_nodes_attr(dend, "label")
 79 |   colnames(node_labels) <- labels(dend)
 80 | 
 81 |   # Swap the mapped and reference nodes if
 82 |   # mappedAsReference=TRUE
 83 |   clTmp <- as.character(clusters[kpDat])
 84 |   if (mappedAsReference) {
 85 |     temp <- clTmp
 86 |     clTmp <- facsCl
 87 |     facsCl <- temp
 88 |   }
 89 | 
 90 |   # Which clusters agree at the node level?
 91 |   agreeNodes <- apply(cbind(facsCl, clTmp), 1, function(lab, node_labels) {
 92 |     rowSums(node_labels[, lab]) == 2
 93 |   }, node_labels)
 94 |   colnames(agreeNodes) <- clTmp
 95 | 
 96 |   # Which clusters are in each nodes?
 97 |   isInNodes <- t(apply(node_labels, 1, function(node, cl, dend) {
 98 |     is.element(cl, labels(dend)[node])
 99 |   }, clTmp, dend))
100 |   colnames(isInNodes) <- clTmp
101 | 
102 |   # For each node, plot the fraction of cells that
103 |   # match if desired?
104 |   fracAgree <- rowSums(agreeNodes) / rowSums(isInNodes)
105 |   if (plotdendro) {
106 |     par(mar = mar)
107 |     dend %>% set("nodes_cex", 0) %>% set("branches_col", "grey") %>% plot()
108 |     text(get_nodes_xy(dend)[, 1], get_nodes_xy(dend)[, 2], round(fracAgree * 100))
109 |     title(main = main, ylab = ylab)
110 |   }
111 | 
112 |   # Return the results (if desired)
113 |   if (returndendro) return(list(dend, fracAgree))
114 | }
115 | 
116 | 
117 | #' Build a dendrogram from gene panel
118 | #'
119 | #' Build a dendrogram from an inputted data matrix.
120 | #'
121 | #' @param dat matrix of values (e.g., genes x clusters) for calculating the dendrogram
122 | #' @param distFun function for calculating distance matrix (default is correlation-based)
123 | #' @param ... additional variables for distFun
124 | #'
125 | #' @return dendrogram
126 | #'
127 | #' @export
128 | getDend <- function(dat,
129 |                     distFun = function(x) return(as.dist(1 - WGCNA::cor(x))),
130 |                     ...) {
131 |   distCor <- distFun(dat, ...)
132 |   distCor[is.na(distCor)] <- max(distCor, na.rm = TRUE) * 1.2
133 |   # Avoid crashing by setting NA values to hang off the side of the tree.
134 |   avgClust <- hclust(distCor, method = "average")
135 |   dend <- as.dendrogram(avgClust)
136 |   dend <- labelDend(dend)[[1]]
137 |   return(dend)
138 | }
139 | 
140 | 
141 | #' Label dendrogram nodes
142 | #'
143 | #' Add numeric node labels to a dendrogram.
144 | #'
145 | #' @param dend dendrogram object
146 | #' @param distFun starting numeric node value (default=1)
147 | #'
148 | #' @return a list where the first item is the new dendrogram object and the
149 | #'   second item is the final numeric node value.
150 | #'
151 | #' @export
152 | labelDend <- function(dend, n = 1) {
153 |   if (is.null(attr(dend, "label"))) {
154 |     attr(dend, "label") <- paste0("n", n)
155 |     n <- n + 1
156 |   }
157 |   if (length(dend) > 1) {
158 |     for (i in 1:length(dend)) {
159 |       tmp <- labelDend(dend[[i]], n)
160 |       dend[[i]] <- tmp[[1]]
161 |       n <- tmp[[2]]
162 |     }
163 |   }
164 |   return(list(dend, n))
165 | }
166 | 
167 | 
168 | #' Summarize matrix
169 | #'
170 | #' Groups columns in a matrix by a specified group vector and summarizes using a specificed function.
171 | #'   Optionally binarizes the matrix using a specified cutoff parameter.  This is a wrapper for tapply.
172 | #'
173 | #' @param mat matrix where the columns (e.g., samples) are going to be grouped
174 | #' @param group vector of length dim(mat)[2] corresponding to the groups
175 | #' @param scale either 'none' (default),'row', or 'column'
176 | #' @param scaleQuantile what quantile of value should be set as 1 (default=1)
177 | #' @param binarize should the data be binarized? (default=FALSE)
178 | #' @param binMin minimum ON value for the binarized matrix (ignored if binarize=FALSE)
179 | #' @param summaryFunction function (or function name) to be used for summarization
180 | #' @param ... additional parameters for summaryFunction
181 | #'
182 | #' @return matrix of summarized values
183 | #'
184 | #' @export
185 | summarizeMatrix <- function(mat,
186 |                             group,
187 |                             scale = "none",
188 |                             scaleQuantile = 1,
189 |                             binarize = FALSE,
190 |                             binMin = 0.5,
191 |                             summaryFunction = median,
192 |                             ...) {
193 | 
194 |   # Make sure the names match up
195 |   if (is.null(colnames(mat))) colnames(mat) <- names(group)
196 |   if (is.null(colnames(mat))) colnames(mat) <- 1:length(group)
197 |   names(group) <- colnames(mat)
198 | 
199 |   # Calculate the summary
200 |   summaryFunction <- match.fun(summaryFunction)
201 |   runFunction <- function(x, ...) {
202 |     if (length(x) > 1) {
203 |       return(apply(mat[, x], 1, summaryFunction, ...))
204 |     }
205 |     return(mat[, x])
206 |   }
207 |   summarizedMat <- do.call("cbind", tapply(names(group), group, runFunction, ...))
208 |   if (is.factor(group)) summarizedMat <- summarizedMat[, levels(group)]
209 |   rownames(summarizedMat) <- rownames(mat)
210 | 
211 |   # Scale the data if desired
212 |   if (substr(scale, 1, 1) == "r") {
213 |     for (i in 1:dim(summarizedMat)[1])
214 |       summarizedMat[i, ] <- summarizedMat[i, ] /
215 |         max(1e-06, quantile(summarizedMat[i, ], probs = scaleQuantile))
216 |   }
217 |   if (substr(scale, 1, 1) == "c") {
218 |     for (i in 1:dim(summarizedMat)[2])
219 |       summarizedMat[, i] <- summarizedMat[, i] /
220 |         max(1e-06, quantile(summarizedMat[, i], probs = scaleQuantile))
221 |   }
222 | 
223 |   # Binarize the data if desired
224 |   if (binarize) {
225 |     summarizedMat <- summarizedMat > binMin
226 |     summarizedMat <- summarizedMat + 1 - 1
227 |   }
228 |   summarizedMat
229 | }
230 | 
231 | 
232 | #' Scale mFISH data and map to RNA-seq reference
233 | #'
234 | #' This function is a wrapper for several other functions which aim to scale mFISH data to
235 | #'   more closely match RNA-seq data and then map the mFISH data to the closest reference
236 | #'   classes.  There are several parameters allowing flexability in filtering and analysis.
237 | #'
238 | #' @param mapDat normalized data of the MAPPING data set.  Default is to map the data onto itself.
239 | #' @param refSummaryDat normalized summary data of the REFERENCE data set (e.g., what to map against)
240 | #' @param genesToMap which genes to include in the mapping (calculated in not entered)
241 | #' @param mappingFunction which function to use for mapping (default is cellToClusterMapping_byCor)
242 | #'   The function must include at least two parameters with the first one being mapped data and the
243 | #'   second data the reference.  Additional parameters are okay.  Output must be a data frame where
244 | #'   the first value is a mapped class.  Additional columns are okay and will be returned)
245 | #' @param transform function for transformation of the data (default in none)
246 | #' @param noiselevel scalar value at or below which all values are set to 0 (default is 0)
247 | #' @param scaleFunction which function to use for scaling mapDat to refSummaryDat (default is setting
248 | #'   90th quantile of mapDat to max of refSummaryDat and truncating higher mapDat values)
249 | #' @param omitGenes genes to be included in the data frames but excluded from the mapping
250 | #' @param metadata a data frame of possible metadata (additional columns are okay and ignored):
251 | #' \describe{
252 | #'   \item{area}{a vector of cell areas for normalization}
253 | #'   \item{experiment}{a vector indicating if multiple experiments should be scaled separately}
254 | #'   \item{x,y}{x (e.g., parallel to layer) and y (e.g., across cortical layers) coordinates in tissue}
255 | #' }
256 | #' @param integerWeights if not NULL (default) a vector of integers corresponding to how many times
257 | #'   each gene should be counted as part of the correlation.  This is equivalent to calculating
258 | #'   a weighted correlation, but only allows for integer weight values (for use with cor).
259 | #' @param binarize should the data be binarized? (default=FALSE)
260 | #' @param binMin minimum ON value for the binarized matrix (ignored if binarize=FALSE)
261 | #' @param ... additional parameters for passthrough into other functions
262 | #'
263 | #' @return a list with the following entrees:
264 | #' \describe{
265 | #'   \item{mapDat}{mapDat data matrix is passed through}
266 | #'   \item{scaleDat}{scaled mapDat data matrix}
267 | #'   \item{mappingResults}{Results of the mapping and associated confidence values (if any)}
268 | #'   \item{metadata=metadata}{metadata is passed through unchanged}
269 | #'   \item{scaledX/Y}{scaled x and y coordinates (or unscaled if scaling was not performed)}
270 | #' }
271 | #'
272 | #' @export
273 | fishScaleAndMap <- function(mapDat,
274 |                             refSummaryDat,
275 |                             genesToMap = NULL,
276 |                             mappingFunction = cellToClusterMapping_byCor,
277 |                             transform = function(x) x,
278 |                             noiselevel = 0,
279 |                             scaleFunction = quantileTruncate,
280 | 							omitGenes = NULL,
281 |                             metadata = data.frame(experiment = rep("all", dim(mapDat)[2])),
282 |                             integerWeights = NULL,
283 |                             binarize = FALSE,
284 |                             binMin = 0.5,
285 |                             ...) {
286 | 
287 |   # Setup
288 |   mappingFunction <- match.fun(mappingFunction)
289 |   scaleFunction <- match.fun(scaleFunction)
290 |   transform <- match.fun(transform)
291 |   if (is.null(genesToMap)) genesToMap <- colnames(mapDat)
292 |   genesToMap <- intersect(genesToMap, rownames(refSummaryDat))
293 |   params <- colnames(metadata)
294 |   refSummaryDat <- refSummaryDat[genesToMap, ]
295 |   mapDat <- mapDat[genesToMap, ]
296 | 
297 |   # Transform the data to be mapped
298 |   scaleDat <- as.matrix(mapDat[genesToMap, ])
299 |   scaleDat[scaleDat <= noiselevel] <- 0 # Set values less than or equal to noiselevel to 0
300 |   if (is.element("area", params)) {
301 |     # Account for spot area in gene expression calculation
302 |     scaleDat <- t(t(scaleDat) / metadata$area) * mean(metadata$area)
303 |   }
304 |   scaleDat <- transform(scaleDat)
305 | 
306 |   # Scale to the reference data
307 |   for (ex in unique(metadata$experiment)) {
308 |     isExp <- metadata$experiment == ex
309 |     for (g in genesToMap) scaleDat[g, isExp] <- scaleFunction(scaleDat[
310 |         g, isExp
311 |       ], maxVal = max(refSummaryDat[g, ]), ...)
312 |   }
313 | 
314 |   # Binarize, if desired
315 |   if (binarize) {
316 |     scaleDat <- scaleDat > binMin
317 |     scaleDat <- scaleDat + 1 - 1
318 |   }
319 | 
320 |   # Omit genes and weight scaling, if desired
321 |   genesToMap2 <- genesToMap
322 |   if (!is.null(integerWeights)) genesToMap2 <- rep(genesToMap2, integerWeights)
323 |   genesToMap2 <- genesToMap2[!is.element(genesToMap2, omitGenes)]
324 | 
325 |   # Map the map data to the reference data
326 |   scaleDat2 <- scaleDat[genesToMap2, ]
327 |   refSummaryDat2 <- refSummaryDat[genesToMap2, ]
328 |   mappingResults <- mappingFunction(refSummaryDat2, scaleDat2, ...)
329 | 
330 |   # Scale x and y coordinates to (0,1) within experiment, if desired
331 |   for (ex in unique(metadata$experiment)) {
332 |     isExp <- metadata$experiment == ex
333 |     metadata$x[isExp] <- metadata$x[isExp] - min(metadata$x[isExp])
334 |     metadata$x[isExp] <- metadata$x[isExp] / max(metadata$x[isExp])
335 |     metadata$y[isExp] <- metadata$y[isExp] - min(metadata$y[isExp])
336 |     metadata$y[isExp] <- metadata$y[isExp] / max(metadata$y[isExp])
337 |   }
338 | 
339 |   # Return the results
340 |   out <- list(
341 |     mapDat = mapDat, scaleDat = scaleDat,
342 |     mappingResults = mappingResults, metadata = metadata,
343 |     scaledX = metadata$x, scaledY = metadata$y
344 |   )
345 | }
346 | 
347 | 
348 | #' Filter (subset) fishScaleAndMap object
349 | #'
350 | #' Subsets all components in a fishScaleAndMap object
351 | #'
352 | #' @param datFish a fishScaleAndMap output list
353 | #' @param subset  a boolean or numeric vector of the elements to retain
354 | #'
355 | #' @return a fishScaleAndMap output subsetted to the requested elements
356 | #'
357 | #' @export
358 | filterCells <- function(datFish,
359 |                         subset) {
360 |   ## Error checking
361 |   if ((length(subset) != length(datFish$scaledX)) & (!is.numeric(subset))) {
362 |     print("subset is incorrect format.  Returning original entry.")
363 |     return(datFish)
364 |   }
365 |   if (is.numeric(subset)) subset <- intersect(subset, 1:length(datFish$scaledX))
366 | 
367 |   ## Subset all of the elements
368 |   datFish$mapDat <- datFish$mapDat[, subset]
369 |   datFish$scaleDat <- datFish$scaleDat[, subset]
370 |   datFish$metadata <- datFish$metadata[subset, ]
371 |   datFish$scaledX <- datFish$scaledX[subset]
372 |   datFish$scaledY <- datFish$scaledY[subset]
373 |   if (!is.null(datFish$mappingResults)) {
374 |     datFish$mappingResults <- datFish$mappingResults[subset, ]
375 |   }
376 |   return(datFish)
377 | }
378 | 
379 | 
380 | #' Merge two fishScaleAndMap objects
381 | #'
382 | #' Merges all components of two fishScaleAndMap objects to create a new
383 | #'   one. Note: only meta-data and mappingResults that is present in BOTH
384 | #'   objects will be returned.
385 | #'
386 | #' @param datFish1 a fishScaleAndMap output list
387 | #' @param datFish2 a second fishScaleAndMap output list.
388 | #'
389 | #' @return a new fishScaleAndMap output list with the two original ones merged
390 | #'
391 | #' @export
392 | mergeFish <- function(datFish1,
393 |                       datFish2) {
394 |   datFish <- datFish1
395 |   datFish$mapDat <- cbind(datFish1$mapDat, datFish2$mapDat)
396 |   datFish$scaleDat <- cbind(datFish1$scaleDat, datFish2$scaleDat)
397 |   datFish$metadata <- rbind(datFish1$metadata, datFish2$metadata)
398 |   datFish$scaledX <- c(datFish1$scaledX, datFish2$scaledX)
399 |   datFish$scaledY <- c(datFish1$scaledY, datFish2$scaledY)
400 |   if ((!is.null(datFish1$mappingResults)) & (!is.null(datFish2$mappingResults))) {
401 |     datFish$mappingResults <- rbind(datFish1$mappingResults, datFish2$mappingResults)
402 |   }
403 |   return(datFish)
404 | }
405 | 
406 | 
407 | 
408 | 
409 | #' Rotate coordinates
410 | #'
411 | #' Rotates the scaledX and scaledY elements of a fishScaleAndMap output list so that the
412 | #'   axis of interest (e.g., cortical layer) is paralled with the x cooridate plan.
413 | #'   Rotation code is from  https://stackoverflow.com/questions/15463462/rotate-graph-by-angle
414 | #'
415 | #' @param datFish a fishScaleAndMap output list
416 | #' @param flatVector a TRUE/FALSE vector ordred in the same way as the elements (e.g., cells)
417 | #'   in datIn where all TRUE values correspond to cells who should have the same Y coordinate
418 | #'   (e.g., be in the same layer).  Alternatively a numeric vector of cell indices to include
419 | #' @param flipVector a numeric vector of values to ensure proper reflection on Y-axes (e.g.,
420 | #'   layer; default=NULL)
421 | #' @param subset  a boolean or numeric vector of the elements to retain
422 | #'
423 | #' @return a fishScaleAndMap output list with updated scaledX and scaleY coordinates
424 | #'
425 | #' @export
426 | rotateXY <- function(datFish,
427 |                      flatVector = NULL,
428 |                      flipVector = NULL,
429 |                      subset = NULL) {
430 | 
431 |   ## Error checking
432 |   datFishIn <- datFish
433 |   if ((length(flatVector) != length(datFish$scaledX)) & (!is.numeric(flatVector))) {
434 |     print("flatVector is incorrect format.  Returning original entry.")
435 |     return(datFish)
436 |   }
437 |   if (!is.null(subset)) {
438 |     if ((length(subset) != length(datFish$scaledX)) & (!is.numeric(subset))) {
439 |       print("subset is incorrect format.  Returning original entry.")
440 |       return(datFish)
441 |     }
442 |   }
443 |   if (((length(flipVector) != length(datFish$scaledX)) &
444 |     (!is.numeric(flipVector))) & (!is.null(flipVector))) {
445 |     print("flipVector is incorrect format.  Returning original entry.")
446 |     return(datFish)
447 |   }
448 |   if (is.numeric(flatVector)) {
449 |     flatVector <- intersect(flatVector, 1:length(datFish$scaledX))
450 |   }
451 | 
452 |   ## Subset the data if needed
453 |   datFish <- datFishIn
454 |   if (!is.null(subset)) {
455 |     datFish <- filterCells(datFishIn, subset)
456 |     flatVector <- flatVector[subset]
457 |     flipVector <- flipVector[subset]
458 |   }
459 | 
460 |   ## Caculate best angle
461 |   v <- prcomp(cbind(datFish$scaledX, datFish$scaledY)[flatVector, ])$rotation
462 |   beta <- as.numeric(atan(-v[2, 1]/v[1, 1]))
463 | 
464 |   ## Rotate coordinates (internal function)
465 |   rotCor <- function(datFish,
466 |                        beta) {
467 |     M <- cbind(datFish$scaledX, datFish$scaledY)
468 |     rotm <- matrix(c(cos(beta), sin(beta), -sin(beta), cos(beta)), ncol = 2) # rotation matrix
469 |     M2.1 <- t(t(M) - c(M[1, 1], M[1, 2])) # shift points, so that turning point is (0,0)
470 |     M2.2 <- t(rotm %*% (t(M2.1))) # rotate
471 |     M2.3 <- t(t(M2.2) + c(M[1, 1], M[1, 2])) # shift back
472 |     x <- M2.3[, 1]
473 |     y <- M2.3[, 2]
474 | 
475 |     x <- x - min(x)
476 |     x <- x / max(x)
477 |     y <- y - min(y)
478 |     y <- y / max(y)
479 | 
480 |     datFish$scaledX <- x
481 |     datFish$scaledY <- y
482 |     datFish
483 |   }
484 |   datFish2 <- rotCor(datFish, beta)
485 | 
486 |   if (!is.null(flipVector)) {
487 |     if (sum(datFish2$scaledY * flipVector) < sum((1 - datFish2$scaledY) * flipVector)) {
488 |       datFish2 <- rotCor(datFish, beta)# + pi) # Pi should not be needed
489 |     }
490 |   }
491 |   datFish <- datFish2
492 | 
493 |   ## Unsubset and return the data
494 |   if (is.null(subset)) return(datFish)
495 | 
496 |   datFishIn$scaledX[subset] <- datFish$scaledX
497 |   datFishIn$scaledY[subset] <- datFish$scaledY
498 |   datFishIn
499 | }
500 | 
501 | 
502 | 
503 | 
504 | #' Plot distributions
505 | #'
506 | #' Plot the distributions of cells across the tissue with overlaying color information.  This is
507 | #'   a wrapper function for plot
508 | #'
509 | #' @param datIn  a fishScaleAndMap output list
510 | #' @param group a character vector (or factor) indicating how to split the data (e.g., cluster
511 | #'   call) or a metadata/mappingResults column name
512 | #' @param groups a character vector of groups to show (default is levels of group)
513 | #' @param colors a character vector (or factor) indicating how to color the plots (e.g., layer
514 | #'   or gene expression) or a metadata/mappingResults column name (default is all black)
515 | #' @param colormap function to use for the colormap for the data (default gray.colors)
516 | #' @param maxrow maximum number of plots to show in one row (default=12)
517 | #' @param xlim,ylim for plot, but will be calculated if not entered
518 | #' @param pch,cex for plot.  Can be single values or vectors
519 | #' @param main,xlab,ylab,... other parameters for plot (must be single values)
520 | #' @param singlePlot should everything be plot on a single page (default=TRUE)
521 | #'
522 | #' @return Only returns if there is an error
523 | #'
524 | #' @export
525 | plotDistributions <- function(datIn,
526 |                               group,
527 |                               groups = NULL,
528 |                               colors = rep("black", dim(datIn$mapDat)[2]),
529 |                               colormap = gray.colors,
530 |                               maxrow = 12,
531 |                               pch = 19,
532 |                               cex = 1.5,
533 |                               xlim = NULL, ylim = NULL,
534 |                               main = "",
535 |                               xlab = "", ylab = "",
536 | 			      singlePlot = TRUE,
537 |                               ...) {
538 |   colormap <- match.fun(colormap)
539 |   meta <- cbind(datIn$metadata, datIn$mappingResults)
540 |   if (length(group) == 1) {
541 |     if (is.element(group, colnames(meta))) {
542 |       group <- as.factor(meta[, group])
543 |       if (is.null(groups)) groups <- levels(group)
544 |     } else {
545 |       return(paste(group, "is not an available column name for division."))
546 |     }
547 |   }
548 |   if (length(colors) == 1) {
549 |     if (is.element(colors, colnames(meta))) {
550 |       colors <- as.numeric(as.factor(meta[, colors]))
551 |       colors <- colormap(length(unique(colors)))[colors]
552 |     } else {
553 |       return(paste(colors, "is not an available column name for coloring."))
554 |     }
555 |   } else {
556 |     colors <- as.numeric(as.factor(colors))
557 |     colors <- colormap(length(unique(colors)))[colors]
558 |   }
559 | 
560 |   if (is.null(xlim)) xlim <- range(datIn$scaledX)
561 |   if (is.null(ylim)) ylim <- range(-datIn$scaledY)
562 | 
563 |   # Make the plot!
564 |   if(singlePlot){
565 |     ncolv <- min(length(groups), maxrow)
566 |     nrowv <- ceiling(length(groups)/maxrow)
567 |     par(mfrow = c(nrowv, ncolv))
568 |   }
569 |   for (gp in groups) {
570 |     kp <- group == gp
571 |     pch2 <- pch
572 |     if (length(pch) > 1) pch2 <- pch[kp]
573 |     cex2 <- cex
574 |     if (length(cex) > 1) cex2 <- cex[kp]
575 | 
576 |     plot(datIn$scaledX[kp], -datIn$scaledY[kp],
577 |       pch = pch2, col = colors[kp], xlim = xlim,
578 |       ylim = ylim, main = paste(main, gp), xlab = xlab,
579 |       ylab = ylab, cex = cex2, ...
580 |     )
581 |   }
582 | }
583 | 
584 | 
585 | #' Plot heatmap
586 | #'
587 | #' Plot the heatmap of cells ordering by a specified order.  This is a wrapper for heatmap.2
588 | #'
589 | #' @param datIn  a fishScaleAndMap output list
590 | #' @param group a character vector (or factor) indicating how to order the heatmap (e.g., cluster
591 | #'   call) or a metadata/mappingResults column name
592 | #' @param groups a character vector of groups to show (default is levels of group)
593 | #' @param grouplab label for the grouping in the heatmap (default is 'Grouping' or the value for group)
594 | #' @param useScaled plot the scaled (TRUE) or unscaled (FALSE; default) values
595 | #' @param capValue values above capValue will be capped at capValue (default is none)
596 | #' @param colormap set of values to use for the colormap for the data (default heat_colors)
597 | #' @param Rowv,Colv,dendrogram,trace,margins,rowsep,colsep,key,... other parameters for heatmap.2
598 | #'   (some default values are different)
599 | #'
600 | #' @return Only returns if there is an error
601 | #'
602 | #' @export
603 | plotHeatmap <- function(datIn,
604 |                         group,
605 |                         groups = NULL,
606 |                         grouplab = "Grouping",
607 |                         useScaled = FALSE,
608 |                         capValue = Inf,
609 |                         colormap = grey.colors(1000),
610 |                         pch = 19,
611 |                         xlim = NULL, ylim = NULL,
612 |                         Rowv = FALSE,
613 |                         Colv = FALSE,
614 |                         dendrogram = "none",
615 |                         trace = "none",
616 |                         margins = c(6, 10),
617 |                         rowsep = NULL,
618 |                         sepwidth=c(0.4,0.4),
619 |                         key = FALSE, ...) {
620 |   library(gplots)
621 |   
622 |   if (useScaled) {
623 |     plotDat <- datIn$scaleDat
624 |   } else {
625 |     plotDat <- datIn$mapDat
626 |   }
627 |   plotDat <- pmin(plotDat, capValue)
628 |   
629 |   meta <- cbind(datIn$metadata, datIn$mappingResults)
630 |   if (length(group) == 1) {
631 |     if (is.element(group, colnames(meta))) {
632 |       if (grouplab == "Grouping") grouplab <- group
633 |       group <- as.factor(meta[, group])
634 |       if (is.null(groups)) groups <- levels(group)
635 |     } else {
636 |       return(paste(group, "is not an available column name for division."))
637 |     }
638 |   }
639 |   
640 |   # Update the cell order
641 |   groups <- c(groups, setdiff(levels(group), groups))
642 |   ord <- order(factor(group, levels = groups), -colSums(plotDat))
643 |   plotDat <- plotDat[, ord]
644 |   group <- group[ord]
645 |   
646 |   # Append the cluster name to the plot data and find colseps
647 |   cn <- rep("", length(colnames(plotDat)))
648 |   colseps <- NULL
649 |   for (g in unique(as.character(group))){
650 |     wg <- which(as.character(group)==g)
651 |     cn[round(mean(wg))] <- g
652 |     colseps <- c(colseps,min(wg)-1)
653 |   }
654 |   colnames(plotDat) <- paste(cn,colnames(plotDat))
655 |   
656 |   # Make the plot!
657 |   heatmap.2(plotDat,
658 |             Rowv = Rowv, Colv = Colv, dendrogram = dendrogram,
659 |             trace = trace, margins = margins, rowsep = rowsep,
660 |             colsep = colseps, key = key, col = colormap,
661 |             ...
662 |   )
663 | }
664 | 
665 | 
666 | #' Return top mapped correlation-based cluster and confidence
667 | #'
668 | #' Primary function for doing correlation-based mapping to cluster medians and also reporting the
669 | #'   correlations and confidences.  This is wrapper for getTopMatch and corTreeMapping.
670 | #'
671 | #' @param medianDat representative value for each leaf and node.  If not entered, it is calculated
672 | #' @param mapDat normalized data of the MAPPING data set.  Default is to map the data onto itself.
673 | #' @param refDat normalized data of the REFERENCE data set.  Ignored if medianDat is passed
674 | #' @param clusters  cluster calls for each cell.  Ignored if medianDat is passed
675 | #' @param genesToMap which genes to include in the correlation mapping
676 | #' @param use additional parameter for cor (use='p' as default)
677 | #' @param method additional parameter for cor (method='p' as default)
678 | #' @param returnCor should the correlation matrix be appended to the return?
679 | #' @param ... not used
680 | #'
681 | #' @return data frame with the top match and associated correlation
682 | #'
683 | #' @export
684 | cellToClusterMapping_byCor <- function(medianDat,
685 |                                        mapDat,
686 |                                        refDat = NA,
687 |                                        clusters = NA,
688 |                                        genesToMap = rownames(mapDat),
689 |                                        use = "p",
690 |                                        method = "p",
691 | 				       returnCor=FALSE,
692 |                                        ...) {
693 |   corVar <- corTreeMapping(
694 |     medianDat = medianDat,
695 |     mapDat = mapDat, refDat = refDat, clusters = clusters,
696 |     genesToMap = genesToMap, use = use, method = method
697 |   )
698 |   corMatch <- getTopMatch(corVar)
699 |   colnames(corMatch) <- c("Class", "Correlation")
700 | 
701 |   dex <- apply(corVar, 1, function(x) return(diff(sort(-x)[1:2])))
702 |   corMatch$DifferenceBetweenTopTwoCorrelations <- dex
703 |   if(returnCor)
704 |     corMatch <- cbind(corMatch,corVar)
705 |   corMatch
706 | }
707 | 
708 | 
709 | #' Quantile normalize, truncate, and scale
710 | #'
711 | #' Quantile normalize, truncate, and scale a numeric vector (e.g. mFISH data from one gene)
712 | #'
713 | #' @param x input data vector
714 | #' @param qprob probs value to result from quantile (default=0.9)
715 | #' @param maxVal max value for scaling (default=1)
716 | #' @param truncate should data above the qprob threshold be truncated (default=yes)
717 | #' @param ... not used
718 | #'
719 | #' @return scaled vector
720 | #'
721 | #' @export
722 | quantileTruncate <- function(x,
723 |                              qprob = 0.9,
724 |                              maxVal = 1,
725 |                              truncate = TRUE,
726 |                              ...) {
727 |   qs <- quantile(x[x > 0], probs = qprob, na.rm = TRUE)
728 |   if (is.na(qs)) return(x)
729 |   if (truncate) x[x > qs] <- qs
730 |   x * maxVal / qs
731 | }
732 | 
733 | 
734 | #' Plot TSNE
735 | #'
736 | #' Plot a TSNE of the data, with assigned colors and labels from provided variables.  Note that this
737 | #'   function is a modification of code from Pabloc (https://www.r-bloggers.com/author/pabloc/) from
738 | #'   https://www.r-bloggers.com/playing-with-dimensions-from-clustering-pca-t-sne-to-carl-sagan/
739 | #'
740 | #' @param datIn  a fishScaleAndMap output list
741 | #' @param colorGroup a character vector (or factor) indicating how to color the Tsne (e.g., cluster
742 | #'   call) or a metadata/mappingResults column name (default=NULL)
743 | #' @param labelGroup a character vector (or factor) indicating how to label the Tsne (e.g., cluster
744 | #'   call) or a metadata/mappingResults column name (default=NULL)
745 | #' @param useScaled plot the scaled (TRUE) or unscaled (FALSE; default) values
746 | #' @param capValue values above capValue will be capped at capValue (default is none)
747 | #' @param perplexity,theta other parameters for Rtsne
748 | #' @param main title of the plot
749 | #' @param maxNchar what is the maximum number of characters to display in the plot for each entry?
750 | #' @param seed for reproducibility
751 | #'
752 | #' @return Only returns if there is an error
753 | #'
754 | #' @export
755 | plotTsne <- function(datIn,
756 |                      colorGroup = "none",
757 |                      labelGroup = "none",
758 |                      useScaled = FALSE,
759 |                      capValue = Inf,
760 |                      perplexity = 10,
761 |                      theta = 0.5,
762 |                      main = "TSNE plot",
763 |                      maxNchar = 1000,
764 |                      seed = 10) {
765 |   library(Rtsne)
766 |   library(ggplot2)
767 | 
768 |   # Get the data
769 |   if (useScaled) {
770 |     plotDat <- datIn$scaleDat
771 |   } else {
772 |     plotDat <- datIn$mapDat
773 |   }
774 |   plotDat <- pmin(plotDat, capValue)
775 |   plotDat <- t(plotDat)
776 | 
777 |   # Color and label groups
778 |   meta <- cbind(datIn$metadata, datIn$mappingResults)
779 |   if (length(colorGroup) == 1) {
780 |     if (is.element(colorGroup, colnames(meta))) {
781 |       colorGroup <- as.factor(meta[, colorGroup])
782 |     } else {
783 |       colorGroup <- as.factor(rep("none", dim(meta)[1]))
784 |     }
785 |   }
786 |   if (length(labelGroup) == 1) {
787 |     if (is.element(labelGroup, colnames(meta))) {
788 |       labelGroup <- as.factor(meta[, labelGroup])
789 |     } else {
790 |       labelGroup <- as.factor(rep("*", dim(meta)[1]))
791 |     }
792 |   }
793 |   # Subset to maxNchar characters
794 |   levs <- substr(levels(as.factor(labelGroup)), 1, maxNchar)
795 |   labelGroup <- factor(as.character(substr(
796 |     labelGroup, 1, maxNchar
797 |   )), levels = as.character(unique(levs)))
798 | 
799 |   # Get the tsne corrdinates
800 |   set.seed(seed)
801 |   tsne_model_1 <- Rtsne(as.matrix(plotDat),
802 |     check_duplicates = FALSE,
803 |     pca = TRUE, perplexity = perplexity, theta = theta, dims = 2
804 |   )
805 |   d_tsne_1 <- as.data.frame(tsne_model_1$Y)
806 | 
807 |   # Make the plot!
808 |   plot_k <- ggplot(d_tsne_1, aes_string(
809 |     x = "V1", y = "V2",
810 |     color = colorGroup, label = labelGroup
811 |   )) +
812 |     geom_text() + xlab("TSNE 1") + ylab("TSNE 2") +
813 |     ggtitle(main) + theme(legend.title = element_blank())
814 |   scale_colour_discrete()
815 | 
816 |   plot_k
817 | }
818 | 


--------------------------------------------------------------------------------
/R/update.r:
--------------------------------------------------------------------------------
1 | #' Update the mfishtools library
2 | #'
3 | #' @export
4 | update_mfishtools <- function() {
5 |   devtools::install_github("AllenInstitute/mfishtools", build_vignettes = TRUE)
6 | }
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # mfishtools
 2 | 
 3 | R functions for gene selection and analysis of mFISH data
 4 | 
 5 | mfishtools includes many functions that are used for analysis of data for the CZI SpaceTx project, and mostly relies on correlation-based analysis with filtering.  *This library is still in beta testing and may be buggy.  Community involvement is encouraged through both issues and pull requests.*  
 6 | 
 7 | ## Installation
 8 | 
 9 | Install prerequisites:
10 | ```
11 | install.packages("devtools")
12 | devtools::install_github("AllenInstitute/scrattch.vis")
13 | devtools::install_github("AllenInstitute/tasic2016data")
14 | ```
15 | 
16 | Note that some people may need to manually install the `GO.db` and `WGCNA` libraries as well:
17 | ```
18 | install.packages("BiocManager")
19 | BiocManager::install("GO.db")
20 | BiocManager::install("WGCNA")
21 | ```
22 | 
23 | Install `mfishtools` using:
24 | ```
25 | # Quickly, but without the vignettes:
26 | devtools::install_github("AllenInstitute/mfishtools")
27 | 
28 | # More slowly, but with the vignettes:
29 | install.packages("remotes", repos='http://cran.us.r-project.org')
30 | remotes::install_github("AllenInstitute/mfishtools", build_vignettes = TRUE)
31 | ```
32 | 
33 | 
34 | ## Library use cases
35 | 
36 | There are two primary use cases for this libary:
37 | 
38 | 1. **Building a combinatorial marker gene panel for spatial transcriptomics.** [LINK TO VIGNETTE](http://htmlpreview.github.io/?https://github.com/AllenInstitute/mfishtools/blob/master/vignettes/inhibitory_marker_selection.html)  This allows the generation of computationally "optimal" marker gene panels based on single cell/nucleus RNA-Seq reference data.  A starting set of manually-selected marker genes is first selected, and then the remaining genes are chosen using a greedy algorithm.  Relevant statistics and plots are generated that show the predicted success for the panel.  
39 | 2. **Mapping cells from spatial transcriptomics data sets to reference cell types.** [LINK TO VIGNETTE](http://htmlpreview.github.io/?https://github.com/AllenInstitute/mfishtools/blob/master/vignettes/inhibitory_marker_mapping.html)  This allows for cell type calling of cells in a spatial transcriptomics study, and also predicts the accuracy of the calls based on reference data.  *Note: it is currently unclear how reliable this method is at correctly predicting cell type calls.  Please review results carefully!*  Plots can also be generated to show the results.  
40 | 
41 | Many functions are currently not included in these vignettes; please use the R help ("?") if additional information is needed, or e-mail me at jeremym@alleninstitute.org.
42 | 
43 | ## License
44 | 
45 | The license for this package is available on Github at: https://github.com/AllenInstitute/mfishtools/blob/master/LICENSE
46 | 
47 | ## Level of Support
48 | 
49 | We are planning on occasional updating this tool with no fixed schedule. Community involvement is encouraged through both issues and pull requests.
50 | 
51 | ## Contribution Agreement
52 | 
53 | If you contribute code to this repository through pull requests or other mechanisms, you are subject to the Allen Institute Contribution Agreement, which is available in full at: https://github.com/AllenInstitute/mfishtools/blob/master/CONTRIBUTION
54 | 


--------------------------------------------------------------------------------
/data/fishData.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenInstitute/mfishtools/df6d7f1104b149698a3cebd7f82dd9848bc7328a/data/fishData.rda


--------------------------------------------------------------------------------
/data/metadata.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenInstitute/mfishtools/df6d7f1104b149698a3cebd7f82dd9848bc7328a/data/metadata.rda


--------------------------------------------------------------------------------
/man/buildMappingBasedMarkerPanel.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{buildMappingBasedMarkerPanel}
 4 | \alias{buildMappingBasedMarkerPanel}
 5 | \title{Greedy algorithm for building marker gene panel}
 6 | \usage{
 7 | buildMappingBasedMarkerPanel(
 8 |   mapDat,
 9 |   medianDat = NA,
10 |   clustersF = NA,
11 |   panelSize = 50,
12 |   subSamp = 20,
13 |   maxFcGene = 1000,
14 |   qMin = 0.75,
15 |   seed = 10,
16 |   currentPanel = NULL,
17 |   panelMin = 5,
18 |   writeText = TRUE,
19 |   corMapping = TRUE,
20 |   optimize = "FractionCorrect",
21 |   clusterDistance = NULL,
22 |   clusterGenes = NULL,
23 |   dend = NULL,
24 |   percentSubset = 100
25 | )
26 | }
27 | \arguments{
28 | \item{mapDat}{normalized data of the mapping (=reference) data set.}
29 | 
30 | \item{medianDat}{representative value for each leaf.  If not entered, it is calculated}
31 | 
32 | \item{clustersF}{cluster calls for each cell.}
33 | 
34 | \item{panelSize}{number of genes to include in the marker gene panel}
35 | 
36 | \item{subSamp}{number of random nuclei to select from each cluster (to increase speed);
37 | set as NA to not subsample}
38 | 
39 | \item{maxFcGene}{maximum number of genes to consider at each iteration (to increase speed)}
40 | 
41 | \item{qMin}{minimum quantile for fold change comparison (between 0 and 1, higher = more
42 | specific marker genes are included)}
43 | 
44 | \item{seed}{for reproducibility}
45 | 
46 | \item{currentPanel}{starting panel.  Default is NULL.}
47 | 
48 | \item{panelMin}{if there are fewer genes than this, the top number of these genes by fc
49 | rank are set as the starting panel.  Cannot be less than 2.}
50 | 
51 | \item{writeText}{should gene names and marker scores be output (default TRUE)}
52 | 
53 | \item{corMapping}{if TRUE (default) map by correlation; otherwise, map by Euclidean
54 | distance (not recommended)}
55 | 
56 | \item{optimize}{if 'FractionCorrect' (default) will seek to maximize the fraction of
57 | cells correctly mapping to final clusters
58 | if 'CorrelationDistance' will seek to minimize the total distance between actual
59 | cluster calls and mapped clusters
60 | if 'DendrogramHeight' will seek to minimize the total dendrogram height between
61 | actual cluster calls and mapped clusters}
62 | 
63 | \item{clusterDistance}{only used if optimize='CorrelationDistance'; a matrix (or
64 | vector) of cluster distances.  Will be calculated if NULL and if clusterGenes
65 | provided. (NOTE: order must be the same as medianDat and/or have column and row
66 | names corresponding to clusters in clustersF)}
67 | 
68 | \item{clusterGenes}{a vector of genes used to calculate the cluster distance.
69 | Only used if optimize='CorrelationDistance' and clusterDistance=NULL.}
70 | 
71 | \item{dend}{only used if optimize='DendrogramHeight' dendrogram; will error out of not provided}
72 | 
73 | \item{percentSubset}{for each iteration the function can subset the set of possible
74 | genes to speed up the calculation.}
75 | }
76 | \value{
77 | an ordered character vector corresponding to the marker gene panel
78 | }
79 | \description{
80 | This is the primary function that iteratively builds a marker gene panel, one gene at a
81 |   time by iteratively adding the most informative gene to the existing gene panel.
82 | }
83 | 


--------------------------------------------------------------------------------
/man/buildPanel_oneCluster.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{buildPanel_oneCluster}
 4 | \alias{buildPanel_oneCluster}
 5 | \title{Build panel for one cluster (beta)}
 6 | \usage{
 7 | buildPanel_oneCluster(
 8 |   mapDat,
 9 |   clustersF,
10 |   medianDat = NA,
11 |   propIn = NA,
12 |   clust = as.character(clustersF[1]),
13 |   subSamp = NA,
14 |   seed = 10,
15 |   maxSize = 20,
16 |   dexCutoff = 0.001,
17 |   topGeneCount = 100
18 | )
19 | }
20 | \arguments{
21 | \item{mapDat}{normalized data of the mapping (=reference) data set.}
22 | 
23 | \item{clustersF}{cluster calls for each cell.}
24 | 
25 | \item{medianDat}{median value for each leaf}
26 | 
27 | \item{propIn}{proportions of cells with expression > 1 in each leaf}
28 | 
29 | \item{clust}{which cluster to target?}
30 | 
31 | \item{subSamp}{number of random nuclei to select from each cluster, EXCEPT the target cluster;
32 | set as NA to not subsample}
33 | 
34 | \item{seed}{for reproducibility}
35 | 
36 | \item{maxSize}{maximum size of marker gene panel}
37 | 
38 | \item{dexCutoff}{criteria for stopping: when improvement in fraction of cells properly mapped
39 | dips below this value}
40 | 
41 | \item{topGeneCount}{number of top genes by proportion to consider}
42 | }
43 | \value{
44 | a matrix of the top marker genes for each cluster.  Output matrix includes five columns:
45 |   clust = cluster; panel = ordered genes in the panel for that cluster; onCorrect = fraction of
46 |   correctly assigned cells in cluster; offCorrect = fraction of cells correctly assigned outside
47 |   of cluster; dexTotal = additional dex explained by last gene added.
48 | }
49 | \description{
50 | This UNTESTED function finds the best small marker panel for marking a single cluster, using
51 |   proportion difference as the metric for determining the starting panel.
52 | }
53 | 


--------------------------------------------------------------------------------
/man/buildQualityTable.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{buildQualityTable}
 4 | \alias{buildQualityTable}
 5 | \title{Correct mapping at different tree heights}
 6 | \usage{
 7 | buildQualityTable(
 8 |   orderedGenes,
 9 |   dend,
10 |   mapDat,
11 |   medianDat,
12 |   clustersF,
13 |   minVal = 2,
14 |   heights = c((0:100)/100),
15 |   verbose = FALSE
16 | )
17 | }
18 | \arguments{
19 | \item{orderedGenes}{an ordered list of input genes (e.g. from an iterative algorithm)}
20 | 
21 | \item{dend}{dendrogram for mapping.}
22 | 
23 | \item{mapDat}{normalized data of the mapping (=reference) data set.}
24 | 
25 | \item{medianDat}{median value for each leaf}
26 | 
27 | \item{clustersF}{cluster calls for each cell}
28 | 
29 | \item{minVal}{minimum number of genes to consider from the list in the mapping}
30 | 
31 | \item{heights}{height in the tree to look at}
32 | 
33 | \item{verbose}{whether or not to show progress in the function}
34 | }
35 | \value{
36 | a matrix of fractions of cells correctly mapped for different tree heights (columns)
37 |   and different gene panels (rows)
38 | }
39 | \description{
40 | This function takes as input an ordered set of marker genes (e.g., from at iterative algorithm,
41 |   and returns an table showing the fraction of cells correctly mapped to a similar cell type
42 |   (as defined by the heights parameter).  A height of 1 indicates correct mapping to the leaf.
43 | }
44 | 


--------------------------------------------------------------------------------
/man/buildTreeFromGenePanel.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{buildTreeFromGenePanel}
 4 | \alias{buildTreeFromGenePanel}
 5 | \title{Build and plot dendrogram from gene panel}
 6 | \usage{
 7 | buildTreeFromGenePanel(
 8 |   dend = NA,
 9 |   refDat = NA,
10 |   mapDat = refDat,
11 |   medianDat = NA,
12 |   requiredGenes = 2,
13 |   clusters = NA,
14 |   mappedAsReference = FALSE,
15 |   genesToMap = rownames(mapDat),
16 |   plotdendro = TRUE,
17 |   returndendro = TRUE,
18 |   mar = c(12, 5, 5, 5),
19 |   main = NULL,
20 |   ylab = NULL,
21 |   use = "p",
22 |   ...
23 | )
24 | }
25 | \arguments{
26 | \item{dend}{dendrogram for mapping.  Ignored if medianDat is passed}
27 | 
28 | \item{refDat}{normalized data of the REFERENCE data set.  Ignored if medianDat is passed}
29 | 
30 | \item{mapDat}{normalized data of the MAPPING data set.  Default is to map the data onto itself.}
31 | 
32 | \item{medianDat}{representative value for each leaf and node.  If not entered, it is calculated}
33 | 
34 | \item{requiredGenes}{minimum number of genes required to be expressed in a cluster (column
35 | of medianDat) for the cluster to be included (default=2)}
36 | 
37 | \item{clusters}{cluster calls for each cell}
38 | 
39 | \item{mappedAsReference}{if TRUE, returns the fraction of cells mapped to a node which are
40 | were orginally clustered from that node; if FALSE (default) returns the fraction of cells
41 | clustered under a node which are mapped to the correct node.}
42 | 
43 | \item{genesToMap}{which genes to include in the correlation mapping}
44 | 
45 | \item{plotdendro}{should the dendrogram be plotted (default = TRUE)}
46 | 
47 | \item{mar}{margins (for use with par)}
48 | 
49 | \item{main, ylab}{add title and labels to plot (default is NULL)}
50 | 
51 | \item{use, ...}{additional parameters for cor}
52 | 
53 | \item{returnDendro}{should the dendrogram be returned (default = TRUE)}
54 | }
55 | \value{
56 | a list where the first entry is the resulting tree and the second entry is the
57 |   fraction of cells correctly mapping to each node using the inputted gene panel.
58 | }
59 | \description{
60 | Build and plot a dendrogram using correlation-based average linkage hierarchical
61 |   clustering and only using a specified set of genes.  The output is the expected
62 |   accuracy of mapping to each node in the tree, which gives an idea of the best-case
63 |   expected results for mFISH analysis.
64 | }
65 | 


--------------------------------------------------------------------------------
/man/cellToClusterMapping_byCor.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{cellToClusterMapping_byCor}
 4 | \alias{cellToClusterMapping_byCor}
 5 | \title{Return top mapped correlation-based cluster and confidence}
 6 | \usage{
 7 | cellToClusterMapping_byCor(
 8 |   medianDat,
 9 |   mapDat,
10 |   refDat = NA,
11 |   clusters = NA,
12 |   genesToMap = rownames(mapDat),
13 |   use = "p",
14 |   method = "p",
15 |   ...
16 | )
17 | }
18 | \arguments{
19 | \item{medianDat}{representative value for each leaf and node.  If not entered, it is calculated}
20 | 
21 | \item{mapDat}{normalized data of the MAPPING data set.  Default is to map the data onto itself.}
22 | 
23 | \item{refDat}{normalized data of the REFERENCE data set.  Ignored if medianDat is passed}
24 | 
25 | \item{clusters}{cluster calls for each cell.  Ignored if medianDat is passed}
26 | 
27 | \item{genesToMap}{which genes to include in the correlation mapping}
28 | 
29 | \item{use}{additional parameter for cor (use='p' as default)}
30 | 
31 | \item{method}{additional parameter for cor (method='p' as default)}
32 | 
33 | \item{...}{not used}
34 | }
35 | \value{
36 | data frame with the top match and associated correlation
37 | }
38 | \description{
39 | Primary function for doing correlation-based mapping to cluster medians and also reporting the
40 |   correlations and confidences.  This is wrapper for getTopMatch and corTreeMapping.
41 | }
42 | 


--------------------------------------------------------------------------------
/man/cellToClusterMapping_byRank.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{cellToClusterMapping_byRank}
 4 | \alias{cellToClusterMapping_byRank}
 5 | \title{Cell-based cluster mapping}
 6 | \usage{
 7 | cellToClusterMapping_byRank(
 8 |   mapDat,
 9 |   refDat,
10 |   clustersF,
11 |   genesToMap = rownames(mapDat),
12 |   mergeFunction = rowMedians,
13 |   useRank = TRUE,
14 |   use = "p",
15 |   method = "p"
16 | )
17 | }
18 | \arguments{
19 | \item{mapDat}{normalized data of the MAPPING data set.}
20 | 
21 | \item{refDat}{normalized data of the REFERENCE data set}
22 | 
23 | \item{clustersF}{factor indicating which cluster each cell type is actually assigned
24 | to in the reference data set}
25 | 
26 | \item{genesToMap}{character vector of which genes to include in the correlation mapping}
27 | 
28 | \item{mergeFunction}{function for combining ranks; the tested choices are rowMeans or
29 | rowMedians (default)}
30 | 
31 | \item{useRank}{use the rank of the correlation (default) or the correlation itself to
32 | determine the top cluster}
33 | 
34 | \item{use}{additional parameter for cor (use='p' as default)}
35 | 
36 | \item{method}{additional parameter for cor (method='p' as default)}
37 | }
38 | \value{
39 | a two column data matrix where the first column is the mapped cluster and the second
40 |   column is a confidence call indicating how close to the top of the ranked list cells of the
41 |   assigned cluster were located relative to their best possible location in the ranked list.
42 |   This confidence score seems to be a bit more reliable than correlation at determining how
43 |   likely a cell in a training set is to being correctly assigned to the training cluster.
44 | }
45 | \description{
46 | Maps cells to clusters by correlating every mapped cell with every reference cell,
47 |   ranking the cells by correlation, and the reporting the cluster with the lowest average rank.
48 | }
49 | 


--------------------------------------------------------------------------------
/man/corTreeMapping.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{corTreeMapping}
 4 | \alias{corTreeMapping}
 5 | \title{Correlation-based cluster mapping}
 6 | \usage{
 7 | corTreeMapping(
 8 |   mapDat,
 9 |   medianDat,
10 |   dend = NULL,
11 |   refDat = NA,
12 |   clusters = NA,
13 |   genesToMap = rownames(mapDat),
14 |   use = "p",
15 |   method = "p"
16 | )
17 | }
18 | \arguments{
19 | \item{mapDat}{normalized data of the MAPPING data set.  Default is to map the data onto itself.}
20 | 
21 | \item{medianDat}{representative value for each leaf and node.  If not entered, it is calculated}
22 | 
23 | \item{dend}{dendrogram for mapping.  If provided, correlations to nodes are also returned}
24 | 
25 | \item{refDat}{normalized data of the REFERENCE data set.  Ignored if medianDat is passed}
26 | 
27 | \item{clusters}{cluster calls for each cell.  Ignored if medianDat is passed}
28 | 
29 | \item{genesToMap}{which genes to include in the correlation mapping}
30 | 
31 | \item{use}{additional parameter for cor (use='p' as default)}
32 | 
33 | \item{method}{additional parameter for cor (method='p' as default)}
34 | }
35 | \value{
36 | matrix with the correlation between expression of each cell and representative value for
37 |   each leaf and node
38 | }
39 | \description{
40 | Primary function for doing correlation-based mapping to cluster medians.  This is wrapper for cor
41 |   and returns a correlation matrix.
42 | }
43 | 


--------------------------------------------------------------------------------
/man/corTreeMapping_withFilter.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{corTreeMapping_withFilter}
 4 | \alias{corTreeMapping_withFilter}
 5 | \title{Correlation between nodes and leafs (deprecated)}
 6 | \usage{
 7 | corTreeMapping_withFilter(
 8 |   dend = NA,
 9 |   refDat = NA,
10 |   mapDat = refDat,
11 |   medianExpr = NA,
12 |   propExpr = NA,
13 |   filterMatrix = NA,
14 |   clusters = NA,
15 |   numberOfGenes = 1200,
16 |   outerLimitGenes = 7200,
17 |   rankGeneFunction = function(x) getBetaScore(x, returnScore = FALSE),
18 |   use = "p",
19 |   ...
20 | )
21 | }
22 | \arguments{
23 | \item{dend}{dendrogram for mapping.  Ignored if medianDat is passed}
24 | 
25 | \item{refDat}{normalized data of the REFERENCE data set.  Ignored if medianExpr
26 | and propExpr are passed}
27 | 
28 | \item{mapDat}{normalized data of the MAPPING data set.  Default is to map the
29 | data onto itself.}
30 | 
31 | \item{medianExpr}{representative value for each leaf.  If not entered, it is
32 | calculated}
33 | 
34 | \item{propExpr}{proportion of cells in each type expressing a given gene.  If not
35 | entered, it is calculated}
36 | 
37 | \item{filterMatrix}{a matrix of TRUE/FALSE values to indicate whether a given
38 | cluster is possible}
39 | 
40 | \item{clusters}{cluster calls for each cell.  Ignored if medianExpr and propExpr
41 | are passed}
42 | 
43 | \item{numberOfGenes}{how many variables genes}
44 | 
45 | \item{outerLimitGenes}{choose different numberOfGenes per cell from the top overall
46 | outerLimitGenes (to speed up function)}
47 | 
48 | \item{use, ...}{additional parameters for cor}
49 | 
50 | \item{genesToMap}{which genes to include in the correlation mapping}
51 | }
52 | \value{
53 | a matrix of correlation values with rows as mapped cells and columns as clusters
54 | }
55 | \description{
56 | Returns the correlation between expression of each cell and representative
57 |   value for each node and leaf.  NOTE: this function is unstable and will
58 |   eventually be merged with corTreeMapping.
59 | }
60 | 


--------------------------------------------------------------------------------
/man/distTreeMapping.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{distTreeMapping}
 4 | \alias{distTreeMapping}
 5 | \title{(Euclidean) distance mapping}
 6 | \usage{
 7 | distTreeMapping(
 8 |   dend = NA,
 9 |   refDat = NA,
10 |   mapDat = refDat,
11 |   medianDat = NA,
12 |   clusters = NA,
13 |   genesToMap = rownames(mapDat),
14 |   returnSimilarity = TRUE,
15 |   use = "p",
16 |   ...
17 | )
18 | }
19 | \arguments{
20 | \item{dend}{dendrogram for mapping.  Ignored if medianDat is passed}
21 | 
22 | \item{refDat}{normalized data of the REFERENCE data set.  Ignored if medianDat is passed}
23 | 
24 | \item{mapDat}{normalized data of the MAPPING data set.  Default is to map the data onto itself.}
25 | 
26 | \item{medianDat}{representative value for each leaf and node.  If not entered, it is calculated}
27 | 
28 | \item{clusters}{cluster calls for each cell.  Ignored if medianDat is passed}
29 | 
30 | \item{genesToMap}{which genes to include in the correlation mapping}
31 | 
32 | \item{returnSimilarity}{FALSE to return distance, TRUE to return something like a similarity}
33 | 
34 | \item{use, ...}{additional parameters for dist (for back-compatiblity; doesn't work)}
35 | }
36 | \value{
37 | matrix of Euclidean distances between cells (rows) and clusters (columns)
38 | }
39 | \description{
40 | Returns the distance between expression of each cell and representative value for each node and
41 |   leaf (default is based on euclidean distance).  In our hands this is does not work very well.
42 | }
43 | 


--------------------------------------------------------------------------------
/man/filterByClass.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{filterByClass}
 4 | \alias{filterByClass}
 5 | \title{Filter by meta-data}
 6 | \usage{
 7 | filterByClass(
 8 |   classVector,
 9 |   sampleInfo,
10 |   classColumn = "cluster_type_label",
11 |   clusterColumn = "cluster_label",
12 |   threshold = 0.1
13 | )
14 | }
15 | \arguments{
16 | \item{classVector}{vector corresponding to the class information for filtering (e.g., vector
17 | of label calls)}
18 | 
19 | \item{sampleInfo}{matrix of sample information with rows corresponding to cells and columns
20 | corresponding to meta-data}
21 | 
22 | \item{classColumn}{column name of class information}
23 | 
24 | \item{clusterColumn}{column name of cluster information}
25 | 
26 | \item{threshold}{minimum fraction of cluster cells from a given class to be considered present}
27 | }
28 | \value{
29 | a matrix of filters with rows as clusters and columns as classes with entries of TRUE or
30 |   FALSE indicating whether cells from a given class can assigned to that cluster, given threshold.
31 | }
32 | \description{
33 | Return a filter of TRUE/FALSE values for a given piece of meta-data (e.g., broad class).
34 | }
35 | 


--------------------------------------------------------------------------------
/man/filterCells.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{filterCells}
 4 | \alias{filterCells}
 5 | \title{Filter (subset) fishScaleAndMap object}
 6 | \usage{
 7 | filterCells(datFish, subset)
 8 | }
 9 | \arguments{
10 | \item{datFish}{a fishScaleAndMap output list}
11 | 
12 | \item{subset}{a boolean or numeric vector of the elements to retain}
13 | }
14 | \value{
15 | a fishScaleAndMap output subsetted to the requested elements
16 | }
17 | \description{
18 | Subsets all components in a fishScaleAndMap object
19 | }
20 | 


--------------------------------------------------------------------------------
/man/filterPanelGenes.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{filterPanelGenes}
 4 | \alias{filterPanelGenes}
 5 | \title{Filter genes for spatial transcriptomics panel}
 6 | \usage{
 7 | filterPanelGenes(
 8 |   summaryExpr,
 9 |   propExpr = summaryExpr,
10 |   onClusters = 1:dim(summaryExpr)[2],
11 |   offClusters = NULL,
12 |   geneLengths = NULL,
13 |   startingGenes = c("GAD1", "SLC17A7"),
14 |   numBinaryGenes = 500,
15 |   minOn = 10,
16 |   maxOn = 250,
17 |   maxOff = 50,
18 |   minLength = 960,
19 |   fractionOnClusters = 0.5,
20 |   onThreshold = 0.5,
21 |   excludeGenes = NULL,
22 |   excludeFamilies = c("LOC", "LINC", "FAM", "ORF", "KIAA", "FLJ", "DKFZ", "RIK", "RPS",
23 |     "RPL", "\\\\-")
24 | )
25 | }
26 | \arguments{
27 | \item{summaryExpr}{Matrix of summarized expression levels for a given cluster.  Typically the median
28 | or mean should be used.  Rows are genes and columns are samples.  ROW NAMES MUST BE GENE SYMBOLS!}
29 | 
30 | \item{propExpr}{Proportion of cells expressed in each cluster for use with binary score calculation
31 | (default = summaryExpr, which is not recommended)}
32 | 
33 | \item{onClusters}{Vector indicating which clusters should be included in the gene panel (default
34 | is all clusters. Can be logical or numeric, or a character string of cluster names)}
35 | 
36 | \item{offClusters}{Vector indidicating from which clusters expression should be avoided}
37 | 
38 | \item{numBinaryGenes}{Number of genes to include in the final panel.  Genes are sorted by binary
39 | score using `getBetaScore` and this number of genes are chosen (default = 500)}
40 | 
41 | \item{minOn}{Minimum summary expression level in most highly expressed "on" cluster (default = 10)}
42 | 
43 | \item{maxOn}{Maximum summary expression level in most highly expressed "on" cluster (default = 250)}
44 | 
45 | \item{maxOff}{Maximum summary expression level in most highly expressed "off" cluster (default = 50)}
46 | 
47 | \item{minLength}{Minimum gene length for marker gene selection.  Ignored if geneLength is not 
48 | provided (default = 960)}
49 | 
50 | \item{fractionOnClusters}{What is the maximum fraction of clusters in which a gene can be expressed
51 | (as defined by propExpr>onThreshold; default = 0.5).  This prevents nearly ubiquitous genes from selection}
52 | 
53 | \item{onThreshold}{What fraction of cells need to have expression for a gene to be defined as expressed
54 | (default = 0.5)}
55 | 
56 | \item{excludeGenes}{Which genes should be excluded from the analysis (default is none)}
57 | 
58 | \item{excludeFamilies}{Which gene classes or families should be excluded from the analysis?  More
59 | specifically, any gene that contain these strings of characters anywhere in the symbol will be
60 | excluded (default is "LOC","LINC","FAM","ORF","KIAA","FLJ","DKFZ","RIK","RPS","RPL","\\-").}
61 | 
62 | \item{geneLength}{Optional vector of gene lengths in same order as summaryExpr.  Default is NULL}
63 | }
64 | \value{
65 | A character vector of genes meeting all constraints
66 | }
67 | \description{
68 | Returns a set of genes for inclusion in a spatial transcriptomics panel based on a series of 
69 |   hard-coded and user-defined constraints
70 | }
71 | 


--------------------------------------------------------------------------------
/man/fishScaleAndMap.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{fishScaleAndMap}
 4 | \alias{fishScaleAndMap}
 5 | \title{Scale mFISH data and map to RNA-seq reference}
 6 | \usage{
 7 | fishScaleAndMap(
 8 |   mapDat,
 9 |   refSummaryDat,
10 |   genesToMap = NULL,
11 |   mappingFunction = cellToClusterMapping_byCor,
12 |   transform = function(x) x,
13 |   noiselevel = 0,
14 |   scaleFunction = quantileTruncate,
15 |   omitGenes = NULL,
16 |   metadata = data.frame(experiment = rep("all", dim(mapDat)[2])),
17 |   integerWeights = NULL,
18 |   binarize = FALSE,
19 |   binMin = 0.5,
20 |   ...
21 | )
22 | }
23 | \arguments{
24 | \item{mapDat}{normalized data of the MAPPING data set.  Default is to map the data onto itself.}
25 | 
26 | \item{refSummaryDat}{normalized summary data of the REFERENCE data set (e.g., what to map against)}
27 | 
28 | \item{genesToMap}{which genes to include in the mapping (calculated in not entered)}
29 | 
30 | \item{mappingFunction}{which function to use for mapping (default is cellToClusterMapping_byCor)
31 | The function must include at least two parameters with the first one being mapped data and the
32 | second data the reference.  Additional parameters are okay.  Output must be a data frame where
33 | the first value is a mapped class.  Additional columns are okay and will be returned)}
34 | 
35 | \item{transform}{function for transformation of the data (default in none)}
36 | 
37 | \item{noiselevel}{scalar value at or below which all values are set to 0 (default is 0)}
38 | 
39 | \item{scaleFunction}{which function to use for scaling mapDat to refSummaryDat (default is setting
40 | 90th quantile of mapDat to max of refSummaryDat and truncating higher mapDat values)}
41 | 
42 | \item{omitGenes}{genes to be included in the data frames but excluded from the mapping}
43 | 
44 | \item{metadata}{a data frame of possible metadata (additional columns are okay and ignored):
45 | \describe{
46 |   \item{area}{a vector of cell areas for normalization}
47 |   \item{experiment}{a vector indicating if multiple experiments should be scaled separately}
48 |   \item{x,y}{x (e.g., parallel to layer) and y (e.g., across cortical layers) coordinates in tissue}
49 | }}
50 | 
51 | \item{integerWeights}{if not NULL (default) a vector of integers corresponding to how many times
52 | each gene should be counted as part of the correlation.  This is equivalent to calculating
53 | a weighted correlation, but only allows for integer weight values (for use with cor).}
54 | 
55 | \item{binarize}{should the data be binarized? (default=FALSE)}
56 | 
57 | \item{binMin}{minimum ON value for the binarized matrix (ignored if binarize=FALSE)}
58 | 
59 | \item{...}{additional parameters for passthrough into other functions}
60 | }
61 | \value{
62 | a list with the following entrees:
63 | \describe{
64 |   \item{mapDat}{mapDat data matrix is passed through}
65 |   \item{scaleDat}{scaled mapDat data matrix}
66 |   \item{mappingResults}{Results of the mapping and associated confidence values (if any)}
67 |   \item{metadata=metadata}{metadata is passed through unchanged}
68 |   \item{scaledX/Y}{scaled x and y coordinates (or unscaled if scaling was not performed)}
69 | }
70 | }
71 | \description{
72 | This function is a wrapper for several other functions which aim to scale mFISH data to
73 |   more closely match RNA-seq data and then map the mFISH data to the closest reference
74 |   classes.  There are several parameters allowing flexability in filtering and analysis.
75 | }
76 | 


--------------------------------------------------------------------------------
/man/fractionCorrectPerNode.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{fractionCorrectPerNode}
 4 | \alias{fractionCorrectPerNode}
 5 | \title{Fraction of correct calls per node}
 6 | \usage{
 7 | fractionCorrectPerNode(
 8 |   dendIn,
 9 |   clActual,
10 |   clPredict,
11 |   minCount = 0.1,
12 |   defaultSum = -1,
13 |   out = NULL
14 | )
15 | }
16 | \arguments{
17 | \item{dendIn}{dendrogram for mapping.  Ignored if minimizeHeight=FALSE}
18 | 
19 | \item{clActual}{character vector of actual cluster assignments}
20 | 
21 | \item{clPredict}{character vector of predicted cluster assignments}
22 | 
23 | \item{minCount}{set to 0 results from clusters with fewer than this number of cells (default
24 | is to consider all clusters)}
25 | 
26 | \item{defaultSum}{value to return in cases where there are fewer than minCount cells in the
27 | actual cluster (e.g., cases that aren't considered at all)}
28 | 
29 | \item{out}{required for recursive function.  Do not set!}
30 | }
31 | \value{
32 | matrix of two columns: (1) node name and (2) the fraction of cells in that node that
33 |   are correctly assigned
34 | }
35 | \description{
36 | This function returns the fraction correctly assigned to each node (as defined that the actual
37 |   and predicted cluster are both in the same node)
38 | }
39 | 


--------------------------------------------------------------------------------
/man/fractionCorrectWithGenes.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{fractionCorrectWithGenes}
 4 | \alias{fractionCorrectWithGenes}
 5 | \title{Fraction of cells correctly assigned}
 6 | \usage{
 7 | fractionCorrectWithGenes(
 8 |   orderedGenes,
 9 |   mapDat,
10 |   medianDat,
11 |   clustersF,
12 |   verbose = FALSE,
13 |   plot = TRUE,
14 |   return = TRUE,
15 |   ...
16 | )
17 | }
18 | \arguments{
19 | \item{orderedGenes}{an ordered list of input genes (e.g. from an iterative algorithm)}
20 | 
21 | \item{mapDat}{normalized data of the mapping (=reference) data set.}
22 | 
23 | \item{medianDat}{median value for each leaf}
24 | 
25 | \item{clustersF}{cluster calls for each cell}
26 | 
27 | \item{verbose}{whether or not to show progress in the function}
28 | 
29 | \item{plot}{if TRUE, plotCorrectWithGenes is run}
30 | 
31 | \item{return}{if TRUE, the value is returned}
32 | 
33 | \item{...}{parameters passed to plotCorrectWithGenes (if plot=TRUE)}
34 | }
35 | \value{
36 | a vector showing the fraction of cells correctly mapped to each cluster
37 | }
38 | \description{
39 | This function takes as input an ordered set of marker genes (e.g., from at iterative
40 |   algorithm), and returns a vector showing the fraction of cells correctly mapped.
41 | }
42 | 


--------------------------------------------------------------------------------
/man/generateMultipleCellReferenceSet.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{generateMultipleCellReferenceSet}
 4 | \alias{generateMultipleCellReferenceSet}
 5 | \title{Generate reference set of pseudo-cells}
 6 | \usage{
 7 | generateMultipleCellReferenceSet(
 8 |   refDat,
 9 |   clustersF,
10 |   genesToUse = rownames(refDat),
11 |   cellsPerMerge = 5,
12 |   numberOfMerges = 10,
13 |   mergeFunction = rowMedians,
14 |   seed = 1
15 | )
16 | }
17 | \arguments{
18 | \item{refDat}{normalized data of the REFERENCE data set}
19 | 
20 | \item{clustersF}{factor indicating which cluster each cell type is actually assigned to in the reference data set}
21 | 
22 | \item{cellsPerMerge}{Number of cells to include in each combo cell}
23 | 
24 | \item{numberOfMerges}{Number of combo cells to include per cell type}
25 | 
26 | \item{mergeFunction}{function for combining cells into combo cells (use rowMeans or rowMedians)}
27 | 
28 | \item{seed}{for resproducibility}
29 | 
30 | \item{genesToMap}{which genes to include in the correlation mapping}
31 | }
32 | \value{
33 | list where first element is data matrix of multi-cells by genes and
34 |   second element is a vector of corresponding clusters
35 | }
36 | \description{
37 | Creates a new reference set as input for cellToClusterMapping_byRank, where each 'cell' is the
38 |   combiniation of several cells and this is run several times using different subsets of cells.
39 | }
40 | 


--------------------------------------------------------------------------------
/man/getBetaScore.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{getBetaScore}
 4 | \alias{getBetaScore}
 5 | \title{Get binary (aka beta) score}
 6 | \usage{
 7 | getBetaScore(propExpr, returnScore = TRUE, spec.exp = 2)
 8 | }
 9 | \arguments{
10 | \item{propExpr}{a matrix of proportions of cells (rows) in a given cluster (columns) with
11 | CPM/FPKM > 1 (or 0, HCT uses 1)}
12 | 
13 | \item{returnScore}{if TRUE returns the score, if FALSE returns the ranks}
14 | 
15 | \item{spec.exp}{scaling factor (recommended to leave as default)}
16 | }
17 | \value{
18 | returns a numeric vector of beta score (or ranks)
19 | }
20 | \description{
21 | Returns a beta score which indicates the binaryness of a gene across clusters.  High scores
22 |   (near 1) indicate that a gene is either on or off in nearly all cells of every cluster.
23 |   Scores near 0 indicate a cells is non-binary (e.g., not expressed, ubiquitous, or
24 |   randomly expressed).  This value is used for gene filtering prior to defining clustering.
25 | }
26 | 


--------------------------------------------------------------------------------
/man/getBranchList.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{getBranchList}
 4 | \alias{getBranchList}
 5 | \title{Branch list}
 6 | \usage{
 7 | getBranchList(
 8 |   dend,
 9 |   branches = list(),
10 |   allTips = as.character(dend \%>\% labels())
11 | )
12 | }
13 | \arguments{
14 | \item{dend}{dendrogram for mapping.  Ignored if medianDat is passed}
15 | 
16 | \item{branches}{do not change from default}
17 | 
18 | \item{allTips}{do not change from default}
19 | }
20 | \value{
21 | a list of branch information for use with leafToNodeMedians
22 | }
23 | \description{
24 | Returns branches of a dendrogram in a specific format
25 | }
26 | 


--------------------------------------------------------------------------------
/man/getConfusionMatrix.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{getConfusionMatrix}
 4 | \alias{getConfusionMatrix}
 5 | \title{Confusion matrix}
 6 | \usage{
 7 | getConfusionMatrix(realCluster, foundCluster, proportions = TRUE)
 8 | }
 9 | \arguments{
10 | \item{realCluster}{character vector of assigned clusters}
11 | 
12 | \item{foundCluster}{character vector of mapped clusters}
13 | 
14 | \item{proportions}{FALSE if the counts are to be returned and TRUE if the proportions are to be returned}
15 | }
16 | \description{
17 | Returns a confusion matrix of the found (mapped) vs. real (assigned) clusters.
18 | }
19 | 


--------------------------------------------------------------------------------
/man/getDend.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{getDend}
 4 | \alias{getDend}
 5 | \title{Build a dendrogram from gene panel}
 6 | \usage{
 7 | getDend(dat, distFun = function(x) return(as.dist(1 - WGCNA::cor(x))), ...)
 8 | }
 9 | \arguments{
10 | \item{dat}{matrix of values (e.g., genes x clusters) for calculating the dendrogram}
11 | 
12 | \item{distFun}{function for calculating distance matrix (default is correlation-based)}
13 | 
14 | \item{...}{additional variables for distFun}
15 | }
16 | \value{
17 | dendrogram
18 | }
19 | \description{
20 | Build a dendrogram from an inputted data matrix.
21 | }
22 | 


--------------------------------------------------------------------------------
/man/getNodeHeight.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{getNodeHeight}
 4 | \alias{getNodeHeight}
 5 | \title{Get node height}
 6 | \usage{
 7 | getNodeHeight(tree)
 8 | }
 9 | \arguments{
10 | \item{tree}{a dendrogram object}
11 | }
12 | \value{
13 | a vector of node heights
14 | }
15 | \description{
16 | Returns the heights of each node, scaled from 0 (top) to 1 (leafs); this is a wrapper for dendextend functions
17 | }
18 | 


--------------------------------------------------------------------------------
/man/getTopMatch.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{getTopMatch}
 4 | \alias{getTopMatch}
 5 | \title{Get top leaf match}
 6 | \usage{
 7 | getTopMatch(memb.cl)
 8 | }
 9 | \arguments{
10 | \item{memb.cl}{membership scores for each leaf}
11 | }
12 | \value{
13 | a matrix where first column is found cluster and second column is confidence score
14 | }
15 | \description{
16 | Returns the top leaf match for each cell and the corresponding fraction mapping there.
17 | }
18 | 


--------------------------------------------------------------------------------
/man/get_subtree_label.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{get_subtree_label}
 4 | \alias{get_subtree_label}
 5 | \title{Gets subtree labels for lca function.}
 6 | \usage{
 7 | get_subtree_label(dend)
 8 | }
 9 | \arguments{
10 | \item{dend}{a cluster dendrogram}
11 | }
12 | \value{
13 | vector of subtree labels
14 | }
15 | \description{
16 | Gets subtree labels for lca function.
17 | }
18 | 


--------------------------------------------------------------------------------
/man/labelDend.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{labelDend}
 4 | \alias{labelDend}
 5 | \title{Label dendrogram nodes}
 6 | \usage{
 7 | labelDend(dend, n = 1)
 8 | }
 9 | \arguments{
10 | \item{dend}{dendrogram object}
11 | 
12 | \item{distFun}{starting numeric node value (default=1)}
13 | }
14 | \value{
15 | a list where the first item is the new dendrogram object and the
16 |   second item is the final numeric node value.
17 | }
18 | \description{
19 | Add numeric node labels to a dendrogram.
20 | }
21 | 


--------------------------------------------------------------------------------
/man/layerFraction.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{layerFraction}
 4 | \alias{layerFraction}
 5 | \title{Layer weights per cell}
 6 | \usage{
 7 | layerFraction(layerIn, useLayer = "L1", cluster = NA, ...)
 8 | }
 9 | \arguments{
10 | \item{layerIn}{a list corresponding to all layers of dissection for a given sample}
11 | 
12 | \item{useLayer}{target layer}
13 | 
14 | \item{cluster}{if passed the weights are smartly allocated based on laminar distributions by cluster}
15 | 
16 | \item{...}{additional variables for smartLayerAllocation}
17 | }
18 | \value{
19 | numeric vector with weights for cells in input layer
20 | }
21 | \description{
22 | Returns a numeric vector saying how to weight a particular cell for each layer.  This is a
23 |   wrapper for smartLayerAllocation
24 | }
25 | 


--------------------------------------------------------------------------------
/man/layerScale.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{layerScale}
 4 | \alias{layerScale}
 5 | \title{Fraction of cells per layer}
 6 | \usage{
 7 | layerScale(layerIn, layerNm = c("L1", "L2/3", "L4", "L5", "L6"), scale = TRUE)
 8 | }
 9 | \arguments{
10 | \item{layerIn}{a list corresponding to all layers of dissection for a given sample}
11 | 
12 | \item{layerNm}{names of all layers.  set to NULL to have this calculated}
13 | 
14 | \item{scale}{if TRUE (default), scale to the total number of cells}
15 | }
16 | \value{
17 | vector indicating the fraction of cells in each layerNm layer
18 | }
19 | \description{
20 | Determines the expected proportions in each layer based on input
21 | }
22 | 


--------------------------------------------------------------------------------
/man/lca.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{lca}
 4 | \alias{lca}
 5 | \title{Get lowest common ancestor (defined cluster pairs)}
 6 | \usage{
 7 | lca(dend, l1, l2, l = rep(attr(dend, "label"), length(l1)))
 8 | }
 9 | \arguments{
10 | \item{dend}{a cluster dendrogram}
11 | 
12 | \item{l1}{a vector of node labels}
13 | 
14 | \item{l2}{a second fector of node labels (of the same length as l1)}
15 | 
16 | \item{l}{do not adjust; required for recursive function}
17 | }
18 | \value{
19 | The function will return a vector for lowest common ancestor for every pair of nodes in l1 and l2
20 | }
21 | \description{
22 | Maps a cluster back up the tree to the first node where the mapped and correct clusters agree.
23 | }
24 | 


--------------------------------------------------------------------------------
/man/leafToNodeMedians.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{leafToNodeMedians}
 4 | \alias{leafToNodeMedians}
 5 | \title{Return mean node expression}
 6 | \usage{
 7 | leafToNodeMedians(dend, medianDat, branches = getBranchList(dend), fnIn = mean)
 8 | }
 9 | \arguments{
10 | \item{dend}{dendrogram for mapping.  Ignored if medianDat is passed}
11 | 
12 | \item{medianDat}{median expression data at each node}
13 | 
14 | \item{branches}{a particular format of branch information from the dendrogram structure}
15 | 
16 | \item{fnIn}{function to use to wrap up to the node level (default = mean)}
17 | }
18 | \value{
19 | a matrix of mean node expression (rows=genes, columns=nodes)
20 | }
21 | \description{
22 | Define expression at a node as the MEAN expression for each leaf as default (using the
23 |   median removes all specific marker genes!)
24 | }
25 | 


--------------------------------------------------------------------------------
/man/makeLCAtable.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{makeLCAtable}
 4 | \alias{makeLCAtable}
 5 | \title{Get lowest common ancestor (all cluster pairs in tree)}
 6 | \usage{
 7 | makeLCAtable(dend, includeInternalNodes = FALSE, verbose = FALSE)
 8 | }
 9 | \arguments{
10 | \item{dend}{a cluster dendrogram}
11 | 
12 | \item{includeInternalNodes}{should internal nodes be included in the output?}
13 | 
14 | \item{verbose}{if TRUE, status will be printed to the screen, since function is relatively slow
15 | for large trees (default FALSE)}
16 | }
17 | \value{
18 | The function will return a vector for lowest common ancestor for every pair of leaves
19 |   in dend.  Vector names are l1|||l2 for string parsing in other functions.
20 | }
21 | \description{
22 | Calculates the vector for lowest common ancestor for every pair of leaves in a tree and returns a
23 |   vector in a specific format for faster look-up.
24 | }
25 | 


--------------------------------------------------------------------------------
/man/map_dend.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{map_dend}
 4 | \alias{map_dend}
 5 | \title{Tree-based mapping}
 6 | \usage{
 7 | map_dend(
 8 |   dend,
 9 |   cl,
10 |   dat,
11 |   map.dat,
12 |   select.cells,
13 |   p = 0.8,
14 |   low.th = 0.2,
15 |   default.markers = NULL
16 | )
17 | }
18 | \arguments{
19 | \item{dend}{dendrogram for mapping}
20 | 
21 | \item{cl}{factor indicating which cluster each cell type is actually assigned to
22 | in the reference data set}
23 | 
24 | \item{dat}{normalized data of the REFERENCE data set}
25 | 
26 | \item{map.dat}{normalized data of the MAPPING data set.  Default is to map the
27 | data onto itself.}
28 | 
29 | \item{p}{proportion of marker genes to include in each iteration of the mapping
30 | algorithm.}
31 | 
32 | \item{low.th}{the minimum difference in Pearson correlation required to decide
33 | on which branch to map to. otherwise, a random branch is chosen.}
34 | 
35 | \item{default.markers}{not used}
36 | }
37 | \value{
38 | a matrix of confidence scores (from 0 to 100) with rows as cells and columns
39 |   as tree node/leafs.  Values indicate the fraction of permutations in which the cell
40 |   mapped to that node/leaf using the subset of cells/genes in map_dend
41 | }
42 | \description{
43 | Returns the mapping membership of each cell to each node and leaf using a
44 |   tree-based method.  This is a wrapper function for map_dend.
45 | }
46 | 


--------------------------------------------------------------------------------
/man/mergeFish.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{mergeFish}
 4 | \alias{mergeFish}
 5 | \title{Merge two fishScaleAndMap objects}
 6 | \usage{
 7 | mergeFish(datFish1, datFish2)
 8 | }
 9 | \arguments{
10 | \item{datFish1}{a fishScaleAndMap output list}
11 | 
12 | \item{datFish2}{a second fishScaleAndMap output list.}
13 | }
14 | \value{
15 | a new fishScaleAndMap output list with the two original ones merged
16 | }
17 | \description{
18 | Merges all components of two fishScaleAndMap objects to create a new
19 |   one. Note: only meta-data and mappingResults that is present in BOTH
20 |   objects will be returned.
21 | }
22 | 


--------------------------------------------------------------------------------
/man/mfishtools.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \docType{package}
 4 | \name{mfishtools}
 5 | \alias{mfishtools}
 6 | \title{mfishtools: Building Gene Sets and Mapping mFISH Data.}
 7 | \description{
 8 | This repository includes code for gene selection for spatial transcriptomics methods and for
 9 | mapping of spatial transcriptomics (or RNA-Seq data) onto a RNA-Seq reference. Specific topics include:
10 | 1) Correlation-based mapping of cells to reference cell types
11 | 2) Iterative building of gene panels a greedy algorithm with pre-defined constraints
12 | 3) Visualizations related to gene mapping a gene panel selection
13 | }
14 | 


--------------------------------------------------------------------------------
/man/outputTopConfused.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{outputTopConfused}
 4 | \alias{outputTopConfused}
 5 | \title{Table of confused clusters}
 6 | \usage{
 7 | outputTopConfused(confusionProp, count = 10)
 8 | }
 9 | \arguments{
10 | \item{confusionProp}{confusion matrix (e.g., output from getConfusionMatrix).}
11 | 
12 | \item{count}{number of top confusions to show}
13 | }
14 | \value{
15 | a 3 x count matrix of the top confused pairs of clusters with the three
16 |   columns corresponding to mapped cluster, assigned cluster, and fraction of
17 |   cells incorrectly mapped, respectively.
18 | }
19 | \description{
20 | This function returns a table of the top confused clusters (assigned clusters
21 |   incorrectly mapped)
22 | }
23 | 


--------------------------------------------------------------------------------
/man/plotConfusionVsConfidence.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{plotConfusionVsConfidence}
 4 | \alias{plotConfusionVsConfidence}
 5 | \title{Confusion plot vs. confidence}
 6 | \usage{
 7 | plotConfusionVsConfidence(
 8 |   foundClusterAndScore,
 9 |   realCluster,
10 |   RI = (31:100)/100,
11 |   main = "\% mapping (blue) / correct (orange)",
12 |   ylab = "Percent",
13 |   xlab = "Fraction correctly mapped to leaf",
14 |   type = "l",
15 |   xlim = range(RI),
16 |   ...
17 | )
18 | }
19 | \arguments{
20 | \item{foundClusterAndScore}{matrix where first column is found cluster and second column is
21 | confidence score (e.g., output from getTopMatch)}
22 | 
23 | \item{realCluster}{character vector of assigned clusters}
24 | 
25 | \item{...}{additional parameters for the plot function}
26 | }
27 | \description{
28 | Produces line plots showing the percent of correctly mapped cells above a certain confidence value (or score).
29 |   This is a wrapper for plot.
30 | }
31 | 


--------------------------------------------------------------------------------
/man/plotCorrectWithGenes.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{plotCorrectWithGenes}
 4 | \alias{plotCorrectWithGenes}
 5 | \title{Plot fraction correct}
 6 | \usage{
 7 | plotCorrectWithGenes(
 8 |   frac,
 9 |   genes = names(frac),
10 |   xlab = "Number of genes in panel",
11 |   main = "All clusters gene panel",
12 |   ylim = c(-10, 100),
13 |   lwd = 5,
14 |   ylab = "Percent of nuclei correctly mapping",
15 |   colLine = "grey",
16 |   ...
17 | )
18 | }
19 | \arguments{
20 | \item{frac}{a numeric vector indicating the fraction of cells correctly mapped for a given gene panel}
21 | 
22 | \item{genes}{ordered character vector (e.g., of genes) to be plotted; default is names(frac)}
23 | 
24 | \item{...}{additional parameters for plot.}
25 | }
26 | \description{
27 | This function is a wrapper for plot designd for plotting the fraction correctly mapped for a given gene set.
28 |   If geneN is the Nth gene, the plotted value indicates correct mapping using genes 1:N.
29 | }
30 | 


--------------------------------------------------------------------------------
/man/plotDistributions.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{plotDistributions}
 4 | \alias{plotDistributions}
 5 | \title{Plot distributions}
 6 | \usage{
 7 | plotDistributions(
 8 |   datIn,
 9 |   group,
10 |   groups = NULL,
11 |   colors = rep("black", dim(datIn$mapDat)[2]),
12 |   colormap = gray.colors,
13 |   maxrow = 12,
14 |   pch = 19,
15 |   cex = 1.5,
16 |   xlim = NULL,
17 |   ylim = NULL,
18 |   main = "",
19 |   xlab = "",
20 |   ylab = "",
21 |   ...
22 | )
23 | }
24 | \arguments{
25 | \item{datIn}{a fishScaleAndMap output list}
26 | 
27 | \item{group}{a character vector (or factor) indicating how to split the data (e.g., cluster
28 | call) or a metadata/mappingResults column name}
29 | 
30 | \item{groups}{a character vector of groups to show (default is levels of group)}
31 | 
32 | \item{colors}{a character vector (or factor) indicating how to color the plots (e.g., layer
33 | or gene expression) or a metadata/mappingResults column name (default is all black)}
34 | 
35 | \item{colormap}{function to use for the colormap for the data (default gray.colors)}
36 | 
37 | \item{maxrow}{maximum number of plots to show in one row (default=12)}
38 | 
39 | \item{pch, cex}{for plot.  Can be single values or vectors}
40 | 
41 | \item{xlim, ylim}{for plot, but will be calculated if not entered}
42 | 
43 | \item{main, xlab, ylab, ...}{other parameters for plot (must be single values)}
44 | }
45 | \value{
46 | Only returns if there is an error
47 | }
48 | \description{
49 | Plot the distributions of cells across the tissue with overlaying color information.  This is
50 |   a wrapper function for plot
51 | }
52 | 


--------------------------------------------------------------------------------
/man/plotHeatmap.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{plotHeatmap}
 4 | \alias{plotHeatmap}
 5 | \title{Plot heatmap}
 6 | \usage{
 7 | plotHeatmap(
 8 |   datIn,
 9 |   group,
10 |   groups = NULL,
11 |   grouplab = "Grouping",
12 |   useScaled = FALSE,
13 |   capValue = Inf,
14 |   colormap = grey.colors(1000),
15 |   pch = 19,
16 |   xlim = NULL,
17 |   ylim = NULL,
18 |   Rowv = FALSE,
19 |   Colv = FALSE,
20 |   dendrogram = "none",
21 |   trace = "none",
22 |   margins = c(6, 10),
23 |   rowsep = NULL,
24 |   sepwidth = c(0.4, 0.4),
25 |   key = FALSE,
26 |   ...
27 | )
28 | }
29 | \arguments{
30 | \item{datIn}{a fishScaleAndMap output list}
31 | 
32 | \item{group}{a character vector (or factor) indicating how to order the heatmap (e.g., cluster
33 | call) or a metadata/mappingResults column name}
34 | 
35 | \item{groups}{a character vector of groups to show (default is levels of group)}
36 | 
37 | \item{grouplab}{label for the grouping in the heatmap (default is 'Grouping' or the value for group)}
38 | 
39 | \item{useScaled}{plot the scaled (TRUE) or unscaled (FALSE; default) values}
40 | 
41 | \item{capValue}{values above capValue will be capped at capValue (default is none)}
42 | 
43 | \item{colormap}{set of values to use for the colormap for the data (default heat_colors)}
44 | 
45 | \item{Rowv, Colv, dendrogram, trace, margins, rowsep, colsep, key, ...}{other parameters for heatmap.2
46 | (some default values are different)}
47 | }
48 | \value{
49 | Only returns if there is an error
50 | }
51 | \description{
52 | Plot the heatmap of cells ordering by a specified order.  This is a wrapper for heatmap.2
53 | }
54 | 


--------------------------------------------------------------------------------
/man/plotNodes.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{plotNodes}
 4 | \alias{plotNodes}
 5 | \title{Plot dendrogram}
 6 | \usage{
 7 | plotNodes(
 8 |   tree,
 9 |   value = rep(1, length(labels(tree))),
10 |   cexScale = 2,
11 |   margins = c(10, 5, 2, 2),
12 |   cols = "black",
13 |   pch = 19,
14 |   ...
15 | )
16 | }
17 | \arguments{
18 | \item{tree}{a dendrogram object}
19 | 
20 | \item{value}{numeric vector corresponding to the size of each node}
21 | 
22 | \item{cexScale}{a global cex multiplier for node sizes}
23 | 
24 | \item{margins}{set the margins using par(mar=margins)}
25 | 
26 | \item{cols}{vector of node colors (or a single value)}
27 | 
28 | \item{pch}{vector of node pch shapes (or a single value)}
29 | 
30 | \item{...}{additional parameters for the plot function}
31 | }
32 | \description{
33 | Plots a dendrogram with set not colors, shapes, sizes and labels.  This is a wrapper for plot.
34 | }
35 | 


--------------------------------------------------------------------------------
/man/plotTsne.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{plotTsne}
 4 | \alias{plotTsne}
 5 | \title{Plot TSNE}
 6 | \usage{
 7 | plotTsne(
 8 |   datIn,
 9 |   colorGroup = "none",
10 |   labelGroup = "none",
11 |   useScaled = FALSE,
12 |   capValue = Inf,
13 |   perplexity = 10,
14 |   theta = 0.5,
15 |   main = "TSNE plot",
16 |   maxNchar = 1000,
17 |   seed = 10
18 | )
19 | }
20 | \arguments{
21 | \item{datIn}{a fishScaleAndMap output list}
22 | 
23 | \item{colorGroup}{a character vector (or factor) indicating how to color the Tsne (e.g., cluster
24 | call) or a metadata/mappingResults column name (default=NULL)}
25 | 
26 | \item{labelGroup}{a character vector (or factor) indicating how to label the Tsne (e.g., cluster
27 | call) or a metadata/mappingResults column name (default=NULL)}
28 | 
29 | \item{useScaled}{plot the scaled (TRUE) or unscaled (FALSE; default) values}
30 | 
31 | \item{capValue}{values above capValue will be capped at capValue (default is none)}
32 | 
33 | \item{perplexity, theta}{other parameters for Rtsne}
34 | 
35 | \item{main}{title of the plot}
36 | 
37 | \item{maxNchar}{what is the maximum number of characters to display in the plot for each entry?}
38 | 
39 | \item{seed}{for reproducibility}
40 | }
41 | \value{
42 | Only returns if there is an error
43 | }
44 | \description{
45 | Plot a TSNE of the data, with assigned colors and labels from provided variables.  Note that this
46 |   function is a modification of code from Pabloc (https://www.r-bloggers.com/author/pabloc/) from
47 |   https://www.r-bloggers.com/playing-with-dimensions-from-clustering-pca-t-sne-to-carl-sagan/
48 | }
49 | 


--------------------------------------------------------------------------------
/man/possibleClustersByPriors.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{possibleClustersByPriors}
 4 | \alias{possibleClustersByPriors}
 5 | \title{Filter possible cluster calls using priors}
 6 | \usage{
 7 | possibleClustersByPriors(
 8 |   cluster,
 9 |   layer,
10 |   subsetVector = rep(TRUE, length(cluster)),
11 |   useClusters = sort(unique(cluster)),
12 |   rareLimit = 0.005,
13 |   layerNm = c("L1", "L2/3", "L4", "L5", "L6"),
14 |   scaleByLayer = TRUE,
15 |   scaleByFn = max,
16 |   smartWeight = TRUE,
17 |   spillFactor = 0.15,
18 |   weightCutoff = 0.02
19 | )
20 | }
21 | \arguments{
22 | \item{cluster}{vector of all clusters}
23 | 
24 | \item{layer}{list of layers for each cluster entry (for data sets with only laminar dissections,
25 | each list entry will be of length 1)}
26 | 
27 | \item{subsetVector}{a vector of TRUE/FALSE values indicated whether the entry is in the subset of
28 | interest (e.g., Cre lines); default is all}
29 | 
30 | \item{useClusters}{a set of clusters to be considered a priori (e.g., GABA vs. glut); default is all}
31 | 
32 | \item{rareLimit}{define any values less than this as 0.  The idea is to exclude rare cells}
33 | 
34 | \item{layerNm}{names of all layers.  set to NULL to have this calculated}
35 | 
36 | \item{scaleByLayer}{if TRUE, scales to the proportion of cells in each layer}
37 | 
38 | \item{scaleByFn}{what function should be used for the layer scaling (default=max, ignored
39 | if scaleByLayer=FALSE)}
40 | 
41 | \item{smartWeight}{if TRUE, multilayer dissections are weighted smartly by cluster, rather
42 | than evenly by cluster (FALSE)}
43 | 
44 | \item{spillFactor}{fractional amount of cells in a layer below which it is assumed no cells
45 | are from that layer in multilayer dissection}
46 | 
47 | \item{weightCutoff}{anything less than this is set to 0 for convenience}
48 | }
49 | \value{
50 | a vector of possible clusters for cells that meet a set of priors for each layer
51 | }
52 | \description{
53 | This function will return a vector of possible clusters for cells that meet a set of priors for each layer
54 | }
55 | 


--------------------------------------------------------------------------------
/man/quantileTruncate.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{quantileTruncate}
 4 | \alias{quantileTruncate}
 5 | \title{Quantile normalize, truncate, and scale}
 6 | \usage{
 7 | quantileTruncate(x, qprob = 0.9, maxVal = 1, truncate = TRUE, ...)
 8 | }
 9 | \arguments{
10 | \item{x}{input data vector}
11 | 
12 | \item{qprob}{probs value to result from quantile (default=0.9)}
13 | 
14 | \item{maxVal}{max value for scaling (default=1)}
15 | 
16 | \item{truncate}{should data above the qprob threshold be truncated (default=yes)}
17 | 
18 | \item{...}{not used}
19 | }
20 | \value{
21 | scaled vector
22 | }
23 | \description{
24 | Quantile normalize, truncate, and scale a numeric vector (e.g. mFISH data from one gene)
25 | }
26 | 


--------------------------------------------------------------------------------
/man/resolve_cl.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{resolve_cl}
 4 | \alias{resolve_cl}
 5 | \title{Tree-based mapping (internal)}
 6 | \usage{
 7 | resolve_cl(
 8 |   cl.g,
 9 |   cl.med,
10 |   markers,
11 |   dat,
12 |   map.dat,
13 |   select.cells,
14 |   p = 0.7,
15 |   low.th = 0.2
16 | )
17 | }
18 | \arguments{
19 | \item{cl.g}{all clusters}
20 | 
21 | \item{cl.med}{cluster medians}
22 | 
23 | \item{markers}{gene markers}
24 | 
25 | \item{dat}{normalized data of the REFERENCE data set}
26 | 
27 | \item{map.dat}{normalized data of the MAPPING data set.  Default is to map the data onto itself.}
28 | 
29 | \item{select.cells}{which cells to use?}
30 | 
31 | \item{p}{proportion of marker genes to include in each iteration of the mapping algorithm.}
32 | 
33 | \item{low.th}{the minimum difference in Pearson correlation required to decide on which branch
34 | to map to. otherwise, a random branch is chosen.}
35 | }
36 | \value{
37 | a vector of the mapped cluster
38 | }
39 | \description{
40 | Returns the mapped cluster call of each cell to each leaf. This function is called by map_dend
41 | }
42 | 


--------------------------------------------------------------------------------
/man/rfTreeMapping.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{rfTreeMapping}
 4 | \alias{rfTreeMapping}
 5 | \title{Tree-based mapping}
 6 | \usage{
 7 | rfTreeMapping(
 8 |   dend,
 9 |   refDat,
10 |   clustersF,
11 |   mapDat = refDat,
12 |   p = 0.7,
13 |   low.th = 0.15,
14 |   seed = 1
15 | )
16 | }
17 | \arguments{
18 | \item{dend}{dendrogram for mapping}
19 | 
20 | \item{refDat}{normalized data of the REFERENCE data set}
21 | 
22 | \item{clustersF}{factor indicating which cluster each cell type is actually assigned to
23 | in the reference data set}
24 | 
25 | \item{mapDat}{normalized data of the MAPPING data set.  Default is to map the data onto itself.}
26 | 
27 | \item{p}{proportion of marker genes to include in each iteration of the mapping algorithm.}
28 | 
29 | \item{low.th}{the minimum difference in Pearson correlation required to decide on which branch
30 | to map to. otherwise, a random branch is chosen.}
31 | 
32 | \item{seed}{added for reproducibility}
33 | }
34 | \value{
35 | a matrix of confidence scores (from 0 to 100) with rows as cells and columns
36 |   as tree node/leafs.  Values indicate the fraction of permutations in which the cell
37 |   mapped to that node/leaf using the subset of cells/genes in map_dend
38 | }
39 | \description{
40 | Returns the mapping membership of each cell to each node and leaf using a
41 |   tree-based method.  This is a wrapper function for map_dend.
42 | }
43 | 


--------------------------------------------------------------------------------
/man/rotateXY.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{rotateXY}
 4 | \alias{rotateXY}
 5 | \title{Rotate coordinates}
 6 | \usage{
 7 | rotateXY(datFish, flatVector = NULL, flipVector = NULL, subset = NULL)
 8 | }
 9 | \arguments{
10 | \item{datFish}{a fishScaleAndMap output list}
11 | 
12 | \item{flatVector}{a TRUE/FALSE vector ordred in the same way as the elements (e.g., cells)
13 | in datIn where all TRUE values correspond to cells who should have the same Y coordinate
14 | (e.g., be in the same layer).  Alternatively a numeric vector of cell indices to include}
15 | 
16 | \item{flipVector}{a numeric vector of values to ensure proper reflection on Y-axes (e.g.,
17 | layer; default=NULL)}
18 | 
19 | \item{subset}{a boolean or numeric vector of the elements to retain}
20 | }
21 | \value{
22 | a fishScaleAndMap output list with updated scaledX and scaleY coordinates
23 | }
24 | \description{
25 | Rotates the scaledX and scaledY elements of a fishScaleAndMap output list so that the
26 |   axis of interest (e.g., cortical layer) is paralled with the x cooridate plan.
27 |   Rotation code is from  https://stackoverflow.com/questions/15463462/rotate-graph-by-angle
28 | }
29 | 


--------------------------------------------------------------------------------
/man/smartLayerAllocation.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{smartLayerAllocation}
 4 | \alias{smartLayerAllocation}
 5 | \title{Layer weights per cell}
 6 | \usage{
 7 | smartLayerAllocation(
 8 |   layerIn,
 9 |   useLayer = "L1",
10 |   spillFactor = 0.15,
11 |   weightCutoff = 0.02,
12 |   layerNm = c("L1", "L2/3", "L4", "L5", "L6")
13 | )
14 | }
15 | \arguments{
16 | \item{layerIn}{a list corresponding to all layers of dissection for a given sample}
17 | 
18 | \item{useLayer}{target layer}
19 | 
20 | \item{spillFactor}{fractional amount of cells in a layer below which it is assumed no cells are
21 | from that layer in multilayer dissection}
22 | 
23 | \item{weightCutoff}{anything less than this is set to 0 for convenience and to avoid rare types}
24 | 
25 | \item{layerNm}{names of all layers.  set to NULL to have this calculated}
26 | }
27 | \value{
28 | numeric vector saying how to weight a particular cell for each layer, using a smart
29 |   weighting strategy
30 | }
31 | \description{
32 | Returns a numeric vector saying how to weight a particular cell for each layer, using a smart
33 |   weighting strategy
34 | }
35 | 


--------------------------------------------------------------------------------
/man/subsampleCells.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/markerGenesAndMapping.r
 3 | \name{subsampleCells}
 4 | \alias{subsampleCells}
 5 | \title{Subsample cells}
 6 | \usage{
 7 | subsampleCells(clusters, subSamp = 25, seed = 5)
 8 | }
 9 | \arguments{
10 | \item{clusters}{vector of cluster labels (or any category) in factor or character format}
11 | 
12 | \item{subSamp}{maximum number of values for each category to subsample.  Can be single integer
13 | for global subsampling, or a *named* vector corresponding to how many values to take from each
14 | category in clusters.}
15 | 
16 | \item{seed}{for reproducibility}
17 | }
18 | \value{
19 | returns a vector of TRUE / FALSE with a maximum of subSamp TRUE calls per category
20 | }
21 | \description{
22 | Subsets a categorical vector to include up to a maximum number of values for each category.
23 | }
24 | 


--------------------------------------------------------------------------------
/man/summarizeMatrix.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/mfishMapping.r
 3 | \name{summarizeMatrix}
 4 | \alias{summarizeMatrix}
 5 | \title{Summarize matrix}
 6 | \usage{
 7 | summarizeMatrix(
 8 |   mat,
 9 |   group,
10 |   scale = "none",
11 |   scaleQuantile = 1,
12 |   binarize = FALSE,
13 |   binMin = 0.5,
14 |   summaryFunction = median,
15 |   ...
16 | )
17 | }
18 | \arguments{
19 | \item{mat}{matrix where the columns (e.g., samples) are going to be grouped}
20 | 
21 | \item{group}{vector of length dim(mat)[2] corresponding to the groups}
22 | 
23 | \item{scale}{either 'none' (default),'row', or 'column'}
24 | 
25 | \item{scaleQuantile}{what quantile of value should be set as 1 (default=1)}
26 | 
27 | \item{binarize}{should the data be binarized? (default=FALSE)}
28 | 
29 | \item{binMin}{minimum ON value for the binarized matrix (ignored if binarize=FALSE)}
30 | 
31 | \item{summaryFunction}{function (or function name) to be used for summarization}
32 | 
33 | \item{...}{additional parameters for summaryFunction}
34 | }
35 | \value{
36 | matrix of summarized values
37 | }
38 | \description{
39 | Groups columns in a matrix by a specified group vector and summarizes using a specificed function.
40 |   Optionally binarizes the matrix using a specified cutoff parameter.  This is a wrapper for tapply.
41 | }
42 | 


--------------------------------------------------------------------------------
/man/update_mfishtools.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/update.r
 3 | \name{update_mfishtools}
 4 | \alias{update_mfishtools}
 5 | \title{Update the mfishtools library}
 6 | \usage{
 7 | update_mfishtools()
 8 | }
 9 | \description{
10 | Update the mfishtools library
11 | }
12 | 


--------------------------------------------------------------------------------
/mfishtools.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | BuildType: Package
16 | PackageUseDevtools: Yes
17 | PackageInstallArgs: --no-multiarch --with-keep.source
18 | PackageCheckArgs: –no-examples
19 | 


--------------------------------------------------------------------------------
/mfishtools_0.0.2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AllenInstitute/mfishtools/df6d7f1104b149698a3cebd7f82dd9848bc7328a/mfishtools_0.0.2.pdf


--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | *.R
2 | .Rhistory
3 | 


--------------------------------------------------------------------------------
/vignettes/inhibitory_marker_mapping.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Mapping mFISH data to reference data set"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{Mapping mFISH data to RNA-seq reference}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   \usepackage[utf8]{inputenc} 
  8 | ---
  9 |   
 10 | This code reads in all of the data for an example mouse SST mFISH experiment and compares it against a reference FACs data set.  **In this case we have very limited prior knowledge, so I am open to suggestions for how to check whether the results look reasonable.**.    
 11 |   
 12 | ### Workspace set-up
 13 |   
 14 | Install the necessary packages.  In this case we are using data from `tasic2016data` and plotting functions from `scrattch.vis`.
 15 |   
 16 | ```{r install packages, eval=FALSE}
 17 | install.packages("devtools")
 18 | devtools::install_github("AllenInstitute/tasic2016data") # For our data example
 19 | ```
 20 |   
 21 |   
 22 | Load libraries.  
 23 |   
 24 | ```{r load libraries}
 25 | suppressPackageStartupMessages({
 26 |   library(mfishtools)    # This library!
 27 |   library(matrixStats)   # For rowMedians function, which is fast
 28 |   library(gplots)        # For some of the plots
 29 |   library(ggplot2)       # For other plots
 30 |   library(tasic2016data) # For the data
 31 | })
 32 | options(stringsAsFactors = FALSE)  # IMPORTANT
 33 | print("Libraries loaded.")
 34 | ```
 35 |   
 36 |   
 37 | Read in the reference data (in this case we will use the Tasic 2016 data, which includes ~1800 cells from mouse primary visual cortex).  
 38 |   
 39 | ```{r load tasic data}
 40 | annotations <- tasic_2016_anno
 41 | #counts      <- tasic_2016_counts  # uncomment if using CPM below
 42 | rpkm        <- tasic_2016_rpkm
 43 | annotations <- annotations[match(colnames(rpkm),annotations$sample_name),]  # Put them in the correct order
 44 | ```
 45 |   
 46 |   
 47 | Read in the mFISH data.  An example data set is provided as part of `mfishtools`, and was loaded with the library.  In this case, `fishData` is a matrix of gene expression levels (e.g., spot counts within a cell) for a given cell, with genes as rows and cells as columns.  This is the same format as the RNA-seq data.  
 48 | 
 49 | ```{r what is mfish data}
 50 | dim(fishData)
 51 | ```
 52 |   
 53 |   
 54 | The `metadata` variable is a data frame with some specific requirements on column names.  
 55 |   
 56 | ```{r what is mfish metadata}
 57 | colnames(metadata)
 58 | ```
 59 | 
 60 | The required column names are as follows (and other column names are perfectly fine):  
 61 |  - area       = Area/volume of the cell (or just set to a constant)  
 62 |  - experiment = Name of the experiment or experiments (or just set ot a constant)  
 63 |  - layerData  = Numeric call for the layer.  Not requred, but useful for plotting an rotating x, y  
 64 |  - x          = X coordinate for cell (ideally this is the lateral coordinate)  
 65 |  - y          = Y coordinate for cell (ideally this is the laminar coordinate)  
 66 |   
 67 |   
 68 | ### Data preparations
 69 |   
 70 | This analysis will only be looking at marker genes for GABAergic neurons, so we need to only consider cells mapping to GABAergic types.  We also define some convenient cluster info variables here.  
 71 |   
 72 | ```{r define variables}
 73 | clusterType = annotations$broad_type 
 74 | includeClas = "GABA-ergic Neuron"  # In this analysis, we are only considering interneurons
 75 | excludeClas = sort(setdiff(clusterType,includeClas))
 76 | kpSamp      = !is.element(clusterType,excludeClas)
 77 | anno        = annotations[kpSamp,]
 78 | cl          = annotations$primary_type_label
 79 | names(cl)   = annotations$sample_name
 80 | kpClust     = sort(unique(cl[kpSamp]))
 81 | ```
 82 |   
 83 |   
 84 | Convert the data to log2(rpkm).  NOTE: we often use counts per million of introns + exons when performing this analysis.  Currently, we don't know which method produces more reliable markers.  Alternative code for calculating cpm is commented out below.  
 85 |   
 86 | ```{r convert to log2}
 87 | normDat = log2(rpkm+1)
 88 | #sf      = colSums(counts)/10^6
 89 | #cpms    = t(t(counts)/sf)
 90 | #normDat = log2(cpms+1)
 91 | print("Data normalized!")
 92 | ```
 93 |   
 94 |   
 95 | Calculate proportions, means, and medians.  These are all used later for various reasons.  One important thing to note is that we are using `cl` here as a vector of cell type calls for the RNA-seq data (with names corresponding to the column/sample names of the RNA-seq sample data); however, if you wanted to map to a more coarse definition of cell types, or to omit certain cell types, this would be the step to do it.  The columns of these summary values calculated here define the cell types that mFISH data will be mapped against for the remainder of the vignette.  
 96 |   
 97 | ```{r calculate reference stats}        
 98 | exprThresh = 1
 99 | medianExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMedians(normDat[,x]))) 
100 | meanExpr   = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]))) 
101 | propExpr   = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]>exprThresh))) 
102 | rownames(medianExpr) <- rownames(propExpr) <- rownames(meanExpr) <- genes <- rownames(normDat)  
103 | print("Summary values calculated!")
104 | ```
105 |   
106 |   
107 | Consider only genes included in the mFISH experiment that are also present in the RNA-seq reference data set (usually this is all of them).  
108 |   
109 | ```{r format mFISH data}
110 | useGenes <- intersect(rownames(fishData),genes)  # Define genes to be used in the analysis
111 | fishDat  <- fishData[useGenes,]    # Separate out the data from the metadata
112 | print("mFISH data is ready!")
113 | ```
114 |   
115 |  
116 | ## Map the mFISH data!  
117 |   
118 | Do the mapping using the parameters defined below.  *Currently these are all manually selected and a bit of a guess.  I am open to suggestions for how to do this mapping in a more systematic way.*  The idea behind this method is to attach various filtering and scaling strategies to the mFISH and RNA-seq data sets, and then use correlation-based mapping to find the best fitting cell cluster.  My expectation (still to be tested) is that this strategy will be useful for smaller gene panels, but that other more computational-based strategies will be most effective for larger gene panels.  
119 |   
120 | ```{r map mFISH data, warning=FALSE}
121 | qprob    <- 0.9     # Parameter for scaling mFISH to FACS
122 | thresh   <- 3       # Set counts less than or equal to thresh to 0          1
123 | log2p1   <- function(x) return(log2(x+1))  # log transform function
124 | binarize <- FALSE   # Should the data be binarized?
125 | weights  <- NULL    # Integer weights.  SET TO NULL IF YOU DON'T KNOW WHAT YOU ARE DOING!
126 | #weights  <- round(rowSums(fishDat)/min(rowSums(fishDat)))  # Here is how to weight roughly by average expression level
127 |  
128 | fishMouse <- fishScaleAndMap(mapDat=fishDat, refSummaryDat=medianExpr, 
129 |   mappingFunction = cellToClusterMapping_byCor, transform = log2p1, noiselevel = thresh, 
130 |   genesToMap = useGenes, metadata = metadata, qprob=qprob, binarize=binarize,
131 |   omitGenes = NULL,integerWeights=weights)
132 | print("Data is mapped.")
133 | ```  
134 |   
135 |   
136 | #### View the mapping results
137 |   
138 | First rotate the X and Y coordinate space to that layer 2 is parallel to the X axis.  This won't be perfect since the tissue is not perfectly linear, but it will make for easier viewing of the data.  This requires some meta-data variable which marks a subset of genes roughly along align (e.g., cortical layer calls).   
139 |   
140 | ```{r rotateXY, fig.width=6, fig.height=6}  
141 | rotateAxis <- fishMouse$metadata$layerData==4
142 | flipVector <- fishMouse$metadata$layerData
143 | for (e in unique(fishMouse$metadata$experiment)){
144 |   subset  <- fishMouse$metadata$experiment==e
145 |   fishMouse <- rotateXY(fishMouse,rotateAxis,flipVector,subset)
146 | }
147 | ```
148 |   
149 |   
150 | Now plot the location of all of the cells in the tissue section.  
151 |   
152 | ```{r plot cell locations, fig.width=6, fig.height=6}  
153 | sc  <- function(n,...)  return(c("brown","pink","orange","turquoise","blue","green")[1:n])  # Standard colors without yellow
154 | lay <- as.character(fishMouse$metadata$layerData)
155 | plotDistributions(fishMouse,group = "experiment", xlab="Mouse - All cells", ylab="Layer",
156 |                   colors=lay, colormap=sc,maxrow=8,cex=1)
157 | ```  
158 |   
159 |   
160 | Show the cell distribution across all types in the tissue.   
161 |   
162 | ```{r plot mFISH distributions,fig.width=20,fig.height=20}  
163 | allClusts  <- colnames(medianExpr)
164 | plotDistributions(fishMouse,group = "Class", groups = allClusts, colors = lay, pch=lay, xlab="Mouse", 
165 |                   ylab="Layer",colormap=sc,maxrow=8)
166 | ```  
167 |   
168 | There are some broad observations that suggest we are not too far off.  First, nearly all of the cells map to inhibitory types as expected.  Second, many of the cell types have layer signatures, with SST/PVALB types more likely to be in deep layers and other inhibitory types more likely to be in upper layers.
169 |   
170 | Now let's plot the heatmap to see how the data looks when clustered along the tree like this.  First, unscaled data, capped at 10 counts.  
171 |   
172 | ```{r Plot mFISH heatmap (unscaled), fig.width=24,fig.height=6} 
173 | cap = 10
174 | colorset = c("darkblue", "dodgerblue", "gray80", "orange", "orangered")
175 | heat_colors <- colorRampPalette(colorset)(1001)
176 | plotHeatmap(fishMouse,main="Mouse cells (unscaled, cap=10)",group="Class",groups=allClusts,capValue=cap,
177 |                              colormap=heat_colors,rowsep=NULL,dendrogram="row",Rowv = TRUE,margins = c(8,6))
178 | ```
179 |   
180 | Next, let's plot the scaled heatmaps for comparison.  This is the data that is used in the mapping.  
181 |   
182 | ```{r Plot mFISH heatmap (scaled), fig.width=24,fig.height=6} 
183 | cap = 8
184 | colorset = c("darkblue", "dodgerblue", "gray80", "orange", "orangered")
185 | heat_colors <- colorRampPalette(colorset)(1001)
186 | plotHeatmap(fishMouse,main="Mouse cells (scaled, cap=8)",group="Class",groups=allClusts,capValue=cap,colormap=heat_colors,
187 |                              rowsep=NULL,dendrogram="row",Rowv = TRUE,margins = c(8,6),useScaled=TRUE)  
188 | ```
189 |   
190 | A few genes have quite low expression and probably should be omitted from future experiments.  Otherwise, by eye things look reasonable because there do appear to be distinct expression patterns across clusters (although it is really hard to tell by eye, and some of the cells in different blocks seem like they should be grouped together).  
191 |   
192 |   
193 | ## Quantitative sanity check
194 |   
195 | So far we have focused on getting the results and trying to determine agreement with expections based on resulting plots.  The results seem mixed so far.  We now want to do what I am calling a quantitative sanity check, where we compare results obtained by different computational methods, or between RNA-Seq and mFISH, or taking prior knowledge into consideration.  Ideally we can build a mapping alorithm that adjusts parameters to try and optimize the results based on priors (or something to this effect), but for now we want to have quick ways of seeing what looks right and what looks wrong to help us make adjustments.  
196 |   
197 | First, let's see whether the proportion of cells identified in RNA-seq and mFISH agree.  Note that we would only expect this to be the case when we have unbiased surveyed in both modalities, which is not the case here.  
198 |   
199 | ```{r type wrap-up, fig.width=8,fig.height=8} 
200 | countFish9 <- table(fishMouse$mappingResults$Class)
201 | countFish9 <- countFish9[countFish9>0]
202 | countSeq  <- table(anno$primary_type_label)
203 | fs <- intersect(names(countFish9),names(countSeq))
204 | plot(as.numeric(countSeq[fs]),as.numeric(countFish9[fs]),main="Sst types",
205 |   pch=19,col="grey",ylab="FISH cell count",xlab="RNA-seq cell count",xlim=c(0,max(as.numeric(countSeq[fs]))*1.1))
206 | text(as.numeric(countSeq[fs]),as.numeric(countFish9[fs]),fs,cex=0.7,srt=0)
207 | ```
208 |   
209 |   
210 | There is not great agreement between modalities.  What if we wrap up by class?
211 |   
212 | ```{r broad class wrap-up, fig.width=8,fig.height=8} 
213 | # Second, broad class
214 | val = list(fishMouse$mappingResults$Class,anno$primary_type_label)
215 | for (i in 1:2){
216 |   val[[i]] <- as.character(lapply(val[[i]], function(x) strsplit(x," ")[[1]][1]))  # Get the class within interneurons
217 | }
218 | countFish <- table(val[[1]])
219 | countSeq  <- table(val[[2]])
220 | fs <- intersect(val[[1]],val[[2]])
221 | plot(as.numeric(countSeq[fs]),as.numeric(countFish[fs]),main="Broad classes",
222 |   pch=19,col="grey",ylab="FISH cell count",xlab="RNA-seq cell count",
223 |   xlim=c(0,max(as.numeric(countSeq[fs]))*1.1),ylim=c(0,max(as.numeric(countFish[fs]))*1.1))
224 | text(as.numeric(countSeq[fs]),as.numeric(countFish[fs]),fs,cex=0.7,srt=0)
225 | ```  
226 |   
227 | Wrapping up by class, there is much better agreement between proportions across modalities.  Whiel this is not necessarily something we'd expect to find, it is still useful to note.  
228 |   
229 | One possibility (that I think is likely the case for at least a subset of cells) is that there is some mapping issues.  We can try and understand why cells are mapping specific ways by plotting expression levels between RNA-seq and mFISH for each cell type.  Let's do that.  
230 |   
231 | ```{r mean expression comparison, fig.width=18,fig.height=11} 
232 | meanFish    <- summarizeMatrix(fishMouse$scaleDat,fishMouse$mappingResults$Class,summaryFunction = mean)[useGenes,]
233 | meanFish    <- meanFish[,intersect(colnames(meanFish),colnames(medianExpr))]
234 | medSeq      <- meanExpr[useGenes,colnames(meanFish)]
235 | corFS       <- NULL
236 | par(mfrow=c(4,7))
237 | for (ct in intersect(allClusts,colnames(medSeq))){
238 |   corMR <- signif(cor(medSeq[,ct],meanFish[,ct]),3)
239 |   plot(medSeq[,ct],meanFish[,ct],xlab="RNA-seq",ylab=ct,pch=19,col="white", main=paste("R =",corMR))
240 |   text(medSeq[,ct],meanFish[,ct],rownames(medSeq),cex=0.9)
241 |   corFS <- c(corFS,cor(medSeq[,ct],meanFish[,ct]))
242 | }
243 | names(corFS) <- intersect(allClusts,colnames(medSeq))
244 | ```  
245 |   
246 | Overall, there is quite good agreement for most, but not all, of these cell types.  Similarly, this type of plot can give us a sense of which cell types we can be confident in and which ones we can't, although the exact relationship between confidence of mapping and mapping accuracy remains TBD.  
247 |   
248 |   
249 | Now make a TSNE plot on all genes using the mFISH data.  Do the data cluster by color, and show the points also corresponding to cluster (abbreviated).  
250 |   
251 | ```{r Plot mFISH TSNE (scaled), fig.width=10,fig.height=8} 
252 | cap=10
253 | fishPlot <- fishMouse 
254 | fishPlot$mappingResults$Broad <- val[[1]]
255 | fishPlot <- filterCells(fishPlot,fishPlot$mappingResults$Class!="none")
256 | p=plotTsne(fishPlot,main="Mouse cells (scaled)",colorGroup="Broad",labelGroup = "Broad",
257 |            capValue=cap, useScaled=TRUE, perplexity = 10, maxNchar=5)
258 | p
259 | ```
260 |   
261 | Overall it seems that most cells map reasonably well by broad class calls, although it looks like there are a few errors.  
262 |   
263 | Finally, repeat but with the points corresponding to layer.  
264 |   
265 | ```{r Plot mFISH TSNE (scaled, color by layer), fig.width=10,fig.height=6} 
266 | p=plotTsne(fishPlot,main="Mouse Sst cells (scaled)",colorGroup="Class",labelGroup = "layerData",
267 |            capValue=cap, useScaled=TRUE, perplexity = 10)
268 | p
269 | ```
270 | 
271 | The layer and cell type appear to provide some complementary information.  
272 |   
273 | **It is important to note that these tools are very much in development and minimally tested, but even so we hope that that are useful.**  
274 | 


--------------------------------------------------------------------------------
/vignettes/inhibitory_marker_selection.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title:  "Marker panel generation"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{Marker panel generation}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   \usepackage[utf8]{inputenc} 
  8 | ---
  9 | 
 10 | This vignette demonstrates how to generate an optimal marker gene panel (of length N), and then predicts how well each of the FACs cells map into the types defined as a measure of panel quality.  
 11 |   
 12 | The strategy used is a correlation-baseed greedy algorithm, which aims to minimize the distance between the actual and predicted clusters (rather than maximizing the fraction correctly mapping to each cluster).  
 13 |   
 14 | ### Workspace set-up
 15 |   
 16 | Install the necessary packages.  In this case we are using data from `tasic2016data` and plotting functions from `scrattch.vis`.
 17 |   
 18 | ```{r install packages, eval=FALSE}
 19 | install.packages("devtools")
 20 | devtools::install_github("AllenInstitute/scrattch.vis")  # For plotting
 21 | devtools::install_github("AllenInstitute/tasic2016data") # For our data example
 22 | ```
 23 |   
 24 |   
 25 | Load libraries.  
 26 |   
 27 | ```{r load libraries}
 28 | suppressPackageStartupMessages({
 29 |   library(mfishtools)    # This library!
 30 |   library(gplots)        # This is for plotting gene panels only.
 31 |   library(scrattch.vis)  # This is for plotting gene panels only.
 32 |   library(matrixStats)   # For rowMedians function, which is fast
 33 |   library(tasic2016data) # For the data
 34 | })
 35 | options(stringsAsFactors = FALSE)  # IMPORTANT
 36 | print("Libraries loaded.")
 37 | ```
 38 |   
 39 |   
 40 | Read in the data (in this case we will use the Tasic 2016 data, which includes ~1800 cells from mouse primary visual cortex).  
 41 |   
 42 | ```{r load tasic data}
 43 | annotations <- tasic_2016_anno
 44 | counts      <- tasic_2016_counts
 45 | rpkm        <- tasic_2016_rpkm
 46 | annotations <- annotations[match(colnames(counts),annotations$sample_name),]  # Put them in the correct order
 47 | ```
 48 |   
 49 | ### Data preparations
 50 |   
 51 | This analysis will only be looking at marker genes for GABAergic neurons, so we need to only consider cells mapping to GABAergic types.  We also define some convenient cluster info variables here.  
 52 |   
 53 | ```{r define variables}
 54 | clusterType = annotations$broad_type 
 55 | includeClas = "GABA-ergic Neuron"  # In this analysis, we are only considering interneurons
 56 | excludeClas = sort(setdiff(clusterType,includeClas))
 57 | gliaClas    = setdiff(excludeClas,"Glutamatergic Neuron") 
 58 | kpSamp      = !is.element(clusterType,excludeClas)
 59 | anno        = annotations[kpSamp,]
 60 | cl          = annotations$primary_type_label
 61 | names(cl)   = annotations$sample_name
 62 | kpClust     = sort(unique(cl[kpSamp]))
 63 | gliaClust   = sort(unique(cl[is.element(clusterType,gliaClas)]))
 64 | ```
 65 |   
 66 |   
 67 | Convert the data to log2(rpkm).  NOTE: we often use counts per million of introns + exons when performing this analysis.  Currently, we don't know which method produces more reliable markers.  Alternative code for calculating cpm is commented out below.  
 68 |   
 69 | ```{r convert to log2}
 70 | normDat = log2(rpkm+1)
 71 | #sf      = colSums(counts)/10^6
 72 | #cpms    = t(t(counts)/sf)
 73 | #normDat = log2(cpms+1)
 74 | ```
 75 |   
 76 |   
 77 | Calculate proportions and medians.  These are both needed for gene filtering and for marker selection.    
 78 |   
 79 | ```{r calculate proportions and medians}        
 80 | exprThresh = 1
 81 | medianExpr = do.call("cbind", tapply(names(cl), cl, function(x) rowMedians(normDat[,x]))) 
 82 | propExpr   = do.call("cbind", tapply(names(cl), cl, function(x) rowMeans(normDat[,x]>exprThresh))) 
 83 | rownames(medianExpr) <- rownames(propExpr) <- genes <- rownames(normDat)  
 84 | ```
 85 |   
 86 |   
 87 | ### Combinatorial marker gene selection
 88 |   
 89 | This section is where gene selection happens.  There are two steps: (1) gene filtering and (2) marker selection, both of which are described below.  
 90 |   
 91 | We first want to define some gene filters prior to running gene selection.  Note that this filtering occurs prior to gene selection and does not occur during the selection process.  To see details about all of the specific filters, see the code block below or use `?filterPanelGenes`.  Overall, the goal of this section is to exclude genes that won't work properly with the specific spatial transcriptomics method desired because their expression is too low or too high, or they are too short.  It also removes genes with too much off-target expression, genes that are expressed in too many cell types, and genes that are potentially not of interest because they are unannotated, or on a sex chromosome, or are don't work for any other reason. It also takes the most binary genes as the possible selection space in order to try and avoid genes whose only cell type differences are in magnitude.  In this case, we will use a total of 250 binary genes.   
 92 |   
 93 | ```{r filter genes for panel selection}
 94 | startingGenePanel <-  c("Gad1","Slc32a1","Pvalb","Sst","Vip")
 95 | runGenes <- NULL
 96 | runGenes <- filterPanelGenes(
 97 |   summaryExpr = 2^medianExpr-1,  # medians (could also try means); We enter linear values to match the linear limits below
 98 |   propExpr    = propExpr,    # proportions
 99 |   onClusters  = kpClust,     # clusters of interest for gene panel
100 |   offClusters = gliaClust,   # clusters to exclude expression
101 |   geneLengths = NULL,        # vector of gene lengths (not included here)
102 |   startingGenes  = startingGenePanel,  # Starting genes (from above)
103 |   numBinaryGenes = 250,      # Number of binary genes (explained below)
104 |   minOn     = 10,   # Minimum required expression in highest expressing cell type
105 |   maxOn     = 500,  # Maximum allowed expression
106 |   maxOff    = 50,   # Maximum allowed expression in off types (e.g., aviod glial expression)
107 |   minLength = 960,  # Minimum gene length (to allow probe design; ignored in this case)
108 |   fractionOnClusters = 0.5,  # Max fraction of on clusters (described above)
109 |   excludeGenes    = NULL,    # Genes to exclude.  Often sex chromosome or mitochondrial genes would be input here.
110 |   excludeFamilies = c("LOC","Fam","RIK","RPS","RPL","\\-","Gm","Rnf","BC0")) # Avoid LOC markers, in this case
111 | ```
112 |   
113 |   
114 | The second step is our marker panel selection.  This strategy uses a greedy algorithm to iteratively add the "best" gene to the existing panel until the panel reaches a certain size.  Specifically, each cell in the reference data set is correlated with each cluster median using the existing marker gene set, and the most highly correlated cluster is compared with the originally assigned cluster for each cell.  By default the algorithm tries to optimize the fraction of cells correctly mapping to each cluster by using this correlation-based mapping.  An alternative strategy that is particularly useful for smaller gene panels is to use a weighting strategy to penalize cells that map to a distant cluster more than cells that map to a nearby cluster.  To do this we iteratively choose genes from the starting gene panel such that the addition of each gene minimizes the correlation distance between clusters.  
115 |   
116 | To do this we first determine the cluster distance based on correlation of top binary genes, which is what is done in this code block.  
117 |   
118 | ```{r identify binary markers}
119 | corDist         <- function(x) return(as.dist(1-cor(x)))
120 | clusterDistance <- as.matrix(corDist(medianExpr[runGenes,kpClust]))
121 | print(dim(clusterDistance))
122 | ```
123 |   
124 |   
125 | This step constructs the gene panel, as described above.  Once again, use `?buildMappingBasedMarkerPanel` and see the code block below for details on the parameters, but there are a few key options.  First, we find it useful to subsample cells from each cluster which decreases the time for the algorithm to run and also more evenly weights the clusters for gene selection.  
126 |   
127 | ```{r build gene panels}
128 | fishPanel <- buildMappingBasedMarkerPanel(
129 |   mapDat        = normDat[runGenes,kpSamp],     # Data for optimization
130 |   medianDat     = medianExpr[runGenes,kpClust], # Median expression levels of relevant genes in relevant clusters
131 |   clustersF     = cl[kpSamp],                   # Vector of cluster assignments
132 |   panelSize     = 30,                           # Final panel size
133 |   currentPanel  = startingGenePanel,            # Starting gene panel
134 |   subSamp       = 15,                           # Maximum number of cells per cluster to include in analysis (20-50 is usually best)
135 |   optimize      = "CorrelationDistance",        # CorrelationDistance maximizes the cluster distance as described
136 |   clusterDistance = clusterDistance,            # Cluster distance matrix
137 |   percentSubset = 50                            # Only consider a certain percent of genes each iteration to speed up calculations (in most cases this is not recommeded)
138 | )       
139 | 
140 | ```
141 |   
142 | This is the panel!
143 |   
144 | ### Assess panel quality  
145 |   
146 | First, let's plot the panel in the context of the clusters we care about.  This can be done using the `scrattch.vis` library.  
147 |   
148 | ```{r panel gene plot in subset, fig.width=7,fig.height=10}
149 | plotGenes <- fishPanel
150 | plotData  <- cbind(sample_name = colnames(rpkm), as.data.frame(t(rpkm[plotGenes,])))
151 | clid_inh  <- 1:23  # Cluster IDs for inhibitory clusters
152 | 
153 | # violin plot example.  Could be swapped with fire, dot, bar, box plot, heatmap, Quasirandom, etc.
154 | sample_fire_plot(data = plotData, anno = annotations, genes = plotGenes, grouping = "primary_type", 
155 |                  log_scale=TRUE, max_width=15, label_height = 8, group_order = clid_inh)
156 | ``` 
157 |   
158 | The first half of this panel looks like most of the genes have fairly distinct and binarized patterning, while many of the genes in the latter half of the panel appear to be showing redundant information, suggesting that after a certain point this strategy becomes less than optimal in a practical sense.
159 |   
160 | How do they look across ALL cell types (realizing that we don't expect to see all these types in the tissue we care about)?  
161 |   
162 | ```{r panel gene plot across all types, fig.width=14,fig.height=10}
163 | sample_fire_plot(data = plotData, anno = annotations, genes = plotGenes, grouping = "primary_type", 
164 |                  log_scale=TRUE, max_width=15, label_height = 8)
165 | ```
166 |   
167 | We get a little bit of excitatory and glial cell separation for free, but it's not great (as expected).  
168 |   
169 | What fraction of cells are correctly mapped to leaf nodes? Note that we don't necessarily expect this number to be high.  Also note that this is using all of the above genes, so will actually be higher than we would get with any given 9-gene panel.  
170 |   
171 | ```{r dsiplay fraction correctly mapped}
172 | assignedCluster <- suppressWarnings(getTopMatch(corTreeMapping(mapDat = normDat[runGenes,kpSamp], 
173 |                    medianDat=medianExpr[runGenes,kpClust], genesToMap=fishPanel)))
174 | print(paste0("Percent correctly mapped: ",signif(100*mean(as.character(assignedCluster$TopLeaf)==cl[kpSamp],na.rm=TRUE),3),"%"))
175 | ```
176 |   
177 | Around ~72% of the cells are correctly mapped with this panel, which is reasonable, but less than ideal.  How does the plot look for fewer genes?  
178 |   
179 | ```{r plot fraction correctly mapped,fig.width=9,fig.height=6}
180 | fractionCorrectWithGenes(fishPanel,normDat[,kpSamp],medianExpr[runGenes,kpClust],cl[kpSamp],
181 |                          main="Mapping quality for different numbers of included genes",return=FALSE)
182 | ```
183 |   
184 | This plot suggests that with 30 genes we are continuing to see improvement by adding new genes; however, by approximately a 20 gene panel the level of improvement decreases dramatically for each gene added.  **Later releases of this library will include additional strategies for optimizing the panel beyond these initial 20 or so genes.**  
185 |   
186 | Finally, as an overview, we create a confusion matrix based on the top leaf assignments. This will let us address which pairs of clusters are the most confused. Are they adjacent/nearby on the tree?  Note that the colors are distorted to highlight confusion in the tree.  
187 |   
188 | ```{r cluster confusion, fig.height=8,fig.width=8}
189 | membConfusionProp  <- getConfusionMatrix(cl[kpSamp],assignedCluster[,1],TRUE)
190 | clOrd <- (annotations$primary_type_label[match(clid_inh,annotations$primary_type_id)])  # Cluster order
191 | heatmap.2(pmin(membConfusionProp,0.25)[clOrd,clOrd],Rowv=FALSE,Colv=FALSE,trace="none",dendrogram="none",
192 |           margins=c(16,16),main="Confusion Matrix")
193 | ```
194 |   
195 | Most of the errors are nearby on the diagonal, which is what we were optimizing for using the clusterDistance strategy.  
196 |   


--------------------------------------------------------------------------------