├── .Rbuildignore ├── CommonProblems.md ├── DESCRIPTION ├── Evaluation ├── method_evaluation.R ├── method_evaluation.Rmd └── method_evaluation.html ├── LICENSE ├── NAMESPACE ├── NEWS ├── R └── MAUDE.R ├── README.md ├── doc ├── BACH2_base_editor_screen.R ├── BACH2_base_editor_screen.html ├── CD69_tutorial.R ├── CD69_tutorial.html ├── simulated_data_tutorial.R └── simulated_data_tutorial.html ├── images └── logo2.png ├── inst ├── CITATION └── extdata │ ├── CD69_bin_percentiles.txt │ └── Encode_Jurkat_DHS_both.merged.bed ├── man ├── calcFDRByExperiment.Rd ├── combineZStouffer.Rd ├── findGuideHits.Rd ├── findGuideHitsAllScreens.Rd ├── findOverlappingElements.Rd ├── getElementwiseStats.Rd ├── getNBGaussianLikelihood.Rd ├── getTilingElementwiseStats.Rd ├── getZScalesWithNTGuides.Rd └── makeBinModel.Rd └── vignettes ├── BACH2_base_editor_screen.Rmd ├── BACH2_data ├── loaded_BACH2_BE_CRISPResso_data.txt.gz └── sample_metadata.txt ├── CD69_tutorial.Rmd └── simulated_data_tutorial.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^doc$ 2 | ^Meta$ 3 | ^Evaluation$ 4 | ^images$ 5 | ^CommonProblems.md$ 6 | ^.*\.Rproj$ 7 | ^\.Rproj\.user$ 8 | -------------------------------------------------------------------------------- /CommonProblems.md: -------------------------------------------------------------------------------- 1 | # Common Problems 2 | Here are some common problems and their solutions. 3 | 4 | ## `Mu optimization returned NaN objective: restricting search space` 5 | This warning indicates that when optimizing the mu value for a guide, the log likelihood function returned NaN. 6 | This warning does not necessarily mean there is a problem. 7 | 8 | There are three usual reasons for this: 9 | ### 1) Data that has really low coverage in the "input" and no pseudocount added (e.g. if A is 0% of the library, the likelihood of >1 reads in bin A is NaN) 10 | To resolve the warning, please ensure that you are including a pseudocount to the data. 11 | 12 | ### 2) Data where one "guide" has really high abundance. (e.g. if a "guide" makes up 89% of your library, a Z score of 1 might mean that you now expect OVER 100% to end up in bin F) 13 | Usually, MAUDE automatically restricting the search space will resolve this. The warning will stay, but you don't have to worry about it. 14 | 15 | ### 3) The mu search limits are too wide. (an extreme mu is so unlikely that the log likelihood is infinite, leading to this error) 16 | If you only get this message a few times, there's no need to adjust anything; MAUDE automatically adjusts the search space. If many warnings are issued, it is recommended that you adjust the `limits` parameter. It defaults to c(-4,4). Adjusting it closer to 0 will eliminate the warning message. 17 | 18 | ## Other problems 19 | Please submit an Issue, or contact the authors for any other problems you encounter so that we can fix/clarify things as necessary. 20 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: MAUDE 2 | Title: Mean Alterations Using Discrete Expression 3 | Version: 0.99.4 4 | Authors@R: person("Carl", "de Boer", email = "carl.deboer@ubc.ca", role = c("aut", "cre")) 5 | Description: This package is useful for inferring the difference in the mean expression of some entity as quantified by discrete measurements obtained by binning across expression space. One example of where this is useful is with pooled CRISPR screens with a FACS (cell sorting into discrete bins) readout. For example, you introduce a guide library into your CRISPRi-competent cells, sort them by expression of a reporter gene targeted by your guides, and want to know which guides altered expression of the reporter gene. 6 | Depends: R (>= 4.0), reshape, stats 7 | Suggests: knitr, rmarkdown, ggplot2, openxlsx, ggbio, GenomicRanges, Homo.sapiens, biovizBase, edgeR, DESeq2, cowplot 8 | VignetteBuilder: knitr 9 | License: MIT + file LICENSE 10 | Encoding: UTF-8 11 | LazyData: true 12 | RoxygenNote: 7.1.1 13 | biocViews: FunctionalPrediction, DifferentialExpression, GeneRegulation, GeneTarget, GenomeAnnotation, PeakDetection, FunctionalGenomics, Genetics, Transcriptomics, SystemsBiology, CRISPR, FlowCytometry, DNASeq, PooledScreens, Annotation, ExperimentalDesign, Normalization, QualityControl, GeneExpression, Transcription, SNP, GeneticVariability 14 | -------------------------------------------------------------------------------- /Evaluation/method_evaluation.R: -------------------------------------------------------------------------------- 1 | ## ----setup, include=FALSE----------------------------------------------------- 2 | knitr::opts_chunk$set(echo = TRUE) 3 | 4 | ## ----load libraries, results="hide", message = FALSE, warning=FALSE----------- 5 | library("ggplot2") 6 | library(reshape) 7 | library(cowplot) 8 | library(MAUDE) 9 | library(openxlsx) 10 | library(edgeR) 11 | library(DESeq2) 12 | 13 | ## ----set seed----------------------------------------------------------------- 14 | set.seed(35263377) 15 | 16 | ## ----Load and parse CD69 data------------------------------------------------- 17 | #a mapping to unify bin names from Simeonov data 18 | binmapBack = list("baseline" = "baseline", "low"="low", "medium"="medium","high"="high","back_" = "NS", 19 | "baseline_" = "baseline", "low_"="low", "medium_"="medium", "high_"="high", 20 | "A"="baseline", "B"="low", "E" = "medium", "F"="high") 21 | 22 | #this comes from manually reconstructing the CD69 density curve from extended data figure 1a (Simeonov et al) 23 | binBoundsCD69 = data.frame(Bin = c("A","F","B","E"), 24 | fraction = c(0.65747100, 0.02792824, 0.25146688, 0.06313389), 25 | stringsAsFactors = FALSE) 26 | fractionalBinBounds = makeBinModel(binBoundsCD69[c("Bin","fraction")]) 27 | fractionalBinBounds = rbind(fractionalBinBounds, fractionalBinBounds) 28 | fractionalBinBounds$screen = c(rep("1",6),rep("2",6)); 29 | #only keep bins A,B,E,F 30 | fractionalBinBounds = fractionalBinBounds[fractionalBinBounds$Bin %in% c("A","B","E","F"),] 31 | fractionalBinBounds$Bin = unlist(binmapBack[fractionalBinBounds$Bin]); 32 | 33 | #load data 34 | cd69OriginalResults = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx') 35 | cd69OriginalResults$NT = grepl("negative_control", cd69OriginalResults$gRNA_systematic_name) 36 | cd69OriginalResults$pos = cd69OriginalResults$PAM_3primeEnd_coord; 37 | cd69OriginalResults = unique(cd69OriginalResults) 38 | cd69CountData = melt(cd69OriginalResults, id.vars = c("pos","NT","gRNA_systematic_name")) 39 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),] 40 | cd69CountData$theirBin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable) 41 | cd69CountData$screen = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable) 42 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL; 43 | # convert their bin to one that is consistent 44 | cd69CountData$Bin = unlist(binmapBack[cd69CountData$theirBin]); 45 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$pos) | cd69CountData$NT,], pos+gRNA_systematic_name+NT+screen ~ Bin, value="reads")) 46 | 47 | ## ----calc logFC--------------------------------------------------------------- 48 | #confirm how to calc log2FC: 49 | cd69OriginalResults$l2fc.vsbg1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69back_1.count.norm)) 50 | cd69OriginalResults$l2fc.vsbg2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69back_2.count.norm)) 51 | cd69OriginalResults$l2fc.hilo1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69baseline_1.count.norm)) 52 | cd69OriginalResults$l2fc.hilo2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69baseline_2.count.norm)) 53 | #confirm that how we just calculated log2 fc is the same as originally calculated 54 | p=ggplot(cd69OriginalResults, aes(x=high2.l2fc,y= l2fc.vsbg2)) + geom_point(); print(p) 55 | 56 | ## ----run methods, results="hide", message = FALSE, warning=FALSE-------------- 57 | #edgeR 58 | x <- unique(cd69OriginalResults[c("gRNA_systematic_name","CD69baseline_1.count","CD69baseline_2.count", 59 | "CD69high_1.count", "CD69high2.count")]) 60 | row.names(x) = x$gRNA_systematic_name; x$gRNA_systematic_name=NULL; 61 | x=as.matrix(x) 62 | group <- factor(c(1,1,2,2)) 63 | y <- DGEList(counts=x,group=group) 64 | y <- calcNormFactors(y) 65 | design <- model.matrix(~group) 66 | y <- estimateDisp(y,design) 67 | #To perform likelihood ratio tests: 68 | fit <- glmFit(y,design) 69 | lrt <- glmLRT(fit,coef=2) 70 | edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]] 71 | edgeRGuideLevel$gRNA_systematic_name = row.names(edgeRGuideLevel); 72 | edgeRGuideLevel$metric = edgeRGuideLevel$logFC; 73 | edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue); 74 | edgeRGuideLevel$method="edgeR"; 75 | 76 | #DESeq2 77 | deseqGroups = data.frame(bin=factor(c(1,1,2,2))); 78 | row.names(deseqGroups) = c("CD69baseline_1.count","CD69baseline_2.count", 79 | "CD69high_1.count", "CD69high2.count"); 80 | dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin) 81 | dds <- DESeq(dds) 82 | res <- results(dds, name=resultsNames(dds)[2]) 83 | 84 | deseqGuideLevel = as.data.frame(res@listData) 85 | deseqGuideLevel$gRNA_systematic_name =res@rownames; 86 | deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange; 87 | deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue); 88 | deseqGuideLevel$method="DESeq2"; 89 | 90 | #MAUDE 91 | guideLevelStatsCD69 = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, fractionalBinBounds, sortBins = c("baseline","high","low","medium"), unsortedBin = "NS") 92 | guideLevelStatsCD69$chr="chr12" 93 | guideLevelStatsCastCD69 = cast(guideLevelStatsCD69, gRNA_systematic_name + pos+NT ~ screen, value="Z") 94 | names(guideLevelStatsCastCD69)[ncol(guideLevelStatsCastCD69)-1:0]=c("s1","s2") 95 | guideLevelStatsCastCD69$significance = apply(guideLevelStatsCastCD69[c("s1","s2")],1, combineZStouffer) 96 | guideLevelStatsCastCD69$metric=apply(guideLevelStatsCastCD69[c("s1","s2")],1, mean) 97 | guideLevelStatsCastCD69$method = "MAUDE" 98 | 99 | #Two log fold change methods 100 | cd69OriginalResultsHiLow = cd69OriginalResults[c("gRNA_systematic_name","l2fc.hilo1","l2fc.hilo2")] 101 | cd69OriginalResultsVsBG = cd69OriginalResults[c("gRNA_systematic_name","l2fc.vsbg1","l2fc.vsbg2")] 102 | cd69OriginalResultsHiLow$significance = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean) 103 | cd69OriginalResultsHiLow$metric = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean) 104 | cd69OriginalResultsVsBG$significance = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean) 105 | cd69OriginalResultsVsBG$metric = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean) 106 | cd69OriginalResultsHiLow$method="logHivsLow" 107 | cd69OriginalResultsVsBG$method="logVsUnsorted" 108 | 109 | ## ----compile results---------------------------------------------------------- 110 | # predictions 111 | allResults = rbind(cd69OriginalResultsVsBG[c("method","gRNA_systematic_name","significance","metric")], 112 | cd69OriginalResultsHiLow[c("method","gRNA_systematic_name","significance","metric")], 113 | deseqGuideLevel[c("method","gRNA_systematic_name","significance","metric")], 114 | edgeRGuideLevel[c("method","gRNA_systematic_name","significance","metric")], 115 | guideLevelStatsCastCD69[c("method","gRNA_systematic_name","significance","metric")]) 116 | 117 | 118 | allResults = merge(allResults, cd69OriginalResults[c("gRNA_systematic_name","NT","pos")], 119 | by="gRNA_systematic_name") 120 | allResults = allResults[!is.na(allResults$pos) | allResults$NT,] 121 | allResults$promoter = allResults$pos <= 9913996 & allResults$pos >= 9912997 122 | allResults$gID = allResults$gRNA_systematic_name; allResults$gRNA_systematic_name=NULL; 123 | allResults$locus ="CD69" 124 | allResults$type ="CRISPRa" 125 | allResults$celltype ="Jurkat" 126 | 127 | ## ----input and parse TNFAIP3 data--------------------------------------------- 128 | 129 | ##read in TNFAIP3 data 130 | binFractionsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPR_FF_bin_fractions.txt.gz")))), 131 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 132 | CRISPRaCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRa_FF_countMatrix.txt.gz")))), 133 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 134 | CRISPRiCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRi_FF_countMatrix.txt.gz")))), 135 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 136 | crispraGuides = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20180205_selected_CRISPRa_guides_seq.txt.gz")))), 137 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 138 | crispraGuides$seq = gsub("(.*)(.{3})","\\1",crispraGuides$seq.w.PAM) 139 | crispraGuides$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2", 140 | crispraGuides$guideID))+ 141 | as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3", 142 | crispraGuides$guideID)))/2) 143 | 144 | CRISPRaCountsA20 = merge(CRISPRaCountsA20, unique(crispraGuides[c("seq","pos")]), by=c("seq"), all.x=TRUE) 145 | 146 | CRISPRiCountsA20$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2", 147 | CRISPRiCountsA20$gID))+ 148 | as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3", 149 | CRISPRiCountsA20$gID)))/2) 150 | 151 | binFractionsA20$expt = paste(binFractionsA20$celltype, binFractionsA20$CRISPRType,sep="_") 152 | 153 | #merge CRISPRi and CRISPRa 154 | A20CountData = melt(CRISPRaCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos")) 155 | A20CountData$CRISPRType="CRISPRa"; 156 | A20CountData2 = melt(CRISPRiCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos")) 157 | A20CountData2$CRISPRType="CRISPRi"; 158 | A20CountData = rbind(A20CountData, A20CountData2) 159 | rm('A20CountData2'); 160 | A20CountData$count = A20CountData$value; A20CountData$value=NULL; 161 | A20CountData$sample = A20CountData$variable; A20CountData$variable=NULL; 162 | A20CountData$bin = gsub("(.*)_(.*)_(.*)", "\\3", A20CountData$sample); 163 | A20CountData$screen = gsub("(.*)_(.*)_(.*)", "\\2", A20CountData$sample); 164 | A20CountData$celltype = gsub("(.*)_(.*)_(.*)", "\\1", A20CountData$sample); 165 | A20CountData$expt = paste(A20CountData$celltype, A20CountData$CRISPRType,sep="_") 166 | 167 | ## ----start evaluation and run on TNFAIP3, results="hide", message = FALSE, warning=FALSE---- 168 | #combine CD69 metrics 169 | # Pearson's r between replicates ## metricsBoth will contain all our evaluation metrics 170 | metricsBoth = 171 | data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), metric="r",which="replicate_correl", 172 | value=c(cor(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT], 173 | guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT]), 174 | cor(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT], 175 | cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT]), 176 | cor(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT], 177 | cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])), 178 | sig=c(cor.test(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT], 179 | guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT])$p.value, 180 | cor.test(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT], 181 | cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT])$p.value, 182 | cor.test(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT], 183 | cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])$p.value), 184 | locus="CD69",type="CRISPRa",celltype="Jurkat",stringsAsFactors = FALSE) 185 | 186 | allResultsBoth = allResults; #combined results (significance, effect sizes) for CD69 and A20 screens 187 | for (e in unique(binFractionsA20$expt)){ 188 | curCelltype = gsub("(.*)_(.*)", "\\1", e); 189 | curtype = gsub("(.*)_(.*)", "\\2", e); 190 | curA20CountData = unique(A20CountData[A20CountData$expt==e, 191 | c("seq","pos", "NT","gID","count","screen","bin")]) 192 | curA20CountDataTotals = cast(curA20CountData, screen +bin~ ., fun.aggregate = sum, value="count") 193 | names(curA20CountDataTotals)[3] = "total"; 194 | curA20CountData = merge(curA20CountData, curA20CountDataTotals, by=c("screen","bin")) 195 | curA20CountData$CPM = curA20CountData$count/curA20CountData$total * 1E6; 196 | curCPMMat = cast(curA20CountData, seq + NT + gID + screen + pos ~ bin, value="CPM") 197 | curCPMMat$l2fc_hilo = log2((1+curCPMMat$F)/(1+curCPMMat$A)) 198 | if(curtype=="CRISRPi"){ 199 | curCPMMat$l2fc_vsbg = log2((1+curCPMMat$NS)/(1+curCPMMat$A)) 200 | }else{ #CRISPRa 201 | curCPMMat$l2fc_vsbg = log2((1+curCPMMat$F)/(1+curCPMMat$NS)) 202 | } 203 | curBins = as.data.frame(melt(binFractionsA20[binFractionsA20$expt==e,], 204 | id.vars = c("celltype","screen","CRISPRType","expt"))) 205 | names(curBins)[ncol(curBins) - (1:0)] = c("Bin","fraction") 206 | curBins2 = data.frame(); 207 | for (s in unique(curBins$screen)){ 208 | curBins3 = makeBinModel(curBins[curBins$screen==s,c("Bin","fraction")]) 209 | curBins3$screen = s; 210 | curBins2 = rbind(curBins2, curBins3) 211 | } 212 | curBins2$Bin = as.character(curBins2$Bin); 213 | curCountMat = cast(curA20CountData, seq + NT + gID +pos + screen ~ bin, value="count") 214 | guideLevelStats = findGuideHitsAllScreens(experiments = unique(curCountMat["screen"]), 215 | countDataFrame = curCountMat, binStats = curBins2, 216 | sortBins = c("A","B","C","D","E","F"), unsortedBin = "NS", 217 | negativeControl="NT") 218 | 219 | guideLevelStatsCast = cast(guideLevelStats, gID + pos+NT ~ screen, value="Z") 220 | #names(guideLevelStatsCast)[4:ncol(guideLevelStatsCast)]=sprintf("s%i", 1:(ncol(guideLevelStatsCast)-3)) 221 | 222 | maudeZs = guideLevelStatsCast; 223 | 224 | guideLevelStatsCast$significance = apply(maudeZs[unique(curA20CountData$screen)],1, combineZStouffer) 225 | guideLevelStatsCast$metric=apply(maudeZs[unique(curA20CountData$screen)],1, mean) 226 | guideLevelStatsCast$method = "MAUDE" 227 | 228 | ### EdgeR 229 | library(edgeR) 230 | x= cast(unique(curA20CountData[curA20CountData$bin %in% c("A","F"), c("bin","gID","screen","count")]), gID ~ screen + bin, value="count") 231 | row.names(x) = x$gID; x$gID=NULL; 232 | x=as.matrix(x) 233 | group = grepl("_F",colnames(x))+1 234 | group <- factor(group) 235 | y <- DGEList(counts=x,group=group) 236 | y <- calcNormFactors(y) 237 | design <- model.matrix(~group) 238 | y <- estimateDisp(y,design) 239 | #To perform likelihood ratio tests: 240 | fit <- glmFit(y,design) 241 | lrt <- glmLRT(fit,coef=2) 242 | edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]] 243 | 244 | edgeRGuideLevel$gID = row.names(edgeRGuideLevel); 245 | edgeRGuideLevel$metric = edgeRGuideLevel$logFC; 246 | edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue); 247 | edgeRGuideLevel$method="edgeR"; 248 | 249 | ### DEseq 250 | library(DESeq2) 251 | deseqGroups = data.frame(bin=group); 252 | row.names(deseqGroups) = colnames(x); 253 | dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin) 254 | dds <- DESeq(dds) 255 | #resultsNames(dds) # lists the coefficients 256 | res <- results(dds, name=resultsNames(dds)[2]) 257 | stopifnot(resultsNames(dds)[1]=="Intercept") 258 | deseqGuideLevel = as.data.frame(res@listData) 259 | deseqGuideLevel$gID =res@rownames; 260 | deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange; 261 | deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue); 262 | deseqGuideLevel$method="DESeq2"; 263 | 264 | curLRHiLow = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_hilo")]), gID + NT ~ screen, value="l2fc_hilo") 265 | curLRVsBG = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_vsbg")]), gID + NT ~ screen, value="l2fc_vsbg") 266 | numSamples = ncol(curLRHiLow)-2; 267 | sampleNames = unique(curCPMMat$screen) 268 | 269 | curLRVsBG$significance = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean); 270 | curLRVsBG$metric = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean); 271 | curLRVsBG$method="logVsUnsorted" 272 | curLRHiLow$significance = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean); 273 | curLRHiLow$metric = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean); 274 | curLRHiLow$method="logHivsLow" 275 | 276 | #compile results for A20 277 | curResults = rbind(unique(curLRHiLow[c("method","gID","significance","metric")]), 278 | unique(curLRVsBG[c("method","gID","significance","metric")]), 279 | deseqGuideLevel[c("method","gID","significance","metric")], 280 | edgeRGuideLevel[c("method","gID","significance","metric")], 281 | unique(guideLevelStatsCast[c("method","gID","significance","metric")])) 282 | 283 | curResults = merge(curResults, unique(curCPMMat[c("gID","NT","pos")]), by="gID") 284 | curResults = curResults[!is.na(curResults$pos) | curResults$NT,] 285 | curResults$promoter = grepl("TNFAIP3", curResults$gID) | 286 | (curResults$pos <= 138189439 & curResults$pos >= 138187040) # chr6:138188077-138188379;138187040 287 | 288 | #append the current results to all 289 | curResults$locus ="TNFAIP3" 290 | curResults$type =curtype 291 | curResults$celltype =curCelltype 292 | allResultsBoth = rbind(allResultsBoth, curResults); 293 | 294 | # (1) similarity between the effect sizes estimated per replicate, 295 | corLRHiLow = cor(curLRHiLow[!curLRHiLow$NT, 3:(3+numSamples-1)]) 296 | corLRVsBG = cor(curLRVsBG[!curLRVsBG$NT, 3:(3+numSamples-1)]) 297 | maudeZCors = cor(maudeZs[!maudeZs$NT, 4:ncol(maudeZs)]) 298 | 299 | maudeCorP=1 300 | maudeCorR=-1 301 | corLRHiLowP=1 302 | corLRHiLowR=-1 303 | corLRVsBGP=1; 304 | corLRVsBGR=-1 305 | #select the best inter-replicate correlation for each of the three approaches for which this is possible 306 | for(i in 1:(length(sampleNames)-1)){ 307 | for(j in (i+1):length(sampleNames)){ 308 | curR = cor(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]]) 309 | curP = cor.test(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]])$p.value 310 | if (maudeCorR < curR){ 311 | maudeCorR = curR; 312 | maudeCorP = curP; 313 | } 314 | curR = cor(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], curLRVsBG[!curLRVsBG$NT, sampleNames[j]]) 315 | curP = cor.test(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], 316 | curLRVsBG[!curLRVsBG$NT, sampleNames[j]])$p.value 317 | if (corLRVsBGR < curR){ 318 | corLRVsBGR = curR; 319 | corLRVsBGP = curP; 320 | } 321 | curR = cor(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], curLRHiLow[!curLRHiLow$NT, sampleNames[j]]) 322 | curP = cor.test(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], 323 | curLRHiLow[!curLRHiLow$NT, sampleNames[j]])$p.value 324 | if (corLRHiLowR < curR){ 325 | corLRHiLowR = curR; 326 | corLRHiLowP = curP; 327 | } 328 | } 329 | } 330 | metricsBoth = rbind(metricsBoth, 331 | data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), 332 | metric="r",which="replicate_correl", 333 | value=c(maudeCorR, corLRHiLowR, corLRVsBGR), 334 | sig=c(maudeCorP, corLRHiLowP, corLRVsBGP), 335 | locus ="TNFAIP3", type =curtype, celltype =curCelltype, 336 | stringsAsFactors = FALSE)) 337 | } 338 | 339 | ## ----some useful functions---------------------------------------------------- 340 | #Some useful functions 341 | ranksumROC = function(x,y,na.rm=TRUE,...){ 342 | if (na.rm){ 343 | x=na.rm(x); 344 | y=na.rm(y); 345 | } 346 | curTest = wilcox.test(x,y,...); 347 | curTest$AUROC = (curTest$statistic/length(x))/length(y) 348 | return(curTest) 349 | } 350 | na.rm = function(x){ x[!is.na(x)]} 351 | 352 | ## ----evaluate methods part 2 and 3-------------------------------------------- 353 | # (1) similarity between the effect sizes estimated per replicate, 354 | # (above) 355 | metricsBoth$significant= metricsBoth$sig < 0.01; 356 | 357 | # Other evaluation metrics 358 | allExpts = unique(allResultsBoth[c("celltype","locus","type")]) 359 | for (ei in 1:nrow(allExpts)){ 360 | curCelltype = allExpts$celltype[ei] 361 | curtype = allExpts$type[ei]; 362 | curLocus = allExpts$locus[ei]; 363 | 364 | curResults = allResultsBoth[allResultsBoth$celltype==curCelltype & 365 | allResultsBoth$type==curtype & allResultsBoth$locus==curLocus,] 366 | for(m in unique(curResults$method)){ 367 | curData = curResults[curResults$method==m & !curResults$NT,] 368 | 369 | # (2) similarity in effect size between adjacent guides 370 | curData = curData[order(curData$pos),] 371 | guideEffectDistances = 372 | data.frame(method = m, random=FALSE, 373 | difference = abs(curData$metric[2:nrow(curData)] - curData$metric[1:(nrow(curData)-1)]), 374 | dist =abs(curData$pos[2:nrow(curData)] - curData$pos[1:(nrow(curData)-1)]), 375 | stringsAsFactors = FALSE) 376 | guideEffectDistances = guideEffectDistances[guideEffectDistances$dist < 100,] 377 | guideEffectDistances$dist=NULL; 378 | ### changed 10 in next line to 100 to make this more robust 379 | curData = curData[sample(nrow(curData), size = nrow(curData)*100, replace = TRUE),] 380 | guideEffectDistances = 381 | rbind(guideEffectDistances, 382 | data.frame(method = m, random=TRUE, 383 | difference = abs(curData$metric[2:nrow(curData)] - 384 | curData$metric[1:(nrow(curData)-1)]), stringsAsFactors = FALSE)); 385 | # random should have more different than adjacent 386 | curRS = ranksumROC(guideEffectDistances$difference[guideEffectDistances$method==m & 387 | guideEffectDistances$random], 388 | guideEffectDistances$difference[guideEffectDistances$method==m & 389 | !guideEffectDistances$random]) 390 | metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5", 391 | which="adjacent_vs_random",value = curRS$AUROC-0.5, 392 | locus=curLocus,celltype=curCelltype, type=curtype, 393 | sig=curRS$p.value, significant = curRS$p.value < 0.01)) 394 | 395 | # (3) ability to distinguish promoter-targeting guides from other guides. 396 | if (curtype=="CRISPRi" & !(m %in% c("edgeR","DESeq2"))){ 397 | # edgeR and DESeq2 are reversed for CRISPRi 398 | # non promoter should have higher effect than promoter (more -ve) 399 | curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT & 400 | !curResults$promoter], 401 | curResults$significance[curResults$method==m & !curResults$NT & 402 | curResults$promoter]) 403 | }else{ 404 | # promoter should have larger effect than non-promoter 405 | curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT & 406 | curResults$promoter], 407 | curResults$significance[curResults$method==m & !curResults$NT & 408 | !curResults$promoter]) 409 | } 410 | metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5", which="promoter_vs_T", 411 | value = curRS$AUROC-0.5, locus=curLocus, 412 | celltype=curCelltype, type=curtype, sig=curRS$p.value, 413 | significant = curRS$p.value < 0.01)) 414 | } 415 | } 416 | 417 | ##compile all metrics; label the best in each test and whether any tests were significant (P<0.01) 418 | metricsBoth2 = metricsBoth; 419 | metricsBoth2$method = factor(as.character(metricsBoth2$method), 420 | levels = c("logVsUnsorted","logHivsLow","DESeq2","edgeR","MAUDE")) 421 | metricsBoth2Best = cast(metricsBoth2, which + locus + type+celltype ~ ., value="value", 422 | fun.aggregate = max) 423 | names(metricsBoth2Best)[ncol(metricsBoth2Best)] = "best" 424 | metricsBoth2 = merge(metricsBoth2, metricsBoth2Best, by = c("which","locus","type","celltype")) 425 | metricsBoth2AnySig = cast(metricsBoth2[colnames(metricsBoth2)!="value"], 426 | which + locus + type+celltype ~ ., value="significant", fun.aggregate = any) 427 | names(metricsBoth2AnySig)[ncol(metricsBoth2AnySig)] = "anySig" 428 | metricsBoth2 = merge(metricsBoth2, metricsBoth2AnySig, by = c("which","locus","type","celltype")) 429 | metricsBoth2$isBest = metricsBoth2$value==metricsBoth2$best; 430 | metricsBoth2$isBestNA = metricsBoth2$isBest; 431 | metricsBoth2$isBestNA[!metricsBoth2$isBestNA]=NA; 432 | metricsBoth2$pctOfMax = metricsBoth2$value/metricsBoth2$best * 100; 433 | 434 | #fill in NAs for edgeR and DESeq2 which cannot have inter-replicate correlations 435 | temp = metricsBoth2[metricsBoth2$metric=="r",] 436 | temp = temp[1:2,]; 437 | temp$method = c("edgeR","DESeq2") 438 | temp$value=NA; temp$isBest=NA; temp$significant=FALSE; temp$pctOfMax=NA; temp$isBestNA=NA; 439 | metricsBoth2 = rbind(metricsBoth2, temp) 440 | 441 | ## ----make graph, fig.width=10, fig.height=6----------------------------------- 442 | #make the final evaluation graph 443 | p1 = ggplot(metricsBoth2[metricsBoth2$which=="adjacent_vs_random",], 444 | aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() + 445 | geom_text(data=metricsBoth2[metricsBoth2$which=="adjacent_vs_random" & metricsBoth2$anySig,], 446 | aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() + 447 | scale_fill_gradient2(high="red", low="blue", mid="black") + 448 | theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 449 | axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA) + 450 | scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Adjacent vs\nrandom guides"); 451 | p2 = ggplot(metricsBoth2[metricsBoth2$which=="promoter_vs_T",], 452 | aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() + 453 | geom_text(data=metricsBoth2[metricsBoth2$which=="promoter_vs_T" & metricsBoth2$anySig,], 454 | aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() + 455 | scale_fill_gradient2(high="red", low="blue", mid="black") + 456 | theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 457 | axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+ 458 | scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ 459 | ggtitle("Promoter vs other\ntargeting guides"); 460 | p3 = ggplot(metricsBoth2[metricsBoth2$which=="replicate_correl",], 461 | aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() + 462 | geom_text(data=metricsBoth2[metricsBoth2$which=="replicate_correl" & metricsBoth2$anySig,], 463 | aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() + 464 | scale_fill_gradient2(high="red", low="blue", mid="black") + 465 | theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 466 | axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+ 467 | scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Replicate\ncorrelations"); 468 | g= plot_grid(p1,p2,p3, align = 'h', nrow = 1); print(g) 469 | 470 | ## ----session info------------------------------------------------------------- 471 | sessionInfo() 472 | 473 | -------------------------------------------------------------------------------- /Evaluation/method_evaluation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "CRISPR sorting screen analysis method comparison" 3 | author: "Carl de Boer" 4 | date: "5/10/2020" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{CRISPR sorting screen analysis method comparison} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | \usepackage[utf8]{inputenc} 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | ## CRISPR sort-screen evaluation test 17 | 18 | This document includes both the code and results for the evaluation metrics used in the MAUDE paper. This should allow fairly easy extension to allow new approaches and datasets. 19 | 20 | ```{r load libraries, results="hide", message = FALSE, warning=FALSE} 21 | library("ggplot2") 22 | library(reshape) 23 | library(cowplot) 24 | library(MAUDE) 25 | library(openxlsx) 26 | library(edgeR) 27 | library(DESeq2) 28 | ``` 29 | 30 | I forgot to set a seed when doing this for the publication, and there is an element of stochasticity in the evaluation, and so the final results are likely to differ a little from the publication, especially for the datasets with less signal/noise; if you run the stochastic steps multiple times, set a different seed, or don't set the seed at all, they may also differ from this tutorial 31 | ```{r set seed} 32 | set.seed(35263377) 33 | ``` 34 | 35 | ### CD69 screen 36 | Load the CD69 data and process it. 37 | ```{r Load and parse CD69 data} 38 | #a mapping to unify bin names from Simeonov data 39 | binmapBack = list("baseline" = "baseline", "low"="low", "medium"="medium","high"="high","back_" = "NS", 40 | "baseline_" = "baseline", "low_"="low", "medium_"="medium", "high_"="high", 41 | "A"="baseline", "B"="low", "E" = "medium", "F"="high") 42 | 43 | #this comes from manually reconstructing the CD69 density curve from extended data figure 1a (Simeonov et al) 44 | binBoundsCD69 = data.frame(Bin = c("A","F","B","E"), 45 | fraction = c(0.65747100, 0.02792824, 0.25146688, 0.06313389), 46 | stringsAsFactors = FALSE) 47 | fractionalBinBounds = makeBinModel(binBoundsCD69[c("Bin","fraction")]) 48 | fractionalBinBounds = rbind(fractionalBinBounds, fractionalBinBounds) 49 | fractionalBinBounds$screen = c(rep("1",6),rep("2",6)); 50 | #only keep bins A,B,E,F 51 | fractionalBinBounds = fractionalBinBounds[fractionalBinBounds$Bin %in% c("A","B","E","F"),] 52 | fractionalBinBounds$Bin = unlist(binmapBack[fractionalBinBounds$Bin]); 53 | 54 | #load data 55 | cd69OriginalResults = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx') 56 | cd69OriginalResults$NT = grepl("negative_control", cd69OriginalResults$gRNA_systematic_name) 57 | cd69OriginalResults$pos = cd69OriginalResults$PAM_3primeEnd_coord; 58 | cd69OriginalResults = unique(cd69OriginalResults) 59 | cd69CountData = melt(cd69OriginalResults, id.vars = c("pos","NT","gRNA_systematic_name")) 60 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),] 61 | cd69CountData$theirBin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable) 62 | cd69CountData$screen = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable) 63 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL; 64 | # convert their bin to one that is consistent 65 | cd69CountData$Bin = unlist(binmapBack[cd69CountData$theirBin]); 66 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$pos) | cd69CountData$NT,], pos+gRNA_systematic_name+NT+screen ~ Bin, value="reads")) 67 | ``` 68 | 69 | 70 | I wanted to confirm that this is how they calculated logFC in Simeonov et al. The plot below should form a straight line on y=x. 71 | ```{r calc logFC} 72 | #confirm how to calc log2FC: 73 | cd69OriginalResults$l2fc.vsbg1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69back_1.count.norm)) 74 | cd69OriginalResults$l2fc.vsbg2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69back_2.count.norm)) 75 | cd69OriginalResults$l2fc.hilo1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69baseline_1.count.norm)) 76 | cd69OriginalResults$l2fc.hilo2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69baseline_2.count.norm)) 77 | #confirm that how we just calculated log2 fc is the same as originally calculated 78 | p=ggplot(cd69OriginalResults, aes(x=high2.l2fc,y= l2fc.vsbg2)) + geom_point(); print(p) 79 | ``` 80 | 81 | Run the various analysis methods on the CD69 data. 82 | ```{r run methods, results="hide", message = FALSE, warning=FALSE} 83 | #edgeR 84 | x <- unique(cd69OriginalResults[c("gRNA_systematic_name","CD69baseline_1.count","CD69baseline_2.count", 85 | "CD69high_1.count", "CD69high2.count")]) 86 | row.names(x) = x$gRNA_systematic_name; x$gRNA_systematic_name=NULL; 87 | x=as.matrix(x) 88 | group <- factor(c(1,1,2,2)) 89 | y <- DGEList(counts=x,group=group) 90 | y <- calcNormFactors(y) 91 | design <- model.matrix(~group) 92 | y <- estimateDisp(y,design) 93 | #To perform likelihood ratio tests: 94 | fit <- glmFit(y,design) 95 | lrt <- glmLRT(fit,coef=2) 96 | edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]] 97 | edgeRGuideLevel$gRNA_systematic_name = row.names(edgeRGuideLevel); 98 | edgeRGuideLevel$metric = edgeRGuideLevel$logFC; 99 | edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue); 100 | edgeRGuideLevel$method="edgeR"; 101 | 102 | #DESeq2 103 | deseqGroups = data.frame(bin=factor(c(1,1,2,2))); 104 | row.names(deseqGroups) = c("CD69baseline_1.count","CD69baseline_2.count", 105 | "CD69high_1.count", "CD69high2.count"); 106 | dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin) 107 | dds <- DESeq(dds) 108 | res <- results(dds, name=resultsNames(dds)[2]) 109 | 110 | deseqGuideLevel = as.data.frame(res@listData) 111 | deseqGuideLevel$gRNA_systematic_name =res@rownames; 112 | deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange; 113 | deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue); 114 | deseqGuideLevel$method="DESeq2"; 115 | 116 | #MAUDE 117 | guideLevelStatsCD69 = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, fractionalBinBounds, sortBins = c("baseline","high","low","medium"), unsortedBin = "NS") 118 | guideLevelStatsCD69$chr="chr12" 119 | guideLevelStatsCastCD69 = cast(guideLevelStatsCD69, gRNA_systematic_name + pos+NT ~ screen, value="Z") 120 | names(guideLevelStatsCastCD69)[ncol(guideLevelStatsCastCD69)-1:0]=c("s1","s2") 121 | guideLevelStatsCastCD69$significance = apply(guideLevelStatsCastCD69[c("s1","s2")],1, combineZStouffer) 122 | guideLevelStatsCastCD69$metric=apply(guideLevelStatsCastCD69[c("s1","s2")],1, mean) 123 | guideLevelStatsCastCD69$method = "MAUDE" 124 | 125 | #Two log fold change methods 126 | cd69OriginalResultsHiLow = cd69OriginalResults[c("gRNA_systematic_name","l2fc.hilo1","l2fc.hilo2")] 127 | cd69OriginalResultsVsBG = cd69OriginalResults[c("gRNA_systematic_name","l2fc.vsbg1","l2fc.vsbg2")] 128 | cd69OriginalResultsHiLow$significance = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean) 129 | cd69OriginalResultsHiLow$metric = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean) 130 | cd69OriginalResultsVsBG$significance = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean) 131 | cd69OriginalResultsVsBG$metric = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean) 132 | cd69OriginalResultsHiLow$method="logHivsLow" 133 | cd69OriginalResultsVsBG$method="logVsUnsorted" 134 | ``` 135 | 136 | Compile the results from the various methods. 137 | ```{r compile results} 138 | # predictions 139 | allResults = rbind(cd69OriginalResultsVsBG[c("method","gRNA_systematic_name","significance","metric")], 140 | cd69OriginalResultsHiLow[c("method","gRNA_systematic_name","significance","metric")], 141 | deseqGuideLevel[c("method","gRNA_systematic_name","significance","metric")], 142 | edgeRGuideLevel[c("method","gRNA_systematic_name","significance","metric")], 143 | guideLevelStatsCastCD69[c("method","gRNA_systematic_name","significance","metric")]) 144 | 145 | 146 | allResults = merge(allResults, cd69OriginalResults[c("gRNA_systematic_name","NT","pos")], 147 | by="gRNA_systematic_name") 148 | allResults = allResults[!is.na(allResults$pos) | allResults$NT,] 149 | allResults$promoter = allResults$pos <= 9913996 & allResults$pos >= 9912997 150 | allResults$gID = allResults$gRNA_systematic_name; allResults$gRNA_systematic_name=NULL; 151 | allResults$locus ="CD69" 152 | allResults$type ="CRISPRa" 153 | allResults$celltype ="Jurkat" 154 | ``` 155 | 156 | ### TNFAIP3 157 | Load and parse the TNFAIP3 data from Ray et al. 158 | ```{r input and parse TNFAIP3 data} 159 | 160 | ##read in TNFAIP3 data 161 | binFractionsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPR_FF_bin_fractions.txt.gz")))), 162 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 163 | CRISPRaCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRa_FF_countMatrix.txt.gz")))), 164 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 165 | CRISPRiCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRi_FF_countMatrix.txt.gz")))), 166 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 167 | crispraGuides = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20180205_selected_CRISPRa_guides_seq.txt.gz")))), 168 | sep="\t", stringsAsFactors = FALSE, header = TRUE) 169 | crispraGuides$seq = gsub("(.*)(.{3})","\\1",crispraGuides$seq.w.PAM) 170 | crispraGuides$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2", 171 | crispraGuides$guideID))+ 172 | as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3", 173 | crispraGuides$guideID)))/2) 174 | 175 | CRISPRaCountsA20 = merge(CRISPRaCountsA20, unique(crispraGuides[c("seq","pos")]), by=c("seq"), all.x=TRUE) 176 | 177 | CRISPRiCountsA20$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2", 178 | CRISPRiCountsA20$gID))+ 179 | as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3", 180 | CRISPRiCountsA20$gID)))/2) 181 | 182 | binFractionsA20$expt = paste(binFractionsA20$celltype, binFractionsA20$CRISPRType,sep="_") 183 | 184 | #merge CRISPRi and CRISPRa 185 | A20CountData = melt(CRISPRaCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos")) 186 | A20CountData$CRISPRType="CRISPRa"; 187 | A20CountData2 = melt(CRISPRiCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos")) 188 | A20CountData2$CRISPRType="CRISPRi"; 189 | A20CountData = rbind(A20CountData, A20CountData2) 190 | rm('A20CountData2'); 191 | A20CountData$count = A20CountData$value; A20CountData$value=NULL; 192 | A20CountData$sample = A20CountData$variable; A20CountData$variable=NULL; 193 | A20CountData$bin = gsub("(.*)_(.*)_(.*)", "\\3", A20CountData$sample); 194 | A20CountData$screen = gsub("(.*)_(.*)_(.*)", "\\2", A20CountData$sample); 195 | A20CountData$celltype = gsub("(.*)_(.*)_(.*)", "\\1", A20CountData$sample); 196 | A20CountData$expt = paste(A20CountData$celltype, A20CountData$CRISPRType,sep="_") 197 | ``` 198 | 199 | Run each method on each experiment from Ray et al. 200 | ```{r start evaluation and run on TNFAIP3, results="hide", message = FALSE, warning=FALSE} 201 | #combine CD69 metrics 202 | # Pearson's r between replicates ## metricsBoth will contain all our evaluation metrics 203 | metricsBoth = 204 | data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), metric="r",which="replicate_correl", 205 | value=c(cor(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT], 206 | guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT]), 207 | cor(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT], 208 | cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT]), 209 | cor(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT], 210 | cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])), 211 | sig=c(cor.test(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT], 212 | guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT])$p.value, 213 | cor.test(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT], 214 | cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT])$p.value, 215 | cor.test(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT], 216 | cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])$p.value), 217 | locus="CD69",type="CRISPRa",celltype="Jurkat",stringsAsFactors = FALSE) 218 | 219 | allResultsBoth = allResults; #combined results (significance, effect sizes) for CD69 and A20 screens 220 | for (e in unique(binFractionsA20$expt)){ 221 | curCelltype = gsub("(.*)_(.*)", "\\1", e); 222 | curtype = gsub("(.*)_(.*)", "\\2", e); 223 | curA20CountData = unique(A20CountData[A20CountData$expt==e, 224 | c("seq","pos", "NT","gID","count","screen","bin")]) 225 | curA20CountDataTotals = cast(curA20CountData, screen +bin~ ., fun.aggregate = sum, value="count") 226 | names(curA20CountDataTotals)[3] = "total"; 227 | curA20CountData = merge(curA20CountData, curA20CountDataTotals, by=c("screen","bin")) 228 | curA20CountData$CPM = curA20CountData$count/curA20CountData$total * 1E6; 229 | curCPMMat = cast(curA20CountData, seq + NT + gID + screen + pos ~ bin, value="CPM") 230 | curCPMMat$l2fc_hilo = log2((1+curCPMMat$F)/(1+curCPMMat$A)) 231 | if(curtype=="CRISRPi"){ 232 | curCPMMat$l2fc_vsbg = log2((1+curCPMMat$NS)/(1+curCPMMat$A)) 233 | }else{ #CRISPRa 234 | curCPMMat$l2fc_vsbg = log2((1+curCPMMat$F)/(1+curCPMMat$NS)) 235 | } 236 | curBins = as.data.frame(melt(binFractionsA20[binFractionsA20$expt==e,], 237 | id.vars = c("celltype","screen","CRISPRType","expt"))) 238 | names(curBins)[ncol(curBins) - (1:0)] = c("Bin","fraction") 239 | curBins2 = data.frame(); 240 | for (s in unique(curBins$screen)){ 241 | curBins3 = makeBinModel(curBins[curBins$screen==s,c("Bin","fraction")]) 242 | curBins3$screen = s; 243 | curBins2 = rbind(curBins2, curBins3) 244 | } 245 | curBins2$Bin = as.character(curBins2$Bin); 246 | curCountMat = cast(curA20CountData, seq + NT + gID +pos + screen ~ bin, value="count") 247 | guideLevelStats = findGuideHitsAllScreens(experiments = unique(curCountMat["screen"]), 248 | countDataFrame = curCountMat, binStats = curBins2, 249 | sortBins = c("A","B","C","D","E","F"), unsortedBin = "NS", 250 | negativeControl="NT") 251 | 252 | guideLevelStatsCast = cast(guideLevelStats, gID + pos+NT ~ screen, value="Z") 253 | #names(guideLevelStatsCast)[4:ncol(guideLevelStatsCast)]=sprintf("s%i", 1:(ncol(guideLevelStatsCast)-3)) 254 | 255 | maudeZs = guideLevelStatsCast; 256 | 257 | guideLevelStatsCast$significance = apply(maudeZs[unique(curA20CountData$screen)],1, combineZStouffer) 258 | guideLevelStatsCast$metric=apply(maudeZs[unique(curA20CountData$screen)],1, mean) 259 | guideLevelStatsCast$method = "MAUDE" 260 | 261 | ### EdgeR 262 | library(edgeR) 263 | x= cast(unique(curA20CountData[curA20CountData$bin %in% c("A","F"), c("bin","gID","screen","count")]), gID ~ screen + bin, value="count") 264 | row.names(x) = x$gID; x$gID=NULL; 265 | x=as.matrix(x) 266 | group = grepl("_F",colnames(x))+1 267 | group <- factor(group) 268 | y <- DGEList(counts=x,group=group) 269 | y <- calcNormFactors(y) 270 | design <- model.matrix(~group) 271 | y <- estimateDisp(y,design) 272 | #To perform likelihood ratio tests: 273 | fit <- glmFit(y,design) 274 | lrt <- glmLRT(fit,coef=2) 275 | edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]] 276 | 277 | edgeRGuideLevel$gID = row.names(edgeRGuideLevel); 278 | edgeRGuideLevel$metric = edgeRGuideLevel$logFC; 279 | edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue); 280 | edgeRGuideLevel$method="edgeR"; 281 | 282 | ### DEseq 283 | library(DESeq2) 284 | deseqGroups = data.frame(bin=group); 285 | row.names(deseqGroups) = colnames(x); 286 | dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin) 287 | dds <- DESeq(dds) 288 | #resultsNames(dds) # lists the coefficients 289 | res <- results(dds, name=resultsNames(dds)[2]) 290 | stopifnot(resultsNames(dds)[1]=="Intercept") 291 | deseqGuideLevel = as.data.frame(res@listData) 292 | deseqGuideLevel$gID =res@rownames; 293 | deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange; 294 | deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue); 295 | deseqGuideLevel$method="DESeq2"; 296 | 297 | curLRHiLow = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_hilo")]), gID + NT ~ screen, value="l2fc_hilo") 298 | curLRVsBG = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_vsbg")]), gID + NT ~ screen, value="l2fc_vsbg") 299 | numSamples = ncol(curLRHiLow)-2; 300 | sampleNames = unique(curCPMMat$screen) 301 | 302 | curLRVsBG$significance = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean); 303 | curLRVsBG$metric = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean); 304 | curLRVsBG$method="logVsUnsorted" 305 | curLRHiLow$significance = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean); 306 | curLRHiLow$metric = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean); 307 | curLRHiLow$method="logHivsLow" 308 | 309 | #compile results for A20 310 | curResults = rbind(unique(curLRHiLow[c("method","gID","significance","metric")]), 311 | unique(curLRVsBG[c("method","gID","significance","metric")]), 312 | deseqGuideLevel[c("method","gID","significance","metric")], 313 | edgeRGuideLevel[c("method","gID","significance","metric")], 314 | unique(guideLevelStatsCast[c("method","gID","significance","metric")])) 315 | 316 | curResults = merge(curResults, unique(curCPMMat[c("gID","NT","pos")]), by="gID") 317 | curResults = curResults[!is.na(curResults$pos) | curResults$NT,] 318 | curResults$promoter = grepl("TNFAIP3", curResults$gID) | 319 | (curResults$pos <= 138189439 & curResults$pos >= 138187040) # chr6:138188077-138188379;138187040 320 | 321 | #append the current results to all 322 | curResults$locus ="TNFAIP3" 323 | curResults$type =curtype 324 | curResults$celltype =curCelltype 325 | allResultsBoth = rbind(allResultsBoth, curResults); 326 | 327 | # (1) similarity between the effect sizes estimated per replicate, 328 | corLRHiLow = cor(curLRHiLow[!curLRHiLow$NT, 3:(3+numSamples-1)]) 329 | corLRVsBG = cor(curLRVsBG[!curLRVsBG$NT, 3:(3+numSamples-1)]) 330 | maudeZCors = cor(maudeZs[!maudeZs$NT, 4:ncol(maudeZs)]) 331 | 332 | maudeCorP=1 333 | maudeCorR=-1 334 | corLRHiLowP=1 335 | corLRHiLowR=-1 336 | corLRVsBGP=1; 337 | corLRVsBGR=-1 338 | #select the best inter-replicate correlation for each of the three approaches for which this is possible 339 | for(i in 1:(length(sampleNames)-1)){ 340 | for(j in (i+1):length(sampleNames)){ 341 | curR = cor(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]]) 342 | curP = cor.test(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]])$p.value 343 | if (maudeCorR < curR){ 344 | maudeCorR = curR; 345 | maudeCorP = curP; 346 | } 347 | curR = cor(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], curLRVsBG[!curLRVsBG$NT, sampleNames[j]]) 348 | curP = cor.test(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], 349 | curLRVsBG[!curLRVsBG$NT, sampleNames[j]])$p.value 350 | if (corLRVsBGR < curR){ 351 | corLRVsBGR = curR; 352 | corLRVsBGP = curP; 353 | } 354 | curR = cor(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], curLRHiLow[!curLRHiLow$NT, sampleNames[j]]) 355 | curP = cor.test(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], 356 | curLRHiLow[!curLRHiLow$NT, sampleNames[j]])$p.value 357 | if (corLRHiLowR < curR){ 358 | corLRHiLowR = curR; 359 | corLRHiLowP = curP; 360 | } 361 | } 362 | } 363 | metricsBoth = rbind(metricsBoth, 364 | data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), 365 | metric="r",which="replicate_correl", 366 | value=c(maudeCorR, corLRHiLowR, corLRVsBGR), 367 | sig=c(maudeCorP, corLRHiLowP, corLRVsBGP), 368 | locus ="TNFAIP3", type =curtype, celltype =curCelltype, 369 | stringsAsFactors = FALSE)) 370 | } 371 | ``` 372 | 373 | Here are some functions that I will make use of below. 374 | ```{r some useful functions} 375 | #Some useful functions 376 | ranksumROC = function(x,y,na.rm=TRUE,...){ 377 | if (na.rm){ 378 | x=na.rm(x); 379 | y=na.rm(y); 380 | } 381 | curTest = wilcox.test(x,y,...); 382 | curTest$AUROC = (curTest$statistic/length(x))/length(y) 383 | return(curTest) 384 | } 385 | na.rm = function(x){ x[!is.na(x)]} 386 | ``` 387 | 388 | ### Evaluation 389 | The inter-replicate correlations were calculated in the sections above and are stored in `metricsBoth`. Below, the two remaining metrics are calculated (distinguishing promoter vs other targeting guides, and adjacent vs randomly paired guides). 390 | ```{r evaluate methods part 2 and 3} 391 | # (1) similarity between the effect sizes estimated per replicate, 392 | # (above) 393 | metricsBoth$significant= metricsBoth$sig < 0.01; 394 | 395 | # Other evaluation metrics 396 | allExpts = unique(allResultsBoth[c("celltype","locus","type")]) 397 | for (ei in 1:nrow(allExpts)){ 398 | curCelltype = allExpts$celltype[ei] 399 | curtype = allExpts$type[ei]; 400 | curLocus = allExpts$locus[ei]; 401 | 402 | curResults = allResultsBoth[allResultsBoth$celltype==curCelltype & 403 | allResultsBoth$type==curtype & allResultsBoth$locus==curLocus,] 404 | for(m in unique(curResults$method)){ 405 | curData = curResults[curResults$method==m & !curResults$NT,] 406 | 407 | # (2) similarity in effect size between adjacent guides 408 | curData = curData[order(curData$pos),] 409 | guideEffectDistances = 410 | data.frame(method = m, random=FALSE, 411 | difference = abs(curData$metric[2:nrow(curData)] - curData$metric[1:(nrow(curData)-1)]), 412 | dist =abs(curData$pos[2:nrow(curData)] - curData$pos[1:(nrow(curData)-1)]), 413 | stringsAsFactors = FALSE) 414 | guideEffectDistances = guideEffectDistances[guideEffectDistances$dist < 100,] 415 | guideEffectDistances$dist=NULL; 416 | ### changed 10 in next line to 100 to make this more robust 417 | curData = curData[sample(nrow(curData), size = nrow(curData)*100, replace = TRUE),] 418 | guideEffectDistances = 419 | rbind(guideEffectDistances, 420 | data.frame(method = m, random=TRUE, 421 | difference = abs(curData$metric[2:nrow(curData)] - 422 | curData$metric[1:(nrow(curData)-1)]), stringsAsFactors = FALSE)); 423 | # random should have more different than adjacent 424 | curRS = ranksumROC(guideEffectDistances$difference[guideEffectDistances$method==m & 425 | guideEffectDistances$random], 426 | guideEffectDistances$difference[guideEffectDistances$method==m & 427 | !guideEffectDistances$random]) 428 | metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5", 429 | which="adjacent_vs_random",value = curRS$AUROC-0.5, 430 | locus=curLocus,celltype=curCelltype, type=curtype, 431 | sig=curRS$p.value, significant = curRS$p.value < 0.01)) 432 | 433 | # (3) ability to distinguish promoter-targeting guides from other guides. 434 | if (curtype=="CRISPRi" & !(m %in% c("edgeR","DESeq2"))){ 435 | # edgeR and DESeq2 are reversed for CRISPRi 436 | # non promoter should have higher effect than promoter (more -ve) 437 | curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT & 438 | !curResults$promoter], 439 | curResults$significance[curResults$method==m & !curResults$NT & 440 | curResults$promoter]) 441 | }else{ 442 | # promoter should have larger effect than non-promoter 443 | curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT & 444 | curResults$promoter], 445 | curResults$significance[curResults$method==m & !curResults$NT & 446 | !curResults$promoter]) 447 | } 448 | metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5", which="promoter_vs_T", 449 | value = curRS$AUROC-0.5, locus=curLocus, 450 | celltype=curCelltype, type=curtype, sig=curRS$p.value, 451 | significant = curRS$p.value < 0.01)) 452 | } 453 | } 454 | 455 | ##compile all metrics; label the best in each test and whether any tests were significant (P<0.01) 456 | metricsBoth2 = metricsBoth; 457 | metricsBoth2$method = factor(as.character(metricsBoth2$method), 458 | levels = c("logVsUnsorted","logHivsLow","DESeq2","edgeR","MAUDE")) 459 | metricsBoth2Best = cast(metricsBoth2, which + locus + type+celltype ~ ., value="value", 460 | fun.aggregate = max) 461 | names(metricsBoth2Best)[ncol(metricsBoth2Best)] = "best" 462 | metricsBoth2 = merge(metricsBoth2, metricsBoth2Best, by = c("which","locus","type","celltype")) 463 | metricsBoth2AnySig = cast(metricsBoth2[colnames(metricsBoth2)!="value"], 464 | which + locus + type+celltype ~ ., value="significant", fun.aggregate = any) 465 | names(metricsBoth2AnySig)[ncol(metricsBoth2AnySig)] = "anySig" 466 | metricsBoth2 = merge(metricsBoth2, metricsBoth2AnySig, by = c("which","locus","type","celltype")) 467 | metricsBoth2$isBest = metricsBoth2$value==metricsBoth2$best; 468 | metricsBoth2$isBestNA = metricsBoth2$isBest; 469 | metricsBoth2$isBestNA[!metricsBoth2$isBestNA]=NA; 470 | metricsBoth2$pctOfMax = metricsBoth2$value/metricsBoth2$best * 100; 471 | 472 | #fill in NAs for edgeR and DESeq2 which cannot have inter-replicate correlations 473 | temp = metricsBoth2[metricsBoth2$metric=="r",] 474 | temp = temp[1:2,]; 475 | temp$method = c("edgeR","DESeq2") 476 | temp$value=NA; temp$isBest=NA; temp$significant=FALSE; temp$pctOfMax=NA; temp$isBestNA=NA; 477 | metricsBoth2 = rbind(metricsBoth2, temp) 478 | ``` 479 | 480 | Finally make the graph with all evaluation metrics as shown in the MAUDE paper. 481 | ```{r make graph, fig.width=10, fig.height=6} 482 | #make the final evaluation graph 483 | p1 = ggplot(metricsBoth2[metricsBoth2$which=="adjacent_vs_random",], 484 | aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() + 485 | geom_text(data=metricsBoth2[metricsBoth2$which=="adjacent_vs_random" & metricsBoth2$anySig,], 486 | aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() + 487 | scale_fill_gradient2(high="red", low="blue", mid="black") + 488 | theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 489 | axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA) + 490 | scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Adjacent vs\nrandom guides"); 491 | p2 = ggplot(metricsBoth2[metricsBoth2$which=="promoter_vs_T",], 492 | aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() + 493 | geom_text(data=metricsBoth2[metricsBoth2$which=="promoter_vs_T" & metricsBoth2$anySig,], 494 | aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() + 495 | scale_fill_gradient2(high="red", low="blue", mid="black") + 496 | theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 497 | axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+ 498 | scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ 499 | ggtitle("Promoter vs other\ntargeting guides"); 500 | p3 = ggplot(metricsBoth2[metricsBoth2$which=="replicate_correl",], 501 | aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() + 502 | geom_text(data=metricsBoth2[metricsBoth2$which=="replicate_correl" & metricsBoth2$anySig,], 503 | aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() + 504 | scale_fill_gradient2(high="red", low="blue", mid="black") + 505 | theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 506 | axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+ 507 | scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Replicate\ncorrelations"); 508 | g= plot_grid(p1,p2,p3, align = 'h', nrow = 1); print(g) 509 | ``` 510 | 511 | ### Session info 512 | ```{r session info} 513 | sessionInfo() 514 | ``` 515 | 516 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Carl de Boer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(calcFDRByExperiment) 4 | export(combineZStouffer) 5 | export(findGuideHits) 6 | export(findGuideHitsAllScreens) 7 | export(findOverlappingElements) 8 | export(getElementwiseStats) 9 | export(getNBGaussianLikelihood) 10 | export(getTilingElementwiseStats) 11 | export(getZScalesWithNTGuides) 12 | export(makeBinModel) 13 | import(stats) 14 | importFrom(reshape,cast) 15 | -------------------------------------------------------------------------------- /NEWS: -------------------------------------------------------------------------------- 1 | Changes in version 1.0.0 (2020-05-29) 2 | + Submitted to Bioconductor 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MAUDE: Mean Alterations Using Discrete Expression 2 | 3 | [![DOI](https://zenodo.org/badge/135627989.svg)](https://zenodo.org/badge/latestdoi/135627989) 4 | 5 | MAUDE is an R package for finding differences in means of normally distributed (or nearly so) data, via measuring abundances in discrete bins. For example, a pooled CRISPRi screen with expression readout by FACS sorting into discrete bins and sequencing the abundances of the guides in each bin. Most of the documentation and examples are written with a CRISPRi-type sorting screen in mind, but there is no reason why it can't be used for any experiment where normally distributed expression values are read out via abundances in discrete expression bins. For example, MAUDE can also be used for [CRISPR base editor screens](https://de-boer-lab.github.io/MAUDE/doc/BACH2_base_editor_screen.html) where the readout is expression of a target gene, and reporter assays with expression readouts (e.g. [Rafi et al](https://www.biorxiv.org/content/10.1101/2023.04.26.538471v2)). 6 | 7 | See 'Usage' below for more information. 8 | 9 | 10 | Maude Flanders 11 | 12 | # Table of contents 13 | 14 | * [R Installation](#r-installation) 15 | * [Requirements](#requirements) 16 | * [Usage](#usage) 17 | * [Citation](#citation) 18 | 19 | 20 | # R Installation 21 | 22 | ## Option 1: Install directly from GitHub 23 | 24 | If you don't already have `devtools`, install it: 25 | ``` 26 | install.packages("devtools") 27 | ``` 28 | 29 | Load `devtools` and install from the GitHub page: 30 | 31 | ``` 32 | devtools::install_github("de-Boer-Lab/MAUDE") 33 | ``` 34 | 35 | ## Option 2: Install from download 36 | 37 | Download the latest MAUDE release (Under "Releases" on the right hand side of this page). 38 | 39 | Decompress the directory contained within it (something like "MAUDE-1.0.2"). 40 | 41 | Then in R: 42 | If you don't already have `devtools`, install it: 43 | ``` 44 | install.packages("devtools") 45 | ``` 46 | 47 | Then install in R using: 48 | 49 | ``` 50 | devtools::install_local("C:\\Users\\cdeboer\\Downloads\\MAUDE-1.0.2") 51 | ``` 52 | 53 | # Requirements 54 | Right now we have three main requirements: 55 | 1. Negative control guides are included in the experiment; (these are used for calibrating Z-scores and P-values, and so are not strictly needed if only the expression means are desired). 56 | 2. The abundance of the guides must have been measured somehow (usually by sequencing the guide DNA of unsorted cells; though there are ways to estimate this post-sort if the bins cover the majority of the distribution) 57 | 3. The fractions of cells sorted into each expression bin was quantified (typically the cell counts/fractions read off of the cell sorter) 58 | 59 | 60 | # Usage 61 | 62 | ## Tutorials 63 | We provide two tutorials on how to run a MAUDE analysis in R here: 64 | 1. [Re-analysis of CD69 screen data](https://de-boer-lab.github.io/MAUDE/doc/CD69_tutorial.html) 65 | 2. [Analysis of a simulated screen](https://de-boer-lab.github.io/MAUDE/doc/simulated_data_tutorial.html) 66 | 3. [Analysis of a CRISPR base editor non-coding mutation screen](https://de-boer-lab.github.io/MAUDE/doc/BACH2_base_editor_screen.html) 67 | 68 | For additional examples, see the [script for evaluating and comparing sorting-based CRISPR screen analysis methods.](https://de-boer-lab.github.io/MAUDE/Evaluation/method_evaluation.html) 69 | 70 | ## Quantifying guide DNA abundance 71 | After sequencing, you get fastqs, one per sorting bin and experiment. The first step for a MAUDE analysis is to quantify the number of guides residing in each bin. Here, we provide some guidance as to how to do this. 72 | 73 | We have previously used the aligner [`bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml). 74 | 75 | To make the bowtie2 reference `guide_seq_reference`: 76 | ```bash 77 | bowtie2-build guide_seqs.fa guide_seq_reference 78 | 79 | ``` 80 | where `guide_seqs.fa` is a fasta file including the sequences you are mapping against, which will include the guide DNA sequence and any flanking constant regions as well. The amount of constant sequence you include in the reference should be at least as much as what was sequenced. 81 | 82 | For example, with 20bp guides with constant flanking `GTTTAAGAGCTATGCTGGAAACAGCATAG`: 83 | ``` 84 | >guide1 85 | GTCGCATATCGCGATAGCGAGTTTAAGAGCTATGCTGGAAACAGCATAG 86 | >guide2 87 | GTCGTGAAAGTGCTGTTGAGGTTTAAGAGCTATGCTGGAAACAGCATAG 88 | ... 89 | ``` 90 | 91 | The following command is an example of how to quantify guide abundance into a format that can easily be input into R for MAUDE analysis: 92 | ```bash 93 | bowtie2 --no-head -x guide_seq_reference -U $sample.fastq.gz -S $sample.mapped.sam 94 | #here, we include all mapped reads, but by using Samtools, you can filter out reads that map to the wrong strand, have indels, etc. 95 | cat $sample.mapped.sam | awk '{print $3}' | sort | uniq -c | sort > $sample.counts 96 | ``` 97 | Here, `$sample` is the sample name, with `$sample.fastq.gz` the corresponding fastq file, and `guide_seq_reference` is the `bowtie2` reference. The file `$sample.counts` will contain guide counts that can be input into R. 98 | 99 | To turn this into a format that can easily be used for a MAUDE analysis, you can input the data using something like the following: 100 | ```R 101 | #here, allSamples is a data.frame containing one sample per row, with columns including ID, expt, and Bin. There should be one file for every row in allSamples 102 | allData = data.frame(); 103 | for (i in 1:nrow(allSamples)){ 104 | curData = read.table(file=sprintf("%s/%s.counts",inDir,allSamples$ID[i]), quote="", header = F, row.names = NULL, stringsAsFactors = F) 105 | names(curData) = c("count","guideID"); 106 | curData = curData[curData$gID!="*",] # remove unmapped counts 107 | curData$ID = allSamples$ID[i]; 108 | curData$expt = allSamples$expt[i]; 109 | curData$Bin = allSamples$Bin[i]; 110 | allData = rbind(allData, curData) 111 | } 112 | #now you have the data in a data.frame that can be reshaped to a MAUDE-compatible format: 113 | library(reshape) 114 | allDataCounts = as.data.frame(cast(allData, expt + guideID ~ Bin, value="count")); 115 | allDataCounts[is.na(allDataCounts)]=0; # fill in 0s for guides not observed at all 116 | #now you just need to label the non-targeting guides and this will be in the correct format 117 | ``` 118 | 119 | ## Encountering problems 120 | Should you encounter a problem using MAUDE: 121 | 1. [Consult the Common Problems](CommonProblems.md) 122 | 2. [Submit an Issue](https://github.com/Carldeboer/MAUDE/issues) 123 | 3. Contact the authors. 124 | 125 | 126 | # Citation 127 | Please cite: 128 | 129 | Carl G de Boer*, John P Ray*, Nir Hacohen, Aviv Regev. [_MAUDE: Inferring Expression Changes in Sorting-Based CRISPR Screens_.](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02046-8) 2020 Jun 3;21(1):134. doi: 10.1186/s13059-020-02046-8. PMID: [32493396](https://pubmed.ncbi.nlm.nih.gov/32493396/). 130 | -------------------------------------------------------------------------------- /doc/BACH2_base_editor_screen.R: -------------------------------------------------------------------------------- 1 | ## ----setup, include=FALSE----------------------------------------------------- 2 | knitr::opts_chunk$set(echo = TRUE) 3 | 4 | ## ----------------------------------------------------------------------------- 5 | library(ggplot2) 6 | library(reshape) 7 | library(MAUDE) 8 | maudeGitPathRoot = "https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/master" 9 | 10 | ## ----------------------------------------------------------------------------- 11 | allSamples = read.table(file=sprintf("%s/vignettes/BACH2_data/sample_metadata.txt",maudeGitPathRoot), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F) 12 | head(allSamples) 13 | #remove some of the control samples 14 | allSamples = allSamples[!grepl("HEK62", allSamples$replicate),] 15 | 16 | 17 | ## ----------------------------------------------------------------------------- 18 | if (FALSE){ 19 | #this was run on my computer to load the data from many files spread out over many subdirectories. Rather than upload all these files separately, I have loaded them all locally and saved the resulting concatenated file onto github. I leave this code here so that others may view how the data was loaded, should they have their own CRISPResso files to analyze. 20 | inDir = "/Path/To/CRISPResso/Files"; 21 | setwd(inDir) 22 | 23 | 24 | allCRISPRessoData= data.frame() 25 | for (i in 1:nrow(allSamples)){ 26 | curData = read.table(file=sprintf("%s/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "") 27 | curData2 = read.table(file=sprintf("%s/Deep_resequencing_analysis/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "") 28 | curData = rbind(curData, curData2); 29 | curData$locus = allSamples$locus[i]; 30 | curData$replicate = allSamples$replicate[i]; 31 | curData$sortBin = allSamples$sortBin[i]; 32 | allCRISPRessoData = rbind(allCRISPRessoData, curData) 33 | } 34 | #remove gaps from sequences 35 | allCRISPRessoData$SeqSpecies = gsub("-","",allCRISPRessoData$Aligned_Sequence); # Allele 36 | #remove unwanted fields 37 | allCRISPRessoData$Aligned_Sequence=NULL; allCRISPRessoData$Reference_Sequence=NULL; allCRISPRessoData$X.Reads.1=NULL; 38 | write.table(allCRISPRessoData, file=sprintf("%s/loaded_BACH2_BE_CRISPResso_data.txt", inDir),row.names = F, col.names = T, quote=F, sep="\t") 39 | #I then gzipped this file and uploaded it to github 40 | }else{ 41 | #load the CRISPResso files from GitHub. 42 | z= gzcon(url(sprintf("%s/vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz",maudeGitPathRoot))); 43 | fileConn=textConnection(readLines(z)); 44 | allCRISPRessoData = read.table(fileConn, sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F) 45 | close(fileConn) 46 | } 47 | 48 | ## ----------------------------------------------------------------------------- 49 | #CRISPResso has split some sequences into two or maybe more lines; below is an example 50 | #We also input multiple files from different sequencing runs. This merges all identical seq species'/samples 51 | allCRISPRessoData = cast(melt(allCRISPRessoData, id.vars = c("SeqSpecies","Reference_Name","Read_Status","n_deleted","n_inserted","n_mutated", "locus","replicate","sortBin")), SeqSpecies + Reference_Name + Read_Status + n_deleted + n_inserted + n_mutated + locus + replicate + sortBin ~ variable, value="value", fun.aggregate=sum) 52 | 53 | #Get read totals per replicate 54 | allCRISPRessoDataTotals = cast(allCRISPRessoData, locus + replicate + sortBin ~ ., value="X.Reads", fun.aggregate = sum) 55 | names(allCRISPRessoDataTotals)[ncol(allCRISPRessoDataTotals)] = "totalReads"; 56 | 57 | #Add totalReads column to allCRISPRessoData, calculate read fractions 58 | allCRISPRessoData = merge(allCRISPRessoData, allCRISPRessoDataTotals, by=c("locus","replicate","sortBin")) 59 | allCRISPRessoData$readFraction = allCRISPRessoData$X.Reads/allCRISPRessoData$totalReads; 60 | 61 | ## ----------------------------------------------------------------------------- 62 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_continuous(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p) 63 | 64 | ## ----------------------------------------------------------------------------- 65 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_log10(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p) 66 | 67 | ## ----------------------------------------------------------------------------- 68 | allCRISPRessoData = allCRISPRessoData[ !(allCRISPRessoData$locus=="HEK" & grepl("^[ABCR][12]$", allCRISPRessoData$replicate)) & !(allCRISPRessoData$locus=="BACH2" & grepl("^HEK9[12]", allCRISPRessoData$replicate)),] 69 | 70 | ## ----------------------------------------------------------------------------- 71 | p = ggplot(allCRISPRessoData[allCRISPRessoData$sortBin=="NS",], aes(x= readFraction, colour=replicate)) + stat_ecdf()+facet_grid(locus ~ .) + scale_x_log10() + theme_bw(); print(p) 72 | 73 | ## ----------------------------------------------------------------------------- 74 | temp = cast(allCRISPRessoData[allCRISPRessoData$Read_Status=="UNMODIFIED" & allCRISPRessoData$Reference_Name=="Reference",], formula = sortBin + locus+ replicate ~. ,value = "readFraction", fun.aggregate = max) 75 | names(temp)[ncol(temp)]="readFraction"; 76 | p = ggplot(temp, aes(x=sortBin, y=replicate, fill=readFraction)) + geom_tile() + facet_grid(locus ~., scales="free_y")+ggtitle("just unmodified read fractions") + scale_fill_gradientn(colours=c("red","orange", "green","cyan","blue","violet"), limits=c(0,0.8)); print(p) 77 | min(temp$readFraction, na.rm = T) # 0.0356963 78 | 79 | ## ----------------------------------------------------------------------------- 80 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="A1-HEK",] # this sample had no data for bin D (BACH2) because it didn't PCR well 81 | 82 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="B2-HEK",] # this sample had very low coverage for bin D for BACH2 83 | 84 | ## ----------------------------------------------------------------------------- 85 | seqsObserved = cast(allCRISPRessoData, SeqSpecies + locus + Read_Status + Reference_Name~ .) 86 | names(seqsObserved)[ncol(seqsObserved)]="seqRunsObserved"; 87 | 88 | retainedSamples = unique(allCRISPRessoData[,c("locus","replicate","sortBin")]) 89 | 90 | for (l in unique(allSamples$locus)){ 91 | seqsObserved$inAll[seqsObserved$locus == l] = seqsObserved$seqRunsObserved[seqsObserved$locus == l]==sum(retainedSamples$locus==l) 92 | } 93 | 94 | #require that all samples have at least one read for every species considered. This will help exclude read errors 95 | keepAlleles = seqsObserved[seqsObserved$inAll,] 96 | 97 | ## ----------------------------------------------------------------------------- 98 | #First make a count matrix, but only of alleles seen in all replicates 99 | readCountMat = cast(allCRISPRessoData[allCRISPRessoData$SeqSpecies %in% keepAlleles$SeqSpecies,], SeqSpecies + replicate +locus ~ sortBin, value="X.Reads") 100 | readCountMat[is.na(readCountMat)] = 0 101 | readCountMat = readCountMat[order(readCountMat$NS, decreasing = T),] 102 | 103 | #Label WT alleles 104 | wtSeqs = data.frame(locus = c("HEK","BACH2"), SeqSpecies = c("GGTAGCCAGAGACCCGCTGGTCTTCTTTCCCCTCCCCTGCCCTCCCCTCCCTTCAAGATGGCTGACAA","TGCCCCACCCTGTGCCCTTTTTACATTACACACAAATAGGGACGGATTTCCTGTAAGCTGATCTTGAAGAAAAAAAACATGTTAGACAAAGAAAATCAGAACTAAGA"), isWT=T, stringsAsFactors = F); 105 | readCountMat = merge(readCountMat, wtSeqs, by=c("locus","SeqSpecies"), all.x=T) 106 | readCountMat$isWT[is.na(readCountMat$isWT)]=F; 107 | 108 | allCRISPRessoData = merge(allCRISPRessoData, wtSeqs, by=c("locus","SeqSpecies"), all.x=T) 109 | allCRISPRessoData$isWT[is.na(allCRISPRessoData$isWT)]=F; 110 | 111 | 112 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.105,6))) # 10.5% bins 113 | # but only the top ~21% (CD) and bottom ~21% (AB) were retained 114 | binBounds = binBounds[binBounds$Bin %in% c("A","B","E","F"),] 115 | binBounds$Bin[binBounds$Bin=="E"]="C" 116 | binBounds$Bin[binBounds$Bin=="F"]="D" 117 | 118 | ## ----------------------------------------------------------------------------- 119 | p = ggplot(binBounds, aes(colour=Bin)) + 120 | geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 121 | ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 122 | xlab("Bin bounds as expression Z-scores") + 123 | ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+ 124 | coord_cartesian(ylim=c(0,0.7)); print(p) 125 | 126 | ## ----------------------------------------------------------------------------- 127 | curExpts = unique(readCountMat[c("replicate","locus")]) 128 | 129 | binStatsAll = data.frame(); 130 | for(i in 1:nrow(curExpts)){ 131 | binStatsAll = rbind(binStatsAll, cbind(binBounds, curExpts[i,])) 132 | } 133 | 134 | ## ----------------------------------------------------------------------------- 135 | #perform MAUDE analysis at guide (allele in this case) level for each locus. 136 | guideLevelStats = data.frame(); 137 | for( l in unique(allSamples$locus)){ 138 | message(l); 139 | statsA = findGuideHitsAllScreens(experiments = curExpts[curExpts$locus==l,], countDataFrame = readCountMat[readCountMat$locus==l,], binStats = binStatsAll[binStatsAll$locus==l,], sortBins = c("A","B","C","D"), unsortedBin = "NS", negativeControl = "isWT") 140 | guideLevelStats = rbind(guideLevelStats, statsA) 141 | } 142 | 143 | ## ----------------------------------------------------------------------------- 144 | baseChanges = data.frame() 145 | for (l in wtSeqs$locus){ 146 | wtSeq = wtSeqs$SeqSpecies[wtSeqs$locus==l]; 147 | wtSplit = strsplit(wtSeq,"")[[1]] 148 | curSeqs = unique(guideLevelStats$SeqSpecies[guideLevelStats$locus==l & guideLevelStats$libFraction > 0.001]) 149 | for (i in 1:length(curSeqs)){ 150 | curSplit = strsplit(curSeqs[i],"")[[1]] 151 | mismatches = c(); 152 | mismatchPoss=c(); 153 | for (j in 1:min(length(wtSplit),length(curSplit))){ 154 | if (wtSplit[j]!=curSplit[j]){ 155 | mismatches = c(mismatches, curSplit[j]); 156 | mismatchPoss = c(mismatchPoss, j) 157 | } 158 | } 159 | if (length(mismatchPoss)>0){ 160 | baseChanges = rbind(baseChanges, data.frame(mismatch=mismatches, position = mismatchPoss, SeqSpecies=curSeqs[i], locus=l)) 161 | } 162 | } 163 | } 164 | 165 | baseChangesSummary = cast(baseChanges, SeqSpecies +locus~ ., value="mismatch") 166 | names(baseChangesSummary)[ncol(baseChangesSummary)] = "numMismatches"; 167 | p = ggplot(baseChangesSummary, aes(x=numMismatches, colour=locus)) + stat_ecdf()+ geom_vline(xintercept = 15); print(p) 168 | 169 | ## ----------------------------------------------------------------------------- 170 | #exclude any where the number of mismatches is greater than 15 - these are also likely read or PCR artifacts. 171 | baseChangesSummary$keepAlleles= baseChangesSummary$numMismatches<15; 172 | 173 | baseChanges$has = 1; 174 | 175 | baseChangesNumWith = cast(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles],], position + mismatch + locus~ ., value="has", fun.aggregate = sum) 176 | names(baseChangesNumWith)[ncol(baseChangesNumWith)] = "numSeqsWith" 177 | baseChanges = merge(baseChanges, baseChangesNumWith, by=c("position","mismatch","locus")) 178 | baseChanges$mismatch = as.character(baseChanges$mismatch) 179 | baseChanges$SeqSpecies = as.character(baseChanges$SeqSpecies) 180 | 181 | 182 | p = ggplot(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], aes(x=position, y=SeqSpecies, fill=mismatch)) + geom_tile()+scale_fill_manual(values=c("orange","green","blue","red"))+scale_x_continuous(expand=c(0,0))+theme_classic() + theme(axis.text.y = element_text(size=5, family = "Courier")) + facet_grid(locus ~ ., scales="free", space ="free"); print(p) 183 | 184 | ## ----------------------------------------------------------------------------- 185 | #A_53 is the mutation of interest 186 | 187 | baseChangesMatrix = cast(baseChanges[baseChanges$locus == "BACH2" & baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], SeqSpecies ~ mismatch + position, value="has", fill = 0) 188 | A53_seq = baseChangesMatrix$SeqSpecies[apply(baseChangesMatrix[2:ncol(baseChangesMatrix)],1, sum)==1 & baseChangesMatrix$A_53==1] 189 | 190 | guideLevelStats$pooled = grepl("-",guideLevelStats$replicate); 191 | 192 | ## ----------------------------------------------------------------------------- 193 | wtAndVarMaudeMu = cast(guideLevelStats[guideLevelStats$SeqSpecies %in% c(A53_seq, wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]),], replicate +pooled ~ SeqSpecies, value="mean") 194 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==A53_seq]="rs72928038"; 195 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]]="WT"; 196 | 197 | ttestResults = t.test(x = wtAndVarMaudeMu$rs72928038[!wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[!wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE) 198 | 199 | ttestResults_mixedCells = t.test(x = wtAndVarMaudeMu$rs72928038[wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE) 200 | 201 | wtAndVarMaudeMu$replicate2 = gsub("-HEK","",wtAndVarMaudeMu$replicate) 202 | 203 | 204 | ## ----------------------------------------------------------------------------- 205 | meltedSNPMus = melt(as.data.frame(wtAndVarMaudeMu), id.vars=c("pooled","replicate", "replicate2")); 206 | meltedSNPMus$genotype = factor(ifelse(meltedSNPMus$variable=="WT", "G", "A (risk)"),levels=c("G","A (risk)")) 207 | p = ggplot(meltedSNPMus, aes(x=genotype, y=value, group=replicate)) + geom_point() + geom_line()+facet_grid(. ~ pooled)+theme_bw()+xlab("Genotype") + ylab("Mean expression") + ggtitle(sprintf("P=%f; P=%f", ttestResults$p.value, ttestResults_mixedCells$p.value)); print(p) 208 | 209 | -------------------------------------------------------------------------------- /doc/CD69_tutorial.R: -------------------------------------------------------------------------------- 1 | ## ----setup, include=FALSE----------------------------------------------------- 2 | knitr::opts_chunk$set(echo = TRUE) 3 | 4 | ## ----load libraries, results="hide", message = FALSE, warning=FALSE----------- 5 | #load required libraries 6 | library(openxlsx) 7 | library(reshape) 8 | library(ggplot2) 9 | library(MAUDE) 10 | library(GenomicRanges) 11 | library(ggbio) 12 | library(Homo.sapiens) 13 | 14 | ## ----input data--------------------------------------------------------------- 15 | #read in the CD69 screen data 16 | CD69Data = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx') 17 | 18 | #identify non-targeting guides 19 | CD69Data$isNontargeting = grepl("negative_control", CD69Data$gRNA_systematic_name) 20 | 21 | CD69Data = unique(CD69Data) # for some reason there were duplicated rows in this table - remove duplicates 22 | 23 | #reshape the count data so we can label the experimental replicates and bins, and remove all the non-count data 24 | cd69CountData = melt(CD69Data, id.vars = c("PAM_3primeEnd_coord","isNontargeting","gRNA_systematic_name")) 25 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),] # keep only read count columns 26 | cd69CountData$Bin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable) 27 | cd69CountData$expt = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable) 28 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL; 29 | cd69CountData$Bin = gsub("_","",cd69CountData$Bin) # remove extra underscores 30 | 31 | #reshape into a matrix 32 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$PAM_3primeEnd_coord) | cd69CountData$isNontargeting,], 33 | PAM_3primeEnd_coord+gRNA_systematic_name+isNontargeting+expt ~ Bin, value="reads")) 34 | #binReadMat now contains a matrix in the proper format for MAUDE analysis 35 | 36 | 37 | ## ----input set of DHS peaks--------------------------------------------------- 38 | dhsPeakBED = read.table(system.file("extdata", "Encode_Jurkat_DHS_both.merged.bed", package = "MAUDE", mustWork = TRUE), 39 | stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=FALSE) 40 | names(dhsPeakBED) = c("chrom","start","end"); 41 | #add a column to include peak names 42 | dhsPeakBED$name = paste(dhsPeakBED$chrom, paste(dhsPeakBED$start, dhsPeakBED$end, sep="-"), sep=":") 43 | 44 | ## ----read in bin fractions---------------------------------------------------- 45 | #read in the bin fractions derived from Simeonov et al Extended Data Fig 1a and the "digitize" R package 46 | #Ideally, you derive this from the FACS sort data. 47 | binStats = read.table(system.file("extdata", "CD69_bin_percentiles.txt", package = "MAUDE", mustWork = TRUE), 48 | stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=TRUE) 49 | binStats$fraction = binStats$binEndQ - binStats$binStartQ; #the fraction of cells captured is the difference in bin start and end percentiles 50 | 51 | #plot the bins as the percentiles of the distribution captured by each bin 52 | p = ggplot(binStats, aes(colour=Bin)) + ggplot2::geom_segment(aes(x=binStartQ, xend=binEndQ, y=fraction, yend=fraction)) + 53 | xlab("Bin bounds as percentiles") + ylab("Fraction of the distribution captured") +theme_classic() + 54 | scale_y_continuous(expand=c(0,0))+coord_cartesian(ylim=c(0,0.7)); print(p) 55 | 56 | ## ----convert bin percentiles to Z scores-------------------------------------- 57 | #convert bin fractions to Z scores 58 | binStats$binStartZ = qnorm(binStats$binStartQ) 59 | binStats$binEndZ = qnorm(binStats$binEndQ) 60 | 61 | ## ----plot bins---------------------------------------------------------------- 62 | p = ggplot(binStats, aes(colour=Bin)) + 63 | geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 64 | ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 65 | xlab("Bin bounds as expression Z-scores") + 66 | ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+ 67 | coord_cartesian(ylim=c(0,0.7)); print(p) 68 | 69 | ## ----duplicate bins for second replicate-------------------------------------- 70 | binStats = rbind(binStats, binStats) #duplicate data 71 | binStats$expt = c(rep("1",4),rep("2",4)); #name the first duplicate expt "1" and the next expt "2"; 72 | 73 | ## ----find guide effects------------------------------------------------------- 74 | guideLevelStats = findGuideHitsAllScreens(experiments = unique(binReadMat["expt"]), 75 | countDataFrame = binReadMat, binStats = binStats, 76 | sortBins = c("baseline","high","low","medium"), 77 | unsortedBin = "back", negativeControl = "isNontargeting") 78 | 79 | ## ----plot guide effects------------------------------------------------------- 80 | # Plot the guide-level mus 81 | p = ggplot(guideLevelStats, aes(x=mean, colour=isNontargeting, linetype=expt)) + geom_density()+ 82 | theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+ 83 | xlab("Learned mean guide expression"); print(p); 84 | 85 | ## ----plot guide level Zs------------------------------------------------------ 86 | # Plot the guide-level Zs 87 | p = ggplot(guideLevelStats, aes(x=Z, colour=isNontargeting, linetype=expt)) + geom_density()+ 88 | theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+ 89 | xlab("Learned guide expression Z score"); 90 | print(p) 91 | 92 | ## ----plot replicate scatter--------------------------------------------------- 93 | guideEffectsByRep = cast(guideLevelStats, 94 | gRNA_systematic_name + isNontargeting + PAM_3primeEnd_coord ~ expt, value="Z") 95 | 96 | p = ggplot(guideEffectsByRep[!guideEffectsByRep$isNontargeting,], aes(x=`1`, y=`2`)) + 97 | geom_point(size=0.3) + xlab("Replicate 1 Z score") + ylab("Replicate 2 Z score") + 98 | ggtitle(sprintf("r = %f",cor(guideEffectsByRep$`1`[!guideEffectsByRep$isNontargeting], 99 | guideEffectsByRep$`2`[!guideEffectsByRep$isNontargeting])))+theme_classic(); 100 | print(p) 101 | 102 | ## ----plot locus--------------------------------------------------------------- 103 | dhsPos = min(guideLevelStats$Z)*1.05; 104 | p=ggplot(guideLevelStats, aes(x=PAM_3primeEnd_coord, y=Z)) +geom_point(size=0.5)+facet_grid(expt ~.)+ 105 | ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="red") + 106 | theme_classic() + xlab("Genomic position") + ylab("Guide Z score"); 107 | print(p) 108 | 109 | ## ----infer sliding window effects--------------------------------------------- 110 | guideLevelStats$chrom = "chr12"; # we need to tell it what chromosome our guides are on - they're all on chr12 111 | slidingWindowElements = getTilingElementwiseStats(experiments = unique(binReadMat["expt"]), 112 | normNBSummaries = guideLevelStats, tails="both", window = 200, location = "PAM_3primeEnd_coord", 113 | chr="chrom",negativeControl = "isNontargeting") 114 | #override the default chromosome field 'chr' with the GRanges compatible 'chrom' 115 | names(slidingWindowElements)[names(slidingWindowElements)=="chr"]="chrom" 116 | 117 | ## ----tiles locus effects------------------------------------------------------ 118 | dhsPos = min(slidingWindowElements$meanZ)*1.05; 119 | p=ggplot(slidingWindowElements, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) + 120 | ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 121 | ylab("Element Z score") + geom_hline(yintercept = 0) + 122 | ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="black"); 123 | print(p) 124 | 125 | ## ----tiled replicate scatter-------------------------------------------------- 126 | slidingWindowElementsByRep = cast(slidingWindowElements, chrom + start + end +numGuides ~ expt, 127 | value="meanZ") 128 | p = ggplot(slidingWindowElementsByRep, aes(x=`1`, y=`2`)) + geom_point(size=0.5) + 129 | xlab("Replicate 1 element effect Z score") + ylab("Replicate 2 element effect Z score") + 130 | ggtitle(sprintf("r = %f",cor(slidingWindowElementsByRep$`1`,slidingWindowElementsByRep$`2`)))+ 131 | theme_classic(); 132 | print(p) 133 | 134 | ## ----element-level stats------------------------------------------------------ 135 | #the next command annotates our guides with any DHS peak they lie in. 136 | annotatedGuides = findOverlappingElements(guides = unique(guideLevelStats[!guideLevelStats$isNontargeting, 137 | c("PAM_3primeEnd_coord","gRNA_systematic_name","chrom")]), elements = dhsPeakBED, 138 | elements.start = "start", elements.end = "end", elements.chr = "chrom", 139 | guides.pos = "PAM_3primeEnd_coord", guides.chr = "chrom") 140 | 141 | #merge regulatory element annotations back onto guideLevelStats 142 | guideLevelStats = merge(guideLevelStats, annotatedGuides[c("gRNA_systematic_name", "name")], 143 | by="gRNA_systematic_name", all.x=TRUE) 144 | 145 | #this is where we are actually running MAUDE to find element-level stats 146 | dhsPeakStats = getElementwiseStats(experiments = unique(binReadMat["expt"]), 147 | normNBSummaries = guideLevelStats, negativeControl = "isNontargeting", 148 | elementIDs = "name") # "name" is the peak IDs from the DHS BED file 149 | 150 | #merge peak info back into dhsPeakStats 151 | dhsPeakStats = merge(dhsPeakStats, dhsPeakBED, by="name"); 152 | 153 | ## ----element locus effect view------------------------------------------------ 154 | p=ggplot(dhsPeakStats, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) + 155 | ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 156 | ylab("Element Z score") + geom_hline(yintercept = 0); 157 | print(p) 158 | 159 | ## ----element replicate scatter------------------------------------------------ 160 | dhsPeakStatsByRep = cast(dhsPeakStats, name ~ expt, value="meanZ") 161 | 162 | p = ggplot(dhsPeakStatsByRep, aes(x=`1`, y=`2`)) + geom_point() + 163 | xlab("Replicate 1 DHS effect Z score") + ylab("Replicate 2 DHS effect Z score") + 164 | ggtitle(sprintf("r = %f",cor(dhsPeakStatsByRep$`1`,dhsPeakStatsByRep$`2`)))+theme_classic(); 165 | print(p) 166 | 167 | ## ----guide effects per element------------------------------------------------ 168 | p=ggplot(guideLevelStats, aes(x=Z, group=name, colour=name == "chr12:9912678-9915275")) + stat_ecdf(alpha=0.3)+ 169 | stat_ecdf(data=guideLevelStats[!is.na(guideLevelStats$name) & 170 | guideLevelStats$name=="chr12:9912678-9915275",], size=1)+ 171 | facet_grid(expt ~.) + theme_classic() + xlab("Guide Z score")+scale_y_continuous(expand=c(0,0)) + 172 | scale_x_continuous(expand=c(0,0)) + scale_colour_manual(values=c("black","red")) + 173 | labs(colour = "CD69 promoter?")+ylab("Cumulative fraction"); 174 | print(p) 175 | 176 | ## ----promoter view------------------------------------------------------------ 177 | p=ggplot(guideLevelStats[!is.na(guideLevelStats$name) & guideLevelStats$name=="chr12:9912678-9915275",], 178 | aes(x=PAM_3primeEnd_coord, y=Z, colour=expt)) + 179 | geom_point(size=1)+ geom_line()+theme_classic() + xlab("Genomic position") + ylab("Guide Z score")+ 180 | geom_vline(xintercept = 9913497, colour="black"); 181 | print(p) 182 | 183 | ## ----promoter guide zoom------------------------------------------------------ 184 | p=ggplot(guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 185 | guideEffectsByRep$PAM_3primeEnd_coord > 9912678,], 186 | aes(x=`2`, y=`1`)) + geom_point() + 187 | geom_text(data=guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 188 | guideEffectsByRep$PAM_3primeEnd_coord > 9912678 & 189 | guideEffectsByRep$`1`>2 & guideEffectsByRep$`2`>2,], 190 | aes(label=gRNA_systematic_name)) + 191 | theme_classic()+xlab("Replicate 2 guide Z score") + ylab("Replicate 1 guide Z score"); 192 | print(p) 193 | 194 | 195 | ## ----reshaping window effects------------------------------------------------- 196 | slidingWindowElementsByReplicate = cast(melt(slidingWindowElements, 197 | id.vars=c("expt","numGuides","chrom","start","end")), 198 | numGuides+chrom+start+end ~ variable+expt, value="value") 199 | head(slidingWindowElementsByReplicate) 200 | 201 | ## ----cast to GRanges---------------------------------------------------------- 202 | #casting to data.frame is only needed if using cast 203 | slidingWindowElementsByReplicateGR = GRanges(as.data.frame(slidingWindowElementsByReplicate)) 204 | 205 | ## ----find doubly significant tiles-------------------------------------------- 206 | #require that both replicates are significant at an FDR of 0.1 and that the signs agree 207 | slidingWindowElementsByReplicateGR$significantUp = slidingWindowElementsByReplicateGR$FDR_1< 0.01 & 208 | slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 > 0 & 209 | slidingWindowElementsByReplicateGR$meanZ_2 > 0; 210 | slidingWindowElementsByReplicateGR$significantDown = slidingWindowElementsByReplicateGR$FDR_1< 0.01 & 211 | slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 < 0 & 212 | slidingWindowElementsByReplicateGR$meanZ_2 < 0; 213 | 214 | #merge overlapping regions in each set 215 | overlappingSlidingWindowElementsUp = 216 | reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantUp]) 217 | overlappingSlidingWindowElementsDown = 218 | reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantDown]) 219 | 220 | ## ----genome browser view, fig.width=10, fig.height=5-------------------------- 221 | 222 | #which gene models do I want to plot? 223 | data(genesymbol, package = "biovizBase") 224 | wh <- genesymbol[c("CD69", "CLECL1", "KLRF1", "CLEC2D","CLEC2B")] 225 | wh <- range(wh, ignore.strand = TRUE) 226 | 227 | #make the genome tracks 228 | tracks(autoplot(Homo.sapiens, which = wh, gap.geom="chevron"), 229 | autoplot(overlappingSlidingWindowElementsUp, fill="red"), 230 | autoplot(overlappingSlidingWindowElementsDown, fill="blue"), heights=c(5,2,2)) + theme_classic() 231 | 232 | -------------------------------------------------------------------------------- /doc/simulated_data_tutorial.R: -------------------------------------------------------------------------------- 1 | ## ----setup, include=FALSE----------------------------------------------------- 2 | knitr::opts_chunk$set(echo = TRUE) 3 | 4 | ## ----load libraries, results="hide", message = FALSE, warning=FALSE----------- 5 | library(ggplot2) 6 | library(reshape) 7 | library(MAUDE) 8 | 9 | set.seed(76484377) 10 | 11 | ## ----simulation setup--------------------------------------------------------- 12 | #ground truth 13 | groundTruth = data.frame(element = 1:200, meanEffect = c((1:100)/100,rep(0,100))) #targeting 200 elements, half of which do nothing 14 | 15 | #guide - element map; 5 guides per element; gid is the guide ID 16 | guideMap = data.frame(element = rep(groundTruth$element, 5), gid = 1:(5*nrow(groundTruth)), NT=FALSE, mean=rep(groundTruth$meanEffect, 5)) 17 | guideMap = rbind(guideMap, data.frame(element = NA, gid = (1:1000)+nrow(guideMap), NT=TRUE, mean=0)); # 1000 non-targeting guides 18 | 19 | guideMap$abundance = rpois(n=nrow(guideMap), lambda=1000); #guide abundance drawing from a poisson distribution with mean=1000 20 | guideMap$cells = rpois(n=nrow(guideMap), lambda=guideMap$abundance); #cell count drawing from a poisson distribution with mean the abundance from above 21 | 22 | #create observarions for different guides, with expression drawn from normal(mean=mean, sd=1) 23 | cellObservations = data.frame(gid = rep(guideMap$gid, guideMap$cells)) 24 | cellObservations = merge(cellObservations, guideMap, by="gid") 25 | cellObservations$expression = rnorm(n=nrow(cellObservations), mean=cellObservations$mean); 26 | 27 | #create the bin model for this experiment - this represents 6 bins, each of which are 10%, where A+B+C catch the bottom ~30% and D+E+F catch the top 30%; in an actual experiment, the true captured fractions should be used here. 28 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.1,6))) 29 | if(FALSE){ 30 | # in reality, we shouldn't assume this distribution is exactly normal - we can re-assign expression bin bounds based on quantiles 31 | # of the actual simulated expression distribution. If you run the next two lines, the answer will improve slightly, but the 32 | # resulting graphs will look slightly different than those below. 33 | # correct for the actual distribution 34 | binBounds$binStartZ = quantile(cellObservations$expression, probs = binBounds$binStartQ); 35 | binBounds$binEndZ = quantile(cellObservations$expression, probs = binBounds$binEndQ); 36 | } 37 | 38 | # select some examples to inspect for both 39 | exampleNT = sample(guideMap$gid[guideMap$NT],10);# non-targeting 40 | exampleT = sample(guideMap$gid[!guideMap$NT],5);# and targeting guides 41 | 42 | #plot the select examples and show the bin structure 43 | p = ggplot(cellObservations[cellObservations$gid %in% c(exampleT, exampleNT),], 44 | aes(x=expression, group=gid, fill=NT))+geom_density(alpha=0.2) + 45 | geom_vline(xintercept = sort(unique(c(binBounds$binStartZ,binBounds$binEndZ))),colour="gray") + 46 | theme_classic() + scale_fill_manual(values=c("red","darkgray")) + xlab("Target expression") + 47 | scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + 48 | coord_cartesian(xlim=c(min(cellObservations$expression), max(cellObservations$expression)))+ 49 | geom_segment(data=binBounds, aes(x=binStartZ, xend=binEndZ, colour=Bin, y=0, yend=0), size=5, inherit.aes = FALSE); 50 | print(p) 51 | 52 | ## ----simulate sorting cells--------------------------------------------------- 53 | #for each bin, find which cells landed within the bin bounds 54 | for(i in 1:nrow(binBounds)){ 55 | cellObservations[[as.character(binBounds$Bin[i])]] = 56 | cellObservations$expression > binBounds$binStartZ[i] & cellObservations$expression < binBounds$binEndZ[i]; 57 | } 58 | 59 | #count the number of cells that ended up in each bin for each guide 60 | binLevelData = cast(melt(cellObservations[c("gid","element","NT",as.character(binBounds$Bin))], 61 | id.vars=c("gid","element","NT")), gid + element + NT + variable ~ ., fun.aggregate = sum) 62 | names(binLevelData)[(ncol(binLevelData)-1):ncol(binLevelData)]=c("Bin","cells"); 63 | 64 | #plot the cells/bin for each of our example guides 65 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=cells)) + 66 | geom_boxplot(fill="darkgray")+geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], 67 | aes(group=gid), colour="red")+ theme_classic() + xlab("Expression bin") + 68 | ylab("Captured cells/bin") + scale_y_continuous(expand=c(0,0)); print(p) 69 | 70 | ## ----simulate sequencing------------------------------------------------------ 71 | #get the total number of cells sorted into each of the bins 72 | totalCellsPerBin = cast(binLevelData,Bin ~ ., value="cells", fun.aggregate = sum) 73 | names(totalCellsPerBin)[ncol(totalCellsPerBin)]="totalCells"; 74 | 75 | #add bin totals to binLevelData 76 | binLevelData = merge(binLevelData, totalCellsPerBin, by="Bin") 77 | 78 | #generate reads for each guide per bin, following a negative binomial distribution 79 | #n=number of observations; size= total reads per bin; prob= probability of not getting a read at each drawing 80 | binLevelData$reads = rnbinom(n=nrow(binLevelData), size=binLevelData$totalCells*10, 81 | prob=1- binLevelData$cells/binLevelData$totalCells) 82 | 83 | #plot the distribution of reads for each example guide 84 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=reads)) + 85 | geom_boxplot(fill="darkgray")+ 86 | geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], aes(group=gid), colour="red")+ 87 | theme_classic() + xlab("Expression bin") + ylab("Reads/bin") + scale_y_continuous(expand=c(0,0)); 88 | print(p) 89 | 90 | ## ----simulate unsorted cells-------------------------------------------------- 91 | binReadMat = cast(binLevelData, element+gid+NT ~ Bin, value="reads") 92 | 93 | #pretend we sequence the unsorted cells to similar coverage as above: 94 | guideMap$NS = rnbinom(n=nrow(guideMap), size=sum(guideMap$cells)*10, 95 | prob=1- guideMap$abundance/sum(guideMap$cells)) 96 | binReadMat = merge(binReadMat, guideMap[c("gid","NS")], by="gid") 97 | 98 | binReadMat$screen="test"; # here, we're only doing the one screen - this simulation 99 | binBounds$screen="test"; 100 | 101 | ## ----run MAUDE---------------------------------------------------------------- 102 | # get guide-level stats 103 | guideLevelStats = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, binBounds) 104 | 105 | #get element level stats 106 | elementLevelStats = getElementwiseStats(unique(guideLevelStats["screen"]), 107 | guideLevelStats, elementIDs="element",tails="upper") 108 | 109 | ## ----plor effects------------------------------------------------------------- 110 | elementLevelStats = merge(elementLevelStats, groundTruth, by="element") 111 | 112 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+ 113 | geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 114 | xlab("True effect") + ylab("Inferred effect"); print(p) 115 | 116 | ## ----effets zoom in----------------------------------------------------------- 117 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+ 118 | geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 119 | coord_cartesian(xlim = c(0,0.1),ylim = c(0,0.1)) + xlab("True effect") + ylab("Inferred effect"); 120 | print(p) 121 | 122 | -------------------------------------------------------------------------------- /images/logo2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/eb9c80a0bf37bb2b0c3c7ae7270a1fee71eaba20/images/logo2.png -------------------------------------------------------------------------------- /inst/CITATION: -------------------------------------------------------------------------------- 1 | citHeader("To cite MAUDE in publications use:") 2 | 3 | citEntry(entry = "Article", 4 | title = "MAUDE: inferring expression changes in sorting-based CRISPR screens", 5 | author = personList(as.person("Carl G. de Boer"), 6 | as.person("John P. Ray"), 7 | as.person("Nir Hacohen"), 8 | as.person("Aviv Regev")), 9 | journal = "Genome Biology", 10 | year = "2020", 11 | volume = "21", 12 | number = "1", 13 | pages = "134", 14 | url = "https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02046-8", 15 | 16 | textVersion = 17 | paste("de Boer CG, Ray JP, Hacohen N, Regev A.", 18 | "MAUDE: inferring expression changes in sorting-based CRISPR screens.", 19 | "Genome Biol. 2020;21(1):134. Published 2020 Jun 3.", 20 | "doi:10.1186/s13059-020-02046-8") 21 | ) 22 | -------------------------------------------------------------------------------- /inst/extdata/CD69_bin_percentiles.txt: -------------------------------------------------------------------------------- 1 | Bin binStartQ binEndQ 2 | baseline 0.001 0.658470995483854 3 | low 0.658470995483854 0.909937871020824 4 | medium 0.907937871020824 0.971071760477821 5 | high 0.971071760477821 0.999 6 | -------------------------------------------------------------------------------- /inst/extdata/Encode_Jurkat_DHS_both.merged.bed: -------------------------------------------------------------------------------- 1 | chr12 9885788 9886093 2 | chr12 9889662 9890295 3 | chr12 9894141 9894851 4 | chr12 9895391 9895533 5 | chr12 9895675 9896083 6 | chr12 9901260 9901720 7 | chr12 9902669 9904782 8 | chr12 9907175 9907332 9 | chr12 9907655 9907678 10 | chr12 9909466 9912465 11 | chr12 9912678 9915275 12 | chr12 9915630 9916048 13 | chr12 9916404 9918532 14 | chr12 9919586 9920156 15 | chr12 9923553 9923906 16 | chr12 9925876 9926553 17 | chr12 9928010 9928338 18 | chr12 9933989 9934360 19 | chr12 9938952 9939007 20 | chr12 9940013 9940299 21 | chr12 9945122 9945437 22 | chr12 9945544 9945599 23 | chr12 9946517 9946622 24 | chr12 9950506 9951060 25 | chr12 9961099 9961274 26 | chr12 9963380 9963792 27 | chr12 9964121 9964262 28 | chr12 9966189 9967318 29 | chr12 9973087 9973460 30 | -------------------------------------------------------------------------------- /man/calcFDRByExperiment.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{calcFDRByExperiment} 4 | \alias{calcFDRByExperiment} 5 | \title{FDR correction per experiment} 6 | \usage{ 7 | calcFDRByExperiment(experiments, x, tails) 8 | } 9 | \arguments{ 10 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in normNBSummaries} 11 | 12 | \item{x}{data.frame of element-level statistics, including columns for every column in 'experiments' and a column named 'p.value'} 13 | 14 | \item{tails}{whether to test for increased expression ("upper"), decreased ("lower"), or both ("both"); (defaults to "both")} 15 | } 16 | \value{ 17 | a numerical vector containing B-H Q values corrected separately for every experiment. 18 | } 19 | \description{ 20 | Perform Benjamini-Hochberg FDR correction of p-values within each experiment. The returned values correspond to Q values. This is run automatically within getTilingElementwiseStats and getElementwiseStats, and doesn't generally need to be used directly. 21 | } 22 | \examples{ 23 | fakeReadData = data.frame(id=rep(1:10000,2), expt=c(rep("e1",10000), rep("e2",10000)), 24 | A=rpois(20000, lambda = 100), B=rpois(20000, lambda = 100), 25 | C=rpois(20000, lambda = 100), D=rpois(20000, lambda = 100), 26 | E=rpois(20000, lambda = 100), F=rpois(20000, lambda = 100), 27 | NotSorted=rpois(20000, lambda = 100), 28 | position = rep(c(rep(NA, 1000), (1:9000)*10 + 5E7),2), 29 | chr=rep(c(rep(NA, 1000), rep("chr1", 9000)),2), 30 | negControl = rep(c(rep(TRUE,1000),rep(FALSE,9000)),2), 31 | stringsAsFactors = FALSE) 32 | #make one region an "enhancer" and "repressor" by skewing the reads 33 | enhancers = data.frame(name = c("enh","repr"), start = c(40000, 70000) + 5E7, 34 | end = c(40500, 70500) + 5E7, chr="chr1") 35 | enhancerData = findOverlappingElements(fakeReadData[!is.na(fakeReadData$position),], enhancers, 36 | guides.pos = "position",elements.start = "start", elements.end = "end") 37 | readSkew=1.2 # we will scale up/down the reads in ABC and DEF by this amount 38 | enhancerData[enhancerData$name=="enh", c("D","E","F")] = 39 | floor(readSkew* enhancerData[enhancerData$name=="enh", c("D","E","F")]); 40 | enhancerData[enhancerData$name=="repr", c("D","E","F")] = 41 | floor(enhancerData[enhancerData$name=="repr", c("D","E","F")]/readSkew); 42 | enhancerData[enhancerData$name=="repr", c("A","B","C")] = 43 | floor(readSkew* enhancerData[enhancerData$name=="repr", c("A","B","C")]); 44 | enhancerData[enhancerData$name=="enh", c("A","B","C")] = 45 | floor(enhancerData[enhancerData$name=="enh", c("A","B","C")]/readSkew); 46 | #replace the original data for these elements 47 | fakeReadData = rbind(fakeReadData[!(fakeReadData$id \%in\% enhancerData$id), ], 48 | enhancerData[names(fakeReadData)]) 49 | #make experiments and sorting strategy 50 | expts = unique(fakeReadData["expt"]); 51 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6))) 52 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts 53 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6)) 54 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData, 55 | binStats = curSortBins, unsortedBin = "NotSorted", 56 | negativeControl="negControl") 57 | tilingElementStats = getTilingElementwiseStats(experiments = expts, normNBSummaries = guideHits, 58 | tails = "both", chr = "chr", location="position", negativeControl = "negControl") 59 | tilingElementStats$Q = calcFDRByExperiment(expts, tilingElementStats,"both") 60 | if(require("ggplot2")){ 61 | p=ggplot(tilingElementStats, aes(x=FDR, y=Q)) +geom_point()+geom_abline(intercept=0, slope=1)+ 62 | scale_y_log10()+scale_x_log10(); print(p) 63 | } 64 | } 65 | -------------------------------------------------------------------------------- /man/combineZStouffer.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{combineZStouffer} 4 | \alias{combineZStouffer} 5 | \title{Combines Z-scores using Stouffer's method} 6 | \usage{ 7 | combineZStouffer(x) 8 | } 9 | \arguments{ 10 | \item{x}{a vector of Z-scores to be combined} 11 | } 12 | \value{ 13 | Returns a single Z-score. 14 | } 15 | \description{ 16 | This function takes a vector of Z-scores and combines them into a single Z-score using Stouffer's method. 17 | } 18 | \examples{ 19 | combineZStouffer(rnorm(10)) 20 | } 21 | -------------------------------------------------------------------------------- /man/findGuideHits.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{findGuideHits} 4 | \alias{findGuideHits} 5 | \title{Calculate guide-level statistics for a single screen} 6 | \usage{ 7 | findGuideHits( 8 | countTable, 9 | curBinBounds, 10 | pseudocount = 10, 11 | meanFunction = mean, 12 | sortBins = c("A", "B", "C", "D", "E", "F"), 13 | unsortedBin = "NS", 14 | negativeControl = "NT", 15 | limits = c(-4, 4) 16 | ) 17 | } 18 | \arguments{ 19 | \item{countTable}{a table containing one column for each bin (A-F) and another column for non-targeting guide (logical-"NT"), and unsorted abundance (NS)} 20 | 21 | \item{curBinBounds}{a bin model as created by makeBinModel} 22 | 23 | \item{pseudocount}{the count to be added to each bin count, per 1e6 reads/bin total (default=10 pseudo reads per 1e6 reads total)} 24 | 25 | \item{meanFunction}{how to calculate the mean of the non-targeting guides for centering Z-scores. Defaults to 'mean'} 26 | 27 | \item{sortBins}{the names in countTable of the sorting bins. Defaults to c("A","B","C","D","E","F")} 28 | 29 | \item{unsortedBin}{the name in countTable of the unsorted bin. Defaults to "NS"} 30 | 31 | \item{negativeControl}{the name in countTable containing a logical representing whether or not the guide is non-Targeting (i.e. a negative control guide). Defaults to "NT"} 32 | 33 | \item{limits}{the limits to the mu optimization. Defaults to c(-4,4)} 34 | } 35 | \value{ 36 | a data.frame containing the guide-level statistics, including the Z score 'Z', log likelihood ratio 'llRatio', and estimated mean expression 'mean'. 37 | } 38 | \description{ 39 | Given a table of counts per guide/bin and a bin model for an experiment, calculate the optimal mean expression for each guide 40 | } 41 | \examples{ 42 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6))) 43 | fakeReadData = data.frame(id=1:1000, A=rpois(1000, lambda = 100), B=rpois(1000, lambda = 100), 44 | C=rpois(1000, lambda = 100), D=rpois(1000, lambda = 100), 45 | E=rpois(1000, lambda = 100), F=rpois(1000, lambda = 100), 46 | NotSorted=rpois(1000, lambda = 100), negControl = rnorm(1000)>0) 47 | guideHits = findGuideHits(fakeReadData, curSortBins, unsortedBin = "NotSorted", 48 | negativeControl="negControl") 49 | if(require("ggplot2")){ 50 | p=ggplot(guideHits, aes(x=Z, colour=negControl))+geom_density(); print(p) 51 | } 52 | } 53 | -------------------------------------------------------------------------------- /man/findGuideHitsAllScreens.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{findGuideHitsAllScreens} 4 | \alias{findGuideHitsAllScreens} 5 | \title{Calculate guide-level stats for multiple experiments} 6 | \usage{ 7 | findGuideHitsAllScreens(experiments, countDataFrame, binStats, ...) 8 | } 9 | \arguments{ 10 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in countDataFrame and binStats} 11 | 12 | \item{countDataFrame}{a table containing one column for each bin (A-F) and another column for non-targeting guide (logical-"NT"), and unsorted abundance (NS), as well as columns corresponding to those in 'experiments'} 13 | 14 | \item{binStats}{a bin model as created by makeBinModel, as well as columns corresponding to those in 'experiments'} 15 | 16 | \item{...}{other parameters for findGuideHits} 17 | } 18 | \value{ 19 | guide-level stats for all experiments 20 | } 21 | \description{ 22 | Uses findGuideHits to find guide-level stats for each unique entry in 'experiments'. 23 | } 24 | \examples{ 25 | fakeReadData = data.frame(id=rep(1:1000,2), expt=c(rep("e1",1000), rep("e2",1000)), 26 | A=rpois(2000, lambda = 100), B=rpois(2000, lambda = 100), 27 | C=rpois(2000, lambda = 100), D=rpois(2000, lambda = 100), 28 | E=rpois(2000, lambda = 100), F=rpois(2000, lambda = 100), 29 | NotSorted=rpois(2000, lambda = 100), 30 | negControl = rep(rnorm(1000)>0,2), stringsAsFactors = FALSE) 31 | expts = unique(fakeReadData["expt"]); 32 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6))) 33 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts 34 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6)) 35 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData, 36 | binStats = curSortBins, unsortedBin = "NotSorted", 37 | negativeControl="negControl") 38 | } 39 | -------------------------------------------------------------------------------- /man/findOverlappingElements.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{findOverlappingElements} 4 | \alias{findOverlappingElements} 5 | \title{Find overlaps between guides and annotated elements} 6 | \usage{ 7 | findOverlappingElements( 8 | guides, 9 | elements, 10 | guides.pos = "pos", 11 | guides.chr = "chr", 12 | elements.start = "st", 13 | elements.end = "en", 14 | elements.chr = "chr" 15 | ) 16 | } 17 | \arguments{ 18 | \item{guides}{a data.frame containing guide information including the guides genomic position in a column named guides.pos} 19 | 20 | \item{elements}{a data.frame containing element information, as in a BED file, including the element's genomic start, end, and chromosome in columns named elements.start, elements.end, and elements.chr} 21 | 22 | \item{guides.pos}{the name of the column in guides that contains the genomic position targeted by the guide (defaults to "pos")} 23 | 24 | \item{guides.chr}{the name of the column in guides that contains the genomic chromosome targeted by the guide (defaults to "chr")} 25 | 26 | \item{elements.start}{the name of the column in elements that contains the start coordinate of the element (defaults to "st")} 27 | 28 | \item{elements.end}{the name of the column in elements that contains the start coordinate of the element (defaults to "en")} 29 | 30 | \item{elements.chr}{the name of the column in elements that contains the start coordinate of the element (defaults to "chr")} 31 | } 32 | \value{ 33 | Returns a new data.frame containing the intersection of elements and guides 34 | } 35 | \description{ 36 | Finds guides that overlap the elements of a BED-like data.frame (e.g. open chromatin regions) and returns a new data.frame containing those overlaps 37 | } 38 | \examples{ 39 | set1 = data.frame(gid=1:10, chr=c(rep("chr1",5), rep("chr5",5)), 40 | pos= c(1:5,1:5)*10, stringsAsFactors = FALSE) 41 | set2 = data.frame(eid=1:4, chr=c("chr1","chr1","chr4","chr5"), st=c(5,25,1,45), 42 | en=c(15,50,50,55), stringsAsFactors = FALSE) 43 | findOverlappingElements(set1, set2) 44 | } 45 | -------------------------------------------------------------------------------- /man/getElementwiseStats.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{getElementwiseStats} 4 | \alias{getElementwiseStats} 5 | \title{Find active elements by annotation} 6 | \usage{ 7 | getElementwiseStats( 8 | experiments, 9 | normNBSummaries, 10 | elementIDs, 11 | tails = "both", 12 | negativeControl = "NT", 13 | ... 14 | ) 15 | } 16 | \arguments{ 17 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in normNBSummaries} 18 | 19 | \item{normNBSummaries}{data.frame of guide-level statistics as generated by findGuideHits()} 20 | 21 | \item{elementIDs}{the names of one or more columns within guideLevelStats that contain the element annotations.} 22 | 23 | \item{tails}{whether to test for increased expression ("upper"), decreased ("lower"), or both ("both"); (defaults to "both")} 24 | 25 | \item{negativeControl}{the name in normNBSummaries containing a logical representing whether or not the guide is non-Targeting (i.e. a negative control guide). Defaults to "NT"} 26 | 27 | \item{...}{other parameters for getZScalesWithNTGuides} 28 | } 29 | \value{ 30 | a data.frame containing the statistics for all elements 31 | } 32 | \description{ 33 | Tests guides for activity by considering a set of provided regulatory elements within the region and considering all guides within each region for the test. 34 | } 35 | \examples{ 36 | fakeReadData = data.frame(id=rep(1:10000,2), expt=c(rep("e1",10000), rep("e2",10000)), 37 | A=rpois(20000, lambda = 100), B=rpois(20000, lambda = 100), 38 | C=rpois(20000, lambda = 100), D=rpois(20000, lambda = 100), 39 | E=rpois(20000, lambda = 100), F=rpois(20000, lambda = 100), 40 | NotSorted=rpois(20000, lambda = 100), 41 | position = rep(c(rep(NA, 1000), (1:9000)*10 + 5E7),2), 42 | chr=rep(c(rep(NA, 1000), rep("chr1", 9000)),2), 43 | negControl = rep(c(rep(TRUE,1000),rep(FALSE,9000)),2), 44 | stringsAsFactors = FALSE) 45 | #make one region an "enhancer" and "repressor" by skewing the reads 46 | enhancers = data.frame(name = c("enh","repr"), start = c(40000, 70000) + 5E7, 47 | end = c(40500, 70500) + 5E7, chr="chr1") 48 | enhancerData = findOverlappingElements(fakeReadData[!is.na(fakeReadData$position),], 49 | enhancers, guides.pos = "position",elements.start = "start", elements.end = "end") 50 | readSkew=1.2 # we will scale up/down the reads in ABC and DEF by this amount 51 | enhancerData[enhancerData$name=="enh", c("D","E","F")] = 52 | floor(readSkew* enhancerData[enhancerData$name=="enh", c("D","E","F")]); 53 | enhancerData[enhancerData$name=="repr", c("D","E","F")] = 54 | floor(enhancerData[enhancerData$name=="repr", c("D","E","F")]/readSkew); 55 | enhancerData[enhancerData$name=="repr", c("A","B","C")] = 56 | floor(readSkew* enhancerData[enhancerData$name=="repr", c("A","B","C")]); 57 | enhancerData[enhancerData$name=="enh", c("A","B","C")] = 58 | floor(enhancerData[enhancerData$name=="enh", c("A","B","C")]/readSkew); 59 | #replace the original data for these elements 60 | fakeReadData = rbind(fakeReadData[!(fakeReadData$id \%in\% enhancerData$id), ], 61 | enhancerData[names(fakeReadData)]) 62 | #make experiments and sorting strategy 63 | expts = unique(fakeReadData["expt"]); 64 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6))) 65 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts 66 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6)) 67 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData, 68 | binStats = curSortBins, unsortedBin = "NotSorted", 69 | negativeControl="negControl") 70 | 71 | potentialEnhancers = rbind(enhancers, data.frame(name = sprintf("EC\%i", 1:16), 72 | start = (5:20)*6000 + 5E7, end = (5:20)*6000 + 5E7+500, chr="chr1")) 73 | #annotate guides with elements, but first get all non-targeting guides 74 | guideHitsAnnotated = guideHits[is.na(guideHits$position),]; 75 | guideHitsAnnotated$name=NA; guideHitsAnnotated$start=NA; 76 | guideHitsAnnotated$end=NA; guideHitsAnnotated$chr=NA; 77 | guideHitsAnnotated = rbind(guideHitsAnnotated, 78 | findOverlappingElements(guideHits[!is.na(guideHits$position),], potentialEnhancers, 79 | guides.pos = "position",elements.start = "start", elements.end = "end") ) 80 | allElementHits = getElementwiseStats(experiments = expts, normNBSummaries = guideHitsAnnotated, 81 | elementIDs = "name", tails = "both", negativeControl = "negControl") 82 | allElementHits = merge(allElementHits, potentialEnhancers, by="name") 83 | if(require("ggplot2")){ 84 | p=ggplot(allElementHits, aes(x=start, xend=end, y=significanceZ, yend=significanceZ, 85 | colour=FDR<0.01, label=name)) + geom_segment()+ 86 | geom_text(data= allElementHits[allElementHits$FDR<0.01,], colour="black") + 87 | facet_grid(expt ~ .); print(p) 88 | } 89 | } 90 | -------------------------------------------------------------------------------- /man/getNBGaussianLikelihood.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{getNBGaussianLikelihood} 4 | \alias{getNBGaussianLikelihood} 5 | \title{Calculate the log likelihood of observed read counts} 6 | \usage{ 7 | getNBGaussianLikelihood(x, mu, k, sigma = 1, nullModel, libFract) 8 | } 9 | \arguments{ 10 | \item{x}{a vector of guide counts per bin} 11 | 12 | \item{mu}{the mean for the normal expression distribution} 13 | 14 | \item{k}{the vector of total counts per bin} 15 | 16 | \item{sigma}{for the normal expression distribution (defaults to 1)} 17 | 18 | \item{nullModel}{the bin bounds for the null model (for no change in expression)} 19 | 20 | \item{libFract}{the fraction of the unsorted library this guide comprises (e.g. from unsorted cells, or sequencing the vector)} 21 | } 22 | \value{ 23 | the log likelihood 24 | } 25 | \description{ 26 | Uses a normal distribution (N(mu,sigma)) to estimate how many reads are expected per bin under nullModel, and calculates the log likelihood under a negative binomial model. This function is usually not used directly. 27 | } 28 | \examples{ 29 | #usually not used directly 30 | #make a bin sorting model with 6 10\% bins 31 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6))) 32 | readsForGuideX =c(10,20,30,100,200,100); #the reads for this guide 33 | getNBGaussianLikelihood(x=readsForGuideX, mu=1, k=rep(1E6,6), sigma=1, nullModel=curSortBins, 34 | libFract = 50/1E6) 35 | getNBGaussianLikelihood(x=readsForGuideX, mu=-1, k=rep(1E6,6), sigma=1, nullModel=curSortBins, 36 | libFract = 50/1E6) 37 | #mu=1 is far more likely (closer to 0) than mu=-1 for this distribution of reads 38 | } 39 | -------------------------------------------------------------------------------- /man/getTilingElementwiseStats.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{getTilingElementwiseStats} 4 | \alias{getTilingElementwiseStats} 5 | \title{Find active elements by sliding window} 6 | \usage{ 7 | getTilingElementwiseStats( 8 | experiments, 9 | normNBSummaries, 10 | tails = "both", 11 | location = "pos", 12 | chr = "chr", 13 | window = 500, 14 | minGuides = 5, 15 | negativeControl = "NT", 16 | ... 17 | ) 18 | } 19 | \arguments{ 20 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in normNBSummaries} 21 | 22 | \item{normNBSummaries}{data.frame of guide-level statistics as generated by findGuideHits()} 23 | 24 | \item{tails}{whether to test for increased expression ("upper"), decreased ("lower"), or both ("both"); (defaults to "both")} 25 | 26 | \item{location}{the name of the column in normNBSummaries containing the chromosomal location (defaults to "pos")} 27 | 28 | \item{chr}{the name of the column in normNBSummaries containing the chromosome name (defaults to "chr")} 29 | 30 | \item{window}{the window width in base pairs (defaults to 500)} 31 | 32 | \item{minGuides}{the minimum number of guides in a window required for a test (defaults to 5)} 33 | 34 | \item{negativeControl}{the name in normNBSummaries containing a logical representing whether or not the guide is non-Targeting (i.e. a negative control guide). Defaults to "NT"} 35 | 36 | \item{...}{other parameters for getZScalesWithNTGuides} 37 | } 38 | \value{ 39 | a data.frame containing the statistics for all windows tested for activity 40 | } 41 | \description{ 42 | Tests guides for activity by considering a sliding window across the tested region and including all guides within the window for the test. 43 | } 44 | \examples{ 45 | fakeReadData = data.frame(id=rep(1:10000,2), expt=c(rep("e1",10000), rep("e2",10000)), 46 | A=rpois(20000, lambda = 100), B=rpois(20000, lambda = 100), 47 | C=rpois(20000, lambda = 100), D=rpois(20000, lambda = 100), 48 | E=rpois(20000, lambda = 100), F=rpois(20000, lambda = 100), 49 | NotSorted=rpois(20000, lambda = 100), 50 | position = rep(c(rep(NA, 1000), (1:9000)*10 + 5E7),2), 51 | chr=rep(c(rep(NA, 1000), rep("chr1", 9000)),2), 52 | negControl = rep(c(rep(TRUE,1000),rep(FALSE,9000)),2), 53 | stringsAsFactors = FALSE) 54 | #make one region an "enhancer" and "repressor" by skewing the reads 55 | enhancers = data.frame(name = c("enh","repr"), start = c(40000, 70000) + 5E7, 56 | end = c(40500, 70500) + 5E7, chr="chr1") 57 | enhancerData = findOverlappingElements(fakeReadData[!is.na(fakeReadData$position),], enhancers, 58 | guides.pos = "position",elements.start = "start", elements.end = "end") 59 | readSkew=1.2 # we will scale up/down the reads in ABC and DEF by this amount 60 | enhancerData[enhancerData$name=="enh", c("D","E","F")] = 61 | floor(readSkew* enhancerData[enhancerData$name=="enh", c("D","E","F")]); 62 | enhancerData[enhancerData$name=="repr", c("D","E","F")] = 63 | floor(enhancerData[enhancerData$name=="repr", c("D","E","F")]/readSkew); 64 | enhancerData[enhancerData$name=="repr", c("A","B","C")] = 65 | floor(readSkew* enhancerData[enhancerData$name=="repr", c("A","B","C")]); 66 | enhancerData[enhancerData$name=="enh", c("A","B","C")] = 67 | floor(enhancerData[enhancerData$name=="enh", c("A","B","C")]/readSkew); 68 | #replace the original data for these elements 69 | fakeReadData = rbind(fakeReadData[!(fakeReadData$id \%in\% enhancerData$id), ], 70 | enhancerData[names(fakeReadData)]) 71 | #make experiments and sorting strategy 72 | expts = unique(fakeReadData["expt"]); 73 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6))) 74 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts 75 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6)) 76 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData, 77 | binStats = curSortBins, unsortedBin = "NotSorted", 78 | negativeControl="negControl") 79 | if(require("ggplot2")){ 80 | p=ggplot(guideHits, aes(x=position, y=Z))+geom_point() ; print(p) 81 | } 82 | tilingElementStats = getTilingElementwiseStats(experiments = expts, normNBSummaries = guideHits, 83 | tails = "both", chr = "chr", location="position", negativeControl = "negControl") 84 | if(require("ggplot2")){ 85 | p=ggplot(tilingElementStats, aes(x=start, xend=end, y=significanceZ, yend=significanceZ, 86 | colour=FDR<0.01))+geom_segment() ; print(p) 87 | } 88 | } 89 | -------------------------------------------------------------------------------- /man/getZScalesWithNTGuides.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{getZScalesWithNTGuides} 4 | \alias{getZScalesWithNTGuides} 5 | \title{Calculate Z-score scaling factors using non-targeting guides} 6 | \usage{ 7 | getZScalesWithNTGuides(ntData, uGuidesPerElement, mergeBy, ntSampleFold = 10) 8 | } 9 | \arguments{ 10 | \item{ntData}{data.frame containing the data for the non-targeting guides} 11 | 12 | \item{uGuidesPerElement}{a unique vector of guide counts per element} 13 | 14 | \item{mergeBy}{a character vector containing the header(s) that demarcate the screen/experiment/replicate ID(s)} 15 | 16 | \item{ntSampleFold}{how many times to sample each non-targeting guide to make the Z score scale (defaults to 10)} 17 | } 18 | \value{ 19 | a data.frame containing a Z-score scaling factor, one for every number of guides and unique entry in mergeBy 20 | } 21 | \description{ 22 | Calculates scaling factors to calibrate element-wise Z-scores by repeatedly calculating a set of "null" Z-scores by repeatedly sampling the given numbers of non-targeting guides per element. This function is not normally used directly. 23 | } 24 | \examples{ 25 | fakeReadData = data.frame(id=rep(1:1000,2), expt=c(rep("e1",1000), rep("e2",1000)), 26 | A=rpois(2000, lambda = 100), B=rpois(2000, lambda = 100), 27 | C=rpois(2000, lambda = 100), D=rpois(2000, lambda = 100), 28 | E=rpois(2000, lambda = 100), F=rpois(2000, lambda = 100), 29 | NotSorted=rpois(2000, lambda = 100), 30 | negControl = rep(rnorm(1000)>0,2), stringsAsFactors = FALSE) 31 | expts = unique(fakeReadData["expt"]); 32 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6))) 33 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts 34 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6)) 35 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData, 36 | binStats = curSortBins, unsortedBin = "NotSorted", 37 | negativeControl="negControl") 38 | guideZScales = getZScalesWithNTGuides(guideHits[guideHits$negControl,], uGuidesPerElement=1:10, 39 | mergeBy=names(expts)) 40 | if(require("ggplot2")){ 41 | p=ggplot(guideZScales, aes(x=numGuides, y=Zscale, colour=expt))+geom_point()+geom_line(); 42 | print(p) 43 | } 44 | } 45 | -------------------------------------------------------------------------------- /man/makeBinModel.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/MAUDE.R 3 | \name{makeBinModel} 4 | \alias{makeBinModel} 5 | \title{Create a bin model for a single experiment} 6 | \usage{ 7 | makeBinModel(curBinBounds, tailP = 0.001) 8 | } 9 | \arguments{ 10 | \item{curBinBounds}{a data.frame containing two columns: Bin (must be {A,B,C,D,E,F}), and fraction (the fractions of the total captured by each bin)} 11 | 12 | \item{tailP}{the fraction of the tails of the distribution not captured in any bin (defaults to 0.001)} 13 | } 14 | \value{ 15 | returns a data.frame with additional columns including the bin starts and ends in Z-score space, and in quantile space. 16 | } 17 | \description{ 18 | Provided with the fractions captured by each bin, creates a bin model for use with MAUDE analysis, assuming 3 contiguous bins on the tails of the distribution. You can easily remove or rename bins after they have been created with this function. An example is provided in the BACH2 Vignette. 19 | } 20 | \examples{ 21 | #generally, the bin bounds are retrieved from the FACS data 22 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.1,6))) 23 | if(require("ggplot2")){ 24 | p = ggplot() + 25 | geom_vline(xintercept= sort(unique(c(binBounds$binStartZ, binBounds$binEndZ))),colour="gray")+ 26 | theme_classic() + xlab("Target expression") + 27 | geom_segment(data=binBounds, aes(x=binStartZ, xend=binEndZ, colour=Bin, y=0, yend=0), 28 | size=5, inherit.aes = FALSE); 29 | print(p) 30 | } 31 | } 32 | -------------------------------------------------------------------------------- /vignettes/BACH2_base_editor_screen.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "BACH2 Base editor flow-FISH screen" 3 | author: "Carl de Boer" 4 | date: "24/02/2022" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{BACH2 Base editor flow-FISH screen} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | \usepackage[utf8]{inputenc} 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | # Introduction 17 | In this example, we use MAUDE to analyze a CRISPR base editor screen targeting a region containing a autoimmune-linked varaint (rs72928038) within a BACH2 enhancer. Data are from Mouri et al (Prioritization of autoimmune disease-associated genetic variants that perturb regulatory element activity in T cells), which also contains further details about the experiment: 18 | https://www.biorxiv.org/content/10.1101/2021.05.30.445673v1 19 | 20 | 21 | In this experiment, CRISPR/Cas9 base editors were targeted to rs72928038, where they mutate the DNA. Expression is then measured by sorting cells by BACH2 expression (Flow-FISH) into 4 expression bins. We then use MAUDE to estimate the mean expression of each of the alleles generated by the base editors. Only a minority of the alleles created correspond to rs72928038. 22 | 23 | Here, we have several experiments and replicates, four sorting bins per experiment, as well as unsorted cells. For more details, please see the before referenced publication. 24 | 25 | # Loading the data/libraries 26 | 27 | 28 | ```{r} 29 | library(ggplot2) 30 | library(reshape) 31 | library(MAUDE) 32 | maudeGitPathRoot = "https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/master" 33 | ``` 34 | 35 | Load sample metadata. 36 | ```{r} 37 | allSamples = read.table(file=sprintf("%s/vignettes/BACH2_data/sample_metadata.txt",maudeGitPathRoot), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F) 38 | head(allSamples) 39 | #remove some of the control samples 40 | allSamples = allSamples[!grepl("HEK62", allSamples$replicate),] 41 | 42 | ``` 43 | 44 | Load the data. 45 | ```{r} 46 | if (FALSE){ 47 | #this was run on my computer to load the data from many files spread out over many subdirectories. Rather than upload all these files separately, I have loaded them all locally and saved the resulting concatenated file onto github. I leave this code here so that others may view how the data was loaded, should they have their own CRISPResso files to analyze. 48 | inDir = "/Path/To/CRISPResso/Files"; 49 | setwd(inDir) 50 | 51 | 52 | allCRISPRessoData= data.frame() 53 | for (i in 1:nrow(allSamples)){ 54 | curData = read.table(file=sprintf("%s/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "") 55 | curData2 = read.table(file=sprintf("%s/Deep_resequencing_analysis/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "") 56 | curData = rbind(curData, curData2); 57 | curData$locus = allSamples$locus[i]; 58 | curData$replicate = allSamples$replicate[i]; 59 | curData$sortBin = allSamples$sortBin[i]; 60 | allCRISPRessoData = rbind(allCRISPRessoData, curData) 61 | } 62 | #remove gaps from sequences 63 | allCRISPRessoData$SeqSpecies = gsub("-","",allCRISPRessoData$Aligned_Sequence); # Allele 64 | #remove unwanted fields 65 | allCRISPRessoData$Aligned_Sequence=NULL; allCRISPRessoData$Reference_Sequence=NULL; allCRISPRessoData$X.Reads.1=NULL; 66 | write.table(allCRISPRessoData, file=sprintf("%s/loaded_BACH2_BE_CRISPResso_data.txt", inDir),row.names = F, col.names = T, quote=F, sep="\t") 67 | #I then gzipped this file and uploaded it to github 68 | }else{ 69 | #load the CRISPResso files from GitHub. 70 | z= gzcon(url(sprintf("%s/vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz",maudeGitPathRoot))); 71 | fileConn=textConnection(readLines(z)); 72 | allCRISPRessoData = read.table(fileConn, sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F) 73 | close(fileConn) 74 | } 75 | ``` 76 | 77 | 78 | Polish CRISPResso genotype data, calculate reads per sample and genotype composition per sample. 79 | 80 | ```{r} 81 | #CRISPResso has split some sequences into two or maybe more lines; below is an example 82 | #We also input multiple files from different sequencing runs. This merges all identical seq species'/samples 83 | allCRISPRessoData = cast(melt(allCRISPRessoData, id.vars = c("SeqSpecies","Reference_Name","Read_Status","n_deleted","n_inserted","n_mutated", "locus","replicate","sortBin")), SeqSpecies + Reference_Name + Read_Status + n_deleted + n_inserted + n_mutated + locus + replicate + sortBin ~ variable, value="value", fun.aggregate=sum) 84 | 85 | #Get read totals per replicate 86 | allCRISPRessoDataTotals = cast(allCRISPRessoData, locus + replicate + sortBin ~ ., value="X.Reads", fun.aggregate = sum) 87 | names(allCRISPRessoDataTotals)[ncol(allCRISPRessoDataTotals)] = "totalReads"; 88 | 89 | #Add totalReads column to allCRISPRessoData, calculate read fractions 90 | allCRISPRessoData = merge(allCRISPRessoData, allCRISPRessoDataTotals, by=c("locus","replicate","sortBin")) 91 | allCRISPRessoData$readFraction = allCRISPRessoData$X.Reads/allCRISPRessoData$totalReads; 92 | ``` 93 | 94 | 95 | # Quality Control analysis for the experiment 96 | 97 | Reads per sample 98 | ```{r} 99 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_continuous(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p) 100 | ``` 101 | 102 | Same on a log scale 103 | ```{r} 104 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_log10(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p) 105 | ``` 106 | 107 | 108 | Toss all the cases where the locus was unmodified but sequenced anyway. Note that this means each was effectively only 50K cells, not 100k because the gDNA was divided in half 109 | ```{r} 110 | allCRISPRessoData = allCRISPRessoData[ !(allCRISPRessoData$locus=="HEK" & grepl("^[ABCR][12]$", allCRISPRessoData$replicate)) & !(allCRISPRessoData$locus=="BACH2" & grepl("^HEK9[12]", allCRISPRessoData$replicate)),] 111 | ``` 112 | 113 | 114 | Plot a CDF showing the cumulative fraction of genotypes (y axis) per sample (colour) by their abundance in the sequencing data (x axis). This illustrates how biased the data are towards rare genotypes (left) vs abundant genotypes (right). The more the curves shift to the right, the less complex the modifications to the DNA. A lot of the ones to the left are from sequencing artifacts. 115 | ```{r} 116 | p = ggplot(allCRISPRessoData[allCRISPRessoData$sortBin=="NS",], aes(x= readFraction, colour=replicate)) + stat_ecdf()+facet_grid(locus ~ .) + scale_x_log10() + theme_bw(); print(p) 117 | ``` 118 | 119 | Estimate the %unmodified in each sample. 120 | ```{r} 121 | temp = cast(allCRISPRessoData[allCRISPRessoData$Read_Status=="UNMODIFIED" & allCRISPRessoData$Reference_Name=="Reference",], formula = sortBin + locus+ replicate ~. ,value = "readFraction", fun.aggregate = max) 122 | names(temp)[ncol(temp)]="readFraction"; 123 | p = ggplot(temp, aes(x=sortBin, y=replicate, fill=readFraction)) + geom_tile() + facet_grid(locus ~., scales="free_y")+ggtitle("just unmodified read fractions") + scale_fill_gradientn(colours=c("red","orange", "green","cyan","blue","violet"), limits=c(0,0.8)); print(p) 124 | min(temp$readFraction, na.rm = T) # 0.0356963 125 | ``` 126 | 127 | 128 | Sample exclusion 129 | ```{r} 130 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="A1-HEK",] # this sample had no data for bin D (BACH2) because it didn't PCR well 131 | 132 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="B2-HEK",] # this sample had very low coverage for bin D for BACH2 133 | ``` 134 | 135 | 136 | 137 | Identify and retain only those alleles that are present in all samples 138 | ```{r} 139 | seqsObserved = cast(allCRISPRessoData, SeqSpecies + locus + Read_Status + Reference_Name~ .) 140 | names(seqsObserved)[ncol(seqsObserved)]="seqRunsObserved"; 141 | 142 | retainedSamples = unique(allCRISPRessoData[,c("locus","replicate","sortBin")]) 143 | 144 | for (l in unique(allSamples$locus)){ 145 | seqsObserved$inAll[seqsObserved$locus == l] = seqsObserved$seqRunsObserved[seqsObserved$locus == l]==sum(retainedSamples$locus==l) 146 | } 147 | 148 | #require that all samples have at least one read for every species considered. This will help exclude read errors 149 | keepAlleles = seqsObserved[seqsObserved$inAll,] 150 | ``` 151 | 152 | 153 | # MAUDE Analysis of base editor screen 154 | 155 | ```{r} 156 | #First make a count matrix, but only of alleles seen in all replicates 157 | readCountMat = cast(allCRISPRessoData[allCRISPRessoData$SeqSpecies %in% keepAlleles$SeqSpecies,], SeqSpecies + replicate +locus ~ sortBin, value="X.Reads") 158 | readCountMat[is.na(readCountMat)] = 0 159 | readCountMat = readCountMat[order(readCountMat$NS, decreasing = T),] 160 | 161 | #Label WT alleles 162 | wtSeqs = data.frame(locus = c("HEK","BACH2"), SeqSpecies = c("GGTAGCCAGAGACCCGCTGGTCTTCTTTCCCCTCCCCTGCCCTCCCCTCCCTTCAAGATGGCTGACAA","TGCCCCACCCTGTGCCCTTTTTACATTACACACAAATAGGGACGGATTTCCTGTAAGCTGATCTTGAAGAAAAAAAACATGTTAGACAAAGAAAATCAGAACTAAGA"), isWT=T, stringsAsFactors = F); 163 | readCountMat = merge(readCountMat, wtSeqs, by=c("locus","SeqSpecies"), all.x=T) 164 | readCountMat$isWT[is.na(readCountMat$isWT)]=F; 165 | 166 | allCRISPRessoData = merge(allCRISPRessoData, wtSeqs, by=c("locus","SeqSpecies"), all.x=T) 167 | allCRISPRessoData$isWT[is.na(allCRISPRessoData$isWT)]=F; 168 | 169 | 170 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.105,6))) # 10.5% bins 171 | # but only the top ~21% (CD) and bottom ~21% (AB) were retained 172 | binBounds = binBounds[binBounds$Bin %in% c("A","B","E","F"),] 173 | binBounds$Bin[binBounds$Bin=="E"]="C" 174 | binBounds$Bin[binBounds$Bin=="F"]="D" 175 | ``` 176 | 177 | The bin layout: 178 | ```{r} 179 | p = ggplot(binBounds, aes(colour=Bin)) + 180 | geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 181 | ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 182 | xlab("Bin bounds as expression Z-scores") + 183 | ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+ 184 | coord_cartesian(ylim=c(0,0.7)); print(p) 185 | ``` 186 | 187 | Replicate bin stats for each experiment (the same sorting parameters were used for each) 188 | ```{r} 189 | curExpts = unique(readCountMat[c("replicate","locus")]) 190 | 191 | binStatsAll = data.frame(); 192 | for(i in 1:nrow(curExpts)){ 193 | binStatsAll = rbind(binStatsAll, cbind(binBounds, curExpts[i,])) 194 | } 195 | ``` 196 | 197 | 198 | ```{r} 199 | #perform MAUDE analysis at guide (allele in this case) level for each locus. 200 | guideLevelStats = data.frame(); 201 | for( l in unique(allSamples$locus)){ 202 | message(l); 203 | statsA = findGuideHitsAllScreens(experiments = curExpts[curExpts$locus==l,], countDataFrame = readCountMat[readCountMat$locus==l,], binStats = binStatsAll[binStatsAll$locus==l,], sortBins = c("A","B","C","D"), unsortedBin = "NS", negativeControl = "isWT") 204 | guideLevelStats = rbind(guideLevelStats, statsA) 205 | } 206 | ``` 207 | 208 | 209 | Summarize base changes per allele 210 | ```{r} 211 | baseChanges = data.frame() 212 | for (l in wtSeqs$locus){ 213 | wtSeq = wtSeqs$SeqSpecies[wtSeqs$locus==l]; 214 | wtSplit = strsplit(wtSeq,"")[[1]] 215 | curSeqs = unique(guideLevelStats$SeqSpecies[guideLevelStats$locus==l & guideLevelStats$libFraction > 0.001]) 216 | for (i in 1:length(curSeqs)){ 217 | curSplit = strsplit(curSeqs[i],"")[[1]] 218 | mismatches = c(); 219 | mismatchPoss=c(); 220 | for (j in 1:min(length(wtSplit),length(curSplit))){ 221 | if (wtSplit[j]!=curSplit[j]){ 222 | mismatches = c(mismatches, curSplit[j]); 223 | mismatchPoss = c(mismatchPoss, j) 224 | } 225 | } 226 | if (length(mismatchPoss)>0){ 227 | baseChanges = rbind(baseChanges, data.frame(mismatch=mismatches, position = mismatchPoss, SeqSpecies=curSeqs[i], locus=l)) 228 | } 229 | } 230 | } 231 | 232 | baseChangesSummary = cast(baseChanges, SeqSpecies +locus~ ., value="mismatch") 233 | names(baseChangesSummary)[ncol(baseChangesSummary)] = "numMismatches"; 234 | p = ggplot(baseChangesSummary, aes(x=numMismatches, colour=locus)) + stat_ecdf()+ geom_vline(xintercept = 15); print(p) 235 | ``` 236 | 237 | 238 | ```{r} 239 | #exclude any where the number of mismatches is greater than 15 - these are also likely read or PCR artifacts. 240 | baseChangesSummary$keepAlleles= baseChangesSummary$numMismatches<15; 241 | 242 | baseChanges$has = 1; 243 | 244 | baseChangesNumWith = cast(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles],], position + mismatch + locus~ ., value="has", fun.aggregate = sum) 245 | names(baseChangesNumWith)[ncol(baseChangesNumWith)] = "numSeqsWith" 246 | baseChanges = merge(baseChanges, baseChangesNumWith, by=c("position","mismatch","locus")) 247 | baseChanges$mismatch = as.character(baseChanges$mismatch) 248 | baseChanges$SeqSpecies = as.character(baseChanges$SeqSpecies) 249 | 250 | 251 | p = ggplot(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], aes(x=position, y=SeqSpecies, fill=mismatch)) + geom_tile()+scale_fill_manual(values=c("orange","green","blue","red"))+scale_x_continuous(expand=c(0,0))+theme_classic() + theme(axis.text.y = element_text(size=5, family = "Courier")) + facet_grid(locus ~ ., scales="free", space ="free"); print(p) 252 | ``` 253 | 254 | 255 | ```{r} 256 | #A_53 is the mutation of interest 257 | 258 | baseChangesMatrix = cast(baseChanges[baseChanges$locus == "BACH2" & baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], SeqSpecies ~ mismatch + position, value="has", fill = 0) 259 | A53_seq = baseChangesMatrix$SeqSpecies[apply(baseChangesMatrix[2:ncol(baseChangesMatrix)],1, sum)==1 & baseChangesMatrix$A_53==1] 260 | 261 | guideLevelStats$pooled = grepl("-",guideLevelStats$replicate); 262 | ``` 263 | 264 | 265 | ```{r} 266 | wtAndVarMaudeMu = cast(guideLevelStats[guideLevelStats$SeqSpecies %in% c(A53_seq, wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]),], replicate +pooled ~ SeqSpecies, value="mean") 267 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==A53_seq]="rs72928038"; 268 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]]="WT"; 269 | 270 | ttestResults = t.test(x = wtAndVarMaudeMu$rs72928038[!wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[!wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE) 271 | 272 | ttestResults_mixedCells = t.test(x = wtAndVarMaudeMu$rs72928038[wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE) 273 | 274 | wtAndVarMaudeMu$replicate2 = gsub("-HEK","",wtAndVarMaudeMu$replicate) 275 | 276 | ``` 277 | 278 | Make plot of WT vs rs72928038 mean expression levels 279 | ```{r} 280 | meltedSNPMus = melt(as.data.frame(wtAndVarMaudeMu), id.vars=c("pooled","replicate", "replicate2")); 281 | meltedSNPMus$genotype = factor(ifelse(meltedSNPMus$variable=="WT", "G", "A (risk)"),levels=c("G","A (risk)")) 282 | p = ggplot(meltedSNPMus, aes(x=genotype, y=value, group=replicate)) + geom_point() + geom_line()+facet_grid(. ~ pooled)+theme_bw()+xlab("Genotype") + ylab("Mean expression") + ggtitle(sprintf("P=%f; P=%f", ttestResults$p.value, ttestResults_mixedCells$p.value)); print(p) 283 | ``` 284 | 285 | -------------------------------------------------------------------------------- /vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/eb9c80a0bf37bb2b0c3c7ae7270a1fee71eaba20/vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz -------------------------------------------------------------------------------- /vignettes/BACH2_data/sample_metadata.txt: -------------------------------------------------------------------------------- 1 | directory locus locus2 replicate sortBin 2 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-A_S1.trimmed.1_BACH2-A1-A_S1.trimmed.2 BACH2 BACH2 A1 A 3 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-B_S2.trimmed.1_BACH2-A1-B_S2.trimmed.2 BACH2 BACH2 A1 B 4 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-C_S3.trimmed.1_BACH2-A1-C_S3.trimmed.2 BACH2 BACH2 A1 C 5 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-D_S4.trimmed.1_BACH2-A1-D_S4.trimmed.2 BACH2 BACH2 A1 D 6 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-A_S49.trimmed.1_BACH2-A1-HEK-A_S49.trimmed.2 BACH2 BACH2 A1-HEK A 7 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-B_S50.trimmed.1_BACH2-A1-HEK-B_S50.trimmed.2 BACH2 BACH2 A1-HEK B 8 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-C_S51.trimmed.1_BACH2-A1-HEK-C_S51.trimmed.2 BACH2 BACH2 A1-HEK C 9 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-NS_S53.trimmed.1_BACH2-A1-HEK-NS_S53.trimmed.2 BACH2 BACH2 A1-HEK NS 10 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-NS_S5.trimmed.1_BACH2-A1-NS_S5.trimmed.2 BACH2 BACH2 A1 NS 11 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-A_S6.trimmed.1_BACH2-A2-A_S6.trimmed.2 BACH2 BACH2 A2 A 12 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-B_S7.trimmed.1_BACH2-A2-B_S7.trimmed.2 BACH2 BACH2 A2 B 13 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-C_S8.trimmed.1_BACH2-A2-C_S8.trimmed.2 BACH2 BACH2 A2 C 14 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-D_S9.trimmed.1_BACH2-A2-D_S9.trimmed.2 BACH2 BACH2 A2 D 15 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-A_S54.trimmed.1_BACH2-A2-HEK-A_S54.trimmed.2 BACH2 BACH2 A2-HEK A 16 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-B_S55.trimmed.1_BACH2-A2-HEK-B_S55.trimmed.2 BACH2 BACH2 A2-HEK B 17 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-C_S56.trimmed.1_BACH2-A2-HEK-C_S56.trimmed.2 BACH2 BACH2 A2-HEK C 18 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-D_S57.trimmed.1_BACH2-A2-HEK-D_S57.trimmed.2 BACH2 BACH2 A2-HEK D 19 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-NS_S58.trimmed.1_BACH2-A2-HEK-NS_S58.trimmed.2 BACH2 BACH2 A2-HEK NS 20 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-NS_S10.trimmed.1_BACH2-A2-NS_S10.trimmed.2 BACH2 BACH2 A2 NS 21 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-A_S13.trimmed.1_BACH2-B1-A_S13.trimmed.2 BACH2 BACH2 B1 A 22 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-B_S14.trimmed.1_BACH2-B1-B_S14.trimmed.2 BACH2 BACH2 B1 B 23 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-C_S15.trimmed.1_BACH2-B1-C_S15.trimmed.2 BACH2 BACH2 B1 C 24 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-D_S16.trimmed.1_BACH2-B1-D_S16.trimmed.2 BACH2 BACH2 B1 D 25 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-A_S61.trimmed.1_BACH2-B1-HEK-A_S61.trimmed.2 BACH2 BACH2 B1-HEK A 26 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-B_S62.trimmed.1_BACH2-B1-HEK-B_S62.trimmed.2 BACH2 BACH2 B1-HEK B 27 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-C_S63.trimmed.1_BACH2-B1-HEK-C_S63.trimmed.2 BACH2 BACH2 B1-HEK C 28 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-D_S64.trimmed.1_BACH2-B1-HEK-D_S64.trimmed.2 BACH2 BACH2 B1-HEK D 29 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-NS_S65.trimmed.1_BACH2-B1-HEK-NS_S65.trimmed.2 BACH2 BACH2 B1-HEK NS 30 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-NS_S17.trimmed.1_BACH2-B1-NS_S17.trimmed.2 BACH2 BACH2 B1 NS 31 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-A_S18.trimmed.1_BACH2-B2-A_S18.trimmed.2 BACH2 BACH2 B2 A 32 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-B_S19.trimmed.1_BACH2-B2-B_S19.trimmed.2 BACH2 BACH2 B2 B 33 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-C_S20.trimmed.1_BACH2-B2-C_S20.trimmed.2 BACH2 BACH2 B2 C 34 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-D_S21.trimmed.1_BACH2-B2-D_S21.trimmed.2 BACH2 BACH2 B2 D 35 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-A_S66.trimmed.1_BACH2-B2-HEK-A_S66.trimmed.2 BACH2 BACH2 B2-HEK A 36 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-B_S67.trimmed.1_BACH2-B2-HEK-B_S67.trimmed.2 BACH2 BACH2 B2-HEK B 37 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-C_S68.trimmed.1_BACH2-B2-HEK-C_S68.trimmed.2 BACH2 BACH2 B2-HEK C 38 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-D_S69.trimmed.1_BACH2-B2-HEK-D_S69.trimmed.2 BACH2 BACH2 B2-HEK D 39 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-NS_S70.trimmed.1_BACH2-B2-HEK-NS_S70.trimmed.2 BACH2 BACH2 B2-HEK NS 40 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-NS_S22.trimmed.1_BACH2-B2-NS_S22.trimmed.2 BACH2 BACH2 B2 NS 41 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-A_S25.trimmed.1_BACH2-C1-A_S25.trimmed.2 BACH2 BACH2 C1 A 42 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-B_S26.trimmed.1_BACH2-C1-B_S26.trimmed.2 BACH2 BACH2 C1 B 43 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-C_S27.trimmed.1_BACH2-C1-C_S27.trimmed.2 BACH2 BACH2 C1 C 44 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-D_S28.trimmed.1_BACH2-C1-D_S28.trimmed.2 BACH2 BACH2 C1 D 45 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-A_S73.trimmed.1_BACH2-C1-HEK-A_S73.trimmed.2 BACH2 BACH2 C1-HEK A 46 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-B_S74.trimmed.1_BACH2-C1-HEK-B_S74.trimmed.2 BACH2 BACH2 C1-HEK B 47 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-C_S75.trimmed.1_BACH2-C1-HEK-C_S75.trimmed.2 BACH2 BACH2 C1-HEK C 48 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-D_S76.trimmed.1_BACH2-C1-HEK-D_S76.trimmed.2 BACH2 BACH2 C1-HEK D 49 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-NS_S77.trimmed.1_BACH2-C1-HEK-NS_S77.trimmed.2 BACH2 BACH2 C1-HEK NS 50 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-NS_S29.trimmed.1_BACH2-C1-NS_S29.trimmed.2 BACH2 BACH2 C1 NS 51 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-A_S30.trimmed.1_BACH2-C2-A_S30.trimmed.2 BACH2 BACH2 C2 A 52 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-B_S31.trimmed.1_BACH2-C2-B_S31.trimmed.2 BACH2 BACH2 C2 B 53 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-C_S32.trimmed.1_BACH2-C2-C_S32.trimmed.2 BACH2 BACH2 C2 C 54 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-D_S33.trimmed.1_BACH2-C2-D_S33.trimmed.2 BACH2 BACH2 C2 D 55 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-A_S78.trimmed.1_BACH2-C2-HEK-A_S78.trimmed.2 BACH2 BACH2 C2-HEK A 56 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-B_S79.trimmed.1_BACH2-C2-HEK-B_S79.trimmed.2 BACH2 BACH2 C2-HEK B 57 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-C_S80.trimmed.1_BACH2-C2-HEK-C_S80.trimmed.2 BACH2 BACH2 C2-HEK C 58 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-D_S81.trimmed.1_BACH2-C2-HEK-D_S81.trimmed.2 BACH2 BACH2 C2-HEK D 59 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-NS_S82.trimmed.1_BACH2-C2-HEK-NS_S82.trimmed.2 BACH2 BACH2 C2-HEK NS 60 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-NS_S34.trimmed.1_BACH2-C2-NS_S34.trimmed.2 BACH2 BACH2 C2 NS 61 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-A_S11.trimmed.1_BACH2-HEK62-A_S11.trimmed.2 BACH2 BACH2 HEK62 A 62 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-B_S23.trimmed.1_BACH2-HEK62-B_S23.trimmed.2 BACH2 BACH2 HEK62 B 63 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-C_S35.trimmed.1_BACH2-HEK62-C_S35.trimmed.2 BACH2 BACH2 HEK62 C 64 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-NS_S59.trimmed.1_BACH2-HEK62-NS_S59.trimmed.2 BACH2 BACH2 HEK62 NS 65 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62D_S47.trimmed.1_BACH2-HEK62D_S47.trimmed.2 BACH2 BACH2 HEK62 D 66 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-A_S71.trimmed.1_BACH2-HEK91-A_S71.trimmed.2 BACH2 BACH2 HEK91 A 67 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-B_S83.trimmed.1_BACH2-HEK91-B_S83.trimmed.2 BACH2 BACH2 HEK91 B 68 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-C_S95.trimmed.1_BACH2-HEK91-C_S95.trimmed.2 BACH2 BACH2 HEK91 C 69 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-D_S12.trimmed.1_BACH2-HEK91-D_S12.trimmed.2 BACH2 BACH2 HEK91 D 70 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-NS_S24.trimmed.1_BACH2-HEK91-NS_S24.trimmed.2 BACH2 BACH2 HEK91 NS 71 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-A_S36.trimmed.1_BACH2-HEK92-A_S36.trimmed.2 BACH2 BACH2 HEK92 A 72 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-B_S48.trimmed.1_BACH2-HEK92-B_S48.trimmed.2 BACH2 BACH2 HEK92 B 73 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-C_S60.trimmed.1_BACH2-HEK92-C_S60.trimmed.2 BACH2 BACH2 HEK92 C 74 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-D_S72.trimmed.1_BACH2-HEK92-D_S72.trimmed.2 BACH2 BACH2 HEK92 D 75 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-NS_S84.trimmed.1_BACH2-HEK92-NS_S84.trimmed.2 BACH2 BACH2 HEK92 NS 76 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-A_S37.trimmed.1_BACH2-R1-A_S37.trimmed.2 BACH2 BACH2 R1 A 77 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-B_S38.trimmed.1_BACH2-R1-B_S38.trimmed.2 BACH2 BACH2 R1 B 78 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-C_S39.trimmed.1_BACH2-R1-C_S39.trimmed.2 BACH2 BACH2 R1 C 79 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-D_S40.trimmed.1_BACH2-R1-D_S40.trimmed.2 BACH2 BACH2 R1 D 80 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-A_S85.trimmed.1_BACH2-R1-HEK-A_S85.trimmed.2 BACH2 BACH2 R1-HEK A 81 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-B_S86.trimmed.1_BACH2-R1-HEK-B_S86.trimmed.2 BACH2 BACH2 R1-HEK B 82 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-C_S87.trimmed.1_BACH2-R1-HEK-C_S87.trimmed.2 BACH2 BACH2 R1-HEK C 83 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-D_S88.trimmed.1_BACH2-R1-HEK-D_S88.trimmed.2 BACH2 BACH2 R1-HEK D 84 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-NS_S89.trimmed.1_BACH2-R1-HEK-NS_S89.trimmed.2 BACH2 BACH2 R1-HEK NS 85 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-NS_S41.trimmed.1_BACH2-R1-NS_S41.trimmed.2 BACH2 BACH2 R1 NS 86 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-A_S42.trimmed.1_BACH2-R2-A_S42.trimmed.2 BACH2 BACH2 R2 A 87 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-B_S43.trimmed.1_BACH2-R2-B_S43.trimmed.2 BACH2 BACH2 R2 B 88 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-C_S44.trimmed.1_BACH2-R2-C_S44.trimmed.2 BACH2 BACH2 R2 C 89 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-D_S45.trimmed.1_BACH2-R2-D_S45.trimmed.2 BACH2 BACH2 R2 D 90 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-A_S90.trimmed.1_BACH2-R2-HEK-A_S90.trimmed.2 BACH2 BACH2 R2-HEK A 91 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-B_S91.trimmed.1_BACH2-R2-HEK-B_S91.trimmed.2 BACH2 BACH2 R2-HEK B 92 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-C_S92.trimmed.1_BACH2-R2-HEK-C_S92.trimmed.2 BACH2 BACH2 R2-HEK C 93 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-D_S93.trimmed.1_BACH2-R2-HEK-D_S93.trimmed.2 BACH2 BACH2 R2-HEK D 94 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-NS_S94.trimmed.1_BACH2-R2-HEK-NS_S94.trimmed.2 BACH2 BACH2 R2-HEK NS 95 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-NS_S46.trimmed.1_BACH2-R2-NS_S46.trimmed.2 BACH2 BACH2 R2 NS 96 | CRISPResso_HEK/CRISPResso_on_HEK-A1-A_S97.trimmed.1 HEK HEK A1 A 97 | CRISPResso_HEK/CRISPResso_on_HEK-A1-B_S98.trimmed.1 HEK HEK A1 B 98 | CRISPResso_HEK/CRISPResso_on_HEK-A1-C_S99.trimmed.1 HEK HEK A1 C 99 | CRISPResso_HEK/CRISPResso_on_HEK-A1-D_S100.trimmed.1 HEK HEK A1 D 100 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-A_S145.trimmed.1 HEK HEK A1-HEK A 101 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-B_S146.trimmed.1 HEK HEK A1-HEK B 102 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-C_S147.trimmed.1 HEK HEK A1-HEK C 103 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-D_S148.trimmed.1 HEK HEK A1-HEK D 104 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-NS_S149.trimmed.1 HEK HEK A1-HEK NS 105 | CRISPResso_HEK/CRISPResso_on_HEK-A1-NS_S101.trimmed.1 HEK HEK A1 NS 106 | CRISPResso_HEK/CRISPResso_on_HEK-A2-A_S102.trimmed.1 HEK HEK A2 A 107 | CRISPResso_HEK/CRISPResso_on_HEK-A2-B_S103.trimmed.1 HEK HEK A2 B 108 | CRISPResso_HEK/CRISPResso_on_HEK-A2-C_S104.trimmed.1 HEK HEK A2 C 109 | CRISPResso_HEK/CRISPResso_on_HEK-A2-D_S105.trimmed.1 HEK HEK A2 D 110 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-A_S150.trimmed.1 HEK HEK A2-HEK A 111 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-B_S151.trimmed.1 HEK HEK A2-HEK B 112 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-C_S152.trimmed.1 HEK HEK A2-HEK C 113 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-D_S153.trimmed.1 HEK HEK A2-HEK D 114 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-NS_S154.trimmed.1 HEK HEK A2-HEK NS 115 | CRISPResso_HEK/CRISPResso_on_HEK-A2-NS_S106.trimmed.1 HEK HEK A2 NS 116 | CRISPResso_HEK/CRISPResso_on_HEK-B1-A_S109.trimmed.1 HEK HEK B1 A 117 | CRISPResso_HEK/CRISPResso_on_HEK-B1-B_S110.trimmed.1 HEK HEK B1 B 118 | CRISPResso_HEK/CRISPResso_on_HEK-B1-C_S111.trimmed.1 HEK HEK B1 C 119 | CRISPResso_HEK/CRISPResso_on_HEK-B1-D_S112.trimmed.1 HEK HEK B1 D 120 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-A_S157.trimmed.1 HEK HEK B1-HEK A 121 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-B_S158.trimmed.1 HEK HEK B1-HEK B 122 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-C_S159.trimmed.1 HEK HEK B1-HEK C 123 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-D_S160.trimmed.1 HEK HEK B1-HEK D 124 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-NS_S161.trimmed.1 HEK HEK B1-HEK NS 125 | CRISPResso_HEK/CRISPResso_on_HEK-B1-NS_S113.trimmed.1 HEK HEK B1 NS 126 | CRISPResso_HEK/CRISPResso_on_HEK-B2-A_S114.trimmed.1 HEK HEK B2 A 127 | CRISPResso_HEK/CRISPResso_on_HEK-B2-B_S115.trimmed.1 HEK HEK B2 B 128 | CRISPResso_HEK/CRISPResso_on_HEK-B2-C_S116.trimmed.1 HEK HEK B2 C 129 | CRISPResso_HEK/CRISPResso_on_HEK-B2-D_S117.trimmed.1 HEK HEK B2 D 130 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-A_S162.trimmed.1 HEK HEK B2-HEK A 131 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-B_S163.trimmed.1 HEK HEK B2-HEK B 132 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-C_S164.trimmed.1 HEK HEK B2-HEK C 133 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-D_S165.trimmed.1 HEK HEK B2-HEK D 134 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-NS_S166.trimmed.1 HEK HEK B2-HEK NS 135 | CRISPResso_HEK/CRISPResso_on_HEK-B2-NS_S118.trimmed.1 HEK HEK B2 NS 136 | CRISPResso_HEK/CRISPResso_on_HEK-C1-A_S121.trimmed.1 HEK HEK C1 A 137 | CRISPResso_HEK/CRISPResso_on_HEK-C1-B_S122.trimmed.1 HEK HEK C1 B 138 | CRISPResso_HEK/CRISPResso_on_HEK-C1-C_S123.trimmed.1 HEK HEK C1 C 139 | CRISPResso_HEK/CRISPResso_on_HEK-C1-D_S124.trimmed.1 HEK HEK C1 D 140 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-A_S169.trimmed.1 HEK HEK C1-HEK A 141 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-B_S170.trimmed.1 HEK HEK C1-HEK B 142 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-C_S171.trimmed.1 HEK HEK C1-HEK C 143 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-D_S172.trimmed.1 HEK HEK C1-HEK D 144 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-NS_S173.trimmed.1 HEK HEK C1-HEK NS 145 | CRISPResso_HEK/CRISPResso_on_HEK-C1-NS_S125.trimmed.1 HEK HEK C1 NS 146 | CRISPResso_HEK/CRISPResso_on_HEK-C2-A_S126.trimmed.1 HEK HEK C2 A 147 | CRISPResso_HEK/CRISPResso_on_HEK-C2-B_S127.trimmed.1 HEK HEK C2 B 148 | CRISPResso_HEK/CRISPResso_on_HEK-C2-C_S128.trimmed.1 HEK HEK C2 C 149 | CRISPResso_HEK/CRISPResso_on_HEK-C2-D_S129.trimmed.1 HEK HEK C2 D 150 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-A_S174.trimmed.1 HEK HEK C2-HEK A 151 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-B_S175.trimmed.1 HEK HEK C2-HEK B 152 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-C_S176.trimmed.1 HEK HEK C2-HEK C 153 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-D_S177.trimmed.1 HEK HEK C2-HEK D 154 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-NS_S178.trimmed.1 HEK HEK C2-HEK NS 155 | CRISPResso_HEK/CRISPResso_on_HEK-C2-NS_S130.trimmed.1 HEK HEK C2 NS 156 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-A_S107.trimmed.1 HEK HEK HEK62 A 157 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-B_S119.trimmed.1 HEK HEK HEK62 B 158 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-C_S131.trimmed.1 HEK HEK HEK62 C 159 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-NS_S155.trimmed.1 HEK HEK HEK62 NS 160 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62D_S143.trimmed.1 HEK HEK HEK62 D 161 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-A_S167.trimmed.1 HEK HEK HEK91 A 162 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-B_S179.trimmed.1 HEK HEK HEK91 B 163 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-C_S191.trimmed.1 HEK HEK HEK91 C 164 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-D_S108.trimmed.1 HEK HEK HEK91 D 165 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-NS_S120.trimmed.1 HEK HEK HEK91 NS 166 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-A_S132.trimmed.1 HEK HEK HEK92 A 167 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-B_S144.trimmed.1 HEK HEK HEK92 B 168 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-C_S156.trimmed.1 HEK HEK HEK92 C 169 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-D_S168.trimmed.1 HEK HEK HEK92 D 170 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-NS_S180.trimmed.1 HEK HEK HEK92 NS 171 | CRISPResso_HEK/CRISPResso_on_HEK-R1-A_S133.trimmed.1 HEK HEK R1 A 172 | CRISPResso_HEK/CRISPResso_on_HEK-R1-B_S134.trimmed.1 HEK HEK R1 B 173 | CRISPResso_HEK/CRISPResso_on_HEK-R1-C_S135.trimmed.1 HEK HEK R1 C 174 | CRISPResso_HEK/CRISPResso_on_HEK-R1-D_S136.trimmed.1 HEK HEK R1 D 175 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-A_S181.trimmed.1 HEK HEK R1-HEK A 176 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-B_S182.trimmed.1 HEK HEK R1-HEK B 177 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-C_S183.trimmed.1 HEK HEK R1-HEK C 178 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-D_S184.trimmed.1 HEK HEK R1-HEK D 179 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-NS_S185.trimmed.1 HEK HEK R1-HEK NS 180 | CRISPResso_HEK/CRISPResso_on_HEK-R1-NS_S137.trimmed.1 HEK HEK R1 NS 181 | CRISPResso_HEK/CRISPResso_on_HEK-R2-A_S138.trimmed.1 HEK HEK R2 A 182 | CRISPResso_HEK/CRISPResso_on_HEK-R2-B_S139.trimmed.1 HEK HEK R2 B 183 | CRISPResso_HEK/CRISPResso_on_HEK-R2-C_S140.trimmed.1 HEK HEK R2 C 184 | CRISPResso_HEK/CRISPResso_on_HEK-R2-D_S141.trimmed.1 HEK HEK R2 D 185 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-A_S186.trimmed.1 HEK HEK R2-HEK A 186 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-B_S187.trimmed.1 HEK HEK R2-HEK B 187 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-C_S188.trimmed.1 HEK HEK R2-HEK C 188 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-D_S189.trimmed.1 HEK HEK R2-HEK D 189 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-NS_S190.trimmed.1 HEK HEK R2-HEK NS 190 | CRISPResso_HEK/CRISPResso_on_HEK-R2-NS_S142.trimmed.1 HEK HEK R2 NS 191 | -------------------------------------------------------------------------------- /vignettes/CD69_tutorial.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Re-analysis of CD69 CRISPRa tiling screen" 3 | author: "Carl de Boer" 4 | date: "5/21/2020" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{Re-analysis of CD69 CRISPRa tiling screen} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | \usepackage[utf8]{inputenc} 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | # Tutorial: Re-analysis of CD69 CRISPRa tiling screen 17 | 18 | See Simeonov et al. Discovery of stimulation-responsive immune enhancers with CRISPR activation. Nature. 2017 Sep 7;549(7670):111-115. doi: 10.1038/nature23875. Epub 2017 Aug 30. [PMC5675716](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/) for the original publication associated with this dataset. 19 | 20 | This tutorial is designed to get you quickly up and running with MAUDE. The only requirements are that you have R and the following R libraries installed: 21 | `ggplot2, openxlsx, MAUDE, reshape` 22 | 23 | ## Load required libraries into R 24 | ```{r load libraries, results="hide", message = FALSE, warning=FALSE} 25 | #load required libraries 26 | library(openxlsx) 27 | library(reshape) 28 | library(ggplot2) 29 | library(MAUDE) 30 | library(GenomicRanges) 31 | library(ggbio) 32 | library(Homo.sapiens) 33 | ``` 34 | 35 | ## Load CD69 screen count data from Simeonov Supplementary Table 1 36 | ```{r input data} 37 | #read in the CD69 screen data 38 | CD69Data = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx') 39 | 40 | #identify non-targeting guides 41 | CD69Data$isNontargeting = grepl("negative_control", CD69Data$gRNA_systematic_name) 42 | 43 | CD69Data = unique(CD69Data) # for some reason there were duplicated rows in this table - remove duplicates 44 | 45 | #reshape the count data so we can label the experimental replicates and bins, and remove all the non-count data 46 | cd69CountData = melt(CD69Data, id.vars = c("PAM_3primeEnd_coord","isNontargeting","gRNA_systematic_name")) 47 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),] # keep only read count columns 48 | cd69CountData$Bin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable) 49 | cd69CountData$expt = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable) 50 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL; 51 | cd69CountData$Bin = gsub("_","",cd69CountData$Bin) # remove extra underscores 52 | 53 | #reshape into a matrix 54 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$PAM_3primeEnd_coord) | cd69CountData$isNontargeting,], 55 | PAM_3primeEnd_coord+gRNA_systematic_name+isNontargeting+expt ~ Bin, value="reads")) 56 | #binReadMat now contains a matrix in the proper format for MAUDE analysis 57 | 58 | ``` 59 | 60 | ##read in the Encode DHS peaks at the locus 61 | We can use this as a reference point with which to view our results, and to annotate guides with regulatory elements. These are all the DHS peaks at CD69 in Jurkat cells, merged across two replicates. 62 | 63 | ```{r input set of DHS peaks} 64 | dhsPeakBED = read.table(system.file("extdata", "Encode_Jurkat_DHS_both.merged.bed", package = "MAUDE", mustWork = TRUE), 65 | stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=FALSE) 66 | names(dhsPeakBED) = c("chrom","start","end"); 67 | #add a column to include peak names 68 | dhsPeakBED$name = paste(dhsPeakBED$chrom, paste(dhsPeakBED$start, dhsPeakBED$end, sep="-"), sep=":") 69 | ``` 70 | 71 | ## Read in and inspect bin fractions 72 | 73 | ```{r read in bin fractions} 74 | #read in the bin fractions derived from Simeonov et al Extended Data Fig 1a and the "digitize" R package 75 | #Ideally, you derive this from the FACS sort data. 76 | binStats = read.table(system.file("extdata", "CD69_bin_percentiles.txt", package = "MAUDE", mustWork = TRUE), 77 | stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=TRUE) 78 | binStats$fraction = binStats$binEndQ - binStats$binStartQ; #the fraction of cells captured is the difference in bin start and end percentiles 79 | 80 | #plot the bins as the percentiles of the distribution captured by each bin 81 | p = ggplot(binStats, aes(colour=Bin)) + ggplot2::geom_segment(aes(x=binStartQ, xend=binEndQ, y=fraction, yend=fraction)) + 82 | xlab("Bin bounds as percentiles") + ylab("Fraction of the distribution captured") +theme_classic() + 83 | scale_y_continuous(expand=c(0,0))+coord_cartesian(ylim=c(0,0.7)); print(p) 84 | ``` 85 | 86 | This shows the sizes of the bins in terms of the fractions of the overall distribution covered by each bin. In terms of expression space, we need to convert these to Z scores with the reverse normal CDF function `qnorm` 87 | ```{r convert bin percentiles to Z scores} 88 | #convert bin fractions to Z scores 89 | binStats$binStartZ = qnorm(binStats$binStartQ) 90 | binStats$binEndZ = qnorm(binStats$binEndQ) 91 | ``` 92 | 93 | Now let's plot the bins as the Z score bounds of a normal distribution, with an actual normal distribution overlaid in gray. 94 | ```{r plot bins} 95 | p = ggplot(binStats, aes(colour=Bin)) + 96 | geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 97 | ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 98 | xlab("Bin bounds as expression Z-scores") + 99 | ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+ 100 | coord_cartesian(ylim=c(0,0.7)); print(p) 101 | ``` 102 | 103 | Here, we are forced to use the same distribution for both experimental replicates since we didn't have the underlying data. You should derive this data from the FACS sorting data separately for each replicate (although it doesn't affect things that much if the bins were drawn similarly). 104 | ```{r duplicate bins for second replicate} 105 | binStats = rbind(binStats, binStats) #duplicate data 106 | binStats$expt = c(rep("1",4),rep("2",4)); #name the first duplicate expt "1" and the next expt "2"; 107 | ``` 108 | 109 | ## 1) MAUDE: Calculate guide level statistics 110 | Now we've finally gotten to the part where we're running MAUDE. To compute guide-level stats, we run `findGuideHitsAllScreens`. 111 | 112 | This step takes about a minute to run. 113 | ```{r find guide effects} 114 | guideLevelStats = findGuideHitsAllScreens(experiments = unique(binReadMat["expt"]), 115 | countDataFrame = binReadMat, binStats = binStats, 116 | sortBins = c("baseline","high","low","medium"), 117 | unsortedBin = "back", negativeControl = "isNontargeting") 118 | ``` 119 | 120 | Here, we calculated the mean expression of each guide 121 | ```{r plot guide effects} 122 | # Plot the guide-level mus 123 | p = ggplot(guideLevelStats, aes(x=mean, colour=isNontargeting, linetype=expt)) + geom_density()+ 124 | theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+ 125 | xlab("Learned mean guide expression"); print(p); 126 | ``` 127 | 128 | But notice how the distributions for the non-targeting guides are off-centre. This is why we also calculate a re-centred Z-score using the non-targeting guides. 129 | 130 | ```{r plot guide level Zs} 131 | # Plot the guide-level Zs 132 | p = ggplot(guideLevelStats, aes(x=Z, colour=isNontargeting, linetype=expt)) + geom_density()+ 133 | theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+ 134 | xlab("Learned guide expression Z score"); 135 | print(p) 136 | ``` 137 | 138 | Now the non-targeting guides are centred at 0, as expected since they have no effect on expression. 139 | 140 | Now that we have guide-level statistics, we can inspect them. We can see there is a high correlation between replicates: 141 | ```{r plot replicate scatter} 142 | guideEffectsByRep = cast(guideLevelStats, 143 | gRNA_systematic_name + isNontargeting + PAM_3primeEnd_coord ~ expt, value="Z") 144 | 145 | p = ggplot(guideEffectsByRep[!guideEffectsByRep$isNontargeting,], aes(x=`1`, y=`2`)) + 146 | geom_point(size=0.3) + xlab("Replicate 1 Z score") + ylab("Replicate 2 Z score") + 147 | ggtitle(sprintf("r = %f",cor(guideEffectsByRep$`1`[!guideEffectsByRep$isNontargeting], 148 | guideEffectsByRep$`2`[!guideEffectsByRep$isNontargeting])))+theme_classic(); 149 | print(p) 150 | ``` 151 | 152 | They are pretty highly correlated at the effect size. Now let's map the results across the CD69 locus. 153 | 154 | ```{r plot locus} 155 | dhsPos = min(guideLevelStats$Z)*1.05; 156 | p=ggplot(guideLevelStats, aes(x=PAM_3primeEnd_coord, y=Z)) +geom_point(size=0.5)+facet_grid(expt ~.)+ 157 | ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="red") + 158 | theme_classic() + xlab("Genomic position") + ylab("Guide Z score"); 159 | print(p) 160 | ``` 161 | 162 | Here, the DHS peaks are shown in red. Clearly some regions contain many active guides, the majority of which have high Z-scores, indicating CD69 activation. 163 | 164 | ## 2) MAUDE: Identify active elements 165 | There are two ways to get element-level statistics, depending on how the screen was done. If you have regulatory element annotations, you can use `getElementwiseStats` to identify active elements. If you did a tiling screen, you could use the same, but you can also identify active regions in an unbiased way with `getTilingElementwiseStats`. 166 | ### 2a) Get element level stats with sliding window 167 | Now, we can combine adjacent guides in an unbiased way, using a sliding window across the locus to identify regions with more active guides than expected by chance. Here we use a sliding window of 200 bp, and the default minimum guide number (5). Any 200 bp with fewer than 5 guides will not be tested. 168 | 169 | ```{r infer sliding window effects} 170 | guideLevelStats$chrom = "chr12"; # we need to tell it what chromosome our guides are on - they're all on chr12 171 | slidingWindowElements = getTilingElementwiseStats(experiments = unique(binReadMat["expt"]), 172 | normNBSummaries = guideLevelStats, tails="both", window = 200, location = "PAM_3primeEnd_coord", 173 | chr="chrom",negativeControl = "isNontargeting") 174 | #override the default chromosome field 'chr' with the GRanges compatible 'chrom' 175 | names(slidingWindowElements)[names(slidingWindowElements)=="chr"]="chrom" 176 | ``` 177 | 178 | Now that we have element-level stats, let's inspect them! First, let's look at the whole locus. 179 | ```{r tiles locus effects} 180 | dhsPos = min(slidingWindowElements$meanZ)*1.05; 181 | p=ggplot(slidingWindowElements, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) + 182 | ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 183 | ylab("Element Z score") + geom_hline(yintercept = 0) + 184 | ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="black"); 185 | print(p) 186 | ``` 187 | 188 | In the above, we see that there are some regions significantly upregulated or down-regulated, which are often shared between the two replicates. The regions that upregulate CD69 preferentially lie in open chromatin (DHS - black, in the above). 189 | 190 | How correlated are the effect size estimates for the two replicates at the element level? 191 | ```{r tiled replicate scatter} 192 | slidingWindowElementsByRep = cast(slidingWindowElements, chrom + start + end +numGuides ~ expt, 193 | value="meanZ") 194 | p = ggplot(slidingWindowElementsByRep, aes(x=`1`, y=`2`)) + geom_point(size=0.5) + 195 | xlab("Replicate 1 element effect Z score") + ylab("Replicate 2 element effect Z score") + 196 | ggtitle(sprintf("r = %f",cor(slidingWindowElementsByRep$`1`,slidingWindowElementsByRep$`2`)))+ 197 | theme_classic(); 198 | print(p) 199 | ``` 200 | 201 | Quite highly correlated! 202 | 203 | ### 2b) Get element level stats with annotated elements 204 | Since we have annotated elements anyway in `dhsPeakBED`, it might help us to identify which guides should be combined. This would be the only way to do it if we had only targeted these regions to begin with. 205 | 206 | ```{r element-level stats} 207 | #the next command annotates our guides with any DHS peak they lie in. 208 | annotatedGuides = findOverlappingElements(guides = unique(guideLevelStats[!guideLevelStats$isNontargeting, 209 | c("PAM_3primeEnd_coord","gRNA_systematic_name","chrom")]), elements = dhsPeakBED, 210 | elements.start = "start", elements.end = "end", elements.chr = "chrom", 211 | guides.pos = "PAM_3primeEnd_coord", guides.chr = "chrom") 212 | 213 | #merge regulatory element annotations back onto guideLevelStats 214 | guideLevelStats = merge(guideLevelStats, annotatedGuides[c("gRNA_systematic_name", "name")], 215 | by="gRNA_systematic_name", all.x=TRUE) 216 | 217 | #this is where we are actually running MAUDE to find element-level stats 218 | dhsPeakStats = getElementwiseStats(experiments = unique(binReadMat["expt"]), 219 | normNBSummaries = guideLevelStats, negativeControl = "isNontargeting", 220 | elementIDs = "name") # "name" is the peak IDs from the DHS BED file 221 | 222 | #merge peak info back into dhsPeakStats 223 | dhsPeakStats = merge(dhsPeakStats, dhsPeakBED, by="name"); 224 | ``` 225 | 226 | Now we can again look at the activity of the elements across the locus: 227 | ```{r element locus effect view} 228 | p=ggplot(dhsPeakStats, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) + 229 | ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 230 | ylab("Element Z score") + geom_hline(yintercept = 0); 231 | print(p) 232 | ``` 233 | 234 | And the correlation between replicates. 235 | ```{r element replicate scatter} 236 | dhsPeakStatsByRep = cast(dhsPeakStats, name ~ expt, value="meanZ") 237 | 238 | p = ggplot(dhsPeakStatsByRep, aes(x=`1`, y=`2`)) + geom_point() + 239 | xlab("Replicate 1 DHS effect Z score") + ylab("Replicate 2 DHS effect Z score") + 240 | ggtitle(sprintf("r = %f",cor(dhsPeakStatsByRep$`1`,dhsPeakStatsByRep$`2`)))+theme_classic(); 241 | print(p) 242 | ``` 243 | 244 | Again, these are highly correlated. 245 | In this case, it didn't look like the regulatory element annotations helped our analysis, but that is not always true. 246 | 247 | Let's see investigate why this might be by looking at the guide effect sizes for the guides within each DHS element. 248 | ```{r guide effects per element} 249 | p=ggplot(guideLevelStats, aes(x=Z, group=name, colour=name == "chr12:9912678-9915275")) + stat_ecdf(alpha=0.3)+ 250 | stat_ecdf(data=guideLevelStats[!is.na(guideLevelStats$name) & 251 | guideLevelStats$name=="chr12:9912678-9915275",], size=1)+ 252 | facet_grid(expt ~.) + theme_classic() + xlab("Guide Z score")+scale_y_continuous(expand=c(0,0)) + 253 | scale_x_continuous(expand=c(0,0)) + scale_colour_manual(values=c("black","red")) + 254 | labs(colour = "CD69 promoter?")+ylab("Cumulative fraction"); 255 | print(p) 256 | ``` 257 | 258 | Here, the promoter DHS peak is in red and all the other DHS peaks are in black. You can appreciate that the promoter tends to have among the most influencial guides (rightward shift in the CDF curves). Even for the promoter, there is substantial variability in the estimated guide effect sizes, ranging from not doing anything (near 0) to pretty sizable effects (>3). This is probably a combination of factors, including experimental noise, how effective each guide targets the region, and the effect of actually targeting each region. 259 | 260 | Let's zoom in on the promoter to see what exactly is happening here. 261 | ```{r promoter view} 262 | p=ggplot(guideLevelStats[!is.na(guideLevelStats$name) & guideLevelStats$name=="chr12:9912678-9915275",], 263 | aes(x=PAM_3primeEnd_coord, y=Z, colour=expt)) + 264 | geom_point(size=1)+ geom_line()+theme_classic() + xlab("Genomic position") + ylab("Guide Z score")+ 265 | geom_vline(xintercept = 9913497, colour="black"); 266 | print(p) 267 | ``` 268 | 269 | Here, I've marked the transcription start site with a vertical black line. CD69 is an antisense gene, so it is transcribed to the left of the black line. The guides targeting the 5' UTR of CD69 (left of black line) appear to be less effective than those in the promoter (right of black line), which is expected. The high correlation in the guide effect sizes between the two experiments indicates that many of the low-effect size guides cannot be attributed to experimental noise. So probably most of the signal diversity within the promoter DHS is attributable to how effective the guide is at targeting the DNA, and how effective the activation domain of CRISPRa is where it is bound. 270 | 271 | Now let's look more closely at the guides and select a few that work especially well. 272 | ```{r promoter guide zoom} 273 | p=ggplot(guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 274 | guideEffectsByRep$PAM_3primeEnd_coord > 9912678,], 275 | aes(x=`2`, y=`1`)) + geom_point() + 276 | geom_text(data=guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 277 | guideEffectsByRep$PAM_3primeEnd_coord > 9912678 & 278 | guideEffectsByRep$`1`>2 & guideEffectsByRep$`2`>2,], 279 | aes(label=gRNA_systematic_name)) + 280 | theme_classic()+xlab("Replicate 2 guide Z score") + ylab("Replicate 1 guide Z score"); 281 | print(p) 282 | 283 | ``` 284 | 285 | We can see that, although there are four guides that have a large effect in one replicate, they don't have as great an effect in the other. Here, I've labeled the IDs of three guides that have a Z score over two in both replicates. If I were going to follow up targeting the CD69 promoter with more guides, these are the ones I would use. 286 | 287 | `GenomicRanges` is a great package to work with data associated with genomic coordinates. First, let's restructure our tiling element data.frame so that there is only one row per region and there is one of each data field for each replicate. 288 | 289 | ```{r reshaping window effects} 290 | slidingWindowElementsByReplicate = cast(melt(slidingWindowElements, 291 | id.vars=c("expt","numGuides","chrom","start","end")), 292 | numGuides+chrom+start+end ~ variable+expt, value="value") 293 | head(slidingWindowElementsByReplicate) 294 | ``` 295 | 296 | We can convert our data structures to `GRanges` objects as easily as this: 297 | 298 | ```{r cast to GRanges} 299 | #casting to data.frame is only needed if using cast 300 | slidingWindowElementsByReplicateGR = GRanges(as.data.frame(slidingWindowElementsByReplicate)) 301 | ``` 302 | 303 | Now let's label those ranges that were significant in both replicates, and merge overlapping significant ranges to get a set of expanded active regions. 304 | 305 | ```{r find doubly significant tiles} 306 | #require that both replicates are significant at an FDR of 0.1 and that the signs agree 307 | slidingWindowElementsByReplicateGR$significantUp = slidingWindowElementsByReplicateGR$FDR_1< 0.01 & 308 | slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 > 0 & 309 | slidingWindowElementsByReplicateGR$meanZ_2 > 0; 310 | slidingWindowElementsByReplicateGR$significantDown = slidingWindowElementsByReplicateGR$FDR_1< 0.01 & 311 | slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 < 0 & 312 | slidingWindowElementsByReplicateGR$meanZ_2 < 0; 313 | 314 | #merge overlapping regions in each set 315 | overlappingSlidingWindowElementsUp = 316 | reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantUp]) 317 | overlappingSlidingWindowElementsDown = 318 | reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantDown]) 319 | ``` 320 | 321 | Now let's plot it all together as a genome browser track! 322 | 323 | ```{r genome browser view, fig.width=10, fig.height=5} 324 | 325 | #which gene models do I want to plot? 326 | data(genesymbol, package = "biovizBase") 327 | wh <- genesymbol[c("CD69", "CLECL1", "KLRF1", "CLEC2D","CLEC2B")] 328 | wh <- range(wh, ignore.strand = TRUE) 329 | 330 | #make the genome tracks 331 | tracks(autoplot(Homo.sapiens, which = wh, gap.geom="chevron"), 332 | autoplot(overlappingSlidingWindowElementsUp, fill="red"), 333 | autoplot(overlappingSlidingWindowElementsDown, fill="blue"), heights=c(5,2,2)) + theme_classic() 334 | ``` 335 | 336 | Here, the red are CD69-activating regions, blue are CD69-repressing regions. 337 | 338 | -------------------------------------------------------------------------------- /vignettes/simulated_data_tutorial.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "A tutorial with simulated data" 3 | author: "Carl de Boer" 4 | date: "5/21/2020" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{A tutorial with simulated data} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | \usepackage[utf8]{inputenc} 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | # A tutorial with simulated data 17 | 18 | The following is an example that relies on simulated data and makes use of the `ggplot2` and `reshape` R packages. Since there is no data to load for this example, it should be relatively easy to run and to inspect the variables to better understand the desired formats. This should also help make clear the underlying assumptions of the model since the data are generated with many of these same assumptions. 19 | 20 | ## Create the simulated experimental data 21 | Load the required packages and set the seed. You only need to set the seed if you want your analysis to get exactly the same results as what is shown here. Otherwise, you can skip this step (and note that if you run things in a different order, the results may differ). 22 | ```{r load libraries, results="hide", message = FALSE, warning=FALSE} 23 | library(ggplot2) 24 | library(reshape) 25 | library(MAUDE) 26 | 27 | set.seed(76484377) 28 | ``` 29 | 30 | Set up the simulation. Here, we're using 5 guides per element, and 200 elements, 100 of which have no effect on expression. The half that affect expression have effect sizes ranging from 0.01 to 1 standard deviations. We also include 1000 non-targeting guides. 31 | ```{r simulation setup} 32 | #ground truth 33 | groundTruth = data.frame(element = 1:200, meanEffect = c((1:100)/100,rep(0,100))) #targeting 200 elements, half of which do nothing 34 | 35 | #guide - element map; 5 guides per element; gid is the guide ID 36 | guideMap = data.frame(element = rep(groundTruth$element, 5), gid = 1:(5*nrow(groundTruth)), NT=FALSE, mean=rep(groundTruth$meanEffect, 5)) 37 | guideMap = rbind(guideMap, data.frame(element = NA, gid = (1:1000)+nrow(guideMap), NT=TRUE, mean=0)); # 1000 non-targeting guides 38 | 39 | guideMap$abundance = rpois(n=nrow(guideMap), lambda=1000); #guide abundance drawing from a poisson distribution with mean=1000 40 | guideMap$cells = rpois(n=nrow(guideMap), lambda=guideMap$abundance); #cell count drawing from a poisson distribution with mean the abundance from above 41 | 42 | #create observarions for different guides, with expression drawn from normal(mean=mean, sd=1) 43 | cellObservations = data.frame(gid = rep(guideMap$gid, guideMap$cells)) 44 | cellObservations = merge(cellObservations, guideMap, by="gid") 45 | cellObservations$expression = rnorm(n=nrow(cellObservations), mean=cellObservations$mean); 46 | 47 | #create the bin model for this experiment - this represents 6 bins, each of which are 10%, where A+B+C catch the bottom ~30% and D+E+F catch the top 30%; in an actual experiment, the true captured fractions should be used here. 48 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.1,6))) 49 | if(FALSE){ 50 | # in reality, we shouldn't assume this distribution is exactly normal - we can re-assign expression bin bounds based on quantiles 51 | # of the actual simulated expression distribution. If you run the next two lines, the answer will improve slightly, but the 52 | # resulting graphs will look slightly different than those below. 53 | # correct for the actual distribution 54 | binBounds$binStartZ = quantile(cellObservations$expression, probs = binBounds$binStartQ); 55 | binBounds$binEndZ = quantile(cellObservations$expression, probs = binBounds$binEndQ); 56 | } 57 | 58 | # select some examples to inspect for both 59 | exampleNT = sample(guideMap$gid[guideMap$NT],10);# non-targeting 60 | exampleT = sample(guideMap$gid[!guideMap$NT],5);# and targeting guides 61 | 62 | #plot the select examples and show the bin structure 63 | p = ggplot(cellObservations[cellObservations$gid %in% c(exampleT, exampleNT),], 64 | aes(x=expression, group=gid, fill=NT))+geom_density(alpha=0.2) + 65 | geom_vline(xintercept = sort(unique(c(binBounds$binStartZ,binBounds$binEndZ))),colour="gray") + 66 | theme_classic() + scale_fill_manual(values=c("red","darkgray")) + xlab("Target expression") + 67 | scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + 68 | coord_cartesian(xlim=c(min(cellObservations$expression), max(cellObservations$expression)))+ 69 | geom_segment(data=binBounds, aes(x=binStartZ, xend=binEndZ, colour=Bin, y=0, yend=0), size=5, inherit.aes = FALSE); 70 | print(p) 71 | ``` 72 | 73 | 74 | Now that we've set up our experimental system, we simulate actually capturing cells in the bins 75 | ```{r simulate sorting cells} 76 | #for each bin, find which cells landed within the bin bounds 77 | for(i in 1:nrow(binBounds)){ 78 | cellObservations[[as.character(binBounds$Bin[i])]] = 79 | cellObservations$expression > binBounds$binStartZ[i] & cellObservations$expression < binBounds$binEndZ[i]; 80 | } 81 | 82 | #count the number of cells that ended up in each bin for each guide 83 | binLevelData = cast(melt(cellObservations[c("gid","element","NT",as.character(binBounds$Bin))], 84 | id.vars=c("gid","element","NT")), gid + element + NT + variable ~ ., fun.aggregate = sum) 85 | names(binLevelData)[(ncol(binLevelData)-1):ncol(binLevelData)]=c("Bin","cells"); 86 | 87 | #plot the cells/bin for each of our example guides 88 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=cells)) + 89 | geom_boxplot(fill="darkgray")+geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], 90 | aes(group=gid), colour="red")+ theme_classic() + xlab("Expression bin") + 91 | ylab("Captured cells/bin") + scale_y_continuous(expand=c(0,0)); print(p) 92 | ``` 93 | 94 | Now, we simulate sequencing the guides that end up in each bin, assuming we sequence 10 reads per sorted cell 95 | ```{r simulate sequencing} 96 | #get the total number of cells sorted into each of the bins 97 | totalCellsPerBin = cast(binLevelData,Bin ~ ., value="cells", fun.aggregate = sum) 98 | names(totalCellsPerBin)[ncol(totalCellsPerBin)]="totalCells"; 99 | 100 | #add bin totals to binLevelData 101 | binLevelData = merge(binLevelData, totalCellsPerBin, by="Bin") 102 | 103 | #generate reads for each guide per bin, following a negative binomial distribution 104 | #n=number of observations; size= total reads per bin; prob= probability of not getting a read at each drawing 105 | binLevelData$reads = rnbinom(n=nrow(binLevelData), size=binLevelData$totalCells*10, 106 | prob=1- binLevelData$cells/binLevelData$totalCells) 107 | 108 | #plot the distribution of reads for each example guide 109 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=reads)) + 110 | geom_boxplot(fill="darkgray")+ 111 | geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], aes(group=gid), colour="red")+ 112 | theme_classic() + xlab("Expression bin") + ylab("Reads/bin") + scale_y_continuous(expand=c(0,0)); 113 | print(p) 114 | ``` 115 | 116 | 117 | ```{r simulate unsorted cells} 118 | binReadMat = cast(binLevelData, element+gid+NT ~ Bin, value="reads") 119 | 120 | #pretend we sequence the unsorted cells to similar coverage as above: 121 | guideMap$NS = rnbinom(n=nrow(guideMap), size=sum(guideMap$cells)*10, 122 | prob=1- guideMap$abundance/sum(guideMap$cells)) 123 | binReadMat = merge(binReadMat, guideMap[c("gid","NS")], by="gid") 124 | 125 | binReadMat$screen="test"; # here, we're only doing the one screen - this simulation 126 | binBounds$screen="test"; 127 | ``` 128 | 129 | ## Run MAUDE 130 | 131 | Up until now, we have just been building up the data in a way that simulated an actual experiment. With the exception of defining the bin model(s), most of the above is not required (nor even possible, since we're inspecting hidden variables above) for an actual experiment. Only the next two commands would actually be run for a real experiment. 132 | ```{r run MAUDE} 133 | # get guide-level stats 134 | guideLevelStats = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, binBounds) 135 | 136 | #get element level stats 137 | elementLevelStats = getElementwiseStats(unique(guideLevelStats["screen"]), 138 | guideLevelStats, elementIDs="element",tails="upper") 139 | ``` 140 | 141 | Plot the actual effect sizes (those we defined at the begining) compared with those predicted by MAUDE: 142 | ```{r plor effects} 143 | elementLevelStats = merge(elementLevelStats, groundTruth, by="element") 144 | 145 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+ 146 | geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 147 | xlab("True effect") + ylab("Inferred effect"); print(p) 148 | ``` 149 | 150 | We can see that the two are highly correlated. At more extreme effect sizes, MAUDE underestimates the effect size because of the uniform prior applied (in the form of a pseudocount). However, biological data can be very noisy and so we highly recommend this prior. 151 | 152 | 153 | Zoom in on the bottom left: 154 | ```{r effets zoom in} 155 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+ 156 | geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 157 | coord_cartesian(xlim = c(0,0.1),ylim = c(0,0.1)) + xlab("True effect") + ylab("Inferred effect"); 158 | print(p) 159 | ``` 160 | 161 | In this particular simulation we have 1 false positive out of nearly 100 true positives (consistent with our FDR cutoff of 0.01), and three false negatives with very small effect sizes. 162 | --------------------------------------------------------------------------------