├── .Rbuildignore
├── CommonProblems.md
├── DESCRIPTION
├── Evaluation
    ├── method_evaluation.R
    ├── method_evaluation.Rmd
    └── method_evaluation.html
├── LICENSE
├── NAMESPACE
├── NEWS
├── R
    └── MAUDE.R
├── README.md
├── doc
    ├── BACH2_base_editor_screen.R
    ├── BACH2_base_editor_screen.html
    ├── CD69_tutorial.R
    ├── CD69_tutorial.html
    ├── simulated_data_tutorial.R
    └── simulated_data_tutorial.html
├── images
    └── logo2.png
├── inst
    ├── CITATION
    └── extdata
    │   ├── CD69_bin_percentiles.txt
    │   └── Encode_Jurkat_DHS_both.merged.bed
├── man
    ├── calcFDRByExperiment.Rd
    ├── combineZStouffer.Rd
    ├── findGuideHits.Rd
    ├── findGuideHitsAllScreens.Rd
    ├── findOverlappingElements.Rd
    ├── getElementwiseStats.Rd
    ├── getNBGaussianLikelihood.Rd
    ├── getTilingElementwiseStats.Rd
    ├── getZScalesWithNTGuides.Rd
    └── makeBinModel.Rd
└── vignettes
    ├── BACH2_base_editor_screen.Rmd
    ├── BACH2_data
        ├── loaded_BACH2_BE_CRISPResso_data.txt.gz
        └── sample_metadata.txt
    ├── CD69_tutorial.Rmd
    └── simulated_data_tutorial.Rmd


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^doc$
2 | ^Meta$
3 | ^Evaluation$
4 | ^images$
5 | ^CommonProblems.md$
6 | ^.*\.Rproj$
7 | ^\.Rproj\.user$
8 | 


--------------------------------------------------------------------------------
/CommonProblems.md:
--------------------------------------------------------------------------------
 1 | # Common Problems
 2 | Here are some common problems and their solutions.
 3 | 
 4 | ## `Mu optimization returned NaN objective: restricting search space`
 5 | This warning indicates that when optimizing the mu value for a guide, the log likelihood function returned NaN. 
 6 | This warning does not necessarily mean there is a problem.
 7 | 
 8 | There are three usual reasons for this:
 9 | ### 1) Data that has really low coverage in the "input" and no pseudocount added (e.g. if A is 0% of the library, the likelihood of >1 reads in bin A is NaN)
10 | To resolve the warning, please ensure that you are including a pseudocount to the data.
11 | 
12 | ### 2) Data where one "guide" has really high abundance. (e.g. if a "guide" makes up 89% of your library, a Z score of 1 might mean that you now expect OVER 100% to end up in bin F)
13 | Usually, MAUDE automatically restricting the search space will resolve this. The warning will stay, but you don't have to worry about it.
14 | 
15 | ### 3) The mu search limits are too wide. (an extreme mu is so unlikely that the log likelihood is infinite, leading to this error)
16 | If you only get this message a few times, there's no need to adjust anything; MAUDE automatically adjusts the search space. If many warnings are issued, it is recommended that you adjust the `limits` parameter. It defaults to c(-4,4). Adjusting it closer to 0 will eliminate the warning message. 
17 | 
18 | ## Other problems
19 | Please submit an Issue, or contact the authors for any other problems you encounter so that we can fix/clarify things as necessary.
20 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: MAUDE
 2 | Title: Mean Alterations Using Discrete Expression
 3 | Version: 0.99.4
 4 | Authors@R: person("Carl", "de Boer", email = "carl.deboer@ubc.ca", role = c("aut", "cre"))
 5 | Description: This package is useful for inferring the difference in the mean expression of some entity as quantified by discrete measurements obtained by binning across expression space. One example of where this is useful is with pooled CRISPR screens with a FACS (cell sorting into discrete bins) readout. For example, you introduce a guide library into your CRISPRi-competent cells, sort them by expression of a reporter gene targeted by your guides, and want to know which guides altered expression of the reporter gene.
 6 | Depends: R (>= 4.0), reshape, stats
 7 | Suggests: knitr, rmarkdown, ggplot2, openxlsx, ggbio, GenomicRanges, Homo.sapiens, biovizBase, edgeR, DESeq2, cowplot
 8 | VignetteBuilder: knitr
 9 | License: MIT + file LICENSE
10 | Encoding: UTF-8
11 | LazyData: true
12 | RoxygenNote: 7.1.1
13 | biocViews: FunctionalPrediction, DifferentialExpression, GeneRegulation, GeneTarget, GenomeAnnotation, PeakDetection, FunctionalGenomics, Genetics, Transcriptomics, SystemsBiology, CRISPR, FlowCytometry, DNASeq, PooledScreens, Annotation, ExperimentalDesign, Normalization, QualityControl, GeneExpression, Transcription, SNP, GeneticVariability
14 | 


--------------------------------------------------------------------------------
/Evaluation/method_evaluation.R:
--------------------------------------------------------------------------------
  1 | ## ----setup, include=FALSE-----------------------------------------------------
  2 | knitr::opts_chunk$set(echo = TRUE)
  3 | 
  4 | ## ----load libraries, results="hide", message = FALSE, warning=FALSE-----------
  5 | library("ggplot2")
  6 | library(reshape)
  7 | library(cowplot)
  8 | library(MAUDE)
  9 | library(openxlsx)
 10 | library(edgeR)
 11 | library(DESeq2)
 12 | 
 13 | ## ----set seed-----------------------------------------------------------------
 14 | set.seed(35263377)
 15 | 
 16 | ## ----Load and parse CD69 data-------------------------------------------------
 17 | #a mapping to unify bin names from Simeonov data
 18 | binmapBack = list("baseline" = "baseline", "low"="low", "medium"="medium","high"="high","back_" = "NS",
 19 |                   "baseline_" = "baseline", "low_"="low", "medium_"="medium", "high_"="high", 
 20 |                   "A"="baseline", "B"="low", "E" = "medium", "F"="high")
 21 | 
 22 | #this comes from manually reconstructing the CD69 density curve from extended data figure 1a (Simeonov et al)
 23 | binBoundsCD69 = data.frame(Bin = c("A","F","B","E"), 
 24 |                            fraction = c(0.65747100, 0.02792824, 0.25146688, 0.06313389), 
 25 |                            stringsAsFactors = FALSE) 
 26 | fractionalBinBounds = makeBinModel(binBoundsCD69[c("Bin","fraction")])
 27 | fractionalBinBounds = rbind(fractionalBinBounds, fractionalBinBounds)
 28 | fractionalBinBounds$screen = c(rep("1",6),rep("2",6));
 29 | #only keep bins A,B,E,F
 30 | fractionalBinBounds = fractionalBinBounds[fractionalBinBounds$Bin %in% c("A","B","E","F"),]
 31 | fractionalBinBounds$Bin = unlist(binmapBack[fractionalBinBounds$Bin]);
 32 | 
 33 | #load data
 34 | cd69OriginalResults = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx')
 35 | cd69OriginalResults$NT = grepl("negative_control", cd69OriginalResults$gRNA_systematic_name)
 36 | cd69OriginalResults$pos = cd69OriginalResults$PAM_3primeEnd_coord;
 37 | cd69OriginalResults = unique(cd69OriginalResults)
 38 | cd69CountData = melt(cd69OriginalResults, id.vars = c("pos","NT","gRNA_systematic_name"))
 39 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),]
 40 | cd69CountData$theirBin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable)
 41 | cd69CountData$screen = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable)
 42 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL;
 43 | # convert their bin to one that is consistent
 44 | cd69CountData$Bin = unlist(binmapBack[cd69CountData$theirBin]); 
 45 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$pos) | cd69CountData$NT,], pos+gRNA_systematic_name+NT+screen ~ Bin, value="reads"))
 46 | 
 47 | ## ----calc logFC---------------------------------------------------------------
 48 | #confirm how to calc log2FC:
 49 | cd69OriginalResults$l2fc.vsbg1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69back_1.count.norm))
 50 | cd69OriginalResults$l2fc.vsbg2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69back_2.count.norm))
 51 | cd69OriginalResults$l2fc.hilo1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69baseline_1.count.norm))
 52 | cd69OriginalResults$l2fc.hilo2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69baseline_2.count.norm))
 53 | #confirm that how we just calculated log2 fc is the same as originally calculated
 54 | p=ggplot(cd69OriginalResults, aes(x=high2.l2fc,y= l2fc.vsbg2)) + geom_point(); print(p)
 55 | 
 56 | ## ----run methods, results="hide", message = FALSE, warning=FALSE--------------
 57 | #edgeR
 58 | x <- unique(cd69OriginalResults[c("gRNA_systematic_name","CD69baseline_1.count","CD69baseline_2.count", 
 59 |                                   "CD69high_1.count", "CD69high2.count")])
 60 | row.names(x) = x$gRNA_systematic_name; x$gRNA_systematic_name=NULL;
 61 | x=as.matrix(x)
 62 | group <- factor(c(1,1,2,2))
 63 | y <- DGEList(counts=x,group=group)
 64 | y <- calcNormFactors(y)
 65 | design <- model.matrix(~group)
 66 | y <- estimateDisp(y,design)
 67 | #To perform likelihood ratio tests:
 68 | fit <- glmFit(y,design)
 69 | lrt <- glmLRT(fit,coef=2)
 70 | edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]]
 71 | edgeRGuideLevel$gRNA_systematic_name = row.names(edgeRGuideLevel);
 72 | edgeRGuideLevel$metric = edgeRGuideLevel$logFC;
 73 | edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue);
 74 | edgeRGuideLevel$method="edgeR";
 75 | 
 76 | #DESeq2
 77 | deseqGroups = data.frame(bin=factor(c(1,1,2,2)));
 78 | row.names(deseqGroups) = c("CD69baseline_1.count","CD69baseline_2.count", 
 79 |                            "CD69high_1.count", "CD69high2.count");
 80 | dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin)
 81 | dds <- DESeq(dds)
 82 | res <- results(dds, name=resultsNames(dds)[2])
 83 | 
 84 | deseqGuideLevel = as.data.frame(res@listData)
 85 | deseqGuideLevel$gRNA_systematic_name =res@rownames;
 86 | deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange;
 87 | deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue);
 88 | deseqGuideLevel$method="DESeq2";
 89 | 
 90 | #MAUDE
 91 | guideLevelStatsCD69 = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, fractionalBinBounds, sortBins = c("baseline","high","low","medium"), unsortedBin = "NS")
 92 | guideLevelStatsCD69$chr="chr12"
 93 | guideLevelStatsCastCD69 = cast(guideLevelStatsCD69, gRNA_systematic_name + pos+NT ~ screen, value="Z")
 94 | names(guideLevelStatsCastCD69)[ncol(guideLevelStatsCastCD69)-1:0]=c("s1","s2")
 95 | guideLevelStatsCastCD69$significance = apply(guideLevelStatsCastCD69[c("s1","s2")],1, combineZStouffer)
 96 | guideLevelStatsCastCD69$metric=apply(guideLevelStatsCastCD69[c("s1","s2")],1, mean)
 97 | guideLevelStatsCastCD69$method = "MAUDE"
 98 | 
 99 | #Two log fold change methods
100 | cd69OriginalResultsHiLow = cd69OriginalResults[c("gRNA_systematic_name","l2fc.hilo1","l2fc.hilo2")]
101 | cd69OriginalResultsVsBG = cd69OriginalResults[c("gRNA_systematic_name","l2fc.vsbg1","l2fc.vsbg2")]
102 | cd69OriginalResultsHiLow$significance = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean)
103 | cd69OriginalResultsHiLow$metric = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean)
104 | cd69OriginalResultsVsBG$significance = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean)
105 | cd69OriginalResultsVsBG$metric = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean)
106 | cd69OriginalResultsHiLow$method="logHivsLow"
107 | cd69OriginalResultsVsBG$method="logVsUnsorted"
108 | 
109 | ## ----compile results----------------------------------------------------------
110 | # predictions
111 | allResults = rbind(cd69OriginalResultsVsBG[c("method","gRNA_systematic_name","significance","metric")],
112 |                    cd69OriginalResultsHiLow[c("method","gRNA_systematic_name","significance","metric")],
113 |                    deseqGuideLevel[c("method","gRNA_systematic_name","significance","metric")], 
114 |                    edgeRGuideLevel[c("method","gRNA_systematic_name","significance","metric")],
115 |                    guideLevelStatsCastCD69[c("method","gRNA_systematic_name","significance","metric")]) 
116 | 
117 | 
118 | allResults = merge(allResults, cd69OriginalResults[c("gRNA_systematic_name","NT","pos")],
119 |                    by="gRNA_systematic_name")
120 | allResults = allResults[!is.na(allResults$pos) | allResults$NT,]
121 | allResults$promoter  = allResults$pos <= 9913996 & allResults$pos >= 9912997
122 | allResults$gID = allResults$gRNA_systematic_name; allResults$gRNA_systematic_name=NULL;
123 | allResults$locus ="CD69"
124 | allResults$type ="CRISPRa"
125 | allResults$celltype ="Jurkat"
126 | 
127 | ## ----input and parse TNFAIP3 data---------------------------------------------
128 | 
129 | ##read in TNFAIP3 data
130 | binFractionsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPR_FF_bin_fractions.txt.gz")))), 
131 |                              sep="\t", stringsAsFactors = FALSE, header = TRUE)
132 | CRISPRaCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRa_FF_countMatrix.txt.gz")))), 
133 |                               sep="\t", stringsAsFactors = FALSE, header = TRUE)
134 | CRISPRiCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRi_FF_countMatrix.txt.gz")))), 
135 |                               sep="\t", stringsAsFactors = FALSE, header = TRUE)
136 | crispraGuides = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20180205_selected_CRISPRa_guides_seq.txt.gz")))), 
137 |                            sep="\t", stringsAsFactors = FALSE, header = TRUE)
138 | crispraGuides$seq = gsub("(.*)(.{3})","\\1",crispraGuides$seq.w.PAM)
139 | crispraGuides$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2",
140 |                                            crispraGuides$guideID))+
141 |                              as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3",
142 |                                              crispraGuides$guideID)))/2)
143 | 
144 | CRISPRaCountsA20 = merge(CRISPRaCountsA20, unique(crispraGuides[c("seq","pos")]), by=c("seq"), all.x=TRUE)
145 | 
146 | CRISPRiCountsA20$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2",
147 |                                               CRISPRiCountsA20$gID))+
148 |                                 as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3",
149 |                                               CRISPRiCountsA20$gID)))/2)
150 | 
151 | binFractionsA20$expt = paste(binFractionsA20$celltype, binFractionsA20$CRISPRType,sep="_")
152 | 
153 | #merge CRISPRi and CRISPRa
154 | A20CountData = melt(CRISPRaCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos"))
155 | A20CountData$CRISPRType="CRISPRa";
156 | A20CountData2 = melt(CRISPRiCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos"))
157 | A20CountData2$CRISPRType="CRISPRi";
158 | A20CountData = rbind(A20CountData, A20CountData2)
159 | rm('A20CountData2');
160 | A20CountData$count = A20CountData$value; A20CountData$value=NULL;
161 | A20CountData$sample = A20CountData$variable; A20CountData$variable=NULL;
162 | A20CountData$bin = gsub("(.*)_(.*)_(.*)", "\\3", A20CountData$sample);
163 | A20CountData$screen = gsub("(.*)_(.*)_(.*)", "\\2", A20CountData$sample);
164 | A20CountData$celltype = gsub("(.*)_(.*)_(.*)", "\\1", A20CountData$sample);
165 | A20CountData$expt = paste(A20CountData$celltype, A20CountData$CRISPRType,sep="_")
166 | 
167 | ## ----start evaluation and run on TNFAIP3, results="hide", message = FALSE, warning=FALSE----
168 | #combine CD69 metrics 
169 | # Pearson's r between replicates ## metricsBoth will contain all our evaluation metrics
170 | metricsBoth = 
171 |   data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), metric="r",which="replicate_correl",
172 |              value=c(cor(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT],
173 |                          guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT]), 
174 |                      cor(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT],
175 |                          cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT]),
176 |                      cor(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT],
177 |                          cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])),
178 |              sig=c(cor.test(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT],
179 |                             guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT])$p.value, 
180 |                    cor.test(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT],
181 |                             cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT])$p.value,
182 |                    cor.test(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT],
183 |                             cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])$p.value),
184 |              locus="CD69",type="CRISPRa",celltype="Jurkat",stringsAsFactors = FALSE)
185 | 
186 | allResultsBoth = allResults; #combined results (significance, effect sizes) for CD69 and A20 screens
187 | for (e in unique(binFractionsA20$expt)){
188 |   curCelltype = gsub("(.*)_(.*)", "\\1", e);
189 |   curtype = gsub("(.*)_(.*)", "\\2", e);
190 |   curA20CountData = unique(A20CountData[A20CountData$expt==e,
191 |                                         c("seq","pos", "NT","gID","count","screen","bin")])
192 |   curA20CountDataTotals = cast(curA20CountData, screen +bin~ ., fun.aggregate = sum, value="count")
193 |   names(curA20CountDataTotals)[3] = "total";
194 |   curA20CountData = merge(curA20CountData, curA20CountDataTotals, by=c("screen","bin"))
195 |   curA20CountData$CPM = curA20CountData$count/curA20CountData$total * 1E6;
196 |   curCPMMat = cast(curA20CountData, seq + NT + gID + screen + pos ~ bin, value="CPM")
197 |   curCPMMat$l2fc_hilo = log2((1+curCPMMat$F)/(1+curCPMMat$A))
198 |   if(curtype=="CRISRPi"){
199 |     curCPMMat$l2fc_vsbg = log2((1+curCPMMat$NS)/(1+curCPMMat$A))
200 |   }else{ #CRISPRa
201 |     curCPMMat$l2fc_vsbg = log2((1+curCPMMat$F)/(1+curCPMMat$NS))
202 |   }
203 |   curBins = as.data.frame(melt(binFractionsA20[binFractionsA20$expt==e,], 
204 |                                id.vars = c("celltype","screen","CRISPRType","expt")))
205 |   names(curBins)[ncol(curBins) - (1:0)] = c("Bin","fraction")
206 |   curBins2 = data.frame();
207 |   for (s in unique(curBins$screen)){
208 |     curBins3 = makeBinModel(curBins[curBins$screen==s,c("Bin","fraction")])
209 |     curBins3$screen = s;
210 |     curBins2 = rbind(curBins2, curBins3)
211 |   }
212 |   curBins2$Bin = as.character(curBins2$Bin);
213 |   curCountMat = cast(curA20CountData, seq + NT + gID +pos + screen ~ bin, value="count")
214 |   guideLevelStats = findGuideHitsAllScreens(experiments = unique(curCountMat["screen"]), 
215 |                                             countDataFrame = curCountMat, binStats = curBins2, 
216 |                                             sortBins = c("A","B","C","D","E","F"), unsortedBin = "NS", 
217 |                                             negativeControl="NT")
218 |   
219 |   guideLevelStatsCast = cast(guideLevelStats, gID + pos+NT ~ screen, value="Z")
220 |   #names(guideLevelStatsCast)[4:ncol(guideLevelStatsCast)]=sprintf("s%i", 1:(ncol(guideLevelStatsCast)-3))
221 |   
222 |   maudeZs = guideLevelStatsCast;
223 |   
224 |   guideLevelStatsCast$significance = apply(maudeZs[unique(curA20CountData$screen)],1, combineZStouffer)
225 |   guideLevelStatsCast$metric=apply(maudeZs[unique(curA20CountData$screen)],1, mean)
226 |   guideLevelStatsCast$method = "MAUDE"
227 |   
228 |   ### EdgeR
229 |   library(edgeR)
230 |   x= cast(unique(curA20CountData[curA20CountData$bin %in% c("A","F"), c("bin","gID","screen","count")]), gID ~ screen + bin, value="count")
231 |   row.names(x) = x$gID; x$gID=NULL;
232 |   x=as.matrix(x)
233 |   group = grepl("_F",colnames(x))+1
234 |   group <- factor(group)
235 |   y <- DGEList(counts=x,group=group)
236 |   y <- calcNormFactors(y)
237 |   design <- model.matrix(~group)
238 |   y <- estimateDisp(y,design)
239 |   #To perform likelihood ratio tests:
240 |   fit <- glmFit(y,design)
241 |   lrt <- glmLRT(fit,coef=2)
242 |   edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]]
243 |   
244 |   edgeRGuideLevel$gID = row.names(edgeRGuideLevel);
245 |   edgeRGuideLevel$metric = edgeRGuideLevel$logFC;
246 |   edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue);
247 |   edgeRGuideLevel$method="edgeR";
248 |   
249 |   ### DEseq
250 |   library(DESeq2)
251 |   deseqGroups = data.frame(bin=group);
252 |   row.names(deseqGroups) = colnames(x);
253 |   dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin)
254 |   dds <- DESeq(dds)
255 |   #resultsNames(dds) # lists the coefficients
256 |   res <- results(dds, name=resultsNames(dds)[2])
257 |   stopifnot(resultsNames(dds)[1]=="Intercept")
258 |   deseqGuideLevel = as.data.frame(res@listData)
259 |   deseqGuideLevel$gID =res@rownames;
260 |   deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange;
261 |   deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue);
262 |   deseqGuideLevel$method="DESeq2";
263 |   
264 |   curLRHiLow = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_hilo")]), gID + NT ~ screen, value="l2fc_hilo")
265 |   curLRVsBG = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_vsbg")]), gID + NT ~ screen, value="l2fc_vsbg")
266 |   numSamples = ncol(curLRHiLow)-2;
267 |   sampleNames = unique(curCPMMat$screen)
268 |   
269 |   curLRVsBG$significance = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean);
270 |   curLRVsBG$metric = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean);
271 |   curLRVsBG$method="logVsUnsorted"
272 |   curLRHiLow$significance = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean);
273 |   curLRHiLow$metric = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean);
274 |   curLRHiLow$method="logHivsLow"
275 |   
276 |   #compile results for A20
277 |   curResults = rbind(unique(curLRHiLow[c("method","gID","significance","metric")]),
278 |                      unique(curLRVsBG[c("method","gID","significance","metric")]),
279 |                      deseqGuideLevel[c("method","gID","significance","metric")], 
280 |                      edgeRGuideLevel[c("method","gID","significance","metric")],
281 |                      unique(guideLevelStatsCast[c("method","gID","significance","metric")]))
282 |   
283 |   curResults = merge(curResults, unique(curCPMMat[c("gID","NT","pos")]), by="gID")
284 |   curResults = curResults[!is.na(curResults$pos) | curResults$NT,]
285 |   curResults$promoter  = grepl("TNFAIP3", curResults$gID) | 
286 |     (curResults$pos <= 138189439 & curResults$pos >= 138187040) # chr6:138188077-138188379;138187040
287 |   
288 |   #append the current results to all
289 |   curResults$locus ="TNFAIP3"
290 |   curResults$type =curtype
291 |   curResults$celltype =curCelltype
292 |   allResultsBoth = rbind(allResultsBoth, curResults);
293 |   
294 |   # (1) similarity between the effect sizes estimated per replicate, 
295 |   corLRHiLow = cor(curLRHiLow[!curLRHiLow$NT, 3:(3+numSamples-1)])
296 |   corLRVsBG = cor(curLRVsBG[!curLRVsBG$NT, 3:(3+numSamples-1)])
297 |   maudeZCors = cor(maudeZs[!maudeZs$NT, 4:ncol(maudeZs)])
298 |   
299 |   maudeCorP=1
300 |   maudeCorR=-1
301 |   corLRHiLowP=1
302 |   corLRHiLowR=-1
303 |   corLRVsBGP=1;
304 |   corLRVsBGR=-1
305 |   #select the best inter-replicate correlation for each of the three approaches for which this is possible
306 |   for(i in 1:(length(sampleNames)-1)){ 
307 |     for(j in (i+1):length(sampleNames)){ 
308 |       curR = cor(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]])
309 |       curP = cor.test(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]])$p.value
310 |       if (maudeCorR < curR){
311 |         maudeCorR = curR;
312 |         maudeCorP = curP;
313 |       }
314 |       curR = cor(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], curLRVsBG[!curLRVsBG$NT, sampleNames[j]])
315 |       curP = cor.test(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], 
316 |                       curLRVsBG[!curLRVsBG$NT, sampleNames[j]])$p.value
317 |       if (corLRVsBGR < curR){
318 |         corLRVsBGR = curR;
319 |         corLRVsBGP = curP;
320 |       }
321 |       curR = cor(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], curLRHiLow[!curLRHiLow$NT, sampleNames[j]])
322 |       curP = cor.test(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], 
323 |                       curLRHiLow[!curLRHiLow$NT, sampleNames[j]])$p.value
324 |       if (corLRHiLowR < curR){
325 |         corLRHiLowR = curR;
326 |         corLRHiLowP = curP;
327 |       }
328 |     }
329 |   }
330 |   metricsBoth = rbind(metricsBoth, 
331 |                       data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), 
332 |                                  metric="r",which="replicate_correl",
333 |                                  value=c(maudeCorR, corLRHiLowR, corLRVsBGR),
334 |                                  sig=c(maudeCorP, corLRHiLowP, corLRVsBGP), 
335 |                                  locus ="TNFAIP3", type =curtype, celltype =curCelltype, 
336 |                                  stringsAsFactors = FALSE))
337 | }
338 | 
339 | ## ----some useful functions----------------------------------------------------
340 | #Some useful functions
341 | ranksumROC = function(x,y,na.rm=TRUE,...){
342 |   if (na.rm){
343 |     x=na.rm(x);
344 |     y=na.rm(y);
345 |   }
346 |   curTest = wilcox.test(x,y,...);
347 |   curTest$AUROC = (curTest$statistic/length(x))/length(y)
348 |   return(curTest)
349 | }
350 | na.rm = function(x){ x[!is.na(x)]}
351 | 
352 | ## ----evaluate methods part 2 and 3--------------------------------------------
353 | # (1) similarity between the effect sizes estimated per replicate, 
354 | # (above)
355 | metricsBoth$significant= metricsBoth$sig < 0.01;
356 | 
357 | # Other evaluation metrics
358 | allExpts = unique(allResultsBoth[c("celltype","locus","type")])
359 | for (ei in 1:nrow(allExpts)){
360 |   curCelltype = allExpts$celltype[ei]
361 |   curtype = allExpts$type[ei];
362 |   curLocus = allExpts$locus[ei];
363 |   
364 |   curResults = allResultsBoth[allResultsBoth$celltype==curCelltype & 
365 |                                 allResultsBoth$type==curtype & allResultsBoth$locus==curLocus,]
366 |   for(m in unique(curResults$method)){
367 |     curData = curResults[curResults$method==m & !curResults$NT,]
368 |     
369 |     # (2) similarity in effect size between adjacent guides
370 |     curData = curData[order(curData$pos),]
371 |     guideEffectDistances = 
372 |       data.frame(method = m, random=FALSE, 
373 |                  difference = abs(curData$metric[2:nrow(curData)] - curData$metric[1:(nrow(curData)-1)]), 
374 |                  dist =abs(curData$pos[2:nrow(curData)] - curData$pos[1:(nrow(curData)-1)]), 
375 |                  stringsAsFactors = FALSE)
376 |     guideEffectDistances = guideEffectDistances[guideEffectDistances$dist < 100,]
377 |     guideEffectDistances$dist=NULL;
378 |     ### changed 10 in next line to 100 to make this more robust
379 |     curData = curData[sample(nrow(curData), size = nrow(curData)*100, replace = TRUE),]
380 |     guideEffectDistances = 
381 |       rbind(guideEffectDistances,
382 |             data.frame(method = m, random=TRUE,
383 |                        difference = abs(curData$metric[2:nrow(curData)] - 
384 |                                           curData$metric[1:(nrow(curData)-1)]), stringsAsFactors = FALSE));
385 |     # random should have more different than adjacent
386 |     curRS = ranksumROC(guideEffectDistances$difference[guideEffectDistances$method==m &
387 |                                                          guideEffectDistances$random],
388 |                        guideEffectDistances$difference[guideEffectDistances$method==m &
389 |                                                          !guideEffectDistances$random]) 
390 |     metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5",
391 |                                                 which="adjacent_vs_random",value = curRS$AUROC-0.5,
392 |                                                 locus=curLocus,celltype=curCelltype, type=curtype,
393 |                                                 sig=curRS$p.value, significant = curRS$p.value < 0.01))
394 |     
395 |     # (3) ability to distinguish promoter-targeting guides from other guides. 
396 |     if (curtype=="CRISPRi" & !(m %in% c("edgeR","DESeq2"))){ 
397 |       # edgeR and DESeq2 are reversed for CRISPRi
398 |       # non promoter should have higher effect than promoter (more -ve)
399 |       curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT &
400 |                                                    !curResults$promoter],
401 |                          curResults$significance[curResults$method==m & !curResults$NT &
402 |                                                    curResults$promoter]) 
403 |     }else{
404 |       # promoter should have larger effect than non-promoter
405 |       curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT &
406 |                                                    curResults$promoter],
407 |                          curResults$significance[curResults$method==m & !curResults$NT &
408 |                                                    !curResults$promoter]) 
409 |     }
410 |     metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5", which="promoter_vs_T",
411 |                                                 value = curRS$AUROC-0.5, locus=curLocus, 
412 |                                                 celltype=curCelltype, type=curtype, sig=curRS$p.value,
413 |                                                 significant = curRS$p.value < 0.01))
414 |   }
415 | }
416 | 
417 | ##compile all metrics; label the best in each test and whether any tests were significant (P<0.01)
418 | metricsBoth2 = metricsBoth;
419 | metricsBoth2$method = factor(as.character(metricsBoth2$method), 
420 |                              levels = c("logVsUnsorted","logHivsLow","DESeq2","edgeR","MAUDE"))
421 | metricsBoth2Best = cast(metricsBoth2, which + locus + type+celltype ~ ., value="value", 
422 |                         fun.aggregate = max)
423 | names(metricsBoth2Best)[ncol(metricsBoth2Best)] = "best"
424 | metricsBoth2 = merge(metricsBoth2, metricsBoth2Best, by = c("which","locus","type","celltype"))
425 | metricsBoth2AnySig = cast(metricsBoth2[colnames(metricsBoth2)!="value"], 
426 |                           which + locus + type+celltype ~ ., value="significant", fun.aggregate = any)
427 | names(metricsBoth2AnySig)[ncol(metricsBoth2AnySig)] = "anySig"
428 | metricsBoth2 = merge(metricsBoth2, metricsBoth2AnySig, by = c("which","locus","type","celltype"))
429 | metricsBoth2$isBest = metricsBoth2$value==metricsBoth2$best;
430 | metricsBoth2$isBestNA = metricsBoth2$isBest;
431 | metricsBoth2$isBestNA[!metricsBoth2$isBestNA]=NA;
432 | metricsBoth2$pctOfMax = metricsBoth2$value/metricsBoth2$best * 100;
433 | 
434 | #fill in NAs for edgeR and DESeq2 which cannot have inter-replicate correlations
435 | temp = metricsBoth2[metricsBoth2$metric=="r",]
436 | temp = temp[1:2,];
437 | temp$method = c("edgeR","DESeq2")
438 | temp$value=NA; temp$isBest=NA; temp$significant=FALSE; temp$pctOfMax=NA; temp$isBestNA=NA;
439 | metricsBoth2 = rbind(metricsBoth2, temp)
440 | 
441 | ## ----make graph, fig.width=10, fig.height=6-----------------------------------
442 | #make the final evaluation graph
443 | p1 = ggplot(metricsBoth2[metricsBoth2$which=="adjacent_vs_random",], 
444 |             aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() +
445 |   geom_text(data=metricsBoth2[metricsBoth2$which=="adjacent_vs_random" & metricsBoth2$anySig,],
446 |             aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() +
447 |   scale_fill_gradient2(high="red", low="blue", mid="black") + 
448 |   theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 
449 |         axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA) + 
450 |   scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Adjacent vs\nrandom guides");
451 | p2 = ggplot(metricsBoth2[metricsBoth2$which=="promoter_vs_T",], 
452 |             aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() +
453 |   geom_text(data=metricsBoth2[metricsBoth2$which=="promoter_vs_T" & metricsBoth2$anySig,],
454 |             aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() +
455 |   scale_fill_gradient2(high="red", low="blue", mid="black") + 
456 |   theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 
457 |         axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+
458 |   scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+
459 |   ggtitle("Promoter vs other\ntargeting guides");
460 | p3 = ggplot(metricsBoth2[metricsBoth2$which=="replicate_correl",], 
461 |             aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() +
462 |   geom_text(data=metricsBoth2[metricsBoth2$which=="replicate_correl" & metricsBoth2$anySig,],
463 |             aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() +
464 |   scale_fill_gradient2(high="red", low="blue", mid="black") + 
465 |   theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 
466 |         axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+
467 |   scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Replicate\ncorrelations");
468 | g= plot_grid(p1,p2,p3, align = 'h', nrow = 1); print(g)
469 | 
470 | ## ----session info-------------------------------------------------------------
471 | sessionInfo()
472 | 
473 | 


--------------------------------------------------------------------------------
/Evaluation/method_evaluation.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "CRISPR sorting screen analysis method comparison"
  3 | author: "Carl de Boer"
  4 | date: "5/10/2020"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{CRISPR sorting screen analysis method comparison}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   \usepackage[utf8]{inputenc}
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | ## CRISPR sort-screen evaluation test
 17 | 
 18 | This document includes both the code and results for the evaluation metrics used in the MAUDE paper. This should allow fairly easy extension to allow new approaches and datasets. 
 19 | 
 20 | ```{r load libraries, results="hide", message = FALSE, warning=FALSE}
 21 | library("ggplot2")
 22 | library(reshape)
 23 | library(cowplot)
 24 | library(MAUDE)
 25 | library(openxlsx)
 26 | library(edgeR)
 27 | library(DESeq2)
 28 | ```
 29 | 
 30 | I forgot to set a seed when doing this for the publication, and there is an element of stochasticity in the evaluation, and so the final results are likely to differ a little from the publication, especially for the datasets with less signal/noise; if you run the stochastic steps multiple times, set a different seed, or don't set the seed at all, they may also differ from this tutorial
 31 | ```{r set seed}
 32 | set.seed(35263377)
 33 | ```
 34 | 
 35 | ### CD69 screen
 36 | Load the CD69 data and process it.
 37 | ```{r Load and parse CD69 data}
 38 | #a mapping to unify bin names from Simeonov data
 39 | binmapBack = list("baseline" = "baseline", "low"="low", "medium"="medium","high"="high","back_" = "NS",
 40 |                   "baseline_" = "baseline", "low_"="low", "medium_"="medium", "high_"="high", 
 41 |                   "A"="baseline", "B"="low", "E" = "medium", "F"="high")
 42 | 
 43 | #this comes from manually reconstructing the CD69 density curve from extended data figure 1a (Simeonov et al)
 44 | binBoundsCD69 = data.frame(Bin = c("A","F","B","E"), 
 45 |                            fraction = c(0.65747100, 0.02792824, 0.25146688, 0.06313389), 
 46 |                            stringsAsFactors = FALSE) 
 47 | fractionalBinBounds = makeBinModel(binBoundsCD69[c("Bin","fraction")])
 48 | fractionalBinBounds = rbind(fractionalBinBounds, fractionalBinBounds)
 49 | fractionalBinBounds$screen = c(rep("1",6),rep("2",6));
 50 | #only keep bins A,B,E,F
 51 | fractionalBinBounds = fractionalBinBounds[fractionalBinBounds$Bin %in% c("A","B","E","F"),]
 52 | fractionalBinBounds$Bin = unlist(binmapBack[fractionalBinBounds$Bin]);
 53 | 
 54 | #load data
 55 | cd69OriginalResults = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx')
 56 | cd69OriginalResults$NT = grepl("negative_control", cd69OriginalResults$gRNA_systematic_name)
 57 | cd69OriginalResults$pos = cd69OriginalResults$PAM_3primeEnd_coord;
 58 | cd69OriginalResults = unique(cd69OriginalResults)
 59 | cd69CountData = melt(cd69OriginalResults, id.vars = c("pos","NT","gRNA_systematic_name"))
 60 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),]
 61 | cd69CountData$theirBin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable)
 62 | cd69CountData$screen = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable)
 63 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL;
 64 | # convert their bin to one that is consistent
 65 | cd69CountData$Bin = unlist(binmapBack[cd69CountData$theirBin]); 
 66 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$pos) | cd69CountData$NT,], pos+gRNA_systematic_name+NT+screen ~ Bin, value="reads"))
 67 | ```
 68 | 
 69 | 
 70 | I wanted to confirm that this is how they calculated logFC in Simeonov et al. The plot below should form a straight line on y=x.
 71 | ```{r calc logFC}
 72 | #confirm how to calc log2FC:
 73 | cd69OriginalResults$l2fc.vsbg1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69back_1.count.norm))
 74 | cd69OriginalResults$l2fc.vsbg2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69back_2.count.norm))
 75 | cd69OriginalResults$l2fc.hilo1 = log2((1+ cd69OriginalResults$CD69high_1.count.norm)/(1+cd69OriginalResults$CD69baseline_1.count.norm))
 76 | cd69OriginalResults$l2fc.hilo2 = log2((1+ cd69OriginalResults$CD69high_2.count.norm)/(1+cd69OriginalResults$CD69baseline_2.count.norm))
 77 | #confirm that how we just calculated log2 fc is the same as originally calculated
 78 | p=ggplot(cd69OriginalResults, aes(x=high2.l2fc,y= l2fc.vsbg2)) + geom_point(); print(p)
 79 | ```
 80 | 
 81 | Run the various analysis methods on the CD69 data.
 82 | ```{r run methods, results="hide", message = FALSE, warning=FALSE}
 83 | #edgeR
 84 | x <- unique(cd69OriginalResults[c("gRNA_systematic_name","CD69baseline_1.count","CD69baseline_2.count", 
 85 |                                   "CD69high_1.count", "CD69high2.count")])
 86 | row.names(x) = x$gRNA_systematic_name; x$gRNA_systematic_name=NULL;
 87 | x=as.matrix(x)
 88 | group <- factor(c(1,1,2,2))
 89 | y <- DGEList(counts=x,group=group)
 90 | y <- calcNormFactors(y)
 91 | design <- model.matrix(~group)
 92 | y <- estimateDisp(y,design)
 93 | #To perform likelihood ratio tests:
 94 | fit <- glmFit(y,design)
 95 | lrt <- glmLRT(fit,coef=2)
 96 | edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]]
 97 | edgeRGuideLevel$gRNA_systematic_name = row.names(edgeRGuideLevel);
 98 | edgeRGuideLevel$metric = edgeRGuideLevel$logFC;
 99 | edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue);
100 | edgeRGuideLevel$method="edgeR";
101 | 
102 | #DESeq2
103 | deseqGroups = data.frame(bin=factor(c(1,1,2,2)));
104 | row.names(deseqGroups) = c("CD69baseline_1.count","CD69baseline_2.count", 
105 |                            "CD69high_1.count", "CD69high2.count");
106 | dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin)
107 | dds <- DESeq(dds)
108 | res <- results(dds, name=resultsNames(dds)[2])
109 | 
110 | deseqGuideLevel = as.data.frame(res@listData)
111 | deseqGuideLevel$gRNA_systematic_name =res@rownames;
112 | deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange;
113 | deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue);
114 | deseqGuideLevel$method="DESeq2";
115 | 
116 | #MAUDE
117 | guideLevelStatsCD69 = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, fractionalBinBounds, sortBins = c("baseline","high","low","medium"), unsortedBin = "NS")
118 | guideLevelStatsCD69$chr="chr12"
119 | guideLevelStatsCastCD69 = cast(guideLevelStatsCD69, gRNA_systematic_name + pos+NT ~ screen, value="Z")
120 | names(guideLevelStatsCastCD69)[ncol(guideLevelStatsCastCD69)-1:0]=c("s1","s2")
121 | guideLevelStatsCastCD69$significance = apply(guideLevelStatsCastCD69[c("s1","s2")],1, combineZStouffer)
122 | guideLevelStatsCastCD69$metric=apply(guideLevelStatsCastCD69[c("s1","s2")],1, mean)
123 | guideLevelStatsCastCD69$method = "MAUDE"
124 | 
125 | #Two log fold change methods
126 | cd69OriginalResultsHiLow = cd69OriginalResults[c("gRNA_systematic_name","l2fc.hilo1","l2fc.hilo2")]
127 | cd69OriginalResultsVsBG = cd69OriginalResults[c("gRNA_systematic_name","l2fc.vsbg1","l2fc.vsbg2")]
128 | cd69OriginalResultsHiLow$significance = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean)
129 | cd69OriginalResultsHiLow$metric = apply(cd69OriginalResultsHiLow[2:3], 1, FUN = mean)
130 | cd69OriginalResultsVsBG$significance = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean)
131 | cd69OriginalResultsVsBG$metric = apply(cd69OriginalResultsVsBG[2:3], 1, FUN = mean)
132 | cd69OriginalResultsHiLow$method="logHivsLow"
133 | cd69OriginalResultsVsBG$method="logVsUnsorted"
134 | ```
135 | 
136 | Compile the results from the various methods.
137 | ```{r compile results}
138 | # predictions
139 | allResults = rbind(cd69OriginalResultsVsBG[c("method","gRNA_systematic_name","significance","metric")],
140 |                    cd69OriginalResultsHiLow[c("method","gRNA_systematic_name","significance","metric")],
141 |                    deseqGuideLevel[c("method","gRNA_systematic_name","significance","metric")], 
142 |                    edgeRGuideLevel[c("method","gRNA_systematic_name","significance","metric")],
143 |                    guideLevelStatsCastCD69[c("method","gRNA_systematic_name","significance","metric")]) 
144 | 
145 | 
146 | allResults = merge(allResults, cd69OriginalResults[c("gRNA_systematic_name","NT","pos")],
147 |                    by="gRNA_systematic_name")
148 | allResults = allResults[!is.na(allResults$pos) | allResults$NT,]
149 | allResults$promoter  = allResults$pos <= 9913996 & allResults$pos >= 9912997
150 | allResults$gID = allResults$gRNA_systematic_name; allResults$gRNA_systematic_name=NULL;
151 | allResults$locus ="CD69"
152 | allResults$type ="CRISPRa"
153 | allResults$celltype ="Jurkat"
154 | ```
155 | 
156 | ### TNFAIP3
157 | Load and parse the TNFAIP3 data from Ray et al.
158 | ```{r input and parse TNFAIP3 data}
159 | 
160 | ##read in TNFAIP3 data
161 | binFractionsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPR_FF_bin_fractions.txt.gz")))), 
162 |                              sep="\t", stringsAsFactors = FALSE, header = TRUE)
163 | CRISPRaCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRa_FF_countMatrix.txt.gz")))), 
164 |                               sep="\t", stringsAsFactors = FALSE, header = TRUE)
165 | CRISPRiCountsA20 = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20190828_CRISPRi_FF_countMatrix.txt.gz")))), 
166 |                               sep="\t", stringsAsFactors = FALSE, header = TRUE)
167 | crispraGuides = read.table(textConnection(readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE136nnn/GSE136693/suppl/GSE136693_20180205_selected_CRISPRa_guides_seq.txt.gz")))), 
168 |                            sep="\t", stringsAsFactors = FALSE, header = TRUE)
169 | crispraGuides$seq = gsub("(.*)(.{3})","\\1",crispraGuides$seq.w.PAM)
170 | crispraGuides$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2",
171 |                                            crispraGuides$guideID))+
172 |                              as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3",
173 |                                              crispraGuides$guideID)))/2)
174 | 
175 | CRISPRaCountsA20 = merge(CRISPRaCountsA20, unique(crispraGuides[c("seq","pos")]), by=c("seq"), all.x=TRUE)
176 | 
177 | CRISPRiCountsA20$pos = round((as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\2",
178 |                                               CRISPRiCountsA20$gID))+
179 |                                 as.numeric(gsub("([0-9]+):([0-9]+)-([0-9]+):([+-])","\\3",
180 |                                               CRISPRiCountsA20$gID)))/2)
181 | 
182 | binFractionsA20$expt = paste(binFractionsA20$celltype, binFractionsA20$CRISPRType,sep="_")
183 | 
184 | #merge CRISPRi and CRISPRa
185 | A20CountData = melt(CRISPRaCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos"))
186 | A20CountData$CRISPRType="CRISPRa";
187 | A20CountData2 = melt(CRISPRiCountsA20, id.vars=c("seq","elementClass","element","NT","gID","pos"))
188 | A20CountData2$CRISPRType="CRISPRi";
189 | A20CountData = rbind(A20CountData, A20CountData2)
190 | rm('A20CountData2');
191 | A20CountData$count = A20CountData$value; A20CountData$value=NULL;
192 | A20CountData$sample = A20CountData$variable; A20CountData$variable=NULL;
193 | A20CountData$bin = gsub("(.*)_(.*)_(.*)", "\\3", A20CountData$sample);
194 | A20CountData$screen = gsub("(.*)_(.*)_(.*)", "\\2", A20CountData$sample);
195 | A20CountData$celltype = gsub("(.*)_(.*)_(.*)", "\\1", A20CountData$sample);
196 | A20CountData$expt = paste(A20CountData$celltype, A20CountData$CRISPRType,sep="_")
197 | ```
198 | 
199 | Run each method on each experiment from Ray et al. 
200 | ```{r start evaluation and run on TNFAIP3, results="hide", message = FALSE, warning=FALSE}
201 | #combine CD69 metrics 
202 | # Pearson's r between replicates ## metricsBoth will contain all our evaluation metrics
203 | metricsBoth = 
204 |   data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), metric="r",which="replicate_correl",
205 |              value=c(cor(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT],
206 |                          guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT]), 
207 |                      cor(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT],
208 |                          cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT]),
209 |                      cor(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT],
210 |                          cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])),
211 |              sig=c(cor.test(guideLevelStatsCastCD69$s1[!guideLevelStatsCastCD69$NT],
212 |                             guideLevelStatsCastCD69$s2[!guideLevelStatsCastCD69$NT])$p.value, 
213 |                    cor.test(cd69OriginalResults$l2fc.hilo1[!cd69OriginalResults$NT],
214 |                             cd69OriginalResults$l2fc.hilo2[!cd69OriginalResults$NT])$p.value,
215 |                    cor.test(cd69OriginalResults$l2fc.vsbg1[!cd69OriginalResults$NT],
216 |                             cd69OriginalResults$l2fc.vsbg2[!cd69OriginalResults$NT])$p.value),
217 |              locus="CD69",type="CRISPRa",celltype="Jurkat",stringsAsFactors = FALSE)
218 | 
219 | allResultsBoth = allResults; #combined results (significance, effect sizes) for CD69 and A20 screens
220 | for (e in unique(binFractionsA20$expt)){
221 |   curCelltype = gsub("(.*)_(.*)", "\\1", e);
222 |   curtype = gsub("(.*)_(.*)", "\\2", e);
223 |   curA20CountData = unique(A20CountData[A20CountData$expt==e,
224 |                                         c("seq","pos", "NT","gID","count","screen","bin")])
225 |   curA20CountDataTotals = cast(curA20CountData, screen +bin~ ., fun.aggregate = sum, value="count")
226 |   names(curA20CountDataTotals)[3] = "total";
227 |   curA20CountData = merge(curA20CountData, curA20CountDataTotals, by=c("screen","bin"))
228 |   curA20CountData$CPM = curA20CountData$count/curA20CountData$total * 1E6;
229 |   curCPMMat = cast(curA20CountData, seq + NT + gID + screen + pos ~ bin, value="CPM")
230 |   curCPMMat$l2fc_hilo = log2((1+curCPMMat$F)/(1+curCPMMat$A))
231 |   if(curtype=="CRISRPi"){
232 |     curCPMMat$l2fc_vsbg = log2((1+curCPMMat$NS)/(1+curCPMMat$A))
233 |   }else{ #CRISPRa
234 |     curCPMMat$l2fc_vsbg = log2((1+curCPMMat$F)/(1+curCPMMat$NS))
235 |   }
236 |   curBins = as.data.frame(melt(binFractionsA20[binFractionsA20$expt==e,], 
237 |                                id.vars = c("celltype","screen","CRISPRType","expt")))
238 |   names(curBins)[ncol(curBins) - (1:0)] = c("Bin","fraction")
239 |   curBins2 = data.frame();
240 |   for (s in unique(curBins$screen)){
241 |     curBins3 = makeBinModel(curBins[curBins$screen==s,c("Bin","fraction")])
242 |     curBins3$screen = s;
243 |     curBins2 = rbind(curBins2, curBins3)
244 |   }
245 |   curBins2$Bin = as.character(curBins2$Bin);
246 |   curCountMat = cast(curA20CountData, seq + NT + gID +pos + screen ~ bin, value="count")
247 |   guideLevelStats = findGuideHitsAllScreens(experiments = unique(curCountMat["screen"]), 
248 |                                             countDataFrame = curCountMat, binStats = curBins2, 
249 |                                             sortBins = c("A","B","C","D","E","F"), unsortedBin = "NS", 
250 |                                             negativeControl="NT")
251 |   
252 |   guideLevelStatsCast = cast(guideLevelStats, gID + pos+NT ~ screen, value="Z")
253 |   #names(guideLevelStatsCast)[4:ncol(guideLevelStatsCast)]=sprintf("s%i", 1:(ncol(guideLevelStatsCast)-3))
254 |   
255 |   maudeZs = guideLevelStatsCast;
256 |   
257 |   guideLevelStatsCast$significance = apply(maudeZs[unique(curA20CountData$screen)],1, combineZStouffer)
258 |   guideLevelStatsCast$metric=apply(maudeZs[unique(curA20CountData$screen)],1, mean)
259 |   guideLevelStatsCast$method = "MAUDE"
260 |   
261 |   ### EdgeR
262 |   library(edgeR)
263 |   x= cast(unique(curA20CountData[curA20CountData$bin %in% c("A","F"), c("bin","gID","screen","count")]), gID ~ screen + bin, value="count")
264 |   row.names(x) = x$gID; x$gID=NULL;
265 |   x=as.matrix(x)
266 |   group = grepl("_F",colnames(x))+1
267 |   group <- factor(group)
268 |   y <- DGEList(counts=x,group=group)
269 |   y <- calcNormFactors(y)
270 |   design <- model.matrix(~group)
271 |   y <- estimateDisp(y,design)
272 |   #To perform likelihood ratio tests:
273 |   fit <- glmFit(y,design)
274 |   lrt <- glmLRT(fit,coef=2)
275 |   edgeRGuideLevel = topTags(lrt, n=nrow(x))@.Data[[1]]
276 |   
277 |   edgeRGuideLevel$gID = row.names(edgeRGuideLevel);
278 |   edgeRGuideLevel$metric = edgeRGuideLevel$logFC;
279 |   edgeRGuideLevel$significance = -log(edgeRGuideLevel$PValue);
280 |   edgeRGuideLevel$method="edgeR";
281 |   
282 |   ### DEseq
283 |   library(DESeq2)
284 |   deseqGroups = data.frame(bin=group);
285 |   row.names(deseqGroups) = colnames(x);
286 |   dds <- DESeqDataSetFromMatrix(countData = x,colData = deseqGroups, design= ~ bin)
287 |   dds <- DESeq(dds)
288 |   #resultsNames(dds) # lists the coefficients
289 |   res <- results(dds, name=resultsNames(dds)[2])
290 |   stopifnot(resultsNames(dds)[1]=="Intercept")
291 |   deseqGuideLevel = as.data.frame(res@listData)
292 |   deseqGuideLevel$gID =res@rownames;
293 |   deseqGuideLevel$metric = deseqGuideLevel$log2FoldChange;
294 |   deseqGuideLevel$significance = -log(deseqGuideLevel$pvalue);
295 |   deseqGuideLevel$method="DESeq2";
296 |   
297 |   curLRHiLow = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_hilo")]), gID + NT ~ screen, value="l2fc_hilo")
298 |   curLRVsBG = cast(unique(curCPMMat[c("gID","NT","screen","l2fc_vsbg")]), gID + NT ~ screen, value="l2fc_vsbg")
299 |   numSamples = ncol(curLRHiLow)-2;
300 |   sampleNames = unique(curCPMMat$screen)
301 |   
302 |   curLRVsBG$significance = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean);
303 |   curLRVsBG$metric = apply(curLRVsBG[3:(numSamples+2)], MARGIN = 1, FUN = mean);
304 |   curLRVsBG$method="logVsUnsorted"
305 |   curLRHiLow$significance = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean);
306 |   curLRHiLow$metric = apply(curLRHiLow[3:(numSamples+2)], MARGIN = 1, FUN = mean);
307 |   curLRHiLow$method="logHivsLow"
308 |   
309 |   #compile results for A20
310 |   curResults = rbind(unique(curLRHiLow[c("method","gID","significance","metric")]),
311 |                      unique(curLRVsBG[c("method","gID","significance","metric")]),
312 |                      deseqGuideLevel[c("method","gID","significance","metric")], 
313 |                      edgeRGuideLevel[c("method","gID","significance","metric")],
314 |                      unique(guideLevelStatsCast[c("method","gID","significance","metric")]))
315 |   
316 |   curResults = merge(curResults, unique(curCPMMat[c("gID","NT","pos")]), by="gID")
317 |   curResults = curResults[!is.na(curResults$pos) | curResults$NT,]
318 |   curResults$promoter  = grepl("TNFAIP3", curResults$gID) | 
319 |     (curResults$pos <= 138189439 & curResults$pos >= 138187040) # chr6:138188077-138188379;138187040
320 |   
321 |   #append the current results to all
322 |   curResults$locus ="TNFAIP3"
323 |   curResults$type =curtype
324 |   curResults$celltype =curCelltype
325 |   allResultsBoth = rbind(allResultsBoth, curResults);
326 |   
327 |   # (1) similarity between the effect sizes estimated per replicate, 
328 |   corLRHiLow = cor(curLRHiLow[!curLRHiLow$NT, 3:(3+numSamples-1)])
329 |   corLRVsBG = cor(curLRVsBG[!curLRVsBG$NT, 3:(3+numSamples-1)])
330 |   maudeZCors = cor(maudeZs[!maudeZs$NT, 4:ncol(maudeZs)])
331 |   
332 |   maudeCorP=1
333 |   maudeCorR=-1
334 |   corLRHiLowP=1
335 |   corLRHiLowR=-1
336 |   corLRVsBGP=1;
337 |   corLRVsBGR=-1
338 |   #select the best inter-replicate correlation for each of the three approaches for which this is possible
339 |   for(i in 1:(length(sampleNames)-1)){ 
340 |     for(j in (i+1):length(sampleNames)){ 
341 |       curR = cor(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]])
342 |       curP = cor.test(maudeZs[!maudeZs$NT, sampleNames[i]], maudeZs[!maudeZs$NT, sampleNames[j]])$p.value
343 |       if (maudeCorR < curR){
344 |         maudeCorR = curR;
345 |         maudeCorP = curP;
346 |       }
347 |       curR = cor(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], curLRVsBG[!curLRVsBG$NT, sampleNames[j]])
348 |       curP = cor.test(curLRVsBG[!curLRVsBG$NT, sampleNames[i]], 
349 |                       curLRVsBG[!curLRVsBG$NT, sampleNames[j]])$p.value
350 |       if (corLRVsBGR < curR){
351 |         corLRVsBGR = curR;
352 |         corLRVsBGP = curP;
353 |       }
354 |       curR = cor(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], curLRHiLow[!curLRHiLow$NT, sampleNames[j]])
355 |       curP = cor.test(curLRHiLow[!curLRHiLow$NT, sampleNames[i]], 
356 |                       curLRHiLow[!curLRHiLow$NT, sampleNames[j]])$p.value
357 |       if (corLRHiLowR < curR){
358 |         corLRHiLowR = curR;
359 |         corLRHiLowP = curP;
360 |       }
361 |     }
362 |   }
363 |   metricsBoth = rbind(metricsBoth, 
364 |                       data.frame(method=c("MAUDE","logHivsLow","logVsUnsorted"), 
365 |                                  metric="r",which="replicate_correl",
366 |                                  value=c(maudeCorR, corLRHiLowR, corLRVsBGR),
367 |                                  sig=c(maudeCorP, corLRHiLowP, corLRVsBGP), 
368 |                                  locus ="TNFAIP3", type =curtype, celltype =curCelltype, 
369 |                                  stringsAsFactors = FALSE))
370 | }
371 | ```
372 | 
373 | Here are some functions that I will make use of below.
374 | ```{r some useful functions}
375 | #Some useful functions
376 | ranksumROC = function(x,y,na.rm=TRUE,...){
377 |   if (na.rm){
378 |     x=na.rm(x);
379 |     y=na.rm(y);
380 |   }
381 |   curTest = wilcox.test(x,y,...);
382 |   curTest$AUROC = (curTest$statistic/length(x))/length(y)
383 |   return(curTest)
384 | }
385 | na.rm = function(x){ x[!is.na(x)]}
386 | ```
387 | 
388 | ### Evaluation
389 | The inter-replicate correlations were calculated in the sections above and are stored in `metricsBoth`. Below, the two remaining metrics are calculated (distinguishing promoter vs other targeting guides, and adjacent vs randomly paired guides).
390 | ```{r evaluate methods part 2 and 3}
391 | # (1) similarity between the effect sizes estimated per replicate, 
392 | # (above)
393 | metricsBoth$significant= metricsBoth$sig < 0.01;
394 | 
395 | # Other evaluation metrics
396 | allExpts = unique(allResultsBoth[c("celltype","locus","type")])
397 | for (ei in 1:nrow(allExpts)){
398 |   curCelltype = allExpts$celltype[ei]
399 |   curtype = allExpts$type[ei];
400 |   curLocus = allExpts$locus[ei];
401 |   
402 |   curResults = allResultsBoth[allResultsBoth$celltype==curCelltype & 
403 |                                 allResultsBoth$type==curtype & allResultsBoth$locus==curLocus,]
404 |   for(m in unique(curResults$method)){
405 |     curData = curResults[curResults$method==m & !curResults$NT,]
406 |     
407 |     # (2) similarity in effect size between adjacent guides
408 |     curData = curData[order(curData$pos),]
409 |     guideEffectDistances = 
410 |       data.frame(method = m, random=FALSE, 
411 |                  difference = abs(curData$metric[2:nrow(curData)] - curData$metric[1:(nrow(curData)-1)]), 
412 |                  dist =abs(curData$pos[2:nrow(curData)] - curData$pos[1:(nrow(curData)-1)]), 
413 |                  stringsAsFactors = FALSE)
414 |     guideEffectDistances = guideEffectDistances[guideEffectDistances$dist < 100,]
415 |     guideEffectDistances$dist=NULL;
416 |     ### changed 10 in next line to 100 to make this more robust
417 |     curData = curData[sample(nrow(curData), size = nrow(curData)*100, replace = TRUE),]
418 |     guideEffectDistances = 
419 |       rbind(guideEffectDistances,
420 |             data.frame(method = m, random=TRUE,
421 |                        difference = abs(curData$metric[2:nrow(curData)] - 
422 |                                           curData$metric[1:(nrow(curData)-1)]), stringsAsFactors = FALSE));
423 |     # random should have more different than adjacent
424 |     curRS = ranksumROC(guideEffectDistances$difference[guideEffectDistances$method==m &
425 |                                                          guideEffectDistances$random],
426 |                        guideEffectDistances$difference[guideEffectDistances$method==m &
427 |                                                          !guideEffectDistances$random]) 
428 |     metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5",
429 |                                                 which="adjacent_vs_random",value = curRS$AUROC-0.5,
430 |                                                 locus=curLocus,celltype=curCelltype, type=curtype,
431 |                                                 sig=curRS$p.value, significant = curRS$p.value < 0.01))
432 |     
433 |     # (3) ability to distinguish promoter-targeting guides from other guides. 
434 |     if (curtype=="CRISPRi" & !(m %in% c("edgeR","DESeq2"))){ 
435 |       # edgeR and DESeq2 are reversed for CRISPRi
436 |       # non promoter should have higher effect than promoter (more -ve)
437 |       curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT &
438 |                                                    !curResults$promoter],
439 |                          curResults$significance[curResults$method==m & !curResults$NT &
440 |                                                    curResults$promoter]) 
441 |     }else{
442 |       # promoter should have larger effect than non-promoter
443 |       curRS = ranksumROC(curResults$significance[curResults$method==m & !curResults$NT &
444 |                                                    curResults$promoter],
445 |                          curResults$significance[curResults$method==m & !curResults$NT &
446 |                                                    !curResults$promoter]) 
447 |     }
448 |     metricsBoth = rbind(metricsBoth, data.frame(method=m, metric="AUROC-0.5", which="promoter_vs_T",
449 |                                                 value = curRS$AUROC-0.5, locus=curLocus, 
450 |                                                 celltype=curCelltype, type=curtype, sig=curRS$p.value,
451 |                                                 significant = curRS$p.value < 0.01))
452 |   }
453 | }
454 | 
455 | ##compile all metrics; label the best in each test and whether any tests were significant (P<0.01)
456 | metricsBoth2 = metricsBoth;
457 | metricsBoth2$method = factor(as.character(metricsBoth2$method), 
458 |                              levels = c("logVsUnsorted","logHivsLow","DESeq2","edgeR","MAUDE"))
459 | metricsBoth2Best = cast(metricsBoth2, which + locus + type+celltype ~ ., value="value", 
460 |                         fun.aggregate = max)
461 | names(metricsBoth2Best)[ncol(metricsBoth2Best)] = "best"
462 | metricsBoth2 = merge(metricsBoth2, metricsBoth2Best, by = c("which","locus","type","celltype"))
463 | metricsBoth2AnySig = cast(metricsBoth2[colnames(metricsBoth2)!="value"], 
464 |                           which + locus + type+celltype ~ ., value="significant", fun.aggregate = any)
465 | names(metricsBoth2AnySig)[ncol(metricsBoth2AnySig)] = "anySig"
466 | metricsBoth2 = merge(metricsBoth2, metricsBoth2AnySig, by = c("which","locus","type","celltype"))
467 | metricsBoth2$isBest = metricsBoth2$value==metricsBoth2$best;
468 | metricsBoth2$isBestNA = metricsBoth2$isBest;
469 | metricsBoth2$isBestNA[!metricsBoth2$isBestNA]=NA;
470 | metricsBoth2$pctOfMax = metricsBoth2$value/metricsBoth2$best * 100;
471 | 
472 | #fill in NAs for edgeR and DESeq2 which cannot have inter-replicate correlations
473 | temp = metricsBoth2[metricsBoth2$metric=="r",]
474 | temp = temp[1:2,];
475 | temp$method = c("edgeR","DESeq2")
476 | temp$value=NA; temp$isBest=NA; temp$significant=FALSE; temp$pctOfMax=NA; temp$isBestNA=NA;
477 | metricsBoth2 = rbind(metricsBoth2, temp)
478 | ```
479 | 
480 | Finally make the graph with all evaluation metrics as shown in the MAUDE paper. 
481 | ```{r make graph, fig.width=10, fig.height=6}
482 | #make the final evaluation graph
483 | p1 = ggplot(metricsBoth2[metricsBoth2$which=="adjacent_vs_random",], 
484 |             aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() +
485 |   geom_text(data=metricsBoth2[metricsBoth2$which=="adjacent_vs_random" & metricsBoth2$anySig,],
486 |             aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() +
487 |   scale_fill_gradient2(high="red", low="blue", mid="black") + 
488 |   theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 
489 |         axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA) + 
490 |   scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Adjacent vs\nrandom guides");
491 | p2 = ggplot(metricsBoth2[metricsBoth2$which=="promoter_vs_T",], 
492 |             aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() +
493 |   geom_text(data=metricsBoth2[metricsBoth2$which=="promoter_vs_T" & metricsBoth2$anySig,],
494 |             aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() +
495 |   scale_fill_gradient2(high="red", low="blue", mid="black") + 
496 |   theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 
497 |         axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+
498 |   scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+
499 |   ggtitle("Promoter vs other\ntargeting guides");
500 | p3 = ggplot(metricsBoth2[metricsBoth2$which=="replicate_correl",], 
501 |             aes(x=method, fill=value, y=paste(locus,type,celltype))) + geom_tile() +
502 |   geom_text(data=metricsBoth2[metricsBoth2$which=="replicate_correl" & metricsBoth2$anySig,],
503 |             aes(label="*",colour=isBestNA),show.legend = FALSE) +theme_bw() +
504 |   scale_fill_gradient2(high="red", low="blue", mid="black") + 
505 |   theme(legend.position="top", axis.text.x = element_text(hjust=1, angle=45), 
506 |         axis.title.y = element_blank())+scale_colour_manual(values = c("green"), na.value=NA)+
507 |   scale_y_discrete(expand=c(0,0))+scale_x_discrete(expand=c(0,0))+ggtitle("Replicate\ncorrelations");
508 | g= plot_grid(p1,p2,p3, align = 'h', nrow = 1); print(g)
509 | ```
510 | 
511 | ### Session info
512 | ```{r session info}
513 | sessionInfo()
514 | ```
515 | 
516 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Carl de Boer
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(calcFDRByExperiment)
 4 | export(combineZStouffer)
 5 | export(findGuideHits)
 6 | export(findGuideHitsAllScreens)
 7 | export(findOverlappingElements)
 8 | export(getElementwiseStats)
 9 | export(getNBGaussianLikelihood)
10 | export(getTilingElementwiseStats)
11 | export(getZScalesWithNTGuides)
12 | export(makeBinModel)
13 | import(stats)
14 | importFrom(reshape,cast)
15 | 


--------------------------------------------------------------------------------
/NEWS:
--------------------------------------------------------------------------------
1 | Changes in version 1.0.0 (2020-05-29)
2 | + Submitted to Bioconductor
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # MAUDE: Mean Alterations Using Discrete Expression
  2 | 
  3 | [![DOI](https://zenodo.org/badge/135627989.svg)](https://zenodo.org/badge/latestdoi/135627989)
  4 | 
  5 | MAUDE is an R package for finding differences in means of normally distributed (or nearly so) data, via measuring abundances in discrete bins. For example, a pooled CRISPRi screen with expression readout by FACS sorting into discrete bins and sequencing the abundances of the guides in each bin.  Most of the documentation and examples are written with a CRISPRi-type sorting screen in mind, but there is no reason why it can't be used for any experiment where normally distributed expression values are read out via abundances in discrete expression bins. For example, MAUDE can also be used for [CRISPR base editor screens](https://de-boer-lab.github.io/MAUDE/doc/BACH2_base_editor_screen.html) where the readout is expression of a target gene, and reporter assays with expression readouts (e.g. [Rafi et al](https://www.biorxiv.org/content/10.1101/2023.04.26.538471v2)). 
  6 | 
  7 | See 'Usage' below for more information.
  8 | 
  9 | 
 10 | <img src="images/logo2.png" alt="Maude Flanders" width="400"/>
 11 | 
 12 | # Table of contents
 13 | <!--ts-->
 14 |    * [R Installation](#r-installation)
 15 |    * [Requirements](#requirements)
 16 |    * [Usage](#usage)
 17 |    * [Citation](#citation)
 18 | <!--te-->
 19 | 
 20 | # R Installation
 21 | 
 22 | ## Option 1: Install directly from GitHub
 23 | 
 24 | If you don't already have `devtools`, install it:
 25 | ```
 26 | install.packages("devtools")
 27 | ```
 28 | 
 29 | Load `devtools` and install from the GitHub page:
 30 | 
 31 | ```
 32 | devtools::install_github("de-Boer-Lab/MAUDE")
 33 | ```
 34 | 
 35 | ## Option 2: Install from download
 36 | 
 37 | Download the latest MAUDE release (Under "Releases" on the right hand side of this page).
 38 | 
 39 | Decompress the directory contained within it (something like "MAUDE-1.0.2").
 40 | 
 41 | Then in R:
 42 | If you don't already have `devtools`, install it:
 43 | ```
 44 | install.packages("devtools")
 45 | ```
 46 | 
 47 | Then install in R using:
 48 | 
 49 | ```
 50 | devtools::install_local("C:\\Users\\cdeboer\\Downloads\\MAUDE-1.0.2")
 51 | ```
 52 | 
 53 | # Requirements
 54 | Right now we have three main requirements: 
 55 | 1. Negative control guides are included in the experiment; (these are used for calibrating Z-scores and P-values, and so are not strictly needed if only the expression means are desired).
 56 | 2. The abundance of the guides must have been measured somehow (usually by sequencing the guide DNA of unsorted cells; though there are ways to estimate this post-sort if the bins cover the majority of the distribution)
 57 | 3. The fractions of cells sorted into each expression bin was quantified (typically the cell counts/fractions read off of the cell sorter)
 58 | 
 59 | 
 60 | # Usage
 61 | 
 62 | ## Tutorials
 63 | We provide two tutorials on how to run a MAUDE analysis in R here:
 64 | 1. [Re-analysis of CD69 screen data](https://de-boer-lab.github.io/MAUDE/doc/CD69_tutorial.html)
 65 | 2. [Analysis of a simulated screen](https://de-boer-lab.github.io/MAUDE/doc/simulated_data_tutorial.html)
 66 | 3. [Analysis of a CRISPR base editor non-coding mutation screen](https://de-boer-lab.github.io/MAUDE/doc/BACH2_base_editor_screen.html)
 67 | 
 68 | For additional examples, see the [script for evaluating and comparing sorting-based CRISPR screen analysis methods.](https://de-boer-lab.github.io/MAUDE/Evaluation/method_evaluation.html)
 69 | 
 70 | ## Quantifying guide DNA abundance
 71 | After sequencing, you get fastqs, one per sorting bin and experiment.  The first step for a MAUDE analysis is to quantify the number of guides residing in each bin.  Here, we provide some guidance as to how to do this.
 72 | 
 73 | We have previously used the aligner [`bowtie2`](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).
 74 | 
 75 | To make the bowtie2 reference `guide_seq_reference`:
 76 | ```bash
 77 | bowtie2-build guide_seqs.fa guide_seq_reference
 78 | 
 79 | ```
 80 | where `guide_seqs.fa` is a fasta file including the sequences you are mapping against, which will include the guide DNA sequence and any flanking constant regions as well. The amount of constant sequence you include in the reference should be at least as much as what was sequenced.
 81 | 
 82 | For example, with 20bp guides with constant flanking `GTTTAAGAGCTATGCTGGAAACAGCATAG`:
 83 | ```
 84 | >guide1
 85 | GTCGCATATCGCGATAGCGAGTTTAAGAGCTATGCTGGAAACAGCATAG
 86 | >guide2
 87 | GTCGTGAAAGTGCTGTTGAGGTTTAAGAGCTATGCTGGAAACAGCATAG
 88 | ...
 89 | ```
 90 | 
 91 | The following command is an example of how to quantify guide abundance into a format that can easily be input into R for MAUDE analysis:
 92 | ```bash
 93 | bowtie2 --no-head -x guide_seq_reference -U $sample.fastq.gz -S $sample.mapped.sam
 94 | #here, we include all mapped reads, but by using Samtools, you can filter out reads that map to the wrong strand, have indels, etc.
 95 | cat $sample.mapped.sam | awk '{print $3}' | sort | uniq -c | sort > $sample.counts
 96 | ```
 97 | Here, `$sample` is the sample name, with `$sample.fastq.gz` the corresponding fastq file, and `guide_seq_reference` is the `bowtie2` reference.  The file `$sample.counts` will contain guide counts that can be input into R. 
 98 | 
 99 | To turn this into a format that can easily be used for a MAUDE analysis, you can input the data using something like the following:
100 | ```R
101 | #here, allSamples is a data.frame containing one sample per row, with columns including ID, expt, and Bin.  There should be one file for every row in allSamples
102 | allData = data.frame();
103 | for (i in 1:nrow(allSamples)){
104 |   curData = read.table(file=sprintf("%s/%s.counts",inDir,allSamples$ID[i]), quote="", header = F, row.names = NULL, stringsAsFactors = F)
105 |   names(curData) = c("count","guideID");
106 |   curData = curData[curData$gID!="*",] # remove unmapped counts
107 |   curData$ID = allSamples$ID[i];
108 |   curData$expt = allSamples$expt[i];
109 |   curData$Bin = allSamples$Bin[i];
110 |   allData = rbind(allData, curData)
111 | }
112 | #now you have the data in a data.frame that can be reshaped to a MAUDE-compatible format:
113 | library(reshape)
114 | allDataCounts = as.data.frame(cast(allData, expt + guideID ~ Bin, value="count"));
115 | allDataCounts[is.na(allDataCounts)]=0; # fill in 0s for guides not observed at all
116 | #now you just need to label the non-targeting guides and this will be in the correct format
117 | ```
118 | 
119 | ## Encountering problems
120 | Should you encounter a problem using MAUDE:
121 | 1. [Consult the Common Problems](CommonProblems.md)
122 | 2. [Submit an Issue](https://github.com/Carldeboer/MAUDE/issues)
123 | 3. Contact the authors.
124 | 
125 | 
126 | # Citation
127 | Please cite:
128 | 
129 | Carl G de Boer*, John P Ray*, Nir Hacohen, Aviv Regev. [_MAUDE: Inferring Expression Changes in Sorting-Based CRISPR Screens_.](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02046-8) 2020 Jun 3;21(1):134. doi: 10.1186/s13059-020-02046-8. PMID: [32493396](https://pubmed.ncbi.nlm.nih.gov/32493396/).
130 | 


--------------------------------------------------------------------------------
/doc/BACH2_base_editor_screen.R:
--------------------------------------------------------------------------------
  1 | ## ----setup, include=FALSE-----------------------------------------------------
  2 | knitr::opts_chunk$set(echo = TRUE)
  3 | 
  4 | ## -----------------------------------------------------------------------------
  5 | library(ggplot2)
  6 | library(reshape)
  7 | library(MAUDE)
  8 | maudeGitPathRoot = "https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/master"
  9 | 
 10 | ## -----------------------------------------------------------------------------
 11 | allSamples = read.table(file=sprintf("%s/vignettes/BACH2_data/sample_metadata.txt",maudeGitPathRoot), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F)
 12 | head(allSamples)
 13 | #remove some of the control samples
 14 | allSamples = allSamples[!grepl("HEK62", allSamples$replicate),]
 15 | 
 16 | 
 17 | ## -----------------------------------------------------------------------------
 18 | if (FALSE){
 19 |   #this was run on my computer to load the data from many files spread out over many subdirectories. Rather than upload all these files separately, I have loaded them all locally and saved the resulting concatenated file onto github. I leave this code here so that others may view how the data was loaded, should they have their own CRISPResso files to analyze.
 20 |   inDir = "/Path/To/CRISPResso/Files";
 21 |   setwd(inDir)
 22 |   
 23 |   
 24 |   allCRISPRessoData= data.frame()
 25 |   for (i in 1:nrow(allSamples)){
 26 |     curData = read.table(file=sprintf("%s/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "")
 27 |     curData2 = read.table(file=sprintf("%s/Deep_resequencing_analysis/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "")
 28 |     curData = rbind(curData, curData2);
 29 |     curData$locus = allSamples$locus[i];
 30 |     curData$replicate = allSamples$replicate[i];
 31 |     curData$sortBin = allSamples$sortBin[i];
 32 |     allCRISPRessoData = rbind(allCRISPRessoData, curData)
 33 |   }
 34 |   #remove gaps from sequences
 35 |   allCRISPRessoData$SeqSpecies = gsub("-","",allCRISPRessoData$Aligned_Sequence); # Allele
 36 |   #remove unwanted fields
 37 |   allCRISPRessoData$Aligned_Sequence=NULL; allCRISPRessoData$Reference_Sequence=NULL;  allCRISPRessoData$X.Reads.1=NULL;
 38 |   write.table(allCRISPRessoData, file=sprintf("%s/loaded_BACH2_BE_CRISPResso_data.txt", inDir),row.names = F, col.names = T, quote=F, sep="\t")
 39 |   #I then gzipped this file and uploaded it to github
 40 | }else{
 41 |   #load the CRISPResso files from GitHub. 
 42 |   z= gzcon(url(sprintf("%s/vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz",maudeGitPathRoot)));
 43 |   fileConn=textConnection(readLines(z));
 44 |   allCRISPRessoData =  read.table(fileConn, sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F)
 45 |   close(fileConn)
 46 | }
 47 | 
 48 | ## -----------------------------------------------------------------------------
 49 | #CRISPResso has split some sequences into two or maybe more lines; below is an example
 50 | #We also input multiple files from different sequencing runs. This merges all identical seq species'/samples
 51 | allCRISPRessoData = cast(melt(allCRISPRessoData, id.vars = c("SeqSpecies","Reference_Name","Read_Status","n_deleted","n_inserted","n_mutated", "locus","replicate","sortBin")), SeqSpecies + Reference_Name + Read_Status + n_deleted + n_inserted + n_mutated + locus + replicate + sortBin ~ variable, value="value", fun.aggregate=sum)
 52 | 
 53 | #Get read totals per replicate
 54 | allCRISPRessoDataTotals = cast(allCRISPRessoData, locus + replicate + sortBin ~ ., value="X.Reads", fun.aggregate = sum)
 55 | names(allCRISPRessoDataTotals)[ncol(allCRISPRessoDataTotals)] = "totalReads";
 56 | 
 57 | #Add totalReads column to allCRISPRessoData, calculate read fractions
 58 | allCRISPRessoData = merge(allCRISPRessoData, allCRISPRessoDataTotals, by=c("locus","replicate","sortBin"))
 59 | allCRISPRessoData$readFraction = allCRISPRessoData$X.Reads/allCRISPRessoData$totalReads;
 60 | 
 61 | ## -----------------------------------------------------------------------------
 62 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_continuous(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p)
 63 | 
 64 | ## -----------------------------------------------------------------------------
 65 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_log10(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p)
 66 | 
 67 | ## -----------------------------------------------------------------------------
 68 | allCRISPRessoData = allCRISPRessoData[ !(allCRISPRessoData$locus=="HEK" & grepl("^[ABCR][12]$", allCRISPRessoData$replicate)) & !(allCRISPRessoData$locus=="BACH2" & grepl("^HEK9[12]", allCRISPRessoData$replicate)),]
 69 | 
 70 | ## -----------------------------------------------------------------------------
 71 | p = ggplot(allCRISPRessoData[allCRISPRessoData$sortBin=="NS",], aes(x= readFraction, colour=replicate)) + stat_ecdf()+facet_grid(locus ~ .) + scale_x_log10() + theme_bw(); print(p)
 72 | 
 73 | ## -----------------------------------------------------------------------------
 74 | temp = cast(allCRISPRessoData[allCRISPRessoData$Read_Status=="UNMODIFIED" & allCRISPRessoData$Reference_Name=="Reference",], formula = sortBin + locus+ replicate ~. ,value = "readFraction", fun.aggregate = max)
 75 | names(temp)[ncol(temp)]="readFraction";
 76 | p = ggplot(temp, aes(x=sortBin, y=replicate, fill=readFraction)) + geom_tile() + facet_grid(locus ~., scales="free_y")+ggtitle("just unmodified read fractions") + scale_fill_gradientn(colours=c("red","orange", "green","cyan","blue","violet"), limits=c(0,0.8)); print(p)
 77 | min(temp$readFraction, na.rm = T) # 0.0356963
 78 | 
 79 | ## -----------------------------------------------------------------------------
 80 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="A1-HEK",] # this sample had no data for bin D (BACH2) because it didn't PCR well
 81 | 
 82 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="B2-HEK",] # this sample had very low coverage for bin D for BACH2
 83 | 
 84 | ## -----------------------------------------------------------------------------
 85 | seqsObserved = cast(allCRISPRessoData, SeqSpecies + locus + Read_Status + Reference_Name~ .)
 86 | names(seqsObserved)[ncol(seqsObserved)]="seqRunsObserved";
 87 | 
 88 | retainedSamples = unique(allCRISPRessoData[,c("locus","replicate","sortBin")])
 89 | 
 90 | for (l in unique(allSamples$locus)){
 91 |   seqsObserved$inAll[seqsObserved$locus == l] = seqsObserved$seqRunsObserved[seqsObserved$locus == l]==sum(retainedSamples$locus==l)
 92 | }
 93 | 
 94 | #require that all samples have at least one read for every species considered. This will help exclude read errors
 95 | keepAlleles = seqsObserved[seqsObserved$inAll,]
 96 | 
 97 | ## -----------------------------------------------------------------------------
 98 | #First make a count matrix, but only of alleles seen in all replicates
 99 | readCountMat = cast(allCRISPRessoData[allCRISPRessoData$SeqSpecies %in% keepAlleles$SeqSpecies,], SeqSpecies  + replicate +locus ~ sortBin, value="X.Reads")
100 | readCountMat[is.na(readCountMat)] = 0
101 | readCountMat = readCountMat[order(readCountMat$NS, decreasing = T),]
102 | 
103 | #Label WT alleles
104 | wtSeqs = data.frame(locus = c("HEK","BACH2"), SeqSpecies =  c("GGTAGCCAGAGACCCGCTGGTCTTCTTTCCCCTCCCCTGCCCTCCCCTCCCTTCAAGATGGCTGACAA","TGCCCCACCCTGTGCCCTTTTTACATTACACACAAATAGGGACGGATTTCCTGTAAGCTGATCTTGAAGAAAAAAAACATGTTAGACAAAGAAAATCAGAACTAAGA"), isWT=T, stringsAsFactors = F);
105 | readCountMat = merge(readCountMat, wtSeqs, by=c("locus","SeqSpecies"), all.x=T)
106 | readCountMat$isWT[is.na(readCountMat$isWT)]=F;
107 | 
108 | allCRISPRessoData = merge(allCRISPRessoData, wtSeqs, by=c("locus","SeqSpecies"), all.x=T)
109 | allCRISPRessoData$isWT[is.na(allCRISPRessoData$isWT)]=F;
110 | 
111 | 
112 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.105,6))) # 10.5% bins
113 | # but only the top ~21% (CD) and bottom ~21% (AB) were retained
114 | binBounds = binBounds[binBounds$Bin %in% c("A","B","E","F"),] 
115 | binBounds$Bin[binBounds$Bin=="E"]="C"
116 | binBounds$Bin[binBounds$Bin=="F"]="D"
117 | 
118 | ## -----------------------------------------------------------------------------
119 | p = ggplot(binBounds, aes(colour=Bin))  + 
120 |   geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 
121 |   ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 
122 |   xlab("Bin bounds as expression Z-scores") + 
123 |   ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+
124 |   coord_cartesian(ylim=c(0,0.7)); print(p)
125 | 
126 | ## -----------------------------------------------------------------------------
127 | curExpts = unique(readCountMat[c("replicate","locus")])
128 | 
129 | binStatsAll = data.frame();
130 | for(i in 1:nrow(curExpts)){
131 |   binStatsAll = rbind(binStatsAll, cbind(binBounds, curExpts[i,]))
132 | }
133 | 
134 | ## -----------------------------------------------------------------------------
135 | #perform MAUDE analysis at guide (allele in this case) level for each locus.
136 | guideLevelStats = data.frame();
137 | for( l in unique(allSamples$locus)){
138 |   message(l);
139 |   statsA = findGuideHitsAllScreens(experiments = curExpts[curExpts$locus==l,], countDataFrame = readCountMat[readCountMat$locus==l,], binStats = binStatsAll[binStatsAll$locus==l,], sortBins = c("A","B","C","D"), unsortedBin = "NS", negativeControl = "isWT")
140 |   guideLevelStats = rbind(guideLevelStats, statsA)
141 | }
142 | 
143 | ## -----------------------------------------------------------------------------
144 | baseChanges = data.frame()
145 | for (l in wtSeqs$locus){
146 |   wtSeq = wtSeqs$SeqSpecies[wtSeqs$locus==l];
147 |   wtSplit = strsplit(wtSeq,"")[[1]]
148 |   curSeqs = unique(guideLevelStats$SeqSpecies[guideLevelStats$locus==l & guideLevelStats$libFraction > 0.001])
149 |   for (i in 1:length(curSeqs)){
150 |     curSplit = strsplit(curSeqs[i],"")[[1]]
151 |     mismatches = c();
152 |     mismatchPoss=c();
153 |     for (j in 1:min(length(wtSplit),length(curSplit))){
154 |       if (wtSplit[j]!=curSplit[j]){
155 |         mismatches = c(mismatches, curSplit[j]);
156 |         mismatchPoss = c(mismatchPoss, j)
157 |       }
158 |     }
159 |     if (length(mismatchPoss)>0){
160 |       baseChanges = rbind(baseChanges, data.frame(mismatch=mismatches, position = mismatchPoss, SeqSpecies=curSeqs[i], locus=l))
161 |     }
162 |   }
163 | }
164 | 
165 | baseChangesSummary = cast(baseChanges, SeqSpecies +locus~ ., value="mismatch")
166 | names(baseChangesSummary)[ncol(baseChangesSummary)] = "numMismatches";
167 | p = ggplot(baseChangesSummary, aes(x=numMismatches, colour=locus)) + stat_ecdf()+ geom_vline(xintercept = 15); print(p)
168 | 
169 | ## -----------------------------------------------------------------------------
170 | #exclude any where the number of mismatches is greater than 15 - these are also likely read or PCR artifacts.
171 | baseChangesSummary$keepAlleles= baseChangesSummary$numMismatches<15;
172 | 
173 | baseChanges$has = 1;
174 | 
175 | baseChangesNumWith = cast(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles],], position + mismatch + locus~ ., value="has", fun.aggregate = sum)
176 | names(baseChangesNumWith)[ncol(baseChangesNumWith)] = "numSeqsWith"
177 | baseChanges = merge(baseChanges, baseChangesNumWith, by=c("position","mismatch","locus"))
178 | baseChanges$mismatch = as.character(baseChanges$mismatch)
179 | baseChanges$SeqSpecies = as.character(baseChanges$SeqSpecies)
180 | 
181 | 
182 | p = ggplot(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], aes(x=position, y=SeqSpecies, fill=mismatch)) + geom_tile()+scale_fill_manual(values=c("orange","green","blue","red"))+scale_x_continuous(expand=c(0,0))+theme_classic() + theme(axis.text.y = element_text(size=5, family = "Courier")) + facet_grid(locus ~ ., scales="free", space ="free"); print(p)
183 | 
184 | ## -----------------------------------------------------------------------------
185 | #A_53 is the mutation of interest
186 | 
187 | baseChangesMatrix = cast(baseChanges[baseChanges$locus == "BACH2" & baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], SeqSpecies ~ mismatch + position, value="has", fill = 0)
188 | A53_seq = baseChangesMatrix$SeqSpecies[apply(baseChangesMatrix[2:ncol(baseChangesMatrix)],1, sum)==1 & baseChangesMatrix$A_53==1]
189 | 
190 | guideLevelStats$pooled = grepl("-",guideLevelStats$replicate);
191 | 
192 | ## -----------------------------------------------------------------------------
193 | wtAndVarMaudeMu = cast(guideLevelStats[guideLevelStats$SeqSpecies %in% c(A53_seq, wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]),], replicate +pooled ~ SeqSpecies, value="mean")
194 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==A53_seq]="rs72928038";
195 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]]="WT";
196 | 
197 | ttestResults = t.test(x = wtAndVarMaudeMu$rs72928038[!wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[!wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE)
198 | 
199 | ttestResults_mixedCells = t.test(x = wtAndVarMaudeMu$rs72928038[wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE)
200 | 
201 | wtAndVarMaudeMu$replicate2 = gsub("-HEK","",wtAndVarMaudeMu$replicate)
202 | 
203 | 
204 | ## -----------------------------------------------------------------------------
205 | meltedSNPMus = melt(as.data.frame(wtAndVarMaudeMu), id.vars=c("pooled","replicate", "replicate2"));
206 | meltedSNPMus$genotype = factor(ifelse(meltedSNPMus$variable=="WT", "G", "A (risk)"),levels=c("G","A (risk)"))
207 | p = ggplot(meltedSNPMus, aes(x=genotype, y=value, group=replicate)) + geom_point() + geom_line()+facet_grid(. ~ pooled)+theme_bw()+xlab("Genotype") + ylab("Mean expression") + ggtitle(sprintf("P=%f; P=%f", ttestResults$p.value, ttestResults_mixedCells$p.value)); print(p)
208 | 
209 | 


--------------------------------------------------------------------------------
/doc/CD69_tutorial.R:
--------------------------------------------------------------------------------
  1 | ## ----setup, include=FALSE-----------------------------------------------------
  2 | knitr::opts_chunk$set(echo = TRUE)
  3 | 
  4 | ## ----load libraries, results="hide", message = FALSE, warning=FALSE-----------
  5 | #load required libraries
  6 | library(openxlsx)
  7 | library(reshape)
  8 | library(ggplot2)
  9 | library(MAUDE)
 10 | library(GenomicRanges)
 11 | library(ggbio)
 12 | library(Homo.sapiens)
 13 | 
 14 | ## ----input data---------------------------------------------------------------
 15 | #read in the CD69 screen data
 16 | CD69Data = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx')
 17 | 
 18 | #identify non-targeting guides
 19 | CD69Data$isNontargeting = grepl("negative_control", CD69Data$gRNA_systematic_name)
 20 | 
 21 | CD69Data = unique(CD69Data) # for some reason there were duplicated rows in this table - remove duplicates
 22 | 
 23 | #reshape the count data so we can label the experimental replicates and bins, and remove all the non-count data
 24 | cd69CountData = melt(CD69Data, id.vars = c("PAM_3primeEnd_coord","isNontargeting","gRNA_systematic_name"))
 25 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),] # keep only read count columns
 26 | cd69CountData$Bin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable)
 27 | cd69CountData$expt = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable)
 28 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL;
 29 | cd69CountData$Bin = gsub("_","",cd69CountData$Bin) # remove extra underscores
 30 | 
 31 | #reshape into a matrix
 32 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$PAM_3primeEnd_coord) | cd69CountData$isNontargeting,], 
 33 |   PAM_3primeEnd_coord+gRNA_systematic_name+isNontargeting+expt ~ Bin, value="reads"))
 34 | #binReadMat now contains a matrix in the proper format for MAUDE analysis
 35 | 
 36 | 
 37 | ## ----input set of DHS peaks---------------------------------------------------
 38 | dhsPeakBED = read.table(system.file("extdata", "Encode_Jurkat_DHS_both.merged.bed", package = "MAUDE", mustWork = TRUE), 
 39 |   stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=FALSE)
 40 | names(dhsPeakBED) = c("chrom","start","end");
 41 | #add a column to include peak names
 42 | dhsPeakBED$name = paste(dhsPeakBED$chrom, paste(dhsPeakBED$start, dhsPeakBED$end, sep="-"), sep=":")
 43 | 
 44 | ## ----read in bin fractions----------------------------------------------------
 45 | #read in the bin fractions derived from Simeonov et al Extended Data Fig 1a and the "digitize" R package
 46 | #Ideally, you derive this from the FACS sort data. 
 47 | binStats = read.table(system.file("extdata", "CD69_bin_percentiles.txt", package = "MAUDE", mustWork = TRUE), 
 48 |   stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=TRUE)
 49 | binStats$fraction = binStats$binEndQ - binStats$binStartQ; #the fraction of cells captured is the difference in bin start and end percentiles
 50 | 
 51 | #plot the bins as the percentiles of the distribution captured by each bin
 52 | p = ggplot(binStats, aes(colour=Bin)) + ggplot2::geom_segment(aes(x=binStartQ, xend=binEndQ, y=fraction, yend=fraction)) + 
 53 |   xlab("Bin bounds as percentiles") + ylab("Fraction of the distribution captured") +theme_classic() + 
 54 |   scale_y_continuous(expand=c(0,0))+coord_cartesian(ylim=c(0,0.7)); print(p)
 55 | 
 56 | ## ----convert bin percentiles to Z scores--------------------------------------
 57 | #convert bin fractions to Z scores
 58 | binStats$binStartZ = qnorm(binStats$binStartQ)
 59 | binStats$binEndZ = qnorm(binStats$binEndQ)
 60 | 
 61 | ## ----plot bins----------------------------------------------------------------
 62 | p = ggplot(binStats, aes(colour=Bin))  + 
 63 |   geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 
 64 |   ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 
 65 |   xlab("Bin bounds as expression Z-scores") + 
 66 |   ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+
 67 |   coord_cartesian(ylim=c(0,0.7)); print(p)
 68 | 
 69 | ## ----duplicate bins for second replicate--------------------------------------
 70 | binStats = rbind(binStats, binStats) #duplicate data
 71 | binStats$expt = c(rep("1",4),rep("2",4)); #name the first duplicate expt "1" and the next expt "2";
 72 | 
 73 | ## ----find guide effects-------------------------------------------------------
 74 | guideLevelStats = findGuideHitsAllScreens(experiments = unique(binReadMat["expt"]), 
 75 |   countDataFrame = binReadMat, binStats = binStats, 
 76 |   sortBins = c("baseline","high","low","medium"), 
 77 |   unsortedBin = "back", negativeControl = "isNontargeting")
 78 | 
 79 | ## ----plot guide effects-------------------------------------------------------
 80 | # Plot the guide-level mus
 81 | p = ggplot(guideLevelStats, aes(x=mean, colour=isNontargeting, linetype=expt)) + geom_density()+
 82 |   theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+
 83 |   xlab("Learned mean guide expression"); print(p);
 84 | 
 85 | ## ----plot guide level Zs------------------------------------------------------
 86 | # Plot the guide-level Zs
 87 | p = ggplot(guideLevelStats, aes(x=Z, colour=isNontargeting, linetype=expt)) + geom_density()+
 88 |   theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+
 89 |   xlab("Learned guide expression Z score"); 
 90 | print(p)
 91 | 
 92 | ## ----plot replicate scatter---------------------------------------------------
 93 | guideEffectsByRep = cast(guideLevelStats, 
 94 |   gRNA_systematic_name + isNontargeting + PAM_3primeEnd_coord ~ expt, value="Z")
 95 | 
 96 | p = ggplot(guideEffectsByRep[!guideEffectsByRep$isNontargeting,], aes(x=`1`, y=`2`)) + 
 97 |   geom_point(size=0.3) + xlab("Replicate 1 Z score") + ylab("Replicate 2 Z score") + 
 98 |   ggtitle(sprintf("r = %f",cor(guideEffectsByRep$`1`[!guideEffectsByRep$isNontargeting],
 99 |     guideEffectsByRep$`2`[!guideEffectsByRep$isNontargeting])))+theme_classic(); 
100 | print(p)
101 | 
102 | ## ----plot locus---------------------------------------------------------------
103 | dhsPos = min(guideLevelStats$Z)*1.05;
104 | p=ggplot(guideLevelStats, aes(x=PAM_3primeEnd_coord, y=Z)) +geom_point(size=0.5)+facet_grid(expt ~.)+ 
105 |   ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="red") + 
106 |   theme_classic() + xlab("Genomic position") + ylab("Guide Z score"); 
107 | print(p)
108 | 
109 | ## ----infer sliding window effects---------------------------------------------
110 | guideLevelStats$chrom = "chr12"; # we need to tell it what chromosome our guides are on - they're all on chr12
111 | slidingWindowElements = getTilingElementwiseStats(experiments = unique(binReadMat["expt"]), 
112 |   normNBSummaries = guideLevelStats, tails="both", window = 200, location = "PAM_3primeEnd_coord",
113 |   chr="chrom",negativeControl = "isNontargeting")
114 | #override the default chromosome field 'chr' with the GRanges compatible 'chrom'
115 | names(slidingWindowElements)[names(slidingWindowElements)=="chr"]="chrom" 
116 | 
117 | ## ----tiles locus effects------------------------------------------------------
118 | dhsPos = min(slidingWindowElements$meanZ)*1.05;
119 | p=ggplot(slidingWindowElements, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) +
120 |   ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 
121 |   ylab("Element Z score") + geom_hline(yintercept = 0) + 
122 |   ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="black");
123 | print(p)
124 | 
125 | ## ----tiled replicate scatter--------------------------------------------------
126 | slidingWindowElementsByRep = cast(slidingWindowElements, chrom + start + end +numGuides ~ expt, 
127 |   value="meanZ")
128 | p = ggplot(slidingWindowElementsByRep, aes(x=`1`, y=`2`)) + geom_point(size=0.5) + 
129 |   xlab("Replicate 1 element effect Z score") + ylab("Replicate 2 element effect Z score") + 
130 |   ggtitle(sprintf("r = %f",cor(slidingWindowElementsByRep$`1`,slidingWindowElementsByRep$`2`)))+
131 |   theme_classic(); 
132 | print(p)
133 | 
134 | ## ----element-level stats------------------------------------------------------
135 | #the next command annotates our guides with any DHS peak they lie in.
136 | annotatedGuides = findOverlappingElements(guides = unique(guideLevelStats[!guideLevelStats$isNontargeting,
137 |   c("PAM_3primeEnd_coord","gRNA_systematic_name","chrom")]), elements = dhsPeakBED, 
138 |   elements.start = "start", elements.end = "end", elements.chr = "chrom", 
139 |   guides.pos = "PAM_3primeEnd_coord", guides.chr = "chrom")
140 | 
141 | #merge regulatory element annotations back onto guideLevelStats
142 | guideLevelStats = merge(guideLevelStats, annotatedGuides[c("gRNA_systematic_name", "name")],
143 |                         by="gRNA_systematic_name", all.x=TRUE)
144 | 
145 | #this is where we are actually running MAUDE to find element-level stats
146 | dhsPeakStats = getElementwiseStats(experiments = unique(binReadMat["expt"]), 
147 |   normNBSummaries = guideLevelStats, negativeControl = "isNontargeting", 
148 |   elementIDs = "name") # "name" is the peak IDs from the DHS BED file
149 | 
150 | #merge peak info back into dhsPeakStats
151 | dhsPeakStats = merge(dhsPeakStats, dhsPeakBED, by="name");
152 | 
153 | ## ----element locus effect view------------------------------------------------
154 | p=ggplot(dhsPeakStats, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) +
155 |   ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 
156 |   ylab("Element Z score") + geom_hline(yintercept = 0); 
157 | print(p)
158 | 
159 | ## ----element replicate scatter------------------------------------------------
160 | dhsPeakStatsByRep = cast(dhsPeakStats, name ~ expt, value="meanZ")
161 | 
162 | p = ggplot(dhsPeakStatsByRep, aes(x=`1`, y=`2`)) + geom_point() + 
163 |   xlab("Replicate 1 DHS effect Z score") + ylab("Replicate 2 DHS effect Z score") + 
164 |   ggtitle(sprintf("r = %f",cor(dhsPeakStatsByRep$`1`,dhsPeakStatsByRep$`2`)))+theme_classic(); 
165 | print(p)
166 | 
167 | ## ----guide effects per element------------------------------------------------
168 | p=ggplot(guideLevelStats, aes(x=Z, group=name, colour=name == "chr12:9912678-9915275")) + stat_ecdf(alpha=0.3)+ 
169 |   stat_ecdf(data=guideLevelStats[!is.na(guideLevelStats$name) &
170 |                                    guideLevelStats$name=="chr12:9912678-9915275",], size=1)+
171 |   facet_grid(expt ~.) + theme_classic() + xlab("Guide Z score")+scale_y_continuous(expand=c(0,0)) + 
172 |   scale_x_continuous(expand=c(0,0)) + scale_colour_manual(values=c("black","red")) + 
173 |   labs(colour = "CD69 promoter?")+ylab("Cumulative fraction"); 
174 | print(p)
175 | 
176 | ## ----promoter view------------------------------------------------------------
177 | p=ggplot(guideLevelStats[!is.na(guideLevelStats$name) & guideLevelStats$name=="chr12:9912678-9915275",],
178 |          aes(x=PAM_3primeEnd_coord, y=Z, colour=expt)) +
179 |   geom_point(size=1)+ geom_line()+theme_classic() + xlab("Genomic position") + ylab("Guide Z score")+
180 |   geom_vline(xintercept = 9913497, colour="black"); 
181 | print(p)
182 | 
183 | ## ----promoter guide zoom------------------------------------------------------
184 | p=ggplot(guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 
185 |                              guideEffectsByRep$PAM_3primeEnd_coord > 9912678,], 
186 |          aes(x=`2`, y=`1`)) + geom_point() + 
187 |   geom_text(data=guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 
188 |                                      guideEffectsByRep$PAM_3primeEnd_coord > 9912678 & 
189 |                                      guideEffectsByRep$`1`>2 & guideEffectsByRep$`2`>2,],
190 |             aes(label=gRNA_systematic_name)) + 
191 |   theme_classic()+xlab("Replicate 2 guide Z score") + ylab("Replicate 1 guide Z score"); 
192 | print(p) 
193 | 
194 | 
195 | ## ----reshaping window effects-------------------------------------------------
196 | slidingWindowElementsByReplicate = cast(melt(slidingWindowElements, 
197 |   id.vars=c("expt","numGuides","chrom","start","end")), 
198 |   numGuides+chrom+start+end ~ variable+expt, value="value")
199 | head(slidingWindowElementsByReplicate)
200 | 
201 | ## ----cast to GRanges----------------------------------------------------------
202 | #casting to data.frame is only needed if using cast
203 | slidingWindowElementsByReplicateGR = GRanges(as.data.frame(slidingWindowElementsByReplicate)) 
204 | 
205 | ## ----find doubly significant tiles--------------------------------------------
206 | #require that both replicates are significant at an FDR of 0.1 and that the signs agree
207 | slidingWindowElementsByReplicateGR$significantUp   = slidingWindowElementsByReplicateGR$FDR_1< 0.01 & 
208 |   slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 > 0 &
209 |   slidingWindowElementsByReplicateGR$meanZ_2 > 0; 
210 | slidingWindowElementsByReplicateGR$significantDown = slidingWindowElementsByReplicateGR$FDR_1< 0.01 &
211 |   slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 < 0 &
212 |   slidingWindowElementsByReplicateGR$meanZ_2 < 0;
213 | 
214 | #merge overlapping regions in each set
215 | overlappingSlidingWindowElementsUp =
216 |   reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantUp])
217 | overlappingSlidingWindowElementsDown =
218 |   reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantDown])
219 | 
220 | ## ----genome browser view, fig.width=10, fig.height=5--------------------------
221 | 
222 | #which gene models do I want to plot?
223 | data(genesymbol, package = "biovizBase")
224 | wh <- genesymbol[c("CD69", "CLECL1", "KLRF1", "CLEC2D","CLEC2B")]
225 | wh <- range(wh, ignore.strand = TRUE)
226 | 
227 | #make the genome tracks
228 | tracks(autoplot(Homo.sapiens, which = wh, gap.geom="chevron"), 
229 |   autoplot(overlappingSlidingWindowElementsUp, fill="red"), 
230 |   autoplot(overlappingSlidingWindowElementsDown, fill="blue"), heights=c(5,2,2)) + theme_classic()
231 | 
232 | 


--------------------------------------------------------------------------------
/doc/simulated_data_tutorial.R:
--------------------------------------------------------------------------------
  1 | ## ----setup, include=FALSE-----------------------------------------------------
  2 | knitr::opts_chunk$set(echo = TRUE)
  3 | 
  4 | ## ----load libraries, results="hide", message = FALSE, warning=FALSE-----------
  5 | library(ggplot2)
  6 | library(reshape)
  7 | library(MAUDE)
  8 | 
  9 | set.seed(76484377)
 10 | 
 11 | ## ----simulation setup---------------------------------------------------------
 12 | #ground truth
 13 | groundTruth = data.frame(element = 1:200, meanEffect = c((1:100)/100,rep(0,100))) #targeting 200 elements, half of which do nothing
 14 | 
 15 | #guide - element map; 5 guides per element; gid is the guide ID
 16 | guideMap = data.frame(element = rep(groundTruth$element, 5), gid = 1:(5*nrow(groundTruth)), NT=FALSE, mean=rep(groundTruth$meanEffect, 5))
 17 | guideMap = rbind(guideMap, data.frame(element = NA, gid = (1:1000)+nrow(guideMap), NT=TRUE, mean=0)); # 1000 non-targeting guides
 18 | 
 19 | guideMap$abundance = rpois(n=nrow(guideMap), lambda=1000); #guide abundance drawing from a poisson distribution with mean=1000
 20 | guideMap$cells = rpois(n=nrow(guideMap), lambda=guideMap$abundance); #cell count drawing from a poisson distribution with mean the abundance from above
 21 | 
 22 | #create observarions for different guides, with expression drawn from normal(mean=mean, sd=1)
 23 | cellObservations = data.frame(gid = rep(guideMap$gid, guideMap$cells))
 24 | cellObservations = merge(cellObservations, guideMap, by="gid")
 25 | cellObservations$expression = rnorm(n=nrow(cellObservations), mean=cellObservations$mean);
 26 | 
 27 | #create the bin model for this experiment - this represents 6 bins, each of which are 10%, where A+B+C catch the bottom ~30% and D+E+F catch the top 30%; in an actual experiment, the true captured fractions should be used here. 
 28 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.1,6)))
 29 | if(FALSE){ 
 30 |   # in reality, we shouldn't assume this distribution is exactly normal - we can re-assign expression bin bounds based on quantiles 
 31 |   # of the actual simulated expression distribution.  If you run the next two lines, the answer will improve slightly, but the 
 32 |   # resulting graphs will look slightly different than those below.
 33 |   # correct for the actual distribution
 34 |   binBounds$binStartZ = quantile(cellObservations$expression, probs = binBounds$binStartQ);
 35 |   binBounds$binEndZ = quantile(cellObservations$expression, probs = binBounds$binEndQ);
 36 | }
 37 | 
 38 | # select some examples to inspect for both
 39 | exampleNT = sample(guideMap$gid[guideMap$NT],10);# non-targeting
 40 | exampleT = sample(guideMap$gid[!guideMap$NT],5);# and targeting guides
 41 | 
 42 | #plot the select examples and show the bin structure
 43 | p = ggplot(cellObservations[cellObservations$gid %in% c(exampleT, exampleNT),], 
 44 |   aes(x=expression, group=gid, fill=NT))+geom_density(alpha=0.2) + 
 45 |   geom_vline(xintercept = sort(unique(c(binBounds$binStartZ,binBounds$binEndZ))),colour="gray") + 
 46 |   theme_classic() + scale_fill_manual(values=c("red","darkgray")) + xlab("Target expression") + 
 47 |   scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + 
 48 |   coord_cartesian(xlim=c(min(cellObservations$expression), max(cellObservations$expression)))+ 
 49 |   geom_segment(data=binBounds, aes(x=binStartZ, xend=binEndZ, colour=Bin, y=0, yend=0), size=5, inherit.aes = FALSE); 
 50 | print(p)
 51 | 
 52 | ## ----simulate sorting cells---------------------------------------------------
 53 | #for each bin, find which cells landed within the bin bounds
 54 | for(i in 1:nrow(binBounds)){
 55 |   cellObservations[[as.character(binBounds$Bin[i])]] = 
 56 |     cellObservations$expression > binBounds$binStartZ[i] & cellObservations$expression < binBounds$binEndZ[i];
 57 | }
 58 | 
 59 | #count the number of cells that ended up in each bin for each guide
 60 | binLevelData = cast(melt(cellObservations[c("gid","element","NT",as.character(binBounds$Bin))], 
 61 |   id.vars=c("gid","element","NT")), gid + element + NT + variable ~ ., fun.aggregate = sum)
 62 | names(binLevelData)[(ncol(binLevelData)-1):ncol(binLevelData)]=c("Bin","cells");
 63 | 
 64 | #plot the cells/bin for each of our example guides
 65 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=cells)) +
 66 |   geom_boxplot(fill="darkgray")+geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], 
 67 |     aes(group=gid), colour="red")+  theme_classic()  + xlab("Expression bin") + 
 68 |   ylab("Captured cells/bin")  + scale_y_continuous(expand=c(0,0)); print(p)
 69 | 
 70 | ## ----simulate sequencing------------------------------------------------------
 71 | #get the total number of cells sorted into each of the bins
 72 | totalCellsPerBin = cast(binLevelData,Bin ~ ., value="cells", fun.aggregate = sum)
 73 | names(totalCellsPerBin)[ncol(totalCellsPerBin)]="totalCells";
 74 | 
 75 | #add bin totals to binLevelData
 76 | binLevelData = merge(binLevelData, totalCellsPerBin, by="Bin")
 77 | 
 78 | #generate reads for each guide per bin, following a negative binomial distribution
 79 | #n=number of observations; size= total reads per bin; prob= probability of not getting a read at each drawing
 80 | binLevelData$reads = rnbinom(n=nrow(binLevelData), size=binLevelData$totalCells*10, 
 81 |   prob=1- binLevelData$cells/binLevelData$totalCells)
 82 | 
 83 | #plot the distribution of reads for each example guide
 84 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=reads)) +
 85 |   geom_boxplot(fill="darkgray")+
 86 |   geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], aes(group=gid), colour="red")+  
 87 |   theme_classic()  + xlab("Expression bin") + ylab("Reads/bin")  + scale_y_continuous(expand=c(0,0)); 
 88 | print(p)
 89 | 
 90 | ## ----simulate unsorted cells--------------------------------------------------
 91 | binReadMat = cast(binLevelData, element+gid+NT ~ Bin, value="reads")
 92 | 
 93 | #pretend we sequence the unsorted cells to similar coverage as above:
 94 | guideMap$NS = rnbinom(n=nrow(guideMap), size=sum(guideMap$cells)*10, 
 95 |   prob=1- guideMap$abundance/sum(guideMap$cells))
 96 | binReadMat = merge(binReadMat, guideMap[c("gid","NS")], by="gid")
 97 | 
 98 | binReadMat$screen="test"; # here, we're only doing the one screen - this simulation
 99 | binBounds$screen="test";
100 | 
101 | ## ----run MAUDE----------------------------------------------------------------
102 | # get guide-level stats 
103 | guideLevelStats = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, binBounds)
104 | 
105 | #get element level stats
106 | elementLevelStats = getElementwiseStats(unique(guideLevelStats["screen"]),
107 |   guideLevelStats, elementIDs="element",tails="upper")
108 | 
109 | ## ----plor effects-------------------------------------------------------------
110 | elementLevelStats = merge(elementLevelStats, groundTruth, by="element")
111 | 
112 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+
113 |   geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 
114 |   xlab("True effect") + ylab("Inferred effect"); print(p)
115 | 
116 | ## ----effets zoom in-----------------------------------------------------------
117 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+
118 |   geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 
119 |   coord_cartesian(xlim = c(0,0.1),ylim = c(0,0.1)) + xlab("True effect") + ylab("Inferred effect"); 
120 | print(p)
121 | 
122 | 


--------------------------------------------------------------------------------
/images/logo2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/eb9c80a0bf37bb2b0c3c7ae7270a1fee71eaba20/images/logo2.png


--------------------------------------------------------------------------------
/inst/CITATION:
--------------------------------------------------------------------------------
 1 | citHeader("To cite MAUDE in publications use:")
 2 | 
 3 | citEntry(entry = "Article",
 4 |   title        = "MAUDE: inferring expression changes in sorting-based CRISPR screens",
 5 |   author       = personList(as.person("Carl G. de Boer"),
 6 |                    as.person("John P. Ray"),
 7 |                    as.person("Nir Hacohen"),
 8 |                    as.person("Aviv Regev")),
 9 |   journal      = "Genome Biology",
10 |   year         = "2020",
11 |   volume       = "21",
12 |   number       = "1",
13 |   pages        = "134",
14 |   url          = "https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02046-8",
15 | 
16 |   textVersion  =
17 |   paste("de Boer CG, Ray JP, Hacohen N, Regev A.",
18 | 		"MAUDE: inferring expression changes in sorting-based CRISPR screens.",  
19 |  		"Genome Biol. 2020;21(1):134. Published 2020 Jun 3.",
20 | 		"doi:10.1186/s13059-020-02046-8")
21 | )
22 | 


--------------------------------------------------------------------------------
/inst/extdata/CD69_bin_percentiles.txt:
--------------------------------------------------------------------------------
1 | Bin	binStartQ	binEndQ
2 | baseline	0.001	0.658470995483854
3 | low	0.658470995483854	0.909937871020824
4 | medium	0.907937871020824	0.971071760477821
5 | high	0.971071760477821	0.999
6 | 


--------------------------------------------------------------------------------
/inst/extdata/Encode_Jurkat_DHS_both.merged.bed:
--------------------------------------------------------------------------------
 1 | chr12	9885788	9886093
 2 | chr12	9889662	9890295
 3 | chr12	9894141	9894851
 4 | chr12	9895391	9895533
 5 | chr12	9895675	9896083
 6 | chr12	9901260	9901720
 7 | chr12	9902669	9904782
 8 | chr12	9907175	9907332
 9 | chr12	9907655	9907678
10 | chr12	9909466	9912465
11 | chr12	9912678	9915275
12 | chr12	9915630	9916048
13 | chr12	9916404	9918532
14 | chr12	9919586	9920156
15 | chr12	9923553	9923906
16 | chr12	9925876	9926553
17 | chr12	9928010	9928338
18 | chr12	9933989	9934360
19 | chr12	9938952	9939007
20 | chr12	9940013	9940299
21 | chr12	9945122	9945437
22 | chr12	9945544	9945599
23 | chr12	9946517	9946622
24 | chr12	9950506	9951060
25 | chr12	9961099	9961274
26 | chr12	9963380	9963792
27 | chr12	9964121	9964262
28 | chr12	9966189	9967318
29 | chr12	9973087	9973460
30 | 


--------------------------------------------------------------------------------
/man/calcFDRByExperiment.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{calcFDRByExperiment}
 4 | \alias{calcFDRByExperiment}
 5 | \title{FDR correction per experiment}
 6 | \usage{
 7 | calcFDRByExperiment(experiments, x, tails)
 8 | }
 9 | \arguments{
10 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in normNBSummaries}
11 | 
12 | \item{x}{data.frame of element-level statistics, including columns for every column in 'experiments' and a column named 'p.value'}
13 | 
14 | \item{tails}{whether to test for increased expression ("upper"), decreased ("lower"), or both ("both"); (defaults to "both")}
15 | }
16 | \value{
17 | a numerical vector containing B-H Q values corrected separately for every experiment.
18 | }
19 | \description{
20 | Perform Benjamini-Hochberg FDR correction of p-values within each experiment. The returned values correspond to Q values. This is run automatically within getTilingElementwiseStats and getElementwiseStats, and doesn't generally need to be used directly.
21 | }
22 | \examples{
23 | fakeReadData = data.frame(id=rep(1:10000,2), expt=c(rep("e1",10000), rep("e2",10000)), 
24 |                           A=rpois(20000, lambda = 100), B=rpois(20000, lambda = 100),
25 |                           C=rpois(20000, lambda = 100), D=rpois(20000, lambda = 100),
26 |                           E=rpois(20000, lambda = 100), F=rpois(20000, lambda = 100),
27 |                           NotSorted=rpois(20000, lambda = 100), 
28 |                           position = rep(c(rep(NA, 1000), (1:9000)*10 + 5E7),2), 
29 |                           chr=rep(c(rep(NA, 1000), rep("chr1", 9000)),2), 
30 |                           negControl = rep(c(rep(TRUE,1000),rep(FALSE,9000)),2), 
31 |                           stringsAsFactors = FALSE)
32 | #make one region an "enhancer" and "repressor" by skewing the reads 
33 | enhancers = data.frame(name = c("enh","repr"), start = c(40000, 70000) + 5E7, 
34 |   end = c(40500, 70500) + 5E7, chr="chr1")
35 | enhancerData = findOverlappingElements(fakeReadData[!is.na(fakeReadData$position),], enhancers, 
36 |   guides.pos = "position",elements.start = "start", elements.end = "end")
37 | readSkew=1.2 # we will scale up/down the reads in ABC and DEF by this amount
38 | enhancerData[enhancerData$name=="enh", c("D","E","F")] = 
39 |   floor(readSkew* enhancerData[enhancerData$name=="enh", c("D","E","F")]);
40 | enhancerData[enhancerData$name=="repr", c("D","E","F")] = 
41 |   floor(enhancerData[enhancerData$name=="repr", c("D","E","F")]/readSkew);
42 | enhancerData[enhancerData$name=="repr", c("A","B","C")] = 
43 |   floor(readSkew* enhancerData[enhancerData$name=="repr", c("A","B","C")]);
44 | enhancerData[enhancerData$name=="enh", c("A","B","C")] = 
45 |   floor(enhancerData[enhancerData$name=="enh", c("A","B","C")]/readSkew);
46 | #replace the original data for these elements
47 | fakeReadData = rbind(fakeReadData[!(fakeReadData$id \%in\% enhancerData$id), ],
48 |   enhancerData[names(fakeReadData)])
49 | #make experiments and sorting strategy
50 | expts = unique(fakeReadData["expt"]);
51 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6)))
52 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts
53 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6))
54 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData,
55 |                                     binStats = curSortBins, unsortedBin = "NotSorted",
56 |                                     negativeControl="negControl")
57 | tilingElementStats = getTilingElementwiseStats(experiments = expts, normNBSummaries = guideHits, 
58 |   tails = "both", chr = "chr", location="position", negativeControl = "negControl")
59 | tilingElementStats$Q = calcFDRByExperiment(expts, tilingElementStats,"both") 
60 | if(require("ggplot2")){
61 |   p=ggplot(tilingElementStats, aes(x=FDR, y=Q)) +geom_point()+geom_abline(intercept=0, slope=1)+
62 |     scale_y_log10()+scale_x_log10(); print(p)
63 | }
64 | }
65 | 


--------------------------------------------------------------------------------
/man/combineZStouffer.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{combineZStouffer}
 4 | \alias{combineZStouffer}
 5 | \title{Combines Z-scores using Stouffer's method}
 6 | \usage{
 7 | combineZStouffer(x)
 8 | }
 9 | \arguments{
10 | \item{x}{a vector of Z-scores to be combined}
11 | }
12 | \value{
13 | Returns a single Z-score.
14 | }
15 | \description{
16 | This function takes a vector of Z-scores and combines them into a single Z-score using Stouffer's method.
17 | }
18 | \examples{
19 | combineZStouffer(rnorm(10))
20 | }
21 | 


--------------------------------------------------------------------------------
/man/findGuideHits.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{findGuideHits}
 4 | \alias{findGuideHits}
 5 | \title{Calculate guide-level statistics for a single screen}
 6 | \usage{
 7 | findGuideHits(
 8 |   countTable,
 9 |   curBinBounds,
10 |   pseudocount = 10,
11 |   meanFunction = mean,
12 |   sortBins = c("A", "B", "C", "D", "E", "F"),
13 |   unsortedBin = "NS",
14 |   negativeControl = "NT",
15 |   limits = c(-4, 4)
16 | )
17 | }
18 | \arguments{
19 | \item{countTable}{a table containing one column for each bin (A-F) and another column for non-targeting guide (logical-"NT"), and unsorted abundance (NS)}
20 | 
21 | \item{curBinBounds}{a bin model as created by makeBinModel}
22 | 
23 | \item{pseudocount}{the count to be added to each bin count, per 1e6 reads/bin total (default=10 pseudo reads per 1e6 reads total)}
24 | 
25 | \item{meanFunction}{how to calculate the mean of the non-targeting guides for centering Z-scores.  Defaults to 'mean'}
26 | 
27 | \item{sortBins}{the names in countTable of the sorting bins.  Defaults to c("A","B","C","D","E","F")}
28 | 
29 | \item{unsortedBin}{the name in countTable of the unsorted bin.  Defaults to "NS"}
30 | 
31 | \item{negativeControl}{the name in countTable containing a logical representing whether or not the guide is non-Targeting (i.e. a negative control guide).  Defaults to "NT"}
32 | 
33 | \item{limits}{the limits to the mu optimization. Defaults to c(-4,4)}
34 | }
35 | \value{
36 | a data.frame containing the guide-level statistics, including the Z score 'Z', log likelihood ratio 'llRatio', and estimated mean expression 'mean'.
37 | }
38 | \description{
39 | Given a table of counts per guide/bin and a bin model for an experiment, calculate the optimal mean expression for each guide
40 | }
41 | \examples{
42 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6)))
43 | fakeReadData = data.frame(id=1:1000, A=rpois(1000, lambda = 100), B=rpois(1000, lambda = 100),
44 |                           C=rpois(1000, lambda = 100), D=rpois(1000, lambda = 100),
45 |                           E=rpois(1000, lambda = 100), F=rpois(1000, lambda = 100),
46 |                           NotSorted=rpois(1000, lambda = 100), negControl = rnorm(1000)>0)
47 | guideHits = findGuideHits(fakeReadData, curSortBins, unsortedBin = "NotSorted", 
48 |   negativeControl="negControl")
49 | if(require("ggplot2")){
50 |   p=ggplot(guideHits, aes(x=Z, colour=negControl))+geom_density(); print(p)
51 | }
52 | }
53 | 


--------------------------------------------------------------------------------
/man/findGuideHitsAllScreens.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{findGuideHitsAllScreens}
 4 | \alias{findGuideHitsAllScreens}
 5 | \title{Calculate guide-level stats for multiple experiments}
 6 | \usage{
 7 | findGuideHitsAllScreens(experiments, countDataFrame, binStats, ...)
 8 | }
 9 | \arguments{
10 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in countDataFrame and binStats}
11 | 
12 | \item{countDataFrame}{a table containing one column for each bin (A-F) and another column for non-targeting guide (logical-"NT"), and unsorted abundance (NS), as well as columns corresponding to those in  'experiments'}
13 | 
14 | \item{binStats}{a bin model as created by makeBinModel, as well as columns corresponding to those in  'experiments'}
15 | 
16 | \item{...}{other parameters for findGuideHits}
17 | }
18 | \value{
19 | guide-level stats for all experiments
20 | }
21 | \description{
22 | Uses findGuideHits to find guide-level stats for each unique entry in 'experiments'.
23 | }
24 | \examples{
25 |  fakeReadData = data.frame(id=rep(1:1000,2), expt=c(rep("e1",1000), rep("e2",1000)), 
26 |                           A=rpois(2000, lambda = 100), B=rpois(2000, lambda = 100),
27 |                           C=rpois(2000, lambda = 100), D=rpois(2000, lambda = 100),
28 |                           E=rpois(2000, lambda = 100), F=rpois(2000, lambda = 100),
29 |                           NotSorted=rpois(2000, lambda = 100), 
30 |                           negControl = rep(rnorm(1000)>0,2), stringsAsFactors = FALSE)
31 | expts = unique(fakeReadData["expt"]);
32 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6)))
33 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts
34 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6))
35 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData,
36 |                                     binStats = curSortBins, unsortedBin = "NotSorted",
37 |                                     negativeControl="negControl")
38 | }
39 | 


--------------------------------------------------------------------------------
/man/findOverlappingElements.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{findOverlappingElements}
 4 | \alias{findOverlappingElements}
 5 | \title{Find overlaps between guides and annotated elements}
 6 | \usage{
 7 | findOverlappingElements(
 8 |   guides,
 9 |   elements,
10 |   guides.pos = "pos",
11 |   guides.chr = "chr",
12 |   elements.start = "st",
13 |   elements.end = "en",
14 |   elements.chr = "chr"
15 | )
16 | }
17 | \arguments{
18 | \item{guides}{a data.frame containing guide information including the guides genomic position in a column named guides.pos}
19 | 
20 | \item{elements}{a data.frame containing element information, as in a BED file, including the element's genomic start, end, and chromosome in columns named elements.start, elements.end, and elements.chr}
21 | 
22 | \item{guides.pos}{the name of the column in guides that contains the genomic position targeted by the guide (defaults to "pos")}
23 | 
24 | \item{guides.chr}{the name of the column in guides that contains the genomic chromosome targeted by the guide  (defaults to "chr")}
25 | 
26 | \item{elements.start}{the name of the column in elements that contains the start coordinate of the element (defaults to "st")}
27 | 
28 | \item{elements.end}{the name of the column in elements that contains the start coordinate of the element (defaults to "en")}
29 | 
30 | \item{elements.chr}{the name of the column in elements that contains the start coordinate of the element (defaults to "chr")}
31 | }
32 | \value{
33 | Returns a new data.frame containing the intersection of elements and guides
34 | }
35 | \description{
36 | Finds guides that overlap the elements of a BED-like data.frame (e.g. open chromatin regions) and returns a new data.frame containing those overlaps
37 | }
38 | \examples{
39 | set1 = data.frame(gid=1:10, chr=c(rep("chr1",5), rep("chr5",5)),
40 |   pos= c(1:5,1:5)*10, stringsAsFactors = FALSE)
41 | set2 = data.frame(eid=1:4, chr=c("chr1","chr1","chr4","chr5"), st=c(5,25,1,45),
42 |     en=c(15,50,50,55), stringsAsFactors = FALSE)
43 | findOverlappingElements(set1, set2)
44 | }
45 | 


--------------------------------------------------------------------------------
/man/getElementwiseStats.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{getElementwiseStats}
 4 | \alias{getElementwiseStats}
 5 | \title{Find active elements by annotation}
 6 | \usage{
 7 | getElementwiseStats(
 8 |   experiments,
 9 |   normNBSummaries,
10 |   elementIDs,
11 |   tails = "both",
12 |   negativeControl = "NT",
13 |   ...
14 | )
15 | }
16 | \arguments{
17 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in normNBSummaries}
18 | 
19 | \item{normNBSummaries}{data.frame of guide-level statistics as generated by findGuideHits()}
20 | 
21 | \item{elementIDs}{the names of one or more columns within guideLevelStats that contain the element annotations.}
22 | 
23 | \item{tails}{whether to test for increased expression ("upper"), decreased ("lower"), or both ("both"); (defaults to "both")}
24 | 
25 | \item{negativeControl}{the name in normNBSummaries containing a logical representing whether or not the guide is non-Targeting (i.e. a negative control guide).  Defaults to "NT"}
26 | 
27 | \item{...}{other parameters for getZScalesWithNTGuides}
28 | }
29 | \value{
30 | a data.frame containing the statistics for all elements
31 | }
32 | \description{
33 | Tests guides for activity by considering a set of provided regulatory elements within the region and considering all guides within each region for the test.
34 | }
35 | \examples{
36 | fakeReadData = data.frame(id=rep(1:10000,2), expt=c(rep("e1",10000), rep("e2",10000)), 
37 |                           A=rpois(20000, lambda = 100), B=rpois(20000, lambda = 100),
38 |                           C=rpois(20000, lambda = 100), D=rpois(20000, lambda = 100),
39 |                           E=rpois(20000, lambda = 100), F=rpois(20000, lambda = 100),
40 |                           NotSorted=rpois(20000, lambda = 100), 
41 |                           position = rep(c(rep(NA, 1000), (1:9000)*10 + 5E7),2), 
42 |                           chr=rep(c(rep(NA, 1000), rep("chr1", 9000)),2), 
43 |                           negControl = rep(c(rep(TRUE,1000),rep(FALSE,9000)),2), 
44 |                           stringsAsFactors = FALSE)
45 | #make one region an "enhancer" and "repressor" by skewing the reads 
46 | enhancers = data.frame(name = c("enh","repr"), start = c(40000, 70000) + 5E7, 
47 |   end = c(40500, 70500) + 5E7, chr="chr1")
48 | enhancerData = findOverlappingElements(fakeReadData[!is.na(fakeReadData$position),], 
49 |   enhancers, guides.pos = "position",elements.start = "start", elements.end = "end")
50 | readSkew=1.2 # we will scale up/down the reads in ABC and DEF by this amount
51 | enhancerData[enhancerData$name=="enh", c("D","E","F")] = 
52 |   floor(readSkew* enhancerData[enhancerData$name=="enh", c("D","E","F")]);
53 | enhancerData[enhancerData$name=="repr", c("D","E","F")] = 
54 |   floor(enhancerData[enhancerData$name=="repr", c("D","E","F")]/readSkew);
55 | enhancerData[enhancerData$name=="repr", c("A","B","C")] = 
56 |   floor(readSkew* enhancerData[enhancerData$name=="repr", c("A","B","C")]);
57 | enhancerData[enhancerData$name=="enh", c("A","B","C")] = 
58 |   floor(enhancerData[enhancerData$name=="enh", c("A","B","C")]/readSkew);
59 | #replace the original data for these elements
60 | fakeReadData = rbind(fakeReadData[!(fakeReadData$id \%in\% enhancerData$id), ],
61 |   enhancerData[names(fakeReadData)])
62 | #make experiments and sorting strategy
63 | expts = unique(fakeReadData["expt"]);
64 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6)))
65 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts
66 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6))
67 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData,
68 |                                     binStats = curSortBins, unsortedBin = "NotSorted",
69 |                                     negativeControl="negControl")
70 | 
71 | potentialEnhancers = rbind(enhancers, data.frame(name = sprintf("EC\%i", 1:16), 
72 |   start = (5:20)*6000 + 5E7, end = (5:20)*6000 + 5E7+500, chr="chr1"))
73 | #annotate guides with elements, but first get all non-targeting guides
74 | guideHitsAnnotated = guideHits[is.na(guideHits$position),];
75 | guideHitsAnnotated$name=NA; guideHitsAnnotated$start=NA; 
76 | guideHitsAnnotated$end=NA; guideHitsAnnotated$chr=NA;
77 | guideHitsAnnotated = rbind(guideHitsAnnotated, 
78 |   findOverlappingElements(guideHits[!is.na(guideHits$position),], potentialEnhancers, 
79 |     guides.pos = "position",elements.start = "start", elements.end = "end") )
80 | allElementHits = getElementwiseStats(experiments = expts, normNBSummaries = guideHitsAnnotated, 
81 |   elementIDs = "name", tails = "both", negativeControl = "negControl")
82 | allElementHits = merge(allElementHits, potentialEnhancers, by="name")
83 | if(require("ggplot2")){
84 |   p=ggplot(allElementHits, aes(x=start, xend=end, y=significanceZ, yend=significanceZ, 
85 |     colour=FDR<0.01, label=name)) + geom_segment()+
86 |     geom_text(data= allElementHits[allElementHits$FDR<0.01,], colour="black") + 
87 |     facet_grid(expt ~ .); print(p)
88 | }
89 | }
90 | 


--------------------------------------------------------------------------------
/man/getNBGaussianLikelihood.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{getNBGaussianLikelihood}
 4 | \alias{getNBGaussianLikelihood}
 5 | \title{Calculate the log likelihood of observed read counts}
 6 | \usage{
 7 | getNBGaussianLikelihood(x, mu, k, sigma = 1, nullModel, libFract)
 8 | }
 9 | \arguments{
10 | \item{x}{a vector of guide counts per bin}
11 | 
12 | \item{mu}{the mean for the normal expression distribution}
13 | 
14 | \item{k}{the vector of total counts per bin}
15 | 
16 | \item{sigma}{for the normal expression distribution (defaults to 1)}
17 | 
18 | \item{nullModel}{the bin bounds for the null model (for no change in expression)}
19 | 
20 | \item{libFract}{the fraction of the unsorted library this guide comprises (e.g. from unsorted cells, or sequencing the vector)}
21 | }
22 | \value{
23 | the log likelihood
24 | }
25 | \description{
26 | Uses a normal distribution (N(mu,sigma)) to estimate how many reads are expected per bin under nullModel, and calculates the log likelihood under a negative binomial model. This function is usually not used directly.
27 | }
28 | \examples{
29 | #usually not used directly
30 | #make a bin sorting model with 6 10\% bins
31 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6)))
32 | readsForGuideX =c(10,20,30,100,200,100); #the reads for this guide
33 | getNBGaussianLikelihood(x=readsForGuideX, mu=1, k=rep(1E6,6), sigma=1, nullModel=curSortBins, 
34 |   libFract = 50/1E6)
35 | getNBGaussianLikelihood(x=readsForGuideX, mu=-1, k=rep(1E6,6), sigma=1, nullModel=curSortBins, 
36 |   libFract = 50/1E6)
37 | #mu=1 is far more likely (closer to 0) than mu=-1 for this distribution of reads
38 | }
39 | 


--------------------------------------------------------------------------------
/man/getTilingElementwiseStats.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{getTilingElementwiseStats}
 4 | \alias{getTilingElementwiseStats}
 5 | \title{Find active elements by sliding window}
 6 | \usage{
 7 | getTilingElementwiseStats(
 8 |   experiments,
 9 |   normNBSummaries,
10 |   tails = "both",
11 |   location = "pos",
12 |   chr = "chr",
13 |   window = 500,
14 |   minGuides = 5,
15 |   negativeControl = "NT",
16 |   ...
17 | )
18 | }
19 | \arguments{
20 | \item{experiments}{a data.frame containing the headers that demarcate the screen ID, which are all also present in normNBSummaries}
21 | 
22 | \item{normNBSummaries}{data.frame of guide-level statistics as generated by findGuideHits()}
23 | 
24 | \item{tails}{whether to test for increased expression ("upper"), decreased ("lower"), or both ("both"); (defaults to "both")}
25 | 
26 | \item{location}{the name of the column in normNBSummaries containing the chromosomal location (defaults to "pos")}
27 | 
28 | \item{chr}{the name of the column in normNBSummaries containing the chromosome name (defaults to "chr")}
29 | 
30 | \item{window}{the window width in base pairs (defaults to 500)}
31 | 
32 | \item{minGuides}{the minimum number of guides in a window required for a test (defaults to 5)}
33 | 
34 | \item{negativeControl}{the name in normNBSummaries containing a logical representing whether or not the guide is non-Targeting (i.e. a negative control guide).  Defaults to "NT"}
35 | 
36 | \item{...}{other parameters for getZScalesWithNTGuides}
37 | }
38 | \value{
39 | a data.frame containing the statistics for all windows tested for activity
40 | }
41 | \description{
42 | Tests guides for activity by considering a sliding window across the tested region and including all guides within the window for the test.
43 | }
44 | \examples{
45 | fakeReadData = data.frame(id=rep(1:10000,2), expt=c(rep("e1",10000), rep("e2",10000)), 
46 |                           A=rpois(20000, lambda = 100), B=rpois(20000, lambda = 100),
47 |                           C=rpois(20000, lambda = 100), D=rpois(20000, lambda = 100),
48 |                           E=rpois(20000, lambda = 100), F=rpois(20000, lambda = 100),
49 |                           NotSorted=rpois(20000, lambda = 100), 
50 |                           position = rep(c(rep(NA, 1000), (1:9000)*10 + 5E7),2), 
51 |                           chr=rep(c(rep(NA, 1000), rep("chr1", 9000)),2), 
52 |                           negControl = rep(c(rep(TRUE,1000),rep(FALSE,9000)),2), 
53 |                           stringsAsFactors = FALSE)
54 | #make one region an "enhancer" and "repressor" by skewing the reads 
55 | enhancers = data.frame(name = c("enh","repr"), start = c(40000, 70000) + 5E7, 
56 |   end = c(40500, 70500) + 5E7, chr="chr1")
57 | enhancerData = findOverlappingElements(fakeReadData[!is.na(fakeReadData$position),], enhancers, 
58 |   guides.pos = "position",elements.start = "start", elements.end = "end")
59 | readSkew=1.2 # we will scale up/down the reads in ABC and DEF by this amount
60 | enhancerData[enhancerData$name=="enh", c("D","E","F")] = 
61 |   floor(readSkew* enhancerData[enhancerData$name=="enh", c("D","E","F")]);
62 | enhancerData[enhancerData$name=="repr", c("D","E","F")] = 
63 |   floor(enhancerData[enhancerData$name=="repr", c("D","E","F")]/readSkew);
64 | enhancerData[enhancerData$name=="repr", c("A","B","C")] = 
65 |   floor(readSkew* enhancerData[enhancerData$name=="repr", c("A","B","C")]);
66 | enhancerData[enhancerData$name=="enh", c("A","B","C")] = 
67 |   floor(enhancerData[enhancerData$name=="enh", c("A","B","C")]/readSkew);
68 | #replace the original data for these elements
69 | fakeReadData = rbind(fakeReadData[!(fakeReadData$id \%in\% enhancerData$id), ],
70 |   enhancerData[names(fakeReadData)])
71 | #make experiments and sorting strategy
72 | expts = unique(fakeReadData["expt"]);
73 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6)))
74 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts
75 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6))
76 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData,
77 |                                     binStats = curSortBins, unsortedBin = "NotSorted",
78 |                                     negativeControl="negControl")
79 | if(require("ggplot2")){
80 |   p=ggplot(guideHits, aes(x=position, y=Z))+geom_point() ; print(p)
81 | }
82 | tilingElementStats = getTilingElementwiseStats(experiments = expts, normNBSummaries = guideHits, 
83 |   tails = "both", chr = "chr", location="position", negativeControl = "negControl")
84 | if(require("ggplot2")){
85 |   p=ggplot(tilingElementStats, aes(x=start, xend=end, y=significanceZ, yend=significanceZ, 
86 |     colour=FDR<0.01))+geom_segment() ; print(p)
87 | }
88 | }
89 | 


--------------------------------------------------------------------------------
/man/getZScalesWithNTGuides.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{getZScalesWithNTGuides}
 4 | \alias{getZScalesWithNTGuides}
 5 | \title{Calculate Z-score scaling factors using non-targeting guides}
 6 | \usage{
 7 | getZScalesWithNTGuides(ntData, uGuidesPerElement, mergeBy, ntSampleFold = 10)
 8 | }
 9 | \arguments{
10 | \item{ntData}{data.frame containing the data for the non-targeting guides}
11 | 
12 | \item{uGuidesPerElement}{a unique vector of guide counts per element}
13 | 
14 | \item{mergeBy}{a character vector containing the header(s) that demarcate the screen/experiment/replicate ID(s)}
15 | 
16 | \item{ntSampleFold}{how many times to sample each non-targeting guide to make the Z score scale (defaults to 10)}
17 | }
18 | \value{
19 | a data.frame containing a Z-score scaling factor, one for every number of guides and unique entry in mergeBy
20 | }
21 | \description{
22 | Calculates scaling factors to calibrate  element-wise Z-scores by repeatedly calculating a set of "null" Z-scores by repeatedly sampling the given numbers of non-targeting guides per element. This function is not normally used directly.
23 | }
24 | \examples{
25 | fakeReadData = data.frame(id=rep(1:1000,2), expt=c(rep("e1",1000), rep("e2",1000)), 
26 |                           A=rpois(2000, lambda = 100), B=rpois(2000, lambda = 100),
27 |                           C=rpois(2000, lambda = 100), D=rpois(2000, lambda = 100),
28 |                           E=rpois(2000, lambda = 100), F=rpois(2000, lambda = 100),
29 |                           NotSorted=rpois(2000, lambda = 100), 
30 |                           negControl = rep(rnorm(1000)>0,2), stringsAsFactors = FALSE)
31 | expts = unique(fakeReadData["expt"]);
32 | curSortBins = makeBinModel(data.frame(Bin = c("A","B","C","D","E","F"), fraction = rep(0.1,6)))
33 | curSortBins = rbind(curSortBins, curSortBins) # duplicate and use same for both expts
34 | curSortBins$expt = c(rep(expts$expt[1],6),rep(expts$expt[2],6))
35 | guideHits = findGuideHitsAllScreens(experiments = expts, countDataFrame=fakeReadData,
36 |                                     binStats = curSortBins, unsortedBin = "NotSorted",
37 |                                     negativeControl="negControl")
38 | guideZScales = getZScalesWithNTGuides(guideHits[guideHits$negControl,], uGuidesPerElement=1:10, 
39 |   mergeBy=names(expts))
40 | if(require("ggplot2")){
41 |   p=ggplot(guideZScales, aes(x=numGuides, y=Zscale, colour=expt))+geom_point()+geom_line();
42 |   print(p)
43 | }
44 | }
45 | 


--------------------------------------------------------------------------------
/man/makeBinModel.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/MAUDE.R
 3 | \name{makeBinModel}
 4 | \alias{makeBinModel}
 5 | \title{Create a bin model for a single experiment}
 6 | \usage{
 7 | makeBinModel(curBinBounds, tailP = 0.001)
 8 | }
 9 | \arguments{
10 | \item{curBinBounds}{a data.frame containing two columns: Bin (must be {A,B,C,D,E,F}), and fraction (the fractions of the total captured by each bin)}
11 | 
12 | \item{tailP}{the fraction of the tails of the distribution not captured in any bin (defaults to 0.001)}
13 | }
14 | \value{
15 | returns a data.frame with additional columns including the bin starts and ends in Z-score space, and in quantile space.
16 | }
17 | \description{
18 | Provided with the fractions captured by each bin, creates a bin model for use with MAUDE analysis, assuming 3 contiguous bins on the tails of the distribution. You can easily remove or rename bins after they have been created with this function. An example is provided in the BACH2 Vignette.
19 | }
20 | \examples{
21 | #generally, the bin bounds are retrieved from the FACS data
22 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.1,6))) 
23 | if(require("ggplot2")){
24 |   p = ggplot() + 
25 |     geom_vline(xintercept= sort(unique(c(binBounds$binStartZ, binBounds$binEndZ))),colour="gray")+
26 |     theme_classic() + xlab("Target expression") + 
27 |     geom_segment(data=binBounds, aes(x=binStartZ, xend=binEndZ, colour=Bin, y=0, yend=0), 
28 |       size=5, inherit.aes = FALSE); 
29 |   print(p)
30 | }
31 | }
32 | 


--------------------------------------------------------------------------------
/vignettes/BACH2_base_editor_screen.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "BACH2 Base editor flow-FISH screen"
  3 | author: "Carl de Boer"
  4 | date: "24/02/2022"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{BACH2 Base editor flow-FISH screen}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   \usepackage[utf8]{inputenc}
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | # Introduction
 17 | In this example, we use MAUDE to analyze a CRISPR base editor screen targeting a region containing a autoimmune-linked varaint (rs72928038) within a BACH2 enhancer. Data are from Mouri et al (Prioritization of autoimmune disease-associated genetic variants that perturb regulatory element activity in T cells), which also contains further details about the experiment:
 18 | https://www.biorxiv.org/content/10.1101/2021.05.30.445673v1
 19 | 
 20 | 
 21 | In this experiment, CRISPR/Cas9 base editors were targeted to rs72928038, where they mutate the DNA. Expression is then measured by sorting cells by BACH2 expression (Flow-FISH) into 4 expression bins. We then use MAUDE to estimate the mean expression of each of the alleles generated by the base editors. Only a minority of the alleles created correspond to rs72928038. 
 22 | 
 23 | Here, we have several experiments and replicates, four sorting bins per experiment, as well as unsorted cells. For more details, please see the before referenced publication.
 24 | 
 25 | # Loading the data/libraries
 26 | 
 27 | 
 28 | ```{r}
 29 | library(ggplot2)
 30 | library(reshape)
 31 | library(MAUDE)
 32 | maudeGitPathRoot = "https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/master"
 33 | ```
 34 | 
 35 | Load sample metadata.
 36 | ```{r}
 37 | allSamples = read.table(file=sprintf("%s/vignettes/BACH2_data/sample_metadata.txt",maudeGitPathRoot), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F)
 38 | head(allSamples)
 39 | #remove some of the control samples
 40 | allSamples = allSamples[!grepl("HEK62", allSamples$replicate),]
 41 | 
 42 | ```
 43 | 
 44 | Load the data.
 45 | ```{r}
 46 | if (FALSE){
 47 |   #this was run on my computer to load the data from many files spread out over many subdirectories. Rather than upload all these files separately, I have loaded them all locally and saved the resulting concatenated file onto github. I leave this code here so that others may view how the data was loaded, should they have their own CRISPResso files to analyze.
 48 |   inDir = "/Path/To/CRISPResso/Files";
 49 |   setwd(inDir)
 50 |   
 51 |   
 52 |   allCRISPRessoData= data.frame()
 53 |   for (i in 1:nrow(allSamples)){
 54 |     curData = read.table(file=sprintf("%s/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "")
 55 |     curData2 = read.table(file=sprintf("%s/Deep_resequencing_analysis/%s/Alleles_frequency_table.txt",inDir, allSamples$directory[i]), sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F, comment.char = "")
 56 |     curData = rbind(curData, curData2);
 57 |     curData$locus = allSamples$locus[i];
 58 |     curData$replicate = allSamples$replicate[i];
 59 |     curData$sortBin = allSamples$sortBin[i];
 60 |     allCRISPRessoData = rbind(allCRISPRessoData, curData)
 61 |   }
 62 |   #remove gaps from sequences
 63 |   allCRISPRessoData$SeqSpecies = gsub("-","",allCRISPRessoData$Aligned_Sequence); # Allele
 64 |   #remove unwanted fields
 65 |   allCRISPRessoData$Aligned_Sequence=NULL; allCRISPRessoData$Reference_Sequence=NULL;  allCRISPRessoData$X.Reads.1=NULL;
 66 |   write.table(allCRISPRessoData, file=sprintf("%s/loaded_BACH2_BE_CRISPResso_data.txt", inDir),row.names = F, col.names = T, quote=F, sep="\t")
 67 |   #I then gzipped this file and uploaded it to github
 68 | }else{
 69 |   #load the CRISPResso files from GitHub. 
 70 |   z= gzcon(url(sprintf("%s/vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz",maudeGitPathRoot)));
 71 |   fileConn=textConnection(readLines(z));
 72 |   allCRISPRessoData =  read.table(fileConn, sep="\t", quote="", header = T, row.names = NULL, stringsAsFactors = F)
 73 |   close(fileConn)
 74 | }
 75 | ```
 76 | 
 77 | 
 78 | Polish CRISPResso genotype data, calculate reads per sample and genotype composition per sample.
 79 | 
 80 | ```{r}
 81 | #CRISPResso has split some sequences into two or maybe more lines; below is an example
 82 | #We also input multiple files from different sequencing runs. This merges all identical seq species'/samples
 83 | allCRISPRessoData = cast(melt(allCRISPRessoData, id.vars = c("SeqSpecies","Reference_Name","Read_Status","n_deleted","n_inserted","n_mutated", "locus","replicate","sortBin")), SeqSpecies + Reference_Name + Read_Status + n_deleted + n_inserted + n_mutated + locus + replicate + sortBin ~ variable, value="value", fun.aggregate=sum)
 84 | 
 85 | #Get read totals per replicate
 86 | allCRISPRessoDataTotals = cast(allCRISPRessoData, locus + replicate + sortBin ~ ., value="X.Reads", fun.aggregate = sum)
 87 | names(allCRISPRessoDataTotals)[ncol(allCRISPRessoDataTotals)] = "totalReads";
 88 | 
 89 | #Add totalReads column to allCRISPRessoData, calculate read fractions
 90 | allCRISPRessoData = merge(allCRISPRessoData, allCRISPRessoDataTotals, by=c("locus","replicate","sortBin"))
 91 | allCRISPRessoData$readFraction = allCRISPRessoData$X.Reads/allCRISPRessoData$totalReads;
 92 | ```
 93 | 
 94 | 
 95 | # Quality Control analysis for the experiment
 96 | 
 97 | Reads per sample
 98 | ```{r}
 99 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_continuous(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p)
100 | ```
101 | 
102 | Same on a log scale
103 | ```{r}
104 | p = ggplot(allCRISPRessoDataTotals, aes(x=replicate, fill=sortBin, y=totalReads)) + geom_bar(stat="identity", position="dodge") + facet_grid(. ~ locus) + theme_classic() + scale_y_log10(expand=c(0,0))+theme(axis.text.x=element_text(hjust=1, angle=90)); print(p)
105 | ```
106 | 
107 | 
108 | Toss all the cases where the locus was unmodified but sequenced anyway. Note that this means each was effectively only 50K cells, not 100k because the gDNA was divided in half
109 | ```{r}
110 | allCRISPRessoData = allCRISPRessoData[ !(allCRISPRessoData$locus=="HEK" & grepl("^[ABCR][12]$", allCRISPRessoData$replicate)) & !(allCRISPRessoData$locus=="BACH2" & grepl("^HEK9[12]", allCRISPRessoData$replicate)),]
111 | ```
112 | 
113 | 
114 | Plot a CDF showing the cumulative fraction of genotypes (y axis) per sample (colour) by their abundance in the sequencing data (x axis). This illustrates how biased the data are towards rare genotypes (left) vs abundant genotypes (right). The more the curves shift to the right, the less complex the modifications to the DNA. A lot of the ones to the left are from sequencing artifacts.
115 | ```{r}
116 | p = ggplot(allCRISPRessoData[allCRISPRessoData$sortBin=="NS",], aes(x= readFraction, colour=replicate)) + stat_ecdf()+facet_grid(locus ~ .) + scale_x_log10() + theme_bw(); print(p)
117 | ```
118 | 
119 | Estimate the %unmodified in each sample.
120 | ```{r}
121 | temp = cast(allCRISPRessoData[allCRISPRessoData$Read_Status=="UNMODIFIED" & allCRISPRessoData$Reference_Name=="Reference",], formula = sortBin + locus+ replicate ~. ,value = "readFraction", fun.aggregate = max)
122 | names(temp)[ncol(temp)]="readFraction";
123 | p = ggplot(temp, aes(x=sortBin, y=replicate, fill=readFraction)) + geom_tile() + facet_grid(locus ~., scales="free_y")+ggtitle("just unmodified read fractions") + scale_fill_gradientn(colours=c("red","orange", "green","cyan","blue","violet"), limits=c(0,0.8)); print(p)
124 | min(temp$readFraction, na.rm = T) # 0.0356963
125 | ```
126 | 
127 | 
128 | Sample exclusion
129 | ```{r}
130 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="A1-HEK",] # this sample had no data for bin D (BACH2) because it didn't PCR well
131 | 
132 | allCRISPRessoData = allCRISPRessoData[allCRISPRessoData$replicate!="B2-HEK",] # this sample had very low coverage for bin D for BACH2
133 | ```
134 | 
135 | 
136 | 
137 | Identify and retain only those alleles that are present in all samples
138 | ```{r}
139 | seqsObserved = cast(allCRISPRessoData, SeqSpecies + locus + Read_Status + Reference_Name~ .)
140 | names(seqsObserved)[ncol(seqsObserved)]="seqRunsObserved";
141 | 
142 | retainedSamples = unique(allCRISPRessoData[,c("locus","replicate","sortBin")])
143 | 
144 | for (l in unique(allSamples$locus)){
145 |   seqsObserved$inAll[seqsObserved$locus == l] = seqsObserved$seqRunsObserved[seqsObserved$locus == l]==sum(retainedSamples$locus==l)
146 | }
147 | 
148 | #require that all samples have at least one read for every species considered. This will help exclude read errors
149 | keepAlleles = seqsObserved[seqsObserved$inAll,]
150 | ```
151 | 
152 | 
153 | # MAUDE Analysis of base editor screen
154 | 
155 | ```{r}
156 | #First make a count matrix, but only of alleles seen in all replicates
157 | readCountMat = cast(allCRISPRessoData[allCRISPRessoData$SeqSpecies %in% keepAlleles$SeqSpecies,], SeqSpecies  + replicate +locus ~ sortBin, value="X.Reads")
158 | readCountMat[is.na(readCountMat)] = 0
159 | readCountMat = readCountMat[order(readCountMat$NS, decreasing = T),]
160 | 
161 | #Label WT alleles
162 | wtSeqs = data.frame(locus = c("HEK","BACH2"), SeqSpecies =  c("GGTAGCCAGAGACCCGCTGGTCTTCTTTCCCCTCCCCTGCCCTCCCCTCCCTTCAAGATGGCTGACAA","TGCCCCACCCTGTGCCCTTTTTACATTACACACAAATAGGGACGGATTTCCTGTAAGCTGATCTTGAAGAAAAAAAACATGTTAGACAAAGAAAATCAGAACTAAGA"), isWT=T, stringsAsFactors = F);
163 | readCountMat = merge(readCountMat, wtSeqs, by=c("locus","SeqSpecies"), all.x=T)
164 | readCountMat$isWT[is.na(readCountMat$isWT)]=F;
165 | 
166 | allCRISPRessoData = merge(allCRISPRessoData, wtSeqs, by=c("locus","SeqSpecies"), all.x=T)
167 | allCRISPRessoData$isWT[is.na(allCRISPRessoData$isWT)]=F;
168 | 
169 | 
170 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.105,6))) # 10.5% bins
171 | # but only the top ~21% (CD) and bottom ~21% (AB) were retained
172 | binBounds = binBounds[binBounds$Bin %in% c("A","B","E","F"),] 
173 | binBounds$Bin[binBounds$Bin=="E"]="C"
174 | binBounds$Bin[binBounds$Bin=="F"]="D"
175 | ```
176 | 
177 | The bin layout:
178 | ```{r}
179 | p = ggplot(binBounds, aes(colour=Bin))  + 
180 |   geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 
181 |   ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 
182 |   xlab("Bin bounds as expression Z-scores") + 
183 |   ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+
184 |   coord_cartesian(ylim=c(0,0.7)); print(p)
185 | ```
186 | 
187 | Replicate bin stats for each experiment (the same sorting parameters were used for each)
188 | ```{r}
189 | curExpts = unique(readCountMat[c("replicate","locus")])
190 | 
191 | binStatsAll = data.frame();
192 | for(i in 1:nrow(curExpts)){
193 |   binStatsAll = rbind(binStatsAll, cbind(binBounds, curExpts[i,]))
194 | }
195 | ```
196 | 
197 | 
198 | ```{r}
199 | #perform MAUDE analysis at guide (allele in this case) level for each locus.
200 | guideLevelStats = data.frame();
201 | for( l in unique(allSamples$locus)){
202 |   message(l);
203 |   statsA = findGuideHitsAllScreens(experiments = curExpts[curExpts$locus==l,], countDataFrame = readCountMat[readCountMat$locus==l,], binStats = binStatsAll[binStatsAll$locus==l,], sortBins = c("A","B","C","D"), unsortedBin = "NS", negativeControl = "isWT")
204 |   guideLevelStats = rbind(guideLevelStats, statsA)
205 | }
206 | ```
207 | 
208 | 
209 | Summarize base changes per allele
210 | ```{r}
211 | baseChanges = data.frame()
212 | for (l in wtSeqs$locus){
213 |   wtSeq = wtSeqs$SeqSpecies[wtSeqs$locus==l];
214 |   wtSplit = strsplit(wtSeq,"")[[1]]
215 |   curSeqs = unique(guideLevelStats$SeqSpecies[guideLevelStats$locus==l & guideLevelStats$libFraction > 0.001])
216 |   for (i in 1:length(curSeqs)){
217 |     curSplit = strsplit(curSeqs[i],"")[[1]]
218 |     mismatches = c();
219 |     mismatchPoss=c();
220 |     for (j in 1:min(length(wtSplit),length(curSplit))){
221 |       if (wtSplit[j]!=curSplit[j]){
222 |         mismatches = c(mismatches, curSplit[j]);
223 |         mismatchPoss = c(mismatchPoss, j)
224 |       }
225 |     }
226 |     if (length(mismatchPoss)>0){
227 |       baseChanges = rbind(baseChanges, data.frame(mismatch=mismatches, position = mismatchPoss, SeqSpecies=curSeqs[i], locus=l))
228 |     }
229 |   }
230 | }
231 | 
232 | baseChangesSummary = cast(baseChanges, SeqSpecies +locus~ ., value="mismatch")
233 | names(baseChangesSummary)[ncol(baseChangesSummary)] = "numMismatches";
234 | p = ggplot(baseChangesSummary, aes(x=numMismatches, colour=locus)) + stat_ecdf()+ geom_vline(xintercept = 15); print(p)
235 | ```
236 | 
237 | 
238 | ```{r}
239 | #exclude any where the number of mismatches is greater than 15 - these are also likely read or PCR artifacts.
240 | baseChangesSummary$keepAlleles= baseChangesSummary$numMismatches<15;
241 | 
242 | baseChanges$has = 1;
243 | 
244 | baseChangesNumWith = cast(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles],], position + mismatch + locus~ ., value="has", fun.aggregate = sum)
245 | names(baseChangesNumWith)[ncol(baseChangesNumWith)] = "numSeqsWith"
246 | baseChanges = merge(baseChanges, baseChangesNumWith, by=c("position","mismatch","locus"))
247 | baseChanges$mismatch = as.character(baseChanges$mismatch)
248 | baseChanges$SeqSpecies = as.character(baseChanges$SeqSpecies)
249 | 
250 | 
251 | p = ggplot(baseChanges[baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], aes(x=position, y=SeqSpecies, fill=mismatch)) + geom_tile()+scale_fill_manual(values=c("orange","green","blue","red"))+scale_x_continuous(expand=c(0,0))+theme_classic() + theme(axis.text.y = element_text(size=5, family = "Courier")) + facet_grid(locus ~ ., scales="free", space ="free"); print(p)
252 | ```
253 | 
254 | 
255 | ```{r}
256 | #A_53 is the mutation of interest
257 | 
258 | baseChangesMatrix = cast(baseChanges[baseChanges$locus == "BACH2" & baseChanges$SeqSpecies %in% baseChangesSummary$SeqSpecies[baseChangesSummary$keepAlleles] & !(baseChanges$SeqSpecies %in% baseChanges$SeqSpecies[baseChanges$numSeqsWith==1]),], SeqSpecies ~ mismatch + position, value="has", fill = 0)
259 | A53_seq = baseChangesMatrix$SeqSpecies[apply(baseChangesMatrix[2:ncol(baseChangesMatrix)],1, sum)==1 & baseChangesMatrix$A_53==1]
260 | 
261 | guideLevelStats$pooled = grepl("-",guideLevelStats$replicate);
262 | ```
263 | 
264 | 
265 | ```{r}
266 | wtAndVarMaudeMu = cast(guideLevelStats[guideLevelStats$SeqSpecies %in% c(A53_seq, wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]),], replicate +pooled ~ SeqSpecies, value="mean")
267 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==A53_seq]="rs72928038";
268 | names(wtAndVarMaudeMu)[names(wtAndVarMaudeMu)==wtSeqs$SeqSpecies[wtSeqs$locus=="BACH2"]]="WT";
269 | 
270 | ttestResults = t.test(x = wtAndVarMaudeMu$rs72928038[!wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[!wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE)
271 | 
272 | ttestResults_mixedCells = t.test(x = wtAndVarMaudeMu$rs72928038[wtAndVarMaudeMu$pooled], y=wtAndVarMaudeMu$WT[wtAndVarMaudeMu$pooled], alternative="less", paired=TRUE, var.equal = FALSE)
273 | 
274 | wtAndVarMaudeMu$replicate2 = gsub("-HEK","",wtAndVarMaudeMu$replicate)
275 | 
276 | ```
277 | 
278 | Make plot of WT vs rs72928038 mean expression levels
279 | ```{r}
280 | meltedSNPMus = melt(as.data.frame(wtAndVarMaudeMu), id.vars=c("pooled","replicate", "replicate2"));
281 | meltedSNPMus$genotype = factor(ifelse(meltedSNPMus$variable=="WT", "G", "A (risk)"),levels=c("G","A (risk)"))
282 | p = ggplot(meltedSNPMus, aes(x=genotype, y=value, group=replicate)) + geom_point() + geom_line()+facet_grid(. ~ pooled)+theme_bw()+xlab("Genotype") + ylab("Mean expression") + ggtitle(sprintf("P=%f; P=%f", ttestResults$p.value, ttestResults_mixedCells$p.value)); print(p)
283 | ```
284 | 
285 | 


--------------------------------------------------------------------------------
/vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/de-Boer-Lab/MAUDE/eb9c80a0bf37bb2b0c3c7ae7270a1fee71eaba20/vignettes/BACH2_data/loaded_BACH2_BE_CRISPResso_data.txt.gz


--------------------------------------------------------------------------------
/vignettes/BACH2_data/sample_metadata.txt:
--------------------------------------------------------------------------------
  1 | directory	locus	locus2	replicate	sortBin
  2 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-A_S1.trimmed.1_BACH2-A1-A_S1.trimmed.2	BACH2	BACH2	A1	A
  3 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-B_S2.trimmed.1_BACH2-A1-B_S2.trimmed.2	BACH2	BACH2	A1	B
  4 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-C_S3.trimmed.1_BACH2-A1-C_S3.trimmed.2	BACH2	BACH2	A1	C
  5 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-D_S4.trimmed.1_BACH2-A1-D_S4.trimmed.2	BACH2	BACH2	A1	D
  6 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-A_S49.trimmed.1_BACH2-A1-HEK-A_S49.trimmed.2	BACH2	BACH2	A1-HEK	A
  7 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-B_S50.trimmed.1_BACH2-A1-HEK-B_S50.trimmed.2	BACH2	BACH2	A1-HEK	B
  8 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-C_S51.trimmed.1_BACH2-A1-HEK-C_S51.trimmed.2	BACH2	BACH2	A1-HEK	C
  9 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-HEK-NS_S53.trimmed.1_BACH2-A1-HEK-NS_S53.trimmed.2	BACH2	BACH2	A1-HEK	NS
 10 | CRISPResso_BACH2/CRISPResso_on_BACH2-A1-NS_S5.trimmed.1_BACH2-A1-NS_S5.trimmed.2	BACH2	BACH2	A1	NS
 11 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-A_S6.trimmed.1_BACH2-A2-A_S6.trimmed.2	BACH2	BACH2	A2	A
 12 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-B_S7.trimmed.1_BACH2-A2-B_S7.trimmed.2	BACH2	BACH2	A2	B
 13 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-C_S8.trimmed.1_BACH2-A2-C_S8.trimmed.2	BACH2	BACH2	A2	C
 14 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-D_S9.trimmed.1_BACH2-A2-D_S9.trimmed.2	BACH2	BACH2	A2	D
 15 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-A_S54.trimmed.1_BACH2-A2-HEK-A_S54.trimmed.2	BACH2	BACH2	A2-HEK	A
 16 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-B_S55.trimmed.1_BACH2-A2-HEK-B_S55.trimmed.2	BACH2	BACH2	A2-HEK	B
 17 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-C_S56.trimmed.1_BACH2-A2-HEK-C_S56.trimmed.2	BACH2	BACH2	A2-HEK	C
 18 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-D_S57.trimmed.1_BACH2-A2-HEK-D_S57.trimmed.2	BACH2	BACH2	A2-HEK	D
 19 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-HEK-NS_S58.trimmed.1_BACH2-A2-HEK-NS_S58.trimmed.2	BACH2	BACH2	A2-HEK	NS
 20 | CRISPResso_BACH2/CRISPResso_on_BACH2-A2-NS_S10.trimmed.1_BACH2-A2-NS_S10.trimmed.2	BACH2	BACH2	A2	NS
 21 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-A_S13.trimmed.1_BACH2-B1-A_S13.trimmed.2	BACH2	BACH2	B1	A
 22 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-B_S14.trimmed.1_BACH2-B1-B_S14.trimmed.2	BACH2	BACH2	B1	B
 23 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-C_S15.trimmed.1_BACH2-B1-C_S15.trimmed.2	BACH2	BACH2	B1	C
 24 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-D_S16.trimmed.1_BACH2-B1-D_S16.trimmed.2	BACH2	BACH2	B1	D
 25 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-A_S61.trimmed.1_BACH2-B1-HEK-A_S61.trimmed.2	BACH2	BACH2	B1-HEK	A
 26 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-B_S62.trimmed.1_BACH2-B1-HEK-B_S62.trimmed.2	BACH2	BACH2	B1-HEK	B
 27 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-C_S63.trimmed.1_BACH2-B1-HEK-C_S63.trimmed.2	BACH2	BACH2	B1-HEK	C
 28 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-D_S64.trimmed.1_BACH2-B1-HEK-D_S64.trimmed.2	BACH2	BACH2	B1-HEK	D
 29 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-HEK-NS_S65.trimmed.1_BACH2-B1-HEK-NS_S65.trimmed.2	BACH2	BACH2	B1-HEK	NS
 30 | CRISPResso_BACH2/CRISPResso_on_BACH2-B1-NS_S17.trimmed.1_BACH2-B1-NS_S17.trimmed.2	BACH2	BACH2	B1	NS
 31 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-A_S18.trimmed.1_BACH2-B2-A_S18.trimmed.2	BACH2	BACH2	B2	A
 32 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-B_S19.trimmed.1_BACH2-B2-B_S19.trimmed.2	BACH2	BACH2	B2	B
 33 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-C_S20.trimmed.1_BACH2-B2-C_S20.trimmed.2	BACH2	BACH2	B2	C
 34 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-D_S21.trimmed.1_BACH2-B2-D_S21.trimmed.2	BACH2	BACH2	B2	D
 35 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-A_S66.trimmed.1_BACH2-B2-HEK-A_S66.trimmed.2	BACH2	BACH2	B2-HEK	A
 36 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-B_S67.trimmed.1_BACH2-B2-HEK-B_S67.trimmed.2	BACH2	BACH2	B2-HEK	B
 37 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-C_S68.trimmed.1_BACH2-B2-HEK-C_S68.trimmed.2	BACH2	BACH2	B2-HEK	C
 38 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-D_S69.trimmed.1_BACH2-B2-HEK-D_S69.trimmed.2	BACH2	BACH2	B2-HEK	D
 39 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-HEK-NS_S70.trimmed.1_BACH2-B2-HEK-NS_S70.trimmed.2	BACH2	BACH2	B2-HEK	NS
 40 | CRISPResso_BACH2/CRISPResso_on_BACH2-B2-NS_S22.trimmed.1_BACH2-B2-NS_S22.trimmed.2	BACH2	BACH2	B2	NS
 41 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-A_S25.trimmed.1_BACH2-C1-A_S25.trimmed.2	BACH2	BACH2	C1	A
 42 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-B_S26.trimmed.1_BACH2-C1-B_S26.trimmed.2	BACH2	BACH2	C1	B
 43 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-C_S27.trimmed.1_BACH2-C1-C_S27.trimmed.2	BACH2	BACH2	C1	C
 44 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-D_S28.trimmed.1_BACH2-C1-D_S28.trimmed.2	BACH2	BACH2	C1	D
 45 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-A_S73.trimmed.1_BACH2-C1-HEK-A_S73.trimmed.2	BACH2	BACH2	C1-HEK	A
 46 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-B_S74.trimmed.1_BACH2-C1-HEK-B_S74.trimmed.2	BACH2	BACH2	C1-HEK	B
 47 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-C_S75.trimmed.1_BACH2-C1-HEK-C_S75.trimmed.2	BACH2	BACH2	C1-HEK	C
 48 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-D_S76.trimmed.1_BACH2-C1-HEK-D_S76.trimmed.2	BACH2	BACH2	C1-HEK	D
 49 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-HEK-NS_S77.trimmed.1_BACH2-C1-HEK-NS_S77.trimmed.2	BACH2	BACH2	C1-HEK	NS
 50 | CRISPResso_BACH2/CRISPResso_on_BACH2-C1-NS_S29.trimmed.1_BACH2-C1-NS_S29.trimmed.2	BACH2	BACH2	C1	NS
 51 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-A_S30.trimmed.1_BACH2-C2-A_S30.trimmed.2	BACH2	BACH2	C2	A
 52 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-B_S31.trimmed.1_BACH2-C2-B_S31.trimmed.2	BACH2	BACH2	C2	B
 53 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-C_S32.trimmed.1_BACH2-C2-C_S32.trimmed.2	BACH2	BACH2	C2	C
 54 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-D_S33.trimmed.1_BACH2-C2-D_S33.trimmed.2	BACH2	BACH2	C2	D
 55 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-A_S78.trimmed.1_BACH2-C2-HEK-A_S78.trimmed.2	BACH2	BACH2	C2-HEK	A
 56 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-B_S79.trimmed.1_BACH2-C2-HEK-B_S79.trimmed.2	BACH2	BACH2	C2-HEK	B
 57 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-C_S80.trimmed.1_BACH2-C2-HEK-C_S80.trimmed.2	BACH2	BACH2	C2-HEK	C
 58 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-D_S81.trimmed.1_BACH2-C2-HEK-D_S81.trimmed.2	BACH2	BACH2	C2-HEK	D
 59 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-HEK-NS_S82.trimmed.1_BACH2-C2-HEK-NS_S82.trimmed.2	BACH2	BACH2	C2-HEK	NS
 60 | CRISPResso_BACH2/CRISPResso_on_BACH2-C2-NS_S34.trimmed.1_BACH2-C2-NS_S34.trimmed.2	BACH2	BACH2	C2	NS
 61 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-A_S11.trimmed.1_BACH2-HEK62-A_S11.trimmed.2	BACH2	BACH2	HEK62	A
 62 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-B_S23.trimmed.1_BACH2-HEK62-B_S23.trimmed.2	BACH2	BACH2	HEK62	B
 63 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-C_S35.trimmed.1_BACH2-HEK62-C_S35.trimmed.2	BACH2	BACH2	HEK62	C
 64 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62-NS_S59.trimmed.1_BACH2-HEK62-NS_S59.trimmed.2	BACH2	BACH2	HEK62	NS
 65 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK62D_S47.trimmed.1_BACH2-HEK62D_S47.trimmed.2	BACH2	BACH2	HEK62	D
 66 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-A_S71.trimmed.1_BACH2-HEK91-A_S71.trimmed.2	BACH2	BACH2	HEK91	A
 67 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-B_S83.trimmed.1_BACH2-HEK91-B_S83.trimmed.2	BACH2	BACH2	HEK91	B
 68 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-C_S95.trimmed.1_BACH2-HEK91-C_S95.trimmed.2	BACH2	BACH2	HEK91	C
 69 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-D_S12.trimmed.1_BACH2-HEK91-D_S12.trimmed.2	BACH2	BACH2	HEK91	D
 70 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK91-NS_S24.trimmed.1_BACH2-HEK91-NS_S24.trimmed.2	BACH2	BACH2	HEK91	NS
 71 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-A_S36.trimmed.1_BACH2-HEK92-A_S36.trimmed.2	BACH2	BACH2	HEK92	A
 72 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-B_S48.trimmed.1_BACH2-HEK92-B_S48.trimmed.2	BACH2	BACH2	HEK92	B
 73 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-C_S60.trimmed.1_BACH2-HEK92-C_S60.trimmed.2	BACH2	BACH2	HEK92	C
 74 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-D_S72.trimmed.1_BACH2-HEK92-D_S72.trimmed.2	BACH2	BACH2	HEK92	D
 75 | CRISPResso_BACH2/CRISPResso_on_BACH2-HEK92-NS_S84.trimmed.1_BACH2-HEK92-NS_S84.trimmed.2	BACH2	BACH2	HEK92	NS
 76 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-A_S37.trimmed.1_BACH2-R1-A_S37.trimmed.2	BACH2	BACH2	R1	A
 77 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-B_S38.trimmed.1_BACH2-R1-B_S38.trimmed.2	BACH2	BACH2	R1	B
 78 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-C_S39.trimmed.1_BACH2-R1-C_S39.trimmed.2	BACH2	BACH2	R1	C
 79 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-D_S40.trimmed.1_BACH2-R1-D_S40.trimmed.2	BACH2	BACH2	R1	D
 80 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-A_S85.trimmed.1_BACH2-R1-HEK-A_S85.trimmed.2	BACH2	BACH2	R1-HEK	A
 81 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-B_S86.trimmed.1_BACH2-R1-HEK-B_S86.trimmed.2	BACH2	BACH2	R1-HEK	B
 82 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-C_S87.trimmed.1_BACH2-R1-HEK-C_S87.trimmed.2	BACH2	BACH2	R1-HEK	C
 83 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-D_S88.trimmed.1_BACH2-R1-HEK-D_S88.trimmed.2	BACH2	BACH2	R1-HEK	D
 84 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-HEK-NS_S89.trimmed.1_BACH2-R1-HEK-NS_S89.trimmed.2	BACH2	BACH2	R1-HEK	NS
 85 | CRISPResso_BACH2/CRISPResso_on_BACH2-R1-NS_S41.trimmed.1_BACH2-R1-NS_S41.trimmed.2	BACH2	BACH2	R1	NS
 86 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-A_S42.trimmed.1_BACH2-R2-A_S42.trimmed.2	BACH2	BACH2	R2	A
 87 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-B_S43.trimmed.1_BACH2-R2-B_S43.trimmed.2	BACH2	BACH2	R2	B
 88 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-C_S44.trimmed.1_BACH2-R2-C_S44.trimmed.2	BACH2	BACH2	R2	C
 89 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-D_S45.trimmed.1_BACH2-R2-D_S45.trimmed.2	BACH2	BACH2	R2	D
 90 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-A_S90.trimmed.1_BACH2-R2-HEK-A_S90.trimmed.2	BACH2	BACH2	R2-HEK	A
 91 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-B_S91.trimmed.1_BACH2-R2-HEK-B_S91.trimmed.2	BACH2	BACH2	R2-HEK	B
 92 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-C_S92.trimmed.1_BACH2-R2-HEK-C_S92.trimmed.2	BACH2	BACH2	R2-HEK	C
 93 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-D_S93.trimmed.1_BACH2-R2-HEK-D_S93.trimmed.2	BACH2	BACH2	R2-HEK	D
 94 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-HEK-NS_S94.trimmed.1_BACH2-R2-HEK-NS_S94.trimmed.2	BACH2	BACH2	R2-HEK	NS
 95 | CRISPResso_BACH2/CRISPResso_on_BACH2-R2-NS_S46.trimmed.1_BACH2-R2-NS_S46.trimmed.2	BACH2	BACH2	R2	NS
 96 | CRISPResso_HEK/CRISPResso_on_HEK-A1-A_S97.trimmed.1	HEK	HEK	A1	A
 97 | CRISPResso_HEK/CRISPResso_on_HEK-A1-B_S98.trimmed.1	HEK	HEK	A1	B
 98 | CRISPResso_HEK/CRISPResso_on_HEK-A1-C_S99.trimmed.1	HEK	HEK	A1	C
 99 | CRISPResso_HEK/CRISPResso_on_HEK-A1-D_S100.trimmed.1	HEK	HEK	A1	D
100 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-A_S145.trimmed.1	HEK	HEK	A1-HEK	A
101 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-B_S146.trimmed.1	HEK	HEK	A1-HEK	B
102 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-C_S147.trimmed.1	HEK	HEK	A1-HEK	C
103 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-D_S148.trimmed.1	HEK	HEK	A1-HEK	D
104 | CRISPResso_HEK/CRISPResso_on_HEK-A1-HEK-NS_S149.trimmed.1	HEK	HEK	A1-HEK	NS
105 | CRISPResso_HEK/CRISPResso_on_HEK-A1-NS_S101.trimmed.1	HEK	HEK	A1	NS
106 | CRISPResso_HEK/CRISPResso_on_HEK-A2-A_S102.trimmed.1	HEK	HEK	A2	A
107 | CRISPResso_HEK/CRISPResso_on_HEK-A2-B_S103.trimmed.1	HEK	HEK	A2	B
108 | CRISPResso_HEK/CRISPResso_on_HEK-A2-C_S104.trimmed.1	HEK	HEK	A2	C
109 | CRISPResso_HEK/CRISPResso_on_HEK-A2-D_S105.trimmed.1	HEK	HEK	A2	D
110 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-A_S150.trimmed.1	HEK	HEK	A2-HEK	A
111 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-B_S151.trimmed.1	HEK	HEK	A2-HEK	B
112 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-C_S152.trimmed.1	HEK	HEK	A2-HEK	C
113 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-D_S153.trimmed.1	HEK	HEK	A2-HEK	D
114 | CRISPResso_HEK/CRISPResso_on_HEK-A2-HEK-NS_S154.trimmed.1	HEK	HEK	A2-HEK	NS
115 | CRISPResso_HEK/CRISPResso_on_HEK-A2-NS_S106.trimmed.1	HEK	HEK	A2	NS
116 | CRISPResso_HEK/CRISPResso_on_HEK-B1-A_S109.trimmed.1	HEK	HEK	B1	A
117 | CRISPResso_HEK/CRISPResso_on_HEK-B1-B_S110.trimmed.1	HEK	HEK	B1	B
118 | CRISPResso_HEK/CRISPResso_on_HEK-B1-C_S111.trimmed.1	HEK	HEK	B1	C
119 | CRISPResso_HEK/CRISPResso_on_HEK-B1-D_S112.trimmed.1	HEK	HEK	B1	D
120 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-A_S157.trimmed.1	HEK	HEK	B1-HEK	A
121 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-B_S158.trimmed.1	HEK	HEK	B1-HEK	B
122 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-C_S159.trimmed.1	HEK	HEK	B1-HEK	C
123 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-D_S160.trimmed.1	HEK	HEK	B1-HEK	D
124 | CRISPResso_HEK/CRISPResso_on_HEK-B1-HEK-NS_S161.trimmed.1	HEK	HEK	B1-HEK	NS
125 | CRISPResso_HEK/CRISPResso_on_HEK-B1-NS_S113.trimmed.1	HEK	HEK	B1	NS
126 | CRISPResso_HEK/CRISPResso_on_HEK-B2-A_S114.trimmed.1	HEK	HEK	B2	A
127 | CRISPResso_HEK/CRISPResso_on_HEK-B2-B_S115.trimmed.1	HEK	HEK	B2	B
128 | CRISPResso_HEK/CRISPResso_on_HEK-B2-C_S116.trimmed.1	HEK	HEK	B2	C
129 | CRISPResso_HEK/CRISPResso_on_HEK-B2-D_S117.trimmed.1	HEK	HEK	B2	D
130 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-A_S162.trimmed.1	HEK	HEK	B2-HEK	A
131 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-B_S163.trimmed.1	HEK	HEK	B2-HEK	B
132 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-C_S164.trimmed.1	HEK	HEK	B2-HEK	C
133 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-D_S165.trimmed.1	HEK	HEK	B2-HEK	D
134 | CRISPResso_HEK/CRISPResso_on_HEK-B2-HEK-NS_S166.trimmed.1	HEK	HEK	B2-HEK	NS
135 | CRISPResso_HEK/CRISPResso_on_HEK-B2-NS_S118.trimmed.1	HEK	HEK	B2	NS
136 | CRISPResso_HEK/CRISPResso_on_HEK-C1-A_S121.trimmed.1	HEK	HEK	C1	A
137 | CRISPResso_HEK/CRISPResso_on_HEK-C1-B_S122.trimmed.1	HEK	HEK	C1	B
138 | CRISPResso_HEK/CRISPResso_on_HEK-C1-C_S123.trimmed.1	HEK	HEK	C1	C
139 | CRISPResso_HEK/CRISPResso_on_HEK-C1-D_S124.trimmed.1	HEK	HEK	C1	D
140 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-A_S169.trimmed.1	HEK	HEK	C1-HEK	A
141 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-B_S170.trimmed.1	HEK	HEK	C1-HEK	B
142 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-C_S171.trimmed.1	HEK	HEK	C1-HEK	C
143 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-D_S172.trimmed.1	HEK	HEK	C1-HEK	D
144 | CRISPResso_HEK/CRISPResso_on_HEK-C1-HEK-NS_S173.trimmed.1	HEK	HEK	C1-HEK	NS
145 | CRISPResso_HEK/CRISPResso_on_HEK-C1-NS_S125.trimmed.1	HEK	HEK	C1	NS
146 | CRISPResso_HEK/CRISPResso_on_HEK-C2-A_S126.trimmed.1	HEK	HEK	C2	A
147 | CRISPResso_HEK/CRISPResso_on_HEK-C2-B_S127.trimmed.1	HEK	HEK	C2	B
148 | CRISPResso_HEK/CRISPResso_on_HEK-C2-C_S128.trimmed.1	HEK	HEK	C2	C
149 | CRISPResso_HEK/CRISPResso_on_HEK-C2-D_S129.trimmed.1	HEK	HEK	C2	D
150 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-A_S174.trimmed.1	HEK	HEK	C2-HEK	A
151 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-B_S175.trimmed.1	HEK	HEK	C2-HEK	B
152 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-C_S176.trimmed.1	HEK	HEK	C2-HEK	C
153 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-D_S177.trimmed.1	HEK	HEK	C2-HEK	D
154 | CRISPResso_HEK/CRISPResso_on_HEK-C2-HEK-NS_S178.trimmed.1	HEK	HEK	C2-HEK	NS
155 | CRISPResso_HEK/CRISPResso_on_HEK-C2-NS_S130.trimmed.1	HEK	HEK	C2	NS
156 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-A_S107.trimmed.1	HEK	HEK	HEK62	A
157 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-B_S119.trimmed.1	HEK	HEK	HEK62	B
158 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-C_S131.trimmed.1	HEK	HEK	HEK62	C
159 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62-NS_S155.trimmed.1	HEK	HEK	HEK62	NS
160 | CRISPResso_HEK/CRISPResso_on_HEK-HEK62D_S143.trimmed.1	HEK	HEK	HEK62	D
161 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-A_S167.trimmed.1	HEK	HEK	HEK91	A
162 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-B_S179.trimmed.1	HEK	HEK	HEK91	B
163 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-C_S191.trimmed.1	HEK	HEK	HEK91	C
164 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-D_S108.trimmed.1	HEK	HEK	HEK91	D
165 | CRISPResso_HEK/CRISPResso_on_HEK-HEK91-NS_S120.trimmed.1	HEK	HEK	HEK91	NS
166 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-A_S132.trimmed.1	HEK	HEK	HEK92	A
167 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-B_S144.trimmed.1	HEK	HEK	HEK92	B
168 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-C_S156.trimmed.1	HEK	HEK	HEK92	C
169 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-D_S168.trimmed.1	HEK	HEK	HEK92	D
170 | CRISPResso_HEK/CRISPResso_on_HEK-HEK92-NS_S180.trimmed.1	HEK	HEK	HEK92	NS
171 | CRISPResso_HEK/CRISPResso_on_HEK-R1-A_S133.trimmed.1	HEK	HEK	R1	A
172 | CRISPResso_HEK/CRISPResso_on_HEK-R1-B_S134.trimmed.1	HEK	HEK	R1	B
173 | CRISPResso_HEK/CRISPResso_on_HEK-R1-C_S135.trimmed.1	HEK	HEK	R1	C
174 | CRISPResso_HEK/CRISPResso_on_HEK-R1-D_S136.trimmed.1	HEK	HEK	R1	D
175 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-A_S181.trimmed.1	HEK	HEK	R1-HEK	A
176 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-B_S182.trimmed.1	HEK	HEK	R1-HEK	B
177 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-C_S183.trimmed.1	HEK	HEK	R1-HEK	C
178 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-D_S184.trimmed.1	HEK	HEK	R1-HEK	D
179 | CRISPResso_HEK/CRISPResso_on_HEK-R1-HEK-NS_S185.trimmed.1	HEK	HEK	R1-HEK	NS
180 | CRISPResso_HEK/CRISPResso_on_HEK-R1-NS_S137.trimmed.1	HEK	HEK	R1	NS
181 | CRISPResso_HEK/CRISPResso_on_HEK-R2-A_S138.trimmed.1	HEK	HEK	R2	A
182 | CRISPResso_HEK/CRISPResso_on_HEK-R2-B_S139.trimmed.1	HEK	HEK	R2	B
183 | CRISPResso_HEK/CRISPResso_on_HEK-R2-C_S140.trimmed.1	HEK	HEK	R2	C
184 | CRISPResso_HEK/CRISPResso_on_HEK-R2-D_S141.trimmed.1	HEK	HEK	R2	D
185 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-A_S186.trimmed.1	HEK	HEK	R2-HEK	A
186 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-B_S187.trimmed.1	HEK	HEK	R2-HEK	B
187 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-C_S188.trimmed.1	HEK	HEK	R2-HEK	C
188 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-D_S189.trimmed.1	HEK	HEK	R2-HEK	D
189 | CRISPResso_HEK/CRISPResso_on_HEK-R2-HEK-NS_S190.trimmed.1	HEK	HEK	R2-HEK	NS
190 | CRISPResso_HEK/CRISPResso_on_HEK-R2-NS_S142.trimmed.1	HEK	HEK	R2	NS
191 | 


--------------------------------------------------------------------------------
/vignettes/CD69_tutorial.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Re-analysis of CD69 CRISPRa tiling screen"
  3 | author: "Carl de Boer"
  4 | date: "5/21/2020"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{Re-analysis of CD69 CRISPRa tiling screen}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   \usepackage[utf8]{inputenc}
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | # Tutorial: Re-analysis of CD69 CRISPRa tiling screen
 17 | 
 18 | See Simeonov et al. Discovery of stimulation-responsive immune enhancers with CRISPR activation. Nature. 2017 Sep 7;549(7670):111-115. doi: 10.1038/nature23875. Epub 2017 Aug 30. [PMC5675716](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/) for the original publication associated with this dataset.
 19 | 
 20 | This tutorial is designed to get you quickly up and running with MAUDE. The only requirements are that you have R and the following R libraries installed:
 21 | `ggplot2, openxlsx, MAUDE, reshape`
 22 | 
 23 | ## Load required libraries into R
 24 | ```{r load libraries, results="hide", message = FALSE, warning=FALSE}
 25 | #load required libraries
 26 | library(openxlsx)
 27 | library(reshape)
 28 | library(ggplot2)
 29 | library(MAUDE)
 30 | library(GenomicRanges)
 31 | library(ggbio)
 32 | library(Homo.sapiens)
 33 | ```
 34 | 
 35 | ## Load CD69 screen count data from Simeonov Supplementary Table 1
 36 | ```{r input data}
 37 | #read in the CD69 screen data
 38 | CD69Data = read.xlsx('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675716/bin/NIHMS913084-supplement-supplementary_table_1.xlsx')
 39 | 
 40 | #identify non-targeting guides
 41 | CD69Data$isNontargeting = grepl("negative_control", CD69Data$gRNA_systematic_name)
 42 | 
 43 | CD69Data = unique(CD69Data) # for some reason there were duplicated rows in this table - remove duplicates
 44 | 
 45 | #reshape the count data so we can label the experimental replicates and bins, and remove all the non-count data
 46 | cd69CountData = melt(CD69Data, id.vars = c("PAM_3primeEnd_coord","isNontargeting","gRNA_systematic_name"))
 47 | cd69CountData = cd69CountData[grepl(".count$",cd69CountData$variable),] # keep only read count columns
 48 | cd69CountData$Bin = gsub("CD69(.*)([12]).count","\\1",cd69CountData$variable)
 49 | cd69CountData$expt = gsub("CD69(.*)([12]).count","\\2",cd69CountData$variable)
 50 | cd69CountData$reads= as.numeric(cd69CountData$value); cd69CountData$value=NULL;
 51 | cd69CountData$Bin = gsub("_","",cd69CountData$Bin) # remove extra underscores
 52 | 
 53 | #reshape into a matrix
 54 | binReadMat = data.frame(cast(cd69CountData[!is.na(cd69CountData$PAM_3primeEnd_coord) | cd69CountData$isNontargeting,], 
 55 |   PAM_3primeEnd_coord+gRNA_systematic_name+isNontargeting+expt ~ Bin, value="reads"))
 56 | #binReadMat now contains a matrix in the proper format for MAUDE analysis
 57 | 
 58 | ```
 59 | 
 60 | ##read in the Encode DHS peaks at the locus
 61 | We can use this as a reference point with which to view our results, and to annotate guides with regulatory elements. These are all the DHS peaks at CD69 in Jurkat cells, merged across two replicates.
 62 | 
 63 | ```{r input set of DHS peaks}
 64 | dhsPeakBED = read.table(system.file("extdata", "Encode_Jurkat_DHS_both.merged.bed", package = "MAUDE", mustWork = TRUE), 
 65 |   stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=FALSE)
 66 | names(dhsPeakBED) = c("chrom","start","end");
 67 | #add a column to include peak names
 68 | dhsPeakBED$name = paste(dhsPeakBED$chrom, paste(dhsPeakBED$start, dhsPeakBED$end, sep="-"), sep=":")
 69 | ```
 70 | 
 71 | ## Read in and inspect bin fractions
 72 | 
 73 | ```{r read in bin fractions}
 74 | #read in the bin fractions derived from Simeonov et al Extended Data Fig 1a and the "digitize" R package
 75 | #Ideally, you derive this from the FACS sort data. 
 76 | binStats = read.table(system.file("extdata", "CD69_bin_percentiles.txt", package = "MAUDE", mustWork = TRUE), 
 77 |   stringsAsFactors=FALSE, row.names=NULL, sep="\t", header=TRUE)
 78 | binStats$fraction = binStats$binEndQ - binStats$binStartQ; #the fraction of cells captured is the difference in bin start and end percentiles
 79 | 
 80 | #plot the bins as the percentiles of the distribution captured by each bin
 81 | p = ggplot(binStats, aes(colour=Bin)) + ggplot2::geom_segment(aes(x=binStartQ, xend=binEndQ, y=fraction, yend=fraction)) + 
 82 |   xlab("Bin bounds as percentiles") + ylab("Fraction of the distribution captured") +theme_classic() + 
 83 |   scale_y_continuous(expand=c(0,0))+coord_cartesian(ylim=c(0,0.7)); print(p)
 84 | ```
 85 | 
 86 | This shows the sizes of the bins in terms of the fractions of the overall distribution covered by each bin. In terms of expression space, we need to convert these to Z scores with the reverse normal CDF function `qnorm`
 87 | ```{r convert bin percentiles to Z scores}
 88 | #convert bin fractions to Z scores
 89 | binStats$binStartZ = qnorm(binStats$binStartQ)
 90 | binStats$binEndZ = qnorm(binStats$binEndQ)
 91 | ```
 92 | 
 93 | Now let's plot the bins as the Z score bounds of a normal distribution, with an actual normal distribution overlaid in gray.
 94 | ```{r plot bins}
 95 | p = ggplot(binStats, aes(colour=Bin))  + 
 96 |   geom_density(data=data.frame(x=rnorm(100000)), aes(x=x), fill="gray", colour=NA)+ 
 97 |   ggplot2::geom_segment(aes(x=binStartZ, xend=binEndZ, y=fraction, yend=fraction)) + 
 98 |   xlab("Bin bounds as expression Z-scores") + 
 99 |   ylab("Fraction of the distribution captured") +theme_classic()+scale_y_continuous(expand=c(0,0))+
100 |   coord_cartesian(ylim=c(0,0.7)); print(p)
101 | ```
102 | 
103 | Here, we are forced to use the same distribution for both experimental replicates since we didn't have the underlying data. You should derive this data from the FACS sorting data separately for each replicate (although it doesn't affect things that much if the bins were drawn similarly).
104 | ```{r duplicate bins for second replicate}
105 | binStats = rbind(binStats, binStats) #duplicate data
106 | binStats$expt = c(rep("1",4),rep("2",4)); #name the first duplicate expt "1" and the next expt "2";
107 | ```
108 | 
109 | ## 1) MAUDE: Calculate guide level statistics
110 | Now we've finally gotten to the part where we're running MAUDE.  To compute guide-level stats, we run `findGuideHitsAllScreens`.
111 | 
112 | This step takes about a minute to run.
113 | ```{r find guide effects}
114 | guideLevelStats = findGuideHitsAllScreens(experiments = unique(binReadMat["expt"]), 
115 |   countDataFrame = binReadMat, binStats = binStats, 
116 |   sortBins = c("baseline","high","low","medium"), 
117 |   unsortedBin = "back", negativeControl = "isNontargeting")
118 | ```
119 | 
120 | Here, we calculated the mean expression of each guide
121 | ```{r plot guide effects}
122 | # Plot the guide-level mus
123 | p = ggplot(guideLevelStats, aes(x=mean, colour=isNontargeting, linetype=expt)) + geom_density()+
124 |   theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+
125 |   xlab("Learned mean guide expression"); print(p);
126 | ```
127 | 
128 | But notice how the distributions for the non-targeting guides are off-centre.  This is why we also calculate a re-centred Z-score using the non-targeting guides.
129 | 
130 | ```{r plot guide level Zs}
131 | # Plot the guide-level Zs
132 | p = ggplot(guideLevelStats, aes(x=Z, colour=isNontargeting, linetype=expt)) + geom_density()+
133 |   theme_classic()+scale_y_continuous(expand=c(0,0)) + geom_vline(xintercept = 0)+
134 |   xlab("Learned guide expression Z score"); 
135 | print(p)
136 | ```
137 | 
138 | Now the non-targeting guides are centred at 0, as expected since they have no effect on expression.  
139 | 
140 | Now that we have guide-level statistics, we can inspect them. We can see there is a high correlation between replicates:
141 | ```{r plot replicate scatter}
142 | guideEffectsByRep = cast(guideLevelStats, 
143 |   gRNA_systematic_name + isNontargeting + PAM_3primeEnd_coord ~ expt, value="Z")
144 | 
145 | p = ggplot(guideEffectsByRep[!guideEffectsByRep$isNontargeting,], aes(x=`1`, y=`2`)) + 
146 |   geom_point(size=0.3) + xlab("Replicate 1 Z score") + ylab("Replicate 2 Z score") + 
147 |   ggtitle(sprintf("r = %f",cor(guideEffectsByRep$`1`[!guideEffectsByRep$isNontargeting],
148 |     guideEffectsByRep$`2`[!guideEffectsByRep$isNontargeting])))+theme_classic(); 
149 | print(p)
150 | ```
151 | 
152 | They are pretty highly correlated at the effect size.  Now let's map the results across the CD69 locus.
153 | 
154 | ```{r plot locus}
155 | dhsPos = min(guideLevelStats$Z)*1.05;
156 | p=ggplot(guideLevelStats, aes(x=PAM_3primeEnd_coord, y=Z)) +geom_point(size=0.5)+facet_grid(expt ~.)+ 
157 |   ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="red") + 
158 |   theme_classic() + xlab("Genomic position") + ylab("Guide Z score"); 
159 | print(p)
160 | ```
161 | 
162 | Here, the DHS peaks are shown in red. Clearly some regions contain many active guides, the majority of which have high Z-scores, indicating CD69 activation.
163 | 
164 | ## 2) MAUDE: Identify active elements
165 | There are two ways to get element-level statistics, depending on how the screen was done.  If you have regulatory element annotations, you can use `getElementwiseStats` to identify active elements.  If you did a tiling screen, you could use the same, but you can also identify active regions in an unbiased way with `getTilingElementwiseStats`.
166 | ### 2a) Get element level stats with sliding window
167 | Now, we can combine adjacent guides in an unbiased way, using a sliding window across the locus to identify regions with more active guides than expected by chance. Here we use a sliding window of 200 bp, and the default minimum guide number (5). Any 200 bp with fewer than 5 guides will not be tested.
168 | 
169 | ```{r infer sliding window effects}
170 | guideLevelStats$chrom = "chr12"; # we need to tell it what chromosome our guides are on - they're all on chr12
171 | slidingWindowElements = getTilingElementwiseStats(experiments = unique(binReadMat["expt"]), 
172 |   normNBSummaries = guideLevelStats, tails="both", window = 200, location = "PAM_3primeEnd_coord",
173 |   chr="chrom",negativeControl = "isNontargeting")
174 | #override the default chromosome field 'chr' with the GRanges compatible 'chrom'
175 | names(slidingWindowElements)[names(slidingWindowElements)=="chr"]="chrom" 
176 | ```
177 | 
178 | Now that we have element-level stats, let's inspect them! First, let's look at the whole locus.
179 | ```{r tiles locus effects}
180 | dhsPos = min(slidingWindowElements$meanZ)*1.05;
181 | p=ggplot(slidingWindowElements, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) +
182 |   ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 
183 |   ylab("Element Z score") + geom_hline(yintercept = 0) + 
184 |   ggplot2::geom_segment(data = dhsPeakBED, aes(x=start, xend=end,y=dhsPos, yend=dhsPos), colour="black");
185 | print(p)
186 | ```
187 | 
188 | In the above, we see that there are some regions significantly upregulated or down-regulated, which are often shared between the two replicates. The regions that upregulate CD69 preferentially lie in open chromatin (DHS - black, in the above).
189 | 
190 | How correlated are the effect size estimates for the two replicates at the element level?
191 | ```{r tiled replicate scatter}
192 | slidingWindowElementsByRep = cast(slidingWindowElements, chrom + start + end +numGuides ~ expt, 
193 |   value="meanZ")
194 | p = ggplot(slidingWindowElementsByRep, aes(x=`1`, y=`2`)) + geom_point(size=0.5) + 
195 |   xlab("Replicate 1 element effect Z score") + ylab("Replicate 2 element effect Z score") + 
196 |   ggtitle(sprintf("r = %f",cor(slidingWindowElementsByRep$`1`,slidingWindowElementsByRep$`2`)))+
197 |   theme_classic(); 
198 | print(p)
199 | ```
200 | 
201 | Quite highly correlated!
202 | 
203 | ### 2b) Get element level stats with annotated elements
204 | Since we have annotated elements anyway in `dhsPeakBED`, it might help us to identify which guides should be combined.  This would be the only way to do it if we had only targeted these regions to begin with.
205 | 
206 | ```{r element-level stats}
207 | #the next command annotates our guides with any DHS peak they lie in.
208 | annotatedGuides = findOverlappingElements(guides = unique(guideLevelStats[!guideLevelStats$isNontargeting,
209 |   c("PAM_3primeEnd_coord","gRNA_systematic_name","chrom")]), elements = dhsPeakBED, 
210 |   elements.start = "start", elements.end = "end", elements.chr = "chrom", 
211 |   guides.pos = "PAM_3primeEnd_coord", guides.chr = "chrom")
212 | 
213 | #merge regulatory element annotations back onto guideLevelStats
214 | guideLevelStats = merge(guideLevelStats, annotatedGuides[c("gRNA_systematic_name", "name")],
215 |                         by="gRNA_systematic_name", all.x=TRUE)
216 | 
217 | #this is where we are actually running MAUDE to find element-level stats
218 | dhsPeakStats = getElementwiseStats(experiments = unique(binReadMat["expt"]), 
219 |   normNBSummaries = guideLevelStats, negativeControl = "isNontargeting", 
220 |   elementIDs = "name") # "name" is the peak IDs from the DHS BED file
221 | 
222 | #merge peak info back into dhsPeakStats
223 | dhsPeakStats = merge(dhsPeakStats, dhsPeakBED, by="name");
224 | ```
225 | 
226 | Now we can again look at the activity of the elements across the locus:
227 | ```{r element locus effect view}
228 | p=ggplot(dhsPeakStats, aes(x=start, xend=end, y=meanZ,yend=meanZ, colour=FDR<0.01)) +
229 |   ggplot2::geom_segment(size=1)+facet_grid(expt ~.) + theme_classic() + xlab("Genomic position") + 
230 |   ylab("Element Z score") + geom_hline(yintercept = 0); 
231 | print(p)
232 | ```
233 | 
234 | And the correlation between replicates.
235 | ```{r element replicate scatter}
236 | dhsPeakStatsByRep = cast(dhsPeakStats, name ~ expt, value="meanZ")
237 | 
238 | p = ggplot(dhsPeakStatsByRep, aes(x=`1`, y=`2`)) + geom_point() + 
239 |   xlab("Replicate 1 DHS effect Z score") + ylab("Replicate 2 DHS effect Z score") + 
240 |   ggtitle(sprintf("r = %f",cor(dhsPeakStatsByRep$`1`,dhsPeakStatsByRep$`2`)))+theme_classic(); 
241 | print(p)
242 | ```
243 | 
244 | Again, these are highly correlated.
245 | In this case, it didn't look like the regulatory element annotations helped our analysis, but that is not always true. 
246 | 
247 | Let's see investigate why this might be by looking at the guide effect sizes for the guides within each DHS element.
248 | ```{r guide effects per element}
249 | p=ggplot(guideLevelStats, aes(x=Z, group=name, colour=name == "chr12:9912678-9915275")) + stat_ecdf(alpha=0.3)+ 
250 |   stat_ecdf(data=guideLevelStats[!is.na(guideLevelStats$name) &
251 |                                    guideLevelStats$name=="chr12:9912678-9915275",], size=1)+
252 |   facet_grid(expt ~.) + theme_classic() + xlab("Guide Z score")+scale_y_continuous(expand=c(0,0)) + 
253 |   scale_x_continuous(expand=c(0,0)) + scale_colour_manual(values=c("black","red")) + 
254 |   labs(colour = "CD69 promoter?")+ylab("Cumulative fraction"); 
255 | print(p)
256 | ```
257 | 
258 | Here, the promoter DHS peak is in red and all the other DHS peaks are in black. You can appreciate that the promoter tends to have among the most influencial guides (rightward shift in the CDF curves). Even for the promoter, there is substantial variability in the estimated guide effect sizes, ranging from not doing anything (near 0) to pretty sizable effects (>3).  This is probably a combination of factors, including experimental noise, how effective each guide targets the region, and the effect of actually targeting each region.
259 | 
260 | Let's zoom in on the promoter to see what exactly is happening here.
261 | ```{r promoter view}
262 | p=ggplot(guideLevelStats[!is.na(guideLevelStats$name) & guideLevelStats$name=="chr12:9912678-9915275",],
263 |          aes(x=PAM_3primeEnd_coord, y=Z, colour=expt)) +
264 |   geom_point(size=1)+ geom_line()+theme_classic() + xlab("Genomic position") + ylab("Guide Z score")+
265 |   geom_vline(xintercept = 9913497, colour="black"); 
266 | print(p)
267 | ```
268 | 
269 | Here, I've marked the transcription start site with a vertical black line. CD69 is an antisense gene, so it is transcribed to the left of the black line. The guides targeting the 5' UTR of CD69 (left of black line) appear to be less effective than those in the promoter (right of black line), which is expected. The high correlation in the guide effect sizes between the two experiments indicates that many of the low-effect size guides cannot be attributed to experimental noise. So probably most of the signal diversity within the promoter DHS is attributable to how effective the guide is at targeting the DNA, and how effective the activation domain of CRISPRa is where it is bound.
270 | 
271 | Now let's look more closely at the guides and select a few that work especially well.
272 | ```{r promoter guide zoom}
273 | p=ggplot(guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 
274 |                              guideEffectsByRep$PAM_3primeEnd_coord > 9912678,], 
275 |          aes(x=`2`, y=`1`)) + geom_point() + 
276 |   geom_text(data=guideEffectsByRep[guideEffectsByRep$PAM_3primeEnd_coord < 9915275 & 
277 |                                      guideEffectsByRep$PAM_3primeEnd_coord > 9912678 & 
278 |                                      guideEffectsByRep$`1`>2 & guideEffectsByRep$`2`>2,],
279 |             aes(label=gRNA_systematic_name)) + 
280 |   theme_classic()+xlab("Replicate 2 guide Z score") + ylab("Replicate 1 guide Z score"); 
281 | print(p) 
282 | 
283 | ```
284 | 
285 | We can see that, although there are four guides that have a large effect in one replicate, they don't have as great an effect in the other. Here, I've labeled the IDs of three guides that have a Z score over two in both replicates. If I were going to follow up targeting the CD69 promoter with more guides, these are the ones I would use.
286 | 
287 | `GenomicRanges` is a great package to work with data associated with genomic coordinates.  First, let's restructure our tiling element data.frame so that there is only one row per region and there is one of each data field for each replicate. 
288 | 
289 | ```{r reshaping window effects}
290 | slidingWindowElementsByReplicate = cast(melt(slidingWindowElements, 
291 |   id.vars=c("expt","numGuides","chrom","start","end")), 
292 |   numGuides+chrom+start+end ~ variable+expt, value="value")
293 | head(slidingWindowElementsByReplicate)
294 | ```
295 | 
296 | We can convert our data structures to `GRanges` objects as easily as this:
297 | 
298 | ```{r cast to GRanges}
299 | #casting to data.frame is only needed if using cast
300 | slidingWindowElementsByReplicateGR = GRanges(as.data.frame(slidingWindowElementsByReplicate)) 
301 | ```
302 | 
303 | Now let's label those ranges that were significant in both replicates, and merge overlapping significant ranges to get a set of expanded active regions.
304 | 
305 | ```{r find doubly significant tiles}
306 | #require that both replicates are significant at an FDR of 0.1 and that the signs agree
307 | slidingWindowElementsByReplicateGR$significantUp   = slidingWindowElementsByReplicateGR$FDR_1< 0.01 & 
308 |   slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 > 0 &
309 |   slidingWindowElementsByReplicateGR$meanZ_2 > 0; 
310 | slidingWindowElementsByReplicateGR$significantDown = slidingWindowElementsByReplicateGR$FDR_1< 0.01 &
311 |   slidingWindowElementsByReplicateGR$FDR_2 < 0.01 & slidingWindowElementsByReplicateGR$meanZ_1 < 0 &
312 |   slidingWindowElementsByReplicateGR$meanZ_2 < 0;
313 | 
314 | #merge overlapping regions in each set
315 | overlappingSlidingWindowElementsUp =
316 |   reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantUp])
317 | overlappingSlidingWindowElementsDown =
318 |   reduce(slidingWindowElementsByReplicateGR[slidingWindowElementsByReplicateGR$significantDown])
319 | ```
320 | 
321 | Now let's plot it all together as a genome browser track!
322 | 
323 | ```{r genome browser view, fig.width=10, fig.height=5}
324 | 
325 | #which gene models do I want to plot?
326 | data(genesymbol, package = "biovizBase")
327 | wh <- genesymbol[c("CD69", "CLECL1", "KLRF1", "CLEC2D","CLEC2B")]
328 | wh <- range(wh, ignore.strand = TRUE)
329 | 
330 | #make the genome tracks
331 | tracks(autoplot(Homo.sapiens, which = wh, gap.geom="chevron"), 
332 |   autoplot(overlappingSlidingWindowElementsUp, fill="red"), 
333 |   autoplot(overlappingSlidingWindowElementsDown, fill="blue"), heights=c(5,2,2)) + theme_classic()
334 | ```
335 | 
336 | Here, the red are CD69-activating regions, blue are CD69-repressing regions.
337 | 
338 | 


--------------------------------------------------------------------------------
/vignettes/simulated_data_tutorial.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "A tutorial with simulated data"
  3 | author: "Carl de Boer"
  4 | date: "5/21/2020"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{A tutorial with simulated data}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   \usepackage[utf8]{inputenc}
 10 | ---
 11 | 
 12 | ```{r setup, include=FALSE}
 13 | knitr::opts_chunk$set(echo = TRUE)
 14 | ```
 15 | 
 16 | # A tutorial with simulated data
 17 | 
 18 | The following is an example that relies on simulated data and makes use of the `ggplot2` and `reshape` R packages.  Since there is no data to load for this example, it should be relatively easy to run and to inspect the variables to better understand the desired formats.  This should also help make clear the underlying assumptions of the model since the data are generated with many of these same assumptions.
 19 | 
 20 | ## Create the simulated experimental data
 21 | Load the required packages and set the seed.  You only need to set the seed if you want your analysis to get exactly the same results as what is shown here.  Otherwise, you can skip this step (and note that if you run things in a different order, the results may differ).
 22 | ```{r load libraries, results="hide", message = FALSE, warning=FALSE}
 23 | library(ggplot2)
 24 | library(reshape)
 25 | library(MAUDE)
 26 | 
 27 | set.seed(76484377)
 28 | ```
 29 | 
 30 | Set up the simulation.  Here, we're using 5 guides per element, and 200 elements, 100 of which have no effect on expression.  The half that affect expression have effect sizes ranging from 0.01 to 1 standard deviations.  We also include 1000 non-targeting guides.
 31 | ```{r simulation setup}
 32 | #ground truth
 33 | groundTruth = data.frame(element = 1:200, meanEffect = c((1:100)/100,rep(0,100))) #targeting 200 elements, half of which do nothing
 34 | 
 35 | #guide - element map; 5 guides per element; gid is the guide ID
 36 | guideMap = data.frame(element = rep(groundTruth$element, 5), gid = 1:(5*nrow(groundTruth)), NT=FALSE, mean=rep(groundTruth$meanEffect, 5))
 37 | guideMap = rbind(guideMap, data.frame(element = NA, gid = (1:1000)+nrow(guideMap), NT=TRUE, mean=0)); # 1000 non-targeting guides
 38 | 
 39 | guideMap$abundance = rpois(n=nrow(guideMap), lambda=1000); #guide abundance drawing from a poisson distribution with mean=1000
 40 | guideMap$cells = rpois(n=nrow(guideMap), lambda=guideMap$abundance); #cell count drawing from a poisson distribution with mean the abundance from above
 41 | 
 42 | #create observarions for different guides, with expression drawn from normal(mean=mean, sd=1)
 43 | cellObservations = data.frame(gid = rep(guideMap$gid, guideMap$cells))
 44 | cellObservations = merge(cellObservations, guideMap, by="gid")
 45 | cellObservations$expression = rnorm(n=nrow(cellObservations), mean=cellObservations$mean);
 46 | 
 47 | #create the bin model for this experiment - this represents 6 bins, each of which are 10%, where A+B+C catch the bottom ~30% and D+E+F catch the top 30%; in an actual experiment, the true captured fractions should be used here. 
 48 | binBounds = makeBinModel(data.frame(Bin=c("A","B","C","D","E","F"), fraction=rep(0.1,6)))
 49 | if(FALSE){ 
 50 |   # in reality, we shouldn't assume this distribution is exactly normal - we can re-assign expression bin bounds based on quantiles 
 51 |   # of the actual simulated expression distribution.  If you run the next two lines, the answer will improve slightly, but the 
 52 |   # resulting graphs will look slightly different than those below.
 53 |   # correct for the actual distribution
 54 |   binBounds$binStartZ = quantile(cellObservations$expression, probs = binBounds$binStartQ);
 55 |   binBounds$binEndZ = quantile(cellObservations$expression, probs = binBounds$binEndQ);
 56 | }
 57 | 
 58 | # select some examples to inspect for both
 59 | exampleNT = sample(guideMap$gid[guideMap$NT],10);# non-targeting
 60 | exampleT = sample(guideMap$gid[!guideMap$NT],5);# and targeting guides
 61 | 
 62 | #plot the select examples and show the bin structure
 63 | p = ggplot(cellObservations[cellObservations$gid %in% c(exampleT, exampleNT),], 
 64 |   aes(x=expression, group=gid, fill=NT))+geom_density(alpha=0.2) + 
 65 |   geom_vline(xintercept = sort(unique(c(binBounds$binStartZ,binBounds$binEndZ))),colour="gray") + 
 66 |   theme_classic() + scale_fill_manual(values=c("red","darkgray")) + xlab("Target expression") + 
 67 |   scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + 
 68 |   coord_cartesian(xlim=c(min(cellObservations$expression), max(cellObservations$expression)))+ 
 69 |   geom_segment(data=binBounds, aes(x=binStartZ, xend=binEndZ, colour=Bin, y=0, yend=0), size=5, inherit.aes = FALSE); 
 70 | print(p)
 71 | ```
 72 | 
 73 | 
 74 | Now that we've set up our experimental system, we simulate actually capturing cells in the bins
 75 | ```{r simulate sorting cells}
 76 | #for each bin, find which cells landed within the bin bounds
 77 | for(i in 1:nrow(binBounds)){
 78 |   cellObservations[[as.character(binBounds$Bin[i])]] = 
 79 |     cellObservations$expression > binBounds$binStartZ[i] & cellObservations$expression < binBounds$binEndZ[i];
 80 | }
 81 | 
 82 | #count the number of cells that ended up in each bin for each guide
 83 | binLevelData = cast(melt(cellObservations[c("gid","element","NT",as.character(binBounds$Bin))], 
 84 |   id.vars=c("gid","element","NT")), gid + element + NT + variable ~ ., fun.aggregate = sum)
 85 | names(binLevelData)[(ncol(binLevelData)-1):ncol(binLevelData)]=c("Bin","cells");
 86 | 
 87 | #plot the cells/bin for each of our example guides
 88 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=cells)) +
 89 |   geom_boxplot(fill="darkgray")+geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], 
 90 |     aes(group=gid), colour="red")+  theme_classic()  + xlab("Expression bin") + 
 91 |   ylab("Captured cells/bin")  + scale_y_continuous(expand=c(0,0)); print(p)
 92 | ```
 93 | 
 94 | Now, we simulate sequencing the guides that end up in each bin, assuming we sequence 10 reads per sorted cell
 95 | ```{r simulate sequencing}
 96 | #get the total number of cells sorted into each of the bins
 97 | totalCellsPerBin = cast(binLevelData,Bin ~ ., value="cells", fun.aggregate = sum)
 98 | names(totalCellsPerBin)[ncol(totalCellsPerBin)]="totalCells";
 99 | 
100 | #add bin totals to binLevelData
101 | binLevelData = merge(binLevelData, totalCellsPerBin, by="Bin")
102 | 
103 | #generate reads for each guide per bin, following a negative binomial distribution
104 | #n=number of observations; size= total reads per bin; prob= probability of not getting a read at each drawing
105 | binLevelData$reads = rnbinom(n=nrow(binLevelData), size=binLevelData$totalCells*10, 
106 |   prob=1- binLevelData$cells/binLevelData$totalCells)
107 | 
108 | #plot the distribution of reads for each example guide
109 | p = ggplot(binLevelData[binLevelData$gid %in% exampleNT,], aes(x=Bin, group=Bin, y=reads)) +
110 |   geom_boxplot(fill="darkgray")+
111 |   geom_line(data=binLevelData[binLevelData$gid %in% exampleT,], aes(group=gid), colour="red")+  
112 |   theme_classic()  + xlab("Expression bin") + ylab("Reads/bin")  + scale_y_continuous(expand=c(0,0)); 
113 | print(p)
114 | ```
115 | 
116 | 
117 | ```{r simulate unsorted cells}
118 | binReadMat = cast(binLevelData, element+gid+NT ~ Bin, value="reads")
119 | 
120 | #pretend we sequence the unsorted cells to similar coverage as above:
121 | guideMap$NS = rnbinom(n=nrow(guideMap), size=sum(guideMap$cells)*10, 
122 |   prob=1- guideMap$abundance/sum(guideMap$cells))
123 | binReadMat = merge(binReadMat, guideMap[c("gid","NS")], by="gid")
124 | 
125 | binReadMat$screen="test"; # here, we're only doing the one screen - this simulation
126 | binBounds$screen="test";
127 | ```
128 | 
129 | ## Run MAUDE
130 | 
131 | Up until now, we have just been building up the data in a way that simulated an actual experiment. With the exception of defining the bin model(s), most of the above is not required (nor even possible, since we're inspecting hidden variables above) for an actual experiment. Only the next two commands would actually be run for a real experiment.
132 | ```{r run MAUDE}
133 | # get guide-level stats 
134 | guideLevelStats = findGuideHitsAllScreens(unique(binReadMat["screen"]), binReadMat, binBounds)
135 | 
136 | #get element level stats
137 | elementLevelStats = getElementwiseStats(unique(guideLevelStats["screen"]),
138 |   guideLevelStats, elementIDs="element",tails="upper")
139 | ```
140 | 
141 | Plot the actual effect sizes (those we defined at the begining) compared with those predicted by MAUDE:
142 | ```{r plor effects}
143 | elementLevelStats = merge(elementLevelStats, groundTruth, by="element")
144 | 
145 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+
146 |   geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 
147 |   xlab("True effect") + ylab("Inferred effect"); print(p)
148 | ```
149 | 
150 | We can see that the two are highly correlated.  At more extreme effect sizes, MAUDE underestimates the effect size because of the uniform prior applied (in the form of a pseudocount).  However, biological data can be very noisy and so we highly recommend this prior.
151 | 
152 | 
153 | Zoom in on the bottom left:
154 | ```{r effets zoom in}
155 | p = ggplot(elementLevelStats, aes(x=meanEffect, y=meanZ, colour=FDR<0.01)) + geom_point()+
156 |   geom_abline(intercept = 0, slope=1) + theme_classic() + scale_colour_manual(values=c("darkgray","red")) + 
157 |   coord_cartesian(xlim = c(0,0.1),ylim = c(0,0.1)) + xlab("True effect") + ylab("Inferred effect"); 
158 | print(p)
159 | ```
160 | 
161 | In this particular simulation we have 1 false positive out of nearly 100 true positives (consistent with our FDR cutoff of 0.01), and three false negatives with very small effect sizes.
162 | 


--------------------------------------------------------------------------------