├── VisCap.cfg
├── README.md
└── VisCap.R


/VisCap.cfg:
--------------------------------------------------------------------------------
 1 | #VisCap configuration file for Linux users
 2 | 
 3 | 
 4 | interval_list_dir           <- "/path/to/dir/bed"  # directory where the interval file is stored 
 5 | #explorer_file               <- "C:\\Windows\\explorer.exe"
 6 | cov_file_pattern            <- "sample_interval_summary$"  # this is the extension of the coverage file from DepthOfCoverage
 7 | cov_field                   <- "_total_cvg" # field to be read from coverage file
 8 | interval_file_pattern       <- "interval_file.bed$" # interval filename
 9 | ylimits                     <- c(-2, 2)
10 | iqr_multiplier              <- 3
11 | threshold.min_exons         <- 1
12 | threshold.cnv_log2_cutoffs  <- c(-0.55, 0.40)
13 | iterative.calling.limit     <- 0        #Set to 0 to iterate until all failed samples are removed
14 | infer.batch.for.sub.out_dir <- TRUE     #Set to FALSE to prompt users for output directory
15 | clobber.output.directory    <- FALSE    #Set to FALSE to stop run when output directory already exists
16 | 
17 | #Setting a path to data skips prompts. Set to FALSE for deployment.
18 | dev_dir     <- FALSE
19 | 
20 | 
21 | 
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Documentation
 2 | For updated and more detailed documentation, please refer to our Github Wiki page at https://github.com/pughlab/VisCap/wiki
 3 | 
 4 | # Introduction
 5 | VisCap is a copy number detection and visualization tool written in R (www.r-project.org) for analysis of next-generation sequencing data derived from hybrid-capture experiments. It requires installation of the R libraries "gplots", "zoo", and “cluster”. We reported validation of this algorithm for detection of germline variation in Genetics in Medicine:  
 6 | 
 7 | VisCap: inference and visualization of germ-line copy-number variants from targeted clinical sequencing data.  
 8 | Pugh TJ, Amr SS, Bowser MJ, Gowrisankar S, Hynes E, Mahanta LM, Rehm HL, Funke B, Lebo MS.  
 9 | Genet Med. 2015 Dec 17. doi: 10.1038/gim.2015.156. [Epub ahead of print]  
10 | PMID: 26681316  
11 | www.ncbi.nlm.nih.gov/pubmed/26681316  
12 | 
13 | # Input
14 | As input, VisCap reads interval summary files generated by the Genome Analysis Toolkit (GATK)"DepthOfCoverage" tool (www.broadinstitute.org/gatk). For each sample, these files contain a summary of the total coverage of each genome region interval listed in a reference interval list. For our purposes, this interval list contains the coordinates of each target (usually an exon) captured by a specific bait set. Importantly, the bed file provided to GATK DepthOfCoverage must not contain any overlapping intervals as these are merged into single intervals resulting in a DepthOfCoverage file containing few intervals than the original bed file.
15 |  
16 | # Algorithm
17 | The initial step in the VisCap program is to generate a matrix of all intervals captured (~1,383 for OtoGenome) and the fractional of total coverage assigned to these interval within each sample from a batch (10 samples for production runs). When different interval lists are used to make the coverage files for samples within a batch, VisCap will use only the intervals common to all samples. Therefore, inclusion of even a single coverage file with a small number of targets will result in only these targets being considered across the entire batch. To normalize fractional coverage values for copy number detection, each is divided by the median for that target across the entire batch. These are stored as a matrix of log2 ratios and written to a file (log2_ratio_table.xls).
18 | 
19 | The X-chromosome requires further normalization as there are significant fractional coverage differences between males and females. Depending on the balance of males and females in the batch, males may display single-copy loss of chromosome X or females may display a gain. These patterns are evident from two clusters of boxplots. The clusters are detected informatically by removing outlier probes and then partitioning all samples around two medoids, a more robust version of K-means clustering. Each cluster is then normalized to zero by subtracting the median of each cluster from each sample. To facilitate review of this procedure, boxplots of log2 ratios are generated before and after subtraction of the cluster medians (QC_chrX_pre-scale.pdf and QC_chrX_post-scale.pdf).
20 | 
21 | Boxplots are constructed using the standard "boxplot()" command within R. This provides a visual representation of the distribution of log2 ratios from each sample (QC_cnv_boxplot.pdf) as well as a numeric summary used for subsequent thresholding and quality control (QC_cnv_boxplots.xls). The boxplot output summary includes: upper whisker, Q3, median, Q1, and lower whisker. Upper (lower) whiskers are the greatest (lowest) values that fall within Q3 (Q1) plus a quantity (default: 3) times the interquartile range (Q3 - Q1). A sample fails quality control if either of the boxplot whiskers extends beyond the expected theoretical log2 ratio for a single-copy gain (0.58) or a single-copy loss (-1). Failed samples are identified in the boxplot summary output, and VisCap automatically repeats the analysis until all samples pass quality control. The complete set of output files is generated for each iteration and stored in separate folders. For further quality control, log2 ratios across the entire batch are visualized in a single heatmap-based output (QC_cnv_heatmap.pdf).
22 | 
23 | The thresholding strategy for determining if a log2 ratio represents a gain or a loss is dependent on the boxplot whiskers and a set of fixed thresholds. These fixed thresholds were established based on the analysis of positive control samples (see validation report) and they represent the minimum log2 ratio for gains (0.40) and maximum log2 ratio for losses (-0.55) at which a call is made. A copy number variant is called when a log2 ratio exceeds both the upper (lower) fixed threshold and the upper (lower) whisker for the sample. The minimum number of consecutive exons with log2 ratios outside of these thresholds to call a copy number variant can also be set by the user (default: 1).
24 | 
25 | # Output
26 | For each sample, candidate copy number variants are output as a text data table (*.cnvs.xls). All calls from all intervals across all samples are also written to a file (cnv_boxplot_outliers.xls, losses -1, copy-neutral 0, gains 1). To facilitate visual review of the data for each sample, log2 ratios are plotted by relative genome order (i.e. rank order, not to scale on genome) and data points supporting copy number variants are color-coded (gains: red, losses: blue). This plot is overlaid with guidelines depicting the two sets of thresholds used for copy number variation (whiskers: dashed, fixed: solid black) and the theoretical log2 ratios (solid light gray) for single-copy gains and single-copy losses. 
27 | 


--------------------------------------------------------------------------------
/VisCap.R:
--------------------------------------------------------------------------------
  1 | # VisCap: Visualize normalized capture coverage
  2 | # Generates heatmap of exon coverage from a directory of sample_interval_summary files
  3 | # By Trevor Pugh, March 2012 - April 2013
  4 | 
  5 | ###########
  6 | #Libraries#
  7 | ###########
  8 | 
  9 | library("cluster")
 10 | library("gplots")
 11 | library("zoo")
 12 | 
 13 | ##########
 14 | #Defaults#
 15 | ##########
 16 | 
 17 | svn.revision                <- "$Id: VisCap.R 1372 2013-04-25 13:53:52Z tp908 $"
 18 | source('VisCap.cfg')
 19 | #lane_dir                    <- "\\\\rfanfs.research.partners.org\\NGS"
 20 | #out_dir                     <- "\\\\rfanfs.research.partners.org\\gigpad_clinical\\Facilities\\Laboratory of Molecular Medicine_4171303\\VisCap_Outputs"
 21 | 
 22 | #lane_dir                    <- "\\\\rfanfs.research.partners.org\\NGS"
 23 | #out_dir                     <- "\\\\rfanfs.research.partners.org\\gigpad_clinical\\Facilities\\Laboratory of Molecular Medicine_4171303\\VisCap_Outputs"
 24 | # interval_list_dir           <- "\\\\Sfa6\\lmm$\\DEVELOPMENT\\ACTIVE_DEVELOPMENT\\NEXT_GEN_COPY_NUMBER\\VisCap\\interval_lists"   
 25 | #explorer_file               <- "C:\\Windows\\explorer.exe"
 26 | #cov_file_pattern            <- ".target.cov.sample_interval_summary$"
 27 | #cov_field                   <- "_total_cvg"
 28 | #interval_file_pattern       <- ".interval_list$"
 29 | #ylimits                     <- c(-2, 2)
 30 | #iqr_multiplier              <- 3
 31 | #threshold.min_exons         <- 1
 32 | #iterative.calling.limit     <- 0        #Set to 0 to iterate until all failed samples are removed
 33 | #infer.batch.for.sub.out_dir <- TRUE     #Set to FALSE to prompt users for output directory
 34 | #clobber.output.directory    <- FALSE    #Set to FALSE to stop run when output directory already exists
 35 | 
 36 | #Setting a path to data skips prompts. Set to FALSE for deployment.
 37 | #dev_dir    <- "\\\\Sfa6\\lmm$\\DEVELOPMENT\\ACTIVE_DEVELOPMENT\\NEXT_GEN_COPY_NUMBER\\VisCap\\test_data"
 38 | dev_dir     <- FALSE
 39 | 
 40 | ###########
 41 | #Functions#
 42 | ###########
 43 | 
 44 | #Modified winDialog function to run in non-interactive mode
 45 | winDialog.nonint <- function (type = c("ok", "okcancel", "yesno", "yesnocancel"), message) 
 46 |     {
 47 |         #if (!interactive())
 48 |         #    stop("winDialog() cannot be used non-interactively")
 49 |         type <- match.arg(type)
 50 |         res <- .Internal(winDialog(type, message))
 51 |         if (res == 10L) 
 52 |             return(invisible(NULL))
 53 |         c("NO", "CANCEL", "YES", "OK")[res + 2L]
 54 |     }
 55 | 
 56 | make_matrix_from_cov_files <- function(lane_dir, cov_file_pattern, cov_field) {
 57 |     #Join all cov files into a single matrix
 58 |     filenames <- list.files(lane_dir, full.names=TRUE, pattern=cov_file_pattern, recursive=TRUE)
 59 |     print(filenames)
 60 |     for(file in filenames) {
 61 |         tab      <- read.table(file, header=TRUE, sep ="\t", row.names=1)
 62 |         col_name <- colnames(tab)[grep(cov_field, colnames(tab))]
 63 | 
 64 |         #Make new matrix labelled with appropriate identifier
 65 |         #new_head <- gsub(cov_field, "", col_name)                  #sample identifier only
 66 |         new_head  <- basename(gsub(cov_file_pattern, "", file))      #derive name from original filename
 67 |         new       <- matrix(tab[,col_name], dimnames=list(rownames(tab), new_head) )
 68 |         
 69 |         #If this is the first file, make a new matrix. Otherwise, join it to the existing matrix using genome coordinates
 70 |         if(file == filenames[1]) {
 71 |             mat.cov <- new
 72 |         } else {
 73 |             mrg               <- merge(new, mat.cov, by = "row.names", all = TRUE)
 74 |             mat.cov           <- as.matrix(mrg[-1])
 75 |             rownames(mat.cov) <- mrg[,1]
 76 |         }
 77 |     }
 78 | 
 79 |     #Remove any rows with NA entries
 80 |     mat.cov <- mat.cov[complete.cases(mat.cov),]
 81 |     #TODO: Provide warning if n% of probes are discarded (i.e. multiple panels are present)
 82 |     
 83 |     # Extract fraction of total coverage assigned to each exon in each sample
 84 |     # Note two possible methods, depending on preference for X-chromosome handling.
 85 |     # mat.cov.totals <- colSums(mat.cov) #use X-chromosome for normalization
 86 |     mat.cov.totals <- colSums(mat.cov[grep("X",rownames(mat.cov), invert=TRUE),]) # do not use X-chromosome for normalization
 87 |     mat.frac_cov <- sweep(mat.cov, 2, mat.cov.totals, "/")
 88 | 
 89 |     #Order columns alphanumerically
 90 |     mat.frac_cov <- mat.frac_cov[,order(colnames(mat.frac_cov))]
 91 |     
 92 |     return(mat.frac_cov)
 93 | }
 94 | 
 95 | load_interval_names <- function(interval_list_dir, interval_file_pattern) {
 96 |     #Read interval list files and make lookup_table for genome coords and interval names
 97 |     lookup = c()
 98 |     filenames = list.files(interval_list_dir, full.names=TRUE, pattern=interval_file_pattern, recursive=FALSE)
 99 |     for(file in filenames) {
100 |         tab = read.table(file, header=FALSE, comment.char = "@", sep="\t", stringsAsFactors=FALSE)
101 | 	if(ncol(tab) < 4)
102 | 	{stop("The interval file should have at least 4 columns, chromosome, start, end, and interval name, all tab separated.")}
103 |         colnames(tab) <- c("chr", "start", "end", "interval_name")
104 |         rownames(tab) <- paste(tab$chr, ":", tab$start, "-", tab$end, sep = "")
105 |         tab$interval_file <- file
106 |         lookup <- rbind(lookup, tab)
107 |     }
108 |     return(lookup)
109 | }
110 | 
111 | annotate_interval_names <- function(coords, interval_lookup) {
112 |     #Add name column to mat containing interval name from lookup table, if provided
113 |     if(is.null(interval_lookup)) {
114 |     labels <- coords
115 |     } else {
116 |         matches <- match(coords, rownames(interval_lookup))
117 |         labels <- interval_lookup[matches, "interval_name"]
118 |     }
119 |     return(labels)
120 | }
121 | 
122 | divide_by_batch_median <- function(mat) {
123 |     #Fractional coverage normalization by median across each exon
124 |     rmeds <- apply(mat, 1, median)         ## calculate row medians
125 |     mat_norm <- sweep(mat, 1, rmeds, "/")  ## divide each entry by row median
126 |     mat_norm <- mat_norm[complete.cases(mat_norm),] ##remove NA entries
127 | 	return(mat_norm)
128 | }
129 | 
130 | heatmap_by_chrom <- function(mat, analysis_name, ylimits, out_dir) {
131 |     #Limit matrix for plotting
132 |     mat[mat < min(ylimits)] <- min(ylimits)
133 |     mat[mat > max(ylimits)] <- max(ylimits)
134 | 
135 | 	# Plot heatmap for each chromosome, compare each exon's value/median ratio across samples, save to files
136 | 	pdf(file=paste(out_dir, "/", analysis_name, ".pdf", sep=""))
137 | 
138 |     #Per-chromosome plots
139 | 	chroms = c("all", 1:22,"X","Y", "MT", "M")
140 | 	for(chr in chroms){
141 |         if(chr == "all") {
142 |             main_title <- "All Chromosomes"
143 |             matches <- rownames(mat)
144 |         } else {
145 |             main_title <- paste("Chromosome",chr)
146 |             matches <- grep(paste("^",chr,":",sep=""), rownames(mat))
147 |         }
148 | 		if(length(matches) > 1) {
149 | 			par("cex.main" = 0.5)
150 |             #Set heatmap color scale
151 |             steps = 100
152 |             color_scale = bluered(steps)
153 |             color_breaks = seq(ylimits[1], ylimits[2], by=(ylimits[2] - ylimits[1])/steps)
154 |             #color_breaks = seq(min(mat, na.rm = TRUE), max(mat, na.rm = TRUE), length.out=steps+1)
155 | 
156 |             #exons by samples
157 |             #heatmap.2(mat[matches,], Rowv=NA, Colv=NA, scale="none", cexCol=0.5, cexRow=0.3, col=color_scale, breaks=color_breaks, dendrogram="none", main=title, symbreaks=TRUE, trace="none")
158 |             #samples vs exons
159 |             xlabels <- gsub("__", "\n", colnames(mat))
160 |             heatmap.2(
161 |                         t(mat[matches,]),
162 | 
163 |                         Rowv       = NA,
164 |                         Colv       = NA,
165 |                         scale      = "none",
166 |                         cexCol     = 0.5,
167 |                         cexRow     = 0.5,
168 |                         col        = color_scale,
169 |                         breaks     = color_breaks,
170 |                         dendrogram = "none",
171 |                         symbreaks  = TRUE,
172 |                         trace      = "row",
173 |                         tracecol   = "black",
174 |                         main       = main_title,
175 |                         labRow     = xlabels,
176 |                         labCol     = NA
177 |             )
178 | 		}
179 | 	}
180 |     dev.off()
181 | }
182 | 
183 | score_boxplot <- function(bplot, llim= log2(1/2), ulim=log2(3/2)) {
184 |     fail_lower_thresh <- bplot$stats[1,] < llim
185 |     fail_upper_thresh <- bplot$stats[5,] > ulim
186 |     fail <- as.logical(fail_lower_thresh + fail_upper_thresh)
187 |     qc_string <- gsub(TRUE, "FAIL", gsub(FALSE, "PASS", fail))
188 |     return(qc_string)
189 | }
190 | 
191 | boxplot_cnv_matrix <- function(nmat, bplot_name, out_dir, ylimits, iqr_multiplier) {
192 |     plot_nmat <- nmat
193 |     plot_nmat[nmat < min(ylimits)] <- min(ylimits)
194 |     plot_nmat[nmat > max(ylimits)] <- max(ylimits)
195 |     pdf(file=paste(out_dir, "/", bplot_name, ".pdf", sep=""))
196 |     par(las=2, mar=c(12,4,4,2))
197 |     xlabels = sub("__", "\n", colnames(plot_nmat))
198 |     bplot <- boxplot(plot_nmat, range=iqr_multiplier, ylim=ylimits, srt=90, pch=16, cex.axis=0.6, names=xlabels, ylab="Log2 ratio sample/batch median")
199 |     bplot$names <- sub("\n", "__", bplot$names)
200 |     dev.off()
201 |     
202 |     #Write boxplot information to separate file
203 |     bplot$qc <- score_boxplot(bplot)
204 |     bplot.out <- cbind(round(t(bplot$stats), 3), bplot$qc)
205 |     rownames(bplot.out) <- bplot$names
206 |     colnames(bplot.out) <- c("del_threshold", "Q1", "median", "Q3", "amp_threshold", "qc")
207 |     write.table(bplot.out, paste(out_dir, "/", bplot_name, ".xls", sep=""), quote = FALSE, sep = "\t", col.names = NA)
208 |     return(bplot)
209 | }
210 | 
211 | filter_cnvs <- function(segs, threshold.min_exons, threshold.cnv_log2_cutoffs) {
212 |     #Apply consecutive exon filter
213 |     fsegs <- segs[(segs[,"CNV"] != 0) & (segs[,"Interval_count"] >= threshold.min_exons), , drop = FALSE]
214 | 
215 |     #Apply zero in normal range filter
216 |     fsegs <- fsegs[(as.numeric(fsegs[,"Loss_threshold"]) < 0) & (as.numeric(fsegs[,"Gain_threshold"]) > 0),, drop = FALSE]
217 | 
218 |     #Apply hard CNV threshold
219 |     fsegs <- fsegs[(as.numeric(fsegs[,"Median_log2ratio"]) < min(threshold.cnv_log2_cutoffs)) | (as.numeric(fsegs[,"Median_log2ratio"]) > max(threshold.cnv_log2_cutoffs)),, drop = FALSE]
220 | 
221 |     return(fsegs)
222 | }
223 | 
224 | call_cnvs <- function(nmat, ylimits, interval_lookup, threshold.min_exons, iqr_multiplier, threshold.cnv_log2_cutoffs, out_dir) {
225 |     #Plot boxplots to visualize ranges used to detect copy number variation
226 |     bplot <- boxplot_cnv_matrix(nmat, "QC_cnv_boxplot", out_dir, ylimits, iqr_multiplier)
227 |     
228 |     #Set outlier thresholds by distribution test
229 |     #Use boxplot$stats so whiskers in pdf accurately represent thresholds used
230 |     lbound <- round(bplot$stats[1,], 3)
231 |     ubound <- round(bplot$stats[5,], 3)
232 |     batch_size <- dim(bplot$stats)[2]
233 | 
234 |     #Use hard-threshold, if it is greater than boxplot thresholds
235 |     lbound[lbound > min(threshold.cnv_log2_cutoffs)] <- min(threshold.cnv_log2_cutoffs)
236 |     ubound[ubound < max(threshold.cnv_log2_cutoffs)] <- max(threshold.cnv_log2_cutoffs)
237 | 
238 |     #Make matrix of thresholds
239 |     lbound_mat <-  matrix(rep(lbound, dim(nmat)[1]), ncol=length(lbound), byrow=TRUE, dimnames=list(rownames(nmat),colnames(nmat)))
240 |     ubound_mat <-  matrix(rep(ubound, dim(nmat)[1]), ncol=length(lbound), byrow=TRUE, dimnames=list(rownames(nmat),colnames(nmat)))
241 | 
242 |     #Flag outliers
243 |     nmat_loutliers <- (nmat < lbound_mat) + 0 #Adding zero converts TRUE/FALSE to 1/0
244 |     nmat_uoutliers <- (nmat > ubound_mat) + 0 #Adding zero converts TRUE/FALSE to 1/0
245 |     #TODO: Estimate number of copies gained or lost (i.e. support -2 and +n)
246 |     
247 |     #Make tracking matrix of all zero values then subtract copies lost and add copies gained
248 |     nmat_cnvs <- matrix(data = 0, nrow = dim(nmat)[1], ncol = dim(nmat)[2], dimnames = list(rownames(nmat),colnames(nmat)))
249 |     nmat_cnvs <- nmat_cnvs - nmat_loutliers + nmat_uoutliers
250 |     gene_exon <- annotate_interval_names(rownames(nmat_cnvs), interval_lookup)
251 |     write.table(cbind.data.frame(gene_exon, nmat_cnvs), paste(out_dir, "/", "cnv_boxplot_outliers", ".xls", sep=""), quote = FALSE, sep = "\t", col.names = NA)
252 |     
253 |     #Merge consecutive calls and write out to file
254 |     all_fsegs <- c()
255 |     for(id in colnames(nmat_cnvs)) {
256 |         segs <- c()
257 |         segs.header <- c("Sample", "CNV", "Genome_start_interval", "Genome_end_interval", "Start_interval","End_interval", "Interval_count", "Min_log2ratio", "Median_log2ratio", "Max_log2ratio", "Loss_threshold", "Gain_threshold", "Batch_size")
258 |         chroms = c(1:22,"X","Y", "MT", "M")
259 |         for(chr in chroms){
260 |             matches <- grep(paste("^",chr,":",sep=""), rownames(nmat))
261 |             if(length(matches) > 1) {
262 |                 #Segmentation: Detect runs of consecutive copy number calls
263 |                 rl <- rle(nmat_cnvs[matches, id])
264 |                 values <- rl$values
265 |                 lengths <- rl$lengths
266 |                 starts <- c(rownames(nmat_cnvs[matches,])[1], names(rl$lengths[1:length(rl$lengths)-1]))
267 |                 ends <- names(rl$values)
268 | 
269 |                 #Calculate rounded log2 ratios of intervals involved
270 |                 log2s <- lapply(1:length(starts), function(x) round(nmat[(match(starts[x], names(nmat[,id])):match(ends[x], names(nmat[,id]))), id], 3))
271 |                 log2s_min <- unlist(lapply(log2s, min))
272 |                 log2s_med <- unlist(lapply(log2s, median))
273 |                 log2s_max <- unlist(lapply(log2s, max))
274 | 
275 |                 #Report genome coordinate range
276 |                 coordinates_part1 <- data.frame(strsplit(starts, "-"), stringsAsFactors = FALSE)[1,]
277 |                 coordinates_part2 <- data.frame(strsplit(ends, "-"), stringsAsFactors = FALSE)[2,]                
278 |                 coordinates <- paste(coordinates_part1, coordinates_part2, sep="-")
279 |                 
280 |                 #Lookup interval names
281 |                 start_names <- annotate_interval_names(starts,  interval_lookup)
282 |                 end_names <- annotate_interval_names(ends,  interval_lookup)
283 |                 
284 |                 #Add threshold columns, ensure consistent header
285 |                 segs <- rbind(segs, cbind(rep(id, length(values)), values, starts, ends, start_names, end_names, lengths, log2s_min, log2s_med, log2s_max, lbound_mat[1,id], ubound_mat[1,id], batch_size))
286 |                 colnames(segs) <- segs.header
287 |             }
288 |         }
289 | 
290 |         #Handle case where there are no cnvs on any chromosomes in any samples
291 |         if(is.null(segs)) {
292 |             segs <- matrix(data=rep(NA,length(segs.header)), ncol=length(segs.header), dimnames=list("", segs.header))[-1,,drop=FALSE]
293 |         }
294 |         
295 |         #Filter segments
296 |         fsegs <- filter_cnvs(segs, threshold.min_exons, threshold.cnv_log2_cutoffs)
297 |         
298 |         #Convert CNV type values to text
299 |         fsegs[,"CNV"] <- gsub("^1$", "Gain", fsegs[,"CNV"])
300 |         fsegs[,"CNV"] <- gsub("^-1$", "Loss", fsegs[,"CNV"])
301 | 
302 |         #Convert infinite values to large, numerical value
303 |         large_value <- "10"
304 |         num_cols <- c("Min_log2ratio", "Median_log2ratio", "Max_log2ratio", "Loss_threshold", "Gain_threshold")
305 |         fsegs[,num_cols] <- gsub("Inf", large_value, fsegs[,num_cols])
306 | 
307 |         #Number CNVs for later labeling on visual output
308 |         if(dim(fsegs)[1] > 0) {
309 |             CNV_id <- seq(1, dim(fsegs)[1])
310 |         } else {
311 |             CNV_id <- c()
312 |         }
313 |         fsegs <- cbind(fsegs[,1, drop=FALSE],
314 |                        CNV_id,
315 |                        fsegs[,2:dim(fsegs)[2], drop=FALSE])
316 | 
317 |         #Write output for each sample
318 |         write.table(fsegs, paste(out_dir, "/", id, ".cnvs.xls", sep=""), quote = FALSE, sep = "\t", col.names = TRUE, row.names = FALSE)
319 |         
320 |         #Add fsegs to master fsegs tracking matrix
321 |         all_fsegs <- rbind(all_fsegs, fsegs)
322 |     }
323 |     return(list(bplot, all_fsegs))
324 | }
325 | 
326 | space_label_names <- function(label_names) {
327 |     spacer <- "     "
328 |     if(length(label_names) > 1) {
329 |         alt_positions <- seq(from=2, to=length(label_names), by=2)
330 |         label_names[alt_positions] <- paste(label_names[alt_positions], spacer)
331 |     }
332 |     return(label_names)
333 | }
334 | 
335 | exon_plot_per_sample <- function(nmat, ylimits, interval_lookup, cnv_bplot_and_calls, out_dir) {
336 |     #Extract thresholds and cnvs, determine intervals from genome coordinates
337 |     nmat_bplot <- cnv_bplot_and_calls[[1]]
338 |     nmat_fsegs <- cnv_bplot_and_calls[[2]]
339 | 
340 |     #Limit matrix for plotting
341 |     nmat[nmat < ylimits[1]] <- ylimits[1]
342 |     nmat[nmat > ylimits[2]] <- ylimits[2]
343 | 
344 |     for(name in colnames(nmat)) {
345 |         #Get exon names from interval lists
346 |         interval_names <- annotate_interval_names(rownames(nmat), interval_lookup)
347 | 
348 |         #Plot all intervals, then plot by chromosome
349 |         pdf(file=paste(out_dir, "/", name, ".plot.pdf", sep=""))                 
350 | 		chroms <- c(".*", 1:22, "X", "Y")
351 | 		for(chr in chroms) {
352 |             #Grep rownames with genome coordinates for desired chromosome, support chr1: and 1: formats
353 | 			matches <- grep(paste("(^chr", chr, "|^", chr, "):", sep=""), rownames(nmat))
354 | 			if(length(matches) > 1) {
355 |                 nmat_chr <- nmat[matches, name, drop=FALSE]
356 |                 #Different title and axes labeling for All vs individual chromosomes
357 |                 if(chr == ".*") {
358 |                     title <- "All chromosomes"
359 | 					chrom_names <- sapply(rownames(nmat_chr), function(x) strsplit(x, ":")[[1]][1])
360 | 					labels_rle <- rle(chrom_names)
361 |                     #TODO: Find better method for spacing chromosome labels
362 |                     labels_rle$values <- space_label_names(labels_rle$values)
363 |                     #labels_rle$values <- rep("", length(labels_rle$values-1))
364 |                     labels_size <- 0.6
365 |                     draw_grid_lines <- FALSE
366 |                     label_called_CNVs <- FALSE
367 |                 } else {
368 |                     title <- paste("Chromosome", chr)
369 | 					#Convert genome coordinates to unique gene names
370 | 					exon_names <- annotate_interval_names(rownames(nmat_chr), interval_lookup)
371 | 					gene_names <- sapply(exon_names, function(x) strsplit(x, "_")[[1]][1])
372 | 					#Replace NA gene_names with nearest non-NA value
373 | 					gene_names <- na.locf(gene_names)
374 | 					labels_rle <- rle(gene_names)
375 |                     labels_size <- 1
376 |                     draw_grid_lines <- TRUE
377 |                     label_called_CNVs <- TRUE
378 |                 }
379 | 
380 |                 #Plot formatting and command
381 |                 shift <- -0.5 #shift for plotting gene guide lines
382 |                 par(pch=16)
383 | 				plot(nmat_chr, xlim=c(1 + shift,length(nmat_chr)), ylim=ylimits, main=title, ylab = "Log2 ratio sample/batch median", xaxt="n", xlab="")
384 |                 mtext(name, line=3, cex=0.5, adj=0)
385 |                 
386 |                 #Plot gene or chomosome grid lines
387 | 				section_ends <- (unlist(lapply(seq(1:length(labels_rle$lengths)), function(x) sum(labels_rle$lengths[1:x]))))
388 |                 section_starts <- c(1, section_ends[-length(section_ends)] + 1)
389 |                 grid_marks <- section_starts + shift
390 |                 grid_labels <- c(labels_rle$values)
391 |                 if(draw_grid_lines == TRUE) {
392 |                     abline(v=grid_marks, col="grey")
393 |                 }
394 | 				axis(1, las=3, at=grid_marks, labels=grid_labels, cex.axis=labels_size)
395 | 
396 |                 #Plot guidelines and thresholds
397 |                 name_bplot <- nmat_bplot$stats[,nmat_bplot$names == name]
398 |                 abline(h=0, col="black")                          #zero line
399 |                 abline(h=c(log2(1/2), log2(3/2)), col="grey")     #expected log2 ratios for single copy loss and gain
400 |                 abline(h=name_bplot[c(1,5)], lty=2, col="black")  #boxplot thresholds
401 |                 abline(h=threshold.cnv_log2_cutoffs, col="black") #hard cnv log2 ratio thresholds
402 |                 
403 |                 #Mark data points outside of ylimits and scaled to fit plot
404 |                 off_scale <- as.numeric(which((nmat_chr <= ylimits[1]) | (nmat_chr >= ylimits[2])))
405 |                 points(off_scale, nmat_chr[off_scale], pch=22)
406 |                                             
407 |                 #Color CNV data points
408 |                 types     <- matrix(byrow=TRUE, ncol=4,
409 |                                     dimnames=list(c(),
410 |                                         c("type", "col", "ypos.line", "ypos.label")),
411 |                                     data=c(
412 |                                           "Loss", "blue", -2,         -2.1,
413 |                                           "Gain", "red",   2,          2.1))
414 | 
415 |                 segs <- nmat_fsegs[(nmat_fsegs[,"Sample"] == name),,drop=FALSE]
416 |                 segs <- segs[(rownames(segs) %in% rownames(nmat_chr)),,drop=FALSE] #Filter to chromosome being plotted
417 |                 start_indexes <- which(rownames(nmat_chr) %in% segs[,"Genome_start_interval"])
418 |                 end_indexes   <- which(rownames(nmat_chr) %in% segs[,"Genome_end_interval"])
419 |                 if(length(start_indexes) > 0) {
420 |                     segs <- cbind(segs, types[match(segs[,"CNV"], types[,"type"]),,drop=FALSE])
421 |                     for(i in seq(1:length(start_indexes))) {
422 |                         indexes <- seq(start_indexes[i], end_indexes[i])
423 |                         #Plot colored points
424 |                         points(indexes, nmat_chr[indexes], col=segs[i, "col"])
425 | 
426 |                         #Label called CNVs with identifier
427 |                         if(label_called_CNVs == TRUE) {
428 |                         segments(start_indexes[i] - 0.25,
429 |                                  as.numeric(segs[i, "ypos.line"]),
430 |                                  end_indexes[i] + 0.25,
431 |                                  as.numeric(segs[i, "ypos.line"]),
432 |                                  col="orange", lwd=5)
433 |                             text(mean(c(start_indexes[i], end_indexes[i])),
434 |                                  as.numeric(segs[i, "ypos.label"]),
435 |                                  segs[i, "CNV_id"],
436 |                                  col="orange")
437 |                         }
438 |                     }
439 |                 }
440 |             }
441 |         }
442 | 	dev.off()
443 |     }
444 | }
445 | 
446 | rescale_chrX <- function(nmat_badX, ylimits, iqr_multiplier, out_dir) {
447 |     #Isolate chromosome X, remove outlier probes    
448 |     nmatX <- nmat_badX[grep("^X",rownames(nmat_badX)),]
449 |     if(dim(nmatX)[1] == 0) {
450 |         return(nmat_badX) # return original matrix if no probes are found on X chromosome
451 |     }
452 | 
453 |     #Assign cases to clusters and scale medians to zero
454 |     #    DEPRECATED METHOD: kmeans is mislead by outliers. multiple solutions possible.    
455 |     #    clusters <- kmeans(t(nmatX),2, nstart=1)
456 |     #    cluster_med <- apply(clusters$centers, 1, median)
457 |     #    scaling_factors <- cluster_med[clusters$cluster]
458 | 
459 |     #CURRENT METHOD: "Partitioning Around Medoids" followed by calculation of cluster medians
460 |     clusters <- pam(t(nmatX), 2, cluster.only=TRUE)
461 |     cluster_med <- c()
462 |     for(i in 1:max(clusters)) {
463 |         cluster_med[i] <- median(nmatX[,clusters == i])
464 |     }
465 |     scaling_factors <- cluster_med[clusters]
466 |     nmatX_scaled <- sweep(nmatX, 2, scaling_factors, "-")  ## subtract cluster median
467 | 
468 |     #Plot pre- and post-scaled values, write to file
469 |     bplotX <- boxplot_cnv_matrix(nmatX, "QC_chrX_pre-scale", out_dir, ylimits, iqr_multiplier)
470 |     bplotX_scaled <- boxplot_cnv_matrix(nmatX_scaled, "QC_chrX_post-scale", out_dir, ylimits, iqr_multiplier)
471 | 
472 |     #Infer sexes from two clusters of fractional coverage
473 |     farthest_from_zero <- abs(cluster_med) == max(abs(cluster_med))
474 |     if(cluster_med[farthest_from_zero] > 0) {
475 |         cluster_female <- farthest_from_zero
476 |     } else {
477 |         cluster_female <- !farthest_from_zero
478 |     }
479 |     sexes <- sapply(cluster_female[clusters], function(x) if(x){"Female"} else {"Male"})
480 |     names(sexes) <- names(clusters)
481 |     write.table(sexes, paste(out_dir, "sexes.xls", sep="/") , quote = FALSE, sep = "\t", col.names = FALSE)
482 | 
483 |     #Overwrite unscaled X-chromosome values with new, scaled values
484 |     nmat <- nmat_badX
485 |     nmat[grep("^X",rownames(nmat)),] <- nmatX_scaled
486 |     
487 |     return(nmat)
488 | }
489 | 
490 | remove_failed_samples <- function(mat, cnv_bplot_and_calls) {
491 |     passes <- score_boxplot(cnv_bplot_and_calls)
492 |     mat.trimmed <- mat[,passes]
493 |     return(mat.trimmed)
494 | }
495 | 
496 | ###########
497 | #Arguments#
498 | ###########
499 | 
500 | #Argument collection and parsing
501 | arguments <- commandArgs(trailingOnly = TRUE)
502 | 
503 | if(length(arguments) == 1) {  
504 |     if(dev_dir != FALSE) {
505 |         #Skips prompts if dev_dir is set
506 |         lane_dir <- dev_dir
507 |         out_dir <- dev_dir
508 |     } else {
509 |         #Collect input and output information from user
510 | 
511 |         #Input directory
512 |         lane_dir <- choose.dir(caption = "Select a lane directory (e.g. L001):", default = lane_dir)
513 |         if(is.na(lane_dir)) {
514 |             try( winDialog.nonint(type="ok", "Run canceled. No input lane directory provided."), silent=TRUE)
515 |             q(save="no")
516 |         }
517 |         
518 |         #Output diretory: Attempt to derive batch information from file name, prompt user if unsuccessful
519 |         file1 <- list.files(lane_dir, full.names=TRUE, pattern=cov_file_pattern, recursive=TRUE)[1]
520 |         batch.regex <- "__(B*[0-9]+)"
521 |         batch.match <- regexec(batch.regex, file1)[[1]]
522 |         batch <- substring(file1, batch.match[2], batch.match[2] + attr(batch.match, "match.length")[2] - 1)
523 |         if((infer.batch.for.sub.out_dir == FALSE) || is.na(batch)) {
524 |             out_dir <- choose.dir(caption = "Select an output directory:", default = out_dir)
525 |             batch   <- basename(out_dir)
526 |         } else {
527 |             out_dir <- paste(out_dir, batch, sep="/")
528 |         }
529 |         if(is.na(out_dir)) {
530 |             try( winDialog.nonint(type="ok", "Run canceled. No output directory provided."), silent=TRUE)
531 |             q(save="no")
532 |         }
533 | 
534 |         #If output directory already exists, prompt user to overwrite
535 |         if((clobber.output.directory == FALSE) & (file.exists(out_dir))) {
536 |             overwrite <- try( winDialog.nonint(type="yesno", "Output directory already exists. Overwrite?"), silent=TRUE)
537 |             if(overwrite == "NO") {
538 |                 shell(paste(explorer_file, out_dir, sep=" "), wait=FALSE)
539 |                 q(save="no")
540 |             }
541 |         }
542 | 		
543 | 		interval_list_dir <-arguments[1]
544 | 	}
545 | } else if(length(arguments) == 3) {
546 |     #Uses provided command line arguments
547 |     lane_dir <- arguments[1]
548 |     out_dir  <- arguments[2]
549 |     viscap.cfg <- arguments[3]
550 |     batch    <- basename(out_dir)
551 | } else {
552 |     #Usage statement
553 |     try( winDialog.nonint(type="ok", "Usage: VisCap.R lane_directory output_directory interval_lists_directory"), silent=TRUE)
554 |     q(save="no")
555 | }
556 | 
557 | ######
558 | #Main#
559 | ######
560 | 
561 | # Read coverage tables
562 | mat.all <- make_matrix_from_cov_files(lane_dir, cov_file_pattern, cov_field)
563 | mat.all[which(mat.all==0)]=0.00001
564 | 
565 | # Read interval name files
566 | interval_lookup <- load_interval_names(interval_list_dir, interval_file_pattern)
567 | 
568 | # Sort matrix by genome coordinates found in rownames
569 | chroms <- c(1:22,"X","Y", "MT", "M")
570 | chroms <- factor(chroms, levels=chroms, labels=chroms, ordered=TRUE)
571 | coords <- matrix(unlist(strsplit(rownames(mat.all), ":|-")), ncol=3, byrow=TRUE, dimnames=list(rownames(mat.all)))
572 | coords <- coords[order(match(coords[,1], chroms), as.numeric(coords[,2]), as.numeric(coords[,3])),]
573 | mat <- mat.all[rownames(coords),]
574 | 
575 | #Iteratively run VisCap algorithm, removing bad samples after each run
576 | if(iterative.calling.limit == 0) {
577 |     iterative.calling.limit <- dim(mat)[2]
578 | }
579 | 
580 | for(iteration in 1:iterative.calling.limit) {
581 |     if(iterative.calling.limit == 1) {
582 |         out_dir.iteration <- out_dir
583 |     } else {
584 |         out_dir.iteration <- paste(out_dir, "/", batch, "_run", iteration, sep="")
585 |     }
586 |     dir.create(out_dir.iteration, showWarnings = FALSE, recursive=TRUE)
587 |     
588 |     # Normalize exon coverage by exon
589 |     nmat_badX <- log2(divide_by_batch_median(mat))
590 |     nmat <- rescale_chrX(nmat_badX, ylimits, iqr_multiplier, out_dir.iteration)
591 | 
592 |     # Call cnvs then plot heatmaps by chromosome and per-sample exon coverage
593 |     heatmap_by_chrom(nmat, "QC_cnv_heatmap", ylimits, out_dir.iteration)
594 |     cnv_bplot_and_calls <- call_cnvs(nmat, ylimits, interval_lookup, threshold.min_exons, iqr_multiplier, threshold.cnv_log2_cutoffs, out_dir.iteration)
595 |     exon_plot_per_sample(nmat, ylimits, interval_lookup, cnv_bplot_and_calls, out_dir.iteration)
596 | 
597 |     # Write out matrix of log2 ratios to file
598 |     outfile <- paste(out_dir.iteration, "/", "log2_ratio_table", ".xls", sep="")
599 |     gene_exon <- annotate_interval_names(rownames(nmat), interval_lookup)
600 |     nmat.with.gene_exon <- cbind(gene_exon, nmat)
601 |     write.table(nmat.with.gene_exon, outfile, , quote = FALSE, sep = "\t", col.names = NA)
602 | 
603 |     #Write out VisCap run information
604 |     run_info_table <- matrix(ncol = 2, byrow=TRUE, data = c(
605 |             "Date",                                         date(),
606 |             "VisCap command",                               paste(commandArgs(), collapse=" "),
607 |             "Subversion revision information",              svn.revision,
608 |             "Batch directory",                              lane_dir,      
609 |             "Coverage file pattern",                        cov_file_pattern,
610 |             "Field within coverage file",                   cov_field,              
611 |             "Output directory",                             out_dir.iteration,                
612 |             "Interval name lookup files",                   interval_list_dir,
613 |             "Interval name lookup file pattern",            interval_file_pattern,
614 |             "Exons used for CNV detection",                 dim(nmat)[1],
615 |             "Samples used for CNV detection",               dim(nmat)[2],
616 |             "Samples not used for CNV detection",           paste(c(colnames(mat.all)[!(colnames(mat.all) %in% colnames(nmat))], ""), sep=","),
617 |             "Plot y-axis limits",                           paste(ylimits, collapse=","),
618 |             "Minimum consecutive exons to call CNV",        threshold.min_exons,
619 |             "IQR multiplier used for boxplots",             iqr_multiplier,
620 |             "Static log2 ratio thresholds to call CNVs",    paste(threshold.cnv_log2_cutoffs, collapse=","),
621 |             "Iteration",                                    iteration
622 |             ))
623 |     run_info_outfile <- paste(out_dir.iteration, "/", "VisCap_run_info", ".xls", sep="")
624 |     write.table(run_info_table, run_info_outfile, quote = FALSE, sep = "\t", col.names = FALSE, row.names = FALSE)
625 |     save.image(file=paste(out_dir.iteration, "/", "session", ".Rdata", sep=""))
626 | 
627 |     #Remove failed samples from matrix for next run
628 |     passes <- cnv_bplot_and_calls[[1]]$names[cnv_bplot_and_calls[[1]]$qc == "PASS"]
629 |     if(length(passes) == dim(mat)[2]) {
630 |         break
631 |     } else {
632 |         #Restrict mat only to samples that pass boxplot qc
633 |         mat <- mat[,passes]
634 |     }
635 | }
636 | 
637 | # Open output directory and quit R
638 | if((dev_dir == FALSE) && (length(arguments) == 0)) {
639 |     shell(paste(explorer_file, out_dir, sep=" "), wait=FALSE)
640 | }
641 | quit(save="no")
642 | 


--------------------------------------------------------------------------------