├── VisCap.cfg ├── README.md └── VisCap.R /VisCap.cfg: -------------------------------------------------------------------------------- 1 | #VisCap configuration file for Linux users 2 | 3 | 4 | interval_list_dir <- "/path/to/dir/bed" # directory where the interval file is stored 5 | #explorer_file <- "C:\\Windows\\explorer.exe" 6 | cov_file_pattern <- "sample_interval_summary$" # this is the extension of the coverage file from DepthOfCoverage 7 | cov_field <- "_total_cvg" # field to be read from coverage file 8 | interval_file_pattern <- "interval_file.bed$" # interval filename 9 | ylimits <- c(-2, 2) 10 | iqr_multiplier <- 3 11 | threshold.min_exons <- 1 12 | threshold.cnv_log2_cutoffs <- c(-0.55, 0.40) 13 | iterative.calling.limit <- 0 #Set to 0 to iterate until all failed samples are removed 14 | infer.batch.for.sub.out_dir <- TRUE #Set to FALSE to prompt users for output directory 15 | clobber.output.directory <- FALSE #Set to FALSE to stop run when output directory already exists 16 | 17 | #Setting a path to data skips prompts. Set to FALSE for deployment. 18 | dev_dir <- FALSE 19 | 20 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Documentation 2 | For updated and more detailed documentation, please refer to our Github Wiki page at https://github.com/pughlab/VisCap/wiki 3 | 4 | # Introduction 5 | VisCap is a copy number detection and visualization tool written in R (www.r-project.org) for analysis of next-generation sequencing data derived from hybrid-capture experiments. It requires installation of the R libraries "gplots", "zoo", and “cluster”. We reported validation of this algorithm for detection of germline variation in Genetics in Medicine: 6 | 7 | VisCap: inference and visualization of germ-line copy-number variants from targeted clinical sequencing data. 8 | Pugh TJ, Amr SS, Bowser MJ, Gowrisankar S, Hynes E, Mahanta LM, Rehm HL, Funke B, Lebo MS. 9 | Genet Med. 2015 Dec 17. doi: 10.1038/gim.2015.156. [Epub ahead of print] 10 | PMID: 26681316 11 | www.ncbi.nlm.nih.gov/pubmed/26681316 12 | 13 | # Input 14 | As input, VisCap reads interval summary files generated by the Genome Analysis Toolkit (GATK)"DepthOfCoverage" tool (www.broadinstitute.org/gatk). For each sample, these files contain a summary of the total coverage of each genome region interval listed in a reference interval list. For our purposes, this interval list contains the coordinates of each target (usually an exon) captured by a specific bait set. Importantly, the bed file provided to GATK DepthOfCoverage must not contain any overlapping intervals as these are merged into single intervals resulting in a DepthOfCoverage file containing few intervals than the original bed file. 15 | 16 | # Algorithm 17 | The initial step in the VisCap program is to generate a matrix of all intervals captured (~1,383 for OtoGenome) and the fractional of total coverage assigned to these interval within each sample from a batch (10 samples for production runs). When different interval lists are used to make the coverage files for samples within a batch, VisCap will use only the intervals common to all samples. Therefore, inclusion of even a single coverage file with a small number of targets will result in only these targets being considered across the entire batch. To normalize fractional coverage values for copy number detection, each is divided by the median for that target across the entire batch. These are stored as a matrix of log2 ratios and written to a file (log2_ratio_table.xls). 18 | 19 | The X-chromosome requires further normalization as there are significant fractional coverage differences between males and females. Depending on the balance of males and females in the batch, males may display single-copy loss of chromosome X or females may display a gain. These patterns are evident from two clusters of boxplots. The clusters are detected informatically by removing outlier probes and then partitioning all samples around two medoids, a more robust version of K-means clustering. Each cluster is then normalized to zero by subtracting the median of each cluster from each sample. To facilitate review of this procedure, boxplots of log2 ratios are generated before and after subtraction of the cluster medians (QC_chrX_pre-scale.pdf and QC_chrX_post-scale.pdf). 20 | 21 | Boxplots are constructed using the standard "boxplot()" command within R. This provides a visual representation of the distribution of log2 ratios from each sample (QC_cnv_boxplot.pdf) as well as a numeric summary used for subsequent thresholding and quality control (QC_cnv_boxplots.xls). The boxplot output summary includes: upper whisker, Q3, median, Q1, and lower whisker. Upper (lower) whiskers are the greatest (lowest) values that fall within Q3 (Q1) plus a quantity (default: 3) times the interquartile range (Q3 - Q1). A sample fails quality control if either of the boxplot whiskers extends beyond the expected theoretical log2 ratio for a single-copy gain (0.58) or a single-copy loss (-1). Failed samples are identified in the boxplot summary output, and VisCap automatically repeats the analysis until all samples pass quality control. The complete set of output files is generated for each iteration and stored in separate folders. For further quality control, log2 ratios across the entire batch are visualized in a single heatmap-based output (QC_cnv_heatmap.pdf). 22 | 23 | The thresholding strategy for determining if a log2 ratio represents a gain or a loss is dependent on the boxplot whiskers and a set of fixed thresholds. These fixed thresholds were established based on the analysis of positive control samples (see validation report) and they represent the minimum log2 ratio for gains (0.40) and maximum log2 ratio for losses (-0.55) at which a call is made. A copy number variant is called when a log2 ratio exceeds both the upper (lower) fixed threshold and the upper (lower) whisker for the sample. The minimum number of consecutive exons with log2 ratios outside of these thresholds to call a copy number variant can also be set by the user (default: 1). 24 | 25 | # Output 26 | For each sample, candidate copy number variants are output as a text data table (*.cnvs.xls). All calls from all intervals across all samples are also written to a file (cnv_boxplot_outliers.xls, losses -1, copy-neutral 0, gains 1). To facilitate visual review of the data for each sample, log2 ratios are plotted by relative genome order (i.e. rank order, not to scale on genome) and data points supporting copy number variants are color-coded (gains: red, losses: blue). This plot is overlaid with guidelines depicting the two sets of thresholds used for copy number variation (whiskers: dashed, fixed: solid black) and the theoretical log2 ratios (solid light gray) for single-copy gains and single-copy losses. 27 | -------------------------------------------------------------------------------- /VisCap.R: -------------------------------------------------------------------------------- 1 | # VisCap: Visualize normalized capture coverage 2 | # Generates heatmap of exon coverage from a directory of sample_interval_summary files 3 | # By Trevor Pugh, March 2012 - April 2013 4 | 5 | ########### 6 | #Libraries# 7 | ########### 8 | 9 | library("cluster") 10 | library("gplots") 11 | library("zoo") 12 | 13 | ########## 14 | #Defaults# 15 | ########## 16 | 17 | svn.revision <- "$Id: VisCap.R 1372 2013-04-25 13:53:52Z tp908 $" 18 | source('VisCap.cfg') 19 | #lane_dir <- "\\\\rfanfs.research.partners.org\\NGS" 20 | #out_dir <- "\\\\rfanfs.research.partners.org\\gigpad_clinical\\Facilities\\Laboratory of Molecular Medicine_4171303\\VisCap_Outputs" 21 | 22 | #lane_dir <- "\\\\rfanfs.research.partners.org\\NGS" 23 | #out_dir <- "\\\\rfanfs.research.partners.org\\gigpad_clinical\\Facilities\\Laboratory of Molecular Medicine_4171303\\VisCap_Outputs" 24 | # interval_list_dir <- "\\\\Sfa6\\lmm$\\DEVELOPMENT\\ACTIVE_DEVELOPMENT\\NEXT_GEN_COPY_NUMBER\\VisCap\\interval_lists" 25 | #explorer_file <- "C:\\Windows\\explorer.exe" 26 | #cov_file_pattern <- ".target.cov.sample_interval_summary$" 27 | #cov_field <- "_total_cvg" 28 | #interval_file_pattern <- ".interval_list$" 29 | #ylimits <- c(-2, 2) 30 | #iqr_multiplier <- 3 31 | #threshold.min_exons <- 1 32 | #iterative.calling.limit <- 0 #Set to 0 to iterate until all failed samples are removed 33 | #infer.batch.for.sub.out_dir <- TRUE #Set to FALSE to prompt users for output directory 34 | #clobber.output.directory <- FALSE #Set to FALSE to stop run when output directory already exists 35 | 36 | #Setting a path to data skips prompts. Set to FALSE for deployment. 37 | #dev_dir <- "\\\\Sfa6\\lmm$\\DEVELOPMENT\\ACTIVE_DEVELOPMENT\\NEXT_GEN_COPY_NUMBER\\VisCap\\test_data" 38 | dev_dir <- FALSE 39 | 40 | ########### 41 | #Functions# 42 | ########### 43 | 44 | #Modified winDialog function to run in non-interactive mode 45 | winDialog.nonint <- function (type = c("ok", "okcancel", "yesno", "yesnocancel"), message) 46 | { 47 | #if (!interactive()) 48 | # stop("winDialog() cannot be used non-interactively") 49 | type <- match.arg(type) 50 | res <- .Internal(winDialog(type, message)) 51 | if (res == 10L) 52 | return(invisible(NULL)) 53 | c("NO", "CANCEL", "YES", "OK")[res + 2L] 54 | } 55 | 56 | make_matrix_from_cov_files <- function(lane_dir, cov_file_pattern, cov_field) { 57 | #Join all cov files into a single matrix 58 | filenames <- list.files(lane_dir, full.names=TRUE, pattern=cov_file_pattern, recursive=TRUE) 59 | print(filenames) 60 | for(file in filenames) { 61 | tab <- read.table(file, header=TRUE, sep ="\t", row.names=1) 62 | col_name <- colnames(tab)[grep(cov_field, colnames(tab))] 63 | 64 | #Make new matrix labelled with appropriate identifier 65 | #new_head <- gsub(cov_field, "", col_name) #sample identifier only 66 | new_head <- basename(gsub(cov_file_pattern, "", file)) #derive name from original filename 67 | new <- matrix(tab[,col_name], dimnames=list(rownames(tab), new_head) ) 68 | 69 | #If this is the first file, make a new matrix. Otherwise, join it to the existing matrix using genome coordinates 70 | if(file == filenames[1]) { 71 | mat.cov <- new 72 | } else { 73 | mrg <- merge(new, mat.cov, by = "row.names", all = TRUE) 74 | mat.cov <- as.matrix(mrg[-1]) 75 | rownames(mat.cov) <- mrg[,1] 76 | } 77 | } 78 | 79 | #Remove any rows with NA entries 80 | mat.cov <- mat.cov[complete.cases(mat.cov),] 81 | #TODO: Provide warning if n% of probes are discarded (i.e. multiple panels are present) 82 | 83 | # Extract fraction of total coverage assigned to each exon in each sample 84 | # Note two possible methods, depending on preference for X-chromosome handling. 85 | # mat.cov.totals <- colSums(mat.cov) #use X-chromosome for normalization 86 | mat.cov.totals <- colSums(mat.cov[grep("X",rownames(mat.cov), invert=TRUE),]) # do not use X-chromosome for normalization 87 | mat.frac_cov <- sweep(mat.cov, 2, mat.cov.totals, "/") 88 | 89 | #Order columns alphanumerically 90 | mat.frac_cov <- mat.frac_cov[,order(colnames(mat.frac_cov))] 91 | 92 | return(mat.frac_cov) 93 | } 94 | 95 | load_interval_names <- function(interval_list_dir, interval_file_pattern) { 96 | #Read interval list files and make lookup_table for genome coords and interval names 97 | lookup = c() 98 | filenames = list.files(interval_list_dir, full.names=TRUE, pattern=interval_file_pattern, recursive=FALSE) 99 | for(file in filenames) { 100 | tab = read.table(file, header=FALSE, comment.char = "@", sep="\t", stringsAsFactors=FALSE) 101 | if(ncol(tab) < 4) 102 | {stop("The interval file should have at least 4 columns, chromosome, start, end, and interval name, all tab separated.")} 103 | colnames(tab) <- c("chr", "start", "end", "interval_name") 104 | rownames(tab) <- paste(tab$chr, ":", tab$start, "-", tab$end, sep = "") 105 | tab$interval_file <- file 106 | lookup <- rbind(lookup, tab) 107 | } 108 | return(lookup) 109 | } 110 | 111 | annotate_interval_names <- function(coords, interval_lookup) { 112 | #Add name column to mat containing interval name from lookup table, if provided 113 | if(is.null(interval_lookup)) { 114 | labels <- coords 115 | } else { 116 | matches <- match(coords, rownames(interval_lookup)) 117 | labels <- interval_lookup[matches, "interval_name"] 118 | } 119 | return(labels) 120 | } 121 | 122 | divide_by_batch_median <- function(mat) { 123 | #Fractional coverage normalization by median across each exon 124 | rmeds <- apply(mat, 1, median) ## calculate row medians 125 | mat_norm <- sweep(mat, 1, rmeds, "/") ## divide each entry by row median 126 | mat_norm <- mat_norm[complete.cases(mat_norm),] ##remove NA entries 127 | return(mat_norm) 128 | } 129 | 130 | heatmap_by_chrom <- function(mat, analysis_name, ylimits, out_dir) { 131 | #Limit matrix for plotting 132 | mat[mat < min(ylimits)] <- min(ylimits) 133 | mat[mat > max(ylimits)] <- max(ylimits) 134 | 135 | # Plot heatmap for each chromosome, compare each exon's value/median ratio across samples, save to files 136 | pdf(file=paste(out_dir, "/", analysis_name, ".pdf", sep="")) 137 | 138 | #Per-chromosome plots 139 | chroms = c("all", 1:22,"X","Y", "MT", "M") 140 | for(chr in chroms){ 141 | if(chr == "all") { 142 | main_title <- "All Chromosomes" 143 | matches <- rownames(mat) 144 | } else { 145 | main_title <- paste("Chromosome",chr) 146 | matches <- grep(paste("^",chr,":",sep=""), rownames(mat)) 147 | } 148 | if(length(matches) > 1) { 149 | par("cex.main" = 0.5) 150 | #Set heatmap color scale 151 | steps = 100 152 | color_scale = bluered(steps) 153 | color_breaks = seq(ylimits[1], ylimits[2], by=(ylimits[2] - ylimits[1])/steps) 154 | #color_breaks = seq(min(mat, na.rm = TRUE), max(mat, na.rm = TRUE), length.out=steps+1) 155 | 156 | #exons by samples 157 | #heatmap.2(mat[matches,], Rowv=NA, Colv=NA, scale="none", cexCol=0.5, cexRow=0.3, col=color_scale, breaks=color_breaks, dendrogram="none", main=title, symbreaks=TRUE, trace="none") 158 | #samples vs exons 159 | xlabels <- gsub("__", "\n", colnames(mat)) 160 | heatmap.2( 161 | t(mat[matches,]), 162 | 163 | Rowv = NA, 164 | Colv = NA, 165 | scale = "none", 166 | cexCol = 0.5, 167 | cexRow = 0.5, 168 | col = color_scale, 169 | breaks = color_breaks, 170 | dendrogram = "none", 171 | symbreaks = TRUE, 172 | trace = "row", 173 | tracecol = "black", 174 | main = main_title, 175 | labRow = xlabels, 176 | labCol = NA 177 | ) 178 | } 179 | } 180 | dev.off() 181 | } 182 | 183 | score_boxplot <- function(bplot, llim= log2(1/2), ulim=log2(3/2)) { 184 | fail_lower_thresh <- bplot$stats[1,] < llim 185 | fail_upper_thresh <- bplot$stats[5,] > ulim 186 | fail <- as.logical(fail_lower_thresh + fail_upper_thresh) 187 | qc_string <- gsub(TRUE, "FAIL", gsub(FALSE, "PASS", fail)) 188 | return(qc_string) 189 | } 190 | 191 | boxplot_cnv_matrix <- function(nmat, bplot_name, out_dir, ylimits, iqr_multiplier) { 192 | plot_nmat <- nmat 193 | plot_nmat[nmat < min(ylimits)] <- min(ylimits) 194 | plot_nmat[nmat > max(ylimits)] <- max(ylimits) 195 | pdf(file=paste(out_dir, "/", bplot_name, ".pdf", sep="")) 196 | par(las=2, mar=c(12,4,4,2)) 197 | xlabels = sub("__", "\n", colnames(plot_nmat)) 198 | bplot <- boxplot(plot_nmat, range=iqr_multiplier, ylim=ylimits, srt=90, pch=16, cex.axis=0.6, names=xlabels, ylab="Log2 ratio sample/batch median") 199 | bplot$names <- sub("\n", "__", bplot$names) 200 | dev.off() 201 | 202 | #Write boxplot information to separate file 203 | bplot$qc <- score_boxplot(bplot) 204 | bplot.out <- cbind(round(t(bplot$stats), 3), bplot$qc) 205 | rownames(bplot.out) <- bplot$names 206 | colnames(bplot.out) <- c("del_threshold", "Q1", "median", "Q3", "amp_threshold", "qc") 207 | write.table(bplot.out, paste(out_dir, "/", bplot_name, ".xls", sep=""), quote = FALSE, sep = "\t", col.names = NA) 208 | return(bplot) 209 | } 210 | 211 | filter_cnvs <- function(segs, threshold.min_exons, threshold.cnv_log2_cutoffs) { 212 | #Apply consecutive exon filter 213 | fsegs <- segs[(segs[,"CNV"] != 0) & (segs[,"Interval_count"] >= threshold.min_exons), , drop = FALSE] 214 | 215 | #Apply zero in normal range filter 216 | fsegs <- fsegs[(as.numeric(fsegs[,"Loss_threshold"]) < 0) & (as.numeric(fsegs[,"Gain_threshold"]) > 0),, drop = FALSE] 217 | 218 | #Apply hard CNV threshold 219 | fsegs <- fsegs[(as.numeric(fsegs[,"Median_log2ratio"]) < min(threshold.cnv_log2_cutoffs)) | (as.numeric(fsegs[,"Median_log2ratio"]) > max(threshold.cnv_log2_cutoffs)),, drop = FALSE] 220 | 221 | return(fsegs) 222 | } 223 | 224 | call_cnvs <- function(nmat, ylimits, interval_lookup, threshold.min_exons, iqr_multiplier, threshold.cnv_log2_cutoffs, out_dir) { 225 | #Plot boxplots to visualize ranges used to detect copy number variation 226 | bplot <- boxplot_cnv_matrix(nmat, "QC_cnv_boxplot", out_dir, ylimits, iqr_multiplier) 227 | 228 | #Set outlier thresholds by distribution test 229 | #Use boxplot$stats so whiskers in pdf accurately represent thresholds used 230 | lbound <- round(bplot$stats[1,], 3) 231 | ubound <- round(bplot$stats[5,], 3) 232 | batch_size <- dim(bplot$stats)[2] 233 | 234 | #Use hard-threshold, if it is greater than boxplot thresholds 235 | lbound[lbound > min(threshold.cnv_log2_cutoffs)] <- min(threshold.cnv_log2_cutoffs) 236 | ubound[ubound < max(threshold.cnv_log2_cutoffs)] <- max(threshold.cnv_log2_cutoffs) 237 | 238 | #Make matrix of thresholds 239 | lbound_mat <- matrix(rep(lbound, dim(nmat)[1]), ncol=length(lbound), byrow=TRUE, dimnames=list(rownames(nmat),colnames(nmat))) 240 | ubound_mat <- matrix(rep(ubound, dim(nmat)[1]), ncol=length(lbound), byrow=TRUE, dimnames=list(rownames(nmat),colnames(nmat))) 241 | 242 | #Flag outliers 243 | nmat_loutliers <- (nmat < lbound_mat) + 0 #Adding zero converts TRUE/FALSE to 1/0 244 | nmat_uoutliers <- (nmat > ubound_mat) + 0 #Adding zero converts TRUE/FALSE to 1/0 245 | #TODO: Estimate number of copies gained or lost (i.e. support -2 and +n) 246 | 247 | #Make tracking matrix of all zero values then subtract copies lost and add copies gained 248 | nmat_cnvs <- matrix(data = 0, nrow = dim(nmat)[1], ncol = dim(nmat)[2], dimnames = list(rownames(nmat),colnames(nmat))) 249 | nmat_cnvs <- nmat_cnvs - nmat_loutliers + nmat_uoutliers 250 | gene_exon <- annotate_interval_names(rownames(nmat_cnvs), interval_lookup) 251 | write.table(cbind.data.frame(gene_exon, nmat_cnvs), paste(out_dir, "/", "cnv_boxplot_outliers", ".xls", sep=""), quote = FALSE, sep = "\t", col.names = NA) 252 | 253 | #Merge consecutive calls and write out to file 254 | all_fsegs <- c() 255 | for(id in colnames(nmat_cnvs)) { 256 | segs <- c() 257 | segs.header <- c("Sample", "CNV", "Genome_start_interval", "Genome_end_interval", "Start_interval","End_interval", "Interval_count", "Min_log2ratio", "Median_log2ratio", "Max_log2ratio", "Loss_threshold", "Gain_threshold", "Batch_size") 258 | chroms = c(1:22,"X","Y", "MT", "M") 259 | for(chr in chroms){ 260 | matches <- grep(paste("^",chr,":",sep=""), rownames(nmat)) 261 | if(length(matches) > 1) { 262 | #Segmentation: Detect runs of consecutive copy number calls 263 | rl <- rle(nmat_cnvs[matches, id]) 264 | values <- rl$values 265 | lengths <- rl$lengths 266 | starts <- c(rownames(nmat_cnvs[matches,])[1], names(rl$lengths[1:length(rl$lengths)-1])) 267 | ends <- names(rl$values) 268 | 269 | #Calculate rounded log2 ratios of intervals involved 270 | log2s <- lapply(1:length(starts), function(x) round(nmat[(match(starts[x], names(nmat[,id])):match(ends[x], names(nmat[,id]))), id], 3)) 271 | log2s_min <- unlist(lapply(log2s, min)) 272 | log2s_med <- unlist(lapply(log2s, median)) 273 | log2s_max <- unlist(lapply(log2s, max)) 274 | 275 | #Report genome coordinate range 276 | coordinates_part1 <- data.frame(strsplit(starts, "-"), stringsAsFactors = FALSE)[1,] 277 | coordinates_part2 <- data.frame(strsplit(ends, "-"), stringsAsFactors = FALSE)[2,] 278 | coordinates <- paste(coordinates_part1, coordinates_part2, sep="-") 279 | 280 | #Lookup interval names 281 | start_names <- annotate_interval_names(starts, interval_lookup) 282 | end_names <- annotate_interval_names(ends, interval_lookup) 283 | 284 | #Add threshold columns, ensure consistent header 285 | segs <- rbind(segs, cbind(rep(id, length(values)), values, starts, ends, start_names, end_names, lengths, log2s_min, log2s_med, log2s_max, lbound_mat[1,id], ubound_mat[1,id], batch_size)) 286 | colnames(segs) <- segs.header 287 | } 288 | } 289 | 290 | #Handle case where there are no cnvs on any chromosomes in any samples 291 | if(is.null(segs)) { 292 | segs <- matrix(data=rep(NA,length(segs.header)), ncol=length(segs.header), dimnames=list("", segs.header))[-1,,drop=FALSE] 293 | } 294 | 295 | #Filter segments 296 | fsegs <- filter_cnvs(segs, threshold.min_exons, threshold.cnv_log2_cutoffs) 297 | 298 | #Convert CNV type values to text 299 | fsegs[,"CNV"] <- gsub("^1$", "Gain", fsegs[,"CNV"]) 300 | fsegs[,"CNV"] <- gsub("^-1$", "Loss", fsegs[,"CNV"]) 301 | 302 | #Convert infinite values to large, numerical value 303 | large_value <- "10" 304 | num_cols <- c("Min_log2ratio", "Median_log2ratio", "Max_log2ratio", "Loss_threshold", "Gain_threshold") 305 | fsegs[,num_cols] <- gsub("Inf", large_value, fsegs[,num_cols]) 306 | 307 | #Number CNVs for later labeling on visual output 308 | if(dim(fsegs)[1] > 0) { 309 | CNV_id <- seq(1, dim(fsegs)[1]) 310 | } else { 311 | CNV_id <- c() 312 | } 313 | fsegs <- cbind(fsegs[,1, drop=FALSE], 314 | CNV_id, 315 | fsegs[,2:dim(fsegs)[2], drop=FALSE]) 316 | 317 | #Write output for each sample 318 | write.table(fsegs, paste(out_dir, "/", id, ".cnvs.xls", sep=""), quote = FALSE, sep = "\t", col.names = TRUE, row.names = FALSE) 319 | 320 | #Add fsegs to master fsegs tracking matrix 321 | all_fsegs <- rbind(all_fsegs, fsegs) 322 | } 323 | return(list(bplot, all_fsegs)) 324 | } 325 | 326 | space_label_names <- function(label_names) { 327 | spacer <- " " 328 | if(length(label_names) > 1) { 329 | alt_positions <- seq(from=2, to=length(label_names), by=2) 330 | label_names[alt_positions] <- paste(label_names[alt_positions], spacer) 331 | } 332 | return(label_names) 333 | } 334 | 335 | exon_plot_per_sample <- function(nmat, ylimits, interval_lookup, cnv_bplot_and_calls, out_dir) { 336 | #Extract thresholds and cnvs, determine intervals from genome coordinates 337 | nmat_bplot <- cnv_bplot_and_calls[[1]] 338 | nmat_fsegs <- cnv_bplot_and_calls[[2]] 339 | 340 | #Limit matrix for plotting 341 | nmat[nmat < ylimits[1]] <- ylimits[1] 342 | nmat[nmat > ylimits[2]] <- ylimits[2] 343 | 344 | for(name in colnames(nmat)) { 345 | #Get exon names from interval lists 346 | interval_names <- annotate_interval_names(rownames(nmat), interval_lookup) 347 | 348 | #Plot all intervals, then plot by chromosome 349 | pdf(file=paste(out_dir, "/", name, ".plot.pdf", sep="")) 350 | chroms <- c(".*", 1:22, "X", "Y") 351 | for(chr in chroms) { 352 | #Grep rownames with genome coordinates for desired chromosome, support chr1: and 1: formats 353 | matches <- grep(paste("(^chr", chr, "|^", chr, "):", sep=""), rownames(nmat)) 354 | if(length(matches) > 1) { 355 | nmat_chr <- nmat[matches, name, drop=FALSE] 356 | #Different title and axes labeling for All vs individual chromosomes 357 | if(chr == ".*") { 358 | title <- "All chromosomes" 359 | chrom_names <- sapply(rownames(nmat_chr), function(x) strsplit(x, ":")[[1]][1]) 360 | labels_rle <- rle(chrom_names) 361 | #TODO: Find better method for spacing chromosome labels 362 | labels_rle$values <- space_label_names(labels_rle$values) 363 | #labels_rle$values <- rep("", length(labels_rle$values-1)) 364 | labels_size <- 0.6 365 | draw_grid_lines <- FALSE 366 | label_called_CNVs <- FALSE 367 | } else { 368 | title <- paste("Chromosome", chr) 369 | #Convert genome coordinates to unique gene names 370 | exon_names <- annotate_interval_names(rownames(nmat_chr), interval_lookup) 371 | gene_names <- sapply(exon_names, function(x) strsplit(x, "_")[[1]][1]) 372 | #Replace NA gene_names with nearest non-NA value 373 | gene_names <- na.locf(gene_names) 374 | labels_rle <- rle(gene_names) 375 | labels_size <- 1 376 | draw_grid_lines <- TRUE 377 | label_called_CNVs <- TRUE 378 | } 379 | 380 | #Plot formatting and command 381 | shift <- -0.5 #shift for plotting gene guide lines 382 | par(pch=16) 383 | plot(nmat_chr, xlim=c(1 + shift,length(nmat_chr)), ylim=ylimits, main=title, ylab = "Log2 ratio sample/batch median", xaxt="n", xlab="") 384 | mtext(name, line=3, cex=0.5, adj=0) 385 | 386 | #Plot gene or chomosome grid lines 387 | section_ends <- (unlist(lapply(seq(1:length(labels_rle$lengths)), function(x) sum(labels_rle$lengths[1:x])))) 388 | section_starts <- c(1, section_ends[-length(section_ends)] + 1) 389 | grid_marks <- section_starts + shift 390 | grid_labels <- c(labels_rle$values) 391 | if(draw_grid_lines == TRUE) { 392 | abline(v=grid_marks, col="grey") 393 | } 394 | axis(1, las=3, at=grid_marks, labels=grid_labels, cex.axis=labels_size) 395 | 396 | #Plot guidelines and thresholds 397 | name_bplot <- nmat_bplot$stats[,nmat_bplot$names == name] 398 | abline(h=0, col="black") #zero line 399 | abline(h=c(log2(1/2), log2(3/2)), col="grey") #expected log2 ratios for single copy loss and gain 400 | abline(h=name_bplot[c(1,5)], lty=2, col="black") #boxplot thresholds 401 | abline(h=threshold.cnv_log2_cutoffs, col="black") #hard cnv log2 ratio thresholds 402 | 403 | #Mark data points outside of ylimits and scaled to fit plot 404 | off_scale <- as.numeric(which((nmat_chr <= ylimits[1]) | (nmat_chr >= ylimits[2]))) 405 | points(off_scale, nmat_chr[off_scale], pch=22) 406 | 407 | #Color CNV data points 408 | types <- matrix(byrow=TRUE, ncol=4, 409 | dimnames=list(c(), 410 | c("type", "col", "ypos.line", "ypos.label")), 411 | data=c( 412 | "Loss", "blue", -2, -2.1, 413 | "Gain", "red", 2, 2.1)) 414 | 415 | segs <- nmat_fsegs[(nmat_fsegs[,"Sample"] == name),,drop=FALSE] 416 | segs <- segs[(rownames(segs) %in% rownames(nmat_chr)),,drop=FALSE] #Filter to chromosome being plotted 417 | start_indexes <- which(rownames(nmat_chr) %in% segs[,"Genome_start_interval"]) 418 | end_indexes <- which(rownames(nmat_chr) %in% segs[,"Genome_end_interval"]) 419 | if(length(start_indexes) > 0) { 420 | segs <- cbind(segs, types[match(segs[,"CNV"], types[,"type"]),,drop=FALSE]) 421 | for(i in seq(1:length(start_indexes))) { 422 | indexes <- seq(start_indexes[i], end_indexes[i]) 423 | #Plot colored points 424 | points(indexes, nmat_chr[indexes], col=segs[i, "col"]) 425 | 426 | #Label called CNVs with identifier 427 | if(label_called_CNVs == TRUE) { 428 | segments(start_indexes[i] - 0.25, 429 | as.numeric(segs[i, "ypos.line"]), 430 | end_indexes[i] + 0.25, 431 | as.numeric(segs[i, "ypos.line"]), 432 | col="orange", lwd=5) 433 | text(mean(c(start_indexes[i], end_indexes[i])), 434 | as.numeric(segs[i, "ypos.label"]), 435 | segs[i, "CNV_id"], 436 | col="orange") 437 | } 438 | } 439 | } 440 | } 441 | } 442 | dev.off() 443 | } 444 | } 445 | 446 | rescale_chrX <- function(nmat_badX, ylimits, iqr_multiplier, out_dir) { 447 | #Isolate chromosome X, remove outlier probes 448 | nmatX <- nmat_badX[grep("^X",rownames(nmat_badX)),] 449 | if(dim(nmatX)[1] == 0) { 450 | return(nmat_badX) # return original matrix if no probes are found on X chromosome 451 | } 452 | 453 | #Assign cases to clusters and scale medians to zero 454 | # DEPRECATED METHOD: kmeans is mislead by outliers. multiple solutions possible. 455 | # clusters <- kmeans(t(nmatX),2, nstart=1) 456 | # cluster_med <- apply(clusters$centers, 1, median) 457 | # scaling_factors <- cluster_med[clusters$cluster] 458 | 459 | #CURRENT METHOD: "Partitioning Around Medoids" followed by calculation of cluster medians 460 | clusters <- pam(t(nmatX), 2, cluster.only=TRUE) 461 | cluster_med <- c() 462 | for(i in 1:max(clusters)) { 463 | cluster_med[i] <- median(nmatX[,clusters == i]) 464 | } 465 | scaling_factors <- cluster_med[clusters] 466 | nmatX_scaled <- sweep(nmatX, 2, scaling_factors, "-") ## subtract cluster median 467 | 468 | #Plot pre- and post-scaled values, write to file 469 | bplotX <- boxplot_cnv_matrix(nmatX, "QC_chrX_pre-scale", out_dir, ylimits, iqr_multiplier) 470 | bplotX_scaled <- boxplot_cnv_matrix(nmatX_scaled, "QC_chrX_post-scale", out_dir, ylimits, iqr_multiplier) 471 | 472 | #Infer sexes from two clusters of fractional coverage 473 | farthest_from_zero <- abs(cluster_med) == max(abs(cluster_med)) 474 | if(cluster_med[farthest_from_zero] > 0) { 475 | cluster_female <- farthest_from_zero 476 | } else { 477 | cluster_female <- !farthest_from_zero 478 | } 479 | sexes <- sapply(cluster_female[clusters], function(x) if(x){"Female"} else {"Male"}) 480 | names(sexes) <- names(clusters) 481 | write.table(sexes, paste(out_dir, "sexes.xls", sep="/") , quote = FALSE, sep = "\t", col.names = FALSE) 482 | 483 | #Overwrite unscaled X-chromosome values with new, scaled values 484 | nmat <- nmat_badX 485 | nmat[grep("^X",rownames(nmat)),] <- nmatX_scaled 486 | 487 | return(nmat) 488 | } 489 | 490 | remove_failed_samples <- function(mat, cnv_bplot_and_calls) { 491 | passes <- score_boxplot(cnv_bplot_and_calls) 492 | mat.trimmed <- mat[,passes] 493 | return(mat.trimmed) 494 | } 495 | 496 | ########### 497 | #Arguments# 498 | ########### 499 | 500 | #Argument collection and parsing 501 | arguments <- commandArgs(trailingOnly = TRUE) 502 | 503 | if(length(arguments) == 1) { 504 | if(dev_dir != FALSE) { 505 | #Skips prompts if dev_dir is set 506 | lane_dir <- dev_dir 507 | out_dir <- dev_dir 508 | } else { 509 | #Collect input and output information from user 510 | 511 | #Input directory 512 | lane_dir <- choose.dir(caption = "Select a lane directory (e.g. L001):", default = lane_dir) 513 | if(is.na(lane_dir)) { 514 | try( winDialog.nonint(type="ok", "Run canceled. No input lane directory provided."), silent=TRUE) 515 | q(save="no") 516 | } 517 | 518 | #Output diretory: Attempt to derive batch information from file name, prompt user if unsuccessful 519 | file1 <- list.files(lane_dir, full.names=TRUE, pattern=cov_file_pattern, recursive=TRUE)[1] 520 | batch.regex <- "__(B*[0-9]+)" 521 | batch.match <- regexec(batch.regex, file1)[[1]] 522 | batch <- substring(file1, batch.match[2], batch.match[2] + attr(batch.match, "match.length")[2] - 1) 523 | if((infer.batch.for.sub.out_dir == FALSE) || is.na(batch)) { 524 | out_dir <- choose.dir(caption = "Select an output directory:", default = out_dir) 525 | batch <- basename(out_dir) 526 | } else { 527 | out_dir <- paste(out_dir, batch, sep="/") 528 | } 529 | if(is.na(out_dir)) { 530 | try( winDialog.nonint(type="ok", "Run canceled. No output directory provided."), silent=TRUE) 531 | q(save="no") 532 | } 533 | 534 | #If output directory already exists, prompt user to overwrite 535 | if((clobber.output.directory == FALSE) & (file.exists(out_dir))) { 536 | overwrite <- try( winDialog.nonint(type="yesno", "Output directory already exists. Overwrite?"), silent=TRUE) 537 | if(overwrite == "NO") { 538 | shell(paste(explorer_file, out_dir, sep=" "), wait=FALSE) 539 | q(save="no") 540 | } 541 | } 542 | 543 | interval_list_dir <-arguments[1] 544 | } 545 | } else if(length(arguments) == 3) { 546 | #Uses provided command line arguments 547 | lane_dir <- arguments[1] 548 | out_dir <- arguments[2] 549 | viscap.cfg <- arguments[3] 550 | batch <- basename(out_dir) 551 | } else { 552 | #Usage statement 553 | try( winDialog.nonint(type="ok", "Usage: VisCap.R lane_directory output_directory interval_lists_directory"), silent=TRUE) 554 | q(save="no") 555 | } 556 | 557 | ###### 558 | #Main# 559 | ###### 560 | 561 | # Read coverage tables 562 | mat.all <- make_matrix_from_cov_files(lane_dir, cov_file_pattern, cov_field) 563 | mat.all[which(mat.all==0)]=0.00001 564 | 565 | # Read interval name files 566 | interval_lookup <- load_interval_names(interval_list_dir, interval_file_pattern) 567 | 568 | # Sort matrix by genome coordinates found in rownames 569 | chroms <- c(1:22,"X","Y", "MT", "M") 570 | chroms <- factor(chroms, levels=chroms, labels=chroms, ordered=TRUE) 571 | coords <- matrix(unlist(strsplit(rownames(mat.all), ":|-")), ncol=3, byrow=TRUE, dimnames=list(rownames(mat.all))) 572 | coords <- coords[order(match(coords[,1], chroms), as.numeric(coords[,2]), as.numeric(coords[,3])),] 573 | mat <- mat.all[rownames(coords),] 574 | 575 | #Iteratively run VisCap algorithm, removing bad samples after each run 576 | if(iterative.calling.limit == 0) { 577 | iterative.calling.limit <- dim(mat)[2] 578 | } 579 | 580 | for(iteration in 1:iterative.calling.limit) { 581 | if(iterative.calling.limit == 1) { 582 | out_dir.iteration <- out_dir 583 | } else { 584 | out_dir.iteration <- paste(out_dir, "/", batch, "_run", iteration, sep="") 585 | } 586 | dir.create(out_dir.iteration, showWarnings = FALSE, recursive=TRUE) 587 | 588 | # Normalize exon coverage by exon 589 | nmat_badX <- log2(divide_by_batch_median(mat)) 590 | nmat <- rescale_chrX(nmat_badX, ylimits, iqr_multiplier, out_dir.iteration) 591 | 592 | # Call cnvs then plot heatmaps by chromosome and per-sample exon coverage 593 | heatmap_by_chrom(nmat, "QC_cnv_heatmap", ylimits, out_dir.iteration) 594 | cnv_bplot_and_calls <- call_cnvs(nmat, ylimits, interval_lookup, threshold.min_exons, iqr_multiplier, threshold.cnv_log2_cutoffs, out_dir.iteration) 595 | exon_plot_per_sample(nmat, ylimits, interval_lookup, cnv_bplot_and_calls, out_dir.iteration) 596 | 597 | # Write out matrix of log2 ratios to file 598 | outfile <- paste(out_dir.iteration, "/", "log2_ratio_table", ".xls", sep="") 599 | gene_exon <- annotate_interval_names(rownames(nmat), interval_lookup) 600 | nmat.with.gene_exon <- cbind(gene_exon, nmat) 601 | write.table(nmat.with.gene_exon, outfile, , quote = FALSE, sep = "\t", col.names = NA) 602 | 603 | #Write out VisCap run information 604 | run_info_table <- matrix(ncol = 2, byrow=TRUE, data = c( 605 | "Date", date(), 606 | "VisCap command", paste(commandArgs(), collapse=" "), 607 | "Subversion revision information", svn.revision, 608 | "Batch directory", lane_dir, 609 | "Coverage file pattern", cov_file_pattern, 610 | "Field within coverage file", cov_field, 611 | "Output directory", out_dir.iteration, 612 | "Interval name lookup files", interval_list_dir, 613 | "Interval name lookup file pattern", interval_file_pattern, 614 | "Exons used for CNV detection", dim(nmat)[1], 615 | "Samples used for CNV detection", dim(nmat)[2], 616 | "Samples not used for CNV detection", paste(c(colnames(mat.all)[!(colnames(mat.all) %in% colnames(nmat))], ""), sep=","), 617 | "Plot y-axis limits", paste(ylimits, collapse=","), 618 | "Minimum consecutive exons to call CNV", threshold.min_exons, 619 | "IQR multiplier used for boxplots", iqr_multiplier, 620 | "Static log2 ratio thresholds to call CNVs", paste(threshold.cnv_log2_cutoffs, collapse=","), 621 | "Iteration", iteration 622 | )) 623 | run_info_outfile <- paste(out_dir.iteration, "/", "VisCap_run_info", ".xls", sep="") 624 | write.table(run_info_table, run_info_outfile, quote = FALSE, sep = "\t", col.names = FALSE, row.names = FALSE) 625 | save.image(file=paste(out_dir.iteration, "/", "session", ".Rdata", sep="")) 626 | 627 | #Remove failed samples from matrix for next run 628 | passes <- cnv_bplot_and_calls[[1]]$names[cnv_bplot_and_calls[[1]]$qc == "PASS"] 629 | if(length(passes) == dim(mat)[2]) { 630 | break 631 | } else { 632 | #Restrict mat only to samples that pass boxplot qc 633 | mat <- mat[,passes] 634 | } 635 | } 636 | 637 | # Open output directory and quit R 638 | if((dev_dir == FALSE) && (length(arguments) == 0)) { 639 | shell(paste(explorer_file, out_dir, sep=" "), wait=FALSE) 640 | } 641 | quit(save="no") 642 | --------------------------------------------------------------------------------