├── DESCRIPTION ├── LICENSE.txt ├── NAMESPACE ├── R ├── buildcodon.R ├── buildref.R ├── codondnds.R ├── dndscv-package.R ├── dndscv.R ├── fitlnpbin.R ├── geneci.R ├── genesetdnds.R ├── sitednds.R └── withingenednds.R ├── README.md ├── data ├── cancergenes_cgc81.rda ├── covariates_hg19.rda ├── covariates_hg19_chrx.rda ├── covariates_hg19_hg38_epigenome_pcawg.rda ├── dataset_normaloesophagus.rda ├── dataset_normalskin.rda ├── dataset_normalskin_genes.rda ├── dataset_simbreast.rda ├── dataset_tcgablca.rda ├── knownhotcodons_hg19.rda ├── knownhotspots_hg19.rda ├── refcds_GRCh38_hg38.rda ├── refcds_hg19.rda ├── submod_12r_3w.rda ├── submod_192r_3w.rda └── submod_2r_3w.rda ├── dndscv.Rproj ├── inst ├── doc │ ├── dNdScv.Rmd │ └── dNdScv.html └── extdata │ ├── BioMart_human_GRCh37_chr3_segment.txt │ ├── chr3_segment.fa │ ├── chr3_segment.fa.fai │ └── refcds_example_chr3_segment.rda ├── man ├── buildcodon.Rd ├── buildref.Rd ├── codondnds.Rd ├── dndscv-package.Rd ├── dndscv.Rd ├── fitlnpbin.Rd ├── geneci.Rd ├── genesetdnds.Rd ├── sitednds.Rd └── withingenednds.Rd └── vignettes ├── Ensembl_BioMart_screenshot1.png ├── Ensembl_BioMart_screenshot2.png ├── buildref.Rmd ├── buildref.html ├── dNdScv.Rmd ├── dNdScv.html ├── example_output_refcds.rda ├── sitednds.Rmd └── sitednds.html /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: dndscv 2 | Title: Poisson-based dN/dS models to quantify natural selection in somatic evolution 3 | Type: Package 4 | Version: 0.0.1.0 5 | Date: 2019-07-03 6 | Author: Inigo Martincorena 7 | Maintainer: Inigo Martincorena 8 | Description: This package contains functions for studying selection on coding sequences using a Poisson implementation of dN/dS. A Poisson model of dN/dS facilitates the study of selection beyond traditional codon models, including complex context-dependent mutation effects and selection on nonsense and splice site mutations. This model is best suited for resequencing studies, with very low density of mutations per base pair. The model was initially developed for cancer genome sequencing studies, and specific functions are provided to perform driver gene discovery using the dNdScv method on human cancer genomic data. 9 | Reference: Martincorena I et al, 2017. Universal patterns of selection in cancer and somatic tissues. Cell. https://www.ncbi.nlm.nih.gov/pubmed/29056346 10 | biocViews: 11 | Imports: 12 | seqinr, 13 | MASS, 14 | GenomicRanges, 15 | Biostrings, 16 | IRanges, 17 | MASS, 18 | Rsamtools, 19 | poilog, 20 | plyr 21 | LazyData: false 22 | License: GPL-3 23 | Encoding: UTF-8 24 | Suggests: knitr, rmarkdown 25 | VignetteBuilder: knitr 26 | RoxygenNote: 7.3.1 27 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(buildcodon) 4 | export(buildref) 5 | export(codondnds) 6 | export(dndscv) 7 | export(fitlnpbin) 8 | export(geneci) 9 | export(genesetdnds) 10 | export(sitednds) 11 | export(withingenednds) 12 | import(Biostrings) 13 | import(GenomicRanges) 14 | import(IRanges) 15 | import(MASS) 16 | import(Rsamtools) 17 | import(seqinr) 18 | -------------------------------------------------------------------------------- /R/buildcodon.R: -------------------------------------------------------------------------------- 1 | #' buildcodon 2 | #' 3 | #' This function takes a RefCDS object as input and adds to it two fields required to run the codondnds function. Usage: RefCDS = buildcodon(RefCDS) 4 | #' 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041. 7 | #' 8 | #' @param refcds Input RefCDS object 9 | #' @param numcode NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate 10 | #' 11 | #' @export 12 | 13 | buildcodon = function(refcds, numcode = 1) { 14 | 15 | ## 1. Valid chromosomes and reference CDS per gene 16 | message("Adding codon-level information to RefCDS to run codondnds...") 17 | 18 | nt = c("A","C","G","T") 19 | trinuc_list = paste(rep(nt,each=16,times=1), rep(nt,each=4,times=4), rep(nt,each=1,times=16), sep="") 20 | trinuc_ind = structure(1:64, names=trinuc_list) 21 | trinuc_subs = NULL; for (j in 1:length(trinuc_list)) { trinuc_subs = c(trinuc_subs, paste(trinuc_list[j], paste(substr(trinuc_list[j],1,1), setdiff(nt,substr(trinuc_list[j],2,2)), substr(trinuc_list[j],3,3), sep=""), sep=">")) } 22 | trinuc_subsind = structure(1:192, names=trinuc_subs) 23 | 24 | # Precalculating a 64x64 matrix with the functional impact of each codon transition (1=Synonymous, 2=Missense, 3=Nonsense) 25 | impact_matrix = array(NA, dim=c(64,64)) 26 | colnames(impact_matrix) = rownames(impact_matrix) = trinuc_list 27 | for (j in 1:64) { 28 | for (h in 1:64) { 29 | from_aa = seqinr::translate(strsplit(trinuc_list[j],"")[[1]], numcode = numcode) 30 | to_aa = seqinr::translate(strsplit(trinuc_list[h],"")[[1]], numcode = numcode) 31 | # Annotating the impact of the mutation 32 | if (to_aa == from_aa){ 33 | impact_matrix[j,h] = 1 34 | } else if (to_aa == "*"){ 35 | impact_matrix[j,h] = 3 36 | } else if ((to_aa != "*") & (from_aa != "*") & (to_aa != from_aa)){ 37 | impact_matrix[j,h] = 2 38 | } else if (from_aa=="*") { 39 | impact_matrix[j,h] = NA 40 | } 41 | } 42 | } 43 | 44 | # Adding two new fields to refcds containing the full vector of all site changes listing the rate parameters and aminoacid impact 45 | 46 | for (j in 1:length(refcds)) { 47 | 48 | cdsseq = as.character(as.vector(refcds[[j]]$seq_cds)) 49 | cdsseq1up = as.character(as.vector(refcds[[j]]$seq_cds1up)) 50 | cdsseq1down = as.character(as.vector(refcds[[j]]$seq_cds1down)) 51 | 52 | # Exonic mutations 53 | 54 | ind = rep(1:length(cdsseq), each=3) 55 | old_trinuc = paste(cdsseq1up[ind], cdsseq[ind], cdsseq1down[ind], sep="") 56 | new_base = c(sapply(cdsseq, function(x) nt[nt!=x])) 57 | new_trinuc = paste(cdsseq1up[ind], new_base, cdsseq1down[ind], sep="") 58 | codon_start = rep(seq(1,length(cdsseq),by=3),each=9) 59 | old_codon = paste(cdsseq[codon_start], cdsseq[codon_start+1], cdsseq[codon_start+2], sep="") 60 | pos_in_codon = rep(rep(1:3, each=3), length.out=length(old_codon)) 61 | aux = strsplit(old_codon,"") 62 | new_codon = sapply(1:length(old_codon), function(x) { new_codonx = aux[[x]]; new_codonx[pos_in_codon[x]] = new_base[x]; return(new_codonx) } ) 63 | new_codon = paste(new_codon[1,], new_codon[2,], new_codon[3,], sep="") 64 | 65 | imp = impact_matrix[(trinuc_ind[new_codon]-1)*64 + trinuc_ind[old_codon]] 66 | matrind = trinuc_subsind[paste(old_trinuc, new_trinuc, sep=">")] 67 | 68 | refcds[[j]]$codon_impact = imp 69 | refcds[[j]]$codon_rates = matrind 70 | 71 | if (round(j/1000)==(j/1000)) { message(sprintf(' %0.3g%% ...', round(j/length(refcds),2)*100)) } 72 | } 73 | 74 | return(refcds) 75 | 76 | } # EOF 77 | -------------------------------------------------------------------------------- /R/buildref.R: -------------------------------------------------------------------------------- 1 | #' buildref 2 | #' 3 | #' Function to build a RefCDS object from a reference genome and a table of transcripts. The RefCDS object has to be precomputed for any new species or assembly prior to running dndscv. This function generates an .rda file that needs to be input into dndscv using the refdb argument. Note that when multiple CDS share the same gene name (second column of cdsfile), the longest coding CDS will be chosen for the gene. CDS with ambiguous bases (N) will not be considered. 4 | #' 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041. 7 | #' 8 | #' @param cdsfile Path to the reference transcript table. 9 | #' @param genomefile Path to the indexed reference genome file. 10 | #' @param outfile Output file name (default = "RefCDS.rda"). 11 | #' @param numcode NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate 12 | #' @param excludechrs Vector or string with chromosome names to be excluded from the RefCDS object (default: no chromosome will be excluded). The mitochondrial chromosome should be excluded as it has different genetic code and mutation rates, either using the excludechrs argument or not including mitochondrial transcripts in cdsfile. 13 | #' @param onlychrs Vector of valid chromosome names (default: all chromosomes will be included) 14 | #' @param useids Combine gene IDs and gene names (columns 1 and 2 of the input table) as long gene names (default = F) 15 | #' 16 | #' @export 17 | 18 | buildref = function(cdsfile, genomefile, outfile = "RefCDS.rda", numcode = 1, excludechrs = NULL, onlychrs = NULL, useids = F) { 19 | 20 | ## 1. Valid chromosomes and reference CDS per gene 21 | message("[1/3] Preparing the environment...") 22 | 23 | reftable = read.table(cdsfile, header=1, sep="\t", stringsAsFactors=F, quote="\"", na.strings="-", fill=TRUE) # Loading the reference table 24 | colnames(reftable) = c("gene.id","gene.name","cds.id","chr","chr.coding.start","chr.coding.end","cds.start","cds.end","length","strand") 25 | reftable[,5:10] = suppressWarnings(lapply(reftable[,5:10], as.numeric)) # Convert to numeric 26 | 27 | # Checking for systematic absence of gene names (it happens in some BioMart inputs) 28 | longname = paste(reftable$gene.id, reftable$gene.name, sep=":") # Gene name combining the Gene stable ID and the Associated gene name 29 | if (useids==T) { 30 | reftable$gene.name = longname # Replacing gene names by a concatenation of gene ID and gene name 31 | } 32 | if (length(unique(reftable$gene.name))0) { 41 | validchrs = validchrs[validchrs %in% onlychrs] 42 | } 43 | 44 | # Restricting to chromosomes present in both the genome file and the CDS table 45 | if (any(validchrs %in% unique(reftable$chr))) { 46 | validchrs = validchrs[validchrs %in% unique(reftable$chr)] 47 | } else { # Try adding a chr prefix 48 | reftable$chr = paste("chr", reftable$chr, sep="") 49 | validchrs = validchrs[validchrs %in% unique(reftable$chr)] 50 | if (length(validchrs)==0) { # No matching chromosome names 51 | stop("No chromosome names in common between the genome file and the CDS table") 52 | } 53 | } 54 | 55 | 56 | # Removing genes that fall partially or completely outside of the available chromosomes/contigs 57 | 58 | reftable = reftable[reftable[,1]!="" & reftable[,2]!="" & reftable[,3]!="" & !is.na(reftable[,5]) & !is.na(reftable[,6]),] # Removing invalid entries 59 | reftable = reftable[which(reftable$chr %in% validchrs),] # Only valid chromosomes 60 | 61 | transc_gr = GenomicRanges::GRanges(reftable$chr, IRanges::IRanges(reftable$chr.coding.start,reftable$chr.coding.end)) 62 | chrs_gr = Rsamtools::scanFaIndex(genomefile) 63 | ol = as.data.frame(GenomicRanges::findOverlaps(transc_gr, chrs_gr, type="within", select="all")) 64 | 65 | # Issuing an error if any transcript falls outside of the limits of a chromosome. Possibly due to a mismatch between the assembly used for the reference table and the reference genome. 66 | if (length(unique(ol[,1])) < nrow(reftable)) { 67 | stop(sprintf("Aborting buildref. %0.0f rows in cdsfile have coordinates that fall outside of the corresponding chromosome length. Please ensure that you are using the same assembly for the cdsfile and genomefile",nrow(reftable)-length(unique(ol[,1])))) 68 | } 69 | 70 | reftable = reftable[unique(ol[,1]),] 71 | 72 | # Identifying genes starting or ending at the ends of a chromosome/contig 73 | # Because buildref and dndscv need to access the base before and after each coding position, genes overlapping the ends 74 | # of a contig will be trimmed by three bases and a warning will be issued listing those genes. 75 | 76 | fullcds = intersect(reftable$cds.id[reftable$cds.start==1], reftable$cds.id[reftable$cds.end==reftable$length]) # List of complete CDS 77 | 78 | ol_start = as.data.frame(GenomicRanges::findOverlaps(transc_gr, chrs_gr, type="start", select="all"))[,1] # Genes overlapping contig starts 79 | if (any(ol_start)) { 80 | reftable[ol_start,"chr.coding.start"] = reftable[ol_start,"chr.coding.start"] + 3 # Truncate the first 3 bases 81 | reftable[ol_start,"cds.start"] = reftable[ol_start,"cds.start"] + 3 # Truncate the first 3 bases 82 | } 83 | 84 | ol_end = as.data.frame(GenomicRanges::findOverlaps(transc_gr, chrs_gr, type="end", select="all"))[,1] # Genes overlapping contig starts 85 | if (any(ol_end)) { 86 | reftable[ol_end,"chr.coding.end"] = reftable[ol_end,"chr.coding.end"] - 3 # Truncate the first 3 bases 87 | reftable[ol_end,"cds.end"] = reftable[ol_end,"cds.end"] - 3 # Truncate the first 3 bases 88 | } 89 | 90 | if (any(c(ol_start,ol_end))) { 91 | warning(sprintf("The following genes were found to start or end at the first or last base of their contig. Since dndscv needs trinucleotide contexts for all coding bases, codons overlapping ends of contigs have been trimmed. Affected genes: %s.", paste(reftable[unique(c(ol_start,ol_end)),"gene.name"], collapse=", "))) 92 | } 93 | 94 | 95 | # Selecting the longest complete CDS for every gene (required when there are multiple alternative transcripts per unique gene name) 96 | 97 | cds_table = unique(reftable[,c(1:3,9)]) 98 | cds_table = cds_table[order(cds_table$gene.name, -cds_table$length), ] # Sorting CDS from longest to shortest 99 | cds_table = cds_table[(cds_table$length %% 3)==0, ] # Removing CDS of length not multiple of 3 100 | cds_table = cds_table[cds_table$cds.id %in% fullcds, ] # Complete CDS 101 | reftable = reftable[reftable$cds.id %in% fullcds, ] # Complete CDS 102 | gene_list = unique(cds_table$gene.name) 103 | 104 | reftable = reftable[order(reftable$chr, reftable$chr.coding.start), ] 105 | cds_split = split(reftable, f=reftable$cds.id) 106 | gene_split = split(cds_table, f=cds_table$gene.name) 107 | 108 | 109 | ## 2. Building the RefCDS object 110 | message("[2/3] Building the RefCDS object...") 111 | 112 | # Subfunction to extract the coding sequence 113 | get_CDSseq = function(gr, strand) { 114 | cdsseq = strsplit(paste(as.vector(Rsamtools::scanFa(genomefile, gr)),collapse=""),"")[[1]] 115 | if (strand==-1) { 116 | cdsseq = rev(seqinr::comp(cdsseq,forceToLower=F,ambiguous=T)) 117 | } 118 | return(cdsseq) 119 | } 120 | 121 | # Subfunction to extract essential splice site sequences 122 | # Definition of essential splice sites: (5' splice site: +1,+2,+5; 3' splice site: -1,-2) 123 | get_splicesites = function(cds) { 124 | splpos = numeric(0) 125 | if (nrow(cds)>1) { # If the CDS has more than one exon 126 | if (cds[1,10]==1) { # + strand 127 | spl5prime = cds[-nrow(cds),6] # Exon end before splice site 128 | spl3prime = cds[-1,5] # Exon start after splice site 129 | splpos = unique(sort(c(spl5prime+1, spl5prime+2, spl5prime+5, spl3prime-1, spl3prime-2))) 130 | } else if (cds[1,10]==-1) { # - strand 131 | spl5prime = cds[-1,5] # Exon end before splice site 132 | spl3prime = cds[-nrow(cds),6] # Exon start after splice site 133 | splpos = unique(sort(c(spl5prime-1, spl5prime-2, spl5prime-5, spl3prime+1, spl3prime+2))) 134 | } 135 | } 136 | return(splpos) 137 | } 138 | 139 | # Subfunction to extract the essential splice site sequence 140 | get_spliceseq = function(gr, strand) { 141 | spliceseq = unname(as.vector(Rsamtools::scanFa(genomefile, gr))) 142 | if (strand==-1) { 143 | spliceseq = seqinr::comp(spliceseq,forceToLower=F,ambiguous=T) 144 | } 145 | return(spliceseq) 146 | } 147 | 148 | # Initialising and populating the RefCDS object 149 | 150 | RefCDS = array(list(NULL), length(gene_split)) # Initialising empty object 151 | invalid_genes = rep(0, length(gene_split)) # Initialising empty object 152 | 153 | for (j in 1:length(gene_split)) { 154 | 155 | gene_cdss = gene_split[[j]] 156 | h = keeptrying = 1 157 | 158 | while (h<=nrow(gene_cdss) & keeptrying) { 159 | 160 | pid = gene_cdss[h,3] 161 | cds = cds_split[[pid]] 162 | strand = cds[1,10] 163 | chr = cds[1,4] 164 | gr = GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5], cds[,6])) 165 | cdsseq = get_CDSseq(gr,strand) 166 | pseq = seqinr::translate(cdsseq, numcode = numcode) 167 | 168 | if (all(pseq[-length(pseq)]!="*") & all(cdsseq!="N")) { # A valid CDS has been found (no stop codons inside the CDS excluding the last codon) and no "N" nucleotides 169 | 170 | # Essential splice sites 171 | splpos = get_splicesites(cds) # Essential splice sites 172 | if (length(splpos)>0) { # CDSs with a single exon do not have splice sites 173 | gr_spl = GenomicRanges::GRanges(chr, IRanges::IRanges(splpos, splpos)) 174 | splseq = get_spliceseq(gr_spl, strand) 175 | } 176 | 177 | # Obtaining the splicing sequences and the coding and splicing sequence contexts 178 | if (strand==1) { 179 | 180 | cdsseq1up = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]-1, cds[,6]-1)), strand) 181 | cdsseq1down = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]+1, cds[,6]+1)), strand) 182 | if (length(splpos)>0) { 183 | splseq1up = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos-1, splpos-1)), strand) 184 | splseq1down = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos+1, splpos+1)), strand) 185 | } 186 | 187 | } else if (strand==-1) { 188 | 189 | cdsseq1up = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]+1, cds[,6]+1)), strand) 190 | cdsseq1down = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]-1, cds[,6]-1)), strand) 191 | if (length(splpos)>0) { 192 | splseq1up = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos+1, splpos+1)), strand) 193 | splseq1down = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos-1, splpos-1)), strand) 194 | } 195 | 196 | } 197 | 198 | # Annotating the CDS in the RefCDS database 199 | 200 | RefCDS[[j]]$gene_name = gene_cdss[h,2] 201 | RefCDS[[j]]$gene_id = gene_cdss[h,1] 202 | RefCDS[[j]]$protein_id = gene_cdss[h,3] 203 | RefCDS[[j]]$CDS_length = gene_cdss[h,4] 204 | RefCDS[[j]]$chr = cds[1,4] 205 | RefCDS[[j]]$strand = strand 206 | RefCDS[[j]]$intervals_cds = unname(as.matrix(cds[,5:6])) 207 | RefCDS[[j]]$intervals_splice = splpos 208 | 209 | RefCDS[[j]]$seq_cds = Biostrings::DNAString(paste(cdsseq, collapse="")) 210 | RefCDS[[j]]$seq_cds1up = Biostrings::DNAString(paste(cdsseq1up, collapse="")) 211 | RefCDS[[j]]$seq_cds1down = Biostrings::DNAString(paste(cdsseq1down, collapse="")) 212 | 213 | if (length(splpos)>0) { # If there are splice sites in the gene 214 | RefCDS[[j]]$seq_splice = Biostrings::DNAString(paste(splseq, collapse="")) 215 | RefCDS[[j]]$seq_splice1up = Biostrings::DNAString(paste(splseq1up, collapse="")) 216 | RefCDS[[j]]$seq_splice1down = Biostrings::DNAString(paste(splseq1down, collapse="")) 217 | } 218 | 219 | keeptrying = 0 # Stopping the while loop 220 | } 221 | h = h+1 222 | } 223 | if (keeptrying) { 224 | invalid_genes[j] = 1 # No valid CDS was found for this gene and the gene will be removed from the RefCDS object 225 | } 226 | if (round(j/1000)==(j/1000)) { message(sprintf(' %0.3g%% ...', round(j/length(gene_split),2)*100)) } 227 | } 228 | 229 | RefCDS = RefCDS[!invalid_genes] # Removing genes without a valid CDS 230 | 231 | 232 | ## 3. L matrices: number of synonymous, missense, nonsense and splice sites in each CDS at each trinucleotide context 233 | message("[3/3] Calculating the impact of all possible coding changes...") 234 | 235 | nt = c("A","C","G","T") 236 | trinuc_list = paste(rep(nt,each=16,times=1), rep(nt,each=4,times=4), rep(nt,each=1,times=16), sep="") 237 | trinuc_ind = structure(1:64, names=trinuc_list) 238 | 239 | trinuc_subs = NULL; for (j in 1:length(trinuc_list)) { trinuc_subs = c(trinuc_subs, paste(trinuc_list[j], paste(substr(trinuc_list[j],1,1), setdiff(nt,substr(trinuc_list[j],2,2)), substr(trinuc_list[j],3,3), sep=""), sep=">")) } 240 | trinuc_subsind = structure(1:192, names=trinuc_subs) 241 | 242 | # Precalculating a 64x64 matrix with the functional impact of each codon transition (1=Synonymous, 2=Missense, 3=Nonsense) 243 | impact_matrix = array(NA, dim=c(64,64)) 244 | colnames(impact_matrix) = rownames(impact_matrix) = trinuc_list 245 | for (j in 1:64) { 246 | for (h in 1:64) { 247 | from_aa = seqinr::translate(strsplit(trinuc_list[j],"")[[1]], numcode = numcode) 248 | to_aa = seqinr::translate(strsplit(trinuc_list[h],"")[[1]], numcode = numcode) 249 | # Annotating the impact of the mutation 250 | if (to_aa == from_aa){ 251 | impact_matrix[j,h] = 1 252 | } else if (to_aa == "*"){ 253 | impact_matrix[j,h] = 3 254 | } else if ((to_aa != "*") & (from_aa != "*") & (to_aa != from_aa)){ 255 | impact_matrix[j,h] = 2 256 | } else if (from_aa=="*") { 257 | impact_matrix[j,h] = NA 258 | } 259 | } 260 | } 261 | 262 | for (j in 1:length(RefCDS)) { 263 | 264 | L = array(0, dim=c(192,4)) 265 | cdsseq = as.character(as.vector(RefCDS[[j]]$seq_cds)) 266 | cdsseq1up = as.character(as.vector(RefCDS[[j]]$seq_cds1up)) 267 | cdsseq1down = as.character(as.vector(RefCDS[[j]]$seq_cds1down)) 268 | 269 | # 1. Exonic mutations 270 | 271 | ind = rep(1:length(cdsseq), each=3) 272 | old_trinuc = paste(cdsseq1up[ind], cdsseq[ind], cdsseq1down[ind], sep="") 273 | new_base = c(sapply(cdsseq, function(x) nt[nt!=x])) 274 | new_trinuc = paste(cdsseq1up[ind], new_base, cdsseq1down[ind], sep="") 275 | codon_start = rep(seq(1,length(cdsseq),by=3),each=9) 276 | old_codon = paste(cdsseq[codon_start], cdsseq[codon_start+1], cdsseq[codon_start+2], sep="") 277 | pos_in_codon = rep(rep(1:3, each=3), length.out=length(old_codon)) 278 | aux = strsplit(old_codon,"") 279 | new_codon = sapply(1:length(old_codon), function(x) { new_codonx = aux[[x]]; new_codonx[pos_in_codon[x]] = new_base[x]; return(new_codonx) } ) 280 | new_codon = paste(new_codon[1,], new_codon[2,], new_codon[3,], sep="") 281 | 282 | imp = impact_matrix[(trinuc_ind[new_codon]-1)*64 + trinuc_ind[old_codon]] 283 | matrind = trinuc_subsind[paste(old_trinuc, new_trinuc, sep=">")] 284 | 285 | # Synonymous 286 | matrix_ind = table(matrind[which(imp==1)]) 287 | L[as.numeric(names(matrix_ind)), 1] = matrix_ind 288 | 289 | # Missense 290 | matrix_ind = table(matrind[which(imp==2)]) 291 | L[as.numeric(names(matrix_ind)), 2] = matrix_ind 292 | 293 | # Nonsense 294 | matrix_ind = table(matrind[which(imp==3)]) 295 | L[as.numeric(names(matrix_ind)), 3] = matrix_ind 296 | 297 | # 2. Splice site mutations 298 | if (length(RefCDS[[j]]$intervals_splice)>0) { 299 | splseq = as.character(as.vector(RefCDS[[j]]$seq_splice)) 300 | splseq1up = as.character(as.vector(RefCDS[[j]]$seq_splice1up)) 301 | splseq1down = as.character(as.vector(RefCDS[[j]]$seq_splice1down)) 302 | old_trinuc = rep(paste(splseq1up, splseq, splseq1down, sep=""), each=3) 303 | new_trinuc = paste(rep(splseq1up, each=3), c(sapply(splseq, function(x) nt[nt!=x])), rep(splseq1down,each=3), sep="") 304 | matrind = trinuc_subsind[paste(old_trinuc, new_trinuc, sep=">")] 305 | matrix_ind = table(matrind) 306 | L[as.numeric(names(matrix_ind)), 4] = matrix_ind 307 | } 308 | 309 | RefCDS[[j]]$L = L # Saving the L matrix 310 | if (round(j/1000)==(j/1000)) { message(sprintf(' %0.3g%% ...', round(j/length(gene_split),2)*100)) } 311 | } 312 | 313 | ## Saving the reference GenomicRanges object 314 | 315 | aux = unlist(sapply(1:length(RefCDS), function(x) t(cbind(x,rbind(RefCDS[[x]]$intervals_cds,cbind(RefCDS[[x]]$intervals_splice,RefCDS[[x]]$intervals_splice)))))) 316 | df_genes = as.data.frame(t(array(aux,dim=c(3,length(aux)/3)))) 317 | colnames(df_genes) = c("ind","start","end") 318 | df_genes$chr = unlist(sapply(1:length(RefCDS), function(x) rep(RefCDS[[x]]$chr,nrow(RefCDS[[x]]$intervals_cds)+length(RefCDS[[x]]$intervals_splice)))) 319 | df_genes$gene = sapply(RefCDS, function(x) x$gene_name)[df_genes$ind] 320 | 321 | gr_genes = GenomicRanges::GRanges(df_genes$chr, IRanges::IRanges(df_genes$start, df_genes$end)) 322 | GenomicRanges::mcols(gr_genes)$names = df_genes$gene 323 | 324 | save(RefCDS, gr_genes, file=outfile) 325 | 326 | } # EOF 327 | -------------------------------------------------------------------------------- /R/codondnds.R: -------------------------------------------------------------------------------- 1 | #' codondnds 2 | #' 3 | #' Function to estimate codon-wise dN/dS values and p-values against neutrality. To generate a valid RefCDS input object for this function, use the buildcodon function. Note that recurrent artefacts or SNP contamination can violate the null model and dominate the list of codons under apparent selection. Be very critical of the results and if suspicious codons appear recurrently mutated consider refining the variant calling (e.g. using a better unmatched normal panel). 4 | #' 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 6 | #' 7 | #' @param dndsout Output object from dndscv. 8 | #' @param refcds RefCDS object annotated with codon-level information using the buildcodon function. 9 | #' @param min_recurr Minimum number of mutations per codon to estimate codon-wise dN/dS ratios. [default=2] 10 | #' @param gene_list List of genes to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of genes is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout] 11 | #' @param codon_list List of hotspot codons to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of codons is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout] 12 | #' @param theta_option 2 options: "mle" (uses the MLE of the negative binomial size parameter) or "conservative" (uses the lower bound of the CI95). Values other than "mle" will lead to the conservative option. [default="conservative"] 13 | #' @param syn_drivers Vector with a list of known synonymous driver mutations to exclude from the background model [default="TP53:T125T"]. See Martincorena et al., Cell, 2017 (PMID:29056346). 14 | #' @param method Overdispersion model: NB = Negative Binomial (Gamma-Poisson), LNP = Poisson-Lognormal (see Hess et al., BiorXiv, 2019). [default="NB"] 15 | #' @param numbins Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size. 16 | #' 17 | #' @return 'codondnds' returns a table of recurrently mutated codons and the estimates of the size parameter: 18 | #' @return - recurcodons: Table of recurrently mutated codons with codon-wise dN/dS values and p-values 19 | #' @return - recurcodons_ext: The same table of recurrently mutated codons, but including additional information on the contribution of different changes within a codon. 20 | #' @return - overdisp: Maximum likelihood estimate and CI95% for the overdispersion parameter (the size parameter of the negative binomial distribution or the sigma parameter of the lognormal distribution). The lower the size value or the higher the sigma value the higher the variation of the mutation rate across codons not captured by the trinucleotide change or by variation across genes. 21 | #' @return - LL: Log-likelihood of the fit of the overdispersed model (see "method" argument) to all synonymous sites at a codon level. 22 | #' 23 | #' @export 24 | 25 | codondnds = function(dndsout, refcds, min_recurr = 2, gene_list = NULL, codon_list = NULL, theta_option = "conservative", syn_drivers = "TP53:T125T", method = "NB", numbins = 1e4) { 26 | 27 | ## 1. Fitting an overdispersed distribution at the codon level considering the background mutation rate of the gene and of each trinucleotide 28 | message("[1] Codon-wise overdispersed model accounting for trinucleotides and relative gene mutability...") 29 | 30 | if (nrow(dndsout$mle_submodel)!=195) { stop("Invalid input: dndsout must be generated using the default trinucleotide substitution model in dndscv.") } 31 | if (is.null(refcds[[1]]$codon_impact)) { stop("Invalid input: the input RefCDS object must contain codon-level annotation. Use the buildcodon function to add this information.") } 32 | 33 | # Restricting refcds to genes in the dndsout object 34 | refcds = refcds[sapply(refcds, function(x) x$gene_name) %in% dndsout$genemuts$gene_name] # Only input genes 35 | 36 | # Restricting the analysis to an input list of genes 37 | if (!is.null(gene_list)) { 38 | g = as.vector(dndsout$genemuts$gene_name) 39 | # Correcting CDKN2A if required (hg19) 40 | if (any(g %in% c("CDKN2A.p14arf","CDKN2A.p16INK4a")) & any(gene_list=="CDKN2A")) { 41 | gene_list = unique(c(setdiff(gene_list,"CDKN2A"),"CDKN2A.p14arf","CDKN2A.p16INK4a")) 42 | } 43 | nonex = gene_list[!(gene_list %in% g)] 44 | if (length(nonex)>0) { 45 | warning(sprintf("The following input gene names are not in dndsout input object and will not be analysed: %s.", paste(nonex,collapse=", "))) 46 | } 47 | refaux = refcds[sapply(refcds, function(x) x$gene_name) %in% gene_list] # Only input genes 48 | numtests = sum(sapply(refaux, function(x) x$CDS_length))/3 # Number of codons in genes listed in gene_list 49 | } else { 50 | numtests = sum(sapply(refcds, function(x) x$CDS_length))/3 # Number of codons in all genes 51 | } 52 | 53 | # Relative mutation rate per gene 54 | # Note that this assumes that the gene order in genemuts has not been altered with respect to the N and L matrices, as it is currently the case in dndscv 55 | relmr = dndsout$genemuts$exp_syn_cv/dndsout$genemuts$exp_syn 56 | names(relmr) = dndsout$genemuts$gene_name 57 | 58 | # Substitution rates (192 trinucleotide rates, strand-specific) 59 | sm = setNames(dndsout$mle_submodel$mle, dndsout$mle_submodel$name) 60 | sm["TTT>TGT"] = 1 # Adding the TTT>TGT rate (which is arbitrarily set to 1 relative to t) 61 | sm = sm*sm["t"] # Absolute rates 62 | sm = sm[setdiff(names(sm),c("wmis","wnon","wspl","t"))] # Removing selection parameters 63 | sm = sm[order(names(sm))] # Sorting 64 | 65 | # Annotated mutations per gene 66 | annotsubs = dndsout$annotmuts[which(dndsout$annotmuts$impact=="Synonymous"),] 67 | if (nrow(annotsubs)<2) { 68 | stop("Too few synonymous mutations found in the input. codondnds cannot run without synonymous mutations.") 69 | } 70 | annotsubs = annotsubs[!(paste(annotsubs$gene,annotsubs$aachange,sep=":") %in% syn_drivers),] 71 | annotsubs$codon = as.numeric(substr(annotsubs$aachange,2,nchar(annotsubs$aachange)-1)) # Numeric codon position 72 | annotsubs = split(annotsubs, f=annotsubs$gene) 73 | 74 | # Calculating observed and expected mutation rates per codon for every gene 75 | numcodons = sum(sapply(refcds, function(x) x$CDS_length))/3 # Number of codons in all genes 76 | nvec = rvec = array(NA, numcodons) 77 | pos = 1 78 | 79 | for (j in 1:length(refcds)) { 80 | 81 | nvec_syn = rvec_syn = rvec_ns = array(0,refcds[[j]]$CDS_length/3) # Initialising the obs and exp vectors 82 | gene = refcds[[j]]$gene_name 83 | sm_rel = sm * relmr[gene] 84 | 85 | # Expected rates 86 | ind = rep(1:(refcds[[j]]$CDS_length/3), each=9) 87 | syn = which(refcds[[j]]$codon_impact==1) # Synonymous changes 88 | ns = which(refcds[[j]]$codon_impact %in% c(2,3)) # Missense and nonsense changes 89 | 90 | aux = sapply(split(refcds[[j]]$codon_rates[syn], f=ind[syn]), function(x) sum(sm_rel[x])) 91 | rvec_syn[as.numeric(names(aux))] = aux 92 | 93 | aux = sapply(split(refcds[[j]]$codon_rates[ns], f=ind[ns]), function(x) sum(sm_rel[x])) 94 | rvec_ns[as.numeric(names(aux))] = aux 95 | 96 | # Observed mutations 97 | subs = annotsubs[[gene]] 98 | if (!is.null(subs)) { 99 | obs_syn = table(subs$codon) 100 | nvec_syn[as.numeric(names(obs_syn))] = obs_syn 101 | } 102 | 103 | rvec[pos:(pos+refcds[[j]]$CDS_length/3-1)] = rvec_syn 104 | nvec[pos:(pos+refcds[[j]]$CDS_length/3-1)] = nvec_syn 105 | pos = pos + refcds[[j]]$CDS_length/3 106 | 107 | refcds[[j]]$codon_rvec_ns = rvec_ns 108 | 109 | if (round(j/2000)==(j/2000)) { message(sprintf(' %0.3g%% ...', round(j/length(refcds),2)*100)) } 110 | } 111 | 112 | rvec = rvec * sum(nvec) / sum(rvec) # Small correction ensuring that global observed and expected rates are identical 113 | 114 | 115 | message("[2] Estimating overdispersion and calculating codon-wise dN/dS ratios...") 116 | 117 | # Estimation of overdispersion: Using optimize appears to yield reliable results. Problems experienced with fitdistr, glm.nb and theta.ml. Consider using grid search if problems appear with optimize. 118 | if (method=="LNP") { # Modelling rates per codon with a Poisson-Lognormal mixture 119 | 120 | lnp_est = fitlnpbin(nvec, rvec, theta_option = theta_option, numbins = numbins) 121 | theta_ml = lnp_est$ml$minimum 122 | theta_ci95 = lnp_est$sig_ci95 123 | LL = -lnp_est$ml$objective # LogLik 124 | thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95_high")) 125 | 126 | } else { # Modelling rates per codon as negative binomially distributed (i.e. quantifying uncertainty above Poisson using a Gamma) 127 | 128 | nbin = function(theta, n=nvec, r=rvec) { -sum(dnbinom(x=n, mu=r, log=T, size=theta)) } # nbin loglik function for optimisation 129 | ml = optimize(nbin, interval=c(0,1000)) 130 | theta_ml = ml$minimum 131 | LL = -ml$objective # LogLik 132 | 133 | # CI95% for theta using profile likelihood and iterative grid search (this yields slightly conservative CI95) 134 | grid_proflik = function(bins=5, iter=5) { 135 | for (j in 1:iter) { 136 | if (j==1) { 137 | thetavec = sort(c(0, 10^seq(-3,3,length.out=bins), theta_ml, theta_ml*10, 1e4)) # Initial vals 138 | } else { 139 | thetavec = sort(c(seq(thetavec[ind[1]], thetavec[ind[1]+1], length.out=bins), seq(thetavec[ind[2]-1], thetavec[ind[2]], length.out=bins))) # Refining previous iteration 140 | } 141 | 142 | proflik = sapply(thetavec, function(theta) -sum(dnbinom(x=nvec, mu=rvec, size=theta, log=T))-ml$objective) < qchisq(.95,1)/2 # Values of theta within CI95% 143 | ind = c(which(proflik[1:(length(proflik)-1)]==F & proflik[2:length(proflik)]==T)[1], 144 | which(proflik[1:(length(proflik)-1)]==T & proflik[2:length(proflik)]==F)[1]+1) 145 | if (is.na(ind[1])) { ind[1] = 1 } 146 | if (is.na(ind[2])) { ind[2] = length(thetavec) } 147 | } 148 | return(thetavec[ind]) 149 | } 150 | theta_ci95 = grid_proflik(bins=5, iter=5) 151 | thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95low","CI95_high")) 152 | } 153 | 154 | 155 | ## 2. Calculating codon-wise dN/dS ratios and P-values for recurrently mutated codons 156 | 157 | # Theta option 158 | if (theta_option=="mle" | theta_option=="MLE") { 159 | theta = theta_ml 160 | } else { # Conservative 161 | message(" Using the conservative bound of the confidence interval of the overdispersion parameter.") 162 | theta = theta_ci95[1] 163 | } 164 | 165 | # Creating the recurcodons object 166 | annotsubs = dndsout$annotmuts[which(dndsout$annotmuts$impact %in% c("Missense","Nonsense")),] 167 | annotsubs$codon = substr(annotsubs$aachange,1,nchar(annotsubs$aachange)-1) # Codon position 168 | annotsubs$codonsub = paste(annotsubs$chr,annotsubs$gene,annotsubs$codon,sep=":") 169 | annotsubs = annotsubs[which(annotsubs$ref!=annotsubs$mut),] 170 | freqs = sort(table(annotsubs$codonsub), decreasing=T) 171 | 172 | recurcodons = read.table(text=names(freqs), header=0, sep=":", stringsAsFactors=F) # Frequency table of mutations 173 | colnames(recurcodons) = c("chr","gene","codon") 174 | recurcodons$freq = freqs 175 | 176 | # Gene RHT 177 | if (!is.null(gene_list)) { 178 | message(" Peforming Restricted Hypothesis Testing on the input list of a-priori genes") 179 | recurcodons = recurcodons[which(recurcodons$gene %in% gene_list), ] # Restricting the p-value and q-value calculations to gene_list 180 | } 181 | 182 | # Codon RHT 183 | if (!is.null(codon_list)) { 184 | message(" Peforming Restricted Hypothesis Testing on the input list of a-priori codons (numtests = length(codon_list))") 185 | mutstr = paste(recurcodons$gene,recurcodons$codon,sep=":") 186 | if (!any(mutstr %in% codon_list)) { 187 | stop("No mutation was observed in the restricted list of known hotspots. Codon-RHT cannot be run.") 188 | } 189 | recurcodons = recurcodons[which(mutstr %in% codon_list), ] # Restricting the p-value and q-value calculations to codon_list 190 | numtests = length(codon_list) 191 | 192 | # Calculating global dN/dS ratios at known hotcodons 193 | auxcodons = as.data.frame(do.call("rbind",strsplit(codon_list,split=":")), stringsAsFactors=F) 194 | auxcodons$V3 = as.numeric(substr(auxcodons$V2,2,nchar(auxcodons$V2))) 195 | auxcodons = auxcodons[auxcodons$V1 %in% names(relmr), ] 196 | colnames(auxcodons) = c("gene","codon","numcodon") 197 | auxcodons$mu = NA 198 | geneind = setNames(1:length(refcds), sapply(refcds, function(x) x$gene_name)) 199 | for (j in 1:nrow(auxcodons)) { 200 | auxcodons$mu[j] = refcds[[geneind[auxcodons$gene[j]]]]$codon_rvec_ns[auxcodons$numcodon[j]] # Background non-synonymous rate for this codon 201 | } 202 | neutralexp = sum(auxcodons$mu) # Number of mutations expected at known hotspots expected under neutrality 203 | numobs = sum(recurcodons$freq) # Number observed 204 | poistest = poisson.test(numobs, T=neutralexp) 205 | globaldnds_knowncodons = setNames(c(numobs, neutralexp, poistest$estimate, poistest$conf.int), c("obs","exp","dnds","cilow","cihigh")) 206 | message(sprintf(" Mutations at known hotspots: %0.0f observed, %0.3g expected, obs/exp~%0.3g (CI95:%0.3g,%0.3g).", globaldnds_knowncodons[1], globaldnds_knowncodons[2], globaldnds_knowncodons[3], globaldnds_knowncodons[4], globaldnds_knowncodons[5])) 207 | } 208 | 209 | # Restricting the recurcodons output by min_recurr 210 | recurcodons = recurcodons[recurcodons$freq>=min_recurr, ] # Restricting the output to codons with min_recurr 211 | 212 | if (nrow(recurcodons)>0) { 213 | 214 | recurcodons$mu = NA 215 | codonnumeric = as.numeric(substr(recurcodons$codon,2,nchar(recurcodons$codon))) # Numeric codon position 216 | geneind = setNames(1:length(refcds), sapply(refcds, function(x) x$gene_name)) 217 | 218 | for (j in 1:nrow(recurcodons)) { 219 | recurcodons$mu[j] = refcds[[geneind[recurcodons$gene[j]]]]$codon_rvec_ns[codonnumeric[j]] # Background non-synonymous rate for this codon 220 | } 221 | 222 | recurcodons$dnds = recurcodons$freq / recurcodons$mu # Codon-wise dN/dS (point estimate) 223 | 224 | if (method=="LNP") { # Modelling rates per codon with a Poisson-Lognormal mixture 225 | 226 | # Cumulative Lognormal-Poisson using poilog::dpoilog 227 | dpoilog = poilog::dpoilog 228 | ppoilog = function(n, mu, sig) { 229 | p = sum(dpoilog(n=floor(n+1):floor(n*10+1000), mu=log(mu)-sig^2/2, sig=sig)) 230 | return(p) 231 | } 232 | 233 | message(sprintf(" Modelling substitution rates using a Lognormal-Poisson: sig = %0.3g (upperbound = %0.3g)", theta_ml, theta_ci95)) 234 | recurcodons$pval = apply(recurcodons, 1, function(x) ppoilog(n=as.numeric(x["freq"])-0.5, mu=as.numeric(x["mu"]), sig=theta)) 235 | 236 | } else { # Negative binomial model 237 | 238 | message(sprintf(" Modelling substitution rates using a Negative Binomial: theta = %0.3g (CI95:%0.3g,%0.3g)", theta_ml, theta_ci95[1], theta_ci95[2])) 239 | recurcodons$pval = pnbinom(q=recurcodons$freq-0.5, mu=recurcodons$mu, size=theta, lower.tail=F) 240 | } 241 | 242 | recurcodons = recurcodons[order(recurcodons$pval, -recurcodons$freq), ] # Sorting by p-val and frequency 243 | recurcodons$qval = p.adjust(recurcodons$pval, method="BH", n=numtests) # P-value adjustment for all possible changes 244 | rownames(recurcodons) = NULL 245 | 246 | # Additional annotation 247 | annotsubs$mutaa = substr(annotsubs$aachange,nchar(annotsubs$aachange),nchar(annotsubs$aachange)) 248 | annotsubs$simplent = paste(annotsubs$ref,annotsubs$mut,sep=">") 249 | annotsubs$mutnt = paste(annotsubs$chr,annotsubs$pos,annotsubs$simplent,annotsubs$mutaa,sep="_") 250 | aux = split(annotsubs, f=annotsubs$codonsub) 251 | recurcodons_ext = recurcodons 252 | recurcodons_ext$codonsub = paste(recurcodons_ext$chr,recurcodons_ext$gene,recurcodons_ext$codon,sep=":") 253 | recurcodons_ext$mutnt = recurcodons_ext$mutaa = NA 254 | for (j in 1:nrow(recurcodons_ext)) { 255 | x = aux[[recurcodons_ext$codonsub[j]]] 256 | f = sort(table(x$mutaa),decreasing=T) 257 | recurcodons_ext$mutaa[j] = paste(names(f),f,sep=":",collapse="|") 258 | f = sort(table(x$mutnt),decreasing=T) 259 | recurcodons_ext$mutnt[j] = paste(names(f),f,sep=":",collapse="|") 260 | } 261 | 262 | } else { 263 | recurcodons = recurcodons_ext = NULL 264 | warning("No codon was found with the minimum recurrence requested [default min_recurr=2]") 265 | } 266 | 267 | if (is.null(codon_list)) { 268 | return(list(recurcodons=recurcodons, recurcodons_ext=recurcodons_ext, overdisp=thetaout, LL=LL)) 269 | } else { 270 | return(list(recurcodons=recurcodons, recurcodons_ext=recurcodons_ext, overdisp=thetaout, LL=LL, globaldnds_knowncodons=globaldnds_knowncodons)) 271 | } 272 | } 273 | -------------------------------------------------------------------------------- /R/dndscv-package.R: -------------------------------------------------------------------------------- 1 | # Package definitions for dndscv 2 | # 3 | # Author: Inigo Martincorena 4 | ############################################################################### 5 | 6 | #' Detection of selection in cancer and somatic evolution 7 | #' 8 | #' The dNdScv R package is a suite of maximum-likelihood dN/dS methods designed 9 | #' to quantify selection in cancer and somatic evolution (Martincorena et al., 2017). 10 | #' The package contains functions to quantify dN/dS ratios for missense, nonsense 11 | #' and essential splice mutations, at the level of individual genes, groups of 12 | #' genes or at whole-genome level. The dNdScv method was designed to detect cancer 13 | #' driver genes (i.e. genes under positive selection in cancer) on datasets ranging 14 | #' from a few samples to thousands of samples, in whole-exome/genome or targeted 15 | #' sequencing studies. 16 | #' @name dndscv-package 17 | #' @docType package 18 | #' @title Detection of selection in cancer and somatic evolution 19 | #' @author Inigo Martincorena, Wellcome Trust Sanger Institute, \email{im3@@sanger.ac.uk} 20 | #' @references Martincorena I, et al. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. 21 | #' @keywords package 22 | #' @seealso \code{\link{dndscv}} 23 | #' @seealso \code{\link{buildref}} 24 | #' @import seqinr 25 | #' @import MASS 26 | #' @import GenomicRanges 27 | #' @import IRanges 28 | #' @import Rsamtools 29 | #' @import Biostrings 30 | NA 31 | -------------------------------------------------------------------------------- /R/fitlnpbin.R: -------------------------------------------------------------------------------- 1 | #' fitlnpbin 2 | #' 3 | #' Function to fit a Lognormal-Poisson model to estimate overdispersion on synonymous changes for sitednds and codondnds. 4 | #' 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 6 | #' 7 | #' @param nvec Vector of observed counts of mutations per site. 8 | #' @param rvec Vector of expected counts of mutations per site. 9 | #' @param level Confidence level desired for the confidence interval of the overdispersion parameter [defaul=0.95] 10 | #' @param theta_option 2 options: "mle" (uses the MLE of the overdispersion parameter) or "conservative" (uses the conservative bound of the CI95). Values other than "mle" will lead to the conservative option [default="conservative"] 11 | #' @param numbins Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size. 12 | #' 13 | #' @return 'fitlnpbin' returns the maximum likelihood estimate and confidence intervals of the "sig" overdispersion parameter of the LNP model: 14 | #' 15 | #' @export 16 | 17 | fitlnpbin = function(nvec, rvec, level = 0.95, theta_option = "conservative", numbins = 1e4) { 18 | 19 | # 1. Binning the r vector 20 | minrate = 1e-8 # Values <<1/exome_length are due to 0 observed counts for a given trinucleotide 21 | rvec = pmax(minrate, rvec) # Setting values below minrate to minrate 22 | br = cut(log(rvec),breaks=numbins) # Binning rvec in log space 23 | binmeans = tapply(rvec, br, mean) # Mean value per bin 24 | rvecbinned = as.numeric(binmeans[br]) # Binned values for rvec 25 | message(sprintf(" Binning the rate vector: maximum deviation of %0.3g", max(abs(rvecbinned-rvec)/rvec))) 26 | rvec = rvecbinned # Using the binned values 27 | freqs = as.matrix(plyr::count(cbind(rvec,nvec))) # Frequency table 28 | 29 | # Excluding sites with rates < minrate from the calculation (they should yield LL=0) 30 | freqs = freqs[which(freqs[,1] > minrate), ] 31 | 32 | # 2. Vectorising dpoilog 33 | dpoilog = poilog::dpoilog 34 | lnp = function(freqs, sig) { 35 | -sum(apply(freqs, 1, function(x) x[3]*log(dpoilog(n=x[2], mu=log(x[1])-sig^2/2, sig=sig)))) # vectorised dpoilog with fixed expected rates and log-transformed 36 | } 37 | 38 | # 3. Estimating the MLE: grid search followed by optim within reasonable bounds 39 | lnp_mle = function(minsig=1e-2, maxsig=5, bins=5, iter=8) { 40 | 41 | lls = sigs = NULL # Saving the log-likelihoods calculated 42 | 43 | # 1. Grid search to identify reasonable bounds 44 | for (j in 1:iter) { 45 | if (j==1) { 46 | sigvec = sort(exp(seq(log(minsig), log(maxsig), length.out=bins))) # Initial vals 47 | } else { 48 | sigvec = seq(sigvec[pmax(1,ind-1)], sigvec[pmin(bins,ind+1)], length.out=bins) # Refining previous iteration 49 | } 50 | proflik = sapply(sigvec, function(sig) lnp(freqs=freqs, sig=sig)) 51 | ind = which.min(proflik) 52 | 53 | sigs = c(sigs, sigvec) # Saving the result 54 | lls = c(lls, proflik) # Saving the result 55 | } 56 | 57 | # 2. Optim for precise estimation of the MLE (optim without narrow bounds tends to fail) 58 | f = function(sig, n=nvec, r=rvec) { lnp(freqs=freqs, sig=sig) } 59 | ml = optimize(f, interval=c(sigvec[pmax(1,ind-1)], sigvec[pmin(bins,ind+1)])) 60 | 61 | sigs = c(sigs, ml$minimum) # Saving the result 62 | lls = c(lls, ml$objective) # Saving the result 63 | ll = cbind(sigs,lls) 64 | 65 | return(list(ml=ml, ll=unique(ll[order(ll[,1]),]))) 66 | } 67 | 68 | ml = lnp_mle(minsig=1e-2, maxsig=5, bins=5, iter=8) # Maximum likelihood estimate of the overdispersion 69 | 70 | 71 | # 4. Estimating the lower bound of the CI95% using profile likelihood 72 | # This is done exploiting the points already evaluated for the MLE 73 | 74 | if (theta_option == "mle") { 75 | ml$sig_ci95 = NA # We only estimate the lower bound of sig if requested by the user 76 | } else { 77 | grid_proflik = function(minsig=1e-2, maxsig=5, bins=5, iter=8) { 78 | for (j in 1:iter) { 79 | if (j==1) { 80 | sigvec = ml$ll[,1] 81 | ind = min(which(ml$ll[,1]>ml$ml$minimum & (ml$ll[,2]-ml$ml$objective)>qchisq(.95,1)/2)) # First value outside of bounds 82 | } 83 | sigvec = seq(sigvec[pmax(1,ind-1)], sigvec[pmin(length(sigvec),ind)], length.out=bins) # New grid based on the previous iteration 84 | proflik = sapply(sigvec, function(sig) lnp(freqs=freqs, sig=sig)) # Calculating log-likelihoods 85 | ind = min(which(sigvec>ml$ml$minimum & (proflik-ml$ml$objective)>qchisq(.95,1)/2)) # First value outside of bounds 86 | } 87 | return(sigvec[ind]) # Conservative estimate for the lower bound of the CI95% for sig 88 | } 89 | ml$sig_ci95 = grid_proflik(minsig=1e-2, maxsig=5, bins=5, iter=8) 90 | } 91 | 92 | return(ml) 93 | } 94 | -------------------------------------------------------------------------------- /R/geneci.R: -------------------------------------------------------------------------------- 1 | #' geneci 2 | #' 3 | #' Function to calculate confidence intervals for dN/dS values per gene under the dNdScv model using profile likelihood. To generate a valid dndsout input object for this function, use outmats=T when running dndscv. 4 | #' 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041. 7 | #' 8 | #' @param dndsout Output object from dndscv. 9 | #' @param gene_list List of genes to restrict the analysis (by default, all genes in dndsout will be analysed) 10 | #' @param level Confidence level desired [default = 0.95] 11 | #' 12 | #' @return ci: Dataframe with the confidence intervals for dN/dS ratios per gene under the dNdScv model. 13 | #' 14 | #' @export 15 | 16 | geneci = function(dndsout, gene_list = NULL, level = 0.95) { 17 | 18 | # Ensuring valid level value 19 | if (level > 1) { 20 | warning("Confidence level must be lower than 1, using 0.95 as default") 21 | level = 0.95 22 | } 23 | 24 | # N and L matrices 25 | N = dndsout$N 26 | L = dndsout$L 27 | if (length(N)==0) { stop(sprintf("Invalid input: the dndsout input object must be generated using outmats=T as an argument to dndscv.")) } 28 | if (nrow(dndsout$mle_submodel)!=195) { stop(sprintf("Invalid input: dndsout must be generated using the default trinucleotide substitution model in dndscv."))} 29 | 30 | # Restricting the analysis to an input list of genes 31 | if (!is.null(gene_list)) { 32 | g = as.vector(dndsout$genemuts$gene_name) # Genes in the input object 33 | nonex = gene_list[!(gene_list %in% g)] # Excluding genes from the input gene_list if they are not present in the input dndsout object 34 | if (length(nonex)>0) { 35 | warning(sprintf("The following input gene names are not in dndsout input object and will not be analysed: %s.", paste(nonex,collapse=", "))) 36 | } 37 | dndsout$annotmuts = dndsout$annotmuts[which(dndsout$annotmuts$gene %in% gene_list), ] # Restricting to genes of interest 38 | dndsout$genemuts = dndsout$genemuts[which(g %in% gene_list), ] # Restricting to genes of interest 39 | N = N[,,which(g %in% gene_list)] # Restricting to genes of interest 40 | L = L[,,which(g %in% gene_list)] # Restricting to genes of interest 41 | } 42 | gene_list = as.vector(dndsout$genemuts$gene_name) 43 | 44 | wnoneqspl = all(dndsout$sel_cv$wnon_cv==dndsout$sel_cv$wspl_cv) # Deciding on wnon==wspl based on the input object 45 | 46 | ## Subfunction: Analytical opt_t (aka tML) given fixed w values 47 | mle_tcvgivenw = function(n, theta, exp_neutral_cv, E) { 48 | shape = theta; scale = exp_neutral_cv/theta 49 | tml = (n+shape-1)/(1+E+(1/scale)) 50 | if (shape<=1) { # i.e. when theta<=1 51 | tml = max(shape*scale,tml) # i.e. tml is bounded to the mean of the gamma (i.e. y[9]) when theta<=1 52 | } 53 | return(pmax(tml,1e-6)) 54 | } 55 | 56 | ## Subfunction: Log-Likelihood of the model given fixed w values (requires finding MLEs for t and the free w values given the fixed w values) 57 | loglik_givenw = function(w,x,y,mutrates,theta,wtype,wnoneqspl) { 58 | 59 | # 1. tML given w 60 | exp_neutral_cv = y[9] 61 | exp_rel = y[6:8]/y[5] 62 | n = y[1] + sum(y[wtype+1]) 63 | E = sum(exp_rel[wtype])*w 64 | tML = mle_tcvgivenw(n, theta, exp_neutral_cv, E) 65 | mrfold = max(1e-10, tML/y[5]) # Correction factor of "t" under the model 66 | 67 | # 2. Calculating the MLEs of the unconstrained w values 68 | if (!wnoneqspl) { 69 | wfree = y[2:4]/y[6:8]/mrfold; wfree[y[2:4]==0] = 0 # MLEs for w given tval 70 | } else { 71 | wmisfree = y[2]/y[6]/mrfold; wmisfree[y[2]==0] = 0 72 | wtruncfree = sum(y[3:4])/sum(y[7:8])/mrfold; wtruncfree[sum(y[3:4])==0] = 0 73 | wfree = c(wmisfree,wtruncfree,wtruncfree) # MLEs for w given tval 74 | } 75 | wfree[wtype] = w # Replacing free w values by fixed input values 76 | 77 | # 2. loglik of the model under tML and w 78 | llpois = sum(dpois(x=x$n, lambda=x$l*mutrates*mrfold*t(array(c(1,wfree),dim=c(4,length(mutrates)))), log=T)) 79 | llgamm = dgamma(x=tML, shape=theta, scale=exp_neutral_cv/theta, log=T) 80 | return(-(llpois+llgamm)) 81 | } 82 | 83 | ## Subfunction: Working with vector inputs 84 | loglik_vec = function(wfixed,x,y,mutrates,theta,wtype,wnoneqspl) { 85 | sapply(wfixed, function(w) loglik_givenw(w,x,y,mutrates,theta,wtype,wnoneqspl)) 86 | } 87 | 88 | 89 | ## Subfunction: iterative search for the CI95% boundaries for wvec 90 | iterative_search_ci95 = function(wtype,x,y,mutrates,theta,wmle,ml,grid_size=10,iter=10,wnoneqspl=T,wmax = 10000) { 91 | 92 | if (wmle[wtype][1]0) { 96 | search_range = c(1e-9, wmle[wtype][1]) 97 | for (it in 1:iter) { 98 | wvec = seq(search_range[1], search_range[2],length.out=grid_size) 99 | ll = -loglik_vec(wvec,x,y,mutrates,theta,wtype,wnoneqspl) 100 | lr = 2*(ml-ll) > qchisq(p=level,df=1) 101 | ind = max(which(wvec<=wmle[wtype][1] & lr)) 102 | search_range = c(wvec[ind], wvec[ind+1]) 103 | } 104 | w_low = wvec[ind] 105 | } else { 106 | w_low = 0 107 | } 108 | 109 | # Iteratively searching for the higher bound of the CI95% for "t" 110 | search_range = c(wmle[wtype][1], wmax) 111 | llhighbound = -loglik_vec(wmax,x,y,mutrates,theta,wtype,wnoneqspl) 112 | outofboundaries = !(2*(ml-llhighbound) > qchisq(p=level,df=1)) 113 | if (!outofboundaries) { 114 | for (it in 1:iter) { 115 | wvec = seq(search_range[1], search_range[2],length.out=grid_size) 116 | ll = -loglik_vec(wvec,x,y,mutrates,theta,wtype,wnoneqspl) 117 | lr = 2*(ml-ll) > qchisq(p=level,df=1) 118 | ind = min(which(wvec>=wmle[wtype][1] & lr)) 119 | search_range = c(wvec[ind-1], wvec[ind]) 120 | } 121 | w_high = wvec[ind] 122 | } else { 123 | w_high = wmax 124 | } 125 | 126 | } else { 127 | wmle[wtype] = w_low = w_high = wmax 128 | } 129 | 130 | return(c(wmle[wtype][1],w_low,w_high)) 131 | } 132 | 133 | 134 | ## Subfunction: calculate the MLEs and CI95% of each independent w value (unconstraining the other values) 135 | ci95cv_intt = function(x,y,mutrates,theta,grid_size=10,iter=10,wnoneqspl=T) { 136 | 137 | # MLE 138 | exp_neutral_cv = y[9] 139 | n = y[1]; E = 0 # Only synonymous mutations are considered 140 | tML = mle_tcvgivenw(n, theta, exp_neutral_cv, E) 141 | mrfold = max(1e-10, tML/y[5]) 142 | if (!wnoneqspl) { 143 | wmle = y[2:4]/y[6:8]/mrfold; wmle[y[2:4]==0] = 0 # MLEs for w given tval 144 | } else { 145 | wmisfree = y[2]/y[6]/mrfold; wmisfree[y[2]==0] = 0 146 | wtruncfree = sum(y[3:4])/sum(y[7:8])/mrfold; wtruncfree[sum(y[3:4])==0] = 0 147 | wmle = c(wmisfree,wtruncfree,wtruncfree) # MLEs for w given tval 148 | } 149 | llpois = sum(dpois(x=x$n, lambda=x$l*mutrates*mrfold*t(array(c(1,wmle),dim=c(4,length(mutrates)))), log=T)) 150 | llgamm = dgamma(x=tML, shape=theta, scale=y[9]/theta, log=T) 151 | ml = llpois+llgamm 152 | 153 | # Iteratively searching for the lower bound of the CI95% for "t" 154 | w_ci95 = array(NA,c(3,3)) 155 | colnames(w_ci95) = c("mle","low","high") 156 | rownames(w_ci95) = c("mis","non","spl") 157 | if (!wnoneqspl) { 158 | for (h in 1:3) { 159 | w_ci95[h,] = iterative_search_ci95(wtype=h,x,y,mutrates,theta,wmle,ml,grid_size,iter,wnoneqspl) 160 | } 161 | } else { 162 | w_ci95[1,] = iterative_search_ci95(wtype=1,x,y,mutrates,theta,wmle,ml,grid_size,iter,wnoneqspl) 163 | w_ci95[2,] = iterative_search_ci95(wtype=c(2,3),x,y,mutrates,theta,wmle,ml,grid_size,iter,wnoneqspl) 164 | w_ci95[3,] = w_ci95[2,] 165 | } 166 | return(w_ci95) 167 | } 168 | 169 | 170 | ## Calculating CI95% across all genes 171 | 172 | message("Calculating CI95 across all genes...") 173 | 174 | ci95 = array(NA, dim=c(length(gene_list),9)) 175 | colnames(ci95) = c("mis_mle","non_mle","spl_mle","mis_low","non_low","spl_low","mis_high","non_high","spl_high") 176 | theta = dndsout$nbreg$theta 177 | 178 | data("submod_192r_3w", package="dndscv") 179 | parmle = setNames(dndsout$mle_submodel[,2], dndsout$mle_submodel[,1]) 180 | mutrates = sapply(substmodel[,1], function(x) prod(parmle[base::strsplit(x,split="\\*")[[1]]])) # Expected rate per available site 181 | 182 | for (j in 1:length(gene_list)) { 183 | geneind = which(dndsout$genemuts$gene_name==gene_list[j]) 184 | y = as.numeric(dndsout$genemuts[geneind,-1]) 185 | if (length(gene_list)==1) { 186 | x = list(n=N, l=L) 187 | } else { 188 | x = list(n=N[,,geneind], l=L[,,geneind]) 189 | } 190 | ci95[j,] = c(ci95cv_intt(x,y,mutrates,theta,grid_size=10,iter=10,wnoneqspl=wnoneqspl)) 191 | if (round(j/1000)==(j/1000)) { print(j/length(gene_list), digits=2) } # Progress 192 | } 193 | 194 | ci95df = cbind(gene=gene_list, as.data.frame(ci95)) 195 | 196 | # Restricting columns if we forced wnon==wspl 197 | if (wnoneqspl==T) { 198 | ci95df = ci95df[,-c(4,7,10)] 199 | colnames(ci95df) = c("gene","mis_mle","tru_mle","mis_low","tru_low","mis_high","tru_high") 200 | } 201 | 202 | return(ci95df) 203 | 204 | } # EOF 205 | -------------------------------------------------------------------------------- /R/genesetdnds.R: -------------------------------------------------------------------------------- 1 | #' genesetdnds 2 | #' 3 | #' Function to estimate global dN/dS values for a gene set when using whole-exome 4 | #' data. Global dN/dS values for a set of genes can also be obtained using dndscv 5 | #' (gene_list argument), but that option estimates the trinucleotide mutation rates 6 | #' exclusively from the list of genes of interest. This may be prefereable in large 7 | #' datasets, but in small datasets, the genesetdnds option estimates global dN/dS 8 | #' values for a set of genes while using all genes in the exome to fit the 9 | #' substitution model. Usage: genesetdnds(dndsout, gene_list). 10 | #' 11 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 12 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041. 13 | #' 14 | #' @param dndsout Output object from dndscv. To generate a valid input object for this function, use outmats=T when running dndscv. 15 | #' @param gene_list List of genes to restrict the analysis (gene set). 16 | #' @param sm Substitution model (precomputed models are available in the data directory) 17 | #' 18 | #' @return 'genesetdnds' returns a list of objects: 19 | #' @return globaldnds_geneset: Global dN/dS estimates in the gene set, including Wald CI95%, Wald p-values and LRT (recommended) p-values. 20 | #' @return globaldnds_rest: Global dN/dS estimates in all other genes, including Wald CI95%, Wald p-values and LRT (recommended) p-values. 21 | #' 22 | #' @export 23 | 24 | genesetdnds = function(dndsout, gene_list, sm = "192r_3w") { 25 | 26 | ## 1. Input 27 | 28 | if (is.null(dndsout$N)) { stop(sprintf("Invalid input: dndsout must be generated using outmats=T in dndscv.")) } 29 | 30 | allg = as.vector(dndsout$genemuts$gene) # All genes in the dndsout object 31 | nonex = gene_list[!(gene_list %in% allg)] 32 | if (length(nonex)>0) { 33 | stop(sprintf("The following input gene names are not in the dndsout object: %s. To see the list of genes available use as.vector(dndsout$genemuts$gene).", paste(nonex,collapse=", "))) 34 | } 35 | if (length(gene_list)<2) { 36 | stop("The gene_list argument needs to contain at least two genes") 37 | } 38 | 39 | # Substitution model (The user can also input a custom substitution model as a matrix) 40 | if (length(sm)==1) { 41 | data(list=sprintf("submod_%s",sm), package="dndscv") 42 | } else { 43 | substmodel = sm 44 | } 45 | 46 | ## 2. Estimation of the global rate and selection parameters 47 | 48 | Lall = dndsout$L 49 | Nall = dndsout$N 50 | geneind = which(allg %in% gene_list) # Genes in the gene set 51 | L = rbind(apply(Lall[,,geneind], c(1,2), sum), apply(Lall[,,-geneind], c(1,2), sum)) 52 | N = rbind(apply(Nall[,,geneind], c(1,2), sum), apply(Nall[,,-geneind], c(1,2), sum)) 53 | 54 | # Subfunction: fitting substitution model 55 | 56 | fit_substmodel = function(N, L, substmodel, testpar) { 57 | 58 | l = c(L); n = c(N); r = c(substmodel) 59 | n = n[l!=0]; r = r[l!=0]; l = l[l!=0] 60 | 61 | params = unique(base::strsplit(x=paste(r,collapse="*"), split="\\*")[[1]]) 62 | indmat = as.data.frame(array(0, dim=c(length(r),length(params)))) 63 | colnames(indmat) = params 64 | for (j in 1:length(r)) { 65 | indmat[j, base::strsplit(r[j], split="\\*")[[1]]] = 1 66 | } 67 | 68 | model = glm(formula = n ~ offset(log(l)) + . -1, data=indmat, family=poisson(link=log)) 69 | mle = exp(coefficients(model)) # Maximum-likelihood estimates for the rate params 70 | ci = exp(confint.default(model)) # Wald confidence intervals 71 | 72 | pvals.wald = coef(summary(model))[,4] 73 | model.lrt = drop1(model, test="LRT", scope=testpar) 74 | pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt)) 75 | 76 | par = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals.wald[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters 77 | 78 | return(list(par=par, model=model)) 79 | } 80 | 81 | syneqs = substmodel[,1] # Rate model for synonymous sites 82 | 83 | # Model 1: Fitting all mutation rates and the 3 global selection parameters 84 | 85 | rmatrix = array("",dim=dim(L)) 86 | rmatrix[,1] = c(paste(syneqs,"*r_rel",sep=""), syneqs) # This adds an extra parameter (r_rel) to account for a different mutation rate (synonymous density) in the gene set 87 | rmatrix[,2] = c(paste(syneqs,"*r_rel*wmis_geneset",sep=""), paste(syneqs,"*wmis_rest",sep="")) 88 | rmatrix[,3] = c(paste(syneqs,"*r_rel*wnon_geneset",sep=""), paste(syneqs,"*wnon_rest",sep="")) 89 | rmatrix[,4] = c(paste(syneqs,"*r_rel*wspl_geneset",sep=""), paste(syneqs,"*wspl_rest",sep="")) 90 | poissout = fit_substmodel(N, L, rmatrix, testpar = c("wmis_geneset","wnon_geneset","wspl_geneset","wmis_rest","wnon_rest","wspl_rest")) # Original substitution model 91 | par1 = poissout$par 92 | 93 | # Model 2: Fitting all mutation rates and the 2 global selection parameters 94 | 95 | rmatrix = array("",dim=dim(L)) 96 | rmatrix[,1] = c(paste(syneqs,"*r_rel",sep=""), syneqs) # This adds an extra parameter (r_rel) to account for a different mutation rate (synonymous density) in the gene set 97 | rmatrix[,2] = c(paste(syneqs,"*r_rel*wmis_geneset",sep=""), paste(syneqs,"*wmis_rest",sep="")) 98 | rmatrix[,3] = c(paste(syneqs,"*r_rel*wtru_geneset",sep=""), paste(syneqs,"*wtru_rest",sep="")) 99 | rmatrix[,4] = c(paste(syneqs,"*r_rel*wtru_geneset",sep=""), paste(syneqs,"*wtru_rest",sep="")) 100 | poissout = fit_substmodel(N, L, rmatrix, testpar = c("wmis_geneset","wtru_geneset","wmis_rest","wtru_rest")) # Original substitution model 101 | par2 = poissout$par 102 | 103 | # Model 2: Fitting all mutation rates and the 2 global selection parameters 104 | 105 | rmatrix = array("",dim=dim(L)) 106 | rmatrix[,1] = c(paste(syneqs,"*r_rel",sep=""), syneqs) # This adds an extra parameter (r_rel) to account for a different mutation rate (synonymous density) in the gene set 107 | rmatrix[,2] = c(paste(syneqs,"*r_rel*wall_geneset",sep=""), paste(syneqs,"*wall_rest",sep="")) 108 | rmatrix[,3] = c(paste(syneqs,"*r_rel*wall_geneset",sep=""), paste(syneqs,"*wall_rest",sep="")) 109 | rmatrix[,4] = c(paste(syneqs,"*r_rel*wall_geneset",sep=""), paste(syneqs,"*wall_rest",sep="")) 110 | poissout = fit_substmodel(N, L, rmatrix, testpar = c("wall_geneset","wall_rest")) # Original substitution model 111 | par3 = poissout$par 112 | 113 | globaldnds_geneset = rbind(par1, par2, par3)[c("wmis_geneset","wnon_geneset","wspl_geneset","wtru_geneset","wall_geneset"),-1] 114 | globaldnds_rest = rbind(par1, par2, par3)[c("wmis_rest","wnon_rest","wspl_rest","wtru_rest","wall_rest"),-1] 115 | return(list(globaldnds_geneset=globaldnds_geneset, globaldnds_rest=globaldnds_rest)) 116 | } 117 | -------------------------------------------------------------------------------- /R/sitednds.R: -------------------------------------------------------------------------------- 1 | #' sitednds 2 | #' 3 | #' Function to estimate site-wise dN/dS values and p-values against neutrality. To generate a valid input object for this function, use outmats=T when running dndscv. This function is in testing, please interpret the results with caution. Also note that recurrent artefacts or SNP contamination can violate the null model and dominate the list of sites under apparent selection. A considerable number of significant synonymous sites may reflect a problem with the data. Be very critical of the results and if suspicious sites appear recurrently mutated consider refining the variant calling (e.g. using a better unmatched normal panel). In the future, this function may be extended to perform inferences at a codon level instead of at a single-base level. 4 | #' 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 6 | #' 7 | #' @param dndsout Output object from dndscv. To generate a valid input object for this function, use outmats=T when running dndscv. 8 | #' @param min_recurr Minimum number of mutations per site to estimate site-wise dN/dS ratios. [default=2] 9 | #' @param gene_list List of genes to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of genes is decided a priori. [default=NULL, sitednds will be run on all genes in dndsout] 10 | #' @param site_list List of hotspot sites to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of sites is decided a priori. [default=NULL, sitednds will be run on all genes in dndsout] 11 | #' @param trinuc_list List of trinucleotide substitution to restrict the analysis of sitednds. This is used to estimate separate overdispersion parameters for different substitution contexts [default=NULL, sitednds will be run on all substitution contexts] 12 | #' @param theta_option 2 options: "mle" (uses the MLE of the overdispersion parameter) or "conservative" (uses the conservative bound of the CI95). Values other than "mle" will lead to the conservative option [default="conservative"] 13 | #' @param syn_drivers Vector with a list of known synonymous driver mutations to exclude from the background model [default="TP53:T125T"]. See Martincorena et al., Cell, 2017 (PMID:29056346). 14 | #' @param method Overdispersion model: NB = Negative Binomial (Gamma-Poisson), LNP = Poisson-Lognormal (see Hess et al., BiorXiv, 2019). [default="NB"] 15 | #' @param numbins Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size. 16 | #' @param kc List of a-priori known cancer genes (to be excluded when fitting the background model) 17 | #' 18 | #' @return 'sitednds' returns a table of recurrently mutated sites and the estimates of the size parameter: 19 | #' @return - recursites: Table of recurrently mutated sites with site-wise dN/dS values and p-values 20 | #' @return - overdisp: Maximum likelihood estimate and CI95% for the overdispersion parameter (the size parameter of the negative binomial distribution or the sigma parameter of the lognormal distribution). The lower the size value or the higher the sigma value the higher the variation of the mutation rate across sites not captured by the trinucleotide change or by variation across genes. 21 | #' @return - fpr_nonsyn_q05: Fraction of the significant non-synonymous sites (qval<0.05) that are estimated to be false positives. This assumes that all synonymous mutations (except those in TP53 and CDKN2A) are false positives, thus offering a conservative estimate of the false positive rate. 22 | #' @return - LL: Log-likelihood of the fit of the overdispersed model (see "method" argument) to all synonymous sites. 23 | #' 24 | #' @export 25 | 26 | sitednds = function(dndsout, min_recurr = 2, gene_list = NULL, site_list = NULL, trinuc_list = NULL, theta_option = "conservative", syn_drivers = "TP53:T125T", method = "NB", numbins = 1e4, kc = "cgc81") { 27 | 28 | ## 1. Fitting a negative binomial distribution at the site level considering the background mutation rate of the gene and of each trinucleotide 29 | message("[1] Site-wise overdispersed model accounting for trinucleotides and relative gene mutability...") 30 | 31 | # N and L matrices for synonymous mutations 32 | if (length(dndsout$N)==0) { stop(sprintf("Invalid input: dndsout must be generated using outmats=T in dndscv.")) } 33 | if (nrow(dndsout$mle_submodel)!=195) { stop("Invalid input: dndsout must be generated using the default trinucleotide substitution model in dndscv.") } 34 | 35 | # Restricting the analysis to an input list of genes 36 | if (!is.null(gene_list)) { 37 | g = as.vector(dndsout$genemuts$gene_name) 38 | # Correcting CDKN2A if required (hg19) 39 | if (any(g %in% c("CDKN2A.p14arf","CDKN2A.p16INK4a")) & any(gene_list=="CDKN2A")) { 40 | gene_list = unique(c(setdiff(gene_list,"CDKN2A"),"CDKN2A.p14arf","CDKN2A.p16INK4a")) 41 | } 42 | nonex = gene_list[!(gene_list %in% g)] 43 | if (length(nonex)>0) { 44 | warning(sprintf("The following input gene names are not in dndsout input object and will not be analysed: %s.", paste(nonex,collapse=", "))) 45 | } 46 | numtests = sum(dndsout$L[,,which(g %in% gene_list)]) 47 | } else { 48 | numtests = sum(dndsout$L) 49 | } 50 | 51 | # Input: known cancer genes to exclude from the background model fitting (the user can input a gene list as a character vector) 52 | if (is.null(kc)) { 53 | known_cancergenes = "" 54 | } else if (kc[1] %in% c("cgc81")) { 55 | data(list=sprintf("cancergenes_%s",kc), package="dndscv") 56 | } else { 57 | known_cancergenes = kc 58 | } 59 | 60 | # L matrix 61 | L = dndsout$L 62 | L[,2:4,which(as.vector(dndsout$genemuts$gene_name) %in% known_cancergenes)] = 0 # Removing non-synonymous sites in known_cancergenes 63 | L = apply(L, c(1,3), sum) # Total number of sites to be considered in the background model 64 | 65 | 66 | # Counts of observed mutations 67 | 68 | annotsubs = dndsout$annotmuts[which(dndsout$annotmuts$impact %in% c("Synonymous","Missense","Nonsense","Essential_Splice")),] 69 | num_syn_drivers_masked = sum(paste(annotsubs$gene,annotsubs$aachange,sep=":") %in% syn_drivers) 70 | 71 | if (!is.null(trinuc_list)) { # Restricting sitednds to certain trinucleotide changes 72 | annotsubs = annotsubs[which(paste(annotsubs$ref3_cod,annotsubs$mut3_cod,sep=">") %in% trinuc_list), ] 73 | if (nrow(annotsubs)==0) { stop("No mutations left after restricting by trinuc_list. Please review your input arguments and your mutation table (dndsout$annotmuts).") } 74 | } 75 | 76 | annotsubs$trisub = paste(annotsubs$chr,annotsubs$pos,annotsubs$ref,annotsubs$mut,annotsubs$gene,annotsubs$aachange,annotsubs$impact,annotsubs$ref3_cod,annotsubs$mut3_cod,sep=":") 77 | annotsubs = annotsubs[which(annotsubs$ref!=annotsubs$mut),] 78 | freqs = sort(table(annotsubs$trisub), decreasing=T) 79 | 80 | # Relative mutation rate per gene 81 | # Note that this assumes that the gene order in genemuts has not been altered with respect to the N and L matrices, as it is currently the case in dndscv 82 | relmr = dndsout$genemuts$exp_syn_cv/dndsout$genemuts$exp_syn 83 | names(relmr) = dndsout$genemuts$gene_name 84 | 85 | # Substitution rates (192 trinucleotide rates, strand-specific) 86 | sm = setNames(dndsout$mle_submodel$mle, dndsout$mle_submodel$name) 87 | sm["TTT>TGT"] = 1 # Adding the TTT>TGT rate (which is arbitrarily set to 1 relative to t) 88 | sm = sm*sm["t"] # Absolute rates 89 | sm = sm[setdiff(names(sm),c("wmis","wnon","wspl","t"))] # Removing selection parameters 90 | sm = sm[order(names(sm))] # Sorting 91 | 92 | if (!is.null(trinuc_list)) { # Restricting sitednds to certain trinucleotide changes 93 | sm[!(names(sm) %in% trinuc_list)] = 0 # Setting all other rates to zero 94 | } 95 | 96 | mat_trisub = array(sm, dim=c(192,nrow(dndsout$genemuts))) # Relative mutation rates by trinucleotide 97 | mat_relmr = t(array(relmr, dim=c(nrow(dndsout$genemuts),192))) # Relative mutation rates by gene 98 | R = mat_trisub * mat_relmr # Expected rate for each mutation type in each gene 99 | 100 | # Expanded vectors: full vectors of observed and expected mutations per site across all sites considered in dndsout 101 | rvec = rep(R, times=L) # Expanded vector of expected mutation counts per site 102 | nvec = array(0, length(rvec)) # Initialising the vector with observed mutation counts per site 103 | 104 | mutsites = read.table(text=names(freqs), header=0, sep=":", stringsAsFactors=F) # Frequency table of mutations 105 | colnames(mutsites) = c("chr","pos","ref","mut","gene","aachange","impact","ref3_cod","mut3_cod") 106 | mutsites$freq = freqs 107 | trindex = setNames(1:192, names(sm)) 108 | geneindex = setNames(1:length(names(relmr)), names(relmr)) 109 | mutsites$trindex = trindex[paste(mutsites$ref3_cod, mutsites$mut3_cod, sep=">")] 110 | mutsites$geneindex = geneindex[mutsites$gene] 111 | 112 | # Mutations for the background model (excluding non-synonymous mutations in known_cancergenes) 113 | synsites = mutsites[!(mutsites$impact!="Synonymous" & mutsites$gene %in% known_cancergenes),] 114 | synsites = synsites[!(paste(synsites$gene,synsites$aachange,sep=":") %in% syn_drivers),] 115 | Lcum = array(cumsum(L), dim=dim(L)) # Cumulative L indicating the position to place a given mutation in the nvec following rvec 116 | synsites$vecindex = apply(as.matrix(synsites[,c("trindex","geneindex")]), 1, function(x) Lcum[x[1], x[2]]) # Index for the mutation 117 | synsites = synsites[order(synsites$vecindex), ] # Sorting by index in the nvec 118 | 119 | # Stop execution with an error if there are no synonymous mutations 120 | if (nrow(synsites)<2) { 121 | stop("Too few synonymous mutations found in the input. sitednds cannot run without synonymous mutations.") 122 | } 123 | 124 | # Correcting the index when there are multiple synonymous mutations in the same gene and trinucleotide class 125 | s = snew = synsites$vecindex 126 | sameind = 0 127 | for (j in 2:nrow(synsites)) { 128 | if (s[j]<=s[j-1]) { 129 | sameind = sameind + 1 # Annotating a run of elements 130 | snew[j] = s[j-1] - sameind # We assign it an earlier position in the vector 131 | } else { 132 | sameind = 0 133 | } 134 | } 135 | synsites$vecindex2 = snew 136 | 137 | nvec[synsites$vecindex2] = synsites$freq # Expanded nvec for the negative binomial regression 138 | rvec = rvec * (sum(dndsout$genemuts$n_syn)-num_syn_drivers_masked) / sum(dndsout$genemuts$exp_syn_cv) # Minor correction ensuring that global observed and expected rates are identical (this works after subsetting substitutions) 139 | 140 | # Estimation of overdispersion: Using optimize appears to yield reliable results. Problems experienced with fitdistr, glm.nb and theta.ml. Consider using grid search if problems appear with optimize. 141 | if (method=="LNP") { # Modelling rates per site with a Poisson-Lognormal mixture 142 | 143 | lnp_est = fitlnpbin(nvec, rvec, theta_option = theta_option, numbins = numbins) 144 | theta_ml = lnp_est$ml$minimum 145 | theta_ci95 = lnp_est$sig_ci95 146 | LL = -lnp_est$ml$objective # LogLik 147 | thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95_high")) 148 | 149 | } else { # Modelling rates per site as negative binomially distributed (i.e. quantifying uncertainty above Poisson using a Gamma) 150 | 151 | nbin = function(theta, n=nvec, r=rvec) { -sum(dnbinom(x=n, mu=r, log=T, size=theta)) } # nbin loglik function for optimisation 152 | ml = optimize(nbin, interval=c(0,1000)) 153 | theta_ml = ml$minimum 154 | LL = -ml$objective # LogLik 155 | 156 | # CI95% for theta using profile likelihood and iterative grid search (this yields slightly conservative CI95) 157 | grid_proflik = function(bins=5, iter=5) { 158 | for (j in 1:iter) { 159 | if (j==1) { 160 | thetavec = sort(c(0, 10^seq(-3,3,length.out=bins), theta_ml, theta_ml*10, 1e4)) # Initial vals 161 | } else { 162 | thetavec = sort(c(seq(thetavec[ind[1]], thetavec[ind[1]+1], length.out=bins), seq(thetavec[ind[2]-1], thetavec[ind[2]], length.out=bins))) # Refining previous iteration 163 | } 164 | 165 | proflik = sapply(thetavec, function(theta) -sum(dnbinom(x=nvec, mu=rvec, size=theta, log=T))-ml$objective) < qchisq(.95,1)/2 # Values of theta within CI95% 166 | ind = c(which(proflik[1:(length(proflik)-1)]==F & proflik[2:length(proflik)]==T)[1], 167 | which(proflik[1:(length(proflik)-1)]==T & proflik[2:length(proflik)]==F)[1]+1) 168 | if (is.na(ind[1])) { ind[1] = 1 } 169 | if (is.na(ind[2])) { ind[2] = length(thetavec) } 170 | } 171 | return(thetavec[ind]) 172 | } 173 | theta_ci95 = grid_proflik(bins=5, iter=5) 174 | thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95low","CI95_high")) 175 | } 176 | 177 | 178 | ## 2. Calculating site-wise dN/dS ratios and P-values for recurrently mutated sites 179 | message("[2] Calculating site-wise dN/dS ratios and p-values...") 180 | 181 | # Theta option 182 | if (theta_option=="mle" | theta_option=="MLE") { 183 | theta = theta_ml 184 | } else { # Conservative 185 | message(" Using the conservative bound of the confidence interval of the overdispersion parameter.") 186 | theta = theta_ci95[1] 187 | } 188 | 189 | # Creating the recursites object 190 | recursites = mutsites[, c("chr","pos","ref","mut","gene","aachange","impact","ref3_cod","mut3_cod","freq")] 191 | recursites$mu = relmr[recursites$gene] * sm[paste(recursites$ref3_cod,recursites$mut3_cod,sep=">")] 192 | 193 | # Gene RHT 194 | if (!is.null(gene_list)) { 195 | message(" Peforming Restricted Hypothesis Testing on the input list of a-priori genes") 196 | recursites = recursites[which(recursites$gene %in% gene_list), ] # Restricting the p-value and q-value calculations to gene_list 197 | } 198 | 199 | # Site RHT 200 | if (!is.null(site_list)) { 201 | message(" Peforming Restricted Hypothesis Testing on the input list of a-priori sites (numtests = length(site_list))") 202 | mutstr = paste(recursites$chr,recursites$pos,recursites$ref,recursites$mut,recursites$gene,recursites$aachange,recursites$ref3_cod,recursites$mut3_cod,sep=":") 203 | if (!any(mutstr %in% site_list)) { 204 | stop("No mutation was observed in the restricted list of known hotspots. Site-RHT cannot be run.") 205 | } 206 | recursites = recursites[which(mutstr %in% site_list), ] # Restricting the p-value and q-value calculations to site_list 207 | numtests = length(site_list) 208 | 209 | # Calculating global dN/dS ratios at known hotspots 210 | auxsites = as.data.frame(do.call("rbind",strsplit(site_list,split=":")), stringsAsFactors=F) 211 | auxsites = auxsites[auxsites$V5 %in% names(relmr), ] 212 | neutralexp = sum(relmr[auxsites$V5]*sm[paste(auxsites$V7,auxsites$V8,sep=">")]) # Number of mutations expected at known hotspots under neutrality 213 | numobs = sum(recursites$freq) # Number observed 214 | poistest = poisson.test(numobs, T=neutralexp) 215 | globaldnds_knownsites = setNames(c(numobs, neutralexp, poistest$estimate, poistest$conf.int), c("obs","exp","dnds","cilow","cihigh")) 216 | message(sprintf(" Mutations at known hotspots: %0.0f observed, %0.3g expected, obs/exp~%0.3g (CI95:%0.3g,%0.3g).", globaldnds_knownsites[1], globaldnds_knownsites[2], globaldnds_knownsites[3], globaldnds_knownsites[4], globaldnds_knownsites[5])) 217 | } 218 | 219 | # Restricting the recursites output by min_recurr 220 | recursites = recursites[recursites$freq>=min_recurr, ] # Restricting the output to sites with min_recurr 221 | 222 | if (nrow(recursites)>0) { 223 | 224 | recursites$dnds = recursites$freq / recursites$mu # Site-wise dN/dS (point estimate) 225 | 226 | if (method=="LNP") { # Modelling rates per site with a Poisson-Lognormal mixture 227 | 228 | # Cumulative Lognormal-Poisson using poilog::dpoilog 229 | dpoilog = poilog::dpoilog 230 | ppoilog = function(n, mu, sig) { 231 | p = sum(dpoilog(n=floor(n+1):floor(n*10+1000), mu=log(mu)-sig^2/2, sig=sig)) 232 | return(p) 233 | } 234 | 235 | message(sprintf(" Modelling substitution rates using a Lognormal-Poisson: sig = %0.3g (upperbound = %0.3g)", theta_ml, theta_ci95)) 236 | recursites$pval = apply(recursites, 1, function(x) ppoilog(n=as.numeric(x["freq"])-0.5, mu=as.numeric(x["mu"]), sig=theta)) 237 | 238 | } else { # Negative binomial model 239 | 240 | message(sprintf(" Modelling substitution rates using a Negative Binomial: theta = %0.3g (CI95:%0.3g,%0.3g)", theta_ml, theta_ci95[1], theta_ci95[2])) 241 | recursites$pval = pnbinom(q=recursites$freq-0.5, mu=recursites$mu, size=theta, lower.tail=F) 242 | } 243 | 244 | recursites = recursites[order(recursites$pval, -recursites$freq), ] # Sorting by p-val and frequency 245 | recursites$qval = p.adjust(recursites$pval, method="BH", n=numtests) # P-value adjustment for all possible changes 246 | rownames(recursites) = NULL 247 | 248 | # Estimating False Positive Rates based on the observed number of significant synonymous hits 249 | 250 | qcutoff = 0.05 # q-value cutoff to estimate false positive rates 251 | if (any(recursites$qvalqcutoff) { 262 | warning(sprintf("The estimated false positive rate for nonsynonymous hits (qval<0.05) is %0.3f (CI95:%0.3f,%0.3f). High false positive rates (>>0.05) evidence problems with the data or the model and mean that the results are not reliable.", fpr_nonsyn$estimate, fpr_nonsyn$conf.int[1], fpr_nonsyn$conf.int[2])) 263 | } 264 | 265 | if (!is.null(site_list)) { # We do not compute FPRs when restricting the analysis to a list of a priori hotspots 266 | fpr_nonsyn = NULL 267 | } 268 | 269 | } else { 270 | fpr_nonsyn = NULL 271 | } 272 | 273 | } else { 274 | recursites = fpr_nonsyn = lnp_est = NULL 275 | warning("No site was found with the minimum recurrence requested [default min_recurr=2]") 276 | } 277 | 278 | if (is.null(site_list)) { 279 | return(list(recursites=recursites, overdisp=thetaout, fpr_nonsyn_q05=fpr_nonsyn, LL=LL)) 280 | } else { 281 | return(list(recursites=recursites, overdisp=thetaout, fpr_nonsyn_q05=fpr_nonsyn, LL=LL, globaldnds_knownsites=globaldnds_knownsites)) 282 | } 283 | } 284 | -------------------------------------------------------------------------------- /R/withingenednds.R: -------------------------------------------------------------------------------- 1 | #' withingenednds 2 | #' 3 | #' This function uses Poisson and Negative Binomial regression models at single-site level to study selection across different regions (coding and non-coding) within a gene. 4 | #' 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute) 6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041. 7 | #' 8 | #' @param mutations Data frame with all the mutations detected in the study (5-column input table as for dndscv: sampleID, chr, pos, ref, mut). 9 | #' @param gene Name of the gene of interest. This function is currently designed to work on a single gene, but combined analyses of multiple genes could be done using the sites output table generated by this function. 10 | #' @param covtable Table with all sites of interest in the gene. This should be a data frame with one row per site and the following columns: chr, pos, dc (duplex depth). Additional columns will not be used. 11 | #' @param dndsout dndscv output object for the dataset. This is mainly used for the MLEs of the substitution model. Running dndscv on all genes in the dataset is recommended unless the gene of interest is believed to have a different substitution model. 12 | #' @param genomeFile Path to a reference fasta file for the genome assembly. 13 | #' @param regionschr Optional data frame with user-defined regions of interest in the gene. This allows the user to define arbitrary regions within a gene (coding or non-coding) from which to calculate omega (selection or obs/exp) values (e.g. protein domains, splicing regulator regions, core promoters, etc). The table should contain the following columns: chr, start, end, wname (a unique name for the w parameter, e.g. wdomain1, wcorepromoter), impacts (e.g. Missense or Missense|Nonsense will restrict the w calculation with Missense or Missense and Nonsense mutations in the region, respectively), layered (1/0; using "0" removes other w parameters influencing the site, whereas using "1" models selection as relative to other w parameters active at these sites). 14 | #' @param regionsaa Optional data frame with user-defined regions of interest in the gene, using aminoacid coordinates. The table should contain the following columns: gene, aa_start, aa_end, w feature name (e.g. wdomain1), impacts. 15 | #' @param fixtheta Pre-calculated overdispersion (theta) parameter. This should be calculated using sitednds(., method="NB"). 16 | #' @param refdb Reference database (path to .rda file or a pre-loaded array object in the right format). 17 | #' @param numcode NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate 18 | #' @param normalisefromsyn Normalise the substitution rates based on the synonymous mutations in the gene. Using TRUE is recommended. Using FALSE uses the expected synonymous mutation rate of the gene from the dndscv negative binomial regression model (dndsout$genemuts). 19 | #' @param syndrivers Vector of known synonymous driver sites defined by their aminoacid position, to be excluded from the background model (e.g. syndrivers = c("T125T","E224E","Q331Q") for TP53). 20 | #' @param exon_flank_length Exon flank length in bp [default = 10]. Using a value higher than 0 will calculate a separate selection (w) coefficient for synonymous mutations in exon flanks. 21 | #' @param intron_flank_length Intron flank length in bp [default = 10]. Intronic sites occurring within these flanks but not already classified as Essential_Splice will receive a separate w parameter. 22 | #' @param sitefilename Optionally, provide a file name to save the table of all annotated sites in the gene. This table is also always contained in the output object. 23 | #' 24 | #' @return 'withingenednds' returns a list of objects: 25 | #' @return - sites: Table with the annotation of all sites in the gene (from covtable), including all functional annotations in the "regions" input object as well as default annotations (Missense, Nonsense, Essential_Splice, Start_loss, Stop_loss, etc). 26 | #' @return - par.pois: Poisson regression results (not recommended). 27 | #' @return - par.nb: Negative binomial results fitting a new overdispersion parameter to the data (when fixtheta is not provided). 28 | #' @return - par.nbfix: Negative binomial results using the input fixtheta value as recommended. 29 | #' @return - model.pois: Poisson regression object. 30 | #' @return - model.nb: Negative binomial regression object. 31 | #' @return - model.nbfix: Negative binomial regression object. 32 | #' 33 | #' @export 34 | 35 | withingenednds = function(mutations, gene, covtable, dndsout, genomeFile, regionschr = NULL, regionsaa = NULL, fixtheta = NULL, normalisefromsyn = TRUE, syndrivers = NULL, exon_flank_length = 10, intron_flank_length = 10, sitefilename = NULL, refdb = "hg19", numcode = 1) { 36 | 37 | ## 1. Environment 38 | message("Running gene-level selection analyses...") 39 | 40 | # [Input] Reference database 41 | refdb_class = class(refdb) 42 | if ("character" %in% refdb_class) { 43 | if (refdb == "hg19") { 44 | data("refcds_hg19", package="dndscv") 45 | } else { 46 | load(refdb) 47 | } 48 | } else if("array" %in% refdb_class) { 49 | # use the user-supplied RefCDS object 50 | RefCDS = refdb 51 | } else { 52 | stop("Expected refdb to be \"hg19\", a file path, or a RefCDS-formatted array object.") 53 | } 54 | 55 | # Annotating all possible coding changes in the positions provided in the input covtable 56 | library("Rsamtools") 57 | covtable$ref = as.vector(scanFa(genomeFile, GRanges(covtable$chr, IRanges(covtable$pos, covtable$pos)))) 58 | covtable$ref3 = as.vector(scanFa(genomeFile, GRanges(covtable$chr, IRanges(covtable$pos-1, covtable$pos+1)))) 59 | 60 | allsubs = data.frame(sampleID="allsubs", chr = rep(covtable$chr, each=3), pos = rep(covtable$pos, each=3), ref = rep(covtable$ref, each=3), mut = NA, ref3 = rep(covtable$ref3, each=3), mut3 = NA, ref_cod = NA, mut_cod = NA, ref3_cod = NA, mut3_cod = NA) 61 | for (j in seq(1,nrow(allsubs),3)) { 62 | allsubs$mut[j:(j+2)] = setdiff(c("A","C","G","T"),allsubs$ref[j]) 63 | } 64 | allsubs$mut3 = paste(substr(allsubs$ref3,1,1), allsubs$mut, substr(allsubs$ref3,3,3), sep="") 65 | aux = dndscv(allsubs, gene_list = gene, max_coding_muts_per_sample = Inf, max_muts_per_gene_per_sample = Inf, outp = 1)$annotmuts # Annotated mutations 66 | 67 | # Adding duplex depth, functional impact annotation to all possible changes in the input table 68 | aux$mstr = paste(aux$chr, aux$pos, aux$mut, sep=":") # mutation string ID 69 | obssubs = dndsout$annotmuts[which(dndsout$annotmuts$ref %in% c("A","C","G","T") & dndsout$annotmuts$mut %in% c("A","C","G","T") & dndsout$annotmuts$gene == gene), ] 70 | obssubs$mstr = paste(obssubs$chr, obssubs$pos, obssubs$mut, sep=":") 71 | m$mstr = paste(m$chr, m$pos, m$mut, sep=":") 72 | 73 | sites = cbind(data.frame(gene=gene, pid=aux$pid[1]), allsubs[,-1]) # Initialising the sites table 74 | pos2dc = setNames(covtable$dc, covtable$pos) 75 | sites$dc = pos2dc[as.character(sites$pos)] 76 | sites$mstr = paste(sites$chr, sites$pos, sites$mut, sep=":") 77 | sites$obs = table(m$mstr)[sites$mstr]; sites$obs[is.na(sites$obs)] = 0 # Annotating the number of observed mutations at each site 78 | 79 | mstr2imp1 = setNames(aux$ntchange, aux$mstr) 80 | mstr2imp2 = setNames(aux$aachange, aux$mstr) 81 | mstr2imp3 = setNames(aux$impact, aux$mstr) 82 | sites$ntchange = mstr2imp1[sites$mstr] 83 | sites$aachange = mstr2imp2[sites$mstr] 84 | sites$impact = mstr2imp3[sites$mstr] 85 | 86 | # Adding strand and annotating the coding trinucleotide change 87 | aux2 = unique(dndsout$annotmuts[,c("gene","strand")]) 88 | gene2strand = setNames(aux2[,2],aux2[,1]) 89 | sites$strand = gene2strand[sites$gene] 90 | 91 | sites$ref_cod = sites$ref 92 | sites$ref_cod[sites$strand==-1] = seqinr::comp(sites$ref[sites$strand==-1], forceToLower = F) 93 | sites$mut_cod = sites$mut 94 | sites$mut_cod[sites$strand==-1] = seqinr::comp(sites$mut[sites$strand==-1], forceToLower = F) 95 | 96 | revcomp = function(seqvec) { 97 | as.vector(sapply(seqvec, function(x) paste(seqinr::comp(rev(unlist(strsplit(x,split=""))), forceToLower=F), collapse=""))) # Reverse complement of a vector of sequence motifs 98 | } 99 | sites$ref3_cod = sites$ref3 100 | sites$mut3_cod = sites$mut3 101 | if (any(sites$strand==-1)) { 102 | sites$ref3_cod[sites$strand==-1] = revcomp(sites$ref3[sites$strand==-1]) 103 | sites$mut3_cod[sites$strand==-1] = revcomp(sites$mut3[sites$strand==-1]) 104 | } 105 | 106 | # Adding the trinucleotide substitution rates from the dndsout model 107 | 108 | mle_submodel = setNames(dndsout$mle_submodel[,2], dndsout$mle_submodel[,1]) 109 | mle_submodel = c(mle_submodel, "TTT>TGT"=1) # Adding the missing rate (note that all rates in mle_submodel in dNdScv are relative to TTT>TGT) 110 | mle_submodel = mle_submodel * mle_submodel["t"] # Absolute rates 111 | sites$r = mle_submodel[paste(sites$ref3_cod, sites$mut3_cod, sep=">")] 112 | sites$r = sites$r * sites$dc / mean(sites$dc) # Normalising the expected rates at a site by the duplex depth 113 | 114 | # Normalising the global expected rates by the estimated mutation rate of the gene using one of two alternative models: 115 | # 1. normalisefromsyn = TRUE. We normalise "r" using the synonymous mutations observed in the gene (excluding known syn driver sites) 116 | # 2. normalisefromsyn = FALSE. We normalise "r" using the negative binomial model from dndscv. 117 | 118 | if (normalisefromsyn == TRUE) { 119 | mobs = sum(sites$obs[which(sites$impact=="Synonymous" & !(sites$aachange %in% syndrivers))]) 120 | mexp = sum(sites$r[which(sites$impact=="Synonymous" & !(sites$aachange %in% syndrivers))]) 121 | sites$rnorm = sites$r * mobs / mexp 122 | } else { 123 | message("Option not recommended: Normalising the mutation rate of the gene based on the negative regression model of dNdScv") 124 | sites$rnorm = sites$r * dndsout$genemuts$exp_syn_cv[dndsout$genemuts$gene_name==gene] / dndsout$genemuts$exp_syn[dndsout$genemuts$gene_name==gene] 125 | } 126 | 127 | # Excluding sites with rate = 0 (e.g. due to lack of duplex coverage) 128 | ratezero = which(sites$rnorm==0) 129 | if (length(ratezero)>0) { 130 | sites = sites[-ratezero, ] 131 | message(sprintf("%0.0f sites have been excluded due to having a duplex depth and/or a predicted mutation rate of 0", length(ratezero))) 132 | } 133 | 134 | ## 2. Creating an index regression matrix with all functional annotations (each row is a site and each column is a parameter in the selection model) 135 | 136 | # Annotating Missense, Nonsense, Essential_Splice and Stop_loss mutations 137 | mutmat = data.frame(t = rep(1,nrow(sites)), wmis = 0, wnon = 0, wspl = 0, wsyndriv = 0, 138 | wintr = 0, woutcds = 0, wexfl = 0, winfl = 0, wnonlastex = 0, wstoploss = 0, wstartloss = 0) 139 | mutmat$wmis[which(sites$impact=="Missense")] = 1 140 | mutmat$wnon[which(sites$impact=="Nonsense")] = 1 141 | mutmat$wspl[which(sites$impact=="Essential_Splice")] = 1 142 | mutmat$wsyndriv[which(sites$impact=="Synonymous" & sites$aachange %in% syndrivers)] = 1 143 | mutmat$wstoploss[which(sites$impact=="Stop_loss")] = 1 144 | 145 | # Annotating Start_loss mutations (mutations in codon 1) 146 | 147 | #start_loss = which(sites$impact != "Synonymous" & substr(sites$aachange,2,nchar(sites$aachange)-1) == "1") # There is no need to check for synonymous changes in ATG 148 | start_loss = which(substr(sites$aachange,2,nchar(sites$aachange)-1) == "1") # There is no need to check for synonymous changes in ATG 149 | mutmat$wstartloss[start_loss] = 1 150 | sites$impact[start_loss] = "Start_loss" 151 | 152 | # Annotating introns, exon flanks, and intron flanks 153 | 154 | refcdsgene = RefCDS[[which(sapply(RefCDS, function(x) x$gene_name)==gene)]] 155 | exons = refcdsgene$intervals_cds 156 | esspl = refcdsgene$intervals_splice 157 | 158 | gr_sites = GenomicRanges::GRanges(sites$gene, IRanges::IRanges(sites$pos,sites$pos)) 159 | gr_exons = GenomicRanges::GRanges(gene, IRanges::IRanges(exons[,1],exons[,2])) 160 | gr_cds = GenomicRanges::GRanges(gene, IRanges::IRanges(min(exons),max(exons))) 161 | gr_outcds = setdiff(gr_sites, gr_cds) 162 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_outcds, type="any", select="all")) 163 | mutmat$woutcds[unique(ol[,1])] = 1 # Annotation of the intronic sites 164 | 165 | # For genes with >1 exon, we also annotate essential splice sites, and intron and exon flanks. 166 | if (length(esspl)>0) { 167 | gr_esspl = GenomicRanges::GRanges(gene, IRanges::IRanges(esspl,esspl)) 168 | gr_introns = setdiff(setdiff(gr_cds, gr_exons), gr_esspl) 169 | exfl = rbind(cbind(exons[2:nrow(exons),1],exons[2:nrow(exons),1]+exon_flank_length-1), cbind(exons[1:(nrow(exons)-1),2]-exon_flank_length+1,exons[1:(nrow(exons)-1),2])) 170 | gr_exonflanks = GenomicRanges::GRanges(gene, IRanges::IRanges(exfl[,1],exfl[,2])) 171 | gr_exonflanks = intersect(gr_exonflanks, gr_exons) # Ensuring that the exon flanks do not extend into introns 172 | infl = rbind(cbind(exons[2:nrow(exons),1]-intron_flank_length,exons[2:nrow(exons),1]-1), cbind(exons[1:(nrow(exons)-1),2]+1,exons[1:(nrow(exons)-1),2]+intron_flank_length)) 173 | gr_intrflanks = GenomicRanges::GRanges(gene, IRanges::IRanges(infl[,1],infl[,2])) 174 | gr_intrflanks = setdiff(gr_intrflanks, gr_exons) # Removing any overlaps with exonic sequences 175 | gr_intrflanks = setdiff(gr_intrflanks, gr_esspl) # Removing any overlaps with essential splice site sequences 176 | gr_intrflanks = intersect(gr_intrflanks, gr_cds) # Removing UTR sequences 177 | 178 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_introns, type="any", select="all")) 179 | mutmat$wintr[unique(ol[,1])] = 1 # Annotation of the intronic sites 180 | 181 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_introns, type="any", select="all")) 182 | mutmat$wintr[unique(ol[,1])] = 1 # Annotation of the intronic sites 183 | 184 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_exonflanks, type="any", select="all")) 185 | mutmat$wexfl[seq(1,nrow(sites)) %in% ol[,1] & sites$impact=="Synonymous"] = 1 # Annotation of the exonic flank sites (only for synonymous mutations) 186 | 187 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_intrflanks, type="any", select="all")) 188 | mutmat$winfl[unique(ol[,1])] = 1 # Annotation of the exonic flank sites 189 | } 190 | 191 | # Annotation of nonsense mutations in the last coding exon 192 | 193 | if (nrow(exons)>1) { 194 | if (refcdsgene$strand==1) { 195 | lastexon = exons[nrow(exons),] 196 | } else { 197 | lastexon = exons[1,] 198 | } 199 | 200 | gr_lastexon = GenomicRanges::GRanges(gene, IRanges::IRanges(min(lastexon),max(lastexon))) 201 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_lastexon, type="any", select="all")) 202 | mutmat$wnonlastex[seq(1,nrow(sites)) %in% ol[,1] & sites$impact=="Nonsense"] = 1 203 | mutmat$wnon[seq(1,nrow(sites)) %in% ol[,1] & sites$impact=="Nonsense"] = 0 # Removing Nonsense mutations in the last exon from the wnon annotation 204 | } 205 | 206 | ## 3. Other annotations from the user 207 | 208 | # Regions defined by chr position 209 | if (!is.null(regionschr)) { 210 | 211 | wnames = unique(regionschr$wname) 212 | badnames = intersect(wnames,unique(colnames(mutmat))) 213 | if (length(badnames)>0) { stop(sprintf("The following w names are not allowed in the input regions as they match existing parameters: %s", paste(badnames,collapse = ","))) } 214 | 215 | for (j in 1:length(wnames)) { 216 | 217 | aux = regionschr[which(regionschr$wname==wnames[j]), ] 218 | if (length(unique(aux$impacts))!=1) { stop("regionschr: different values found in the impacts column for a given feature, please correct your input object") } 219 | imps = strsplit(unique(aux$impacts), split=",")[[1]] 220 | if (any(!(imps %in% setdiff(unique(sites$impact),NA)))) { stop("regionschr: invalid impact terms used in the input object, please correct the impact column") } 221 | 222 | gr_wname = GenomicRanges::GRanges(gene, IRanges::IRanges(aux$start,aux$end)) 223 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_wname, type="any", select="all")) 224 | mutmat[,wnames[j]] = 0 # Initialising this field in the data frame 225 | if (length(imps)==0) { 226 | indx = unique(ol[,1]) # If no impacts are indicated, we consider all sites independent of their impact 227 | } else { 228 | indx = intersect(unique(ol[,1]), which(sites$impact %in% imps)) 229 | } 230 | mutmat[indx,wnames[j]] = 1 231 | 232 | # Removing previous w parameters if layered=0 233 | if (aux$layered[1]==0 | aux$layered[1]==FALSE) { 234 | mutmat[indx, setdiff(colnames(mutmat),c("t",wnames[j]))] = 0 # Removing previous annotations at these sites 235 | } 236 | } 237 | } 238 | 239 | # Regions defined by aminoacid position 240 | if (!is.null(regionsaa)) { 241 | 242 | sites$aux = 1:nrow(sites) 243 | aas = sites[which(!is.na(sites$aachange) & sites$aachange!="."),] 244 | aas$aapos = as.numeric(substr(aas$aachange,2,nchar(aas$aachange)-1)) 245 | gr_aas = GenomicRanges::GRanges(gene, IRanges::IRanges(aas$aapos,aas$aapos)) 246 | 247 | wnames = unique(regionsaa$wname) 248 | badnames = intersect(wnames,unique(colnames(mutmat))) 249 | if (length(badnames)>0) { stop(sprintf("The following w names are not allowed in the input regions as they match existing parameters: %s", paste(badnames,collapse = ", "))) } 250 | 251 | for (j in 1:length(wnames)) { 252 | 253 | aux = regionsaa[which(regionsaa$wname==wnames[j]), ] 254 | if (length(unique(aux$impacts))!=1) { stop("regionschr: different values found in the impacts column for a given feature, please correct your input object") } 255 | imps = strsplit(unique(aux$impacts), split=",")[[1]] 256 | if (any(!(imps %in% setdiff(unique(sites$impact),NA)))) { stop("regionschr: invalid impact terms used in the input object, please correct the impact column") } 257 | 258 | gr_wname = GenomicRanges::GRanges(gene, IRanges::IRanges(aux$start,aux$end)) 259 | ol = as.data.frame(GenomicRanges::findOverlaps(gr_aas, gr_wname, type="any", select="all")) 260 | mutmat[,wnames[j]] = 0 # Initialising this field in the data frame 261 | if (length(imps)==0) { 262 | indx = aas$aux[unique(ol[,1])] # If no impacts are indicated, we consider all sites independent of their impact 263 | } else { 264 | indx = aas$aux[intersect(unique(ol[,1]), which(aas$impact %in% imps))] 265 | } 266 | mutmat[indx,wnames[j]] = 1 267 | 268 | # Removing previous w parameters if layered=0 269 | if (aux$layered[1]==0 | aux$layered[1]==FALSE) { 270 | mutmat[indx, setdiff(colnames(mutmat),c("t",wnames[j]))] = 0 # Removing previous annotations at these sites 271 | } 272 | } 273 | sites = sites[, setdiff(colnames(sites),"aux")] 274 | } 275 | 276 | ## 4. Poisson regression: fitting the selection model 277 | 278 | model = glm(formula = sites$obs ~ offset(log(sites$rnorm)) + . -1, data=mutmat, family=poisson(link=log)) 279 | mle = exp(coefficients(model)) # Maximum-likelihood estimates for the rate params 280 | pvals = coef(summary(model))[,4] 281 | model.lrt = drop1(model, test= "Chisq") 282 | pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt)) 283 | ci = exp(confint.default(model)) # Wald confidence intervals 284 | par.pois = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters 285 | 286 | # Adding obs/exp statistics to the regression outputs 287 | nobs = apply(mutmat, 2, function(x) sum(sites$obs[x==1])) 288 | nexp = apply(mutmat, 2, function(x) sum(sites$rnorm[x==1])) 289 | par.pois$obs = nobs[par.pois$name] 290 | par.pois$exp = nexp[par.pois$name] 291 | 292 | ## 5. Negative binomial regression 293 | 294 | model.nbfix = model.nb = par.nbfix = par.nb = NULL 295 | 296 | if (!is.null(fixtheta)) { 297 | 298 | model.nbfix = glm(formula = sites$obs ~ offset(log(sites$rnorm)) + . -1, data=mutmat, family=MASS::negative.binomial(fixtheta)) 299 | mle = exp(coefficients(model.nbfix)) # Maximum-likelihood estimates for the rate params 300 | pvals = coef(summary(model.nbfix))[,4] 301 | model.lrt = drop1(model.nbfix, test= "Chisq") 302 | pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt)) 303 | ci = exp(confint.default(model.nbfix)) # Wald confidence intervals 304 | par.nbfix = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters 305 | par.nbfix$obs = nobs[par.nbfix$name] 306 | par.nbfix$exp = nexp[par.nbfix$name] 307 | 308 | } else { 309 | 310 | model.nb = MASS::glm.nb(formula = as.vector(sites$obs) ~ offset(log(sites$rnorm)) + . -1, data=mutmat) 311 | mle = exp(coefficients(model.nb)) # Maximum-likelihood estimates for the rate params 312 | pvals = coef(summary(model.nb))[,4] 313 | model.lrt = drop1(model.nb, test= "Chisq") 314 | pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt)) 315 | ci = exp(confint.default(model.nb)) # Wald confidence intervals 316 | par.nb = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters 317 | par.nb$obs = nobs[par.nb$name] 318 | par.nb$exp = nexp[par.nb$name] 319 | } 320 | 321 | ## 6. Outputs 322 | 323 | # Annotated sites table 324 | 325 | sites2 = cbind(sites, mutmat) 326 | if (!is.null(sitefilename)) { 327 | sites2 = apply(sites2,2,as.character) 328 | write.table(sites2, file=sitefilename, col.names = T, row.names = F, quote = F, sep = "\t") 329 | } 330 | 331 | # Inform the user about the sites used as neutral reference by the model 332 | 333 | neutral_sites = (rowSums(mutmat[,setdiff(colnames(mutmat),"t")])==0) 334 | neutral_nsyn = sum(sites$obs[which(neutral_sites & sites$impact=="Synonymous")]) # Number of synonymous mutations used as background for dN/dS 335 | neutral_nsynsites = length(which(neutral_sites & sites$impact=="Synonymous")) 336 | neutral_othersites = length(which(neutral_sites & sites$impact!="Synonymous")) 337 | nonneutral_nsyn = sum(sites$obs[which(neutral_sites==0 & sites$impact=="Synonymous")]) # Number of synonymous mutations excluded from the neutral backgroup by the "w" annotations 338 | 339 | message(sprintf(" Sites used as neutral reference: %0.0f synonymous mutations across %0.0f synonymous sites.", neutral_nsyn, neutral_nsynsites)) 340 | if (neutral_othersites>0) { 341 | message(sprintf(" %0.0f sites not classified as synonymous coding sites were used in the background model. Please ensure that this was intended.", neutral_othersites)) 342 | } 343 | if (nonneutral_nsyn>0) { 344 | message(sprintf(" %0.0f synonymous mutations were not used in the neutral background model. This may be due to excluding known synonymous driver mutations (when using syndrivers), excluding synonymous mutations in exon flanks (when using exon_flank_length>0) and/or due to annotations provided by the user. Please ensure that this is the desired behaviour.", nonneutral_nsyn)) 345 | } 346 | 347 | # Output object 348 | out = list(sites = sites2, dnds.pois = par.pois, dnds.nb = par.nb, dnds.nbfix = par.nbfix, model.pois = model, model.nb = model.nb, model.nbfix = model.nbfix) 349 | 350 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | dndscv 2 | ===== 3 | 4 | Description 5 | --- 6 | The **dNdScv** R package is a group of maximum-likelihood dN/dS methods designed to 7 | quantify selection in cancer and somatic evolution (Martincorena *et al.*, 2017). The 8 | package contains functions to quantify dN/dS ratios for missense, nonsense and 9 | essential splice mutations, at the level of individual genes, groups of genes or at 10 | whole-exome level. The *dndscv* function within the package was originally designed 11 | to detect cancer driver genes (*i.e.* genes under positive selection in cancer genomes) 12 | on datasets ranging from a few samples to thousands of samples, in whole-exome/genome 13 | or targeted sequencing studies. 14 | 15 | Although initially designed for cancer genomic studies, this package can also be used 16 | with appropriate caution to study selection in other resequencing studies, such 17 | as SNP analyses, mutation accumulation studies in bacteria or for the discovery 18 | of mutations causing developmental disorders using data from human trios. Please 19 | study the optional arguments carefully if you are using the dndscv function for 20 | other applications. 21 | 22 | When using the dndscv function in the package (sel_cv output object), the background 23 | mutation rate of each gene is estimated by combining local information 24 | (synonymous mutations in the gene) and global information (variation of the mutation 25 | rate across genes, exploiting epigenomic covariates), and controlling for the sequence 26 | composition of the gene and mutational signatures. Constraining the expected neutral 27 | mutation rate of a gene using information from other genes considerably increases 28 | the sensitivity to detect positive selection in sparse datasets. 29 | 30 | Unlike traditional implementations of dN/dS using Markov-chain models, the underlying 31 | Poisson assumptions in dNdScv allow the use of more complex context-dependent 32 | substitution models and the estimation of dN/dS ratios for truncating mutations. 33 | By default, *dNdScv* uses a trinucleotide context-dependent substitution model, 34 | which is important to avoid common biases affecting simpler substitution 35 | models in dN/dS (Greenman *et al.*, 2006, and Martincorena *et al*, 2017). 36 | 37 | Installation 38 | -------- 39 | You can use devtools::install_github() to install *dndscv* from this repository: 40 | 41 | > library(devtools); install_github("im3sanger/dndscv") 42 | 43 | Tutorial 44 | -------- 45 | For a tutorial on dNdScv see the vignette included with the package. This includes 46 | examples for whole-exome/genome data and for targeted data. 47 | 48 | [Tutorial: getting started with dNdScv](http://htmlpreview.github.io/?http://github.com/im3sanger/dndscv/blob/master/vignettes/dNdScv.html) 49 | 50 | Genome assemblies and species 51 | -------- 52 | By default, *dNdScv* assumes that mutation data is mapped to the GRCh37/hg19 assembly of the 53 | human genome. If you are using human data mapped to the GRCh38/hg38 assembly, you can use 54 | refdb="hg38" as an argument in dndscv to use the default GRCh38/hg38 precomputed database 55 | and epigenomic covariates (please ensure that you have downloaded the latest version of 56 | dNdScv). 57 | 58 | Users interested in trying *dNdScv* on a different set of transcripts, a different human assembly 59 | or a different species can use the buildref function to create a custom RefCDS, as explained 60 | in this [tutorial](http://htmlpreview.github.io/?http://github.com/im3sanger/dndscv/blob/master/vignettes/buildref.html). 61 | 62 | Pre-computed RefCDS files (RefCDS objects) to run *dNdScv* on some popular species 63 | (e.g. mouse, rat, cow, dog, yeast or SARS-CoV-2) are available from 64 | [this link](https://github.com/im3sanger/dndscv_data/tree/master/data). 65 | 66 | Reference 67 | ---- 68 | Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*. 69 | http://www.cell.com/cell/fulltext/S0092-8674(17)31136-4 70 | 71 | Author 72 | -------- 73 | Inigo Martincorena 74 | 75 | Acknowledgements 76 | -------- 77 | Moritz Gerstung and Peter Campbell for advice and inspiration. Federico Abascal and Andrew Lawson for testing, feedback and ideas. 78 | -------------------------------------------------------------------------------- /data/cancergenes_cgc81.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/cancergenes_cgc81.rda -------------------------------------------------------------------------------- /data/covariates_hg19.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/covariates_hg19.rda -------------------------------------------------------------------------------- /data/covariates_hg19_chrx.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/covariates_hg19_chrx.rda -------------------------------------------------------------------------------- /data/covariates_hg19_hg38_epigenome_pcawg.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/covariates_hg19_hg38_epigenome_pcawg.rda -------------------------------------------------------------------------------- /data/dataset_normaloesophagus.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_normaloesophagus.rda -------------------------------------------------------------------------------- /data/dataset_normalskin.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_normalskin.rda -------------------------------------------------------------------------------- /data/dataset_normalskin_genes.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_normalskin_genes.rda -------------------------------------------------------------------------------- /data/dataset_simbreast.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_simbreast.rda -------------------------------------------------------------------------------- /data/dataset_tcgablca.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_tcgablca.rda -------------------------------------------------------------------------------- /data/knownhotcodons_hg19.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/knownhotcodons_hg19.rda -------------------------------------------------------------------------------- /data/knownhotspots_hg19.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/knownhotspots_hg19.rda -------------------------------------------------------------------------------- /data/refcds_GRCh38_hg38.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/refcds_GRCh38_hg38.rda -------------------------------------------------------------------------------- /data/refcds_hg19.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/refcds_hg19.rda -------------------------------------------------------------------------------- /data/submod_12r_3w.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/submod_12r_3w.rda -------------------------------------------------------------------------------- /data/submod_192r_3w.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/submod_192r_3w.rda -------------------------------------------------------------------------------- /data/submod_2r_3w.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/submod_2r_3w.rda -------------------------------------------------------------------------------- /dndscv.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 4 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace 22 | -------------------------------------------------------------------------------- /inst/doc/dNdScv.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Selection analyses and cancer driver discovery using dNdScv" 3 | author: "Inigo Martincorena (May 2017)" 4 | output: rmarkdown::html_vignette 5 | vignette: > 6 | %\VignetteIndexEntry{Selection analyses and cancer driver discovery using dNdScv} 7 | %\VignetteEngine{knitr::rmarkdown} 8 | \usepackage[utf8]{inputenc} 9 | --- 10 | 11 | The **dNdScv** R package is a group of maximum-likelihood dN/dS methods designed to quantify selection in cancer and somatic evolution (Martincorena *et al.*, 2017). The package contains functions to quantify dN/dS ratios for missense, nonsense and essential splice mutations, at the level of individual genes, groups of genes or at whole-genome level. The *dNdScv* method was designed to detect cancer driver genes (*i.e.* genes under positive selection in cancer) on datasets ranging from a few samples to thousands of samples, in whole-exome/genome or targeted sequencing studies. 12 | 13 | The background mutation rate of each gene is estimated by combining local information (synonymous mutations in the gene) and global information (variation of the mutation rate across genes, exploiting epigenomic covariates), and controlling for the sequence composition of the gene and mutational signatures. Unlike traditional implementations of dN/dS, *dNdScv* uses trinucleotide context-dependent substitution matrices to avoid common mutation biases affecting dN/dS (Greenman *et al.*, 2006). 14 | 15 | This vignette shows how to perform driver discovery and selection analyses with dNdScv in cancer sequencing data. The current version of dNdScv only supports human data, but future versions will incorporate functions to allow studies of selection on any species. 16 | 17 | Main reference: 18 | Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*. 19 | 20 | ##Example 1: Driver discovery (positive selection) in cancer exomes/genomes 21 | 22 | ####The simplest way to run dNdScv 23 | 24 | ```{r message=FALSE} 25 | library("seqinr") 26 | library("Biostrings") 27 | library("MASS") 28 | library("GenomicRanges") 29 | library("dndscv") 30 | data("dataset_simbreast", package="dndscv") 31 | dndsout = dndscv(mutations) 32 | ``` 33 | 34 | ####Inputs and default parameters 35 | 36 | For this example, we have used a simulated dataset of 196 breast cancer exomes provided in the package. The simplest way to run the dndscv function is to provide a table of mutations. Mutations are provided as a *data.frame* with five columns (sampleID, chromosome, position, reference base and mutant base). It is important that only unique mutations are provided in the file. Multiple instances of the same mutation in related samples (for example, when sequencing multiple samples of the same tumour) should only be listed once. 37 | 38 | ```{r} 39 | head(mutations) 40 | ``` 41 | 42 | With the example dataset provided, the function should take about one minute to run. In this example, the function issues a warning as it detects the same mutation in more than one sample, requesting the user to verify that the input table of mutations only contains independent mutation events. In this case, each sample corresponds to a different patient and so the warning can be safely ignored. 43 | 44 | We have run dNdScv with default parameters. This includes removing ultra-hypermutator samples and subsampling mutations when encountering too many mutations per gene in the same sample. These were designed to protect against loss of sensitivity from ultra-hypermutators and from clustered artefacts in the input mutation table, but there are occasions when the user will prefer to relax these (see Example 2). 45 | 46 | #####dndscv outputs: Table of significant genes 47 | 48 | The output of the *dndscv* function is a list of objects. For an analysis of exome or genome data, the most relevant output will often be the result of neutrality tests at gene level. *P-values* for substitutions are obtained by Likelihood-Ratio Tests as described in (Martincorena *et al*, 2017) and q-values are obtained by Benjamini-Hodgberg's multiple testing correction. The table also includes information on the number of substitutions of each class observed in each gene, as well as maximum-likelihood estimates (MLEs) of the dN/dS ratios for each gene, for missense (*wmis*), nonsense (*wnon*), essential splice site mutations (*wspl*) and indels (*wind*). 49 | 50 | ```{r} 51 | sel_cv = dndsout$sel_cv 52 | print(head(sel_cv), digits = 3) 53 | signif_genes = sel_cv[sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")] 54 | rownames(signif_genes) = NULL 55 | print(signif_genes) 56 | ``` 57 | 58 | Note in the table that the dN/dS ratios of significant cancer genes are typically extremely high, often >10 or even >100. This contains information about the fraction of mutations observed in a gene that are genuine drivers under positive selection. For example, for a highly significant gene, a dN/dS of 10 indicates that there are 10 times more non-synonymous mutations in the gene than neutrally expected, suggesting that at least around 90% of these mutations are genuine drivers (Greenman *et al*, 2006; Martincorena *et al*, 2017). 59 | 60 | #####dndscv outputs: Global dN/dS estimates 61 | 62 | Another output that can be of interest is a table with the global MLEs for the dN/dS ratios across all genes. dN/dS ratios with associated confidence intervals are calculated for missense, nonsense and essential splice site substitutions separately, as well as for all non-synonymous substitutions (*wall*) and for all truncating substitutions together (*wtru*), which include nonsense and essential splice site mutations. 63 | 64 | ```{r} 65 | print(dndsout$globaldnds) 66 | ``` 67 | 68 | Global dN/dS ratios in somatic evolution of cancer, and seemingly of healthy somatic tissues, appear to show a near-universal pattern of dN/dS~1, with exome-wide dN/dS ratios typically slightly higher than 1 (Martincorena *et al.*, 2017). Only occasionally, I have found datasets with global dN/dS<1, but upon closer examination, this has typically been caused by contamination of the catalogue of somatic mutations with germline SNPs. An exception are melanoma tumours, which show a bias towards slight underestimation of dN/dS due to the signature of ultraviolet-induced mutations extending beyond the trinucleotide model (Martincorena *et al.*, 2017). In my personal experience, datasets of somatic mutations with global dN/dS<<1 have always reflected a problem of SNP contamination or an inadequate substitution model, and so the evaluation of global dN/dS values can help identify problems in certain datasets. 69 | 70 | #####Other useful outputs 71 | 72 | The dndscv function also outputs other results that can be of interest, such as an annotated table of coding mutations (*annotmuts*), MLEs of mutation rate parameters (*mle_submodel*), lists of samples and mutations excluded from the analysis and a table with the observed and expected number of mutations per gene (*genemuts*), among others. 73 | 74 | ```{r} 75 | head(dndsout$annotmuts) 76 | ``` 77 | 78 | dNdScv relies on a negative binomial regression model across genes to refine the estimated background mutation rate for a gene. This assumes that the variation of the mutation rate across genes that remains unexplained by covariates or by the sequence composition of the gene can be modelled as a Gamma distribution. This model typically works well on clean cancer genomic datasets, but not all datasets may be suitable for this model. In particular, very low estimates of $\theta$ (the overdispersion parameter), particularly $\theta<1$, may reflect problems with the suitability of the dNdScv model for the dataset. 79 | 80 | ```{r} 81 | print(dndsout$nbreg$theta) 82 | ``` 83 | 84 | ##### dNdSloc: local neutrality test 85 | 86 | An additional set of neutrality tests per gene are performed using a more traditional dN/dS model in which the local mutation rate for a gene is estimated exclusively from the synonymous mutations observed in the gene (*dNdSloc*) (Wong, Martincorena, *et al*., 2014). This test is typically only powered in very large datasets. For example, in the dataset used in this example, comprising of 196 simulated breast cancer exomes, this model only detects *ARID1A* as significantly mutated. 87 | 88 | ```{r} 89 | signif_genes_localmodel = as.vector(dndsout$sel_loc$gene_name[dndsout$sel_loc$qall_loc<0.1]) 90 | print(signif_genes_localmodel) 91 | ``` 92 | 93 | ##Example 2: Driver discovery in targeted sequencing data 94 | 95 | The dndscv function can take a list of genes as an input to restrict the analysis of selection. This is strictly required when analysing targeted sequencing data, and might also be used to obtain global dN/dS ratios for a particular group of genes. 96 | 97 | To exemplify the use of the dndscv function on targeted data, we can use another example dataset provided with the dNdScv package: 98 | 99 | ```{r message=FALSE} 100 | library("seqinr") 101 | library("Biostrings") 102 | library("MASS") 103 | library("GenomicRanges") 104 | library("dndscv") 105 | data("dataset_normalskin", package="dndscv") 106 | data("dataset_normalskin_genes", package="dndscv") 107 | dndsskin = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf) 108 | ``` 109 | 110 | This dataset comprises of 3,408 unique somatic mutations detected by ultra-deep (~500x) targeted sequencing of 74 cancer genes in 234 small biopsies of normal human skin (epidermis) from four healthy individuals. Note that all of the mutations listed in the input table are genuinely independent events and so, again, we can safely ignore the two warnings issued by dndscv. For more details on this study see: 111 | 112 | **Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. Science. 348(6237):880-6.** doi: 10.1126/science.aaa6806. 113 | 114 | In the paper above, we described a strong evidence of positive selection on somatic mutations occurring in normal human skin throughout life. These mutations are detected as microscopic clones of mutant cells in normal skin. The dNdScv analysis below recapitulates some of the key analyses of selection in this study: 115 | 116 | ```{r} 117 | sel_cv = dndsskin$sel_cv 118 | print(head(sel_cv[sel_cv$qglobal_cv<0.1,c(1:10,16,17)]), digits = 3) 119 | print(dndsskin$globaldnds, digits = 3) 120 | ``` 121 | 122 | ##Example 3: Using different substitution models 123 | 124 | Classic maximum-likelihood implementations of dN/dS use a simple substitution model with a single rate parameter. Mutations are classified as either transitions (C<>T, G<>A) or transversions, and the single rate parameter is a transition/transversion (ts/tv) ratio reflecting the relative frequency of both classes of substitutions (Goldman & Yang, 1994). The dndscv function can take a different substitution model as input. The user can choose from existing substitution models provided in the *data* directory as part of the package or input a different substitution model as a matrix: 125 | 126 | ```{r message=FALSE} 127 | library("dndscv") 128 | # 192 rates (used as default) 129 | data("submod_192r_3w", package="dndscv") 130 | colnames(substmodel) = c("syn","mis","non","spl") 131 | head(substmodel) 132 | # 12 rates (no context-dependence) 133 | data("submod_12r_3w", package="dndscv") 134 | colnames(substmodel) = c("syn","mis","non","spl") 135 | head(substmodel) 136 | # 2 rates (classic ts/tv model) 137 | data("submod_2r_3w", package="dndscv") 138 | colnames(substmodel) = c("syn","mis","non","spl") 139 | head(substmodel) 140 | ``` 141 | 142 | We can fit a traditional ts/tv model to the skin dataset using the code below: 143 | 144 | ```{r message=FALSE} 145 | library("seqinr") 146 | library("Biostrings") 147 | library("MASS") 148 | library("GenomicRanges") 149 | library("dndscv") 150 | data("dataset_normalskin", package="dndscv") 151 | data("dataset_normalskin_genes", package="dndscv") 152 | dndsskin_2r = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf, sm = "2r_3w") 153 | print(dndsskin_2r$mle_submodel) 154 | sel_cv = dndsskin_2r$sel_cv 155 | print(head(sel_cv[sel_cv$qglobal_cv<0.1, c(1:10,16,17)]), digits = 3) 156 | ``` 157 | 158 | In general, the full trinucleotide model is recommended for cancer genomic datasets as it typically provides the least biased dN/dS estimates. The impact of using simplistic mutation models can be considerable on global dN/dS ratios (see Martincorena *et al*., 2017), and can lead to false signals of negative or positive selection. In general, the impact of simple substitution models on gene-level inferences of selection tends to be smaller. 159 | 160 | ###Additional references 161 | * Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Under review*. 162 | * Goldman N, Yang Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. *Molecular biology and evolution*. 11:725-736. 163 | * Greenman C, *et al*. (2006) Statistical analysis of pathogenicity of somatic mutations in cancer. *Genetics*. 173(4):2187-98. 164 | * Wong CC, Martincorena I, *et al*. (2014) Inactivating CUX1 mutations promote tumorigenesis. *Nature Genetics*. 46(1):33-8. 165 | * Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. *Science*. 348(6237):880-6. 166 | -------------------------------------------------------------------------------- /inst/extdata/BioMart_human_GRCh37_chr3_segment.txt: -------------------------------------------------------------------------------- 1 | gene.id gene.name cds.id chr chr.coding.start chr.coding.end cds.start cds.end length strand 2 | ENSG00000121879 PIK3CA ENSP00000418145 3 116615 116677 1 63 63 1 3 | ENSG00000121879 PIK3CA ENSP00000263967 3 116615 116966 1 352 3207 1 4 | ENSG00000121879 PIK3CA ENSP00000263967 3 117479 117688 353 562 3207 1 5 | ENSG00000121879 PIK3CA ENSP00000263967 3 119079 119329 563 813 3207 1 6 | ENSG00000121879 PIK3CA ENSP00000263967 3 121333 121578 814 1059 3207 1 7 | ENSG00000121879 PIK3CA ENSP00000263967 3 122292 122377 1060 1145 3207 1 8 | ENSG00000121879 PIK3CA ENSP00000263967 3 127384 127489 1146 1251 3207 1 9 | ENSG00000121879 PIK3CA ENSP00000263967 3 127975 128127 1252 1404 3207 1 10 | ENSG00000121879 PIK3CA ENSP00000263967 3 128220 128354 1405 1539 3207 1 11 | ENSG00000121879 PIK3CA ENSP00000263967 3 135999 136123 1540 1664 3207 1 12 | ENSG00000121879 PIK3CA ENSP00000263967 3 136985 137066 1665 1746 3207 1 13 | ENSG00000121879 PIK3CA ENSP00000263967 3 137360 137524 1747 1911 3207 1 14 | ENSG00000121879 PIK3CA ENSP00000263967 3 137738 137841 1912 2015 3207 1 15 | ENSG00000121879 PIK3CA ENSP00000263967 3 138775 138946 2016 2187 3207 1 16 | ENSG00000121879 PIK3CA ENSP00000263967 3 141870 141976 2188 2294 3207 1 17 | ENSG00000121879 PIK3CA ENSP00000263967 3 142489 142610 2295 2416 3207 1 18 | ENSG00000121879 PIK3CA ENSP00000263967 3 143751 143829 2417 2495 3207 1 19 | ENSG00000121879 PIK3CA ENSP00000263967 3 147061 147231 2496 2666 3207 1 20 | ENSG00000121879 PIK3CA ENSP00000263967 3 147793 147910 2667 2784 3207 1 21 | ENSG00000121879 PIK3CA ENSP00000263967 3 148014 148165 2785 2936 3207 1 22 | ENSG00000121879 PIK3CA ENSP00000263967 3 151883 152153 2937 3207 3207 1 23 | ENSG00000121879 PIK3CA ENSP00000417479 3 116615 116968 1 354 354 1 24 | ENSG00000171121 KCNMB3 ENSP00000417091 3 184438 184499 1 62 522 -1 25 | ENSG00000171121 KCNMB3 ENSP00000417091 3 168532 168723 63 254 522 -1 26 | ENSG00000171121 KCNMB3 ENSP00000417091 3 162284 162482 255 453 522 -1 27 | ENSG00000171121 KCNMB3 ENSP00000417091 3 157785 157853 454 522 522 -1 28 | ENSG00000171121 KCNMB3 ENSP00000376451 3 162284 162482 249 447 828 -1 29 | ENSG00000171121 KCNMB3 ENSP00000376451 3 168532 168779 1 248 828 -1 30 | ENSG00000171121 KCNMB3 ENSP00000376451 3 160693 161073 448 828 828 -1 31 | ENSG00000171121 KCNMB3 ENSP00000319370 3 168532 168723 69 260 840 -1 32 | ENSG00000171121 KCNMB3 ENSP00000319370 3 162284 162482 261 459 840 -1 33 | ENSG00000171121 KCNMB3 ENSP00000319370 3 168825 168892 1 68 840 -1 34 | ENSG00000171121 KCNMB3 ENSP00000319370 3 160693 161073 460 840 840 -1 35 | ENSG00000171121 KCNMB3 ENSP00000327866 3 168532 168723 63 254 834 -1 36 | ENSG00000171121 KCNMB3 ENSP00000327866 3 162284 162482 255 453 834 -1 37 | ENSG00000171121 KCNMB3 ENSP00000327866 3 160693 161073 454 834 834 -1 38 | ENSG00000171121 KCNMB3 ENSP00000327866 3 184438 184499 1 62 834 -1 39 | ENSG00000171121 KCNMB3 ENSP00000418536 3 168532 168723 3 194 774 -1 40 | ENSG00000171121 KCNMB3 ENSP00000418536 3 162284 162482 195 393 774 -1 41 | ENSG00000171121 KCNMB3 ENSP00000418536 3 176737 176738 1 2 774 -1 42 | ENSG00000171121 KCNMB3 ENSP00000418536 3 160693 161073 394 774 774 -1 43 | ENSG00000121864 ZNF639 ENSP00000417740 3 246083 246140 1 58 1458 1 44 | ENSG00000121864 ZNF639 ENSP00000417740 3 247407 247517 59 169 1458 1 45 | ENSG00000121864 ZNF639 ENSP00000417740 3 250778 250912 170 304 1458 1 46 | ENSG00000121864 ZNF639 ENSP00000417740 3 251058 252211 305 1458 1458 1 47 | ENSG00000121864 ZNF639 ENSP00000418870 3 246083 246140 1 58 649 1 48 | ENSG00000121864 ZNF639 ENSP00000418870 3 247407 247517 59 169 649 1 49 | ENSG00000121864 ZNF639 ENSP00000418870 3 250778 250912 170 304 649 1 50 | ENSG00000121864 ZNF639 ENSP00000418870 3 251058 251402 305 649 649 1 51 | ENSG00000121864 ZNF639 ENSP00000418628 3 246083 246140 1 58 412 1 52 | ENSG00000121864 ZNF639 ENSP00000418628 3 247407 247517 59 169 412 1 53 | ENSG00000121864 ZNF639 ENSP00000418628 3 250778 250912 170 304 412 1 54 | ENSG00000121864 ZNF639 ENSP00000418628 3 251058 251165 305 412 412 1 55 | ENSG00000121864 ZNF639 ENSP00000325634 3 246083 246140 1 58 1458 1 56 | ENSG00000121864 ZNF639 ENSP00000325634 3 247407 247517 59 169 1458 1 57 | ENSG00000121864 ZNF639 ENSP00000325634 3 250778 250912 170 304 1458 1 58 | ENSG00000121864 ZNF639 ENSP00000325634 3 251058 252211 305 1458 1458 1 59 | ENSG00000121864 ZNF639 ENSP00000419650 3 246083 246140 1 58 865 1 60 | ENSG00000121864 ZNF639 ENSP00000419650 3 247407 247517 59 169 865 1 61 | ENSG00000121864 ZNF639 ENSP00000419650 3 250778 250912 170 304 865 1 62 | ENSG00000121864 ZNF639 ENSP00000419650 3 251058 251618 305 865 865 1 63 | ENSG00000121864 ZNF639 ENSP00000418766 3 246083 246140 1 58 1458 1 64 | ENSG00000121864 ZNF639 ENSP00000418766 3 247407 247517 59 169 1458 1 65 | ENSG00000121864 ZNF639 ENSP00000418766 3 250778 250912 170 304 1458 1 66 | ENSG00000121864 ZNF639 ENSP00000418766 3 251058 252211 305 1458 1458 1 67 | ENSG00000121864 ZNF639 ENSP00000417232 3 246083 246140 1 58 390 1 68 | ENSG00000121864 ZNF639 ENSP00000417232 3 247407 247517 59 169 390 1 69 | ENSG00000121864 ZNF639 ENSP00000417232 3 250778 250912 170 304 390 1 70 | ENSG00000121864 ZNF639 ENSP00000417232 3 251058 251143 305 390 390 1 71 | ENSG00000171109 MFN1 ENSP00000420617 3 266641 266752 1 112 2226 1 72 | ENSG00000171109 MFN1 ENSP00000420617 3 269689 269824 113 248 2226 1 73 | ENSG00000171109 MFN1 ENSP00000420617 3 276629 276791 249 411 2226 1 74 | ENSG00000171109 MFN1 ENSP00000420617 3 280147 280271 412 536 2226 1 75 | ENSG00000171109 MFN1 ENSP00000420617 3 282086 282194 537 645 2226 1 76 | ENSG00000171109 MFN1 ENSP00000420617 3 282907 283014 646 753 2226 1 77 | ENSG00000171109 MFN1 ENSP00000420617 3 285228 285381 754 907 2226 1 78 | ENSG00000171109 MFN1 ENSP00000420617 3 285825 285892 908 975 2226 1 79 | ENSG00000171109 MFN1 ENSP00000420617 3 293009 293130 976 1097 2226 1 80 | ENSG00000171109 MFN1 ENSP00000420617 3 294831 294957 1098 1224 2226 1 81 | ENSG00000171109 MFN1 ENSP00000420617 3 295133 295237 1225 1329 2226 1 82 | ENSG00000171109 MFN1 ENSP00000420617 3 296130 296232 1330 1432 2226 1 83 | ENSG00000171109 MFN1 ENSP00000420617 3 296374 296603 1433 1662 2226 1 84 | ENSG00000171109 MFN1 ENSP00000420617 3 303358 303510 1663 1815 2226 1 85 | ENSG00000171109 MFN1 ENSP00000420617 3 304222 304418 1816 2012 2226 1 86 | ENSG00000171109 MFN1 ENSP00000420617 3 307793 307927 2013 2147 2226 1 87 | ENSG00000171109 MFN1 ENSP00000420617 3 309770 309848 2148 2226 2226 1 88 | ENSG00000171109 MFN1 ENSP00000419134 3 266641 266752 1 112 464 1 89 | ENSG00000171109 MFN1 ENSP00000419134 3 269689 269824 113 248 464 1 90 | ENSG00000171109 MFN1 ENSP00000419134 3 276629 276791 249 411 464 1 91 | ENSG00000171109 MFN1 ENSP00000419134 3 280147 280199 412 464 464 1 92 | ENSG00000171109 MFN1 ENSP00000263969 3 269689 269824 113 248 2226 1 93 | ENSG00000171109 MFN1 ENSP00000263969 3 276629 276791 249 411 2226 1 94 | ENSG00000171109 MFN1 ENSP00000263969 3 280147 280271 412 536 2226 1 95 | ENSG00000171109 MFN1 ENSP00000263969 3 282086 282194 537 645 2226 1 96 | ENSG00000171109 MFN1 ENSP00000263969 3 282907 283014 646 753 2226 1 97 | ENSG00000171109 MFN1 ENSP00000263969 3 285228 285381 754 907 2226 1 98 | ENSG00000171109 MFN1 ENSP00000263969 3 285825 285892 908 975 2226 1 99 | ENSG00000171109 MFN1 ENSP00000263969 3 293009 293130 976 1097 2226 1 100 | ENSG00000171109 MFN1 ENSP00000263969 3 294831 294957 1098 1224 2226 1 101 | ENSG00000171109 MFN1 ENSP00000263969 3 295133 295237 1225 1329 2226 1 102 | ENSG00000171109 MFN1 ENSP00000263969 3 296130 296232 1330 1432 2226 1 103 | ENSG00000171109 MFN1 ENSP00000263969 3 296374 296603 1433 1662 2226 1 104 | ENSG00000171109 MFN1 ENSP00000263969 3 303358 303510 1663 1815 2226 1 105 | ENSG00000171109 MFN1 ENSP00000263969 3 304222 304418 1816 2012 2226 1 106 | ENSG00000171109 MFN1 ENSP00000263969 3 307793 307927 2013 2147 2226 1 107 | ENSG00000171109 MFN1 ENSP00000263969 3 266641 266752 1 112 2226 1 108 | ENSG00000171109 MFN1 ENSP00000263969 3 309770 309848 2148 2226 2226 1 109 | ENSG00000171109 MFN1 ENSP00000420148 3 282086 282194 96 204 372 1 110 | ENSG00000171109 MFN1 ENSP00000420148 3 282907 283014 205 312 372 1 111 | ENSG00000171109 MFN1 ENSP00000420148 3 280177 280271 1 95 372 1 112 | ENSG00000171109 MFN1 ENSP00000420148 3 285228 285287 313 372 372 1 113 | ENSG00000171109 MFN1 ENSP00000419926 3 280147 280271 1 125 1482 1 114 | ENSG00000171109 MFN1 ENSP00000419926 3 282086 282194 126 234 1482 1 115 | ENSG00000171109 MFN1 ENSP00000419926 3 282907 283014 235 342 1482 1 116 | ENSG00000171109 MFN1 ENSP00000419926 3 285228 285381 343 496 1482 1 117 | ENSG00000171109 MFN1 ENSP00000419926 3 285825 285892 497 564 1482 1 118 | ENSG00000171109 MFN1 ENSP00000419926 3 293009 293130 565 686 1482 1 119 | ENSG00000171109 MFN1 ENSP00000419926 3 294831 294957 687 813 1482 1 120 | ENSG00000171109 MFN1 ENSP00000419926 3 295133 295237 814 918 1482 1 121 | ENSG00000171109 MFN1 ENSP00000419926 3 303358 303510 919 1071 1482 1 122 | ENSG00000171109 MFN1 ENSP00000419926 3 304222 304418 1072 1268 1482 1 123 | ENSG00000171109 MFN1 ENSP00000419926 3 307793 307927 1269 1403 1482 1 124 | ENSG00000171109 MFN1 ENSP00000419926 3 309770 309848 1404 1482 1482 1 125 | ENSG00000171109 MFN1 ENSP00000280653 3 266641 266752 1 112 1893 1 126 | ENSG00000171109 MFN1 ENSP00000280653 3 269689 269824 113 248 1893 1 127 | ENSG00000171109 MFN1 ENSP00000280653 3 276629 276791 249 411 1893 1 128 | ENSG00000171109 MFN1 ENSP00000280653 3 280147 280271 412 536 1893 1 129 | ENSG00000171109 MFN1 ENSP00000280653 3 282086 282194 537 645 1893 1 130 | ENSG00000171109 MFN1 ENSP00000280653 3 282907 283014 646 753 1893 1 131 | ENSG00000171109 MFN1 ENSP00000280653 3 285228 285381 754 907 1893 1 132 | ENSG00000171109 MFN1 ENSP00000280653 3 285825 285892 908 975 1893 1 133 | ENSG00000171109 MFN1 ENSP00000280653 3 293009 293130 976 1097 1893 1 134 | ENSG00000171109 MFN1 ENSP00000280653 3 294831 294957 1098 1224 1893 1 135 | ENSG00000171109 MFN1 ENSP00000280653 3 295133 295237 1225 1329 1893 1 136 | ENSG00000171109 MFN1 ENSP00000280653 3 303358 303510 1330 1482 1893 1 137 | ENSG00000171109 MFN1 ENSP00000280653 3 304222 304418 1483 1679 1893 1 138 | ENSG00000171109 MFN1 ENSP00000280653 3 307793 307927 1680 1814 1893 1 139 | ENSG00000171109 MFN1 ENSP00000280653 3 309770 309848 1815 1893 1893 1 140 | ENSG00000114450 GNB4 ENSP00000232564 3 343933 343989 1 57 1023 -1 141 | ENSG00000114450 GNB4 ENSP00000232564 3 338678 338716 58 96 1023 -1 142 | ENSG00000114450 GNB4 ENSP00000232564 3 337188 337294 97 203 1023 -1 143 | ENSG00000114450 GNB4 ENSP00000232564 3 334282 334345 204 267 1023 -1 144 | ENSG00000114450 GNB4 ENSP00000232564 3 332674 332836 268 430 1023 -1 145 | ENSG00000114450 GNB4 ENSP00000232564 3 331504 331570 431 497 1023 -1 146 | ENSG00000114450 GNB4 ENSP00000232564 3 331201 331402 498 699 1023 -1 147 | ENSG00000114450 GNB4 ENSP00000232564 3 322979 323195 700 916 1023 -1 148 | ENSG00000114450 GNB4 ENSP00000232564 3 319002 319108 917 1023 1023 -1 149 | ENSG00000114450 GNB4 ENSP00000420066 3 332674 332836 35 197 496 -1 150 | ENSG00000114450 GNB4 ENSP00000420066 3 331504 331570 198 264 496 -1 151 | ENSG00000114450 GNB4 ENSP00000420066 3 331201 331402 265 466 496 -1 152 | ENSG00000114450 GNB4 ENSP00000420066 3 334282 334315 1 34 496 -1 153 | ENSG00000114450 GNB4 ENSP00000420066 3 319079 319108 467 496 496 -1 154 | ENSG00000114450 GNB4 ENSP00000419693 3 343933 343989 1 57 1023 -1 155 | ENSG00000114450 GNB4 ENSP00000419693 3 338678 338716 58 96 1023 -1 156 | ENSG00000114450 GNB4 ENSP00000419693 3 337188 337294 97 203 1023 -1 157 | ENSG00000114450 GNB4 ENSP00000419693 3 334282 334345 204 267 1023 -1 158 | ENSG00000114450 GNB4 ENSP00000419693 3 332674 332836 268 430 1023 -1 159 | ENSG00000114450 GNB4 ENSP00000419693 3 331504 331570 431 497 1023 -1 160 | ENSG00000114450 GNB4 ENSP00000419693 3 331201 331402 498 699 1023 -1 161 | ENSG00000114450 GNB4 ENSP00000419693 3 322979 323195 700 916 1023 -1 162 | ENSG00000114450 GNB4 ENSP00000419693 3 319002 319108 917 1023 1023 -1 163 | ENSG00000114450 GNB4 ENSP00000420606 3 343933 343989 1 57 239 -1 164 | ENSG00000114450 GNB4 ENSP00000420606 3 338678 338716 58 96 239 -1 165 | ENSG00000114450 GNB4 ENSP00000420606 3 337188 337294 97 203 239 -1 166 | ENSG00000114450 GNB4 ENSP00000420606 3 334310 334345 204 239 239 -1 167 | ENSG00000136518 ACTL6A ENSP00000397552 3 480882 480906 1 25 1290 1 168 | ENSG00000136518 ACTL6A ENSP00000397552 3 487613 487689 26 102 1290 1 169 | ENSG00000136518 ACTL6A ENSP00000397552 3 487856 488030 103 277 1290 1 170 | ENSG00000136518 ACTL6A ENSP00000397552 3 491158 491258 278 378 1290 1 171 | ENSG00000136518 ACTL6A ENSP00000397552 3 492159 492256 379 476 1290 1 172 | ENSG00000136518 ACTL6A ENSP00000397552 3 494006 494100 477 571 1290 1 173 | ENSG00000136518 ACTL6A ENSP00000397552 3 494409 494515 572 678 1290 1 174 | ENSG00000136518 ACTL6A ENSP00000397552 3 494613 494702 679 768 1290 1 175 | ENSG00000136518 ACTL6A ENSP00000397552 3 498429 498490 769 830 1290 1 176 | ENSG00000136518 ACTL6A ENSP00000397552 3 498683 498797 831 945 1290 1 177 | ENSG00000136518 ACTL6A ENSP00000397552 3 498929 499009 946 1026 1290 1 178 | ENSG00000136518 ACTL6A ENSP00000394014 3 491158 491258 152 252 1164 1 179 | ENSG00000136518 ACTL6A ENSP00000394014 3 492159 492256 253 350 1164 1 180 | ENSG00000136518 ACTL6A ENSP00000394014 3 494006 494100 351 445 1164 1 181 | ENSG00000136518 ACTL6A ENSP00000394014 3 494409 494515 446 552 1164 1 182 | ENSG00000136518 ACTL6A ENSP00000394014 3 494613 494702 553 642 1164 1 183 | ENSG00000136518 ACTL6A ENSP00000394014 3 498429 498490 643 704 1164 1 184 | ENSG00000136518 ACTL6A ENSP00000394014 3 498683 498797 705 819 1164 1 185 | ENSG00000136518 ACTL6A ENSP00000394014 3 498929 499009 820 900 1164 1 186 | ENSG00000136518 ACTL6A ENSP00000394014 3 487880 488030 1 151 1164 1 187 | ENSG00000136518 ACTL6A ENSP00000420153 3 487880 488030 1 151 151 1 188 | ENSG00000136518 ACTL6A ENSP00000376430 3 491158 491258 152 252 1164 1 189 | ENSG00000136518 ACTL6A ENSP00000376430 3 492159 492256 253 350 1164 1 190 | ENSG00000136518 ACTL6A ENSP00000376430 3 494006 494100 351 445 1164 1 191 | ENSG00000136518 ACTL6A ENSP00000376430 3 494409 494515 446 552 1164 1 192 | ENSG00000136518 ACTL6A ENSP00000376430 3 494613 494702 553 642 1164 1 193 | ENSG00000136518 ACTL6A ENSP00000376430 3 498429 498490 643 704 1164 1 194 | ENSG00000136518 ACTL6A ENSP00000376430 3 498683 498797 705 819 1164 1 195 | ENSG00000136518 ACTL6A ENSP00000376430 3 498929 499009 820 900 1164 1 196 | ENSG00000136518 ACTL6A ENSP00000376430 3 487880 488030 1 151 1164 1 197 | -------------------------------------------------------------------------------- /inst/extdata/chr3_segment.fa.fai: -------------------------------------------------------------------------------- 1 | 3 500001 23 500001 500002 2 | -------------------------------------------------------------------------------- /inst/extdata/refcds_example_chr3_segment.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/inst/extdata/refcds_example_chr3_segment.rda -------------------------------------------------------------------------------- /man/buildcodon.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/buildcodon.R 3 | \name{buildcodon} 4 | \alias{buildcodon} 5 | \title{buildcodon} 6 | \usage{ 7 | buildcodon(refcds, numcode = 1) 8 | } 9 | \arguments{ 10 | \item{refcds}{Input RefCDS object} 11 | 12 | \item{numcode}{NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate} 13 | } 14 | \description{ 15 | This function takes a RefCDS object as input and adds to it two fields required to run the codondnds function. Usage: RefCDS = buildcodon(RefCDS) 16 | } 17 | \details{ 18 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041. 19 | } 20 | \author{ 21 | Inigo Martincorena (Wellcome Sanger Institute) 22 | } 23 | -------------------------------------------------------------------------------- /man/buildref.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/buildref.R 3 | \name{buildref} 4 | \alias{buildref} 5 | \title{buildref} 6 | \usage{ 7 | buildref( 8 | cdsfile, 9 | genomefile, 10 | outfile = "RefCDS.rda", 11 | numcode = 1, 12 | excludechrs = NULL, 13 | onlychrs = NULL, 14 | useids = F 15 | ) 16 | } 17 | \arguments{ 18 | \item{cdsfile}{Path to the reference transcript table.} 19 | 20 | \item{genomefile}{Path to the indexed reference genome file.} 21 | 22 | \item{outfile}{Output file name (default = "RefCDS.rda").} 23 | 24 | \item{numcode}{NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate} 25 | 26 | \item{excludechrs}{Vector or string with chromosome names to be excluded from the RefCDS object (default: no chromosome will be excluded). The mitochondrial chromosome should be excluded as it has different genetic code and mutation rates, either using the excludechrs argument or not including mitochondrial transcripts in cdsfile.} 27 | 28 | \item{onlychrs}{Vector of valid chromosome names (default: all chromosomes will be included)} 29 | 30 | \item{useids}{Combine gene IDs and gene names (columns 1 and 2 of the input table) as long gene names (default = F)} 31 | } 32 | \description{ 33 | Function to build a RefCDS object from a reference genome and a table of transcripts. The RefCDS object has to be precomputed for any new species or assembly prior to running dndscv. This function generates an .rda file that needs to be input into dndscv using the refdb argument. Note that when multiple CDS share the same gene name (second column of cdsfile), the longest coding CDS will be chosen for the gene. CDS with ambiguous bases (N) will not be considered. 34 | } 35 | \details{ 36 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041. 37 | } 38 | \author{ 39 | Inigo Martincorena (Wellcome Sanger Institute) 40 | } 41 | -------------------------------------------------------------------------------- /man/codondnds.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/codondnds.R 3 | \name{codondnds} 4 | \alias{codondnds} 5 | \title{codondnds} 6 | \usage{ 7 | codondnds( 8 | dndsout, 9 | refcds, 10 | min_recurr = 2, 11 | gene_list = NULL, 12 | codon_list = NULL, 13 | theta_option = "conservative", 14 | syn_drivers = "TP53:T125T", 15 | method = "NB", 16 | numbins = 10000 17 | ) 18 | } 19 | \arguments{ 20 | \item{dndsout}{Output object from dndscv.} 21 | 22 | \item{refcds}{RefCDS object annotated with codon-level information using the buildcodon function.} 23 | 24 | \item{min_recurr}{Minimum number of mutations per codon to estimate codon-wise dN/dS ratios. [default=2]} 25 | 26 | \item{gene_list}{List of genes to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of genes is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout]} 27 | 28 | \item{codon_list}{List of hotspot codons to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of codons is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout]} 29 | 30 | \item{theta_option}{2 options: "mle" (uses the MLE of the negative binomial size parameter) or "conservative" (uses the lower bound of the CI95). Values other than "mle" will lead to the conservative option. [default="conservative"]} 31 | 32 | \item{syn_drivers}{Vector with a list of known synonymous driver mutations to exclude from the background model [default="TP53:T125T"]. See Martincorena et al., Cell, 2017 (PMID:29056346).} 33 | 34 | \item{method}{Overdispersion model: NB = Negative Binomial (Gamma-Poisson), LNP = Poisson-Lognormal (see Hess et al., BiorXiv, 2019). [default="NB"]} 35 | 36 | \item{numbins}{Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size.} 37 | } 38 | \value{ 39 | 'codondnds' returns a table of recurrently mutated codons and the estimates of the size parameter: 40 | 41 | - recurcodons: Table of recurrently mutated codons with codon-wise dN/dS values and p-values 42 | 43 | - recurcodons_ext: The same table of recurrently mutated codons, but including additional information on the contribution of different changes within a codon. 44 | 45 | - overdisp: Maximum likelihood estimate and CI95% for the overdispersion parameter (the size parameter of the negative binomial distribution or the sigma parameter of the lognormal distribution). The lower the size value or the higher the sigma value the higher the variation of the mutation rate across codons not captured by the trinucleotide change or by variation across genes. 46 | 47 | - LL: Log-likelihood of the fit of the overdispersed model (see "method" argument) to all synonymous sites at a codon level. 48 | } 49 | \description{ 50 | Function to estimate codon-wise dN/dS values and p-values against neutrality. To generate a valid RefCDS input object for this function, use the buildcodon function. Note that recurrent artefacts or SNP contamination can violate the null model and dominate the list of codons under apparent selection. Be very critical of the results and if suspicious codons appear recurrently mutated consider refining the variant calling (e.g. using a better unmatched normal panel). 51 | } 52 | \author{ 53 | Inigo Martincorena (Wellcome Sanger Institute) 54 | } 55 | -------------------------------------------------------------------------------- /man/dndscv-package.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/dndscv-package.R 3 | \docType{package} 4 | \name{dndscv-package} 5 | \alias{dndscv-package} 6 | \title{Detection of selection in cancer and somatic evolution} 7 | \description{ 8 | Detection of selection in cancer and somatic evolution 9 | } 10 | \details{ 11 | The dNdScv R package is a suite of maximum-likelihood dN/dS methods designed 12 | to quantify selection in cancer and somatic evolution (Martincorena et al., 2017). 13 | The package contains functions to quantify dN/dS ratios for missense, nonsense 14 | and essential splice mutations, at the level of individual genes, groups of 15 | genes or at whole-genome level. The dNdScv method was designed to detect cancer 16 | driver genes (i.e. genes under positive selection in cancer) on datasets ranging 17 | from a few samples to thousands of samples, in whole-exome/genome or targeted 18 | sequencing studies. 19 | } 20 | \references{ 21 | Martincorena I, et al. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. 22 | } 23 | \seealso{ 24 | \code{\link{dndscv}} 25 | 26 | \code{\link{buildref}} 27 | } 28 | \author{ 29 | Inigo Martincorena, Wellcome Trust Sanger Institute, \email{im3@sanger.ac.uk} 30 | } 31 | \keyword{package} 32 | -------------------------------------------------------------------------------- /man/dndscv.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/dndscv.R 3 | \name{dndscv} 4 | \alias{dndscv} 5 | \title{dNdScv} 6 | \usage{ 7 | dndscv( 8 | mutations, 9 | gene_list = NULL, 10 | refdb = "hg19", 11 | sm = "192r_3w", 12 | kc = "cgc81", 13 | cv = "hg19", 14 | max_muts_per_gene_per_sample = 3, 15 | max_coding_muts_per_sample = 3000, 16 | use_indel_sites = T, 17 | min_indels = 5, 18 | maxcovs = 20, 19 | constrain_wnon_wspl = T, 20 | outp = 3, 21 | numcode = 1, 22 | outmats = F, 23 | mingenecovs = 500, 24 | onesided = F, 25 | dc = NULL 26 | ) 27 | } 28 | \arguments{ 29 | \item{mutations}{Table of mutations (5 columns: sampleID, chr, pos, ref, alt). Only list independent events as mutations.} 30 | 31 | \item{gene_list}{List of genes to restrict the analysis (use for targeted sequencing studies)} 32 | 33 | \item{refdb}{Reference database (path to .rda file or a pre-loaded array object in the right format)} 34 | 35 | \item{sm}{Substitution model (precomputed models are available in the data directory)} 36 | 37 | \item{kc}{List of a-priori known cancer genes (to be excluded from the indel background model)} 38 | 39 | \item{cv}{Covariates (a matrix of covariates -columns- for each gene -rows-) [default: reference covariates] [cv=NULL runs dndscv without covariates]} 40 | 41 | \item{max_muts_per_gene_per_sample}{If nsegment[1] & mutations$pos10 or even >100. This contains information about the fraction of mutations observed in a gene that are genuine drivers under positive selection. For example, for a highly significant gene, a dN/dS of 10 indicates that there are 10 times more non-synonymous mutations in the gene than neutrally expected, suggesting that at least around 90% of these mutations are genuine drivers (Greenman *et al*, 2006; Martincorena *et al*, 2017). 56 | 57 | Users can calculate confidence intervals for the dN/dS ratios per gene using the *geneci* function in the package. To use this function you need to run *dndscv* with the optional argument `outmats=T` and then use `ci = geneci(dndsout)`. For more details see `? geneci`. 58 | 59 | #####dndscv outputs: Global dN/dS estimates 60 | 61 | Another output that can be of interest is a table with the global MLEs for the dN/dS ratios across all genes. dN/dS ratios with associated confidence intervals are calculated for missense, nonsense and essential splice site substitutions separately, as well as for all non-synonymous substitutions (*wall*) and for all truncating substitutions together (*wtru*), which include nonsense and essential splice site mutations. 62 | 63 | ```{r} 64 | print(dndsout$globaldnds) 65 | ``` 66 | 67 | Global dN/dS ratios in somatic evolution of cancer, and seemingly of healthy somatic tissues, appear to show a near-universal pattern of dN/dS~1, with exome-wide dN/dS ratios typically slightly higher than 1 (Martincorena *et al.*, 2017). Only occasionally, I have found datasets with global dN/dS<1, but upon closer examination, this has typically been caused by contamination of the catalogue of somatic mutations with germline SNPs. An exception are melanoma tumours, which show a bias towards slight underestimation of dN/dS due to the signature of ultraviolet-induced mutations extending beyond the trinucleotide model (Martincorena *et al.*, 2017). In my personal experience, datasets of somatic mutations with global dN/dS<<1 have always reflected a problem of SNP contamination or an inadequate substitution model, and so the evaluation of global dN/dS values can help identify problems in certain datasets. 68 | 69 | #####Other useful outputs 70 | 71 | The dndscv function also outputs other results that can be of interest, such as an annotated table of coding mutations (*annotmuts*), MLEs of mutation rate parameters (*mle_submodel*), lists of samples and mutations excluded from the analysis and a table with the observed and expected number of mutations per gene (*genemuts*), among others. 72 | 73 | ```{r} 74 | head(dndsout$annotmuts) 75 | ``` 76 | 77 | dNdScv relies on a negative binomial regression model across genes to refine the estimated background mutation rate for a gene. This assumes that the variation of the mutation rate across genes that remains unexplained by covariates or by the sequence composition of the gene can be modelled as a Gamma distribution. This model typically works well on clean cancer genomic datasets, but not all datasets may be suitable for this model. In particular, very low estimates of $\theta$ (the overdispersion parameter), particularly $\theta<1$, may reflect problems with the suitability of the dNdScv model for the dataset. 78 | 79 | ```{r} 80 | print(dndsout$nbreg$theta) 81 | ``` 82 | 83 | ##### dNdSloc: local neutrality test 84 | 85 | An additional set of neutrality tests per gene are performed using a more traditional dN/dS model in which the local mutation rate for a gene is estimated exclusively from the synonymous mutations observed in the gene (*dNdSloc*) (Wong, Martincorena, *et al*., 2014). This test is typically only powered in very large datasets. For example, in the dataset used in this example, comprising of 196 simulated breast cancer exomes, this model only detects *ARID1A* as significantly mutated. 86 | 87 | ```{r} 88 | signif_genes_localmodel = as.vector(dndsout$sel_loc$gene_name[dndsout$sel_loc$qall_loc<0.1]) 89 | print(signif_genes_localmodel) 90 | ``` 91 | 92 | ##Driver discovery in targeted sequencing data 93 | 94 | The dndscv function can take a list of genes as an input to restrict the analysis of selection. This is strictly required when analysing targeted sequencing data, and might also be used to obtain global dN/dS ratios for a particular group of genes. 95 | 96 | To exemplify the use of the dndscv function on targeted data, we can use another example dataset provided with the dNdScv package: 97 | 98 | ```{r message=FALSE} 99 | library("dndscv") 100 | data("dataset_normalskin", package="dndscv") 101 | data("dataset_normalskin_genes", package="dndscv") 102 | dndsskin = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf) 103 | ``` 104 | 105 | This dataset comprises of 3,408 unique somatic mutations detected by ultra-deep (~500x) targeted sequencing of 74 cancer genes in 234 small biopsies of normal human skin (epidermis) from four healthy individuals. Note that all of the mutations listed in the input table are genuinely independent events and so, again, we can safely ignore the two warnings issued by dndscv. For more details on this study see: 106 | 107 | **Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. Science. 348(6237):880-6.** doi: 10.1126/science.aaa6806. 108 | 109 | In the paper above, we described a strong evidence of positive selection on somatic mutations occurring in normal human skin throughout life. These mutations are detected as microscopic clones of mutant cells in normal skin. The dNdScv analysis below recapitulates some of the key analyses of selection in this study: 110 | 111 | ```{r} 112 | sel_cv = dndsskin$sel_cv 113 | print(sel_cv[sel_cv$qglobal_cv<0.1,c(1:10,19)], digits = 3) 114 | print(dndsskin$globaldnds, digits = 3) 115 | ``` 116 | 117 | ##Using different substitution models 118 | 119 | Classic maximum-likelihood implementations of dN/dS use a simple substitution model with a single rate parameter. Mutations are classified as either transitions (C<>T, G<>A) or transversions, and the single rate parameter is a transition/transversion (ts/tv) ratio reflecting the relative frequency of both classes of substitutions (Goldman & Yang, 1994). The dndscv function can take a different substitution model as input. The user can choose from existing substitution models provided in the *data* directory as part of the package or input a different substitution model as a matrix: 120 | 121 | ```{r message=FALSE} 122 | library("dndscv") 123 | # 192 rates (used as default) 124 | data("submod_192r_3w", package="dndscv") 125 | colnames(substmodel) = c("syn","mis","non","spl") 126 | head(substmodel) 127 | # 12 rates (no context-dependence) 128 | data("submod_12r_3w", package="dndscv") 129 | colnames(substmodel) = c("syn","mis","non","spl") 130 | head(substmodel) 131 | # 2 rates (classic ts/tv model) 132 | data("submod_2r_3w", package="dndscv") 133 | colnames(substmodel) = c("syn","mis","non","spl") 134 | head(substmodel) 135 | ``` 136 | 137 | We can fit a traditional ts/tv model to the skin dataset using the code below: 138 | 139 | ```{r message=FALSE} 140 | library("dndscv") 141 | data("dataset_normalskin", package="dndscv") 142 | data("dataset_normalskin_genes", package="dndscv") 143 | dndsskin_2r = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf, sm = "2r_3w") 144 | print(dndsskin_2r$mle_submodel) 145 | sel_cv = dndsskin_2r$sel_cv 146 | print(head(sel_cv[sel_cv$qglobal_cv<0.1, c(1:10,19)]), digits = 3) 147 | ``` 148 | 149 | In general, the full trinucleotide model is recommended for cancer genomic datasets as it typically provides the least biased dN/dS estimates. The impact of using simplistic mutation models can be considerable on global dN/dS ratios (see Martincorena *et al*., 2017), and can lead to false signals of negative or positive selection. In general, the impact of simple substitution models on gene-level inferences of selection tends to be smaller. AIC model selection can be easily used: 150 | 151 | ```{r message=FALSE} 152 | AIC(dndsskin$poissmodel) 153 | AIC(dndsskin_2r$poissmodel) 154 | ``` 155 | 156 | ###References 157 | * Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*. 171(5):1029-1041. 158 | * Goldman N, Yang Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. *Molecular biology and evolution*. 11:725-736. 159 | * Greenman C, *et al*. (2006) Statistical analysis of pathogenicity of somatic mutations in cancer. *Genetics*. 173(4):2187-98. 160 | * Wong CC, Martincorena I, *et al*. (2014) Inactivating CUX1 mutations promote tumorigenesis. *Nature Genetics*. 46(1):33-8. 161 | * Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. *Science*. 348(6237):880-6. 162 | -------------------------------------------------------------------------------- /vignettes/example_output_refcds.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/vignettes/example_output_refcds.rda -------------------------------------------------------------------------------- /vignettes/sitednds.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Hotspot discovery using sitewise dN/dS" 3 | author: "Inigo Martincorena" 4 | output: 5 | html_document: 6 | toc: true 7 | toc_float: true 8 | --- 9 | 10 | **Warning: this function is in testing. Users are advised to interpret the results with caution.** 11 | 12 | The importance of recurrently mutated hotspots is widely appreciated in cancer. This tutorial shows how to apply the new **sitednds** function provided in the latest version of the *dNdScv* package to estimate dN/dS ratios at single-site level. Sitewise dN/dS estimation has a rich history in comparative genomics (e.g. Massingham and Goldman, 2005) but it has only been used in cancer studies occasionally (e.g. Martincorena *et al.*, 2015). Yet, studying the relative strength of selection at single sites can be valuable, as emphasised by a recent study (Cannataro *et al.*, 2018). 13 | 14 | The new *sitednds* function allows the user to compute maximum-likelihood dN/dS estimates for recurrently mutated sites, as well as p-values against neutrality. Sitewise dN/dS ratios reflect the ratio between the number of observed mutations and the number expected under neutrality, while controlling for trinucleotide rates and for variable mutation rates across genes. In sparse datasets, point estimates for lowly-recurrent sites are likely to be underestimated, but p- and q-values provide a measure of their significance. 15 | 16 | An important aspect is that mutation rates can vary considerably across sites, even after correcting for these known mutational biases. *sitednds* models the observed mutation counts across synonymous sites as following a negative binomial distribution. This effectively controls for Poisson noise in the mutation counts per site and fits a Gamma distribution to the unexplained variation in mutation rate across sites. P-values for site recurrence are calculated using the fitted negative binomial distribution. These p-values should be more conservative and reliable than only considering Poisson variation or non-parametric bootstrapping, but they still rely on the assumption than the Gamma distribution appropriately captures the unexplained variation across sites. 17 | 18 | A major limitation is the fact that mapping artefacts and SNP contamination are common problems in cancer genomic datasets, and these tend to lead to recurrent false positive mutation calls. In noisy datasets, the results of *sitednds* can be dominated by artefacts. Users trying *sitednds* should be very critical of the results. In the context of cancer genomic studies, a considerable number of synonymous recurrently mutated sites among the significant hits in *sitednds* most certainly indicates a problem with the variant calling. This is exemplified in this tutorial analysing two real datasets. 19 | 20 | ###Sitewise dN/dS ratios in a cancer dataset 21 | 22 | As a small example, in this tutorial we will use public somatic mutation calls from bladder cancers from TCGA. To reduce the risk of false positives and increase the signal to noise ratio, this example will only consider mutations in Cancer Gene Census genes (v81). 23 | 24 | ```{r message=FALSE, warning=FALSE} 25 | library("dndscv") 26 | data("dataset_tcgablca", package="dndscv") # Loading the bladder cancer data 27 | data("cancergenes_cgc81", package="dndscv") # Loading the genes in the Cancer Gene Census (v81) 28 | dndsout = dndscv(mutations, outmats=T, gene_list=known_cancergenes) 29 | ``` 30 | 31 | The *sitednds* function takes the output of *dndscv* as input. In order for the dndsout object to be compatible with *sitednds*, users must use the "outmats=T" argument in *dndscv*. After running *dndscv*, we can evaluate the results at the gene level as explained in the main tutorial of *dndscv*: 32 | 33 | ```{r message=FALSE, warning=FALSE} 34 | sel_cv = dndsout$sel_cv 35 | print(head(sel_cv, 10), digits = 3) # Printing the top 10 genes by q-value 36 | ``` 37 | 38 | The table above reveals a problem with this dataset. The gene *MLLT3* appears as significant in *dndscv* (i.e. it violates the neutral null model of dN/dS=1), but due to a very large excess of synonymous mutations (notice the high number of synonymous mutations and the very low dN/dS values). We can further confirm that the low dN/dS value in this gene is due to an excess of synonymous mutations and not genuine negative selection by comparing the observed number of synonymous mutations in the gene (43) and the expected number (*exp_syn* and *exp_syn_cv* columns below): 39 | 40 | ```{r message=FALSE, warning=FALSE} 41 | print(dndsout$genemuts[dndsout$genemuts$gene_name=="MLLT3",]) 42 | ``` 43 | 44 | Thus, *MLLT3* is a false positive, most likely due to recurrent artefacts or SNP contamination in the gene. A careful examination of all statistically significant genes in the dataset reveals other likely false positives. As we will see below, this will also affect the sitewise dN/dS analysis. 45 | 46 | To run the sitewise dN/dS model on this dataset, we only need to input the *dndsout* object into the *sitednds* function. By default, *sitednds* will calculate sitewise dN/dS ratios and p-values for sites mutated at least two times (use the argument *min_recurr* to control this). While p-values are only provided for recurrently mutated sites, false discovery adjustment corrects for all possible changes. 47 | 48 | ```{r message=FALSE, warning=FALSE} 49 | hotspots = sitednds(dndsout) # Running sitewise dN/dS 50 | print(hotspots$theta) # Overdispersion (unexplained variation of the mutation rate across sites) 51 | ``` 52 | 53 | You can see that the maximum-likelihood estimate of *theta* is very low. This reflects considerable variation in the mutation rate across sites, not explained by the trinucleotide context or by the estimated relative mutation rate of the gene. *sitednds* takes this into account when calculating p-values. If there is large uncertainty in the estimation of *theta*, users can choose to use the lower bound estimate of theta instead of the maximum-likelihood estimate, when calculating p-values (use the argument *theta_option="conservative"* in *sitednds*). 54 | 55 | The main output of *sitednds* is a table with all hotspots studied, including their position, the gene affected, the aminoacid change induced, the number of times that the mutation was observed, the expected number of mutations at this site by chance under neutrality (mu) and the dN/dS ratio. The table also contains p-values and q-values for the probability of observing that many mutations at the site by chance. Again, please treat these p-values with caution. 56 | 57 | ```{r message=FALSE, warning=FALSE} 58 | print(head(hotspots$recursites,10)) # First 10 lines of the table of hotspots 59 | ``` 60 | 61 | We can choose a significance cutoff (e.g. q-value<0.05) to list the significant hotspots in the dataset: 62 | 63 | ```{r message=FALSE, warning=FALSE} 64 | signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ] 65 | print(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], digits=5) 66 | ``` 67 | 68 | Careful examination of the significant hotspots reveals many well-known cancer-driver hotspots, including in *FGFR3* (e.g. S249, Y375, G372), *TP53* (e.g. R248), *PIK3CA* (e.g. E542, H1047), *HRAS* (Q61), *KRAS* (G12), *ERBB2* (S310), *ERBB3* (V104), etc. Note that the exact aminoacid position affected depends on the exact protein isoform used for annotation (see Ensembl protein IDs in *dndsout$annotmuts*). 69 | 70 | However, the table of significant hotspots also contains a considerable number of likely false positives, including multiple synonymous sites in *MLLT3*. A proper analysis of these data would require careful reevaluation and improvement of the mutation calls, before repeating this analysis. Significant improvements to somatic mutation calls against recurrent artefacts can be achieved by using an unmatched normal panel and by more stringent filtering of germline SNP contamination. 71 | 72 | The TCGA mutation calls used in this example are an old version and it is likely that more recent versions are much less affected by artefacts. However, I decided to use this dataset as an example to highlight the importance of critically examining the results and the impact of recurrent artefacts on driver discovery at gene and site level. 73 | 74 | As a final note, users with whole-genome or whole-exome data can run *sitednds* on all genes. However, given the frequent presence of recurrent artefacts and the sparsity of cancer datasets, the signal-to-noise ratio can be considerably increased by running *sitednds* on a list of known cancer genes. To do so, I recommend running *dndscv* on all genes and then running *sitednds* on a list of genes of interest using the optional *gene_list* argument in *sitednds*. Running *dndscv* on all genes ensures that mutations from all genes are used to estimate the trinucleotide mutation rates, typically increasing their accuracy. 75 | 76 | 77 | ###Sitewise dN/dS ratios in normal oesophagus 78 | 79 | In a recent study, we sequenced 844 small biopsies of normal oesophageal epithelium from 9 transplant donors to study the extent of mutation and selection in a normal tissue (Martincorena *et al*., 2018). In this part of the tutorial, we will reanalyse this dataset using *dndscv* and *sitednds*. We first run *dndscv* using the settings from the analysis of normal skin data described in the main *dNdScv* tutorial. 80 | 81 | ```{r message=FALSE, warning=FALSE} 82 | library("dndscv") 83 | data("dataset_normaloesophagus", package="dndscv") # Loading the mutations in normal oesophagus 84 | mutations = unique(mutations) # Removing duplicate mutations (more conservative) 85 | data("dataset_normalskin_genes", package="dndscv") 86 | dndsout = dndscv(mutations, outmats=T, gene_list=target_genes, max_coding_muts_per_sample=Inf, max_muts_per_gene_per_sample=Inf) # outmats=T is required to run sitednds 87 | ``` 88 | 89 | We can see the list of genes under positive selection and the global dN/dS values using: 90 | 91 | ```{r message=FALSE, warning=FALSE} 92 | sel_cv = dndsout$sel_cv 93 | print(sel_cv[sel_cv$qglobal_cv<0.05, c(1:6,19)], digits = 3) 94 | print(dndsout$globaldnds, digits = 3) 95 | ``` 96 | 97 | To apply the *sitednds* model, we simply use the following code. Only the top 30 hotspots by q-value are shown, but a total of 133 sites are identified as significant with 5% FDR. 98 | 99 | ```{r message=FALSE, warning=FALSE} 100 | hotspots = sitednds(dndsout) # Running sitewise dN/dS 101 | signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ] 102 | head(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], 30) 103 | ``` 104 | 105 | Remarkably, owing to the very large number of mutant clones identified in this study, this analysis finds a large number of statistically-significant sites (n=81 for qval<0.05). Reassuringly, they are all in genes detected under positive selection in the original publication (Martincorena *et al*, 2018). This comprises 61 sites in *NOTCH1*, 17 sites in *TP53*, the well-known *PIK3CA* hotspot H1047R and one site in FAT1 and TP63. 106 | 107 | This analysis also identifies a known driver hotspot in a synonymous site of *TP53* (T125T), which is known to affect splicing of *TP53*. Intriguingly, it also identifies a synonymous site in *NOTCH1* (V717V), which deserves careful follow-up analysis. Apart from these two synonymous sites, all other 79 significant hotspots are non-synonymous. 108 | 109 | 110 | ###References 111 | * Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*. 171(5):1029-1041. doi:10.1016/j.cell.2017.09.042. 112 | * Massingham T, Goldman N. (2005) Detecting amino acid sites under positive selection and purifying selection. *Genetics*. 169(3):1753-62. doi:10.1534/genetics.104.032144. 113 | * Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. *Science*. 348(6237):880-6. doi:10.1126/science.aaa6806. 114 | * Cannataro VL, Gaffney SG, Townsend JP. (2018) Effect Sizes of Somatic Mutations in Cancer. *J Natl Cancer Inst*. doi:10.1093/jnci/djy168. 115 | * Martincorena I, Fowler JC, *et al*. (2018) Somatic mutant clones colonize the human esophagus with age. *Science*. doi:10.1126/science.aau3879. 116 | --------------------------------------------------------------------------------