├── DESCRIPTION
├── LICENSE.txt
├── NAMESPACE
├── R
    ├── buildcodon.R
    ├── buildref.R
    ├── codondnds.R
    ├── dndscv-package.R
    ├── dndscv.R
    ├── fitlnpbin.R
    ├── geneci.R
    ├── genesetdnds.R
    ├── sitednds.R
    └── withingenednds.R
├── README.md
├── data
    ├── cancergenes_cgc81.rda
    ├── covariates_hg19.rda
    ├── covariates_hg19_chrx.rda
    ├── covariates_hg19_hg38_epigenome_pcawg.rda
    ├── dataset_normaloesophagus.rda
    ├── dataset_normalskin.rda
    ├── dataset_normalskin_genes.rda
    ├── dataset_simbreast.rda
    ├── dataset_tcgablca.rda
    ├── knownhotcodons_hg19.rda
    ├── knownhotspots_hg19.rda
    ├── refcds_GRCh38_hg38.rda
    ├── refcds_hg19.rda
    ├── submod_12r_3w.rda
    ├── submod_192r_3w.rda
    └── submod_2r_3w.rda
├── dndscv.Rproj
├── inst
    ├── doc
    │   ├── dNdScv.Rmd
    │   └── dNdScv.html
    └── extdata
    │   ├── BioMart_human_GRCh37_chr3_segment.txt
    │   ├── chr3_segment.fa
    │   ├── chr3_segment.fa.fai
    │   └── refcds_example_chr3_segment.rda
├── man
    ├── buildcodon.Rd
    ├── buildref.Rd
    ├── codondnds.Rd
    ├── dndscv-package.Rd
    ├── dndscv.Rd
    ├── fitlnpbin.Rd
    ├── geneci.Rd
    ├── genesetdnds.Rd
    ├── sitednds.Rd
    └── withingenednds.Rd
└── vignettes
    ├── Ensembl_BioMart_screenshot1.png
    ├── Ensembl_BioMart_screenshot2.png
    ├── buildref.Rmd
    ├── buildref.html
    ├── dNdScv.Rmd
    ├── dNdScv.html
    ├── example_output_refcds.rda
    ├── sitednds.Rmd
    └── sitednds.html


/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: dndscv
 2 | Title: Poisson-based dN/dS models to quantify natural selection in somatic evolution
 3 | Type: Package
 4 | Version: 0.0.1.0
 5 | Date: 2019-07-03
 6 | Author: Inigo Martincorena
 7 | Maintainer: Inigo Martincorena <im3@sanger.ac.uk>
 8 | Description: This package contains functions for studying selection on coding sequences using a Poisson implementation of dN/dS. A Poisson model of dN/dS facilitates the study of selection beyond traditional codon models, including complex context-dependent mutation effects and selection on nonsense and splice site mutations. This model is best suited for resequencing studies, with very low density of mutations per base pair. The model was initially developed for cancer genome sequencing studies, and specific functions are provided to perform driver gene discovery using the dNdScv method on human cancer genomic data.
 9 | Reference: Martincorena I et al, 2017. Universal patterns of selection in cancer and somatic tissues. Cell. https://www.ncbi.nlm.nih.gov/pubmed/29056346
10 | biocViews:
11 | Imports:
12 |   seqinr,
13 |   MASS,
14 |   GenomicRanges,
15 |   Biostrings,
16 |   IRanges,
17 |   MASS,
18 |   Rsamtools,
19 |   poilog,
20 |   plyr
21 | LazyData: false
22 | License: GPL-3
23 | Encoding: UTF-8
24 | Suggests: knitr, rmarkdown
25 | VignetteBuilder: knitr
26 | RoxygenNote: 7.3.1
27 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(buildcodon)
 4 | export(buildref)
 5 | export(codondnds)
 6 | export(dndscv)
 7 | export(fitlnpbin)
 8 | export(geneci)
 9 | export(genesetdnds)
10 | export(sitednds)
11 | export(withingenednds)
12 | import(Biostrings)
13 | import(GenomicRanges)
14 | import(IRanges)
15 | import(MASS)
16 | import(Rsamtools)
17 | import(seqinr)
18 | 


--------------------------------------------------------------------------------
/R/buildcodon.R:
--------------------------------------------------------------------------------
 1 | #' buildcodon
 2 | #' 
 3 | #' This function takes a RefCDS object as input and adds to it two fields required to run the codondnds function. Usage: RefCDS = buildcodon(RefCDS)
 4 | #' 
 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
 6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
 7 | #' 
 8 | #' @param refcds Input RefCDS object
 9 | #' @param numcode NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate
10 | #'
11 | #' @export
12 | 
13 | buildcodon = function(refcds, numcode = 1) {
14 |     
15 |     ## 1. Valid chromosomes and reference CDS per gene
16 |     message("Adding codon-level information to RefCDS to run codondnds...")
17 | 
18 |     nt = c("A","C","G","T")
19 |     trinuc_list = paste(rep(nt,each=16,times=1), rep(nt,each=4,times=4), rep(nt,each=1,times=16), sep="")
20 |     trinuc_ind = structure(1:64, names=trinuc_list)
21 |     trinuc_subs = NULL; for (j in 1:length(trinuc_list)) { trinuc_subs = c(trinuc_subs, paste(trinuc_list[j], paste(substr(trinuc_list[j],1,1), setdiff(nt,substr(trinuc_list[j],2,2)), substr(trinuc_list[j],3,3), sep=""), sep=">")) }
22 |     trinuc_subsind = structure(1:192, names=trinuc_subs)
23 |     
24 |     # Precalculating a 64x64 matrix with the functional impact of each codon transition (1=Synonymous, 2=Missense, 3=Nonsense)
25 |     impact_matrix = array(NA, dim=c(64,64))
26 |     colnames(impact_matrix) = rownames(impact_matrix) = trinuc_list
27 |     for (j in 1:64) {
28 |         for (h in 1:64) {
29 |             from_aa = seqinr::translate(strsplit(trinuc_list[j],"")[[1]], numcode = numcode)
30 |             to_aa = seqinr::translate(strsplit(trinuc_list[h],"")[[1]], numcode = numcode)
31 |             # Annotating the impact of the mutation
32 |             if (to_aa == from_aa){ 
33 |                 impact_matrix[j,h] = 1
34 |             } else if (to_aa == "*"){
35 |                 impact_matrix[j,h] = 3
36 |             } else if ((to_aa != "*") & (from_aa != "*") & (to_aa != from_aa)){
37 |                 impact_matrix[j,h] = 2
38 |             } else if (from_aa=="*") {
39 |                 impact_matrix[j,h] = NA
40 |             }
41 |         }
42 |     }
43 |         
44 |     # Adding two new fields to refcds containing the full vector of all site changes listing the rate parameters and aminoacid impact
45 | 
46 |     for (j in 1:length(refcds)) {
47 |         
48 |         cdsseq = as.character(as.vector(refcds[[j]]$seq_cds))
49 |         cdsseq1up = as.character(as.vector(refcds[[j]]$seq_cds1up))
50 |         cdsseq1down = as.character(as.vector(refcds[[j]]$seq_cds1down))
51 |         
52 |         # Exonic mutations
53 |         
54 |         ind = rep(1:length(cdsseq), each=3)
55 |         old_trinuc = paste(cdsseq1up[ind], cdsseq[ind], cdsseq1down[ind], sep="")
56 |         new_base = c(sapply(cdsseq, function(x) nt[nt!=x]))
57 |         new_trinuc = paste(cdsseq1up[ind], new_base, cdsseq1down[ind], sep="")
58 |         codon_start = rep(seq(1,length(cdsseq),by=3),each=9)
59 |         old_codon = paste(cdsseq[codon_start], cdsseq[codon_start+1], cdsseq[codon_start+2], sep="")
60 |         pos_in_codon = rep(rep(1:3, each=3), length.out=length(old_codon))
61 |         aux = strsplit(old_codon,"")
62 |         new_codon = sapply(1:length(old_codon), function(x) { new_codonx = aux[[x]]; new_codonx[pos_in_codon[x]] = new_base[x]; return(new_codonx) } )
63 |         new_codon = paste(new_codon[1,], new_codon[2,], new_codon[3,], sep="")
64 |         
65 |         imp = impact_matrix[(trinuc_ind[new_codon]-1)*64 + trinuc_ind[old_codon]]
66 |         matrind = trinuc_subsind[paste(old_trinuc, new_trinuc, sep=">")]
67 |         
68 |         refcds[[j]]$codon_impact = imp
69 |         refcds[[j]]$codon_rates = matrind
70 |         
71 |         if (round(j/1000)==(j/1000)) { message(sprintf('    %0.3g%% ...', round(j/length(refcds),2)*100)) }
72 |     }
73 |     
74 |     return(refcds)
75 |     
76 | } # EOF
77 | 


--------------------------------------------------------------------------------
/R/buildref.R:
--------------------------------------------------------------------------------
  1 | #' buildref
  2 | #'
  3 | #' Function to build a RefCDS object from a reference genome and a table of transcripts. The RefCDS object has to be precomputed for any new species or assembly prior to running dndscv. This function generates an .rda file that needs to be input into dndscv using the refdb argument. Note that when multiple CDS share the same gene name (second column of cdsfile), the longest coding CDS will be chosen for the gene. CDS with ambiguous bases (N) will not be considered.
  4 | #'
  5 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
  6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
  7 | #'
  8 | #' @param cdsfile Path to the reference transcript table.
  9 | #' @param genomefile Path to the indexed reference genome file.
 10 | #' @param outfile Output file name (default = "RefCDS.rda").
 11 | #' @param numcode NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate
 12 | #' @param excludechrs Vector or string with chromosome names to be excluded from the RefCDS object (default: no chromosome will be excluded). The mitochondrial chromosome should be excluded as it has different genetic code and mutation rates, either using the excludechrs argument or not including mitochondrial transcripts in cdsfile.
 13 | #' @param onlychrs Vector of valid chromosome names (default: all chromosomes will be included)
 14 | #' @param useids Combine gene IDs and gene names (columns 1 and 2 of the input table) as long gene names (default = F)
 15 | #'
 16 | #' @export
 17 | 
 18 | buildref = function(cdsfile, genomefile, outfile = "RefCDS.rda", numcode = 1, excludechrs = NULL, onlychrs = NULL, useids = F) {
 19 | 
 20 |     ## 1. Valid chromosomes and reference CDS per gene
 21 |     message("[1/3] Preparing the environment...")
 22 | 
 23 |     reftable = read.table(cdsfile, header=1, sep="\t", stringsAsFactors=F, quote="\"", na.strings="-", fill=TRUE) # Loading the reference table
 24 |     colnames(reftable) = c("gene.id","gene.name","cds.id","chr","chr.coding.start","chr.coding.end","cds.start","cds.end","length","strand")
 25 |     reftable[,5:10] = suppressWarnings(lapply(reftable[,5:10], as.numeric)) # Convert to numeric
 26 | 
 27 |     # Checking for systematic absence of gene names (it happens in some BioMart inputs)
 28 |     longname = paste(reftable$gene.id, reftable$gene.name, sep=":") # Gene name combining the Gene stable ID and the Associated gene name
 29 |     if (useids==T) {
 30 |         reftable$gene.name = longname # Replacing gene names by a concatenation of gene ID and gene name
 31 |     }
 32 |     if (length(unique(reftable$gene.name))<length(unique(longname))) {
 33 |         warning(sprintf("%0.0f unique gene IDs (column 1) found. %0.0f unique gene names (column 2) found. Consider combining gene names and gene IDs or replacing gene names by gene IDs to avoid losing genes (see useids argument in ? buildref)",length(unique(reftable$gene.id)),length(unique(reftable$gene.name))))
 34 |     }
 35 | 
 36 |     # Reading chromosome names in the fasta file using its index file if it already exists or creating it if it does not exist. The index file is also used by scanFa later.
 37 |     validchrs = as.character(GenomicRanges::seqnames(Rsamtools::scanFaIndex(genomefile)))
 38 | 
 39 |     validchrs = setdiff(validchrs, excludechrs)
 40 |     if (length(onlychrs)>0) {
 41 |         validchrs = validchrs[validchrs %in% onlychrs]
 42 |     }
 43 | 
 44 |     # Restricting to chromosomes present in both the genome file and the CDS table
 45 |     if (any(validchrs %in% unique(reftable$chr))) {
 46 |         validchrs = validchrs[validchrs %in% unique(reftable$chr)]
 47 |     } else { # Try adding a chr prefix
 48 |         reftable$chr = paste("chr", reftable$chr, sep="")
 49 |         validchrs = validchrs[validchrs %in% unique(reftable$chr)]
 50 |         if (length(validchrs)==0) { # No matching chromosome names
 51 |             stop("No chromosome names in common between the genome file and the CDS table")
 52 |         }
 53 |     }
 54 | 
 55 | 
 56 |     # Removing genes that fall partially or completely outside of the available chromosomes/contigs
 57 | 
 58 |     reftable = reftable[reftable[,1]!="" & reftable[,2]!="" & reftable[,3]!="" & !is.na(reftable[,5]) & !is.na(reftable[,6]),] # Removing invalid entries
 59 |     reftable = reftable[which(reftable$chr %in% validchrs),] # Only valid chromosomes
 60 | 
 61 |     transc_gr = GenomicRanges::GRanges(reftable$chr, IRanges::IRanges(reftable$chr.coding.start,reftable$chr.coding.end))
 62 |     chrs_gr = Rsamtools::scanFaIndex(genomefile)
 63 |     ol = as.data.frame(GenomicRanges::findOverlaps(transc_gr, chrs_gr, type="within", select="all"))
 64 | 
 65 |     # Issuing an error if any transcript falls outside of the limits of a chromosome. Possibly due to a mismatch between the assembly used for the reference table and the reference genome.
 66 |     if (length(unique(ol[,1])) < nrow(reftable)) {
 67 |         stop(sprintf("Aborting buildref. %0.0f rows in cdsfile have coordinates that fall outside of the corresponding chromosome length. Please ensure that you are using the same assembly for the cdsfile and genomefile",nrow(reftable)-length(unique(ol[,1]))))
 68 |     }
 69 | 
 70 |     reftable = reftable[unique(ol[,1]),]
 71 | 
 72 |     # Identifying genes starting or ending at the ends of a chromosome/contig
 73 |     # Because buildref and dndscv need to access the base before and after each coding position, genes overlapping the ends
 74 |     # of a contig will be trimmed by three bases and a warning will be issued listing those genes.
 75 | 
 76 |     fullcds = intersect(reftable$cds.id[reftable$cds.start==1], reftable$cds.id[reftable$cds.end==reftable$length]) # List of complete CDS
 77 | 
 78 |     ol_start = as.data.frame(GenomicRanges::findOverlaps(transc_gr, chrs_gr, type="start", select="all"))[,1] # Genes overlapping contig starts
 79 |     if (any(ol_start)) {
 80 |         reftable[ol_start,"chr.coding.start"] = reftable[ol_start,"chr.coding.start"] + 3 # Truncate the first 3 bases
 81 |         reftable[ol_start,"cds.start"] = reftable[ol_start,"cds.start"] + 3 # Truncate the first 3 bases
 82 |     }
 83 | 
 84 |     ol_end = as.data.frame(GenomicRanges::findOverlaps(transc_gr, chrs_gr, type="end", select="all"))[,1] # Genes overlapping contig starts
 85 |     if (any(ol_end)) {
 86 |         reftable[ol_end,"chr.coding.end"] = reftable[ol_end,"chr.coding.end"] - 3 # Truncate the first 3 bases
 87 |         reftable[ol_end,"cds.end"] = reftable[ol_end,"cds.end"] - 3 # Truncate the first 3 bases
 88 |     }
 89 | 
 90 |     if (any(c(ol_start,ol_end))) {
 91 |         warning(sprintf("The following genes were found to start or end at the first or last base of their contig. Since dndscv needs trinucleotide contexts for all coding bases, codons overlapping ends of contigs have been trimmed. Affected genes: %s.", paste(reftable[unique(c(ol_start,ol_end)),"gene.name"], collapse=", ")))
 92 |     }
 93 | 
 94 | 
 95 |     # Selecting the longest complete CDS for every gene (required when there are multiple alternative transcripts per unique gene name)
 96 | 
 97 |     cds_table = unique(reftable[,c(1:3,9)])
 98 |     cds_table = cds_table[order(cds_table$gene.name, -cds_table$length), ] # Sorting CDS from longest to shortest
 99 |     cds_table = cds_table[(cds_table$length %% 3)==0, ] # Removing CDS of length not multiple of 3
100 |     cds_table = cds_table[cds_table$cds.id %in% fullcds, ] # Complete CDS
101 |     reftable = reftable[reftable$cds.id %in% fullcds, ] # Complete CDS
102 |     gene_list = unique(cds_table$gene.name)
103 | 
104 |     reftable = reftable[order(reftable$chr, reftable$chr.coding.start), ]
105 |     cds_split = split(reftable, f=reftable$cds.id)
106 |     gene_split = split(cds_table, f=cds_table$gene.name)
107 | 
108 | 
109 |     ## 2. Building the RefCDS object
110 |     message("[2/3] Building the RefCDS object...")
111 | 
112 |     # Subfunction to extract the coding sequence
113 |     get_CDSseq = function(gr, strand) {
114 |         cdsseq = strsplit(paste(as.vector(Rsamtools::scanFa(genomefile, gr)),collapse=""),"")[[1]]
115 |         if (strand==-1) {
116 |             cdsseq = rev(seqinr::comp(cdsseq,forceToLower=F,ambiguous=T))
117 |         }
118 |         return(cdsseq)
119 |     }
120 | 
121 |     # Subfunction to extract essential splice site sequences
122 |     # Definition of essential splice sites: (5' splice site: +1,+2,+5; 3' splice site: -1,-2)
123 |     get_splicesites = function(cds) {
124 |         splpos = numeric(0)
125 |         if (nrow(cds)>1) { # If the CDS has more than one exon
126 |             if (cds[1,10]==1) { # + strand
127 |                 spl5prime = cds[-nrow(cds),6] # Exon end before splice site
128 |                 spl3prime = cds[-1,5] # Exon start after splice site
129 |                 splpos = unique(sort(c(spl5prime+1, spl5prime+2, spl5prime+5, spl3prime-1, spl3prime-2)))
130 |             } else if (cds[1,10]==-1) { # - strand
131 |                 spl5prime = cds[-1,5] # Exon end before splice site
132 |                 spl3prime = cds[-nrow(cds),6] # Exon start after splice site
133 |                 splpos = unique(sort(c(spl5prime-1, spl5prime-2, spl5prime-5, spl3prime+1, spl3prime+2)))
134 |             }
135 |         }
136 |         return(splpos)
137 |     }
138 | 
139 |     # Subfunction to extract the essential splice site sequence
140 |     get_spliceseq = function(gr, strand) {
141 |         spliceseq = unname(as.vector(Rsamtools::scanFa(genomefile, gr)))
142 |         if (strand==-1) {
143 |             spliceseq = seqinr::comp(spliceseq,forceToLower=F,ambiguous=T)
144 |         }
145 |         return(spliceseq)
146 |     }
147 | 
148 |     # Initialising and populating the RefCDS object
149 | 
150 |     RefCDS = array(list(NULL), length(gene_split)) # Initialising empty object
151 |     invalid_genes = rep(0, length(gene_split)) # Initialising empty object
152 | 
153 |     for (j in 1:length(gene_split)) {
154 | 
155 |         gene_cdss = gene_split[[j]]
156 |         h = keeptrying = 1
157 | 
158 |         while (h<=nrow(gene_cdss) & keeptrying) {
159 | 
160 |             pid = gene_cdss[h,3]
161 |             cds = cds_split[[pid]]
162 |             strand = cds[1,10]
163 |             chr = cds[1,4]
164 |             gr = GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5], cds[,6]))
165 |             cdsseq = get_CDSseq(gr,strand)
166 |             pseq = seqinr::translate(cdsseq, numcode = numcode)
167 | 
168 |             if (all(pseq[-length(pseq)]!="*") & all(cdsseq!="N")) { # A valid CDS has been found (no stop codons inside the CDS excluding the last codon) and no "N" nucleotides
169 | 
170 |                 # Essential splice sites
171 |                 splpos = get_splicesites(cds) # Essential splice sites
172 |                 if (length(splpos)>0) { # CDSs with a single exon do not have splice sites
173 |                     gr_spl = GenomicRanges::GRanges(chr, IRanges::IRanges(splpos, splpos))
174 |                     splseq = get_spliceseq(gr_spl, strand)
175 |                 }
176 | 
177 |                 # Obtaining the splicing sequences and the coding and splicing sequence contexts
178 |                 if (strand==1) {
179 | 
180 |                     cdsseq1up = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]-1, cds[,6]-1)), strand)
181 |                     cdsseq1down = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]+1, cds[,6]+1)), strand)
182 |                     if (length(splpos)>0) {
183 |                         splseq1up = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos-1, splpos-1)), strand)
184 |                         splseq1down = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos+1, splpos+1)), strand)
185 |                     }
186 | 
187 |                 } else if (strand==-1) {
188 | 
189 |                     cdsseq1up = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]+1, cds[,6]+1)), strand)
190 |                     cdsseq1down = get_CDSseq(GenomicRanges::GRanges(chr, IRanges::IRanges(cds[,5]-1, cds[,6]-1)), strand)
191 |                     if (length(splpos)>0) {
192 |                         splseq1up = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos+1, splpos+1)), strand)
193 |                         splseq1down = get_spliceseq(GenomicRanges::GRanges(chr, IRanges::IRanges(splpos-1, splpos-1)), strand)
194 |                     }
195 | 
196 |                 }
197 | 
198 |                 # Annotating the CDS in the RefCDS database
199 | 
200 |                 RefCDS[[j]]$gene_name = gene_cdss[h,2]
201 |                 RefCDS[[j]]$gene_id = gene_cdss[h,1]
202 |                 RefCDS[[j]]$protein_id = gene_cdss[h,3]
203 |                 RefCDS[[j]]$CDS_length = gene_cdss[h,4]
204 |                 RefCDS[[j]]$chr = cds[1,4]
205 |                 RefCDS[[j]]$strand = strand
206 |                 RefCDS[[j]]$intervals_cds = unname(as.matrix(cds[,5:6]))
207 |                 RefCDS[[j]]$intervals_splice = splpos
208 | 
209 |                 RefCDS[[j]]$seq_cds = Biostrings::DNAString(paste(cdsseq, collapse=""))
210 |                 RefCDS[[j]]$seq_cds1up = Biostrings::DNAString(paste(cdsseq1up, collapse=""))
211 |                 RefCDS[[j]]$seq_cds1down = Biostrings::DNAString(paste(cdsseq1down, collapse=""))
212 | 
213 |                 if (length(splpos)>0) { # If there are splice sites in the gene
214 |                     RefCDS[[j]]$seq_splice = Biostrings::DNAString(paste(splseq, collapse=""))
215 |                     RefCDS[[j]]$seq_splice1up = Biostrings::DNAString(paste(splseq1up, collapse=""))
216 |                     RefCDS[[j]]$seq_splice1down = Biostrings::DNAString(paste(splseq1down, collapse=""))
217 |                 }
218 | 
219 |                 keeptrying = 0 # Stopping the while loop
220 |             }
221 |             h = h+1
222 |         }
223 |         if (keeptrying) {
224 |             invalid_genes[j] = 1 # No valid CDS was found for this gene and the gene will be removed from the RefCDS object
225 |         }
226 |         if (round(j/1000)==(j/1000)) { message(sprintf('    %0.3g%% ...', round(j/length(gene_split),2)*100)) }
227 |     }
228 | 
229 |     RefCDS = RefCDS[!invalid_genes] # Removing genes without a valid CDS
230 | 
231 | 
232 |     ## 3. L matrices: number of synonymous, missense, nonsense and splice sites in each CDS at each trinucleotide context
233 |     message("[3/3] Calculating the impact of all possible coding changes...")
234 | 
235 |     nt = c("A","C","G","T")
236 |     trinuc_list = paste(rep(nt,each=16,times=1), rep(nt,each=4,times=4), rep(nt,each=1,times=16), sep="")
237 |     trinuc_ind = structure(1:64, names=trinuc_list)
238 | 
239 |     trinuc_subs = NULL; for (j in 1:length(trinuc_list)) { trinuc_subs = c(trinuc_subs, paste(trinuc_list[j], paste(substr(trinuc_list[j],1,1), setdiff(nt,substr(trinuc_list[j],2,2)), substr(trinuc_list[j],3,3), sep=""), sep=">")) }
240 |     trinuc_subsind = structure(1:192, names=trinuc_subs)
241 | 
242 |     # Precalculating a 64x64 matrix with the functional impact of each codon transition (1=Synonymous, 2=Missense, 3=Nonsense)
243 |     impact_matrix = array(NA, dim=c(64,64))
244 |     colnames(impact_matrix) = rownames(impact_matrix) = trinuc_list
245 |     for (j in 1:64) {
246 |         for (h in 1:64) {
247 |             from_aa = seqinr::translate(strsplit(trinuc_list[j],"")[[1]], numcode = numcode)
248 |             to_aa = seqinr::translate(strsplit(trinuc_list[h],"")[[1]], numcode = numcode)
249 |             # Annotating the impact of the mutation
250 |             if (to_aa == from_aa){
251 |                 impact_matrix[j,h] = 1
252 |             } else if (to_aa == "*"){
253 |                 impact_matrix[j,h] = 3
254 |             } else if ((to_aa != "*") & (from_aa != "*") & (to_aa != from_aa)){
255 |                 impact_matrix[j,h] = 2
256 |             } else if (from_aa=="*") {
257 |                 impact_matrix[j,h] = NA
258 |             }
259 |         }
260 |     }
261 | 
262 |     for (j in 1:length(RefCDS)) {
263 | 
264 |         L = array(0, dim=c(192,4))
265 |         cdsseq = as.character(as.vector(RefCDS[[j]]$seq_cds))
266 |         cdsseq1up = as.character(as.vector(RefCDS[[j]]$seq_cds1up))
267 |         cdsseq1down = as.character(as.vector(RefCDS[[j]]$seq_cds1down))
268 | 
269 |         # 1. Exonic mutations
270 | 
271 |         ind = rep(1:length(cdsseq), each=3)
272 |         old_trinuc = paste(cdsseq1up[ind], cdsseq[ind], cdsseq1down[ind], sep="")
273 |         new_base = c(sapply(cdsseq, function(x) nt[nt!=x]))
274 |         new_trinuc = paste(cdsseq1up[ind], new_base, cdsseq1down[ind], sep="")
275 |         codon_start = rep(seq(1,length(cdsseq),by=3),each=9)
276 |         old_codon = paste(cdsseq[codon_start], cdsseq[codon_start+1], cdsseq[codon_start+2], sep="")
277 |         pos_in_codon = rep(rep(1:3, each=3), length.out=length(old_codon))
278 |         aux = strsplit(old_codon,"")
279 |         new_codon = sapply(1:length(old_codon), function(x) { new_codonx = aux[[x]]; new_codonx[pos_in_codon[x]] = new_base[x]; return(new_codonx) } )
280 |         new_codon = paste(new_codon[1,], new_codon[2,], new_codon[3,], sep="")
281 | 
282 |         imp = impact_matrix[(trinuc_ind[new_codon]-1)*64 + trinuc_ind[old_codon]]
283 |         matrind = trinuc_subsind[paste(old_trinuc, new_trinuc, sep=">")]
284 | 
285 |         # Synonymous
286 |         matrix_ind = table(matrind[which(imp==1)])
287 |         L[as.numeric(names(matrix_ind)), 1] = matrix_ind
288 | 
289 |         # Missense
290 |         matrix_ind = table(matrind[which(imp==2)])
291 |         L[as.numeric(names(matrix_ind)), 2] = matrix_ind
292 | 
293 |         # Nonsense
294 |         matrix_ind = table(matrind[which(imp==3)])
295 |         L[as.numeric(names(matrix_ind)), 3] = matrix_ind
296 | 
297 |         # 2. Splice site mutations
298 |         if (length(RefCDS[[j]]$intervals_splice)>0) {
299 |             splseq = as.character(as.vector(RefCDS[[j]]$seq_splice))
300 |             splseq1up = as.character(as.vector(RefCDS[[j]]$seq_splice1up))
301 |             splseq1down = as.character(as.vector(RefCDS[[j]]$seq_splice1down))
302 |             old_trinuc = rep(paste(splseq1up, splseq, splseq1down, sep=""), each=3)
303 |             new_trinuc = paste(rep(splseq1up, each=3), c(sapply(splseq, function(x) nt[nt!=x])), rep(splseq1down,each=3), sep="")
304 |             matrind = trinuc_subsind[paste(old_trinuc, new_trinuc, sep=">")]
305 |             matrix_ind = table(matrind)
306 |             L[as.numeric(names(matrix_ind)), 4] = matrix_ind
307 |         }
308 | 
309 |         RefCDS[[j]]$L = L # Saving the L matrix
310 |         if (round(j/1000)==(j/1000)) { message(sprintf('    %0.3g%% ...', round(j/length(gene_split),2)*100)) }
311 |     }
312 | 
313 |     ## Saving the reference GenomicRanges object
314 | 
315 |     aux = unlist(sapply(1:length(RefCDS), function(x) t(cbind(x,rbind(RefCDS[[x]]$intervals_cds,cbind(RefCDS[[x]]$intervals_splice,RefCDS[[x]]$intervals_splice))))))
316 |     df_genes = as.data.frame(t(array(aux,dim=c(3,length(aux)/3))))
317 |     colnames(df_genes) = c("ind","start","end")
318 |     df_genes$chr = unlist(sapply(1:length(RefCDS), function(x) rep(RefCDS[[x]]$chr,nrow(RefCDS[[x]]$intervals_cds)+length(RefCDS[[x]]$intervals_splice))))
319 |     df_genes$gene = sapply(RefCDS, function(x) x$gene_name)[df_genes$ind]
320 | 
321 |     gr_genes = GenomicRanges::GRanges(df_genes$chr, IRanges::IRanges(df_genes$start, df_genes$end))
322 |     GenomicRanges::mcols(gr_genes)$names = df_genes$gene
323 | 
324 |     save(RefCDS, gr_genes, file=outfile)
325 | 
326 | } # EOF
327 | 


--------------------------------------------------------------------------------
/R/codondnds.R:
--------------------------------------------------------------------------------
  1 | #' codondnds
  2 | #'
  3 | #' Function to estimate codon-wise dN/dS values and p-values against neutrality. To generate a valid RefCDS input object for this function, use the buildcodon function. Note that recurrent artefacts or SNP contamination can violate the null model and dominate the list of codons under apparent selection. Be very critical of the results and if suspicious codons appear recurrently mutated consider refining the variant calling (e.g. using a better unmatched normal panel).
  4 | #'
  5 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
  6 | #'
  7 | #' @param dndsout Output object from dndscv.
  8 | #' @param refcds RefCDS object annotated with codon-level information using the buildcodon function.
  9 | #' @param min_recurr Minimum number of mutations per codon to estimate codon-wise dN/dS ratios. [default=2]
 10 | #' @param gene_list List of genes to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of genes is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout]
 11 | #' @param codon_list List of hotspot codons to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of codons is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout]
 12 | #' @param theta_option 2 options: "mle" (uses the MLE of the negative binomial size parameter) or "conservative" (uses the lower bound of the CI95). Values other than "mle" will lead to the conservative option. [default="conservative"]
 13 | #' @param syn_drivers Vector with a list of known synonymous driver mutations to exclude from the background model [default="TP53:T125T"]. See Martincorena et al., Cell, 2017 (PMID:29056346).
 14 | #' @param method Overdispersion model: NB = Negative Binomial (Gamma-Poisson), LNP = Poisson-Lognormal (see Hess et al., BiorXiv, 2019). [default="NB"]
 15 | #' @param numbins Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size.
 16 | #'
 17 | #' @return 'codondnds' returns a table of recurrently mutated codons and the estimates of the size parameter:
 18 | #' @return - recurcodons: Table of recurrently mutated codons with codon-wise dN/dS values and p-values
 19 | #' @return - recurcodons_ext: The same table of recurrently mutated codons, but including additional information on the contribution of different changes within a codon.
 20 | #' @return - overdisp: Maximum likelihood estimate and CI95% for the overdispersion parameter (the size parameter of the negative binomial distribution or the sigma parameter of the lognormal distribution). The lower the size value or the higher the sigma value the higher the variation of the mutation rate across codons not captured by the trinucleotide change or by variation across genes.
 21 | #' @return - LL: Log-likelihood of the fit of the overdispersed model (see "method" argument) to all synonymous sites at a codon level.
 22 | #'
 23 | #' @export
 24 | 
 25 | codondnds = function(dndsout, refcds, min_recurr = 2, gene_list = NULL, codon_list = NULL, theta_option = "conservative", syn_drivers = "TP53:T125T", method = "NB", numbins = 1e4) {
 26 | 
 27 |     ## 1. Fitting an overdispersed distribution at the codon level considering the background mutation rate of the gene and of each trinucleotide
 28 |     message("[1] Codon-wise overdispersed model accounting for trinucleotides and relative gene mutability...")
 29 | 
 30 |     if (nrow(dndsout$mle_submodel)!=195) { stop("Invalid input: dndsout must be generated using the default trinucleotide substitution model in dndscv.") }
 31 |     if (is.null(refcds[[1]]$codon_impact)) { stop("Invalid input: the input RefCDS object must contain codon-level annotation. Use the buildcodon function to add this information.") }
 32 | 
 33 |     # Restricting refcds to genes in the dndsout object
 34 |     refcds = refcds[sapply(refcds, function(x) x$gene_name) %in% dndsout$genemuts$gene_name] # Only input genes
 35 | 
 36 |     # Restricting the analysis to an input list of genes
 37 |     if (!is.null(gene_list)) {
 38 |         g = as.vector(dndsout$genemuts$gene_name)
 39 |         # Correcting CDKN2A if required (hg19)
 40 |         if (any(g %in% c("CDKN2A.p14arf","CDKN2A.p16INK4a")) & any(gene_list=="CDKN2A")) {
 41 |             gene_list = unique(c(setdiff(gene_list,"CDKN2A"),"CDKN2A.p14arf","CDKN2A.p16INK4a"))
 42 |         }
 43 |         nonex = gene_list[!(gene_list %in% g)]
 44 |         if (length(nonex)>0) {
 45 |             warning(sprintf("The following input gene names are not in dndsout input object and will not be analysed: %s.", paste(nonex,collapse=", ")))
 46 |         }
 47 |         refaux = refcds[sapply(refcds, function(x) x$gene_name) %in% gene_list] # Only input genes
 48 |         numtests = sum(sapply(refaux, function(x) x$CDS_length))/3 # Number of codons in genes listed in gene_list
 49 |     } else {
 50 |         numtests = sum(sapply(refcds, function(x) x$CDS_length))/3 # Number of codons in all genes
 51 |     }
 52 | 
 53 |     # Relative mutation rate per gene
 54 |     # Note that this assumes that the gene order in genemuts has not been altered with respect to the N and L matrices, as it is currently the case in dndscv
 55 |     relmr = dndsout$genemuts$exp_syn_cv/dndsout$genemuts$exp_syn
 56 |     names(relmr) = dndsout$genemuts$gene_name
 57 | 
 58 |     # Substitution rates (192 trinucleotide rates, strand-specific)
 59 |     sm = setNames(dndsout$mle_submodel$mle, dndsout$mle_submodel$name)
 60 |     sm["TTT>TGT"] = 1 # Adding the TTT>TGT rate (which is arbitrarily set to 1 relative to t)
 61 |     sm = sm*sm["t"] # Absolute rates
 62 |     sm = sm[setdiff(names(sm),c("wmis","wnon","wspl","t"))] # Removing selection parameters
 63 |     sm = sm[order(names(sm))] # Sorting
 64 | 
 65 |     # Annotated mutations per gene
 66 |     annotsubs = dndsout$annotmuts[which(dndsout$annotmuts$impact=="Synonymous"),]
 67 |     if (nrow(annotsubs)<2) {
 68 |         stop("Too few synonymous mutations found in the input. codondnds cannot run without synonymous mutations.")
 69 |     }
 70 |     annotsubs = annotsubs[!(paste(annotsubs$gene,annotsubs$aachange,sep=":") %in% syn_drivers),]
 71 |     annotsubs$codon = as.numeric(substr(annotsubs$aachange,2,nchar(annotsubs$aachange)-1)) # Numeric codon position
 72 |     annotsubs = split(annotsubs, f=annotsubs$gene)
 73 | 
 74 |     # Calculating observed and expected mutation rates per codon for every gene
 75 |     numcodons = sum(sapply(refcds, function(x) x$CDS_length))/3 # Number of codons in all genes
 76 |     nvec = rvec = array(NA, numcodons)
 77 |     pos = 1
 78 | 
 79 |     for (j in 1:length(refcds)) {
 80 | 
 81 |         nvec_syn = rvec_syn = rvec_ns = array(0,refcds[[j]]$CDS_length/3) # Initialising the obs and exp vectors
 82 |         gene = refcds[[j]]$gene_name
 83 |         sm_rel = sm * relmr[gene]
 84 | 
 85 |         # Expected rates
 86 |         ind = rep(1:(refcds[[j]]$CDS_length/3), each=9)
 87 |         syn = which(refcds[[j]]$codon_impact==1) # Synonymous changes
 88 |         ns = which(refcds[[j]]$codon_impact %in% c(2,3)) # Missense and nonsense changes
 89 | 
 90 |         aux = sapply(split(refcds[[j]]$codon_rates[syn], f=ind[syn]), function(x) sum(sm_rel[x]))
 91 |         rvec_syn[as.numeric(names(aux))] = aux
 92 | 
 93 |         aux = sapply(split(refcds[[j]]$codon_rates[ns], f=ind[ns]), function(x) sum(sm_rel[x]))
 94 |         rvec_ns[as.numeric(names(aux))] = aux
 95 | 
 96 |         # Observed mutations
 97 |         subs = annotsubs[[gene]]
 98 |         if (!is.null(subs)) {
 99 |             obs_syn = table(subs$codon)
100 |             nvec_syn[as.numeric(names(obs_syn))] = obs_syn
101 |         }
102 | 
103 |         rvec[pos:(pos+refcds[[j]]$CDS_length/3-1)] = rvec_syn
104 |         nvec[pos:(pos+refcds[[j]]$CDS_length/3-1)] = nvec_syn
105 |         pos = pos + refcds[[j]]$CDS_length/3
106 | 
107 |         refcds[[j]]$codon_rvec_ns = rvec_ns
108 | 
109 |         if (round(j/2000)==(j/2000)) { message(sprintf('    %0.3g%% ...', round(j/length(refcds),2)*100)) }
110 |     }
111 | 
112 |     rvec = rvec * sum(nvec) / sum(rvec) # Small correction ensuring that global observed and expected rates are identical
113 | 
114 | 
115 |     message("[2] Estimating overdispersion and calculating codon-wise dN/dS ratios...")
116 | 
117 |     # Estimation of overdispersion: Using optimize appears to yield reliable results. Problems experienced with fitdistr, glm.nb and theta.ml. Consider using grid search if problems appear with optimize.
118 |     if (method=="LNP") { # Modelling rates per codon with a Poisson-Lognormal mixture
119 | 
120 |         lnp_est = fitlnpbin(nvec, rvec, theta_option = theta_option, numbins = numbins)
121 |         theta_ml = lnp_est$ml$minimum
122 |         theta_ci95 = lnp_est$sig_ci95
123 |         LL = -lnp_est$ml$objective # LogLik
124 |         thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95_high"))
125 | 
126 |     } else { # Modelling rates per codon as negative binomially distributed (i.e. quantifying uncertainty above Poisson using a Gamma)
127 | 
128 |         nbin = function(theta, n=nvec, r=rvec) { -sum(dnbinom(x=n, mu=r, log=T, size=theta)) } # nbin loglik function for optimisation
129 |         ml = optimize(nbin, interval=c(0,1000))
130 |         theta_ml = ml$minimum
131 |         LL = -ml$objective # LogLik
132 | 
133 |         # CI95% for theta using profile likelihood and iterative grid search (this yields slightly conservative CI95)
134 |         grid_proflik = function(bins=5, iter=5) {
135 |             for (j in 1:iter) {
136 |                 if (j==1) {
137 |                     thetavec = sort(c(0, 10^seq(-3,3,length.out=bins), theta_ml, theta_ml*10, 1e4)) # Initial vals
138 |                 } else {
139 |                     thetavec = sort(c(seq(thetavec[ind[1]], thetavec[ind[1]+1], length.out=bins), seq(thetavec[ind[2]-1], thetavec[ind[2]], length.out=bins))) # Refining previous iteration
140 |                 }
141 | 
142 |                 proflik = sapply(thetavec, function(theta) -sum(dnbinom(x=nvec, mu=rvec, size=theta, log=T))-ml$objective) < qchisq(.95,1)/2 # Values of theta within CI95%
143 |                 ind = c(which(proflik[1:(length(proflik)-1)]==F & proflik[2:length(proflik)]==T)[1],
144 |                         which(proflik[1:(length(proflik)-1)]==T & proflik[2:length(proflik)]==F)[1]+1)
145 |                 if (is.na(ind[1])) { ind[1] = 1 }
146 |                 if (is.na(ind[2])) { ind[2] = length(thetavec) }
147 |             }
148 |             return(thetavec[ind])
149 |         }
150 |         theta_ci95 = grid_proflik(bins=5, iter=5)
151 |         thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95low","CI95_high"))
152 |     }
153 | 
154 | 
155 |     ## 2. Calculating codon-wise dN/dS ratios and P-values for recurrently mutated codons
156 | 
157 |     # Theta option
158 |     if (theta_option=="mle" | theta_option=="MLE") {
159 |         theta = theta_ml
160 |     } else { # Conservative
161 |         message("    Using the conservative bound of the confidence interval of the overdispersion parameter.")
162 |         theta = theta_ci95[1]
163 |     }
164 | 
165 |     # Creating the recurcodons object
166 |     annotsubs = dndsout$annotmuts[which(dndsout$annotmuts$impact %in% c("Missense","Nonsense")),]
167 |     annotsubs$codon = substr(annotsubs$aachange,1,nchar(annotsubs$aachange)-1) # Codon position
168 |     annotsubs$codonsub = paste(annotsubs$chr,annotsubs$gene,annotsubs$codon,sep=":")
169 |     annotsubs = annotsubs[which(annotsubs$ref!=annotsubs$mut),]
170 |     freqs = sort(table(annotsubs$codonsub), decreasing=T)
171 | 
172 |     recurcodons = read.table(text=names(freqs), header=0, sep=":", stringsAsFactors=F) # Frequency table of mutations
173 |     colnames(recurcodons) = c("chr","gene","codon")
174 |     recurcodons$freq = freqs
175 | 
176 |     # Gene RHT
177 |     if (!is.null(gene_list)) {
178 |         message("    Peforming Restricted Hypothesis Testing on the input list of a-priori genes")
179 |         recurcodons = recurcodons[which(recurcodons$gene %in% gene_list), ] # Restricting the p-value and q-value calculations to gene_list
180 |     }
181 | 
182 |     # Codon RHT
183 |     if (!is.null(codon_list)) {
184 |         message("    Peforming Restricted Hypothesis Testing on the input list of a-priori codons (numtests = length(codon_list))")
185 |         mutstr = paste(recurcodons$gene,recurcodons$codon,sep=":")
186 |         if (!any(mutstr %in% codon_list)) {
187 |             stop("No mutation was observed in the restricted list of known hotspots. Codon-RHT cannot be run.")
188 |         }
189 |         recurcodons = recurcodons[which(mutstr %in% codon_list), ] # Restricting the p-value and q-value calculations to codon_list
190 |         numtests = length(codon_list)
191 | 
192 |         # Calculating global dN/dS ratios at known hotcodons
193 |         auxcodons = as.data.frame(do.call("rbind",strsplit(codon_list,split=":")), stringsAsFactors=F)
194 |         auxcodons$V3 = as.numeric(substr(auxcodons$V2,2,nchar(auxcodons$V2)))
195 |         auxcodons = auxcodons[auxcodons$V1 %in% names(relmr), ]
196 |         colnames(auxcodons) = c("gene","codon","numcodon")
197 |         auxcodons$mu = NA
198 |         geneind = setNames(1:length(refcds), sapply(refcds, function(x) x$gene_name))
199 |         for (j in 1:nrow(auxcodons)) {
200 |             auxcodons$mu[j] = refcds[[geneind[auxcodons$gene[j]]]]$codon_rvec_ns[auxcodons$numcodon[j]] # Background non-synonymous rate for this codon
201 |         }
202 |         neutralexp = sum(auxcodons$mu) # Number of mutations expected at known hotspots expected under neutrality
203 |         numobs = sum(recurcodons$freq) # Number observed
204 |         poistest = poisson.test(numobs, T=neutralexp)
205 |         globaldnds_knowncodons = setNames(c(numobs, neutralexp, poistest$estimate, poistest$conf.int), c("obs","exp","dnds","cilow","cihigh"))
206 |         message(sprintf("    Mutations at known hotspots: %0.0f observed, %0.3g expected, obs/exp~%0.3g (CI95:%0.3g,%0.3g).", globaldnds_knowncodons[1], globaldnds_knowncodons[2], globaldnds_knowncodons[3], globaldnds_knowncodons[4], globaldnds_knowncodons[5]))
207 |     }
208 | 
209 |     # Restricting the recurcodons output by min_recurr
210 |     recurcodons = recurcodons[recurcodons$freq>=min_recurr, ] # Restricting the output to codons with min_recurr
211 | 
212 |     if (nrow(recurcodons)>0) {
213 | 
214 |         recurcodons$mu = NA
215 |         codonnumeric = as.numeric(substr(recurcodons$codon,2,nchar(recurcodons$codon))) # Numeric codon position
216 |         geneind = setNames(1:length(refcds), sapply(refcds, function(x) x$gene_name))
217 | 
218 |         for (j in 1:nrow(recurcodons)) {
219 |             recurcodons$mu[j] = refcds[[geneind[recurcodons$gene[j]]]]$codon_rvec_ns[codonnumeric[j]] # Background non-synonymous rate for this codon
220 |         }
221 | 
222 |         recurcodons$dnds = recurcodons$freq / recurcodons$mu # Codon-wise dN/dS (point estimate)
223 | 
224 |         if (method=="LNP") { # Modelling rates per codon with a Poisson-Lognormal mixture
225 | 
226 |             # Cumulative Lognormal-Poisson using poilog::dpoilog
227 |             dpoilog = poilog::dpoilog
228 |             ppoilog = function(n, mu, sig) {
229 |                 p = sum(dpoilog(n=floor(n+1):floor(n*10+1000), mu=log(mu)-sig^2/2, sig=sig))
230 |                 return(p)
231 |             }
232 | 
233 |             message(sprintf("    Modelling substitution rates using a Lognormal-Poisson: sig = %0.3g (upperbound = %0.3g)", theta_ml, theta_ci95))
234 |             recurcodons$pval = apply(recurcodons, 1, function(x) ppoilog(n=as.numeric(x["freq"])-0.5, mu=as.numeric(x["mu"]), sig=theta))
235 | 
236 |         } else { # Negative binomial model
237 | 
238 |             message(sprintf("    Modelling substitution rates using a Negative Binomial: theta = %0.3g (CI95:%0.3g,%0.3g)", theta_ml, theta_ci95[1], theta_ci95[2]))
239 |             recurcodons$pval = pnbinom(q=recurcodons$freq-0.5, mu=recurcodons$mu, size=theta, lower.tail=F)
240 |         }
241 | 
242 |         recurcodons = recurcodons[order(recurcodons$pval, -recurcodons$freq), ] # Sorting by p-val and frequency
243 |         recurcodons$qval = p.adjust(recurcodons$pval, method="BH", n=numtests) # P-value adjustment for all possible changes
244 |         rownames(recurcodons) = NULL
245 | 
246 |         # Additional annotation
247 |         annotsubs$mutaa = substr(annotsubs$aachange,nchar(annotsubs$aachange),nchar(annotsubs$aachange))
248 |         annotsubs$simplent = paste(annotsubs$ref,annotsubs$mut,sep=">")
249 |         annotsubs$mutnt = paste(annotsubs$chr,annotsubs$pos,annotsubs$simplent,annotsubs$mutaa,sep="_")
250 |         aux = split(annotsubs, f=annotsubs$codonsub)
251 |         recurcodons_ext = recurcodons
252 |         recurcodons_ext$codonsub = paste(recurcodons_ext$chr,recurcodons_ext$gene,recurcodons_ext$codon,sep=":")
253 |         recurcodons_ext$mutnt = recurcodons_ext$mutaa = NA
254 |         for (j in 1:nrow(recurcodons_ext)) {
255 |             x = aux[[recurcodons_ext$codonsub[j]]]
256 |             f = sort(table(x$mutaa),decreasing=T)
257 |             recurcodons_ext$mutaa[j] = paste(names(f),f,sep=":",collapse="|")
258 |             f = sort(table(x$mutnt),decreasing=T)
259 |             recurcodons_ext$mutnt[j] = paste(names(f),f,sep=":",collapse="|")
260 |         }
261 | 
262 |     } else {
263 |         recurcodons = recurcodons_ext = NULL
264 |         warning("No codon was found with the minimum recurrence requested [default min_recurr=2]")
265 |     }
266 | 
267 |     if (is.null(codon_list)) {
268 |         return(list(recurcodons=recurcodons, recurcodons_ext=recurcodons_ext, overdisp=thetaout, LL=LL))
269 |     } else {
270 |         return(list(recurcodons=recurcodons, recurcodons_ext=recurcodons_ext, overdisp=thetaout, LL=LL, globaldnds_knowncodons=globaldnds_knowncodons))
271 |     }
272 | }
273 | 


--------------------------------------------------------------------------------
/R/dndscv-package.R:
--------------------------------------------------------------------------------
 1 | # Package definitions for dndscv
 2 | #
 3 | # Author: Inigo Martincorena
 4 | ###############################################################################
 5 | 
 6 | #' Detection of selection in cancer and somatic evolution
 7 | #'
 8 | #' The dNdScv R package is a suite of maximum-likelihood dN/dS methods designed
 9 | #' to quantify selection in cancer and somatic evolution (Martincorena et al., 2017).
10 | #' The package contains functions to quantify dN/dS ratios for missense, nonsense
11 | #' and essential splice mutations, at the level of individual genes, groups of
12 | #' genes or at whole-genome level. The dNdScv method was designed to detect cancer
13 | #' driver genes (i.e. genes under positive selection in cancer) on datasets ranging
14 | #' from a few samples to thousands of samples, in whole-exome/genome or targeted
15 | #' sequencing studies.
16 | #' @name dndscv-package
17 | #' @docType package
18 | #' @title Detection of selection in cancer and somatic evolution
19 | #' @author Inigo Martincorena, Wellcome Trust Sanger Institute, \email{im3@@sanger.ac.uk}
20 | #' @references Martincorena I, et al. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. Cell.
21 | #' @keywords package
22 | #' @seealso \code{\link{dndscv}}
23 | #' @seealso \code{\link{buildref}}
24 | #' @import seqinr
25 | #' @import MASS
26 | #' @import GenomicRanges
27 | #' @import IRanges
28 | #' @import Rsamtools
29 | #' @import Biostrings
30 | NA
31 | 


--------------------------------------------------------------------------------
/R/fitlnpbin.R:
--------------------------------------------------------------------------------
 1 | #' fitlnpbin
 2 | #'
 3 | #' Function to fit a Lognormal-Poisson model to estimate overdispersion on synonymous changes for sitednds and codondnds.
 4 | #'
 5 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
 6 | #' 
 7 | #' @param nvec Vector of observed counts of mutations per site.
 8 | #' @param rvec Vector of expected counts of mutations per site.
 9 | #' @param level Confidence level desired for the confidence interval of the overdispersion parameter [defaul=0.95]
10 | #' @param theta_option 2 options: "mle" (uses the MLE of the overdispersion parameter) or "conservative" (uses the conservative bound of the CI95). Values other than "mle" will lead to the conservative option [default="conservative"]
11 | #' @param numbins Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size.
12 | #'
13 | #' @return 'fitlnpbin' returns the maximum likelihood estimate and confidence intervals of the "sig" overdispersion parameter of the LNP model:
14 | #' 
15 | #' @export
16 | 
17 | fitlnpbin = function(nvec, rvec, level = 0.95, theta_option = "conservative", numbins = 1e4) {
18 |     
19 |     # 1. Binning the r vector
20 |     minrate = 1e-8 # Values <<1/exome_length are due to 0 observed counts for a given trinucleotide
21 |     rvec = pmax(minrate, rvec) # Setting values below minrate to minrate
22 |     br = cut(log(rvec),breaks=numbins) # Binning rvec in log space
23 |     binmeans = tapply(rvec, br, mean) # Mean value per bin
24 |     rvecbinned = as.numeric(binmeans[br]) # Binned values for rvec
25 |     message(sprintf("    Binning the rate vector: maximum deviation of %0.3g", max(abs(rvecbinned-rvec)/rvec)))
26 |     rvec = rvecbinned # Using the binned values
27 |     freqs = as.matrix(plyr::count(cbind(rvec,nvec))) # Frequency table
28 |     
29 |     # Excluding sites with rates < minrate from the calculation (they should yield LL=0)
30 |     freqs = freqs[which(freqs[,1] > minrate), ]
31 |     
32 |     # 2. Vectorising dpoilog
33 |     dpoilog = poilog::dpoilog
34 |     lnp = function(freqs, sig) {
35 |         -sum(apply(freqs, 1, function(x) x[3]*log(dpoilog(n=x[2], mu=log(x[1])-sig^2/2, sig=sig)))) # vectorised dpoilog with fixed expected rates and log-transformed
36 |     }
37 |     
38 |     # 3. Estimating the MLE: grid search followed by optim within reasonable bounds
39 |     lnp_mle = function(minsig=1e-2, maxsig=5, bins=5, iter=8) {
40 |         
41 |         lls = sigs = NULL # Saving the log-likelihoods calculated
42 |         
43 |         # 1. Grid search to identify reasonable bounds
44 |         for (j in 1:iter) {
45 |             if (j==1) {
46 |                 sigvec = sort(exp(seq(log(minsig), log(maxsig), length.out=bins))) # Initial vals
47 |             } else {
48 |                 sigvec = seq(sigvec[pmax(1,ind-1)], sigvec[pmin(bins,ind+1)], length.out=bins) # Refining previous iteration
49 |             }
50 |             proflik = sapply(sigvec, function(sig) lnp(freqs=freqs, sig=sig))
51 |             ind = which.min(proflik)
52 |             
53 |             sigs = c(sigs, sigvec) # Saving the result
54 |             lls = c(lls, proflik) # Saving the result
55 |         }
56 |         
57 |         # 2. Optim for precise estimation of the MLE (optim without narrow bounds tends to fail)
58 |         f = function(sig, n=nvec, r=rvec) { lnp(freqs=freqs, sig=sig) }
59 |         ml = optimize(f, interval=c(sigvec[pmax(1,ind-1)], sigvec[pmin(bins,ind+1)]))
60 |         
61 |         sigs = c(sigs, ml$minimum) # Saving the result
62 |         lls = c(lls, ml$objective) # Saving the result
63 |         ll = cbind(sigs,lls)
64 |         
65 |         return(list(ml=ml, ll=unique(ll[order(ll[,1]),])))
66 |     }
67 |     
68 |     ml = lnp_mle(minsig=1e-2, maxsig=5, bins=5, iter=8) # Maximum likelihood estimate of the overdispersion
69 |     
70 |     
71 |     # 4. Estimating the lower bound of the CI95% using profile likelihood
72 |     #    This is done exploiting the points already evaluated for the MLE
73 |     
74 |     if (theta_option == "mle") {
75 |         ml$sig_ci95 = NA # We only estimate the lower bound of sig if requested by the user
76 |     } else { 
77 |         grid_proflik = function(minsig=1e-2, maxsig=5, bins=5, iter=8) {
78 |             for (j in 1:iter) {
79 |                 if (j==1) {
80 |                     sigvec = ml$ll[,1]
81 |                     ind = min(which(ml$ll[,1]>ml$ml$minimum & (ml$ll[,2]-ml$ml$objective)>qchisq(.95,1)/2)) # First value outside of bounds
82 |                 }
83 |                 sigvec = seq(sigvec[pmax(1,ind-1)], sigvec[pmin(length(sigvec),ind)], length.out=bins) # New grid based on the previous iteration
84 |                 proflik = sapply(sigvec, function(sig) lnp(freqs=freqs, sig=sig)) # Calculating log-likelihoods
85 |                 ind = min(which(sigvec>ml$ml$minimum & (proflik-ml$ml$objective)>qchisq(.95,1)/2)) # First value outside of bounds
86 |             }
87 |             return(sigvec[ind]) # Conservative estimate for the lower bound of the CI95% for sig
88 |         }
89 |         ml$sig_ci95 = grid_proflik(minsig=1e-2, maxsig=5, bins=5, iter=8)
90 |     }
91 |     
92 |     return(ml)
93 | }
94 | 


--------------------------------------------------------------------------------
/R/geneci.R:
--------------------------------------------------------------------------------
  1 | #' geneci
  2 | #'
  3 | #' Function to calculate confidence intervals for dN/dS values per gene under the dNdScv model using profile likelihood. To generate a valid dndsout input object for this function, use outmats=T when running dndscv.
  4 | #'
  5 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
  6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
  7 | #' 
  8 | #' @param dndsout Output object from dndscv.
  9 | #' @param gene_list List of genes to restrict the analysis (by default, all genes in dndsout will be analysed)
 10 | #' @param level Confidence level desired [default = 0.95]
 11 | #'
 12 | #' @return ci: Dataframe with the confidence intervals for dN/dS ratios per gene under the dNdScv model.
 13 | #' 
 14 | #' @export
 15 | 
 16 | geneci = function(dndsout, gene_list = NULL, level = 0.95) {
 17 | 
 18 |     # Ensuring valid level value
 19 |     if (level > 1) {
 20 |         warning("Confidence level must be lower than 1, using 0.95 as default")
 21 |         level = 0.95
 22 |     }
 23 |     
 24 |     # N and L matrices
 25 |     N = dndsout$N
 26 |     L = dndsout$L
 27 |     if (length(N)==0) { stop(sprintf("Invalid input: the dndsout input object must be generated using outmats=T as an argument to dndscv.")) }
 28 |     if (nrow(dndsout$mle_submodel)!=195) { stop(sprintf("Invalid input: dndsout must be generated using the default trinucleotide substitution model in dndscv."))}
 29 |     
 30 |     # Restricting the analysis to an input list of genes
 31 |     if (!is.null(gene_list)) {
 32 |         g = as.vector(dndsout$genemuts$gene_name) # Genes in the input object
 33 |         nonex = gene_list[!(gene_list %in% g)] # Excluding genes from the input gene_list if they are not present in the input dndsout object
 34 |         if (length(nonex)>0) {
 35 |             warning(sprintf("The following input gene names are not in dndsout input object and will not be analysed: %s.", paste(nonex,collapse=", ")))
 36 |         }
 37 |         dndsout$annotmuts = dndsout$annotmuts[which(dndsout$annotmuts$gene %in% gene_list), ] # Restricting to genes of interest
 38 |         dndsout$genemuts = dndsout$genemuts[which(g %in% gene_list), ] # Restricting to genes of interest
 39 |         N = N[,,which(g %in% gene_list)] # Restricting to genes of interest
 40 |         L = L[,,which(g %in% gene_list)] # Restricting to genes of interest
 41 |     }
 42 |     gene_list = as.vector(dndsout$genemuts$gene_name)
 43 |     
 44 |     wnoneqspl = all(dndsout$sel_cv$wnon_cv==dndsout$sel_cv$wspl_cv) # Deciding on wnon==wspl based on the input object
 45 |     
 46 |     ## Subfunction: Analytical opt_t (aka tML) given fixed w values
 47 |     mle_tcvgivenw = function(n, theta, exp_neutral_cv, E) {
 48 |         shape = theta; scale = exp_neutral_cv/theta
 49 |         tml = (n+shape-1)/(1+E+(1/scale))
 50 |         if (shape<=1) { # i.e. when theta<=1
 51 |             tml = max(shape*scale,tml) # i.e. tml is bounded to the mean of the gamma (i.e. y[9]) when theta<=1
 52 |         }
 53 |         return(pmax(tml,1e-6))
 54 |     }
 55 |     
 56 |     ## Subfunction: Log-Likelihood of the model given fixed w values (requires finding MLEs for t and the free w values given the fixed w values)
 57 |     loglik_givenw = function(w,x,y,mutrates,theta,wtype,wnoneqspl) {
 58 |     
 59 |         # 1. tML given w
 60 |         exp_neutral_cv = y[9]
 61 |         exp_rel = y[6:8]/y[5]
 62 |         n = y[1] + sum(y[wtype+1])
 63 |         E = sum(exp_rel[wtype])*w
 64 |         tML = mle_tcvgivenw(n, theta, exp_neutral_cv, E)
 65 |         mrfold = max(1e-10, tML/y[5]) # Correction factor of "t" under the model
 66 |     
 67 |         # 2. Calculating the MLEs of the unconstrained w values
 68 |         if (!wnoneqspl) {
 69 |           wfree = y[2:4]/y[6:8]/mrfold; wfree[y[2:4]==0] = 0 # MLEs for w given tval
 70 |         } else {
 71 |           wmisfree = y[2]/y[6]/mrfold; wmisfree[y[2]==0] = 0
 72 |           wtruncfree = sum(y[3:4])/sum(y[7:8])/mrfold; wtruncfree[sum(y[3:4])==0] = 0
 73 |           wfree = c(wmisfree,wtruncfree,wtruncfree) # MLEs for w given tval
 74 |         }
 75 |         wfree[wtype] = w # Replacing free w values by fixed input values
 76 |     
 77 |         # 2. loglik of the model under tML and w
 78 |         llpois = sum(dpois(x=x$n, lambda=x$l*mutrates*mrfold*t(array(c(1,wfree),dim=c(4,length(mutrates)))), log=T))
 79 |         llgamm = dgamma(x=tML, shape=theta, scale=exp_neutral_cv/theta, log=T)
 80 |         return(-(llpois+llgamm))
 81 |     }
 82 |     
 83 |     ## Subfunction: Working with vector inputs
 84 |     loglik_vec = function(wfixed,x,y,mutrates,theta,wtype,wnoneqspl) {
 85 |         sapply(wfixed, function(w) loglik_givenw(w,x,y,mutrates,theta,wtype,wnoneqspl))
 86 |     }
 87 |     
 88 |     
 89 |     ## Subfunction: iterative search for the CI95% boundaries for wvec
 90 |     iterative_search_ci95 = function(wtype,x,y,mutrates,theta,wmle,ml,grid_size=10,iter=10,wnoneqspl=T,wmax = 10000) {
 91 |     
 92 |       if (wmle[wtype][1]<wmax) {
 93 |     
 94 |         # Iteratively searching for the lower bound of the CI95% for "t"
 95 |         if (wmle[wtype][1]>0) {
 96 |           search_range = c(1e-9, wmle[wtype][1])
 97 |           for (it in 1:iter) {
 98 |             wvec = seq(search_range[1], search_range[2],length.out=grid_size)
 99 |             ll = -loglik_vec(wvec,x,y,mutrates,theta,wtype,wnoneqspl)
100 |             lr = 2*(ml-ll) > qchisq(p=level,df=1)
101 |             ind = max(which(wvec<=wmle[wtype][1] & lr))
102 |             search_range = c(wvec[ind], wvec[ind+1])
103 |           }
104 |           w_low = wvec[ind]
105 |         } else {
106 |           w_low = 0
107 |         }
108 |         
109 |         # Iteratively searching for the higher bound of the CI95% for "t"
110 |         search_range = c(wmle[wtype][1], wmax)
111 |         llhighbound = -loglik_vec(wmax,x,y,mutrates,theta,wtype,wnoneqspl)
112 |         outofboundaries = !(2*(ml-llhighbound) > qchisq(p=level,df=1))
113 |         if (!outofboundaries) {
114 |           for (it in 1:iter) {
115 |             wvec = seq(search_range[1], search_range[2],length.out=grid_size)
116 |             ll = -loglik_vec(wvec,x,y,mutrates,theta,wtype,wnoneqspl)
117 |             lr = 2*(ml-ll) > qchisq(p=level,df=1)
118 |             ind = min(which(wvec>=wmle[wtype][1] & lr))
119 |             search_range = c(wvec[ind-1], wvec[ind])
120 |           }
121 |           w_high = wvec[ind]
122 |         } else {
123 |           w_high = wmax
124 |         }
125 |     
126 |       } else {
127 |         wmle[wtype] = w_low = w_high = wmax
128 |       }
129 |     
130 |       return(c(wmle[wtype][1],w_low,w_high))
131 |     }
132 |     
133 |     
134 |     ## Subfunction: calculate the MLEs and CI95% of each independent w value (unconstraining the other values)
135 |     ci95cv_intt = function(x,y,mutrates,theta,grid_size=10,iter=10,wnoneqspl=T) {
136 |     
137 |       # MLE
138 |       exp_neutral_cv = y[9]
139 |       n = y[1]; E = 0 # Only synonymous mutations are considered
140 |       tML = mle_tcvgivenw(n, theta, exp_neutral_cv, E)
141 |       mrfold = max(1e-10, tML/y[5])
142 |       if (!wnoneqspl) {
143 |         wmle = y[2:4]/y[6:8]/mrfold; wmle[y[2:4]==0] = 0 # MLEs for w given tval
144 |       } else {
145 |         wmisfree = y[2]/y[6]/mrfold; wmisfree[y[2]==0] = 0
146 |         wtruncfree = sum(y[3:4])/sum(y[7:8])/mrfold; wtruncfree[sum(y[3:4])==0] = 0
147 |         wmle = c(wmisfree,wtruncfree,wtruncfree) # MLEs for w given tval
148 |       }
149 |       llpois = sum(dpois(x=x$n, lambda=x$l*mutrates*mrfold*t(array(c(1,wmle),dim=c(4,length(mutrates)))), log=T))
150 |       llgamm = dgamma(x=tML, shape=theta, scale=y[9]/theta, log=T)
151 |       ml = llpois+llgamm
152 |     
153 |       # Iteratively searching for the lower bound of the CI95% for "t"
154 |       w_ci95 = array(NA,c(3,3))
155 |       colnames(w_ci95) = c("mle","low","high")
156 |       rownames(w_ci95) = c("mis","non","spl")
157 |       if (!wnoneqspl) {
158 |         for (h in 1:3) {
159 |           w_ci95[h,] = iterative_search_ci95(wtype=h,x,y,mutrates,theta,wmle,ml,grid_size,iter,wnoneqspl)
160 |         }
161 |       } else {
162 |         w_ci95[1,] = iterative_search_ci95(wtype=1,x,y,mutrates,theta,wmle,ml,grid_size,iter,wnoneqspl)
163 |         w_ci95[2,] = iterative_search_ci95(wtype=c(2,3),x,y,mutrates,theta,wmle,ml,grid_size,iter,wnoneqspl)
164 |         w_ci95[3,] = w_ci95[2,]
165 |       }
166 |       return(w_ci95)
167 |     }
168 |     
169 |     
170 |     ## Calculating CI95% across all genes
171 |     
172 |     message("Calculating CI95 across all genes...")
173 |     
174 |     ci95 = array(NA, dim=c(length(gene_list),9))
175 |     colnames(ci95) = c("mis_mle","non_mle","spl_mle","mis_low","non_low","spl_low","mis_high","non_high","spl_high")
176 |     theta = dndsout$nbreg$theta
177 |     
178 |     data("submod_192r_3w", package="dndscv")
179 |     parmle =  setNames(dndsout$mle_submodel[,2], dndsout$mle_submodel[,1])
180 |     mutrates = sapply(substmodel[,1], function(x) prod(parmle[base::strsplit(x,split="\\*")[[1]]])) # Expected rate per available site
181 |     
182 |     for (j in 1:length(gene_list)) {
183 |         geneind = which(dndsout$genemuts$gene_name==gene_list[j])
184 |         y = as.numeric(dndsout$genemuts[geneind,-1])
185 |         if (length(gene_list)==1) {
186 |             x = list(n=N, l=L)
187 |         } else {
188 |             x = list(n=N[,,geneind], l=L[,,geneind])
189 |         }
190 |         ci95[j,] = c(ci95cv_intt(x,y,mutrates,theta,grid_size=10,iter=10,wnoneqspl=wnoneqspl))
191 |         if (round(j/1000)==(j/1000)) { print(j/length(gene_list), digits=2) } # Progress
192 |     }
193 |     
194 |     ci95df = cbind(gene=gene_list, as.data.frame(ci95))
195 |     
196 |     # Restricting columns if we forced wnon==wspl
197 |     if (wnoneqspl==T) {
198 |         ci95df = ci95df[,-c(4,7,10)]
199 |         colnames(ci95df) = c("gene","mis_mle","tru_mle","mis_low","tru_low","mis_high","tru_high")
200 |     }
201 | 
202 |     return(ci95df)
203 |     
204 | } # EOF
205 | 


--------------------------------------------------------------------------------
/R/genesetdnds.R:
--------------------------------------------------------------------------------
  1 | #' genesetdnds
  2 | #'
  3 | #' Function to estimate global dN/dS values for a gene set when using whole-exome
  4 | #' data. Global dN/dS values for a set of genes can also be obtained using dndscv
  5 | #' (gene_list argument), but that option estimates the trinucleotide mutation rates
  6 | #' exclusively from the list of genes of interest. This may be prefereable in large
  7 | #' datasets, but in small datasets, the genesetdnds option estimates global dN/dS
  8 | #' values for a set of genes while using all genes in the exome to fit the
  9 | #' substitution model. Usage: genesetdnds(dndsout, gene_list).
 10 | #'
 11 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
 12 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
 13 | #'
 14 | #' @param dndsout Output object from dndscv. To generate a valid input object for this function, use outmats=T when running dndscv.
 15 | #' @param gene_list List of genes to restrict the analysis (gene set).
 16 | #' @param sm Substitution model (precomputed models are available in the data directory)
 17 | #'
 18 | #' @return 'genesetdnds' returns a list of objects:
 19 | #' @return globaldnds_geneset: Global dN/dS estimates in the gene set, including Wald CI95%, Wald p-values and LRT (recommended) p-values.
 20 | #' @return globaldnds_rest: Global dN/dS estimates in all other genes, including Wald CI95%, Wald p-values and LRT (recommended) p-values.
 21 | #'
 22 | #' @export
 23 | 
 24 | genesetdnds = function(dndsout, gene_list, sm = "192r_3w") {
 25 | 
 26 |     ## 1. Input
 27 | 
 28 |     if (is.null(dndsout$N)) { stop(sprintf("Invalid input: dndsout must be generated using outmats=T in dndscv.")) }
 29 | 
 30 |     allg = as.vector(dndsout$genemuts$gene) # All genes in the dndsout object
 31 |     nonex = gene_list[!(gene_list %in% allg)]
 32 |     if (length(nonex)>0) {
 33 |         stop(sprintf("The following input gene names are not in the dndsout object: %s. To see the list of genes available use as.vector(dndsout$genemuts$gene).", paste(nonex,collapse=", ")))
 34 |     }
 35 |     if (length(gene_list)<2) {
 36 |         stop("The gene_list argument needs to contain at least two genes")
 37 |     }
 38 | 
 39 |     # Substitution model (The user can also input a custom substitution model as a matrix)
 40 |     if (length(sm)==1) {
 41 |         data(list=sprintf("submod_%s",sm), package="dndscv")
 42 |     } else {
 43 |         substmodel = sm
 44 |     }
 45 | 
 46 |     ## 2. Estimation of the global rate and selection parameters
 47 | 
 48 |     Lall = dndsout$L
 49 |     Nall = dndsout$N
 50 |     geneind = which(allg %in% gene_list) # Genes in the gene set
 51 |     L = rbind(apply(Lall[,,geneind], c(1,2), sum), apply(Lall[,,-geneind], c(1,2), sum))
 52 |     N = rbind(apply(Nall[,,geneind], c(1,2), sum), apply(Nall[,,-geneind], c(1,2), sum))
 53 | 
 54 |     # Subfunction: fitting substitution model
 55 | 
 56 |     fit_substmodel = function(N, L, substmodel, testpar) {
 57 | 
 58 |         l = c(L); n = c(N); r = c(substmodel)
 59 |         n = n[l!=0]; r = r[l!=0]; l = l[l!=0]
 60 | 
 61 |         params = unique(base::strsplit(x=paste(r,collapse="*"), split="\\*")[[1]])
 62 |         indmat = as.data.frame(array(0, dim=c(length(r),length(params))))
 63 |         colnames(indmat) = params
 64 |         for (j in 1:length(r)) {
 65 |             indmat[j, base::strsplit(r[j], split="\\*")[[1]]] = 1
 66 |         }
 67 | 
 68 |         model = glm(formula = n ~ offset(log(l)) + . -1, data=indmat, family=poisson(link=log))
 69 |         mle = exp(coefficients(model)) # Maximum-likelihood estimates for the rate params
 70 |         ci = exp(confint.default(model)) # Wald confidence intervals
 71 | 
 72 |         pvals.wald = coef(summary(model))[,4]
 73 |         model.lrt = drop1(model, test="LRT", scope=testpar)
 74 |         pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt))
 75 | 
 76 |         par = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals.wald[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters
 77 | 
 78 |         return(list(par=par, model=model))
 79 |     }
 80 | 
 81 |     syneqs = substmodel[,1] # Rate model for synonymous sites
 82 | 
 83 |     # Model 1: Fitting all mutation rates and the 3 global selection parameters
 84 | 
 85 |     rmatrix = array("",dim=dim(L))
 86 |     rmatrix[,1] = c(paste(syneqs,"*r_rel",sep=""), syneqs) # This adds an extra parameter (r_rel) to account for a different mutation rate (synonymous density) in the gene set
 87 |     rmatrix[,2] = c(paste(syneqs,"*r_rel*wmis_geneset",sep=""), paste(syneqs,"*wmis_rest",sep=""))
 88 |     rmatrix[,3] = c(paste(syneqs,"*r_rel*wnon_geneset",sep=""), paste(syneqs,"*wnon_rest",sep=""))
 89 |     rmatrix[,4] = c(paste(syneqs,"*r_rel*wspl_geneset",sep=""), paste(syneqs,"*wspl_rest",sep=""))
 90 |     poissout = fit_substmodel(N, L, rmatrix, testpar = c("wmis_geneset","wnon_geneset","wspl_geneset","wmis_rest","wnon_rest","wspl_rest")) # Original substitution model
 91 |     par1 = poissout$par
 92 | 
 93 |     # Model 2: Fitting all mutation rates and the 2 global selection parameters
 94 | 
 95 |     rmatrix = array("",dim=dim(L))
 96 |     rmatrix[,1] = c(paste(syneqs,"*r_rel",sep=""), syneqs) # This adds an extra parameter (r_rel) to account for a different mutation rate (synonymous density) in the gene set
 97 |     rmatrix[,2] = c(paste(syneqs,"*r_rel*wmis_geneset",sep=""), paste(syneqs,"*wmis_rest",sep=""))
 98 |     rmatrix[,3] = c(paste(syneqs,"*r_rel*wtru_geneset",sep=""), paste(syneqs,"*wtru_rest",sep=""))
 99 |     rmatrix[,4] = c(paste(syneqs,"*r_rel*wtru_geneset",sep=""), paste(syneqs,"*wtru_rest",sep=""))
100 |     poissout = fit_substmodel(N, L, rmatrix, testpar = c("wmis_geneset","wtru_geneset","wmis_rest","wtru_rest")) # Original substitution model
101 |     par2 = poissout$par
102 | 
103 |     # Model 2: Fitting all mutation rates and the 2 global selection parameters
104 | 
105 |     rmatrix = array("",dim=dim(L))
106 |     rmatrix[,1] = c(paste(syneqs,"*r_rel",sep=""), syneqs) # This adds an extra parameter (r_rel) to account for a different mutation rate (synonymous density) in the gene set
107 |     rmatrix[,2] = c(paste(syneqs,"*r_rel*wall_geneset",sep=""), paste(syneqs,"*wall_rest",sep=""))
108 |     rmatrix[,3] = c(paste(syneqs,"*r_rel*wall_geneset",sep=""), paste(syneqs,"*wall_rest",sep=""))
109 |     rmatrix[,4] = c(paste(syneqs,"*r_rel*wall_geneset",sep=""), paste(syneqs,"*wall_rest",sep=""))
110 |     poissout = fit_substmodel(N, L, rmatrix, testpar = c("wall_geneset","wall_rest")) # Original substitution model
111 |     par3 = poissout$par
112 | 
113 |     globaldnds_geneset = rbind(par1, par2, par3)[c("wmis_geneset","wnon_geneset","wspl_geneset","wtru_geneset","wall_geneset"),-1]
114 |     globaldnds_rest = rbind(par1, par2, par3)[c("wmis_rest","wnon_rest","wspl_rest","wtru_rest","wall_rest"),-1]
115 |     return(list(globaldnds_geneset=globaldnds_geneset, globaldnds_rest=globaldnds_rest))
116 | }
117 | 


--------------------------------------------------------------------------------
/R/sitednds.R:
--------------------------------------------------------------------------------
  1 | #' sitednds
  2 | #'
  3 | #' Function to estimate site-wise dN/dS values and p-values against neutrality. To generate a valid input object for this function, use outmats=T when running dndscv. This function is in testing, please interpret the results with caution. Also note that recurrent artefacts or SNP contamination can violate the null model and dominate the list of sites under apparent selection. A considerable number of significant synonymous sites may reflect a problem with the data. Be very critical of the results and if suspicious sites appear recurrently mutated consider refining the variant calling (e.g. using a better unmatched normal panel). In the future, this function may be extended to perform inferences at a codon level instead of at a single-base level.
  4 | #'
  5 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
  6 | #'
  7 | #' @param dndsout Output object from dndscv. To generate a valid input object for this function, use outmats=T when running dndscv.
  8 | #' @param min_recurr Minimum number of mutations per site to estimate site-wise dN/dS ratios. [default=2]
  9 | #' @param gene_list List of genes to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of genes is decided a priori. [default=NULL, sitednds will be run on all genes in dndsout]
 10 | #' @param site_list List of hotspot sites to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of sites is decided a priori. [default=NULL, sitednds will be run on all genes in dndsout]
 11 | #' @param trinuc_list List of trinucleotide substitution to restrict the analysis of sitednds. This is used to estimate separate overdispersion parameters for different substitution contexts [default=NULL, sitednds will be run on all substitution contexts]
 12 | #' @param theta_option 2 options: "mle" (uses the MLE of the overdispersion parameter) or "conservative" (uses the conservative bound of the CI95). Values other than "mle" will lead to the conservative option [default="conservative"]
 13 | #' @param syn_drivers Vector with a list of known synonymous driver mutations to exclude from the background model [default="TP53:T125T"]. See Martincorena et al., Cell, 2017 (PMID:29056346).
 14 | #' @param method Overdispersion model: NB = Negative Binomial (Gamma-Poisson), LNP = Poisson-Lognormal (see Hess et al., BiorXiv, 2019). [default="NB"]
 15 | #' @param numbins Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size.
 16 | #' @param kc List of a-priori known cancer genes (to be excluded when fitting the background model)
 17 | #'
 18 | #' @return 'sitednds' returns a table of recurrently mutated sites and the estimates of the size parameter:
 19 | #' @return - recursites: Table of recurrently mutated sites with site-wise dN/dS values and p-values
 20 | #' @return - overdisp: Maximum likelihood estimate and CI95% for the overdispersion parameter (the size parameter of the negative binomial distribution or the sigma parameter of the lognormal distribution). The lower the size value or the higher the sigma value the higher the variation of the mutation rate across sites not captured by the trinucleotide change or by variation across genes.
 21 | #' @return - fpr_nonsyn_q05: Fraction of the significant non-synonymous sites (qval<0.05) that are estimated to be false positives. This assumes that all synonymous mutations (except those in TP53 and CDKN2A) are false positives, thus offering a conservative estimate of the false positive rate.
 22 | #' @return - LL: Log-likelihood of the fit of the overdispersed model (see "method" argument) to all synonymous sites.
 23 | #'
 24 | #' @export
 25 | 
 26 | sitednds = function(dndsout, min_recurr = 2, gene_list = NULL, site_list = NULL, trinuc_list = NULL, theta_option = "conservative", syn_drivers = "TP53:T125T", method = "NB", numbins = 1e4, kc = "cgc81") {
 27 | 
 28 |     ## 1. Fitting a negative binomial distribution at the site level considering the background mutation rate of the gene and of each trinucleotide
 29 |     message("[1] Site-wise overdispersed model accounting for trinucleotides and relative gene mutability...")
 30 | 
 31 |     # N and L matrices for synonymous mutations
 32 |     if (length(dndsout$N)==0) { stop(sprintf("Invalid input: dndsout must be generated using outmats=T in dndscv.")) }
 33 |     if (nrow(dndsout$mle_submodel)!=195) { stop("Invalid input: dndsout must be generated using the default trinucleotide substitution model in dndscv.") }
 34 | 
 35 |     # Restricting the analysis to an input list of genes
 36 |     if (!is.null(gene_list)) {
 37 |         g = as.vector(dndsout$genemuts$gene_name)
 38 |         # Correcting CDKN2A if required (hg19)
 39 |         if (any(g %in% c("CDKN2A.p14arf","CDKN2A.p16INK4a")) & any(gene_list=="CDKN2A")) {
 40 |             gene_list = unique(c(setdiff(gene_list,"CDKN2A"),"CDKN2A.p14arf","CDKN2A.p16INK4a"))
 41 |         }
 42 |         nonex = gene_list[!(gene_list %in% g)]
 43 |         if (length(nonex)>0) {
 44 |             warning(sprintf("The following input gene names are not in dndsout input object and will not be analysed: %s.", paste(nonex,collapse=", ")))
 45 |         }
 46 |         numtests = sum(dndsout$L[,,which(g %in% gene_list)])
 47 |     } else {
 48 |         numtests = sum(dndsout$L)
 49 |     }
 50 | 
 51 |     # Input: known cancer genes to exclude from the background model fitting (the user can input a gene list as a character vector)
 52 |     if (is.null(kc)) {
 53 |         known_cancergenes = ""
 54 |     } else if (kc[1] %in% c("cgc81")) {
 55 |         data(list=sprintf("cancergenes_%s",kc), package="dndscv")
 56 |     } else {
 57 |         known_cancergenes = kc
 58 |     }
 59 | 
 60 |     # L matrix
 61 |     L = dndsout$L
 62 |     L[,2:4,which(as.vector(dndsout$genemuts$gene_name) %in% known_cancergenes)] = 0 # Removing non-synonymous sites in known_cancergenes
 63 |     L = apply(L, c(1,3), sum) # Total number of sites to be considered in the background model
 64 | 
 65 | 
 66 |     # Counts of observed mutations
 67 | 
 68 |     annotsubs = dndsout$annotmuts[which(dndsout$annotmuts$impact %in% c("Synonymous","Missense","Nonsense","Essential_Splice")),]
 69 |     num_syn_drivers_masked = sum(paste(annotsubs$gene,annotsubs$aachange,sep=":") %in% syn_drivers)
 70 | 
 71 |     if (!is.null(trinuc_list)) { # Restricting sitednds to certain trinucleotide changes
 72 |         annotsubs = annotsubs[which(paste(annotsubs$ref3_cod,annotsubs$mut3_cod,sep=">") %in% trinuc_list), ]
 73 |         if (nrow(annotsubs)==0) { stop("No mutations left after restricting by trinuc_list. Please review your input arguments and your mutation table (dndsout$annotmuts).") }
 74 |     }
 75 | 
 76 |     annotsubs$trisub = paste(annotsubs$chr,annotsubs$pos,annotsubs$ref,annotsubs$mut,annotsubs$gene,annotsubs$aachange,annotsubs$impact,annotsubs$ref3_cod,annotsubs$mut3_cod,sep=":")
 77 |     annotsubs = annotsubs[which(annotsubs$ref!=annotsubs$mut),]
 78 |     freqs = sort(table(annotsubs$trisub), decreasing=T)
 79 | 
 80 |     # Relative mutation rate per gene
 81 |     # Note that this assumes that the gene order in genemuts has not been altered with respect to the N and L matrices, as it is currently the case in dndscv
 82 |     relmr = dndsout$genemuts$exp_syn_cv/dndsout$genemuts$exp_syn
 83 |     names(relmr) = dndsout$genemuts$gene_name
 84 | 
 85 |     # Substitution rates (192 trinucleotide rates, strand-specific)
 86 |     sm = setNames(dndsout$mle_submodel$mle, dndsout$mle_submodel$name)
 87 |     sm["TTT>TGT"] = 1 # Adding the TTT>TGT rate (which is arbitrarily set to 1 relative to t)
 88 |     sm = sm*sm["t"] # Absolute rates
 89 |     sm = sm[setdiff(names(sm),c("wmis","wnon","wspl","t"))] # Removing selection parameters
 90 |     sm = sm[order(names(sm))] # Sorting
 91 | 
 92 |     if (!is.null(trinuc_list)) { # Restricting sitednds to certain trinucleotide changes
 93 |         sm[!(names(sm) %in% trinuc_list)] = 0 # Setting all other rates to zero
 94 |     }
 95 | 
 96 |     mat_trisub = array(sm, dim=c(192,nrow(dndsout$genemuts))) # Relative mutation rates by trinucleotide
 97 |     mat_relmr = t(array(relmr, dim=c(nrow(dndsout$genemuts),192))) # Relative mutation rates by gene
 98 |     R = mat_trisub * mat_relmr # Expected rate for each mutation type in each gene
 99 | 
100 |     # Expanded vectors: full vectors of observed and expected mutations per site across all sites considered in dndsout
101 |     rvec = rep(R, times=L) # Expanded vector of expected mutation counts per site
102 |     nvec = array(0, length(rvec)) # Initialising the vector with observed mutation counts per site
103 | 
104 |     mutsites = read.table(text=names(freqs), header=0, sep=":", stringsAsFactors=F) # Frequency table of mutations
105 |     colnames(mutsites) = c("chr","pos","ref","mut","gene","aachange","impact","ref3_cod","mut3_cod")
106 |     mutsites$freq = freqs
107 |     trindex = setNames(1:192, names(sm))
108 |     geneindex = setNames(1:length(names(relmr)), names(relmr))
109 |     mutsites$trindex = trindex[paste(mutsites$ref3_cod, mutsites$mut3_cod, sep=">")]
110 |     mutsites$geneindex = geneindex[mutsites$gene]
111 | 
112 |     # Mutations for the background model (excluding non-synonymous mutations in known_cancergenes)
113 |     synsites = mutsites[!(mutsites$impact!="Synonymous" & mutsites$gene %in% known_cancergenes),]
114 |     synsites = synsites[!(paste(synsites$gene,synsites$aachange,sep=":") %in% syn_drivers),]
115 |     Lcum = array(cumsum(L), dim=dim(L)) # Cumulative L indicating the position to place a given mutation in the nvec following rvec
116 |     synsites$vecindex = apply(as.matrix(synsites[,c("trindex","geneindex")]), 1, function(x) Lcum[x[1], x[2]]) # Index for the mutation
117 |     synsites = synsites[order(synsites$vecindex), ] # Sorting by index in the nvec
118 | 
119 |     # Stop execution with an error if there are no synonymous mutations
120 |     if (nrow(synsites)<2) {
121 |         stop("Too few synonymous mutations found in the input. sitednds cannot run without synonymous mutations.")
122 |     }
123 | 
124 |     # Correcting the index when there are multiple synonymous mutations in the same gene and trinucleotide class
125 |     s = snew = synsites$vecindex
126 |     sameind = 0
127 |     for (j in 2:nrow(synsites)) {
128 |         if (s[j]<=s[j-1]) {
129 |             sameind = sameind + 1 # Annotating a run of elements
130 |             snew[j] = s[j-1] - sameind # We assign it an earlier position in the vector
131 |         } else {
132 |             sameind = 0
133 |         }
134 |     }
135 |     synsites$vecindex2 = snew
136 | 
137 |     nvec[synsites$vecindex2] = synsites$freq # Expanded nvec for the negative binomial regression
138 |     rvec = rvec * (sum(dndsout$genemuts$n_syn)-num_syn_drivers_masked) / sum(dndsout$genemuts$exp_syn_cv) # Minor correction ensuring that global observed and expected rates are identical (this works after subsetting substitutions)
139 | 
140 |     # Estimation of overdispersion: Using optimize appears to yield reliable results. Problems experienced with fitdistr, glm.nb and theta.ml. Consider using grid search if problems appear with optimize.
141 |     if (method=="LNP") { # Modelling rates per site with a Poisson-Lognormal mixture
142 | 
143 |         lnp_est = fitlnpbin(nvec, rvec, theta_option = theta_option, numbins = numbins)
144 |         theta_ml = lnp_est$ml$minimum
145 |         theta_ci95 = lnp_est$sig_ci95
146 |         LL = -lnp_est$ml$objective # LogLik
147 |         thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95_high"))
148 | 
149 |     } else { # Modelling rates per site as negative binomially distributed (i.e. quantifying uncertainty above Poisson using a Gamma)
150 | 
151 |         nbin = function(theta, n=nvec, r=rvec) { -sum(dnbinom(x=n, mu=r, log=T, size=theta)) } # nbin loglik function for optimisation
152 |         ml = optimize(nbin, interval=c(0,1000))
153 |         theta_ml = ml$minimum
154 |         LL = -ml$objective # LogLik
155 | 
156 |         # CI95% for theta using profile likelihood and iterative grid search (this yields slightly conservative CI95)
157 |         grid_proflik = function(bins=5, iter=5) {
158 |             for (j in 1:iter) {
159 |                 if (j==1) {
160 |                     thetavec = sort(c(0, 10^seq(-3,3,length.out=bins), theta_ml, theta_ml*10, 1e4)) # Initial vals
161 |                 } else {
162 |                     thetavec = sort(c(seq(thetavec[ind[1]], thetavec[ind[1]+1], length.out=bins), seq(thetavec[ind[2]-1], thetavec[ind[2]], length.out=bins))) # Refining previous iteration
163 |                 }
164 | 
165 |                 proflik = sapply(thetavec, function(theta) -sum(dnbinom(x=nvec, mu=rvec, size=theta, log=T))-ml$objective) < qchisq(.95,1)/2 # Values of theta within CI95%
166 |                 ind = c(which(proflik[1:(length(proflik)-1)]==F & proflik[2:length(proflik)]==T)[1],
167 |                         which(proflik[1:(length(proflik)-1)]==T & proflik[2:length(proflik)]==F)[1]+1)
168 |                 if (is.na(ind[1])) { ind[1] = 1 }
169 |                 if (is.na(ind[2])) { ind[2] = length(thetavec) }
170 |             }
171 |             return(thetavec[ind])
172 |         }
173 |         theta_ci95 = grid_proflik(bins=5, iter=5)
174 |         thetaout = setNames(c(theta_ml, theta_ci95), c("MLE","CI95low","CI95_high"))
175 |     }
176 | 
177 | 
178 |     ## 2. Calculating site-wise dN/dS ratios and P-values for recurrently mutated sites
179 |     message("[2] Calculating site-wise dN/dS ratios and p-values...")
180 | 
181 |     # Theta option
182 |     if (theta_option=="mle" | theta_option=="MLE") {
183 |         theta = theta_ml
184 |     } else { # Conservative
185 |         message("    Using the conservative bound of the confidence interval of the overdispersion parameter.")
186 |         theta = theta_ci95[1]
187 |     }
188 | 
189 |     # Creating the recursites object
190 |     recursites = mutsites[, c("chr","pos","ref","mut","gene","aachange","impact","ref3_cod","mut3_cod","freq")]
191 |     recursites$mu = relmr[recursites$gene] * sm[paste(recursites$ref3_cod,recursites$mut3_cod,sep=">")]
192 | 
193 |     # Gene RHT
194 |     if (!is.null(gene_list)) {
195 |         message("    Peforming Restricted Hypothesis Testing on the input list of a-priori genes")
196 |         recursites = recursites[which(recursites$gene %in% gene_list), ] # Restricting the p-value and q-value calculations to gene_list
197 |     }
198 | 
199 |     # Site RHT
200 |     if (!is.null(site_list)) {
201 |         message("    Peforming Restricted Hypothesis Testing on the input list of a-priori sites (numtests = length(site_list))")
202 |         mutstr = paste(recursites$chr,recursites$pos,recursites$ref,recursites$mut,recursites$gene,recursites$aachange,recursites$ref3_cod,recursites$mut3_cod,sep=":")
203 |         if (!any(mutstr %in% site_list)) {
204 |             stop("No mutation was observed in the restricted list of known hotspots. Site-RHT cannot be run.")
205 |         }
206 |         recursites = recursites[which(mutstr %in% site_list), ] # Restricting the p-value and q-value calculations to site_list
207 |         numtests = length(site_list)
208 | 
209 |         # Calculating global dN/dS ratios at known hotspots
210 |         auxsites = as.data.frame(do.call("rbind",strsplit(site_list,split=":")), stringsAsFactors=F)
211 |         auxsites = auxsites[auxsites$V5 %in% names(relmr), ]
212 |         neutralexp = sum(relmr[auxsites$V5]*sm[paste(auxsites$V7,auxsites$V8,sep=">")]) # Number of mutations expected at known hotspots under neutrality
213 |         numobs = sum(recursites$freq) # Number observed
214 |         poistest = poisson.test(numobs, T=neutralexp)
215 |         globaldnds_knownsites = setNames(c(numobs, neutralexp, poistest$estimate, poistest$conf.int), c("obs","exp","dnds","cilow","cihigh"))
216 |         message(sprintf("    Mutations at known hotspots: %0.0f observed, %0.3g expected, obs/exp~%0.3g (CI95:%0.3g,%0.3g).", globaldnds_knownsites[1], globaldnds_knownsites[2], globaldnds_knownsites[3], globaldnds_knownsites[4], globaldnds_knownsites[5]))
217 |     }
218 | 
219 |     # Restricting the recursites output by min_recurr
220 |     recursites = recursites[recursites$freq>=min_recurr, ] # Restricting the output to sites with min_recurr
221 | 
222 |     if (nrow(recursites)>0) {
223 | 
224 |         recursites$dnds = recursites$freq / recursites$mu # Site-wise dN/dS (point estimate)
225 | 
226 |         if (method=="LNP") { # Modelling rates per site with a Poisson-Lognormal mixture
227 | 
228 |             # Cumulative Lognormal-Poisson using poilog::dpoilog
229 |             dpoilog = poilog::dpoilog
230 |             ppoilog = function(n, mu, sig) {
231 |                 p = sum(dpoilog(n=floor(n+1):floor(n*10+1000), mu=log(mu)-sig^2/2, sig=sig))
232 |                 return(p)
233 |             }
234 | 
235 |             message(sprintf("    Modelling substitution rates using a Lognormal-Poisson: sig = %0.3g (upperbound = %0.3g)", theta_ml, theta_ci95))
236 |             recursites$pval = apply(recursites, 1, function(x) ppoilog(n=as.numeric(x["freq"])-0.5, mu=as.numeric(x["mu"]), sig=theta))
237 | 
238 |         } else { # Negative binomial model
239 | 
240 |             message(sprintf("    Modelling substitution rates using a Negative Binomial: theta = %0.3g (CI95:%0.3g,%0.3g)", theta_ml, theta_ci95[1], theta_ci95[2]))
241 |             recursites$pval = pnbinom(q=recursites$freq-0.5, mu=recursites$mu, size=theta, lower.tail=F)
242 |         }
243 | 
244 |         recursites = recursites[order(recursites$pval, -recursites$freq), ] # Sorting by p-val and frequency
245 |         recursites$qval = p.adjust(recursites$pval, method="BH", n=numtests) # P-value adjustment for all possible changes
246 |         rownames(recursites) = NULL
247 | 
248 |         # Estimating False Positive Rates based on the observed number of significant synonymous hits
249 | 
250 |         qcutoff = 0.05 # q-value cutoff to estimate false positive rates
251 |         if (any(recursites$qval<qcutoff)) {
252 | 
253 |             signifsites = recursites[recursites$qval<qcutoff, ]
254 |             obs_hits = c(sum(signifsites$impact=="Synonymous" & !(signifsites$gene %in% c("TP53","CDKN2A","CDKN2A.p14arf"))), sum(signifsites$impact!="Synonymous"))
255 |             exp_frac = c(sum(dndsout$genemuts$exp_syn), sum(dndsout$genemuts$exp_mis+dndsout$genemuts$exp_non+dndsout$genemuts$exp_spl))
256 |             obsexp = (obs_hits[2]/obs_hits[1])/(exp_frac[2]/exp_frac[1])
257 |             fpr_nonsyn = list()
258 |             fpr_nonsyn$estimate = 1 / pmax(1,obsexp) # i.e. 1 - driver_fraction
259 |             fpr_nonsyn$conf.int = rev(1 / pmax(1,as.vector(poisson.test(x=obs_hits[2:1], T=exp_frac[2:1])$conf.int)))
260 | 
261 |             if (fpr_nonsyn$estimate>qcutoff) {
262 |                 warning(sprintf("The estimated false positive rate for nonsynonymous hits (qval<0.05) is %0.3f (CI95:%0.3f,%0.3f). High false positive rates (>>0.05) evidence problems with the data or the model and mean that the results are not reliable.", fpr_nonsyn$estimate, fpr_nonsyn$conf.int[1], fpr_nonsyn$conf.int[2]))
263 |             }
264 | 
265 |             if (!is.null(site_list)) { # We do not compute FPRs when restricting the analysis to a list of a priori hotspots
266 |                 fpr_nonsyn = NULL
267 |             }
268 | 
269 |         } else {
270 |             fpr_nonsyn = NULL
271 |         }
272 | 
273 |     } else {
274 |         recursites = fpr_nonsyn = lnp_est = NULL
275 |         warning("No site was found with the minimum recurrence requested [default min_recurr=2]")
276 |     }
277 | 
278 |     if (is.null(site_list)) {
279 |         return(list(recursites=recursites, overdisp=thetaout, fpr_nonsyn_q05=fpr_nonsyn, LL=LL))
280 |     } else {
281 |         return(list(recursites=recursites, overdisp=thetaout, fpr_nonsyn_q05=fpr_nonsyn, LL=LL, globaldnds_knownsites=globaldnds_knownsites))
282 |     }
283 | }
284 | 


--------------------------------------------------------------------------------
/R/withingenednds.R:
--------------------------------------------------------------------------------
  1 | #' withingenednds
  2 | #' 
  3 | #' This function uses Poisson and Negative Binomial regression models at single-site level to study selection across different regions (coding and non-coding) within a gene.
  4 | #' 
  5 | #' @author Inigo Martincorena (Wellcome Sanger Institute)
  6 | #' @details Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
  7 | #' 
  8 | #' @param mutations Data frame with all the mutations detected in the study (5-column input table as for dndscv: sampleID, chr, pos, ref, mut).
  9 | #' @param gene Name of the gene of interest. This function is currently designed to work on a single gene, but combined analyses of multiple genes could be done using the sites output table generated by this function.
 10 | #' @param covtable Table with all sites of interest in the gene. This should be a data frame with one row per site and the following columns: chr, pos, dc (duplex depth). Additional columns will not be used.
 11 | #' @param dndsout dndscv output object for the dataset. This is mainly used for the MLEs of the substitution model. Running dndscv on all genes in the dataset is recommended unless the gene of interest is believed to have a different substitution model.
 12 | #' @param genomeFile Path to a reference fasta file for the genome assembly.
 13 | #' @param regionschr Optional data frame with user-defined regions of interest in the gene. This allows the user to define arbitrary regions within a gene (coding or non-coding) from which to calculate omega (selection or obs/exp) values (e.g. protein domains, splicing regulator regions, core promoters, etc). The table should contain the following columns: chr, start, end, wname (a unique name for the w parameter, e.g. wdomain1, wcorepromoter), impacts (e.g. Missense or Missense|Nonsense will restrict the w calculation with Missense or Missense and Nonsense mutations in the region, respectively), layered (1/0; using "0" removes other w parameters influencing the site, whereas using "1" models selection as relative to other w parameters active at these sites).
 14 | #' @param regionsaa Optional data frame with user-defined regions of interest in the gene, using aminoacid coordinates. The table should contain the following columns: gene, aa_start, aa_end, w feature name (e.g. wdomain1), impacts.
 15 | #' @param fixtheta Pre-calculated overdispersion (theta) parameter. This should be calculated using sitednds(., method="NB").
 16 | #' @param refdb Reference database (path to .rda file or a pre-loaded array object in the right format).
 17 | #' @param numcode NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate
 18 | #' @param normalisefromsyn Normalise the substitution rates based on the synonymous mutations in the gene. Using TRUE is recommended. Using FALSE uses the expected synonymous mutation rate of the gene from the dndscv negative binomial regression model (dndsout$genemuts). 
 19 | #' @param syndrivers Vector of known synonymous driver sites defined by their aminoacid position, to be excluded from the background model (e.g. syndrivers = c("T125T","E224E","Q331Q") for TP53).
 20 | #' @param exon_flank_length Exon flank length in bp [default = 10]. Using a value higher than 0 will calculate a separate selection (w) coefficient for synonymous mutations in exon flanks.
 21 | #' @param intron_flank_length Intron flank length in bp [default = 10]. Intronic sites occurring within these flanks but not already classified as Essential_Splice will receive a separate w parameter.
 22 | #' @param sitefilename Optionally, provide a file name to save the table of all annotated sites in the gene. This table is also always contained in the output object. 
 23 | #'
 24 | #' @return 'withingenednds' returns a list of objects:
 25 | #' @return - sites: Table with the annotation of all sites in the gene (from covtable), including all functional annotations in the "regions" input object as well as default annotations (Missense, Nonsense, Essential_Splice, Start_loss, Stop_loss, etc).
 26 | #' @return - par.pois: Poisson regression results (not recommended).
 27 | #' @return - par.nb: Negative binomial results fitting a new overdispersion parameter to the data (when fixtheta is not provided).
 28 | #' @return - par.nbfix: Negative binomial results using the input fixtheta value as recommended.
 29 | #' @return - model.pois: Poisson regression object.
 30 | #' @return - model.nb: Negative binomial regression object.
 31 | #' @return - model.nbfix: Negative binomial regression object.
 32 | #'
 33 | #' @export
 34 | 
 35 | withingenednds = function(mutations, gene, covtable, dndsout, genomeFile, regionschr = NULL, regionsaa = NULL, fixtheta = NULL, normalisefromsyn = TRUE, syndrivers = NULL, exon_flank_length = 10, intron_flank_length = 10, sitefilename = NULL, refdb = "hg19", numcode = 1) {
 36 |   
 37 |   ## 1. Environment
 38 |   message("Running gene-level selection analyses...")
 39 |   
 40 |   # [Input] Reference database
 41 |   refdb_class = class(refdb)
 42 |   if ("character" %in% refdb_class) {
 43 |     if (refdb == "hg19") {
 44 |       data("refcds_hg19", package="dndscv")
 45 |    } else {
 46 |       load(refdb)
 47 |     }
 48 |   } else if("array" %in% refdb_class) {
 49 |     # use the user-supplied RefCDS object
 50 |     RefCDS = refdb
 51 |   } else {
 52 |     stop("Expected refdb to be \"hg19\", a file path, or a RefCDS-formatted array object.")
 53 |   }
 54 |   
 55 |   # Annotating all possible coding changes in the positions provided in the input covtable
 56 |   library("Rsamtools")
 57 |   covtable$ref = as.vector(scanFa(genomeFile, GRanges(covtable$chr, IRanges(covtable$pos, covtable$pos))))
 58 |   covtable$ref3 = as.vector(scanFa(genomeFile, GRanges(covtable$chr, IRanges(covtable$pos-1, covtable$pos+1))))
 59 |   
 60 |   allsubs = data.frame(sampleID="allsubs", chr = rep(covtable$chr, each=3), pos = rep(covtable$pos, each=3), ref = rep(covtable$ref, each=3), mut = NA, ref3 = rep(covtable$ref3, each=3), mut3 = NA, ref_cod = NA, mut_cod = NA, ref3_cod = NA, mut3_cod = NA)
 61 |   for (j in seq(1,nrow(allsubs),3)) {
 62 |     allsubs$mut[j:(j+2)] = setdiff(c("A","C","G","T"),allsubs$ref[j])
 63 |   }
 64 |   allsubs$mut3 = paste(substr(allsubs$ref3,1,1), allsubs$mut, substr(allsubs$ref3,3,3), sep="")
 65 |   aux = dndscv(allsubs, gene_list = gene, max_coding_muts_per_sample = Inf, max_muts_per_gene_per_sample = Inf, outp = 1)$annotmuts # Annotated mutations
 66 |   
 67 |   # Adding duplex depth, functional impact annotation to all possible changes in the input table
 68 |   aux$mstr = paste(aux$chr, aux$pos, aux$mut, sep=":") # mutation string ID
 69 |   obssubs = dndsout$annotmuts[which(dndsout$annotmuts$ref %in% c("A","C","G","T") & dndsout$annotmuts$mut %in% c("A","C","G","T") & dndsout$annotmuts$gene == gene), ]
 70 |   obssubs$mstr = paste(obssubs$chr, obssubs$pos, obssubs$mut, sep=":")
 71 |   m$mstr = paste(m$chr, m$pos, m$mut, sep=":")
 72 |   
 73 |   sites = cbind(data.frame(gene=gene, pid=aux$pid[1]), allsubs[,-1]) # Initialising the sites table
 74 |   pos2dc = setNames(covtable$dc, covtable$pos)
 75 |   sites$dc = pos2dc[as.character(sites$pos)]
 76 |   sites$mstr = paste(sites$chr, sites$pos, sites$mut, sep=":")
 77 |   sites$obs = table(m$mstr)[sites$mstr]; sites$obs[is.na(sites$obs)] = 0 # Annotating the number of observed mutations at each site
 78 |   
 79 |   mstr2imp1 = setNames(aux$ntchange, aux$mstr)
 80 |   mstr2imp2 = setNames(aux$aachange, aux$mstr)
 81 |   mstr2imp3 = setNames(aux$impact, aux$mstr)
 82 |   sites$ntchange = mstr2imp1[sites$mstr]
 83 |   sites$aachange = mstr2imp2[sites$mstr]
 84 |   sites$impact = mstr2imp3[sites$mstr]
 85 |   
 86 |   # Adding strand and annotating the coding trinucleotide change
 87 |   aux2 = unique(dndsout$annotmuts[,c("gene","strand")])
 88 |   gene2strand = setNames(aux2[,2],aux2[,1])
 89 |   sites$strand = gene2strand[sites$gene]
 90 |   
 91 |   sites$ref_cod = sites$ref
 92 |   sites$ref_cod[sites$strand==-1] = seqinr::comp(sites$ref[sites$strand==-1], forceToLower = F)
 93 |   sites$mut_cod = sites$mut
 94 |   sites$mut_cod[sites$strand==-1] = seqinr::comp(sites$mut[sites$strand==-1], forceToLower = F)
 95 |   
 96 |   revcomp = function(seqvec) {
 97 |     as.vector(sapply(seqvec, function(x) paste(seqinr::comp(rev(unlist(strsplit(x,split=""))), forceToLower=F), collapse=""))) # Reverse complement of a vector of sequence motifs
 98 |   }
 99 |   sites$ref3_cod = sites$ref3
100 |   sites$mut3_cod = sites$mut3
101 |   if (any(sites$strand==-1)) {
102 |     sites$ref3_cod[sites$strand==-1] = revcomp(sites$ref3[sites$strand==-1])
103 |     sites$mut3_cod[sites$strand==-1] = revcomp(sites$mut3[sites$strand==-1])
104 |   }
105 |  
106 |   # Adding the trinucleotide substitution rates from the dndsout model
107 |   
108 |   mle_submodel = setNames(dndsout$mle_submodel[,2], dndsout$mle_submodel[,1])
109 |   mle_submodel = c(mle_submodel, "TTT>TGT"=1) # Adding the missing rate (note that all rates in mle_submodel in dNdScv are relative to TTT>TGT)
110 |   mle_submodel = mle_submodel * mle_submodel["t"] # Absolute rates
111 |   sites$r = mle_submodel[paste(sites$ref3_cod, sites$mut3_cod, sep=">")]
112 |   sites$r = sites$r * sites$dc / mean(sites$dc) # Normalising the expected rates at a site by the duplex depth
113 |   
114 |   # Normalising the global expected rates by the estimated mutation rate of the gene using one of two alternative models:
115 |   # 1. normalisefromsyn = TRUE. We normalise "r" using the synonymous mutations observed in the gene (excluding known syn driver sites)
116 |   # 2. normalisefromsyn = FALSE. We normalise "r" using the negative binomial model from dndscv.
117 |   
118 |   if (normalisefromsyn == TRUE) {
119 |     mobs = sum(sites$obs[which(sites$impact=="Synonymous" & !(sites$aachange %in% syndrivers))])
120 |     mexp = sum(sites$r[which(sites$impact=="Synonymous" & !(sites$aachange %in% syndrivers))])
121 |     sites$rnorm = sites$r * mobs / mexp
122 |   } else {
123 |     message("Option not recommended: Normalising the mutation rate of the gene based on the negative regression model of dNdScv")
124 |     sites$rnorm = sites$r * dndsout$genemuts$exp_syn_cv[dndsout$genemuts$gene_name==gene] / dndsout$genemuts$exp_syn[dndsout$genemuts$gene_name==gene]
125 |   }
126 |   
127 |   # Excluding sites with rate = 0 (e.g. due to lack of duplex coverage)
128 |   ratezero = which(sites$rnorm==0)
129 |   if (length(ratezero)>0) {
130 |     sites = sites[-ratezero, ]
131 |     message(sprintf("%0.0f sites have been excluded due to having a duplex depth and/or a predicted mutation rate of 0", length(ratezero)))
132 |   }
133 |   
134 |   ## 2. Creating an index regression matrix with all functional annotations (each row is a site and each column is a parameter in the selection model)
135 |   
136 |   # Annotating Missense, Nonsense, Essential_Splice and Stop_loss mutations
137 |   mutmat = data.frame(t = rep(1,nrow(sites)), wmis = 0, wnon = 0, wspl = 0, wsyndriv = 0, 
138 |                       wintr = 0, woutcds = 0, wexfl = 0, winfl = 0, wnonlastex = 0, wstoploss = 0, wstartloss = 0)
139 |   mutmat$wmis[which(sites$impact=="Missense")] = 1
140 |   mutmat$wnon[which(sites$impact=="Nonsense")] = 1
141 |   mutmat$wspl[which(sites$impact=="Essential_Splice")] = 1
142 |   mutmat$wsyndriv[which(sites$impact=="Synonymous" & sites$aachange %in% syndrivers)] = 1
143 |   mutmat$wstoploss[which(sites$impact=="Stop_loss")] = 1
144 |   
145 |   # Annotating Start_loss mutations (mutations in codon 1)
146 |   
147 |   #start_loss = which(sites$impact != "Synonymous" & substr(sites$aachange,2,nchar(sites$aachange)-1) == "1") # There is no need to check for synonymous changes in ATG
148 |   start_loss = which(substr(sites$aachange,2,nchar(sites$aachange)-1) == "1") # There is no need to check for synonymous changes in ATG
149 |   mutmat$wstartloss[start_loss] = 1
150 |   sites$impact[start_loss] = "Start_loss"
151 |   
152 |   # Annotating introns, exon flanks, and intron flanks
153 |   
154 |   refcdsgene = RefCDS[[which(sapply(RefCDS, function(x) x$gene_name)==gene)]]
155 |   exons = refcdsgene$intervals_cds
156 |   esspl = refcdsgene$intervals_splice
157 |   
158 |   gr_sites = GenomicRanges::GRanges(sites$gene, IRanges::IRanges(sites$pos,sites$pos))
159 |   gr_exons = GenomicRanges::GRanges(gene, IRanges::IRanges(exons[,1],exons[,2]))
160 |   gr_cds = GenomicRanges::GRanges(gene, IRanges::IRanges(min(exons),max(exons)))
161 |   gr_outcds = setdiff(gr_sites, gr_cds)
162 |   ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_outcds, type="any", select="all"))
163 |   mutmat$woutcds[unique(ol[,1])] = 1 # Annotation of the intronic sites
164 |   
165 |   # For genes with >1 exon, we also annotate essential splice sites, and intron and exon flanks.
166 |   if (length(esspl)>0) {
167 |     gr_esspl = GenomicRanges::GRanges(gene, IRanges::IRanges(esspl,esspl))
168 |     gr_introns = setdiff(setdiff(gr_cds, gr_exons), gr_esspl)
169 |     exfl = rbind(cbind(exons[2:nrow(exons),1],exons[2:nrow(exons),1]+exon_flank_length-1), cbind(exons[1:(nrow(exons)-1),2]-exon_flank_length+1,exons[1:(nrow(exons)-1),2]))
170 |     gr_exonflanks = GenomicRanges::GRanges(gene, IRanges::IRanges(exfl[,1],exfl[,2]))
171 |     gr_exonflanks = intersect(gr_exonflanks, gr_exons) # Ensuring that the exon flanks do not extend into introns
172 |     infl = rbind(cbind(exons[2:nrow(exons),1]-intron_flank_length,exons[2:nrow(exons),1]-1), cbind(exons[1:(nrow(exons)-1),2]+1,exons[1:(nrow(exons)-1),2]+intron_flank_length))
173 |     gr_intrflanks = GenomicRanges::GRanges(gene, IRanges::IRanges(infl[,1],infl[,2]))
174 |     gr_intrflanks = setdiff(gr_intrflanks, gr_exons) # Removing any overlaps with exonic sequences
175 |     gr_intrflanks = setdiff(gr_intrflanks, gr_esspl) # Removing any overlaps with essential splice site sequences
176 |     gr_intrflanks = intersect(gr_intrflanks, gr_cds) # Removing UTR sequences
177 |     
178 |     ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_introns, type="any", select="all"))
179 |     mutmat$wintr[unique(ol[,1])] = 1 # Annotation of the intronic sites
180 |     
181 |     ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_introns, type="any", select="all"))
182 |     mutmat$wintr[unique(ol[,1])] = 1 # Annotation of the intronic sites
183 |     
184 |     ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_exonflanks, type="any", select="all"))
185 |     mutmat$wexfl[seq(1,nrow(sites)) %in% ol[,1] & sites$impact=="Synonymous"] = 1 # Annotation of the exonic flank sites (only for synonymous mutations)
186 |     
187 |     ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_intrflanks, type="any", select="all"))
188 |     mutmat$winfl[unique(ol[,1])] = 1 # Annotation of the exonic flank sites
189 |   }
190 |   
191 |   # Annotation of nonsense mutations in the last coding exon
192 |   
193 |   if (nrow(exons)>1) {
194 |     if (refcdsgene$strand==1) {
195 |       lastexon = exons[nrow(exons),]
196 |     } else {
197 |       lastexon = exons[1,]
198 |     }
199 |     
200 |     gr_lastexon = GenomicRanges::GRanges(gene, IRanges::IRanges(min(lastexon),max(lastexon)))
201 |     ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_lastexon, type="any", select="all"))
202 |     mutmat$wnonlastex[seq(1,nrow(sites)) %in% ol[,1] & sites$impact=="Nonsense"] = 1
203 |     mutmat$wnon[seq(1,nrow(sites)) %in% ol[,1] & sites$impact=="Nonsense"] = 0 # Removing Nonsense mutations in the last exon from the wnon annotation
204 |   }
205 |   
206 |   ## 3. Other annotations from the user
207 |   
208 |   # Regions defined by chr position
209 |   if (!is.null(regionschr)) {
210 |     
211 |     wnames = unique(regionschr$wname)
212 |     badnames = intersect(wnames,unique(colnames(mutmat)))
213 |     if (length(badnames)>0) { stop(sprintf("The following w names are not allowed in the input regions as they match existing parameters: %s", paste(badnames,collapse = ","))) }
214 |     
215 |     for (j in 1:length(wnames)) {
216 |       
217 |       aux = regionschr[which(regionschr$wname==wnames[j]), ]
218 |       if (length(unique(aux$impacts))!=1) { stop("regionschr: different values found in the impacts column for a given feature, please correct your input object") }
219 |       imps = strsplit(unique(aux$impacts), split=",")[[1]]
220 |       if (any(!(imps %in% setdiff(unique(sites$impact),NA)))) { stop("regionschr: invalid impact terms used in the input object, please correct the impact column") }
221 |         
222 |       gr_wname = GenomicRanges::GRanges(gene, IRanges::IRanges(aux$start,aux$end))
223 |       ol = as.data.frame(GenomicRanges::findOverlaps(gr_sites, gr_wname, type="any", select="all"))
224 |       mutmat[,wnames[j]] = 0 # Initialising this field in the data frame
225 |       if (length(imps)==0) {
226 |         indx = unique(ol[,1]) # If no impacts are indicated, we consider all sites independent of their impact
227 |       } else {
228 |         indx = intersect(unique(ol[,1]), which(sites$impact %in% imps))
229 |       }
230 |       mutmat[indx,wnames[j]] = 1
231 |       
232 |       # Removing previous w parameters if layered=0
233 |       if (aux$layered[1]==0 | aux$layered[1]==FALSE) {
234 |         mutmat[indx, setdiff(colnames(mutmat),c("t",wnames[j]))] = 0 # Removing previous annotations at these sites
235 |       }
236 |     }
237 |   }
238 |   
239 |   # Regions defined by aminoacid position
240 |   if (!is.null(regionsaa)) {
241 |     
242 |     sites$aux = 1:nrow(sites)
243 |     aas = sites[which(!is.na(sites$aachange) & sites$aachange!="."),]
244 |     aas$aapos = as.numeric(substr(aas$aachange,2,nchar(aas$aachange)-1))
245 |     gr_aas = GenomicRanges::GRanges(gene, IRanges::IRanges(aas$aapos,aas$aapos))
246 |     
247 |     wnames = unique(regionsaa$wname)
248 |     badnames = intersect(wnames,unique(colnames(mutmat)))
249 |     if (length(badnames)>0) { stop(sprintf("The following w names are not allowed in the input regions as they match existing parameters: %s", paste(badnames,collapse = ", "))) }
250 |     
251 |     for (j in 1:length(wnames)) {
252 |       
253 |       aux = regionsaa[which(regionsaa$wname==wnames[j]), ]
254 |       if (length(unique(aux$impacts))!=1) { stop("regionschr: different values found in the impacts column for a given feature, please correct your input object") }
255 |       imps = strsplit(unique(aux$impacts), split=",")[[1]]
256 |       if (any(!(imps %in% setdiff(unique(sites$impact),NA)))) { stop("regionschr: invalid impact terms used in the input object, please correct the impact column") }
257 |       
258 |       gr_wname = GenomicRanges::GRanges(gene, IRanges::IRanges(aux$start,aux$end))
259 |       ol = as.data.frame(GenomicRanges::findOverlaps(gr_aas, gr_wname, type="any", select="all"))
260 |       mutmat[,wnames[j]] = 0 # Initialising this field in the data frame
261 |       if (length(imps)==0) {
262 |         indx = aas$aux[unique(ol[,1])] # If no impacts are indicated, we consider all sites independent of their impact
263 |       } else {
264 |         indx = aas$aux[intersect(unique(ol[,1]), which(aas$impact %in% imps))]
265 |       }
266 |       mutmat[indx,wnames[j]] = 1
267 |       
268 |       # Removing previous w parameters if layered=0
269 |       if (aux$layered[1]==0 | aux$layered[1]==FALSE) {
270 |         mutmat[indx, setdiff(colnames(mutmat),c("t",wnames[j]))] = 0 # Removing previous annotations at these sites
271 |       }
272 |     }
273 |     sites = sites[, setdiff(colnames(sites),"aux")]
274 |   }
275 |   
276 |   ## 4. Poisson regression: fitting the selection model
277 |   
278 |   model = glm(formula = sites$obs ~ offset(log(sites$rnorm)) + . -1, data=mutmat, family=poisson(link=log))
279 |   mle = exp(coefficients(model)) # Maximum-likelihood estimates for the rate params
280 |   pvals = coef(summary(model))[,4]
281 |   model.lrt = drop1(model, test= "Chisq")
282 |   pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt))
283 |   ci = exp(confint.default(model)) # Wald confidence intervals
284 |   par.pois = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters
285 | 
286 |   # Adding obs/exp statistics to the regression outputs
287 |   nobs = apply(mutmat, 2, function(x) sum(sites$obs[x==1]))
288 |   nexp = apply(mutmat, 2, function(x) sum(sites$rnorm[x==1]))
289 |   par.pois$obs = nobs[par.pois$name]
290 |   par.pois$exp = nexp[par.pois$name]
291 |   
292 |   ## 5. Negative binomial regression
293 |   
294 |   model.nbfix = model.nb = par.nbfix = par.nb = NULL
295 |   
296 |   if (!is.null(fixtheta)) {
297 |     
298 |     model.nbfix = glm(formula = sites$obs ~ offset(log(sites$rnorm)) + . -1, data=mutmat, family=MASS::negative.binomial(fixtheta))
299 |     mle = exp(coefficients(model.nbfix)) # Maximum-likelihood estimates for the rate params
300 |     pvals = coef(summary(model.nbfix))[,4]
301 |     model.lrt = drop1(model.nbfix, test= "Chisq")
302 |     pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt))
303 |     ci = exp(confint.default(model.nbfix)) # Wald confidence intervals
304 |     par.nbfix = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters
305 |     par.nbfix$obs = nobs[par.nbfix$name]
306 |     par.nbfix$exp = nexp[par.nbfix$name]
307 |     
308 |   } else {
309 |     
310 |     model.nb = MASS::glm.nb(formula = as.vector(sites$obs) ~ offset(log(sites$rnorm)) + . -1, data=mutmat)
311 |     mle = exp(coefficients(model.nb)) # Maximum-likelihood estimates for the rate params
312 |     pvals = coef(summary(model.nb))[,4]
313 |     model.lrt = drop1(model.nb, test= "Chisq")
314 |     pvals.lrt = setNames(model.lrt[[5]], row.names(model.lrt))
315 |     ci = exp(confint.default(model.nb)) # Wald confidence intervals
316 |     par.nb = data.frame(name=gsub("\`","",rownames(ci)), mle=mle[rownames(ci)], cilow=ci[,1], cihigh=ci[,2], pval.wald=pvals[rownames(ci)], pval.lrt=pvals.lrt[rownames(ci)]) # MLEs and Wald CI95% for the selection parameters
317 |     par.nb$obs = nobs[par.nb$name]
318 |     par.nb$exp = nexp[par.nb$name]
319 |   }
320 |   
321 |   ## 6. Outputs
322 |   
323 |   # Annotated sites table
324 |   
325 |   sites2 = cbind(sites, mutmat)
326 |   if (!is.null(sitefilename)) {
327 |     sites2 = apply(sites2,2,as.character)
328 |     write.table(sites2, file=sitefilename, col.names = T, row.names = F, quote = F, sep = "\t")
329 |   }
330 |   
331 |   # Inform the user about the sites used as neutral reference by the model
332 |   
333 |   neutral_sites = (rowSums(mutmat[,setdiff(colnames(mutmat),"t")])==0)
334 |   neutral_nsyn = sum(sites$obs[which(neutral_sites & sites$impact=="Synonymous")]) # Number of synonymous mutations used as background for dN/dS
335 |   neutral_nsynsites = length(which(neutral_sites & sites$impact=="Synonymous"))
336 |   neutral_othersites = length(which(neutral_sites & sites$impact!="Synonymous"))
337 |   nonneutral_nsyn = sum(sites$obs[which(neutral_sites==0 & sites$impact=="Synonymous")]) # Number of synonymous mutations excluded from the neutral backgroup by the "w" annotations
338 |   
339 |   message(sprintf("   Sites used as neutral reference: %0.0f synonymous mutations across %0.0f synonymous sites.", neutral_nsyn, neutral_nsynsites))
340 |   if (neutral_othersites>0) {
341 |     message(sprintf("   %0.0f sites not classified as synonymous coding sites were used in the background model. Please ensure that this was intended.", neutral_othersites))
342 |   }
343 |   if (nonneutral_nsyn>0) {
344 |     message(sprintf("   %0.0f synonymous mutations were not used in the neutral background model. This may be due to excluding known synonymous driver mutations (when using syndrivers), excluding synonymous mutations in exon flanks (when using exon_flank_length>0) and/or due to annotations provided by the user. Please ensure that this is the desired behaviour.", nonneutral_nsyn))
345 |   }
346 |   
347 |   # Output object
348 |   out = list(sites = sites2, dnds.pois = par.pois, dnds.nb = par.nb,  dnds.nbfix = par.nbfix, model.pois = model, model.nb = model.nb, model.nbfix = model.nbfix)
349 |                    
350 | }


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | dndscv
 2 | =====
 3 | 
 4 | Description
 5 | ---
 6 | The **dNdScv** R package is a group of maximum-likelihood dN/dS methods designed to 
 7 | 	quantify selection in cancer and somatic evolution (Martincorena *et al.*, 2017). The
 8 | 	package contains functions to quantify dN/dS ratios for missense, nonsense and 
 9 | 	essential splice mutations, at the level of individual genes, groups of genes or at 
10 | 	whole-exome level. The *dndscv* function within the package was originally designed 
11 | 	to detect cancer driver genes (*i.e.* genes under positive selection in cancer genomes)
12 | 	on datasets ranging from a few samples to thousands of samples, in whole-exome/genome 
13 | 	or targeted sequencing studies. 
14 | 	
15 | Although initially designed for cancer genomic studies, this package can also be used 
16 |     with appropriate caution to study selection in other resequencing studies, such 
17 |     as SNP analyses, mutation accumulation studies in bacteria or for the discovery 
18 |     of mutations causing developmental disorders using data from human trios. Please 
19 |     study the optional arguments carefully if you are using the dndscv function for 
20 |     other applications.
21 | 	
22 | When using the dndscv function in the package (sel_cv output object), the background 
23 |     mutation rate of each gene is estimated by combining local information 
24 | 	(synonymous mutations in the gene) and global information (variation of the mutation 
25 | 	rate across genes, exploiting epigenomic covariates), and controlling for the sequence 
26 | 	composition of the gene and mutational signatures. Constraining the expected neutral
27 | 	mutation rate of a gene using information from other genes considerably increases
28 | 	the sensitivity to detect positive selection in sparse datasets.
29 | 	
30 | Unlike traditional implementations of dN/dS using Markov-chain models, the underlying
31 |     Poisson assumptions in dNdScv allow the use of more complex context-dependent 
32 |     substitution models and the estimation of dN/dS ratios for truncating mutations. 
33 |     By default, *dNdScv* uses a trinucleotide context-dependent substitution model, 
34 |     which is important to avoid common biases affecting simpler substitution 
35 | 	models in dN/dS (Greenman *et al.*, 2006, and Martincorena *et al*, 2017).
36 | 
37 | Installation
38 | --------
39 | You can use devtools::install_github() to install *dndscv* from this repository:
40 | 
41 | 	> library(devtools); install_github("im3sanger/dndscv")
42 | 
43 | Tutorial
44 | --------
45 | For a tutorial on dNdScv see the vignette included with the package. This includes 
46 | examples for whole-exome/genome data and for targeted data.
47 | 
48 | [Tutorial: getting started with dNdScv](http://htmlpreview.github.io/?http://github.com/im3sanger/dndscv/blob/master/vignettes/dNdScv.html)
49 | 
50 | Genome assemblies and species
51 | --------
52 | By default, *dNdScv* assumes that mutation data is mapped to the GRCh37/hg19 assembly of the
53 | human genome. If you are using human data mapped to the GRCh38/hg38 assembly, you can use 
54 | refdb="hg38" as an argument in dndscv to use the default GRCh38/hg38 precomputed database
55 | and epigenomic covariates (please ensure that you have downloaded the latest version of
56 | dNdScv).
57 | 
58 | Users interested in trying *dNdScv* on a different set of transcripts, a different human assembly
59 | or a different species can use the buildref function to create a custom RefCDS, as explained
60 | in this [tutorial](http://htmlpreview.github.io/?http://github.com/im3sanger/dndscv/blob/master/vignettes/buildref.html).
61 | 
62 | Pre-computed RefCDS files (RefCDS objects) to run *dNdScv* on some popular species
63 | (e.g. mouse, rat, cow, dog, yeast or SARS-CoV-2) are available from 
64 | [this link](https://github.com/im3sanger/dndscv_data/tree/master/data).
65 | 
66 | Reference
67 | ----
68 | Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*.
69 | http://www.cell.com/cell/fulltext/S0092-8674(17)31136-4
70 | 
71 | Author
72 | --------
73 | Inigo Martincorena
74 | 
75 | Acknowledgements
76 | --------
77 | Moritz Gerstung and Peter Campbell for advice and inspiration. Federico Abascal and Andrew Lawson for testing, feedback and ideas.
78 | 


--------------------------------------------------------------------------------
/data/cancergenes_cgc81.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/cancergenes_cgc81.rda


--------------------------------------------------------------------------------
/data/covariates_hg19.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/covariates_hg19.rda


--------------------------------------------------------------------------------
/data/covariates_hg19_chrx.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/covariates_hg19_chrx.rda


--------------------------------------------------------------------------------
/data/covariates_hg19_hg38_epigenome_pcawg.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/covariates_hg19_hg38_epigenome_pcawg.rda


--------------------------------------------------------------------------------
/data/dataset_normaloesophagus.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_normaloesophagus.rda


--------------------------------------------------------------------------------
/data/dataset_normalskin.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_normalskin.rda


--------------------------------------------------------------------------------
/data/dataset_normalskin_genes.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_normalskin_genes.rda


--------------------------------------------------------------------------------
/data/dataset_simbreast.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_simbreast.rda


--------------------------------------------------------------------------------
/data/dataset_tcgablca.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/dataset_tcgablca.rda


--------------------------------------------------------------------------------
/data/knownhotcodons_hg19.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/knownhotcodons_hg19.rda


--------------------------------------------------------------------------------
/data/knownhotspots_hg19.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/knownhotspots_hg19.rda


--------------------------------------------------------------------------------
/data/refcds_GRCh38_hg38.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/refcds_GRCh38_hg38.rda


--------------------------------------------------------------------------------
/data/refcds_hg19.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/refcds_hg19.rda


--------------------------------------------------------------------------------
/data/submod_12r_3w.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/submod_12r_3w.rda


--------------------------------------------------------------------------------
/data/submod_192r_3w.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/submod_192r_3w.rda


--------------------------------------------------------------------------------
/data/submod_2r_3w.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/data/submod_2r_3w.rda


--------------------------------------------------------------------------------
/dndscv.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: No
 4 | SaveWorkspace: No
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 4
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | PackageRoxygenize: rd,collate,namespace
22 | 


--------------------------------------------------------------------------------
/inst/doc/dNdScv.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Selection analyses and cancer driver discovery using dNdScv"
  3 | author: "Inigo Martincorena (May 2017)"
  4 | output: rmarkdown::html_vignette
  5 | vignette: >
  6 |   %\VignetteIndexEntry{Selection analyses and cancer driver discovery using dNdScv}
  7 |   %\VignetteEngine{knitr::rmarkdown}
  8 |   \usepackage[utf8]{inputenc}
  9 | ---
 10 | 
 11 | The **dNdScv** R package is a group of maximum-likelihood dN/dS methods designed to quantify selection in cancer and somatic evolution (Martincorena *et al.*, 2017). The package contains functions to quantify dN/dS ratios for missense, nonsense and essential splice mutations, at the level of individual genes, groups of genes or at whole-genome level. The *dNdScv* method was designed to detect cancer driver genes (*i.e.* genes under positive selection in cancer) on datasets ranging from a few samples to thousands of samples, in whole-exome/genome or targeted sequencing studies. 
 12 | 
 13 | The background mutation rate of each gene is estimated by combining local information (synonymous mutations in the gene) and global information (variation of the mutation rate across genes, exploiting epigenomic covariates), and controlling for the sequence composition of the gene and mutational signatures. Unlike traditional implementations of dN/dS, *dNdScv* uses trinucleotide context-dependent substitution matrices to avoid common mutation biases affecting dN/dS (Greenman *et al.*, 2006).
 14 | 
 15 | This vignette shows how to perform driver discovery and selection analyses with dNdScv in cancer sequencing data. The current version of dNdScv only supports human data, but future versions will incorporate functions to allow studies of selection on any species.
 16 | 
 17 | Main reference:
 18 | Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*.
 19 | 
 20 | ##Example 1: Driver discovery (positive selection) in cancer exomes/genomes
 21 | 
 22 | ####The simplest way to run dNdScv
 23 | 
 24 | ```{r message=FALSE}
 25 | library("seqinr")
 26 | library("Biostrings")
 27 | library("MASS")
 28 | library("GenomicRanges")
 29 | library("dndscv")
 30 | data("dataset_simbreast", package="dndscv")
 31 | dndsout = dndscv(mutations)
 32 | ```
 33 | 
 34 | ####Inputs and default parameters
 35 | 
 36 | For this example, we have used a simulated dataset of 196 breast cancer exomes provided in the package. The simplest way to run the dndscv function is to provide a table of mutations. Mutations are provided as a *data.frame* with five columns (sampleID, chromosome, position, reference base and mutant base). It is important that only unique mutations are provided in the file. Multiple instances of the same mutation in related samples (for example, when sequencing multiple samples of the same tumour) should only be listed once.
 37 | 
 38 | ```{r}
 39 | head(mutations)
 40 | ```
 41 | 
 42 | With the example dataset provided, the function should take about one minute to run. In this example, the function issues a warning as it detects the same mutation in more than one sample, requesting the user to verify that the input table of mutations only contains independent mutation events. In this case, each sample corresponds to a different patient and so the warning can be safely ignored.
 43 | 
 44 | We have run dNdScv with default parameters. This includes removing ultra-hypermutator samples and subsampling mutations when encountering too many mutations per gene in the same sample. These were designed to protect against loss of sensitivity from ultra-hypermutators and from clustered artefacts in the input mutation table, but there are occasions when the user will prefer to relax these (see Example 2).
 45 | 
 46 | #####dndscv outputs: Table of significant genes
 47 | 
 48 | The output of the *dndscv* function is a list of objects. For an analysis of exome or genome data, the most relevant output will often be the result of neutrality tests at gene level. *P-values* for substitutions are obtained by Likelihood-Ratio Tests as described in (Martincorena *et al*, 2017) and q-values are obtained by Benjamini-Hodgberg's multiple testing correction. The table also includes information on the number of substitutions of each class observed in each gene, as well as maximum-likelihood estimates (MLEs) of the dN/dS ratios for each gene, for missense (*wmis*), nonsense (*wnon*), essential splice site mutations (*wspl*) and indels (*wind*).
 49 | 
 50 | ```{r}
 51 | sel_cv = dndsout$sel_cv
 52 | print(head(sel_cv), digits = 3)
 53 | signif_genes = sel_cv[sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")]
 54 | rownames(signif_genes) = NULL
 55 | print(signif_genes)
 56 | ```
 57 | 
 58 | Note in the table that the dN/dS ratios of significant cancer genes are typically extremely high, often >10 or even >100. This contains information about the fraction of mutations observed in a gene that are genuine drivers under positive selection. For example, for a highly significant gene, a dN/dS of 10 indicates that there are 10 times more non-synonymous mutations in the gene than neutrally expected, suggesting that at least around 90% of these mutations are genuine drivers (Greenman *et al*, 2006; Martincorena *et al*, 2017).
 59 | 
 60 | #####dndscv outputs: Global dN/dS estimates
 61 | 
 62 | Another output that can be of interest is a table with the global MLEs for the dN/dS ratios across all genes. dN/dS ratios with associated confidence intervals are calculated for missense, nonsense and essential splice site substitutions separately, as well as for all non-synonymous substitutions (*wall*) and for all truncating substitutions together (*wtru*), which include nonsense and essential splice site mutations.
 63 | 
 64 | ```{r}
 65 | print(dndsout$globaldnds)
 66 | ```
 67 | 
 68 | Global dN/dS ratios in somatic evolution of cancer, and seemingly of healthy somatic tissues, appear to show a near-universal pattern of dN/dS~1, with exome-wide dN/dS ratios typically slightly higher than 1 (Martincorena *et al.*, 2017). Only occasionally, I have found datasets with global dN/dS<1, but upon closer examination, this has typically been caused by contamination of the catalogue of somatic mutations with germline SNPs. An exception are melanoma tumours, which show a bias towards slight underestimation of dN/dS due to the signature of ultraviolet-induced mutations extending beyond the trinucleotide model (Martincorena *et al.*, 2017). In my personal experience, datasets of somatic mutations with global dN/dS<<1 have always reflected a problem of SNP contamination or an inadequate substitution model, and so the evaluation of global dN/dS values can help identify problems in certain datasets.
 69 | 
 70 | #####Other useful outputs
 71 | 
 72 | The dndscv function also outputs other results that can be of interest, such as an annotated table of coding mutations (*annotmuts*), MLEs of mutation rate parameters (*mle_submodel*), lists of samples and mutations excluded from the analysis and a table with the observed and expected number of mutations per gene (*genemuts*), among others.
 73 | 
 74 | ```{r}
 75 | head(dndsout$annotmuts)
 76 | ```
 77 | 
 78 | dNdScv relies on a negative binomial regression model across genes to refine the estimated background mutation rate for a gene. This assumes that the variation of the mutation rate across genes that remains unexplained by covariates or by the sequence composition of the gene can be modelled as a Gamma distribution. This model typically works well on clean cancer genomic datasets, but not all datasets may be suitable for this model. In particular, very low estimates of $\theta$ (the overdispersion parameter), particularly $\theta<1$, may reflect problems with the suitability of the dNdScv model for the dataset.
 79 | 
 80 | ```{r}
 81 | print(dndsout$nbreg$theta)
 82 | ```
 83 | 
 84 | ##### dNdSloc: local neutrality test
 85 | 
 86 | An additional set of neutrality tests per gene are performed using a more traditional dN/dS model in which the local mutation rate for a gene is estimated exclusively from the synonymous mutations observed in the gene (*dNdSloc*) (Wong, Martincorena, *et al*., 2014). This test is typically only powered in very large datasets. For example, in the dataset used in this example, comprising of 196 simulated breast cancer exomes, this model only detects *ARID1A* as significantly mutated.
 87 | 
 88 | ```{r}
 89 | signif_genes_localmodel = as.vector(dndsout$sel_loc$gene_name[dndsout$sel_loc$qall_loc<0.1])
 90 | print(signif_genes_localmodel)
 91 | ```
 92 | 
 93 | ##Example 2: Driver discovery in targeted sequencing data
 94 | 
 95 | The dndscv function can take a list of genes as an input to restrict the analysis of selection. This is strictly required when analysing targeted sequencing data, and might also be used to obtain global dN/dS ratios for a particular group of genes.
 96 | 
 97 | To exemplify the use of the dndscv function on targeted data, we can use another example dataset provided with the dNdScv package:
 98 | 
 99 | ```{r message=FALSE}
100 | library("seqinr")
101 | library("Biostrings")
102 | library("MASS")
103 | library("GenomicRanges")
104 | library("dndscv")
105 | data("dataset_normalskin", package="dndscv")
106 | data("dataset_normalskin_genes", package="dndscv")
107 | dndsskin = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf)
108 | ```
109 | 
110 | This dataset comprises of 3,408 unique somatic mutations detected by ultra-deep (~500x) targeted sequencing of 74 cancer genes in 234 small biopsies of normal human skin (epidermis) from four healthy individuals. Note that all of the mutations listed in the input table are genuinely independent events and so, again, we can safely ignore the two warnings issued by dndscv. For more details on this study see:
111 | 
112 | **Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. Science. 348(6237):880-6.** doi: 10.1126/science.aaa6806.
113 | 
114 | In the paper above, we described a strong evidence of positive selection on somatic mutations occurring in normal human skin throughout life. These mutations are detected as microscopic clones of mutant cells in normal skin. The dNdScv analysis below recapitulates some of the key analyses of selection in this study:
115 | 
116 | ```{r}
117 | sel_cv = dndsskin$sel_cv
118 | print(head(sel_cv[sel_cv$qglobal_cv<0.1,c(1:10,16,17)]), digits = 3)
119 | print(dndsskin$globaldnds, digits = 3)
120 | ```
121 | 
122 | ##Example 3: Using different substitution models
123 | 
124 | Classic maximum-likelihood implementations of dN/dS use a simple substitution model with a single rate parameter. Mutations are classified as either transitions (C<>T, G<>A) or transversions, and the single rate parameter is a transition/transversion (ts/tv) ratio reflecting the relative frequency of both classes of substitutions (Goldman & Yang, 1994). The dndscv function can take a different substitution model as input. The user can choose from existing substitution models provided in the *data* directory as part of the package or input a different substitution model as a matrix:
125 | 
126 | ```{r message=FALSE}
127 | library("dndscv")
128 | # 192 rates (used as default)
129 | data("submod_192r_3w", package="dndscv")
130 | colnames(substmodel) = c("syn","mis","non","spl")
131 | head(substmodel)
132 | # 12 rates (no context-dependence)
133 | data("submod_12r_3w", package="dndscv")
134 | colnames(substmodel) = c("syn","mis","non","spl")
135 | head(substmodel)
136 | # 2 rates (classic ts/tv model)
137 | data("submod_2r_3w", package="dndscv")
138 | colnames(substmodel) = c("syn","mis","non","spl")
139 | head(substmodel)
140 | ```
141 | 
142 | We can fit a traditional ts/tv model to the skin dataset using the code below:
143 | 
144 | ```{r message=FALSE}
145 | library("seqinr")
146 | library("Biostrings")
147 | library("MASS")
148 | library("GenomicRanges")
149 | library("dndscv")
150 | data("dataset_normalskin", package="dndscv")
151 | data("dataset_normalskin_genes", package="dndscv")
152 | dndsskin_2r = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf, sm = "2r_3w")
153 | print(dndsskin_2r$mle_submodel)
154 | sel_cv = dndsskin_2r$sel_cv
155 | print(head(sel_cv[sel_cv$qglobal_cv<0.1, c(1:10,16,17)]), digits = 3)
156 | ```
157 | 
158 | In general, the full trinucleotide model is recommended for cancer genomic datasets as it typically provides the least biased dN/dS estimates. The impact of using simplistic mutation models can be considerable on global dN/dS ratios (see Martincorena *et al*., 2017), and can lead to false signals of negative or positive selection. In general, the impact of simple substitution models on gene-level inferences of selection tends to be smaller.
159 | 
160 | ###Additional references
161 | * Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Under review*.
162 | * Goldman N, Yang Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. *Molecular biology and evolution*. 11:725-736.
163 | * Greenman C, *et al*. (2006) Statistical analysis of pathogenicity of somatic mutations in cancer. *Genetics*. 173(4):2187-98.
164 | * Wong CC, Martincorena I, *et al*. (2014) Inactivating CUX1 mutations promote tumorigenesis. *Nature Genetics*. 46(1):33-8.
165 | * Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. *Science*. 348(6237):880-6.
166 | 


--------------------------------------------------------------------------------
/inst/extdata/BioMart_human_GRCh37_chr3_segment.txt:
--------------------------------------------------------------------------------
  1 | gene.id	gene.name	cds.id	chr	chr.coding.start	chr.coding.end	cds.start	cds.end	length	strand
  2 | ENSG00000121879	PIK3CA	ENSP00000418145	3	116615	116677	1	63	63	1
  3 | ENSG00000121879	PIK3CA	ENSP00000263967	3	116615	116966	1	352	3207	1
  4 | ENSG00000121879	PIK3CA	ENSP00000263967	3	117479	117688	353	562	3207	1
  5 | ENSG00000121879	PIK3CA	ENSP00000263967	3	119079	119329	563	813	3207	1
  6 | ENSG00000121879	PIK3CA	ENSP00000263967	3	121333	121578	814	1059	3207	1
  7 | ENSG00000121879	PIK3CA	ENSP00000263967	3	122292	122377	1060	1145	3207	1
  8 | ENSG00000121879	PIK3CA	ENSP00000263967	3	127384	127489	1146	1251	3207	1
  9 | ENSG00000121879	PIK3CA	ENSP00000263967	3	127975	128127	1252	1404	3207	1
 10 | ENSG00000121879	PIK3CA	ENSP00000263967	3	128220	128354	1405	1539	3207	1
 11 | ENSG00000121879	PIK3CA	ENSP00000263967	3	135999	136123	1540	1664	3207	1
 12 | ENSG00000121879	PIK3CA	ENSP00000263967	3	136985	137066	1665	1746	3207	1
 13 | ENSG00000121879	PIK3CA	ENSP00000263967	3	137360	137524	1747	1911	3207	1
 14 | ENSG00000121879	PIK3CA	ENSP00000263967	3	137738	137841	1912	2015	3207	1
 15 | ENSG00000121879	PIK3CA	ENSP00000263967	3	138775	138946	2016	2187	3207	1
 16 | ENSG00000121879	PIK3CA	ENSP00000263967	3	141870	141976	2188	2294	3207	1
 17 | ENSG00000121879	PIK3CA	ENSP00000263967	3	142489	142610	2295	2416	3207	1
 18 | ENSG00000121879	PIK3CA	ENSP00000263967	3	143751	143829	2417	2495	3207	1
 19 | ENSG00000121879	PIK3CA	ENSP00000263967	3	147061	147231	2496	2666	3207	1
 20 | ENSG00000121879	PIK3CA	ENSP00000263967	3	147793	147910	2667	2784	3207	1
 21 | ENSG00000121879	PIK3CA	ENSP00000263967	3	148014	148165	2785	2936	3207	1
 22 | ENSG00000121879	PIK3CA	ENSP00000263967	3	151883	152153	2937	3207	3207	1
 23 | ENSG00000121879	PIK3CA	ENSP00000417479	3	116615	116968	1	354	354	1
 24 | ENSG00000171121	KCNMB3	ENSP00000417091	3	184438	184499	1	62	522	-1
 25 | ENSG00000171121	KCNMB3	ENSP00000417091	3	168532	168723	63	254	522	-1
 26 | ENSG00000171121	KCNMB3	ENSP00000417091	3	162284	162482	255	453	522	-1
 27 | ENSG00000171121	KCNMB3	ENSP00000417091	3	157785	157853	454	522	522	-1
 28 | ENSG00000171121	KCNMB3	ENSP00000376451	3	162284	162482	249	447	828	-1
 29 | ENSG00000171121	KCNMB3	ENSP00000376451	3	168532	168779	1	248	828	-1
 30 | ENSG00000171121	KCNMB3	ENSP00000376451	3	160693	161073	448	828	828	-1
 31 | ENSG00000171121	KCNMB3	ENSP00000319370	3	168532	168723	69	260	840	-1
 32 | ENSG00000171121	KCNMB3	ENSP00000319370	3	162284	162482	261	459	840	-1
 33 | ENSG00000171121	KCNMB3	ENSP00000319370	3	168825	168892	1	68	840	-1
 34 | ENSG00000171121	KCNMB3	ENSP00000319370	3	160693	161073	460	840	840	-1
 35 | ENSG00000171121	KCNMB3	ENSP00000327866	3	168532	168723	63	254	834	-1
 36 | ENSG00000171121	KCNMB3	ENSP00000327866	3	162284	162482	255	453	834	-1
 37 | ENSG00000171121	KCNMB3	ENSP00000327866	3	160693	161073	454	834	834	-1
 38 | ENSG00000171121	KCNMB3	ENSP00000327866	3	184438	184499	1	62	834	-1
 39 | ENSG00000171121	KCNMB3	ENSP00000418536	3	168532	168723	3	194	774	-1
 40 | ENSG00000171121	KCNMB3	ENSP00000418536	3	162284	162482	195	393	774	-1
 41 | ENSG00000171121	KCNMB3	ENSP00000418536	3	176737	176738	1	2	774	-1
 42 | ENSG00000171121	KCNMB3	ENSP00000418536	3	160693	161073	394	774	774	-1
 43 | ENSG00000121864	ZNF639	ENSP00000417740	3	246083	246140	1	58	1458	1
 44 | ENSG00000121864	ZNF639	ENSP00000417740	3	247407	247517	59	169	1458	1
 45 | ENSG00000121864	ZNF639	ENSP00000417740	3	250778	250912	170	304	1458	1
 46 | ENSG00000121864	ZNF639	ENSP00000417740	3	251058	252211	305	1458	1458	1
 47 | ENSG00000121864	ZNF639	ENSP00000418870	3	246083	246140	1	58	649	1
 48 | ENSG00000121864	ZNF639	ENSP00000418870	3	247407	247517	59	169	649	1
 49 | ENSG00000121864	ZNF639	ENSP00000418870	3	250778	250912	170	304	649	1
 50 | ENSG00000121864	ZNF639	ENSP00000418870	3	251058	251402	305	649	649	1
 51 | ENSG00000121864	ZNF639	ENSP00000418628	3	246083	246140	1	58	412	1
 52 | ENSG00000121864	ZNF639	ENSP00000418628	3	247407	247517	59	169	412	1
 53 | ENSG00000121864	ZNF639	ENSP00000418628	3	250778	250912	170	304	412	1
 54 | ENSG00000121864	ZNF639	ENSP00000418628	3	251058	251165	305	412	412	1
 55 | ENSG00000121864	ZNF639	ENSP00000325634	3	246083	246140	1	58	1458	1
 56 | ENSG00000121864	ZNF639	ENSP00000325634	3	247407	247517	59	169	1458	1
 57 | ENSG00000121864	ZNF639	ENSP00000325634	3	250778	250912	170	304	1458	1
 58 | ENSG00000121864	ZNF639	ENSP00000325634	3	251058	252211	305	1458	1458	1
 59 | ENSG00000121864	ZNF639	ENSP00000419650	3	246083	246140	1	58	865	1
 60 | ENSG00000121864	ZNF639	ENSP00000419650	3	247407	247517	59	169	865	1
 61 | ENSG00000121864	ZNF639	ENSP00000419650	3	250778	250912	170	304	865	1
 62 | ENSG00000121864	ZNF639	ENSP00000419650	3	251058	251618	305	865	865	1
 63 | ENSG00000121864	ZNF639	ENSP00000418766	3	246083	246140	1	58	1458	1
 64 | ENSG00000121864	ZNF639	ENSP00000418766	3	247407	247517	59	169	1458	1
 65 | ENSG00000121864	ZNF639	ENSP00000418766	3	250778	250912	170	304	1458	1
 66 | ENSG00000121864	ZNF639	ENSP00000418766	3	251058	252211	305	1458	1458	1
 67 | ENSG00000121864	ZNF639	ENSP00000417232	3	246083	246140	1	58	390	1
 68 | ENSG00000121864	ZNF639	ENSP00000417232	3	247407	247517	59	169	390	1
 69 | ENSG00000121864	ZNF639	ENSP00000417232	3	250778	250912	170	304	390	1
 70 | ENSG00000121864	ZNF639	ENSP00000417232	3	251058	251143	305	390	390	1
 71 | ENSG00000171109	MFN1	ENSP00000420617	3	266641	266752	1	112	2226	1
 72 | ENSG00000171109	MFN1	ENSP00000420617	3	269689	269824	113	248	2226	1
 73 | ENSG00000171109	MFN1	ENSP00000420617	3	276629	276791	249	411	2226	1
 74 | ENSG00000171109	MFN1	ENSP00000420617	3	280147	280271	412	536	2226	1
 75 | ENSG00000171109	MFN1	ENSP00000420617	3	282086	282194	537	645	2226	1
 76 | ENSG00000171109	MFN1	ENSP00000420617	3	282907	283014	646	753	2226	1
 77 | ENSG00000171109	MFN1	ENSP00000420617	3	285228	285381	754	907	2226	1
 78 | ENSG00000171109	MFN1	ENSP00000420617	3	285825	285892	908	975	2226	1
 79 | ENSG00000171109	MFN1	ENSP00000420617	3	293009	293130	976	1097	2226	1
 80 | ENSG00000171109	MFN1	ENSP00000420617	3	294831	294957	1098	1224	2226	1
 81 | ENSG00000171109	MFN1	ENSP00000420617	3	295133	295237	1225	1329	2226	1
 82 | ENSG00000171109	MFN1	ENSP00000420617	3	296130	296232	1330	1432	2226	1
 83 | ENSG00000171109	MFN1	ENSP00000420617	3	296374	296603	1433	1662	2226	1
 84 | ENSG00000171109	MFN1	ENSP00000420617	3	303358	303510	1663	1815	2226	1
 85 | ENSG00000171109	MFN1	ENSP00000420617	3	304222	304418	1816	2012	2226	1
 86 | ENSG00000171109	MFN1	ENSP00000420617	3	307793	307927	2013	2147	2226	1
 87 | ENSG00000171109	MFN1	ENSP00000420617	3	309770	309848	2148	2226	2226	1
 88 | ENSG00000171109	MFN1	ENSP00000419134	3	266641	266752	1	112	464	1
 89 | ENSG00000171109	MFN1	ENSP00000419134	3	269689	269824	113	248	464	1
 90 | ENSG00000171109	MFN1	ENSP00000419134	3	276629	276791	249	411	464	1
 91 | ENSG00000171109	MFN1	ENSP00000419134	3	280147	280199	412	464	464	1
 92 | ENSG00000171109	MFN1	ENSP00000263969	3	269689	269824	113	248	2226	1
 93 | ENSG00000171109	MFN1	ENSP00000263969	3	276629	276791	249	411	2226	1
 94 | ENSG00000171109	MFN1	ENSP00000263969	3	280147	280271	412	536	2226	1
 95 | ENSG00000171109	MFN1	ENSP00000263969	3	282086	282194	537	645	2226	1
 96 | ENSG00000171109	MFN1	ENSP00000263969	3	282907	283014	646	753	2226	1
 97 | ENSG00000171109	MFN1	ENSP00000263969	3	285228	285381	754	907	2226	1
 98 | ENSG00000171109	MFN1	ENSP00000263969	3	285825	285892	908	975	2226	1
 99 | ENSG00000171109	MFN1	ENSP00000263969	3	293009	293130	976	1097	2226	1
100 | ENSG00000171109	MFN1	ENSP00000263969	3	294831	294957	1098	1224	2226	1
101 | ENSG00000171109	MFN1	ENSP00000263969	3	295133	295237	1225	1329	2226	1
102 | ENSG00000171109	MFN1	ENSP00000263969	3	296130	296232	1330	1432	2226	1
103 | ENSG00000171109	MFN1	ENSP00000263969	3	296374	296603	1433	1662	2226	1
104 | ENSG00000171109	MFN1	ENSP00000263969	3	303358	303510	1663	1815	2226	1
105 | ENSG00000171109	MFN1	ENSP00000263969	3	304222	304418	1816	2012	2226	1
106 | ENSG00000171109	MFN1	ENSP00000263969	3	307793	307927	2013	2147	2226	1
107 | ENSG00000171109	MFN1	ENSP00000263969	3	266641	266752	1	112	2226	1
108 | ENSG00000171109	MFN1	ENSP00000263969	3	309770	309848	2148	2226	2226	1
109 | ENSG00000171109	MFN1	ENSP00000420148	3	282086	282194	96	204	372	1
110 | ENSG00000171109	MFN1	ENSP00000420148	3	282907	283014	205	312	372	1
111 | ENSG00000171109	MFN1	ENSP00000420148	3	280177	280271	1	95	372	1
112 | ENSG00000171109	MFN1	ENSP00000420148	3	285228	285287	313	372	372	1
113 | ENSG00000171109	MFN1	ENSP00000419926	3	280147	280271	1	125	1482	1
114 | ENSG00000171109	MFN1	ENSP00000419926	3	282086	282194	126	234	1482	1
115 | ENSG00000171109	MFN1	ENSP00000419926	3	282907	283014	235	342	1482	1
116 | ENSG00000171109	MFN1	ENSP00000419926	3	285228	285381	343	496	1482	1
117 | ENSG00000171109	MFN1	ENSP00000419926	3	285825	285892	497	564	1482	1
118 | ENSG00000171109	MFN1	ENSP00000419926	3	293009	293130	565	686	1482	1
119 | ENSG00000171109	MFN1	ENSP00000419926	3	294831	294957	687	813	1482	1
120 | ENSG00000171109	MFN1	ENSP00000419926	3	295133	295237	814	918	1482	1
121 | ENSG00000171109	MFN1	ENSP00000419926	3	303358	303510	919	1071	1482	1
122 | ENSG00000171109	MFN1	ENSP00000419926	3	304222	304418	1072	1268	1482	1
123 | ENSG00000171109	MFN1	ENSP00000419926	3	307793	307927	1269	1403	1482	1
124 | ENSG00000171109	MFN1	ENSP00000419926	3	309770	309848	1404	1482	1482	1
125 | ENSG00000171109	MFN1	ENSP00000280653	3	266641	266752	1	112	1893	1
126 | ENSG00000171109	MFN1	ENSP00000280653	3	269689	269824	113	248	1893	1
127 | ENSG00000171109	MFN1	ENSP00000280653	3	276629	276791	249	411	1893	1
128 | ENSG00000171109	MFN1	ENSP00000280653	3	280147	280271	412	536	1893	1
129 | ENSG00000171109	MFN1	ENSP00000280653	3	282086	282194	537	645	1893	1
130 | ENSG00000171109	MFN1	ENSP00000280653	3	282907	283014	646	753	1893	1
131 | ENSG00000171109	MFN1	ENSP00000280653	3	285228	285381	754	907	1893	1
132 | ENSG00000171109	MFN1	ENSP00000280653	3	285825	285892	908	975	1893	1
133 | ENSG00000171109	MFN1	ENSP00000280653	3	293009	293130	976	1097	1893	1
134 | ENSG00000171109	MFN1	ENSP00000280653	3	294831	294957	1098	1224	1893	1
135 | ENSG00000171109	MFN1	ENSP00000280653	3	295133	295237	1225	1329	1893	1
136 | ENSG00000171109	MFN1	ENSP00000280653	3	303358	303510	1330	1482	1893	1
137 | ENSG00000171109	MFN1	ENSP00000280653	3	304222	304418	1483	1679	1893	1
138 | ENSG00000171109	MFN1	ENSP00000280653	3	307793	307927	1680	1814	1893	1
139 | ENSG00000171109	MFN1	ENSP00000280653	3	309770	309848	1815	1893	1893	1
140 | ENSG00000114450	GNB4	ENSP00000232564	3	343933	343989	1	57	1023	-1
141 | ENSG00000114450	GNB4	ENSP00000232564	3	338678	338716	58	96	1023	-1
142 | ENSG00000114450	GNB4	ENSP00000232564	3	337188	337294	97	203	1023	-1
143 | ENSG00000114450	GNB4	ENSP00000232564	3	334282	334345	204	267	1023	-1
144 | ENSG00000114450	GNB4	ENSP00000232564	3	332674	332836	268	430	1023	-1
145 | ENSG00000114450	GNB4	ENSP00000232564	3	331504	331570	431	497	1023	-1
146 | ENSG00000114450	GNB4	ENSP00000232564	3	331201	331402	498	699	1023	-1
147 | ENSG00000114450	GNB4	ENSP00000232564	3	322979	323195	700	916	1023	-1
148 | ENSG00000114450	GNB4	ENSP00000232564	3	319002	319108	917	1023	1023	-1
149 | ENSG00000114450	GNB4	ENSP00000420066	3	332674	332836	35	197	496	-1
150 | ENSG00000114450	GNB4	ENSP00000420066	3	331504	331570	198	264	496	-1
151 | ENSG00000114450	GNB4	ENSP00000420066	3	331201	331402	265	466	496	-1
152 | ENSG00000114450	GNB4	ENSP00000420066	3	334282	334315	1	34	496	-1
153 | ENSG00000114450	GNB4	ENSP00000420066	3	319079	319108	467	496	496	-1
154 | ENSG00000114450	GNB4	ENSP00000419693	3	343933	343989	1	57	1023	-1
155 | ENSG00000114450	GNB4	ENSP00000419693	3	338678	338716	58	96	1023	-1
156 | ENSG00000114450	GNB4	ENSP00000419693	3	337188	337294	97	203	1023	-1
157 | ENSG00000114450	GNB4	ENSP00000419693	3	334282	334345	204	267	1023	-1
158 | ENSG00000114450	GNB4	ENSP00000419693	3	332674	332836	268	430	1023	-1
159 | ENSG00000114450	GNB4	ENSP00000419693	3	331504	331570	431	497	1023	-1
160 | ENSG00000114450	GNB4	ENSP00000419693	3	331201	331402	498	699	1023	-1
161 | ENSG00000114450	GNB4	ENSP00000419693	3	322979	323195	700	916	1023	-1
162 | ENSG00000114450	GNB4	ENSP00000419693	3	319002	319108	917	1023	1023	-1
163 | ENSG00000114450	GNB4	ENSP00000420606	3	343933	343989	1	57	239	-1
164 | ENSG00000114450	GNB4	ENSP00000420606	3	338678	338716	58	96	239	-1
165 | ENSG00000114450	GNB4	ENSP00000420606	3	337188	337294	97	203	239	-1
166 | ENSG00000114450	GNB4	ENSP00000420606	3	334310	334345	204	239	239	-1
167 | ENSG00000136518	ACTL6A	ENSP00000397552	3	480882	480906	1	25	1290	1
168 | ENSG00000136518	ACTL6A	ENSP00000397552	3	487613	487689	26	102	1290	1
169 | ENSG00000136518	ACTL6A	ENSP00000397552	3	487856	488030	103	277	1290	1
170 | ENSG00000136518	ACTL6A	ENSP00000397552	3	491158	491258	278	378	1290	1
171 | ENSG00000136518	ACTL6A	ENSP00000397552	3	492159	492256	379	476	1290	1
172 | ENSG00000136518	ACTL6A	ENSP00000397552	3	494006	494100	477	571	1290	1
173 | ENSG00000136518	ACTL6A	ENSP00000397552	3	494409	494515	572	678	1290	1
174 | ENSG00000136518	ACTL6A	ENSP00000397552	3	494613	494702	679	768	1290	1
175 | ENSG00000136518	ACTL6A	ENSP00000397552	3	498429	498490	769	830	1290	1
176 | ENSG00000136518	ACTL6A	ENSP00000397552	3	498683	498797	831	945	1290	1
177 | ENSG00000136518	ACTL6A	ENSP00000397552	3	498929	499009	946	1026	1290	1
178 | ENSG00000136518	ACTL6A	ENSP00000394014	3	491158	491258	152	252	1164	1
179 | ENSG00000136518	ACTL6A	ENSP00000394014	3	492159	492256	253	350	1164	1
180 | ENSG00000136518	ACTL6A	ENSP00000394014	3	494006	494100	351	445	1164	1
181 | ENSG00000136518	ACTL6A	ENSP00000394014	3	494409	494515	446	552	1164	1
182 | ENSG00000136518	ACTL6A	ENSP00000394014	3	494613	494702	553	642	1164	1
183 | ENSG00000136518	ACTL6A	ENSP00000394014	3	498429	498490	643	704	1164	1
184 | ENSG00000136518	ACTL6A	ENSP00000394014	3	498683	498797	705	819	1164	1
185 | ENSG00000136518	ACTL6A	ENSP00000394014	3	498929	499009	820	900	1164	1
186 | ENSG00000136518	ACTL6A	ENSP00000394014	3	487880	488030	1	151	1164	1
187 | ENSG00000136518	ACTL6A	ENSP00000420153	3	487880	488030	1	151	151	1
188 | ENSG00000136518	ACTL6A	ENSP00000376430	3	491158	491258	152	252	1164	1
189 | ENSG00000136518	ACTL6A	ENSP00000376430	3	492159	492256	253	350	1164	1
190 | ENSG00000136518	ACTL6A	ENSP00000376430	3	494006	494100	351	445	1164	1
191 | ENSG00000136518	ACTL6A	ENSP00000376430	3	494409	494515	446	552	1164	1
192 | ENSG00000136518	ACTL6A	ENSP00000376430	3	494613	494702	553	642	1164	1
193 | ENSG00000136518	ACTL6A	ENSP00000376430	3	498429	498490	643	704	1164	1
194 | ENSG00000136518	ACTL6A	ENSP00000376430	3	498683	498797	705	819	1164	1
195 | ENSG00000136518	ACTL6A	ENSP00000376430	3	498929	499009	820	900	1164	1
196 | ENSG00000136518	ACTL6A	ENSP00000376430	3	487880	488030	1	151	1164	1
197 | 


--------------------------------------------------------------------------------
/inst/extdata/chr3_segment.fa.fai:
--------------------------------------------------------------------------------
1 | 3	500001	23	500001	500002
2 | 


--------------------------------------------------------------------------------
/inst/extdata/refcds_example_chr3_segment.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/inst/extdata/refcds_example_chr3_segment.rda


--------------------------------------------------------------------------------
/man/buildcodon.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/buildcodon.R
 3 | \name{buildcodon}
 4 | \alias{buildcodon}
 5 | \title{buildcodon}
 6 | \usage{
 7 | buildcodon(refcds, numcode = 1)
 8 | }
 9 | \arguments{
10 | \item{refcds}{Input RefCDS object}
11 | 
12 | \item{numcode}{NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate}
13 | }
14 | \description{
15 | This function takes a RefCDS object as input and adds to it two fields required to run the codondnds function. Usage: RefCDS = buildcodon(RefCDS)
16 | }
17 | \details{
18 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
19 | }
20 | \author{
21 | Inigo Martincorena (Wellcome Sanger Institute)
22 | }
23 | 


--------------------------------------------------------------------------------
/man/buildref.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/buildref.R
 3 | \name{buildref}
 4 | \alias{buildref}
 5 | \title{buildref}
 6 | \usage{
 7 | buildref(
 8 |   cdsfile,
 9 |   genomefile,
10 |   outfile = "RefCDS.rda",
11 |   numcode = 1,
12 |   excludechrs = NULL,
13 |   onlychrs = NULL,
14 |   useids = F
15 | )
16 | }
17 | \arguments{
18 | \item{cdsfile}{Path to the reference transcript table.}
19 | 
20 | \item{genomefile}{Path to the indexed reference genome file.}
21 | 
22 | \item{outfile}{Output file name (default = "RefCDS.rda").}
23 | 
24 | \item{numcode}{NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate}
25 | 
26 | \item{excludechrs}{Vector or string with chromosome names to be excluded from the RefCDS object (default: no chromosome will be excluded). The mitochondrial chromosome should be excluded as it has different genetic code and mutation rates, either using the excludechrs argument or not including mitochondrial transcripts in cdsfile.}
27 | 
28 | \item{onlychrs}{Vector of valid chromosome names (default: all chromosomes will be included)}
29 | 
30 | \item{useids}{Combine gene IDs and gene names (columns 1 and 2 of the input table) as long gene names (default = F)}
31 | }
32 | \description{
33 | Function to build a RefCDS object from a reference genome and a table of transcripts. The RefCDS object has to be precomputed for any new species or assembly prior to running dndscv. This function generates an .rda file that needs to be input into dndscv using the refdb argument. Note that when multiple CDS share the same gene name (second column of cdsfile), the longest coding CDS will be chosen for the gene. CDS with ambiguous bases (N) will not be considered.
34 | }
35 | \details{
36 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
37 | }
38 | \author{
39 | Inigo Martincorena (Wellcome Sanger Institute)
40 | }
41 | 


--------------------------------------------------------------------------------
/man/codondnds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/codondnds.R
 3 | \name{codondnds}
 4 | \alias{codondnds}
 5 | \title{codondnds}
 6 | \usage{
 7 | codondnds(
 8 |   dndsout,
 9 |   refcds,
10 |   min_recurr = 2,
11 |   gene_list = NULL,
12 |   codon_list = NULL,
13 |   theta_option = "conservative",
14 |   syn_drivers = "TP53:T125T",
15 |   method = "NB",
16 |   numbins = 10000
17 | )
18 | }
19 | \arguments{
20 | \item{dndsout}{Output object from dndscv.}
21 | 
22 | \item{refcds}{RefCDS object annotated with codon-level information using the buildcodon function.}
23 | 
24 | \item{min_recurr}{Minimum number of mutations per codon to estimate codon-wise dN/dS ratios. [default=2]}
25 | 
26 | \item{gene_list}{List of genes to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of genes is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout]}
27 | 
28 | \item{codon_list}{List of hotspot codons to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of codons is decided a priori. [default=NULL, codondnds will be run on all genes in dndsout]}
29 | 
30 | \item{theta_option}{2 options: "mle" (uses the MLE of the negative binomial size parameter) or "conservative" (uses the lower bound of the CI95). Values other than "mle" will lead to the conservative option. [default="conservative"]}
31 | 
32 | \item{syn_drivers}{Vector with a list of known synonymous driver mutations to exclude from the background model [default="TP53:T125T"]. See Martincorena et al., Cell, 2017 (PMID:29056346).}
33 | 
34 | \item{method}{Overdispersion model: NB = Negative Binomial (Gamma-Poisson), LNP = Poisson-Lognormal (see Hess et al., BiorXiv, 2019). [default="NB"]}
35 | 
36 | \item{numbins}{Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size.}
37 | }
38 | \value{
39 | 'codondnds' returns a table of recurrently mutated codons and the estimates of the size parameter:
40 | 
41 | - recurcodons: Table of recurrently mutated codons with codon-wise dN/dS values and p-values
42 | 
43 | - recurcodons_ext: The same table of recurrently mutated codons, but including additional information on the contribution of different changes within a codon.
44 | 
45 | - overdisp: Maximum likelihood estimate and CI95% for the overdispersion parameter (the size parameter of the negative binomial distribution or the sigma parameter of the lognormal distribution). The lower the size value or the higher the sigma value the higher the variation of the mutation rate across codons not captured by the trinucleotide change or by variation across genes.
46 | 
47 | - LL: Log-likelihood of the fit of the overdispersed model (see "method" argument) to all synonymous sites at a codon level.
48 | }
49 | \description{
50 | Function to estimate codon-wise dN/dS values and p-values against neutrality. To generate a valid RefCDS input object for this function, use the buildcodon function. Note that recurrent artefacts or SNP contamination can violate the null model and dominate the list of codons under apparent selection. Be very critical of the results and if suspicious codons appear recurrently mutated consider refining the variant calling (e.g. using a better unmatched normal panel).
51 | }
52 | \author{
53 | Inigo Martincorena (Wellcome Sanger Institute)
54 | }
55 | 


--------------------------------------------------------------------------------
/man/dndscv-package.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/dndscv-package.R
 3 | \docType{package}
 4 | \name{dndscv-package}
 5 | \alias{dndscv-package}
 6 | \title{Detection of selection in cancer and somatic evolution}
 7 | \description{
 8 | Detection of selection in cancer and somatic evolution
 9 | }
10 | \details{
11 | The dNdScv R package is a suite of maximum-likelihood dN/dS methods designed
12 | to quantify selection in cancer and somatic evolution (Martincorena et al., 2017).
13 | The package contains functions to quantify dN/dS ratios for missense, nonsense
14 | and essential splice mutations, at the level of individual genes, groups of
15 | genes or at whole-genome level. The dNdScv method was designed to detect cancer
16 | driver genes (i.e. genes under positive selection in cancer) on datasets ranging
17 | from a few samples to thousands of samples, in whole-exome/genome or targeted
18 | sequencing studies.
19 | }
20 | \references{
21 | Martincorena I, et al. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. Cell.
22 | }
23 | \seealso{
24 | \code{\link{dndscv}}
25 | 
26 | \code{\link{buildref}}
27 | }
28 | \author{
29 | Inigo Martincorena, Wellcome Trust Sanger Institute, \email{im3@sanger.ac.uk}
30 | }
31 | \keyword{package}
32 | 


--------------------------------------------------------------------------------
/man/dndscv.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/dndscv.R
  3 | \name{dndscv}
  4 | \alias{dndscv}
  5 | \title{dNdScv}
  6 | \usage{
  7 | dndscv(
  8 |   mutations,
  9 |   gene_list = NULL,
 10 |   refdb = "hg19",
 11 |   sm = "192r_3w",
 12 |   kc = "cgc81",
 13 |   cv = "hg19",
 14 |   max_muts_per_gene_per_sample = 3,
 15 |   max_coding_muts_per_sample = 3000,
 16 |   use_indel_sites = T,
 17 |   min_indels = 5,
 18 |   maxcovs = 20,
 19 |   constrain_wnon_wspl = T,
 20 |   outp = 3,
 21 |   numcode = 1,
 22 |   outmats = F,
 23 |   mingenecovs = 500,
 24 |   onesided = F,
 25 |   dc = NULL
 26 | )
 27 | }
 28 | \arguments{
 29 | \item{mutations}{Table of mutations (5 columns: sampleID, chr, pos, ref, alt). Only list independent events as mutations.}
 30 | 
 31 | \item{gene_list}{List of genes to restrict the analysis (use for targeted sequencing studies)}
 32 | 
 33 | \item{refdb}{Reference database (path to .rda file or a pre-loaded array object in the right format)}
 34 | 
 35 | \item{sm}{Substitution model (precomputed models are available in the data directory)}
 36 | 
 37 | \item{kc}{List of a-priori known cancer genes (to be excluded from the indel background model)}
 38 | 
 39 | \item{cv}{Covariates (a matrix of covariates -columns- for each gene -rows-) [default: reference covariates] [cv=NULL runs dndscv without covariates]}
 40 | 
 41 | \item{max_muts_per_gene_per_sample}{If n<Inf, arbitrarily the first n mutations by chr position will be kept (default = 3, please set this to Inf to avoid filtering out any mutation)}
 42 | 
 43 | \item{max_coding_muts_per_sample}{Hypermutator samples often reduce power to detect selection}
 44 | 
 45 | \item{use_indel_sites}{Use unique indel sites instead of the total number of indels (default = TRUE, which tends to be more robust for typical cancer or somatic mutation datasets)}
 46 | 
 47 | \item{min_indels}{Minimum number of indels required to run the indel recurrence module}
 48 | 
 49 | \item{maxcovs}{Maximum number of covariates that will be considered (additional columns in the matrix of covariates will be excluded)}
 50 | 
 51 | \item{constrain_wnon_wspl}{This constrains wnon==wspl in the dNdScv model (this typically leads to higher power to detect selection)}
 52 | 
 53 | \item{outp}{Output: 1 = Global dN/dS values; 2 = Global dN/dS and dNdSloc; 3 = Global dN/dS, dNdSloc and dNdScv}
 54 | 
 55 | \item{numcode}{NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate. Note that the same genetic code must be used in the dndscv and buildref functions.}
 56 | 
 57 | \item{outmats}{Output the internal N and L matrices (default = F)}
 58 | 
 59 | \item{mingenecovs}{Minimum number of genes required to run the negative binomial regression model with covariates (default = 500)}
 60 | 
 61 | \item{onesided}{Option to run one-sided positive and negative selection tests per gene (default = FALSE). Note that one-sided tests are only performed for the wnon==wspl model, so using onesided=TRUE will overwrite constrain_wnon_wspl to TRUE.}
 62 | 
 63 | \item{dc}{Duplex coverage per gene. Named Numeric Vector with values reflecting the mean duplex coverage per site per gene, and names corresponding to gene names. Use this argument only when running dNdScv on duplex sequencing data to use gene coverage in the offset of the regression model (default = NULL)}
 64 | }
 65 | \value{
 66 | 'dndscv' returns a list of objects:
 67 | 
 68 | - globaldnds: Global dN/dS estimates across all genes.
 69 | 
 70 | - sel_cv: Gene-wise selection results using dNdScv.
 71 | 
 72 | - sel_loc: Gene-wise selection results using dNdSloc.
 73 | 
 74 | - annotmuts: Annotated coding mutations.
 75 | 
 76 | - genemuts: Observed and expected numbers of mutations per gene.
 77 | 
 78 | - geneindels: Observed and expected numbers of indels per gene.
 79 | 
 80 | - mle_submodel: MLEs of the substitution model.
 81 | 
 82 | - exclsamples: Samples excluded from the analysis.
 83 | 
 84 | - exclmuts: Coding mutations excluded from the analysis.
 85 | 
 86 | - nbreg: Negative binomial regression model for substitutions.
 87 | 
 88 | - nbregind: Negative binomial regression model for indels.
 89 | 
 90 | - poissmodel: Poisson regression model used to fit the substitution model and the global dNdS values.
 91 | 
 92 | - wrongmuts: Table of input mutations with a wrong annotation of the reference base (if any).
 93 | }
 94 | \description{
 95 | Analyses of selection using the dNdScv and dNdSloc models. Default parameters typically increase the performance of the method on cancer genomic studies. Default arguments use the GRCh37/hg19 version of the human genome. To run dNdScv on other assemblies or species see the buildref function and the dndscv_data GitHub repository.
 96 | }
 97 | \details{
 98 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
 99 | }
100 | \author{
101 | Inigo Martincorena (Wellcome Sanger Institute)
102 | }
103 | 


--------------------------------------------------------------------------------
/man/fitlnpbin.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/fitlnpbin.R
 3 | \name{fitlnpbin}
 4 | \alias{fitlnpbin}
 5 | \title{fitlnpbin}
 6 | \usage{
 7 | fitlnpbin(
 8 |   nvec,
 9 |   rvec,
10 |   level = 0.95,
11 |   theta_option = "conservative",
12 |   numbins = 10000
13 | )
14 | }
15 | \arguments{
16 | \item{nvec}{Vector of observed counts of mutations per site.}
17 | 
18 | \item{rvec}{Vector of expected counts of mutations per site.}
19 | 
20 | \item{level}{Confidence level desired for the confidence interval of the overdispersion parameter [defaul=0.95]}
21 | 
22 | \item{theta_option}{2 options: "mle" (uses the MLE of the overdispersion parameter) or "conservative" (uses the conservative bound of the CI95). Values other than "mle" will lead to the conservative option [default="conservative"]}
23 | 
24 | \item{numbins}{Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size.}
25 | }
26 | \value{
27 | 'fitlnpbin' returns the maximum likelihood estimate and confidence intervals of the "sig" overdispersion parameter of the LNP model:
28 | }
29 | \description{
30 | Function to fit a Lognormal-Poisson model to estimate overdispersion on synonymous changes for sitednds and codondnds.
31 | }
32 | \author{
33 | Inigo Martincorena (Wellcome Sanger Institute)
34 | }
35 | 


--------------------------------------------------------------------------------
/man/geneci.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/geneci.R
 3 | \name{geneci}
 4 | \alias{geneci}
 5 | \title{geneci}
 6 | \usage{
 7 | geneci(dndsout, gene_list = NULL, level = 0.95)
 8 | }
 9 | \arguments{
10 | \item{dndsout}{Output object from dndscv.}
11 | 
12 | \item{gene_list}{List of genes to restrict the analysis (by default, all genes in dndsout will be analysed)}
13 | 
14 | \item{level}{Confidence level desired [default = 0.95]}
15 | }
16 | \value{
17 | ci: Dataframe with the confidence intervals for dN/dS ratios per gene under the dNdScv model.
18 | }
19 | \description{
20 | Function to calculate confidence intervals for dN/dS values per gene under the dNdScv model using profile likelihood. To generate a valid dndsout input object for this function, use outmats=T when running dndscv.
21 | }
22 | \details{
23 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
24 | }
25 | \author{
26 | Inigo Martincorena (Wellcome Sanger Institute)
27 | }
28 | 


--------------------------------------------------------------------------------
/man/genesetdnds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/genesetdnds.R
 3 | \name{genesetdnds}
 4 | \alias{genesetdnds}
 5 | \title{genesetdnds}
 6 | \usage{
 7 | genesetdnds(dndsout, gene_list, sm = "192r_3w")
 8 | }
 9 | \arguments{
10 | \item{dndsout}{Output object from dndscv. To generate a valid input object for this function, use outmats=T when running dndscv.}
11 | 
12 | \item{gene_list}{List of genes to restrict the analysis (gene set).}
13 | 
14 | \item{sm}{Substitution model (precomputed models are available in the data directory)}
15 | }
16 | \value{
17 | 'genesetdnds' returns a list of objects:
18 | 
19 | globaldnds_geneset: Global dN/dS estimates in the gene set, including Wald CI95%, Wald p-values and LRT (recommended) p-values.
20 | 
21 | globaldnds_rest: Global dN/dS estimates in all other genes, including Wald CI95%, Wald p-values and LRT (recommended) p-values.
22 | }
23 | \description{
24 | Function to estimate global dN/dS values for a gene set when using whole-exome
25 | data. Global dN/dS values for a set of genes can also be obtained using dndscv
26 | (gene_list argument), but that option estimates the trinucleotide mutation rates
27 | exclusively from the list of genes of interest. This may be prefereable in large
28 | datasets, but in small datasets, the genesetdnds option estimates global dN/dS
29 | values for a set of genes while using all genes in the exome to fit the
30 | substitution model. Usage: genesetdnds(dndsout, gene_list).
31 | }
32 | \details{
33 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
34 | }
35 | \author{
36 | Inigo Martincorena (Wellcome Sanger Institute)
37 | }
38 | 


--------------------------------------------------------------------------------
/man/sitednds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/sitednds.R
 3 | \name{sitednds}
 4 | \alias{sitednds}
 5 | \title{sitednds}
 6 | \usage{
 7 | sitednds(
 8 |   dndsout,
 9 |   min_recurr = 2,
10 |   gene_list = NULL,
11 |   site_list = NULL,
12 |   trinuc_list = NULL,
13 |   theta_option = "conservative",
14 |   syn_drivers = "TP53:T125T",
15 |   method = "NB",
16 |   numbins = 10000,
17 |   kc = "cgc81"
18 | )
19 | }
20 | \arguments{
21 | \item{dndsout}{Output object from dndscv. To generate a valid input object for this function, use outmats=T when running dndscv.}
22 | 
23 | \item{min_recurr}{Minimum number of mutations per site to estimate site-wise dN/dS ratios. [default=2]}
24 | 
25 | \item{gene_list}{List of genes to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of genes is decided a priori. [default=NULL, sitednds will be run on all genes in dndsout]}
26 | 
27 | \item{site_list}{List of hotspot sites to restrict the p-value and q-value calculations (Restricted Hypothesis Testing). Note that q-values are only valid if the list of sites is decided a priori. [default=NULL, sitednds will be run on all genes in dndsout]}
28 | 
29 | \item{trinuc_list}{List of trinucleotide substitution to restrict the analysis of sitednds. This is used to estimate separate overdispersion parameters for different substitution contexts [default=NULL, sitednds will be run on all substitution contexts]}
30 | 
31 | \item{theta_option}{2 options: "mle" (uses the MLE of the overdispersion parameter) or "conservative" (uses the conservative bound of the CI95). Values other than "mle" will lead to the conservative option [default="conservative"]}
32 | 
33 | \item{syn_drivers}{Vector with a list of known synonymous driver mutations to exclude from the background model [default="TP53:T125T"]. See Martincorena et al., Cell, 2017 (PMID:29056346).}
34 | 
35 | \item{method}{Overdispersion model: NB = Negative Binomial (Gamma-Poisson), LNP = Poisson-Lognormal (see Hess et al., BiorXiv, 2019). [default="NB"]}
36 | 
37 | \item{numbins}{Number of bins to discretise the rvec vector [default=1e4]. This enables fast execution of the LNP model in datasets of arbitrarily any size.}
38 | 
39 | \item{kc}{List of a-priori known cancer genes (to be excluded when fitting the background model)}
40 | }
41 | \value{
42 | 'sitednds' returns a table of recurrently mutated sites and the estimates of the size parameter:
43 | 
44 | - recursites: Table of recurrently mutated sites with site-wise dN/dS values and p-values
45 | 
46 | - overdisp: Maximum likelihood estimate and CI95% for the overdispersion parameter (the size parameter of the negative binomial distribution or the sigma parameter of the lognormal distribution). The lower the size value or the higher the sigma value the higher the variation of the mutation rate across sites not captured by the trinucleotide change or by variation across genes.
47 | 
48 | - fpr_nonsyn_q05: Fraction of the significant non-synonymous sites (qval<0.05) that are estimated to be false positives. This assumes that all synonymous mutations (except those in TP53 and CDKN2A) are false positives, thus offering a conservative estimate of the false positive rate.
49 | 
50 | - LL: Log-likelihood of the fit of the overdispersed model (see "method" argument) to all synonymous sites.
51 | }
52 | \description{
53 | Function to estimate site-wise dN/dS values and p-values against neutrality. To generate a valid input object for this function, use outmats=T when running dndscv. This function is in testing, please interpret the results with caution. Also note that recurrent artefacts or SNP contamination can violate the null model and dominate the list of sites under apparent selection. A considerable number of significant synonymous sites may reflect a problem with the data. Be very critical of the results and if suspicious sites appear recurrently mutated consider refining the variant calling (e.g. using a better unmatched normal panel). In the future, this function may be extended to perform inferences at a codon level instead of at a single-base level.
54 | }
55 | \author{
56 | Inigo Martincorena (Wellcome Sanger Institute)
57 | }
58 | 


--------------------------------------------------------------------------------
/man/withingenednds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/withingenednds.R
 3 | \name{withingenednds}
 4 | \alias{withingenednds}
 5 | \title{withingenednds}
 6 | \usage{
 7 | withingenednds(
 8 |   mutations,
 9 |   gene,
10 |   covtable,
11 |   dndsout,
12 |   genomeFile,
13 |   regionschr = NULL,
14 |   regionsaa = NULL,
15 |   fixtheta = NULL,
16 |   normalisefromsyn = TRUE,
17 |   syndrivers = NULL,
18 |   exon_flank_length = 10,
19 |   intron_flank_length = 10,
20 |   sitefilename = NULL,
21 |   refdb = "hg19",
22 |   numcode = 1
23 | )
24 | }
25 | \arguments{
26 | \item{mutations}{Data frame with all the mutations detected in the study (5-column input table as for dndscv: sampleID, chr, pos, ref, mut).}
27 | 
28 | \item{gene}{Name of the gene of interest. This function is currently designed to work on a single gene, but combined analyses of multiple genes could be done using the sites output table generated by this function.}
29 | 
30 | \item{covtable}{Table with all sites of interest in the gene. This should be a data frame with one row per site and the following columns: chr, pos, dc (duplex depth). Additional columns will not be used.}
31 | 
32 | \item{dndsout}{dndscv output object for the dataset. This is mainly used for the MLEs of the substitution model. Running dndscv on all genes in the dataset is recommended unless the gene of interest is believed to have a different substitution model.}
33 | 
34 | \item{genomeFile}{Path to a reference fasta file for the genome assembly.}
35 | 
36 | \item{regionschr}{Optional data frame with user-defined regions of interest in the gene. This allows the user to define arbitrary regions within a gene (coding or non-coding) from which to calculate omega (selection or obs/exp) values (e.g. protein domains, splicing regulator regions, core promoters, etc). The table should contain the following columns: chr, start, end, wname (a unique name for the w parameter, e.g. wdomain1, wcorepromoter), impacts (e.g. Missense or Missense|Nonsense will restrict the w calculation with Missense or Missense and Nonsense mutations in the region, respectively), layered (1/0; using "0" removes other w parameters influencing the site, whereas using "1" models selection as relative to other w parameters active at these sites).}
37 | 
38 | \item{regionsaa}{Optional data frame with user-defined regions of interest in the gene, using aminoacid coordinates. The table should contain the following columns: gene, aa_start, aa_end, w feature name (e.g. wdomain1), impacts.}
39 | 
40 | \item{fixtheta}{Pre-calculated overdispersion (theta) parameter. This should be calculated using sitednds(., method="NB").}
41 | 
42 | \item{normalisefromsyn}{Normalise the substitution rates based on the synonymous mutations in the gene. Using TRUE is recommended. Using FALSE uses the expected synonymous mutation rate of the gene from the dndscv negative binomial regression model (dndsout$genemuts).}
43 | 
44 | \item{syndrivers}{Vector of known synonymous driver sites defined by their aminoacid position, to be excluded from the background model (e.g. syndrivers = c("T125T","E224E","Q331Q") for TP53).}
45 | 
46 | \item{exon_flank_length}{Exon flank length in bp [default = 10]. Using a value higher than 0 will calculate a separate selection (w) coefficient for synonymous mutations in exon flanks.}
47 | 
48 | \item{intron_flank_length}{Intron flank length in bp [default = 10]. Intronic sites occurring within these flanks but not already classified as Essential_Splice will receive a separate w parameter.}
49 | 
50 | \item{sitefilename}{Optionally, provide a file name to save the table of all annotated sites in the gene. This table is also always contained in the output object.}
51 | 
52 | \item{refdb}{Reference database (path to .rda file or a pre-loaded array object in the right format).}
53 | 
54 | \item{numcode}{NCBI genetic code number (default = 1; standard genetic code). To see the list of genetic codes supported use: ? seqinr::translate}
55 | }
56 | \value{
57 | 'withingenednds' returns a list of objects:
58 | 
59 | - sites: Table with the annotation of all sites in the gene (from covtable), including all functional annotations in the "regions" input object as well as default annotations (Missense, Nonsense, Essential_Splice, Start_loss, Stop_loss, etc).
60 | 
61 | - par.pois: Poisson regression results (not recommended).
62 | 
63 | - par.nb: Negative binomial results fitting a new overdispersion parameter to the data (when fixtheta is not provided).
64 | 
65 | - par.nbfix: Negative binomial results using the input fixtheta value as recommended.
66 | 
67 | - model.pois: Poisson regression object.
68 | 
69 | - model.nb: Negative binomial regression object.
70 | 
71 | - model.nbfix: Negative binomial regression object.
72 | }
73 | \description{
74 | This function uses Poisson and Negative Binomial regression models at single-site level to study selection across different regions (coding and non-coding) within a gene.
75 | }
76 | \details{
77 | Martincorena I, et al. (2017) Universal patterns of selection in cancer and somatic tissues. Cell. 171(5):1029-1041.
78 | }
79 | \author{
80 | Inigo Martincorena (Wellcome Sanger Institute)
81 | }
82 | 


--------------------------------------------------------------------------------
/vignettes/Ensembl_BioMart_screenshot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/vignettes/Ensembl_BioMart_screenshot1.png


--------------------------------------------------------------------------------
/vignettes/Ensembl_BioMart_screenshot2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/vignettes/Ensembl_BioMart_screenshot2.png


--------------------------------------------------------------------------------
/vignettes/buildref.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Using dNdScv in a different species or assembly"
  3 | author: "Inigo Martincorena"
  4 | output: 
  5 |   html_document:
  6 |     toc: true
  7 |     toc_float: true
  8 | ---
  9 | 
 10 | By default, **dNdScv** assumes human data with mutations mapped to the GRCh37/hg19 assembly of the human genome. Adapting the method to run with data from other species or genome assemblies requires the generation of a new reference database (RefCDS object). This tutorial explains how to use the *buildref* function provided in the latest version of the package to generate a new reference database. Users interested in analysing GRCH37/hg19 data using different transcripts to those used by default in the *dNdScv* package can also use this function to generate a bespoke RefCDS object with their transcripts of interest.
 11 | 
 12 | Although designed for cancer genomic studies, *dNdScv* could be also used to quantify selection in other resequencing studies, such as SNP analyses, mutation accumulation studies in bacteria or for the discovery of mutations causing developmental disorders using data from human trios.
 13 | 
 14 | To cite this package please use:
 15 | Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*.
 16 | 
 17 | ###Inputs for the buildref function
 18 | 
 19 | As a small example, in this tutorial I show how to generate a RefCDS object for the protein coding genes in a small (500kb) segment of chromosome 3 in the GRCh37 assembly. This segment contains a few genes including *PICK3CA*. The *buildref* function needs two inputs: (1) the path to a tab-delimited table of transcripts, (2) the path to a fasta file for the reference genome of interest. 
 20 | 
 21 | You can find the example files using the code below to retrieve their location in your system (remember to install the latest version of the package):
 22 | 
 23 | ```{r message=FALSE, warning=FALSE}
 24 | path_cds_table = system.file("extdata", "BioMart_human_GRCh37_chr3_segment.txt", package = "dndscv", mustWork = TRUE)
 25 | path_genome_fasta = system.file("extdata", "chr3_segment.fa", package = "dndscv", mustWork = TRUE)
 26 | ```
 27 | 
 28 | We can first load the table of transcripts to see the input format required by *buildref*.
 29 | 
 30 | ```{r message=FALSE, warning=FALSE}
 31 | reftable = read.table(path_cds_table, header=1, sep="\t", stringsAsFactors=F)
 32 | print(reftable[1:21,])
 33 | ```
 34 | 
 35 | The input table of transcripts is a tab-delimited text file. Each row corresponds to an exon (in the case of genes with multiple exons) or to a whole CDS (in the absence of introns):
 36 | 
 37 | 1. **Gene ID**: This column is only used for annotation purposes (it does not need to be a recognised gene ID).
 38 | 2. **Gene name**: This is the gene name that dNdScv will use. If multiple transcripts with the same gene name are provided, *buildref* will use the longest coding CDS as the reference CDS for dNdScv. If you want to use more than one transcript for a given gene in dNdScv, make sure to assign a different name to each using this column.
 39 | 3. **CDS ID**: This should be a unique identifier for each transcript. The default RefCDS in *dNdScv* uses the Ensembl protein ID.
 40 | 4. **Chromosome name**: Note that chromosome names and coordinates must be consistent with the input reference fasta file provided.
 41 | 5. **Chromosomal start**: Starting coordinate (1-based) of the exon.
 42 | 6. **Chromosomal end**: End coordinate (1-based) of the exon.
 43 | 7. **CDS start**: First coding base of the exon.
 44 | 8. **CDS end**: Last coding base of the exon.
 45 | 9. **Length**: Full length of the CDS for this transcript (this is used to identify the longest coding CDS for each gene).
 46 | 10. **Strand**: Coding strand of the transcript.
 47 | 
 48 | The *buildref* function was originally designed to use a table of transcripts from Ensembl BioMart as input. You can use the [BioMart website](https://www.ensembl.org/biomart/martview/) to download all coding transcripts from a given genome assembly for a wide range of species. Withing BioMart, click on the "Ensembl genes" database and choose your species of interest. Then use the "Attributes" menu on the left and click on the "Structures" option (exon-level information) to select the 10 columns needed by *buildref*. Make sure that your table has all 10 columns in the right order, which you achieve by selecting each attribute in the right order (see example image below).
 49 | 
 50 | ![ ](Ensembl_BioMart_screenshot1.png)
 51 | 
 52 | Once you have selected all attributes, click on the "Results" tab to export and download the table as a .tsv (tab-delimited) file (see image below). Note that using BioMart will generate a large table with multiple transcripts per gene. The *buildref* function will choose the longest complete coding sequence per gene as the reference transcript. *buildref* currently also discards transcripts with ambiguous bases (e.g. "N"). However, if you are interested in particular transcripts, you can edit the table to contain only the transcript of interest per gene. If you would like *dNdScv* to use more than one transcript for genes that have very different alternative transcripts, such as the *CDKN2A* human gene, you can simply assign a different gene name for each transcript of interest (e.g. *CDKN2A-1*, *CDKN2A-2*). Also note that, for some species, the "Gene name" column may be sparse. If appropriate, users can modify the table to copy the "Gene stable ID" column in the "Gene name" column.  
 53 | 
 54 | ![ ](Ensembl_BioMart_screenshot2.png)
 55 | 
 56 | Although the *buildref* function was originally designed to use BioMart output tables and fasta files of whole genome assemblies, you can use it in more flexible ways. For example, you could provide a fasta file with a CDS sequence per entry instead of full chromosomes, or you could annotate different protein domains under a different "gene name" to perform inferences of selection at a domain level. Just note that, since dNdScv uses substitution models in a trinucleotide context, buildref will need to access the base before and after each CDS in the input fasta file.
 57 | 
 58 | ###Using buildref
 59 | 
 60 | ```{r message=FALSE, warning=FALSE}
 61 | library(dndscv)
 62 | path_cds_table = system.file("extdata", "BioMart_human_GRCh37_chr3_segment.txt", package = "dndscv", mustWork = TRUE)
 63 | path_genome_fasta = system.file("extdata", "chr3_segment.fa", package = "dndscv", mustWork = TRUE)
 64 | buildref(cdsfile=path_cds_table, genomefile=path_genome_fasta, outfile = "example_output_refcds.rda", excludechrs="MT")
 65 | ```
 66 | 
 67 | The code above will generate a RefCDS object and save it as a file using the *outfile* file name. Some users may be interested in loading the object and examining its structure (```load("example_output_refcds.rda"); print(RefCDS[[1]])```). The RefCDS object is an array of lists with an entry per gene (or, strictly speaking, per unique gene name). It contains information on the sequence of the gene and a table (192 rows by 4 columns) with the precomputed impact of all possible coding mutations in the gene and their trinucleotide context.
 68 | 
 69 | In the example above, I have used the optional argument `excludechrs` as a reminder that genes in the mitochondrial genome should be excluded. The reasons for this are that mitochondrial genes have a different genetic code and that the mitochondrial genome normally has very different mutation rates and requires a different substitution model. When using *buildref*, you should use the arguments "excludechrs" or "onlychrs" to exclude certain chromosomes or to restrict RefCDS to chromosomes of interest, respectively (e.g. including only chromosomes 1-22,X,Y for humans, excluding other contigs). If you are interested in running dN/dS analyses on the mitochondrial genome, you can use buildref to create a separate RefCDS object for the mitochondria. Just remember that when building a RefCDS for genomes with a non-standard genetic code, you need to inform *buildref* and *dndscv* about the genetic code of your genome using the argument `numcode` in both functions. To see the list of genetic codes supported use:
 70 | 
 71 | ```{r warning=FALSE}
 72 | ? seqinr::translate
 73 | ```
 74 | 
 75 | ###Running dNdScv with a new RefCDS object
 76 | 
 77 | We can use the simulated breast cancer data provided with the package to test the new RefCDS object that we have generated above. In the example, the reference fasta file was only a segment of chromosome 3 (3:178800000-179300000) and the gene coordinates in the transcript table had been adjusted accordingly. Thus, we need to correct the position of the mutations in the example file for consistency with the new RefCDS object.
 78 | 
 79 | ```{r message=FALSE, warning=FALSE}
 80 | library("dndscv")
 81 | segment = c(178.8e6,179.3e6) # The 500kb segment of the genome used to generate the new RefCDS
 82 | data("dataset_simbreast", package="dndscv") # Example dataset
 83 | mutations = mutations[mutations$chr=="3" & mutations$pos>segment[1] & mutations$pos<segment[2], ] # Restricting the mutations to those inside the segment
 84 | mutations$pos = mutations$pos-segment[1]+1 # Correcting the position of the mutations to their position in the reference fasta file used
 85 | dndsout = dndscv(mutations, refdb="example_output_refcds.rda", cv=NULL)
 86 | print(dndsout$sel_cv) # This is shown as an example but these results based on a few genes should not be trusted
 87 | ```
 88 | 
 89 | ###Running *dNdScv* with and without covariates
 90 | 
 91 | Using `cv=NULL` we are telling *dndscv* not to use the default covariates provided with the package. As explained in the original publication (Martincorena *et al.*, 2017), *dNdScv* can use covariates to reduce the uncertainty in the variation of the mutation rate across genes. Covariates can improve the performance of the model by increasing its sensitivity and specificity, although their impact can be small. 
 92 | 
 93 | You can see the impact of covariates in the example dataset provided using the code below. Note that we are now using the default RefCDS and running this analysis considering all genes in the genome:
 94 | 
 95 | ```{r message=FALSE, warning=FALSE}
 96 | data("dataset_simbreast", package="dndscv")
 97 | dndsout = dndscv(mutations)
 98 | dndsout_nocovariates = dndscv(mutations, cv=NULL)
 99 | print(dndsout$sel_cv[dndsout$sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")])
100 | print(dndsout_nocovariates$sel_cv[dndsout_nocovariates$sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")])
101 | ```
102 | 
103 | In the example dataset, only one gene is lost from the list of significant genes when running dNdScv without covariates. However, this is only a simulated dataset and does not necessarily reflect the likely gains of using covariates in real datasets. The benefits of using covariates can be larger in datasets with larger unexplained variation of the mutation rates across genes and so users may want to generate their own covariates for a new RefCDS object. By default, in the original publication I used chromatin information from Epigenomics RoadMap to generate 20 covariates for GRCh37. Other useful covariates can be sequencing coverage or expression level, for example. Those users interested in trying their own covariates can see the input format by loading the default covariates used by *dNdScv*:
104 | 
105 | ```{r message=FALSE, warning=FALSE}
106 | data("covariates_hg19", package="dndscv")
107 | print(covs[1:5,])
108 | ```
109 | 
110 | Covariates need to be formatted as a numeric matrix with all genes as rows (use row.names=*gene_names*) and a column per covariate. By default, a maximum of 20 covariates are used, but users can change this with the *maxcovs* argument in the dndscv function. You can also see the impact of different covariates by exploring the regression output (```dndsout$nbreg``` and ```dndsout$nbregind``` for substitutions and indels, respectively).
111 | 
112 | When using a new RefCDS, please exercise caution in interpreting the results. For example, low values of theta (e.g. ```dndsout$nbreg$theta``` << 1) indicate that there is large unexplained variation in the mutation density across genes and may mean that *dNdScv* is not adequate for this dataset.
113 | 
114 | ###Precomputed RefCDS objects for some species
115 | 
116 | I have generated precomputed RefCDS objects for a few species and you can find them in this [link](https://github.com/im3sanger/dndscv_data/tree/master/data). Please use these with caution as they have not been tested extensively.
117 | 
118 | If you have generated a new RefCDS object and/or covariates for an interesting species and would like to share it with other users, please drop me an [email](https://www.sanger.ac.uk/people/directory/martincorena-inigo).
119 | 
120 | ###References
121 | * Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*. 171(5):1029-1041.
122 | 


--------------------------------------------------------------------------------
/vignettes/dNdScv.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Selection analyses and cancer driver discovery using dNdScv"
  3 | author: "Inigo Martincorena"
  4 | output: 
  5 |   html_document:
  6 |     toc: true
  7 |     toc_float: true
  8 | ---
  9 | 
 10 | The **dNdScv** R package is a suite of maximum-likelihood dN/dS methods designed to quantify selection in cancer and somatic evolution (Martincorena *et al.*, 2017). The package contains functions to quantify dN/dS ratios for missense, nonsense and essential splice mutations, at the level of individual genes, groups of genes or at whole-genome level. The *dNdScv* method was designed to detect cancer driver genes (*i.e.* genes under positive selection in cancer) on datasets ranging from a few samples to thousands of samples, in whole-exome/genome or targeted sequencing studies.
 11 | 
 12 | Unlike traditional implementations of dN/dS, the *dNdScv* package uses trinucleotide context-dependent substitution models to avoid common mutation biases affecting dN/dS (Greenman *et al.*, 2006, Martincorena *et al.*, 2017). The package includes two different dN/dS models. *dNdSloc*, like traditional dN/dS implementations, uses the number of synonymous mutations in a gene to infer the local mutation rate, without exploiting additional information from other genes. *dNdScv* offers a much more powerful alternative, combining local information (synonymous mutations in the gene) and global information (variation of the mutation rate across genes, exploiting epigenomic covariates) to estimate the background mutation rate. *dNdScv* should be preferred in most situations.
 13 | 
 14 | This vignette shows how to perform driver discovery and selection analyses with *dNdScv* in cancer sequencing data. By default, dNdScv assumes human data from the GRCh37/hg19 assembly, but the latest version of the package provides a function (*buildref*) to generate the necessary reference file to run dNdScv on others species or assembly (see the relevant tutorial). Although designed for cancer genomic studies, *dNdScv* can be also used to quantify selection in other resequencing studies, such as SNP analyses, mutation accumulation studies in bacteria or for the discovery of mutations causing developmental disorders using data from human trios.
 15 | 
 16 | To cite this package please use:
 17 | Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*.
 18 | 
 19 | ##Driver discovery (positive selection) in cancer exomes/genomes
 20 | 
 21 | ####The simplest way to run dNdScv
 22 | 
 23 | ```{r message=FALSE, warning=FALSE}
 24 | library("dndscv")
 25 | ```
 26 | ```{r message=FALSE}
 27 | data("dataset_simbreast", package="dndscv")
 28 | dndsout = dndscv(mutations)
 29 | ```
 30 | 
 31 | ####Inputs and default parameters
 32 | 
 33 | For this example, we have used a simulated dataset of 196 breast cancer exomes provided in the package. The simplest way to run the dndscv function is to provide a table of mutations. Mutations are provided as a *data.frame* with five columns (sampleID, chromosome, position, reference base and mutant base). It is important that only unique mutations are provided in the file. Multiple instances of the same mutation in related samples (for example, when sequencing multiple samples of the same tumour) should only be listed once.
 34 | 
 35 | ```{r}
 36 | head(mutations)
 37 | ```
 38 | 
 39 | With the example dataset provided, the function should take about one minute to run. In this example, the function issues a warning as it detects the same mutation in more than one sample, requesting the user to verify that the input table of mutations only contains independent mutation events. In this case, each sample corresponds to a different patient and so the warning can be safely ignored.
 40 | 
 41 | We have run dNdScv with default parameters. This includes removing ultra-hypermutator samples and subsampling mutations when encountering too many mutations per gene in the same sample. These were designed to protect against loss of sensitivity from ultra-hypermutators and from clustered artefacts in the input mutation table, but there are occasions when the user will prefer to relax these (see Example 2).
 42 | 
 43 | #####dndscv outputs: Table of significant genes
 44 | 
 45 | The output of the *dndscv* function is a list of objects. For an analysis of exome or genome data, the most relevant output will often be the result of neutrality tests at gene level. *P-values* for substitutions are obtained by Likelihood-Ratio Tests as described in (Martincorena *et al*, 2017) and q-values are obtained by Benjamini-Hodgberg's multiple testing correction. The table also includes information on the number of substitutions of each class observed in each gene, as well as maximum-likelihood estimates (MLEs) of the dN/dS ratios for each gene, for missense (*wmis*), nonsense (*wnon*), essential splice site mutations (*wspl*) and indels (*wind*). The global q-value integrating all mutation types are available in the *qglobal_cv* and *qallsubs_cv* columns for analyses with and without indels, respectively.
 46 | 
 47 | ```{r}
 48 | sel_cv = dndsout$sel_cv
 49 | print(head(sel_cv), digits = 3)
 50 | signif_genes = sel_cv[sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")]
 51 | rownames(signif_genes) = NULL
 52 | print(signif_genes)
 53 | ```
 54 | 
 55 | Note in the table that the dN/dS ratios of significant cancer genes are typically extremely high, often >10 or even >100. This contains information about the fraction of mutations observed in a gene that are genuine drivers under positive selection. For example, for a highly significant gene, a dN/dS of 10 indicates that there are 10 times more non-synonymous mutations in the gene than neutrally expected, suggesting that at least around 90% of these mutations are genuine drivers (Greenman *et al*, 2006; Martincorena *et al*, 2017).
 56 | 
 57 | Users can calculate confidence intervals for the dN/dS ratios per gene using the *geneci* function in the package. To use this function you need to run *dndscv* with the optional argument `outmats=T` and then use `ci = geneci(dndsout)`. For more details see `? geneci`.
 58 | 
 59 | #####dndscv outputs: Global dN/dS estimates
 60 | 
 61 | Another output that can be of interest is a table with the global MLEs for the dN/dS ratios across all genes. dN/dS ratios with associated confidence intervals are calculated for missense, nonsense and essential splice site substitutions separately, as well as for all non-synonymous substitutions (*wall*) and for all truncating substitutions together (*wtru*), which include nonsense and essential splice site mutations.
 62 | 
 63 | ```{r}
 64 | print(dndsout$globaldnds)
 65 | ```
 66 | 
 67 | Global dN/dS ratios in somatic evolution of cancer, and seemingly of healthy somatic tissues, appear to show a near-universal pattern of dN/dS~1, with exome-wide dN/dS ratios typically slightly higher than 1 (Martincorena *et al.*, 2017). Only occasionally, I have found datasets with global dN/dS<1, but upon closer examination, this has typically been caused by contamination of the catalogue of somatic mutations with germline SNPs. An exception are melanoma tumours, which show a bias towards slight underestimation of dN/dS due to the signature of ultraviolet-induced mutations extending beyond the trinucleotide model (Martincorena *et al.*, 2017). In my personal experience, datasets of somatic mutations with global dN/dS<<1 have always reflected a problem of SNP contamination or an inadequate substitution model, and so the evaluation of global dN/dS values can help identify problems in certain datasets.
 68 | 
 69 | #####Other useful outputs
 70 | 
 71 | The dndscv function also outputs other results that can be of interest, such as an annotated table of coding mutations (*annotmuts*), MLEs of mutation rate parameters (*mle_submodel*), lists of samples and mutations excluded from the analysis and a table with the observed and expected number of mutations per gene (*genemuts*), among others.
 72 | 
 73 | ```{r}
 74 | head(dndsout$annotmuts)
 75 | ```
 76 | 
 77 | dNdScv relies on a negative binomial regression model across genes to refine the estimated background mutation rate for a gene. This assumes that the variation of the mutation rate across genes that remains unexplained by covariates or by the sequence composition of the gene can be modelled as a Gamma distribution. This model typically works well on clean cancer genomic datasets, but not all datasets may be suitable for this model. In particular, very low estimates of $\theta$ (the overdispersion parameter), particularly $\theta<1$, may reflect problems with the suitability of the dNdScv model for the dataset.
 78 | 
 79 | ```{r}
 80 | print(dndsout$nbreg$theta)
 81 | ```
 82 | 
 83 | ##### dNdSloc: local neutrality test
 84 | 
 85 | An additional set of neutrality tests per gene are performed using a more traditional dN/dS model in which the local mutation rate for a gene is estimated exclusively from the synonymous mutations observed in the gene (*dNdSloc*) (Wong, Martincorena, *et al*., 2014). This test is typically only powered in very large datasets. For example, in the dataset used in this example, comprising of 196 simulated breast cancer exomes, this model only detects *ARID1A* as significantly mutated.
 86 | 
 87 | ```{r}
 88 | signif_genes_localmodel = as.vector(dndsout$sel_loc$gene_name[dndsout$sel_loc$qall_loc<0.1])
 89 | print(signif_genes_localmodel)
 90 | ```
 91 | 
 92 | ##Driver discovery in targeted sequencing data
 93 | 
 94 | The dndscv function can take a list of genes as an input to restrict the analysis of selection. This is strictly required when analysing targeted sequencing data, and might also be used to obtain global dN/dS ratios for a particular group of genes.
 95 | 
 96 | To exemplify the use of the dndscv function on targeted data, we can use another example dataset provided with the dNdScv package:
 97 | 
 98 | ```{r message=FALSE}
 99 | library("dndscv")
100 | data("dataset_normalskin", package="dndscv")
101 | data("dataset_normalskin_genes", package="dndscv")
102 | dndsskin = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf)
103 | ```
104 | 
105 | This dataset comprises of 3,408 unique somatic mutations detected by ultra-deep (~500x) targeted sequencing of 74 cancer genes in 234 small biopsies of normal human skin (epidermis) from four healthy individuals. Note that all of the mutations listed in the input table are genuinely independent events and so, again, we can safely ignore the two warnings issued by dndscv. For more details on this study see:
106 | 
107 | **Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. Science. 348(6237):880-6.** doi: 10.1126/science.aaa6806.
108 | 
109 | In the paper above, we described a strong evidence of positive selection on somatic mutations occurring in normal human skin throughout life. These mutations are detected as microscopic clones of mutant cells in normal skin. The dNdScv analysis below recapitulates some of the key analyses of selection in this study:
110 | 
111 | ```{r}
112 | sel_cv = dndsskin$sel_cv
113 | print(sel_cv[sel_cv$qglobal_cv<0.1,c(1:10,19)], digits = 3)
114 | print(dndsskin$globaldnds, digits = 3)
115 | ```
116 | 
117 | ##Using different substitution models
118 | 
119 | Classic maximum-likelihood implementations of dN/dS use a simple substitution model with a single rate parameter. Mutations are classified as either transitions (C<>T, G<>A) or transversions, and the single rate parameter is a transition/transversion (ts/tv) ratio reflecting the relative frequency of both classes of substitutions (Goldman & Yang, 1994). The dndscv function can take a different substitution model as input. The user can choose from existing substitution models provided in the *data* directory as part of the package or input a different substitution model as a matrix:
120 | 
121 | ```{r message=FALSE}
122 | library("dndscv")
123 | # 192 rates (used as default)
124 | data("submod_192r_3w", package="dndscv")
125 | colnames(substmodel) = c("syn","mis","non","spl")
126 | head(substmodel)
127 | # 12 rates (no context-dependence)
128 | data("submod_12r_3w", package="dndscv")
129 | colnames(substmodel) = c("syn","mis","non","spl")
130 | head(substmodel)
131 | # 2 rates (classic ts/tv model)
132 | data("submod_2r_3w", package="dndscv")
133 | colnames(substmodel) = c("syn","mis","non","spl")
134 | head(substmodel)
135 | ```
136 | 
137 | We can fit a traditional ts/tv model to the skin dataset using the code below:
138 | 
139 | ```{r message=FALSE}
140 | library("dndscv")
141 | data("dataset_normalskin", package="dndscv")
142 | data("dataset_normalskin_genes", package="dndscv")
143 | dndsskin_2r = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf, sm = "2r_3w")
144 | print(dndsskin_2r$mle_submodel)
145 | sel_cv = dndsskin_2r$sel_cv
146 | print(head(sel_cv[sel_cv$qglobal_cv<0.1, c(1:10,19)]), digits = 3)
147 | ```
148 | 
149 | In general, the full trinucleotide model is recommended for cancer genomic datasets as it typically provides the least biased dN/dS estimates. The impact of using simplistic mutation models can be considerable on global dN/dS ratios (see Martincorena *et al*., 2017), and can lead to false signals of negative or positive selection. In general, the impact of simple substitution models on gene-level inferences of selection tends to be smaller. AIC model selection can be easily used:
150 | 
151 | ```{r message=FALSE}
152 | AIC(dndsskin$poissmodel)
153 | AIC(dndsskin_2r$poissmodel)
154 | ```
155 | 
156 | ###References
157 | * Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*. 171(5):1029-1041.
158 | * Goldman N, Yang Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. *Molecular biology and evolution*. 11:725-736.
159 | * Greenman C, *et al*. (2006) Statistical analysis of pathogenicity of somatic mutations in cancer. *Genetics*. 173(4):2187-98.
160 | * Wong CC, Martincorena I, *et al*. (2014) Inactivating CUX1 mutations promote tumorigenesis. *Nature Genetics*. 46(1):33-8.
161 | * Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. *Science*. 348(6237):880-6.
162 | 


--------------------------------------------------------------------------------
/vignettes/example_output_refcds.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/im3sanger/dndscv/69007c2bbd2d6dae003a30dcfe5dda3df722b2f8/vignettes/example_output_refcds.rda


--------------------------------------------------------------------------------
/vignettes/sitednds.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Hotspot discovery using sitewise dN/dS"
  3 | author: "Inigo Martincorena"
  4 | output: 
  5 |   html_document:
  6 |     toc: true
  7 |     toc_float: true
  8 | ---
  9 | 
 10 | **Warning: this function is in testing. Users are advised to interpret the results with caution.**
 11 | 
 12 | The importance of recurrently mutated hotspots is widely appreciated in cancer. This tutorial shows how to apply the new **sitednds** function provided in the latest version of the *dNdScv* package to estimate dN/dS ratios at single-site level. Sitewise dN/dS estimation has a rich history in comparative genomics (e.g. Massingham and Goldman, 2005) but it has only been used in cancer studies occasionally (e.g. Martincorena *et al.*, 2015). Yet, studying the relative strength of selection at single sites can be valuable, as emphasised by a recent study (Cannataro *et al.*, 2018).
 13 | 
 14 | The new *sitednds* function allows the user to compute maximum-likelihood dN/dS estimates for recurrently mutated sites, as well as p-values against neutrality. Sitewise dN/dS ratios reflect the ratio between the number of observed mutations and the number expected under neutrality, while controlling for trinucleotide rates and for variable mutation rates across genes. In sparse datasets, point estimates for lowly-recurrent sites are likely to be underestimated, but p- and q-values provide a measure of their significance.
 15 | 
 16 | An important aspect is that mutation rates can vary considerably across sites, even after correcting for these known mutational biases. *sitednds* models the observed mutation counts across synonymous sites as following a negative binomial distribution. This effectively controls for Poisson noise in the mutation counts per site and fits a Gamma distribution to the unexplained variation in mutation rate across sites. P-values for site recurrence are calculated using the fitted negative binomial distribution. These p-values should be more conservative and reliable than only considering Poisson variation or non-parametric bootstrapping, but they still rely on the assumption than the Gamma distribution appropriately captures the unexplained variation across sites.
 17 | 
 18 | A major limitation is the fact that mapping artefacts and SNP contamination are common problems in cancer genomic datasets, and these tend to lead to recurrent false positive mutation calls. In noisy datasets, the results of *sitednds* can be dominated by artefacts. Users trying *sitednds* should be very critical of the results. In the context of cancer genomic studies, a considerable number of synonymous recurrently mutated sites among the significant hits in *sitednds* most certainly indicates a problem with the variant calling. This is exemplified in this tutorial analysing two real datasets.
 19 | 
 20 | ###Sitewise dN/dS ratios in a cancer dataset
 21 | 
 22 | As a small example, in this tutorial we will use public somatic mutation calls from bladder cancers from TCGA. To reduce the risk of false positives and increase the signal to noise ratio, this example will only consider mutations in Cancer Gene Census genes (v81).
 23 | 
 24 | ```{r message=FALSE, warning=FALSE}
 25 | library("dndscv")
 26 | data("dataset_tcgablca", package="dndscv") # Loading the bladder cancer data
 27 | data("cancergenes_cgc81", package="dndscv") # Loading the genes in the Cancer Gene Census (v81)
 28 | dndsout = dndscv(mutations, outmats=T, gene_list=known_cancergenes)
 29 | ```
 30 | 
 31 | The *sitednds* function takes the output of *dndscv* as input. In order for the dndsout object to be compatible with *sitednds*, users must use the "outmats=T" argument in *dndscv*. After running *dndscv*, we can evaluate the results at the gene level as explained in the main tutorial of *dndscv*:
 32 | 
 33 | ```{r message=FALSE, warning=FALSE}
 34 | sel_cv = dndsout$sel_cv
 35 | print(head(sel_cv, 10), digits = 3) # Printing the top 10 genes by q-value
 36 | ```
 37 | 
 38 | The table above reveals a problem with this dataset. The gene *MLLT3* appears as significant in *dndscv* (i.e. it violates the neutral null model of dN/dS=1), but due to a very large excess of synonymous mutations (notice the high number of synonymous mutations and the very low dN/dS values). We can further confirm that the low dN/dS value in this gene is due to an excess of synonymous mutations and not genuine negative selection by comparing the observed number of synonymous mutations in the gene (43) and the expected number (*exp_syn* and *exp_syn_cv* columns below):
 39 | 
 40 | ```{r message=FALSE, warning=FALSE}
 41 | print(dndsout$genemuts[dndsout$genemuts$gene_name=="MLLT3",])
 42 | ```
 43 | 
 44 | Thus, *MLLT3* is a false positive, most likely due to recurrent artefacts or SNP contamination in the gene. A careful examination of all statistically significant genes in the dataset reveals other likely false positives. As we will see below, this will also affect the sitewise dN/dS analysis.
 45 | 
 46 | To run the sitewise dN/dS model on this dataset, we only need to input the *dndsout* object into the *sitednds* function. By default, *sitednds* will calculate sitewise dN/dS ratios and p-values for sites mutated at least two times (use the argument *min_recurr* to control this). While p-values are only provided for recurrently mutated sites, false discovery adjustment corrects for all possible changes.
 47 | 
 48 | ```{r message=FALSE, warning=FALSE}
 49 | hotspots = sitednds(dndsout) # Running sitewise dN/dS
 50 | print(hotspots$theta) # Overdispersion (unexplained variation of the mutation rate across sites)
 51 | ```
 52 | 
 53 | You can see that the maximum-likelihood estimate of *theta* is very low. This reflects considerable variation in the mutation rate across sites, not explained by the trinucleotide context or by the estimated relative mutation rate of the gene. *sitednds* takes this into account when calculating p-values. If there is large uncertainty in the estimation of *theta*, users can choose to use the lower bound estimate of theta instead of the maximum-likelihood estimate, when calculating p-values (use the argument *theta_option="conservative"* in *sitednds*).
 54 | 
 55 | The main output of *sitednds* is a table with all hotspots studied, including their position, the gene affected, the aminoacid change induced, the number of times that the mutation was observed, the expected number of mutations at this site by chance under neutrality (mu) and the dN/dS ratio. The table also contains p-values and q-values for the probability of observing that many mutations at the site by chance. Again, please treat these p-values with caution.
 56 | 
 57 | ```{r message=FALSE, warning=FALSE}
 58 | print(head(hotspots$recursites,10)) # First 10 lines of the table of hotspots
 59 | ```
 60 | 
 61 | We can choose a significance cutoff (e.g. q-value<0.05) to list the significant hotspots in the dataset:
 62 | 
 63 | ```{r message=FALSE, warning=FALSE}
 64 | signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ]
 65 | print(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], digits=5)
 66 | ```
 67 | 
 68 | Careful examination of the significant hotspots reveals many well-known cancer-driver hotspots, including in *FGFR3* (e.g. S249, Y375, G372), *TP53* (e.g. R248), *PIK3CA* (e.g. E542, H1047), *HRAS* (Q61), *KRAS* (G12), *ERBB2* (S310), *ERBB3* (V104), etc. Note that the exact aminoacid position affected depends on the exact protein isoform used for annotation (see Ensembl protein IDs in *dndsout$annotmuts*).
 69 | 
 70 | However, the table of significant hotspots also contains a considerable number of likely false positives, including multiple synonymous sites in *MLLT3*. A proper analysis of these data would require careful reevaluation and improvement of the mutation calls, before repeating this analysis. Significant improvements to somatic mutation calls against recurrent artefacts can be achieved by using an unmatched normal panel and by more stringent filtering of germline SNP contamination. 
 71 | 
 72 | The TCGA mutation calls used in this example are an old version and it is likely that more recent versions are much less affected by artefacts. However, I decided to use this dataset as an example to highlight the importance of critically examining the results and the impact of recurrent artefacts on driver discovery at gene and site level.
 73 | 
 74 | As a final note, users with whole-genome or whole-exome data can run *sitednds* on all genes. However, given the frequent presence of recurrent artefacts and the sparsity of cancer datasets, the signal-to-noise ratio can be considerably increased by running *sitednds* on a list of known cancer genes. To do so, I recommend running *dndscv* on all genes and then running *sitednds* on a list of genes of interest using the optional *gene_list* argument in *sitednds*. Running *dndscv* on all genes ensures that mutations from all genes are used to estimate the trinucleotide mutation rates, typically increasing their accuracy.
 75 | 
 76 | 
 77 | ###Sitewise dN/dS ratios in normal oesophagus
 78 | 
 79 | In a recent study, we sequenced 844 small biopsies of normal oesophageal epithelium from 9 transplant donors to study the extent of mutation and selection in a normal tissue (Martincorena *et al*., 2018). In this part of the tutorial, we will reanalyse this dataset using *dndscv* and *sitednds*. We first run *dndscv* using the settings from the analysis of normal skin data described in the main *dNdScv* tutorial.
 80 | 
 81 | ```{r message=FALSE, warning=FALSE}
 82 | library("dndscv")
 83 | data("dataset_normaloesophagus", package="dndscv") # Loading the mutations in normal oesophagus
 84 | mutations = unique(mutations) # Removing duplicate mutations (more conservative)
 85 | data("dataset_normalskin_genes", package="dndscv")
 86 | dndsout = dndscv(mutations, outmats=T, gene_list=target_genes, max_coding_muts_per_sample=Inf, max_muts_per_gene_per_sample=Inf) # outmats=T is required to run sitednds
 87 | ```
 88 | 
 89 | We can see the list of genes under positive selection and the global dN/dS values using:
 90 | 
 91 | ```{r message=FALSE, warning=FALSE}
 92 | sel_cv = dndsout$sel_cv
 93 | print(sel_cv[sel_cv$qglobal_cv<0.05, c(1:6,19)], digits = 3)
 94 | print(dndsout$globaldnds, digits = 3)
 95 | ```
 96 | 
 97 | To apply the *sitednds* model, we simply use the following code. Only the top 30 hotspots by q-value are shown, but a total of 133 sites are identified as significant with 5% FDR.
 98 | 
 99 | ```{r message=FALSE, warning=FALSE}
100 | hotspots = sitednds(dndsout) # Running sitewise dN/dS
101 | signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ]
102 | head(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], 30)
103 | ```
104 | 
105 | Remarkably, owing to the very large number of mutant clones identified in this study, this analysis finds a large number of statistically-significant sites (n=81 for qval<0.05). Reassuringly, they are all in genes detected under positive selection in the original publication (Martincorena *et al*, 2018). This comprises 61 sites in *NOTCH1*, 17 sites in *TP53*, the well-known *PIK3CA* hotspot H1047R and one site in FAT1 and TP63.
106 | 
107 | This analysis also identifies a known driver hotspot in a synonymous site of *TP53* (T125T), which is known to affect splicing of *TP53*. Intriguingly, it also identifies a synonymous site in *NOTCH1* (V717V), which deserves careful follow-up analysis. Apart from these two synonymous sites, all other 79 significant hotspots are non-synonymous.
108 | 
109 | 
110 | ###References
111 | * Martincorena I, *et al*. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. *Cell*. 171(5):1029-1041. doi:10.1016/j.cell.2017.09.042.
112 | * Massingham T, Goldman N. (2005) Detecting amino acid sites under positive selection and purifying selection. *Genetics*. 169(3):1753-62. doi:10.1534/genetics.104.032144.
113 | * Martincorena I, *et al*. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. *Science*. 348(6237):880-6. doi:10.1126/science.aaa6806.
114 | * Cannataro VL, Gaffney SG, Townsend JP. (2018) Effect Sizes of Somatic Mutations in Cancer. *J Natl Cancer Inst*. doi:10.1093/jnci/djy168.
115 | * Martincorena I, Fowler JC, *et al*. (2018) Somatic mutant clones colonize the human esophagus with age. *Science*. doi:10.1126/science.aau3879.
116 | 


--------------------------------------------------------------------------------