├── .Rbuildignore ├── .gitignore ├── DESCRIPTION ├── NAMESPACE ├── NEWS.md ├── R ├── G_functions.R ├── Import_Filter.R ├── RcppExports.R ├── export_functions.R ├── format_genomic.R ├── onunload.R ├── plotting_functions.R └── takagi_sim.R ├── README.md ├── all_plots.png ├── man ├── FilterSNPs.Rd ├── ImportFromGATK.Rd ├── countSNPs_cpp.Rd ├── getFDRThreshold.Rd ├── getG.Rd ├── getPvals.Rd ├── getQTLTable.Rd ├── getSigRegions.Rd ├── importFromTable.Rd ├── plotGprimeDist.Rd ├── plotQTLStats.Rd ├── plotSimulatedThresholds.Rd ├── runGprimeAnalysis.Rd ├── runQTLseqAnalysis.Rd ├── simulateAlleleFreq.Rd ├── simulateConfInt.Rd ├── simulateSNPindex.Rd └── tricubeStat.Rd ├── src ├── .gitignore ├── RcppExports.cpp └── countSNPs.cpp └── vignettes ├── .gitignore ├── QTLseqr.Rmd └── QTLseqr.pdf /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj 2 | .Rproj.user 3 | .Rhistory 4 | .RData 5 | .Ruserdata 6 | inst/doc 7 | QTLseqr.Rproj 8 | calculation tests.xlsx 9 | README.Rmd 10 | temp 11 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: QTLseqr 2 | Type: Package 3 | Title: QTL mapping using Bulk Segregant Analysis of Next Generation Sequencing data. 4 | Version: 0.7.5.2 5 | Date: 12-3-2017 6 | Authors@R: c( 7 | person("Ben Nathan", "Mansfeld", email = "mansfeld@msu.edu", role = c("aut", "cre")), 8 | person("Rebecca", "Grumet", email = "grumet@msu.edu", role = c("ths")) 9 | ) 10 | BugReports: https://github.com/bmansfeld/QTLseqr/issues 11 | Description: QTLseqr performs QTL mapping using Bulk Segregant Analysis of Next Gen Sequencing data. 12 | The package is an R implementation of the analyses described in Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing. PLOS Computational Biology 7(11): e1002255. doi: 10.1371/journal.pcbi.1002255 and Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi,H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. *Plant J*, 74: 174–183. doi:10.1111/tpj.12105 13 | License: GPL-3 14 | Encoding: UTF-8 15 | LazyData: true 16 | Imports: 17 | modeest (>= 2.3.2), 18 | ggplot2 (>= 2.2.0), 19 | gtools, 20 | dplyr, 21 | readr, 22 | tidyr, 23 | Rcpp, 24 | locfit 25 | Suggests: 26 | knitr, 27 | rmarkdown, 28 | kableExtra, 29 | Yang2013data 30 | RoxygenNote: 6.1.1 31 | biocViews: Software, Sequencing, Visualization 32 | VignetteBuilder: knitr 33 | LinkingTo: Rcpp 34 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(countSNPs_cpp) 4 | export(filterSNPs) 5 | export(getFDRThreshold) 6 | export(getPvals) 7 | export(getQTLTable) 8 | export(getSigRegions) 9 | export(importFromGATK) 10 | export(importFromTable) 11 | export(plotGprimeDist) 12 | export(plotQTLStats) 13 | export(plotSimulatedThresholds) 14 | export(runGprimeAnalysis) 15 | export(runQTLseqAnalysis) 16 | export(simulateConfInt) 17 | importFrom(Rcpp,sourceCpp) 18 | importFrom(dplyr,"%>%") 19 | useDynLib(QTLseqr) 20 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # QTLseqr 0.7.5.2 2 | ## Bugs 3 | * Import failure caused by 0.7.5.1 fixed 4 | 5 | # QTLseqr 0.7.5.1 6 | ## Bugs 7 | * Import failure caused by 0.7.5 fixed 8 | 9 | # QTLseqr 0.7.5 10 | ## Bugs 11 | * reading files would sometimes guess wrong. Set the default to col_character. This will help with very large read depth importing. 12 | 13 | # QTLseqr 0.7.4 14 | ## Bugs 15 | * Fixed a compatibility issue with new versions of the `modeest` package. Please note that from now on QTLseqr requires `modeest (> 2.3.2)` 16 | 17 | # QTLseqr 0.7.3 18 | ## Updates 19 | * Added a `...` for all functions that use tricubed smoothing functions. So that users can easily pass higher maxk values to `raw.locfit`. 20 | * Added _"A note about window sizes"_ to the vignette. 21 | 22 | # QTLseqr 0.7.2 23 | ## Updates 24 | * Added `depthDifference` paramater to `filterSNPs` function. This helps filtering SNPs with high absolute differences in read depth between the bulks. 25 | * `getQTLTable` now also reports the genomic position of the maximum of each peak. 26 | * Updates to the vignette about filtering SNPs. 27 | 28 | # QTLseqr 0.7.1 29 | ## Bug fixes 30 | * Corrected a bug in checking for negative bulksizes 31 | 32 | # QTLseqr 0.7.0 33 | ## Updates 34 | * Added `importFromTable` function to allow users to import from a delimited file. 35 | * Allowed different size bulks in `runQTLseqAnalysis`. 36 | * Updated vignettes and documentation files. 37 | * Some documentation link fixes 38 | 39 | # QTLseqr 0.6.5 40 | ## Bug fixes 41 | * Corrected a bug in import that happend when high or low bulks were named with things that looked like CHROM, POS, ALT or REF. Now the function ignores those columns when renaming to HIGH.xx or LOW.xx. Also better definition of column types to force CHROM to be char and POS to be int. 42 | 43 | # QTLseqr 0.6.4 44 | ## Bug fixes 45 | * Corrected a windowSize that was set to 1e6 instead of the parameter function in 'runQTLanalysis'. This was causing problems in calculating window depth. 46 | 47 | # QTLseqr 0.6.3 48 | ## Bug fixes 49 | * Corrected a call to global env variables in `importFromGATK`. 50 | * Fixed issues with bulk names that had periods in them. 51 | * Bug fix in export functions using `table` as variable name. 52 | 53 | # QTLseqr 0.6.2 54 | ## Bug fixes 55 | * Added responsive x axis brakes. Axis brake labels were getting squashed on small chromosomes. 56 | 57 | # QTLseqr 0.6.1 58 | ## Updates 59 | * added `plotSimulatedThresholds` function to help users visuallize their confisence intervals 60 | * some manual corrections and mods 61 | 62 | # QTLseqr 0.6.0 63 | ## Updates 64 | * Added QTLseq analysis functionality 65 | * `plotQTLStats` can now plot confidence intervals in $\Delta (SNP\text{-}index)$ plots 66 | * Export functions run faster and allow for detection of QTL in either "Gprime" or "QTLseq" methods 67 | * removed Bioconductor dependency 68 | 69 | # QTLseqr 0.5.8 70 | ## Updates 71 | * changed $\Delta SNP\text{-}index$ to $\Delta (SNP\text{-}index)$ 72 | 73 | # QTLseqr 0.5.7 74 | ## Updates 75 | * `plotQTLStats` now allows for chromosome facet shape scaling using the 'scaleChroms' parameter 76 | 77 | # QTLseqr 0.5.6 78 | ## Bug fixes 79 | * Added package `locfit` to Imports. 80 | 81 | # QTLseqr 0.5.5 82 | ## Updates 83 | * `plotGprimeDist` null distribution label more accurate. 84 | 85 | # QTLseqr 0.5.4 86 | 87 | ## Updates 88 | * `plotGprimeDist` now plots histograms of filtered and raw data. overlaid with the null dist. Is easier to interpret. 89 | 90 | # QTLseqr 0.5.3 91 | 92 | ## New features 93 | 94 | * Added a `NEWS.md` file to track changes to the package. 95 | * in `plotGprimeDist` plotting now includes density plots of all data, data after QTL filtering and the null-distribution assuming mean and variance of the filtered set. 96 | 97 | ## Bug fixes 98 | * Corrected a bug in `getFDRThreshold` function that was using regular p-values and not adjusted pvalues to define threshold 99 | -------------------------------------------------------------------------------- /R/G_functions.R: -------------------------------------------------------------------------------- 1 | #Functions for calculating and manipulating the G statistic 2 | 3 | #' Calculates the G statistic 4 | #' 5 | #' The function is used by \code{\link{runGprimeAnalysis}} to calculate the G 6 | #' statisic G is defined by the equation: \deqn{G = 2*\sum_{i=1}^{4} 7 | #' n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2 * \sum n_i * ln(obs(n_i)/exp(n_i))} 8 | #' Where for each SNP, \eqn{n_i} from i = 1 to 4 corresponds to the reference 9 | #' and alternate allele depths for each bulk, as described in the following 10 | #' table: \tabular{rcc}{ Allele \tab High Bulk \tab Low Bulk \cr Reference \tab 11 | #' \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab \eqn{n_3} \tab \eqn{n_4} \cr} 12 | #' ...and \eqn{obs(n_i)} are the observed allele depths as described in the data 13 | #' frame. Method 1 calculates the G statistic using expected values assuming 14 | #' read depth is equal for all alleles in both bulks: \deqn{exp(n_1) = ((n_1 + 15 | #' n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)} \deqn{exp(n_2) = ((n_2 + n_1)*(n_2 16 | #' + n_4))/(n_1 + n_2 + n_3 + n_4)} etc... 17 | #' 18 | #' @param LowRef A vector of the reference allele depth in the low bulk 19 | #' @param HighRef A vector of the reference allele depth in the high bulk 20 | #' @param LowAlt A vector of the alternate allele depth in the low bulk 21 | #' @param HighAlt A vector of the alternate allele depth in the high bulk 22 | #' 23 | #' @return A vector of G statistic values with the same length as 24 | #' 25 | #' @seealso \href{https://doi.org/10.1371/journal.pcbi.1002255}{The Statistics 26 | #' of Bulk Segregant Analysis Using Next Generation Sequencing} 27 | #' \code{\link{tricubeStat}} for G prime calculation 28 | 29 | getG <- function(LowRef, HighRef, LowAlt, HighAlt) 30 | { 31 | exp <- c( 32 | (LowRef + HighRef) * (LowRef + LowAlt) / (LowRef + HighRef + LowAlt + HighAlt), 33 | (LowRef + HighRef) * (HighRef + HighAlt) / (LowRef + HighRef + LowAlt + HighAlt), 34 | (LowRef + LowAlt) * (LowAlt + HighAlt) / (LowRef + HighRef + LowAlt + HighAlt), 35 | (LowAlt + HighAlt) * (HighRef + HighAlt) / (LowRef + HighRef + LowAlt + HighAlt) 36 | ) 37 | obs <- c(LowRef, HighRef, LowAlt, HighAlt) 38 | 39 | G <- 40 | 2 * (rowSums(obs * log( 41 | matrix(obs, ncol = 4) / matrix(exp, ncol = 4) 42 | ))) 43 | return(G) 44 | } 45 | 46 | #' Calculate tricube weighted statistics for each SNP 47 | #' 48 | #' Uses local regression (wrapper for \code{\link[locfit]{locfit}}) to predict a 49 | #' tricube smoothed version of the statistic supplied for each SNP. This works as a 50 | #' weighted average across neighboring SNPs that accounts for Linkage 51 | #' disequilibrium (LD) while minizing noise attributed to SNP calling errors. 52 | #' Values for neighboring SNPs within the window are weighted by physical 53 | #' distance from the focal SNP. 54 | #' 55 | #' @return Returns a vector of the weighted statistic caluculted with a tricube 56 | #' smoothing kernel 57 | #' 58 | #' @param POS A vector of genomic positions for each SNP 59 | #' @param Stat A vector of values for a given statistic for each SNP 60 | #' @param WinSize the window size (in base pairs) bracketing each SNP for which 61 | #' to calculate the statitics. Magwene et. al recommend a window size of ~25 62 | #' cM, but also recommend optionally trying several window sizes to test if 63 | #' peaks are over- or undersmoothed. 64 | #' @param ... Other arguments passed to \code{\link[locfit]{locfit}} and 65 | #' subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull 66 | #' in cases where you get "out of vertex space warnings"; Set the maxk higher 67 | #' than the default 100. See \code{\link[locfit]{locfit.raw}}(). 68 | #' @examples df_filt_4mb$Gprime <- tricubeStat(POS, Stat = GStat, WinSize = 4e6) 69 | #' @seealso \code{\link{getG}} for G statistic calculation 70 | #' @seealso \code{\link[locfit]{locfit}} for local regression 71 | 72 | tricubeStat <- function(POS, Stat, windowSize = 2e6, ...) 73 | { 74 | if (windowSize <= 0) 75 | stop("A positive smoothing window is required") 76 | stats::predict(locfit::locfit(Stat ~ locfit::lp(POS, h = windowSize, deg = 0), ...), POS) 77 | } 78 | 79 | 80 | #' Non-parametric estimation of the null distribution of G' 81 | #' 82 | #' The function is used by \code{\link{runGprimeAnalysis}} to estimate p-values for the 83 | #' weighted G' statistic based on the non-parametric estimation method described 84 | #' in Magwene et al. 2011. Breifly, using the natural log of Gprime a median 85 | #' absolute deviation (MAD) is calculated. The Gprime set is trimmed to exclude 86 | #' outlier regions (i.e. QTL) based on Hampel's rule. An alternate method for 87 | #' filtering out QTL is proposed using absolute delta SNP indeces greater than 88 | #' a set threshold to filter out potential QTL. An estimation of the mode of the trimmed set 89 | #' is calculated using the \code{\link[modeest]{mlv}} function from the package 90 | #' modeest. Finally, the mean and variance of the set are estimated using the 91 | #' median and mode and p-values are estimated from a log normal distribution. 92 | #' 93 | #' @param Gprime a vector of G prime values (tricube weighted G statistics) 94 | #' @param deltaSNP a vector of delta SNP values for use for QTL region filtering 95 | #' @param outlierFilter one of either "deltaSNP" or "Hampel". Method for 96 | #' filtering outlier (ie QTL) regions for p-value estimation 97 | #' @param filterThreshold The absolute delta SNP index to use to filter out putative QTL 98 | #' @export getPvals 99 | 100 | getPvals <- 101 | function(Gprime, 102 | deltaSNP = NULL, 103 | outlierFilter = c("deltaSNP", "Hampel"), 104 | filterThreshold) 105 | { 106 | 107 | if (outlierFilter == "deltaSNP") { 108 | 109 | if (abs(filterThreshold) >= 0.5) { 110 | stop("filterThreshold should be less than 0.5") 111 | } 112 | 113 | message("Using deltaSNP-index to filter outlier regions with a threshold of ", filterThreshold) 114 | trimGprime <- Gprime[abs(deltaSNP) < abs(filterThreshold)] 115 | } else { 116 | message("Using Hampel's rule to filter outlier regions") 117 | lnGprime <- log(Gprime) 118 | 119 | medianLogGprime <- median(lnGprime) 120 | 121 | # calculate left median absolute deviation for the trimmed G' prime set 122 | MAD <- 123 | median(medianLogGprime - lnGprime[lnGprime <= medianLogGprime]) 124 | 125 | # Trim the G prime set to exclude outlier regions (i.e. QTL) using Hampel's rule 126 | trimGprime <- 127 | Gprime[lnGprime - median(lnGprime) <= 5.2 * MAD] 128 | } 129 | 130 | medianTrimGprime <- median(trimGprime) 131 | 132 | # estimate the mode of the trimmed G' prime set using the half-sample method 133 | message("Estimating the mode of a trimmed G prime set using the 'modeest' package...") 134 | modeTrimGprime <- 135 | modeest::mlv(x = trimGprime, bw = 0.5, method = "hsm")[1] 136 | 137 | muE <- log(medianTrimGprime) 138 | varE <- abs(muE - log(modeTrimGprime)) 139 | #use the log normal distribution to get pvals 140 | message("Calculating p-values...") 141 | pval <- 142 | 1 - plnorm(q = Gprime, 143 | meanlog = muE, 144 | sdlog = sqrt(varE)) 145 | 146 | return(pval) 147 | } 148 | 149 | 150 | #' Find false discovery rate threshold 151 | #' 152 | #' Given a vector of p-values and a set false discovery rate alpha the function 153 | #' returns the lowest p-value in the vector for which the Benjamini-Hochberg 154 | #' adjusted p-value (ie q-value) is less than that alpha. 155 | #' 156 | #' @param pvalues a vector of p-values 157 | #' @param alpha the required false discovery rate alpha 158 | #' 159 | #' @return The p-value threshold that corresponds to the Benjamini-Hochberg adjusted p-value at the FDR set by alpha. 160 | #' @export getFDRThreshold 161 | 162 | getFDRThreshold <- function(pvalues, alpha = 0.01) 163 | { 164 | sortedPvals <- sort(pvalues, decreasing = FALSE) 165 | pAdj <- p.adjust(sortedPvals, method = "BH") 166 | if (!any(pAdj < alpha)) { 167 | fdrThreshold <- NA 168 | } else { 169 | fdrThreshold <- sortedPvals[max(which(pAdj < alpha))] 170 | } 171 | return(fdrThreshold) 172 | } 173 | 174 | #' Identify QTL using a smoothed G statistic 175 | #' 176 | #' A wrapper for all the functions that perform the full G prime analysis to 177 | #' identify QTL. The following steps are performed:\cr 1) Genome-wide G 178 | #' statistics are calculated by \code{\link{getG}}. \cr G is defined by the 179 | #' equation: \deqn{G = 2*\sum_{i=1}^{4} n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2 180 | #' * \sum n_i * ln(obs(n_i)/exp(n_i))} Where for each SNP, \eqn{n_i} from i = 1 181 | #' to 4 corresponds to the reference and alternate allele depths for each bulk, 182 | #' as described in the following table: \tabular{rcc}{ Allele \tab High Bulk 183 | #' \tab Low Bulk \cr Reference \tab \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab 184 | #' \eqn{n_3} \tab \eqn{n_4} \cr} ...and \eqn{obs(n_i)} are the observed allele 185 | #' depths as described in the data frame. \code{\link{getG}} calculates the G statistic 186 | #' using expected values assuming read depth is equal for all alleles in both 187 | #' bulks: \deqn{exp(n_1) = ((n_1 + n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)} 188 | #' \deqn{exp(n_2) = ((n_2 + n_1)*(n_2 + n_4))/(n_1 + n_2 + n_3 + n_4)} 189 | #' \deqn{exp(n_3) = ((n_3 + n_1)*(n_3 + n_4))/(n_1 + n_2 + n_3 + n_4)} 190 | #' \deqn{exp(n_4) = ((n_4 + n_2)*(n_4 + n_3))/(n_1 + n_2 + n_3 + n_4)}\cr 2) G' 191 | #' - A tricube-smoothed G statistic is predicted by local regression within each 192 | #' chromosome using \code{\link{tricubeStat}}. This works as a weighted average 193 | #' across neighboring SNPs that accounts for Linkage disequilibrium (LD) while 194 | #' minizing noise attributed to SNP calling errors. G values for neighboring 195 | #' SNPs within the window are weighted by physical distance from the focal SNP. 196 | #' \cr \cr 3) P-values are estimated based using the non-parametric method 197 | #' described by Magwene et al. 2011 with the function \code{\link{getPvals}}. 198 | #' Breifly, using the natural log of Gprime a median absolute deviation (MAD) is 199 | #' calculated. The Gprime set is trimmed to exclude outlier regions (i.e. QTL) 200 | #' based on Hampel's rule. An alternate method for filtering out QTL is proposed 201 | #' using absolute delta SNP indeces greater than 0.1 to filter out potential 202 | #' QTL. An estimation of the mode of the trimmed set is calculated using the 203 | #' \code{\link[modeest]{mlv}} function from the package modeest. Finally, the 204 | #' mean and variance of the set are estimated using the median and mode and 205 | #' p-values are estimated from a log normal distribution. \cr \cr 4) Negative 206 | #' Log10- and Benjamini-Hochberg adjusted p-values are calculated using 207 | #' \code{\link[stats]{p.adjust}} 208 | #' 209 | #' @param SNPset Data frame SNP set containing previously filtered SNPs 210 | #' @param windowSize the window size (in base pairs) bracketing each SNP for which 211 | #' to calculate the statitics. Magwene et. al recommend a window size of ~25 212 | #' cM, but also recommend optionally trying several window sizes to test if 213 | #' peaks are over- or undersmoothed. 214 | #' @param outlierFilter one of either "deltaSNP" or "Hampel". Method for 215 | #' filtering outlier (ie QTL) regions for p-value estimation 216 | #' @param filterThreshold The absolute delta SNP index to use to filter out putative QTL (default = 0.1) 217 | #' @param ... Other arguments passed to \code{\link[locfit]{locfit}} and 218 | #' subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull 219 | #' in cases where you get "out of vertex space warnings"; Set the maxk higher 220 | #' than the default 100. See \code{\link[locfit]{locfit.raw}}(). But if you 221 | #' are getting that warning you should seriously consider increasing your 222 | #' window size. 223 | #' 224 | #' @return The supplied SNP set tibble after G' analysis. Includes five new 225 | #' columns: \itemize{\item{G - The G statistic for each SNP} \item{Gprime - 226 | #' The tricube smoothed G statistic based on the supplied window size} 227 | #' \item{pvalue - the pvalue at each SNP calculatd by non-parametric 228 | #' estimation} \item{negLog10Pval - the -Log10(pvalue) supplied for quick 229 | #' plotting} \item{qvalue - the Benajamini-Hochberg adjusted p-value}} 230 | #' 231 | #' 232 | #' @importFrom dplyr %>% 233 | #' 234 | #' @export runGprimeAnalysis 235 | #' 236 | #' @examples df_filt <- runGprimeAnalysis(df_filt,windowSize = 2e6,outlierFilter = "deltaSNP") 237 | #' @useDynLib QTLseqr 238 | #' @importFrom Rcpp sourceCpp 239 | 240 | 241 | runGprimeAnalysis <- 242 | function(SNPset, 243 | windowSize = 1e6, 244 | outlierFilter = "deltaSNP", 245 | filterThreshold = 0.1, 246 | ...) 247 | { 248 | message("Counting SNPs in each window...") 249 | SNPset <- SNPset %>% 250 | dplyr::group_by(CHROM) %>% 251 | dplyr::mutate(nSNPs = countSNPs_cpp(POS = POS, windowSize = windowSize)) 252 | 253 | message("Calculating tricube smoothed delta SNP index...") 254 | SNPset <- SNPset %>% 255 | dplyr::mutate(tricubeDeltaSNP = tricubeStat(POS = POS, Stat = deltaSNP, windowSize, ...)) 256 | 257 | message("Calculating G and G' statistics...") 258 | SNPset <- SNPset %>% 259 | dplyr::mutate( 260 | G = getG( 261 | LowRef = AD_REF.LOW, 262 | HighRef = AD_REF.HIGH, 263 | LowAlt = AD_ALT.LOW, 264 | HighAlt = AD_ALT.HIGH 265 | ), 266 | Gprime = tricubeStat( 267 | POS = POS, 268 | Stat = G, 269 | windowSize = windowSize, 270 | ... 271 | ) 272 | ) %>% 273 | dplyr::ungroup() %>% 274 | dplyr::mutate( 275 | pvalue = getPvals( 276 | Gprime = Gprime, 277 | deltaSNP = deltaSNP, 278 | outlierFilter = outlierFilter, 279 | filterThreshold = filterThreshold 280 | ), 281 | negLog10Pval = -log10(pvalue), 282 | qvalue = p.adjust(p = pvalue, method = "BH") 283 | ) 284 | 285 | return(as.data.frame(SNPset)) 286 | } 287 | -------------------------------------------------------------------------------- /R/Import_Filter.R: -------------------------------------------------------------------------------- 1 | #' Imports SNP data from GATK VariablesToTable output 2 | #' 3 | #' Imports SNP data from the output of the 4 | #' \href{https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php}{VariantsToTable} 5 | #' function in GATK. After importing the data, the function then calculates 6 | #' total reference allele frequency for both bulks together, the delta SNP index 7 | #' (i.e. SNP index of the low bulk subtracted from the SNP index of the high 8 | #' bulk), the G statistic and returns a data frame. The required GATK fields 9 | #' (-F) are CHROM (Chromosome) and POS (Position). The required Genotype fields 10 | #' (-GF) are AD (Allele Depth), DP (Depth). Recommended 11 | #' fields are REF (Reference allele) and ALT (Alternative allele) Recommended 12 | #' Genotype feilds are PL (Phred-scaled likelihoods) and GQ (Genotype Quality). 13 | #' 14 | #' @param file The name of the GATK VariablesToTable output .table file which the 15 | #' data are to be read from. 16 | #' @param highBulk The sample name of the High Bulk 17 | #' @param lowBulk The sample name of the Low Bulk 18 | #' @param chromList a string vector of the chromosomes to be used in the 19 | #' analysis. Useful for filtering out unwanted contigs etc. 20 | #' @return Returns a data frame containing columns for Read depth (DP), 21 | #' Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT), 22 | #' Genoytype Quality (GQ) and SNPindex for each bulk (indicated by .HIGH and 23 | #' .LOW column name suffix). Total reference allele frequnce "REF_FRQ" is the 24 | #' sum of AD.REF for both bulks divided by total Depth for that SNP. The 25 | #' deltaSNPindex is equal to SNPindex.HIGH - SNPindex.LOW. The GStat column 26 | #' is the calculated G statistic for that SNP. 27 | #' @seealso \code{\link{getG}} for explaination of how G statistic is 28 | #' calculated. 29 | #' \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What 30 | #' is a VCF and how should I interpret it?} for more information on GATK 31 | #' Fields and Genotype Fields 32 | #' @examples df <- ImportFromGATK(filename = file.table, 33 | #' highBulk = highBulkSampleName, 34 | #' lowBulk = lowBulkSampleName, 35 | #' chromList = c("Chr1","Chr4","Chr7")) 36 | #' @export importFromGATK 37 | 38 | importFromGATK <- function(file, 39 | highBulk = character(), 40 | lowBulk = character(), 41 | chromList = NULL) { 42 | 43 | # first read one line to help define col types 44 | colheader <- read.delim(file, nrows=1, check.names=FALSE) 45 | 46 | # identify the sample name specific int and chr columns 47 | int_matches <- grep('DP|GQ', names(colheader), value=TRUE) 48 | chr_matches <- grep('\\.AD', names(colheader), value=TRUE) 49 | 50 | # create cols_spec class definitions 51 | int_cols <- do.call(readr::cols, setNames( 52 | rep(list(readr::col_integer()), length(int_matches)), 53 | int_matches))$cols 54 | 55 | chr_cols <- do.call(readr::cols, setNames( 56 | rep(list(readr::col_character()), length(chr_matches)), 57 | chr_matches))$cols 58 | 59 | col_defs <- readr::cols(CHROM = "c", POS = "i") 60 | col_defs$cols <- c(readr::cols(CHROM = "c", POS = "i")$cols, 61 | int_cols, 62 | chr_cols 63 | ) 64 | 65 | SNPset <- readr::read_tsv(file = file, 66 | col_names = TRUE, 67 | guess_max = 10000, 68 | col_types = col_defs) 69 | 70 | if (!all( 71 | c( 72 | "CHROM", 73 | "POS", 74 | paste0(highBulk, ".AD"), 75 | paste0(lowBulk, ".AD"), 76 | paste0(highBulk, ".DP"), 77 | paste0(lowBulk, ".DP") 78 | ) %in% names(SNPset))) { 79 | stop("One of the required fields is missing. Check your table file.") 80 | } 81 | 82 | # rename columns based on bulk names and flip headers (ie HIGH.AD -> AD.HIGH to match the rest of the functions 83 | colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <- 84 | gsub(pattern = highBulk, 85 | replacement = "HIGH", 86 | x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")]) 87 | colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <- 88 | gsub(pattern = lowBulk, 89 | replacement = "LOW", 90 | x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")]) 91 | 92 | colnames(SNPset) <- 93 | sapply(strsplit(colnames(SNPset), "[.]"), 94 | function(x) {paste0(rev(x),collapse = '.')}) 95 | 96 | #Keep only wanted chromosomes 97 | if (!is.null(chromList)) { 98 | message("Removing the following chromosomes: ", paste(unique(SNPset$CHROM)[!unique(SNPset$CHROM) %in% chromList], collapse = ", ")) 99 | SNPset <- SNPset[SNPset$CHROM %in% chromList, ] 100 | } 101 | #arrange the chromosomes by natural order sort, eg Chr1, Chr10, Chr2 >>> Chr1, Chr2, Chr10 102 | SNPset$CHROM <- 103 | factor(SNPset$CHROM, levels = gtools::mixedsort(unique(SNPset$CHROM))) 104 | 105 | SNPset <- SNPset %>% 106 | tidyr::separate( 107 | col = AD.LOW, 108 | into = "AD_REF.LOW", 109 | sep = ",", 110 | extra = "drop", 111 | convert = TRUE 112 | ) %>% 113 | tidyr::separate( 114 | col = AD.HIGH, 115 | into = "AD_REF.HIGH", 116 | sep = ",", 117 | extra = "drop", 118 | convert = TRUE 119 | ) %>% 120 | dplyr::mutate( 121 | AD_ALT.HIGH = DP.HIGH - AD_REF.HIGH, 122 | AD_ALT.LOW = DP.LOW - AD_REF.LOW, 123 | SNPindex.HIGH = AD_ALT.HIGH / DP.HIGH, 124 | SNPindex.LOW = AD_ALT.LOW / DP.LOW, 125 | REF_FRQ = (AD_REF.HIGH + AD_REF.LOW) / (DP.HIGH + DP.LOW), 126 | deltaSNP = SNPindex.HIGH - SNPindex.LOW 127 | ) %>% 128 | dplyr::select( 129 | -dplyr::contains("HIGH"), 130 | -dplyr::contains("LOW"), 131 | -dplyr::one_of("deltaSNP", "REF_FRQ"), 132 | dplyr::matches("AD.*.LOW"), 133 | dplyr::contains("LOW"), 134 | dplyr::matches("AD.*.HIGH"), 135 | dplyr::contains("HIGH"), 136 | dplyr::everything() 137 | ) 138 | 139 | return(as.data.frame(SNPset)) 140 | } 141 | 142 | 143 | #' Import SNP data from a delimited file 144 | #' 145 | #' After importing the data from a delimited file, the function then calculates 146 | #' total reference allele frequency for both bulks together, the delta SNP index 147 | #' (i.e. SNP index of the low bulk subtracted from the SNP index of the high 148 | #' bulk), the G statistic and returns a data frame. The required columns in the 149 | #' file are CHROM (Chromosome) and POS (Position) as well as the reference and 150 | #' alternate allele depths (number of reads supporting each allele). The allele 151 | #' depths should be in columns named in this format: 152 | #' \code{AD_().}. For example, the column for alternate 153 | #' allele depth for a high bulk sample named "sample1", should be 154 | #' "AD_ALT.sample1". Any other columns describing the SNPs are allowed, ie the 155 | #' actual allele calls, or a quality score. If the column is Bulk specific, It 156 | #' should be named \code{columnName.sampleName}, i.e "QUAL.sample1". 157 | #' 158 | #' @param file The name of the file which the data are to be read from. 159 | #' @param highBulk The sample name of the High Bulk. Defaults to "HIGH" 160 | #' @param lowBulk The sample name of the Low Bulk. Defaults to "LOW" 161 | #' @param chromList a string vector of the chromosomes to be used in the 162 | #' analysis. Useful for filtering out unwanted contigs etc. 163 | #' @param sep the field separator character. Values on each line of the file are 164 | #' separated by this character. Default is for csv file ie ",". 165 | #' @return Returns a data frame containing columns for per bulk total Read depth (DP), 166 | #' Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT), any 167 | #' other SNP associated columns in the file, and SNPindex for each bulk 168 | #' (indicated by .HIGH and .LOW column name suffix). Total reference allele 169 | #' frequnce "REF_FRQ" is the sum of AD_REF for both bulks divided by total 170 | #' Depth for that SNP. The deltaSNPindex is equal to SNPindex.HIGH - 171 | #' SNPindex.LOW. 172 | #' 173 | #' @export importFromTable 174 | 175 | importFromTable <- 176 | function(file, 177 | highBulk = "HIGH", 178 | lowBulk = "LOW", 179 | chromList = NULL, 180 | sep = ",") { 181 | SNPset <- 182 | readr::read_delim( 183 | file = file, 184 | delim = sep, 185 | col_names = TRUE, 186 | col_types = readr::cols( 187 | .default = readr::col_guess(), 188 | CHROM = "c", 189 | POS = "i" 190 | ) 191 | ) 192 | # check CHROM 193 | if (!"CHROM" %in% names(SNPset)) { 194 | stop("No 'CHROM' coloumn found.") 195 | } 196 | 197 | # check POS 198 | if (!"POS" %in% names(SNPset)) { 199 | stop("No 'POS' coloumn found.") 200 | } 201 | 202 | # check AD_REF.HIGH 203 | if (!paste0("AD_REF.", highBulk) %in% names(SNPset)) { 204 | stop( 205 | "No High Bulk AD_REF coloumn found. Column should be named 'AD_REF.highBulkName'." 206 | ) 207 | } 208 | 209 | # check AD_REF.LOW 210 | if (!paste0("AD_REF.", lowBulk) %in% names(SNPset)) { 211 | stop("No Low Bulk AD_REF coloumn found. Column should be named 'AD_REF.lowBulkName'.") 212 | } 213 | 214 | #check AD_ALT.HIGH 215 | if (!paste0("AD_ALT.", highBulk) %in% names(SNPset)) { 216 | stop( 217 | "No High Bulk AD_REF coloumn found. Column should be named 'AD_ALT.highBulkName'." 218 | ) 219 | } 220 | 221 | # check AD_ALT.LOW 222 | if (!paste0("AD_ALT.", lowBulk) %in% names(SNPset)) { 223 | stop("No Low Bulk AD_ALT coloumn found. Column should be named 'AD_ALT.lowBulkName'.") 224 | } 225 | 226 | # Keep only wanted chromosomes 227 | if (!is.null(chromList)) { 228 | message("Removing the following chromosomes: ", 229 | paste(unique(SNPset$CHROM)[!unique(SNPset$CHROM) %in% chromList], collapse = ", ")) 230 | SNPset <- SNPset[SNPset$CHROM %in% chromList, ] 231 | } 232 | # arrange the chromosomes by natural order sort, eg Chr1, Chr10, Chr2 >>> Chr1, Chr2, Chr10 233 | SNPset$CHROM <- 234 | factor(SNPset$CHROM, levels = gtools::mixedsort(unique(SNPset$CHROM))) 235 | 236 | # Rename columns 237 | message("Renaming the following columns: ", 238 | paste(colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")][grep(highBulk, x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])], collapse = ", ")) 239 | colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <- 240 | gsub(pattern = highBulk, 241 | replacement = "HIGH", 242 | x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")]) 243 | 244 | message("Renaming the following columns: ", 245 | paste(colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")][grep(lowBulk, x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])], collapse = ", ")) 246 | colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <- 247 | gsub(pattern = lowBulk, 248 | replacement = "LOW", 249 | x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")]) 250 | 251 | # calculate DPs 252 | SNPset <- SNPset %>% 253 | dplyr::mutate( 254 | DP.HIGH = AD_REF.HIGH + AD_ALT.HIGH, 255 | DP.LOW = AD_REF.LOW + AD_ALT.LOW, 256 | SNPindex.HIGH = AD_ALT.HIGH / DP.HIGH, 257 | SNPindex.LOW = AD_ALT.LOW / DP.LOW, 258 | REF_FRQ = (AD_REF.HIGH + AD_REF.LOW) / (DP.HIGH + DP.LOW), 259 | deltaSNP = SNPindex.HIGH - SNPindex.LOW 260 | ) %>% 261 | dplyr::select( 262 | -dplyr::contains("HIGH"),-dplyr::contains("LOW"),-dplyr::one_of("deltaSNP", "REF_FRQ"), 263 | dplyr::matches("AD.*.LOW"), 264 | dplyr::contains("LOW"), 265 | dplyr::matches("AD.*.HIGH"), 266 | dplyr::contains("HIGH"), 267 | dplyr::everything() 268 | ) 269 | 270 | return(as.data.frame(SNPset)) 271 | } 272 | 273 | 274 | ## not exported still only works for GATK... 275 | importFromVCF <- function(file, 276 | highBulk = character(), 277 | lowBulk = character(), 278 | chromList = NULL) { 279 | 280 | vcf <- vcfR::read.vcfR(file = file) 281 | message("Keeping SNPs that pass all filters") 282 | vcf <- vcf[vcf@fix[, "FILTER"] == "PASS"] 283 | 284 | fix <- dplyr::as_tibble(vcf@fix[, c("CHROM", "POS", "REF", "ALT")]) %>% mutate(Key = seq(1:nrow(.))) 285 | 286 | # if (!all( 287 | # c( 288 | # "CHROM", 289 | # "POS", 290 | # paste0(highBulk, ".AD"), 291 | # paste0(lowBulk, ".AD"), 292 | # paste0(highBulk, ".DP"), 293 | # paste0(lowBulk, ".DP") 294 | # ) %in% names(SNPset))) { 295 | # stop("One of the required fields is missing. Check your VCF file.") 296 | # } 297 | 298 | tidy_gt <- extract_gt_tidy(vcf, format_fields = c("AD", "DP", "GQ"), gt_column_prepend = "", alleles = FALSE) 299 | 300 | SNPset <- tidy_gt %>% 301 | filter(Indiv == LowBulk) %>% select(-Indiv) %>% 302 | dplyr::left_join(select(filter(tidy_gt, Indiv == HighBulk),-Indiv), 303 | by = "Key", 304 | suffix = c(".LOW", ".HIGH")) %>% 305 | tidyr::separate( 306 | col = "AD.LOW", 307 | into = c("AD_REF.LOW", "AD_ALT.LOW"), 308 | sep = ",", 309 | extra = "merge", 310 | convert = TRUE 311 | ) %>% 312 | tidyr::separate( 313 | col = "AD.HIGH", 314 | into = c("AD_REF.HIGH", "AD_ALT.HIGH"), 315 | sep = ",", 316 | extra = "merge", 317 | convert = TRUE 318 | ) %>% 319 | dplyr::full_join(x = fix, by = "Key") %>% 320 | dplyr::mutate( 321 | AD_ALT.HIGH = DP.HIGH - AD_REF.HIGH, 322 | AD_ALT.LOW = DP.LOW - AD_REF.LOW, 323 | SNPindex.HIGH = AD_ALT.HIGH / DP.HIGH, 324 | SNPindex.LOW = AD_ALT.LOW / DP.LOW, 325 | REF_FRQ = (AD_REF.HIGH + AD_REF.LOW) / (DP.HIGH + DP.LOW), 326 | deltaSNP = SNPindex.HIGH - SNPindex.LOW 327 | ) %>% 328 | select(-Key) 329 | #Keep only wanted chromosomes 330 | if (!is.null(chromList)) { 331 | message("Removing the following chromosomes: ", paste(unique(SNPset$CHROM)[!unique(SNPset$CHROM) %in% chromList], collapse = ", ")) 332 | SNPset <- SNPset[SNPset$CHROM %in% chromList, ] 333 | } 334 | as.data.frame(SNPset) 335 | } 336 | 337 | 338 | #' Filter SNPs based on read depth and quality 339 | #' 340 | #' Use filtering paramaters to filter out high and low depth reads as well as 341 | #' low Genotype Quality as defined by GATK. All filters are optional but recommended. 342 | #' 343 | #' @param SNPset The data frame imported by \code{ImportFromGATK} 344 | #' @param refAlleleFreq A numeric < 1. This will filter out SNPs with a 345 | #' Reference Allele Frequency less than \code{refAlleleFreq} and greater than 346 | #' 1 - \code{refAlleleFreq}. Eg. \code{refAlleleFreq = 0.3} will keep SNPs 347 | #' with 0.3 <= REF_FRQ <= 0.7 348 | #' @param filterAroundMedianDepth Filters total SNP read depth for both bulks. A 349 | #' median and median absolute deviation (MAD) of depth will be calculated. 350 | #' SNPs with read depth greater or less than \code{filterAroundMedianDepth} 351 | #' MADs away from the median will be filtered. 352 | #' @param minTotalDepth The minimum total read depth for a SNP (counting both 353 | #' bulks) 354 | #' @param maxTotalDepth The maximum total read depth for a SNP (counting both 355 | #' bulks) 356 | #' @param minSampleDepth The minimum read depth for a SNP in each bulk 357 | #' @param depthDifference The maximum absolute difference in read depth between the bulks. 358 | #' @param minGQ The minimum Genotype Quality as set by GATK. This is a measure 359 | #' of how confident GATK was with the assigned genotype (i.e. homozygous ref, 360 | #' heterozygous, homozygous alt). See 361 | #' \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What 362 | #' is a VCF and how should I interpret it?} 363 | #' @param verbose logical. If \code{TRUE} will report number of SNPs filtered in 364 | #' each step. 365 | #' @return Returns a subset of the data frame supplied which meets the filtering 366 | #' conditions applied by the selected parameters. If \code{verbose} is 367 | #' \code{TRUE} the function reports the number of SNPs filtered in each step 368 | #' as well as the initiatl number of SNPs, the total number of SNPs filtered 369 | #' and the remaining number. 370 | #' 371 | #' @seealso See \code{\link[stats]{mad}} for explaination of calculation of 372 | #' median absolute deviation. 373 | #' \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What 374 | #' is a VCF and how should I interpret it?} for more information on GATK 375 | #' Fields and Genotype Fields 376 | #' @examples df_filt <- FilterSNPs( 377 | #' df, 378 | #' refAlleleFreq = 0.3, 379 | #' minTotalDepth = 40, 380 | #' maxTotalDepth = 80, 381 | #' minSampleDepth = 20, 382 | #' minGQ = 99, 383 | #' verbose = TRUE 384 | #' ) 385 | #' 386 | #' @export filterSNPs 387 | 388 | filterSNPs <- function(SNPset, 389 | refAlleleFreq, 390 | filterAroundMedianDepth, 391 | minTotalDepth, 392 | maxTotalDepth, 393 | minSampleDepth, 394 | depthDifference, 395 | minGQ, 396 | verbose = TRUE) { 397 | 398 | org_count <- nrow(SNPset) 399 | count <- nrow(SNPset) 400 | 401 | # Filter by total reference allele frequency 402 | if (!missing(refAlleleFreq)) { 403 | if (verbose) { 404 | message( 405 | "Filtering by reference allele frequency: ", 406 | refAlleleFreq, 407 | " <= REF_FRQ <= ", 408 | 1 - refAlleleFreq 409 | ) 410 | } 411 | SNPset <- dplyr::filter(SNPset, SNPset$REF_FRQ < 1 - refAlleleFreq & 412 | SNPset$REF_FRQ > refAlleleFreq) 413 | if (verbose) { 414 | message("...Filtered ", count - nrow(SNPset), " SNPs") 415 | } 416 | count <- nrow(SNPset) 417 | } 418 | 419 | #Total read depth filtering 420 | 421 | if (!missing(filterAroundMedianDepth)) { 422 | # filter by Read depth for each SNP FilterByMAD MADs around the median 423 | madDP <- 424 | mad( 425 | x = (SNPset$DP.HIGH + SNPset$DP.LOW), 426 | constant = 1, 427 | na.rm = TRUE 428 | ) 429 | medianDP <- 430 | median(x = (SNPset$DP.HIGH + SNPset$DP.LOW), 431 | na.rm = TRUE) 432 | maxDP <- medianDP + filterAroundMedianDepth * madDP 433 | minDP <- medianDP - filterAroundMedianDepth * madDP 434 | SNPset <- dplyr::filter(SNPset, (DP.HIGH + DP.LOW) <= maxDP & 435 | (DP.HIGH + DP.LOW) >= minDP) 436 | 437 | if(verbose) {message("Filtering by total read depth: ", 438 | filterAroundMedianDepth, 439 | " MADs arround the median: ", minDP, " <= Total DP <= ", maxDP) 440 | message("...Filtered ", count - nrow(SNPset), " SNPs")} 441 | count <- nrow(SNPset) 442 | 443 | } 444 | 445 | if (!missing(minTotalDepth)) { 446 | # Filter by minimum total SNP depth 447 | if (verbose) { 448 | message("Filtering by total sample read depth: Total DP >= ", 449 | minTotalDepth) 450 | } 451 | SNPset <- 452 | dplyr::filter(SNPset, (DP.HIGH + DP.LOW) >= minTotalDepth) 453 | 454 | if (verbose) { 455 | message("...Filtered ", count - nrow(SNPset), " SNPs") 456 | } 457 | count <- nrow(SNPset) 458 | } 459 | 460 | if (!missing(maxTotalDepth)) { 461 | # Filter by maximum total SNP depth 462 | if (verbose) { 463 | message("Filtering by total sample read depth: Total DP <= ", 464 | maxTotalDepth) 465 | } 466 | SNPset <- 467 | dplyr::filter(SNPset, (DP.HIGH + DP.LOW) <= maxTotalDepth) 468 | if (verbose) { 469 | message("...Filtered ", count - nrow(SNPset), " SNPs") 470 | } 471 | count <- nrow(SNPset) 472 | } 473 | 474 | 475 | # Filter by min read depth in either sample 476 | if (!missing(minSampleDepth)) { 477 | if (verbose) { 478 | message("Filtering by per sample read depth: DP >= ", 479 | minSampleDepth) 480 | } 481 | SNPset <- 482 | dplyr::filter(SNPset, DP.HIGH >= minSampleDepth & 483 | SNPset$DP.LOW >= minSampleDepth) 484 | if (verbose) { 485 | message("...Filtered ", count - nrow(SNPset), " SNPs") 486 | } 487 | count <- nrow(SNPset) 488 | } 489 | 490 | # Filter by Genotype Quality 491 | if (!missing(minGQ)) { 492 | if (all(c("GQ.LOW", "GQ.HIGH") %in% names(SNPset))) { 493 | if (verbose) { 494 | message("Filtering by Genotype Quality: GQ >= ", minGQ) 495 | } 496 | SNPset <- 497 | dplyr::filter(SNPset, GQ.LOW >= minGQ & GQ.HIGH >= minGQ) 498 | if (verbose) { 499 | message("...Filtered ", count - nrow(SNPset), " SNPs") 500 | } 501 | count <- nrow(SNPset)} 502 | else { 503 | message("GQ columns not found. Skipping...") 504 | } 505 | } 506 | 507 | # Filter by difference between high and low bulks 508 | if (!missing(depthDifference)) { 509 | if (verbose) { 510 | message("Filtering by difference between bulks <= ", 511 | depthDifference) 512 | } 513 | SNPset <- 514 | dplyr::filter(SNPset, abs(DP.HIGH - DP.LOW) <= depthDifference) 515 | if (verbose) { 516 | message("...Filtered ", count - nrow(SNPset), " SNPs") 517 | } 518 | count <- nrow(SNPset) 519 | } 520 | 521 | # #Filter SNP Clusters 522 | # if (!is.null(SNPsInCluster) & !is.null(ClusterWin)) { 523 | # tmp <- which(diff(SNPset$POS, SNPsInCluster-1) < ClusterWin) 524 | # message("...Filtered ", count - nrow(SNPset), " SNPs") 525 | # count <- nrow(SNPset) 526 | # } 527 | if (verbose) { 528 | message( 529 | "Original SNP number: ", 530 | org_count, 531 | ", Filtered: ", 532 | org_count - count, 533 | ", Remaining: ", 534 | count 535 | ) 536 | } 537 | return(as.data.frame(SNPset)) 538 | } 539 | -------------------------------------------------------------------------------- /R/RcppExports.R: -------------------------------------------------------------------------------- 1 | # Generated by using Rcpp::compileAttributes() -> do not edit by hand 2 | # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 3 | 4 | #' Count number of SNPs within a sliding window 5 | #' 6 | #' For each SNP returns how many SNPs are bracketing it within the set window size 7 | #' 8 | #' @param POS A numeric vector of genomic positions for each SNP 9 | #' @param windowSize The required window size 10 | #' @export countSNPs_cpp 11 | countSNPs_cpp <- function(POS, windowSize) { 12 | .Call('_QTLseqr_countSNPs_cpp', PACKAGE = 'QTLseqr', POS, windowSize) 13 | } 14 | 15 | -------------------------------------------------------------------------------- /R/export_functions.R: -------------------------------------------------------------------------------- 1 | #' Return SNPs in significant regions 2 | #' 3 | #' The function takes a SNP set after calculation of p- and q-values or Takagi confidence intervals and returns 4 | #' a list containing all SNPs with q-values or deltaSNP below a set alpha or confidence intervals, respectively. Each entry in the list 5 | #' is a SNP set data frame in a contiguous region with adjusted pvalues lower 6 | #' than the set false discovery rate alpha. 7 | #' 8 | #' @param SNPset Data frame SNP set containing previously filtered SNPs. 9 | #' @param method either "Gprime" or "QTLseq". The method for detecting significant regions. 10 | #' @param alpha numeric. The required false discovery rate alpha for use with \code{method = "Gprime"} 11 | #' @param interval numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99. 12 | #' 13 | #' @export getSigRegions 14 | 15 | getSigRegions <- 16 | function(SNPset, 17 | method = "Gprime", 18 | alpha = 0.05, 19 | interval = 99) 20 | { 21 | conf <- paste0("CI_", interval) 22 | 23 | if (!method %in% c("Gprime", "QTLseq")) { 24 | stop("method must be either \"Gprime\" or \"QTLseq\"") 25 | } 26 | 27 | if ((method == "Gprime") & !("qvalue" %in% colnames(SNPset))) { 28 | stop("Please first use runGprimeAnalysis to calculate q-values") 29 | } 30 | 31 | if ((method == "QTLseq") & !(any(names(SNPset) %in% conf))) { 32 | stop( 33 | "Cant find the requested confidence interval. Please check that the requested interval exsits or first use runQTLseqAnalysis to calculate confidence intervals" 34 | ) 35 | } 36 | 37 | #QTL <- getSigRegions(SNPset = SNPset, method = method, interval = interval, alpha = alpha) 38 | fdrT <- getFDRThreshold(SNPset$pvalue, alpha = alpha) 39 | GprimeT <- SNPset[which(SNPset$pvalue == fdrT), "Gprime"] 40 | #merged_QTL <- dplyr::bind_rows(QTL, .id = "id") 41 | SNPset <- SNPset %>% 42 | dplyr::group_by(CHROM) 43 | 44 | if (method == "QTLseq") { 45 | qtltable <- 46 | SNPset %>% dplyr::mutate(passThresh = abs(tricubeDeltaSNP) > abs(!!as.name(conf))) %>% 47 | dplyr::group_by(CHROM, run = { 48 | run = rle(passThresh) 49 | rep(seq_along(run$lengths), run$lengths) 50 | }) %>% 51 | dplyr::filter(passThresh == T) %>% dplyr::ungroup() %>% 52 | dplyr::group_by(CHROM) %>% dplyr::group_by(CHROM, qtl = { 53 | qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths) 54 | }) %>% 55 | #dont need run variable anymore 56 | dplyr::select(-run,-qtl,-passThresh) 57 | } else { 58 | qtltable <- SNPset %>% dplyr::mutate(passThresh = qvalue <= alpha) %>% 59 | dplyr::group_by(CHROM, run = { 60 | run = rle(passThresh) 61 | rep(seq_along(run$lengths), run$lengths) 62 | }) %>% 63 | dplyr::filter(passThresh == T) %>% dplyr::ungroup() %>% 64 | dplyr::group_by(CHROM) %>% dplyr::group_by(CHROM, qtl = { 65 | qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths) 66 | }) %>% 67 | #dont need run variable anymore 68 | dplyr::select(-run,-qtl,-passThresh) 69 | } 70 | 71 | qtltable <- as.data.frame(qtltable) 72 | qtlList <- 73 | split(qtltable, factor( 74 | paste(qtltable$CHROM, qtltable$qtl, sep = "_"), 75 | levels = gtools::mixedsort(unique( 76 | paste(qtltable$CHROM, qtltable$qtl, sep = "_") 77 | )) 78 | )) 79 | 80 | return(qtlList) 81 | } 82 | 83 | 84 | #' Export a summarized table of QTL 85 | #' @param SNPset Data frame SNP set containing previously filtered SNPs 86 | #' @param method either "Gprime" or "QTLseq". The method for detecting significant regions. 87 | #' @param alpha numeric. The required false discovery rate alpha for use with \code{method = "Gprime"} 88 | #' @param interval numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99. 89 | #' @param export logical. If TRUE will export a csv table. 90 | #' @param fileName either a character string naming a file or a connection open for writing. "" indicates output to the console. 91 | #' 92 | #' @return Returns a summarized table of QTL identified. The table contains the following columns: 93 | #' \itemize{ 94 | #' \item{id - the QTL identification number} 95 | #' \item{chromosome - The chromosome on which the region was identified} 96 | #' \item{start - the start position on that chromosome, i.e. the position of the first SNP that passes the FDR threshold} 97 | #' \item{end - the end position} 98 | #' \item{length - the length in basepairs from start to end of the region} 99 | #' \item{nSNPs - the number of SNPs in the region} 100 | #' \item{avgSNPs_Mb - the average number of SNPs/Mb within that region} 101 | #' \item{peakDeltaSNP - the tricube-smoothed deltaSNP-index value at the peak summit} 102 | #' \item{posPeakDeltaSNP - the position of the absolute maximum tricube-smoothed deltaSNP-index} 103 | #' \item{maxGprime - the max G' score in the region} 104 | #' \item{posMaxGprime - the genomic position of the maximum G' value in the QTL} 105 | #' \item{meanGprime - the average G' score of that region} 106 | #' \item{sdGprime - the standard deviation of G' within the region} 107 | #' \item{AUCaT - the Area Under the Curve but above the Threshold line, an indicator of how significant or wide the peak is} 108 | #' \item{meanPval - the average p-value in the region} 109 | #' \item{meanQval - the average adjusted p-value in the region} 110 | #'} 111 | #' @export getQTLTable 112 | #' 113 | #' @importFrom dplyr %>% 114 | 115 | getQTLTable <- 116 | function(SNPset, 117 | method = "Gprime", 118 | alpha = 0.05, 119 | interval = 99, 120 | export = FALSE, 121 | fileName = "QTL.csv") 122 | { 123 | conf <- paste0("CI_", interval) 124 | 125 | if (!method %in% c("Gprime", "QTLseq")) { 126 | stop("method must be either \"Gprime\" or \"QTLseq\"") 127 | } 128 | 129 | if ((method == "Gprime") & 130 | !("qvalue" %in% colnames(SNPset))) { 131 | stop("Please first use runGprimeAnalysis to calculate q-values") 132 | } 133 | 134 | if ((method == "QTLseq") & !(any(names(SNPset) %in% conf))) { 135 | stop( 136 | "Cant find the requested confidence interval. Please check that the requested interval exsits or first use runQTLseqAnalysis to calculate confidence intervals" 137 | ) 138 | } 139 | 140 | #QTL <- getSigRegions(SNPset = SNPset, method = method, interval = interval, alpha = alpha) 141 | fdrT <- getFDRThreshold(SNPset$pvalue, alpha = alpha) 142 | GprimeT <- SNPset[which(SNPset$pvalue == fdrT), "Gprime"] 143 | #merged_QTL <- dplyr::bind_rows(QTL, .id = "id") 144 | SNPset <- SNPset %>% 145 | dplyr::group_by(CHROM) 146 | 147 | if (method == "QTLseq") { 148 | qtltable <- 149 | SNPset %>% 150 | dplyr::mutate(passThresh = abs(tricubeDeltaSNP) > abs(!!as.name(conf))) %>% 151 | dplyr::group_by(CHROM, run = { 152 | run = rle(passThresh) 153 | rep(seq_along(run$lengths), run$lengths) 154 | }) %>% 155 | dplyr::filter(passThresh == TRUE) %>% 156 | dplyr::ungroup() %>% 157 | dplyr::group_by(CHROM) %>% 158 | dplyr::group_by(CHROM, qtl = { 159 | qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths) 160 | }) %>% 161 | #dont need run variable anymore 162 | dplyr::select(-run) %>% 163 | dplyr::summarize( 164 | start = min(POS), 165 | end = max(POS), 166 | length = end - start, 167 | nSNPs = length(POS), 168 | avgSNPs_Mb = round(length(POS) / (max(POS) - min(POS)) * 1e6), 169 | peakDeltaSNP = ifelse( 170 | mean(tricubeDeltaSNP) >= 0, 171 | max(tricubeDeltaSNP), 172 | min(tricubeDeltaSNP) 173 | ), 174 | posPeakDeltaSNP = POS[which.max(abs(tricubeDeltaSNP))], 175 | avgDeltaSNP = mean(tricubeDeltaSNP) 176 | ) 177 | } else { 178 | qtltable <- SNPset %>% dplyr::mutate(passThresh = qvalue <= alpha) %>% 179 | dplyr::group_by(CHROM, run = { 180 | run = rle(passThresh) 181 | rep(seq_along(run$lengths), run$lengths) 182 | }) %>% 183 | dplyr::filter(passThresh == T) %>% dplyr::ungroup() %>% 184 | dplyr::group_by(CHROM) %>% dplyr::group_by(CHROM, qtl = { 185 | qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths) 186 | }) %>% 187 | #dont need run variable anymore 188 | dplyr::select(-run) %>% 189 | dplyr::summarize( 190 | start = min(POS), 191 | end = max(POS), 192 | length = end - start, 193 | nSNPs = length(POS), 194 | avgSNPs_Mb = round(length(POS) / (max(POS) - min(POS)) * 1e6), 195 | peakDeltaSNP = ifelse( 196 | mean(tricubeDeltaSNP) >= 0, 197 | max(tricubeDeltaSNP), 198 | min(tricubeDeltaSNP) 199 | ), 200 | posPeakDeltaSNP = POS[which.max(abs(tricubeDeltaSNP))], 201 | avgDeltaSNP = mean(tricubeDeltaSNP), 202 | #Gprime stuff 203 | maxGprime = max(Gprime), 204 | posMaxGprime = POS[which.max(Gprime)], 205 | meanGprime = mean(Gprime), 206 | sdGprime = sd(Gprime), 207 | AUCaT = sum(diff(POS) * (((head(Gprime, -1) + tail(Gprime, -1)) / 2 208 | ) - GprimeT)), 209 | meanPval = mean(pvalue), 210 | meanQval = mean(qvalue) 211 | ) 212 | } 213 | 214 | qtltable <- as.data.frame(qtltable) 215 | 216 | if (export) { 217 | write.csv(file = fileName, 218 | x = qtltable, 219 | row.names = FALSE) 220 | } 221 | return(qtltable) 222 | } 223 | -------------------------------------------------------------------------------- /R/format_genomic.R: -------------------------------------------------------------------------------- 1 | format_genomic <- function(...) { 2 | # Format a vector of numeric values according 3 | # to the International System of Units. 4 | # http://en.wikipedia.org/wiki/SI_prefix 5 | # 6 | # Based on code by Ben Tupper 7 | # https://stat.ethz.ch/pipermail/r-help/2012-January/299804.html 8 | # Args: 9 | # ...: Args passed to format() 10 | # 11 | # Returns: 12 | # A function to format a vector of strings using 13 | # SI prefix notation 14 | # 15 | 16 | function(x) { 17 | limits <- c(1e0, 1e3, 1e6) 18 | #prefix <- c("","Kb","Mb") 19 | 20 | # Vector with array indices according to position in intervals 21 | i <- findInterval(abs(x), limits) 22 | 23 | # Set prefix to " " for very small values < 1e-24 24 | i <- ifelse(i==0, which(limits == 1e0), i) 25 | 26 | paste(format(round(x/limits[i], 1), 27 | trim=TRUE, scientific=FALSE, ...) 28 | # ,prefix[i] 29 | ) 30 | } 31 | } 32 | -------------------------------------------------------------------------------- /R/onunload.R: -------------------------------------------------------------------------------- 1 | .onUnload <- function (libpath) { 2 | library.dynam.unload("QTLseqr", libpath) 3 | } -------------------------------------------------------------------------------- /R/plotting_functions.R: -------------------------------------------------------------------------------- 1 | #' Plots different paramaters for QTL identification 2 | #' 3 | #' A wrapper for ggplot to plot genome wide distribution of parameters used to 4 | #' identify QTL. 5 | #' 6 | #' @param SNPset a data frame with SNPs and genotype fields as imported by 7 | #' \code{ImportFromGATK} and after running \code{runGprimeAnalysis} or \code{runQTLseqAnalysis} 8 | #' @param subset a vector of chromosome names for use in quick plotting of 9 | #' chromosomes of interest. Defaults to 10 | #' NULL and will plot all chromosomes in the SNPset 11 | #' @param var character. The paramater for plotting. Must be one of: "nSNPs", 12 | #' "deltaSNP", "Gprime", "negLog10Pval" 13 | #' @param scaleChroms boolean. if TRUE (default) then chromosome facets will be 14 | #' scaled to relative chromosome sizes. If FALSE all facets will be equal 15 | #' sizes. This is basically a convenience argument for setting both scales and 16 | #' shape as "free_x" in ggplot2::facet_grid. 17 | #' @param line boolean. If TRUE will plot line graph. If FALSE will plot points. 18 | #' Plotting points will take more time. 19 | #' @param plotThreshold boolean. Should we plot the False Discovery Rate 20 | #' threshold (FDR). Only plots line if var is "Gprime" or "negLogPval". 21 | #' @param plotIntervals boolean. Whether or not to plot the two-sided Takagi confidence intervals in "deltaSNP" plots. 22 | #' @param q numeric. The q-value to use as the FDR threshold. If too low, no 23 | #' line will be drawn and a warning will be given. 24 | #' @param ... arguments to pass to ggplot2::geom_line or ggplot2::geom_point for 25 | #' changing colors etc. 26 | #' 27 | #' @return Plots a ggplot graph for all chromosomes or those requested in 28 | #' \code{subset}. By setting \code{var} to "nSNPs" the distribution of SNPs 29 | #' used to calculate G' will be plotted. "deltaSNP" will plot a tri-cube 30 | #' weighted delta SNP-index for each SNP. "Gprime" will plot the tri-cube 31 | #' weighted G' value. Setting "negLogPval" will plot the -log10 of the p-value 32 | #' at each SNP. In "Gprime" and "negLogPval" plots, a genome wide FDR threshold of 33 | #' q can be drawn by setting "plotThreshold" to TRUE. The defualt is a red 34 | #' line. If you would like to plot a different line we suggest setting 35 | #' "plotThreshold" to FALSE and manually adding a line using 36 | #' ggplot2::geom_hline. 37 | #' 38 | #' @examples p <- plotQTLstats(df_filt_6Mb, var = "Gprime", plotThreshold = TRUE, q = 0.01, subset = c("Chr3","Chr4")) 39 | #' @export plotQTLStats 40 | 41 | plotQTLStats <- 42 | function(SNPset, 43 | subset = NULL, 44 | var = "nSNPs", 45 | scaleChroms = TRUE, 46 | line = TRUE, 47 | plotThreshold = FALSE, 48 | plotIntervals = FALSE, 49 | q = 0.05, 50 | ...) { 51 | 52 | #get fdr threshold by ordering snps by pval then getting the last pval 53 | #with a qval < q 54 | 55 | if (!all(subset %in% unique(SNPset$CHROM))) { 56 | whichnot <- 57 | paste(subset[base::which(!subset %in% unique(SNPset$CHROM))], collapse = ', ') 58 | stop(paste0("The following are not true chromosome names: ", whichnot)) 59 | } 60 | 61 | if (!var %in% c("nSNPs", "deltaSNP", "Gprime", "negLog10Pval")) 62 | stop( 63 | "Please choose one of the following variables to plot: \"nSNPs\", \"deltaSNP\", \"Gprime\", \"negLog10Pval\"" 64 | ) 65 | 66 | #don't plot threshold lines in deltaSNPprime or number of SNPs as they are not relevant 67 | if ((plotThreshold == TRUE & 68 | var == "deltaSNP") | 69 | (plotThreshold == TRUE & var == "nSNPs")) { 70 | message("FDR threshold is not plotted in deltaSNP or nSNPs plots") 71 | plotThreshold <- FALSE 72 | } 73 | #if you need to plot threshold get the FDR, but check if there are any values that pass fdr 74 | 75 | GprimeT <- 0 76 | logFdrT <- 0 77 | 78 | if (plotThreshold == TRUE) { 79 | fdrT <- getFDRThreshold(SNPset$pvalue, alpha = q) 80 | 81 | if (is.na(fdrT)) { 82 | warning("The q threshold is too low. No threshold line will be drawn") 83 | plotThreshold <- FALSE 84 | 85 | } else { 86 | logFdrT <- -log10(fdrT) 87 | GprimeT <- SNPset[which(SNPset$pvalue == fdrT), "Gprime"] 88 | } 89 | } 90 | 91 | SNPset <- 92 | if (is.null(subset)) { 93 | SNPset 94 | } else { 95 | SNPset[SNPset$CHROM %in% subset,] 96 | } 97 | 98 | p <- ggplot2::ggplot(data = SNPset) + 99 | ggplot2::scale_x_continuous(breaks = seq(from = 0,to = max(SNPset$POS), by = 10^(floor(log10(max(SNPset$POS))))), labels = format_genomic(), name = "Genomic Position (Mb)") + 100 | ggplot2::theme(plot.margin = ggplot2::margin( 101 | b = 10, 102 | l = 20, 103 | r = 20, 104 | unit = "pt" 105 | )) 106 | 107 | if (var == "Gprime") { 108 | threshold <- GprimeT 109 | p <- p + ggplot2::ylab("G' value") 110 | } 111 | 112 | if (var == "negLog10Pval") { 113 | threshold <- logFdrT 114 | p <- 115 | p + ggplot2::ylab(expression("-" * log[10] * '(p-value)')) 116 | } 117 | 118 | if (var == "nSNPs") { 119 | p <- p + ggplot2::ylab("Number of SNPs in window") 120 | } 121 | 122 | if (var == "deltaSNP") { 123 | var <- "tricubeDeltaSNP" 124 | p <- 125 | p + ggplot2::ylab(expression(Delta * '(SNP-index)')) + 126 | ggplot2::ylim(-0.55, 0.55) + 127 | ggplot2::geom_hline(yintercept = 0, 128 | color = "black", 129 | alpha = 0.4) 130 | if (plotIntervals == TRUE) { 131 | 132 | ints_df <- 133 | dplyr::select(SNPset, CHROM, POS, dplyr::matches("CI_")) %>% tidyr::gather(key = "Interval", value = "value",-CHROM,-POS) 134 | 135 | p <- p + ggplot2::geom_line(data = ints_df, ggplot2::aes(x = POS, y = value, color = Interval)) + 136 | ggplot2::geom_line(data = ints_df, ggplot2::aes( 137 | x = POS, 138 | y = -value, 139 | color = Interval 140 | )) 141 | } 142 | } 143 | 144 | if (line) { 145 | p <- 146 | p + ggplot2::geom_line(ggplot2::aes_string(x = "POS", y = var), ...) 147 | } 148 | 149 | if (!line) { 150 | p <- 151 | p + ggplot2::geom_point(ggplot2::aes_string(x = "POS", y = var), ...) 152 | } 153 | 154 | if (plotThreshold == TRUE) 155 | p <- 156 | p + ggplot2::geom_hline( 157 | ggplot2::aes_string(yintercept = "threshold"), 158 | color = "red", 159 | size = 1, 160 | alpha = 0.4 161 | ) 162 | 163 | if (scaleChroms == TRUE) { 164 | p <- p + ggplot2::facet_grid(~ CHROM, scales = "free_x", space = "free_x") 165 | } else { 166 | p <- p + ggplot2::facet_grid(~ CHROM, scales = "free_x") 167 | } 168 | 169 | p 170 | 171 | } 172 | 173 | 174 | #' Plots Gprime distribution 175 | #' 176 | #' Plots a ggplot histogram of the distribution of Gprime with a log normal 177 | #' distribution overlay 178 | #' 179 | #' @param SNPset a data frame with SNPs and genotype fields as imported by 180 | #' \code{ImportFromGATK} and after running \code{GetPrimeStats} 181 | #' @param outlierFilter one of either "deltaSNP" or "Hampel". Method for 182 | #' filtering outlier (ie QTL) regions for p-value estimation 183 | #' @param filterThreshold The absolute delta SNP index to use to filter out 184 | #' putative QTL (default = 0.1) 185 | #' @param binwidth The binwidth for the histogram. Recomended and default = 0.5 186 | #' 187 | #' @return Plots a ggplot histogram of the G' value distribution. The raw data 188 | #' as well as the filtered G' values (excluding putatitve QTL) are plotted. It 189 | #' will then overlay an estimated log normal distribution with the same mean 190 | #' and variance as the null G' distribution. This will allow to verify if 191 | #' after filtering your G' value appear to be close to log normally and thus 192 | #' can be used to estimate p-values using the non-parametric estimation method 193 | #' described in Magwene et al. (2011). Breifly, using the natural log of 194 | #' Gprime a median absolute deviation (MAD) is calculated. The Gprime set is 195 | #' trimmed to exclude outlier regions (i.e. QTL) based on Hampel's rule. An 196 | #' estimation of the mode of the trimmed set is calculated using the 197 | #' \code{\link[modeest]{mlv}} function from the package modeest. Finally, the 198 | #' mean and variance of the set are estimated using the median and mode are 199 | #' estimated and used to plot the log normal distribution. 200 | #' 201 | #' @examples plotGprimedist(df_filt_6Mb, outlierFilter = "deltaSNP") 202 | #' 203 | #' @seealso \code{\link{getPvals}} for how p-values are calculated. 204 | #' @export plotGprimeDist 205 | 206 | 207 | plotGprimeDist <- 208 | function(SNPset, 209 | outlierFilter = c("deltaSNP", "Hampel"), 210 | filterThreshold = 0.1, 211 | binwidth = 0.5) 212 | { 213 | if (outlierFilter == "deltaSNP") { 214 | trim_df <- SNPset[abs(SNPset$deltaSNP) < filterThreshold, ] 215 | trimGprime <- trim_df$Gprime 216 | } else { 217 | # Non-parametric estimation of the null distribution of G' 218 | 219 | lnGprime <- log(SNPset$Gprime) 220 | 221 | # calculate left median absolute deviation for the trimmed G' prime set 222 | MAD <- 223 | median(abs(lnGprime[lnGprime <= median(lnGprime)] - median(lnGprime))) 224 | 225 | # Trim the G prime set to exclude outlier regions (i.e. QTL) using Hampel's rule 226 | trim_df <- 227 | SNPset[lnGprime - median(lnGprime) <= 5.2 * median(MAD),] 228 | trimGprime <- trim_df$Gprime 229 | } 230 | medianTrimGprime <- median(trimGprime) 231 | 232 | # estimate the mode of the trimmed G' prime set using the half-sample method 233 | modeTrimGprime <- 234 | modeest::mlv(x = trimGprime, bw = 0.5, method = "hsm")[[1]] 235 | 236 | muE <- log(medianTrimGprime) 237 | varE <- abs(muE - log(modeTrimGprime)) 238 | 239 | n <- length(trim_df$Gprime) 240 | bw <- binwidth 241 | 242 | #plot Gprime distrubtion 243 | p <- ggplot2::ggplot(SNPset) + 244 | ggplot2::xlim(0, max(SNPset$Gprime) + 1) + 245 | ggplot2::xlab("G' value") + 246 | ggplot2::geom_histogram(ggplot2::aes(x = Gprime, fill = "Raw Data"), binwidth = bw) + 247 | ggplot2::geom_histogram(data = trim_df, 248 | ggplot2::aes(x = Gprime, fill = "After filtering"), 249 | binwidth = bw) + 250 | ggplot2::stat_function( 251 | ggplot2::aes(color = "black"), 252 | size = 1, 253 | fun = function(x, mean, sd, n, bw) { 254 | dlnorm(x = x, 255 | mean = muE, 256 | sd = sqrt(varE)) * n * bw 257 | }, 258 | args = c( 259 | mean = muE, 260 | sd = sqrt(varE), 261 | n = n, 262 | bw = bw 263 | ) 264 | ) + 265 | 266 | # ggplot2::stat_function( 267 | # fun = dlnorm * n, 268 | # size = 1, 269 | # args = c(meanlog = muE, sdlog = sqrt(varE)), 270 | # ggplot2::aes( 271 | # color = paste0( 272 | # "Null distribution \n G' ~ lnN(", 273 | # round(muE, 2), 274 | # ",", 275 | # round(varE, 2), 276 | # ")" 277 | # ) 278 | # ) 279 | # ) + 280 | ggplot2::scale_fill_discrete(name = "Distribution") + 281 | ggplot2::scale_colour_manual(name = "Null distribution" , values = "black", labels = as.expression(bquote(~theta["G'"]~" ~ lnN("*.(round(muE, 2))*","*.(round(varE, 2))*")"))) + 282 | ggplot2::guides(fill = ggplot2::guide_legend(order = 1, reverse = TRUE)) 283 | 284 | #ggplot2::annotate(x = 10, y = 0.325, geom="text", 285 | # label = paste0("G' ~ lnN(", round(muE, 2), ",",round(varE, 2), ")"), 286 | # color = "blue") 287 | return(p) 288 | } 289 | 290 | 291 | #' Plots simulation data for QTLseq analysis 292 | #' 293 | # The method for simulating delta SNP-index confidence interval thresholds 294 | #' as described in Takagi et al., (2013). Genotypes are randomly assigned for 295 | #' each indvidual in the bulk, based on the population structure. The total 296 | #' alternative allele frequency in each bulk is calculated at each depth used to simulate 297 | #' delta SNP-indeces, with a user defined number of bootstrapped replication. 298 | #' The requested confidence intervals are then calculated from the bootstraps. 299 | #' This function plots the simulated confidence intervals by the read depth. 300 | #' 301 | #' @param SNPset optional. Either supply your data set to extract read depths from or supply depth vector. 302 | #' @param popStruc the population structure. Defaults to "F2" and assumes "RIL" otherwise. 303 | #' @param bulkSize non-negative integer. The number of individuals in each bulk 304 | #' @param depth optional integer vector. A read depth for which to replicate SNP-index calls. If read depth is defined SNPset will be ignored. 305 | #' @param replications integer. The number of bootstrap replications. 306 | #' @param filter numeric. An optional minimum SNP-index filter 307 | #' @param intervals numeric vector. Confidence intervals supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95\% confidence interval, 2.5\% on each side. 308 | #' 309 | #' @return Plots a deltaSNP by depth plot. Helps if the user wants to know the the delta SNP index needed to pass a certain CI at a specified depth. 310 | #' 311 | #' @export plotSimulatedThresholds 312 | #' 313 | #' @examples plotSimulatedThresholds <- function(SNPset = NULL, popStruc = "F2", bulkSize = 25, depth = 1:150, replications = 10000, filter = 0.3, intervals = c(95, 99)) 314 | 315 | plotSimulatedThresholds <- 316 | function(SNPset = NULL, 317 | popStruc = "F2", 318 | bulkSize, 319 | depth = NULL, 320 | replications = 10000, 321 | filter = 0.3, 322 | intervals = c(95, 99)) { 323 | 324 | if (is.null(depth)) { 325 | if (!is.null(SNPset)) { 326 | message( 327 | "Variable 'depth' not defined, using min and max depth from data: ", 328 | min(SNPset$minDP), 329 | "-", 330 | max(SNPset$minDP) 331 | ) 332 | depth <- min(SNPset$minDP):max(SNPset$minDP) 333 | } else { 334 | stop("No SNPset or depth supplied") 335 | } 336 | } 337 | 338 | #convert intervals to quantiles 339 | if (all(intervals >= 1)) { 340 | message( 341 | "Returning the following two sided confidence intervals: ", 342 | paste(intervals, collapse = ", ") 343 | ) 344 | quantiles <- (100 - intervals) / 200 345 | } else { 346 | stop( 347 | "Convidence intervals ('intervals' paramater) should be supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95% confidence interval, 2.5% on each side." 348 | ) 349 | } 350 | 351 | CI <- 352 | simulateConfInt( 353 | popStruc = popStruc, 354 | bulkSize = bulkSize, 355 | depth = depth, 356 | replications = replications, 357 | filter = filter, 358 | intervals = quantiles 359 | ) 360 | 361 | CI <- 362 | tidyr::gather(CI, key = "Interval", value = "deltaSNP",-depth) 363 | 364 | ggplot2::ggplot(data = CI) + 365 | ggplot2::geom_line(ggplot2::aes(x = depth, y = deltaSNP, color = Interval)) + 366 | ggplot2::geom_line(ggplot2::aes(x = depth, y = -deltaSNP,color = Interval)) 367 | } -------------------------------------------------------------------------------- /R/takagi_sim.R: -------------------------------------------------------------------------------- 1 | # QTLseq simulation functions 2 | 3 | 4 | #' Randomly calculates an alternate allele frequency within a bulk 5 | #' 6 | #' @param n non-negative integer. The number of individuals in each bulk 7 | #' @param pop the population structure. Defaults to "F2" and assumes "RIL" 8 | #' population otherwise. 9 | #' 10 | #' @return an alternate allele frequency within the bulk. Used for simulating 11 | #' SNP-indeces. 12 | #' 13 | simulateAlleleFreq <- function(n, pop = "F2") { 14 | if (pop == "F2") { 15 | mean(sample( 16 | x = c(0, 0.5, 1), 17 | size = n, 18 | prob = c(1, 2, 1), 19 | replace = TRUE 20 | )) 21 | } else { 22 | mean(sample( 23 | x = c(0, 1), 24 | size = n, 25 | prob = c(1, 1), 26 | replace = TRUE 27 | )) 28 | } 29 | } 30 | 31 | 32 | #' Simulates a delta SNP-index with replication 33 | #' 34 | #' @param depth integer. A read depth for which to replicate SNP-index calls. 35 | #' @param altFreq1 numeric. The alternate allele frequency for bulk A. 36 | #' @param altFreq2 numeric. The alternate allele frequency for bulk B. 37 | #' @param replicates integer. The number of bootstrap replications. 38 | #' @param filter numeric. an optional minimum SNP-index filter 39 | #' 40 | #' @return Returns a vector of length replicates delta SNP-indeces 41 | 42 | 43 | simulateSNPindex <- 44 | function(depth, 45 | altFreq1, 46 | altFreq2, 47 | replicates = 10000, 48 | filter = NULL) { 49 | 50 | SNPindex_H <- rbinom(replicates, size = depth, altFreq1) / depth 51 | SNPindex_L <- rbinom(replicates, size = depth, altFreq2) / depth 52 | deltaSNP <- SNPindex_H - SNPindex_L 53 | 54 | if (!is.null(filter)) { 55 | deltaSNP <- deltaSNP[SNPindex_H >= filter | SNPindex_L >= filter] 56 | } 57 | deltaSNP 58 | } 59 | 60 | 61 | #' Simulation of delta SPP index confidence intervals 62 | #' 63 | #' The method for simulating delta SNP-index confidence interval thresholds 64 | #' as described in Takagi et al., (2013). Genotypes are randomly assigned for 65 | #' each indvidual in the bulk, based on the population structure. The total 66 | #' alternative allele frequency in each bulk is calculated at each depth used to simulate 67 | #' delta SNP-indeces, with a user defined number of bootstrapped replication. 68 | #' The requested confidence intervals are then calculated from the bootstraps. 69 | #' 70 | #' @param popStruc the population structure. Defaults to "F2" and assumes "RIL" 71 | #' @param bulkSize non-negative integer vector. The number of individuals in 72 | #' each simulated bulk. Can be of length 1, then both bulks are set to the 73 | #' same size. Assumes the first value in the vector is the simulated high 74 | #' bulk. 75 | #' @param depth integer. A read depth for which to replicate SNP-index calls. 76 | #' @param replications integer. The number of bootstrap replications. 77 | #' @param filter numeric. An optional minimum SNP-index filter 78 | #' @param intervals numeric vector of probabilities with values in [0,1] 79 | #' corresponding to the requested confidence intervals 80 | #' 81 | #' @return A data frame of delta SNP-index thresholds corrisponding to the 82 | #' requested confidence intervals at the user set depths. 83 | #' @export simulateConfInt 84 | #' 85 | #' @examples CI <- 86 | #' simulateConfInt( 87 | #' popStruc = "F2", 88 | #' bulkSize = 50, 89 | #' depth = 1:100, 90 | #' intervals = c(0.05, 0.95, 0.025, 0.975, 0.005, 0.995, 0.0025, 0.9975) 91 | 92 | simulateConfInt <- function(popStruc = "F2", 93 | bulkSize, 94 | depth = 1:100, 95 | replications = 10000, 96 | filter = 0.3, 97 | intervals = c(0.05, 0.025)) { 98 | if (popStruc == "F2") { 99 | message( 100 | "Assuming bulks selected from F2 population, with ", 101 | paste(bulkSize, collapse = " and "), 102 | " individuals per bulk." 103 | ) 104 | } else { 105 | message( 106 | "Assuming bulks selected from RIL population, with ", 107 | bulkSize, 108 | " individuals per bulk." 109 | ) 110 | } 111 | 112 | if (length(bulkSize) == 1) { 113 | message("The 'bulkSize' argument is of length 1, setting number of individuals in both bulks to: ", bulkSize) 114 | bulkSize[2] <- bulkSize[1] 115 | } 116 | 117 | if (length(bulkSize) > 2) { 118 | message("The 'bulkSize' argument is larger than 2. Using the first two values as the bulk size.") 119 | } 120 | 121 | if (any(bulkSize < 0)) { 122 | stop("Negative bulkSize values") 123 | } 124 | 125 | #makes a vector of possible alt allele frequencies once. this is then sampled for each replicate 126 | tmp_freq <- 127 | replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[1], pop = popStruc)) 128 | 129 | tmp_freq2 <- 130 | replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[2], pop = popStruc)) 131 | 132 | message( 133 | paste0( 134 | "Simulating ", 135 | replications, 136 | " SNPs with reads at each depth: ", 137 | min(depth), 138 | "-", 139 | max(depth) 140 | ) 141 | ) 142 | message(paste0( 143 | "Keeping SNPs with >= ", 144 | filter, 145 | " SNP-index in both simulated bulks" 146 | )) 147 | 148 | # tmp allele freqs are sampled to produce 'replicate' numbers of probablities. these 149 | # are then used as altFreq probs to simulate SNP index values, per bulk. 150 | CI <- sapply( 151 | X = depth, 152 | FUN = function(x) 153 | { 154 | quantile( 155 | x = simulateSNPindex( 156 | depth = x, 157 | altFreq1 = sample( 158 | x = tmp_freq, 159 | size = replications, 160 | replace = TRUE 161 | ), 162 | altFreq2 = sample( 163 | x = tmp_freq2, 164 | size = replications, 165 | replace = TRUE 166 | ), 167 | replicates = replications, 168 | filter = filter 169 | ), 170 | probs = intervals, 171 | names = TRUE 172 | ) 173 | } 174 | ) 175 | 176 | CI <- as.data.frame(CI) 177 | 178 | if (length(CI) > 1) { 179 | CI <- data.frame(t(CI)) 180 | } 181 | 182 | names(CI) <- paste0("CI_", 100 - (intervals * 200)) 183 | CI <- cbind(depth, CI) 184 | 185 | #to long format for easy plotting 186 | # tidyr::gather(data = CI, 187 | # key = interval, 188 | # convert = TRUE, 189 | # value = SNPindex,-depth) %>% 190 | # dplyr::mutate(Confidence = factor(ifelse( 191 | # interval > 0.5, 192 | # paste0(round((1 - interval) * 200, digits = 1), "%"), 193 | # paste0((interval * 200), "%") 194 | # ))) 195 | CI 196 | } 197 | 198 | 199 | #' Calculates delta SNP confidence intervals for QTLseq analysis 200 | #' 201 | #'The method for simulating delta SNP-index confidence interval thresholds 202 | #' as described in Takagi et al., (2013). Genotypes are randomly assigned for 203 | #' each indvidual in the bulk, based on the population structure. The total 204 | #' alternative allele frequency in each bulk is calculated at each depth used to simulate 205 | #' delta SNP-indeces, with a user defined number of bootstrapped replication. 206 | #' The requested confidence intervals are then calculated from the bootstraps. 207 | #' 208 | #' @param SNPset The data frame imported by \code{ImportFromGATK} 209 | #' @param windowSize the window size (in base pairs) bracketing each SNP for which to calculate the statitics. 210 | #' @param popStruc the population structure. Defaults to "F2" and assumes "RIL" otherwise 211 | #' @param bulkSize non-negative integer vector. The number of individuals in 212 | #' each simulated bulk. Can be of length 1, then both bulks are set to the 213 | #' same size. Assumes the first value in the vector is the simulated high 214 | #' bulk. 215 | #' @param depth integer. A read depth for which to replicate SNP-index calls. 216 | #' @param replications integer. The number of bootstrap replications. 217 | #' @param filter numeric. An optional minimum SNP-index filter 218 | #' @param intervals numeric vector. Confidence intervals supplied as two-sided 219 | #' percentiles. i.e. If intervals = '95' will return the two sided 95\% 220 | #' confidence interval, 2.5\% on each side. 221 | #' @param ... Other arguments passed to \code{\link[locfit]{locfit}} and 222 | #' subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull 223 | #' in cases where you get "out of vertex space warnings"; Set the maxk higher 224 | #' than the default 100. See \code{\link[locfit]{locfit.raw}}(). But if you 225 | #' are getting that warning you should seriously consider increasing your 226 | #' window size. 227 | #' 228 | #' @return A SNPset data frame with delta SNP-index thresholds corrisponding to the 229 | #' requested confidence intervals matching the tricube smoothed depth at each SNP. 230 | #' @export runQTLseqAnalysis 231 | #' 232 | #' @examples df_filt <- runQTLseqAnalysis( 233 | #' SNPset = df_filt, 234 | #' bulkSize = c(25, 35) 235 | #' windowSize = 1e6, 236 | #' popStruc = "F2", 237 | #' replications = 10000, 238 | #' intervals = c(95, 99) 239 | #' ) 240 | #' 241 | 242 | runQTLseqAnalysis <- function(SNPset, 243 | windowSize = 1e6, 244 | popStruc = "F2", 245 | bulkSize, 246 | depth = NULL, 247 | replications = 10000, 248 | filter = 0.3, 249 | intervals = c(95, 99), 250 | ...) { 251 | 252 | message("Counting SNPs in each window...") 253 | SNPset <- SNPset %>% 254 | dplyr::group_by(CHROM) %>% 255 | dplyr::mutate(nSNPs = countSNPs_cpp(POS = POS, windowSize = windowSize)) 256 | 257 | message("Calculating tricube smoothed delta SNP index...") 258 | SNPset <- SNPset %>% 259 | dplyr::mutate(tricubeDeltaSNP = tricubeStat(POS = POS, Stat = deltaSNP, windowSize, ...)) 260 | 261 | #convert intervals to quantiles 262 | if (all(intervals >= 1)) { 263 | message( 264 | "Returning the following two sided confidence intervals: ", 265 | paste(intervals, collapse = ", ") 266 | ) 267 | quantiles <- (100 - intervals) / 200 268 | } else { 269 | stop( 270 | "Convidence intervals ('intervals' paramater) should be supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95% confidence interval, 2.5% on each side." 271 | ) 272 | } 273 | 274 | #calculate min depth per snp between bulks 275 | SNPset <- 276 | SNPset %>% 277 | dplyr::mutate(minDP = pmin(DP.LOW, DP.HIGH)) 278 | 279 | SNPset <- 280 | SNPset %>% 281 | dplyr::group_by(CHROM) %>% 282 | dplyr::mutate(tricubeDP = floor(tricubeStat(POS, minDP, windowSize = windowSize, ...))) 283 | 284 | if (is.null(depth)) { 285 | message( 286 | "Variable 'depth' not defined, using min and max depth from data: ", 287 | min(SNPset$minDP), 288 | "-", 289 | max(SNPset$minDP) 290 | ) 291 | depth <- min(SNPset$minDP):max(SNPset$minDP) 292 | } 293 | 294 | #simualte confidence intervals 295 | CI <- 296 | simulateConfInt( 297 | popStruc = popStruc, 298 | bulkSize = bulkSize, 299 | depth = depth, 300 | replications = replications, 301 | filter = filter, 302 | intervals = quantiles 303 | ) 304 | 305 | 306 | #match name of column for easier joining of repeat columns 307 | names(CI)[1] <- "tricubeDP" 308 | 309 | #use join as a quick way to match min depth to matching conf intervals. 310 | SNPset <- 311 | dplyr::left_join(x = SNPset, 312 | y = CI #, commented out becuase of above change. need to remove eventually 313 | # by = c("tricubeDP" = "depth")) 314 | ) 315 | as.data.frame(SNPset) 316 | 317 | } 318 | 319 | 320 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # QTLseqr v0.7.5.2 5 | 6 | QTLseqr is an R package for QTL mapping using NGS Bulk Segregant 7 | Analysis. 8 | 9 | QTLseqr is still under development and is offered with out any 10 | guarantee. 11 | 12 | ### **For more detailed instructions please read the vignette [here](https://github.com/bmansfeld/QTLseqr/raw/master/vignettes/QTLseqr.pdf)** 13 | 14 | ### For updates read the [NEWS.md](https://github.com/bmansfeld/QTLseqr/blob/master/NEWS.md) 15 | 16 | # Installation 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | You can install QTLseqr from github with: 29 | 30 | ``` r 31 | # install devtools first to download packages from github 32 | install.packages("devtools") 33 | 34 | # use devtools to install QTLseqr 35 | devtools::install_github("bmansfeld/QTLseqr") 36 | ``` 37 | 38 | **Note:** Apart from regular package dependencies, there are some 39 | Bioconductor tools that we use as well, as such you will be prompted to 40 | install support for Bioconductor, if you haven’t already. QTLseqr makes 41 | use of C++ to make some tasks significantly faster (like counting SNPs). 42 | Because of this, in order to install QTLseqr from github you will be 43 | required to install some compiling tools (Rtools and Xcode, for Windows 44 | and Mac, respectively). 45 | 46 | **If you use QTLseqr in published research, please cite:** 47 | 48 | > Mansfeld B.N. and Grumet R, QTLseqr: An R package for bulk segregant 49 | > analysis with next-generation sequencing *The Plant Genome* 50 | > [doi:10.3835/plantgenome2018.01.0006](https://dl.sciencesocieties.org/publications/tpg/abstracts/11/2/180006) 51 | 52 | We also recommend citing the paper for the corresponding method you work 53 | with. 54 | 55 | QTL-seq method: 56 | 57 | > Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, 58 | > C., Uemura, A., Utsushi, H., Tamiru, M., Takuno, S., Innan, H., Cano, 59 | > L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of 60 | > quantitative trait loci in rice by whole genome resequencing of DNA 61 | > from two bulked populations. *Plant J*, 74: 174–183. 62 | > [doi:10.1111/tpj.12105](https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.12105) 63 | 64 | G prime method: 65 | 66 | > Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk 67 | > Segregant Analysis Using Next Generation Sequencing. *PLOS 68 | > Computational Biology* 7(11): e1002255. 69 | > [doi.org/10.1371/journal.pcbi.1002255](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002255) 70 | 71 | ## Abstract 72 | 73 | Next Generation Sequencing Bulk Segregant Analysis (NGS-BSA) is 74 | efficient in detecting quantitative trait loci (QTL). Despite the 75 | popularity of NGS-BSA and the R statistical platform, no R packages are 76 | currently available for NGS-BSA. We present QTLseqr, an R package for 77 | NGS-BSA that identifies QTL using two statistical approaches: QTL-seq 78 | and G’. These approaches use a simulation method and a tricube smoothed 79 | G statistic, respectively, to identify and assess statistical 80 | significance of QTL. QTLseqr, can import and filter SNP data, calculate 81 | SNP distributions, relative allele frequencies, G’ values, and 82 | log10(p-values), enabling identification and plotting of QTL. 83 | 84 | # Examples: 85 | 86 | ## Example figure 87 | 88 | ![Example 89 | figure](https://github.com/bmansfeld/QTLseqr/raw/master/all_plots.png 90 | "Example figure") 91 | 92 | \#\#\#**For more detailed instructions please read the vignette 93 | [here](https://github.com/bmansfeld/QTLseqr/raw/master/vignettes/QTLseqr.pdf)** 94 | 95 | This is a basic example which shows you how to import and analyze 96 | NGS-BSA data. 97 | 98 | ``` r 99 | 100 | #load the package 101 | library("QTLseqr") 102 | 103 | #Set sample and file names 104 | HighBulk <- "SRR834931" 105 | LowBulk <- "SRR834927" 106 | file <- "SNPs_from_GATK.table" 107 | 108 | #Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs) 109 | Chroms <- paste0(rep("Chr", 12), 1:12) 110 | 111 | #Import SNP data from file 112 | df <- 113 | importFromGATK( 114 | file = file, 115 | highBulk = HighBulk, 116 | lowBulk = LowBulk, 117 | chromList = Chroms 118 | ) 119 | 120 | #Filter SNPs based on some criteria 121 | df_filt <- 122 | filterSNPs( 123 | SNPset = df, 124 | refAlleleFreq = 0.20, 125 | minTotalDepth = 100, 126 | maxTotalDepth = 400, 127 | minSampleDepth = 40, 128 | minGQ = 99 129 | ) 130 | 131 | 132 | #Run G' analysis 133 | df_filt <- runGprimeAnalysis( 134 | SNPset = df_filt, 135 | windowSize = 1e6, 136 | outlierFilter = "deltaSNP") 137 | 138 | #Run QTLseq analysis 139 | df_filt <- runQTLseqAnalysis( 140 | SNPset = df_filt, 141 | windowSize = 1e6, 142 | popStruc = "F2", 143 | bulkSize = c(25, 25), 144 | replications = 10000, 145 | intervals = c(95, 99) 146 | ) 147 | 148 | #Plot 149 | plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01) 150 | plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE) 151 | 152 | #export summary CSV 153 | getQTLTable(SNPset = df_filt, alpha = 0.01, export = TRUE, fileName = "my_BSA_QTL.csv") 154 | ``` 155 | -------------------------------------------------------------------------------- /all_plots.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bmansfeld/QTLseqr/5e761379a805b65038c415c8d3ce7aa61abe89dc/all_plots.png -------------------------------------------------------------------------------- /man/FilterSNPs.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Import_Filter.R 3 | \name{filterSNPs} 4 | \alias{filterSNPs} 5 | \title{Filter SNPs based on read depth and quality} 6 | \usage{ 7 | filterSNPs(SNPset, refAlleleFreq, filterAroundMedianDepth, minTotalDepth, 8 | maxTotalDepth, minSampleDepth, depthDifference, minGQ, verbose = TRUE) 9 | } 10 | \arguments{ 11 | \item{SNPset}{The data frame imported by \code{ImportFromGATK}} 12 | 13 | \item{refAlleleFreq}{A numeric < 1. This will filter out SNPs with a 14 | Reference Allele Frequency less than \code{refAlleleFreq} and greater than 15 | 1 - \code{refAlleleFreq}. Eg. \code{refAlleleFreq = 0.3} will keep SNPs 16 | with 0.3 <= REF_FRQ <= 0.7} 17 | 18 | \item{filterAroundMedianDepth}{Filters total SNP read depth for both bulks. A 19 | median and median absolute deviation (MAD) of depth will be calculated. 20 | SNPs with read depth greater or less than \code{filterAroundMedianDepth} 21 | MADs away from the median will be filtered.} 22 | 23 | \item{minTotalDepth}{The minimum total read depth for a SNP (counting both 24 | bulks)} 25 | 26 | \item{maxTotalDepth}{The maximum total read depth for a SNP (counting both 27 | bulks)} 28 | 29 | \item{minSampleDepth}{The minimum read depth for a SNP in each bulk} 30 | 31 | \item{depthDifference}{The maximum absolute difference in read depth between the bulks.} 32 | 33 | \item{minGQ}{The minimum Genotype Quality as set by GATK. This is a measure 34 | of how confident GATK was with the assigned genotype (i.e. homozygous ref, 35 | heterozygous, homozygous alt). See 36 | \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What 37 | is a VCF and how should I interpret it?}} 38 | 39 | \item{verbose}{logical. If \code{TRUE} will report number of SNPs filtered in 40 | each step.} 41 | } 42 | \value{ 43 | Returns a subset of the data frame supplied which meets the filtering 44 | conditions applied by the selected parameters. If \code{verbose} is 45 | \code{TRUE} the function reports the number of SNPs filtered in each step 46 | as well as the initiatl number of SNPs, the total number of SNPs filtered 47 | and the remaining number. 48 | } 49 | \description{ 50 | Use filtering paramaters to filter out high and low depth reads as well as 51 | low Genotype Quality as defined by GATK. All filters are optional but recommended. 52 | } 53 | \examples{ 54 | df_filt <- FilterSNPs( 55 | df, 56 | refAlleleFreq = 0.3, 57 | minTotalDepth = 40, 58 | maxTotalDepth = 80, 59 | minSampleDepth = 20, 60 | minGQ = 99, 61 | verbose = TRUE 62 | ) 63 | 64 | } 65 | \seealso{ 66 | See \code{\link[stats]{mad}} for explaination of calculation of 67 | median absolute deviation. 68 | \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What 69 | is a VCF and how should I interpret it?} for more information on GATK 70 | Fields and Genotype Fields 71 | } 72 | -------------------------------------------------------------------------------- /man/ImportFromGATK.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Import_Filter.R 3 | \name{importFromGATK} 4 | \alias{importFromGATK} 5 | \title{Imports SNP data from GATK VariablesToTable output} 6 | \usage{ 7 | importFromGATK(file, highBulk = character(), lowBulk = character(), 8 | chromList = NULL) 9 | } 10 | \arguments{ 11 | \item{file}{The name of the GATK VariablesToTable output .table file which the 12 | data are to be read from.} 13 | 14 | \item{highBulk}{The sample name of the High Bulk} 15 | 16 | \item{lowBulk}{The sample name of the Low Bulk} 17 | 18 | \item{chromList}{a string vector of the chromosomes to be used in the 19 | analysis. Useful for filtering out unwanted contigs etc.} 20 | } 21 | \value{ 22 | Returns a data frame containing columns for Read depth (DP), 23 | Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT), 24 | Genoytype Quality (GQ) and SNPindex for each bulk (indicated by .HIGH and 25 | .LOW column name suffix). Total reference allele frequnce "REF_FRQ" is the 26 | sum of AD.REF for both bulks divided by total Depth for that SNP. The 27 | deltaSNPindex is equal to SNPindex.HIGH - SNPindex.LOW. The GStat column 28 | is the calculated G statistic for that SNP. 29 | } 30 | \description{ 31 | Imports SNP data from the output of the 32 | \href{https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php}{VariantsToTable} 33 | function in GATK. After importing the data, the function then calculates 34 | total reference allele frequency for both bulks together, the delta SNP index 35 | (i.e. SNP index of the low bulk subtracted from the SNP index of the high 36 | bulk), the G statistic and returns a data frame. The required GATK fields 37 | (-F) are CHROM (Chromosome) and POS (Position). The required Genotype fields 38 | (-GF) are AD (Allele Depth), DP (Depth). Recommended 39 | fields are REF (Reference allele) and ALT (Alternative allele) Recommended 40 | Genotype feilds are PL (Phred-scaled likelihoods) and GQ (Genotype Quality). 41 | } 42 | \examples{ 43 | df <- ImportFromGATK(filename = file.table, 44 | highBulk = highBulkSampleName, 45 | lowBulk = lowBulkSampleName, 46 | chromList = c("Chr1","Chr4","Chr7")) 47 | } 48 | \seealso{ 49 | \code{\link{getG}} for explaination of how G statistic is 50 | calculated. 51 | \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What 52 | is a VCF and how should I interpret it?} for more information on GATK 53 | Fields and Genotype Fields 54 | } 55 | -------------------------------------------------------------------------------- /man/countSNPs_cpp.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/RcppExports.R 3 | \name{countSNPs_cpp} 4 | \alias{countSNPs_cpp} 5 | \title{Count number of SNPs within a sliding window} 6 | \usage{ 7 | countSNPs_cpp(POS, windowSize) 8 | } 9 | \arguments{ 10 | \item{POS}{A numeric vector of genomic positions for each SNP} 11 | 12 | \item{windowSize}{The required window size} 13 | } 14 | \description{ 15 | For each SNP returns how many SNPs are bracketing it within the set window size 16 | } 17 | -------------------------------------------------------------------------------- /man/getFDRThreshold.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/G_functions.R 3 | \name{getFDRThreshold} 4 | \alias{getFDRThreshold} 5 | \title{Find false discovery rate threshold} 6 | \usage{ 7 | getFDRThreshold(pvalues, alpha = 0.01) 8 | } 9 | \arguments{ 10 | \item{pvalues}{a vector of p-values} 11 | 12 | \item{alpha}{the required false discovery rate alpha} 13 | } 14 | \value{ 15 | The p-value threshold that corresponds to the Benjamini-Hochberg adjusted p-value at the FDR set by alpha. 16 | } 17 | \description{ 18 | Given a vector of p-values and a set false discovery rate alpha the function 19 | returns the lowest p-value in the vector for which the Benjamini-Hochberg 20 | adjusted p-value (ie q-value) is less than that alpha. 21 | } 22 | -------------------------------------------------------------------------------- /man/getG.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/G_functions.R 3 | \name{getG} 4 | \alias{getG} 5 | \title{Calculates the G statistic} 6 | \usage{ 7 | getG(LowRef, HighRef, LowAlt, HighAlt) 8 | } 9 | \arguments{ 10 | \item{LowRef}{A vector of the reference allele depth in the low bulk} 11 | 12 | \item{HighRef}{A vector of the reference allele depth in the high bulk} 13 | 14 | \item{LowAlt}{A vector of the alternate allele depth in the low bulk} 15 | 16 | \item{HighAlt}{A vector of the alternate allele depth in the high bulk} 17 | } 18 | \value{ 19 | A vector of G statistic values with the same length as 20 | } 21 | \description{ 22 | The function is used by \code{\link{runGprimeAnalysis}} to calculate the G 23 | statisic G is defined by the equation: \deqn{G = 2*\sum_{i=1}^{4} 24 | n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2 * \sum n_i * ln(obs(n_i)/exp(n_i))} 25 | Where for each SNP, \eqn{n_i} from i = 1 to 4 corresponds to the reference 26 | and alternate allele depths for each bulk, as described in the following 27 | table: \tabular{rcc}{ Allele \tab High Bulk \tab Low Bulk \cr Reference \tab 28 | \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab \eqn{n_3} \tab \eqn{n_4} \cr} 29 | ...and \eqn{obs(n_i)} are the observed allele depths as described in the data 30 | frame. Method 1 calculates the G statistic using expected values assuming 31 | read depth is equal for all alleles in both bulks: \deqn{exp(n_1) = ((n_1 + 32 | n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)} \deqn{exp(n_2) = ((n_2 + n_1)*(n_2 33 | + n_4))/(n_1 + n_2 + n_3 + n_4)} etc... 34 | } 35 | \seealso{ 36 | \href{https://doi.org/10.1371/journal.pcbi.1002255}{The Statistics 37 | of Bulk Segregant Analysis Using Next Generation Sequencing} 38 | \code{\link{tricubeStat}} for G prime calculation 39 | } 40 | -------------------------------------------------------------------------------- /man/getPvals.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/G_functions.R 3 | \name{getPvals} 4 | \alias{getPvals} 5 | \title{Non-parametric estimation of the null distribution of G'} 6 | \usage{ 7 | getPvals(Gprime, deltaSNP = NULL, outlierFilter = c("deltaSNP", 8 | "Hampel"), filterThreshold) 9 | } 10 | \arguments{ 11 | \item{Gprime}{a vector of G prime values (tricube weighted G statistics)} 12 | 13 | \item{deltaSNP}{a vector of delta SNP values for use for QTL region filtering} 14 | 15 | \item{outlierFilter}{one of either "deltaSNP" or "Hampel". Method for 16 | filtering outlier (ie QTL) regions for p-value estimation} 17 | 18 | \item{filterThreshold}{The absolute delta SNP index to use to filter out putative QTL} 19 | } 20 | \description{ 21 | The function is used by \code{\link{runGprimeAnalysis}} to estimate p-values for the 22 | weighted G' statistic based on the non-parametric estimation method described 23 | in Magwene et al. 2011. Breifly, using the natural log of Gprime a median 24 | absolute deviation (MAD) is calculated. The Gprime set is trimmed to exclude 25 | outlier regions (i.e. QTL) based on Hampel's rule. An alternate method for 26 | filtering out QTL is proposed using absolute delta SNP indeces greater than 27 | a set threshold to filter out potential QTL. An estimation of the mode of the trimmed set 28 | is calculated using the \code{\link[modeest]{mlv}} function from the package 29 | modeest. Finally, the mean and variance of the set are estimated using the 30 | median and mode and p-values are estimated from a log normal distribution. 31 | } 32 | -------------------------------------------------------------------------------- /man/getQTLTable.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/export_functions.R 3 | \name{getQTLTable} 4 | \alias{getQTLTable} 5 | \title{Export a summarized table of QTL} 6 | \usage{ 7 | getQTLTable(SNPset, method = "Gprime", alpha = 0.05, interval = 99, 8 | export = FALSE, fileName = "QTL.csv") 9 | } 10 | \arguments{ 11 | \item{SNPset}{Data frame SNP set containing previously filtered SNPs} 12 | 13 | \item{method}{either "Gprime" or "QTLseq". The method for detecting significant regions.} 14 | 15 | \item{alpha}{numeric. The required false discovery rate alpha for use with \code{method = "Gprime"}} 16 | 17 | \item{interval}{numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99.} 18 | 19 | \item{export}{logical. If TRUE will export a csv table.} 20 | 21 | \item{fileName}{either a character string naming a file or a connection open for writing. "" indicates output to the console.} 22 | } 23 | \value{ 24 | Returns a summarized table of QTL identified. The table contains the following columns: 25 | \itemize{ 26 | \item{id - the QTL identification number} 27 | \item{chromosome - The chromosome on which the region was identified} 28 | \item{start - the start position on that chromosome, i.e. the position of the first SNP that passes the FDR threshold} 29 | \item{end - the end position} 30 | \item{length - the length in basepairs from start to end of the region} 31 | \item{nSNPs - the number of SNPs in the region} 32 | \item{avgSNPs_Mb - the average number of SNPs/Mb within that region} 33 | \item{peakDeltaSNP - the tricube-smoothed deltaSNP-index value at the peak summit} 34 | \item{posPeakDeltaSNP - the position of the absolute maximum tricube-smoothed deltaSNP-index} 35 | \item{maxGprime - the max G' score in the region} 36 | \item{posMaxGprime - the genomic position of the maximum G' value in the QTL} 37 | \item{meanGprime - the average G' score of that region} 38 | \item{sdGprime - the standard deviation of G' within the region} 39 | \item{AUCaT - the Area Under the Curve but above the Threshold line, an indicator of how significant or wide the peak is} 40 | \item{meanPval - the average p-value in the region} 41 | \item{meanQval - the average adjusted p-value in the region} 42 | } 43 | } 44 | \description{ 45 | Export a summarized table of QTL 46 | } 47 | -------------------------------------------------------------------------------- /man/getSigRegions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/export_functions.R 3 | \name{getSigRegions} 4 | \alias{getSigRegions} 5 | \title{Return SNPs in significant regions} 6 | \usage{ 7 | getSigRegions(SNPset, method = "Gprime", alpha = 0.05, interval = 99) 8 | } 9 | \arguments{ 10 | \item{SNPset}{Data frame SNP set containing previously filtered SNPs.} 11 | 12 | \item{method}{either "Gprime" or "QTLseq". The method for detecting significant regions.} 13 | 14 | \item{alpha}{numeric. The required false discovery rate alpha for use with \code{method = "Gprime"}} 15 | 16 | \item{interval}{numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99.} 17 | } 18 | \description{ 19 | The function takes a SNP set after calculation of p- and q-values or Takagi confidence intervals and returns 20 | a list containing all SNPs with q-values or deltaSNP below a set alpha or confidence intervals, respectively. Each entry in the list 21 | is a SNP set data frame in a contiguous region with adjusted pvalues lower 22 | than the set false discovery rate alpha. 23 | } 24 | -------------------------------------------------------------------------------- /man/importFromTable.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Import_Filter.R 3 | \name{importFromTable} 4 | \alias{importFromTable} 5 | \title{Import SNP data from a delimited file} 6 | \usage{ 7 | importFromTable(file, highBulk = "HIGH", lowBulk = "LOW", 8 | chromList = NULL, sep = ",") 9 | } 10 | \arguments{ 11 | \item{file}{The name of the file which the data are to be read from.} 12 | 13 | \item{highBulk}{The sample name of the High Bulk. Defaults to "HIGH"} 14 | 15 | \item{lowBulk}{The sample name of the Low Bulk. Defaults to "LOW"} 16 | 17 | \item{chromList}{a string vector of the chromosomes to be used in the 18 | analysis. Useful for filtering out unwanted contigs etc.} 19 | 20 | \item{sep}{the field separator character. Values on each line of the file are 21 | separated by this character. Default is for csv file ie ",".} 22 | } 23 | \value{ 24 | Returns a data frame containing columns for per bulk total Read depth (DP), 25 | Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT), any 26 | other SNP associated columns in the file, and SNPindex for each bulk 27 | (indicated by .HIGH and .LOW column name suffix). Total reference allele 28 | frequnce "REF_FRQ" is the sum of AD_REF for both bulks divided by total 29 | Depth for that SNP. The deltaSNPindex is equal to SNPindex.HIGH - 30 | SNPindex.LOW. 31 | } 32 | \description{ 33 | After importing the data from a delimited file, the function then calculates 34 | total reference allele frequency for both bulks together, the delta SNP index 35 | (i.e. SNP index of the low bulk subtracted from the SNP index of the high 36 | bulk), the G statistic and returns a data frame. The required columns in the 37 | file are CHROM (Chromosome) and POS (Position) as well as the reference and 38 | alternate allele depths (number of reads supporting each allele). The allele 39 | depths should be in columns named in this format: 40 | \code{AD_().}. For example, the column for alternate 41 | allele depth for a high bulk sample named "sample1", should be 42 | "AD_ALT.sample1". Any other columns describing the SNPs are allowed, ie the 43 | actual allele calls, or a quality score. If the column is Bulk specific, It 44 | should be named \code{columnName.sampleName}, i.e "QUAL.sample1". 45 | } 46 | -------------------------------------------------------------------------------- /man/plotGprimeDist.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotting_functions.R 3 | \name{plotGprimeDist} 4 | \alias{plotGprimeDist} 5 | \title{Plots Gprime distribution} 6 | \usage{ 7 | plotGprimeDist(SNPset, outlierFilter = c("deltaSNP", "Hampel"), 8 | filterThreshold = 0.1, binwidth = 0.5) 9 | } 10 | \arguments{ 11 | \item{SNPset}{a data frame with SNPs and genotype fields as imported by 12 | \code{ImportFromGATK} and after running \code{GetPrimeStats}} 13 | 14 | \item{outlierFilter}{one of either "deltaSNP" or "Hampel". Method for 15 | filtering outlier (ie QTL) regions for p-value estimation} 16 | 17 | \item{filterThreshold}{The absolute delta SNP index to use to filter out 18 | putative QTL (default = 0.1)} 19 | 20 | \item{binwidth}{The binwidth for the histogram. Recomended and default = 0.5} 21 | } 22 | \value{ 23 | Plots a ggplot histogram of the G' value distribution. The raw data 24 | as well as the filtered G' values (excluding putatitve QTL) are plotted. It 25 | will then overlay an estimated log normal distribution with the same mean 26 | and variance as the null G' distribution. This will allow to verify if 27 | after filtering your G' value appear to be close to log normally and thus 28 | can be used to estimate p-values using the non-parametric estimation method 29 | described in Magwene et al. (2011). Breifly, using the natural log of 30 | Gprime a median absolute deviation (MAD) is calculated. The Gprime set is 31 | trimmed to exclude outlier regions (i.e. QTL) based on Hampel's rule. An 32 | estimation of the mode of the trimmed set is calculated using the 33 | \code{\link[modeest]{mlv}} function from the package modeest. Finally, the 34 | mean and variance of the set are estimated using the median and mode are 35 | estimated and used to plot the log normal distribution. 36 | } 37 | \description{ 38 | Plots a ggplot histogram of the distribution of Gprime with a log normal 39 | distribution overlay 40 | } 41 | \examples{ 42 | plotGprimedist(df_filt_6Mb, outlierFilter = "deltaSNP") 43 | 44 | } 45 | \seealso{ 46 | \code{\link{getPvals}} for how p-values are calculated. 47 | } 48 | -------------------------------------------------------------------------------- /man/plotQTLStats.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotting_functions.R 3 | \name{plotQTLStats} 4 | \alias{plotQTLStats} 5 | \title{Plots different paramaters for QTL identification} 6 | \usage{ 7 | plotQTLStats(SNPset, subset = NULL, var = "nSNPs", 8 | scaleChroms = TRUE, line = TRUE, plotThreshold = FALSE, 9 | plotIntervals = FALSE, q = 0.05, ...) 10 | } 11 | \arguments{ 12 | \item{SNPset}{a data frame with SNPs and genotype fields as imported by 13 | \code{ImportFromGATK} and after running \code{runGprimeAnalysis} or \code{runQTLseqAnalysis}} 14 | 15 | \item{subset}{a vector of chromosome names for use in quick plotting of 16 | chromosomes of interest. Defaults to 17 | NULL and will plot all chromosomes in the SNPset} 18 | 19 | \item{var}{character. The paramater for plotting. Must be one of: "nSNPs", 20 | "deltaSNP", "Gprime", "negLog10Pval"} 21 | 22 | \item{scaleChroms}{boolean. if TRUE (default) then chromosome facets will be 23 | scaled to relative chromosome sizes. If FALSE all facets will be equal 24 | sizes. This is basically a convenience argument for setting both scales and 25 | shape as "free_x" in ggplot2::facet_grid.} 26 | 27 | \item{line}{boolean. If TRUE will plot line graph. If FALSE will plot points. 28 | Plotting points will take more time.} 29 | 30 | \item{plotThreshold}{boolean. Should we plot the False Discovery Rate 31 | threshold (FDR). Only plots line if var is "Gprime" or "negLogPval".} 32 | 33 | \item{plotIntervals}{boolean. Whether or not to plot the two-sided Takagi confidence intervals in "deltaSNP" plots.} 34 | 35 | \item{q}{numeric. The q-value to use as the FDR threshold. If too low, no 36 | line will be drawn and a warning will be given.} 37 | 38 | \item{...}{arguments to pass to ggplot2::geom_line or ggplot2::geom_point for 39 | changing colors etc.} 40 | } 41 | \value{ 42 | Plots a ggplot graph for all chromosomes or those requested in 43 | \code{subset}. By setting \code{var} to "nSNPs" the distribution of SNPs 44 | used to calculate G' will be plotted. "deltaSNP" will plot a tri-cube 45 | weighted delta SNP-index for each SNP. "Gprime" will plot the tri-cube 46 | weighted G' value. Setting "negLogPval" will plot the -log10 of the p-value 47 | at each SNP. In "Gprime" and "negLogPval" plots, a genome wide FDR threshold of 48 | q can be drawn by setting "plotThreshold" to TRUE. The defualt is a red 49 | line. If you would like to plot a different line we suggest setting 50 | "plotThreshold" to FALSE and manually adding a line using 51 | ggplot2::geom_hline. 52 | } 53 | \description{ 54 | A wrapper for ggplot to plot genome wide distribution of parameters used to 55 | identify QTL. 56 | } 57 | \examples{ 58 | p <- plotQTLstats(df_filt_6Mb, var = "Gprime", plotThreshold = TRUE, q = 0.01, subset = c("Chr3","Chr4")) 59 | } 60 | -------------------------------------------------------------------------------- /man/plotSimulatedThresholds.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotting_functions.R 3 | \name{plotSimulatedThresholds} 4 | \alias{plotSimulatedThresholds} 5 | \title{Plots simulation data for QTLseq analysis} 6 | \usage{ 7 | plotSimulatedThresholds(SNPset = NULL, popStruc = "F2", bulkSize, 8 | depth = NULL, replications = 10000, filter = 0.3, 9 | intervals = c(95, 99)) 10 | } 11 | \arguments{ 12 | \item{SNPset}{optional. Either supply your data set to extract read depths from or supply depth vector.} 13 | 14 | \item{popStruc}{the population structure. Defaults to "F2" and assumes "RIL" otherwise.} 15 | 16 | \item{bulkSize}{non-negative integer. The number of individuals in each bulk} 17 | 18 | \item{depth}{optional integer vector. A read depth for which to replicate SNP-index calls. If read depth is defined SNPset will be ignored.} 19 | 20 | \item{replications}{integer. The number of bootstrap replications.} 21 | 22 | \item{filter}{numeric. An optional minimum SNP-index filter} 23 | 24 | \item{intervals}{numeric vector. Confidence intervals supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95\% confidence interval, 2.5\% on each side.} 25 | } 26 | \value{ 27 | Plots a deltaSNP by depth plot. Helps if the user wants to know the the delta SNP index needed to pass a certain CI at a specified depth. 28 | } 29 | \description{ 30 | as described in Takagi et al., (2013). Genotypes are randomly assigned for 31 | each indvidual in the bulk, based on the population structure. The total 32 | alternative allele frequency in each bulk is calculated at each depth used to simulate 33 | delta SNP-indeces, with a user defined number of bootstrapped replication. 34 | The requested confidence intervals are then calculated from the bootstraps. 35 | This function plots the simulated confidence intervals by the read depth. 36 | } 37 | \examples{ 38 | plotSimulatedThresholds <- function(SNPset = NULL, popStruc = "F2", bulkSize = 25, depth = 1:150, replications = 10000, filter = 0.3, intervals = c(95, 99)) 39 | } 40 | -------------------------------------------------------------------------------- /man/runGprimeAnalysis.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/G_functions.R 3 | \name{runGprimeAnalysis} 4 | \alias{runGprimeAnalysis} 5 | \title{Identify QTL using a smoothed G statistic} 6 | \usage{ 7 | runGprimeAnalysis(SNPset, windowSize = 1e+06, 8 | outlierFilter = "deltaSNP", filterThreshold = 0.1, ...) 9 | } 10 | \arguments{ 11 | \item{SNPset}{Data frame SNP set containing previously filtered SNPs} 12 | 13 | \item{windowSize}{the window size (in base pairs) bracketing each SNP for which 14 | to calculate the statitics. Magwene et. al recommend a window size of ~25 15 | cM, but also recommend optionally trying several window sizes to test if 16 | peaks are over- or undersmoothed.} 17 | 18 | \item{outlierFilter}{one of either "deltaSNP" or "Hampel". Method for 19 | filtering outlier (ie QTL) regions for p-value estimation} 20 | 21 | \item{filterThreshold}{The absolute delta SNP index to use to filter out putative QTL (default = 0.1)} 22 | 23 | \item{...}{Other arguments passed to \code{\link[locfit]{locfit}} and 24 | subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull 25 | in cases where you get "out of vertex space warnings"; Set the maxk higher 26 | than the default 100. See \code{\link[locfit]{locfit.raw}}(). But if you 27 | are getting that warning you should seriously consider increasing your 28 | window size.} 29 | } 30 | \value{ 31 | The supplied SNP set tibble after G' analysis. Includes five new 32 | columns: \itemize{\item{G - The G statistic for each SNP} \item{Gprime - 33 | The tricube smoothed G statistic based on the supplied window size} 34 | \item{pvalue - the pvalue at each SNP calculatd by non-parametric 35 | estimation} \item{negLog10Pval - the -Log10(pvalue) supplied for quick 36 | plotting} \item{qvalue - the Benajamini-Hochberg adjusted p-value}} 37 | } 38 | \description{ 39 | A wrapper for all the functions that perform the full G prime analysis to 40 | identify QTL. The following steps are performed:\cr 1) Genome-wide G 41 | statistics are calculated by \code{\link{getG}}. \cr G is defined by the 42 | equation: \deqn{G = 2*\sum_{i=1}^{4} n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2 43 | * \sum n_i * ln(obs(n_i)/exp(n_i))} Where for each SNP, \eqn{n_i} from i = 1 44 | to 4 corresponds to the reference and alternate allele depths for each bulk, 45 | as described in the following table: \tabular{rcc}{ Allele \tab High Bulk 46 | \tab Low Bulk \cr Reference \tab \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab 47 | \eqn{n_3} \tab \eqn{n_4} \cr} ...and \eqn{obs(n_i)} are the observed allele 48 | depths as described in the data frame. \code{\link{getG}} calculates the G statistic 49 | using expected values assuming read depth is equal for all alleles in both 50 | bulks: \deqn{exp(n_1) = ((n_1 + n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)} 51 | \deqn{exp(n_2) = ((n_2 + n_1)*(n_2 + n_4))/(n_1 + n_2 + n_3 + n_4)} 52 | \deqn{exp(n_3) = ((n_3 + n_1)*(n_3 + n_4))/(n_1 + n_2 + n_3 + n_4)} 53 | \deqn{exp(n_4) = ((n_4 + n_2)*(n_4 + n_3))/(n_1 + n_2 + n_3 + n_4)}\cr 2) G' 54 | - A tricube-smoothed G statistic is predicted by local regression within each 55 | chromosome using \code{\link{tricubeStat}}. This works as a weighted average 56 | across neighboring SNPs that accounts for Linkage disequilibrium (LD) while 57 | minizing noise attributed to SNP calling errors. G values for neighboring 58 | SNPs within the window are weighted by physical distance from the focal SNP. 59 | \cr \cr 3) P-values are estimated based using the non-parametric method 60 | described by Magwene et al. 2011 with the function \code{\link{getPvals}}. 61 | Breifly, using the natural log of Gprime a median absolute deviation (MAD) is 62 | calculated. The Gprime set is trimmed to exclude outlier regions (i.e. QTL) 63 | based on Hampel's rule. An alternate method for filtering out QTL is proposed 64 | using absolute delta SNP indeces greater than 0.1 to filter out potential 65 | QTL. An estimation of the mode of the trimmed set is calculated using the 66 | \code{\link[modeest]{mlv}} function from the package modeest. Finally, the 67 | mean and variance of the set are estimated using the median and mode and 68 | p-values are estimated from a log normal distribution. \cr \cr 4) Negative 69 | Log10- and Benjamini-Hochberg adjusted p-values are calculated using 70 | \code{\link[stats]{p.adjust}} 71 | } 72 | \examples{ 73 | df_filt <- runGprimeAnalysis(df_filt,windowSize = 2e6,outlierFilter = "deltaSNP") 74 | } 75 | -------------------------------------------------------------------------------- /man/runQTLseqAnalysis.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/takagi_sim.R 3 | \name{runQTLseqAnalysis} 4 | \alias{runQTLseqAnalysis} 5 | \title{Calculates delta SNP confidence intervals for QTLseq analysis} 6 | \usage{ 7 | runQTLseqAnalysis(SNPset, windowSize = 1e+06, popStruc = "F2", 8 | bulkSize, depth = NULL, replications = 10000, filter = 0.3, 9 | intervals = c(95, 99), ...) 10 | } 11 | \arguments{ 12 | \item{SNPset}{The data frame imported by \code{ImportFromGATK}} 13 | 14 | \item{windowSize}{the window size (in base pairs) bracketing each SNP for which to calculate the statitics.} 15 | 16 | \item{popStruc}{the population structure. Defaults to "F2" and assumes "RIL" otherwise} 17 | 18 | \item{bulkSize}{non-negative integer vector. The number of individuals in 19 | each simulated bulk. Can be of length 1, then both bulks are set to the 20 | same size. Assumes the first value in the vector is the simulated high 21 | bulk.} 22 | 23 | \item{depth}{integer. A read depth for which to replicate SNP-index calls.} 24 | 25 | \item{replications}{integer. The number of bootstrap replications.} 26 | 27 | \item{filter}{numeric. An optional minimum SNP-index filter} 28 | 29 | \item{intervals}{numeric vector. Confidence intervals supplied as two-sided 30 | percentiles. i.e. If intervals = '95' will return the two sided 95\% 31 | confidence interval, 2.5\% on each side.} 32 | 33 | \item{...}{Other arguments passed to \code{\link[locfit]{locfit}} and 34 | subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull 35 | in cases where you get "out of vertex space warnings"; Set the maxk higher 36 | than the default 100. See \code{\link[locfit]{locfit.raw}}(). But if you 37 | are getting that warning you should seriously consider increasing your 38 | window size.} 39 | } 40 | \value{ 41 | A SNPset data frame with delta SNP-index thresholds corrisponding to the 42 | requested confidence intervals matching the tricube smoothed depth at each SNP. 43 | } 44 | \description{ 45 | The method for simulating delta SNP-index confidence interval thresholds 46 | as described in Takagi et al., (2013). Genotypes are randomly assigned for 47 | each indvidual in the bulk, based on the population structure. The total 48 | alternative allele frequency in each bulk is calculated at each depth used to simulate 49 | delta SNP-indeces, with a user defined number of bootstrapped replication. 50 | The requested confidence intervals are then calculated from the bootstraps. 51 | } 52 | \examples{ 53 | df_filt <- runQTLseqAnalysis( 54 | SNPset = df_filt, 55 | bulkSize = c(25, 35) 56 | windowSize = 1e6, 57 | popStruc = "F2", 58 | replications = 10000, 59 | intervals = c(95, 99) 60 | ) 61 | 62 | } 63 | -------------------------------------------------------------------------------- /man/simulateAlleleFreq.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/takagi_sim.R 3 | \name{simulateAlleleFreq} 4 | \alias{simulateAlleleFreq} 5 | \title{Randomly calculates an alternate allele frequency within a bulk} 6 | \usage{ 7 | simulateAlleleFreq(n, pop = "F2") 8 | } 9 | \arguments{ 10 | \item{n}{non-negative integer. The number of individuals in each bulk} 11 | 12 | \item{pop}{the population structure. Defaults to "F2" and assumes "RIL" 13 | population otherwise.} 14 | } 15 | \value{ 16 | an alternate allele frequency within the bulk. Used for simulating 17 | SNP-indeces. 18 | } 19 | \description{ 20 | Randomly calculates an alternate allele frequency within a bulk 21 | } 22 | -------------------------------------------------------------------------------- /man/simulateConfInt.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/takagi_sim.R 3 | \name{simulateConfInt} 4 | \alias{simulateConfInt} 5 | \title{Simulation of delta SPP index confidence intervals} 6 | \usage{ 7 | simulateConfInt(popStruc = "F2", bulkSize, depth = 1:100, 8 | replications = 10000, filter = 0.3, intervals = c(0.05, 0.025)) 9 | } 10 | \arguments{ 11 | \item{popStruc}{the population structure. Defaults to "F2" and assumes "RIL"} 12 | 13 | \item{bulkSize}{non-negative integer vector. The number of individuals in 14 | each simulated bulk. Can be of length 1, then both bulks are set to the 15 | same size. Assumes the first value in the vector is the simulated high 16 | bulk.} 17 | 18 | \item{depth}{integer. A read depth for which to replicate SNP-index calls.} 19 | 20 | \item{replications}{integer. The number of bootstrap replications.} 21 | 22 | \item{filter}{numeric. An optional minimum SNP-index filter} 23 | 24 | \item{intervals}{numeric vector of probabilities with values in [0,1] 25 | corresponding to the requested confidence intervals} 26 | } 27 | \value{ 28 | A data frame of delta SNP-index thresholds corrisponding to the 29 | requested confidence intervals at the user set depths. 30 | } 31 | \description{ 32 | The method for simulating delta SNP-index confidence interval thresholds 33 | as described in Takagi et al., (2013). Genotypes are randomly assigned for 34 | each indvidual in the bulk, based on the population structure. The total 35 | alternative allele frequency in each bulk is calculated at each depth used to simulate 36 | delta SNP-indeces, with a user defined number of bootstrapped replication. 37 | The requested confidence intervals are then calculated from the bootstraps. 38 | } 39 | \examples{ 40 | CI <- 41 | simulateConfInt( 42 | popStruc = "F2", 43 | bulkSize = 50, 44 | depth = 1:100, 45 | intervals = c(0.05, 0.95, 0.025, 0.975, 0.005, 0.995, 0.0025, 0.9975) 46 | } 47 | -------------------------------------------------------------------------------- /man/simulateSNPindex.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/takagi_sim.R 3 | \name{simulateSNPindex} 4 | \alias{simulateSNPindex} 5 | \title{Simulates a delta SNP-index with replication} 6 | \usage{ 7 | simulateSNPindex(depth, altFreq1, altFreq2, replicates = 10000, 8 | filter = NULL) 9 | } 10 | \arguments{ 11 | \item{depth}{integer. A read depth for which to replicate SNP-index calls.} 12 | 13 | \item{altFreq1}{numeric. The alternate allele frequency for bulk A.} 14 | 15 | \item{altFreq2}{numeric. The alternate allele frequency for bulk B.} 16 | 17 | \item{replicates}{integer. The number of bootstrap replications.} 18 | 19 | \item{filter}{numeric. an optional minimum SNP-index filter} 20 | } 21 | \value{ 22 | Returns a vector of length replicates delta SNP-indeces 23 | } 24 | \description{ 25 | Simulates a delta SNP-index with replication 26 | } 27 | -------------------------------------------------------------------------------- /man/tricubeStat.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/G_functions.R 3 | \name{tricubeStat} 4 | \alias{tricubeStat} 5 | \title{Calculate tricube weighted statistics for each SNP} 6 | \usage{ 7 | tricubeStat(POS, Stat, windowSize = 2e+06, ...) 8 | } 9 | \arguments{ 10 | \item{POS}{A vector of genomic positions for each SNP} 11 | 12 | \item{Stat}{A vector of values for a given statistic for each SNP} 13 | 14 | \item{...}{Other arguments passed to \code{\link[locfit]{locfit}} and 15 | subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull 16 | in cases where you get "out of vertex space warnings"; Set the maxk higher 17 | than the default 100. See \code{\link[locfit]{locfit.raw}}().} 18 | 19 | \item{WinSize}{the window size (in base pairs) bracketing each SNP for which 20 | to calculate the statitics. Magwene et. al recommend a window size of ~25 21 | cM, but also recommend optionally trying several window sizes to test if 22 | peaks are over- or undersmoothed.} 23 | } 24 | \value{ 25 | Returns a vector of the weighted statistic caluculted with a tricube 26 | smoothing kernel 27 | } 28 | \description{ 29 | Uses local regression (wrapper for \code{\link[locfit]{locfit}}) to predict a 30 | tricube smoothed version of the statistic supplied for each SNP. This works as a 31 | weighted average across neighboring SNPs that accounts for Linkage 32 | disequilibrium (LD) while minizing noise attributed to SNP calling errors. 33 | Values for neighboring SNPs within the window are weighted by physical 34 | distance from the focal SNP. 35 | } 36 | \examples{ 37 | df_filt_4mb$Gprime <- tricubeStat(POS, Stat = GStat, WinSize = 4e6) 38 | } 39 | \seealso{ 40 | \code{\link{getG}} for G statistic calculation 41 | 42 | \code{\link[locfit]{locfit}} for local regression 43 | } 44 | -------------------------------------------------------------------------------- /src/.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | *.so 3 | *.dll 4 | -------------------------------------------------------------------------------- /src/RcppExports.cpp: -------------------------------------------------------------------------------- 1 | // Generated by using Rcpp::compileAttributes() -> do not edit by hand 2 | // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 3 | 4 | #include 5 | 6 | using namespace Rcpp; 7 | 8 | // countSNPs_cpp 9 | NumericVector countSNPs_cpp(NumericVector POS, double windowSize); 10 | RcppExport SEXP _QTLseqr_countSNPs_cpp(SEXP POSSEXP, SEXP windowSizeSEXP) { 11 | BEGIN_RCPP 12 | Rcpp::RObject rcpp_result_gen; 13 | Rcpp::RNGScope rcpp_rngScope_gen; 14 | Rcpp::traits::input_parameter< NumericVector >::type POS(POSSEXP); 15 | Rcpp::traits::input_parameter< double >::type windowSize(windowSizeSEXP); 16 | rcpp_result_gen = Rcpp::wrap(countSNPs_cpp(POS, windowSize)); 17 | return rcpp_result_gen; 18 | END_RCPP 19 | } 20 | 21 | static const R_CallMethodDef CallEntries[] = { 22 | {"_QTLseqr_countSNPs_cpp", (DL_FUNC) &_QTLseqr_countSNPs_cpp, 2}, 23 | {NULL, NULL, 0} 24 | }; 25 | 26 | RcppExport void R_init_QTLseqr(DllInfo *dll) { 27 | R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); 28 | R_useDynamicSymbols(dll, FALSE); 29 | } 30 | -------------------------------------------------------------------------------- /src/countSNPs.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | using namespace Rcpp; 3 | 4 | //' Count number of SNPs within a sliding window 5 | //' 6 | //' For each SNP returns how many SNPs are bracketing it within the set window size 7 | //' 8 | //' @param POS A numeric vector of genomic positions for each SNP 9 | //' @param windowSize The required window size 10 | //' @export countSNPs_cpp 11 | // [[Rcpp::export]] 12 | NumericVector countSNPs_cpp(NumericVector POS, double windowSize) { 13 | unsigned int nout=POS.size(), i, left=0, right=0; 14 | NumericVector out(nout); 15 | 16 | for( i=0; i < nout; i++ ) { 17 | while ((right < nout) & (POS[right + 1] <= POS[i] + windowSize / 2)) 18 | right++; 19 | 20 | while (POS[left] <= POS[i] - windowSize / 2) 21 | left++; 22 | 23 | out[i] = right - left + 1 ; 24 | } 25 | return out; 26 | } 27 | 28 | 29 | // You can include R code blocks in C++ files processed with sourceCpp 30 | // (useful for testing and development). The R code will be automatically 31 | // run after the compilation. 32 | // 33 | 34 | /*** R 35 | timesTwo(42) 36 | */ 37 | -------------------------------------------------------------------------------- /vignettes/.gitignore: -------------------------------------------------------------------------------- 1 | QTLseqr_cache 2 | QTLseqr_files 3 | -------------------------------------------------------------------------------- /vignettes/QTLseqr.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Next-generation sequencing bulk segregant analysis with QTLseqr" 3 | author: "Ben N. Mansfeld and Rebecca Grumet" 4 | date: "`r Sys.Date()`" 5 | output: 6 | pdf_document: 7 | toc: yes 8 | graphics: yes 9 | urlcolor: blue 10 | vignette: > 11 | %\VignetteIndexEntry{NGS-BSA with QTLseqr} 12 | %\VignetteEncoding{UTF-8} 13 | %\VignetteEngine{knitr::rmarkdown} 14 | header-includes: \usepackage{graphicx} 15 | \usepackage{float} 16 | --- 17 | 18 | ```{r setup, echo=FALSE, results="hide"} 19 | knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, 20 | dev="png", 21 | message=FALSE, error=FALSE, warning=TRUE) 22 | ``` 23 | # Current version: Development - `r packageVersion("QTLseqr")` 24 | 25 | # Introduction 26 | 27 | ## Citations 28 | 29 | **If you use QTLseqr in published research, please cite:** 30 | 31 | > Mansfeld B.N. and Grumet R, 32 | > QTLseqr: An R package for bulk segregant analysis with next-generation sequencing 33 | > *The Plant Genome* [doi:10.3835/plantgenome2018.01.0006](https://dl.sciencesocieties.org/publications/tpg/abstracts/11/2/180006) 34 | 35 | We also recommend citing the paper for the corresponding method you work with. 36 | 37 | QTL-seq method: 38 | 39 | > Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi, 40 | > H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), 41 | > QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA 42 | > from two bulked populations. *Plant J*, 74: 174–183. [doi:10.1111/tpj.12105](https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.12105) 43 | 44 | G prime method: 45 | 46 | > Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next 47 | > Generation Sequencing. *PLOS Computational Biology* 7(11): e1002255. 48 | > [doi.org/10.1371/journal.pcbi.1002255](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002255) 49 | 50 | ## Quick Start 51 | Here are the basic steps required to run and plot QTLseq and $G'$ analysis 52 | 53 | ```{r installGitHub, eval = FALSE} 54 | # download and install the devtools package 55 | install.packages("devtools") 56 | 57 | # download the QTLseqr package from GitHub 58 | devtools::install_github("bmansfeld/QTLseqr") 59 | ``` 60 | 61 | ```{r quickStart, eval=FALSE} 62 | #load the package 63 | library("QTLseqr") 64 | 65 | #Set sample and file names 66 | HighBulk <- "SRR834931" 67 | LowBulk <- "SRR834927" 68 | file <- "SNPs_from_GATK.table" 69 | 70 | #Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs) 71 | Chroms <- paste0(rep("Chr", 12), 1:12) 72 | 73 | #Import SNP data from file 74 | df <- 75 | importFromGATK( 76 | file = file, 77 | highBulk = HighBulk, 78 | lowBulk = LowBulk, 79 | chromList = Chroms 80 | ) 81 | 82 | #Filter SNPs based on some criteria 83 | df_filt <- 84 | filterSNPs( 85 | SNPset = df, 86 | refAlleleFreq = 0.20, 87 | minTotalDepth = 100, 88 | maxTotalDepth = 400, 89 | minSampleDepth = 40, 90 | minGQ = 99 91 | ) 92 | 93 | #Run G' analysis 94 | df_filt <- runGprimeAnalysis(SNPset = df_filt, 95 | windowSize = 1e6, 96 | outlierFilter = "deltaSNP") 97 | 98 | #Run QTLseq analysis 99 | df_filt <- runQTLseqAnalysis( 100 | SNPset = df_filt, 101 | windowSize = 1e6, 102 | popStruc = "F2", 103 | bulkSize = c(300, 450), 104 | replications = 10000, 105 | intervals = c(95, 99) 106 | ) 107 | 108 | #Plot 109 | plotQTLStats( 110 | SNPset = df_filt, 111 | var = "Gprime", 112 | plotThreshold = TRUE, 113 | q = 0.01 114 | ) 115 | 116 | plotQTLStats( 117 | SNPset = df_filt, 118 | var = "deltaSNP", 119 | plotIntervals = TRUE) 120 | 121 | #export summary CSV 122 | getQTLTable( 123 | SNPset = df_filt, 124 | alpha = 0.01, 125 | export = TRUE, 126 | fileName = "my_BSA_QTL.csv" 127 | ) 128 | ``` 129 | 130 | # Standard workflow 131 | 132 | ## Installation 133 | Let's install and load the QTLseqr package: 134 | ```{r install_load, eval=FALSE} 135 | #Install step if you have not done so yet: 136 | #install.packages("devtools") 137 | 138 | devtools::install_github("bmansfeld/QTLseqr") 139 | library("QTLseqr") 140 | ``` 141 | 142 | Great! We now need to load some data. QTLseqr has with a sister package [Yang2013data](https://github.com/bmansfeld/Yang2013data) (derived from [Yang et al. (2013)](https://doi.org/10.1371/journal.pone.0068433)], that has some trial data you can play with. 143 | 144 | ## Input data 145 | 146 | QTLseqr supports importing data either from table file exported from the VariantsToTable function built in to GATK or a delimited file containing allele read depths for each bulk. The two available functions are `importFromGATK` and `importFromTable`. 147 | 148 | Both functions import the SNP data and perform some preliminary calculations to find the following: 149 | 150 | $$Reference\ allele\ frequency = \frac{Ref\ allele\ depth_{HighBulk} + Ref\ allele\ depth_{LowBulk}}{Total\ read\ depth\ for\ both\ bulks}$$ 151 | 152 | $$SNP\text{-}index_{per\ bulk} = \frac{Alternate\ allele\ depth}{Total\ read\ depth}$$ 153 | 154 | $$\Delta (SNP\text{-}index) = SNP \text{-} index_{High Bulk} - SNP\text{-}index_{Low Bulk}$$ 155 | 156 | To demonstrate the use of the import functions we will first load the Yang et al. (2013) data file. 157 | We first need to download the package that contains the data from github. 158 | ```{r installdata} 159 | #download and load data package (~50Mb) 160 | devtools::install_github("bmansfeld/Yang2013data") 161 | library("Yang2013data") 162 | 163 | #Import the data 164 | rawData <- system.file( 165 | "extdata", 166 | "Yang_et_al_2013.table", 167 | package = "Yang2013data", 168 | mustWork = TRUE) 169 | ``` 170 | If you have your own data you can simply refer to it directly: 171 | ```{r, eval = FALSE} 172 | rawData <- "C:/PATH/TO/MY/DIR/My_BSA_data.table" 173 | ``` 174 | 175 | We define the sample name for each of the bulks. We also can define a vector of the chromosomes to be included in the analysis (i.e. exclude smaller contigs), In this case, Chr1, Chr2 ... Chr12. 176 | ```{r} 177 | HighBulk <- "SRR834931" 178 | LowBulk <- "SRR834927" 179 | Chroms <- paste0(rep("Chr", 12), 1:12) 180 | ``` 181 | 182 | ### Importing SNPs from GATK 183 | 184 | Working directly with the [GATK best practices guide](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS) for whole genome sequence should result in a VCF that is compatible with QTLseqr. In general the workflow suggested by GATK is per-sample variant calling followed by joint genotyping across samples. This will produce a VCF file that includes **BOTH** bulks, each with a different sample name (here SRR834927 and SRR834931), one SNP for example: 185 | 186 | ```{r VCFrow, echo=FALSE, warning=FALSE} 187 | library(kableExtra) 188 | x <- data.frame(CHROM = "Chr1", POS = 31071, ID = ".", REF = "A", ALT = "G", QUAL = 1390.44, FILTER = "PASS", INFO = "..\\*...", FORMAT = "GT:AD:DP:GQ:PL", SRR834927 = "0/1:34,36:70:99:897,0,855", SRR834931 = "0/1:26,22:48:99:522,0,698") 189 | kable_styling(knitr::kable(x = x, format = "latex", booktabs = TRUE), latex_options = "scale_down") 190 | ``` 191 | \**info column removed for brevity* 192 | 193 | GATK have provided a fast VCF parser, the [VariantsToTable](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php) tool, that extracts the necessary fields for easy use in downstream analysis. 194 | 195 | We highly recommend reading [What is a VCF and how should I interpret it?](http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it) for more information on GATK VCF Fields and Genotype Fields 196 | 197 | Though the use of GATK's VariantsToTable function is out of the scope of this vignette, the syntax for use with QTLseqr should look something like this: 198 | 199 | ```{bash, eval=FALSE} 200 | java -jar GenomeAnalysisTK.jar \ 201 | -T VariantsToTable \ 202 | -R ${REF} \ 203 | -V ${NAME} \ 204 | -F CHROM -F POS -F REF -F ALT \ 205 | -GF AD -GF DP -GF GQ -GF PL \ 206 | -o ${NAME}.table 207 | ``` 208 | Where `${REF}` is the reference genome file and `${NAME}` is VCF file you wish to parse. 209 | 210 | To run QTLseqr successfully, the required VCF fields `(-F)` are CHROM (Chromosome) and POS (Position). the required Genotype fields `(-GF)` are AD (Allele Depth), DP (Depth). Recommended fields are REF (Reference allele) and ALT (Alternative allele) Recommended Genotype fields are PL (Phred-scaled likelihoods) and GQ (Genotype Quality). 211 | 212 | #### The `ImportFromGATK` function 213 | 214 | The `importFromGATK` function imports SNP data from the output of the VariantsToTable function in GATK. After importing the data, the function then calculates total reference allele frequency for both bulks together, the SNP index for each bulk, and the $\Delta (SNP\text{-}index)$. 215 | 216 | We then use the `importFromGATK` function to import the raw data. After importing the data, the function then calculates total reference allele frequency for both bulks together, the $SNP\text{-}index$ for each SNP in each bulk and the $\Delta (SNP\text{-}index)$ and returns a data frame. 217 | 218 | Let's import: 219 | ```{r import, cache=TRUE} 220 | #import data 221 | df <- 222 | importFromGATK( 223 | file = rawData, 224 | highBulk = HighBulk, 225 | lowBulk = LowBulk, 226 | chromList = Chroms 227 | ) 228 | ``` 229 | 230 | ### Importing from a delimited file 231 | 232 | You can also import SNPs from a csv, tsv or any other delimited file. Your file must include some necessary columns: CHROM (Chromosome names) and POS (the SNP position) as well as the reference and alternate allele depths (number of reads supporting each allele). The allele depths should be in columns named in this format: *AD_\.\*. For example, the column for alternate allele depth for a high bulk sample named "sample1", should be "AD_ALT.sample1". Any other columns describing the SNPs are allowed, ie the actual allele calls, or a quality score. If the column is Bulk specific, It should be named columnName.sampleName, i.e "QUAL.sample1". 233 | 234 | Here is the header of an example file: 235 | ```{r csvexample, echo=FALSE} 236 | x <- 237 | data.frame( 238 | CHROM = rep("Chr1", 5), 239 | POS = c(31071, 31478, 33667, 34057, 35239), 240 | REF = c("A", "C", "A", "C", "A"), 241 | ALT = c("G", "T", "G", "T", "C"), 242 | AD_REF.SRR834927 = c(34, 34, 20, 38, 25), 243 | AD_ALT.SRR834927 = c(36, 52, 48, 40, 36), 244 | GQ.SRR834927 = rep(99, 5), 245 | AD_REF.SRR834931 = c(26, 40, 24, 29, 40), 246 | AD_ALT.SRR834931 = c(22, 34, 29, 26, 60), 247 | GQ.SRR834931 = rep(99, 5) 248 | ) 249 | 250 | kableExtra::kable_styling(knitr::kable(x = x, format = "latex", booktabs = TRUE), latex_options = "scale_down") 251 | ``` 252 | 253 | #### The `importFromTable` function 254 | 255 | To load the data from a delimited file we use the `importFromTable` function. in our case it is a csv so it is comma delimited. The columns are denoted by the sample names for the bulks (SRR834931 and SRR834927). These names will be renamed to "HIGH" and "LOW". If you want you can set the column names as such in advance as the defaults for `highBulk` and `lowBulk` arguments are "HIGH" and "LOW", respectively. As in the `importFromGATK` function the `chromList` argument should be a string vector of the chromosomes to be used in the analysis. Useful for filtering out unwanted contigs etc. 256 | 257 | ```{r importcsv, eval=FALSE} 258 | df <- importFromTable(file = "Yang2013.csv", 259 | highBulk = HighBulk, 260 | lowBulk = LowBulk, 261 | chromList = Chroms) 262 | ``` 263 | 264 | ## Loaded data frame 265 | The loaded data frame should look like this. This data frame has other information about the genotype calls from GATK: 266 | ```{r viewdf} 267 | head(df) 268 | ``` 269 | 270 | Let's review the column headers: 271 | 272 | * CHROM - The chromosome this SNP is in 273 | * POS - The position on the chromosome in nt 274 | * REF - The reference allele at that position 275 | * ALT - The alternate allele 276 | * DP.HIGH - The read depth at that position in the high bulk 277 | * AD_REF.HIGH - The allele depth of the reference allele in the high bulk 278 | * AD_ALT.HIGH - The alternate allele depth in the the high bulk 279 | * GQ.HIGH - The genotype quality score, (how confident we are in the genotyping) 280 | * SNPindex.HIGH - The calculated SNP-index for the high bulk 281 | * Same as above for the low bulk 282 | * REF_FRQ - The reference allele frequency as defined above 283 | * deltaSNP - The $\Delta (SNP\text{-}index)$ as defined above 284 | 285 | ## Filtering SNPs 286 | Now that we have loaded the data into R we can start cleaning it up by filtering some of the low confidence SNPs. 287 | While GATK has its own filtering tools, QTLseqr offers some options for filtering that may help reduce noise and improve results. Filtering is mainly based on read depth for each SNP, such that we can try to eliminate SNPs with low confidence, due to low coverage, and SNPs that may be in repetitive regions and thus have inflated read depth. You can also filter based on the absolute difference in read depth between the bulks. If you have your own quality columns you can always use R's base subsetting functions to filter out any SNPs you don't want. 288 | 289 | ### Read depth histograms 290 | 291 | One way to assess filtering thresholds and check the quality of our data is by plotting histograms of the read depths. We can get an idea of where to draw our thresholds. We'll use the ggplot2 package for this purpose, but you could use base R to plot as well. 292 | 293 | Lets look at total read depth for example: 294 | ```{r plothist1, warning = FALSE, fig.align="center", fig.width=4, fig.height=3, dpi=300} 295 | library("ggplot2") 296 | ggplot(data = df) + 297 | geom_histogram(aes(x = DP.HIGH + DP.LOW)) + 298 | xlim(0,1000) 299 | 300 | ``` 301 | 302 | ...or look at total reference allele frequency: 303 | ```{r plothist2, warning=FALSE, fig.align = "center", fig.width=4, fig.height=3, dpi=300} 304 | ggplot(data = df) + 305 | geom_histogram(aes(x = REF_FRQ)) 306 | ``` 307 | 308 | We can plot our per-bulk SNP-index to check if our data is good.We expect to find two small peaks on each end and most of the SNPs should be approximiately normally distributed arround 0.5 in an F2 population. Here is the HIGH bulk for example: 309 | 310 | ```{r plotSNPindex, warning=FALSE, fig.align = "center", fig.width=4, fig.height=3, dpi=300} 311 | ggplot(data = df) + 312 | geom_histogram(aes(x = SNPindex.HIGH)) 313 | ``` 314 | 315 | 316 | ### Using the filterSNPs function 317 | Now that we have an idea about our read depth distribution we can filter out low confidence SNPS. In general we recommend filtering extremely low and high coverage SNPs, either in both bulks (`minTotalDepth/maxTotalDepth`) and/or in each bulk separately (`minSampleDepth`). We have the option to filter based on reference allele frequency (`refAlleleFreq`), this removes SNPs that for some reason are over- or under-represented in *BOTH* bulks. We can also filter SNPs that have large discrepancies in read depth between the bulks(i.e. one bulk has a depth of 500 and the other has 5). Such discrepancies can throw off the G statistic. We can also use the GATK GQ score (Genotype Quality) to filter out low confidence SNPs. If the `verbose` parameter is set to `TRUE` (default) the function will report the numbers of SNPs filtered in each step. 318 | ```{r filtSNPs-source, eval = FALSE, message = FALSE} 319 | df_filt <- 320 | filterSNPs( 321 | SNPset = df, 322 | refAlleleFreq = 0.20, 323 | minTotalDepth = 100, 324 | maxTotalDepth = 400, 325 | depthDifference = 100, 326 | minSampleDepth = 40, 327 | minGQ = 99, 328 | verbose = TRUE 329 | ) 330 | ``` 331 | 332 | ```{r filtSNPs-msgs, message = TRUE, warning = FALSE, collapse = TRUE, echo = FALSE} 333 | df_filt <- 334 | filterSNPs( 335 | SNPset = df, 336 | refAlleleFreq = 0.20, 337 | minTotalDepth = 100, 338 | maxTotalDepth = 400, 339 | depthDifference = 100, 340 | minSampleDepth = 40, 341 | minGQ = 99, 342 | verbose = TRUE 343 | ) 344 | ``` 345 | 346 | This step is quick and we can go back and plot some histograms to see if we are happy with the results, and we can quickly re-run the filtering step if not. 347 | 348 | ## Running the analysis 349 | 350 | The analysis in QTLseqr is an implementation of both pipelines for bulk segregant analysis, $G'$ and $\Delta (SNP\text{-}index)$, described by Magwene et al. (2011) and Takagi et al. (2013), respectively. We recommend reading both papers to fully understand the considerations and math behind the analysis. 351 | 352 | There are two main analysis functions: 353 | 354 | 1. `runGprimeAnalysis` - performs Magwene et al type $G'$ analysis 355 | 1. `runQTLseqAnalysis` - performs Takagi et al type QTLseq analysis 356 | 357 | ### A note about window sizes 358 | 359 | QTLseqr utilizes a Nadaraya-Watson smoothing kernel to produce tricube-smoothed statistics for analysis. These smoothed statistics function as a weighted moving average across neighboring SNPs that accounts for linkage disequilibrium (LD), while minimizing noise attributed to SNP calling errors (Magwene et al., 2011). In a tricube-weighted window, SNPs that are close to the focal SNP have a high weighting value, while SNPs closer to the edge of the window have low weights. The tricube-smoothed values are predicted by constant local regression within each chromosome. The calculations are performed using the `locfit` function from the locfit package using a user defined window size and the degree of the polynomial set to zero. 360 | 361 | As this window is using a tricube-smoothing kernel, the window size *can* be much larger than you might expect. However, in the rice examples below, we choose a window size of 1Mb for the sliding window analysis to replicate the orignal results published in Yang et al., (2013). For a discussion about window size, we recommend reading Magwene et al. (2011). In general, larger windows will produce smoother data. The functions making these calculations are rather fast, so we recommend testing several window sizes for your data, and then deciding on the optimal size. 362 | 363 | When running either analysis functions above, some users will get an "newsplit: out of vertex space" error, coming from the `locfit` function. This usually happens when either the window size is set too small (it is a memory allocation error), or the organsim genome is very large and thus the default window size becomes too small. In such cases, the user may pass a higher `maxk` value to either analysis function (the default in `locfit.raw` is 100). **However**, if you are getting that warning, you should _*seriously*_ consider increasing your window size, as it is probably too small to effectively manage the noise in your data. Please read the literature and make educated choices about your selected window size. 364 | 365 | ### QTLseq analysis 366 | Takagi et al. (2013) developed the method for QTLseq type NGS-BSA. The analysis is based on calculating the allele frequency differences, or $\Delta (SNP\text{-}index)$, from the allele depths at each SNP. To determine regions of the genome that significantly differ from the expected $\Delta (SNP\text{-}index)$ of 0, a simulation approach is used. Briefly, at each read depth, simulated SNP frequencies are bootstrapped, and the extreme quantiles are used as simulated confidence intervals. The true data are averaged over a sliding window and regions that surpass the CI are putative QTL. 367 | 368 | When the analysis is run the following steps are performed: 369 | 370 | 1. First the number of SNPs within the sliding window are counted. 371 | 372 | 1. A tricube-smoothed $\Delta (SNP\text{-}index)$ is calculated within the set window size. 373 | 374 | 1. The minimum read depth at each position is calculated and the tricube-smoothed depth is calculated for the window. 375 | 376 | 1. The simulation is performed for data derived read depths (can be set by the user): 377 | Alternate allele frequency is calculated per bulk based on the population type and size (F2 or RIL) $\Delta (SNP\text{-}index)$ is simulated over several replications (default = 10000) for each bulk. The quantiles from the simulations are used to estimate the confidence intervals. Say for example the 99th quantile of 10000 $\Delta (SNP\text{-}index)$ simulations represents the 99% confidence interval for the true data. 378 | 379 | 1. Confidence intervals are matched with the relevant window depth at each SNP. 380 | 381 | Here is an example for running the analysis for an F2 population. In Yang et al. (2013), the bulks are of different sizes (385 and 430 for high and low bulk respectively), so we set `bulkSize = c(385, 430)`. If your bulks are the same size you can simply set one value, i.e. `bulkSize = 25`. The simulation is bootstrapped 10000 times and the two-sided 95 and 99% confidence intervals are calculated: 382 | 383 | ```{r qtlseqanalysis-src, eval = FALSE} 384 | df_filt <- runQTLseqAnalysis(df_filt, 385 | windowSize = 1e6, 386 | popStruc = "F2", 387 | bulkSize = c(385, 430), 388 | replications = 10000, 389 | intervals = c(95, 99) 390 | ) 391 | ``` 392 | ```{r atlseqanalysis-msg, message = TRUE, warning = FALSE, collapse = TRUE, echo = FALSE} 393 | df_filt <- runQTLseqAnalysis( 394 | df_filt, 395 | windowSize = 1e6, 396 | popStruc = "F2", 397 | bulkSize = c(385, 430), 398 | replications = 10000, 399 | intervals = c(95, 99) 400 | ) 401 | ``` 402 | 403 | ### G' analysis 404 | An alternate approach to determine statistical significance of QTL from NGS-BSA was proposed by Magwene et al. (2011) – calculating a modified G statistic for each SNP based on the observed and expected allele depths and smoothing this value using a tricube smoothing kernel. Using the smoothed G statistic, or G’, Magwene et al. allow for noise reduction while also addressing linkage disequilibrium between SNPs. Furthermore, as G’ is close to being log normally distributed, p-values can be estimated for each SNP using non-parametric estimation of the null distribution of G’. This provides a clear and easy-to-interpret result as well as the option for multiple testing corrections. 405 | 406 | Here, we will briefly summarize the steps performed by the main analysis function, `runGprimeAnalysis`. 407 | 408 | The following steps are performed: 409 | 410 | 1. First the number of SNPs within the sliding window are counted. 411 | 412 | 1. A tricube-smoothed $\Delta (SNP\text{-}index)$ is calculated within the set window size. 413 | 414 | 1. Genome-wide G statistics are calculated by `getG`. 415 | $G$ is defined by the equation: 416 | 417 | $$G = 2 * \sum n_i * ln(\frac{obs(n_i)}{exp(n_i)})$$ 418 | 419 | Where for each SNP, $n_i$ from i = 1 to 4 corresponds to the reference and alternate allele depths for each bulk, as described in the following table: 420 | 421 | |Allele|High Bulk|Low Bulk| 422 | |------|---------|--------| 423 | |Reference| $n_1$ | $n_2$ | 424 | |Alternate| $n_3$ | $n_4$ | 425 | 426 | ...and $obs(n_i)$ are the observed allele depths as described in the data frame. `getG` calculates the G statistic using expected values assuming read depth is equal for all alleles in both bulks: 427 | $$ 428 | exp(n_1) = \frac{(n_1 + n_2)*(n_1 + n_3)}{(n_1 + n_2 + n_3 + n_4)} 429 | $$ 430 | $$ 431 | exp(n_2) = \frac{(n_2 + n_1)*(n_2 + n_4)}{(n_1 + n_2 + n_3 + n_4)} 432 | $$ 433 | $$ 434 | exp(n_3) = \frac{(n_3 + n_1)*(n_3 + n_4)}{(n_1 + n_2 + n_3 + n_4)} 435 | $$ 436 | $$ 437 | exp(n_4) = \frac{(n_4 + n_2)*(n_4 + n_3)}{(n_1 + n_2 + n_3 + n_4)} 438 | $$ 439 | 440 | 441 | 1. G' - A tricube-smoothed G statistic is predicted by constant local regression within each chromosome using the `tricubeStat` function. This works as a weighted average across neighboring SNPs that accounts for Linkage disequilibrium (LD) while minimizing noise attributed to SNP calling errors. G values for neighboring SNPs within the window are weighted by physical distance from the focal SNP. 442 | 443 | 1. P-values are estimated based using the non-parametric method described by Magwene et al. 2011 with the function `getPvals`. Briefly, using the natural log of $G'$ a median absolute deviation (MAD) is calculated. The $G'$ set is trimmed to exclude outlier regions (i.e. QTL) based on Hampel's rule. An alternate method for filtering out QTL that we propose is using absolute $\Delta (SNP\text{-}index)$ values greater than a set threshold (default = 0.1) to filter out potential QTL. An estimation of the mode of the trimmed set is calculated using the `mlv` function from the package `modeest`. Finally, the mean and variance of the set are estimated using the median and mode and p-values are estimated from a log normal distribution. 444 | 445 | 1. Negative Log10- and Benjamini-Hochberg adjusted p-values are calculated using `p.adjust`. 446 | 447 | Let's run the function: 448 | 449 | ```{r gprimeanalysis-src, eval = FALSE} 450 | df_filt <- runGprimeAnalysis(df_filt, 451 | windowSize = 1e6, 452 | outlierFilter = "deltaSNP", 453 | filterThreshold = 0.1) 454 | ``` 455 | ```{r gprimeanalysis-msg, message = TRUE, warning = FALSE, collapse = TRUE, echo = FALSE} 456 | df_filt <- runGprimeAnalysis(df_filt, 457 | windowSize = 1e6, 458 | outlierFilter = "deltaSNP", 459 | filterThreshold = 0.1) 460 | ``` 461 | 462 | Some additional columns are added to the filtered data frame: 463 | ```{r} 464 | head(df_filt) 465 | ``` 466 | 467 | * nSNPs - the number of SNPs bracketing the focal SNP within the set sliding window 468 | * tricubeDeltaSNP - the tricube-smoothed $\Delta (SNP\text{-}index)$ 469 | * G - the G value for the SNP 470 | * Gprime - the tricube-smoothed G value 471 | * pvalue - the p-value calculated by non-parametric estimation 472 | * negLog10Pval - the $-log_{10}(p\text{-}value)$ 473 | * qvalue - Benjamini-Hochberg adjusted p-values 474 | 475 | ## Plotting the data 476 | 477 | QTLseqr offers two main plotting functions to check the validity of the $G'$ analysis and to plot genome-wide or chromosome specific QTL analysis plots. 478 | 479 | ### G' distribution plots 480 | 481 | Due to the fact that p-values are estimated from the null distribution of $G'$, an important check is to see if the null distribution of $G'$ values is close to log normally distributed. For this purpose we use the `plotGprimeDist` function, which plots the $G'$ histograms of both raw and filtered $G'$ sets (see P-value calculation above) alongside the log-normal null distribution (which is reported in the legend). We can also use this to test which filtering method (Hampel or DeltaSNP) estimates a more accurate null distribution. If you use the `"deltaSNP"` method plotting $G'$ distributions with different filter thresholds might also help reveal a better $G'$ null distribution. 482 | 483 | ```{r gprimedist hampel, message = FALSE, warning = FALSE, fig.height=4 , dpi=300} 484 | plotGprimeDist(SNPset = df_filt, outlierFilter = "Hampel") 485 | 486 | ``` 487 | 488 | ```{r gprimedist deltaSNP, message=FALSE, warning = FALSE, fig.height=4, dpi=300} 489 | plotGprimeDist(SNPset =df_filt, outlierFilter = "deltaSNP", filterThreshold = 0.1) 490 | ``` 491 | 492 | ### QTL analysis plots 493 | Now that we are happy with our filtered data and it seems that the $G'$ distribution is close to log-normal, we can finally plot some genome-wide figures and try to identify QTL. 494 | 495 | Let's start by plotting the SNP/window distribution: 496 | ```{r plotnSNPs, fig.align = "center", fig.width=12, fig.height=4} 497 | p1 <- plotQTLStats(SNPset = df_filt, var = "nSNPs") 498 | p1 499 | ``` 500 | This is informative as we can assess if there are regions with extremely low SNP density. 501 | 502 | More importantly lets identify some QTL by plotting the smoothed $\Delta (SNP\text{-}index)$ and $G'$ values. If we've performed QTLseq analysis we can also set `plotIntervals` to `TRUE` and plot the confidence intervals to identify QTL using that method. 503 | 504 | ```{r plotdeltaSNP, fig.align = "center", fig.width=12, fig.height=4, , dpi=300} 505 | p2 <- plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE) 506 | p2 507 | ``` 508 | We can see that there are some regions that have $\Delta (SNP\text{-}index)$ that pass the confidence interval thresholds, and are putative QTL. The directionality of the $\Delta (SNP\text{-}index)$ is also important for $G'$ analysis. If the allele contributing to the trait is from the reference parent the $\Delta (SNP\text{-}index)$ should be less than 0. However, if the $\Delta (SNP\text{-}index) > 0$ then the contributing parent is the one with the alternate alleles. 509 | 510 | Let's look at the $G'$ values to see if these regions are significant and pass the FDR (q) of 0.01. 511 | ```{r plotGprime, fig.align = "center", fig.width=12, fig.height=4, , dpi=300} 512 | p3 <- plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01) 513 | p3 514 | ``` 515 | 516 | Great! It looks like there are QTL identified on Chromosomes 1, 2, 5, 8 and 10. 517 | Based on the $\Delta (SNP\text{-}index)$ and $G'$ plots the QTL from Chr1 originates from the reference parent (Nipponbare rice, in this case) and the QTL on Chr8 was contributed by the other parent, for example. 518 | 519 | We can also use the `plotQTLStats` function to the $-log_{10}(p\text{-}value)$. While this number is a direct derivative of $G'$ it can be more self explanatory for some. We can use the subset parameter to plot one or a few of the chromosomes, say for a close up figure of a QTL of interest. Here we look at the $-log_{10}(p\text{-}value)$ plots of Chromosomes 1 and 8: 520 | 521 | ```{r subsetlogpval, , fig.align = "center", fig.width=6, fig.height=3, , dpi=300} 522 | QTLplots <- plotQTLStats( 523 | SNPset = df_filt, 524 | var = "negLog10Pval", 525 | plotThreshold = TRUE, 526 | q = 0.01, 527 | subset = c("Chr1", "Chr8") 528 | ) 529 | QTLplots 530 | ``` 531 | 532 | ## Extracting QTL data 533 | 534 | Now that we've plotted and identified some putative QTL we can extract the data using two functions `getSigRegions` and `getQTLTable`. 535 | 536 | ### Extracting significant regions 537 | The `getSigRegions` function will produce a list in which each element represents a QTL region. The elements are subsets from the original data frame you supplied. Any contiguous region above with an adjusted p-value above the set alpha will be returned. If there is a dip below the alpha this region will be split to two elements. 538 | 539 | Let's examine the `head` of the first QTL: 540 | ```{r getsigreg} 541 | QTL <- getSigRegions(SNPset = df_filt, alpha = 0.01) 542 | head(QTL[[1]]) 543 | ``` 544 | 545 | ### Output QTL summary 546 | 547 | While `getSigRegions` is useful for examining every SNP within each QTL and perhaps for some downstream analysis, the `getQTLTable` will summarize those results and can output a CSV by setting `export = TRUE` and `fileName = "MyQTLsummary.csv"`. We can set `method` as either `"Gprime"` or `"QTLseq"` depending on the type of analysis; `"Gprime"` will use `alpha` as FDR threshold and `"QTLseq"` will use the `interval` parameter, which should match one of the intervals calculated above. 548 | 549 | Here is the summary for significant regions with a FDR of 0.01: 550 | ```{r QTLtable} 551 | results <- getQTLTable(SNPset = df_filt, method = "Gprime",alpha = 0.01, export = FALSE) 552 | results 553 | 554 | ``` 555 | 556 | The columns are: 557 | 558 | * chromosome - The chromosome on which the region was identified 559 | * qtl - the QTL identification number in this chromosome 560 | * start - the start position on that chromosome, i.e. the position of the first SNP that passes the FDR threshold 561 | * end - the end position 562 | * length - the length in base pairs from start to end of the region 563 | * nSNPs - the number of SNPs in the region 564 | * avgSNPs_Mb - the average number of SNPs/Mb within that region 565 | * peakDeltaSNP - the $\Delta (SNP\text{-}index)$ value at the peak summit 566 | * posPeakDeltaSNP - the position of the absolute maximum tricube-smoothed deltaSNP-index 567 | * maxGprime - the max G' score in the region 568 | * meanGprime - the average $G'$ score of that region 569 | * posMaxGprime - the genomic position of the maximum G' value in the QTL 570 | * sdGprime - the standard deviation of $G'$ within the region 571 | * AUCaT - the **A**rea **U**nder the **C**urve but **a**bove the **T**hreshold line, an indicator of how significant or wide the peak is 572 | * meanPval - the average p-value in the region 573 | * meanQval - the average adjusted p-value in the region 574 | 575 | # Summary 576 | 577 | We've reviewed how to load SNP data from GATK and filter the data to contain high confidence SNPs. We then performed $\Delta (SNP\text{-}index)$ and $G'$ analysis and calculate p-values and q-values based on the tricube-smoothed $G'$ values. The QTL regions that pass our defined threshold can be stored as a list for further analysis or summarized as a table for publication. 578 | 579 | # Session info 580 | ```{r sessioninfo, echo=FALSE, cache=FALSE} 581 | sessionInfo() 582 | ``` -------------------------------------------------------------------------------- /vignettes/QTLseqr.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bmansfeld/QTLseqr/5e761379a805b65038c415c8d3ce7aa61abe89dc/vignettes/QTLseqr.pdf --------------------------------------------------------------------------------