├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── NAMESPACE
├── NEWS.md
├── R
    ├── G_functions.R
    ├── Import_Filter.R
    ├── RcppExports.R
    ├── export_functions.R
    ├── format_genomic.R
    ├── onunload.R
    ├── plotting_functions.R
    └── takagi_sim.R
├── README.md
├── all_plots.png
├── man
    ├── FilterSNPs.Rd
    ├── ImportFromGATK.Rd
    ├── countSNPs_cpp.Rd
    ├── getFDRThreshold.Rd
    ├── getG.Rd
    ├── getPvals.Rd
    ├── getQTLTable.Rd
    ├── getSigRegions.Rd
    ├── importFromTable.Rd
    ├── plotGprimeDist.Rd
    ├── plotQTLStats.Rd
    ├── plotSimulatedThresholds.Rd
    ├── runGprimeAnalysis.Rd
    ├── runQTLseqAnalysis.Rd
    ├── simulateAlleleFreq.Rd
    ├── simulateConfInt.Rd
    ├── simulateSNPindex.Rd
    └── tricubeStat.Rd
├── src
    ├── .gitignore
    ├── RcppExports.cpp
    └── countSNPs.cpp
└── vignettes
    ├── .gitignore
    ├── QTLseqr.Rmd
    └── QTLseqr.pdf


/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^.*\.Rproj$
2 | ^\.Rproj\.user$
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | .Rproj
 2 | .Rproj.user
 3 | .Rhistory
 4 | .RData
 5 | .Ruserdata
 6 | inst/doc
 7 | QTLseqr.Rproj
 8 | calculation tests.xlsx
 9 | README.Rmd
10 | temp
11 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: QTLseqr
 2 | Type: Package
 3 | Title: QTL mapping using Bulk Segregant Analysis of Next Generation Sequencing data.
 4 | Version: 0.7.5.2
 5 | Date: 12-3-2017
 6 | Authors@R: c(
 7 |     person("Ben Nathan", "Mansfeld", email = "mansfeld@msu.edu", role = c("aut", "cre")),
 8 |     person("Rebecca", "Grumet", email = "grumet@msu.edu", role = c("ths"))
 9 |     )
10 | BugReports: https://github.com/bmansfeld/QTLseqr/issues
11 | Description: QTLseqr performs QTL mapping using Bulk Segregant Analysis of Next Gen Sequencing data.
12 |     The package is an R implementation of the analyses described in Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing. PLOS Computational Biology 7(11): e1002255. doi: 10.1371/journal.pcbi.1002255 and Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi,H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. *Plant J*, 74: 174–183. doi:10.1111/tpj.12105
13 | License: GPL-3
14 | Encoding: UTF-8
15 | LazyData: true
16 | Imports:
17 |     modeest (>= 2.3.2),
18 |     ggplot2 (>= 2.2.0),
19 |     gtools,
20 |     dplyr,
21 |     readr,
22 |     tidyr,
23 |     Rcpp,
24 |     locfit
25 | Suggests:
26 |     knitr,
27 |     rmarkdown,
28 |     kableExtra,
29 |     Yang2013data
30 | RoxygenNote: 6.1.1
31 | biocViews: Software, Sequencing, Visualization
32 | VignetteBuilder: knitr
33 | LinkingTo: Rcpp
34 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(countSNPs_cpp)
 4 | export(filterSNPs)
 5 | export(getFDRThreshold)
 6 | export(getPvals)
 7 | export(getQTLTable)
 8 | export(getSigRegions)
 9 | export(importFromGATK)
10 | export(importFromTable)
11 | export(plotGprimeDist)
12 | export(plotQTLStats)
13 | export(plotSimulatedThresholds)
14 | export(runGprimeAnalysis)
15 | export(runQTLseqAnalysis)
16 | export(simulateConfInt)
17 | importFrom(Rcpp,sourceCpp)
18 | importFrom(dplyr,"%>%")
19 | useDynLib(QTLseqr)
20 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
 1 | # QTLseqr 0.7.5.2
 2 | ## Bugs
 3 | * Import failure caused by 0.7.5.1 fixed
 4 | 
 5 | # QTLseqr 0.7.5.1
 6 | ## Bugs
 7 | * Import failure caused by 0.7.5 fixed
 8 | 
 9 | # QTLseqr 0.7.5
10 | ## Bugs
11 | * reading files would sometimes guess wrong. Set the default to col_character. This will help with very large read depth importing.
12 | 
13 | # QTLseqr 0.7.4
14 | ## Bugs
15 | * Fixed a compatibility issue with new versions of the `modeest` package. Please note that from now on QTLseqr requires `modeest (> 2.3.2)`
16 | 
17 | # QTLseqr 0.7.3
18 | ## Updates
19 | * Added a `...` for all functions that use tricubed smoothing functions. So that users can easily pass higher maxk values to `raw.locfit`. 
20 | * Added _"A note about window sizes"_ to the vignette.
21 | 
22 | # QTLseqr 0.7.2
23 | ## Updates
24 | * Added `depthDifference` paramater to `filterSNPs` function. This helps filtering SNPs with high absolute differences in read depth between the bulks. 
25 | * `getQTLTable` now also reports the genomic position of the maximum of each peak. 
26 | * Updates to the vignette about filtering SNPs.
27 | 
28 | # QTLseqr 0.7.1
29 | ## Bug fixes
30 | * Corrected a bug in checking for negative bulksizes
31 | 
32 | # QTLseqr 0.7.0
33 | ## Updates
34 | * Added `importFromTable` function to allow users to import from a delimited file.
35 | * Allowed different size bulks in `runQTLseqAnalysis`.
36 | * Updated vignettes and documentation files. 
37 | * Some documentation link fixes
38 | 
39 | # QTLseqr 0.6.5
40 | ## Bug fixes
41 | * Corrected a bug in import that happend when high or low bulks were named with things that looked like CHROM, POS, ALT or REF. Now the function ignores those columns when renaming to HIGH.xx or LOW.xx. Also better definition of column types to force CHROM to be char and POS to be int.
42 | 
43 | # QTLseqr 0.6.4
44 | ## Bug fixes
45 | * Corrected a windowSize that was set to 1e6 instead of the parameter function in 'runQTLanalysis'. This was causing problems in calculating window depth.
46 | 
47 | # QTLseqr 0.6.3
48 | ## Bug fixes
49 | * Corrected a call to global env variables in `importFromGATK`.
50 | * Fixed issues with bulk names that had periods in them.
51 | * Bug fix in export functions using `table` as variable name.
52 | 
53 | # QTLseqr 0.6.2
54 | ## Bug fixes
55 | * Added responsive x axis brakes. Axis brake labels were getting squashed on small chromosomes.
56 | 
57 | # QTLseqr 0.6.1
58 | ## Updates
59 | * added `plotSimulatedThresholds` function to help users visuallize their confisence intervals
60 | * some manual corrections and mods
61 | 
62 | # QTLseqr 0.6.0
63 | ## Updates
64 | * Added QTLseq analysis functionality
65 | * `plotQTLStats` can now plot confidence intervals in $\Delta (SNP\text{-}index)$ plots
66 | * Export functions run faster and allow for detection of QTL in either "Gprime" or "QTLseq" methods
67 | * removed Bioconductor dependency
68 | 
69 | # QTLseqr 0.5.8
70 | ## Updates
71 | * changed $\Delta SNP\text{-}index$ to $\Delta (SNP\text{-}index)$
72 | 
73 | # QTLseqr 0.5.7
74 | ## Updates
75 | * `plotQTLStats` now allows for chromosome facet shape scaling using the 'scaleChroms' parameter
76 | 
77 | # QTLseqr 0.5.6
78 | ## Bug fixes
79 | * Added package `locfit` to Imports. 
80 | 
81 | # QTLseqr 0.5.5
82 | ## Updates
83 | * `plotGprimeDist` null distribution label more accurate.
84 | 
85 | # QTLseqr 0.5.4
86 | 
87 | ## Updates
88 | * `plotGprimeDist` now plots histograms of filtered and raw data. overlaid with the null dist. Is easier to interpret. 
89 | 
90 | # QTLseqr 0.5.3
91 | 
92 | ## New features
93 | 
94 | * Added a `NEWS.md` file to track changes to the package.
95 | * in `plotGprimeDist` plotting now includes density plots of all data, data after QTL filtering and the null-distribution assuming mean and variance of the filtered set.
96 | 
97 | ## Bug fixes
98 | * Corrected a bug in `getFDRThreshold` function that was using regular p-values and not adjusted pvalues to define threshold
99 | 


--------------------------------------------------------------------------------
/R/G_functions.R:
--------------------------------------------------------------------------------
  1 | #Functions for calculating and manipulating the G statistic
  2 | 
  3 | #' Calculates the G statistic
  4 | #'
  5 | #' The function is used by \code{\link{runGprimeAnalysis}} to calculate the G
  6 | #' statisic G is defined by the equation: \deqn{G = 2*\sum_{i=1}^{4}
  7 | #' n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2 * \sum n_i * ln(obs(n_i)/exp(n_i))}
  8 | #' Where for each SNP, \eqn{n_i} from i = 1 to 4 corresponds to the reference
  9 | #' and alternate allele depths for each bulk, as described in the following
 10 | #' table: \tabular{rcc}{ Allele \tab High Bulk \tab Low Bulk \cr Reference \tab
 11 | #' \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab \eqn{n_3} \tab \eqn{n_4} \cr}
 12 | #' ...and \eqn{obs(n_i)} are the observed allele depths as described in the data
 13 | #' frame. Method 1 calculates the G statistic using expected values assuming
 14 | #' read depth is equal for all alleles in both bulks: \deqn{exp(n_1) = ((n_1 +
 15 | #' n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)} \deqn{exp(n_2) = ((n_2 + n_1)*(n_2
 16 | #' + n_4))/(n_1 + n_2 + n_3 + n_4)} etc...
 17 | #'
 18 | #' @param LowRef A vector of the reference allele depth in the low bulk
 19 | #' @param HighRef A vector of the reference allele depth in the high bulk
 20 | #' @param LowAlt A vector of the alternate allele depth in the low bulk
 21 | #' @param HighAlt A vector of the alternate allele depth in the high bulk
 22 | #'
 23 | #' @return A vector of G statistic values with the same length as
 24 | #'
 25 | #' @seealso \href{https://doi.org/10.1371/journal.pcbi.1002255}{The Statistics
 26 | #'   of Bulk Segregant Analysis Using Next Generation Sequencing}
 27 | #'   \code{\link{tricubeStat}} for G prime calculation
 28 | 
 29 | getG <- function(LowRef, HighRef, LowAlt, HighAlt)
 30 | {
 31 |     exp <- c(
 32 |         (LowRef + HighRef) * (LowRef + LowAlt) / (LowRef + HighRef + LowAlt + HighAlt),
 33 |         (LowRef + HighRef) * (HighRef + HighAlt) / (LowRef + HighRef + LowAlt + HighAlt),
 34 |         (LowRef + LowAlt) * (LowAlt + HighAlt) / (LowRef + HighRef + LowAlt + HighAlt),
 35 |         (LowAlt + HighAlt) * (HighRef + HighAlt) / (LowRef + HighRef + LowAlt + HighAlt)
 36 |     )
 37 |     obs <- c(LowRef, HighRef, LowAlt, HighAlt)
 38 |     
 39 |     G <-
 40 |         2 * (rowSums(obs * log(
 41 |             matrix(obs, ncol = 4) / matrix(exp, ncol = 4)
 42 |         )))
 43 |     return(G)
 44 | }
 45 | 
 46 | #' Calculate tricube weighted statistics for each SNP
 47 | #'
 48 | #' Uses local regression (wrapper for \code{\link[locfit]{locfit}}) to predict a
 49 | #' tricube smoothed version of the statistic supplied for each SNP. This works as a
 50 | #' weighted average across neighboring SNPs that accounts for Linkage
 51 | #' disequilibrium (LD) while minizing noise attributed to SNP calling errors.
 52 | #' Values for neighboring SNPs within the window are weighted by physical
 53 | #' distance from the focal SNP.
 54 | #'
 55 | #' @return Returns a vector of the weighted statistic caluculted with a tricube
 56 | #'   smoothing kernel
 57 | #'
 58 | #' @param POS A vector of genomic positions for each SNP
 59 | #' @param Stat A vector of values for a given statistic for each SNP
 60 | #' @param WinSize the window size (in base pairs) bracketing each SNP for which
 61 | #'   to calculate the statitics. Magwene et. al recommend a window size of ~25
 62 | #'   cM, but also recommend optionally trying several window sizes to test if
 63 | #'   peaks are over- or undersmoothed.
 64 | #' @param ... Other arguments passed to \code{\link[locfit]{locfit}} and
 65 | #'   subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull
 66 | #'   in cases where you get "out of vertex space warnings"; Set the maxk higher
 67 | #'   than the default 100. See \code{\link[locfit]{locfit.raw}}().
 68 | #' @examples df_filt_4mb$Gprime <- tricubeStat(POS, Stat = GStat, WinSize = 4e6)
 69 | #' @seealso \code{\link{getG}} for G statistic calculation
 70 | #' @seealso \code{\link[locfit]{locfit}} for local regression
 71 | 
 72 | tricubeStat <- function(POS, Stat, windowSize = 2e6, ...)
 73 | {
 74 |     if (windowSize <= 0)
 75 |         stop("A positive smoothing window is required")
 76 |     stats::predict(locfit::locfit(Stat ~ locfit::lp(POS, h = windowSize, deg = 0), ...), POS)
 77 | }
 78 | 
 79 | 
 80 | #' Non-parametric estimation of the null distribution of G'
 81 | #'
 82 | #' The function is used by \code{\link{runGprimeAnalysis}} to estimate p-values for the
 83 | #' weighted G' statistic based on the non-parametric estimation method described
 84 | #' in Magwene et al. 2011. Breifly, using the natural log of Gprime a median
 85 | #' absolute deviation (MAD) is calculated. The Gprime set is trimmed to exclude
 86 | #' outlier regions (i.e. QTL) based on Hampel's rule. An alternate method for
 87 | #' filtering out QTL is proposed using absolute delta SNP indeces greater than
 88 | #' a set threshold to filter out potential QTL. An estimation of the mode of the trimmed set
 89 | #' is calculated using the \code{\link[modeest]{mlv}} function from the package
 90 | #' modeest. Finally, the mean and variance of the set are estimated using the
 91 | #' median and mode and p-values are estimated from a log normal distribution.
 92 | #'
 93 | #' @param Gprime a vector of G prime values (tricube weighted G statistics)
 94 | #' @param deltaSNP a vector of delta SNP values for use for QTL region filtering
 95 | #' @param outlierFilter one of either "deltaSNP" or "Hampel". Method for
 96 | #'   filtering outlier (ie QTL) regions for p-value estimation
 97 | #' @param filterThreshold The absolute delta SNP index to use to filter out putative QTL
 98 | #' @export getPvals
 99 | 
100 | getPvals <-
101 |     function(Gprime,
102 |         deltaSNP = NULL,
103 |         outlierFilter = c("deltaSNP", "Hampel"),
104 |         filterThreshold)
105 |     {
106 |         
107 |         if (outlierFilter == "deltaSNP") {
108 |             
109 |             if (abs(filterThreshold) >= 0.5) {
110 |                 stop("filterThreshold should be less than 0.5")
111 |             }
112 |             
113 |             message("Using deltaSNP-index to filter outlier regions with a threshold of ", filterThreshold)
114 |             trimGprime <- Gprime[abs(deltaSNP) < abs(filterThreshold)]
115 |         } else {
116 |             message("Using Hampel's rule to filter outlier regions")
117 |             lnGprime <- log(Gprime)
118 |             
119 |             medianLogGprime <- median(lnGprime)
120 |             
121 |             # calculate left median absolute deviation for the trimmed G' prime set
122 |             MAD <-
123 |                 median(medianLogGprime - lnGprime[lnGprime <= medianLogGprime])
124 |             
125 |             # Trim the G prime set to exclude outlier regions (i.e. QTL) using Hampel's rule
126 |             trimGprime <-
127 |                 Gprime[lnGprime - median(lnGprime) <= 5.2 * MAD]
128 |         }
129 |         
130 |         medianTrimGprime <- median(trimGprime)
131 |         
132 |         # estimate the mode of the trimmed G' prime set using the half-sample method
133 |         message("Estimating the mode of a trimmed G prime set using the 'modeest' package...")
134 |         modeTrimGprime <-
135 |             modeest::mlv(x = trimGprime, bw = 0.5, method = "hsm")[1]
136 |         
137 |         muE <- log(medianTrimGprime)
138 |         varE <- abs(muE - log(modeTrimGprime))
139 |         #use the log normal distribution to get pvals
140 |         message("Calculating p-values...")
141 |         pval <-
142 |             1 - plnorm(q = Gprime,
143 |                 meanlog = muE,
144 |                 sdlog = sqrt(varE))
145 |         
146 |         return(pval)
147 |     }
148 | 
149 | 
150 | #' Find false discovery rate threshold
151 | #'
152 | #' Given a vector of p-values and a set false discovery rate alpha the function
153 | #' returns the lowest p-value in the vector for which the Benjamini-Hochberg
154 | #' adjusted p-value (ie q-value) is less than that alpha.
155 | #'
156 | #' @param pvalues a vector of p-values
157 | #' @param alpha the required false discovery rate alpha
158 | #'
159 | #' @return The p-value threshold that corresponds to the Benjamini-Hochberg adjusted p-value at the FDR set by alpha.
160 | #' @export getFDRThreshold
161 | 
162 | getFDRThreshold <- function(pvalues, alpha = 0.01)
163 | {
164 |     sortedPvals <- sort(pvalues, decreasing = FALSE)
165 |     pAdj <- p.adjust(sortedPvals, method = "BH")
166 |     if (!any(pAdj < alpha)) {
167 |         fdrThreshold <- NA
168 |     } else {
169 |     fdrThreshold <- sortedPvals[max(which(pAdj < alpha))]
170 |     }
171 |     return(fdrThreshold)
172 | }
173 | 
174 | #' Identify QTL using a smoothed G statistic
175 | #'
176 | #' A wrapper for all the functions that perform the full G prime analysis to
177 | #' identify QTL. The following steps are performed:\cr 1) Genome-wide G
178 | #' statistics are calculated by \code{\link{getG}}. \cr G is defined by the
179 | #' equation: \deqn{G = 2*\sum_{i=1}^{4} n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2
180 | #' * \sum n_i * ln(obs(n_i)/exp(n_i))} Where for each SNP, \eqn{n_i} from i = 1
181 | #' to 4 corresponds to the reference and alternate allele depths for each bulk,
182 | #' as described in the following table: \tabular{rcc}{ Allele \tab High Bulk
183 | #' \tab Low Bulk \cr Reference \tab \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab
184 | #' \eqn{n_3} \tab \eqn{n_4} \cr} ...and \eqn{obs(n_i)} are the observed allele
185 | #' depths as described in the data frame. \code{\link{getG}} calculates the G statistic
186 | #' using expected values assuming read depth is equal for all alleles in both
187 | #' bulks: \deqn{exp(n_1) = ((n_1 + n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)}
188 | #' \deqn{exp(n_2) = ((n_2 + n_1)*(n_2 + n_4))/(n_1 + n_2 + n_3 + n_4)}
189 | #' \deqn{exp(n_3) = ((n_3 + n_1)*(n_3 + n_4))/(n_1 + n_2 + n_3 + n_4)}
190 | #' \deqn{exp(n_4) = ((n_4 + n_2)*(n_4 + n_3))/(n_1 + n_2 + n_3 + n_4)}\cr 2) G'
191 | #' - A tricube-smoothed G statistic is predicted by local regression within each
192 | #' chromosome using \code{\link{tricubeStat}}. This works as a weighted average
193 | #' across neighboring SNPs that accounts for Linkage disequilibrium (LD) while
194 | #' minizing noise attributed to SNP calling errors. G values for neighboring
195 | #' SNPs within the window are weighted by physical distance from the focal SNP.
196 | #' \cr \cr 3) P-values are estimated based using the non-parametric method
197 | #' described by Magwene et al. 2011 with the function \code{\link{getPvals}}.
198 | #' Breifly, using the natural log of Gprime a median absolute deviation (MAD) is
199 | #' calculated. The Gprime set is trimmed to exclude outlier regions (i.e. QTL)
200 | #' based on Hampel's rule. An alternate method for filtering out QTL is proposed
201 | #' using absolute delta SNP indeces greater than 0.1 to filter out potential
202 | #' QTL. An estimation of the mode of the trimmed set is calculated using the
203 | #' \code{\link[modeest]{mlv}} function from the package modeest. Finally, the
204 | #' mean and variance of the set are estimated using the median and mode and
205 | #' p-values are estimated from a log normal distribution. \cr \cr 4) Negative
206 | #' Log10- and Benjamini-Hochberg adjusted p-values are calculated using
207 | #' \code{\link[stats]{p.adjust}}
208 | #'
209 | #' @param SNPset Data frame SNP set containing previously filtered SNPs
210 | #' @param windowSize the window size (in base pairs) bracketing each SNP for which
211 | #'   to calculate the statitics. Magwene et. al recommend a window size of ~25
212 | #'   cM, but also recommend optionally trying several window sizes to test if
213 | #'   peaks are over- or undersmoothed.
214 | #' @param outlierFilter one of either "deltaSNP" or "Hampel". Method for
215 | #'   filtering outlier (ie QTL) regions for p-value estimation
216 | #' @param filterThreshold The absolute delta SNP index to use to filter out putative QTL (default = 0.1)
217 | #' @param ... Other arguments passed to \code{\link[locfit]{locfit}} and
218 | #'   subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull
219 | #'   in cases where you get "out of vertex space warnings"; Set the maxk higher
220 | #'   than the default 100. See \code{\link[locfit]{locfit.raw}}(). But if you
221 | #'   are getting that warning you should seriously consider increasing your
222 | #'   window size.
223 | #'   
224 | #' @return The supplied SNP set tibble after G' analysis. Includes five new
225 | #'   columns: \itemize{\item{G - The G statistic for each SNP} \item{Gprime -
226 | #'   The tricube smoothed G statistic based on the supplied window size}
227 | #'   \item{pvalue - the pvalue at each SNP calculatd by non-parametric
228 | #'   estimation} \item{negLog10Pval - the -Log10(pvalue) supplied for quick
229 | #'   plotting} \item{qvalue - the Benajamini-Hochberg adjusted p-value}}
230 | #'
231 | #'
232 | #' @importFrom dplyr %>%
233 | #'
234 | #' @export runGprimeAnalysis
235 | #'
236 | #' @examples df_filt <- runGprimeAnalysis(df_filt,windowSize = 2e6,outlierFilter = "deltaSNP")
237 | #' @useDynLib QTLseqr
238 | #' @importFrom Rcpp sourceCpp
239 | 
240 | 
241 | runGprimeAnalysis <-
242 |     function(SNPset,
243 |         windowSize = 1e6,
244 |         outlierFilter = "deltaSNP",
245 |         filterThreshold = 0.1, 
246 |         ...)
247 |     {
248 |         message("Counting SNPs in each window...")
249 |         SNPset <- SNPset %>%
250 |             dplyr::group_by(CHROM) %>%
251 |             dplyr::mutate(nSNPs = countSNPs_cpp(POS = POS, windowSize = windowSize))
252 |         
253 |         message("Calculating tricube smoothed delta SNP index...")
254 |         SNPset <- SNPset %>%
255 |             dplyr::mutate(tricubeDeltaSNP = tricubeStat(POS = POS, Stat = deltaSNP, windowSize, ...))
256 |         
257 |         message("Calculating G and G' statistics...")
258 |         SNPset <- SNPset %>%
259 |             dplyr::mutate(
260 |                 G = getG(
261 |                     LowRef = AD_REF.LOW,
262 |                     HighRef = AD_REF.HIGH,
263 |                     LowAlt = AD_ALT.LOW,
264 |                     HighAlt = AD_ALT.HIGH
265 |                 ),
266 |                 Gprime = tricubeStat(
267 |                     POS = POS,
268 |                     Stat = G,
269 |                     windowSize = windowSize,
270 |                     ...
271 |                 )
272 |             ) %>%
273 |             dplyr::ungroup() %>%
274 |             dplyr::mutate(
275 |                 pvalue = getPvals(
276 |                     Gprime = Gprime,
277 |                     deltaSNP = deltaSNP,
278 |                     outlierFilter = outlierFilter,
279 |                     filterThreshold = filterThreshold
280 |                 ),
281 |                 negLog10Pval = -log10(pvalue),
282 |                 qvalue = p.adjust(p = pvalue, method = "BH")
283 |             )
284 |         
285 |         return(as.data.frame(SNPset))
286 |     }
287 | 


--------------------------------------------------------------------------------
/R/Import_Filter.R:
--------------------------------------------------------------------------------
  1 | #' Imports SNP data from GATK VariablesToTable output
  2 | #'
  3 | #' Imports SNP data from the output of the
  4 | #' \href{https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php}{VariantsToTable}
  5 | #' function in GATK. After importing the data, the function then calculates
  6 | #' total reference allele frequency for both bulks together, the delta SNP index
  7 | #' (i.e. SNP index of the low bulk subtracted from the SNP index of the high
  8 | #' bulk), the G statistic and returns a data frame. The required GATK fields
  9 | #' (-F) are CHROM (Chromosome) and POS (Position). The required Genotype fields
 10 | #' (-GF) are AD (Allele Depth), DP (Depth). Recommended
 11 | #' fields are REF (Reference allele) and ALT (Alternative allele) Recommended
 12 | #' Genotype feilds are PL (Phred-scaled likelihoods) and GQ  (Genotype Quality).
 13 | #'
 14 | #' @param file The name of the GATK VariablesToTable output .table file which the
 15 | #'   data are to be read from.
 16 | #' @param highBulk The sample name of the High Bulk
 17 | #' @param lowBulk The sample name of the Low Bulk
 18 | #' @param chromList a string vector of the chromosomes to be used in the
 19 | #'   analysis. Useful for filtering out unwanted contigs etc.
 20 | #' @return Returns a data frame containing columns for Read depth (DP),
 21 | #'   Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT),
 22 | #'   Genoytype Quality (GQ) and SNPindex for each bulk (indicated by .HIGH and
 23 | #'   .LOW column name suffix). Total reference allele frequnce "REF_FRQ" is the
 24 | #'   sum of AD.REF for both bulks divided by total Depth for that SNP. The
 25 | #'   deltaSNPindex is equal to  SNPindex.HIGH - SNPindex.LOW. The GStat column
 26 | #'   is the calculated G statistic for that SNP.
 27 | #' @seealso \code{\link{getG}} for explaination of how G statistic is
 28 | #'   calculated.
 29 | #'   \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What
 30 | #'   is a VCF and how should I interpret it?} for more information on GATK
 31 | #'   Fields and Genotype Fields
 32 | #' @examples df <-  ImportFromGATK(filename = file.table,
 33 | #'     highBulk = highBulkSampleName,
 34 | #'     lowBulk = lowBulkSampleName,
 35 | #'     chromList = c("Chr1","Chr4","Chr7"))
 36 | #' @export importFromGATK
 37 | 
 38 | importFromGATK <- function(file,
 39 |     highBulk = character(),
 40 |     lowBulk = character(),
 41 |     chromList = NULL) {
 42 |     
 43 |     # first read one line to help define col types
 44 |     colheader <- read.delim(file, nrows=1, check.names=FALSE)
 45 |     
 46 |     # identify the sample name specific int and chr columns
 47 |     int_matches <- grep('DP|GQ', names(colheader), value=TRUE)
 48 |     chr_matches <- grep('\\.AD', names(colheader), value=TRUE)
 49 |     
 50 |     # create cols_spec class definitions
 51 |     int_cols <- do.call(readr::cols, setNames(
 52 |         rep(list(readr::col_integer()), length(int_matches)), 
 53 |         int_matches))$cols
 54 |     
 55 |     chr_cols <- do.call(readr::cols, setNames(
 56 |         rep(list(readr::col_character()), length(chr_matches)), 
 57 |         chr_matches))$cols
 58 |     
 59 |     col_defs <- readr::cols(CHROM = "c", POS = "i")
 60 |     col_defs$cols <- c(readr::cols(CHROM = "c", POS = "i")$cols,
 61 |                        int_cols,
 62 |                        chr_cols
 63 |     )
 64 |     
 65 |     SNPset <- readr::read_tsv(file = file, 
 66 |                               col_names = TRUE, 
 67 |                               guess_max = 10000,
 68 |                               col_types = col_defs)
 69 |     
 70 |     if (!all(
 71 |         c(
 72 |         "CHROM", 
 73 |         "POS", 
 74 |         paste0(highBulk, ".AD"), 
 75 |         paste0(lowBulk, ".AD"), 
 76 |         paste0(highBulk, ".DP"), 
 77 |         paste0(lowBulk, ".DP")
 78 |         ) %in% names(SNPset))) {
 79 |         stop("One of the required fields is missing. Check your table file.")
 80 |     }
 81 |     
 82 | # rename columns based on bulk names and flip headers (ie HIGH.AD -> AD.HIGH to match the rest of the functions
 83 |     colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <- 
 84 |         gsub(pattern = highBulk, 
 85 |              replacement = "HIGH", 
 86 |              x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])
 87 |     colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <- 
 88 |         gsub(pattern = lowBulk, 
 89 |              replacement = "LOW", 
 90 |              x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])
 91 |     
 92 |     colnames(SNPset) <-
 93 |         sapply(strsplit(colnames(SNPset), "[.]"),
 94 |             function(x) {paste0(rev(x),collapse = '.')})
 95 |     
 96 |     #Keep only wanted chromosomes
 97 |     if (!is.null(chromList)) {
 98 |         message("Removing the following chromosomes: ", paste(unique(SNPset$CHROM)[!unique(SNPset$CHROM) %in% chromList], collapse = ", "))
 99 |         SNPset <- SNPset[SNPset$CHROM %in% chromList, ]
100 |     }
101 |     #arrange the chromosomes by natural order sort, eg Chr1, Chr10, Chr2 >>> Chr1, Chr2, Chr10
102 |     SNPset$CHROM <-
103 |         factor(SNPset$CHROM, levels = gtools::mixedsort(unique(SNPset$CHROM)))
104 |     
105 |     SNPset <- SNPset %>%
106 |         tidyr::separate(
107 |             col = AD.LOW,
108 |             into = "AD_REF.LOW",
109 |             sep = ",",
110 |             extra = "drop",
111 |             convert = TRUE
112 |         ) %>%
113 |         tidyr::separate(
114 |             col = AD.HIGH,
115 |             into = "AD_REF.HIGH",
116 |             sep = ",",
117 |             extra = "drop",
118 |             convert = TRUE
119 |         ) %>%
120 |         dplyr::mutate(
121 |             AD_ALT.HIGH = DP.HIGH - AD_REF.HIGH,
122 |             AD_ALT.LOW = DP.LOW - AD_REF.LOW,
123 |             SNPindex.HIGH = AD_ALT.HIGH / DP.HIGH,
124 |             SNPindex.LOW = AD_ALT.LOW / DP.LOW,
125 |             REF_FRQ = (AD_REF.HIGH + AD_REF.LOW) / (DP.HIGH + DP.LOW),
126 |             deltaSNP = SNPindex.HIGH - SNPindex.LOW
127 |         ) %>%
128 |         dplyr::select(
129 |             -dplyr::contains("HIGH"),
130 |             -dplyr::contains("LOW"),
131 |             -dplyr::one_of("deltaSNP", "REF_FRQ"),
132 |             dplyr::matches("AD.*.LOW"),
133 |             dplyr::contains("LOW"),
134 |             dplyr::matches("AD.*.HIGH"),
135 |             dplyr::contains("HIGH"),
136 |             dplyr::everything()
137 |         )
138 |     
139 |     return(as.data.frame(SNPset))
140 | }
141 | 
142 | 
143 | #' Import SNP data from a delimited file
144 | #'
145 | #' After importing the data from a delimited file, the function then calculates
146 | #' total reference allele frequency for both bulks together, the delta SNP index
147 | #' (i.e. SNP index of the low bulk subtracted from the SNP index of the high
148 | #' bulk), the G statistic and returns a data frame. The required columns in the
149 | #' file are CHROM (Chromosome) and POS (Position) as well as the reference and
150 | #' alternate allele depths (number of reads supporting each allele). The allele
151 | #' depths should be in columns named in this format:
152 | #' \code{AD_(<ALT/REF>).<sampleName>}. For example, the column for alternate
153 | #' allele depth for a high bulk sample named "sample1", should be
154 | #' "AD_ALT.sample1". Any other columns describing the SNPs are allowed, ie the
155 | #' actual allele calls, or a quality score. If the column is Bulk specific, It
156 | #' should be named \code{columnName.sampleName}, i.e "QUAL.sample1".
157 | #'
158 | #' @param file The name of the file which the data are to be read from.
159 | #' @param highBulk The sample name of the High Bulk. Defaults to "HIGH"
160 | #' @param lowBulk The sample name of the Low Bulk. Defaults to "LOW"
161 | #' @param chromList a string vector of the chromosomes to be used in the
162 | #'   analysis. Useful for filtering out unwanted contigs etc.
163 | #' @param sep the field separator character. Values on each line of the file are
164 | #'   separated by this character. Default is for csv file ie ",".
165 | #' @return Returns a data frame containing columns for per bulk total Read depth (DP),
166 | #'   Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT), any
167 | #'   other SNP associated columns in the file, and SNPindex for each bulk
168 | #'   (indicated by .HIGH and .LOW column name suffix). Total reference allele
169 | #'   frequnce "REF_FRQ" is the sum of AD_REF for both bulks divided by total
170 | #'   Depth for that SNP. The deltaSNPindex is equal to  SNPindex.HIGH -
171 | #'   SNPindex.LOW.
172 | #'
173 | #' @export importFromTable
174 | 
175 | importFromTable <-
176 |     function(file,
177 |              highBulk = "HIGH",
178 |              lowBulk = "LOW",
179 |              chromList = NULL,
180 |              sep = ",") {
181 |         SNPset <-
182 |             readr::read_delim(
183 |                 file = file,
184 |                 delim = sep,
185 |                 col_names = TRUE,
186 |                 col_types = readr::cols(
187 |                     .default = readr::col_guess(),
188 |                     CHROM = "c",
189 |                     POS = "i"
190 |                 )
191 |             )
192 |         # check CHROM
193 |         if (!"CHROM" %in% names(SNPset)) {
194 |             stop("No 'CHROM' coloumn found.")
195 |         }
196 |         
197 |         # check POS
198 |         if (!"POS" %in% names(SNPset)) {
199 |             stop("No 'POS' coloumn found.")
200 |         }
201 |         
202 |         # check AD_REF.HIGH
203 |         if (!paste0("AD_REF.", highBulk) %in% names(SNPset)) {
204 |             stop(
205 |                 "No High Bulk AD_REF coloumn found. Column should be named 'AD_REF.highBulkName'."
206 |             )
207 |         }
208 |         
209 |         # check AD_REF.LOW
210 |         if (!paste0("AD_REF.", lowBulk) %in% names(SNPset)) {
211 |             stop("No Low Bulk AD_REF coloumn found. Column should be named 'AD_REF.lowBulkName'.")
212 |         }
213 |         
214 |         #check AD_ALT.HIGH
215 |         if (!paste0("AD_ALT.", highBulk) %in% names(SNPset)) {
216 |             stop(
217 |                 "No High Bulk AD_REF coloumn found. Column should be named 'AD_ALT.highBulkName'."
218 |             )
219 |         }
220 |         
221 |         # check AD_ALT.LOW
222 |         if (!paste0("AD_ALT.", lowBulk) %in% names(SNPset)) {
223 |             stop("No Low Bulk AD_ALT coloumn found. Column should be named 'AD_ALT.lowBulkName'.")
224 |         }
225 |         
226 |         # Keep only wanted chromosomes
227 |         if (!is.null(chromList)) {
228 |             message("Removing the following chromosomes: ",
229 |                     paste(unique(SNPset$CHROM)[!unique(SNPset$CHROM) %in% chromList], collapse = ", "))
230 |             SNPset <- SNPset[SNPset$CHROM %in% chromList, ]
231 |         }
232 |         # arrange the chromosomes by natural order sort, eg Chr1, Chr10, Chr2 >>> Chr1, Chr2, Chr10
233 |         SNPset$CHROM <-
234 |             factor(SNPset$CHROM, levels = gtools::mixedsort(unique(SNPset$CHROM)))
235 |         
236 |         # Rename columns
237 |         message("Renaming the following columns: ",
238 |                 paste(colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")][grep(highBulk, x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])], collapse = ", "))
239 |         colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <-
240 |             gsub(pattern = highBulk,
241 |                  replacement = "HIGH",
242 |                  x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])
243 |         
244 |         message("Renaming the following columns: ",
245 |                 paste(colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")][grep(lowBulk, x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])], collapse = ", "))
246 |         colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")] <-
247 |             gsub(pattern = lowBulk,
248 |                  replacement = "LOW",
249 |                  x = colnames(SNPset)[!colnames(SNPset) %in% c("CHROM", "POS", "REF", "ALT")])
250 |         
251 |         # calculate DPs
252 |         SNPset <- SNPset %>%
253 |             dplyr::mutate(
254 |                 DP.HIGH = AD_REF.HIGH + AD_ALT.HIGH,
255 |                 DP.LOW = AD_REF.LOW + AD_ALT.LOW,
256 |                 SNPindex.HIGH = AD_ALT.HIGH / DP.HIGH,
257 |                 SNPindex.LOW = AD_ALT.LOW / DP.LOW,
258 |                 REF_FRQ = (AD_REF.HIGH + AD_REF.LOW) / (DP.HIGH + DP.LOW),
259 |                 deltaSNP = SNPindex.HIGH - SNPindex.LOW
260 |             ) %>%
261 |             dplyr::select(
262 |                 -dplyr::contains("HIGH"),-dplyr::contains("LOW"),-dplyr::one_of("deltaSNP", "REF_FRQ"),
263 |                 dplyr::matches("AD.*.LOW"),
264 |                 dplyr::contains("LOW"),
265 |                 dplyr::matches("AD.*.HIGH"),
266 |                 dplyr::contains("HIGH"),
267 |                 dplyr::everything()
268 |             )
269 |         
270 |         return(as.data.frame(SNPset))
271 |     }
272 | 
273 | 
274 | ## not exported still only works for GATK...
275 | importFromVCF <- function(file,
276 |                           highBulk = character(),
277 |                           lowBulk = character(),
278 |                           chromList = NULL) {
279 |     
280 |     vcf <- vcfR::read.vcfR(file = file)
281 |     message("Keeping SNPs that pass all filters")
282 |     vcf <- vcf[vcf@fix[, "FILTER"] == "PASS"] 
283 |     
284 |     fix <- dplyr::as_tibble(vcf@fix[, c("CHROM", "POS", "REF", "ALT")]) %>% mutate(Key = seq(1:nrow(.)))
285 |     
286 |     # if (!all(
287 |     #     c(
288 |     #         "CHROM", 
289 |     #         "POS", 
290 |     #         paste0(highBulk, ".AD"), 
291 |     #         paste0(lowBulk, ".AD"), 
292 |     #         paste0(highBulk, ".DP"), 
293 |     #         paste0(lowBulk, ".DP")
294 |     #     ) %in% names(SNPset))) {
295 |     #     stop("One of the required fields is missing. Check your VCF file.")
296 |     # }
297 |     
298 |     tidy_gt <- extract_gt_tidy(vcf, format_fields = c("AD", "DP", "GQ"), gt_column_prepend = "", alleles = FALSE)
299 |     
300 |     SNPset <- tidy_gt %>%
301 |         filter(Indiv == LowBulk) %>% select(-Indiv) %>%
302 |         dplyr::left_join(select(filter(tidy_gt, Indiv == HighBulk),-Indiv),
303 |                          by = "Key",
304 |                          suffix = c(".LOW", ".HIGH")) %>%
305 |         tidyr::separate(
306 |             col = "AD.LOW",
307 |             into = c("AD_REF.LOW", "AD_ALT.LOW"),
308 |             sep = ",",
309 |             extra = "merge",
310 |             convert = TRUE
311 |         ) %>%
312 |         tidyr::separate(
313 |             col = "AD.HIGH",
314 |             into = c("AD_REF.HIGH", "AD_ALT.HIGH"),
315 |             sep = ",",
316 |             extra = "merge", 
317 |             convert = TRUE
318 |         ) %>%
319 |         dplyr::full_join(x = fix, by = "Key") %>%
320 |         dplyr::mutate(
321 |             AD_ALT.HIGH = DP.HIGH - AD_REF.HIGH,
322 |             AD_ALT.LOW = DP.LOW - AD_REF.LOW,
323 |             SNPindex.HIGH = AD_ALT.HIGH / DP.HIGH,
324 |             SNPindex.LOW = AD_ALT.LOW / DP.LOW,
325 |             REF_FRQ = (AD_REF.HIGH + AD_REF.LOW) / (DP.HIGH + DP.LOW),
326 |             deltaSNP = SNPindex.HIGH - SNPindex.LOW
327 |         ) %>%
328 |         select(-Key)
329 |     #Keep only wanted chromosomes
330 |     if (!is.null(chromList)) {
331 |         message("Removing the following chromosomes: ", paste(unique(SNPset$CHROM)[!unique(SNPset$CHROM) %in% chromList], collapse = ", "))
332 |         SNPset <- SNPset[SNPset$CHROM %in% chromList, ]
333 |     }
334 |     as.data.frame(SNPset)
335 | }
336 | 
337 | 
338 | #' Filter SNPs based on read depth and quality
339 | #'
340 | #' Use filtering paramaters to filter out high and low depth reads as well as
341 | #' low Genotype Quality as defined by GATK. All filters are optional but recommended.
342 | #'
343 | #' @param SNPset The data frame imported by \code{ImportFromGATK}
344 | #' @param refAlleleFreq A numeric < 1. This will filter out SNPs with a
345 | #'   Reference Allele Frequency less than \code{refAlleleFreq} and greater than
346 | #'   1 - \code{refAlleleFreq}. Eg. \code{refAlleleFreq = 0.3} will keep SNPs
347 | #'   with 0.3 <= REF_FRQ <= 0.7
348 | #' @param filterAroundMedianDepth Filters total SNP read depth for both bulks. A
349 | #'   median and median absolute deviation (MAD) of depth will be calculated.
350 | #'   SNPs with read depth greater or less than \code{filterAroundMedianDepth}
351 | #'   MADs away from the median will be filtered.
352 | #' @param minTotalDepth The minimum total read depth for a SNP (counting both
353 | #'   bulks)
354 | #' @param maxTotalDepth The maximum total read depth for a SNP (counting both
355 | #'   bulks)
356 | #' @param minSampleDepth The minimum read depth for a SNP in each bulk
357 | #' @param depthDifference The maximum absolute difference in read depth between the bulks.
358 | #' @param minGQ The minimum Genotype Quality as set by GATK. This is a measure
359 | #'   of how confident GATK was with the assigned genotype (i.e. homozygous ref,
360 | #'   heterozygous, homozygous alt). See
361 | #'   \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What
362 | #'   is a VCF and how should I interpret it?}
363 | #' @param verbose logical. If \code{TRUE} will report number of SNPs filtered in
364 | #'   each step.
365 | #' @return Returns a subset of the data frame supplied which meets the filtering
366 | #'   conditions applied by the selected parameters. If \code{verbose} is
367 | #'   \code{TRUE} the function reports the number of SNPs filtered in each step
368 | #'   as well as the initiatl number of SNPs, the total number of SNPs filtered
369 | #'   and the remaining number.
370 | #'
371 | #' @seealso See \code{\link[stats]{mad}} for explaination of calculation of
372 | #'   median absolute deviation.
373 | #'   \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What
374 | #'   is a VCF and how should I interpret it?} for more information on GATK
375 | #'   Fields and Genotype Fields
376 | #' @examples df_filt <- FilterSNPs(
377 | #'     df,
378 | #'     refAlleleFreq = 0.3,
379 | #'     minTotalDepth = 40,
380 | #'     maxTotalDepth = 80,
381 | #'     minSampleDepth = 20,
382 | #'     minGQ = 99,
383 | #'     verbose = TRUE
384 | #' )
385 | #'
386 | #' @export filterSNPs
387 | 
388 | filterSNPs <- function(SNPset,
389 |     refAlleleFreq,
390 |     filterAroundMedianDepth,
391 |     minTotalDepth,
392 |     maxTotalDepth,
393 |     minSampleDepth,
394 |     depthDifference,
395 |     minGQ,
396 |     verbose = TRUE) {
397 |     
398 |     org_count <- nrow(SNPset)
399 |     count <- nrow(SNPset)
400 |     
401 |     # Filter by total reference allele frequency
402 |     if (!missing(refAlleleFreq)) {
403 |         if (verbose) {
404 |             message(
405 |                 "Filtering by reference allele frequency: ",
406 |                 refAlleleFreq,
407 |                 " <= REF_FRQ <= ",
408 |                 1 - refAlleleFreq
409 |             )
410 |         }
411 |         SNPset <- dplyr::filter(SNPset, SNPset$REF_FRQ < 1 - refAlleleFreq &
412 |                 SNPset$REF_FRQ > refAlleleFreq)
413 |             if (verbose) {
414 |             message("...Filtered ", count - nrow(SNPset), " SNPs")
415 |         }
416 |         count <- nrow(SNPset)
417 |     }
418 |     
419 |     #Total read depth filtering
420 |     
421 |     if (!missing(filterAroundMedianDepth)) {
422 |         # filter by Read depth for each SNP FilterByMAD MADs around the median
423 |         madDP <-
424 |             mad(
425 |                 x = (SNPset$DP.HIGH + SNPset$DP.LOW),
426 |                 constant = 1,
427 |                 na.rm = TRUE
428 |             )
429 |         medianDP <-
430 |             median(x = (SNPset$DP.HIGH + SNPset$DP.LOW),
431 |                 na.rm = TRUE)
432 |         maxDP <- medianDP + filterAroundMedianDepth * madDP
433 |         minDP <- medianDP - filterAroundMedianDepth * madDP
434 |         SNPset <- dplyr::filter(SNPset, (DP.HIGH + DP.LOW) <= maxDP &
435 |                 (DP.HIGH + DP.LOW) >= minDP)
436 | 
437 |         if(verbose) {message("Filtering by total read depth: ",
438 |             filterAroundMedianDepth,
439 |             " MADs arround the median: ", minDP, " <= Total DP <= ", maxDP)
440 |             message("...Filtered ", count - nrow(SNPset), " SNPs")}
441 |         count <- nrow(SNPset)
442 |         
443 |     }
444 |     
445 |     if (!missing(minTotalDepth)) {
446 |         # Filter by minimum total SNP depth
447 |         if (verbose) {
448 |             message("Filtering by total sample read depth: Total DP >= ",
449 |                 minTotalDepth)
450 |         }
451 |         SNPset <-
452 |             dplyr::filter(SNPset, (DP.HIGH + DP.LOW) >= minTotalDepth)
453 | 
454 |         if (verbose) {
455 |             message("...Filtered ", count - nrow(SNPset), " SNPs")
456 |         }
457 |         count <- nrow(SNPset)
458 |     }
459 |     
460 |     if (!missing(maxTotalDepth)) {
461 |         # Filter by maximum total SNP depth
462 |         if (verbose) {
463 |             message("Filtering by total sample read depth: Total DP <= ",
464 |                 maxTotalDepth)
465 |         }
466 |         SNPset <-
467 |             dplyr::filter(SNPset, (DP.HIGH + DP.LOW) <= maxTotalDepth)
468 |         if (verbose) {
469 |             message("...Filtered ", count - nrow(SNPset), " SNPs")
470 |         }
471 |         count <- nrow(SNPset)
472 |     }
473 |     
474 |     
475 |     # Filter by min read depth in either sample
476 |     if (!missing(minSampleDepth)) {
477 |         if (verbose) {
478 |             message("Filtering by per sample read depth: DP >= ",
479 |                 minSampleDepth)
480 |         }
481 |         SNPset <-
482 |             dplyr::filter(SNPset, DP.HIGH >= minSampleDepth &
483 |                     SNPset$DP.LOW >= minSampleDepth)
484 |         if (verbose) {
485 |             message("...Filtered ", count - nrow(SNPset), " SNPs")
486 |         }
487 |         count <- nrow(SNPset)
488 |     }
489 |     
490 |     # Filter by Genotype Quality
491 |     if (!missing(minGQ)) {
492 |         if (all(c("GQ.LOW", "GQ.HIGH") %in% names(SNPset))) {
493 |         if (verbose) {
494 |             message("Filtering by Genotype Quality: GQ >= ", minGQ)
495 |         }
496 |         SNPset <-
497 |             dplyr::filter(SNPset, GQ.LOW >= minGQ & GQ.HIGH >= minGQ)
498 |         if (verbose) {
499 |             message("...Filtered ", count - nrow(SNPset), " SNPs")
500 |         }
501 |         count <- nrow(SNPset)} 
502 |         else {
503 |             message("GQ columns not found. Skipping...")
504 |         }
505 |     }
506 |     
507 |     # Filter by difference between high and low bulks
508 |     if (!missing(depthDifference)) {
509 |         if (verbose) {
510 |             message("Filtering by difference between bulks <= ",
511 |                     depthDifference)
512 |         }
513 |         SNPset <-
514 |             dplyr::filter(SNPset, abs(DP.HIGH - DP.LOW) <= depthDifference)
515 |         if (verbose) {
516 |             message("...Filtered ", count - nrow(SNPset), " SNPs")
517 |         }
518 |         count <- nrow(SNPset)
519 |     }
520 |     
521 |     # #Filter SNP Clusters
522 |     # if (!is.null(SNPsInCluster) & !is.null(ClusterWin)) {
523 |     #     tmp <- which(diff(SNPset$POS, SNPsInCluster-1) < ClusterWin)
524 |     # message("...Filtered ", count - nrow(SNPset), " SNPs")
525 |     # count <- nrow(SNPset)
526 |     # }
527 |     if (verbose) {
528 |         message(
529 |             "Original SNP number: ",
530 |             org_count,
531 |             ", Filtered: ",
532 |             org_count - count,
533 |             ", Remaining: ",
534 |             count
535 |         )
536 |     }
537 |     return(as.data.frame(SNPset))
538 | }
539 | 


--------------------------------------------------------------------------------
/R/RcppExports.R:
--------------------------------------------------------------------------------
 1 | # Generated by using Rcpp::compileAttributes() -> do not edit by hand
 2 | # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
 3 | 
 4 | #' Count number of SNPs within a sliding window
 5 | #' 
 6 | #' For each SNP returns how many SNPs are bracketing it within the set window size
 7 | #' 
 8 | #' @param POS A numeric vector of genomic positions for each SNP
 9 | #' @param windowSize The required window size
10 | #' @export countSNPs_cpp
11 | countSNPs_cpp <- function(POS, windowSize) {
12 |     .Call('_QTLseqr_countSNPs_cpp', PACKAGE = 'QTLseqr', POS, windowSize)
13 | }
14 | 
15 | 


--------------------------------------------------------------------------------
/R/export_functions.R:
--------------------------------------------------------------------------------
  1 | #' Return SNPs in significant regions
  2 | #'
  3 | #' The function takes a SNP set after calculation of p- and q-values or Takagi confidence intervals and returns
  4 | #' a list containing all SNPs with q-values or deltaSNP below a set alpha or confidence intervals, respectively. Each entry in the list
  5 | #' is a SNP set data frame in a contiguous region with adjusted pvalues lower
  6 | #' than the set false discovery rate alpha.
  7 | #'
  8 | #' @param SNPset Data frame SNP set containing previously filtered SNPs.
  9 | #' @param method either "Gprime" or "QTLseq". The method for detecting significant regions.
 10 | #' @param alpha numeric. The required false discovery rate alpha for use with \code{method = "Gprime"}
 11 | #' @param interval numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99.
 12 | #'
 13 | #' @export getSigRegions
 14 | 
 15 | getSigRegions <-
 16 |     function(SNPset,
 17 |              method = "Gprime",
 18 |              alpha = 0.05,
 19 |              interval = 99)
 20 |     {
 21 |         conf <- paste0("CI_", interval)
 22 |         
 23 |         if (!method %in% c("Gprime", "QTLseq")) {
 24 |             stop("method must be either \"Gprime\" or \"QTLseq\"")
 25 |         }
 26 |         
 27 |         if ((method == "Gprime") & !("qvalue" %in% colnames(SNPset))) {
 28 |             stop("Please first use runGprimeAnalysis to calculate q-values")
 29 |         }
 30 |         
 31 |         if ((method == "QTLseq") & !(any(names(SNPset) %in% conf))) {
 32 |             stop(
 33 |                 "Cant find the requested confidence interval. Please check that the requested interval exsits or first use runQTLseqAnalysis to calculate confidence intervals"
 34 |             )
 35 |         }
 36 |         
 37 |         #QTL <- getSigRegions(SNPset = SNPset, method = method, interval = interval, alpha = alpha)
 38 |         fdrT <- getFDRThreshold(SNPset$pvalue, alpha = alpha)
 39 |         GprimeT <- SNPset[which(SNPset$pvalue == fdrT), "Gprime"]
 40 |         #merged_QTL <- dplyr::bind_rows(QTL, .id = "id")
 41 |         SNPset <- SNPset %>%
 42 |             dplyr::group_by(CHROM)
 43 |         
 44 |         if (method == "QTLseq") {
 45 |             qtltable <-
 46 |                 SNPset %>% dplyr::mutate(passThresh = abs(tricubeDeltaSNP) > abs(!!as.name(conf))) %>%
 47 |                 dplyr::group_by(CHROM, run = {
 48 |                     run = rle(passThresh)
 49 |                     rep(seq_along(run$lengths), run$lengths)
 50 |                 }) %>%
 51 |                 dplyr::filter(passThresh == T) %>% dplyr::ungroup() %>%
 52 |                 dplyr::group_by(CHROM) %>% dplyr::group_by(CHROM, qtl = {
 53 |                     qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths)
 54 |                 }) %>%
 55 |                 #dont need run variable anymore
 56 |                 dplyr::select(-run,-qtl,-passThresh)
 57 |         } else {
 58 |             qtltable <- SNPset %>% dplyr::mutate(passThresh = qvalue <= alpha) %>%
 59 |                 dplyr::group_by(CHROM, run = {
 60 |                     run = rle(passThresh)
 61 |                     rep(seq_along(run$lengths), run$lengths)
 62 |                 }) %>%
 63 |                 dplyr::filter(passThresh == T) %>% dplyr::ungroup() %>%
 64 |                 dplyr::group_by(CHROM) %>% dplyr::group_by(CHROM, qtl = {
 65 |                     qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths)
 66 |                 }) %>%
 67 |                 #dont need run variable anymore
 68 |                 dplyr::select(-run,-qtl,-passThresh)
 69 |         }
 70 |         
 71 |         qtltable <- as.data.frame(qtltable)
 72 |         qtlList <-
 73 |             split(qtltable, factor(
 74 |                 paste(qtltable$CHROM, qtltable$qtl, sep = "_"),
 75 |                 levels = gtools::mixedsort(unique(
 76 |                     paste(qtltable$CHROM, qtltable$qtl, sep = "_")
 77 |                 ))
 78 |             ))
 79 |         
 80 |         return(qtlList)
 81 |     }
 82 | 
 83 | 
 84 | #' Export a summarized table of QTL
 85 | #' @param SNPset Data frame SNP set containing previously filtered SNPs
 86 | #' @param method either "Gprime" or "QTLseq". The method for detecting significant regions.
 87 | #' @param alpha numeric. The required false discovery rate alpha for use with \code{method = "Gprime"}
 88 | #' @param interval numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99.
 89 | #' @param export logical. If TRUE will export a csv table.
 90 | #' @param fileName either a character string naming a file or a connection open for writing. "" indicates output to the console.
 91 | #'
 92 | #' @return Returns a summarized table of QTL identified. The table contains the following columns:
 93 | #' \itemize{
 94 | #' \item{id - the QTL identification number}
 95 | #' \item{chromosome - The chromosome on which the region was identified}
 96 | #' \item{start - the start position on that chromosome, i.e. the position of the first SNP that passes the FDR threshold}
 97 | #' \item{end - the end position}
 98 | #' \item{length - the length in basepairs from start to end of the region}
 99 | #' \item{nSNPs - the number of SNPs in the region}
100 | #' \item{avgSNPs_Mb - the average number of SNPs/Mb within that region}
101 | #' \item{peakDeltaSNP - the tricube-smoothed deltaSNP-index value at the peak summit}
102 | #' \item{posPeakDeltaSNP - the position of the absolute maximum tricube-smoothed deltaSNP-index}
103 | #' \item{maxGprime - the max G' score in the region}
104 | #' \item{posMaxGprime - the genomic position of the maximum G' value in the QTL}
105 | #' \item{meanGprime - the average G' score of that region}
106 | #' \item{sdGprime - the standard deviation of G' within the region}
107 | #' \item{AUCaT - the Area Under the Curve but above the Threshold line, an indicator of how significant or wide the peak is}
108 | #' \item{meanPval - the average p-value in the region}
109 | #' \item{meanQval - the average adjusted p-value in the region}
110 | #'}
111 | #' @export getQTLTable
112 | #'
113 | #' @importFrom dplyr %>%
114 | 
115 | getQTLTable <-
116 |     function(SNPset,
117 |              method = "Gprime",
118 |              alpha = 0.05,
119 |              interval = 99,
120 |              export = FALSE,
121 |              fileName = "QTL.csv")
122 |     {
123 |         conf <- paste0("CI_", interval)
124 |         
125 |         if (!method %in% c("Gprime", "QTLseq")) {
126 |             stop("method must be either \"Gprime\" or \"QTLseq\"")
127 |         }
128 |         
129 |         if ((method == "Gprime") &
130 |             !("qvalue" %in% colnames(SNPset))) {
131 |             stop("Please first use runGprimeAnalysis to calculate q-values")
132 |         }
133 |         
134 |         if ((method == "QTLseq") & !(any(names(SNPset) %in% conf))) {
135 |             stop(
136 |                 "Cant find the requested confidence interval. Please check that the requested interval exsits or first use runQTLseqAnalysis to calculate confidence intervals"
137 |             )
138 |         }
139 |         
140 |         #QTL <- getSigRegions(SNPset = SNPset, method = method, interval = interval, alpha = alpha)
141 |         fdrT <- getFDRThreshold(SNPset$pvalue, alpha = alpha)
142 |         GprimeT <- SNPset[which(SNPset$pvalue == fdrT), "Gprime"]
143 |         #merged_QTL <- dplyr::bind_rows(QTL, .id = "id")
144 |         SNPset <- SNPset %>%
145 |             dplyr::group_by(CHROM)
146 |         
147 |         if (method == "QTLseq") {
148 |             qtltable <-
149 |                 SNPset %>% 
150 |                 dplyr::mutate(passThresh = abs(tricubeDeltaSNP) > abs(!!as.name(conf))) %>%
151 |                 dplyr::group_by(CHROM, run = {
152 |                     run = rle(passThresh)
153 |                     rep(seq_along(run$lengths), run$lengths)
154 |                 }) %>%
155 |                 dplyr::filter(passThresh == TRUE) %>% 
156 |                 dplyr::ungroup() %>%
157 |                 dplyr::group_by(CHROM) %>% 
158 |                 dplyr::group_by(CHROM, qtl = {
159 |                     qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths)
160 |                 }) %>%
161 |                 #dont need run variable anymore
162 |                 dplyr::select(-run) %>%
163 |                 dplyr::summarize(
164 |                     start = min(POS),
165 |                     end = max(POS),
166 |                     length = end - start,
167 |                     nSNPs = length(POS),
168 |                     avgSNPs_Mb = round(length(POS) / (max(POS) - min(POS)) * 1e6),
169 |                     peakDeltaSNP = ifelse(
170 |                         mean(tricubeDeltaSNP) >= 0,
171 |                         max(tricubeDeltaSNP),
172 |                         min(tricubeDeltaSNP)
173 |                     ),
174 |                     posPeakDeltaSNP = POS[which.max(abs(tricubeDeltaSNP))],
175 |                     avgDeltaSNP = mean(tricubeDeltaSNP)
176 |                 )
177 |         } else {
178 |             qtltable <- SNPset %>% dplyr::mutate(passThresh = qvalue <= alpha) %>%
179 |                 dplyr::group_by(CHROM, run = {
180 |                     run = rle(passThresh)
181 |                     rep(seq_along(run$lengths), run$lengths)
182 |                 }) %>%
183 |                 dplyr::filter(passThresh == T) %>% dplyr::ungroup() %>%
184 |                 dplyr::group_by(CHROM) %>% dplyr::group_by(CHROM, qtl = {
185 |                     qtl = rep(seq_along(rle(run)$lengths), rle(run)$lengths)
186 |                 }) %>%
187 |                 #dont need run variable anymore
188 |                 dplyr::select(-run) %>%
189 |                 dplyr::summarize(
190 |                     start = min(POS),
191 |                     end = max(POS),
192 |                     length = end - start,
193 |                     nSNPs = length(POS),
194 |                     avgSNPs_Mb = round(length(POS) / (max(POS) - min(POS)) * 1e6),
195 |                     peakDeltaSNP = ifelse(
196 |                         mean(tricubeDeltaSNP) >= 0,
197 |                         max(tricubeDeltaSNP),
198 |                         min(tricubeDeltaSNP)
199 |                     ),
200 |                     posPeakDeltaSNP = POS[which.max(abs(tricubeDeltaSNP))],
201 |                     avgDeltaSNP = mean(tricubeDeltaSNP),
202 |                     #Gprime stuff
203 |                     maxGprime = max(Gprime),
204 |                     posMaxGprime = POS[which.max(Gprime)],
205 |                     meanGprime = mean(Gprime),
206 |                     sdGprime = sd(Gprime),
207 |                     AUCaT = sum(diff(POS) * (((head(Gprime, -1) + tail(Gprime, -1)) / 2
208 |                     ) - GprimeT)),
209 |                     meanPval = mean(pvalue),
210 |                     meanQval = mean(qvalue)
211 |                 )
212 |         }
213 |         
214 |         qtltable <- as.data.frame(qtltable)
215 |         
216 |         if (export) {
217 |             write.csv(file = fileName,
218 |                       x = qtltable,
219 |                       row.names = FALSE)
220 |         }
221 |         return(qtltable)
222 |     }
223 | 


--------------------------------------------------------------------------------
/R/format_genomic.R:
--------------------------------------------------------------------------------
 1 | format_genomic <- function(...) {
 2 |       # Format a vector of numeric values according
 3 |       # to the International System of Units.
 4 |       # http://en.wikipedia.org/wiki/SI_prefix
 5 |       #
 6 |       # Based on code by Ben Tupper
 7 |       # https://stat.ethz.ch/pipermail/r-help/2012-January/299804.html
 8 |       # Args:
 9 |       #   ...: Args passed to format()
10 |       #
11 |       # Returns:
12 |       #   A function to format a vector of strings using
13 |       #   SI prefix notation
14 |       #
15 | 
16 |       function(x) {
17 |             limits <- c(1e0,   1e3, 1e6)
18 |             #prefix <- c("","Kb","Mb")
19 | 
20 |             # Vector with array indices according to position in intervals
21 |             i <- findInterval(abs(x), limits)
22 | 
23 |             # Set prefix to " " for very small values < 1e-24
24 |             i <- ifelse(i==0, which(limits == 1e0), i)
25 | 
26 |             paste(format(round(x/limits[i], 1),
27 |                          trim=TRUE, scientific=FALSE, ...)
28 |                 #  ,prefix[i]
29 |             )
30 |       }
31 | }
32 | 


--------------------------------------------------------------------------------
/R/onunload.R:
--------------------------------------------------------------------------------
1 | .onUnload <- function (libpath) {
2 |     library.dynam.unload("QTLseqr", libpath)
3 | }


--------------------------------------------------------------------------------
/R/plotting_functions.R:
--------------------------------------------------------------------------------
  1 | #' Plots different paramaters for QTL identification
  2 | #'
  3 | #' A wrapper for ggplot to plot genome wide distribution of parameters used to
  4 | #' identify QTL.
  5 | #'
  6 | #' @param SNPset a data frame with SNPs and genotype fields as imported by
  7 | #'   \code{ImportFromGATK} and after running \code{runGprimeAnalysis} or \code{runQTLseqAnalysis}
  8 | #' @param subset a vector of chromosome names for use in quick plotting of
  9 | #'   chromosomes of interest. Defaults to
 10 | #'   NULL and will plot all chromosomes in the SNPset
 11 | #' @param var character. The paramater for plotting. Must be one of: "nSNPs",
 12 | #'   "deltaSNP", "Gprime", "negLog10Pval"
 13 | #' @param scaleChroms boolean. if TRUE (default) then chromosome facets will be 
 14 | #'   scaled to relative chromosome sizes. If FALSE all facets will be equal
 15 | #'   sizes. This is basically a convenience argument for setting both scales and 
 16 | #'   shape as "free_x" in ggplot2::facet_grid.
 17 | #' @param line boolean. If TRUE will plot line graph. If FALSE will plot points.
 18 | #'   Plotting points will take more time.
 19 | #' @param plotThreshold boolean. Should we plot the False Discovery Rate
 20 | #'   threshold (FDR). Only plots line if var is "Gprime" or "negLogPval". 
 21 | #' @param plotIntervals boolean. Whether or not to plot the two-sided Takagi confidence intervals in "deltaSNP" plots.
 22 | #' @param q numeric. The q-value to use as the FDR threshold. If too low, no
 23 | #'   line will be drawn and a warning will be given.
 24 | #' @param ... arguments to pass to ggplot2::geom_line or ggplot2::geom_point for
 25 | #'   changing colors etc.
 26 | #'
 27 | #' @return Plots a ggplot graph for all chromosomes or those requested in
 28 | #'   \code{subset}. By setting \code{var} to "nSNPs" the distribution of SNPs
 29 | #'   used to calculate G' will be plotted. "deltaSNP" will plot a tri-cube
 30 | #'   weighted delta SNP-index for each SNP. "Gprime" will plot the tri-cube
 31 | #'   weighted G' value. Setting "negLogPval" will plot the -log10 of the p-value
 32 | #'   at each SNP. In "Gprime" and "negLogPval" plots, a genome wide FDR threshold of
 33 | #'   q can be drawn by setting "plotThreshold" to TRUE. The defualt is a red
 34 | #'   line. If you would like to plot a different line we suggest setting
 35 | #'   "plotThreshold" to FALSE and manually adding a line using
 36 | #'   ggplot2::geom_hline.
 37 | #'
 38 | #' @examples p <- plotQTLstats(df_filt_6Mb, var = "Gprime", plotThreshold = TRUE, q = 0.01, subset = c("Chr3","Chr4"))
 39 | #' @export plotQTLStats
 40 | 
 41 | plotQTLStats <-
 42 |     function(SNPset,
 43 |         subset = NULL,
 44 |         var = "nSNPs",
 45 |         scaleChroms = TRUE,
 46 |         line = TRUE,
 47 |         plotThreshold = FALSE,
 48 |         plotIntervals = FALSE,
 49 |         q = 0.05,
 50 |         ...) {
 51 |         
 52 |         #get fdr threshold by ordering snps by pval then getting the last pval
 53 |         #with a qval < q
 54 |         
 55 |         if (!all(subset %in% unique(SNPset$CHROM))) {
 56 |             whichnot <-
 57 |                 paste(subset[base::which(!subset %in% unique(SNPset$CHROM))], collapse = ', ')
 58 |             stop(paste0("The following are not true chromosome names: ", whichnot))
 59 |         }
 60 |         
 61 |         if (!var %in% c("nSNPs", "deltaSNP", "Gprime", "negLog10Pval"))
 62 |             stop(
 63 |                 "Please choose one of the following variables to plot: \"nSNPs\", \"deltaSNP\", \"Gprime\", \"negLog10Pval\""
 64 |             )
 65 |         
 66 |         #don't plot threshold lines in deltaSNPprime or number of SNPs as they are not relevant
 67 |         if ((plotThreshold == TRUE &
 68 |                 var == "deltaSNP") |
 69 |                 (plotThreshold == TRUE & var == "nSNPs")) {
 70 |             message("FDR threshold is not plotted in deltaSNP or nSNPs plots")
 71 |             plotThreshold <- FALSE
 72 |         }
 73 |         #if you need to plot threshold get the FDR, but check if there are any values that pass fdr
 74 |         
 75 |         GprimeT <- 0
 76 |         logFdrT <- 0
 77 |         
 78 |         if (plotThreshold == TRUE) {
 79 |             fdrT <- getFDRThreshold(SNPset$pvalue, alpha = q)
 80 |             
 81 |             if (is.na(fdrT)) {
 82 |                 warning("The q threshold is too low. No threshold line will be drawn")
 83 |                 plotThreshold <- FALSE
 84 |                 
 85 |             } else {
 86 |                 logFdrT <- -log10(fdrT)
 87 |                 GprimeT <- SNPset[which(SNPset$pvalue == fdrT), "Gprime"]
 88 |             }
 89 |         }
 90 |         
 91 |         SNPset <-
 92 |             if (is.null(subset)) {
 93 |                 SNPset
 94 |             } else {
 95 |                 SNPset[SNPset$CHROM %in% subset,]
 96 |             }
 97 |         
 98 |         p <- ggplot2::ggplot(data = SNPset) +
 99 |             ggplot2::scale_x_continuous(breaks = seq(from = 0,to = max(SNPset$POS), by = 10^(floor(log10(max(SNPset$POS))))), labels = format_genomic(), name = "Genomic Position (Mb)") +
100 |             ggplot2::theme(plot.margin = ggplot2::margin(
101 |                 b = 10,
102 |                 l = 20,
103 |                 r = 20,
104 |                 unit = "pt"
105 |             ))
106 |         
107 |         if (var == "Gprime") {
108 |             threshold <- GprimeT
109 |             p <- p + ggplot2::ylab("G' value")
110 |         }
111 |         
112 |         if (var == "negLog10Pval") {
113 |             threshold <- logFdrT
114 |             p <-
115 |                 p + ggplot2::ylab(expression("-" * log[10] * '(p-value)'))
116 |         }
117 |         
118 |         if (var == "nSNPs") {
119 |             p <- p + ggplot2::ylab("Number of SNPs in window")
120 |         }
121 |         
122 |         if (var == "deltaSNP") {
123 |             var <- "tricubeDeltaSNP"
124 |             p <-
125 |                 p + ggplot2::ylab(expression(Delta * '(SNP-index)')) +
126 |                 ggplot2::ylim(-0.55, 0.55) +
127 |                 ggplot2::geom_hline(yintercept = 0,
128 |                     color = "black",
129 |                     alpha = 0.4)
130 |             if (plotIntervals == TRUE) {
131 |                 
132 |                 ints_df <-
133 |                      dplyr::select(SNPset, CHROM, POS, dplyr::matches("CI_")) %>% tidyr::gather(key = "Interval", value = "value",-CHROM,-POS)
134 |                 
135 |                 p <- p + ggplot2::geom_line(data = ints_df, ggplot2::aes(x = POS, y = value, color = Interval)) +
136 |                     ggplot2::geom_line(data = ints_df, ggplot2::aes(
137 |                         x = POS,
138 |                         y = -value,
139 |                         color = Interval
140 |                     ))
141 |             }
142 |         }
143 |         
144 |         if (line) {
145 |             p <-
146 |                 p + ggplot2::geom_line(ggplot2::aes_string(x = "POS", y = var), ...)
147 |         }
148 |         
149 |         if (!line) {
150 |             p <-
151 |                 p + ggplot2::geom_point(ggplot2::aes_string(x = "POS", y = var), ...)
152 |         }
153 |         
154 |         if (plotThreshold == TRUE)
155 |             p <-
156 |             p + ggplot2::geom_hline(
157 |                 ggplot2::aes_string(yintercept = "threshold"),
158 |                 color = "red",
159 |                 size = 1,
160 |                 alpha = 0.4
161 |             )
162 |         
163 |         if (scaleChroms == TRUE) {
164 |            p <- p + ggplot2::facet_grid(~ CHROM, scales = "free_x", space = "free_x")
165 |         } else {
166 |            p <- p + ggplot2::facet_grid(~ CHROM, scales = "free_x")    
167 |         }
168 |         
169 |         p
170 |         
171 |     }
172 | 
173 | 
174 | #' Plots Gprime distribution
175 | #'
176 | #' Plots a ggplot histogram of the distribution of Gprime with a log normal
177 | #' distribution overlay
178 | #'
179 | #' @param SNPset a data frame with SNPs and genotype fields as imported by
180 | #'   \code{ImportFromGATK} and after running \code{GetPrimeStats}
181 | #' @param outlierFilter one of either "deltaSNP" or "Hampel". Method for
182 | #'   filtering outlier (ie QTL) regions for p-value estimation
183 | #' @param filterThreshold The absolute delta SNP index to use to filter out
184 | #'   putative QTL (default = 0.1)
185 | #' @param binwidth The binwidth for the histogram. Recomended and default = 0.5
186 | #'
187 | #' @return Plots a ggplot histogram of the G' value distribution. The raw data
188 | #'   as well as the filtered G' values (excluding putatitve QTL) are plotted. It
189 | #'   will then overlay an estimated log normal distribution with the same mean
190 | #'   and variance as the null G' distribution. This will allow to verify if
191 | #'   after filtering your G' value appear to be close to log normally and thus
192 | #'   can be used to estimate p-values using the non-parametric estimation method
193 | #'   described in Magwene et al. (2011). Breifly, using the natural log of
194 | #'   Gprime a median absolute deviation (MAD) is calculated. The Gprime set is
195 | #'   trimmed to exclude outlier regions (i.e. QTL) based on Hampel's rule. An
196 | #'   estimation of the mode of the trimmed set is calculated using the
197 | #'   \code{\link[modeest]{mlv}} function from the package modeest. Finally, the
198 | #'   mean and variance of the set are estimated using the median and mode are
199 | #'   estimated and used to plot the log normal distribution.
200 | #'
201 | #' @examples plotGprimedist(df_filt_6Mb, outlierFilter = "deltaSNP")
202 | #'
203 | #' @seealso \code{\link{getPvals}} for how p-values are calculated.
204 | #' @export plotGprimeDist
205 | 
206 | 
207 | plotGprimeDist <-
208 |     function(SNPset,
209 |         outlierFilter = c("deltaSNP", "Hampel"),
210 |         filterThreshold = 0.1,
211 |         binwidth = 0.5)
212 |     {
213 |         if (outlierFilter == "deltaSNP") {
214 |             trim_df <- SNPset[abs(SNPset$deltaSNP) < filterThreshold, ]
215 |             trimGprime <- trim_df$Gprime
216 |         } else {
217 |             # Non-parametric estimation of the null distribution of G'
218 |             
219 |             lnGprime <- log(SNPset$Gprime)
220 |             
221 |             # calculate left median absolute deviation for the trimmed G' prime set
222 |             MAD <-
223 |                 median(abs(lnGprime[lnGprime <= median(lnGprime)] - median(lnGprime)))
224 |             
225 |             # Trim the G prime set to exclude outlier regions (i.e. QTL) using Hampel's rule
226 |             trim_df <-
227 |                 SNPset[lnGprime - median(lnGprime) <= 5.2 * median(MAD),]
228 |             trimGprime <- trim_df$Gprime
229 |         }
230 |         medianTrimGprime <- median(trimGprime)
231 |         
232 |         # estimate the mode of the trimmed G' prime set using the half-sample method
233 |         modeTrimGprime <-
234 |             modeest::mlv(x = trimGprime, bw = 0.5, method = "hsm")[[1]]
235 |         
236 |         muE <- log(medianTrimGprime)
237 |         varE <- abs(muE - log(modeTrimGprime))
238 |         
239 |         n <- length(trim_df$Gprime)
240 |         bw <- binwidth
241 |         
242 |         #plot Gprime distrubtion
243 |         p <- ggplot2::ggplot(SNPset) +
244 |             ggplot2::xlim(0, max(SNPset$Gprime) + 1) +
245 |             ggplot2::xlab("G' value") +
246 |             ggplot2::geom_histogram(ggplot2::aes(x = Gprime, fill = "Raw Data"), binwidth = bw) +
247 |             ggplot2::geom_histogram(data = trim_df,
248 |                 ggplot2::aes(x = Gprime, fill = "After filtering"),
249 |                 binwidth = bw) +
250 |             ggplot2::stat_function(
251 |                 ggplot2::aes(color = "black"),
252 |                 size = 1,
253 |                 fun = function(x, mean, sd, n, bw) {
254 |                     dlnorm(x = x,
255 |                         mean = muE,
256 |                         sd = sqrt(varE)) * n * bw
257 |                 },
258 |                 args = c(
259 |                     mean = muE,
260 |                     sd = sqrt(varE),
261 |                     n = n,
262 |                     bw = bw
263 |                 )
264 |             ) +
265 |             
266 |             # ggplot2::stat_function(
267 |             #     fun = dlnorm * n,
268 |             #     size = 1,
269 |             #     args = c(meanlog = muE, sdlog = sqrt(varE)),
270 |             # ggplot2::aes(
271 |             #     color = paste0(
272 |             #         "Null distribution \n G' ~ lnN(",
273 |             #         round(muE, 2),
274 |             #         ",",
275 |             #         round(varE, 2),
276 |         #         ")"
277 |         #     )
278 |         #     )
279 |         # ) +
280 |         ggplot2::scale_fill_discrete(name = "Distribution") +
281 |             ggplot2::scale_colour_manual(name = "Null distribution" , values = "black", labels = as.expression(bquote(~theta["G'"]~" ~ lnN("*.(round(muE, 2))*","*.(round(varE, 2))*")")))  +
282 |             ggplot2::guides(fill = ggplot2::guide_legend(order = 1, reverse = TRUE))
283 |         
284 |         #ggplot2::annotate(x = 10, y = 0.325, geom="text",
285 |         #    label = paste0("G' ~ lnN(", round(muE, 2), ",",round(varE, 2), ")"),
286 |         #    color = "blue")
287 |         return(p)
288 |     }
289 | 
290 | 
291 | #' Plots simulation data for QTLseq analysis
292 | #'
293 | # The method for simulating delta SNP-index confidence interval thresholds
294 | #' as described in Takagi et al., (2013). Genotypes are randomly assigned for
295 | #' each indvidual in the bulk, based on the population structure. The total
296 | #' alternative allele frequency in each bulk is calculated at each depth used to simulate
297 | #' delta SNP-indeces, with a user defined number of bootstrapped replication.
298 | #' The requested confidence intervals are then calculated from the bootstraps.
299 | #' This function plots the simulated confidence intervals by the read depth.
300 | #'
301 | #' @param SNPset optional. Either supply your data set to extract read depths from or supply depth vector.
302 | #' @param popStruc the population structure. Defaults to "F2" and assumes "RIL" otherwise.
303 | #' @param bulkSize non-negative integer. The number of individuals in each bulk
304 | #' @param depth optional integer vector. A read depth for which to replicate SNP-index calls. If read depth is defined SNPset will be ignored.
305 | #' @param replications integer. The number of bootstrap replications.
306 | #' @param filter numeric. An optional minimum SNP-index filter
307 | #' @param intervals numeric vector. Confidence intervals supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95\% confidence interval, 2.5\% on each side.
308 | #'
309 | #' @return Plots a deltaSNP by depth plot. Helps if the user wants to know the the delta SNP index needed to pass a certain CI at a specified depth.
310 | #'
311 | #' @export plotSimulatedThresholds
312 | #'
313 | #' @examples plotSimulatedThresholds <- function(SNPset = NULL, popStruc = "F2", bulkSize = 25,   depth = 1:150, replications = 10000, filter = 0.3, intervals = c(95, 99))
314 | 
315 | plotSimulatedThresholds <-
316 |     function(SNPset = NULL,
317 |              popStruc = "F2",
318 |              bulkSize,
319 |              depth = NULL,
320 |              replications = 10000,
321 |              filter = 0.3,
322 |              intervals = c(95, 99)) {
323 |         
324 |         if (is.null(depth)) {
325 |             if (!is.null(SNPset)) {
326 |                 message(
327 |                     "Variable 'depth' not defined, using min and max depth from data: ",
328 |                     min(SNPset$minDP),
329 |                     "-",
330 |                     max(SNPset$minDP)
331 |                 )
332 |                 depth <- min(SNPset$minDP):max(SNPset$minDP)
333 |             } else {
334 |                 stop("No SNPset or depth supplied")
335 |             }
336 |         }
337 |         
338 |         #convert intervals to quantiles
339 |         if (all(intervals >= 1)) {
340 |             message(
341 |                 "Returning the following two sided confidence intervals: ",
342 |                 paste(intervals, collapse = ", ")
343 |             )
344 |             quantiles <- (100 - intervals) / 200
345 |         } else {
346 |             stop(
347 |                 "Convidence intervals ('intervals' paramater) should be supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95% confidence interval, 2.5% on each side."
348 |             )
349 |         }
350 |         
351 |         CI <-
352 |             simulateConfInt(
353 |                 popStruc = popStruc,
354 |                 bulkSize = bulkSize,
355 |                 depth = depth,
356 |                 replications = replications,
357 |                 filter = filter,
358 |                 intervals = quantiles
359 |             )
360 |         
361 |         CI <-
362 |             tidyr::gather(CI, key = "Interval", value = "deltaSNP",-depth)
363 |         
364 |         ggplot2::ggplot(data = CI) + 
365 |             ggplot2::geom_line(ggplot2::aes(x = depth, y = deltaSNP, color = Interval)) +
366 |             ggplot2::geom_line(ggplot2::aes(x = depth, y = -deltaSNP,color = Interval))
367 |     }    


--------------------------------------------------------------------------------
/R/takagi_sim.R:
--------------------------------------------------------------------------------
  1 | # QTLseq simulation functions
  2 | 
  3 | 
  4 | #' Randomly calculates an alternate allele frequency within a bulk
  5 | #'
  6 | #' @param n non-negative integer. The number of individuals in each bulk
  7 | #' @param pop the population structure. Defaults to "F2" and assumes "RIL" 
  8 | #' population otherwise.
  9 | #'
 10 | #' @return an alternate allele frequency within the bulk. Used for simulating 
 11 | #' SNP-indeces.
 12 | #'
 13 | simulateAlleleFreq <- function(n, pop = "F2") {
 14 |     if (pop == "F2") {
 15 |         mean(sample(
 16 |             x = c(0, 0.5, 1),
 17 |             size = n,
 18 |             prob = c(1, 2, 1),
 19 |             replace = TRUE
 20 |         ))
 21 |     } else {
 22 |         mean(sample(
 23 |             x = c(0, 1),
 24 |             size = n,
 25 |             prob = c(1, 1),
 26 |             replace = TRUE
 27 |         ))
 28 |     }
 29 | }
 30 | 
 31 | 
 32 | #' Simulates a delta SNP-index with replication
 33 | #'
 34 | #' @param depth integer. A read depth for which to replicate SNP-index calls.
 35 | #' @param altFreq1 numeric. The alternate allele frequency for bulk A. 
 36 | #' @param altFreq2 numeric. The alternate allele frequency for bulk B. 
 37 | #' @param replicates integer. The number of bootstrap replications.
 38 | #' @param filter numeric. an optional minimum SNP-index filter
 39 | #'
 40 | #' @return Returns a vector of length replicates delta SNP-indeces 
 41 | 
 42 | 
 43 | simulateSNPindex <-
 44 |     function(depth,
 45 |              altFreq1,
 46 |              altFreq2,
 47 |              replicates = 10000,
 48 |              filter = NULL) {
 49 |         
 50 |         SNPindex_H <- rbinom(replicates, size = depth, altFreq1) / depth
 51 |         SNPindex_L <- rbinom(replicates, size = depth, altFreq2) / depth
 52 |         deltaSNP <- SNPindex_H - SNPindex_L
 53 |         
 54 |         if (!is.null(filter)) {
 55 |             deltaSNP <- deltaSNP[SNPindex_H >= filter | SNPindex_L >= filter]
 56 |         }
 57 |         deltaSNP
 58 |     }
 59 | 
 60 | 
 61 | #' Simulation of delta SPP index confidence intervals 
 62 | #' 
 63 | #' The method for simulating delta SNP-index confidence interval thresholds
 64 | #' as described in Takagi et al., (2013). Genotypes are randomly assigned for
 65 | #' each indvidual in the bulk, based on the population structure. The total
 66 | #' alternative allele frequency in each bulk is calculated at each depth used to simulate 
 67 | #' delta SNP-indeces, with a user defined number of bootstrapped replication.
 68 | #' The requested confidence intervals are then calculated from the bootstraps.
 69 | #'
 70 | #' @param popStruc the population structure. Defaults to "F2" and assumes "RIL" 
 71 | #' @param bulkSize non-negative integer vector. The number of individuals in
 72 | #'   each simulated bulk. Can be of length 1, then both bulks are set to the
 73 | #'   same size. Assumes the first value in the vector is the simulated high
 74 | #'   bulk.
 75 | #' @param depth integer. A read depth for which to replicate SNP-index calls.
 76 | #' @param replications integer. The number of bootstrap replications.
 77 | #' @param filter numeric. An optional minimum SNP-index filter
 78 | #' @param intervals numeric vector of probabilities with values in [0,1] 
 79 | #' corresponding to the requested confidence intervals 
 80 | #'
 81 | #' @return A data frame of delta SNP-index thresholds corrisponding to the 
 82 | #' requested confidence intervals at the user set depths. 
 83 | #' @export simulateConfInt
 84 | #'
 85 | #' @examples CI <-
 86 | #' simulateConfInt(
 87 | #'    popStruc = "F2",
 88 | #'    bulkSize = 50,
 89 | #'    depth = 1:100,
 90 | #'    intervals = c(0.05, 0.95, 0.025, 0.975, 0.005, 0.995, 0.0025, 0.9975)
 91 |      
 92 | simulateConfInt <- function(popStruc = "F2",
 93 |                             bulkSize,
 94 |                             depth = 1:100,
 95 |                             replications = 10000,
 96 |                             filter = 0.3,
 97 |                             intervals = c(0.05, 0.025)) {
 98 |     if (popStruc == "F2") {
 99 |         message(
100 |             "Assuming bulks selected from F2 population, with ",
101 |             paste(bulkSize, collapse = " and "),
102 |             " individuals per bulk."
103 |         )
104 |     } else {
105 |         message(
106 |             "Assuming bulks selected from RIL population, with ",
107 |             bulkSize,
108 |             " individuals per bulk."
109 |         )
110 |     }
111 |     
112 |     if (length(bulkSize) == 1) {
113 |         message("The 'bulkSize' argument is of length 1, setting number of individuals in both bulks to: ", bulkSize)
114 |         bulkSize[2] <- bulkSize[1]
115 |     }
116 |     
117 |     if (length(bulkSize) > 2) {
118 |         message("The 'bulkSize' argument is larger than 2. Using the first two values as the bulk size.")
119 |     }
120 |     
121 |     if (any(bulkSize < 0)) {
122 |         stop("Negative bulkSize values")
123 |     }
124 |     
125 |     #makes a vector of possible alt allele frequencies once. this is then sampled for each replicate
126 |     tmp_freq <-
127 |         replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[1], pop = popStruc))
128 |     
129 |     tmp_freq2 <-
130 |         replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[2], pop = popStruc))
131 |     
132 |     message(
133 |         paste0(
134 |             "Simulating ",
135 |             replications,
136 |             " SNPs with reads at each depth: ",
137 |             min(depth),
138 |             "-",
139 |             max(depth)
140 |         )
141 |     )
142 |     message(paste0(
143 |         "Keeping SNPs with >= ",
144 |         filter,
145 |         " SNP-index in both simulated bulks"
146 |     ))
147 |     
148 |     # tmp allele freqs are sampled to produce 'replicate' numbers of probablities. these
149 |     # are then used as altFreq probs to simulate SNP index values, per bulk.
150 |     CI <- sapply(
151 |         X = depth,
152 |         FUN = function(x)
153 |         {
154 |             quantile(
155 |                 x = simulateSNPindex(
156 |                     depth = x,
157 |                     altFreq1 = sample(
158 |                         x = tmp_freq,
159 |                         size = replications,
160 |                         replace = TRUE
161 |                     ),
162 |                     altFreq2 = sample(
163 |                         x = tmp_freq2,
164 |                         size = replications,
165 |                         replace = TRUE
166 |                     ),
167 |                     replicates = replications,
168 |                     filter = filter
169 |                 ),
170 |                 probs = intervals,
171 |                 names = TRUE
172 |             )
173 |         }
174 |     )
175 |     
176 |     CI <- as.data.frame(CI)
177 |     
178 |     if (length(CI) > 1) {
179 |         CI <- data.frame(t(CI))
180 |     }
181 |     
182 |     names(CI) <- paste0("CI_", 100 - (intervals * 200))
183 |     CI <- cbind(depth, CI)
184 |     
185 |     #to long format for easy plotting
186 |     # tidyr::gather(data = CI,
187 |     #     key = interval,
188 |     #     convert = TRUE,
189 |     #     value = SNPindex,-depth) %>%
190 |     #     dplyr::mutate(Confidence = factor(ifelse(
191 |     #         interval > 0.5,
192 |     #         paste0(round((1 - interval) * 200, digits = 1), "%"),
193 |     #         paste0((interval * 200), "%")
194 |     # )))
195 |     CI
196 | }
197 | 
198 | 
199 | #' Calculates delta SNP confidence intervals for QTLseq analysis
200 | #'
201 | #'The method for simulating delta SNP-index confidence interval thresholds
202 | #' as described in Takagi et al., (2013). Genotypes are randomly assigned for
203 | #' each indvidual in the bulk, based on the population structure. The total
204 | #' alternative allele frequency in each bulk is calculated at each depth used to simulate 
205 | #' delta SNP-indeces, with a user defined number of bootstrapped replication.
206 | #' The requested confidence intervals are then calculated from the bootstraps.
207 | #'
208 | #' @param SNPset The data frame imported by \code{ImportFromGATK} 
209 | #' @param windowSize the window size (in base pairs) bracketing each SNP for which to calculate the statitics.
210 | #' @param popStruc the population structure. Defaults to "F2" and assumes "RIL" otherwise
211 | #' @param bulkSize non-negative integer vector. The number of individuals in
212 | #'   each simulated bulk. Can be of length 1, then both bulks are set to the
213 | #'   same size. Assumes the first value in the vector is the simulated high
214 | #'   bulk.
215 | #' @param depth integer. A read depth for which to replicate SNP-index calls.
216 | #' @param replications integer. The number of bootstrap replications.
217 | #' @param filter numeric. An optional minimum SNP-index filter
218 | #' @param intervals numeric vector. Confidence intervals supplied as two-sided
219 | #'   percentiles. i.e. If intervals = '95' will return the two sided 95\%
220 | #'   confidence interval, 2.5\% on each side.
221 | #' @param ... Other arguments passed to \code{\link[locfit]{locfit}} and
222 | #'   subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull
223 | #'   in cases where you get "out of vertex space warnings"; Set the maxk higher
224 | #'   than the default 100. See \code{\link[locfit]{locfit.raw}}().  But if you
225 | #'   are getting that warning you should seriously consider increasing your
226 | #'   window size.
227 | #'   
228 | #' @return A SNPset data frame with delta SNP-index thresholds corrisponding to the 
229 | #' requested confidence intervals matching the tricube smoothed depth at each SNP.
230 | #' @export runQTLseqAnalysis
231 | #'
232 | #' @examples df_filt <- runQTLseqAnalysis(
233 | #' SNPset = df_filt,
234 | #' bulkSize = c(25, 35)
235 | #' windowSize = 1e6,
236 | #' popStruc = "F2",
237 | #' replications = 10000,
238 | #' intervals = c(95, 99)
239 | #' )
240 | #' 
241 | 
242 | runQTLseqAnalysis <- function(SNPset,
243 |                               windowSize = 1e6,
244 |                               popStruc = "F2",
245 |                               bulkSize,
246 |                               depth = NULL,
247 |                               replications = 10000,
248 |                               filter = 0.3,
249 |                               intervals = c(95, 99),
250 |                               ...) {
251 |     
252 |     message("Counting SNPs in each window...")
253 |     SNPset <- SNPset %>%
254 |         dplyr::group_by(CHROM) %>%
255 |         dplyr::mutate(nSNPs = countSNPs_cpp(POS = POS, windowSize = windowSize))
256 |     
257 |     message("Calculating tricube smoothed delta SNP index...")
258 |     SNPset <- SNPset %>%
259 |         dplyr::mutate(tricubeDeltaSNP = tricubeStat(POS = POS, Stat = deltaSNP, windowSize, ...))
260 |     
261 |     #convert intervals to quantiles
262 |     if (all(intervals >= 1)) {
263 |         message(
264 |             "Returning the following two sided confidence intervals: ",
265 |             paste(intervals, collapse = ", ")
266 |         )
267 |         quantiles <- (100 - intervals) / 200
268 |     } else {
269 |         stop(
270 |             "Convidence intervals ('intervals' paramater) should be supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95% confidence interval, 2.5% on each side."
271 |         )
272 |     }
273 |     
274 |     #calculate min depth per snp between bulks
275 |     SNPset <-
276 |         SNPset %>%
277 |         dplyr::mutate(minDP = pmin(DP.LOW, DP.HIGH))
278 |     
279 |     SNPset <-
280 |         SNPset %>%
281 |         dplyr::group_by(CHROM) %>%
282 |         dplyr::mutate(tricubeDP = floor(tricubeStat(POS, minDP, windowSize = windowSize, ...)))
283 |     
284 |     if (is.null(depth)) {
285 |         message(
286 |             "Variable 'depth' not defined, using min and max depth from data: ",
287 |             min(SNPset$minDP),
288 |             "-",
289 |             max(SNPset$minDP)
290 |         )
291 |         depth <- min(SNPset$minDP):max(SNPset$minDP)
292 |     }
293 |     
294 |     #simualte confidence intervals
295 |     CI <-
296 |         simulateConfInt(
297 |             popStruc = popStruc,
298 |             bulkSize = bulkSize,
299 |             depth = depth,
300 |             replications = replications,
301 |             filter = filter,
302 |             intervals = quantiles
303 |         )
304 |     
305 |     
306 |     #match name of column for easier joining of repeat columns
307 |     names(CI)[1] <- "tricubeDP"
308 |     
309 |     #use join as a quick way to match min depth to matching conf intervals.
310 |     SNPset <-
311 |         dplyr::left_join(x = SNPset,
312 |                          y = CI #, commented out becuase of above change. need to remove eventually
313 |                          # by = c("tricubeDP" = "depth"))
314 |         )
315 |                          as.data.frame(SNPset)
316 |                          
317 | }
318 | 
319 | 
320 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | <!-- README.md is generated from README.Rmd. Please edit that file -->
  3 | 
  4 | # QTLseqr v0.7.5.2
  5 | 
  6 | QTLseqr is an R package for QTL mapping using NGS Bulk Segregant
  7 | Analysis.
  8 | 
  9 | QTLseqr is still under development and is offered with out any
 10 | guarantee.
 11 | 
 12 | ### **For more detailed instructions please read the vignette [here](https://github.com/bmansfeld/QTLseqr/raw/master/vignettes/QTLseqr.pdf)**
 13 | 
 14 | ### For updates read the [NEWS.md](https://github.com/bmansfeld/QTLseqr/blob/master/NEWS.md)
 15 | 
 16 | # Installation
 17 | 
 18 | <!-- You can install and update QTLseqr by using our [drat](http://dirk.eddelbuettel.com/code/drat.html) repository hosted on our github page: -->
 19 | 
 20 | <!-- ```{r drat-install, eval = FALSE} -->
 21 | 
 22 | <!-- install.packages("QTLseqr", repos = "http://bmansfeld.github.io/drat") -->
 23 | 
 24 | <!-- ``` -->
 25 | 
 26 | <!-- OR You can install QTLseqr from github with: -->
 27 | 
 28 | You can install QTLseqr from github with:
 29 | 
 30 | ``` r
 31 | # install devtools first to download packages from github
 32 | install.packages("devtools")
 33 | 
 34 | # use devtools to install QTLseqr
 35 | devtools::install_github("bmansfeld/QTLseqr")
 36 | ```
 37 | 
 38 | **Note:** Apart from regular package dependencies, there are some
 39 | Bioconductor tools that we use as well, as such you will be prompted to
 40 | install support for Bioconductor, if you haven’t already. QTLseqr makes
 41 | use of C++ to make some tasks significantly faster (like counting SNPs).
 42 | Because of this, in order to install QTLseqr from github you will be
 43 | required to install some compiling tools (Rtools and Xcode, for Windows
 44 | and Mac, respectively).
 45 | 
 46 | **If you use QTLseqr in published research, please cite:**
 47 | 
 48 | > Mansfeld B.N. and Grumet R, QTLseqr: An R package for bulk segregant
 49 | > analysis with next-generation sequencing *The Plant Genome*
 50 | > [doi:10.3835/plantgenome2018.01.0006](https://dl.sciencesocieties.org/publications/tpg/abstracts/11/2/180006)
 51 | 
 52 | We also recommend citing the paper for the corresponding method you work
 53 | with.
 54 | 
 55 | QTL-seq method:
 56 | 
 57 | > Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka,
 58 | > C., Uemura, A., Utsushi, H., Tamiru, M., Takuno, S., Innan, H., Cano,
 59 | > L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of
 60 | > quantitative trait loci in rice by whole genome resequencing of DNA
 61 | > from two bulked populations. *Plant J*, 74: 174–183.
 62 | > [doi:10.1111/tpj.12105](https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.12105)
 63 | 
 64 | G prime method:
 65 | 
 66 | > Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk
 67 | > Segregant Analysis Using Next Generation Sequencing. *PLOS
 68 | > Computational Biology* 7(11): e1002255.
 69 | > [doi.org/10.1371/journal.pcbi.1002255](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002255)
 70 | 
 71 | ## Abstract
 72 | 
 73 | Next Generation Sequencing Bulk Segregant Analysis (NGS-BSA) is
 74 | efficient in detecting quantitative trait loci (QTL). Despite the
 75 | popularity of NGS-BSA and the R statistical platform, no R packages are
 76 | currently available for NGS-BSA. We present QTLseqr, an R package for
 77 | NGS-BSA that identifies QTL using two statistical approaches: QTL-seq
 78 | and G’. These approaches use a simulation method and a tricube smoothed
 79 | G statistic, respectively, to identify and assess statistical
 80 | significance of QTL. QTLseqr, can import and filter SNP data, calculate
 81 | SNP distributions, relative allele frequencies, G’ values, and
 82 | log10(p-values), enabling identification and plotting of QTL.
 83 | 
 84 | # Examples:
 85 | 
 86 | ## Example figure
 87 | 
 88 | ![Example
 89 | figure](https://github.com/bmansfeld/QTLseqr/raw/master/all_plots.png
 90 | "Example figure")
 91 | 
 92 | \#\#\#**For more detailed instructions please read the vignette
 93 | [here](https://github.com/bmansfeld/QTLseqr/raw/master/vignettes/QTLseqr.pdf)**
 94 | 
 95 | This is a basic example which shows you how to import and analyze
 96 | NGS-BSA data.
 97 | 
 98 | ``` r
 99 | 
100 | #load the package
101 | library("QTLseqr")
102 | 
103 | #Set sample and file names
104 | HighBulk <- "SRR834931"
105 | LowBulk <- "SRR834927"
106 | file <- "SNPs_from_GATK.table"
107 | 
108 | #Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
109 | Chroms <- paste0(rep("Chr", 12), 1:12)
110 | 
111 | #Import SNP data from file
112 | df <-
113 |     importFromGATK(
114 |         file = file,
115 |         highBulk = HighBulk,
116 |         lowBulk = LowBulk,
117 |         chromList = Chroms
118 |      )
119 | 
120 | #Filter SNPs based on some criteria
121 | df_filt <-
122 |     filterSNPs(
123 |         SNPset = df,
124 |         refAlleleFreq = 0.20,
125 |         minTotalDepth = 100,
126 |         maxTotalDepth = 400,
127 |         minSampleDepth = 40,
128 |         minGQ = 99
129 |     )
130 | 
131 | 
132 | #Run G' analysis
133 | df_filt <- runGprimeAnalysis(
134 |     SNPset = df_filt,
135 |     windowSize = 1e6,
136 |     outlierFilter = "deltaSNP")
137 | 
138 | #Run QTLseq analysis
139 | df_filt <- runQTLseqAnalysis(
140 |     SNPset = df_filt,
141 |     windowSize = 1e6,
142 |     popStruc = "F2",
143 |     bulkSize = c(25, 25),
144 |     replications = 10000,
145 |     intervals = c(95, 99)
146 | )
147 | 
148 | #Plot
149 | plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01)
150 | plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE)
151 | 
152 | #export summary CSV
153 | getQTLTable(SNPset = df_filt, alpha = 0.01, export = TRUE, fileName = "my_BSA_QTL.csv")
154 | ```
155 | 


--------------------------------------------------------------------------------
/all_plots.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bmansfeld/QTLseqr/5e761379a805b65038c415c8d3ce7aa61abe89dc/all_plots.png


--------------------------------------------------------------------------------
/man/FilterSNPs.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Import_Filter.R
 3 | \name{filterSNPs}
 4 | \alias{filterSNPs}
 5 | \title{Filter SNPs based on read depth and quality}
 6 | \usage{
 7 | filterSNPs(SNPset, refAlleleFreq, filterAroundMedianDepth, minTotalDepth,
 8 |   maxTotalDepth, minSampleDepth, depthDifference, minGQ, verbose = TRUE)
 9 | }
10 | \arguments{
11 | \item{SNPset}{The data frame imported by \code{ImportFromGATK}}
12 | 
13 | \item{refAlleleFreq}{A numeric < 1. This will filter out SNPs with a
14 | Reference Allele Frequency less than \code{refAlleleFreq} and greater than
15 | 1 - \code{refAlleleFreq}. Eg. \code{refAlleleFreq = 0.3} will keep SNPs
16 | with 0.3 <= REF_FRQ <= 0.7}
17 | 
18 | \item{filterAroundMedianDepth}{Filters total SNP read depth for both bulks. A
19 | median and median absolute deviation (MAD) of depth will be calculated.
20 | SNPs with read depth greater or less than \code{filterAroundMedianDepth}
21 | MADs away from the median will be filtered.}
22 | 
23 | \item{minTotalDepth}{The minimum total read depth for a SNP (counting both
24 | bulks)}
25 | 
26 | \item{maxTotalDepth}{The maximum total read depth for a SNP (counting both
27 | bulks)}
28 | 
29 | \item{minSampleDepth}{The minimum read depth for a SNP in each bulk}
30 | 
31 | \item{depthDifference}{The maximum absolute difference in read depth between the bulks.}
32 | 
33 | \item{minGQ}{The minimum Genotype Quality as set by GATK. This is a measure
34 | of how confident GATK was with the assigned genotype (i.e. homozygous ref,
35 | heterozygous, homozygous alt). See
36 | \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What
37 | is a VCF and how should I interpret it?}}
38 | 
39 | \item{verbose}{logical. If \code{TRUE} will report number of SNPs filtered in
40 | each step.}
41 | }
42 | \value{
43 | Returns a subset of the data frame supplied which meets the filtering
44 |   conditions applied by the selected parameters. If \code{verbose} is
45 |   \code{TRUE} the function reports the number of SNPs filtered in each step
46 |   as well as the initiatl number of SNPs, the total number of SNPs filtered
47 |   and the remaining number.
48 | }
49 | \description{
50 | Use filtering paramaters to filter out high and low depth reads as well as
51 | low Genotype Quality as defined by GATK. All filters are optional but recommended.
52 | }
53 | \examples{
54 | df_filt <- FilterSNPs(
55 |     df,
56 |     refAlleleFreq = 0.3,
57 |     minTotalDepth = 40,
58 |     maxTotalDepth = 80,
59 |     minSampleDepth = 20,
60 |     minGQ = 99,
61 |     verbose = TRUE
62 | )
63 | 
64 | }
65 | \seealso{
66 | See \code{\link[stats]{mad}} for explaination of calculation of
67 |   median absolute deviation.
68 |   \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What
69 |   is a VCF and how should I interpret it?} for more information on GATK
70 |   Fields and Genotype Fields
71 | }
72 | 


--------------------------------------------------------------------------------
/man/ImportFromGATK.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Import_Filter.R
 3 | \name{importFromGATK}
 4 | \alias{importFromGATK}
 5 | \title{Imports SNP data from GATK VariablesToTable output}
 6 | \usage{
 7 | importFromGATK(file, highBulk = character(), lowBulk = character(),
 8 |   chromList = NULL)
 9 | }
10 | \arguments{
11 | \item{file}{The name of the GATK VariablesToTable output .table file which the
12 | data are to be read from.}
13 | 
14 | \item{highBulk}{The sample name of the High Bulk}
15 | 
16 | \item{lowBulk}{The sample name of the Low Bulk}
17 | 
18 | \item{chromList}{a string vector of the chromosomes to be used in the
19 | analysis. Useful for filtering out unwanted contigs etc.}
20 | }
21 | \value{
22 | Returns a data frame containing columns for Read depth (DP),
23 |   Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT),
24 |   Genoytype Quality (GQ) and SNPindex for each bulk (indicated by .HIGH and
25 |   .LOW column name suffix). Total reference allele frequnce "REF_FRQ" is the
26 |   sum of AD.REF for both bulks divided by total Depth for that SNP. The
27 |   deltaSNPindex is equal to  SNPindex.HIGH - SNPindex.LOW. The GStat column
28 |   is the calculated G statistic for that SNP.
29 | }
30 | \description{
31 | Imports SNP data from the output of the
32 | \href{https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php}{VariantsToTable}
33 | function in GATK. After importing the data, the function then calculates
34 | total reference allele frequency for both bulks together, the delta SNP index
35 | (i.e. SNP index of the low bulk subtracted from the SNP index of the high
36 | bulk), the G statistic and returns a data frame. The required GATK fields
37 | (-F) are CHROM (Chromosome) and POS (Position). The required Genotype fields
38 | (-GF) are AD (Allele Depth), DP (Depth). Recommended
39 | fields are REF (Reference allele) and ALT (Alternative allele) Recommended
40 | Genotype feilds are PL (Phred-scaled likelihoods) and GQ  (Genotype Quality).
41 | }
42 | \examples{
43 | df <-  ImportFromGATK(filename = file.table,
44 |     highBulk = highBulkSampleName,
45 |     lowBulk = lowBulkSampleName,
46 |     chromList = c("Chr1","Chr4","Chr7"))
47 | }
48 | \seealso{
49 | \code{\link{getG}} for explaination of how G statistic is
50 |   calculated.
51 |   \href{http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it}{What
52 |   is a VCF and how should I interpret it?} for more information on GATK
53 |   Fields and Genotype Fields
54 | }
55 | 


--------------------------------------------------------------------------------
/man/countSNPs_cpp.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/RcppExports.R
 3 | \name{countSNPs_cpp}
 4 | \alias{countSNPs_cpp}
 5 | \title{Count number of SNPs within a sliding window}
 6 | \usage{
 7 | countSNPs_cpp(POS, windowSize)
 8 | }
 9 | \arguments{
10 | \item{POS}{A numeric vector of genomic positions for each SNP}
11 | 
12 | \item{windowSize}{The required window size}
13 | }
14 | \description{
15 | For each SNP returns how many SNPs are bracketing it within the set window size
16 | }
17 | 


--------------------------------------------------------------------------------
/man/getFDRThreshold.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/G_functions.R
 3 | \name{getFDRThreshold}
 4 | \alias{getFDRThreshold}
 5 | \title{Find false discovery rate threshold}
 6 | \usage{
 7 | getFDRThreshold(pvalues, alpha = 0.01)
 8 | }
 9 | \arguments{
10 | \item{pvalues}{a vector of p-values}
11 | 
12 | \item{alpha}{the required false discovery rate alpha}
13 | }
14 | \value{
15 | The p-value threshold that corresponds to the Benjamini-Hochberg adjusted p-value at the FDR set by alpha.
16 | }
17 | \description{
18 | Given a vector of p-values and a set false discovery rate alpha the function
19 | returns the lowest p-value in the vector for which the Benjamini-Hochberg
20 | adjusted p-value (ie q-value) is less than that alpha.
21 | }
22 | 


--------------------------------------------------------------------------------
/man/getG.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/G_functions.R
 3 | \name{getG}
 4 | \alias{getG}
 5 | \title{Calculates the G statistic}
 6 | \usage{
 7 | getG(LowRef, HighRef, LowAlt, HighAlt)
 8 | }
 9 | \arguments{
10 | \item{LowRef}{A vector of the reference allele depth in the low bulk}
11 | 
12 | \item{HighRef}{A vector of the reference allele depth in the high bulk}
13 | 
14 | \item{LowAlt}{A vector of the alternate allele depth in the low bulk}
15 | 
16 | \item{HighAlt}{A vector of the alternate allele depth in the high bulk}
17 | }
18 | \value{
19 | A vector of G statistic values with the same length as
20 | }
21 | \description{
22 | The function is used by \code{\link{runGprimeAnalysis}} to calculate the G
23 | statisic G is defined by the equation: \deqn{G = 2*\sum_{i=1}^{4}
24 | n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2 * \sum n_i * ln(obs(n_i)/exp(n_i))}
25 | Where for each SNP, \eqn{n_i} from i = 1 to 4 corresponds to the reference
26 | and alternate allele depths for each bulk, as described in the following
27 | table: \tabular{rcc}{ Allele \tab High Bulk \tab Low Bulk \cr Reference \tab
28 | \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab \eqn{n_3} \tab \eqn{n_4} \cr}
29 | ...and \eqn{obs(n_i)} are the observed allele depths as described in the data
30 | frame. Method 1 calculates the G statistic using expected values assuming
31 | read depth is equal for all alleles in both bulks: \deqn{exp(n_1) = ((n_1 +
32 | n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)} \deqn{exp(n_2) = ((n_2 + n_1)*(n_2
33 | + n_4))/(n_1 + n_2 + n_3 + n_4)} etc...
34 | }
35 | \seealso{
36 | \href{https://doi.org/10.1371/journal.pcbi.1002255}{The Statistics
37 |   of Bulk Segregant Analysis Using Next Generation Sequencing}
38 |   \code{\link{tricubeStat}} for G prime calculation
39 | }
40 | 


--------------------------------------------------------------------------------
/man/getPvals.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/G_functions.R
 3 | \name{getPvals}
 4 | \alias{getPvals}
 5 | \title{Non-parametric estimation of the null distribution of G'}
 6 | \usage{
 7 | getPvals(Gprime, deltaSNP = NULL, outlierFilter = c("deltaSNP",
 8 |   "Hampel"), filterThreshold)
 9 | }
10 | \arguments{
11 | \item{Gprime}{a vector of G prime values (tricube weighted G statistics)}
12 | 
13 | \item{deltaSNP}{a vector of delta SNP values for use for QTL region filtering}
14 | 
15 | \item{outlierFilter}{one of either "deltaSNP" or "Hampel". Method for
16 | filtering outlier (ie QTL) regions for p-value estimation}
17 | 
18 | \item{filterThreshold}{The absolute delta SNP index to use to filter out putative QTL}
19 | }
20 | \description{
21 | The function is used by \code{\link{runGprimeAnalysis}} to estimate p-values for the
22 | weighted G' statistic based on the non-parametric estimation method described
23 | in Magwene et al. 2011. Breifly, using the natural log of Gprime a median
24 | absolute deviation (MAD) is calculated. The Gprime set is trimmed to exclude
25 | outlier regions (i.e. QTL) based on Hampel's rule. An alternate method for
26 | filtering out QTL is proposed using absolute delta SNP indeces greater than
27 | a set threshold to filter out potential QTL. An estimation of the mode of the trimmed set
28 | is calculated using the \code{\link[modeest]{mlv}} function from the package
29 | modeest. Finally, the mean and variance of the set are estimated using the
30 | median and mode and p-values are estimated from a log normal distribution.
31 | }
32 | 


--------------------------------------------------------------------------------
/man/getQTLTable.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/export_functions.R
 3 | \name{getQTLTable}
 4 | \alias{getQTLTable}
 5 | \title{Export a summarized table of QTL}
 6 | \usage{
 7 | getQTLTable(SNPset, method = "Gprime", alpha = 0.05, interval = 99,
 8 |   export = FALSE, fileName = "QTL.csv")
 9 | }
10 | \arguments{
11 | \item{SNPset}{Data frame SNP set containing previously filtered SNPs}
12 | 
13 | \item{method}{either "Gprime" or "QTLseq". The method for detecting significant regions.}
14 | 
15 | \item{alpha}{numeric. The required false discovery rate alpha for use with \code{method = "Gprime"}}
16 | 
17 | \item{interval}{numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99.}
18 | 
19 | \item{export}{logical. If TRUE will export a csv table.}
20 | 
21 | \item{fileName}{either a character string naming a file or a connection open for writing. "" indicates output to the console.}
22 | }
23 | \value{
24 | Returns a summarized table of QTL identified. The table contains the following columns:
25 | \itemize{
26 | \item{id - the QTL identification number}
27 | \item{chromosome - The chromosome on which the region was identified}
28 | \item{start - the start position on that chromosome, i.e. the position of the first SNP that passes the FDR threshold}
29 | \item{end - the end position}
30 | \item{length - the length in basepairs from start to end of the region}
31 | \item{nSNPs - the number of SNPs in the region}
32 | \item{avgSNPs_Mb - the average number of SNPs/Mb within that region}
33 | \item{peakDeltaSNP - the tricube-smoothed deltaSNP-index value at the peak summit}
34 | \item{posPeakDeltaSNP - the position of the absolute maximum tricube-smoothed deltaSNP-index}
35 | \item{maxGprime - the max G' score in the region}
36 | \item{posMaxGprime - the genomic position of the maximum G' value in the QTL}
37 | \item{meanGprime - the average G' score of that region}
38 | \item{sdGprime - the standard deviation of G' within the region}
39 | \item{AUCaT - the Area Under the Curve but above the Threshold line, an indicator of how significant or wide the peak is}
40 | \item{meanPval - the average p-value in the region}
41 | \item{meanQval - the average adjusted p-value in the region}
42 | }
43 | }
44 | \description{
45 | Export a summarized table of QTL
46 | }
47 | 


--------------------------------------------------------------------------------
/man/getSigRegions.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/export_functions.R
 3 | \name{getSigRegions}
 4 | \alias{getSigRegions}
 5 | \title{Return SNPs in significant regions}
 6 | \usage{
 7 | getSigRegions(SNPset, method = "Gprime", alpha = 0.05, interval = 99)
 8 | }
 9 | \arguments{
10 | \item{SNPset}{Data frame SNP set containing previously filtered SNPs.}
11 | 
12 | \item{method}{either "Gprime" or "QTLseq". The method for detecting significant regions.}
13 | 
14 | \item{alpha}{numeric. The required false discovery rate alpha for use with \code{method = "Gprime"}}
15 | 
16 | \item{interval}{numeric. For use eith \code{method = "QTLseq"} The Takagi based confidence interval requested. This will find the column named "CI_\*\*", where \*\* is the requested interval, i.e. 99.}
17 | }
18 | \description{
19 | The function takes a SNP set after calculation of p- and q-values or Takagi confidence intervals and returns
20 | a list containing all SNPs with q-values or deltaSNP below a set alpha or confidence intervals, respectively. Each entry in the list
21 | is a SNP set data frame in a contiguous region with adjusted pvalues lower
22 | than the set false discovery rate alpha.
23 | }
24 | 


--------------------------------------------------------------------------------
/man/importFromTable.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Import_Filter.R
 3 | \name{importFromTable}
 4 | \alias{importFromTable}
 5 | \title{Import SNP data from a delimited file}
 6 | \usage{
 7 | importFromTable(file, highBulk = "HIGH", lowBulk = "LOW",
 8 |   chromList = NULL, sep = ",")
 9 | }
10 | \arguments{
11 | \item{file}{The name of the file which the data are to be read from.}
12 | 
13 | \item{highBulk}{The sample name of the High Bulk. Defaults to "HIGH"}
14 | 
15 | \item{lowBulk}{The sample name of the Low Bulk. Defaults to "LOW"}
16 | 
17 | \item{chromList}{a string vector of the chromosomes to be used in the
18 | analysis. Useful for filtering out unwanted contigs etc.}
19 | 
20 | \item{sep}{the field separator character. Values on each line of the file are
21 | separated by this character. Default is for csv file ie ",".}
22 | }
23 | \value{
24 | Returns a data frame containing columns for per bulk total Read depth (DP),
25 |   Reference Allele Depth (AD_REF) and Alternative Allele Depth (AD_ALT), any
26 |   other SNP associated columns in the file, and SNPindex for each bulk
27 |   (indicated by .HIGH and .LOW column name suffix). Total reference allele
28 |   frequnce "REF_FRQ" is the sum of AD_REF for both bulks divided by total
29 |   Depth for that SNP. The deltaSNPindex is equal to  SNPindex.HIGH -
30 |   SNPindex.LOW.
31 | }
32 | \description{
33 | After importing the data from a delimited file, the function then calculates
34 | total reference allele frequency for both bulks together, the delta SNP index
35 | (i.e. SNP index of the low bulk subtracted from the SNP index of the high
36 | bulk), the G statistic and returns a data frame. The required columns in the
37 | file are CHROM (Chromosome) and POS (Position) as well as the reference and
38 | alternate allele depths (number of reads supporting each allele). The allele
39 | depths should be in columns named in this format:
40 | \code{AD_(<ALT/REF>).<sampleName>}. For example, the column for alternate
41 | allele depth for a high bulk sample named "sample1", should be
42 | "AD_ALT.sample1". Any other columns describing the SNPs are allowed, ie the
43 | actual allele calls, or a quality score. If the column is Bulk specific, It
44 | should be named \code{columnName.sampleName}, i.e "QUAL.sample1".
45 | }
46 | 


--------------------------------------------------------------------------------
/man/plotGprimeDist.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotting_functions.R
 3 | \name{plotGprimeDist}
 4 | \alias{plotGprimeDist}
 5 | \title{Plots Gprime distribution}
 6 | \usage{
 7 | plotGprimeDist(SNPset, outlierFilter = c("deltaSNP", "Hampel"),
 8 |   filterThreshold = 0.1, binwidth = 0.5)
 9 | }
10 | \arguments{
11 | \item{SNPset}{a data frame with SNPs and genotype fields as imported by
12 | \code{ImportFromGATK} and after running \code{GetPrimeStats}}
13 | 
14 | \item{outlierFilter}{one of either "deltaSNP" or "Hampel". Method for
15 | filtering outlier (ie QTL) regions for p-value estimation}
16 | 
17 | \item{filterThreshold}{The absolute delta SNP index to use to filter out
18 | putative QTL (default = 0.1)}
19 | 
20 | \item{binwidth}{The binwidth for the histogram. Recomended and default = 0.5}
21 | }
22 | \value{
23 | Plots a ggplot histogram of the G' value distribution. The raw data
24 |   as well as the filtered G' values (excluding putatitve QTL) are plotted. It
25 |   will then overlay an estimated log normal distribution with the same mean
26 |   and variance as the null G' distribution. This will allow to verify if
27 |   after filtering your G' value appear to be close to log normally and thus
28 |   can be used to estimate p-values using the non-parametric estimation method
29 |   described in Magwene et al. (2011). Breifly, using the natural log of
30 |   Gprime a median absolute deviation (MAD) is calculated. The Gprime set is
31 |   trimmed to exclude outlier regions (i.e. QTL) based on Hampel's rule. An
32 |   estimation of the mode of the trimmed set is calculated using the
33 |   \code{\link[modeest]{mlv}} function from the package modeest. Finally, the
34 |   mean and variance of the set are estimated using the median and mode are
35 |   estimated and used to plot the log normal distribution.
36 | }
37 | \description{
38 | Plots a ggplot histogram of the distribution of Gprime with a log normal
39 | distribution overlay
40 | }
41 | \examples{
42 | plotGprimedist(df_filt_6Mb, outlierFilter = "deltaSNP")
43 | 
44 | }
45 | \seealso{
46 | \code{\link{getPvals}} for how p-values are calculated.
47 | }
48 | 


--------------------------------------------------------------------------------
/man/plotQTLStats.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotting_functions.R
 3 | \name{plotQTLStats}
 4 | \alias{plotQTLStats}
 5 | \title{Plots different paramaters for QTL identification}
 6 | \usage{
 7 | plotQTLStats(SNPset, subset = NULL, var = "nSNPs",
 8 |   scaleChroms = TRUE, line = TRUE, plotThreshold = FALSE,
 9 |   plotIntervals = FALSE, q = 0.05, ...)
10 | }
11 | \arguments{
12 | \item{SNPset}{a data frame with SNPs and genotype fields as imported by
13 | \code{ImportFromGATK} and after running \code{runGprimeAnalysis} or \code{runQTLseqAnalysis}}
14 | 
15 | \item{subset}{a vector of chromosome names for use in quick plotting of
16 | chromosomes of interest. Defaults to
17 | NULL and will plot all chromosomes in the SNPset}
18 | 
19 | \item{var}{character. The paramater for plotting. Must be one of: "nSNPs",
20 | "deltaSNP", "Gprime", "negLog10Pval"}
21 | 
22 | \item{scaleChroms}{boolean. if TRUE (default) then chromosome facets will be 
23 | scaled to relative chromosome sizes. If FALSE all facets will be equal
24 | sizes. This is basically a convenience argument for setting both scales and 
25 | shape as "free_x" in ggplot2::facet_grid.}
26 | 
27 | \item{line}{boolean. If TRUE will plot line graph. If FALSE will plot points.
28 | Plotting points will take more time.}
29 | 
30 | \item{plotThreshold}{boolean. Should we plot the False Discovery Rate
31 | threshold (FDR). Only plots line if var is "Gprime" or "negLogPval".}
32 | 
33 | \item{plotIntervals}{boolean. Whether or not to plot the two-sided Takagi confidence intervals in "deltaSNP" plots.}
34 | 
35 | \item{q}{numeric. The q-value to use as the FDR threshold. If too low, no
36 | line will be drawn and a warning will be given.}
37 | 
38 | \item{...}{arguments to pass to ggplot2::geom_line or ggplot2::geom_point for
39 | changing colors etc.}
40 | }
41 | \value{
42 | Plots a ggplot graph for all chromosomes or those requested in
43 |   \code{subset}. By setting \code{var} to "nSNPs" the distribution of SNPs
44 |   used to calculate G' will be plotted. "deltaSNP" will plot a tri-cube
45 |   weighted delta SNP-index for each SNP. "Gprime" will plot the tri-cube
46 |   weighted G' value. Setting "negLogPval" will plot the -log10 of the p-value
47 |   at each SNP. In "Gprime" and "negLogPval" plots, a genome wide FDR threshold of
48 |   q can be drawn by setting "plotThreshold" to TRUE. The defualt is a red
49 |   line. If you would like to plot a different line we suggest setting
50 |   "plotThreshold" to FALSE and manually adding a line using
51 |   ggplot2::geom_hline.
52 | }
53 | \description{
54 | A wrapper for ggplot to plot genome wide distribution of parameters used to
55 | identify QTL.
56 | }
57 | \examples{
58 | p <- plotQTLstats(df_filt_6Mb, var = "Gprime", plotThreshold = TRUE, q = 0.01, subset = c("Chr3","Chr4"))
59 | }
60 | 


--------------------------------------------------------------------------------
/man/plotSimulatedThresholds.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotting_functions.R
 3 | \name{plotSimulatedThresholds}
 4 | \alias{plotSimulatedThresholds}
 5 | \title{Plots simulation data for QTLseq analysis}
 6 | \usage{
 7 | plotSimulatedThresholds(SNPset = NULL, popStruc = "F2", bulkSize,
 8 |   depth = NULL, replications = 10000, filter = 0.3,
 9 |   intervals = c(95, 99))
10 | }
11 | \arguments{
12 | \item{SNPset}{optional. Either supply your data set to extract read depths from or supply depth vector.}
13 | 
14 | \item{popStruc}{the population structure. Defaults to "F2" and assumes "RIL" otherwise.}
15 | 
16 | \item{bulkSize}{non-negative integer. The number of individuals in each bulk}
17 | 
18 | \item{depth}{optional integer vector. A read depth for which to replicate SNP-index calls. If read depth is defined SNPset will be ignored.}
19 | 
20 | \item{replications}{integer. The number of bootstrap replications.}
21 | 
22 | \item{filter}{numeric. An optional minimum SNP-index filter}
23 | 
24 | \item{intervals}{numeric vector. Confidence intervals supplied as two-sided percentiles. i.e. If intervals = '95' will return the two sided 95\% confidence interval, 2.5\% on each side.}
25 | }
26 | \value{
27 | Plots a deltaSNP by depth plot. Helps if the user wants to know the the delta SNP index needed to pass a certain CI at a specified depth.
28 | }
29 | \description{
30 | as described in Takagi et al., (2013). Genotypes are randomly assigned for
31 | each indvidual in the bulk, based on the population structure. The total
32 | alternative allele frequency in each bulk is calculated at each depth used to simulate
33 | delta SNP-indeces, with a user defined number of bootstrapped replication.
34 | The requested confidence intervals are then calculated from the bootstraps.
35 | This function plots the simulated confidence intervals by the read depth.
36 | }
37 | \examples{
38 | plotSimulatedThresholds <- function(SNPset = NULL, popStruc = "F2", bulkSize = 25,   depth = 1:150, replications = 10000, filter = 0.3, intervals = c(95, 99))
39 | }
40 | 


--------------------------------------------------------------------------------
/man/runGprimeAnalysis.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/G_functions.R
 3 | \name{runGprimeAnalysis}
 4 | \alias{runGprimeAnalysis}
 5 | \title{Identify QTL using a smoothed G statistic}
 6 | \usage{
 7 | runGprimeAnalysis(SNPset, windowSize = 1e+06,
 8 |   outlierFilter = "deltaSNP", filterThreshold = 0.1, ...)
 9 | }
10 | \arguments{
11 | \item{SNPset}{Data frame SNP set containing previously filtered SNPs}
12 | 
13 | \item{windowSize}{the window size (in base pairs) bracketing each SNP for which
14 | to calculate the statitics. Magwene et. al recommend a window size of ~25
15 | cM, but also recommend optionally trying several window sizes to test if
16 | peaks are over- or undersmoothed.}
17 | 
18 | \item{outlierFilter}{one of either "deltaSNP" or "Hampel". Method for
19 | filtering outlier (ie QTL) regions for p-value estimation}
20 | 
21 | \item{filterThreshold}{The absolute delta SNP index to use to filter out putative QTL (default = 0.1)}
22 | 
23 | \item{...}{Other arguments passed to \code{\link[locfit]{locfit}} and
24 | subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull
25 | in cases where you get "out of vertex space warnings"; Set the maxk higher
26 | than the default 100. See \code{\link[locfit]{locfit.raw}}(). But if you
27 | are getting that warning you should seriously consider increasing your
28 | window size.}
29 | }
30 | \value{
31 | The supplied SNP set tibble after G' analysis. Includes five new
32 |   columns: \itemize{\item{G - The G statistic for each SNP} \item{Gprime -
33 |   The tricube smoothed G statistic based on the supplied window size}
34 |   \item{pvalue - the pvalue at each SNP calculatd by non-parametric
35 |   estimation} \item{negLog10Pval - the -Log10(pvalue) supplied for quick
36 |   plotting} \item{qvalue - the Benajamini-Hochberg adjusted p-value}}
37 | }
38 | \description{
39 | A wrapper for all the functions that perform the full G prime analysis to
40 | identify QTL. The following steps are performed:\cr 1) Genome-wide G
41 | statistics are calculated by \code{\link{getG}}. \cr G is defined by the
42 | equation: \deqn{G = 2*\sum_{i=1}^{4} n_{i}*ln\frac{obs(n_i)}{exp(n_i)}}{G = 2
43 | * \sum n_i * ln(obs(n_i)/exp(n_i))} Where for each SNP, \eqn{n_i} from i = 1
44 | to 4 corresponds to the reference and alternate allele depths for each bulk,
45 | as described in the following table: \tabular{rcc}{ Allele \tab High Bulk
46 | \tab Low Bulk \cr Reference \tab \eqn{n_1} \tab \eqn{n_2} \cr Alternate \tab
47 | \eqn{n_3} \tab \eqn{n_4} \cr} ...and \eqn{obs(n_i)} are the observed allele
48 | depths as described in the data frame. \code{\link{getG}} calculates the G statistic
49 | using expected values assuming read depth is equal for all alleles in both
50 | bulks: \deqn{exp(n_1) = ((n_1 + n_2)*(n_1 + n_3))/(n_1 + n_2 + n_3 + n_4)}
51 | \deqn{exp(n_2) = ((n_2 + n_1)*(n_2 + n_4))/(n_1 + n_2 + n_3 + n_4)}
52 | \deqn{exp(n_3) = ((n_3 + n_1)*(n_3 + n_4))/(n_1 + n_2 + n_3 + n_4)}
53 | \deqn{exp(n_4) = ((n_4 + n_2)*(n_4 + n_3))/(n_1 + n_2 + n_3 + n_4)}\cr 2) G'
54 | - A tricube-smoothed G statistic is predicted by local regression within each
55 | chromosome using \code{\link{tricubeStat}}. This works as a weighted average
56 | across neighboring SNPs that accounts for Linkage disequilibrium (LD) while
57 | minizing noise attributed to SNP calling errors. G values for neighboring
58 | SNPs within the window are weighted by physical distance from the focal SNP.
59 | \cr \cr 3) P-values are estimated based using the non-parametric method
60 | described by Magwene et al. 2011 with the function \code{\link{getPvals}}.
61 | Breifly, using the natural log of Gprime a median absolute deviation (MAD) is
62 | calculated. The Gprime set is trimmed to exclude outlier regions (i.e. QTL)
63 | based on Hampel's rule. An alternate method for filtering out QTL is proposed
64 | using absolute delta SNP indeces greater than 0.1 to filter out potential
65 | QTL. An estimation of the mode of the trimmed set is calculated using the
66 | \code{\link[modeest]{mlv}} function from the package modeest. Finally, the
67 | mean and variance of the set are estimated using the median and mode and
68 | p-values are estimated from a log normal distribution. \cr \cr 4) Negative
69 | Log10- and Benjamini-Hochberg adjusted p-values are calculated using
70 | \code{\link[stats]{p.adjust}}
71 | }
72 | \examples{
73 | df_filt <- runGprimeAnalysis(df_filt,windowSize = 2e6,outlierFilter = "deltaSNP")
74 | }
75 | 


--------------------------------------------------------------------------------
/man/runQTLseqAnalysis.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/takagi_sim.R
 3 | \name{runQTLseqAnalysis}
 4 | \alias{runQTLseqAnalysis}
 5 | \title{Calculates delta SNP confidence intervals for QTLseq analysis}
 6 | \usage{
 7 | runQTLseqAnalysis(SNPset, windowSize = 1e+06, popStruc = "F2",
 8 |   bulkSize, depth = NULL, replications = 10000, filter = 0.3,
 9 |   intervals = c(95, 99), ...)
10 | }
11 | \arguments{
12 | \item{SNPset}{The data frame imported by \code{ImportFromGATK}}
13 | 
14 | \item{windowSize}{the window size (in base pairs) bracketing each SNP for which to calculate the statitics.}
15 | 
16 | \item{popStruc}{the population structure. Defaults to "F2" and assumes "RIL" otherwise}
17 | 
18 | \item{bulkSize}{non-negative integer vector. The number of individuals in
19 | each simulated bulk. Can be of length 1, then both bulks are set to the
20 | same size. Assumes the first value in the vector is the simulated high
21 | bulk.}
22 | 
23 | \item{depth}{integer. A read depth for which to replicate SNP-index calls.}
24 | 
25 | \item{replications}{integer. The number of bootstrap replications.}
26 | 
27 | \item{filter}{numeric. An optional minimum SNP-index filter}
28 | 
29 | \item{intervals}{numeric vector. Confidence intervals supplied as two-sided
30 | percentiles. i.e. If intervals = '95' will return the two sided 95\%
31 | confidence interval, 2.5\% on each side.}
32 | 
33 | \item{...}{Other arguments passed to \code{\link[locfit]{locfit}} and
34 | subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull
35 | in cases where you get "out of vertex space warnings"; Set the maxk higher
36 | than the default 100. See \code{\link[locfit]{locfit.raw}}().  But if you
37 | are getting that warning you should seriously consider increasing your
38 | window size.}
39 | }
40 | \value{
41 | A SNPset data frame with delta SNP-index thresholds corrisponding to the 
42 | requested confidence intervals matching the tricube smoothed depth at each SNP.
43 | }
44 | \description{
45 | The method for simulating delta SNP-index confidence interval thresholds
46 | as described in Takagi et al., (2013). Genotypes are randomly assigned for
47 | each indvidual in the bulk, based on the population structure. The total
48 | alternative allele frequency in each bulk is calculated at each depth used to simulate 
49 | delta SNP-indeces, with a user defined number of bootstrapped replication.
50 | The requested confidence intervals are then calculated from the bootstraps.
51 | }
52 | \examples{
53 | df_filt <- runQTLseqAnalysis(
54 | SNPset = df_filt,
55 | bulkSize = c(25, 35)
56 | windowSize = 1e6,
57 | popStruc = "F2",
58 | replications = 10000,
59 | intervals = c(95, 99)
60 | )
61 | 
62 | }
63 | 


--------------------------------------------------------------------------------
/man/simulateAlleleFreq.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/takagi_sim.R
 3 | \name{simulateAlleleFreq}
 4 | \alias{simulateAlleleFreq}
 5 | \title{Randomly calculates an alternate allele frequency within a bulk}
 6 | \usage{
 7 | simulateAlleleFreq(n, pop = "F2")
 8 | }
 9 | \arguments{
10 | \item{n}{non-negative integer. The number of individuals in each bulk}
11 | 
12 | \item{pop}{the population structure. Defaults to "F2" and assumes "RIL" 
13 | population otherwise.}
14 | }
15 | \value{
16 | an alternate allele frequency within the bulk. Used for simulating 
17 | SNP-indeces.
18 | }
19 | \description{
20 | Randomly calculates an alternate allele frequency within a bulk
21 | }
22 | 


--------------------------------------------------------------------------------
/man/simulateConfInt.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/takagi_sim.R
 3 | \name{simulateConfInt}
 4 | \alias{simulateConfInt}
 5 | \title{Simulation of delta SPP index confidence intervals}
 6 | \usage{
 7 | simulateConfInt(popStruc = "F2", bulkSize, depth = 1:100,
 8 |   replications = 10000, filter = 0.3, intervals = c(0.05, 0.025))
 9 | }
10 | \arguments{
11 | \item{popStruc}{the population structure. Defaults to "F2" and assumes "RIL"}
12 | 
13 | \item{bulkSize}{non-negative integer vector. The number of individuals in
14 | each simulated bulk. Can be of length 1, then both bulks are set to the
15 | same size. Assumes the first value in the vector is the simulated high
16 | bulk.}
17 | 
18 | \item{depth}{integer. A read depth for which to replicate SNP-index calls.}
19 | 
20 | \item{replications}{integer. The number of bootstrap replications.}
21 | 
22 | \item{filter}{numeric. An optional minimum SNP-index filter}
23 | 
24 | \item{intervals}{numeric vector of probabilities with values in [0,1] 
25 | corresponding to the requested confidence intervals}
26 | }
27 | \value{
28 | A data frame of delta SNP-index thresholds corrisponding to the 
29 | requested confidence intervals at the user set depths.
30 | }
31 | \description{
32 | The method for simulating delta SNP-index confidence interval thresholds
33 | as described in Takagi et al., (2013). Genotypes are randomly assigned for
34 | each indvidual in the bulk, based on the population structure. The total
35 | alternative allele frequency in each bulk is calculated at each depth used to simulate 
36 | delta SNP-indeces, with a user defined number of bootstrapped replication.
37 | The requested confidence intervals are then calculated from the bootstraps.
38 | }
39 | \examples{
40 | CI <-
41 | simulateConfInt(
42 |    popStruc = "F2",
43 |    bulkSize = 50,
44 |    depth = 1:100,
45 |    intervals = c(0.05, 0.95, 0.025, 0.975, 0.005, 0.995, 0.0025, 0.9975)
46 | }
47 | 


--------------------------------------------------------------------------------
/man/simulateSNPindex.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/takagi_sim.R
 3 | \name{simulateSNPindex}
 4 | \alias{simulateSNPindex}
 5 | \title{Simulates a delta SNP-index with replication}
 6 | \usage{
 7 | simulateSNPindex(depth, altFreq1, altFreq2, replicates = 10000,
 8 |   filter = NULL)
 9 | }
10 | \arguments{
11 | \item{depth}{integer. A read depth for which to replicate SNP-index calls.}
12 | 
13 | \item{altFreq1}{numeric. The alternate allele frequency for bulk A.}
14 | 
15 | \item{altFreq2}{numeric. The alternate allele frequency for bulk B.}
16 | 
17 | \item{replicates}{integer. The number of bootstrap replications.}
18 | 
19 | \item{filter}{numeric. an optional minimum SNP-index filter}
20 | }
21 | \value{
22 | Returns a vector of length replicates delta SNP-indeces
23 | }
24 | \description{
25 | Simulates a delta SNP-index with replication
26 | }
27 | 


--------------------------------------------------------------------------------
/man/tricubeStat.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/G_functions.R
 3 | \name{tricubeStat}
 4 | \alias{tricubeStat}
 5 | \title{Calculate tricube weighted statistics for each SNP}
 6 | \usage{
 7 | tricubeStat(POS, Stat, windowSize = 2e+06, ...)
 8 | }
 9 | \arguments{
10 | \item{POS}{A vector of genomic positions for each SNP}
11 | 
12 | \item{Stat}{A vector of values for a given statistic for each SNP}
13 | 
14 | \item{...}{Other arguments passed to \code{\link[locfit]{locfit}} and
15 | subsequently to \code{\link[locfit]{locfit.raw}}() (or the lfproc). Usefull
16 | in cases where you get "out of vertex space warnings"; Set the maxk higher
17 | than the default 100. See \code{\link[locfit]{locfit.raw}}().}
18 | 
19 | \item{WinSize}{the window size (in base pairs) bracketing each SNP for which
20 | to calculate the statitics. Magwene et. al recommend a window size of ~25
21 | cM, but also recommend optionally trying several window sizes to test if
22 | peaks are over- or undersmoothed.}
23 | }
24 | \value{
25 | Returns a vector of the weighted statistic caluculted with a tricube
26 |   smoothing kernel
27 | }
28 | \description{
29 | Uses local regression (wrapper for \code{\link[locfit]{locfit}}) to predict a
30 | tricube smoothed version of the statistic supplied for each SNP. This works as a
31 | weighted average across neighboring SNPs that accounts for Linkage
32 | disequilibrium (LD) while minizing noise attributed to SNP calling errors.
33 | Values for neighboring SNPs within the window are weighted by physical
34 | distance from the focal SNP.
35 | }
36 | \examples{
37 | df_filt_4mb$Gprime <- tricubeStat(POS, Stat = GStat, WinSize = 4e6)
38 | }
39 | \seealso{
40 | \code{\link{getG}} for G statistic calculation
41 | 
42 | \code{\link[locfit]{locfit}} for local regression
43 | }
44 | 


--------------------------------------------------------------------------------
/src/.gitignore:
--------------------------------------------------------------------------------
1 | *.o
2 | *.so
3 | *.dll
4 | 


--------------------------------------------------------------------------------
/src/RcppExports.cpp:
--------------------------------------------------------------------------------
 1 | // Generated by using Rcpp::compileAttributes() -> do not edit by hand
 2 | // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
 3 | 
 4 | #include <Rcpp.h>
 5 | 
 6 | using namespace Rcpp;
 7 | 
 8 | // countSNPs_cpp
 9 | NumericVector countSNPs_cpp(NumericVector POS, double windowSize);
10 | RcppExport SEXP _QTLseqr_countSNPs_cpp(SEXP POSSEXP, SEXP windowSizeSEXP) {
11 | BEGIN_RCPP
12 |     Rcpp::RObject rcpp_result_gen;
13 |     Rcpp::RNGScope rcpp_rngScope_gen;
14 |     Rcpp::traits::input_parameter< NumericVector >::type POS(POSSEXP);
15 |     Rcpp::traits::input_parameter< double >::type windowSize(windowSizeSEXP);
16 |     rcpp_result_gen = Rcpp::wrap(countSNPs_cpp(POS, windowSize));
17 |     return rcpp_result_gen;
18 | END_RCPP
19 | }
20 | 
21 | static const R_CallMethodDef CallEntries[] = {
22 |     {"_QTLseqr_countSNPs_cpp", (DL_FUNC) &_QTLseqr_countSNPs_cpp, 2},
23 |     {NULL, NULL, 0}
24 | };
25 | 
26 | RcppExport void R_init_QTLseqr(DllInfo *dll) {
27 |     R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
28 |     R_useDynamicSymbols(dll, FALSE);
29 | }
30 | 


--------------------------------------------------------------------------------
/src/countSNPs.cpp:
--------------------------------------------------------------------------------
 1 | #include <Rcpp.h>
 2 | using namespace Rcpp;
 3 | 
 4 | //' Count number of SNPs within a sliding window
 5 | //' 
 6 | //' For each SNP returns how many SNPs are bracketing it within the set window size
 7 | //' 
 8 | //' @param POS A numeric vector of genomic positions for each SNP
 9 | //' @param windowSize The required window size
10 | //' @export countSNPs_cpp
11 | // [[Rcpp::export]]
12 | NumericVector countSNPs_cpp(NumericVector POS, double windowSize) {
13 |     unsigned int nout=POS.size(), i, left=0, right=0;
14 |     NumericVector out(nout);
15 |     
16 |     for( i=0; i < nout; i++ ) {
17 |         while ((right < nout) & (POS[right + 1] <= POS[i] + windowSize / 2))
18 |             right++;
19 |         
20 |         while (POS[left] <= POS[i] - windowSize / 2)
21 |             left++;
22 |         
23 |         out[i] = right - left + 1 ;
24 |     }
25 |     return out;
26 | }
27 | 
28 | 
29 | // You can include R code blocks in C++ files processed with sourceCpp
30 | // (useful for testing and development). The R code will be automatically 
31 | // run after the compilation.
32 | //
33 | 
34 | /*** R
35 | timesTwo(42)
36 | */
37 | 


--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | QTLseqr_cache
2 | QTLseqr_files
3 | 


--------------------------------------------------------------------------------
/vignettes/QTLseqr.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Next-generation sequencing bulk segregant analysis with QTLseqr"
  3 | author: "Ben N. Mansfeld and Rebecca Grumet"
  4 | date: "`r Sys.Date()`"
  5 | output: 
  6 |   pdf_document:
  7 |     toc: yes
  8 | graphics: yes
  9 | urlcolor: blue
 10 | vignette: >
 11 |   %\VignetteIndexEntry{NGS-BSA with QTLseqr}
 12 |   %\VignetteEncoding{UTF-8}
 13 |   %\VignetteEngine{knitr::rmarkdown}
 14 | header-includes: \usepackage{graphicx}
 15 |   \usepackage{float}
 16 | ---
 17 | 
 18 | ```{r setup, echo=FALSE, results="hide"}
 19 | knitr::opts_chunk$set(tidy=FALSE, cache=TRUE,
 20 |                       dev="png",
 21 |                       message=FALSE, error=FALSE, warning=TRUE)
 22 | ```	
 23 | # Current version: Development - `r packageVersion("QTLseqr")`   
 24 | 
 25 | # Introduction
 26 | 
 27 | ## Citations
 28 | 
 29 | **If you use QTLseqr in published research, please cite:**
 30 | 
 31 | > Mansfeld B.N. and Grumet R,
 32 | > QTLseqr: An R package for bulk segregant analysis with next-generation sequencing
 33 | > *The Plant Genome* [doi:10.3835/plantgenome2018.01.0006](https://dl.sciencesocieties.org/publications/tpg/abstracts/11/2/180006)
 34 | 
 35 | We also recommend citing the paper for the corresponding method you work with.
 36 | 
 37 | QTL-seq method:
 38 | 
 39 | > Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi,
 40 | > H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), 
 41 | > QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA 
 42 | > from two bulked populations. *Plant J*, 74: 174–183. [doi:10.1111/tpj.12105](https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.12105)
 43 | 
 44 | G prime method:
 45 | 
 46 | > Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next 
 47 | > Generation Sequencing. *PLOS Computational Biology* 7(11): e1002255. 
 48 | > [doi.org/10.1371/journal.pcbi.1002255](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002255)
 49 | 
 50 | ## Quick Start
 51 | Here are the basic steps required to run and plot QTLseq and $G'$ analysis
 52 | 
 53 | ```{r installGitHub, eval = FALSE}
 54 | # download and install the devtools package
 55 | install.packages("devtools")
 56 | 
 57 | # download the QTLseqr package from GitHub
 58 | devtools::install_github("bmansfeld/QTLseqr")
 59 | ```
 60 | 
 61 | ```{r quickStart, eval=FALSE}
 62 | #load the package
 63 | library("QTLseqr")
 64 | 
 65 | #Set sample and file names
 66 | HighBulk <- "SRR834931"
 67 | LowBulk <- "SRR834927"
 68 | file <- "SNPs_from_GATK.table"
 69 | 
 70 | #Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
 71 | Chroms <- paste0(rep("Chr", 12), 1:12)
 72 | 
 73 | #Import SNP data from file
 74 | df <-
 75 |     importFromGATK(
 76 |         file = file,
 77 |         highBulk = HighBulk,
 78 |         lowBulk = LowBulk,
 79 |         chromList = Chroms
 80 |     )
 81 | 
 82 | #Filter SNPs based on some criteria
 83 | df_filt <-
 84 |     filterSNPs(
 85 |         SNPset = df,
 86 |         refAlleleFreq = 0.20,
 87 |         minTotalDepth = 100,
 88 |         maxTotalDepth = 400,
 89 |         minSampleDepth = 40,
 90 |         minGQ = 99
 91 |     )
 92 | 
 93 | #Run G' analysis
 94 | df_filt <- runGprimeAnalysis(SNPset = df_filt,
 95 |                              windowSize = 1e6,
 96 |                              outlierFilter = "deltaSNP")
 97 | 
 98 | #Run QTLseq analysis
 99 | df_filt <- runQTLseqAnalysis(
100 |     SNPset = df_filt,
101 |     windowSize = 1e6,
102 |     popStruc = "F2",
103 |     bulkSize = c(300, 450),
104 |     replications = 10000,
105 |     intervals = c(95, 99)
106 | )
107 | 
108 | #Plot
109 | plotQTLStats(
110 |     SNPset = df_filt,
111 |     var = "Gprime",
112 |     plotThreshold = TRUE,
113 |     q = 0.01
114 | )
115 | 
116 | plotQTLStats(
117 |     SNPset = df_filt,
118 |     var = "deltaSNP",
119 |     plotIntervals = TRUE)
120 | 
121 | #export summary CSV
122 | getQTLTable(
123 |     SNPset = df_filt,
124 |     alpha = 0.01,
125 |     export = TRUE,
126 |     fileName = "my_BSA_QTL.csv"
127 | )
128 | ```
129 | 
130 | # Standard workflow
131 | 
132 | ## Installation 
133 | Let's install and load the QTLseqr package:
134 | ```{r install_load, eval=FALSE}
135 | #Install step if you have not done so yet:
136 | #install.packages("devtools")
137 | 
138 | devtools::install_github("bmansfeld/QTLseqr")
139 | library("QTLseqr")
140 | ```
141 | 
142 | Great! We now need to load some data. QTLseqr has with a sister package [Yang2013data](https://github.com/bmansfeld/Yang2013data) (derived from [Yang et al. (2013)](https://doi.org/10.1371/journal.pone.0068433)], that has some trial data you can play with. 
143 | 
144 | ## Input data
145 | 
146 | QTLseqr supports importing data either from table file exported from the VariantsToTable function built in to GATK or a delimited file containing allele read depths for each bulk. The two available functions are `importFromGATK` and `importFromTable`.
147 | 
148 | Both functions import the SNP data and perform some preliminary calculations to find the following:
149 | 
150 | $$Reference\ allele\ frequency = \frac{Ref\ allele\ depth_{HighBulk} + Ref\ allele\ depth_{LowBulk}}{Total\ read\ depth\ for\ both\ bulks}$$
151 | 
152 | $$SNP\text{-}index_{per\ bulk} = \frac{Alternate\ allele\ depth}{Total\ read\ depth}$$
153 | 
154 | $$\Delta (SNP\text{-}index) = SNP \text{-} index_{High Bulk} - SNP\text{-}index_{Low Bulk}$$
155 | 
156 | To demonstrate the use of the import functions we will first load the Yang et al. (2013) data file.
157 | We first need to download the package that contains the data from github.
158 | ```{r installdata}
159 | #download and load data package (~50Mb)
160 | devtools::install_github("bmansfeld/Yang2013data")
161 | library("Yang2013data")
162 | 
163 | #Import the data
164 | rawData <- system.file(
165 |     "extdata", 
166 |     "Yang_et_al_2013.table", 
167 |     package = "Yang2013data", 
168 |     mustWork = TRUE)
169 | ```
170 | If you have your own data you can simply refer to it directly:
171 | ```{r, eval = FALSE}
172 | rawData <- "C:/PATH/TO/MY/DIR/My_BSA_data.table"
173 | ```
174 | 
175 | We define the sample name for each of the bulks. We also can define a vector of the chromosomes to be included in the analysis (i.e. exclude smaller contigs), In this case, Chr1, Chr2 ... Chr12.
176 | ```{r}
177 | HighBulk <- "SRR834931"
178 | LowBulk <- "SRR834927"
179 | Chroms <- paste0(rep("Chr", 12), 1:12)
180 | ```
181 | 
182 | ### Importing SNPs from GATK
183 | 
184 | Working directly with the [GATK best practices guide](https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS) for whole genome sequence should result in a VCF that is compatible with QTLseqr. In general the workflow suggested by GATK is per-sample variant calling followed by joint genotyping across samples. This will produce a VCF file that includes **BOTH** bulks, each with a different sample name (here SRR834927 and SRR834931), one SNP for example:
185 | 
186 | ```{r VCFrow, echo=FALSE, warning=FALSE}
187 | library(kableExtra)
188 | x <- data.frame(CHROM = "Chr1", POS = 31071, ID = ".", REF = "A", ALT = "G", QUAL = 1390.44, FILTER = "PASS", INFO = "..\\*...", FORMAT = "GT:AD:DP:GQ:PL", SRR834927 = "0/1:34,36:70:99:897,0,855", SRR834931 = "0/1:26,22:48:99:522,0,698")
189 | kable_styling(knitr::kable(x = x, format = "latex", booktabs = TRUE), latex_options = "scale_down")
190 | ```
191 | \**info column removed for brevity*
192 | 
193 | GATK have provided a fast VCF parser, the [VariantsToTable](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php) tool, that extracts the necessary fields for easy use in downstream analysis. 
194 | 
195 | We highly recommend reading [What is a VCF and how should I interpret it?](http://gatkforums.broadinstitute.org/gatk/discussion/1268/what-is-a-vcf-and-how-should-i-interpret-it) for more information on GATK VCF Fields and Genotype Fields
196 | 
197 | Though the use of GATK's VariantsToTable function is out of the scope of this vignette, the syntax for use with QTLseqr should look something like this:
198 | 
199 | ```{bash, eval=FALSE}
200 | java -jar GenomeAnalysisTK.jar \
201 | -T VariantsToTable \
202 | -R ${REF} \
203 | -V ${NAME} \
204 | -F CHROM -F POS -F REF -F ALT \
205 | -GF AD -GF DP -GF GQ -GF PL \
206 | -o ${NAME}.table
207 | ```
208 | Where `${REF}` is the reference genome file and `${NAME}` is VCF file you wish to parse.
209 | 
210 | To run QTLseqr successfully, the required VCF fields `(-F)` are CHROM (Chromosome) and POS (Position). the required Genotype fields `(-GF)` are AD (Allele Depth), DP (Depth). Recommended fields are REF (Reference allele) and ALT (Alternative allele) Recommended Genotype fields are PL (Phred-scaled likelihoods) and  GQ (Genotype Quality).
211 | 
212 | #### The `ImportFromGATK` function
213 | 
214 | The `importFromGATK` function imports SNP data from the output of the VariantsToTable function in GATK. After importing the data, the function then calculates total reference allele frequency for both bulks together, the SNP index for each bulk, and the $\Delta (SNP\text{-}index)$.
215 | 
216 | We then use the `importFromGATK` function to import the raw data. After importing the data, the function then calculates total reference allele frequency for both bulks together, the $SNP\text{-}index$ for each SNP in each bulk and the $\Delta (SNP\text{-}index)$ and returns a data frame.
217 | 
218 | Let's import:
219 | ```{r import, cache=TRUE}
220 | #import data
221 | df <-
222 |     importFromGATK(
223 |         file = rawData,
224 |         highBulk = HighBulk,
225 |         lowBulk = LowBulk,
226 |         chromList = Chroms
227 |      )
228 | ```
229 | 
230 | ### Importing from a delimited file
231 | 
232 | You can also import SNPs from a csv, tsv or any other delimited file. Your file must include some necessary columns: CHROM (Chromosome names) and POS (the SNP position) as well as the reference and alternate allele depths (number of reads supporting each allele). The allele depths should be in columns named in this format: *AD_\<ALT/REF\>.\<sampleName\>*. For example, the column for alternate allele depth for a high bulk sample named "sample1", should be "AD_ALT.sample1". Any other columns describing the SNPs are allowed, ie the actual allele calls, or a quality score. If the column is Bulk specific, It should be named columnName.sampleName, i.e "QUAL.sample1".
233 | 
234 | Here is the header of an example file:
235 | ```{r csvexample, echo=FALSE}
236 | x <-
237 |     data.frame(
238 |     CHROM = rep("Chr1", 5),
239 |     POS = c(31071, 31478, 33667, 34057, 35239),
240 |     REF = c("A", "C", "A", "C", "A"),
241 |     ALT = c("G", "T", "G", "T", "C"),
242 |     AD_REF.SRR834927 = c(34, 34, 20, 38, 25),
243 |     AD_ALT.SRR834927 = c(36, 52, 48, 40, 36),
244 |     GQ.SRR834927 = rep(99, 5),
245 |     AD_REF.SRR834931 = c(26, 40, 24, 29, 40),
246 |     AD_ALT.SRR834931 = c(22, 34, 29, 26, 60),
247 |     GQ.SRR834931 = rep(99, 5)
248 |     )
249 | 
250 | kableExtra::kable_styling(knitr::kable(x = x, format = "latex", booktabs = TRUE), latex_options = "scale_down")
251 | ```
252 | 
253 | #### The `importFromTable` function
254 | 
255 | To load the data from a delimited file we use the `importFromTable` function. in our case it is a csv so it is comma delimited. The columns are denoted by the sample names for the bulks (SRR834931 and SRR834927). These names will be renamed to "HIGH" and "LOW". If you want you can set the column names as such in advance as the defaults for `highBulk` and `lowBulk` arguments are "HIGH" and "LOW", respectively. As in the `importFromGATK` function the `chromList` argument should be a string vector of the chromosomes to be used in the analysis. Useful for filtering out unwanted contigs etc.
256 | 
257 | ```{r importcsv, eval=FALSE}
258 | df <- importFromTable(file = "Yang2013.csv",
259 |                       highBulk = HighBulk,
260 |                       lowBulk = LowBulk,
261 |                       chromList = Chroms)
262 | ```
263 | 
264 | ## Loaded data frame
265 | The loaded data frame should look like this. This data frame has other information about the genotype calls from GATK:
266 | ```{r viewdf}
267 | head(df)
268 | ```
269 | 
270 | Let's review the column headers:
271 | 
272 | * CHROM - The chromosome this SNP is in
273 | * POS - The position on the chromosome in nt
274 | * REF - The reference allele at that position
275 | * ALT - The alternate allele
276 | * DP.HIGH - The read depth at that position in the high bulk
277 | * AD_REF.HIGH - The allele depth of the reference allele in the high bulk
278 | * AD_ALT.HIGH - The alternate allele depth in the the high bulk
279 | * GQ.HIGH - The genotype quality score, (how confident we are in the genotyping)
280 | * SNPindex.HIGH - The calculated SNP-index for the high bulk
281 | * Same as above for the low bulk
282 | * REF_FRQ - The reference allele frequency as defined above
283 | * deltaSNP - The $\Delta (SNP\text{-}index)$ as defined above
284 | 
285 | ## Filtering SNPs
286 | Now that we have loaded the data into R we can start cleaning it up by filtering some of the low confidence SNPs.
287 | While GATK has its own filtering tools, QTLseqr offers some options for filtering that may help reduce noise and improve results. Filtering is mainly based on read depth for each SNP, such that we can try to eliminate SNPs with low confidence, due to low coverage, and SNPs that may be in repetitive regions and thus have inflated read depth. You can also filter based on the absolute difference in read depth between the bulks. If you have your own quality columns you can always use R's base subsetting functions to filter out any SNPs you don't want. 
288 | 
289 | ### Read depth histograms
290 | 
291 | One way to assess filtering thresholds and check the quality of our data is by plotting histograms of the read depths. We can get an idea of where to draw our thresholds. We'll use the ggplot2 package for this purpose, but you could use base R to plot as well.
292 | 
293 | Lets look at total read depth for example:
294 | ```{r plothist1, warning = FALSE, fig.align="center", fig.width=4, fig.height=3, dpi=300}
295 | library("ggplot2")
296 | ggplot(data = df) + 
297 |     geom_histogram(aes(x = DP.HIGH + DP.LOW)) + 
298 |     xlim(0,1000)
299 | 
300 | ```
301 | 
302 | ...or look at total reference allele frequency:
303 | ```{r plothist2, warning=FALSE, fig.align = "center", fig.width=4, fig.height=3, dpi=300}
304 | ggplot(data = df) +
305 |     geom_histogram(aes(x = REF_FRQ))
306 | ```
307 | 
308 | We can plot our per-bulk SNP-index to check if our data is good.We expect to find two small peaks on each end and most of the SNPs should be approximiately normally distributed arround 0.5 in an F2 population. Here is the HIGH bulk for example:
309 | 
310 | ```{r plotSNPindex, warning=FALSE, fig.align = "center", fig.width=4, fig.height=3, dpi=300}
311 | ggplot(data = df) +
312 |     geom_histogram(aes(x = SNPindex.HIGH))
313 | ```
314 | 
315 | 
316 | ### Using the filterSNPs function
317 | Now that we have an idea about our read depth distribution we can filter out low confidence SNPS. In general we recommend filtering extremely low and high coverage SNPs, either in both bulks (`minTotalDepth/maxTotalDepth`) and/or in each bulk separately (`minSampleDepth`). We have the option to filter based on reference allele frequency (`refAlleleFreq`), this removes SNPs that for some reason are over- or under-represented in *BOTH* bulks. We can also filter SNPs that have large discrepancies in read depth between the bulks(i.e. one bulk has a depth of 500 and the other has 5). Such discrepancies can throw off the G statistic.  We can also use the GATK GQ score (Genotype Quality) to filter out low confidence SNPs. If the `verbose` parameter is set to `TRUE` (default) the function will report the numbers of SNPs filtered in each step.
318 | ```{r filtSNPs-source, eval = FALSE, message = FALSE}
319 | df_filt <-
320 |     filterSNPs(
321 |         SNPset = df,
322 |         refAlleleFreq = 0.20,
323 |         minTotalDepth = 100,
324 |         maxTotalDepth = 400, 
325 |         depthDifference = 100,
326 |         minSampleDepth = 40,
327 |         minGQ = 99,
328 |         verbose = TRUE
329 |     )
330 | ```
331 | 
332 | ```{r filtSNPs-msgs, message = TRUE, warning = FALSE, collapse = TRUE, echo = FALSE}
333 | df_filt <-
334 |     filterSNPs(
335 |         SNPset = df,
336 |         refAlleleFreq = 0.20,
337 |         minTotalDepth = 100,
338 |         maxTotalDepth = 400,
339 |         depthDifference = 100,
340 |         minSampleDepth = 40,
341 |         minGQ = 99,
342 |         verbose = TRUE
343 |     )
344 | ```
345 | 
346 | This step is quick and we can go back and plot some histograms to see if we are happy with the results, and we can quickly re-run the filtering step if not.
347 | 
348 | ## Running the analysis
349 | 
350 | The analysis in QTLseqr is an implementation of both pipelines for bulk segregant analysis, $G'$ and $\Delta (SNP\text{-}index)$, described by Magwene et al. (2011) and Takagi et al. (2013), respectively. We recommend reading both papers to fully understand the considerations and math behind the analysis. 
351 | 
352 | There are two main analysis functions: 
353 | 
354 | 1. `runGprimeAnalysis` - performs Magwene et al type $G'$ analysis
355 | 1. `runQTLseqAnalysis` - performs Takagi et al type QTLseq analysis
356 | 
357 | ### A note about window sizes
358 | 
359 | QTLseqr utilizes a Nadaraya-Watson smoothing kernel to produce tricube-smoothed statistics for analysis. These smoothed statistics function as a weighted moving average across neighboring SNPs that accounts for linkage disequilibrium (LD), while minimizing noise attributed to SNP calling errors (Magwene et al., 2011). In a tricube-weighted window, SNPs that are close to the focal SNP have a high weighting value, while SNPs closer to the edge of the window have low weights. The tricube-smoothed values are predicted by constant local regression within each chromosome. The calculations are performed using the `locfit` function from the locfit package using a user defined window size and the degree of the polynomial set to zero.
360 | 
361 | As this window is using a tricube-smoothing kernel, the window size *can* be much larger than you might expect. However, in the rice examples below, we choose a window size of 1Mb for the sliding window analysis to replicate the orignal results published in Yang et al., (2013). For a discussion about window size, we recommend reading Magwene et al. (2011). In general, larger windows will produce smoother data. The functions making these calculations are rather fast, so we recommend testing several window sizes for your data, and then deciding on the optimal size.
362 | 
363 | When running either analysis functions above, some users will get an "newsplit: out of vertex space" error, coming from the `locfit` function. This usually happens when either the window size is set too small (it is a memory allocation error), or the organsim genome is very large and thus the default window size becomes too small. In such cases, the user may pass a higher `maxk` value to either analysis function (the default in `locfit.raw` is 100). **However**, if you are getting that warning, you should _*seriously*_ consider increasing your window size, as it is probably too small to effectively manage the noise in your data. Please read the literature and make educated choices about your selected window size. 
364 | 
365 | ### QTLseq analysis
366 | Takagi et al. (2013) developed the method for QTLseq type NGS-BSA. The analysis is based on calculating the allele frequency differences, or $\Delta (SNP\text{-}index)$, from the allele depths at each SNP. To determine regions of the genome that significantly differ from the expected $\Delta (SNP\text{-}index)$ of 0, a simulation approach is used. Briefly, at each read depth, simulated SNP frequencies are bootstrapped, and the extreme quantiles are used as simulated confidence intervals. The true data are averaged over a sliding window and regions that surpass the CI are putative QTL.
367 | 
368 | When the analysis is run the following steps are performed:
369 | 
370 | 1. First the number of SNPs within the sliding window are counted.
371 | 
372 | 1. A tricube-smoothed $\Delta (SNP\text{-}index)$ is calculated within the set window size.
373 | 
374 | 1. The minimum read depth at each position is calculated and the tricube-smoothed depth is calculated for the window.
375 | 
376 | 1. The simulation is performed for data derived read depths (can be set by the user): 
377 | Alternate allele frequency is calculated per bulk based on the population type and size (F2 or RIL) $\Delta (SNP\text{-}index)$ is simulated over several replications (default = 10000) for        each bulk. The quantiles from the simulations are used to estimate the confidence intervals. Say for example the 99th quantile of 10000 $\Delta (SNP\text{-}index)$ simulations represents the 99% confidence interval for the true data.
378 | 
379 | 1. Confidence intervals are matched with the relevant window depth at each SNP.
380 | 
381 | Here is an example for running the analysis for an F2 population. In Yang et al. (2013), the bulks are of different sizes (385 and 430 for high and low bulk respectively), so we set `bulkSize = c(385, 430)`. If your bulks are the same size you can simply set one value, i.e. `bulkSize = 25`. The simulation is bootstrapped 10000 times and the two-sided 95 and 99% confidence intervals are calculated:
382 | 
383 | ```{r qtlseqanalysis-src, eval = FALSE}
384 | df_filt <- runQTLseqAnalysis(df_filt,
385 |     windowSize = 1e6,
386 |     popStruc = "F2", 
387 |     bulkSize = c(385, 430), 
388 |     replications = 10000, 
389 |     intervals = c(95, 99)
390 |     )
391 | ```
392 | ```{r atlseqanalysis-msg, message = TRUE, warning = FALSE, collapse = TRUE, echo = FALSE}
393 | df_filt <- runQTLseqAnalysis(
394 |     df_filt,
395 |     windowSize = 1e6,
396 |     popStruc = "F2", 
397 |     bulkSize = c(385, 430),
398 |     replications = 10000, 
399 |     intervals = c(95, 99)
400 |     )
401 | ```
402 | 
403 | ### G' analysis
404 | An alternate approach to determine statistical significance of QTL from NGS-BSA was proposed by Magwene et al. (2011) – calculating a modified G statistic for each SNP based on the observed and expected allele depths and smoothing this value using a tricube smoothing kernel. Using the smoothed G statistic, or G’, Magwene et al. allow for noise reduction while also addressing linkage disequilibrium between SNPs. Furthermore, as G’ is close to being log normally distributed, p-values can be estimated for each SNP using non-parametric estimation of the null distribution of G’. This provides a clear and easy-to-interpret result as well as the option for multiple testing corrections.
405 | 
406 | Here, we will briefly summarize the steps performed by the main analysis function, `runGprimeAnalysis`.
407 | 
408 | The following steps are performed:
409 | 
410 | 1. First the number of SNPs within the sliding window are counted.
411 | 
412 | 1. A tricube-smoothed $\Delta (SNP\text{-}index)$ is calculated within the set window size.
413 | 
414 | 1. Genome-wide G statistics are calculated by `getG`.
415 |     $G$ is defined by the equation:
416 | 
417 |     $$G = 2 * \sum n_i * ln(\frac{obs(n_i)}{exp(n_i)})$$
418 | 
419 |     Where for each SNP, $n_i$ from i = 1 to 4 corresponds to the reference and alternate allele depths      for each bulk, as described in the following table:
420 | 
421 |     |Allele|High Bulk|Low Bulk|
422 |     |------|---------|--------|
423 |     |Reference| $n_1$	| $n_2$ |
424 |     |Alternate| $n_3$	| $n_4$ |
425 | 
426 |     ...and $obs(n_i)$ are the observed allele depths as described in the data frame. `getG` calculates     the G statistic using expected values assuming read depth is equal for all alleles in both bulks:
427 |     $$
428 |     exp(n_1) = \frac{(n_1 + n_2)*(n_1 + n_3)}{(n_1 + n_2 + n_3 + n_4)}
429 |     $$ 
430 |     $$
431 |     exp(n_2) = \frac{(n_2 + n_1)*(n_2 + n_4)}{(n_1 + n_2 + n_3 + n_4)}
432 |     $$
433 |     $$
434 |     exp(n_3) = \frac{(n_3 + n_1)*(n_3 + n_4)}{(n_1 + n_2 + n_3 + n_4)}
435 |     $$
436 |     $$
437 |     exp(n_4) = \frac{(n_4 + n_2)*(n_4 + n_3)}{(n_1 + n_2 + n_3 + n_4)}
438 |     $$
439 | 
440 | 
441 | 1. G' - A tricube-smoothed G statistic is predicted by constant local regression within each chromosome using the `tricubeStat` function. This works as a weighted average across neighboring SNPs that accounts for Linkage disequilibrium (LD) while minimizing noise attributed to SNP calling errors. G values for neighboring SNPs within the window are weighted by physical distance from the focal SNP. 
442 | 
443 | 1. P-values are estimated based using the non-parametric method described by Magwene et al. 2011 with the function `getPvals`. Briefly, using the natural log of $G'$ a median absolute deviation (MAD) is calculated. The $G'$ set is trimmed to exclude outlier regions (i.e. QTL) based on Hampel's rule. An alternate method for filtering out QTL that we propose is using absolute $\Delta (SNP\text{-}index)$ values greater than a set threshold (default = 0.1) to filter out potential QTL. An estimation of the mode of the trimmed set is calculated using the `mlv` function from the package `modeest`. Finally, the mean and variance of the set are estimated using the median and mode and p-values are estimated from a log normal distribution. 
444 | 
445 | 1. Negative Log10- and Benjamini-Hochberg adjusted p-values are calculated using `p.adjust`.
446 | 
447 | Let's run the function:
448 | 
449 | ```{r gprimeanalysis-src, eval = FALSE}
450 | df_filt <- runGprimeAnalysis(df_filt,
451 |     windowSize = 1e6,
452 |     outlierFilter = "deltaSNP",
453 |     filterThreshold = 0.1)
454 | ```
455 | ```{r gprimeanalysis-msg, message = TRUE, warning = FALSE, collapse = TRUE, echo = FALSE}
456 | df_filt <- runGprimeAnalysis(df_filt,
457 |     windowSize = 1e6,
458 |     outlierFilter = "deltaSNP",
459 |     filterThreshold = 0.1)
460 | ```
461 | 
462 | Some additional columns are added to the filtered data frame:
463 | ```{r}
464 | head(df_filt)
465 | ```
466 | 
467 | * nSNPs - the number of SNPs bracketing the focal SNP within the set sliding window
468 | * tricubeDeltaSNP - the tricube-smoothed $\Delta (SNP\text{-}index)$
469 | * G - the G value for the SNP
470 | * Gprime - the tricube-smoothed G value
471 | * pvalue - the p-value calculated by non-parametric estimation
472 | * negLog10Pval - the $-log_{10}(p\text{-}value)$ 
473 | * qvalue - Benjamini-Hochberg adjusted p-values
474 | 
475 | ## Plotting the data
476 | 
477 | QTLseqr offers two main plotting functions to check the validity of the $G'$ analysis and to plot genome-wide or chromosome specific QTL analysis plots.
478 | 
479 | ### G' distribution plots
480 | 
481 | Due to the fact that p-values are estimated from the null distribution of $G'$, an important check is to see if the null distribution of $G'$ values is close to log normally distributed. For this purpose we use the `plotGprimeDist` function, which plots the $G'$ histograms of both raw and filtered $G'$ sets (see P-value calculation above) alongside the log-normal null distribution (which is reported in the legend). We can also use this to test which filtering method (Hampel or DeltaSNP) estimates a more accurate null distribution. If you use the `"deltaSNP"` method plotting $G'$ distributions with different filter thresholds might also help reveal a better $G'$ null distribution. 
482 | 
483 | ```{r gprimedist hampel, message = FALSE, warning = FALSE, fig.height=4 , dpi=300}
484 | plotGprimeDist(SNPset = df_filt, outlierFilter = "Hampel")
485 | 
486 | ```
487 | 
488 | ```{r gprimedist deltaSNP, message=FALSE, warning = FALSE, fig.height=4, dpi=300}
489 | plotGprimeDist(SNPset =df_filt, outlierFilter = "deltaSNP", filterThreshold = 0.1)
490 | ```
491 | 
492 | ### QTL analysis plots
493 | Now that we are happy with our filtered data and it seems that the $G'$ distribution is close to log-normal, we can finally plot some genome-wide figures and try to identify QTL.
494 | 
495 | Let's start by plotting the SNP/window distribution:
496 | ```{r plotnSNPs, fig.align = "center", fig.width=12, fig.height=4}
497 | p1 <- plotQTLStats(SNPset = df_filt, var = "nSNPs")
498 | p1
499 | ```
500 | This is informative as we can assess if there are regions with extremely low SNP density.
501 | 
502 | More importantly lets identify some QTL by plotting the smoothed $\Delta (SNP\text{-}index)$ and $G'$ values. If we've performed QTLseq analysis we can also set `plotIntervals` to `TRUE` and plot the confidence intervals to identify QTL using that method.
503 | 
504 | ```{r plotdeltaSNP, fig.align = "center", fig.width=12, fig.height=4, , dpi=300}
505 | p2 <- plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE)
506 | p2
507 | ```
508 | We can see that there are some regions that have $\Delta (SNP\text{-}index)$ that pass the confidence interval thresholds, and are putative QTL. The directionality of the $\Delta (SNP\text{-}index)$ is also important for $G'$ analysis. If the allele contributing to the trait is from the reference parent the $\Delta (SNP\text{-}index)$ should be less than 0. However, if the $\Delta (SNP\text{-}index) > 0$ then the contributing parent is the one with the alternate alleles. 
509 | 
510 | Let's look at the $G'$ values to see if these regions are significant and pass the FDR (q) of 0.01.
511 | ```{r plotGprime, fig.align = "center", fig.width=12, fig.height=4, , dpi=300}
512 | p3 <- plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01)
513 | p3
514 | ```
515 | 
516 | Great! It looks like there are QTL identified on Chromosomes 1, 2, 5, 8 and 10.
517 | Based on the $\Delta (SNP\text{-}index)$ and $G'$ plots the QTL from Chr1 originates from the reference parent (Nipponbare rice, in this case) and the QTL on Chr8 was contributed by the other parent, for example.
518 | 
519 | We can also use the `plotQTLStats` function to the $-log_{10}(p\text{-}value)$. While this number is a direct derivative of $G'$ it can be more self explanatory for some. We can use the subset parameter to plot one or a few of the chromosomes, say for a close up figure of a QTL of interest. Here we look at the $-log_{10}(p\text{-}value)$ plots of Chromosomes 1 and 8:
520 | 
521 | ```{r subsetlogpval, , fig.align = "center", fig.width=6, fig.height=3, , dpi=300}
522 | QTLplots <- plotQTLStats(
523 |     SNPset = df_filt, 
524 |     var = "negLog10Pval", 
525 |     plotThreshold = TRUE, 
526 |     q = 0.01, 
527 |     subset = c("Chr1", "Chr8")
528 |     )
529 | QTLplots
530 | ```
531 | 
532 | ## Extracting QTL data
533 | 
534 | Now that we've plotted and identified some putative QTL we can extract the data using two functions `getSigRegions` and `getQTLTable`.
535 | 
536 | ### Extracting significant regions
537 | The `getSigRegions` function will produce a list in which each element represents a QTL region. The elements are subsets from the original data frame you supplied. Any contiguous region above with an adjusted p-value above the set alpha will be returned. If there is a dip below the alpha this region will be split to two elements. 
538 | 
539 | Let's examine the `head` of the first QTL:
540 | ```{r getsigreg}
541 | QTL <- getSigRegions(SNPset = df_filt, alpha = 0.01)
542 | head(QTL[[1]])
543 | ```
544 | 
545 | ### Output QTL summary
546 | 
547 | While `getSigRegions` is useful for examining every SNP within each QTL and perhaps for some downstream analysis, the `getQTLTable` will summarize those results and can output a CSV by setting `export = TRUE` and `fileName = "MyQTLsummary.csv"`. We can set `method` as either `"Gprime"` or `"QTLseq"` depending on the type of analysis; `"Gprime"` will use `alpha` as FDR threshold and `"QTLseq"` will use the `interval` parameter, which should match one of the intervals calculated above.
548 | 
549 | Here is the summary for significant regions with a FDR of 0.01:
550 | ```{r QTLtable}
551 | results <- getQTLTable(SNPset = df_filt, method = "Gprime",alpha = 0.01, export = FALSE)
552 | results
553 | 
554 | ```
555 | 
556 | The columns are:
557 | 
558 | * chromosome - The chromosome on which the region was identified
559 | * qtl - the QTL identification number in this chromosome
560 | * start - the start position on that chromosome, i.e. the position of the first SNP that passes the FDR threshold
561 | * end - the end position
562 | * length - the length in base pairs from start to end of the region
563 | * nSNPs - the number of SNPs in the region
564 | * avgSNPs_Mb - the average number of SNPs/Mb within that region
565 | * peakDeltaSNP - the $\Delta (SNP\text{-}index)$ value at the peak summit
566 | * posPeakDeltaSNP - the position of the absolute maximum tricube-smoothed deltaSNP-index
567 | * maxGprime - the max G' score in the region
568 | * meanGprime - the average $G'$ score of that region
569 | * posMaxGprime - the genomic position of the maximum G' value in the QTL
570 | * sdGprime - the standard deviation of $G'$ within the region
571 | * AUCaT - the **A**rea **U**nder the **C**urve but **a**bove the **T**hreshold line, an indicator of how significant or wide the peak is
572 | * meanPval - the average p-value in the region
573 | * meanQval - the average adjusted p-value in the region
574 | 
575 | # Summary
576 | 
577 | We've reviewed how to load SNP data from GATK and filter the data to contain high confidence SNPs. We then performed $\Delta (SNP\text{-}index)$ and $G'$ analysis and calculate p-values and q-values based on the tricube-smoothed $G'$ values. The QTL regions that pass our defined threshold can be stored as a list for further analysis or summarized as a table for publication.
578 | 
579 | # Session info
580 | ```{r sessioninfo, echo=FALSE, cache=FALSE}
581 | sessionInfo()
582 | ```


--------------------------------------------------------------------------------
/vignettes/QTLseqr.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bmansfeld/QTLseqr/5e761379a805b65038c415c8d3ce7aa61abe89dc/vignettes/QTLseqr.pdf


--------------------------------------------------------------------------------