├── .github ├── ISSUE_TEMPLATE │ └── config.yml ├── README.md └── pull_request_template.md ├── .gitignore ├── DESCRIPTION ├── NAMESPACE ├── NEWS ├── R ├── EM.R ├── M_helpers.R ├── SparseDOSSA2.R ├── SparseDOSSA2_fit.R ├── copulasso.R ├── data.R ├── fit.R ├── generate.R ├── helpers.R ├── integration_helpers.R ├── likelihood_expectation.R ├── sim_helpers.R ├── spike.R └── sysdata.rda ├── data └── Stool_subset.rda ├── man ├── SparseDOSSA2.Rd ├── Stool_subset.Rd ├── control_fit.Rd ├── control_integrate.Rd ├── fitCV_SparseDOSSA2.Rd └── fit_SparseDOSSA2.Rd ├── tests ├── testthat.R └── testthat │ ├── test-likelihood.R │ └── test-spike_in.R └── vignettes ├── SparseDOSSA2.Rmd └── SparseDOSSA2.html /.github/ISSUE_TEMPLATE/config.yml: -------------------------------------------------------------------------------- 1 | blank_issues_enabled: false 2 | contact_links: 3 | - name: Please proceed to the bioBakery Forum to make a bug report or feature request 4 | url: https://forum.biobakery.org/ 5 | about: Thanks for your interest in the bioBakery! We use a Discourse forum (https://forums.biobakery.org) for interaction with our user community, including fielding bug reports and feature requests. Please submit your comment (including all relevant details) as a new post under 1) an appropriate method-specific forum topic (e.g. "MetaPhlAn") or 2) the generic "Feature/Pull Request" topic. Doing a search for related posts before starting a new post is greatly appreciated. We look forward to hearing from you. 6 | -------------------------------------------------------------------------------- /.github/README.md: -------------------------------------------------------------------------------- 1 | 2 | # "Simulating realistic microbial observations with SparseDOSSA2" 3 | 4 | Author Name: "Siyuan Ma" 5 | Affiliation: Harvard T.H. Chan School of Public Health. 6 | Broad Institute email: siyuan.ma@pennmedicine.upenn.edu 7 | 8 | Tutorial: https://github.com/biobakery/biobakery/wiki/SparseDOSSA2 9 | 10 | # Introduction 11 | SparseDOSSA2 an R package for fitting to and the simulation of realistic microbial abundance observations. It provides functionlaities for: a) generation of realistic synthetic microbial observations, b) spiking-in of associations with metadata variables for e.g. benchmarking or power analysis purposes, and c) fitting the SparseDOSSA 2 model to real-world microbial abundance observations that can be used for a). This vignette is intended to provide working examples for these functionalities. 12 | 13 | ``` 14 | library(SparseDOSSA2) 15 | # tidyverse packages for utilities 16 | library(magrittr) 17 | library(dplyr) 18 | library(ggplot2) 19 | ``` 20 | 21 | # Installation 22 | SparseDOSSA2 is a Bioconductor package and can be installed via the following command. 23 | ``` 24 | # if (!requireNamespace("BiocManager", quietly = TRUE)) 25 | # install.packages("BiocManager") 26 | # BiocManager::install("SparseDOSSA2") 27 | ``` 28 | # Simulating realistic microbial observations with SparseDOSSA2 29 | The most important functionality of SparseDOSSA2 is the simulation of realistic synthetic microbial observations. To this end, SparseDOSSA2 provides three pre-trained templates, "Stool", "Vaginal", and "IBD", targeting continuous, discrete, and diseased population structures. 30 | ``` 31 | Stool_simulation <- SparseDOSSA2(template = "Stool", 32 | n_sample = 100, 33 | n_feature = 100, 34 | verbose = TRUE) 35 | Vaginal_simulation <- SparseDOSSA2(template = "Vaginal", 36 | n_sample = 100, 37 | n_feature = 100, 38 | verbose = TRUE) 39 | ``` 40 | 41 | # Fitting to microbiome datasets with SparseDOSSA2 42 | SparseDOSSA2 provide two functions, fit_SparseDOSSA2 and fitCV_SparseDOSSA2, to fit the SparseDOSSA2 model to microbial count or relative abundance observations. For these functions, as input, SparseDOSSA2 requires a feature-by-sample table of microbial abundance observations. We provide with SparseDOSSA2 a minimal example of such a dataset: a five-by-five of the HMP1-II stool study. 43 | ``` 44 | data("Stool_subset", package = "SparseDOSSA2") 45 | # columns are samples. 46 | Stool_subset[1:2, 1, drop = FALSE] 47 | ``` 48 | 49 | ## Fitting SparseDOSSA2 model with fit_SparseDOSSA2 50 | fit_SparseDOSSA2 fits the SparseDOSSA2 model to estimate the model parameters: per-feature prevalence, mean and standard deviation of non-zero abundances, and feature-feature correlations. It also estimates joint distribution of these parameters and (if input is count) a read count distribution. 51 | ``` 52 | fitted <- fit_SparseDOSSA2(data = Stool_subset, 53 | control = list(verbose = TRUE)) 54 | # fitted mean log non-zero abundance values of the first two features 55 | fitted$EM_fit$fit$mu[1:2] 56 | ``` 57 | 58 | ## Fitting SparseDOSSA2 model with fitCV_SparseDOSSA2 59 | The user can additionally achieve optimal model fitting via fitCV_SparseDOSSA2. They can either provide a vector of tuning parameter values (lambdas) to control sparsity in the estimation of the correlation matrix parameter, or a grid will be selected automatically. fitCV_SparseDOSSA2 uses cross validation to select an "optimal" model fit across these tuning parameters via average testing log-likelihood. This is a computationally intensive procedure, and best-suited for users that would like accurate fitting to the input dataset, for best simulated new microbial observations on the same features as the input (i.e. not new features). 60 | ``` 61 | set.seed(1) 62 | fitted_CV <- fitCV_SparseDOSSA2(data = Stool_subset, 63 | lambdas = c(0.1, 1), 64 | K = 2, 65 | control = list(verbose = TRUE)) 66 | # the average log likelihood of different tuning parameters 67 | apply(fitted_CV$EM_fit$logLik_CV, 2, mean) 68 | # The second lambda (1) had better performance in terms of log likelihood, 69 | # and will be selected as the default fit. 70 | ``` 71 | 72 | # Parallelization controls with future 73 | SparseDOSSA2 internally uses r BiocStyle::CRANpkg("future") to allow for parallel computation. The user can thus specify parallelization through future's interface. See the reference manual for future for more details. This is particularly suited if fitting SparseDOSSA2 in a high-performance computing environment/ 74 | ``` 75 | ## regular fitting 76 | # system.time(fitted_regular <- 77 | # fit_SparseDOSSA2(data = Stool_subset, 78 | # control = list(verbose = FALSE))) 79 | ## parallel fitting with future: 80 | # future::plan(future::multisession()) 81 | # system.time(fitted_parallel <- 82 | # fit_SparseDOSSA2(data = Stool_subset, 83 | # control = list(verbose = FALSE))) 84 | 85 | ## For CV fitting, there are three components that can be paralleled, in order: 86 | ## different cross validation folds, different tuning parameter lambdas, 87 | ## and different samples. It is usually most efficient to parallelize at the 88 | ## sample level: 89 | # system.time(fitted_regular_CV <- 90 | # fitCV_SparseDOSSA2(data = Stool_subset, 91 | # lambdas = c(0.1, 1), 92 | # K = 2, 93 | # control = list(verbose = TRUE))) 94 | # future::plan(future::sequential(), future::sequential(), future::multisession()) 95 | # system.time(fitted_parallel_CV <- 96 | # fitCV_SparseDOSSA2(data = Stool_subset, 97 | # lambdas = c(0.1, 1), 98 | # K = 2, 99 | # control = list(verbose = TRUE))) 100 | ``` 101 | 102 | # Sessioninfo 103 | ``` 104 | sessionInfo() 105 | 106 | R version 3.6.2 (2019-12-12) 107 | Platform: x86_64-apple-darwin15.6.0 (64-bit) 108 | Running under: macOS Mojave 10.14.6 109 | 110 | Matrix products: default 111 | BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 112 | LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib 113 | 114 | locale: 115 | [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 116 | 117 | attached base packages: 118 | [1] stats graphics grDevices utils datasets methods base 119 | 120 | other attached packages: 121 | [1] SparseDOSSA2_0.99.0 Rmpfr_0.8-2 gmp_0.6-1 igraph_1.2.6 122 | [5] truncnorm_1.0-8 magrittr_2.0.1 future.apply_1.7.0 future_1.21.0 123 | [9] huge_1.3.4.1 mvtnorm_1.1-1 ks_1.11.7 BiocCheck_1.22.0 124 | 125 | loaded via a namespace (and not attached): 126 | [1] Rcpp_1.0.5 compiler_3.6.2 BiocManager_1.30.10 bitops_1.0-6 127 | [5] tools_3.6.2 digest_0.6.27 mclust_5.4.7 jsonlite_1.7.2 128 | [9] lattice_0.20-41 pkgconfig_2.0.3 Matrix_1.2-18 graph_1.64.0 129 | [13] curl_4.3 parallel_3.6.2 xfun_0.20 stringr_1.4.0 130 | [17] httr_1.4.2 knitr_1.30 globals_0.14.0 stats4_3.6.2 131 | [21] grid_3.6.2 getopt_1.20.3 optparse_1.6.6 Biobase_2.46.0 132 | [25] listenv_0.8.0 R6_2.5.0 parallelly_1.23.0 XML_3.99-0.3 133 | [29] RBGL_1.62.1 codetools_0.2-18 biocViews_1.54.0 BiocGenerics_0.32.0 134 | [33] MASS_7.3-53 stringdist_0.9.6.3 RUnit_0.4.32 KernSmooth_2.23-18 135 | [37] stringi_1.5.3 RCurl_1.98-1.2 136 | ``` 137 | 138 | # Contributions 139 | Thanks go to these wonderful people: 140 | 141 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | We welcome feedback and issue reporting for all bioBakery tools through our [Discourse site](https://forum.biobakery.org/c/pull-request/featurepull-request/). For users that would like to directly contribute to the [tools](https://github.com/biobakery/) we are happy to field PRs to address **bug fixes**. Please note the turn around time on our end might be a bit long to field these but that does not mean we don't value the contribution! We currently **don't** accept PRs to add **new functionality** to tools but we would be happy to receive your feedback on [Discourse](https://forum.biobakery.org/c/pull-request/featurepull-request/). 2 | 3 | Also, we will make sure to attribute your contribution in our User’s manual(README.md) and in any associated paper Acknowledgements. 4 | 5 | 6 | ## Description 7 | 8 | 9 | ## Related Issue 10 | 11 | 12 | 13 | 14 | ## Screenshots (if appropriate): 15 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.Rproj 2 | .Rbuildignore 3 | .Rproj.user 4 | .Rhistory 5 | .RData 6 | .Ruserdata 7 | .DS_Store 8 | *.sbatch 9 | *.html -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: SparseDOSSA2 2 | Type: Package 3 | Title: A Statistical Model for Describing and Simulating Microbial Community Profiles 4 | Version: 0.99.2 5 | Author: Siyuan Ma 6 | Maintainer: Siyuan Ma 7 | Description: SparseDOSSA 2 is an R package for fitting to and the simulation of realistic microbial 8 | abundance observations. It provides functionlaities for: 9 | a) generation of synthetic microbial observations, based on either pre-fitted template that the package provides, or user-trained results. 10 | b) spiking-in of associations with metadata variables or between feature pairs, for e.g. benchmarking or power analysis purposes. 11 | c) fitting the SparseDOSSA 2 model to real-world microbial abundance observations. 12 | License: MIT + file LICENSE 13 | Encoding: UTF-8 14 | RoxygenNote: 7.1.2 15 | Depends: 16 | ks, 17 | mvtnorm, 18 | huge, 19 | future.apply, 20 | magrittr, 21 | truncnorm, 22 | igraph, 23 | Rmpfr 24 | VignetteBuilder: knitr 25 | Suggests: 26 | testthat (>= 2.1.0), 27 | cubature, 28 | BiocStyle, 29 | knitr, 30 | rmarkdown 31 | biocViews: Metagenomics, Microbiome, StatisticalMethod 32 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(SparseDOSSA2) 4 | export(control_fit) 5 | export(control_integrate) 6 | export(fitCV_SparseDOSSA2) 7 | export(fit_SparseDOSSA2) 8 | importFrom(magrittr,"%>%") 9 | -------------------------------------------------------------------------------- /NEWS: -------------------------------------------------------------------------------- 1 | Changes in version 0.99.2 (2021-10-06) 2 | + Changed the spike_metadata interface in the SparseDOSSA2 function. 3 | spike_metadata can now be either character ("none" the default, 4 | "abundance", "prevalence", or "both", same as before), or a data.frame 5 | configuring the metadata-microbe associations (replacing the old 6 | feature_metadata_spike_df parameter) 7 | + More detailed documentation for these parameters 8 | + Better checking and error reporting for metadata spike-in user errors -------------------------------------------------------------------------------- /R/EM.R: -------------------------------------------------------------------------------- 1 | EM <- function(data, 2 | lambda, 3 | control = list()) { 4 | # initialization, filtering 5 | if(is.null(lambda) & any(lambda <= 0)) 6 | stop("lambda must be a positive value!") 7 | control <- do.call(control_fit, control) 8 | debug_copulasso_file <- NULL 9 | if(!is.null(control$debug_dir)) { 10 | dir.create(control$debug_dir, recursive = TRUE) 11 | control$debug_dir <- normalizePath(control$debug_dir) 12 | } 13 | 14 | l_filtering <- filter_data(data) 15 | data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE] 16 | 17 | # initialize EM using relative abundances 18 | feature_param <- fit_featureParam(data) 19 | fit_copulasso <- copulasso( 20 | data = data, 21 | marginals = feature_param, 22 | lambda = lambda, 23 | control = list(debug_dir = control$debug_dir) 24 | ) 25 | feature_param[, "mu"] <- feature_param[, "mu"] - mean(feature_param[, "mu"]) 26 | params <- list(pi0 = feature_param[ ,"pi0"], 27 | mu = feature_param[, "mu"], 28 | sigma = feature_param[, "sigma"], 29 | Sigma = fit_copulasso$Sigma, 30 | Omega = fit_copulasso$Omega, 31 | Corr_star = fit_copulasso$Corr_star, 32 | diff = rep(NA_real_, 4), 33 | logLik = -Inf, 34 | time = Sys.time()) 35 | if(fit_copulasso$copulasso_code != 0) { 36 | warning("Missing values in Omega estimation! (lambda to small?)") 37 | return(list(lambda = lambda, 38 | fit = params, 39 | convergence = list(converge = FALSE, 40 | converge_code = 4, 41 | n_iter = i_iter))) 42 | } 43 | 44 | # EM algorithim 45 | i_iter <- 0 46 | converge <- FALSE 47 | ll_easums <- list() 48 | ll_params <- list() 49 | while(TRUE) { 50 | i_iter <- i_iter + 1 51 | if(control$verbose) 52 | message("EM iteration ", i_iter) 53 | 54 | # E step 55 | e_asums <- future.apply::future_vapply( 56 | seq_len(nrow(data)), 57 | function(i_sample) 58 | get_es(x = data[i_sample, , drop = TRUE], 59 | pi0 = params$pi0, mu = params$mu, sigma = params$sigma, 60 | Omega = params$Omega, Sigma = params$Sigma, 61 | control = control$control_numint), 62 | rep(0.0, 10) 63 | ) %>% t() 64 | if(any(is.na(e_asums[, c("ea", "dx", "eloga", "eloga2")]))) { 65 | warning("Numeric integration in E step returned NAs!") 66 | converge_code <- 3 67 | break 68 | } 69 | if(any(e_asums[, "eloga"]^2 >= e_asums[, "eloga2"])) { 70 | warning("Numeric integration in E step gave bad expectation values!") 71 | converge_code <- 2 72 | break 73 | } 74 | ll_easums[[i_iter]] <- e_asums 75 | 76 | ## M step 77 | fit_copulasso <- copulasso(data = data * exp(e_asums[, "eloga"]), 78 | marginals = feature_param, 79 | lambda = lambda) 80 | if(fit_copulasso$copulasso_code != 0) { 81 | warning("Missing values in Omega estimation! (lambda to small?)") 82 | converge_code <- 4 83 | break 84 | } 85 | sigmahat <- get_sigmahat(data = data, 86 | eloga = e_asums[, "eloga"], 87 | eloga2 = e_asums[, "eloga2"]) 88 | feature_param[, "sigma"] <- sigmahat 89 | params_new <- list(pi0 = feature_param[ ,"pi0"], 90 | mu = feature_param[ ,"mu"], 91 | sigma = feature_param[ ,"sigma"], 92 | Sigma = fit_copulasso$Sigma, 93 | Omega = fit_copulasso$Omega, 94 | Corr_star = fit_copulasso$Corr_star, 95 | logLik = mean(e_asums[, "logLik"])) 96 | diff_abs <- get_diff(params_new[["logLik"]], params[["logLik"]], 97 | denom_c = control$abs_tol, method = "abs") 98 | diff_rel <- get_diff(params_new[["logLik"]], params[["logLik"]], 99 | denom_c = control$abs_tol, method = "rel") 100 | params <- c(params_new, 101 | list(diff = c(diff_abs, diff_rel), 102 | time = Sys.time())) 103 | ll_params[[i_iter]] <- params 104 | 105 | if(!is.null(control$debug_dir)) { 106 | l_debug <- list(ll_easums = ll_easums, 107 | ll_params = ll_params, 108 | l_filtering = l_filtering) 109 | save(l_debug, 110 | file = paste0(control$debug_dir, 111 | "/debug_EM.RData")) 112 | } 113 | 114 | if(max(diff_abs) < control$abs_tol & max(diff_rel) < control$rel_tol) { 115 | converge <- TRUE 116 | converge_code <- 0 117 | break 118 | } 119 | if(i_iter + 1 > control$maxit) { 120 | warning("Maximum EM iteration reached!") 121 | converge_code <- 1 122 | break 123 | } 124 | } 125 | 126 | return(list(fit = params, 127 | lambda = lambda, 128 | convergence = list(converge = converge, 129 | converge_code = converge_code, 130 | n_iter = i_iter), 131 | l_filtering = l_filtering)) 132 | } 133 | 134 | EM_CV <- function(data, 135 | lambdas, 136 | K = 5, 137 | control = list()) { 138 | # initialization, filtering 139 | control <- do.call(control_fit, control) 140 | l_filtering <- filter_data(data) 141 | data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE] 142 | if(is.null(lambdas)) 143 | lambdas <- 10^seq(-2, 0, length.out = 5) 144 | if(any(lambdas <= 0)) 145 | stop("lambdas must be positive values!") 146 | if(!is.null(control$debug_dir)) { 147 | dir.create(control$debug_dir, recursive = TRUE) 148 | control$debug_dir <- normalizePath(control$debug_dir) 149 | } 150 | 151 | if(control$verbose) message("Performing full data fit...") 152 | l_fits_full <- future::future({ 153 | future.apply::future_lapply( 154 | seq_along(lambdas), 155 | function(i_lambda) { 156 | control_tmp <- control 157 | control_tmp$verbose <- FALSE 158 | if(!is.null(control$debug_dir)) 159 | control_tmp$debug_dir <- paste0(control$debug_dir, 160 | "/K_0/lambda_", 161 | i_lambda) 162 | EM(data = data, 163 | lambda = lambdas[i_lambda], 164 | control = control_tmp) 165 | }) 166 | }) 167 | l_fits_full <- future::value(l_fits_full) 168 | 169 | # CV fits 170 | CV_folds <- make_CVfolds(n = nrow(data), K = K) 171 | if(!is.null(control$debug_dir)) 172 | save(CV_folds, file = paste0(control$debug_dir, "/CV_folds.RData")) 173 | ll_results_CV <- list() 174 | for(k in seq_len(K)) { 175 | if(control$verbose) 176 | message("Performing CV k=", k) 177 | ll_results_CV[[k]] <- future::future({ 178 | data_training <- data[CV_folds != k, , drop = FALSE] 179 | data_testing <- data[CV_folds == k, , drop = FALSE] 180 | 181 | l_fits_k <- 182 | future.apply::future_lapply( 183 | seq_along(lambdas), 184 | function(i_lambda) { 185 | control_tmp <- control 186 | control_tmp$verbose <- FALSE 187 | if(!is.null(control$debug_dir)) 188 | control_tmp$debug_dir <- paste0(control$debug_dir, 189 | "/K_", k, 190 | "/lambda_", i_lambda) 191 | EM(data = data_training, 192 | lambda = lambdas[i_lambda], 193 | control = control_tmp) 194 | }) 195 | 196 | # Fill in parameters estimates for features not present in training data 197 | for(i_lambda in seq_along(lambdas)) 198 | l_fits_k[[i_lambda]]$fit <- 199 | fill_estimates_CV(l_fits_k[[i_lambda]]$fit, 200 | l_fits_full[[i_lambda]]$fit, 201 | l_fits_k[[i_lambda]]$l_filtering$ind_feature) 202 | 203 | # Calculate ll in testing data 204 | l_logLik <- future.apply::future_lapply( 205 | seq_along(lambdas), 206 | function(i_lambda) { 207 | params <- l_fits_k[[i_lambda]]$fit 208 | future.apply::future_sapply( 209 | seq_len(nrow(data_testing)), 210 | function(i_sample) { 211 | logLik <- 212 | dx(x = data_testing[i_sample, , drop = TRUE], 213 | pi0 = params$pi0, mu = params$mu, sigma = params$sigma, 214 | Omega = params$Omega, Sigma = params$Sigma, 215 | control = control$control_numint, 216 | log.p = TRUE) 217 | }) 218 | }) 219 | 220 | list(l_fits = l_fits_k, 221 | l_logLik = l_logLik) 222 | }) 223 | } 224 | ll_results_CV <- future::value(ll_results_CV) 225 | 226 | # Aggregate CV results 227 | logLik_CV <- 228 | sapply(seq_along(lambdas), 229 | function(i_lambda) { 230 | logLik <- rep(NA_real_, nrow(data)) 231 | for(k in seq_len(K)) 232 | logLik[CV_folds == k] <- ll_results_CV[[k]]$l_logLik[[i_lambda]] 233 | return(logLik) 234 | }) 235 | ll_fits_CV <- lapply(ll_results_CV, function(results_k) results_k$l_fits) 236 | 237 | return(list(fit = 238 | l_fits_full[[ 239 | order( 240 | apply(-logLik_CV, 241 | 2, 242 | function(x) 243 | mean(setdiff(x, Inf))))[1]]]$fit, 244 | lambdas = lambdas, 245 | logLik_CV = logLik_CV, 246 | l_fits_full = l_fits_full, 247 | CV_folds = CV_folds, 248 | ll_fits_CV = ll_fits_CV, 249 | l_filtering = l_filtering 250 | )) 251 | } 252 | -------------------------------------------------------------------------------- /R/M_helpers.R: -------------------------------------------------------------------------------- 1 | get_sigmahat <- function(data, eloga, eloga2) { 2 | mutilde <- 3 | vapply(seq_len(ncol(data)), 4 | function(j) { 5 | ind_j <- data[, j] > 0 6 | mean(eloga[ind_j] + 7 | log(data[ind_j, j])) 8 | }, 9 | 0.0) 10 | 11 | sigma2tilde <- 12 | vapply(seq_len(ncol(data)), 13 | function(j) { 14 | ind_j <- data[, j] > 0 15 | mean(eloga2[ind_j] + 16 | 2 * eloga[ind_j] * log(data[ind_j, j]) + 17 | log(data[ind_j, j])^2) 18 | }, 19 | 0.0) 20 | 21 | return(sqrt(sigma2tilde - mutilde^2)) 22 | } -------------------------------------------------------------------------------- /R/SparseDOSSA2.R: -------------------------------------------------------------------------------- 1 | #' Simulate synthetic microbial abundance observations with SparseDOSSA2 2 | #' 3 | #' \code{SparseDOSSA2} generates synthetic microbial abundance observations 4 | #' from either pre-trained template, or user-provided fitted results from 5 | #' \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}. Additional options 6 | #' are available for simulating associations between microbial features 7 | #' and metadata variables. 8 | #' 9 | #' @param template can be 1) a character string (\code{"Stool"}, \code{"Vaginal"}, 10 | #' or \code{"IBD"}) indicating one of the pre-trained templates in SparseDOSSA2, 11 | #' or 2) user-provided, fitted results. In the latter case this should be an output 12 | #' from \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}. 13 | #' @param n_sample number of samples to simulate 14 | #' @param new_features \code{TRUE}/\code{FALSE} indicator for whether or not new 15 | #' features should be simulated. If \code{FALSE} then the same set of features 16 | #' in \code{template} will be simulated. 17 | #' @param n_feature number of features to simulate. Only relevant when 18 | #' \code{new_features} is \code{TRUE} 19 | #' @param spike_metadata for metadata spike-in configurations. Must be one of two things: 20 | #' a) , 21 | #' \itemize{ 22 | #' \item a character string of \code{"none"}, \code{"both"} \code{"abundance"}, 23 | #' or \code{"prevalence"}, indicating whether or not 24 | #' association with metadata will be spiked in. For the spiked-in case, it 25 | #' indicates if features' abundance/prevalence/both characteristics will be associated 26 | #' with metadata (also see explanations for \code{metadata_effect_size} and 27 | #' \code{perc_feature_spiked_metadata}) 28 | #' \item a data.frame for detailed spike-in configurations. This is the more 29 | #' advanced approach, where detailed specification for metadata-microbial 30 | #' feature associations are provided. Note: if \code{spike_metadata} is provided 31 | #' as a data.frame, then \code{metadata_matrix} must be provided as well 32 | #' (cannot be generated automatically). In this case, \code{spike_metadata} 33 | #' must have exactly four columns: \code{metadata_datum}, \code{feature_spiked}, 34 | #' \code{associated_property}, and \code{effect_size}. Each row of the data.frame 35 | #' configures one specific metadata-microbe association. Specifically: 36 | #' \itemize{ 37 | #' \item \code{metadata_datum} (integer) indicates the column number for the metadata 38 | #' variable to be associated with the microbe 39 | #' \item \code{feature_spiked} (character) indicates the microbe name to be associated 40 | #' with the metadata variable 41 | #' \item \code{associated_property} (character, either \code{"abundance"} or 42 | #' \code{"prevalence"}), indicating the property of the microbe to be modified. 43 | #' If you want the microbe to be associated with the metadata variable 44 | #' in both properties, include two rows in \code{spike_metadata}, one for 45 | #' abundance and one for prevalence 46 | #' \item \code{effect_size} (numeric) indicating the strength of the association. 47 | #' This corresponds to log fold change in non-zero abundance for 48 | #' \code{"abundance"} spike-in, and log odds ratio for \code{"prevalence"} 49 | #' spike-in 50 | #' } 51 | #' } 52 | #' @param metadata_effect_size (for when \code{spike_metadata} is \code{"abundance"}, 53 | #' \code{"prevalence"}, or \code{"both"}) effect size of the spiked-in associations. This is 54 | #' non-zero log fold change for abundance spike-in, and log odds ratio for prevalence spike-in 55 | #' @param perc_feature_spiked_metadata (for when \code{spike_metadata} is \code{"abundance"}, 56 | #' \code{"prevalence"}, or \code{"both"}) 57 | #' percentage of features to be associated with metadata 58 | #' @param metadata_matrix the user can provide a metadata matrix to use for spiking-in 59 | #' of feature abundances. If using default (\code{NULL}) two variables will be generated: 60 | #' one continous, and a binary one of balanced cases and controls. Note: if 61 | #' \code{spike_metadata} is provided as a data.frame, then the user must provide 62 | #' \code{metadata_matrix} too 63 | #' @param median_read_depth targeted median per-sample read depth 64 | #' @param verbose whether detailed information should be printed 65 | #' 66 | #' @return a list with the following component: 67 | #' \describe{ 68 | #' \item{simulated_data}{ 69 | #' feature by sample matrix of simulated microbial count observations 70 | #' } 71 | #' \item{simulated_matrices}{ 72 | #' list of all simulated data matrices, including that of null (i.e. not spiked-in) absolute 73 | #' abundances, spiked-in absolute abundances, and normalized relative abundances 74 | #' } 75 | #' \item{params}{ 76 | #' parameters used for simulation. These are provided in \code{template}. 77 | #' } 78 | #' \item{spike_metadata}{ 79 | #' list of variables provided or generated for metadata spike-in. This include 80 | #' \code{spike_metadata} for the original \code{spike_metadata} parameter provided 81 | #' by the user, \code{metadata_matrix} for the 82 | #' metadata (either provided by the user or internally generated), and 83 | #' \code{feature_metadata_spike_df} 84 | #' for detailed specification of which metadata variables were used to spike-in associations 85 | #' with which features, in what properties at which effect sizes. This is the 86 | #' same as \code{spike_metadata} if the latter was provided as a data.frame. 87 | #' } 88 | #' } 89 | #' @export 90 | #' @author Siyuan Ma, \email{syma.research@gmail.com} 91 | #' 92 | #' @examples 93 | #' ## Using one of the pre-trained SparseDOSSA2 templates: 94 | #' sim <- SparseDOSSA2(template = "stool", n_sample = 200, new_features = FALSE) 95 | #' ## Using user-provided trained SparseDOSSA2 model: 96 | #' data("Stool_subset") 97 | #' fitted <- fit_SparseDOSSA(data = Stool_subset) 98 | #' sim <- SparseDOSSA2(template = fitted, n_sample = 200, new_features = FALSE) 99 | SparseDOSSA2 <- function(template = "Stool", 100 | n_sample = 100, 101 | new_features = TRUE, 102 | n_feature = 100, 103 | spike_metadata = "none", 104 | metadata_effect_size = 1, 105 | perc_feature_spiked_metadata = 0.05, 106 | metadata_matrix = NULL, 107 | median_read_depth = 50000, 108 | verbose = TRUE) { 109 | if(is.character(template)) { 110 | if(!template %in% c("Stool", "Vaginal", "IBD")) 111 | stop("Pre-trained template must be one of \"Stool\", \"Vaginal\", or \"IBD\"!") 112 | template <- get(template) 113 | } 114 | 115 | # generate per-feature params 116 | feature_param_template <- cbind("pi0" = template$EM_fit$fit$pi0, 117 | "mu" = template$EM_fit$fit$mu, 118 | "sigma" = template$EM_fit$fit$sigma) 119 | if(!new_features) { 120 | if(verbose) 121 | message("new_features is FALSE, ", 122 | "adopt per-feature parameters from template ", 123 | "(n_feature will be ignored)...") 124 | n_feature <- nrow(feature_param_template) 125 | feature_param <- feature_param_template 126 | Omega <- template$EM_fit$fit$Omega 127 | # check that Omega and Sigma should agree 128 | if(max(abs(Omega %*% template$EM_fit$fit$Sigma - 129 | diag(rep(1, length(template$EM_fit$fit$pi0))))) > 1e-10) 130 | stop("Omega shoud be the inverse of Sigma!") 131 | } else { 132 | if(verbose) 133 | message("new_features is TRUE, ", 134 | "generating new per-feature parameters, based on template...") 135 | feature_param <- 136 | generate_featureParam(F_fit = template$F_fit, 137 | param_original = feature_param_template, 138 | n_feature = n_feature) 139 | Omega <- diag(rep(1, nrow(feature_param))) 140 | } 141 | features <- 142 | rownames(Omega) <- 143 | colnames(Omega) <- 144 | rownames(feature_param) 145 | 146 | # generate null absolute abundance matrix 147 | if(verbose) 148 | message("Generating null absolute abundance matrix...") 149 | mat_null <- generate_a(n = n_sample, 150 | feature_param = feature_param, 151 | Omega = Omega) 152 | 153 | # generate spiked-in association with metadata 154 | if(!(is.character(spike_metadata) | is.data.frame(spike_metadata))) 155 | stop("spike_metadata must be either character or a data.frame!") 156 | if(is.character(spike_metadata)) { 157 | spike_metadata <- match.arg(spike_metadata, 158 | choices = c("none", "abundance", 159 | "prevalence", "both")) 160 | } 161 | if(identical(spike_metadata, "none")) { 162 | if(verbose) 163 | message("spike_metadata is \"none\", ", 164 | "no metadata association will be simulated...") 165 | 166 | mat_spiked_metadata <- mat_null 167 | feature_metadata_spike_df <- NULL 168 | } else { 169 | if(verbose) 170 | message("Spiking in metadata association...") 171 | 172 | # metadata_matrix 173 | if(is.null(metadata_matrix)) { 174 | if(is.data.frame(spike_metadata)) 175 | stop("spike_metadata is provided as a data.frame. User must specify ", 176 | "metadata_matrix as well!") 177 | if(verbose) 178 | message("metadata_matrix is not provided; ", 179 | "simulating default metadata_matrix...") 180 | metadata_matrix <- cbind(rnorm(n = n_sample), 181 | rbinom(n = n_sample, 182 | size = 1, 183 | prob = 0.5)) 184 | rownames(metadata_matrix) <- colnames(mat_null) 185 | } else { 186 | if(!is.matrix(metadata_matrix)) 187 | stop("metadata_matrix must be a matrix ", 188 | "(model matrix where categorical variables are dummified)!") 189 | if(nrow(metadata_matrix) != n_sample) 190 | stop("n_sample does not agree with number of samples in ", 191 | "metadata_matrix!") 192 | if(!is.null(rownames(metadata_matrix))) 193 | colnames(mat_null) <- rownames(metadata_matrix) 194 | } 195 | n_metadata <- ncol(metadata_matrix) 196 | 197 | # metadata_effect_size 198 | if(length(metadata_effect_size) != 1 & 199 | length(metadata_effect_size) != n_metadata) 200 | stop("Length of metadata_effect_size can only be either 1 or number of ", 201 | "columns of metadata_matrix!") 202 | if(length(metadata_effect_size) == 1) 203 | metadata_effect_size <- rep(metadata_effect_size, n_metadata) 204 | 205 | # feature_metadata_spike_df 206 | if(is.data.frame(spike_metadata)) { 207 | if(verbose) { 208 | message("spike_metadata is provided as a data.frame; ", 209 | "will use for simulating metadata association ", 210 | "(metadata_effect_size and perc_feature_spiked_metadata will ", 211 | "be ignored)...") 212 | } 213 | 214 | # check format 215 | if(!all(c("metadata_datum", 216 | "feature_spiked", 217 | "associated_property", 218 | "effect_size") %in% 219 | colnames(spike_metadata))) 220 | stop("spike_metadata does not follow the correct format! ", 221 | "Must have the following columns: metadata_datum, ", 222 | "feature_spiked, associated_property, and effect_size.") 223 | if(!all(spike_metadata$feature_spiked %in% 224 | features)) 225 | stop("feature_spiked in spike_metadata must provide the ", 226 | "spiked feature names!") 227 | if(!all(spike_metadata$metadata_datum %in% 228 | seq(1, n_metadata))) 229 | stop("metadata_datum in spike_metadata must provide the ", 230 | "associated metadata column number!") 231 | if(!all(spike_metadata$associated_property %in% 232 | c("prevalence", "abundance"))) 233 | stop("associated_property in spike_metadata must be ", 234 | "either \"prevalence\" or \"abundance\"!") 235 | 236 | feature_metadata_spike_df <- spike_metadata 237 | } else { 238 | if(verbose) 239 | message("spike_metadata is specified as ", spike_metadata, "; ", 240 | "generating default metadata association...") 241 | feature_metadata_spike_df <- 242 | generate_feature_metadata_spike_df( 243 | features = features, 244 | perc_feature_spiked_metadata = perc_feature_spiked_metadata, 245 | n_metadata = n_metadata, 246 | effect_size = metadata_effect_size, 247 | spike_metadata = spike_metadata) 248 | } 249 | if(verbose) 250 | message("Generating feature abundances with spiked-in metadata ", 251 | "associations...") 252 | mat_spiked_metadata <- 253 | spike_a_metadata(null = mat_null, 254 | feature_param = feature_param, 255 | metadata = metadata_matrix, 256 | spike_df = feature_metadata_spike_df) 257 | } 258 | 259 | mat_rel <- apply(mat_spiked_metadata, 2, TSS) 260 | if(any(is.na(template$depth_fit))) { 261 | mat_count <- mat_rel 262 | } else { 263 | if(verbose) 264 | message("Generating count matrix...") 265 | # generate read depth 266 | depth_new <- generate_depth(mu_depth = template$depth_fit["mu_depth"], 267 | sigma_depth = template$depth_fit["sigma_depth"], 268 | n = n_sample, 269 | median_depth = median_read_depth) 270 | # generate read counts 271 | mat_count <- generate_count(rel = mat_rel, 272 | depth = depth_new) 273 | } 274 | 275 | return(list(simulated_data = mat_count, 276 | simulated_matrices = list(rel = mat_rel, 277 | a_spiked = mat_spiked_metadata, 278 | a_null = mat_null), 279 | params = list(feature_param = feature_param, 280 | Omega = Omega), 281 | template = template, 282 | spike_metadata = list(spike_metadata = spike_metadata, 283 | metadata_matrix = metadata_matrix, 284 | feature_metadata_spike_df = 285 | feature_metadata_spike_df))) 286 | } 287 | -------------------------------------------------------------------------------- /R/SparseDOSSA2_fit.R: -------------------------------------------------------------------------------- 1 | #' Fit SparseDOSSA 2 model to a microbiome abundance dataset 2 | #' 3 | #' \code{fit_SparseDOSSA2} fits the SparseDOSSA 2 model (zero-inflated log normal 4 | #' marginals connected through Gaussian copula) to microbial abundances. It takes 5 | #' as input a feature-by-sample microbial count or relative abundance table and 6 | #' a penalization tuning parameter \code{lambda} to control the sparsity of 7 | #' feature-feature correlations. It then adopts a penalized expectation-maximization 8 | #' algorithm to provide estimations of the model parameters. 9 | #' 10 | #' @param data feature-by-sample matrix of abundances (proportions or 11 | #' counts) 12 | #' @param lambda positive penalization parameter for the sparsity of feature-feature 13 | #' correlations. Default to maxmum value \code{1}, where features are assumed to be 14 | #' independent (no correlations, most sparse) 15 | #' @param control a named list of additional control parameters. See help page for 16 | #' \code{control_fit} 17 | #' 18 | #' @return a list, with the following components: 19 | #' \describe{ 20 | #' \item{EM_fit}{ 21 | #' list of fitted parameters from the EM algorithm. 22 | #' } 23 | #' \item{F_fit}{ 24 | #' fitted parameters for the joint distribution of per-feature prevalence, abundance, 25 | #' and variability parameters (for simulating new features) 26 | #' } 27 | #' \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable 28 | #' to count data. 29 | #' } 30 | #' \item{l_filtering}{list of quality control filtering for sample and features. 31 | #' } 32 | #' } 33 | 34 | #' @export 35 | #' @author Siyuan Ma, \email{syma.research@gmail.com} 36 | #' @examples 37 | #' data("Stool_subset") 38 | #' fitted <- fit_SparseDOSSA2(data = Stool_subset) 39 | fit_SparseDOSSA2 <- function(data, 40 | lambda = 1, 41 | control = list()) { 42 | # initialization, filtering 43 | if(any(data < 0)) 44 | stop("data has negative values!") 45 | data <- t(data) 46 | control <- do.call(control_fit, control) 47 | if(control$verbose) 48 | message("Filtering for all-zero features/samples...") 49 | l_filtering <- filter_data(data) 50 | data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE] 51 | if(!is.null(control$debug_dir)) 52 | dir.create(control$debug_dir) 53 | 54 | # check if table is count and fit library size parameters if needed 55 | row_sums <- apply(data, 1, sum) 56 | if(all(row_sums > 1)) { 57 | if(control$verbose) 58 | message("Data appears to be count table. ", 59 | "Fitting library size distribution...") 60 | depth_fit <- fit_depth(depth = row_sums) 61 | } else { 62 | if(control$verbose) 63 | message("Data appears to be relative abundance table.") 64 | depth_fit <- c("mu_depth" = NA, "sigma_depth" = NA) ## FIXME 65 | } 66 | 67 | # normalize data 68 | data <- t(apply(data, 1, TSS, correct = TRUE)) 69 | 70 | # EM fitting 71 | if(control$verbose) 72 | message("Fitting EM algorithm...") 73 | EM_fit <- EM(data = data, 74 | lambda = lambda, 75 | control = control) 76 | 77 | # fit per-feature parameter 3d Gaussian kernel density 78 | if(control$verbose) 79 | message("Fitting joint distribution of per-feature parameters...") 80 | F_fit <- fit_F(feature_param = cbind("pi0" = EM_fit$fit$pi0, 81 | "mu" = EM_fit$fit$mu, 82 | "sigma" = EM_fit$fit$sigma)) 83 | 84 | return(list(EM_fit = EM_fit, 85 | F_fit = F_fit, 86 | depth_fit = depth_fit, 87 | l_filtering = l_filtering)) 88 | } 89 | 90 | #' Control options for fit_SparseDOSSA2 and fitCV_SparseDOSSA2 91 | #' 92 | #' @param maxit maximum number of EM iterations 93 | #' @param rel_tol relative change threshold in the log likelihood 94 | #' for algorithm convergence 95 | #' @param abs_tol absolute change threshold in the log likelihood 96 | #' for algorithm convergence 97 | #' @param control_numint a named list of control parameters for the 98 | #' numerical integrations during the E step. See help page for 99 | #' \code{control_numint} 100 | #' @param verbose whether or not detailed running messages should be provided 101 | #' @param debug_dir directory for intermediate output, such as the 102 | #' EM expectations and parameter values and during each step of the 103 | #' EM algorithm. Default to \code{NULL} in which case no such output 104 | #' will be generated 105 | #' 106 | #' @return a list of the same names 107 | #' @export 108 | control_fit <- function(maxit = 100, 109 | rel_tol = 1e-2, 110 | abs_tol = 1e-2, 111 | control_numint = list(), 112 | verbose = FALSE, 113 | debug_dir = NULL) { 114 | list(maxit = maxit, 115 | rel_tol = rel_tol, 116 | abs_tol = abs_tol, 117 | control_numint = control_numint, 118 | verbose = verbose, 119 | debug_dir = debug_dir) 120 | } 121 | 122 | 123 | #' Fit SparseDOSSA 2 model to a microbiome abundance dataset with cross validation 124 | #' 125 | #' \code{fitCV_SparseDOSSA2} randomly partitions the data into fitting and testing 126 | #' subsets. It fits the SparseDOSSA 2 model to the fitting sets and uses log likelihood 127 | #' of the fitted parameters in the testing sets as the criteria for selection of 128 | #' tuning parameter lambda. 129 | #' 130 | #' @param data feature-by-sample matrix of abundances (proportions or 131 | #' counts). 132 | #' @param lambdas vector of positive penalization parameters for the sparsity of feature-feature 133 | #' correlations. The function fits SparseDOSSA 2 models to each of the lambda values, and uses 134 | #' cross validation likelihood to select the optimal one. If not provided this will be chosen 135 | #' automatically. 136 | #' @param control a named list of additional control parameters. See help page for 137 | #' \code{control_fit}. 138 | #' 139 | #' @return a list, with the following components: 140 | #' \describe{ 141 | #' \item{EM_fit}{ 142 | #' list of fitted parameters from the EM algorithm, with additional cross validation likelihood. 143 | #' } 144 | #' \item{F_fit}{ 145 | #' fitted parameters for the joint distribution of per-feature prevalence, abundance, 146 | #' and variability parameters (for simulating new features) 147 | #' } 148 | #' \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable 149 | #' to count data. 150 | #' } 151 | #' \item{l_filtering}{list of quality control filtering for sample and features. 152 | #' } 153 | #' } 154 | 155 | #' @export 156 | #' @author Siyuan Ma, \email{syma.research@gmail.com} 157 | #' @examples 158 | #' data("Stool_subset") 159 | #' fitted <- fitCV_SparseDOSSA(data = Stool_subset, 160 | #' lambdas = c(0.1, 1), 161 | #' K = 5) 162 | #' 163 | fitCV_SparseDOSSA2 <- function(data, 164 | lambdas = 10^seq(-2, 0, length.out = 5), 165 | K = 5, 166 | control = list()) { 167 | # initialization, filtering 168 | if(any(data < 0)) 169 | stop("data has negative values!") 170 | data <- t(data) 171 | control <- do.call(control_fit, control) 172 | if(control$verbose) 173 | message("Filtering for all-zero features/samples...") 174 | l_filtering <- filter_data(data) 175 | data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE] 176 | if(!is.null(control$debug_dir)) 177 | dir.create(control$debug_dir) 178 | 179 | # check if table is count and fit library size parameters if needed 180 | row_sums <- apply(data, 1, sum) 181 | if(all(row_sums > 1)) { 182 | if(control$verbose) 183 | message("Data appears to be count table. ", 184 | "Fitting library size distribution...") 185 | depth_fit <- fit_depth(depth = row_sums) 186 | } else { 187 | if(control$verbose) 188 | message("Data appears to be relative abundance table.") 189 | depth_fit <- c("mu_depth" = NA, "sigma_depth" = NA) ## FIXME 190 | } 191 | 192 | # normalize data 193 | data <- t(apply(data, 1, TSS, correct = TRUE)) 194 | 195 | # CV EM fitting 196 | if(control$verbose) 197 | message("Fitting EM algorithm with cross-validation...") 198 | EM_fit <- EM_CV(data = data, 199 | lambdas = lambdas, 200 | K = K, 201 | control = control) 202 | 203 | # fit per-feature parameter 3d Gaussian kernel density 204 | if(control$verbose) 205 | message("Fitting joint distribution of per-feature parameters...") 206 | F_fit <- fit_F(feature_param = cbind("pi0" = EM_fit$fit$pi0, 207 | "mu" = EM_fit$fit$mu, 208 | "sigma" = EM_fit$fit$sigma)) 209 | 210 | return(list(EM_fit = EM_fit, 211 | F_fit = F_fit, 212 | depth_fit = depth_fit, 213 | l_filtering = l_filtering)) 214 | } -------------------------------------------------------------------------------- /R/copulasso.R: -------------------------------------------------------------------------------- 1 | copulasso <- function(data, marginals, 2 | lambda, 3 | control = list()) { 4 | control <- do.call(control_copulasso, control) 5 | if(ncol(data) != nrow(marginals)) 6 | stop("Dimension of data and marginals do not agree!") 7 | 8 | if(lambda >= 1) { 9 | Omega <- 10 | Sigma <- 11 | Corr_star <- diag(1, ncol(data)) 12 | return(list(Omega = Omega, 13 | Sigma = Sigma, 14 | Corr_star = Corr_star, 15 | copulasso_code = 0)) 16 | } else { 17 | data_g <- vapply(seq_len(ncol(data)), 18 | function(i_feature) 19 | a_to_g(a = data[, i_feature], 20 | pi0 = marginals[i_feature, "pi0"], 21 | mu = marginals[i_feature, "mu"], 22 | sigma = marginals[i_feature, "sigma"]), 23 | rep(0.0, nrow(data))) 24 | Corr_star <- cor(data_g, method = "pearson") 25 | S <- get_s(cor_data = Corr_star, 26 | pi0 = marginals[, "pi0"], 27 | glim = marginals[, "glim"], 28 | g0 = marginals[, "g0"], 29 | sigmaMod = marginals[, "sigmaMod"]) 30 | Omega <- diag(1/(diag(S) + lambda)) 31 | 32 | if(control$simplify) 33 | z <- which(rowSums(abs(S) > lambda) > 1) 34 | else 35 | z <- seq_len(nrow(S)) 36 | q <- length(z) 37 | if (q > 0) { 38 | S_conditioned <- condition_ridge(S[z, z, drop = FALSE], 39 | lambda = control$lambda_ridge, 40 | method = "ridge1") 41 | out.glasso <- huge::huge.glasso(x = S_conditioned, 42 | lambda = lambda, 43 | verbose = FALSE) 44 | Omega[z, z] <- out.glasso$icov[[1]] 45 | } 46 | } 47 | 48 | if(!is.null(control$debug_dir)) 49 | save(Omega, file = paste0(control$debug_dir, "/debug_copulasso.RData")) 50 | 51 | if(any(is.na(Omega))) { 52 | # warning("Missing values in Omega estimation! (lambda to small?)") # FIXME 53 | Omega <- diag(rep(1, nrow(Omega))) 54 | return(list(Omega = Omega, 55 | copulasso_code = 1)) 56 | } 57 | 58 | if(control$symm) { 59 | Omega <- enforce_symm(Omega, method = "svd") 60 | } 61 | if(control$corr) { 62 | Omega <- enforce_corr(Omega) 63 | } 64 | 65 | Omega <- threshold_matrix(Omega, 66 | threshold_zero = control$threshold_zero) 67 | Sigma <- threshold_matrix(solve(Omega), 68 | threshold_zero = control$threshold_zero) 69 | 70 | return(list(Omega = Omega, 71 | Sigma = Sigma, 72 | Corr_star = Corr_star, 73 | copulasso_code = 0)) 74 | } 75 | 76 | control_copulasso <- function(lambda_ridge = 1e-6, 77 | threshold_zero = 1e-6, 78 | simplify = FALSE, 79 | symm = TRUE, 80 | corr = TRUE, 81 | debug_dir = NULL) { 82 | list(lambda_ridge = lambda_ridge, 83 | threshold_zero = threshold_zero, 84 | simplify = simplify, 85 | symm = symm, 86 | corr = corr, 87 | debug_dir = debug_dir) 88 | } 89 | 90 | Rho <- function(rho_p, 91 | pi0_1, pi0_2, 92 | glim_1, glim_2, 93 | g0_1, g0_2, 94 | sigmaMod_1, sigmaMod_2) { 95 | mat_cor <- matrix(c(1, rho_p, rho_p, 1), 96 | 2, 2) 97 | part1 <- g0_1 * g0_2 * 98 | mvtnorm::pmvnorm(upper = c(glim_1, glim_2), 99 | corr = mat_cor) 100 | part2 <- g0_1 * 101 | tmvtnorm::mtmvnorm(sigma = mat_cor, 102 | lower = c(-Inf, glim_2), 103 | upper = c(glim_1, Inf), 104 | doComputeVariance = FALSE)$tmean[2] * 105 | mvtnorm::pmvnorm(lower = c(-Inf, glim_2), 106 | upper = c(glim_1, Inf), 107 | corr = mat_cor) 108 | part3 <- 109 | g0_2 * 110 | tmvtnorm::mtmvnorm(sigma = mat_cor, 111 | lower = c(glim_1, -Inf), 112 | upper = c(Inf, glim_2), 113 | doComputeVariance = FALSE)$tmean[1] * 114 | mvtnorm::pmvnorm(lower = c(glim_1, -Inf), 115 | upper = c(Inf, glim_2), 116 | corr = mat_cor) 117 | part4 <- { 118 | fit_mtmvnorm <- tmvtnorm::mtmvnorm(sigma = mat_cor, 119 | lower = c(glim_1, glim_2), 120 | doComputeVariance = TRUE) 121 | if (any(is.na(fit_mtmvnorm$tmean))) 122 | 0 123 | else 124 | (prod(fit_mtmvnorm$tmean) + fit_mtmvnorm$tvar[1, 2]) * 125 | mvtnorm::pmvnorm(lower = c(glim_1, glim_2), 126 | corr = mat_cor) 127 | } 128 | 129 | return((part1 + part2 + part3 + part4) / sigmaMod_1 / sigmaMod_2) 130 | } 131 | 132 | vRho <- Vectorize(Rho, vectorize.args = "rho_p") 133 | 134 | iRho <- function(rho_s, 135 | pi0_1, pi0_2, 136 | glim_1, glim_2, 137 | g0_1, g0_2, 138 | sigmaMod_1, sigmaMod_2) { 139 | f_lim <- vRho(c(-0.99, 0.99), 140 | pi0_1 = pi0_1, pi0_2 = pi0_2, 141 | glim_1 = glim_1, glim_2 = glim_2, 142 | g0_1 = g0_1, g0_2 = g0_2, 143 | sigmaMod_1 = sigmaMod_1, sigmaMod_2 = sigmaMod_2) 144 | if(rho_s <= f_lim[1]) 145 | return(-0.99) ## FIXME 146 | if(rho_s >= f_lim[2]) 147 | return(0.99) 148 | 149 | uniroot(f = function(x) 150 | vRho(x, 151 | pi0_1 = pi0_1, pi0_2 = pi0_2, 152 | glim_1 = glim_1, glim_2 = glim_2, 153 | g0_1 = g0_1, g0_2 = g0_2, 154 | sigmaMod_1 = sigmaMod_1, sigmaMod_2 = sigmaMod_2) - 155 | rho_s, 156 | interval = c(-0.99, 0.99), 157 | f.lower = f_lim[1], 158 | f.upper = f_lim[2], 159 | maxiter = 100)$root 160 | } 161 | 162 | get_s <- function(cor_data, pi0, glim, g0, sigmaMod) { 163 | df_index <- expand.grid(feature2 = seq_len(nrow(cor_data)), 164 | feature1 = seq_len(nrow(cor_data))) 165 | df_index <- subset(df_index, feature1 < feature2) 166 | df_index$i_combo <- seq_len(nrow(df_index)) 167 | 168 | ss <- 169 | future.apply::future_vapply( 170 | df_index$i_combo, 171 | function(ii_combo) { 172 | ind_feature1 <- df_index[ii_combo, ]$feature1 173 | ind_feature2 <- df_index[ii_combo, ]$feature2 174 | 175 | iRho(rho_s = cor_data[ind_feature1, ind_feature2], 176 | pi0_1 = pi0[ind_feature1], pi0_2 = pi0[ind_feature2], 177 | glim_1 = glim[ind_feature1], glim_2 = glim[ind_feature2], 178 | g0_1 = g0[ind_feature1], g0_2 = g0[ind_feature2], 179 | sigmaMod_1 = sigmaMod[ind_feature1], 180 | sigmaMod_2 = sigmaMod[ind_feature2]) 181 | }, 182 | 0.0) 183 | 184 | to_return <- cor_data 185 | to_return[lower.tri(to_return)] <- ss 186 | to_return[upper.tri(to_return)] <- upper_tri(t(to_return), warning = FALSE) 187 | 188 | return(to_return) 189 | } 190 | 191 | solve_ridge <- function(S, lambda, method = "ridge1") { 192 | svd_fit <- svd(S) 193 | if(method == "ridge1") 194 | eigen_inv <- (sqrt(svd_fit$d^2 + 4 * lambda) - svd_fit$d) / 2 / lambda 195 | if(method == "ridge2") 196 | eigen_inv <- 1 / (svd_fit$d + lambda) 197 | 198 | return(svd_fit$u %*% diag(eigen_inv) %*% t(svd_fit$u)) 199 | } 200 | 201 | condition_ridge <- function(S, lambda, method = "ridge1") { 202 | svd_fit <- svd(S) 203 | if(method == "ridge1") 204 | eigen_cond <- 2 * lambda / (sqrt(svd_fit$d^2 + 4 * lambda) - svd_fit$d) 205 | if(method == "ridge2") 206 | eigen_cond <- 1 / (svd_fit$d + lambda) 207 | 208 | return(svd_fit$u %*% diag(eigen_cond) %*% t(svd_fit$u)) 209 | } -------------------------------------------------------------------------------- /R/data.R: -------------------------------------------------------------------------------- 1 | #' A subset of the HMP1-II stool samples 2 | #' 3 | #' A dataset containing species-level microbial counts of a subset of 4 | #' the HMP1-II stool samples. This includes the top 5 most abundant 5 | #' species and top 5 most deeply sequenced samples. 6 | #' 7 | #' @format A matrix with 5 rows (species) and 5 columns (samples) 8 | #' @source \url{https://www.hmpdacc.org/hmp/} 9 | "Stool_subset" -------------------------------------------------------------------------------- /R/fit.R: -------------------------------------------------------------------------------- 1 | fit_featureParam <- function(data) { 2 | feature_param <- 3 | t(vapply(seq_len(ncol(data)), 4 | function(i_feature) 5 | fit_oneFeatureParam( 6 | data[, i_feature, drop = TRUE]), 7 | rep(0.0, 6))) 8 | # fill in missing sigmas (those with only one non-zero observations) 9 | # with the overall median across features 10 | feature_param[is.na(feature_param[, "sigma"]), "sigma"] <- 11 | median(feature_param[, "sigma"], na.rm = TRUE) 12 | 13 | rownames(feature_param) <- colnames(data) 14 | return(feature_param) 15 | } 16 | 17 | fit_oneFeatureParam <- function(x) { 18 | ind_nonzero <- x > 0 19 | if(!any(ind_nonzero)) 20 | stop("Does not support features with all zero values!") 21 | if(all(ind_nonzero)) ## FIXME 22 | pi0 <- 0.5 / length(x) 23 | else 24 | pi0 <- mean(!ind_nonzero) 25 | mu <- mean(log(x[ind_nonzero])) 26 | sigma <- sd(log(x[ind_nonzero])) 27 | 28 | # Additional parameters to help with computation 29 | glim <- qnorm(pi0) 30 | g0 <- truncnorm::etruncnorm(b = glim) 31 | sigmaMod <- sqrt(pi0 * g0^2 + 32 | (1 - pi0) * 33 | (truncnorm::vtruncnorm(a = glim) + 34 | truncnorm::etruncnorm(a = glim)^2)) 35 | 36 | return(c("pi0" = pi0, 37 | "mu" = mu, 38 | "sigma" = sigma, 39 | "glim" = glim, 40 | "g0" = g0, 41 | "sigmaMod" = sigmaMod)) 42 | } 43 | 44 | fit_F <- function(feature_param) { 45 | if(!all(colnames(feature_param) == c("pi0", "mu", "sigma"))) 46 | stop("feature_param is not of the correct format!") 47 | # transform pi0 parameter before estimation 48 | feature_param[, "pi0"] <- log(feature_param[, "pi0"]) - log(1 - feature_param[, "pi0"]) 49 | 50 | ind_zero_sigma <- feature_param[, "sigma"] == 0 51 | 52 | K_nonzero <- ks::Hscv(x = feature_param[!ind_zero_sigma, , drop = FALSE]) 53 | if(any(ind_zero_sigma)) 54 | ks::Hscv(x = feature_param[ind_zero_sigma, -3, drop = FALSE]) 55 | else 56 | K_zero <- NULL 57 | 58 | return(list(p0_sigma = mean(ind_zero_sigma), 59 | K_nonzero = K_nonzero, 60 | K_zero = K_zero)) 61 | } 62 | 63 | fit_depth <- function(depth) { 64 | return(c("mu_depth" = mean(log(depth)), 65 | "sigma_depth" = sd(log(depth)))) 66 | } 67 | -------------------------------------------------------------------------------- /R/generate.R: -------------------------------------------------------------------------------- 1 | generate_a <- function(n, 2 | feature_param, Omega, 3 | maxit = 10, verbose = FALSE) { 4 | i_iter <- 0 5 | samples_a <- matrix(NA, 6 | nrow = nrow(feature_param), 7 | ncol = 0) 8 | rownames(samples_a) <- rownames(feature_param) 9 | 10 | while(TRUE) { 11 | if(i_iter + 1 > maxit) 12 | stop("Can't satisfy conditions!") 13 | i_iter <- i_iter + 1 14 | if(verbose) 15 | print(i_iter) 16 | 17 | samples_a <- cbind(samples_a, 18 | t(rcopulasso(n = n, 19 | pi0 = feature_param[, "pi0"], 20 | mu = feature_param[, "mu"], 21 | sigma = feature_param[, "sigma"], 22 | Omega = Omega))) 23 | 24 | ind_nonzero <- apply(samples_a > 0, 2, any) ## FIXME?? 25 | if(sum(ind_nonzero) >= n) { 26 | samples_a <- samples_a[, ind_nonzero][, seq_len(n)] 27 | colnames(samples_a) <- paste0("Sample", seq_len(n)) 28 | return(samples_a) 29 | } 30 | } 31 | } 32 | 33 | rcopulasso <- function(n, pi0, mu, sigma, Omega) { 34 | if(length(pi0) != length(mu) | 35 | length(pi0) != length(sigma) | 36 | length(pi0) != nrow(Omega)) 37 | stop("Parameter dimensions must agree!") 38 | 39 | # sample marginals 40 | mat_amarginals <- 41 | vapply(seq_len(length(pi0)), 42 | function(i) 43 | rZILogN_one(n = n, 44 | pi0 = pi0[i], 45 | mu = mu[i], 46 | sigma = sigma[i]), 47 | rep(0.0, n)) 48 | # arrange from smallest to largest for shuffling 49 | mat_amarginals <- 50 | apply(mat_amarginals, 2, function(x) x[order(x)]) 51 | 52 | # sample ranks 53 | mat_rank <- 54 | mvtnorm::rmvnorm(n = n, sigma = solve(Omega)) 55 | mat_rank <- apply(mat_rank, 2, rank) 56 | 57 | mat_a <- 58 | vapply(seq_len(length(pi0)), 59 | function(i) 60 | mat_amarginals[, i, drop = TRUE][mat_rank[, i, drop = TRUE]], 61 | rep(0.0, n)) 62 | 63 | return(mat_a) 64 | } 65 | 66 | rZILogN_one <- function(n, pi0, mu, sigma) { 67 | return(exp(rnorm(n = n, mean = mu, sd = sigma)) * 68 | rbinom(n = n, size = 1, prob = 1 - pi0)) 69 | } 70 | 71 | generate_featureParam <- function(F_fit, param_original, n_feature) { 72 | if(!all(names(F_fit) == c("p0_sigma", "K_nonzero", "K_zero"))) 73 | stop("F_fit is not of the correct format!") 74 | 75 | # transform pi parameter 76 | param_original[, "pi0"] <- logit(param_original[, "pi0"]) 77 | 78 | if(F_fit$p0_sigma > 0) 79 | ind_sigma <- rbinom(n = n_feature, size = 1, 80 | prob = 1 - F_fit$p0_sigma) == 1 81 | else 82 | ind_sigma <- rep(TRUE, length = n_feature) 83 | 84 | # simulate parameters for non-zero sigmas 85 | param_original_nonZeroSigma <- 86 | param_original[param_original[, "sigma"] > 0, , drop = FALSE] 87 | ind_sigma_nonzero_tosimulate <- ind_sigma[ind_sigma] 88 | param_simulated <- 89 | matrix(NA, nrow = sum(ind_sigma_nonzero_tosimulate), ncol = 3) 90 | dimnames(param_simulated) <- list(paste0("Feature", seq_len(n_feature)), 91 | c("pi0", "mu", "sigma")) 92 | while(TRUE) { 93 | nFeature_nonzero <- sum(ind_sigma_nonzero_tosimulate) 94 | simulated_mixture <- 95 | param_original_nonZeroSigma[sample.int( 96 | n = nrow(param_original_nonZeroSigma), 97 | size = nFeature_nonzero, 98 | replace = TRUE), , drop = FALSE] 99 | simulated_difference <- mvtnorm::rmvnorm(n = nFeature_nonzero, 100 | sigma = F_fit$K_nonzero) 101 | simulated_mixture <- simulated_mixture + simulated_difference 102 | param_simulated[ind_sigma_nonzero_tosimulate, ] <- simulated_mixture 103 | ind_sigma_nonzero_tosimulate <- param_simulated[, "sigma"] <= 0 104 | if(!any(ind_sigma_nonzero_tosimulate)) break 105 | } 106 | 107 | nFeature_zero <- sum(!ind_sigma) 108 | if(nFeature_zero > 0) { 109 | param_original_zeroSigma <- 110 | param_original[param_original[, "sigma"] == 0, , drop = FALSE] 111 | simulated_mixture <- 112 | param_original_zeroSigma[sample.int( 113 | n = nrow(param_original_zeroSigma), 114 | size = nFeature_zero, 115 | replace = TRUE), , drop = FALSE] 116 | simulated_difference <- mvtnorm::rmvnorm(n = nFeature_zero, 117 | sigma = F_fit$K_zero) 118 | simulated_mixture[, c("pi0", "mu")] <- 119 | simulated_mixture[, c("pi0", "mu"), drop = FALSE] + 120 | simulated_difference 121 | param_simulated <- rbind(param_simulated, simulated_mixture) 122 | } 123 | 124 | param_simulated[, "pi0"] <- expit(param_simulated[, "pi0"]) 125 | return(param_simulated) 126 | } 127 | 128 | 129 | #' @importFrom magrittr %>% 130 | generate_feature_metadata_spike_df <- 131 | function(features, 132 | perc_feature_spiked_metadata, 133 | n_metadata, 134 | effect_size, 135 | spike_metadata = c("both", 136 | "abundance", 137 | "prevalence")) { 138 | n_feature_spiked <- ceiling(length(features) * 139 | perc_feature_spiked_metadata) 140 | if(spike_metadata == "both") 141 | spike_metadata <- c("abundance", "prevalence") 142 | 143 | feature_metadata_spike_df <- 144 | purrr::map2_dfr(seq_len(n_metadata), 145 | effect_size, 146 | function(metadatum_i, effect_size_i) { 147 | data.frame(metadata_datum = metadatum_i, 148 | effect_size = 149 | effect_size_i * 150 | sample(c(-1, 1), 151 | size = n_feature_spiked, 152 | replace = TRUE), 153 | feature_spiked = 154 | sample(features, 155 | size = n_feature_spiked, 156 | replace = FALSE), 157 | stringsAsFactors = FALSE) 158 | }) 159 | feature_metadata_spike_df <- 160 | tidyr::expand_grid(feature_metadata_spike_df, 161 | associated_property = spike_metadata) %>% 162 | dplyr::select(metadata_datum, 163 | feature_spiked, 164 | associated_property, 165 | effect_size) %>% 166 | dplyr::arrange(metadata_datum, 167 | associated_property, 168 | -effect_size) %>% 169 | as.data.frame() 170 | 171 | return(feature_metadata_spike_df) 172 | } 173 | 174 | generate_depth <- function(mu_depth, 175 | sigma_depth, 176 | n, median_depth) { 177 | depth <- exp(rnorm(n = n, mean = mu_depth, sd = sigma_depth)) 178 | depth <- round(depth / median(depth) * median_depth) 179 | return(depth) 180 | } 181 | 182 | generate_count <- function(rel, depth) { 183 | mat_count <- 184 | vapply(seq_along(depth), 185 | function(i_sample) { 186 | if(all(rel[, i_sample] == 0)) 187 | return(rep(0, nrow(rel))) 188 | else 189 | rmultinom(n = 1, size = depth[i_sample], prob = rel[, i_sample]) 190 | }, 191 | rep(0, nrow(rel))) 192 | dimnames(mat_count) <- dimnames(rel) 193 | return(mat_count) 194 | } 195 | -------------------------------------------------------------------------------- /R/helpers.R: -------------------------------------------------------------------------------- 1 | logit <- function(x) log(x) - log(1 - x) 2 | 3 | expit <- function(x) 4 | exp(x) / (1 + exp(x)) 5 | 6 | TSS <- function(x, correct = FALSE) { 7 | if(any(x < 0)) 8 | stop("Negative x values are not accepted!") 9 | if(all(x == 0)) return(x) 10 | # this is a special case where only one feature is present 11 | if(sum(x > 0) == 1 & correct) { 12 | x[x > 0] <- 1 - 0.5 / length(x) 13 | return(x) 14 | } 15 | return(x / sum(x)) 16 | } 17 | 18 | lower_tri <- function(x, warning = FALSE) { 19 | if(!isSymmetric(x) & warning) 20 | warning("x is not symmetric!") 21 | 22 | x[lower.tri(x)] 23 | } 24 | 25 | upper_tri <- function(x, warning = FALSE) { 26 | if(!isSymmetric(x) & warning) 27 | warning("x is not symmetric!") 28 | 29 | x[upper.tri(x)] 30 | } 31 | 32 | enforce_symm <- function(x, method = "svd") { 33 | if(nrow(x) != ncol(x)) 34 | stop("x does not appear to be a covariance matrix!") 35 | x_out <- x 36 | if(!isSymmetric(x_out)) { 37 | if(method == "average") { 38 | lower_averaged <- (lower_tri(x_out, warning = FALSE) + 39 | lower_tri(t(x_out), warning = FALSE)) / 2 40 | x_out[lower.tri(x_out)] <- lower_averaged 41 | x_out[upper.tri(x_out)] <- 42 | t(x_out)[upper.tri(x_out)] 43 | } 44 | if(method == "lower") 45 | x_out[upper.tri(x_out)] <- upper_tri(t(x_out), warning = FALSE) 46 | if(method == "upper") 47 | x_out[lower.tri(x_out)] <- lower_tri(t(x_out), warning = FALSE) 48 | if(method == "svd") { 49 | svd_fit <- svd(x) 50 | if(Matrix::rankMatrix(svd_fit$u) < nrow(x)) 51 | # Sometimes svd run into weird cases where the matrix is not singular but the 52 | # eigen vector matrix is 53 | svd_fit <- svd(x + diag(rep(1e-16, nrow(x)))) 54 | 55 | # In case it's not pos-def 56 | svd_fit$d <- abs(svd_fit$d) ## FIXME 57 | 58 | x_out <- svd_fit$u %*% diag(svd_fit$d) %*% t(svd_fit$u) 59 | 60 | # I don't know how this could happen, but even after this x_out can still be not symmetric! 61 | ## FIXME 62 | x_out[upper.tri(x_out)] <- upper_tri(t(x_out), warning = FALSE) 63 | } 64 | } 65 | 66 | return(x_out) 67 | } 68 | 69 | enforce_corr <- function(x) { 70 | if(nrow(x) != ncol(x)) 71 | stop("x does not appear to be a covariance matrix!") 72 | 73 | diag_sigma <- diag(sqrt(abs(diag(solve(x))))) 74 | x_out <- diag_sigma %*% x %*% diag_sigma 75 | 76 | return(x_out) 77 | } 78 | 79 | a_to_u <- function(a, pi0, mu, sigma, 80 | half_pi0 = FALSE) { 81 | if(any(pi0 == 1) | any(pi0 == 0)) 82 | stop("zero or one valued pi0 is not supported!") 83 | 84 | if(half_pi0) 85 | to_return <- pi0 / 2 86 | else 87 | to_return <- pi0 88 | 89 | ind_nonzero <- a > 0 90 | if(any(ind_nonzero)) ##FIXME 91 | to_return[ind_nonzero] <- 92 | pi0[ind_nonzero] + 93 | pnorm(log(a[ind_nonzero]), 94 | mean = mu[ind_nonzero], 95 | sd = sigma[ind_nonzero]) * 96 | (1 - pi0[ind_nonzero]) 97 | 98 | return(to_return) 99 | } 100 | 101 | a_to_u_marg <- function(a, pi0, mu, sigma, 102 | half_pi0 = FALSE, 103 | u_tol = 1e-5) { 104 | if(half_pi0) 105 | to_return <- rep(pi0 / 2, 106 | length = length(a)) 107 | else 108 | to_return <- rep(pi0, 109 | length = length(a)) 110 | 111 | ind_nonzero <- a > 0 112 | if(any(ind_nonzero)) ##FIXME 113 | to_return[ind_nonzero] <- 114 | pi0 + 115 | pnorm(log(a[ind_nonzero]), 116 | mean = mu, 117 | sd = sigma) * 118 | (1 - u_tol) * 119 | (1 - pi0) 120 | 121 | return(to_return) 122 | } 123 | 124 | a_to_g <- function(a, pi0, mu, sigma) { 125 | if(pi0 == 1 | pi0 == 0) 126 | stop("zero or one valued pi0 is not supported!") 127 | ind_nonzero <- a > 0 128 | 129 | to_return <- u <- a_to_u_marg(a, pi0, mu, sigma) 130 | to_return[ind_nonzero] <- qnorm(u[ind_nonzero]) 131 | to_return[!ind_nonzero] <- 132 | -mean(to_return[ind_nonzero]) * 133 | sum(ind_nonzero) / sum(1 - ind_nonzero) 134 | 135 | return(to_return) 136 | } 137 | 138 | a <- function(x, asum) { 139 | a <- x * asum 140 | a[x == 0] <- 0 141 | 142 | return(a) 143 | } 144 | 145 | make_CVfolds <- function(n, K) { 146 | cut(sample.int(n), breaks = K, labels = FALSE) 147 | } 148 | 149 | filter_data <- function(data, 150 | k_feature = 2, k_sample = 1, 151 | maxit = 3) { 152 | i_iter <- 0 153 | ind_feature <- rep(TRUE, ncol(data)) 154 | ind_sample <- rep(TRUE, nrow(data)) 155 | 156 | while(TRUE) { 157 | if (i_iter + 1 > maxit) 158 | stop("Max iteration reached!") 159 | i_iter <- i_iter + 1 160 | 161 | ind_feature_tmp <- 162 | apply(data[ind_sample, ind_feature, drop = FALSE] > 0, 2, sum) >= 163 | k_feature 164 | ind_feature[ind_feature] <- ind_feature_tmp 165 | ind_sample_tmp <- 166 | apply(data[ind_sample, ind_feature, drop = FALSE] > 0, 1, sum) >= 167 | k_sample 168 | ind_sample[ind_sample] <- ind_sample_tmp 169 | 170 | if (all(ind_feature_tmp) & all(ind_sample_tmp)) 171 | return(list(ind_feature = ind_feature, 172 | ind_sample = ind_sample)) 173 | } 174 | } 175 | 176 | fill_estimates_CV <- function(params_CV, params_full, ind_feature) { 177 | params_return <- params_CV 178 | 179 | params_return[c("pi0", "mu", "sigma")] <- params_full[c("pi0", "mu", "sigma")] 180 | for(param in c("pi0", "mu", "sigma")) 181 | params_return[[param]][ind_feature] <- params_CV[[param]] 182 | 183 | params_return$Sigma <- 184 | params_return$Omega <- 185 | diag(rep(1, length(ind_feature))) 186 | params_return[["Sigma"]][ind_feature, ind_feature] <- params_CV[["Sigma"]] 187 | params_return[["Omega"]][ind_feature, ind_feature] <- threshold_matrix(solve(params_CV[["Sigma"]])) 188 | 189 | return(params_return) 190 | } 191 | 192 | threshold_matrix <- function(x, threshold_zero = 1e-16) { 193 | ## FIXME 194 | if(!is.null(threshold_zero)) { 195 | diag_x <- diag(x) 196 | x[abs(x) < threshold_zero] <- 0 197 | diag(x) <- diag_x 198 | } 199 | return(x) 200 | } -------------------------------------------------------------------------------- /R/integration_helpers.R: -------------------------------------------------------------------------------- 1 | integrate2 <- function(f, 2 | lower, upper, 3 | rel_tol, abs_tol, max_eval, 4 | precBits, 5 | ...) { 6 | if(lower == 0 & upper == 0) 7 | return(list(integral = 0, 8 | error = 0, 9 | neval = 0, 10 | returnCode = 1)) 11 | 12 | ff <- function(x) f(x, ...) 13 | 14 | neval <- 2 15 | knots_spline <- Rmpfr::mpfr(c(lower, upper), 16 | precBits = precBits) 17 | vals_spline <- 18 | Rmpfr::mpfr( 19 | ff(as.double(knots_spline)), 20 | precBits = precBits) 21 | errors_spline <- Inf 22 | 23 | while(TRUE) { 24 | i_max_error <- which(errors_spline == max(errors_spline))[1] 25 | knots_spline <- c(knots_spline[seq(1, i_max_error)], 26 | Rmpfr::mean(knots_spline[c(i_max_error, i_max_error + 1)]), 27 | knots_spline[seq(i_max_error + 1, neval)]) 28 | 29 | vals_spline <- c(vals_spline[seq(1, i_max_error)], 30 | Rmpfr::mpfr(ff(as.double(knots_spline[i_max_error + 1])), 31 | precBits = precBits), 32 | vals_spline[seq(i_max_error + 1, neval)]) 33 | 34 | neval <- neval + 1 35 | knots_diff <- knots_spline[-1] - knots_spline[-neval] 36 | # linear spline for estimating integration 37 | coefs_spline <- Rmpfr::mpfrArray(NA, precBits = precBits, 38 | dim = c(2, neval - 1)) 39 | coefs_spline[2, ] <- 40 | (vals_spline[-1] - vals_spline[-neval]) / 41 | knots_diff 42 | coefs_spline[1, ] <- 43 | vals_spline[-neval] - knots_spline[-neval] * coefs_spline[2, ] 44 | integral <- sum(coefs_spline[1, ] * knots_spline[-1] + 45 | coefs_spline[2, ] / 2 * knots_spline[-1]^2 - 46 | coefs_spline[1, ] * knots_spline[-neval] - 47 | coefs_spline[2, ] / 2 * knots_spline[-neval]^2) 48 | 49 | # error estimation 50 | errors_spline <- estimate_errors( 51 | knots_diff, 52 | c(Rmpfr::mpfr(0, precBits = precBits), 53 | coefs_spline[2, ], 54 | Rmpfr::mpfr(0, precBits = precBits)), 55 | precBits = precBits) 56 | error <- sum(errors_spline) 57 | 58 | if(neval >= max_eval) 59 | break 60 | if(integral < 0) 61 | stop("Negative integration values; something went wrong!") 62 | if(integral > 0) 63 | if(error / abs(integral) < rel_tol | 64 | error < abs_tol) 65 | break 66 | } 67 | return(list(integral = integral, 68 | error = error, 69 | neval = neval, 70 | returnCode = 0, 71 | debug = list(knots_spline = knots_spline, 72 | vals_spline = vals_spline, 73 | errors_spline = errors_spline))) 74 | } 75 | 76 | get_intLimits <- function(x, pi0, mu, sigma, Omega, Sigma, 77 | maxit = 10) { 78 | ind_nonzero <- x != 0 79 | logx <- log(x[ind_nonzero]) 80 | mat_range <- cbind(mu[ind_nonzero] - logx - sqrt(2000)*sigma[ind_nonzero], 81 | mu[ind_nonzero] - logx + sqrt(2000)*sigma[ind_nonzero]) 82 | range <- c(max(c(mat_range[, 1], -745 - min(logx))), 83 | min(mat_range[, 2])) 84 | 85 | if(range[1] >= range[2]) 86 | return(c(0, 0)) ## FIXME?? 87 | 88 | i_iter <- 1 89 | vlim <- seq(from = range[1], to = range[2], length.out = 3) 90 | vval <- sapply(vlim, 91 | function(vv) 92 | dloga_asum(asum = exp(vv), x = x, 93 | pi0 = pi0, mu = mu, sigma = sigma, 94 | Omega = Omega, Sigma = Sigma)) 95 | while(TRUE) { 96 | vflag <- vval > -745 97 | if(vflag[1] | rev(vflag)[1]) 98 | stop("Positive dx values at integration limits!") 99 | if(sum(vflag) > 1) 100 | return(vlim[c(min(which(vflag)) - 1, 101 | max(which(vflag)) + 1)]) 102 | 103 | if(i_iter + 1 > maxit) 104 | return(range(vlim)) ## FIXME?? 105 | i_iter <- i_iter + 1 106 | 107 | if(all(vval == -Inf)) { 108 | vlim <- c(range[1], 109 | sort(runif(3, min = range[1], max = range[2])), 110 | range[2]) 111 | vval <- c(-Inf, 112 | sapply(vlim, 113 | function(vv) 114 | dloga_asum(asum = exp(vv), x = x, 115 | pi0 = pi0, mu = mu, sigma = sigma, 116 | Omega = Omega, Sigma = Sigma)), 117 | -Inf) 118 | } else { 119 | ind_max <- order(-vval)[1] 120 | if(ind_max == 1) ind_max <- 2 121 | if(ind_max == length(vlim)) ind_max <- length(vlim) - 1 122 | vlim <- vlim[seq(ind_max - 1, 123 | ind_max + 1)] 124 | vval <- vval[seq(ind_max - 1, 125 | ind_max + 1)] 126 | 127 | vlim <- c(vlim[1], 128 | mean(vlim[seq(1, 2)]), 129 | vlim[2], 130 | mean(vlim[seq(2, 3)]), 131 | vlim[3]) 132 | vval <- c(vval[1], 133 | dloga_asum(asum = exp(vlim[2]), x = x, 134 | pi0 = pi0, mu = mu, sigma = sigma, 135 | Omega = Omega, Sigma = Sigma), 136 | vval[2], 137 | dloga_asum(asum = exp(vlim[4]), x = x, 138 | pi0 = pi0, mu = mu, sigma = sigma, 139 | Omega = Omega, Sigma = Sigma), 140 | vval[3]) 141 | } 142 | } 143 | } 144 | 145 | dloga_asum <- function(asum, x, pi0, mu, sigma, Omega, Sigma) { 146 | if(asum * min(x[x != 0]) <= 0) 147 | return(-Inf) 148 | if(asum > 0) 149 | return(dloga(a(x, asum), pi0 = pi0, mu = mu, sigma = sigma, 150 | Omega = Omega, Sigma = Sigma, 151 | log.p = TRUE)) 152 | } 153 | 154 | get_diff <- function(x, x_old, 155 | denom_c = 1e-5, 156 | method = "abs") { 157 | x <- as.vector(x) 158 | x_old <- as.vector(x_old) 159 | 160 | abs_diff <- abs(x - x_old) 161 | if(method == "abs") 162 | return(max(abs_diff)) 163 | if(method == "rel") { 164 | rel_diff <- abs_diff / (abs(x) + denom_c) 165 | return(max(rel_diff)) 166 | } 167 | } 168 | 169 | Vectorize2 <- function(FUN, vectorize.args = arg.names, SIMPLIFY = TRUE, 170 | USE.NAMES = TRUE) 171 | { 172 | arg.names <- as.list(formals(FUN)) 173 | arg.names[["..."]] <- NULL 174 | arg.names <- names(arg.names) 175 | vectorize.args <- as.character(vectorize.args) 176 | 177 | if(length(vectorize.args) != 1) 178 | stop("Can only vectorize over one argument!") 179 | 180 | if (!length(vectorize.args)) 181 | return(FUN) 182 | if (!all(vectorize.args %in% arg.names)) 183 | stop("must specify names of formal arguments for 'vectorize'") 184 | collisions <- arg.names %in% c("FUN", "SIMPLIFY", "USE.NAMES", 185 | "vectorize.args") 186 | if (any(collisions)) 187 | stop(sQuote("FUN"), " may not have argument(s) named ", 188 | paste(sQuote(arg.names[collisions]), collapse = ", ")) 189 | FUNV <- function() { 190 | args <- lapply(as.list(match.call())[-1L], eval, parent.frame()) 191 | names <- if (is.null(names(args))) 192 | character(length(args)) 193 | else names(args) 194 | dovec <- names %in% vectorize.args 195 | val <- do.call("mapply", 196 | c(FUN = FUN, args[dovec], 197 | MoreArgs = list(args[!dovec]), 198 | SIMPLIFY = TRUE, USE.NAMES = USE.NAMES)) 199 | if(!is.null(dim(args[dovec][[1]]))) 200 | val <- array(val, dim = dim(args[dovec][[1]])) 201 | return(val) 202 | } 203 | formals(FUNV) <- formals(FUN) 204 | FUNV 205 | } 206 | 207 | estimate_errors <- function(knots_diff, slopes, precBits) { 208 | if(length(knots_diff) != length(slopes) - 2) 209 | stop("Length of knots_diff and slopes should agree!") 210 | errors <- rep(NA, length = length(knots_diff)) 211 | 212 | y3s <- knots_diff * slopes[-c(1, length(slopes))] 213 | errors <- 214 | Rmpfr::sapplyMpfr(seq_along(knots_diff), 215 | function(i_region) 216 | estimate_one_error( 217 | knots_diff[i_region], 218 | y3s[i_region], 219 | slopes[c(i_region, i_region + 1, i_region + 2)], 220 | precBits = precBits 221 | )) 222 | 223 | return(errors) 224 | } 225 | 226 | estimate_one_error <- function(x3, y3, slopes, precBits) { 227 | slope1 <- slopes[1] 228 | slope2 <- slopes[3] 229 | if((slope1 <= slopes[2] & slopes[2] <= slope2) | 230 | (slope1 >= slopes[2] & slopes[2] >= slope2)) { 231 | # function is convex/concave here 232 | 233 | # if slopes are the same 234 | if(slope1 == slope2) 235 | return(Rmpfr::mpfr(0, precBits = precBits)) 236 | 237 | # if not calculate where lines meet 238 | x2 <- (slope2 * x3 - y3) / (slope2 - slope1) 239 | y2 <- x2 * slope1 240 | return(abs(x2 * y3 - x3 * y2) / 2) 241 | } else { 242 | return(abs(x3 * y3) / 2) 243 | } 244 | } 245 | -------------------------------------------------------------------------------- /R/likelihood_expectation.R: -------------------------------------------------------------------------------- 1 | dx <- function(x, 2 | pi0, mu, sigma, Omega, Sigma, 3 | control = list(), 4 | log.p = FALSE) { 5 | control <- do.call(control_integrate, control) 6 | limits <- get_intLimits( 7 | x = x, 8 | pi0 = pi0, mu = mu, sigma = sigma, 9 | Omega = Omega, Sigma = Sigma, 10 | maxit = control$maxit_limits) 11 | 12 | fit_integrate <- 13 | integrate2(vintegrand_dx, 14 | lower = limits[1], upper = limits[2], 15 | rel_tol = control$rel_tol, abs_tol = control$abs_tol, 16 | max_eval = control$max_eval, 17 | precBits = control$precBits, 18 | x = x, pi0 = pi0, mu = mu, sigma = sigma, 19 | Omega = Omega, Sigma = Sigma) 20 | 21 | # jacobian 22 | fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0]))) 23 | fit_integrate$error <- as.double(exp(log(fit_integrate$error) - sum(log(x[x > 0])))) 24 | 25 | if(log.p) { 26 | return(fit_integrate$integral) 27 | } 28 | else { 29 | fit_integrate$integral <- exp(fit_integrate$integral) 30 | if(control$only_value) 31 | return(fit_integrate$integral) 32 | return(fit_integrate) 33 | } 34 | } 35 | 36 | #' control parameters for the 37 | #' numerical integrations during the E step of SparseDOSSA2's fitting 38 | #' 39 | #' @param rel_tol relative change threshold in the integration values for the 40 | #' integration to converge 41 | #' @param abs_tol absolute change threshold in the integration values for the 42 | #' integration to converge 43 | #' @param max_eval maximum of integration evaluations allowed 44 | #' @param maxit_limits maximum number of tries allowed to guess the integration's 45 | #' lower and upper limits 46 | #' @param precBits numeric precision used for the integration values 47 | #' @param only_value whether or not only the integration value should be returned 48 | #' 49 | #' @return a list of the same names 50 | #' @export 51 | control_integrate <- function(rel_tol = 1e-2, 52 | abs_tol = 0, 53 | max_eval = 50, 54 | maxit_limits = 10, 55 | precBits = 200, 56 | only_value = TRUE) { 57 | list(rel_tol = rel_tol, 58 | abs_tol = abs_tol, 59 | max_eval = max_eval, 60 | maxit_limits = maxit_limits, 61 | precBits = precBits, 62 | only_value = only_value) 63 | } 64 | 65 | ploga <- function(a, 66 | pi0, mu, sigma, Sigma, 67 | log.p = TRUE) { 68 | ind_zero <- a == 0 69 | 70 | if(!any(ind_zero)) { 71 | log_p <- 1 72 | } else { 73 | a <- a[ind_zero] 74 | pi0 <- pi0[ind_zero] 75 | mu <- mu[ind_zero] 76 | sigma <- sigma[ind_zero] 77 | Sigma <- Sigma[ind_zero, ind_zero, drop = FALSE] 78 | 79 | u <- a_to_u(a, 80 | pi0 = pi0, mu = mu, sigma = sigma) 81 | g <- qnorm(u) 82 | log_p <- 83 | log(mvtnorm::pmvnorm( 84 | lower = -Inf, 85 | upper = g, 86 | mean = rep(0, length(g)), 87 | sigma = Sigma)) 88 | } 89 | 90 | if(log.p) 91 | return(log_p) 92 | else 93 | return(exp(log_p)) 94 | } 95 | 96 | mloga <- function(a, 97 | pi0, mu, sigma, Sigma, 98 | doComputeVariance = FALSE, 99 | log_ploga = 0) { 100 | ind_zero <- a == 0 101 | 102 | if(!any(ind_zero)) { 103 | return(list(mean_cond = NULL, 104 | Sigma_cond = matrix(nrow = 0, ncol = 0))) 105 | } else { 106 | a <- a[ind_zero] 107 | pi0 <- pi0[ind_zero] 108 | mu <- mu[ind_zero] 109 | sigma <- sigma[ind_zero] 110 | Sigma <- Sigma[ind_zero, ind_zero, drop = FALSE] 111 | 112 | u <- a_to_u(a, 113 | pi0 = pi0, mu = mu, sigma = sigma) 114 | g <- qnorm(u) 115 | m_cond <- 116 | tmvtnorm::mtmvnorm(sigma = Sigma, 117 | upper = g, 118 | doComputeVariance = doComputeVariance) 119 | } 120 | 121 | return(list(mean_cond = m_cond$tmean, 122 | Sigma_cond = m_cond$tvar)) 123 | } 124 | 125 | dloga_forInt <- function(a, 126 | pi0, mu, sigma, Omega, Sigma, 127 | log_ploga = 0, 128 | mean_cond, 129 | Sigma_cond = NA, 130 | mean_cond_choice = "half", 131 | Sigma_cond_choice = "cond", 132 | log.p = TRUE) { 133 | 134 | u <- a_to_u(a, 135 | pi0 = pi0, mu = mu, sigma = sigma, 136 | half_pi0 = TRUE) 137 | g <- qnorm(u) 138 | 139 | ind_nonzero <- a > 0 140 | if(any(abs(g) == Inf)) { 141 | log_d <- -Inf 142 | } else if(all(ind_nonzero)) { 143 | log_d <- 144 | mvtnorm::dmvnorm(x = g, 145 | mean = rep(0, length(g)), 146 | sigma = Sigma, 147 | log = TRUE) + 148 | sum(g^2) / 2 - 149 | sum((log(a) - mu)^2 / (sigma)^2 / 2) - 150 | sum(log(sigma)) + 151 | sum(log(1 - pi0)) 152 | } else if(!any(ind_nonzero)) { 153 | log_d <- 154 | log_ploga 155 | } else { 156 | if(mean_cond_choice == "half") 157 | mean_zero <- g[!ind_nonzero] 158 | else 159 | mean_zero <- mean_cond 160 | if(Sigma_cond_choice == "cond") 161 | Sigma_nonzero <- solve(Omega[ind_nonzero, ind_nonzero, drop = FALSE]) 162 | else if (Sigma_cond_choice == "full") 163 | Sigma_nonzero <- Sigma[ind_nonzero, ind_nonzero, drop = FALSE] 164 | else 165 | Sigma_nonzero <- solve(Omega[ind_nonzero, ind_nonzero, drop = FALSE]) + 166 | Sigma[ind_nonzero, !ind_nonzero, drop = FALSE] %*% 167 | solve(Sigma[!ind_nonzero, !ind_nonzero, drop = FALSE]) %*% 168 | Sigma_cond %*% 169 | solve(Sigma[!ind_nonzero, !ind_nonzero, drop = FALSE]) %*% 170 | Sigma[!ind_nonzero, ind_nonzero, drop = FALSE] 171 | log_d <- 172 | log_ploga + 173 | mvtnorm::dmvnorm( 174 | x = g[ind_nonzero], 175 | mean = (Sigma[ind_nonzero, !ind_nonzero, drop = FALSE] %*% 176 | solve(Sigma[!ind_nonzero, !ind_nonzero, drop = FALSE], 177 | mean_zero))[, 1], 178 | sigma = Sigma_nonzero, 179 | log = TRUE) + 180 | sum(g[ind_nonzero]^2) / 2 - 181 | sum((log(a[ind_nonzero]) - mu[ind_nonzero])^2 / 182 | (sigma[ind_nonzero])^2 / 2) - 183 | sum(log(sigma[ind_nonzero])) + 184 | sum(log(1 - pi0[ind_nonzero])) 185 | } 186 | 187 | if(log.p) 188 | return(log_d) 189 | else 190 | return(exp(log_d)) 191 | } 192 | 193 | dloga <- function(a, 194 | pi0, mu, sigma, Omega, Sigma, 195 | log.p = TRUE) { 196 | 197 | u <- a_to_u(a, 198 | pi0 = pi0, mu = mu, sigma = sigma) 199 | g <- qnorm(u) 200 | 201 | ind_nonzero <- a > 0 202 | if(any(abs(g) == Inf)) { 203 | log_d <- -Inf 204 | } else if(all(ind_nonzero)) { 205 | log_d <- 206 | mvtnorm::dmvnorm(x = g, 207 | mean = rep(0, length = length(pi0)), 208 | sigma = Sigma, 209 | log = TRUE) + 210 | sum(g^2) / 2 - 211 | sum((log(a) - mu)^2 / 212 | sigma^2 / 2) - 213 | sum(log(sigma)) + 214 | sum(log(1 - pi0)) 215 | } else if(!any(ind_nonzero)) { 216 | log_d <- 217 | pmvnorm2( 218 | lower = -Inf, 219 | upper = g, 220 | mean = rep(0, length(g)), 221 | sigma = Sigma, 222 | log.p = TRUE) 223 | } else { 224 | log_d <- 225 | pmvnorm2( 226 | lower = -Inf, 227 | upper = g[!ind_nonzero], 228 | mean = (-solve(Omega[!ind_nonzero, !ind_nonzero, drop = FALSE], 229 | Omega[!ind_nonzero, ind_nonzero, drop = FALSE]) %*% 230 | g[ind_nonzero])[, 1], 231 | sigma = solve(Omega[!ind_nonzero, !ind_nonzero, drop = FALSE]), 232 | log.p = TRUE) + 233 | mvtnorm::dmvnorm(x = g[ind_nonzero], 234 | mean = rep(0, length = sum(ind_nonzero)), 235 | sigma = Sigma[ind_nonzero, ind_nonzero, drop = FALSE], 236 | log = TRUE) + 237 | sum(g[ind_nonzero]^2) / 2 - 238 | sum((log(a[ind_nonzero]) - mu[ind_nonzero])^2 / 239 | (sigma[ind_nonzero])^2 / 2) - 240 | sum(log(sigma[ind_nonzero])) + 241 | sum(log(1 - pi0[ind_nonzero])) 242 | } 243 | 244 | if(log.p) 245 | return(log_d) 246 | else 247 | return(exp(log_d)) 248 | } 249 | 250 | pmvnorm2 <- function(lower = -Inf, upper = Inf, 251 | mean, sigma, log.p = FALSE) { 252 | if(length(lower) == 1) 253 | lower <- rep(lower, times = length(mean)) 254 | if(length(upper) == 1) 255 | upper <- rep(upper, times = length(mean)) 256 | 257 | g <- igraph::graph_from_adjacency_matrix( 258 | adjmatrix = (abs(sigma) > 0) * 1, 259 | mode = "undirected", 260 | diag = FALSE 261 | ) 262 | comp <- igraph::components(g) 263 | vp <- 264 | vapply(seq_len(comp$no), 265 | function(i_comp) { 266 | i_ind <- comp$membership == i_comp 267 | log(mvtnorm::pmvnorm(lower = lower[i_ind], 268 | upper = upper[i_ind], 269 | mean = mean[i_ind], 270 | sigma = sigma[i_ind, i_ind, drop = FALSE])) 271 | }, 272 | 0.0) 273 | logp <- sum(vp) 274 | if(log.p) 275 | return(logp) 276 | else 277 | return(exp(logp)) 278 | } 279 | 280 | dloga_old <- function(a, 281 | pi0, mu, sigma, Omega, 282 | log.p = TRUE) { 283 | u <- a_to_u_old(a, 284 | pi0 = pi0, mu = mu, sigma = sigma) 285 | g <- qnorm(u) 286 | 287 | if(any(abs(g) == Inf)) { 288 | log_d <- -Inf 289 | } else { 290 | log_d <- du(g, Omega) - 291 | sum((log(a[a > 0]) - mu[a > 0])^2 / (sigma[a > 0])^2 / 2) - 292 | log(2 * pi) / 2 * sum(a > 0) - sum(log(sigma[a > 0])) 293 | } 294 | 295 | if(log.p) 296 | return(log_d) 297 | else 298 | return(exp(log_d)) 299 | } 300 | 301 | du <- function(g, Omega, log.p = TRUE) { 302 | # Without normalizing constant (2pi)^(-2/p)! 303 | if(any(g == -Inf | g == Inf)) return(-Inf) 304 | log_d <- log_dmvnorm(S = g %*% t(g), Omega = Omega) + sum(g^2)/2 + 305 | log(det(Omega)) / 2 306 | 307 | if(log.p) 308 | return(log_d) 309 | else 310 | return(exp(log_d)) 311 | } 312 | 313 | log_dmvnorm <- function(S, Omega) { 314 | - sum(Omega * S) / 2 315 | } 316 | 317 | integrand_dx <- function(log_asum, x, 318 | pi0, mu, sigma, 319 | Omega, Sigma) { 320 | dloga(a = a(x, exp(log_asum)), 321 | pi0 = pi0, mu = mu, sigma = sigma, 322 | Omega = Omega, Sigma = Sigma, 323 | log.p = FALSE) 324 | } 325 | 326 | vintegrand_dx <- Vectorize2(integrand_dx, vectorize.args = "log_asum") 327 | 328 | ea <- function(x, 329 | pi0, mu, sigma, Omega, Sigma, 330 | control = list()) { 331 | control <- do.call(control_integrate, control) 332 | limits <- get_intLimits( 333 | x = x, 334 | pi0 = pi0, mu = mu, sigma = sigma, 335 | Omega = Omega, Sigma = Sigma, 336 | maxit = control$maxit_limits) 337 | 338 | fit_integrate <- 339 | integrate2(vintegrand_ea, 340 | lower = limits[1], upper = limits[2], 341 | rel_tol = control$rel_tol, abs_tol = control$abs_tol, 342 | max_eval = control$max_eval, 343 | precBits = control$precBits, 344 | x = x, pi0 = pi0, mu = mu, sigma = sigma, 345 | Omega = Omega, Sigma = Sigma) 346 | 347 | # jacobian 348 | fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0]))) 349 | fit_integrate$error <- as.double(exp(log(fit_integrate$error) - sum(log(x[x > 0])))) 350 | 351 | if(log.p) { 352 | return(fit_integrate$integral) 353 | } 354 | else { 355 | fit_integrate$integral <- exp(fit_integrate$integral) 356 | if(control$only_value) 357 | return(fit_integrate$integral) 358 | return(fit_integrate) 359 | } 360 | } 361 | 362 | integrand_ea <- function(log_asum, x, 363 | pi0, mu, sigma, 364 | Omega, Sigma) { 365 | exp(dloga(a = a(x, exp(log_asum)), 366 | pi0 = pi0, mu = mu, sigma = sigma, 367 | Omega = Omega, Sigma = Sigma, 368 | log.p = TRUE) + 369 | log_asum) 370 | } 371 | 372 | vintegrand_ea <- Vectorize2(integrand_ea, 373 | vectorize.args = "log_asum") 374 | 375 | eloga <- function(x, 376 | pi0, mu, sigma, Omega, Sigma, 377 | control = list()) { 378 | control <- do.call(control_integrate, control) 379 | limits <- get_intLimits( 380 | x = x, 381 | pi0 = pi0, mu = mu, sigma = sigma, 382 | Omega = Omega, Sigma = Sigma, 383 | maxit = control$maxit_limits) 384 | 385 | fit_integrate <- 386 | integrate2(vintegrand_eloga, 387 | lower = limits[1], upper = limits[2], 388 | rel_tol = control$rel_tol, abs_tol = control$abs_tol, 389 | max_eval = control$max_eval, 390 | precBits = control$precBits, 391 | x = x, pi0 = pi0, mu = mu, sigma = sigma, 392 | Omega = Omega, Sigma = Sigma) 393 | 394 | # jacobian 395 | fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0]))) 396 | fit_integrate$error <- as.double(exp(log(fit_integrate$error) - sum(log(x[x > 0])))) 397 | 398 | if(log.p) { 399 | return(fit_integrate$integral) 400 | } 401 | else { 402 | fit_integrate$integral <- exp(fit_integrate$integral) 403 | if(control$only_value) 404 | return(fit_integrate$integral) 405 | return(fit_integrate) 406 | } 407 | } 408 | 409 | integrand_eloga <- function(log_asum, x, 410 | pi0, mu, sigma, 411 | Omega, Sigma) { 412 | dloga(a = a(x, exp(log_asum)), 413 | pi0 = pi0, mu = mu, sigma = sigma, 414 | Omega = Omega, Sigma = Sigma, 415 | log.p = FALSE) * log_asum 416 | } 417 | 418 | vintegrand_eloga <- Vectorize2(integrand_eloga, 419 | vectorize.args = "log_asum") 420 | 421 | eloga2 <- function(x, 422 | pi0, mu, sigma, Omega, Sigma, 423 | control = list()) { 424 | control <- do.call(control_integrate, control) 425 | limits <- get_intLimits( 426 | x = x, 427 | pi0 = pi0, mu = mu, sigma = sigma, 428 | Omega = Omega, Sigma = Sigma, 429 | maxit = control$maxit_limits) 430 | 431 | fit_integrate <- 432 | integrate2(vintegrand_eloga2, 433 | lower = limits[1], upper = limits[2], 434 | rel_tol = control$rel_tol, abs_tol = control$abs_tol, 435 | max_eval = control$max_eval, 436 | precBits = control$precBits, 437 | x = x, pi0 = pi0, mu = mu, sigma = sigma, 438 | Omega = Omega, Sigma = Sigma) 439 | 440 | # jacobian 441 | fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0]))) 442 | fit_integrate$error <- as.double(exp(log(fit_integrate$error) - sum(log(x[x > 0])))) 443 | 444 | if(log.p) { 445 | return(fit_integrate$integral) 446 | } 447 | else { 448 | fit_integrate$integral <- exp(fit_integrate$integral) 449 | if(control$only_value) 450 | return(fit_integrate$integral) 451 | return(fit_integrate) 452 | } 453 | } 454 | 455 | integrand_eloga2 <- function(log_asum, x, 456 | pi0, mu, sigma, 457 | Omega, Sigma) { 458 | dloga(a = a(x, exp(log_asum)), 459 | pi0 = pi0, mu = mu, sigma = sigma, 460 | Omega = Omega, Sigma = Sigma, 461 | log.p = FALSE) * log_asum^2 462 | } 463 | 464 | vintegrand_eloga2 <- Vectorize2(integrand_eloga2, 465 | vectorize.args = "log_asum") 466 | 467 | get_es <- function(x, pi0, mu, sigma, Omega, Sigma, 468 | control) { 469 | time_start <- Sys.time() 470 | control <- do.call(control_integrate, control) 471 | 472 | limits <- get_intLimits( 473 | x = x, 474 | pi0 = pi0, mu = mu, sigma = sigma, 475 | Omega = Omega, Sigma = Sigma, 476 | maxit = control$maxit_limits) 477 | 478 | neval <- 2 479 | knots_spline <- Rmpfr::mpfr(c(limits[1], limits[2]), 480 | precBits = control$precBits) 481 | vals_spline <- 482 | Rmpfr::mpfr( 483 | vintegrand_dx(as.double(knots_spline), 484 | pi0 = pi0, x = x, mu = mu, sigma = sigma, 485 | Omega = Omega, Sigma = Sigma), 486 | precBits = control$precBits) 487 | errors_spline <- Inf 488 | 489 | # find knots using dx 490 | while(TRUE) { 491 | i_max_error <- which(errors_spline == max(errors_spline))[1] 492 | knots_spline <- c(knots_spline[seq(1, i_max_error)], 493 | Rmpfr::mean(knots_spline[c(i_max_error, i_max_error + 1)]), 494 | knots_spline[seq(i_max_error + 1, neval)]) 495 | vals_spline <- c(vals_spline[seq(1, i_max_error)], 496 | Rmpfr::mpfr(vintegrand_dx(as.double(knots_spline[i_max_error + 1]), 497 | pi0 = pi0, x = x, mu = mu, sigma = sigma, 498 | Omega = Omega, Sigma = Sigma), 499 | precBits = control$precBits), 500 | vals_spline[seq(i_max_error + 1, neval)]) 501 | 502 | neval <- neval + 1 503 | knots_diff <- knots_spline[-1] - knots_spline[-neval] 504 | # linear spline for estimating integration 505 | coefs_spline <- Rmpfr::mpfrArray(NA, precBits = control$precBits, 506 | dim = c(2, neval - 1)) 507 | coefs_spline[2, ] <- 508 | (vals_spline[-1] - vals_spline[-neval]) / 509 | knots_diff 510 | coefs_spline[1, ] <- 511 | vals_spline[-neval] - knots_spline[-neval] * coefs_spline[2, ] 512 | integral_dx <- sum(coefs_spline[1, ] * knots_spline[-1] + 513 | coefs_spline[2, ] / 2 * knots_spline[-1]^2 - 514 | coefs_spline[1, ] * knots_spline[-neval] - 515 | coefs_spline[2, ] / 2 * knots_spline[-neval]^2) 516 | # error estimation 517 | errors_spline <- estimate_errors( 518 | knots_diff, 519 | c(Rmpfr::mpfr(0, precBits = control$precBits), 520 | coefs_spline[2, ], 521 | Rmpfr::mpfr(0, precBits = control$precBits)), 522 | precBits = control$precBits) 523 | error_dx <- sum(errors_spline) 524 | 525 | if(neval >= control$max_eval) 526 | break 527 | if(integral_dx < 0) 528 | stop("Negative integration values; something went wrong!") 529 | if(integral_dx > 0) 530 | if(error_dx / abs(integral_dx) < control$rel_tol | 531 | error_dx < control$abs_tol) 532 | break 533 | } 534 | 535 | # modify eloga, eloga2 536 | integral_eloga <- sum(coefs_spline[1, ] / 2 * knots_spline[-1]^2 + 537 | coefs_spline[2, ] / 3 * knots_spline[-1]^3 - 538 | coefs_spline[1, ] / 2 * knots_spline[-neval]^2 - 539 | coefs_spline[2, ] / 3 * knots_spline[-neval]^3) 540 | error_eloga <- sum(abs(knots_diff * 541 | ((vals_spline * knots_spline)[-1] - 542 | (vals_spline * knots_spline)[-neval]))) 543 | integral_eloga2 <- sum(coefs_spline[1, ] / 3 * knots_spline[-1]^3 + 544 | coefs_spline[2, ] / 4 * knots_spline[-1]^4 - 545 | coefs_spline[1, ] / 3 * knots_spline[-neval]^3 - 546 | coefs_spline[2, ] / 4 * knots_spline[-neval]^4) 547 | error_eloga2 <- sum(abs(knots_diff * 548 | ((vals_spline * knots_spline^2)[-1] - 549 | (vals_spline * knots_spline^2)[-neval]))) 550 | 551 | # refit for ea 552 | coefs_spline <- Rmpfr::mpfrArray(NA, precBits = control$precBits, 553 | dim = c(2, neval - 1)) 554 | coefs_spline[2, ] <- 555 | ((vals_spline * exp(knots_spline))[-1] - 556 | (vals_spline * exp(knots_spline))[-neval]) / 557 | knots_diff 558 | coefs_spline[1, ] <- 559 | (vals_spline * exp(knots_spline))[-neval] - 560 | knots_spline[-neval] * coefs_spline[2, ] 561 | 562 | integral_ea <- sum(coefs_spline[1, ] * knots_spline[-1] + 563 | coefs_spline[2, ] / 2 * knots_spline[-1]^2 - 564 | coefs_spline[1, ] * knots_spline[-neval] - 565 | coefs_spline[2, ] / 2 * knots_spline[-neval]^2) 566 | error_ea <- sum(abs(knots_diff * 567 | ((vals_spline * exp(knots_spline))[-1] - 568 | (vals_spline * exp(knots_spline))[-neval]))) 569 | 570 | return(c("ea" = as.double(integral_ea / integral_dx), 571 | "dx" = as.double(integral_dx), 572 | "logLik" = as.double(log(integral_dx) - sum(log(x[x > 0]))), 573 | "eloga" = as.double(integral_eloga / integral_dx), 574 | "eloga2" = as.double(integral_eloga2 / integral_dx), 575 | "error_ea" = as.double(error_ea), 576 | "error_dx" = as.double(error_dx), 577 | "error_eloga" = as.double(error_eloga), 578 | "error_eloga2" = as.double(error_eloga2), 579 | "time" = as.numeric(Sys.time() - time_start, units = "secs"))) 580 | } -------------------------------------------------------------------------------- /R/sim_helpers.R: -------------------------------------------------------------------------------- 1 | filter_params <- function(params, max_pi0 = 0.95) { 2 | ind <- params$pi0 < max_pi0 3 | return(list(pi0 = params$pi0[ind], 4 | mu = params$mu[ind], 5 | sigma = params$sigma[ind], 6 | Omega = threshold_matrix(solve(params$Sigma[ind, ind])), 7 | Sigma = params$Sigma[ind, ind])) 8 | } 9 | 10 | params_a_to_x <- function(params_a, n = 100000) { 11 | samples_a <- rcopulasso(n = round(n * 1.5), # FIXME 12 | pi0 = params_a$pi0, 13 | mu = params_a$mu, 14 | sigma = params_a$sigma, 15 | Omega = params_a$Omega) 16 | samples_a <- samples_a[!apply(samples_a == 0, 1, all), ][1:n, ] 17 | samples_x <- t(apply(samples_a, 1, function(x) x / sum(x))) 18 | mat_marginals <- get_marginals(samples_x) 19 | return(list(pi0 = mat_marginals[, "pi0"], 20 | mu = mat_marginals[, "mu"], 21 | sigma = mat_marginals[, "sigma"], 22 | Corr = cor(samples_x, method = "spear"))) 23 | } -------------------------------------------------------------------------------- /R/spike.R: -------------------------------------------------------------------------------- 1 | spike_a_metadata <- function(null, 2 | feature_param, 3 | metadata, 4 | spike_df) { 5 | if(ncol(null) != nrow(metadata)) 6 | stop("Sample size of null abundance matrix and metadata do not agree!") 7 | if(nrow(spike_df) != 8 | nrow(dplyr::distinct(spike_df, 9 | metadata_datum, 10 | feature_spiked, 11 | associated_property))) 12 | stop("feature-metadata spiking specification data frame has duplicate ", 13 | "metadata_datum, feature_spiked, and associated_property tuples!") 14 | spiked <- null 15 | for(feature_i in unique(spike_df$feature_spiked)) { 16 | spike_df_i <- subset(spike_df, feature_spiked == feature_i) 17 | spike_df_i_abundance <- subset(spike_df_i, 18 | associated_property == "abundance") 19 | spike_df_i_prevalence <- subset(spike_df_i, 20 | associated_property == "prevalence") 21 | spiked[feature_i, ] <- 22 | spike_oneA_metadata(param = feature_param[feature_i, ], 23 | metadata = metadata, 24 | col_abundance = spike_df_i_abundance$metadata_datum, 25 | effect_abundance = spike_df_i_abundance$effect_size, 26 | col_prevalence = spike_df_i_prevalence$metadata_datum, 27 | effect_prevalence = spike_df_i_prevalence$effect_size) 28 | } 29 | 30 | dimnames(spiked) <- dimnames(null) 31 | return(spiked) 32 | } 33 | 34 | spike_oneA_metadata <- function(param, 35 | metadata, 36 | col_abundance = c(), 37 | effect_abundance = c(), 38 | col_prevalence = c(), 39 | effect_prevalence = c()) { 40 | effect_abundance_all <- rep(0, ncol(metadata)) 41 | effect_prevalence_all <- rep(0, ncol(metadata)) 42 | 43 | effect_abundance_all[col_abundance] <- effect_abundance 44 | effect_prevalence_all[col_prevalence] <- effect_prevalence 45 | 46 | pi0 <- expit(logit(param["pi0"]) - (metadata %*% effect_prevalence_all)[, 1]) 47 | mu <- param["mu"] + (metadata %*% effect_abundance_all)[, 1] 48 | 49 | a <- rZILogN_one(n = nrow(metadata), 50 | pi0 = pi0, 51 | mu = mu, 52 | sigma = param["sigma"]) 53 | return(a) 54 | } 55 | -------------------------------------------------------------------------------- /R/sysdata.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biobakery/SparseDOSSA2/26a998a6e3a5f04d6a86cce14d6d3229ca82633e/R/sysdata.rda -------------------------------------------------------------------------------- /data/Stool_subset.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/biobakery/SparseDOSSA2/26a998a6e3a5f04d6a86cce14d6d3229ca82633e/data/Stool_subset.rda -------------------------------------------------------------------------------- /man/SparseDOSSA2.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/SparseDOSSA2.R 3 | \name{SparseDOSSA2} 4 | \alias{SparseDOSSA2} 5 | \title{Simulate synthetic microbial abundance observations with SparseDOSSA2} 6 | \usage{ 7 | SparseDOSSA2( 8 | template = "Stool", 9 | n_sample = 100, 10 | new_features = TRUE, 11 | n_feature = 100, 12 | spike_metadata = "none", 13 | metadata_effect_size = 1, 14 | perc_feature_spiked_metadata = 0.05, 15 | metadata_matrix = NULL, 16 | median_read_depth = 50000, 17 | verbose = TRUE 18 | ) 19 | } 20 | \arguments{ 21 | \item{template}{can be 1) a character string (\code{"Stool"}, \code{"Vaginal"}, 22 | or \code{"IBD"}) indicating one of the pre-trained templates in SparseDOSSA2, 23 | or 2) user-provided, fitted results. In the latter case this should be an output 24 | from \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}.} 25 | 26 | \item{n_sample}{number of samples to simulate} 27 | 28 | \item{new_features}{\code{TRUE}/\code{FALSE} indicator for whether or not new 29 | features should be simulated. If \code{FALSE} then the same set of features 30 | in \code{template} will be simulated.} 31 | 32 | \item{n_feature}{number of features to simulate. Only relevant when 33 | \code{new_features} is \code{TRUE}} 34 | 35 | \item{spike_metadata}{for metadata spike-in configurations. Must be one of two things: 36 | a) , 37 | \itemize{ 38 | \item a character string of \code{"none"}, \code{"both"} \code{"abundance"}, 39 | or \code{"prevalence"}, indicating whether or not 40 | association with metadata will be spiked in. For the spiked-in case, it 41 | indicates if features' abundance/prevalence/both characteristics will be associated 42 | with metadata (also see explanations for \code{metadata_effect_size} and 43 | \code{perc_feature_spiked_metadata}) 44 | \item a data.frame for detailed spike-in configurations. This is the more 45 | advanced approach, where detailed specification for metadata-microbial 46 | feature associations are provided. Note: if \code{spike_metadata} is provided 47 | as a data.frame, then \code{metadata_matrix} must be provided as well 48 | (cannot be generated automatically). In this case, \code{spike_metadata} 49 | must have exactly four columns: \code{metadata_datum}, \code{feature_spiked}, 50 | \code{associated_property}, and \code{effect_size}. Each row of the data.frame 51 | configures one specific metadata-microbe association. Specifically: 52 | \itemize{ 53 | \item \code{metadata_datum} (integer) indicates the column number for the metadata 54 | variable to be associated with the microbe 55 | \item \code{feature_spiked} (character) indicates the microbe name to be associated 56 | with the metadata variable 57 | \item \code{associated_property} (character, either \code{"abundance"} or 58 | \code{"prevalence"}), indicating the property of the microbe to be modified. 59 | If you want the microbe to be associated with the metadata variable 60 | in both properties, include two rows in \code{spike_metadata}, one for 61 | abundance and one for prevalence 62 | \item \code{effect_size} (numeric) indicating the strength of the association. 63 | This corresponds to log fold change in non-zero abundance for 64 | \code{"abundance"} spike-in, and log odds ratio for \code{"prevalence"} 65 | spike-in 66 | } 67 | }} 68 | 69 | \item{metadata_effect_size}{(for when \code{spike_metadata} is \code{"abundance"}, 70 | \code{"prevalence"}, or \code{"both"}) effect size of the spiked-in associations. This is 71 | non-zero log fold change for abundance spike-in, and log odds ratio for prevalence spike-in} 72 | 73 | \item{perc_feature_spiked_metadata}{(for when \code{spike_metadata} is \code{"abundance"}, 74 | \code{"prevalence"}, or \code{"both"}) 75 | percentage of features to be associated with metadata} 76 | 77 | \item{metadata_matrix}{the user can provide a metadata matrix to use for spiking-in 78 | of feature abundances. If using default (\code{NULL}) two variables will be generated: 79 | one continous, and a binary one of balanced cases and controls. Note: if 80 | \code{spike_metadata} is provided as a data.frame, then the user must provide 81 | \code{metadata_matrix} too} 82 | 83 | \item{median_read_depth}{targeted median per-sample read depth} 84 | 85 | \item{verbose}{whether detailed information should be printed} 86 | } 87 | \value{ 88 | a list with the following component: 89 | \describe{ 90 | \item{simulated_data}{ 91 | feature by sample matrix of simulated microbial count observations 92 | } 93 | \item{simulated_matrices}{ 94 | list of all simulated data matrices, including that of null (i.e. not spiked-in) absolute 95 | abundances, spiked-in absolute abundances, and normalized relative abundances 96 | } 97 | \item{params}{ 98 | parameters used for simulation. These are provided in \code{template}. 99 | } 100 | \item{spike_metadata}{ 101 | list of variables provided or generated for metadata spike-in. This include 102 | \code{spike_metadata} for the original \code{spike_metadata} parameter provided 103 | by the user, \code{metadata_matrix} for the 104 | metadata (either provided by the user or internally generated), and 105 | \code{feature_metadata_spike_df} 106 | for detailed specification of which metadata variables were used to spike-in associations 107 | with which features, in what properties at which effect sizes. This is the 108 | same as \code{spike_metadata} if the latter was provided as a data.frame. 109 | } 110 | } 111 | } 112 | \description{ 113 | \code{SparseDOSSA2} generates synthetic microbial abundance observations 114 | from either pre-trained template, or user-provided fitted results from 115 | \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}. Additional options 116 | are available for simulating associations between microbial features 117 | and metadata variables. 118 | } 119 | \examples{ 120 | ## Using one of the pre-trained SparseDOSSA2 templates: 121 | sim <- SparseDOSSA2(template = "stool", n_sample = 200, new_features = FALSE) 122 | ## Using user-provided trained SparseDOSSA2 model: 123 | data("Stool_subset") 124 | fitted <- fit_SparseDOSSA(data = Stool_subset) 125 | sim <- SparseDOSSA2(template = fitted, n_sample = 200, new_features = FALSE) 126 | } 127 | \author{ 128 | Siyuan Ma, \email{syma.research@gmail.com} 129 | } 130 | -------------------------------------------------------------------------------- /man/Stool_subset.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/data.R 3 | \docType{data} 4 | \name{Stool_subset} 5 | \alias{Stool_subset} 6 | \title{A subset of the HMP1-II stool samples} 7 | \format{ 8 | A matrix with 5 rows (species) and 5 columns (samples) 9 | } 10 | \source{ 11 | \url{https://www.hmpdacc.org/hmp/} 12 | } 13 | \usage{ 14 | Stool_subset 15 | } 16 | \description{ 17 | A dataset containing species-level microbial counts of a subset of 18 | the HMP1-II stool samples. This includes the top 5 most abundant 19 | species and top 5 most deeply sequenced samples. 20 | } 21 | \keyword{datasets} 22 | -------------------------------------------------------------------------------- /man/control_fit.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/SparseDOSSA2_fit.R 3 | \name{control_fit} 4 | \alias{control_fit} 5 | \title{Control options for fit_SparseDOSSA2 and fitCV_SparseDOSSA2} 6 | \usage{ 7 | control_fit( 8 | maxit = 100, 9 | rel_tol = 0.01, 10 | abs_tol = 0.01, 11 | control_numint = list(), 12 | verbose = FALSE, 13 | debug_dir = NULL 14 | ) 15 | } 16 | \arguments{ 17 | \item{maxit}{maximum number of EM iterations} 18 | 19 | \item{rel_tol}{relative change threshold in the log likelihood 20 | for algorithm convergence} 21 | 22 | \item{abs_tol}{absolute change threshold in the log likelihood 23 | for algorithm convergence} 24 | 25 | \item{control_numint}{a named list of control parameters for the 26 | numerical integrations during the E step. See help page for 27 | \code{control_numint}} 28 | 29 | \item{verbose}{whether or not detailed running messages should be provided} 30 | 31 | \item{debug_dir}{directory for intermediate output, such as the 32 | EM expectations and parameter values and during each step of the 33 | EM algorithm. Default to \code{NULL} in which case no such output 34 | will be generated} 35 | } 36 | \value{ 37 | a list of the same names 38 | } 39 | \description{ 40 | Control options for fit_SparseDOSSA2 and fitCV_SparseDOSSA2 41 | } 42 | -------------------------------------------------------------------------------- /man/control_integrate.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/likelihood_expectation.R 3 | \name{control_integrate} 4 | \alias{control_integrate} 5 | \title{control parameters for the 6 | numerical integrations during the E step of SparseDOSSA2's fitting} 7 | \usage{ 8 | control_integrate( 9 | rel_tol = 0.01, 10 | abs_tol = 0, 11 | max_eval = 50, 12 | maxit_limits = 10, 13 | precBits = 200, 14 | only_value = TRUE 15 | ) 16 | } 17 | \arguments{ 18 | \item{rel_tol}{relative change threshold in the integration values for the 19 | integration to converge} 20 | 21 | \item{abs_tol}{absolute change threshold in the integration values for the 22 | integration to converge} 23 | 24 | \item{max_eval}{maximum of integration evaluations allowed} 25 | 26 | \item{maxit_limits}{maximum number of tries allowed to guess the integration's 27 | lower and upper limits} 28 | 29 | \item{precBits}{numeric precision used for the integration values} 30 | 31 | \item{only_value}{whether or not only the integration value should be returned} 32 | } 33 | \value{ 34 | a list of the same names 35 | } 36 | \description{ 37 | control parameters for the 38 | numerical integrations during the E step of SparseDOSSA2's fitting 39 | } 40 | -------------------------------------------------------------------------------- /man/fitCV_SparseDOSSA2.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/SparseDOSSA2_fit.R 3 | \name{fitCV_SparseDOSSA2} 4 | \alias{fitCV_SparseDOSSA2} 5 | \title{Fit SparseDOSSA 2 model to a microbiome abundance dataset with cross validation} 6 | \usage{ 7 | fitCV_SparseDOSSA2( 8 | data, 9 | lambdas = 10^seq(-2, 0, length.out = 5), 10 | K = 5, 11 | control = list() 12 | ) 13 | } 14 | \arguments{ 15 | \item{data}{feature-by-sample matrix of abundances (proportions or 16 | counts).} 17 | 18 | \item{lambdas}{vector of positive penalization parameters for the sparsity of feature-feature 19 | correlations. The function fits SparseDOSSA 2 models to each of the lambda values, and uses 20 | cross validation likelihood to select the optimal one. If not provided this will be chosen 21 | automatically.} 22 | 23 | \item{control}{a named list of additional control parameters. See help page for 24 | \code{control_fit}.} 25 | } 26 | \value{ 27 | a list, with the following components: 28 | \describe{ 29 | \item{EM_fit}{ 30 | list of fitted parameters from the EM algorithm, with additional cross validation likelihood. 31 | } 32 | \item{F_fit}{ 33 | fitted parameters for the joint distribution of per-feature prevalence, abundance, 34 | and variability parameters (for simulating new features) 35 | } 36 | \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable 37 | to count data. 38 | } 39 | \item{l_filtering}{list of quality control filtering for sample and features. 40 | } 41 | } 42 | } 43 | \description{ 44 | \code{fitCV_SparseDOSSA2} randomly partitions the data into fitting and testing 45 | subsets. It fits the SparseDOSSA 2 model to the fitting sets and uses log likelihood 46 | of the fitted parameters in the testing sets as the criteria for selection of 47 | tuning parameter lambda. 48 | } 49 | \examples{ 50 | data("Stool_subset") 51 | fitted <- fitCV_SparseDOSSA(data = Stool_subset, 52 | lambdas = c(0.1, 1), 53 | K = 5) 54 | 55 | } 56 | \author{ 57 | Siyuan Ma, \email{syma.research@gmail.com} 58 | } 59 | -------------------------------------------------------------------------------- /man/fit_SparseDOSSA2.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/SparseDOSSA2_fit.R 3 | \name{fit_SparseDOSSA2} 4 | \alias{fit_SparseDOSSA2} 5 | \title{Fit SparseDOSSA 2 model to a microbiome abundance dataset} 6 | \usage{ 7 | fit_SparseDOSSA2(data, lambda = 1, control = list()) 8 | } 9 | \arguments{ 10 | \item{data}{feature-by-sample matrix of abundances (proportions or 11 | counts)} 12 | 13 | \item{lambda}{positive penalization parameter for the sparsity of feature-feature 14 | correlations. Default to maxmum value \code{1}, where features are assumed to be 15 | independent (no correlations, most sparse)} 16 | 17 | \item{control}{a named list of additional control parameters. See help page for 18 | \code{control_fit}} 19 | } 20 | \value{ 21 | a list, with the following components: 22 | \describe{ 23 | \item{EM_fit}{ 24 | list of fitted parameters from the EM algorithm. 25 | } 26 | \item{F_fit}{ 27 | fitted parameters for the joint distribution of per-feature prevalence, abundance, 28 | and variability parameters (for simulating new features) 29 | } 30 | \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable 31 | to count data. 32 | } 33 | \item{l_filtering}{list of quality control filtering for sample and features. 34 | } 35 | } 36 | } 37 | \description{ 38 | \code{fit_SparseDOSSA2} fits the SparseDOSSA 2 model (zero-inflated log normal 39 | marginals connected through Gaussian copula) to microbial abundances. It takes 40 | as input a feature-by-sample microbial count or relative abundance table and 41 | a penalization tuning parameter \code{lambda} to control the sparsity of 42 | feature-feature correlations. It then adopts a penalized expectation-maximization 43 | algorithm to provide estimations of the model parameters. 44 | } 45 | \examples{ 46 | data("Stool_subset") 47 | fitted <- fit_SparseDOSSA2(data = Stool_subset) 48 | } 49 | \author{ 50 | Siyuan Ma, \email{syma.research@gmail.com} 51 | } 52 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(SparseDOSSA2) 3 | 4 | test_check("SparseDOSSA2") 5 | -------------------------------------------------------------------------------- /tests/testthat/test-likelihood.R: -------------------------------------------------------------------------------- 1 | test_that("dloga works", { 2 | pi0 <- rep(0.5, 3) 3 | mu <- rep(0, 3) 4 | sigma <- rep(1, 3) 5 | Sigma <- diag(rep(1, 3)) 6 | Omega <- solve(Sigma) 7 | 8 | # certain dloga values should be the same as directly caculated from dnorm 9 | expect_equal(dnorm(0)^2 * 0.125, 10 | dloga(a = c(0, 1, 1), 11 | pi0 = pi0, 12 | mu = mu, 13 | sigma = sigma, 14 | Omega = Omega, 15 | Sigma = Sigma, 16 | log.p = FALSE)) 17 | 18 | int_func <- function(x) dloga(a = c(0, exp(x)), 19 | pi0 = pi0, 20 | mu = mu, 21 | sigma = sigma, 22 | Omega = Omega, 23 | Sigma = Sigma, 24 | log.p = FALSE) 25 | 26 | # dloga should integrate out to discrete components of the likelihood 27 | expect_lt(abs(0.125 - 28 | cubature::pcubature(int_func, lowerLimit = rep(-10, 2), upperLimit = rep(10, 2))$integral), 29 | 1e-5) 30 | 31 | # test for non trivial correlation 32 | Sigma <- matrix(0.1, 3, 3) 33 | diag(Sigma) <- rep(1, 3) 34 | Omega <- solve(Sigma) 35 | 36 | int_func <- function(x) dloga(a = c(0, exp(x)), 37 | pi0 = pi0, 38 | mu = mu, 39 | sigma = sigma, 40 | Omega = Omega, 41 | Sigma = Sigma, 42 | log.p = FALSE) 43 | 44 | expect_lt( 45 | abs( 46 | cubature::pcubature(f = int_func, lower = rep(-10, 2), upper = rep(10, 2))$integral - 47 | mvtnorm::pmvnorm(lower = c(-Inf, 0, 0), upper = c(0, Inf, Inf), 48 | mean = rep(0, 3), sigma = Sigma)), 49 | 1e-5) 50 | }) -------------------------------------------------------------------------------- /tests/testthat/test-spike_in.R: -------------------------------------------------------------------------------- 1 | test_that("spike in works", { 2 | # # no spike-in 3 | # sim <- SparseDOSSA2() 4 | # 5 | # # spike-in default 6 | # sim <- SparseDOSSA2(spike_metadata = "both") 7 | # plot(sim$spike_metadata$metadata_matrix[, 1], 8 | # sim$simulated_matrices$rel[60,]) 9 | # boxplot(sim$simulated_matrices$rel[19, ] ~ 10 | # sim$spike_metadata$metadata_matrix[, 2]) 11 | # 12 | # # spike-in matrix 13 | # # error 14 | # spike_metadata <- sim$spike_metadata$feature_metadata_spike_df[c(1, 20), ] 15 | # sim <- SparseDOSSA2(spike_metadata = spike_metadata) 16 | # 17 | # # error 18 | # metadata_matrix <- matrix(rnorm(100), ncol = 1) 19 | # sim <- SparseDOSSA2(spike_metadata = spike_metadata, 20 | # metadata_matrix = metadata_matrix) 21 | # 22 | # # runs 23 | # metadata_matrix <- matrix(rnorm(300), ncol = 3) 24 | # sim <- SparseDOSSA2(spike_metadata = spike_metadata, 25 | # metadata_matrix = metadata_matrix) 26 | }) -------------------------------------------------------------------------------- /vignettes/SparseDOSSA2.Rmd: -------------------------------------------------------------------------------- 1 | # --- 2 | title: "Simulating realistic microbial observations with SparseDOSSA2" 3 | author: 4 | - name: "Siyuan Ma" 5 | affiliation: 6 | - Harvard T.H. Chan School of Public Health 7 | - Broad Institute 8 | email: syma.research@gmail.com 9 | package: SparseDOSSA2 10 | date: "12/01/2020" 11 | output: 12 | BiocStyle::html_document 13 | vignette: > 14 | %\VignetteIndexEntry{SparseDOSSA2} 15 | %\VignetteEngine{knitr::rmarkdown} 16 | %\VignetteEncoding{UTF-8} 17 | bibliography: references.bib 18 | --- 19 | 20 | 21 | ```{r setup, include=FALSE} 22 | knitr::opts_chunk$set(echo = TRUE) 23 | knitr::opts_chunk$set(cache = FALSE) 24 | ``` 25 | 26 | # Introduction 27 | 28 | `SparseDOSSA2` an R package for fitting to and the simulation of realistic microbial 29 | abundance observations. It provides functionlaities for: a) generation of realistic synthetic microbial observations, b) spiking-in of associations with metadata variables 30 | for e.g. benchmarking or power analysis purposes, and c) fitting the SparseDOSSA 2 31 | model to real-world microbial abundance observations that can be used for a). This vignette is intended to provide working examples for these functionalities. 32 | 33 | ```{r, message=FALSE} 34 | library(SparseDOSSA2) 35 | # tidyverse packages for utilities 36 | library(magrittr) 37 | library(dplyr) 38 | library(ggplot2) 39 | ``` 40 | 41 | # Installation 42 | 43 | SparseDOSSA2 is a Bioconductor package and can be installed via the following 44 | command. 45 | 46 | ```{r Installation, eval = FALSE} 47 | # if (!requireNamespace("BiocManager", quietly = TRUE)) 48 | # install.packages("BiocManager") 49 | # BiocManager::install("SparseDOSSA2") 50 | ``` 51 | 52 | # Simulating realistic microbial observations with `SparseDOSSA2` 53 | The most important functionality of `SparseDOSSA2` is the simulation of 54 | realistic synthetic microbial observations. To this end, `SparseDOSSA2` provides 55 | three pre-trained templates, `"Stool"`, `"Vaginal"`, and `"IBD"`, targeting 56 | continuous, discrete, and diseased population structures. 57 | ```{r SparseDOSSA2_generate} 58 | Stool_simulation <- SparseDOSSA2(template = "Stool", 59 | n_sample = 100, 60 | n_feature = 100, 61 | verbose = TRUE) 62 | Vaginal_simulation <- SparseDOSSA2(template = "Vaginal", 63 | n_sample = 100, 64 | n_feature = 100, 65 | verbose = TRUE) 66 | ``` 67 | 68 | # Fitting to microbiome datasets with SparseDOSSA2 69 | 70 | SparseDOSSA2 provide two functions, `fit_SparseDOSSA2` and `fitCV_SparseDOSSA2`, 71 | to fit the SparseDOSSA2 model to microbial count or relative abundance observations. 72 | For these functions, as input, `SparseDOSSA2` requires a feature-by-sample 73 | table of microbial abundance observations. We provide with SparseDOSSA2 a minimal 74 | example of such a dataset: a five-by-five of the [HMP1-II](https://www.hmpdacc.org/hmp/) 75 | stool study. 76 | 77 | ```{r load data} 78 | data("Stool_subset", package = "SparseDOSSA2") 79 | # columns are samples. 80 | Stool_subset[1:2, 1, drop = FALSE] 81 | ``` 82 | 83 | ## Fitting SparseDOSSA2 model with `fit_SparseDOSSA2` 84 | 85 | `fit_SparseDOSSA2` fits the SparseDOSSA2 model to estimate the 86 | model parameters: per-feature prevalence, mean and standard deviation of 87 | non-zero abundances, and feature-feature correlations. 88 | It also estimates joint distribution of these parameters 89 | and (if input is count) a read count distribution. 90 | 91 | ```{r SparseDOSSA2_fit} 92 | fitted <- fit_SparseDOSSA2(data = Stool_subset, 93 | control = list(verbose = TRUE)) 94 | # fitted mean log non-zero abundance values of the first two features 95 | fitted$EM_fit$fit$mu[1:2] 96 | ``` 97 | 98 | ## Fitting SparseDOSSA2 model with `fitCV_SparseDOSSA2` 99 | 100 | The user can additionally achieve optimal model fitting via 101 | `fitCV_SparseDOSSA2`. They can either provide a vector of tuning parameter 102 | values (`lambdas`) to control sparsity in the estimation of the correlation 103 | matrix parameter, or a grid will be selected automatically. 104 | `fitCV_SparseDOSSA2` uses cross validation to select an "optimal" model fit 105 | across these tuning parameters via average testing log-likelihood. This is a 106 | computationally intensive procedure, and best-suited for users that would like 107 | accurate fitting to the input dataset, for best simulated new microbial 108 | observations on the same features as the input (i.e. not new features). 109 | 110 | ```{r SparseDOSSA2_fit_CV} 111 | # set.seed(1) 112 | # fitted_CV <- fitCV_SparseDOSSA2(data = Stool_subset, 113 | # lambdas = c(0.1, 1), 114 | # K = 2, 115 | # control = list(verbose = TRUE)) 116 | # the average log likelihood of different tuning parameters 117 | # apply(fitted_CV$EM_fit$logLik_CV, 2, mean) 118 | # The second lambda (1) had better performance in terms of log likelihood, 119 | # and will be selected as the default fit. 120 | # ``` 121 | 122 | ## Parallelization controls with `future` 123 | 124 | #`SparseDOSSA2` internally uses `r BiocStyle::CRANpkg("future")` to allow for 125 | #parallel computation. The user can thus specify parallelization through `future`'s 126 | #interface. See the [reference #manual](https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html) 127 | #for `future` for more details. This is 128 | #particularly suited if fitting `SparseDOSSA2` in a high-performance computing 129 | #environment/ 130 | 131 | #```{r future} 132 | ## regular fitting 133 | # system.time(fitted_regular <- 134 | # fit_SparseDOSSA2(data = Stool_subset, 135 | # control = list(verbose = FALSE))) 136 | ## parallel fitting with future: 137 | # future::plan(future::multisession()) 138 | # system.time(fitted_parallel <- 139 | # fit_SparseDOSSA2(data = Stool_subset, 140 | # control = list(verbose = FALSE))) 141 | 142 | ## For CV fitting, there are three components that can be paralleled, in order: 143 | ## different cross validation folds, different tuning parameter lambdas, 144 | ## and different samples. It is usually most efficient to parallelize at the 145 | ## sample level: 146 | # system.time(fitted_regular_CV <- 147 | # fitCV_SparseDOSSA2(data = Stool_subset, 148 | # lambdas = c(0.1, 1), 149 | # K = 2, 150 | # control = list(verbose = TRUE))) 151 | # future::plan(future::sequential(), future::sequential(), future::multisession()) 152 | # system.time(fitted_parallel_CV <- 153 | # fitCV_SparseDOSSA2(data = Stool_subset, 154 | # lambdas = c(0.1, 1), 155 | # K = 2, 156 | # control = list(verbose = TRUE))) 157 | 158 | ``` 159 | 160 | 161 | # Sessioninfo 162 | ```{r sessioninfo} 163 | sessionInfo() 164 | ``` 165 | # References --------------------------------------------------------------------------------