├── .github
    ├── ISSUE_TEMPLATE
    │   └── config.yml
    ├── README.md
    └── pull_request_template.md
├── .gitignore
├── DESCRIPTION
├── NAMESPACE
├── NEWS
├── R
    ├── EM.R
    ├── M_helpers.R
    ├── SparseDOSSA2.R
    ├── SparseDOSSA2_fit.R
    ├── copulasso.R
    ├── data.R
    ├── fit.R
    ├── generate.R
    ├── helpers.R
    ├── integration_helpers.R
    ├── likelihood_expectation.R
    ├── sim_helpers.R
    ├── spike.R
    └── sysdata.rda
├── data
    └── Stool_subset.rda
├── man
    ├── SparseDOSSA2.Rd
    ├── Stool_subset.Rd
    ├── control_fit.Rd
    ├── control_integrate.Rd
    ├── fitCV_SparseDOSSA2.Rd
    └── fit_SparseDOSSA2.Rd
├── tests
    ├── testthat.R
    └── testthat
    │   ├── test-likelihood.R
    │   └── test-spike_in.R
└── vignettes
    ├── SparseDOSSA2.Rmd
    └── SparseDOSSA2.html


/.github/ISSUE_TEMPLATE/config.yml:
--------------------------------------------------------------------------------
1 | blank_issues_enabled: false
2 | contact_links:
3 |   - name: Please proceed to the bioBakery Forum to make a bug report or feature request
4 |     url: https://forum.biobakery.org/
5 |     about: Thanks for your interest in the bioBakery! We use a Discourse forum (https://forums.biobakery.org) for interaction with our user community, including fielding bug reports and feature requests. Please submit your comment (including all relevant details) as a new post under 1) an appropriate method-specific forum topic (e.g. "MetaPhlAn") or 2) the generic "Feature/Pull Request" topic. Doing a search for related posts before starting a new post is greatly appreciated. We look forward to hearing from you.
6 |   


--------------------------------------------------------------------------------
/.github/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # "Simulating realistic microbial observations with SparseDOSSA2" 
  3 | 
  4 | Author Name: "Siyuan Ma"  
  5 | Affiliation: Harvard T.H. Chan School of Public Health.  
  6 | Broad Institute email: siyuan.ma@pennmedicine.upenn.edu 
  7 | 
  8 | Tutorial: https://github.com/biobakery/biobakery/wiki/SparseDOSSA2
  9 | 
 10 | # Introduction
 11 | SparseDOSSA2 an R package for fitting to and the simulation of realistic microbial abundance observations. It provides functionlaities for: a) generation of realistic synthetic microbial observations, b) spiking-in of associations with metadata variables for e.g. benchmarking or power analysis purposes, and c) fitting the SparseDOSSA 2 model to real-world microbial abundance observations that can be used for a). This vignette is intended to provide working examples for these functionalities.
 12 | 
 13 | ```
 14 | library(SparseDOSSA2)
 15 | # tidyverse packages for utilities
 16 | library(magrittr)
 17 | library(dplyr)
 18 | library(ggplot2)
 19 | ```
 20 | 
 21 | # Installation
 22 | SparseDOSSA2 is a Bioconductor package and can be installed via the following command.
 23 | ```
 24 | # if (!requireNamespace("BiocManager", quietly = TRUE))
 25 | #     install.packages("BiocManager")
 26 | # BiocManager::install("SparseDOSSA2")
 27 | ```
 28 | # Simulating realistic microbial observations with SparseDOSSA2
 29 | The most important functionality of SparseDOSSA2 is the simulation of realistic synthetic microbial observations. To this end, SparseDOSSA2 provides three pre-trained templates, "Stool", "Vaginal", and "IBD", targeting continuous, discrete, and diseased population structures.
 30 | ```
 31 | Stool_simulation <- SparseDOSSA2(template = "Stool", 
 32 |                                  n_sample = 100, 
 33 |                                  n_feature = 100,
 34 |                                  verbose = TRUE)
 35 | Vaginal_simulation <- SparseDOSSA2(template = "Vaginal", 
 36 |                                    n_sample = 100, 
 37 |                                    n_feature = 100,
 38 |                                    verbose = TRUE)
 39 | ```
 40 | 
 41 | # Fitting to microbiome datasets with SparseDOSSA2
 42 | SparseDOSSA2 provide two functions, fit_SparseDOSSA2 and fitCV_SparseDOSSA2, to fit the SparseDOSSA2 model to microbial count or relative abundance observations. For these functions, as input, SparseDOSSA2 requires a feature-by-sample table of microbial abundance observations. We provide with SparseDOSSA2 a minimal example of such a dataset: a five-by-five of the HMP1-II stool study.
 43 | ```
 44 | data("Stool_subset", package = "SparseDOSSA2")
 45 | # columns are samples.
 46 | Stool_subset[1:2, 1, drop = FALSE]
 47 | ```
 48 | 
 49 | ## Fitting SparseDOSSA2 model with fit_SparseDOSSA2
 50 | fit_SparseDOSSA2 fits the SparseDOSSA2 model to estimate the model parameters: per-feature prevalence, mean and standard deviation of non-zero abundances, and feature-feature correlations. It also estimates joint distribution of these parameters and (if input is count) a read count distribution.
 51 | ```
 52 | fitted <- fit_SparseDOSSA2(data = Stool_subset,
 53 |                            control = list(verbose = TRUE))
 54 | # fitted mean log non-zero abundance values of the first two features
 55 | fitted$EM_fit$fit$mu[1:2]
 56 | ```
 57 | 
 58 | ## Fitting SparseDOSSA2 model with fitCV_SparseDOSSA2
 59 | The user can additionally achieve optimal model fitting via fitCV_SparseDOSSA2. They can either provide a vector of tuning parameter values (lambdas) to control sparsity in the estimation of the correlation matrix parameter, or a grid will be selected automatically. fitCV_SparseDOSSA2 uses cross validation to select an "optimal" model fit across these tuning parameters via average testing log-likelihood. This is a computationally intensive procedure, and best-suited for users that would like accurate fitting to the input dataset, for best simulated new microbial observations on the same features as the input (i.e. not new features).
 60 | ```
 61 | set.seed(1)
 62 | fitted_CV <- fitCV_SparseDOSSA2(data = Stool_subset,
 63 |                                          lambdas = c(0.1, 1),
 64 |                                          K = 2,
 65 |                                          control = list(verbose = TRUE))
 66 | # the average log likelihood of different tuning parameters
 67 | apply(fitted_CV$EM_fit$logLik_CV, 2, mean)
 68 | # The second lambda (1) had better performance in terms of log likelihood,
 69 | # and will be selected as the default fit.
 70 | ```
 71 | 
 72 | # Parallelization controls with future
 73 | SparseDOSSA2 internally uses r BiocStyle::CRANpkg("future") to allow for parallel computation. The user can thus specify parallelization through future's interface. See the reference manual for future for more details. This is particularly suited if fitting SparseDOSSA2 in a high-performance computing environment/
 74 | ```
 75 | ## regular fitting 
 76 | # system.time(fitted_regular <- 
 77 | #               fit_SparseDOSSA2(data = Stool_subset,
 78 | #                                control = list(verbose = FALSE)))
 79 | ## parallel fitting with future:
 80 | # future::plan(future::multisession())
 81 | # system.time(fitted_parallel <- 
 82 | #               fit_SparseDOSSA2(data = Stool_subset,
 83 | #                                control = list(verbose = FALSE)))
 84 | 
 85 | ## For CV fitting, there are three components that can be paralleled, in order:
 86 | ## different cross validation folds, different tuning parameter lambdas, 
 87 | ## and different samples. It is usually most efficient to parallelize at the
 88 | ## sample level:
 89 | # system.time(fitted_regular_CV <-
 90 | #               fitCV_SparseDOSSA2(data = Stool_subset,
 91 | #                                  lambdas = c(0.1, 1),
 92 | #                                  K = 2,
 93 | #                                  control = list(verbose = TRUE)))
 94 | # future::plan(future::sequential(), future::sequential(), future::multisession())
 95 | # system.time(fitted_parallel_CV <-
 96 | #               fitCV_SparseDOSSA2(data = Stool_subset,
 97 | #                                  lambdas = c(0.1, 1),
 98 | #                                  K = 2,
 99 | #                                  control = list(verbose = TRUE)))
100 | ```
101 | 
102 | # Sessioninfo
103 | ```
104 | sessionInfo()
105 | 
106 | R version 3.6.2 (2019-12-12)
107 | Platform: x86_64-apple-darwin15.6.0 (64-bit)
108 | Running under: macOS Mojave 10.14.6
109 | 
110 | Matrix products: default
111 | BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
112 | LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
113 | 
114 | locale:
115 | [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
116 | 
117 | attached base packages:
118 | [1] stats     graphics  grDevices utils     datasets  methods   base     
119 | 
120 | other attached packages:
121 |  [1] SparseDOSSA2_0.99.0 Rmpfr_0.8-2         gmp_0.6-1           igraph_1.2.6       
122 |  [5] truncnorm_1.0-8     magrittr_2.0.1      future.apply_1.7.0  future_1.21.0      
123 |  [9] huge_1.3.4.1        mvtnorm_1.1-1       ks_1.11.7           BiocCheck_1.22.0   
124 | 
125 | loaded via a namespace (and not attached):
126 |  [1] Rcpp_1.0.5          compiler_3.6.2      BiocManager_1.30.10 bitops_1.0-6       
127 |  [5] tools_3.6.2         digest_0.6.27       mclust_5.4.7        jsonlite_1.7.2     
128 |  [9] lattice_0.20-41     pkgconfig_2.0.3     Matrix_1.2-18       graph_1.64.0       
129 | [13] curl_4.3            parallel_3.6.2      xfun_0.20           stringr_1.4.0      
130 | [17] httr_1.4.2          knitr_1.30          globals_0.14.0      stats4_3.6.2       
131 | [21] grid_3.6.2          getopt_1.20.3       optparse_1.6.6      Biobase_2.46.0     
132 | [25] listenv_0.8.0       R6_2.5.0            parallelly_1.23.0   XML_3.99-0.3       
133 | [29] RBGL_1.62.1         codetools_0.2-18    biocViews_1.54.0    BiocGenerics_0.32.0
134 | [33] MASS_7.3-53         stringdist_0.9.6.3  RUnit_0.4.32        KernSmooth_2.23-18 
135 | [37] stringi_1.5.3       RCurl_1.98-1.2     
136 | ```
137 | 
138 | # Contributions
139 | Thanks go to these wonderful people:
140 | 
141 | 


--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
 1 | We welcome feedback and issue reporting for all bioBakery tools through our [Discourse site](https://forum.biobakery.org/c/pull-request/featurepull-request/). For users that would like to directly contribute to the [tools](https://github.com/biobakery/) we are happy to field PRs to address **bug fixes**. Please note the turn around time on our end might be a bit long to field these but that does not mean we don't value the contribution! We currently **don't** accept PRs to add **new functionality** to tools but we would be happy to receive your feedback on [Discourse](https://forum.biobakery.org/c/pull-request/featurepull-request/).
 2 | 
 3 | Also, we will make sure to attribute your contribution in our User’s manual(README.md) and in any associated paper Acknowledgements.
 4 | 
 5 | 
 6 | ## Description
 7 | <!--- Describe your changes in detail -->
 8 | 
 9 | ## Related Issue
10 | <!--- If fixing a bug, there should be an issue describing it with steps to reproduce -->
11 | <!--- Please link to the issue here: -->
12 | 
13 | 
14 | ## Screenshots (if appropriate):
15 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.Rproj
2 | .Rbuildignore
3 | .Rproj.user
4 | .Rhistory
5 | .RData
6 | .Ruserdata
7 | .DS_Store
8 | *.sbatch
9 | *.html


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: SparseDOSSA2
 2 | Type: Package
 3 | Title: A Statistical Model for Describing and Simulating Microbial Community Profiles
 4 | Version: 0.99.2
 5 | Author: Siyuan Ma
 6 | Maintainer: Siyuan Ma <syma.research@gmail.com>
 7 | Description: SparseDOSSA 2 is an R package for fitting to and the simulation of realistic microbial
 8 |     abundance observations. It provides functionlaities for:
 9 |     a) generation of synthetic microbial observations, based on either pre-fitted template that the package provides, or user-trained results.
10 |     b) spiking-in of associations with metadata variables or between feature pairs, for e.g. benchmarking or power analysis purposes.
11 |     c) fitting the SparseDOSSA 2 model to real-world microbial abundance observations.
12 | License: MIT + file LICENSE
13 | Encoding: UTF-8
14 | RoxygenNote: 7.1.2
15 | Depends:
16 |   ks,
17 |   mvtnorm,
18 |   huge,
19 |   future.apply,
20 |   magrittr,
21 |   truncnorm,
22 |   igraph,
23 |   Rmpfr
24 | VignetteBuilder: knitr
25 | Suggests:
26 |   testthat (>= 2.1.0),
27 |   cubature,
28 |   BiocStyle,
29 |   knitr,
30 |   rmarkdown
31 | biocViews: Metagenomics, Microbiome, StatisticalMethod
32 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 | 
3 | export(SparseDOSSA2)
4 | export(control_fit)
5 | export(control_integrate)
6 | export(fitCV_SparseDOSSA2)
7 | export(fit_SparseDOSSA2)
8 | importFrom(magrittr,"%>%")
9 | 


--------------------------------------------------------------------------------
/NEWS:
--------------------------------------------------------------------------------
1 | Changes in version 0.99.2 (2021-10-06)
2 | + Changed the spike_metadata interface in the SparseDOSSA2 function. 
3 |   spike_metadata can now be either character ("none" the default,
4 |   "abundance", "prevalence", or "both", same as before), or a data.frame
5 |   configuring the metadata-microbe associations (replacing the old
6 |   feature_metadata_spike_df parameter)
7 | + More detailed documentation for these parameters
8 | + Better checking and error reporting for metadata spike-in user errors


--------------------------------------------------------------------------------
/R/EM.R:
--------------------------------------------------------------------------------
  1 | EM <- function(data, 
  2 |                lambda,
  3 |                control = list()) {
  4 |   # initialization, filtering
  5 |   if(is.null(lambda) & any(lambda <= 0))
  6 |     stop("lambda must be a positive value!")
  7 |   control <- do.call(control_fit, control)
  8 |   debug_copulasso_file <- NULL
  9 |   if(!is.null(control$debug_dir)) {
 10 |     dir.create(control$debug_dir, recursive = TRUE)
 11 |     control$debug_dir <- normalizePath(control$debug_dir)
 12 |   }
 13 |     
 14 |   l_filtering <- filter_data(data)
 15 |   data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE]
 16 |   
 17 |   # initialize EM using relative abundances
 18 |   feature_param <- fit_featureParam(data)
 19 |   fit_copulasso <- copulasso(
 20 |     data = data, 
 21 |     marginals = feature_param,
 22 |     lambda = lambda,
 23 |     control = list(debug_dir = control$debug_dir)
 24 |   )
 25 |   feature_param[, "mu"] <- feature_param[, "mu"] - mean(feature_param[, "mu"])
 26 |   params <- list(pi0 = feature_param[ ,"pi0"],
 27 |                  mu = feature_param[, "mu"],
 28 |                  sigma = feature_param[, "sigma"],
 29 |                  Sigma = fit_copulasso$Sigma,
 30 |                  Omega = fit_copulasso$Omega,
 31 |                  Corr_star = fit_copulasso$Corr_star,
 32 |                  diff = rep(NA_real_, 4),
 33 |                  logLik = -Inf,
 34 |                  time = Sys.time())
 35 |   if(fit_copulasso$copulasso_code != 0) {
 36 |     warning("Missing values in Omega estimation! (lambda to small?)")
 37 |     return(list(lambda = lambda,
 38 |                 fit = params,
 39 |                 convergence = list(converge = FALSE,
 40 |                                    converge_code = 4,
 41 |                                    n_iter = i_iter)))
 42 |   }
 43 |   
 44 |   # EM algorithim    
 45 |   i_iter <- 0
 46 |   converge <- FALSE
 47 |   ll_easums <- list()
 48 |   ll_params <- list()
 49 |   while(TRUE) {
 50 |     i_iter <- i_iter + 1
 51 |     if(control$verbose)
 52 |       message("EM iteration ", i_iter)
 53 |     
 54 |     # E step
 55 |     e_asums <- future.apply::future_vapply(
 56 |       seq_len(nrow(data)),
 57 |       function(i_sample)
 58 |         get_es(x = data[i_sample, , drop = TRUE],
 59 |                pi0 = params$pi0, mu = params$mu, sigma = params$sigma, 
 60 |                Omega = params$Omega, Sigma = params$Sigma,
 61 |                control = control$control_numint),
 62 |       rep(0.0, 10)
 63 |     ) %>% t()
 64 |     if(any(is.na(e_asums[, c("ea", "dx", "eloga", "eloga2")]))) {
 65 |       warning("Numeric integration in E step returned NAs!")
 66 |       converge_code <- 3
 67 |       break
 68 |     }
 69 |     if(any(e_asums[, "eloga"]^2 >= e_asums[, "eloga2"])) {
 70 |       warning("Numeric integration in E step gave bad expectation values!")
 71 |       converge_code <- 2
 72 |       break
 73 |     }
 74 |     ll_easums[[i_iter]] <- e_asums
 75 |     
 76 |     ## M step
 77 |     fit_copulasso <- copulasso(data = data * exp(e_asums[, "eloga"]), 
 78 |                                marginals = feature_param,
 79 |                                lambda = lambda)
 80 |     if(fit_copulasso$copulasso_code != 0) {
 81 |       warning("Missing values in Omega estimation! (lambda to small?)")
 82 |       converge_code <- 4
 83 |       break
 84 |     }
 85 |     sigmahat <- get_sigmahat(data = data,
 86 |                              eloga = e_asums[, "eloga"],
 87 |                              eloga2 = e_asums[, "eloga2"])
 88 |     feature_param[, "sigma"] <- sigmahat
 89 |     params_new <- list(pi0 = feature_param[ ,"pi0"],
 90 |                        mu = feature_param[ ,"mu"],
 91 |                        sigma = feature_param[ ,"sigma"],
 92 |                        Sigma = fit_copulasso$Sigma,
 93 |                        Omega = fit_copulasso$Omega,
 94 |                        Corr_star = fit_copulasso$Corr_star,
 95 |                        logLik = mean(e_asums[, "logLik"]))
 96 |     diff_abs <- get_diff(params_new[["logLik"]], params[["logLik"]], 
 97 |                          denom_c = control$abs_tol, method = "abs")
 98 |     diff_rel <- get_diff(params_new[["logLik"]], params[["logLik"]], 
 99 |                          denom_c = control$abs_tol, method = "rel")
100 |     params <- c(params_new,
101 |                 list(diff = c(diff_abs, diff_rel),
102 |                      time = Sys.time()))
103 |     ll_params[[i_iter]] <- params
104 |     
105 |     if(!is.null(control$debug_dir)) {
106 |       l_debug <- list(ll_easums = ll_easums, 
107 |                       ll_params = ll_params, 
108 |                       l_filtering = l_filtering)
109 |       save(l_debug,
110 |            file = paste0(control$debug_dir,
111 |                          "/debug_EM.RData"))
112 |     }
113 |     
114 |     if(max(diff_abs) < control$abs_tol & max(diff_rel) < control$rel_tol) {
115 |       converge <- TRUE
116 |       converge_code <- 0
117 |       break
118 |     }
119 |     if(i_iter + 1 > control$maxit) {
120 |       warning("Maximum EM iteration reached!")
121 |       converge_code <- 1
122 |       break
123 |     }
124 |   }
125 |   
126 |   return(list(fit = params,
127 |               lambda = lambda,
128 |               convergence = list(converge = converge,
129 |                                  converge_code = converge_code,
130 |                                  n_iter = i_iter),
131 |               l_filtering = l_filtering))
132 | }
133 | 
134 | EM_CV <- function(data,
135 |                   lambdas,
136 |                   K = 5,
137 |                   control = list()) {
138 |   # initialization, filtering
139 |   control <- do.call(control_fit, control)
140 |   l_filtering <- filter_data(data)
141 |   data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE]
142 |   if(is.null(lambdas))
143 |     lambdas <- 10^seq(-2, 0, length.out = 5)
144 |   if(any(lambdas <= 0))
145 |     stop("lambdas must be positive values!")
146 |   if(!is.null(control$debug_dir)) {
147 |     dir.create(control$debug_dir, recursive = TRUE)
148 |     control$debug_dir <- normalizePath(control$debug_dir)
149 |   }
150 | 
151 |   if(control$verbose) message("Performing full data fit...")
152 |   l_fits_full <- future::future({
153 |     future.apply::future_lapply(
154 |       seq_along(lambdas),
155 |       function(i_lambda) {
156 |         control_tmp <- control
157 |         control_tmp$verbose <- FALSE
158 |         if(!is.null(control$debug_dir))
159 |           control_tmp$debug_dir <- paste0(control$debug_dir,
160 |                                           "/K_0/lambda_",
161 |                                           i_lambda)
162 |         EM(data = data,
163 |            lambda = lambdas[i_lambda],
164 |            control = control_tmp)
165 |       })
166 |   })
167 |   l_fits_full <- future::value(l_fits_full)
168 | 
169 |   # CV fits
170 |   CV_folds <- make_CVfolds(n = nrow(data), K = K)
171 |   if(!is.null(control$debug_dir))
172 |     save(CV_folds, file = paste0(control$debug_dir, "/CV_folds.RData"))
173 |   ll_results_CV <- list()
174 |   for(k in seq_len(K)) {
175 |     if(control$verbose) 
176 |       message("Performing CV k=", k)
177 |     ll_results_CV[[k]] <- future::future({
178 |       data_training <- data[CV_folds != k, , drop = FALSE]
179 |       data_testing <- data[CV_folds == k, , drop = FALSE]
180 |       
181 |       l_fits_k <- 
182 |         future.apply::future_lapply(
183 |           seq_along(lambdas),
184 |           function(i_lambda) {
185 |             control_tmp <- control
186 |             control_tmp$verbose <- FALSE
187 |             if(!is.null(control$debug_dir))
188 |               control_tmp$debug_dir <- paste0(control$debug_dir,
189 |                                               "/K_", k,
190 |                                               "/lambda_", i_lambda)
191 |             EM(data = data_training,
192 |                lambda = lambdas[i_lambda],
193 |                control = control_tmp)
194 |           })
195 |       
196 |       # Fill in parameters estimates for features not present in training data
197 |       for(i_lambda in seq_along(lambdas))
198 |         l_fits_k[[i_lambda]]$fit <-
199 |         fill_estimates_CV(l_fits_k[[i_lambda]]$fit,
200 |                           l_fits_full[[i_lambda]]$fit,
201 |                           l_fits_k[[i_lambda]]$l_filtering$ind_feature)
202 | 
203 |       # Calculate ll in testing data
204 |       l_logLik <- future.apply::future_lapply(
205 |         seq_along(lambdas),
206 |         function(i_lambda) {
207 |           params <- l_fits_k[[i_lambda]]$fit
208 |           future.apply::future_sapply(
209 |             seq_len(nrow(data_testing)),
210 |             function(i_sample) {
211 |               logLik <-
212 |                 dx(x = data_testing[i_sample, , drop = TRUE],
213 |                    pi0 = params$pi0, mu = params$mu, sigma = params$sigma,
214 |                    Omega = params$Omega, Sigma = params$Sigma,
215 |                    control = control$control_numint,
216 |                    log.p = TRUE)
217 |             })
218 |         })
219 | 
220 |       list(l_fits = l_fits_k,
221 |            l_logLik = l_logLik)
222 |     })
223 |   }
224 |   ll_results_CV <- future::value(ll_results_CV)
225 | 
226 |   # Aggregate CV results
227 |   logLik_CV <-
228 |     sapply(seq_along(lambdas),
229 |            function(i_lambda) {
230 |              logLik <- rep(NA_real_, nrow(data))
231 |              for(k in seq_len(K))
232 |                logLik[CV_folds == k] <- ll_results_CV[[k]]$l_logLik[[i_lambda]]
233 |              return(logLik)
234 |            })
235 |   ll_fits_CV <- lapply(ll_results_CV, function(results_k) results_k$l_fits)
236 | 
237 |   return(list(fit = 
238 |                 l_fits_full[[
239 |                   order(
240 |                     apply(-logLik_CV, 
241 |                           2, 
242 |                           function(x)
243 |                             mean(setdiff(x, Inf))))[1]]]$fit,
244 |               lambdas = lambdas,
245 |               logLik_CV = logLik_CV,
246 |               l_fits_full = l_fits_full,
247 |               CV_folds = CV_folds,
248 |               ll_fits_CV = ll_fits_CV,
249 |               l_filtering = l_filtering
250 |   ))
251 | }
252 | 


--------------------------------------------------------------------------------
/R/M_helpers.R:
--------------------------------------------------------------------------------
 1 | get_sigmahat <- function(data, eloga, eloga2) {
 2 |   mutilde <- 
 3 |     vapply(seq_len(ncol(data)),
 4 |            function(j) {
 5 |              ind_j <- data[, j] > 0
 6 |              mean(eloga[ind_j] + 
 7 |                     log(data[ind_j, j]))
 8 |            },
 9 |            0.0)
10 |   
11 |   sigma2tilde <- 
12 |     vapply(seq_len(ncol(data)),
13 |            function(j) {
14 |              ind_j <- data[, j] > 0
15 |              mean(eloga2[ind_j] + 
16 |                     2 * eloga[ind_j] * log(data[ind_j, j]) +
17 |                     log(data[ind_j, j])^2)
18 |            },
19 |            0.0)
20 |   
21 |   return(sqrt(sigma2tilde - mutilde^2))
22 | }


--------------------------------------------------------------------------------
/R/SparseDOSSA2.R:
--------------------------------------------------------------------------------
  1 | #' Simulate synthetic microbial abundance observations with SparseDOSSA2 
  2 | #'
  3 | #' \code{SparseDOSSA2} generates synthetic microbial abundance observations
  4 | #' from either pre-trained template, or user-provided fitted results from
  5 | #' \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}. Additional options
  6 | #' are available for simulating associations between microbial features
  7 | #' and metadata variables.
  8 | #' 
  9 | #' @param template can be 1) a character string (\code{"Stool"}, \code{"Vaginal"},
 10 | #' or \code{"IBD"}) indicating one of the pre-trained templates in SparseDOSSA2,
 11 | #' or 2) user-provided, fitted results. In the latter case this should be an output
 12 | #' from \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}.
 13 | #' @param n_sample number of samples to simulate
 14 | #' @param new_features \code{TRUE}/\code{FALSE} indicator for whether or not new
 15 | #' features should be simulated. If \code{FALSE} then the same set of features 
 16 | #' in \code{template} will be simulated.
 17 | #' @param n_feature number of features to simulate. Only relevant when 
 18 | #' \code{new_features} is \code{TRUE}
 19 | #' @param spike_metadata for metadata spike-in configurations. Must be one of two things:
 20 | #' a) , 
 21 | #' \itemize{
 22 | #' \item a character string of \code{"none"}, \code{"both"} \code{"abundance"}, 
 23 | #' or \code{"prevalence"}, indicating whether or not
 24 | #' association with metadata will be spiked in. For the spiked-in case, it
 25 | #' indicates if features' abundance/prevalence/both characteristics will be associated
 26 | #' with metadata (also see explanations for \code{metadata_effect_size} and 
 27 | #' \code{perc_feature_spiked_metadata})
 28 | #' \item a data.frame for detailed spike-in configurations. This is the more 
 29 | #' advanced approach, where detailed specification for metadata-microbial 
 30 | #' feature associations are provided. Note: if \code{spike_metadata} is provided
 31 | #' as a data.frame, then \code{metadata_matrix} must be provided as well 
 32 | #' (cannot be generated automatically). In this case, \code{spike_metadata} 
 33 | #' must have exactly four columns: \code{metadata_datum}, \code{feature_spiked}, 
 34 | #' \code{associated_property}, and \code{effect_size}. Each row of the data.frame 
 35 | #' configures one specific metadata-microbe association. Specifically:
 36 | #' \itemize{
 37 | #' \item \code{metadata_datum} (integer) indicates the column number for the metadata
 38 | #' variable to be associated with the microbe
 39 | #' \item \code{feature_spiked} (character) indicates the microbe name to be associated
 40 | #' with the metadata variable
 41 | #' \item \code{associated_property} (character, either \code{"abundance"} or 
 42 | #' \code{"prevalence"}), indicating the property of the microbe to be modified.
 43 | #' If you want the microbe to be associated with the metadata variable
 44 | #' in both properties, include two rows in \code{spike_metadata}, one for 
 45 | #' abundance and one for prevalence
 46 | #' \item \code{effect_size} (numeric) indicating the strength of the association. 
 47 | #' This corresponds to log fold change in non-zero abundance for
 48 | #' \code{"abundance"} spike-in, and log odds ratio for \code{"prevalence"} 
 49 | #' spike-in
 50 | #' }
 51 | #' }
 52 | #' @param metadata_effect_size (for when \code{spike_metadata} is \code{"abundance"},
 53 | #' \code{"prevalence"}, or \code{"both"}) effect size of the spiked-in associations. This is
 54 | #' non-zero log fold change for abundance spike-in, and log odds ratio for prevalence spike-in
 55 | #' @param perc_feature_spiked_metadata (for when \code{spike_metadata} is \code{"abundance"},
 56 | #' \code{"prevalence"}, or \code{"both"})
 57 | #' percentage of features to be associated with metadata 
 58 | #' @param metadata_matrix the user can provide a metadata matrix to use for spiking-in
 59 | #' of feature abundances. If using default (\code{NULL}) two variables will be generated: 
 60 | #' one continous, and a binary one of balanced cases and controls. Note: if
 61 | #' \code{spike_metadata} is provided as a data.frame, then the user must provide
 62 | #' \code{metadata_matrix} too
 63 | #' @param median_read_depth targeted median per-sample read depth 
 64 | #' @param verbose whether detailed information should be printed
 65 | #'
 66 | #' @return a list with the following component:
 67 | #' \describe{
 68 | #' \item{simulated_data}{
 69 | #' feature by sample matrix of simulated microbial count observations
 70 | #' }
 71 | #' \item{simulated_matrices}{
 72 | #' list of all simulated data matrices, including that of null (i.e. not spiked-in) absolute
 73 | #' abundances, spiked-in absolute abundances, and normalized relative abundances
 74 | #' }
 75 | #' \item{params}{
 76 | #' parameters used for simulation. These are provided in \code{template}.
 77 | #' }
 78 | #' \item{spike_metadata}{
 79 | #' list of variables provided or generated for metadata spike-in. This include 
 80 | #' \code{spike_metadata} for the original \code{spike_metadata} parameter provided
 81 | #' by the user, \code{metadata_matrix} for the
 82 | #' metadata (either provided by the user or internally generated), and 
 83 | #' \code{feature_metadata_spike_df}
 84 | #' for detailed specification of which metadata variables were used to spike-in associations
 85 | #' with which features, in what properties at which effect sizes. This is the 
 86 | #' same as \code{spike_metadata} if the latter was provided as a data.frame.
 87 | #' }
 88 | #' }
 89 | #' @export
 90 | #' @author Siyuan Ma, \email{syma.research@gmail.com}
 91 | #'
 92 | #' @examples
 93 | #' ## Using one of the pre-trained SparseDOSSA2 templates:
 94 | #' sim <- SparseDOSSA2(template = "stool", n_sample = 200, new_features = FALSE)
 95 | #' ## Using user-provided trained SparseDOSSA2 model:
 96 | #' data("Stool_subset")
 97 | #' fitted <- fit_SparseDOSSA(data = Stool_subset)
 98 | #' sim <- SparseDOSSA2(template = fitted, n_sample = 200, new_features = FALSE)
 99 | SparseDOSSA2 <- function(template = "Stool",
100 |                          n_sample = 100,
101 |                          new_features = TRUE,
102 |                          n_feature = 100,
103 |                          spike_metadata = "none",
104 |                          metadata_effect_size = 1,
105 |                          perc_feature_spiked_metadata = 0.05,
106 |                          metadata_matrix = NULL,
107 |                          median_read_depth = 50000,
108 |                          verbose = TRUE) {
109 |   if(is.character(template)) {
110 |     if(!template %in% c("Stool", "Vaginal", "IBD"))
111 |       stop("Pre-trained template must be one of \"Stool\", \"Vaginal\", or \"IBD\"!")
112 |     template <- get(template)
113 |   }
114 |   
115 |   # generate per-feature params
116 |   feature_param_template <- cbind("pi0" = template$EM_fit$fit$pi0,
117 |                                   "mu" = template$EM_fit$fit$mu,
118 |                                   "sigma" = template$EM_fit$fit$sigma)
119 |   if(!new_features) {
120 |     if(verbose) 
121 |       message("new_features is FALSE, ",
122 |               "adopt per-feature parameters from template ",
123 |               "(n_feature will be ignored)...")
124 |     n_feature <- nrow(feature_param_template)
125 |     feature_param <- feature_param_template
126 |     Omega <- template$EM_fit$fit$Omega
127 |     # check that Omega and Sigma should agree
128 |     if(max(abs(Omega %*% template$EM_fit$fit$Sigma - 
129 |                diag(rep(1, length(template$EM_fit$fit$pi0))))) > 1e-10)
130 |        stop("Omega shoud be the inverse of Sigma!")
131 |   } else {
132 |     if(verbose) 
133 |       message("new_features is TRUE, ",
134 |               "generating new per-feature parameters, based on template...")
135 |     feature_param <- 
136 |       generate_featureParam(F_fit = template$F_fit,
137 |                             param_original = feature_param_template,
138 |                             n_feature = n_feature)
139 |     Omega <- diag(rep(1, nrow(feature_param)))
140 |   }
141 |   features <- 
142 |     rownames(Omega) <- 
143 |     colnames(Omega) <- 
144 |     rownames(feature_param)
145 |   
146 |   # generate null absolute abundance matrix
147 |   if(verbose) 
148 |     message("Generating null absolute abundance matrix...")
149 |   mat_null <- generate_a(n = n_sample,
150 |                          feature_param = feature_param,
151 |                          Omega = Omega)
152 |   
153 |   # generate spiked-in association with metadata
154 |   if(!(is.character(spike_metadata) | is.data.frame(spike_metadata)))
155 |     stop("spike_metadata must be either character or a data.frame!")
156 |   if(is.character(spike_metadata)) {
157 |     spike_metadata <- match.arg(spike_metadata, 
158 |                                 choices = c("none", "abundance", 
159 |                                             "prevalence", "both"))
160 |   }
161 |   if(identical(spike_metadata, "none")) {
162 |     if(verbose)
163 |       message("spike_metadata is \"none\", ",
164 |               "no metadata association will be simulated...")
165 |     
166 |     mat_spiked_metadata <- mat_null
167 |     feature_metadata_spike_df <- NULL
168 |   } else {
169 |     if(verbose)
170 |       message("Spiking in metadata association...")
171 |     
172 |     # metadata_matrix
173 |     if(is.null(metadata_matrix)) {
174 |       if(is.data.frame(spike_metadata))
175 |         stop("spike_metadata is provided as a data.frame. User must specify ",
176 |              "metadata_matrix as well!")
177 |       if(verbose)
178 |         message("metadata_matrix is not provided; ",
179 |                 "simulating default metadata_matrix...")
180 |       metadata_matrix <- cbind(rnorm(n = n_sample),
181 |                                rbinom(n = n_sample,
182 |                                       size = 1,
183 |                                       prob = 0.5))
184 |       rownames(metadata_matrix) <- colnames(mat_null)
185 |     } else {
186 |       if(!is.matrix(metadata_matrix))
187 |         stop("metadata_matrix must be a matrix ",
188 |              "(model matrix where categorical variables are dummified)!")
189 |       if(nrow(metadata_matrix) != n_sample)
190 |         stop("n_sample does not agree with number of samples in ",
191 |              "metadata_matrix!")
192 |       if(!is.null(rownames(metadata_matrix)))
193 |         colnames(mat_null) <- rownames(metadata_matrix)
194 |     }
195 |     n_metadata <- ncol(metadata_matrix)
196 |     
197 |     # metadata_effect_size
198 |     if(length(metadata_effect_size) != 1 & 
199 |        length(metadata_effect_size) != n_metadata)
200 |       stop("Length of metadata_effect_size can only be either 1 or number of ",
201 |            "columns of metadata_matrix!")
202 |     if(length(metadata_effect_size) == 1)
203 |       metadata_effect_size <- rep(metadata_effect_size, n_metadata)
204 |     
205 |     # feature_metadata_spike_df
206 |     if(is.data.frame(spike_metadata)) {
207 |       if(verbose) {
208 |         message("spike_metadata is provided as a data.frame; ",
209 |                 "will use for simulating metadata association ",
210 |                 "(metadata_effect_size and perc_feature_spiked_metadata will ",
211 |                 "be ignored)...")
212 |       }
213 |       
214 |       # check format
215 |       if(!all(c("metadata_datum",
216 |                 "feature_spiked",
217 |                 "associated_property",
218 |                 "effect_size") %in%
219 |               colnames(spike_metadata)))
220 |         stop("spike_metadata does not follow the correct format! ",
221 |              "Must have the following columns: metadata_datum, ",
222 |              "feature_spiked, associated_property, and effect_size.")
223 |       if(!all(spike_metadata$feature_spiked %in% 
224 |               features))
225 |         stop("feature_spiked in spike_metadata must provide the ",
226 |              "spiked feature names!")
227 |       if(!all(spike_metadata$metadata_datum %in% 
228 |               seq(1, n_metadata)))
229 |         stop("metadata_datum in spike_metadata must provide the ",
230 |              "associated metadata column number!")
231 |       if(!all(spike_metadata$associated_property %in% 
232 |               c("prevalence", "abundance")))
233 |         stop("associated_property in spike_metadata must be ",
234 |              "either \"prevalence\" or \"abundance\"!")
235 |       
236 |       feature_metadata_spike_df <- spike_metadata
237 |     } else {
238 |       if(verbose)
239 |         message("spike_metadata is specified as ", spike_metadata, "; ",
240 |                 "generating default metadata association...")
241 |       feature_metadata_spike_df <- 
242 |         generate_feature_metadata_spike_df(
243 |           features = features,
244 |           perc_feature_spiked_metadata = perc_feature_spiked_metadata,
245 |           n_metadata = n_metadata,
246 |           effect_size = metadata_effect_size,
247 |           spike_metadata = spike_metadata) 
248 |     }
249 |     if(verbose)
250 |       message("Generating feature abundances with spiked-in metadata ",
251 |               "associations...")
252 |     mat_spiked_metadata <- 
253 |       spike_a_metadata(null = mat_null,
254 |                        feature_param = feature_param,
255 |                        metadata = metadata_matrix,
256 |                        spike_df = feature_metadata_spike_df)
257 |   }
258 |   
259 |   mat_rel <- apply(mat_spiked_metadata, 2, TSS)
260 |   if(any(is.na(template$depth_fit))) {
261 |     mat_count <- mat_rel
262 |   } else {
263 |     if(verbose) 
264 |       message("Generating count matrix...")
265 |     # generate read depth
266 |     depth_new <- generate_depth(mu_depth = template$depth_fit["mu_depth"],
267 |                                 sigma_depth = template$depth_fit["sigma_depth"],
268 |                                 n = n_sample,
269 |                                 median_depth = median_read_depth)
270 |     # generate read counts
271 |     mat_count <- generate_count(rel = mat_rel,
272 |                                 depth = depth_new)
273 |   }
274 |   
275 |   return(list(simulated_data = mat_count,
276 |               simulated_matrices = list(rel = mat_rel,
277 |                                         a_spiked = mat_spiked_metadata,
278 |                                         a_null = mat_null),
279 |               params = list(feature_param = feature_param,
280 |                             Omega = Omega),
281 |               template = template,
282 |               spike_metadata = list(spike_metadata = spike_metadata,
283 |                                     metadata_matrix = metadata_matrix,
284 |                                     feature_metadata_spike_df = 
285 |                                       feature_metadata_spike_df)))
286 | }
287 | 


--------------------------------------------------------------------------------
/R/SparseDOSSA2_fit.R:
--------------------------------------------------------------------------------
  1 | #' Fit SparseDOSSA 2 model to a microbiome abundance dataset
  2 | #' 
  3 | #' \code{fit_SparseDOSSA2} fits the SparseDOSSA 2 model (zero-inflated log normal
  4 | #' marginals connected through Gaussian copula) to microbial abundances. It takes
  5 | #' as input a feature-by-sample microbial count or relative abundance table and
  6 | #' a penalization tuning parameter \code{lambda} to control the sparsity of 
  7 | #' feature-feature correlations. It then adopts a penalized expectation-maximization
  8 | #' algorithm to provide estimations of the model parameters.
  9 | #' 
 10 | #' @param data feature-by-sample matrix of abundances (proportions or
 11 | #' counts)
 12 | #' @param lambda positive penalization parameter for the sparsity of feature-feature
 13 | #' correlations. Default to maxmum value \code{1}, where features are assumed to be 
 14 | #' independent (no correlations, most sparse)
 15 | #' @param control a named list of additional control parameters. See help page for
 16 | #' \code{control_fit}
 17 | #'
 18 | #' @return a list, with the following components:
 19 | #' \describe{
 20 | #' \item{EM_fit}{
 21 | #' list of fitted parameters from the EM algorithm. 
 22 | #' }
 23 | #' \item{F_fit}{
 24 | #' fitted parameters for the joint distribution of per-feature prevalence, abundance,
 25 | #' and variability parameters (for simulating new features) 
 26 | #' }
 27 | #' \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable
 28 | #' to count data.
 29 | #' }
 30 | #' \item{l_filtering}{list of quality control filtering for sample and features.
 31 | #' }
 32 | #' }
 33 | 
 34 | #' @export
 35 | #' @author Siyuan Ma, \email{syma.research@gmail.com}
 36 | #' @examples
 37 | #' data("Stool_subset")
 38 | #' fitted <- fit_SparseDOSSA2(data = Stool_subset)
 39 | fit_SparseDOSSA2 <- function(data, 
 40 |                              lambda = 1,
 41 |                              control = list()) {
 42 |   # initialization, filtering
 43 |   if(any(data < 0))
 44 |     stop("data has negative values!")
 45 |   data <- t(data)
 46 |   control <- do.call(control_fit, control)
 47 |   if(control$verbose)
 48 |     message("Filtering for all-zero features/samples...")
 49 |   l_filtering <- filter_data(data)
 50 |   data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE]
 51 |   if(!is.null(control$debug_dir))
 52 |     dir.create(control$debug_dir)
 53 |   
 54 |   # check if table is count and fit library size parameters if needed
 55 |   row_sums <- apply(data, 1, sum)
 56 |   if(all(row_sums > 1)) {
 57 |     if(control$verbose)
 58 |       message("Data appears to be count table. ",
 59 |               "Fitting library size distribution...")
 60 |     depth_fit <- fit_depth(depth = row_sums)
 61 |   } else {
 62 |     if(control$verbose)
 63 |       message("Data appears to be relative abundance table.")
 64 |     depth_fit <- c("mu_depth" = NA, "sigma_depth" = NA) ## FIXME
 65 |   } 
 66 |   
 67 |   # normalize data
 68 |   data <- t(apply(data, 1, TSS, correct = TRUE))
 69 |   
 70 |   # EM fitting
 71 |   if(control$verbose)
 72 |     message("Fitting EM algorithm...")
 73 |   EM_fit <- EM(data = data,
 74 |                lambda = lambda, 
 75 |                control = control)
 76 |   
 77 |   # fit per-feature parameter 3d Gaussian kernel density
 78 |   if(control$verbose)
 79 |     message("Fitting joint distribution of per-feature parameters...")
 80 |   F_fit <- fit_F(feature_param = cbind("pi0" = EM_fit$fit$pi0,
 81 |                                        "mu" = EM_fit$fit$mu,
 82 |                                        "sigma" = EM_fit$fit$sigma))
 83 |   
 84 |   return(list(EM_fit = EM_fit,
 85 |               F_fit = F_fit,
 86 |               depth_fit = depth_fit,
 87 |               l_filtering = l_filtering))
 88 | }
 89 | 
 90 | #' Control options for fit_SparseDOSSA2 and fitCV_SparseDOSSA2
 91 | #'
 92 | #' @param maxit maximum number of EM iterations
 93 | #' @param rel_tol relative change threshold in the log likelihood
 94 | #' for algorithm convergence
 95 | #' @param abs_tol absolute change threshold in the log likelihood
 96 | #' for algorithm convergence
 97 | #' @param control_numint a named list of control parameters for the 
 98 | #' numerical integrations during the E step. See help page for
 99 | #' \code{control_numint}
100 | #' @param verbose whether or not detailed running messages should be provided
101 | #' @param debug_dir directory for intermediate output, such as the
102 | #' EM expectations and parameter values and during each step of the
103 | #' EM algorithm. Default to \code{NULL} in which case no such output
104 | #' will be generated
105 | #'
106 | #' @return a list of the same names
107 | #' @export
108 | control_fit <- function(maxit = 100,
109 |                         rel_tol = 1e-2,
110 |                         abs_tol = 1e-2,
111 |                         control_numint = list(),
112 |                         verbose = FALSE,
113 |                         debug_dir = NULL) {
114 |   list(maxit = maxit,
115 |        rel_tol = rel_tol,
116 |        abs_tol = abs_tol,
117 |        control_numint = control_numint,
118 |        verbose = verbose,
119 |        debug_dir = debug_dir)
120 | }
121 | 
122 | 
123 | #' Fit SparseDOSSA 2 model to a microbiome abundance dataset with cross validation
124 | #' 
125 | #' \code{fitCV_SparseDOSSA2} randomly partitions the data into fitting and testing
126 | #' subsets. It fits the SparseDOSSA 2 model to the fitting sets and uses log likelihood
127 | #' of the fitted parameters in the testing sets as the criteria for selection of 
128 | #' tuning parameter lambda.
129 | #' 
130 | #' @param data feature-by-sample matrix of abundances (proportions or
131 | #' counts).
132 | #' @param lambdas vector of positive penalization parameters for the sparsity of feature-feature
133 | #' correlations. The function fits SparseDOSSA 2 models to each of the lambda values, and uses
134 | #' cross validation likelihood to select the optimal one. If not provided this will be chosen
135 | #' automatically.
136 | #' @param control a named list of additional control parameters. See help page for
137 | #' \code{control_fit}.
138 | #'
139 | #' @return a list, with the following components:
140 | #' \describe{
141 | #' \item{EM_fit}{
142 | #' list of fitted parameters from the EM algorithm, with additional cross validation likelihood. 
143 | #' }
144 | #' \item{F_fit}{
145 | #' fitted parameters for the joint distribution of per-feature prevalence, abundance,
146 | #' and variability parameters (for simulating new features) 
147 | #' }
148 | #' \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable
149 | #' to count data.
150 | #' }
151 | #' \item{l_filtering}{list of quality control filtering for sample and features.
152 | #' }
153 | #' }
154 | 
155 | #' @export
156 | #' @author Siyuan Ma, \email{syma.research@gmail.com}
157 | #' @examples
158 | #' data("Stool_subset")
159 | #' fitted <- fitCV_SparseDOSSA(data = Stool_subset,
160 | #'                             lambdas = c(0.1, 1),
161 | #'                             K = 5)
162 | #' 
163 | fitCV_SparseDOSSA2 <- function(data, 
164 |                                lambdas = 10^seq(-2, 0, length.out = 5),
165 |                                K = 5,
166 |                                control = list()) {
167 |   # initialization, filtering
168 |   if(any(data < 0))
169 |     stop("data has negative values!")
170 |   data <- t(data)
171 |   control <- do.call(control_fit, control)
172 |   if(control$verbose)
173 |     message("Filtering for all-zero features/samples...")
174 |   l_filtering <- filter_data(data)
175 |   data <- data[l_filtering$ind_sample, l_filtering$ind_feature, drop = FALSE]
176 |   if(!is.null(control$debug_dir))
177 |     dir.create(control$debug_dir)
178 |   
179 |   # check if table is count and fit library size parameters if needed
180 |   row_sums <- apply(data, 1, sum)
181 |   if(all(row_sums > 1)) {
182 |     if(control$verbose)
183 |       message("Data appears to be count table. ",
184 |               "Fitting library size distribution...")
185 |     depth_fit <- fit_depth(depth = row_sums)
186 |   } else {
187 |     if(control$verbose)
188 |       message("Data appears to be relative abundance table.")
189 |     depth_fit <- c("mu_depth" = NA, "sigma_depth" = NA) ## FIXME
190 |   } 
191 |   
192 |   # normalize data
193 |   data <- t(apply(data, 1, TSS, correct = TRUE))
194 |   
195 |   # CV EM fitting
196 |   if(control$verbose)
197 |     message("Fitting EM algorithm with cross-validation...")
198 |   EM_fit <- EM_CV(data = data,
199 |                   lambdas = lambdas, 
200 |                   K = K,
201 |                   control = control)
202 |   
203 |   # fit per-feature parameter 3d Gaussian kernel density
204 |   if(control$verbose)
205 |     message("Fitting joint distribution of per-feature parameters...")
206 |   F_fit <- fit_F(feature_param = cbind("pi0" = EM_fit$fit$pi0,
207 |                                        "mu" = EM_fit$fit$mu,
208 |                                        "sigma" = EM_fit$fit$sigma))
209 |   
210 |   return(list(EM_fit = EM_fit,
211 |               F_fit = F_fit,
212 |               depth_fit = depth_fit,
213 |               l_filtering = l_filtering))
214 | }


--------------------------------------------------------------------------------
/R/copulasso.R:
--------------------------------------------------------------------------------
  1 | copulasso <- function(data, marginals,
  2 |                       lambda, 
  3 |                       control = list()) {
  4 |   control <- do.call(control_copulasso, control)
  5 |   if(ncol(data) != nrow(marginals))
  6 |     stop("Dimension of data and marginals do not agree!")
  7 |   
  8 |   if(lambda >= 1) {
  9 |     Omega <- 
 10 |       Sigma <- 
 11 |       Corr_star <- diag(1, ncol(data))
 12 |     return(list(Omega = Omega,
 13 |                 Sigma = Sigma,
 14 |                 Corr_star = Corr_star,
 15 |                 copulasso_code = 0))
 16 |   } else {
 17 |     data_g <- vapply(seq_len(ncol(data)),
 18 |                      function(i_feature)
 19 |                        a_to_g(a = data[, i_feature],
 20 |                               pi0 = marginals[i_feature, "pi0"],
 21 |                               mu = marginals[i_feature, "mu"],
 22 |                               sigma = marginals[i_feature, "sigma"]),
 23 |                      rep(0.0, nrow(data)))
 24 |     Corr_star <- cor(data_g, method = "pearson")
 25 |     S <- get_s(cor_data = Corr_star,
 26 |                pi0 = marginals[, "pi0"],
 27 |                glim = marginals[, "glim"],
 28 |                g0 = marginals[, "g0"],
 29 |                sigmaMod = marginals[, "sigmaMod"])
 30 |     Omega <- diag(1/(diag(S) + lambda))
 31 |     
 32 |     if(control$simplify) 
 33 |       z <- which(rowSums(abs(S) > lambda) > 1)
 34 |     else 
 35 |       z <- seq_len(nrow(S))
 36 |     q <- length(z)
 37 |     if (q > 0) {
 38 |       S_conditioned <- condition_ridge(S[z, z, drop = FALSE],
 39 |                                        lambda = control$lambda_ridge,
 40 |                                        method = "ridge1")
 41 |       out.glasso <- huge::huge.glasso(x = S_conditioned,
 42 |                                       lambda = lambda,
 43 |                                       verbose = FALSE)
 44 |       Omega[z, z] <- out.glasso$icov[[1]]
 45 |     }
 46 |   }
 47 |   
 48 |   if(!is.null(control$debug_dir))
 49 |     save(Omega, file = paste0(control$debug_dir, "/debug_copulasso.RData"))
 50 | 
 51 |   if(any(is.na(Omega))) {
 52 |     # warning("Missing values in Omega estimation! (lambda to small?)") # FIXME
 53 |     Omega <- diag(rep(1, nrow(Omega)))
 54 |     return(list(Omega = Omega,
 55 |                 copulasso_code = 1))
 56 |   }
 57 |   
 58 |   if(control$symm) {
 59 |     Omega <- enforce_symm(Omega, method = "svd")
 60 |   }
 61 |   if(control$corr) {
 62 |     Omega <- enforce_corr(Omega)
 63 |   }
 64 |   
 65 |   Omega <- threshold_matrix(Omega, 
 66 |                             threshold_zero = control$threshold_zero)
 67 |   Sigma <- threshold_matrix(solve(Omega), 
 68 |                             threshold_zero = control$threshold_zero)
 69 |   
 70 |   return(list(Omega = Omega,
 71 |               Sigma = Sigma,
 72 |               Corr_star = Corr_star,
 73 |               copulasso_code = 0))
 74 | }
 75 | 
 76 | control_copulasso <- function(lambda_ridge = 1e-6,
 77 |                               threshold_zero = 1e-6,
 78 |                               simplify = FALSE,
 79 |                               symm = TRUE,
 80 |                               corr = TRUE,
 81 |                               debug_dir = NULL) {
 82 |   list(lambda_ridge = lambda_ridge,
 83 |        threshold_zero = threshold_zero,
 84 |        simplify = simplify,
 85 |        symm = symm,
 86 |        corr = corr, 
 87 |        debug_dir = debug_dir)
 88 | }
 89 | 
 90 | Rho <- function(rho_p, 
 91 |                 pi0_1, pi0_2, 
 92 |                 glim_1, glim_2,
 93 |                 g0_1, g0_2,
 94 |                 sigmaMod_1, sigmaMod_2) {
 95 |   mat_cor <- matrix(c(1, rho_p, rho_p, 1),
 96 |                     2, 2)
 97 |   part1 <- g0_1 * g0_2 * 
 98 |     mvtnorm::pmvnorm(upper = c(glim_1, glim_2),
 99 |                      corr = mat_cor)
100 |   part2 <- g0_1 *
101 |     tmvtnorm::mtmvnorm(sigma = mat_cor,
102 |                        lower = c(-Inf, glim_2),
103 |                        upper = c(glim_1, Inf),
104 |                        doComputeVariance = FALSE)$tmean[2] *
105 |     mvtnorm::pmvnorm(lower = c(-Inf, glim_2),
106 |                      upper = c(glim_1, Inf),
107 |                      corr = mat_cor)
108 |   part3 <- 
109 |     g0_2 *
110 |     tmvtnorm::mtmvnorm(sigma = mat_cor,
111 |                        lower = c(glim_1, -Inf),
112 |                        upper = c(Inf, glim_2),
113 |                        doComputeVariance = FALSE)$tmean[1] *
114 |     mvtnorm::pmvnorm(lower = c(glim_1, -Inf),
115 |                      upper = c(Inf, glim_2),
116 |                      corr = mat_cor)
117 |   part4 <- {
118 |     fit_mtmvnorm <- tmvtnorm::mtmvnorm(sigma = mat_cor,
119 |                                        lower = c(glim_1, glim_2),
120 |                                        doComputeVariance = TRUE)
121 |     if (any(is.na(fit_mtmvnorm$tmean)))
122 |       0
123 |     else
124 |       (prod(fit_mtmvnorm$tmean) + fit_mtmvnorm$tvar[1, 2]) *
125 |       mvtnorm::pmvnorm(lower = c(glim_1, glim_2),
126 |                        corr = mat_cor)
127 |   }
128 |   
129 |   return((part1 + part2 + part3 + part4) / sigmaMod_1 / sigmaMod_2)
130 | }
131 | 
132 | vRho <- Vectorize(Rho, vectorize.args = "rho_p")
133 | 
134 | iRho <- function(rho_s, 
135 |                  pi0_1, pi0_2, 
136 |                  glim_1, glim_2,
137 |                  g0_1, g0_2,
138 |                  sigmaMod_1, sigmaMod_2) {
139 |   f_lim <- vRho(c(-0.99, 0.99),
140 |                 pi0_1 = pi0_1, pi0_2 = pi0_2, 
141 |                 glim_1 = glim_1, glim_2 = glim_2,
142 |                 g0_1 = g0_1, g0_2 = g0_2,
143 |                 sigmaMod_1 = sigmaMod_1, sigmaMod_2 = sigmaMod_2)
144 |   if(rho_s <= f_lim[1])
145 |     return(-0.99) ## FIXME
146 |   if(rho_s >= f_lim[2])
147 |     return(0.99)
148 |   
149 |   uniroot(f = function(x) 
150 |     vRho(x, 
151 |          pi0_1 = pi0_1, pi0_2 = pi0_2, 
152 |          glim_1 = glim_1, glim_2 = glim_2,
153 |          g0_1 = g0_1, g0_2 = g0_2,
154 |          sigmaMod_1 = sigmaMod_1, sigmaMod_2 = sigmaMod_2) - 
155 |       rho_s, 
156 |     interval = c(-0.99, 0.99),
157 |     f.lower = f_lim[1],
158 |     f.upper = f_lim[2],
159 |     maxiter = 100)$root
160 | }
161 | 
162 | get_s <- function(cor_data, pi0, glim, g0, sigmaMod) {
163 |   df_index <- expand.grid(feature2 = seq_len(nrow(cor_data)),
164 |                           feature1 = seq_len(nrow(cor_data)))
165 |   df_index <- subset(df_index, feature1 < feature2)
166 |   df_index$i_combo <- seq_len(nrow(df_index))
167 |   
168 |   ss <- 
169 |     future.apply::future_vapply(
170 |       df_index$i_combo,
171 |       function(ii_combo) {
172 |         ind_feature1 <- df_index[ii_combo, ]$feature1
173 |         ind_feature2 <- df_index[ii_combo, ]$feature2
174 |         
175 |         iRho(rho_s = cor_data[ind_feature1, ind_feature2],
176 |              pi0_1 = pi0[ind_feature1], pi0_2 = pi0[ind_feature2],
177 |              glim_1 = glim[ind_feature1], glim_2 = glim[ind_feature2],
178 |              g0_1 = g0[ind_feature1], g0_2 = g0[ind_feature2],
179 |              sigmaMod_1 = sigmaMod[ind_feature1], 
180 |              sigmaMod_2 = sigmaMod[ind_feature2])
181 |       },
182 |       0.0)
183 |   
184 |   to_return <- cor_data
185 |   to_return[lower.tri(to_return)] <- ss
186 |   to_return[upper.tri(to_return)] <- upper_tri(t(to_return), warning = FALSE)
187 |   
188 |   return(to_return)
189 | }
190 | 
191 | solve_ridge <- function(S, lambda, method = "ridge1") {
192 |   svd_fit <- svd(S)
193 |   if(method == "ridge1")
194 |     eigen_inv <- (sqrt(svd_fit$d^2 + 4 * lambda) - svd_fit$d) / 2 / lambda
195 |   if(method == "ridge2")
196 |     eigen_inv <- 1 / (svd_fit$d + lambda)
197 |   
198 |   return(svd_fit$u %*% diag(eigen_inv) %*% t(svd_fit$u))
199 | }
200 | 
201 | condition_ridge <- function(S, lambda, method = "ridge1") {
202 |   svd_fit <- svd(S)
203 |   if(method == "ridge1")
204 |     eigen_cond <- 2 * lambda / (sqrt(svd_fit$d^2 + 4 * lambda) - svd_fit$d)
205 |   if(method == "ridge2")
206 |     eigen_cond <- 1 / (svd_fit$d + lambda)
207 |   
208 |   return(svd_fit$u %*% diag(eigen_cond) %*% t(svd_fit$u))
209 | }


--------------------------------------------------------------------------------
/R/data.R:
--------------------------------------------------------------------------------
1 | #' A subset of the HMP1-II stool samples
2 | #'
3 | #' A dataset containing species-level microbial counts of a subset of
4 | #' the HMP1-II stool samples. This includes the top 5 most abundant
5 | #' species and top 5 most deeply sequenced samples.
6 | #'
7 | #' @format A matrix with 5 rows (species) and 5 columns (samples)
8 | #' @source \url{https://www.hmpdacc.org/hmp/}
9 | "Stool_subset"


--------------------------------------------------------------------------------
/R/fit.R:
--------------------------------------------------------------------------------
 1 | fit_featureParam <- function(data) {
 2 |   feature_param <- 
 3 |     t(vapply(seq_len(ncol(data)),
 4 |              function(i_feature)
 5 |                fit_oneFeatureParam(
 6 |                  data[, i_feature, drop = TRUE]),
 7 |              rep(0.0, 6)))
 8 |   # fill in missing sigmas (those with only one non-zero observations)
 9 |   # with the overall median across features
10 |   feature_param[is.na(feature_param[, "sigma"]), "sigma"] <- 
11 |     median(feature_param[, "sigma"], na.rm = TRUE)
12 |   
13 |   rownames(feature_param) <- colnames(data)
14 |   return(feature_param)
15 | }
16 | 
17 | fit_oneFeatureParam <- function(x) {
18 |   ind_nonzero <- x > 0
19 |   if(!any(ind_nonzero))
20 |     stop("Does not support features with all zero values!")
21 |   if(all(ind_nonzero)) ## FIXME
22 |     pi0 <- 0.5 / length(x)
23 |   else
24 |     pi0 <- mean(!ind_nonzero)
25 |   mu <- mean(log(x[ind_nonzero]))
26 |   sigma <- sd(log(x[ind_nonzero]))
27 |   
28 |   # Additional parameters to help with computation
29 |   glim <- qnorm(pi0)
30 |   g0 <- truncnorm::etruncnorm(b = glim)
31 |   sigmaMod <- sqrt(pi0 * g0^2 + 
32 |                      (1 - pi0) * 
33 |                      (truncnorm::vtruncnorm(a = glim) + 
34 |                         truncnorm::etruncnorm(a = glim)^2))
35 |   
36 |   return(c("pi0" = pi0,
37 |            "mu" = mu,
38 |            "sigma" = sigma,
39 |            "glim" = glim,
40 |            "g0" = g0,
41 |            "sigmaMod" = sigmaMod))
42 | }
43 | 
44 | fit_F <- function(feature_param) {
45 |   if(!all(colnames(feature_param) == c("pi0", "mu", "sigma")))
46 |     stop("feature_param is not of the correct format!")
47 |   # transform pi0 parameter before estimation
48 |   feature_param[, "pi0"] <- log(feature_param[, "pi0"]) - log(1 - feature_param[, "pi0"])
49 |   
50 |   ind_zero_sigma <- feature_param[, "sigma"] == 0
51 |   
52 |   K_nonzero <- ks::Hscv(x = feature_param[!ind_zero_sigma, , drop = FALSE])
53 |   if(any(ind_zero_sigma))
54 |     ks::Hscv(x = feature_param[ind_zero_sigma, -3, drop = FALSE])
55 |   else
56 |     K_zero <- NULL
57 |   
58 |   return(list(p0_sigma = mean(ind_zero_sigma),
59 |               K_nonzero = K_nonzero,
60 |               K_zero = K_zero))
61 | }
62 | 
63 | fit_depth <- function(depth) {
64 |   return(c("mu_depth" = mean(log(depth)),
65 |            "sigma_depth" = sd(log(depth))))
66 | }
67 | 


--------------------------------------------------------------------------------
/R/generate.R:
--------------------------------------------------------------------------------
  1 | generate_a <- function(n, 
  2 |                        feature_param, Omega,
  3 |                        maxit = 10, verbose = FALSE) {
  4 |   i_iter <- 0
  5 |   samples_a <- matrix(NA, 
  6 |                       nrow = nrow(feature_param),
  7 |                       ncol = 0)
  8 |   rownames(samples_a) <- rownames(feature_param)
  9 |   
 10 |   while(TRUE) {
 11 |     if(i_iter + 1 > maxit)
 12 |       stop("Can't satisfy conditions!")
 13 |     i_iter <- i_iter + 1
 14 |     if(verbose) 
 15 |       print(i_iter)
 16 |     
 17 |     samples_a <- cbind(samples_a,
 18 |                        t(rcopulasso(n = n,
 19 |                                     pi0 = feature_param[, "pi0"],
 20 |                                     mu = feature_param[, "mu"],
 21 |                                     sigma = feature_param[, "sigma"],
 22 |                                     Omega = Omega)))
 23 |     
 24 |     ind_nonzero <- apply(samples_a > 0, 2, any) ## FIXME??
 25 |     if(sum(ind_nonzero) >= n) {
 26 |       samples_a <- samples_a[, ind_nonzero][, seq_len(n)]
 27 |       colnames(samples_a) <- paste0("Sample", seq_len(n))
 28 |       return(samples_a)
 29 |     }
 30 |   }
 31 | }
 32 | 
 33 | rcopulasso <- function(n, pi0, mu, sigma, Omega) {
 34 |   if(length(pi0) != length(mu) |
 35 |      length(pi0) != length(sigma) |
 36 |      length(pi0) != nrow(Omega))
 37 |     stop("Parameter dimensions must agree!")
 38 |   
 39 |   # sample marginals
 40 |   mat_amarginals <- 
 41 |     vapply(seq_len(length(pi0)),
 42 |            function(i)
 43 |              rZILogN_one(n = n,
 44 |                          pi0 = pi0[i],
 45 |                          mu = mu[i],
 46 |                          sigma = sigma[i]),
 47 |            rep(0.0, n))
 48 |   # arrange from smallest to largest for shuffling
 49 |   mat_amarginals <- 
 50 |     apply(mat_amarginals, 2, function(x) x[order(x)])
 51 |   
 52 |   # sample ranks
 53 |   mat_rank <- 
 54 |     mvtnorm::rmvnorm(n = n, sigma = solve(Omega))
 55 |   mat_rank <- apply(mat_rank, 2, rank)
 56 |   
 57 |   mat_a <- 
 58 |     vapply(seq_len(length(pi0)),
 59 |            function(i)
 60 |              mat_amarginals[, i, drop = TRUE][mat_rank[, i, drop = TRUE]],
 61 |            rep(0.0, n))
 62 |   
 63 |   return(mat_a)
 64 | }
 65 | 
 66 | rZILogN_one <- function(n, pi0, mu, sigma) {
 67 |     return(exp(rnorm(n = n, mean = mu, sd = sigma)) *
 68 |              rbinom(n = n, size = 1, prob = 1 - pi0))
 69 | }
 70 | 
 71 | generate_featureParam <- function(F_fit, param_original, n_feature) {
 72 |   if(!all(names(F_fit) == c("p0_sigma", "K_nonzero", "K_zero")))
 73 |     stop("F_fit is not of the correct format!")
 74 |   
 75 |   # transform pi parameter
 76 |   param_original[, "pi0"] <- logit(param_original[, "pi0"])
 77 |   
 78 |   if(F_fit$p0_sigma > 0)
 79 |     ind_sigma <- rbinom(n = n_feature, size = 1, 
 80 |                         prob = 1 - F_fit$p0_sigma) == 1
 81 |   else
 82 |     ind_sigma <- rep(TRUE, length = n_feature)
 83 |   
 84 |   # simulate parameters for non-zero sigmas
 85 |   param_original_nonZeroSigma <-
 86 |     param_original[param_original[, "sigma"] > 0, , drop = FALSE]
 87 |   ind_sigma_nonzero_tosimulate <- ind_sigma[ind_sigma]
 88 |   param_simulated <- 
 89 |     matrix(NA, nrow = sum(ind_sigma_nonzero_tosimulate), ncol = 3)
 90 |   dimnames(param_simulated) <- list(paste0("Feature", seq_len(n_feature)),
 91 |                                     c("pi0", "mu", "sigma"))
 92 |   while(TRUE) {
 93 |     nFeature_nonzero <- sum(ind_sigma_nonzero_tosimulate)
 94 |     simulated_mixture <- 
 95 |       param_original_nonZeroSigma[sample.int(
 96 |         n = nrow(param_original_nonZeroSigma),
 97 |         size = nFeature_nonzero,
 98 |         replace = TRUE), , drop = FALSE]
 99 |     simulated_difference <- mvtnorm::rmvnorm(n = nFeature_nonzero, 
100 |                                              sigma = F_fit$K_nonzero)
101 |     simulated_mixture <- simulated_mixture + simulated_difference
102 |     param_simulated[ind_sigma_nonzero_tosimulate, ] <- simulated_mixture
103 |     ind_sigma_nonzero_tosimulate <- param_simulated[, "sigma"] <= 0
104 |     if(!any(ind_sigma_nonzero_tosimulate)) break
105 |   }
106 |   
107 |   nFeature_zero <- sum(!ind_sigma)
108 |   if(nFeature_zero > 0) {
109 |     param_original_zeroSigma <- 
110 |       param_original[param_original[, "sigma"] == 0, , drop = FALSE]
111 |     simulated_mixture <- 
112 |       param_original_zeroSigma[sample.int(
113 |         n = nrow(param_original_zeroSigma),
114 |         size = nFeature_zero,
115 |         replace = TRUE), , drop = FALSE]
116 |     simulated_difference <-  mvtnorm::rmvnorm(n = nFeature_zero, 
117 |                                               sigma = F_fit$K_zero)
118 |     simulated_mixture[, c("pi0", "mu")] <- 
119 |       simulated_mixture[, c("pi0", "mu"), drop = FALSE] +
120 |       simulated_difference
121 |     param_simulated <- rbind(param_simulated, simulated_mixture)
122 |   }
123 |   
124 |   param_simulated[, "pi0"] <- expit(param_simulated[, "pi0"])
125 |   return(param_simulated)
126 | }
127 | 
128 | 
129 | #' @importFrom magrittr %>%
130 | generate_feature_metadata_spike_df <- 
131 |   function(features,
132 |            perc_feature_spiked_metadata,
133 |            n_metadata,
134 |            effect_size,
135 |            spike_metadata = c("both", 
136 |                               "abundance",
137 |                               "prevalence")) {
138 |     n_feature_spiked <- ceiling(length(features) *
139 |                                   perc_feature_spiked_metadata)
140 |     if(spike_metadata == "both")
141 |       spike_metadata <- c("abundance", "prevalence")
142 |     
143 |     feature_metadata_spike_df <- 
144 |       purrr::map2_dfr(seq_len(n_metadata),
145 |                       effect_size,
146 |                       function(metadatum_i, effect_size_i) {
147 |                         data.frame(metadata_datum = metadatum_i,
148 |                                    effect_size = 
149 |                                      effect_size_i * 
150 |                                      sample(c(-1, 1),
151 |                                             size = n_feature_spiked,
152 |                                             replace = TRUE),
153 |                                    feature_spiked = 
154 |                                      sample(features,
155 |                                             size = n_feature_spiked,
156 |                                             replace = FALSE),
157 |                                    stringsAsFactors = FALSE)
158 |                       })
159 |     feature_metadata_spike_df <- 
160 |       tidyr::expand_grid(feature_metadata_spike_df,
161 |                          associated_property = spike_metadata) %>% 
162 |       dplyr::select(metadata_datum, 
163 |                     feature_spiked,
164 |                     associated_property,
165 |                     effect_size) %>% 
166 |       dplyr::arrange(metadata_datum,
167 |                      associated_property,
168 |                      -effect_size) %>% 
169 |       as.data.frame()
170 |     
171 |     return(feature_metadata_spike_df)
172 |   }
173 | 
174 | generate_depth <- function(mu_depth, 
175 |                            sigma_depth, 
176 |                            n, median_depth) {
177 |   depth <- exp(rnorm(n = n, mean = mu_depth, sd = sigma_depth))
178 |   depth <- round(depth / median(depth) * median_depth)
179 |   return(depth)
180 | }
181 | 
182 | generate_count <- function(rel, depth) {
183 |   mat_count <- 
184 |     vapply(seq_along(depth), 
185 |            function(i_sample) {
186 |              if(all(rel[, i_sample] == 0))
187 |                return(rep(0, nrow(rel)))
188 |              else
189 |                rmultinom(n = 1, size = depth[i_sample], prob = rel[, i_sample])
190 |            },
191 |            rep(0, nrow(rel)))
192 |   dimnames(mat_count) <- dimnames(rel)
193 |   return(mat_count)
194 | }  
195 | 


--------------------------------------------------------------------------------
/R/helpers.R:
--------------------------------------------------------------------------------
  1 | logit <- function(x) log(x) - log(1 - x)
  2 | 
  3 | expit <- function(x) 
  4 |   exp(x) / (1 + exp(x))
  5 | 
  6 | TSS <- function(x, correct = FALSE) {
  7 |   if(any(x < 0))
  8 |     stop("Negative x values are not accepted!")
  9 |   if(all(x == 0)) return(x)
 10 |   # this is a special case where only one feature is present
 11 |   if(sum(x > 0) == 1 & correct) {
 12 |     x[x > 0] <- 1 - 0.5 / length(x)
 13 |     return(x)
 14 |   }
 15 |   return(x / sum(x))
 16 | }
 17 | 
 18 | lower_tri <- function(x, warning = FALSE) {
 19 |   if(!isSymmetric(x) & warning) 
 20 |     warning("x is not symmetric!")
 21 |   
 22 |   x[lower.tri(x)]
 23 | }
 24 | 
 25 | upper_tri <- function(x, warning = FALSE) {
 26 |   if(!isSymmetric(x) & warning) 
 27 |     warning("x is not symmetric!")
 28 |   
 29 |   x[upper.tri(x)]
 30 | }
 31 | 
 32 | enforce_symm <- function(x, method = "svd") {
 33 |   if(nrow(x) != ncol(x)) 
 34 |     stop("x does not appear to be a covariance matrix!")
 35 |   x_out <- x
 36 |   if(!isSymmetric(x_out)) {
 37 |     if(method == "average") {
 38 |       lower_averaged <- (lower_tri(x_out, warning = FALSE) + 
 39 |                            lower_tri(t(x_out), warning = FALSE)) / 2
 40 |       x_out[lower.tri(x_out)] <- lower_averaged
 41 |       x_out[upper.tri(x_out)] <- 
 42 |         t(x_out)[upper.tri(x_out)]
 43 |     }
 44 |     if(method == "lower")
 45 |       x_out[upper.tri(x_out)] <- upper_tri(t(x_out), warning = FALSE)
 46 |     if(method == "upper")
 47 |       x_out[lower.tri(x_out)] <- lower_tri(t(x_out), warning = FALSE)
 48 |     if(method == "svd") {
 49 |       svd_fit <- svd(x)
 50 |       if(Matrix::rankMatrix(svd_fit$u) < nrow(x)) 
 51 |         # Sometimes svd run into weird cases where the matrix is not singular but the 
 52 |         # eigen vector matrix is
 53 |         svd_fit <- svd(x + diag(rep(1e-16, nrow(x))))
 54 |       
 55 |       # In case it's not pos-def 
 56 |       svd_fit$d <- abs(svd_fit$d) ## FIXME
 57 |       
 58 |       x_out <- svd_fit$u %*% diag(svd_fit$d) %*% t(svd_fit$u)
 59 |       
 60 |       # I don't know how this could happen, but even after this x_out can still be not symmetric!
 61 |       ## FIXME
 62 |       x_out[upper.tri(x_out)] <- upper_tri(t(x_out), warning = FALSE)
 63 |     }
 64 |   }
 65 |   
 66 |   return(x_out)
 67 | }
 68 | 
 69 | enforce_corr <- function(x) {
 70 |   if(nrow(x) != ncol(x)) 
 71 |     stop("x does not appear to be a covariance matrix!")
 72 | 
 73 |   diag_sigma <- diag(sqrt(abs(diag(solve(x)))))
 74 |   x_out <- diag_sigma %*% x %*% diag_sigma
 75 |   
 76 |   return(x_out)
 77 | }
 78 | 
 79 | a_to_u <- function(a, pi0, mu, sigma,
 80 |                    half_pi0 = FALSE) {
 81 |   if(any(pi0 == 1) | any(pi0 == 0))
 82 |     stop("zero or one valued pi0 is not supported!")
 83 |   
 84 |   if(half_pi0)
 85 |     to_return <-  pi0 / 2
 86 |   else
 87 |     to_return <-  pi0
 88 | 
 89 |   ind_nonzero <- a > 0
 90 |   if(any(ind_nonzero)) ##FIXME
 91 |     to_return[ind_nonzero] <-
 92 |       pi0[ind_nonzero] +
 93 |       pnorm(log(a[ind_nonzero]),
 94 |             mean = mu[ind_nonzero],
 95 |             sd = sigma[ind_nonzero]) *
 96 |       (1 - pi0[ind_nonzero])
 97 | 
 98 |   return(to_return)
 99 | }
100 | 
101 | a_to_u_marg <- function(a, pi0, mu, sigma,
102 |                     half_pi0 = FALSE,
103 |                     u_tol =  1e-5) {
104 |   if(half_pi0)
105 |     to_return <-  rep(pi0 / 2,
106 |                       length = length(a))
107 |   else
108 |     to_return <-  rep(pi0,
109 |                       length = length(a))
110 |   
111 |   ind_nonzero <- a > 0
112 |   if(any(ind_nonzero)) ##FIXME
113 |     to_return[ind_nonzero] <-
114 |     pi0 +
115 |     pnorm(log(a[ind_nonzero]),
116 |           mean = mu,
117 |           sd = sigma) *
118 |     (1 - u_tol) *
119 |     (1 - pi0)
120 |   
121 |   return(to_return)
122 | }
123 | 
124 | a_to_g <- function(a, pi0, mu, sigma) {
125 |   if(pi0 == 1 | pi0 == 0)
126 |     stop("zero or one valued pi0 is not supported!")
127 |   ind_nonzero <- a > 0
128 |   
129 |   to_return <- u <- a_to_u_marg(a, pi0, mu, sigma)
130 |   to_return[ind_nonzero] <- qnorm(u[ind_nonzero])
131 |   to_return[!ind_nonzero] <- 
132 |     -mean(to_return[ind_nonzero]) * 
133 |     sum(ind_nonzero) / sum(1 - ind_nonzero)
134 |   
135 |   return(to_return)
136 | }
137 | 
138 | a <- function(x, asum) {
139 |   a <- x * asum
140 |   a[x == 0] <- 0  
141 |   
142 |   return(a)
143 | }
144 | 
145 | make_CVfolds <- function(n, K) {
146 |   cut(sample.int(n), breaks = K, labels = FALSE)
147 | }
148 | 
149 | filter_data <- function(data, 
150 |                         k_feature = 2, k_sample = 1,
151 |                         maxit = 3) {
152 |   i_iter <- 0
153 |   ind_feature <- rep(TRUE, ncol(data))
154 |   ind_sample <- rep(TRUE, nrow(data))
155 |   
156 |   while(TRUE) {
157 |     if (i_iter + 1 > maxit) 
158 |       stop("Max iteration reached!")
159 |     i_iter <- i_iter + 1
160 |     
161 |     ind_feature_tmp <- 
162 |       apply(data[ind_sample, ind_feature, drop = FALSE] > 0, 2, sum) >= 
163 |       k_feature
164 |     ind_feature[ind_feature] <- ind_feature_tmp
165 |     ind_sample_tmp <- 
166 |       apply(data[ind_sample, ind_feature, drop = FALSE] > 0, 1, sum) >= 
167 |       k_sample
168 |     ind_sample[ind_sample] <- ind_sample_tmp
169 |     
170 |     if (all(ind_feature_tmp) & all(ind_sample_tmp)) 
171 |       return(list(ind_feature = ind_feature,
172 |                   ind_sample = ind_sample))
173 |   }
174 | }
175 | 
176 | fill_estimates_CV <- function(params_CV, params_full, ind_feature) {
177 |   params_return <- params_CV
178 |   
179 |   params_return[c("pi0", "mu", "sigma")] <- params_full[c("pi0", "mu", "sigma")]
180 |   for(param in c("pi0", "mu", "sigma"))
181 |     params_return[[param]][ind_feature] <- params_CV[[param]]
182 |   
183 |   params_return$Sigma <- 
184 |     params_return$Omega <- 
185 |     diag(rep(1, length(ind_feature)))
186 |   params_return[["Sigma"]][ind_feature, ind_feature] <- params_CV[["Sigma"]]
187 |   params_return[["Omega"]][ind_feature, ind_feature] <- threshold_matrix(solve(params_CV[["Sigma"]]))
188 |   
189 |   return(params_return)
190 | }
191 | 
192 | threshold_matrix <- function(x, threshold_zero = 1e-16) {
193 |   ## FIXME
194 |   if(!is.null(threshold_zero)) {
195 |     diag_x <- diag(x)
196 |     x[abs(x) < threshold_zero] <- 0
197 |     diag(x) <- diag_x
198 |   }
199 |   return(x)
200 | }


--------------------------------------------------------------------------------
/R/integration_helpers.R:
--------------------------------------------------------------------------------
  1 | integrate2 <- function(f, 
  2 |                            lower, upper, 
  3 |                            rel_tol, abs_tol, max_eval, 
  4 |                            precBits,
  5 |                            ...) {
  6 |   if(lower == 0 & upper == 0)
  7 |     return(list(integral = 0,
  8 |                 error = 0,
  9 |                 neval = 0,
 10 |                 returnCode = 1))
 11 |   
 12 |   ff <- function(x) f(x, ...)
 13 |   
 14 |   neval <- 2
 15 |   knots_spline <- Rmpfr::mpfr(c(lower, upper),
 16 |                               precBits = precBits)
 17 |   vals_spline <- 
 18 |     Rmpfr::mpfr(
 19 |       ff(as.double(knots_spline)), 
 20 |       precBits = precBits)
 21 |   errors_spline <- Inf
 22 |   
 23 |   while(TRUE) {
 24 |     i_max_error <- which(errors_spline == max(errors_spline))[1]
 25 |     knots_spline <- c(knots_spline[seq(1, i_max_error)],
 26 |                       Rmpfr::mean(knots_spline[c(i_max_error, i_max_error + 1)]),
 27 |                       knots_spline[seq(i_max_error + 1, neval)])
 28 |     
 29 |     vals_spline <- c(vals_spline[seq(1, i_max_error)],
 30 |                      Rmpfr::mpfr(ff(as.double(knots_spline[i_max_error + 1])),
 31 |                                  precBits = precBits),
 32 |                      vals_spline[seq(i_max_error + 1, neval)])
 33 |     
 34 |     neval <- neval + 1
 35 |     knots_diff <-  knots_spline[-1] - knots_spline[-neval]
 36 |     # linear spline for estimating integration
 37 |     coefs_spline <- Rmpfr::mpfrArray(NA, precBits = precBits,
 38 |                                      dim = c(2, neval - 1))
 39 |     coefs_spline[2, ] <- 
 40 |       (vals_spline[-1] - vals_spline[-neval]) /
 41 |       knots_diff
 42 |     coefs_spline[1, ] <- 
 43 |       vals_spline[-neval] - knots_spline[-neval] * coefs_spline[2, ]
 44 |     integral <- sum(coefs_spline[1, ] * knots_spline[-1] +
 45 |                       coefs_spline[2, ] / 2 * knots_spline[-1]^2 -
 46 |                       coefs_spline[1, ] * knots_spline[-neval] - 
 47 |                       coefs_spline[2, ] / 2 * knots_spline[-neval]^2)
 48 |     
 49 |     # error estimation
 50 |     errors_spline <- estimate_errors(
 51 |       knots_diff, 
 52 |       c(Rmpfr::mpfr(0, precBits = precBits), 
 53 |         coefs_spline[2, ], 
 54 |         Rmpfr::mpfr(0, precBits = precBits)),
 55 |       precBits = precBits)
 56 |     error <- sum(errors_spline)
 57 |     
 58 |     if(neval >= max_eval)
 59 |       break
 60 |     if(integral < 0)
 61 |       stop("Negative integration values; something went wrong!")
 62 |     if(integral > 0)
 63 |       if(error / abs(integral) < rel_tol |
 64 |          error < abs_tol)
 65 |         break
 66 |   }
 67 |   return(list(integral = integral,
 68 |               error = error,
 69 |               neval = neval,
 70 |               returnCode = 0,
 71 |               debug = list(knots_spline = knots_spline,
 72 |                            vals_spline = vals_spline,
 73 |                            errors_spline = errors_spline)))
 74 | }
 75 | 
 76 | get_intLimits <- function(x, pi0, mu, sigma, Omega, Sigma,
 77 |                           maxit = 10) {
 78 |   ind_nonzero <- x != 0
 79 |   logx <- log(x[ind_nonzero])
 80 |   mat_range <- cbind(mu[ind_nonzero] - logx - sqrt(2000)*sigma[ind_nonzero],
 81 |                      mu[ind_nonzero] - logx + sqrt(2000)*sigma[ind_nonzero])
 82 |   range <- c(max(c(mat_range[, 1], -745 - min(logx))), 
 83 |              min(mat_range[, 2]))
 84 |   
 85 |   if(range[1] >= range[2])
 86 |     return(c(0, 0)) ## FIXME??
 87 |   
 88 |   i_iter <- 1
 89 |   vlim <- seq(from = range[1], to = range[2], length.out = 3)
 90 |   vval <- sapply(vlim,
 91 |                  function(vv) 
 92 |                    dloga_asum(asum = exp(vv), x = x,
 93 |                               pi0 = pi0, mu = mu, sigma = sigma,
 94 |                               Omega = Omega, Sigma = Sigma))
 95 |   while(TRUE) {
 96 |     vflag <- vval > -745
 97 |     if(vflag[1] | rev(vflag)[1])
 98 |       stop("Positive dx values at integration limits!")
 99 |     if(sum(vflag) > 1)
100 |       return(vlim[c(min(which(vflag)) - 1, 
101 |                     max(which(vflag)) + 1)])
102 |     
103 |     if(i_iter + 1 > maxit)
104 |       return(range(vlim)) ## FIXME??
105 |     i_iter <- i_iter + 1
106 |     
107 |     if(all(vval == -Inf)) {
108 |       vlim <- c(range[1],
109 |                 sort(runif(3, min = range[1], max = range[2])),
110 |                 range[2])
111 |       vval <- c(-Inf,
112 |                 sapply(vlim,
113 |                      function(vv)
114 |                        dloga_asum(asum = exp(vv), x = x,
115 |                                   pi0 = pi0, mu = mu, sigma = sigma,
116 |                                   Omega = Omega, Sigma = Sigma)),
117 |                 -Inf)
118 |     } else {
119 |       ind_max <- order(-vval)[1]
120 |       if(ind_max == 1) ind_max <- 2
121 |       if(ind_max == length(vlim)) ind_max <- length(vlim) - 1
122 |       vlim <- vlim[seq(ind_max - 1, 
123 |                        ind_max + 1)]
124 |       vval <- vval[seq(ind_max - 1, 
125 |                        ind_max + 1)]
126 |       
127 |       vlim <- c(vlim[1],
128 |                 mean(vlim[seq(1, 2)]), 
129 |                 vlim[2], 
130 |                 mean(vlim[seq(2, 3)]),
131 |                 vlim[3])
132 |       vval <- c(vval[1],
133 |                 dloga_asum(asum = exp(vlim[2]), x = x,
134 |                            pi0 = pi0, mu = mu, sigma = sigma,
135 |                            Omega = Omega, Sigma = Sigma),
136 |                 vval[2],
137 |                 dloga_asum(asum = exp(vlim[4]), x = x,
138 |                            pi0 = pi0, mu = mu, sigma = sigma,
139 |                            Omega = Omega, Sigma = Sigma),
140 |                 vval[3])
141 |     }
142 |   }
143 | }
144 | 
145 | dloga_asum <- function(asum, x, pi0, mu, sigma, Omega, Sigma) {
146 |   if(asum * min(x[x != 0]) <= 0)
147 |     return(-Inf)
148 |   if(asum > 0)
149 |     return(dloga(a(x, asum), pi0 = pi0, mu = mu, sigma = sigma, 
150 |                  Omega = Omega, Sigma = Sigma,
151 |                  log.p = TRUE))
152 | }
153 | 
154 | get_diff <- function(x, x_old, 
155 |                      denom_c = 1e-5,
156 |                      method = "abs") {
157 |   x <- as.vector(x)
158 |   x_old <- as.vector(x_old)
159 |   
160 |   abs_diff <- abs(x - x_old)
161 |   if(method == "abs")
162 |     return(max(abs_diff))
163 |   if(method == "rel") {
164 |     rel_diff <- abs_diff / (abs(x) + denom_c)
165 |     return(max(rel_diff))
166 |   }
167 | }
168 | 
169 | Vectorize2 <- function(FUN, vectorize.args = arg.names, SIMPLIFY = TRUE, 
170 |                        USE.NAMES = TRUE) 
171 | {
172 |   arg.names <- as.list(formals(FUN))
173 |   arg.names[["..."]] <- NULL
174 |   arg.names <- names(arg.names)
175 |   vectorize.args <- as.character(vectorize.args)
176 |   
177 |   if(length(vectorize.args) != 1)
178 |     stop("Can only vectorize over one argument!")
179 |   
180 |   if (!length(vectorize.args)) 
181 |     return(FUN)
182 |   if (!all(vectorize.args %in% arg.names)) 
183 |     stop("must specify names of formal arguments for 'vectorize'")
184 |   collisions <- arg.names %in% c("FUN", "SIMPLIFY", "USE.NAMES", 
185 |                                  "vectorize.args")
186 |   if (any(collisions)) 
187 |     stop(sQuote("FUN"), " may not have argument(s) named ", 
188 |          paste(sQuote(arg.names[collisions]), collapse = ", "))
189 |   FUNV <- function() {
190 |     args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
191 |     names <- if (is.null(names(args))) 
192 |       character(length(args))
193 |     else names(args)
194 |     dovec <- names %in% vectorize.args
195 |     val <- do.call("mapply", 
196 |                    c(FUN = FUN, args[dovec], 
197 |                      MoreArgs = list(args[!dovec]), 
198 |                      SIMPLIFY = TRUE, USE.NAMES = USE.NAMES))
199 |     if(!is.null(dim(args[dovec][[1]])))
200 |       val <- array(val, dim = dim(args[dovec][[1]]))
201 |     return(val)
202 |   }
203 |   formals(FUNV) <- formals(FUN)
204 |   FUNV
205 | }
206 | 
207 | estimate_errors <- function(knots_diff, slopes, precBits) {
208 |   if(length(knots_diff) != length(slopes) - 2)
209 |     stop("Length of knots_diff and slopes should agree!")
210 |   errors <- rep(NA, length = length(knots_diff))
211 |   
212 |   y3s <- knots_diff * slopes[-c(1, length(slopes))]
213 |   errors <- 
214 |     Rmpfr::sapplyMpfr(seq_along(knots_diff),
215 |                       function(i_region)
216 |                         estimate_one_error(
217 |                           knots_diff[i_region],
218 |                           y3s[i_region],
219 |                           slopes[c(i_region, i_region + 1, i_region + 2)],
220 |                           precBits = precBits
221 |                         ))
222 |   
223 |   return(errors)
224 | }
225 | 
226 | estimate_one_error <- function(x3, y3, slopes, precBits) {
227 |   slope1 <- slopes[1]
228 |   slope2 <- slopes[3]
229 |   if((slope1 <= slopes[2] & slopes[2] <= slope2) |
230 |      (slope1 >= slopes[2] & slopes[2] >= slope2)) {
231 |     # function is convex/concave here
232 |     
233 |     # if slopes are the same
234 |     if(slope1 == slope2)
235 |       return(Rmpfr::mpfr(0, precBits = precBits))
236 | 
237 |     # if not calculate where lines meet
238 |     x2 <- (slope2 * x3 - y3) / (slope2 - slope1)
239 |     y2 <- x2 * slope1
240 |     return(abs(x2 * y3 - x3 * y2) / 2)
241 |   } else {
242 |     return(abs(x3 * y3) / 2)
243 |   }
244 | }
245 | 


--------------------------------------------------------------------------------
/R/likelihood_expectation.R:
--------------------------------------------------------------------------------
  1 | dx <- function(x, 
  2 |                pi0, mu, sigma, Omega, Sigma,
  3 |                control = list(),
  4 |                log.p = FALSE) {
  5 |   control <- do.call(control_integrate, control)
  6 |   limits <- get_intLimits(
  7 |     x = x,
  8 |     pi0 = pi0, mu = mu, sigma = sigma, 
  9 |     Omega = Omega, Sigma = Sigma, 
 10 |     maxit = control$maxit_limits)
 11 |     
 12 |   fit_integrate <- 
 13 |     integrate2(vintegrand_dx,
 14 |                lower = limits[1], upper = limits[2], 
 15 |                rel_tol = control$rel_tol, abs_tol = control$abs_tol, 
 16 |                max_eval = control$max_eval,
 17 |                precBits = control$precBits,
 18 |                x = x, pi0 = pi0, mu = mu, sigma = sigma, 
 19 |                Omega = Omega, Sigma = Sigma)
 20 |   
 21 |   # jacobian
 22 |   fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0])))
 23 |   fit_integrate$error <- as.double(exp(log(fit_integrate$error) -  sum(log(x[x > 0]))))
 24 |   
 25 |   if(log.p) {
 26 |     return(fit_integrate$integral)
 27 |   }
 28 |   else {
 29 |     fit_integrate$integral <- exp(fit_integrate$integral)
 30 |     if(control$only_value)
 31 |       return(fit_integrate$integral)
 32 |     return(fit_integrate)
 33 |   }
 34 | }
 35 | 
 36 | #' control parameters for the 
 37 | #' numerical integrations during the E step of SparseDOSSA2's fitting
 38 | #'
 39 | #' @param rel_tol relative change threshold in the integration values for the 
 40 | #' integration to converge
 41 | #' @param abs_tol absolute change threshold in the integration values for the 
 42 | #' integration to converge
 43 | #' @param max_eval maximum of integration evaluations allowed
 44 | #' @param maxit_limits maximum number of tries allowed to guess the integration's
 45 | #' lower and upper limits
 46 | #' @param precBits numeric precision used for the integration values
 47 | #' @param only_value whether or not only the integration value should be returned
 48 | #'
 49 | #' @return a list of the same names
 50 | #' @export
 51 | control_integrate <- function(rel_tol = 1e-2,
 52 |                               abs_tol = 0,
 53 |                               max_eval = 50,
 54 |                               maxit_limits = 10,
 55 |                               precBits = 200,
 56 |                               only_value = TRUE) {
 57 |   list(rel_tol = rel_tol,
 58 |        abs_tol = abs_tol,
 59 |        max_eval = max_eval,
 60 |        maxit_limits = maxit_limits,
 61 |        precBits = precBits,
 62 |        only_value = only_value)
 63 | }
 64 | 
 65 | ploga <- function(a, 
 66 |                   pi0, mu, sigma, Sigma,
 67 |                   log.p = TRUE) {
 68 |   ind_zero <- a == 0
 69 |   
 70 |   if(!any(ind_zero)) {
 71 |     log_p <- 1
 72 |   } else {
 73 |     a <- a[ind_zero]
 74 |     pi0 <- pi0[ind_zero]
 75 |     mu <- mu[ind_zero]
 76 |     sigma <- sigma[ind_zero]
 77 |     Sigma <- Sigma[ind_zero, ind_zero, drop = FALSE]
 78 |     
 79 |     u <- a_to_u(a,
 80 |                 pi0 = pi0, mu = mu, sigma = sigma)
 81 |     g <- qnorm(u)
 82 |     log_p <- 
 83 |       log(mvtnorm::pmvnorm(
 84 |         lower = -Inf, 
 85 |         upper = g, 
 86 |         mean = rep(0, length(g)),
 87 |         sigma = Sigma))
 88 |   }
 89 |   
 90 |   if(log.p)
 91 |     return(log_p)
 92 |   else
 93 |     return(exp(log_p))
 94 | }
 95 | 
 96 | mloga <- function(a, 
 97 |                   pi0, mu, sigma, Sigma,
 98 |                   doComputeVariance = FALSE,
 99 |                   log_ploga = 0) {
100 |   ind_zero <- a == 0
101 |   
102 |   if(!any(ind_zero)) {
103 |     return(list(mean_cond = NULL,
104 |                 Sigma_cond = matrix(nrow = 0, ncol = 0)))
105 |   } else {
106 |     a <- a[ind_zero]
107 |     pi0 <- pi0[ind_zero]
108 |     mu <- mu[ind_zero]
109 |     sigma <- sigma[ind_zero]
110 |     Sigma <- Sigma[ind_zero, ind_zero, drop = FALSE]
111 |     
112 |     u <- a_to_u(a,
113 |                 pi0 = pi0, mu = mu, sigma = sigma)
114 |     g <- qnorm(u)
115 |     m_cond <- 
116 |       tmvtnorm::mtmvnorm(sigma = Sigma,
117 |                          upper = g,
118 |                          doComputeVariance = doComputeVariance)
119 |   }
120 |   
121 |   return(list(mean_cond = m_cond$tmean,
122 |               Sigma_cond = m_cond$tvar))
123 | }
124 | 
125 | dloga_forInt <- function(a,
126 |                          pi0, mu, sigma, Omega, Sigma, 
127 |                          log_ploga = 0,
128 |                          mean_cond,
129 |                          Sigma_cond = NA,
130 |                          mean_cond_choice = "half",
131 |                          Sigma_cond_choice = "cond",
132 |                          log.p = TRUE) {
133 |   
134 |   u <- a_to_u(a,
135 |               pi0 = pi0, mu = mu, sigma = sigma,
136 |               half_pi0 = TRUE)
137 |   g <- qnorm(u)
138 |   
139 |   ind_nonzero <- a > 0
140 |   if(any(abs(g) == Inf)) {
141 |     log_d <- -Inf
142 |   } else if(all(ind_nonzero)) {
143 |     log_d <- 
144 |       mvtnorm::dmvnorm(x = g,
145 |                        mean = rep(0, length(g)),
146 |                        sigma = Sigma,
147 |                        log = TRUE) +
148 |       sum(g^2) / 2 -
149 |       sum((log(a) - mu)^2 / (sigma)^2 / 2) - 
150 |       sum(log(sigma)) + 
151 |       sum(log(1 - pi0))
152 |   } else if(!any(ind_nonzero)) {
153 |     log_d <- 
154 |       log_ploga
155 |   } else {
156 |     if(mean_cond_choice == "half")
157 |       mean_zero <- g[!ind_nonzero]
158 |     else
159 |       mean_zero <- mean_cond
160 |     if(Sigma_cond_choice == "cond")
161 |       Sigma_nonzero <- solve(Omega[ind_nonzero, ind_nonzero, drop = FALSE])
162 |     else if (Sigma_cond_choice == "full")
163 |       Sigma_nonzero <- Sigma[ind_nonzero, ind_nonzero, drop = FALSE]
164 |     else
165 |       Sigma_nonzero <- solve(Omega[ind_nonzero, ind_nonzero, drop = FALSE]) +
166 |         Sigma[ind_nonzero, !ind_nonzero, drop = FALSE] %*%
167 |         solve(Sigma[!ind_nonzero, !ind_nonzero, drop = FALSE]) %*%
168 |         Sigma_cond %*%
169 |         solve(Sigma[!ind_nonzero, !ind_nonzero, drop = FALSE]) %*%
170 |         Sigma[!ind_nonzero, ind_nonzero, drop = FALSE]
171 |     log_d <- 
172 |       log_ploga +
173 |       mvtnorm::dmvnorm(
174 |         x = g[ind_nonzero],
175 |         mean = (Sigma[ind_nonzero, !ind_nonzero, drop = FALSE] %*%
176 |                   solve(Sigma[!ind_nonzero, !ind_nonzero, drop = FALSE],
177 |                         mean_zero))[, 1],
178 |         sigma = Sigma_nonzero,
179 |         log = TRUE) +
180 |       sum(g[ind_nonzero]^2) / 2 -
181 |       sum((log(a[ind_nonzero]) - mu[ind_nonzero])^2 / 
182 |             (sigma[ind_nonzero])^2 / 2) - 
183 |       sum(log(sigma[ind_nonzero])) + 
184 |       sum(log(1 - pi0[ind_nonzero]))
185 |   }
186 |   
187 |   if(log.p)
188 |     return(log_d)
189 |   else
190 |     return(exp(log_d))
191 | }
192 | 
193 | dloga <- function(a,
194 |                   pi0, mu, sigma, Omega, Sigma, 
195 |                   log.p = TRUE) {
196 |   
197 |   u <- a_to_u(a,
198 |               pi0 = pi0, mu = mu, sigma = sigma)
199 |   g <- qnorm(u)
200 |   
201 |   ind_nonzero <- a > 0
202 |   if(any(abs(g) == Inf)) {
203 |     log_d <- -Inf
204 |   } else if(all(ind_nonzero)) {
205 |     log_d <- 
206 |       mvtnorm::dmvnorm(x = g,
207 |                        mean = rep(0, length = length(pi0)),
208 |                        sigma = Sigma,
209 |                        log = TRUE) +
210 |       sum(g^2) / 2 -
211 |       sum((log(a) - mu)^2 / 
212 |             sigma^2 / 2) - 
213 |       sum(log(sigma)) + 
214 |       sum(log(1 - pi0))
215 |   } else if(!any(ind_nonzero)) {
216 |     log_d <- 
217 |       pmvnorm2(
218 |         lower = -Inf, 
219 |         upper = g, 
220 |         mean = rep(0, length(g)),
221 |         sigma = Sigma, 
222 |         log.p = TRUE)
223 |   } else {
224 |     log_d <- 
225 |       pmvnorm2(
226 |         lower = -Inf, 
227 |         upper = g[!ind_nonzero], 
228 |         mean = (-solve(Omega[!ind_nonzero, !ind_nonzero, drop = FALSE],
229 |                        Omega[!ind_nonzero, ind_nonzero, drop = FALSE]) %*% 
230 |                   g[ind_nonzero])[, 1],
231 |         sigma = solve(Omega[!ind_nonzero, !ind_nonzero, drop = FALSE]),
232 |         log.p = TRUE) +
233 |       mvtnorm::dmvnorm(x = g[ind_nonzero],
234 |                        mean = rep(0, length = sum(ind_nonzero)),
235 |                        sigma = Sigma[ind_nonzero, ind_nonzero, drop = FALSE],
236 |                        log = TRUE) +
237 |       sum(g[ind_nonzero]^2) / 2 -
238 |       sum((log(a[ind_nonzero]) - mu[ind_nonzero])^2 / 
239 |             (sigma[ind_nonzero])^2 / 2) - 
240 |       sum(log(sigma[ind_nonzero])) + 
241 |       sum(log(1 - pi0[ind_nonzero]))
242 |   }
243 |   
244 |   if(log.p)
245 |     return(log_d)
246 |   else
247 |     return(exp(log_d))
248 | }
249 | 
250 | pmvnorm2 <- function(lower = -Inf, upper = Inf, 
251 |                      mean, sigma, log.p = FALSE) {
252 |   if(length(lower) == 1)
253 |     lower <- rep(lower, times = length(mean))
254 |   if(length(upper) == 1)
255 |     upper <- rep(upper, times = length(mean))
256 |   
257 |   g <- igraph::graph_from_adjacency_matrix(
258 |     adjmatrix = (abs(sigma) > 0) * 1,
259 |     mode = "undirected",
260 |     diag = FALSE
261 |   )
262 |   comp <- igraph::components(g)
263 |   vp <- 
264 |     vapply(seq_len(comp$no),
265 |            function(i_comp) {
266 |              i_ind <- comp$membership == i_comp
267 |              log(mvtnorm::pmvnorm(lower = lower[i_ind],
268 |                                   upper = upper[i_ind],
269 |                                   mean = mean[i_ind],
270 |                                   sigma = sigma[i_ind, i_ind, drop = FALSE]))
271 |            },
272 |            0.0)
273 |   logp <- sum(vp)
274 |   if(log.p)
275 |     return(logp)
276 |   else
277 |     return(exp(logp))
278 | }
279 | 
280 | dloga_old <- function(a,
281 |                       pi0, mu, sigma, Omega,
282 |                       log.p = TRUE) {
283 |   u <- a_to_u_old(a,
284 |                   pi0 = pi0, mu = mu, sigma = sigma)
285 |   g <- qnorm(u)
286 |   
287 |   if(any(abs(g) == Inf)) {
288 |     log_d <- -Inf
289 |   } else {
290 |     log_d <- du(g, Omega) -
291 |       sum((log(a[a > 0]) - mu[a > 0])^2 / (sigma[a > 0])^2 / 2) -
292 |       log(2 * pi) / 2 * sum(a > 0) - sum(log(sigma[a > 0]))
293 |   }
294 |   
295 |   if(log.p)
296 |     return(log_d)
297 |   else
298 |     return(exp(log_d))
299 | }
300 | 
301 | du <- function(g, Omega, log.p = TRUE) {
302 |   # Without normalizing constant (2pi)^(-2/p)!
303 |   if(any(g == -Inf | g == Inf)) return(-Inf)
304 |   log_d <- log_dmvnorm(S = g %*% t(g), Omega = Omega) + sum(g^2)/2 +
305 |     log(det(Omega)) / 2
306 |   
307 |   if(log.p)
308 |     return(log_d)
309 |   else
310 |     return(exp(log_d))
311 | }
312 | 
313 | log_dmvnorm <- function(S, Omega) {
314 |   - sum(Omega * S) / 2
315 | }
316 | 
317 | integrand_dx <- function(log_asum, x, 
318 |                          pi0, mu, sigma, 
319 |                          Omega, Sigma) {
320 |   dloga(a = a(x, exp(log_asum)),
321 |         pi0 = pi0, mu = mu, sigma = sigma, 
322 |         Omega = Omega, Sigma = Sigma,
323 |         log.p = FALSE)
324 | }
325 | 
326 | vintegrand_dx <- Vectorize2(integrand_dx, vectorize.args = "log_asum")
327 | 
328 | ea <- function(x, 
329 |                pi0, mu, sigma, Omega, Sigma,
330 |                control = list()) {
331 |   control <- do.call(control_integrate, control)
332 |   limits <- get_intLimits(
333 |     x = x,
334 |     pi0 = pi0, mu = mu, sigma = sigma, 
335 |     Omega = Omega, Sigma = Sigma,
336 |     maxit = control$maxit_limits)
337 |   
338 |   fit_integrate <- 
339 |     integrate2(vintegrand_ea,
340 |                lower = limits[1], upper = limits[2], 
341 |                rel_tol = control$rel_tol, abs_tol = control$abs_tol, 
342 |                max_eval = control$max_eval,
343 |                precBits = control$precBits,
344 |                x = x, pi0 = pi0, mu = mu, sigma = sigma, 
345 |                Omega = Omega, Sigma = Sigma)
346 |   
347 |   # jacobian
348 |   fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0])))
349 |   fit_integrate$error <- as.double(exp(log(fit_integrate$error) -  sum(log(x[x > 0]))))
350 |   
351 |   if(log.p) {
352 |     return(fit_integrate$integral)
353 |   }
354 |   else {
355 |     fit_integrate$integral <- exp(fit_integrate$integral)
356 |     if(control$only_value)
357 |       return(fit_integrate$integral)
358 |     return(fit_integrate)
359 |   }
360 | }
361 | 
362 | integrand_ea <- function(log_asum, x, 
363 |                          pi0, mu, sigma, 
364 |                          Omega, Sigma) {
365 |   exp(dloga(a = a(x, exp(log_asum)),
366 |             pi0 = pi0, mu = mu, sigma = sigma, 
367 |             Omega = Omega, Sigma = Sigma,
368 |             log.p = TRUE) + 
369 |         log_asum)
370 | }
371 | 
372 | vintegrand_ea <- Vectorize2(integrand_ea, 
373 |                             vectorize.args = "log_asum")
374 | 
375 | eloga <- function(x, 
376 |                   pi0, mu, sigma, Omega, Sigma,
377 |                   control = list()) {
378 |   control <- do.call(control_integrate, control)
379 |     limits <- get_intLimits(
380 |       x = x,
381 |       pi0 = pi0, mu = mu, sigma = sigma, 
382 |       Omega = Omega, Sigma = Sigma,
383 |       maxit = control$maxit_limits)
384 |   
385 |   fit_integrate <- 
386 |     integrate2(vintegrand_eloga,
387 |                lower = limits[1], upper = limits[2], 
388 |                rel_tol = control$rel_tol, abs_tol = control$abs_tol, 
389 |                max_eval = control$max_eval,
390 |                precBits = control$precBits,
391 |                x = x, pi0 = pi0, mu = mu, sigma = sigma, 
392 |                Omega = Omega, Sigma = Sigma)
393 |   
394 |   # jacobian
395 |   fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0])))
396 |   fit_integrate$error <- as.double(exp(log(fit_integrate$error) -  sum(log(x[x > 0]))))
397 |   
398 |   if(log.p) {
399 |     return(fit_integrate$integral)
400 |   }
401 |   else {
402 |     fit_integrate$integral <- exp(fit_integrate$integral)
403 |     if(control$only_value)
404 |       return(fit_integrate$integral)
405 |     return(fit_integrate)
406 |   }
407 | }
408 | 
409 | integrand_eloga <- function(log_asum, x, 
410 |                             pi0, mu, sigma, 
411 |                             Omega, Sigma) {
412 |   dloga(a = a(x, exp(log_asum)),
413 |         pi0 = pi0, mu = mu, sigma = sigma, 
414 |         Omega = Omega, Sigma = Sigma,
415 |         log.p = FALSE) * log_asum
416 | }
417 | 
418 | vintegrand_eloga <- Vectorize2(integrand_eloga, 
419 |                                vectorize.args = "log_asum")
420 | 
421 | eloga2 <- function(x, 
422 |                    pi0, mu, sigma, Omega, Sigma,
423 |                    control = list()) {
424 |   control <- do.call(control_integrate, control)
425 |     limits <- get_intLimits(
426 |       x = x,
427 |       pi0 = pi0, mu = mu, sigma = sigma, 
428 |       Omega = Omega, Sigma = Sigma,
429 |       maxit = control$maxit_limits)
430 |     
431 |   fit_integrate <- 
432 |     integrate2(vintegrand_eloga2,
433 |                lower = limits[1], upper = limits[2], 
434 |                rel_tol = control$rel_tol, abs_tol = control$abs_tol, 
435 |                max_eval = control$max_eval,
436 |                precBits = control$precBits,
437 |                x = x, pi0 = pi0, mu = mu, sigma = sigma, 
438 |                Omega = Omega, Sigma = Sigma)
439 |   
440 |   # jacobian
441 |   fit_integrate$integral <- as.double(log(fit_integrate$integral) - sum(log(x[x > 0])))
442 |   fit_integrate$error <- as.double(exp(log(fit_integrate$error) -  sum(log(x[x > 0]))))
443 |   
444 |   if(log.p) {
445 |     return(fit_integrate$integral)
446 |   }
447 |   else {
448 |     fit_integrate$integral <- exp(fit_integrate$integral)
449 |     if(control$only_value)
450 |       return(fit_integrate$integral)
451 |     return(fit_integrate)
452 |   }
453 | }
454 | 
455 | integrand_eloga2 <- function(log_asum, x, 
456 |                              pi0, mu, sigma, 
457 |                              Omega, Sigma) {
458 |   dloga(a = a(x, exp(log_asum)),
459 |         pi0 = pi0, mu = mu, sigma = sigma, 
460 |         Omega = Omega, Sigma = Sigma,
461 |         log.p = FALSE) * log_asum^2
462 | }
463 | 
464 | vintegrand_eloga2 <- Vectorize2(integrand_eloga2, 
465 |                                 vectorize.args = "log_asum")
466 | 
467 | get_es <- function(x, pi0, mu, sigma, Omega, Sigma,
468 |                    control) {
469 |   time_start <- Sys.time()
470 |   control <- do.call(control_integrate, control)
471 |   
472 |   limits <- get_intLimits(
473 |     x = x,
474 |     pi0 = pi0, mu = mu, sigma = sigma, 
475 |     Omega = Omega, Sigma = Sigma,
476 |     maxit = control$maxit_limits)
477 |   
478 |   neval <- 2
479 |   knots_spline <- Rmpfr::mpfr(c(limits[1], limits[2]),
480 |                               precBits = control$precBits)
481 |   vals_spline <- 
482 |     Rmpfr::mpfr(
483 |       vintegrand_dx(as.double(knots_spline), 
484 |                     pi0 = pi0, x = x, mu = mu, sigma = sigma,
485 |                     Omega = Omega, Sigma = Sigma), 
486 |       precBits = control$precBits)
487 |   errors_spline <- Inf
488 |   
489 |   # find knots using dx
490 |   while(TRUE) {
491 |     i_max_error <- which(errors_spline == max(errors_spline))[1]
492 |     knots_spline <- c(knots_spline[seq(1, i_max_error)],
493 |                       Rmpfr::mean(knots_spline[c(i_max_error, i_max_error + 1)]),
494 |                       knots_spline[seq(i_max_error + 1, neval)])
495 |     vals_spline <- c(vals_spline[seq(1, i_max_error)],
496 |                      Rmpfr::mpfr(vintegrand_dx(as.double(knots_spline[i_max_error + 1]),
497 |                                                pi0 = pi0, x = x, mu = mu, sigma = sigma,
498 |                                                Omega = Omega, Sigma = Sigma),
499 |                                  precBits = control$precBits),
500 |                      vals_spline[seq(i_max_error + 1, neval)])
501 |     
502 |     neval <- neval + 1
503 |     knots_diff <-  knots_spline[-1] - knots_spline[-neval]
504 |     # linear spline for estimating integration
505 |     coefs_spline <- Rmpfr::mpfrArray(NA, precBits = control$precBits,
506 |                                      dim = c(2, neval - 1))
507 |     coefs_spline[2, ] <- 
508 |       (vals_spline[-1] - vals_spline[-neval]) /
509 |       knots_diff
510 |     coefs_spline[1, ] <- 
511 |       vals_spline[-neval] - knots_spline[-neval] * coefs_spline[2, ]
512 |     integral_dx <- sum(coefs_spline[1, ] * knots_spline[-1] +
513 |                          coefs_spline[2, ] / 2 * knots_spline[-1]^2 -
514 |                          coefs_spline[1, ] * knots_spline[-neval] - 
515 |                          coefs_spline[2, ] / 2 * knots_spline[-neval]^2)
516 |     # error estimation
517 |     errors_spline <- estimate_errors(
518 |       knots_diff, 
519 |       c(Rmpfr::mpfr(0, precBits = control$precBits), 
520 |         coefs_spline[2, ], 
521 |         Rmpfr::mpfr(0, precBits = control$precBits)),
522 |       precBits = control$precBits)
523 |     error_dx <- sum(errors_spline)
524 |     
525 |     if(neval >= control$max_eval)
526 |       break
527 |     if(integral_dx < 0)
528 |       stop("Negative integration values; something went wrong!")
529 |     if(integral_dx > 0) 
530 |       if(error_dx / abs(integral_dx) < control$rel_tol |
531 |          error_dx < control$abs_tol)
532 |         break
533 |   }
534 |   
535 |   # modify eloga, eloga2
536 |   integral_eloga <- sum(coefs_spline[1, ] / 2 * knots_spline[-1]^2 + 
537 |                           coefs_spline[2, ] / 3 * knots_spline[-1]^3 -
538 |                           coefs_spline[1, ] / 2 * knots_spline[-neval]^2 -
539 |                           coefs_spline[2, ] / 3 * knots_spline[-neval]^3)
540 |   error_eloga <- sum(abs(knots_diff *
541 |                            ((vals_spline * knots_spline)[-1] - 
542 |                               (vals_spline * knots_spline)[-neval])))
543 |   integral_eloga2 <- sum(coefs_spline[1, ] / 3 * knots_spline[-1]^3 + 
544 |                            coefs_spline[2, ] / 4 * knots_spline[-1]^4 -
545 |                            coefs_spline[1, ] / 3 * knots_spline[-neval]^3 -
546 |                            coefs_spline[2, ] / 4 * knots_spline[-neval]^4)
547 |   error_eloga2 <- sum(abs(knots_diff *
548 |                             ((vals_spline * knots_spline^2)[-1] - 
549 |                                (vals_spline * knots_spline^2)[-neval])))
550 |   
551 |   # refit for ea
552 |   coefs_spline <- Rmpfr::mpfrArray(NA, precBits = control$precBits,
553 |                                    dim = c(2, neval - 1))
554 |   coefs_spline[2, ] <- 
555 |     ((vals_spline * exp(knots_spline))[-1] - 
556 |        (vals_spline * exp(knots_spline))[-neval]) /
557 |     knots_diff
558 |   coefs_spline[1, ] <- 
559 |     (vals_spline * exp(knots_spline))[-neval] - 
560 |     knots_spline[-neval] * coefs_spline[2, ]
561 |   
562 |   integral_ea <- sum(coefs_spline[1, ] * knots_spline[-1] +
563 |                        coefs_spline[2, ] / 2 * knots_spline[-1]^2 -
564 |                        coefs_spline[1, ] * knots_spline[-neval] - 
565 |                        coefs_spline[2, ] / 2 * knots_spline[-neval]^2)
566 |   error_ea <-  sum(abs(knots_diff *
567 |                          ((vals_spline * exp(knots_spline))[-1] - 
568 |                             (vals_spline * exp(knots_spline))[-neval])))
569 |   
570 |   return(c("ea" = as.double(integral_ea / integral_dx),
571 |            "dx" = as.double(integral_dx),
572 |            "logLik" = as.double(log(integral_dx) - sum(log(x[x > 0]))),
573 |            "eloga" = as.double(integral_eloga / integral_dx),
574 |            "eloga2" = as.double(integral_eloga2 / integral_dx),
575 |            "error_ea" = as.double(error_ea),
576 |            "error_dx" = as.double(error_dx),
577 |            "error_eloga" = as.double(error_eloga),
578 |            "error_eloga2" = as.double(error_eloga2),
579 |            "time" = as.numeric(Sys.time() - time_start, units = "secs")))
580 | }


--------------------------------------------------------------------------------
/R/sim_helpers.R:
--------------------------------------------------------------------------------
 1 | filter_params <- function(params, max_pi0 = 0.95) {
 2 |   ind <- params$pi0 < max_pi0
 3 |   return(list(pi0 = params$pi0[ind],
 4 |               mu = params$mu[ind],
 5 |               sigma = params$sigma[ind],
 6 |               Omega = threshold_matrix(solve(params$Sigma[ind, ind])),
 7 |               Sigma = params$Sigma[ind, ind]))
 8 | }
 9 | 
10 | params_a_to_x <- function(params_a, n = 100000) {
11 |   samples_a <- rcopulasso(n = round(n * 1.5), # FIXME
12 |                           pi0 = params_a$pi0,
13 |                           mu = params_a$mu,
14 |                           sigma = params_a$sigma,
15 |                           Omega = params_a$Omega)
16 |   samples_a <- samples_a[!apply(samples_a == 0, 1, all), ][1:n, ]
17 |   samples_x <- t(apply(samples_a, 1, function(x) x / sum(x)))
18 |   mat_marginals <- get_marginals(samples_x)
19 |   return(list(pi0 = mat_marginals[, "pi0"],
20 |               mu = mat_marginals[, "mu"],
21 |               sigma = mat_marginals[, "sigma"],
22 |               Corr = cor(samples_x, method = "spear")))
23 | }


--------------------------------------------------------------------------------
/R/spike.R:
--------------------------------------------------------------------------------
 1 | spike_a_metadata <- function(null,
 2 |                              feature_param,
 3 |                              metadata,
 4 |                              spike_df) {
 5 |   if(ncol(null) != nrow(metadata))
 6 |     stop("Sample size of null abundance matrix and metadata do not agree!")
 7 |   if(nrow(spike_df) != 
 8 |      nrow(dplyr::distinct(spike_df, 
 9 |                           metadata_datum, 
10 |                           feature_spiked, 
11 |                           associated_property)))
12 |     stop("feature-metadata spiking specification data frame has duplicate ",
13 |          "metadata_datum, feature_spiked, and associated_property tuples!")
14 |   spiked <- null
15 |   for(feature_i in unique(spike_df$feature_spiked)) {
16 |     spike_df_i <- subset(spike_df, feature_spiked == feature_i)
17 |     spike_df_i_abundance <- subset(spike_df_i, 
18 |                                    associated_property == "abundance")
19 |     spike_df_i_prevalence <- subset(spike_df_i, 
20 |                                     associated_property == "prevalence")
21 |     spiked[feature_i, ] <- 
22 |       spike_oneA_metadata(param = feature_param[feature_i, ],
23 |                           metadata = metadata,
24 |                           col_abundance = spike_df_i_abundance$metadata_datum,
25 |                           effect_abundance = spike_df_i_abundance$effect_size,
26 |                           col_prevalence = spike_df_i_prevalence$metadata_datum,
27 |                           effect_prevalence = spike_df_i_prevalence$effect_size)
28 |   }
29 |   
30 |   dimnames(spiked) <- dimnames(null)
31 |   return(spiked)
32 | }
33 | 
34 | spike_oneA_metadata <- function(param,
35 |                                 metadata,
36 |                                 col_abundance = c(),
37 |                                 effect_abundance = c(),
38 |                                 col_prevalence = c(),
39 |                                 effect_prevalence = c()) {
40 |   effect_abundance_all <- rep(0, ncol(metadata))
41 |   effect_prevalence_all <- rep(0, ncol(metadata))
42 |   
43 |   effect_abundance_all[col_abundance] <- effect_abundance
44 |   effect_prevalence_all[col_prevalence] <- effect_prevalence
45 |   
46 |   pi0 <- expit(logit(param["pi0"]) - (metadata %*% effect_prevalence_all)[, 1])
47 |   mu <- param["mu"] + (metadata %*% effect_abundance_all)[, 1]
48 |   
49 |   a <- rZILogN_one(n = nrow(metadata),
50 |                    pi0 = pi0,
51 |                    mu = mu,
52 |                    sigma = param["sigma"])
53 |   return(a)
54 | }
55 | 


--------------------------------------------------------------------------------
/R/sysdata.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biobakery/SparseDOSSA2/26a998a6e3a5f04d6a86cce14d6d3229ca82633e/R/sysdata.rda


--------------------------------------------------------------------------------
/data/Stool_subset.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biobakery/SparseDOSSA2/26a998a6e3a5f04d6a86cce14d6d3229ca82633e/data/Stool_subset.rda


--------------------------------------------------------------------------------
/man/SparseDOSSA2.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/SparseDOSSA2.R
  3 | \name{SparseDOSSA2}
  4 | \alias{SparseDOSSA2}
  5 | \title{Simulate synthetic microbial abundance observations with SparseDOSSA2}
  6 | \usage{
  7 | SparseDOSSA2(
  8 |   template = "Stool",
  9 |   n_sample = 100,
 10 |   new_features = TRUE,
 11 |   n_feature = 100,
 12 |   spike_metadata = "none",
 13 |   metadata_effect_size = 1,
 14 |   perc_feature_spiked_metadata = 0.05,
 15 |   metadata_matrix = NULL,
 16 |   median_read_depth = 50000,
 17 |   verbose = TRUE
 18 | )
 19 | }
 20 | \arguments{
 21 | \item{template}{can be 1) a character string (\code{"Stool"}, \code{"Vaginal"},
 22 | or \code{"IBD"}) indicating one of the pre-trained templates in SparseDOSSA2,
 23 | or 2) user-provided, fitted results. In the latter case this should be an output
 24 | from \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}.}
 25 | 
 26 | \item{n_sample}{number of samples to simulate}
 27 | 
 28 | \item{new_features}{\code{TRUE}/\code{FALSE} indicator for whether or not new
 29 | features should be simulated. If \code{FALSE} then the same set of features 
 30 | in \code{template} will be simulated.}
 31 | 
 32 | \item{n_feature}{number of features to simulate. Only relevant when 
 33 | \code{new_features} is \code{TRUE}}
 34 | 
 35 | \item{spike_metadata}{for metadata spike-in configurations. Must be one of two things:
 36 | a) , 
 37 | \itemize{
 38 | \item a character string of \code{"none"}, \code{"both"} \code{"abundance"}, 
 39 | or \code{"prevalence"}, indicating whether or not
 40 | association with metadata will be spiked in. For the spiked-in case, it
 41 | indicates if features' abundance/prevalence/both characteristics will be associated
 42 | with metadata (also see explanations for \code{metadata_effect_size} and 
 43 | \code{perc_feature_spiked_metadata})
 44 | \item a data.frame for detailed spike-in configurations. This is the more 
 45 | advanced approach, where detailed specification for metadata-microbial 
 46 | feature associations are provided. Note: if \code{spike_metadata} is provided
 47 | as a data.frame, then \code{metadata_matrix} must be provided as well 
 48 | (cannot be generated automatically). In this case, \code{spike_metadata} 
 49 | must have exactly four columns: \code{metadata_datum}, \code{feature_spiked}, 
 50 | \code{associated_property}, and \code{effect_size}. Each row of the data.frame 
 51 | configures one specific metadata-microbe association. Specifically:
 52 | \itemize{
 53 | \item \code{metadata_datum} (integer) indicates the column number for the metadata
 54 | variable to be associated with the microbe
 55 | \item \code{feature_spiked} (character) indicates the microbe name to be associated
 56 | with the metadata variable
 57 | \item \code{associated_property} (character, either \code{"abundance"} or 
 58 | \code{"prevalence"}), indicating the property of the microbe to be modified.
 59 | If you want the microbe to be associated with the metadata variable
 60 | in both properties, include two rows in \code{spike_metadata}, one for 
 61 | abundance and one for prevalence
 62 | \item \code{effect_size} (numeric) indicating the strength of the association. 
 63 | This corresponds to log fold change in non-zero abundance for
 64 | \code{"abundance"} spike-in, and log odds ratio for \code{"prevalence"} 
 65 | spike-in
 66 | }
 67 | }}
 68 | 
 69 | \item{metadata_effect_size}{(for when \code{spike_metadata} is \code{"abundance"},
 70 | \code{"prevalence"}, or \code{"both"}) effect size of the spiked-in associations. This is
 71 | non-zero log fold change for abundance spike-in, and log odds ratio for prevalence spike-in}
 72 | 
 73 | \item{perc_feature_spiked_metadata}{(for when \code{spike_metadata} is \code{"abundance"},
 74 | \code{"prevalence"}, or \code{"both"})
 75 | percentage of features to be associated with metadata}
 76 | 
 77 | \item{metadata_matrix}{the user can provide a metadata matrix to use for spiking-in
 78 | of feature abundances. If using default (\code{NULL}) two variables will be generated: 
 79 | one continous, and a binary one of balanced cases and controls. Note: if
 80 | \code{spike_metadata} is provided as a data.frame, then the user must provide
 81 | \code{metadata_matrix} too}
 82 | 
 83 | \item{median_read_depth}{targeted median per-sample read depth}
 84 | 
 85 | \item{verbose}{whether detailed information should be printed}
 86 | }
 87 | \value{
 88 | a list with the following component:
 89 | \describe{
 90 | \item{simulated_data}{
 91 | feature by sample matrix of simulated microbial count observations
 92 | }
 93 | \item{simulated_matrices}{
 94 | list of all simulated data matrices, including that of null (i.e. not spiked-in) absolute
 95 | abundances, spiked-in absolute abundances, and normalized relative abundances
 96 | }
 97 | \item{params}{
 98 | parameters used for simulation. These are provided in \code{template}.
 99 | }
100 | \item{spike_metadata}{
101 | list of variables provided or generated for metadata spike-in. This include 
102 | \code{spike_metadata} for the original \code{spike_metadata} parameter provided
103 | by the user, \code{metadata_matrix} for the
104 | metadata (either provided by the user or internally generated), and 
105 | \code{feature_metadata_spike_df}
106 | for detailed specification of which metadata variables were used to spike-in associations
107 | with which features, in what properties at which effect sizes. This is the 
108 | same as \code{spike_metadata} if the latter was provided as a data.frame.
109 | }
110 | }
111 | }
112 | \description{
113 | \code{SparseDOSSA2} generates synthetic microbial abundance observations
114 | from either pre-trained template, or user-provided fitted results from
115 | \code{fit_SparseDOSSA2} or \code{fitCV_SparseDOSSA2}. Additional options
116 | are available for simulating associations between microbial features
117 | and metadata variables.
118 | }
119 | \examples{
120 | ## Using one of the pre-trained SparseDOSSA2 templates:
121 | sim <- SparseDOSSA2(template = "stool", n_sample = 200, new_features = FALSE)
122 | ## Using user-provided trained SparseDOSSA2 model:
123 | data("Stool_subset")
124 | fitted <- fit_SparseDOSSA(data = Stool_subset)
125 | sim <- SparseDOSSA2(template = fitted, n_sample = 200, new_features = FALSE)
126 | }
127 | \author{
128 | Siyuan Ma, \email{syma.research@gmail.com}
129 | }
130 | 


--------------------------------------------------------------------------------
/man/Stool_subset.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/data.R
 3 | \docType{data}
 4 | \name{Stool_subset}
 5 | \alias{Stool_subset}
 6 | \title{A subset of the HMP1-II stool samples}
 7 | \format{
 8 | A matrix with 5 rows (species) and 5 columns (samples)
 9 | }
10 | \source{
11 | \url{https://www.hmpdacc.org/hmp/}
12 | }
13 | \usage{
14 | Stool_subset
15 | }
16 | \description{
17 | A dataset containing species-level microbial counts of a subset of
18 | the HMP1-II stool samples. This includes the top 5 most abundant
19 | species and top 5 most deeply sequenced samples.
20 | }
21 | \keyword{datasets}
22 | 


--------------------------------------------------------------------------------
/man/control_fit.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/SparseDOSSA2_fit.R
 3 | \name{control_fit}
 4 | \alias{control_fit}
 5 | \title{Control options for fit_SparseDOSSA2 and fitCV_SparseDOSSA2}
 6 | \usage{
 7 | control_fit(
 8 |   maxit = 100,
 9 |   rel_tol = 0.01,
10 |   abs_tol = 0.01,
11 |   control_numint = list(),
12 |   verbose = FALSE,
13 |   debug_dir = NULL
14 | )
15 | }
16 | \arguments{
17 | \item{maxit}{maximum number of EM iterations}
18 | 
19 | \item{rel_tol}{relative change threshold in the log likelihood
20 | for algorithm convergence}
21 | 
22 | \item{abs_tol}{absolute change threshold in the log likelihood
23 | for algorithm convergence}
24 | 
25 | \item{control_numint}{a named list of control parameters for the 
26 | numerical integrations during the E step. See help page for
27 | \code{control_numint}}
28 | 
29 | \item{verbose}{whether or not detailed running messages should be provided}
30 | 
31 | \item{debug_dir}{directory for intermediate output, such as the
32 | EM expectations and parameter values and during each step of the
33 | EM algorithm. Default to \code{NULL} in which case no such output
34 | will be generated}
35 | }
36 | \value{
37 | a list of the same names
38 | }
39 | \description{
40 | Control options for fit_SparseDOSSA2 and fitCV_SparseDOSSA2
41 | }
42 | 


--------------------------------------------------------------------------------
/man/control_integrate.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/likelihood_expectation.R
 3 | \name{control_integrate}
 4 | \alias{control_integrate}
 5 | \title{control parameters for the 
 6 | numerical integrations during the E step of SparseDOSSA2's fitting}
 7 | \usage{
 8 | control_integrate(
 9 |   rel_tol = 0.01,
10 |   abs_tol = 0,
11 |   max_eval = 50,
12 |   maxit_limits = 10,
13 |   precBits = 200,
14 |   only_value = TRUE
15 | )
16 | }
17 | \arguments{
18 | \item{rel_tol}{relative change threshold in the integration values for the 
19 | integration to converge}
20 | 
21 | \item{abs_tol}{absolute change threshold in the integration values for the 
22 | integration to converge}
23 | 
24 | \item{max_eval}{maximum of integration evaluations allowed}
25 | 
26 | \item{maxit_limits}{maximum number of tries allowed to guess the integration's
27 | lower and upper limits}
28 | 
29 | \item{precBits}{numeric precision used for the integration values}
30 | 
31 | \item{only_value}{whether or not only the integration value should be returned}
32 | }
33 | \value{
34 | a list of the same names
35 | }
36 | \description{
37 | control parameters for the 
38 | numerical integrations during the E step of SparseDOSSA2's fitting
39 | }
40 | 


--------------------------------------------------------------------------------
/man/fitCV_SparseDOSSA2.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/SparseDOSSA2_fit.R
 3 | \name{fitCV_SparseDOSSA2}
 4 | \alias{fitCV_SparseDOSSA2}
 5 | \title{Fit SparseDOSSA 2 model to a microbiome abundance dataset with cross validation}
 6 | \usage{
 7 | fitCV_SparseDOSSA2(
 8 |   data,
 9 |   lambdas = 10^seq(-2, 0, length.out = 5),
10 |   K = 5,
11 |   control = list()
12 | )
13 | }
14 | \arguments{
15 | \item{data}{feature-by-sample matrix of abundances (proportions or
16 | counts).}
17 | 
18 | \item{lambdas}{vector of positive penalization parameters for the sparsity of feature-feature
19 | correlations. The function fits SparseDOSSA 2 models to each of the lambda values, and uses
20 | cross validation likelihood to select the optimal one. If not provided this will be chosen
21 | automatically.}
22 | 
23 | \item{control}{a named list of additional control parameters. See help page for
24 | \code{control_fit}.}
25 | }
26 | \value{
27 | a list, with the following components:
28 | \describe{
29 | \item{EM_fit}{
30 | list of fitted parameters from the EM algorithm, with additional cross validation likelihood. 
31 | }
32 | \item{F_fit}{
33 | fitted parameters for the joint distribution of per-feature prevalence, abundance,
34 | and variability parameters (for simulating new features) 
35 | }
36 | \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable
37 | to count data.
38 | }
39 | \item{l_filtering}{list of quality control filtering for sample and features.
40 | }
41 | }
42 | }
43 | \description{
44 | \code{fitCV_SparseDOSSA2} randomly partitions the data into fitting and testing
45 | subsets. It fits the SparseDOSSA 2 model to the fitting sets and uses log likelihood
46 | of the fitted parameters in the testing sets as the criteria for selection of 
47 | tuning parameter lambda.
48 | }
49 | \examples{
50 | data("Stool_subset")
51 | fitted <- fitCV_SparseDOSSA(data = Stool_subset,
52 |                             lambdas = c(0.1, 1),
53 |                             K = 5)
54 | 
55 | }
56 | \author{
57 | Siyuan Ma, \email{syma.research@gmail.com}
58 | }
59 | 


--------------------------------------------------------------------------------
/man/fit_SparseDOSSA2.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/SparseDOSSA2_fit.R
 3 | \name{fit_SparseDOSSA2}
 4 | \alias{fit_SparseDOSSA2}
 5 | \title{Fit SparseDOSSA 2 model to a microbiome abundance dataset}
 6 | \usage{
 7 | fit_SparseDOSSA2(data, lambda = 1, control = list())
 8 | }
 9 | \arguments{
10 | \item{data}{feature-by-sample matrix of abundances (proportions or
11 | counts)}
12 | 
13 | \item{lambda}{positive penalization parameter for the sparsity of feature-feature
14 | correlations. Default to maxmum value \code{1}, where features are assumed to be 
15 | independent (no correlations, most sparse)}
16 | 
17 | \item{control}{a named list of additional control parameters. See help page for
18 | \code{control_fit}}
19 | }
20 | \value{
21 | a list, with the following components:
22 | \describe{
23 | \item{EM_fit}{
24 | list of fitted parameters from the EM algorithm. 
25 | }
26 | \item{F_fit}{
27 | fitted parameters for the joint distribution of per-feature prevalence, abundance,
28 | and variability parameters (for simulating new features) 
29 | }
30 | \item{depth_fit}{fitted parameters for the read depth distribution. Only applicable
31 | to count data.
32 | }
33 | \item{l_filtering}{list of quality control filtering for sample and features.
34 | }
35 | }
36 | }
37 | \description{
38 | \code{fit_SparseDOSSA2} fits the SparseDOSSA 2 model (zero-inflated log normal
39 | marginals connected through Gaussian copula) to microbial abundances. It takes
40 | as input a feature-by-sample microbial count or relative abundance table and
41 | a penalization tuning parameter \code{lambda} to control the sparsity of 
42 | feature-feature correlations. It then adopts a penalized expectation-maximization
43 | algorithm to provide estimations of the model parameters.
44 | }
45 | \examples{
46 | data("Stool_subset")
47 | fitted <- fit_SparseDOSSA2(data = Stool_subset)
48 | }
49 | \author{
50 | Siyuan Ma, \email{syma.research@gmail.com}
51 | }
52 | 


--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
1 | library(testthat)
2 | library(SparseDOSSA2)
3 | 
4 | test_check("SparseDOSSA2")
5 | 


--------------------------------------------------------------------------------
/tests/testthat/test-likelihood.R:
--------------------------------------------------------------------------------
 1 | test_that("dloga works", {
 2 |   pi0 <- rep(0.5, 3)
 3 |   mu <- rep(0, 3)
 4 |   sigma <- rep(1, 3)
 5 |   Sigma <- diag(rep(1, 3))
 6 |   Omega <- solve(Sigma)
 7 |   
 8 |   # certain dloga values should be the same as directly caculated from dnorm
 9 |   expect_equal(dnorm(0)^2 * 0.125,
10 |                dloga(a = c(0, 1, 1),
11 |                      pi0 = pi0,
12 |                      mu = mu, 
13 |                      sigma = sigma,
14 |                      Omega = Omega, 
15 |                      Sigma = Sigma, 
16 |                      log.p = FALSE))
17 |   
18 |   int_func <- function(x) dloga(a = c(0, exp(x)),
19 |                                 pi0 = pi0,
20 |                                 mu = mu, 
21 |                                 sigma = sigma,
22 |                                 Omega = Omega, 
23 |                                 Sigma = Sigma, 
24 |                                 log.p = FALSE)
25 |   
26 |   # dloga should integrate out to discrete components of the likelihood
27 |   expect_lt(abs(0.125 - 
28 |                   cubature::pcubature(int_func, lowerLimit = rep(-10, 2), upperLimit = rep(10, 2))$integral),
29 |             1e-5)
30 |   
31 |   # test for non trivial correlation
32 |   Sigma <- matrix(0.1, 3, 3)
33 |   diag(Sigma) <- rep(1, 3)
34 |   Omega <- solve(Sigma)
35 |   
36 |   int_func <- function(x) dloga(a = c(0, exp(x)),
37 |                                 pi0 = pi0,
38 |                                 mu = mu, 
39 |                                 sigma = sigma,
40 |                                 Omega = Omega, 
41 |                                 Sigma = Sigma, 
42 |                                 log.p = FALSE)
43 |   
44 |   expect_lt(
45 |     abs(
46 |     cubature::pcubature(f = int_func, lower = rep(-10, 2), upper = rep(10, 2))$integral -
47 |     mvtnorm::pmvnorm(lower = c(-Inf, 0, 0), upper = c(0, Inf, Inf), 
48 |                      mean = rep(0, 3), sigma = Sigma)),
49 |     1e-5)
50 | })


--------------------------------------------------------------------------------
/tests/testthat/test-spike_in.R:
--------------------------------------------------------------------------------
 1 | test_that("spike in works", {
 2 |   # # no spike-in
 3 |   # sim <- SparseDOSSA2()
 4 |   # 
 5 |   # # spike-in default
 6 |   # sim <- SparseDOSSA2(spike_metadata = "both")
 7 |   # plot(sim$spike_metadata$metadata_matrix[, 1],
 8 |   #      sim$simulated_matrices$rel[60,])
 9 |   # boxplot(sim$simulated_matrices$rel[19, ] ~
10 |   #           sim$spike_metadata$metadata_matrix[, 2])
11 |   # 
12 |   # # spike-in matrix
13 |   # # error
14 |   # spike_metadata <- sim$spike_metadata$feature_metadata_spike_df[c(1, 20), ]
15 |   # sim <- SparseDOSSA2(spike_metadata = spike_metadata)
16 |   # 
17 |   # # error
18 |   # metadata_matrix <- matrix(rnorm(100), ncol = 1)
19 |   # sim <- SparseDOSSA2(spike_metadata = spike_metadata,
20 |   #                     metadata_matrix = metadata_matrix)
21 |   # 
22 |   # # runs
23 |   # metadata_matrix <- matrix(rnorm(300), ncol = 3)
24 |   # sim <- SparseDOSSA2(spike_metadata = spike_metadata,
25 |   #                     metadata_matrix = metadata_matrix)
26 | })


--------------------------------------------------------------------------------
/vignettes/SparseDOSSA2.Rmd:
--------------------------------------------------------------------------------
  1 | # ---
  2 | title: "Simulating realistic microbial observations with SparseDOSSA2"
  3 | author:
  4 | - name: "Siyuan Ma"
  5 | affiliation:
  6 | - Harvard T.H. Chan School of Public Health
  7 | - Broad Institute
  8 | email: syma.research@gmail.com
  9 | package: SparseDOSSA2
 10 | date: "12/01/2020"
 11 | output:
 12 | BiocStyle::html_document
 13 | vignette: >
 14 | %\VignetteIndexEntry{SparseDOSSA2}
 15 | %\VignetteEngine{knitr::rmarkdown}
 16 | %\VignetteEncoding{UTF-8}
 17 | bibliography: references.bib
 18 | ---
 19 | 
 20 | 
 21 | ```{r setup, include=FALSE}
 22 | knitr::opts_chunk$set(echo = TRUE)
 23 | knitr::opts_chunk$set(cache = FALSE)
 24 | ```
 25 | 
 26 | # Introduction
 27 | 
 28 | `SparseDOSSA2` an R package for fitting to and the simulation of realistic microbial
 29 | abundance observations. It provides functionlaities for: a) generation of realistic synthetic microbial observations, b) spiking-in of associations with metadata variables 
 30 | for e.g. benchmarking or power analysis purposes, and c) fitting the SparseDOSSA 2 
 31 | model to real-world microbial abundance observations that can be used for a). This vignette is intended to provide working examples for these functionalities. 
 32 | 
 33 | ```{r, message=FALSE}
 34 | library(SparseDOSSA2)
 35 | # tidyverse packages for utilities
 36 | library(magrittr)
 37 | library(dplyr)
 38 | library(ggplot2)
 39 | ```
 40 | 
 41 | # Installation
 42 | 
 43 | SparseDOSSA2 is a Bioconductor package and can be installed via the following 
 44 | command.
 45 | 
 46 | ```{r Installation, eval = FALSE}
 47 | # if (!requireNamespace("BiocManager", quietly = TRUE))
 48 | #     install.packages("BiocManager")
 49 | # BiocManager::install("SparseDOSSA2")
 50 | ```
 51 | 
 52 | # Simulating realistic microbial observations with `SparseDOSSA2`
 53 | The most important functionality of `SparseDOSSA2` is the simulation of
 54 | realistic synthetic microbial observations. To this end, `SparseDOSSA2` provides
 55 | three pre-trained templates, `"Stool"`, `"Vaginal"`, and `"IBD"`, targeting
 56 | continuous, discrete, and diseased population structures. 
 57 | ```{r SparseDOSSA2_generate}
 58 | Stool_simulation <- SparseDOSSA2(template = "Stool", 
 59 |                                  n_sample = 100, 
 60 |                                  n_feature = 100,
 61 |                                  verbose = TRUE)
 62 | Vaginal_simulation <- SparseDOSSA2(template = "Vaginal", 
 63 |                                    n_sample = 100, 
 64 |                                    n_feature = 100,
 65 |                                    verbose = TRUE)
 66 | ```
 67 | 
 68 | # Fitting to microbiome datasets with SparseDOSSA2
 69 | 
 70 | SparseDOSSA2 provide two functions, `fit_SparseDOSSA2` and `fitCV_SparseDOSSA2`,
 71 | to fit the SparseDOSSA2 model to microbial count or relative abundance observations.
 72 | For these functions, as input, `SparseDOSSA2` requires a feature-by-sample 
 73 | table of microbial abundance observations. We provide with SparseDOSSA2 a minimal
 74 | example of such a dataset: a five-by-five of the [HMP1-II](https://www.hmpdacc.org/hmp/) 
 75 | stool study.
 76 | 
 77 | ```{r load data}
 78 | data("Stool_subset", package = "SparseDOSSA2")
 79 | # columns are samples.
 80 | Stool_subset[1:2, 1, drop = FALSE]
 81 | ```
 82 | 
 83 | ## Fitting SparseDOSSA2 model with `fit_SparseDOSSA2`
 84 | 
 85 | `fit_SparseDOSSA2` fits the SparseDOSSA2 model to estimate the 
 86 | model parameters: per-feature prevalence, mean and standard deviation of 
 87 | non-zero abundances, and feature-feature correlations. 
 88 | It also estimates joint distribution of these parameters
 89 | and (if input is count) a read count distribution.
 90 | 
 91 | ```{r SparseDOSSA2_fit}
 92 | fitted <- fit_SparseDOSSA2(data = Stool_subset,
 93 |                            control = list(verbose = TRUE))
 94 | # fitted mean log non-zero abundance values of the first two features
 95 | fitted$EM_fit$fit$mu[1:2]
 96 | ```
 97 | 
 98 | ## Fitting SparseDOSSA2 model with `fitCV_SparseDOSSA2`
 99 | 
100 | The user can additionally achieve optimal model fitting via 
101 | `fitCV_SparseDOSSA2`. They can either provide a vector of tuning parameter
102 | values (`lambdas`) to control sparsity in the estimation of the correlation
103 | matrix parameter, or a grid will be selected automatically. 
104 | `fitCV_SparseDOSSA2` uses cross validation to select an "optimal" model fit
105 | across these tuning parameters via average testing log-likelihood. This is a 
106 | computationally intensive procedure, and best-suited for users that would like
107 | accurate fitting to the input dataset, for best simulated new microbial 
108 | observations on the same features as the input (i.e. not new features).
109 | 
110 |  ```{r SparseDOSSA2_fit_CV}
111 | # set.seed(1)
112 | # fitted_CV <- fitCV_SparseDOSSA2(data = Stool_subset,
113 | #                                         lambdas = c(0.1, 1),
114 | #                                         K = 2,
115 | #                                         control = list(verbose = TRUE))
116 | # the average log likelihood of different tuning parameters
117 | # apply(fitted_CV$EM_fit$logLik_CV, 2, mean)
118 | # The second lambda (1) had better performance in terms of log likelihood,
119 | # and will be selected as the default fit.
120 | # ```
121 | 
122 | ## Parallelization controls with `future`
123 | 
124 | #`SparseDOSSA2` internally uses `r BiocStyle::CRANpkg("future")` to allow for
125 | #parallel computation. The user can thus specify parallelization through `future`'s
126 | #interface. See the [reference #manual](https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html) 
127 | #for `future` for more details. This is 
128 | #particularly suited if fitting `SparseDOSSA2` in a high-performance computing
129 | #environment/
130 | 
131 | #```{r future}
132 | ## regular fitting 
133 | # system.time(fitted_regular <- 
134 | #               fit_SparseDOSSA2(data = Stool_subset,
135 | #                                control = list(verbose = FALSE)))
136 | ## parallel fitting with future:
137 | # future::plan(future::multisession())
138 | # system.time(fitted_parallel <- 
139 | #               fit_SparseDOSSA2(data = Stool_subset,
140 | #                                control = list(verbose = FALSE)))
141 | 
142 | ## For CV fitting, there are three components that can be paralleled, in order:
143 | ## different cross validation folds, different tuning parameter lambdas, 
144 | ## and different samples. It is usually most efficient to parallelize at the
145 | ## sample level:
146 | # system.time(fitted_regular_CV <-
147 | #               fitCV_SparseDOSSA2(data = Stool_subset,
148 | #                                  lambdas = c(0.1, 1),
149 | #                                  K = 2,
150 | #                                  control = list(verbose = TRUE)))
151 | # future::plan(future::sequential(), future::sequential(), future::multisession())
152 | # system.time(fitted_parallel_CV <-
153 | #               fitCV_SparseDOSSA2(data = Stool_subset,
154 | #                                  lambdas = c(0.1, 1),
155 | #                                  K = 2,
156 | #                                  control = list(verbose = TRUE)))
157 | 
158 | ```
159 | 
160 | 
161 | # Sessioninfo
162 | ```{r sessioninfo}
163 | sessionInfo()
164 | ```
165 | # References


--------------------------------------------------------------------------------