├── .Rbuildignore ├── .gitignore ├── .travis.yml ├── DESCRIPTION ├── NAMESPACE ├── NEWS.md ├── R ├── bootstrapify.R ├── collect.R ├── compat-dplyr.R ├── compat-purrr.R ├── compat-tidyr.R ├── samplify.R ├── util.R └── zzz.R ├── README.Rmd ├── README.md ├── _pkgdown.yml ├── codecov.yml ├── cran-comments.md ├── man ├── bootstrapify.Rd ├── collect.resampled_df.Rd ├── figures │ ├── README-bootstrap-plots-1.png │ ├── README-non-bootstrap-plots-1.png │ └── strap-god.jpg ├── reexports.Rd └── samplify.Rd ├── strapgod.Rproj ├── tests ├── testthat.R └── testthat │ ├── test-bootstrapify.R │ ├── test-collect.R │ ├── test-compat-tidyr.R │ ├── test-dplyr-compat.R │ ├── test-dplyr-do.R │ ├── test-dplyr-group-funs.R │ ├── test-dplyr-summarise.R │ └── test-samplify.R └── vignettes ├── .gitignore ├── dplyr-support.Rmd └── virtual-bootstraps.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^CRAN-RELEASE$ 2 | ^cran-comments\.md$ 3 | ^pkgdown$ 4 | ^_pkgdown\.yml$ 5 | ^docs$ 6 | ^codecov\.yml$ 7 | ^\.travis\.yml$ 8 | ^README\.Rmd$ 9 | ^strapgod\.Rproj$ 10 | ^\.Rproj\.user$ 11 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | docs/ 2 | inst/doc 3 | .Rhistory 4 | .RData 5 | .Rproj.user 6 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | # R for travis: see documentation at https://docs.travis-ci.com/user/languages/r 2 | 3 | language: r 4 | sudo: false 5 | cache: packages 6 | 7 | matrix: 8 | include: 9 | - r: devel 10 | - r: release 11 | after_success: 12 | - Rscript -e 'covr::codecov()' 13 | before_cache: 14 | - Rscript -e 'remotes::install_cran("pkgdown")' 15 | # For the README 16 | - Rscript -e 'remotes::install_cran("ggplot2")' 17 | deploy: 18 | provider: script 19 | script: Rscript -e 'pkgdown::deploy_site_github(run_dont_run = TRUE)' 20 | skip_cleanup: true 21 | - r: oldrel 22 | - r: 3.2 23 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: strapgod 2 | Title: Resampled Data Frames 3 | Version: 0.0.4.9000 4 | Authors@R: 5 | person(given = "Davis", 6 | family = "Vaughan", 7 | role = c("aut", "cre"), 8 | email = "davis@rstudio.com") 9 | Description: Create data frames with virtual groups that can be used with 10 | 'dplyr' to efficiently compute resampled statistics, generate the data for 11 | hypothetical outcome plots, and fit multiple models on resampled variations 12 | of the original data. 13 | License: GPL-3 14 | Encoding: UTF-8 15 | LazyData: true 16 | Roxygen: list(markdown = TRUE) 17 | RoxygenNote: 6.1.1 18 | Depends: 19 | R (>= 3.2.0) 20 | Imports: 21 | dplyr (>= 0.8.3), 22 | tidyr (>= 1.0.0), 23 | rlang, 24 | tibble, 25 | lifecycle 26 | Suggests: 27 | broom, 28 | testthat, 29 | covr, 30 | knitr, 31 | rmarkdown 32 | URL: https://github.com/DavisVaughan/strapgod 33 | BugReports: https://github.com/DavisVaughan/strapgod/issues 34 | VignetteBuilder: knitr 35 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method(anti_join,resampled_df) 4 | S3method(arrange,resampled_df) 5 | S3method(arrange_,resampled_df) 6 | S3method(bootstrapify,data.frame) 7 | S3method(bootstrapify,grouped_df) 8 | S3method(bootstrapify,tbl_df) 9 | S3method(collect,resampled_df) 10 | S3method(distinct,resampled_df) 11 | S3method(filter,resampled_df) 12 | S3method(full_join,resampled_df) 13 | S3method(group_by,resampled_df) 14 | S3method(group_indices,resampled_df) 15 | S3method(group_indices_,resampled_df) 16 | S3method(group_nest,resampled_df) 17 | S3method(group_split,resampled_df) 18 | S3method(inner_join,resampled_df) 19 | S3method(left_join,resampled_df) 20 | S3method(mutate,resampled_df) 21 | S3method(mutate_,resampled_df) 22 | S3method(nest,resampled_df) 23 | S3method(pull,resampled_df) 24 | S3method(rename,resampled_df) 25 | S3method(right_join,resampled_df) 26 | S3method(samplify,data.frame) 27 | S3method(samplify,grouped_df) 28 | S3method(samplify,tbl_df) 29 | S3method(select,resampled_df) 30 | S3method(semi_join,resampled_df) 31 | S3method(slice,resampled_df) 32 | S3method(slice_,resampled_df) 33 | S3method(summarise,resampled_df) 34 | S3method(summarise_,resampled_df) 35 | S3method(transmute,resampled_df) 36 | export(bootstrapify) 37 | export(filter) 38 | export(samplify) 39 | importFrom(dplyr,anti_join) 40 | importFrom(dplyr,arrange) 41 | importFrom(dplyr,arrange_) 42 | importFrom(dplyr,collect) 43 | importFrom(dplyr,distinct) 44 | importFrom(dplyr,filter) 45 | importFrom(dplyr,full_join) 46 | importFrom(dplyr,group_by) 47 | importFrom(dplyr,group_indices) 48 | importFrom(dplyr,group_indices_) 49 | importFrom(dplyr,group_nest) 50 | importFrom(dplyr,group_split) 51 | importFrom(dplyr,inner_join) 52 | importFrom(dplyr,left_join) 53 | importFrom(dplyr,mutate) 54 | importFrom(dplyr,mutate_) 55 | importFrom(dplyr,pull) 56 | importFrom(dplyr,rename) 57 | importFrom(dplyr,right_join) 58 | importFrom(dplyr,select) 59 | importFrom(dplyr,semi_join) 60 | importFrom(dplyr,slice) 61 | importFrom(dplyr,slice_) 62 | importFrom(dplyr,summarise) 63 | importFrom(dplyr,summarise_) 64 | importFrom(dplyr,transmute) 65 | importFrom(lifecycle,deprecated) 66 | importFrom(rlang,":=") 67 | importFrom(tidyr,nest) 68 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # strapgod (development version) 2 | 3 | # strapgod 0.0.4 4 | 5 | * The resampled data frame method for `tidyr::nest()` has been updated to 6 | support the changes in tidyr 1.0.0. 7 | 8 | # strapgod 0.0.3 9 | 10 | * In dplyr 0.8.2, the `ptype` is carried along as an attribute 11 | in `group_split()`. A test was updated to reflect this. 12 | 13 | * In dplyr 0.8.2, the behavior of `tbl_vars()` was changed to return a classed 14 | object. A test has been updated to reflect this (#13). 15 | 16 | # strapgod 0.0.2 17 | 18 | * In dplyr 0.8.1, the behavior of `group_map()` was moved to `group_modify()`, 19 | and `group_map()` was repurposed to always return a list. strapgod has been 20 | updated to reflect these changes. 21 | 22 | # strapgod 0.0.1 23 | 24 | * Added a `NEWS.md` file to track changes to the package. 25 | -------------------------------------------------------------------------------- /R/bootstrapify.R: -------------------------------------------------------------------------------- 1 | #' Create a bootstrapped tibble 2 | #' 3 | #' @description 4 | #' 5 | #' `bootstrapify()` creates a bootstrapped tibble with _virtual groups_. 6 | #' 7 | #' @inherit samplify details 8 | #' 9 | #' @inherit samplify return 10 | #' 11 | #' @seealso [collect.resampled_df()] 12 | #' 13 | #' @inheritParams samplify 14 | #' 15 | #' @examples 16 | #' library(dplyr) 17 | #' library(broom) 18 | #' 19 | #' bootstrapify(iris, 5) 20 | #' 21 | #' iris %>% 22 | #' bootstrapify(5) %>% 23 | #' summarise(per_strap_mean = mean(Petal.Width)) 24 | #' 25 | #' iris %>% 26 | #' group_by(Species) %>% 27 | #' bootstrapify(5) %>% 28 | #' summarise(per_strap_species_mean = mean(Petal.Width)) 29 | #' 30 | #' iris %>% 31 | #' bootstrapify(5) %>% 32 | #' do(tidy(lm(Sepal.Width ~ Sepal.Length + Species, data = .))) 33 | #' 34 | #' # Alternatively, use the newer group_modify() 35 | #' iris %>% 36 | #' bootstrapify(5) %>% 37 | #' group_modify(~tidy(lm(Sepal.Width ~ Sepal.Length + Species, data = .x))) 38 | #' 39 | #' # Alter the name of the group with `key` 40 | #' # Materialize them with collect() 41 | #' straps <- bootstrapify(iris, 5, key = ".straps") 42 | #' collect(straps) 43 | #' 44 | #' @family virtual samplers 45 | #' 46 | #' @name bootstrapify 47 | 48 | #' @rdname bootstrapify 49 | #' @export 50 | bootstrapify <- function(data, times, ..., key = ".bootstrap") { 51 | UseMethod("bootstrapify") 52 | } 53 | 54 | #' @export 55 | bootstrapify.data.frame <- function(data, times, ..., key = ".bootstrap") { 56 | bootstrapify(dplyr::as_tibble(data), times, ..., key = key) 57 | } 58 | 59 | #' @export 60 | bootstrapify.tbl_df <- function(data, times, ..., key = ".bootstrap") { 61 | samplify( 62 | data = data, 63 | times = times, 64 | size = nrow(data), 65 | ..., 66 | replace = TRUE, 67 | key = key 68 | ) 69 | } 70 | 71 | #' @export 72 | bootstrapify.grouped_df <- function(data, times, ..., key = ".bootstrap") { 73 | samplify( 74 | data = data, 75 | times = times, 76 | size = dplyr::group_size(data), 77 | ..., 78 | replace = TRUE, 79 | key = key 80 | ) 81 | } 82 | -------------------------------------------------------------------------------- /R/collect.R: -------------------------------------------------------------------------------- 1 | #' Force virtual groups to become explicit rows 2 | #' 3 | #' When `collect()` is used on a `resampled_df`, the virtual bootstrap groups 4 | #' are made explicit. 5 | #' 6 | #' @param x A `resampled_df`. 7 | #' 8 | #' @param ... Not used. 9 | #' 10 | #' @param id Optional. A single character that specifies a name for a column 11 | #' containing a sequence from `1:n` for each bootstrap group. 12 | #' 13 | #' @param original_id Optional. A single character that specifies a name for 14 | #' a column containing the original position of the bootstrapped row. 15 | #' 16 | #' @examples 17 | #' library(dplyr) 18 | #' 19 | #' # virtual groups become real rows 20 | #' collect(bootstrapify(iris, 5)) 21 | #' 22 | #' # add on the id column for an identifier per bootstrap 23 | #' collect(bootstrapify(iris, 5), id = ".id") 24 | #' 25 | #' # add on the original_id column to know which row this bootstrapped row 26 | #' # originally came from 27 | #' collect(bootstrapify(iris, 5), original_id = ".original_id") 28 | #' 29 | #' @export 30 | collect.resampled_df <- function(x, ..., id = NULL, original_id = NULL) { 31 | 32 | check_empty_dots(...) 33 | validate_null_or_single_character(id, "id") 34 | validate_null_or_single_character(original_id, "original_id") 35 | 36 | group_syms <- dplyr::groups(x) 37 | group_tbl <- dplyr::group_data(x) 38 | x <- dplyr::ungroup(x) 39 | 40 | # The only column names that are not in the x and are not '.rows' is the .key 41 | # Could potentially have multiple bootstrap columns 42 | .key <- setdiff(colnames(group_tbl), c(colnames(x), ".rows")) 43 | 44 | # Strip off non-virtual groups 45 | .out <- dplyr::select(group_tbl, !!!.key, .rows) 46 | 47 | # Order of these calls matters 48 | .out <- maybe_use_id(.out, id) 49 | .out <- add_straps(.out, x) 50 | .out <- maybe_use_original_id(.out, original_id) 51 | 52 | # Flatten 53 | .out <- tidyr::unnest(.out, cols = c(!!id, !!original_id, ...x)) 54 | 55 | .out <- dplyr::group_by(.out, !!!group_syms) 56 | 57 | .out 58 | } 59 | 60 | # ------------------------------------------------------------------------------ 61 | 62 | # id = 1:n for each group 63 | maybe_use_id <- function(.out, id) { 64 | 65 | if(!is.null(id)) { 66 | 67 | id_col <- map(.out[[".rows"]], seq_along) 68 | 69 | .out <- tibble::add_column(.out, !!id := id_col, .before = ".rows") 70 | } 71 | 72 | .out 73 | } 74 | 75 | # Repeat `x` rows to generate the bootstraps 76 | # Does vctrs::vec_slice() actually speed this up? 77 | # Limited benchmarking seemed inconclusive 78 | add_straps <- function(.out, x) { 79 | 80 | .out[["...x"]] <- map( 81 | .x = .out[[".rows"]], 82 | .f = function(idx) x[idx, , drop = FALSE] 83 | ) 84 | 85 | .out 86 | } 87 | 88 | maybe_use_original_id <- function(.out, original_id) { 89 | 90 | if (!is.null(original_id)) { 91 | .out <- dplyr::rename(.out, !!original_id := .rows) 92 | } 93 | else { 94 | .out[[".rows"]] <- NULL 95 | } 96 | 97 | .out 98 | } 99 | 100 | # ------------------------------------------------------------------------------ 101 | 102 | validate_null_or_single_character <- function(.x, .x_nm) { 103 | 104 | if (is.null(.x)) { 105 | return(invisible(.x)) 106 | } 107 | 108 | if (!rlang::is_scalar_character(.x)) { 109 | msg <- paste0("`", .x_nm, "` must be a character of size 1.") 110 | rlang::abort(msg) 111 | } 112 | 113 | invisible(.x) 114 | } 115 | -------------------------------------------------------------------------------- /R/compat-dplyr.R: -------------------------------------------------------------------------------- 1 | # ------------------------------------------------------------------------------ 2 | # Interesting dplyr functions 3 | 4 | # summarise() 5 | # do() 6 | # ungroup() 7 | # group_nest() 8 | # group_map() 9 | # group_modify() 10 | # group_walk() 11 | # group_split() 12 | # group_keys() 13 | # group_indices() 14 | 15 | # In theory we could let the default `summarise()` do its thing. But if the 16 | # user did a double `bootstrapify()` call, only one level of it will be removed 17 | # and the post-summarise() object will still be a resampled_df, even though 18 | # all of the bootstrap rows have been materialized. 19 | 20 | #' @importFrom dplyr summarise 21 | #' @export 22 | summarise.resampled_df <- function(.data, ...) { 23 | maybe_new_grouped_df(NextMethod()) 24 | } 25 | 26 | # For `group_nest()`, the default method works unless `keep = TRUE`. In that 27 | # case, we need to `collect()` so the groups are available to be 'kept'. 28 | 29 | #' @importFrom dplyr group_nest 30 | #' @export 31 | group_nest.resampled_df <- function(.tbl, ..., .key = "data", keep = FALSE) { 32 | 33 | if (keep) { 34 | dplyr::group_nest(collect(.tbl), ..., .key = .key, keep = keep) 35 | } 36 | else { 37 | NextMethod() 38 | } 39 | 40 | } 41 | 42 | # Same idea as group_nest() 43 | 44 | #' @importFrom dplyr group_split 45 | #' @export 46 | group_split.resampled_df <- function(.tbl, ..., keep = TRUE) { 47 | 48 | if (keep) { 49 | dplyr::group_split(collect(.tbl), ..., keep = keep) 50 | } 51 | else { 52 | NextMethod() 53 | } 54 | 55 | } 56 | 57 | # `group_indices()` returns garbage unless we `collect()` first 58 | 59 | #' @importFrom dplyr group_indices 60 | #' @export 61 | group_indices.resampled_df <- function(.data, ...) { 62 | dplyr::group_indices(collect(.data), ...) 63 | } 64 | 65 | # ------------------------------------------------------------------------------ 66 | # Interesting dplyr functions - Standard evaluation backwards compat 67 | 68 | # nocov start 69 | 70 | #' @importFrom dplyr summarise_ 71 | #' @export 72 | summarise_.resampled_df <- function(.data, ..., .dots = list()) { 73 | maybe_new_grouped_df(NextMethod()) 74 | } 75 | 76 | #' @importFrom dplyr group_indices_ 77 | #' @export 78 | group_indices_.resampled_df <- function(.data, ..., .dots = list()) { 79 | dplyr::group_indices_(collect(.data), ..., .dots = list()) 80 | } 81 | 82 | # nocov end 83 | 84 | # ------------------------------------------------------------------------------ 85 | # dplyr support 86 | 87 | #' @importFrom dplyr mutate 88 | #' @export 89 | mutate.resampled_df <- function(.data, ...) { 90 | dplyr::mutate(collect(.data), ...) 91 | } 92 | 93 | #' @importFrom dplyr transmute 94 | #' @export 95 | transmute.resampled_df <- function(.data, ...) { 96 | dplyr::transmute(collect(.data), ...) 97 | } 98 | 99 | # Required to export filter, otherwise: 100 | # Warning: declared S3 method 'filter.resampled_df' not found 101 | # because of stats::filter 102 | 103 | #' @export 104 | dplyr::filter 105 | 106 | #' @importFrom dplyr filter 107 | #' @export 108 | filter.resampled_df <- function(.data, ...) { 109 | dplyr::filter(collect(.data), ...) 110 | } 111 | 112 | #' @importFrom dplyr arrange 113 | #' @export 114 | arrange.resampled_df <- function(.data, ...) { 115 | dplyr::arrange(collect(.data), ...) 116 | } 117 | 118 | #' @importFrom dplyr distinct 119 | #' @export 120 | distinct.resampled_df <- function(.data, ..., .keep_all = FALSE) { 121 | dplyr::distinct(collect(.data), ..., .keep_all = .keep_all) 122 | } 123 | 124 | #' @importFrom dplyr select 125 | #' @export 126 | select.resampled_df <- function(.data, ...) { 127 | dplyr::select(collect(.data), ...) 128 | } 129 | 130 | #' @importFrom dplyr slice 131 | #' @export 132 | slice.resampled_df <- function(.data, ...) { 133 | dplyr::slice(collect(.data), ...) 134 | } 135 | 136 | #' @importFrom dplyr pull 137 | #' @export 138 | pull.resampled_df <- function(.data, var = -1) { 139 | dplyr::pull(collect(.data), var = !!rlang::enquo(var)) 140 | } 141 | 142 | #' @importFrom dplyr rename 143 | #' @export 144 | rename.resampled_df <- function(.data, ...) { 145 | dplyr::rename(collect(.data), ...) 146 | } 147 | 148 | #' @importFrom dplyr full_join 149 | #' @export 150 | full_join.resampled_df <- function(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...) { 151 | dplyr::full_join(collect(x), collect(y), by = by, copy = copy, suffix = suffix, ...) 152 | } 153 | 154 | #' @importFrom dplyr inner_join 155 | #' @export 156 | inner_join.resampled_df <- function(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...) { 157 | dplyr::inner_join(collect(x), collect(y), by = by, copy = copy, suffix = suffix, ...) 158 | } 159 | 160 | #' @importFrom dplyr left_join 161 | #' @export 162 | left_join.resampled_df <- function(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...) { 163 | dplyr::left_join(collect(x), collect(y), by = by, copy = copy, suffix = suffix, ...) 164 | } 165 | 166 | #' @importFrom dplyr right_join 167 | #' @export 168 | right_join.resampled_df <- function(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...) { 169 | dplyr::right_join(collect(x), collect(y), by = by, copy = copy, suffix = suffix, ...) 170 | } 171 | 172 | #' @importFrom dplyr anti_join 173 | #' @export 174 | anti_join.resampled_df <- function(x, y, by = NULL, copy = FALSE, ...) { 175 | dplyr::anti_join(collect(x), collect(y), by = by, copy = copy, ...) 176 | } 177 | 178 | #' @importFrom dplyr semi_join 179 | #' @export 180 | semi_join.resampled_df <- function(x, y, by = NULL, copy = FALSE, ...) { 181 | dplyr::semi_join(collect(x), collect(y), by = by, copy = copy, ...) 182 | } 183 | 184 | #' @importFrom dplyr group_by 185 | #' @export 186 | group_by.resampled_df <- function(.data, ..., add = FALSE, .drop = FALSE) { 187 | 188 | if (add) { 189 | .data <- collect(.data) 190 | } 191 | else { 192 | .data <- dplyr::ungroup(.data) 193 | } 194 | 195 | dplyr::group_by(.data, ..., add = add, .drop = .drop) 196 | } 197 | 198 | # ------------------------------------------------------------------------------ 199 | # Backwards compat support for deprecated standard eval dplyr 200 | 201 | # nocov start 202 | 203 | # Only a few of them need it. arrange_.grouped_df() 204 | # directly calls arrange_impl() causing a problem. 205 | 206 | #' @importFrom dplyr arrange_ 207 | #' @export 208 | arrange_.resampled_df <- function(.data, ..., .dots = list()) { 209 | dplyr::arrange_(collect(.data), ..., .dots = .dots) 210 | } 211 | 212 | #' @importFrom dplyr mutate_ 213 | #' @export 214 | mutate_.resampled_df <- function(.data, ..., .dots = list()) { 215 | dplyr::mutate_(collect(.data), ..., .dots = .dots) 216 | } 217 | 218 | #' @importFrom dplyr slice_ 219 | #' @export 220 | slice_.resampled_df <- function(.data, ..., .dots = list()) { 221 | dplyr::slice_(collect(.data), ..., .dots = .dots) 222 | } 223 | 224 | # nocov end 225 | 226 | # ------------------------------------------------------------------------------ 227 | # Util 228 | 229 | maybe_new_grouped_df <- function(x) { 230 | 231 | if (dplyr::is_grouped_df(x)) { 232 | x <- dplyr::new_grouped_df(x = x, groups = dplyr::group_data(x)) 233 | } 234 | 235 | x 236 | } 237 | -------------------------------------------------------------------------------- /R/compat-purrr.R: -------------------------------------------------------------------------------- 1 | # nocov start - compat-purrr (last updated: rlang 0.3.0.9000) 2 | 3 | # This file serves as a reference for compatibility functions for 4 | # purrr. They are not drop-in replacements but allow a similar style 5 | # of programming. This is useful in cases where purrr is too heavy a 6 | # package to depend on. Please find the most recent version in rlang's 7 | # repository. 8 | 9 | map <- function(.x, .f, ...) { 10 | lapply(.x, .f, ...) 11 | } 12 | 13 | map2 <- function(.x, .y, .f, ...) { 14 | out <- mapply(.f, .x, .y, MoreArgs = list(...), SIMPLIFY = FALSE) 15 | if (length(out) == length(.x)) { 16 | rlang::set_names(out, names(.x)) 17 | } else { 18 | rlang::set_names(out, NULL) 19 | } 20 | } 21 | 22 | # nocov end 23 | -------------------------------------------------------------------------------- /R/compat-tidyr.R: -------------------------------------------------------------------------------- 1 | # ------------------------------------------------------------------------------ 2 | # tidyr support 3 | 4 | #' @importFrom tidyr nest 5 | #' @importFrom lifecycle deprecated 6 | #' @export 7 | nest.resampled_df <- function(.data, ..., .key = deprecated()) { 8 | tidyr::nest(collect(.data), ..., .key = .key) 9 | } 10 | -------------------------------------------------------------------------------- /R/samplify.R: -------------------------------------------------------------------------------- 1 | #' Created a resampled tibble 2 | #' 3 | #' `samplify()` creates a resampled tibble with _virtual groups_. 4 | #' 5 | #' @details 6 | #' 7 | #' The following functions have special / interesting behavior when used with 8 | #' a `resampled_df`: 9 | #' 10 | #' - [dplyr::collect()] 11 | #' 12 | #' - [dplyr::summarise()] 13 | #' 14 | #' - [dplyr::do()] 15 | #' 16 | #' - [dplyr::group_map()] 17 | #' 18 | #' - [dplyr::group_modify()] 19 | #' 20 | #' - [dplyr::group_walk()] 21 | #' 22 | #' - [dplyr::group_nest()] 23 | #' 24 | #' - [dplyr::group_split()] 25 | #' 26 | #' @param data A tbl. 27 | #' 28 | #' @param times A single integer specifying the number of resamples. 29 | #' If the `tibble` is grouped, this is the number of resamples per group. 30 | #' 31 | #' @param size A single integer specifying the size of each resample. For a 32 | #' grouped data frame, this is also allowed to be an integer vector with size 33 | #' equal to the number of groups in `data`. This can be helpful when sampling 34 | #' without replacement when the number of rows per group is very different. 35 | #' 36 | #' @param ... Not used. 37 | #' 38 | #' @param replace Whether or not to sample with replacement. 39 | #' 40 | #' @param key A single character specifying the name of the virtual group 41 | #' that is added. 42 | #' 43 | #' @return A `resampled_df` with an extra group specified by the `key`. 44 | #' 45 | #' @examples 46 | #' library(dplyr) 47 | #' library(broom) 48 | #' 49 | #' samplify(iris, times = 3, size = 20) 50 | #' 51 | #' iris %>% 52 | #' samplify(times = 3, size = 20) %>% 53 | #' summarise(per_strap_mean = mean(Petal.Width)) 54 | #' 55 | #' iris %>% 56 | #' group_by(Species) %>% 57 | #' samplify(times = 3, size = 20) %>% 58 | #' summarise(per_strap_species_mean = mean(Petal.Width)) 59 | #' 60 | #' # Alter the name of the group with `key` 61 | #' # Materialize them with collect() 62 | #' samps <- samplify(iris, times = 3, size = 5, key = ".samps") 63 | #' collect(samps) 64 | #' 65 | #' collect(samps, id = ".id", original_id = ".orig_id") 66 | #' 67 | #' #---------------------------------------------------------------------------- 68 | #' 69 | #' # Be careful not to specify a `size` larger 70 | #' # than one of your groups! This will throw an error. 71 | #' 72 | #' iris_group_sizes_of_50_and_5 <- iris[1:55,] %>% 73 | #' group_by(Species) %>% 74 | #' group_trim() 75 | #' 76 | #' count(iris_group_sizes_of_50_and_5, Species) 77 | #' 78 | #' # size = 10 > min_group_size = 5 79 | #' \dontrun{ 80 | #' iris_group_sizes_of_50_and_5 %>% 81 | #' samplify(times = 2, size = 10) 82 | #' } 83 | #' 84 | #' # Instead, pass a vector of sizes to `samplify()` if this 85 | #' # structure is absolutely required for your use case. 86 | #' 87 | #' # size of 10 for the first group 88 | #' # size of 5 for the second group 89 | #' # total number of rows is 10 * 2 + 5 * 2 = 30 90 | #' iris_group_sizes_of_50_and_5 %>% 91 | #' samplify(times = 2, size = c(10, 5)) %>% 92 | #' collect() 93 | #' 94 | #' @family virtual samplers 95 | #' 96 | #' @seealso [collect.resampled_df()] 97 | #' 98 | #' @name samplify 99 | NULL 100 | 101 | #' @rdname samplify 102 | #' @export 103 | samplify <- function(data, times, size, ..., 104 | replace = FALSE, key = ".sample") { 105 | 106 | check_empty_dots(...) 107 | validate_is_scalar_character(key, "key") 108 | validate_is_bool(replace, "replace") 109 | validate_is_scalar_positive_integerish(times, "times") 110 | validate_is_positive_integerish(size, "size") 111 | 112 | UseMethod("samplify") 113 | } 114 | 115 | #' @export 116 | samplify.data.frame <- function(data, times, size, ..., 117 | replace = FALSE, key = ".sample") { 118 | samplify( 119 | data = dplyr::as_tibble(data), 120 | times = times, 121 | size = size, 122 | ..., 123 | replace = replace, 124 | key = key 125 | ) 126 | } 127 | 128 | #' @export 129 | samplify.tbl_df <- function(data, times, size, ..., 130 | replace = FALSE, key = ".sample") { 131 | 132 | # additional check for size 1 here 133 | validate_is( 134 | size, "size", rlang::is_scalar_integerish, 135 | "a single integer (for ungrouped data frames)" 136 | ) 137 | 138 | .row_slice_ids <- seq_len(nrow(data)) 139 | 140 | group_info <- index_sampler( 141 | .row_slice_ids = .row_slice_ids, 142 | times = times, 143 | key = key, 144 | size = size, 145 | replace = replace 146 | ) 147 | 148 | dplyr::new_grouped_df(data, group_info, class = "resampled_df") 149 | } 150 | 151 | #' @export 152 | samplify.grouped_df <- function(data, times, size, ..., 153 | replace = FALSE, key = ".sample") { 154 | 155 | size <- recycle_size(size, dplyr::n_groups(data)) 156 | 157 | # extract existing group_info 158 | group_info <- dplyr::group_data(data) 159 | index_list <- group_info[[".rows"]] 160 | 161 | new_row_index_tbl <- map2( 162 | .x = index_list, 163 | .y = size, 164 | .f = function(.x, .y) { 165 | index_sampler( 166 | .row_slice_ids = .x, 167 | times = times, 168 | key = key, 169 | size = .y, 170 | replace = replace 171 | ) 172 | } 173 | ) 174 | 175 | # overwrite current .rows and unnest 176 | group_info[[".rows"]] <- new_row_index_tbl 177 | group_info <- tidyr::unnest(group_info, cols = .rows, names_repair = "universal") 178 | 179 | dplyr::new_grouped_df(data, group_info, class = "resampled_df") 180 | } 181 | 182 | # ------------------------------------------------------------------------------ 183 | # Utility 184 | 185 | # Actually perform the resampling of the row indices 186 | # and create the group tbl information from that 187 | 188 | index_sampler <- function(.row_slice_ids, 189 | times, 190 | key, 191 | size, 192 | replace = FALSE) { 193 | 194 | check_size(size, length(.row_slice_ids), replace) 195 | 196 | .bootstrap_id <- seq_len(times) 197 | 198 | # must unquote the colname as `.rows` is an arg to tibble() 199 | .row_col <- ".rows" 200 | 201 | .index_list <- replicate( 202 | n = times, 203 | expr = sample( 204 | x = .row_slice_ids, 205 | size = size, 206 | replace = replace, 207 | prob = NULL 208 | ), 209 | simplify = FALSE 210 | ) 211 | 212 | dplyr::tibble( 213 | !!key := .bootstrap_id, 214 | !!.row_col := .index_list 215 | ) 216 | 217 | } 218 | 219 | # dplyr:::check_size() 220 | check_size <- function (size, n, replace = FALSE) { 221 | if (size <= n || replace) 222 | return(invisible(size)) 223 | 224 | msg <- paste0( 225 | "`size` (%i) must be less than or equal to the ", 226 | "size of the data / current group (%i), ", 227 | "set `replace = TRUE` to use sampling with replacement." 228 | ) 229 | 230 | stop(sprintf(msg, size, n), call. = FALSE) 231 | } 232 | 233 | recycle_size <- function(.x, .n) { 234 | 235 | .n_x <- length(.x) 236 | 237 | if (.n_x == .n) { 238 | return(.x) 239 | } 240 | 241 | if (.n_x == 1L) { 242 | .x <- rep(.x, times = .n) 243 | return(.x) 244 | } 245 | 246 | msg <- paste0("`size` must be size 1 or ", .n, " (the number of groups).") 247 | rlang::abort(msg) 248 | } 249 | 250 | # ------------------------------------------------------------------------------ 251 | # Validation 252 | 253 | validate_is <- function(.x, .x_nm, .is_f, .expected) { 254 | 255 | if (!.is_f(.x)) { 256 | msg <- paste0("`", .x_nm, "` must be ", .expected, ".") 257 | rlang::abort(msg) 258 | } 259 | 260 | invisible(.x) 261 | } 262 | 263 | validate_is_scalar_character <- function(.x, .x_nm) { 264 | validate_is(.x, .x_nm, rlang::is_scalar_character, "a single character") 265 | } 266 | 267 | is_bool <- function (x) { 268 | rlang::is_logical(x, n = 1) && !is.na(x) 269 | } 270 | 271 | validate_is_bool <- function(.x, .x_nm) { 272 | validate_is(.x, .x_nm, is_bool, "a single logical (TRUE/FALSE)") 273 | } 274 | 275 | is_positive <- function(x) { 276 | isTRUE(all(x > 0)) 277 | } 278 | 279 | validate_is_scalar_positive_integerish <- function(.x, .x_nm) { 280 | validate_is(.x, .x_nm, rlang::is_scalar_integerish, "a single integer") 281 | validate_is(.x, .x_nm, is_positive, "a positive integer") 282 | } 283 | 284 | validate_is_positive_integerish <- function(.x, .x_nm) { 285 | validate_is(.x, .x_nm, rlang::is_integerish, "an integer vector") 286 | validate_is(.x, .x_nm, is_positive, "a positive integer vector") 287 | } 288 | -------------------------------------------------------------------------------- /R/util.R: -------------------------------------------------------------------------------- 1 | check_empty_dots <- function(...) { 2 | dots <- rlang::enquos(...) 3 | 4 | dots_are_empty <- length(dots) == 0L 5 | 6 | if (!dots_are_empty) { 7 | rlang::abort("`...` must be empty.") 8 | } 9 | 10 | invisible() 11 | } 12 | -------------------------------------------------------------------------------- /R/zzz.R: -------------------------------------------------------------------------------- 1 | #' @importFrom dplyr collect 2 | NULL 3 | 4 | #' @importFrom rlang := 5 | NULL 6 | 7 | # For collect() 8 | utils::globalVariables( 9 | c( 10 | ".rows", 11 | "...x" 12 | ) 13 | ) 14 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | editor_options: 4 | chunk_output_type: console 5 | --- 6 | 7 | 8 | 9 | ```{r setup, include = FALSE} 10 | knitr::opts_chunk$set( 11 | collapse = TRUE, 12 | comment = "#>", 13 | fig.path = "man/figures/README-", 14 | out.width = "100%" 15 | ) 16 | ``` 17 | # strapgod 18 | 19 | 20 | [![Codecov test coverage](https://codecov.io/gh/DavisVaughan/strapgod/branch/master/graph/badge.svg)](https://codecov.io/gh/DavisVaughan/strapgod?branch=master) 21 | [![Travis build status](https://travis-ci.org/DavisVaughan/strapgod.svg?branch=master)](https://travis-ci.org/DavisVaughan/strapgod) 22 | [![CRAN status](https://www.r-pkg.org/badges/version/strapgod)](https://cran.r-project.org/package=strapgod) 23 | [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental) 24 | 25 | 26 |

27 | 28 |

29 | 30 | ## Introduction 31 | 32 | The goal of strapgod is to create _virtual groups_ on top of a `tibble` or `grouped_df` as a way of resampling the original data frame. You can then efficiently perform various dplyr operations on this `resampled_df`, like: `summarise()`, `do()`, `group_map()`, and more, to easily compute bootstrapped and resampled statistics. 33 | 34 | ## Installation 35 | 36 | You can install the released version of strapgod from [CRAN](https://CRAN.R-project.org) with: 37 | 38 | ``` r 39 | install.packages("strapgod") 40 | ``` 41 | 42 | Install the development version from GitHub with: 43 | 44 | ``` r 45 | devtools::install_github("DavisVaughan/strapgod") 46 | ``` 47 | 48 | ## Learning about strapgod 49 | 50 | If you aren't already on the [pkgdown site](https://davisvaughan.github.io/strapgod/), I would encourage starting there. From there, you will be able to click on these two vignettes to learn about working with resampled tibbles. 51 | 52 | - `vignette("virtual-bootstraps", "strapgod")` 53 | 54 | - `vignette("dplyr-support", "strapgod")` 55 | 56 | ## Example 57 | 58 | Create resampled data frames with `bootstrapify()` or `samplify()`. Notice how we grouped by the _virtual_ column, `.bootstrap` and there are still only 150 rows even though we bootstrapped this dataset 10 times. 59 | 60 | ```{r first-strap, warning=FALSE, message=FALSE} 61 | library(strapgod) 62 | library(dplyr) 63 | set.seed(123) 64 | 65 | bootstrapify(iris, 10) 66 | ``` 67 | 68 | You can feed a `resampled_df` into `summarise()` or `group_modify()` to perform 69 | efficient bootstrapped computations. 70 | 71 | ```{r summarise} 72 | iris %>% 73 | bootstrapify(10) %>% 74 | summarise(per_strap_mean = mean(Petal.Width)) 75 | ``` 76 | 77 | The original data can be grouped as well, and the bootstraps will be created for each group. 78 | 79 | ```{r summarise-by-group} 80 | iris %>% 81 | group_by(Species) %>% 82 | bootstrapify(10) %>% 83 | summarise(per_strap_per_species_mean = mean(Petal.Width)) 84 | ``` 85 | 86 | ## Plotting bootstrapped results 87 | 88 | A fun example of using strapgod is to create bootstrapped visualizations quickly and easily for hypothetical outcome plots. 89 | 90 | ```{r non-bootstrap-plots, warning=FALSE, message=FALSE} 91 | set.seed(123) 92 | library(ggplot2) 93 | 94 | # without bootstrap 95 | mtcars %>% 96 | ggplot(aes(hp, mpg)) + 97 | geom_smooth(se = FALSE) + 98 | ylim(y = c(0, 40)) 99 | ``` 100 | 101 | ```{r bootstrap-plots, warning=FALSE, message=FALSE} 102 | # with bootstrap 103 | mtcars %>% 104 | bootstrapify(10) %>% 105 | collect() %>% 106 | ggplot(aes(hp, mpg, group = .bootstrap)) + 107 | geom_smooth(se = FALSE) + 108 | ylim(y = c(0, 40)) 109 | ``` 110 | 111 | ## In the wild 112 | 113 | - Claus Wilke has used strapgod to power some pieces of his [ungeviz](https://github.com/wilkelab/ungeviz) package for visualizing uncertainty. 114 | 115 | - You can watch Claus's [rstudio::conf 2019](https://resources.rstudio.com/rstudio-conf-2019/visualizing-uncertainty-with-hypothetical-outcomes-plots) talk to see ungeviz and strapgod in action. 116 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # strapgod 5 | 6 | 7 | 8 | [![Codecov test 9 | coverage](https://codecov.io/gh/DavisVaughan/strapgod/branch/master/graph/badge.svg)](https://codecov.io/gh/DavisVaughan/strapgod?branch=master) 10 | [![Travis build 11 | status](https://travis-ci.org/DavisVaughan/strapgod.svg?branch=master)](https://travis-ci.org/DavisVaughan/strapgod) 12 | [![CRAN 13 | status](https://www.r-pkg.org/badges/version/strapgod)](https://cran.r-project.org/package=strapgod) 14 | [![Lifecycle: 15 | experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental) 16 | 17 | 18 |

19 | 20 | 21 | 22 |

23 | 24 | ## Introduction 25 | 26 | The goal of strapgod is to create *virtual groups* on top of a `tibble` 27 | or `grouped_df` as a way of resampling the original data frame. You can 28 | then efficiently perform various dplyr operations on this 29 | `resampled_df`, like: `summarise()`, `do()`, `group_map()`, and more, to 30 | easily compute bootstrapped and resampled statistics. 31 | 32 | ## Installation 33 | 34 | You can install the released version of strapgod from 35 | [CRAN](https://CRAN.R-project.org) with: 36 | 37 | ``` r 38 | install.packages("strapgod") 39 | ``` 40 | 41 | Install the development version from GitHub with: 42 | 43 | ``` r 44 | devtools::install_github("DavisVaughan/strapgod") 45 | ``` 46 | 47 | ## Learning about strapgod 48 | 49 | If you aren’t already on the [pkgdown 50 | site](https://davisvaughan.github.io/strapgod/), I would encourage 51 | starting there. From there, you will be able to click on these two 52 | vignettes to learn about working with resampled tibbles. 53 | 54 | - `vignette("virtual-bootstraps", "strapgod")` 55 | 56 | - `vignette("dplyr-support", "strapgod")` 57 | 58 | ## Example 59 | 60 | Create resampled data frames with `bootstrapify()` or `samplify()`. 61 | Notice how we grouped by the *virtual* column, `.bootstrap` and there 62 | are still only 150 rows even though we bootstrapped this dataset 10 63 | times. 64 | 65 | ``` r 66 | library(strapgod) 67 | library(dplyr) 68 | set.seed(123) 69 | 70 | bootstrapify(iris, 10) 71 | #> # A tibble: 150 x 5 72 | #> # Groups: .bootstrap [10] 73 | #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species 74 | #> 75 | #> 1 5.1 3.5 1.4 0.2 setosa 76 | #> 2 4.9 3 1.4 0.2 setosa 77 | #> 3 4.7 3.2 1.3 0.2 setosa 78 | #> 4 4.6 3.1 1.5 0.2 setosa 79 | #> 5 5 3.6 1.4 0.2 setosa 80 | #> 6 5.4 3.9 1.7 0.4 setosa 81 | #> 7 4.6 3.4 1.4 0.3 setosa 82 | #> 8 5 3.4 1.5 0.2 setosa 83 | #> 9 4.4 2.9 1.4 0.2 setosa 84 | #> 10 4.9 3.1 1.5 0.1 setosa 85 | #> # … with 140 more rows 86 | ``` 87 | 88 | You can feed a `resampled_df` into `summarise()` or `group_map()` to 89 | perform efficient bootstrapped computations. 90 | 91 | ``` r 92 | iris %>% 93 | bootstrapify(10) %>% 94 | summarise(per_strap_mean = mean(Petal.Width)) 95 | #> # A tibble: 10 x 2 96 | #> .bootstrap per_strap_mean 97 | #> 98 | #> 1 1 1.20 99 | #> 2 2 1.22 100 | #> 3 3 1.23 101 | #> 4 4 1.13 102 | #> 5 5 1.20 103 | #> 6 6 1.15 104 | #> 7 7 1.18 105 | #> 8 8 1.13 106 | #> 9 9 1.31 107 | #> 10 10 1.19 108 | ``` 109 | 110 | The original data can be grouped as well, and the bootstraps will be 111 | created for each group. 112 | 113 | ``` r 114 | iris %>% 115 | group_by(Species) %>% 116 | bootstrapify(10) %>% 117 | summarise(per_strap_per_species_mean = mean(Petal.Width)) 118 | #> # A tibble: 30 x 3 119 | #> # Groups: Species [3] 120 | #> Species .bootstrap per_strap_per_species_mean 121 | #> 122 | #> 1 setosa 1 0.25 123 | #> 2 setosa 2 0.246 124 | #> 3 setosa 3 0.24 125 | #> 4 setosa 4 0.238 126 | #> 5 setosa 5 0.252 127 | #> 6 setosa 6 0.274 128 | #> 7 setosa 7 0.238 129 | #> 8 setosa 8 0.258 130 | #> 9 setosa 9 0.252 131 | #> 10 setosa 10 0.256 132 | #> # … with 20 more rows 133 | ``` 134 | 135 | ## Plotting bootstrapped results 136 | 137 | A fun example of using strapgod is to create bootstrapped visualizations 138 | quickly and easily for hypothetical outcome plots. 139 | 140 | ``` r 141 | set.seed(123) 142 | library(ggplot2) 143 | 144 | # without bootstrap 145 | mtcars %>% 146 | ggplot(aes(hp, mpg)) + 147 | geom_smooth(se = FALSE) + 148 | ylim(y = c(0, 40)) 149 | ``` 150 | 151 | 152 | 153 | ``` r 154 | # with bootstrap 155 | mtcars %>% 156 | bootstrapify(10) %>% 157 | collect() %>% 158 | ggplot(aes(hp, mpg, group = .bootstrap)) + 159 | geom_smooth(se = FALSE) + 160 | ylim(y = c(0, 40)) 161 | ``` 162 | 163 | 164 | 165 | ## In the wild 166 | 167 | - Claus Wilke has used strapgod to power some pieces of his 168 | [ungeviz](https://github.com/wilkelab/ungeviz) package for 169 | visualizing uncertainty. 170 | 171 | - You can watch Claus’s 172 | [rstudio::conf 2019](https://resources.rstudio.com/rstudio-conf-2019/visualizing-uncertainty-with-hypothetical-outcomes-plots) 173 | talk to see ungeviz and strapgod in action. 174 | -------------------------------------------------------------------------------- /_pkgdown.yml: -------------------------------------------------------------------------------- 1 | destination: docs 2 | -------------------------------------------------------------------------------- /codecov.yml: -------------------------------------------------------------------------------- 1 | comment: false 2 | 3 | coverage: 4 | status: 5 | project: 6 | default: 7 | target: auto 8 | threshold: 1% 9 | patch: 10 | default: 11 | target: auto 12 | threshold: 1% 13 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## Comments 2 | This is the fourth release of this package. 3 | 4 | This release requires tidyr 1.0.0 and updates the `resampled_df` method for 5 | `tidyr::nest()`. 6 | 7 | ## Test environments 8 | * local OS X install, R 3.6.0 9 | * ubuntu 14.04 (on travis-ci) (devel and release) 10 | * win-builder (devel and release) 11 | 12 | ## R CMD check results 13 | 14 | 0 errors | 0 warnings | 0 notes 15 | -------------------------------------------------------------------------------- /man/bootstrapify.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/bootstrapify.R 3 | \name{bootstrapify} 4 | \alias{bootstrapify} 5 | \title{Create a bootstrapped tibble} 6 | \usage{ 7 | bootstrapify(data, times, ..., key = ".bootstrap") 8 | } 9 | \arguments{ 10 | \item{data}{A tbl.} 11 | 12 | \item{times}{A single integer specifying the number of resamples. 13 | If the \code{tibble} is grouped, this is the number of resamples per group.} 14 | 15 | \item{...}{Not used.} 16 | 17 | \item{key}{A single character specifying the name of the virtual group 18 | that is added.} 19 | } 20 | \value{ 21 | A \code{resampled_df} with an extra group specified by the \code{key}. 22 | } 23 | \description{ 24 | \code{bootstrapify()} creates a bootstrapped tibble with \emph{virtual groups}. 25 | } 26 | \details{ 27 | The following functions have special / interesting behavior when used with 28 | a \code{resampled_df}: 29 | \itemize{ 30 | \item \code{\link[dplyr:collect]{dplyr::collect()}} 31 | \item \code{\link[dplyr:summarise]{dplyr::summarise()}} 32 | \item \code{\link[dplyr:do]{dplyr::do()}} 33 | \item \code{\link[dplyr:group_map]{dplyr::group_map()}} 34 | \item \code{\link[dplyr:group_modify]{dplyr::group_modify()}} 35 | \item \code{\link[dplyr:group_walk]{dplyr::group_walk()}} 36 | \item \code{\link[dplyr:group_nest]{dplyr::group_nest()}} 37 | \item \code{\link[dplyr:group_split]{dplyr::group_split()}} 38 | } 39 | } 40 | \examples{ 41 | library(dplyr) 42 | library(broom) 43 | 44 | bootstrapify(iris, 5) 45 | 46 | iris \%>\% 47 | bootstrapify(5) \%>\% 48 | summarise(per_strap_mean = mean(Petal.Width)) 49 | 50 | iris \%>\% 51 | group_by(Species) \%>\% 52 | bootstrapify(5) \%>\% 53 | summarise(per_strap_species_mean = mean(Petal.Width)) 54 | 55 | iris \%>\% 56 | bootstrapify(5) \%>\% 57 | do(tidy(lm(Sepal.Width ~ Sepal.Length + Species, data = .))) 58 | 59 | # Alternatively, use the newer group_modify() 60 | iris \%>\% 61 | bootstrapify(5) \%>\% 62 | group_modify(~tidy(lm(Sepal.Width ~ Sepal.Length + Species, data = .x))) 63 | 64 | # Alter the name of the group with `key` 65 | # Materialize them with collect() 66 | straps <- bootstrapify(iris, 5, key = ".straps") 67 | collect(straps) 68 | 69 | } 70 | \seealso{ 71 | \code{\link[=collect.resampled_df]{collect.resampled_df()}} 72 | 73 | Other virtual samplers: \code{\link{samplify}} 74 | } 75 | \concept{virtual samplers} 76 | -------------------------------------------------------------------------------- /man/collect.resampled_df.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/collect.R 3 | \name{collect.resampled_df} 4 | \alias{collect.resampled_df} 5 | \title{Force virtual groups to become explicit rows} 6 | \usage{ 7 | \method{collect}{resampled_df}(x, ..., id = NULL, original_id = NULL) 8 | } 9 | \arguments{ 10 | \item{x}{A \code{resampled_df}.} 11 | 12 | \item{...}{Not used.} 13 | 14 | \item{id}{Optional. A single character that specifies a name for a column 15 | containing a sequence from \code{1:n} for each bootstrap group.} 16 | 17 | \item{original_id}{Optional. A single character that specifies a name for 18 | a column containing the original position of the bootstrapped row.} 19 | } 20 | \description{ 21 | When \code{collect()} is used on a \code{resampled_df}, the virtual bootstrap groups 22 | are made explicit. 23 | } 24 | \examples{ 25 | library(dplyr) 26 | 27 | # virtual groups become real rows 28 | collect(bootstrapify(iris, 5)) 29 | 30 | # add on the id column for an identifier per bootstrap 31 | collect(bootstrapify(iris, 5), id = ".id") 32 | 33 | # add on the original_id column to know which row this bootstrapped row 34 | # originally came from 35 | collect(bootstrapify(iris, 5), original_id = ".original_id") 36 | 37 | } 38 | -------------------------------------------------------------------------------- /man/figures/README-bootstrap-plots-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavisVaughan/strapgod/ea2b1ecfc780a44ffa934c9bc7c2032954f4ffaa/man/figures/README-bootstrap-plots-1.png -------------------------------------------------------------------------------- /man/figures/README-non-bootstrap-plots-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavisVaughan/strapgod/ea2b1ecfc780a44ffa934c9bc7c2032954f4ffaa/man/figures/README-non-bootstrap-plots-1.png -------------------------------------------------------------------------------- /man/figures/strap-god.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavisVaughan/strapgod/ea2b1ecfc780a44ffa934c9bc7c2032954f4ffaa/man/figures/strap-god.jpg -------------------------------------------------------------------------------- /man/reexports.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/compat-dplyr.R 3 | \docType{import} 4 | \name{reexports} 5 | \alias{reexports} 6 | \alias{filter} 7 | \title{Objects exported from other packages} 8 | \keyword{internal} 9 | \description{ 10 | These objects are imported from other packages. Follow the links 11 | below to see their documentation. 12 | 13 | \describe{ 14 | \item{dplyr}{\code{\link[dplyr]{filter}}} 15 | }} 16 | 17 | -------------------------------------------------------------------------------- /man/samplify.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/samplify.R 3 | \name{samplify} 4 | \alias{samplify} 5 | \title{Created a resampled tibble} 6 | \usage{ 7 | samplify(data, times, size, ..., replace = FALSE, key = ".sample") 8 | } 9 | \arguments{ 10 | \item{data}{A tbl.} 11 | 12 | \item{times}{A single integer specifying the number of resamples. 13 | If the \code{tibble} is grouped, this is the number of resamples per group.} 14 | 15 | \item{size}{A single integer specifying the size of each resample. For a 16 | grouped data frame, this is also allowed to be an integer vector with size 17 | equal to the number of groups in \code{data}. This can be helpful when sampling 18 | without replacement when the number of rows per group is very different.} 19 | 20 | \item{...}{Not used.} 21 | 22 | \item{replace}{Whether or not to sample with replacement.} 23 | 24 | \item{key}{A single character specifying the name of the virtual group 25 | that is added.} 26 | } 27 | \value{ 28 | A \code{resampled_df} with an extra group specified by the \code{key}. 29 | } 30 | \description{ 31 | \code{samplify()} creates a resampled tibble with \emph{virtual groups}. 32 | } 33 | \details{ 34 | The following functions have special / interesting behavior when used with 35 | a \code{resampled_df}: 36 | \itemize{ 37 | \item \code{\link[dplyr:collect]{dplyr::collect()}} 38 | \item \code{\link[dplyr:summarise]{dplyr::summarise()}} 39 | \item \code{\link[dplyr:do]{dplyr::do()}} 40 | \item \code{\link[dplyr:group_map]{dplyr::group_map()}} 41 | \item \code{\link[dplyr:group_modify]{dplyr::group_modify()}} 42 | \item \code{\link[dplyr:group_walk]{dplyr::group_walk()}} 43 | \item \code{\link[dplyr:group_nest]{dplyr::group_nest()}} 44 | \item \code{\link[dplyr:group_split]{dplyr::group_split()}} 45 | } 46 | } 47 | \examples{ 48 | library(dplyr) 49 | library(broom) 50 | 51 | samplify(iris, times = 3, size = 20) 52 | 53 | iris \%>\% 54 | samplify(times = 3, size = 20) \%>\% 55 | summarise(per_strap_mean = mean(Petal.Width)) 56 | 57 | iris \%>\% 58 | group_by(Species) \%>\% 59 | samplify(times = 3, size = 20) \%>\% 60 | summarise(per_strap_species_mean = mean(Petal.Width)) 61 | 62 | # Alter the name of the group with `key` 63 | # Materialize them with collect() 64 | samps <- samplify(iris, times = 3, size = 5, key = ".samps") 65 | collect(samps) 66 | 67 | collect(samps, id = ".id", original_id = ".orig_id") 68 | 69 | #---------------------------------------------------------------------------- 70 | 71 | # Be careful not to specify a `size` larger 72 | # than one of your groups! This will throw an error. 73 | 74 | iris_group_sizes_of_50_and_5 <- iris[1:55,] \%>\% 75 | group_by(Species) \%>\% 76 | group_trim() 77 | 78 | count(iris_group_sizes_of_50_and_5, Species) 79 | 80 | # size = 10 > min_group_size = 5 81 | \dontrun{ 82 | iris_group_sizes_of_50_and_5 \%>\% 83 | samplify(times = 2, size = 10) 84 | } 85 | 86 | # Instead, pass a vector of sizes to `samplify()` if this 87 | # structure is absolutely required for your use case. 88 | 89 | # size of 10 for the first group 90 | # size of 5 for the second group 91 | # total number of rows is 10 * 2 + 5 * 2 = 30 92 | iris_group_sizes_of_50_and_5 \%>\% 93 | samplify(times = 2, size = c(10, 5)) \%>\% 94 | collect() 95 | 96 | } 97 | \seealso{ 98 | \code{\link[=collect.resampled_df]{collect.resampled_df()}} 99 | 100 | Other virtual samplers: \code{\link{bootstrapify}} 101 | } 102 | \concept{virtual samplers} 103 | -------------------------------------------------------------------------------- /strapgod.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace 22 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(strapgod) 3 | 4 | test_check("strapgod") 5 | -------------------------------------------------------------------------------- /tests/testthat/test-bootstrapify.R: -------------------------------------------------------------------------------- 1 | context("test-bootstrapify") 2 | 3 | library(dplyr) 4 | 5 | test_that("can create bootstrapped data frames", { 6 | 7 | expect_error( 8 | x <- bootstrapify(iris, 2), 9 | NA 10 | ) 11 | 12 | expect_is(x, "resampled_df") 13 | expect_is(x, "grouped_df") 14 | 15 | x_gd <- group_data(x) 16 | 17 | expect_equal( 18 | colnames(x_gd), 19 | c(".bootstrap", ".rows") 20 | ) 21 | 22 | expect_equal( 23 | x_gd[[".bootstrap"]], 24 | c(1, 2) 25 | ) 26 | 27 | expect_equal( 28 | nrow(x_gd), 29 | 2 30 | ) 31 | 32 | expect_equal( 33 | unique(vapply(x_gd$.rows, length, integer(1))), 34 | 150 35 | ) 36 | 37 | }) 38 | 39 | test_that("can bootstrap with groups", { 40 | 41 | x <- iris %>% 42 | group_by(Species) %>% 43 | bootstrapify(5) 44 | 45 | x_gd <- group_data(x) 46 | 47 | expect_equal( 48 | colnames(x_gd), 49 | c("Species", ".bootstrap", ".rows") 50 | ) 51 | 52 | expect_equal( 53 | nrow(x_gd), 54 | 15 55 | ) 56 | 57 | expect_equal( 58 | unique(vapply(x_gd$.rows, length, integer(1))), 59 | 50 60 | ) 61 | 62 | }) 63 | 64 | test_that("can correctly double bootstrap", { 65 | 66 | once <- iris %>% 67 | bootstrapify(5) 68 | 69 | twice <- once %>% 70 | bootstrapify(5, key = ".bootstrap1") 71 | 72 | once_gd <- group_data(once) 73 | twice_gd <- group_data(twice) 74 | 75 | once_strap_1 <- once_gd$.rows[[1]] 76 | 77 | twice_1 <- dplyr::filter(twice_gd, .bootstrap == 1) 78 | 79 | # Check that the indices generated from bootstrapping the second 80 | # time are subsets of the indices in each group from the first 81 | # bootstrap 82 | 83 | each_twice_is_subset_of_once <- vapply( 84 | X = twice_1$.rows, 85 | FUN = function(x) {all(x %in% once_strap_1)}, 86 | FUN.VALUE = logical(1) 87 | ) 88 | 89 | expect_true(all(each_twice_is_subset_of_once)) 90 | 91 | expect_equal( 92 | colnames(twice_gd), 93 | c(".bootstrap", ".bootstrap1", ".rows") 94 | ) 95 | 96 | }) 97 | 98 | test_that("universal name repair kicks in with double bootstrap", { 99 | once <- iris %>% 100 | bootstrapify(5) 101 | 102 | # Expect universal name repair 103 | expect_message( 104 | twice <- once %>% 105 | bootstrapify(5), 106 | ".bootstrap -> .bootstrap...1" 107 | ) 108 | 109 | expect_equal( 110 | colnames(group_data(twice)), 111 | c(".bootstrap...1", ".bootstrap...2", ".rows") 112 | ) 113 | }) 114 | 115 | test_that("can alter the key", { 116 | x <- bootstrapify(iris, 5, key = ".boot") 117 | 118 | expect_equal( 119 | colnames(group_data(x))[1], 120 | ".boot" 121 | ) 122 | }) 123 | 124 | test_that("cannot pass into the `...`", { 125 | expect_error( 126 | bootstrapify(iris, 5, "key-in-dots"), 127 | "`...` must be empty" 128 | ) 129 | }) 130 | 131 | test_that("invalid inputs are caught", { 132 | expect_error( 133 | bootstrapify(iris, 5, key = 1), 134 | "`key` must be a single character." 135 | ) 136 | }) 137 | -------------------------------------------------------------------------------- /tests/testthat/test-collect.R: -------------------------------------------------------------------------------- 1 | context("test-collect") 2 | 3 | library(dplyr) 4 | 5 | test_that("can collect()", { 6 | 7 | x <- bootstrapify(iris, 5) 8 | 9 | expect_error( 10 | x_c <- collect(x), 11 | NA 12 | ) 13 | 14 | expect_equal( 15 | nrow(x_c), 16 | 750 17 | ) 18 | 19 | expect_is( 20 | x_c, 21 | "grouped_df" 22 | ) 23 | 24 | expect_false("resample_df" %in% class(x_c)) 25 | 26 | expect_equal( 27 | colnames(x_c), 28 | c(".bootstrap", colnames(iris)) 29 | ) 30 | 31 | }) 32 | 33 | test_that("can do a double bootstrap collect", { 34 | 35 | x <- iris %>% 36 | bootstrapify(5) %>% 37 | bootstrapify(10, key = ".bootstrap_2") 38 | 39 | x_c <- collect(x) 40 | 41 | expect_equal( 42 | nrow(x_c), 43 | nrow(iris) * 5 * 10 44 | ) 45 | 46 | expect_equal( 47 | x_c$.bootstrap, 48 | rep(1:5, each = nrow(iris) * 10) 49 | ) 50 | 51 | expect_equal( 52 | x_c$.bootstrap_2, 53 | rep(1:10, each = nrow(iris), times = 5) 54 | ) 55 | 56 | }) 57 | 58 | test_that("all groups are preserved", { 59 | 60 | iris_g <- iris %>% 61 | group_by(Species) 62 | 63 | x <- bootstrapify(iris_g, 5) 64 | 65 | x_c <- collect(x) 66 | 67 | expect_equal( 68 | group_vars(x_c), 69 | c("Species", ".bootstrap") 70 | ) 71 | 72 | expect_equal( 73 | nrow(x_c), 74 | sum(group_size(iris_g) * 5) 75 | ) 76 | 77 | }) 78 | 79 | test_that("`key` is propagated to `collect()`", { 80 | 81 | x <- bootstrapify(iris, 5, key = ".boot") 82 | 83 | expect_equal( 84 | colnames(collect(x))[1], 85 | ".boot" 86 | ) 87 | 88 | }) 89 | 90 | test_that("can collect with `id`", { 91 | 92 | x <- bootstrapify(iris, 5) 93 | 94 | x_c <- collect(x, id = ".id") 95 | 96 | expect_equal( 97 | colnames(x_c), 98 | c(".bootstrap", ".id", colnames(iris)) 99 | ) 100 | 101 | expect_equal( 102 | x_c$.id, 103 | rep(1:150, times = 5) 104 | ) 105 | 106 | }) 107 | 108 | test_that("can collect with `original_id`", { 109 | 110 | x <- bootstrapify(iris, 5) 111 | 112 | x_c <- collect(x, original_id = ".o_id") 113 | 114 | expect_equal( 115 | colnames(x_c), 116 | c(".bootstrap", ".o_id", colnames(iris)) 117 | ) 118 | 119 | expect_equal( 120 | x_c$.o_id, 121 | unlist(group_data(x)$.rows) 122 | ) 123 | 124 | }) 125 | 126 | test_that("can collect with `id` and `original_id`", { 127 | 128 | x <- bootstrapify(iris, 5) 129 | 130 | x_c <- collect(x, id = ".id", original_id = ".o_id") 131 | 132 | # want this column order. most intuitive. 133 | expect_equal( 134 | colnames(x_c), 135 | c(".bootstrap", ".id", ".o_id", colnames(iris)) 136 | ) 137 | 138 | }) 139 | 140 | test_that("`id` must be a single character", { 141 | 142 | x <- bootstrapify(iris, 5) 143 | 144 | expect_error( 145 | collect(x, id = 1), 146 | "`id` must be a character of size 1." 147 | ) 148 | 149 | }) 150 | 151 | test_that("`original_id` must be a single character", { 152 | 153 | x <- bootstrapify(iris, 5) 154 | 155 | expect_error( 156 | collect(x, original_id = 1), 157 | "`original_id` must be a character of size 1." 158 | ) 159 | 160 | }) 161 | -------------------------------------------------------------------------------- /tests/testthat/test-compat-tidyr.R: -------------------------------------------------------------------------------- 1 | context("test-compat-tidyr") 2 | 3 | test_that("nest()", { 4 | 5 | x <- bootstrapify(iris, 5) 6 | 7 | expect_error( 8 | x_n <- nest(x), 9 | NA 10 | ) 11 | 12 | expect_equal( 13 | x_n$.bootstrap, 14 | 1:5 15 | ) 16 | 17 | expect_equal( 18 | colnames(nest(x, .x = -.bootstrap)), 19 | c(".bootstrap", ".x") 20 | ) 21 | 22 | }) 23 | -------------------------------------------------------------------------------- /tests/testthat/test-dplyr-compat.R: -------------------------------------------------------------------------------- 1 | context("test-dplyr-compat") 2 | 3 | library(dplyr) 4 | 5 | test_that("can ungroup() to lose `resampled_df`", { 6 | 7 | x <- bootstrapify(iris, 5) 8 | 9 | x_ug <- ungroup(x) 10 | 11 | expect_is(x_ug, "tbl_df") 12 | expect_false("resampled_df" %in% class(x_ug)) 13 | expect_equal(attr(x_ug, "groups"), NULL) 14 | 15 | }) 16 | 17 | test_that("can group_by() after bootstrapify()", { 18 | 19 | x <- bootstrapify(iris, 5) 20 | 21 | x_g1 <- group_by(x, Species) 22 | 23 | expect_is(x_g1, "grouped_df") 24 | expect_false("resampled_df" %in% class(x_g1)) 25 | expect_false(".bootstrap" %in% colnames(x_g1)) 26 | 27 | x_g2 <- group_by(x, Species, add = TRUE) 28 | 29 | expect_is(x_g2, "grouped_df") 30 | expect_false("resampled_df" %in% class(x_g2)) 31 | expect_true(".bootstrap" %in% colnames(x_g2)) 32 | expect_equal(group_vars(x_g2), c(".bootstrap", "Species")) 33 | 34 | expect_equal( 35 | nrow(x_g2), 36 | 5 * 150 37 | ) 38 | 39 | x_g3 <- group_by(x, add = TRUE) 40 | 41 | expect_is(x_g3, "grouped_df") 42 | expect_false("resampled_df" %in% class(x_g3)) 43 | expect_true(".bootstrap" %in% colnames(x_g3)) 44 | expect_equal(group_vars(x_g3), ".bootstrap") 45 | 46 | x_g4 <- group_by(x) 47 | 48 | expect_is(x_g4, "tbl_df") 49 | expect_false("grouped_df" %in% class(x_g4)) 50 | expect_false("resampled_df" %in% class(x_g4)) 51 | expect_false(".bootstrap" %in% colnames(x_g4)) 52 | 53 | # We know we get an error here 54 | # as the bootstraps aren't materialized 55 | expect_error( 56 | group_by(x, .bootstrap) 57 | ) 58 | 59 | }) 60 | 61 | test_that("mutate()", { 62 | x <- bootstrapify(iris, 2) 63 | expect_equal( 64 | nrow(mutate(x, x = 4)), 65 | 300 66 | ) 67 | }) 68 | 69 | test_that("transmute()", { 70 | x <- bootstrapify(iris, 2) 71 | expect_equal( 72 | nrow(transmute(x, x = 4)), 73 | 300 74 | ) 75 | }) 76 | 77 | test_that("filter()", { 78 | x <- bootstrapify(iris, 2) 79 | expect_equal( 80 | nrow(filter(x, .bootstrap <= 1)), 81 | 150 82 | ) 83 | }) 84 | 85 | test_that("arrange()", { 86 | x <- bootstrapify(iris, 2) 87 | 88 | expect_equal( 89 | nrow(arrange(x, desc(.bootstrap))), 90 | 300 91 | ) 92 | 93 | expect_equal( 94 | arrange(x, desc(.bootstrap))$.bootstrap, 95 | rep(c(2,1), each = 150) 96 | ) 97 | }) 98 | 99 | test_that("distinct()", { 100 | x <- bootstrapify(iris, 2) 101 | 102 | expect_equal( 103 | colnames(distinct(x)), 104 | c(".bootstrap", colnames(iris)) 105 | ) 106 | 107 | expect_equal( 108 | distinct(x, .bootstrap), 109 | group_by(tibble(.bootstrap = 1:2), .bootstrap) 110 | ) 111 | 112 | }) 113 | 114 | test_that("select()", { 115 | x <- bootstrapify(iris, 2) 116 | 117 | expect_equal( 118 | colnames(select(x, group_cols())), 119 | ".bootstrap" 120 | ) 121 | 122 | expect_equal( 123 | nrow(select(x, group_cols())), 124 | 300 125 | ) 126 | 127 | }) 128 | 129 | test_that("slice()", { 130 | x <- bootstrapify(iris, 2) 131 | 132 | expect_equal( 133 | nrow(slice(x, 1)), 134 | 2 135 | ) 136 | }) 137 | 138 | test_that("pull()", { 139 | x <- bootstrapify(iris, 2) 140 | 141 | expect_equal( 142 | length(pull(x, ".bootstrap")), 143 | 300 144 | ) 145 | 146 | expect_equal( 147 | length(pull(x, "Sepal.Length")), 148 | 300 149 | ) 150 | }) 151 | 152 | test_that("rename()", { 153 | x <- bootstrapify(iris, 2) 154 | 155 | expect_equal( 156 | colnames(rename(x, y = .bootstrap)), 157 | c("y", colnames(iris)) 158 | ) 159 | }) 160 | 161 | test_that("full_join()", { 162 | 163 | mini_iris <- iris[1:20, 1:2] 164 | x <- bootstrapify(mini_iris, 2) 165 | y <- tibble(Sepal.Length = 100, new = 1) 166 | 167 | x_fj <- full_join(x, y, by = "Sepal.Length") 168 | 169 | expect_equal( 170 | nrow(x_fj), 171 | 41 172 | ) 173 | 174 | expect_equal( 175 | x_fj$new, 176 | c(rep(NA_real_, 40), 1) 177 | ) 178 | 179 | expect_equal( 180 | colnames(x_fj), 181 | c(".bootstrap", "Sepal.Length", "Sepal.Width", "new") 182 | ) 183 | 184 | }) 185 | 186 | test_that("inner_join()", { 187 | 188 | mini_iris <- iris[1:20, 1:2] 189 | x <- bootstrapify(mini_iris, 2) 190 | y <- tibble(Sepal.Length = 100, new = 1) 191 | 192 | x_ij <- inner_join(x, y, by = "Sepal.Length") 193 | 194 | expect_equal( 195 | nrow(x_ij), 196 | 0 197 | ) 198 | 199 | expect_equal( 200 | colnames(x_ij), 201 | c(".bootstrap", "Sepal.Length", "Sepal.Width", "new") 202 | ) 203 | 204 | }) 205 | 206 | test_that("left_join()", { 207 | 208 | mini_iris <- iris[1:20, 1:2] 209 | x <- bootstrapify(mini_iris, 2) 210 | y <- tibble(Sepal.Length = 100, new = 1) 211 | 212 | x_lj <- left_join(x, y, by = "Sepal.Length") 213 | 214 | expect_equal( 215 | nrow(x_lj), 216 | 40 217 | ) 218 | 219 | expect_equal( 220 | x_lj$new, 221 | rep(NA_real_, times = 40) 222 | ) 223 | 224 | expect_equal( 225 | colnames(x_lj), 226 | c(".bootstrap", "Sepal.Length", "Sepal.Width", "new") 227 | ) 228 | 229 | }) 230 | 231 | test_that("right_join()", { 232 | 233 | mini_iris <- iris[1:20, 1:2] 234 | x <- bootstrapify(mini_iris, 2) 235 | y <- tibble(Sepal.Length = 100, new = 1) 236 | 237 | x_rj <- right_join(x, y, by = "Sepal.Length") 238 | 239 | expect_equal( 240 | nrow(x_rj), 241 | 1 242 | ) 243 | 244 | expect_equal( 245 | x_rj$new, 246 | 1 247 | ) 248 | 249 | expect_equal( 250 | colnames(x_rj), 251 | c(".bootstrap", "Sepal.Length", "Sepal.Width", "new") 252 | ) 253 | 254 | }) 255 | 256 | test_that("anti_join()", { 257 | 258 | mini_iris <- iris[1:20, 1:2] 259 | x <- bootstrapify(mini_iris, 2) 260 | y <- tibble(Sepal.Length = 100, new = 1) 261 | 262 | x_aj <- anti_join(x, y, by = "Sepal.Length") 263 | 264 | expect_equal( 265 | x_aj, 266 | collect(x) 267 | ) 268 | 269 | }) 270 | 271 | test_that("semi_join()", { 272 | 273 | mini_iris <- iris[1:20, 1:2] 274 | x <- bootstrapify(mini_iris, 2) 275 | y <- tibble(Sepal.Length = 100, new = 1) 276 | 277 | x_sj <- semi_join(x, y, by = "Sepal.Length") 278 | 279 | expect_equal( 280 | x_sj, 281 | collect(x)[0,] 282 | ) 283 | 284 | }) 285 | 286 | # ------------------------------------------------------------------------------ 287 | # dplyr functions implictly supported 288 | 289 | context("test-dplyr-compat-extra") 290 | 291 | test_that("add_count()", { 292 | x <- bootstrapify(iris, 2) 293 | expect_equal( 294 | nrow(add_count(x)), 295 | 300 296 | ) 297 | }) 298 | 299 | test_that("add_tally()", { 300 | x <- bootstrapify(iris, 2) 301 | 302 | expect_equal( 303 | nrow(add_tally(x)), 304 | 300 305 | ) 306 | 307 | expect_equal( 308 | nrow(add_tally(x, .bootstrap, sort = TRUE)), 309 | 300 310 | ) 311 | 312 | expect_equal( 313 | unique(add_tally(x, .bootstrap, sort = TRUE)$n), 314 | c(300, 150) 315 | ) 316 | }) 317 | 318 | test_that("as_tibble()", { 319 | 320 | x <- bootstrapify(iris, 2) 321 | 322 | x_at <- as_tibble(x) 323 | 324 | # can convert to tibble without expanding 325 | # virtual groups 326 | expect_equal( 327 | nrow(x_at), 328 | 150 329 | ) 330 | 331 | expect_false(".bootstrap" %in% colnames(x_at)) 332 | expect_false("resampled_df" %in% class(x_at)) 333 | 334 | }) 335 | 336 | test_that("bind_rows() fails sadly", { 337 | 338 | x <- bootstrapify(iris, 2) 339 | 340 | # Cant currently do anything about this 341 | expect_error( 342 | bind_rows(x, iris), 343 | "Column `.bootstrap` is unknown" 344 | ) 345 | 346 | }) 347 | 348 | test_that("bind_cols() works", { 349 | 350 | x <- bootstrapify(iris, 2) 351 | 352 | x_bc_1 <- bind_cols(x, iris) 353 | 354 | expect_is(x_bc_1, "resampled_df") 355 | 356 | expect_equal( 357 | ncol(x_bc_1), 358 | 10 359 | ) 360 | 361 | expect_equal( 362 | nrow(x_bc_1), 363 | 150 364 | ) 365 | 366 | expect_equal( 367 | nrow(collect(x_bc_1)), 368 | 300 369 | ) 370 | 371 | x_bc_2 <- bind_cols(iris, x) 372 | 373 | expect_is(x_bc_2, "data.frame") 374 | expect_false("tbl_df" %in% class(x_bc_2)) 375 | }) 376 | 377 | test_that("sample_n()", { 378 | x <- bootstrapify(iris, 2) 379 | 380 | expect_equal( 381 | nrow(sample_n(x, 5)), 382 | 10 383 | ) 384 | }) 385 | -------------------------------------------------------------------------------- /tests/testthat/test-dplyr-do.R: -------------------------------------------------------------------------------- 1 | context("test-dplyr-do") 2 | 3 | library(dplyr) 4 | 5 | test_that("can use do() with bootstraps", { 6 | 7 | x <- bootstrapify(iris, 5) 8 | 9 | expect_equal( 10 | do(x, .), 11 | collect(x) 12 | ) 13 | 14 | }) 15 | 16 | test_that("can return non-data frames", { 17 | 18 | x <- bootstrapify(iris, 5) 19 | 20 | x_do <- do(x, mod = lm(Sepal.Width ~ Species, data = .)) 21 | 22 | expect_is(x_do, "rowwise_df") 23 | expect_false("resampled_df" %in% class(x_do)) 24 | 25 | expect_equal( 26 | nrow(x_do), 27 | 5 28 | ) 29 | 30 | }) 31 | 32 | test_that("multiple groups", { 33 | 34 | x <- iris %>% 35 | group_by(Species) %>% 36 | bootstrapify(5) 37 | 38 | x_do <- do(x, mod = lm(Sepal.Width ~ Sepal.Length, data = .)) 39 | 40 | expect_equal( 41 | nrow(x_do), 42 | 15 43 | ) 44 | 45 | }) 46 | 47 | test_that("double bootstrap", { 48 | 49 | x <- iris %>% 50 | bootstrapify(5) %>% 51 | bootstrapify(10, key = "b2") 52 | 53 | x_do <- do(x, mod = lm(Sepal.Width ~ Sepal.Length, data = .)) 54 | 55 | expect_equal( 56 | colnames(x_do), 57 | c(".bootstrap", "b2", "mod") 58 | ) 59 | 60 | }) 61 | -------------------------------------------------------------------------------- /tests/testthat/test-dplyr-group-funs.R: -------------------------------------------------------------------------------- 1 | context("test-dplyr-group-funs") 2 | 3 | library(dplyr) 4 | 5 | test_that("group_nest()", { 6 | 7 | x <- iris %>% 8 | bootstrapify(5) 9 | 10 | x_gn <- group_nest(x) 11 | 12 | expect_equal(nrow(x_gn), 5) 13 | expect_false("resampled_df" %in% class(x_gn)) 14 | 15 | xx <- bootstrapify(x, 10, key = "bs2") 16 | 17 | xx_gn <- group_nest(xx) 18 | 19 | expect_equal(nrow(xx_gn), 50) 20 | expect_false("resampled_df" %in% class(xx_gn)) 21 | 22 | }) 23 | 24 | test_that("group_nest() with `keep = TRUE`", { 25 | 26 | x <- iris %>% 27 | bootstrapify(5) 28 | 29 | x_gn <- group_nest(x, keep = TRUE) 30 | 31 | expect_equal( 32 | colnames(x_gn$data[[1]])[1], 33 | ".bootstrap" 34 | ) 35 | 36 | }) 37 | 38 | test_that("group_modify()", { 39 | 40 | x <- iris %>% 41 | bootstrapify(5) 42 | 43 | expect_equal( 44 | group_modify(x, ~.x), 45 | collect(x) 46 | ) 47 | 48 | x_gm <- group_modify(x, ~dplyr::tibble(.g = list(.y))) 49 | 50 | expect_equal(nrow(x_gm), 5) 51 | 52 | expect_equal( 53 | x_gm$.g[[1]], 54 | dplyr::tibble(.bootstrap = 1L) 55 | ) 56 | 57 | }) 58 | 59 | test_that("group_map()", { 60 | 61 | x <- iris %>% 62 | bootstrapify(5) 63 | 64 | # Don't check attributes, as group_split() 65 | # also adds the ptype as an attribute as of dplyr 0.8.2 66 | expect_equivalent( 67 | group_map(x, ~.x), 68 | group_split(x, keep = FALSE) 69 | ) 70 | 71 | x_gm <- group_map(x, ~dplyr::tibble(.g = list(.y))) 72 | 73 | expect_is(x_gm, "list") 74 | 75 | expect_identical( 76 | x_gm[[1]], 77 | tibble(.g = list(tibble(.bootstrap = 1L))) 78 | ) 79 | 80 | # `keep` argument 81 | expect_equal( 82 | unlist(group_map(x, ~ncol(.x), keep = TRUE)), 83 | rep(6, times = 5) 84 | ) 85 | 86 | }) 87 | 88 | test_that("group_walk()", { 89 | 90 | x <- iris %>% 91 | bootstrapify(5) 92 | 93 | res <- NULL 94 | 95 | group_walk(x, ~{ res <<- dplyr::bind_rows(res, .y) }) 96 | 97 | expect_equal( 98 | res, 99 | dplyr::tibble(.bootstrap = 1:5) 100 | ) 101 | 102 | res <- NULL 103 | 104 | group_walk(x, ~{ res <<- dplyr::bind_rows(res, .x) }) 105 | 106 | expect_equal( 107 | res, 108 | select(ungroup(collect(x)), -.bootstrap) 109 | ) 110 | 111 | }) 112 | 113 | test_that("group_split() - `keep` argument", { 114 | 115 | x <- iris %>% 116 | bootstrapify(5) 117 | 118 | x_gs <- group_split(x, keep = TRUE) 119 | 120 | expect_equal( 121 | colnames(x_gs[[1]]), 122 | c(".bootstrap", colnames(iris)) 123 | ) 124 | 125 | x_gs2 <- group_split(x, keep = FALSE) 126 | 127 | expect_equal( 128 | colnames(x_gs2[[1]]), 129 | colnames(iris) 130 | ) 131 | }) 132 | 133 | test_that("group_keys() can find the virtual groups", { 134 | 135 | x <- iris %>% 136 | group_by(Species) %>% 137 | bootstrapify(1) %>% 138 | bootstrapify(2) 139 | 140 | x_keys <- group_keys(x) 141 | 142 | expect_equal( 143 | x_keys[[1]], 144 | rep(unique(iris$Species), each = 2) 145 | ) 146 | 147 | expect_equal( 148 | x_keys[[2]], 149 | rep(1, times = 6) 150 | ) 151 | 152 | expect_equal( 153 | x_keys[[3]], 154 | rep(1:2, times = 3) 155 | ) 156 | }) 157 | 158 | test_that("group_data() finds virtual groups", { 159 | 160 | x <- iris %>% 161 | bootstrapify(2) 162 | 163 | x_gd <- group_data(x) 164 | 165 | expect_equal( 166 | x_gd$.bootstrap, 167 | c(1, 2) 168 | ) 169 | }) 170 | 171 | test_that("group_indices() returns collect()ed indices", { 172 | 173 | x <- iris %>% 174 | bootstrapify(2) 175 | 176 | x_gi <- group_indices(x) 177 | 178 | expect_equal( 179 | x_gi, 180 | c(rep(1, times = 150), rep(2, times = 150)) 181 | ) 182 | 183 | }) 184 | 185 | test_that("group_vars() returns virtual groups", { 186 | 187 | x <- iris %>% 188 | bootstrapify(2) 189 | 190 | expect_equal( 191 | group_vars(x), 192 | ".bootstrap" 193 | ) 194 | 195 | # I think it is correct to expect that tbl_vars() 196 | # doesn't return the virtual group 197 | expect_equal( 198 | as.character(tbl_vars(x)), 199 | colnames(iris) 200 | ) 201 | 202 | }) 203 | -------------------------------------------------------------------------------- /tests/testthat/test-dplyr-summarise.R: -------------------------------------------------------------------------------- 1 | context("test-dplyr-summarise") 2 | 3 | library(dplyr) 4 | 5 | test_that("summarise() works with a simple bootstrap", { 6 | 7 | x <- bootstrapify(iris, 5) 8 | 9 | expect_error( 10 | x_s <- summarise(x, y = mean(Sepal.Width)), 11 | NA 12 | ) 13 | 14 | expect_equal( 15 | x_s$.bootstrap, 16 | 1:5 17 | ) 18 | 19 | expect_false("grouped_df" %in% class(x_s)) 20 | expect_false("resampled_df" %in% class(x_s)) 21 | 22 | }) 23 | 24 | test_that("summarise() == summarize()", { 25 | 26 | x <- bootstrapify(iris, 5) 27 | 28 | expect_equal( 29 | summarise(x, y = mean(Sepal.Width)), 30 | summarize(x, y = mean(Sepal.Width)) 31 | ) 32 | 33 | }) 34 | 35 | test_that("works with a double bootstrap", { 36 | 37 | x <- iris %>% 38 | bootstrapify(5) %>% 39 | bootstrapify(10, key = "bs2") 40 | 41 | expect_error( 42 | x_s <- summarise(x, y = mean(Sepal.Width)), 43 | NA 44 | ) 45 | 46 | # We want it to be a grouped_df, but no longer a resampled_df 47 | # because the groups have been materialized 48 | expect_is(x_s, "grouped_df") 49 | expect_false("resampled_df" %in% class(x_s)) 50 | 51 | expect_equal( 52 | nrow(x_s), 53 | 5 * 10 54 | ) 55 | 56 | }) 57 | 58 | test_that("works with existing groups", { 59 | 60 | x <- iris %>% 61 | group_by(Species) %>% 62 | bootstrapify(5) 63 | 64 | expect_error( 65 | x_s <- summarise(x, y = mean(Sepal.Width)), 66 | NA 67 | ) 68 | 69 | # We want it to be a grouped_df, but no longer a resampled_df 70 | # because the groups have been materialized 71 | expect_is(x_s, "grouped_df") 72 | expect_false("resampled_df" %in% class(x_s)) 73 | 74 | expect_equal( 75 | nrow(x_s), 76 | 5 * 3 77 | ) 78 | 79 | expect_equal( 80 | colnames(x_s), 81 | c("Species", ".bootstrap", "y") 82 | ) 83 | 84 | }) 85 | -------------------------------------------------------------------------------- /tests/testthat/test-samplify.R: -------------------------------------------------------------------------------- 1 | context("test-samplify") 2 | 3 | # many of the would-be tests here are already covered by `test-bootstrapify.R` 4 | 5 | library(dplyr) 6 | 7 | test_that("can alter the `size` of each resample", { 8 | 9 | x <- samplify(iris, 1, 5) 10 | 11 | expect_equal( 12 | length(group_data(x)$.rows[[1]]), 13 | 5 14 | ) 15 | 16 | }) 17 | 18 | test_that("cannot specify vector `size` for ungrouped data frames", { 19 | 20 | expect_error( 21 | samplify(iris, 2, c(1, 5)), 22 | "a single integer \\(for ungrouped data frames\\)" 23 | ) 24 | 25 | }) 26 | 27 | test_that("can specify vector `size` for grouped data frames", { 28 | 29 | iris_g <- group_by(iris, Species) 30 | 31 | expect_error( 32 | x <- samplify(iris_g, 1, c(1, 2, 3)), 33 | NA 34 | ) 35 | 36 | x_gd <- group_data(x) 37 | 38 | expect_equal( 39 | vapply(x_gd$.rows, length, integer(1)), 40 | c(1, 2, 3) 41 | ) 42 | 43 | expect_error( 44 | samplify(x, 2, c(1, 5)), 45 | "must be size 1 or 3 \\(the number of groups\\)" 46 | ) 47 | 48 | }) 49 | 50 | test_that("cannot sample more than the number of rows without replacement", { 51 | 52 | expect_error( 53 | samplify(iris, 1, 151), 54 | "`size` \\(151\\) must be less than or equal to the size of the data / current group \\(150\\)" 55 | ) 56 | 57 | iris_g <- group_by(iris, Species) 58 | 59 | expect_error( 60 | samplify(iris_g, 1, c(49, 49, 51)), 61 | "`size` \\(51\\) must be less than or equal to the size of the data / current group \\(50\\)" 62 | ) 63 | 64 | }) 65 | 66 | test_that("can sample with replacement past the number of rows", { 67 | 68 | expect_error( 69 | x <- samplify(iris, 1, 151, replace = TRUE), 70 | NA 71 | ) 72 | 73 | x_gd <- group_data(x) 74 | 75 | expect_equal( 76 | length(x_gd$.rows[[1]]), 77 | 151 78 | ) 79 | 80 | iris_g <- group_by(iris, Species) 81 | 82 | expect_error( 83 | xx <- samplify(iris_g, 1, c(51, 55, 40), replace = TRUE), 84 | NA 85 | ) 86 | 87 | xx_gd <- group_data(xx) 88 | 89 | 90 | expect_equal( 91 | vapply(xx_gd$.rows, length, integer(1)), 92 | c(51, 55, 40) 93 | ) 94 | 95 | }) 96 | 97 | test_that("`replace` must be a bool", { 98 | expect_error( 99 | samplify(iris, 1, 1, replace = NA), 100 | "a single logical \\(TRUE/FALSE\\)" 101 | ) 102 | }) 103 | 104 | test_that("`size` is recycled as necessary", { 105 | 106 | iris_g <- group_by(iris, Species) 107 | 108 | x <- samplify(iris_g, 2, 5) 109 | 110 | x_gd <- group_data(x) 111 | 112 | expect_equal( 113 | unique(vapply(x_gd$.rows, length, integer(1))), 114 | 5 115 | ) 116 | 117 | }) 118 | -------------------------------------------------------------------------------- /vignettes/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | *.R 3 | -------------------------------------------------------------------------------- /vignettes/dplyr-support.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Support for dplyr" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{Support for dplyr} 6 | %\VignetteEncoding{UTF-8} 7 | %\VignetteEngine{knitr::rmarkdown} 8 | editor_options: 9 | chunk_output_type: console 10 | --- 11 | 12 | ```{r, include = FALSE} 13 | knitr::opts_chunk$set( 14 | collapse = TRUE, 15 | comment = "#>" 16 | ) 17 | ``` 18 | 19 | ```{r setup, message=FALSE, warning=FALSE} 20 | set.seed(123) 21 | library(strapgod) 22 | library(dplyr) 23 | ``` 24 | 25 | # Introduction 26 | 27 | As much as possible, strapgod attempts to let you use any dplyr function that you want on the `resampled_df` object that is returned by `bootstrapify()` and `samplify()`. Some functions have specialized behavior, like `summarise()`, while most others just call `collect()` to materialize the bootstrap rows before passing on to the underlying dplyr function. 28 | 29 | What follows is a list of the dplyr functions that have "special" properties when used on a `resampled_df`. 30 | 31 | # collect() 32 | 33 | The most important dplyr function for strapgod is `collect()`. Generally, this has been used to force a computation from a data base query and return the results as a tibble, and it has a similar context here. `collect()` forces the materialization of the virtual groups, and returns the full grouped tibble back to you. 34 | 35 | ```{r} 36 | x <- bootstrapify(iris, 10) 37 | 38 | # Not materialized 39 | x 40 | 41 | # Materialized 42 | collect(x) 43 | ``` 44 | 45 | When calling `collect()` directly, there are two arguments available to extract extra information about the bootstraps. 46 | 47 | `id` adds a sequence of integers from `1:n` for each bootstrap group. It would be equivalent to adding the `row_number()` by group after the `collect()`, but saves some typing. 48 | 49 | ```{r} 50 | collect(x, id = ".id") 51 | ``` 52 | 53 | `original_id` tacks on the original row of the current bootstrap observation. It is generally more useful than `id`, as it provides a way to link the bootstrap rows back to the original data. 54 | 55 | ```{r} 56 | collect(x, original_id = ".original_id") 57 | ``` 58 | 59 | # summarise() 60 | 61 | The motivation for this package was `summarise()`. It efficiently computes the summary results, only materializing the bootstrap rows as they are needed at the C++ level. 62 | 63 | ```{r} 64 | summarise(x, mean_length = mean(Sepal.Length)) 65 | ``` 66 | 67 | You can group by other columns before creating the virtual groups, and `bootstrapify()` will respect those extra groups in the `summarise()` call. Pay attention to how easy it is to go from a non-bootstrapped version to a bootstrapped version. It's just one extra line! 68 | 69 | ```{r} 70 | # Non-bootstrapped 71 | iris %>% 72 | group_by(Species) %>% 73 | summarise( 74 | mean_length_across_species = mean(Sepal.Length) 75 | ) 76 | 77 | # Bootstrapped 78 | iris %>% 79 | group_by(Species) %>% 80 | bootstrapify(5) %>% 81 | summarise( 82 | mean_length_across_species = mean(Sepal.Length) 83 | ) 84 | ``` 85 | 86 | # do() 87 | 88 | While `dplyr::do()` is basically deprecated and has been replaced by `group_modify()`, it still has its uses sometimes. Like `summarise()`, `do()` materializes the groups only when they are required. Here we run the same linear model on each bootstrapped set of data. 89 | 90 | ```{r} 91 | do(x, model = lm(Sepal.Length ~ Sepal.Width, data = .)) 92 | ``` 93 | 94 | # group_nest() 95 | 96 | `group_nest()` will materialize the groups so that they become columns in the outer tibble after the nest has been performed. 97 | 98 | ```{r} 99 | group_nest(x) 100 | ``` 101 | 102 | You can set `keep = TRUE` to include the groups in the inner tibbles as well. 103 | 104 | ```{r} 105 | group_nest(x, keep = TRUE)$data[[1]] 106 | ``` 107 | 108 | # group_split() 109 | 110 | `group_split()` allows you to materialize all of the bootstrap tibbles into separate tibbles, all bundled together into a list. 111 | 112 | ```{r} 113 | group_split(x) %>% head(n = 3) 114 | ``` 115 | 116 | You can specify `keep = FALSE` if you never want to see the bootstrap columns. 117 | 118 | ```{r} 119 | group_split(x, keep = FALSE) %>% head(n = 3) 120 | ``` 121 | 122 | # group_modify() 123 | 124 | `group_modify()` is similar to `do()`, but (as of dplyr 0.8.0.1) always returns a data frame and gives you access to the non-group and group data separately. 125 | 126 | ```{r} 127 | # Just show the first 2 rows of each bootstrap 128 | group_modify(x, ~head(.x, n = 2)) 129 | 130 | # As you iterate though each group, you have access to that 131 | # group's metadata through `.y` if you need it. 132 | group_modify_group_data <- group_modify(x, ~tibble(.g = list(.y))) 133 | 134 | group_modify_group_data 135 | 136 | group_modify_group_data$.g[[1]] 137 | ``` 138 | 139 | Like `do()`, it can be a convenient way to run multiple models as long as you return a data frame from each one. 140 | 141 | ```{r} 142 | x %>% 143 | group_by(Species, add = TRUE) %>% 144 | group_modify(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x))) 145 | ``` 146 | 147 | # ungroup() 148 | 149 | `ungroup()` will return the original tibble back to you, without materializing the virtual groups. 150 | 151 | ```{r} 152 | ungroup(x) 153 | ``` 154 | 155 | # as_tibble() 156 | 157 | Like `ungroup()`, you can get the original tibble back by converting it to one explicitly with `as_tibble()`. 158 | 159 | ```{r} 160 | as_tibble(x) 161 | ``` 162 | 163 | # Other dplyr functions 164 | 165 | Most other dplyr functions work by first calling `collect()`, and then passing off to the underlying dplyr implementation. This means you can use `mutate()` like so: 166 | 167 | ```{r} 168 | mutate(x, mean = mean(Sepal.Length)) 169 | ``` 170 | 171 | This doesn't really get you anything in terms of speed, but can be convenient as an automatic way to convert back to a tibble and keep going with your workflow. 172 | -------------------------------------------------------------------------------- /vignettes/virtual-bootstraps.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Virtual Bootstraps" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{Virtual Bootstraps} 6 | %\VignetteEncoding{UTF-8} 7 | %\VignetteEngine{knitr::rmarkdown} 8 | editor_options: 9 | chunk_output_type: console 10 | --- 11 | 12 | ```{r, include = FALSE} 13 | knitr::opts_chunk$set( 14 | collapse = TRUE, 15 | comment = "#>" 16 | ) 17 | ``` 18 | 19 | ```{r setup, message=FALSE, warning=FALSE} 20 | set.seed(123) 21 | library(strapgod) 22 | library(dplyr) 23 | iris <- as_tibble(iris) 24 | ``` 25 | 26 | # Introduction 27 | 28 | The goal of strapgod is to make it easy to create _virtual groups_ on top of tibbles for use with resampling. This means that your tibble is grouped, but you don't actually "materialize" the groups until you actually need them. By doing this, some computations involving large amounts of bootstraps or resamples can be made much more efficient. 29 | 30 | # Creating resampled data frames 31 | 32 | There are two core functions that help you generate a `resampled_df` object. 33 | 34 | `bootstrapify()` takes a data frame and bootstraps the rows of that data frame a set number of `times` to generate the virtual groups. 35 | 36 | ```{r} 37 | iris_boot <- bootstrapify(iris, times = 10) 38 | 39 | nrow(iris) 40 | nrow(iris_boot) 41 | 42 | iris_boot 43 | ``` 44 | 45 | What you'll immediately notice is that: 46 | 47 | - The tibble still only has 150 rows. 48 | 49 | - The tibble is now grouped by `.bootstrap`, which isn't a column in the tibble. 50 | 51 | The invisible `.bootstrap` column is the virtual group. It hasn't been materialized (there are still only 150 rows, not 150 * 10 rows), but dplyr still seems to know about it. 52 | 53 | `samplify()` is the other function that can generate resampled tibbles. It is a slight generalization of `bootstrapify()` that also allows you to specify the size of each resample, and if you want to resample with replacement or not. 54 | 55 | ```{r} 56 | iris_samp <- samplify(iris, times = 10, size = 20, replace = FALSE) 57 | 58 | iris_samp 59 | ``` 60 | 61 | This result: 62 | 63 | - Has 10 resamples 64 | 65 | - Each one is of size 20 66 | 67 | - And the resampling was done without replacement each time 68 | 69 | # Resampled summaries 70 | 71 | What can you do with these neat resampled data frames? Great question! For one thing, you can `summarise()` the tibble to compute bootstrapped summaries quickly and efficiently. 72 | 73 | ```{r} 74 | # without the bootstrap 75 | iris %>% 76 | summarise( 77 | mean_length = mean(Sepal.Length) 78 | ) 79 | 80 | # with the bootstrap 81 | iris %>% 82 | bootstrapify(10) %>% 83 | summarise( 84 | mean_length = mean(Sepal.Length) 85 | ) 86 | ``` 87 | 88 | This makes it easy to compute bootstrapped estimates of individual statistics, along with bootstrapped standard deviations around those estimates. 89 | 90 | ```{r} 91 | iris %>% 92 | bootstrapify(10) %>% 93 | summarise(mean_length = mean(Sepal.Length)) %>% 94 | summarise( 95 | bootstrapped_mean = mean(mean_length), 96 | bootstrapped_sd = sd(mean_length) 97 | ) 98 | ``` 99 | 100 | If you want, you can take an existing grouped data frame and bootstrapify that as well, allowing you to compute bootstrapped statistics _across_ some other variable. 101 | 102 | ```{r} 103 | iris_group_strap <- iris %>% 104 | group_by(Species) %>% 105 | bootstrapify(100) 106 | 107 | iris_group_strap 108 | ``` 109 | 110 | Reusing the code from above, we can now compute bootstrapped estimates for the mean `Sepal.Length` of each `Species`, along with standard deviations around those estimates. 111 | 112 | ```{r} 113 | iris_group_strap %>% 114 | summarise(mean_length = mean(Sepal.Length)) %>% 115 | summarise( 116 | bootstrapped_mean = mean(mean_length), 117 | bootstrapped_sd = sd(mean_length) 118 | ) 119 | ``` 120 | 121 | # Understanding virtual groups 122 | 123 | The virtual groups are stored in the `group_data()` metadata of the `resampled_df` object. Every grouped data frame has one of these, and they are used internally to power the dplyr `group_by()` system. 124 | 125 | ```{r} 126 | group_data(iris_boot) 127 | ``` 128 | 129 | The `.bootstrap` column contains the unique values of the groups, and the `.rows` column is a list column, where each element is an integer vector. That integer vector holds the rows that belong to that specific group. So, for `.bootstrap == 1`, there is a vector with 150 integers identifying the rows belonging to that resample. 130 | 131 | ```{r} 132 | group_data(iris_boot)$.rows[[1]] 133 | ``` 134 | 135 | When a call to `collect()` is made, this row index information is used to construct the output. Essentially, we start with the `group_data()` and utilize the `.rows` info to replicate the rows of the original data frame for each group, building up the complete resampled data frame. Notice how we now have the `150 * 10 = 1500` rows from the 10 bootstraps. 136 | 137 | ```{r} 138 | collect(iris_boot) 139 | ``` 140 | 141 | To learn more about `collect()`, and the other supported dplyr functions in strapgod, read the `vignette("dplyr-support", "strapgod")`. 142 | --------------------------------------------------------------------------------