├── .travis.yml ├── DESCRIPTION ├── NAMESPACE ├── NEWS.md ├── R ├── Crosstab.R ├── Crosstab3way.R ├── DesignEffectAndMOE.R ├── InternalHelperFunctions.R ├── SummaryStatistics.R ├── Topline.R ├── data.R ├── globals.R ├── moeCrosstab.R ├── moeCrosstab3way.R ├── moeTopline.R ├── moeWaveCrosstab.R └── moeWaveCrosstab3way.R ├── README.Rmd ├── README.html ├── README.md ├── cran-comments.md ├── data └── illinois.rda ├── man ├── crosstab.Rd ├── crosstab_3way.Rd ├── deff_calc.Rd ├── figures │ ├── README-unnamed-chunk-12-1.png │ ├── README-unnamed-chunk-8-1.png │ └── README-unnamed-chunk-9-1.png ├── illinois.Rd ├── moe_crosstab.Rd ├── moe_crosstab_3way.Rd ├── moe_topline.Rd ├── moe_wave_crosstab.Rd ├── moe_wave_crosstab_3way.Rd ├── moedeff_calc.Rd ├── summary_table.Rd ├── topline.Rd └── wtd_mean.Rd └── vignettes ├── .gitignore ├── crosstab3way.Rmd ├── crosstabs.Rmd └── toplines.Rmd /.travis.yml: -------------------------------------------------------------------------------- 1 | # R for travis: see documentation at https://docs.travis-ci.com/user/languages/r 2 | 3 | language: R 4 | cache: packages 5 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: pollster 2 | Type: Package 3 | Title: Calculate Crosstab and Topline Tables of Weighted Survey Data 4 | Version: 0.1.6 5 | Author: John D. Johnson [aut, cre] 6 | Maintainer: John D. Johnson 7 | Description: Calculate common types of tables for weighted survey data. 8 | Options include topline and (2-way and 3-way) crosstab tables of 9 | categorical or ordinal data as well as summary tables of weighted 10 | numeric variables. Optionally, include the margin of error at 11 | selected confidence intervals including the design effect. The 12 | design effect is calculated as described by 13 | Kish (1965) beginning 14 | on page 257. Output takes the form of tibbles (simple data frames). 15 | This package conveniently handles labelled data, such as that 16 | commonly used by 'Stata' and 'SPSS.' Complex survey design is 17 | not supported at this time. 18 | Depends: 19 | R (>= 2.10) 20 | Imports: 21 | dplyr (>= 0.8.0), 22 | stringr (>= 1.0.0), 23 | tidyr (>= 1.1.0), 24 | labelled (>= 2.0.0), 25 | forcats (>= 1.0.0), 26 | rlang (>= 0.4.5) 27 | Suggests: 28 | ggplot2 (>= 3.3.0), 29 | knitr, 30 | rmarkdown 31 | License: CC0 32 | Encoding: UTF-8 33 | LazyData: true 34 | RoxygenNote: 7.2.0 35 | VignetteBuilder: knitr 36 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(crosstab) 4 | export(crosstab_3way) 5 | export(deff_calc) 6 | export(moe_crosstab) 7 | export(moe_crosstab_3way) 8 | export(moe_topline) 9 | export(moe_wave_crosstab) 10 | export(moe_wave_crosstab_3way) 11 | export(moedeff_calc) 12 | export(summary_table) 13 | export(topline) 14 | export(wtd_mean) 15 | import(dplyr) 16 | import(labelled) 17 | import(rlang) 18 | import(stringr) 19 | import(tidyr) 20 | importFrom(stats,weighted.mean) 21 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # pollster 0.1.6 2 | * fix bug which lead to explicitly missing factor levels still being included in the "valid percent" column. 3 | 4 | # pollster 0.1.5 5 | * replace deprecated `forcats::fct_explicit_na` with `forcats::fct_na_value_to_level` 6 | * require at least forcats v1.0.0 7 | 8 | # pollster 0.1.4 9 | * rows are explicitly rearranged by x and z in wide 3-way crosstabs 10 | * weights of 0 are removed prior to calculation of the design effect 11 | * an error is now given in all table-creating functions if the weight variable includes NA values 12 | 13 | # pollster 0.1.3 14 | * crosstab functions now include an option to include a column with unweighted frequencies. currently it is not available for column percents. 15 | * a bug is fixed that gave an error in `crosstab(..., pct_type = "col", format = "wide", n = FALSE)` 16 | * `crosstab_3way` now places the n column at the end of the dataframe, consistent with `crosstab` 17 | * fix bug in `moe_crosstab` & `moe_crosstab_3way` that reported unweighted n in place of weighted n 18 | 19 | # pollster 0.1.2 20 | 21 | * pollster now depends on the most recent version of tidyr (1.1.0) because it uses the argument `names_sort = TRUE` to ensure that `tidyr::pivot_wider` arranges rows and columns in the order of their factor levels. 22 | 23 | # pollster 0.1.1 24 | 25 | * improvements to how crosstab functions conditionally convert factor to date class. This includes removing the lubridate dependency. 26 | * crosstab functions now convert factors in crosstabs to numeric values when all values are numeric 27 | * crosstabs now show a value of 0% instead of NA when there are no values. 28 | * add CRAN installation to readme 29 | 30 | 31 | # pollster 0.1.0 32 | 33 | * package accepted by CRAN on 03/25/2020 34 | -------------------------------------------------------------------------------- /R/Crosstab.R: -------------------------------------------------------------------------------- 1 | #' weighted crosstabs 2 | #' 3 | #' \code{crosstab} returns a tibble containing a weighted crosstab of two variables 4 | #' 5 | #' Options include row, column, or cell percentages. The tibble can be in long or wide format. 6 | #' 7 | #' @param df The data source 8 | #' @param x The independent variable 9 | #' @param y The dependent variable 10 | #' @param weight The weighting variable 11 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 12 | #' This will not affect any calculations made. The vector is not case-sensitive. 13 | #' @param n logical, if TRUE numeric totals are included. They are included in a separate column for row and cell 14 | #' percentages, but in a separate row for wide format column percentages. 15 | #' @param pct_type Controls the kind of percentage values returned. One of "row," "cell," or "column." 16 | #' @param format one of "long" or "wide" 17 | #' @param unwt_n logical, if TRUE a column "unweighted_n" is included containing the unweighted frequency count. It is not available when pct_type is "column" 18 | #' 19 | #' @return a tibble 20 | #' @export 21 | #' @import dplyr 22 | #' @import stringr 23 | #' @import tidyr 24 | #' @import labelled 25 | #' @import rlang 26 | #' 27 | #' @examples 28 | #' crosstab(df = illinois, x = voter, y = raceethnic, weight = weight) 29 | #' crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE) 30 | 31 | crosstab <- function(df, x, y, weight, remove = "", n = TRUE, 32 | pct_type = "row", format = "wide", unwt_n = FALSE){ 33 | 34 | # make sure no weights are NA 35 | w <- df %>% pull({{weight}}) 36 | if(length(w[is.na(w)]) > 0){ 37 | stop("The weight variable contains missing values.", call. = FALSE) 38 | } 39 | 40 | # make sure the arguments are all correct 41 | stopifnot(pct_type %in% c("row", "cell", "column", "col"), 42 | format %in% c("wide", "long")) 43 | 44 | if(str_to_lower(pct_type) == "row"){ 45 | d.output <- df %>% 46 | # Remove missing cases 47 | filter(!is.na({{x}}), 48 | !is.na({{y}})) %>% 49 | # Convert to ordered factors 50 | mutate({{x}} := to_factor({{x}}, sort_levels = "values"), 51 | {{y}} := to_factor({{y}}, sort_levels = "values")) %>% 52 | # Calculate denominator 53 | group_by({{x}}) %>% 54 | mutate(total = sum({{weight}}), 55 | unweighted_n = n()) %>% 56 | # Calculate proportions 57 | group_by({{x}}, {{y}}) %>% 58 | summarise(pct = (sum({{weight}})/first(total))*100, 59 | n = first(total), 60 | unweighted_n = first(unweighted_n)) %>% 61 | # Remove values included in "remove" string 62 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 63 | !str_to_upper({{y}}) %in% str_to_upper(remove)) %>% 64 | ungroup() 65 | 66 | # spread if format = wide 67 | if(format == "wide"){ 68 | d.output <- d.output %>% 69 | # Spread so x is rows and y is columns 70 | pivot_wider(names_from = {{y}}, values_from = pct, 71 | values_fill = list(pct = 0), names_sort = TRUE) %>% 72 | # move total row to end 73 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 74 | ungroup() 75 | } 76 | 77 | # remove n if required 78 | if(n == FALSE){ 79 | d.output <- select(d.output, -n) 80 | } 81 | 82 | # remove unweighted n if required 83 | if(unwt_n == FALSE){ 84 | d.output <- select(d.output, -unweighted_n) 85 | } 86 | 87 | } else if(str_to_lower(pct_type) %in% c("col", "column")){ 88 | d.output <- df %>% 89 | # Remove missing cases 90 | filter(!is.na({{x}}), 91 | !is.na({{y}})) %>% 92 | # Convert to ordered factors 93 | mutate({{x}} := to_factor({{x}}, sort_levels = "values"), 94 | {{y}} := to_factor({{y}}, sort_levels = "values")) %>% 95 | # calculate denominator 96 | group_by({{y}}) %>% 97 | mutate(total = sum({{weight}})) %>% 98 | # calculate proportions 99 | group_by({{x}}, {{y}}) %>% 100 | summarise(pct = (sum({{weight}})/first(total))*100, 101 | n = first(total)) %>% 102 | ungroup() %>% 103 | # remove values included in "remove" string 104 | filter(! str_to_upper({{x}}) %in% str_to_upper(remove), 105 | ! str_to_upper({{y}}) %in% str_to_upper(remove)) 106 | 107 | if(format == "wide"){ 108 | # make the total row separately 109 | total.row <- d.output %>% 110 | group_by({{y}}, n) %>% 111 | summarise() %>% 112 | pivot_wider(names_from = {{y}}, values_from = n, values_fill = list(pct = 0), 113 | names_sort = TRUE) %>% 114 | mutate({{x}} := "n") 115 | 116 | # spread the output table 117 | d.output <- d.output %>% 118 | # drop the n column 119 | select(-n) %>% 120 | # spread so x is rows and y is columns 121 | pivot_wider(names_from = {{y}}, values_from = pct, values_fill = list(pct = 0), 122 | names_sort = TRUE) 123 | 124 | # if n = TRUE, then add then n row 125 | # this causes the response column to switch from factor to character 126 | if(n == TRUE){ 127 | d.output <- suppressWarnings(bind_rows(d.output, total.row)) 128 | } 129 | } 130 | 131 | # remove n column if n == FALSE (long format only) 132 | if(n == FALSE & format == "long"){ 133 | d.output <- select(d.output, -n) 134 | } 135 | 136 | } else if(str_to_lower(pct_type) == "cell"){ 137 | d.output <- df %>% 138 | # Remove missing cases 139 | filter(!is.na({{x}}), 140 | !is.na({{y}})) %>% 141 | # Convert to ordered factors 142 | mutate({{x}} := to_factor({{x}}, sort_levels = "values"), 143 | {{y}} := to_factor({{y}}, sort_levels = "values")) %>% 144 | # Calculate denominator 145 | mutate(total = sum({{weight}}), 146 | unweighted_n = n()) %>% 147 | # Calculate proportions 148 | group_by({{x}}, {{y}}) %>% 149 | summarise(pct = (sum({{weight}})/first(total))*100, 150 | n = first(total), 151 | unweighted_n = first(unweighted_n)) %>% 152 | # Remove values included in "remove" string 153 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 154 | !str_to_upper({{y}}) %in% str_to_upper(remove)) 155 | 156 | # if format = wide, then spread the table 157 | if(format == "wide"){ 158 | d.output <- d.output %>% 159 | # Spread so x is rows and y is columns 160 | pivot_wider(names_from = {{y}}, values_from = pct, values_fill = 0, 161 | names_sort = TRUE) %>% 162 | # move total row to end 163 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 164 | ungroup() 165 | } 166 | 167 | # remove n column if n == FALSE 168 | if(n == FALSE){ 169 | d.output <- select(d.output, -n) 170 | } 171 | 172 | # remove unweighted n column if unwt_n == FALSE 173 | if(unwt_n == FALSE){ 174 | d.output <- select(d.output, -unweighted_n) 175 | } 176 | } 177 | 178 | # test if date or number 179 | factor.true.type <- what_is_this_factor(pull(d.output, {{x}})) 180 | if(factor.true.type == "date"){ 181 | d.output %>% 182 | as_tibble() %>% 183 | mutate({{x}} := as.Date({{x}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) 184 | } else if(factor.true.type == "number"){ 185 | d.output %>% 186 | as_tibble() %>% 187 | mutate({{x}} := as.numeric(as.character({{x}}))) 188 | } else{ 189 | d.output %>% 190 | as_tibble() 191 | } 192 | } 193 | -------------------------------------------------------------------------------- /R/Crosstab3way.R: -------------------------------------------------------------------------------- 1 | #' weighted 3-way crosstabs 2 | #' 3 | #' \code{crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable 4 | #' 5 | #' Options include row or cell percentages. The tibble can be in long or wide format. 6 | #' These tables are ideal for use with small multiples created with ggplot2::facet_wrap. 7 | #' 8 | #' @param df The data source 9 | #' @param x The independent variable 10 | #' @param y The dependent variable 11 | #' @param z The second control variable 12 | #' @param weight The weighting variable 13 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 14 | #' This will not affect any calculations made. The vector is not case-sensitive. 15 | #' @param n logical, if TRUE numeric totals are included. 16 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell." 17 | #' @param format one of "long" or "wide" 18 | #' @param unwt_n logical, if TRUE a column is added containing unweighted frequency counts 19 | #' 20 | #' @return a tibble 21 | #' @export 22 | #' @import dplyr 23 | #' @import stringr 24 | #' @import tidyr 25 | #' @import labelled 26 | #' @import rlang 27 | #' 28 | #' @examples 29 | #' crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight) 30 | #' crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight, 31 | #' format = "wide") 32 | 33 | crosstab_3way <- function(df, x, y, z, 34 | weight, remove = c(""), 35 | n = TRUE, pct_type = "row", format = "wide", 36 | unwt_n = FALSE){ 37 | 38 | # make sure no weights are NA 39 | w <- df %>% pull({{weight}}) 40 | if(length(w[is.na(w)]) > 0){ 41 | stop("The weight variable contains missing values.", call. = FALSE) 42 | } 43 | 44 | # make sure the arguments are all correct 45 | stopifnot(pct_type %in% c("row", "cell"), 46 | format %in% c("wide", "long")) 47 | 48 | # row percents 49 | if(pct_type == "row"){ 50 | d.output <- df %>% 51 | # Remove missing cases 52 | filter(!is.na({{x}}), 53 | !is.na({{y}}), 54 | !is.na({{z}})) %>% 55 | # Convert to ordered factors 56 | mutate({{x}} := to_factor({{x}}, sort_levels = "values"), 57 | {{y}} := to_factor({{y}}, sort_levels = "values"), 58 | {{z}} := to_factor({{z}}, sort_levels = "values")) %>% 59 | # Calculate denominator 60 | group_by({{x}}, {{z}}) %>% 61 | mutate(total = sum({{weight}}), 62 | unweighted_n = n()) %>% 63 | # Calculate proportions 64 | group_by({{x}}, {{y}}, {{z}}) %>% 65 | summarise(pct = (sum({{weight}})/first(total))*100, 66 | n = first(total), 67 | unweighted_n = first(unweighted_n)) %>% 68 | # Remove values included in "remove" string 69 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 70 | !str_to_upper({{y}}) %in% str_to_upper(remove), 71 | !str_to_upper({{z}}) %in% str_to_upper(remove)) %>% 72 | ungroup() 73 | 74 | # wide format, if required 75 | if(str_to_lower(format) == "wide"){ 76 | d.output <- d.output %>% 77 | pivot_wider(names_from = {{y}}, values_from = pct, 78 | values_fill = list(pct = 0), names_sort = TRUE) %>% 79 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 80 | arrange({{x}}, {{z}}) 81 | } 82 | } else if(pct_type == "cell"){ 83 | d.output <- df %>% 84 | # Remove missing cases 85 | filter(!is.na({{x}}), 86 | !is.na({{y}}), 87 | !is.na({{z}})) %>% 88 | # Convert to ordered factors 89 | mutate({{x}} := to_factor({{x}}, sort_levels = "values"), 90 | {{y}} := to_factor({{y}}, sort_levels = "values"), 91 | {{z}} := to_factor({{z}}, sort_levels = "values")) %>% 92 | # Calculate denominator 93 | group_by({{z}}) %>% 94 | mutate(total = sum({{weight}}), 95 | unweighted_n = n()) %>% 96 | # Calculate proportions 97 | group_by({{x}}, {{y}}, {{z}}) %>% 98 | summarise(pct = (sum({{weight}})/first(total))*100, 99 | n = first(total), 100 | unweighted_n = first(unweighted_n)) %>% 101 | # Remove values included in "remove" string 102 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 103 | !str_to_upper({{y}}) %in% str_to_upper(remove), 104 | !str_to_upper({{z}}) %in% str_to_upper(remove)) %>% 105 | ungroup() 106 | 107 | # wide format, if required 108 | if(str_to_lower(format) == "wide"){ 109 | d.output <- d.output %>% 110 | pivot_wider(names_from = {{y}}, values_from = pct, 111 | values_fill = list(pct = 0), names_sort = TRUE) %>% 112 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 113 | arrange({{x}}, {{z}}) 114 | } 115 | } 116 | 117 | 118 | if(n == FALSE){ 119 | d.output <- select(d.output, -n) 120 | } 121 | 122 | if(unwt_n == FALSE){ 123 | d.output <- select(d.output, -unweighted_n) 124 | } 125 | 126 | # test if date or number 127 | factor.true.type <- what_is_this_factor(pull(d.output, {{z}})) 128 | if(factor.true.type == "date"){ 129 | d.output %>% 130 | as_tibble() %>% 131 | mutate({{z}} := as.Date({{z}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) 132 | } else if(factor.true.type == "number"){ 133 | d.output %>% 134 | as_tibble() %>% 135 | mutate({{z}} := as.numeric(as.character({{z}}))) 136 | } else{ 137 | d.output %>% 138 | as_tibble() 139 | } 140 | } 141 | -------------------------------------------------------------------------------- /R/DesignEffectAndMOE.R: -------------------------------------------------------------------------------- 1 | #' Calculate the design effect of a sample 2 | #' 3 | #' \code{deff_calc} returns a single number 4 | #' 5 | #' This function returns the design effect of a given sample using the formula 6 | #' length(w)*sum(w^2)/(sum(w)^2). 7 | #' It is designed for use in the moe family of functions. If any weights are equal to 0, they are removed prior to calculation. 8 | #' 9 | #' @param w a vector of weights 10 | #' 11 | #' @return A number 12 | #' @export 13 | #' 14 | #' @examples 15 | #' deff_calc(illinois$weight) 16 | #' 17 | deff_calc <- function(w){ 18 | # check for weights of 0 19 | if(length(w[w == 0]) > 0){ 20 | message("Your data includes weights equal to zero. These are removed before calculating the design effect.") 21 | # remove any weights that are 0 22 | w <- w[w > 0] 23 | } 24 | 25 | length(w)*sum(w^2)/(sum(w)^2) 26 | } 27 | 28 | 29 | #' Calculate the margin of error (including design effect) of a sample 30 | #' 31 | #' \code{moedeff_calc} returns a single number. It is designed for use in the moe family of functions. 32 | #' 33 | #' This function returns the margin of error including design effect of a given sample of weighted data using the formula 34 | #' sqrt(deff)*zscore*sqrt((pct*(1-pct))/(n-1))*100 35 | #' 36 | #' @param pct a proportion 37 | #' @param deff a design effect 38 | #' @param n the sample size 39 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval. 40 | #' 41 | #' @return A percentage 42 | #' @export 43 | #' 44 | #' @examples 45 | #' moedeff_calc(pct = 0.515, deff = 1.6, n = 214) 46 | moedeff_calc <- function(pct, deff, n, zscore = 1.96){ 47 | sqrt(deff)*zscore*sqrt((pct*(1-pct))/(n-1))*100 48 | } 49 | -------------------------------------------------------------------------------- /R/InternalHelperFunctions.R: -------------------------------------------------------------------------------- 1 | # Thanks to Berry Boessenkool for this function 2 | # https://github.com/brry/berryFunctions/blob/master/R/is.error.R 3 | # This function checks if a function returns an error 4 | is_error <- function(expr){ 5 | expr_name <- deparse(substitute(expr)) 6 | test <- try(expr, silent=TRUE) 7 | iserror <- inherits(test, "try-error") 8 | # output: 9 | iserror 10 | } 11 | 12 | # This function checks if a value is a date 13 | is_date <- function(potentialdate){ 14 | if(is_error(as.Date(potentialdate, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) == TRUE){ 15 | FALSE 16 | } else {TRUE} 17 | } 18 | 19 | # This function checks if a value is a number 20 | is_a_number <- function(potentialnumber){ 21 | !is.na(suppressWarnings(as.numeric(as.character(potentialnumber)))) 22 | } 23 | 24 | # this function takes a vector of factor values & checks if all the 25 | # values are (1) a date or (2) a number. 26 | what_is_this_factor <- function(factorvector){ 27 | if(all(sapply({{factorvector}}, is_date))){ 28 | "date" 29 | } else if(all(sapply({{factorvector}}, is_a_number))) { 30 | "number" 31 | } else { 32 | "factor" 33 | } 34 | } 35 | -------------------------------------------------------------------------------- /R/SummaryStatistics.R: -------------------------------------------------------------------------------- 1 | #' weighted mean 2 | #' 3 | #' \code{wtd_mean} returns the weighted mean of a variable. It's a tidy-compatible 4 | #' wrapper around stats::weighted.mean(). 5 | #' 6 | #' @param df The data source 7 | #' @param variable the variable, it should be numeric 8 | #' @param weight The weighting variable 9 | #' 10 | #' @return a numeric value 11 | #' @export 12 | #' @import dplyr 13 | #' @import rlang 14 | #' @importFrom stats weighted.mean 15 | #' 16 | #' @examples 17 | #' wtd_mean(illinois, age, weight) 18 | #' 19 | #' library(dplyr) 20 | #' illinois %>% wtd_mean(age, weight) 21 | wtd_mean <- function(df, variable, weight){ 22 | df %>% 23 | summarise(mean = weighted.mean(x = {{variable}}, w = {{weight}})) %>% 24 | pull() 25 | } 26 | 27 | #' weighted summary table 28 | #' 29 | #' \code{summary_table} returns a tibble containing a weighted summary table of a single variable. 30 | #' 31 | #' The resulting tible includes columns for the variable name, unweighted observations, 32 | #' weighted observations, weighted mean, minimum value, maximum value, 33 | #' unweighted missing values, and weighted missing values 34 | #' 35 | #' @param df The data source 36 | #' @param variable the variable to summarize, it should be numeric 37 | #' @param weight The weighting variable 38 | #' @param name_style the style of the column names--one of "clean" or "pretty." 39 | #' Clean names are all lower case and words are separated by an underscore. 40 | #' Pretty names begin with a capital letter are words a separated by a space. 41 | #' 42 | #' @return a tibble 43 | #' @export 44 | #' @import dplyr 45 | #' @import rlang 46 | #' @importFrom stats weighted.mean 47 | #' 48 | #' @examples 49 | #' summary_table(illinois, age, weight) 50 | #' summary_table(illinois, age, weight, name_style = "pretty") 51 | summary_table <- function(df, variable, weight, name_style = "clean"){ 52 | 53 | # make sure no weights are NA 54 | w <- df %>% pull({{weight}}) 55 | if(length(w[is.na(w)]) > 0){ 56 | stop("The weight variable contains missing values.", call. = FALSE) 57 | } 58 | 59 | stopifnot(name_style %in% c("clean", "pretty")) 60 | 61 | unweighted_observations <- df %>% 62 | filter(!is.na({{variable}})) %>% 63 | pull({{variable}}) %>% 64 | length() 65 | 66 | weighted_observations <- df %>% 67 | filter(!is.na({{variable}})) %>% 68 | pull({{weight}}) %>% 69 | sum() 70 | 71 | weighted_mean <- df %>% 72 | wtd_mean({{variable}}, {{weight}}) 73 | 74 | min_value <- df %>% 75 | summarise(min = min({{variable}}, na.rm = TRUE)) %>% 76 | pull() 77 | 78 | max_value <- df %>% 79 | summarise(max = max({{variable}}, na.rm = TRUE)) %>% 80 | pull() 81 | 82 | missing_observations <- df %>% 83 | filter(is.na({{variable}})) %>% 84 | pull({{variable}}) %>% 85 | length() 86 | 87 | missing_weighted_observations <- df %>% 88 | filter(is.na({{variable}})) %>% 89 | pull({{weight}}) %>% 90 | sum() 91 | 92 | variable_name <- df %>% 93 | select({{variable}}) %>% 94 | names() 95 | 96 | output <- tibble(variable_name, unweighted_observations, weighted_observations, 97 | weighted_mean, min_value, max_value, missing_observations, 98 | missing_weighted_observations) 99 | 100 | if(name_style == "pretty"){ 101 | names(output) <- c("Variable", "Unweighted obs", 102 | "Weighted obs", "Weighted mean", "Min", "Max", 103 | "Unweighted missing", "Weighted missing") 104 | } 105 | 106 | output 107 | } 108 | -------------------------------------------------------------------------------- /R/Topline.R: -------------------------------------------------------------------------------- 1 | #' weighted topline 2 | #' 3 | #' \code{topline} returns a tibble containing a weighted topline of one variable 4 | #' 5 | #' By default the table includes a column for frequency count, percent, valid percent, and cumulative percent. 6 | #' 7 | #' @param df The data source 8 | #' @param variable the variable name 9 | #' @param weight The weighting variable, defaults to zwave_weight 10 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 11 | #' This will not affect any calculations made. The vector is not case-sensitive. 12 | #' @param n logical, if TRUE a frequency column is included 13 | #' percentages, but in a separate row for column percentages. 14 | #' @param pct logical, if TRUE a column of percents is included 15 | #' @param valid_pct logical, if TRUE a column of valid percents is included 16 | #' @param cum_pct logical, if TRUE a column of cumulative percents is included 17 | #' 18 | #' @return a tibble 19 | #' @export 20 | #' @import dplyr 21 | #' @import stringr 22 | #' @import tidyr 23 | #' @import labelled 24 | #' @import rlang 25 | #' 26 | #' @examples 27 | #' topline(illinois, sex, weight) 28 | #' topline(illinois, sex, weight, pct = FALSE) 29 | 30 | topline <- function(df, variable, weight, remove = c(""), n = TRUE, 31 | pct = TRUE, valid_pct = TRUE, cum_pct = TRUE){ 32 | 33 | # make sure no weights are NA 34 | w <- df %>% pull({{weight}}) 35 | if(length(w[is.na(w)]) > 0){ 36 | stop("The weight variable contains missing values.", call. = FALSE) 37 | } 38 | 39 | # Make table 40 | d.output <- df %>% 41 | # Convert to ordered factors 42 | mutate({{variable}} := to_factor({{variable}}, sort_levels = "values"), 43 | {{variable}} := forcats::fct_na_value_to_level({{variable}}, 44 | level = "(Missing)")) %>% 45 | # Calculate denominator 46 | mutate(total = sum({{weight}}), 47 | valid.total = sum(({{weight}})[{{variable}} != "(Missing)"])) %>% 48 | # Calculate proportions 49 | group_by({{variable}}) %>% 50 | summarise(pct = (sum({{weight}})/first(total))*100, 51 | valid.pct = (sum({{weight}})/first(valid.total)*100), 52 | n = sum({{weight}})) %>% 53 | ungroup() %>% 54 | mutate(cum = cumsum(valid.pct), 55 | valid.pct = replace(valid.pct, {{variable}} == "(Missing)", NA), 56 | cum = replace(cum, {{variable}} == "(Missing)", NA)) %>% 57 | select(Response = {{variable}}, Frequency = n, Percent = pct, 58 | `Valid Percent` = valid.pct, `Cumulative Percent` = cum) %>% 59 | # Remove values included in "remove" string 60 | filter(! str_to_upper(Response) %in% str_to_upper(remove)) 61 | 62 | # remove columns as requested 63 | if(valid_pct == FALSE){ 64 | d.output <- select(d.output, -`Valid Percent`) 65 | } 66 | 67 | if(cum_pct == FALSE){ 68 | d.output <- select(d.output, -`Cumulative Percent`) 69 | } 70 | 71 | if(n == FALSE){ 72 | d.output <- select(d.output, -Frequency) 73 | } 74 | 75 | if(pct == FALSE){ 76 | d.output <- select(d.output, -Percent) 77 | } 78 | 79 | d.output %>% 80 | as_tibble() 81 | } 82 | -------------------------------------------------------------------------------- /R/data.R: -------------------------------------------------------------------------------- 1 | #' Illinois respondents to the Voting and Registration Supplement for the Current Population Survey 2 | #' 3 | #' A dataset containing the responses of 36,207 Illinois respondents to the Current 4 | #' Population Survey's biennial Voting and Registration Supplement for the Current 5 | #' Population Survey, 1996-2018. 6 | #' 7 | #' @format A data frame with 36207 rows and 9 variables: 8 | #' \describe{ 9 | #' \item{year}{year of survey} 10 | #' \item{fips}{the state fips code} 11 | #' \item{sex}{sex of the respondent, labelled value} 12 | #' \item{educ6}{highest level of education for respondent, labelled values} 13 | #' \item{raceethnic}{one of white, black, Hispanic, or other, labelled values} 14 | #' \item{maritalstatus}{one of Married, Widowed/divorced/Sep, or Never Married, labelled values} 15 | #' \item{rv}{indicates if the respondent is registered to vote, labelled values} 16 | #' \item{voter}{indicates if the respondent voted, labelled values} 17 | #' \item{age}{the age of the respondent, numeric values} 18 | #' \item{weight}{the number of people each respondent is calculated to represent} 19 | #' } 20 | #' @source \url{https://www.census.gov/topics/public-sector/voting.html} 21 | "illinois" 22 | -------------------------------------------------------------------------------- /R/globals.R: -------------------------------------------------------------------------------- 1 | utils::globalVariables(c("Cumulative Percent", "Percent", "Frequency", 2 | "Valid Percent", "cum", "d.output", "deff", 3 | "moe", "observations", "pct", "total", "valid.pct", 4 | "valid.total", "Response", "unweighted_n")) 5 | -------------------------------------------------------------------------------- /R/moeCrosstab.R: -------------------------------------------------------------------------------- 1 | #' weighted crosstabs with margin of error 2 | #' 3 | #' \code{moe_crosstab} returns a tibble containing a weighted crosstab of two variables with margin of error 4 | #' 5 | #' Options include row or cell percentages. The tibble can be in long or wide format. The margin of 6 | #' error includes the design effect of the weights. 7 | #' 8 | #' @param df The data source 9 | #' @param x The independent variable 10 | #' @param y The dependent variable 11 | #' @param weight The weighting variable, defaults to zwave_weight 12 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 13 | #' This will not affect any calculations made. The vector is not case-sensitive. 14 | #' @param n logical, if TRUE numeric totals are included. 15 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell." 16 | #' Column percents are not supported. 17 | #' @param format one of "long" or "wide" 18 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval 19 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values 20 | #' 21 | #' @return a tibble 22 | #' @export 23 | #' @import dplyr 24 | #' @import stringr 25 | #' @import tidyr 26 | #' @import labelled 27 | #' @import rlang 28 | #' 29 | #' @examples 30 | #' moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight) 31 | #' moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE) 32 | 33 | moe_crosstab <- function(df, x, y, weight, remove = c(""), 34 | n = TRUE, pct_type = "row", format = "long", 35 | zscore = 1.96, unwt_n = FALSE){ 36 | 37 | # make sure no weights are NA 38 | w <- df %>% pull({{weight}}) 39 | if(length(w[is.na(w)]) > 0){ 40 | stop("The weight variable contains missing values.", call. = FALSE) 41 | } 42 | 43 | # make sure the arguments are all correct 44 | stopifnot(pct_type %in% c("row", "cell"), 45 | format %in% c("wide", "long")) 46 | 47 | # calculate the design effect 48 | deff <- df %>% pull({{weight}}) %>% deff_calc() 49 | 50 | # build the table, either row percents or cell percents 51 | if(pct_type == "row"){ 52 | d.output <- df %>% 53 | filter(!is.na({{x}}), 54 | !is.na({{y}})) %>% 55 | mutate({{x}} := to_factor({{x}}), 56 | {{y}} := to_factor({{y}})) %>% 57 | group_by({{x}}) %>% 58 | mutate(total = sum({{weight}}), 59 | unweighted_n = length({{weight}})) %>% 60 | group_by({{x}}, {{y}}) %>% 61 | summarise(observations = sum({{weight}}), 62 | pct = observations/first(total), 63 | n = first(total), 64 | unweighted_n = first(unweighted_n)) %>% 65 | ungroup() %>% 66 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 67 | mutate(pct = pct*100) %>% 68 | select(-observations) %>% 69 | # Remove values included in "remove" string 70 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 71 | !str_to_upper({{y}}) %in% str_to_upper(remove)) %>% 72 | # move total row to end 73 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) 74 | } else if(pct_type == "cell"){ 75 | d.output <- df %>% 76 | filter(!is.na({{x}}), 77 | !is.na({{y}})) %>% 78 | mutate({{x}} := to_factor({{x}}), 79 | {{y}} := to_factor({{y}})) %>% 80 | # calculate denominator 81 | mutate(total = sum({{weight}}), 82 | unweighted_n = length({{weight}})) %>% 83 | group_by({{x}}, {{y}}) %>% 84 | summarise(observations = sum({{weight}}), 85 | pct = observations/first(total), 86 | n = first(total), 87 | unweighted_n = first(unweighted_n)) %>% 88 | ungroup() %>% 89 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 90 | mutate(pct = pct*100) %>% 91 | select(-observations) %>% 92 | # Remove values included in "remove" string 93 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 94 | !str_to_upper({{y}}) %in% str_to_upper(remove)) %>% 95 | # move total row to end 96 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) 97 | } 98 | 99 | # convert to wide format if required 100 | if(format == "wide"){ 101 | d.output <- d.output %>% 102 | pivot_wider(names_from = {{y}}, values_from = c(pct, moe), 103 | values_fill = list(pct = 0), names_sort = TRUE) %>% 104 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) 105 | } 106 | 107 | # remove n if required 108 | if(n == FALSE){ 109 | d.output <- select(d.output, -n) 110 | } 111 | 112 | # remove unweighted_n if required 113 | if(unwt_n == FALSE){ 114 | d.output <- select(d.output, -unweighted_n) 115 | } 116 | 117 | # test if date or number 118 | factor.true.type <- what_is_this_factor(pull(d.output, {{x}})) 119 | if(factor.true.type == "date"){ 120 | d.output %>% 121 | as_tibble() %>% 122 | mutate({{x}} := as.Date({{x}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) 123 | } else if(factor.true.type == "number"){ 124 | d.output %>% 125 | as_tibble() %>% 126 | mutate({{x}} := as.numeric(as.character({{x}}))) 127 | } else{ 128 | d.output %>% 129 | as_tibble() 130 | } 131 | } 132 | -------------------------------------------------------------------------------- /R/moeCrosstab3way.R: -------------------------------------------------------------------------------- 1 | #' weighted 3-way crosstabs with margin of error 2 | #' 3 | #' \code{moe_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error 4 | #' 5 | #' Options include row or cell percentages. The tibble can be in long or wide format. 6 | #' These tables are ideal for use with small multiples created with ggplot2::facet_wrap. 7 | #' 8 | #' @param df The data source 9 | #' @param x The independent variable 10 | #' @param y The dependent variable 11 | #' @param z The second control variable 12 | #' @param weight The weighting variable 13 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 14 | #' This will not affect any calculations made. The vector is not case-sensitive. 15 | #' @param n logical, if TRUE numeric totals are included. 16 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell." 17 | #' @param format one of "long" or "wide" 18 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval 19 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values 20 | #' 21 | #' @return a tibble 22 | #' @export 23 | #' @import dplyr 24 | #' @import stringr 25 | #' @import tidyr 26 | #' @import labelled 27 | #' @import rlang 28 | #' 29 | #' @examples 30 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight) 31 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight, 32 | #' format = "wide") 33 | 34 | moe_crosstab_3way <- function(df, x, y, z, 35 | weight, remove = c(""), 36 | n = TRUE, pct_type = "row", 37 | format = "long", zscore = 1.96, 38 | unwt_n = FALSE){ 39 | 40 | # make sure no weights are NA 41 | w <- df %>% pull({{weight}}) 42 | if(length(w[is.na(w)]) > 0){ 43 | stop("The weight variable contains missing values.", call. = FALSE) 44 | } 45 | 46 | # make sure the arguments are all correct 47 | stopifnot(pct_type %in% c("row", "cell"), 48 | format %in% c("wide", "long")) 49 | 50 | # calculate the design effect 51 | deff <- df %>% pull({{weight}}) %>% deff_calc() 52 | 53 | # build the table, either row percents or cell percents 54 | if(pct_type == "row"){ 55 | d.output <- df %>% 56 | filter(!is.na({{x}}), 57 | !is.na({{y}}), 58 | !is.na({{z}})) %>% 59 | mutate({{x}} := to_factor({{x}}), 60 | {{y}} := to_factor({{y}}), 61 | {{z}} := to_factor({{z}})) %>% 62 | group_by({{z}}, {{x}}) %>% 63 | mutate(total = sum({{weight}}), 64 | unweighted_n = length({{weight}})) %>% 65 | group_by({{z}}, {{x}}, {{y}}) %>% 66 | summarise(observations = sum({{weight}}), 67 | pct = observations/first(total), 68 | n = first(total), 69 | unweighted_n = first(unweighted_n)) %>% 70 | ungroup() %>% 71 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 72 | mutate(pct = pct*100) %>% 73 | select(-observations) %>% 74 | # Remove values included in "remove" string 75 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 76 | !str_to_upper({{y}}) %in% str_to_upper(remove), 77 | !str_to_upper({{z}}) %in% str_to_upper(remove)) %>% 78 | # move total row to end 79 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) 80 | } else if(pct_type == "cell"){ 81 | d.output <- df %>% 82 | filter(!is.na({{x}}), 83 | !is.na({{y}})) %>% 84 | mutate({{x}} := to_factor({{x}}), 85 | {{y}} := to_factor({{y}})) %>% 86 | # calculate denominator 87 | group_by({{z}}) %>% 88 | mutate(total = sum({{weight}}), 89 | unweighted_n = length({{weight}})) %>% 90 | group_by({{z}}, {{x}}, {{y}}) %>% 91 | summarise(observations = sum({{weight}}), 92 | pct = observations/first(total), 93 | n = first(total), 94 | unweighted_n = first(unweighted_n)) %>% 95 | ungroup() %>% 96 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 97 | mutate(pct = pct*100) %>% 98 | select(-observations) %>% 99 | # Remove values included in "remove" string 100 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 101 | !str_to_upper({{y}}) %in% str_to_upper(remove)) %>% 102 | # move total row to end 103 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) 104 | } 105 | 106 | # convert to wide format if required 107 | if(format == "wide"){ 108 | d.output <- d.output %>% 109 | pivot_wider(names_from = {{y}}, values_from = c(pct, moe), 110 | values_fill = list(pct = 0, moe = 0)) %>% 111 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 112 | arrange({{x}}, {{z}}) 113 | } 114 | 115 | # remove n if required 116 | if(n == FALSE){ 117 | d.output <- select(d.output, -n) 118 | } 119 | 120 | # remove unweighted_n if required 121 | if(unwt_n == FALSE){ 122 | d.output <- select(d.output, -unweighted_n) 123 | } 124 | 125 | # test if date or number 126 | factor.true.type <- what_is_this_factor(pull(d.output, {{z}})) 127 | if(factor.true.type == "date"){ 128 | d.output %>% 129 | as_tibble() %>% 130 | mutate({{z}} := as.Date({{z}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) 131 | } else if(factor.true.type == "number"){ 132 | d.output %>% 133 | as_tibble() %>% 134 | mutate({{z}} := as.numeric(as.character({{z}}))) 135 | } else{ 136 | d.output %>% 137 | as_tibble() 138 | } 139 | } 140 | -------------------------------------------------------------------------------- /R/moeTopline.R: -------------------------------------------------------------------------------- 1 | #' weighted topline with margin of error 2 | #' 3 | #' \code{moe_topline} returns a tibble containing a weighted topline of one variable with margin of error 4 | #' 5 | #' By default the table includes a column for frequency count, percent, valid percent, and cumulative percent. 6 | #' 7 | #' @param df The data source 8 | #' @param variable the variable name 9 | #' @param weight The weighting variable, defaults to zwave_weight 10 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 11 | #' This will not affect any calculations made. The vector is not case-sensitive. 12 | #' @param n logical, if TRUE a frequency column is included 13 | #' percentages, but in a separate row for column percentages. 14 | #' @param pct logical, if TRUE a column of percents is included 15 | #' @param valid_pct logical, if TRUE a column of valid percents is included 16 | #' @param cum_pct logical, if TRUE a column of cumulative percents is included 17 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval 18 | #' 19 | #' @return a tibble 20 | #' @export 21 | #' @import dplyr 22 | #' @import stringr 23 | #' @import tidyr 24 | #' @import labelled 25 | #' @import rlang 26 | #' 27 | #' @examples 28 | #' moe_topline(df = illinois, variable = educ6, weight = weight) 29 | #' moe_topline(df = illinois, variable = educ6, weight = weight, remove = c("LT HS")) 30 | moe_topline <- function(df, variable, weight, remove = c(""), 31 | n = TRUE, pct = TRUE, valid_pct = TRUE, cum_pct = TRUE, zscore = 1.96){ 32 | 33 | # make sure no weights are NA 34 | w <- df %>% pull({{weight}}) 35 | if(length(w[is.na(w)]) > 0){ 36 | stop("The weight variable contains missing values.", call. = FALSE) 37 | } 38 | 39 | # calculate the design effect 40 | deff <- df %>% pull({{weight}}) %>% deff_calc() 41 | 42 | # calculate the valid unweighted sample count 43 | unweighted.n <- df %>% 44 | filter(!is.na({{variable}})) %>% 45 | nrow() 46 | 47 | output <- df %>% 48 | filter(!is.na({{variable}})) %>% 49 | mutate({{variable}} := to_factor({{variable}}), 50 | total = sum({{weight}}), 51 | valid.total = sum(({{weight}})[{{variable}} != "(Missing)"])) %>% 52 | group_by({{variable}}) %>% 53 | summarise(valid.pct = (sum({{weight}})/first(valid.total)*100), 54 | n = sum({{weight}}), 55 | pct = (n/first(total))) %>% 56 | ungroup() %>% 57 | mutate(moe = moedeff_calc(pct = valid.pct/100, deff = deff, n = unweighted.n, zscore = zscore), 58 | cum = cumsum(valid.pct), 59 | valid.pct = replace(valid.pct, {{variable}} == "(Missing)", NA), 60 | cum = replace(cum, {{variable}} == "(Missing)", NA)) %>% 61 | mutate(pct = pct*100) %>% 62 | select(Response = {{variable}}, Frequency = n, Percent = pct, 63 | `Valid Percent` = valid.pct, `MOE` = moe, `Cumulative Percent` = cum) %>% 64 | # Remove values included in "remove" string 65 | filter(! str_to_upper(Response) %in% str_to_upper(remove)) 66 | 67 | # remove columns as requested 68 | if(valid_pct == FALSE){ 69 | d.output <- select(d.output, -`Valid Percent`) 70 | } 71 | 72 | if(cum_pct == FALSE){ 73 | d.output <- select(d.output, -`Cumulative Percent`) 74 | } 75 | 76 | if(n == FALSE){ 77 | d.output <- select(d.output, -Frequency) 78 | } 79 | 80 | if(pct == FALSE){ 81 | d.output <- select(d.output, -Percent) 82 | } 83 | 84 | output %>% 85 | as_tibble() 86 | } 87 | -------------------------------------------------------------------------------- /R/moeWaveCrosstab.R: -------------------------------------------------------------------------------- 1 | #' weighted crosstabs with margin of error, where the x-variable identifies different survey waves 2 | #' 3 | #' \code{moe_wave_crosstab} returns a tibble containing a weighted crosstab of two variables 4 | #' with margin of error. Use this function when the x-variable indicates different survey 5 | #' waves for which weights were calculated independently. 6 | #' 7 | #' Options include row or cell percentages. The tibble can be in long or wide format. The margin of 8 | #' error includes the design effect of the weights, calculated separately for each 9 | #' survey wave. 10 | #' 11 | #' @param df The data source 12 | #' @param x The independent variable, which uniquely identifies survey waves 13 | #' @param y The dependent variable 14 | #' @param weight The weighting variable, defaults to zwave_weight 15 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 16 | #' This will not affect any calculations made. The vector is not case-sensitive. 17 | #' @param n logical, if TRUE numeric totals are included. 18 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell." 19 | #' Column percents are not supported. 20 | #' @param format one of "long" or "wide" 21 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval 22 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values 23 | #' 24 | #' @return a tibble 25 | #' @export 26 | #' @import dplyr 27 | #' @import stringr 28 | #' @import tidyr 29 | #' @import labelled 30 | #' @import rlang 31 | #' 32 | #' @examples 33 | #' moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight) 34 | #' moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight, format = "wide") 35 | 36 | moe_wave_crosstab <- function(df, x, y, weight, remove = c(""), n = TRUE, 37 | pct_type = "row", format = "long", 38 | zscore = 1.96, unwt_n = FALSE){ 39 | 40 | # make sure no weights are NA 41 | w <- df %>% pull({{weight}}) 42 | if(length(w[is.na(w)]) > 0){ 43 | stop("The weight variable contains missing values.", call. = FALSE) 44 | } 45 | 46 | # make sure the arguments are all correct 47 | stopifnot(pct_type %in% c("row", "cell"), 48 | format %in% c("wide", "long")) 49 | 50 | # Calculate the design effect for each wave individually, as that is 51 | # the level at which the weights are calculated 52 | stats.by.wave <- df %>% 53 | filter(!is.na({{x}})) %>% 54 | mutate({{x}} := to_factor({{x}})) %>% 55 | group_by({{x}}) %>% 56 | summarise(deff = deff_calc({{weight}})) %>% 57 | ungroup() 58 | 59 | # build the table, either row percents or cell percents 60 | if(pct_type == "row"){ 61 | d.output <- df %>% 62 | filter(!is.na({{x}}), 63 | !is.na({{y}})) %>% 64 | mutate({{x}} := to_factor({{x}}), 65 | {{y}} := to_factor({{y}})) %>% 66 | group_by({{x}}) %>% 67 | mutate(total = sum({{weight}}), 68 | unweighted_n = length({{weight}})) %>% 69 | group_by({{x}}, {{y}}) %>% 70 | summarise(observations = sum({{weight}}), 71 | pct = observations/first(total), 72 | n = first(total), 73 | unweighted_n = first(unweighted_n)) %>% 74 | ungroup() %>% 75 | inner_join(stats.by.wave) %>% 76 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 77 | mutate(pct = pct*100) %>% 78 | select(-observations) %>% 79 | # Remove values included in "remove" string 80 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 81 | !str_to_upper({{y}}) %in% str_to_upper(remove)) %>% 82 | # move total row to end 83 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 84 | select(-deff) 85 | } else if(pct_type == "cell"){ 86 | d.output <- df %>% 87 | filter(!is.na({{x}}), 88 | !is.na({{y}})) %>% 89 | mutate({{x}} := to_factor({{x}}), 90 | {{y}} := to_factor({{y}})) %>% 91 | # calculate denominator 92 | mutate(total = sum({{weight}}), 93 | unweighted_n = length({{weight}})) %>% 94 | group_by({{x}}, {{y}}) %>% 95 | summarise(observations = sum({{weight}}), 96 | pct = observations/first(total), 97 | n = first(total), 98 | unweighted_n = first(unweighted_n)) %>% 99 | ungroup() %>% 100 | inner_join(stats.by.wave) %>% 101 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 102 | mutate(pct = pct*100) %>% 103 | select(-observations) %>% 104 | # Remove values included in "remove" string 105 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 106 | !str_to_upper({{y}}) %in% str_to_upper(remove)) %>% 107 | # move total row to end 108 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 109 | select(-deff) 110 | } 111 | 112 | # convert to wide format if required 113 | if(format == "wide"){ 114 | d.output <- d.output %>% 115 | pivot_wider(names_from = {{y}}, values_from = c(pct, moe), 116 | values_fill = list(pct = 0, moe = 0)) %>% 117 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) 118 | } 119 | 120 | # remove n if required 121 | if(n == FALSE){ 122 | d.output <- select(d.output, -n) 123 | } 124 | # remove unweighted_n if required 125 | if(unwt_n == FALSE){ 126 | d.output <- select(d.output, -unweighted_n) 127 | } 128 | 129 | # test if date or number 130 | factor.true.type <- what_is_this_factor(pull(d.output, {{x}})) 131 | if(factor.true.type == "date"){ 132 | d.output %>% 133 | as_tibble() %>% 134 | mutate({{x}} := as.Date({{x}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) 135 | } else if(factor.true.type == "number"){ 136 | d.output %>% 137 | as_tibble() %>% 138 | mutate({{x}} := as.numeric(as.character({{x}}))) 139 | } else{ 140 | d.output %>% 141 | as_tibble() 142 | } 143 | } 144 | -------------------------------------------------------------------------------- /R/moeWaveCrosstab3way.R: -------------------------------------------------------------------------------- 1 | #' weighted 3-way crosstabs with margin of error, where the z-variable identifies different survey waves 2 | #' 3 | #' \code{moe_wave_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error. 4 | #' Use this function when the z-variable indicates different survey 5 | #' waves for which weights were calculated independently. 6 | #' 7 | #' Options include row or cell percentages. The tibble can be in long or wide format. 8 | #' These tables are ideal for use with small multiples created with ggplot2::facet_wrap. 9 | #' 10 | #' @param df The data source 11 | #' @param x The independent variable 12 | #' @param y The dependent variable 13 | #' @param z The second control variable, uniquely identifies survey waves 14 | #' @param weight The weighting variable 15 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused"). 16 | #' This will not affect any calculations made. The vector is not case-sensitive. 17 | #' @param n logical, if TRUE numeric totals are included. 18 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell." 19 | #' @param format one of "long" or "wide" 20 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval 21 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values 22 | #' 23 | #' @return a tibble 24 | #' @export 25 | #' @import dplyr 26 | #' @import stringr 27 | #' @import tidyr 28 | #' @import labelled 29 | #' @import rlang 30 | #' 31 | #' @examples 32 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight) 33 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "wide") 34 | 35 | moe_wave_crosstab_3way <- function(df, x, y, z, 36 | weight, remove = c(""), 37 | n = TRUE, pct_type = "row", format = "long", 38 | zscore = 1.96, unwt_n = FALSE){ 39 | 40 | # make sure no weights are NA 41 | w <- df %>% pull({{weight}}) 42 | if(length(w[is.na(w)]) > 0){ 43 | stop("The weight variable contains missing values.", call. = FALSE) 44 | } 45 | 46 | # make sure the arguments are all correct 47 | stopifnot(pct_type %in% c("row", "cell"), 48 | format %in% c("wide", "long")) 49 | 50 | # Calculate the design effect for each wave individually, as that is 51 | # the level at which the weights are calculated 52 | stats.by.wave <- df %>% 53 | filter(!is.na({{z}})) %>% 54 | mutate({{z}} := to_factor({{z}})) %>% 55 | group_by({{z}}) %>% 56 | summarise(deff = deff_calc({{weight}})) %>% 57 | ungroup() 58 | 59 | # build the table, either row percents or cell percents 60 | if(pct_type == "row"){ 61 | d.output <- df %>% 62 | filter(!is.na({{x}}), 63 | !is.na({{y}}), 64 | !is.na({{z}})) %>% 65 | mutate({{x}} := to_factor({{x}}), 66 | {{y}} := to_factor({{y}}), 67 | {{z}} := to_factor({{z}})) %>% 68 | group_by({{z}}, {{x}}) %>% 69 | mutate(total = sum({{weight}}), 70 | unweighted_n = length({{weight}})) %>% 71 | group_by({{z}}, {{x}}, {{y}}) %>% 72 | summarise(observations = sum({{weight}}), 73 | pct = observations/first(total), 74 | n = first(total), 75 | unweighted_n = first(unweighted_n)) %>% 76 | ungroup() %>% 77 | inner_join(stats.by.wave) %>% 78 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 79 | mutate(pct = pct*100) %>% 80 | select(-observations) %>% 81 | # Remove values included in "remove" string 82 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 83 | !str_to_upper({{y}}) %in% str_to_upper(remove), 84 | !str_to_upper({{z}}) %in% str_to_upper(remove)) %>% 85 | # move total row to end 86 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 87 | select(-deff) 88 | } else if(pct_type == "cell"){ 89 | d.output <- df %>% 90 | filter(!is.na({{x}}), 91 | !is.na({{y}})) %>% 92 | mutate({{x}} := to_factor({{x}}), 93 | {{y}} := to_factor({{y}})) %>% 94 | # calculate denominator 95 | group_by({{z}}) %>% 96 | mutate(total = sum({{weight}}), 97 | unweighted_n = length({{weight}})) %>% 98 | group_by({{z}}, {{x}}, {{y}}) %>% 99 | summarise(observations = sum({{weight}}), 100 | pct = observations/first(total), 101 | n = first(total), 102 | unweighted_n = first(unweighted_n)) %>% 103 | ungroup() %>% 104 | inner_join(stats.by.wave) %>% 105 | mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>% 106 | mutate(pct = pct*100) %>% 107 | select(-observations) %>% 108 | # Remove values included in "remove" string 109 | filter(!str_to_upper({{x}}) %in% str_to_upper(remove), 110 | !str_to_upper({{y}}) %in% str_to_upper(remove)) %>% 111 | # move total row to end 112 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>% 113 | select(-deff) 114 | } 115 | 116 | # convert to wide format if required 117 | if(format == "wide"){ 118 | d.output <- d.output %>% 119 | pivot_wider(names_from = {{y}}, values_from = c(pct, moe), 120 | values_fill = list(pct = 0, moe = 0)) %>% 121 | select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) 122 | } 123 | 124 | # remove n if required 125 | if(n == FALSE){ 126 | d.output <- select(d.output, -n) 127 | } 128 | 129 | # remove unweighted_n if required 130 | if(unwt_n == FALSE){ 131 | d.output <- select(d.output, -unweighted_n) 132 | } 133 | 134 | # test if date or number 135 | factor.true.type <- what_is_this_factor(pull(d.output, {{z}})) 136 | if(factor.true.type == "date"){ 137 | d.output %>% 138 | as_tibble() %>% 139 | mutate({{z}} := as.Date({{z}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) 140 | } else if(factor.true.type == "number"){ 141 | d.output %>% 142 | as_tibble() %>% 143 | mutate({{z}} := as.numeric(as.character({{z}}))) 144 | } else{ 145 | d.output %>% 146 | as_tibble() 147 | } 148 | } 149 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Weighted data survey tables in R" 3 | author: "John Johnson" 4 | date: "3/9/2020" 5 | output: 6 | github_document: 7 | toc: true 8 | toc_depth: 2 9 | --- 10 | 11 | ```{r setup, include=FALSE} 12 | knitr::opts_chunk$set(echo = TRUE) 13 | ``` 14 | 15 | ```{r, echo = FALSE} 16 | knitr::opts_chunk$set( 17 | collapse = TRUE, 18 | comment = "#>", 19 | fig.path = "man/figures/README-" 20 | ) 21 | ``` 22 | 23 | `pollster` is an R package for making topline and crosstab tables of simple weighted survey data. The package is designed for use with labelled data, like what you might use the `haven` package to import from Stata or SPSS. It follows tidyverse programming conventions, and output tables are also in the form of a tidy data frame, or tibble. 24 | 25 | Only simple weights are currently supported. For complex survey designs, we recommend the excellent [`survey` package](http://r-survey.r-forge.r-project.org/survey/). 26 | 27 | The core functions are: 28 | 29 | * `topline()` 30 | * `crosstab()` 31 | * `crosstab_3way()` 32 | 33 | Each of these functions also has a twin version which includes a column for the margin of error calculated to include the design effect of the weights. 34 | 35 | * `moe_topline()` 36 | * `moe_crosstab()` 37 | * `moe_crosstab_3way()` 38 | 39 | There are also two special functions which calculate the design effect component of the margin of error for each survey wave independently. 40 | 41 | * `moe_wave_crosstab()` 42 | * `moe_wave_crosstab_3way()` 43 | 44 | Other functions are included to calculate simple weighted summary statistics. 45 | 46 | * `wtd_mean()` is a tidy-compliant wrapper around `stats::weighted.mean()` 47 | * `summary_table()` returns a tible with summary statistics similar to the Stata command `sum` 48 | 49 | ## Installation 50 | 51 | Install it this way. 52 | 53 | ``` 54 | install.packages("pollster") 55 | ``` 56 | 57 | Or get the development version. 58 | 59 | ``` 60 | remotes::install_github("jdjohn215/pollster") 61 | ``` 62 | 63 | ## Basic usage 64 | 65 | `pollster` includes a dataset of Illinois responses to the Current Population Survey's voter registration supplement. 66 | 67 | ```{r} 68 | library(pollster) 69 | head(illinois) 70 | ``` 71 | 72 | Make a topline table like this. The output is a tibble. 73 | 74 | ```{r} 75 | topline(df = illinois, variable = maritalstatus, weight = weight) 76 | ``` 77 | 78 | Make a crosstab like this. 79 | 80 | ```{r} 81 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight) 82 | ``` 83 | 84 | If you prefer, you can also get the output in long format. 85 | 86 | ```{r} 87 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight, format = "long") 88 | ``` 89 | 90 | A three-way crosstab is just a normal crosstab with a third control variable. Often, this third variable is time. 91 | 92 | ```{r} 93 | crosstab_3way(df = illinois, x = educ6, y = maritalstatus, z = year, weight = weight) 94 | ``` 95 | 96 | ## Making tables and graphs 97 | 98 | Wide format is best for displaying table output. Long format is best for making graphs. `pollster` outputs dovetail seamlessly with [`knitr::kable()`](https://www.rdocumentation.org/packages/knitr/versions/1.28/topics/kable) and [`ggplot2::ggplot()`](https://ggplot2.tidyverse.org/). These examples show very basic html table output, but you can customize the appearance of your tables almost endlessly in either html or pdf formats using Hao Zhu's excellent [`kableExtra` package](https://haozhu233.github.io/kableExtra/). 99 | 100 | ```{r} 101 | library(dplyr) 102 | crosstab(df = illinois, x = sex, y = educ6, weight = weight) %>% 103 | knitr::kable(digits = 0) 104 | ``` 105 | 106 | 107 | ```{r} 108 | library(ggplot2) 109 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") %>% 110 | ggplot(aes(educ6, pct, fill = sex)) + 111 | geom_bar(stat = "identity", position = "dodge") 112 | ``` 113 | 114 | Three-way crosstabs are ideal for plotting time series graphs and/or faceted plots. 115 | 116 | ```{r} 117 | crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "long") %>% 118 | ggplot(aes(year, pct, col = sex)) + 119 | geom_line() + 120 | facet_wrap(facets = vars(educ6)) 121 | ``` 122 | 123 | ## Margin of error 124 | 125 | Each `pollster` function comes with a twin function which includes a margin of error column. For example: 126 | 127 | ```{r} 128 | moe_topline(df = illinois, variable = voter, weight = weight) 129 | ``` 130 | 131 | 132 | By default, `moe_crosstab` output comes in long format, but you can also specify wide format. 133 | 134 | ```{r} 135 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight, format = "wide") 136 | ``` 137 | 138 | ```{r} 139 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight) %>% 140 | ggplot(aes(x = pct, y = raceethnic, xmin = (pct - moe), xmax = (pct + moe), color = voter)) + 141 | geom_pointrange(position = position_dodge(width = 0.2)) 142 | ``` 143 | 144 | ## Summary table 145 | 146 | `summary_table()` creates a simple summary table of a weighted numeric variable. 147 | 148 | ```{r} 149 | summary_table(df = illinois, variable = age, weight = weight) 150 | ``` 151 | 152 | You can choose `name_style = "pretty"` if you want column headings appropriate for a formatted table. 153 | 154 | ```{r} 155 | summary_table(df = illinois, variable = age, 156 | weight = weight, name_style = "pretty") %>% 157 | knitr::kable() 158 | ``` 159 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Weighted data survey tables in R 2 | ================ 3 | John Johnson 4 | 3/9/2020 5 | 6 | `pollster` is an R package for making topline and crosstab tables of 7 | simple weighted survey data. The package is designed for use with 8 | labelled data, like what you might use the `haven` package to import 9 | from Stata or SPSS. It follows tidyverse programming conventions, and 10 | output tables are also in the form of a tidy data frame, or tibble. 11 | 12 | Only simple weights are currently supported. For complex survey designs, 13 | we recommend the excellent [`survey` 14 | package](http://r-survey.r-forge.r-project.org/survey/). 15 | 16 | The core functions are: 17 | 18 | - `topline()` 19 | - `crosstab()` 20 | - `crosstab_3way()` 21 | 22 | Each of these functions also has a twin version which includes a column 23 | for the margin of error calculated to include the design effect of the 24 | weights. 25 | 26 | - `moe_topline()` 27 | - `moe_crosstab()` 28 | - `moe_crosstab_3way()` 29 | 30 | There are also two special functions which calculate the design effect 31 | component of the margin of error for each survey wave independently. 32 | 33 | - `moe_wave_crosstab()` 34 | - `moe_wave_crosstab_3way()` 35 | 36 | Other functions are included to calculate simple weighted summary 37 | statistics. 38 | 39 | - `wtd_mean()` is a tidy-compliant wrapper around 40 | `stats::weighted.mean()` 41 | - `summary_table()` returns a tible with summary statistics similar to 42 | the Stata command `sum` 43 | 44 | ## Installation 45 | 46 | Install it this way. 47 | 48 | install.packages("pollster") 49 | 50 | Or get the development version. 51 | 52 | remotes::install_github("jdjohn215/pollster") 53 | 54 | ## Basic usage 55 | 56 | `pollster` includes a dataset of Illinois responses to the Current 57 | Population Survey’s voter registration supplement. 58 | 59 | ``` r 60 | library(pollster) 61 | head(illinois) 62 | #> # A tibble: 6 x 10 63 | #> year fips sex educ6 raceethnic maritalstatus rv voter age 64 | #> 65 | #> 1 1996 17 [IL] 1 [Mal… 2 [HS] 1 [White] 1 [Married] 2 [Not… 2 [Not… 29 66 | #> 2 1996 17 [IL] 2 [Fem… 3 [Som… 1 [White] 1 [Married] 1 [Reg… 1 [Vot… 28 67 | #> 3 1996 17 [IL] 2 [Fem… 2 [HS] 1 [White] 3 [Never Mar… 1 [Reg… 1 [Vot… 82 68 | #> 4 1996 17 [IL] 2 [Fem… 2 [HS] 1 [White] 3 [Never Mar… 1 [Reg… 1 [Vot… 72 69 | #> 5 1996 17 [IL] 1 [Mal… 2 [HS] 2 [Black] 1 [Married] 1 [Reg… 1 [Vot… 75 70 | #> 6 1996 17 [IL] 2 [Fem… 2 [HS] 2 [Black] 1 [Married] 1 [Reg… 1 [Vot… 60 71 | #> # … with 1 more variable: weight 72 | ``` 73 | 74 | Make a topline table like this. The output is a tibble. 75 | 76 | ``` r 77 | topline(df = illinois, variable = maritalstatus, weight = weight) 78 | #> # A tibble: 3 x 5 79 | #> Response Frequency Percent `Valid Percent` `Cumulative Percent` 80 | #> 81 | #> 1 Married 55001786. 53.6 53.6 53.6 82 | #> 2 Widow/divorced/Sep 18635087. 18.1 18.1 71.7 83 | #> 3 Never Married 29041640. 28.3 28.3 100 84 | ``` 85 | 86 | Make a crosstab like this. 87 | 88 | ``` r 89 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight) 90 | #> # A tibble: 6 x 5 91 | #> educ6 Married `Widow/divorced/Sep` `Never Married` n 92 | #> 93 | #> 1 LT HS 40.0 29.1 30.9 10770999. 94 | #> 2 HS 52.9 21.0 26.1 31409418. 95 | #> 3 Some Col 44.6 17.4 38.0 21745113. 96 | #> 4 AA 57.4 18.4 24.2 8249909. 97 | #> 5 BA 61.1 11.3 27.6 19937965. 98 | #> 6 Post-BA 70.7 12.9 16.5 10565110. 99 | ``` 100 | 101 | If you prefer, you can also get the output in long 102 | format. 103 | 104 | ``` r 105 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight, format = "long") 106 | #> # A tibble: 18 x 4 107 | #> educ6 maritalstatus pct n 108 | #> 109 | #> 1 LT HS Married 40.0 10770999. 110 | #> 2 LT HS Widow/divorced/Sep 29.1 10770999. 111 | #> 3 LT HS Never Married 30.9 10770999. 112 | #> 4 HS Married 52.9 31409418. 113 | #> 5 HS Widow/divorced/Sep 21.0 31409418. 114 | #> 6 HS Never Married 26.1 31409418. 115 | #> 7 Some Col Married 44.6 21745113. 116 | #> 8 Some Col Widow/divorced/Sep 17.4 21745113. 117 | #> 9 Some Col Never Married 38.0 21745113. 118 | #> 10 AA Married 57.4 8249909. 119 | #> 11 AA Widow/divorced/Sep 18.4 8249909. 120 | #> 12 AA Never Married 24.2 8249909. 121 | #> 13 BA Married 61.1 19937965. 122 | #> 14 BA Widow/divorced/Sep 11.3 19937965. 123 | #> 15 BA Never Married 27.6 19937965. 124 | #> 16 Post-BA Married 70.7 10565110. 125 | #> 17 Post-BA Widow/divorced/Sep 12.9 10565110. 126 | #> 18 Post-BA Never Married 16.5 10565110. 127 | ``` 128 | 129 | A three-way crosstab is just a normal crosstab with a third control 130 | variable. Often, this third variable is 131 | time. 132 | 133 | ``` r 134 | crosstab_3way(df = illinois, x = educ6, y = maritalstatus, z = year, weight = weight) 135 | #> # A tibble: 72 x 6 136 | #> educ6 year n Married `Widow/divorced/Sep` `Never Married` 137 | #> 138 | #> 1 LT HS 1996 1182402. 41.0 28.8 30.2 139 | #> 2 LT HS 1998 1159148. 42.2 33.6 24.2 140 | #> 3 LT HS 2000 1036154. 44.3 32.6 23.1 141 | #> 4 LT HS 2002 1074704. 38.0 30.4 31.6 142 | #> 5 LT HS 2004 936926. 41.0 30.3 28.6 143 | #> 6 LT HS 2006 918858. 38.6 31.7 29.7 144 | #> 7 LT HS 2008 909755. 42.1 28.1 29.8 145 | #> 8 LT HS 2010 806647. 40.6 24.6 34.7 146 | #> 9 LT HS 2012 705132. 35.7 26.9 37.4 147 | #> 10 LT HS 2014 782926. 43.7 23.7 32.7 148 | #> # … with 62 more rows 149 | ``` 150 | 151 | ## Making tables and graphs 152 | 153 | Wide format is best for displaying table output. Long format is best for 154 | making graphs. `pollster` outputs dovetail seamlessly with 155 | [`knitr::kable()`](https://www.rdocumentation.org/packages/knitr/versions/1.28/topics/kable) 156 | and [`ggplot2::ggplot()`](https://ggplot2.tidyverse.org/). These 157 | examples show very basic html table output, but you can customize the 158 | appearance of your tables almost endlessly in either html or pdf formats 159 | using Hao Zhu’s excellent [`kableExtra` 160 | package](https://haozhu233.github.io/kableExtra/). 161 | 162 | ``` r 163 | library(dplyr) 164 | #> 165 | #> Attaching package: 'dplyr' 166 | #> The following objects are masked from 'package:stats': 167 | #> 168 | #> filter, lag 169 | #> The following objects are masked from 'package:base': 170 | #> 171 | #> intersect, setdiff, setequal, union 172 | crosstab(df = illinois, x = sex, y = educ6, weight = weight) %>% 173 | knitr::kable(digits = 0) 174 | ``` 175 | 176 | | sex | LT HS | HS | Some Col | AA | BA | Post-BA | n | 177 | | :----- | ----: | -: | -------: | -: | -: | ------: | -------: | 178 | | Male | 11 | 31 | 21 | 7 | 20 | 11 | 49108796 | 179 | | Female | 10 | 30 | 22 | 9 | 19 | 10 | 53569718 | 180 | 181 | ``` r 182 | library(ggplot2) 183 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") %>% 184 | ggplot(aes(educ6, pct, fill = sex)) + 185 | geom_bar(stat = "identity", position = "dodge") 186 | ``` 187 | 188 | ![](man/figures/README-unnamed-chunk-8-1.png) 189 | 190 | Three-way crosstabs are ideal for plotting time series graphs and/or 191 | faceted 192 | plots. 193 | 194 | ``` r 195 | crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "long") %>% 196 | ggplot(aes(year, pct, col = sex)) + 197 | geom_line() + 198 | facet_wrap(facets = vars(educ6)) 199 | ``` 200 | 201 | ![](man/figures/README-unnamed-chunk-9-1.png) 202 | 203 | ## Margin of error 204 | 205 | Each `pollster` function comes with a twin function which includes a 206 | margin of error column. For example: 207 | 208 | ``` r 209 | moe_topline(df = illinois, variable = voter, weight = weight) 210 | #> # A tibble: 2 x 6 211 | #> Response Frequency Percent `Valid Percent` MOE `Cumulative Percent` 212 | #> 213 | #> 1 Voted 56230937. 63.7 63.7 0.551 63.7 214 | #> 2 Not voted 32070164. 36.3 36.3 0.551 100 215 | ``` 216 | 217 | By default, `moe_crosstab` output comes in long format, but you can also 218 | specify wide 219 | format. 220 | 221 | ``` r 222 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight, format = "wide") 223 | #> # A tibble: 4 x 6 224 | #> raceethnic n pct_Voted `pct_Not voted` moe_Voted `moe_Not voted` 225 | #> 226 | #> 1 White 24167 64.4 35.6 0.624 0.624 227 | #> 2 Black 3980 71.6 28.4 1.45 1.45 228 | #> 3 Hispanic 2106 48.3 51.7 2.21 2.21 229 | #> 4 Other 1006 48.7 51.3 3.19 3.19 230 | ``` 231 | 232 | ``` r 233 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight) %>% 234 | ggplot(aes(x = pct, y = raceethnic, xmin = (pct - moe), xmax = (pct + moe), color = voter)) + 235 | geom_pointrange(position = position_dodge(width = 0.2)) 236 | ``` 237 | 238 | ![](man/figures/README-unnamed-chunk-12-1.png) 239 | 240 | ## Summary table 241 | 242 | `summary_table()` creates a simple summary table of a weighted numeric 243 | variable. 244 | 245 | ``` r 246 | summary_table(df = illinois, variable = age, weight = weight) 247 | #> # A tibble: 1 x 8 248 | #> variable_name unweighted_obse… weighted_observ… weighted_mean min_value 249 | #> 250 | #> 1 age 36207 102678514. 46.2 18 251 | #> # … with 3 more variables: max_value , missing_observations , 252 | #> # missing_weighted_observations 253 | ``` 254 | 255 | You can choose `name_style = "pretty"` if you want column headings 256 | appropriate for a formatted table. 257 | 258 | ``` r 259 | summary_table(df = illinois, variable = age, 260 | weight = weight, name_style = "pretty") %>% 261 | knitr::kable() 262 | ``` 263 | 264 | | Variable | Unweighted obs | Weighted obs | Weighted mean | Min | Max | Unweighted missing | Weighted missing | 265 | | :------- | -------------: | -----------: | ------------: | --: | --: | -----------------: | ---------------: | 266 | | age | 36207 | 102678514 | 46.19646 | 18 | 90 | 0 | 0 | 267 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## Test environments 2 | * local OS X install, R 4.2.0 3 | * win-builder (devel, release, and old) 4 | 5 | ## R CMD check results 6 | 7 | 0 errors | 0 warnings | 0 notes 8 | 9 | ## Downstream dependencies 10 | * No downstream dependencies 11 | -------------------------------------------------------------------------------- /data/illinois.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/data/illinois.rda -------------------------------------------------------------------------------- /man/crosstab.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Crosstab.R 3 | \name{crosstab} 4 | \alias{crosstab} 5 | \title{weighted crosstabs} 6 | \usage{ 7 | crosstab( 8 | df, 9 | x, 10 | y, 11 | weight, 12 | remove = "", 13 | n = TRUE, 14 | pct_type = "row", 15 | format = "wide", 16 | unwt_n = FALSE 17 | ) 18 | } 19 | \arguments{ 20 | \item{df}{The data source} 21 | 22 | \item{x}{The independent variable} 23 | 24 | \item{y}{The dependent variable} 25 | 26 | \item{weight}{The weighting variable} 27 | 28 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 29 | This will not affect any calculations made. The vector is not case-sensitive.} 30 | 31 | \item{n}{logical, if TRUE numeric totals are included. They are included in a separate column for row and cell 32 | percentages, but in a separate row for wide format column percentages.} 33 | 34 | \item{pct_type}{Controls the kind of percentage values returned. One of "row," "cell," or "column."} 35 | 36 | \item{format}{one of "long" or "wide"} 37 | 38 | \item{unwt_n}{logical, if TRUE a column "unweighted_n" is included containing the unweighted frequency count. It is not available when pct_type is "column"} 39 | } 40 | \value{ 41 | a tibble 42 | } 43 | \description{ 44 | \code{crosstab} returns a tibble containing a weighted crosstab of two variables 45 | } 46 | \details{ 47 | Options include row, column, or cell percentages. The tibble can be in long or wide format. 48 | } 49 | \examples{ 50 | crosstab(df = illinois, x = voter, y = raceethnic, weight = weight) 51 | crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE) 52 | } 53 | -------------------------------------------------------------------------------- /man/crosstab_3way.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Crosstab3way.R 3 | \name{crosstab_3way} 4 | \alias{crosstab_3way} 5 | \title{weighted 3-way crosstabs} 6 | \usage{ 7 | crosstab_3way( 8 | df, 9 | x, 10 | y, 11 | z, 12 | weight, 13 | remove = c(""), 14 | n = TRUE, 15 | pct_type = "row", 16 | format = "wide", 17 | unwt_n = FALSE 18 | ) 19 | } 20 | \arguments{ 21 | \item{df}{The data source} 22 | 23 | \item{x}{The independent variable} 24 | 25 | \item{y}{The dependent variable} 26 | 27 | \item{z}{The second control variable} 28 | 29 | \item{weight}{The weighting variable} 30 | 31 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 32 | This will not affect any calculations made. The vector is not case-sensitive.} 33 | 34 | \item{n}{logical, if TRUE numeric totals are included.} 35 | 36 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."} 37 | 38 | \item{format}{one of "long" or "wide"} 39 | 40 | \item{unwt_n}{logical, if TRUE a column is added containing unweighted frequency counts} 41 | } 42 | \value{ 43 | a tibble 44 | } 45 | \description{ 46 | \code{crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable 47 | } 48 | \details{ 49 | Options include row or cell percentages. The tibble can be in long or wide format. 50 | These tables are ideal for use with small multiples created with ggplot2::facet_wrap. 51 | } 52 | \examples{ 53 | crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight) 54 | crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight, 55 | format = "wide") 56 | } 57 | -------------------------------------------------------------------------------- /man/deff_calc.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/DesignEffectAndMOE.R 3 | \name{deff_calc} 4 | \alias{deff_calc} 5 | \title{Calculate the design effect of a sample} 6 | \usage{ 7 | deff_calc(w) 8 | } 9 | \arguments{ 10 | \item{w}{a vector of weights} 11 | } 12 | \value{ 13 | A number 14 | } 15 | \description{ 16 | \code{deff_calc} returns a single number 17 | } 18 | \details{ 19 | This function returns the design effect of a given sample using the formula 20 | length(w)*sum(w^2)/(sum(w)^2). 21 | It is designed for use in the moe family of functions. If any weights are equal to 0, they are removed prior to calculation. 22 | } 23 | \examples{ 24 | deff_calc(illinois$weight) 25 | 26 | } 27 | -------------------------------------------------------------------------------- /man/figures/README-unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/man/figures/README-unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /man/figures/README-unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/man/figures/README-unnamed-chunk-8-1.png -------------------------------------------------------------------------------- /man/figures/README-unnamed-chunk-9-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/man/figures/README-unnamed-chunk-9-1.png -------------------------------------------------------------------------------- /man/illinois.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/data.R 3 | \docType{data} 4 | \name{illinois} 5 | \alias{illinois} 6 | \title{Illinois respondents to the Voting and Registration Supplement for the Current Population Survey} 7 | \format{ 8 | A data frame with 36207 rows and 9 variables: 9 | \describe{ 10 | \item{year}{year of survey} 11 | \item{fips}{the state fips code} 12 | \item{sex}{sex of the respondent, labelled value} 13 | \item{educ6}{highest level of education for respondent, labelled values} 14 | \item{raceethnic}{one of white, black, Hispanic, or other, labelled values} 15 | \item{maritalstatus}{one of Married, Widowed/divorced/Sep, or Never Married, labelled values} 16 | \item{rv}{indicates if the respondent is registered to vote, labelled values} 17 | \item{voter}{indicates if the respondent voted, labelled values} 18 | \item{age}{the age of the respondent, numeric values} 19 | \item{weight}{the number of people each respondent is calculated to represent} 20 | } 21 | } 22 | \source{ 23 | \url{https://www.census.gov/topics/public-sector/voting.html} 24 | } 25 | \usage{ 26 | illinois 27 | } 28 | \description{ 29 | A dataset containing the responses of 36,207 Illinois respondents to the Current 30 | Population Survey's biennial Voting and Registration Supplement for the Current 31 | Population Survey, 1996-2018. 32 | } 33 | \keyword{datasets} 34 | -------------------------------------------------------------------------------- /man/moe_crosstab.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/moeCrosstab.R 3 | \name{moe_crosstab} 4 | \alias{moe_crosstab} 5 | \title{weighted crosstabs with margin of error} 6 | \usage{ 7 | moe_crosstab( 8 | df, 9 | x, 10 | y, 11 | weight, 12 | remove = c(""), 13 | n = TRUE, 14 | pct_type = "row", 15 | format = "long", 16 | zscore = 1.96, 17 | unwt_n = FALSE 18 | ) 19 | } 20 | \arguments{ 21 | \item{df}{The data source} 22 | 23 | \item{x}{The independent variable} 24 | 25 | \item{y}{The dependent variable} 26 | 27 | \item{weight}{The weighting variable, defaults to zwave_weight} 28 | 29 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 30 | This will not affect any calculations made. The vector is not case-sensitive.} 31 | 32 | \item{n}{logical, if TRUE numeric totals are included.} 33 | 34 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell." 35 | Column percents are not supported.} 36 | 37 | \item{format}{one of "long" or "wide"} 38 | 39 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval} 40 | 41 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values} 42 | } 43 | \value{ 44 | a tibble 45 | } 46 | \description{ 47 | \code{moe_crosstab} returns a tibble containing a weighted crosstab of two variables with margin of error 48 | } 49 | \details{ 50 | Options include row or cell percentages. The tibble can be in long or wide format. The margin of 51 | error includes the design effect of the weights. 52 | } 53 | \examples{ 54 | moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight) 55 | moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE) 56 | } 57 | -------------------------------------------------------------------------------- /man/moe_crosstab_3way.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/moeCrosstab3way.R 3 | \name{moe_crosstab_3way} 4 | \alias{moe_crosstab_3way} 5 | \title{weighted 3-way crosstabs with margin of error} 6 | \usage{ 7 | moe_crosstab_3way( 8 | df, 9 | x, 10 | y, 11 | z, 12 | weight, 13 | remove = c(""), 14 | n = TRUE, 15 | pct_type = "row", 16 | format = "long", 17 | zscore = 1.96, 18 | unwt_n = FALSE 19 | ) 20 | } 21 | \arguments{ 22 | \item{df}{The data source} 23 | 24 | \item{x}{The independent variable} 25 | 26 | \item{y}{The dependent variable} 27 | 28 | \item{z}{The second control variable} 29 | 30 | \item{weight}{The weighting variable} 31 | 32 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 33 | This will not affect any calculations made. The vector is not case-sensitive.} 34 | 35 | \item{n}{logical, if TRUE numeric totals are included.} 36 | 37 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."} 38 | 39 | \item{format}{one of "long" or "wide"} 40 | 41 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval} 42 | 43 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values} 44 | } 45 | \value{ 46 | a tibble 47 | } 48 | \description{ 49 | \code{moe_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error 50 | } 51 | \details{ 52 | Options include row or cell percentages. The tibble can be in long or wide format. 53 | These tables are ideal for use with small multiples created with ggplot2::facet_wrap. 54 | } 55 | \examples{ 56 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight) 57 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight, 58 | format = "wide") 59 | } 60 | -------------------------------------------------------------------------------- /man/moe_topline.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/moeTopline.R 3 | \name{moe_topline} 4 | \alias{moe_topline} 5 | \title{weighted topline with margin of error} 6 | \usage{ 7 | moe_topline( 8 | df, 9 | variable, 10 | weight, 11 | remove = c(""), 12 | n = TRUE, 13 | pct = TRUE, 14 | valid_pct = TRUE, 15 | cum_pct = TRUE, 16 | zscore = 1.96 17 | ) 18 | } 19 | \arguments{ 20 | \item{df}{The data source} 21 | 22 | \item{variable}{the variable name} 23 | 24 | \item{weight}{The weighting variable, defaults to zwave_weight} 25 | 26 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 27 | This will not affect any calculations made. The vector is not case-sensitive.} 28 | 29 | \item{n}{logical, if TRUE a frequency column is included 30 | percentages, but in a separate row for column percentages.} 31 | 32 | \item{pct}{logical, if TRUE a column of percents is included} 33 | 34 | \item{valid_pct}{logical, if TRUE a column of valid percents is included} 35 | 36 | \item{cum_pct}{logical, if TRUE a column of cumulative percents is included} 37 | 38 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval} 39 | } 40 | \value{ 41 | a tibble 42 | } 43 | \description{ 44 | \code{moe_topline} returns a tibble containing a weighted topline of one variable with margin of error 45 | } 46 | \details{ 47 | By default the table includes a column for frequency count, percent, valid percent, and cumulative percent. 48 | } 49 | \examples{ 50 | moe_topline(df = illinois, variable = educ6, weight = weight) 51 | moe_topline(df = illinois, variable = educ6, weight = weight, remove = c("LT HS")) 52 | } 53 | -------------------------------------------------------------------------------- /man/moe_wave_crosstab.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/moeWaveCrosstab.R 3 | \name{moe_wave_crosstab} 4 | \alias{moe_wave_crosstab} 5 | \title{weighted crosstabs with margin of error, where the x-variable identifies different survey waves} 6 | \usage{ 7 | moe_wave_crosstab( 8 | df, 9 | x, 10 | y, 11 | weight, 12 | remove = c(""), 13 | n = TRUE, 14 | pct_type = "row", 15 | format = "long", 16 | zscore = 1.96, 17 | unwt_n = FALSE 18 | ) 19 | } 20 | \arguments{ 21 | \item{df}{The data source} 22 | 23 | \item{x}{The independent variable, which uniquely identifies survey waves} 24 | 25 | \item{y}{The dependent variable} 26 | 27 | \item{weight}{The weighting variable, defaults to zwave_weight} 28 | 29 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 30 | This will not affect any calculations made. The vector is not case-sensitive.} 31 | 32 | \item{n}{logical, if TRUE numeric totals are included.} 33 | 34 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell." 35 | Column percents are not supported.} 36 | 37 | \item{format}{one of "long" or "wide"} 38 | 39 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval} 40 | 41 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values} 42 | } 43 | \value{ 44 | a tibble 45 | } 46 | \description{ 47 | \code{moe_wave_crosstab} returns a tibble containing a weighted crosstab of two variables 48 | with margin of error. Use this function when the x-variable indicates different survey 49 | waves for which weights were calculated independently. 50 | } 51 | \details{ 52 | Options include row or cell percentages. The tibble can be in long or wide format. The margin of 53 | error includes the design effect of the weights, calculated separately for each 54 | survey wave. 55 | } 56 | \examples{ 57 | moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight) 58 | moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight, format = "wide") 59 | } 60 | -------------------------------------------------------------------------------- /man/moe_wave_crosstab_3way.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/moeWaveCrosstab3way.R 3 | \name{moe_wave_crosstab_3way} 4 | \alias{moe_wave_crosstab_3way} 5 | \title{weighted 3-way crosstabs with margin of error, where the z-variable identifies different survey waves} 6 | \usage{ 7 | moe_wave_crosstab_3way( 8 | df, 9 | x, 10 | y, 11 | z, 12 | weight, 13 | remove = c(""), 14 | n = TRUE, 15 | pct_type = "row", 16 | format = "long", 17 | zscore = 1.96, 18 | unwt_n = FALSE 19 | ) 20 | } 21 | \arguments{ 22 | \item{df}{The data source} 23 | 24 | \item{x}{The independent variable} 25 | 26 | \item{y}{The dependent variable} 27 | 28 | \item{z}{The second control variable, uniquely identifies survey waves} 29 | 30 | \item{weight}{The weighting variable} 31 | 32 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 33 | This will not affect any calculations made. The vector is not case-sensitive.} 34 | 35 | \item{n}{logical, if TRUE numeric totals are included.} 36 | 37 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."} 38 | 39 | \item{format}{one of "long" or "wide"} 40 | 41 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval} 42 | 43 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values} 44 | } 45 | \value{ 46 | a tibble 47 | } 48 | \description{ 49 | \code{moe_wave_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error. 50 | Use this function when the z-variable indicates different survey 51 | waves for which weights were calculated independently. 52 | } 53 | \details{ 54 | Options include row or cell percentages. The tibble can be in long or wide format. 55 | These tables are ideal for use with small multiples created with ggplot2::facet_wrap. 56 | } 57 | \examples{ 58 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight) 59 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "wide") 60 | } 61 | -------------------------------------------------------------------------------- /man/moedeff_calc.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/DesignEffectAndMOE.R 3 | \name{moedeff_calc} 4 | \alias{moedeff_calc} 5 | \title{Calculate the margin of error (including design effect) of a sample} 6 | \usage{ 7 | moedeff_calc(pct, deff, n, zscore = 1.96) 8 | } 9 | \arguments{ 10 | \item{pct}{a proportion} 11 | 12 | \item{deff}{a design effect} 13 | 14 | \item{n}{the sample size} 15 | 16 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval.} 17 | } 18 | \value{ 19 | A percentage 20 | } 21 | \description{ 22 | \code{moedeff_calc} returns a single number. It is designed for use in the moe family of functions. 23 | } 24 | \details{ 25 | This function returns the margin of error including design effect of a given sample of weighted data using the formula 26 | sqrt(deff)*zscore*sqrt((pct*(1-pct))/(n-1))*100 27 | } 28 | \examples{ 29 | moedeff_calc(pct = 0.515, deff = 1.6, n = 214) 30 | } 31 | -------------------------------------------------------------------------------- /man/summary_table.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/SummaryStatistics.R 3 | \name{summary_table} 4 | \alias{summary_table} 5 | \title{weighted summary table} 6 | \usage{ 7 | summary_table(df, variable, weight, name_style = "clean") 8 | } 9 | \arguments{ 10 | \item{df}{The data source} 11 | 12 | \item{variable}{the variable to summarize, it should be numeric} 13 | 14 | \item{weight}{The weighting variable} 15 | 16 | \item{name_style}{the style of the column names--one of "clean" or "pretty." 17 | Clean names are all lower case and words are separated by an underscore. 18 | Pretty names begin with a capital letter are words a separated by a space.} 19 | } 20 | \value{ 21 | a tibble 22 | } 23 | \description{ 24 | \code{summary_table} returns a tibble containing a weighted summary table of a single variable. 25 | } 26 | \details{ 27 | The resulting tible includes columns for the variable name, unweighted observations, 28 | weighted observations, weighted mean, minimum value, maximum value, 29 | unweighted missing values, and weighted missing values 30 | } 31 | \examples{ 32 | summary_table(illinois, age, weight) 33 | summary_table(illinois, age, weight, name_style = "pretty") 34 | } 35 | -------------------------------------------------------------------------------- /man/topline.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Topline.R 3 | \name{topline} 4 | \alias{topline} 5 | \title{weighted topline} 6 | \usage{ 7 | topline( 8 | df, 9 | variable, 10 | weight, 11 | remove = c(""), 12 | n = TRUE, 13 | pct = TRUE, 14 | valid_pct = TRUE, 15 | cum_pct = TRUE 16 | ) 17 | } 18 | \arguments{ 19 | \item{df}{The data source} 20 | 21 | \item{variable}{the variable name} 22 | 23 | \item{weight}{The weighting variable, defaults to zwave_weight} 24 | 25 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused"). 26 | This will not affect any calculations made. The vector is not case-sensitive.} 27 | 28 | \item{n}{logical, if TRUE a frequency column is included 29 | percentages, but in a separate row for column percentages.} 30 | 31 | \item{pct}{logical, if TRUE a column of percents is included} 32 | 33 | \item{valid_pct}{logical, if TRUE a column of valid percents is included} 34 | 35 | \item{cum_pct}{logical, if TRUE a column of cumulative percents is included} 36 | } 37 | \value{ 38 | a tibble 39 | } 40 | \description{ 41 | \code{topline} returns a tibble containing a weighted topline of one variable 42 | } 43 | \details{ 44 | By default the table includes a column for frequency count, percent, valid percent, and cumulative percent. 45 | } 46 | \examples{ 47 | topline(illinois, sex, weight) 48 | topline(illinois, sex, weight, pct = FALSE) 49 | } 50 | -------------------------------------------------------------------------------- /man/wtd_mean.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/SummaryStatistics.R 3 | \name{wtd_mean} 4 | \alias{wtd_mean} 5 | \title{weighted mean} 6 | \usage{ 7 | wtd_mean(df, variable, weight) 8 | } 9 | \arguments{ 10 | \item{df}{The data source} 11 | 12 | \item{variable}{the variable, it should be numeric} 13 | 14 | \item{weight}{The weighting variable} 15 | } 16 | \value{ 17 | a numeric value 18 | } 19 | \description{ 20 | \code{wtd_mean} returns the weighted mean of a variable. It's a tidy-compatible 21 | wrapper around stats::weighted.mean(). 22 | } 23 | \examples{ 24 | wtd_mean(illinois, age, weight) 25 | 26 | library(dplyr) 27 | illinois \%>\% wtd_mean(age, weight) 28 | } 29 | -------------------------------------------------------------------------------- /vignettes/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | *.R 3 | -------------------------------------------------------------------------------- /vignettes/crosstab3way.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "3-way crosstabs" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{crosstab3way} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set( 12 | collapse = TRUE, 13 | comment = "#>" 14 | ) 15 | ``` 16 | 17 | ```{r setup} 18 | library(pollster) 19 | library(dplyr) 20 | library(knitr) 21 | library(ggplot2) 22 | ``` 23 | 24 | It's common to want to view a crosstab of two variables by a third variable, for instance educational attainment by sex *and* marital status. The function `crosstab_3way` accomplishes this. Row and cell percents are both supported; column percents are not. 25 | 26 | ```{r} 27 | illinois %>% 28 | # filter for recent years & limited ages 29 | filter(year > 2009, 30 | age > 39) %>% 31 | crosstab_3way(x = sex, y = educ6, z = maritalstatus, weight = weight, 32 | remove = c("widow/divorced/sep"), 33 | n = FALSE) %>% 34 | kable(digits = 0, caption = "Educational attainment by sex and marital status among Illinois residents ages 35+", 35 | format = "html") 36 | ``` 37 | 38 | Three-way crosstabs plot well as small multiples using ggplot facets. 39 | 40 | ```{r, fig.width=6, fig.height=4} 41 | illinois %>% 42 | # filter for recent years & limited ages 43 | filter(year > 2009, 44 | age > 34) %>% 45 | crosstab_3way(x = sex, y = educ6, z = maritalstatus, weight = weight, 46 | remove = c("widow/divorced/sep"), 47 | format = "long") %>% 48 | ggplot(aes(educ6, pct, fill = maritalstatus)) + 49 | geom_bar(stat = "identity", position = position_dodge()) + 50 | facet_wrap(facets = vars(sex)) + 51 | labs("Educational attainment by sex and marital status", 52 | subtitle = "Illinois residents ages 40+") + 53 | theme(legend.position = "top") 54 | ``` 55 | 56 | The same plot can be made with margin of errors as well. (See the "crosstabs" vignette for a more detailed discussion of margin of errors.) 57 | 58 | 59 | ```{r, fig.width=6, fig.height=4} 60 | illinois %>% 61 | # filter for recent years & limited ages 62 | filter(year > 2009, 63 | age > 34) %>% 64 | moe_crosstab_3way(x = sex, y = educ6, z = maritalstatus, weight = weight, 65 | remove = c("widow/divorced/sep"), format = "long") %>% 66 | ggplot(aes(educ6, pct, fill = maritalstatus)) + 67 | geom_bar(stat = "identity", position = position_dodge(), 68 | alpha = 0.5) + 69 | geom_errorbar(aes(ymin = (pct - moe), ymax = (pct + moe), 70 | color = maritalstatus), 71 | position = position_dodge()) + 72 | facet_wrap(facets = vars(sex)) + 73 | labs(title = "Educational attainment by sex and marital status", 74 | subtitle = "Illinois residents ages 35+", 75 | caption = "Current Population Survey, 2010-2018") + 76 | theme(legend.position = "top") 77 | ``` 78 | 79 | ### Special case, when the z-variable identifies survey waves 80 | 81 | If the x-variable in your crosstab uniquely identifies survey waves for which the weights were independently generated, it is best practice to calculate the design effect independently for each wave. `moe_wave_crosstab_3way` does just that. All of the arguments remain the same as in `moe_crosstab_3way`. 82 | 83 | ```{r} 84 | moe_wave_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight) 85 | ``` 86 | -------------------------------------------------------------------------------- /vignettes/crosstabs.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "crosstabs" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{crosstabs} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set( 12 | collapse = TRUE, 13 | comment = "#>" 14 | ) 15 | ``` 16 | 17 | ```{r setup} 18 | library(pollster) 19 | library(dplyr) 20 | library(knitr) 21 | library(ggplot2) 22 | ``` 23 | 24 | Crosstabs can come in [wide or long format](https://en.wikipedia.org/wiki/Wide_and_narrow_data). Each is useful, depending on your purpose. Wide data is best for display tables. Long data is usually better for making plots, for instance.. 25 | 26 | Here is a wide table. 27 | 28 | ```{r} 29 | crosstab(df = illinois, x = sex, y = educ6, weight = weight) %>% 30 | kable() 31 | ``` 32 | 33 | And here is long format. 34 | 35 | ```{r} 36 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") 37 | ``` 38 | 39 | By default, row percentages are used. You can also explicitly choose cell or column percentages using the `pct_type` argument. I discourage the use of column percentages--it's better to just flip the x and y variables and make row percents--but the option is included to match functionality provided by other standard statistical software. 40 | 41 | ```{r} 42 | # cell percentages 43 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, pct_type = "cell") 44 | 45 | # column percentages 46 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, pct_type = "column") 47 | ``` 48 | 49 | To make a graph, just feed your `tibble` output to a `ggplot2` function. 50 | 51 | ```{r, fig.width=5.6} 52 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") %>% 53 | ggplot(aes(x = educ6, y = pct, fill = sex)) + 54 | geom_bar(stat = "identity", position = position_dodge()) + 55 | labs(title = "Educational attainment of the Illinois adult population by gender") 56 | ``` 57 | 58 | ## Margin of error 59 | 60 | ### How the margin of error is calculated 61 | 62 | The margin of error is calculated including the design effect of the sample weights, using the following formula: 63 | 64 | `sqrt(design effect)*zscore*sqrt((pct*(1-pct))/(n-1))*100` 65 | 66 | The design effect is calculated using the formula `length(weights)*sum(weights^2)/(sum(weights)^2)`. 67 | 68 | ------ 69 | 70 | Get at topline table with the margin of error in a separate column using the `moe_crosstab` function. By default, a z-score of 1.96 (95% confidence interval is used). Supply your own desired z-score using the `zscore` argument. Only row and cell percents are supported. By default, the table format is long because I anticipate making visualizations will be the most common use-case for this graphic. 71 | 72 | ```{r} 73 | moe_crosstab(illinois, educ6, voter, weight) 74 | ``` 75 | 76 | A wide format table looks like this. 77 | 78 | ```{r} 79 | moe_crosstab(illinois, educ6, voter, weight, format = "wide") 80 | ``` 81 | 82 | `ggplot2` offers [multiple ways](http://www.sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization) to visualize the margin of error. Here is one good option. (Please note, if you don't have ggplot2 >= [3.3.0](https://www.tidyverse.org/blog/2020/03/ggplot2-3-3-0/) you'll get an error message.) 83 | 84 | ```{r, fig.width=5} 85 | illinois %>% 86 | filter(year == 2016) %>% 87 | moe_crosstab(educ6, voter, weight) %>% 88 | ggplot(aes(x = pct, y = educ6, xmin = (pct - moe), xmax = (pct + moe), 89 | color = voter)) + 90 | geom_pointrange(position = position_dodge(width = 0.2)) 91 | ``` 92 | 93 | ### Special case, the x-variable identifies survey waves 94 | 95 | If the x-variable in your crosstab uniquely identifies survey waves for which the weights were independently generated, it is best practice to calculate the design effect independently for each wave. `moe_wave_crosstab` does just that. All of the arguments remain the same as in `moe_crosstab`. 96 | 97 | ```{r} 98 | moe_wave_crosstab(df = illinois, x = year, y = rv, weight = weight) 99 | ``` 100 | -------------------------------------------------------------------------------- /vignettes/toplines.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "toplines" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{toplines} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set( 12 | collapse = TRUE, 13 | comment = "#>" 14 | ) 15 | ``` 16 | 17 | ```{r setup} 18 | library(pollster) 19 | library(dplyr) 20 | library(knitr) 21 | library(ggplot2) 22 | ``` 23 | 24 | The default topline table comes with columns for response category, frequency count, percent, valid percent, and cumulative percent. 25 | 26 | ```{r} 27 | topline(df = illinois, variable = voter, weight = weight) %>% 28 | kable() 29 | ``` 30 | 31 | Because the output is a `tibble`, it's simple to manipulate it in any way you want after creating it. Use `dplyr::select` to remove columns or `dplyr::filter` to remove rows. For convenience, the `topline` function also provides ways to do this within the function call. For example, the `remove` argument accepts a character vector of response values to be removed from the table *after* all statistics are calculated. This is especially useful for survey data with a "refused" category. 32 | 33 | ```{r} 34 | topline(df = illinois, variable = voter, weight = weight, 35 | remove = c("(Missing)"), pct = FALSE) %>% 36 | mutate(Frequency = prettyNum(Frequency, big.mark = ",")) %>% 37 | kable(digits = 0) 38 | ``` 39 | 40 | Refer to the [`kableExtra` package](https://CRAN.R-project.org/package=kableExtra) for lots of examples on how to format the appearance of these tables in either HTML or PDF latex formats. I recommend the vignettes "Create Awesome HTML Table with knitr::kable and kableExtra" and "Create Awesome PDF Table with knitr::kable and kableExtra. 41 | 42 | ## Graphs 43 | 44 | ```{r, fig.width=4} 45 | topline(df = illinois, variable = voter, weight = weight) %>% 46 | ggplot(aes(Response, Percent, fill = Response)) + 47 | geom_bar(stat = "identity") 48 | ``` 49 | 50 | ## Margin of error 51 | 52 | Get at topline table with the margin of error in a separate column using the `moe_topline` function. By default, a z-score of 1.96 (95% confidence interval is used). Supply your own desired z-score using the `zscore` argument. 53 | 54 | ```{r} 55 | moe_topline(df = illinois, variable = educ6, weight = weight) 56 | ``` 57 | 58 | The margin of error is calculated including the design effect of the sample weights, using the following formula: 59 | 60 | `sqrt(design effect)*zscore*sqrt((pct*(1-pct))/(n-1))*100` 61 | 62 | The design effect is calculated using the formula `length(weights)*sum(weights^2)/(sum(weights)^2)`. 63 | --------------------------------------------------------------------------------