├── .travis.yml
├── DESCRIPTION
├── NAMESPACE
├── NEWS.md
├── R
    ├── Crosstab.R
    ├── Crosstab3way.R
    ├── DesignEffectAndMOE.R
    ├── InternalHelperFunctions.R
    ├── SummaryStatistics.R
    ├── Topline.R
    ├── data.R
    ├── globals.R
    ├── moeCrosstab.R
    ├── moeCrosstab3way.R
    ├── moeTopline.R
    ├── moeWaveCrosstab.R
    └── moeWaveCrosstab3way.R
├── README.Rmd
├── README.html
├── README.md
├── cran-comments.md
├── data
    └── illinois.rda
├── man
    ├── crosstab.Rd
    ├── crosstab_3way.Rd
    ├── deff_calc.Rd
    ├── figures
    │   ├── README-unnamed-chunk-12-1.png
    │   ├── README-unnamed-chunk-8-1.png
    │   └── README-unnamed-chunk-9-1.png
    ├── illinois.Rd
    ├── moe_crosstab.Rd
    ├── moe_crosstab_3way.Rd
    ├── moe_topline.Rd
    ├── moe_wave_crosstab.Rd
    ├── moe_wave_crosstab_3way.Rd
    ├── moedeff_calc.Rd
    ├── summary_table.Rd
    ├── topline.Rd
    └── wtd_mean.Rd
└── vignettes
    ├── .gitignore
    ├── crosstab3way.Rmd
    ├── crosstabs.Rmd
    └── toplines.Rmd


/.travis.yml:
--------------------------------------------------------------------------------
1 | # R for travis: see documentation at https://docs.travis-ci.com/user/languages/r
2 | 
3 | language: R
4 | cache: packages
5 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: pollster
 2 | Type: Package
 3 | Title: Calculate Crosstab and Topline Tables of Weighted Survey Data
 4 | Version: 0.1.6
 5 | Author: John D. Johnson [aut, cre]
 6 | Maintainer: John D. Johnson <john.d.johnson@marquette.edu>
 7 | Description: Calculate common types of tables for weighted survey data.
 8 |     Options include topline and (2-way and 3-way) crosstab tables of
 9 |     categorical or ordinal data as well as summary tables of weighted
10 |     numeric variables. Optionally, include the margin of error at
11 |     selected confidence intervals including the design effect. The
12 |     design effect is calculated as described by 
13 |     Kish (1965) <doi:10.1002/bimj.19680100122> beginning
14 |     on page 257. Output takes the form of tibbles (simple data frames).
15 |     This package conveniently handles labelled data, such as that
16 |     commonly used by 'Stata' and 'SPSS.' Complex survey design is 
17 |     not supported at this time. 
18 | Depends:
19 | 	R (>= 2.10)
20 | Imports:
21 |    dplyr (>= 0.8.0),
22 |    stringr (>= 1.0.0),
23 |    tidyr (>= 1.1.0),
24 |    labelled (>= 2.0.0),
25 |    forcats (>= 1.0.0),
26 |    rlang  (>= 0.4.5)
27 | Suggests:
28 |     ggplot2 (>= 3.3.0),
29 |     knitr,
30 |     rmarkdown
31 | License: CC0
32 | Encoding: UTF-8
33 | LazyData: true
34 | RoxygenNote: 7.2.0
35 | VignetteBuilder: knitr
36 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(crosstab)
 4 | export(crosstab_3way)
 5 | export(deff_calc)
 6 | export(moe_crosstab)
 7 | export(moe_crosstab_3way)
 8 | export(moe_topline)
 9 | export(moe_wave_crosstab)
10 | export(moe_wave_crosstab_3way)
11 | export(moedeff_calc)
12 | export(summary_table)
13 | export(topline)
14 | export(wtd_mean)
15 | import(dplyr)
16 | import(labelled)
17 | import(rlang)
18 | import(stringr)
19 | import(tidyr)
20 | importFrom(stats,weighted.mean)
21 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
 1 | # pollster 0.1.6
 2 | * fix bug which lead to explicitly missing factor levels still being included in the "valid percent" column.
 3 | 
 4 | # pollster 0.1.5
 5 | * replace deprecated `forcats::fct_explicit_na` with `forcats::fct_na_value_to_level`
 6 | * require at least forcats v1.0.0
 7 | 
 8 | # pollster 0.1.4
 9 | * rows are explicitly rearranged by x and z in wide 3-way crosstabs
10 | * weights of 0 are removed prior to calculation of the design effect
11 | * an error is now given in all table-creating functions if the weight variable includes NA values
12 | 
13 | # pollster 0.1.3
14 | * crosstab functions now include an option to include a column with unweighted frequencies. currently it is not available for column percents.
15 | * a bug is fixed that gave an error in `crosstab(..., pct_type = "col", format = "wide", n = FALSE)`
16 | * `crosstab_3way` now places the n column at the end of the dataframe, consistent with `crosstab`
17 | * fix bug in `moe_crosstab` & `moe_crosstab_3way` that reported unweighted n in place of weighted n
18 | 
19 | # pollster 0.1.2
20 | 
21 | * pollster now depends on the most recent version of tidyr (1.1.0) because it uses the argument `names_sort = TRUE` to ensure that `tidyr::pivot_wider` arranges rows and columns in the order of their factor levels.
22 | 
23 | # pollster 0.1.1
24 | 
25 | * improvements to how crosstab functions conditionally convert factor to date class. This includes removing the lubridate dependency.
26 | * crosstab functions now convert factors in crosstabs to numeric values when all values are numeric
27 | * crosstabs now show a value of 0% instead of NA when there are no values.
28 | * add CRAN installation to readme
29 | 
30 | 
31 | # pollster 0.1.0
32 | 
33 | * package accepted by CRAN on 03/25/2020
34 | 


--------------------------------------------------------------------------------
/R/Crosstab.R:
--------------------------------------------------------------------------------
  1 | #' weighted crosstabs
  2 | #'
  3 | #' \code{crosstab} returns a tibble containing a weighted crosstab of two variables
  4 | #'
  5 | #'  Options  include row, column, or cell percentages. The tibble can be in long or wide format.
  6 | #'
  7 | #' @param df The data source
  8 | #' @param x The independent variable
  9 | #' @param y The dependent variable
 10 | #' @param weight The weighting variable
 11 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
 12 | #' This will not affect any calculations made. The vector is not case-sensitive.
 13 | #' @param n logical, if TRUE numeric totals are included. They are included in a separate column for row and cell
 14 | #' percentages, but in a separate row for wide format column percentages.
 15 | #' @param pct_type Controls the kind of percentage values returned. One of "row," "cell," or "column."
 16 | #' @param format one of "long" or "wide"
 17 | #' @param unwt_n logical, if TRUE a column "unweighted_n" is included containing the unweighted frequency count. It is not available when pct_type is "column"
 18 | #'
 19 | #' @return a tibble
 20 | #' @export
 21 | #' @import dplyr
 22 | #' @import stringr
 23 | #' @import tidyr
 24 | #' @import labelled
 25 | #' @import rlang
 26 | #'
 27 | #' @examples
 28 | #' crosstab(df = illinois, x = voter, y = raceethnic, weight = weight)
 29 | #' crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE)
 30 | 
 31 | crosstab <- function(df, x, y, weight, remove = "", n = TRUE,
 32 |                      pct_type = "row", format = "wide", unwt_n = FALSE){
 33 | 
 34 |   # make sure no weights are NA
 35 |   w <- df %>% pull({{weight}})
 36 |   if(length(w[is.na(w)]) > 0){
 37 |     stop("The weight variable contains missing values.", call. = FALSE)
 38 |   }
 39 | 
 40 |   # make sure the arguments are all correct
 41 |   stopifnot(pct_type %in% c("row", "cell", "column", "col"),
 42 |             format %in% c("wide", "long"))
 43 | 
 44 |   if(str_to_lower(pct_type) == "row"){
 45 |     d.output <- df %>%
 46 |       # Remove missing cases
 47 |       filter(!is.na({{x}}),
 48 |              !is.na({{y}})) %>%
 49 |       # Convert to ordered factors
 50 |       mutate({{x}} := to_factor({{x}}, sort_levels = "values"),
 51 |              {{y}} := to_factor({{y}}, sort_levels = "values")) %>%
 52 |       # Calculate denominator
 53 |       group_by({{x}}) %>%
 54 |       mutate(total = sum({{weight}}),
 55 |              unweighted_n = n()) %>%
 56 |       # Calculate proportions
 57 |       group_by({{x}}, {{y}}) %>%
 58 |       summarise(pct = (sum({{weight}})/first(total))*100,
 59 |                 n = first(total),
 60 |                 unweighted_n = first(unweighted_n)) %>%
 61 |       # Remove values included in "remove" string
 62 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
 63 |              !str_to_upper({{y}}) %in% str_to_upper(remove)) %>%
 64 |       ungroup()
 65 | 
 66 |     # spread if format = wide
 67 |     if(format == "wide"){
 68 |       d.output <- d.output %>%
 69 |         # Spread so x is rows and y is columns
 70 |         pivot_wider(names_from = {{y}}, values_from = pct,
 71 |                     values_fill = list(pct = 0), names_sort = TRUE) %>%
 72 |         # move total row to end
 73 |         select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
 74 |         ungroup()
 75 |     }
 76 | 
 77 |     # remove n if required
 78 |     if(n == FALSE){
 79 |       d.output <- select(d.output, -n)
 80 |     }
 81 | 
 82 |     # remove unweighted n if required
 83 |     if(unwt_n == FALSE){
 84 |       d.output <- select(d.output, -unweighted_n)
 85 |     }
 86 | 
 87 |   } else if(str_to_lower(pct_type) %in% c("col", "column")){
 88 |     d.output <- df %>%
 89 |       # Remove missing cases
 90 |       filter(!is.na({{x}}),
 91 |              !is.na({{y}})) %>%
 92 |       # Convert to ordered factors
 93 |       mutate({{x}} := to_factor({{x}}, sort_levels = "values"),
 94 |              {{y}} := to_factor({{y}}, sort_levels = "values")) %>%
 95 |       # calculate denominator
 96 |       group_by({{y}}) %>%
 97 |       mutate(total = sum({{weight}})) %>%
 98 |       # calculate proportions
 99 |       group_by({{x}}, {{y}}) %>%
100 |       summarise(pct = (sum({{weight}})/first(total))*100,
101 |                 n = first(total)) %>%
102 |       ungroup() %>%
103 |       # remove values included in "remove" string
104 |       filter(! str_to_upper({{x}}) %in% str_to_upper(remove),
105 |              ! str_to_upper({{y}}) %in% str_to_upper(remove))
106 | 
107 |     if(format == "wide"){
108 |       # make the total row separately
109 |       total.row <- d.output %>%
110 |         group_by({{y}}, n) %>%
111 |         summarise() %>%
112 |         pivot_wider(names_from = {{y}}, values_from = n, values_fill = list(pct = 0),
113 |                     names_sort = TRUE) %>%
114 |         mutate({{x}} := "n")
115 | 
116 |       # spread the output table
117 |       d.output <- d.output %>%
118 |         # drop the n column
119 |         select(-n) %>%
120 |         # spread so x is rows and y is columns
121 |         pivot_wider(names_from = {{y}}, values_from = pct, values_fill = list(pct = 0),
122 |                     names_sort = TRUE)
123 | 
124 |       # if n = TRUE, then add then n row
125 |       # this causes the response column to switch from factor to character
126 |       if(n == TRUE){
127 |         d.output <- suppressWarnings(bind_rows(d.output, total.row))
128 |       }
129 |     }
130 | 
131 |     # remove n column if n == FALSE (long format only)
132 |     if(n == FALSE & format == "long"){
133 |       d.output <- select(d.output, -n)
134 |     }
135 | 
136 |   } else if(str_to_lower(pct_type) == "cell"){
137 |     d.output <- df %>%
138 |       # Remove missing cases
139 |       filter(!is.na({{x}}),
140 |              !is.na({{y}})) %>%
141 |       # Convert to ordered factors
142 |       mutate({{x}} := to_factor({{x}}, sort_levels = "values"),
143 |              {{y}} := to_factor({{y}}, sort_levels = "values")) %>%
144 |       # Calculate denominator
145 |       mutate(total = sum({{weight}}),
146 |              unweighted_n = n()) %>%
147 |       # Calculate proportions
148 |       group_by({{x}}, {{y}}) %>%
149 |       summarise(pct = (sum({{weight}})/first(total))*100,
150 |                 n = first(total),
151 |                 unweighted_n = first(unweighted_n)) %>%
152 |       # Remove values included in "remove" string
153 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
154 |              !str_to_upper({{y}}) %in% str_to_upper(remove))
155 | 
156 |     # if format = wide, then spread the table
157 |     if(format == "wide"){
158 |       d.output <- d.output %>%
159 |         # Spread so x is rows and y is columns
160 |         pivot_wider(names_from = {{y}}, values_from = pct, values_fill = 0,
161 |                     names_sort = TRUE) %>%
162 |         # move total row to end
163 |         select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
164 |         ungroup()
165 |     }
166 | 
167 |     # remove n column if n == FALSE
168 |     if(n == FALSE){
169 |       d.output <- select(d.output, -n)
170 |     }
171 | 
172 |     # remove unweighted n column if unwt_n == FALSE
173 |     if(unwt_n == FALSE){
174 |       d.output <- select(d.output, -unweighted_n)
175 |     }
176 |   }
177 | 
178 |   # test if date or number
179 |   factor.true.type <- what_is_this_factor(pull(d.output, {{x}}))
180 |   if(factor.true.type == "date"){
181 |     d.output %>%
182 |       as_tibble() %>%
183 |       mutate({{x}} := as.Date({{x}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y")))
184 |   } else if(factor.true.type == "number"){
185 |     d.output %>%
186 |       as_tibble() %>%
187 |       mutate({{x}} := as.numeric(as.character({{x}})))
188 |   } else{
189 |     d.output %>%
190 |       as_tibble()
191 |   }
192 | }
193 | 


--------------------------------------------------------------------------------
/R/Crosstab3way.R:
--------------------------------------------------------------------------------
  1 | #' weighted 3-way crosstabs
  2 | #'
  3 | #' \code{crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable
  4 | #'
  5 | #'  Options  include row or cell percentages. The tibble can be in long or wide format.
  6 | #'  These tables are ideal for use with small multiples created with ggplot2::facet_wrap.
  7 | #'
  8 | #' @param df The data source
  9 | #' @param x The independent variable
 10 | #' @param y The dependent variable
 11 | #' @param z The second control variable
 12 | #' @param weight The weighting variable
 13 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
 14 | #' This will not affect any calculations made. The vector is not case-sensitive.
 15 | #' @param n logical, if TRUE numeric totals are included.
 16 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell."
 17 | #' @param format one of "long" or "wide"
 18 | #' @param unwt_n logical, if TRUE a column is added containing unweighted frequency counts
 19 | #'
 20 | #' @return a tibble
 21 | #' @export
 22 | #' @import dplyr
 23 | #' @import stringr
 24 | #' @import tidyr
 25 | #' @import labelled
 26 | #' @import rlang
 27 | #'
 28 | #' @examples
 29 | #' crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight)
 30 | #' crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight,
 31 | #' format = "wide")
 32 | 
 33 | crosstab_3way <- function(df, x, y, z,
 34 |                           weight, remove = c(""),
 35 |                           n = TRUE, pct_type = "row", format = "wide",
 36 |                           unwt_n = FALSE){
 37 | 
 38 |   # make sure no weights are NA
 39 |   w <- df %>% pull({{weight}})
 40 |   if(length(w[is.na(w)]) > 0){
 41 |     stop("The weight variable contains missing values.", call. = FALSE)
 42 |   }
 43 | 
 44 |   # make sure the arguments are all correct
 45 |   stopifnot(pct_type %in% c("row", "cell"),
 46 |             format %in% c("wide", "long"))
 47 | 
 48 |   # row percents
 49 |   if(pct_type == "row"){
 50 |     d.output <- df %>%
 51 |       # Remove missing cases
 52 |       filter(!is.na({{x}}),
 53 |              !is.na({{y}}),
 54 |              !is.na({{z}})) %>%
 55 |       # Convert to ordered factors
 56 |       mutate({{x}} := to_factor({{x}}, sort_levels = "values"),
 57 |              {{y}} := to_factor({{y}}, sort_levels = "values"),
 58 |              {{z}} := to_factor({{z}}, sort_levels = "values")) %>%
 59 |       # Calculate denominator
 60 |       group_by({{x}}, {{z}}) %>%
 61 |       mutate(total = sum({{weight}}),
 62 |              unweighted_n = n()) %>%
 63 |       # Calculate proportions
 64 |       group_by({{x}}, {{y}}, {{z}}) %>%
 65 |       summarise(pct = (sum({{weight}})/first(total))*100,
 66 |                 n = first(total),
 67 |                 unweighted_n = first(unweighted_n)) %>%
 68 |       # Remove values included in "remove" string
 69 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
 70 |              !str_to_upper({{y}}) %in% str_to_upper(remove),
 71 |              !str_to_upper({{z}}) %in% str_to_upper(remove)) %>%
 72 |       ungroup()
 73 | 
 74 |     # wide format, if required
 75 |     if(str_to_lower(format) == "wide"){
 76 |       d.output <- d.output %>%
 77 |         pivot_wider(names_from = {{y}}, values_from = pct,
 78 |                     values_fill = list(pct = 0), names_sort = TRUE) %>%
 79 |         select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
 80 |         arrange({{x}}, {{z}})
 81 |     }
 82 |   } else if(pct_type == "cell"){
 83 |     d.output <- df %>%
 84 |       # Remove missing cases
 85 |       filter(!is.na({{x}}),
 86 |              !is.na({{y}}),
 87 |              !is.na({{z}})) %>%
 88 |       # Convert to ordered factors
 89 |       mutate({{x}} := to_factor({{x}}, sort_levels = "values"),
 90 |              {{y}} := to_factor({{y}}, sort_levels = "values"),
 91 |              {{z}} := to_factor({{z}}, sort_levels = "values")) %>%
 92 |       # Calculate denominator
 93 |       group_by({{z}}) %>%
 94 |       mutate(total = sum({{weight}}),
 95 |              unweighted_n = n()) %>%
 96 |       # Calculate proportions
 97 |       group_by({{x}}, {{y}}, {{z}}) %>%
 98 |       summarise(pct = (sum({{weight}})/first(total))*100,
 99 |                 n = first(total),
100 |                 unweighted_n = first(unweighted_n)) %>%
101 |       # Remove values included in "remove" string
102 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
103 |              !str_to_upper({{y}}) %in% str_to_upper(remove),
104 |              !str_to_upper({{z}}) %in% str_to_upper(remove)) %>%
105 |       ungroup()
106 | 
107 |     # wide format, if required
108 |     if(str_to_lower(format) == "wide"){
109 |       d.output <- d.output %>%
110 |         pivot_wider(names_from = {{y}}, values_from = pct,
111 |                     values_fill = list(pct = 0), names_sort = TRUE) %>%
112 |         select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
113 |         arrange({{x}}, {{z}})
114 |     }
115 |   }
116 | 
117 | 
118 |   if(n == FALSE){
119 |     d.output <- select(d.output, -n)
120 |   }
121 | 
122 |   if(unwt_n == FALSE){
123 |     d.output <- select(d.output, -unweighted_n)
124 |   }
125 | 
126 |   # test if date or number
127 |   factor.true.type <- what_is_this_factor(pull(d.output, {{z}}))
128 |   if(factor.true.type == "date"){
129 |     d.output %>%
130 |       as_tibble() %>%
131 |       mutate({{z}} := as.Date({{z}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y")))
132 |   } else if(factor.true.type == "number"){
133 |     d.output %>%
134 |       as_tibble() %>%
135 |       mutate({{z}} := as.numeric(as.character({{z}})))
136 |   } else{
137 |     d.output %>%
138 |       as_tibble()
139 |   }
140 | }
141 | 


--------------------------------------------------------------------------------
/R/DesignEffectAndMOE.R:
--------------------------------------------------------------------------------
 1 | #' Calculate the design effect of a sample
 2 | #'
 3 | #' \code{deff_calc} returns a single number
 4 | #'
 5 | #'  This function returns the design effect of a given sample using the formula
 6 | #'  length(w)*sum(w^2)/(sum(w)^2).
 7 | #'  It is designed for use in the moe family of functions. If any weights are equal to 0, they are removed prior to calculation.
 8 | #'
 9 | #' @param w a vector of weights
10 | #'
11 | #' @return A number
12 | #' @export
13 | #'
14 | #' @examples
15 | #' deff_calc(illinois$weight)
16 | #'
17 | deff_calc <- function(w){
18 |   # check for weights of 0
19 |   if(length(w[w == 0]) > 0){
20 |     message("Your data includes weights equal to zero. These are removed before calculating the design effect.")
21 |     # remove any weights that are 0
22 |     w <- w[w > 0]
23 |   }
24 | 
25 |   length(w)*sum(w^2)/(sum(w)^2)
26 | }
27 | 
28 | 
29 | #' Calculate the margin of error (including design effect) of a sample
30 | #'
31 | #' \code{moedeff_calc} returns a single number. It is designed for use in the moe family of functions.
32 | #'
33 | #'  This function returns the margin of error including design effect of a given sample of weighted data using the formula
34 | #'  sqrt(deff)*zscore*sqrt((pct*(1-pct))/(n-1))*100
35 | #'
36 | #' @param pct a proportion
37 | #' @param deff a design effect
38 | #' @param n the sample size
39 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval.
40 | #'
41 | #' @return A percentage
42 | #' @export
43 | #'
44 | #' @examples
45 | #' moedeff_calc(pct = 0.515, deff = 1.6, n = 214)
46 | moedeff_calc <- function(pct, deff, n, zscore = 1.96){
47 |   sqrt(deff)*zscore*sqrt((pct*(1-pct))/(n-1))*100
48 | }
49 | 


--------------------------------------------------------------------------------
/R/InternalHelperFunctions.R:
--------------------------------------------------------------------------------
 1 | # Thanks to Berry Boessenkool for this function
 2 | # https://github.com/brry/berryFunctions/blob/master/R/is.error.R
 3 | # This function checks if a function returns an error
 4 | is_error <- function(expr){
 5 |   expr_name <- deparse(substitute(expr))
 6 |   test <- try(expr, silent=TRUE)
 7 |   iserror <- inherits(test, "try-error")
 8 |   # output:
 9 |   iserror
10 | }
11 | 
12 | # This function checks if a value is a date
13 | is_date <- function(potentialdate){
14 |   if(is_error(as.Date(potentialdate, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y"))) == TRUE){
15 |     FALSE
16 |   } else {TRUE}
17 | }
18 | 
19 | # This function checks if a value is a number
20 | is_a_number <- function(potentialnumber){
21 |   !is.na(suppressWarnings(as.numeric(as.character(potentialnumber))))
22 | }
23 | 
24 | # this function takes a vector of factor values & checks if all the
25 | # values are (1) a date or (2) a number.
26 | what_is_this_factor <- function(factorvector){
27 |   if(all(sapply({{factorvector}}, is_date))){
28 |     "date"
29 |   } else if(all(sapply({{factorvector}}, is_a_number))) {
30 |     "number"
31 |   } else {
32 |     "factor"
33 |   }
34 | }
35 | 


--------------------------------------------------------------------------------
/R/SummaryStatistics.R:
--------------------------------------------------------------------------------
  1 | #' weighted mean
  2 | #'
  3 | #' \code{wtd_mean} returns the weighted mean of a variable. It's a tidy-compatible
  4 | #' wrapper around stats::weighted.mean().
  5 | #'
  6 | #' @param df The data source
  7 | #' @param variable the variable, it should be numeric
  8 | #' @param weight The weighting variable
  9 | #'
 10 | #' @return a numeric value
 11 | #' @export
 12 | #' @import dplyr
 13 | #' @import rlang
 14 | #' @importFrom stats weighted.mean
 15 | #'
 16 | #' @examples
 17 | #' wtd_mean(illinois, age, weight)
 18 | #'
 19 | #' library(dplyr)
 20 | #' illinois %>% wtd_mean(age, weight)
 21 | wtd_mean <- function(df, variable, weight){
 22 |   df %>%
 23 |     summarise(mean = weighted.mean(x = {{variable}}, w = {{weight}})) %>%
 24 |     pull()
 25 | }
 26 | 
 27 | #' weighted summary table
 28 | #'
 29 | #' \code{summary_table} returns a tibble containing a weighted summary table of a single variable.
 30 | #'
 31 | #'  The resulting tible includes columns for the variable name, unweighted observations,
 32 | #'  weighted observations, weighted mean, minimum value, maximum value,
 33 | #'  unweighted missing values, and weighted missing values
 34 | #'
 35 | #' @param df The data source
 36 | #' @param variable the variable to summarize, it should be numeric
 37 | #' @param weight The weighting variable
 38 | #' @param name_style the style of the column names--one of "clean" or "pretty."
 39 | #' Clean names are all lower case and words are separated by an underscore.
 40 | #' Pretty names begin with a capital letter are words a separated by a space.
 41 | #'
 42 | #' @return a tibble
 43 | #' @export
 44 | #' @import dplyr
 45 | #' @import rlang
 46 | #' @importFrom stats weighted.mean
 47 | #'
 48 | #' @examples
 49 | #' summary_table(illinois, age, weight)
 50 | #' summary_table(illinois, age, weight, name_style = "pretty")
 51 | summary_table <- function(df, variable, weight, name_style = "clean"){
 52 | 
 53 |   # make sure no weights are NA
 54 |   w <- df %>% pull({{weight}})
 55 |   if(length(w[is.na(w)]) > 0){
 56 |     stop("The weight variable contains missing values.", call. = FALSE)
 57 |   }
 58 | 
 59 |   stopifnot(name_style %in% c("clean", "pretty"))
 60 | 
 61 |   unweighted_observations <- df %>%
 62 |     filter(!is.na({{variable}})) %>%
 63 |     pull({{variable}}) %>%
 64 |     length()
 65 | 
 66 |   weighted_observations <- df %>%
 67 |     filter(!is.na({{variable}})) %>%
 68 |     pull({{weight}}) %>%
 69 |     sum()
 70 | 
 71 |   weighted_mean <- df %>%
 72 |     wtd_mean({{variable}}, {{weight}})
 73 | 
 74 |   min_value <- df %>%
 75 |     summarise(min = min({{variable}}, na.rm = TRUE)) %>%
 76 |     pull()
 77 | 
 78 |   max_value <- df %>%
 79 |     summarise(max = max({{variable}}, na.rm = TRUE)) %>%
 80 |     pull()
 81 | 
 82 |   missing_observations <- df %>%
 83 |     filter(is.na({{variable}})) %>%
 84 |     pull({{variable}}) %>%
 85 |     length()
 86 | 
 87 |   missing_weighted_observations <- df %>%
 88 |     filter(is.na({{variable}})) %>%
 89 |     pull({{weight}}) %>%
 90 |     sum()
 91 | 
 92 |   variable_name <- df %>%
 93 |     select({{variable}}) %>%
 94 |     names()
 95 | 
 96 |   output <- tibble(variable_name, unweighted_observations, weighted_observations,
 97 |          weighted_mean, min_value, max_value, missing_observations,
 98 |          missing_weighted_observations)
 99 | 
100 |   if(name_style == "pretty"){
101 |     names(output) <- c("Variable", "Unweighted obs",
102 |                        "Weighted obs", "Weighted mean", "Min", "Max",
103 |                        "Unweighted missing", "Weighted missing")
104 |   }
105 | 
106 |   output
107 | }
108 | 


--------------------------------------------------------------------------------
/R/Topline.R:
--------------------------------------------------------------------------------
 1 | #' weighted topline
 2 | #'
 3 | #' \code{topline} returns a tibble containing a weighted topline of one variable
 4 | #'
 5 | #'  By default the table includes a column for frequency count, percent, valid percent, and cumulative percent.
 6 | #'
 7 | #' @param df The data source
 8 | #' @param variable the variable name
 9 | #' @param weight The weighting variable, defaults to zwave_weight
10 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
11 | #' This will not affect any calculations made. The vector is not case-sensitive.
12 | #' @param n logical, if TRUE a frequency column is included
13 | #' percentages, but in a separate row for column percentages.
14 | #' @param pct logical, if TRUE a column of percents is included
15 | #' @param valid_pct logical, if TRUE a column of valid percents is included
16 | #' @param cum_pct logical, if TRUE a column of cumulative percents is included
17 | #'
18 | #' @return a tibble
19 | #' @export
20 | #' @import dplyr
21 | #' @import stringr
22 | #' @import tidyr
23 | #' @import labelled
24 | #' @import rlang
25 | #'
26 | #' @examples
27 | #' topline(illinois, sex, weight)
28 | #' topline(illinois, sex, weight, pct = FALSE)
29 | 
30 | topline <- function(df, variable, weight, remove = c(""), n = TRUE,
31 |                     pct = TRUE, valid_pct = TRUE, cum_pct = TRUE){
32 | 
33 |   # make sure no weights are NA
34 |   w <- df %>% pull({{weight}})
35 |   if(length(w[is.na(w)]) > 0){
36 |     stop("The weight variable contains missing values.", call. = FALSE)
37 |   }
38 | 
39 |   # Make table
40 |   d.output <- df %>%
41 |     # Convert to ordered factors
42 |     mutate({{variable}} := to_factor({{variable}}, sort_levels = "values"),
43 |            {{variable}} := forcats::fct_na_value_to_level({{variable}},
44 |                                                           level = "(Missing)")) %>%
45 |     # Calculate denominator
46 |     mutate(total = sum({{weight}}),
47 |            valid.total = sum(({{weight}})[{{variable}} != "(Missing)"])) %>%
48 |     # Calculate proportions
49 |     group_by({{variable}}) %>%
50 |     summarise(pct = (sum({{weight}})/first(total))*100,
51 |               valid.pct = (sum({{weight}})/first(valid.total)*100),
52 |               n = sum({{weight}})) %>%
53 |     ungroup() %>%
54 |     mutate(cum = cumsum(valid.pct),
55 |            valid.pct = replace(valid.pct, {{variable}} == "(Missing)", NA),
56 |            cum = replace(cum, {{variable}} == "(Missing)", NA)) %>%
57 |     select(Response = {{variable}}, Frequency = n, Percent = pct,
58 |            `Valid Percent` = valid.pct, `Cumulative Percent` = cum) %>%
59 |     # Remove values included in "remove" string
60 |     filter(! str_to_upper(Response) %in% str_to_upper(remove))
61 | 
62 |   # remove columns as requested
63 |   if(valid_pct == FALSE){
64 |     d.output <- select(d.output, -`Valid Percent`)
65 |   }
66 | 
67 |   if(cum_pct == FALSE){
68 |     d.output <- select(d.output, -`Cumulative Percent`)
69 |   }
70 | 
71 |   if(n == FALSE){
72 |     d.output <- select(d.output, -Frequency)
73 |   }
74 | 
75 |   if(pct == FALSE){
76 |     d.output <- select(d.output, -Percent)
77 |   }
78 | 
79 |   d.output %>%
80 |     as_tibble()
81 | }
82 | 


--------------------------------------------------------------------------------
/R/data.R:
--------------------------------------------------------------------------------
 1 | #' Illinois respondents to the Voting and Registration Supplement for the Current Population Survey
 2 | #'
 3 | #' A dataset containing the responses of 36,207 Illinois respondents to the Current
 4 | #' Population Survey's biennial Voting and Registration Supplement for the Current
 5 | #' Population Survey, 1996-2018.
 6 | #'
 7 | #' @format A data frame with 36207 rows and 9 variables:
 8 | #' \describe{
 9 | #'   \item{year}{year of survey}
10 | #'   \item{fips}{the state fips code}
11 | #'   \item{sex}{sex of the respondent, labelled value}
12 | #'   \item{educ6}{highest level of education for respondent, labelled values}
13 | #'   \item{raceethnic}{one of white, black, Hispanic, or other, labelled values}
14 | #'   \item{maritalstatus}{one of Married, Widowed/divorced/Sep, or Never Married, labelled values}
15 | #'   \item{rv}{indicates if the respondent is registered to vote, labelled values}
16 | #'   \item{voter}{indicates if the respondent voted, labelled values}
17 | #'   \item{age}{the age of the respondent, numeric values}
18 | #'   \item{weight}{the number of people each respondent is calculated to represent}
19 | #' }
20 | #' @source \url{https://www.census.gov/topics/public-sector/voting.html}
21 | "illinois"
22 | 


--------------------------------------------------------------------------------
/R/globals.R:
--------------------------------------------------------------------------------
1 | utils::globalVariables(c("Cumulative Percent", "Percent", "Frequency",
2 |                          "Valid Percent", "cum", "d.output", "deff",
3 |                          "moe", "observations", "pct", "total", "valid.pct",
4 |                          "valid.total", "Response", "unweighted_n"))
5 | 


--------------------------------------------------------------------------------
/R/moeCrosstab.R:
--------------------------------------------------------------------------------
  1 | #' weighted crosstabs with margin of error
  2 | #'
  3 | #' \code{moe_crosstab} returns a tibble containing a weighted crosstab of two variables with margin of error
  4 | #'
  5 | #'  Options  include row or cell percentages. The tibble can be in long or wide format. The margin of
  6 | #'  error includes the design effect of the weights.
  7 | #'
  8 | #' @param df The data source
  9 | #' @param x The independent variable
 10 | #' @param y The dependent variable
 11 | #' @param weight The weighting variable, defaults to zwave_weight
 12 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
 13 | #' This will not affect any calculations made. The vector is not case-sensitive.
 14 | #' @param n logical, if TRUE numeric totals are included.
 15 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell."
 16 | #' Column percents are not supported.
 17 | #' @param format one of "long" or "wide"
 18 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval
 19 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values
 20 | #'
 21 | #' @return a tibble
 22 | #' @export
 23 | #' @import dplyr
 24 | #' @import stringr
 25 | #' @import tidyr
 26 | #' @import labelled
 27 | #' @import rlang
 28 | #'
 29 | #' @examples
 30 | #' moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight)
 31 | #' moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE)
 32 | 
 33 | moe_crosstab <- function(df, x, y, weight, remove = c(""),
 34 |                          n = TRUE, pct_type = "row", format = "long",
 35 |                          zscore = 1.96, unwt_n = FALSE){
 36 | 
 37 |   # make sure no weights are NA
 38 |   w <- df %>% pull({{weight}})
 39 |   if(length(w[is.na(w)]) > 0){
 40 |     stop("The weight variable contains missing values.", call. = FALSE)
 41 |   }
 42 | 
 43 |   # make sure the arguments are all correct
 44 |   stopifnot(pct_type %in% c("row", "cell"),
 45 |             format %in% c("wide", "long"))
 46 | 
 47 |   # calculate the design effect
 48 |   deff <- df %>% pull({{weight}}) %>% deff_calc()
 49 | 
 50 |   # build the table, either row percents or cell percents
 51 |   if(pct_type == "row"){
 52 |     d.output <- df %>%
 53 |       filter(!is.na({{x}}),
 54 |              !is.na({{y}})) %>%
 55 |       mutate({{x}} := to_factor({{x}}),
 56 |              {{y}} := to_factor({{y}})) %>%
 57 |       group_by({{x}}) %>%
 58 |       mutate(total = sum({{weight}}),
 59 |              unweighted_n = length({{weight}})) %>%
 60 |       group_by({{x}}, {{y}}) %>%
 61 |       summarise(observations = sum({{weight}}),
 62 |                 pct = observations/first(total),
 63 |                 n = first(total),
 64 |                 unweighted_n = first(unweighted_n)) %>%
 65 |       ungroup() %>%
 66 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
 67 |       mutate(pct = pct*100) %>%
 68 |       select(-observations) %>%
 69 |       # Remove values included in "remove" string
 70 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
 71 |              !str_to_upper({{y}}) %in% str_to_upper(remove)) %>%
 72 |       # move total row to end
 73 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n"))
 74 |   } else if(pct_type == "cell"){
 75 |     d.output <- df %>%
 76 |       filter(!is.na({{x}}),
 77 |              !is.na({{y}})) %>%
 78 |       mutate({{x}} := to_factor({{x}}),
 79 |              {{y}} := to_factor({{y}})) %>%
 80 |       # calculate denominator
 81 |       mutate(total = sum({{weight}}),
 82 |              unweighted_n = length({{weight}})) %>%
 83 |       group_by({{x}}, {{y}}) %>%
 84 |       summarise(observations = sum({{weight}}),
 85 |                 pct = observations/first(total),
 86 |                 n = first(total),
 87 |                 unweighted_n = first(unweighted_n)) %>%
 88 |       ungroup() %>%
 89 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
 90 |       mutate(pct = pct*100) %>%
 91 |       select(-observations) %>%
 92 |       # Remove values included in "remove" string
 93 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
 94 |              !str_to_upper({{y}}) %in% str_to_upper(remove)) %>%
 95 |       # move total row to end
 96 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n"))
 97 |   }
 98 | 
 99 |   # convert to wide format if required
100 |   if(format == "wide"){
101 |     d.output <- d.output %>%
102 |       pivot_wider(names_from = {{y}}, values_from = c(pct, moe),
103 |                   values_fill = list(pct = 0), names_sort = TRUE) %>%
104 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n"))
105 |   }
106 | 
107 |   # remove n if required
108 |   if(n == FALSE){
109 |     d.output <- select(d.output, -n)
110 |   }
111 | 
112 |   # remove unweighted_n if required
113 |   if(unwt_n == FALSE){
114 |     d.output <- select(d.output, -unweighted_n)
115 |   }
116 | 
117 |   # test if date or number
118 |   factor.true.type <- what_is_this_factor(pull(d.output, {{x}}))
119 |   if(factor.true.type == "date"){
120 |     d.output %>%
121 |       as_tibble() %>%
122 |       mutate({{x}} := as.Date({{x}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y")))
123 |   } else if(factor.true.type == "number"){
124 |     d.output %>%
125 |       as_tibble() %>%
126 |       mutate({{x}} := as.numeric(as.character({{x}})))
127 |   } else{
128 |     d.output %>%
129 |       as_tibble()
130 |   }
131 | }
132 | 


--------------------------------------------------------------------------------
/R/moeCrosstab3way.R:
--------------------------------------------------------------------------------
  1 | #' weighted 3-way crosstabs with margin of error
  2 | #'
  3 | #' \code{moe_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error
  4 | #'
  5 | #'  Options  include row or cell percentages. The tibble can be in long or wide format.
  6 | #'  These tables are ideal for use with small multiples created with ggplot2::facet_wrap.
  7 | #'
  8 | #' @param df The data source
  9 | #' @param x The independent variable
 10 | #' @param y The dependent variable
 11 | #' @param z The second control variable
 12 | #' @param weight The weighting variable
 13 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
 14 | #' This will not affect any calculations made. The vector is not case-sensitive.
 15 | #' @param n logical, if TRUE numeric totals are included.
 16 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell."
 17 | #' @param format one of "long" or "wide"
 18 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval
 19 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values
 20 | #'
 21 | #' @return a tibble
 22 | #' @export
 23 | #' @import dplyr
 24 | #' @import stringr
 25 | #' @import tidyr
 26 | #' @import labelled
 27 | #' @import rlang
 28 | #'
 29 | #' @examples
 30 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight)
 31 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight,
 32 | #' format = "wide")
 33 | 
 34 | moe_crosstab_3way <- function(df, x, y, z,
 35 |                               weight, remove = c(""),
 36 |                               n = TRUE, pct_type = "row",
 37 |                               format = "long", zscore = 1.96,
 38 |                               unwt_n = FALSE){
 39 | 
 40 |   # make sure no weights are NA
 41 |   w <- df %>% pull({{weight}})
 42 |   if(length(w[is.na(w)]) > 0){
 43 |     stop("The weight variable contains missing values.", call. = FALSE)
 44 |   }
 45 | 
 46 |   # make sure the arguments are all correct
 47 |   stopifnot(pct_type %in% c("row", "cell"),
 48 |             format %in% c("wide", "long"))
 49 | 
 50 |   # calculate the design effect
 51 |   deff <- df %>% pull({{weight}}) %>% deff_calc()
 52 | 
 53 |   # build the table, either row percents or cell percents
 54 |   if(pct_type == "row"){
 55 |     d.output <- df %>%
 56 |       filter(!is.na({{x}}),
 57 |              !is.na({{y}}),
 58 |              !is.na({{z}})) %>%
 59 |       mutate({{x}} := to_factor({{x}}),
 60 |              {{y}} := to_factor({{y}}),
 61 |              {{z}} := to_factor({{z}})) %>%
 62 |       group_by({{z}}, {{x}}) %>%
 63 |       mutate(total = sum({{weight}}),
 64 |              unweighted_n = length({{weight}})) %>%
 65 |       group_by({{z}}, {{x}}, {{y}}) %>%
 66 |       summarise(observations = sum({{weight}}),
 67 |                 pct = observations/first(total),
 68 |                 n = first(total),
 69 |                 unweighted_n = first(unweighted_n)) %>%
 70 |       ungroup() %>%
 71 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
 72 |       mutate(pct = pct*100) %>%
 73 |       select(-observations) %>%
 74 |       # Remove values included in "remove" string
 75 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
 76 |              !str_to_upper({{y}}) %in% str_to_upper(remove),
 77 |              !str_to_upper({{z}}) %in% str_to_upper(remove)) %>%
 78 |       # move total row to end
 79 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n"))
 80 |   } else if(pct_type == "cell"){
 81 |     d.output <- df %>%
 82 |       filter(!is.na({{x}}),
 83 |              !is.na({{y}})) %>%
 84 |       mutate({{x}} := to_factor({{x}}),
 85 |              {{y}} := to_factor({{y}})) %>%
 86 |       # calculate denominator
 87 |       group_by({{z}}) %>%
 88 |       mutate(total = sum({{weight}}),
 89 |              unweighted_n = length({{weight}})) %>%
 90 |       group_by({{z}}, {{x}}, {{y}}) %>%
 91 |       summarise(observations = sum({{weight}}),
 92 |                 pct = observations/first(total),
 93 |                 n = first(total),
 94 |                 unweighted_n = first(unweighted_n)) %>%
 95 |       ungroup() %>%
 96 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
 97 |       mutate(pct = pct*100) %>%
 98 |       select(-observations) %>%
 99 |       # Remove values included in "remove" string
100 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
101 |              !str_to_upper({{y}}) %in% str_to_upper(remove)) %>%
102 |       # move total row to end
103 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n"))
104 |   }
105 | 
106 |   # convert to wide format if required
107 |   if(format == "wide"){
108 |     d.output <- d.output %>%
109 |       pivot_wider(names_from = {{y}}, values_from = c(pct, moe),
110 |                   values_fill = list(pct = 0, moe = 0)) %>%
111 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
112 |       arrange({{x}}, {{z}})
113 |   }
114 | 
115 |   # remove n if required
116 |   if(n == FALSE){
117 |     d.output <- select(d.output, -n)
118 |   }
119 | 
120 |   # remove unweighted_n if required
121 |   if(unwt_n == FALSE){
122 |     d.output <- select(d.output, -unweighted_n)
123 |   }
124 | 
125 |   # test if date or number
126 |   factor.true.type <- what_is_this_factor(pull(d.output, {{z}}))
127 |   if(factor.true.type == "date"){
128 |     d.output %>%
129 |       as_tibble() %>%
130 |       mutate({{z}} := as.Date({{z}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y")))
131 |   } else if(factor.true.type == "number"){
132 |     d.output %>%
133 |       as_tibble() %>%
134 |       mutate({{z}} := as.numeric(as.character({{z}})))
135 |   } else{
136 |     d.output %>%
137 |       as_tibble()
138 |   }
139 | }
140 | 


--------------------------------------------------------------------------------
/R/moeTopline.R:
--------------------------------------------------------------------------------
 1 | #' weighted topline with margin of error
 2 | #'
 3 | #' \code{moe_topline} returns a tibble containing a weighted topline of one variable with margin of error
 4 | #'
 5 | #'  By default the table includes a column for frequency count, percent, valid percent, and cumulative percent.
 6 | #'
 7 | #' @param df The data source
 8 | #' @param variable the variable name
 9 | #' @param weight The weighting variable, defaults to zwave_weight
10 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
11 | #' This will not affect any calculations made. The vector is not case-sensitive.
12 | #' @param n logical, if TRUE a frequency column is included
13 | #' percentages, but in a separate row for column percentages.
14 | #' @param pct logical, if TRUE a column of percents is included
15 | #' @param valid_pct logical, if TRUE a column of valid percents is included
16 | #' @param cum_pct logical, if TRUE a column of cumulative percents is included
17 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval
18 | #'
19 | #' @return a tibble
20 | #' @export
21 | #' @import dplyr
22 | #' @import stringr
23 | #' @import tidyr
24 | #' @import labelled
25 | #' @import rlang
26 | #'
27 | #' @examples
28 | #' moe_topline(df = illinois, variable = educ6, weight = weight)
29 | #' moe_topline(df = illinois, variable = educ6, weight = weight, remove = c("LT HS"))
30 | moe_topline <- function(df, variable, weight, remove = c(""),
31 |                         n = TRUE, pct = TRUE, valid_pct = TRUE, cum_pct = TRUE, zscore = 1.96){
32 | 
33 |   # make sure no weights are NA
34 |   w <- df %>% pull({{weight}})
35 |   if(length(w[is.na(w)]) > 0){
36 |     stop("The weight variable contains missing values.", call. = FALSE)
37 |   }
38 | 
39 |   # calculate the design effect
40 |   deff <- df %>% pull({{weight}}) %>% deff_calc()
41 | 
42 |   # calculate the valid unweighted sample count
43 |   unweighted.n <- df %>%
44 |     filter(!is.na({{variable}})) %>%
45 |     nrow()
46 | 
47 |   output <- df %>%
48 |     filter(!is.na({{variable}})) %>%
49 |     mutate({{variable}} := to_factor({{variable}}),
50 |            total = sum({{weight}}),
51 |            valid.total = sum(({{weight}})[{{variable}} != "(Missing)"])) %>%
52 |     group_by({{variable}}) %>%
53 |     summarise(valid.pct = (sum({{weight}})/first(valid.total)*100),
54 |               n = sum({{weight}}),
55 |               pct = (n/first(total))) %>%
56 |     ungroup() %>%
57 |     mutate(moe = moedeff_calc(pct = valid.pct/100, deff = deff, n = unweighted.n, zscore = zscore),
58 |            cum = cumsum(valid.pct),
59 |            valid.pct = replace(valid.pct, {{variable}} == "(Missing)", NA),
60 |            cum = replace(cum, {{variable}} == "(Missing)", NA)) %>%
61 |     mutate(pct = pct*100) %>%
62 |     select(Response = {{variable}}, Frequency = n, Percent = pct,
63 |            `Valid Percent` = valid.pct, `MOE` = moe, `Cumulative Percent` = cum) %>%
64 |     # Remove values included in "remove" string
65 |     filter(! str_to_upper(Response) %in% str_to_upper(remove))
66 | 
67 |   # remove columns as requested
68 |   if(valid_pct == FALSE){
69 |     d.output <- select(d.output, -`Valid Percent`)
70 |   }
71 | 
72 |   if(cum_pct == FALSE){
73 |     d.output <- select(d.output, -`Cumulative Percent`)
74 |   }
75 | 
76 |   if(n == FALSE){
77 |     d.output <- select(d.output, -Frequency)
78 |   }
79 | 
80 |   if(pct == FALSE){
81 |     d.output <- select(d.output, -Percent)
82 |   }
83 | 
84 |   output %>%
85 |     as_tibble()
86 | }
87 | 


--------------------------------------------------------------------------------
/R/moeWaveCrosstab.R:
--------------------------------------------------------------------------------
  1 | #' weighted crosstabs with margin of error, where the x-variable identifies different survey waves
  2 | #'
  3 | #' \code{moe_wave_crosstab} returns a tibble containing a weighted crosstab of two variables
  4 | #'  with margin of error. Use this function when the x-variable indicates different survey
  5 | #'  waves for which weights were calculated independently.
  6 | #'
  7 | #'  Options  include row or cell percentages. The tibble can be in long or wide format. The margin of
  8 | #'  error includes the design effect of the weights, calculated separately for each
  9 | #'  survey wave.
 10 | #'
 11 | #' @param df The data source
 12 | #' @param x The independent variable, which uniquely identifies survey waves
 13 | #' @param y The dependent variable
 14 | #' @param weight The weighting variable, defaults to zwave_weight
 15 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
 16 | #' This will not affect any calculations made. The vector is not case-sensitive.
 17 | #' @param n logical, if TRUE numeric totals are included.
 18 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell."
 19 | #' Column percents are not supported.
 20 | #' @param format one of "long" or "wide"
 21 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval
 22 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values
 23 | #'
 24 | #' @return a tibble
 25 | #' @export
 26 | #' @import dplyr
 27 | #' @import stringr
 28 | #' @import tidyr
 29 | #' @import labelled
 30 | #' @import rlang
 31 | #'
 32 | #' @examples
 33 | #' moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight)
 34 | #' moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight, format = "wide")
 35 | 
 36 | moe_wave_crosstab <- function(df, x, y, weight, remove = c(""), n = TRUE,
 37 |                               pct_type = "row", format = "long",
 38 |                               zscore = 1.96, unwt_n = FALSE){
 39 | 
 40 |   # make sure no weights are NA
 41 |   w <- df %>% pull({{weight}})
 42 |   if(length(w[is.na(w)]) > 0){
 43 |     stop("The weight variable contains missing values.", call. = FALSE)
 44 |   }
 45 | 
 46 |   # make sure the arguments are all correct
 47 |   stopifnot(pct_type %in% c("row", "cell"),
 48 |             format %in% c("wide", "long"))
 49 | 
 50 |   # Calculate the design effect for each wave individually, as that is
 51 |   # the level at which the weights are calculated
 52 |   stats.by.wave <- df %>%
 53 |     filter(!is.na({{x}})) %>%
 54 |     mutate({{x}} := to_factor({{x}})) %>%
 55 |     group_by({{x}}) %>%
 56 |     summarise(deff = deff_calc({{weight}})) %>%
 57 |     ungroup()
 58 | 
 59 |   # build the table, either row percents or cell percents
 60 |   if(pct_type == "row"){
 61 |     d.output <- df %>%
 62 |       filter(!is.na({{x}}),
 63 |              !is.na({{y}})) %>%
 64 |       mutate({{x}} := to_factor({{x}}),
 65 |              {{y}} := to_factor({{y}})) %>%
 66 |       group_by({{x}}) %>%
 67 |       mutate(total = sum({{weight}}),
 68 |              unweighted_n = length({{weight}})) %>%
 69 |       group_by({{x}}, {{y}}) %>%
 70 |       summarise(observations = sum({{weight}}),
 71 |                 pct = observations/first(total),
 72 |                 n = first(total),
 73 |                 unweighted_n = first(unweighted_n)) %>%
 74 |       ungroup() %>%
 75 |       inner_join(stats.by.wave) %>%
 76 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
 77 |       mutate(pct = pct*100) %>%
 78 |       select(-observations) %>%
 79 |       # Remove values included in "remove" string
 80 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
 81 |              !str_to_upper({{y}}) %in% str_to_upper(remove)) %>%
 82 |       # move total row to end
 83 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
 84 |       select(-deff)
 85 |   } else if(pct_type == "cell"){
 86 |     d.output <- df %>%
 87 |       filter(!is.na({{x}}),
 88 |              !is.na({{y}})) %>%
 89 |       mutate({{x}} := to_factor({{x}}),
 90 |              {{y}} := to_factor({{y}})) %>%
 91 |       # calculate denominator
 92 |       mutate(total = sum({{weight}}),
 93 |              unweighted_n = length({{weight}})) %>%
 94 |       group_by({{x}}, {{y}}) %>%
 95 |       summarise(observations = sum({{weight}}),
 96 |                 pct = observations/first(total),
 97 |                 n = first(total),
 98 |                 unweighted_n = first(unweighted_n)) %>%
 99 |       ungroup() %>%
100 |       inner_join(stats.by.wave) %>%
101 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
102 |       mutate(pct = pct*100) %>%
103 |       select(-observations) %>%
104 |       # Remove values included in "remove" string
105 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
106 |              !str_to_upper({{y}}) %in% str_to_upper(remove)) %>%
107 |       # move total row to end
108 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
109 |       select(-deff)
110 |   }
111 | 
112 |   # convert to wide format if required
113 |   if(format == "wide"){
114 |     d.output <- d.output %>%
115 |       pivot_wider(names_from = {{y}}, values_from = c(pct, moe),
116 |                   values_fill = list(pct = 0, moe = 0)) %>%
117 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n"))
118 |   }
119 | 
120 |   # remove n if required
121 |   if(n == FALSE){
122 |     d.output <- select(d.output, -n)
123 |   }
124 |   # remove unweighted_n if required
125 |   if(unwt_n == FALSE){
126 |     d.output <- select(d.output, -unweighted_n)
127 |   }
128 | 
129 |   # test if date or number
130 |   factor.true.type <- what_is_this_factor(pull(d.output, {{x}}))
131 |   if(factor.true.type == "date"){
132 |     d.output %>%
133 |       as_tibble() %>%
134 |       mutate({{x}} := as.Date({{x}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y")))
135 |   } else if(factor.true.type == "number"){
136 |     d.output %>%
137 |       as_tibble() %>%
138 |       mutate({{x}} := as.numeric(as.character({{x}})))
139 |   } else{
140 |     d.output %>%
141 |       as_tibble()
142 |   }
143 | }
144 | 


--------------------------------------------------------------------------------
/R/moeWaveCrosstab3way.R:
--------------------------------------------------------------------------------
  1 | #' weighted 3-way crosstabs with margin of error, where the z-variable identifies different survey waves
  2 | #'
  3 | #' \code{moe_wave_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error.
  4 | #' Use this function when the z-variable indicates different survey
  5 | #' waves for which weights were calculated independently.
  6 | #'
  7 | #'  Options  include row or cell percentages. The tibble can be in long or wide format.
  8 | #'  These tables are ideal for use with small multiples created with ggplot2::facet_wrap.
  9 | #'
 10 | #' @param df The data source
 11 | #' @param x The independent variable
 12 | #' @param y The dependent variable
 13 | #' @param z The second control variable, uniquely identifies survey waves
 14 | #' @param weight The weighting variable
 15 | #' @param remove An optional character vector of values to remove from final table (e.g. "refused").
 16 | #' This will not affect any calculations made. The vector is not case-sensitive.
 17 | #' @param n logical, if TRUE numeric totals are included.
 18 | #' @param pct_type Controls the kind of percentage values returned. One of "row" or "cell."
 19 | #' @param format one of "long" or "wide"
 20 | #' @param zscore defaults to 1.96, consistent with a 95\% confidence interval
 21 | #' @param unwt_n logical, if TRUE it adds a column with unweighted frequency values
 22 | #'
 23 | #' @return a tibble
 24 | #' @export
 25 | #' @import dplyr
 26 | #' @import stringr
 27 | #' @import tidyr
 28 | #' @import labelled
 29 | #' @import rlang
 30 | #'
 31 | #' @examples
 32 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight)
 33 | #' moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "wide")
 34 | 
 35 | moe_wave_crosstab_3way <- function(df, x, y, z,
 36 |                               weight, remove = c(""),
 37 |                               n = TRUE, pct_type = "row", format = "long",
 38 |                               zscore = 1.96, unwt_n = FALSE){
 39 | 
 40 |   # make sure no weights are NA
 41 |   w <- df %>% pull({{weight}})
 42 |   if(length(w[is.na(w)]) > 0){
 43 |     stop("The weight variable contains missing values.", call. = FALSE)
 44 |   }
 45 | 
 46 |   # make sure the arguments are all correct
 47 |   stopifnot(pct_type %in% c("row", "cell"),
 48 |             format %in% c("wide", "long"))
 49 | 
 50 |   # Calculate the design effect for each wave individually, as that is
 51 |   # the level at which the weights are calculated
 52 |   stats.by.wave <- df %>%
 53 |     filter(!is.na({{z}})) %>%
 54 |     mutate({{z}} := to_factor({{z}})) %>%
 55 |     group_by({{z}}) %>%
 56 |     summarise(deff = deff_calc({{weight}})) %>%
 57 |     ungroup()
 58 | 
 59 |   # build the table, either row percents or cell percents
 60 |   if(pct_type == "row"){
 61 |     d.output <- df %>%
 62 |       filter(!is.na({{x}}),
 63 |              !is.na({{y}}),
 64 |              !is.na({{z}})) %>%
 65 |       mutate({{x}} := to_factor({{x}}),
 66 |              {{y}} := to_factor({{y}}),
 67 |              {{z}} := to_factor({{z}})) %>%
 68 |       group_by({{z}}, {{x}}) %>%
 69 |       mutate(total = sum({{weight}}),
 70 |              unweighted_n = length({{weight}})) %>%
 71 |       group_by({{z}}, {{x}}, {{y}}) %>%
 72 |       summarise(observations = sum({{weight}}),
 73 |                 pct = observations/first(total),
 74 |                 n = first(total),
 75 |                 unweighted_n = first(unweighted_n)) %>%
 76 |       ungroup() %>%
 77 |       inner_join(stats.by.wave) %>%
 78 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
 79 |       mutate(pct = pct*100) %>%
 80 |       select(-observations) %>%
 81 |       # Remove values included in "remove" string
 82 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
 83 |              !str_to_upper({{y}}) %in% str_to_upper(remove),
 84 |              !str_to_upper({{z}}) %in% str_to_upper(remove)) %>%
 85 |       # move total row to end
 86 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
 87 |       select(-deff)
 88 |   } else if(pct_type == "cell"){
 89 |     d.output <- df %>%
 90 |       filter(!is.na({{x}}),
 91 |              !is.na({{y}})) %>%
 92 |       mutate({{x}} := to_factor({{x}}),
 93 |              {{y}} := to_factor({{y}})) %>%
 94 |       # calculate denominator
 95 |       group_by({{z}}) %>%
 96 |       mutate(total = sum({{weight}}),
 97 |              unweighted_n = length({{weight}})) %>%
 98 |       group_by({{z}}, {{x}}, {{y}}) %>%
 99 |       summarise(observations = sum({{weight}}),
100 |                 pct = observations/first(total),
101 |                 n = first(total),
102 |                 unweighted_n = first(unweighted_n)) %>%
103 |       ungroup() %>%
104 |       inner_join(stats.by.wave) %>%
105 |       mutate(moe = moedeff_calc(pct = pct, deff = deff, n = unweighted_n, zscore = zscore)) %>%
106 |       mutate(pct = pct*100) %>%
107 |       select(-observations) %>%
108 |       # Remove values included in "remove" string
109 |       filter(!str_to_upper({{x}}) %in% str_to_upper(remove),
110 |              !str_to_upper({{y}}) %in% str_to_upper(remove)) %>%
111 |       # move total row to end
112 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n")) %>%
113 |       select(-deff)
114 |   }
115 | 
116 |   # convert to wide format if required
117 |   if(format == "wide"){
118 |     d.output <- d.output %>%
119 |       pivot_wider(names_from = {{y}}, values_from = c(pct, moe),
120 |                   values_fill = list(pct = 0, moe = 0)) %>%
121 |       select(-one_of("n", "unweighted_n"), one_of("n", "unweighted_n"))
122 |   }
123 | 
124 |   # remove n if required
125 |   if(n == FALSE){
126 |     d.output <- select(d.output, -n)
127 |   }
128 | 
129 |   # remove unweighted_n if required
130 |   if(unwt_n == FALSE){
131 |     d.output <- select(d.output, -unweighted_n)
132 |   }
133 | 
134 |   # test if date or number
135 |   factor.true.type <- what_is_this_factor(pull(d.output, {{z}}))
136 |   if(factor.true.type == "date"){
137 |     d.output %>%
138 |       as_tibble() %>%
139 |       mutate({{z}} := as.Date({{z}}, tryFormats = c("%Y-%m-%d", "%Y/%m/%d","%d-%m-%Y","%m-%d-%Y")))
140 |   } else if(factor.true.type == "number"){
141 |     d.output %>%
142 |       as_tibble() %>%
143 |       mutate({{z}} := as.numeric(as.character({{z}})))
144 |   } else{
145 |     d.output %>%
146 |       as_tibble()
147 |   }
148 | }
149 | 


--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Weighted data survey tables in R"
  3 | author: "John Johnson"
  4 | date: "3/9/2020"
  5 | output:
  6 |   github_document:
  7 |     toc: true
  8 |     toc_depth: 2
  9 | ---
 10 | 
 11 | ```{r setup, include=FALSE}
 12 | knitr::opts_chunk$set(echo = TRUE)
 13 | ```
 14 | 
 15 | ```{r, echo = FALSE}
 16 | knitr::opts_chunk$set(
 17 |   collapse = TRUE,
 18 |   comment = "#>",
 19 |   fig.path = "man/figures/README-"
 20 | )
 21 | ```
 22 | 
 23 | `pollster` is an R package for making topline and crosstab tables of simple weighted survey data. The package is designed for use with labelled data, like what you might use the `haven` package to import from Stata or SPSS. It follows tidyverse programming conventions, and output tables are also in the form of a tidy data frame, or tibble.
 24 | 
 25 | Only simple weights are currently supported. For complex survey designs, we recommend the excellent [`survey` package](http://r-survey.r-forge.r-project.org/survey/). 
 26 | 
 27 | The core functions are:
 28 | 
 29 | * `topline()`
 30 | * `crosstab()`
 31 | * `crosstab_3way()`
 32 | 
 33 | Each of these functions also has a twin version which includes a column for the margin of error calculated to include the design effect of the weights.
 34 | 
 35 | * `moe_topline()`
 36 | * `moe_crosstab()`
 37 | * `moe_crosstab_3way()`
 38 | 
 39 | There are also two special functions which calculate the design effect component of the margin of error for each survey wave independently.
 40 | 
 41 | * `moe_wave_crosstab()`
 42 | * `moe_wave_crosstab_3way()`
 43 | 
 44 | Other functions are included to calculate simple weighted summary statistics.
 45 | 
 46 | * `wtd_mean()` is a tidy-compliant wrapper around `stats::weighted.mean()`
 47 | * `summary_table()` returns a tible with summary statistics similar to the Stata command `sum`
 48 | 
 49 | ## Installation
 50 | 
 51 | Install it this way.
 52 | 
 53 | ```
 54 | install.packages("pollster")
 55 | ```
 56 | 
 57 | Or get the development version.
 58 | 
 59 | ```
 60 | remotes::install_github("jdjohn215/pollster")
 61 | ```
 62 | 
 63 | ## Basic usage
 64 | 
 65 | `pollster` includes a dataset of Illinois responses to the Current Population Survey's voter registration supplement.
 66 | 
 67 | ```{r}
 68 | library(pollster)
 69 | head(illinois)
 70 | ```
 71 | 
 72 | Make a topline table like this. The output is a tibble.
 73 | 
 74 | ```{r}
 75 | topline(df = illinois, variable = maritalstatus, weight = weight)
 76 | ```
 77 | 
 78 | Make a crosstab like this.
 79 | 
 80 | ```{r}
 81 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight)
 82 | ```
 83 | 
 84 | If you prefer, you can also get the output in long format.
 85 | 
 86 | ```{r}
 87 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight, format = "long")
 88 | ```
 89 | 
 90 | A three-way crosstab is just a normal crosstab with a third control variable. Often, this third variable is time.
 91 | 
 92 | ```{r}
 93 | crosstab_3way(df = illinois, x = educ6, y = maritalstatus, z = year, weight = weight)
 94 | ```
 95 | 
 96 | ## Making tables and graphs
 97 | 
 98 | Wide format is best for displaying table output. Long format is best for making graphs. `pollster` outputs dovetail seamlessly with [`knitr::kable()`](https://www.rdocumentation.org/packages/knitr/versions/1.28/topics/kable) and [`ggplot2::ggplot()`](https://ggplot2.tidyverse.org/). These examples show very basic html table output, but you can customize the appearance of your tables almost endlessly in either html or pdf formats using Hao Zhu's excellent [`kableExtra` package](https://haozhu233.github.io/kableExtra/).
 99 | 
100 | ```{r}
101 | library(dplyr)
102 | crosstab(df = illinois, x = sex, y = educ6, weight = weight) %>%
103 |   knitr::kable(digits = 0)
104 | ```
105 | 
106 | 
107 | ```{r}
108 | library(ggplot2)
109 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") %>%
110 |   ggplot(aes(educ6, pct, fill = sex)) +
111 |   geom_bar(stat = "identity", position = "dodge")
112 | ```
113 | 
114 | Three-way crosstabs are ideal for plotting time series graphs and/or faceted plots.
115 | 
116 | ```{r}
117 | crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "long") %>%
118 |   ggplot(aes(year, pct, col = sex)) +
119 |   geom_line() +
120 |   facet_wrap(facets = vars(educ6))
121 | ```
122 | 
123 | ## Margin of error
124 | 
125 | Each `pollster` function comes with a twin function which includes a margin of error column. For example:
126 | 
127 | ```{r}
128 | moe_topline(df = illinois, variable = voter, weight = weight)
129 | ```
130 | 
131 | 
132 | By default, `moe_crosstab` output comes in long format, but you can also specify wide format.
133 | 
134 | ```{r}
135 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight, format = "wide")
136 | ```
137 | 
138 | ```{r}
139 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight) %>%
140 |   ggplot(aes(x = pct, y = raceethnic, xmin = (pct - moe), xmax = (pct + moe), color = voter)) +
141 |   geom_pointrange(position = position_dodge(width = 0.2))
142 | ```
143 | 
144 | ## Summary table
145 | 
146 | `summary_table()` creates a simple summary table of a weighted numeric variable.
147 | 
148 | ```{r}
149 | summary_table(df = illinois, variable = age, weight = weight)
150 | ```
151 | 
152 | You can choose `name_style = "pretty"` if you want column headings appropriate for a formatted table.
153 | 
154 | ```{r}
155 | summary_table(df = illinois, variable = age, 
156 |               weight = weight, name_style = "pretty") %>%
157 |   knitr::kable()
158 | ```
159 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Weighted data survey tables in R
  2 | ================
  3 | John Johnson
  4 | 3/9/2020
  5 | 
  6 | `pollster` is an R package for making topline and crosstab tables of
  7 | simple weighted survey data. The package is designed for use with
  8 | labelled data, like what you might use the `haven` package to import
  9 | from Stata or SPSS. It follows tidyverse programming conventions, and
 10 | output tables are also in the form of a tidy data frame, or tibble.
 11 | 
 12 | Only simple weights are currently supported. For complex survey designs,
 13 | we recommend the excellent [`survey`
 14 | package](http://r-survey.r-forge.r-project.org/survey/).
 15 | 
 16 | The core functions are:
 17 | 
 18 |   - `topline()`
 19 |   - `crosstab()`
 20 |   - `crosstab_3way()`
 21 | 
 22 | Each of these functions also has a twin version which includes a column
 23 | for the margin of error calculated to include the design effect of the
 24 | weights.
 25 | 
 26 |   - `moe_topline()`
 27 |   - `moe_crosstab()`
 28 |   - `moe_crosstab_3way()`
 29 | 
 30 | There are also two special functions which calculate the design effect
 31 | component of the margin of error for each survey wave independently.
 32 | 
 33 |   - `moe_wave_crosstab()`
 34 |   - `moe_wave_crosstab_3way()`
 35 | 
 36 | Other functions are included to calculate simple weighted summary
 37 | statistics.
 38 | 
 39 |   - `wtd_mean()` is a tidy-compliant wrapper around
 40 |     `stats::weighted.mean()`
 41 |   - `summary_table()` returns a tible with summary statistics similar to
 42 |     the Stata command `sum`
 43 | 
 44 | ## Installation
 45 | 
 46 | Install it this way.
 47 | 
 48 |     install.packages("pollster")
 49 | 
 50 | Or get the development version.
 51 | 
 52 |     remotes::install_github("jdjohn215/pollster")
 53 | 
 54 | ## Basic usage
 55 | 
 56 | `pollster` includes a dataset of Illinois responses to the Current
 57 | Population Survey’s voter registration supplement.
 58 | 
 59 | ``` r
 60 | library(pollster)
 61 | head(illinois)
 62 | #> # A tibble: 6 x 10
 63 | #>    year    fips     sex   educ6 raceethnic maritalstatus      rv   voter   age
 64 | #>   <dbl> <dbl+l> <dbl+l> <dbl+l>  <dbl+lbl>     <dbl+lbl> <dbl+l> <dbl+l> <dbl>
 65 | #> 1  1996 17 [IL] 1 [Mal… 2 [HS]   1 [White] 1 [Married]   2 [Not… 2 [Not…    29
 66 | #> 2  1996 17 [IL] 2 [Fem… 3 [Som…  1 [White] 1 [Married]   1 [Reg… 1 [Vot…    28
 67 | #> 3  1996 17 [IL] 2 [Fem… 2 [HS]   1 [White] 3 [Never Mar… 1 [Reg… 1 [Vot…    82
 68 | #> 4  1996 17 [IL] 2 [Fem… 2 [HS]   1 [White] 3 [Never Mar… 1 [Reg… 1 [Vot…    72
 69 | #> 5  1996 17 [IL] 1 [Mal… 2 [HS]   2 [Black] 1 [Married]   1 [Reg… 1 [Vot…    75
 70 | #> 6  1996 17 [IL] 2 [Fem… 2 [HS]   2 [Black] 1 [Married]   1 [Reg… 1 [Vot…    60
 71 | #> # … with 1 more variable: weight <dbl>
 72 | ```
 73 | 
 74 | Make a topline table like this. The output is a tibble.
 75 | 
 76 | ``` r
 77 | topline(df = illinois, variable = maritalstatus, weight = weight)
 78 | #> # A tibble: 3 x 5
 79 | #>   Response           Frequency Percent `Valid Percent` `Cumulative Percent`
 80 | #>   <fct>                  <dbl>   <dbl>           <dbl>                <dbl>
 81 | #> 1 Married            55001786.    53.6            53.6                 53.6
 82 | #> 2 Widow/divorced/Sep 18635087.    18.1            18.1                 71.7
 83 | #> 3 Never Married      29041640.    28.3            28.3                100
 84 | ```
 85 | 
 86 | Make a crosstab like this.
 87 | 
 88 | ``` r
 89 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight)
 90 | #> # A tibble: 6 x 5
 91 | #>   educ6    Married `Widow/divorced/Sep` `Never Married`         n
 92 | #>   <fct>      <dbl>                <dbl>           <dbl>     <dbl>
 93 | #> 1 LT HS       40.0                 29.1            30.9 10770999.
 94 | #> 2 HS          52.9                 21.0            26.1 31409418.
 95 | #> 3 Some Col    44.6                 17.4            38.0 21745113.
 96 | #> 4 AA          57.4                 18.4            24.2  8249909.
 97 | #> 5 BA          61.1                 11.3            27.6 19937965.
 98 | #> 6 Post-BA     70.7                 12.9            16.5 10565110.
 99 | ```
100 | 
101 | If you prefer, you can also get the output in long
102 | format.
103 | 
104 | ``` r
105 | crosstab(df = illinois, x = educ6, y = maritalstatus, weight = weight, format = "long")
106 | #> # A tibble: 18 x 4
107 | #>    educ6    maritalstatus        pct         n
108 | #>    <fct>    <fct>              <dbl>     <dbl>
109 | #>  1 LT HS    Married             40.0 10770999.
110 | #>  2 LT HS    Widow/divorced/Sep  29.1 10770999.
111 | #>  3 LT HS    Never Married       30.9 10770999.
112 | #>  4 HS       Married             52.9 31409418.
113 | #>  5 HS       Widow/divorced/Sep  21.0 31409418.
114 | #>  6 HS       Never Married       26.1 31409418.
115 | #>  7 Some Col Married             44.6 21745113.
116 | #>  8 Some Col Widow/divorced/Sep  17.4 21745113.
117 | #>  9 Some Col Never Married       38.0 21745113.
118 | #> 10 AA       Married             57.4  8249909.
119 | #> 11 AA       Widow/divorced/Sep  18.4  8249909.
120 | #> 12 AA       Never Married       24.2  8249909.
121 | #> 13 BA       Married             61.1 19937965.
122 | #> 14 BA       Widow/divorced/Sep  11.3 19937965.
123 | #> 15 BA       Never Married       27.6 19937965.
124 | #> 16 Post-BA  Married             70.7 10565110.
125 | #> 17 Post-BA  Widow/divorced/Sep  12.9 10565110.
126 | #> 18 Post-BA  Never Married       16.5 10565110.
127 | ```
128 | 
129 | A three-way crosstab is just a normal crosstab with a third control
130 | variable. Often, this third variable is
131 | time.
132 | 
133 | ``` r
134 | crosstab_3way(df = illinois, x = educ6, y = maritalstatus, z = year, weight = weight)
135 | #> # A tibble: 72 x 6
136 | #>    educ6  year        n Married `Widow/divorced/Sep` `Never Married`
137 | #>    <fct> <dbl>    <dbl>   <dbl>                <dbl>           <dbl>
138 | #>  1 LT HS  1996 1182402.    41.0                 28.8            30.2
139 | #>  2 LT HS  1998 1159148.    42.2                 33.6            24.2
140 | #>  3 LT HS  2000 1036154.    44.3                 32.6            23.1
141 | #>  4 LT HS  2002 1074704.    38.0                 30.4            31.6
142 | #>  5 LT HS  2004  936926.    41.0                 30.3            28.6
143 | #>  6 LT HS  2006  918858.    38.6                 31.7            29.7
144 | #>  7 LT HS  2008  909755.    42.1                 28.1            29.8
145 | #>  8 LT HS  2010  806647.    40.6                 24.6            34.7
146 | #>  9 LT HS  2012  705132.    35.7                 26.9            37.4
147 | #> 10 LT HS  2014  782926.    43.7                 23.7            32.7
148 | #> # … with 62 more rows
149 | ```
150 | 
151 | ## Making tables and graphs
152 | 
153 | Wide format is best for displaying table output. Long format is best for
154 | making graphs. `pollster` outputs dovetail seamlessly with
155 | [`knitr::kable()`](https://www.rdocumentation.org/packages/knitr/versions/1.28/topics/kable)
156 | and [`ggplot2::ggplot()`](https://ggplot2.tidyverse.org/). These
157 | examples show very basic html table output, but you can customize the
158 | appearance of your tables almost endlessly in either html or pdf formats
159 | using Hao Zhu’s excellent [`kableExtra`
160 | package](https://haozhu233.github.io/kableExtra/).
161 | 
162 | ``` r
163 | library(dplyr)
164 | #> 
165 | #> Attaching package: 'dplyr'
166 | #> The following objects are masked from 'package:stats':
167 | #> 
168 | #>     filter, lag
169 | #> The following objects are masked from 'package:base':
170 | #> 
171 | #>     intersect, setdiff, setequal, union
172 | crosstab(df = illinois, x = sex, y = educ6, weight = weight) %>%
173 |   knitr::kable(digits = 0)
174 | ```
175 | 
176 | | sex    | LT HS | HS | Some Col | AA | BA | Post-BA |        n |
177 | | :----- | ----: | -: | -------: | -: | -: | ------: | -------: |
178 | | Male   |    11 | 31 |       21 |  7 | 20 |      11 | 49108796 |
179 | | Female |    10 | 30 |       22 |  9 | 19 |      10 | 53569718 |
180 | 
181 | ``` r
182 | library(ggplot2)
183 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") %>%
184 |   ggplot(aes(educ6, pct, fill = sex)) +
185 |   geom_bar(stat = "identity", position = "dodge")
186 | ```
187 | 
188 | ![](man/figures/README-unnamed-chunk-8-1.png)<!-- -->
189 | 
190 | Three-way crosstabs are ideal for plotting time series graphs and/or
191 | faceted
192 | plots.
193 | 
194 | ``` r
195 | crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "long") %>%
196 |   ggplot(aes(year, pct, col = sex)) +
197 |   geom_line() +
198 |   facet_wrap(facets = vars(educ6))
199 | ```
200 | 
201 | ![](man/figures/README-unnamed-chunk-9-1.png)<!-- -->
202 | 
203 | ## Margin of error
204 | 
205 | Each `pollster` function comes with a twin function which includes a
206 | margin of error column. For example:
207 | 
208 | ``` r
209 | moe_topline(df = illinois, variable = voter, weight = weight)
210 | #> # A tibble: 2 x 6
211 | #>   Response  Frequency Percent `Valid Percent`   MOE `Cumulative Percent`
212 | #>   <fct>         <dbl>   <dbl>           <dbl> <dbl>                <dbl>
213 | #> 1 Voted     56230937.    63.7            63.7 0.551                 63.7
214 | #> 2 Not voted 32070164.    36.3            36.3 0.551                100
215 | ```
216 | 
217 | By default, `moe_crosstab` output comes in long format, but you can also
218 | specify wide
219 | format.
220 | 
221 | ``` r
222 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight, format = "wide")
223 | #> # A tibble: 4 x 6
224 | #>   raceethnic     n pct_Voted `pct_Not voted` moe_Voted `moe_Not voted`
225 | #>   <fct>      <int>     <dbl>           <dbl>     <dbl>           <dbl>
226 | #> 1 White      24167      64.4            35.6     0.624           0.624
227 | #> 2 Black       3980      71.6            28.4     1.45            1.45 
228 | #> 3 Hispanic    2106      48.3            51.7     2.21            2.21 
229 | #> 4 Other       1006      48.7            51.3     3.19            3.19
230 | ```
231 | 
232 | ``` r
233 | moe_crosstab(df = illinois, x = raceethnic, y = voter, weight = weight) %>%
234 |   ggplot(aes(x = pct, y = raceethnic, xmin = (pct - moe), xmax = (pct + moe), color = voter)) +
235 |   geom_pointrange(position = position_dodge(width = 0.2))
236 | ```
237 | 
238 | ![](man/figures/README-unnamed-chunk-12-1.png)<!-- -->
239 | 
240 | ## Summary table
241 | 
242 | `summary_table()` creates a simple summary table of a weighted numeric
243 | variable.
244 | 
245 | ``` r
246 | summary_table(df = illinois, variable = age, weight = weight)
247 | #> # A tibble: 1 x 8
248 | #>   variable_name unweighted_obse… weighted_observ… weighted_mean min_value
249 | #>   <chr>                    <int>            <dbl>         <dbl>     <dbl>
250 | #> 1 age                      36207       102678514.          46.2        18
251 | #> # … with 3 more variables: max_value <dbl>, missing_observations <int>,
252 | #> #   missing_weighted_observations <dbl>
253 | ```
254 | 
255 | You can choose `name_style = "pretty"` if you want column headings
256 | appropriate for a formatted table.
257 | 
258 | ``` r
259 | summary_table(df = illinois, variable = age, 
260 |               weight = weight, name_style = "pretty") %>%
261 |   knitr::kable()
262 | ```
263 | 
264 | | Variable | Unweighted obs | Weighted obs | Weighted mean | Min | Max | Unweighted missing | Weighted missing |
265 | | :------- | -------------: | -----------: | ------------: | --: | --: | -----------------: | ---------------: |
266 | | age      |          36207 |    102678514 |      46.19646 |  18 |  90 |                  0 |                0 |
267 | 


--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
 1 | ## Test environments
 2 | * local OS X install, R 4.2.0
 3 | * win-builder (devel, release, and old)
 4 | 
 5 | ## R CMD check results
 6 | 
 7 | 0 errors | 0 warnings | 0 notes
 8 | 
 9 | ## Downstream dependencies
10 | * No downstream dependencies
11 | 


--------------------------------------------------------------------------------
/data/illinois.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/data/illinois.rda


--------------------------------------------------------------------------------
/man/crosstab.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Crosstab.R
 3 | \name{crosstab}
 4 | \alias{crosstab}
 5 | \title{weighted crosstabs}
 6 | \usage{
 7 | crosstab(
 8 |   df,
 9 |   x,
10 |   y,
11 |   weight,
12 |   remove = "",
13 |   n = TRUE,
14 |   pct_type = "row",
15 |   format = "wide",
16 |   unwt_n = FALSE
17 | )
18 | }
19 | \arguments{
20 | \item{df}{The data source}
21 | 
22 | \item{x}{The independent variable}
23 | 
24 | \item{y}{The dependent variable}
25 | 
26 | \item{weight}{The weighting variable}
27 | 
28 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
29 | This will not affect any calculations made. The vector is not case-sensitive.}
30 | 
31 | \item{n}{logical, if TRUE numeric totals are included. They are included in a separate column for row and cell
32 | percentages, but in a separate row for wide format column percentages.}
33 | 
34 | \item{pct_type}{Controls the kind of percentage values returned. One of "row," "cell," or "column."}
35 | 
36 | \item{format}{one of "long" or "wide"}
37 | 
38 | \item{unwt_n}{logical, if TRUE a column "unweighted_n" is included containing the unweighted frequency count. It is not available when pct_type is "column"}
39 | }
40 | \value{
41 | a tibble
42 | }
43 | \description{
44 | \code{crosstab} returns a tibble containing a weighted crosstab of two variables
45 | }
46 | \details{
47 | Options  include row, column, or cell percentages. The tibble can be in long or wide format.
48 | }
49 | \examples{
50 | crosstab(df = illinois, x = voter, y = raceethnic, weight = weight)
51 | crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE)
52 | }
53 | 


--------------------------------------------------------------------------------
/man/crosstab_3way.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Crosstab3way.R
 3 | \name{crosstab_3way}
 4 | \alias{crosstab_3way}
 5 | \title{weighted 3-way crosstabs}
 6 | \usage{
 7 | crosstab_3way(
 8 |   df,
 9 |   x,
10 |   y,
11 |   z,
12 |   weight,
13 |   remove = c(""),
14 |   n = TRUE,
15 |   pct_type = "row",
16 |   format = "wide",
17 |   unwt_n = FALSE
18 | )
19 | }
20 | \arguments{
21 | \item{df}{The data source}
22 | 
23 | \item{x}{The independent variable}
24 | 
25 | \item{y}{The dependent variable}
26 | 
27 | \item{z}{The second control variable}
28 | 
29 | \item{weight}{The weighting variable}
30 | 
31 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
32 | This will not affect any calculations made. The vector is not case-sensitive.}
33 | 
34 | \item{n}{logical, if TRUE numeric totals are included.}
35 | 
36 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."}
37 | 
38 | \item{format}{one of "long" or "wide"}
39 | 
40 | \item{unwt_n}{logical, if TRUE a column is added containing unweighted frequency counts}
41 | }
42 | \value{
43 | a tibble
44 | }
45 | \description{
46 | \code{crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable
47 | }
48 | \details{
49 | Options  include row or cell percentages. The tibble can be in long or wide format.
50 |  These tables are ideal for use with small multiples created with ggplot2::facet_wrap.
51 | }
52 | \examples{
53 | crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight)
54 | crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight,
55 | format = "wide")
56 | }
57 | 


--------------------------------------------------------------------------------
/man/deff_calc.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/DesignEffectAndMOE.R
 3 | \name{deff_calc}
 4 | \alias{deff_calc}
 5 | \title{Calculate the design effect of a sample}
 6 | \usage{
 7 | deff_calc(w)
 8 | }
 9 | \arguments{
10 | \item{w}{a vector of weights}
11 | }
12 | \value{
13 | A number
14 | }
15 | \description{
16 | \code{deff_calc} returns a single number
17 | }
18 | \details{
19 | This function returns the design effect of a given sample using the formula
20 |  length(w)*sum(w^2)/(sum(w)^2).
21 |  It is designed for use in the moe family of functions. If any weights are equal to 0, they are removed prior to calculation.
22 | }
23 | \examples{
24 | deff_calc(illinois$weight)
25 | 
26 | }
27 | 


--------------------------------------------------------------------------------
/man/figures/README-unnamed-chunk-12-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/man/figures/README-unnamed-chunk-12-1.png


--------------------------------------------------------------------------------
/man/figures/README-unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/man/figures/README-unnamed-chunk-8-1.png


--------------------------------------------------------------------------------
/man/figures/README-unnamed-chunk-9-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdjohn215/pollster/06d85efce6a2bb3b9c10a045f2ff854b64050c24/man/figures/README-unnamed-chunk-9-1.png


--------------------------------------------------------------------------------
/man/illinois.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/data.R
 3 | \docType{data}
 4 | \name{illinois}
 5 | \alias{illinois}
 6 | \title{Illinois respondents to the Voting and Registration Supplement for the Current Population Survey}
 7 | \format{
 8 | A data frame with 36207 rows and 9 variables:
 9 | \describe{
10 |   \item{year}{year of survey}
11 |   \item{fips}{the state fips code}
12 |   \item{sex}{sex of the respondent, labelled value}
13 |   \item{educ6}{highest level of education for respondent, labelled values}
14 |   \item{raceethnic}{one of white, black, Hispanic, or other, labelled values}
15 |   \item{maritalstatus}{one of Married, Widowed/divorced/Sep, or Never Married, labelled values}
16 |   \item{rv}{indicates if the respondent is registered to vote, labelled values}
17 |   \item{voter}{indicates if the respondent voted, labelled values}
18 |   \item{age}{the age of the respondent, numeric values}
19 |   \item{weight}{the number of people each respondent is calculated to represent}
20 | }
21 | }
22 | \source{
23 | \url{https://www.census.gov/topics/public-sector/voting.html}
24 | }
25 | \usage{
26 | illinois
27 | }
28 | \description{
29 | A dataset containing the responses of 36,207 Illinois respondents to the Current
30 | Population Survey's biennial Voting and Registration Supplement for the Current
31 | Population Survey, 1996-2018.
32 | }
33 | \keyword{datasets}
34 | 


--------------------------------------------------------------------------------
/man/moe_crosstab.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/moeCrosstab.R
 3 | \name{moe_crosstab}
 4 | \alias{moe_crosstab}
 5 | \title{weighted crosstabs with margin of error}
 6 | \usage{
 7 | moe_crosstab(
 8 |   df,
 9 |   x,
10 |   y,
11 |   weight,
12 |   remove = c(""),
13 |   n = TRUE,
14 |   pct_type = "row",
15 |   format = "long",
16 |   zscore = 1.96,
17 |   unwt_n = FALSE
18 | )
19 | }
20 | \arguments{
21 | \item{df}{The data source}
22 | 
23 | \item{x}{The independent variable}
24 | 
25 | \item{y}{The dependent variable}
26 | 
27 | \item{weight}{The weighting variable, defaults to zwave_weight}
28 | 
29 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
30 | This will not affect any calculations made. The vector is not case-sensitive.}
31 | 
32 | \item{n}{logical, if TRUE numeric totals are included.}
33 | 
34 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."
35 | Column percents are not supported.}
36 | 
37 | \item{format}{one of "long" or "wide"}
38 | 
39 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval}
40 | 
41 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values}
42 | }
43 | \value{
44 | a tibble
45 | }
46 | \description{
47 | \code{moe_crosstab} returns a tibble containing a weighted crosstab of two variables with margin of error
48 | }
49 | \details{
50 | Options  include row or cell percentages. The tibble can be in long or wide format. The margin of
51 |  error includes the design effect of the weights.
52 | }
53 | \examples{
54 | moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight)
55 | moe_crosstab(df = illinois, x = voter, y = raceethnic, weight = weight, n = FALSE)
56 | }
57 | 


--------------------------------------------------------------------------------
/man/moe_crosstab_3way.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/moeCrosstab3way.R
 3 | \name{moe_crosstab_3way}
 4 | \alias{moe_crosstab_3way}
 5 | \title{weighted 3-way crosstabs with margin of error}
 6 | \usage{
 7 | moe_crosstab_3way(
 8 |   df,
 9 |   x,
10 |   y,
11 |   z,
12 |   weight,
13 |   remove = c(""),
14 |   n = TRUE,
15 |   pct_type = "row",
16 |   format = "long",
17 |   zscore = 1.96,
18 |   unwt_n = FALSE
19 | )
20 | }
21 | \arguments{
22 | \item{df}{The data source}
23 | 
24 | \item{x}{The independent variable}
25 | 
26 | \item{y}{The dependent variable}
27 | 
28 | \item{z}{The second control variable}
29 | 
30 | \item{weight}{The weighting variable}
31 | 
32 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
33 | This will not affect any calculations made. The vector is not case-sensitive.}
34 | 
35 | \item{n}{logical, if TRUE numeric totals are included.}
36 | 
37 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."}
38 | 
39 | \item{format}{one of "long" or "wide"}
40 | 
41 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval}
42 | 
43 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values}
44 | }
45 | \value{
46 | a tibble
47 | }
48 | \description{
49 | \code{moe_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error
50 | }
51 | \details{
52 | Options  include row or cell percentages. The tibble can be in long or wide format.
53 |  These tables are ideal for use with small multiples created with ggplot2::facet_wrap.
54 | }
55 | \examples{
56 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight)
57 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = maritalstatus, weight = weight,
58 | format = "wide")
59 | }
60 | 


--------------------------------------------------------------------------------
/man/moe_topline.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/moeTopline.R
 3 | \name{moe_topline}
 4 | \alias{moe_topline}
 5 | \title{weighted topline with margin of error}
 6 | \usage{
 7 | moe_topline(
 8 |   df,
 9 |   variable,
10 |   weight,
11 |   remove = c(""),
12 |   n = TRUE,
13 |   pct = TRUE,
14 |   valid_pct = TRUE,
15 |   cum_pct = TRUE,
16 |   zscore = 1.96
17 | )
18 | }
19 | \arguments{
20 | \item{df}{The data source}
21 | 
22 | \item{variable}{the variable name}
23 | 
24 | \item{weight}{The weighting variable, defaults to zwave_weight}
25 | 
26 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
27 | This will not affect any calculations made. The vector is not case-sensitive.}
28 | 
29 | \item{n}{logical, if TRUE a frequency column is included
30 | percentages, but in a separate row for column percentages.}
31 | 
32 | \item{pct}{logical, if TRUE a column of percents is included}
33 | 
34 | \item{valid_pct}{logical, if TRUE a column of valid percents is included}
35 | 
36 | \item{cum_pct}{logical, if TRUE a column of cumulative percents is included}
37 | 
38 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval}
39 | }
40 | \value{
41 | a tibble
42 | }
43 | \description{
44 | \code{moe_topline} returns a tibble containing a weighted topline of one variable with margin of error
45 | }
46 | \details{
47 | By default the table includes a column for frequency count, percent, valid percent, and cumulative percent.
48 | }
49 | \examples{
50 | moe_topline(df = illinois, variable = educ6, weight = weight)
51 | moe_topline(df = illinois, variable = educ6, weight = weight, remove = c("LT HS"))
52 | }
53 | 


--------------------------------------------------------------------------------
/man/moe_wave_crosstab.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/moeWaveCrosstab.R
 3 | \name{moe_wave_crosstab}
 4 | \alias{moe_wave_crosstab}
 5 | \title{weighted crosstabs with margin of error, where the x-variable identifies different survey waves}
 6 | \usage{
 7 | moe_wave_crosstab(
 8 |   df,
 9 |   x,
10 |   y,
11 |   weight,
12 |   remove = c(""),
13 |   n = TRUE,
14 |   pct_type = "row",
15 |   format = "long",
16 |   zscore = 1.96,
17 |   unwt_n = FALSE
18 | )
19 | }
20 | \arguments{
21 | \item{df}{The data source}
22 | 
23 | \item{x}{The independent variable, which uniquely identifies survey waves}
24 | 
25 | \item{y}{The dependent variable}
26 | 
27 | \item{weight}{The weighting variable, defaults to zwave_weight}
28 | 
29 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
30 | This will not affect any calculations made. The vector is not case-sensitive.}
31 | 
32 | \item{n}{logical, if TRUE numeric totals are included.}
33 | 
34 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."
35 | Column percents are not supported.}
36 | 
37 | \item{format}{one of "long" or "wide"}
38 | 
39 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval}
40 | 
41 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values}
42 | }
43 | \value{
44 | a tibble
45 | }
46 | \description{
47 | \code{moe_wave_crosstab} returns a tibble containing a weighted crosstab of two variables
48 |  with margin of error. Use this function when the x-variable indicates different survey
49 |  waves for which weights were calculated independently.
50 | }
51 | \details{
52 | Options  include row or cell percentages. The tibble can be in long or wide format. The margin of
53 |  error includes the design effect of the weights, calculated separately for each
54 |  survey wave.
55 | }
56 | \examples{
57 | moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight)
58 | moe_wave_crosstab(df = illinois, x = year, y = maritalstatus, weight = weight, format = "wide")
59 | }
60 | 


--------------------------------------------------------------------------------
/man/moe_wave_crosstab_3way.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/moeWaveCrosstab3way.R
 3 | \name{moe_wave_crosstab_3way}
 4 | \alias{moe_wave_crosstab_3way}
 5 | \title{weighted 3-way crosstabs with margin of error, where the z-variable identifies different survey waves}
 6 | \usage{
 7 | moe_wave_crosstab_3way(
 8 |   df,
 9 |   x,
10 |   y,
11 |   z,
12 |   weight,
13 |   remove = c(""),
14 |   n = TRUE,
15 |   pct_type = "row",
16 |   format = "long",
17 |   zscore = 1.96,
18 |   unwt_n = FALSE
19 | )
20 | }
21 | \arguments{
22 | \item{df}{The data source}
23 | 
24 | \item{x}{The independent variable}
25 | 
26 | \item{y}{The dependent variable}
27 | 
28 | \item{z}{The second control variable, uniquely identifies survey waves}
29 | 
30 | \item{weight}{The weighting variable}
31 | 
32 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
33 | This will not affect any calculations made. The vector is not case-sensitive.}
34 | 
35 | \item{n}{logical, if TRUE numeric totals are included.}
36 | 
37 | \item{pct_type}{Controls the kind of percentage values returned. One of "row" or "cell."}
38 | 
39 | \item{format}{one of "long" or "wide"}
40 | 
41 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval}
42 | 
43 | \item{unwt_n}{logical, if TRUE it adds a column with unweighted frequency values}
44 | }
45 | \value{
46 | a tibble
47 | }
48 | \description{
49 | \code{moe_wave_crosstab_3way} returns a tibble containing a weighted crosstab of two variables by a third variable with margin of error.
50 | Use this function when the z-variable indicates different survey
51 | waves for which weights were calculated independently.
52 | }
53 | \details{
54 | Options  include row or cell percentages. The tibble can be in long or wide format.
55 |  These tables are ideal for use with small multiples created with ggplot2::facet_wrap.
56 | }
57 | \examples{
58 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight)
59 | moe_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight, format = "wide")
60 | }
61 | 


--------------------------------------------------------------------------------
/man/moedeff_calc.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/DesignEffectAndMOE.R
 3 | \name{moedeff_calc}
 4 | \alias{moedeff_calc}
 5 | \title{Calculate the margin of error (including design effect) of a sample}
 6 | \usage{
 7 | moedeff_calc(pct, deff, n, zscore = 1.96)
 8 | }
 9 | \arguments{
10 | \item{pct}{a proportion}
11 | 
12 | \item{deff}{a design effect}
13 | 
14 | \item{n}{the sample size}
15 | 
16 | \item{zscore}{defaults to 1.96, consistent with a 95\% confidence interval.}
17 | }
18 | \value{
19 | A percentage
20 | }
21 | \description{
22 | \code{moedeff_calc} returns a single number. It is designed for use in the moe family of functions.
23 | }
24 | \details{
25 | This function returns the margin of error including design effect of a given sample of weighted data using the formula
26 |  sqrt(deff)*zscore*sqrt((pct*(1-pct))/(n-1))*100
27 | }
28 | \examples{
29 | moedeff_calc(pct = 0.515, deff = 1.6, n = 214)
30 | }
31 | 


--------------------------------------------------------------------------------
/man/summary_table.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/SummaryStatistics.R
 3 | \name{summary_table}
 4 | \alias{summary_table}
 5 | \title{weighted summary table}
 6 | \usage{
 7 | summary_table(df, variable, weight, name_style = "clean")
 8 | }
 9 | \arguments{
10 | \item{df}{The data source}
11 | 
12 | \item{variable}{the variable to summarize, it should be numeric}
13 | 
14 | \item{weight}{The weighting variable}
15 | 
16 | \item{name_style}{the style of the column names--one of "clean" or "pretty."
17 | Clean names are all lower case and words are separated by an underscore.
18 | Pretty names begin with a capital letter are words a separated by a space.}
19 | }
20 | \value{
21 | a tibble
22 | }
23 | \description{
24 | \code{summary_table} returns a tibble containing a weighted summary table of a single variable.
25 | }
26 | \details{
27 | The resulting tible includes columns for the variable name, unweighted observations,
28 |  weighted observations, weighted mean, minimum value, maximum value,
29 |  unweighted missing values, and weighted missing values
30 | }
31 | \examples{
32 | summary_table(illinois, age, weight)
33 | summary_table(illinois, age, weight, name_style = "pretty")
34 | }
35 | 


--------------------------------------------------------------------------------
/man/topline.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/Topline.R
 3 | \name{topline}
 4 | \alias{topline}
 5 | \title{weighted topline}
 6 | \usage{
 7 | topline(
 8 |   df,
 9 |   variable,
10 |   weight,
11 |   remove = c(""),
12 |   n = TRUE,
13 |   pct = TRUE,
14 |   valid_pct = TRUE,
15 |   cum_pct = TRUE
16 | )
17 | }
18 | \arguments{
19 | \item{df}{The data source}
20 | 
21 | \item{variable}{the variable name}
22 | 
23 | \item{weight}{The weighting variable, defaults to zwave_weight}
24 | 
25 | \item{remove}{An optional character vector of values to remove from final table (e.g. "refused").
26 | This will not affect any calculations made. The vector is not case-sensitive.}
27 | 
28 | \item{n}{logical, if TRUE a frequency column is included
29 | percentages, but in a separate row for column percentages.}
30 | 
31 | \item{pct}{logical, if TRUE a column of percents is included}
32 | 
33 | \item{valid_pct}{logical, if TRUE a column of valid percents is included}
34 | 
35 | \item{cum_pct}{logical, if TRUE a column of cumulative percents is included}
36 | }
37 | \value{
38 | a tibble
39 | }
40 | \description{
41 | \code{topline} returns a tibble containing a weighted topline of one variable
42 | }
43 | \details{
44 | By default the table includes a column for frequency count, percent, valid percent, and cumulative percent.
45 | }
46 | \examples{
47 | topline(illinois, sex, weight)
48 | topline(illinois, sex, weight, pct = FALSE)
49 | }
50 | 


--------------------------------------------------------------------------------
/man/wtd_mean.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/SummaryStatistics.R
 3 | \name{wtd_mean}
 4 | \alias{wtd_mean}
 5 | \title{weighted mean}
 6 | \usage{
 7 | wtd_mean(df, variable, weight)
 8 | }
 9 | \arguments{
10 | \item{df}{The data source}
11 | 
12 | \item{variable}{the variable, it should be numeric}
13 | 
14 | \item{weight}{The weighting variable}
15 | }
16 | \value{
17 | a numeric value
18 | }
19 | \description{
20 | \code{wtd_mean} returns the weighted mean of a variable. It's a tidy-compatible
21 | wrapper around stats::weighted.mean().
22 | }
23 | \examples{
24 | wtd_mean(illinois, age, weight)
25 | 
26 | library(dplyr)
27 | illinois \%>\% wtd_mean(age, weight)
28 | }
29 | 


--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | *.R
3 | 


--------------------------------------------------------------------------------
/vignettes/crosstab3way.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "3-way crosstabs"
 3 | output: rmarkdown::html_vignette
 4 | vignette: >
 5 |   %\VignetteIndexEntry{crosstab3way}
 6 |   %\VignetteEngine{knitr::rmarkdown}
 7 |   %\VignetteEncoding{UTF-8}
 8 | ---
 9 | 
10 | ```{r, include = FALSE}
11 | knitr::opts_chunk$set(
12 |   collapse = TRUE,
13 |   comment = "#>"
14 | )
15 | ```
16 | 
17 | ```{r setup}
18 | library(pollster)
19 | library(dplyr)
20 | library(knitr)
21 | library(ggplot2)
22 | ```
23 | 
24 | It's common to want to view a crosstab of two variables by a third variable, for instance educational attainment by sex *and* marital status. The function `crosstab_3way` accomplishes this. Row and cell percents are both supported; column percents are not.
25 | 
26 | ```{r}
27 | illinois %>%
28 |   # filter for recent years & limited ages
29 |   filter(year > 2009,
30 |          age > 39) %>%
31 |   crosstab_3way(x = sex, y = educ6, z = maritalstatus, weight = weight,
32 |                 remove = c("widow/divorced/sep"),
33 |                 n = FALSE) %>%
34 |   kable(digits = 0, caption = "Educational attainment by sex and marital status among Illinois residents ages 35+",
35 |         format = "html")
36 | ```
37 | 
38 | Three-way crosstabs plot well as small multiples using ggplot facets.
39 | 
40 | ```{r, fig.width=6, fig.height=4}
41 | illinois %>%
42 |   # filter for recent years & limited ages
43 |   filter(year > 2009,
44 |          age > 34) %>%
45 |   crosstab_3way(x = sex, y = educ6, z = maritalstatus, weight = weight,
46 |                 remove = c("widow/divorced/sep"), 
47 |                 format = "long") %>%
48 |   ggplot(aes(educ6, pct, fill = maritalstatus)) +
49 |   geom_bar(stat = "identity", position = position_dodge()) +
50 |   facet_wrap(facets = vars(sex)) +
51 |   labs("Educational attainment by sex and marital status",
52 |        subtitle = "Illinois residents ages 40+") +
53 |   theme(legend.position = "top")
54 | ```
55 | 
56 | The same plot can be made with margin of errors as well. (See the "crosstabs" vignette for a more detailed discussion of margin of errors.)
57 | 
58 | 
59 | ```{r, fig.width=6, fig.height=4}
60 | illinois %>%
61 |   # filter for recent years & limited ages
62 |   filter(year > 2009,
63 |          age > 34) %>%
64 |   moe_crosstab_3way(x = sex, y = educ6, z = maritalstatus, weight = weight,
65 |                 remove = c("widow/divorced/sep"), format = "long") %>%
66 |   ggplot(aes(educ6, pct, fill = maritalstatus)) +
67 |   geom_bar(stat = "identity", position = position_dodge(),
68 |            alpha = 0.5) +
69 |   geom_errorbar(aes(ymin = (pct - moe), ymax = (pct + moe),
70 |                     color = maritalstatus),
71 |                 position = position_dodge()) +
72 |   facet_wrap(facets = vars(sex)) +
73 |   labs(title = "Educational attainment by sex and marital status",
74 |        subtitle = "Illinois residents ages 35+",
75 |        caption = "Current Population Survey, 2010-2018") +
76 |   theme(legend.position = "top")
77 | ```
78 | 
79 | ### Special case, when the z-variable identifies survey waves
80 | 
81 | If the x-variable in your crosstab uniquely identifies survey waves for which the weights were independently generated, it is best practice to calculate the design effect independently for each wave. `moe_wave_crosstab_3way` does just that. All of the arguments remain the same as in `moe_crosstab_3way`.
82 | 
83 | ```{r}
84 | moe_wave_crosstab_3way(df = illinois, x = sex, y = educ6, z = year, weight = weight)
85 | ```
86 | 


--------------------------------------------------------------------------------
/vignettes/crosstabs.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "crosstabs"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{crosstabs}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   %\VignetteEncoding{UTF-8}
  8 | ---
  9 | 
 10 | ```{r, include = FALSE}
 11 | knitr::opts_chunk$set(
 12 |   collapse = TRUE,
 13 |   comment = "#>"
 14 | )
 15 | ```
 16 | 
 17 | ```{r setup}
 18 | library(pollster)
 19 | library(dplyr)
 20 | library(knitr)
 21 | library(ggplot2)
 22 | ```
 23 | 
 24 | Crosstabs can come in [wide or long format](https://en.wikipedia.org/wiki/Wide_and_narrow_data). Each is useful, depending on your purpose. Wide data is best for display tables. Long data is usually better for making plots, for instance..
 25 | 
 26 | Here is a wide table.
 27 | 
 28 | ```{r}
 29 | crosstab(df = illinois, x = sex, y = educ6, weight = weight) %>%
 30 |   kable()
 31 | ```
 32 | 
 33 | And here is long format.
 34 | 
 35 | ```{r}
 36 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long")
 37 | ```
 38 | 
 39 | By default, row percentages are used. You can also explicitly choose cell or column percentages using the `pct_type` argument. I discourage the use of column percentages--it's better to just flip the x and y variables and make row percents--but the option is included to match functionality provided by other standard statistical software.
 40 | 
 41 | ```{r}
 42 | # cell percentages
 43 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, pct_type = "cell")
 44 | 
 45 | # column percentages
 46 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, pct_type = "column")
 47 | ```
 48 | 
 49 | To make a graph, just feed your `tibble` output to a `ggplot2` function.
 50 | 
 51 | ```{r, fig.width=5.6}
 52 | crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") %>%
 53 |   ggplot(aes(x = educ6, y = pct, fill = sex)) +
 54 |   geom_bar(stat = "identity", position = position_dodge()) +
 55 |   labs(title = "Educational attainment of the Illinois adult population by gender")
 56 | ```
 57 | 
 58 | ## Margin of error
 59 | 
 60 | ### How the margin of error is calculated
 61 | 
 62 | The margin of error is calculated including the design effect of the sample weights, using the following formula:
 63 | 
 64 | `sqrt(design effect)*zscore*sqrt((pct*(1-pct))/(n-1))*100`
 65 | 
 66 | The design effect is calculated using the formula `length(weights)*sum(weights^2)/(sum(weights)^2)`.
 67 | 
 68 | ------
 69 | 
 70 | Get at topline table with the margin of error in a separate column using the `moe_crosstab` function. By default, a z-score of 1.96 (95% confidence interval is used). Supply your own desired z-score using the `zscore` argument. Only row and cell percents are supported. By default, the table format is long because I anticipate making visualizations will be the most common use-case for this graphic.
 71 | 
 72 | ```{r}
 73 | moe_crosstab(illinois, educ6, voter, weight)
 74 | ```
 75 | 
 76 | A wide format table looks like this.
 77 | 
 78 | ```{r}
 79 | moe_crosstab(illinois, educ6, voter, weight, format = "wide")
 80 | ```
 81 | 
 82 | `ggplot2` offers [multiple ways](http://www.sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization) to visualize the margin of error. Here is one good option. (Please note, if you don't have ggplot2 >= [3.3.0](https://www.tidyverse.org/blog/2020/03/ggplot2-3-3-0/) you'll get an error message.)
 83 | 
 84 | ```{r, fig.width=5}
 85 | illinois %>%
 86 |   filter(year == 2016) %>%
 87 |   moe_crosstab(educ6, voter, weight) %>%
 88 |   ggplot(aes(x = pct, y = educ6, xmin = (pct - moe), xmax = (pct + moe),
 89 |              color = voter)) +
 90 |   geom_pointrange(position = position_dodge(width = 0.2))
 91 | ```
 92 | 
 93 | ### Special case, the x-variable identifies survey waves
 94 | 
 95 | If the x-variable in your crosstab uniquely identifies survey waves for which the weights were independently generated, it is best practice to calculate the design effect independently for each wave. `moe_wave_crosstab` does just that. All of the arguments remain the same as in `moe_crosstab`.
 96 | 
 97 | ```{r}
 98 | moe_wave_crosstab(df = illinois, x = year, y = rv, weight = weight)
 99 | ```
100 | 


--------------------------------------------------------------------------------
/vignettes/toplines.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "toplines"
 3 | output: rmarkdown::html_vignette
 4 | vignette: >
 5 |   %\VignetteIndexEntry{toplines}
 6 |   %\VignetteEngine{knitr::rmarkdown}
 7 |   %\VignetteEncoding{UTF-8}
 8 | ---
 9 | 
10 | ```{r, include = FALSE}
11 | knitr::opts_chunk$set(
12 |   collapse = TRUE,
13 |   comment = "#>"
14 | )
15 | ```
16 | 
17 | ```{r setup}
18 | library(pollster)
19 | library(dplyr)
20 | library(knitr)
21 | library(ggplot2)
22 | ```
23 | 
24 | The default topline table comes with columns for response category, frequency count, percent, valid percent, and cumulative percent.
25 | 
26 | ```{r}
27 | topline(df = illinois, variable = voter, weight = weight) %>%
28 |   kable()
29 | ```
30 | 
31 | Because the output is a `tibble`, it's simple to manipulate it in any way you want after creating it. Use `dplyr::select` to remove columns or `dplyr::filter` to remove rows. For convenience, the `topline` function also provides ways to do this within the function call. For example, the `remove` argument accepts a character vector of response values to be removed from the table *after* all statistics are calculated. This is especially useful for survey data with a "refused" category.
32 | 
33 | ```{r}
34 | topline(df = illinois, variable = voter, weight = weight, 
35 |         remove = c("(Missing)"), pct = FALSE) %>%
36 |   mutate(Frequency = prettyNum(Frequency, big.mark = ",")) %>%
37 |   kable(digits = 0)
38 | ```
39 | 
40 | Refer to the [`kableExtra` package](https://CRAN.R-project.org/package=kableExtra) for lots of examples on how to format the appearance of these tables in either HTML or PDF latex formats. I recommend the vignettes "Create Awesome HTML Table with knitr::kable and kableExtra" and "Create Awesome PDF Table with knitr::kable and kableExtra.
41 | 
42 | ## Graphs
43 | 
44 | ```{r, fig.width=4}
45 | topline(df = illinois, variable = voter, weight = weight) %>%
46 |   ggplot(aes(Response, Percent, fill = Response)) +
47 |   geom_bar(stat = "identity")
48 | ```
49 | 
50 | ## Margin of error
51 | 
52 | Get at topline table with the margin of error in a separate column using the `moe_topline` function. By default, a z-score of 1.96 (95% confidence interval is used). Supply your own desired z-score using the `zscore` argument.
53 | 
54 | ```{r}
55 | moe_topline(df = illinois, variable = educ6, weight = weight)
56 | ```
57 | 
58 | The margin of error is calculated including the design effect of the sample weights, using the following formula:
59 | 
60 | `sqrt(design effect)*zscore*sqrt((pct*(1-pct))/(n-1))*100`
61 | 
62 | The design effect is calculated using the formula `length(weights)*sum(weights^2)/(sum(weights)^2)`.
63 | 


--------------------------------------------------------------------------------