├── .gitignore ├── .travis.yml ├── 01-mrp.Rmd ├── 02-mrp_noncensus.Rmd ├── 03-mrp_ideal.Rmd ├── 04-references.Rmd ├── README.md ├── _bookdown.yml ├── _build.sh ├── _deploy.sh ├── _output.yml ├── book.bib ├── data_public ├── chapter1 │ ├── data │ │ ├── cces18_common_vv.csv.gz │ │ ├── poststrat_df.csv │ │ └── statelevel_predictors.csv │ ├── models │ │ ├── fit_mrp_1.rds │ │ └── fit_mrp_2.rds │ └── screenshots │ │ ├── screenshot1.png │ │ ├── screenshot2.png │ │ ├── screenshot3.png │ │ ├── screenshot4.PNG │ │ └── screenshot5.PNG ├── chapter2 │ └── models │ │ ├── fit_abortion_noncensus.rds │ │ ├── fit_abortion_standard.rds │ │ └── fit_party_example.rds └── chapter3 │ └── idealpoint.stan ├── index.Rmd ├── mrp-case.Rproj ├── mrp.bib ├── now.json ├── packages.bib ├── preamble.tex ├── style.css └── toc.css /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .git 5 | _publish.R 6 | _book 7 | _bookdown_files 8 | poststrat_data 9 | MRP-case-studies.log 10 | MRP-case-studies.tex 11 | rsconnect 12 | cces18_common_vv.csv 13 | usa_00001.csv 14 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: r 2 | cache: packages 3 | pandoc_version: 2.9.2.1 4 | 5 | addons: 6 | apt: 7 | packages: 8 | - ghostscript 9 | 10 | before_script: 11 | - chmod +x ./_build.sh 12 | - chmod +x ./_deploy.sh 13 | 14 | script: 15 | - ./_build.sh 16 | - ./_deploy.sh 17 | -------------------------------------------------------------------------------- /01-mrp.Rmd: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Introduction to Mister P 4 | 5 | ```{r packages-1, warning=FALSE, message=FALSE, echo = FALSE, cache=FALSE} 6 | library(rstan) 7 | library(rstanarm) 8 | library(data.table) 9 | library(dplyr) 10 | library(forcats) 11 | library(tidyr) 12 | library(reshape2) 13 | library(stringr) 14 | library(readr) 15 | 16 | library(ggplot2) 17 | library(scales) 18 | library(bayesplot) 19 | library(gridExtra) 20 | library(ggalt) 21 | library(usmap) 22 | library(gridExtra) 23 | library(scales) 24 | library(kableExtra) 25 | library(formatR) 26 | 27 | theme_set(bayesplot::theme_default()) 28 | 29 | # Improves performance in some systems, but can be removed if it causes problems 30 | # Sys.setenv(LOCAL_CPPFLAGS = '-march=corei7 -mtune=corei7') 31 | 32 | # Use all available cores 33 | options(mc.cores = parallel::detectCores(logical = FALSE)) 34 | ``` 35 | 36 | Multilevel regression and poststratification (MRP, also called MrP or Mister P) has become widely used in two closely related applications: 37 | 38 | 1. Small-area estimation: Subnational surveys are not always available, and even then finding comparable surveys across subnational units is rare. However, public views at the subnational level are often central, as many policies are decided by local goverments or subnational area representatives at national assemblies. MRP allows us to use national surveys to generate reliable estimates of subnational opinion (@park2004bayesian, @lax2009gay, @lax2009states, @kiewiet2018predicting). 39 | 40 | 2. Using nonrepresentative surveys: Many surveys face serious difficulties in recruiting representative samples of participants (e.g. because of non-response bias). However, with proper statistical adjustment, nonrepresentative surveys can be used to generate accurate opinion estimates (@wang2015xbox, @downes2018multilevel). 41 | 42 | This initial chapter introduces MRP in the context of public opinion research. Following a brief introduction to the data, we will describe the two essential stages of MRP: building an individual-response model and using poststratification. First, we take individual responses to a national survey and use multilevel modeling in order to predict opinion estimates based on demographic-geographic subgroups (e.g. middle-aged white female with postgraduate education in California). Secondly, these opinion estimates by subgroups are weighted by the frequency of these subgroups at the (national or subnational) unit of interest. With these two steps, MRP emerged (@gelman1997poststratification) as an approach that brought together the advantages of regularized estimation and poststratification, two techniques that had shown promising results in the field of survey research (see @fay1979estimates and @little1993post). After presenting how MRP can be used for obtaining subregion or subgroup estimates and for adjusting for nonresponse bias, we will conclude with some practical considerations. 43 | 44 | ## Data {#data} 45 | 46 | ### Survey data {.unnumbered #survey-data} 47 | 48 | The first step is to gather and recode raw survey data. These surveys should include some respondent demographic information and some type of geographic indicator (e.g. state, congressional district). In this case, we will use data from the 2018 Cooperative Congressional Election Study (@2018CCES), a US nationwide survey designed by a consortium of 60 research teams and administered by YouGov. The outcome of interest in this introduction is a dichotomous question: 49 | 50 | > Allow employers to decline coverage of abortions in insurance plans (Support / Oppose) 51 | 52 | Apart from the outcome measure, we will consider a set of geographic-demographic factors that will be used as predictors in the first stage and that define the geographic-demographic subgroups for the second stage. Even though some of these variables may be continous (e.g. age, income), we must split them into intervals to create a factor with different levels. As we will see in a moment, these factors and their corresponding levels need to match the ones in the postratification table. In this case, we will use the following factors with the indicated levels: 53 | 54 | * State: 50 US states ($S = 50$). 55 | * Age: 18-29, 30-39, 40-49, 50-59, 60-69, 70+ ($A = 6$). 56 | * Gender: Female, Male ($G = 2$). 57 | * Ethnicity: (Non-hispanic) White, Black, Hispanic, Other (which also includes Mixed) ($R = 4$). 58 | * Education: No HS, HS, Some college, 4-year college, Post-grad ($E = 5$). 59 | 60 | ```{r, echo = FALSE, cache=FALSE} 61 | # The US census and CCES data use FIPS codes to identify states. For better 62 | # interpretability, we label these FIPS codes with their corresponding abbreviation. 63 | # Note that the FIPS codes include the district of Columbia and US territories which 64 | # are not considered in this study, creating some gaps in the numbering system. 65 | state_abb <- datasets::state.abb 66 | state_fips <- c(1,2,4,5,6,8,9,10,12,13,15,16,17,18,19,20,21,22,23,24, 67 | 25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42, 68 | 44,45,46,47,48,49,50,51,53,54,55,56) 69 | recode_fips <- function(column) { 70 | factor(column, levels = state_fips, labels = state_abb) 71 | } 72 | 73 | # Recode CCES 74 | clean_cces <- function(df, remove_nas = TRUE){ 75 | 76 | ## Abortion -- dichotomous (0 - Oppose / 1 - Support) 77 | df$abortion <- abs(df$CC18_321d-2) 78 | 79 | ## State -- factor 80 | df$state <- recode_fips(df$inputstate) 81 | 82 | ## Gender -- dichotomous (coded as -0.5 Female, +0.5 Male) 83 | df$male <- abs(df$gender-2)-0.5 84 | 85 | ## ethnicity -- factor 86 | df$eth <- factor(df$race, 87 | levels = 1:8, 88 | labels = c("White", "Black", "Hispanic", "Asian", "Native American", "Mixed", "Other", "Middle Eastern")) 89 | df$eth <- fct_collapse(df$eth, "Other" = c("Asian", "Other", "Middle Eastern", "Mixed", "Native American")) 90 | 91 | ## Age -- cut into factor 92 | df$age <- 2018 - df$birthyr 93 | df$age <- cut(as.integer(df$age), breaks = c(0, 29, 39, 49, 59, 69, 120), 94 | labels = c("18-29","30-39","40-49","50-59","60-69","70+"), 95 | ordered_result = TRUE) 96 | 97 | ## Education -- factor 98 | df$educ <- factor(as.integer(df$educ), 99 | levels = 1:6, 100 | labels = c("No HS", "HS", "Some college", "Associates", "4-Year College", "Post-grad"), ordered = TRUE) 101 | df$educ <- fct_collapse(df$educ, "Some college" = c("Some college", "Associates")) 102 | 103 | # Filter out unnecessary columns and remove NAs 104 | df <- df %>% select(abortion, state, eth, male, age, educ) 105 | if (remove_nas){ 106 | df <- df %>% drop_na() 107 | } 108 | 109 | return(df) 110 | 111 | } 112 | ``` 113 | 114 | ```{r, results = 'asis', cache=FALSE, warning=FALSE, message=FALSE} 115 | cces_all_df <- read_csv("data_public/chapter1/data/cces18_common_vv.csv.gz") 116 | 117 | # Preprocessing 118 | cces_all_df <- clean_cces(cces_all_df, remove_nas = TRUE) 119 | ``` 120 | 121 | Details about how we preprocess the CCES data using the `clean_cces()` function can be found in the appendix. 122 | 123 | The full 2018 CCES consist of almost 60,000 respondents. However, most studies work with a smaller national survey. To show how MRP works in these cases, we take a random sample of 5,000 participants and work with the sample instead of the full CCES. Obviously, in a more realistic setting we would always use all the available data. 124 | 125 | ```{r, cache=FALSE} 126 | # We set the seed to an arbitrary number for reproducibility. 127 | set.seed(1010) 128 | 129 | # For clarity, we will call the full survey with 60,000 respondents cces_all_df, 130 | # while the 5,000 person sample will be called cces_df. 'df' stands for data frame, 131 | # the most frequently used two dimensional data structure in R. 132 | cces_df <- cces_all_df %>% sample_n(5000) 133 | ``` 134 | 135 | ```{r, echo=FALSE} 136 | kable(head(cces_df), format = 'markdown') 137 | ``` 138 | 139 | ### Poststratification table {.unnumbered #poststratification-table} 140 | 141 | The poststratification table reflects the number of people in the population of interest that, according to a large survey, corresponds to each combination of the demographic-geographic factors. In the US context it is typical to use Decennial Census data or the American Community Survey, although we can of course use any other large-scale surveys that reflects the frequency of the different demographic types within any geographic area of interest. The poststratification table will be used in the second stage to poststratify the estimates obtained for each subgroup. For this, it is central that the factors (and their levels) used in the survey match the factors obtained in the census. Therefore, MRP is in principle limited to use individual-level variables that are present both the survey and the census. For instance, the CCES includes information on respondent's religion, but as this information is not available in the census we are not able to use this variable. Chapter 13 will cover different approaches to incorporate noncensus variables into the analysis. Similarly, the levels of the factors in the survey of interest are required to match the ones in the large survey used to build the poststratification table. For instance, the CCES included 'Middle Eastern' as an option for ethnicity, while the census data we used did not include it. Therefore, people who identified as 'Middle Eastern' in the CCES had to be included in the 'Other' category. 142 | 143 | In this case, we will base our poststratification table on the 2014-2018 American Community Survey (ACS), a set of yearly surveys conducted by the US Census that provides estimates of the number of US residents according to a series of variables that include our poststratification variables. As we defined the levels for these variables, the poststratification table must have $50 \times 6 \times 2 \times 4 \times 5 = 12,000$ rows. This means we actually have more rows in the poststratification table than observed units, which necessarily implies that there are some combinations in the poststratification table that we don't observe in the CCES sample. 144 | 145 | ```{r, results = 'asis', cache=FALSE, warning=FALSE, message=FALSE} 146 | # Load data frame created in the appendix. The data frame that contains the poststratification 147 | # table is called poststrat_df 148 | poststrat_df <- read_csv("data_public/chapter1/data/poststrat_df.csv") 149 | ``` 150 | 151 | ```{r, echo=FALSE} 152 | kable(head(poststrat_df), format = 'markdown') 153 | ``` 154 | 155 | For instance, the first row in the poststratification table indicates that there are 23,948 Alabamians that are white, male, between 18 and 29 years old, and without a high school degree. 156 | 157 | Every MRP study requires some degree of data wrangling in order to make the factors in the survey of interest match the factors available in the census. The code shown in the appendix can be used as a template to download the ACS data and make it match with a given survey of interest. 158 | 159 | ### Group-level predictors {.unnumbered #group-level-predictors} 160 | 161 | The individual-response model used in the first stage can include group-level predictors, which are particularly useful to reduce unexplained group-level variation by accounting for structured differences among the states. For instance, most national-level surveys in the US tend to include many participants from a state such as New York, but few from a small state like Vermont. This can result in noisy estimates for the effect of being from Vermont. The intuition is that by including state-level predictors, such as the Republican voteshare in a previous election or the percentage of Evangelicals at each state, the model is able to account for how similar Vermont is to New York and other more populous states, and therefore to produce more precise estimates. These group-level predictors do not need to be available in the census nor they have to be converted to factors, and in many cases are readily available. A more detailed discussion on the importance of builidng a reasonable model for predicting opinion, and how state-level predictors can be a key element in this regard, can be found in @lax2009states and @buttice2013mrp. 162 | 163 | In our example, we will include two state-level predictors: the geographical region (Northeast, North Central, South, and West) and the Republican vote share in the 2016 presidential election. 164 | 165 | ```{r, results = 'asis', cache=FALSE, warning=FALSE, message=FALSE} 166 | # Read statelevel_predictors.csv in a dataframe called statelevel_predictors 167 | statelevel_predictors_df <- read_csv('data_public/chapter1/data/statelevel_predictors.csv') 168 | ``` 169 | 170 | ```{r, echo=FALSE} 171 | kable(head(statelevel_predictors_df), format = 'markdown', digits = 2) 172 | ``` 173 | 174 | ### Exploratory data analysis {.unnumbered #EDA} 175 | 176 | In the previous steps we have obtained a 5,000-person sample from the CCES survey and also generated a poststratification table using census data. As a first exploratory step, we will check if the frequencies for the different levels of the factors considered in the CCES data are similar to the frequencies reported in the census. If this was not the case, we will start suspecting some degree of nonresponse bias in the CCES survey. 177 | 178 | For clarity, the levels in the plots follow their natural order in the case of age and education, ordering the others by the approximate proportion of Republican support. 179 | 180 | ```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE, cache=FALSE} 181 | # Age 182 | age_sample <- cces_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 183 | age_post <- poststrat_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 184 | age <- inner_join(age_sample, age_post, by = "age") %>% select(age, Sample, Population) 185 | age_plot <- ggplot() + 186 | ylab("") + xlab("Proportion") + theme_bw() + coord_flip() + 187 | geom_dumbbell(data = age, aes(y = age, x = Sample, xend = Population)) + 188 | geom_point(data = melt(age, id = "age"), aes(y = age, x = value, color = variable), size = 2) + 189 | scale_x_continuous(limits = c(0, 0.35), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Age") 190 | 191 | # Gender 192 | male_sample <- cces_df %>% group_by(male) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 193 | male_post <- poststrat_df %>% group_by(male) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 194 | male <- inner_join(male_sample, male_post, by = "male") %>% select(male, Sample, Population) %>% 195 | mutate(male = factor(male, levels = c(-0.5, 0.5), labels = c("Female", "Male"))) 196 | male_plot <- ggplot() + 197 | ylab("") + xlab("") + theme_bw() + coord_flip() + 198 | geom_dumbbell(data = male, aes(y = male, x = Sample, xend = Population)) + 199 | geom_point(data = melt(male, id = "male"), aes(y = male, x = value, color = variable), size = 2) + 200 | scale_x_continuous(limits = c(0, 0.6), breaks = c(0, .2, .4, .6)) + theme(legend.position = "none") + ggtitle("Gender") 201 | 202 | # Ethnicity 203 | ethnicity_sample <- cces_df %>% group_by(eth) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 204 | ethnicity_post <- poststrat_df %>% group_by(eth) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 205 | ethnicity <- inner_join(ethnicity_sample, ethnicity_post, by = "eth") %>% select(eth, Sample, Population) 206 | ethnicity$eth <- factor(ethnicity$eth, 207 | levels = c("Black", "Hispanic", "Other", "White"), 208 | labels = c("Black", "Hispanic", "Other", "White")) 209 | ethnicity_plot <- ggplot() + 210 | ylab("") + xlab("") + theme_bw() + coord_flip() + 211 | geom_dumbbell(data = ethnicity, aes(y = eth, x = Sample, xend = Population)) + 212 | geom_point(data = melt(ethnicity, id = "eth"), aes(y = eth, x = value, color = variable), size = 2) + 213 | scale_x_continuous(limits = c(0, 0.8), breaks = c(0, .2, .4, .6, 0.8)) + theme(legend.position = "none") + ggtitle("Ethnicity") 214 | 215 | # Education 216 | educ_sample <- cces_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 217 | educ_post <- poststrat_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 218 | educ <- inner_join(educ_sample, educ_post, by = "educ") %>% select(educ, Sample, Population) 219 | educ$educ <- factor(educ$educ, 220 | levels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad"), 221 | labels = c("No HS", "HS", "Some\nCollege", "4-year\nCollege", "Post-grad")) 222 | educ_plot <- ggplot() + 223 | ylab("") + xlab("") + theme_bw() + coord_flip() + 224 | geom_dumbbell(data = educ, aes(y = educ, x = Sample, xend = Population)) + 225 | geom_point(data = melt(educ, id = "educ"), aes(y = educ, x = value, color = variable), size = 2) + 226 | scale_x_continuous(limits = c(0, 0.33), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Education") 227 | 228 | grid.arrange(age_plot, male_plot, ethnicity_plot, educ_plot, 229 | widths = c(1.5, 0.75, 1.5, 1.5)) 230 | ``` 231 | 232 | ```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE, cache=FALSE} 233 | state_sample <- cces_df %>% group_by(state) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 234 | state_post <- poststrat_df %>% group_by(state) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 235 | state <- left_join(state_sample, state_post, by = "state") %>% select(state, Sample, Population) %>% left_join(statelevel_predictors_df, by = "state") 236 | states_order <- state$repvote 237 | state$state <- fct_reorder(state$state, states_order) 238 | state <- state %>% select(state, Sample, Population) 239 | 240 | ggplot() + 241 | ylab("") + xlab("Proportion") + theme_bw() + coord_flip() + 242 | geom_dumbbell(data = state, aes(y = state, x = Sample, xend = Population)) + 243 | geom_point(data = melt(state, id = "state"), aes(y = state, x = value, color = variable), size = 2) + 244 | scale_x_continuous(limits = c(0, 0.13), breaks = c(0, .025, .05, .075, .1, .125)) + ggtitle("State") + 245 | theme(legend.position = "bottom", legend.title=element_blank()) 246 | ``` 247 | 248 | We see that our 5,000-participant CCES sample does not differ too much from the target population according to the American Community Survey. This should not be surprising, as the CCES intends to use a representative sample. 249 | 250 | In general, we recommend checking the differences between the sample and the target population. In this case, the comparison has been based on the factors that are going to be used in MRP. However, even if some non-response bias existed for any of these factors MRP would be able to adjust for it, as we will see more in detail in subsection 4. Therefore, it may be especially important to compare the sample and target population with respect to the variables that are *not* going to be used in MRP -- and, consequently, where we will not be able to correct any outcome measure bias due to differential non-response in these non-MRP variables. 251 | 252 | ## First stage: Estimating the Individual-Response Model {#first-stage} 253 | 254 | The first stage is to use a multilevel logistic regression model to predict the outcome measure based on a set of factors. Having a plausible model to predict opinion is central for MRP to work well. 255 | 256 | The model we use in this example is described below. It includes varying intercepts for age, ethnicity, education, and state, where the variation for the state intercepts is in turn influenced by the region effects (coded as indicator variables) and the Republican vote share in the 2016 election. As there are only two levels for gender, it is preferable to model it as a predictor for computational efficiency. Additionally, we include varying intercepts for the interaction between gender and ethnicity, education and age, and education and ethnicity (see @ghitza2013deep for an in-depth discussion on interactions in the context of MRP). 257 | 258 | $$ 259 | Pr(y_i = 1) = logit^{-1}( 260 | \alpha_{\rm s[i]}^{\rm state} 261 | + \alpha_{\rm a[i]}^{\rm age} 262 | + \alpha_{\rm r[i]}^{\rm eth} 263 | + \alpha_{\rm e[i]}^{\rm educ} 264 | + \beta^{\rm male} \cdot {\rm Male}_{\rm i} 265 | + \alpha_{\rm g[i], r[i]}^{\rm male.eth} 266 | + \alpha_{\rm e[i], a[i]}^{\rm educ.age} 267 | + \alpha_{\rm e[i], r[i]}^{\rm educ.eth} 268 | ) 269 | $$ 270 | where: 271 | 272 | $$ 273 | \begin{aligned} 274 | \alpha_{\rm s}^{\rm state} &\sim {\rm normal}(\gamma^0 + \gamma^{\rm south} \cdot {\rm South}_{\rm s} + \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s} + \gamma^{\rm west} \cdot {\rm West}_{\rm s} \\ & \quad + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}, \sigma^{\rm state}) \textrm{ for s = 1,...,50}\\ 275 | \alpha_{\rm a}^{\rm age} & \sim {\rm normal}(0,\sigma^{\rm age}) \textrm{ for a = 1,...,6}\\ 276 | \alpha_{\rm r}^{\rm eth} & \sim {\rm normal}(0,\sigma^{\rm eth}) \textrm{ for r = 1,...,4}\\ 277 | \alpha_{\rm e}^{\rm educ} & \sim {\rm normal}(0,\sigma^{\rm educ}) \textrm{ for e = 1,...,5}\\ 278 | \alpha_{\rm g,r}^{\rm male.eth} & \sim {\rm normal}(0,\sigma^{\rm male.eth}) \textrm{ for g = 1,2 and r = 1,...,4}\\ 279 | \alpha_{\rm e,a}^{\rm educ.age} & \sim {\rm normal}(0,\sigma^{\rm educ.age}) \textrm{ for e = 1,...,5 and a = 1,...,6}\\ 280 | \alpha_{\rm e,r}^{\rm educ.eth} & \sim {\rm normal}(0,\sigma^{\rm educ.eth}) \textrm{ for e = 1,...,5 and r = 1,...,4}\\ 281 | \end{aligned} 282 | $$ 283 | 284 | Where: 285 | 286 | * $\alpha_{\rm a}^{\rm age}$: The effect of subject $i$'s age on the probability of supporting the statement. 287 | 288 | * $\alpha_{\rm r}^{\rm eth}$: The effect of subject $i$'s ethnicity on the probability of supporting the statement. 289 | 290 | * $\alpha_{\rm e}^{\rm educ}$: The effect of subject $i$'s education on the probability of supporting the statement. 291 | 292 | * $\alpha_{\rm s}^{\rm state}$: The effect of subject $i$'s state on the probability of supporting the statement. As we have a state-level predictor (the Republican vote share in the 2016 election), we need to build another model in which $\alpha_{\rm s}^{\rm state}$ is the outcome of a linear regression with an expected value determined by an intercept $\gamma^0$, the effect of the region coded as indicator variables (with Northeast as the baseline level), and the effect of the Republican vote share $\gamma^{\rm demvote}$. 293 | 294 | * $\beta^{\rm male}$: The average effect of being male on the probability of supporting abortion. We could have used a similar formulation as in the previous cases (i.e. $\alpha_{\rm g}^{\rm gender} \sim N(0, \sigma^{\rm gender})$), but having only two levels (i.e. male and female) can create some estimation problems. 295 | 296 | * $\alpha_{\rm e,r}^{\rm male.eth}$ and $\alpha_{\rm e,r}^{\rm educ.age}$: In the survey literature it is common practice to include these two interactions. 297 | 298 | * $\alpha_{\rm e,r}^{\rm educ.eth}$: In the next section we will explore public opinion on required abortion coverage at the different levels of education and ethnicity. It is, therefore, a good idea to also include this interaction. 299 | 300 | Readers without a background in multilevel modeling may be surprised to see this formulation. Why are we using terms such as $\alpha_{\rm eth}^{\rm eth}$ instead of the much more common method of creating an indicator variable for each state (e.g. $\beta^{\rm white} \cdot {\rm White}_{i} + \beta^{\rm black} \cdot {\rm Black}_{i} + ...$)? The answer is that this approach allows to share information between the levels of each variable (e.g. different ethnicities), preventing levels with less data from being too sensitive to the few observed values. For instance, it could happen that we only surveyed ten Hispanics, and that none of them turned out to agree that employers should be able to decline abortion coverage in insurance plans. Under the typical approach, the model would take this data too seriously and consider that Hispanics necessarily oppose this statement (i.e. $\beta^{\rm hispanic} = - \infty$). We know, however, that this is not the case. It may be that Hispanics are less likely to support the statement, but from such a small sample size it is impossible to know. What the multilevel model will do is to partially pool the varying intercept for Hispanics towards the average accross all ethnicities (i.e. in our model, the average across all ethnicities is fixed at zero), making it negative but far from the unrealistic negative infinity. This pooling will be data-dependent, meaning that it will pool the varying intercept towards the average more strongly the smaller the sample size in that level. In fact, if the sample size for a certain level is zero, the estimate varying intercept would be the average coefficient for all the other levels. We recommend @gelman2006data for an introduction to multilevel modeling. 301 | 302 | The `rstanarm` package allows the user to conduct complicated regression analyses in Stan with the simplicity of standard formula notation in R. `stan_glmer()`, the function that allows to fit generalized linear multilevel models, uses the same notation as the `lme4` package (see documentation [here](https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf)). That is, we specify the varying intercepts as `(1 | group)` and the interactions are expressed as `(1 | group1:group2)`, where the `:` operator creates a new grouping factor that consists of the combined levels of the two groups (i.e. this is the same as pasting together the levels of both factors). However, this syntax only accepts predictors at the individual level, and thus the two state-level predictors must be expanded to the individual level (see [p. 265-266]@gelman2006data). Notice that: 303 | 304 | $$ 305 | \begin{aligned} 306 | \alpha_{\rm s}^{\rm state} &\sim {\rm normal}(\gamma^0 + \gamma^{\rm south} \cdot {\rm South}_{\rm s} + \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s} + \gamma^{\rm west} \cdot {\rm West}_{\rm s} + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}, \sigma^{\rm state}) \\ 307 | &= \underbrace{\gamma^0}_\text{Intercept} + 308 | \underbrace{{\rm normal}(0, \sigma^{\rm state})}_\text{State varying intercept} + 309 | \underbrace{\gamma^{\rm south} \cdot {\rm South}_{\rm s} + \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s} + \gamma^{\rm west} \cdot {\rm West}_{\rm s} + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}}_\text{State-level predictors expanded to the individual level} 310 | \end{aligned} 311 | $$ 312 | 313 | Consequently, we can then reexpress the model as: 314 | 315 | $$ 316 | \begin{aligned} 317 | Pr(y_i = 1) =& logit^{-1}( 318 | \gamma^0 319 | + \alpha_{\rm s[i]}^{\rm state} 320 | + \alpha_{\rm a[i]}^{\rm age} 321 | + \alpha_{\rm r[i]}^{\rm eth} 322 | + \alpha_{\rm e[i]}^{\rm educ} 323 | + \beta^{\rm male} \cdot {\rm Male}_{\rm i} 324 | + \alpha_{\rm g[i], r[i]}^{\rm male.eth} 325 | + \alpha_{\rm e[i], a[i]}^{\rm educ.age} 326 | + \alpha_{\rm e[i], r[i]}^{\rm educ.eth} 327 | + \gamma^{\rm south} \cdot {\rm South}_{\rm s} \\ 328 | &+ \gamma^{\rm northcentral} \cdot {\rm NorthCentral}_{\rm s} 329 | + \gamma^{\rm west} \cdot {\rm West}_{\rm s} 330 | + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}) 331 | \end{aligned} 332 | $$ 333 | 334 | In the previous version of the model, $\alpha_{\rm s[i]}^{\rm state}$ was informed by several state-level predictors. This reparametrization expands the state-level predictors at the individual level, and thus $\alpha_{\rm s[i]}^{\rm state}$ now represents the variance introduced by the state adjusting for the region and 2016 Republican vote share. Similarly, $\gamma^0$, which previously represented the state-level intercept, now becomes the individual-level intercept. The two parameterizations of the multilevel model are mathematically equivalent, and using one or the other is simply a matter of preference. The former one highlights the role that state-level predictos have in accounting for structured differences among the states, while the later is closer to the `rstanarm` syntax. 335 | 336 | ```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=TRUE} 337 | # Expand state-level predictors to the individual level 338 | cces_df <- left_join(cces_df, statelevel_predictors_df, by = "state") 339 | ``` 340 | 341 | ```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=FALSE} 342 | # Fit in stan_glmer 343 | fit <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | educ) + male + 344 | (1 | male:eth) + (1 | educ:age) + (1 | educ:eth) + 345 | repvote + factor(region), 346 | family = binomial(link = "logit"), 347 | data = cces_df, 348 | prior = normal(0, 1, autoscale = TRUE), 349 | prior_covariance = decov(scale = 0.50), 350 | adapt_delta = 0.99, 351 | refresh = 0, 352 | seed = 1010) 353 | ``` 354 | 355 | ```{r, echo = FALSE, eval=TRUE} 356 | # we save the model for future use. By default we do not retrain the model and 357 | # save it, only retrieving the previously file version. To train the model again, 358 | # simply change eval=TRUE in the previous cell and eval=FALSE in this one. 359 | #saveRDS(fit, file = "data_public/chapter1/models/fit_mrp_1.rds") 360 | fit <- readRDS("data_public/chapter1/models/fit_mrp_1.rds") 361 | ``` 362 | 363 | As a first pass to check whether the model is performing well, we must check that there are no warnings about divergences, failure to converge or tree depth. Fitting the model with the default settings produced a few divergent transitions, and thus we decided to try increasing `adapt_delta` to 0.99 and introducing stronger priors than the `rstanarm` defaults. After doing this, the divergences dissapeared. In the [Computational Issues]() subsection we provide more details about divergent transitions and potential solutions. 364 | 365 | ```{r} 366 | print(fit) 367 | ``` 368 | 369 | We can interpret the resulting model as follows: 370 | 371 | * `Intercept` ($\gamma^0$): The global intercept corresponds to the expected outcome in the logit scale when having all the predictors equal to zero. In this case, this does not have a clear interpretation, as it is then influenced by the varying intercepts for state, age, ethnicity, education, and interactions. Furthermore, it corresponds to the impractical scenario of someone in a state with zero Republican vote share. 372 | 373 | * `male` ($\beta^{\rm male}$): The median estimate for this coefficient is `r round(fit$coefficients["male"], 1)`, with a standard error (measured using the Mean Absolute Deviation) of `r round(fit$ses["male"], 1)`. Using the divide-by-four rule (@gelman2020raos, Chapter 13), we see that, adjusting for the other covariates, males present up to a `r round(fit$coefficients["male"], 1)/4*100`% $\pm$ `r round(fit$ses["male"], 1)/4*100`% higher probability of supporting the right of employers to decline coverage of abortions relative to females. 374 | 375 | * `repvote` ($\gamma^{\rm repvote}$): As the scale of `repvote` was between 0 and 1, this coefficient corresponds to the difference in probability of supporting the statement between someone that was in a state in which no one voted Republican to someone whose state voted all Republican. This is not reasonable, and therefore we start by dividing the median coefficient by 10. Doing this, we consider a difference of a 10% increase in Republican vote share. This means that we expect that someone from a state with a 55% Republican vote share has approximately $\frac{`r round(fit$coefficients["repvote"], 1)`}{10}/4 = 4\%$ ($\pm `r round(fit$ses["repvote"], 1)/4*100`\%$) higher probability of supporting the statement relative to another individual with similar characteristics from a state in which Republicans received 45% of the vote. 376 | 377 | * `regionSouth` ($\gamma^{\rm south}$): According to the model, we expect that someone from a state in the south has, adjusting for the other covariates, up to a `r round(fit$coefficients["factor(region)South"], 1)`/4 = `r round(round(fit$coefficients["factor(region)South"], 1)/4*100, 0)`% ($\pm$ `r round(round(fit$ses["factor(region)South"], 1)/4*100, 0)`%) higher probability of supporting the statement relative to someone from the Northeast, which was the baseline category. The interpretation for `regionNorthCentral` and `regionWest` is similar. 378 | 379 | * `Error terms` ($\sigma^{\rm state}$, $\sigma^{\rm age}$, $\sigma^{\rm eth}$, $\sigma^{\rm educ}$, $\sigma^{\rm male.eth}$, $\sigma^{\rm educ.age}$, $\sigma^{\rm educ.eth}$): Remember that the intercepts for the different levels of state, age, ethnicity, education, and the specified interactions are distributed following a normal distribution centered at zero and with a standard deviation that is estimated from the data. The `Error terms` section gives us the estimates for these group-level standard deviations. For instance, $\alpha_{\rm r}^{\rm ethnicity} \sim {\rm normal}(0, \sigma^{\rm ethnicity})$, where the median estimate for $\sigma^{\rm ethnicity}$ is `r round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)`. In other words, the variyng intercepts for the different ethnicity groups have a standard deviation that is estimated to be `r round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)` on the logit scale, or `r round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)`/4 = `r round(round(sqrt(fit$stan_summary["Sigma[eth:(Intercept),(Intercept)]","mean"]), 3)/4, 3)` on the probability scale. In some cases, we may also want to check the intercepts corresponding to the individual levels of a factor. In `rstanarm`, this can be done using `fit$coefficients`. For instance, the median values for the varying intercepts of race are $\alpha^{\rm eth}_{r = {\rm White}}$ = `r round(fit$coefficients["b[(Intercept) eth:White]"], 2)`, $\alpha^{\rm eth}_{r = {\rm Black}}$ = `r round(fit$coefficients["b[(Intercept) eth:Black]"], 2)`, $\alpha^{\rm eth}_{r = {\rm Hispanic}}$ = `r round(fit$coefficients["b[(Intercept) eth:Hispanic]"], 2)`, $\alpha^{\rm eth}_{r = {\rm Other}}$ = `r round(fit$coefficients["b[(Intercept) eth:Other]"], 2)`. 380 | 381 | ## Second Stage: Poststratification {#second-stage} 382 | 383 | ### Estimation at the national level {.unnumbered #estimational-national-level} 384 | 385 | Currently, all we have achieved is a model that, considering certain factor-type predictors, predicts support for providing employers with the option to decline abortion coverage. To go from this model to a national or subnational estimate, we need to weight the model predictions for the different subgroups by the actual frequency of these subgroups. This idea can be expressed as: 386 | 387 | $$ 388 | \theta^{MRP} = \frac{\sum N_j \theta_j}{\sum N_j} 389 | $$ 390 | 391 | where $\theta^{MRP}$ is the MRP estimate, $\theta_{\rm j}$ corresponds to the model estimate for a specific subgroup defined in a cell of the poststratification table (e.g. young Hispanic men with a High School diploma in Arkansas), and $N_{\rm j}$ corresponds to the number of people in that subgroup. For a more in-depth review of poststratification, see Chapter 13 of @gelman2020raos. 392 | 393 | The values of $\theta_{j}$ for the different subgroups can be obtained with the `posterior_epred()` function. Of course, as `stan_glmer()` performs Bayesian inference, $\theta_{j}$ for any given subgroup will not be a single point estimate but a vector of posterior draws. 394 | 395 | ```{r, message=FALSE, cache=FALSE, echo=TRUE} 396 | # Expand state level predictors to the individual level 397 | poststrat_df <- left_join(poststrat_df, statelevel_predictors_df, by = "state") 398 | 399 | # Posterior_epred returns the posterior estimates for the different subgroups stored in the 400 | # poststrat_df dataframe. 401 | epred_mat <- posterior_epred(fit, newdata = poststrat_df, draws = 1000) 402 | mrp_estimates_vector <- epred_mat %*% poststrat_df$n / sum(poststrat_df$n) 403 | mrp_estimate <- c(mean = mean(mrp_estimates_vector), sd = sd(mrp_estimates_vector)) 404 | cat("MRP estimate mean, sd: ", round(mrp_estimate, 3)) 405 | ``` 406 | 407 | `posterior_epred()` returns a matrix $P$ with $D$ rows and $J$ columns, where $D$ corresponds to the number of draws from the posterior distribution (in this case 1000, as we specified `draws = 1000`) and $J$ is the number of subgroups in the poststratification table (i.e. 12,000). This matrix, which was called `epred_mat` in our code, is multiplied by a vector $k$ of length $J$ that contains the number of people in each subgroup of the poststratification table. This results in a vector of length $D$ that is then divided by the total number of people considered in the poststratification table, a scalar which is calculated by adding all the values in $k$. 408 | 409 | $$\theta^{MRP} = \frac{P \times k}{\sum_j^J k_j}$$ 410 | 411 | The end result is a vector that we call $\theta^{MRP}$, and which contains $D$ estimates for the national-level statement support. 412 | 413 | We can compare these results to the 5,000-person unadjusted sample estimate: 414 | 415 | ```{r, message=FALSE, cache=FALSE, echo=FALSE} 416 | # Remember that the standard error of a proportion is sqrt(p(1-p)/n). We define 417 | # a function called get_se_bernoulli to obtain the SE based on p and n 418 | get_se_bernoulli <- function(p, n){ 419 | return(sqrt(p*(1-p)/n)) 420 | } 421 | 422 | sample_cces_estimate <- c(mean = mean(cces_df$abortion), 423 | se = get_se_bernoulli(mean(cces_df$abortion), nrow(cces_df))) 424 | cat("Unadjusted 5000-respondent survey mean, sd: ", (round(sample_cces_estimate, 3))) 425 | ``` 426 | 427 | Additionally, we compare with the population support estimated by the full CCES with close to 60,000 participants: 428 | 429 | ```{r, message=FALSE, cache=FALSE, echo=FALSE} 430 | full_cces_estimate <- c(mean = mean(cces_all_df$abortion), 431 | se = get_se_bernoulli(mean(cces_all_df$abortion), nrow(cces_all_df))) 432 | cat("Unadjusted 60,000-respondent survey mean, sd: ", (round(full_cces_estimate, 3))) 433 | ``` 434 | 435 | In general, we see that both the unadjusted sample estimate and the MRP estimate are quite close to the results of the full survey. In other words, MRP is not providing a notable advantage against the unadjusted sample national estimates. However, it is important to clarify that we were somewhat lucky in obtaining this result as a product of using data from the CCES, a high quality survey that intends to be representative (and appears to be, at least with respect to the variables considered in our poststratification table). Many real-world surveys are not as representative relative to the variables considered in the poststratification step, and in these cases MRP will help correcting the biased estimates from the unadjusted survey. We will see an example of this in section 4, where we exemplify how MRP adjusts a clearly biased sample. 436 | 437 | ### Estimation for subnational units {.unnumbered #estimational-subnational-level} 438 | 439 | As we mentioned, small area estimation is one of the main applications of MRP. In this case, we will get an estimate of the support for employer's right to decline coverage of abortions per state: 440 | 441 | $$ 442 | \theta_s^{MRP} = \frac{\sum_{j \in s} N_j \theta_j}{\sum_{j \in s} N_j} 443 | $$ 444 | 445 | ```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center"} 446 | # Create empty dataframe 447 | states_df <- data.frame( 448 | state = state_abb, 449 | mrp_estimate = NA, 450 | mrp_estimate_se = NA, 451 | sample_cces_estimate = NA, 452 | sample_cces_estimate_se = NA, 453 | full_cces_estimate = NA, 454 | full_cces_estimate_se = NA, 455 | n_sample = NA, 456 | n_full = NA 457 | ) 458 | 459 | # Loop to populate the dataframe 460 | for(i in 1:nrow(states_df)) { 461 | # Currently, the matrix epred_mat and the poststratification table contain 12,000 462 | # rows. We need to filter the ones that correspond to state in row i. We do so 463 | # by defining the following condition: 464 | filtering_condition <- which(poststrat_df$state == states_df$state[i]) 465 | 466 | # Filtering matrix epred_mat with filtering_condition 467 | state_epred_mat <- epred_mat[ ,filtering_condition] 468 | 469 | # Filtering poststratification table with filtering_condition 470 | k_filtered <- poststrat_df[filtering_condition, ]$n 471 | 472 | # Poststratification step 473 | mrp_estimates_vector_sub <- state_epred_mat %*% k_filtered / sum(k_filtered) 474 | 475 | # MRP estimate for state in row i 476 | states_df$mrp_estimate[i] <- mean(mrp_estimates_vector_sub) 477 | states_df$mrp_estimate_se[i] <- sd(mrp_estimates_vector_sub) 478 | 479 | # 5,000-sample survey unadjusted estimate for state in row i 480 | states_df$sample_cces_estimate[i] <- mean( 481 | filter(cces_df, state==states_df$state[i])$abortion) 482 | states_df$n_sample[i] <- nrow(filter(cces_df, state==states_df$state[i])) 483 | states_df$sample_cces_estimate_se[i] <- get_se_bernoulli( 484 | states_df$sample_cces_estimate[i], states_df$n_sample[i]) 485 | 486 | # 60,000-person survey unadjusted estimate for state in row i 487 | states_df$full_cces_estimate[i] <- mean( 488 | filter(cces_all_df, state==states_df$state[i])$abortion) 489 | states_df$n_full[i] <- nrow(filter(cces_all_df, 490 | state==states_df$state[i])) 491 | states_df$full_cces_estimate_se[i] <- get_se_bernoulli( 492 | states_df$full_cces_estimate[i], states_df$n_full[i]) 493 | } 494 | ``` 495 | 496 | We start visualizing the estimates by state from the unadjusted 5,000-person sample. Again, the states are ordered by Republican vote in the 2016 election, and therefore we expect that statement support to follow an increasing trend. 497 | 498 | ```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center"} 499 | # Order states by republican voteshare 500 | states_df$state <- fct_reorder(states_df$state, states_order) 501 | 502 | compare1 <- ggplot(data=states_df) + 503 | geom_point(aes(x=state, y=sample_cces_estimate), color = "#E37B1C") + 504 | geom_errorbar(aes(ymin=sample_cces_estimate - 2*sample_cces_estimate_se, 505 | ymax=sample_cces_estimate + 2*sample_cces_estimate_se, 506 | x=state), alpha=.5, width = 0, color = "#E37B1C") + 507 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 508 | labels=c("0%","25%","50%","75%","100%"), 509 | expand=c(0,0)) + 510 | coord_cartesian(ylim=c(0, 1)) + 511 | theme_bw() + 512 | labs(x="States",y="Support")+ 513 | theme(legend.position="none", 514 | axis.title=element_text(size=10), 515 | axis.text.y=element_text(size=10), 516 | axis.text.x=element_text(angle=90,size=8, vjust=0.3), 517 | legend.title=element_text(size=10), 518 | legend.text=element_text(size=10)) 519 | 520 | compare2 <- ggplot(data = data.frame())+ 521 | geom_point(aes(y=sample_cces_estimate[1], x = .25), color = "#E37B1C") + 522 | geom_errorbar(data=data.frame(), aes(y = sample_cces_estimate[1], 523 | x = .25, 524 | ymin = sample_cces_estimate[1] - 2*sample_cces_estimate[2], 525 | ymax = sample_cces_estimate[1] + 2*sample_cces_estimate[2]), 526 | width = 0, color = "#E37B1C") + 527 | geom_text(aes(x = Inf, y = sample_cces_estimate[1] + 0.06, label = "Unadjusted Sample"), 528 | hjust = -.05, size = 4, color = "#E37B1C") + 529 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 530 | labels=c("0%","25%","50%","75%","100%"), 531 | limits=c(0,1),expand=c(0,0))+ 532 | scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) + 533 | coord_cartesian(clip = 'off') + 534 | theme_bw()+ 535 | labs(x="Population",y="")+ 536 | theme(legend.position="none", 537 | axis.title.y=element_blank(), 538 | axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)), 539 | axis.text=element_blank(), 540 | axis.ticks=element_blank(), 541 | legend.title=element_text(size=10), 542 | legend.text=element_text(size=10), 543 | plot.margin = margin(5.5, 105, 5.5, 5.5, "pt") 544 | ) 545 | 546 | bayesplot_grid(compare1,compare2, 547 | grid_args = list(nrow=1, widths = c(5,1.4))) 548 | ``` 549 | 550 | In states with small samples, we see considerably wide 95\% confidence intervals. We can add the MRP-adjusted estimates to this plot. 551 | 552 | ```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center"} 553 | compare1 <- ggplot(data=states_df) + 554 | geom_point(aes(x=state, y=sample_cces_estimate), color = "#E37B1C") + 555 | geom_errorbar(aes(ymin=sample_cces_estimate - 2*sample_cces_estimate_se, 556 | ymax=sample_cces_estimate + 2*sample_cces_estimate_se, 557 | x=state), alpha=.5, width = 0, color = "#E37B1C") + 558 | geom_point(data=states_df, aes(x=state, y=mrp_estimate), color = "#7B1CE3") + 559 | geom_errorbar(data=states_df, aes(ymin=mrp_estimate - 2*mrp_estimate_se, 560 | ymax=mrp_estimate + 2*mrp_estimate_se, 561 | x=state), alpha=.5, width = 0, color = "#7B1CE3") + 562 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 563 | labels=c("0%","25%","50%","75%","100%"), 564 | expand=c(0,0))+ 565 | coord_cartesian(ylim=c(0, 1)) + 566 | theme_bw()+ 567 | labs(x="States",y="Support")+ 568 | theme(legend.position="none", 569 | axis.title=element_text(size=10), 570 | axis.text.y=element_text(size=10), 571 | axis.text.x=element_text(angle=90,size=8, vjust=0.3), 572 | legend.title=element_text(size=10), 573 | legend.text=element_text(size=10)) 574 | 575 | compare2 <- ggplot(data = data.frame())+ 576 | geom_point(aes(y=sample_cces_estimate[1], x = .25), color = "#E37B1C") + 577 | geom_errorbar(data=data.frame(), aes(y = sample_cces_estimate[1], 578 | x = .25, 579 | ymin = sample_cces_estimate[1] - 2*sample_cces_estimate[2], 580 | ymax = sample_cces_estimate[1] + 2*sample_cces_estimate[2]), 581 | width = 0, color = "#E37B1C") + 582 | geom_text(aes(x = Inf, y = sample_cces_estimate[1]+0.06, label = "Unadjusted Sample"), 583 | hjust = -.05, size = 4, color = "#E37B1C") + 584 | geom_point(aes(y = mrp_estimate[1], x = .75), color = "#7B1CE3") + 585 | geom_errorbar(aes(y = mrp_estimate[1], 586 | x = .75, 587 | ymin = mrp_estimate[1] - 2*mrp_estimate[2], 588 | ymax = mrp_estimate[1] + 2*mrp_estimate[2]), 589 | width = 0, color = "#7B1CE3") + 590 | geom_text(data = data.frame(), aes(x = Inf, y = mrp_estimate[1]+.0, label = "Sample with MRP"), 591 | hjust = -.05, size = 4, color = "#7B1CE3") + 592 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 593 | labels=c("0%","25%","50%","75%","100%"), 594 | limits=c(0,1),expand=c(0,0))+ 595 | scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) + 596 | coord_cartesian(clip = 'off') + 597 | theme_bw()+ 598 | labs(x="Population",y="")+ 599 | theme(legend.position="none", 600 | axis.title.y=element_blank(), 601 | axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)), 602 | axis.text=element_blank(), 603 | axis.ticks=element_blank(), 604 | legend.title=element_text(size=10), 605 | legend.text=element_text(size=10), 606 | plot.margin = margin(5.5, 105, 5.5, 5.5, "pt") 607 | ) 608 | 609 | bayesplot_grid(compare1,compare2, 610 | grid_args = list(nrow=1, widths = c(5,1.4))) 611 | ``` 612 | 613 | In general, MRP produces less extreme values by partially pooling information across the factor levels. To illustrate this, we can compare the sample and MRP estimates with the results form the full 60,000-respondent CCES. Of course, in any applied situation we would be using the data from all the participants, but as we took a 5,000 person sample the full 60,000-respondent survey serves as a reference point. 614 | 615 | ```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center", results = 'hide'} 616 | compare1 <- ggplot(data=states_df) + 617 | geom_point(aes(x=state, y=sample_cces_estimate), color = "#E37B1C") + 618 | geom_errorbar(aes(ymin=sample_cces_estimate - 2*sample_cces_estimate_se, 619 | ymax=sample_cces_estimate + 2*sample_cces_estimate_se, 620 | x=state), alpha=.5, width = 0, color = "#E37B1C") + 621 | geom_point(data=states_df, aes(x=state, y=mrp_estimate), color = "#7B1CE3") + 622 | geom_errorbar(data=states_df, aes(ymin=mrp_estimate - 2*mrp_estimate_se, 623 | ymax=mrp_estimate + 2*mrp_estimate_se, 624 | x=state), alpha=.5, width = 0, color = "#7B1CE3") + 625 | geom_point(aes(x=state, y=full_cces_estimate), color = "#1CE37B") + 626 | geom_errorbar(data=states_df, aes(ymin=full_cces_estimate - 2*full_cces_estimate_se, 627 | ymax=full_cces_estimate + 2*full_cces_estimate_se, 628 | x=state), alpha=.5, width = 0, color = "#1CE37B") + 629 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 630 | labels=c("0%","25%","50%","75%","100%"), 631 | expand=c(0,0))+ 632 | coord_cartesian(ylim=c(0, 1)) + 633 | theme_bw()+ 634 | labs(x="States",y="Support")+ 635 | theme(legend.position="none", 636 | axis.title=element_text(size=10), 637 | axis.text.y=element_text(size=10), 638 | axis.text.x=element_text(angle=90,size=8, vjust=0.3), 639 | legend.title=element_text(size=10), 640 | legend.text=element_text(size=10)) 641 | 642 | compare2 <- ggplot(data = data.frame()) + 643 | geom_point(aes(y=sample_cces_estimate[1], x = .25), color = "#E37B1C") + 644 | geom_errorbar(data=data.frame(), aes(y = sample_cces_estimate[1], 645 | x = .25, 646 | ymin = sample_cces_estimate[1] - 2*sample_cces_estimate[2], 647 | ymax = sample_cces_estimate[1] + 2*sample_cces_estimate[2]), 648 | width = 0, color = "#E37B1C") + 649 | geom_text(aes(x = Inf, y = sample_cces_estimate[1]+0.06, label = "Unadjusted Sample"), 650 | hjust = -.05, size = 4, color = "#E37B1C") + 651 | geom_point(aes(y = mrp_estimate[1], x = .75), color = "#7B1CE3") + 652 | geom_errorbar(aes(y = mrp_estimate[1], 653 | x = .75, 654 | ymin = mrp_estimate[1] - 2*mrp_estimate[2], 655 | ymax = mrp_estimate[1] + 2*mrp_estimate[2]), 656 | width = 0, color = "#7B1CE3") + 657 | geom_text(data = data.frame(), aes(x = Inf, y = mrp_estimate[1]+.0, label = "Sample with MRP"), 658 | hjust = -.05, size = 4, color = "#7B1CE3") + 659 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 660 | labels=c("0%","25%","50%","75%","100%"), 661 | limits=c(0,1),expand=c(0,0)) + 662 | geom_point(data = data.frame(), aes(y=full_cces_estimate[1], x = .5), color = "#1CE37B") + 663 | geom_errorbar(data = data.frame(), aes(y = full_cces_estimate[1], 664 | x = .5, 665 | ymin = full_cces_estimate[1] - 2*full_cces_estimate[2], 666 | ymax = full_cces_estimate[1] + 2*full_cces_estimate[2]), 667 | width = 0, color = "#1CE37B") + 668 | geom_text(data = data.frame(), aes(x = Inf, y = full_cces_estimate-0.06, label = "Complete Survey"), 669 | hjust = -.06, size = 4, color = "#1CE37B") + 670 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 671 | labels=c("0%","25%","50%","75%","100%"), 672 | limits=c(0,1),expand=c(0,0))+ 673 | scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) + 674 | coord_cartesian(clip = 'off') + 675 | theme_bw() + 676 | labs(x="Population",y="")+ 677 | theme(legend.position="none", 678 | axis.title.y=element_blank(), 679 | axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)), 680 | axis.text=element_blank(), 681 | axis.ticks=element_blank(), 682 | legend.title=element_text(size=10), 683 | legend.text=element_text(size=10), 684 | plot.margin = margin(5.5, 105, 5.5, 5.5, "pt") 685 | ) 686 | 687 | bayesplot_grid(compare1,compare2, 688 | grid_args = list(nrow=1, widths = c(5,1.4))) 689 | ``` 690 | 691 | Overall, the MRP estimates are closer to the full survey estimates. This is particularly clear for the states with a smaller sample size. 692 | 693 | As a final way of presenting the MRP estimates, we can plot them on a US map. The symmetric color range goes from 10% to 90% support, as this scale allows for comparison with the other maps. However, the MRP estimates for statement support are concentrated in a relatively small range, which makes the colors appear muted. 694 | 695 | ```{r, fig.height=4, fig.width=6, echo=FALSE, warning=FALSE, message=FALSE, fig.align = "center"} 696 | library(usmap) 697 | 698 | # Load map and merge data 699 | states_map <- us_map(regions = "states") 700 | states_df_melted <- states_df %>% select(state, mrp_estimate) 701 | states_map <- left_join(states_map, states_df_melted, by = c("abbr" = "state")) %>% drop_na() 702 | 703 | # Plot 704 | ggplot(states_map, aes(x = x, y = y, group = group)) + 705 | geom_polygon(colour = "lightgray") + 706 | geom_polygon(aes(fill = mrp_estimate)) + theme_void() + 707 | scale_fill_gradient2(midpoint = 0.5, limits = c(0.1, .9), breaks = c(.1, .5, .9), 708 | name = "Support", low = muted("blue"), high = muted("red")) + 709 | theme(legend.margin=margin(l = 0.5, unit='cm')) 710 | 711 | ``` 712 | 713 | ### Estimation for subgroups within subnational units {.unnumbered #estimational-subgroup-level} 714 | 715 | MRP can also be used to obtain estimates for more complex cases, such as subgroups within states. For instance, we can study support for declining coverage of abortions by state and ethnicity within state. For clarity, we order the races according to their support for the statement. 716 | 717 | ```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE} 718 | # Create dataframe for all combinations of state and ethnicity 719 | subgroups_df <- cces_df %>% expand(state, eth) %>% 720 | mutate(mrp_subgroup_estimate = NA, 721 | mrp_subgroup_estimate_se = NA) 722 | 723 | # Loop to populate the dataframe 724 | for(i in 1:nrow(subgroups_df)) { 725 | 726 | # Filtering and poststratification 727 | filtering_condition <- which(poststrat_df$state == subgroups_df$state[i] & 728 | poststrat_df$eth == subgroups_df$eth[i]) 729 | epred_mat_filtered <- epred_mat[, filtering_condition] 730 | k_filtered <- poststrat_df[filtering_condition, ]$n 731 | mrp_subgroup_estimates_vector <- epred_mat_filtered %*% k_filtered / sum(k_filtered) 732 | 733 | # Estimates for MRP 734 | subgroups_df$mrp_subgroup_estimate[i] <- mean(mrp_subgroup_estimates_vector) 735 | subgroups_df$mrp_subgroup_estimate_se[i] <- sd(mrp_subgroup_estimates_vector) 736 | 737 | } 738 | ``` 739 | 740 | ```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE, fig.height=1.9, fig.width=9, fig.align = "center"} 741 | # Load map and merge data 742 | states_map <- us_map(regions = "states") 743 | subgroups_df_melted <- subgroups_df %>% select(state, mrp_subgroup_estimate, eth) 744 | states_map <- left_join(states_map, subgroups_df_melted, by = c("abbr" = "state")) %>% drop_na() 745 | 746 | # Declare order for ethnicity 747 | states_map$eth <- factor(states_map$eth, 748 | levels = c("Black", "Hispanic", "Other", "White"), 749 | labels = c("Black", "Hispanic", "Other", "White")) 750 | # Plot 751 | ggplot(states_map, aes(x = x, y = y, group = group)) + 752 | geom_polygon(colour = "lightgray") + 753 | geom_polygon(aes(fill = mrp_subgroup_estimate)) + theme_void() + facet_grid(cols = vars(eth)) + 754 | scale_fill_gradient2(midpoint = 0.5, limits = c(0.1, .9), breaks = c(.1, .5, .9), 755 | name = "Support", low = muted("blue"), high = muted("red")) + 756 | theme(legend.margin=margin(l = 0.5, unit='cm'), legend.position = "none") 757 | ``` 758 | 759 | Similarly, we can look at the outcome in ethnicity-education subgroups by state. 760 | 761 | ```{r, message=FALSE, warning=FALSE, echo=FALSE} 762 | # Make the educ column in the poststrat_df data frame an ordered factor. This allows a comparison 763 | # with the educ column in the subgroups_df data frame, which is already an ordered factor. We also 764 | # relabel the levels for prettier plotting 765 | poststrat_df$educ <- factor(poststrat_df$educ, 766 | levels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad"), 767 | labels = c("No HS", "HS", "Some\ncollege", "4-Year\ncollege", "Post-grad"), 768 | ordered = TRUE) 769 | 770 | # Create dataframe for all combinations of state and ethnicity 771 | subgroups_df <- poststrat_df %>% expand(state, eth, educ) %>% 772 | mutate(mrp_subgroup_estimate = NA, 773 | mrp_subgroup_estimate_se = NA) 774 | 775 | # Loop to populate the dataframe 776 | for(i in 1:nrow(subgroups_df)) { 777 | 778 | # Filtering and poststratification 779 | filtering_condition <- which(poststrat_df$state == subgroups_df$state[i] & 780 | poststrat_df$eth == subgroups_df$eth[i] & 781 | poststrat_df$educ == subgroups_df$educ[i]) 782 | epred_mat_filtered <- epred_mat[, filtering_condition] 783 | k_filtered <- poststrat_df[filtering_condition, ]$n 784 | mrp_subgroup_estimates_vector <- epred_mat_filtered %*% k_filtered / sum(k_filtered) 785 | 786 | # Estimates for MRP 787 | subgroups_df$mrp_subgroup_estimate[i] <- mean(mrp_subgroup_estimates_vector) 788 | subgroups_df$mrp_subgroup_estimate_se[i] <- sd(mrp_subgroup_estimates_vector) 789 | 790 | } 791 | ``` 792 | 793 | ```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE, fig.height=11, fig.width=12, fig.align = "center"} 794 | # Load map and merge data 795 | states_map <- us_map(regions = "states") 796 | subgroups_df_melted <- subgroups_df %>% select(state, mrp_subgroup_estimate, eth, educ) 797 | states_map <- left_join(states_map, subgroups_df_melted, by = c("abbr" = "state")) %>% drop_na() 798 | # Declare order for ethnicity 799 | states_map$eth <- factor(states_map$eth, 800 | levels = c("Black", "Hispanic", "Other", "White"), 801 | labels = c("Black", "Hispanic", "Other", "White")) 802 | 803 | ggplot(states_map, aes(x = x, y = y, group = group)) + 804 | geom_polygon(colour = "lightgray") + 805 | geom_polygon(aes(fill = mrp_subgroup_estimate)) + theme_void() + facet_grid(vars(educ), vars(eth)) + 806 | scale_fill_gradient2(midpoint = 0.5, limits = c(0.1, .9), breaks = c(.1, .5, .9), 807 | name = "Support", low = muted("blue"), high = muted("red")) + 808 | theme(legend.margin=margin(l = 0.5, unit='cm'), legend.position = "none") 809 | 810 | ``` 811 | 812 | ## Adjusting for Nonrepresentative Surveys {#nonrepresentative-survey} 813 | 814 | We have already introduced that MRP is an effective statistical adjustment method to correct for differences between the sample and target population for a set of key variables. We start this second example by obtaining an artificially nonrepresentative sample that gives more weight to respondents that are older, male, and from Republican states. 815 | 816 | ```{r, warning=FALSE} 817 | set.seed(1010) 818 | 819 | # We add the state-level predictors to cces_all_df 820 | cces_all_df <- left_join(cces_all_df, statelevel_predictors_df, by = "state") 821 | 822 | # We take a sample from cces_all_df giving extra weight to respondents that are older, male, 823 | # and from Republican states. 824 | cces_biased_df <- cces_all_df %>% 825 | sample_n(5000, weight = I(5*repvote + (age=="18-29")*0.5 + (age=="30-39")*1 + 826 | (age=="40-49")*2 + (age=="50-59")*4 + 827 | (age=="60-69")*6 + (age=="70+")*8 + (male==1)*20 + 828 | (eth=="White")*1.05)) 829 | ``` 830 | 831 | ```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE, cache=FALSE} 832 | # Age 833 | age_sample <- cces_biased_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 834 | age_post <- poststrat_df %>% mutate(age = factor(age, ordered = FALSE)) %>% group_by(age) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 835 | age <- inner_join(age_sample, age_post, by = "age") %>% select(age, Sample, Population) 836 | age_plot <- ggplot() + 837 | ylab("") + xlab("Proportion") + theme_bw() + coord_flip() + 838 | geom_dumbbell(data = age, aes(y = age, x = Sample, xend = Population)) + 839 | geom_point(data = melt(age, id = "age"), aes(y = age, x = value, color = variable), size = 2) + 840 | scale_x_continuous(limits = c(0, 0.35), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Age") 841 | 842 | # Gender 843 | male_sample <- cces_biased_df %>% group_by(male) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 844 | male_post <- poststrat_df %>% group_by(male) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 845 | male <- inner_join(male_sample, male_post, by = "male") %>% select(male, Sample, Population) %>% 846 | mutate(male = factor(male, levels = c(-0.5, +0.5), labels = c("Female", "Male"))) 847 | male_plot <- ggplot() + 848 | ylab("") + xlab("") + theme_bw() + coord_flip() + 849 | geom_dumbbell(data = male, aes(y = male, x = Sample, xend = Population)) + 850 | geom_point(data = melt(male, id = "male"), aes(y = male, x = value, color = variable), size = 2) + 851 | scale_x_continuous(limits = c(0, 0.6), breaks = c(0, .2, .4, .6)) + theme(legend.position = "none") + ggtitle("Gender") 852 | 853 | # Ethnicity 854 | ethnicity_sample <- cces_biased_df %>% group_by(eth) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 855 | ethnicity_post <- poststrat_df %>% group_by(eth) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 856 | ethnicity <- inner_join(ethnicity_sample, ethnicity_post, by = "eth") %>% select(eth, Sample, Population) 857 | ethnicity$eth <- factor(ethnicity$eth, 858 | levels = c("Black", "Hispanic", "Other", "White"), 859 | labels = c("Black", "Hispanic", "Other", "White")) 860 | ethnicity_plot <- ggplot() + 861 | ylab("") + xlab("") + theme_bw() + coord_flip() + 862 | geom_dumbbell(data = ethnicity, aes(y = eth, x = Sample, xend = Population)) + 863 | geom_point(data = melt(ethnicity, id = "eth"), aes(y = eth, x = value, color = variable), size = 2) + 864 | scale_x_continuous(limits = c(0, 0.9), breaks = c(0, .2, .4, .6, 0.8)) + theme(legend.position = "none") + ggtitle("Ethnicity") 865 | 866 | # Education 867 | educ_sample <- cces_biased_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 868 | educ_post <- poststrat_df %>% mutate(educ = factor(educ, ordered = FALSE)) %>% group_by(educ) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 869 | educ <- inner_join(educ_sample, educ_post, by = "educ") %>% select(educ, Sample, Population) 870 | educ$educ <- factor(educ$educ, 871 | levels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad"), 872 | labels = c("No HS", "HS", "Some\nCollege", "4-year\nCollege", "Post-grad")) 873 | educ_plot <- ggplot() + 874 | ylab("") + xlab("") + theme_bw() + coord_flip() + 875 | geom_dumbbell(data = educ, aes(y = educ, x = Sample, xend = Population)) + 876 | geom_point(data = melt(educ, id = "educ"), aes(y = educ, x = value, color = variable), size = 2) + 877 | scale_x_continuous(limits = c(0, 0.33), breaks = c(0, .1, .2, .3)) + theme(legend.position = "none") + ggtitle("Education") 878 | 879 | grid.arrange(age_plot, male_plot, ethnicity_plot, educ_plot, 880 | widths = c(1.5, 0.75, 1.5, 1.5)) 881 | ``` 882 | 883 | ```{r, fig.width=14, fig.height=3.5, echo=FALSE, fig.align = "center", warning=FALSE} 884 | state_sample <- cces_biased_df %>% group_by(state) %>% summarise(n = n()) %>% mutate(Sample = n/sum(n)) 885 | state_post <- poststrat_df %>% group_by(state) %>% summarise(n_post = sum(n)) %>% mutate(Population = n_post/sum(n_post)) 886 | state <- full_join(state_sample, state_post, by = "state") %>% select(state, Sample, Population) 887 | 888 | state$state <- fct_reorder(state$state, states_order) 889 | 890 | ggplot() + 891 | ylab("") + xlab("Proportion") + theme_bw() + coord_flip() + 892 | geom_dumbbell(data = state, aes(y = state, x = Sample, xend = Population)) + 893 | geom_point(data = melt(state, id = "state"), aes(y = state, x = value, color = variable), size = 2) + 894 | scale_x_continuous(limits = c(0, 0.13), breaks = c(0, .025, .05, .075, .1, .125)) + ggtitle("State") + 895 | theme(legend.position = "bottom", legend.title=element_blank()) 896 | ``` 897 | 898 | ```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=FALSE} 899 | fit <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | educ) + (1 | age) + male + 900 | (1 | male:eth) + (1 | educ:age) + (1 | educ:eth) + 901 | repvote + factor(region), 902 | family = binomial(link = "logit"), 903 | data = cces_biased_df, 904 | prior = normal(0, 1, autoscale = TRUE), 905 | prior_covariance = decov(scale = 0.50), 906 | adapt_delta = 0.99, 907 | refresh = 0, 908 | seed = 1010) 909 | ``` 910 | 911 | ```{r, echo = FALSE, eval=TRUE} 912 | # we save the model for future use. By default we do not retrain the model and 913 | # save it, only retrieving the previously file version. To train the model again, 914 | # simply change eval=TRUE in the previous cell and eval=FALSE in this one. 915 | #saveRDS(fit, file = "data_public/chapter1/models/fit_mrp_2.rds") 916 | fit <- readRDS("data_public/chapter1/models/fit_mrp_2.rds") 917 | ``` 918 | 919 | ```{r, cache = TRUE, echo = FALSE, warning=FALSE, message=FALSE} 920 | # National 921 | epred_mat <- posterior_epred(fit, newdata = poststrat_df, draws = 4000) 922 | mrp_estimates_vector <- epred_mat %*% poststrat_df$n / sum(poststrat_df$n) 923 | mrp_estimate <- c(mean = mean(mrp_estimates_vector), sd = sd(mrp_estimates_vector)) 924 | cces_estimate <- c(mean = mean(cces_biased_df$abortion), se = get_se_bernoulli(mean(cces_biased_df$abortion), nrow(cces_biased_df))) 925 | full_cces_estimate <- c(mean = mean(cces_all_df$abortion), se = get_se_bernoulli(mean(cces_all_df$abortion), nrow(cces_all_df))) 926 | 927 | # By state 928 | states_df <- data.frame( 929 | state = state_abb, 930 | mrp_estimate = NA, 931 | mrp_estimate_se = NA, 932 | cces_estimate = NA, 933 | cces_estimate_se = NA, 934 | full_cces_estimate = NA, 935 | full_cces_estimate_se = NA, 936 | n_sample = NA, 937 | n_full = NA 938 | ) 939 | 940 | for(i in 1:nrow(states_df)) { 941 | filtering_condition <- which(poststrat_df$state == states_df$state[i]) 942 | state_epred_mat <- epred_mat[ ,filtering_condition] 943 | k_filtered <- poststrat_df[filtering_condition, ]$n 944 | mrp_estimates_vector <- state_epred_mat %*% k_filtered / sum(k_filtered) 945 | 946 | # MRP estimate 947 | states_df$mrp_estimate[i] <- mean(mrp_estimates_vector) 948 | states_df$mrp_estimate_se[i] <- sd(mrp_estimates_vector) 949 | 950 | # Biased 5,000-sample survey unadjusted estimate 951 | states_df$cces_estimate[i] <- mean(filter(cces_biased_df, state==states_df$state[i])$abortion) 952 | states_df$n_sample[i] <- nrow(filter(cces_biased_df, state==states_df$state[i])) 953 | states_df$cces_estimate_se[i] <- get_se_bernoulli(states_df$cces_estimate[i], states_df$n_sample[i]) 954 | 955 | # Full 60,000-person survey unadjusted estimate 956 | states_df$full_cces_estimate[i] <- mean(filter(cces_all_df, state==states_df$state[i])$abortion) 957 | states_df$n_full[i] <- nrow(filter(cces_all_df, state==states_df$state[i])) 958 | states_df$full_cces_estimate_se[i] <- get_se_bernoulli(states_df$full_cces_estimate[i], states_df$n_full[i]) 959 | } 960 | ``` 961 | 962 | As expected, our remarkably nonrepresentative sample produces estimates that are lower than what we obtained by using a random sample in the previous section. 963 | 964 | ```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center", cache=FALSE} 965 | # Order states by republican voteshare 966 | states_df$state <- fct_reorder(states_df$state, states_order) 967 | 968 | compare1 <- ggplot(data=states_df) + 969 | geom_point(aes(x = state, y = cces_estimate), color = "#E37B1C") + 970 | geom_errorbar(aes(ymin=cces_estimate - 2*cces_estimate_se, 971 | ymax=cces_estimate + 2*cces_estimate_se, 972 | x=state), alpha=.5, width = 0, color = "#E37B1C") + 973 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 974 | labels=c("0%","25%","50%","75%","100%"), 975 | expand=c(0,0)) + 976 | coord_cartesian(ylim=c(0, 1)) + 977 | theme_bw() + 978 | labs(x="States",y="Support")+ 979 | theme(legend.position="none", 980 | axis.title=element_text(size=10), 981 | axis.text.y=element_text(size=10), 982 | axis.text.x=element_text(angle=90,size=8, vjust=0.3), 983 | legend.title=element_text(size=10), 984 | legend.text=element_text(size=10)) 985 | 986 | compare2 <- ggplot(data = data.frame())+ 987 | geom_point(aes(y = cces_estimate[1], x = .25), color = "#E37B1C") + 988 | geom_errorbar(data = data.frame(), aes(y = cces_estimate[1], 989 | x = .25, 990 | ymin = cces_estimate[1] - 2*cces_estimate[2], 991 | ymax = cces_estimate[1] + 2*cces_estimate[2]), 992 | width = 0, color = "#E37B1C") + 993 | geom_text(aes(x = Inf, y = cces_estimate[1] + 0.06, label = "Unadjusted Sample"), 994 | hjust = -.05, size = 4, color = "#E37B1C") + 995 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 996 | labels=c("0%","25%","50%","75%","100%"), 997 | limits=c(0,1),expand=c(0,0))+ 998 | scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) + 999 | coord_cartesian(clip = 'off') + 1000 | theme_bw()+ 1001 | labs(x="Population",y="")+ 1002 | theme(legend.position="none", 1003 | axis.title.y=element_blank(), 1004 | axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)), 1005 | axis.text=element_blank(), 1006 | axis.ticks=element_blank(), 1007 | legend.title=element_text(size=10), 1008 | legend.text=element_text(size=10), 1009 | plot.margin = margin(5.5, 105, 5.5, 5.5, "pt") 1010 | ) 1011 | 1012 | bayesplot_grid(compare1,compare2, 1013 | grid_args = list(nrow=1, widths = c(5,1.4))) 1014 | ``` 1015 | 1016 | MRP seems to partially correct for the nonrepresentative sample: 1017 | 1018 | ```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center", cache=FALSE} 1019 | compare1 <- ggplot(data=states_df) + 1020 | geom_point(aes(x = state, y = cces_estimate), color = "#E37B1C") + 1021 | geom_errorbar(aes(ymin=cces_estimate - 2*cces_estimate_se, 1022 | ymax=cces_estimate + 2*cces_estimate_se, 1023 | x=state), alpha=.5, width = 0, color = "#E37B1C") + 1024 | geom_point(data = states_df, aes(x=state, y=mrp_estimate), color = "#7B1CE3") + 1025 | geom_errorbar(data = states_df, aes(ymin=mrp_estimate - 2*mrp_estimate_se, 1026 | ymax=mrp_estimate + 2*mrp_estimate_se, 1027 | x=state), alpha=.5, width = 0, color = "#7B1CE3") + 1028 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 1029 | labels=c("0%","25%","50%","75%","100%"), 1030 | expand=c(0,0))+ 1031 | coord_cartesian(ylim=c(0, 1)) + 1032 | theme_bw()+ 1033 | labs(x="States",y="Support")+ 1034 | theme(legend.position="none", 1035 | axis.title=element_text(size=10), 1036 | axis.text.y=element_text(size=10), 1037 | axis.text.x=element_text(angle=90,size=8, vjust=0.3), 1038 | legend.title=element_text(size=10), 1039 | legend.text=element_text(size=10)) 1040 | 1041 | compare2 <- ggplot(data = data.frame())+ 1042 | geom_point(aes(y=cces_estimate[1], x = .25), color = "#E37B1C") + 1043 | geom_errorbar(data=data.frame(), aes(y = cces_estimate[1], 1044 | x = .25, 1045 | ymin = cces_estimate[1] - 2*cces_estimate[2], 1046 | ymax = cces_estimate[1] + 2*cces_estimate[2]), 1047 | width = 0, color = "#E37B1C") + 1048 | geom_text(aes(x = Inf, y = cces_estimate[1]+0.06, label = "Unadjusted Sample"), 1049 | hjust = -.05, size = 4, color = "#E37B1C") + 1050 | geom_point(aes(y = mrp_estimate[1], x = .75), color = "#7B1CE3") + 1051 | geom_errorbar(data=data.frame(), aes(y = mrp_estimate[1], 1052 | x = .75, 1053 | ymin = mrp_estimate[1] - 2*mrp_estimate[2], 1054 | ymax = mrp_estimate[1] + 2*mrp_estimate[2]), 1055 | width = 0, color = "#7B1CE3") + 1056 | geom_text(data = data.frame(), aes(x = Inf, y = mrp_estimate[1]+0.02, label = "Sample with MRP"), 1057 | hjust = -.05, size = 4, color = "#7B1CE3") + 1058 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 1059 | labels=c("0%","25%","50%","75%","100%"), 1060 | limits=c(0,1),expand=c(0,0))+ 1061 | scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) + 1062 | coord_cartesian(clip = 'off') + 1063 | theme_bw()+ 1064 | labs(x="Population",y="")+ 1065 | theme(legend.position="none", 1066 | axis.title.y=element_blank(), 1067 | axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)), 1068 | axis.text=element_blank(), 1069 | axis.ticks=element_blank(), 1070 | legend.title=element_text(size=10), 1071 | legend.text=element_text(size=10), 1072 | plot.margin = margin(5.5, 105, 5.5, 5.5, "pt") 1073 | ) 1074 | 1075 | bayesplot_grid(compare1,compare2, 1076 | grid_args = list(nrow=1, widths = c(5,1.4))) 1077 | ``` 1078 | 1079 | Lastly, we see how the MRP national and subnational estimates based on the nonrepresentative sample are, overall, much closer to the 60,000-person survey than the biased unadjusted sample estimates. 1080 | 1081 | ```{r, warning=FALSE, echo=FALSE, message=FALSE, fig.height=3.5, fig.width=10, fig.align = "center", results = 'hide', cache=FALSE} 1082 | compare1 <- ggplot(data=states_df) + 1083 | geom_point(aes(x=state, y=cces_estimate), color = "#E37B1C") + 1084 | geom_errorbar(aes(ymin=cces_estimate - 2*cces_estimate_se, 1085 | ymax=cces_estimate + 2*cces_estimate_se, 1086 | x=state), alpha=.5, width = 0, color = "#E37B1C") + 1087 | geom_point(data=states_df, aes(x=state, y=mrp_estimate), color = "#7B1CE3") + 1088 | geom_errorbar(data=states_df, aes(ymin=mrp_estimate - 2*mrp_estimate_se, 1089 | ymax=mrp_estimate + 2*mrp_estimate_se, 1090 | x=state), alpha=.5, width = 0, color = "#7B1CE3") + 1091 | geom_point(aes(x=state, y=full_cces_estimate), color = "#1CE37B") + 1092 | geom_errorbar(data=states_df, aes(ymin=full_cces_estimate - 2*full_cces_estimate_se, 1093 | ymax=full_cces_estimate + 2*full_cces_estimate_se, 1094 | x=state), alpha=.5, width = 0, color = "#1CE37B") + 1095 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 1096 | labels=c("0%","25%","50%","75%","100%"), 1097 | expand=c(0,0))+ 1098 | coord_cartesian(ylim=c(0, 1)) + 1099 | theme_bw()+ 1100 | labs(x="States",y="Support")+ 1101 | theme(legend.position="none", 1102 | axis.title=element_text(size=10), 1103 | axis.text.y=element_text(size=10), 1104 | axis.text.x=element_text(angle=90,size=8, vjust=0.3), 1105 | legend.title=element_text(size=10), 1106 | legend.text=element_text(size=10)) 1107 | 1108 | compare2 <- ggplot(data = data.frame()) + 1109 | geom_point(aes(y=cces_estimate[1], x = .25), color = "#E37B1C") + 1110 | geom_errorbar(data=data.frame(), aes(y = cces_estimate[1], 1111 | x = .25, 1112 | ymin = cces_estimate[1] - 2*cces_estimate[2], 1113 | ymax = cces_estimate[1] + 2*cces_estimate[2]), 1114 | width = 0, color = "#E37B1C") + 1115 | geom_text(aes(x = Inf, y = cces_estimate[1]+0.06, label = "Unadjusted Sample"), 1116 | hjust = -.05, size = 4, color = "#E37B1C") + 1117 | geom_point(aes(y = mrp_estimate[1], x = .75), color = "#7B1CE3") + 1118 | geom_errorbar(data=data.frame(), aes(y = mrp_estimate[1], 1119 | x = .75, 1120 | ymin = mrp_estimate[1] - 2*mrp_estimate[2], 1121 | ymax = mrp_estimate[1] + 2*mrp_estimate[2]), 1122 | width = 0, color = "#7B1CE3") + 1123 | geom_text(data = data.frame(), aes(x = Inf, y = mrp_estimate[1]+0.02, label = "Sample with MRP"), 1124 | hjust = -.05, size = 4, color = "#7B1CE3") + 1125 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 1126 | labels=c("0%","25%","50%","75%","100%"), 1127 | limits=c(0,1),expand=c(0,0)) + 1128 | geom_point(data = data.frame(), aes(y = full_cces_estimate[1], x = .5), color = "#1CE37B") + 1129 | geom_errorbar(data = data.frame(), aes(y = full_cces_estimate[1], 1130 | x = .5, 1131 | ymin = full_cces_estimate[1] - 2*full_cces_estimate[2], 1132 | ymax = full_cces_estimate[1] + 2*full_cces_estimate[2]), 1133 | width = 0, color = "#1CE37B") + 1134 | geom_text(data = data.frame(), aes(x = Inf, y = full_cces_estimate-0.04, label = "Complete Survey"), 1135 | hjust = -.06, size = 4, color = "#1CE37B") + 1136 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 1137 | labels=c("0%","25%","50%","75%","100%"), 1138 | limits=c(0,1),expand=c(0,0))+ 1139 | scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) + 1140 | coord_cartesian(clip = 'off') + 1141 | theme_bw() + 1142 | labs(x="Population",y="")+ 1143 | theme(legend.position="none", 1144 | axis.title.y=element_blank(), 1145 | axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)), 1146 | axis.text=element_blank(), 1147 | axis.ticks=element_blank(), 1148 | legend.title=element_text(size=10), 1149 | legend.text=element_text(size=10), 1150 | plot.margin = margin(5.5, 105, 5.5, 5.5, "pt") 1151 | ) 1152 | 1153 | bayesplot_grid(compare1,compare2, 1154 | grid_args = list(nrow=1, widths = c(5,1.4))) 1155 | ``` 1156 | 1157 | 1158 | ## Practical Considerations {#practical-considerations} 1159 | 1160 | ### Census incompletness and uncertainty {.unnumbered #census-incompletness} 1161 | 1162 | There are two main problems we can encounter when dealing with census data. 1163 | 1164 | * It is possible that some variables that we may want to use for poststratification are not available. For instance, party ID is not registered in the US census and ethnicity is not registered in the French census. This additional information can be included in the poststratification table based on other (generally smaller) surveys that contain these variables. 1165 | 1166 | * A great number of demogaphic-geographic combinations can require a large poststratification table, which in turn can result in unreliable census estimates. The American Community Survey we use in this case study does not only provide estimates of the actual figures that would have been obtained if the entire population was sampled, it also includes a measure of uncertainty around these estimates. Ideally, this uncertainty should be taken into account in the poststratification. For simplicity, this introduction has skipped this step, but this could mean the MRP-based estimates present an underestimated uncertainty. 1167 | 1168 | ### Nonreponse and missing data {.unnumbered #nonresponse} 1169 | 1170 | We have seen that MRP is a method that can mitigate potential biases in the sample, but it is not a substitute for a better data collection effort that tries to minimize systematic nonresponse patterns. 1171 | 1172 | ### Model complexity {.unnumbered #model-complexity} 1173 | 1174 | MRP depends upon the use of a regularized model (i.e. that prevents overfitting by limiting its complexity). Different approaches can be used for this goal (e.g. non-multilevel regression, random forests, or a neural network; see @bisbee2019barp for an implementation that uses Bayesian Additive Regression Trees and @ornstein2020stacked for an ensemble of predictive models), but there are several advantages of using a Bayesian multilevel model. First, the multilevel structure allows for partially pooling information across different levels of a factor, which can be crucial when dealing with certain levels with few samples. Second, the Bayesian approach propagates uncertainty across the modeling, and thus gives more realistic confidence intervals. 1175 | 1176 | Apart from selecting the demogaphic-geographic categories included in the poststratification table, there are several decisions the modeler should make. As we have already mentioned, adding relevant state-level predictors to the model often improves results, particularly when we have few data about some states. The inclusion of interactions can also be benefitial, especially when studying subgroups within subgroups (e.g. demographic subgroups within states; @ghitza2013deep). Lastly, the use of structured priors can also serve to reduce both bias and variance by sharing information across the levels of a factor (@gao2020structuredpriors). 1177 | 1178 | ### Empty cells in the poststratification table {.unnumbered #empty-cells} 1179 | 1180 | It is very frequent that some of the cells in the poststratification table are empty, meaning that there are not anyone that fulfills some specific combination of factors. For instance, in a given poststratification table we might find that there can be no people younger than 20, without a high school degree, and earning more than $500,000 a year in a particularly small state. In our example, we made sure that all the cells in the poststratification table were present even if the weight of that cell was zero, but this was only for illustrative purposes. 1181 | 1182 | ### Subnational units not represented in the survey {.unnumbered #missing-subnational-units} 1183 | 1184 | It is fairly common for small-sample surveys not to include anyone from a particular subnational unit. For instance, a small national survey in the US may not include any participant from Wyoming. An important advantage of MRP is that we can still produce estimates for this state using the information from the participants in other states. Going back to the first parametrization of the multilevel model that we presented, $\alpha^{\rm state}_{\rm s = Wyoming}$ will be calculated based on the region and Republican voteshare of the 2016 -- even in the abscence of information about the effect of residing in Wyoming specifically. As we have already explained, including subnational-level predictors is always recommended, particularly considering that data at the subnational level is easy to obtain in many cases. However, when dealing with subnational units that are not represented in our survey these predictors become even more central, as they are able to capture structured differences among the states and therefore allow for more precise estimation in the missing subnational areas. 1185 | 1186 | ### Computational issues {.unnumbered #computational-issues} 1187 | 1188 | Stan uses Hamiltonian Monte Carlo to explore the posterior distribution. In some cases, the geometry of the posterior distribution is too complex, making the Hamiltonian Monte Carlo "diverge". This produces a warning indicating the presence of divergent transitions after warmup, something that implies the model could present biased estimates (see @betancourt2017hmc for more details). Usually, a few divergent transitions do not indicate a serious problem. There are, in any case, three potential solutions to this problem that do not involve reformulating the model: (i) a non-centered parametization; (ii) increasing the `adapt_delta` parameter; and (iii) including stronger priors. Fortunately we don't have to worry about (i), as `rstanarm` already provides a non-centered parametization for the model. Therefore, we can focus on the other two. 1189 | 1190 | 1. Exploring the posterior distribution is somewhat similar as cartographing a mountainous terrain, and a divergent transition is similar to falling down a very steep slope, with the consequence of not being able to correctly map that area. In this analogy, what the cartographer could do is moving through the steep slope giving smaller steps to avoid falling. In Stan, the step size is set up automatically, but we can change a parameter called `adapt_delta` that controls the step size. By default we have that `adapt_delta = .95`, but we can increase that number to make Stan take smaller steps, which should reduce the number of divergent transitions. The maximum value we can set for `adapt_delta` is close (but necessarely less than) 1, with the downside that an increase implies a somewhat slower exploration of the posterior distribution. Usually, an `adapt_delta = 0.99` works well if we only have a few divergent transitions. 1191 | 1192 | 2. However, there are cases in which increasing `adapt_delta` is not sufficient, and divergent transitions still occur. In this case, introducing weakly informative priors can be extremelly helpful. Although `rstanarm` provides by default weakly informative priors, in most applications these tend to be too weak. By using more reasonable priors, we make the posterior distribution easier to explore. 1193 | + The priors for the scaled coefficients are ${\rm normal}(0, 2.5)$. When the coefficients are not scaled, `rstanarm` will automatically adjust the scaling of the priors as detailed in the [prior vignette](https://cran.r-project.org/web/packages/rstanarm/vignettes/priors.html). In most cases, and particularly when we find computational issues, it is reasonable to give stronger priors on the scaled coefficients such as ${\rm normal}(0, 1)$. 1194 | + Multilevel models with multiple group-level standard deviation parameters (e.g. $\sigma^{\rm age}$, $\sigma^{\rm eth}$, $\sigma^{\rm educ.eth}$, etc.) tend to be hard to estimate and sometimes present serious computational issues. The default prior for the covariance matrix is `decov(reg. = 1, conc. = 1, shape = 1, scale = 1)`. However, in a varying-intercept model such as this one (i.e. with structure `(1 | a) + (1 | b) + ... + (1 | n)`) the group-level standard deviations are independent of each other, and therefore the prior is simply a gamma distribution with some shape and scale. Consequently, `decov(shape = 1, scale = 1)` implies a weakly informative prior ${\rm Gamma(shape = 1, scale = 1)} = {\rm Exponential(scale = 1)}$ on each group-level standard deviation. This is too weak in most situations, and using something like ${\rm Exponential(scale = 0.5)}$ can be crucial for stabilizing computation. 1195 | 1196 | Therefore, something like this has much fewer chances of running into computational issues than simply leaving the defaults: 1197 | 1198 | ```{r, eval=FALSE} 1199 | fit <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | educ) + (1 | age) + male + 1200 | (1 | male:eth) + (1 | educ:age) + (1 | educ:eth) + 1201 | repvote + factor(region), 1202 | family = binomial(link = "logit"), 1203 | data = cces_df, 1204 | prior = normal(0, 1, autoscale = TRUE), 1205 | prior_covariance = decov(scale = 0.50), 1206 | adapt_delta = 0.99, 1207 | refresh = 0, 1208 | seed = 1010) 1209 | ``` 1210 | 1211 | More details about divergent transitions can be found in the [Brief Guide to Stan’s Warnings](https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup) and in the [Stan Reference Manual](https://mc-stan.org/docs/2_24/reference-manual/divergent-transitions.html). More information and references about priors can be found in the [Prior Choice Recommendations Wiki](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations). 1212 | 1213 | ## Appendix: Downloading and Processing Data 1214 | 1215 | ```{r packages-3, message=FALSE, echo = FALSE} 1216 | library(data.table) 1217 | library(dplyr) 1218 | library(forcats) 1219 | library(tidyr) 1220 | library(reshape2) 1221 | library(kableExtra) 1222 | library(ggplot2) 1223 | ``` 1224 | 1225 | ### CCES {#CCES} 1226 | 1227 | The 2018 CCES raw survey data can be downloaded from the [CCES Dataverse](https://dataverse.harvard.edu) 1228 | via [this link](https://dataverse.harvard.edu/api/access/datafile/3588803?format=original&gbrecs=true); 1229 | by default the downloaded filename is called `cces18_common_vv.csv`. 1230 | 1231 | Every MRP study requires some degree of data wrangling in order to make the factors in the survey of interest match the factors available in census data and other population-level surveys and census. Here we use the [R Tidyverse](https://www.tidyverse.org/) to process the survey data so that it aligns with the postratification table. Because initial recoding errors are fatal, it is important to check that each step of the recoding process produces expected results, either by viewing or summarizing the data. Because the data is all tabular data, we use the utility function [`head`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/head) to inspect the first few lines of a dataframe before and after each operation. 1232 | 1233 | First, we examine the contents of the data as downloaded, looking only at those columns which provide the demographic-geographic information of interest. In this case, these are labeled as `inputstate`, `gender`, `birthyr`, `race`, and `educ`. 1234 | 1235 | ```{r results="asis", cache=FALSE} 1236 | cces_all <- read_csv("data_public/chapter1/data/cces18_common_vv.csv.gz") 1237 | ``` 1238 | 1239 | ```{r echo=FALSE, warning=FALSE, cache=FALSE} 1240 | kable(head(select(cces_all, inputstate, gender, birthyr, race, educ)), format='markdown') 1241 | ``` 1242 | 1243 | As we have seen, it is crucial that the geography and demographics for the survery must match the geography and demographics in the poststratification table. If there is not a direct one-to-one relationship between the survey and the population data, the survey data must be recoded until a clean mapping exists. We write R functions to encapsulate the recoding steps. 1244 | 1245 | We start considering the geographic information. Both the CCES survey and the US Census data use numeric [FIPS codes](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt) to record state information. We can use R factors to map FIPS codes to the standard two-letter state name abbreviations. Because both surveys use this encoding, we make this into a reusable function `recode_fips`. 1246 | 1247 | ```{r results="asis", cache=FALSE} 1248 | # Note that the FIPS codes include the district of Columbia and US territories which 1249 | # are not considered in this study, creating some gaps in the numbering system. 1250 | state_ab <- datasets::state.abb 1251 | state_fips <- c(1,2,4,5,6,8,9,10,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30, 1252 | 31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56) 1253 | recode_fips <- function(column) { 1254 | factor(column, levels = state_fips, labels = state_ab) 1255 | } 1256 | ``` 1257 | 1258 | Secondly, we recode the demographics in order for them to be compatible with the American Community Survey data. In some cases this requires changing the levels of a factor (e.g. ethnicity) and in others we may need ot split a continous variable into different intervals (e.g. age). `clean_cces` uses the `recode_fips` function defined above to clean up the states. By default, the `clean_cces` functions drops rows where there is non-response in any of the considered factors or in the outcome variable; if this information is not missing at random, this introduces (more) bias into our survey. 1259 | 1260 | ```{r} 1261 | # Recode CCES 1262 | clean_cces <- function(df, remove_nas = TRUE){ 1263 | 1264 | ## Abortion -- dichotomous (0 - Oppose / 1 - Support) 1265 | df$abortion <- abs(df$CC18_321d-2) 1266 | 1267 | ## State -- factor 1268 | df$state <- recode_fips(df$inputstate) 1269 | 1270 | ## Gender -- dichotomous (coded as -0.5 Female, +0.5 Male) 1271 | df$male <- abs(df$gender-2)-0.5 1272 | 1273 | ## ethnicity -- factor 1274 | df$eth <- factor(df$race, 1275 | levels = 1:8, 1276 | labels = c("White", "Black", "Hispanic", "Asian", "Native American", 1277 | "Mixed", "Other", "Middle Eastern")) 1278 | df$eth <- fct_collapse(df$eth, "Other" = c("Asian", "Other", "Middle Eastern", 1279 | "Mixed", "Native American")) 1280 | 1281 | ## Age -- cut into factor 1282 | df$age <- 2018 - df$birthyr 1283 | df$age <- cut(as.integer(df$age), breaks = c(0, 29, 39, 49, 59, 69, 120), 1284 | labels = c("18-29","30-39","40-49","50-59","60-69","70+"), 1285 | ordered_result = TRUE) 1286 | 1287 | ## Education -- factor 1288 | df$educ <- factor(as.integer(df$educ), 1289 | levels = 1:6, 1290 | labels = c("No HS", "HS", "Some college", "Associates", 1291 | "4-Year College", "Post-grad"), ordered = TRUE) 1292 | df$educ <- fct_collapse(df$educ, "Some college" = c("Some college", "Associates")) 1293 | 1294 | # Filter out unnecessary columns and remove NAs 1295 | df <- df %>% select(abortion, state, eth, male, age, educ) 1296 | if (remove_nas){ 1297 | df <- df %>% drop_na() 1298 | } 1299 | 1300 | return(df) 1301 | } 1302 | ``` 1303 | 1304 | ### American Community Survey {#ACS} 1305 | 1306 | We used the US American Community Survey to create a poststratification table. We will show two different (but equivalent) ways to do this. 1307 | 1308 | #### Alternative 1: IPUMS {.unnumbered #ACS-IPUMS} 1309 | 1310 | The [Integrated Public Use Microdata Series (IPUMS)](https://ipums.org/what-is-ipums) (@ipums2020) is a service run by the University of Minnesota that allows easy access to census and survey data. We focus on the IPUMS USA section, which preserves and harmonizes US census microdata, including the American Community Survey. Other researchers may be interested in IPUMS international, which contains census microdata for over 100 countries. 1311 | 1312 | In order to create the poststratification table we took the following steps: 1313 | 1314 | 1. Register at IPUMS using the following [link](https://uma.pop.umn.edu/usa/user/new) 1315 | 2. On ipums.org, select [IPUMS USA](https://uma.pop.umn.edu/usa/user/new) and then click on “Get Data”. This tool allows to easily select certain variables from census microdata using an intuitive point-and-click interface. 1316 | 1317 | ```{r, echo=FALSE, out.width = '50%', fig.align='center'} 1318 | knitr::include_graphics("data_public/chapter1/screenshots/screenshot1.png") 1319 | ``` 1320 | 1321 | 3. We first need to select a sample (i.e. the survey we want to use for the poststratification table) with a menu that is opened by clicking on the SELECT SAMPLES button shown above. In our case, we we will select the 2018 5-year ACS survey and then click on SUBMIT SAMPLE. 1322 | 1323 | ```{r, echo=FALSE, out.width = '50%', fig.align='center'} 1324 | knitr::include_graphics("data_public/chapter1/screenshots/screenshot2.png") 1325 | ``` 1326 | 1327 | 4. After selecting the sample we need to select the variables that will be included in our poststratification table. The multiple variables are conveniently categorized by HOUSEHOLD (household-level variables), PERSON (individual-level variables), and A-Z (alphabetically). For instance, clicking on PERSON > DEMOGRAPHIC displays the demographic variables, as shown below. Note that the rightmost column shows if that variable is available in the 2018 5-year ACS. Note that if you click on a certain variable IPUMS will provide a description and show the codes and frequencies. Based on the data available in your survey of interest, this is a useful tool to decide which variables to include in the poststratification table. In our case, we select: 1328 | 1329 | * On PERSON > DEMOGRAPHIC select SEX and AGE 1330 | * On PERSON > RACE, ETHNICITY, AND NATIVITY select RACE, HISPAN, and CITIZEN 1331 | * On PERSON > EDUCATION select EDUC 1332 | * On HOUSEHOLD > GEOGRAPHIC select STATEFIP 1333 | 1334 | ```{r, echo=FALSE, out.width = '50%', fig.align='center'} 1335 | knitr::include_graphics("data_public/chapter1/screenshots/screenshot3.png") 1336 | ``` 1337 | 1338 | 6. We can review the variables we have selected by clicking on VIEW CART. This view also includes the ones which are automatically selected by IPUMS. 1339 | 1340 | ```{r, echo=FALSE, out.width = '45%', fig.align='center'} 1341 | knitr::include_graphics("data_public/chapter1/screenshots/screenshot4.png") 1342 | ``` 1343 | 1344 | 7. After reviewing these variables we should select CREATE DATA EXTRACT. By default the data format is a .dat with fixed-width text, but if we prefer we can change this to csv. After clicking SUBMIT EXTRACT the data will be generated. This can take a while, but you will receive an email when the file is ready for download. 1345 | 1346 | ```{r, echo=FALSE, out.width = '45%', fig.align='center'} 1347 | knitr::include_graphics("data_public/chapter1/screenshots/screenshot5.png") 1348 | ``` 1349 | 1350 | 7. Lastly, we download and preprocess the data. There are two main considerations: 1351 | 1352 | + Focus on the population of interest: We must take into account that the population of interest for the CCES survey, which only considers US citizens above 18 years of age, is different from the population reflected in the ACS. Therefore, we had to remove the cases of underages and non-citizens in the census data. 1353 | 1354 | + Match the levels of the two datasets: The levels of the variables in the poststratification table must match the levels of the variables in the CCES dataset. This required preprocessing the variables of the CCES and ACS in a way that the levels were compatible. 1355 | 1356 | ```{r, eval=FALSE} 1357 | ## Read data downloaded from IPUMS. This step can be slow, as the dataset is almost 1.5Gb 1358 | ## (note: due to its size, this file is not included in the book repo) 1359 | temp_df <- read.csv('usa_00001.csv') 1360 | 1361 | ## Remove non-citizens 1362 | temp_df <- temp_df %>% filter(CITIZEN<3) 1363 | 1364 | ## State 1365 | temp_df$state <- temp_df$STATEFIP 1366 | 1367 | ## Gender 1368 | temp_df$male <- abs(temp_df$SEX-2)-0.5 1369 | 1370 | ## Ethnicity 1371 | temp_df$RACE <- factor(temp_df$RACE, 1372 | levels = 1:9, 1373 | labels = c("White", "Black", "Native American", "Chinese", 1374 | "Japanese", "Other Asian or Pacific Islander", 1375 | "Other race, nec", "Two major races", 1376 | "Three or more major races")) 1377 | temp_df$eth <- fct_collapse(temp_df$RACE, 1378 | "Other" = c("Native American", "Chinese", 1379 | "Japanese", "Other Asian or Pacific Islander", 1380 | "Other race, nec", "Two major races", 1381 | "Three or more major races")) 1382 | levels(temp_df$eth) <- c(levels(temp_df$eth), "Hispanic") 1383 | ## add hispanic as ethnicity. This is done only for individuals that indicate being white 1384 | # in RACE and of hispanic origin in HISPAN 1385 | temp_df$eth[(temp_df$HISPAN!=0) & temp_df$eth=="White"] <- "Hispanic" 1386 | 1387 | ## Age 1388 | temp_df$age <- cut(as.integer(temp_df$AGE), breaks = c(0, 17, 29, 39, 49, 59, 69, 120), 1389 | labels = c("0-17", "18-29","30-39","40-49","50-59","60-69","70+"), 1390 | ordered_result = TRUE) 1391 | # filter out underages 1392 | temp_df <- filter(temp_df, age!="0-17") 1393 | temp_df$age <- droplevels(temp_df$age) 1394 | 1395 | ## Education 1396 | # we need to use EDUCD (i.e. education detailed) instead of EDUC (i.e. general codes), as the 1397 | # latter does not contain enough information about whether high school was completed or not. 1398 | temp_df$educ <- cut(as.integer(temp_df$EDUCD), c(0, 61, 64, 100, 101, Inf), 1399 | ordered_result = TRUE, 1400 | labels = c("No HS", "HS", "Some college", "4-Year College", "Post-grad")) 1401 | 1402 | # Clean temp_df by dropping NAs and cleaning states with recode_fips 1403 | temp_df <- temp_df %>% drop_na(state, eth, male, age, educ, PERWT) %>% 1404 | select(state, eth, male, age, educ, PERWT) %>% filter(state %in% state_fips) %>% 1405 | mutate(state = recode_fips(state)) 1406 | # Generate cell frequencies using groupby 1407 | poststrat_df <- temp_df %>% 1408 | group_by(state, eth, male, age, educ, .drop = FALSE) %>% 1409 | summarise(n = sum(as.numeric(PERWT))) 1410 | # Write as csv 1411 | write.csv(poststrat_df, "poststrat_df.csv", row.names = FALSE) 1412 | ``` 1413 | 1414 | If you use IPUMS in your project, don’t forget to [cite it](https://ipums.org/about/citation). 1415 | 1416 | #### Alternative 2: ACS PUMS {.unnumbered #ACS-PUMS} 1417 | 1418 | Some researchers may prefer to access the 2018 5-year ACS data directly without using IPUMS, which makes the process less intuitive but also more reproducible. Additionally, this does not require creating an account, as the Public Use Microdata Sample (PUMS) from the ACS can be downloaded directly from the [data repository](https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/). The repository contains two .zip files for each state: one for individual-level variables and other for household-level variables. All the variables considered in our analysis are available in the individual-level files, but we will also download and process the household-level variable income to show how this could be done. 1419 | 1420 | ```{r, eval=FALSE} 1421 | # We start downloading all the zip files using wget. If you are using Windows you can download 1422 | # a pre-built wget from http://gnuwin32.sourceforge.net/packages/wget.htm 1423 | dir.create("poststrat_data/") 1424 | system('wget -O poststrat_data -e robots=off -nd -A "csv_*.zip" -R "index.html","csv_hus.zip", 1425 | "csv_pus.zip" https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/') 1426 | ``` 1427 | 1428 | If this does not work, you can also access the [data repository](https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/) and download the files directly from your browser. 1429 | 1430 | Once the data is downloaded, we process the .zip files for each state and then merge them together. IPUMS integrates census data accross different surveys, which results in different naming conventions and levels in some of the variables with respect to the PUMS data directly downloaded from the ACS repository. Therefore, the preprocessing steps is slightly different from the code shown above, but as the underlying data is the same we obtain an identical poststratification table. 1431 | 1432 | ```{r, eval=FALSE} 1433 | list_states_abb <- datasets::state.abb 1434 | list_states_num <- rep(NA, length(list_states_abb)) 1435 | 1436 | list_of_poststrat_df <- list() 1437 | 1438 | for(i in 1:length(list_states_num)){ 1439 | # Unzip and read household and person files for state i 1440 | p_name <- paste0("postrat_data/csv_p", tolower(list_states_abb[i]),".zip") 1441 | h_name <- paste0("postrat_data/csv_h", tolower(list_states_abb[i]),".zip") 1442 | p_csv_name <- grep('\\.csv$', unzip(p_name, list=TRUE)$Name, ignore.case=TRUE, value=TRUE) 1443 | temp_df_p_state <- fread(unzip(p_name, files = p_csv_name), header=TRUE, 1444 | select=c("SERIALNO","ST","CIT","PWGTP","RAC1P","HISP","SEX", 1445 | "AGEP","SCHL")) 1446 | h_csv_name <- grep('\\.csv$', unzip(h_name, list=TRUE)$Name, ignore.case=TRUE, value=TRUE) 1447 | temp_df_h_state <- fread(unzip(h_name, files = h_csv_name), 1448 | header=TRUE, select=c("SERIALNO","FINCP")) 1449 | 1450 | # Merge the individual and household level variables according to the serial number 1451 | temp_df <- merge(temp_df_h_state, temp_df_p_state, by = "SERIALNO") 1452 | 1453 | # Update list of state numbers that will be used later 1454 | list_states_num[i] <- temp_df$ST[1] 1455 | 1456 | ## Filter by citizenship 1457 | temp_df <- temp_df %>% filter(CIT!=5) 1458 | 1459 | ## State 1460 | temp_df$state <- temp_df$ST 1461 | 1462 | ## Gender 1463 | temp_df$male <- abs(temp_df$SEX-2)-0.5 1464 | 1465 | ## Tthnicity 1466 | temp_df$RAC1P <- factor(temp_df$RAC1P, 1467 | levels = 1:9, 1468 | labels = c("White", "Black", "Native Indian", "Native Alaskan", 1469 | "Native Indian or Alaskan", "Asian", "Pacific Islander", 1470 | "Other", "Mixed")) 1471 | temp_df$eth <- fct_collapse(temp_df$RAC1P, "Native American" = c("Native Indian", 1472 | "Native Alaskan", 1473 | "Native Indian or Alaskan")) 1474 | temp_df$eth <- fct_collapse(temp_df$eth, "Other" = c("Asian", "Pacific Islander", "Other", 1475 | "Native American", "Mixed")) 1476 | levels(temp_df$eth) <- c(levels(temp_df$eth), "Hispanic") 1477 | temp_df$eth[(temp_df$HISP!=1) & temp_df$eth=="White"] <- "Hispanic" 1478 | 1479 | ## Age 1480 | temp_df$age <- cut(as.integer(temp_df$AGEP), breaks = c(0, 17, 29, 39, 49, 59, 69, 120), 1481 | labels = c("0-17", "18-29","30-39","40-49","50-59","60-69","70+"), 1482 | ordered_result = TRUE) 1483 | # filter out underages 1484 | temp_df <- filter(temp_df, age!="0-17") 1485 | temp_df$age <- droplevels(temp_df$age) 1486 | 1487 | ## Income (not currently used) 1488 | temp_df$income <- cut(as.integer(temp_df$FINCP), 1489 | breaks = c(-Inf, 9999, 19999, 29999, 39999, 49999, 59999, 69999, 79999, 1490 | 99999, 119999, 149999, 199999, 249999, 349999, 499999, Inf), 1491 | ordered_result = TRUE, 1492 | labels = c("<$10,000", "$10,000 - $19,999", "$20,000 - $29,999", 1493 | "$30,000 - $39,999", "$40,000 - $49,999", 1494 | "$50,000 - $59,999", "$60,000 - $69,999", 1495 | "$70,000 - $79,999","$80,000 - $99,999", 1496 | "$100,000 - $119,999", "$120,000 - $149,999", 1497 | "$150,000 - $199,999","$200,000 - $249,999", 1498 | "$250,000 - $349,999", "$350,000 - $499,999", 1499 | ">$500,000")) 1500 | temp_df$income <- fct_explicit_na(temp_df$income, "Prefer Not to Say") 1501 | 1502 | ## Education 1503 | temp_df$educ <- cut(as.integer(temp_df$SCHL), breaks = c(0, 15, 17, 19, 20, 21, 24), 1504 | ordered_result = TRUE, 1505 | labels = c("No HS", "HS", "Some college", "Associates", 1506 | "4-Year College", "Post-grad")) 1507 | temp_df$educ <- fct_collapse(temp_df$educ, "Some college" = c("Some college", "Associates")) 1508 | 1509 | # Calculate the poststratification table 1510 | temp_df <- temp_df %>% drop_na(state, eth, male, age, educ, PWGTP) %>% 1511 | select(state, eth, male, age, educ, PWGTP) 1512 | 1513 | ## We sum by the inidividual-level weight PWGTP 1514 | list_of_poststrat_df[[i]] <- temp_df %>% 1515 | group_by(state, eth, male, age, educ, .drop = FALSE) %>% 1516 | summarise(n = sum(as.numeric(PWGTP))) 1517 | 1518 | print(paste0("Data from ", list_states_abb[i], " completed")) 1519 | } 1520 | 1521 | # Join list of state-level poststratification files 1522 | poststrat_df <- rbindlist(list_of_poststrat_df) 1523 | # Clean up state names 1524 | poststrat_df$state <- recode_fips(state) 1525 | # Write as csv 1526 | write.csv(poststrat_df, "poststrat_df.csv", row.names = FALSE) 1527 | ``` 1528 | 1529 | Some researchers may prefer to access the 2018 5-year ACS data directly without using IPUMS, which makes the process less intuitive but also more reproducible. Additionally, this does not require creating an account, as the Public Use Microdata Sample (PUMS) from the ACS can be downloaded directly from the [data repository](https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/). The repository contains two .zip files for each state: one for individual-level variables and other for household-level variables. There are also two files, `csv_hus.zip` and `csv_hus.zip`, which contain these variables for all states. We will start creating a folder called `poststrat_data` and downloading these two files: 1530 | 1531 | ```{r, eval=FALSE} 1532 | # If you are using Windows you can download a pre-built wget from 1533 | # http://gnuwin32.sourceforge.net/packages/wget.htm 1534 | dir.create("poststrat_data/") 1535 | system('wget -O poststrat_data2 -e robots=off -nd -A "csv_hus.zip","csv_pus.zip" 1536 | https://www2.census.gov/programs-surveys/acs/data/pums/2018/5-Year/') 1537 | ``` 1538 | 1539 | If this does not work, you can also download the files directly from the data repository. 1540 | 1541 | Once the data is downloaded, we unzip the files and merge them together. IPUMS integrates census data accross different surveys, which results in different naming conventions and levels in some of the variables with respect to the PUMS data directly downloaded from the ACS repository. Therefore, the preprocessing steps is slightly different from the code shown above, but as the underlying data is the same we obtain an identical poststratification table. All the variables considered in our analysis are available in the individual-level files, but we will also download and process the household-level variable income in order to show how this could be done. 1542 | 1543 | ```{r, eval=FALSE} 1544 | list_states_abb <- datasets::state.abb 1545 | list_states_num <- rep(NA, length(list_states_abb)) 1546 | 1547 | # Unzip and read household and person files 1548 | p_name <- paste0("poststrat_data/csv_pus.zip") 1549 | h_name <- paste0("poststrat_data/csv_hus.zip") 1550 | 1551 | p_csv_name <- grep('\\.csv$', unzip(p_name, list=TRUE)$Name, ignore.case=TRUE, value=TRUE) 1552 | unzip(p_name, files = p_csv_name, exdir = "poststrat_data") 1553 | p_csv_name = paste0("poststrat_data/", p_csv_name) 1554 | temp_df_p <- rbindlist(lapply(p_csv_name, fread, header=TRUE, 1555 | select=c("SERIALNO","ST","CIT","PWGTP","RAC1P","HISP","SEX", 1556 | "AGEP","SCHL"))) 1557 | 1558 | h_csv_name <- grep('\\.csv$', unzip(h_name, list=TRUE)$Name, ignore.case=TRUE, value=TRUE) 1559 | unzip(h_name, files = h_csv_name, exdir = "poststrat_data") 1560 | h_csv_name = paste0("poststrat_data/", h_csv_name) 1561 | temp_df_h <- rbindlist(lapply(h_csv_name, fread, header=TRUE, select=c("SERIALNO","FINCP"))) 1562 | 1563 | # Merge the individual and household level variables according to the serial number 1564 | temp_df <- merge(temp_df_h, temp_df_p, by = "SERIALNO") 1565 | 1566 | # Exclude associated ares that are not states based on FIPS codes 1567 | temp_df <- temp_df %>% filter(ST %in% state_fips) 1568 | temp_df$ST <- factor(temp_df$ST, levels = state_fips, labels = state_ab) 1569 | 1570 | ## Filter by citizenship 1571 | temp_df <- temp_df %>% filter(CIT!=5) 1572 | 1573 | ## State 1574 | temp_df$state <- temp_df$ST 1575 | 1576 | ## Gender 1577 | temp_df$male <- abs(temp_df$SEX-2)-0.5 1578 | 1579 | ## Ethnicity 1580 | temp_df$RAC1P <- factor(temp_df$RAC1P, 1581 | levels = 1:9, 1582 | labels = c("White", "Black", "Native Indian", "Native Alaskan", 1583 | "Native Indian or Alaskan", "Asian", "Pacific Islander", 1584 | "Other", "Mixed")) 1585 | temp_df$eth <- fct_collapse(temp_df$RAC1P, "Native American" = c("Native Indian", 1586 | "Native Alaskan", 1587 | "Native Indian or Alaskan")) 1588 | temp_df$eth <- fct_collapse(temp_df$eth, "Other" = c("Asian", "Pacific Islander", "Other", 1589 | "Native American", "Mixed")) 1590 | levels(temp_df$eth) <- c(levels(temp_df$eth), "Hispanic") 1591 | temp_df$eth[(temp_df$HISP!=1) & temp_df$eth=="White"] <- "Hispanic" 1592 | 1593 | ## Age 1594 | temp_df$age <- cut(as.integer(temp_df$AGEP), breaks = c(0, 17, 29, 39, 49, 59, 69, 120), 1595 | labels = c("0-17", "18-29","30-39","40-49","50-59","60-69","70+"), 1596 | ordered_result = TRUE) 1597 | # filter out underages 1598 | temp_df <- filter(temp_df, age!="0-17") 1599 | temp_df$age <- droplevels(temp_df$age) 1600 | 1601 | ## Income (not currently used) 1602 | temp_df$income <- cut(as.integer(temp_df$FINCP), 1603 | breaks = c(-Inf, 9999, 19999, 29999, 39999, 49999, 59999, 69999, 79999, 1604 | 99999, 119999, 149999, 199999, 249999, 349999, 499999, Inf), 1605 | ordered_result = TRUE, 1606 | labels = c("<$10,000", "$10,000 - $19,999", "$20,000 - $29,999", 1607 | "$30,000 - $39,999", "$40,000 - $49,999", 1608 | "$50,000 - $59,999", "$60,000 - $69,999", 1609 | "$70,000 - $79,999","$80,000 - $99,999", 1610 | "$100,000 - $119,999", "$120,000 - $149,999", 1611 | "$150,000 - $199,999","$200,000 - $249,999", 1612 | "$250,000 - $349,999", "$350,000 - $499,999", 1613 | ">$500,000")) 1614 | temp_df$income <- fct_explicit_na(temp_df$income, "Prefer Not to Say") 1615 | 1616 | ## Education 1617 | temp_df$educ <- cut(as.integer(temp_df$SCHL), breaks = c(0, 15, 17, 19, 20, 21, 24), 1618 | ordered_result = TRUE, 1619 | labels = c("No HS", "HS", "Some college", "Associates", 1620 | "4-Year College", "Post-grad")) 1621 | temp_df$educ <- fct_collapse(temp_df$educ, "Some college" = c("Some college", "Associates")) 1622 | 1623 | # Calculate the poststratification table 1624 | temp_df <- temp_df %>% drop_na(state, eth, male, age, educ, PWGTP) %>% 1625 | select(state, eth, male, age, educ, PWGTP) 1626 | 1627 | ## We sum by the inidividual-level weight PWGTP 1628 | poststrat_df <- temp_df %>% 1629 | group_by(state, eth, male, age, educ, .drop = FALSE) %>% 1630 | summarise(n = sum(as.numeric(PWGTP))) 1631 | 1632 | # Write as csv 1633 | write.csv(poststrat_df, "poststrat_df.csv", row.names = FALSE) 1634 | ``` 1635 | -------------------------------------------------------------------------------- /02-mrp_noncensus.Rmd: -------------------------------------------------------------------------------- 1 | # MRP with Noncensus Variables 2 | 3 | 26 | 27 | 53 | 99 | 100 | ```{r packages-2, message=FALSE, echo = FALSE} 101 | library(brms) 102 | library(rstanarm) 103 | library(data.table) 104 | library(dplyr) 105 | library(forcats) 106 | library(tidyr) 107 | library(reshape2) 108 | library(kableExtra) 109 | 110 | library(ggplot2) 111 | library(bayesplot) 112 | library(gridExtra) 113 | library(ggalt) 114 | library(scales) 115 | library(usmap) 116 | 117 | theme_set(bayesplot::theme_default()) 118 | 119 | Sys.setenv(LOCAL_CPPFLAGS = '-march=corei7 -mtune=corei7') 120 | options(mc.cores = parallel::detectCores(logical = FALSE)) 121 | ``` 122 | 123 | When our sample population is different than our target population, MRP can only adjust for the predictors included in the model. As these are restricted by the variables in the poststratification table, which in turn are limited by the questions asked in the census, the characteristics that we can use for poststratification are quite reduced. This is the reason researchers tend to use simple demographic and geographic variables, which unfortunately do not provide much help if the bias in the survey originates from non-response in voters of a certain party, for instance. As a potential solution, @kastellec2015polarizing propose extending the postratification table using a survey that contains one or multiple non-census variables that could help adjusting for the differences between the sample and the target population. For instance, if our survey asked for partisanship, we could use the CCES to extend the poststratification table such as that it also contains this variable. The extension is done in two steps. First, we fit a multilevel model in which we try to predict partisanship in the CCES based on the variables available in the census. Second, we use this model to predict, for each cell in the original poststratification table, what proportion of subjects are Democrats, Republicans, or Independents. This extended poststratification table that contains partisanship will allow us to (a) generate MRP estimates that adjust for differential party nonresponse in the original survey; and/or (b) obtain estimates outcome of interest by party. 124 | 125 | For this case study we will continue using the previous example of studying support for the right of employers to exclude abortion coverage. 126 | 127 | ```{r, echo = FALSE} 128 | state_abb <- datasets::state.abb 129 | state_fips <- c(1,2,4,5,6,8,9,10,12,13,15,16,17,18,19,20,21,22,23,24, 130 | 25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42, 131 | 44,45,46,47,48,49,50,51,53,54,55,56) 132 | recode_fips <- function(column) { 133 | factor(column, levels = state_fips, labels = state_abb) 134 | } 135 | 136 | get_se_bernoulli <- function(p, n){ 137 | return(sqrt(p*(1-p)/n)) 138 | } 139 | 140 | # Reed CCES data again, but this time we also read the partisanship variable 141 | clean_cces2 <- function(df, remove_nas = TRUE){ 142 | 143 | ## Abortion -- dichotomous (0 - Oppose / 1 - Support) 144 | df$abortion <- abs(df$CC18_321d-2) 145 | 146 | ## State -- factor 147 | df$state <- recode_fips(df$inputstate) 148 | 149 | ## Gender -- dichotomous (-0.5 Female, +0.5 Male) 150 | df$male <- abs(df$gender-2)-0.5 151 | 152 | ## ethnicity -- factor 153 | df$eth <- factor(df$race, 154 | levels = 1:8, 155 | labels = c("White", "Black", "Hispanic", "Asian", "Native American", "Mixed", "Other", "Middle Eastern")) 156 | df$eth <- fct_collapse(df$eth, "Other" = c("Asian", "Other", "Middle Eastern", "Mixed", "Native American")) 157 | 158 | ## Age -- cut into factor 159 | df$age <- 2018 - df$birthyr 160 | df$age <- cut(as.integer(df$age), breaks = c(0, 29, 39, 49, 59, 69, 120), 161 | labels = c("18-29","30-39","40-49","50-59","60-69","70+"), 162 | ordered_result = TRUE) 163 | 164 | ## Education -- factor 165 | df$educ <- factor(as.integer(df$educ), 166 | levels = 1:6, 167 | labels = c("No HS", "HS", "Some college", "Associates", "4-Year College", "Post-grad"), ordered = TRUE) 168 | df$educ <- fct_collapse(df$educ, "Some college" = c("Some college", "Associates")) 169 | 170 | # Party 171 | df$party <- factor(df$pid3, 172 | levels = 1:5, 173 | labels = c("Democrat", "Republican", "Independent", "Other", "Not sure"), 174 | ordered = TRUE) 175 | df$party <- fct_collapse(df$party, "Independent" = c("Independent", "Other", "Not sure")) 176 | 177 | # Clean and remove NAs 178 | df <- df %>% select(abortion, state, eth, male, age, educ, party) 179 | if (remove_nas){ 180 | df <- df %>% drop_na() 181 | } 182 | 183 | return(df) 184 | 185 | } 186 | ``` 187 | 188 | ```{r, results = 'asis', cache=FALSE, warning=FALSE, message=FALSE} 189 | # Read CCES data with the same outcome variable and predictors, but also including 190 | # party. This is done by the clean_cces2 function (not shown, see Github code) 191 | cces_all_df <- read_csv("data_public/chapter1/data/cces18_common_vv.csv.gz") 192 | cces_all_df <- clean_cces2(cces_all_df, remove_nas = TRUE) 193 | 194 | # Read poststratification table 195 | poststrat_df <- read_csv("data_public/chapter1/data/poststrat_df.csv") 196 | 197 | # Read state-level predictors and add them to the CCES and poststratification table 198 | statelevel_predictors_df <- read_csv('data_public/chapter1/data/statelevel_predictors.csv') 199 | cces_all_df <- left_join(cces_all_df, statelevel_predictors_df, 200 | by = "state", keep = FALSE) 201 | poststrat_df <- left_join(poststrat_df, statelevel_predictors_df, 202 | by = "state", keep = FALSE) 203 | ``` 204 | 205 | ## Model-based Extension of the Poststratification Table 206 | 207 | As we have described, we start fitting a multilevel model to predict partisanship as a function of the same demographic and geographic variables used in the standard MRP model, which will allow us to predict the proportion of Republicans, Democrats, and Independents in each row of the poststratification table. As there are three levels for partisanship, we use a Bayesian multinomial (i.e. unordered) logistic regression which can be fitted in `brms` (currently, `rstanarm` does not support multinomial logistic regression). 208 | 209 | For this extension step we should use a survey that we think is to some degree representative with respect to the variable that we are trying to include in the poststratification table. In our example, if we extended our census-based poststratification table using a highly non-representative survey with respect to party, we would indeed generated a biased poststratification table and ultimately obtain compromised MRP estimates. In other words, this is our opportunity to bring outside information in order to generate a richer poststratification table that can adjust for potential biases in the main survey, so we need to make sure that the survey we use to extend the poststratification table is trustworthy with respect to the non-census variable. 210 | 211 | In this example, we will use a 5,000-person sample of the CCES to extend the poststratification table to include partisanship, which is addressed in the CCES: 212 | 213 | > Generally speaking, do you think of yourself as a ...? (Democrat, Republican, Independent, Other, Not Sure) 214 | 215 | For simplicity, we included the few respondents that indicated "Other" or "Not Sure" as Independents. 216 | 217 | ```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=FALSE} 218 | # Setting seed to arbitrary number for reproducibility 219 | set.seed(1010) 220 | 221 | # Taking random sample from the CCES survey 222 | cces_unbiased_df <- cces_all_df %>% sample_n(5000) 223 | 224 | fit_party <- brm(party ~ (1 | state) + (1 | eth) + (1 | age) + (1 | educ) + male + 225 | (1 | male:eth) + (1 | educ:age) + (1 | educ:eth) + 226 | repvote + factor(region), 227 | family = "categorical", 228 | data = cces_unbiased_df, 229 | prior = c(prior(normal(0, 5), class = Intercept), 230 | prior(normal(0, 1), class = b), 231 | prior(exponential(0.5), class = sd, dpar = muIndependent), 232 | prior(exponential(0.5), class = sd, dpar = muRepublican)), 233 | control = list(adapt_delta = 0.9, max_treedepth = 10), 234 | seed = 1010) 235 | ``` 236 | 237 | ```{r, echo = FALSE, eval=TRUE} 238 | # we save the model for future use. By default we do not retrain the model and 239 | # save it, only retrieving the previously file version. To train the model again, 240 | # simply change eval=TRUE in the previous cell and eval=FALSE in this one. 241 | #saveRDS(fit_party, file = "data_public/chapter2/models/fit_party_example.rds") 242 | fit_party <- readRDS("data_public/chapter2/models/fit_party_example.rds") 243 | ``` 244 | 245 | This model gives us, for each poststratification cell $j$, an estimate for the proportion of Democrats ($\hat{\theta}_{{\rm Democrat}, j}$), Republicans ($\hat{\theta}_{{\rm Republican}, j}$), and Independents ($\hat{\theta}_{{\rm Independent}, j}$). We can multiply these quantities by the number of people in cell $j$ to estimate the number of Democrats ($N_j \: \hat{\theta}_{{\rm Democrat}, j}$), Republicans ($N_j \: \hat{\theta}_{{\rm Republican}, j}$), and Independents ($N_j \: \hat{\theta}_{{\rm Independent}, j}$), obtaining an extended poststratification table in which each cell has been expanded into three. That is, if the original poststratification table had $J$ rows (e.g. 12,000 in our case), the new one will have $3 J$ (e.g. 36,000). There is, however, a certain complication that must be taken into account. The model-based estimates for the proportion of Democrats, Republicans, and Independents are not single numbers, but several draws from the posterior distribution that capture the uncertainty about these estimates. For instance, if we have 500 draws for $\hat{\theta}_{{\rm Democrat}, j}$, $\hat{\theta}_{{\rm Republican}, j}$, and $\hat{\theta}_{{\rm Independent}, j}$, we can imagine 500 poststratification tables with different numbers for each cell. 246 | 247 | ```{r} 248 | # Use posterior_epred to predict partisanship for original poststratification table 249 | pred_mat <- brms::posterior_epred(fit_party, newdata = poststrat_df, ndraws = 500, transform = TRUE) 250 | 251 | # Extend poststratification table 252 | poststrat_df_threefold <- poststrat_df[rep(seq_len(nrow(poststrat_df)), each = 3), ] 253 | poststrat_df_threefold$party <- rep(c("Democrat", "Republican", "Independent"), nrow(poststrat_df)) 254 | 255 | # Calculate new numbers for the cells of the new poststratification table. K 256 | # is a matrix containing 36000 rows (one for each cell of the poststratification table) 257 | # and 500 columns (corresponding to the 500 draws). 258 | K_theta <- apply(pred_mat, 1, function(x){as.vector(t(x))}) 259 | K <- K_theta * rep(poststrat_df$n, each = 3) 260 | ``` 261 | 262 | 263 | 270 | 271 |
272 | 273 | In sum, we started with a poststratification table with 12,000 rows. Here we can see the first three rows: 274 | 275 | ```{r, echo=FALSE} 276 | poststrat_df[1:3,1:6] %>% kable() %>% kable_styling(full_width = TRUE) 277 | ``` 278 | 279 | We have used a model-based approach to include partisanship in this poststratification table, that now has 36,000 rows (again, each row in the original table has been split into three). However, in order to consider the uncertainty in these model-based estimates we have actually built 500 different poststratification tables. Here we show the first 9 rows of one of these 500 poststratification tables: 280 | 281 | ```{r, echo=FALSE} 282 | poststrat_df_threefold[1:9,c(1, 2, 3, 4, 5, 9, 6)] %>% mutate(n = round(K[1:9, 1], 1)) %>% kable() %>% kable_styling(full_width = TRUE) 283 | ``` 284 | 285 | ## Adjusting for Nonresponse Bias 286 | 287 | We have described how to extend the poststratification table by including partisanship. Now, we will use this poststratification table to adjust for differential party nonresponse. 288 | 289 | ### Setting up example with an artificially nonrepresentative sample 290 | 291 | To demostraty how non-census MRP can adjust for party, we will use a survey that is biased with respect to party. As we are already familiar with the CCES dataset, what we are going to do is to take a different sample of 5,000 respondents that simulates a high nonresponse rate among Republicans and, to a lesser degree, Independents. 292 | 293 | ```{r} 294 | # Random sample of 5,000 that weights by party 295 | cces_biased_df <- cces_all_df %>% sample_n(5000, weight = I((cces_all_df$party=="Democrat")*1 + 296 | (cces_all_df$party=="Independent")*0.75 + 297 | (cces_all_df$party=="Republican")*0.5)) 298 | ``` 299 | 300 | Previously, we saw that the national average support for requiring companies to cover abortion in their insurance plans was around `r round(mean(cces_all_df$abortion, na.rm = TRUE), 3)*100`% according to the CCES. Comparatively, this biased sample of the CCES gives an estimate of `r round(mean(cces_biased_df$abortion, na.rm = TRUE), 3)*100`%. This is not surprising, as missing Republicans and Independents in the survey should reduce support for the employers' right to decline abortion coverage. 301 | 302 | ### Standard MRP 303 | 304 | We fit a standard MRP (i.e. without including party) on the nonrepresentative sample, using the same model as in the MRP introduction and the non-extended poststratification table. 305 | 306 | ```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=FALSE} 307 | fit_abortion_standard <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | age) + (1 | educ) + male + 308 | (1 | male:eth) + (1 | educ:age) + (1 | educ:eth) + 309 | repvote + factor(region), 310 | family = binomial(link = "logit"), 311 | data = cces_biased_df, 312 | prior = normal(0, 1, autoscale = TRUE), 313 | prior_covariance = decov(scale = 0.50), 314 | adapt_delta = 0.99, 315 | seed = 1010) 316 | ``` 317 | 318 | ```{r, echo = FALSE, eval=TRUE} 319 | # we save the model for future use. By default we do not retrain the model and 320 | # save it, only retrieving the previously file version. To train the model again, 321 | # simply change eval=TRUE in the previous cell and eval=FALSE in this one. 322 | #saveRDS(fit_abortion_standard, file = "data_public/chapter2/models/fit_abortion_standard.rds") 323 | fit_abortion_standard <- readRDS("data_public/chapter2/models/fit_abortion_standard.rds") 324 | ``` 325 | 326 | ```{r} 327 | standard_epred_mat <- rstanarm::posterior_epred(fit_abortion_standard, newdata = poststrat_df, draws = 500) 328 | standard_mrp_estimates_vector <- (standard_epred_mat %*% poststrat_df$n)/sum(poststrat_df$n) 329 | ``` 330 | 331 | ```{r, echo = FALSE} 332 | cat("Standard MRP estimate mean, sd: ", round(mean(standard_mrp_estimates_vector), 3), round(sd(standard_mrp_estimates_vector), 3)) 333 | ``` 334 | 335 | The standard MRP with the nonrepresentative sample gives a national-level estimate of `r 100*round(mean(standard_mrp_estimates_vector), 3)`% ($\pm$ `r 100*round(sd(standard_mrp_estimates_vector), 3)`%). As this estimate does not consider partisanship, standard MRP is not being able to adjust for the smaller statement support that results from oversampling Democrats. 336 | 337 | ### Non-census MRP with partisanship as a predictor 338 | 339 | In the first section we have created a poststratification table that contains partisanship. After doing this, the next step of the non-census MRP approach is to fit the same model as we did in the standard MRP, but also including party as a predictor: 340 | 341 | $$ 342 | Pr(y_i = 1) = logit^{-1}( 343 | \alpha_{\rm s[i]}^{\rm state} 344 | + \alpha_{\rm a[i]}^{\rm age} 345 | + \alpha_{\rm r[i]}^{\rm eth} 346 | + \alpha_{\rm e[i]}^{\rm educ} 347 | + \beta^{\rm male} \cdot {\rm Male}_{\rm i} 348 | + \alpha_{\rm g[i], r[i]}^{\rm male.eth} 349 | + \alpha_{\rm e[i], a[i]}^{\rm educ.age} 350 | + \alpha_{\rm e[i], r[i]}^{\rm educ.eth} 351 | + \alpha_{\rm p[i]}^{\rm party} 352 | ) 353 | $$ 354 | 355 | $$ 356 | \begin{align*} 357 | \alpha_{\rm s}^{\rm state} &\sim {\rm Normal}(\gamma^0 + \gamma^{\rm south} \cdot {\rm South}_{\rm s} + \gamma^{\rm midwest} \cdot {\rm Midwest}_{\rm s} + \gamma^{\rm west} \cdot {\rm West}_{\rm s} + \gamma^{\rm repvote} \cdot {\rm RepVote}_{\rm s}, \sigma_{\rm state}) \textrm{ for s = 1,...,50}\\ 358 | \alpha_{\rm a}^{\rm age} & \sim {\rm Normal}(0,\sigma_{\rm age}) \textrm{ for a = 1,...,6}\\ 359 | \alpha_{\rm r}^{\rm eth} & \sim {\rm Normal}(0,\sigma_{\rm eth}) \textrm{ for r = 1,...,4}\\ 360 | \alpha_{\rm e}^{\rm educ} & \sim {\rm Normal}(0,\sigma_{\rm educ}) \textrm{ for e = 1,...,5}\\ 361 | \alpha_{\rm g,r}^{\rm male.eth} & \sim {\rm Normal}(0,\sigma_{\rm male.eth}) \textrm{ for g = 1,2 and r = 1,...,4}\\ 362 | \alpha_{\rm e,a}^{\rm educ.age} & \sim {\rm Normal}(0,\sigma_{\rm educ.age}) \textrm{ for e = 1,...,5 and a = 1,...,6}\\ 363 | \alpha_{\rm e,r}^{\rm educ.eth} & \sim {\rm Normal}(0,\sigma_{\rm educ.eth}) \textrm{ for e = 1,...,5 and r = 1,...,4}\\ 364 | \alpha_{\rm p}^{\rm party} & \sim {\rm Normal}(0,\sigma_{\rm party}) \textrm{ for p = 1,2,3}\\ 365 | \end{align*} 366 | $$ 367 | 368 | ```{r, cache=FALSE, warning=FALSE, message=FALSE, eval=FALSE} 369 | fit_abortion_noncensus <- stan_glmer(abortion ~ (1 | state) + (1 | eth) + (1 | age) + (1 | educ) + male + 370 | (1 | male:eth) + (1 | educ:age) + (1 | educ:eth) + 371 | repvote + factor(region) + (1 | party), 372 | family = binomial(link = "logit"), 373 | data = cces_biased_df, 374 | prior = normal(0, 1, autoscale = TRUE), 375 | prior_covariance = decov(scale = 0.50), 376 | adapt_delta = 0.99, 377 | seed = 1010) 378 | ``` 379 | 380 | ```{r, echo = FALSE, eval=TRUE} 381 | # we save the model for future use. By default we do not retrain the model and 382 | # save it, only retrieving the previously file version. To train the model again, 383 | # simply change eval=TRUE in the previous cell and eval=FALSE in this one. 384 | #saveRDS(fit_abortion_noncensus, file = "data_public/chapter2/models/fit_abortion_noncensus.rds") 385 | fit_abortion_noncensus <- readRDS("data_public/chapter2/models/fit_abortion_noncensus.rds") 386 | ``` 387 | 388 | Using `posterior_epred` allows us to estimate abortion coverage support for each of the cells in the extended poststratification table. As we set `draws = 500`, we obtain 500 estimates for each cell. In standard MRP, we will weight each the statement support estimates for each poststratification cell by the number of people in that cell according to the model-based estimates obtained in the previous section. However, as in this case the number of people in each cell was estimated with uncertainty, we need to propagate the uncertainty in the first (party prediction) model to the final MRP estimates. Essentially, what we can do is randomly pick one of the 500 statement support estimates for each poststratification cell (i.e. a 36,000 vector) we have just obtained and weight it by one of the 500 poststratification tables that resulted from the first model. Repeating the process for the remaining draws gives us a distribution of 500 MRP estimates for national support that correctly captures the uncertainty in the two models. 389 | 390 | ```{r} 391 | # Use posterior_epred to predict stance on abortion insurance coverage for extended poststratification table 392 | noncensus_epred_mat <- rstanarm::posterior_epred(fit_abortion_noncensus, newdata = poststrat_df_threefold, draws = 500) 393 | 394 | # Calculate national MRP estimates propagating uncertainty from the two models 395 | noncensus_mrp_estimates_vector <- colSums(t(noncensus_epred_mat)*K) / sum(K[,1]) 396 | ``` 397 | 398 | ```{r, echo = FALSE} 399 | cat("Noncensus MRP estimate mean, sd: ", round(mean(noncensus_mrp_estimates_vector), 3), round(sd(noncensus_mrp_estimates_vector), 3)) 400 | ``` 401 | 402 | 403 | 416 | 417 |
418 | 419 | Our national-level estimate for the right to exclude abortion coverage from employer-sponsored insurance resulting from this non-census variable MRP is `r 100*round(mean(noncensus_mrp_estimates_vector), 3)`% (`r 100*round(sd(noncensus_mrp_estimates_vector), 3)`%). Unsurprisingly, this is much closer to the full (unbiased) 60,000 participant survey (`r 100*round(mean(cces_all_df$abortion), 3)` $\pm$ `r 100*round(sqrt(mean(cces_all_df$abortion)*(1-mean(cces_all_df$abortion))/nrow(cces_all_df)), 3)`%) than the standard MRP estimate seen above (`r 100*round(mean(standard_mrp_estimates_vector), 3)` $\pm$ `r 100*round(sd(standard_mrp_estimates_vector), 3)`%). Using an extended poststratification table that contained partisanship allowed us to adjust for differential partisan nonresponse. 420 | 421 | Of course, we can also obtain state-level estimates and compare standard MRP with non-census MRP. 422 | 423 | ```{r} 424 | # Create empty dataframe 425 | state_estimates_df <- data.frame( 426 | state = state_abb, 427 | standard_mrp_state_estimate = NA, 428 | standard_mrp_state_estimate_se = NA, 429 | noncensus_mrp_state_estimate = NA, 430 | noncensus_mrp_state_estimate_se = NA, 431 | full_cces_state_estimate = NA, 432 | full_cces_state_estimate_se = NA, 433 | n_full = NA 434 | ) 435 | 436 | # Loop to populate the dataframe 437 | for(i in 1:nrow(state_estimates_df)) { 438 | # Filtering condition for standard_epred_mat (12,000 rows) 439 | filtering_condition <- which(poststrat_df$state == state_estimates_df$state[i]) 440 | # Filtering condition for noncensus_epred_mat (36,000 rows) 441 | filtering_condition_threefold <- which(poststrat_df_threefold$state == state_estimates_df$state[i]) 442 | 443 | # Standard MRP estimate 444 | state_standard_epred_mat <- standard_epred_mat[ ,filtering_condition] 445 | k_filtered <- poststrat_df[filtering_condition, ]$n 446 | standard_mrp_state_estimates_vector <- state_standard_epred_mat %*% k_filtered / sum(k_filtered) 447 | state_estimates_df$standard_mrp_state_estimate[i] <- mean(standard_mrp_state_estimates_vector) 448 | state_estimates_df$standard_mrp_state_estimate_se[i] <- sd(standard_mrp_state_estimates_vector) 449 | 450 | # Noncensus MRP estimate 451 | state_noncensus_epred_mat <- noncensus_epred_mat[ ,filtering_condition_threefold] 452 | K_filtered <- K[filtering_condition_threefold, ] 453 | noncensus_mrp_state_estimates_vector <- colSums(t(state_noncensus_epred_mat)*K_filtered) / colSums(K_filtered) 454 | state_estimates_df$noncensus_mrp_state_estimate[i] <- mean(noncensus_mrp_state_estimates_vector) 455 | state_estimates_df$noncensus_mrp_state_estimate_se[i] <- sd(noncensus_mrp_state_estimates_vector) 456 | 457 | # Full survey estimate 458 | state_estimates_df$full_cces_state_estimate[i] <- mean(filter(cces_all_df, state==state_estimates_df$state[i])$abortion) 459 | state_estimates_df$n_full[i] <- nrow(filter(cces_all_df, state==state_estimates_df$state[i])) 460 | state_estimates_df$full_cces_state_estimate_se[i] <- get_se_bernoulli(state_estimates_df$full_cces_state_estimate[i], state_estimates_df$n_full[i]) 461 | } 462 | ``` 463 | 464 | ```{r, fig.width=10, fig.height=3.5, warning=FALSE, message=FALSE, results = 'hide', echo=FALSE} 465 | states_order <- poststrat_df %>% group_by(state) %>% summarise(repvote = first(repvote)) %>% arrange(repvote) %>% .$state 466 | state_estimates_df$state <- factor(state_estimates_df$state, levels = states_order, ordered = TRUE) 467 | 468 | compare1 <- ggplot(data=state_estimates_df) + 469 | geom_point(aes(x=state, y=standard_mrp_state_estimate), color = "#E37B1C") + 470 | geom_errorbar(aes(ymin=standard_mrp_state_estimate - 2*standard_mrp_state_estimate_se, 471 | ymax=standard_mrp_state_estimate + 2*standard_mrp_state_estimate_se, 472 | x=state), alpha=.5, width = 0, color = "#E37B1C") + 473 | geom_point(data=state_estimates_df, aes(x=state, y=noncensus_mrp_state_estimate), color = "#7B1CE3") + 474 | geom_errorbar(data=state_estimates_df, aes(ymin=noncensus_mrp_state_estimate - 2*noncensus_mrp_state_estimate_se, 475 | ymax=noncensus_mrp_state_estimate + 2*noncensus_mrp_state_estimate_se, 476 | x=state), alpha=.5, width = 0, color = "#7B1CE3") + 477 | geom_point(aes(x=state, y=full_cces_state_estimate), color = "#1CE37B") + 478 | geom_errorbar(data=state_estimates_df, aes(ymin=full_cces_state_estimate - 2*full_cces_state_estimate_se, 479 | ymax=full_cces_state_estimate + 2*full_cces_state_estimate_se, 480 | x=state), alpha=.5, width = 0, color = "#1CE37B") + 481 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 482 | labels=c("0%","25%","50%","75%","100%"), 483 | expand=c(0,0))+ 484 | coord_cartesian(ylim=c(0, 1)) + 485 | theme_bw()+ 486 | labs(x="States",y="Support")+ 487 | theme(legend.position="none", 488 | axis.title=element_text(size=10), 489 | axis.text.y=element_text(size=10), 490 | axis.text.x=element_text(angle=90,size=8, vjust=0.3), 491 | legend.title=element_text(size=10), 492 | legend.text=element_text(size=10)) 493 | 494 | compare2 <- ggplot(data = state_estimates_df) + 495 | geom_point(aes(y=mean(standard_mrp_estimates_vector), x = .25), color = "#E37B1C") + 496 | geom_errorbar(data=state_estimates_df, aes(y = mean(standard_mrp_estimates_vector), 497 | x = .25, 498 | ymin = mean(standard_mrp_estimates_vector) - 2*sd(standard_mrp_estimates_vector), 499 | ymax = mean(standard_mrp_estimates_vector) + 2*sd(standard_mrp_estimates_vector)), 500 | width = 0, color = "#E37B1C") + 501 | geom_text(data = data.frame(), aes(x = Inf, y = mean(standard_mrp_estimates_vector) - 0.03, label = "Standard MRP"), 502 | hjust = -.05, size = 4, color = "#E37B1C") + 503 | geom_point(aes(y = mean(noncensus_mrp_estimates_vector), x = .75), color = "#7B1CE3") + 504 | geom_errorbar(aes(y = mean(noncensus_mrp_estimates_vector), 505 | x = .75, 506 | ymin = mean(noncensus_mrp_estimates_vector) - 2*sd(noncensus_mrp_estimates_vector), 507 | ymax = mean(noncensus_mrp_estimates_vector) + 2*sd(noncensus_mrp_estimates_vector)), 508 | width = 0, color = "#7B1CE3") + 509 | geom_text(data = data.frame(), aes(x = Inf, y = mean(noncensus_mrp_estimates_vector) - 0.01, label = "Non-census MRP"), 510 | hjust = -.05, size = 4, color = "#7B1CE3") + 511 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 512 | labels=c("0%","25%","50%","75%","100%"), 513 | limits=c(0,1),expand=c(0,0)) + 514 | geom_point(data = data.frame(), aes(y=mean(cces_all_df$abortion), x = .5), color = "#1CE37B") + 515 | geom_errorbar(data = data.frame(), aes(y = mean(cces_all_df$abortion), 516 | x = .5, 517 | ymin = mean(cces_all_df$abortion) - 2*sqrt(mean(cces_all_df$abortion)*(1-mean(cces_all_df$abortion))/nrow(cces_all_df)), 518 | ymax = mean(cces_all_df$abortion) + 2*sqrt(mean(cces_all_df$abortion)*(1-mean(cces_all_df$abortion))/nrow(cces_all_df))), 519 | width = 0, color = "#1CE37B") + 520 | geom_text(data = data.frame(), aes(x = Inf, y = mean(cces_all_df$abortion)+0.05, label = "Complete Survey"), 521 | hjust = -.06, size = 4, color = "#1CE37B") + 522 | scale_y_continuous(breaks=c(0,.25,.5,.75,1), 523 | labels=c("0%","25%","50%","75%","100%"), 524 | limits=c(0,1),expand=c(0,0))+ 525 | scale_x_continuous(limits=c(0,1),expand=c(0,0), breaks=c(.25, .75)) + 526 | coord_cartesian(clip = 'off') + 527 | theme_bw() + 528 | labs(x="Population",y="")+ 529 | theme(legend.position="none", 530 | axis.title.y=element_blank(), 531 | axis.title.x=element_text(size=10, margin = margin(t = 19, r = 0, b = , l = 0)), 532 | axis.text=element_blank(), 533 | axis.ticks=element_blank(), 534 | legend.title=element_text(size=10), 535 | legend.text=element_text(size=10), 536 | plot.margin = margin(5.5, 105, 5.5, 5.5, "pt") 537 | ) 538 | 539 | bayesplot_grid(compare1,compare2, 540 | grid_args = list(nrow=1, widths = c(5,1.4))) 541 | ``` 542 | 543 | In general, we see that the estimates from the standard MRP are upwardly biased with respect to the 60,000 survey estimates. Conversely, the MRP with non-census variables is able to adjust for the differential partisan nonresponse. 544 | 545 | ## Obtaining Estimates for Non-census Variable Subgroups 546 | 547 | Even if we do not suspect that our survey population is different from our target population with respect to a non-census variable, using non-census MRP can allow us to obtain different estimates for the levels of the non-census variable. Here, we obtain and plot support for declining coverage of abortions by state and party within state. 548 | 549 | ```{r, warning=FALSE} 550 | subgroup_estimates_df <- cces_biased_df %>% expand(state, party) %>% 551 | mutate(noncensus_mrp_subgroup_estimate = NA, 552 | noncensus_mrp_subgroup_estimate_se = NA) 553 | 554 | for(i in 1:nrow(subgroup_estimates_df)) { 555 | filtering_condition_threefold <- which(poststrat_df_threefold$state == subgroup_estimates_df$state[i] & 556 | poststrat_df_threefold$party == subgroup_estimates_df$party[i]) 557 | subgroup_noncensus_epred_mat <- noncensus_epred_mat[ ,filtering_condition_threefold] 558 | K_filtered <- K[filtering_condition_threefold, ] 559 | noncensus_mrp_subgroup_estimates_vector <- colSums(t(subgroup_noncensus_epred_mat)*K_filtered) / colSums(K_filtered) 560 | 561 | subgroup_estimates_df$noncensus_mrp_subgroup_estimate[i] <- mean(noncensus_mrp_subgroup_estimates_vector) 562 | subgroup_estimates_df$noncensus_mrp_subgroup_estimate_se[i] <- sd(noncensus_mrp_subgroup_estimates_vector) 563 | } 564 | ``` 565 | 566 | ```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE, fig.height=3.25, fig.width=12, fig.align = "center"} 567 | states_map <- us_map(regions = "states") 568 | subgroup_estimates_df_melted <- subgroup_estimates_df %>% select(state, noncensus_mrp_subgroup_estimate, party) 569 | states_map <- left_join(states_map, subgroup_estimates_df_melted, by = c("abbr" = "state")) %>% drop_na() 570 | 571 | ggplot(states_map, aes(x = x, y = y, group = group)) + 572 | geom_polygon(colour = "lightgray") + 573 | geom_polygon(aes(fill = noncensus_mrp_subgroup_estimate)) + theme_void() + facet_grid(cols = vars(party)) + 574 | scale_fill_gradient2(midpoint = 0.5, limits = c(0, 1), breaks = c(0, .5, 1), 575 | name = "Support", low = muted("blue"), high = muted("red")) + 576 | theme(legend.margin=margin(l = 0.5, unit='cm')) 577 | 578 | ``` 579 | 580 | -------------------------------------------------------------------------------- /04-references.Rmd: -------------------------------------------------------------------------------- 1 | `r if (knitr:::is_html_output()) ' 2 | # References {-} 3 | '` 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MRP Case Studies 2 | 3 | Current version hosted [here](https://bookdown.org/jl5522/MRP-case-studies/) 4 | -------------------------------------------------------------------------------- /_bookdown.yml: -------------------------------------------------------------------------------- 1 | book_filename: "MRP-case-studies" 2 | language: 3 | ui: 4 | chapter_name: "Chapter " 5 | delete_merged_file: true 6 | -------------------------------------------------------------------------------- /_build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | set -ev 4 | 5 | Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')" 6 | Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book')" 7 | Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::epub_book')" 8 | 9 | -------------------------------------------------------------------------------- /_deploy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | set -e 4 | 5 | [ -z "${GITHUB_PAT}" ] && exit 0 6 | [ "${TRAVIS_BRANCH}" != "master" ] && exit 0 7 | 8 | git config --global user.email "xie@yihui.name" 9 | git config --global user.name "Yihui Xie" 10 | 11 | git clone -b gh-pages https://${GITHUB_PAT}@github.com/${TRAVIS_REPO_SLUG}.git book-output 12 | cd book-output 13 | cp -r ../_book/* ./ 14 | git add --all * 15 | git commit -m"Update the book" || true 16 | git push -q origin gh-pages 17 | -------------------------------------------------------------------------------- /_output.yml: -------------------------------------------------------------------------------- 1 | bookdown::gitbook: 2 | css: style.css 3 | config: 4 | toc: 5 | collapse: section 6 | before: | 7 |
  • MRP Case Studies
  • 8 | after: | 9 |
  • Code and data
  • 10 | fontsettings: 11 | theme: white 12 | family: serif 13 | size: 1 14 | edit: https://github.com/JuanLopezMartin/MRPCaseStudy/edit/master/%s 15 | sharing: 16 | github: yes 17 | facebook: no 18 | -------------------------------------------------------------------------------- /book.bib: -------------------------------------------------------------------------------- 1 | @Book{xie2015, 2 | title = {Dynamic Documents with {R} and knitr}, 3 | author = {Yihui Xie}, 4 | publisher = {Chapman and Hall/CRC}, 5 | address = {Boca Raton, Florida}, 6 | year = {2015}, 7 | edition = {2nd}, 8 | note = {ISBN 978-1498716963}, 9 | url = {http://yihui.name/knitr/}, 10 | } 11 | -------------------------------------------------------------------------------- /data_public/chapter1/data/cces18_common_vv.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/data/cces18_common_vv.csv.gz -------------------------------------------------------------------------------- /data_public/chapter1/data/statelevel_predictors.csv: -------------------------------------------------------------------------------- 1 | state,repvote,region 2 | AL,0.643741436,South 3 | AK,0.583856547,West 4 | AZ,0.518900234,West 5 | AR,0.642851377,South 6 | CA,0.338717892,West 7 | CO,0.473166666,West 8 | CT,0.428584525,Northeast 9 | DE,0.440013786,South 10 | FL,0.506188355,South 11 | GA,0.526611726,South 12 | HI,0.325586625,West 13 | ID,0.683101767,West 14 | IL,0.409799486,North Central 15 | IN,0.601173095,North Central 16 | IA,0.550635478,North Central 17 | KS,0.611114703,North Central 18 | KY,0.65670629,South 19 | LA,0.601716772,South 20 | ME,0.484032089,Northeast 21 | MD,0.359837503,South 22 | MA,0.353487213,Northeast 23 | MI,0.501176682,North Central 24 | MN,0.491681431,North Central 25 | MS,0.590898473,South 26 | MO,0.59818561,North Central 27 | MT,0.611096643,West 28 | NE,0.635476741,North Central 29 | NV,0.487062906,West 30 | NH,0.498029716,Northeast 31 | NJ,0.427158099,Northeast 32 | NM,0.453492051,West 33 | NY,0.382275815,Northeast 34 | NC,0.519037458,South 35 | ND,0.698092429,North Central 36 | OH,0.542676846,North Central 37 | OK,0.693047372,South 38 | OR,0.438441611,West 39 | PA,0.503755358,Northeast 40 | RI,0.416892959,Northeast 41 | SC,0.574602564,South 42 | SD,0.659718581,North Central 43 | TN,0.63624343,South 44 | TX,0.547132256,South 45 | UT,0.623836582,West 46 | VT,0.348135737,Northeast 47 | VA,0.471736237,South 48 | WA,0.412130688,West 49 | WV,0.721610523,South 50 | WI,0.50407989,North Central 51 | WY,0.757053196,West 52 | -------------------------------------------------------------------------------- /data_public/chapter1/models/fit_mrp_1.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/models/fit_mrp_1.rds -------------------------------------------------------------------------------- /data_public/chapter1/models/fit_mrp_2.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/models/fit_mrp_2.rds -------------------------------------------------------------------------------- /data_public/chapter1/screenshots/screenshot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/screenshots/screenshot1.png -------------------------------------------------------------------------------- /data_public/chapter1/screenshots/screenshot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/screenshots/screenshot2.png -------------------------------------------------------------------------------- /data_public/chapter1/screenshots/screenshot3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/screenshots/screenshot3.png -------------------------------------------------------------------------------- /data_public/chapter1/screenshots/screenshot4.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/screenshots/screenshot4.PNG -------------------------------------------------------------------------------- /data_public/chapter1/screenshots/screenshot5.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter1/screenshots/screenshot5.PNG -------------------------------------------------------------------------------- /data_public/chapter2/models/fit_abortion_noncensus.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter2/models/fit_abortion_noncensus.rds -------------------------------------------------------------------------------- /data_public/chapter2/models/fit_abortion_standard.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter2/models/fit_abortion_standard.rds -------------------------------------------------------------------------------- /data_public/chapter2/models/fit_party_example.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JuanLopezMartin/MRPCaseStudy/fe2dfefc1bec7559bfb37f515cc03b0a67b567c0/data_public/chapter2/models/fit_party_example.rds -------------------------------------------------------------------------------- /data_public/chapter3/idealpoint.stan: -------------------------------------------------------------------------------- 1 | // 2 | // Ideal Point Multilevel Modeling and Postratification 3 | // 4 | 5 | 6 | data { 7 | int J; //Participants 8 | int K; //Questions 9 | int N; //no. of observations 10 | int S; //no. of states 11 | int P; //no. of states 12 | int participant[N]; // Participant for observation n 13 | int question[N]; // Question for observation n 14 | int state[N]; // State for observation n 15 | int age[N]; // Age for observation n 16 | int ethnicity[N]; // Ethnicity for observation n 17 | int educ[N]; // Education for observation n 18 | real male[N]; // Gender for observation n 19 | int region[S]; // Region for state s 20 | real repvote[S]; // Republican voteshare for state s 21 | int y[N]; // Support for observation n 22 | int postrat_state[P]; 23 | int postrat_age[P]; 24 | int postrat_ethnicity[P]; 25 | int postrat_educ[P]; 26 | real postrat_male[P]; 27 | } 28 | parameters { 29 | vector[S] alpha_state_raw; 30 | vector[6] alpha_age_raw; 31 | vector[5] alpha_educ_raw; 32 | vector[4] alpha_ethnicity_raw; 33 | vector[4] alpha_region_raw; 34 | real beta_male; 35 | real beta_repvote; 36 | real sigma_state; 37 | real sigma_age; 38 | real sigma_ethnicity; 39 | real sigma_educ; 40 | real sigma_region; 41 | 42 | real mu_alpha; 43 | real sigma_alpha; 44 | real mu_beta; 45 | real sigma_beta; 46 | real mu_gamma; 47 | real sigma_gamma; 48 | 49 | vector[K] beta_raw; 50 | vector[J] alpha_raw; 51 | vector[K] gamma_raw; 52 | } 53 | transformed parameters{ 54 | vector[6] alpha_age = 0 + sigma_age*alpha_age_raw; 55 | vector[5] alpha_educ = 0 + sigma_educ*alpha_educ_raw; 56 | vector[4] alpha_ethnicity = 0 + sigma_ethnicity*alpha_ethnicity_raw; 57 | vector[4] alpha_region = 0 + sigma_region*alpha_region_raw; 58 | vector[K] beta = mu_beta + sigma_beta*beta_raw; 59 | vector[K] gamma = mu_gamma + sigma_gamma*gamma_raw; 60 | 61 | vector[S] alpha_state; 62 | vector[J] alpha; 63 | 64 | real alpha_mean; 65 | real alpha_sd; 66 | vector[J] alpha_adj; 67 | vector[K] beta_adj; 68 | vector[K] gamma_adj; 69 | 70 | for(s in 1:S) 71 | alpha_state[s] = alpha_region[region[s]] + beta_repvote*repvote[s] + sigma_state*alpha_state_raw[s]; 72 | for (j in 1:J) 73 | alpha[j] = mu_alpha + alpha_state[state[j]] + alpha_age[age[j]] + alpha_ethnicity[ethnicity[j]] + alpha_educ[educ[j]] + beta_male*male[j] + sigma_alpha*alpha_raw[j]; 74 | 75 | alpha_mean = mean(alpha); 76 | alpha_sd = sd(alpha); 77 | alpha_adj = (alpha - alpha_mean)/alpha_sd; 78 | beta_adj = (beta - alpha_mean)/alpha_sd; 79 | gamma_adj = gamma*alpha_sd; 80 | } 81 | 82 | 83 | 84 | model { 85 | //priors on predictors 86 | sigma_state ~ exponential(0.5); // prior for sigma_state 87 | sigma_age ~ exponential(0.5); // prior for sigma_age 88 | sigma_ethnicity ~ exponential(0.5); // prior for sigma_ethnicity 89 | sigma_educ ~ exponential(0.5); // prior for sigma_educ 90 | sigma_region ~ exponential(0.5); // prior for sigma_educ 91 | beta_male ~ normal(0, 2); // prior for beta_male 92 | beta_repvote ~ normal(0, 2); // prior for beta_repvote 93 | 94 | //priors on parameters 95 | mu_beta ~ normal(0, 2); // prior for mu_beta 96 | sigma_beta ~ exponential(1); // prior for sigma_beta 97 | mu_gamma ~ normal(0, 2); // prior for mu_gamma 98 | sigma_gamma ~ exponential(1); // prior for sigma_gamma 99 | 100 | alpha_state_raw ~ std_normal(); // implies alpha_state ~ normal(alpha_region, sigma_state) 101 | alpha_age_raw ~ std_normal(); // implies alpha_age ~ normal(0, sigma_age) 102 | alpha_ethnicity_raw ~ std_normal(); // implies alpha_ethnicity ~ normal(0, sigma_ethnicity) 103 | alpha_educ_raw ~ std_normal(); // implies alpha_educ ~ normal(0, sigma_educ) 104 | alpha_region_raw ~ std_normal(); // implies alpha_region ~ normal(0, sigma_region) 105 | 106 | gamma_raw ~ std_normal(); // implies beta ~ normal(mu_beta, sigma_beta) 107 | beta_raw ~ std_normal(); // implies beta ~ normal(mu_beta, sigma_beta) 108 | alpha_raw ~ std_normal(); // implies alpha ~ normal(mu_alpha + alpha_state + alpha_age + ..., sigma_alpha) 109 | for (n in 1:N) 110 | y[n] ~ bernoulli_logit(gamma_adj[question[n]] * (alpha_adj[participant[n]] - beta_adj[question[n]])); 111 | } 112 | 113 | generated quantities{ 114 | vector[P] alpha_pred_raw; 115 | vector[P] alpha_pred; 116 | 117 | 118 | for (p in 1:P) 119 | alpha_pred_raw[p] = alpha_state[postrat_state[p]] + alpha_age[postrat_age[p]] + alpha_ethnicity[postrat_ethnicity[p]] + alpha_educ[postrat_educ[p]] + beta_male*postrat_male[p]; 120 | 121 | alpha_pred = (alpha_pred_raw - mean(alpha_pred_raw)) / sd(alpha_pred_raw); 122 | } 123 | -------------------------------------------------------------------------------- /index.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Multilevel Regression and Poststratification Case Studies" 3 | date: "2022-05-16" 4 | author: "Juan Lopez-Martin, Justin H. Phillips, and Andrew Gelman" 5 | site: bookdown::bookdown_site 6 | output: 7 | bookdown::gitbook: 8 | css: style.css 9 | config: 10 | toc: 11 | collapse: section 12 | before: | 13 |
  • MRP Case Studies
  • 14 | after: | 15 |
  • Data and code
  • 16 | fontsettings: 17 | theme: white 18 | family: serif 19 | size: 1 20 | edit: https://github.com/JuanLopezMartin/MRPCaseStudy/edit/master/%s 21 | sharing: 22 | github: yes 23 | facebook: no 24 | documentclass: book 25 | bibliography: [book.bib, mrp.bib] 26 | biblio-style: apalike 27 | link-citations: yes 28 | description: "Introduction to Bayesian Multilevel Modeling and Poststratification using rstanarm, brms, and Stan" 29 | --- 30 | 31 | # Preface {-} 32 | 33 | The following case studies intend to introduce users to Multilevel Modeling and Poststratification (MRP) and some of its extensions, providing reusable code and clear explanations. The first section^[The first section corresponds to a draft version of the introductory chapter to _Multilevel Regression and Poststratification: A Practical Guide and New Developments_, and oncoming book on the topic. This chapter has received additional contributions by Shiro Kuriwaki and Jonah Sol Gabry.] presents MRP, a statistical technique that allows to estimate subnational estimates from national surveys while adjusting for nonrepresentativeness. The second chapter extends MRP to overcome the limitation of only using variables included in the census. The last chapter develops a new approach that combines MRP with an ideal point model, allowing to obtain subnational estimates of latent attitudes based on multiple survey questions and improving the subnational estimates for an individual survey item based on other related items. 34 | 35 | These case studies do not display some non-essential code, such as the ones used to generate figures and tables. However, all the code and data is available on the corresponding [GitHub repo](https://github.com/JuanLopezMartin/MRPCaseStudy). 36 | 37 | The tutorials assume certain familiarity with R and Bayesian Statistics. A good reference to the required background is @gelman2020raos. Additionally, multilevel models are covered in @gelman2006data (Part 2A) or @mcelreath2020statistical (Chapters 12 and 13). 38 | 39 | The case studies are still under development. Please send any feedback to [jl5522@columbia.edu](jl5522@columbia.edu). 40 | 41 | -------------------------------------------------------------------------------- /mrp-case.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Website 19 | -------------------------------------------------------------------------------- /mrp.bib: -------------------------------------------------------------------------------- 1 | @article{little1993post, 2 | title={Post-stratification: a modeler's perspective}, 3 | author={Little, Roderick JA}, 4 | journal={Journal of the American Statistical Association}, 5 | volume={88}, 6 | number={423}, 7 | pages={1001--1012}, 8 | year={1993}, 9 | publisher={Taylor \& Francis Group} 10 | } 11 | @article{park2004bayesian, 12 | title={Bayesian multilevel estimation with poststratification: state-level estimates from national polls}, 13 | author={Park, David K and Gelman, Andrew and Bafumi, Joseph}, 14 | journal={Political Analysis}, 15 | volume={12}, 16 | number={4}, 17 | pages={375--385}, 18 | year={2004}, 19 | publisher={Cambridge University Press} 20 | } 21 | 22 | @article{shirley2015hierarchical, 23 | title={Hierarchical models for estimating state and demographic trends in US death penalty public opinion}, 24 | author={Shirley, Kenneth E and Gelman, Andrew}, 25 | journal={Journal of the Royal Statistical Society: Series A (Statistics in Society)}, 26 | volume={178}, 27 | number={1}, 28 | pages={1--28}, 29 | year={2015}, 30 | publisher={Wiley Online Library} 31 | } 32 | 33 | @article{barr2013random, 34 | title={Random effects structure for confirmatory hypothesis testing: Keep it maximal}, 35 | author={Barr, Dale J and Levy, Roger and Scheepers, Christoph and Tily, Harry J}, 36 | journal={Journal of memory and language}, 37 | volume={68}, 38 | number={3}, 39 | pages={255--278}, 40 | year={2013}, 41 | publisher={Elsevier} 42 | } 43 | 44 | @article{ghitza2013deep, 45 | title={Deep interactions with MRP: Election turnout and voting patterns among small electoral subgroups}, 46 | author={Ghitza, Yair and Gelman, Andrew}, 47 | journal={American Journal of Political Science}, 48 | volume={57}, 49 | number={3}, 50 | pages={762--776}, 51 | year={2013}, 52 | publisher={Wiley Online Library} 53 | } 54 | @article{lei20172008, 55 | title={The 2008 election: A preregistered replication analysis}, 56 | author={Lei, Rayleigh and Gelman, Andrew and Ghitza, Yair}, 57 | journal={Statistics and Public Policy}, 58 | pages={1--8}, 59 | year={2017}, 60 | publisher={Taylor \& Francis} 61 | } 62 | 63 | @article{gelman2007struggles, 64 | title={Struggles with survey weighting and regression modeling}, 65 | author={Gelman, Andrew}, 66 | journal={Statistical Science}, 67 | pages={153--164}, 68 | year={2007}, 69 | publisher={JSTOR} 70 | } 71 | 72 | @article{lax2009should, 73 | title={How should we estimate public opinion in the states\?}, 74 | author={Lax, Jeffrey R and Phillips, Justin H}, 75 | journal={American Journal of Political Science}, 76 | volume={53}, 77 | number={1}, 78 | pages={107--121}, 79 | year={2009}, 80 | publisher={Wiley Online Library} 81 | } 82 | 83 | @article{park2004bayesian, 84 | title={Bayesian multilevel estimation with poststratification: state-level estimates from national polls}, 85 | author={Park, David K and Gelman, Andrew and Bafumi, Joseph}, 86 | journal={Political Analysis}, 87 | volume={12}, 88 | number={4}, 89 | pages={375--385}, 90 | year={2004}, 91 | publisher={Cambridge University Press} 92 | } 93 | 94 | @article{gelman2005analysis, 95 | title={Analysis of variance-why it is more important than ever}, 96 | author={Gelman, Andrew and others}, 97 | journal={The annals of statistics}, 98 | volume={33}, 99 | number={1}, 100 | pages={1--53}, 101 | year={2005}, 102 | publisher={Institute of Mathematical Statistics} 103 | } 104 | @article{si2017bayesian, 105 | title={Bayesian hierarchical weighting adjustment and survey inference}, 106 | author={Si, Yajuan and Trangucci, Rob and Gabry, Jonah Sol and Gelman, Andrew}, 107 | journal={arXiv preprint arXiv:1707.08220}, 108 | year={2017} 109 | } 110 | 111 | @article{lax2009states, 112 | title={How should we estimate public opinion in the states?}, 113 | author={Lax, Jeffrey R and Phillips, Justin H}, 114 | journal={American Journal of Political Science}, 115 | volume={53}, 116 | number={1}, 117 | pages={107--121}, 118 | year={2009}, 119 | publisher={Wiley Online Library} 120 | } 121 | @article{lax2009gay, 122 | title={Gay rights in the states: Public opinion and policy responsiveness}, 123 | author={Lax, Jeffrey R and Phillips, Justin H}, 124 | journal={American Political Science Review}, 125 | volume={103}, 126 | number={3}, 127 | pages={367--386}, 128 | year={2009}, 129 | publisher={Cambridge University Press} 130 | } 131 | @article{wang2015xbox, 132 | title={Forecasting elections with non-representative polls}, 133 | author={Wang, Wei and Rothschild, David and Goel, Sharad and Gelman, Andrew}, 134 | journal={International Journal of Forecasting}, 135 | volume={31}, 136 | number={3}, 137 | pages={980--991}, 138 | year={2015}, 139 | publisher={Elsevier} 140 | } 141 | @article{buttice2013mrp, 142 | title={How does multilevel regression and poststratification perform with conventional national surveys?}, 143 | author={Buttice, Matthew K and Highton, Benjamin}, 144 | journal={Political analysis}, 145 | volume={21}, 146 | number={4}, 147 | year={2013} 148 | } 149 | @article{kastellec2010primer, 150 | title={Estimating state public opinion with multi-level regression and poststratification using R}, 151 | author={Kastellec, Jonathan P and Lax, Jeffrey R and Phillips, Justin H}, 152 | journal={Unpublished manuscript, Princeton University}, 153 | year={2010} 154 | } 155 | @book{gelman2007data, 156 | title={Data analysis using regression and multilevelhierarchical models}, 157 | author={Gelman, Andrew and Hill, Jennifer}, 158 | volume={1}, 159 | year={2007}, 160 | publisher={Cambridge University Press New York, NY, USA} 161 | } 162 | @article{downes2018multilevel, 163 | title={Multilevel regression and poststratification: A modeling approach to estimating population quantities from highly selected survey samples}, 164 | author={Downes, Marnie and Gurrin, Lyle C and English, Dallas R and Pirkis, Jane and Currier, Dianne and Spittal, Matthew J and Carlin, John B}, 165 | journal={American journal of epidemiology}, 166 | volume={187}, 167 | number={8}, 168 | pages={1780--1790}, 169 | year={2018}, 170 | publisher={Oxford University Press} 171 | } 172 | @article{kiewiet2018predicting, 173 | title={Predicting state presidential election results using national tracking polls and multilevel regression with poststratification (MRP)}, 174 | author={Kiewiet de Jonge, Chad P and Langer, Gary and Sinozich, Sofi}, 175 | journal={Public Opinion Quarterly}, 176 | volume={82}, 177 | number={3}, 178 | pages={419--446}, 179 | year={2018}, 180 | publisher={Oxford University Press US} 181 | } 182 | @book{gelman2020raos, 183 | title={Regression and other stories}, 184 | author={Gelman, Andrew and Hill, Jennifer and Vehtari, Aki}, 185 | year={2020}, 186 | publisher={Cambridge University Press} 187 | } 188 | @book{mcelreath2020statistical, 189 | title={Statistical rethinking: A Bayesian course with examples in R and Stan}, 190 | author={McElreath, Richard}, 191 | year={2020}, 192 | publisher={CRC press} 193 | } 194 | @book{gelman2006data, 195 | title={Data analysis using regression and multilevel/hierarchical models}, 196 | author={Gelman, Andrew and Hill, Jennifer}, 197 | year={2006}, 198 | publisher={Cambridge university press} 199 | } 200 | @article{gao2020structuredpriors, 201 | title={Improving multilevel regression and poststratification with structured priors}, 202 | author={Gao, Yuxiang and Kennedy, Lauren and Simpson, Daniel and Gelman, Andrew and others}, 203 | journal={Bayesian Analysis}, 204 | year={2020}, 205 | publisher={International Society for Bayesian Analysis} 206 | } 207 | @article{betancourt2017hmc, 208 | title={A conceptual introduction to Hamiltonian Monte Carlo}, 209 | author={Betancourt, Michael}, 210 | journal={arXiv preprint arXiv:1701.02434}, 211 | year={2017} 212 | } 213 | @article{kastellec2015polarizing, 214 | title={Polarizing the electoral connection: Partisan representation in Supreme Court confirmation politics}, 215 | author={Kastellec, Jonathan P and Lax, Jeffrey R and Malecki, Michael and Phillips, Justin H}, 216 | journal={The journal of politics}, 217 | volume={77}, 218 | number={3}, 219 | pages={787--804}, 220 | year={2015}, 221 | publisher={University of Chicago Press Chicago, IL} 222 | } 223 | @article{tausanovitch2013ideal, 224 | title={Measuring constituent policy preferences in congress, state legislatures, and cities}, 225 | author={Tausanovitch, Chris and Warshaw, Christopher}, 226 | journal={The Journal of Politics}, 227 | volume={75}, 228 | number={2}, 229 | pages={330--342}, 230 | year={2013}, 231 | publisher={Cambridge University Press New York, USA} 232 | } 233 | @article{caughey2015dynamic, 234 | title={Dynamic estimation of latent opinion using a hierarchical group-level IRT model}, 235 | author={Caughey, Devin and Warshaw, Christopher}, 236 | journal={Political Analysis}, 237 | pages={197--211}, 238 | year={2015}, 239 | publisher={JSTOR} 240 | } 241 | @article{caughey2018policy, 242 | title={Policy preferences and policy change: Dynamic responsiveness in the American states, 1936--2014}, 243 | author={Caughey, Devin and Warshaw, Christopher}, 244 | year={2018}, 245 | publisher={Cambridge University Press (CUP)} 246 | } 247 | @article{bergquist2019does, 248 | title={Does global warming increase public concern about climate change?}, 249 | author={Bergquist, Parrish and Warshaw, Christopher}, 250 | journal={The Journal of Politics}, 251 | volume={81}, 252 | number={2}, 253 | pages={686--691}, 254 | year={2019}, 255 | publisher={The University of Chicago Press Chicago, IL} 256 | } 257 | @article{bisbee2019barp, 258 | title={BARP: Improving Mister P Using Bayesian Additive Regression Trees}, 259 | author={Bisbee, James}, 260 | journal={American Political Science Review}, 261 | volume={113}, 262 | number={4}, 263 | pages={1060--1065}, 264 | year={2019}, 265 | publisher={Cambridge University Press} 266 | } 267 | @article{bafumi2005practical, 268 | title={Practical issues in implementing and understanding Bayesian ideal point estimation}, 269 | author={Bafumi, Joseph and Gelman, Andrew and Park, David K and Kaplan, Noah}, 270 | journal={Political Analysis}, 271 | volume={13}, 272 | number={2}, 273 | pages={171--187}, 274 | year={2005}, 275 | publisher={Cambridge University Press} 276 | } 277 | @article{clinton2004statistical, 278 | title={The statistical analysis of roll call data}, 279 | author={Clinton, Joshua and Jackman, Simon and Rivers, Douglas}, 280 | journal={American Political Science Review}, 281 | pages={355--370}, 282 | year={2004}, 283 | publisher={JSTOR} 284 | } 285 | @article{tausanovitch2014representation, 286 | title={Representation in municipal government}, 287 | author={Tausanovitch, Chris and Warshaw, Christopher}, 288 | journal={American Political Science Review}, 289 | pages={605--641}, 290 | year={2014}, 291 | publisher={JSTOR} 292 | } 293 | @book{ipums2020, 294 | title={IPUMS USA: Version 10.0 [dataset]}, 295 | author={Ruggles, Steven and Flood, Sarah and Goeken, Ronald and Grover, Josiah and Meyer, Erin and Pacas, Jose and Sobek, Matthew}, 296 | volume={}, 297 | year={2020}, 298 | publisher={Minneapolis, MN: IPUMS, 2020} 299 | } 300 | @article{ornstein2020stacked, 301 | title={Stacked Regression and Poststratification}, 302 | author={Ornstein, Joseph T}, 303 | journal={Political Analysis}, 304 | volume={28}, 305 | number={2}, 306 | pages={293--301}, 307 | year={2020}, 308 | publisher={Cambridge University Press} 309 | } 310 | @article{gelman1997poststratification, 311 | title={Poststratification into many categories using hierarchical logistic regression}, 312 | author={Gelman, Andrew and Little, Thomas C}, 313 | year={1997}, 314 | journal={Survey Methodology}, 315 | volume={23}, 316 | number={2}, 317 | pages={127–-35} 318 | } 319 | @article{fay1979estimates, 320 | title={Estimates of income for small places: an application of James-Stein procedures to census data}, 321 | author={Fay, Robert E and Herriot, Roger A}, 322 | journal={Journal of the American Statistical Association}, 323 | volume={74}, 324 | number={366a}, 325 | pages={269--277}, 326 | year={1979}, 327 | publisher={Taylor \& Francis} 328 | } 329 | @data{2018CCES, 330 | author = {Schaffner, Brian and Ansolabehere, Stephen and Luks, Sam}, 331 | publisher = {Harvard Dataverse}, 332 | title = {CCES 2018}, 333 | year = {2018}, 334 | url = {https://doi.org/10.7910/DVN/ZSBZ7K} 335 | } -------------------------------------------------------------------------------- /now.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "static", 3 | "public": true 4 | } 5 | -------------------------------------------------------------------------------- /packages.bib: -------------------------------------------------------------------------------- 1 | @Manual{R-base, 2 | title = {R: A Language and Environment for Statistical Computing}, 3 | author = {{R Core Team}}, 4 | organization = {R Foundation for Statistical Computing}, 5 | address = {Vienna, Austria}, 6 | year = {2020}, 7 | url = {https://www.R-project.org/}, 8 | } 9 | 10 | @Manual{R-bookdown, 11 | title = {bookdown: Authoring Books and Technical Documents with R Markdown}, 12 | author = {Yihui Xie}, 13 | year = {2020}, 14 | note = {R package version 0.21}, 15 | url = {https://github.com/rstudio/bookdown}, 16 | } 17 | 18 | @Manual{R-knitr, 19 | title = {knitr: A General-Purpose Package for Dynamic Report Generation in R}, 20 | author = {Yihui Xie}, 21 | year = {2020}, 22 | note = {R package version 1.30}, 23 | url = {https://yihui.org/knitr/}, 24 | } 25 | 26 | @Manual{R-rmarkdown, 27 | title = {rmarkdown: Dynamic Documents for R}, 28 | author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone}, 29 | year = {2020}, 30 | note = {R package version 2.4}, 31 | url = {https://github.com/rstudio/rmarkdown}, 32 | } 33 | 34 | @Book{bookdown2016, 35 | title = {bookdown: Authoring Books and Technical Documents with {R} Markdown}, 36 | author = {Yihui Xie}, 37 | publisher = {Chapman and Hall/CRC}, 38 | address = {Boca Raton, Florida}, 39 | year = {2016}, 40 | note = {ISBN 978-1138700109}, 41 | url = {https://github.com/rstudio/bookdown}, 42 | } 43 | 44 | @Book{knitr2015, 45 | title = {Dynamic Documents with {R} and knitr}, 46 | author = {Yihui Xie}, 47 | publisher = {Chapman and Hall/CRC}, 48 | address = {Boca Raton, Florida}, 49 | year = {2015}, 50 | edition = {2nd}, 51 | note = {ISBN 978-1498716963}, 52 | url = {https://yihui.org/knitr/}, 53 | } 54 | 55 | @InCollection{knitr2014, 56 | booktitle = {Implementing Reproducible Computational Research}, 57 | editor = {Victoria Stodden and Friedrich Leisch and Roger D. Peng}, 58 | title = {knitr: A Comprehensive Tool for Reproducible Research in {R}}, 59 | author = {Yihui Xie}, 60 | publisher = {Chapman and Hall/CRC}, 61 | year = {2014}, 62 | note = {ISBN 978-1466561595}, 63 | url = {http://www.crcpress.com/product/isbn/9781466561595}, 64 | } 65 | 66 | @Book{rmarkdown2018, 67 | title = {R Markdown: The Definitive Guide}, 68 | author = {Yihui Xie and J.J. Allaire and Garrett Grolemund}, 69 | publisher = {Chapman and Hall/CRC}, 70 | address = {Boca Raton, Florida}, 71 | year = {2018}, 72 | note = {ISBN 9781138359338}, 73 | url = {https://bookdown.org/yihui/rmarkdown}, 74 | } 75 | 76 | @Book{rmarkdown2020, 77 | title = {R Markdown Cookbook}, 78 | author = {Yihui Xie and Christophe Dervieux and Emily Riederer}, 79 | publisher = {Chapman and Hall/CRC}, 80 | address = {Boca Raton, Florida}, 81 | year = {2020}, 82 | note = {ISBN 9780367563837}, 83 | url = {https://bookdown.org/yihui/rmarkdown-cookbook}, 84 | } 85 | 86 | -------------------------------------------------------------------------------- /preamble.tex: -------------------------------------------------------------------------------- 1 | \usepackage[most]{tcolorbox} 2 | \usepackage{amsmath} 3 | \usepackage{tabu} 4 | 5 | \usepackage{titlesec} 6 | \titleformat{\chapter}[display] 7 | {\normalfont\huge\bfseries}{\chaptertitlename\ \thechapter}{20pt}{\Huge} 8 | \titlespacing*{\chapter}{0pt}{-50pt}{20pt} 9 | 10 | \usepackage{fancyhdr} 11 | \pagestyle{fancy} 12 | \renewcommand{\headrulewidth}{0pt} 13 | \fancyhf{} 14 | \fancyhead[LO]{\nouppercase{\rightmark}} 15 | \fancyhead[RE]{\nouppercase{\leftmark}} 16 | \fancyhead[LE,RO]{\thepage} 17 | -------------------------------------------------------------------------------- /style.css: -------------------------------------------------------------------------------- 1 | p.caption { 2 | color: #777; 3 | margin-top: 10px; 4 | } 5 | p code { 6 | white-space: inherit; 7 | } 8 | pre { 9 | word-break: normal; 10 | word-wrap: normal; 11 | } 12 | pre code { 13 | white-space: inherit; 14 | } 15 | 16 | .book .book-body .page-wrapper .page-inner { 17 | max-width: 1000px; 18 | } -------------------------------------------------------------------------------- /toc.css: -------------------------------------------------------------------------------- 1 | #TOC ul, 2 | #TOC li, 3 | #TOC span, 4 | #TOC a { 5 | margin: 0; 6 | padding: 0; 7 | position: relative; 8 | } 9 | #TOC { 10 | line-height: 1; 11 | border-radius: 5px 5px 0 0; 12 | background: #141414; 13 | background: linear-gradient(to bottom, #333333 0%, #141414 100%); 14 | border-bottom: 2px solid #0fa1e0; 15 | width: auto; 16 | } 17 | #TOC:after, 18 | #TOC ul:after { 19 | content: ''; 20 | display: block; 21 | clear: both; 22 | } 23 | #TOC a { 24 | background: #141414; 25 | background: linear-gradient(to bottom, #333333 0%, #141414 100%); 26 | color: #ffffff; 27 | display: block; 28 | padding: 19px 20px; 29 | text-decoration: none; 30 | text-shadow: none; 31 | } 32 | #TOC ul { 33 | list-style: none; 34 | } 35 | #TOC > ul > li { 36 | display: inline-block; 37 | float: left; 38 | margin: 0; 39 | } 40 | #TOC > ul > li > a { 41 | color: #ffffff; 42 | } 43 | #TOC > ul > li:hover:after { 44 | content: ''; 45 | display: block; 46 | width: 0; 47 | height: 0; 48 | position: absolute; 49 | left: 50%; 50 | bottom: 0; 51 | border-left: 10px solid transparent; 52 | border-right: 10px solid transparent; 53 | border-bottom: 10px solid #0fa1e0; 54 | margin-left: -10px; 55 | } 56 | #TOC > ul > li:first-child > a { 57 | border-radius: 5px 0 0 0; 58 | } 59 | #TOC.align-right > ul > li:first-child > a, 60 | #TOC.align-center > ul > li:first-child > a { 61 | border-radius: 0; 62 | } 63 | #TOC.align-right > ul > li:last-child > a { 64 | border-radius: 0 5px 0 0; 65 | } 66 | #TOC > ul > li.active > a, 67 | #TOC > ul > li:hover > a { 68 | color: #ffffff; 69 | box-shadow: inset 0 0 3px #000000; 70 | background: #070707; 71 | background: linear-gradient(to bottom, #262626 0%, #070707 100%); 72 | } 73 | #TOC .has-sub { 74 | z-index: 1; 75 | } 76 | #TOC .has-sub:hover > ul { 77 | display: block; 78 | } 79 | #TOC .has-sub ul { 80 | display: none; 81 | position: absolute; 82 | width: 200px; 83 | top: 100%; 84 | left: 0; 85 | } 86 | #TOC .has-sub ul li a { 87 | background: #0fa1e0; 88 | border-bottom: 1px dotted #31b7f1; 89 | filter: none; 90 | display: block; 91 | line-height: 120%; 92 | padding: 10px; 93 | color: #ffffff; 94 | } 95 | #TOC .has-sub ul li:hover a { 96 | background: #0c7fb0; 97 | } 98 | #TOC ul ul li:hover > a { 99 | color: #ffffff; 100 | } 101 | #TOC .has-sub .has-sub:hover > ul { 102 | display: block; 103 | } 104 | #TOC .has-sub .has-sub ul { 105 | display: none; 106 | position: absolute; 107 | left: 100%; 108 | top: 0; 109 | } 110 | #TOC .has-sub .has-sub ul li a { 111 | background: #0c7fb0; 112 | border-bottom: 1px dotted #31b7f1; 113 | } 114 | #TOC .has-sub .has-sub ul li a:hover { 115 | background: #0a6d98; 116 | } 117 | #TOC ul ul li.last > a, 118 | #TOC ul ul li:last-child > a, 119 | #TOC ul ul ul li.last > a, 120 | #TOC ul ul ul li:last-child > a, 121 | #TOC .has-sub ul li:last-child > a, 122 | #TOC .has-sub ul li.last > a { 123 | border-bottom: 0; 124 | } 125 | #TOC ul { 126 | font-size: 1.2rem; 127 | } 128 | --------------------------------------------------------------------------------