├── scripts
    ├── .gitignore
    ├── header.tex
    ├── nep_gendergap.R
    └── clean-data.R
├── .gitignore
├── README.md
├── slides.pdf
├── images
    ├── Stein.jpg
    ├── bayes.png
    ├── ulam.jpg
    ├── gelman.jpeg
    ├── bayes-grave.jpg
    ├── dag_2level.png
    ├── stan_logo.png
    ├── uk17_yg_est.png
    ├── gender-gap-xp.pdf
    ├── demo_xbox_v_exit.pdf
    ├── further_reading.png
    ├── mrp-stan-tweet.png
    ├── poll_screenshot.jpg
    ├── uk17_predictions.png
    ├── subgroup_predictions.pdf
    ├── us16_fig-lead-pid-wgt.pdf
    └── us16_NationalTimeTrend.pdf
├── data
    ├── cleaned.RData
    └── output_538.RData
├── mrp-aapor.Rproj
├── LICENSE
├── class.R
└── slides.Rmd


/scripts/.gitignore:
--------------------------------------------------------------------------------
1 | slides_files
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # mrp-aapor
2 | Short course on MRP at 2018 AAPOR Conference
3 | 


--------------------------------------------------------------------------------
/slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/slides.pdf


--------------------------------------------------------------------------------
/images/Stein.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/Stein.jpg


--------------------------------------------------------------------------------
/images/bayes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/bayes.png


--------------------------------------------------------------------------------
/images/ulam.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/ulam.jpg


--------------------------------------------------------------------------------
/data/cleaned.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/data/cleaned.RData


--------------------------------------------------------------------------------
/images/gelman.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/gelman.jpeg


--------------------------------------------------------------------------------
/data/output_538.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/data/output_538.RData


--------------------------------------------------------------------------------
/images/bayes-grave.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/bayes-grave.jpg


--------------------------------------------------------------------------------
/images/dag_2level.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/dag_2level.png


--------------------------------------------------------------------------------
/images/stan_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/stan_logo.png


--------------------------------------------------------------------------------
/images/uk17_yg_est.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/uk17_yg_est.png


--------------------------------------------------------------------------------
/images/gender-gap-xp.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/gender-gap-xp.pdf


--------------------------------------------------------------------------------
/images/demo_xbox_v_exit.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/demo_xbox_v_exit.pdf


--------------------------------------------------------------------------------
/images/further_reading.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/further_reading.png


--------------------------------------------------------------------------------
/images/mrp-stan-tweet.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/mrp-stan-tweet.png


--------------------------------------------------------------------------------
/images/poll_screenshot.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/poll_screenshot.jpg


--------------------------------------------------------------------------------
/images/uk17_predictions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/uk17_predictions.png


--------------------------------------------------------------------------------
/images/subgroup_predictions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/subgroup_predictions.pdf


--------------------------------------------------------------------------------
/images/us16_fig-lead-pid-wgt.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/us16_fig-lead-pid-wgt.pdf


--------------------------------------------------------------------------------
/images/us16_NationalTimeTrend.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/us16_NationalTimeTrend.pdf


--------------------------------------------------------------------------------
/scripts/header.tex:
--------------------------------------------------------------------------------
1 | \newcommand{\columnsbegin}{\begin{columns}}
2 | \newcommand{\columnsend}{\end{columns}}
3 | \setbeamertemplate{footline}[frame number]


--------------------------------------------------------------------------------
/mrp-aapor.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: No
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Doug Rivers
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/scripts/nep_gendergap.R:
--------------------------------------------------------------------------------
 1 | library(tidyverse)
 2 | 
 3 | load("~/Dropbox (YouGov Analytics)/Data/frame/NEP/2016/nep16.Rdata")
 4 | 
 5 | nep16 <- nep16 %>%
 6 |   mutate(state = droplevels(state))
 7 |   
 8 | fixef <- prop.table(
 9 | 	xtabs(weight ~ gender + presvote16, data = nep16), 1)[,"clinton"]
10 | fixef[2] <- fixef[2] - fixef[1]
11 | 
12 | tbl <- prop.table(xtabs(weight ~ state + presvote16 + gender, data = nep16),
13 | 	c(1,3))[,'clinton',] 
14 | coefs <- data_frame(male = tbl[,1], gap = tbl[,2] - tbl[,1])
15 | 	
16 | plot.gender.by.state <- function(main, coefs, fixef) {
17 |   n <- nrow(coefs)
18 |   x <- rep(c(0,1), rep(n, 2))
19 |   y <- 100 * c(coefs[[1]], coefs[[1]] + coefs[[2]])
20 |   plot(x = x, y = y, pch = 19, cex = 0.5, xlim = c(0, 1),
21 |     ylim = c(0, 100), main = main,xlab = "Gender",
22 |     ylab = "Percent voting for Clinton", axes = FALSE)
23 |   axis(1, at = c(0, 1), labels = c("Male", "Female"))
24 |   axis(2, at = seq(0, 100, 20))
25 |   for (i in seq_len(n)) lines(x = c(0, 1),
26 |     y = c(y[i], y[i+n]), col = "grey", lwd = 0.5)
27 |   abline(100 * fixef, col = "red3", lwd = 2)
28 | }
29 | pdf("~/Projects/mrp-aapor/images/gender-gap-xp.pdf")
30 | plot.gender.by.state("Exit poll", coefs, fixef)
31 | dev.off()
32 | 
33 | 
34 | 
35 | 
36 | 


--------------------------------------------------------------------------------
/scripts/clean-data.R:
--------------------------------------------------------------------------------
  1 | library(tidyverse)
  2 | library(haven)
  3 | 
  4 | # add DC to list of states
  5 | # make state a factor with postal codes as levels
  6 | all.states <- data_frame(state = c(state.abb, "DC"),
  7 |   name = c(state.name, "District of Columbia"),
  8 |   division = c(as.character(state.division), "South Atlantic"),
  9 |   region = c(as.character(state.region), "South")) %>%
 10 |   arrange(state) %>%
 11 |   mutate(state = factor(state, levels = sort(state)),
 12 |     division = factor(division, levels = levels(state.division)),
 13 |     region = factor(region, levels = levels(state.region)))
 14 | name.to.abb <- function(name) 
 15 |   all.states$state[match(tolower(name), tolower(all.states$name))]
 16 | 
 17 | 
 18 | # data from final pew survey before 2016 presidential election
 19 | pew <- read_spss(file = "Data/Oct16 public.sav") %>%
 20 |   mutate_if(is.labelled, as_factor) %>%
 21 |   transmute(
 22 |     state = name.to.abb(as.character(sstate)),
 23 |     votereg = fct_recode(regfinal, 
 24 |       yes = "Registered/Plan to/N.Dakota", 
 25 |       no = "Not registered"),
 26 |     turnout = fct_collapse(plan1,
 27 |       yes = c("Plan to vote", "Already voted"),
 28 |       no = c("Already voted", "Don't plan to vote", "Don't know/Refused (VOL.)")),
 29 |     vote16full = fct_recode(q10horse, 
 30 |       clinton = "Clinton/lean Clinton",
 31 |       trump = "Trump/lean Trump",
 32 |       johnson = "Johnson/lean Johnson",
 33 |       stein = "Stein/lean Stein",
 34 |       other = "Other-refused to lean",
 35 |       dk = "DK-refused to lean"),
 36 |     vote16 = case_when(
 37 |       votereg == "no" | turnout == "no" ~ "nonvoter",
 38 |       vote16full == "clinton" ~ "clinton",
 39 |       vote16full == "trump" ~ "trump",
 40 |       vote16full %in% c("johnson", "stein", "other") ~ "other"),
 41 |     vote16 = factor(vote16, levels = c("clinton", "trump", "other", "nonvoter")),
 42 |     age = fct_recode(age,
 43 |       "97" = "97 or older",
 44 |       NULL = "Don't know/Refused (VOL.)"),
 45 |     age = as.integer(as.character(age)),
 46 |     age_top_codes = case_when(
 47 |       age < 80 ~ "<80",
 48 |       age >= 80 & age < 85 ~ "80-84",
 49 |       age >= 85 ~ "85+"),
 50 |     raceeth = fct_recode(racethn,
 51 |       white = "White, non-Hisp",
 52 |       black = "Black, non-Hisp",
 53 |       hispanic = "Hispanic",
 54 |       other = "Other",
 55 |       NULL = "Don't know/Refused (VOL.)"),
 56 |     gender = fct_relabel(sex, tolower),
 57 |     educ = fct_recode(educ2,
 58 |       "grades 1-8" = "Less than high school (Grades 1-8 or no formal schooling)",
 59 |       "hs dropout" = "High school incomplete (Grades 9-11 or Grade 12 with NO diploma)",
 60 |       "hs grad" = "High school graduate (Grade 12 with diploma or GED certificate)",
 61 |       "some col" = "Some college, no degree (includes some community college)",
 62 |       "assoc" = "Two year associate degree from a college or university",
 63 |       "col grad" = "Four year college or university degree/Bachelor's degree (e.g., BS, BA, AB)",
 64 |       "postgrad" = "Some postgraduate or professional schooling, no postgraduate degree",
 65 |       "postgrad" = "Postgraduate or professional degree, including master's, doctorate, medical or law degree",
 66 |       NULL = "Don't know/Refused (VOL.)"),
 67 |     weight = weight / mean(weight)) %>%
 68 |   left_join(all.states, by = "state")
 69 | for(v in pew) attr(v, "label") <- NULL
 70 | 
 71 | # current population survey registration and voting supplement november 2016
 72 | cps <- read_spss(file = "Data/cpsnov2016.sav") %>%
 73 |   mutate_if(is.labelled, as_factor) %>%
 74 |   mutate_if(is.factor, ~ fct_recode(., NULL = "-1")) %>%
 75 |   transmute(
 76 |     state = factor(as.character(gestfips)), # levels in alphabetical order
 77 |     age_top_codes = factor(case_when(prtage == "80-84 Years Old" ~ "80-84",
 78 |       prtage == "85+ Years Old" ~ "85+", !is.na(prtage) ~ "<80"),
 79 |       levels = c("<80", "80-84", "85+")),
 80 |     age = as.integer(ifelse(age_top_codes == "<80", as.character(prtage), NA)),
 81 |     gender = fct_relabel(pesex, tolower),
 82 |     hisp = fct_recode(pehspnon,
 83 |       yes = "HISPANIC",
 84 |       no = "NON-HISPANIC"),
 85 |     race = fct_collapse(ptdtrace,
 86 |       white = "White Only",
 87 |       black = "Black Only",
 88 |       other = c("American Indian, Alaskan Native Only", "Asian Only",
 89 |         "Hawaiian/Pacific Islander Only", "White-Black", "White-AI",                            
 90 |         "White-Asian", "White-HP", "Black-AI", "Black-Asian", "Black-HP",
 91 |         "AI-Asian", "AI-HP", "Asian-HP", "W-B-AI", "W-B-A", "W-B-HP", 
 92 |         "W-AI-A", "W-AI-HP", "W-A-HP", "B-AI-A", "W-B-AI-A", "W-AI-A-HP", 
 93 |         "Other 3 Race Combinations", "Other 4 and 5 Race Combinations")),
 94 |     raceeth = factor(ifelse(hisp == "yes", "hispanic", as.character(race)),
 95 |       levels = c("white", "black", "hispanic", "other")),
 96 |     educ = fct_collapse(peeduca,
 97 |       "grades 1-8" = c("LESS THAN 1ST GRADE", "1ST, 2ND, 3RD OR 4TH GRADE",                  
 98 |         "5TH OR 6TH GRADE", "7TH OR 8TH GRADE"),
 99 |       "hs dropout" = c("9TH GRADE", "10TH GRADE", "11TH GRADE", "12TH GRADE NO DIPLOMA"),
100 |       "hs grad" = "HIGH SCHOOL GRAD-DIPLOMA OR EQUIV (GED)",
101 |       "some col" = "SOME COLLEGE BUT NO DEGREE",
102 |       "assoc" = c("ASSOCIATE DEGREE-OCCUPATIONAL/VOCATIONAL", "ASSOCIATE DEGREE-ACADEMIC PROGRAM"),
103 |       "col grad" = "BACHELOR'S DEGREE (EX: BA, AB, BS)",
104 |       "postgrad" = c("MASTER'S DEGREE (EX: MA, MS, MEng, MEd, MSW)",
105 |         "PROFESSIONAL SCHOOL DEG (EX: MD, DDS, DVM)", "DOCTORATE DEGREE (EX: PhD, EdD)")),
106 |     citizen = fct_collapse(prcitshp,
107 |       yes = c("NATIVE, BORN IN THE UNITED STATES", 
108 |         "NATIVE, BORN IN PUERTO RICO OR OTHER U.S. ISLAND AREAS",
109 |         "NATIVE, BORN ABROAD OF AMERICAN PARENT OR PARENTS",
110 |         "FOREIGN BORN, U.S. CITIZEN BY NATURALIZATION"),
111 |       no = "FOREIGN BORN, NOT A CITIZEN OF THE UNITED STATES"),
112 |     votereg = factor(case_when(
113 |       pes1 == "Yes" | pes2 == "Yes" ~ "yes",
114 |       citizen == "No" | pes2 %in% c("No", "Don't Know") ~ "no"),
115 |       levels = c("yes", "no")),
116 |     turnout = factor(case_when(
117 |       pes1 == "Yes" ~ "yes",
118 |       votereg == "no" | pes1 %in% c("No", "Don't Know") ~ "no"),
119 |       levels = c("yes", "no")),
120 |     weight = as.numeric(pwsswgt) / 10000) %>%
121 |   left_join(all.states, by = "state") %>%
122 |   filter(age_top_codes %in% c("80-84", "85+") | age >= 18, citizen == "yes")
123 | 
124 | 
125 | # vote counts and citizen voting age population for 2012 and 2016
126 | vap12 <- read_csv("Data/vap2012.csv") %>%
127 |   mutate(state = name.to.abb(state),
128 |     cvap = as.integer(1000 * cvap)) %>%
129 |   select(state, cvap) 
130 | votes12 <- read_csv("Data/2012_0_0_1.csv", skip = 1) %>%
131 |   transmute(
132 |     state = name.to.abb(name),
133 |     obama = vote1,
134 |     romney = vote2,
135 |     johnson = vote4,
136 |     stein = vote5,
137 |     turnout = totalvote
138 |   ) %>%
139 |   inner_join(vap12, by = "state") %>%
140 |   inner_join(all.states, by = "state") %>%
141 |   arrange(state)
142 | vap16 <- read_csv("Data/vap2016.csv") %>%
143 |   mutate(state = name.to.abb(state),
144 |     cvap = as.integer(1000 * cvap)) %>%
145 |   select(state, cvap)
146 | votes16 <- read_csv("Data/2016_0_0_1.csv", skip = 1) %>%
147 |   transmute(
148 |     state = name.to.abb(name),
149 |     clinton = vote1,
150 |     trump = vote2,
151 |     johnson = vote4, 
152 |     stein = vote5,
153 |     mcmullin = vote3,
154 |     turnout = totalvote) %>%
155 |   inner_join(vap16, by = "state") %>%
156 |   inner_join(all.states, by = "state") %>%
157 |   arrange(state)
158 | 
159 | # save clean data
160 | save(pew, cps, votes12, votes16, file = "Data/cleaned.RData")
161 | 


--------------------------------------------------------------------------------
/class.R:
--------------------------------------------------------------------------------
  1 | options(width=85)
  2 | library("tidyverse")
  3 | load("data/cleaned.RData")
  4 | 
  5 | 
  6 | 
  7 | ##########################
  8 | # 2. Post-stratification #
  9 | ##########################
 10 | 
 11 | 
 12 | ## Recode pew data
 13 | pew <- pew %>%
 14 |   filter(
 15 |     complete.cases(age, raceeth, gender, educ, vote16)) %>%
 16 |   mutate(
 17 |     demvote = ifelse(vote16 == "clinton", 1, 0),
 18 |     age4 = factor(case_when(age < 30 ~ "18-29",
 19 |       age < 45 ~ "30-44", age < 65 ~ "45-64",
 20 |       TRUE ~ "65+")),
 21 |     race3 = fct_collapse(raceeth,
 22 |       white = c("white", "other")),
 23 |     educ4 = fct_collapse(educ,
 24 |       "hs" = c("grades 1-8", "hs dropout", "hs grad"),
 25 |       "some col" = c("some col", "assoc")))
 26 | 
 27 | # Save a full version for MNL
 28 | pew_nv <- pew 
 29 | pew <- pew %>% filter(vote16 != 'nonvoter')
 30 | 
 31 | 
 32 | ## ...then do the same for CPS
 33 | cps <- cps %>%
 34 |   filter(
 35 |     complete.cases(age_top_codes,
 36 |       raceeth, gender, educ, turnout)) %>%
 37 |   mutate(
 38 |     age4 = factor(case_when(
 39 |       age_top_codes == "<80" & age < 30 ~ "18-29",
 40 |       age_top_codes == "<80" & age < 45 ~ "30-44",
 41 |       age_top_codes == "<80" & age < 65 ~ "45-64",
 42 |       TRUE ~ "65+")),
 43 |     race3 = fct_collapse(raceeth,
 44 |       white = c("white", "other")),
 45 |     educ4 = fct_collapse(educ,
 46 |       "hs" = c("grades 1-8", "hs dropout", "hs grad"),
 47 |       "some col" = c("some col", "assoc")))
 48 | 
 49 | # Save a full version for MNL
 50 | cps_nv <- cps
 51 | cps <- cps %>% filter(turnout == "yes")
 52 | 
 53 | ## Check that the datasets are consistent -- mistakes will be made!
 54 | compare_distributions <- function(var, data1, data2, wgt1, wgt2, digits = 1) {
 55 |   stopifnot(all(levels(data1[[var]]) == levels(data2[[var]])))
 56 |   formula1 <- as.formula(paste(wgt1, "~", var))
 57 |   formula2 <- as.formula(paste(wgt2, "~", var))
 58 |   tbl <- rbind(round(100 * prop.table(xtabs(formula1, data1)), digits),
 59 |       round(100 * prop.table(xtabs(formula2, data2)), digits))
 60 |   row.names(tbl) <- c(substitute(data1), substitute(data2))
 61 |   tbl
 62 | }
 63 | compare_distributions("race3", pew, cps, "", "weight")
 64 | 
 65 | 
 66 | ## Compare variables in `pew` and `cps`
 67 | compare_distributions("educ4", pew, cps, "", "weight")
 68 | compare_distributions("age4", pew, cps, "", "weight")
 69 | compare_distributions("gender", pew, cps, "", "weight")
 70 | 
 71 | 
 72 | ## Step 2: create post-strata 
 73 | library(survey)
 74 | pop.counts <- xtabs(weight ~ age4 + gender + race3 + educ4, data = cps)
 75 | sample.counts <- xtabs(~ age4 + gender + race3 + educ4, data = pew)
 76 | pew <- mutate(pew,
 77 |   weight0 = sum(pop.counts) / sum(sample.counts))
 78 | sample.weights <- xtabs(weight0 ~ age4 + gender + race3 + 
 79 |     educ4, data = pew)
 80 | nr <- nonresponse(sample.weights, sample.counts, pop.counts)
 81 | 
 82 | 
 83 | ## Check for empty cells and/or large weights
 84 | sparseCells(nr, nrweight = 4)
 85 | 
 86 | 
 87 | ## Look for categories adjacent to empty cells
 88 | neighbours(14, nr) # use nr$index to get cell index
 89 | 
 90 | 
 91 | ## Step 3: collapse cells
 92 | nr$index[,,"black","hs"]
 93 | nr <- joinCells(nr, 10, 11, 14, 15) # update the nr object
 94 | nr$index[,,"black","hs"]
 95 | 
 96 | 
 97 | ## Eliminate remaining empty cells
 98 | nr <- joinCells(nr, 18, 19, 21, 22) # hisp 30-64 hs
 99 | nr <- joinCells(nr, 44, 48, 68, 72, 92, 96) # hisp 65+ >hs
100 | nr <- joinCells(nr, 57, 61, 81, 85) # black 18-29 col+
101 | sparseCells(nr, nrweight = 4) # no more bad cells
102 | 
103 | 
104 | ## Step 4: compute weights and add to dataframe
105 | get_weights <- function(data, nr) {
106 |   wgt_arr <- weights(nr)
107 |   var.names <- names(dimnames(wgt_arr))
108 |   indexes <- data %>%
109 |     select(var.names) %>%
110 |     mutate_all(as.integer) %>%
111 |     as.matrix()
112 |   wgt_arr[indexes]
113 | }
114 | pew$ps.weight <- get_weights(pew, nr)
115 | 
116 | 
117 | ## Check that the post-stratification worked
118 | compare_distributions("race3", pew, cps, "ps.weight", "weight")
119 | compare_distributions("educ4", pew, cps, "ps.weight", "weight")
120 | compare_distributions("age4", pew, cps, "ps.weight", "weight")
121 | 
122 | 
123 | ## Step 5: compute estimates using the new weight
124 | round(100 * prop.table(xtabs(ps.weight ~ vote16, pew)), 1)
125 | design <- svydesign(ids = ~ 1, weights = ~ ps.weight, data = pew)
126 | round(100 * prop.table(svytable(~ vote16, design)), 1)
127 | cv <- function(x) sd(x) / mean(x) # coefficient of variation
128 | cv(pew$ps.weight)^2 # weighting loss
129 | 
130 | 
131 | ## State estimates
132 | tbl <- 100 * prop.table(xtabs(ps.weight ~ state + vote16, data = pew), 1)
133 | round(tbl, 1)[1:10,]
134 | 
135 | 
136 | ## Plotting the estimates
137 | estimates <- votes16 %>%
138 |   transmute(state, name,
139 |     actual = 100 * clinton / turnout) %>%
140 |   mutate(post.stratified = tbl[,"clinton"])
141 | 
142 | library(ggplot2)
143 | p1 <- ggplot(estimates, aes(actual, post.stratified)) +
144 |   geom_abline(slope = 1, intercept = 0, col = "grey") +
145 |   geom_point() +
146 |   lims(x = c(0, 100), y = c(0, 100)) +
147 |   labs(x = "Percent of votes", y = "Post-stratifed estimate") +
148 |   theme_minimal()
149 | p2 <- ggplot(estimates, aes(post.stratified - actual)) +
150 |   geom_histogram(binwidth = 10, center = 0, fill = "gray") +
151 |   labs(x = "Error in estimate") +
152 |   theme_minimal()
153 | library(gridExtra)
154 | grid.arrange(p1, p2, nrow = 1)
155 | 
156 | 
157 | ## Mapping the estimates
158 | library(maps)
159 | library(mapproj)
160 | us_map <- map_data("state")
161 | estimates %>%
162 |   mutate(state_name = tolower(name),
163 |     clinton_pct = cut(post.stratified, breaks = c(-Inf, 40, 45, 50, 55, 60, 100),
164 |       labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>%
165 |   ggplot(aes(map_id = state_name)) +
166 |     geom_map(aes(fill = clinton_pct), map = us_map) +
167 |     expand_limits(x = us_map$long, y = us_map$lat) +
168 |     coord_map("albers", lat0 = 39, lat1 = 45) +
169 |     scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") +
170 |     theme(axis.line = element_blank()) +
171 |     theme_void()
172 | 
173 | 
174 | 
175 | ############################
176 | # 3. Multilevel regression #
177 | ############################
178 | 
179 | 
180 | ## Fixed effects (no pooling)
181 | no.pooling <- lm(demvote ~ state - 1, data = pew)
182 | no.pooling <- data_frame(
183 |     state = gsub("state", "", names(coef(no.pooling))),
184 |     no.pooling = 100 * coef(no.pooling)) %>%
185 |   mutate(state = factor(state, levels = levels(pew$state)))
186 | head(no.pooling)
187 | 
188 | 
189 | ## Grand mean (complete pooling)
190 | round(100 * mean(pew$demvote), 1) # 1. mean of dichotomous indicator
191 | round(100 * prop.table(xtabs(~ demvote, data = pew)), 1) # 2. cross-tabulation
192 | complete.pooling <- lm(demvote ~ 1, data = pew) # 3. intercept in regression
193 | complete.pooling <- 100 * coef(complete.pooling)
194 | round(complete.pooling, 1)
195 | 
196 | 
197 | ## Predicting random effects
198 | library(lme4)
199 | partial.pooling <- lmer(demvote ~ 1 + (1 | state), data = pew)
200 | partial.pooling
201 | 
202 | 
203 | ## Extracting predictions
204 | fixef(partial.pooling) # grand mean
205 | ranef(partial.pooling)$state %>% head(4) # state effects
206 | coef(partial.pooling)$state %>% head(4) # state predictions
207 | 
208 | 
209 | ## Which is better?
210 | partial.pooling <- coef(partial.pooling)$state %>%
211 |   as_data_frame(rownames = "state") %>%
212 |   transmute(state = factor(state, levels = levels(pew$state)),
213 |     partial.pooling = 100 * `(Intercept)`)
214 | estimates <- estimates %>%
215 |   left_join(no.pooling, by = "state") %>%
216 |   left_join(partial.pooling, by = "state") %>%
217 |   mutate(n = as.integer(xtabs(~ state, data = pew)))
218 | estimates
219 | rmse <- function(est, act) sqrt(mean((est - act)^2, na.rm = TRUE))
220 | RMSE <- estimates %>%
221 |   summarize(complete.pooling = rmse(complete.pooling, actual),
222 |     no.pooling = rmse(no.pooling, actual),
223 |     partial.pooling = rmse(partial.pooling, actual)) %>%
224 |   unlist()
225 | RMSE
226 | 
227 | 
228 | ## 2016 U.S. election: estimates *vs.* actuals
229 | p1 <- ggplot(estimates, aes(actual, no.pooling)) +
230 |   geom_point() +
231 |   labs(title = "No pooling", x = "Percent of voters",
232 |     y = "Percent of voters") +
233 |   lims(x = c(0,100), y = c(0,100)) +
234 |   theme_minimal()
235 | p2 <- ggplot(estimates, aes(actual, partial.pooling)) +
236 |   geom_point() +
237 |   labs(title = "Partial pooling", x = "Percent of voters",
238 |     y = "Percent of voters") +
239 |   lims(x = c(0,100), y = c(0,100)) +
240 |   theme_minimal()
241 | suppressMessages(library(gridExtra))
242 | grid.arrange(p1, p2, nrow = 1)
243 | 
244 | 
245 | ## Shrinkage
246 | na.omit(estimates) %>%
247 |   select(-post.stratified) %>%
248 |   rename("Actual" = actual, "No pooling" = no.pooling,
249 |     "Partial pooling" = partial.pooling) %>%
250 |   gather(Estimator, Percentage, -state, -name, -n) %>%
251 |   mutate(Estimator = factor(Estimator, levels = c("Post-stratified",
252 |     "No pooling", "Partial pooling", "Actual"))) %>%
253 |   ggplot(aes(Estimator, Percentage, group = state, color = n)) +
254 |     geom_line(size = 0.7) +
255 |     scale_color_gradientn(colors = blues9[-1]) +
256 |     lims(y = c(0, 100)) +
257 |     theme_minimal()
258 | 
259 | 
260 | ## Complete pooling: common intercepts and slopes
261 | pew <- mutate(pew, female = ifelse(gender == "female", 1, 0))
262 | fit1 <- lm(demvote ~ 1 + female, data = pew)
263 | arm::display(fit1)
264 | 
265 | 
266 | ## No pooling: separate intercepts and slopes
267 | fit2 <- lm(demvote ~ 0 + state + state:female, data = pew)
268 | coef.fit2 <- as.matrix(coef(fit2))
269 | round(coef(fit2)[c(1:5, 90:98)], 2)
270 | xtabs(~ gender + state, data = pew) # no men in WY
271 | coef.fit2 <- coef(fit2)
272 | coef.fit2 <- as.matrix(data.frame(intercept = coef.fit2[1:48],
273 |   slope = coef.fit2[50:97], row.names =
274 |   gsub("state", "", names(coef.fit2)[1:48])))
275 | 
276 | 
277 | ## Separate intercepts with common slope
278 | fit3 <- lm(demvote ~ state + female - 1, data = pew)
279 | coef.fit3 <- coef(fit3)
280 | names(coef.fit3) <- gsub("state", "", names(coef.fit3))
281 | round(coef.fit3, 2)
282 | 
283 | 
284 | ## Which should you believe?
285 | par(mfrow=c(1,2))
286 | plot.gender.by.state <- function(main, coefs, fixef) {
287 |   n <- nrow(coefs)
288 |   x <- rep(c(0,1), rep(n, 2))
289 |   y <- 100 * c(coefs[[1]], coefs[[1]] + coefs[[2]])
290 |   plot(x = x, y = y, pch = 19, cex = 0.5, xlim = c(0, 1),
291 |     ylim = c(0, 100), main = main,xlab = "Gender",
292 |     ylab = "Percent voting for Clinton", axes = FALSE)
293 |   axis(1, at = c(0, 1), labels = c("Male", "Female"))
294 |   axis(2, at = seq(0, 100, 20))
295 |   for (i in seq_len(n)) lines(x = c(0, 1),
296 |     y = c(y[i], y[i+n]), col = "grey", lwd = 0.5)
297 |   abline(100 * fixef, col = "red3", lwd = 2)
298 | }
299 | plot.gender.by.state("No pooling", as_data_frame(coef.fit2)[-49,],
300 |   coef(fit1))
301 | coef.fit3.df <- data_frame(Intercept = coef.fit3[-length(coef.fit3)],
302 |   Slope = coef.fit3[length(coef.fit3)])
303 | plot.gender.by.state("Common slope", coef.fit3.df, coef(fit1))
304 | par(mfrow=c(1,1))
305 | 
306 | 
307 | ## Random effects: varying intercepts, common slopes
308 | fit4 <- lmer(demvote ~ 1 + female + (1 | state), data = pew)
309 | fit4
310 | 
311 | 
312 | ## Random effects: varying intercepts and slopes (first try)
313 | (fit5 <- lmer(demvote ~ 1 + female +  (1 + female | state),
314 |   data = pew)) # fails to converge
315 | 
316 | 
317 | ## Fixing the convergence failure (by centering covariates)
318 | pew <- mutate(pew, female.c = female - 0.5)
319 | fit6 <- lmer(demvote ~ 1 + female.c + (1 + female.c | state),
320 |   data = pew)
321 | fixef(fit6) # not comparable to prior models
322 | head(coef(fit6)$state)
323 | 
324 | 
325 | ## Unscale the estimates for comparability
326 | fixef.fit6 <- fixef(fit6)
327 | fixef.fit6[1] <- fixef.fit6[1] - 0.5 * fixef.fit6[2]
328 | coef.fit6 <- coef(fit6)$state
329 | coef.fit6[[1]] <- coef.fit6[[1]] - 0.5 * coef.fit6[[2]]
330 | fixef.fit6
331 | head(coef.fit6)
332 | 
333 | 
334 | ## Comparing the random effects estimates
335 | par(mfrow=c(1,2))
336 | plot.gender.by.state("Varying intercepts, common slope",
337 |   coef(fit4)$state, fixef(fit4))
338 | plot.gender.by.state("Varying intercepts and slopes",
339 |   coef.fit6, fixef.fit6)
340 | par(mfrow=c(1,1))
341 | 
342 | 
343 | 
344 | ## Estimating models with covariates at both levels
345 | obama12 <- votes12 %>%
346 |   mutate(obama12 = obama / turnout) %>%
347 |   select(state, obama12)
348 | pew <- left_join(pew, obama12, by = "state")
349 | fit7 <- lmer(demvote ~ 1 + female.c + obama12 + (1 + female.c | state), data = pew)
350 | arm::display(fit7)
351 | 
352 | 
353 | ## First attempt at MRP
354 | cps <- cps %>%
355 |   mutate(female = ifelse(gender == "female", 1, 0),
356 |     female.c = female - 0.5) %>%
357 |   left_join(obama12, by = "state")
358 | prob <- predict(fit7, newdata = cps, allow.new.levels = TRUE)
359 | mrp1 <- cps %>%
360 |   mutate(prob = prob) %>%
361 |   group_by(state) %>%
362 |   summarize(mrp1 = 100 * weighted.mean(prob, weight))
363 | estimates <- left_join(estimates, mrp1, by = "state")
364 | head(estimates)
365 | 
366 | 
367 | ## Plotting first MRP estimates
368 | RMSE["mrp1"] <- with(estimates, rmse(mrp1, actual))
369 | ggplot(estimates, aes(actual, mrp1)) +
370 |   geom_abline(intercept = 0, slope = 1, col = "grey") +
371 |   geom_point(size = 1.5) +
372 |   lims(x = c(0, 100), y = c(0, 100)) +
373 |   labs(x = "Percentage of vote for Clinton", y = "Estimate") +
374 |   theme_minimal()
375 | 
376 | 
377 | #########################
378 | # 4. Bayesian inference #
379 | #########################
380 | 
381 | 
382 | ## Example: What is the average airfare to AAPOR in Denver?
383 | post.precis <- 1 / 100^2 + 3 / 150^2
384 | post.mean <- (400 / 100^2 + 300 * 3 / 150^2) / post.precis
385 | post.sd <- 1 / sqrt(post.precis)
386 | 
387 | 
388 | ## Bayesian updating: from prior to posterior
389 | ggplot(data_frame(theta = c(0, 700) ), aes(theta)) +
390 |   stat_function(fun = dnorm, color = "red",
391 |     args = list(mean = 300, sd = 200 / sqrt(3))) +
392 |   stat_function(fun = dnorm, color = "blue",
393 |     args = list(mean = 400, sd = 100)) +
394 |   stat_function(fun = dnorm, color = "purple",
395 |     args = list(mean = post.mean, sd = post.sd)) +
396 |   labs(x = quote(theta), y = "") +
397 |   scale_x_continuous(breaks = seq(0, 700, 100)) +
398 |   scale_y_continuous(breaks = NULL) +
399 |   theme_minimal()
400 | 
401 | 
402 | ## A first Stan program
403 | model_code <- "data {
404 |   int n;
405 |   real y[n];
406 |   real theta_0;
407 |   real<lower=0> omega_0;
408 |   real sigma;
409 | }
410 | parameters {
411 |   real theta;
412 |   }
413 | model {
414 |   theta ~ normal(theta_0, omega_0);
415 |   for (i in 1:n) {
416 |     y[i] ~ normal(theta, sigma);
417 |   }
418 | }"
419 | 
420 | 
421 | ## Example (continued)
422 | library(rstan)
423 | rstan_options(auto_write = TRUE)
424 | options(mc.cores = parallel::detectCores())
425 | y <- c(161, 250, 489)
426 | data <- list(y = y, n = length(y), theta_0 = 350,
427 |   omega_0 = 150, sigma = 100)
428 | sims <- stan(model_code = model_code, data = data,
429 |   chains = 4, iter = 500, seed = 1234)
430 | 
431 | 
432 | ## Stan output
433 | print(sims)
434 | 
435 | 
436 | ## Stan code for estimating both mean and SD
437 | data <- list(y = y, n = length(y), theta_0 = 350, omega_0 = 150)
438 | model_code <- "data {
439 |   int n;
440 |   real y[n];
441 |   real theta_0;
442 |   real omega_0;
443 | }
444 | parameters {
445 |   real theta;
446 |   real<lower=0> sigma; // moved to parameter block
447 | }
448 | model {
449 |   theta ~ normal(theta_0, omega_0);
450 |   sigma ~ normal(150, 150);
451 |   y ~ normal(theta, sigma); // assumed iid
452 | }"
453 | sims <- stan(model_code = model_code, data = data,
454 |   iter = 500, seed = 1234)
455 | 
456 | 
457 | ## Output from two-parameter model
458 | print(sims)
459 | 
460 | 
461 | ## Graphical display of parameters
462 | plot(sims)
463 | 
464 | 
465 | ## Traceplot
466 | traceplot(sims)
467 | 
468 | 
469 | ## First hierarchical model in Stan
470 | data <- with(pew, list(y = demvote, group = as.integer(state),
471 |   n = nrow(pew), J = nlevels(state)))
472 | code <- "data {
473 |   int n; // number of respondents
474 |   int J; // number of groups (states)
475 |   int<lower=0, upper=1> y[n]; // demvote
476 |   int<lower=1, upper=J> group[n]; // state index
477 | }
478 | parameters {
479 |   real mu_theta; // hyper parameters
480 |   real<lower=0> sigma_theta; 
481 |   vector[J] theta; // group parameters
482 | }
483 | model {
484 |   sigma_theta ~ normal(0, 5);
485 |   for (j in 1:J)
486 |     theta[j] ~ normal(mu_theta, sigma_theta);
487 |   for (i in 1:n)
488 |     y[i] ~ bernoulli_logit(theta[ group[i] ]);
489 | }"
490 | sims <- stan(model_code = code, data = data)
491 | 
492 | 
493 | ## Stan output
494 | print(sims)
495 | 
496 | 
497 | ## A plot of the state estimates
498 | names(sims) <- c("mu_theta", "sigma_theta", levels(pew$state), "lp__")
499 | plot(sims, par = "theta")
500 | 
501 | 
502 | ## Add a covariate to the model
503 | data <- with(pew, list(y = demvote, group = as.integer(state),
504 |   x = female, n = nrow(pew), J = nlevels(state)))
505 | code <- "data {
506 |   int n; // number of respondents
507 |   int J; // number of groups (states)
508 |   int<lower=0, upper=1> y[n]; // demvote
509 |   int<lower=1, upper=J> group[n]; // state index
510 |   vector[n] x; // added covariate (gender)
511 | }
512 | parameters {
513 |   real mu_alpha; // mean intercept
514 |   real<lower=0> sigma_alpha; // sd intercept
515 |   real beta; // coefficient of gender
516 |   vector[J] alpha; // group intercepts
517 | }
518 | model {
519 |   sigma_alpha ~ normal(0, 5);
520 |   for (j in 1:J)
521 |     alpha[j] ~ normal(mu_alpha, sigma_alpha);
522 |   for (i in 1:n)
523 |     y[i] ~ bernoulli_logit(alpha[group[i]] + beta * x[i]);
524 | }"
525 | sims <- stan(model_code = code, data = data)
526 | 
527 | 
528 | ## Simplifying the computations
529 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 +
530 |     region + qlogis(obama12), data = pew)
531 | head(X, 4)
532 | 
533 | 
534 | ## Hierarchical model with multiple covariates
535 | model_code <- "data {
536 |   int n; // number of respondents
537 |   int k; // number of covariates
538 |   matrix[n, k] X; // covariate matrix
539 |   int<lower=0, upper=1> y[n]; // outcome (demvote)
540 |   int J; // number of groups (states)
541 |   int<lower=1, upper=J> group[n]; // group index
542 | }
543 | parameters {
544 |   vector[k] beta; // fixed effects
545 |   real<lower=0> sigma_alpha; // sd intercept
546 |   vector[J] alpha; // group intercepts
547 | }
548 | model {
549 |   vector[n] Xb;
550 |   beta ~ normal(0, 4);
551 |   sigma_alpha ~ normal(0.2, 1); // prior for sd
552 |   alpha ~ normal(0, 1); // standardized intercepts
553 |   Xb = X * beta;
554 |   for (i in 1:n)
555 |     Xb[i] += sigma_alpha * alpha[ group[i] ];
556 |   y ~ bernoulli_logit(Xb);
557 | }"
558 | 
559 | 
560 | ## Estimating the model in R
561 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 +
562 |   region + qlogis(obama12), data = pew)
563 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote,
564 |   J = nlevels(pew$state), group = as.integer(pew$state))
565 | sims <- stan(model_code = model_code, data = data,
566 |   seed = 1234)
567 | names(sims)
568 | 
569 | 
570 | ## Rename the coefficients for easier reading
571 | coef.names <- c(colnames(X), "sigma_alpha", levels(pew$state), "lp__")
572 | names(sims) <- coef.names
573 | names(sims)
574 | 
575 | 
576 | ## Summary of fixed effect estimates
577 | print(sims, par = "beta")
578 | 
579 | 
580 | ## Imputation in Stan
581 | X0 <- model.matrix(~ 1 + age4 + gender + race3 + educ4 +
582 |     region + qlogis(obama12), data = cps)
583 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote,
584 |   J = nlevels(pew$state), group = as.integer(pew$state),
585 |   N = nrow(X0), X0 = X0, group0 = as.integer(cps$state))
586 | 
587 | 
588 | ## The complete Stan program
589 | model_code <- "data {
590 |   int n; // number of respondents
591 |   int k; // number of covariates
592 |   matrix[n, k] X; // covariate matrix
593 |   int<lower=0, upper=1> y[n]; // outcome (demvote)
594 |   int J; // number of groups (states)
595 |   int<lower=1, upper=J> group[n]; // group index
596 |   int N; // population size
597 |   matrix[N, k] X0; // population covariates
598 |   int group0[N]; // group index in population
599 | }
600 | parameters {
601 |   vector[k] beta; // fixed effects
602 |   real<lower=0> sigma_alpha; // sd intercept
603 |   vector[J] alpha; // group intercepts
604 | }
605 | model {
606 |   vector[n] Xb;
607 |   beta ~ normal(0, 4);
608 |   sigma_alpha ~ normal(0.2, 1);
609 |   alpha ~ normal(0, 1);
610 |   Xb = X * beta;
611 |   for (i in 1:n)
612 |     Xb[i] += sigma_alpha * alpha[ group[i] ];
613 |   y ~ bernoulli_logit(Xb);
614 | }
615 | generated quantities {
616 |   int<lower=0, upper=1> yimp[N];
617 |   {
618 |     vector[N] Xb0;
619 |     Xb0 = X0 * beta;
620 |     for (i in 1:N)
621 |       yimp[i] = bernoulli_logit_rng(Xb0[i] + sigma_alpha * alpha[ group0[i] ]);
622 |   }
623 | }"
624 | sims <- stan(model_code = model_code, data = data,
625 |   seed = 1234)
626 | 
627 | 
628 | ## Extracting the simulations
629 | imputations <- extract(sims, pars = "yimp")$yimp[sample(
630 |     nrow(sims), size = 500), ]
631 | get_state_estimates <- function(imputations) {
632 |   state_by_clinton <- function(imputed_values) 100 * prop.table(
633 |     xtabs(weight ~ state + imputed_values, data = cps), 1)[,"1"]
634 |   state_estimates <- apply(imputations, 1, state_by_clinton)
635 |   apply(state_estimates, 1, mean)
636 | }
637 | estimates$mrp2 <- get_state_estimates(imputations)
638 | RMSE["mrp2"] <- with(estimates, rmse(mrp2, actual))
639 | 
640 | 
641 | ## The easy way with `rstanarm`
642 | library(rstanarm)
643 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 +
644 |     region + qlogis(obama12) + (1 | state), data = pew, family = binomial)
645 | imputations <- posterior_predict(fit, draws = 500,
646 |   newdata = select(cps, age4, gender, race3, educ4, region, obama12, state))
647 | 
648 | 
649 | ## The complete program in `rstanarm`
650 | library(rstanarm)
651 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 +
652 |     region + qlogis(obama12) + (1 | state), data = pew, family = binomial)
653 | cpstmp <- cps %>%
654 |   select(age4, gender, race3, educ4, region, obama12, state)
655 | imputations <- posterior_predict(fit, draws = 500,
656 |   newdata = select(cps, age4, gender, race3, educ4, region, obama12, state))
657 | estimates$mrp3 <- get_state_estimates(imputations)
658 | RMSE["mrp3"] <- with(estimates, rmse(mrp3, actual))
659 | RMSE
660 | 
661 | 
662 | ## Accuracy of state level estimates
663 | p1 <- ggplot(estimates, aes(actual, mrp3)) +
664 |   geom_abline(intercept = 0, slope = 1, col = "grey") +
665 |   geom_point(size = 1.5) +
666 |   lims(x = c(0, 100), y = c(0, 100)) +
667 |   labs(x = "Percentage of vote for Clinton", y = "Estimate") +
668 |   theme_minimal()
669 | p2 <- ggplot(estimates, aes(mrp3 - actual)) +
670 |   geom_histogram(binwidth = 4, center = 0, fill = "gray") +
671 |   lims(x = c(-20, 20)) +
672 |   labs(x = "Error in estimate") +
673 |   theme_minimal()
674 | grid.arrange(p1, p2, nrow = 1)
675 | 
676 | 
677 | ## What the map now looks like
678 | estimates %>%
679 |   mutate(state_name = tolower(name),
680 |     clinton_pct = cut(mrp3, breaks = c(-Inf, 40, 45, 50, 55, 60, 100),
681 |       labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>%
682 |   ggplot(aes(map_id = state_name)) +
683 |     geom_map(aes(fill = clinton_pct), map = us_map) +
684 |     expand_limits(x = us_map$long, y = us_map$lat) +
685 |     coord_map("albers", lat0 = 39, lat1 = 45) +
686 |     scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") +
687 |     theme(axis.line = element_blank()) +
688 |     theme_void()
689 | 
690 | 
691 | 
692 | ######################
693 | # 5. Advanced Topics #
694 | ######################
695 | 
696 | ## Allow race and gender effects to vary across states
697 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 +
698 |   region + qlogis(obama12) + (1 + gender + race3 | state),
699 |   data = pew, family = binomial)
700 | cpstmp <- cps %>%
701 |   select(age4, gender, race3, educ4, region, obama12, state)
702 | imputations <- posterior_predict(fit, draws = 500,
703 |   newdata = select(cps, age4, gender, race3, educ4, region, obama12, state))
704 | estimates$mrp4 <- get_state_estimates(imputations)
705 | RMSE["mrp4"] <- with(estimates, rmse(mrp4, actual))
706 | RMSE
707 | 
708 | ######################
709 | # 6. MNL             #
710 | ######################
711 | 
712 | library(brms)
713 | 
714 | # Includes Non-Voters
715 | pew_nv <- left_join(pew_nv, obama12, by = "state")
716 | cps_nv <- left_join(cps_nv, obama12, by = "state")
717 | 
718 | fit <- brm(vote16 ~ 1 + age4 + gender + race3 + educ4 +
719 |            region + qlogis(obama12) + (1 | state), data = pew_nv,
720 |            family = categorical, chains = 4, cores = 4)
721 | imputations <- posterior_predict(fit, nsamples = 500, allow_new_levels = TRUE,
722 |                newdata = select(cps_nv, age4, gender, race3, educ4, region, obama12, state))
723 | 
724 | state_table <- function(imp){
725 |   100 * prop.table(xtabs(weight ~ state + imp, data = cps_nv), 1)
726 | }
727 | 
728 | tbls <- array(apply(imputations[1:10,], 1, state_table),
729 |               dim=c(51, 4, 10), 
730 |               dimnames=list(state=levels(cps_nv$state),
731 |                             vote16=levels(pew_nv$vote16),
732 |                             samples = 1:10))
733 | estimate <- apply(tbls, c('state', 'vote16'), mean)
734 | sd <- apply(tbls, c('state', 'vote16'), sd)
735 | 
736 | 


--------------------------------------------------------------------------------
/slides.Rmd:
--------------------------------------------------------------------------------
   1 | ---
   2 | title: "Multilevel Regression and Post-stratification"
   3 | author: "Douglas Rivers"
   4 | date: "May 15, 2018"
   5 | output:
   6 |   beamer_presentation:
   7 |     fig_caption: false
   8 |     includes:
   9 |       in_header: scripts/header.tex
  10 | classoption: "aspectratio=169"
  11 | always_allow_html: yes
  12 | ---
  13 | 
  14 | ```{r setup, include=FALSE}
  15 | knitr::opts_chunk$set(echo = FALSE)
  16 | options(width=85)
  17 | ```
  18 | 
  19 |  
  20 | ## Course Schedule
  21 | 
  22 | | Time        | Topics                         |
  23 | |-------------|--------------------------------|
  24 | | 2:30 - 3:15 | Overview and Examples          |
  25 | | 3:15 - 4:00 | Post-stratification            |
  26 | | 4:00 - 4:45 | Multilevel Regression          |
  27 | | 4:45 - 5:30 | Bayesian inference             |
  28 | | 5:30 - 6:00 | Advanced topics and discussion |
  29 | 
  30 | 
  31 | ## Course materials and basic R setup
  32 | 
  33 | Slides, code, and datasets can be downloaded from Github:
  34 | 
  35 | <https://github.com/rdrivers/mrp-aapor>
  36 | 
  37 | If you wish to run the examples shown in class, you will need to 
  38 | 
  39 | 1. Download R from one of the sites listed at <https://cran.r-project.org/mirrors.html> and install. R is an open source programming environment for statistical computing.
  40 | 2. (Optional) Download RStudio Desktop from <https://www.rstudio.com//>. RStudio is an integrated development environment for R. The free open source version is recommended.
  41 | 3. Start R or RStudio and enter the following at the R command prompt:
  42 | 
  43 | \footnotesize
  44 | ```{r, eval=FALSE, echo=TRUE}
  45 | install.packages(c("tidyverse", "lme4", "survey", "arm", "maps", "mapproj", 
  46 |   "gridExtra"))
  47 | ```
  48 | 
  49 | 
  50 | # 1. Overview and Examples
  51 | 
  52 | 
  53 | ## What is MRP?
  54 | 
  55 | ![](images/mrp-stan-tweet.png)
  56 | 
  57 | \centering
  58 | \LARGE
  59 | 
  60 | Formally, **M**ultilevel **R**egression and **P**ost-stratificaiton
  61 | 
  62 | \medskip
  63 | 
  64 | \centering
  65 | \LARGE
  66 | Informally, **Mr. P**
  67 | 
  68 | 
  69 | ## Who is the ubiquitous Mr. P?
  70 | 
  71 | \columnsbegin
  72 | \column{.3\textwidth}
  73 | ```{r, out.width = "1.5in"}
  74 | knitr::include_graphics("images/gelman.jpeg")
  75 | ```
  76 | 
  77 | \smallskip
  78 | 
  79 | \centering
  80 | 
  81 | Andrew Gelman
  82 | 
  83 | \column{.6\textwidth}
  84 | 
  85 | Gelman is a prolific statistician, political scientist and blogger at Columbia University.
  86 | 
  87 | \medskip
  88 | 
  89 | Developed MRP in a series of papers starting in 1997. Adapts and extends some techniques for *small area estimation* from survey statistics.
  90 | 
  91 | \medskip
  92 | 
  93 | Ideas (post-stratification, multilevel models, hierarchical Bayesian estimation) were well known, but the combination has been very effective in polling applications.
  94 | 
  95 | \columnsend
  96 | 
  97 | 
  98 | ## . . . and his mysterious friend Stan?
  99 | 
 100 | \columnsbegin
 101 | \column{.3\textwidth}
 102 | ```{r, out.width = "1.5in"}
 103 | knitr::include_graphics("images/ulam.jpeg")
 104 | ```
 105 | 
 106 | \medskip
 107 | 
 108 | \centering
 109 | 
 110 | Stanislav Ulam (1909-1984)
 111 | 
 112 | \column{.4\textwidth}
 113 | 
 114 | Ulam was a physicist who invented the *Monte Carlo method* while working on the Manhattan Project.
 115 | 
 116 | \medskip
 117 | 
 118 | Stan is an open source software project developing *Markov chain Monte Carlo* (MCMC) software for statistical inference and other applications.
 119 | 
 120 | \column{.3\textwidth}
 121 | ```{r, out.width = "1.5in"}
 122 | knitr::include_graphics("images/stan_logo.png")
 123 | ```
 124 | 
 125 | \smallskip
 126 | 
 127 | \centering
 128 | 
 129 | <https://mc-stan.org>
 130 | 
 131 | \columnsend
 132 | 
 133 | 
 134 | 
 135 | ## What problems does MRP address?
 136 | 
 137 | - **Selection bias.** Flexible and robust method for correcting for imbalances in sample composition, even when these are severe and can involve a large number of variables. 
 138 | - **Small area estimation.** Can provide good estimates for sub-national units (such as states, Congressional districts, counties, *etc.*)
 139 | - **Trend analysis.** Estimate means of survey variables over time with a set of rolling cross-sections
 140 | 
 141 | 
 142 | ## Selection bias: the Xbox panel
 143 | 
 144 | ![](images/poll_screenshot.jpg)
 145 | 
 146 | * 750,148 interviews with 345,858 Xbox users during the 2012 election campaign
 147 | * However, the sample had some problems....
 148 | 
 149 | (Gelman, Goel, Rivers, and Rothschild, "The Mythical Swing Voter", *Quarterly Journal of Political Science,* 2016)
 150 | 
 151 | 
 152 | ## Xbox panel demographics
 153 | 
 154 | \centering
 155 | 
 156 | ![](images/demo_xbox_v_exit.pdf){height=75%}
 157 | 
 158 | \centering
 159 | 
 160 | 93% were male, but there were still over 5,000 women in the sample.
 161 | 
 162 |   
 163 |   
 164 | ## MRP estimates of 2012 voting from Xbox panel
 165 | 
 166 | \centering
 167 | 
 168 | ![](images/subgroup_predictions.pdf){height=60%}
 169 | 
 170 | 75,096 cells: 2 gender x 4 race x 4 age x 4 education x 4 party x 3 ideology x 50 states 
 171 | 
 172 |  
 173 | ## Lesson 1
 174 | 
 175 | \centering
 176 | 
 177 | \LARGE
 178 | 
 179 | Under some circumstances, **big data** (large sample sizes) and **modelling** can handle empty cells and correct for severe selection bias.
 180 | 
 181 |   
 182 | ## Small area estimation: the 2017 UK general election
 183 | 
 184 | \centering
 185 | 
 186 | ![](images/uk17_predictions.png)
 187 | 
 188 | 
 189 | ## Comparison of MRP forecasts and outcomes
 190 | 
 191 | \centering
 192 | 
 193 | ![](images/uk17_yg_est.png)
 194 | 
 195 | 
 196 | ## Lesson 2
 197 | 
 198 | \centering
 199 | \LARGE
 200 | 
 201 | It's possible to get pretty good estimates of **small areas** (parliamentary constituencies, congressional districts, even precincts) with effective predictors and multilevel models.
 202 | 
 203 | 
 204 | ## Why do polls bounce around so much?
 205 | 
 206 | \centering
 207 | 
 208 | ![](images/us16_fig-lead-pid-wgt.pdf)
 209 | 
 210 | 
 211 | ## Trend analysis: the 2016 US presidential election
 212 | 
 213 | \centering
 214 | ![](images/us16_NationalTimeTrend.pdf)
 215 | 
 216 | 
 217 | ## Lesson 3
 218 | 
 219 | \centering
 220 | \LARGE
 221 | 
 222 | Large swings in the 2012 and 2016 pre-election polls were methodological artifacts due to inadequate **post-stratification** and **modeling**.
 223 | 
 224 | 
 225 | ## Ingredients for MRP and running example
 226 | 
 227 | **Survey** Pew Research Organization's *October 2016 Political Survey* (2,583 interviews, conducted October 20-25, 2016.)
 228 | 
 229 | **Survey variable** 2016 Presidential voting intention
 230 | 
 231 | **Covariates** Individual characteristics (from the survey) and group level predictors (2012 state vote)
 232 | 
 233 | **Post-strata** Age x Gender x Race x Education x State
 234 | 
 235 | **Stratum counts** from the November 2016 Voting and Registration Supplement to the *Current Population Survey*
 236 | 
 237 | 
 238 | ## Data sources
 239 | 
 240 | The file `cleaned.RData` contains four R dataframes:
 241 | 
 242 | * `pew` - Pew Research Organization's **October 2016 Political Survey**. The original data can be found at <http://www.people-press.org/dataset/october-2016-political-survey/>. 
 243 | * `cps` - the November 2016 Voting and Registration Supplement to the **Current Population Survey**. The full dataset can be downloaded from <www.nber.org/cps/>.
 244 | * `votes12` and `votes16` - votes cast for major presidential candidates, turnout, and voting age population by state. **Vote counts** are from <https://uselectionatlas.org/> and **population counts** are from <https://www2.census.gov/programs-surveys/cps/>.
 245 | 
 246 | Code `clean-data.R` for creating  `cleaned.RData` is on the course Github site.
 247 | 
 248 | 
 249 | ## How MRP works
 250 | 
 251 | **Step 1: multilevel regression** Fit a model relating the survey variable (vote) to individual and group level covariates (age, gender, race, education, state, 2012 vote).
 252 |   
 253 | **Step 2: imputation** Impute the survey variable (proportion voting Democratic) for all combinations of age x gender x race x education x state.
 254 | each cell of post-stratification (2 age x 2 gender x 3 race x 4 education x 51 state).
 255 |   
 256 | **Step 3: post-stratification** Average the imputed values in each cell and weight the estimated cell mean by the population count in each of the $2 \times 2 \times 3 \times 4 \times 51 = 2248$ cells.
 257 | 
 258 |   
 259 | ## Some important concepts we will cover
 260 | 
 261 |  * Fit **multilevel** regression models for survey variables which are "richly parameterized" and can flexibly adjust for high dimensional selection bias.
 262 |  * Estimate models using **hierarchical Bayesian priors** to avoid over-fitting.
 263 |  * Impute using the **posterior predictive distribution** (not the sample average).
 264 |  * Use **multiple imputations** to accurately reflect the sources of uncertainty in the imputation process.
 265 | 
 266 | Don't worry if you don't understand the terminology -- details will follow -- but it's just **regression** and **imputation**.
 267 | 
 268 | 
 269 | # 2. Post-stratification
 270 | 
 271 | 
 272 | ## Review of stratified sampling
 273 | 
 274 | In *stratified sampling*, we divide the population into **strata** (*e.g.*, regions) and draw samples within each stratum.
 275 | 
 276 | The **stratified estimator** is obtained by weighting the stratum estimates by the population size in each.
 277 | 
 278 | A stratified design can be more **efficient** than a simple random sample of the same size because
 279 | 
 280 | * Eliminates variability in the proportion of sample coming from each stratum
 281 | * Larger sample sizes could be allocated to more heterogeneous strata, less to homogeneous strata
 282 | 
 283 | 
 284 | ## Notation for stratified sampling
 285 | 
 286 | **Population** $U = U_1 \cup \cdots \cup U_H$ partitioned into $H$ strata
 287 | 
 288 | **Sample** $S = S_1 \cup \cdots \cup S_H$ where $S_h \subset U_h$.
 289 | 
 290 | **Population sizes** $N = N_1 + \cdots + N_H$
 291 | 
 292 | **Sample sizes** $n = n_1 + \cdots + n_H$
 293 | 
 294 | **Sample means** $\bar y_1,\dots, \bar y_H$
 295 | 
 296 | **Stratified estimator** $\bar y_{\text{ST}} = \dfrac{N_1}{N} \bar y_1 + \cdots + \dfrac{N_H}{N} \bar y_H$
 297 | 
 298 | 
 299 | The stratified estimator is a **weighted** combination of the stratum means. The weights are **known** -- sampling variation comes from errors in estimating the stratum means.
 300 | 
 301 | 
 302 | ## Stratification weights
 303 | 
 304 | An equivalent way of computing the stratified estimator is to define the **stratum weights**
 305 | 
 306 | $$
 307 |   w_h = \dfrac{\text{population proportion in stratum}}{\text{sample proportion in stratum}}
 308 |   = \dfrac{N_h/N}{n_h/n} 
 309 | $$
 310 | Each respondent gets the weight associated with their stratum:
 311 | $$
 312 |   \bar y_{\text{ST}} = \text{weighted average of sample values} = \dfrac{ w_{h_1} y_1 + \cdots + w_{h_n} y_n }{ w_{h_1} + \cdots + w_{h_n}}
 313 | $$
 314 | where $h_i$ and $w_{h_i}$ are the **stratum** and **weight**, respectively, associated with respondent $i$.
 315 | 
 316 | 
 317 | ## Variance of the stratified estimator
 318 | 
 319 | $$
 320 |   V(\bar y_{\text{ST}}) = \sum_{h=1}^H \left( \dfrac{N_h}{N} \right)^{\!2} \left( 1 - \dfrac{n_h}{N_h} \right) \dfrac{S_h^2}{n_h} = \dfrac{1}{n} \sum_{i=1}^n w_{h_i}^2 (1 - f_h) \dfrac{S_{h_i}^2}{n}
 321 | $$
 322 | There are three factors that determine the variance of the stratified estimator:
 323 | 
 324 | * Size of the **sampling fractions** $f_h = n_h / N_h$ in each stratum. Usually these are close to zero and can be ignored.
 325 | * The **within stratum variances** $\sum_{i \in U_h} (y_i - \bar y_h)^2 / (N_h - 1)$. 
 326 | * Variability of the **stratum weights** $w_h$.
 327 | 
 328 | $S_h^2$ can be substantially less than the total variance of $y$, so there are potential large gains in efficiency from stratifying, compared to SRSWOR.
 329 | 
 330 | Variation in the weights tends to *increase* the standard error of the estimator, unless it's possible to allocate sample size $n_h$ proportional to $S_h^2$. 
 331 | 
 332 | 
 333 | ## Post-stratification 
 334 | 
 335 | When the sample design is **not** stratified, it may still be feasible to use the stratified estimator if the sample can be divided into groups (called **post-strata**) whose population sizes are known.
 336 | 
 337 | The sample sizes in the post-strata are not fixed by design, so the variance formula only holds conditional upon the realized sample sizes $n_1,\dots,n_H$. (See Holt and Smith, "Post-stratification", *Journal of the Royal Statistical Society, Series A*, 1979.)
 338 | 
 339 | Even though the sample may have started with equal selection probabilities, the proportions in the post-strata often differ substantially from the population proportions. Failure to post-stratify results in large bias, but large weights may result in unstable estimates. 
 340 | 
 341 | Usual advice is to collapse post-strata to have at least 20 respondents per cell.
 342 | 
 343 | 
 344 | ## Post-stratification in R
 345 | 
 346 | **Step 1** Recode survey and population data (usually from a *public use microdata* file) so that the post-stratifying variables are consistent. 
 347 | 
 348 | **Step 2** Choose the post-strata 
 349 | 
 350 | **Step 3** Collapse categories to eliminate empty cells
 351 | 
 352 | **Step 4** Compute the post-stratification weights
 353 | 
 354 | **Step 5** Estimate means and proportions (with standard errors) 
 355 | 
 356 | 
 357 | ## Step 1: recode Pew data...
 358 | 
 359 | Variables should be `factor`s (R's version of categorical variables) with the same `levels` (categories) *in the same order*. 
 360 | 
 361 | \scriptsize
 362 | ```{r, echo=TRUE, warning=FALSE, message=FALSE}
 363 | suppressMessages(library("tidyverse"))
 364 | load("data/cleaned.RData")
 365 | pew <- pew %>%
 366 |   filter(
 367 |     complete.cases(age, raceeth, gender, educ, vote16),
 368 |     vote16 != "nonvoter") %>%
 369 |   mutate(
 370 |     demvote = ifelse(vote16 == "clinton", 1, 0),
 371 |     age4 = factor(case_when(age < 30 ~ "18-29",
 372 |       age < 45 ~ "30-44", age < 65 ~ "45-64",
 373 |       TRUE ~ "65+")),
 374 |     race3 = fct_collapse(raceeth,
 375 |       white = c("white", "other")),
 376 |     educ4 = fct_collapse(educ,
 377 |       "hs" = c("grades 1-8", "hs dropout", "hs grad"),
 378 |       "some col" = c("some col", "assoc")))
 379 | ```
 380 | 
 381 | 
 382 | ## ...then do the same for CPS
 383 | 
 384 | \scriptsize
 385 | ```{r, echo=TRUE, warning=FALSE, messages=FALSE}
 386 | cps <- cps %>%
 387 |   filter(
 388 |     complete.cases(age_top_codes,
 389 |       raceeth, gender, educ, turnout),
 390 |     turnout == "yes") %>%
 391 |   mutate(
 392 |     age4 = factor(case_when(
 393 |       age_top_codes == "<80" & age < 30 ~ "18-29",
 394 |       age_top_codes == "<80" & age < 45 ~ "30-44",
 395 |       age_top_codes == "<80" & age < 65 ~ "45-64",
 396 |       TRUE ~ "65+")),
 397 |     race3 = fct_collapse(raceeth,
 398 |       white = c("white", "other")),
 399 |     educ4 = fct_collapse(educ,
 400 |       "hs" = c("grades 1-8", "hs dropout", "hs grad"),
 401 |       "some col" = c("some col", "assoc")))
 402 | ```
 403 | 
 404 | ## Check that the datasets are consistent -- mistakes will be made!
 405 | 
 406 | Time spent cleaning the data at this stage is time well spent.
 407 | 
 408 | \footnotesize
 409 | ```{r, echo=TRUE, include=TRUE}
 410 | compare_distributions <- function(var, data1, data2, wgt1, wgt2, digits = 1) {
 411 |   stopifnot(all(levels(data1[[var]]) == levels(data2[[var]])))
 412 |   formula1 <- as.formula(paste(wgt1, "~", var))
 413 |   formula2 <- as.formula(paste(wgt2, "~", var))
 414 |   tbl <- rbind(round(100 * prop.table(xtabs(formula1, data1)), digits),
 415 |       round(100 * prop.table(xtabs(formula2, data2)), digits))
 416 |   row.names(tbl) <- c(substitute(data1), substitute(data2))
 417 |   tbl
 418 | }
 419 | compare_distributions("race3", pew, cps, "", "weight")
 420 | ```
 421 | 
 422 | ## Compare variables in `pew` and `cps`
 423 | 
 424 | \scriptsize
 425 | ```{r, echo=TRUE}
 426 | compare_distributions("educ4", pew, cps, "", "weight")
 427 | compare_distributions("age4", pew, cps, "", "weight")
 428 | compare_distributions("gender", pew, cps, "", "weight")
 429 | ```
 430 | 
 431 | ## Step 2: create post-strata 
 432 | 
 433 | The `survey` package contains useful functions for post-stratification, including `postStratify`, `rake`, and `nonresponse`.
 434 | 
 435 | \footnotesize
 436 | ```{r, echo=TRUE, message = FALSE, warning = FALSE}
 437 | suppressMessages(library(survey))
 438 | pop.counts <- xtabs(weight ~ age4 + gender + race3 + educ4, data = cps)
 439 | sample.counts <- xtabs(~ age4 + gender + race3 + educ4, data = pew)
 440 | pew <- mutate(pew,
 441 |   weight0 = sum(pop.counts) / sum(sample.counts))
 442 | sample.weights <- xtabs(weight0 ~ age4 + gender + race3 + 
 443 |     educ4, data = pew)
 444 | nr <- nonresponse(sample.weights, sample.counts, pop.counts)
 445 | ```
 446 | 
 447 | \normalsize
 448 | \centering
 449 | 
 450 | `nr` is an object which keeps track of the cells (post-strata).
 451 | 
 452 | 
 453 | ## Check for empty cells and/or large weights
 454 | 
 455 | \tiny
 456 | ```{r, echo=TRUE}
 457 | sparseCells(nr, nrweight = 4)
 458 | ```
 459 | 
 460 | ## Look for categories adjacent to empty cells
 461 | 
 462 | \scriptsize
 463 | ```{r, echo = TRUE, include=TRUE}
 464 | neighbours(14, nr) # use nr$index to get cell index
 465 | ```
 466 | 
 467 | 
 468 | ## Step 3: collapse cells
 469 | 
 470 | Cells 10, 11, 14 and 15 contain blacks aged 30-64 with a high school education.
 471 | 
 472 | \scriptsize
 473 | ```{r echo=TRUE}
 474 | nr$index[,,"black","hs"]
 475 | nr <- joinCells(nr, 10, 11, 14, 15) # update the nr object
 476 | nr$index[,,"black","hs"]
 477 | ```
 478 | 
 479 | ## Eliminate remaining empty cells
 480 | 
 481 | Combine males and females and collapse age and education categories for minorities. Each call to `joinCells` collapses some cells and stores the result in `nr`.
 482 | 
 483 | \scriptsize
 484 | ```{r echo=TRUE, include=TRUE}
 485 | nr <- joinCells(nr, 18, 19, 21, 22) # hisp 30-64 hs
 486 | nr <- joinCells(nr, 44, 48, 68, 72, 92, 96) # hisp 65+ >hs
 487 | nr <- joinCells(nr, 57, 61, 81, 85) # black 18-29 col+
 488 | sparseCells(nr, nrweight = 4) # no more bad cells
 489 | ```
 490 | 
 491 | \normalsize
 492 | 
 493 | `NULL` means that, after collapsing, there are no empty cells or weights > 4.
 494 | 
 495 | 
 496 | ## Step 4: compute weights and add to dataframe
 497 | 
 498 | `weights(nr)` is a four dimensional (age x gender x race x education) array of **stratum weights**. I created `get_weights` to convert the stratum weights into a vector of **individual weights**.
 499 | 
 500 | \footnotesize
 501 | ```{r echo=TRUE}
 502 | get_weights <- function(data, nr) {
 503 |   wgt_arr <- weights(nr)
 504 |   var.names <- names(dimnames(wgt_arr))
 505 |   indexes <- data %>%
 506 |     select(var.names) %>%
 507 |     mutate_all(as.integer) %>%
 508 |     as.matrix()
 509 |   wgt_arr[indexes]
 510 | }
 511 | pew$ps.weight <- get_weights(pew, nr)
 512 | ```
 513 | 
 514 | 
 515 | ## Check that the post-stratification worked
 516 | 
 517 | \scriptsize
 518 | ```{r, echo=TRUE}
 519 | compare_distributions("race3", pew, cps, "ps.weight", "weight")
 520 | compare_distributions("educ4", pew, cps, "ps.weight", "weight")
 521 | compare_distributions("age4", pew, cps, "ps.weight", "weight")
 522 | ```
 523 | 
 524 | 
 525 | ## Step 5: compute estimates using the new weight
 526 | 
 527 | Use either `xtabs` (in base R) or the `survey` package
 528 | 
 529 | \scriptsize
 530 | ```{r, echo=TRUE, include=TRUE, warning=FALSE, message=FALSE}
 531 | round(100 * prop.table(xtabs(ps.weight ~ vote16, pew)), 1)
 532 | design <- svydesign(ids = ~ 1, weights = ~ ps.weight, data = pew)
 533 | round(100 * prop.table(svytable(~ vote16, design)), 1)
 534 | cv <- function(x) sd(x) / mean(x) # coefficient of variation
 535 | cv(pew$ps.weight)^2 # weighting loss
 536 | ```
 537 | 
 538 | 
 539 | ## State estimates
 540 | 
 541 | \footnotesize
 542 | ```{r echo=TRUE}
 543 | tbl <- 100 * prop.table(xtabs(ps.weight ~ state + vote16, data = pew), 1)
 544 | round(tbl, 1)[1:10,]
 545 | ```
 546 | 
 547 | ## Plotting the estimates
 548 | 
 549 | State estimates are nearly useless, even if approximately unbiased.
 550 | 
 551 | \scriptsize
 552 | ```{r echo=FALSE, warning=FALSE, fig.width=9, fig.height=4.5, fig.align="center"}
 553 | estimates <- votes16 %>%
 554 |   transmute(state, name,
 555 |     actual = 100 * clinton / turnout) %>%
 556 |   mutate(post.stratified = tbl[,"clinton"])
 557 | 
 558 | suppressMessages(library(ggplot2))
 559 | p1 <- ggplot(estimates, aes(actual, post.stratified)) +
 560 |   geom_abline(slope = 1, intercept = 0, col = "grey") +
 561 |   geom_point() +
 562 |   lims(x = c(0, 100), y = c(0, 100)) +
 563 |   labs(x = "Percent of votes", y = "Post-stratifed estimate") +
 564 |   theme_minimal()
 565 | p2 <- ggplot(estimates, aes(post.stratified - actual)) +
 566 |   geom_histogram(binwidth = 10, center = 0, fill = "gray") +
 567 |   labs(x = "Error in estimate") +
 568 |   theme_minimal()
 569 | 
 570 | suppressMessages(library(gridExtra))
 571 | grid.arrange(p1, p2, nrow = 1)
 572 | ```
 573 | 
 574 | 
 575 | ## Mapping the estimates
 576 | 
 577 | *Probably not something you'd want to share with the world.*
 578 | 
 579 | ```{r, echo=FALSE, include=TRUE, message=FALSE, warning=FALSE}
 580 | suppressMessages(library(maps))
 581 | suppressMessages(library(mapproj))
 582 | us_map <- map_data("state")
 583 | estimates %>%
 584 |   mutate(state_name = tolower(name),
 585 |     clinton_pct = cut(post.stratified, breaks = c(-Inf, 40, 45, 50, 55, 60, 100),
 586 |       labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>%
 587 |   ggplot(aes(map_id = state_name)) +
 588 |     geom_map(aes(fill = clinton_pct), map = us_map) +
 589 |     expand_limits(x = us_map$long, y = us_map$lat) +
 590 |     coord_map("albers", lat0 = 39, lat1 = 45) +
 591 |     scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") +
 592 |     theme(axis.line = element_blank()) +
 593 |     theme_void()
 594 | ```
 595 | 
 596 | 
 597 | 
 598 | ## Problems with post-stratification
 599 | 
 600 | **Empty cells**
 601 | 
 602 | * Collapsing cells increases bias
 603 | * Alternative is _raking_ (works if no empty cells in marginals)
 604 | 
 605 | \smallskip
 606 | 
 607 | **Small cells**
 608 | 
 609 | * Even with non-empty cells, mean estimates will be noisy if the cell size is small.
 610 | * Recommended cell size is at least 20 -- suggest collapsing cells 
 611 |     
 612 | \smallskip
 613 | 
 614 | **Large weights**
 615 | 
 616 | * Common practice is to *trim* large weights.
 617 | * Solves the wrong problem: population cell count is ok, sample cell estimate is noisy.
 618 |   
 619 | 
 620 | # 3. Multilevel regression
 621 | 
 622 | ## Key ideas
 623 | 
 624 | * Use a regression model to **predict** values of the survey variable for each combination of covariates.
 625 | * Use a **richly parameterized model** for the data (the first level)
 626 |     + Allow intercepts and possibly slopes to vary across groups in the data ("deep interactions").
 627 |     + Use variables with many categories (such as counties or states), some of which may not be present in the data.
 628 | * Introduce a second level model for the **parameters** in the first level.
 629 |     + The parameters of the second level model are sometimes called *hyperparameters*.
 630 |     + The second level model is used to avoid overfitting. Results in *shrinkage* or *regularization*.
 631 | * Also known as hierarchical models, mixed models, random effects, variance components, repeated measures, longitudinal analysis, *etc.*
 632 | 
 633 | 
 634 | ## Stein's paradox
 635 | 
 636 | \columnsbegin
 637 | \column{.3\textwidth}
 638 | ```{r, out.width = "1.5in"}
 639 | knitr::include_graphics("images/stein.jpg")
 640 | ```
 641 | 
 642 | \smallskip
 643 | 
 644 | \centering
 645 | 
 646 | Charles Stein (1920-2016)
 647 | 
 648 | \column{.5\textwidth}
 649 | 
 650 | The **mean square error** (or **risk**) of an estimator $\hat\theta$ of $\theta$ is
 651 | $$
 652 |   \text{MSE}(\hat\theta, \theta) = E_\theta(\hat\theta - \theta)^2
 653 | $$
 654 | 
 655 | An estimator $\hat\theta$ is **inadmissable** if there exists another estimator $\tilde\theta$ with smaller risk for *all* $\theta$.  The estimator $\tilde\theta$ is said to **dominate** $\hat\theta$.
 656 | 
 657 | \medskip
 658 | 
 659 | When estimating three or more means of independent normal distributions, Stein showed that the sample mean is **inadmissable**.
 660 | 
 661 | \columnsend
 662 | 
 663 | 
 664 | ## A tale of three estimators
 665 | 
 666 | To keep things simple, we start with models for vote by state without any covariates.
 667 | 
 668 | **No pooling** estimate the model separately for each state (by a regression on 51 state dummies)
 669 | 
 670 | **Complete pooling**  estimate a single **grand mean** (equivalent to a regression with just an intercept)
 671 | 
 672 | **Partial pooling** estimate a *random effects* model (to be explained shortly -- think of it as a compromise between no pooling and complete pooling)
 673 | 
 674 | \bigskip
 675 | \small
 676 | 
 677 | (For pedagogical purposes, we will use *linear* instead of *logistic regression* in this section. The same concepts apply to nonlinear model.)
 678 | 
 679 | 
 680 | ## Fixed effects (no pooling)
 681 | 
 682 | We use `lm` to estimate a model with state dummies and no intercept (so the estimates are the predicted Clinton vote in each state).
 683 | 
 684 | \scriptsize
 685 | ```{r, echo=TRUE}
 686 | no.pooling <- lm(demvote ~ state - 1, data = pew)
 687 | no.pooling <- data_frame(
 688 |     state = gsub("state", "", names(coef(no.pooling))),
 689 |     no.pooling = 100 * coef(no.pooling)) %>%
 690 |   mutate(state = factor(state, levels = levels(pew$state)))
 691 | head(no.pooling)
 692 | ```
 693 | 
 694 | 
 695 | ## Grand mean (complete pooling)
 696 | 
 697 | The grand mean can be computed in **three equivalent ways**:
 698 | 
 699 | \scriptsize
 700 | ```{r, echo=TRUE}
 701 | round(100 * mean(pew$demvote), 1) # 1. mean of dichotomous indicator
 702 | round(100 * prop.table(xtabs(~ demvote, data = pew)), 1) # 2. cross-tabulation
 703 | complete.pooling <- lm(demvote ~ 1, data = pew) # 3. intercept in regression
 704 | complete.pooling <- 100 * coef(complete.pooling)
 705 | round(complete.pooling, 1)
 706 | ```
 707 | 
 708 | 
 709 | ## Random effects (partial pooling)
 710 | 
 711 | Estimate a model with one **fixed effect** (the grand mean) plus **random effects** for each state,
 712 | $$
 713 |   \text{vote} = \text{grand mean} + \text{state effect} + \text{individual error}
 714 | $$
 715 | or, in symbols,
 716 | $$
 717 |    y_i = \mu + \alpha_{j_i} + \epsilon_i \qquad (j_i = \text{state of respondent})
 718 | $$
 719 | where
 720 | $$
 721 |   \alpha_1,\dots,\alpha_J \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\alpha) \qquad
 722 |   \epsilon_1,\dots,\epsilon_n \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\epsilon)
 723 | $$
 724 | The grand mean $\mu_i$ is a **fixed effect** while the state effects $\alpha_j$ are **random effects** (assumed to be $\text{Normal}(0, \sigma_\alpha)$). 
 725 | 
 726 | In classical statistics, parameters (like $\mu$, $\sigma_\epsilon$ and $\sigma_\alpha$) are **estimated** while random effects are **predicted**. 
 727 | 
 728 | ## What's the difference between fixed and random effects models?
 729 | 
 730 | In both models,
 731 | $$
 732 |   y_i = \mu + \alpha_{j_i} + \epsilon_i
 733 | $$
 734 | with the only difference being
 735 | 
 736 | * **Fixed effects** $\alpha_1,\dots,\alpha_J$ are *unknown* parameters to be estimated.
 737 |     + $E(y_i) = \mu + \alpha_{j_i}$ and $V(y_i) = \sigma_\epsilon^2$
 738 | * **Random effects** $\alpha_1,\dots,\alpha_J$ are *unknown* random variables to be predicted.
 739 |     + $E(y_i) = \mu$ and $V(y_i) = \sigma_\alpha^2 + \sigma_\epsilon^2$
 740 |     + But $E(y_i | \alpha_{j_i}) = \mu + \alpha_{j_i}$ and $V(y_i | \alpha_{j_i}) = \sigma_\epsilon^2$
 741 |     
 742 | If you find this confusing, you are not alone. Gelman and Hill argue that the only meaningful difference is that the $\alpha_j$ are modelled in one and not in the other.
 743 | 
 744 | 
 745 | ## Predicting random effects
 746 | 
 747 | The default method for R's `lmer` function (in the `lme4` package) is *REML* (residualized or restricted maximum likelihood), which is unbiased for linear models.
 748 | 
 749 | \scriptsize
 750 | ```{r, echo=TRUE}
 751 | suppressMessages(library(lme4))
 752 | partial.pooling <- lmer(demvote ~ 1 + (1 | state), data = pew)
 753 | partial.pooling
 754 | ```
 755 | 
 756 | ## Extracting predictions
 757 | 
 758 | \scriptsize
 759 | ```{r, echo=TRUE}
 760 | fixef(partial.pooling) # grand mean
 761 | ranef(partial.pooling)$state %>% head(4) # state effects
 762 | coef(partial.pooling)$state %>% head(4) # state predictions
 763 | ```
 764 | 
 765 | 
 766 | ## Which is better?
 767 | 
 768 | \scriptsize
 769 | ```{r echo=FALSE}
 770 | partial.pooling <- coef(partial.pooling)$state %>%
 771 |   as_data_frame(rownames = "state") %>%
 772 |   transmute(state = factor(state, levels = levels(pew$state)),
 773 |     partial.pooling = 100 * `(Intercept)`)
 774 | estimates <- estimates %>%
 775 |   left_join(no.pooling, by = "state") %>%
 776 |   left_join(partial.pooling, by = "state") %>%
 777 |   mutate(n = as.integer(xtabs(~ state, data = pew)))
 778 | estimates
 779 | ```
 780 | 
 781 | 
 782 | ```{r echo=FALSE, include=FALSE}
 783 | rmse <- function(est, act) sqrt(mean((est - act)^2, na.rm = TRUE))
 784 | RMSE <- estimates %>%
 785 |   summarize(complete.pooling = rmse(complete.pooling, actual),
 786 |     no.pooling = rmse(no.pooling, actual),
 787 |     partial.pooling = rmse(partial.pooling, actual)) %>%
 788 |   unlist()
 789 | RMSE
 790 | ```
 791 | 
 792 | \normalsize
 793 | 
 794 | The RMSE is about the same for *no pooling* (`r round(RMSE["no.pooling"], 1)`%) and *complete pooling* (`r round(RMSE[
 795 | "complete.pooling"], 1)`%), but substantially smaller for *partial pooling* (`r round(RMSE["partial.pooling"], 1)`%).
 796 | 
 797 | 
 798 | 
 799 | ## 2016 U.S. election: estimates *vs.* actuals
 800 | 
 801 | ```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=4.5, fig.width=9, fig.align="center"}
 802 | p1 <- ggplot(estimates, aes(actual, no.pooling)) +
 803 |   geom_point() +
 804 |   labs(title = "No pooling", x = "Percent of voters",
 805 |     y = "Percent of voters") +
 806 |   lims(x = c(0,100), y = c(0,100)) +
 807 |   theme_minimal()
 808 | p2 <- ggplot(estimates, aes(actual, partial.pooling)) +
 809 |   geom_point() +
 810 |   labs(title = "Partial pooling", x = "Percent of voters",
 811 |     y = "Percent of voters") +
 812 |   lims(x = c(0,100), y = c(0,100)) +
 813 |   theme_minimal()
 814 | suppressMessages(library(gridExtra))
 815 | grid.arrange(p1, p2, nrow = 1)
 816 | ```
 817 | 
 818 | 
 819 | ## Shrinkage
 820 | 
 821 | ```{r echo=FALSE, fig.align="center"}
 822 | na.omit(estimates) %>%
 823 |   select(-post.stratified) %>%
 824 |   rename("Actual" = actual, "No pooling" = no.pooling,
 825 |     "Partial pooling" = partial.pooling) %>%
 826 |   gather(Estimator, Percentage, -state, -name, -n) %>%
 827 |   mutate(Estimator = factor(Estimator, levels = c("Post-stratified",
 828 |     "No pooling", "Partial pooling", "Actual"))) %>%
 829 |   ggplot(aes(Estimator, Percentage, group = state, color = n)) +
 830 |     geom_line(size = 0.7) +
 831 |     scale_color_gradientn(colors = blues9[-1]) +
 832 |     lims(y = c(0, 100)) +
 833 |     theme_minimal()
 834 | ```
 835 | 
 836 | 
 837 | ## What *exactly* is a random effects estimator?
 838 | 
 839 | The random effects/partial pooling estimate of a group mean is a weighted average of the fixed effects/no pooling estimate and the grand mean/complete pooling estimate.
 840 | 
 841 | $$
 842 |   \hat\theta_j^{\text{RE}} = w_j \, \bar y_j + (1 - w_j) \, \bar y
 843 | $$
 844 | 
 845 | * The weight on the no pooling estimator $\bar y_j$ is proportional to sample size $n_j$ in group $j$ and inversely proportional to the **within group variance**.
 846 | * The weight on the complete pooling is inversely proportional to the **between group variance**.
 847 | * Sometimes referred to as an **empirical Bayesian estimator**. The amount of shrinkage is data dependent.
 848 | 
 849 | 
 850 | ## Adding individual covariates: fixed effects
 851 | 
 852 | To get better predictions, let's add a covariate -- **gender** -- to the model. There are now three *fixed effects* estimators:
 853 | 
 854 | **Complete pooling** estimate a single slope and single intercept for all states
 855 | 
 856 | **No pooling** estimate separate intercepts and slopes for each state
 857 | 
 858 | **Complete pooling of slopes, no pooling of intercepts** estimate separate intercepts for each state and a common slope
 859 | 
 860 | *(It rarely makes sense to estimate separate slopes and a common intercept, so we ignore this possibility.)*
 861 | 
 862 | 
 863 | ## Complete pooling: common intercepts and slopes
 864 | 
 865 | $$
 866 |   y_i = \alpha + \beta x_i + \epsilon_i \qquad \epsilon_i \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\epsilon)
 867 | $$
 868 | 
 869 | \scriptsize
 870 | ```{r, echo=TRUE}
 871 | pew <- mutate(pew, female = ifelse(gender == "female", 1, 0))
 872 | fit1 <- lm(demvote ~ 1 + female, data = pew)
 873 | arm::display(fit1)
 874 | ```
 875 | 
 876 | \normalsize
 877 | 
 878 | $\hat\beta = 0.14$ implies a 14% gender gap (42% of men voted for Clinton *vs.* 56% of women).
 879 | 
 880 | 
 881 | ## No pooling: separate intercepts and slopes
 882 | 
 883 | $$
 884 |   y_i = \alpha_{j_i} + \beta_{j_i} x_i + \epsilon_i 
 885 | $$
 886 | 
 887 | \tiny
 888 | ```{r, echo=TRUE}
 889 | fit2 <- lm(demvote ~ 0 + state + state:female, data = pew)
 890 | coef.fit2 <- as.matrix(coef(fit2))
 891 | round(coef(fit2)[c(1:5, 90:98)], 2)
 892 | ```
 893 | 
 894 | \normalsize
 895 | It's impossible to estimate the slope in Wyoming:
 896 | 
 897 | \tiny
 898 | ```{r, echo=FALSE}
 899 | options(width = 100)
 900 | xtabs(~ gender + state, data = pew) # no men in WY
 901 | options(width = 85)
 902 | ```
 903 | 
 904 | ```{r echo=FALSE, include=FALSE}
 905 | coef.fit2 <- coef(fit2)
 906 | coef.fit2 <- as.matrix(data.frame(intercept = coef.fit2[1:48],
 907 |   slope = coef.fit2[50:97], row.names =
 908 |   gsub("state", "", names(coef.fit2)[1:48])))
 909 | ```
 910 | 
 911 | 
 912 | ## Separate intercepts with common slope
 913 | 
 914 | $$
 915 |   y_i = \alpha_{j_i} + \beta x_i + \epsilon_i
 916 | $$
 917 | 
 918 | \scriptsize
 919 | ```{r echo=TRUE}
 920 | fit3 <- lm(demvote ~ state + female - 1, data = pew)
 921 | coef.fit3 <- coef(fit3)
 922 | names(coef.fit3) <- gsub("state", "", names(coef.fit3))
 923 | round(coef.fit3, 2)
 924 | ```
 925 | 
 926 | 
 927 | ## Which should you believe?
 928 | 
 929 | ```{r, echo=FALSE, fig.align="center"}
 930 | par(mfrow=c(1,2))
 931 | plot.gender.by.state <- function(main, coefs, fixef) {
 932 |   n <- nrow(coefs)
 933 |   x <- rep(c(0,1), rep(n, 2))
 934 |   y <- 100 * c(coefs[[1]], coefs[[1]] + coefs[[2]])
 935 |   plot(x = x, y = y, pch = 19, cex = 0.5, xlim = c(0, 1),
 936 |     ylim = c(0, 100), main = main,xlab = "Gender",
 937 |     ylab = "Percent voting for Clinton", axes = FALSE)
 938 |   axis(1, at = c(0, 1), labels = c("Male", "Female"))
 939 |   axis(2, at = seq(0, 100, 20))
 940 |   for (i in seq_len(n)) lines(x = c(0, 1),
 941 |     y = c(y[i], y[i+n]), col = "grey", lwd = 0.5)
 942 |   abline(100 * fixef, col = "red3", lwd = 2)
 943 | }
 944 | plot.gender.by.state("No pooling", as_data_frame(coef.fit2)[-49,],
 945 |   coef(fit1))
 946 | coef.fit3.df <- data_frame(Intercept = coef.fit3[-length(coef.fit3)],
 947 |   Slope = coef.fit3[length(coef.fit3)])
 948 | plot.gender.by.state("Common slope", coef.fit3.df, coef(fit1))
 949 | par(mfrow=c(1,1))
 950 | ```
 951 | 
 952 | 
 953 | ## Gender gaps in 2016 Exit poll
 954 | 
 955 | \centering
 956 | ![](images/gender-gap-xp.pdf)
 957 | 
 958 | 
 959 | ## Multilevel models with a single covariate
 960 | 
 961 | **Varying intercepts, common slope** 
 962 | 
 963 | $$
 964 |   y_i = \mu + \alpha_{j_i} + \beta x_i + \epsilon_i
 965 | $$
 966 | 
 967 | * The intercept for state $j$ is $\mu + \alpha_j$. 
 968 | * The slope (gender gap) is $\beta$ (same for all states).
 969 | 
 970 | **Varying intercepts and slopes** 
 971 | 
 972 | $$
 973 |   y_i = \mu + \alpha_{j_i} + \beta x_i + \gamma_{j_i} x_i + \epsilon_i
 974 | $$
 975 | 
 976 | * The intercept for state $j$ is $\mu + \alpha_j$.
 977 | * The slope (gender gap) for state $j$ is $\beta + \gamma_j$ (varies).
 978 | 
 979 | 
 980 | ## Random effects: varying intercepts, common slopes
 981 | 
 982 | \scriptsize
 983 | ```{r echo=TRUE}
 984 | # Varying intercept and common slope
 985 | fit4 <- lmer(demvote ~ 1 + female + (1 | state), data = pew)
 986 | fit4
 987 | ```
 988 | 
 989 | 
 990 | ## Random effects: varying intercepts and slopes (first try)
 991 | 
 992 | \tiny
 993 | ```{r, echo=TRUE}
 994 | (fit5 <- lmer(demvote ~ 1 + female +  (1 + female | state),
 995 |   data = pew)) # fails to converge
 996 | ```
 997 | 
 998 | 
 999 | ## Fixing the convergence failure (by centering covariates)
1000 | 
1001 | \scriptsize
1002 | ```{r echo=TRUE}
1003 | pew <- mutate(pew, female.c = female - 0.5)
1004 | fit6 <- lmer(demvote ~ 1 + female.c + (1 + female.c | state),
1005 |   data = pew)
1006 | fixef(fit6) # not comparable to prior models
1007 | head(coef(fit6)$state)
1008 | ```
1009 | 
1010 | \normalsize
1011 | 
1012 | (No need to wrestle with `lmer` if you use Stan -- which is up next.)
1013 | 
1014 | 
1015 | ## Unscale the estimates for comparability
1016 | 
1017 | \scriptsize
1018 | ```{r, echo=TRUE}
1019 | fixef.fit6 <- fixef(fit6)
1020 | fixef.fit6[1] <- fixef.fit6[1] - 0.5 * fixef.fit6[2]
1021 | coef.fit6 <- coef(fit6)$state
1022 | coef.fit6[[1]] <- coef.fit6[[1]] - 0.5 * coef.fit6[[2]]
1023 | fixef.fit6
1024 | head(coef.fit6)
1025 | ```
1026 | 
1027 | 
1028 | ## Comparing the random effects estimates
1029 | 
1030 | Gender gaps are correlated with state effects ($r = 0.56$).
1031 | 
1032 | ```{r, echo=FALSE, fig.width=9, fig.height=5}
1033 | par(mfrow=c(1,2))
1034 | plot.gender.by.state("Varying intercepts, common slope",
1035 |   coef(fit4)$state, fixef(fit4))
1036 | plot.gender.by.state("Varying intercepts and slopes",
1037 |   coef.fit6, fixef.fit6)
1038 | par(mfrow=c(1,1))
1039 | ```
1040 | 
1041 | 
1042 | ## Adding group-level covariates to the model
1043 | 
1044 | Add a **group-level covariate** (2012 vote) to the model. This is impossible in the no-pooling model, since group level variables are **collinear** with state indicators. 
1045 | 
1046 | As before, $i$ indexes respondents and $j$ indexes states. The variables are:
1047 | $$
1048 | \begin{aligned}
1049 |   y_i &= \text{respondent 2016 vote (1 = Clinton, 0 = other)} \\
1050 |   x_i &= \text{respondent gender (1 = female, 0 = male)} \\
1051 |   z_j &= \text{proportion in state voting for Obama in 2012}
1052 | \end{aligned}
1053 | $$
1054 | At the respondent level, both intercepts and slopes vary by group (state):
1055 | $$
1056 |   \text{\textbf{Respondent model}} \quad  y_i = \alpha_{j_i} + \beta_{j_i} x_i + \epsilon_i \qquad (i = 1,\dots,n) 
1057 | $$
1058 | The state intercepts and slopes are modelled as:
1059 | $$
1060 |   \text{\textbf{Group model}} \qquad \alpha_j = \mu + \delta z_j + \eta_j \qquad 
1061 |   \beta_j = \gamma + \xi_j
1062 | $$
1063 | The errors are all assumed to be iid normal with zero means:
1064 | $$
1065 |   \epsilon_i \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\epsilon)
1066 |   \qquad 
1067 |   \eta_j \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\eta) 
1068 |   \qquad
1069 |   \xi_j \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\xi)
1070 | $$
1071 | 
1072 | ## An equivalent version of the model
1073 | 
1074 | We can move covariates from the **upper** (group) level to the **lower** (respondent) level without changing anything:
1075 | $$
1076 |   y_i = \mu + \gamma x_i + \delta z_{j_i} + \left( \eta_{j_i} + \xi_{j_i} x_i + \epsilon_i \right)
1077 | $$
1078 | The first three terms are **fixed effects**, while the last terms (in parentheses) are **random effects**.
1079 | 
1080 | The random effects in this formulation all have **mean zero**, which is required by `lmer`. This assumption is not restrictive -- anything on the RHS of the group model with non-zero mean can be moved to the respondent level fixed effects.
1081 | 
1082 | 
1083 | ## Estimating models with covariates at both levels
1084 | 
1085 | Join group level covariates to respondent data:
1086 | 
1087 | \scriptsize
1088 | ```{r, echo=TRUE}
1089 | obama12 <- votes12 %>%
1090 |   mutate(obama12 = obama / turnout) %>%
1091 |   select(state, obama12)
1092 | pew <- left_join(pew, obama12, by = "state")
1093 | fit7 <- lmer(demvote ~ 1 + female.c + obama12 + (1 + female.c | state), data = pew)
1094 | ```
1095 | 
1096 | \tiny
1097 | ```{r}
1098 | options(width = 105)
1099 | arm::display(fit7)
1100 | options(width = 85)
1101 | ```
1102 | 
1103 | 
1104 | 
1105 | 
1106 | ## First attempt at MRP
1107 | 
1108 | We can use the R `predict` function to **impute** `demvote` onto `cps`:
1109 | 
1110 | \scriptsize
1111 | ```{r echo=TRUE, warning=FALSE, message=FALSE}
1112 | cps <- cps %>%
1113 |   mutate(female = ifelse(gender == "female", 1, 0),
1114 |     female.c = female - 0.5) %>%
1115 |   left_join(obama12, by = "state")
1116 | prob <- predict(fit7, newdata = cps, allow.new.levels = TRUE)
1117 | mrp1 <- cps %>%
1118 |   mutate(prob = prob) %>%
1119 |   group_by(state) %>%
1120 |   summarize(mrp1 = 100 * weighted.mean(prob, weight))
1121 | estimates <- left_join(estimates, mrp1, by = "state")
1122 | head(estimates)
1123 | ```
1124 | 
1125 | 
1126 | ## Plotting first MRP estimates
1127 | 
1128 | \centering
1129 | 
1130 | ```{r echo=FALSE, message=FALSE, warning = FALSE, fig.width=4.5, fig.height=4, fig.aling="center"}
1131 | RMSE["mrp1"] <- with(estimates, rmse(mrp1, actual))
1132 | ggplot(estimates, aes(actual, mrp1)) +
1133 |   geom_abline(intercept = 0, slope = 1, col = "grey") +
1134 |   geom_point(size = 1.5) +
1135 |   lims(x = c(0, 100), y = c(0, 100)) +
1136 |   labs(x = "Percentage of vote for Clinton", y = "Estimate") +
1137 |   theme_minimal()
1138 | ```
1139 | 
1140 | \normalsize
1141 | 
1142 | The MRP estimates are now pretty good (RMSE = `r round(RMSE["mrp1"], 2)`)!
1143 | 
1144 | 
1145 | 
1146 | 
1147 | ## Summary of multilevel models
1148 | 
1149 | * Multilevel models allow us to estimate models with large numbers of parameters and still get sensible results.
1150 | * Multilevel models **shrink** or **regularize** estimates.
1151 |     + Random effect estimates of group parameters are shrunk toward a "grand mean" effect.
1152 | * The amount of **shrinkage** in estimating a parameter depends upon:
1153 |     + How much data is available to estimate the parameter.
1154 |     + How much the parameter varies across groups.
1155 | * Substantial improvements can be made by adding **group-level predictors** (like past vote).
1156 | * Also possible to estimate multilevel **logistic regressions** using `glmer`, but we will focus on Bayesian estimation instead.
1157 | 
1158 | 
1159 | # 4. Bayesian inference
1160 | 
1161 | 
1162 | ## What is Bayesian statistics?
1163 | 
1164 | \columnsbegin
1165 | \column{.3\textwidth}
1166 | ```{r, out.width = "1.5in"}
1167 | knitr::include_graphics("images/bayes.png")
1168 | ```
1169 | 
1170 | \smallskip
1171 | 
1172 | \centering
1173 | 
1174 | Thomas Bayes (1701-61)
1175 | 
1176 | \column{.5\textwidth}
1177 | 
1178 | Bayesian statistics is a method for combining **prior information** with **data** to draw inferences.
1179 | 
1180 | \medskip
1181 | 
1182 | Probabilities are **subjective measures of uncertainty** and, in Bayesian inference, can be applied to *anything*, including events that have already happened.
1183 | 
1184 | \medskip
1185 | 
1186 | Bayesian inference seems particularly well suited for situations where there is uncertainty about the sampling process and no **objective sampling distribution**.
1187 | 
1188 | \columnsend
1189 | 
1190 | 
1191 | ## Advantages and disadvantages of Bayesian inference
1192 | 
1193 | \columnsbegin
1194 | \column{.3\textwidth}
1195 | ```{r, out.width = "1.5in"}
1196 | knitr::include_graphics("images/bayes-grave.jpg")
1197 | ```
1198 | 
1199 | \smallskip
1200 | 
1201 | \centering
1202 | \small
1203 | 
1204 | Cotton and Bayes 
1205 | 
1206 | family vault
1207 | 
1208 | \footnotesize
1209 | 
1210 | (Bunhill Fields, London)
1211 | 
1212 | \column{.3\textwidth}
1213 | 
1214 | \normalsize
1215 | 
1216 | **Advantages**
1217 | 
1218 | * Does not depend on having large samples
1219 | * Can incorporate prior (non-sample) information
1220 | * Better estimates of variance parameters
1221 | * Simpler conceptual framework, avoids *ad hockery*
1222 | 
1223 | \column{.3\textwidth}
1224 | 
1225 | \normalsize
1226 | 
1227 | **Disadvantages**
1228 | 
1229 | * Requires specification of prior probability distributions
1230 | * Inferences are not "objective" -- different results depending on what priors you use
1231 | * More math
1232 | * Takes *much* more computational effort
1233 | 
1234 | \columnsend
1235 | 
1236 | 
1237 | ## Basic ideas
1238 | 
1239 | In Bayesian statistics, both data and parameters are considered to be random variables. The important distinction is between what is **observed** (the data, except possibly for missing values) and what isn't (parameters, such as regression coefficients). 
1240 | 
1241 | As in classical statistics, we specify a **model** (probability distribution) for the data, which depends on some unknown parameters. In addition, we summarize any information that we might have about the unknown parameters using a **prior probability distribution**.
1242 | 
1243 | Bayesian inference uses Bayes theorem to combine the prior information with the data to obtain a **posterior probability distribution** for the parameters. 
1244 | 
1245 | 
1246 | 
1247 | ## Notation for Bayesian analysis
1248 | 
1249 | **Data** $y = (y_1,\dots,y_n)$
1250 | 
1251 | **Parameters** $\theta = (\theta_1,\dots,\theta_k)$
1252 | 
1253 | **Prior distribution** $p(\theta) = p(\theta_1,\dots,\theta_k)$
1254 | 
1255 | **Model** $p(y|\theta)$ (*a.k.a.* the **likelihood function**)
1256 | 
1257 | **Posterior distribution** $p(\theta | y)$
1258 | 
1259 | 
1260 | ## Example: What is the average airfare to AAPOR in Denver?
1261 | 
1262 | ```{r echo=FALSE, include=FALSE}
1263 | post.precis <- 1 / 100^2 + 3 / 150^2
1264 | post.mean <- (400 / 100^2 + 300 * 3 / 150^2) / post.precis
1265 | post.sd <- 1 / sqrt(post.precis)
1266 | ```
1267 | 
1268 | 
1269 | **Population parameters**
1270 | 
1271 | * $\theta = \text{population average fare}$ 
1272 | * $\sigma = \text{population s.d.} = 150$ (assumed to be known)
1273 | 
1274 | **Data** My airfare was $y_1 = 161$ and two friends spent $y_2 = 250$ and $y_3 = 489$.
1275 | 
1276 | $$
1277 |   \bar y = \dfrac{161 + 250 + 489}{3} = 300 \qquad n = 3
1278 | $$
1279 | 
1280 | **Model** $y_1,\dots,y_n$ are independent $\text{Normal}(\theta, 150)$ random variables 
1281 | 
1282 | **Prior information** $\theta \sim \text{Normal}(400, 100)$ (so $P(200 < \theta < 600) \approx 95\%$)
1283 | 
1284 | **Posterior distribution** $\theta | y_1,y_2,y_3 \sim \text{Normal}(`r round(post.mean)`, `r round(post.sd)`)$
1285 | 
1286 | 
1287 | ## Bayesian updating: from prior to posterior
1288 | 
1289 | \centering
1290 | 
1291 | From Bayes' Theorem, it can be shown that $\theta | y \sim \text{Normal}(343, 65)$.
1292 | 
1293 | 
1294 | ```{r echo=FALSE, fig.width=4.5, fig.height=2.75, fig.align="center"}
1295 | ggplot(data_frame(theta = c(0, 700) ), aes(theta)) +
1296 |   stat_function(fun = dnorm, color = "red",
1297 |     args = list(mean = 300, sd = 200 / sqrt(3))) +
1298 |   stat_function(fun = dnorm, color = "blue",
1299 |     args = list(mean = 400, sd = 100)) +
1300 |   stat_function(fun = dnorm, color = "purple",
1301 |     args = list(mean = post.mean, sd = post.sd)) +
1302 |   labs(x = quote(theta), y = "") +
1303 |   scale_x_continuous(breaks = seq(0, 700, 100)) +
1304 |   scale_y_continuous(breaks = NULL) +
1305 |   theme_minimal()
1306 | ```
1307 | 
1308 | *The posterior is a compromise between the prior and the likelihood.*
1309 | 
1310 | 
1311 | 
1312 | ## Relationship between prior and posterior mean
1313 | 
1314 | In the normal model, the **posterior mean** is a weighted average of the **prior mean** and the **sample mean**.
1315 | $$
1316 | \begin{aligned}
1317 |   \text{posterior mean} &= \text{weight} \times \text{prior mean} + (1 - \text{weight})  \times \text{sample mean} \\[10pt]
1318 |   \text{weight} &= \dfrac{ \text{prior precision} }{ \text{prior precision} + \text{sample precision} } \\[3pt]
1319 |   \text{prior precision} &= \dfrac{1}{ \text{prior variance} } \\[3pt]
1320 |   \text{sample precision} &= \dfrac{n}{ \text{data variance} }
1321 | \end{aligned}
1322 | $$
1323 | 
1324 | When the sample size is large, the weight on the prior becomes small. However, when the sample size is small, the posterior **shrinks** the sample mean toward the prior.
1325 | 
1326 | 
1327 | ## Markov Chain Monte Carlo (MCMC)
1328 | 
1329 | For simple models (like the normal model with a normal prior), the posterior can be derived **analytically**. This is impossible for most models and priors, which limited the application of Bayesian methods before MCMC (about 1990).
1330 | 
1331 | MCMC is a technique for **simulating** draws from a probability distribution. With simulated draws, we can:
1332 | 
1333 |  * Use a histogram or density plot to see what the **posterior distribution** looks like.
1334 |  * Compute an estimate of the **posterior mean** (the average of the draws) or **posterior standard deviation** (the s.d. of the draws)
1335 |  * Compute quantiles to obtain **credible intervals** (a.k.a. "confidence intervals").
1336 |  
1337 | Programs like BUGS and Stan automate most of this process.
1338 |  
1339 | 
1340 | ## Instructions for installing Stan on MacOS
1341 | 
1342 | <https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started>
1343 | 
1344 | 1. Install a C++ compiler  (Xcode for MacOS or Rtools for Windows)
1345 |     
1346 | 2. Create a `Makevars` file your `.R` folder (in your home directory):
1347 | 
1348 | \scriptsize
1349 | ```
1350 | CC=clang
1351 | CXX=clang++ -arch x86_64 -ftemplate-depth-256
1352 | CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable \
1353 |   -Wno-unused-function -Wno-macro-redefined \
1354 |   -Wno-unused-local-typedefs -Wno-c++11-inline-namespace \
1355 |   -Wno-unneeded-internal-declaration -Wno-unknown-pragmas
1356 | ```
1357 | 
1358 | \normalsize
1359 | 
1360 | 3. Install `rstan` from CRAN
1361 | 
1362 | \scriptsize
1363 | ```{r echo=TRUE, eval=FALSE}
1364 | install.packages(c("rstan", "rstanarm"))
1365 | ```
1366 | 
1367 | 
1368 | ## A first Stan program
1369 | 
1370 | **Model** $y_1,\dots,y_n \stackrel{\text{iid}}{\sim} \text{Normal}(\theta, \sigma)$ ($\sigma = 150$)
1371 | 
1372 | **Prior** $\theta \sim \text{Normal}(\theta_0, \omega_0)$ ($\theta_0 = 350$, $\omega_0 = 100$)
1373 | 
1374 | \scriptsize
1375 | ```{r, echo=TRUE}
1376 | model_code <- "data {
1377 |   int n;
1378 |   real y[n];
1379 |   real theta_0;
1380 |   real<lower=0> omega_0;
1381 |   real sigma;
1382 | }
1383 | parameters {
1384 |   real theta;
1385 |   }
1386 | model {
1387 |   theta ~ normal(theta_0, omega_0);
1388 |   for (i in 1:n) {
1389 |     y[i] ~ normal(theta, sigma);
1390 |   }
1391 | }"
1392 | ```
1393 | 
1394 | 
1395 | ## Components of a Stan program
1396 | 
1397 | **Data** (what you know)
1398 | 
1399 | \scriptsize
1400 | ```{r eval=FALSE, echo=TRUE}
1401 | data {
1402 |   int n;  // number of data points
1403 |   real y[n]; // data on fares
1404 |   real theta_0; // prior mean
1405 |   real<lower=0> omega_0; // prior sd
1406 |   real sigma; // sd of fares
1407 | }
1408 | ```
1409 | 
1410 | \normalsize
1411 | 
1412 | **Parameters** (what you don't know)
1413 | 
1414 | \scriptsize
1415 | ```{r eval=FALSE, echo=TRUE}
1416 | parameters {
1417 |   real theta; // mean fare
1418 | }
1419 | ```
1420 | 
1421 | \normalsize
1422 | 
1423 | **Model** (joint distribution of data and parameters)
1424 | 
1425 | \scriptsize
1426 | ```{r eval=FALSE, echo=TRUE}
1427 | model {
1428 |   theta ~ normal(theta_0, omega_0); // prior
1429 |   for (i in 1:n)
1430 |     y[i] ~ normal(theta, sigma); // likelihood
1431 | ```
1432 | 
1433 | 
1434 | ## Running a Stan program from R
1435 | 
1436 | To run a Stan program from R, you will need:
1437 | 
1438 | * Load `rstan` package (using the `library` function)
1439 | * Code your Stan `model_code` (as a string or from an external file)
1440 | * Create a named list `data` (containing all of the items listed in the Stan `data` block)
1441 | * Call the `stan` function with `model_code` and `data` as arguments (and possibly other options)
1442 | 
1443 | Stan returns draws from the posterior distribution of the parameters, arranged into a three-dimensional array (draws x chains x parameters).
1444 | 
1445 | 
1446 | ## Example (continued)
1447 | 
1448 | \footnotesize
1449 | ```{r, echo=TRUE,warning=FALSE,message=FALSE}
1450 | suppressMessages(library(rstan))
1451 | rstan_options(auto_write = TRUE)
1452 | options(mc.cores = parallel::detectCores())
1453 | y <- c(161, 250, 489)
1454 | data <- list(y = y, n = length(y), theta_0 = 350,
1455 |   omega_0 = 150, sigma = 100)
1456 | sims <- stan(model_code = model_code, data = data,
1457 |   chains = 4, iter = 500, seed = 1234)
1458 | ```
1459 | 
1460 | \normalsize
1461 | 
1462 | This will simulate four **chains** of length `iter = 500`. 
1463 | 
1464 | \smallskip
1465 | 
1466 | By default, half of each chain is for **warmup**, so the result will be a 500 x 4 x 1 array (iterations x chains x parameters).
1467 | 
1468 |  
1469 | ## Stan output
1470 | 
1471 | \scriptsize
1472 | ```{r, echo=TRUE, include=TRUE}
1473 | print(sims)
1474 | ```
1475 | 
1476 | 
1477 | ## How to read Stan output
1478 | 
1479 | * `mean` is the **posterior mean** -- estimated to be 302.94 for $\theta$.
1480 | * `se_mean` is the **monte carlo standard error**.
1481 |   * The value 2.63 tells you that only the first two digits are reliable -- the point estimate is probably between 297 and 309.
1482 |   * To make this smaller, redo the simulations with a larger value of `iter`.
1483 | * `sd` is the **posterior standard deviation** of $\theta$ -- beyond the error in the computation of the point estimate.
1484 | * The percentages are **posterior quantiles**.
1485 |   * The 95% **credible interval** is 200.47 to 404.01.
1486 | * `Rhat` is a **convergence** diagnostic. You should look for values near one.
1487 | 
1488 | 
1489 | ## Weakly informative priors
1490 | 
1491 | Bayesian inference is often criticized for being *subjective* since different priors will yield different inferences.
1492 | 
1493 | **Non-informative priors** can be used to reproduce many classical frequentist ("objective") results. Sometimes these priors are not *proper* (*e.g.*, $\theta$ is uniformly distributed on $\Bbb{R}$).
1494 | 
1495 | **Weakly informative priors** purposely include less information than we actually have, but still provide some *regularization* (shrinkage). 
1496 | 
1497 | For example, logistic regression coefficients are rarely greater than five or six (the difference between the top and bottom decile is about 4.4, between the .95 and .05 quantiles is about 5.9). In a linear model, a ``six sigma'' effect would be considered extremely large by any standard.
1498 | 
1499 | 
1500 | ## Priors on standard deviations
1501 | 
1502 | In real world applications, we do not know the standard deviation of the data, so there are two unknown parameters ($\mu$ and $\sigma$).
1503 | 
1504 | Since $\sigma > 0$, we need to use a prior distribution for $\sigma$ that assigns zero probability to negative numbers. There are many possible choices.
1505 | 
1506 | Traditionally, variance priors were selected for computational convenience, but with modern MCMC methods this is no longer necessary.
1507 | 
1508 | Gelman, "Prior distributions for variance parameters in hierarchical models," *Bayesian Analysis* (2006) recommends the *Folded noncentral t* distribution (the distribution of the absolute value of a noncentral $t$ random variable). A *half normal* prior may also be suitable.
1509 | 
1510 | 
1511 | ## Stan code for estimating both mean and SD
1512 | 
1513 | Move `sigma` from the `data` block to the `parameters` block.
1514 | 
1515 | \scriptsize
1516 | ```{r stan_mean_sd, echo=TRUE, warning=FALSE, message=FALSE}
1517 | data <- list(y = y, n = length(y), theta_0 = 350, omega_0 = 150)
1518 | model_code <- "data {
1519 |   int n;
1520 |   real y[n];
1521 |   real theta_0;
1522 |   real omega_0;
1523 | }
1524 | parameters {
1525 |   real theta;
1526 |   real<lower=0> sigma; // moved to parameter block
1527 | }
1528 | model {
1529 |   theta ~ normal(theta_0, omega_0);
1530 |   sigma ~ normal(150, 150);
1531 |   y ~ normal(theta, sigma); // assumed iid
1532 | }"
1533 | sims <- stan(model_code = model_code, data = data,
1534 |   iter = 500, seed = 1234)
1535 | ```
1536 | 
1537 | 
1538 | ## Output from two-parameter model
1539 | 
1540 | \scriptsize
1541 | ```{r mean_sd_output, echo=TRUE}
1542 | print(sims)
1543 | ```
1544 | 
1545 | 
1546 | ## Graphical display of parameters
1547 | 
1548 | \footnotesize
1549 | ```{r, fig.height=4.5,message=FALSE,warning=FALSE}
1550 | plot(sims)
1551 | ```
1552 | 
1553 | 
1554 | ## Traceplot
1555 | 
1556 | \footnotesize
1557 | ```{r, fig.height=4.5,message=FALSE,warning=FALSE}
1558 | traceplot(sims)
1559 | ```
1560 | 
1561 | ## Hierarchical priors
1562 | 
1563 | Multilevel models contain many parameters and it is difficult to specify **high dimensional priors**. In a model for respondents in U.S. states, we might have intercepts for each state, each of which would need to be assigned a prior distribution.
1564 | 
1565 | Instead, we use **hierarchical priors** where the priors for the model parameters depend upon **hyperparameters**. That is, we model relationships between the many model parameters using a second level model. 
1566 | 
1567 | Hierarchical priors are **data dependent**. We infer relationships between model parameters using observed patterns in the data.   
1568 | 
1569 | For example, we can model state intercepts in a voting model using past vote. To the extent that 2016 respondents in states that voted Republican in 2012 also say they will vote Republican in 2016, we can infer a correlation between past vote and state intercepts in the 2016 model.
1570 | 
1571 | 
1572 | ## Example: a simple two-level model
1573 | 
1574 | ![](images/dag_2level.png)
1575 | 
1576 |                     
1577 | ## Notation for hierarchical normal model
1578 | 
1579 | **Individuals** indexed by $i = 1,\dots,n$.
1580 | 
1581 | **Groups** indexed by $j = 1,\dots,J$. Individual $i$ belongs to group $j_i$ 
1582 | 
1583 | **Data Model** $y_i \stackrel{\text{ind}}{\sim} \text{Normal}(\theta_{j_i}, \sigma_y)$
1584 | 
1585 | **Group Model** $\theta_j \stackrel{\text{ind}}{\sim}(\mu_\theta, \sigma_\theta)$ (Hierarchical prior)
1586 | 
1587 | **Hyper Prior** $\mu_\theta \sim \text{Uniform}(-\infty, \infty)$ and $\sigma_\theta \sim \text{Normal}_+(0, 5)$
1588 | 
1589 | We can then use Bayes' Theorem to derive the posterior distribution of both the group parameters ($\theta_1,\dots,\theta_J$) and the hyperparameters ($\mu_\theta, \sigma_\theta$).
1590 | 
1591 | 
1592 | ## First hierarchical model in Stan
1593 | 
1594 | \scriptsize
1595 | ```{r hb1, echo=TRUE, warning=FALSE, message=FALSE}
1596 | data <- with(pew, list(y = demvote, group = as.integer(state),
1597 |   n = nrow(pew), J = nlevels(state)))
1598 | code <- "data {
1599 |   int n; // number of respondents
1600 |   int J; // number of groups (states)
1601 |   int<lower=0, upper=1> y[n]; // demvote
1602 |   int<lower=1, upper=J> group[n]; // state index
1603 | }
1604 | parameters {
1605 |   real mu_theta; // hyper parameters
1606 |   real<lower=0> sigma_theta; 
1607 |   vector[J] theta; // group parameters
1608 | }
1609 | model {
1610 |   sigma_theta ~ normal(0, 5);
1611 |   for (j in 1:J)
1612 |     theta[j] ~ normal(mu_theta, sigma_theta);
1613 |   for (i in 1:n)
1614 |     y[i] ~ bernoulli_logit(theta[ group[i] ]);
1615 | }"
1616 | sims <- stan(model_code = code, data = data)
1617 | ```
1618 | 
1619 | 
1620 | ## Stan output
1621 | 
1622 | \tiny
1623 | ```{r}
1624 | print(sims)
1625 | ```
1626 | 
1627 | 
1628 | ## A plot of the state estimates
1629 | 
1630 | \centering
1631 | 
1632 | ```{r warning=FALSE, message=FALSE, fig.width=5}
1633 | names(sims) <- c("mu_theta", "sigma_theta", levels(pew$state), "lp__")
1634 | plot(sims, par = "theta")
1635 | ```
1636 | 
1637 | 
1638 | ## Add a covariate to the model
1639 | 
1640 | \scriptsize
1641 | ```{r hb2, echo=TRUE, warning=FALSE, message=FALSE}
1642 | data <- with(pew, list(y = demvote, group = as.integer(state),
1643 |   x = female, n = nrow(pew), J = nlevels(state)))
1644 | code <- "data {
1645 |   int n; // number of respondents
1646 |   int J; // number of groups (states)
1647 |   int<lower=0, upper=1> y[n]; // demvote
1648 |   int<lower=1, upper=J> group[n]; // state index
1649 |   vector[n] x; // added covariate (gender)
1650 | }
1651 | parameters {
1652 |   real mu_alpha; // mean intercept
1653 |   real<lower=0> sigma_alpha; // sd intercept
1654 |   real beta; // coefficient of gender
1655 |   vector[J] alpha; // group intercepts
1656 | }
1657 | model {
1658 |   sigma_alpha ~ normal(0, 5);
1659 |   for (j in 1:J)
1660 |     alpha[j] ~ normal(mu_alpha, sigma_alpha);
1661 |   for (i in 1:n)
1662 |     y[i] ~ bernoulli_logit(alpha[group[i]] + beta * x[i]);
1663 | }"
1664 | sims <- stan(model_code = code, data = data)
1665 | ```
1666 | 
1667 | 
1668 | ## Simplifying the computations
1669 | 
1670 | Make a `model.matrix` $X$ in R for fixed effects.
1671 | 
1672 | \footnotesize
1673 | ```{r, echo=TRUE, include = TRUE}
1674 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 +
1675 |     region + qlogis(obama12), data = pew)
1676 | ```
1677 | 
1678 | \scriptsize
1679 | ```{r}
1680 | head(X, 4)
1681 | ```
1682 | 
1683 | 
1684 | ## Simplifying the computations (continued)
1685 | 
1686 | Add `X = X` to the R `data` list and modify the Stan code to use matrix multiplication:
1687 | 
1688 | \scriptsize
1689 | ```{r echo=TRUE, eval=FALSE}
1690 | data {
1691 |   ...
1692 |   matrix[n, k] X; // covariate matrix
1693 | }
1694 | parameters {
1695 |   vector[k] beta; // fixed effects
1696 |   ...
1697 | }
1698 | model {
1699 |   vector[n] Xb; 
1700 |   Xb = X * beta;  
1701 |   ...
1702 | }
1703 | 
1704 | ```
1705 | 
1706 | 
1707 | ## Simplifying the computations (continued)
1708 | 
1709 | * Move hierarchical means into fixed effects, so mean of random effects is zero.
1710 | * Random effects are usually easier to program directly in Stan.
1711 | * Sometimes helpful to standardized parameters and multiply by the SD.
1712 | * Stan's default is to assume elements of a vector are independent.
1713 | 
1714 | \scriptsize
1715 | ```{r echo=TRUE, eval=FALSE}
1716 | parameters {
1717 |   real<lower=0> sigma_alpha; // sd intercept
1718 |   vector[J] alpha; // group intercepts
1719 |   ...
1720 | }
1721 | model {
1722 |   sigma_alpha ~ normal(0.2, 1); // prior for sd
1723 |   alpha ~ normal(0, 1); // standardized intercept
1724 |   ...
1725 |   for (i in 1:n)
1726 |     Xb[i] += sigma_alpha * alpha[ group[i] ];
1727 |   y ~ bernoulli_logit(Xb); 
1728 | }
1729 | 
1730 | ```
1731 | 
1732 | 
1733 | ## Hierarchical model with multiple covariates
1734 | 
1735 | \scriptsize
1736 | ```{r hb3, echo=TRUE, message=FALSE, warning=FALSE}
1737 | model_code <- "data {
1738 |   int n; // number of respondents
1739 |   int k; // number of covariates
1740 |   matrix[n, k] X; // covariate matrix
1741 |   int<lower=0, upper=1> y[n]; // outcome (demvote)
1742 |   int J; // number of groups (states)
1743 |   int<lower=1, upper=J> group[n]; // group index
1744 | }
1745 | parameters {
1746 |   vector[k] beta; // fixed effects
1747 |   real<lower=0> sigma_alpha; // sd intercept
1748 |   vector[J] alpha; // group intercepts
1749 | }
1750 | model {
1751 |   vector[n] Xb;
1752 |   beta ~ normal(0, 4);
1753 |   sigma_alpha ~ normal(0.2, 1); // prior for sd
1754 |   alpha ~ normal(0, 1); // standardized intercepts
1755 |   Xb = X * beta;
1756 |   for (i in 1:n)
1757 |     Xb[i] += sigma_alpha * alpha[ group[i] ];
1758 |   y ~ bernoulli_logit(Xb);
1759 | }"
1760 | ```
1761 | 
1762 | 
1763 | ## Estimating the model in R
1764 | 
1765 | \scriptsize
1766 | ```{r echo=TRUE, warning=FALSE, message=FALSE}
1767 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 +
1768 |   region + qlogis(obama12), data = pew)
1769 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote,
1770 |   J = nlevels(pew$state), group = as.integer(pew$state))
1771 | sims <- stan(model_code = model_code, data = data,
1772 |   seed = 1234)
1773 | ```
1774 | 
1775 | \normalsize
1776 | 
1777 | Coefficients in the output are in the same order specified in the `model` block:
1778 | 
1779 | \scriptsize
1780 | ```{r echo=TRUE, eval=FALSE}
1781 | parameters {
1782 |   vector[J] alpha; // group intercepts
1783 |   real<lower=0> sigma_alpha; // sd intercept
1784 |   vector[k] beta; // fixed effects
1785 | }
1786 | ```
1787 | 
1788 | \tiny
1789 | ```{r echo=FALSE}
1790 | options(width = 105)
1791 | names(sims)
1792 | options(width = 85)
1793 | ```
1794 | 
1795 | 
1796 | ## Rename the coefficients for easier reading
1797 | 
1798 | \scriptsize
1799 | ```{r echo=TRUE}
1800 | coef.names <- c(colnames(X), "sigma_alpha", levels(pew$state), "lp__")
1801 | names(sims) <- coef.names
1802 | ```
1803 | 
1804 | \tiny
1805 | ```{r}
1806 | options(width = 105)
1807 | names(sims)
1808 | options(width = 85)
1809 | ```
1810 | 
1811 | 
1812 | ## Summary of fixed effect estimates
1813 | 
1814 | \tiny
1815 | ```{r echo=TRUE, warning=FALSE, message=FALSE}
1816 | print(sims, par = "beta")
1817 | ```
1818 | 
1819 | 
1820 | 
1821 | ## Predictive distributions: imputation of survey variables for the population
1822 | 
1823 | * The final step in MRP is to **impute** vote for the entire population.
1824 |     + The sample is a trivial proportion of the population.
1825 |     + We need to impute the survey variable to everyone **not** surveyed.
1826 | 
1827 | * The **posterior predictive distribution** $p(\tilde y | y)$ is the conditional distribution of a **new** draw $\tilde y$ from the model, conditional upon the **observed** data $y$.
1828 | 
1829 | * This requires averaging $p(\tilde y | \theta)$ over the posterior distribution $p(\theta | y)$, *i.e.*, over the uncertainty in both $\tilde y$ *and* $\theta$.
1830 | 
1831 | * Contrast this with 
1832 |     + **Regression imputation** the expected value of $\tilde y$ is used
1833 |     + **Plug-in methods** a point estimate is substituted for the unknown parameter.
1834 |     
1835 | 
1836 | ## Imputation in Stan
1837 | 
1838 | Munge the population data in R
1839 | 
1840 | \scriptsize
1841 | ```{r echo=TRUE}
1842 | X0 <- model.matrix(~ 1 + age4 + gender + race3 + educ4 +
1843 |     region + qlogis(obama12), data = cps)
1844 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote,
1845 |   J = nlevels(pew$state), group = as.integer(pew$state),
1846 |   N = nrow(X0), X0 = X0, group0 = as.integer(cps$state))
1847 | ```
1848 | 
1849 | \normalsize
1850 | 
1851 | and add to the Stan `data` block:
1852 | 
1853 | \scriptsize
1854 | 
1855 | ```{r, eval=FALSE, echo=TRUE}
1856 | data {
1857 |   ...
1858 |   // add population data definitions
1859 |   int N; // number of rows in population (cps)
1860 |   matrix[N, k] X0; // covariates in population
1861 |   int<lower=1, upper=J> group0[N]; // group index in population
1862 | }
1863 | ```
1864 | 
1865 | 
1866 | ## The generated quantities block in Stan
1867 | 
1868 | Tell Stan what you want to impute and how to create the imputations.
1869 | 
1870 | \footnotesize
1871 | ```{r echo=TRUE, eval=FALSE}
1872 | generated quantities {
1873 |   int<lower=0, upper=1> yimp[N];
1874 |   {
1875 |     vector[N] Xb0;
1876 |     Xb0 = X0 * beta;
1877 |     for (i in 1:N)
1878 |       yimp[i] = bernoulli_logit_rng(Xb0[i] + sigma_alpha * alpha[ group0[i] ]);
1879 |   }
1880 | }
1881 | ```
1882 | 
1883 | \normalsize
1884 | 
1885 | Note the use of the `bernoulli_logit_rng` (random number generator) function to draw from the posterior predictive distribution. The `generated quantities` block cannot contain any distributions (indicated by `~`).
1886 | 
1887 | 
1888 | ## The complete Stan program
1889 | 
1890 | \tiny
1891 | 
1892 | ```{r, echo=TRUE, warning=FALSE, message=FALSE}
1893 | model_code <- "data {
1894 |   int n; // number of respondents
1895 |   int k; // number of covariates
1896 |   matrix[n, k] X; // covariate matrix
1897 |   int<lower=0, upper=1> y[n]; // outcome (demvote)
1898 |   int J; // number of groups (states)
1899 |   int<lower=1, upper=J> group[n]; // group index
1900 |   int N; // population size
1901 |   matrix[N, k] X0; // population covariates
1902 |   int group0[N]; // group index in population
1903 | }
1904 | parameters {
1905 |   vector[k] beta; // fixed effects
1906 |   real<lower=0> sigma_alpha; // sd intercept
1907 |   vector[J] alpha; // group intercepts
1908 | }
1909 | model {
1910 |   vector[n] Xb;
1911 |   beta ~ normal(0, 4);
1912 |   sigma_alpha ~ normal(0.2, 1);
1913 |   alpha ~ normal(0, 1);
1914 |   Xb = X * beta;
1915 |   for (i in 1:n)
1916 |     Xb[i] += sigma_alpha * alpha[ group[i] ];
1917 |   y ~ bernoulli_logit(Xb);
1918 | }
1919 | generated quantities {
1920 |   int<lower=0, upper=1> yimp[N];
1921 |   {
1922 |     vector[N] Xb0;
1923 |     Xb0 = X0 * beta;
1924 |     for (i in 1:N)
1925 |       yimp[i] = bernoulli_logit_rng(Xb0[i] + sigma_alpha * alpha[ group0[i] ]);
1926 |   }
1927 | }"
1928 | ```
1929 | 
1930 | ```{r include=FALSE}
1931 | sims <- stan(model_code = model_code, data = data,
1932 |   seed = 1234)
1933 | ```
1934 | 
1935 | 
1936 | 
1937 | ## Extracting the simulations
1938 | 
1939 | Stan has imputed 4000 values for each of the rows in `cps`. We sample 500 (much more than necessary, but it's still fast).
1940 | 
1941 | \footnotesize
1942 | ```{r extracting_sims, echo=TRUE}
1943 | imputations <- extract(sims, pars = "yimp")$yimp[sample(
1944 |     nrow(sims), size = 500), ]
1945 | get_state_estimates <- function(imputations) {
1946 |   state_by_clinton <- function(imputed_values) 100 * prop.table(
1947 |     xtabs(weight ~ state + imputed_values, data = cps), 1)[,"1"]
1948 |   state_estimates <- apply(imputations, 1, state_by_clinton)
1949 |   apply(state_estimates, 1, mean)
1950 | }
1951 | estimates$mrp2 <- get_state_estimates(imputations)
1952 | RMSE["mrp2"] <- with(estimates, rmse(mrp2, actual))
1953 | ```
1954 | 
1955 | \normalsize
1956 | 
1957 | Now we can perform any analyses we wish on the imputed `cps` data and average the results over the 10 imputed datasets to get point estimates.
1958 | 
1959 | 
1960 | 
1961 | ## The easy way with `rstanarm`
1962 | 
1963 | * `Rstanarm` is an R package that writes and executes Stan code for you.
1964 | * It uses the same notation as `lme4` for specifying multilevel models.
1965 | * For example, to estimate the same model as `mrp2`, use the following code:
1966 | 
1967 | ```{r eval=FALSE, echo=TRUE}
1968 | library(rstanarm)
1969 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 +
1970 |     region + qlogis(obama12) + (1 | state), data = pew, family = binomial)
1971 | ```
1972 | 
1973 | * The function `posterior_predict` in `rstanarm` substitutes for the usual `predict` function in R:
1974 | 
1975 | ```{r eval=FALSE, echo=TRUE}
1976 | imputations <- posterior_predict(fit, draws = 500,
1977 |   newdata = select(cps, age4, gender, race3, educ4, region, obama12, state))
1978 | ```
1979 | 
1980 | (This creates a matrix `imputations` of dimension `draws` x `nrow(newdata)`.)
1981 | 
1982 | * Extract the estimates using `get_state_estimates`.
1983 | 
1984 | 
1985 | ## The complete program in `rstanarm`
1986 | 
1987 | \footnotesize
1988 | 
1989 | ```{r, echo=TRUE, warning=FALSE, message=FALSE}
1990 | library(rstanarm)
1991 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 +
1992 |     region + qlogis(obama12) + (1 | state), data = pew, family = binomial)
1993 | cpstmp <- cps %>%
1994 |   select(age4, gender, race3, educ4, region, obama12, state)
1995 | imputations <- posterior_predict(fit, draws = 500,
1996 |   newdata = select(cps, age4, gender, race3, educ4, region, obama12, state))
1997 | estimates$mrp3 <- get_state_estimates(imputations)
1998 | RMSE["mrp3"] <- with(estimates, rmse(mrp3, actual))
1999 | RMSE
2000 | ```
2001 | 
2002 | 
2003 | ## Accuracy of state level estimates
2004 | 
2005 | \centering
2006 | ```{r, echo=FALSE, fig.width=9, fig.height=4.5}
2007 | p1 <- ggplot(estimates, aes(actual, mrp3)) +
2008 |   geom_abline(intercept = 0, slope = 1, col = "grey") +
2009 |   geom_point(size = 1.5) +
2010 |   lims(x = c(0, 100), y = c(0, 100)) +
2011 |   labs(x = "Percentage of vote for Clinton", y = "Estimate") +
2012 |   theme_minimal()
2013 | p2 <- ggplot(estimates, aes(mrp3 - actual)) +
2014 |   geom_histogram(binwidth = 4, center = 0, fill = "gray") +
2015 |   lims(x = c(-20, 20)) +
2016 |   labs(x = "Error in estimate") +
2017 |   theme_minimal()
2018 | grid.arrange(p1, p2, nrow = 1)
2019 | ```
2020 | 
2021 | 
2022 | ## What the map now looks like
2023 | 
2024 | ```{r}
2025 | estimates %>%
2026 |   mutate(state_name = tolower(name),
2027 |     clinton_pct = cut(mrp3, breaks = c(-Inf, 40, 45, 50, 55, 60, 100),
2028 |       labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>%
2029 |   ggplot(aes(map_id = state_name)) +
2030 |     geom_map(aes(fill = clinton_pct), map = us_map) +
2031 |     expand_limits(x = us_map$long, y = us_map$lat) +
2032 |     coord_map("albers", lat0 = 39, lat1 = 45) +
2033 |     scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") +
2034 |     theme(axis.line = element_blank()) +
2035 |     theme_void()
2036 | ```
2037 | 
2038 | 
2039 | ## How well did we do?
2040 | 
2041 | \centering
2042 | ```{r echo=FALSE}
2043 | load("data/output_538.RData")
2044 | out538 <- out538 %>%
2045 | 	filter(!(state %in% c("ME1", "ME2", "NE1", "NE2", "NE3", "US"))) %>%
2046 | 	transmute(state = factor(as.character(state)), `538` = clinton_share) 
2047 | estimates <- left_join(estimates, out538, by = "state")
2048 | RMSE["538"] <- with(estimates, rmse(`538`, actual))
2049 | tmp <- RMSE
2050 | names(tmp) <- c("Complete pooling", "No pooling", "Partial pooling",
2051 | 		"MRP: gender", "MRP: demos + 2012 vote", "MRP3", "538")
2052 | tmp <- tmp[c(1:5, 7)]; tmp <- rev(sort(tmp))
2053 | par(mar = c(3, 11, 4, 2))
2054 | barplot(sort(tmp), horiz = TRUE, border = NA, las = 1,
2055 | 	xlim = c(0, 16), ylim = c(0, 6.6), xlab = "", ylab = "",
2056 | 	width = 1, space = 0.1,
2057 | 	main = "Root mean square error",
2058 | 	col = c("navy", "blue3", "purple3", "purple1", "red2", "red3"))
2059 | text(x = tmp, y = seq(6, 0.6, length.out = 6), pos = 4,
2060 | 	labels = paste0(round(tmp, 1), "%"))
2061 | ```
2062 | 
2063 | # 5. Advanced Topics
2064 | 
2065 | ## Where to from here?
2066 | 
2067 | 1. More complicated models
2068 | 2. More than two response categories in survey variable
2069 | 3. More than two survey variables
2070 | 4. Missing variables from the post-stratification
2071 | 5. Computation of standard errors
2072 | 6. The missing details
2073 | 
2074 | 
2075 | ## More complicated models
2076 | 
2077 | Selection bias is **ignorable** if survey variables are conditionally independent of sample selection *conditional upon the covariates*. We would like to be able to fit models with many covariates and flexible functional forms.
2078 | 
2079 | The models considered so far have been very simple. Instead, we might:
2080 | 
2081 | * Incorporate more **interactions** (such as race x region)
2082 | * Allow both **slopes** and **intercepts** to vary across states
2083 | * Add additional **levels** to the mode (*e.g.*, counties within states within regions)
2084 | 
2085 | This is not difficult with `rstanarm`, but as the next example demonstrates, there is no guarantee that more complicated models will perform better.
2086 | 
2087 | 
2088 | ## Allow race and gender effects to vary across states
2089 | 
2090 | \scriptsize
2091 | ```{r, echo=TRUE, warning=FALSE, message=FALSE}
2092 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 +
2093 |   region + qlogis(obama12) + (1 + gender + race3 | state),
2094 |   data = pew, family = binomial)
2095 | cpstmp <- cps %>%
2096 |   select(age4, gender, race3, educ4, region, obama12, state)
2097 | imputations <- posterior_predict(fit, draws = 500,
2098 |   newdata = select(cps, age4, gender, race3, educ4, region, obama12, state))
2099 | estimates$mrp4 <- get_state_estimates(imputations)
2100 | RMSE["mrp4"] <- with(estimates, rmse(mrp4, actual))
2101 | RMSE
2102 | ```
2103 | 
2104 | 
2105 | ## More than two response categories in the survey variable
2106 | 
2107 | In real applications, there are almost always more than two response categories, *e.g.* vote for Clinton, Trump, Johnson, Stein, do not vote, undecided, *etc*.
2108 | 
2109 | A simple solution is to estimate a set of binary response models (*e.g.*, Clinton *vs.* not Clinton, Trump *vs.* not Trump, *etc.*), ignoring correlations between the parameters of these models.
2110 | 
2111 | A better solution is a **multinomial logit model**. When the response variable $y_i$ takes values $q = 0, 1, \dots, Q$, a flexible model is
2112 | $$
2113 |   P(y_i = q) = \dfrac{ e^{\beta_q^T x_i} }{ \sum_{r=0}^Q e^{\beta_r^T x_i} }
2114 | $$
2115 | 
2116 | See Section 9.6 of the Stan Reference Manual for code to implement this model. The R `brms` package can also estimate this model.
2117 | 
2118 | ## More than two survey variables
2119 | 
2120 | Most surveys contain more than a single question, so there are more than two variables to model. For example, pre-election polls contain questions about likelihood of voting, preferred candidate, and policy preferences.
2121 | 
2122 | Estimating separate MRP models for several variables can yield inconsistent results. Each estimate is equivalent to creating a separate weight for each variable. 
2123 | 
2124 | In principle, we can specify bivariate or multivariate response models, yielding a single, consistent set of estimates for the marginal and joint distributions. In practice, this can be quite challenging and is an active research area.
2125 | 
2126 | 
2127 | ## Missing variables in the post-stratification
2128 | 
2129 | Population data often are **missing** for some predictors of survey outcomes. For example, voting models can benefit substantially from having individual level data on past vote. 
2130 | The assumption of **ignorable nonresponse** or selection is usually more plausible conditional upon a larger set of variables.
2131 | 
2132 | A reasonable approach is to **impute** missing variables onto PUMS. For instance, the 2012 exit poll can be used to impute candidate preference for voters in the 2012 CPS Registration and Voting Supplement. Usually single imputation, followed by raking of the marginals to known vote totals, performs well.
2133 | 
2134 | ## Computation of standard errors
2135 | 
2136 | When post-stratifying to **known population counts**, the only source of error is the estimate of the means of the survey variable in the post-strata. This is reflected in the posterior standard deviation of the draws and in the simulations of generated quantities.
2137 | 
2138 | When imputing onto a **public use microdata survey** (like CPS), there is an additional source of error (the sampling error in the PUMS survey).
2139 | 
2140 | There is a simple formula for calculating the combined variance:
2141 | 
2142 | * Create a set of imputed datasets.
2143 | * For each imputed dataset, compute the variance with the usual complete cases formula and average these.
2144 | * Also, compute the variance of the estimates across the imputed datasets.
2145 | * Add the two variances (adjusting for degrees of freedom)
2146 | 
2147 | With $M$ imputations $\hat\theta_1,\dots,\hat\theta_M$:
2148 | $$
2149 | \begin{aligned}
2150 |   \text{Var}(\bar\theta) &= \dfrac{1}{M} \sum_i \widehat{\text{s.e.}}^2(\hat\theta_i)
2151 |   + \dfrac{M+1}{M} \dfrac{1}{M-1} \sum_{i=1}^M (\hat\theta_i - \bar\theta_i)^2 \\[10pt]
2152 |   &\approx \text{average estimated variance} + \text{variance of estimates}
2153 | \end{aligned}
2154 | $$
2155 | 
2156 | 
2157 | ## References: Gelman papers
2158 | 
2159 | Gelman and Little, "Poststratification into many categories using hierarchical logistic regression," *Survey Methodology* (1998), pp. 127-135.
2160 | 
2161 | Park, Gelman and Bafumi, "Bayesian multilevel analysis with poststratification: state-level estimates from national polls," *Political Analysis* (2004), pp. 375-385.
2162 | 
2163 | Gelman, "Analysis of variance: Why it is more important than ever" (with discussion), *Annals of Statistics* (2005), pp. 1-53.
2164 | 
2165 | Gelman, "Struggles with survey weighting" (with discussion), *Statistical Science* (2007), pp. 153-164.
2166 | 
2167 | Gelman and Ghitza, "Deep interactions with MRP: Election turnout and voting patterns among small electoral subgroups," *American Journal of Political Science* (2013), pp. 762-776.
2168 | 
2169 | Gelman, Goel, Rivers and Rothschild, "The Mythical Swing Voter," *Quarterly Journal of Political Science* (2016), pp. 103-130.
2170 | 
2171 | 
2172 | ## Further reading
2173 | 
2174 | \centering
2175 | 
2176 | ![](images/further_reading.png)
2177 | 
2178 | 
2179 | ## What we have covered
2180 | 
2181 | 1. Fit a multilevel regression model to survey data, using both indivdiual and group-level covariates. The purpose of this model is **prediction**, not explanation.
2182 | 
2183 | 2. Avoid **over-fitting** by using hierarchical priors or random effects. 
2184 | 
2185 | 3. The Stein paradox is real and important. When making estimates for many small areas, **shrinkage** or **partial pooling** can substantially improve over "direct" estimates (just using the data from each area separately).
2186 | 
2187 | 4. The model is used to **impute** the survey variable for non-sampled units to either a public use microdata survey or census counts. This is the **post-stratification** component.
2188 | 
2189 | 5. Model failure is a real risk, but many successful applications and validations suggest that it is often exaggerated. The estimates are fairly robust to failures of the upper level model if data is not too sparse.
2190 | 
2191 | \centering
2192 | \LARGE
2193 | 
2194 | Good luck!
2195 | 
2196 | 
2197 | ## Course evaluations
2198 | 
2199 | Please fillout the online course evaluation at
2200 | 
2201 | \medskip
2202 | \centering
2203 | 
2204 | <https://www.surveymonkey.com/r/AAPOR2018ShortCourse>
2205 | 
2206 | \bigskip
2207 | \large
2208 | \centering
2209 | 
2210 | Thanks!


--------------------------------------------------------------------------------