├── scripts ├── .gitignore ├── header.tex ├── nep_gendergap.R └── clean-data.R ├── .gitignore ├── README.md ├── slides.pdf ├── images ├── Stein.jpg ├── bayes.png ├── ulam.jpg ├── gelman.jpeg ├── bayes-grave.jpg ├── dag_2level.png ├── stan_logo.png ├── uk17_yg_est.png ├── gender-gap-xp.pdf ├── demo_xbox_v_exit.pdf ├── further_reading.png ├── mrp-stan-tweet.png ├── poll_screenshot.jpg ├── uk17_predictions.png ├── subgroup_predictions.pdf ├── us16_fig-lead-pid-wgt.pdf └── us16_NationalTimeTrend.pdf ├── data ├── cleaned.RData └── output_538.RData ├── mrp-aapor.Rproj ├── LICENSE ├── class.R └── slides.Rmd /scripts/.gitignore: -------------------------------------------------------------------------------- 1 | slides_files 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # mrp-aapor 2 | Short course on MRP at 2018 AAPOR Conference 3 | -------------------------------------------------------------------------------- /slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/slides.pdf -------------------------------------------------------------------------------- /images/Stein.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/Stein.jpg -------------------------------------------------------------------------------- /images/bayes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/bayes.png -------------------------------------------------------------------------------- /images/ulam.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/ulam.jpg -------------------------------------------------------------------------------- /data/cleaned.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/data/cleaned.RData -------------------------------------------------------------------------------- /images/gelman.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/gelman.jpeg -------------------------------------------------------------------------------- /data/output_538.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/data/output_538.RData -------------------------------------------------------------------------------- /images/bayes-grave.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/bayes-grave.jpg -------------------------------------------------------------------------------- /images/dag_2level.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/dag_2level.png -------------------------------------------------------------------------------- /images/stan_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/stan_logo.png -------------------------------------------------------------------------------- /images/uk17_yg_est.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/uk17_yg_est.png -------------------------------------------------------------------------------- /images/gender-gap-xp.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/gender-gap-xp.pdf -------------------------------------------------------------------------------- /images/demo_xbox_v_exit.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/demo_xbox_v_exit.pdf -------------------------------------------------------------------------------- /images/further_reading.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/further_reading.png -------------------------------------------------------------------------------- /images/mrp-stan-tweet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/mrp-stan-tweet.png -------------------------------------------------------------------------------- /images/poll_screenshot.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/poll_screenshot.jpg -------------------------------------------------------------------------------- /images/uk17_predictions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/uk17_predictions.png -------------------------------------------------------------------------------- /images/subgroup_predictions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/subgroup_predictions.pdf -------------------------------------------------------------------------------- /images/us16_fig-lead-pid-wgt.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/us16_fig-lead-pid-wgt.pdf -------------------------------------------------------------------------------- /images/us16_NationalTimeTrend.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rdrivers/mrp-aapor/HEAD/images/us16_NationalTimeTrend.pdf -------------------------------------------------------------------------------- /scripts/header.tex: -------------------------------------------------------------------------------- 1 | \newcommand{\columnsbegin}{\begin{columns}} 2 | \newcommand{\columnsend}{\end{columns}} 3 | \setbeamertemplate{footline}[frame number] -------------------------------------------------------------------------------- /mrp-aapor.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: No 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Doug Rivers 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /scripts/nep_gendergap.R: -------------------------------------------------------------------------------- 1 | library(tidyverse) 2 | 3 | load("~/Dropbox (YouGov Analytics)/Data/frame/NEP/2016/nep16.Rdata") 4 | 5 | nep16 <- nep16 %>% 6 | mutate(state = droplevels(state)) 7 | 8 | fixef <- prop.table( 9 | xtabs(weight ~ gender + presvote16, data = nep16), 1)[,"clinton"] 10 | fixef[2] <- fixef[2] - fixef[1] 11 | 12 | tbl <- prop.table(xtabs(weight ~ state + presvote16 + gender, data = nep16), 13 | c(1,3))[,'clinton',] 14 | coefs <- data_frame(male = tbl[,1], gap = tbl[,2] - tbl[,1]) 15 | 16 | plot.gender.by.state <- function(main, coefs, fixef) { 17 | n <- nrow(coefs) 18 | x <- rep(c(0,1), rep(n, 2)) 19 | y <- 100 * c(coefs[[1]], coefs[[1]] + coefs[[2]]) 20 | plot(x = x, y = y, pch = 19, cex = 0.5, xlim = c(0, 1), 21 | ylim = c(0, 100), main = main,xlab = "Gender", 22 | ylab = "Percent voting for Clinton", axes = FALSE) 23 | axis(1, at = c(0, 1), labels = c("Male", "Female")) 24 | axis(2, at = seq(0, 100, 20)) 25 | for (i in seq_len(n)) lines(x = c(0, 1), 26 | y = c(y[i], y[i+n]), col = "grey", lwd = 0.5) 27 | abline(100 * fixef, col = "red3", lwd = 2) 28 | } 29 | pdf("~/Projects/mrp-aapor/images/gender-gap-xp.pdf") 30 | plot.gender.by.state("Exit poll", coefs, fixef) 31 | dev.off() 32 | 33 | 34 | 35 | 36 | -------------------------------------------------------------------------------- /scripts/clean-data.R: -------------------------------------------------------------------------------- 1 | library(tidyverse) 2 | library(haven) 3 | 4 | # add DC to list of states 5 | # make state a factor with postal codes as levels 6 | all.states <- data_frame(state = c(state.abb, "DC"), 7 | name = c(state.name, "District of Columbia"), 8 | division = c(as.character(state.division), "South Atlantic"), 9 | region = c(as.character(state.region), "South")) %>% 10 | arrange(state) %>% 11 | mutate(state = factor(state, levels = sort(state)), 12 | division = factor(division, levels = levels(state.division)), 13 | region = factor(region, levels = levels(state.region))) 14 | name.to.abb <- function(name) 15 | all.states$state[match(tolower(name), tolower(all.states$name))] 16 | 17 | 18 | # data from final pew survey before 2016 presidential election 19 | pew <- read_spss(file = "Data/Oct16 public.sav") %>% 20 | mutate_if(is.labelled, as_factor) %>% 21 | transmute( 22 | state = name.to.abb(as.character(sstate)), 23 | votereg = fct_recode(regfinal, 24 | yes = "Registered/Plan to/N.Dakota", 25 | no = "Not registered"), 26 | turnout = fct_collapse(plan1, 27 | yes = c("Plan to vote", "Already voted"), 28 | no = c("Already voted", "Don't plan to vote", "Don't know/Refused (VOL.)")), 29 | vote16full = fct_recode(q10horse, 30 | clinton = "Clinton/lean Clinton", 31 | trump = "Trump/lean Trump", 32 | johnson = "Johnson/lean Johnson", 33 | stein = "Stein/lean Stein", 34 | other = "Other-refused to lean", 35 | dk = "DK-refused to lean"), 36 | vote16 = case_when( 37 | votereg == "no" | turnout == "no" ~ "nonvoter", 38 | vote16full == "clinton" ~ "clinton", 39 | vote16full == "trump" ~ "trump", 40 | vote16full %in% c("johnson", "stein", "other") ~ "other"), 41 | vote16 = factor(vote16, levels = c("clinton", "trump", "other", "nonvoter")), 42 | age = fct_recode(age, 43 | "97" = "97 or older", 44 | NULL = "Don't know/Refused (VOL.)"), 45 | age = as.integer(as.character(age)), 46 | age_top_codes = case_when( 47 | age < 80 ~ "<80", 48 | age >= 80 & age < 85 ~ "80-84", 49 | age >= 85 ~ "85+"), 50 | raceeth = fct_recode(racethn, 51 | white = "White, non-Hisp", 52 | black = "Black, non-Hisp", 53 | hispanic = "Hispanic", 54 | other = "Other", 55 | NULL = "Don't know/Refused (VOL.)"), 56 | gender = fct_relabel(sex, tolower), 57 | educ = fct_recode(educ2, 58 | "grades 1-8" = "Less than high school (Grades 1-8 or no formal schooling)", 59 | "hs dropout" = "High school incomplete (Grades 9-11 or Grade 12 with NO diploma)", 60 | "hs grad" = "High school graduate (Grade 12 with diploma or GED certificate)", 61 | "some col" = "Some college, no degree (includes some community college)", 62 | "assoc" = "Two year associate degree from a college or university", 63 | "col grad" = "Four year college or university degree/Bachelor's degree (e.g., BS, BA, AB)", 64 | "postgrad" = "Some postgraduate or professional schooling, no postgraduate degree", 65 | "postgrad" = "Postgraduate or professional degree, including master's, doctorate, medical or law degree", 66 | NULL = "Don't know/Refused (VOL.)"), 67 | weight = weight / mean(weight)) %>% 68 | left_join(all.states, by = "state") 69 | for(v in pew) attr(v, "label") <- NULL 70 | 71 | # current population survey registration and voting supplement november 2016 72 | cps <- read_spss(file = "Data/cpsnov2016.sav") %>% 73 | mutate_if(is.labelled, as_factor) %>% 74 | mutate_if(is.factor, ~ fct_recode(., NULL = "-1")) %>% 75 | transmute( 76 | state = factor(as.character(gestfips)), # levels in alphabetical order 77 | age_top_codes = factor(case_when(prtage == "80-84 Years Old" ~ "80-84", 78 | prtage == "85+ Years Old" ~ "85+", !is.na(prtage) ~ "<80"), 79 | levels = c("<80", "80-84", "85+")), 80 | age = as.integer(ifelse(age_top_codes == "<80", as.character(prtage), NA)), 81 | gender = fct_relabel(pesex, tolower), 82 | hisp = fct_recode(pehspnon, 83 | yes = "HISPANIC", 84 | no = "NON-HISPANIC"), 85 | race = fct_collapse(ptdtrace, 86 | white = "White Only", 87 | black = "Black Only", 88 | other = c("American Indian, Alaskan Native Only", "Asian Only", 89 | "Hawaiian/Pacific Islander Only", "White-Black", "White-AI", 90 | "White-Asian", "White-HP", "Black-AI", "Black-Asian", "Black-HP", 91 | "AI-Asian", "AI-HP", "Asian-HP", "W-B-AI", "W-B-A", "W-B-HP", 92 | "W-AI-A", "W-AI-HP", "W-A-HP", "B-AI-A", "W-B-AI-A", "W-AI-A-HP", 93 | "Other 3 Race Combinations", "Other 4 and 5 Race Combinations")), 94 | raceeth = factor(ifelse(hisp == "yes", "hispanic", as.character(race)), 95 | levels = c("white", "black", "hispanic", "other")), 96 | educ = fct_collapse(peeduca, 97 | "grades 1-8" = c("LESS THAN 1ST GRADE", "1ST, 2ND, 3RD OR 4TH GRADE", 98 | "5TH OR 6TH GRADE", "7TH OR 8TH GRADE"), 99 | "hs dropout" = c("9TH GRADE", "10TH GRADE", "11TH GRADE", "12TH GRADE NO DIPLOMA"), 100 | "hs grad" = "HIGH SCHOOL GRAD-DIPLOMA OR EQUIV (GED)", 101 | "some col" = "SOME COLLEGE BUT NO DEGREE", 102 | "assoc" = c("ASSOCIATE DEGREE-OCCUPATIONAL/VOCATIONAL", "ASSOCIATE DEGREE-ACADEMIC PROGRAM"), 103 | "col grad" = "BACHELOR'S DEGREE (EX: BA, AB, BS)", 104 | "postgrad" = c("MASTER'S DEGREE (EX: MA, MS, MEng, MEd, MSW)", 105 | "PROFESSIONAL SCHOOL DEG (EX: MD, DDS, DVM)", "DOCTORATE DEGREE (EX: PhD, EdD)")), 106 | citizen = fct_collapse(prcitshp, 107 | yes = c("NATIVE, BORN IN THE UNITED STATES", 108 | "NATIVE, BORN IN PUERTO RICO OR OTHER U.S. ISLAND AREAS", 109 | "NATIVE, BORN ABROAD OF AMERICAN PARENT OR PARENTS", 110 | "FOREIGN BORN, U.S. CITIZEN BY NATURALIZATION"), 111 | no = "FOREIGN BORN, NOT A CITIZEN OF THE UNITED STATES"), 112 | votereg = factor(case_when( 113 | pes1 == "Yes" | pes2 == "Yes" ~ "yes", 114 | citizen == "No" | pes2 %in% c("No", "Don't Know") ~ "no"), 115 | levels = c("yes", "no")), 116 | turnout = factor(case_when( 117 | pes1 == "Yes" ~ "yes", 118 | votereg == "no" | pes1 %in% c("No", "Don't Know") ~ "no"), 119 | levels = c("yes", "no")), 120 | weight = as.numeric(pwsswgt) / 10000) %>% 121 | left_join(all.states, by = "state") %>% 122 | filter(age_top_codes %in% c("80-84", "85+") | age >= 18, citizen == "yes") 123 | 124 | 125 | # vote counts and citizen voting age population for 2012 and 2016 126 | vap12 <- read_csv("Data/vap2012.csv") %>% 127 | mutate(state = name.to.abb(state), 128 | cvap = as.integer(1000 * cvap)) %>% 129 | select(state, cvap) 130 | votes12 <- read_csv("Data/2012_0_0_1.csv", skip = 1) %>% 131 | transmute( 132 | state = name.to.abb(name), 133 | obama = vote1, 134 | romney = vote2, 135 | johnson = vote4, 136 | stein = vote5, 137 | turnout = totalvote 138 | ) %>% 139 | inner_join(vap12, by = "state") %>% 140 | inner_join(all.states, by = "state") %>% 141 | arrange(state) 142 | vap16 <- read_csv("Data/vap2016.csv") %>% 143 | mutate(state = name.to.abb(state), 144 | cvap = as.integer(1000 * cvap)) %>% 145 | select(state, cvap) 146 | votes16 <- read_csv("Data/2016_0_0_1.csv", skip = 1) %>% 147 | transmute( 148 | state = name.to.abb(name), 149 | clinton = vote1, 150 | trump = vote2, 151 | johnson = vote4, 152 | stein = vote5, 153 | mcmullin = vote3, 154 | turnout = totalvote) %>% 155 | inner_join(vap16, by = "state") %>% 156 | inner_join(all.states, by = "state") %>% 157 | arrange(state) 158 | 159 | # save clean data 160 | save(pew, cps, votes12, votes16, file = "Data/cleaned.RData") 161 | -------------------------------------------------------------------------------- /class.R: -------------------------------------------------------------------------------- 1 | options(width=85) 2 | library("tidyverse") 3 | load("data/cleaned.RData") 4 | 5 | 6 | 7 | ########################## 8 | # 2. Post-stratification # 9 | ########################## 10 | 11 | 12 | ## Recode pew data 13 | pew <- pew %>% 14 | filter( 15 | complete.cases(age, raceeth, gender, educ, vote16)) %>% 16 | mutate( 17 | demvote = ifelse(vote16 == "clinton", 1, 0), 18 | age4 = factor(case_when(age < 30 ~ "18-29", 19 | age < 45 ~ "30-44", age < 65 ~ "45-64", 20 | TRUE ~ "65+")), 21 | race3 = fct_collapse(raceeth, 22 | white = c("white", "other")), 23 | educ4 = fct_collapse(educ, 24 | "hs" = c("grades 1-8", "hs dropout", "hs grad"), 25 | "some col" = c("some col", "assoc"))) 26 | 27 | # Save a full version for MNL 28 | pew_nv <- pew 29 | pew <- pew %>% filter(vote16 != 'nonvoter') 30 | 31 | 32 | ## ...then do the same for CPS 33 | cps <- cps %>% 34 | filter( 35 | complete.cases(age_top_codes, 36 | raceeth, gender, educ, turnout)) %>% 37 | mutate( 38 | age4 = factor(case_when( 39 | age_top_codes == "<80" & age < 30 ~ "18-29", 40 | age_top_codes == "<80" & age < 45 ~ "30-44", 41 | age_top_codes == "<80" & age < 65 ~ "45-64", 42 | TRUE ~ "65+")), 43 | race3 = fct_collapse(raceeth, 44 | white = c("white", "other")), 45 | educ4 = fct_collapse(educ, 46 | "hs" = c("grades 1-8", "hs dropout", "hs grad"), 47 | "some col" = c("some col", "assoc"))) 48 | 49 | # Save a full version for MNL 50 | cps_nv <- cps 51 | cps <- cps %>% filter(turnout == "yes") 52 | 53 | ## Check that the datasets are consistent -- mistakes will be made! 54 | compare_distributions <- function(var, data1, data2, wgt1, wgt2, digits = 1) { 55 | stopifnot(all(levels(data1[[var]]) == levels(data2[[var]]))) 56 | formula1 <- as.formula(paste(wgt1, "~", var)) 57 | formula2 <- as.formula(paste(wgt2, "~", var)) 58 | tbl <- rbind(round(100 * prop.table(xtabs(formula1, data1)), digits), 59 | round(100 * prop.table(xtabs(formula2, data2)), digits)) 60 | row.names(tbl) <- c(substitute(data1), substitute(data2)) 61 | tbl 62 | } 63 | compare_distributions("race3", pew, cps, "", "weight") 64 | 65 | 66 | ## Compare variables in `pew` and `cps` 67 | compare_distributions("educ4", pew, cps, "", "weight") 68 | compare_distributions("age4", pew, cps, "", "weight") 69 | compare_distributions("gender", pew, cps, "", "weight") 70 | 71 | 72 | ## Step 2: create post-strata 73 | library(survey) 74 | pop.counts <- xtabs(weight ~ age4 + gender + race3 + educ4, data = cps) 75 | sample.counts <- xtabs(~ age4 + gender + race3 + educ4, data = pew) 76 | pew <- mutate(pew, 77 | weight0 = sum(pop.counts) / sum(sample.counts)) 78 | sample.weights <- xtabs(weight0 ~ age4 + gender + race3 + 79 | educ4, data = pew) 80 | nr <- nonresponse(sample.weights, sample.counts, pop.counts) 81 | 82 | 83 | ## Check for empty cells and/or large weights 84 | sparseCells(nr, nrweight = 4) 85 | 86 | 87 | ## Look for categories adjacent to empty cells 88 | neighbours(14, nr) # use nr$index to get cell index 89 | 90 | 91 | ## Step 3: collapse cells 92 | nr$index[,,"black","hs"] 93 | nr <- joinCells(nr, 10, 11, 14, 15) # update the nr object 94 | nr$index[,,"black","hs"] 95 | 96 | 97 | ## Eliminate remaining empty cells 98 | nr <- joinCells(nr, 18, 19, 21, 22) # hisp 30-64 hs 99 | nr <- joinCells(nr, 44, 48, 68, 72, 92, 96) # hisp 65+ >hs 100 | nr <- joinCells(nr, 57, 61, 81, 85) # black 18-29 col+ 101 | sparseCells(nr, nrweight = 4) # no more bad cells 102 | 103 | 104 | ## Step 4: compute weights and add to dataframe 105 | get_weights <- function(data, nr) { 106 | wgt_arr <- weights(nr) 107 | var.names <- names(dimnames(wgt_arr)) 108 | indexes <- data %>% 109 | select(var.names) %>% 110 | mutate_all(as.integer) %>% 111 | as.matrix() 112 | wgt_arr[indexes] 113 | } 114 | pew$ps.weight <- get_weights(pew, nr) 115 | 116 | 117 | ## Check that the post-stratification worked 118 | compare_distributions("race3", pew, cps, "ps.weight", "weight") 119 | compare_distributions("educ4", pew, cps, "ps.weight", "weight") 120 | compare_distributions("age4", pew, cps, "ps.weight", "weight") 121 | 122 | 123 | ## Step 5: compute estimates using the new weight 124 | round(100 * prop.table(xtabs(ps.weight ~ vote16, pew)), 1) 125 | design <- svydesign(ids = ~ 1, weights = ~ ps.weight, data = pew) 126 | round(100 * prop.table(svytable(~ vote16, design)), 1) 127 | cv <- function(x) sd(x) / mean(x) # coefficient of variation 128 | cv(pew$ps.weight)^2 # weighting loss 129 | 130 | 131 | ## State estimates 132 | tbl <- 100 * prop.table(xtabs(ps.weight ~ state + vote16, data = pew), 1) 133 | round(tbl, 1)[1:10,] 134 | 135 | 136 | ## Plotting the estimates 137 | estimates <- votes16 %>% 138 | transmute(state, name, 139 | actual = 100 * clinton / turnout) %>% 140 | mutate(post.stratified = tbl[,"clinton"]) 141 | 142 | library(ggplot2) 143 | p1 <- ggplot(estimates, aes(actual, post.stratified)) + 144 | geom_abline(slope = 1, intercept = 0, col = "grey") + 145 | geom_point() + 146 | lims(x = c(0, 100), y = c(0, 100)) + 147 | labs(x = "Percent of votes", y = "Post-stratifed estimate") + 148 | theme_minimal() 149 | p2 <- ggplot(estimates, aes(post.stratified - actual)) + 150 | geom_histogram(binwidth = 10, center = 0, fill = "gray") + 151 | labs(x = "Error in estimate") + 152 | theme_minimal() 153 | library(gridExtra) 154 | grid.arrange(p1, p2, nrow = 1) 155 | 156 | 157 | ## Mapping the estimates 158 | library(maps) 159 | library(mapproj) 160 | us_map <- map_data("state") 161 | estimates %>% 162 | mutate(state_name = tolower(name), 163 | clinton_pct = cut(post.stratified, breaks = c(-Inf, 40, 45, 50, 55, 60, 100), 164 | labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>% 165 | ggplot(aes(map_id = state_name)) + 166 | geom_map(aes(fill = clinton_pct), map = us_map) + 167 | expand_limits(x = us_map$long, y = us_map$lat) + 168 | coord_map("albers", lat0 = 39, lat1 = 45) + 169 | scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") + 170 | theme(axis.line = element_blank()) + 171 | theme_void() 172 | 173 | 174 | 175 | ############################ 176 | # 3. Multilevel regression # 177 | ############################ 178 | 179 | 180 | ## Fixed effects (no pooling) 181 | no.pooling <- lm(demvote ~ state - 1, data = pew) 182 | no.pooling <- data_frame( 183 | state = gsub("state", "", names(coef(no.pooling))), 184 | no.pooling = 100 * coef(no.pooling)) %>% 185 | mutate(state = factor(state, levels = levels(pew$state))) 186 | head(no.pooling) 187 | 188 | 189 | ## Grand mean (complete pooling) 190 | round(100 * mean(pew$demvote), 1) # 1. mean of dichotomous indicator 191 | round(100 * prop.table(xtabs(~ demvote, data = pew)), 1) # 2. cross-tabulation 192 | complete.pooling <- lm(demvote ~ 1, data = pew) # 3. intercept in regression 193 | complete.pooling <- 100 * coef(complete.pooling) 194 | round(complete.pooling, 1) 195 | 196 | 197 | ## Predicting random effects 198 | library(lme4) 199 | partial.pooling <- lmer(demvote ~ 1 + (1 | state), data = pew) 200 | partial.pooling 201 | 202 | 203 | ## Extracting predictions 204 | fixef(partial.pooling) # grand mean 205 | ranef(partial.pooling)$state %>% head(4) # state effects 206 | coef(partial.pooling)$state %>% head(4) # state predictions 207 | 208 | 209 | ## Which is better? 210 | partial.pooling <- coef(partial.pooling)$state %>% 211 | as_data_frame(rownames = "state") %>% 212 | transmute(state = factor(state, levels = levels(pew$state)), 213 | partial.pooling = 100 * `(Intercept)`) 214 | estimates <- estimates %>% 215 | left_join(no.pooling, by = "state") %>% 216 | left_join(partial.pooling, by = "state") %>% 217 | mutate(n = as.integer(xtabs(~ state, data = pew))) 218 | estimates 219 | rmse <- function(est, act) sqrt(mean((est - act)^2, na.rm = TRUE)) 220 | RMSE <- estimates %>% 221 | summarize(complete.pooling = rmse(complete.pooling, actual), 222 | no.pooling = rmse(no.pooling, actual), 223 | partial.pooling = rmse(partial.pooling, actual)) %>% 224 | unlist() 225 | RMSE 226 | 227 | 228 | ## 2016 U.S. election: estimates *vs.* actuals 229 | p1 <- ggplot(estimates, aes(actual, no.pooling)) + 230 | geom_point() + 231 | labs(title = "No pooling", x = "Percent of voters", 232 | y = "Percent of voters") + 233 | lims(x = c(0,100), y = c(0,100)) + 234 | theme_minimal() 235 | p2 <- ggplot(estimates, aes(actual, partial.pooling)) + 236 | geom_point() + 237 | labs(title = "Partial pooling", x = "Percent of voters", 238 | y = "Percent of voters") + 239 | lims(x = c(0,100), y = c(0,100)) + 240 | theme_minimal() 241 | suppressMessages(library(gridExtra)) 242 | grid.arrange(p1, p2, nrow = 1) 243 | 244 | 245 | ## Shrinkage 246 | na.omit(estimates) %>% 247 | select(-post.stratified) %>% 248 | rename("Actual" = actual, "No pooling" = no.pooling, 249 | "Partial pooling" = partial.pooling) %>% 250 | gather(Estimator, Percentage, -state, -name, -n) %>% 251 | mutate(Estimator = factor(Estimator, levels = c("Post-stratified", 252 | "No pooling", "Partial pooling", "Actual"))) %>% 253 | ggplot(aes(Estimator, Percentage, group = state, color = n)) + 254 | geom_line(size = 0.7) + 255 | scale_color_gradientn(colors = blues9[-1]) + 256 | lims(y = c(0, 100)) + 257 | theme_minimal() 258 | 259 | 260 | ## Complete pooling: common intercepts and slopes 261 | pew <- mutate(pew, female = ifelse(gender == "female", 1, 0)) 262 | fit1 <- lm(demvote ~ 1 + female, data = pew) 263 | arm::display(fit1) 264 | 265 | 266 | ## No pooling: separate intercepts and slopes 267 | fit2 <- lm(demvote ~ 0 + state + state:female, data = pew) 268 | coef.fit2 <- as.matrix(coef(fit2)) 269 | round(coef(fit2)[c(1:5, 90:98)], 2) 270 | xtabs(~ gender + state, data = pew) # no men in WY 271 | coef.fit2 <- coef(fit2) 272 | coef.fit2 <- as.matrix(data.frame(intercept = coef.fit2[1:48], 273 | slope = coef.fit2[50:97], row.names = 274 | gsub("state", "", names(coef.fit2)[1:48]))) 275 | 276 | 277 | ## Separate intercepts with common slope 278 | fit3 <- lm(demvote ~ state + female - 1, data = pew) 279 | coef.fit3 <- coef(fit3) 280 | names(coef.fit3) <- gsub("state", "", names(coef.fit3)) 281 | round(coef.fit3, 2) 282 | 283 | 284 | ## Which should you believe? 285 | par(mfrow=c(1,2)) 286 | plot.gender.by.state <- function(main, coefs, fixef) { 287 | n <- nrow(coefs) 288 | x <- rep(c(0,1), rep(n, 2)) 289 | y <- 100 * c(coefs[[1]], coefs[[1]] + coefs[[2]]) 290 | plot(x = x, y = y, pch = 19, cex = 0.5, xlim = c(0, 1), 291 | ylim = c(0, 100), main = main,xlab = "Gender", 292 | ylab = "Percent voting for Clinton", axes = FALSE) 293 | axis(1, at = c(0, 1), labels = c("Male", "Female")) 294 | axis(2, at = seq(0, 100, 20)) 295 | for (i in seq_len(n)) lines(x = c(0, 1), 296 | y = c(y[i], y[i+n]), col = "grey", lwd = 0.5) 297 | abline(100 * fixef, col = "red3", lwd = 2) 298 | } 299 | plot.gender.by.state("No pooling", as_data_frame(coef.fit2)[-49,], 300 | coef(fit1)) 301 | coef.fit3.df <- data_frame(Intercept = coef.fit3[-length(coef.fit3)], 302 | Slope = coef.fit3[length(coef.fit3)]) 303 | plot.gender.by.state("Common slope", coef.fit3.df, coef(fit1)) 304 | par(mfrow=c(1,1)) 305 | 306 | 307 | ## Random effects: varying intercepts, common slopes 308 | fit4 <- lmer(demvote ~ 1 + female + (1 | state), data = pew) 309 | fit4 310 | 311 | 312 | ## Random effects: varying intercepts and slopes (first try) 313 | (fit5 <- lmer(demvote ~ 1 + female + (1 + female | state), 314 | data = pew)) # fails to converge 315 | 316 | 317 | ## Fixing the convergence failure (by centering covariates) 318 | pew <- mutate(pew, female.c = female - 0.5) 319 | fit6 <- lmer(demvote ~ 1 + female.c + (1 + female.c | state), 320 | data = pew) 321 | fixef(fit6) # not comparable to prior models 322 | head(coef(fit6)$state) 323 | 324 | 325 | ## Unscale the estimates for comparability 326 | fixef.fit6 <- fixef(fit6) 327 | fixef.fit6[1] <- fixef.fit6[1] - 0.5 * fixef.fit6[2] 328 | coef.fit6 <- coef(fit6)$state 329 | coef.fit6[[1]] <- coef.fit6[[1]] - 0.5 * coef.fit6[[2]] 330 | fixef.fit6 331 | head(coef.fit6) 332 | 333 | 334 | ## Comparing the random effects estimates 335 | par(mfrow=c(1,2)) 336 | plot.gender.by.state("Varying intercepts, common slope", 337 | coef(fit4)$state, fixef(fit4)) 338 | plot.gender.by.state("Varying intercepts and slopes", 339 | coef.fit6, fixef.fit6) 340 | par(mfrow=c(1,1)) 341 | 342 | 343 | 344 | ## Estimating models with covariates at both levels 345 | obama12 <- votes12 %>% 346 | mutate(obama12 = obama / turnout) %>% 347 | select(state, obama12) 348 | pew <- left_join(pew, obama12, by = "state") 349 | fit7 <- lmer(demvote ~ 1 + female.c + obama12 + (1 + female.c | state), data = pew) 350 | arm::display(fit7) 351 | 352 | 353 | ## First attempt at MRP 354 | cps <- cps %>% 355 | mutate(female = ifelse(gender == "female", 1, 0), 356 | female.c = female - 0.5) %>% 357 | left_join(obama12, by = "state") 358 | prob <- predict(fit7, newdata = cps, allow.new.levels = TRUE) 359 | mrp1 <- cps %>% 360 | mutate(prob = prob) %>% 361 | group_by(state) %>% 362 | summarize(mrp1 = 100 * weighted.mean(prob, weight)) 363 | estimates <- left_join(estimates, mrp1, by = "state") 364 | head(estimates) 365 | 366 | 367 | ## Plotting first MRP estimates 368 | RMSE["mrp1"] <- with(estimates, rmse(mrp1, actual)) 369 | ggplot(estimates, aes(actual, mrp1)) + 370 | geom_abline(intercept = 0, slope = 1, col = "grey") + 371 | geom_point(size = 1.5) + 372 | lims(x = c(0, 100), y = c(0, 100)) + 373 | labs(x = "Percentage of vote for Clinton", y = "Estimate") + 374 | theme_minimal() 375 | 376 | 377 | ######################### 378 | # 4. Bayesian inference # 379 | ######################### 380 | 381 | 382 | ## Example: What is the average airfare to AAPOR in Denver? 383 | post.precis <- 1 / 100^2 + 3 / 150^2 384 | post.mean <- (400 / 100^2 + 300 * 3 / 150^2) / post.precis 385 | post.sd <- 1 / sqrt(post.precis) 386 | 387 | 388 | ## Bayesian updating: from prior to posterior 389 | ggplot(data_frame(theta = c(0, 700) ), aes(theta)) + 390 | stat_function(fun = dnorm, color = "red", 391 | args = list(mean = 300, sd = 200 / sqrt(3))) + 392 | stat_function(fun = dnorm, color = "blue", 393 | args = list(mean = 400, sd = 100)) + 394 | stat_function(fun = dnorm, color = "purple", 395 | args = list(mean = post.mean, sd = post.sd)) + 396 | labs(x = quote(theta), y = "") + 397 | scale_x_continuous(breaks = seq(0, 700, 100)) + 398 | scale_y_continuous(breaks = NULL) + 399 | theme_minimal() 400 | 401 | 402 | ## A first Stan program 403 | model_code <- "data { 404 | int n; 405 | real y[n]; 406 | real theta_0; 407 | real omega_0; 408 | real sigma; 409 | } 410 | parameters { 411 | real theta; 412 | } 413 | model { 414 | theta ~ normal(theta_0, omega_0); 415 | for (i in 1:n) { 416 | y[i] ~ normal(theta, sigma); 417 | } 418 | }" 419 | 420 | 421 | ## Example (continued) 422 | library(rstan) 423 | rstan_options(auto_write = TRUE) 424 | options(mc.cores = parallel::detectCores()) 425 | y <- c(161, 250, 489) 426 | data <- list(y = y, n = length(y), theta_0 = 350, 427 | omega_0 = 150, sigma = 100) 428 | sims <- stan(model_code = model_code, data = data, 429 | chains = 4, iter = 500, seed = 1234) 430 | 431 | 432 | ## Stan output 433 | print(sims) 434 | 435 | 436 | ## Stan code for estimating both mean and SD 437 | data <- list(y = y, n = length(y), theta_0 = 350, omega_0 = 150) 438 | model_code <- "data { 439 | int n; 440 | real y[n]; 441 | real theta_0; 442 | real omega_0; 443 | } 444 | parameters { 445 | real theta; 446 | real sigma; // moved to parameter block 447 | } 448 | model { 449 | theta ~ normal(theta_0, omega_0); 450 | sigma ~ normal(150, 150); 451 | y ~ normal(theta, sigma); // assumed iid 452 | }" 453 | sims <- stan(model_code = model_code, data = data, 454 | iter = 500, seed = 1234) 455 | 456 | 457 | ## Output from two-parameter model 458 | print(sims) 459 | 460 | 461 | ## Graphical display of parameters 462 | plot(sims) 463 | 464 | 465 | ## Traceplot 466 | traceplot(sims) 467 | 468 | 469 | ## First hierarchical model in Stan 470 | data <- with(pew, list(y = demvote, group = as.integer(state), 471 | n = nrow(pew), J = nlevels(state))) 472 | code <- "data { 473 | int n; // number of respondents 474 | int J; // number of groups (states) 475 | int y[n]; // demvote 476 | int group[n]; // state index 477 | } 478 | parameters { 479 | real mu_theta; // hyper parameters 480 | real sigma_theta; 481 | vector[J] theta; // group parameters 482 | } 483 | model { 484 | sigma_theta ~ normal(0, 5); 485 | for (j in 1:J) 486 | theta[j] ~ normal(mu_theta, sigma_theta); 487 | for (i in 1:n) 488 | y[i] ~ bernoulli_logit(theta[ group[i] ]); 489 | }" 490 | sims <- stan(model_code = code, data = data) 491 | 492 | 493 | ## Stan output 494 | print(sims) 495 | 496 | 497 | ## A plot of the state estimates 498 | names(sims) <- c("mu_theta", "sigma_theta", levels(pew$state), "lp__") 499 | plot(sims, par = "theta") 500 | 501 | 502 | ## Add a covariate to the model 503 | data <- with(pew, list(y = demvote, group = as.integer(state), 504 | x = female, n = nrow(pew), J = nlevels(state))) 505 | code <- "data { 506 | int n; // number of respondents 507 | int J; // number of groups (states) 508 | int y[n]; // demvote 509 | int group[n]; // state index 510 | vector[n] x; // added covariate (gender) 511 | } 512 | parameters { 513 | real mu_alpha; // mean intercept 514 | real sigma_alpha; // sd intercept 515 | real beta; // coefficient of gender 516 | vector[J] alpha; // group intercepts 517 | } 518 | model { 519 | sigma_alpha ~ normal(0, 5); 520 | for (j in 1:J) 521 | alpha[j] ~ normal(mu_alpha, sigma_alpha); 522 | for (i in 1:n) 523 | y[i] ~ bernoulli_logit(alpha[group[i]] + beta * x[i]); 524 | }" 525 | sims <- stan(model_code = code, data = data) 526 | 527 | 528 | ## Simplifying the computations 529 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 + 530 | region + qlogis(obama12), data = pew) 531 | head(X, 4) 532 | 533 | 534 | ## Hierarchical model with multiple covariates 535 | model_code <- "data { 536 | int n; // number of respondents 537 | int k; // number of covariates 538 | matrix[n, k] X; // covariate matrix 539 | int y[n]; // outcome (demvote) 540 | int J; // number of groups (states) 541 | int group[n]; // group index 542 | } 543 | parameters { 544 | vector[k] beta; // fixed effects 545 | real sigma_alpha; // sd intercept 546 | vector[J] alpha; // group intercepts 547 | } 548 | model { 549 | vector[n] Xb; 550 | beta ~ normal(0, 4); 551 | sigma_alpha ~ normal(0.2, 1); // prior for sd 552 | alpha ~ normal(0, 1); // standardized intercepts 553 | Xb = X * beta; 554 | for (i in 1:n) 555 | Xb[i] += sigma_alpha * alpha[ group[i] ]; 556 | y ~ bernoulli_logit(Xb); 557 | }" 558 | 559 | 560 | ## Estimating the model in R 561 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 + 562 | region + qlogis(obama12), data = pew) 563 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote, 564 | J = nlevels(pew$state), group = as.integer(pew$state)) 565 | sims <- stan(model_code = model_code, data = data, 566 | seed = 1234) 567 | names(sims) 568 | 569 | 570 | ## Rename the coefficients for easier reading 571 | coef.names <- c(colnames(X), "sigma_alpha", levels(pew$state), "lp__") 572 | names(sims) <- coef.names 573 | names(sims) 574 | 575 | 576 | ## Summary of fixed effect estimates 577 | print(sims, par = "beta") 578 | 579 | 580 | ## Imputation in Stan 581 | X0 <- model.matrix(~ 1 + age4 + gender + race3 + educ4 + 582 | region + qlogis(obama12), data = cps) 583 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote, 584 | J = nlevels(pew$state), group = as.integer(pew$state), 585 | N = nrow(X0), X0 = X0, group0 = as.integer(cps$state)) 586 | 587 | 588 | ## The complete Stan program 589 | model_code <- "data { 590 | int n; // number of respondents 591 | int k; // number of covariates 592 | matrix[n, k] X; // covariate matrix 593 | int y[n]; // outcome (demvote) 594 | int J; // number of groups (states) 595 | int group[n]; // group index 596 | int N; // population size 597 | matrix[N, k] X0; // population covariates 598 | int group0[N]; // group index in population 599 | } 600 | parameters { 601 | vector[k] beta; // fixed effects 602 | real sigma_alpha; // sd intercept 603 | vector[J] alpha; // group intercepts 604 | } 605 | model { 606 | vector[n] Xb; 607 | beta ~ normal(0, 4); 608 | sigma_alpha ~ normal(0.2, 1); 609 | alpha ~ normal(0, 1); 610 | Xb = X * beta; 611 | for (i in 1:n) 612 | Xb[i] += sigma_alpha * alpha[ group[i] ]; 613 | y ~ bernoulli_logit(Xb); 614 | } 615 | generated quantities { 616 | int yimp[N]; 617 | { 618 | vector[N] Xb0; 619 | Xb0 = X0 * beta; 620 | for (i in 1:N) 621 | yimp[i] = bernoulli_logit_rng(Xb0[i] + sigma_alpha * alpha[ group0[i] ]); 622 | } 623 | }" 624 | sims <- stan(model_code = model_code, data = data, 625 | seed = 1234) 626 | 627 | 628 | ## Extracting the simulations 629 | imputations <- extract(sims, pars = "yimp")$yimp[sample( 630 | nrow(sims), size = 500), ] 631 | get_state_estimates <- function(imputations) { 632 | state_by_clinton <- function(imputed_values) 100 * prop.table( 633 | xtabs(weight ~ state + imputed_values, data = cps), 1)[,"1"] 634 | state_estimates <- apply(imputations, 1, state_by_clinton) 635 | apply(state_estimates, 1, mean) 636 | } 637 | estimates$mrp2 <- get_state_estimates(imputations) 638 | RMSE["mrp2"] <- with(estimates, rmse(mrp2, actual)) 639 | 640 | 641 | ## The easy way with `rstanarm` 642 | library(rstanarm) 643 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 + 644 | region + qlogis(obama12) + (1 | state), data = pew, family = binomial) 645 | imputations <- posterior_predict(fit, draws = 500, 646 | newdata = select(cps, age4, gender, race3, educ4, region, obama12, state)) 647 | 648 | 649 | ## The complete program in `rstanarm` 650 | library(rstanarm) 651 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 + 652 | region + qlogis(obama12) + (1 | state), data = pew, family = binomial) 653 | cpstmp <- cps %>% 654 | select(age4, gender, race3, educ4, region, obama12, state) 655 | imputations <- posterior_predict(fit, draws = 500, 656 | newdata = select(cps, age4, gender, race3, educ4, region, obama12, state)) 657 | estimates$mrp3 <- get_state_estimates(imputations) 658 | RMSE["mrp3"] <- with(estimates, rmse(mrp3, actual)) 659 | RMSE 660 | 661 | 662 | ## Accuracy of state level estimates 663 | p1 <- ggplot(estimates, aes(actual, mrp3)) + 664 | geom_abline(intercept = 0, slope = 1, col = "grey") + 665 | geom_point(size = 1.5) + 666 | lims(x = c(0, 100), y = c(0, 100)) + 667 | labs(x = "Percentage of vote for Clinton", y = "Estimate") + 668 | theme_minimal() 669 | p2 <- ggplot(estimates, aes(mrp3 - actual)) + 670 | geom_histogram(binwidth = 4, center = 0, fill = "gray") + 671 | lims(x = c(-20, 20)) + 672 | labs(x = "Error in estimate") + 673 | theme_minimal() 674 | grid.arrange(p1, p2, nrow = 1) 675 | 676 | 677 | ## What the map now looks like 678 | estimates %>% 679 | mutate(state_name = tolower(name), 680 | clinton_pct = cut(mrp3, breaks = c(-Inf, 40, 45, 50, 55, 60, 100), 681 | labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>% 682 | ggplot(aes(map_id = state_name)) + 683 | geom_map(aes(fill = clinton_pct), map = us_map) + 684 | expand_limits(x = us_map$long, y = us_map$lat) + 685 | coord_map("albers", lat0 = 39, lat1 = 45) + 686 | scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") + 687 | theme(axis.line = element_blank()) + 688 | theme_void() 689 | 690 | 691 | 692 | ###################### 693 | # 5. Advanced Topics # 694 | ###################### 695 | 696 | ## Allow race and gender effects to vary across states 697 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 + 698 | region + qlogis(obama12) + (1 + gender + race3 | state), 699 | data = pew, family = binomial) 700 | cpstmp <- cps %>% 701 | select(age4, gender, race3, educ4, region, obama12, state) 702 | imputations <- posterior_predict(fit, draws = 500, 703 | newdata = select(cps, age4, gender, race3, educ4, region, obama12, state)) 704 | estimates$mrp4 <- get_state_estimates(imputations) 705 | RMSE["mrp4"] <- with(estimates, rmse(mrp4, actual)) 706 | RMSE 707 | 708 | ###################### 709 | # 6. MNL # 710 | ###################### 711 | 712 | library(brms) 713 | 714 | # Includes Non-Voters 715 | pew_nv <- left_join(pew_nv, obama12, by = "state") 716 | cps_nv <- left_join(cps_nv, obama12, by = "state") 717 | 718 | fit <- brm(vote16 ~ 1 + age4 + gender + race3 + educ4 + 719 | region + qlogis(obama12) + (1 | state), data = pew_nv, 720 | family = categorical, chains = 4, cores = 4) 721 | imputations <- posterior_predict(fit, nsamples = 500, allow_new_levels = TRUE, 722 | newdata = select(cps_nv, age4, gender, race3, educ4, region, obama12, state)) 723 | 724 | state_table <- function(imp){ 725 | 100 * prop.table(xtabs(weight ~ state + imp, data = cps_nv), 1) 726 | } 727 | 728 | tbls <- array(apply(imputations[1:10,], 1, state_table), 729 | dim=c(51, 4, 10), 730 | dimnames=list(state=levels(cps_nv$state), 731 | vote16=levels(pew_nv$vote16), 732 | samples = 1:10)) 733 | estimate <- apply(tbls, c('state', 'vote16'), mean) 734 | sd <- apply(tbls, c('state', 'vote16'), sd) 735 | 736 | -------------------------------------------------------------------------------- /slides.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Multilevel Regression and Post-stratification" 3 | author: "Douglas Rivers" 4 | date: "May 15, 2018" 5 | output: 6 | beamer_presentation: 7 | fig_caption: false 8 | includes: 9 | in_header: scripts/header.tex 10 | classoption: "aspectratio=169" 11 | always_allow_html: yes 12 | --- 13 | 14 | ```{r setup, include=FALSE} 15 | knitr::opts_chunk$set(echo = FALSE) 16 | options(width=85) 17 | ``` 18 | 19 | 20 | ## Course Schedule 21 | 22 | | Time | Topics | 23 | |-------------|--------------------------------| 24 | | 2:30 - 3:15 | Overview and Examples | 25 | | 3:15 - 4:00 | Post-stratification | 26 | | 4:00 - 4:45 | Multilevel Regression | 27 | | 4:45 - 5:30 | Bayesian inference | 28 | | 5:30 - 6:00 | Advanced topics and discussion | 29 | 30 | 31 | ## Course materials and basic R setup 32 | 33 | Slides, code, and datasets can be downloaded from Github: 34 | 35 | 36 | 37 | If you wish to run the examples shown in class, you will need to 38 | 39 | 1. Download R from one of the sites listed at and install. R is an open source programming environment for statistical computing. 40 | 2. (Optional) Download RStudio Desktop from . RStudio is an integrated development environment for R. The free open source version is recommended. 41 | 3. Start R or RStudio and enter the following at the R command prompt: 42 | 43 | \footnotesize 44 | ```{r, eval=FALSE, echo=TRUE} 45 | install.packages(c("tidyverse", "lme4", "survey", "arm", "maps", "mapproj", 46 | "gridExtra")) 47 | ``` 48 | 49 | 50 | # 1. Overview and Examples 51 | 52 | 53 | ## What is MRP? 54 | 55 | ![](images/mrp-stan-tweet.png) 56 | 57 | \centering 58 | \LARGE 59 | 60 | Formally, **M**ultilevel **R**egression and **P**ost-stratificaiton 61 | 62 | \medskip 63 | 64 | \centering 65 | \LARGE 66 | Informally, **Mr. P** 67 | 68 | 69 | ## Who is the ubiquitous Mr. P? 70 | 71 | \columnsbegin 72 | \column{.3\textwidth} 73 | ```{r, out.width = "1.5in"} 74 | knitr::include_graphics("images/gelman.jpeg") 75 | ``` 76 | 77 | \smallskip 78 | 79 | \centering 80 | 81 | Andrew Gelman 82 | 83 | \column{.6\textwidth} 84 | 85 | Gelman is a prolific statistician, political scientist and blogger at Columbia University. 86 | 87 | \medskip 88 | 89 | Developed MRP in a series of papers starting in 1997. Adapts and extends some techniques for *small area estimation* from survey statistics. 90 | 91 | \medskip 92 | 93 | Ideas (post-stratification, multilevel models, hierarchical Bayesian estimation) were well known, but the combination has been very effective in polling applications. 94 | 95 | \columnsend 96 | 97 | 98 | ## . . . and his mysterious friend Stan? 99 | 100 | \columnsbegin 101 | \column{.3\textwidth} 102 | ```{r, out.width = "1.5in"} 103 | knitr::include_graphics("images/ulam.jpeg") 104 | ``` 105 | 106 | \medskip 107 | 108 | \centering 109 | 110 | Stanislav Ulam (1909-1984) 111 | 112 | \column{.4\textwidth} 113 | 114 | Ulam was a physicist who invented the *Monte Carlo method* while working on the Manhattan Project. 115 | 116 | \medskip 117 | 118 | Stan is an open source software project developing *Markov chain Monte Carlo* (MCMC) software for statistical inference and other applications. 119 | 120 | \column{.3\textwidth} 121 | ```{r, out.width = "1.5in"} 122 | knitr::include_graphics("images/stan_logo.png") 123 | ``` 124 | 125 | \smallskip 126 | 127 | \centering 128 | 129 | 130 | 131 | \columnsend 132 | 133 | 134 | 135 | ## What problems does MRP address? 136 | 137 | - **Selection bias.** Flexible and robust method for correcting for imbalances in sample composition, even when these are severe and can involve a large number of variables. 138 | - **Small area estimation.** Can provide good estimates for sub-national units (such as states, Congressional districts, counties, *etc.*) 139 | - **Trend analysis.** Estimate means of survey variables over time with a set of rolling cross-sections 140 | 141 | 142 | ## Selection bias: the Xbox panel 143 | 144 | ![](images/poll_screenshot.jpg) 145 | 146 | * 750,148 interviews with 345,858 Xbox users during the 2012 election campaign 147 | * However, the sample had some problems.... 148 | 149 | (Gelman, Goel, Rivers, and Rothschild, "The Mythical Swing Voter", *Quarterly Journal of Political Science,* 2016) 150 | 151 | 152 | ## Xbox panel demographics 153 | 154 | \centering 155 | 156 | ![](images/demo_xbox_v_exit.pdf){height=75%} 157 | 158 | \centering 159 | 160 | 93% were male, but there were still over 5,000 women in the sample. 161 | 162 | 163 | 164 | ## MRP estimates of 2012 voting from Xbox panel 165 | 166 | \centering 167 | 168 | ![](images/subgroup_predictions.pdf){height=60%} 169 | 170 | 75,096 cells: 2 gender x 4 race x 4 age x 4 education x 4 party x 3 ideology x 50 states 171 | 172 | 173 | ## Lesson 1 174 | 175 | \centering 176 | 177 | \LARGE 178 | 179 | Under some circumstances, **big data** (large sample sizes) and **modelling** can handle empty cells and correct for severe selection bias. 180 | 181 | 182 | ## Small area estimation: the 2017 UK general election 183 | 184 | \centering 185 | 186 | ![](images/uk17_predictions.png) 187 | 188 | 189 | ## Comparison of MRP forecasts and outcomes 190 | 191 | \centering 192 | 193 | ![](images/uk17_yg_est.png) 194 | 195 | 196 | ## Lesson 2 197 | 198 | \centering 199 | \LARGE 200 | 201 | It's possible to get pretty good estimates of **small areas** (parliamentary constituencies, congressional districts, even precincts) with effective predictors and multilevel models. 202 | 203 | 204 | ## Why do polls bounce around so much? 205 | 206 | \centering 207 | 208 | ![](images/us16_fig-lead-pid-wgt.pdf) 209 | 210 | 211 | ## Trend analysis: the 2016 US presidential election 212 | 213 | \centering 214 | ![](images/us16_NationalTimeTrend.pdf) 215 | 216 | 217 | ## Lesson 3 218 | 219 | \centering 220 | \LARGE 221 | 222 | Large swings in the 2012 and 2016 pre-election polls were methodological artifacts due to inadequate **post-stratification** and **modeling**. 223 | 224 | 225 | ## Ingredients for MRP and running example 226 | 227 | **Survey** Pew Research Organization's *October 2016 Political Survey* (2,583 interviews, conducted October 20-25, 2016.) 228 | 229 | **Survey variable** 2016 Presidential voting intention 230 | 231 | **Covariates** Individual characteristics (from the survey) and group level predictors (2012 state vote) 232 | 233 | **Post-strata** Age x Gender x Race x Education x State 234 | 235 | **Stratum counts** from the November 2016 Voting and Registration Supplement to the *Current Population Survey* 236 | 237 | 238 | ## Data sources 239 | 240 | The file `cleaned.RData` contains four R dataframes: 241 | 242 | * `pew` - Pew Research Organization's **October 2016 Political Survey**. The original data can be found at . 243 | * `cps` - the November 2016 Voting and Registration Supplement to the **Current Population Survey**. The full dataset can be downloaded from . 244 | * `votes12` and `votes16` - votes cast for major presidential candidates, turnout, and voting age population by state. **Vote counts** are from and **population counts** are from . 245 | 246 | Code `clean-data.R` for creating `cleaned.RData` is on the course Github site. 247 | 248 | 249 | ## How MRP works 250 | 251 | **Step 1: multilevel regression** Fit a model relating the survey variable (vote) to individual and group level covariates (age, gender, race, education, state, 2012 vote). 252 | 253 | **Step 2: imputation** Impute the survey variable (proportion voting Democratic) for all combinations of age x gender x race x education x state. 254 | each cell of post-stratification (2 age x 2 gender x 3 race x 4 education x 51 state). 255 | 256 | **Step 3: post-stratification** Average the imputed values in each cell and weight the estimated cell mean by the population count in each of the $2 \times 2 \times 3 \times 4 \times 51 = 2248$ cells. 257 | 258 | 259 | ## Some important concepts we will cover 260 | 261 | * Fit **multilevel** regression models for survey variables which are "richly parameterized" and can flexibly adjust for high dimensional selection bias. 262 | * Estimate models using **hierarchical Bayesian priors** to avoid over-fitting. 263 | * Impute using the **posterior predictive distribution** (not the sample average). 264 | * Use **multiple imputations** to accurately reflect the sources of uncertainty in the imputation process. 265 | 266 | Don't worry if you don't understand the terminology -- details will follow -- but it's just **regression** and **imputation**. 267 | 268 | 269 | # 2. Post-stratification 270 | 271 | 272 | ## Review of stratified sampling 273 | 274 | In *stratified sampling*, we divide the population into **strata** (*e.g.*, regions) and draw samples within each stratum. 275 | 276 | The **stratified estimator** is obtained by weighting the stratum estimates by the population size in each. 277 | 278 | A stratified design can be more **efficient** than a simple random sample of the same size because 279 | 280 | * Eliminates variability in the proportion of sample coming from each stratum 281 | * Larger sample sizes could be allocated to more heterogeneous strata, less to homogeneous strata 282 | 283 | 284 | ## Notation for stratified sampling 285 | 286 | **Population** $U = U_1 \cup \cdots \cup U_H$ partitioned into $H$ strata 287 | 288 | **Sample** $S = S_1 \cup \cdots \cup S_H$ where $S_h \subset U_h$. 289 | 290 | **Population sizes** $N = N_1 + \cdots + N_H$ 291 | 292 | **Sample sizes** $n = n_1 + \cdots + n_H$ 293 | 294 | **Sample means** $\bar y_1,\dots, \bar y_H$ 295 | 296 | **Stratified estimator** $\bar y_{\text{ST}} = \dfrac{N_1}{N} \bar y_1 + \cdots + \dfrac{N_H}{N} \bar y_H$ 297 | 298 | 299 | The stratified estimator is a **weighted** combination of the stratum means. The weights are **known** -- sampling variation comes from errors in estimating the stratum means. 300 | 301 | 302 | ## Stratification weights 303 | 304 | An equivalent way of computing the stratified estimator is to define the **stratum weights** 305 | 306 | $$ 307 | w_h = \dfrac{\text{population proportion in stratum}}{\text{sample proportion in stratum}} 308 | = \dfrac{N_h/N}{n_h/n} 309 | $$ 310 | Each respondent gets the weight associated with their stratum: 311 | $$ 312 | \bar y_{\text{ST}} = \text{weighted average of sample values} = \dfrac{ w_{h_1} y_1 + \cdots + w_{h_n} y_n }{ w_{h_1} + \cdots + w_{h_n}} 313 | $$ 314 | where $h_i$ and $w_{h_i}$ are the **stratum** and **weight**, respectively, associated with respondent $i$. 315 | 316 | 317 | ## Variance of the stratified estimator 318 | 319 | $$ 320 | V(\bar y_{\text{ST}}) = \sum_{h=1}^H \left( \dfrac{N_h}{N} \right)^{\!2} \left( 1 - \dfrac{n_h}{N_h} \right) \dfrac{S_h^2}{n_h} = \dfrac{1}{n} \sum_{i=1}^n w_{h_i}^2 (1 - f_h) \dfrac{S_{h_i}^2}{n} 321 | $$ 322 | There are three factors that determine the variance of the stratified estimator: 323 | 324 | * Size of the **sampling fractions** $f_h = n_h / N_h$ in each stratum. Usually these are close to zero and can be ignored. 325 | * The **within stratum variances** $\sum_{i \in U_h} (y_i - \bar y_h)^2 / (N_h - 1)$. 326 | * Variability of the **stratum weights** $w_h$. 327 | 328 | $S_h^2$ can be substantially less than the total variance of $y$, so there are potential large gains in efficiency from stratifying, compared to SRSWOR. 329 | 330 | Variation in the weights tends to *increase* the standard error of the estimator, unless it's possible to allocate sample size $n_h$ proportional to $S_h^2$. 331 | 332 | 333 | ## Post-stratification 334 | 335 | When the sample design is **not** stratified, it may still be feasible to use the stratified estimator if the sample can be divided into groups (called **post-strata**) whose population sizes are known. 336 | 337 | The sample sizes in the post-strata are not fixed by design, so the variance formula only holds conditional upon the realized sample sizes $n_1,\dots,n_H$. (See Holt and Smith, "Post-stratification", *Journal of the Royal Statistical Society, Series A*, 1979.) 338 | 339 | Even though the sample may have started with equal selection probabilities, the proportions in the post-strata often differ substantially from the population proportions. Failure to post-stratify results in large bias, but large weights may result in unstable estimates. 340 | 341 | Usual advice is to collapse post-strata to have at least 20 respondents per cell. 342 | 343 | 344 | ## Post-stratification in R 345 | 346 | **Step 1** Recode survey and population data (usually from a *public use microdata* file) so that the post-stratifying variables are consistent. 347 | 348 | **Step 2** Choose the post-strata 349 | 350 | **Step 3** Collapse categories to eliminate empty cells 351 | 352 | **Step 4** Compute the post-stratification weights 353 | 354 | **Step 5** Estimate means and proportions (with standard errors) 355 | 356 | 357 | ## Step 1: recode Pew data... 358 | 359 | Variables should be `factor`s (R's version of categorical variables) with the same `levels` (categories) *in the same order*. 360 | 361 | \scriptsize 362 | ```{r, echo=TRUE, warning=FALSE, message=FALSE} 363 | suppressMessages(library("tidyverse")) 364 | load("data/cleaned.RData") 365 | pew <- pew %>% 366 | filter( 367 | complete.cases(age, raceeth, gender, educ, vote16), 368 | vote16 != "nonvoter") %>% 369 | mutate( 370 | demvote = ifelse(vote16 == "clinton", 1, 0), 371 | age4 = factor(case_when(age < 30 ~ "18-29", 372 | age < 45 ~ "30-44", age < 65 ~ "45-64", 373 | TRUE ~ "65+")), 374 | race3 = fct_collapse(raceeth, 375 | white = c("white", "other")), 376 | educ4 = fct_collapse(educ, 377 | "hs" = c("grades 1-8", "hs dropout", "hs grad"), 378 | "some col" = c("some col", "assoc"))) 379 | ``` 380 | 381 | 382 | ## ...then do the same for CPS 383 | 384 | \scriptsize 385 | ```{r, echo=TRUE, warning=FALSE, messages=FALSE} 386 | cps <- cps %>% 387 | filter( 388 | complete.cases(age_top_codes, 389 | raceeth, gender, educ, turnout), 390 | turnout == "yes") %>% 391 | mutate( 392 | age4 = factor(case_when( 393 | age_top_codes == "<80" & age < 30 ~ "18-29", 394 | age_top_codes == "<80" & age < 45 ~ "30-44", 395 | age_top_codes == "<80" & age < 65 ~ "45-64", 396 | TRUE ~ "65+")), 397 | race3 = fct_collapse(raceeth, 398 | white = c("white", "other")), 399 | educ4 = fct_collapse(educ, 400 | "hs" = c("grades 1-8", "hs dropout", "hs grad"), 401 | "some col" = c("some col", "assoc"))) 402 | ``` 403 | 404 | ## Check that the datasets are consistent -- mistakes will be made! 405 | 406 | Time spent cleaning the data at this stage is time well spent. 407 | 408 | \footnotesize 409 | ```{r, echo=TRUE, include=TRUE} 410 | compare_distributions <- function(var, data1, data2, wgt1, wgt2, digits = 1) { 411 | stopifnot(all(levels(data1[[var]]) == levels(data2[[var]]))) 412 | formula1 <- as.formula(paste(wgt1, "~", var)) 413 | formula2 <- as.formula(paste(wgt2, "~", var)) 414 | tbl <- rbind(round(100 * prop.table(xtabs(formula1, data1)), digits), 415 | round(100 * prop.table(xtabs(formula2, data2)), digits)) 416 | row.names(tbl) <- c(substitute(data1), substitute(data2)) 417 | tbl 418 | } 419 | compare_distributions("race3", pew, cps, "", "weight") 420 | ``` 421 | 422 | ## Compare variables in `pew` and `cps` 423 | 424 | \scriptsize 425 | ```{r, echo=TRUE} 426 | compare_distributions("educ4", pew, cps, "", "weight") 427 | compare_distributions("age4", pew, cps, "", "weight") 428 | compare_distributions("gender", pew, cps, "", "weight") 429 | ``` 430 | 431 | ## Step 2: create post-strata 432 | 433 | The `survey` package contains useful functions for post-stratification, including `postStratify`, `rake`, and `nonresponse`. 434 | 435 | \footnotesize 436 | ```{r, echo=TRUE, message = FALSE, warning = FALSE} 437 | suppressMessages(library(survey)) 438 | pop.counts <- xtabs(weight ~ age4 + gender + race3 + educ4, data = cps) 439 | sample.counts <- xtabs(~ age4 + gender + race3 + educ4, data = pew) 440 | pew <- mutate(pew, 441 | weight0 = sum(pop.counts) / sum(sample.counts)) 442 | sample.weights <- xtabs(weight0 ~ age4 + gender + race3 + 443 | educ4, data = pew) 444 | nr <- nonresponse(sample.weights, sample.counts, pop.counts) 445 | ``` 446 | 447 | \normalsize 448 | \centering 449 | 450 | `nr` is an object which keeps track of the cells (post-strata). 451 | 452 | 453 | ## Check for empty cells and/or large weights 454 | 455 | \tiny 456 | ```{r, echo=TRUE} 457 | sparseCells(nr, nrweight = 4) 458 | ``` 459 | 460 | ## Look for categories adjacent to empty cells 461 | 462 | \scriptsize 463 | ```{r, echo = TRUE, include=TRUE} 464 | neighbours(14, nr) # use nr$index to get cell index 465 | ``` 466 | 467 | 468 | ## Step 3: collapse cells 469 | 470 | Cells 10, 11, 14 and 15 contain blacks aged 30-64 with a high school education. 471 | 472 | \scriptsize 473 | ```{r echo=TRUE} 474 | nr$index[,,"black","hs"] 475 | nr <- joinCells(nr, 10, 11, 14, 15) # update the nr object 476 | nr$index[,,"black","hs"] 477 | ``` 478 | 479 | ## Eliminate remaining empty cells 480 | 481 | Combine males and females and collapse age and education categories for minorities. Each call to `joinCells` collapses some cells and stores the result in `nr`. 482 | 483 | \scriptsize 484 | ```{r echo=TRUE, include=TRUE} 485 | nr <- joinCells(nr, 18, 19, 21, 22) # hisp 30-64 hs 486 | nr <- joinCells(nr, 44, 48, 68, 72, 92, 96) # hisp 65+ >hs 487 | nr <- joinCells(nr, 57, 61, 81, 85) # black 18-29 col+ 488 | sparseCells(nr, nrweight = 4) # no more bad cells 489 | ``` 490 | 491 | \normalsize 492 | 493 | `NULL` means that, after collapsing, there are no empty cells or weights > 4. 494 | 495 | 496 | ## Step 4: compute weights and add to dataframe 497 | 498 | `weights(nr)` is a four dimensional (age x gender x race x education) array of **stratum weights**. I created `get_weights` to convert the stratum weights into a vector of **individual weights**. 499 | 500 | \footnotesize 501 | ```{r echo=TRUE} 502 | get_weights <- function(data, nr) { 503 | wgt_arr <- weights(nr) 504 | var.names <- names(dimnames(wgt_arr)) 505 | indexes <- data %>% 506 | select(var.names) %>% 507 | mutate_all(as.integer) %>% 508 | as.matrix() 509 | wgt_arr[indexes] 510 | } 511 | pew$ps.weight <- get_weights(pew, nr) 512 | ``` 513 | 514 | 515 | ## Check that the post-stratification worked 516 | 517 | \scriptsize 518 | ```{r, echo=TRUE} 519 | compare_distributions("race3", pew, cps, "ps.weight", "weight") 520 | compare_distributions("educ4", pew, cps, "ps.weight", "weight") 521 | compare_distributions("age4", pew, cps, "ps.weight", "weight") 522 | ``` 523 | 524 | 525 | ## Step 5: compute estimates using the new weight 526 | 527 | Use either `xtabs` (in base R) or the `survey` package 528 | 529 | \scriptsize 530 | ```{r, echo=TRUE, include=TRUE, warning=FALSE, message=FALSE} 531 | round(100 * prop.table(xtabs(ps.weight ~ vote16, pew)), 1) 532 | design <- svydesign(ids = ~ 1, weights = ~ ps.weight, data = pew) 533 | round(100 * prop.table(svytable(~ vote16, design)), 1) 534 | cv <- function(x) sd(x) / mean(x) # coefficient of variation 535 | cv(pew$ps.weight)^2 # weighting loss 536 | ``` 537 | 538 | 539 | ## State estimates 540 | 541 | \footnotesize 542 | ```{r echo=TRUE} 543 | tbl <- 100 * prop.table(xtabs(ps.weight ~ state + vote16, data = pew), 1) 544 | round(tbl, 1)[1:10,] 545 | ``` 546 | 547 | ## Plotting the estimates 548 | 549 | State estimates are nearly useless, even if approximately unbiased. 550 | 551 | \scriptsize 552 | ```{r echo=FALSE, warning=FALSE, fig.width=9, fig.height=4.5, fig.align="center"} 553 | estimates <- votes16 %>% 554 | transmute(state, name, 555 | actual = 100 * clinton / turnout) %>% 556 | mutate(post.stratified = tbl[,"clinton"]) 557 | 558 | suppressMessages(library(ggplot2)) 559 | p1 <- ggplot(estimates, aes(actual, post.stratified)) + 560 | geom_abline(slope = 1, intercept = 0, col = "grey") + 561 | geom_point() + 562 | lims(x = c(0, 100), y = c(0, 100)) + 563 | labs(x = "Percent of votes", y = "Post-stratifed estimate") + 564 | theme_minimal() 565 | p2 <- ggplot(estimates, aes(post.stratified - actual)) + 566 | geom_histogram(binwidth = 10, center = 0, fill = "gray") + 567 | labs(x = "Error in estimate") + 568 | theme_minimal() 569 | 570 | suppressMessages(library(gridExtra)) 571 | grid.arrange(p1, p2, nrow = 1) 572 | ``` 573 | 574 | 575 | ## Mapping the estimates 576 | 577 | *Probably not something you'd want to share with the world.* 578 | 579 | ```{r, echo=FALSE, include=TRUE, message=FALSE, warning=FALSE} 580 | suppressMessages(library(maps)) 581 | suppressMessages(library(mapproj)) 582 | us_map <- map_data("state") 583 | estimates %>% 584 | mutate(state_name = tolower(name), 585 | clinton_pct = cut(post.stratified, breaks = c(-Inf, 40, 45, 50, 55, 60, 100), 586 | labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>% 587 | ggplot(aes(map_id = state_name)) + 588 | geom_map(aes(fill = clinton_pct), map = us_map) + 589 | expand_limits(x = us_map$long, y = us_map$lat) + 590 | coord_map("albers", lat0 = 39, lat1 = 45) + 591 | scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") + 592 | theme(axis.line = element_blank()) + 593 | theme_void() 594 | ``` 595 | 596 | 597 | 598 | ## Problems with post-stratification 599 | 600 | **Empty cells** 601 | 602 | * Collapsing cells increases bias 603 | * Alternative is _raking_ (works if no empty cells in marginals) 604 | 605 | \smallskip 606 | 607 | **Small cells** 608 | 609 | * Even with non-empty cells, mean estimates will be noisy if the cell size is small. 610 | * Recommended cell size is at least 20 -- suggest collapsing cells 611 | 612 | \smallskip 613 | 614 | **Large weights** 615 | 616 | * Common practice is to *trim* large weights. 617 | * Solves the wrong problem: population cell count is ok, sample cell estimate is noisy. 618 | 619 | 620 | # 3. Multilevel regression 621 | 622 | ## Key ideas 623 | 624 | * Use a regression model to **predict** values of the survey variable for each combination of covariates. 625 | * Use a **richly parameterized model** for the data (the first level) 626 | + Allow intercepts and possibly slopes to vary across groups in the data ("deep interactions"). 627 | + Use variables with many categories (such as counties or states), some of which may not be present in the data. 628 | * Introduce a second level model for the **parameters** in the first level. 629 | + The parameters of the second level model are sometimes called *hyperparameters*. 630 | + The second level model is used to avoid overfitting. Results in *shrinkage* or *regularization*. 631 | * Also known as hierarchical models, mixed models, random effects, variance components, repeated measures, longitudinal analysis, *etc.* 632 | 633 | 634 | ## Stein's paradox 635 | 636 | \columnsbegin 637 | \column{.3\textwidth} 638 | ```{r, out.width = "1.5in"} 639 | knitr::include_graphics("images/stein.jpg") 640 | ``` 641 | 642 | \smallskip 643 | 644 | \centering 645 | 646 | Charles Stein (1920-2016) 647 | 648 | \column{.5\textwidth} 649 | 650 | The **mean square error** (or **risk**) of an estimator $\hat\theta$ of $\theta$ is 651 | $$ 652 | \text{MSE}(\hat\theta, \theta) = E_\theta(\hat\theta - \theta)^2 653 | $$ 654 | 655 | An estimator $\hat\theta$ is **inadmissable** if there exists another estimator $\tilde\theta$ with smaller risk for *all* $\theta$. The estimator $\tilde\theta$ is said to **dominate** $\hat\theta$. 656 | 657 | \medskip 658 | 659 | When estimating three or more means of independent normal distributions, Stein showed that the sample mean is **inadmissable**. 660 | 661 | \columnsend 662 | 663 | 664 | ## A tale of three estimators 665 | 666 | To keep things simple, we start with models for vote by state without any covariates. 667 | 668 | **No pooling** estimate the model separately for each state (by a regression on 51 state dummies) 669 | 670 | **Complete pooling** estimate a single **grand mean** (equivalent to a regression with just an intercept) 671 | 672 | **Partial pooling** estimate a *random effects* model (to be explained shortly -- think of it as a compromise between no pooling and complete pooling) 673 | 674 | \bigskip 675 | \small 676 | 677 | (For pedagogical purposes, we will use *linear* instead of *logistic regression* in this section. The same concepts apply to nonlinear model.) 678 | 679 | 680 | ## Fixed effects (no pooling) 681 | 682 | We use `lm` to estimate a model with state dummies and no intercept (so the estimates are the predicted Clinton vote in each state). 683 | 684 | \scriptsize 685 | ```{r, echo=TRUE} 686 | no.pooling <- lm(demvote ~ state - 1, data = pew) 687 | no.pooling <- data_frame( 688 | state = gsub("state", "", names(coef(no.pooling))), 689 | no.pooling = 100 * coef(no.pooling)) %>% 690 | mutate(state = factor(state, levels = levels(pew$state))) 691 | head(no.pooling) 692 | ``` 693 | 694 | 695 | ## Grand mean (complete pooling) 696 | 697 | The grand mean can be computed in **three equivalent ways**: 698 | 699 | \scriptsize 700 | ```{r, echo=TRUE} 701 | round(100 * mean(pew$demvote), 1) # 1. mean of dichotomous indicator 702 | round(100 * prop.table(xtabs(~ demvote, data = pew)), 1) # 2. cross-tabulation 703 | complete.pooling <- lm(demvote ~ 1, data = pew) # 3. intercept in regression 704 | complete.pooling <- 100 * coef(complete.pooling) 705 | round(complete.pooling, 1) 706 | ``` 707 | 708 | 709 | ## Random effects (partial pooling) 710 | 711 | Estimate a model with one **fixed effect** (the grand mean) plus **random effects** for each state, 712 | $$ 713 | \text{vote} = \text{grand mean} + \text{state effect} + \text{individual error} 714 | $$ 715 | or, in symbols, 716 | $$ 717 | y_i = \mu + \alpha_{j_i} + \epsilon_i \qquad (j_i = \text{state of respondent}) 718 | $$ 719 | where 720 | $$ 721 | \alpha_1,\dots,\alpha_J \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\alpha) \qquad 722 | \epsilon_1,\dots,\epsilon_n \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\epsilon) 723 | $$ 724 | The grand mean $\mu_i$ is a **fixed effect** while the state effects $\alpha_j$ are **random effects** (assumed to be $\text{Normal}(0, \sigma_\alpha)$). 725 | 726 | In classical statistics, parameters (like $\mu$, $\sigma_\epsilon$ and $\sigma_\alpha$) are **estimated** while random effects are **predicted**. 727 | 728 | ## What's the difference between fixed and random effects models? 729 | 730 | In both models, 731 | $$ 732 | y_i = \mu + \alpha_{j_i} + \epsilon_i 733 | $$ 734 | with the only difference being 735 | 736 | * **Fixed effects** $\alpha_1,\dots,\alpha_J$ are *unknown* parameters to be estimated. 737 | + $E(y_i) = \mu + \alpha_{j_i}$ and $V(y_i) = \sigma_\epsilon^2$ 738 | * **Random effects** $\alpha_1,\dots,\alpha_J$ are *unknown* random variables to be predicted. 739 | + $E(y_i) = \mu$ and $V(y_i) = \sigma_\alpha^2 + \sigma_\epsilon^2$ 740 | + But $E(y_i | \alpha_{j_i}) = \mu + \alpha_{j_i}$ and $V(y_i | \alpha_{j_i}) = \sigma_\epsilon^2$ 741 | 742 | If you find this confusing, you are not alone. Gelman and Hill argue that the only meaningful difference is that the $\alpha_j$ are modelled in one and not in the other. 743 | 744 | 745 | ## Predicting random effects 746 | 747 | The default method for R's `lmer` function (in the `lme4` package) is *REML* (residualized or restricted maximum likelihood), which is unbiased for linear models. 748 | 749 | \scriptsize 750 | ```{r, echo=TRUE} 751 | suppressMessages(library(lme4)) 752 | partial.pooling <- lmer(demvote ~ 1 + (1 | state), data = pew) 753 | partial.pooling 754 | ``` 755 | 756 | ## Extracting predictions 757 | 758 | \scriptsize 759 | ```{r, echo=TRUE} 760 | fixef(partial.pooling) # grand mean 761 | ranef(partial.pooling)$state %>% head(4) # state effects 762 | coef(partial.pooling)$state %>% head(4) # state predictions 763 | ``` 764 | 765 | 766 | ## Which is better? 767 | 768 | \scriptsize 769 | ```{r echo=FALSE} 770 | partial.pooling <- coef(partial.pooling)$state %>% 771 | as_data_frame(rownames = "state") %>% 772 | transmute(state = factor(state, levels = levels(pew$state)), 773 | partial.pooling = 100 * `(Intercept)`) 774 | estimates <- estimates %>% 775 | left_join(no.pooling, by = "state") %>% 776 | left_join(partial.pooling, by = "state") %>% 777 | mutate(n = as.integer(xtabs(~ state, data = pew))) 778 | estimates 779 | ``` 780 | 781 | 782 | ```{r echo=FALSE, include=FALSE} 783 | rmse <- function(est, act) sqrt(mean((est - act)^2, na.rm = TRUE)) 784 | RMSE <- estimates %>% 785 | summarize(complete.pooling = rmse(complete.pooling, actual), 786 | no.pooling = rmse(no.pooling, actual), 787 | partial.pooling = rmse(partial.pooling, actual)) %>% 788 | unlist() 789 | RMSE 790 | ``` 791 | 792 | \normalsize 793 | 794 | The RMSE is about the same for *no pooling* (`r round(RMSE["no.pooling"], 1)`%) and *complete pooling* (`r round(RMSE[ 795 | "complete.pooling"], 1)`%), but substantially smaller for *partial pooling* (`r round(RMSE["partial.pooling"], 1)`%). 796 | 797 | 798 | 799 | ## 2016 U.S. election: estimates *vs.* actuals 800 | 801 | ```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=4.5, fig.width=9, fig.align="center"} 802 | p1 <- ggplot(estimates, aes(actual, no.pooling)) + 803 | geom_point() + 804 | labs(title = "No pooling", x = "Percent of voters", 805 | y = "Percent of voters") + 806 | lims(x = c(0,100), y = c(0,100)) + 807 | theme_minimal() 808 | p2 <- ggplot(estimates, aes(actual, partial.pooling)) + 809 | geom_point() + 810 | labs(title = "Partial pooling", x = "Percent of voters", 811 | y = "Percent of voters") + 812 | lims(x = c(0,100), y = c(0,100)) + 813 | theme_minimal() 814 | suppressMessages(library(gridExtra)) 815 | grid.arrange(p1, p2, nrow = 1) 816 | ``` 817 | 818 | 819 | ## Shrinkage 820 | 821 | ```{r echo=FALSE, fig.align="center"} 822 | na.omit(estimates) %>% 823 | select(-post.stratified) %>% 824 | rename("Actual" = actual, "No pooling" = no.pooling, 825 | "Partial pooling" = partial.pooling) %>% 826 | gather(Estimator, Percentage, -state, -name, -n) %>% 827 | mutate(Estimator = factor(Estimator, levels = c("Post-stratified", 828 | "No pooling", "Partial pooling", "Actual"))) %>% 829 | ggplot(aes(Estimator, Percentage, group = state, color = n)) + 830 | geom_line(size = 0.7) + 831 | scale_color_gradientn(colors = blues9[-1]) + 832 | lims(y = c(0, 100)) + 833 | theme_minimal() 834 | ``` 835 | 836 | 837 | ## What *exactly* is a random effects estimator? 838 | 839 | The random effects/partial pooling estimate of a group mean is a weighted average of the fixed effects/no pooling estimate and the grand mean/complete pooling estimate. 840 | 841 | $$ 842 | \hat\theta_j^{\text{RE}} = w_j \, \bar y_j + (1 - w_j) \, \bar y 843 | $$ 844 | 845 | * The weight on the no pooling estimator $\bar y_j$ is proportional to sample size $n_j$ in group $j$ and inversely proportional to the **within group variance**. 846 | * The weight on the complete pooling is inversely proportional to the **between group variance**. 847 | * Sometimes referred to as an **empirical Bayesian estimator**. The amount of shrinkage is data dependent. 848 | 849 | 850 | ## Adding individual covariates: fixed effects 851 | 852 | To get better predictions, let's add a covariate -- **gender** -- to the model. There are now three *fixed effects* estimators: 853 | 854 | **Complete pooling** estimate a single slope and single intercept for all states 855 | 856 | **No pooling** estimate separate intercepts and slopes for each state 857 | 858 | **Complete pooling of slopes, no pooling of intercepts** estimate separate intercepts for each state and a common slope 859 | 860 | *(It rarely makes sense to estimate separate slopes and a common intercept, so we ignore this possibility.)* 861 | 862 | 863 | ## Complete pooling: common intercepts and slopes 864 | 865 | $$ 866 | y_i = \alpha + \beta x_i + \epsilon_i \qquad \epsilon_i \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\epsilon) 867 | $$ 868 | 869 | \scriptsize 870 | ```{r, echo=TRUE} 871 | pew <- mutate(pew, female = ifelse(gender == "female", 1, 0)) 872 | fit1 <- lm(demvote ~ 1 + female, data = pew) 873 | arm::display(fit1) 874 | ``` 875 | 876 | \normalsize 877 | 878 | $\hat\beta = 0.14$ implies a 14% gender gap (42% of men voted for Clinton *vs.* 56% of women). 879 | 880 | 881 | ## No pooling: separate intercepts and slopes 882 | 883 | $$ 884 | y_i = \alpha_{j_i} + \beta_{j_i} x_i + \epsilon_i 885 | $$ 886 | 887 | \tiny 888 | ```{r, echo=TRUE} 889 | fit2 <- lm(demvote ~ 0 + state + state:female, data = pew) 890 | coef.fit2 <- as.matrix(coef(fit2)) 891 | round(coef(fit2)[c(1:5, 90:98)], 2) 892 | ``` 893 | 894 | \normalsize 895 | It's impossible to estimate the slope in Wyoming: 896 | 897 | \tiny 898 | ```{r, echo=FALSE} 899 | options(width = 100) 900 | xtabs(~ gender + state, data = pew) # no men in WY 901 | options(width = 85) 902 | ``` 903 | 904 | ```{r echo=FALSE, include=FALSE} 905 | coef.fit2 <- coef(fit2) 906 | coef.fit2 <- as.matrix(data.frame(intercept = coef.fit2[1:48], 907 | slope = coef.fit2[50:97], row.names = 908 | gsub("state", "", names(coef.fit2)[1:48]))) 909 | ``` 910 | 911 | 912 | ## Separate intercepts with common slope 913 | 914 | $$ 915 | y_i = \alpha_{j_i} + \beta x_i + \epsilon_i 916 | $$ 917 | 918 | \scriptsize 919 | ```{r echo=TRUE} 920 | fit3 <- lm(demvote ~ state + female - 1, data = pew) 921 | coef.fit3 <- coef(fit3) 922 | names(coef.fit3) <- gsub("state", "", names(coef.fit3)) 923 | round(coef.fit3, 2) 924 | ``` 925 | 926 | 927 | ## Which should you believe? 928 | 929 | ```{r, echo=FALSE, fig.align="center"} 930 | par(mfrow=c(1,2)) 931 | plot.gender.by.state <- function(main, coefs, fixef) { 932 | n <- nrow(coefs) 933 | x <- rep(c(0,1), rep(n, 2)) 934 | y <- 100 * c(coefs[[1]], coefs[[1]] + coefs[[2]]) 935 | plot(x = x, y = y, pch = 19, cex = 0.5, xlim = c(0, 1), 936 | ylim = c(0, 100), main = main,xlab = "Gender", 937 | ylab = "Percent voting for Clinton", axes = FALSE) 938 | axis(1, at = c(0, 1), labels = c("Male", "Female")) 939 | axis(2, at = seq(0, 100, 20)) 940 | for (i in seq_len(n)) lines(x = c(0, 1), 941 | y = c(y[i], y[i+n]), col = "grey", lwd = 0.5) 942 | abline(100 * fixef, col = "red3", lwd = 2) 943 | } 944 | plot.gender.by.state("No pooling", as_data_frame(coef.fit2)[-49,], 945 | coef(fit1)) 946 | coef.fit3.df <- data_frame(Intercept = coef.fit3[-length(coef.fit3)], 947 | Slope = coef.fit3[length(coef.fit3)]) 948 | plot.gender.by.state("Common slope", coef.fit3.df, coef(fit1)) 949 | par(mfrow=c(1,1)) 950 | ``` 951 | 952 | 953 | ## Gender gaps in 2016 Exit poll 954 | 955 | \centering 956 | ![](images/gender-gap-xp.pdf) 957 | 958 | 959 | ## Multilevel models with a single covariate 960 | 961 | **Varying intercepts, common slope** 962 | 963 | $$ 964 | y_i = \mu + \alpha_{j_i} + \beta x_i + \epsilon_i 965 | $$ 966 | 967 | * The intercept for state $j$ is $\mu + \alpha_j$. 968 | * The slope (gender gap) is $\beta$ (same for all states). 969 | 970 | **Varying intercepts and slopes** 971 | 972 | $$ 973 | y_i = \mu + \alpha_{j_i} + \beta x_i + \gamma_{j_i} x_i + \epsilon_i 974 | $$ 975 | 976 | * The intercept for state $j$ is $\mu + \alpha_j$. 977 | * The slope (gender gap) for state $j$ is $\beta + \gamma_j$ (varies). 978 | 979 | 980 | ## Random effects: varying intercepts, common slopes 981 | 982 | \scriptsize 983 | ```{r echo=TRUE} 984 | # Varying intercept and common slope 985 | fit4 <- lmer(demvote ~ 1 + female + (1 | state), data = pew) 986 | fit4 987 | ``` 988 | 989 | 990 | ## Random effects: varying intercepts and slopes (first try) 991 | 992 | \tiny 993 | ```{r, echo=TRUE} 994 | (fit5 <- lmer(demvote ~ 1 + female + (1 + female | state), 995 | data = pew)) # fails to converge 996 | ``` 997 | 998 | 999 | ## Fixing the convergence failure (by centering covariates) 1000 | 1001 | \scriptsize 1002 | ```{r echo=TRUE} 1003 | pew <- mutate(pew, female.c = female - 0.5) 1004 | fit6 <- lmer(demvote ~ 1 + female.c + (1 + female.c | state), 1005 | data = pew) 1006 | fixef(fit6) # not comparable to prior models 1007 | head(coef(fit6)$state) 1008 | ``` 1009 | 1010 | \normalsize 1011 | 1012 | (No need to wrestle with `lmer` if you use Stan -- which is up next.) 1013 | 1014 | 1015 | ## Unscale the estimates for comparability 1016 | 1017 | \scriptsize 1018 | ```{r, echo=TRUE} 1019 | fixef.fit6 <- fixef(fit6) 1020 | fixef.fit6[1] <- fixef.fit6[1] - 0.5 * fixef.fit6[2] 1021 | coef.fit6 <- coef(fit6)$state 1022 | coef.fit6[[1]] <- coef.fit6[[1]] - 0.5 * coef.fit6[[2]] 1023 | fixef.fit6 1024 | head(coef.fit6) 1025 | ``` 1026 | 1027 | 1028 | ## Comparing the random effects estimates 1029 | 1030 | Gender gaps are correlated with state effects ($r = 0.56$). 1031 | 1032 | ```{r, echo=FALSE, fig.width=9, fig.height=5} 1033 | par(mfrow=c(1,2)) 1034 | plot.gender.by.state("Varying intercepts, common slope", 1035 | coef(fit4)$state, fixef(fit4)) 1036 | plot.gender.by.state("Varying intercepts and slopes", 1037 | coef.fit6, fixef.fit6) 1038 | par(mfrow=c(1,1)) 1039 | ``` 1040 | 1041 | 1042 | ## Adding group-level covariates to the model 1043 | 1044 | Add a **group-level covariate** (2012 vote) to the model. This is impossible in the no-pooling model, since group level variables are **collinear** with state indicators. 1045 | 1046 | As before, $i$ indexes respondents and $j$ indexes states. The variables are: 1047 | $$ 1048 | \begin{aligned} 1049 | y_i &= \text{respondent 2016 vote (1 = Clinton, 0 = other)} \\ 1050 | x_i &= \text{respondent gender (1 = female, 0 = male)} \\ 1051 | z_j &= \text{proportion in state voting for Obama in 2012} 1052 | \end{aligned} 1053 | $$ 1054 | At the respondent level, both intercepts and slopes vary by group (state): 1055 | $$ 1056 | \text{\textbf{Respondent model}} \quad y_i = \alpha_{j_i} + \beta_{j_i} x_i + \epsilon_i \qquad (i = 1,\dots,n) 1057 | $$ 1058 | The state intercepts and slopes are modelled as: 1059 | $$ 1060 | \text{\textbf{Group model}} \qquad \alpha_j = \mu + \delta z_j + \eta_j \qquad 1061 | \beta_j = \gamma + \xi_j 1062 | $$ 1063 | The errors are all assumed to be iid normal with zero means: 1064 | $$ 1065 | \epsilon_i \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\epsilon) 1066 | \qquad 1067 | \eta_j \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\eta) 1068 | \qquad 1069 | \xi_j \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma_\xi) 1070 | $$ 1071 | 1072 | ## An equivalent version of the model 1073 | 1074 | We can move covariates from the **upper** (group) level to the **lower** (respondent) level without changing anything: 1075 | $$ 1076 | y_i = \mu + \gamma x_i + \delta z_{j_i} + \left( \eta_{j_i} + \xi_{j_i} x_i + \epsilon_i \right) 1077 | $$ 1078 | The first three terms are **fixed effects**, while the last terms (in parentheses) are **random effects**. 1079 | 1080 | The random effects in this formulation all have **mean zero**, which is required by `lmer`. This assumption is not restrictive -- anything on the RHS of the group model with non-zero mean can be moved to the respondent level fixed effects. 1081 | 1082 | 1083 | ## Estimating models with covariates at both levels 1084 | 1085 | Join group level covariates to respondent data: 1086 | 1087 | \scriptsize 1088 | ```{r, echo=TRUE} 1089 | obama12 <- votes12 %>% 1090 | mutate(obama12 = obama / turnout) %>% 1091 | select(state, obama12) 1092 | pew <- left_join(pew, obama12, by = "state") 1093 | fit7 <- lmer(demvote ~ 1 + female.c + obama12 + (1 + female.c | state), data = pew) 1094 | ``` 1095 | 1096 | \tiny 1097 | ```{r} 1098 | options(width = 105) 1099 | arm::display(fit7) 1100 | options(width = 85) 1101 | ``` 1102 | 1103 | 1104 | 1105 | 1106 | ## First attempt at MRP 1107 | 1108 | We can use the R `predict` function to **impute** `demvote` onto `cps`: 1109 | 1110 | \scriptsize 1111 | ```{r echo=TRUE, warning=FALSE, message=FALSE} 1112 | cps <- cps %>% 1113 | mutate(female = ifelse(gender == "female", 1, 0), 1114 | female.c = female - 0.5) %>% 1115 | left_join(obama12, by = "state") 1116 | prob <- predict(fit7, newdata = cps, allow.new.levels = TRUE) 1117 | mrp1 <- cps %>% 1118 | mutate(prob = prob) %>% 1119 | group_by(state) %>% 1120 | summarize(mrp1 = 100 * weighted.mean(prob, weight)) 1121 | estimates <- left_join(estimates, mrp1, by = "state") 1122 | head(estimates) 1123 | ``` 1124 | 1125 | 1126 | ## Plotting first MRP estimates 1127 | 1128 | \centering 1129 | 1130 | ```{r echo=FALSE, message=FALSE, warning = FALSE, fig.width=4.5, fig.height=4, fig.aling="center"} 1131 | RMSE["mrp1"] <- with(estimates, rmse(mrp1, actual)) 1132 | ggplot(estimates, aes(actual, mrp1)) + 1133 | geom_abline(intercept = 0, slope = 1, col = "grey") + 1134 | geom_point(size = 1.5) + 1135 | lims(x = c(0, 100), y = c(0, 100)) + 1136 | labs(x = "Percentage of vote for Clinton", y = "Estimate") + 1137 | theme_minimal() 1138 | ``` 1139 | 1140 | \normalsize 1141 | 1142 | The MRP estimates are now pretty good (RMSE = `r round(RMSE["mrp1"], 2)`)! 1143 | 1144 | 1145 | 1146 | 1147 | ## Summary of multilevel models 1148 | 1149 | * Multilevel models allow us to estimate models with large numbers of parameters and still get sensible results. 1150 | * Multilevel models **shrink** or **regularize** estimates. 1151 | + Random effect estimates of group parameters are shrunk toward a "grand mean" effect. 1152 | * The amount of **shrinkage** in estimating a parameter depends upon: 1153 | + How much data is available to estimate the parameter. 1154 | + How much the parameter varies across groups. 1155 | * Substantial improvements can be made by adding **group-level predictors** (like past vote). 1156 | * Also possible to estimate multilevel **logistic regressions** using `glmer`, but we will focus on Bayesian estimation instead. 1157 | 1158 | 1159 | # 4. Bayesian inference 1160 | 1161 | 1162 | ## What is Bayesian statistics? 1163 | 1164 | \columnsbegin 1165 | \column{.3\textwidth} 1166 | ```{r, out.width = "1.5in"} 1167 | knitr::include_graphics("images/bayes.png") 1168 | ``` 1169 | 1170 | \smallskip 1171 | 1172 | \centering 1173 | 1174 | Thomas Bayes (1701-61) 1175 | 1176 | \column{.5\textwidth} 1177 | 1178 | Bayesian statistics is a method for combining **prior information** with **data** to draw inferences. 1179 | 1180 | \medskip 1181 | 1182 | Probabilities are **subjective measures of uncertainty** and, in Bayesian inference, can be applied to *anything*, including events that have already happened. 1183 | 1184 | \medskip 1185 | 1186 | Bayesian inference seems particularly well suited for situations where there is uncertainty about the sampling process and no **objective sampling distribution**. 1187 | 1188 | \columnsend 1189 | 1190 | 1191 | ## Advantages and disadvantages of Bayesian inference 1192 | 1193 | \columnsbegin 1194 | \column{.3\textwidth} 1195 | ```{r, out.width = "1.5in"} 1196 | knitr::include_graphics("images/bayes-grave.jpg") 1197 | ``` 1198 | 1199 | \smallskip 1200 | 1201 | \centering 1202 | \small 1203 | 1204 | Cotton and Bayes 1205 | 1206 | family vault 1207 | 1208 | \footnotesize 1209 | 1210 | (Bunhill Fields, London) 1211 | 1212 | \column{.3\textwidth} 1213 | 1214 | \normalsize 1215 | 1216 | **Advantages** 1217 | 1218 | * Does not depend on having large samples 1219 | * Can incorporate prior (non-sample) information 1220 | * Better estimates of variance parameters 1221 | * Simpler conceptual framework, avoids *ad hockery* 1222 | 1223 | \column{.3\textwidth} 1224 | 1225 | \normalsize 1226 | 1227 | **Disadvantages** 1228 | 1229 | * Requires specification of prior probability distributions 1230 | * Inferences are not "objective" -- different results depending on what priors you use 1231 | * More math 1232 | * Takes *much* more computational effort 1233 | 1234 | \columnsend 1235 | 1236 | 1237 | ## Basic ideas 1238 | 1239 | In Bayesian statistics, both data and parameters are considered to be random variables. The important distinction is between what is **observed** (the data, except possibly for missing values) and what isn't (parameters, such as regression coefficients). 1240 | 1241 | As in classical statistics, we specify a **model** (probability distribution) for the data, which depends on some unknown parameters. In addition, we summarize any information that we might have about the unknown parameters using a **prior probability distribution**. 1242 | 1243 | Bayesian inference uses Bayes theorem to combine the prior information with the data to obtain a **posterior probability distribution** for the parameters. 1244 | 1245 | 1246 | 1247 | ## Notation for Bayesian analysis 1248 | 1249 | **Data** $y = (y_1,\dots,y_n)$ 1250 | 1251 | **Parameters** $\theta = (\theta_1,\dots,\theta_k)$ 1252 | 1253 | **Prior distribution** $p(\theta) = p(\theta_1,\dots,\theta_k)$ 1254 | 1255 | **Model** $p(y|\theta)$ (*a.k.a.* the **likelihood function**) 1256 | 1257 | **Posterior distribution** $p(\theta | y)$ 1258 | 1259 | 1260 | ## Example: What is the average airfare to AAPOR in Denver? 1261 | 1262 | ```{r echo=FALSE, include=FALSE} 1263 | post.precis <- 1 / 100^2 + 3 / 150^2 1264 | post.mean <- (400 / 100^2 + 300 * 3 / 150^2) / post.precis 1265 | post.sd <- 1 / sqrt(post.precis) 1266 | ``` 1267 | 1268 | 1269 | **Population parameters** 1270 | 1271 | * $\theta = \text{population average fare}$ 1272 | * $\sigma = \text{population s.d.} = 150$ (assumed to be known) 1273 | 1274 | **Data** My airfare was $y_1 = 161$ and two friends spent $y_2 = 250$ and $y_3 = 489$. 1275 | 1276 | $$ 1277 | \bar y = \dfrac{161 + 250 + 489}{3} = 300 \qquad n = 3 1278 | $$ 1279 | 1280 | **Model** $y_1,\dots,y_n$ are independent $\text{Normal}(\theta, 150)$ random variables 1281 | 1282 | **Prior information** $\theta \sim \text{Normal}(400, 100)$ (so $P(200 < \theta < 600) \approx 95\%$) 1283 | 1284 | **Posterior distribution** $\theta | y_1,y_2,y_3 \sim \text{Normal}(`r round(post.mean)`, `r round(post.sd)`)$ 1285 | 1286 | 1287 | ## Bayesian updating: from prior to posterior 1288 | 1289 | \centering 1290 | 1291 | From Bayes' Theorem, it can be shown that $\theta | y \sim \text{Normal}(343, 65)$. 1292 | 1293 | 1294 | ```{r echo=FALSE, fig.width=4.5, fig.height=2.75, fig.align="center"} 1295 | ggplot(data_frame(theta = c(0, 700) ), aes(theta)) + 1296 | stat_function(fun = dnorm, color = "red", 1297 | args = list(mean = 300, sd = 200 / sqrt(3))) + 1298 | stat_function(fun = dnorm, color = "blue", 1299 | args = list(mean = 400, sd = 100)) + 1300 | stat_function(fun = dnorm, color = "purple", 1301 | args = list(mean = post.mean, sd = post.sd)) + 1302 | labs(x = quote(theta), y = "") + 1303 | scale_x_continuous(breaks = seq(0, 700, 100)) + 1304 | scale_y_continuous(breaks = NULL) + 1305 | theme_minimal() 1306 | ``` 1307 | 1308 | *The posterior is a compromise between the prior and the likelihood.* 1309 | 1310 | 1311 | 1312 | ## Relationship between prior and posterior mean 1313 | 1314 | In the normal model, the **posterior mean** is a weighted average of the **prior mean** and the **sample mean**. 1315 | $$ 1316 | \begin{aligned} 1317 | \text{posterior mean} &= \text{weight} \times \text{prior mean} + (1 - \text{weight}) \times \text{sample mean} \\[10pt] 1318 | \text{weight} &= \dfrac{ \text{prior precision} }{ \text{prior precision} + \text{sample precision} } \\[3pt] 1319 | \text{prior precision} &= \dfrac{1}{ \text{prior variance} } \\[3pt] 1320 | \text{sample precision} &= \dfrac{n}{ \text{data variance} } 1321 | \end{aligned} 1322 | $$ 1323 | 1324 | When the sample size is large, the weight on the prior becomes small. However, when the sample size is small, the posterior **shrinks** the sample mean toward the prior. 1325 | 1326 | 1327 | ## Markov Chain Monte Carlo (MCMC) 1328 | 1329 | For simple models (like the normal model with a normal prior), the posterior can be derived **analytically**. This is impossible for most models and priors, which limited the application of Bayesian methods before MCMC (about 1990). 1330 | 1331 | MCMC is a technique for **simulating** draws from a probability distribution. With simulated draws, we can: 1332 | 1333 | * Use a histogram or density plot to see what the **posterior distribution** looks like. 1334 | * Compute an estimate of the **posterior mean** (the average of the draws) or **posterior standard deviation** (the s.d. of the draws) 1335 | * Compute quantiles to obtain **credible intervals** (a.k.a. "confidence intervals"). 1336 | 1337 | Programs like BUGS and Stan automate most of this process. 1338 | 1339 | 1340 | ## Instructions for installing Stan on MacOS 1341 | 1342 | 1343 | 1344 | 1. Install a C++ compiler (Xcode for MacOS or Rtools for Windows) 1345 | 1346 | 2. Create a `Makevars` file your `.R` folder (in your home directory): 1347 | 1348 | \scriptsize 1349 | ``` 1350 | CC=clang 1351 | CXX=clang++ -arch x86_64 -ftemplate-depth-256 1352 | CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable \ 1353 | -Wno-unused-function -Wno-macro-redefined \ 1354 | -Wno-unused-local-typedefs -Wno-c++11-inline-namespace \ 1355 | -Wno-unneeded-internal-declaration -Wno-unknown-pragmas 1356 | ``` 1357 | 1358 | \normalsize 1359 | 1360 | 3. Install `rstan` from CRAN 1361 | 1362 | \scriptsize 1363 | ```{r echo=TRUE, eval=FALSE} 1364 | install.packages(c("rstan", "rstanarm")) 1365 | ``` 1366 | 1367 | 1368 | ## A first Stan program 1369 | 1370 | **Model** $y_1,\dots,y_n \stackrel{\text{iid}}{\sim} \text{Normal}(\theta, \sigma)$ ($\sigma = 150$) 1371 | 1372 | **Prior** $\theta \sim \text{Normal}(\theta_0, \omega_0)$ ($\theta_0 = 350$, $\omega_0 = 100$) 1373 | 1374 | \scriptsize 1375 | ```{r, echo=TRUE} 1376 | model_code <- "data { 1377 | int n; 1378 | real y[n]; 1379 | real theta_0; 1380 | real omega_0; 1381 | real sigma; 1382 | } 1383 | parameters { 1384 | real theta; 1385 | } 1386 | model { 1387 | theta ~ normal(theta_0, omega_0); 1388 | for (i in 1:n) { 1389 | y[i] ~ normal(theta, sigma); 1390 | } 1391 | }" 1392 | ``` 1393 | 1394 | 1395 | ## Components of a Stan program 1396 | 1397 | **Data** (what you know) 1398 | 1399 | \scriptsize 1400 | ```{r eval=FALSE, echo=TRUE} 1401 | data { 1402 | int n; // number of data points 1403 | real y[n]; // data on fares 1404 | real theta_0; // prior mean 1405 | real omega_0; // prior sd 1406 | real sigma; // sd of fares 1407 | } 1408 | ``` 1409 | 1410 | \normalsize 1411 | 1412 | **Parameters** (what you don't know) 1413 | 1414 | \scriptsize 1415 | ```{r eval=FALSE, echo=TRUE} 1416 | parameters { 1417 | real theta; // mean fare 1418 | } 1419 | ``` 1420 | 1421 | \normalsize 1422 | 1423 | **Model** (joint distribution of data and parameters) 1424 | 1425 | \scriptsize 1426 | ```{r eval=FALSE, echo=TRUE} 1427 | model { 1428 | theta ~ normal(theta_0, omega_0); // prior 1429 | for (i in 1:n) 1430 | y[i] ~ normal(theta, sigma); // likelihood 1431 | ``` 1432 | 1433 | 1434 | ## Running a Stan program from R 1435 | 1436 | To run a Stan program from R, you will need: 1437 | 1438 | * Load `rstan` package (using the `library` function) 1439 | * Code your Stan `model_code` (as a string or from an external file) 1440 | * Create a named list `data` (containing all of the items listed in the Stan `data` block) 1441 | * Call the `stan` function with `model_code` and `data` as arguments (and possibly other options) 1442 | 1443 | Stan returns draws from the posterior distribution of the parameters, arranged into a three-dimensional array (draws x chains x parameters). 1444 | 1445 | 1446 | ## Example (continued) 1447 | 1448 | \footnotesize 1449 | ```{r, echo=TRUE,warning=FALSE,message=FALSE} 1450 | suppressMessages(library(rstan)) 1451 | rstan_options(auto_write = TRUE) 1452 | options(mc.cores = parallel::detectCores()) 1453 | y <- c(161, 250, 489) 1454 | data <- list(y = y, n = length(y), theta_0 = 350, 1455 | omega_0 = 150, sigma = 100) 1456 | sims <- stan(model_code = model_code, data = data, 1457 | chains = 4, iter = 500, seed = 1234) 1458 | ``` 1459 | 1460 | \normalsize 1461 | 1462 | This will simulate four **chains** of length `iter = 500`. 1463 | 1464 | \smallskip 1465 | 1466 | By default, half of each chain is for **warmup**, so the result will be a 500 x 4 x 1 array (iterations x chains x parameters). 1467 | 1468 | 1469 | ## Stan output 1470 | 1471 | \scriptsize 1472 | ```{r, echo=TRUE, include=TRUE} 1473 | print(sims) 1474 | ``` 1475 | 1476 | 1477 | ## How to read Stan output 1478 | 1479 | * `mean` is the **posterior mean** -- estimated to be 302.94 for $\theta$. 1480 | * `se_mean` is the **monte carlo standard error**. 1481 | * The value 2.63 tells you that only the first two digits are reliable -- the point estimate is probably between 297 and 309. 1482 | * To make this smaller, redo the simulations with a larger value of `iter`. 1483 | * `sd` is the **posterior standard deviation** of $\theta$ -- beyond the error in the computation of the point estimate. 1484 | * The percentages are **posterior quantiles**. 1485 | * The 95% **credible interval** is 200.47 to 404.01. 1486 | * `Rhat` is a **convergence** diagnostic. You should look for values near one. 1487 | 1488 | 1489 | ## Weakly informative priors 1490 | 1491 | Bayesian inference is often criticized for being *subjective* since different priors will yield different inferences. 1492 | 1493 | **Non-informative priors** can be used to reproduce many classical frequentist ("objective") results. Sometimes these priors are not *proper* (*e.g.*, $\theta$ is uniformly distributed on $\Bbb{R}$). 1494 | 1495 | **Weakly informative priors** purposely include less information than we actually have, but still provide some *regularization* (shrinkage). 1496 | 1497 | For example, logistic regression coefficients are rarely greater than five or six (the difference between the top and bottom decile is about 4.4, between the .95 and .05 quantiles is about 5.9). In a linear model, a ``six sigma'' effect would be considered extremely large by any standard. 1498 | 1499 | 1500 | ## Priors on standard deviations 1501 | 1502 | In real world applications, we do not know the standard deviation of the data, so there are two unknown parameters ($\mu$ and $\sigma$). 1503 | 1504 | Since $\sigma > 0$, we need to use a prior distribution for $\sigma$ that assigns zero probability to negative numbers. There are many possible choices. 1505 | 1506 | Traditionally, variance priors were selected for computational convenience, but with modern MCMC methods this is no longer necessary. 1507 | 1508 | Gelman, "Prior distributions for variance parameters in hierarchical models," *Bayesian Analysis* (2006) recommends the *Folded noncentral t* distribution (the distribution of the absolute value of a noncentral $t$ random variable). A *half normal* prior may also be suitable. 1509 | 1510 | 1511 | ## Stan code for estimating both mean and SD 1512 | 1513 | Move `sigma` from the `data` block to the `parameters` block. 1514 | 1515 | \scriptsize 1516 | ```{r stan_mean_sd, echo=TRUE, warning=FALSE, message=FALSE} 1517 | data <- list(y = y, n = length(y), theta_0 = 350, omega_0 = 150) 1518 | model_code <- "data { 1519 | int n; 1520 | real y[n]; 1521 | real theta_0; 1522 | real omega_0; 1523 | } 1524 | parameters { 1525 | real theta; 1526 | real sigma; // moved to parameter block 1527 | } 1528 | model { 1529 | theta ~ normal(theta_0, omega_0); 1530 | sigma ~ normal(150, 150); 1531 | y ~ normal(theta, sigma); // assumed iid 1532 | }" 1533 | sims <- stan(model_code = model_code, data = data, 1534 | iter = 500, seed = 1234) 1535 | ``` 1536 | 1537 | 1538 | ## Output from two-parameter model 1539 | 1540 | \scriptsize 1541 | ```{r mean_sd_output, echo=TRUE} 1542 | print(sims) 1543 | ``` 1544 | 1545 | 1546 | ## Graphical display of parameters 1547 | 1548 | \footnotesize 1549 | ```{r, fig.height=4.5,message=FALSE,warning=FALSE} 1550 | plot(sims) 1551 | ``` 1552 | 1553 | 1554 | ## Traceplot 1555 | 1556 | \footnotesize 1557 | ```{r, fig.height=4.5,message=FALSE,warning=FALSE} 1558 | traceplot(sims) 1559 | ``` 1560 | 1561 | ## Hierarchical priors 1562 | 1563 | Multilevel models contain many parameters and it is difficult to specify **high dimensional priors**. In a model for respondents in U.S. states, we might have intercepts for each state, each of which would need to be assigned a prior distribution. 1564 | 1565 | Instead, we use **hierarchical priors** where the priors for the model parameters depend upon **hyperparameters**. That is, we model relationships between the many model parameters using a second level model. 1566 | 1567 | Hierarchical priors are **data dependent**. We infer relationships between model parameters using observed patterns in the data. 1568 | 1569 | For example, we can model state intercepts in a voting model using past vote. To the extent that 2016 respondents in states that voted Republican in 2012 also say they will vote Republican in 2016, we can infer a correlation between past vote and state intercepts in the 2016 model. 1570 | 1571 | 1572 | ## Example: a simple two-level model 1573 | 1574 | ![](images/dag_2level.png) 1575 | 1576 | 1577 | ## Notation for hierarchical normal model 1578 | 1579 | **Individuals** indexed by $i = 1,\dots,n$. 1580 | 1581 | **Groups** indexed by $j = 1,\dots,J$. Individual $i$ belongs to group $j_i$ 1582 | 1583 | **Data Model** $y_i \stackrel{\text{ind}}{\sim} \text{Normal}(\theta_{j_i}, \sigma_y)$ 1584 | 1585 | **Group Model** $\theta_j \stackrel{\text{ind}}{\sim}(\mu_\theta, \sigma_\theta)$ (Hierarchical prior) 1586 | 1587 | **Hyper Prior** $\mu_\theta \sim \text{Uniform}(-\infty, \infty)$ and $\sigma_\theta \sim \text{Normal}_+(0, 5)$ 1588 | 1589 | We can then use Bayes' Theorem to derive the posterior distribution of both the group parameters ($\theta_1,\dots,\theta_J$) and the hyperparameters ($\mu_\theta, \sigma_\theta$). 1590 | 1591 | 1592 | ## First hierarchical model in Stan 1593 | 1594 | \scriptsize 1595 | ```{r hb1, echo=TRUE, warning=FALSE, message=FALSE} 1596 | data <- with(pew, list(y = demvote, group = as.integer(state), 1597 | n = nrow(pew), J = nlevels(state))) 1598 | code <- "data { 1599 | int n; // number of respondents 1600 | int J; // number of groups (states) 1601 | int y[n]; // demvote 1602 | int group[n]; // state index 1603 | } 1604 | parameters { 1605 | real mu_theta; // hyper parameters 1606 | real sigma_theta; 1607 | vector[J] theta; // group parameters 1608 | } 1609 | model { 1610 | sigma_theta ~ normal(0, 5); 1611 | for (j in 1:J) 1612 | theta[j] ~ normal(mu_theta, sigma_theta); 1613 | for (i in 1:n) 1614 | y[i] ~ bernoulli_logit(theta[ group[i] ]); 1615 | }" 1616 | sims <- stan(model_code = code, data = data) 1617 | ``` 1618 | 1619 | 1620 | ## Stan output 1621 | 1622 | \tiny 1623 | ```{r} 1624 | print(sims) 1625 | ``` 1626 | 1627 | 1628 | ## A plot of the state estimates 1629 | 1630 | \centering 1631 | 1632 | ```{r warning=FALSE, message=FALSE, fig.width=5} 1633 | names(sims) <- c("mu_theta", "sigma_theta", levels(pew$state), "lp__") 1634 | plot(sims, par = "theta") 1635 | ``` 1636 | 1637 | 1638 | ## Add a covariate to the model 1639 | 1640 | \scriptsize 1641 | ```{r hb2, echo=TRUE, warning=FALSE, message=FALSE} 1642 | data <- with(pew, list(y = demvote, group = as.integer(state), 1643 | x = female, n = nrow(pew), J = nlevels(state))) 1644 | code <- "data { 1645 | int n; // number of respondents 1646 | int J; // number of groups (states) 1647 | int y[n]; // demvote 1648 | int group[n]; // state index 1649 | vector[n] x; // added covariate (gender) 1650 | } 1651 | parameters { 1652 | real mu_alpha; // mean intercept 1653 | real sigma_alpha; // sd intercept 1654 | real beta; // coefficient of gender 1655 | vector[J] alpha; // group intercepts 1656 | } 1657 | model { 1658 | sigma_alpha ~ normal(0, 5); 1659 | for (j in 1:J) 1660 | alpha[j] ~ normal(mu_alpha, sigma_alpha); 1661 | for (i in 1:n) 1662 | y[i] ~ bernoulli_logit(alpha[group[i]] + beta * x[i]); 1663 | }" 1664 | sims <- stan(model_code = code, data = data) 1665 | ``` 1666 | 1667 | 1668 | ## Simplifying the computations 1669 | 1670 | Make a `model.matrix` $X$ in R for fixed effects. 1671 | 1672 | \footnotesize 1673 | ```{r, echo=TRUE, include = TRUE} 1674 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 + 1675 | region + qlogis(obama12), data = pew) 1676 | ``` 1677 | 1678 | \scriptsize 1679 | ```{r} 1680 | head(X, 4) 1681 | ``` 1682 | 1683 | 1684 | ## Simplifying the computations (continued) 1685 | 1686 | Add `X = X` to the R `data` list and modify the Stan code to use matrix multiplication: 1687 | 1688 | \scriptsize 1689 | ```{r echo=TRUE, eval=FALSE} 1690 | data { 1691 | ... 1692 | matrix[n, k] X; // covariate matrix 1693 | } 1694 | parameters { 1695 | vector[k] beta; // fixed effects 1696 | ... 1697 | } 1698 | model { 1699 | vector[n] Xb; 1700 | Xb = X * beta; 1701 | ... 1702 | } 1703 | 1704 | ``` 1705 | 1706 | 1707 | ## Simplifying the computations (continued) 1708 | 1709 | * Move hierarchical means into fixed effects, so mean of random effects is zero. 1710 | * Random effects are usually easier to program directly in Stan. 1711 | * Sometimes helpful to standardized parameters and multiply by the SD. 1712 | * Stan's default is to assume elements of a vector are independent. 1713 | 1714 | \scriptsize 1715 | ```{r echo=TRUE, eval=FALSE} 1716 | parameters { 1717 | real sigma_alpha; // sd intercept 1718 | vector[J] alpha; // group intercepts 1719 | ... 1720 | } 1721 | model { 1722 | sigma_alpha ~ normal(0.2, 1); // prior for sd 1723 | alpha ~ normal(0, 1); // standardized intercept 1724 | ... 1725 | for (i in 1:n) 1726 | Xb[i] += sigma_alpha * alpha[ group[i] ]; 1727 | y ~ bernoulli_logit(Xb); 1728 | } 1729 | 1730 | ``` 1731 | 1732 | 1733 | ## Hierarchical model with multiple covariates 1734 | 1735 | \scriptsize 1736 | ```{r hb3, echo=TRUE, message=FALSE, warning=FALSE} 1737 | model_code <- "data { 1738 | int n; // number of respondents 1739 | int k; // number of covariates 1740 | matrix[n, k] X; // covariate matrix 1741 | int y[n]; // outcome (demvote) 1742 | int J; // number of groups (states) 1743 | int group[n]; // group index 1744 | } 1745 | parameters { 1746 | vector[k] beta; // fixed effects 1747 | real sigma_alpha; // sd intercept 1748 | vector[J] alpha; // group intercepts 1749 | } 1750 | model { 1751 | vector[n] Xb; 1752 | beta ~ normal(0, 4); 1753 | sigma_alpha ~ normal(0.2, 1); // prior for sd 1754 | alpha ~ normal(0, 1); // standardized intercepts 1755 | Xb = X * beta; 1756 | for (i in 1:n) 1757 | Xb[i] += sigma_alpha * alpha[ group[i] ]; 1758 | y ~ bernoulli_logit(Xb); 1759 | }" 1760 | ``` 1761 | 1762 | 1763 | ## Estimating the model in R 1764 | 1765 | \scriptsize 1766 | ```{r echo=TRUE, warning=FALSE, message=FALSE} 1767 | X <- model.matrix(~ 1 + age4 + gender + race3 + educ4 + 1768 | region + qlogis(obama12), data = pew) 1769 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote, 1770 | J = nlevels(pew$state), group = as.integer(pew$state)) 1771 | sims <- stan(model_code = model_code, data = data, 1772 | seed = 1234) 1773 | ``` 1774 | 1775 | \normalsize 1776 | 1777 | Coefficients in the output are in the same order specified in the `model` block: 1778 | 1779 | \scriptsize 1780 | ```{r echo=TRUE, eval=FALSE} 1781 | parameters { 1782 | vector[J] alpha; // group intercepts 1783 | real sigma_alpha; // sd intercept 1784 | vector[k] beta; // fixed effects 1785 | } 1786 | ``` 1787 | 1788 | \tiny 1789 | ```{r echo=FALSE} 1790 | options(width = 105) 1791 | names(sims) 1792 | options(width = 85) 1793 | ``` 1794 | 1795 | 1796 | ## Rename the coefficients for easier reading 1797 | 1798 | \scriptsize 1799 | ```{r echo=TRUE} 1800 | coef.names <- c(colnames(X), "sigma_alpha", levels(pew$state), "lp__") 1801 | names(sims) <- coef.names 1802 | ``` 1803 | 1804 | \tiny 1805 | ```{r} 1806 | options(width = 105) 1807 | names(sims) 1808 | options(width = 85) 1809 | ``` 1810 | 1811 | 1812 | ## Summary of fixed effect estimates 1813 | 1814 | \tiny 1815 | ```{r echo=TRUE, warning=FALSE, message=FALSE} 1816 | print(sims, par = "beta") 1817 | ``` 1818 | 1819 | 1820 | 1821 | ## Predictive distributions: imputation of survey variables for the population 1822 | 1823 | * The final step in MRP is to **impute** vote for the entire population. 1824 | + The sample is a trivial proportion of the population. 1825 | + We need to impute the survey variable to everyone **not** surveyed. 1826 | 1827 | * The **posterior predictive distribution** $p(\tilde y | y)$ is the conditional distribution of a **new** draw $\tilde y$ from the model, conditional upon the **observed** data $y$. 1828 | 1829 | * This requires averaging $p(\tilde y | \theta)$ over the posterior distribution $p(\theta | y)$, *i.e.*, over the uncertainty in both $\tilde y$ *and* $\theta$. 1830 | 1831 | * Contrast this with 1832 | + **Regression imputation** the expected value of $\tilde y$ is used 1833 | + **Plug-in methods** a point estimate is substituted for the unknown parameter. 1834 | 1835 | 1836 | ## Imputation in Stan 1837 | 1838 | Munge the population data in R 1839 | 1840 | \scriptsize 1841 | ```{r echo=TRUE} 1842 | X0 <- model.matrix(~ 1 + age4 + gender + race3 + educ4 + 1843 | region + qlogis(obama12), data = cps) 1844 | data <- list(n = nrow(X), k = ncol(X), X = X, y = pew$demvote, 1845 | J = nlevels(pew$state), group = as.integer(pew$state), 1846 | N = nrow(X0), X0 = X0, group0 = as.integer(cps$state)) 1847 | ``` 1848 | 1849 | \normalsize 1850 | 1851 | and add to the Stan `data` block: 1852 | 1853 | \scriptsize 1854 | 1855 | ```{r, eval=FALSE, echo=TRUE} 1856 | data { 1857 | ... 1858 | // add population data definitions 1859 | int N; // number of rows in population (cps) 1860 | matrix[N, k] X0; // covariates in population 1861 | int group0[N]; // group index in population 1862 | } 1863 | ``` 1864 | 1865 | 1866 | ## The generated quantities block in Stan 1867 | 1868 | Tell Stan what you want to impute and how to create the imputations. 1869 | 1870 | \footnotesize 1871 | ```{r echo=TRUE, eval=FALSE} 1872 | generated quantities { 1873 | int yimp[N]; 1874 | { 1875 | vector[N] Xb0; 1876 | Xb0 = X0 * beta; 1877 | for (i in 1:N) 1878 | yimp[i] = bernoulli_logit_rng(Xb0[i] + sigma_alpha * alpha[ group0[i] ]); 1879 | } 1880 | } 1881 | ``` 1882 | 1883 | \normalsize 1884 | 1885 | Note the use of the `bernoulli_logit_rng` (random number generator) function to draw from the posterior predictive distribution. The `generated quantities` block cannot contain any distributions (indicated by `~`). 1886 | 1887 | 1888 | ## The complete Stan program 1889 | 1890 | \tiny 1891 | 1892 | ```{r, echo=TRUE, warning=FALSE, message=FALSE} 1893 | model_code <- "data { 1894 | int n; // number of respondents 1895 | int k; // number of covariates 1896 | matrix[n, k] X; // covariate matrix 1897 | int y[n]; // outcome (demvote) 1898 | int J; // number of groups (states) 1899 | int group[n]; // group index 1900 | int N; // population size 1901 | matrix[N, k] X0; // population covariates 1902 | int group0[N]; // group index in population 1903 | } 1904 | parameters { 1905 | vector[k] beta; // fixed effects 1906 | real sigma_alpha; // sd intercept 1907 | vector[J] alpha; // group intercepts 1908 | } 1909 | model { 1910 | vector[n] Xb; 1911 | beta ~ normal(0, 4); 1912 | sigma_alpha ~ normal(0.2, 1); 1913 | alpha ~ normal(0, 1); 1914 | Xb = X * beta; 1915 | for (i in 1:n) 1916 | Xb[i] += sigma_alpha * alpha[ group[i] ]; 1917 | y ~ bernoulli_logit(Xb); 1918 | } 1919 | generated quantities { 1920 | int yimp[N]; 1921 | { 1922 | vector[N] Xb0; 1923 | Xb0 = X0 * beta; 1924 | for (i in 1:N) 1925 | yimp[i] = bernoulli_logit_rng(Xb0[i] + sigma_alpha * alpha[ group0[i] ]); 1926 | } 1927 | }" 1928 | ``` 1929 | 1930 | ```{r include=FALSE} 1931 | sims <- stan(model_code = model_code, data = data, 1932 | seed = 1234) 1933 | ``` 1934 | 1935 | 1936 | 1937 | ## Extracting the simulations 1938 | 1939 | Stan has imputed 4000 values for each of the rows in `cps`. We sample 500 (much more than necessary, but it's still fast). 1940 | 1941 | \footnotesize 1942 | ```{r extracting_sims, echo=TRUE} 1943 | imputations <- extract(sims, pars = "yimp")$yimp[sample( 1944 | nrow(sims), size = 500), ] 1945 | get_state_estimates <- function(imputations) { 1946 | state_by_clinton <- function(imputed_values) 100 * prop.table( 1947 | xtabs(weight ~ state + imputed_values, data = cps), 1)[,"1"] 1948 | state_estimates <- apply(imputations, 1, state_by_clinton) 1949 | apply(state_estimates, 1, mean) 1950 | } 1951 | estimates$mrp2 <- get_state_estimates(imputations) 1952 | RMSE["mrp2"] <- with(estimates, rmse(mrp2, actual)) 1953 | ``` 1954 | 1955 | \normalsize 1956 | 1957 | Now we can perform any analyses we wish on the imputed `cps` data and average the results over the 10 imputed datasets to get point estimates. 1958 | 1959 | 1960 | 1961 | ## The easy way with `rstanarm` 1962 | 1963 | * `Rstanarm` is an R package that writes and executes Stan code for you. 1964 | * It uses the same notation as `lme4` for specifying multilevel models. 1965 | * For example, to estimate the same model as `mrp2`, use the following code: 1966 | 1967 | ```{r eval=FALSE, echo=TRUE} 1968 | library(rstanarm) 1969 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 + 1970 | region + qlogis(obama12) + (1 | state), data = pew, family = binomial) 1971 | ``` 1972 | 1973 | * The function `posterior_predict` in `rstanarm` substitutes for the usual `predict` function in R: 1974 | 1975 | ```{r eval=FALSE, echo=TRUE} 1976 | imputations <- posterior_predict(fit, draws = 500, 1977 | newdata = select(cps, age4, gender, race3, educ4, region, obama12, state)) 1978 | ``` 1979 | 1980 | (This creates a matrix `imputations` of dimension `draws` x `nrow(newdata)`.) 1981 | 1982 | * Extract the estimates using `get_state_estimates`. 1983 | 1984 | 1985 | ## The complete program in `rstanarm` 1986 | 1987 | \footnotesize 1988 | 1989 | ```{r, echo=TRUE, warning=FALSE, message=FALSE} 1990 | library(rstanarm) 1991 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 + 1992 | region + qlogis(obama12) + (1 | state), data = pew, family = binomial) 1993 | cpstmp <- cps %>% 1994 | select(age4, gender, race3, educ4, region, obama12, state) 1995 | imputations <- posterior_predict(fit, draws = 500, 1996 | newdata = select(cps, age4, gender, race3, educ4, region, obama12, state)) 1997 | estimates$mrp3 <- get_state_estimates(imputations) 1998 | RMSE["mrp3"] <- with(estimates, rmse(mrp3, actual)) 1999 | RMSE 2000 | ``` 2001 | 2002 | 2003 | ## Accuracy of state level estimates 2004 | 2005 | \centering 2006 | ```{r, echo=FALSE, fig.width=9, fig.height=4.5} 2007 | p1 <- ggplot(estimates, aes(actual, mrp3)) + 2008 | geom_abline(intercept = 0, slope = 1, col = "grey") + 2009 | geom_point(size = 1.5) + 2010 | lims(x = c(0, 100), y = c(0, 100)) + 2011 | labs(x = "Percentage of vote for Clinton", y = "Estimate") + 2012 | theme_minimal() 2013 | p2 <- ggplot(estimates, aes(mrp3 - actual)) + 2014 | geom_histogram(binwidth = 4, center = 0, fill = "gray") + 2015 | lims(x = c(-20, 20)) + 2016 | labs(x = "Error in estimate") + 2017 | theme_minimal() 2018 | grid.arrange(p1, p2, nrow = 1) 2019 | ``` 2020 | 2021 | 2022 | ## What the map now looks like 2023 | 2024 | ```{r} 2025 | estimates %>% 2026 | mutate(state_name = tolower(name), 2027 | clinton_pct = cut(mrp3, breaks = c(-Inf, 40, 45, 50, 55, 60, 100), 2028 | labels = c("<40", "40-45", "45-50", "50-55", "55-60", ">60"))) %>% 2029 | ggplot(aes(map_id = state_name)) + 2030 | geom_map(aes(fill = clinton_pct), map = us_map) + 2031 | expand_limits(x = us_map$long, y = us_map$lat) + 2032 | coord_map("albers", lat0 = 39, lat1 = 45) + 2033 | scale_fill_brewer(name = "Clinton %", type = "div", palette = "RdBu") + 2034 | theme(axis.line = element_blank()) + 2035 | theme_void() 2036 | ``` 2037 | 2038 | 2039 | ## How well did we do? 2040 | 2041 | \centering 2042 | ```{r echo=FALSE} 2043 | load("data/output_538.RData") 2044 | out538 <- out538 %>% 2045 | filter(!(state %in% c("ME1", "ME2", "NE1", "NE2", "NE3", "US"))) %>% 2046 | transmute(state = factor(as.character(state)), `538` = clinton_share) 2047 | estimates <- left_join(estimates, out538, by = "state") 2048 | RMSE["538"] <- with(estimates, rmse(`538`, actual)) 2049 | tmp <- RMSE 2050 | names(tmp) <- c("Complete pooling", "No pooling", "Partial pooling", 2051 | "MRP: gender", "MRP: demos + 2012 vote", "MRP3", "538") 2052 | tmp <- tmp[c(1:5, 7)]; tmp <- rev(sort(tmp)) 2053 | par(mar = c(3, 11, 4, 2)) 2054 | barplot(sort(tmp), horiz = TRUE, border = NA, las = 1, 2055 | xlim = c(0, 16), ylim = c(0, 6.6), xlab = "", ylab = "", 2056 | width = 1, space = 0.1, 2057 | main = "Root mean square error", 2058 | col = c("navy", "blue3", "purple3", "purple1", "red2", "red3")) 2059 | text(x = tmp, y = seq(6, 0.6, length.out = 6), pos = 4, 2060 | labels = paste0(round(tmp, 1), "%")) 2061 | ``` 2062 | 2063 | # 5. Advanced Topics 2064 | 2065 | ## Where to from here? 2066 | 2067 | 1. More complicated models 2068 | 2. More than two response categories in survey variable 2069 | 3. More than two survey variables 2070 | 4. Missing variables from the post-stratification 2071 | 5. Computation of standard errors 2072 | 6. The missing details 2073 | 2074 | 2075 | ## More complicated models 2076 | 2077 | Selection bias is **ignorable** if survey variables are conditionally independent of sample selection *conditional upon the covariates*. We would like to be able to fit models with many covariates and flexible functional forms. 2078 | 2079 | The models considered so far have been very simple. Instead, we might: 2080 | 2081 | * Incorporate more **interactions** (such as race x region) 2082 | * Allow both **slopes** and **intercepts** to vary across states 2083 | * Add additional **levels** to the mode (*e.g.*, counties within states within regions) 2084 | 2085 | This is not difficult with `rstanarm`, but as the next example demonstrates, there is no guarantee that more complicated models will perform better. 2086 | 2087 | 2088 | ## Allow race and gender effects to vary across states 2089 | 2090 | \scriptsize 2091 | ```{r, echo=TRUE, warning=FALSE, message=FALSE} 2092 | fit <- stan_glmer(demvote ~ 1 + age4 + gender + race3 + educ4 + 2093 | region + qlogis(obama12) + (1 + gender + race3 | state), 2094 | data = pew, family = binomial) 2095 | cpstmp <- cps %>% 2096 | select(age4, gender, race3, educ4, region, obama12, state) 2097 | imputations <- posterior_predict(fit, draws = 500, 2098 | newdata = select(cps, age4, gender, race3, educ4, region, obama12, state)) 2099 | estimates$mrp4 <- get_state_estimates(imputations) 2100 | RMSE["mrp4"] <- with(estimates, rmse(mrp4, actual)) 2101 | RMSE 2102 | ``` 2103 | 2104 | 2105 | ## More than two response categories in the survey variable 2106 | 2107 | In real applications, there are almost always more than two response categories, *e.g.* vote for Clinton, Trump, Johnson, Stein, do not vote, undecided, *etc*. 2108 | 2109 | A simple solution is to estimate a set of binary response models (*e.g.*, Clinton *vs.* not Clinton, Trump *vs.* not Trump, *etc.*), ignoring correlations between the parameters of these models. 2110 | 2111 | A better solution is a **multinomial logit model**. When the response variable $y_i$ takes values $q = 0, 1, \dots, Q$, a flexible model is 2112 | $$ 2113 | P(y_i = q) = \dfrac{ e^{\beta_q^T x_i} }{ \sum_{r=0}^Q e^{\beta_r^T x_i} } 2114 | $$ 2115 | 2116 | See Section 9.6 of the Stan Reference Manual for code to implement this model. The R `brms` package can also estimate this model. 2117 | 2118 | ## More than two survey variables 2119 | 2120 | Most surveys contain more than a single question, so there are more than two variables to model. For example, pre-election polls contain questions about likelihood of voting, preferred candidate, and policy preferences. 2121 | 2122 | Estimating separate MRP models for several variables can yield inconsistent results. Each estimate is equivalent to creating a separate weight for each variable. 2123 | 2124 | In principle, we can specify bivariate or multivariate response models, yielding a single, consistent set of estimates for the marginal and joint distributions. In practice, this can be quite challenging and is an active research area. 2125 | 2126 | 2127 | ## Missing variables in the post-stratification 2128 | 2129 | Population data often are **missing** for some predictors of survey outcomes. For example, voting models can benefit substantially from having individual level data on past vote. 2130 | The assumption of **ignorable nonresponse** or selection is usually more plausible conditional upon a larger set of variables. 2131 | 2132 | A reasonable approach is to **impute** missing variables onto PUMS. For instance, the 2012 exit poll can be used to impute candidate preference for voters in the 2012 CPS Registration and Voting Supplement. Usually single imputation, followed by raking of the marginals to known vote totals, performs well. 2133 | 2134 | ## Computation of standard errors 2135 | 2136 | When post-stratifying to **known population counts**, the only source of error is the estimate of the means of the survey variable in the post-strata. This is reflected in the posterior standard deviation of the draws and in the simulations of generated quantities. 2137 | 2138 | When imputing onto a **public use microdata survey** (like CPS), there is an additional source of error (the sampling error in the PUMS survey). 2139 | 2140 | There is a simple formula for calculating the combined variance: 2141 | 2142 | * Create a set of imputed datasets. 2143 | * For each imputed dataset, compute the variance with the usual complete cases formula and average these. 2144 | * Also, compute the variance of the estimates across the imputed datasets. 2145 | * Add the two variances (adjusting for degrees of freedom) 2146 | 2147 | With $M$ imputations $\hat\theta_1,\dots,\hat\theta_M$: 2148 | $$ 2149 | \begin{aligned} 2150 | \text{Var}(\bar\theta) &= \dfrac{1}{M} \sum_i \widehat{\text{s.e.}}^2(\hat\theta_i) 2151 | + \dfrac{M+1}{M} \dfrac{1}{M-1} \sum_{i=1}^M (\hat\theta_i - \bar\theta_i)^2 \\[10pt] 2152 | &\approx \text{average estimated variance} + \text{variance of estimates} 2153 | \end{aligned} 2154 | $$ 2155 | 2156 | 2157 | ## References: Gelman papers 2158 | 2159 | Gelman and Little, "Poststratification into many categories using hierarchical logistic regression," *Survey Methodology* (1998), pp. 127-135. 2160 | 2161 | Park, Gelman and Bafumi, "Bayesian multilevel analysis with poststratification: state-level estimates from national polls," *Political Analysis* (2004), pp. 375-385. 2162 | 2163 | Gelman, "Analysis of variance: Why it is more important than ever" (with discussion), *Annals of Statistics* (2005), pp. 1-53. 2164 | 2165 | Gelman, "Struggles with survey weighting" (with discussion), *Statistical Science* (2007), pp. 153-164. 2166 | 2167 | Gelman and Ghitza, "Deep interactions with MRP: Election turnout and voting patterns among small electoral subgroups," *American Journal of Political Science* (2013), pp. 762-776. 2168 | 2169 | Gelman, Goel, Rivers and Rothschild, "The Mythical Swing Voter," *Quarterly Journal of Political Science* (2016), pp. 103-130. 2170 | 2171 | 2172 | ## Further reading 2173 | 2174 | \centering 2175 | 2176 | ![](images/further_reading.png) 2177 | 2178 | 2179 | ## What we have covered 2180 | 2181 | 1. Fit a multilevel regression model to survey data, using both indivdiual and group-level covariates. The purpose of this model is **prediction**, not explanation. 2182 | 2183 | 2. Avoid **over-fitting** by using hierarchical priors or random effects. 2184 | 2185 | 3. The Stein paradox is real and important. When making estimates for many small areas, **shrinkage** or **partial pooling** can substantially improve over "direct" estimates (just using the data from each area separately). 2186 | 2187 | 4. The model is used to **impute** the survey variable for non-sampled units to either a public use microdata survey or census counts. This is the **post-stratification** component. 2188 | 2189 | 5. Model failure is a real risk, but many successful applications and validations suggest that it is often exaggerated. The estimates are fairly robust to failures of the upper level model if data is not too sparse. 2190 | 2191 | \centering 2192 | \LARGE 2193 | 2194 | Good luck! 2195 | 2196 | 2197 | ## Course evaluations 2198 | 2199 | Please fillout the online course evaluation at 2200 | 2201 | \medskip 2202 | \centering 2203 | 2204 | 2205 | 2206 | \bigskip 2207 | \large 2208 | \centering 2209 | 2210 | Thanks! --------------------------------------------------------------------------------