├── LICENSE ├── README.md ├── preprocess └── ACS_PS_annenberg_fiveyear.R ├── realdata_annenberg ├── baselinemeanzeroN01_annenberg_v3_centered.stan ├── fitmodel_annenberg_v3.R ├── poststrat_annenberg_v3.R ├── proposedN01_annenberg_v3_centered.stan └── proposedarN01_annenberg_v3_centered.stan ├── simulation ├── baselinemeanzeroN01_v3.stan ├── biasfacet_combinedplots.R ├── poststrat_pipeline_testing_age3_v3_posteriorvariance.R ├── proportion_agecat.R ├── proposedN01_v3.stan ├── proposedarN01_v3.stan └── threemodelwriteup_v3_posteriorvariance.R └── simulation_spatialmrp ├── icar_mrp_simulation_bym2.R ├── icar_mrp_viz.R ├── ma_adj_matrix.rds ├── ma_adj_matrix_sparse.rds ├── poverty_poststrat.rds ├── prob_sampling_old.rds ├── proportion_puma.R ├── smooth_x.rds └── us_pumas.rds /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Yuxiang Gao 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Code for reproducing results in "Improving multilevel regression and poststratification with structured priors" 2 | 3 | ## Authors 4 | - Yuxiang Gao 5 | - Lauren Kennedy 6 | - Daniel Simpson 7 | - Andrew Gelman 8 | 9 | Accepted for publication in _Bayesian Analysis_. arXiv preprint: https://arxiv.org/abs/1908.06716 10 | 11 | ## Steps to run simulation pipeline in simulation directory (directed structured priors example) 12 | 1. The two .dta files in poststrat_pipeline_testing_age3_v3_posteriorvariance.R are retrieved from the paper, Estimating State Public Opinion with Multi-level Regression and Poststratification using R. URL: https://scholar.princeton.edu/jkastellec/publications. Download these two files before running the simulation pipeline, and put them in the simulation directory. 13 | 14 | 2. Choose age preference curve with coef_age in poststrat_pipeline_testing_age3_v3_posteriorvariance.R 15 | 16 | 3. Choose sample size, simulation runs, probability indices and number of age categories with sample_size, runs, r, age_grouping_multiplier in poststrat_pipeline_testing_age3_v3_posteriorvariance.R 17 | 18 | 4. Run poststrat_pipeline_testing_age3_v3_posteriorvariance.R 19 | 20 | 5. Run threemodelwriteup_v3_posteriorvariance.R with the same configurations. This will produce the bias plots shown in the paper for a given configuration. 21 | 22 | 6. Run proportion_agecat.R to visualize the share of simulations that the RW and AR priors outperformed the IID priors 23 | 24 | ## Steps to run simulation pipeline in simulation_spatialmrp directory (undirected structured priors example) 25 | 26 | 1. Choose sample_size in icar_mrp_simulation_bym2.R and then run the script. This conducts spatial MRP simulations, based off binomial regression 27 | 28 | 2. Run icar_mrp_viz.R with the same configurations to visualize the spatial MRP simulations 29 | 30 | 3. Run proportion_puma.R to visualize share of simulations that the BYM2 priors outperformed the IID priors 31 | 32 | ## Steps to run data analysis on 2008 Annenberg phone survey in realdata_annenberg directory 33 | 1. Request access for the 2008 National Annenberg Election Survey (NAES) telephone from the Annenberg Public Policy Center and put the phone survey text file into the realdata_annenberg directory. Rename it to annenbergphone2008.txt 34 | 35 | 2. acs_ps.rds is the 2006-2010 5-year American Community Survey (ACS) in a cleaned-up format. Create a new path ~/Desktop/annenbergdata/fiveyearacs/ and then download the 5-year ACS and unzip in this path. Finally, run ACS_PS_annenberg_fiveyear.R in the preprocess directory to get acs_ps.rds in ~/Desktop/annenbergdata/fiveyearacs/ 36 | 37 | 3. Make sure state_level_update.dta, gay_marriage_megapoll.dta, acs_ps.rds, annenbergphone2008.txt are in realdata_annenberg directory. The first two files are from Step 1 in the simulation pipeline. 38 | 39 | 4. Run fitmodel_annenberg_v3.R for age_grouping_multiplier = 12,48,72 and for all three models in the variables m_file,m_name 40 | 41 | 5. Run poststrat_annenberg_v3.R 42 | 43 | ## Notes: 44 | 45 | 1. To download the 2006-2010 5-year ACS, download both United States Population Records and United States Housing Unit Records in the data set 2006-2010 ACS 5-year Public Use Microdata Samples (PUMS) - CSV format, which can be found on https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t# 46 | 47 | 2. The US heatmap and proportion tables in the Annenberg data analysis are derived from annenbergphone2008.txt and acs_ps.rds. 48 | 49 | The covariate N in acs_ps.rds, which corresponds to a person's weight, is summed when calculating heatmap and proportion values. As an example, to get the proportions for the education covariate in the 2006-2010 ACS, we use: 50 | 51 | ```R 52 | acs_ps %>% group_by(education) %>% 53 | summarise(percentage = sum(N)/sum(acs_ps$N)) %>% 54 | dplyr::mutate(percentage = round(100 * percentage, 2)) %>% data.frame() 55 | ``` 56 | 57 | 3. For a working simulation example that doesn't require downloading data files, see my StanCon 2019 repo. 58 | 59 | 4. Noncentered parameterizations were used for the simulation studies whereas centered parameterizations were used for the real data analysis. NUTS sampling in Stan struggled with the noncentered parameterizations for the real data analysis but centering fixed this. 60 | 61 | 5. In the simulation_spatialmrp directory: poverty_poststrat.rds contains the poststratification matrix that's derived from the 2013-2017 5-year ACS, ma_adj_matrix_sparse.rds contains the adjacency matrix of the 52 PUMAs in the state MA, smooth_x.rds contains a sample from a Gaussian Markov random field over the space of 52 PUMAs. ma_adj_matrix_sparse.rds is derived from the tigris package. 62 | -------------------------------------------------------------------------------- /preprocess/ACS_PS_annenberg_fiveyear.R: -------------------------------------------------------------------------------- 1 | rm(list=ls()) 2 | 3 | # for 5 YEAR ACS 2006 - 2010 4 | #load into the ACS and make PS table. 5 | library(dplyr) 6 | library(readr) 7 | library(data.table) 8 | 9 | # data dictionary below 10 | # https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2006-2010.pdf?# 11 | 12 | # load data 13 | acs1 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusa.csv", sep = ",", header =TRUE) 14 | acs1 = acs1[acs1$AGEP>17,] 15 | acs1_tr = acs1[,c("serialno", 16 | "AGEP", # age 17 | "SEX", # sex 18 | "PWGTP", # person's weight 19 | "HISP", # hispanic ethnicity 20 | "RAC1P", # race 21 | "ST", # state 22 | "SCHL", # Educational attainment, 23 | "RT" # record type 24 | )] 25 | rm(acs1) 26 | gc() # free up space 27 | 28 | 29 | acs2 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusb.csv", sep = ",", header =TRUE) 30 | acs2 = acs2[acs2$AGEP>17,] 31 | acs2_tr = acs2[,c("serialno", 32 | "AGEP", # age 33 | "SEX", # sex 34 | "PWGTP", # person's weight 35 | "HISP", # hispanic ethnicity: FOR RACE 36 | "RAC1P", # race: FOR RACE 37 | "ST", # state 38 | "SCHL", # Educational attainment 39 | "RT" # record type 40 | )] 41 | rm(acs2) 42 | gc() # free up space 43 | 44 | 45 | acs3 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusc.csv", sep = ",", header =TRUE) 46 | acs3 = acs3[acs3$AGEP>17,] 47 | acs3_tr = acs3[,c("serialno", 48 | "AGEP", # age 49 | "SEX", # sex 50 | "PWGTP", # person's weight 51 | "HISP", # hispanic ethnicity 52 | "RAC1P", # race 53 | "ST", # state 54 | "SCHL", # Educational attainment 55 | "RT" 56 | )] 57 | rm(acs3) 58 | gc() 59 | 60 | 61 | acs4 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusd.csv", sep = ",", header =TRUE) 62 | acs4 = acs4[acs4$AGEP>17,] 63 | acs4_tr = acs4[,c("serialno", 64 | "AGEP", # age 65 | "SEX", # sex 66 | "PWGTP", # person's weight 67 | "HISP", # hispanic ethnicity: FOR RACE 68 | "RAC1P", # race: FOR RACE 69 | "ST", # state 70 | "SCHL", # Educational attainment, 71 | "RT" 72 | )] 73 | rm(acs4) 74 | gc() 75 | 76 | acs_tr = rbind(acs1_tr, acs2_tr, acs3_tr, acs4_tr) 77 | 78 | 79 | 80 | acsh1 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husa.csv", sep = ",", header =TRUE) 81 | acsh1_tr = acsh1[,c("serialno", "HINCP")] 82 | rm(acsh1) 83 | gc() 84 | 85 | 86 | acsh2 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husb.csv", sep = ",", header =TRUE) 87 | acsh2_tr = acsh2[,c("serialno","HINCP")] 88 | rm(acsh2) 89 | gc() 90 | 91 | 92 | acsh3 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husc.csv", sep = ",", header =TRUE) 93 | acsh3_tr = acsh3[,c("serialno","HINCP")] 94 | rm(acsh3) 95 | gc() 96 | 97 | 98 | acsh4 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husd.csv", sep = ",", header =TRUE) 99 | acsh4_tr = acsh4[,c("serialno","HINCP")] 100 | rm(acsh4) 101 | gc() 102 | 103 | 104 | 105 | #sum(names(acsh1)==names(acsh2)) 106 | #sum(names(acsh2)==names(acsh3)) 107 | #sum(names(acsh3)==names(acsh4)) 108 | 109 | 110 | acsh_tr = rbind(acsh1_tr, acsh2_tr, acsh3_tr, acsh4_tr) 111 | 112 | # check that there are no duplicate rows in acsh_tr 113 | print(length(unique(acsh_tr$serialno))) 114 | print(dim(acsh_tr)) 115 | 116 | acs_tr = dplyr::left_join(acs_tr, acsh_tr, by = c("serialno"="serialno")) 117 | 118 | rm(acs1_tr, acs2_tr, acs3_tr, acs4_tr) 119 | rm(acsh1_tr, acsh2_tr, acsh3_tr, acsh4_tr) 120 | 121 | acs_tr = acs_tr[complete.cases(acs_tr),] # remove NAs 122 | # PREPROCESS RACE 123 | 124 | # Recoded detailed race code 125 | # 1 .White alone 126 | # 2 .Black or African American alone 127 | # 3 .American Indian alone 128 | # 4 .Alaska Native alone 129 | # 5 .American Indian and Alaska Native tribes specified; or American 130 | # .Indian or Alaska native, not specified and no other races 131 | # 6 .Asian alone 132 | # 7 .Native Hawaiian and Other Pacific Islander alone 133 | # 8 .Some other race alone 134 | # 9 .Two or more major race groups 135 | 136 | acs_tr$race_x = ifelse(acs_tr$RAC1P==1 & acs_tr$HISP==1,"white", 137 | ifelse(acs_tr$RAC1P==2 & acs_tr$HISP==1,"black", 138 | ifelse(acs_tr$RAC1P==6 & acs_tr$HISP==1, "asian", 139 | ifelse(acs_tr$RAC1P==3 & acs_tr$HISP==1,"americanindian", 140 | ifelse(!(acs_tr$RAC1P %in% c(1,2,6,3)) & acs_tr$HISP %in% 2:24, "hisp", 141 | "other") 142 | ) 143 | ) 144 | ) 145 | ) 146 | 147 | 148 | # PREPROCESS EDUCATION 149 | 150 | # Educational attainment 151 | # bb .N/A (less than 3 years old) 152 | # 01 .No schooling completed 153 | # 02 .Nursery school to grade 4 154 | # 03 .Grade 5 or grade 6 155 | # 04 .Grade 7 or grade 8 156 | # 05 .Grade 9 157 | # 06 .Grade 10 158 | # 07 .Grade 11 159 | # 08 .12th grade, no diploma 160 | # 09 .High school graduate 161 | # 10 .Some college, but less than 1 year 162 | # 11 .One or more years of college, no degree 163 | # 12 .Associate's degree 164 | # 13 .Bachelor's degree 165 | # 14 .Master's degree 166 | # 15 .Professional school degree 167 | # 16 .Doctorate degree 168 | 169 | acs_tr$education_x = ifelse(acs_tr$SCHL %in% 1:8, "nohighschool", 170 | ifelse(acs_tr$SCHL==9, "highschool", 171 | ifelse(acs_tr$SCHL %in% c(10,11), "somecollege", 172 | ifelse(acs_tr$SCHL==12, "twoyeardegree", 173 | ifelse(acs_tr$SCHL==13, "fouryeardegree", 174 | "fouryeardegreeplus") 175 | ) 176 | ) 177 | ) 178 | ) 179 | 180 | # PREPROCESS STATE 181 | 182 | # 01 .Alabama/AL 183 | # 02 .Alaska/AK 184 | # 04 .Arizona/AZ 185 | # 05 .Arkansas/AR 186 | # 06 .California/CA 187 | # 08 .Colorado/CO 188 | # 09 .Connecticut/CT 189 | # 10 .Delaware/DE 190 | # 11 .District of Columbia/DC 191 | # 12 .Florida/FL 192 | # 13 .Georgia/GA 193 | # 15 .Hawaii/HI 194 | # 16 .Idaho/ID 195 | # 17 .Illinois/IL 196 | # 18 .Indiana/IN 197 | # 19 .Iowa/IA 198 | # 20 .Kansas/KS 199 | # 21 .Kentucky/KY 200 | # 22 .Louisiana/LA 201 | # 23 .Maine/ME 202 | # 24 .Maryland/MD 203 | # 25 .Massachusetts/MA 204 | # 26 .Michigan/MI 205 | # 27 .Minnesota/MN 206 | # 28 .Mississippi/MS 207 | # 29 .Missouri/MO 208 | # 30 .Montana/MT 209 | # 31 .Nebraska/NE 210 | # 32 .Nevada/NV 211 | # 33 .New Hampshire/NH 212 | # 34 .New Jersey/NJ 213 | # 35 .New Mexico/NM 214 | # 36 .New York/NY 215 | # 37 .North Carolina/NC 216 | # 38 .North Dakota/ND 217 | # 39 .Ohio/OH 218 | # 40 .Oklahoma/OK 219 | # 41 .Oregon/OR 220 | # 42 .Pennsylvania/PA 221 | # 44 .Rhode Island/RI 222 | # 45 .South Carolina/SC 223 | # 46 .South Dakota/SD 224 | # 47 .Tennessee/TN 225 | # 48 .Texas/TX 226 | # 49 .Utah/UT 227 | # 50 .Vermont/VT 228 | # 51 .Virginia/VA 229 | # 53 .Washington/WA 230 | # 54 .West Virginia/WV 231 | # 55 .Wisconsin/WI 232 | # 56 .Wyoming/WY 233 | # 72 .Puerto Rico/PR 234 | 235 | # DOUBLE CHECK BELOW 236 | acs_tr$state_x = case_when(acs_tr$ST == 1 ~ "alabama", 237 | acs_tr$ST == 2 ~ "alaska", 238 | acs_tr$ST == 4 ~ "arizona", 239 | acs_tr$ST == 5 ~ "arkansas", 240 | acs_tr$ST == 6 ~ "california", 241 | acs_tr$ST == 8 ~ "colorado", 242 | acs_tr$ST == 9 ~ "connecticut", 243 | acs_tr$ST == 10 ~ "delaware", 244 | acs_tr$ST == 11 ~ "columbia", 245 | acs_tr$ST == 12 ~ "florida", 246 | acs_tr$ST == 13 ~ "georgia", 247 | acs_tr$ST == 15 ~ "hawaii", 248 | acs_tr$ST == 16 ~ "idaho", 249 | acs_tr$ST == 17 ~ "illinois", 250 | acs_tr$ST == 18 ~ "indiana", 251 | acs_tr$ST == 19 ~ "iowa", 252 | acs_tr$ST == 20 ~ "kansas", 253 | acs_tr$ST == 21 ~ "kentucky", 254 | acs_tr$ST == 22 ~ "louisiana", 255 | acs_tr$ST == 23 ~ "maine", 256 | acs_tr$ST == 24 ~ "maryland", 257 | acs_tr$ST == 25 ~ "massachusetts", 258 | acs_tr$ST == 26 ~ "michigan", 259 | acs_tr$ST == 27 ~ "minnesota", 260 | acs_tr$ST == 28 ~ "mississippi", 261 | acs_tr$ST == 29 ~ "missouri", 262 | acs_tr$ST == 30 ~ "montana", 263 | acs_tr$ST == 31 ~ "nebraska", 264 | acs_tr$ST == 32 ~ "nevada", 265 | acs_tr$ST == 33 ~ "newhampshire", 266 | acs_tr$ST == 34 ~ "newjersey", 267 | acs_tr$ST == 35 ~ "newmexico", 268 | acs_tr$ST == 36 ~ "newyork", 269 | acs_tr$ST == 37 ~ "northcarolina", 270 | acs_tr$ST == 38 ~ "northdakota", 271 | acs_tr$ST == 39 ~ "ohio", 272 | acs_tr$ST == 40 ~ "oklahoma", 273 | acs_tr$ST == 41 ~ "oregon", 274 | acs_tr$ST == 42 ~ "pennsylvania", 275 | acs_tr$ST == 44 ~ "rhodeisland", 276 | acs_tr$ST == 45 ~ "southcarolina", 277 | acs_tr$ST == 46 ~ "southdakota", 278 | acs_tr$ST == 47 ~ "tennessee", 279 | acs_tr$ST == 48 ~ "texas", 280 | acs_tr$ST == 49 ~ "utah", 281 | acs_tr$ST == 50 ~ "vermont", 282 | acs_tr$ST == 51 ~ "virginia", 283 | acs_tr$ST == 53 ~ "washington", 284 | acs_tr$ST == 54 ~ "westvirginia", 285 | acs_tr$ST == 55 ~ "wisconsin", 286 | acs_tr$ST == 56 ~ "wyoming", 287 | TRUE ~ as.character(acs_tr$ST)) 288 | 289 | 290 | # PREPROCESS INCOME 291 | # match annenberg coding 292 | acs_tr$income_x = case_when(acs_tr$HINCP < 10000 ~ 1, 293 | acs_tr$HINCP >= 10000 & acs_tr$HINCP < 15000 ~ 2, 294 | acs_tr$HINCP >= 15000 & acs_tr$HINCP < 25000 ~ 3, 295 | acs_tr$HINCP >= 25000 & acs_tr$HINCP < 35000 ~ 4, 296 | acs_tr$HINCP >= 35000 & acs_tr$HINCP < 50000 ~ 5, 297 | acs_tr$HINCP >= 50000 & acs_tr$HINCP < 75000 ~ 6, 298 | acs_tr$HINCP >= 75000 & acs_tr$HINCP < 100000 ~ 7, 299 | acs_tr$HINCP >= 100000 & acs_tr$HINCP < 150000 ~ 8, 300 | acs_tr$HINCP >= 150000 ~ 9) 301 | 302 | # PREPROCESS SEX 303 | # 1 .Male 304 | # 2 .Female 305 | 306 | # PREPROCESS AGE 307 | # no need to preprocess age 308 | 309 | # construct maximal poststratification matrix -------------------------- 310 | 311 | # 1. do dplyr grouping 312 | acs_ps = acs_tr %>% 313 | group_by(AGEP, SEX, race_x, education_x, state_x, income_x) %>% 314 | summarize(N = sum(PWGTP)) %>% # why sum over PWGTP? 315 | ungroup() 316 | 317 | # check that you don't have all subpopulation cells 318 | # prod(apply(acs_ps, 2, function(x) {length(unique(x))})[-7]) 319 | 320 | # 2. create empty maximal poststrat matrix 321 | acs_ps_maximal_x = expand.grid(age = unique(acs_tr$AGEP), 322 | sex = unique(acs_tr$SEX), 323 | race = unique(acs_tr$race_x), 324 | education = unique(acs_tr$education_x), 325 | state = unique(acs_tr$state_x), 326 | income = unique(acs_tr$income_x), 327 | stringsAsFactors = FALSE 328 | ) 329 | 330 | # 3. grab the groupings from acs_ps 331 | acs_ps_maximal = dplyr::left_join(acs_ps_maximal_x, acs_ps, by = c("age"="AGEP", 332 | "sex"="SEX", 333 | "race"="race_x", 334 | "education"="education_x", 335 | "state"="state_x", 336 | "income"="income_x")) 337 | 338 | 339 | # 4. Save poststratification matrix 340 | saveRDS(data.frame(acs_ps),file='~/Desktop/annenbergdata/fiveyearacs/acs_ps.rds') 341 | 342 | rm(list=ls()) 343 | 344 | -------------------------------------------------------------------------------- /realdata_annenberg/baselinemeanzeroN01_annenberg_v3_centered.stan: -------------------------------------------------------------------------------- 1 | data { 2 | int N; // number of samples 3 | int N_groups_age; // the number of groups for age 4 | int N_groups_race; // the number of groups for race 5 | int N_groups_sex; // the number of groups for sex 6 | 7 | int N_groups_education; // the number of groups for education 8 | int N_groups_income; // the number of groups for income 9 | int N_groups_state; // the number of groups for state 10 | 11 | int age[N]; // the column vector of design matrix X for age 12 | int race[N]; // the column vector of design matrix X for race 13 | int sex[N]; // the column vector of design matrix X for sex 14 | 15 | int education[N]; // the column vector of design matrix X for education 16 | int income[N]; // the column vector of design matrix X for income 17 | int state[N]; // the column vector of design matrix X for state 18 | 19 | int region_index[51]; // the index representing the region that a state is in according to jkastell 20 | real state_vs[51]; // the 2004 Republican vote share from jkastell 21 | real relig[51]; // the 2004 conservative religion percentage 22 | 23 | int y[N]; // the response vector 24 | } 25 | parameters { 26 | vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age 27 | vector[N_groups_race] U_race; // the random effect for race, not multiplied by sigma_race 28 | vector[N_groups_income] U_income; 29 | vector[N_groups_state] U_state; // we are using the centered parameterization for U_state now, so U_state_transformed = U_state now 30 | vector[N_groups_education] U_education; 31 | 32 | vector[5] U_region; // the nested random effect for region 33 | 34 | real sigma_age; // sd of U_age (hyperparam). 35 | real sigma_race; // sd of U_race (hyperparam). 36 | real sigma_income; 37 | real sigma_state; 38 | real sigma_education; 39 | 40 | real sigma_region; 41 | real beta_state; // coef for Republican vote share in every state 42 | real beta_relig; // coef for conservative religion share in every state 43 | 44 | real intercept; // the intercept (global fixed effect) 45 | real beta_sex; 46 | 47 | } 48 | transformed parameters { 49 | vector[N_groups_age] U_age_transformed; 50 | vector[N_groups_race] U_race_transformed; 51 | vector[N_groups_income] U_income_transformed; 52 | vector[N_groups_state] U_state_transformed; 53 | vector[N_groups_education] U_education_transformed; 54 | 55 | vector[5] U_region_transformed; 56 | 57 | vector[N] yhat; 58 | 59 | U_age_transformed = sigma_age * U_age; // the random effect for age 60 | U_race_transformed = sigma_race * U_race; // the random effect for race 61 | U_income_transformed = sigma_income * U_income; 62 | //U_state_transformed = sigma_state * U_state; 63 | U_education_transformed = sigma_education * U_education; 64 | 65 | U_region_transformed = sigma_region * U_region; 66 | 67 | U_state_transformed = U_state; 68 | 69 | // noncentered parameterization for U_state_transformed 70 | // for (j in 1:51) { // j iterates through the state index that I defined through levels_state in fitmodel_annenberg R file 71 | // U_state_transformed[j] = (U_state[j] * sigma_state) * ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) ); 72 | // } 73 | 74 | for (i in 1:N) { 75 | yhat[i] = intercept + U_age_transformed[age[i]] + U_race_transformed[race[i]] + U_income_transformed[income[i]] + U_state_transformed[state[i]] + U_education_transformed[education[i]] + beta_sex * sex[i]; // the linear predictor at each point 76 | } 77 | 78 | } 79 | model { 80 | sigma_age ~ normal(0,1); // sigma_A ~ lognormal(0,1). hyperparam. 81 | sigma_race ~ normal(0,1); // sigma_I ~ lognormal(0,1). hyperparam. 82 | sigma_income ~ normal(0,1); 83 | sigma_state ~ normal(0,1); 84 | sigma_education ~ normal(0,1); 85 | 86 | sigma_region ~ normal(0,1); 87 | 88 | U_age ~ normal(0, 1); 89 | U_race ~ normal(0, 1); // random effect for race is normal 90 | U_income ~ normal(0, 1); 91 | //U_state ~ normal(0, 1); 92 | U_education ~ normal(0, 1); 93 | 94 | U_region ~ normal(0,1); 95 | beta_state ~ normal(0,1); 96 | beta_relig ~ normal(0,1); 97 | 98 | for (j in 1:51) { 99 | U_state[j] ~ normal(U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]), sigma_state); 100 | } 101 | 102 | intercept ~ normal(0, 1); 103 | beta_sex ~ normal(0, 1); 104 | 105 | for (i in 1:N) { 106 | y[i] ~ bernoulli(inv_logit(yhat[i])); // the response 107 | } 108 | 109 | } 110 | -------------------------------------------------------------------------------- /realdata_annenberg/fitmodel_annenberg_v3.R: -------------------------------------------------------------------------------- 1 | rm(list=ls()) 2 | 3 | library(rstan) 4 | library(dplyr) 5 | 6 | options(mc.cores = parallel::detectCores()) # use multiple cores. default is 4. 7 | rstan_options(auto_write = TRUE) # save compiled stan model to hard disk so no need to recompile 8 | 9 | # --------------------------------------------------- 10 | age_grouping_multiplier = 12 # the number of strata for age 11 | 12 | iterations = 2000 13 | num_chains = 4 14 | mtreedepth = 15 15 | ad = 0.99 16 | 17 | use_state = TRUE # if this is true we use state as response 18 | 19 | # m_file = "baselinemeanzeroN01_annenberg_v3_centered.stan" 20 | # m_name = "baseline_model_annenberg_centered" 21 | 22 | m_file = "proposedarN01_annenberg_v3_centered.stan" 23 | m_name = "ar_model_annenberg_centered" 24 | 25 | # m_file = "proposedN01_annenberg_v3_centered.stan" 26 | # m_name = "rw_model_annenberg_centered" 27 | 28 | covariates_list = c("CEc01_c", # FAVOR SAME-SEX MARRIAGE - THE RESPONSE 29 | "WA01_c", # SEX 30 | "WA02_c", # AGE 31 | "WA03_c", # EDUCATION 32 | "WC03_c", # RACE 33 | "WA04_c", # INCOME QUESTION 1 34 | "WA05_c", # INCOME QUESTION 2 35 | "WB03_c", # LINE OF WORK 36 | "WFc01_c", # STATE OF RESIDENCY 37 | "WC06_c", # YEARS LIVED IN THE US 38 | "WFa01_c", # ADULTS IN HOUSEHOLD 39 | "WC01_c", # Are you of Hispanic or Latino origin or descent? 40 | "WHb01_c") # SEXUAL ORIENTATION 41 | 42 | 43 | m = stan_model(file = m_file, model_name = m_name) 44 | 45 | acs_ps = readRDS("acs_ps.rds") # load poststratification matrix 46 | # "AGEP" "SEX" "race_x" "education_x" "state_x" "income_x" "N" 47 | colnames(acs_ps) = c("age", "sex", "race", "education", "state", "income", "N") 48 | # --------------------------------------------------- 49 | 50 | survey_dat = readr::read_tsv("annenbergphone2008.txt", col_names = TRUE) 51 | 52 | # Preprocess ------------------------------------------------------------------------- 53 | 54 | # Load state data from jkastell ----------------------------------------------------- 55 | # Load state-level data from http://www.princeton.edu/~jkastell/mrp_primer.html 56 | StateLevel = foreign::read.dta("state_level_update.dta",convert.underscore = TRUE) 57 | StateLevel = StateLevel[order(StateLevel$sstate.initnum),] 58 | 59 | # Load state-level data from http://www.princeton.edu/~jkastell/mrp_primer.html 60 | MarriageDataSP = foreign::read.dta("gay_marriage_megapoll.dta", convert.underscore = TRUE) 61 | 62 | # check that row names of Statelevel are equal to sstate.initnum 63 | sum(as.numeric(as.character(rownames(StateLevel))) == StateLevel$sstate.initnum) 64 | 65 | stateregions = unique(MarriageDataSP[,c("statename", "state.initnum", "region.cat", "region")]) 66 | 67 | stateregions_final = dplyr::left_join(x = StateLevel, 68 | y = stateregions, 69 | by = c("sstate.initnum"="state.initnum"))[,c("sstate.initnum", 70 | "sstate", 71 | "sstatename", 72 | "region", 73 | "region.cat", 74 | "kerry.04", 75 | "p.evang", 76 | "p.mormon")] 77 | 78 | # Alaska and Hawaii doesn't have a region so we set it to west 79 | stateregions_final[stateregions_final$sstatename == "Alaska", c("region")] = "west" 80 | stateregions_final[stateregions_final$sstatename == "Hawaii", c("region")] = "west" 81 | stateregions_final[stateregions_final$sstatename == "Alaska", c("region.cat")] = 4 82 | stateregions_final[stateregions_final$sstatename == "Hawaii", c("region.cat")] = 4 83 | 84 | stateregions_final$lowercasename = tolower(stringr::str_replace_all(stateregions_final$sstatename, " ", "")) 85 | 86 | stateregions_final$lowercasename[stateregions_final$lowercasename=="d.c."]="columbia" 87 | 88 | # ------------------------------------------------------------------------------------ 89 | 90 | survey_dat_selected = survey_dat %>% 91 | select(covariates_list) %>% 92 | filter(WA02_c < 100, # Age less than 100 93 | CEc01_c %in% c(1,3), # 1 is vote yes for gay marriage, 3 is no 94 | ) #%>% 95 | # mutate(gayfavor = as.numeric(CEc01_c), 96 | # sex = as.numeric(WA01_c), 97 | # age = as.numeric(WA02_c), 98 | # education = as.numeric(WA03_c), 99 | # race = as.numeric(WC03_c), 100 | # income = as.numeric(WA04_c), 101 | # state = as.numeric(WFc01_c), 102 | # yearsinus = as.numeric(WC06_c), 103 | # adultsinhouse = as.numeric(WFa01_c), 104 | # orientation = as.numeric(WHb01_c) 105 | # ) # create new variable names 106 | 107 | survey_dat_selected$CEc01_c[survey_dat_selected$CEc01_c==3] = 0 # change 3 to 0. 0 is no to gay marriage, 1 is yes. 108 | 109 | # Preprocess race 110 | survey_dat_selected = survey_dat_selected %>% filter(WC03_c < 10) # remove 998 and 999 from race 111 | 112 | survey_dat_selected$race_x = ifelse(survey_dat_selected$WC03_c==1 & survey_dat_selected$WC01_c==2, "white", 113 | ifelse(survey_dat_selected$WC03_c==2 & survey_dat_selected$WC01_c==2, "black", 114 | ifelse(survey_dat_selected$WC03_c==3 & survey_dat_selected$WC01_c==2, "asian", 115 | ifelse(survey_dat_selected$WC03_c==4 & survey_dat_selected$WC01_c==2, "americanindian", 116 | ifelse(survey_dat_selected$WC03_c==5 & survey_dat_selected$WC01_c==1, "hisp", 117 | "other")) 118 | ) 119 | ) 120 | ) 121 | 122 | # Preprocess education 123 | survey_dat_selected = survey_dat_selected %>% filter(WA03_c < 100) # remove 998 and 999 124 | 125 | survey_dat_selected$education_x = ifelse(survey_dat_selected$WA03_c %in% c(1,2), "nohighschool", 126 | ifelse(survey_dat_selected$WA03_c %in% c(3,4), "highschool", 127 | ifelse(survey_dat_selected$WA03_c==5, "somecollege", 128 | ifelse(survey_dat_selected$WA03_c==6, "twoyeardegree", 129 | ifelse(survey_dat_selected$WA03_c==7, "fouryeardegree", 130 | "fouryeardegreeplus") 131 | ) 132 | ) 133 | ) 134 | ) 135 | 136 | 137 | # Preprocess state 138 | survey_dat_selected$state_x = case_when(survey_dat_selected$WFc01_c == 1 ~ "alabama", 139 | survey_dat_selected$WFc01_c == 4 ~ "arizona", 140 | survey_dat_selected$WFc01_c == 5 ~ "arkansas", 141 | survey_dat_selected$WFc01_c == 6 ~ "california", 142 | survey_dat_selected$WFc01_c == 8 ~ "colorado", 143 | survey_dat_selected$WFc01_c == 9 ~ "connecticut", 144 | survey_dat_selected$WFc01_c == 10 ~ "delaware", 145 | survey_dat_selected$WFc01_c == 11 ~ "columbia", 146 | survey_dat_selected$WFc01_c == 12 ~ "florida", 147 | survey_dat_selected$WFc01_c == 13 ~ "georgia", 148 | survey_dat_selected$WFc01_c == 16 ~ "idaho", 149 | survey_dat_selected$WFc01_c == 17 ~ "illinois", 150 | survey_dat_selected$WFc01_c == 18 ~ "indiana", 151 | survey_dat_selected$WFc01_c == 19 ~ "iowa", 152 | survey_dat_selected$WFc01_c == 20 ~ "kansas", 153 | survey_dat_selected$WFc01_c == 21 ~ "kentucky", 154 | survey_dat_selected$WFc01_c == 22 ~ "louisiana", 155 | survey_dat_selected$WFc01_c == 23 ~ "maine", 156 | survey_dat_selected$WFc01_c == 24 ~ "maryland", 157 | survey_dat_selected$WFc01_c == 25 ~ "massachusetts", 158 | survey_dat_selected$WFc01_c == 26 ~ "michigan", 159 | survey_dat_selected$WFc01_c == 27 ~ "minnesota", 160 | survey_dat_selected$WFc01_c == 28 ~ "mississippi", 161 | survey_dat_selected$WFc01_c == 29 ~ "missouri", 162 | survey_dat_selected$WFc01_c == 30 ~ "montana", 163 | survey_dat_selected$WFc01_c == 31 ~ "nebraska", 164 | survey_dat_selected$WFc01_c == 32 ~ "nevada", 165 | survey_dat_selected$WFc01_c == 33 ~ "newhampshire", 166 | survey_dat_selected$WFc01_c == 34 ~ "newjersey", 167 | survey_dat_selected$WFc01_c == 35 ~ "newmexico", 168 | survey_dat_selected$WFc01_c == 36 ~ "newyork", 169 | survey_dat_selected$WFc01_c == 37 ~ "northcarolina", 170 | survey_dat_selected$WFc01_c == 38 ~ "northdakota", 171 | survey_dat_selected$WFc01_c == 39 ~ "ohio", 172 | survey_dat_selected$WFc01_c == 40 ~ "oklahoma", 173 | survey_dat_selected$WFc01_c == 41 ~ "oregon", 174 | survey_dat_selected$WFc01_c == 42 ~ "pennsylvania", 175 | survey_dat_selected$WFc01_c == 44 ~ "rhodeisland", 176 | survey_dat_selected$WFc01_c == 45 ~ "southcarolina", 177 | survey_dat_selected$WFc01_c == 46 ~ "southdakota", 178 | survey_dat_selected$WFc01_c == 47 ~ "tennessee", 179 | survey_dat_selected$WFc01_c == 48 ~ "texas", 180 | survey_dat_selected$WFc01_c == 49 ~ "utah", 181 | survey_dat_selected$WFc01_c == 50 ~ "vermont", 182 | survey_dat_selected$WFc01_c == 51 ~ "virginia", 183 | survey_dat_selected$WFc01_c == 53 ~ "washington", 184 | survey_dat_selected$WFc01_c == 54 ~ "westvirginia", 185 | survey_dat_selected$WFc01_c == 55 ~ "wisconsin", 186 | survey_dat_selected$WFc01_c == 56 ~ "wyoming", 187 | TRUE ~ as.character(survey_dat_selected$WFc01_c)) 188 | 189 | 190 | # Preprocess income 191 | survey_dat_selected = survey_dat_selected %>% filter(WA04_c < 10) 192 | 193 | # no need to preprocess sex covariate 194 | 195 | # no need to preprocess age covariate 196 | 197 | # Select covariates that we're interested in now ---------------------------------------------------- 198 | 199 | gaymarriagedata = survey_dat_selected[,c("CEc01_c", # favors gay marriage 200 | "WA01_c", # sex 201 | "WA02_c", # age 202 | "education_x", # education 203 | "race_x", # race 204 | "WA04_c", # household income 205 | "state_x")] %>% data.frame() # state 206 | 207 | # change to characters because levels in maximal poststrat matrix are in characters 208 | gaymarriagedata$CEc01_c = as.character(gaymarriagedata$CEc01_c) 209 | gaymarriagedata$WA01_c = as.character(gaymarriagedata$WA01_c) 210 | gaymarriagedata$WA04_c = as.character(gaymarriagedata$WA04_c) 211 | 212 | gaymarriagedata_rm = gaymarriagedata[complete.cases(gaymarriagedata),] # remove NAs 213 | 214 | 215 | acs_ps$race = factor(acs_ps$race) 216 | acs_ps$sex = factor(acs_ps$sex) 217 | acs_ps$education = factor(acs_ps$education) 218 | acs_ps$income = factor(acs_ps$income) 219 | acs_ps$state = factor(acs_ps$state) 220 | 221 | # store original levels from maximal poststrat matrix 222 | levels_race = levels(acs_ps$race) 223 | levels_sex = levels(acs_ps$sex) 224 | levels_education = levels(acs_ps$education) 225 | levels_income = levels(acs_ps$income) 226 | levels_state = levels(acs_ps$state) 227 | 228 | # do some checks to see that we didn't make a typo. the levels from the maximal poststrat matrix should be the superset 229 | unique(gaymarriagedata_rm$race_x) %in% levels_race 230 | unique(gaymarriagedata_rm$WA01_c) %in% levels_sex 231 | unique(gaymarriagedata_rm$education) %in% levels_education 232 | unique(gaymarriagedata_rm$WA04_c) %in% levels_income 233 | unique(gaymarriagedata_rm$state_x) %in% levels_state 234 | 235 | # get state index from jkastell in this dataframe as well 236 | 237 | # check that all state names in the annenberg survey are the same as in jkastell's file 238 | sum(unique(gaymarriagedata_rm$state_x) %in% unique(stateregions_final$lowercasename)) # should equal 49 239 | 240 | # do left_join 241 | gaymarriagedata_rm = dplyr::left_join(x = gaymarriagedata_rm, y = stateregions_final[,c("sstate.initnum", "region.cat", "lowercasename")], 242 | by = c("state_x"="lowercasename")) 243 | 244 | 245 | # use above levels to factor survey data 246 | gaymarriagedata_rm$race_x = factor(gaymarriagedata_rm$race_x, levels = levels_race) # factor race 247 | gaymarriagedata_rm$WA01_c = factor(gaymarriagedata_rm$WA01_c, levels = levels_sex) # factor sex 248 | gaymarriagedata_rm$education_x = factor(gaymarriagedata_rm$education_x, levels = levels_education) # factor education 249 | gaymarriagedata_rm$WA04_c = factor(gaymarriagedata_rm$WA04_c, levels = levels_income) # factor income 250 | gaymarriagedata_rm$state_x = factor(gaymarriagedata_rm$state_x, levels = levels_state) # factor state 251 | 252 | 253 | gaymarriagedata_rm_clean_mapped = gaymarriagedata_rm 254 | # rename levels in our survey data to numerical values so we can feed it into a stan file 255 | levels(gaymarriagedata_rm_clean_mapped$race_x) = c(1, 2, 3, 4, 5, 6) # americanindian=1, asian=2, black=3, hisp=4, other=5, white=6 256 | levels(gaymarriagedata_rm_clean_mapped$WA01_c) = c(0, 1) # men=0, women=1 257 | levels(gaymarriagedata_rm_clean_mapped$education_x) = c(1, 2, 3, 4, 5, 6) # fouryeardegree=1, fouryeardegreeplus=2, highschool=3, nohighschool=4, somecollege=5, twoyeardegree=6 258 | levels(gaymarriagedata_rm_clean_mapped$state_x) = 1:51 259 | 260 | #levels(gaymarriagedata_rm_clean_mapped$gayFavorFederalMarriage) = c(0, 1) # no==0, yes==1 261 | 262 | summary(gaymarriagedata_rm_clean_mapped) 263 | summary(gaymarriagedata_rm) 264 | # create age-category covariate i.e stratify age 265 | # 1. acs_ps$age ranges from 18 to 95 so we remove 96 and 97 266 | gaymarriagedata_rm_clean_mapped = gaymarriagedata_rm_clean_mapped %>% filter(WA02_c <= 95 & WA02_c >= 18) 267 | gaymarriagedata_rm_clean_mapped$age_cat = cut(gaymarriagedata_rm_clean_mapped$WA02_c, 268 | age_grouping_multiplier) 269 | 270 | # store levels of age_cat 271 | levels_age_cat = levels(gaymarriagedata_rm_clean_mapped$age_cat) 272 | 273 | # change levels of age_cat to numeric 274 | levels(gaymarriagedata_rm_clean_mapped$age_cat) = 1:age_grouping_multiplier 275 | 276 | 277 | # ----------------------------------------------------------------------------- 278 | 279 | # rename columns 280 | print(colnames(gaymarriagedata_rm_clean_mapped)) 281 | # "CEc01_c" "WA01_c" "WA02_c" "education_x" "race_x" "WA04_c" "state_x" "age_cat" 282 | colnames(gaymarriagedata_rm_clean_mapped) = c("gayfavor", "sex", "age", "education", "race", "income", "state", "jk_state_index", "jk_region_cat", "age_cat") 283 | 284 | 285 | # we do a left join because we're using the state index ordering that I defined 286 | stateregions_final_ = dplyr::left_join(x = data.frame(levels_state, stringsAsFactors = FALSE), y = stateregions_final, by = c("levels_state"="lowercasename")) 287 | 288 | # we will use the sstate.initnum order from stateregions_final 289 | # coef_state comes from data used in http://www.princeton.edu/~jkastell/MRP_primer/mrp_primer.pdf 290 | coef_state = (stateregions_final_$kerry.04)/100 # value for 2004 Democratic vote share in every state. This is X_{\text{state-vs},j} for j \in [51] 291 | 292 | # coef_relig comes from data used in http://www.princeton.edu/~jkastell/MRP_primer/mrp_primer.pdf. this is the percentage of conservative religions in every state 293 | coef_relig = (stateregions_final_$p.evang + stateregions_final_$p.mormon)/100 294 | 295 | # unfactor stuff in gaymarriagedata_rm_clean_mapped 296 | X = gaymarriagedata_rm_clean_mapped[,!(names(gaymarriagedata_rm_clean_mapped) %in% c("gayfavor", "age"))] 297 | Y = as.numeric(as.character(gaymarriagedata_rm_clean_mapped[,names(gaymarriagedata_rm_clean_mapped)=="gayfavor"])) 298 | 299 | # change covariates to numeric so we can feed into stan model 300 | X$sex = as.numeric(as.character(X$sex)) 301 | X$age_cat = as.numeric(as.character(X$age_cat)) 302 | X$education = as.numeric(as.character(X$education)) 303 | X$race = as.numeric(as.character(X$race)) 304 | X$income = as.numeric(as.character(X$income)) 305 | X$state = as.numeric(as.character(X$state)) 306 | 307 | # note that states 2 and 12 (alaska and hawaii) are not present in the survey sample 308 | 309 | fit_realdata = sampling(m, 310 | data = list(N = dim(X)[1], 311 | N_groups_age = age_grouping_multiplier, # dependent on number of age groups 312 | N_groups_race = length(unique(acs_ps$race)), # there are 4 categories in race 313 | N_groups_sex = length(unique(acs_ps$sex)), # there are 2 categories in gender 314 | N_groups_education = length(unique(acs_ps$education)), 315 | N_groups_income = length(unique(acs_ps$income)), 316 | 317 | state_vs = coef_state, # the 51 values for 2004 Democratic vote share based on the state index ordering I defined for levels_state 318 | relig = coef_relig, # the 51 values for conservative religion based on the state index ordering I defined for levels_state 319 | region_index = stateregions_final_$region.cat, # the 51 values for state region based on the state index ordering I defined for levels_state 320 | 321 | N_groups_state = length(unique(acs_ps$state)), 322 | age = X$age_cat, 323 | race = X$race, 324 | sex = X$sex, 325 | education = X$education, 326 | income = X$income, 327 | state = X$state, # we are using the state index that I defined to fit the model 328 | y = Y), 329 | iter=iterations, 330 | chains=num_chains, 331 | seed=21, 332 | control=list(max_treedepth=mtreedepth, adapt_delta=ad)) 333 | 334 | saveRDS(fit_realdata, 335 | paste0(m_name, 336 | "_", 337 | age_grouping_multiplier, 338 | ".rds")) 339 | 340 | print("Done script.") 341 | -------------------------------------------------------------------------------- /realdata_annenberg/proposedN01_annenberg_v3_centered.stan: -------------------------------------------------------------------------------- 1 | data { 2 | int N; // number of samples 3 | int N_groups_age; // the number of groups for age 4 | int N_groups_race; // the number of groups for race 5 | int N_groups_sex; // the number of groups for sex 6 | 7 | int N_groups_education; // the number of groups for education 8 | int N_groups_income; // the number of groups for income 9 | int N_groups_state; // the number of groups for state 10 | 11 | int age[N]; // the column vector of design matrix X for age 12 | int race[N]; // the column vector of design matrix X for race 13 | int sex[N]; // the column vector of design matrix X for sex. Vector of 0/1 14 | 15 | int education[N]; // the column vector of design matrix X for education 16 | int income[N]; // the column vector of design matrix X for income 17 | int state[N]; // the column vector of design matrix X for state 18 | 19 | int region_index[51]; // the index representing the region that a state is in according to jkastell 20 | real state_vs[51]; // the 2004 Republican vote share from jkastell 21 | real relig[51]; // the 2004 conservative religion percentage 22 | 23 | int y[N]; // the response vector 24 | } 25 | parameters { 26 | vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age 27 | vector[N_groups_race] U_race; // the random effect for race, not multiplied by sigma_race 28 | vector[N_groups_income] U_income; 29 | vector[N_groups_state] U_state; // we are using the centered parameterization for U_state now, so U_state_transformed = U_state now 30 | vector[N_groups_education] U_education; 31 | 32 | vector[5] U_region; // the nested random effect for region 33 | 34 | real sigma_age; // sd of U_age (hyperparam). 35 | real sigma_race; // sd of U_race (hyperparam). 36 | real sigma_income; 37 | real sigma_state; 38 | real sigma_education; 39 | 40 | real sigma_region; 41 | real beta_state; // coef for Republican vote share in every state 42 | real beta_relig; // coef for conservative religion share in every state 43 | 44 | real intercept; // the intercept (global fixed effect) 45 | real beta_sex; 46 | 47 | 48 | } 49 | transformed parameters { 50 | vector[N_groups_age] U_age_transformed; 51 | vector[N_groups_race] U_race_transformed; 52 | vector[N_groups_income] U_income_transformed; 53 | vector[N_groups_state] U_state_transformed; 54 | vector[N_groups_education] U_education_transformed; 55 | 56 | vector[5] U_region_transformed; 57 | 58 | vector[N] yhat; 59 | 60 | U_age_transformed = sigma_age * U_age; // the random effect for age 61 | U_race_transformed = sigma_race * U_race; // the random effect for race 62 | U_income_transformed = sigma_income * U_income; 63 | //U_state_transformed = sigma_state * U_state; 64 | U_education_transformed = sigma_education * U_education; 65 | 66 | U_region_transformed = sigma_region * U_region; 67 | 68 | U_state_transformed = U_state; 69 | 70 | for (i in 1:N) { 71 | yhat[i] = intercept + U_age_transformed[age[i]] + U_race_transformed[race[i]] + U_income_transformed[income[i]] + U_state_transformed[state[i]] + U_education_transformed[education[i]] + beta_sex * sex[i]; // the linear predictor at each point 72 | } 73 | 74 | } 75 | model { 76 | sigma_age ~ normal(0,1); // sigma_A ~ lognormal(0,1). hyperparam. 77 | sigma_race ~ normal(0,1); // sigma_I ~ lognormal(0,1). hyperparam. 78 | sigma_income ~ normal(0,1); 79 | sigma_state ~ normal(0,1); 80 | sigma_education ~ normal(0,1); 81 | 82 | sigma_region ~ normal(0,1); 83 | 84 | for (j in 2:N_groups_age) { 85 | U_age[j] ~normal(U_age[j-1],1); 86 | } 87 | 88 | sum(U_age) ~ normal(0, 0.01 * N_groups_age); // constraint so we can write likelihood for rw(1). 89 | 90 | U_race ~ normal(0, 1); // random effect for race is normal 91 | U_income ~ normal(0, 1); 92 | //U_state ~ normal(0, 1); 93 | U_education ~ normal(0, 1); 94 | 95 | U_region ~ normal(0,1); 96 | beta_state ~ normal(0,1); 97 | beta_relig ~ normal(0,1); 98 | 99 | for (j in 1:51) { 100 | U_state[j] ~ normal(U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]), sigma_state); 101 | } 102 | 103 | intercept ~ normal(0, 1); 104 | beta_sex ~ normal(0, 1); 105 | 106 | 107 | for (i in 1:N) { 108 | y[i] ~ bernoulli(inv_logit(yhat[i])); // the response 109 | } 110 | 111 | } 112 | -------------------------------------------------------------------------------- /realdata_annenberg/proposedarN01_annenberg_v3_centered.stan: -------------------------------------------------------------------------------- 1 | data { 2 | int N; // number of samples 3 | int N_groups_age; // the number of groups for age 4 | int N_groups_race; // the number of groups for race 5 | int N_groups_sex; // the number of groups for sex: 2 6 | 7 | int N_groups_education; // the number of groups for education 8 | int N_groups_income; // the number of groups for income 9 | int N_groups_state; // the number of groups for state 10 | 11 | int age[N]; // the column vector of design matrix X for age 12 | int race[N]; // the column vector of design matrix X for race 13 | int sex[N]; // the column vector of design matrix X for sex 14 | 15 | int education[N]; // the column vector of design matrix X for education 16 | int income[N]; // the column vector of design matrix X for income 17 | int state[N]; // the column vector of design matrix X for state 18 | 19 | int region_index[51]; // the index representing the region that a state is in according to jkastell 20 | real state_vs[51]; // the 2004 Republican vote share from jkastell 21 | real relig[51]; // the 2004 conservative religion percentage 22 | 23 | int y[N]; // the response vector 24 | } 25 | parameters { 26 | vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age 27 | vector[N_groups_race] U_race; // the random effect for race, not multiplied by sigma_race 28 | vector[N_groups_income] U_income; 29 | vector[N_groups_state] U_state; // we are using the centered parameterization for U_state now, so U_state_transformed = U_state now 30 | vector[N_groups_education] U_education; 31 | 32 | vector[5] U_region; // the nested random effect for region 33 | 34 | real sigma_age; // sd of U_age (hyperparam). 35 | real sigma_race; // sd of U_race (hyperparam). 36 | real sigma_income; 37 | real sigma_state; 38 | real sigma_education; 39 | 40 | real sigma_region; 41 | real beta_state; // coef for Republican vote share in every state 42 | real beta_relig; // coef for conservative religion share in every state 43 | 44 | real intercept; // the intercept (global fixed effect) 45 | real beta_sex; 46 | 47 | real rho; // the autoregressive coefficient untransformed 48 | } 49 | transformed parameters { 50 | vector[N_groups_age] U_age_transformed; 51 | vector[N_groups_race] U_race_transformed; 52 | vector[N_groups_income] U_income_transformed; 53 | vector[N_groups_state] U_state_transformed; 54 | vector[N_groups_education] U_education_transformed; 55 | 56 | vector[5] U_region_transformed; 57 | 58 | vector[N] yhat; 59 | real rho_transformed; 60 | 61 | rho_transformed = (rho * 2) - 1; // the autoregressive coefficient 62 | 63 | U_age_transformed = sigma_age * U_age; // the random effect for age 64 | U_race_transformed = sigma_race * U_race; // the random effect for race 65 | U_income_transformed = sigma_income * U_income; 66 | //U_state_transformed = sigma_state * U_state; 67 | U_education_transformed = sigma_education * U_education; 68 | 69 | U_region_transformed = sigma_region * U_region; 70 | 71 | U_state_transformed = U_state; 72 | 73 | // noncentered parameterization for U_state_transformed 74 | // for (j in 1:51) { // j iterates through the state index that I defined through levels_state in fitmodel_annenberg R file 75 | // U_state_transformed[j] = (U_state[j] * sigma_state) * ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) ); 76 | // } 77 | 78 | for (i in 1:N) { 79 | yhat[i] = intercept + U_age_transformed[age[i]] + U_race_transformed[race[i]] + U_income_transformed[income[i]] + U_state_transformed[state[i]] + U_education_transformed[education[i]] + beta_sex * sex[i]; // the linear predictor at each point 80 | } 81 | 82 | } 83 | model { 84 | sigma_age ~ normal(0,1); // sigma_A ~ halfnormal(0,1). hyperparam. 85 | sigma_race ~ normal(0,1); // sigma_I ~ halfnormal(0,1). hyperparam. 86 | sigma_income ~ normal(0,1); 87 | sigma_state ~ normal(0,1); 88 | sigma_education ~ normal(0,1); 89 | 90 | sigma_region ~ normal(0,1); 91 | 92 | rho ~ beta(0.5, 0.5); // prior on autoregressive coefficient 93 | 94 | U_age[1] ~ normal(0, 1/sqrt(1-rho_transformed^2)); // before it was normal(0, 1) but this is wrong 95 | for (j in 2:N_groups_age) { 96 | U_age[j] ~normal(rho_transformed * U_age[j-1],1); 97 | } 98 | 99 | U_race ~ normal(0, 1); // random effect for race is normal 100 | U_income ~ normal(0, 1); 101 | //U_state ~ normal(0, 1); 102 | U_education ~ normal(0, 1); 103 | 104 | U_region ~ normal(0,1); 105 | beta_state ~ normal(0,1); 106 | beta_relig ~ normal(0,1); 107 | 108 | for (j in 1:51) { 109 | U_state[j] ~ normal(U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]), sigma_state); 110 | } 111 | 112 | intercept ~ normal(0, 1); 113 | beta_sex ~ normal(0, 1); 114 | 115 | for (i in 1:N) { 116 | y[i] ~ bernoulli(inv_logit(yhat[i])); // the response 117 | } 118 | 119 | } 120 | -------------------------------------------------------------------------------- /simulation/baselinemeanzeroN01_v3.stan: -------------------------------------------------------------------------------- 1 | data { 2 | int N; // number of samples 3 | int N_groups_age; // the number of groups for age 4 | int N_groups_income; // the number of groups for income 5 | int age[N]; // the column vector of design matrix X for age 6 | int income[N]; // the column vector of design matrix X for income 7 | 8 | int state_index[N]; // the index (from 1 to 51) representing the state for every datapoint 9 | int region_index[51]; // the index (from 1 to 5) representing the region the state is in for all 51 states 10 | real state_vs[51]; // the 2004 Republican vote share for every state. This is hard coded for now. 11 | real relig[51]; // the 2004 conservative religion percentage in every state 12 | 13 | int y[N]; // the response vector 14 | } 15 | parameters { 16 | vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age 17 | vector[N_groups_income] U_income; // the random effect for income, not multiplied by sigma_income 18 | 19 | vector[51] U_state; // the random effect for state, not multiplied by sigma_state 20 | vector[5] U_region; // the nested random effect for region, not multiplied by sigma_region 21 | 22 | real sigma_age; // sd of U_age (hyperparam). halfnormal prior put on this. 23 | real sigma_income; // sd of U_income (hyperparam). halfnormal prior put on this. 24 | 25 | real sigma_state; // sd of state 26 | real sigma_region; // sd of region 27 | real beta_state; // coeff. for Republican vote share in every state 28 | real beta_relig; // coeff. for conservative religion share in every state 29 | 30 | real intercept; // the intercept (global fixed effect) 31 | 32 | } 33 | transformed parameters { 34 | vector[N_groups_age] U_age_transformed; 35 | vector[N_groups_income] U_income_transformed; 36 | 37 | vector[51] U_state_transformed; 38 | vector[5] U_region_transformed; 39 | 40 | vector[N] yhat; 41 | //real intercept; 42 | 43 | U_age_transformed = sigma_age * U_age; // the random effect for age 44 | U_income_transformed = sigma_income * U_income; // the random effect for income 45 | U_region_transformed = sigma_region * U_region; // the random effect for region 46 | 47 | // noncentered parameterization for U_state_transformed 48 | for (j in 1:51) { 49 | U_state_transformed[j] = (U_state[j] * sigma_state) + ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) ); 50 | } 51 | 52 | for (i in 1:N) { 53 | yhat[i] = intercept + U_age_transformed[age[i]] + U_income_transformed[income[i]] + U_state_transformed[state_index[i]]; // the linear predictor at each point 54 | } 55 | 56 | } 57 | model { 58 | sigma_age ~ normal(0, 1); 59 | sigma_income ~ normal(0, 1); 60 | 61 | sigma_region ~ normal(0,1); 62 | sigma_state ~ normal(0, 1); 63 | 64 | U_age ~ normal(0, 1); // random effect for age is normal 65 | U_income ~ normal(0, 1); // random effect for income is normal 66 | 67 | U_region ~ normal(0,1); 68 | 69 | U_state ~ normal(0, 1); 70 | 71 | beta_state ~ normal(0, 1); 72 | beta_relig ~ normal(0, 1); 73 | intercept ~ normal(0, 1); 74 | 75 | 76 | for (i in 1:N) { 77 | y[i] ~ bernoulli(inv_logit(yhat[i])); // the response 78 | } 79 | 80 | } 81 | -------------------------------------------------------------------------------- /simulation/biasfacet_combinedplots.R: -------------------------------------------------------------------------------- 1 | # run this script after threemodelwriteup_v3_posteriorvariance.R to get the main plots in the paper 2 | # this script corresponds to n=100, 500 3 | rm(list=ls()) # clear previous workspace 4 | gc() 5 | 6 | library(ggplot2) 7 | 8 | bias_facet_df_100 = readRDS("biasfacet_200_12_1_100.rds") 9 | bias_facet_df_500 = readRDS("biasfacet_200_12_1_500.rds") 10 | 11 | biasfacet_pngname = "biasfacet_200_12_100_500.png" 12 | allmediansfacet_pngname = "allmediansfacet_200_12_100_500.png" 13 | allquantilediff_pngname = "allquantilediff_facet_200_12_100_500.png" 14 | 15 | runs = 200 16 | age_grouping_multiplier = 12 17 | 18 | # --------------------------- 19 | bias_facet_df_100$age_cat_grouped = ifelse(bias_facet_df_100$age_cat %in% c(1,2,3,4), "Age categories 1-4", 20 | ifelse(bias_facet_df_100$age_cat %in% c(5,6,7,8), "Age categories 5-8", 21 | "Age categories 9-12")) 22 | 23 | bias_facet_df_500$age_cat_grouped = ifelse(bias_facet_df_500$age_cat %in% c(1,2,3,4), "Age categories 1-4", 24 | ifelse(bias_facet_df_500$age_cat %in% c(5,6,7,8), "Age categories 5-8", 25 | "Age categories 9-12")) 26 | 27 | p1 = cowplot::plot_grid(ggplot(bias_facet_df_100, aes(x=prob_sampling_old, y=value, group=cell)) + 28 | geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 0.5) + 29 | facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller( 30 | function(value) { 31 | return(value) # Lets you change the facet labels 32 | }) 33 | ) + 34 | xlab("\n Probability of response for the elderly \n") + 35 | ylab("\n Average Bias \n") + 36 | ylim( -max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))), 37 | max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))) ) + 38 | # scale_colour_manual(values=c("1"="#fbb4b9", 39 | # "2"="#f768a1", 40 | # "3"="#c51b8a", 41 | # "4"="#7a0177", 42 | # "5"="#fdcc8a", 43 | # "6"="#fc8d59", 44 | # "7"="#e34a33", 45 | # "8"="#b30000", 46 | # "9"="#b2e2e2", 47 | # "10"="#66c2a4", 48 | # "11"="#2ca25f", 49 | # "12"="#006d2c") 50 | # ) + 51 | scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a", 52 | "Age categories 5-8"="#b30000", 53 | "Age categories 9-12"="#006d2c") 54 | ) + 55 | theme_bw() + 56 | theme(plot.title = element_text(size = 50, face = "bold"), 57 | axis.text=element_text(size=80), 58 | axis.title=element_text(size=80, face="bold",margin=200), 59 | strip.text.x = element_text(size=80, face="bold", margin = margin(t=25,b=25) ), 60 | strip.background = element_rect(fill="transparent",color="transparent"), 61 | legend.position = "none" 62 | ) + 63 | geom_hline(yintercept=0, color="black", size=5, linetype = "dashed") + 64 | guides(col=guide_legend(title="Age category", 65 | override.aes = list(size = 10*1.6))), 66 | 67 | ggplot(bias_facet_df_500, aes(x=prob_sampling_old, y=value, group=cell)) + 68 | geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 0.5) + 69 | facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller( 70 | function(value) { 71 | return(value) # Lets you change the facet labels 72 | }) 73 | ) + 74 | xlab("\n Probability of response for the elderly \n") + 75 | ylab("\n Average Bias \n") + 76 | ylim( -max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))), 77 | max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))) ) + 78 | # scale_colour_manual(values=c("1"="#fbb4b9", 79 | # "2"="#f768a1", 80 | # "3"="#c51b8a", 81 | # "4"="#7a0177", 82 | # "5"="#fdcc8a", 83 | # "6"="#fc8d59", 84 | # "7"="#e34a33", 85 | # "8"="#b30000", 86 | # "9"="#b2e2e2", 87 | # "10"="#66c2a4", 88 | # "11"="#2ca25f", 89 | # "12"="#006d2c") 90 | # ) + 91 | scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a", 92 | "Age categories 5-8"="#b30000", 93 | "Age categories 9-12"="#006d2c") 94 | ) + 95 | theme_bw() + 96 | theme(plot.title = element_text(size = 50, face = "bold"), 97 | axis.text=element_text(size=80), 98 | axis.title=element_text(size=80, face="bold",margin=200), 99 | strip.text.x = element_text(size=80, face="bold", margin = margin(t=25,b=25) ), 100 | strip.background = element_rect(fill="transparent",color="transparent"), 101 | legend.position = "none" 102 | ) + 103 | geom_hline(yintercept=0, color="black", size=5, linetype = "dashed") + 104 | guides(col=guide_legend(title="Age category", 105 | override.aes = list(size = 10*1.6))) 106 | , 107 | align="v", 108 | nrow=2, 109 | 110 | labels = c("n = 100", "n = 500"), 111 | 112 | label_size = 85, 113 | 114 | vjust = 1.25, 115 | hjust = -0.25) 116 | 117 | legend_p1 = cowplot::get_legend(ggplot(bias_facet_df_500, aes(x=prob_sampling_old, y=value, group=cell)) + 118 | geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 0.5) + 119 | facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller( 120 | function(value) { 121 | return(value) # Lets you change the facet labels 122 | }) 123 | ) + 124 | xlab("\n Probability of response for the elderly \n") + 125 | ylab("\n Average Bias \n") + 126 | # scale_colour_manual(values=c("1"="#fbb4b9", 127 | # "2"="#f768a1", 128 | # "3"="#c51b8a", 129 | # "4"="#7a0177", 130 | # "5"="#fdcc8a", 131 | # "6"="#fc8d59", 132 | # "7"="#e34a33", 133 | # "8"="#b30000", 134 | # "9"="#b2e2e2", 135 | # "10"="#66c2a4", 136 | # "11"="#2ca25f", 137 | # "12"="#006d2c") 138 | # ) + 139 | scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a", 140 | "Age categories 5-8"="#b30000", 141 | "Age categories 9-12"="#006d2c") 142 | ) + 143 | theme_bw() + 144 | theme(plot.title = element_text(size = 50, face = "bold"), 145 | axis.text=element_text(size=50), 146 | axis.title=element_text(size=80, face="bold",margin=200), 147 | legend.text = element_text(size=80), 148 | legend.position = "bottom", 149 | legend.key.size = unit(30*1.6,"line"), 150 | legend.key.height = unit(7*1.6,"line"), 151 | legend.key = element_rect(fill = "transparent",color = "transparent"), 152 | legend.title = element_text(size=80, face="bold"), 153 | strip.text.x = element_text(size=80, face="bold", margin = margin(t=25,b=25) ), 154 | strip.background = element_rect(fill="transparent",color="transparent") 155 | ) + 156 | guides(col=guide_legend(title="Age category", 157 | override.aes = list(size = 10*1.6)))) 158 | 159 | png(biasfacet_pngname, 160 | width = 5200, 161 | height = 5000) 162 | 163 | plot(cowplot::plot_grid(p1, 164 | legend_p1, 165 | ncol=1, 166 | rel_heights = c(3, 0.3))) 167 | dev.off() 168 | 169 | 170 | # plot combined plots for allmedians_facet 171 | allmedians_facet_df_100 = readRDS("allmedians_facet_200_12_100_.rds") 172 | allmedians_facet_df_500 = readRDS("allmedians_facet_200_12_500_.rds") 173 | points_df_final = readRDS("points_df_final.rds") 174 | 175 | allmedians_facet_df_100_filtered = allmedians_facet_df_100[allmedians_facet_df_100$p %in% c(0.1, 0.5, 0.9),] 176 | 177 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled 178 | allmedians_facet_df_100_filtered[allmedians_facet_df_100_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12" 179 | allmedians_facet_df_100_filtered[allmedians_facet_df_100_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12" 180 | allmedians_facet_df_100_filtered[allmedians_facet_df_100_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12" 181 | 182 | allmedians_facet_df_100_filtered$prob_sampling_old = factor(allmedians_facet_df_100_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12", 183 | "Representatively sampling age categories 9-12", 184 | "Strongly oversampling age categories 9-12")) 185 | 186 | 187 | allmedians_facet_df_500_filtered = allmedians_facet_df_500[allmedians_facet_df_500$p %in% c(0.1, 0.5, 0.9),] 188 | 189 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled 190 | allmedians_facet_df_500_filtered[allmedians_facet_df_500_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12" 191 | allmedians_facet_df_500_filtered[allmedians_facet_df_500_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12" 192 | allmedians_facet_df_500_filtered[allmedians_facet_df_500_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12" 193 | 194 | allmedians_facet_df_500_filtered$prob_sampling_old = factor(allmedians_facet_df_500_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12", 195 | "Representatively sampling age categories 9-12", 196 | "Strongly oversampling age categories 9-12")) 197 | 198 | 199 | p2 = cowplot::plot_grid(ggplot(allmedians_facet_df_100_filtered, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+ 200 | geom_rect(allmedians_facet_df_100_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 201 | xmax = Inf, 202 | ymin = -Inf, 203 | ymax = Inf), 204 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 205 | geom_jitter(alpha=0.1, show.legend=FALSE) + 206 | geom_point(aes(x = age_cat, y = mrp), 207 | size=7, 208 | colour = "black", 209 | fill="black", 210 | show.legend = F, 211 | data = points_df_final, 212 | inherit.aes = F) + 213 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 214 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=1, labeller = as_labeller( 215 | function(value) { 216 | return(value) # Lets you change the facet labels 217 | }) 218 | ) + 219 | xlab("\n Age Category \n") + 220 | ylab(paste("\n Median of", runs,"Posteriors \n")) + 221 | scale_x_continuous(breaks= scales::pretty_breaks()) + 222 | coord_cartesian(ylim=c(0,1), expand=FALSE) + 223 | theme_bw() + 224 | theme(plot.title = element_text(size = 50, face = "bold"), 225 | axis.text=element_text(size=50), 226 | axis.title=element_text(size=50, face="bold",margin=200), 227 | legend.text = element_text(size=50), 228 | legend.position = "none", 229 | legend.key.size = unit(25,"line"), 230 | legend.key.height = unit(7,"line"), 231 | legend.key = element_rect(fill = "transparent",color = "transparent"), 232 | legend.title = element_text(size=50, face="bold"), 233 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 234 | strip.background = element_rect(fill="transparent",color="transparent")) + 235 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 236 | guides(color = guide_legend(override.aes = list(size=20))), 237 | 238 | ggplot(allmedians_facet_df_500_filtered, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+ 239 | geom_rect(allmedians_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 240 | xmax = Inf, 241 | ymin = -Inf, 242 | ymax = Inf), 243 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 244 | geom_jitter(alpha=0.1, show.legend=FALSE) + 245 | geom_point(aes(x = age_cat, y = mrp), 246 | size=7, 247 | colour = "black", 248 | fill="black", 249 | show.legend = F, 250 | data = points_df_final, 251 | inherit.aes = F) + 252 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 253 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller( 254 | function(value) { 255 | return(value) # Lets you change the facet labels 256 | }) 257 | ) + 258 | xlab("\n Age Category \n") + 259 | ylab(paste("\n Median of", runs,"Posteriors \n")) + 260 | scale_x_continuous(breaks= scales::pretty_breaks()) + 261 | coord_cartesian(ylim=c(0,1), expand=FALSE) + 262 | theme_bw() + 263 | theme(plot.title = element_text(size = 50, face = "bold"), 264 | axis.text=element_text(size=50), 265 | axis.title=element_text(size=50, face="bold",margin=200), 266 | legend.text = element_text(size=50), 267 | legend.position = "none", 268 | legend.key.size = unit(25,"line"), 269 | legend.key.height = unit(7,"line"), 270 | legend.key = element_rect(fill = "transparent",color = "transparent"), 271 | legend.title = element_text(size=50, face="bold"), 272 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 273 | strip.background = element_rect(fill="transparent",color="transparent")) + 274 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 275 | guides(color = guide_legend(override.aes = list(size=20))), 276 | 277 | ncol = 1, 278 | 279 | labels = c("n = 100", "n = 500"), 280 | 281 | label_size = 50, 282 | 283 | vjust = 1.5, 284 | hjust = -0.25 285 | ) 286 | 287 | 288 | legend_p2 = cowplot::get_legend(ggplot(allmedians_facet_df_500_filtered, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+ 289 | geom_rect(allmedians_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 290 | xmax = Inf, 291 | ymin = -Inf, 292 | ymax = Inf), 293 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 294 | geom_jitter(alpha=0.1, show.legend=FALSE) + 295 | geom_point(aes(x = age_cat, y = mrp), 296 | size=7, 297 | colour = "black", 298 | fill="black", 299 | show.legend = F, 300 | data = points_df_final, 301 | inherit.aes = F) + 302 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 303 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller( 304 | function(value) { 305 | return(value) # Lets you change the facet labels 306 | }) 307 | ) + 308 | xlab("\n Age Category \n") + 309 | ylab(paste("\n Median of", runs,"Posteriors \n")) + 310 | scale_x_continuous(breaks= scales::pretty_breaks()) + 311 | theme_bw() + 312 | theme(plot.title = element_text(size = 50, face = "bold"), 313 | axis.text=element_text(size=35), 314 | axis.title=element_text(size=50, face="bold",margin=200), 315 | legend.text = element_text(size=50), 316 | legend.position = "bottom", 317 | legend.key.size = unit(25,"line"), 318 | legend.key.height = unit(7,"line"), 319 | legend.key = element_rect(fill = "transparent",color = "transparent"), 320 | legend.title = element_text(size=50, face="bold"), 321 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 322 | strip.background = element_rect(fill="transparent",color="transparent")) + 323 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 324 | guides(color = guide_legend(override.aes = list(size=20)))) 325 | 326 | png(allmediansfacet_pngname, 327 | width = 3625, 328 | height = 2200) 329 | 330 | plot(cowplot::plot_grid(p2, 331 | legend_p2, 332 | ncol=1, 333 | rel_heights = c(3, 0.3))) 334 | dev.off() 335 | 336 | 337 | # plot combined plots for allquantilediff_facet 338 | allquantilediff_facet_df_100 = readRDS("allquantilediff_facet_200_12_100_.rds") 339 | allquantilediff_facet_df_500 = readRDS("allquantilediff_facet_200_12_500_.rds") 340 | 341 | allquantilediff_facet_df_100_filtered = allquantilediff_facet_df_100[allquantilediff_facet_df_100$p %in% c(0.1, 0.5, 0.9),] 342 | 343 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled 344 | allquantilediff_facet_df_100_filtered[allquantilediff_facet_df_100_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12" 345 | allquantilediff_facet_df_100_filtered[allquantilediff_facet_df_100_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12" 346 | allquantilediff_facet_df_100_filtered[allquantilediff_facet_df_100_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12" 347 | 348 | allquantilediff_facet_df_100_filtered$prob_sampling_old = factor(allquantilediff_facet_df_100_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12", 349 | "Representatively sampling age categories 9-12", 350 | "Strongly oversampling age categories 9-12")) 351 | 352 | 353 | allquantilediff_facet_df_500_filtered = allquantilediff_facet_df_500[allquantilediff_facet_df_500$p %in% c(0.1, 0.5, 0.9),] 354 | 355 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled 356 | allquantilediff_facet_df_500_filtered[allquantilediff_facet_df_500_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12" 357 | allquantilediff_facet_df_500_filtered[allquantilediff_facet_df_500_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12" 358 | allquantilediff_facet_df_500_filtered[allquantilediff_facet_df_500_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12" 359 | 360 | allquantilediff_facet_df_500_filtered$prob_sampling_old = factor(allquantilediff_facet_df_500_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12", 361 | "Representatively sampling age categories 9-12", 362 | "Strongly oversampling age categories 9-12")) 363 | 364 | 365 | 366 | p3 = cowplot::plot_grid(ggplot(allquantilediff_facet_df_100_filtered, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+ 367 | geom_rect(allquantilediff_facet_df_100_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 368 | xmax = Inf, 369 | ymin = -Inf, 370 | ymax = Inf), 371 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 372 | geom_jitter(alpha=0.1, show.legend=FALSE) + 373 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 374 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller( 375 | function(value) { 376 | return(value) # Lets you change the facet labels 377 | }) 378 | ) + 379 | xlab("\n Age Category \n") + 380 | ylab(paste("\n 90th - 10th quantile of Posteriors \n")) + 381 | scale_x_continuous(breaks= scales::pretty_breaks()) + 382 | coord_cartesian(ylim=c(0, max(allquantilediff_facet_df_100_filtered$quantildiff90_10)), expand=FALSE) + 383 | theme_bw() + 384 | theme(plot.title = element_text(size = 50, face = "bold"), 385 | axis.text=element_text(size=50), 386 | axis.title=element_text(size=50, face="bold",margin=200), 387 | legend.text = element_text(size=50), 388 | legend.position = "none", 389 | legend.key.size = unit(25,"line"), 390 | legend.key.height = unit(7,"line"), 391 | legend.key = element_rect(fill = "transparent",color = "transparent"), 392 | legend.title = element_text(size=50, face="bold"), 393 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 394 | strip.background = element_rect(fill="transparent",color="transparent")) + 395 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 396 | guides(color = guide_legend(override.aes = list(size=20))), 397 | 398 | ggplot(allquantilediff_facet_df_500_filtered, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+ 399 | geom_rect(allquantilediff_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 400 | xmax = Inf, 401 | ymin = -Inf, 402 | ymax = Inf), 403 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 404 | geom_jitter(alpha=0.1, show.legend=FALSE) + 405 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 406 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller( 407 | function(value) { 408 | return(value) # Lets you change the facet labels 409 | }) 410 | ) + 411 | xlab("\n Age Category \n") + 412 | ylab(paste("\n 90th - 10th quantile of Posteriors \n")) + 413 | scale_x_continuous(breaks= scales::pretty_breaks()) + 414 | coord_cartesian(ylim=c(0, max(allquantilediff_facet_df_100_filtered$quantildiff90_10)), expand=FALSE) + 415 | theme_bw() + 416 | theme(plot.title = element_text(size = 50, face = "bold"), 417 | axis.text=element_text(size=50), 418 | axis.title=element_text(size=50, face="bold",margin=200), 419 | legend.text = element_text(size=50), 420 | legend.position = "none", 421 | legend.key.size = unit(25,"line"), 422 | legend.key.height = unit(7,"line"), 423 | legend.key = element_rect(fill = "transparent",color = "transparent"), 424 | legend.title = element_text(size=50, face="bold"), 425 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 426 | strip.background = element_rect(fill="transparent",color="transparent")) + 427 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 428 | guides(color = guide_legend(override.aes = list(size=20))), 429 | 430 | ncol=1, 431 | 432 | labels = c("n = 100", "n = 500"), 433 | 434 | label_size = 50, 435 | 436 | vjust = 1, 437 | hjust = -0.25 438 | ) 439 | 440 | legend_p3 = cowplot::get_legend(ggplot(allquantilediff_facet_df_500_filtered, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+ 441 | geom_rect(allquantilediff_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 442 | xmax = Inf, 443 | ymin = -Inf, 444 | ymax = Inf), 445 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 446 | geom_jitter(alpha=0.1, show.legend=FALSE) + 447 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 448 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller( 449 | function(value) { 450 | return(value) # Lets you change the facet labels 451 | }) 452 | ) + 453 | xlab("\n Age Category \n") + 454 | ylab(paste("\n 90th - 10th quantile of", runs,"Posteriors \n")) + 455 | scale_x_continuous(breaks= scales::pretty_breaks()) + 456 | theme_bw() + 457 | theme(plot.title = element_text(size = 50, face = "bold"), 458 | axis.text=element_text(size=35), 459 | axis.title=element_text(size=50, face="bold",margin=200), 460 | legend.text = element_text(size=50), 461 | legend.position = "bottom", 462 | legend.key.size = unit(25,"line"), 463 | legend.key.height = unit(7,"line"), 464 | legend.key = element_rect(fill = "transparent",color = "transparent"), 465 | legend.title = element_text(size=50, face="bold"), 466 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 467 | strip.background = element_rect(fill="transparent",color="transparent")) + 468 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 469 | guides(color = guide_legend(override.aes = list(size=20)))) 470 | 471 | png(allquantilediff_pngname, 472 | width = 3625, 473 | height = 2200) 474 | 475 | plot(cowplot::plot_grid(p3, 476 | legend_p3, 477 | ncol=1, 478 | rel_heights = c(3, 0.3))) 479 | dev.off() 480 | 481 | print(warnings()) 482 | -------------------------------------------------------------------------------- /simulation/proportion_agecat.R: -------------------------------------------------------------------------------- 1 | rm(list=ls()) 2 | gc() 3 | 4 | options(bitmapType="cairo") 5 | 6 | library(reshape2) 7 | library(rstan) 8 | library(dplyr) 9 | library(ggplot2) 10 | library(gridExtra) 11 | library(knitr) 12 | library(matrixStats) 13 | 14 | # Global params ------------------------------------------- 15 | 16 | save_ridgeplots = TRUE 17 | sample_size = 1000 18 | runs = 100 19 | r = 1:9/10 20 | income_multiplier = 1 # partitions income into more categories 21 | age_grouping_multiplier = 12 # how much we take the maximal poststratification. make sure this can divide 60 22 | number_of_states = 51 23 | 24 | response_tag = "binary" 25 | 26 | # --------------------------------------------------------- 27 | rw_baseline_agecat_bias_comparison = matrix(0, length(r), age_grouping_multiplier) 28 | ar_baseline_agecat_bias_comparison = matrix(0, length(r), age_grouping_multiplier) 29 | 30 | rw_baseline_agecat_sd_comparison = matrix(0, length(r), age_grouping_multiplier) 31 | ar_baseline_agecat_sd_comparison = matrix(0, length(r), age_grouping_multiplier) 32 | 33 | 34 | counter = 1 35 | prob_sampling_old = rep(0, length(r)) 36 | 37 | for (p in r) { 38 | 39 | load(paste0( 40 | runs, 41 | "_", 42 | age_grouping_multiplier, 43 | "_", 44 | income_multiplier, 45 | "_", 46 | sample_size, 47 | "_", 48 | response_tag, 49 | "_oldest_", 50 | p * 10, 51 | ".RData")) 52 | 53 | points_df_final = unique(readRDS("points_df_final.rds")[,1:2]) %>% arrange(age_cat) # the true poststratified preferences for every age category 54 | 55 | # Get the interpretable p which is the probability of sampling someone in ages 61 - 80 56 | poststrat_final_p = cbind(poststrat_final, 57 | (poststrat_final$p_response * poststrat_final$N)/sum((poststrat_final$p_response * poststrat_final$N))) # this is the probability of sampling someone in ages 61 - 80 58 | colnames(poststrat_final_p)[length(colnames(poststrat_final_p))] = "prob_sampling_old" 59 | prob_sampling_old[counter] = sum(poststrat_final_p[as.numeric(as.character(poststrat_final_p$age_cat)) >= (age_grouping_multiplier*2/3 + 1), 60 | c("prob_sampling_old")]) # very important 61 | 62 | rw_baseline_agecat_bias_comparison[counter,] = colSums((abs(sweep(median_quantile_list_rw, 63 | 2, 64 | points_df_final$mrp, 65 | "-")) - 66 | abs(sweep(median_quantile_list_baseline, 67 | 2, 68 | points_df_final$mrp, 69 | "-"))) <= 0)/dim(median_quantile_list_baseline)[1] 70 | 71 | ar_baseline_agecat_bias_comparison[counter,] = colSums((abs(sweep(median_quantile_list_ar, 72 | 2, 73 | points_df_final$mrp, 74 | "-")) - 75 | abs(sweep(median_quantile_list_baseline, 76 | 2, 77 | points_df_final$mrp, 78 | "-"))) <= 0)/dim(median_quantile_list_baseline)[1] 79 | 80 | 81 | rw_baseline_agecat_sd_comparison[counter,] = colSums(((quantile90_list_rw - quantile10_list_rw) - 82 | (quantile90_list_baseline - quantile10_list_baseline))<= 0)/dim(median_quantile_list_rw)[1] 83 | 84 | ar_baseline_agecat_sd_comparison[counter,] = colSums(((quantile90_list_ar - quantile10_list_ar) - 85 | (quantile90_list_baseline - quantile10_list_baseline))<= 0)/dim(median_quantile_list_rw)[1] 86 | 87 | counter = counter + 1 88 | } 89 | 90 | 91 | # ar model 92 | ar_baseline_agecat_bias_comparison_melted = melt(ar_baseline_agecat_bias_comparison) 93 | colnames(ar_baseline_agecat_bias_comparison_melted) = c("p", "age_cat", "proportion") 94 | ar_baseline_agecat_bias_comparison_melted$age_cat_grouped = "" 95 | 96 | ar_baseline_agecat_bias_comparison_melted[ar_baseline_agecat_bias_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4" 97 | ar_baseline_agecat_bias_comparison_melted[ar_baseline_agecat_bias_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8" 98 | ar_baseline_agecat_bias_comparison_melted[ar_baseline_agecat_bias_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12" 99 | 100 | 101 | # rw model 102 | rw_baseline_agecat_bias_comparison_melted = melt(rw_baseline_agecat_bias_comparison) 103 | colnames(rw_baseline_agecat_bias_comparison_melted) = c("p", "age_cat", "proportion") 104 | rw_baseline_agecat_bias_comparison_melted$age_cat_grouped = "" 105 | 106 | rw_baseline_agecat_bias_comparison_melted[rw_baseline_agecat_bias_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4" 107 | rw_baseline_agecat_bias_comparison_melted[rw_baseline_agecat_bias_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8" 108 | rw_baseline_agecat_bias_comparison_melted[rw_baseline_agecat_bias_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12" 109 | 110 | rw_ar_baseline_agecat_bias_comparison_melted = rbind(cbind(ar_baseline_agecat_bias_comparison_melted, Model = "AR - Baseline priors comparison"), 111 | cbind(rw_baseline_agecat_bias_comparison_melted, Model = "RW - Baseline priors comparison")) 112 | 113 | mapdf_prob_sampling_old_bias = data.frame(old = sort(unique(rw_ar_baseline_agecat_bias_comparison_melted$p)), 114 | new = prob_sampling_old) 115 | rw_ar_baseline_agecat_bias_comparison_melted$prob_sampling_old = round(mapdf_prob_sampling_old_bias$new[match(rw_ar_baseline_agecat_bias_comparison_melted$p, 116 | mapdf_prob_sampling_old_bias$old)], 117 | digits = 2) 118 | 119 | 120 | saveRDS(prob_sampling_old, 121 | "prob_sampling_old.rds") 122 | 123 | saveRDS(rw_ar_baseline_agecat_bias_comparison_melted, paste0("proportion_", 124 | runs, 125 | "_", 126 | age_grouping_multiplier, 127 | "_", 128 | income_multiplier, 129 | "_", 130 | sample_size,".rds")) 131 | 132 | png(filename = paste0("proportion_", 133 | runs, 134 | "_", 135 | age_grouping_multiplier, 136 | "_", 137 | income_multiplier, 138 | "_", 139 | sample_size,"smaller.png"), 140 | width = 5400/5, height = 2700/5) 141 | 142 | ggplot(rw_ar_baseline_agecat_bias_comparison_melted, aes(x=prob_sampling_old, y=proportion, group = age_cat)) + 143 | geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 1) + 144 | scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a", 145 | "Age categories 5-8"="#b30000", 146 | "Age categories 9-12"="#006d2c")) + 147 | facet_wrap(.~Model,ncol=2,nrow=1,labeller = as_labeller(function(value){return(value)})) + 148 | xlab("\n Probability of response for the elderly \n") + 149 | ylab("\n Proportion of the time that structured priors outperform \n") + 150 | theme_bw() + 151 | theme(plot.title = element_text(size = 50/5, face = "bold"), 152 | axis.text=element_text(size=50*1.6/5), 153 | axis.title=element_text(size=50*1.6/5, face="bold",margin=200/5), 154 | legend.text = element_text(size=50*1.6/5), 155 | legend.position = "bottom", 156 | legend.key.size = unit(30*1.6/5,"line"), 157 | legend.key.height = unit(7*1.6/5,"line"), 158 | legend.key = element_rect(fill = "transparent",color = "transparent"), 159 | legend.title = element_text(size=50*1.6/5, face="bold"), 160 | strip.text.x = element_text(size=50*1.6/5, face="bold", margin = margin(t=25/5,b=25/5) ), 161 | strip.background = element_rect(fill="transparent",color="transparent") 162 | ) + 163 | ylim(c(.2,1)) + 164 | geom_hline(yintercept = 0.5,linetype=2,size=1) + 165 | guides(col=guide_legend(title="Age category", 166 | override.aes = list(size = 10*1.6/5))) 167 | 168 | dev.off() 169 | 170 | 171 | 172 | # ar model sd 173 | ar_baseline_agecat_sd_comparison_melted = melt(ar_baseline_agecat_sd_comparison) 174 | colnames(ar_baseline_agecat_sd_comparison_melted) = c("p", "age_cat", "proportion") 175 | ar_baseline_agecat_sd_comparison_melted$age_cat_grouped = "" 176 | 177 | ar_baseline_agecat_sd_comparison_melted[ar_baseline_agecat_sd_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4" 178 | ar_baseline_agecat_sd_comparison_melted[ar_baseline_agecat_sd_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8" 179 | ar_baseline_agecat_sd_comparison_melted[ar_baseline_agecat_sd_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12" 180 | 181 | 182 | # rw model sd 183 | rw_baseline_agecat_sd_comparison_melted = melt(rw_baseline_agecat_sd_comparison) 184 | colnames(rw_baseline_agecat_sd_comparison_melted) = c("p", "age_cat", "proportion") 185 | rw_baseline_agecat_sd_comparison_melted$age_cat_grouped = "" 186 | 187 | rw_baseline_agecat_sd_comparison_melted[rw_baseline_agecat_sd_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4" 188 | rw_baseline_agecat_sd_comparison_melted[rw_baseline_agecat_sd_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8" 189 | rw_baseline_agecat_sd_comparison_melted[rw_baseline_agecat_sd_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12" 190 | 191 | rw_ar_baseline_agecat_sd_comparison_melted = rbind(cbind(ar_baseline_agecat_sd_comparison_melted, Model = "AR - Baseline priors comparison"), 192 | cbind(rw_baseline_agecat_sd_comparison_melted, Model = "RW - Baseline priors comparison")) 193 | 194 | rw_ar_baseline_agecat_sd_comparison_melted$prob_sampling_old = round(mapdf_prob_sampling_old_bias$new[match(rw_ar_baseline_agecat_sd_comparison_melted$p, 195 | mapdf_prob_sampling_old_bias$old)], 196 | digits = 2) 197 | 198 | 199 | saveRDS(rw_ar_baseline_agecat_sd_comparison_melted, paste0("proportion_sd_", 200 | runs, 201 | "_", 202 | age_grouping_multiplier, 203 | "_", 204 | income_multiplier, 205 | "_", 206 | sample_size,".rds")) 207 | 208 | png(filename = paste0("proportion_sd_", 209 | runs, 210 | "_", 211 | age_grouping_multiplier, 212 | "_", 213 | income_multiplier, 214 | "_", 215 | sample_size,"smaller.png"), 216 | width = 5400/5, height = 2700/5) 217 | 218 | ggplot(rw_ar_baseline_agecat_sd_comparison_melted, aes(x=prob_sampling_old, y=proportion, group = age_cat)) + 219 | geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 1) + 220 | scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a", 221 | "Age categories 5-8"="#b30000", 222 | "Age categories 9-12"="#006d2c")) + 223 | facet_wrap(.~Model,ncol=2,nrow=1,labeller = as_labeller(function(value){return(value)})) + 224 | xlab("\n Probability of response for the elderly \n") + 225 | ylab("\n Proportion sd \n") + 226 | theme_bw() + 227 | theme(plot.title = element_text(size = 50/5, face = "bold"), 228 | axis.text=element_text(size=50*1.6/5), 229 | axis.title=element_text(size=50*1.6/5, face="bold",margin=200/5), 230 | legend.text = element_text(size=50*1.6/5), 231 | legend.position = "bottom", 232 | legend.key.size = unit(30*1.6/5,"line"), 233 | legend.key.height = unit(7*1.6/5,"line"), 234 | legend.key = element_rect(fill = "transparent",color = "transparent"), 235 | legend.title = element_text(size=50*1.6/5, face="bold"), 236 | strip.text.x = element_text(size=50*1.6/5, face="bold", margin = margin(t=25/5,b=25/5) ), 237 | strip.background = element_rect(fill="transparent",color="transparent") 238 | ) + 239 | ylim(c(.1,1)) + 240 | geom_hline(yintercept = 0.5,linetype=2,size=1) + 241 | guides(col=guide_legend(title="Age category", 242 | override.aes = list(size = 10*1.6/5))) 243 | 244 | dev.off() 245 | 246 | print(colMeans(rw_baseline_agecat_bias_comparison)) 247 | print(colMeans(ar_baseline_agecat_bias_comparison)) 248 | 249 | print(colMeans(rw_baseline_agecat_sd_comparison)) 250 | print(colMeans(ar_baseline_agecat_sd_comparison)) 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | -------------------------------------------------------------------------------- /simulation/proposedN01_v3.stan: -------------------------------------------------------------------------------- 1 | data { 2 | int N; // number of samples 3 | int N_groups_age; // the number of groups for age 4 | int N_groups_income; // the number of groups for income 5 | int age[N]; // the column vector of design matrix X for age 6 | int income[N]; // the column vector of design matrix X for income 7 | 8 | int state_index[N]; // the index (from 1 to 51) representing the state for every datapoint 9 | int region_index[51]; // the index (from 1 to 5) representing the region the state is in for all 51 states 10 | real state_vs[51]; // the 2004 Republican vote share for every state. This is hard coded for now. 11 | real relig[51]; // the 2004 conservative religion percentage in every state 12 | 13 | int y[N]; // the response vector 14 | } 15 | parameters { 16 | vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age 17 | vector[N_groups_income] U_income; // the random effect for income, not multiplied by sigma_income 18 | 19 | vector[51] U_state; // the random effect for state, not multiplied by sigma_state 20 | vector[5] U_region; // the nested random effect for region, not multiplied by sigma_region 21 | 22 | real sigma_age; // sd of U_age (hyperparam). halfnormal prior put on this. 23 | real sigma_income; // sd of U_income (hyperparam). halfnormal prior put on this. 24 | 25 | real sigma_state; // sd of state 26 | real sigma_region; // sd of region 27 | real beta_state; // coeff. for Republican vote share in every state 28 | real beta_relig; // coeff. for conservative religion share in every state 29 | 30 | real intercept; // the intercept (global fixed effect) 31 | 32 | } 33 | transformed parameters { 34 | vector[N_groups_age] U_age_transformed; 35 | vector[N_groups_income] U_income_transformed; 36 | 37 | vector[51] U_state_transformed; 38 | vector[5] U_region_transformed; 39 | 40 | vector[N] yhat; 41 | 42 | U_age_transformed = sigma_age * U_age; // the random effect for age 43 | U_income_transformed = sigma_income * U_income; // the random effect for income 44 | U_region_transformed = sigma_region * U_region; // the nested random effect for region 45 | 46 | // noncentered parameterization for U_state_transformed 47 | for (j in 1:51) { 48 | U_state_transformed[j] = (U_state[j] * sigma_state) + ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) ); 49 | } 50 | 51 | for (i in 1:N) { 52 | yhat[i] = intercept + U_age_transformed[age[i]] + U_income_transformed[income[i]] + U_state_transformed[state_index[i]]; // the linear predictor at each point 53 | } 54 | 55 | } 56 | model { 57 | sigma_age ~ normal(0,1); 58 | sigma_income ~ normal(0,1); 59 | 60 | sigma_region ~ normal(0,1); 61 | sigma_state ~ normal(0, 1); 62 | 63 | // U_age[1] ~ normal(0,1); // random walk doesn't have this 64 | for (j in 2:N_groups_age) { 65 | U_age[j] ~normal(U_age[j-1],1); 66 | } 67 | 68 | sum(U_age) ~ normal(0, 0.01 * N_groups_age); // constraint so we can write likelihood for rw(1). 69 | 70 | U_income ~ normal(0, 1); // random effect for income is normal 71 | 72 | U_region ~ normal(0,1); 73 | 74 | U_state ~ normal(0, 1); 75 | 76 | beta_state ~ normal(0, 1); 77 | beta_relig ~ normal(0, 1); 78 | intercept ~ normal(0, 1); 79 | 80 | 81 | for (i in 1:N) { 82 | y[i] ~ bernoulli(inv_logit(yhat[i])); // the response 83 | } 84 | 85 | } 86 | -------------------------------------------------------------------------------- /simulation/proposedarN01_v3.stan: -------------------------------------------------------------------------------- 1 | data { 2 | int N; // number of samples 3 | int N_groups_age; // the number of groups for age 4 | int N_groups_income; // the number of groups for income 5 | int age[N]; // the column vector of design matrix X for age 6 | int income[N]; // the column vector of design matrix X for income 7 | 8 | int state_index[N]; // the index (from 1 to 51) representing the state for every datapoint 9 | int region_index[51]; // the index (from 1 to 5) representing the region the state is in for all 51 states 10 | real state_vs[51]; // the 2004 Republican vote share for every state. This is hard coded for now. 11 | real relig[51]; // the 2004 conservative religion percentage in every state 12 | 13 | int y[N]; // the response vector 14 | } 15 | parameters { 16 | vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age 17 | vector[N_groups_income] U_income; // the random effect for income, not multiplied by sigma_income 18 | 19 | vector[51] U_state; // the random effect for state, not multiplied by sigma_state 20 | vector[5] U_region; // the nested random effect for region, not multiplied by sigma_region 21 | 22 | real sigma_age; // sd of U_age (hyperparam). halfnormal prior put on this. 23 | real sigma_income; // sd of U_income (hyperparam). halfnormal prior put on this. 24 | 25 | real sigma_state; // sd of state 26 | real sigma_region; // sd of region 27 | real beta_state; // coeff. for Republican vote share in every state 28 | real beta_relig; // coeff. for conservative religion share in every state 29 | 30 | real intercept; // the intercept (global fixed effect) 31 | real rho; // the autoregressive coefficient untransformed 32 | } 33 | transformed parameters { 34 | vector[N_groups_age] U_age_transformed; 35 | vector[N_groups_income] U_income_transformed; 36 | 37 | vector[51] U_state_transformed; 38 | vector[5] U_region_transformed; 39 | 40 | vector[N] yhat; 41 | real rho_transformed; 42 | 43 | rho_transformed = (rho * 2) - 1; // the autoregressive coefficient 44 | 45 | U_age_transformed = sigma_age * U_age; // the random effect for age 46 | U_income_transformed = sigma_income * U_income; // the random effect for income 47 | U_region_transformed = sigma_region * U_region; // the random effect for region 48 | 49 | // noncentered parameterization for U_state_transformed 50 | for (j in 1:51) { 51 | U_state_transformed[j] = (U_state[j] * sigma_state) + ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) ); 52 | } 53 | 54 | for (i in 1:N) { 55 | yhat[i] = intercept + U_age_transformed[age[i]] + U_income_transformed[income[i]] + U_state_transformed[state_index[i]]; // the linear predictor at each point 56 | } 57 | 58 | } 59 | model { 60 | sigma_age ~ normal(0, 1); // sigma_A ~ halfnormal(0,1) 61 | sigma_income ~ normal(0, 1); // sigma_I ~ halfnormal(0,1) 62 | rho ~ beta(0.5, 0.5); // prior on autoregressive coefficient 63 | 64 | sigma_region ~ normal(0, 1); 65 | sigma_state ~ normal(0, 1); 66 | 67 | U_age[1] ~ normal(0, 1/sqrt(1-rho_transformed^2)); // before it was normal(0, 1) but this is wrong 68 | for (j in 2:N_groups_age) { 69 | U_age[j] ~normal(rho_transformed * U_age[j-1],1); 70 | } 71 | 72 | U_income ~ normal(0, 1); // random effect for income is normal 73 | 74 | U_region ~ normal(0, 1); 75 | 76 | U_state ~ normal(0, 1); 77 | 78 | beta_state ~ normal(0, 1); 79 | beta_relig ~ normal(0, 1); 80 | intercept ~ normal(0, 1); 81 | 82 | 83 | for (i in 1:N) { 84 | y[i] ~ bernoulli(inv_logit(yhat[i])); // the response 85 | } 86 | 87 | } 88 | -------------------------------------------------------------------------------- /simulation/threemodelwriteup_v3_posteriorvariance.R: -------------------------------------------------------------------------------- 1 | rm(list=ls()) 2 | 3 | options(bitmapType="cairo") 4 | # bias plots for new pipeline 5 | 6 | library(reshape2) 7 | library(rstan) 8 | library(dplyr) 9 | library(ggplot2) 10 | library(gridExtra) 11 | library(knitr) 12 | library(ggridges) 13 | library(matrixStats) 14 | 15 | # Global params ------------------------------------------- 16 | 17 | save_ridgeplots = TRUE 18 | sample_size = 500 19 | runs = 200 20 | r = 1:9/10 21 | income_multiplier = 1 # partitions income into more categories 22 | age_grouping_multiplier = 12 # how much we take the maximal poststratification. make sure this can divide 60 23 | number_of_states = 51 24 | 25 | response_tag = "binary" 26 | 27 | # colour scheme for biasfacet 28 | use_biascolourscheme = TRUE 29 | biascolourscheme = c("1"="#fbb4b9", 30 | "2"="#f768a1", 31 | "3"="#c51b8a", 32 | "4"="#7a0177", 33 | "5"="#fdcc8a", 34 | "6"="#fc8d59", 35 | "7"="#e34a33", 36 | "8"="#b30000", 37 | "9"="#b2e2e2", 38 | "10"="#66c2a4", 39 | "11"="#2ca25f", 40 | "12"="#006d2c") 41 | 42 | 43 | # --------------------------------------------------------- 44 | # df for 3 regression lines with their standard deviation bands : median of medians 45 | regression_df_all200 = c() 46 | 47 | # df for 3 regression lines with their standard deviation bands : median of means 48 | regression_df_mean_all200 = c() 49 | 50 | # bias for baseline model 51 | avg_bias_mat_baseline = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of bias cell-wise (mean of mean bias) 52 | avg_bias_sd_mat_baseline = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the sd of the mean of bias cell-wise 53 | avg_value_mat_baseline = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # posterior values of 12 cells, for each value of p 54 | 55 | # bias for ar model 56 | avg_bias_mat_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of bias cell-wise (mean of mean bias) 57 | avg_bias_sd_mat_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the sd of the mean of bias cell-wise 58 | avg_value_mat_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # posterior values of 12 cells, for each value of p 59 | 60 | mean_of_medians_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of median bias cell-wise 61 | mean_of_medians_sd_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) 62 | 63 | # bias for rw model 64 | avg_bias_mat_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of bias cell-wise (mean of mean bias) 65 | avg_bias_sd_mat_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the sd of the mean of bias cell-wise 66 | avg_value_mat_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # posterior values of 12 cells, for each value of p 67 | 68 | mean_of_medians_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of median bias cell-wise 69 | mean_of_medians_sd_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) 70 | 71 | # df for 3 regression lines with their standard deviation bands : median of medians 72 | regression_df = c() 73 | 74 | # df for 3 regression lines with their standard deviation bands : median of means 75 | regression_df_mean = c() 76 | 77 | # df for majority vote for all three models 78 | majorityvote_df = c() 79 | 80 | lowerboundary = 0.05 81 | upperboundary = 0.95 82 | 83 | counter = 1 84 | prob_sampling_old = rep(0, length(r)) 85 | 86 | # --- 87 | cm_quantiles_mat_bl = matrix(0, length(r), age_grouping_multiplier) 88 | cm_quantiles_mat_ar = matrix(0, length(r), age_grouping_multiplier) 89 | cm_quantiles_mat_rw = matrix(0, length(r), age_grouping_multiplier) 90 | 91 | regression_df_all200_postsd = c() 92 | # --- 93 | 94 | for (p in r) { 95 | 96 | load(paste0( 97 | runs, 98 | "_", 99 | age_grouping_multiplier, 100 | "_", 101 | income_multiplier, 102 | "_", 103 | sample_size, 104 | "_", 105 | response_tag, 106 | "_oldest_", 107 | p * 10, 108 | ".RData")) 109 | 110 | # --- 111 | cm_quantiles_mat_bl[counter,] = colMeans(quantile90_list_baseline - quantile10_list_baseline) 112 | cm_quantiles_mat_ar[counter,] = colMeans(quantile90_list_ar - quantile10_list_ar) 113 | cm_quantiles_mat_rw[counter,] = colMeans(quantile90_list_rw - quantile10_list_rw) 114 | 115 | quantilediff_list_baseline_melted = melt(quantile90_list_baseline - quantile10_list_baseline) 116 | quantilediff_list_ar_melted = melt(quantile90_list_ar - quantile10_list_ar) 117 | quantilediff_list_rw_melted = melt(quantile90_list_rw - quantile10_list_rw) 118 | 119 | regression_df_all200_postsd = rbind(regression_df_all200_postsd, 120 | data.frame(cbind(quantilediff_list_baseline_melted, 121 | type="Baseline", 122 | p)), 123 | data.frame(cbind(quantilediff_list_ar_melted, 124 | type="Autoregressive", 125 | p)), 126 | data.frame(cbind(quantilediff_list_rw_melted, 127 | type="Random walk", 128 | p)) 129 | ) 130 | 131 | # --- 132 | 133 | # correct level order for the 4 ridgeplots below 134 | correct_levels = c() 135 | for (g in 1:age_grouping_multiplier) { 136 | correct_levels = c(correct_levels, 137 | seq(from = g, to = age_grouping_multiplier * 3, by = age_grouping_multiplier)) 138 | } 139 | 140 | # Get the interpretable p which is the probability of sampling someone in ages 61 - 80 141 | poststrat_final_p = cbind(poststrat_final, 142 | (poststrat_final$p_response * poststrat_final$N)/sum((poststrat_final$p_response * poststrat_final$N))) # this is the probability of sampling someone in ages 61 - 80 143 | colnames(poststrat_final_p)[length(colnames(poststrat_final_p))] = "prob_sampling_old" 144 | prob_sampling_old[counter] = sum(poststrat_final_p[as.numeric(as.character(poststrat_final_p$age_cat)) >= (age_grouping_multiplier*2/3 + 1), 145 | c("prob_sampling_old")]) # very important 146 | 147 | # baseline 148 | avg_bias_mat_baseline[counter,] = colMeans(sweep(sample_cell_estimates_median_baseline, 149 | 2, 150 | poststrat_final_reduced$true_pref_grouped_final, 151 | "-")) 152 | 153 | avg_bias_sd_mat_baseline[counter,] = apply(sweep(sample_cell_estimates_median_baseline, 154 | 2, 155 | poststrat_final_reduced$true_pref_grouped_final, 156 | "-"), 157 | 2, 158 | sd) 159 | 160 | avg_value_mat_baseline[counter,] = colMeans(sample_cell_estimates_median_baseline) 161 | 162 | # autoregressive 163 | avg_bias_mat_ar[counter,] = colMeans(sweep(sample_cell_estimates_median_ar, 164 | 2, 165 | poststrat_final_reduced$true_pref_grouped_final, 166 | "-")) 167 | 168 | avg_bias_sd_mat_ar[counter,] = apply(sweep(sample_cell_estimates_median_ar, 169 | 2, 170 | poststrat_final_reduced$true_pref_grouped_final, "-"), 171 | 2, 172 | sd) 173 | 174 | avg_value_mat_ar[counter,] = colMeans(sample_cell_estimates_median_ar) 175 | 176 | # random walk 177 | avg_bias_mat_rw[counter,] = colMeans(sweep(sample_cell_estimates_median_rw, 178 | 2, 179 | poststrat_final_reduced$true_pref_grouped_final, 180 | "-")) 181 | 182 | avg_bias_sd_mat_rw[counter,] = apply(sweep(sample_cell_estimates_median_rw, 183 | 2, 184 | poststrat_final_reduced$true_pref_grouped_final, "-"), 185 | 2, 186 | sd) 187 | 188 | avg_value_mat_rw[counter,] = colMeans(sample_cell_estimates_median_rw) 189 | 190 | # --- 191 | regression_df = rbind(regression_df, 192 | data.frame(cbind(1:age_grouping_multiplier, 193 | colMedians(median_quantile_list_baseline), 194 | "Baseline", 195 | p)), 196 | data.frame(cbind(1:age_grouping_multiplier, 197 | colMedians(median_quantile_list_ar), 198 | "Autoregressive", 199 | p)), 200 | data.frame(cbind(1:age_grouping_multiplier, 201 | colMedians(median_quantile_list_rw), 202 | "Random walk", 203 | p)) 204 | ) 205 | # --- 206 | 207 | regression_df_mean = rbind(regression_df_mean, 208 | data.frame(cbind(1:age_grouping_multiplier, 209 | colMedians(mean_list_baseline), 210 | "Baseline", 211 | p)), 212 | data.frame(cbind(1:age_grouping_multiplier, 213 | colMedians(mean_list_ar), 214 | "Autoregressive", 215 | p)), 216 | data.frame(cbind(1:age_grouping_multiplier, 217 | colMedians(mean_list_rw), 218 | "Random walk", 219 | p)) 220 | ) 221 | 222 | # --- 223 | 224 | median_quantile_list_baseline_melted = melt(median_quantile_list_baseline) 225 | median_quantile_list_ar_melted = melt(median_quantile_list_ar) 226 | median_quantile_list_rw_melted = melt(median_quantile_list_rw) 227 | 228 | regression_df_all200 = rbind(regression_df_all200, 229 | data.frame(cbind(median_quantile_list_baseline_melted, 230 | type="Baseline", 231 | p)), 232 | data.frame(cbind(median_quantile_list_ar_melted, 233 | type="Autoregressive", 234 | p)), 235 | data.frame(cbind(median_quantile_list_rw_melted, 236 | type="Random walk", 237 | p)) 238 | ) 239 | 240 | # --- 241 | 242 | mean_list_baseline_melted = melt(mean_list_baseline) 243 | mean_list_ar_melted = melt(mean_list_ar) 244 | mean_list_rw_melted = melt(mean_list_rw) 245 | 246 | regression_df_mean_all200 = rbind(regression_df_mean_all200, 247 | data.frame(cbind(mean_list_baseline_melted, 248 | type="Baseline", 249 | p)), 250 | data.frame(cbind(mean_list_ar_melted, 251 | type="Autoregressive", 252 | p)), 253 | data.frame(cbind(mean_list_rw_melted, 254 | type="Random walk", 255 | p)) 256 | ) 257 | 258 | # --- 259 | 260 | majorityvote_df = rbind(majorityvote_df, 261 | data.frame(majorityvote = model_popn_pref_mat_baseline, 262 | type = "Baseline", 263 | p), 264 | data.frame(majorityvote = model_popn_pref_mat_ar, 265 | type = "Autoregressive", 266 | p), 267 | data.frame(majorityvote = model_popn_pref_mat_rw, 268 | type = "Random walk", 269 | p) 270 | ) 271 | 272 | counter = counter + 1 273 | } 274 | 275 | nonboundary_indices = (poststrat_final_reduced$true_pref_grouped_final > lowerboundary) & (poststrat_final_reduced$true_pref_grouped_final < upperboundary) 276 | 277 | avg_value_mat_baseline = as.data.frame(avg_value_mat_baseline) 278 | avg_value_mat_baseline = avg_value_mat_baseline[,nonboundary_indices] 279 | df_val.melted_baseline = cbind( melt(avg_value_mat_baseline), rep(r, sum(nonboundary_indices)) ) 280 | colnames(df_val.melted_baseline) = c("cell", "value", "p_response_3") 281 | 282 | avg_value_mat_ar = as.data.frame(avg_value_mat_ar) 283 | avg_value_mat_ar = avg_value_mat_ar[,nonboundary_indices] 284 | df_val.melted_ar = cbind(melt(avg_value_mat_ar), rep(r, sum(nonboundary_indices)) ) 285 | colnames(df_val.melted_ar) = c("cell", "value", "p_response_3") 286 | 287 | avg_value_mat_rw = as.data.frame(avg_value_mat_rw) 288 | avg_value_mat_rw = avg_value_mat_rw[,nonboundary_indices] 289 | df_val.melted_rw = cbind(melt(avg_value_mat_rw), rep(r, sum(nonboundary_indices)) ) 290 | colnames(df_val.melted_rw) = c("cell", "value", "p_response_3") 291 | 292 | 293 | poststrat_final_V = cbind(paste0("V", rownames(poststrat_final_reduced)), 294 | poststrat_final_reduced, 295 | stringsAsFactors=FALSE) 296 | colnames(poststrat_final_V)[1] = "cell" 297 | 298 | df_val.melted_baseline$cell = as.character(df_val.melted_baseline$cell) 299 | df_val.melted_ar$cell = as.character(df_val.melted_ar$cell) 300 | df_val.melted_rw$cell = as.character(df_val.melted_rw$cell) 301 | 302 | df_val.melted_V_baseline = dplyr::inner_join(x = df_val.melted_baseline, y = poststrat_final_V, by="cell") 303 | df_val.melted_V_unique_baseline = unique(df_val.melted_V_baseline[,c("cell", "true_pref_grouped_final")]) 304 | df_val.melted_V_baseline$true_pref_grouped_final = factor(df_val.melted_V_baseline$true_pref_grouped_final) 305 | 306 | df_val.melted_V_ar = dplyr::inner_join(x = df_val.melted_ar, y = poststrat_final_V, by="cell") 307 | df_val.melted_V_unique_ar = unique(df_val.melted_V_ar[,c("cell", "true_pref_grouped_final")]) 308 | df_val.melted_V_ar$true_pref_grouped_final = factor(df_val.melted_V_ar$true_pref_grouped_final) 309 | 310 | df_val.melted_V_rw = dplyr::inner_join(x = df_val.melted_rw, y = poststrat_final_V, by="cell") 311 | df_val.melted_V_unique_rw = unique(df_val.melted_V_rw[,c("cell", "true_pref_grouped_final")]) 312 | df_val.melted_V_rw$true_pref_grouped_final = factor(df_val.melted_V_rw$true_pref_grouped_final) 313 | 314 | 315 | d_temp_baseline = ggplot(df_val.melted_V_baseline, aes(x=p_response_3, y=value, group=cell)) + 316 | geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 1) + 317 | geom_hline(data = poststrat_final_V, mapping = aes(yintercept = true_pref_grouped_final, color = as.factor(age_cat)), linetype = "dashed", show.legend = FALSE) + 318 | xlab("Probability of response for oldest age group") + 319 | ylab("Baseline Avg. of avg. of PLD colour-coded by truth") + 320 | ylim(0, 1) + 321 | theme_minimal() + 322 | theme(legend.position="top", 323 | legend.title = element_blank()) 324 | 325 | d_temp_ar = ggplot(df_val.melted_V_ar, aes(x=p_response_3, y=value, group=cell)) + 326 | geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 1) + 327 | geom_hline(data = poststrat_final_V, mapping = aes(yintercept = true_pref_grouped_final, color = as.factor(age_cat)), linetype = "dashed", show.legend = FALSE) + 328 | xlab("Probability of response for oldest age group") + 329 | ylab("AR Avg. of avg. of PLD colour-coded by truth") + 330 | ylim(0, 1) + 331 | theme_minimal() + 332 | theme(legend.position="top", 333 | legend.title = element_blank()) 334 | 335 | d_temp_rw = ggplot(df_val.melted_V_rw, aes(x=p_response_3, y=value, group=cell)) + 336 | geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 1) + 337 | geom_hline(data = poststrat_final_V, mapping = aes(yintercept = true_pref_grouped_final, color = as.factor(age_cat)), linetype = "dashed", show.legend = FALSE) + 338 | xlab("Probability of response for oldest age group") + 339 | ylab("RW Avg. of avg. of PLD colour-coded by truth") + 340 | ylim(0, 1) + 341 | theme_minimal() + 342 | theme(legend.position="top", 343 | legend.title = element_blank()) 344 | 345 | 346 | 347 | # BIAS ------- 348 | 349 | avg_bias_mat_baseline = as.data.frame(avg_bias_mat_baseline) 350 | avg_bias_mat_baseline = avg_bias_mat_baseline[,nonboundary_indices] 351 | df_bias.melted_baseline = cbind( melt(avg_bias_mat_baseline), rep(r, sum(nonboundary_indices)) ) 352 | colnames(df_bias.melted_baseline) = c("cell", "value", "p_response_3") 353 | 354 | avg_bias_mat_ar = as.data.frame(avg_bias_mat_ar) 355 | avg_bias_mat_ar = avg_bias_mat_ar[,nonboundary_indices] 356 | df_bias.melted_ar = cbind(melt(avg_bias_mat_ar), rep(r, sum(nonboundary_indices)) ) 357 | colnames(df_bias.melted_ar) = c("cell", "value", "p_response_3") 358 | 359 | avg_bias_mat_rw = as.data.frame(avg_bias_mat_rw) 360 | avg_bias_mat_rw = avg_bias_mat_rw[,nonboundary_indices] 361 | df_bias.melted_rw = cbind(melt(avg_bias_mat_rw), rep(r, sum(nonboundary_indices)) ) 362 | colnames(df_bias.melted_rw) = c("cell", "value", "p_response_3") 363 | 364 | df_bias.melted_baseline$cell = as.character(df_bias.melted_baseline$cell) 365 | df_bias.melted_ar$cell = as.character(df_bias.melted_ar$cell) 366 | df_bias.melted_rw$cell = as.character(df_bias.melted_rw$cell) 367 | 368 | df_bias.melted_V_baseline = dplyr::inner_join(x = df_bias.melted_baseline, y = poststrat_final_V, by="cell") 369 | df_bias.melted_V_unique_baseline = unique(df_bias.melted_V_baseline[,c("cell", "true_pref_grouped_final")]) 370 | df_bias.melted_V_baseline$true_pref_grouped_final = factor(df_bias.melted_V_baseline$true_pref_grouped_final) 371 | 372 | df_bias.melted_V_ar = dplyr::inner_join(x = df_bias.melted_ar, y = poststrat_final_V, by="cell") 373 | df_bias.melted_V_unique_ar = unique(df_bias.melted_V_ar[,c("cell", "true_pref_grouped_final")]) 374 | df_bias.melted_V_ar$true_pref_grouped_final = factor(df_bias.melted_V_ar$true_pref_grouped_final) 375 | 376 | df_bias.melted_V_rw = dplyr::inner_join(x = df_bias.melted_rw, y = poststrat_final_V, by="cell") 377 | df_bias.melted_V_unique_rw = unique(df_bias.melted_V_rw[,c("cell", "true_pref_grouped_final")]) 378 | df_bias.melted_V_rw$true_pref_grouped_final = factor(df_bias.melted_V_rw$true_pref_grouped_final) 379 | 380 | 381 | bias_facet_df = rbind(cbind(df_bias.melted_V_baseline, Model = "Baseline"), 382 | cbind(df_bias.melted_V_ar, Model = "Autoregressive"), 383 | cbind(df_bias.melted_V_rw, Model = "Random walk")) 384 | 385 | mapdf_prob_sampling_old_bias = data.frame(old = sort(unique(bias_facet_df$p_response_3)), 386 | new = prob_sampling_old) 387 | bias_facet_df$prob_sampling_old = round(mapdf_prob_sampling_old_bias$new[match(bias_facet_df$p_response_3, 388 | mapdf_prob_sampling_old_bias$old)], 389 | digits = 2) 390 | 391 | bias_facet_df$age_cat = factor(bias_facet_df$age_cat) 392 | 393 | saveRDS(bias_facet_df, paste0("biasfacet_", 394 | runs, 395 | "_", 396 | age_grouping_multiplier, 397 | "_", 398 | income_multiplier, 399 | "_", 400 | sample_size,".rds")) 401 | 402 | png(filename = paste0("biasfacet_", 403 | runs, 404 | "_", 405 | age_grouping_multiplier, 406 | "_", 407 | income_multiplier, 408 | "_", 409 | sample_size,".png"), 410 | width = 4800, height = 2400) 411 | 412 | if (use_biascolourscheme==TRUE) { 413 | 414 | plot( 415 | ggplot(bias_facet_df, aes(x=prob_sampling_old, y=value, group=cell)) + 416 | geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 0.5) + 417 | facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller( 418 | function(value) { 419 | return(value) # Lets you change the facet labels 420 | }) 421 | ) + 422 | xlab("\n Probability of response for the elderly \n") + 423 | ylab("\n Average Bias \n") + 424 | scale_colour_manual(values=c("1"="#fbb4b9", 425 | "2"="#f768a1", 426 | "3"="#c51b8a", 427 | "4"="#7a0177", 428 | "5"="#fdcc8a", 429 | "6"="#fc8d59", 430 | "7"="#e34a33", 431 | "8"="#b30000", 432 | "9"="#b2e2e2", 433 | "10"="#66c2a4", 434 | "11"="#2ca25f", 435 | "12"="#006d2c") 436 | ) + 437 | theme_bw() + 438 | theme(plot.title = element_text(size = 50, face = "bold"), 439 | axis.text=element_text(size=35*1.6), 440 | axis.title=element_text(size=50*1.6, face="bold",margin=200), 441 | legend.text = element_text(size=50*1.6), 442 | legend.position = "bottom", 443 | legend.key.size = unit(30*1.6,"line"), 444 | legend.key.height = unit(7*1.6,"line"), 445 | legend.key = element_rect(fill = "transparent",color = "transparent"), 446 | legend.title = element_text(size=50*1.6, face="bold"), 447 | strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ), 448 | strip.background = element_rect(fill="transparent",color="transparent") 449 | ) + 450 | guides(col=guide_legend(title="Age category", 451 | override.aes = list(size = 10*1.6))) 452 | 453 | ) 454 | 455 | }else { 456 | plot( 457 | ggplot(bias_facet_df, aes(x=prob_sampling_old, y=value, group=cell)) + 458 | geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 0.5) + 459 | facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller( 460 | function(value) { 461 | return(value) # Lets you change the facet labels 462 | }) 463 | ) + 464 | xlab("\n Probability of response for the elderly \n") + 465 | ylab("\n Average Bias \n") + 466 | theme_bw() + 467 | theme(plot.title = element_text(size = 50, face = "bold"), 468 | axis.text=element_text(size=35*1.6), 469 | axis.title=element_text(size=50*1.6, face="bold",margin=200), 470 | legend.text = element_text(size=50*1.6), 471 | legend.position = "bottom", 472 | legend.key.size = unit(30*1.6,"line"), 473 | legend.key.height = unit(7*1.6,"line"), 474 | legend.key = element_rect(fill = "transparent",color = "transparent"), 475 | legend.title = element_text(size=50*1.6, face="bold"), 476 | strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ), 477 | strip.background = element_rect(fill="transparent",color="transparent")) + 478 | guides(col=guide_legend(title="Age category", 479 | override.aes = list(size = 10*1.6))) 480 | ) 481 | } 482 | 483 | dev.off() 484 | 485 | 486 | # all medians of posteriors 487 | 488 | regression_df_all200$Var2 = factor(regression_df_all200$Var2) 489 | colnames(regression_df_all200) = c("n", "age_cat", "posterior_medians", "Model", "p") 490 | 491 | mapdf_prob_sampling_old = data.frame(old = sort(unique(regression_df_all200$p)), 492 | new = prob_sampling_old) 493 | regression_df_all200$prob_sampling_old = round(mapdf_prob_sampling_old$new[match(regression_df_all200$p, 494 | mapdf_prob_sampling_old$old)], 495 | digits = 2) 496 | 497 | regression_df_all200$age_cat = as.numeric(as.character(regression_df_all200$age_cat)) 498 | 499 | # save regression_df_all200 500 | saveRDS(regression_df_all200, 501 | paste0("allmedians_facet", 502 | "_", 503 | runs, 504 | "_", 505 | age_grouping_multiplier, 506 | "_", 507 | sample_size, 508 | "_", 509 | ".rds")) 510 | 511 | saveRDS(points_df_final[,c("age_cat", "mrp", "Type")], 512 | "points_df_final.rds") 513 | 514 | png(paste0("allmedians_facet", 515 | "_", 516 | runs, 517 | "_", 518 | age_grouping_multiplier, 519 | "_", 520 | sample_size, 521 | "_", 522 | ".png"), 523 | width=3000, height=2400) 524 | 525 | 526 | plot( 527 | ggplot(regression_df_all200, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+ 528 | geom_rect(regression_df_all200, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 529 | xmax = Inf, 530 | ymin = -Inf, 531 | ymax = Inf), 532 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 533 | geom_jitter(alpha=0.1, show.legend=FALSE) + 534 | geom_point(aes(x = age_cat, y = mrp), 535 | size=7, 536 | colour = "black", 537 | fill="black", 538 | show.legend = F, 539 | data = points_df_final, 540 | inherit.aes = F) + 541 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 542 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller( 543 | function(value) { 544 | return(value) # Lets you change the facet labels 545 | }) 546 | ) + 547 | xlab("\n Age Category \n") + 548 | ylab(paste("\n Median of", runs,"Posteriors \n")) + 549 | scale_x_continuous(breaks= scales::pretty_breaks()) + 550 | theme_bw() + 551 | theme(plot.title = element_text(size = 50, face = "bold"), 552 | axis.text=element_text(size=50), 553 | axis.title=element_text(size=50, face="bold",margin=200), 554 | legend.text = element_text(size=50), 555 | legend.position = "bottom", 556 | legend.key.size = unit(25,"line"), 557 | legend.key.height = unit(7,"line"), 558 | legend.key = element_rect(fill = "transparent",color = "transparent"), 559 | legend.title = element_text(size=50, face="bold"), 560 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 561 | strip.background = element_rect(fill="transparent",color="transparent")) + 562 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 563 | guides(color = guide_legend(override.aes = list(size=20))) 564 | 565 | ) 566 | 567 | dev.off() 568 | 569 | 570 | # plot of the posterior standard deviations 571 | regression_df_all200_postsd$Var2 = factor(regression_df_all200_postsd$Var2) 572 | colnames(regression_df_all200_postsd) = c("n", "age_cat", "quantildiff90_10", "Model", "p") 573 | 574 | # 575 | mapdf_prob_sampling_old_postsd = data.frame(old = sort(unique(regression_df_all200_postsd$p)), 576 | new = prob_sampling_old) 577 | regression_df_all200_postsd$prob_sampling_old = round(mapdf_prob_sampling_old_postsd$new[match(regression_df_all200_postsd$p, 578 | mapdf_prob_sampling_old_postsd$old)], 579 | digits = 2) 580 | 581 | regression_df_all200_postsd$age_cat = as.numeric(as.character(regression_df_all200_postsd$age_cat)) 582 | # 583 | 584 | saveRDS(regression_df_all200_postsd, 585 | paste0("allquantilediff_facet", 586 | "_", 587 | runs, 588 | "_", 589 | age_grouping_multiplier, 590 | "_", 591 | sample_size, 592 | "_", 593 | ".rds")) 594 | 595 | png(paste0("allquantilediff_facet", 596 | "_", 597 | runs, 598 | "_", 599 | age_grouping_multiplier, 600 | "_", 601 | sample_size, 602 | "_", 603 | ".png"), 604 | width=3000, height=2400) 605 | 606 | plot( 607 | ggplot(regression_df_all200_postsd, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+ 608 | geom_rect(regression_df_all200_postsd, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2, 609 | xmax = Inf, 610 | ymin = -Inf, 611 | ymax = Inf), 612 | alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) + 613 | geom_jitter(alpha=0.1, show.legend=FALSE) + 614 | geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 615 | facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller( 616 | function(value) { 617 | return(value) # Lets you change the facet labels 618 | }) 619 | ) + 620 | xlab("\n Age Category \n") + 621 | ylab(paste("\n 90th - 10th quantile of", runs,"Posteriors \n")) + 622 | scale_x_continuous(breaks= scales::pretty_breaks()) + 623 | theme_bw() + 624 | theme(plot.title = element_text(size = 50, face = "bold"), 625 | axis.text=element_text(size=50), 626 | axis.title=element_text(size=50, face="bold",margin=200), 627 | legend.text = element_text(size=50), 628 | legend.position = "bottom", 629 | legend.key.size = unit(25,"line"), 630 | legend.key.height = unit(7,"line"), 631 | legend.key = element_rect(fill = "transparent",color = "transparent"), 632 | legend.title = element_text(size=50, face="bold"), 633 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 634 | strip.background = element_rect(fill="transparent",color="transparent")) + 635 | scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 636 | guides(color = guide_legend(override.aes = list(size=20))) 637 | 638 | ) 639 | 640 | dev.off() 641 | 642 | # -------------------------------------------------------------------------- 643 | 644 | print("Warnings:") 645 | print(warnings()) 646 | 647 | 648 | 649 | -------------------------------------------------------------------------------- /simulation_spatialmrp/icar_mrp_simulation_bym2.R: -------------------------------------------------------------------------------- 1 | set.seed(21) 2 | 3 | 4 | rm(list=ls()) 5 | gc() 6 | 7 | library(dplyr) 8 | library(readr) 9 | library(data.table) 10 | library(sp) 11 | library(ggplot2) 12 | library(INLA) 13 | 14 | use_spatial_effect = TRUE # the default is TRUE. This will define the true_pref with coef_puma, the spatial effect 15 | 16 | poverty_poststrat = readRDS("poverty_poststrat.rds") 17 | us_pumas = readRDS("us_pumas.rds") 18 | ma_adj_matrix = readRDS("ma_adj_matrix.rds") 19 | ma_adj_matrix_sparse = readRDS("ma_adj_matrix_sparse.rds") 20 | x = readRDS("smooth_x.rds") 21 | 22 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education 23 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race 24 | num_puma = length(unique(poverty_poststrat$PUMA.x)) 25 | 26 | total_poverty = poverty_poststrat %>% group_by(PUMA.x) %>% summarise(total = sum(N), 27 | total_in_poverty = sum(y)) %>% arrange(PUMA.x) %>% as.data.frame() 28 | 29 | total_poverty$proportion_in_poverty = total_poverty$total_in_poverty/total_poverty$total 30 | 31 | 32 | ss = "massachusetts" 33 | ss_code = 25 # use data dictionary to get ss code 34 | continental_states = c("MA") 35 | 36 | r = 1:9/10 37 | runs = 200 38 | sample_size = 500 # 500, 1000, 2000, 4000 39 | 40 | puma_overundersample_index = c(3301, # "Boston City--Allston, Brighton & Fenway" 41 | 3305, # "Boston City--Hyde Park, Jamaica Plain, Roslindale & West Roxbury" 42 | 3303, # "Boston City--Dorchester & South Boston" 43 | 3302, # "Boston City--Back Bay, Beacon Hill, Charlestown, East Boston, Central & South End" 44 | 3304, # "Boston City--Mattapan & Roxbury" 45 | 3306, 46 | 503, 47 | 504, 48 | 505, 49 | 506, 50 | 507, 51 | 508, 52 | 3500, 53 | 3601, 54 | 3602, 55 | 3603, 56 | 3400 57 | ) 58 | 59 | # used to calculate posterior width 60 | lowerquantile = 0.1 61 | upperquantile = 0.9 62 | 63 | temp_x = cbind(x, us_pumas$PUMACE10) %>% as.data.frame() 64 | x_ordered = temp_x %>% arrange(V2) %>% as.data.frame() 65 | coef_puma = x_ordered$V1 # coef_puma is in alphabetical order 66 | 67 | coef_race = (1:num_race - mean(1:num_race))*0.1 68 | 69 | coef_education = (1:num_education - mean(1:num_education) - 2)*0.15 70 | 71 | intercept_term = 0 72 | 73 | # http://www.paulamoraga.com/book-geospatial/sec-inla.html 74 | # https://mc-stan.org/users/documentation/case-studies/icar_stan.html 75 | 76 | icar_formula = n ~ 1 + 77 | f(education_f, 78 | model="iid", 79 | hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1))) 80 | ) + 81 | f(race_f, 82 | model="iid", 83 | hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1))) 84 | ) + 85 | f(ID, 86 | model = "bym2", 87 | graph = ma_adj_matrix_sparse, 88 | hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1))), 89 | adjust.for.con.comp = TRUE 90 | ) 91 | 92 | iid_formula = n ~ 1 + 93 | f(education_f, 94 | model="iid", 95 | hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1))) 96 | ) + 97 | f(race_f, 98 | model="iid", 99 | hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1))) 100 | ) + 101 | f(ID_f, 102 | model = "iid", 103 | hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1))) 104 | ) 105 | 106 | # ------------------------------------------------------------------------------------- 107 | # ------------------------------------------------------------------------------------- 108 | # ------------------------------------------------------------------------------------- 109 | 110 | us_pumas_df = as.data.frame(us_pumas) 111 | # --- 112 | 113 | a = as.numeric(us_pumas$PUMACE10) # converts string to numeric 114 | b = unique(poverty_poststrat$PUMA.x) 115 | 116 | sort(a) 117 | sort(b) 118 | 119 | sort(a) - sort(b) # great 120 | 121 | # https://www.census.gov/geographies/reference-maps/2010/geo/2010-pumas/massachusetts.html 122 | us_pumas$PUMACE10 = as.numeric(us_pumas$PUMACE10) # just run once 123 | 124 | us_pumas$proportion_in_poverty = total_poverty[base::match(us_pumas$PUMACE10, total_poverty$PUMA.x),c("proportion_in_poverty")] 125 | 126 | # fit simple ICAR model to MA w/ response being proportion_in_poverty 127 | us_pumas$ID = 1:nrow(us_pumas) 128 | 129 | # ----------------------------------------------------------------------- 130 | # simulation pipeline --------------------------------------------------- 131 | # ----------------------------------------------------------------------- 132 | 133 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education 134 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race 135 | num_puma = length(unique(poverty_poststrat$PUMA.x)) 136 | 137 | poststrat_final = expand.grid(puma_area = unique(poverty_poststrat$PUMA.x), 138 | race = unique(poverty_poststrat$race_x), 139 | education = unique(poverty_poststrat$education)) 140 | 141 | poststrat_final_joined = left_join(poststrat_final, poverty_poststrat, by=c("puma_area"="PUMA.x", 142 | "race"="race_x", 143 | "education"="education_x")) %>% as.data.frame() 144 | 145 | # Order of factors for poststrat_final_joined 146 | levels_education = unique(poverty_poststrat$education) 147 | levels_race = unique(poverty_poststrat$race_x) 148 | levels_puma = unique(poverty_poststrat$PUMA.x) 149 | 150 | 151 | poststrat_final_join_numeric = expand.grid(puma_area = 1:length(unique(poverty_poststrat$PUMA.x)), 152 | race = 1:length(unique(poverty_poststrat$race_x)), 153 | education = 1:length(unique(poverty_poststrat$education))) 154 | 155 | # https://stackoverflow.com/questions/18562680/replacing-nas-with-0s-in-r-dataframe 156 | # replace all zeros in dataframe 157 | poststrat_final_joined[is.na(poststrat_final_joined)] = 0 158 | 159 | true_pref = rep(NA, num_puma * num_race * num_education) # this contains the true preference probability of a group in the population 160 | if (use_spatial_effect==TRUE) { 161 | 162 | for (j in 1:(num_education * num_race * num_puma)) { 163 | true_pref[j] = arm::invlogit(intercept_term + 164 | coef_puma[poststrat_final_join_numeric[j, 1]] + 165 | coef_race[poststrat_final_join_numeric[j, 2]] + 166 | coef_education[poststrat_final_join_numeric[j, 3]] 167 | ) 168 | } 169 | 170 | }else { 171 | 172 | for (j in 1:(num_education * num_race * num_puma)) { 173 | true_pref[j] = arm::invlogit(intercept_term + 174 | coef_race[poststrat_final_join_numeric[j, 2]] + 175 | coef_education[poststrat_final_join_numeric[j, 3]] 176 | ) 177 | } 178 | 179 | } 180 | 181 | # y is number of people in poverty weighted by PWGTP, N is number of people in the PUMA area (weighted dby PWGTP) 182 | poststrat_final_joined = cbind(poststrat_final_joined, true_pref) 183 | poststrat_puma_pref = poststrat_final_joined %>% group_by(puma_area) %>% summarise(N_times_theta = sum(N*true_pref)) %>% as.data.frame() 184 | poststrat_puma_pref = inner_join(poststrat_puma_pref, 185 | poststrat_final_joined %>% group_by(puma_area) %>% summarise(N_sub = sum(N)) %>% as.data.frame(), 186 | by = c("puma_area"="puma_area")) 187 | 188 | # calculate poststratified true preference for every puma in MA 189 | poststrat_puma_pref$puma_ps = poststrat_puma_pref$N_times_theta/poststrat_puma_pref$N_sub 190 | summary(poststrat_puma_pref$puma_ps) 191 | 192 | # true poststratified values based on dg process 193 | us_pumas$puma_ps = poststrat_puma_pref$puma_ps[rank(us_pumas$PUMACE10)] 194 | 195 | # Augment p in the simulation scheme 196 | cbind(us_pumas$NAME10, us_pumas$PUMACE10) 197 | 198 | p_response_race = rep(1, num_race) 199 | p_response_education = rep(1, num_education) 200 | 201 | N_sub = rep(0, num_puma) # the N for each PUMA based off poststrat matrix 202 | g_counter = 1 203 | for (g in levels_puma) { 204 | N_sub[g_counter] = sum(poststrat_final_joined[which(poststrat_final_joined$puma_area==g),]$N) 205 | g_counter = g_counter + 1 206 | } 207 | 208 | 209 | for (p in r) { 210 | 211 | # storage matrices for posterior estimates ----------------------------------------- 212 | model_popn_pref_mat_icar = rep(NA, runs) # stores the preference estimate from MRP 213 | model_popn_pref_sd_mat_icar = rep(NA, runs) # stores sd of above estimate 214 | 215 | model_popn_pref_mat_iid = rep(NA, runs) # stores the preference estimate from MRP 216 | model_popn_pref_sd_mat_iid = rep(NA, runs) # stores sd of above estimate 217 | 218 | # medians of posterior linear predictors for all runs k 219 | sample_cell_estimates_median_icar = matrix(NA, runs, dim(poststrat_final_joined)[1]) 220 | sample_cell_estimates_median_iid = matrix(NA, runs, dim(poststrat_final_joined)[1]) 221 | 222 | # posterior width 223 | sample_cell_estimates_width_icar = matrix(NA, runs, dim(poststrat_final_joined)[1]) 224 | sample_cell_estimates_width_iid = matrix(NA, runs, dim(poststrat_final_joined)[1]) 225 | 226 | # the below lists stores puma posterior estimates. columns go by increasing order of puma code 227 | quantile10_puma_icar = matrix(NA, runs, num_puma) 228 | quantile90_puma_icar = matrix(NA, runs, num_puma) 229 | quantile10_puma_iid = matrix(NA, runs, num_puma) 230 | quantile90_puma_iid = matrix(NA, runs, num_puma) 231 | 232 | median_puma_icar = matrix(NA, runs, num_puma) 233 | median_puma_iid = matrix(NA, runs, num_puma) 234 | 235 | # we will under/oversample certain PUMA 236 | p_response_puma = rep(0, num_puma) 237 | p_response_puma[which((unique(poverty_poststrat$PUMA.x) %in% puma_overundersample_index) == TRUE)] = p 238 | p_response_puma[which((unique(poverty_poststrat$PUMA.x) %in% puma_overundersample_index) == FALSE)] = (1-p) 239 | 240 | p_response = rep(0, num_education * num_race * num_puma) 241 | for (j in 1:(num_education * num_race * num_puma)) { 242 | p_response[j] = p_response_puma[poststrat_final_join_numeric[j, 1]] * 243 | p_response_race[poststrat_final_join_numeric[j, 2]] * 244 | p_response_education[poststrat_final_join_numeric[j, 3]] 245 | } 246 | 247 | poststrat_final_joined_p = cbind(poststrat_final_joined, p_response) 248 | 249 | 250 | for (k in 1:runs) { 251 | print(paste("On run", k, ", p =", p)) 252 | 253 | sample_ = sample(num_education * num_race * num_puma, 254 | sample_size, 255 | replace = TRUE, 256 | prob = (poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)/sum(poststrat_final_joined_p$p_response * poststrat_final_joined_p$N) 257 | ) 258 | 259 | y_sample_ = rbinom(sample_size, 1, poststrat_final_joined_p$true_pref[sample_]) 260 | 261 | # get covariates for every row of sample 262 | puma_sample = poststrat_final_joined_p[sample_,1] 263 | race_sample = poststrat_final_joined_p[sample_,2] 264 | education_sample = poststrat_final_joined_p[sample_,3] 265 | 266 | sample_final = data.frame(pref = y_sample_, 267 | puma = puma_sample, 268 | race = race_sample, 269 | education = education_sample) 270 | 271 | # with all the factors 272 | sample_final$race_f = factor(sample_final$race, levels = levels_race) 273 | sample_final$education_f = factor(sample_final$education, levels = levels_education) 274 | 275 | 276 | us_pumas_df = as.data.frame(us_pumas) 277 | sample_final_df = left_join(sample_final, us_pumas_df, by=c("puma"="PUMACE10")) 278 | 279 | sample_final_df_binomial = sample_final_df %>% 280 | group_by(puma, race_f,education_f) %>% 281 | summarise(n=sum(pref),N=n()) 282 | 283 | sample_final_df_binomial_df = left_join(sample_final_df_binomial, us_pumas_df, by=c("puma"="PUMACE10")) %>% 284 | as.data.frame() 285 | 286 | 287 | # ------------------------------------------------------------------------------------------------------------------------------------------- 288 | # ------------------------------------------------------------------------------------------------------------------------------------------- 289 | # ------------------------------------------------------------------------------------------------------------------------------------------- 290 | # prior specification inla 291 | # https://becarioprecario.bitbucket.io/inla-gitbook/ch-priors.html 292 | 293 | icar_model = inla(icar_formula, # pc prior: https://becarioprecario.bitbucket.io/inla-gitbook/ch-priors.html#sec:pcpriors 294 | data = sample_final_df_binomial_df, family ="binomial", 295 | Ntrials = N, 296 | control.predictor = list(compute = TRUE), 297 | control.compute = list(dic = TRUE, waic = TRUE, config =TRUE), 298 | control.fixed = list(prec.intercept=1) # makes precision of intercept fixed at 1 299 | ) 300 | 301 | # draw posterior samples for icar model 302 | icar_samples = inla.posterior.sample(n=4000, icar_model) 303 | 304 | # intercept ----------------------- 305 | icar_intercept_index_icar = grep("Intercept", rownames(icar_samples[[1]]$latent)) 306 | 307 | icar_intercept_samples_f = function(x) { #x = samples[[i]] 308 | return(x$latent[icar_intercept_index_icar]) 309 | } 310 | 311 | intercept_samples_icar = unlist(lapply(icar_samples, icar_intercept_samples_f)) 312 | # --------------------------------- 313 | 314 | # race ---------------------------- 315 | race_indices_icar = grep("race_f", rownames(icar_samples[[1]]$latent)) 316 | 317 | icar_race_samples_f = function(x) { 318 | return(x$latent[race_indices_icar]) 319 | } 320 | 321 | U_race_samples_icar = matrix(unlist(lapply(icar_samples, icar_race_samples_f)), 322 | ncol = num_race, 323 | byrow = TRUE) 324 | # --------------------------------- 325 | 326 | # education ----------------------- 327 | education_indices_icar = grep("education_f", rownames(icar_samples[[1]]$latent)) 328 | 329 | icar_education_samples_f = function(x) { 330 | return(x$latent[education_indices_icar]) 331 | } 332 | 333 | U_education_samples_icar = matrix(unlist(lapply(icar_samples, icar_education_samples_f)), 334 | ncol = num_education, 335 | byrow = TRUE) 336 | # --------------------------------- 337 | 338 | # puma area ----------------------- 339 | puma_indices_icar = grep("ID:", rownames(icar_samples[[1]]$latent))[1:num_puma] # https://inla.r-inla-download.org/r-inla.org/doc/latent/bym2.pdf 340 | 341 | icar_puma_samples_f = function(x) { 342 | return(x$latent[puma_indices_icar]) 343 | } 344 | 345 | U_puma_samples_icar = matrix(unlist(lapply(icar_samples, icar_puma_samples_f)), 346 | ncol = num_puma, 347 | byrow = TRUE) 348 | # --------------------------------- 349 | # get PUMA <-> ID match, where ID is used in f(ID,...) 350 | pumacode_id_equivalency = us_pumas_df[,c("PUMACE10","ID")] 351 | 352 | # the poststrat matrix 353 | poststrat_final_joined_p_ID = left_join(poststrat_final_joined_p, 354 | pumacode_id_equivalency, 355 | by = c("puma_area"="PUMACE10")) 356 | 357 | # this matrix stores the posterior linear predictors for each poststrat cell in poststrat_final_joined_p 358 | postpred_sim_icar = matrix(0, 359 | length(intercept_samples_icar), 360 | dim(poststrat_final_joined_p)[1]) 361 | 362 | for (i in 1:(num_education * num_puma * num_race)) { 363 | postpred_sim_icar[,i] = arm::invlogit(intercept_samples_icar + 364 | U_puma_samples_icar[, poststrat_final_joined_p_ID[i,c("ID")] ] + 365 | U_race_samples_icar[, which(poststrat_final_joined_p_ID[i,c("race")]==levels_race) ] + 366 | U_education_samples_icar[, which(poststrat_final_joined_p_ID[i,c("education")]==levels_education) ] 367 | ) 368 | } 369 | 370 | puma_mrp_samples_icar = matrix(0, 4000, num_puma) # this list stores posterior MRP samples for every puma 371 | 372 | g_counter = 1 373 | for (g in levels_puma) { 374 | #print(g) 375 | for (i in which(poststrat_final_joined_p_ID$puma_area==g)) { 376 | puma_mrp_samples_icar[,g_counter] = puma_mrp_samples_icar[,g_counter] + 377 | (postpred_sim_icar[,i] * poststrat_final_joined_p_ID$N[i]/N_sub[g_counter]) 378 | } 379 | g_counter = g_counter + 1 380 | } 381 | 382 | # ------------------------------------------------------------------------------------------------------------------------------------------- 383 | # ------------------------------------------------------------------------------------------------------------------------------------------- 384 | # ------------------------------------------------------------------------------------------------------------------------------------------- 385 | sample_final_df_binomial_df_iid = sample_final_df_binomial_df 386 | sample_final_df_binomial_df_iid$ID_f = factor(sample_final_df_binomial_df_iid$ID, levels = 1:num_puma) 387 | 388 | iid_model = inla(iid_formula, # pc prior: https://becarioprecario.bitbucket.io/inla-gitbook/ch-priors.html#sec:pcpriors 389 | data = sample_final_df_binomial_df_iid, family ="binomial", 390 | Ntrials = N, 391 | control.predictor = list(compute = TRUE), 392 | control.compute = list(dic = TRUE, waic = TRUE, config =TRUE), 393 | control.fixed = list(prec.intercept=1) # makes precision of intercept fixed at 1 394 | ) 395 | 396 | 397 | # draw posterior samples for iid model 398 | iid_samples = inla.posterior.sample(n=4000, iid_model) 399 | 400 | # intercept ----------------------- 401 | iid_intercept_index_iid = grep("Intercept", rownames(iid_samples[[1]]$latent)) 402 | 403 | iid_intercept_samples_f = function(x) { #x = samples[[i]] 404 | return(x$latent[iid_intercept_index_iid]) 405 | } 406 | 407 | intercept_samples_iid = unlist(lapply(iid_samples, iid_intercept_samples_f)) 408 | # --------------------------------- 409 | 410 | # race ---------------------------- 411 | race_indices_iid = grep("race_f", rownames(iid_samples[[1]]$latent)) 412 | 413 | iid_race_samples_f = function(x) { 414 | return(x$latent[race_indices_iid]) 415 | } 416 | 417 | U_race_samples_iid = matrix(unlist(lapply(iid_samples, iid_race_samples_f)), 418 | ncol = num_race, 419 | byrow = TRUE) 420 | # --------------------------------- 421 | 422 | # education ----------------------- 423 | education_indices_iid = grep("education_f", rownames(iid_samples[[1]]$latent)) 424 | 425 | iid_education_samples_f = function(x) { 426 | return(x$latent[education_indices_iid]) 427 | } 428 | 429 | U_education_samples_iid = matrix(unlist(lapply(iid_samples, iid_education_samples_f)), 430 | ncol = num_education, 431 | byrow = TRUE) 432 | # --------------------------------- 433 | 434 | # puma area ----------------------- 435 | puma_indices_iid = grep("ID_f:", rownames(iid_samples[[1]]$latent)) 436 | 437 | iid_puma_samples_f = function(x) { 438 | return(x$latent[puma_indices_iid]) 439 | } 440 | 441 | U_puma_samples_iid = matrix(unlist(lapply(iid_samples, iid_puma_samples_f)), 442 | ncol = num_puma, 443 | byrow = TRUE) 444 | # --------------------------------- 445 | 446 | # this matrix stores the posterior linear predictors for each poststrat cell in poststrat_final_joined_p 447 | postpred_sim_iid = matrix(0, 448 | length(intercept_samples_iid), 449 | dim(poststrat_final_joined_p)[1]) 450 | 451 | for (i in 1:(num_education * num_puma * num_race)) { 452 | postpred_sim_iid[,i] = arm::invlogit(intercept_samples_iid + 453 | U_puma_samples_iid[, poststrat_final_joined_p_ID[i,c("ID")] ] + 454 | U_race_samples_iid[, which(poststrat_final_joined_p_ID[i,c("race")]==levels_race) ] + 455 | U_education_samples_iid[, which(poststrat_final_joined_p_ID[i,c("education")]==levels_education) ] 456 | ) 457 | } 458 | 459 | puma_mrp_samples_iid = matrix(0, 4000, num_puma) # this list stores posterior MRP samples for every puma 460 | 461 | g_counter = 1 462 | for (g in levels_puma) { 463 | #print(g) 464 | for (i in which(poststrat_final_joined_p_ID$puma_area==g)) { 465 | puma_mrp_samples_iid[,g_counter] = puma_mrp_samples_iid[,g_counter] + 466 | (postpred_sim_iid[,i] * poststrat_final_joined_p_ID$N[i]/N_sub[g_counter]) 467 | } 468 | g_counter = g_counter + 1 469 | } 470 | # ------------------------------------------------------------------------------------------------------------------------------------------- 471 | # ------------------------------------------------------------------------------------------------------------------------------------------- 472 | # ------------------------------------------------------------------------------------------------------------------------------------------- 473 | true_popn_pref = poststrat_final_joined_p$true_pref %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N) 474 | 475 | model_popn_pref_mat_icar[k] = mean(postpred_sim_icar %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N)) 476 | model_popn_pref_sd_mat_icar[k] = sd(postpred_sim_icar %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N)) 477 | 478 | model_popn_pref_mat_iid[k] = mean(postpred_sim_iid %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N)) 479 | model_popn_pref_sd_mat_iid[k] = sd(postpred_sim_iid %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N)) 480 | 481 | # medians of posterior linear predictors for all runs k 482 | sample_cell_estimates_median_icar[k,] = matrixStats::colMedians(postpred_sim_icar) 483 | sample_cell_estimates_median_iid[k,] = matrixStats::colMedians(postpred_sim_iid) 484 | 485 | # posterior width 486 | sample_cell_estimates_width_icar[k,] = apply(X = postpred_sim_icar, 487 | MARGIN = 2, 488 | FUN = quantile, 489 | probs = upperquantile) - 490 | apply(X = postpred_sim_icar, 491 | MARGIN = 2, 492 | FUN = quantile, 493 | probs = lowerquantile) 494 | 495 | 496 | sample_cell_estimates_width_iid[k,] = apply(X = postpred_sim_iid, 497 | MARGIN = 2, 498 | FUN = quantile, 499 | probs = upperquantile) - 500 | apply(X = postpred_sim_iid, 501 | MARGIN = 2, 502 | FUN = quantile, 503 | probs = lowerquantile) 504 | 505 | # # the below lists stores puma posterior estimates. columns go by increasing order of puma code 506 | 507 | median_puma_icar[k,] = matrixStats::colMedians(puma_mrp_samples_icar) 508 | median_puma_iid[k,] = matrixStats::colMedians(puma_mrp_samples_iid) 509 | 510 | quantile10_puma_icar[k,] = apply(X = puma_mrp_samples_icar, 511 | MARGIN = 2, 512 | FUN = quantile, 513 | probs = lowerquantile) 514 | 515 | quantile90_puma_icar[k,] = apply(X = puma_mrp_samples_icar, 516 | MARGIN = 2, 517 | FUN = quantile, 518 | probs = upperquantile) 519 | 520 | quantile10_puma_iid[k,] = apply(X = puma_mrp_samples_iid, 521 | MARGIN = 2, 522 | FUN = quantile, 523 | probs = lowerquantile) 524 | 525 | quantile90_puma_iid[k,] = apply(X = puma_mrp_samples_iid, 526 | MARGIN = 2, 527 | FUN = quantile, 528 | probs = upperquantile) 529 | 530 | 531 | if ((k %% 10)==0) { 532 | print(k) 533 | save.image(paste0(runs, 534 | "_", 535 | sample_size, 536 | "_spatial_", 537 | p * 10, 538 | ".RData")) 539 | } 540 | 541 | } 542 | } 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | -------------------------------------------------------------------------------- /simulation_spatialmrp/icar_mrp_viz.R: -------------------------------------------------------------------------------- 1 | # this script visualizes results from icar_mrp_simulation_bym2.R 2 | 3 | rm(list=ls()) 4 | gc() 5 | 6 | library(ggplot2) 7 | library(dplyr) 8 | library(sp) 9 | 10 | sample_size = 500 11 | runs = 200 12 | r = 1:9/10 13 | 14 | poverty_poststrat = readRDS("poverty_poststrat.rds") 15 | us_pumas = readRDS("us_pumas.rds") 16 | ma_adj_matrix = readRDS("ma_adj_matrix.rds") 17 | ma_adj_matrix_sparse = readRDS("ma_adj_matrix_sparse.rds") 18 | 19 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education 20 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race 21 | num_puma = length(unique(poverty_poststrat$PUMA.x)) 22 | 23 | 24 | # ---- 25 | avg_bias_mat_puma_icar = matrix(0, length(r), num_puma) 26 | avg_bias_mat_puma_iid = matrix(0, length(r), num_puma) 27 | 28 | # bias of all strata 29 | avg_bias_mat_icar = matrix(0, length(r), num_education * num_race * num_puma) 30 | avg_bias_mat_iid = matrix(0, length(r), num_education * num_race * num_puma) 31 | 32 | # avg posterior width for 1872 poststratification cells 33 | avg_posterior_cell_width_icar = matrix(0, length(r), num_education * num_race * num_puma) 34 | avg_posterior_cell_width_iid = matrix(0, length(r), num_education * num_race * num_puma) 35 | 36 | # avg posterior width for all 52 puma 37 | avg_posterior_puma_width_icar = matrix(0, length(r), 52) 38 | avg_posterior_puma_width_iid = matrix(0, length(r), 52) 39 | # ---- 40 | 41 | # df for majoityvote for icar and iid prior 42 | majorityvote_df = c() 43 | 44 | 45 | counter_ = 1 46 | prob_sampling_old = rep(0, length(r)) 47 | 48 | for (p in r) { 49 | load(paste0(runs, 50 | "_", 51 | sample_size, 52 | "_spatial_", 53 | p * 10, 54 | ".RData")) 55 | 56 | # get the interpretable p which is the probability of sampling something in puma_overundersample_index 57 | poststrat_final_joined_p_interpretable = cbind(poststrat_final_joined_p, 58 | (poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)/sum(poststrat_final_joined_p$p_response * poststrat_final_joined_p$N) 59 | ) 60 | # 61 | colnames(poststrat_final_joined_p_interpretable)[length(colnames(poststrat_final_joined_p_interpretable))] = "prob_sampling_g1" 62 | prob_sampling_old[counter_] = round(sum(poststrat_final_joined_p_interpretable[poststrat_final_joined_p_interpretable$puma_area %in% puma_overundersample_index, 63 | c("prob_sampling_g1")]), 64 | 2)# very important 65 | 66 | avg_bias_mat_icar[counter_,] = colMeans(sweep(sample_cell_estimates_median_icar, 67 | 2, 68 | poststrat_final_joined_p_ID$true_pref, 69 | "-")) 70 | 71 | avg_bias_mat_iid[counter_,] = colMeans(sweep(sample_cell_estimates_median_iid, 72 | 2, 73 | poststrat_final_joined_p_ID$true_pref, 74 | "-")) 75 | 76 | 77 | avg_posterior_cell_width_icar[counter_,] = colMeans(sample_cell_estimates_width_icar) 78 | avg_posterior_cell_width_iid[counter_,] = colMeans(sample_cell_estimates_width_iid) 79 | 80 | avg_posterior_puma_width_icar[counter_,] = colMeans(quantile90_puma_icar - quantile10_puma_icar) 81 | avg_posterior_puma_width_iid[counter_,] = colMeans(quantile90_puma_iid - quantile10_puma_iid) 82 | 83 | 84 | avg_bias_mat_puma_icar[counter_,] = colMeans(sweep(median_puma_icar, 2, poststrat_puma_pref$puma_ps, "-")) 85 | avg_bias_mat_puma_iid[counter_,] = colMeans(sweep(median_puma_iid, 2, poststrat_puma_pref$puma_ps, "-")) 86 | 87 | # majorityvote ----------------------- 88 | 89 | majorityvote_df = rbind(majorityvote_df, 90 | data.frame(majorityvote = model_popn_pref_mat_icar, 91 | type = "ICAR", 92 | prob_sampling_old[counter_]), 93 | data.frame(majorityvote = model_popn_pref_mat_iid, 94 | type = "IID", 95 | prob_sampling_old[counter_]) 96 | ) 97 | 98 | counter_ = counter_ + 1 99 | } 100 | 101 | saveRDS(prob_sampling_old, "prob_sampling_old.rds") 102 | # this being negative is good 103 | summary(colMeans(abs(avg_bias_mat_icar) - 104 | abs(avg_bias_mat_iid))) * 100 105 | 106 | # this being negative is good 107 | summary(colMeans(abs(sweep(median_puma_icar,2,poststrat_puma_pref$puma_ps,"-")) - 108 | abs(sweep(median_puma_iid,2,poststrat_puma_pref$puma_ps,"-")))) * 100 109 | 110 | # this being negative is good 111 | summary(colMeans(avg_posterior_cell_width_icar - avg_posterior_cell_width_iid)) 112 | 113 | # --- 114 | poststrat_final_V = cbind(paste0("V", rownames(poststrat_final_joined)), 115 | poststrat_final_joined, 116 | stringsAsFactors=FALSE) 117 | colnames(poststrat_final_V)[1] = "cell" 118 | # --- 119 | 120 | 121 | avg_bias_mat_icar = as.data.frame(avg_bias_mat_icar) 122 | avg_bias_mat_iid = as.data.frame(avg_bias_mat_iid) 123 | 124 | df_bias.melted_icar = cbind(reshape2::melt(avg_bias_mat_icar), 125 | rep(r, dim(poststrat_final_joined_p_ID)[1])) 126 | colnames(df_bias.melted_icar) = c("cell", "value", "p") 127 | df_bias.melted_icar$cell = as.character(df_bias.melted_icar$cell) 128 | 129 | df_bias.melted_V_icar = dplyr::inner_join(x = df_bias.melted_icar, 130 | y = poststrat_final_V, 131 | by ="cell") 132 | 133 | 134 | 135 | df_bias.melted_iid = cbind(reshape2::melt(avg_bias_mat_iid), 136 | rep(r, dim(poststrat_final_joined_p_ID)[1])) 137 | colnames(df_bias.melted_iid) = c("cell", "value", "p") 138 | df_bias.melted_iid$cell = as.character(df_bias.melted_iid$cell) 139 | 140 | df_bias.melted_V_iid = dplyr::inner_join(x = df_bias.melted_iid, 141 | y = poststrat_final_V, 142 | by ="cell") 143 | 144 | bias_facet_df = rbind(cbind(df_bias.melted_V_icar, Model = "ICAR"), 145 | cbind(df_bias.melted_V_iid, Model = "IID")) 146 | 147 | bias_facet_df$sampling_group = ifelse(bias_facet_df$puma_area %in% puma_overundersample_index, "Group 1", "Group 2") 148 | 149 | png(filename = paste0("spatialbiasfacet_", 150 | runs, 151 | "_", 152 | sample_size, 153 | "_.png"), 154 | width = 4800, height = 2400) 155 | 156 | ggplot(bias_facet_df, 157 | aes(x=p, y=value, group = cell)) + 158 | geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 0.5) + 159 | facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller( 160 | function(value) { 161 | return(value) # Lets you change the facet labels 162 | }) 163 | ) + 164 | xlab("\n Probability of response for Group 1 \n") + 165 | ylab("\n Average Bias of Posterior Medians \n") + 166 | theme_bw() + 167 | theme(plot.title = element_text(size = 50, face = "bold"), 168 | axis.text=element_text(size=35*1.6), 169 | axis.title=element_text(size=50*1.6, face="bold",margin=200), 170 | legend.text = element_text(size=50*1.6), 171 | legend.position = "bottom", 172 | legend.key.size = unit(30*1.6,"line"), 173 | legend.key.height = unit(7*1.6,"line"), 174 | legend.key = element_rect(fill = "transparent",color = "transparent"), 175 | legend.title = element_text(size=50*1.6, face="bold"), 176 | strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ), 177 | strip.background = element_rect(fill="transparent",color="transparent") 178 | ) + 179 | guides(col=guide_legend(title="Sampling Group", 180 | override.aes = list(size = 10*1.6))) 181 | 182 | dev.off() 183 | 184 | saveRDS(bias_facet_df, paste0("biasfacetspatial_", 185 | runs, 186 | "_", 187 | sample_size,".rds")) 188 | 189 | # Average bias of posterior medians for 52 PUMAs -------------------------------------------- 190 | avg_bias_mat_puma_icar = as.data.frame(avg_bias_mat_puma_icar) 191 | avg_bias_mat_puma_iid = as.data.frame(avg_bias_mat_puma_iid) 192 | 193 | 194 | colnames(avg_bias_mat_puma_icar) = levels_puma 195 | 196 | df_bias_puma.melted_icar = cbind(reshape2::melt(avg_bias_mat_puma_icar), 197 | rep(r, num_puma)) 198 | colnames(df_bias_puma.melted_icar) = c("puma_area", "value", "p") 199 | df_bias_puma.melted_icar$puma_area = as.numeric(as.character(df_bias_puma.melted_icar$puma_area)) 200 | 201 | df_bias_puma.melted_icar$sampling_group = ifelse(df_bias_puma.melted_icar$puma_area %in% puma_overundersample_index, 202 | "Near Boston", 203 | "Away from Boston") 204 | 205 | 206 | colnames(avg_bias_mat_puma_iid) = levels_puma 207 | 208 | df_bias_puma.melted_iid = cbind(reshape2::melt(avg_bias_mat_puma_iid), 209 | rep(r, num_puma)) 210 | colnames(df_bias_puma.melted_iid) = c("puma_area", "value", "p") 211 | df_bias_puma.melted_iid$puma_area = as.numeric(as.character(df_bias_puma.melted_iid$puma_area)) 212 | 213 | df_bias_puma.melted_iid$sampling_group = ifelse(df_bias_puma.melted_iid$puma_area %in% puma_overundersample_index, 214 | "Near Boston", 215 | "Away from Boston") 216 | 217 | 218 | bias_facet_puma_df = rbind(cbind(df_bias_puma.melted_icar, Model = "ICAR"), 219 | cbind(df_bias_puma.melted_iid, Model = "IID")) 220 | 221 | bias_facet_puma_df$sampling_group = factor(bias_facet_puma_df$sampling_group, 222 | levels = c("Near Boston", "Away from Boston")) 223 | 224 | png(filename = paste0("spatialbiasfacet_puma_", 225 | runs, 226 | "_", 227 | sample_size, 228 | "_.png"), 229 | width = 4800, height = 2400) 230 | 231 | ggplot(bias_facet_puma_df, 232 | aes(x=p, y=value, group = puma_area)) + 233 | geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 1.5) + 234 | facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller( 235 | function(value) { 236 | return(value) # Lets you change the facet labels 237 | }) 238 | ) + 239 | xlab("\n Probability of response for Group 1 \n") + 240 | ylab("\n Average Bias of Posterior Medians for PUMA \n") + 241 | theme_bw() + 242 | theme(plot.title = element_text(size = 50, face = "bold"), 243 | axis.text=element_text(size=35*1.6), 244 | axis.title=element_text(size=50*1.6, face="bold",margin=200), 245 | legend.text = element_text(size=50*1.6), 246 | legend.position = "bottom", 247 | legend.key.size = unit(30*1.6,"line"), 248 | legend.key.height = unit(7*1.6,"line"), 249 | legend.key = element_rect(fill = "transparent",color = "transparent"), 250 | legend.title = element_text(size=50*1.6, face="bold"), 251 | strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ), 252 | strip.background = element_rect(fill="transparent",color="transparent")) + 253 | guides(col=guide_legend(title="Sampling Group", 254 | override.aes = list(size = 10*1.6))) 255 | 256 | 257 | dev.off() 258 | 259 | saveRDS(bias_facet_puma_df, paste0("biasfacetspatialpuma_", 260 | runs, 261 | "_", 262 | sample_size,".rds")) 263 | 264 | # Average posterior sd width for 1872 poststrat cell posteriors ---------------------------- 265 | 266 | avg_posterior_cell_width_icar = as.data.frame(avg_posterior_cell_width_icar) 267 | avg_posterior_cell_width_iid = as.data.frame(avg_posterior_cell_width_iid) 268 | 269 | 270 | df_avg_posterior_cell_width_icar = cbind(reshape2::melt(avg_posterior_cell_width_icar), 271 | rep(r, dim(poststrat_final_joined_p_ID)[1])) 272 | 273 | colnames(df_avg_posterior_cell_width_icar) = c("cell", "average_posterior_width", "p") 274 | df_avg_posterior_cell_width_icar$cell = as.character(df_avg_posterior_cell_width_icar$cell) 275 | 276 | df_avg_posterior_cell_width_icar_V = dplyr::inner_join(x = df_avg_posterior_cell_width_icar, 277 | y = poststrat_final_V, 278 | by = "cell") 279 | 280 | 281 | df_avg_posterior_cell_width_iid = cbind(reshape2::melt(avg_posterior_cell_width_iid), 282 | rep(r, dim(poststrat_final_joined_p_ID)[1])) 283 | 284 | colnames(df_avg_posterior_cell_width_iid) = c("cell", "average_posterior_width", "p") 285 | df_avg_posterior_cell_width_iid$cell = as.character(df_avg_posterior_cell_width_iid$cell) 286 | 287 | df_avg_posterior_cell_width_iid_V = dplyr::inner_join(x = df_avg_posterior_cell_width_iid, 288 | y = poststrat_final_V, 289 | by = "cell") 290 | 291 | posterior_cell_width_facet_df = rbind(cbind(df_avg_posterior_cell_width_iid_V, Model = "ICAR"), 292 | cbind(df_avg_posterior_cell_width_iid_V, Model = "IID")) 293 | 294 | posterior_cell_width_facet_df$sampling_group = ifelse(posterior_cell_width_facet_df$puma_area %in% puma_overundersample_index, "Group 1", "Group 2") 295 | 296 | posterior_cell_width_facet_df$sampling_group[posterior_cell_width_facet_df$sampling_group=="Group 1"] = "Near Boston" 297 | posterior_cell_width_facet_df$sampling_group[posterior_cell_width_facet_df$sampling_group=="Group 2"] = "Away from Boston" 298 | posterior_cell_width_facet_df$sampling_group = factor(posterior_cell_width_facet_df$sampling_group, 299 | levels = c("Near Boston", "Away from Boston")) 300 | 301 | png(filename = paste0("spatialbiasfacet_width_", 302 | runs, 303 | "_", 304 | sample_size, 305 | "_.png"), 306 | width = 4800, height = 2400) 307 | 308 | ggplot(posterior_cell_width_facet_df, 309 | aes(x=p, y=average_posterior_width, group = cell)) + 310 | geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 0.5) + 311 | facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller( 312 | function(value) { 313 | return(value) # Lets you change the facet labels 314 | }) 315 | ) + 316 | xlab("\n Probability of sampling for cluster of PUMA near Boston \n") + 317 | ylab(paste0("\n 90th - 10th quantile of ", runs, " Posteriors \n")) + 318 | theme_bw() + 319 | theme(plot.title = element_text(size = 50, face = "bold"), 320 | axis.text=element_text(size=35*1.6), 321 | axis.title=element_text(size=50*1.6, face="bold",margin=200), 322 | legend.text = element_text(size=50*1.6), 323 | legend.position = "bottom", 324 | legend.key.size = unit(30*1.6,"line"), 325 | legend.key.height = unit(7*1.6,"line"), 326 | legend.key = element_rect(fill = "transparent",color = "transparent"), 327 | legend.title = element_text(size=50*1.6, face="bold"), 328 | strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ), 329 | strip.background = element_rect(fill="transparent",color="transparent")) + 330 | guides(col=guide_legend(title="Sampling Group", 331 | override.aes = list(size = 10*1.6))) 332 | 333 | dev.off() 334 | 335 | saveRDS(posterior_cell_width_facet_df, paste0("spatialbiasfacet_width_", 336 | runs, 337 | "_", 338 | sample_size, 339 | ".rds")) 340 | 341 | 342 | # Average posterior sd of 52 PUMA ------------------------------------------------------------- 343 | 344 | avg_posterior_puma_width_icar = as.data.frame(avg_posterior_puma_width_icar) 345 | avg_posterior_puma_width_iid = as.data.frame(avg_posterior_puma_width_iid) 346 | 347 | colnames(avg_posterior_puma_width_icar) = levels_puma 348 | colnames(avg_posterior_puma_width_iid) = levels_puma 349 | 350 | df_avg_posterior_puma_width_icar = cbind(reshape2::melt(avg_posterior_puma_width_icar), 351 | rep(r, 52)) 352 | 353 | colnames(df_avg_posterior_puma_width_icar) = c("puma", "average_posterior_width", "p") 354 | df_avg_posterior_puma_width_icar$puma = as.character(df_avg_posterior_puma_width_icar$puma) 355 | 356 | 357 | df_avg_posterior_puma_width_iid = cbind(reshape2::melt(avg_posterior_puma_width_iid), 358 | rep(r, 52)) 359 | 360 | colnames(df_avg_posterior_puma_width_iid) = c("puma", "average_posterior_width", "p") 361 | df_avg_posterior_puma_width_iid$puma = as.character(df_avg_posterior_puma_width_iid$puma) 362 | 363 | posterior_puma_width_facet_df = rbind(cbind(df_avg_posterior_puma_width_icar, Model = "ICAR"), 364 | cbind(df_avg_posterior_puma_width_iid, Model = "IID")) 365 | 366 | posterior_puma_width_facet_df$sampling_group = ifelse(posterior_puma_width_facet_df$puma %in% puma_overundersample_index, "Group 1", "Group 2") 367 | 368 | posterior_puma_width_facet_df$sampling_group[posterior_puma_width_facet_df$sampling_group=="Group 1"] = "Near Boston" 369 | posterior_puma_width_facet_df$sampling_group[posterior_puma_width_facet_df$sampling_group=="Group 2"] = "Away from Boston" 370 | posterior_puma_width_facet_df$sampling_group = factor(posterior_puma_width_facet_df$sampling_group, 371 | levels = c("Near Boston", "Away from Boston")) 372 | 373 | png(filename = paste0("spatialbiasfacet_puma_width_", 374 | runs, 375 | "_", 376 | sample_size, 377 | "_.png"), 378 | width = 4800, height = 2400) 379 | 380 | ggplot(posterior_puma_width_facet_df, 381 | aes(x=p, y=average_posterior_width, group = puma)) + 382 | geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 1.5) + 383 | facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller( 384 | function(value) { 385 | return(value) # Lets you change the facet labels 386 | }) 387 | ) + 388 | xlab("\n Probability of sampling for cluster of PUMA near Boston \n") + 389 | ylab(paste0("\n 90th - 10th quantile of ", runs, " Posteriors \n")) + 390 | theme_bw() + 391 | theme(plot.title = element_text(size = 50, face = "bold"), 392 | axis.text=element_text(size=35*1.6), 393 | axis.title=element_text(size=50*1.6, face="bold",margin=200), 394 | legend.text = element_text(size=50*1.6), 395 | legend.position = "bottom", 396 | legend.key.size = unit(30*1.6,"line"), 397 | legend.key.height = unit(7*1.6,"line"), 398 | legend.key = element_rect(fill = "transparent",color = "transparent"), 399 | legend.title = element_text(size=50*1.6, face="bold"), 400 | strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ), 401 | strip.background = element_rect(fill="transparent",color="transparent")) + 402 | guides(col=guide_legend(title="Sampling Group", 403 | override.aes = list(size = 10*1.6))) 404 | 405 | dev.off() 406 | 407 | saveRDS(posterior_puma_width_facet_df, paste0("spatialbiasfacet_puma_width_", 408 | runs, 409 | "_", 410 | sample_size, 411 | ".rds")) 412 | 413 | 414 | # heatmap plot of average posterior medians for 52 pumas, across 9 probability indices -------- 415 | 416 | us_pumas$ICAR_one = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 417 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[1,])), 418 | by = c("PUMACE10" = "puma_code"))[,2] 419 | 420 | us_pumas$IID_one = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 421 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[1,])), 422 | by = c("PUMACE10" = "puma_code"))[,2] 423 | 424 | # --- 425 | us_pumas$ICAR_two = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 426 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[2,])), 427 | by = c("PUMACE10" = "puma_code"))[,2] 428 | 429 | us_pumas$IID_two = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 430 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[2,])), 431 | by = c("PUMACE10" = "puma_code"))[,2] 432 | 433 | # --- 434 | us_pumas$ICAR_three = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 435 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[3,])), 436 | by = c("PUMACE10" = "puma_code"))[,2] 437 | 438 | us_pumas$IID_three = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 439 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[3,])), 440 | by = c("PUMACE10" = "puma_code"))[,2] 441 | 442 | # --- 443 | us_pumas$ICAR_four = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 444 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[4,])), 445 | by = c("PUMACE10" = "puma_code"))[,2] 446 | 447 | us_pumas$IID_four = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 448 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[4,])), 449 | by = c("PUMACE10" = "puma_code"))[,2] 450 | # --- 451 | us_pumas$ICAR_five = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 452 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[5,])), 453 | by = c("PUMACE10" = "puma_code"))[,2] 454 | 455 | us_pumas$IID_five = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 456 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[5,])), 457 | by = c("PUMACE10" = "puma_code"))[,2] 458 | 459 | # --- 460 | us_pumas$ICAR_six = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 461 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[6,])), 462 | by = c("PUMACE10" = "puma_code"))[,2] 463 | 464 | us_pumas$IID_six = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 465 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[6,])), 466 | by = c("PUMACE10" = "puma_code"))[,2] 467 | 468 | # --- 469 | us_pumas$ICAR_seven = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 470 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[7,])), 471 | by = c("PUMACE10" = "puma_code"))[,2] 472 | 473 | us_pumas$IID_seven = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 474 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[7,])), 475 | by = c("PUMACE10" = "puma_code"))[,2] 476 | 477 | # --- 478 | us_pumas$ICAR_eight = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 479 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[8,])), 480 | by = c("PUMACE10" = "puma_code"))[,2] 481 | 482 | us_pumas$IID_eight = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 483 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[8,])), 484 | by = c("PUMACE10" = "puma_code"))[,2] 485 | 486 | # --- 487 | us_pumas$ICAR_nine = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 488 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[9,])), 489 | by = c("PUMACE10" = "puma_code"))[,2] 490 | 491 | us_pumas$IID_nine = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10), 492 | y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[9,])), 493 | by = c("PUMACE10" = "puma_code"))[,2] 494 | 495 | # https://edzer.github.io/sp/ 496 | png(filename = paste0("avg_bias_postmedian_puma", 497 | runs, 498 | "_", 499 | sample_size, 500 | "_.png"), 501 | width = 800, height = 2400) 502 | 503 | spplot(us_pumas[,c("ICAR_one","IID_one", 504 | "ICAR_two","IID_two", 505 | "ICAR_three","IID_three", 506 | "ICAR_four","IID_four", 507 | "ICAR_five", "IID_five", 508 | "ICAR_six","IID_six", 509 | "ICAR_seven","IID_seven", 510 | "ICAR_eight","IID_eight", 511 | "ICAR_nine", "IID_nine")], 512 | 513 | names.attr = c("ICAR 0.05","IID 0.05", 514 | "ICAR 0.10","IID 0.10", 515 | "ICAR 0.17","IID 0.17", 516 | "ICAR 0.24","IID 0.24", 517 | "ICAR 0.32", "IID 0.32", 518 | "ICAR 0.41","IID 0.41", 519 | "ICAR 0.52","IID 0.52", 520 | "ICAR 0.65","IID 0.65", 521 | "ICAR 0.81", "IID 0.81"), 522 | 523 | #pretty = TRUE 524 | at = seq(-round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one", 525 | "ICAR_two","IID_two", 526 | "ICAR_three","IID_three", 527 | "ICAR_four","IID_four", 528 | "ICAR_five","IID_five", 529 | "ICAR_six","IID_six", 530 | "ICAR_seven","IID_seven", 531 | "ICAR_eight","IID_eight", 532 | "ICAR_nine", "IID_nine")])))+0.01, 2), 533 | round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one", 534 | "ICAR_two","IID_two", 535 | "ICAR_three","IID_three", 536 | "ICAR_four","IID_four", 537 | "ICAR_five","IID_five", 538 | "ICAR_six","IID_six", 539 | "ICAR_seven","IID_seven", 540 | "ICAR_eight","IID_eight", 541 | "ICAR_nine", "IID_nine")])))+0.01, 2), 542 | length.out = 12), 543 | 544 | col.regions = RColorBrewer::brewer.pal(11,"RdBu") 545 | ) 546 | 547 | 548 | dev.off() 549 | 550 | png(filename = paste0("avg_bias_postmedian_puma_forpaper", 551 | runs, 552 | "_", 553 | sample_size, 554 | "_.png"), 555 | width = 800, height = 800) 556 | 557 | spplot(us_pumas[,c("ICAR_one","IID_one", 558 | "ICAR_five", "IID_five", 559 | "ICAR_nine", "IID_nine")], 560 | 561 | names.attr = c("BYM2 prior, probability of sampling near Boston = 0.05","IID prior, probability of sampling near Boston = 0.05", 562 | "BYM2 prior, probability of sampling near Boston = 0.32", "IID prior, probability of sampling near Boston = 0.32", 563 | "BYM2 prior, probability of sampling near Boston = 0.81", "IID prior, probability of sampling near Boston = 0.81"), 564 | 565 | #pretty = TRUE 566 | at = seq(-round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one", 567 | "ICAR_five","IID_five", 568 | "ICAR_nine", "IID_nine")])))+0.01, 2), 569 | round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one", 570 | "ICAR_five","IID_five", 571 | "ICAR_nine", "IID_nine")])))+0.01, 2), 572 | length.out = 12), 573 | 574 | col.regions = RColorBrewer::brewer.pal(11,"RdBu"), 575 | 576 | colorkey = list(space="bottom") 577 | ) 578 | 579 | 580 | dev.off() 581 | 582 | us_pumas$puma_ps_group1 = rep(0, 52) 583 | us_pumas$puma_ps_group1[us_pumas$PUMACE10 %in% puma_overundersample_index] = us_pumas$puma_ps[us_pumas$PUMACE10 %in% puma_overundersample_index] 584 | 585 | png(filename = paste0("truth_ps_spatial.png"), 586 | width = 1000, height = 1000) 587 | 588 | spplot(us_pumas[,],c("puma_ps", "puma_ps_group1"), 589 | at = seq(0, max(as.data.frame(us_pumas[,c("puma_ps")]))+0.01, length.out=40), 590 | colorkey = list(space="bottom"), 591 | col="transparent", 592 | names.attr = c("Poststratified true preferences", "17 PUMA over/undersampled near Boston")) 593 | 594 | dev.off() 595 | 596 | # majorityvote plot ------------------------------------------------- 597 | 598 | colnames(majorityvote_df)[2] = "Model" 599 | colnames(majorityvote_df)[3] = "p" 600 | 601 | 602 | png(filename = paste0("majorityvote_", 603 | runs, 604 | "_", 605 | sample_size, 606 | ".png"), 607 | width = 4000, height = 2400) 608 | 609 | ggplot(majorityvote_df, aes(x = majorityvote, y = Model, fill = Model)) + 610 | ggridges::geom_density_ridges2(alpha=0.7, 611 | quantile_lines = TRUE, 612 | quantiles = c(0.1, 0.5, 0.9), 613 | vline_size = 0.5, 614 | vline_color = "black", 615 | scale = 1) + 616 | xlab("\n Probability of voting yes \n") + 617 | ylab("\n Model \n") + 618 | facet_wrap(. ~ p,ncol=3,nrow=3, labeller = as_labeller( 619 | function(value) { 620 | return(value) # Lets you change the facet labels 621 | }) 622 | ) + 623 | geom_point(aes(x = as.numeric(true_popn_pref), 624 | y = Model), 625 | size=7, 626 | colour = "black", 627 | show.legend = F, 628 | inherit.aes = F) + 629 | theme_bw() + 630 | theme(plot.title = element_text(size = 50, face = "bold"), 631 | axis.text=element_text(size=35), 632 | axis.title=element_text(size=50, face="bold",margin=200), 633 | legend.text = element_text(size=50), 634 | legend.position = "bottom", 635 | legend.key.size = unit(30,"line"), 636 | legend.key.height = unit(7,"line"), 637 | legend.key = element_rect(fill = "transparent",color = "transparent"), 638 | legend.title = element_text(size=50, face="bold"), 639 | strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ), 640 | strip.background = element_rect(fill="transparent",color="transparent")) + 641 | scale_fill_manual(values=c("#4575b4", "#d73027","#fdae61")) + 642 | guides(color = guide_legend(override.aes = list(size=20))) 643 | 644 | dev.off() 645 | 646 | 647 | 648 | 649 | 650 | -------------------------------------------------------------------------------- /simulation_spatialmrp/ma_adj_matrix.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/ma_adj_matrix.rds -------------------------------------------------------------------------------- /simulation_spatialmrp/ma_adj_matrix_sparse.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/ma_adj_matrix_sparse.rds -------------------------------------------------------------------------------- /simulation_spatialmrp/poverty_poststrat.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/poverty_poststrat.rds -------------------------------------------------------------------------------- /simulation_spatialmrp/prob_sampling_old.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/prob_sampling_old.rds -------------------------------------------------------------------------------- /simulation_spatialmrp/proportion_puma.R: -------------------------------------------------------------------------------- 1 | # this file calculates the proportion that structured priors outperform baseline priors in MRP for spatial simulation 2 | 3 | rm(list=ls()) 4 | gc() 5 | 6 | 7 | library(ggplot2) 8 | library(dplyr) 9 | library(sp) 10 | library(reshape2) 11 | 12 | sample_size = 1000 13 | runs = 200 14 | r = 1:9/10 15 | 16 | poverty_poststrat = readRDS("poverty_poststrat.rds") 17 | us_pumas = readRDS("us_pumas.rds") 18 | ma_adj_matrix = readRDS("ma_adj_matrix.rds") 19 | ma_adj_matrix_sparse = readRDS("ma_adj_matrix_sparse.rds") 20 | 21 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education 22 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race 23 | num_puma = length(unique(poverty_poststrat$PUMA.x)) 24 | 25 | 26 | icar_iid_puma_bias_comparison = matrix(0, length(r), num_puma) 27 | icar_iid_puma_sd_comparison = matrix(0, length(r), num_puma) 28 | 29 | 30 | counter_ = 1 31 | prob_sampling_old = rep(0, length(r)) 32 | 33 | for (p in r) { 34 | load(paste0(runs, 35 | "_", 36 | sample_size, 37 | "_spatial_", 38 | p * 10, 39 | ".RData")) 40 | 41 | # get the interpretable p which is the probability of sampling something in puma_overundersample_index 42 | poststrat_final_joined_p_interpretable = cbind(poststrat_final_joined_p, 43 | (poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)/sum(poststrat_final_joined_p$p_response * poststrat_final_joined_p$N) 44 | ) 45 | # 46 | colnames(poststrat_final_joined_p_interpretable)[length(colnames(poststrat_final_joined_p_interpretable))] = "prob_sampling_g1" 47 | prob_sampling_old[counter_] = round(sum(poststrat_final_joined_p_interpretable[poststrat_final_joined_p_interpretable$puma_area %in% puma_overundersample_index, 48 | c("prob_sampling_g1")]), 49 | 2) # very important 50 | 51 | 52 | icar_iid_puma_bias_comparison[counter_,] = colSums((abs(sweep(median_puma_icar, 53 | 2, 54 | poststrat_puma_pref$puma_ps, 55 | "-")) - 56 | abs(sweep(median_puma_iid, 57 | 2, 58 | poststrat_puma_pref$puma_ps, 59 | "-"))) <= 0)/dim(median_puma_icar)[1] 60 | 61 | 62 | icar_iid_puma_sd_comparison[counter_,] = colSums(((quantile90_puma_icar - quantile10_puma_icar) - 63 | (quantile90_puma_iid - quantile10_puma_iid)) <= 0)/dim(median_puma_icar)[1] 64 | 65 | 66 | 67 | counter_ = counter_ + 1 68 | } 69 | 70 | colnames(icar_iid_puma_bias_comparison) = poststrat_puma_pref$puma_area 71 | colnames(icar_iid_puma_sd_comparison) = poststrat_puma_pref$puma_area 72 | 73 | icar_iid_puma_bias_comparison_melted = melt(icar_iid_puma_bias_comparison) 74 | icar_iid_puma_sd_comparison_melted = melt(icar_iid_puma_sd_comparison) 75 | 76 | colnames(icar_iid_puma_bias_comparison_melted) = c("p", "puma", "proportion") 77 | colnames(icar_iid_puma_sd_comparison_melted) = c("p", "puma", "proportion") 78 | 79 | icar_iid_puma_bias_comparison_melted$puma_group = "" 80 | icar_iid_puma_sd_comparison_melted$puma_group = "" 81 | 82 | icar_iid_puma_bias_comparison_melted[icar_iid_puma_bias_comparison_melted$puma %in% puma_overundersample_index,c("puma_group")] = "Near Boston" 83 | icar_iid_puma_bias_comparison_melted[!(icar_iid_puma_bias_comparison_melted$puma %in% puma_overundersample_index),c("puma_group")] = "Away From Boston" 84 | 85 | icar_iid_puma_sd_comparison_melted[icar_iid_puma_sd_comparison_melted$puma %in% puma_overundersample_index,c("puma_group")] = "Near Boston" 86 | icar_iid_puma_sd_comparison_melted[!(icar_iid_puma_sd_comparison_melted$puma %in% puma_overundersample_index),c("puma_group")] = "Away From Boston" 87 | 88 | icar_iid_puma_bias_comparison_melted$prob_sampling = rep(times = 52, x = prob_sampling_old) 89 | icar_iid_puma_sd_comparison_melted$prob_sampling = rep(times = 52, x = prob_sampling_old) 90 | 91 | 92 | # bias plot 93 | saveRDS(icar_iid_puma_bias_comparison_melted, 94 | paste0("proportion_puma_", sample_size,".rds")) 95 | 96 | ggplot(icar_iid_puma_bias_comparison_melted, 97 | aes(x=prob_sampling, y=proportion, group = puma)) + 98 | geom_line(aes(col=as.factor(puma_group)),show.legend = TRUE, size = 0.5) + 99 | xlab("\n Probability of sampling for cluster of PUMA near Boston \n") + 100 | ylab("\n Improvement proportion for bias \n") + 101 | guides(col=guide_legend(title="Sampling Group", 102 | override.aes = list(size = 10*1.6))) + 103 | geom_hline(yintercept=0.5, color="black", size=1, linetype = "dashed") + 104 | scale_color_viridis_d(begin=0.25,end=.75) 105 | 106 | 107 | # sd plot 108 | saveRDS(icar_iid_puma_sd_comparison_melted, 109 | paste0("proportion_sd_puma_", sample_size,".rds")) 110 | 111 | ggplot(icar_iid_puma_sd_comparison_melted, 112 | aes(x=prob_sampling, y=proportion, group = puma)) + 113 | geom_line(aes(col=as.factor(puma_group)),show.legend = TRUE, size = 0.5) + 114 | xlab("\n Probability of sampling for cluster of PUMA near Boston \n") + 115 | ylab("\n Improvement proportion for sd \n") + 116 | guides(col=guide_legend(title="Sampling Group", 117 | override.aes = list(size = 10*1.6))) + 118 | geom_hline(yintercept=0.5, color="black", size=1, linetype = "dashed") + 119 | scale_color_viridis_d(begin=0.25,end=.75) 120 | 121 | 122 | print(colMeans(icar_iid_puma_bias_comparison)) 123 | print(colMeans(icar_iid_puma_sd_comparison)) 124 | 125 | print(summary(colMeans(icar_iid_puma_bias_comparison))) 126 | print(summary(colMeans(icar_iid_puma_sd_comparison))) 127 | 128 | -------------------------------------------------------------------------------- /simulation_spatialmrp/smooth_x.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/smooth_x.rds -------------------------------------------------------------------------------- /simulation_spatialmrp/us_pumas.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/us_pumas.rds --------------------------------------------------------------------------------