├── LICENSE
├── README.md
├── preprocess
    └── ACS_PS_annenberg_fiveyear.R
├── realdata_annenberg
    ├── baselinemeanzeroN01_annenberg_v3_centered.stan
    ├── fitmodel_annenberg_v3.R
    ├── poststrat_annenberg_v3.R
    ├── proposedN01_annenberg_v3_centered.stan
    └── proposedarN01_annenberg_v3_centered.stan
├── simulation
    ├── baselinemeanzeroN01_v3.stan
    ├── biasfacet_combinedplots.R
    ├── poststrat_pipeline_testing_age3_v3_posteriorvariance.R
    ├── proportion_agecat.R
    ├── proposedN01_v3.stan
    ├── proposedarN01_v3.stan
    └── threemodelwriteup_v3_posteriorvariance.R
└── simulation_spatialmrp
    ├── icar_mrp_simulation_bym2.R
    ├── icar_mrp_viz.R
    ├── ma_adj_matrix.rds
    ├── ma_adj_matrix_sparse.rds
    ├── poverty_poststrat.rds
    ├── prob_sampling_old.rds
    ├── proportion_puma.R
    ├── smooth_x.rds
    └── us_pumas.rds


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Yuxiang Gao
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Code for reproducing results in "Improving multilevel regression and poststratification with structured priors"
 2 | 
 3 | ## Authors
 4 | - Yuxiang Gao
 5 | - Lauren Kennedy
 6 | - Daniel Simpson
 7 | - Andrew Gelman
 8 | 
 9 | Accepted for publication in _Bayesian Analysis_. arXiv preprint: https://arxiv.org/abs/1908.06716
10 | 
11 | ## Steps to run simulation pipeline in simulation directory (directed structured priors example)
12 | 1. The two .dta files in poststrat_pipeline_testing_age3_v3_posteriorvariance.R are retrieved from the paper, Estimating State Public Opinion with Multi-level Regression and Poststratification using R. URL: https://scholar.princeton.edu/jkastellec/publications. Download these two files before running the simulation pipeline, and put them in the simulation directory.
13 | 
14 | 2. Choose age preference curve with coef_age in poststrat_pipeline_testing_age3_v3_posteriorvariance.R
15 | 
16 | 3. Choose sample size, simulation runs, probability indices and number of age categories with sample_size, runs, r, age_grouping_multiplier in poststrat_pipeline_testing_age3_v3_posteriorvariance.R
17 | 
18 | 4. Run poststrat_pipeline_testing_age3_v3_posteriorvariance.R
19 | 
20 | 5. Run threemodelwriteup_v3_posteriorvariance.R with the same configurations. This will produce the bias plots shown in the paper for a given configuration.
21 | 
22 | 6. Run proportion_agecat.R to visualize the share of simulations that the RW and AR priors outperformed the IID priors
23 | 
24 | ## Steps to run simulation pipeline in simulation_spatialmrp directory (undirected structured priors example)
25 | 
26 | 1. Choose sample_size in icar_mrp_simulation_bym2.R and then run the script. This conducts spatial MRP simulations, based off binomial regression
27 | 
28 | 2. Run icar_mrp_viz.R with the same configurations to visualize the spatial MRP simulations
29 | 
30 | 3. Run proportion_puma.R to visualize share of simulations that the BYM2 priors outperformed the IID priors
31 | 
32 | ## Steps to run data analysis on 2008 Annenberg phone survey in realdata_annenberg directory
33 | 1. Request access for the 2008 National Annenberg Election Survey (NAES) telephone from the Annenberg Public Policy Center and put the phone survey text file into the realdata_annenberg directory. Rename it to annenbergphone2008.txt
34 | 
35 | 2. acs_ps.rds is the 2006-2010 5-year American Community Survey (ACS) in a cleaned-up format. Create a new path ~/Desktop/annenbergdata/fiveyearacs/ and then download the 5-year ACS and unzip in this path. Finally, run ACS_PS_annenberg_fiveyear.R in the preprocess directory to get acs_ps.rds in ~/Desktop/annenbergdata/fiveyearacs/
36 | 
37 | 3. Make sure state_level_update.dta, gay_marriage_megapoll.dta, acs_ps.rds, annenbergphone2008.txt are in realdata_annenberg directory. The first two files are from Step 1 in the simulation pipeline.
38 | 
39 | 4. Run fitmodel_annenberg_v3.R for age_grouping_multiplier = 12,48,72 and for all three models in the variables m_file,m_name
40 | 
41 | 5. Run poststrat_annenberg_v3.R
42 | 
43 | ## Notes: 
44 | 
45 | 1. To download the 2006-2010 5-year ACS, download both United States Population Records and United States Housing Unit Records in the data set 2006-2010 ACS 5-year Public Use Microdata Samples (PUMS) - CSV format, which can be found on https://factfinder.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t#
46 | 
47 | 2. The US heatmap and proportion tables in the Annenberg data analysis are derived from annenbergphone2008.txt and acs_ps.rds. 
48 | 
49 | The covariate N in acs_ps.rds, which corresponds to a person's weight, is summed when calculating heatmap and proportion values. As an example, to get the proportions for the education covariate in the 2006-2010 ACS, we use:
50 | 
51 | ```R
52 | acs_ps %>% group_by(education) %>% 
53 |   summarise(percentage = sum(N)/sum(acs_ps$N)) %>% 
54 |   dplyr::mutate(percentage = round(100 * percentage, 2)) %>% data.frame()
55 | ```
56 | 
57 | 3. For a working simulation example that doesn't require downloading data files, see my StanCon 2019 repo.
58 | 
59 | 4. Noncentered parameterizations were used for the simulation studies whereas centered parameterizations were used for the real data analysis. NUTS sampling in Stan struggled with the noncentered parameterizations for the real data analysis but centering fixed this.
60 | 
61 | 5. In the simulation_spatialmrp directory: poverty_poststrat.rds contains the poststratification matrix that's derived from the 2013-2017 5-year ACS, ma_adj_matrix_sparse.rds contains the adjacency matrix of the 52 PUMAs in the state MA, smooth_x.rds contains a sample from a Gaussian Markov random field over the space of 52 PUMAs. ma_adj_matrix_sparse.rds is derived from the tigris package.
62 | 


--------------------------------------------------------------------------------
/preprocess/ACS_PS_annenberg_fiveyear.R:
--------------------------------------------------------------------------------
  1 | rm(list=ls())
  2 | 
  3 | # for 5 YEAR ACS 2006 - 2010
  4 | #load into the ACS and make PS table.
  5 | library(dplyr)
  6 | library(readr)
  7 | library(data.table)
  8 | 
  9 | # data dictionary below
 10 | # https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2006-2010.pdf?#
 11 | 
 12 | # load data
 13 | acs1 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusa.csv", sep = ",", header =TRUE)
 14 | acs1 = acs1[acs1$AGEP>17,]
 15 | acs1_tr = acs1[,c("serialno",
 16 |                   "AGEP", # age
 17 |                   "SEX", # sex
 18 |                   "PWGTP", # person's weight
 19 |                   "HISP", # hispanic ethnicity
 20 |                   "RAC1P", # race
 21 |                   "ST", # state
 22 |                   "SCHL", # Educational attainment,
 23 |                   "RT" # record type
 24 | )]
 25 | rm(acs1)
 26 | gc() # free up space
 27 | 
 28 | 
 29 | acs2 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusb.csv", sep = ",", header =TRUE)
 30 | acs2 = acs2[acs2$AGEP>17,]
 31 | acs2_tr = acs2[,c("serialno",
 32 |                   "AGEP", # age
 33 |                   "SEX", # sex
 34 |                   "PWGTP", # person's weight
 35 |                   "HISP", # hispanic ethnicity: FOR RACE
 36 |                   "RAC1P", # race: FOR RACE
 37 |                   "ST", # state
 38 |                   "SCHL", # Educational attainment
 39 |                   "RT" # record type
 40 | )] 
 41 | rm(acs2)
 42 | gc() # free up space
 43 | 
 44 | 
 45 | acs3 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusc.csv", sep = ",", header =TRUE)
 46 | acs3 = acs3[acs3$AGEP>17,]
 47 | acs3_tr = acs3[,c("serialno",
 48 |                   "AGEP", # age
 49 |                   "SEX", # sex
 50 |                   "PWGTP", # person's weight
 51 |                   "HISP", # hispanic ethnicity
 52 |                   "RAC1P", # race
 53 |                   "ST", # state
 54 |                   "SCHL", # Educational attainment
 55 |                   "RT"
 56 | )]
 57 | rm(acs3)
 58 | gc()
 59 | 
 60 | 
 61 | acs4 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_pus/ss10pusd.csv", sep = ",", header =TRUE)
 62 | acs4 = acs4[acs4$AGEP>17,]
 63 | acs4_tr = acs4[,c("serialno",
 64 |                   "AGEP", # age
 65 |                   "SEX", # sex
 66 |                   "PWGTP", # person's weight
 67 |                   "HISP", # hispanic ethnicity: FOR RACE
 68 |                   "RAC1P", # race: FOR RACE
 69 |                   "ST", # state
 70 |                   "SCHL", # Educational attainment,
 71 |                   "RT"
 72 | )] 
 73 | rm(acs4)
 74 | gc()
 75 | 
 76 | acs_tr = rbind(acs1_tr, acs2_tr, acs3_tr, acs4_tr)
 77 | 
 78 | 
 79 | 
 80 | acsh1 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husa.csv", sep = ",", header =TRUE)
 81 | acsh1_tr = acsh1[,c("serialno", "HINCP")]
 82 | rm(acsh1)
 83 | gc()
 84 | 
 85 | 
 86 | acsh2 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husb.csv", sep = ",", header =TRUE)
 87 | acsh2_tr = acsh2[,c("serialno","HINCP")]
 88 | rm(acsh2)
 89 | gc()
 90 | 
 91 | 
 92 | acsh3 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husc.csv", sep = ",", header =TRUE)
 93 | acsh3_tr = acsh3[,c("serialno","HINCP")]
 94 | rm(acsh3)
 95 | gc()
 96 | 
 97 | 
 98 | acsh4 = fread("~/Desktop/annenbergdata/fiveyearacs/csv_hus/ss10husd.csv", sep = ",", header =TRUE)
 99 | acsh4_tr = acsh4[,c("serialno","HINCP")]
100 | rm(acsh4)
101 | gc()
102 | 
103 | 
104 | 
105 | #sum(names(acsh1)==names(acsh2))
106 | #sum(names(acsh2)==names(acsh3))
107 | #sum(names(acsh3)==names(acsh4))
108 | 
109 | 
110 | acsh_tr = rbind(acsh1_tr, acsh2_tr, acsh3_tr, acsh4_tr)
111 | 
112 | # check that there are no duplicate rows in acsh_tr
113 | print(length(unique(acsh_tr$serialno)))
114 | print(dim(acsh_tr))
115 | 
116 | acs_tr = dplyr::left_join(acs_tr, acsh_tr, by = c("serialno"="serialno"))
117 | 
118 | rm(acs1_tr, acs2_tr, acs3_tr, acs4_tr)
119 | rm(acsh1_tr, acsh2_tr, acsh3_tr, acsh4_tr)
120 | 
121 | acs_tr = acs_tr[complete.cases(acs_tr),] # remove NAs
122 | # PREPROCESS RACE
123 | 
124 | # Recoded detailed race code
125 | # 1 .White alone
126 | # 2 .Black or African American alone
127 | # 3 .American Indian alone
128 | # 4 .Alaska Native alone
129 | # 5 .American Indian and Alaska Native tribes specified; or American
130 | # .Indian or Alaska native, not specified and no other races
131 | # 6 .Asian alone
132 | # 7 .Native Hawaiian and Other Pacific Islander alone
133 | # 8 .Some other race alone
134 | # 9 .Two or more major race groups
135 | 
136 | acs_tr$race_x = ifelse(acs_tr$RAC1P==1 & acs_tr$HISP==1,"white",
137 |                        ifelse(acs_tr$RAC1P==2 & acs_tr$HISP==1,"black",
138 |                               ifelse(acs_tr$RAC1P==6 & acs_tr$HISP==1, "asian",
139 |                                      ifelse(acs_tr$RAC1P==3 & acs_tr$HISP==1,"americanindian",
140 |                                             ifelse(!(acs_tr$RAC1P %in% c(1,2,6,3)) & acs_tr$HISP %in% 2:24, "hisp",
141 |                                                    "other")
142 |                                      )
143 |                               )
144 |                        )
145 | )
146 | 
147 | 
148 | # PREPROCESS EDUCATION
149 | 
150 | # Educational attainment
151 | # bb .N/A (less than 3 years old)
152 | # 01 .No schooling completed
153 | # 02 .Nursery school to grade 4
154 | # 03 .Grade 5 or grade 6
155 | # 04 .Grade 7 or grade 8
156 | # 05 .Grade 9
157 | # 06 .Grade 10
158 | # 07 .Grade 11
159 | # 08 .12th grade, no diploma
160 | # 09 .High school graduate
161 | # 10 .Some college, but less than 1 year
162 | # 11 .One or more years of college, no degree
163 | # 12 .Associate's degree
164 | # 13 .Bachelor's degree
165 | # 14 .Master's degree
166 | # 15 .Professional school degree
167 | # 16 .Doctorate degree
168 | 
169 | acs_tr$education_x = ifelse(acs_tr$SCHL %in% 1:8, "nohighschool",
170 |                             ifelse(acs_tr$SCHL==9, "highschool",
171 |                                    ifelse(acs_tr$SCHL %in% c(10,11), "somecollege",
172 |                                           ifelse(acs_tr$SCHL==12, "twoyeardegree",
173 |                                                  ifelse(acs_tr$SCHL==13, "fouryeardegree",
174 |                                                         "fouryeardegreeplus")
175 |                                           )
176 |                                    )
177 |                             )
178 | )
179 | 
180 | # PREPROCESS STATE
181 | 
182 | # 01 .Alabama/AL
183 | # 02 .Alaska/AK
184 | # 04 .Arizona/AZ
185 | # 05 .Arkansas/AR
186 | # 06 .California/CA
187 | # 08 .Colorado/CO
188 | # 09 .Connecticut/CT
189 | # 10 .Delaware/DE
190 | # 11 .District of Columbia/DC
191 | # 12 .Florida/FL
192 | # 13 .Georgia/GA
193 | # 15 .Hawaii/HI
194 | # 16 .Idaho/ID
195 | # 17 .Illinois/IL
196 | # 18 .Indiana/IN
197 | # 19 .Iowa/IA
198 | # 20 .Kansas/KS
199 | # 21 .Kentucky/KY
200 | # 22 .Louisiana/LA
201 | # 23 .Maine/ME
202 | # 24 .Maryland/MD
203 | # 25 .Massachusetts/MA
204 | # 26 .Michigan/MI
205 | # 27 .Minnesota/MN
206 | # 28 .Mississippi/MS
207 | # 29 .Missouri/MO
208 | # 30 .Montana/MT
209 | # 31 .Nebraska/NE
210 | # 32 .Nevada/NV
211 | # 33 .New Hampshire/NH
212 | # 34 .New Jersey/NJ 
213 | # 35 .New Mexico/NM
214 | # 36 .New York/NY
215 | # 37 .North Carolina/NC
216 | # 38 .North Dakota/ND
217 | # 39 .Ohio/OH
218 | # 40 .Oklahoma/OK
219 | # 41 .Oregon/OR
220 | # 42 .Pennsylvania/PA
221 | # 44 .Rhode Island/RI
222 | # 45 .South Carolina/SC
223 | # 46 .South Dakota/SD
224 | # 47 .Tennessee/TN
225 | # 48 .Texas/TX
226 | # 49 .Utah/UT
227 | # 50 .Vermont/VT
228 | # 51 .Virginia/VA
229 | # 53 .Washington/WA
230 | # 54 .West Virginia/WV
231 | # 55 .Wisconsin/WI
232 | # 56 .Wyoming/WY
233 | # 72 .Puerto Rico/PR
234 | 
235 | # DOUBLE CHECK BELOW
236 | acs_tr$state_x = case_when(acs_tr$ST == 1 ~ "alabama",
237 |                            acs_tr$ST == 2 ~ "alaska",
238 |                            acs_tr$ST == 4 ~ "arizona",
239 |                            acs_tr$ST == 5 ~ "arkansas",
240 |                            acs_tr$ST == 6 ~ "california",
241 |                            acs_tr$ST == 8 ~ "colorado",
242 |                            acs_tr$ST == 9 ~ "connecticut",
243 |                            acs_tr$ST == 10 ~ "delaware",
244 |                            acs_tr$ST == 11 ~ "columbia",
245 |                            acs_tr$ST == 12 ~ "florida",
246 |                            acs_tr$ST == 13 ~ "georgia",
247 |                            acs_tr$ST == 15 ~ "hawaii",
248 |                            acs_tr$ST == 16 ~ "idaho",
249 |                            acs_tr$ST == 17 ~ "illinois",
250 |                            acs_tr$ST == 18 ~ "indiana",
251 |                            acs_tr$ST == 19 ~ "iowa",
252 |                            acs_tr$ST == 20 ~ "kansas",
253 |                            acs_tr$ST == 21 ~ "kentucky",
254 |                            acs_tr$ST == 22 ~ "louisiana",
255 |                            acs_tr$ST == 23 ~ "maine",
256 |                            acs_tr$ST == 24 ~ "maryland",
257 |                            acs_tr$ST == 25 ~ "massachusetts",
258 |                            acs_tr$ST == 26 ~ "michigan",
259 |                            acs_tr$ST == 27 ~ "minnesota",
260 |                            acs_tr$ST == 28 ~ "mississippi",
261 |                            acs_tr$ST == 29 ~ "missouri",
262 |                            acs_tr$ST == 30 ~ "montana",
263 |                            acs_tr$ST == 31 ~ "nebraska",
264 |                            acs_tr$ST == 32 ~ "nevada",
265 |                            acs_tr$ST == 33 ~ "newhampshire",
266 |                            acs_tr$ST == 34 ~ "newjersey",
267 |                            acs_tr$ST == 35 ~ "newmexico",
268 |                            acs_tr$ST == 36 ~ "newyork",
269 |                            acs_tr$ST == 37 ~ "northcarolina",
270 |                            acs_tr$ST == 38 ~ "northdakota",
271 |                            acs_tr$ST == 39 ~ "ohio",
272 |                            acs_tr$ST == 40 ~ "oklahoma",
273 |                            acs_tr$ST == 41 ~ "oregon",
274 |                            acs_tr$ST == 42 ~ "pennsylvania",
275 |                            acs_tr$ST == 44 ~ "rhodeisland",
276 |                            acs_tr$ST == 45 ~ "southcarolina",
277 |                            acs_tr$ST == 46 ~ "southdakota",
278 |                            acs_tr$ST == 47 ~ "tennessee",
279 |                            acs_tr$ST == 48 ~ "texas",
280 |                            acs_tr$ST == 49 ~ "utah",
281 |                            acs_tr$ST == 50 ~ "vermont",
282 |                            acs_tr$ST == 51 ~ "virginia",
283 |                            acs_tr$ST == 53 ~ "washington",
284 |                            acs_tr$ST == 54 ~ "westvirginia",
285 |                            acs_tr$ST == 55 ~ "wisconsin",
286 |                            acs_tr$ST == 56 ~ "wyoming",
287 |                            TRUE ~ as.character(acs_tr$ST))
288 | 
289 | 
290 | # PREPROCESS INCOME
291 | # match annenberg coding
292 | acs_tr$income_x = case_when(acs_tr$HINCP < 10000 ~ 1,
293 |                             acs_tr$HINCP >= 10000 & acs_tr$HINCP < 15000 ~ 2,
294 |                             acs_tr$HINCP >= 15000 & acs_tr$HINCP < 25000 ~ 3,
295 |                             acs_tr$HINCP >= 25000 & acs_tr$HINCP < 35000 ~ 4,
296 |                             acs_tr$HINCP >= 35000 & acs_tr$HINCP < 50000 ~ 5,
297 |                             acs_tr$HINCP >= 50000 & acs_tr$HINCP < 75000 ~ 6,
298 |                             acs_tr$HINCP >= 75000 & acs_tr$HINCP < 100000 ~ 7,
299 |                             acs_tr$HINCP >= 100000 & acs_tr$HINCP < 150000 ~ 8,
300 |                             acs_tr$HINCP >= 150000 ~ 9)
301 | 
302 | # PREPROCESS SEX
303 | # 1 .Male
304 | # 2 .Female
305 | 
306 | # PREPROCESS AGE
307 | # no need to preprocess age
308 | 
309 | # construct maximal poststratification matrix --------------------------
310 | 
311 | # 1. do dplyr grouping
312 | acs_ps = acs_tr %>%
313 |   group_by(AGEP, SEX, race_x, education_x, state_x, income_x) %>%
314 |   summarize(N = sum(PWGTP)) %>% # why sum over PWGTP?
315 |   ungroup()
316 | 
317 | # check that you don't have all subpopulation cells
318 | # prod(apply(acs_ps, 2, function(x) {length(unique(x))})[-7])
319 | 
320 | # 2. create empty maximal poststrat matrix
321 | acs_ps_maximal_x = expand.grid(age = unique(acs_tr$AGEP),
322 |                                sex = unique(acs_tr$SEX),
323 |                                race = unique(acs_tr$race_x),
324 |                                education = unique(acs_tr$education_x),
325 |                                state = unique(acs_tr$state_x),
326 |                                income = unique(acs_tr$income_x),
327 |                                stringsAsFactors = FALSE
328 | )
329 | 
330 | # 3. grab the groupings from acs_ps
331 | acs_ps_maximal = dplyr::left_join(acs_ps_maximal_x, acs_ps, by = c("age"="AGEP",
332 |                                                                    "sex"="SEX",
333 |                                                                    "race"="race_x",
334 |                                                                    "education"="education_x",
335 |                                                                    "state"="state_x",
336 |                                                                    "income"="income_x"))
337 | 
338 | 
339 | # 4. Save poststratification matrix
340 | saveRDS(data.frame(acs_ps),file='~/Desktop/annenbergdata/fiveyearacs/acs_ps.rds')
341 | 
342 | rm(list=ls())
343 | 
344 | 


--------------------------------------------------------------------------------
/realdata_annenberg/baselinemeanzeroN01_annenberg_v3_centered.stan:
--------------------------------------------------------------------------------
  1 | data {
  2 |   int<lower=0> N; // number of samples
  3 |   int<lower=0> N_groups_age; // the number of groups for age
  4 |   int<lower=0> N_groups_race; // the number of groups for race
  5 |   int<lower=0> N_groups_sex; // the number of groups for sex
  6 |   
  7 |   int<lower=0> N_groups_education; // the number of groups for education
  8 |   int<lower=0> N_groups_income; // the number of groups for income
  9 |   int<lower=0> N_groups_state; // the number of groups for state  
 10 |   
 11 |   int<lower=1,upper=N_groups_age> age[N]; // the column vector of design matrix X for age
 12 |   int<lower=1,upper=N_groups_race> race[N]; // the column vector of design matrix X for race
 13 |   int<lower=0,upper=N_groups_sex-1> sex[N]; // the column vector of design matrix X for sex
 14 |   
 15 |   int<lower=1,upper=N_groups_education> education[N]; // the column vector of design matrix X for education
 16 |   int<lower=1,upper=N_groups_income> income[N]; // the column vector of design matrix X for income
 17 |   int<lower=1,upper=N_groups_state> state[N]; // the column vector of design matrix X for state
 18 |   
 19 |   int<lower=1,upper=5> region_index[51]; // the index representing the region that a state is in according to jkastell 
 20 |   real<lower=0,upper=1> state_vs[51]; // the 2004 Republican vote share from jkastell
 21 |   real<lower=0,upper=1> relig[51]; // the 2004 conservative religion percentage
 22 |   
 23 |   int y[N]; // the response vector
 24 | }
 25 | parameters {
 26 |   vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age
 27 |   vector[N_groups_race] U_race; // the random effect for race, not multiplied by sigma_race
 28 |   vector[N_groups_income] U_income; 
 29 |   vector[N_groups_state] U_state; // we are using the centered parameterization for U_state now, so U_state_transformed = U_state now
 30 |   vector[N_groups_education] U_education;
 31 |   
 32 |   vector[5] U_region; // the nested random effect for region
 33 | 
 34 |   real<lower=0> sigma_age; // sd of U_age (hyperparam).
 35 |   real<lower=0> sigma_race; // sd of U_race (hyperparam). 
 36 |   real<lower=0> sigma_income;
 37 |   real<lower=0> sigma_state;
 38 |   real<lower=0> sigma_education;
 39 |   
 40 |   real<lower=0> sigma_region;
 41 |   real beta_state; // coef for Republican vote share in every state
 42 |   real beta_relig; // coef for conservative religion share in every state
 43 |   
 44 |   real intercept; // the intercept (global fixed effect)
 45 |   real beta_sex;
 46 | 
 47 | }
 48 | transformed parameters { 
 49 |   vector[N_groups_age] U_age_transformed;
 50 |   vector[N_groups_race] U_race_transformed;
 51 |   vector[N_groups_income] U_income_transformed;
 52 |   vector[N_groups_state] U_state_transformed;
 53 |   vector[N_groups_education] U_education_transformed;
 54 |   
 55 |   vector[5] U_region_transformed;
 56 |   
 57 |   vector[N] yhat;
 58 | 
 59 |   U_age_transformed = sigma_age * U_age; // the random effect for age
 60 |   U_race_transformed = sigma_race * U_race; // the random effect for race
 61 |   U_income_transformed = sigma_income * U_income;
 62 |   //U_state_transformed = sigma_state * U_state;
 63 |   U_education_transformed = sigma_education * U_education;
 64 |   
 65 |   U_region_transformed = sigma_region * U_region;
 66 |   
 67 |   U_state_transformed = U_state;
 68 |   
 69 |   // noncentered parameterization for U_state_transformed
 70 | //  for (j in 1:51) { // j iterates through the state index that I defined through levels_state in fitmodel_annenberg R file
 71 | //    U_state_transformed[j] = (U_state[j] * sigma_state) * ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) );
 72 | //  }
 73 | 
 74 |   for (i in 1:N) {
 75 |       yhat[i] = intercept + U_age_transformed[age[i]] + U_race_transformed[race[i]] + U_income_transformed[income[i]] + U_state_transformed[state[i]] + U_education_transformed[education[i]] + beta_sex * sex[i]; // the linear predictor at each point
 76 |   }
 77 |   
 78 | }
 79 | model {
 80 |   sigma_age ~ normal(0,1); // sigma_A ~ lognormal(0,1). hyperparam.
 81 |   sigma_race ~ normal(0,1); // sigma_I ~ lognormal(0,1). hyperparam.
 82 |   sigma_income ~ normal(0,1);
 83 |   sigma_state ~ normal(0,1);
 84 |   sigma_education ~ normal(0,1);
 85 |   
 86 |   sigma_region ~ normal(0,1);
 87 |   
 88 |   U_age ~ normal(0, 1);
 89 |   U_race ~ normal(0, 1); // random effect for race is normal
 90 |   U_income ~ normal(0, 1); 
 91 |   //U_state ~ normal(0, 1);
 92 |   U_education ~ normal(0, 1);
 93 |   
 94 |   U_region ~ normal(0,1);
 95 |   beta_state ~ normal(0,1);
 96 |   beta_relig ~ normal(0,1);
 97 |   
 98 |   for (j in 1:51) {
 99 |     U_state[j] ~ normal(U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]), sigma_state);
100 |   }
101 | 
102 |   intercept ~ normal(0, 1); 
103 |   beta_sex ~ normal(0, 1);
104 |   
105 |   for (i in 1:N) {
106 |     y[i] ~ bernoulli(inv_logit(yhat[i])); // the response
107 |   }
108 |   
109 | }
110 | 


--------------------------------------------------------------------------------
/realdata_annenberg/fitmodel_annenberg_v3.R:
--------------------------------------------------------------------------------
  1 | rm(list=ls())
  2 | 
  3 | library(rstan)
  4 | library(dplyr)
  5 | 
  6 | options(mc.cores = parallel::detectCores()) # use multiple cores. default is 4.
  7 | rstan_options(auto_write = TRUE) # save compiled stan model to hard disk so no need to recompile
  8 | 
  9 | # ---------------------------------------------------
 10 | age_grouping_multiplier = 12 # the number of strata for age
 11 | 
 12 | iterations = 2000
 13 | num_chains = 4
 14 | mtreedepth = 15
 15 | ad = 0.99
 16 | 
 17 | use_state = TRUE # if this is true we use state as response
 18 | 
 19 | # m_file = "baselinemeanzeroN01_annenberg_v3_centered.stan"
 20 | # m_name = "baseline_model_annenberg_centered"
 21 | 
 22 | m_file = "proposedarN01_annenberg_v3_centered.stan"
 23 | m_name = "ar_model_annenberg_centered"
 24 | 
 25 | # m_file = "proposedN01_annenberg_v3_centered.stan"
 26 | # m_name = "rw_model_annenberg_centered"
 27 | 
 28 | covariates_list = c("CEc01_c", # FAVOR SAME-SEX MARRIAGE - THE RESPONSE
 29 |                     "WA01_c", # SEX
 30 |                     "WA02_c", # AGE
 31 |                     "WA03_c", # EDUCATION
 32 |                     "WC03_c", # RACE
 33 |                     "WA04_c", # INCOME QUESTION 1
 34 |                     "WA05_c", # INCOME QUESTION 2
 35 |                     "WB03_c", # LINE OF WORK
 36 |                     "WFc01_c", # STATE OF RESIDENCY
 37 |                     "WC06_c", # YEARS LIVED IN THE US
 38 |                     "WFa01_c", # ADULTS IN HOUSEHOLD
 39 |                     "WC01_c", # Are you of Hispanic or Latino origin or descent?
 40 |                     "WHb01_c") # SEXUAL ORIENTATION
 41 | 
 42 | 
 43 | m = stan_model(file = m_file, model_name = m_name)
 44 | 
 45 | acs_ps = readRDS("acs_ps.rds") # load poststratification matrix
 46 | # "AGEP"        "SEX"         "race_x"      "education_x" "state_x"     "income_x"    "N"   
 47 | colnames(acs_ps) = c("age", "sex", "race", "education", "state", "income", "N")
 48 | # ---------------------------------------------------
 49 | 
 50 | survey_dat = readr::read_tsv("annenbergphone2008.txt", col_names = TRUE)
 51 | 
 52 | # Preprocess -------------------------------------------------------------------------
 53 | 
 54 | # Load state data from jkastell -----------------------------------------------------
 55 | # Load state-level data from http://www.princeton.edu/~jkastell/mrp_primer.html
 56 | StateLevel = foreign::read.dta("state_level_update.dta",convert.underscore = TRUE)
 57 | StateLevel = StateLevel[order(StateLevel$sstate.initnum),]
 58 | 
 59 | # Load state-level data from http://www.princeton.edu/~jkastell/mrp_primer.html
 60 | MarriageDataSP = foreign::read.dta("gay_marriage_megapoll.dta", convert.underscore = TRUE) 
 61 | 
 62 | # check that row names of Statelevel are equal to sstate.initnum
 63 | sum(as.numeric(as.character(rownames(StateLevel))) == StateLevel$sstate.initnum)
 64 | 
 65 | stateregions = unique(MarriageDataSP[,c("statename", "state.initnum", "region.cat", "region")])
 66 | 
 67 | stateregions_final = dplyr::left_join(x = StateLevel, 
 68 |                                       y = stateregions,
 69 |                                       by = c("sstate.initnum"="state.initnum"))[,c("sstate.initnum", 
 70 |                                                                                    "sstate",
 71 |                                                                                    "sstatename",
 72 |                                                                                    "region",
 73 |                                                                                    "region.cat",
 74 |                                                                                    "kerry.04",
 75 |                                                                                    "p.evang",
 76 |                                                                                    "p.mormon")]
 77 | 
 78 | # Alaska and Hawaii doesn't have a region so we  set it to west
 79 | stateregions_final[stateregions_final$sstatename == "Alaska", c("region")] = "west"
 80 | stateregions_final[stateregions_final$sstatename == "Hawaii", c("region")] = "west"
 81 | stateregions_final[stateregions_final$sstatename == "Alaska", c("region.cat")] = 4
 82 | stateregions_final[stateregions_final$sstatename == "Hawaii", c("region.cat")] = 4
 83 | 
 84 | stateregions_final$lowercasename = tolower(stringr::str_replace_all(stateregions_final$sstatename, " ", ""))
 85 | 
 86 | stateregions_final$lowercasename[stateregions_final$lowercasename=="d.c."]="columbia"
 87 | 
 88 | # ------------------------------------------------------------------------------------
 89 | 
 90 | survey_dat_selected = survey_dat %>% 
 91 |   select(covariates_list) %>% 
 92 |   filter(WA02_c < 100, # Age less than 100
 93 |          CEc01_c %in% c(1,3), # 1 is vote yes for gay marriage, 3 is no
 94 |   ) #%>%
 95 |   # mutate(gayfavor = as.numeric(CEc01_c),
 96 |   #        sex = as.numeric(WA01_c),
 97 |   #        age = as.numeric(WA02_c),
 98 |   #        education = as.numeric(WA03_c),
 99 |   #        race = as.numeric(WC03_c),
100 |   #        income = as.numeric(WA04_c),
101 |   #        state = as.numeric(WFc01_c),
102 |   #        yearsinus = as.numeric(WC06_c),
103 |   #        adultsinhouse = as.numeric(WFa01_c),
104 |   #        orientation = as.numeric(WHb01_c)
105 |   # ) # create new variable names
106 | 
107 | survey_dat_selected$CEc01_c[survey_dat_selected$CEc01_c==3] = 0 # change 3 to 0. 0 is no to gay marriage, 1 is yes.
108 | 
109 | # Preprocess race
110 | survey_dat_selected = survey_dat_selected %>% filter(WC03_c < 10) # remove 998 and 999 from race
111 | 
112 | survey_dat_selected$race_x = ifelse(survey_dat_selected$WC03_c==1 & survey_dat_selected$WC01_c==2, "white",
113 |                                     ifelse(survey_dat_selected$WC03_c==2 & survey_dat_selected$WC01_c==2, "black",
114 |                                            ifelse(survey_dat_selected$WC03_c==3 & survey_dat_selected$WC01_c==2, "asian",
115 |                                                   ifelse(survey_dat_selected$WC03_c==4 & survey_dat_selected$WC01_c==2, "americanindian",
116 |                                                          ifelse(survey_dat_selected$WC03_c==5 & survey_dat_selected$WC01_c==1, "hisp",
117 |                                                                 "other"))
118 |                                            )
119 |                                     )
120 | )
121 | 
122 | # Preprocess education
123 | survey_dat_selected = survey_dat_selected %>% filter(WA03_c < 100) # remove 998 and 999
124 | 
125 | survey_dat_selected$education_x = ifelse(survey_dat_selected$WA03_c %in% c(1,2), "nohighschool",
126 |                                          ifelse(survey_dat_selected$WA03_c %in% c(3,4), "highschool",
127 |                                                 ifelse(survey_dat_selected$WA03_c==5, "somecollege",
128 |                                                        ifelse(survey_dat_selected$WA03_c==6, "twoyeardegree",
129 |                                                               ifelse(survey_dat_selected$WA03_c==7, "fouryeardegree",
130 |                                                                      "fouryeardegreeplus")
131 |                                                        )
132 |                                                 )
133 |                                          )
134 | )
135 | 
136 | 
137 | # Preprocess state
138 | survey_dat_selected$state_x = case_when(survey_dat_selected$WFc01_c == 1 ~ "alabama",
139 |                                         survey_dat_selected$WFc01_c == 4 ~ "arizona",
140 |                                         survey_dat_selected$WFc01_c == 5 ~ "arkansas",
141 |                                         survey_dat_selected$WFc01_c == 6 ~ "california",
142 |                                         survey_dat_selected$WFc01_c == 8 ~ "colorado",
143 |                                         survey_dat_selected$WFc01_c == 9 ~ "connecticut",
144 |                                         survey_dat_selected$WFc01_c == 10 ~ "delaware",
145 |                                         survey_dat_selected$WFc01_c == 11 ~ "columbia",
146 |                                         survey_dat_selected$WFc01_c == 12 ~ "florida",
147 |                                         survey_dat_selected$WFc01_c == 13 ~ "georgia",
148 |                                         survey_dat_selected$WFc01_c == 16 ~ "idaho",
149 |                                         survey_dat_selected$WFc01_c == 17 ~ "illinois",
150 |                                         survey_dat_selected$WFc01_c == 18 ~ "indiana",
151 |                                         survey_dat_selected$WFc01_c == 19 ~ "iowa",
152 |                                         survey_dat_selected$WFc01_c == 20 ~ "kansas",
153 |                                         survey_dat_selected$WFc01_c == 21 ~ "kentucky",
154 |                                         survey_dat_selected$WFc01_c == 22 ~ "louisiana",
155 |                                         survey_dat_selected$WFc01_c == 23 ~ "maine",
156 |                                         survey_dat_selected$WFc01_c == 24 ~ "maryland",
157 |                                         survey_dat_selected$WFc01_c == 25 ~ "massachusetts",
158 |                                         survey_dat_selected$WFc01_c == 26 ~ "michigan",
159 |                                         survey_dat_selected$WFc01_c == 27 ~ "minnesota",
160 |                                         survey_dat_selected$WFc01_c == 28 ~ "mississippi",
161 |                                         survey_dat_selected$WFc01_c == 29 ~ "missouri",
162 |                                         survey_dat_selected$WFc01_c == 30 ~ "montana",
163 |                                         survey_dat_selected$WFc01_c == 31 ~ "nebraska",
164 |                                         survey_dat_selected$WFc01_c == 32 ~ "nevada",
165 |                                         survey_dat_selected$WFc01_c == 33 ~ "newhampshire",
166 |                                         survey_dat_selected$WFc01_c == 34 ~ "newjersey",
167 |                                         survey_dat_selected$WFc01_c == 35 ~ "newmexico",
168 |                                         survey_dat_selected$WFc01_c == 36 ~ "newyork",
169 |                                         survey_dat_selected$WFc01_c == 37 ~ "northcarolina",
170 |                                         survey_dat_selected$WFc01_c == 38 ~ "northdakota",
171 |                                         survey_dat_selected$WFc01_c == 39 ~ "ohio",
172 |                                         survey_dat_selected$WFc01_c == 40 ~ "oklahoma",
173 |                                         survey_dat_selected$WFc01_c == 41 ~ "oregon",
174 |                                         survey_dat_selected$WFc01_c == 42 ~ "pennsylvania",
175 |                                         survey_dat_selected$WFc01_c == 44 ~ "rhodeisland",
176 |                                         survey_dat_selected$WFc01_c == 45 ~ "southcarolina",
177 |                                         survey_dat_selected$WFc01_c == 46 ~ "southdakota",
178 |                                         survey_dat_selected$WFc01_c == 47 ~ "tennessee",
179 |                                         survey_dat_selected$WFc01_c == 48 ~ "texas",
180 |                                         survey_dat_selected$WFc01_c == 49 ~ "utah",
181 |                                         survey_dat_selected$WFc01_c == 50 ~ "vermont",
182 |                                         survey_dat_selected$WFc01_c == 51 ~ "virginia",
183 |                                         survey_dat_selected$WFc01_c == 53 ~ "washington",
184 |                                         survey_dat_selected$WFc01_c == 54 ~ "westvirginia",
185 |                                         survey_dat_selected$WFc01_c == 55 ~ "wisconsin",
186 |                                         survey_dat_selected$WFc01_c == 56 ~ "wyoming",
187 |                                         TRUE ~ as.character(survey_dat_selected$WFc01_c))
188 | 
189 | 
190 | # Preprocess income
191 | survey_dat_selected = survey_dat_selected %>% filter(WA04_c < 10)
192 | 
193 | # no need to preprocess sex covariate
194 | 
195 | # no need to preprocess age covariate
196 | 
197 | # Select covariates that we're interested in now ----------------------------------------------------
198 | 
199 | gaymarriagedata = survey_dat_selected[,c("CEc01_c", # favors gay marriage
200 |                                          "WA01_c", # sex
201 |                                          "WA02_c", # age
202 |                                          "education_x", # education
203 |                                          "race_x", # race
204 |                                          "WA04_c", # household income
205 |                                          "state_x")] %>% data.frame() # state
206 | 
207 | # change to characters because levels in maximal poststrat matrix are in characters 
208 | gaymarriagedata$CEc01_c = as.character(gaymarriagedata$CEc01_c)
209 | gaymarriagedata$WA01_c = as.character(gaymarriagedata$WA01_c)
210 | gaymarriagedata$WA04_c = as.character(gaymarriagedata$WA04_c)
211 | 
212 | gaymarriagedata_rm = gaymarriagedata[complete.cases(gaymarriagedata),] # remove NAs
213 | 
214 | 
215 | acs_ps$race = factor(acs_ps$race)
216 | acs_ps$sex = factor(acs_ps$sex)
217 | acs_ps$education = factor(acs_ps$education)
218 | acs_ps$income = factor(acs_ps$income)
219 | acs_ps$state = factor(acs_ps$state)
220 | 
221 | # store original levels from maximal poststrat matrix
222 | levels_race = levels(acs_ps$race)
223 | levels_sex = levels(acs_ps$sex)
224 | levels_education = levels(acs_ps$education)
225 | levels_income = levels(acs_ps$income)
226 | levels_state = levels(acs_ps$state)
227 | 
228 | # do some checks to see that we didn't make a typo. the levels from the maximal poststrat matrix should be the superset
229 | unique(gaymarriagedata_rm$race_x) %in% levels_race
230 | unique(gaymarriagedata_rm$WA01_c) %in% levels_sex
231 | unique(gaymarriagedata_rm$education) %in% levels_education
232 | unique(gaymarriagedata_rm$WA04_c) %in% levels_income
233 | unique(gaymarriagedata_rm$state_x) %in% levels_state
234 | 
235 | # get state index from jkastell in this dataframe as well
236 | 
237 | # check that all state names in the annenberg survey are the same as in  jkastell's file
238 | sum(unique(gaymarriagedata_rm$state_x) %in% unique(stateregions_final$lowercasename)) # should equal 49
239 | 
240 | # do left_join 
241 | gaymarriagedata_rm = dplyr::left_join(x = gaymarriagedata_rm, y = stateregions_final[,c("sstate.initnum", "region.cat", "lowercasename")], 
242 |                  by = c("state_x"="lowercasename"))
243 | 
244 | 
245 | # use above levels to factor survey data
246 | gaymarriagedata_rm$race_x = factor(gaymarriagedata_rm$race_x, levels = levels_race) # factor race
247 | gaymarriagedata_rm$WA01_c = factor(gaymarriagedata_rm$WA01_c, levels = levels_sex) # factor sex
248 | gaymarriagedata_rm$education_x = factor(gaymarriagedata_rm$education_x, levels = levels_education) # factor education
249 | gaymarriagedata_rm$WA04_c = factor(gaymarriagedata_rm$WA04_c, levels = levels_income) # factor income
250 | gaymarriagedata_rm$state_x = factor(gaymarriagedata_rm$state_x, levels = levels_state) # factor state
251 | 
252 | 
253 | gaymarriagedata_rm_clean_mapped = gaymarriagedata_rm
254 | # rename levels in our survey data to numerical values so we can feed it into a stan file
255 | levels(gaymarriagedata_rm_clean_mapped$race_x) = c(1, 2, 3, 4, 5, 6) # americanindian=1, asian=2, black=3, hisp=4, other=5, white=6
256 | levels(gaymarriagedata_rm_clean_mapped$WA01_c) = c(0, 1) # men=0, women=1
257 | levels(gaymarriagedata_rm_clean_mapped$education_x) = c(1, 2, 3, 4, 5, 6) # fouryeardegree=1, fouryeardegreeplus=2, highschool=3, nohighschool=4, somecollege=5, twoyeardegree=6
258 | levels(gaymarriagedata_rm_clean_mapped$state_x) = 1:51
259 | 
260 | #levels(gaymarriagedata_rm_clean_mapped$gayFavorFederalMarriage) = c(0, 1) # no==0, yes==1
261 | 
262 | summary(gaymarriagedata_rm_clean_mapped)
263 | summary(gaymarriagedata_rm)
264 | # create age-category covariate i.e stratify age
265 | # 1. acs_ps$age ranges from 18 to 95 so we remove 96 and 97
266 | gaymarriagedata_rm_clean_mapped = gaymarriagedata_rm_clean_mapped %>% filter(WA02_c <= 95 & WA02_c >= 18)
267 | gaymarriagedata_rm_clean_mapped$age_cat = cut(gaymarriagedata_rm_clean_mapped$WA02_c,
268 |                                               age_grouping_multiplier)
269 | 
270 | # store levels of age_cat
271 | levels_age_cat = levels(gaymarriagedata_rm_clean_mapped$age_cat)
272 | 
273 | # change levels of age_cat to numeric
274 | levels(gaymarriagedata_rm_clean_mapped$age_cat) = 1:age_grouping_multiplier
275 | 
276 | 
277 | # -----------------------------------------------------------------------------
278 | 
279 | # rename columns 
280 | print(colnames(gaymarriagedata_rm_clean_mapped))
281 | # "CEc01_c"     "WA01_c"      "WA02_c"      "education_x" "race_x"      "WA04_c"      "state_x"     "age_cat"    
282 | colnames(gaymarriagedata_rm_clean_mapped) = c("gayfavor", "sex", "age", "education", "race", "income", "state", "jk_state_index", "jk_region_cat", "age_cat")
283 | 
284 | 
285 | # we do a left join because we're using the state index ordering that I defined
286 | stateregions_final_ = dplyr::left_join(x = data.frame(levels_state, stringsAsFactors = FALSE), y = stateregions_final, by = c("levels_state"="lowercasename"))
287 | 
288 | # we will use the sstate.initnum order from stateregions_final
289 | # coef_state comes from data used in http://www.princeton.edu/~jkastell/MRP_primer/mrp_primer.pdf
290 | coef_state = (stateregions_final_$kerry.04)/100 # value for 2004 Democratic vote share in every state. This is X_{\text{state-vs},j} for j \in [51]
291 | 
292 | # coef_relig comes from data used in http://www.princeton.edu/~jkastell/MRP_primer/mrp_primer.pdf. this is the percentage of conservative religions in every state
293 | coef_relig = (stateregions_final_$p.evang + stateregions_final_$p.mormon)/100
294 | 
295 | # unfactor stuff in gaymarriagedata_rm_clean_mapped
296 | X = gaymarriagedata_rm_clean_mapped[,!(names(gaymarriagedata_rm_clean_mapped) %in% c("gayfavor", "age"))]
297 | Y = as.numeric(as.character(gaymarriagedata_rm_clean_mapped[,names(gaymarriagedata_rm_clean_mapped)=="gayfavor"]))
298 | 
299 | # change covariates to numeric so we can feed into stan model
300 | X$sex = as.numeric(as.character(X$sex))
301 | X$age_cat = as.numeric(as.character(X$age_cat))
302 | X$education = as.numeric(as.character(X$education))
303 | X$race = as.numeric(as.character(X$race))
304 | X$income = as.numeric(as.character(X$income))
305 | X$state = as.numeric(as.character(X$state))
306 | 
307 | # note that states 2 and 12 (alaska and hawaii) are not present in the survey sample
308 | 
309 | fit_realdata = sampling(m, 
310 |                         data = list(N = dim(X)[1],
311 |                                     N_groups_age = age_grouping_multiplier, # dependent on number of age groups
312 |                                     N_groups_race = length(unique(acs_ps$race)), # there are 4 categories in race
313 |                                     N_groups_sex = length(unique(acs_ps$sex)), # there are 2 categories in gender
314 |                                     N_groups_education = length(unique(acs_ps$education)),
315 |                                     N_groups_income = length(unique(acs_ps$income)),
316 |                                     
317 |                                     state_vs = coef_state, # the 51 values for 2004 Democratic vote share based on the state index ordering I defined for levels_state
318 |                                     relig = coef_relig, # the 51 values for conservative religion based on the state index ordering I defined for levels_state
319 |                                     region_index =  stateregions_final_$region.cat, # the 51 values for state region based on the state index ordering I defined for levels_state
320 |                                     
321 |                                     N_groups_state = length(unique(acs_ps$state)),
322 |                                     age = X$age_cat,
323 |                                     race = X$race,
324 |                                     sex = X$sex,
325 |                                     education = X$education,
326 |                                     income = X$income,
327 |                                     state = X$state, # we are using the state index that I defined to fit the model
328 |                                     y = Y),
329 |                         iter=iterations, 
330 |                         chains=num_chains,
331 |                         seed=21,
332 |                         control=list(max_treedepth=mtreedepth, adapt_delta=ad))
333 | 
334 | saveRDS(fit_realdata,
335 |         paste0(m_name,
336 |                "_",
337 |                age_grouping_multiplier,
338 |                ".rds"))
339 | 
340 | print("Done script.")
341 | 


--------------------------------------------------------------------------------
/realdata_annenberg/proposedN01_annenberg_v3_centered.stan:
--------------------------------------------------------------------------------
  1 | data {
  2 |   int<lower=0> N; // number of samples
  3 |   int<lower=0> N_groups_age; // the number of groups for age
  4 |   int<lower=0> N_groups_race; // the number of groups for race
  5 |   int<lower=0> N_groups_sex; // the number of groups for sex
  6 |   
  7 |   int<lower=0> N_groups_education; // the number of groups for education
  8 |   int<lower=0> N_groups_income; // the number of groups for income
  9 |   int<lower=0> N_groups_state; // the number of groups for state
 10 |   
 11 |   int<lower=1,upper=N_groups_age> age[N]; // the column vector of design matrix X for age
 12 |   int<lower=1,upper=N_groups_race> race[N]; // the column vector of design matrix X for race
 13 |   int<lower=0,upper=N_groups_sex-1> sex[N]; // the column vector of design matrix X for sex. Vector of 0/1
 14 |   
 15 |   int<lower=1,upper=N_groups_education> education[N]; // the column vector of design matrix X for education
 16 |   int<lower=1,upper=N_groups_income> income[N]; // the column vector of design matrix X for income
 17 |   int<lower=1,upper=N_groups_state> state[N]; // the column vector of design matrix X for state
 18 |   
 19 |   int<lower=1,upper=5> region_index[51]; // the index representing the region that a state is in according to jkastell 
 20 |   real<lower=0,upper=1> state_vs[51]; // the 2004 Republican vote share from jkastell
 21 |   real<lower=0,upper=1> relig[51]; // the 2004 conservative religion percentage
 22 |   
 23 |   int y[N]; // the response vector
 24 | }
 25 | parameters {
 26 |   vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age
 27 |   vector[N_groups_race] U_race; // the random effect for race, not multiplied by sigma_race
 28 |   vector[N_groups_income] U_income; 
 29 |   vector[N_groups_state] U_state; // we are using the centered parameterization for U_state now, so U_state_transformed = U_state now
 30 |   vector[N_groups_education] U_education;
 31 |   
 32 |   vector[5] U_region; // the nested random effect for region
 33 |   
 34 |   real<lower=0> sigma_age; // sd of U_age (hyperparam).
 35 |   real<lower=0> sigma_race; // sd of U_race (hyperparam).
 36 |   real<lower=0> sigma_income;
 37 |   real<lower=0> sigma_state;
 38 |   real<lower=0> sigma_education;
 39 |   
 40 |   real<lower=0> sigma_region;
 41 |   real beta_state; // coef for Republican vote share in every state
 42 |   real beta_relig; // coef for conservative religion share in every state
 43 |   
 44 |   real intercept; // the intercept (global fixed effect)
 45 |   real beta_sex;
 46 |   
 47 |   
 48 | }
 49 | transformed parameters { 
 50 |   vector[N_groups_age] U_age_transformed;
 51 |   vector[N_groups_race] U_race_transformed;
 52 |   vector[N_groups_income] U_income_transformed;
 53 |   vector[N_groups_state] U_state_transformed;
 54 |   vector[N_groups_education] U_education_transformed;
 55 |   
 56 |   vector[5] U_region_transformed;
 57 |   
 58 |   vector[N] yhat;
 59 | 
 60 |   U_age_transformed = sigma_age * U_age; // the random effect for age
 61 |   U_race_transformed = sigma_race * U_race; // the random effect for race
 62 |   U_income_transformed = sigma_income * U_income;
 63 |   //U_state_transformed = sigma_state * U_state;
 64 |   U_education_transformed = sigma_education * U_education;
 65 |   
 66 |   U_region_transformed = sigma_region * U_region;
 67 |   
 68 |   U_state_transformed = U_state;
 69 | 
 70 |   for (i in 1:N) {
 71 |     yhat[i] = intercept + U_age_transformed[age[i]] + U_race_transformed[race[i]] + U_income_transformed[income[i]] + U_state_transformed[state[i]] + U_education_transformed[education[i]] + beta_sex * sex[i]; // the linear predictor at each point
 72 |   }
 73 |   
 74 | }
 75 | model {
 76 |   sigma_age ~ normal(0,1); // sigma_A ~ lognormal(0,1). hyperparam.
 77 |   sigma_race ~ normal(0,1); // sigma_I ~ lognormal(0,1). hyperparam.
 78 |   sigma_income ~ normal(0,1);
 79 |   sigma_state ~ normal(0,1);
 80 |   sigma_education ~ normal(0,1);
 81 | 
 82 |   sigma_region ~ normal(0,1);
 83 | 
 84 |   for (j in 2:N_groups_age) {
 85 |     U_age[j] ~normal(U_age[j-1],1);
 86 |   }
 87 | 
 88 |   sum(U_age) ~ normal(0, 0.01 * N_groups_age); // constraint so we can write likelihood for rw(1).
 89 |   
 90 |   U_race ~ normal(0, 1); // random effect for race is normal
 91 |   U_income ~ normal(0, 1); 
 92 |   //U_state ~ normal(0, 1);
 93 |   U_education ~ normal(0, 1);
 94 |   
 95 |   U_region ~ normal(0,1);
 96 |   beta_state ~ normal(0,1);
 97 |   beta_relig ~ normal(0,1);
 98 |   
 99 |   for (j in 1:51) {
100 |     U_state[j] ~ normal(U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]), sigma_state);
101 |   }
102 |   
103 |   intercept ~ normal(0, 1); 
104 |   beta_sex ~ normal(0, 1);
105 |   
106 |   
107 |   for (i in 1:N) {
108 |     y[i] ~ bernoulli(inv_logit(yhat[i])); // the response
109 |   }
110 |   
111 | }
112 | 


--------------------------------------------------------------------------------
/realdata_annenberg/proposedarN01_annenberg_v3_centered.stan:
--------------------------------------------------------------------------------
  1 | data {
  2 |   int<lower=0> N; // number of samples
  3 |   int<lower=0> N_groups_age; // the number of groups for age
  4 |   int<lower=0> N_groups_race; // the number of groups for race
  5 |   int<lower=0> N_groups_sex; // the number of groups for sex: 2
  6 |   
  7 |   int<lower=0> N_groups_education; // the number of groups for education
  8 |   int<lower=0> N_groups_income; // the number of groups for income
  9 |   int<lower=0> N_groups_state; // the number of groups for state
 10 |   
 11 |   int<lower=1,upper=N_groups_age> age[N]; // the column vector of design matrix X for age
 12 |   int<lower=1,upper=N_groups_race> race[N]; // the column vector of design matrix X for race
 13 |   int<lower=0,upper=N_groups_sex-1> sex[N]; // the column vector of design matrix X for sex
 14 |   
 15 |   int<lower=1,upper=N_groups_education> education[N]; // the column vector of design matrix X for education
 16 |   int<lower=1,upper=N_groups_income> income[N]; // the column vector of design matrix X for income
 17 |   int<lower=1,upper=N_groups_state> state[N]; // the column vector of design matrix X for state
 18 |   
 19 |   int<lower=1,upper=5> region_index[51]; // the index representing the region that a state is in according to jkastell 
 20 |   real<lower=0,upper=1> state_vs[51]; // the 2004 Republican vote share from jkastell
 21 |   real<lower=0,upper=1> relig[51]; // the 2004 conservative religion percentage
 22 |   
 23 |   int y[N]; // the response vector
 24 | }
 25 | parameters {
 26 |   vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age
 27 |   vector[N_groups_race] U_race; // the random effect for race, not multiplied by sigma_race
 28 |   vector[N_groups_income] U_income; 
 29 |   vector[N_groups_state] U_state; // we are using the centered parameterization for U_state now, so U_state_transformed = U_state now
 30 |   vector[N_groups_education] U_education;
 31 |   
 32 |   vector[5] U_region; // the nested random effect for region
 33 |   
 34 |   real<lower=0> sigma_age; // sd of U_age (hyperparam).
 35 |   real<lower=0> sigma_race; // sd of U_race (hyperparam).
 36 |   real<lower=0> sigma_income;
 37 |   real<lower=0> sigma_state;
 38 |   real<lower=0> sigma_education;
 39 |   
 40 |   real<lower=0> sigma_region;
 41 |   real beta_state; // coef for Republican vote share in every state
 42 |   real beta_relig; // coef for conservative religion share in every state
 43 |   
 44 |   real intercept; // the intercept (global fixed effect)
 45 |   real beta_sex; 
 46 |   
 47 |   real<lower=0,upper=1> rho; // the autoregressive coefficient untransformed
 48 | }
 49 | transformed parameters { 
 50 |   vector[N_groups_age] U_age_transformed;
 51 |   vector[N_groups_race] U_race_transformed;
 52 |   vector[N_groups_income] U_income_transformed;
 53 |   vector[N_groups_state] U_state_transformed;
 54 |   vector[N_groups_education] U_education_transformed;
 55 |   
 56 |   vector[5] U_region_transformed;
 57 |   
 58 |   vector[N] yhat;
 59 |   real<lower=-1,upper=1> rho_transformed;
 60 | 
 61 |   rho_transformed = (rho * 2) - 1; // the autoregressive coefficient
 62 | 
 63 |   U_age_transformed = sigma_age * U_age; // the random effect for age
 64 |   U_race_transformed = sigma_race * U_race; // the random effect for race
 65 |   U_income_transformed = sigma_income * U_income;
 66 |   //U_state_transformed = sigma_state * U_state;
 67 |   U_education_transformed = sigma_education * U_education;
 68 |   
 69 |   U_region_transformed = sigma_region * U_region;
 70 |   
 71 |   U_state_transformed = U_state;
 72 |   
 73 |   // noncentered parameterization for U_state_transformed
 74 | //  for (j in 1:51) { // j iterates through the state index that I defined through levels_state in fitmodel_annenberg R file
 75 | //    U_state_transformed[j] = (U_state[j] * sigma_state) * ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) );
 76 | //  }
 77 | 
 78 |   for (i in 1:N) {
 79 |     yhat[i] = intercept + U_age_transformed[age[i]] + U_race_transformed[race[i]] + U_income_transformed[income[i]] + U_state_transformed[state[i]] + U_education_transformed[education[i]] + beta_sex * sex[i]; // the linear predictor at each point
 80 |   }
 81 |   
 82 | }
 83 | model {
 84 |   sigma_age ~ normal(0,1); // sigma_A ~ halfnormal(0,1). hyperparam.
 85 |   sigma_race ~ normal(0,1); // sigma_I ~ halfnormal(0,1). hyperparam.
 86 |   sigma_income ~ normal(0,1);
 87 |   sigma_state ~ normal(0,1);
 88 |   sigma_education ~ normal(0,1);
 89 |   
 90 |   sigma_region ~ normal(0,1);
 91 |   
 92 |   rho ~ beta(0.5, 0.5); // prior on autoregressive coefficient
 93 |   
 94 |   U_age[1] ~ normal(0, 1/sqrt(1-rho_transformed^2)); // before it was normal(0, 1) but this is wrong
 95 |   for (j in 2:N_groups_age) {
 96 |     U_age[j] ~normal(rho_transformed * U_age[j-1],1);
 97 |   }
 98 |   
 99 |   U_race ~ normal(0, 1); // random effect for race is normal
100 |   U_income ~ normal(0, 1); 
101 |   //U_state ~ normal(0, 1);
102 |   U_education ~ normal(0, 1);
103 |   
104 |   U_region ~ normal(0,1);
105 |   beta_state ~ normal(0,1);
106 |   beta_relig ~ normal(0,1);
107 |   
108 |   for (j in 1:51) {
109 |     U_state[j] ~ normal(U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]), sigma_state);
110 |   }
111 |   
112 |   intercept ~ normal(0, 1); 
113 |   beta_sex ~ normal(0, 1);
114 |   
115 |   for (i in 1:N) {
116 |     y[i] ~ bernoulli(inv_logit(yhat[i])); // the response
117 |   }
118 |   
119 | }
120 | 


--------------------------------------------------------------------------------
/simulation/baselinemeanzeroN01_v3.stan:
--------------------------------------------------------------------------------
 1 | data {
 2 |   int<lower=0> N; // number of samples
 3 |   int<lower=0> N_groups_age; // the number of groups for age
 4 |   int<lower=0> N_groups_income; // the number of groups for income
 5 |   int<lower=1,upper=N_groups_age> age[N]; // the column vector of design matrix X for age
 6 |   int<lower=1,upper=N_groups_income> income[N]; // the column vector of design matrix X for income
 7 |   
 8 |   int<lower=1,upper=51> state_index[N]; // the index (from 1 to 51) representing the state for every datapoint
 9 |   int<lower=1,upper=5> region_index[51]; // the index (from 1 to 5) representing the region the state is in for all 51 states
10 |   real<lower=0,upper=1> state_vs[51]; // the 2004 Republican vote share for every state. This is hard coded for now.
11 |   real<lower=0,upper=1> relig[51]; // the 2004 conservative religion percentage in every state
12 |   
13 |   int y[N]; // the response vector
14 | }
15 | parameters {
16 |   vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age
17 |   vector[N_groups_income] U_income; // the random effect for income, not multiplied by sigma_income
18 |   
19 |   vector[51] U_state; // the random effect for state, not multiplied by sigma_state
20 |   vector[5] U_region; // the nested random effect for region, not multiplied by sigma_region
21 |   
22 |   real<lower=0> sigma_age; // sd of U_age (hyperparam). halfnormal prior put on this.
23 |   real<lower=0> sigma_income; // sd of U_income (hyperparam). halfnormal prior put on this.
24 |   
25 |   real<lower=0> sigma_state; // sd of state
26 |   real<lower=0> sigma_region; // sd of region
27 |   real beta_state; // coeff. for Republican vote share in every state
28 |   real beta_relig; // coeff. for conservative religion share in every state
29 |   
30 |   real intercept; // the intercept (global fixed effect)
31 | 
32 | }
33 | transformed parameters { 
34 |   vector[N_groups_age] U_age_transformed;
35 |   vector[N_groups_income] U_income_transformed;
36 |   
37 |   vector[51] U_state_transformed;
38 |   vector[5] U_region_transformed;
39 |   
40 |   vector[N] yhat;
41 |   //real intercept;
42 | 
43 |   U_age_transformed = sigma_age * U_age; // the random effect for age
44 |   U_income_transformed = sigma_income * U_income; // the random effect for income
45 |   U_region_transformed = sigma_region * U_region; // the random effect for region
46 |   
47 |   // noncentered parameterization for U_state_transformed
48 |   for (j in 1:51) {
49 |     U_state_transformed[j] = (U_state[j] * sigma_state) + ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) );
50 |   }
51 | 
52 |   for (i in 1:N) {
53 |     yhat[i] = intercept + U_age_transformed[age[i]] + U_income_transformed[income[i]] + U_state_transformed[state_index[i]]; // the linear predictor at each point
54 |   }
55 |   
56 | }
57 | model {
58 |   sigma_age ~ normal(0, 1);
59 |   sigma_income ~ normal(0, 1);
60 |   
61 |   sigma_region ~ normal(0,1);
62 |   sigma_state ~ normal(0, 1);
63 |   
64 |   U_age ~ normal(0, 1); // random effect for age is normal 
65 |   U_income ~ normal(0, 1); // random effect for income is normal
66 |   
67 |   U_region ~ normal(0,1);
68 |   
69 |   U_state ~ normal(0, 1);
70 | 
71 |   beta_state ~ normal(0, 1);
72 |   beta_relig ~ normal(0, 1);
73 |   intercept ~ normal(0, 1); 
74 |   
75 |   
76 |   for (i in 1:N) {
77 |     y[i] ~ bernoulli(inv_logit(yhat[i])); // the response
78 |   }
79 |   
80 | }
81 | 


--------------------------------------------------------------------------------
/simulation/biasfacet_combinedplots.R:
--------------------------------------------------------------------------------
  1 | # run this script after threemodelwriteup_v3_posteriorvariance.R to get the main plots in the paper
  2 | # this script corresponds to n=100, 500
  3 | rm(list=ls()) # clear previous workspace
  4 | gc()
  5 | 
  6 | library(ggplot2)
  7 | 
  8 | bias_facet_df_100 = readRDS("biasfacet_200_12_1_100.rds")
  9 | bias_facet_df_500 = readRDS("biasfacet_200_12_1_500.rds")
 10 | 
 11 | biasfacet_pngname = "biasfacet_200_12_100_500.png"
 12 | allmediansfacet_pngname = "allmediansfacet_200_12_100_500.png"
 13 | allquantilediff_pngname = "allquantilediff_facet_200_12_100_500.png"
 14 | 
 15 | runs = 200
 16 | age_grouping_multiplier = 12
 17 |   
 18 | # ---------------------------
 19 | bias_facet_df_100$age_cat_grouped = ifelse(bias_facet_df_100$age_cat %in% c(1,2,3,4), "Age categories 1-4",
 20 |                                            ifelse(bias_facet_df_100$age_cat %in% c(5,6,7,8), "Age categories 5-8",
 21 |                                                   "Age categories 9-12"))
 22 | 
 23 | bias_facet_df_500$age_cat_grouped = ifelse(bias_facet_df_500$age_cat %in% c(1,2,3,4), "Age categories 1-4",
 24 |                                            ifelse(bias_facet_df_500$age_cat %in% c(5,6,7,8), "Age categories 5-8",
 25 |                                                   "Age categories 9-12"))
 26 | 
 27 | p1 = cowplot::plot_grid(ggplot(bias_facet_df_100, aes(x=prob_sampling_old, y=value, group=cell)) + 
 28 |                          geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 0.5) +
 29 |                          facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller(
 30 |                            function(value) {
 31 |                              return(value)  # Lets you change the facet labels
 32 |                            })
 33 |                          ) +  
 34 |                          xlab("\n Probability of response for the elderly \n") +
 35 |                          ylab("\n Average Bias \n") + 
 36 |                           ylim( -max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))),
 37 |                                 max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))) ) +
 38 |                          # scale_colour_manual(values=c("1"="#fbb4b9",
 39 |                          #                              "2"="#f768a1",
 40 |                          #                              "3"="#c51b8a",
 41 |                          #                              "4"="#7a0177",
 42 |                          #                              "5"="#fdcc8a",
 43 |                          #                              "6"="#fc8d59",
 44 |                          #                              "7"="#e34a33",
 45 |                          #                              "8"="#b30000",
 46 |                          #                              "9"="#b2e2e2",
 47 |                          #                              "10"="#66c2a4",
 48 |                          #                              "11"="#2ca25f",
 49 |                          #                              "12"="#006d2c")
 50 |                          # ) +
 51 |                         scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a",
 52 |                                                      "Age categories 5-8"="#b30000",
 53 |                                                      "Age categories 9-12"="#006d2c")
 54 |                         ) +
 55 |                          theme_bw() +
 56 |                          theme(plot.title = element_text(size = 50, face = "bold"),
 57 |                                axis.text=element_text(size=80),
 58 |                                axis.title=element_text(size=80, face="bold",margin=200),
 59 |                                strip.text.x = element_text(size=80, face="bold", margin = margin(t=25,b=25) ),
 60 |                                strip.background = element_rect(fill="transparent",color="transparent"),
 61 |                                legend.position = "none"
 62 |                          ) + 
 63 |                           geom_hline(yintercept=0, color="black", size=5, linetype = "dashed") +
 64 |                          guides(col=guide_legend(title="Age category",
 65 |                                                  override.aes = list(size = 10*1.6))),
 66 |                        
 67 |                        ggplot(bias_facet_df_500, aes(x=prob_sampling_old, y=value, group=cell)) + 
 68 |                          geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 0.5) +
 69 |                          facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller(
 70 |                            function(value) {
 71 |                              return(value)  # Lets you change the facet labels
 72 |                            })
 73 |                          ) +  
 74 |                          xlab("\n Probability of response for the elderly \n") +
 75 |                          ylab("\n Average Bias \n") + 
 76 |                          ylim( -max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))),
 77 |                                max(max(abs(bias_facet_df_500$value), abs(bias_facet_df_100$value))) ) +
 78 |                          # scale_colour_manual(values=c("1"="#fbb4b9",
 79 |                          #                              "2"="#f768a1",
 80 |                          #                              "3"="#c51b8a",
 81 |                          #                              "4"="#7a0177",
 82 |                          #                              "5"="#fdcc8a",
 83 |                          #                              "6"="#fc8d59",
 84 |                          #                              "7"="#e34a33",
 85 |                          #                              "8"="#b30000",
 86 |                          #                              "9"="#b2e2e2",
 87 |                          #                              "10"="#66c2a4",
 88 |                          #                              "11"="#2ca25f",
 89 |                          #                              "12"="#006d2c")
 90 |                          # ) +
 91 |                        scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a",
 92 |                                                     "Age categories 5-8"="#b30000",
 93 |                                                     "Age categories 9-12"="#006d2c")
 94 |                                            ) +
 95 |                          theme_bw() +
 96 |                          theme(plot.title = element_text(size = 50, face = "bold"),
 97 |                                axis.text=element_text(size=80),
 98 |                                axis.title=element_text(size=80, face="bold",margin=200),
 99 |                                strip.text.x = element_text(size=80, face="bold", margin = margin(t=25,b=25) ),
100 |                                strip.background = element_rect(fill="transparent",color="transparent"),
101 |                                legend.position = "none"
102 |                          ) + 
103 |                          geom_hline(yintercept=0, color="black", size=5, linetype = "dashed") +
104 |                          guides(col=guide_legend(title="Age category",
105 |                                                  override.aes = list(size = 10*1.6)))
106 | ,
107 | align="v",
108 | nrow=2,
109 | 
110 | labels = c("n = 100", "n = 500"),
111 | 
112 | label_size = 85,
113 | 
114 | vjust = 1.25,
115 | hjust = -0.25)
116 | 
117 | legend_p1 = cowplot::get_legend(ggplot(bias_facet_df_500, aes(x=prob_sampling_old, y=value, group=cell)) + 
118 |                                   geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 0.5) +
119 |                                   facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller(
120 |                                     function(value) {
121 |                                       return(value)  # Lets you change the facet labels
122 |                                     })
123 |                                   ) +  
124 |                                   xlab("\n Probability of response for the elderly \n") +
125 |                                   ylab("\n Average Bias \n") + 
126 |                                   # scale_colour_manual(values=c("1"="#fbb4b9",
127 |                                   #                              "2"="#f768a1",
128 |                                   #                              "3"="#c51b8a",
129 |                                   #                              "4"="#7a0177",
130 |                                   #                              "5"="#fdcc8a",
131 |                                   #                              "6"="#fc8d59",
132 |                                   #                              "7"="#e34a33",
133 |                                   #                              "8"="#b30000",
134 |                                   #                              "9"="#b2e2e2",
135 |                                   #                              "10"="#66c2a4",
136 |                                   #                              "11"="#2ca25f",
137 |                                   #                              "12"="#006d2c")
138 |                                   # ) +
139 |                                 scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a",
140 |                                                              "Age categories 5-8"="#b30000",
141 |                                                              "Age categories 9-12"="#006d2c")
142 |                                 ) +
143 |                                   theme_bw() +
144 |                                   theme(plot.title = element_text(size = 50, face = "bold"),
145 |                                         axis.text=element_text(size=50),
146 |                                         axis.title=element_text(size=80, face="bold",margin=200),
147 |                                         legend.text = element_text(size=80),
148 |                                         legend.position  = "bottom",
149 |                                         legend.key.size = unit(30*1.6,"line"),
150 |                                         legend.key.height = unit(7*1.6,"line"),
151 |                                         legend.key = element_rect(fill = "transparent",color = "transparent"),
152 |                                         legend.title = element_text(size=80, face="bold"),
153 |                                         strip.text.x = element_text(size=80, face="bold", margin = margin(t=25,b=25) ),
154 |                                         strip.background = element_rect(fill="transparent",color="transparent")
155 |                                   ) + 
156 |                                   guides(col=guide_legend(title="Age category",
157 |                                                           override.aes = list(size = 10*1.6))))
158 | 
159 | png(biasfacet_pngname,
160 |     width = 5200, 
161 |     height = 5000)
162 | 
163 | plot(cowplot::plot_grid(p1,
164 |                         legend_p1,
165 |                         ncol=1,
166 |                         rel_heights = c(3, 0.3)))
167 | dev.off()
168 | 
169 | 
170 | # plot combined plots for allmedians_facet
171 | allmedians_facet_df_100 = readRDS("allmedians_facet_200_12_100_.rds")
172 | allmedians_facet_df_500 = readRDS("allmedians_facet_200_12_500_.rds")
173 | points_df_final = readRDS("points_df_final.rds")
174 | 
175 | allmedians_facet_df_100_filtered = allmedians_facet_df_100[allmedians_facet_df_100$p %in% c(0.1, 0.5, 0.9),]
176 | 
177 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled
178 | allmedians_facet_df_100_filtered[allmedians_facet_df_100_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12"
179 | allmedians_facet_df_100_filtered[allmedians_facet_df_100_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12"
180 | allmedians_facet_df_100_filtered[allmedians_facet_df_100_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12"
181 | 
182 | allmedians_facet_df_100_filtered$prob_sampling_old = factor(allmedians_facet_df_100_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12",
183 |                                                                                                                            "Representatively sampling age categories 9-12",
184 |                                                                                                                            "Strongly oversampling age categories 9-12"))
185 | 
186 | 
187 | allmedians_facet_df_500_filtered = allmedians_facet_df_500[allmedians_facet_df_500$p %in% c(0.1, 0.5, 0.9),]
188 | 
189 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled
190 | allmedians_facet_df_500_filtered[allmedians_facet_df_500_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12"
191 | allmedians_facet_df_500_filtered[allmedians_facet_df_500_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12"
192 | allmedians_facet_df_500_filtered[allmedians_facet_df_500_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12"
193 | 
194 | allmedians_facet_df_500_filtered$prob_sampling_old = factor(allmedians_facet_df_500_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12",
195 |                                                                                                                            "Representatively sampling age categories 9-12",
196 |                                                                                                                            "Strongly oversampling age categories 9-12"))
197 | 
198 | 
199 | p2 = cowplot::plot_grid(ggplot(allmedians_facet_df_100_filtered, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+
200 |                           geom_rect(allmedians_facet_df_100_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
201 |                                                                                     xmax = Inf,
202 |                                                                                     ymin = -Inf,
203 |                                                                                     ymax = Inf),
204 |                                     alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
205 |                           geom_jitter(alpha=0.1, show.legend=FALSE) + 
206 |                           geom_point(aes(x = age_cat, y = mrp),
207 |                                      size=7,
208 |                                      colour = "black",
209 |                                      fill="black",
210 |                                      show.legend = F,
211 |                                      data = points_df_final,
212 |                                      inherit.aes = F) +
213 |                           geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
214 |                           facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=1, labeller = as_labeller(
215 |                             function(value) {
216 |                               return(value)  # Lets you change the facet labels
217 |                             })
218 |                           ) +
219 |                           xlab("\n Age Category \n") +
220 |                           ylab(paste("\n Median of", runs,"Posteriors \n")) +
221 |                           scale_x_continuous(breaks= scales::pretty_breaks()) + 
222 |                           coord_cartesian(ylim=c(0,1), expand=FALSE) +
223 |                           theme_bw() +
224 |                           theme(plot.title = element_text(size = 50, face = "bold"),
225 |                                 axis.text=element_text(size=50),
226 |                                 axis.title=element_text(size=50, face="bold",margin=200),
227 |                                 legend.text = element_text(size=50),
228 |                                 legend.position  = "none",
229 |                                 legend.key.size = unit(25,"line"),
230 |                                 legend.key.height = unit(7,"line"),
231 |                                 legend.key = element_rect(fill = "transparent",color = "transparent"),
232 |                                 legend.title = element_text(size=50, face="bold"),
233 |                                 strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
234 |                                 strip.background = element_rect(fill="transparent",color="transparent")) +
235 |                           scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
236 |                           guides(color = guide_legend(override.aes = list(size=20))),
237 |                         
238 |                         ggplot(allmedians_facet_df_500_filtered, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+
239 |                           geom_rect(allmedians_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
240 |                                                                                     xmax = Inf,
241 |                                                                                     ymin = -Inf,
242 |                                                                                     ymax = Inf),
243 |                                     alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
244 |                           geom_jitter(alpha=0.1, show.legend=FALSE) + 
245 |                           geom_point(aes(x = age_cat, y = mrp),
246 |                                      size=7,
247 |                                      colour = "black",
248 |                                      fill="black",
249 |                                      show.legend = F,
250 |                                      data = points_df_final,
251 |                                      inherit.aes = F) +
252 |                           geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
253 |                           facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller(
254 |                             function(value) {
255 |                               return(value)  # Lets you change the facet labels
256 |                             })
257 |                           ) +
258 |                           xlab("\n Age Category \n") +
259 |                           ylab(paste("\n Median of", runs,"Posteriors \n")) +
260 |                           scale_x_continuous(breaks= scales::pretty_breaks()) + 
261 |                           coord_cartesian(ylim=c(0,1), expand=FALSE) +
262 |                           theme_bw() +
263 |                           theme(plot.title = element_text(size = 50, face = "bold"),
264 |                                 axis.text=element_text(size=50),
265 |                                 axis.title=element_text(size=50, face="bold",margin=200),
266 |                                 legend.text = element_text(size=50),
267 |                                 legend.position  = "none",
268 |                                 legend.key.size = unit(25,"line"),
269 |                                 legend.key.height = unit(7,"line"),
270 |                                 legend.key = element_rect(fill = "transparent",color = "transparent"),
271 |                                 legend.title = element_text(size=50, face="bold"),
272 |                                 strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
273 |                                 strip.background = element_rect(fill="transparent",color="transparent")) +
274 |                           scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
275 |                           guides(color = guide_legend(override.aes = list(size=20))),
276 |                         
277 |                         ncol = 1,
278 |                         
279 |                         labels = c("n = 100", "n = 500"),
280 |                         
281 |                         label_size = 50,
282 |                         
283 |                         vjust = 1.5,
284 |                         hjust = -0.25
285 | )
286 | 
287 | 
288 | legend_p2 = cowplot::get_legend(ggplot(allmedians_facet_df_500_filtered, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+
289 |                                   geom_rect(allmedians_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
290 |                                                                                             xmax = Inf,
291 |                                                                                             ymin = -Inf,
292 |                                                                                             ymax = Inf),
293 |                                             alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
294 |                                   geom_jitter(alpha=0.1, show.legend=FALSE) + 
295 |                                   geom_point(aes(x = age_cat, y = mrp),
296 |                                              size=7,
297 |                                              colour = "black",
298 |                                              fill="black",
299 |                                              show.legend = F,
300 |                                              data = points_df_final,
301 |                                              inherit.aes = F) +
302 |                                   geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
303 |                                   facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller(
304 |                                     function(value) {
305 |                                       return(value)  # Lets you change the facet labels
306 |                                     })
307 |                                   ) +
308 |                                   xlab("\n Age Category \n") +
309 |                                   ylab(paste("\n Median of", runs,"Posteriors \n")) +
310 |                                   scale_x_continuous(breaks= scales::pretty_breaks()) + 
311 |                                   theme_bw() +
312 |                                   theme(plot.title = element_text(size = 50, face = "bold"),
313 |                                         axis.text=element_text(size=35),
314 |                                         axis.title=element_text(size=50, face="bold",margin=200),
315 |                                         legend.text = element_text(size=50),
316 |                                         legend.position  = "bottom",
317 |                                         legend.key.size = unit(25,"line"),
318 |                                         legend.key.height = unit(7,"line"),
319 |                                         legend.key = element_rect(fill = "transparent",color = "transparent"),
320 |                                         legend.title = element_text(size=50, face="bold"),
321 |                                         strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
322 |                                         strip.background = element_rect(fill="transparent",color="transparent")) +
323 |                                   scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
324 |                                   guides(color = guide_legend(override.aes = list(size=20))))
325 | 
326 | png(allmediansfacet_pngname,
327 |     width = 3625, 
328 |     height = 2200)
329 | 
330 | plot(cowplot::plot_grid(p2,
331 |      legend_p2,
332 |      ncol=1,
333 |      rel_heights = c(3, 0.3)))
334 | dev.off()
335 | 
336 | 
337 | # plot combined plots for allquantilediff_facet
338 | allquantilediff_facet_df_100 = readRDS("allquantilediff_facet_200_12_100_.rds")
339 | allquantilediff_facet_df_500 = readRDS("allquantilediff_facet_200_12_500_.rds")
340 | 
341 | allquantilediff_facet_df_100_filtered = allquantilediff_facet_df_100[allquantilediff_facet_df_100$p %in% c(0.1, 0.5, 0.9),]
342 | 
343 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled
344 | allquantilediff_facet_df_100_filtered[allquantilediff_facet_df_100_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12"
345 | allquantilediff_facet_df_100_filtered[allquantilediff_facet_df_100_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12"
346 | allquantilediff_facet_df_100_filtered[allquantilediff_facet_df_100_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12"
347 | 
348 | allquantilediff_facet_df_100_filtered$prob_sampling_old = factor(allquantilediff_facet_df_100_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12",
349 |                                                                                                                                      "Representatively sampling age categories 9-12",
350 |                                                                                                                                      "Strongly oversampling age categories 9-12"))
351 | 
352 | 
353 | allquantilediff_facet_df_500_filtered = allquantilediff_facet_df_500[allquantilediff_facet_df_500$p %in% c(0.1, 0.5, 0.9),]
354 | 
355 | # labels for plots: Strongly undersampled, Representative sampling, Strongly oversampled
356 | allquantilediff_facet_df_500_filtered[allquantilediff_facet_df_500_filtered$p==0.1, c("prob_sampling_old")] = "Strongly undersampling age categories 9-12"
357 | allquantilediff_facet_df_500_filtered[allquantilediff_facet_df_500_filtered$p==0.5, c("prob_sampling_old")] = "Representatively sampling age categories 9-12"
358 | allquantilediff_facet_df_500_filtered[allquantilediff_facet_df_500_filtered$p==0.9, c("prob_sampling_old")] = "Strongly oversampling age categories 9-12"
359 | 
360 | allquantilediff_facet_df_500_filtered$prob_sampling_old = factor(allquantilediff_facet_df_500_filtered$prob_sampling_old, levels = c("Strongly undersampling age categories 9-12",
361 |                                                                                                                                      "Representatively sampling age categories 9-12",
362 |                                                                                                                                      "Strongly oversampling age categories 9-12"))
363 | 
364 | 
365 | 
366 | p3 = cowplot::plot_grid(ggplot(allquantilediff_facet_df_100_filtered, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+
367 |                           geom_rect(allquantilediff_facet_df_100_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
368 |                                                                                xmax = Inf,
369 |                                                                                ymin = -Inf,
370 |                                                                                ymax = Inf),
371 |                                     alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
372 |                           geom_jitter(alpha=0.1, show.legend=FALSE) + 
373 |                           geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
374 |                           facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller(
375 |                             function(value) {
376 |                               return(value)  # Lets you change the facet labels
377 |                             })
378 |                           ) +
379 |                           xlab("\n Age Category \n") +
380 |                           ylab(paste("\n 90th - 10th quantile of Posteriors \n")) +
381 |                           scale_x_continuous(breaks= scales::pretty_breaks()) + 
382 |                           coord_cartesian(ylim=c(0, max(allquantilediff_facet_df_100_filtered$quantildiff90_10)), expand=FALSE) +
383 |                           theme_bw() +
384 |                           theme(plot.title = element_text(size = 50, face = "bold"),
385 |                                 axis.text=element_text(size=50),
386 |                                 axis.title=element_text(size=50, face="bold",margin=200),
387 |                                 legend.text = element_text(size=50),
388 |                                 legend.position  = "none",
389 |                                 legend.key.size = unit(25,"line"),
390 |                                 legend.key.height = unit(7,"line"),
391 |                                 legend.key = element_rect(fill = "transparent",color = "transparent"),
392 |                                 legend.title = element_text(size=50, face="bold"),
393 |                                 strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
394 |                                 strip.background = element_rect(fill="transparent",color="transparent")) +
395 |                           scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
396 |                           guides(color = guide_legend(override.aes = list(size=20))),
397 |                         
398 |                         ggplot(allquantilediff_facet_df_500_filtered, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+
399 |                           geom_rect(allquantilediff_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
400 |                                                                                          xmax = Inf,
401 |                                                                                          ymin = -Inf,
402 |                                                                                          ymax = Inf),
403 |                                     alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
404 |                           geom_jitter(alpha=0.1, show.legend=FALSE) + 
405 |                           geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
406 |                           facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller(
407 |                             function(value) {
408 |                               return(value)  # Lets you change the facet labels
409 |                             })
410 |                           ) +
411 |                           xlab("\n Age Category \n") +
412 |                           ylab(paste("\n 90th - 10th quantile of Posteriors \n")) +
413 |                           scale_x_continuous(breaks= scales::pretty_breaks()) + 
414 |                           coord_cartesian(ylim=c(0, max(allquantilediff_facet_df_100_filtered$quantildiff90_10)), expand=FALSE) +
415 |                           theme_bw() +
416 |                           theme(plot.title = element_text(size = 50, face = "bold"),
417 |                                 axis.text=element_text(size=50),
418 |                                 axis.title=element_text(size=50, face="bold",margin=200),
419 |                                 legend.text = element_text(size=50),
420 |                                 legend.position  = "none",
421 |                                 legend.key.size = unit(25,"line"),
422 |                                 legend.key.height = unit(7,"line"),
423 |                                 legend.key = element_rect(fill = "transparent",color = "transparent"),
424 |                                 legend.title = element_text(size=50, face="bold"),
425 |                                 strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
426 |                                 strip.background = element_rect(fill="transparent",color="transparent")) +
427 |                           scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
428 |                           guides(color = guide_legend(override.aes = list(size=20))),
429 |                         
430 |                         ncol=1,
431 |                         
432 |                         labels = c("n = 100", "n = 500"),
433 |                         
434 |                         label_size = 50,
435 |                         
436 |                         vjust = 1,
437 |                         hjust = -0.25
438 |                         )
439 | 
440 | legend_p3 = cowplot::get_legend(ggplot(allquantilediff_facet_df_500_filtered, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+
441 |                                   geom_rect(allquantilediff_facet_df_500_filtered, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
442 |                                                                                                  xmax = Inf,
443 |                                                                                                  ymin = -Inf,
444 |                                                                                                  ymax = Inf),
445 |                                             alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
446 |                                   geom_jitter(alpha=0.1, show.legend=FALSE) + 
447 |                                   geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
448 |                                   facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller(
449 |                                     function(value) {
450 |                                       return(value)  # Lets you change the facet labels
451 |                                     })
452 |                                   ) +
453 |                                   xlab("\n Age Category \n") +
454 |                                   ylab(paste("\n 90th - 10th quantile of", runs,"Posteriors \n")) +
455 |                                   scale_x_continuous(breaks= scales::pretty_breaks()) + 
456 |                                   theme_bw() +
457 |                                   theme(plot.title = element_text(size = 50, face = "bold"),
458 |                                         axis.text=element_text(size=35),
459 |                                         axis.title=element_text(size=50, face="bold",margin=200),
460 |                                         legend.text = element_text(size=50),
461 |                                         legend.position  = "bottom",
462 |                                         legend.key.size = unit(25,"line"),
463 |                                         legend.key.height = unit(7,"line"),
464 |                                         legend.key = element_rect(fill = "transparent",color = "transparent"),
465 |                                         legend.title = element_text(size=50, face="bold"),
466 |                                         strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
467 |                                         strip.background = element_rect(fill="transparent",color="transparent")) +
468 |                                   scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
469 |                                   guides(color = guide_legend(override.aes = list(size=20))))
470 | 
471 | png(allquantilediff_pngname,
472 |     width = 3625, 
473 |     height = 2200)
474 | 
475 | plot(cowplot::plot_grid(p3,
476 |                         legend_p3,
477 |                         ncol=1,
478 |                         rel_heights = c(3, 0.3)))
479 | dev.off()
480 | 
481 | print(warnings())
482 | 


--------------------------------------------------------------------------------
/simulation/proportion_agecat.R:
--------------------------------------------------------------------------------
  1 | rm(list=ls()) 
  2 | gc()
  3 | 
  4 | options(bitmapType="cairo")
  5 | 
  6 | library(reshape2)
  7 | library(rstan)
  8 | library(dplyr)
  9 | library(ggplot2)
 10 | library(gridExtra)
 11 | library(knitr)
 12 | library(matrixStats)
 13 | 
 14 | # Global params -------------------------------------------
 15 | 
 16 | save_ridgeplots = TRUE 
 17 | sample_size = 1000
 18 | runs = 100
 19 | r = 1:9/10
 20 | income_multiplier = 1 # partitions income into more categories
 21 | age_grouping_multiplier = 12 # how much we take the maximal poststratification. make sure this can divide 60
 22 | number_of_states = 51
 23 | 
 24 | response_tag = "binary"
 25 | 
 26 | # ---------------------------------------------------------
 27 | rw_baseline_agecat_bias_comparison = matrix(0, length(r), age_grouping_multiplier)
 28 | ar_baseline_agecat_bias_comparison = matrix(0, length(r), age_grouping_multiplier)
 29 | 
 30 | rw_baseline_agecat_sd_comparison = matrix(0, length(r), age_grouping_multiplier)
 31 | ar_baseline_agecat_sd_comparison = matrix(0, length(r), age_grouping_multiplier)
 32 | 
 33 | 
 34 | counter = 1
 35 | prob_sampling_old = rep(0, length(r))
 36 | 
 37 | for (p in r) {
 38 |   
 39 |   load(paste0(
 40 |     runs,
 41 |     "_",
 42 |     age_grouping_multiplier,
 43 |     "_",
 44 |     income_multiplier,
 45 |     "_",
 46 |     sample_size,
 47 |     "_",
 48 |     response_tag,
 49 |     "_oldest_",
 50 |     p * 10,
 51 |     ".RData"))
 52 |   
 53 |   points_df_final = unique(readRDS("points_df_final.rds")[,1:2]) %>% arrange(age_cat) # the true poststratified preferences for every age category
 54 |   
 55 |   # Get the interpretable p which is the probability of sampling someone in ages 61 - 80
 56 |   poststrat_final_p = cbind(poststrat_final, 
 57 |                             (poststrat_final$p_response * poststrat_final$N)/sum((poststrat_final$p_response * poststrat_final$N))) # this is the probability of sampling someone in ages  61 - 80
 58 |   colnames(poststrat_final_p)[length(colnames(poststrat_final_p))] = "prob_sampling_old" 
 59 |   prob_sampling_old[counter] = sum(poststrat_final_p[as.numeric(as.character(poststrat_final_p$age_cat)) >= (age_grouping_multiplier*2/3 + 1),
 60 |                                                      c("prob_sampling_old")]) # very important
 61 |   
 62 |   rw_baseline_agecat_bias_comparison[counter,] = colSums((abs(sweep(median_quantile_list_rw,
 63 |                                                                     2,
 64 |                                                                     points_df_final$mrp,
 65 |                                                                     "-")) - 
 66 |                                                             abs(sweep(median_quantile_list_baseline,
 67 |                                                                       2,
 68 |                                                                       points_df_final$mrp,
 69 |                                                                       "-"))) <= 0)/dim(median_quantile_list_baseline)[1]
 70 |   
 71 |   ar_baseline_agecat_bias_comparison[counter,] = colSums((abs(sweep(median_quantile_list_ar,
 72 |                                                                     2,
 73 |                                                                     points_df_final$mrp,
 74 |                                                                     "-")) - 
 75 |                                                             abs(sweep(median_quantile_list_baseline,
 76 |                                                                       2,
 77 |                                                                       points_df_final$mrp,
 78 |                                                                       "-"))) <= 0)/dim(median_quantile_list_baseline)[1]
 79 |   
 80 | 
 81 |   rw_baseline_agecat_sd_comparison[counter,] = colSums(((quantile90_list_rw - quantile10_list_rw) -
 82 |                                                           (quantile90_list_baseline - quantile10_list_baseline))<= 0)/dim(median_quantile_list_rw)[1]
 83 |   
 84 |   ar_baseline_agecat_sd_comparison[counter,] = colSums(((quantile90_list_ar - quantile10_list_ar) -
 85 |                                                           (quantile90_list_baseline - quantile10_list_baseline))<= 0)/dim(median_quantile_list_rw)[1]
 86 |   
 87 |   counter = counter + 1
 88 | }
 89 | 
 90 | 
 91 | # ar model
 92 | ar_baseline_agecat_bias_comparison_melted = melt(ar_baseline_agecat_bias_comparison)
 93 | colnames(ar_baseline_agecat_bias_comparison_melted) = c("p", "age_cat", "proportion")
 94 | ar_baseline_agecat_bias_comparison_melted$age_cat_grouped = ""
 95 | 
 96 | ar_baseline_agecat_bias_comparison_melted[ar_baseline_agecat_bias_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4"
 97 | ar_baseline_agecat_bias_comparison_melted[ar_baseline_agecat_bias_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8"
 98 | ar_baseline_agecat_bias_comparison_melted[ar_baseline_agecat_bias_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12"
 99 | 
100 | 
101 | # rw model
102 | rw_baseline_agecat_bias_comparison_melted = melt(rw_baseline_agecat_bias_comparison)
103 | colnames(rw_baseline_agecat_bias_comparison_melted) = c("p", "age_cat", "proportion")
104 | rw_baseline_agecat_bias_comparison_melted$age_cat_grouped = ""
105 | 
106 | rw_baseline_agecat_bias_comparison_melted[rw_baseline_agecat_bias_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4"
107 | rw_baseline_agecat_bias_comparison_melted[rw_baseline_agecat_bias_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8"
108 | rw_baseline_agecat_bias_comparison_melted[rw_baseline_agecat_bias_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12"
109 | 
110 | rw_ar_baseline_agecat_bias_comparison_melted = rbind(cbind(ar_baseline_agecat_bias_comparison_melted, Model = "AR - Baseline priors comparison"),
111 |                                                      cbind(rw_baseline_agecat_bias_comparison_melted, Model = "RW - Baseline priors comparison"))
112 | 
113 | mapdf_prob_sampling_old_bias = data.frame(old = sort(unique(rw_ar_baseline_agecat_bias_comparison_melted$p)), 
114 |                                           new = prob_sampling_old)
115 | rw_ar_baseline_agecat_bias_comparison_melted$prob_sampling_old = round(mapdf_prob_sampling_old_bias$new[match(rw_ar_baseline_agecat_bias_comparison_melted$p,
116 |                                                                                mapdf_prob_sampling_old_bias$old)],
117 |                                         digits = 2)
118 | 
119 | 
120 | saveRDS(prob_sampling_old, 
121 |         "prob_sampling_old.rds")
122 | 
123 | saveRDS(rw_ar_baseline_agecat_bias_comparison_melted, paste0("proportion_",
124 |                                                              runs,
125 |                                                              "_",
126 |                                                              age_grouping_multiplier,
127 |                                                              "_",
128 |                                                              income_multiplier,
129 |                                                              "_",
130 |                                                              sample_size,".rds"))
131 | 
132 | png(filename = paste0("proportion_",
133 |                       runs,
134 |                       "_",
135 |                       age_grouping_multiplier,
136 |                       "_",
137 |                       income_multiplier,
138 |                       "_",
139 |                       sample_size,"smaller.png"),
140 |     width = 5400/5, height = 2700/5)
141 | 
142 | ggplot(rw_ar_baseline_agecat_bias_comparison_melted, aes(x=prob_sampling_old, y=proportion, group = age_cat)) + 
143 |   geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 1) + 
144 |   scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a",
145 |                                "Age categories 5-8"="#b30000",
146 |                                "Age categories 9-12"="#006d2c")) + 
147 |   facet_wrap(.~Model,ncol=2,nrow=1,labeller = as_labeller(function(value){return(value)})) + 
148 |   xlab("\n Probability of response for the elderly \n") +
149 |   ylab("\n Proportion of the time that structured priors outperform \n") + 
150 |   theme_bw() +
151 |   theme(plot.title = element_text(size = 50/5, face = "bold"),
152 |         axis.text=element_text(size=50*1.6/5),
153 |         axis.title=element_text(size=50*1.6/5, face="bold",margin=200/5),
154 |         legend.text = element_text(size=50*1.6/5),
155 |         legend.position  = "bottom",
156 |         legend.key.size = unit(30*1.6/5,"line"),
157 |         legend.key.height = unit(7*1.6/5,"line"),
158 |         legend.key = element_rect(fill = "transparent",color = "transparent"),
159 |         legend.title = element_text(size=50*1.6/5, face="bold"),
160 |         strip.text.x = element_text(size=50*1.6/5, face="bold", margin = margin(t=25/5,b=25/5) ),
161 |         strip.background = element_rect(fill="transparent",color="transparent")
162 |   ) + 
163 |   ylim(c(.2,1)) +
164 |   geom_hline(yintercept = 0.5,linetype=2,size=1) +
165 |   guides(col=guide_legend(title="Age category",
166 |                           override.aes = list(size = 10*1.6/5)))
167 | 
168 | dev.off()
169 | 
170 | 
171 | 
172 | # ar model sd
173 | ar_baseline_agecat_sd_comparison_melted = melt(ar_baseline_agecat_sd_comparison)
174 | colnames(ar_baseline_agecat_sd_comparison_melted) = c("p", "age_cat", "proportion")
175 | ar_baseline_agecat_sd_comparison_melted$age_cat_grouped = ""
176 | 
177 | ar_baseline_agecat_sd_comparison_melted[ar_baseline_agecat_sd_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4"
178 | ar_baseline_agecat_sd_comparison_melted[ar_baseline_agecat_sd_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8"
179 | ar_baseline_agecat_sd_comparison_melted[ar_baseline_agecat_sd_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12"
180 | 
181 | 
182 | # rw model sd
183 | rw_baseline_agecat_sd_comparison_melted = melt(rw_baseline_agecat_sd_comparison)
184 | colnames(rw_baseline_agecat_sd_comparison_melted) = c("p", "age_cat", "proportion")
185 | rw_baseline_agecat_sd_comparison_melted$age_cat_grouped = ""
186 | 
187 | rw_baseline_agecat_sd_comparison_melted[rw_baseline_agecat_sd_comparison_melted$age_cat %in% 1:4, c("age_cat_grouped")] = "Age categories 1-4"
188 | rw_baseline_agecat_sd_comparison_melted[rw_baseline_agecat_sd_comparison_melted$age_cat %in% 5:8, c("age_cat_grouped")] = "Age categories 5-8"
189 | rw_baseline_agecat_sd_comparison_melted[rw_baseline_agecat_sd_comparison_melted$age_cat %in% 9:12, c("age_cat_grouped")] = "Age categories 9-12"
190 | 
191 | rw_ar_baseline_agecat_sd_comparison_melted = rbind(cbind(ar_baseline_agecat_sd_comparison_melted, Model = "AR - Baseline priors comparison"),
192 |                                                      cbind(rw_baseline_agecat_sd_comparison_melted, Model = "RW - Baseline priors comparison"))
193 | 
194 | rw_ar_baseline_agecat_sd_comparison_melted$prob_sampling_old = round(mapdf_prob_sampling_old_bias$new[match(rw_ar_baseline_agecat_sd_comparison_melted$p,
195 |                                                                                                               mapdf_prob_sampling_old_bias$old)],
196 |                                                                        digits = 2)
197 | 
198 | 
199 | saveRDS(rw_ar_baseline_agecat_sd_comparison_melted, paste0("proportion_sd_",
200 |                                                              runs,
201 |                                                              "_",
202 |                                                              age_grouping_multiplier,
203 |                                                              "_",
204 |                                                              income_multiplier,
205 |                                                              "_",
206 |                                                              sample_size,".rds"))
207 | 
208 | png(filename = paste0("proportion_sd_",
209 |                       runs,
210 |                       "_",
211 |                       age_grouping_multiplier,
212 |                       "_",
213 |                       income_multiplier,
214 |                       "_",
215 |                       sample_size,"smaller.png"),
216 |     width = 5400/5, height = 2700/5)
217 | 
218 | ggplot(rw_ar_baseline_agecat_sd_comparison_melted, aes(x=prob_sampling_old, y=proportion, group = age_cat)) + 
219 |   geom_line(aes(col=as.factor(age_cat_grouped)), show.legend = TRUE, size = 1) + 
220 |   scale_colour_manual(values=c("Age categories 1-4"="#fdcc8a",
221 |                                "Age categories 5-8"="#b30000",
222 |                                "Age categories 9-12"="#006d2c")) + 
223 |   facet_wrap(.~Model,ncol=2,nrow=1,labeller = as_labeller(function(value){return(value)})) + 
224 |   xlab("\n Probability of response for the elderly \n") +
225 |   ylab("\n Proportion sd \n") + 
226 |   theme_bw() +
227 |   theme(plot.title = element_text(size = 50/5, face = "bold"),
228 |         axis.text=element_text(size=50*1.6/5),
229 |         axis.title=element_text(size=50*1.6/5, face="bold",margin=200/5),
230 |         legend.text = element_text(size=50*1.6/5),
231 |         legend.position  = "bottom",
232 |         legend.key.size = unit(30*1.6/5,"line"),
233 |         legend.key.height = unit(7*1.6/5,"line"),
234 |         legend.key = element_rect(fill = "transparent",color = "transparent"),
235 |         legend.title = element_text(size=50*1.6/5, face="bold"),
236 |         strip.text.x = element_text(size=50*1.6/5, face="bold", margin = margin(t=25/5,b=25/5) ),
237 |         strip.background = element_rect(fill="transparent",color="transparent")
238 |   ) + 
239 |   ylim(c(.1,1)) +
240 |   geom_hline(yintercept = 0.5,linetype=2,size=1) +
241 |   guides(col=guide_legend(title="Age category",
242 |                           override.aes = list(size = 10*1.6/5)))
243 | 
244 | dev.off()
245 | 
246 | print(colMeans(rw_baseline_agecat_bias_comparison))
247 | print(colMeans(ar_baseline_agecat_bias_comparison))
248 | 
249 | print(colMeans(rw_baseline_agecat_sd_comparison))
250 | print(colMeans(ar_baseline_agecat_sd_comparison))
251 | 
252 | 
253 | 
254 | 
255 | 
256 | 
257 | 
258 | 
259 | 


--------------------------------------------------------------------------------
/simulation/proposedN01_v3.stan:
--------------------------------------------------------------------------------
 1 | data {
 2 |   int<lower=0> N; // number of samples
 3 |   int<lower=0> N_groups_age; // the number of groups for age
 4 |   int<lower=0> N_groups_income; // the number of groups for income
 5 |   int<lower=1,upper=N_groups_age> age[N]; // the column vector of design matrix X for age
 6 |   int<lower=1,upper=N_groups_income> income[N]; // the column vector of design matrix X for income
 7 |   
 8 |   int<lower=1,upper=51> state_index[N]; // the index (from 1 to 51) representing the state for every datapoint
 9 |   int<lower=1,upper=5> region_index[51]; // the index (from 1 to 5) representing the region the state is in for all 51 states
10 |   real<lower=0,upper=1> state_vs[51]; // the 2004 Republican vote share for every state. This is hard coded for now.
11 |   real<lower=0,upper=1> relig[51]; // the 2004 conservative religion percentage in every state
12 |   
13 |   int y[N]; // the response vector
14 | }
15 | parameters {
16 |   vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age
17 |   vector[N_groups_income] U_income; // the random effect for income, not multiplied by sigma_income
18 |   
19 |   vector[51] U_state; // the random effect for state, not multiplied by sigma_state
20 |   vector[5] U_region; // the nested random effect for region, not multiplied by sigma_region
21 |   
22 |   real<lower=0> sigma_age; // sd of U_age (hyperparam). halfnormal prior put on this.
23 |   real<lower=0> sigma_income; // sd of U_income (hyperparam). halfnormal prior put on this.
24 |   
25 |   real<lower=0> sigma_state; // sd of state
26 |   real<lower=0> sigma_region; // sd of region
27 |   real beta_state; // coeff. for Republican vote share in every state
28 |   real beta_relig; // coeff. for conservative religion share in every state
29 |   
30 |   real intercept; // the intercept (global fixed effect)
31 |   
32 | }
33 | transformed parameters { 
34 |   vector[N_groups_age] U_age_transformed;
35 |   vector[N_groups_income] U_income_transformed;
36 |   
37 |   vector[51] U_state_transformed;
38 |   vector[5] U_region_transformed;
39 |     
40 |   vector[N] yhat;
41 | 
42 |   U_age_transformed = sigma_age * U_age; // the random effect for age
43 |   U_income_transformed = sigma_income * U_income; // the random effect for income
44 |   U_region_transformed = sigma_region * U_region; // the nested random effect for region
45 | 
46 |   // noncentered parameterization for U_state_transformed
47 |   for (j in 1:51) {
48 |     U_state_transformed[j] = (U_state[j] * sigma_state) + ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) );
49 |   }
50 | 
51 |   for (i in 1:N) {
52 |     yhat[i] = intercept + U_age_transformed[age[i]] + U_income_transformed[income[i]] + U_state_transformed[state_index[i]]; // the linear predictor at each point
53 |   }
54 |   
55 | }
56 | model {
57 |   sigma_age ~ normal(0,1);
58 |   sigma_income ~ normal(0,1);
59 |   
60 |   sigma_region ~ normal(0,1);
61 |   sigma_state ~ normal(0, 1);
62 |   
63 |   // U_age[1] ~ normal(0,1); // random walk doesn't have this
64 |   for (j in 2:N_groups_age) {
65 |     U_age[j] ~normal(U_age[j-1],1);
66 |   }
67 | 
68 |   sum(U_age) ~ normal(0, 0.01 * N_groups_age); // constraint so we can write likelihood for rw(1).
69 |   
70 |   U_income ~ normal(0, 1); // random effect for income is normal
71 |   
72 |   U_region ~ normal(0,1);
73 |   
74 |   U_state ~ normal(0, 1);
75 | 
76 |   beta_state ~ normal(0, 1);
77 |   beta_relig ~ normal(0, 1);
78 |   intercept ~ normal(0, 1); 
79 |   
80 |   
81 |   for (i in 1:N) {
82 |     y[i] ~ bernoulli(inv_logit(yhat[i])); // the response
83 |   }
84 |   
85 | }
86 | 


--------------------------------------------------------------------------------
/simulation/proposedarN01_v3.stan:
--------------------------------------------------------------------------------
 1 | data {
 2 |   int<lower=0> N; // number of samples
 3 |   int<lower=0> N_groups_age; // the number of groups for age
 4 |   int<lower=0> N_groups_income; // the number of groups for income
 5 |   int<lower=1,upper=N_groups_age> age[N]; // the column vector of design matrix X for age
 6 |   int<lower=1,upper=N_groups_income> income[N]; // the column vector of design matrix X for income
 7 |   
 8 |   int<lower=1,upper=51> state_index[N]; // the index (from 1 to 51) representing the state for every datapoint
 9 |   int<lower=1,upper=5> region_index[51]; // the index (from 1 to 5) representing the region the state is in for all 51 states
10 |   real<lower=0,upper=1> state_vs[51]; // the 2004 Republican vote share for every state. This is hard coded for now.
11 |   real<lower=0,upper=1> relig[51]; // the 2004 conservative religion percentage in every state
12 |   
13 |   int y[N]; // the response vector
14 | }
15 | parameters {
16 |   vector[N_groups_age] U_age; // the random effect for age, not multiplied by sigma_age
17 |   vector[N_groups_income] U_income; // the random effect for income, not multiplied by sigma_income
18 |   
19 |   vector[51] U_state; // the random effect for state, not multiplied by sigma_state
20 |   vector[5] U_region; // the nested random effect for region, not multiplied by sigma_region
21 |   
22 |   real<lower=0> sigma_age; // sd of U_age (hyperparam). halfnormal prior put on this.
23 |   real<lower=0> sigma_income; // sd of U_income (hyperparam). halfnormal prior put on this.
24 |   
25 |   real<lower=0> sigma_state; // sd of state
26 |   real<lower=0> sigma_region; // sd of region
27 |   real beta_state; // coeff. for Republican vote share in every state
28 |   real beta_relig; // coeff. for conservative religion share in every state
29 |   
30 |   real intercept; // the intercept (global fixed effect)
31 |   real<lower=0,upper=1> rho; // the autoregressive coefficient untransformed
32 | }
33 | transformed parameters { 
34 |   vector[N_groups_age] U_age_transformed;
35 |   vector[N_groups_income] U_income_transformed;
36 |   
37 |   vector[51] U_state_transformed;
38 |   vector[5] U_region_transformed;
39 |   
40 |   vector[N] yhat;
41 |   real<lower=-1,upper=1> rho_transformed;
42 | 
43 |   rho_transformed = (rho * 2) - 1; // the autoregressive coefficient
44 | 
45 |   U_age_transformed = sigma_age * U_age; // the random effect for age
46 |   U_income_transformed = sigma_income * U_income; // the random effect for income
47 |   U_region_transformed = sigma_region * U_region; // the random effect for region
48 |   
49 |   // noncentered parameterization for U_state_transformed
50 |   for (j in 1:51) {
51 |     U_state_transformed[j] = (U_state[j] * sigma_state) + ( U_region_transformed[region_index[j]] + (beta_state * state_vs[j]) + (beta_relig * relig[j]) );
52 |   }
53 | 
54 |   for (i in 1:N) {
55 |     yhat[i] = intercept + U_age_transformed[age[i]] + U_income_transformed[income[i]] + U_state_transformed[state_index[i]]; // the linear predictor at each point
56 |   }
57 |   
58 | }
59 | model {
60 |   sigma_age ~ normal(0, 1); // sigma_A ~ halfnormal(0,1)
61 |   sigma_income ~ normal(0, 1); // sigma_I ~ halfnormal(0,1)
62 |   rho ~ beta(0.5, 0.5); // prior on autoregressive coefficient
63 |   
64 |   sigma_region ~ normal(0, 1);
65 |   sigma_state ~ normal(0, 1);
66 |   
67 |   U_age[1] ~ normal(0, 1/sqrt(1-rho_transformed^2)); // before it was normal(0, 1) but this is wrong
68 |   for (j in 2:N_groups_age) {
69 |     U_age[j] ~normal(rho_transformed * U_age[j-1],1);
70 |   }
71 |   
72 |   U_income ~ normal(0, 1); // random effect for income is normal
73 |   
74 |   U_region ~ normal(0, 1); 
75 |     
76 |   U_state ~ normal(0, 1);
77 | 
78 |   beta_state ~ normal(0, 1);
79 |   beta_relig ~ normal(0, 1);
80 |   intercept ~ normal(0, 1); 
81 | 
82 |   
83 |   for (i in 1:N) {
84 |     y[i] ~ bernoulli(inv_logit(yhat[i])); // the response
85 |   }
86 |   
87 | }
88 | 


--------------------------------------------------------------------------------
/simulation/threemodelwriteup_v3_posteriorvariance.R:
--------------------------------------------------------------------------------
  1 | rm(list=ls()) 
  2 | 
  3 | options(bitmapType="cairo")
  4 | # bias plots for new pipeline
  5 | 
  6 | library(reshape2)
  7 | library(rstan)
  8 | library(dplyr)
  9 | library(ggplot2)
 10 | library(gridExtra)
 11 | library(knitr)
 12 | library(ggridges)
 13 | library(matrixStats)
 14 | 
 15 | # Global params -------------------------------------------
 16 | 
 17 | save_ridgeplots = TRUE 
 18 | sample_size = 500
 19 | runs = 200
 20 | r = 1:9/10
 21 | income_multiplier = 1 # partitions income into more categories
 22 | age_grouping_multiplier = 12 # how much we take the maximal poststratification. make sure this can divide 60
 23 | number_of_states = 51
 24 | 
 25 | response_tag = "binary"
 26 | 
 27 | # colour scheme for biasfacet
 28 | use_biascolourscheme = TRUE
 29 | biascolourscheme = c("1"="#fbb4b9",
 30 |                      "2"="#f768a1",
 31 |                      "3"="#c51b8a",
 32 |                      "4"="#7a0177",
 33 |                      "5"="#fdcc8a",
 34 |                      "6"="#fc8d59",
 35 |                      "7"="#e34a33",
 36 |                      "8"="#b30000",
 37 |                      "9"="#b2e2e2",
 38 |                      "10"="#66c2a4",
 39 |                      "11"="#2ca25f",
 40 |                      "12"="#006d2c")
 41 | 
 42 | 
 43 | # ---------------------------------------------------------
 44 | # df for 3 regression lines with their standard deviation bands : median of medians
 45 | regression_df_all200 = c()
 46 | 
 47 | # df for 3 regression lines with their standard deviation bands : median of means
 48 | regression_df_mean_all200 = c()
 49 | 
 50 | # bias for baseline model
 51 | avg_bias_mat_baseline = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of bias cell-wise (mean of mean bias)
 52 | avg_bias_sd_mat_baseline = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the sd of the mean of bias cell-wise
 53 | avg_value_mat_baseline = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # posterior values of 12 cells, for each value of p
 54 | 
 55 | # bias for ar model
 56 | avg_bias_mat_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of bias cell-wise (mean of mean bias)
 57 | avg_bias_sd_mat_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the sd of the mean of bias cell-wise
 58 | avg_value_mat_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier)  * number_of_states) # posterior values of 12 cells, for each value of p
 59 | 
 60 | mean_of_medians_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of median bias cell-wise
 61 | mean_of_medians_sd_ar = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states)
 62 | 
 63 | # bias for rw model
 64 | avg_bias_mat_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of bias cell-wise (mean of mean bias)
 65 | avg_bias_sd_mat_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the sd of the mean of bias cell-wise
 66 | avg_value_mat_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # posterior values of 12 cells, for each value of p
 67 | 
 68 | mean_of_medians_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states) # the mean of median bias cell-wise
 69 | mean_of_medians_sd_rw = matrix(0, length(r), age_grouping_multiplier * (4* income_multiplier) * number_of_states)
 70 | 
 71 | # df for 3 regression lines with their standard deviation bands : median of medians
 72 | regression_df = c()
 73 | 
 74 | # df for 3 regression lines with their standard deviation bands : median of means
 75 | regression_df_mean = c()
 76 | 
 77 | # df for majority vote for all three models
 78 | majorityvote_df = c()
 79 | 
 80 | lowerboundary = 0.05
 81 | upperboundary = 0.95
 82 | 
 83 | counter = 1
 84 | prob_sampling_old = rep(0, length(r))
 85 | 
 86 | # ---
 87 | cm_quantiles_mat_bl = matrix(0, length(r), age_grouping_multiplier)
 88 | cm_quantiles_mat_ar = matrix(0, length(r), age_grouping_multiplier)
 89 | cm_quantiles_mat_rw = matrix(0, length(r), age_grouping_multiplier)
 90 | 
 91 | regression_df_all200_postsd = c()
 92 | # ---
 93 | 
 94 | for (p in r) {
 95 |   
 96 |   load(paste0(
 97 |     runs,
 98 |     "_",
 99 |     age_grouping_multiplier,
100 |     "_",
101 |     income_multiplier,
102 |     "_",
103 |     sample_size,
104 |     "_",
105 |     response_tag,
106 |     "_oldest_",
107 |     p * 10,
108 |     ".RData"))
109 |   
110 |   # ---
111 |   cm_quantiles_mat_bl[counter,] = colMeans(quantile90_list_baseline - quantile10_list_baseline)
112 |   cm_quantiles_mat_ar[counter,] = colMeans(quantile90_list_ar - quantile10_list_ar)
113 |   cm_quantiles_mat_rw[counter,] = colMeans(quantile90_list_rw - quantile10_list_rw)
114 |   
115 |   quantilediff_list_baseline_melted = melt(quantile90_list_baseline - quantile10_list_baseline)
116 |   quantilediff_list_ar_melted = melt(quantile90_list_ar - quantile10_list_ar)
117 |   quantilediff_list_rw_melted = melt(quantile90_list_rw - quantile10_list_rw)
118 |   
119 |   regression_df_all200_postsd = rbind(regression_df_all200_postsd,
120 |                                       data.frame(cbind(quantilediff_list_baseline_melted,
121 |                                                        type="Baseline",
122 |                                                        p)),
123 |                                       data.frame(cbind(quantilediff_list_ar_melted,
124 |                                                        type="Autoregressive",
125 |                                                        p)),
126 |                                       data.frame(cbind(quantilediff_list_rw_melted,
127 |                                                        type="Random walk",
128 |                                                        p))
129 |                                       )
130 | 
131 |   # ---
132 |   
133 |   # correct level order for the 4 ridgeplots below
134 |   correct_levels = c()
135 |   for (g in 1:age_grouping_multiplier) {
136 |     correct_levels = c(correct_levels, 
137 |                        seq(from = g, to = age_grouping_multiplier * 3, by = age_grouping_multiplier))
138 |   }
139 |   
140 |   # Get the interpretable p which is the probability of sampling someone in ages 61 - 80
141 |   poststrat_final_p = cbind(poststrat_final, 
142 |                             (poststrat_final$p_response * poststrat_final$N)/sum((poststrat_final$p_response * poststrat_final$N))) # this is the probability of sampling someone in ages  61 - 80
143 |   colnames(poststrat_final_p)[length(colnames(poststrat_final_p))] = "prob_sampling_old" 
144 |   prob_sampling_old[counter] = sum(poststrat_final_p[as.numeric(as.character(poststrat_final_p$age_cat)) >= (age_grouping_multiplier*2/3 + 1),
145 |                                                      c("prob_sampling_old")]) # very important
146 |   
147 |   # baseline   
148 |   avg_bias_mat_baseline[counter,] = colMeans(sweep(sample_cell_estimates_median_baseline,
149 |                                                    2,
150 |                                                    poststrat_final_reduced$true_pref_grouped_final, 
151 |                                                    "-"))
152 |   
153 |   avg_bias_sd_mat_baseline[counter,] = apply(sweep(sample_cell_estimates_median_baseline,
154 |                                                    2,
155 |                                                    poststrat_final_reduced$true_pref_grouped_final,
156 |                                                    "-"),
157 |                                              2,
158 |                                              sd)
159 |   
160 |   avg_value_mat_baseline[counter,] = colMeans(sample_cell_estimates_median_baseline)
161 |   
162 |   # autoregressive
163 |   avg_bias_mat_ar[counter,] = colMeans(sweep(sample_cell_estimates_median_ar,
164 |                                              2,
165 |                                              poststrat_final_reduced$true_pref_grouped_final, 
166 |                                              "-"))
167 |   
168 |   avg_bias_sd_mat_ar[counter,] = apply(sweep(sample_cell_estimates_median_ar,
169 |                                              2,
170 |                                              poststrat_final_reduced$true_pref_grouped_final, "-"),
171 |                                        2,
172 |                                        sd)
173 |   
174 |   avg_value_mat_ar[counter,] = colMeans(sample_cell_estimates_median_ar)
175 |   
176 |   # random walk
177 |   avg_bias_mat_rw[counter,] = colMeans(sweep(sample_cell_estimates_median_rw,
178 |                                              2,
179 |                                              poststrat_final_reduced$true_pref_grouped_final, 
180 |                                              "-"))
181 |   
182 |   avg_bias_sd_mat_rw[counter,] = apply(sweep(sample_cell_estimates_median_rw,
183 |                                              2,
184 |                                              poststrat_final_reduced$true_pref_grouped_final, "-"),
185 |                                        2,
186 |                                        sd)
187 |   
188 |   avg_value_mat_rw[counter,] = colMeans(sample_cell_estimates_median_rw)
189 |   
190 |   # ---
191 |   regression_df = rbind(regression_df, 
192 |                         data.frame(cbind(1:age_grouping_multiplier, 
193 |                               colMedians(median_quantile_list_baseline),
194 |                               "Baseline",
195 |                               p)),
196 |                         data.frame(cbind(1:age_grouping_multiplier, 
197 |                               colMedians(median_quantile_list_ar),
198 |                               "Autoregressive",
199 |                               p)),
200 |                         data.frame(cbind(1:age_grouping_multiplier, 
201 |                               colMedians(median_quantile_list_rw),
202 |                               "Random walk",
203 |                               p))
204 |                         )
205 |   # ---
206 |   
207 |   regression_df_mean = rbind(regression_df_mean, 
208 |                         data.frame(cbind(1:age_grouping_multiplier, 
209 |                                          colMedians(mean_list_baseline),
210 |                                          "Baseline",
211 |                                          p)),
212 |                         data.frame(cbind(1:age_grouping_multiplier, 
213 |                                          colMedians(mean_list_ar),
214 |                                          "Autoregressive",
215 |                                          p)),
216 |                         data.frame(cbind(1:age_grouping_multiplier, 
217 |                                          colMedians(mean_list_rw),
218 |                                          "Random walk",
219 |                                          p))
220 |   )
221 |   
222 |   # ---
223 |   
224 |   median_quantile_list_baseline_melted = melt(median_quantile_list_baseline)
225 |   median_quantile_list_ar_melted = melt(median_quantile_list_ar)
226 |   median_quantile_list_rw_melted = melt(median_quantile_list_rw)
227 |   
228 |   regression_df_all200 = rbind(regression_df_all200, 
229 |                                data.frame(cbind(median_quantile_list_baseline_melted,
230 |                                                 type="Baseline",  
231 |                                                 p)),
232 |                                data.frame(cbind(median_quantile_list_ar_melted,
233 |                                                 type="Autoregressive",
234 |                                                 p)),
235 |                                data.frame(cbind(median_quantile_list_rw_melted,
236 |                                                 type="Random walk",
237 |                                                 p))
238 |   )
239 |   
240 |   # ---
241 |   
242 |   mean_list_baseline_melted = melt(mean_list_baseline)
243 |   mean_list_ar_melted = melt(mean_list_ar)
244 |   mean_list_rw_melted = melt(mean_list_rw)
245 |   
246 |   regression_df_mean_all200 = rbind(regression_df_mean_all200, 
247 |                                data.frame(cbind(mean_list_baseline_melted,
248 |                                                 type="Baseline",  
249 |                                                 p)),
250 |                                data.frame(cbind(mean_list_ar_melted,
251 |                                                 type="Autoregressive",
252 |                                                 p)),
253 |                                data.frame(cbind(mean_list_rw_melted,
254 |                                                 type="Random walk",
255 |                                                 p))
256 |   )
257 |   
258 |   # ---
259 |   
260 |   majorityvote_df = rbind(majorityvote_df, 
261 |                           data.frame(majorityvote = model_popn_pref_mat_baseline, 
262 |                                      type = "Baseline",
263 |                                      p),
264 |                           data.frame(majorityvote = model_popn_pref_mat_ar, 
265 |                                      type = "Autoregressive",
266 |                                      p),
267 |                           data.frame(majorityvote = model_popn_pref_mat_rw, 
268 |                                      type = "Random walk",
269 |                                      p)
270 |                           )
271 |   
272 |   counter = counter + 1
273 | }
274 | 
275 | nonboundary_indices = (poststrat_final_reduced$true_pref_grouped_final > lowerboundary) & (poststrat_final_reduced$true_pref_grouped_final < upperboundary)
276 | 
277 | avg_value_mat_baseline = as.data.frame(avg_value_mat_baseline)
278 | avg_value_mat_baseline = avg_value_mat_baseline[,nonboundary_indices]
279 | df_val.melted_baseline = cbind( melt(avg_value_mat_baseline), rep(r, sum(nonboundary_indices)) ) 
280 | colnames(df_val.melted_baseline) = c("cell", "value", "p_response_3")
281 | 
282 | avg_value_mat_ar = as.data.frame(avg_value_mat_ar)
283 | avg_value_mat_ar = avg_value_mat_ar[,nonboundary_indices]
284 | df_val.melted_ar = cbind(melt(avg_value_mat_ar), rep(r, sum(nonboundary_indices)) )
285 | colnames(df_val.melted_ar) = c("cell", "value", "p_response_3")
286 | 
287 | avg_value_mat_rw = as.data.frame(avg_value_mat_rw)
288 | avg_value_mat_rw = avg_value_mat_rw[,nonboundary_indices]
289 | df_val.melted_rw = cbind(melt(avg_value_mat_rw), rep(r, sum(nonboundary_indices)) )
290 | colnames(df_val.melted_rw) = c("cell", "value", "p_response_3")
291 | 
292 | 
293 | poststrat_final_V = cbind(paste0("V", rownames(poststrat_final_reduced)),
294 |                           poststrat_final_reduced,
295 |                           stringsAsFactors=FALSE)
296 | colnames(poststrat_final_V)[1] = "cell"
297 | 
298 | df_val.melted_baseline$cell = as.character(df_val.melted_baseline$cell)
299 | df_val.melted_ar$cell = as.character(df_val.melted_ar$cell)
300 | df_val.melted_rw$cell = as.character(df_val.melted_rw$cell)
301 | 
302 | df_val.melted_V_baseline = dplyr::inner_join(x = df_val.melted_baseline, y = poststrat_final_V, by="cell")
303 | df_val.melted_V_unique_baseline = unique(df_val.melted_V_baseline[,c("cell", "true_pref_grouped_final")])
304 | df_val.melted_V_baseline$true_pref_grouped_final = factor(df_val.melted_V_baseline$true_pref_grouped_final)
305 | 
306 | df_val.melted_V_ar = dplyr::inner_join(x = df_val.melted_ar, y = poststrat_final_V, by="cell")
307 | df_val.melted_V_unique_ar = unique(df_val.melted_V_ar[,c("cell", "true_pref_grouped_final")])
308 | df_val.melted_V_ar$true_pref_grouped_final = factor(df_val.melted_V_ar$true_pref_grouped_final)
309 | 
310 | df_val.melted_V_rw = dplyr::inner_join(x = df_val.melted_rw, y = poststrat_final_V, by="cell")
311 | df_val.melted_V_unique_rw = unique(df_val.melted_V_rw[,c("cell", "true_pref_grouped_final")])
312 | df_val.melted_V_rw$true_pref_grouped_final = factor(df_val.melted_V_rw$true_pref_grouped_final)
313 | 
314 | 
315 | d_temp_baseline = ggplot(df_val.melted_V_baseline, aes(x=p_response_3, y=value, group=cell)) +
316 |   geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 1) +
317 |   geom_hline(data = poststrat_final_V, mapping = aes(yintercept = true_pref_grouped_final, color = as.factor(age_cat)), linetype = "dashed", show.legend = FALSE) +
318 |   xlab("Probability of response for oldest age group") +
319 |   ylab("Baseline Avg. of avg. of PLD colour-coded by truth") + 
320 |   ylim(0, 1) +
321 |   theme_minimal() +
322 |   theme(legend.position="top",
323 |         legend.title = element_blank())
324 | 
325 | d_temp_ar = ggplot(df_val.melted_V_ar, aes(x=p_response_3, y=value, group=cell)) +
326 |   geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 1) +
327 |   geom_hline(data = poststrat_final_V, mapping = aes(yintercept = true_pref_grouped_final, color = as.factor(age_cat)), linetype = "dashed", show.legend = FALSE) +
328 |   xlab("Probability of response for oldest age group") +
329 |   ylab("AR Avg. of avg. of PLD colour-coded by truth") + 
330 |   ylim(0, 1) +
331 |   theme_minimal() +
332 |   theme(legend.position="top",
333 |         legend.title = element_blank())
334 | 
335 | d_temp_rw = ggplot(df_val.melted_V_rw, aes(x=p_response_3, y=value, group=cell)) +
336 |   geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 1) +
337 |   geom_hline(data = poststrat_final_V, mapping = aes(yintercept = true_pref_grouped_final, color = as.factor(age_cat)), linetype = "dashed", show.legend = FALSE) +
338 |   xlab("Probability of response for oldest age group") +
339 |   ylab("RW Avg. of avg. of PLD colour-coded by truth") + 
340 |   ylim(0, 1) +
341 |   theme_minimal() +
342 |   theme(legend.position="top",
343 |         legend.title = element_blank())
344 | 
345 | 
346 | 
347 | # BIAS -------
348 | 
349 | avg_bias_mat_baseline = as.data.frame(avg_bias_mat_baseline)
350 | avg_bias_mat_baseline = avg_bias_mat_baseline[,nonboundary_indices]
351 | df_bias.melted_baseline = cbind( melt(avg_bias_mat_baseline), rep(r, sum(nonboundary_indices)) ) 
352 | colnames(df_bias.melted_baseline) = c("cell", "value", "p_response_3")
353 | 
354 | avg_bias_mat_ar = as.data.frame(avg_bias_mat_ar)
355 | avg_bias_mat_ar = avg_bias_mat_ar[,nonboundary_indices]
356 | df_bias.melted_ar = cbind(melt(avg_bias_mat_ar), rep(r, sum(nonboundary_indices)) )
357 | colnames(df_bias.melted_ar) = c("cell", "value", "p_response_3")
358 | 
359 | avg_bias_mat_rw = as.data.frame(avg_bias_mat_rw)
360 | avg_bias_mat_rw = avg_bias_mat_rw[,nonboundary_indices]
361 | df_bias.melted_rw = cbind(melt(avg_bias_mat_rw), rep(r, sum(nonboundary_indices)) )
362 | colnames(df_bias.melted_rw) = c("cell", "value", "p_response_3")
363 | 
364 | df_bias.melted_baseline$cell = as.character(df_bias.melted_baseline$cell)
365 | df_bias.melted_ar$cell = as.character(df_bias.melted_ar$cell)
366 | df_bias.melted_rw$cell = as.character(df_bias.melted_rw$cell)
367 | 
368 | df_bias.melted_V_baseline = dplyr::inner_join(x = df_bias.melted_baseline, y = poststrat_final_V, by="cell")
369 | df_bias.melted_V_unique_baseline = unique(df_bias.melted_V_baseline[,c("cell", "true_pref_grouped_final")])
370 | df_bias.melted_V_baseline$true_pref_grouped_final = factor(df_bias.melted_V_baseline$true_pref_grouped_final)
371 | 
372 | df_bias.melted_V_ar = dplyr::inner_join(x = df_bias.melted_ar, y = poststrat_final_V, by="cell")
373 | df_bias.melted_V_unique_ar = unique(df_bias.melted_V_ar[,c("cell", "true_pref_grouped_final")])
374 | df_bias.melted_V_ar$true_pref_grouped_final = factor(df_bias.melted_V_ar$true_pref_grouped_final)
375 | 
376 | df_bias.melted_V_rw = dplyr::inner_join(x = df_bias.melted_rw, y = poststrat_final_V, by="cell")
377 | df_bias.melted_V_unique_rw = unique(df_bias.melted_V_rw[,c("cell", "true_pref_grouped_final")])
378 | df_bias.melted_V_rw$true_pref_grouped_final = factor(df_bias.melted_V_rw$true_pref_grouped_final)
379 | 
380 | 
381 | bias_facet_df = rbind(cbind(df_bias.melted_V_baseline, Model = "Baseline"),
382 |                       cbind(df_bias.melted_V_ar, Model = "Autoregressive"),
383 |                       cbind(df_bias.melted_V_rw, Model = "Random walk"))
384 | 
385 | mapdf_prob_sampling_old_bias = data.frame(old = sort(unique(bias_facet_df$p_response_3)), 
386 |                                      new = prob_sampling_old)
387 | bias_facet_df$prob_sampling_old = round(mapdf_prob_sampling_old_bias$new[match(bias_facet_df$p_response_3,
388 |                                                                                mapdf_prob_sampling_old_bias$old)],
389 |                                                digits = 2)
390 | 
391 | bias_facet_df$age_cat = factor(bias_facet_df$age_cat)
392 | 
393 | saveRDS(bias_facet_df, paste0("biasfacet_",
394 |                               runs,
395 |                               "_",
396 |                               age_grouping_multiplier,
397 |                               "_",
398 |                               income_multiplier,
399 |                               "_",
400 |                               sample_size,".rds"))
401 | 
402 | png(filename = paste0("biasfacet_",
403 |                       runs,
404 |                       "_",
405 |                       age_grouping_multiplier,
406 |                       "_",
407 |                       income_multiplier,
408 |                       "_",
409 |                       sample_size,".png"),
410 |     width = 4800, height = 2400)
411 | 
412 | if (use_biascolourscheme==TRUE) {
413 | 
414 |     plot(
415 |       ggplot(bias_facet_df, aes(x=prob_sampling_old, y=value, group=cell)) + 
416 |         geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 0.5) +
417 |         facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller(
418 |           function(value) {
419 |             return(value)  # Lets you change the facet labels
420 |           })
421 |         ) +  
422 |         xlab("\n Probability of response for the elderly \n") +
423 |         ylab("\n Average Bias \n") + 
424 |         scale_colour_manual(values=c("1"="#fbb4b9",
425 |                                    "2"="#f768a1",
426 |                                    "3"="#c51b8a",
427 |                                    "4"="#7a0177",
428 |                                    "5"="#fdcc8a",
429 |                                    "6"="#fc8d59",
430 |                                    "7"="#e34a33",
431 |                                    "8"="#b30000",
432 |                                    "9"="#b2e2e2",
433 |                                    "10"="#66c2a4",
434 |                                    "11"="#2ca25f",
435 |                                    "12"="#006d2c")
436 |                             ) +
437 |         theme_bw() +
438 |         theme(plot.title = element_text(size = 50, face = "bold"),
439 |               axis.text=element_text(size=35*1.6),
440 |               axis.title=element_text(size=50*1.6, face="bold",margin=200),
441 |               legend.text = element_text(size=50*1.6),
442 |               legend.position  = "bottom",
443 |               legend.key.size = unit(30*1.6,"line"),
444 |               legend.key.height = unit(7*1.6,"line"),
445 |               legend.key = element_rect(fill = "transparent",color = "transparent"),
446 |               legend.title = element_text(size=50*1.6, face="bold"),
447 |               strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ),
448 |               strip.background = element_rect(fill="transparent",color="transparent")
449 |               ) + 
450 |         guides(col=guide_legend(title="Age category",
451 |                                 override.aes = list(size = 10*1.6)))
452 |   
453 |     )
454 |   
455 | }else {
456 |   plot(
457 |     ggplot(bias_facet_df, aes(x=prob_sampling_old, y=value, group=cell)) + 
458 |       geom_line(aes(col=as.factor(age_cat)), show.legend = TRUE, size = 0.5) +
459 |       facet_wrap(. ~ Model,ncol=3,nrow=1, labeller = as_labeller(
460 |         function(value) {
461 |           return(value)  # Lets you change the facet labels
462 |         })
463 |       ) +  
464 |       xlab("\n Probability of response for the elderly \n") +
465 |       ylab("\n Average Bias \n") + 
466 |       theme_bw() +
467 |       theme(plot.title = element_text(size = 50, face = "bold"),
468 |             axis.text=element_text(size=35*1.6),
469 |             axis.title=element_text(size=50*1.6, face="bold",margin=200),
470 |             legend.text = element_text(size=50*1.6),
471 |             legend.position  = "bottom",
472 |             legend.key.size = unit(30*1.6,"line"),
473 |             legend.key.height = unit(7*1.6,"line"),
474 |             legend.key = element_rect(fill = "transparent",color = "transparent"),
475 |             legend.title = element_text(size=50*1.6, face="bold"),
476 |             strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ),
477 |             strip.background = element_rect(fill="transparent",color="transparent")) + 
478 |       guides(col=guide_legend(title="Age category",
479 |                               override.aes = list(size = 10*1.6)))
480 |   )
481 | }
482 |   
483 | dev.off()
484 | 
485 |   
486 |   # all medians of posteriors
487 |   
488 | regression_df_all200$Var2 = factor(regression_df_all200$Var2)
489 | colnames(regression_df_all200) = c("n", "age_cat", "posterior_medians", "Model", "p")
490 | 
491 | mapdf_prob_sampling_old = data.frame(old = sort(unique(regression_df_all200$p)), 
492 |                                      new = prob_sampling_old)
493 | regression_df_all200$prob_sampling_old = round(mapdf_prob_sampling_old$new[match(regression_df_all200$p,
494 |                                                                                  mapdf_prob_sampling_old$old)],
495 |                                                digits = 2)
496 | 
497 | regression_df_all200$age_cat = as.numeric(as.character(regression_df_all200$age_cat))
498 | 
499 | # save regression_df_all200
500 | saveRDS(regression_df_all200, 
501 |         paste0("allmedians_facet",
502 |                "_",
503 |                runs,
504 |                "_",
505 |                age_grouping_multiplier,
506 |                "_",
507 |                sample_size,
508 |                "_",
509 |                ".rds"))
510 | 
511 | saveRDS(points_df_final[,c("age_cat", "mrp", "Type")], 
512 |         "points_df_final.rds")
513 | 
514 | png(paste0("allmedians_facet",
515 |            "_",
516 |            runs,
517 |            "_",
518 |            age_grouping_multiplier,
519 |            "_",
520 |            sample_size,
521 |            "_",
522 |            ".png"),
523 |     width=3000, height=2400)
524 | 
525 | 
526 | plot(
527 |   ggplot(regression_df_all200, aes(x=age_cat, y=posterior_medians, color=Model, linetype=Model)) + #geom_point(size=5)+
528 |     geom_rect(regression_df_all200, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
529 |                                                   xmax = Inf,
530 |                                                   ymin = -Inf,
531 |                                                   ymax = Inf),
532 |               alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
533 |     geom_jitter(alpha=0.1, show.legend=FALSE) + 
534 |     geom_point(aes(x = age_cat, y = mrp),
535 |                size=7,
536 |                colour = "black",
537 |                fill="black",
538 |                show.legend = F,
539 |                data = points_df_final,
540 |                inherit.aes = F) +
541 |     geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
542 |     facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller(
543 |       function(value) {
544 |         return(value)  # Lets you change the facet labels
545 |       })
546 |     ) +
547 |     xlab("\n Age Category \n") +
548 |     ylab(paste("\n Median of", runs,"Posteriors \n")) +
549 |     scale_x_continuous(breaks= scales::pretty_breaks()) + 
550 |     theme_bw() +
551 |     theme(plot.title = element_text(size = 50, face = "bold"),
552 |           axis.text=element_text(size=50),
553 |           axis.title=element_text(size=50, face="bold",margin=200),
554 |           legend.text = element_text(size=50),
555 |           legend.position  = "bottom",
556 |           legend.key.size = unit(25,"line"),
557 |           legend.key.height = unit(7,"line"),
558 |           legend.key = element_rect(fill = "transparent",color = "transparent"),
559 |           legend.title = element_text(size=50, face="bold"),
560 |           strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
561 |           strip.background = element_rect(fill="transparent",color="transparent")) +
562 |     scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
563 |     guides(color = guide_legend(override.aes = list(size=20)))
564 |   
565 | )
566 | 
567 | dev.off()
568 |   
569 | 
570 | # plot of the posterior standard deviations 
571 | regression_df_all200_postsd$Var2 = factor(regression_df_all200_postsd$Var2)
572 | colnames(regression_df_all200_postsd) = c("n", "age_cat", "quantildiff90_10", "Model", "p")
573 | 
574 | #
575 | mapdf_prob_sampling_old_postsd = data.frame(old = sort(unique(regression_df_all200_postsd$p)), 
576 |                                      new = prob_sampling_old)
577 | regression_df_all200_postsd$prob_sampling_old = round(mapdf_prob_sampling_old_postsd$new[match(regression_df_all200_postsd$p,
578 |                                                                                                mapdf_prob_sampling_old_postsd$old)],
579 |                                                       digits = 2)
580 | 
581 | regression_df_all200_postsd$age_cat = as.numeric(as.character(regression_df_all200_postsd$age_cat))
582 | #
583 | 
584 | saveRDS(regression_df_all200_postsd,
585 |         paste0("allquantilediff_facet",
586 |                "_",
587 |                runs,
588 |                "_",
589 |                age_grouping_multiplier,
590 |                "_",
591 |                sample_size,
592 |                "_",
593 |                ".rds"))
594 | 
595 | png(paste0("allquantilediff_facet",
596 |            "_",
597 |            runs,
598 |            "_",
599 |            age_grouping_multiplier,
600 |            "_",
601 |            sample_size,
602 |            "_",
603 |            ".png"),
604 |     width=3000, height=2400)
605 | 
606 | plot(
607 |   ggplot(regression_df_all200_postsd, aes(x=age_cat, y=quantildiff90_10, color=Model, linetype=Model)) + #geom_point(size=5)+
608 |     geom_rect(regression_df_all200_postsd, mapping = aes(xmin = (2 * (age_grouping_multiplier * 2/3) + 1)/2,
609 |                                                   xmax = Inf,
610 |                                                   ymin = -Inf,
611 |                                                   ymax = Inf),
612 |               alpha = 0.1,fill = "gray93", colour = NA, show.legend=FALSE) +
613 |     geom_jitter(alpha=0.1, show.legend=FALSE) + 
614 |     geom_smooth(aes(group=Model), method="loess", size=2.5, se=FALSE) + 
615 |     facet_wrap(. ~ prob_sampling_old,ncol=3,nrow=3, labeller = as_labeller(
616 |       function(value) {
617 |         return(value)  # Lets you change the facet labels
618 |       })
619 |     ) +
620 |     xlab("\n Age Category \n") +
621 |     ylab(paste("\n 90th - 10th quantile of", runs,"Posteriors \n")) +
622 |     scale_x_continuous(breaks= scales::pretty_breaks()) + 
623 |     theme_bw() +
624 |     theme(plot.title = element_text(size = 50, face = "bold"),
625 |           axis.text=element_text(size=50),
626 |           axis.title=element_text(size=50, face="bold",margin=200),
627 |           legend.text = element_text(size=50),
628 |           legend.position  = "bottom",
629 |           legend.key.size = unit(25,"line"),
630 |           legend.key.height = unit(7,"line"),
631 |           legend.key = element_rect(fill = "transparent",color = "transparent"),
632 |           legend.title = element_text(size=50, face="bold"),
633 |           strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
634 |           strip.background = element_rect(fill="transparent",color="transparent")) +
635 |     scale_color_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
636 |     guides(color = guide_legend(override.aes = list(size=20)))
637 |   
638 | )
639 | 
640 | dev.off()
641 | 
642 | # --------------------------------------------------------------------------
643 | 
644 | print("Warnings:")
645 | print(warnings())
646 | 
647 |     
648 |     
649 | 


--------------------------------------------------------------------------------
/simulation_spatialmrp/icar_mrp_simulation_bym2.R:
--------------------------------------------------------------------------------
  1 | set.seed(21)
  2 | 
  3 | 
  4 | rm(list=ls())
  5 | gc()
  6 | 
  7 | library(dplyr)
  8 | library(readr)
  9 | library(data.table)
 10 | library(sp)
 11 | library(ggplot2)
 12 | library(INLA)
 13 | 
 14 | use_spatial_effect = TRUE # the default is TRUE. This will define the true_pref with coef_puma, the spatial effect
 15 | 
 16 | poverty_poststrat = readRDS("poverty_poststrat.rds")
 17 | us_pumas = readRDS("us_pumas.rds")
 18 | ma_adj_matrix = readRDS("ma_adj_matrix.rds")
 19 | ma_adj_matrix_sparse = readRDS("ma_adj_matrix_sparse.rds")
 20 | x = readRDS("smooth_x.rds")
 21 |   
 22 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education
 23 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race
 24 | num_puma = length(unique(poverty_poststrat$PUMA.x))
 25 | 
 26 | total_poverty = poverty_poststrat %>% group_by(PUMA.x) %>% summarise(total = sum(N),
 27 |                                                                      total_in_poverty = sum(y)) %>% arrange(PUMA.x) %>% as.data.frame()
 28 | 
 29 | total_poverty$proportion_in_poverty = total_poverty$total_in_poverty/total_poverty$total
 30 | 
 31 | 
 32 | ss = "massachusetts"
 33 | ss_code = 25 # use data dictionary to get ss code
 34 | continental_states = c("MA")
 35 | 
 36 | r = 1:9/10
 37 | runs = 200
 38 | sample_size = 500 # 500, 1000, 2000, 4000
 39 | 
 40 | puma_overundersample_index = c(3301, # "Boston City--Allston, Brighton & Fenway"
 41 |                                3305,  # "Boston City--Hyde Park, Jamaica Plain, Roslindale & West Roxbury"
 42 |                                3303, # "Boston City--Dorchester & South Boston"
 43 |                                3302, # "Boston City--Back Bay, Beacon Hill, Charlestown, East Boston, Central & South End" 
 44 |                                3304, # "Boston City--Mattapan & Roxbury" 
 45 |                                3306,
 46 |                                503,
 47 |                                504,
 48 |                                505,
 49 |                                506,
 50 |                                507,
 51 |                                508,
 52 |                                3500,
 53 |                                3601,
 54 |                                3602,
 55 |                                3603,
 56 |                                3400
 57 | )
 58 | 
 59 | # used to calculate posterior width
 60 | lowerquantile = 0.1
 61 | upperquantile = 0.9
 62 | 
 63 | temp_x = cbind(x, us_pumas$PUMACE10) %>% as.data.frame()
 64 | x_ordered = temp_x %>% arrange(V2) %>% as.data.frame()
 65 | coef_puma = x_ordered$V1 # coef_puma is in alphabetical order
 66 | 
 67 | coef_race = (1:num_race - mean(1:num_race))*0.1
 68 | 
 69 | coef_education = (1:num_education - mean(1:num_education) - 2)*0.15
 70 | 
 71 | intercept_term = 0
 72 | 
 73 | # http://www.paulamoraga.com/book-geospatial/sec-inla.html
 74 | # https://mc-stan.org/users/documentation/case-studies/icar_stan.html
 75 | 
 76 | icar_formula = n ~ 1 + 
 77 |   f(education_f,
 78 |     model="iid",
 79 |     hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1)))
 80 |     ) +
 81 |   f(race_f,
 82 |     model="iid",
 83 |     hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1)))
 84 |     ) +
 85 |   f(ID,
 86 |     model = "bym2",
 87 |     graph = ma_adj_matrix_sparse,
 88 |     hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1))),
 89 |     adjust.for.con.comp = TRUE
 90 |   )
 91 | 
 92 | iid_formula = n ~ 1 + 
 93 |   f(education_f,
 94 |     model="iid",
 95 |     hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1)))
 96 |     ) +
 97 |   f(race_f,
 98 |     model="iid",
 99 |     hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1)))
100 |     ) +
101 |   f(ID_f,
102 |     model = "iid",
103 |     hyper = list(prec = list(prior="pc.prec", param=c(1, 0.1)))
104 |   )
105 | 
106 | # -------------------------------------------------------------------------------------
107 | # -------------------------------------------------------------------------------------
108 | # -------------------------------------------------------------------------------------
109 | 
110 | us_pumas_df = as.data.frame(us_pumas)
111 | # ---
112 | 
113 | a = as.numeric(us_pumas$PUMACE10) # converts string to numeric
114 | b = unique(poverty_poststrat$PUMA.x)
115 | 
116 | sort(a)
117 | sort(b)
118 | 
119 | sort(a) - sort(b) # great
120 | 
121 | # https://www.census.gov/geographies/reference-maps/2010/geo/2010-pumas/massachusetts.html
122 | us_pumas$PUMACE10 = as.numeric(us_pumas$PUMACE10) # just run once
123 | 
124 | us_pumas$proportion_in_poverty = total_poverty[base::match(us_pumas$PUMACE10, total_poverty$PUMA.x),c("proportion_in_poverty")]
125 | 
126 | # fit simple ICAR model to MA w/ response being proportion_in_poverty
127 | us_pumas$ID = 1:nrow(us_pumas)
128 | 
129 | # -----------------------------------------------------------------------
130 | # simulation pipeline ---------------------------------------------------
131 | # -----------------------------------------------------------------------
132 | 
133 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education
134 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race
135 | num_puma = length(unique(poverty_poststrat$PUMA.x))
136 | 
137 | poststrat_final = expand.grid(puma_area = unique(poverty_poststrat$PUMA.x),
138 |                               race = unique(poverty_poststrat$race_x),
139 |                               education = unique(poverty_poststrat$education))
140 | 
141 | poststrat_final_joined = left_join(poststrat_final, poverty_poststrat, by=c("puma_area"="PUMA.x",
142 |                                                                             "race"="race_x",
143 |                                                                             "education"="education_x")) %>% as.data.frame()
144 | 
145 | # Order of factors for poststrat_final_joined
146 | levels_education = unique(poverty_poststrat$education)
147 | levels_race = unique(poverty_poststrat$race_x)
148 | levels_puma = unique(poverty_poststrat$PUMA.x)
149 | 
150 | 
151 | poststrat_final_join_numeric = expand.grid(puma_area = 1:length(unique(poverty_poststrat$PUMA.x)),
152 |                                            race = 1:length(unique(poverty_poststrat$race_x)),
153 |                                            education = 1:length(unique(poverty_poststrat$education)))
154 | 
155 | # https://stackoverflow.com/questions/18562680/replacing-nas-with-0s-in-r-dataframe
156 | # replace all zeros in dataframe 
157 | poststrat_final_joined[is.na(poststrat_final_joined)] = 0
158 | 
159 | true_pref = rep(NA, num_puma * num_race * num_education) # this contains the true preference probability of a group in the population
160 | if (use_spatial_effect==TRUE) {
161 |   
162 |   for (j in 1:(num_education * num_race * num_puma)) {
163 |     true_pref[j] = arm::invlogit(intercept_term +
164 |                                    coef_puma[poststrat_final_join_numeric[j, 1]] +
165 |                                    coef_race[poststrat_final_join_numeric[j, 2]] +
166 |                                    coef_education[poststrat_final_join_numeric[j, 3]]
167 |     )
168 |   }
169 |   
170 | }else {
171 |   
172 |   for (j in 1:(num_education * num_race * num_puma)) {
173 |     true_pref[j] = arm::invlogit(intercept_term +
174 |                                    coef_race[poststrat_final_join_numeric[j, 2]] +
175 |                                    coef_education[poststrat_final_join_numeric[j, 3]]
176 |     )
177 |   }
178 |   
179 | }
180 | 
181 | # y is number of people in poverty weighted by PWGTP, N is number of people in the PUMA area (weighted dby PWGTP)
182 | poststrat_final_joined = cbind(poststrat_final_joined, true_pref)
183 | poststrat_puma_pref = poststrat_final_joined %>% group_by(puma_area) %>% summarise(N_times_theta = sum(N*true_pref)) %>% as.data.frame()
184 | poststrat_puma_pref = inner_join(poststrat_puma_pref,
185 |                                  poststrat_final_joined %>% group_by(puma_area) %>% summarise(N_sub = sum(N)) %>% as.data.frame(),
186 |                                  by = c("puma_area"="puma_area"))
187 | 
188 | # calculate poststratified true preference for every puma in MA
189 | poststrat_puma_pref$puma_ps = poststrat_puma_pref$N_times_theta/poststrat_puma_pref$N_sub
190 | summary(poststrat_puma_pref$puma_ps)
191 | 
192 | # true poststratified values based on dg process
193 | us_pumas$puma_ps = poststrat_puma_pref$puma_ps[rank(us_pumas$PUMACE10)]
194 | 
195 | # Augment p in the simulation scheme
196 | cbind(us_pumas$NAME10, us_pumas$PUMACE10)
197 | 
198 | p_response_race = rep(1, num_race)
199 | p_response_education = rep(1, num_education)
200 | 
201 | N_sub = rep(0, num_puma) # the N for each PUMA based off poststrat matrix
202 | g_counter = 1
203 | for (g in levels_puma) {
204 |   N_sub[g_counter] = sum(poststrat_final_joined[which(poststrat_final_joined$puma_area==g),]$N)
205 |   g_counter = g_counter + 1
206 | }
207 | 
208 | 
209 | for (p in r) {
210 |   
211 |   # storage matrices for posterior estimates -----------------------------------------
212 |   model_popn_pref_mat_icar = rep(NA, runs) # stores the preference estimate from MRP
213 |   model_popn_pref_sd_mat_icar = rep(NA, runs) # stores sd of above estimate
214 |   
215 |   model_popn_pref_mat_iid = rep(NA, runs) # stores the preference estimate from MRP
216 |   model_popn_pref_sd_mat_iid = rep(NA, runs) # stores sd of above estimate
217 |   
218 |   # medians of posterior linear predictors for all runs k
219 |   sample_cell_estimates_median_icar = matrix(NA, runs, dim(poststrat_final_joined)[1])
220 |   sample_cell_estimates_median_iid = matrix(NA, runs, dim(poststrat_final_joined)[1])
221 |   
222 |   # posterior width
223 |   sample_cell_estimates_width_icar = matrix(NA, runs, dim(poststrat_final_joined)[1])
224 |   sample_cell_estimates_width_iid = matrix(NA, runs, dim(poststrat_final_joined)[1])
225 |   
226 |   # the below lists stores puma posterior estimates. columns go by increasing order of puma code
227 |   quantile10_puma_icar = matrix(NA, runs, num_puma)
228 |   quantile90_puma_icar = matrix(NA, runs, num_puma)
229 |   quantile10_puma_iid = matrix(NA, runs, num_puma)
230 |   quantile90_puma_iid = matrix(NA, runs, num_puma)
231 |   
232 |   median_puma_icar = matrix(NA, runs, num_puma)
233 |   median_puma_iid = matrix(NA, runs, num_puma)
234 |   
235 |   # we will under/oversample certain PUMA
236 |   p_response_puma = rep(0, num_puma)
237 |   p_response_puma[which((unique(poverty_poststrat$PUMA.x) %in% puma_overundersample_index) == TRUE)] = p
238 |   p_response_puma[which((unique(poverty_poststrat$PUMA.x) %in% puma_overundersample_index) == FALSE)] = (1-p)
239 |   
240 |   p_response = rep(0, num_education * num_race * num_puma)
241 |   for (j in 1:(num_education * num_race * num_puma)) {
242 |     p_response[j] = p_response_puma[poststrat_final_join_numeric[j, 1]] *
243 |       p_response_race[poststrat_final_join_numeric[j, 2]] *
244 |       p_response_education[poststrat_final_join_numeric[j, 3]]
245 |   }
246 |   
247 |   poststrat_final_joined_p = cbind(poststrat_final_joined, p_response)
248 | 
249 |   
250 |  for (k in 1:runs) {
251 |    print(paste("On run", k, ", p =", p))
252 | 
253 |     sample_ = sample(num_education * num_race * num_puma,
254 |                      sample_size,
255 |                      replace = TRUE,
256 |                      prob = (poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)/sum(poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)
257 |     )
258 |     
259 |     y_sample_ = rbinom(sample_size, 1, poststrat_final_joined_p$true_pref[sample_])
260 |     
261 |     # get covariates for every row of sample
262 |     puma_sample = poststrat_final_joined_p[sample_,1]
263 |     race_sample = poststrat_final_joined_p[sample_,2]
264 |     education_sample = poststrat_final_joined_p[sample_,3]
265 |     
266 |     sample_final = data.frame(pref = y_sample_,
267 |                               puma = puma_sample,
268 |                               race = race_sample,
269 |                               education = education_sample)
270 |     
271 |     # with all the factors
272 |     sample_final$race_f = factor(sample_final$race, levels = levels_race)
273 |     sample_final$education_f = factor(sample_final$education, levels = levels_education)
274 |     
275 |     
276 |     us_pumas_df = as.data.frame(us_pumas)
277 |     sample_final_df = left_join(sample_final, us_pumas_df, by=c("puma"="PUMACE10"))
278 |     
279 |     sample_final_df_binomial = sample_final_df %>% 
280 |       group_by(puma, race_f,education_f) %>% 
281 |       summarise(n=sum(pref),N=n())
282 |     
283 |     sample_final_df_binomial_df = left_join(sample_final_df_binomial, us_pumas_df, by=c("puma"="PUMACE10")) %>%
284 |       as.data.frame()
285 |     
286 |     
287 |     # -------------------------------------------------------------------------------------------------------------------------------------------
288 |     # -------------------------------------------------------------------------------------------------------------------------------------------
289 |     # -------------------------------------------------------------------------------------------------------------------------------------------
290 |     # prior specification inla 
291 |     # https://becarioprecario.bitbucket.io/inla-gitbook/ch-priors.html
292 |     
293 |     icar_model = inla(icar_formula, # pc prior: https://becarioprecario.bitbucket.io/inla-gitbook/ch-priors.html#sec:pcpriors
294 |                       data = sample_final_df_binomial_df, family ="binomial",
295 |                       Ntrials = N,
296 |                       control.predictor = list(compute = TRUE),
297 |                       control.compute = list(dic = TRUE, waic = TRUE, config =TRUE),
298 |                       control.fixed = list(prec.intercept=1) # makes precision of intercept fixed at 1
299 |                       )
300 |     
301 |     # draw posterior samples for icar model 
302 |     icar_samples = inla.posterior.sample(n=4000, icar_model)
303 |     
304 |     # intercept -----------------------
305 |     icar_intercept_index_icar = grep("Intercept", rownames(icar_samples[[1]]$latent))
306 |     
307 |     icar_intercept_samples_f = function(x) { #x = samples[[i]]
308 |       return(x$latent[icar_intercept_index_icar])
309 |     }
310 |     
311 |     intercept_samples_icar = unlist(lapply(icar_samples, icar_intercept_samples_f))
312 |     # ---------------------------------
313 |     
314 |     # race ----------------------------
315 |     race_indices_icar = grep("race_f", rownames(icar_samples[[1]]$latent))
316 |     
317 |     icar_race_samples_f = function(x) {
318 |       return(x$latent[race_indices_icar])
319 |     }
320 |     
321 |     U_race_samples_icar = matrix(unlist(lapply(icar_samples, icar_race_samples_f)), 
322 |                                  ncol = num_race,
323 |                                  byrow = TRUE)
324 |     # ---------------------------------
325 |     
326 |     # education -----------------------
327 |     education_indices_icar = grep("education_f", rownames(icar_samples[[1]]$latent))
328 |     
329 |     icar_education_samples_f = function(x) {
330 |       return(x$latent[education_indices_icar])
331 |     }
332 |     
333 |     U_education_samples_icar = matrix(unlist(lapply(icar_samples, icar_education_samples_f)),
334 |                                       ncol = num_education,
335 |                                       byrow = TRUE)
336 |     # ---------------------------------
337 |     
338 |     # puma area -----------------------
339 |     puma_indices_icar = grep("ID:", rownames(icar_samples[[1]]$latent))[1:num_puma] # https://inla.r-inla-download.org/r-inla.org/doc/latent/bym2.pdf
340 |     
341 |     icar_puma_samples_f = function(x) {
342 |       return(x$latent[puma_indices_icar])
343 |     }
344 |     
345 |     U_puma_samples_icar = matrix(unlist(lapply(icar_samples, icar_puma_samples_f)),
346 |                                  ncol = num_puma,
347 |                                  byrow = TRUE)
348 |     # ---------------------------------
349 |     # get PUMA <-> ID match, where ID is used in f(ID,...)
350 |     pumacode_id_equivalency = us_pumas_df[,c("PUMACE10","ID")]
351 |     
352 |     # the poststrat matrix 
353 |     poststrat_final_joined_p_ID = left_join(poststrat_final_joined_p,
354 |                                             pumacode_id_equivalency,
355 |                                             by = c("puma_area"="PUMACE10"))
356 |     
357 |     # this matrix stores the posterior linear predictors for each poststrat cell in poststrat_final_joined_p
358 |     postpred_sim_icar = matrix(0,
359 |                                length(intercept_samples_icar), 
360 |                                dim(poststrat_final_joined_p)[1])
361 |     
362 |     for (i in 1:(num_education * num_puma * num_race)) {
363 |       postpred_sim_icar[,i] = arm::invlogit(intercept_samples_icar +
364 |                                               U_puma_samples_icar[, poststrat_final_joined_p_ID[i,c("ID")] ] +
365 |                                               U_race_samples_icar[, which(poststrat_final_joined_p_ID[i,c("race")]==levels_race) ] +
366 |                                               U_education_samples_icar[, which(poststrat_final_joined_p_ID[i,c("education")]==levels_education) ]
367 |       )
368 |     }
369 |     
370 |     puma_mrp_samples_icar = matrix(0, 4000, num_puma) # this list stores posterior MRP samples for every puma
371 |     
372 |     g_counter = 1
373 |     for (g in levels_puma) {
374 |       #print(g)
375 |       for (i in which(poststrat_final_joined_p_ID$puma_area==g)) {
376 |         puma_mrp_samples_icar[,g_counter] = puma_mrp_samples_icar[,g_counter] + 
377 |           (postpred_sim_icar[,i] * poststrat_final_joined_p_ID$N[i]/N_sub[g_counter])
378 |       }
379 |       g_counter = g_counter + 1
380 |     }
381 |     
382 |     # -------------------------------------------------------------------------------------------------------------------------------------------
383 |     # -------------------------------------------------------------------------------------------------------------------------------------------
384 |     # -------------------------------------------------------------------------------------------------------------------------------------------
385 |     sample_final_df_binomial_df_iid = sample_final_df_binomial_df
386 |     sample_final_df_binomial_df_iid$ID_f = factor(sample_final_df_binomial_df_iid$ID, levels = 1:num_puma)
387 |     
388 |     iid_model = inla(iid_formula, # pc prior: https://becarioprecario.bitbucket.io/inla-gitbook/ch-priors.html#sec:pcpriors
389 |                      data = sample_final_df_binomial_df_iid, family ="binomial",
390 |                      Ntrials = N,
391 |                      control.predictor = list(compute = TRUE),
392 |                      control.compute = list(dic = TRUE, waic = TRUE, config =TRUE),
393 |                      control.fixed = list(prec.intercept=1) # makes precision of intercept fixed at 1
394 |                      )
395 |     
396 |     
397 |     # draw posterior samples for iid model 
398 |     iid_samples = inla.posterior.sample(n=4000, iid_model)
399 |     
400 |     # intercept -----------------------
401 |     iid_intercept_index_iid = grep("Intercept", rownames(iid_samples[[1]]$latent))
402 |     
403 |     iid_intercept_samples_f = function(x) { #x = samples[[i]]
404 |       return(x$latent[iid_intercept_index_iid])
405 |     }
406 |     
407 |     intercept_samples_iid = unlist(lapply(iid_samples, iid_intercept_samples_f))
408 |     # ---------------------------------
409 |     
410 |     # race ----------------------------
411 |     race_indices_iid = grep("race_f", rownames(iid_samples[[1]]$latent))
412 |     
413 |     iid_race_samples_f = function(x) {
414 |       return(x$latent[race_indices_iid])
415 |     }
416 |     
417 |     U_race_samples_iid = matrix(unlist(lapply(iid_samples, iid_race_samples_f)), 
418 |                                 ncol = num_race,
419 |                                 byrow = TRUE)
420 |     # ---------------------------------
421 |     
422 |     # education -----------------------
423 |     education_indices_iid = grep("education_f", rownames(iid_samples[[1]]$latent))
424 |     
425 |     iid_education_samples_f = function(x) {
426 |       return(x$latent[education_indices_iid])
427 |     }
428 |     
429 |     U_education_samples_iid = matrix(unlist(lapply(iid_samples, iid_education_samples_f)),
430 |                                      ncol = num_education,
431 |                                      byrow = TRUE)
432 |     # ---------------------------------
433 |     
434 |     # puma area -----------------------
435 |     puma_indices_iid = grep("ID_f:", rownames(iid_samples[[1]]$latent))
436 |     
437 |     iid_puma_samples_f = function(x) {
438 |       return(x$latent[puma_indices_iid])
439 |     }
440 |     
441 |     U_puma_samples_iid = matrix(unlist(lapply(iid_samples, iid_puma_samples_f)),
442 |                                 ncol = num_puma,
443 |                                 byrow = TRUE)
444 |     # ---------------------------------
445 |     
446 |     # this matrix stores the posterior linear predictors for each poststrat cell in poststrat_final_joined_p
447 |     postpred_sim_iid = matrix(0,
448 |                               length(intercept_samples_iid), 
449 |                               dim(poststrat_final_joined_p)[1])
450 |     
451 |     for (i in 1:(num_education * num_puma * num_race)) {
452 |       postpred_sim_iid[,i] = arm::invlogit(intercept_samples_iid +
453 |                                              U_puma_samples_iid[, poststrat_final_joined_p_ID[i,c("ID")] ] +
454 |                                              U_race_samples_iid[, which(poststrat_final_joined_p_ID[i,c("race")]==levels_race) ] +
455 |                                              U_education_samples_iid[, which(poststrat_final_joined_p_ID[i,c("education")]==levels_education) ]
456 |       )
457 |     }
458 |     
459 |     puma_mrp_samples_iid = matrix(0, 4000, num_puma) # this list stores posterior MRP samples for every puma
460 |     
461 |     g_counter = 1
462 |     for (g in levels_puma) {
463 |       #print(g)
464 |       for (i in which(poststrat_final_joined_p_ID$puma_area==g)) {
465 |         puma_mrp_samples_iid[,g_counter] = puma_mrp_samples_iid[,g_counter] + 
466 |           (postpred_sim_iid[,i] * poststrat_final_joined_p_ID$N[i]/N_sub[g_counter])
467 |       }
468 |       g_counter = g_counter + 1
469 |     }
470 |     # -------------------------------------------------------------------------------------------------------------------------------------------
471 |     # -------------------------------------------------------------------------------------------------------------------------------------------
472 |     # -------------------------------------------------------------------------------------------------------------------------------------------
473 |     true_popn_pref = poststrat_final_joined_p$true_pref %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N)
474 |     
475 |     model_popn_pref_mat_icar[k] = mean(postpred_sim_icar %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N))
476 |     model_popn_pref_sd_mat_icar[k] = sd(postpred_sim_icar %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N))
477 |     
478 |     model_popn_pref_mat_iid[k] = mean(postpred_sim_iid %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N))
479 |     model_popn_pref_sd_mat_iid[k] = sd(postpred_sim_iid %*% poststrat_final_joined_p$N / sum(poststrat_final_joined_p$N))
480 |     
481 |     # medians of posterior linear predictors for all runs k
482 |     sample_cell_estimates_median_icar[k,] = matrixStats::colMedians(postpred_sim_icar)
483 |     sample_cell_estimates_median_iid[k,] = matrixStats::colMedians(postpred_sim_iid)
484 |     
485 |     # posterior width
486 |     sample_cell_estimates_width_icar[k,] = apply(X = postpred_sim_icar,
487 |                                                  MARGIN = 2,
488 |                                                  FUN = quantile, 
489 |                                                  probs = upperquantile) -
490 |       apply(X = postpred_sim_icar, 
491 |             MARGIN = 2,
492 |             FUN = quantile,
493 |             probs = lowerquantile)
494 |     
495 |       
496 |     sample_cell_estimates_width_iid[k,] = apply(X = postpred_sim_iid,
497 |                                                 MARGIN = 2,
498 |                                                 FUN = quantile, 
499 |                                                 probs = upperquantile) -
500 |       apply(X = postpred_sim_iid, 
501 |             MARGIN = 2,
502 |             FUN = quantile,
503 |             probs = lowerquantile)
504 |     
505 |     # # the below lists stores puma posterior estimates. columns go by increasing order of puma code
506 | 
507 |     median_puma_icar[k,] = matrixStats::colMedians(puma_mrp_samples_icar)
508 |     median_puma_iid[k,] = matrixStats::colMedians(puma_mrp_samples_iid)
509 |     
510 |     quantile10_puma_icar[k,] = apply(X = puma_mrp_samples_icar,
511 |                                      MARGIN = 2,
512 |                                      FUN = quantile,
513 |                                      probs = lowerquantile)
514 |     
515 |     quantile90_puma_icar[k,] = apply(X = puma_mrp_samples_icar,
516 |                                      MARGIN = 2,
517 |                                      FUN = quantile,
518 |                                      probs = upperquantile)
519 |     
520 |     quantile10_puma_iid[k,] = apply(X = puma_mrp_samples_iid,
521 |                                     MARGIN = 2,
522 |                                     FUN = quantile,
523 |                                     probs = lowerquantile)
524 |     
525 |     quantile90_puma_iid[k,] = apply(X = puma_mrp_samples_iid,
526 |                                     MARGIN = 2,
527 |                                     FUN = quantile,
528 |                                     probs = upperquantile)
529 |     
530 |     
531 |     if ((k %% 10)==0) { 
532 |       print(k)
533 |       save.image(paste0(runs,
534 |                         "_",
535 |                         sample_size,
536 |                         "_spatial_",
537 |                         p * 10,
538 |                         ".RData"))
539 |     }
540 | 
541 |   }
542 | }
543 | 
544 | 
545 | 
546 | 
547 | 
548 | 
549 | 
550 | 
551 | 


--------------------------------------------------------------------------------
/simulation_spatialmrp/icar_mrp_viz.R:
--------------------------------------------------------------------------------
  1 | # this script visualizes results from icar_mrp_simulation_bym2.R
  2 | 
  3 | rm(list=ls())
  4 | gc()
  5 | 
  6 | library(ggplot2)
  7 | library(dplyr)
  8 | library(sp)
  9 | 
 10 | sample_size = 500
 11 | runs = 200
 12 | r = 1:9/10
 13 | 
 14 | poverty_poststrat = readRDS("poverty_poststrat.rds")
 15 | us_pumas = readRDS("us_pumas.rds")
 16 | ma_adj_matrix = readRDS("ma_adj_matrix.rds")
 17 | ma_adj_matrix_sparse = readRDS("ma_adj_matrix_sparse.rds")
 18 | 
 19 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education
 20 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race
 21 | num_puma = length(unique(poverty_poststrat$PUMA.x))
 22 | 
 23 | 
 24 | # ----
 25 | avg_bias_mat_puma_icar = matrix(0, length(r), num_puma)
 26 | avg_bias_mat_puma_iid = matrix(0, length(r), num_puma)
 27 | 
 28 | # bias of all strata
 29 | avg_bias_mat_icar = matrix(0, length(r), num_education * num_race * num_puma)
 30 | avg_bias_mat_iid = matrix(0, length(r), num_education * num_race * num_puma)
 31 | 
 32 | # avg posterior width for 1872 poststratification cells
 33 | avg_posterior_cell_width_icar = matrix(0, length(r), num_education * num_race * num_puma)
 34 | avg_posterior_cell_width_iid = matrix(0, length(r), num_education * num_race * num_puma)
 35 | 
 36 | # avg posterior width for all 52 puma
 37 | avg_posterior_puma_width_icar = matrix(0, length(r), 52)
 38 | avg_posterior_puma_width_iid = matrix(0, length(r), 52)
 39 | # ----
 40 | 
 41 | # df for majoityvote for icar and iid prior
 42 | majorityvote_df = c()
 43 | 
 44 | 
 45 | counter_ = 1
 46 | prob_sampling_old = rep(0, length(r))
 47 | 
 48 | for (p in r) {
 49 |   load(paste0(runs,
 50 |               "_",
 51 |               sample_size,
 52 |               "_spatial_",
 53 |               p * 10,
 54 |               ".RData"))
 55 |   
 56 |   # get the interpretable p which is the probability of sampling something in puma_overundersample_index
 57 |   poststrat_final_joined_p_interpretable = cbind(poststrat_final_joined_p,
 58 |                                                  (poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)/sum(poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)
 59 |                                                    )
 60 |   # 
 61 |   colnames(poststrat_final_joined_p_interpretable)[length(colnames(poststrat_final_joined_p_interpretable))] = "prob_sampling_g1" 
 62 |   prob_sampling_old[counter_] = round(sum(poststrat_final_joined_p_interpretable[poststrat_final_joined_p_interpretable$puma_area %in% puma_overundersample_index,
 63 |                                                                            c("prob_sampling_g1")]),
 64 |                                       2)# very important
 65 |   
 66 |   avg_bias_mat_icar[counter_,] = colMeans(sweep(sample_cell_estimates_median_icar,
 67 |                                                2,
 68 |                                                poststrat_final_joined_p_ID$true_pref,
 69 |                                                "-"))
 70 |   
 71 |   avg_bias_mat_iid[counter_,] = colMeans(sweep(sample_cell_estimates_median_iid,
 72 |                                                2,
 73 |                                                poststrat_final_joined_p_ID$true_pref,
 74 |                                                "-"))
 75 |   
 76 | 
 77 |   avg_posterior_cell_width_icar[counter_,] = colMeans(sample_cell_estimates_width_icar)
 78 |   avg_posterior_cell_width_iid[counter_,] = colMeans(sample_cell_estimates_width_iid)
 79 |   
 80 |   avg_posterior_puma_width_icar[counter_,] = colMeans(quantile90_puma_icar - quantile10_puma_icar)
 81 |   avg_posterior_puma_width_iid[counter_,] = colMeans(quantile90_puma_iid - quantile10_puma_iid)
 82 |   
 83 |   
 84 |   avg_bias_mat_puma_icar[counter_,] = colMeans(sweep(median_puma_icar, 2, poststrat_puma_pref$puma_ps, "-"))
 85 |   avg_bias_mat_puma_iid[counter_,] = colMeans(sweep(median_puma_iid, 2, poststrat_puma_pref$puma_ps, "-"))
 86 |   
 87 |   # majorityvote -----------------------
 88 |   
 89 |   majorityvote_df = rbind(majorityvote_df, 
 90 |                           data.frame(majorityvote = model_popn_pref_mat_icar, 
 91 |                                      type = "ICAR",
 92 |                                      prob_sampling_old[counter_]),
 93 |                           data.frame(majorityvote = model_popn_pref_mat_iid, 
 94 |                                      type = "IID",
 95 |                                      prob_sampling_old[counter_])
 96 |   )
 97 |   
 98 |   counter_ = counter_ + 1
 99 | }
100 | 
101 | saveRDS(prob_sampling_old, "prob_sampling_old.rds")
102 | # this being negative is good
103 | summary(colMeans(abs(avg_bias_mat_icar) - 
104 |                    abs(avg_bias_mat_iid))) * 100
105 | 
106 | # this being negative is good
107 | summary(colMeans(abs(sweep(median_puma_icar,2,poststrat_puma_pref$puma_ps,"-")) - 
108 |           abs(sweep(median_puma_iid,2,poststrat_puma_pref$puma_ps,"-")))) * 100
109 | 
110 | # this being negative is good
111 | summary(colMeans(avg_posterior_cell_width_icar - avg_posterior_cell_width_iid))
112 | 
113 | # ---
114 | poststrat_final_V = cbind(paste0("V", rownames(poststrat_final_joined)),
115 |                           poststrat_final_joined,
116 |                           stringsAsFactors=FALSE)
117 | colnames(poststrat_final_V)[1] = "cell"
118 | # ---
119 | 
120 | 
121 | avg_bias_mat_icar = as.data.frame(avg_bias_mat_icar)
122 | avg_bias_mat_iid = as.data.frame(avg_bias_mat_iid)
123 | 
124 | df_bias.melted_icar = cbind(reshape2::melt(avg_bias_mat_icar),
125 |                             rep(r, dim(poststrat_final_joined_p_ID)[1]))
126 | colnames(df_bias.melted_icar) = c("cell", "value", "p")
127 | df_bias.melted_icar$cell = as.character(df_bias.melted_icar$cell)
128 | 
129 | df_bias.melted_V_icar = dplyr::inner_join(x = df_bias.melted_icar,
130 |                                           y = poststrat_final_V,
131 |                                           by ="cell")
132 | 
133 | 
134 | 
135 | df_bias.melted_iid = cbind(reshape2::melt(avg_bias_mat_iid),
136 |                             rep(r, dim(poststrat_final_joined_p_ID)[1]))
137 | colnames(df_bias.melted_iid) = c("cell", "value", "p")
138 | df_bias.melted_iid$cell = as.character(df_bias.melted_iid$cell)
139 | 
140 | df_bias.melted_V_iid = dplyr::inner_join(x = df_bias.melted_iid,
141 |                                           y = poststrat_final_V,
142 |                                           by ="cell")
143 | 
144 | bias_facet_df = rbind(cbind(df_bias.melted_V_icar, Model = "ICAR"),
145 |                       cbind(df_bias.melted_V_iid, Model = "IID"))
146 | 
147 | bias_facet_df$sampling_group = ifelse(bias_facet_df$puma_area %in% puma_overundersample_index, "Group 1", "Group 2")
148 | 
149 | png(filename = paste0("spatialbiasfacet_",
150 |                       runs,
151 |                       "_",
152 |                       sample_size,
153 |                       "_.png"),
154 |     width = 4800, height = 2400)
155 | 
156 | ggplot(bias_facet_df,
157 |        aes(x=p, y=value, group = cell)) + 
158 |   geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 0.5) +
159 |   facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller(
160 |     function(value) {
161 |       return(value)  # Lets you change the facet labels
162 |     })
163 |   ) +
164 |   xlab("\n Probability of response for Group 1 \n") +
165 |   ylab("\n Average Bias of Posterior Medians \n") + 
166 |   theme_bw() +
167 |   theme(plot.title = element_text(size = 50, face = "bold"),
168 |         axis.text=element_text(size=35*1.6),
169 |         axis.title=element_text(size=50*1.6, face="bold",margin=200),
170 |         legend.text = element_text(size=50*1.6),
171 |         legend.position  = "bottom",
172 |         legend.key.size = unit(30*1.6,"line"),
173 |         legend.key.height = unit(7*1.6,"line"),
174 |         legend.key = element_rect(fill = "transparent",color = "transparent"),
175 |         legend.title = element_text(size=50*1.6, face="bold"),
176 |         strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ),
177 |         strip.background = element_rect(fill="transparent",color="transparent")
178 |   ) + 
179 |   guides(col=guide_legend(title="Sampling Group",
180 |                           override.aes = list(size = 10*1.6)))
181 | 
182 | dev.off()
183 | 
184 | saveRDS(bias_facet_df, paste0("biasfacetspatial_",
185 |                               runs,
186 |                               "_",
187 |                               sample_size,".rds"))
188 | 
189 | # Average bias of posterior medians for 52 PUMAs --------------------------------------------
190 | avg_bias_mat_puma_icar = as.data.frame(avg_bias_mat_puma_icar)
191 | avg_bias_mat_puma_iid = as.data.frame(avg_bias_mat_puma_iid)
192 | 
193 | 
194 | colnames(avg_bias_mat_puma_icar) = levels_puma
195 | 
196 | df_bias_puma.melted_icar = cbind(reshape2::melt(avg_bias_mat_puma_icar),
197 |                             rep(r, num_puma))
198 | colnames(df_bias_puma.melted_icar) = c("puma_area", "value", "p")
199 | df_bias_puma.melted_icar$puma_area = as.numeric(as.character(df_bias_puma.melted_icar$puma_area))
200 | 
201 | df_bias_puma.melted_icar$sampling_group = ifelse(df_bias_puma.melted_icar$puma_area %in% puma_overundersample_index, 
202 |                                                  "Near Boston",
203 |                                                  "Away from Boston")
204 | 
205 | 
206 | colnames(avg_bias_mat_puma_iid) = levels_puma
207 | 
208 | df_bias_puma.melted_iid = cbind(reshape2::melt(avg_bias_mat_puma_iid),
209 |                                  rep(r, num_puma))
210 | colnames(df_bias_puma.melted_iid) = c("puma_area", "value", "p")
211 | df_bias_puma.melted_iid$puma_area = as.numeric(as.character(df_bias_puma.melted_iid$puma_area))
212 | 
213 | df_bias_puma.melted_iid$sampling_group = ifelse(df_bias_puma.melted_iid$puma_area %in% puma_overundersample_index, 
214 |                                                  "Near Boston",
215 |                                                  "Away from Boston")
216 | 
217 | 
218 | bias_facet_puma_df = rbind(cbind(df_bias_puma.melted_icar, Model = "ICAR"),
219 |                       cbind(df_bias_puma.melted_iid, Model = "IID"))
220 | 
221 | bias_facet_puma_df$sampling_group = factor(bias_facet_puma_df$sampling_group,
222 |                                            levels = c("Near Boston", "Away from Boston"))
223 | 
224 | png(filename = paste0("spatialbiasfacet_puma_",
225 |                       runs,
226 |                       "_",
227 |                       sample_size,
228 |                       "_.png"),
229 |     width = 4800, height = 2400)
230 | 
231 | ggplot(bias_facet_puma_df,
232 |        aes(x=p, y=value, group = puma_area)) + 
233 |   geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 1.5) +
234 |   facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller(
235 |     function(value) {
236 |       return(value)  # Lets you change the facet labels
237 |     })
238 |   ) +
239 |   xlab("\n Probability of response for Group 1 \n") +
240 |   ylab("\n Average Bias of Posterior Medians for PUMA \n") + 
241 |   theme_bw() +
242 |   theme(plot.title = element_text(size = 50, face = "bold"),
243 |         axis.text=element_text(size=35*1.6),
244 |         axis.title=element_text(size=50*1.6, face="bold",margin=200),
245 |         legend.text = element_text(size=50*1.6),
246 |         legend.position  = "bottom",
247 |         legend.key.size = unit(30*1.6,"line"),
248 |         legend.key.height = unit(7*1.6,"line"),
249 |         legend.key = element_rect(fill = "transparent",color = "transparent"),
250 |         legend.title = element_text(size=50*1.6, face="bold"),
251 |         strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ),
252 |         strip.background = element_rect(fill="transparent",color="transparent")) + 
253 |   guides(col=guide_legend(title="Sampling Group",
254 |                           override.aes = list(size = 10*1.6)))
255 | 
256 | 
257 | dev.off()
258 | 
259 | saveRDS(bias_facet_puma_df, paste0("biasfacetspatialpuma_",
260 |                               runs,
261 |                               "_",
262 |                               sample_size,".rds"))
263 | 
264 | # Average posterior sd width for 1872 poststrat cell posteriors ----------------------------
265 | 
266 | avg_posterior_cell_width_icar = as.data.frame(avg_posterior_cell_width_icar)
267 | avg_posterior_cell_width_iid = as.data.frame(avg_posterior_cell_width_iid)
268 | 
269 | 
270 | df_avg_posterior_cell_width_icar = cbind(reshape2::melt(avg_posterior_cell_width_icar),
271 |                                          rep(r, dim(poststrat_final_joined_p_ID)[1]))
272 | 
273 | colnames(df_avg_posterior_cell_width_icar) = c("cell", "average_posterior_width", "p")
274 | df_avg_posterior_cell_width_icar$cell = as.character(df_avg_posterior_cell_width_icar$cell)
275 | 
276 | df_avg_posterior_cell_width_icar_V = dplyr::inner_join(x = df_avg_posterior_cell_width_icar,
277 |                                                        y = poststrat_final_V,
278 |                                                        by = "cell")
279 | 
280 | 
281 | df_avg_posterior_cell_width_iid = cbind(reshape2::melt(avg_posterior_cell_width_iid),
282 |                                          rep(r, dim(poststrat_final_joined_p_ID)[1]))
283 | 
284 | colnames(df_avg_posterior_cell_width_iid) = c("cell", "average_posterior_width", "p")
285 | df_avg_posterior_cell_width_iid$cell = as.character(df_avg_posterior_cell_width_iid$cell)
286 | 
287 | df_avg_posterior_cell_width_iid_V = dplyr::inner_join(x = df_avg_posterior_cell_width_iid,
288 |                                                        y = poststrat_final_V,
289 |                                                        by = "cell")
290 |        
291 | posterior_cell_width_facet_df = rbind(cbind(df_avg_posterior_cell_width_iid_V, Model = "ICAR"),
292 |                                       cbind(df_avg_posterior_cell_width_iid_V, Model = "IID"))
293 | 
294 | posterior_cell_width_facet_df$sampling_group = ifelse(posterior_cell_width_facet_df$puma_area %in% puma_overundersample_index, "Group 1", "Group 2")
295 | 
296 | posterior_cell_width_facet_df$sampling_group[posterior_cell_width_facet_df$sampling_group=="Group 1"] = "Near Boston"
297 | posterior_cell_width_facet_df$sampling_group[posterior_cell_width_facet_df$sampling_group=="Group 2"] = "Away from Boston"
298 | posterior_cell_width_facet_df$sampling_group = factor(posterior_cell_width_facet_df$sampling_group,
299 |                                                       levels = c("Near Boston", "Away from Boston"))
300 | 
301 | png(filename = paste0("spatialbiasfacet_width_",
302 |                       runs,
303 |                       "_",
304 |                       sample_size,
305 |                       "_.png"),
306 |     width = 4800, height = 2400)
307 | 
308 | ggplot(posterior_cell_width_facet_df,
309 |        aes(x=p, y=average_posterior_width, group = cell)) + 
310 |   geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 0.5) +
311 |   facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller(
312 |     function(value) {
313 |       return(value)  # Lets you change the facet labels
314 |     })
315 |   ) +
316 |   xlab("\n Probability of sampling for cluster of PUMA near Boston \n") +
317 |   ylab(paste0("\n 90th - 10th quantile of ", runs, " Posteriors \n")) + 
318 |   theme_bw() +
319 |   theme(plot.title = element_text(size = 50, face = "bold"),
320 |         axis.text=element_text(size=35*1.6),
321 |         axis.title=element_text(size=50*1.6, face="bold",margin=200),
322 |         legend.text = element_text(size=50*1.6),
323 |         legend.position  = "bottom",
324 |         legend.key.size = unit(30*1.6,"line"),
325 |         legend.key.height = unit(7*1.6,"line"),
326 |         legend.key = element_rect(fill = "transparent",color = "transparent"),
327 |         legend.title = element_text(size=50*1.6, face="bold"),
328 |         strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ),
329 |         strip.background = element_rect(fill="transparent",color="transparent")) + 
330 |   guides(col=guide_legend(title="Sampling Group",
331 |                           override.aes = list(size = 10*1.6)))
332 | 
333 | dev.off()
334 | 
335 | saveRDS(posterior_cell_width_facet_df, paste0("spatialbiasfacet_width_",
336 |                                               runs,
337 |                                               "_",
338 |                                               sample_size,
339 |                                               ".rds"))
340 | 
341 | 
342 | # Average posterior sd of 52 PUMA -------------------------------------------------------------
343 | 
344 | avg_posterior_puma_width_icar = as.data.frame(avg_posterior_puma_width_icar)
345 | avg_posterior_puma_width_iid = as.data.frame(avg_posterior_puma_width_iid)
346 | 
347 | colnames(avg_posterior_puma_width_icar) = levels_puma
348 | colnames(avg_posterior_puma_width_iid) = levels_puma
349 | 
350 | df_avg_posterior_puma_width_icar = cbind(reshape2::melt(avg_posterior_puma_width_icar),
351 |                                          rep(r, 52))
352 | 
353 | colnames(df_avg_posterior_puma_width_icar) = c("puma", "average_posterior_width", "p")
354 | df_avg_posterior_puma_width_icar$puma = as.character(df_avg_posterior_puma_width_icar$puma)
355 | 
356 | 
357 | df_avg_posterior_puma_width_iid = cbind(reshape2::melt(avg_posterior_puma_width_iid),
358 |                                         rep(r, 52))
359 | 
360 | colnames(df_avg_posterior_puma_width_iid) = c("puma", "average_posterior_width", "p")
361 | df_avg_posterior_puma_width_iid$puma = as.character(df_avg_posterior_puma_width_iid$puma)
362 | 
363 | posterior_puma_width_facet_df = rbind(cbind(df_avg_posterior_puma_width_icar, Model = "ICAR"),
364 |                                       cbind(df_avg_posterior_puma_width_iid, Model = "IID"))
365 | 
366 | posterior_puma_width_facet_df$sampling_group = ifelse(posterior_puma_width_facet_df$puma %in% puma_overundersample_index, "Group 1", "Group 2")
367 | 
368 | posterior_puma_width_facet_df$sampling_group[posterior_puma_width_facet_df$sampling_group=="Group 1"] = "Near Boston"
369 | posterior_puma_width_facet_df$sampling_group[posterior_puma_width_facet_df$sampling_group=="Group 2"] = "Away from Boston"
370 | posterior_puma_width_facet_df$sampling_group = factor(posterior_puma_width_facet_df$sampling_group,
371 |                                                       levels = c("Near Boston", "Away from Boston"))
372 | 
373 | png(filename = paste0("spatialbiasfacet_puma_width_",
374 |                       runs,
375 |                       "_",
376 |                       sample_size,
377 |                       "_.png"),
378 |     width = 4800, height = 2400)
379 | 
380 | ggplot(posterior_puma_width_facet_df,
381 |        aes(x=p, y=average_posterior_width, group = puma)) + 
382 |   geom_line(aes(col=as.factor(sampling_group)),show.legend = TRUE, size = 1.5) +
383 |   facet_wrap(. ~ Model,ncol=2,nrow=1, labeller = as_labeller(
384 |     function(value) {
385 |       return(value)  # Lets you change the facet labels
386 |     })
387 |   ) +
388 |   xlab("\n Probability of sampling for cluster of PUMA near Boston \n") +
389 |   ylab(paste0("\n 90th - 10th quantile of ", runs, " Posteriors \n")) + 
390 |   theme_bw() +
391 |   theme(plot.title = element_text(size = 50, face = "bold"),
392 |         axis.text=element_text(size=35*1.6),
393 |         axis.title=element_text(size=50*1.6, face="bold",margin=200),
394 |         legend.text = element_text(size=50*1.6),
395 |         legend.position  = "bottom",
396 |         legend.key.size = unit(30*1.6,"line"),
397 |         legend.key.height = unit(7*1.6,"line"),
398 |         legend.key = element_rect(fill = "transparent",color = "transparent"),
399 |         legend.title = element_text(size=50*1.6, face="bold"),
400 |         strip.text.x = element_text(size=50*1.6, face="bold", margin = margin(t=25,b=25) ),
401 |         strip.background = element_rect(fill="transparent",color="transparent")) + 
402 |   guides(col=guide_legend(title="Sampling Group",
403 |                           override.aes = list(size = 10*1.6)))
404 | 
405 | dev.off()
406 | 
407 | saveRDS(posterior_puma_width_facet_df, paste0("spatialbiasfacet_puma_width_",
408 |                                               runs,
409 |                                               "_",
410 |                                               sample_size,
411 |                                               ".rds"))
412 | 
413 | 
414 | # heatmap plot of average posterior medians for 52 pumas, across 9 probability indices --------
415 | 
416 | us_pumas$ICAR_one = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
417 |                             y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[1,])),
418 |                             by = c("PUMACE10" = "puma_code"))[,2]
419 | 
420 | us_pumas$IID_one = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
421 |                                y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[1,])),
422 |                                by = c("PUMACE10" = "puma_code"))[,2]
423 | 
424 | # ---
425 | us_pumas$ICAR_two = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
426 |                                y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[2,])),
427 |                                by = c("PUMACE10" = "puma_code"))[,2]
428 | 
429 | us_pumas$IID_two = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
430 |                               y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[2,])),
431 |                               by = c("PUMACE10" = "puma_code"))[,2]
432 | 
433 | # ---
434 | us_pumas$ICAR_three = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
435 |                                y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[3,])),
436 |                                by = c("PUMACE10" = "puma_code"))[,2]
437 | 
438 | us_pumas$IID_three = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
439 |                               y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[3,])),
440 |                               by = c("PUMACE10" = "puma_code"))[,2]
441 | 
442 | # ---
443 | us_pumas$ICAR_four = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
444 |                                  y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[4,])),
445 |                                  by = c("PUMACE10" = "puma_code"))[,2]
446 | 
447 | us_pumas$IID_four = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
448 |                                 y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[4,])),
449 |                                 by = c("PUMACE10" = "puma_code"))[,2]
450 | # ---
451 | us_pumas$ICAR_five = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
452 |                             y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[5,])),
453 |                             by = c("PUMACE10" = "puma_code"))[,2]
454 | 
455 | us_pumas$IID_five = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
456 |                            y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[5,])),
457 |                            by = c("PUMACE10" = "puma_code"))[,2]
458 | 
459 | # ---
460 | us_pumas$ICAR_six = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
461 |                                  y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[6,])),
462 |                                  by = c("PUMACE10" = "puma_code"))[,2]
463 | 
464 | us_pumas$IID_six = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
465 |                                 y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[6,])),
466 |                                 by = c("PUMACE10" = "puma_code"))[,2]
467 | 
468 | # ---
469 | us_pumas$ICAR_seven = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
470 |                                y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[7,])),
471 |                                by = c("PUMACE10" = "puma_code"))[,2]
472 | 
473 | us_pumas$IID_seven = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
474 |                               y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[7,])),
475 |                               by = c("PUMACE10" = "puma_code"))[,2]
476 | 
477 | # ---
478 | us_pumas$ICAR_eight = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
479 |                                  y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[8,])),
480 |                                  by = c("PUMACE10" = "puma_code"))[,2]
481 | 
482 | us_pumas$IID_eight = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
483 |                                 y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[8,])),
484 |                                 by = c("PUMACE10" = "puma_code"))[,2]
485 | 
486 | # ---
487 | us_pumas$ICAR_nine = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
488 |                              y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_icar[9,])),
489 |                              by = c("PUMACE10" = "puma_code"))[,2]
490 | 
491 | us_pumas$IID_nine = inner_join(x = data.frame(PUMACE10 = us_pumas$PUMACE10),
492 |                                 y = data.frame(puma_code = levels_puma, postmedian_bias = as.numeric(avg_bias_mat_puma_iid[9,])),
493 |                                 by = c("PUMACE10" = "puma_code"))[,2]
494 | 
495 | # https://edzer.github.io/sp/
496 | png(filename = paste0("avg_bias_postmedian_puma",
497 |                       runs,
498 |                       "_",
499 |                       sample_size,
500 |                       "_.png"),
501 |     width = 800, height = 2400)
502 | 
503 | spplot(us_pumas[,c("ICAR_one","IID_one",
504 |                    "ICAR_two","IID_two",
505 |                    "ICAR_three","IID_three",
506 |                    "ICAR_four","IID_four",
507 |                    "ICAR_five", "IID_five",
508 |                    "ICAR_six","IID_six",
509 |                    "ICAR_seven","IID_seven",
510 |                    "ICAR_eight","IID_eight",
511 |                    "ICAR_nine", "IID_nine")],
512 |        
513 |        names.attr = c("ICAR 0.05","IID 0.05",
514 |                       "ICAR 0.10","IID 0.10",
515 |                       "ICAR 0.17","IID 0.17",
516 |                       "ICAR 0.24","IID 0.24",
517 |                       "ICAR 0.32", "IID 0.32",
518 |                       "ICAR 0.41","IID 0.41",
519 |                       "ICAR 0.52","IID 0.52",
520 |                       "ICAR 0.65","IID 0.65",
521 |                       "ICAR 0.81", "IID 0.81"),
522 |        
523 |        #pretty = TRUE
524 |        at = seq(-round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one",
525 |                                                     "ICAR_two","IID_two",
526 |                                                     "ICAR_three","IID_three",
527 |                                                     "ICAR_four","IID_four",
528 |                                                     "ICAR_five","IID_five",
529 |                                                     "ICAR_six","IID_six",
530 |                                                     "ICAR_seven","IID_seven",
531 |                                                     "ICAR_eight","IID_eight",
532 |                                                     "ICAR_nine", "IID_nine")])))+0.01, 2),
533 |                 round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one",
534 |                                                     "ICAR_two","IID_two",
535 |                                                     "ICAR_three","IID_three",
536 |                                                     "ICAR_four","IID_four",
537 |                                                     "ICAR_five","IID_five",
538 |                                                     "ICAR_six","IID_six",
539 |                                                     "ICAR_seven","IID_seven",
540 |                                                     "ICAR_eight","IID_eight",
541 |                                                     "ICAR_nine", "IID_nine")])))+0.01, 2), 
542 |                 length.out = 12),
543 |        
544 |       col.regions = RColorBrewer::brewer.pal(11,"RdBu")
545 | ) 
546 |   
547 | 
548 | dev.off()
549 | 
550 | png(filename = paste0("avg_bias_postmedian_puma_forpaper",
551 |                       runs,
552 |                       "_",
553 |                       sample_size,
554 |                       "_.png"),
555 |     width = 800, height = 800)
556 | 
557 | spplot(us_pumas[,c("ICAR_one","IID_one",
558 |                    "ICAR_five", "IID_five",
559 |                    "ICAR_nine", "IID_nine")],
560 |        
561 |        names.attr = c("BYM2 prior, probability of sampling near Boston = 0.05","IID prior, probability of sampling near Boston = 0.05",
562 |                       "BYM2 prior, probability of sampling near Boston = 0.32", "IID prior, probability of sampling near Boston = 0.32",
563 |                       "BYM2 prior, probability of sampling near Boston = 0.81", "IID prior, probability of sampling near Boston = 0.81"),
564 |        
565 |        #pretty = TRUE
566 |        at = seq(-round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one",
567 |                                                          "ICAR_five","IID_five",
568 |                                                          "ICAR_nine", "IID_nine")])))+0.01, 2),
569 |                 round(max(abs(as.data.frame(us_pumas[,c("ICAR_one","IID_one",
570 |                                                         "ICAR_five","IID_five",
571 |                                                         "ICAR_nine", "IID_nine")])))+0.01, 2), 
572 |                 length.out = 12),
573 |        
574 |        col.regions = RColorBrewer::brewer.pal(11,"RdBu"),
575 |        
576 |        colorkey = list(space="bottom")
577 | ) 
578 | 
579 | 
580 | dev.off()
581 | 
582 | us_pumas$puma_ps_group1 = rep(0, 52)
583 | us_pumas$puma_ps_group1[us_pumas$PUMACE10 %in% puma_overundersample_index] = us_pumas$puma_ps[us_pumas$PUMACE10 %in% puma_overundersample_index]
584 | 
585 | png(filename = paste0("truth_ps_spatial.png"),
586 |     width = 1000, height = 1000)
587 | 
588 | spplot(us_pumas[,],c("puma_ps", "puma_ps_group1"),
589 |        at = seq(0, max(as.data.frame(us_pumas[,c("puma_ps")]))+0.01, length.out=40),
590 |        colorkey = list(space="bottom"),
591 |        col="transparent",
592 |        names.attr = c("Poststratified true preferences", "17 PUMA over/undersampled near Boston"))
593 | 
594 | dev.off()
595 | 
596 | # majorityvote plot -------------------------------------------------
597 | 
598 | colnames(majorityvote_df)[2] = "Model"
599 | colnames(majorityvote_df)[3] = "p"
600 | 
601 | 
602 | png(filename = paste0("majorityvote_",
603 |                       runs,
604 |                       "_",
605 |                       sample_size,
606 |                       ".png"),
607 |     width = 4000, height = 2400)
608 | 
609 |   ggplot(majorityvote_df, aes(x = majorityvote, y = Model, fill = Model)) +
610 |     ggridges::geom_density_ridges2(alpha=0.7, 
611 |                          quantile_lines = TRUE, 
612 |                          quantiles = c(0.1, 0.5, 0.9), 
613 |                          vline_size = 0.5,
614 |                          vline_color = "black",
615 |                          scale = 1) + 
616 |     xlab("\n Probability of voting yes \n") +
617 |     ylab("\n Model \n") +
618 |     facet_wrap(. ~ p,ncol=3,nrow=3, labeller = as_labeller(
619 |       function(value) {
620 |         return(value)  # Lets you change the facet labels
621 |       })
622 |     ) +
623 |     geom_point(aes(x = as.numeric(true_popn_pref),
624 |                    y = Model),
625 |                size=7,
626 |                colour = "black",
627 |                show.legend = F,
628 |                inherit.aes = F) +
629 |     theme_bw() +
630 |     theme(plot.title = element_text(size = 50, face = "bold"),
631 |           axis.text=element_text(size=35),
632 |           axis.title=element_text(size=50, face="bold",margin=200),
633 |           legend.text = element_text(size=50),
634 |           legend.position  = "bottom",
635 |           legend.key.size = unit(30,"line"),
636 |           legend.key.height = unit(7,"line"),
637 |           legend.key = element_rect(fill = "transparent",color = "transparent"),
638 |           legend.title = element_text(size=50, face="bold"),
639 |           strip.text.x = element_text(size=50, face="bold", margin = margin(t=25,b=25) ),
640 |           strip.background = element_rect(fill="transparent",color="transparent")) +
641 |     scale_fill_manual(values=c("#4575b4", "#d73027","#fdae61")) + 
642 |     guides(color = guide_legend(override.aes = list(size=20)))
643 | 
644 |   dev.off()
645 | 
646 | 
647 | 
648 | 
649 | 
650 | 


--------------------------------------------------------------------------------
/simulation_spatialmrp/ma_adj_matrix.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/ma_adj_matrix.rds


--------------------------------------------------------------------------------
/simulation_spatialmrp/ma_adj_matrix_sparse.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/ma_adj_matrix_sparse.rds


--------------------------------------------------------------------------------
/simulation_spatialmrp/poverty_poststrat.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/poverty_poststrat.rds


--------------------------------------------------------------------------------
/simulation_spatialmrp/prob_sampling_old.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/prob_sampling_old.rds


--------------------------------------------------------------------------------
/simulation_spatialmrp/proportion_puma.R:
--------------------------------------------------------------------------------
  1 | # this file calculates the proportion that structured priors outperform baseline priors in MRP for spatial simulation
  2 | 
  3 | rm(list=ls())
  4 | gc()
  5 | 
  6 | 
  7 | library(ggplot2)
  8 | library(dplyr)
  9 | library(sp)
 10 | library(reshape2)
 11 | 
 12 | sample_size = 1000
 13 | runs = 200
 14 | r = 1:9/10
 15 | 
 16 | poverty_poststrat = readRDS("poverty_poststrat.rds")
 17 | us_pumas = readRDS("us_pumas.rds")
 18 | ma_adj_matrix = readRDS("ma_adj_matrix.rds")
 19 | ma_adj_matrix_sparse = readRDS("ma_adj_matrix_sparse.rds")
 20 | 
 21 | num_education = length(unique(poverty_poststrat$education)) # number of strata for education
 22 | num_race = length(unique(poverty_poststrat$race_x)) # number of strata for race
 23 | num_puma = length(unique(poverty_poststrat$PUMA.x))
 24 | 
 25 | 
 26 | icar_iid_puma_bias_comparison = matrix(0, length(r), num_puma)
 27 | icar_iid_puma_sd_comparison = matrix(0, length(r), num_puma)
 28 | 
 29 | 
 30 | counter_ = 1
 31 | prob_sampling_old = rep(0, length(r))
 32 | 
 33 | for (p in r) {
 34 |   load(paste0(runs,
 35 |               "_",
 36 |               sample_size,
 37 |               "_spatial_",
 38 |               p * 10,
 39 |               ".RData"))
 40 |   
 41 |   # get the interpretable p which is the probability of sampling something in puma_overundersample_index
 42 |   poststrat_final_joined_p_interpretable = cbind(poststrat_final_joined_p,
 43 |                                                  (poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)/sum(poststrat_final_joined_p$p_response * poststrat_final_joined_p$N)
 44 |   )
 45 |   # 
 46 |   colnames(poststrat_final_joined_p_interpretable)[length(colnames(poststrat_final_joined_p_interpretable))] = "prob_sampling_g1" 
 47 |   prob_sampling_old[counter_] = round(sum(poststrat_final_joined_p_interpretable[poststrat_final_joined_p_interpretable$puma_area %in% puma_overundersample_index,
 48 |                                                                                  c("prob_sampling_g1")]),
 49 |                                       2) # very important
 50 |   
 51 |   
 52 |   icar_iid_puma_bias_comparison[counter_,] = colSums((abs(sweep(median_puma_icar, 
 53 |                                                                 2,
 54 |                                                                 poststrat_puma_pref$puma_ps,
 55 |                                                                 "-")) - 
 56 |                                                         abs(sweep(median_puma_iid,
 57 |                                                                   2, 
 58 |                                                                   poststrat_puma_pref$puma_ps,
 59 |                                                                   "-"))) <= 0)/dim(median_puma_icar)[1]
 60 |   
 61 |   
 62 |   icar_iid_puma_sd_comparison[counter_,] = colSums(((quantile90_puma_icar - quantile10_puma_icar) -
 63 |                                                       (quantile90_puma_iid - quantile10_puma_iid)) <= 0)/dim(median_puma_icar)[1]
 64 |   
 65 | 
 66 |   
 67 |   counter_ = counter_ + 1
 68 | }
 69 | 
 70 | colnames(icar_iid_puma_bias_comparison) = poststrat_puma_pref$puma_area
 71 | colnames(icar_iid_puma_sd_comparison) = poststrat_puma_pref$puma_area
 72 | 
 73 | icar_iid_puma_bias_comparison_melted = melt(icar_iid_puma_bias_comparison)
 74 | icar_iid_puma_sd_comparison_melted = melt(icar_iid_puma_sd_comparison)
 75 | 
 76 | colnames(icar_iid_puma_bias_comparison_melted) = c("p", "puma", "proportion")
 77 | colnames(icar_iid_puma_sd_comparison_melted) = c("p", "puma", "proportion")
 78 | 
 79 | icar_iid_puma_bias_comparison_melted$puma_group = ""
 80 | icar_iid_puma_sd_comparison_melted$puma_group = ""
 81 | 
 82 | icar_iid_puma_bias_comparison_melted[icar_iid_puma_bias_comparison_melted$puma %in% puma_overundersample_index,c("puma_group")] = "Near Boston"
 83 | icar_iid_puma_bias_comparison_melted[!(icar_iid_puma_bias_comparison_melted$puma %in% puma_overundersample_index),c("puma_group")] = "Away From Boston"
 84 | 
 85 | icar_iid_puma_sd_comparison_melted[icar_iid_puma_sd_comparison_melted$puma %in% puma_overundersample_index,c("puma_group")] = "Near Boston"
 86 | icar_iid_puma_sd_comparison_melted[!(icar_iid_puma_sd_comparison_melted$puma %in% puma_overundersample_index),c("puma_group")] = "Away From Boston"
 87 | 
 88 | icar_iid_puma_bias_comparison_melted$prob_sampling = rep(times = 52, x = prob_sampling_old)
 89 | icar_iid_puma_sd_comparison_melted$prob_sampling = rep(times = 52, x = prob_sampling_old)
 90 | 
 91 | 
 92 | # bias plot
 93 | saveRDS(icar_iid_puma_bias_comparison_melted,
 94 |         paste0("proportion_puma_", sample_size,".rds"))
 95 | 
 96 | ggplot(icar_iid_puma_bias_comparison_melted,
 97 |        aes(x=prob_sampling, y=proportion, group = puma)) + 
 98 |   geom_line(aes(col=as.factor(puma_group)),show.legend = TRUE, size = 0.5) +
 99 |   xlab("\n Probability of sampling for cluster of PUMA near Boston \n") +
100 |   ylab("\n Improvement proportion for bias \n") +
101 |   guides(col=guide_legend(title="Sampling Group",
102 |                           override.aes = list(size = 10*1.6))) +
103 |   geom_hline(yintercept=0.5, color="black", size=1, linetype = "dashed") + 
104 |   scale_color_viridis_d(begin=0.25,end=.75)
105 | 
106 | 
107 | # sd plot
108 | saveRDS(icar_iid_puma_sd_comparison_melted,
109 |         paste0("proportion_sd_puma_", sample_size,".rds"))
110 | 
111 | ggplot(icar_iid_puma_sd_comparison_melted,
112 |        aes(x=prob_sampling, y=proportion, group = puma)) + 
113 |   geom_line(aes(col=as.factor(puma_group)),show.legend = TRUE, size = 0.5) +
114 |   xlab("\n Probability of sampling for cluster of PUMA near Boston \n") +
115 |   ylab("\n Improvement proportion for sd \n") +
116 |   guides(col=guide_legend(title="Sampling Group",
117 |                           override.aes = list(size = 10*1.6))) +
118 |   geom_hline(yintercept=0.5, color="black", size=1, linetype = "dashed") + 
119 |   scale_color_viridis_d(begin=0.25,end=.75)
120 | 
121 | 
122 | print(colMeans(icar_iid_puma_bias_comparison))
123 | print(colMeans(icar_iid_puma_sd_comparison))
124 | 
125 | print(summary(colMeans(icar_iid_puma_bias_comparison)))
126 | print(summary(colMeans(icar_iid_puma_sd_comparison)))
127 | 
128 | 


--------------------------------------------------------------------------------
/simulation_spatialmrp/smooth_x.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/smooth_x.rds


--------------------------------------------------------------------------------
/simulation_spatialmrp/us_pumas.rds:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexgao09/structuredpriorsmrp_public/b7d29da9a8cec121a63ab28255f88cd41a73e46a/simulation_spatialmrp/us_pumas.rds


--------------------------------------------------------------------------------