├── .gitignore
├── 01-whole-game-exercises.qmd
├── 03-ci-with-group-by-and-summarise-exercises.qmd
├── 04-dags-exercises.qmd
├── 05-quartets-exercises.qmd
├── 06-intro-pscores-exercises.qmd
├── 07-pscores-using-exercises.qmd
├── 08-pscores-diagnostics-exercises.qmd
├── 09-outcome-model-exercises.qmd
├── 10-continuous-g-computation-exercises.qmd
├── 11-tipr.qmd
├── 12-whole-game-2-exercises.qmd
├── 13-bonus-selection-bias-exercises.qmd
├── 14-bonus-continuous-pscores-exercises.qmd
└── causal_inference_r_workshop_solutions.Rproj


/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .Rdata
4 | .httr-oauth
5 | .DS_Store
6 | 


--------------------------------------------------------------------------------
/01-whole-game-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Causal Modeling in R: Whole Game"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | library(broom)
 10 | library(rsample)
 11 | library(ggdag)
 12 | library(causaldata)
 13 | library(halfmoon)
 14 | library(propensity)
 15 | library(ggokabeito)
 16 | 
 17 | set.seed(1234)
 18 | ```
 19 | 
 20 | ## Causal Modeling: Whole Game
 21 | 
 22 | In this guided exercise, we'll attempt to answer a causal question: does quitting smoking make you gain weight? Causal modeling has a special place in the history of smoking research: the studies that demonstrated that smoking causes lung cancer were observational. Thanks to other studies, we also know that, if you're already a smoker, quitting smoking reduces your risk of lung cancer. However, some have observed that former smokers tend to gain weight. Is this the result of quitting smoking, or does something else explain this effect? In the book Causal Inference by Hernán and Robins, the authors analyze this question using several causal inference techniques. 
 23 | 
 24 | To answer this question, we'll use causal inference methods to examine the relationship between quitting smoking and gaining weight. First, we'll draw our assumptions with a causal diagram (a directed acyclic graph, or DAG), which will guide our model. Then, we'll use a modeling approach called inverse probability weighting--one of many causal modeling techniques--to estimate the causal effect we're interested in.
 25 | 
 26 | We'll use data from NHEFS to try to estimate the causal effect of quitting smoking on weight game. NHEFS is a longitudinal, observational study that has many of the variables we'll need. Take a look at `nhefs_codebook` if you want to know more about the variables in this data set. These data are included in the {causaldata} package. We'll use the `nhefs_complete` data set, but we'll remove people who were lost to follow-up.
 27 | 
 28 | ```{r}
 29 | nhefs_complete_uc <- nhefs_complete |>
 30 |   filter(censored == 0)
 31 | nhefs_complete_uc
 32 | ```
 33 | 
 34 | Let's look at the distribution of weight gain between the two groups. 
 35 | 
 36 | ```{r}
 37 | nhefs_complete_uc |>
 38 |   ggplot(aes(wt82_71, fill = factor(qsmk))) + 
 39 |   geom_vline(xintercept = 0, color = "grey60", linewidth = 1) +
 40 |   geom_density(color = "white", alpha = .75, linewidth = .5) +
 41 |   scale_color_okabe_ito(order = c(1, 5)) + 
 42 |   theme_minimal() +
 43 |   theme(legend.position = "bottom") + 
 44 |   labs(
 45 |     x = "change in weight (kg)",
 46 |     fill = "quit smoking (1 = yes)"
 47 |   )
 48 | ```
 49 | 
 50 | There's a difference--former smokers do seemed to have gained a bit more weight--but there's also a lot of variation. Let's look at the numeric summaries.
 51 | 
 52 | ```{r}
 53 | # ~2.5 kg gained for quit vs. not quit
 54 | nhefs_complete_uc |>
 55 |   group_by(qsmk) |>
 56 |   summarize(
 57 |     mean_weight_change = mean(wt82_71), 
 58 |     sd = sd(wt82_71),
 59 |     .groups = "drop"
 60 |   )
 61 | ```
 62 | 
 63 | Here, it looks like those who quit smoking gained, on average, 2.5 kg. But is there something else that could explain these results? There are many factors associated with both quitting smoking and gaining weight; could one of those factors explain away the results we're seeing here?
 64 | 
 65 | To truly answer this question, we need to specify a causal diagram based on domain knowledge. Sadly, for most circumstances, there is no data-driven approach that consistently identify confounders. Only our causal assumptions can help us identify them. Causal diagrams are a visual expression of those assumptions linked to rigorous mathematics that allow us to understand what we need to account for in our model.
 66 | 
 67 | In R, we can visualize and analyze our DAGs with the {ggdag} package. {ggdag} uses {ggplot2} and {ggraph} to visualize diagrams and {dagitty} to analyze them. Let's set up our assumptions. The `dagify()` function takes formulas, much like `lm()` and friends, to express assumptions. We have two basic causal structures: the causes of quitting smoking and the causes of gaining weight. Here, we're assuming that the set of variables here affect both. Additionally, we're adding `qsmk` as a cause of `wt82_71`, which is our causal question; we also identify these as our outcome and exposure. Finally, we'll add some labels so the diagram is easier to understand. The result is a `dagitty` object, and we can transform it to a `tidy_dagitty` data set with `tidy_dagitty()`.
 68 | 
 69 | ```{r}
 70 | # set up DAG
 71 | smk_wt_dag <- dagify(
 72 |   # specify causes of quitting smoking and weight gain:
 73 |   qsmk ~ sex + race + age + education + 
 74 |     smokeintensity + smokeyrs + exercise + active + wt71,
 75 |   wt82_71 ~ qsmk + sex + race + age + education + 
 76 |     smokeintensity + smokeyrs + exercise + active + wt71,
 77 |   # specify causal question:
 78 |   exposure = "qsmk", 
 79 |   outcome = "wt82_71",
 80 |   coords = time_ordered_coords(),
 81 |   # set up labels:
 82 |   # here, I'll use the same variable names as the data set, but I'll label them
 83 |   # with clearer names
 84 |   labels = c(
 85 |     # causal question
 86 |     "qsmk" = "quit\nsmoking",
 87 |     "wt82_71" = "change in\nweight",
 88 |     
 89 |     # demographics
 90 |     "age" = "age",
 91 |     "sex" = "sex",
 92 |     "race" = "race",
 93 |     "education" = "education",
 94 |     
 95 |     # health
 96 |     "wt71" = "baseline\nweight",
 97 |     "active" = "daily\nactivity\nlevel",
 98 |     "exercise" = "exercise",
 99 |     
100 |     # smoking history
101 |     "smokeintensity" = "smoking\nintensity",
102 |     "smokeyrs" = "yrs of\nsmoking"
103 |   )
104 | ) |>
105 |   tidy_dagitty()
106 | 
107 | smk_wt_dag
108 | ```
109 | 
110 | Let's visualize our assumptions with `ggdag()`. 
111 | 
112 | ```{r}
113 | smk_wt_dag |>
114 |   ggdag(text = FALSE, use_labels = "label")
115 | ```
116 | 
117 | What do we need to control for to estimate an unbiased effect of quitting smoking on weight gain? In many DAGs, there will be many sets of variables--called adjustment sets--that will give us the right effect (assuming our DAG is correct--a big, unverifiable assumption!). `ggdag_adjustment_set()` can help you visualize them. Here, there's only one adjustment set: we need to control for everything! While we're add it, since a {ggdag} plot is just a {ggplot2} plot, let's clean it up a bit, too.
118 | 
119 | ```{r}
120 | smk_wt_dag |>
121 |   ggdag_adjustment_set(text = FALSE, use_labels = "label") +
122 |   theme_dag() +
123 |   scale_color_okabe_ito(order = c(1, 5)) + 
124 |   scale_fill_okabe_ito(order = c(1, 5))
125 | ```
126 | 
127 | Let's fit a model with these variables. Note that we'll fit all continuous variables with squared terms, as well, to allow them a bit of flexibility.
128 | 
129 | ```{r}
130 | lm(
131 |   wt82_71~ qsmk + sex + 
132 |     race + age + I(age^2) + education + 
133 |     smokeintensity + I(smokeintensity^2) + 
134 |     smokeyrs + I(smokeyrs^2) + exercise + active + 
135 |     wt71 + I(wt71^2), 
136 |   data = nhefs_complete_uc
137 | ) |>
138 |   tidy(conf.int = TRUE) |>
139 |   filter(term == "qsmk")
140 | ```
141 | 
142 | When we adjust for the variables in our DAG, we get an estimate of about 3.5 kg--people who quit smoking gained about this amount of weight. However, we are trying to answer a specific causal question: how much weight would a person gain if the quit smoking vs. if the same person did not quit smoking? Let's use an inverse probability weighting model to try to estimate that effect at the population level (what if *everyone* quit smoking vs what if *no one* quit smoking).
143 | 
144 | For a simple IPW model, we have two modeling steps. First, we fit a propensity score model, which predicts the probability that you received a treatment or exposure (here, that a participant quit smoking). We use this model to calculate inverse probability weights--1 / your probability of treatment. Then, in the second step, we use this weights in the outcome model, which estimates the effect of exposure on the outcome (here, the effect of quitting smoking on gaining weight).
145 | 
146 | For the propensity score model, we'll use logistic regression (since quitting smoking is a binary variable). The outcome is quitting smoking, and the variables in the model are all those included in our adjustment set. Then, we'll use `augment()` from {broom} (which calls `predict()` on the inside) to calculate our weights using `propensity::wt_ate()` and save it back into our data set. 
147 | 
148 | 
149 | ```{r}
150 | propensity_model <- glm(
151 |   qsmk ~ sex + 
152 |     race + age + I(age^2) + education + 
153 |     smokeintensity + I(smokeintensity^2) + 
154 |     smokeyrs + I(smokeyrs^2) + exercise + active + 
155 |     wt71 + I(wt71^2), 
156 |   family = binomial(), 
157 |   data = nhefs_complete_uc
158 | )
159 | 
160 | nhefs_complete_uc <- propensity_model |>
161 |   # predict whether quit smoking
162 |   augment(type.predict = "response", data = nhefs_complete_uc) |>
163 |   # calculate inverse probability
164 |   mutate(wts = wt_ate(.fitted, qsmk))
165 | 
166 | nhefs_complete_uc |>
167 |   select(qsmk, .fitted, wts)
168 | ```
169 | 
170 | Let's look at the distribution of the weights.
171 | 
172 | ```{r}
173 | ggplot(nhefs_complete_uc, aes(wts)) +
174 |   geom_histogram(color = "white", fill = "#E69F00", bins = 50) + 
175 |   #  use a log scale for the x axis
176 |   scale_x_log10() + 
177 |   theme_minimal(base_size = 20) + 
178 |   xlab("Weights")
179 | ```
180 | 
181 | It looks a little skewed, particularly that there are some participants with much higher weights. There are a few techniques for dealing with this--trimming weights and stabilizing weights--but we'll keep it simple for now and just use them as is.
182 | 
183 | The main goal here is to *break* the non-causal associations between quitting smoking and gaining weight--the other paths that might distort our results. In other words, if we succeed, there should be no differences in these variables between our two groups, those who quit smoking and those who didn't. This is where randomized trials shine; you can often assume that there is no baseline differences among potential confounders between your treatment groups (of course, no study is perfect, and there's a whole set of literature on dealing with this problem in randomized trials).
184 | 
185 | Standardized mean differences (SMD) are a simple measurement of differences that work across variable types. In general, the closer to 0 we are, the better job we have done eliminating the non-causal relationships we drew in our DAG. Note that low SMDs for everything we adjust for does *not* mean that there is not something else that might confound our study. Unmeasured confounders or misspecified DAGs can still distort our effects, even if our SMDs look great!
186 | 
187 | We'll use the {halfmoon} package to calculate the SMDs, then visualize them.
188 | 
189 | ```{r}
190 | vars <- c(
191 |   "sex", "race", "age", "education", 
192 |   "smokeintensity", "smokeyrs", 
193 |   "exercise", "active", "wt71"
194 | )
195 | 
196 | plot_df <- tidy_smd(
197 |     nhefs_complete_uc,
198 |     all_of(vars),
199 |     qsmk,
200 |     wts
201 | )
202 | 
203 | ggplot(
204 |     data = plot_df,
205 |     mapping = aes(x = abs(smd), y = variable, group = method, color = method)
206 | ) +
207 |     geom_love()
208 | ```
209 | 
210 | These look pretty good! Some variables are better than others, but weighting appears to have done a much better job eliminating these differences than an unadjusted analysis.
211 | 
212 | We can also use halfmoon's `geom_mirror_histogram()` to visualize the impact that the weights are having on our population.
213 | 
214 | ```{r}
215 | nhefs_complete_uc |>
216 |   mutate(qsmk = factor(qsmk)) |>
217 |   ggplot(aes(.fitted)) +
218 |   geom_mirror_histogram(
219 |     aes(group = qsmk),
220 |     bins = 50
221 |   ) +
222 |   geom_mirror_histogram(
223 |     aes(fill = qsmk, weight = wts),
224 |     bins = 50,
225 |     alpha = .5
226 |   ) +
227 |   scale_y_continuous(labels = abs) +
228 |   labs(x = "propensity score") + 
229 |   theme_minimal(base_size = 20)
230 | ```
231 | 
232 | Both groups are being *upweighted* so that their distributions of propensity scores are much more similar.
233 | 
234 | We could do more here to analyze our assumptions, but let's move on to our second step: fitting the outcome model weighted by our inverse probabilities. Some researchers call these Marginal Structural Models, in part because the model is marginal; we only need to include our outcome (`wt82_71`) and exposure (`qsmk`). The other variables aren't in the model; they are accounted for with the IPWs!
235 | 
236 | ```{r}
237 | ipw_model <- lm(
238 |   wt82_71 ~ qsmk, 
239 |   data = nhefs_complete_uc, 
240 |   weights = wts # inverse probability weights
241 | ) 
242 | 
243 | ipw_estimate <- ipw_model |>
244 |   tidy(conf.int = TRUE) |>
245 |   filter(term == "qsmk")
246 | 
247 | ipw_estimate
248 | ```
249 | 
250 | This estimate is pretty similar to what we saw before, if a little smaller. In fact, for simple causal questions, this is often the case: adjusting for confounders directly in your regression model sometimes estimates the same effect as IPWs and other causal techniques. Causal techniques are special, though, in that the use counterfactual modeling, which allows you to deal with many circumstances, such as when you have selection bias or time-dependendent confounding. They also often have variance properties.
251 | 
252 | But we have other problem that we need to address. While we're just using `lm()` to estimate our IPW model, it doesn't properly account for the weights. That means our standard error is too small, which will artificially narrow confidence intervals and artificially shrink p-values. There are many ways to address this, including robust estimators. We'll focus on using the bootstrap via the {rsamples} package in this workshop, but here's one way to do it with robust standard errors:
253 | 
254 | 
255 | ```{r}
256 | # also see robustbase, survey, gee, and others
257 | library(estimatr)
258 | ipw_model_robust <- lm_robust( 
259 |   wt82_71 ~ qsmk, 
260 |   data = nhefs_complete_uc, 
261 |   weights = wts 
262 | ) 
263 | 
264 | ipw_estimate_robust <- ipw_model_robust |>
265 |   tidy(conf.int = TRUE) |>
266 |   filter(term == "qsmk")
267 | 
268 | ipw_estimate_robust
269 | ```
270 | 
271 | Now let's try the bootstrap. First, we need to wrap our model in a function so we can call it many times on our bootstrapped data. A function like this might be your instinct; however, it's not quite right.
272 | 
273 | ```{r}
274 | # fit ipw model for a single bootstrap sample
275 | fit_ipw_not_quite_rightly <- function(split, ...) {
276 |   # get bootstrapped data sample with `rsample::analysis()`
277 |   .df <- analysis(split)
278 |   
279 |   # fit ipw model
280 |   lm(wt82_71 ~ qsmk, data = .df, weights = wts) |>
281 |     tidy()
282 | }
283 | ```
284 | 
285 | The problem is that we need to account for the *entire* modeling process, so we need to include the first step of our analysis -- fitting the inverse probability weights.
286 | 
287 | ```{r}
288 | fit_ipw <- function(split, ...) {
289 |   .df <- analysis(split)
290 |   
291 |   # fit propensity score model
292 |   propensity_model <- glm(
293 |     qsmk ~ sex + 
294 |       race + age + I(age^2) + education + 
295 |       smokeintensity + I(smokeintensity^2) + 
296 |       smokeyrs + I(smokeyrs^2) + exercise + active + 
297 |       wt71 + I(wt71^2), 
298 |     family = binomial(), 
299 |     data = .df
300 |   )
301 |   
302 |   # calculate inverse probability weights
303 |   .df <- propensity_model |>
304 |     augment(type.predict = "response", data = .df) |>
305 |     mutate(wts = wt_ate(
306 |       .fitted,
307 |       qsmk, 
308 |       exposure_type = "binary"
309 |     ))
310 |   
311 |   # fit correctly bootstrapped ipw model
312 |   lm(wt82_71 ~ qsmk, data = .df, weights = wts) |>
313 |     tidy()
314 | }
315 | ```
316 | 
317 | {rsample} makes the rest easy for us: `bootstraps()` resamples our data 1000 times, then we can use `purrr::map()` to apply our function to each resampled set (`splits`). {rsample}'s `int_*()` functions help us get confidence intervals for our estimate.
318 | 
319 | ```{r}
320 | # fit ipw model to bootstrapped samples
321 | ipw_results <- bootstraps(nhefs_complete, 1000, apparent = TRUE) |>
322 |   mutate(results = map(splits, fit_ipw))
323 | 
324 | # get t-statistic-based CIs
325 | boot_estimate <- int_t(ipw_results, results) |>
326 |   filter(term == "qsmk")
327 | 
328 | boot_estimate
329 | ```
330 | 
331 | Let's compare to our naive weighted model that just used a single estimate from `lm()`
332 | 
333 | ```{r}
334 | bind_rows(
335 |   ipw_estimate |>
336 |     select(estimate, conf.low, conf.high) |>
337 |     mutate(type = "ols"),
338 |   ipw_estimate_robust |>
339 |     select(estimate, conf.low, conf.high) |>
340 |     mutate(type = "robust"),
341 |   boot_estimate |>
342 |     select(estimate = .estimate, conf.low = .lower, conf.high = .upper) |>
343 |     mutate(type = "bootstrap")
344 | ) |>
345 |   #  calculate CI width to sort by it
346 |   mutate(width = conf.high - conf.low) |>
347 |   arrange(width) |>
348 |   #  fix the order of the model types for the plot  
349 |   mutate(type = fct_inorder(type)) |>
350 |   ggplot(aes(x = type, y = estimate, ymin = conf.low, ymax = conf.high)) + 
351 |     geom_pointrange(color = "#0172B1", size = 1, fatten = 3) +
352 |     coord_flip() +
353 |     theme_minimal(base_size = 20) +
354 |     theme(axis.title.y = element_blank())
355 | ```
356 | 
357 | Our bootstrapped confidence intervals are wider, which is expected; remember that they were artificially narrow in the naive OLS model!
358 | 
359 | So, we have a final estimate for our causal effect: on average, a person who quits smoking will gain 3.5 kg (95% CI 2.4 kg, 4.4 kg) versus if they had not quit smoking. What do you think? Is this estimate reliable? Did we do a good job addressing the assumptions we need to make for a causal effect, particularly that there is no confounding? How might you criticize this model, and what would you do differently? 
360 | 
361 | ***
362 | 
363 | # Take aways
364 | * The broad strokes for a causal analysis are: 1) identify your causal question 2) make your assumptions clear 3) check your assumptions as best you can and 4) use the right estimator for the question you're trying to ask. As scientists, we should be able to critique each of these steps, and that's a good thing!
365 | * To create marginal structural models, first fit a propensity model for the weights with the exposure as the outcome. Then, use the inverse of the predicted probabilities as weights in a model with just the outcome and exposure.
366 | 


--------------------------------------------------------------------------------
/03-ci-with-group-by-and-summarise-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Causal Inference with `group_by()` and `summarize()`"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | set.seed(1)
 10 | ```
 11 | 
 12 | ## Your Turn 1
 13 | 
 14 | Run this code to generate the simulated data set
 15 | 
 16 | ```{r}
 17 | n <- 1000 
 18 | sim <- tibble(
 19 |   confounder = rbinom(n, 1, 0.5),
 20 |   p_exposure = case_when(
 21 |     confounder == 1 ~ 0.75,
 22 |     confounder == 0 ~ 0.25
 23 |   ),
 24 |   exposure = rbinom(n, 1, p_exposure),
 25 |   outcome = confounder + rnorm(n)
 26 | )
 27 | ```
 28 | 
 29 | 1. Group the dataset by `confounder` and `exposure`
 30 | 2. Calculate the mean of the `outcome` for the groups
 31 | 
 32 | ```{r}
 33 | sim |>
 34 |   group_by(confounder, exposure) |>
 35 |   summarise(avg_y = mean(outcome)) |>
 36 |   # pivot the data so we can get the difference
 37 |   # between the exposure groups
 38 |   pivot_wider(
 39 |     names_from = exposure,
 40 |     values_from = avg_y,  
 41 |     names_prefix = "x_"
 42 |   ) |>
 43 |   summarise(estimate = x_1 - x_0) |>
 44 |   summarise(estimate = mean(estimate)) # note, we would need to weight this if the confounder groups were not equal sized
 45 | ```
 46 | 
 47 | ## Your Turn 2
 48 | 
 49 | Run the following code to generate `sim2`
 50 | 
 51 | ```{r}
 52 | n <- 1000
 53 | sim2 <- tibble(
 54 |   confounder_1 = rbinom(n, 1, 0.5),
 55 |   confounder_2 = rbinom(n, 1, 0.5), 
 56 |   
 57 |   p_exposure = case_when(
 58 |     confounder_1 == 1 & confounder_2 == 1 ~ 0.75,
 59 |     confounder_1 == 0 & confounder_2 == 1 ~ 0.9,
 60 |     confounder_1 == 1 & confounder_2 == 0 ~ 0.2,
 61 |     confounder_1 == 0 & confounder_2 == 0 ~ 0.1,
 62 |   ),
 63 |   exposure = rbinom(n, 1, p_exposure),
 64 |   outcome = confounder_1 + confounder_2 + rnorm(n) 
 65 | )
 66 | ```
 67 | 
 68 | 1. Group the dataset by the confounders and exposure
 69 | 2. Calculate the mean of the outcome for the groups
 70 | 
 71 | ```{r}
 72 | sim2 |>
 73 |   group_by(confounder_1, confounder_2, exposure) |>
 74 |   summarise(avg_y = mean(outcome)) |>
 75 |   pivot_wider(
 76 |     names_from = exposure,
 77 |     values_from = avg_y,  
 78 |     names_prefix = "x_"
 79 |   ) |>
 80 |   summarise(estimate = x_1 - x_0, .groups = "drop") |>
 81 |   summarise(estimate = mean(estimate))  
 82 | ```
 83 | 
 84 | ## Your Turn 3
 85 | 
 86 | Run the following code to generate `sim3`
 87 | 
 88 | ```{r}
 89 | n <- 10000 
 90 | sim3 <- tibble(
 91 |   confounder = rnorm(n), 
 92 |   p_exposure = exp(confounder) / (1 + exp(confounder)),
 93 |   exposure = rbinom(n, 1, p_exposure),
 94 |   outcome = confounder + rnorm(n) 
 95 | )
 96 | ```
 97 | 
 98 | 1. Use `ntile()` from dplyr to calculate a binned version of `confounder` called `confounder_q`. We'll create a variable with 5 bins.
 99 | 2. Group the dataset by the binned variable you just created and exposure
100 | 3. Calculate the mean of the outcome for the groups
101 | 
102 | ```{r}
103 | sim3 |>
104 |   mutate(confounder_q = ntile(confounder, 5)) |>
105 |   group_by(confounder_q, exposure) |>
106 |   summarise(avg_y = mean(outcome)) |>
107 |   pivot_wider(
108 |     names_from = exposure,
109 |     values_from = avg_y,  
110 |     names_prefix = "x_"
111 |   ) |>
112 |   summarise(estimate = x_1 - x_0) |> 
113 |   summarise(estimate = mean(estimate))
114 | ```
115 | 
116 | # Take aways
117 | 
118 | * Sometimes correlation *is* causation!
119 | * In simple cases, grouping by confounding variables can get us the right answer without a statistical model
120 | * Propensity scores generalize the idea of summarizing exposure effects to any number of confounders. Although we'll use models for this process, the foundations are the same.
121 | 


--------------------------------------------------------------------------------
/04-dags-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Causal Diagrams in R"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | library(ggdag)
 10 | library(dagitty)
 11 | set.seed(1234)
 12 | ```
 13 | 
 14 | ## Your Turn 1
 15 | 
 16 | Descriptively, drinking coffee is associated with lung cancer. Does drinking coffee cause lung cancer? 
 17 | 
 18 | 1. Specify a DAG with `dagify()`. Write your assumption that `smoking` causes `cancer` as a formula. 
 19 | 2. We're going to assume that coffee does *not* cause cancer, so there's no formula for that. But we still need to declare our causal question. Specify "coffee" as the exposure and "cancer" as the outcome (both in quotations marks).
 20 | 3. Plot the DAG using `ggdag()`
 21 | 
 22 | Stretch goal: Underneath the hood, `ggdag()` calls `tidy_dagitty()` on `dagitty` objects. Check the help page for `?tidy_dagitty()` and try using one of the layouts listed in `?layout_tbl_graph_igraph()`. Experiment with different layouts. Plot with `ggdag()`
 23 | 
 24 | Stretch goal: We are assuming that coffee does *not* cause lung cancer. In causal diagram terminology, we want to know if these two factors are *D-connected* (as in, directionally connected). Since we are assuming that there's no causal relationship, any D-connection is caused by other factors. Try `ggdag_dconnected()` to see if we coffee and lung cancer are D-connected.
 25 | 
 26 | ```{r}
 27 | coffee_cancer_dag <- dagify(
 28 |   cancer ~ smoking,
 29 |   smoking ~ addictive,
 30 |   coffee ~ addictive,
 31 |   exposure = "coffee",
 32 |   outcome = "cancer",
 33 |   labels = c(
 34 |     "coffee" = "Coffee", 
 35 |     "cancer" = "Lung Cancer", 
 36 |     "smoking" = "Smoking", 
 37 |     "addictive" = "Addictive \nBehavior"
 38 |   )
 39 | )
 40 | 
 41 | ggdag(coffee_cancer_dag)
 42 | ggdag_dconnected(coffee_cancer_dag)
 43 | ```
 44 | 
 45 | ## Your Turn 2
 46 | 
 47 | Most {ggdag} quick plotting functions are actually wrappers for functions that let you work with the DAG data directly. 
 48 | 
 49 | 1. Call `tidy_dagitty()` on `coffee_cancer_dag` to create a tidy DAG, then pass the results to `dag_paths()`. What's different about these data?
 50 | 2. Plot the open paths with `ggdag_paths()`. (Just give it `coffee_cancer_dag` rather than using `dag_paths()`; the quick plot function will do that for you.) Remember, since we assume there is *no* causal path from coffee to lung cancer, any open paths must be confounding pathways. 
 51 | 
 52 | Stretch goal: ggdags are just ggplots! You can add themes, geoms, and other {ggplot2} elements with `+` like a normal ggplot. Try adding a theme ({ggdag} has several, or you could try `theme_void()`).
 53 | 
 54 | Stretch goal: The variable names are a little hard to read. We specified some labels earlier, so let's use them. In the tidy DAG, these are called `label`. In `ggdag_paths()`, add the argument `use_labels = "label"` and remove the node text with `text = FALSE`.
 55 | 
 56 | ```{r}
 57 | coffee_cancer_dag |>
 58 |   tidy_dagitty() |>
 59 |   dag_paths()
 60 | 
 61 | ggdag_paths(coffee_cancer_dag)
 62 | 
 63 | # stretch goals
 64 | ggdag_paths(
 65 |   coffee_cancer_dag, 
 66 |   use_labels = "label", 
 67 |   text = FALSE
 68 | ) + 
 69 |   theme_dag()
 70 | ```
 71 | 
 72 | 
 73 | ## Your Turn 3
 74 | 
 75 | Now that we know the open, confounding pathways (sometimes called "backdoor paths"), we need to know how to close them! First, we'll ask {ggdag} for adjustment sets, then we would need to do something in our analysis to account for at least one adjustment set (e.g. multivariable regression, weighting, or matching for the adjustment sets).
 76 | 
 77 | 1. Use `ggdag_adjustment_set()` to visualize the adjustment sets. Add the arguments `use_labels = "label"` and `text = FALSE`.
 78 | 2. Write an R formula for each adjustment set, as you might if you were fitting a model in `lm()` or `glm()`
 79 | 
 80 | Stretch goal: Use `dagitty::adjustmentSets()` to print the adjustment sets to the console.
 81 | 
 82 | Stretch goal: Sometimes, we know a variable plays a vital role in a causal diagram but we can't measure it, or we simply don't have it in our data set. You can tell {ggdag} that a variable is unmeasured with the `latent` argument in `dagify()`. Re-run the `dagify()` call above, but set `latent = "addictive"` (meaning we can't or haven't measured this variable). Plot it with `ggdag_adjustment_set()`. What's different? Now, try setting `latent = c("addictive", "smoking")` and plotting the adjustment set. What do the results mean?
 83 | 
 84 | ```{r}
 85 | ggdag_adjustment_set(coffee_cancer_dag, use_labels = "label", text = FALSE)
 86 | 
 87 | cancer ~ addictive
 88 | cancer ~ smoking
 89 | ```
 90 | 
 91 | ```{r}
 92 | # stretch goal
 93 | coffee_cancer_dag2 <- dagify(
 94 |   cancer ~ smoking,
 95 |   smoking ~ addictive,
 96 |   coffee ~ addictive,
 97 |   exposure = "coffee",
 98 |   outcome = "cancer",
 99 |   # `addictive` is unmeasured in our data
100 |   latent = "addictive",
101 |   labels = c(
102 |     "coffee" = "Coffee", 
103 |     "cancer" = "Lung Cancer", 
104 |     "smoking" = "Smoking", 
105 |     "addictive" = "Addictive \nBehavior"
106 |   )
107 | )
108 | 
109 | ggdag_adjustment_set(coffee_cancer_dag2)
110 | ```
111 | 
112 | ```{r}
113 | # stretch goal
114 | coffee_cancer_dag3 <- dagify(
115 |   cancer ~ smoking,
116 |   smoking ~ addictive,
117 |   coffee ~ addictive,
118 |   exposure = "coffee",
119 |   outcome = "cancer",
120 |   # `addictive` and `smoking` are unmeasured in our data
121 |   latent = c("smoking", "addictive"),
122 |   labels = c(
123 |     "coffee" = "Coffee", 
124 |     "cancer" = "Lung Cancer", 
125 |     "smoking" = "Smoking", 
126 |     "addictive" = "Addictive \nBehavior"
127 |   )
128 | )
129 | 
130 | ggdag_adjustment_set(coffee_cancer_dag3)
131 | ```
132 | 
133 | ## Your Turn 4
134 | 
135 | Time-ordering your DAGs is incredibly useful, because it makes it cleaner, easier to read, and easier to understand what to control for --- and *not* control for.
136 | 
137 | ggdag includes several ways to manually specify coordinates, including `time_ordered_coords()`. This functional automatically determines the time ordering of the DAG based on the causal relationships. After all, something that causes something else *must* come before, otherwise we'd be violating the time-space continuum. 
138 | 
139 | Recreate the DAG we've been working with using `time_ordered_coords()`, then visualize the DAG. You don't need to use any arguments for this function, so `coords = time_ordered_coords()` will do.
140 | 
141 | ```{r}
142 | coffee_cancer_dag_to <- dagify(
143 |   cancer ~ smoking,
144 |   smoking ~ addictive,
145 |   coffee ~ addictive,
146 |   exposure = "coffee",
147 |   outcome = "cancer",
148 |   coords = time_ordered_coords(),
149 |   labels = c(
150 |     "coffee" = "Coffee", 
151 |     "cancer" = "Lung Cancer", 
152 |     "smoking" = "Smoking", 
153 |     "addictive" = "Addictive \nBehavior"
154 |   )
155 | )
156 | 
157 | ggdag(coffee_cancer_dag_to, use_labels = "label", text = FALSE)
158 | ```
159 | 
160 | # Take aways
161 | 
162 | * Draw your assumptions with DAGs! Use `dagify()` to specify them and `ggdag()` and friends to draw them. 
163 | * The main goal for many analyses is to close backdoor (non-causal) paths. {ggdag} and {dagitty} can help you identify them.
164 | * Adjustment sets are key for closing backdoor paths. Take a reasonable set and use it in your model to get a causal effect estimate.
165 | 


--------------------------------------------------------------------------------
/05-quartets-exercises.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Causal inference is not just a statistics problem"
 3 | format: html
 4 | ---
 5 | 
 6 | ```{r}
 7 | #| label: setup
 8 | library(tidyverse)
 9 | library(quartets)
10 | ```
11 | 
12 | ## Your turn 1
13 | 
14 | For each of the following 4 datasets, look at the correlation between `exposure` and `covariate`: 
15 | 
16 | * `causal_collider`
17 | * `causal_confounding`
18 | * `causal_mediator`
19 | * `causal_m_bias`
20 | 
21 | 
22 | ```{r}
23 | causal_quartet |> 
24 |   group_by(dataset) |> 
25 |   summarize(correlation = cor(exposure, covariate))
26 | ```
27 | 
28 | For each of the above 4 datasets, create a scatterplot looking at the relationship between `exposure` and `outcome`
29 | 
30 | ```{r}
31 | causal_quartet |> 
32 |   ggplot(aes(exposure, outcome)) + 
33 |   geom_point() + 
34 |   geom_smooth(method = "lm", formula = "y ~ x") + 
35 |   facet_wrap(~ dataset)
36 | ```
37 | 
38 | For each of the above 4 datasets, fit a linear model to examine the relationship between the `exposure` and the `outcome`
39 | 
40 | ```{r}
41 | causal_quartet |> 
42 |     group_by(dataset) |> 
43 |     summarize(exposure_coef = coef(lm(outcome ~ exposure))[["exposure"]])
44 | ```
45 | 
46 | ## Your turn 2
47 | 
48 | For each of the following 4 datasets, fit a linear linear model examining the relationship between `outcome_followup` and `exposure_baseline` adjusting for `covariate_baseline`: 
49 | 
50 | * `causal_collider_time`
51 | * `causal_confounding_time`
52 | * `causal_mediator_time`
53 | * `causal_m_bias_time`
54 | 
55 | ```{r}
56 | causal_quartet_time |> 
57 |     group_by(dataset) |> 
58 |     summarize(exposure_coef = coef(lm(outcome_followup ~ exposure_baseline + covariate_baseline))[["exposure_baseline"]])
59 | ```
60 | 
61 | 
62 | 


--------------------------------------------------------------------------------
/06-intro-pscores-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Propensity Scores"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | library(broom)
 10 | library(touringplans)
 11 | library(ggdag)
 12 | library(ggokabeito)
 13 | ```
 14 | 
 15 | For Your Turn, we'll be looking at an example using Walt Disney World ride data from the touringplans package.
 16 | 
 17 | Historically, guests who stayed in a Walt Disney World resort hotel were able to access the park during "Extra Magic Hours" during which the park was closed to all other guests.
 18 | These extra hours could be in the morning or evening.
 19 | The Seven Dwarfs Mine Train is a ride at Walt Disney World's Magic Kingdom. Typically, each day Magic Kingdom may or may not be selected to have these "Extra Magic Hours".
 20 | 
 21 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am.
 22 | 
 23 | Below is a proposed DAG for this question.
 24 | 
 25 | *Render this document to see the DAG or refer to the slides*.
 26 | 
 27 | ```{r}
 28 | set.seed(1234)
 29 | 
 30 | coord_dag <- list(
 31 |   x = c(Season = 0, close = 0, weather = -1, x = 1, y = 2),
 32 |   y = c(Season = -1, close = 1, weather = 0, x = 0, y = 0)
 33 | )
 34 | 
 35 | labels <- c(
 36 |   x = "Extra Magic Morning",
 37 |   y = "Average wait",
 38 |   Season = "Ticket Season",
 39 |   weather = "Historic high temperature",
 40 |   close = "Time park closed"
 41 | )
 42 | 
 43 | dagify(
 44 |   y ~ x + close + Season + weather,
 45 |   x ~ weather + close + Season,
 46 |   coords = coord_dag,
 47 |   labels = labels,
 48 |   exposure = "x",
 49 |   outcome = "y"
 50 | ) |>
 51 |   tidy_dagitty() |>
 52 |   node_status() |>
 53 |   ggplot(
 54 |     aes(x, y, xend = xend, yend = yend, color = status)
 55 |   ) +
 56 |   geom_dag_edges_arc(curvature = c(rep(0, 5), .3, 0)) +
 57 |   geom_dag_point() +
 58 |   geom_dag_label_repel(
 59 |     aes(x, y, label = label),
 60 |     box.padding = 3.5, 
 61 |     inherit.aes = FALSE,
 62 |     max.overlaps = Inf, 
 63 |     family = "sans",
 64 |     seed = 1630,
 65 |     label.size = NA, 
 66 |     label.padding = 0.1,
 67 |     size = 14 / 3
 68 |   ) +
 69 |   scale_color_okabe_ito(na.value = "grey90") +
 70 |   theme_dag() +
 71 |   theme(
 72 |     legend.position = "none",
 73 |     axis.text.x = element_text()
 74 |   ) +
 75 |   coord_cartesian(clip = "off") +
 76 |   scale_x_continuous(
 77 |     limits = c(-1.25, 2.25),
 78 |     breaks = c(-1, 0, 1, 2),
 79 |     labels = c(
 80 |       "\n(one year ago)",
 81 |       "\n(6 months ago)",
 82 |       "\n(3 months ago)",
 83 |       "5pm - 6pm\n(Today)"
 84 |     )
 85 |   )
 86 | ```
 87 | 
 88 | Here we are proposing that there are three confounders: the historic high temperature on the day, the time the park closed, and the ticket season: value, regular, or peak.
 89 | 
 90 | We can build a propensity score model using the `seven_dwarfs_train_2018` data set from the touringplans package.
 91 | Each row of this dataset contains information about the Seven Dwarfs Mine Train during a certain hour on a given day.
 92 | First we need to subset the data to only include average wait times between 9 and 10 am.
 93 | 
 94 | ```{r}
 95 | seven_dwarfs <- seven_dwarfs_train_2018 |>
 96 |   filter(wait_hour == 9)
 97 | ```
 98 | 
 99 | Here's a data dictionary of the variables we need in the `seven_dwarfs` data set:
100 | 
101 | | Variable                       | Column in `seven_dwarfs` |
102 | |--------------------------------|--------------------------|
103 | | Posted Wait Time (outcome)     | `wait_minutes_posted_avg`           |
104 | | Extra Magic Morning (exposure) | `park_magic_morning`    |
105 | | Ticket Season                  | `park_ticket_season`      |
106 | | Closing Time                   | `park_close`                  |
107 | | Historic Temperature           | `park_temperature_high`        |
108 | 
109 | ## Your Turn
110 | 
111 | *After updating the code chunks below, change `eval: true` before rendering*
112 | 
113 | Now, fit a propensity score model for `extra_magic_morning` using the above proposed confounders.
114 | 
115 | ```{r}
116 | #| eval: true
117 | propensity_model <- glm(
118 |   park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
119 |   data = seven_dwarfs,
120 |   family = binomial()
121 | )
122 | ```
123 | 
124 | Add the propensity scores to the `seven_dwarfs` data set, call this new dataset `df`.
125 | 
126 | ```{r}
127 | #| eval: true
128 | df <- propensity_model |>
129 |   augment(type.predict = "response", data = seven_dwarfs)
130 | ```
131 | 
132 | Stretch Goal 1:
133 | 
134 | Examine two histograms of the propensity scores, one days with Extra Magic Morning (`park_extra_magic_morning == 1`) and one for days without it (`park_extra_magic_morning == 0`).
135 | How do these compare?
136 | 
137 | ```{r}
138 | ggplot(
139 |   df, 
140 |   aes(x = .fitted, fill = factor(park_extra_magic_morning))
141 | ) +
142 |   geom_histogram(bins = 30) +
143 |   scale_y_continuous("Count") +
144 |   scale_x_continuous("Propensity Score") +
145 |   scale_fill_manual(values = c("orange", "cornflower blue"))
146 | ```
147 | 


--------------------------------------------------------------------------------
/07-pscores-using-exercises.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Using Propensity Scores"
 3 | format: html
 4 | ---
 5 | 
 6 | 
 7 | ```{r}
 8 | #| label: setup
 9 | library(tidyverse)
10 | library(broom)
11 | library(touringplans)
12 | library(propensity)
13 | ```
14 | 
15 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**).
16 | 
17 | Below is the propensity score model you created in the previous exercise.
18 | 
19 | ```{r}
20 | seven_dwarfs <- seven_dwarfs_train_2018 |>
21 |   filter(wait_hour == 9)
22 | 
23 | propensity_model <- glm(
24 |   park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
25 |   data = seven_dwarfs,
26 |   family = binomial()
27 | )
28 | 
29 | seven_dwarfs_prop <- propensity_model |>
30 |   augment(type.predict = "response", data = seven_dwarfs)
31 | ```
32 | 
33 | ## Your Turn 1 (Matching)
34 | 
35 | *After updating the code chunks below, change `eval: true` before rendering*
36 | 
37 | Create at "matched" data set using the same propensity score model as above and a caliper of 0.2.
38 | 
39 | ```{r}
40 | #| eval: true
41 | library(MatchIt)
42 | matched_dwarfs <-  matchit(
43 |   park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
44 |   data = seven_dwarfs,
45 |   link = "linear.logit",
46 |   caliper = 0.2
47 | )
48 | 
49 | matched_df <- get_matches(matched_dwarfs)
50 | ```
51 | 
52 | ## Your Turn 2 (Weighting)
53 | 
54 | *After updating the code chunks below, change `eval: true` before rendering*
55 | 
56 | Add the ATE weights to the data frame, `seven_dwarfs_prop`
57 | 
58 | ```{r}
59 | #| eval: true
60 | seven_dwarfs_prop <- seven_dwarfs_prop |>
61 |   mutate(w_ate = wt_ate(.fitted, park_extra_magic_morning))
62 | ```
63 | 
64 | 
65 | Stretch Goal 1: 
66 | 
67 | Add ATM weights to the data frame, `seven_dwarfs_prop`
68 | 
69 | ```{r}
70 | #| eval: true
71 | seven_dwarfs_prop <- seven_dwarfs_prop |>
72 |   mutate(w_atm = wt_atm(.fitted, park_extra_magic_morning))
73 | ```
74 | 
75 | Stretch Goal 2: 
76 | 
77 | Update the code below to examine the distribution of the weighted sample. **HINT** the part that needs to be updated is the `weight` parameter in two of the `geom_mirror_histogram()` call.
78 | 
79 | 
80 | ```{r}
81 | #| eval: true
82 | #| warning: false
83 | library(halfmoon)
84 | ggplot(
85 |   seven_dwarfs_prop, 
86 |   aes(.fitted, fill = factor(park_extra_magic_morning))
87 | ) +
88 |   geom_mirror_histogram(bins = 50, alpha = .5) +
89 |   geom_mirror_histogram(aes(weight = w_ate), alpha = .5, bins = 50) +
90 |   geom_hline(yintercept = 0, lwd = 0.5) +
91 |   theme_minimal() +
92 |   scale_y_continuous(labels = abs) +
93 |   scale_fill_manual(values = c("blue", "green")) +
94 |   labs(x = "p", fill = "Extra Magic Morning") +
95 |   xlim(0, 1) 
96 | ```
97 | 


--------------------------------------------------------------------------------
/08-pscores-diagnostics-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Propensity Score Diagnostics"
  3 | format: html
  4 | ---
  5 | 
  6 | 
  7 | ```{r}
  8 | #| label: setup
  9 | library(tidyverse)
 10 | library(broom)
 11 | library(touringplans)
 12 | library(halfmoon)
 13 | library(propensity)
 14 | ```
 15 | 
 16 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**).
 17 | 
 18 | Below is the propensity score model and weights you created in the previous exercise.
 19 | 
 20 | ```{r}
 21 | seven_dwarfs <- seven_dwarfs_train_2018 |>
 22 |   filter(wait_hour == 9)
 23 | 
 24 | propensity_model <- glm(
 25 |   park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
 26 |   data = seven_dwarfs,
 27 |   family = binomial()
 28 | )
 29 | 
 30 | seven_dwarfs_ps <- propensity_model |>
 31 |   augment(type.predict = "response", data = seven_dwarfs) |>
 32 |   mutate(w_ate = wt_ate(.fitted, park_extra_magic_morning))
 33 | ```
 34 | 
 35 | ## Your Turn 1
 36 | 
 37 | *After updating the code chunks below, change `eval: true` before rendering*
 38 | 
 39 | Calculate the standardized mean differences with and without weights
 40 | 
 41 | ```{r}
 42 | smds <- seven_dwarfs_ps |>
 43 |   mutate(park_close = as.numeric(park_close)) |>
 44 |   tidy_smd(
 45 |     .vars = c(park_ticket_season, park_close, park_temperature_high),
 46 |     .group = park_extra_magic_morning,
 47 |     .wts = w_ate
 48 |   )
 49 | ```
 50 | 
 51 | Create the Love Plot using ggplot and halfmoon
 52 | 
 53 | ```{r}
 54 | #| eval: true
 55 | ggplot(
 56 |   data = smds,
 57 |   aes(x = abs(smd), y = variable, group = method, color = method)
 58 | ) +
 59 |   geom_love()
 60 | ```
 61 | 
 62 | ## Your Turn 2
 63 | 
 64 | Create an unweighted ECDF for `park_temperature_high` by whether or not the day had Extra Magic Hours.
 65 | 
 66 | ```{r}
 67 | #| eval: true
 68 | ggplot(seven_dwarfs_ps, aes(x = park_temperature_high, color = factor(park_extra_magic_morning))) +
 69 |   geom_ecdf() +
 70 |   scale_color_manual(
 71 |     "Extra Magic Hours", 
 72 |     values = c("#5154B8", "#5DB854"),
 73 |     labels = c("Yes", "No")
 74 |   ) + 
 75 |   xlab("Historic Temperature") + 
 76 |   ylab("Proportion <= x") 
 77 | ```
 78 | 
 79 | Create an weighted ECDF for `park_temperature_high` by whether or not the day had Extra Magic Hours.
 80 | 
 81 | ```{r}
 82 | #| eval: true
 83 | ggplot(seven_dwarfs_ps, aes(x = park_temperature_high, color = factor(park_extra_magic_morning))) +
 84 |   geom_ecdf(aes(weights = w_ate)) + 
 85 |   scale_color_manual(
 86 |     "Extra Magic Hours", 
 87 |     values = c("#5154B8", "#5DB854"),
 88 |     labels = c("Yes", "No")
 89 |   ) + 
 90 |   xlab("Historic Temperature") + 
 91 |   ylab("Proportion <= x") 
 92 | ```
 93 | 
 94 | ## Bonus Your Turn: Weighted Tables
 95 | 
 96 | Create a weighted table for the seven dwarfs dataset given your weights
 97 | 
 98 | 
 99 | ```{r}
100 | library(survey)
101 | library(gtsummary)
102 | seven_dwarfs_ps |> 
103 |   select(park_extra_magic_morning, park_ticket_season, park_close, park_temperature_high, w_ate) |> 
104 | svydesign(
105 |   ids = ~ 1,
106 |   data = _,
107 |   weights = ~ w_ate
108 | ) |> 
109 |   tbl_svysummary(
110 |     by = park_extra_magic_morning,
111 |     include = -w_ate
112 |   ) |>
113 |   add_difference(everything() ~ "smd") 
114 | ```
115 | 


--------------------------------------------------------------------------------
/09-outcome-model-exercises.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Outcome Model"
 3 | format: html
 4 | ---
 5 | 
 6 | 
 7 | ```{r}
 8 | #| label: setup
 9 | library(tidyverse)
10 | library(broom)
11 | library(touringplans)
12 | library(rsample)
13 | library(propensity)
14 | 
15 | seven_dwarfs <- seven_dwarfs_train_2018 |>
16 |   filter(wait_hour == 9)
17 | ```
18 | 
19 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**).
20 | 
21 | ## Your turn
22 | 
23 | *After updating the code chunks below, change `eval: true` before rendering*
24 | 
25 | Create a function called `ipw_fit` that fits the propensity score model from Exercise 03, incorporates the ATE weights calculated in Exercise 04, and fits a weighted outcome model.
26 | 
27 | ```{r}
28 | #| eval: true
29 | fit_ipw <- function(split, ...) { 
30 |   .df <- analysis(split)
31 |   
32 |   # fit propensity score model
33 |   propensity_model <- glm(
34 |     park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high,
35 |     data = .df,
36 |     family = binomial()
37 |   )
38 |   
39 |   # calculate ATE weights
40 |   .df <- propensity_model |>
41 |     augment(type.predict = "response", data = .df) |> 
42 |     mutate(w_ate = wt_ate(.fitted, park_extra_magic_morning, exposure_type = "binary"))
43 | 
44 |   # fit correctly bootstrapped ipw model 
45 |   lm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = .df, weights = w_ate) |>
46 |     tidy() 
47 | }
48 | ```
49 | 
50 | Bootstrap this result 1000 times.
51 | 
52 | ```{r}
53 | #| eval: true
54 | set.seed(1234)
55 | 
56 | ipw_results <- bootstraps(seven_dwarfs, 1000, apparent = TRUE)
57 | 
58 | ipw_results <- ipw_results |>
59 |   mutate(boot_fits = map(splits, fit_ipw)) 
60 | ```
61 | 
62 | Check out the distribution of estimates (**no need to change this code**)
63 | 
64 | ```{r}
65 | #| eval: true
66 | ipw_results |>
67 |   mutate(
68 |     estimate = map_dbl(
69 |       boot_fits,
70 |       # pull the `estimate` for `park_extra_magic_morning` for each fit
71 |       \(.fit) .fit |>
72 |         filter(term == "park_extra_magic_morning") |>
73 |         pull(estimate)
74 |     )
75 |   ) |>
76 |   ggplot(aes(estimate)) +
77 |   geom_histogram(fill = "#D55E00FF", color = "white", alpha = 0.8) + 
78 |   theme_minimal()
79 | ```
80 | 
81 | Calculate the confidence interval
82 | 
83 | ```{r}
84 | #| eval: true
85 | boot_estimate <- int_t(ipw_results, boot_fits) |>
86 |   filter(term == "park_extra_magic_morning")
87 | 
88 | boot_estimate
89 | ```
90 | 
91 | 
92 | Stretch goal: Do the same for a model using matching.
93 | 


--------------------------------------------------------------------------------
/10-continuous-g-computation-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Continuous exposures and g-computation"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | library(broom)
 10 | library(touringplans)
 11 | library(splines)
 12 | ```
 13 | 
 14 | For this set of exercises, we'll use g-computation to calculate a causal effect for continuous exposures.
 15 | 
 16 | In the touringplans data set, we have information about the posted waiting times for rides. We also have a limited amount of data on the observed, actual times. The question that we will consider is this: Do posted wait times (`avg_spostmin`) for the Seven Dwarves Mine Train at 8 am affect actual wait times (`avg_sactmin`) at 9 am? Here’s our DAG:
 17 | 
 18 | ```{r}
 19 | #| echo: false
 20 | #| message: false
 21 | #| warning: false
 22 | library(ggdag)
 23 | library(ggokabeito)
 24 | 
 25 | coord_dag <- list(
 26 |   x = c(Season = -1, close = -1, weather = -2, extra = 0, x = 1, y = 2),
 27 |   y = c(Season = -1, close = 1, weather = 0, extra = 0, x = 0, y = 0)
 28 | )
 29 | 
 30 | labels <- c(
 31 |   extra = "Extra Magic Morning",
 32 |   x = "Average posted wait ",
 33 |   y = "Average acutal wait",
 34 |   Season = "Ticket Season",
 35 |   weather = "Historic high temperature",
 36 |   close = "Time park closed"
 37 | )
 38 | 
 39 | dagify(
 40 |   y ~ x + close + Season + weather + extra,
 41 |   x ~ weather + close + Season + extra,
 42 |   extra ~ weather + close + Season,
 43 |   coords = coord_dag,
 44 |   labels = labels,
 45 |   exposure = "x",
 46 |   outcome = "y"
 47 | ) |>
 48 |   tidy_dagitty() |>
 49 |   node_status() |>
 50 |   ggplot(
 51 |     aes(x, y, xend = xend, yend = yend, color = status)
 52 |   ) +
 53 |   geom_dag_edges_arc(curvature = c(rep(0, 7), .2, 0, .2, .2, 0), edge_colour = "grey70") +
 54 |   geom_dag_point() +
 55 |   geom_dag_label_repel(
 56 |     aes(x, y, label = label),
 57 |     box.padding = 3.5, 
 58 |     inherit.aes = FALSE,
 59 |     max.overlaps = Inf, 
 60 |     family = "sans",
 61 |     seed = 1602,
 62 |     label.size = NA, 
 63 |     label.padding = 0.1,
 64 |     size = 14 / 3
 65 |   )  + 
 66 |   scale_color_okabe_ito(na.value = "grey90") +
 67 |   theme_dag() +
 68 |   theme(
 69 |     legend.position = "none",
 70 |     axis.text.x = element_text()
 71 |   ) +
 72 |   coord_cartesian(clip = "off") +
 73 |   scale_x_continuous(
 74 |     limits = c(-2.25, 2.25),
 75 |     breaks = c(-2, -1, 0, 1, 2),
 76 |     labels = c(
 77 |       "\n(one year ago)",
 78 |       "\n(6 months ago)",
 79 |       "\n(3 months ago)",
 80 |       "8am-9am\n(Today)",
 81 |       "9am-10am\n(Today)"
 82 |     )
 83 |   )
 84 | ```
 85 | 
 86 | First, let’s wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9? We’ll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time). We also have a lot of missingness for `avg_sactmin`, so we’ll drop unobserved values for now.
 87 | 
 88 | You don't need to update any code here, so just run this.
 89 | 
 90 | ```{r}
 91 | eight <- seven_dwarfs_train_2018 |>
 92 |   filter(wait_hour == 8) |>
 93 |   select(-wait_minutes_actual_avg)
 94 | 
 95 | nine <- seven_dwarfs_train_2018 |>
 96 |   filter(wait_hour == 9) |>
 97 |   select(park_date, wait_minutes_actual_avg)
 98 | 
 99 | wait_times <- eight |>
100 |   left_join(nine, by = "park_date") |>
101 |   drop_na(wait_minutes_actual_avg)
102 | ```
103 | 
104 | # Your Turn 1
105 | 
106 | For the parametric G-formula, we'll use a single model to fit a causal model of Posted Waiting Times (`wait_minutes_posted_avg`) on Actual Waiting Times (`wait_minutes_actual_avg`) where we  include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
107 | 
108 | Two additional differences in our model: we'll use a natural cubic spline on the exposure, `wait_minutes_posted_avg`, using `ns()` from the splines package, and we'll include an interaction term between `wait_minutes_posted_avg` and `park_extra_magic_morning`. These complicate the interpretation of the coefficient of the model in normal regression but have virtually no downside (as long as we have a reasonable sample size) in g-computation, because we still get an easily interpretable result.
109 | 
110 | First, let's fit the model. 
111 | 
112 | 1.Use `lm()` to create a model with the outcome, exposure, and confounders identified in the DAG. 
113 | 2. Save the model as `standardized_model`
114 | 
115 | ```{r}
116 | standardized_model <-  lm(
117 |   wait_minutes_actual_avg ~ ns(wait_minutes_posted_avg, df = 2)*park_extra_magic_morning + park_temperature_high + park_ticket_season + park_close, 
118 |   data = wait_times
119 | )
120 | ```
121 | 
122 | # Your Turn 2
123 | 
124 | Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `wait_minutes_posted_avg` set to 30 minutes and in another, all participants have `wait_minutes_posted_avg` set to 60 minutes. 
125 | 
126 | 1. Create the cloned data sets, called `thirty` and `sixty`.
127 | 2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `thirty_posted_minutes` or `sixty_posted_minutes` (use the pattern `select(new_name = old_name)`).
128 | 3. Save the predicted data sets as`predicted_thirty` and `predicted_sixty`.
129 | 
130 | ```{r}
131 | thirty <- wait_times |>
132 |   mutate(wait_minutes_posted_avg = 30)
133 | 
134 | sixty <- wait_times |>
135 |   mutate(wait_minutes_posted_avg = 60)
136 | 
137 | predicted_thirty <- standardized_model |>
138 |   augment(newdata = thirty) |>
139 |   select(thirty_posted_minutes = .fitted)
140 | 
141 | predicted_sixty <- standardized_model |>
142 |   augment(newdata = sixty) |>
143 |   select(sixty_posted_minutes = .fitted)
144 | ```
145 | 
146 | # Your Turn 3
147 | 
148 | Finally, we'll get the mean differences between the values. 
149 | 
150 | 1. Bind `predicted_thirty` and  `predicted_sixty` using `bind_cols()`
151 | 2. Summarize the predicted values to create three new variables: `mean_thirty`, `mean_sixty`, and `difference`. The first two should be the means of `thirty_posted_minutes` and `sixty_posted_minutes`. `difference` should be `mean_sixty` minus `mean_thirty`.
152 | 
153 | ```{r}
154 | bind_cols(predicted_thirty, predicted_sixty) |>
155 |   summarize(
156 |     mean_thirty = mean(thirty_posted_minutes),
157 |     mean_sixty = mean(sixty_posted_minutes),
158 |     difference = mean_sixty - mean_thirty
159 |   )
160 | ```
161 | 
162 | That's it! `difference` is our effect estimate, marginalized over the spline terms, interaction effects, and confounders.
163 | 
164 | ## Stretch goal: Boostrapped intervals
165 | 
166 | Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
167 | 
168 | Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
169 | 
170 | ```{r}
171 | set.seed(1234)
172 | library(rsample)
173 | 
174 | fit_gcomp <- function(split, ...) { 
175 |   .df <- analysis(split) 
176 |   
177 |   # fit outcome model. remember to model using `.df` instead of `wait_times`
178 |   standardized_model <-  lm(
179 |     wait_minutes_actual_avg ~ ns(wait_minutes_posted_avg, df = 2)*park_extra_magic_morning + park_temperature_high + park_ticket_season + park_close, 
180 |     data = .df
181 |   )
182 |   
183 |   # clone datasets. remember to clone `.df` instead of `wait_times`
184 |   thirty <- .df |>
185 |     mutate(wait_minutes_posted_avg = 30)
186 |   
187 |   sixty <- .df |>
188 |     mutate(wait_minutes_posted_avg = 60)
189 |   
190 |   # predict actual wait time for each cloned dataset
191 |   predicted_thirty <- standardized_model |>
192 |     augment(newdata = thirty) |>
193 |     select(thirty_posted_minutes = .fitted)
194 |   
195 |   predicted_sixty <- standardized_model |>
196 |     augment(newdata = sixty) |>
197 |     select(sixty_posted_minutes = .fitted)
198 |   
199 |   # calculate ATE
200 |   bind_cols(predicted_thirty, predicted_sixty) |>
201 |     summarize(
202 |       mean_thirty = mean(thirty_posted_minutes),
203 |       mean_sixty = mean(sixty_posted_minutes),
204 |       difference = mean_sixty - mean_thirty
205 |     ) |>
206 |     # rsample expects a `term` and `estimate` column
207 |     pivot_longer(everything(), names_to = "term", values_to = "estimate")
208 | }
209 | 
210 | gcomp_results <- bootstraps(wait_times, 1000, apparent = TRUE) |>
211 |   mutate(results = map(splits, fit_gcomp))
212 | 
213 | # using bias-corrected confidence intervals
214 | boot_estimate <- int_bca(gcomp_results, results, .fn = fit_gcomp)
215 | 
216 | boot_estimate
217 | ```
218 | 
219 | ***
220 | 
221 | # Take aways
222 | 
223 | * To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study. 
224 | * Use the model to predict the values for that level of the exposure and compute the effect estimate you want
225 | 


--------------------------------------------------------------------------------
/11-tipr.qmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Tipping Point Sensitivity Analyses"
 3 | format: html
 4 | ---
 5 | 
 6 | ```{r}
 7 | #| label: setup
 8 | library(tipr)
 9 | ```
10 | 
11 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**).
12 | 
13 | ## Your turn
14 | 
15 | Use the `tip_coef()` function to conduct a sensitivity analysis for the estimate from your previous exercises. Use the lower bound of the confidence interval for the effect and `0.1` for the exposure-confounder effect.
16 | 
17 | ```{r}
18 | tip_coef(0.0009, exposure_confounder_effect = 0.1)
19 | ```
20 | 
21 | 


--------------------------------------------------------------------------------
/12-whole-game-2-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Whole Game 2: Malaria and Mosquito Nets"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | library(broom)
 10 | library(rsample)
 11 | library(ggdag)
 12 | library(tipr)
 13 | library(propensity)
 14 | library(halfmoon)
 15 | ```
 16 | 
 17 | ## Whole Game 2: Malaria and Mosquito Nets
 18 | 
 19 | In this exercise, we'll attempt to answer a causal question: does mosquito bed net use reduce malaria risk? 
 20 | 
 21 | To answer this question, we'll use the causal inference methods we learned in this workshop. You may use any technique you prefer. If you have time, try more than one technique and compare the results. Remember the broad strokes we discussed today: 
 22 | 
 23 | 1. Specify causal question (we just did this!)
 24 | 2. Draw assumptions (via a causal diagram, which we'll do below)
 25 | 3. Model assumptions (e.g. with a propensity score)
 26 | 4. Analyze propensities (diagnostics)
 27 | 5. Estimate causal effects 
 28 | 6. Conduct a sensitivity analysis
 29 | 
 30 | We'll use simulated data, `net_data`, from the {causalworkshop} package, which includes ten variables: 
 31 | `id`
 32 | :    an ID variable
 33 | `net` and `net_num`
 34 | :    a binary variable indicating if the participant used a net (1) or didn't use a net (0)
 35 | `malaria_risk`
 36 | :    a risk of malaria scale ranging from 0-100
 37 | `income`
 38 | :    weekly income, measured in dollars
 39 | `health`
 40 | :    a health score scale ranging from 0–100
 41 | `household`
 42 | :    number of people living in the household
 43 | `eligible`
 44 | :    a binary variable indicating if the household is eligible for the free net program.
 45 | `temperature`
 46 | :    the average temperature at night, in Celsius
 47 | `resistance`
 48 | :    Insecticide resistance of local mosquitoes. This is measured on a scale of 0–100, with higher values indicating higher resistance.
 49 | 
 50 | Our thanks to Andrew Heiss for providing these data!
 51 | 
 52 | ```{r}
 53 | library(causalworkshop)
 54 | net_data
 55 | ```
 56 | 
 57 | Here's the DAG we are proposing. Make sure you confirm what the adjustment set is. 
 58 | 
 59 | ```{r}
 60 | mosquito_dag <- dagify(
 61 |   malaria_risk ~ net + income + health + temperature + resistance,
 62 |   net ~ income + health + temperature + eligible + household,
 63 |   eligible ~ income + household,
 64 |   health ~ income,
 65 |   exposure = "net",
 66 |   outcome = "malaria_risk",
 67 |   coords = list(
 68 |     x = c(
 69 |       malaria_risk = 7,
 70 |       net = 3,
 71 |       income = 4,
 72 |       health = 5,
 73 |       temperature = 6,
 74 |       resistance = 8.5,
 75 |       eligible = 2,
 76 |       household = 1
 77 |     ),
 78 |     y = c(
 79 |       malaria_risk = 2,
 80 |       net = 2,
 81 |       income = 3,
 82 |       health = 1,
 83 |       temperature = 3,
 84 |       resistance = 2,
 85 |       eligible = 3,
 86 |       household = 2
 87 |     )
 88 |   ),
 89 |   labels = c(
 90 |     malaria_risk = "Risk of malaria",
 91 |     net = "Mosquito net",
 92 |     income = "Income",
 93 |     health = "Health",
 94 |     temperature = "Nighttime temperatures",
 95 |     resistance = "Insecticide resistance",
 96 |     eligible = "Eligible for program",
 97 |     household = "Number in the household"
 98 |   )
 99 | )
100 | 
101 | mosquito_dag |>
102 |   tidy_dagitty() |>
103 |   node_status() |>
104 |   ggplot(
105 |     aes(x, y, xend = xend, yend = yend, color = status)
106 |   ) +
107 |   geom_dag_edges() +
108 |   geom_dag_point() +
109 |   geom_dag_label_repel(
110 |     aes(x, y, label = label),
111 |     box.padding = 3.5, 
112 |     inherit.aes = FALSE,
113 |     max.overlaps = Inf, 
114 |     seed = 10,
115 |     label.size = NA, 
116 |     label.padding = 0.1,
117 |     size = 14 / 3
118 |   ) +
119 |   theme_dag(base_size = 14) +
120 |   theme(legend.position = "none") +
121 |   labs(caption = "Thanks to Andrew Heiss for the data!") +
122 |   coord_cartesian(clip = "off")
123 | ```
124 | 
125 | # Your Turn
126 | 
127 | Now, conduct the causal analysis as you see fit! Feel free to work in groups and to ask us questions.
128 | 
129 | See [this chapter](https://www.r-causal.org/chapters/chapter-02.html) for a detailed example of this analysis.
130 | 
131 | 
132 | 


--------------------------------------------------------------------------------
/13-bonus-selection-bias-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Bonus: Selection bias and correcting for loss to follow-up"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | library(broom)
 10 | library(propensity)
 11 | library(causaldata)
 12 | library(rsample)
 13 | ```
 14 | 
 15 | In this example, we'll consider loss to follow-up in the NHEFS study. We'll use the binary exposure we used earlier in the workshop: does quitting smoking (`smk`) increase weight (`wt82_71`)? This time, however, we'll adjust for loss to followup (people who dropped out of the study between observation periods) using inverse probability of censoring weights.
 16 | 
 17 | # Your Turn 1
 18 | 
 19 | 1. Take a look at how many participants were lost to follow up in `nhefs`, called `censored` in this data set. You don't need to change anything in this code.
 20 | 
 21 | ```{r}
 22 | nhefs_censored <- nhefs |>
 23 |   drop_na(
 24 |     qsmk, sex, race, age, school, smokeintensity, smokeyrs, exercise,
 25 |     active, wt71
 26 |   )
 27 | 
 28 | nhefs_censored |>
 29 |   count(censored = as.factor(censored)) |>
 30 |   ggplot(aes(censored, n)) + 
 31 |   geom_col()
 32 | ```
 33 | 
 34 | 2. Create a logistic regression model that predicts whether or not someone is censored. 
 35 | 
 36 | ```{r}
 37 | cens_model <- glm(
 38 |   censored ~ qsmk + sex + race + age + I(age^2) + education +
 39 |     smokeintensity + I(smokeintensity^2) +
 40 |     smokeyrs + I(smokeyrs^2) + exercise + active +
 41 |     wt71 + I(wt71^2),
 42 |   data = nhefs_censored, 
 43 |   family = binomial()
 44 | )
 45 | 
 46 | ```
 47 | 
 48 | # Your Turn 2
 49 | 
 50 | 1. Use the logistic model you just fit to create inverse probability of censoring weights
 51 | 2. Calculate the weights using `.fitted`
 52 | 3. Join `cens` to `nhefs_censored` so that you have the weights in your dataset
 53 | 4. Fit a linear regression model of `wt82_71` weighted by `cens_wts`. We'll use this model as the basis for our G-computation
 54 | 
 55 | ```{r}
 56 | cens <- cens_model |>
 57 |   augment(type.predict = "response", data = nhefs_censored) |>
 58 |   mutate(cens_wts = wt_ate(.fitted, censored)) |>
 59 |   select(id, cens_wts)
 60 | 
 61 | #  join all the weights data from above
 62 | nhefs_censored_wts <- nhefs_censored |>
 63 |   left_join(cens, by = "id")
 64 | 
 65 | cens_model <- lm(
 66 |   wt82_71 ~ qsmk + I(qsmk * smokeintensity) + smokeintensity +
 67 |     I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs +
 68 |     I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2),
 69 |   data = nhefs_censored_wts,
 70 |   weights = cens_wts
 71 | )
 72 | ```
 73 | 
 74 | # Your Turn 3
 75 | 
 76 | 1. Create the cloned data sets, called `kept_smoking` and `no`, where one dataset has `quit_smoking` set to 1 (quit smoking) and the other has it set to 0 (kept smoking).
 77 | 2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking` 
 78 | 3. Calculate the differences between the mean values of `kept_smoking` and `quit_smoking`
 79 | 
 80 | ```{r}
 81 | predicted_kept_smoking <- cens_model |>
 82 |   augment(newdata = nhefs_censored |> mutate(qsmk = 0)) |>
 83 |   select(kept_smoking = .fitted)
 84 | 
 85 | predicted_quit_smoking <- cens_model |>
 86 |   augment(newdata = nhefs_censored |> mutate(qsmk = 1)) |>
 87 |   select(quit_smoking = .fitted)
 88 | 
 89 | #  summarize the mean difference
 90 | bind_cols(predicted_kept_smoking, predicted_quit_smoking) |>
 91 |   summarise(
 92 |     est = mean(quit_smoking - kept_smoking)
 93 |   )
 94 | ```
 95 | 
 96 | ## Stretch goal: Boostrapped intervals
 97 | 
 98 | Finish early? Try bootstrapping the G-computation model with censoring weights
 99 | 
100 | Remember, you need to bootstrap the entire modeling process, including fitting both regression models, cloning the data sets, and calculating the effects.
101 | 
102 | ```{r}
103 | fit_gcomp_cens <- function(split, ...) { 
104 |   .df <- analysis(split) 
105 |   
106 |   # fit the censoring model. remember to model using `.df` instead of `nhefs_censored`
107 |   cens_model <- glm(
108 |     censored ~ qsmk + sex + race + age + I(age^2) + education +
109 |       smokeintensity + I(smokeintensity^2) +
110 |       smokeyrs + I(smokeyrs^2) + exercise + active +
111 |       wt71 + I(wt71^2),
112 |     data = .df, 
113 |     family = binomial()
114 |   )
115 |     # calculate the inverse probability of censoring weights. remember to predict `.df` instead of `nhefs_censored`
116 |   cens <- cens_model |>
117 |     augment(type.predict = "response", data = .df) |>
118 |     mutate(cens_wts = 1 / ifelse(censored == 0, 1 - .fitted, 1)) |>
119 |     select(id, cens_wts)
120 |   
121 |   #  join all the weights data from above to .df
122 |   nhefs_censored_wts <- .df |>
123 |     left_join(cens, by = "id")
124 |   
125 |   # fit outcome model. remember to model using `nhefs_censored_wts` instead of `nhefs_censored` or `.df`
126 |   standardized_model <- lm(
127 |     wt82_71 ~ qsmk + I(qsmk * smokeintensity) + smokeintensity +
128 |       I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs +
129 |       I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2),
130 |     data = nhefs_censored_wts,
131 |     weights = cens_wts
132 |   )
133 |   
134 |   # clone datasets. remember to clone `.df` instead of `nhefs_censored`
135 |   kept_smoking <- .df |> 
136 |     mutate(qsmk = 0)
137 |   
138 |   quit_smoking <- .df |> 
139 |     mutate(qsmk = 1)
140 |   
141 |   # predict change in weight for each cloned dataset
142 |   predicted_kept_smoking <- standardized_model |> 
143 |     augment(newdata = kept_smoking) |>
144 |     select(kept_smoking = .fitted)
145 |   
146 |   predicted_quit_smoking <- standardized_model |> 
147 |     augment(newdata = quit_smoking) |>
148 |     select(quit_smoking = .fitted)
149 |   
150 |   # calculate ATE
151 |   bind_cols(predicted_kept_smoking, predicted_quit_smoking) |> 
152 |     summarize(
153 |       mean_quit_smoking = mean(quit_smoking),
154 |       mean_kept_smoking = mean(kept_smoking),
155 |       difference = mean_quit_smoking - mean_kept_smoking
156 |     ) |> 
157 |     # rsample expects a `term` and `estimate` column
158 |     pivot_longer(everything(), names_to = "term", values_to = "estimate")
159 | }
160 | 
161 | gcomp_results <- bootstraps(nhefs_censored, 1000, apparent = TRUE) |> 
162 |   mutate(results = map(splits, fit_gcomp_cens))
163 | 
164 | # using bias-corrected confidence intervals
165 | boot_estimate <- int_bca(gcomp_results, results, .fn = fit_gcomp_cens)
166 | 
167 | boot_estimate
168 | ```
169 | 
170 | ***
171 | 
172 | # Take aways
173 | 
174 | * If loss to follow-up is potentially related to your study question, inverse probability of censoring weights can help mitigate the bias.
175 | * You can use them in many types of models. If you're also using propensity score weights, simply multiply the weights together, then include the result as the weights for your outcome model.
176 | 


--------------------------------------------------------------------------------
/14-bonus-continuous-pscores-exercises.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Propensity scores for continuous exposures"
  3 | format: html
  4 | ---
  5 | 
  6 | ```{r}
  7 | #| label: setup
  8 | library(tidyverse)
  9 | library(broom)
 10 | library(touringplans)
 11 | library(propensity)
 12 | ```
 13 | 
 14 | For this set of exercises, we'll use propensity scores for continuous exposures.
 15 | 
 16 | In the touringplans data set, we have information about the posted waiting times for rides. We also have a limited amount of data on the observed, actual times. The question that we will consider is this: Do posted wait times (`avg_spostmin`) for the Seven Dwarves Mine Train at 8 am affect actual wait times (`avg_sactmin`) at 9 am? Here’s our DAG:
 17 | 
 18 | ```{r}
 19 | #| echo: false
 20 | #| message: false
 21 | #| warning: false
 22 | library(ggdag)
 23 | library(ggokabeito)
 24 | 
 25 | coord_dag <- list(
 26 |   x = c(Season = -1, close = -1, weather = -2, extra = 0, x = 1, y = 2),
 27 |   y = c(Season = -1, close = 1, weather = 0, extra = 0, x = 0, y = 0)
 28 | )
 29 | 
 30 | labels <- c(
 31 |   extra = "Extra Magic Morning",
 32 |   x = "Average posted wait ",
 33 |   y = "Average acutal wait",
 34 |   Season = "Ticket Season",
 35 |   weather = "Historic high temperature",
 36 |   close = "Time park closed"
 37 | )
 38 | 
 39 | dagify(
 40 |   y ~ x + close + Season + weather + extra,
 41 |   x ~ weather + close + Season + extra,
 42 |   extra ~ weather + close + Season,
 43 |   coords = coord_dag,
 44 |   labels = labels,
 45 |   exposure = "x",
 46 |   outcome = "y"
 47 | ) |>
 48 |   tidy_dagitty() |>
 49 |   node_status() |>
 50 |   ggplot(
 51 |     aes(x, y, xend = xend, yend = yend, color = status)
 52 |   ) +
 53 |   geom_dag_edges_arc(curvature = c(rep(0, 7), .2, 0, .2, .2, 0), edge_colour = "grey70") +
 54 |   geom_dag_point() +
 55 |   geom_dag_label_repel(
 56 |     aes(x, y, label = label),
 57 |     box.padding = 3.5, 
 58 |     inherit.aes = FALSE,
 59 |     max.overlaps = Inf, 
 60 |     family = "sans",
 61 |     seed = 1602,
 62 |     label.size = NA, 
 63 |     label.padding = 0.1,
 64 |     size = 14 / 3
 65 |   )  + 
 66 |   scale_color_okabe_ito(na.value = "grey90") +
 67 |   theme_dag() +
 68 |   theme(
 69 |     legend.position = "none",
 70 |     axis.text.x = element_text()
 71 |   ) +
 72 |   coord_cartesian(clip = "off") +
 73 |   scale_x_continuous(
 74 |     limits = c(-2.25, 2.25),
 75 |     breaks = c(-2, -1, 0, 1, 2),
 76 |     labels = c(
 77 |       "\n(one year ago)",
 78 |       "\n(6 months ago)",
 79 |       "\n(3 months ago)",
 80 |       "8am-9am\n(Today)",
 81 |       "9am-10am\n(Today)"
 82 |     )
 83 |   )
 84 | ```
 85 | 
 86 | First, let’s wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9? We’ll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time). We also have a lot of missingness for `avg_sactmin`, so we’ll drop unobserved values for now.
 87 | 
 88 | You don't need to update any code here, so just run this.
 89 | 
 90 | ```{r}
 91 | eight <- seven_dwarfs_train_2018 |>
 92 |   filter(wait_hour == 8) |>
 93 |   select(-wait_minutes_actual_avg)
 94 | 
 95 | nine <- seven_dwarfs_train_2018 |>
 96 |   filter(wait_hour == 9) |>
 97 |   select(park_date, wait_minutes_actual_avg)
 98 | 
 99 | wait_times <- eight |>
100 |   left_join(nine, by = "park_date") |>
101 |   drop_na(wait_minutes_actual_avg)
102 | ```
103 | 
104 | # Your Turn 1
105 | 
106 | First, let’s calculate the propensity score model, which will be the denominator in our stabilized weights (more to come on that soon). We’ll fit a model using `lm()` for `wait_minutes_posted_avg` with our covariates, then use the fitted predictions of `avg_spostmin` (`.fitted`, `.sigma`) to calculate the density using `dnorm()`.
107 | 
108 | 1. Fit a model using `lm()` with `wait_minutes_posted_avg` as the outcome and the confounders identified in the DAG.
109 | 2. Use `augment()` to add model predictions to the data frame.
110 | 3. In `wt_ate()`, calculate the weights using `avg_postmin`, `.fitted`, and `.sigma`.
111 | 
112 | ```{r}
113 | post_time_model <- lm(
114 |   wait_minutes_posted_avg ~ park_close + park_extra_magic_morning + park_temperature_high + park_ticket_season, 
115 |   data = wait_times
116 | )
117 | 
118 | wait_times_wts <- post_time_model |>
119 |   augment(data = wait_times) |>
120 |   mutate(wts = wt_ate(
121 |     wait_minutes_posted_avg, .fitted, .sigma = .sigma
122 |   ))
123 | ```
124 | 
125 | # Your Turn 2
126 | 
127 | As with the example in the slides, we have a lot of extreme values for our weights
128 | 
129 | ```{r}
130 | wait_times_wts |>
131 |   ggplot(aes(wts)) +
132 |   geom_density(col = "#E69F00", fill = "#E69F0095") + 
133 |   scale_x_log10() + 
134 |   theme_minimal(base_size = 20) + 
135 |   xlab("Weights")
136 | ```
137 | 
138 | Let’s now fit the marginal density to use for stabilized weights:
139 | 
140 | 1. Re-fit the above using stabilized weights
141 | 
142 | ```{r}
143 | wait_times_swts <- post_time_model |>
144 |   augment(data = wait_times) |>
145 |     mutate(swts = wt_ate(
146 |     wait_minutes_posted_avg, 
147 |     .fitted,
148 |     .sigma = .sigma,
149 |     stabilize = TRUE
150 |   ))
151 | ```
152 | 
153 | Take a look at the weights now that we've stabilized them:
154 | 
155 | ```{r}
156 | ggplot(wait_times_swts, aes(swts)) +
157 |   geom_density(col = "#E69F00", fill = "#E69F0095") + 
158 |   scale_x_log10() + 
159 |   theme_minimal(base_size = 20) + 
160 |   xlab("Stabilized Weights")
161 | ```
162 | 
163 | # Your Turn 3
164 | 
165 | Now, let's fit the outcome model!
166 | 
167 | 1. Estimate the relationship between posted wait times and actual wait times using the stabilized weights we just created. 
168 | 
169 | ```{r}
170 | lm(wait_minutes_actual_avg ~ wait_minutes_posted_avg, weights = swts, data = wait_times_swts) |>
171 |   tidy() |>
172 |   filter(term == "wait_minutes_posted_avg") |>
173 |   mutate(estimate = estimate * 10)
174 | ```
175 | 
176 | ## Stretch goal: Boostrapped intervals
177 | 
178 | Bootstrap confidence intervals for our estimate.
179 | 
180 | There's nothing new here. Just remember, you need to bootstrap the entire modeling process!
181 | 
182 | ```{r}
183 | set.seed(1234)
184 | library(rsample)
185 | 
186 | fit_model <- function(split, ...) { 
187 |   .df <- analysis(split) 
188 |   
189 |   # fill in the rest!
190 |   
191 |   post_time_model <- lm(
192 |   wait_minutes_posted_avg ~ park_close + park_extra_magic_morning + park_temperature_high + park_ticket_season, 
193 |   data = .df
194 | )
195 |   
196 |   .df_swts <- post_time_model |>
197 |   augment(data = .df) |>
198 |     mutate(swts = wt_ate(
199 |     wait_minutes_posted_avg, 
200 |     .fitted,
201 |     .sigma = .sigma,
202 |     stabilize = TRUE
203 |   ))
204 |   
205 |   lm(wait_minutes_actual_avg ~ wait_minutes_posted_avg, weights = swts, data = .df_swts) |>
206 |   tidy() |>
207 |   filter(term == "wait_minutes_posted_avg") |>
208 |   mutate(estimate = estimate * 10)
209 | }
210 | 
211 | model_estimate <- bootstraps(wait_times, 1000, apparent = TRUE) |>
212 |   mutate(results = map(splits, fit_model))
213 | 
214 | # using bias-corrected confidence intervals
215 | boot_estimate <- int_bca(model_estimate, results, .fn = fit_model)
216 | 
217 | boot_estimate
218 | ```
219 | 
220 | ***
221 | 
222 | # Take aways
223 | 
224 | * We can calculate propensity scores for continuous exposures. `wt_ate()` uses `dnorm()` to use the normal density to transform predictions to a propensity-like scale; we need to give `wt_ate()` `.sigma` as to calculate do this. We can also use other approaches like quantile binning of the exposure, calculating probability-based propensity scores using categorical regression models. 
225 | * Continuous exposures are prone to mispecification and usually need to be stabilized. A simple stabilization is to invert the propensity score by stabilization weights using an intercept-only model such as `lm(exposure ~ 1)`. `wt_ate()` can do this for you automatically with `stabilize = TRUE`. This also applies to other types of exposures.
226 | * Stabilization is useful for any type of exposure where the weights are unbounded. Weights like the ATO, making them less susceptible to extreme weights.
227 | * Using propensity scores for continuous exposures in outcome models is identical to using them with binary exposures.
228 | * Because propensity scores for continuous exposures are prone to positivity violation, check the bootstrap distribution of your estimate for skew and to see if the mean estimate is different from your regression model. If these problems are present, you may need to use another approach like g-computation.
229 | 


--------------------------------------------------------------------------------
/causal_inference_r_workshop_solutions.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: No
 4 | SaveWorkspace: No
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | LineEndingConversion: Posix
18 | 


--------------------------------------------------------------------------------