├── .gitignore ├── causal_inference_r_workshop_solutions.Rproj ├── 11-tipr.qmd ├── 05-quartets-exercises.qmd ├── 09-outcome-model-exercises.qmd ├── 07-pscores-using-exercises.qmd ├── 08-pscores-diagnostics-exercises.qmd ├── 03-ci-with-group-by-and-summarise-exercises.qmd ├── 12-whole-game-2-exercises.qmd ├── 06-intro-pscores-exercises.qmd ├── 13-bonus-selection-bias-exercises.qmd ├── 04-dags-exercises.qmd ├── 14-bonus-continuous-pscores-exercises.qmd ├── 10-continuous-g-computation-exercises.qmd ├── 01-whole-game-exercises.qmd └── 15-bonus-ml-for-causal-exercises.qmd /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .Rdata 4 | .httr-oauth 5 | .DS_Store 6 | -------------------------------------------------------------------------------- /causal_inference_r_workshop_solutions.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | ProjectId: 73041e34-880b-48f3-b961-5e73736034a8 3 | 4 | RestoreWorkspace: No 5 | SaveWorkspace: No 6 | AlwaysSaveHistory: Default 7 | 8 | EnableCodeIndexing: Yes 9 | UseSpacesForTab: Yes 10 | NumSpacesForTab: 2 11 | Encoding: UTF-8 12 | 13 | RnwWeave: Sweave 14 | LaTeX: pdfLaTeX 15 | 16 | AutoAppendNewline: Yes 17 | StripTrailingWhitespace: Yes 18 | LineEndingConversion: Posix 19 | -------------------------------------------------------------------------------- /11-tipr.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Tipping Point Sensitivity Analyses" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tipr) 9 | ``` 10 | 11 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**). 12 | 13 | ## Your turn 14 | 15 | Use the `tip_coef()` function to conduct a sensitivity analysis for the estimate from your previous exercises. Use the lower bound of the confidence interval for the effect and `0.1` for the exposure-confounder effect. 16 | 17 | ```{r} 18 | tip_coef(0.0009, exposure_confounder_effect = 0.1) 19 | ``` 20 | 21 | -------------------------------------------------------------------------------- /05-quartets-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Causal inference is not just a statistics problem" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(quartets) 10 | ``` 11 | 12 | ## Your turn 1 13 | 14 | For each of the following 4 datasets, look at the correlation between `exposure` and `covariate`: 15 | 16 | * `causal_collider` 17 | * `causal_confounding` 18 | * `causal_mediator` 19 | * `causal_m_bias` 20 | 21 | 22 | ```{r} 23 | causal_quartet |> 24 | group_by(dataset) |> 25 | summarize(correlation = cor(exposure, covariate)) 26 | ``` 27 | 28 | For each of the above 4 datasets, create a scatterplot looking at the relationship between `exposure` and `outcome` 29 | 30 | ```{r} 31 | causal_quartet |> 32 | ggplot(aes(exposure, outcome)) + 33 | geom_point() + 34 | geom_smooth(method = "lm", formula = "y ~ x") + 35 | facet_wrap(~ dataset) 36 | ``` 37 | 38 | For each of the above 4 datasets, fit a linear model to examine the relationship between the `exposure` and the `outcome` 39 | 40 | ```{r} 41 | causal_quartet |> 42 | group_by(dataset) |> 43 | summarize(exposure_coef = coef(lm(outcome ~ exposure))[["exposure"]]) 44 | ``` 45 | 46 | ## Your turn 2 47 | 48 | For each of the following 4 datasets, fit a linear linear model examining the relationship between `outcome_followup` and `exposure_baseline` adjusting for `covariate_baseline`: 49 | 50 | * `causal_collider_time` 51 | * `causal_confounding_time` 52 | * `causal_mediator_time` 53 | * `causal_m_bias_time` 54 | 55 | ```{r} 56 | causal_quartet_time |> 57 | group_by(dataset) |> 58 | summarize(exposure_coef = coef(lm(outcome_followup ~ exposure_baseline + covariate_baseline))[["exposure_baseline"]]) 59 | ``` 60 | 61 | 62 | -------------------------------------------------------------------------------- /09-outcome-model-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Outcome Model" 3 | format: html 4 | --- 5 | 6 | 7 | ```{r} 8 | #| label: setup 9 | library(tidyverse) 10 | library(broom) 11 | library(touringplans) 12 | library(rsample) 13 | library(propensity) 14 | 15 | seven_dwarfs <- seven_dwarfs_train_2018 |> 16 | filter(wait_hour == 9) 17 | ``` 18 | 19 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**). 20 | 21 | ## Your turn 22 | 23 | *After updating the code chunks below, change `eval: true` before rendering* 24 | 25 | Create a function called `ipw_fit` that fits the propensity score model from Exercise 03, incorporates the ATE weights calculated in Exercise 04, and fits a weighted outcome model. 26 | 27 | ```{r} 28 | #| eval: true 29 | fit_ipw <- function(split, ...) { 30 | .df <- analysis(split) 31 | 32 | # fit propensity score model 33 | propensity_model <- glm( 34 | park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high, 35 | data = .df, 36 | family = binomial() 37 | ) 38 | 39 | # calculate ATE weights 40 | .df <- propensity_model |> 41 | augment(type.predict = "response", data = .df) |> 42 | mutate(w_ate = wt_ate(.fitted, park_extra_magic_morning, exposure_type = "binary")) 43 | 44 | # fit correctly bootstrapped ipw model 45 | lm(wait_minutes_posted_avg ~ park_extra_magic_morning, data = .df, weights = w_ate) |> 46 | tidy() 47 | } 48 | ``` 49 | 50 | Bootstrap this result 1000 times. 51 | 52 | ```{r} 53 | #| eval: true 54 | set.seed(1234) 55 | 56 | ipw_results <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) 57 | 58 | ipw_results <- ipw_results |> 59 | mutate(boot_fits = map(splits, fit_ipw)) 60 | ``` 61 | 62 | Check out the distribution of estimates (**no need to change this code**) 63 | 64 | ```{r} 65 | #| eval: true 66 | ipw_results |> 67 | mutate( 68 | estimate = map_dbl( 69 | boot_fits, 70 | # pull the `estimate` for `park_extra_magic_morning` for each fit 71 | \(.fit) .fit |> 72 | filter(term == "park_extra_magic_morning") |> 73 | pull(estimate) 74 | ) 75 | ) |> 76 | ggplot(aes(estimate)) + 77 | geom_histogram(fill = "#D55E00FF", color = "white", alpha = 0.8) + 78 | theme_minimal() 79 | ``` 80 | 81 | Calculate the confidence interval 82 | 83 | ```{r} 84 | #| eval: true 85 | boot_estimate <- int_t(ipw_results, boot_fits) |> 86 | filter(term == "park_extra_magic_morning") 87 | 88 | boot_estimate 89 | ``` 90 | 91 | 92 | Stretch goal: Do the same for a model using matching. 93 | -------------------------------------------------------------------------------- /07-pscores-using-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Using Propensity Scores" 3 | format: html 4 | --- 5 | 6 | 7 | ```{r} 8 | #| label: setup 9 | library(tidyverse) 10 | library(broom) 11 | library(touringplans) 12 | library(propensity) 13 | ``` 14 | 15 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**). 16 | 17 | Below is the propensity score model you created in the previous exercise. 18 | 19 | ```{r} 20 | seven_dwarfs <- seven_dwarfs_train_2018 |> 21 | filter(wait_hour == 9) 22 | 23 | propensity_model <- glm( 24 | park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high, 25 | data = seven_dwarfs, 26 | family = binomial() 27 | ) 28 | 29 | seven_dwarfs_prop <- propensity_model |> 30 | augment(type.predict = "response", data = seven_dwarfs) 31 | ``` 32 | 33 | ## Your Turn 1 (Matching) 34 | 35 | *After updating the code chunks below, change `eval: true` before rendering* 36 | 37 | Create at "matched" data set using the same propensity score model as above and a caliper of 0.2. 38 | 39 | ```{r} 40 | #| eval: true 41 | library(MatchIt) 42 | matched_dwarfs <- matchit( 43 | park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high, 44 | data = seven_dwarfs, 45 | link = "linear.logit", 46 | caliper = 0.2 47 | ) 48 | 49 | matched_df <- get_matches(matched_dwarfs) 50 | ``` 51 | 52 | ## Your Turn 2 (Weighting) 53 | 54 | *After updating the code chunks below, change `eval: true` before rendering* 55 | 56 | Add the ATE weights to the data frame, `seven_dwarfs_prop` 57 | 58 | ```{r} 59 | #| eval: true 60 | seven_dwarfs_prop <- seven_dwarfs_prop |> 61 | mutate(w_ate = wt_ate(.fitted, park_extra_magic_morning)) 62 | ``` 63 | 64 | 65 | Stretch Goal 1: 66 | 67 | Add ATM weights to the data frame, `seven_dwarfs_prop` 68 | 69 | ```{r} 70 | #| eval: true 71 | seven_dwarfs_prop <- seven_dwarfs_prop |> 72 | mutate(w_atm = wt_atm(.fitted, park_extra_magic_morning)) 73 | ``` 74 | 75 | Stretch Goal 2: 76 | 77 | Update the code below to examine the distribution of the weighted sample. **HINT** the part that needs to be updated is the `weight` parameter in two of the `geom_mirror_histogram()` call. 78 | 79 | 80 | ```{r} 81 | #| eval: true 82 | #| warning: false 83 | library(halfmoon) 84 | ggplot( 85 | seven_dwarfs_prop, 86 | aes(.fitted, fill = factor(park_extra_magic_morning)) 87 | ) + 88 | geom_mirror_histogram(bins = 50, alpha = .5) + 89 | geom_mirror_histogram(aes(weight = w_ate), alpha = .5, bins = 50) + 90 | geom_hline(yintercept = 0, lwd = 0.5) + 91 | theme_minimal() + 92 | scale_y_continuous(labels = abs) + 93 | scale_fill_manual(values = c("blue", "green")) + 94 | labs(x = "p", fill = "Extra Magic Morning") + 95 | xlim(0, 1) 96 | ``` 97 | -------------------------------------------------------------------------------- /08-pscores-diagnostics-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Propensity Score Diagnostics" 3 | format: html 4 | --- 5 | 6 | 7 | ```{r} 8 | #| label: setup 9 | library(tidyverse) 10 | library(broom) 11 | library(touringplans) 12 | library(halfmoon) 13 | library(propensity) 14 | ``` 15 | 16 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning (the **exposure**) and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am (the **outcome**). 17 | 18 | Below is the propensity score model and weights you created in the previous exercise. 19 | 20 | ```{r} 21 | seven_dwarfs <- seven_dwarfs_train_2018 |> 22 | filter(wait_hour == 9) 23 | 24 | propensity_model <- glm( 25 | park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high, 26 | data = seven_dwarfs, 27 | family = binomial() 28 | ) 29 | 30 | seven_dwarfs_ps <- propensity_model |> 31 | augment(type.predict = "response", data = seven_dwarfs) |> 32 | mutate(w_ate = wt_ate(.fitted, park_extra_magic_morning)) 33 | ``` 34 | 35 | ## Your Turn 1 36 | 37 | *After updating the code chunks below, change `eval: true` before rendering* 38 | 39 | Calculate the standardized mean differences with and without weights 40 | 41 | ```{r} 42 | smds <- seven_dwarfs_ps |> 43 | mutate(park_close = as.numeric(park_close)) |> 44 | tidy_smd( 45 | .vars = c(park_ticket_season, park_close, park_temperature_high), 46 | .group = park_extra_magic_morning, 47 | .wts = w_ate 48 | ) 49 | ``` 50 | 51 | Create the Love Plot using ggplot and halfmoon 52 | 53 | ```{r} 54 | #| eval: true 55 | ggplot( 56 | data = smds, 57 | aes(x = abs(smd), y = variable, group = method, color = method) 58 | ) + 59 | geom_love() 60 | ``` 61 | 62 | ## Your Turn 2 63 | 64 | Create an unweighted ECDF for `park_temperature_high` by whether or not the day had Extra Magic Hours. 65 | 66 | ```{r} 67 | #| eval: true 68 | ggplot(seven_dwarfs_ps, aes(x = park_temperature_high, color = factor(park_extra_magic_morning))) + 69 | geom_ecdf() + 70 | scale_color_manual( 71 | "Extra Magic Hours", 72 | values = c("#5154B8", "#5DB854"), 73 | labels = c("Yes", "No") 74 | ) + 75 | xlab("Historic Temperature") + 76 | ylab("Proportion <= x") 77 | ``` 78 | 79 | Create an weighted ECDF for `park_temperature_high` by whether or not the day had Extra Magic Hours. 80 | 81 | ```{r} 82 | #| eval: true 83 | ggplot(seven_dwarfs_ps, aes(x = park_temperature_high, color = factor(park_extra_magic_morning))) + 84 | geom_ecdf(aes(weights = w_ate)) + 85 | scale_color_manual( 86 | "Extra Magic Hours", 87 | values = c("#5154B8", "#5DB854"), 88 | labels = c("Yes", "No") 89 | ) + 90 | xlab("Historic Temperature") + 91 | ylab("Proportion <= x") 92 | ``` 93 | 94 | ## Bonus Your Turn: Weighted Tables 95 | 96 | Create a weighted table for the seven dwarfs dataset given your weights 97 | 98 | 99 | ```{r} 100 | library(survey) 101 | library(gtsummary) 102 | seven_dwarfs_ps |> 103 | select(park_extra_magic_morning, park_ticket_season, park_close, park_temperature_high, w_ate) |> 104 | svydesign( 105 | ids = ~ 1, 106 | data = _, 107 | weights = ~ w_ate 108 | ) |> 109 | tbl_svysummary( 110 | by = park_extra_magic_morning, 111 | include = -w_ate 112 | ) |> 113 | add_difference(everything() ~ "smd") 114 | ``` 115 | -------------------------------------------------------------------------------- /03-ci-with-group-by-and-summarise-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Causal Inference with `group_by()` and `summarize()`" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | set.seed(1) 10 | ``` 11 | 12 | ## Your Turn 1 13 | 14 | Run this code to generate the simulated data set 15 | 16 | ```{r} 17 | n <- 1000 18 | sim <- tibble( 19 | confounder = rbinom(n, 1, 0.5), 20 | p_exposure = case_when( 21 | confounder == 1 ~ 0.75, 22 | confounder == 0 ~ 0.25 23 | ), 24 | exposure = rbinom(n, 1, p_exposure), 25 | outcome = confounder + rnorm(n) 26 | ) 27 | ``` 28 | 29 | 1. Group the dataset by `confounder` and `exposure` 30 | 2. Calculate the mean of the `outcome` for the groups 31 | 32 | ```{r} 33 | sim |> 34 | group_by(confounder, exposure) |> 35 | summarise(avg_y = mean(outcome)) |> 36 | # pivot the data so we can get the difference 37 | # between the exposure groups 38 | pivot_wider( 39 | names_from = exposure, 40 | values_from = avg_y, 41 | names_prefix = "x_" 42 | ) |> 43 | summarise(estimate = x_1 - x_0) |> 44 | summarise(estimate = mean(estimate)) # note, we would need to weight this if the confounder groups were not equal sized 45 | ``` 46 | 47 | ## Your Turn 2 48 | 49 | Run the following code to generate `sim2` 50 | 51 | ```{r} 52 | n <- 1000 53 | sim2 <- tibble( 54 | confounder_1 = rbinom(n, 1, 0.5), 55 | confounder_2 = rbinom(n, 1, 0.5), 56 | 57 | p_exposure = case_when( 58 | confounder_1 == 1 & confounder_2 == 1 ~ 0.75, 59 | confounder_1 == 0 & confounder_2 == 1 ~ 0.9, 60 | confounder_1 == 1 & confounder_2 == 0 ~ 0.2, 61 | confounder_1 == 0 & confounder_2 == 0 ~ 0.1, 62 | ), 63 | exposure = rbinom(n, 1, p_exposure), 64 | outcome = confounder_1 + confounder_2 + rnorm(n) 65 | ) 66 | ``` 67 | 68 | 1. Group the dataset by the confounders and exposure 69 | 2. Calculate the mean of the outcome for the groups 70 | 71 | ```{r} 72 | sim2 |> 73 | group_by(confounder_1, confounder_2, exposure) |> 74 | summarise(avg_y = mean(outcome)) |> 75 | pivot_wider( 76 | names_from = exposure, 77 | values_from = avg_y, 78 | names_prefix = "x_" 79 | ) |> 80 | summarise(estimate = x_1 - x_0, .groups = "drop") |> 81 | summarise(estimate = mean(estimate)) 82 | ``` 83 | 84 | ## Your Turn 3 85 | 86 | Run the following code to generate `sim3` 87 | 88 | ```{r} 89 | n <- 10000 90 | sim3 <- tibble( 91 | confounder = rnorm(n), 92 | p_exposure = exp(confounder) / (1 + exp(confounder)), 93 | exposure = rbinom(n, 1, p_exposure), 94 | outcome = confounder + rnorm(n) 95 | ) 96 | ``` 97 | 98 | 1. Use `ntile()` from dplyr to calculate a binned version of `confounder` called `confounder_q`. We'll create a variable with 5 bins. 99 | 2. Group the dataset by the binned variable you just created and exposure 100 | 3. Calculate the mean of the outcome for the groups 101 | 102 | ```{r} 103 | sim3 |> 104 | mutate(confounder_q = ntile(confounder, 5)) |> 105 | group_by(confounder_q, exposure) |> 106 | summarise(avg_y = mean(outcome)) |> 107 | pivot_wider( 108 | names_from = exposure, 109 | values_from = avg_y, 110 | names_prefix = "x_" 111 | ) |> 112 | summarise(estimate = x_1 - x_0) |> 113 | summarise(estimate = mean(estimate)) 114 | ``` 115 | 116 | # Take aways 117 | 118 | * Sometimes correlation *is* causation! 119 | * In simple cases, grouping by confounding variables can get us the right answer without a statistical model 120 | * Propensity scores generalize the idea of summarizing exposure effects to any number of confounders. Although we'll use models for this process, the foundations are the same. 121 | -------------------------------------------------------------------------------- /12-whole-game-2-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Whole Game 2: Malaria and Mosquito Nets" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(broom) 10 | library(rsample) 11 | library(ggdag) 12 | library(tipr) 13 | library(propensity) 14 | library(halfmoon) 15 | ``` 16 | 17 | ## Whole Game 2: Malaria and Mosquito Nets 18 | 19 | In this exercise, we'll attempt to answer a causal question: does mosquito bed net use reduce malaria risk? 20 | 21 | To answer this question, we'll use the causal inference methods we learned in this workshop. You may use any technique you prefer. If you have time, try more than one technique and compare the results. Remember the broad strokes we discussed today: 22 | 23 | 1. Specify causal question (we just did this!) 24 | 2. Draw assumptions (via a causal diagram, which we'll do below) 25 | 3. Model assumptions (e.g. with a propensity score) 26 | 4. Analyze propensities (diagnostics) 27 | 5. Estimate causal effects 28 | 6. Conduct a sensitivity analysis 29 | 30 | We'll use simulated data, `net_data`, from the {causalworkshop} package, which includes ten variables: 31 | `id` 32 | : an ID variable 33 | `net` and `net_num` 34 | : a binary variable indicating if the participant used a net (1) or didn't use a net (0) 35 | `malaria_risk` 36 | : a risk of malaria scale ranging from 0-100 37 | `income` 38 | : weekly income, measured in dollars 39 | `health` 40 | : a health score scale ranging from 0–100 41 | `household` 42 | : number of people living in the household 43 | `eligible` 44 | : a binary variable indicating if the household is eligible for the free net program. 45 | `temperature` 46 | : the average temperature at night, in Celsius 47 | `resistance` 48 | : Insecticide resistance of local mosquitoes. This is measured on a scale of 0–100, with higher values indicating higher resistance. 49 | 50 | Our thanks to Andrew Heiss for providing these data! 51 | 52 | ```{r} 53 | library(causalworkshop) 54 | net_data 55 | ``` 56 | 57 | Here's the DAG we are proposing. Make sure you confirm what the adjustment set is. 58 | 59 | ```{r} 60 | mosquito_dag <- dagify( 61 | malaria_risk ~ net + income + health + temperature + resistance, 62 | net ~ income + health + temperature + eligible + household, 63 | eligible ~ income + household, 64 | health ~ income, 65 | exposure = "net", 66 | outcome = "malaria_risk", 67 | coords = list( 68 | x = c( 69 | malaria_risk = 7, 70 | net = 3, 71 | income = 4, 72 | health = 5, 73 | temperature = 6, 74 | resistance = 8.5, 75 | eligible = 2, 76 | household = 1 77 | ), 78 | y = c( 79 | malaria_risk = 2, 80 | net = 2, 81 | income = 3, 82 | health = 1, 83 | temperature = 3, 84 | resistance = 2, 85 | eligible = 3, 86 | household = 2 87 | ) 88 | ), 89 | labels = c( 90 | malaria_risk = "Risk of malaria", 91 | net = "Mosquito net", 92 | income = "Income", 93 | health = "Health", 94 | temperature = "Nighttime temperatures", 95 | resistance = "Insecticide resistance", 96 | eligible = "Eligible for program", 97 | household = "Number in the household" 98 | ) 99 | ) 100 | 101 | mosquito_dag |> 102 | tidy_dagitty() |> 103 | node_status() |> 104 | ggplot( 105 | aes(x, y, xend = xend, yend = yend, color = status) 106 | ) + 107 | geom_dag_edges() + 108 | geom_dag_point() + 109 | geom_dag_label_repel( 110 | aes(x, y, label = label), 111 | box.padding = 3.5, 112 | inherit.aes = FALSE, 113 | max.overlaps = Inf, 114 | seed = 10, 115 | label.size = NA, 116 | label.padding = 0.1, 117 | size = 14 / 3 118 | ) + 119 | theme_dag(base_size = 14) + 120 | theme(legend.position = "none") + 121 | labs(caption = "Thanks to Andrew Heiss for the data!") + 122 | coord_cartesian(clip = "off") 123 | ``` 124 | 125 | # Your Turn 126 | 127 | Now, conduct the causal analysis as you see fit! Feel free to work in groups and to ask us questions. 128 | 129 | See [this chapter](https://www.r-causal.org/chapters/chapter-02.html) for a detailed example of this analysis. 130 | 131 | 132 | -------------------------------------------------------------------------------- /06-intro-pscores-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Propensity Scores" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(broom) 10 | library(touringplans) 11 | library(ggdag) 12 | library(ggokabeito) 13 | ``` 14 | 15 | For Your Turn, we'll be looking at an example using Walt Disney World ride data from the touringplans package. 16 | 17 | Historically, guests who stayed in a Walt Disney World resort hotel were able to access the park during "Extra Magic Hours" during which the park was closed to all other guests. 18 | These extra hours could be in the morning or evening. 19 | The Seven Dwarfs Mine Train is a ride at Walt Disney World's Magic Kingdom. Typically, each day Magic Kingdom may or may not be selected to have these "Extra Magic Hours". 20 | 21 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am. 22 | 23 | Below is a proposed DAG for this question. 24 | 25 | *Render this document to see the DAG or refer to the slides*. 26 | 27 | ```{r} 28 | set.seed(1234) 29 | 30 | coord_dag <- list( 31 | x = c(Season = 0, close = 0, weather = -1, x = 1, y = 2), 32 | y = c(Season = -1, close = 1, weather = 0, x = 0, y = 0) 33 | ) 34 | 35 | labels <- c( 36 | x = "Extra Magic Morning", 37 | y = "Average wait", 38 | Season = "Ticket Season", 39 | weather = "Historic high temperature", 40 | close = "Time park closed" 41 | ) 42 | 43 | dagify( 44 | y ~ x + close + Season + weather, 45 | x ~ weather + close + Season, 46 | coords = coord_dag, 47 | labels = labels, 48 | exposure = "x", 49 | outcome = "y" 50 | ) |> 51 | tidy_dagitty() |> 52 | node_status() |> 53 | ggplot( 54 | aes(x, y, xend = xend, yend = yend, color = status) 55 | ) + 56 | geom_dag_edges_arc(curvature = c(rep(0, 5), .3, 0)) + 57 | geom_dag_point() + 58 | geom_dag_label_repel( 59 | aes(x, y, label = label), 60 | box.padding = 3.5, 61 | inherit.aes = FALSE, 62 | max.overlaps = Inf, 63 | family = "sans", 64 | seed = 1630, 65 | label.size = NA, 66 | label.padding = 0.1, 67 | size = 14 / 3 68 | ) + 69 | scale_color_okabe_ito(na.value = "grey90") + 70 | theme_dag() + 71 | theme( 72 | legend.position = "none", 73 | axis.text.x = element_text() 74 | ) + 75 | coord_cartesian(clip = "off") + 76 | scale_x_continuous( 77 | limits = c(-1.25, 2.25), 78 | breaks = c(-1, 0, 1, 2), 79 | labels = c( 80 | "\n(one year ago)", 81 | "\n(6 months ago)", 82 | "\n(3 months ago)", 83 | "5pm - 6pm\n(Today)" 84 | ) 85 | ) 86 | ``` 87 | 88 | Here we are proposing that there are three confounders: the historic high temperature on the day, the time the park closed, and the ticket season: value, regular, or peak. 89 | 90 | We can build a propensity score model using the `seven_dwarfs_train_2018` data set from the touringplans package. 91 | Each row of this dataset contains information about the Seven Dwarfs Mine Train during a certain hour on a given day. 92 | First we need to subset the data to only include average wait times between 9 and 10 am. 93 | 94 | ```{r} 95 | seven_dwarfs <- seven_dwarfs_train_2018 |> 96 | filter(wait_hour == 9) 97 | ``` 98 | 99 | Here's a data dictionary of the variables we need in the `seven_dwarfs` data set: 100 | 101 | | Variable | Column in `seven_dwarfs` | 102 | |--------------------------------|--------------------------| 103 | | Posted Wait Time (outcome) | `wait_minutes_posted_avg` | 104 | | Extra Magic Morning (exposure) | `park_magic_morning` | 105 | | Ticket Season | `park_ticket_season` | 106 | | Closing Time | `park_close` | 107 | | Historic Temperature | `park_temperature_high` | 108 | 109 | ## Your Turn 110 | 111 | *After updating the code chunks below, change `eval: true` before rendering* 112 | 113 | Now, fit a propensity score model for `extra_magic_morning` using the above proposed confounders. 114 | 115 | ```{r} 116 | #| eval: true 117 | propensity_model <- glm( 118 | park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high, 119 | data = seven_dwarfs, 120 | family = binomial() 121 | ) 122 | ``` 123 | 124 | Add the propensity scores to the `seven_dwarfs` data set, call this new dataset `df`. 125 | 126 | ```{r} 127 | #| eval: true 128 | df <- propensity_model |> 129 | augment(type.predict = "response", data = seven_dwarfs) 130 | ``` 131 | 132 | Stretch Goal 1: 133 | 134 | Examine two histograms of the propensity scores, one days with Extra Magic Morning (`park_extra_magic_morning == 1`) and one for days without it (`park_extra_magic_morning == 0`). 135 | How do these compare? 136 | 137 | ```{r} 138 | ggplot( 139 | df, 140 | aes(x = .fitted, fill = factor(park_extra_magic_morning)) 141 | ) + 142 | geom_histogram(bins = 30) + 143 | scale_y_continuous("Count") + 144 | scale_x_continuous("Propensity Score") + 145 | scale_fill_manual(values = c("orange", "cornflower blue")) 146 | ``` 147 | -------------------------------------------------------------------------------- /13-bonus-selection-bias-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Bonus: Selection bias and correcting for loss to follow-up" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(broom) 10 | library(propensity) 11 | library(causaldata) 12 | library(rsample) 13 | ``` 14 | 15 | In this example, we'll consider loss to follow-up in the NHEFS study. We'll use the binary exposure we used earlier in the workshop: does quitting smoking (`smk`) increase weight (`wt82_71`)? This time, however, we'll adjust for loss to followup (people who dropped out of the study between observation periods) using inverse probability of censoring weights. 16 | 17 | # Your Turn 1 18 | 19 | 1. Take a look at how many participants were lost to follow up in `nhefs`, called `censored` in this data set. You don't need to change anything in this code. 20 | 21 | ```{r} 22 | nhefs_censored <- nhefs |> 23 | drop_na( 24 | qsmk, sex, race, age, school, smokeintensity, smokeyrs, exercise, 25 | active, wt71 26 | ) 27 | 28 | nhefs_censored |> 29 | count(censored = as.factor(censored)) |> 30 | ggplot(aes(censored, n)) + 31 | geom_col() 32 | ``` 33 | 34 | 2. Create a logistic regression model that predicts whether or not someone is censored. 35 | 36 | ```{r} 37 | cens_model <- glm( 38 | censored ~ qsmk + sex + race + age + I(age^2) + education + 39 | smokeintensity + I(smokeintensity^2) + 40 | smokeyrs + I(smokeyrs^2) + exercise + active + 41 | wt71 + I(wt71^2), 42 | data = nhefs_censored, 43 | family = binomial() 44 | ) 45 | 46 | ``` 47 | 48 | # Your Turn 2 49 | 50 | 1. Use the logistic model you just fit to create inverse probability of censoring weights 51 | 2. Calculate the weights using `.fitted` 52 | 3. Join `cens` to `nhefs_censored` so that you have the weights in your dataset 53 | 4. Fit a linear regression model of `wt82_71` weighted by `cens_wts`. We'll use this model as the basis for our G-computation 54 | 55 | ```{r} 56 | cens <- cens_model |> 57 | augment(type.predict = "response", data = nhefs_censored) |> 58 | mutate(cens_wts = wt_ate(.fitted, censored)) |> 59 | select(id, cens_wts) 60 | 61 | # join all the weights data from above 62 | nhefs_censored_wts <- nhefs_censored |> 63 | left_join(cens, by = "id") 64 | 65 | cens_model <- lm( 66 | wt82_71 ~ qsmk + I(qsmk * smokeintensity) + smokeintensity + 67 | I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs + 68 | I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2), 69 | data = nhefs_censored_wts, 70 | weights = cens_wts 71 | ) 72 | ``` 73 | 74 | # Your Turn 3 75 | 76 | 1. Create the cloned data sets, called `kept_smoking` and `no`, where one dataset has `quit_smoking` set to 1 (quit smoking) and the other has it set to 0 (kept smoking). 77 | 2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking` 78 | 3. Calculate the differences between the mean values of `kept_smoking` and `quit_smoking` 79 | 80 | ```{r} 81 | predicted_kept_smoking <- cens_model |> 82 | augment(newdata = nhefs_censored |> mutate(qsmk = 0)) |> 83 | select(kept_smoking = .fitted) 84 | 85 | predicted_quit_smoking <- cens_model |> 86 | augment(newdata = nhefs_censored |> mutate(qsmk = 1)) |> 87 | select(quit_smoking = .fitted) 88 | 89 | # summarize the mean difference 90 | bind_cols(predicted_kept_smoking, predicted_quit_smoking) |> 91 | summarise( 92 | est = mean(quit_smoking - kept_smoking) 93 | ) 94 | ``` 95 | 96 | ## Stretch goal: Boostrapped intervals 97 | 98 | Finish early? Try bootstrapping the G-computation model with censoring weights 99 | 100 | Remember, you need to bootstrap the entire modeling process, including fitting both regression models, cloning the data sets, and calculating the effects. 101 | 102 | ```{r} 103 | fit_gcomp_cens <- function(split, ...) { 104 | .df <- analysis(split) 105 | 106 | # fit the censoring model. remember to model using `.df` instead of `nhefs_censored` 107 | cens_model <- glm( 108 | censored ~ qsmk + sex + race + age + I(age^2) + education + 109 | smokeintensity + I(smokeintensity^2) + 110 | smokeyrs + I(smokeyrs^2) + exercise + active + 111 | wt71 + I(wt71^2), 112 | data = .df, 113 | family = binomial() 114 | ) 115 | # calculate the inverse probability of censoring weights. remember to predict `.df` instead of `nhefs_censored` 116 | cens <- cens_model |> 117 | augment(type.predict = "response", data = .df) |> 118 | mutate(cens_wts = 1 / ifelse(censored == 0, 1 - .fitted, 1)) |> 119 | select(id, cens_wts) 120 | 121 | # join all the weights data from above to .df 122 | nhefs_censored_wts <- .df |> 123 | left_join(cens, by = "id") 124 | 125 | # fit outcome model. remember to model using `nhefs_censored_wts` instead of `nhefs_censored` or `.df` 126 | standardized_model <- lm( 127 | wt82_71 ~ qsmk + I(qsmk * smokeintensity) + smokeintensity + 128 | I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs + 129 | I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2), 130 | data = nhefs_censored_wts, 131 | weights = cens_wts 132 | ) 133 | 134 | # clone datasets. remember to clone `.df` instead of `nhefs_censored` 135 | kept_smoking <- .df |> 136 | mutate(qsmk = 0) 137 | 138 | quit_smoking <- .df |> 139 | mutate(qsmk = 1) 140 | 141 | # predict change in weight for each cloned dataset 142 | predicted_kept_smoking <- standardized_model |> 143 | augment(newdata = kept_smoking) |> 144 | select(kept_smoking = .fitted) 145 | 146 | predicted_quit_smoking <- standardized_model |> 147 | augment(newdata = quit_smoking) |> 148 | select(quit_smoking = .fitted) 149 | 150 | # calculate ATE 151 | bind_cols(predicted_kept_smoking, predicted_quit_smoking) |> 152 | summarize( 153 | mean_quit_smoking = mean(quit_smoking), 154 | mean_kept_smoking = mean(kept_smoking), 155 | difference = mean_quit_smoking - mean_kept_smoking 156 | ) |> 157 | # rsample expects a `term` and `estimate` column 158 | pivot_longer(everything(), names_to = "term", values_to = "estimate") 159 | } 160 | 161 | gcomp_results <- bootstraps(nhefs_censored, 1000, apparent = TRUE) |> 162 | mutate(results = map(splits, fit_gcomp_cens)) 163 | 164 | # using bias-corrected confidence intervals 165 | boot_estimate <- int_bca(gcomp_results, results, .fn = fit_gcomp_cens) 166 | 167 | boot_estimate 168 | ``` 169 | 170 | *** 171 | 172 | # Take aways 173 | 174 | * If loss to follow-up is potentially related to your study question, inverse probability of censoring weights can help mitigate the bias. 175 | * You can use them in many types of models. If you're also using propensity score weights, simply multiply the weights together, then include the result as the weights for your outcome model. 176 | -------------------------------------------------------------------------------- /04-dags-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Causal Diagrams in R" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(ggdag) 10 | library(dagitty) 11 | set.seed(1234) 12 | ``` 13 | 14 | ## Your Turn 1 15 | 16 | Descriptively, drinking coffee is associated with lung cancer. Does drinking coffee cause lung cancer? 17 | 18 | 1. Specify a DAG with `dagify()`. Write your assumption that `smoking` causes `cancer` as a formula. 19 | 2. We're going to assume that coffee does *not* cause cancer, so there's no formula for that. But we still need to declare our causal question. Specify "coffee" as the exposure and "cancer" as the outcome (both in quotations marks). 20 | 3. Plot the DAG using `ggdag()` 21 | 22 | Stretch goal: Underneath the hood, `ggdag()` calls `tidy_dagitty()` on `dagitty` objects. Check the help page for `?tidy_dagitty()` and try using one of the layouts listed in `?layout_tbl_graph_igraph()`. Experiment with different layouts. Plot with `ggdag()` 23 | 24 | Stretch goal: We are assuming that coffee does *not* cause lung cancer. In causal diagram terminology, we want to know if these two factors are *D-connected* (as in, directionally connected). Since we are assuming that there's no causal relationship, any D-connection is caused by other factors. Try `ggdag_dconnected()` to see if we coffee and lung cancer are D-connected. 25 | 26 | ```{r} 27 | coffee_cancer_dag <- dagify( 28 | cancer ~ smoking, 29 | smoking ~ addictive, 30 | coffee ~ addictive, 31 | exposure = "coffee", 32 | outcome = "cancer", 33 | labels = c( 34 | "coffee" = "Coffee", 35 | "cancer" = "Lung Cancer", 36 | "smoking" = "Smoking", 37 | "addictive" = "Addictive \nBehavior" 38 | ) 39 | ) 40 | 41 | ggdag(coffee_cancer_dag) 42 | ggdag_dconnected(coffee_cancer_dag) 43 | ``` 44 | 45 | ## Your Turn 2 46 | 47 | Most {ggdag} quick plotting functions are actually wrappers for functions that let you work with the DAG data directly. 48 | 49 | 1. Call `tidy_dagitty()` on `coffee_cancer_dag` to create a tidy DAG, then pass the results to `dag_paths()`. What's different about these data? 50 | 2. Plot the open paths with `ggdag_paths()`. (Just give it `coffee_cancer_dag` rather than using `dag_paths()`; the quick plot function will do that for you.) Remember, since we assume there is *no* causal path from coffee to lung cancer, any open paths must be confounding pathways. 51 | 52 | Stretch goal: ggdags are just ggplots! You can add themes, geoms, and other {ggplot2} elements with `+` like a normal ggplot. Try adding a theme ({ggdag} has several, or you could try `theme_void()`). 53 | 54 | Stretch goal: The variable names are a little hard to read. We specified some labels earlier, so let's use them. In the tidy DAG, these are called `label`. In `ggdag_paths()`, add the argument `use_labels = "label"` and remove the node text with `text = FALSE`. 55 | 56 | ```{r} 57 | coffee_cancer_dag |> 58 | tidy_dagitty() |> 59 | dag_paths() 60 | 61 | ggdag_paths(coffee_cancer_dag) 62 | 63 | # stretch goals 64 | ggdag_paths( 65 | coffee_cancer_dag, 66 | use_labels = "label", 67 | text = FALSE 68 | ) + 69 | theme_dag() 70 | ``` 71 | 72 | 73 | ## Your Turn 3 74 | 75 | Now that we know the open, confounding pathways (sometimes called "backdoor paths"), we need to know how to close them! First, we'll ask {ggdag} for adjustment sets, then we would need to do something in our analysis to account for at least one adjustment set (e.g. multivariable regression, weighting, or matching for the adjustment sets). 76 | 77 | 1. Use `ggdag_adjustment_set()` to visualize the adjustment sets. Add the arguments `use_labels = "label"` and `text = FALSE`. 78 | 2. Write an R formula for each adjustment set, as you might if you were fitting a model in `lm()` or `glm()` 79 | 80 | Stretch goal: Use `dagitty::adjustmentSets()` to print the adjustment sets to the console. 81 | 82 | Stretch goal: Sometimes, we know a variable plays a vital role in a causal diagram but we can't measure it, or we simply don't have it in our data set. You can tell {ggdag} that a variable is unmeasured with the `latent` argument in `dagify()`. Re-run the `dagify()` call above, but set `latent = "addictive"` (meaning we can't or haven't measured this variable). Plot it with `ggdag_adjustment_set()`. What's different? Now, try setting `latent = c("addictive", "smoking")` and plotting the adjustment set. What do the results mean? 83 | 84 | ```{r} 85 | ggdag_adjustment_set(coffee_cancer_dag, use_labels = "label", text = FALSE) 86 | 87 | cancer ~ addictive 88 | cancer ~ smoking 89 | ``` 90 | 91 | ```{r} 92 | # stretch goal 93 | coffee_cancer_dag2 <- dagify( 94 | cancer ~ smoking, 95 | smoking ~ addictive, 96 | coffee ~ addictive, 97 | exposure = "coffee", 98 | outcome = "cancer", 99 | # `addictive` is unmeasured in our data 100 | latent = "addictive", 101 | labels = c( 102 | "coffee" = "Coffee", 103 | "cancer" = "Lung Cancer", 104 | "smoking" = "Smoking", 105 | "addictive" = "Addictive \nBehavior" 106 | ) 107 | ) 108 | 109 | ggdag_adjustment_set(coffee_cancer_dag2) 110 | ``` 111 | 112 | ```{r} 113 | # stretch goal 114 | coffee_cancer_dag3 <- dagify( 115 | cancer ~ smoking, 116 | smoking ~ addictive, 117 | coffee ~ addictive, 118 | exposure = "coffee", 119 | outcome = "cancer", 120 | # `addictive` and `smoking` are unmeasured in our data 121 | latent = c("smoking", "addictive"), 122 | labels = c( 123 | "coffee" = "Coffee", 124 | "cancer" = "Lung Cancer", 125 | "smoking" = "Smoking", 126 | "addictive" = "Addictive \nBehavior" 127 | ) 128 | ) 129 | 130 | ggdag_adjustment_set(coffee_cancer_dag3) 131 | ``` 132 | 133 | ## Your Turn 4 134 | 135 | Time-ordering your DAGs is incredibly useful, because it makes it cleaner, easier to read, and easier to understand what to control for --- and *not* control for. 136 | 137 | ggdag includes several ways to manually specify coordinates, including `time_ordered_coords()`. This functional automatically determines the time ordering of the DAG based on the causal relationships. After all, something that causes something else *must* come before, otherwise we'd be violating the time-space continuum. 138 | 139 | Recreate the DAG we've been working with using `time_ordered_coords()`, then visualize the DAG. You don't need to use any arguments for this function, so `coords = time_ordered_coords()` will do. 140 | 141 | ```{r} 142 | coffee_cancer_dag_to <- dagify( 143 | cancer ~ smoking, 144 | smoking ~ addictive, 145 | coffee ~ addictive, 146 | exposure = "coffee", 147 | outcome = "cancer", 148 | coords = time_ordered_coords(), 149 | labels = c( 150 | "coffee" = "Coffee", 151 | "cancer" = "Lung Cancer", 152 | "smoking" = "Smoking", 153 | "addictive" = "Addictive \nBehavior" 154 | ) 155 | ) 156 | 157 | ggdag(coffee_cancer_dag_to, use_labels = "label", text = FALSE) 158 | ``` 159 | 160 | # Take aways 161 | 162 | * Draw your assumptions with DAGs! Use `dagify()` to specify them and `ggdag()` and friends to draw them. 163 | * The main goal for many analyses is to close backdoor (non-causal) paths. {ggdag} and {dagitty} can help you identify them. 164 | * Adjustment sets are key for closing backdoor paths. Take a reasonable set and use it in your model to get a causal effect estimate. 165 | -------------------------------------------------------------------------------- /14-bonus-continuous-pscores-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Propensity scores for continuous exposures" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(broom) 10 | library(touringplans) 11 | library(propensity) 12 | ``` 13 | 14 | For this set of exercises, we'll use propensity scores for continuous exposures. 15 | 16 | In the touringplans data set, we have information about the posted waiting times for rides. We also have a limited amount of data on the observed, actual times. The question that we will consider is this: Do posted wait times (`avg_spostmin`) for the Seven Dwarves Mine Train at 8 am affect actual wait times (`avg_sactmin`) at 9 am? Here’s our DAG: 17 | 18 | ```{r} 19 | #| echo: false 20 | #| message: false 21 | #| warning: false 22 | library(ggdag) 23 | library(ggokabeito) 24 | 25 | coord_dag <- list( 26 | x = c(Season = -1, close = -1, weather = -2, extra = 0, x = 1, y = 2), 27 | y = c(Season = -1, close = 1, weather = 0, extra = 0, x = 0, y = 0) 28 | ) 29 | 30 | labels <- c( 31 | extra = "Extra Magic Morning", 32 | x = "Average posted wait ", 33 | y = "Average acutal wait", 34 | Season = "Ticket Season", 35 | weather = "Historic high temperature", 36 | close = "Time park closed" 37 | ) 38 | 39 | dagify( 40 | y ~ x + close + Season + weather + extra, 41 | x ~ weather + close + Season + extra, 42 | extra ~ weather + close + Season, 43 | coords = coord_dag, 44 | labels = labels, 45 | exposure = "x", 46 | outcome = "y" 47 | ) |> 48 | tidy_dagitty() |> 49 | node_status() |> 50 | ggplot( 51 | aes(x, y, xend = xend, yend = yend, color = status) 52 | ) + 53 | geom_dag_edges_arc(curvature = c(rep(0, 7), .2, 0, .2, .2, 0), edge_colour = "grey70") + 54 | geom_dag_point() + 55 | geom_dag_label_repel( 56 | aes(x, y, label = label), 57 | box.padding = 3.5, 58 | inherit.aes = FALSE, 59 | max.overlaps = Inf, 60 | family = "sans", 61 | seed = 1602, 62 | label.size = NA, 63 | label.padding = 0.1, 64 | size = 14 / 3 65 | ) + 66 | scale_color_okabe_ito(na.value = "grey90") + 67 | theme_dag() + 68 | theme( 69 | legend.position = "none", 70 | axis.text.x = element_text() 71 | ) + 72 | coord_cartesian(clip = "off") + 73 | scale_x_continuous( 74 | limits = c(-2.25, 2.25), 75 | breaks = c(-2, -1, 0, 1, 2), 76 | labels = c( 77 | "\n(one year ago)", 78 | "\n(6 months ago)", 79 | "\n(3 months ago)", 80 | "8am-9am\n(Today)", 81 | "9am-10am\n(Today)" 82 | ) 83 | ) 84 | ``` 85 | 86 | First, let’s wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9? We’ll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time). We also have a lot of missingness for `avg_sactmin`, so we’ll drop unobserved values for now. 87 | 88 | You don't need to update any code here, so just run this. 89 | 90 | ```{r} 91 | eight <- seven_dwarfs_train_2018 |> 92 | filter(wait_hour == 8) |> 93 | select(-wait_minutes_actual_avg) 94 | 95 | nine <- seven_dwarfs_train_2018 |> 96 | filter(wait_hour == 9) |> 97 | select(park_date, wait_minutes_actual_avg) 98 | 99 | wait_times <- eight |> 100 | left_join(nine, by = "park_date") |> 101 | drop_na(wait_minutes_actual_avg) 102 | ``` 103 | 104 | # Your Turn 1 105 | 106 | First, let’s calculate the propensity score model, which will be the denominator in our stabilized weights (more to come on that soon). We’ll fit a model using `lm()` for `wait_minutes_posted_avg` with our covariates, then use the fitted predictions of `avg_spostmin` (`.fitted`, `.sigma`) to calculate the density using `dnorm()`. 107 | 108 | 1. Fit a model using `lm()` with `wait_minutes_posted_avg` as the outcome and the confounders identified in the DAG. 109 | 2. Use `augment()` to add model predictions to the data frame. 110 | 3. In `wt_ate()`, calculate the weights using `avg_postmin`, `.fitted`, and `.sigma`. 111 | 112 | ```{r} 113 | post_time_model <- lm( 114 | wait_minutes_posted_avg ~ park_close + park_extra_magic_morning + park_temperature_high + park_ticket_season, 115 | data = wait_times 116 | ) 117 | 118 | wait_times_wts <- post_time_model |> 119 | augment(data = wait_times) |> 120 | mutate(wts = wt_ate( 121 | wait_minutes_posted_avg, .fitted, .sigma = .sigma 122 | )) 123 | ``` 124 | 125 | # Your Turn 2 126 | 127 | As with the example in the slides, we have a lot of extreme values for our weights 128 | 129 | ```{r} 130 | wait_times_wts |> 131 | ggplot(aes(wts)) + 132 | geom_density(col = "#E69F00", fill = "#E69F0095") + 133 | scale_x_log10() + 134 | theme_minimal(base_size = 20) + 135 | xlab("Weights") 136 | ``` 137 | 138 | Let’s now fit the marginal density to use for stabilized weights: 139 | 140 | 1. Re-fit the above using stabilized weights 141 | 142 | ```{r} 143 | wait_times_swts <- post_time_model |> 144 | augment(data = wait_times) |> 145 | mutate(swts = wt_ate( 146 | wait_minutes_posted_avg, 147 | .fitted, 148 | .sigma = .sigma, 149 | stabilize = TRUE 150 | )) 151 | ``` 152 | 153 | Take a look at the weights now that we've stabilized them: 154 | 155 | ```{r} 156 | ggplot(wait_times_swts, aes(swts)) + 157 | geom_density(col = "#E69F00", fill = "#E69F0095") + 158 | scale_x_log10() + 159 | theme_minimal(base_size = 20) + 160 | xlab("Stabilized Weights") 161 | ``` 162 | 163 | # Your Turn 3 164 | 165 | Now, let's fit the outcome model! 166 | 167 | 1. Estimate the relationship between posted wait times and actual wait times using the stabilized weights we just created. 168 | 169 | ```{r} 170 | lm(wait_minutes_actual_avg ~ wait_minutes_posted_avg, weights = swts, data = wait_times_swts) |> 171 | tidy() |> 172 | filter(term == "wait_minutes_posted_avg") |> 173 | mutate(estimate = estimate * 10) 174 | ``` 175 | 176 | ## Stretch goal: Boostrapped intervals 177 | 178 | Bootstrap confidence intervals for our estimate. 179 | 180 | There's nothing new here. Just remember, you need to bootstrap the entire modeling process! 181 | 182 | ```{r} 183 | set.seed(1234) 184 | library(rsample) 185 | 186 | fit_model <- function(split, ...) { 187 | .df <- analysis(split) 188 | 189 | # fill in the rest! 190 | 191 | post_time_model <- lm( 192 | wait_minutes_posted_avg ~ park_close + park_extra_magic_morning + park_temperature_high + park_ticket_season, 193 | data = .df 194 | ) 195 | 196 | .df_swts <- post_time_model |> 197 | augment(data = .df) |> 198 | mutate(swts = wt_ate( 199 | wait_minutes_posted_avg, 200 | .fitted, 201 | .sigma = .sigma, 202 | stabilize = TRUE 203 | )) 204 | 205 | lm(wait_minutes_actual_avg ~ wait_minutes_posted_avg, weights = swts, data = .df_swts) |> 206 | tidy() |> 207 | filter(term == "wait_minutes_posted_avg") |> 208 | mutate(estimate = estimate * 10) 209 | } 210 | 211 | model_estimate <- bootstraps(wait_times, 1000, apparent = TRUE) |> 212 | mutate(results = map(splits, fit_model)) 213 | 214 | # using bias-corrected confidence intervals 215 | boot_estimate <- int_bca(model_estimate, results, .fn = fit_model) 216 | 217 | boot_estimate 218 | ``` 219 | 220 | *** 221 | 222 | # Take aways 223 | 224 | * We can calculate propensity scores for continuous exposures. `wt_ate()` uses `dnorm()` to use the normal density to transform predictions to a propensity-like scale; we need to give `wt_ate()` `.sigma` as to calculate do this. We can also use other approaches like quantile binning of the exposure, calculating probability-based propensity scores using categorical regression models. 225 | * Continuous exposures are prone to mispecification and usually need to be stabilized. A simple stabilization is to invert the propensity score by stabilization weights using an intercept-only model such as `lm(exposure ~ 1)`. `wt_ate()` can do this for you automatically with `stabilize = TRUE`. This also applies to other types of exposures. 226 | * Stabilization is useful for any type of exposure where the weights are unbounded. Weights like the ATO, making them less susceptible to extreme weights. 227 | * Using propensity scores for continuous exposures in outcome models is identical to using them with binary exposures. 228 | * Because propensity scores for continuous exposures are prone to positivity violation, check the bootstrap distribution of your estimate for skew and to see if the mean estimate is different from your regression model. If these problems are present, you may need to use another approach like g-computation. 229 | -------------------------------------------------------------------------------- /10-continuous-g-computation-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Continuous exposures and g-computation" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(broom) 10 | library(touringplans) 11 | library(splines) 12 | ``` 13 | 14 | For this set of exercises, we'll use g-computation to calculate a causal effect for continuous exposures. 15 | 16 | In the touringplans data set, we have information about the posted waiting times for rides. We also have a limited amount of data on the observed, actual times. The question that we will consider is this: Do posted wait times (`avg_spostmin`) for the Seven Dwarves Mine Train at 8 am affect actual wait times (`avg_sactmin`) at 9 am? Here’s our DAG: 17 | 18 | ```{r} 19 | #| echo: false 20 | #| message: false 21 | #| warning: false 22 | library(ggdag) 23 | library(ggokabeito) 24 | 25 | coord_dag <- list( 26 | x = c(Season = -1, close = -1, weather = -2, extra = 0, x = 1, y = 2), 27 | y = c(Season = -1, close = 1, weather = 0, extra = 0, x = 0, y = 0) 28 | ) 29 | 30 | labels <- c( 31 | extra = "Extra Magic Morning", 32 | x = "Average posted wait ", 33 | y = "Average acutal wait", 34 | Season = "Ticket Season", 35 | weather = "Historic high temperature", 36 | close = "Time park closed" 37 | ) 38 | 39 | dagify( 40 | y ~ x + close + Season + weather + extra, 41 | x ~ weather + close + Season + extra, 42 | extra ~ weather + close + Season, 43 | coords = coord_dag, 44 | labels = labels, 45 | exposure = "x", 46 | outcome = "y" 47 | ) |> 48 | tidy_dagitty() |> 49 | node_status() |> 50 | ggplot( 51 | aes(x, y, xend = xend, yend = yend, color = status) 52 | ) + 53 | geom_dag_edges_arc(curvature = c(rep(0, 7), .2, 0, .2, .2, 0), edge_colour = "grey70") + 54 | geom_dag_point() + 55 | geom_dag_label_repel( 56 | aes(x, y, label = label), 57 | box.padding = 3.5, 58 | inherit.aes = FALSE, 59 | max.overlaps = Inf, 60 | family = "sans", 61 | seed = 1602, 62 | label.size = NA, 63 | label.padding = 0.1, 64 | size = 14 / 3 65 | ) + 66 | scale_color_okabe_ito(na.value = "grey90") + 67 | theme_dag() + 68 | theme( 69 | legend.position = "none", 70 | axis.text.x = element_text() 71 | ) + 72 | coord_cartesian(clip = "off") + 73 | scale_x_continuous( 74 | limits = c(-2.25, 2.25), 75 | breaks = c(-2, -1, 0, 1, 2), 76 | labels = c( 77 | "\n(one year ago)", 78 | "\n(6 months ago)", 79 | "\n(3 months ago)", 80 | "8am-9am\n(Today)", 81 | "9am-10am\n(Today)" 82 | ) 83 | ) 84 | ``` 85 | 86 | First, let’s wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9? We’ll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time). We also have a lot of missingness for `avg_sactmin`, so we’ll drop unobserved values for now. 87 | 88 | You don't need to update any code here, so just run this. 89 | 90 | ```{r} 91 | eight <- seven_dwarfs_train_2018 |> 92 | filter(wait_hour == 8) |> 93 | select(-wait_minutes_actual_avg) 94 | 95 | nine <- seven_dwarfs_train_2018 |> 96 | filter(wait_hour == 9) |> 97 | select(park_date, wait_minutes_actual_avg) 98 | 99 | wait_times <- eight |> 100 | left_join(nine, by = "park_date") |> 101 | drop_na(wait_minutes_actual_avg) 102 | ``` 103 | 104 | # Your Turn 1 105 | 106 | For the parametric G-formula, we'll use a single model to fit a causal model of Posted Waiting Times (`wait_minutes_posted_avg`) on Actual Waiting Times (`wait_minutes_actual_avg`) where we include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets. 107 | 108 | Two additional differences in our model: we'll use a natural cubic spline on the exposure, `wait_minutes_posted_avg`, using `ns()` from the splines package, and we'll include an interaction term between `wait_minutes_posted_avg` and `park_extra_magic_morning`. These complicate the interpretation of the coefficient of the model in normal regression but have virtually no downside (as long as we have a reasonable sample size) in g-computation, because we still get an easily interpretable result. 109 | 110 | First, let's fit the model. 111 | 112 | 1.Use `lm()` to create a model with the outcome, exposure, and confounders identified in the DAG. 113 | 2. Save the model as `standardized_model` 114 | 115 | ```{r} 116 | standardized_model <- lm( 117 | wait_minutes_actual_avg ~ ns(wait_minutes_posted_avg, df = 2)*park_extra_magic_morning + park_temperature_high + park_ticket_season + park_close, 118 | data = wait_times 119 | ) 120 | ``` 121 | 122 | # Your Turn 2 123 | 124 | Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `wait_minutes_posted_avg` set to 30 minutes and in another, all participants have `wait_minutes_posted_avg` set to 60 minutes. 125 | 126 | 1. Create the cloned data sets, called `thirty` and `sixty`. 127 | 2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `thirty_posted_minutes` or `sixty_posted_minutes` (use the pattern `select(new_name = old_name)`). 128 | 3. Save the predicted data sets as`predicted_thirty` and `predicted_sixty`. 129 | 130 | ```{r} 131 | thirty <- wait_times |> 132 | mutate(wait_minutes_posted_avg = 30) 133 | 134 | sixty <- wait_times |> 135 | mutate(wait_minutes_posted_avg = 60) 136 | 137 | predicted_thirty <- standardized_model |> 138 | augment(newdata = thirty) |> 139 | select(thirty_posted_minutes = .fitted) 140 | 141 | predicted_sixty <- standardized_model |> 142 | augment(newdata = sixty) |> 143 | select(sixty_posted_minutes = .fitted) 144 | ``` 145 | 146 | # Your Turn 3 147 | 148 | Finally, we'll get the mean differences between the values. 149 | 150 | 1. Bind `predicted_thirty` and `predicted_sixty` using `bind_cols()` 151 | 2. Summarize the predicted values to create three new variables: `mean_thirty`, `mean_sixty`, and `difference`. The first two should be the means of `thirty_posted_minutes` and `sixty_posted_minutes`. `difference` should be `mean_sixty` minus `mean_thirty`. 152 | 153 | ```{r} 154 | bind_cols(predicted_thirty, predicted_sixty) |> 155 | summarize( 156 | mean_thirty = mean(thirty_posted_minutes), 157 | mean_sixty = mean(sixty_posted_minutes), 158 | difference = mean_sixty - mean_thirty 159 | ) 160 | ``` 161 | 162 | That's it! `difference` is our effect estimate, marginalized over the spline terms, interaction effects, and confounders. 163 | 164 | ## Stretch goal: Boostrapped intervals 165 | 166 | Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model. 167 | 168 | Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects. 169 | 170 | ```{r} 171 | set.seed(1234) 172 | library(rsample) 173 | 174 | fit_gcomp <- function(split, ...) { 175 | .df <- analysis(split) 176 | 177 | # fit outcome model. remember to model using `.df` instead of `wait_times` 178 | standardized_model <- lm( 179 | wait_minutes_actual_avg ~ ns(wait_minutes_posted_avg, df = 2)*park_extra_magic_morning + park_temperature_high + park_ticket_season + park_close, 180 | data = .df 181 | ) 182 | 183 | # clone datasets. remember to clone `.df` instead of `wait_times` 184 | thirty <- .df |> 185 | mutate(wait_minutes_posted_avg = 30) 186 | 187 | sixty <- .df |> 188 | mutate(wait_minutes_posted_avg = 60) 189 | 190 | # predict actual wait time for each cloned dataset 191 | predicted_thirty <- standardized_model |> 192 | augment(newdata = thirty) |> 193 | select(thirty_posted_minutes = .fitted) 194 | 195 | predicted_sixty <- standardized_model |> 196 | augment(newdata = sixty) |> 197 | select(sixty_posted_minutes = .fitted) 198 | 199 | # calculate ATE 200 | bind_cols(predicted_thirty, predicted_sixty) |> 201 | summarize( 202 | mean_thirty = mean(thirty_posted_minutes), 203 | mean_sixty = mean(sixty_posted_minutes), 204 | difference = mean_sixty - mean_thirty 205 | ) |> 206 | # rsample expects a `term` and `estimate` column 207 | pivot_longer(everything(), names_to = "term", values_to = "estimate") 208 | } 209 | 210 | gcomp_results <- bootstraps(wait_times, 1000, apparent = TRUE) |> 211 | mutate(results = map(splits, fit_gcomp)) 212 | 213 | # using bias-corrected confidence intervals 214 | boot_estimate <- int_bca(gcomp_results, results, .fn = fit_gcomp) 215 | 216 | boot_estimate 217 | ``` 218 | 219 | *** 220 | 221 | # Take aways 222 | 223 | * To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study. 224 | * Use the model to predict the values for that level of the exposure and compute the effect estimate you want 225 | -------------------------------------------------------------------------------- /01-whole-game-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Causal Modeling in R: Whole Game" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(broom) 10 | library(rsample) 11 | library(ggdag) 12 | library(causaldata) 13 | library(halfmoon) 14 | library(propensity) 15 | library(ggokabeito) 16 | 17 | set.seed(1234) 18 | ``` 19 | 20 | ## Causal Modeling: Whole Game 21 | 22 | In this guided exercise, we'll attempt to answer a causal question: does quitting smoking make you gain weight? Causal modeling has a special place in the history of smoking research: the studies that demonstrated that smoking causes lung cancer were observational. Thanks to other studies, we also know that, if you're already a smoker, quitting smoking reduces your risk of lung cancer. However, some have observed that former smokers tend to gain weight. Is this the result of quitting smoking, or does something else explain this effect? In the book Causal Inference by Hernán and Robins, the authors analyze this question using several causal inference techniques. 23 | 24 | To answer this question, we'll use causal inference methods to examine the relationship between quitting smoking and gaining weight. First, we'll draw our assumptions with a causal diagram (a directed acyclic graph, or DAG), which will guide our model. Then, we'll use a modeling approach called inverse probability weighting--one of many causal modeling techniques--to estimate the causal effect we're interested in. 25 | 26 | We'll use data from NHEFS to try to estimate the causal effect of quitting smoking on weight game. NHEFS is a longitudinal, observational study that has many of the variables we'll need. Take a look at `nhefs_codebook` if you want to know more about the variables in this data set. These data are included in the {causaldata} package. We'll use the `nhefs_complete` data set, but we'll remove people who were lost to follow-up. 27 | 28 | ```{r} 29 | nhefs_complete_uc <- nhefs_complete |> 30 | filter(censored == 0) 31 | nhefs_complete_uc 32 | ``` 33 | 34 | Let's look at the distribution of weight gain between the two groups. 35 | 36 | ```{r} 37 | nhefs_complete_uc |> 38 | ggplot(aes(wt82_71, fill = factor(qsmk))) + 39 | geom_vline(xintercept = 0, color = "grey60", linewidth = 1) + 40 | geom_density(color = "white", alpha = .75, linewidth = .5) + 41 | scale_color_okabe_ito(order = c(1, 5)) + 42 | theme_minimal() + 43 | theme(legend.position = "bottom") + 44 | labs( 45 | x = "change in weight (kg)", 46 | fill = "quit smoking (1 = yes)" 47 | ) 48 | ``` 49 | 50 | There's a difference--former smokers do seemed to have gained a bit more weight--but there's also a lot of variation. Let's look at the numeric summaries. 51 | 52 | ```{r} 53 | # ~2.5 kg gained for quit vs. not quit 54 | nhefs_complete_uc |> 55 | group_by(qsmk) |> 56 | summarize( 57 | mean_weight_change = mean(wt82_71), 58 | sd = sd(wt82_71), 59 | .groups = "drop" 60 | ) 61 | ``` 62 | 63 | Here, it looks like those who quit smoking gained, on average, 2.5 kg. But is there something else that could explain these results? There are many factors associated with both quitting smoking and gaining weight; could one of those factors explain away the results we're seeing here? 64 | 65 | To truly answer this question, we need to specify a causal diagram based on domain knowledge. Sadly, for most circumstances, there is no data-driven approach that consistently identify confounders. Only our causal assumptions can help us identify them. Causal diagrams are a visual expression of those assumptions linked to rigorous mathematics that allow us to understand what we need to account for in our model. 66 | 67 | In R, we can visualize and analyze our DAGs with the {ggdag} package. {ggdag} uses {ggplot2} and {ggraph} to visualize diagrams and {dagitty} to analyze them. Let's set up our assumptions. The `dagify()` function takes formulas, much like `lm()` and friends, to express assumptions. We have two basic causal structures: the causes of quitting smoking and the causes of gaining weight. Here, we're assuming that the set of variables here affect both. Additionally, we're adding `qsmk` as a cause of `wt82_71`, which is our causal question; we also identify these as our outcome and exposure. Finally, we'll add some labels so the diagram is easier to understand. The result is a `dagitty` object, and we can transform it to a `tidy_dagitty` data set with `tidy_dagitty()`. 68 | 69 | ```{r} 70 | # set up DAG 71 | smk_wt_dag <- dagify( 72 | # specify causes of quitting smoking and weight gain: 73 | qsmk ~ sex + race + age + education + 74 | smokeintensity + smokeyrs + exercise + active + wt71, 75 | wt82_71 ~ qsmk + sex + race + age + education + 76 | smokeintensity + smokeyrs + exercise + active + wt71, 77 | # specify causal question: 78 | exposure = "qsmk", 79 | outcome = "wt82_71", 80 | coords = time_ordered_coords(), 81 | # set up labels: 82 | # here, I'll use the same variable names as the data set, but I'll label them 83 | # with clearer names 84 | labels = c( 85 | # causal question 86 | "qsmk" = "quit\nsmoking", 87 | "wt82_71" = "change in\nweight", 88 | 89 | # demographics 90 | "age" = "age", 91 | "sex" = "sex", 92 | "race" = "race", 93 | "education" = "education", 94 | 95 | # health 96 | "wt71" = "baseline\nweight", 97 | "active" = "daily\nactivity\nlevel", 98 | "exercise" = "exercise", 99 | 100 | # smoking history 101 | "smokeintensity" = "smoking\nintensity", 102 | "smokeyrs" = "yrs of\nsmoking" 103 | ) 104 | ) |> 105 | tidy_dagitty() 106 | 107 | smk_wt_dag 108 | ``` 109 | 110 | Let's visualize our assumptions with `ggdag()`. 111 | 112 | ```{r} 113 | smk_wt_dag |> 114 | ggdag(text = FALSE, use_labels = "label") 115 | ``` 116 | 117 | What do we need to control for to estimate an unbiased effect of quitting smoking on weight gain? In many DAGs, there will be many sets of variables--called adjustment sets--that will give us the right effect (assuming our DAG is correct--a big, unverifiable assumption!). `ggdag_adjustment_set()` can help you visualize them. Here, there's only one adjustment set: we need to control for everything! While we're add it, since a {ggdag} plot is just a {ggplot2} plot, let's clean it up a bit, too. 118 | 119 | ```{r} 120 | smk_wt_dag |> 121 | ggdag_adjustment_set(text = FALSE, use_labels = "label") + 122 | theme_dag() + 123 | scale_color_okabe_ito(order = c(1, 5)) + 124 | scale_fill_okabe_ito(order = c(1, 5)) 125 | ``` 126 | 127 | Let's fit a model with these variables. Note that we'll fit all continuous variables with squared terms, as well, to allow them a bit of flexibility. 128 | 129 | ```{r} 130 | lm( 131 | wt82_71~ qsmk + sex + 132 | race + age + I(age^2) + education + 133 | smokeintensity + I(smokeintensity^2) + 134 | smokeyrs + I(smokeyrs^2) + exercise + active + 135 | wt71 + I(wt71^2), 136 | data = nhefs_complete_uc 137 | ) |> 138 | tidy(conf.int = TRUE) |> 139 | filter(term == "qsmk") 140 | ``` 141 | 142 | When we adjust for the variables in our DAG, we get an estimate of about 3.5 kg--people who quit smoking gained about this amount of weight. However, we are trying to answer a specific causal question: how much weight would a person gain if the quit smoking vs. if the same person did not quit smoking? Let's use an inverse probability weighting model to try to estimate that effect at the population level (what if *everyone* quit smoking vs what if *no one* quit smoking). 143 | 144 | For a simple IPW model, we have two modeling steps. First, we fit a propensity score model, which predicts the probability that you received a treatment or exposure (here, that a participant quit smoking). We use this model to calculate inverse probability weights--1 / your probability of treatment. Then, in the second step, we use this weights in the outcome model, which estimates the effect of exposure on the outcome (here, the effect of quitting smoking on gaining weight). 145 | 146 | For the propensity score model, we'll use logistic regression (since quitting smoking is a binary variable). The outcome is quitting smoking, and the variables in the model are all those included in our adjustment set. Then, we'll use `augment()` from {broom} (which calls `predict()` on the inside) to calculate our weights using `propensity::wt_ate()` and save it back into our data set. 147 | 148 | 149 | ```{r} 150 | propensity_model <- glm( 151 | qsmk ~ sex + 152 | race + age + I(age^2) + education + 153 | smokeintensity + I(smokeintensity^2) + 154 | smokeyrs + I(smokeyrs^2) + exercise + active + 155 | wt71 + I(wt71^2), 156 | family = binomial(), 157 | data = nhefs_complete_uc 158 | ) 159 | 160 | nhefs_complete_uc <- propensity_model |> 161 | # predict whether quit smoking 162 | augment(type.predict = "response", data = nhefs_complete_uc) |> 163 | # calculate inverse probability 164 | mutate(wts = wt_ate(.fitted, qsmk)) 165 | 166 | nhefs_complete_uc |> 167 | select(qsmk, .fitted, wts) 168 | ``` 169 | 170 | Let's look at the distribution of the weights. 171 | 172 | ```{r} 173 | ggplot(nhefs_complete_uc, aes(wts)) + 174 | geom_histogram(color = "white", fill = "#E69F00", bins = 50) + 175 | # use a log scale for the x axis 176 | scale_x_log10() + 177 | theme_minimal(base_size = 20) + 178 | xlab("Weights") 179 | ``` 180 | 181 | It looks a little skewed, particularly that there are some participants with much higher weights. There are a few techniques for dealing with this--trimming weights and stabilizing weights--but we'll keep it simple for now and just use them as is. 182 | 183 | The main goal here is to *break* the non-causal associations between quitting smoking and gaining weight--the other paths that might distort our results. In other words, if we succeed, there should be no differences in these variables between our two groups, those who quit smoking and those who didn't. This is where randomized trials shine; you can often assume that there is no baseline differences among potential confounders between your treatment groups (of course, no study is perfect, and there's a whole set of literature on dealing with this problem in randomized trials). 184 | 185 | Standardized mean differences (SMD) are a simple measurement of differences that work across variable types. In general, the closer to 0 we are, the better job we have done eliminating the non-causal relationships we drew in our DAG. Note that low SMDs for everything we adjust for does *not* mean that there is not something else that might confound our study. Unmeasured confounders or misspecified DAGs can still distort our effects, even if our SMDs look great! 186 | 187 | We'll use the {halfmoon} package to calculate the SMDs, then visualize them. 188 | 189 | ```{r} 190 | vars <- c( 191 | "sex", "race", "age", "education", 192 | "smokeintensity", "smokeyrs", 193 | "exercise", "active", "wt71" 194 | ) 195 | 196 | plot_df <- tidy_smd( 197 | nhefs_complete_uc, 198 | all_of(vars), 199 | qsmk, 200 | wts 201 | ) 202 | 203 | ggplot( 204 | data = plot_df, 205 | mapping = aes(x = abs(smd), y = variable, group = method, color = method) 206 | ) + 207 | geom_love() 208 | ``` 209 | 210 | These look pretty good! Some variables are better than others, but weighting appears to have done a much better job eliminating these differences than an unadjusted analysis. 211 | 212 | We can also use halfmoon's `geom_mirror_histogram()` to visualize the impact that the weights are having on our population. 213 | 214 | ```{r} 215 | nhefs_complete_uc |> 216 | mutate(qsmk = factor(qsmk)) |> 217 | ggplot(aes(.fitted)) + 218 | geom_mirror_histogram( 219 | aes(group = qsmk), 220 | bins = 50 221 | ) + 222 | geom_mirror_histogram( 223 | aes(fill = qsmk, weight = wts), 224 | bins = 50, 225 | alpha = .5 226 | ) + 227 | scale_y_continuous(labels = abs) + 228 | labs(x = "propensity score") + 229 | theme_minimal(base_size = 20) 230 | ``` 231 | 232 | Both groups are being *upweighted* so that their distributions of propensity scores are much more similar. 233 | 234 | We could do more here to analyze our assumptions, but let's move on to our second step: fitting the outcome model weighted by our inverse probabilities. Some researchers call these Marginal Structural Models, in part because the model is marginal; we only need to include our outcome (`wt82_71`) and exposure (`qsmk`). The other variables aren't in the model; they are accounted for with the IPWs! 235 | 236 | ```{r} 237 | ipw_model <- lm( 238 | wt82_71 ~ qsmk, 239 | data = nhefs_complete_uc, 240 | weights = wts # inverse probability weights 241 | ) 242 | 243 | ipw_estimate <- ipw_model |> 244 | tidy(conf.int = TRUE) |> 245 | filter(term == "qsmk") 246 | 247 | ipw_estimate 248 | ``` 249 | 250 | This estimate is pretty similar to what we saw before, if a little smaller. In fact, for simple causal questions, this is often the case: adjusting for confounders directly in your regression model sometimes estimates the same effect as IPWs and other causal techniques. Causal techniques are special, though, in that the use counterfactual modeling, which allows you to deal with many circumstances, such as when you have selection bias or time-dependendent confounding. They also often have variance properties. 251 | 252 | But we have other problem that we need to address. While we're just using `lm()` to estimate our IPW model, it doesn't properly account for the weights. That means our standard error is too small, which will artificially narrow confidence intervals and artificially shrink p-values. There are many ways to address this, including robust estimators. We'll focus on using the bootstrap via the {rsamples} package in this workshop, but here's one way to do it with robust standard errors: 253 | 254 | 255 | ```{r} 256 | # also see robustbase, survey, gee, and others 257 | library(estimatr) 258 | ipw_model_robust <- lm_robust( 259 | wt82_71 ~ qsmk, 260 | data = nhefs_complete_uc, 261 | weights = wts 262 | ) 263 | 264 | ipw_estimate_robust <- ipw_model_robust |> 265 | tidy(conf.int = TRUE) |> 266 | filter(term == "qsmk") 267 | 268 | ipw_estimate_robust 269 | ``` 270 | 271 | Now let's try the bootstrap. First, we need to wrap our model in a function so we can call it many times on our bootstrapped data. A function like this might be your instinct; however, it's not quite right. 272 | 273 | ```{r} 274 | # fit ipw model for a single bootstrap sample 275 | fit_ipw_not_quite_rightly <- function(split, ...) { 276 | # get bootstrapped data sample with `rsample::analysis()` 277 | .df <- analysis(split) 278 | 279 | # fit ipw model 280 | lm(wt82_71 ~ qsmk, data = .df, weights = wts) |> 281 | tidy() 282 | } 283 | ``` 284 | 285 | The problem is that we need to account for the *entire* modeling process, so we need to include the first step of our analysis -- fitting the inverse probability weights. 286 | 287 | ```{r} 288 | fit_ipw <- function(split, ...) { 289 | .df <- analysis(split) 290 | 291 | # fit propensity score model 292 | propensity_model <- glm( 293 | qsmk ~ sex + 294 | race + age + I(age^2) + education + 295 | smokeintensity + I(smokeintensity^2) + 296 | smokeyrs + I(smokeyrs^2) + exercise + active + 297 | wt71 + I(wt71^2), 298 | family = binomial(), 299 | data = .df 300 | ) 301 | 302 | # calculate inverse probability weights 303 | .df <- propensity_model |> 304 | augment(type.predict = "response", data = .df) |> 305 | mutate(wts = wt_ate( 306 | .fitted, 307 | qsmk, 308 | exposure_type = "binary" 309 | )) 310 | 311 | # fit correctly bootstrapped ipw model 312 | lm(wt82_71 ~ qsmk, data = .df, weights = wts) |> 313 | tidy() 314 | } 315 | ``` 316 | 317 | {rsample} makes the rest easy for us: `bootstraps()` resamples our data 1000 times, then we can use `purrr::map()` to apply our function to each resampled set (`splits`). {rsample}'s `int_*()` functions help us get confidence intervals for our estimate. 318 | 319 | ```{r} 320 | # fit ipw model to bootstrapped samples 321 | ipw_results <- bootstraps(nhefs_complete, 1000, apparent = TRUE) |> 322 | mutate(results = map(splits, fit_ipw)) 323 | 324 | # get t-statistic-based CIs 325 | boot_estimate <- int_t(ipw_results, results) |> 326 | filter(term == "qsmk") 327 | 328 | boot_estimate 329 | ``` 330 | 331 | Let's compare to our naive weighted model that just used a single estimate from `lm()` 332 | 333 | ```{r} 334 | bind_rows( 335 | ipw_estimate |> 336 | select(estimate, conf.low, conf.high) |> 337 | mutate(type = "ols"), 338 | ipw_estimate_robust |> 339 | select(estimate, conf.low, conf.high) |> 340 | mutate(type = "robust"), 341 | boot_estimate |> 342 | select(estimate = .estimate, conf.low = .lower, conf.high = .upper) |> 343 | mutate(type = "bootstrap") 344 | ) |> 345 | # calculate CI width to sort by it 346 | mutate(width = conf.high - conf.low) |> 347 | arrange(width) |> 348 | # fix the order of the model types for the plot 349 | mutate(type = fct_inorder(type)) |> 350 | ggplot(aes(x = type, y = estimate, ymin = conf.low, ymax = conf.high)) + 351 | geom_pointrange(color = "#0172B1", size = 1, fatten = 3) + 352 | coord_flip() + 353 | theme_minimal(base_size = 20) + 354 | theme(axis.title.y = element_blank()) 355 | ``` 356 | 357 | Our bootstrapped confidence intervals are wider, which is expected; remember that they were artificially narrow in the naive OLS model! 358 | 359 | So, we have a final estimate for our causal effect: on average, a person who quits smoking will gain 3.5 kg (95% CI 2.4 kg, 4.4 kg) versus if they had not quit smoking. What do you think? Is this estimate reliable? Did we do a good job addressing the assumptions we need to make for a causal effect, particularly that there is no confounding? How might you criticize this model, and what would you do differently? 360 | 361 | *** 362 | 363 | # Take aways 364 | * The broad strokes for a causal analysis are: 1) identify your causal question 2) make your assumptions clear 3) check your assumptions as best you can and 4) use the right estimator for the question you're trying to ask. As scientists, we should be able to critique each of these steps, and that's a good thing! 365 | * To create marginal structural models, first fit a propensity model for the weights with the exposure as the outcome. Then, use the inverse of the predicted probabilities as weights in a model with just the outcome and exposure. 366 | -------------------------------------------------------------------------------- /15-bonus-ml-for-causal-exercises.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Machine Learning for Causal Inference" 3 | format: html 4 | --- 5 | 6 | ```{r} 7 | #| label: setup 8 | library(tidyverse) 9 | library(broom) 10 | library(touringplans) 11 | library(propensity) 12 | library(SuperLearner) 13 | library(tmle) 14 | library(yardstick) 15 | library(ggdag) 16 | library(ggokabeito) 17 | ``` 18 | 19 | ## The Causal Question 20 | 21 | We'll be looking at an example using Walt Disney World ride data from the touringplans package. 22 | 23 | Historically, guests who stayed in a Walt Disney World resort hotel were able to access the park during "Extra Magic Hours" during which the park was closed to all other guests. 24 | These extra hours could be in the morning or evening. 25 | The Seven Dwarfs Mine Train is a ride at Walt Disney World's Magic Kingdom. Typically, each day Magic Kingdom may or may not be selected to have these "Extra Magic Hours". 26 | 27 | We are interested in examining the relationship between whether there were "Extra Magic Hours" in the morning and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am. 28 | 29 | Below is a proposed DAG for this question. 30 | 31 | ```{r} 32 | set.seed(1234) 33 | 34 | coord_dag <- list( 35 | x = c(season = 0, close = 0, weather = -1, emm = 1, wait_posted = 2), 36 | y = c(season = -1, close = 1, weather = 0, emm = 0, wait_posted = 0) 37 | ) 38 | 39 | labels <- c( 40 | emm = "Extra Magic Morning", 41 | wait_posted = "Average wait", 42 | season = "Ticket Season", 43 | weather = "Historic high temperature", 44 | close = "Time park closed" 45 | ) 46 | 47 | dagify( 48 | wait_posted ~ emm + close + season + weather, 49 | emm ~ weather + close + season, 50 | coords = coord_dag, 51 | labels = labels, 52 | exposure = "emm", 53 | outcome = "wait_posted" 54 | ) |> 55 | tidy_dagitty() |> 56 | node_status() |> 57 | ggplot( 58 | aes(x, y, xend = xend, yend = yend, color = status) 59 | ) + 60 | geom_dag_edges_arc(curvature = c(rep(0, 6), .3)) + 61 | geom_dag_point() + 62 | geom_dag_label_repel( 63 | aes(x, y, label = label), 64 | box.padding = 3.5, 65 | inherit.aes = FALSE, 66 | max.overlaps = Inf, 67 | family = "sans", 68 | seed = 1630, 69 | label.size = NA, 70 | label.padding = 0.1, 71 | size = 14 / 3 72 | ) + 73 | scale_color_okabe_ito(na.value = "grey90") + 74 | theme_dag() + 75 | theme( 76 | legend.position = "none", 77 | axis.text.x = element_text() 78 | ) + 79 | coord_cartesian(clip = "off") + 80 | scale_x_continuous( 81 | limits = c(-1.25, 2.25), 82 | breaks = c(-1, 0, 1, 2), 83 | labels = c( 84 | "\n(one year ago)", 85 | "\n(6 months ago)", 86 | "\n(3 months ago)", 87 | "9am - 10am\n(Today)" 88 | ) 89 | ) 90 | ``` 91 | 92 | Here we are proposing that there are three confounders: the historic high temperature on the day, the time the park closed, and the ticket season: value, regular, or peak. 93 | 94 | ## Review: IPW and G-computation 95 | 96 | Before using machine learning methods, let's review how we implement IPW and G-computation with parametric models. IPW and G-computation are two estimators we can use to estimate our estimand, the average treatment effect (ATE), in the presence of confounding. 97 | 98 | ### Inverse Probability Weighting (IPW) 99 | 100 | In IPW, we first fit a propensity score model to estimate the probability of treatment given confounders. The goal is to create a pseudo-population where treatment assignment is independent of confounders. We then calculate weights based on these propensity scores and fit a weighted outcome model to estimate the average treatment effect (ATE). 101 | 102 | The algorithm is: 103 | 1. Fit a propensity score model to estimate the probability of treatment given confounders. 104 | 2. Calculate weights: for treated units, (for the ATE, it is `1 / propensity score` for the treated units, and `1 / (1 - propensity score)` for the control units; `propensity::wt_ate()` will calculate this for you. 105 | 3. Fit a weighted outcome model using the calculated weights. This model only includes the treatment variable and no confounders, as the weights adjust for confounding. 106 | 107 | ```{r} 108 | # Prepare data: filter to 9am wait times 109 | seven_dwarfs <- seven_dwarfs_train_2018 |> 110 | filter(wait_hour == 9) 111 | 112 | # Step 1: Fit propensity score model (exposure model) 113 | propensity_model <- glm( 114 | park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high, 115 | data = seven_dwarfs, 116 | family = binomial() 117 | ) 118 | 119 | # Get propensity scores 120 | propensity_scores <- predict(propensity_model, type = "response") 121 | 122 | # Step 2: Calculate ATE weights 123 | # For treated: 1/PS, for control: 1/(1-PS) 124 | ate_weights <- wt_ate( 125 | propensity_scores, 126 | seven_dwarfs$park_extra_magic_morning 127 | ) 128 | 129 | # Step 3: Fit weighted outcome model 130 | ipw_model <- lm( 131 | wait_minutes_posted_avg ~ park_extra_magic_morning, 132 | data = seven_dwarfs, 133 | weights = ate_weights 134 | ) 135 | 136 | # For proper inference, we need bootstrapping. 137 | # See the appendix at the bottom of this document. 138 | tidy(ipw_model) 139 | ``` 140 | 141 | ### G-computation 142 | 143 | In G-computation, we fit an outcome model that includes the exposure and confounders. We then create two counterfactual datasets: one where everyone is treated and one where everyone is control. We use the fitted outcome model to predict outcomes under each scenario and calculate the ATE as the average difference in predicted outcomes. 144 | 145 | The algorithm is: 146 | 1. Fit an outcome model with the exposure and confounders. 147 | 2. Clone the dataset to create two scenarios: one where everyone is treated and one where everyone is control. 148 | 3. Predict outcomes under each scenario using the fitted outcome model. 149 | 4. Calculate the average treatment effect (ATE) as the mean difference in predicted outcomes. 150 | 151 | ```{r} 152 | # Step 1: Fit outcome model with exposure and all confounders 153 | outcome_model <- lm( 154 | wait_minutes_posted_avg ~ park_extra_magic_morning + park_ticket_season + 155 | park_close + park_temperature_high, 156 | data = seven_dwarfs 157 | ) 158 | 159 | tidy(outcome_model) 160 | 161 | # Step 2: Clone datasets with different exposure values 162 | # Create datasets where everyone is treated vs everyone is control 163 | data_all_treated <- seven_dwarfs |> 164 | mutate(park_extra_magic_morning = 1) 165 | 166 | data_all_control <- seven_dwarfs |> 167 | mutate(park_extra_magic_morning = 0) 168 | 169 | # Step 3: Predict outcomes under each scenario 170 | pred_treated <- predict(outcome_model, newdata = data_all_treated) 171 | pred_control <- predict(outcome_model, newdata = data_all_control) 172 | 173 | # Step 4: Calculate average treatment effect 174 | # For proper inference, we need bootstrapping. 175 | # See the appendix at the bottom of this document. 176 | ate_gcomp <- mean(pred_treated - pred_control) 177 | ate_gcomp 178 | ``` 179 | 180 | ## Your Turn 1 181 | 182 | 1. First, create a character vector `sl_library` that specifies the following algorithms: "SL.glm", "SL.ranger", "SL.xgboost", "SL.gam". Then, Fit a SuperLearner for the exposure model using the `SuperLearner` package. The predictors for this model should be the confounders identified in the DAG: `park_ticket_season`, `park_close`, and `park_temperature_high`. The outcome is `park_extra_magic_morning`. 183 | 2. Fit a SuperLearner for the outcome model using the `SuperLearner` package. The predictors for this model should be the confounders plus the exposure: `park_extra_magic_morning`, `park_ticket_season`, `park_close`, and `park_temperature_high`. The outcome is `wait_minutes_posted_avg`. 184 | 3. Inspect the fitted SuperLearner objects. 185 | 186 | ```{r} 187 | set.seed(1234) 188 | 189 | seven_dwarfs <- seven_dwarfs_train_2018 |> 190 | filter(wait_hour == 9) 191 | 192 | sl_library <- c("SL.glm", "SL.ranger", "SL.xgboost", "SL.gam") 193 | 194 | exposure_sl <- SuperLearner( 195 | Y = seven_dwarfs$park_extra_magic_morning, 196 | X = seven_dwarfs |> 197 | select(park_ticket_season, park_close, park_temperature_high) |> 198 | mutate(park_close = as.numeric(park_close)), 199 | family = binomial(), 200 | SL.library = sl_library, 201 | cvControl = list(V = 5) 202 | ) 203 | 204 | exposure_sl 205 | 206 | outcome_sl <- SuperLearner( 207 | Y = seven_dwarfs$wait_minutes_posted_avg, 208 | X = seven_dwarfs |> 209 | select(park_extra_magic_morning, park_ticket_season, park_close, park_temperature_high) |> 210 | mutate(park_close = as.numeric(park_close)), 211 | family = gaussian(), 212 | SL.library = sl_library, 213 | cvControl = list(V = 5) 214 | ) 215 | 216 | outcome_sl 217 | ``` 218 | 219 | 4. Inspect the predictions from the fitted SuperLearner models. 220 | 5. Check AUC with the `yardstick` package for the exposure model. For this, we need a data frame with the true exposure values and predicted propensity scores to use with `roc_auc()`. 221 | 6. Check RMSE with the `yardstick` package for the outcome model. For this, we need a data frame with the true outcome values and predicted outcomes to use with `rmse()`. 222 | 223 | ```{r} 224 | propensity_scores <- predict(exposure_sl, type = "response")$pred[, 1] 225 | propensity_scores 226 | 227 | outcome_preds <- predict(outcome_sl)$pred[, 1] 228 | outcome_preds 229 | 230 | exposure_results <- tibble( 231 | truth = factor(seven_dwarfs$park_extra_magic_morning), 232 | predicted = propensity_scores 233 | ) 234 | 235 | # Need event_level = "second" because yardstick treats first level ("0") 236 | # as event by default 237 | exposure_auc <- roc_auc(exposure_results, truth, predicted, event_level = "second") 238 | exposure_auc 239 | 240 | outcome_results <- tibble( 241 | truth = seven_dwarfs$wait_minutes_posted_avg, 242 | predicted = outcome_preds 243 | ) 244 | outcome_rmse <- rmse(outcome_results, truth, predicted) 245 | outcome_rmse 246 | ``` 247 | 248 | **Stretch goal**: Add more algorithms to the SuperLearner and fit the models using the new stack of algorithms. 249 | 250 | ```{r} 251 | sl_library_extended <- c( 252 | "SL.glm", 253 | "SL.ranger", 254 | "SL.xgboost", 255 | "SL.earth", 256 | "SL.gam", 257 | "SL.glm.interaction", 258 | "SL.mean", 259 | "SL.glmnet" 260 | ) 261 | 262 | # fit the superlearner models again with the extended library 263 | exposure_sl_extended <- SuperLearner( 264 | Y = seven_dwarfs$park_extra_magic_morning, 265 | X = seven_dwarfs |> 266 | select(park_ticket_season, park_close, park_temperature_high) |> 267 | mutate(park_close = as.numeric(park_close)), 268 | family = binomial(), 269 | SL.library = sl_library_extended, 270 | cvControl = list(V = 5) 271 | ) 272 | 273 | exposure_sl_extended 274 | 275 | outcome_sl_extended <- SuperLearner( 276 | Y = seven_dwarfs$wait_minutes_posted_avg, 277 | X = seven_dwarfs |> 278 | select(park_extra_magic_morning, park_ticket_season, park_close, park_temperature_high) |> 279 | mutate(park_close = as.numeric(park_close)), 280 | family = gaussian(), 281 | SL.library = sl_library_extended, 282 | cvControl = list(V = 5) 283 | ) 284 | 285 | outcome_sl_extended 286 | ``` 287 | 288 | ## Your Turn 2 289 | 290 | 1. Implement Inverse Probability Weighting (IPW) using the fitted SuperLearner for the exposure model. 291 | - First, calculate the ATE weights using `wt_ate()` with the predicted propensity scores from the fitted SuperLearner. 292 | - Then, fit a weighted outcome model using `lm()` with `wait_minutes_posted_avg` as the outcome and `park_extra_magic_morning` as the only predictor, using the ATE weights in the `weights` argument of `lm()`. 293 | 294 | ```{r} 295 | # IPW with SuperLearner propensity scores 296 | # Step 1: Use predicted propensity scores from SuperLearner 297 | # Already done! 298 | propensity_scores 299 | 300 | # Step 2: Calculate ATE weights using wt_ate() 301 | ate_weights <- wt_ate( 302 | # the propensity score 303 | .propensity = propensity_scores, 304 | # the actual exposure values 305 | .exposure = seven_dwarfs$park_extra_magic_morning 306 | ) 307 | 308 | # Step 3: Fit weighted outcome model 309 | # The formula should be: `wait_minutes_posted_avg` as the outcome and `park_extra_magic_morning` as the only predictor 310 | ipw_model <- lm( 311 | wait_minutes_posted_avg ~ park_extra_magic_morning, 312 | data = seven_dwarfs, 313 | weights = ate_weights 314 | ) 315 | 316 | # Extract ATE estimate 317 | tidy(ipw_model) |> 318 | filter(term == "park_extra_magic_morning") 319 | ``` 320 | 321 | 2. Implement G-computation using the fitted SuperLearner for the outcome model. 322 | - First, create two counterfactual datasets: one where everyone is treated (`park_extra_magic_morning` = 1) and one where everyone is control (`park_extra_magic_morning` = 0). 323 | - Then, predict outcomes under each scenario using the fitted SuperLearner for the outcome model. 324 | - Finally, calculate the average treatment effect (ATE) as the mean difference in predicted outcomes between the treated and control scenarios. 325 | 326 | ```{r} 327 | # G-computation with SuperLearner outcome model 328 | # Step 1: Create counterfactual datasets 329 | # For SuperLearner prediction, we need only the columns used in the model 330 | 331 | # Dataset where everyone is treated, `park_extra_magic_morning` = 1 332 | data_all_treated <- seven_dwarfs |> 333 | select(park_extra_magic_morning, park_ticket_season, park_close, park_temperature_high) |> 334 | mutate( 335 | park_close = as.numeric(park_close), 336 | park_extra_magic_morning = 1 337 | ) 338 | 339 | # Dataset where everyone is control, `park_extra_magic_morning` = 0 340 | data_all_control <- seven_dwarfs |> 341 | select(park_extra_magic_morning, park_ticket_season, park_close, park_temperature_high) |> 342 | mutate( 343 | park_close = as.numeric(park_close), 344 | park_extra_magic_morning = 0 345 | ) 346 | 347 | # Step 2: Predict outcomes under each scenario using SuperLearner 348 | pred_treated <- predict(outcome_sl, newdata = data_all_treated)$pred[, 1] 349 | pred_control <- predict(outcome_sl, newdata = data_all_control)$pred[, 1] 350 | 351 | # Step 3: Calculate average treatment effect 352 | gcomp_ate <- mean(pred_treated - pred_control) 353 | gcomp_ate 354 | ``` 355 | 356 | ## Your Turn 3 357 | 358 | 1. First, bound the continuous outcome to [0,1] range for TMLE. Store the min and max values for later transformation back. 359 | 2. Fit a new SuperLearner on the bounded outcome specifically for TMLE. 360 | 3. Get initial predictions from this bounded SuperLearner (they will already be in [0,1] scale). 361 | 4. Create `initial_pred_observed`, which contains the predicted values for each observation based on their actual treatment assignment. 362 | 363 | ```{r} 364 | # Step 1: Bound the continuous outcome to [0,1] for TMLE 365 | min_y <- min(seven_dwarfs$wait_minutes_posted_avg) 366 | max_y <- max(seven_dwarfs$wait_minutes_posted_avg) 367 | y_bounded <- (seven_dwarfs$wait_minutes_posted_avg - min_y) / (max_y - min_y) 368 | 369 | # Step 2: Fit new SuperLearner on bounded outcome 370 | # For TMLE with continuous outcomes, we need to fit on the bounded Y 371 | outcome_sl_bounded <- SuperLearner( 372 | Y = y_bounded, 373 | X = seven_dwarfs |> 374 | select(park_extra_magic_morning, park_ticket_season, park_close, park_temperature_high) |> 375 | mutate(park_close = as.numeric(park_close)), 376 | family = binomial(), 377 | SL.library = sl_library, 378 | cvControl = list(V = 5) 379 | ) 380 | 381 | # Step 3: Get initial predictions (already in [0,1] scale) 382 | initial_pred_treated <- predict(outcome_sl_bounded, newdata = data_all_treated)$pred[, 1] 383 | initial_pred_control <- predict(outcome_sl_bounded, newdata = data_all_control)$pred[, 1] 384 | 385 | # each observation gets the counterfactual prediction based on their actual treatment 386 | # this is the same as predicting on the original dataset, but since we already calculated these, we'll just put it together ourselves 387 | initial_pred_observed <- ifelse( 388 | seven_dwarfs$park_extra_magic_morning == 1, 389 | initial_pred_treated, 390 | initial_pred_control 391 | ) 392 | ``` 393 | 394 | 5. Create the "clever covariate" using the propensity scores from the fitted SuperLearner model. This will be used in the TMLE targeting step. For treated units, it should be `1 /propensity_scores`, and for control units, it should be `-1 / (1 - propensity_scores)`. 395 | 396 | ```{r} 397 | # Step 2: Create the "clever covariate": this is the key to TMLE 398 | # It weights observations based on their propensity scores to achieve balance 399 | # For treated units: 1 / propensity_scores 400 | # For control units: -1 / (1 - propensity_scores) 401 | # This is NOT the ATE weights; it's a component of the efficient influence function 402 | # But it IS related, as it is also a consequence of targeting the ATE 403 | clever_covariate <- ifelse( 404 | seven_dwarfs$park_extra_magic_morning == 1, 405 | 1 / propensity_scores, 406 | -1 / (1 - propensity_scores) 407 | ) 408 | ``` 409 | 410 | 6. Fit a fluctuation model with the bounded outcome, using `qlogis(initial_pred_observed)` as an offset and the clever covariate as a predictor (with no intercept). Use binomial family for the model. 411 | 7. Get the fluctuation parameter `epsilon` from the model coefficients; this is the coefficient for the clever covariate. 412 | 413 | ```{r} 414 | # Step 3: The targeting step - a small parametric fluctuation of initial estimates 415 | # We're not re-fitting from scratch; we're making a targeted adjustment 416 | # The offset keeps our initial predictions on the logit scale 417 | # No intercept because we're adjusting around the initial predictions, not estimating a new model 418 | # Use binomial family for bounded continuous outcomes 419 | fluctuation_model <- glm( 420 | y_bounded ~ -1 + offset(qlogis(initial_pred_observed)) + clever_covariate, 421 | family = binomial() 422 | ) 423 | 424 | # Epsilon: the fluctuation parameter that solves the efficient influence function 425 | # It tells us how much our initial estimate needs to be adjusted to be unbiased 426 | # Small epsilon = initial estimate was already good; large = needed more correction 427 | # This is the coefficient for clever_covariate 428 | epsilon <- coef(fluctuation_model) 429 | epsilon 430 | ``` 431 | 432 | 8. Now that we've calculated the fluctuation parameter, we can update our predictions to obtain targeted predictions that are minimize the bias-variance tradeoff for the average treatment effect. 433 | - For treated units, we add `epsilon * (1 / propensity_scores)`. 434 | - For control units, we add `epsilon * (-1 / (1 - propensity_scores)`. 435 | 436 | ```{r} 437 | # Step 4: Update our predictions using the fluctuation parameter 438 | # This update ensures our estimate solves the efficient influence function equation 439 | # which helps us estimate: 1) unbiased estimate of ATE, 2) valid standard errors 440 | # We adjust the initial predictions (on the logit scale) by adding epsilon * clever_covariate 441 | # 1 / propensity_scores for treated (since everyone in initial_pred_treated is treated counterfactually) 442 | # -1 / (1 - propensity_scores) for controls 443 | logit_pred_treated <- qlogis(initial_pred_treated) + epsilon * (1 / propensity_scores) 444 | logit_pred_control <- qlogis(initial_pred_control) + epsilon * (-1 / (1 - propensity_scores)) 445 | 446 | # Transform back to probability scale 447 | targeted_pred_treated <- plogis(logit_pred_treated) 448 | targeted_pred_control <- plogis(logit_pred_control) 449 | 450 | # we'll need this later for calculating the variance and confidence intervals 451 | targeted_pred_observed <- ifelse( 452 | seven_dwarfs$park_extra_magic_morning == 1, 453 | targeted_pred_treated, 454 | targeted_pred_control 455 | ) 456 | ``` 457 | 458 | 9. Let's visualize the initial vs targeted individual-level predictions for treated and control units. Set the x-axis to the initial predictions and the y-axis to the targeted predictions. For the first plot, use `initial_pred_treated` and `targeted_pred_treated`, and for the second plot, use `initial_pred_control` and `targeted_pred_control`. 459 | 460 | ```{r} 461 | # plot the initial vs targeted individual-level predictions for treated units 462 | ggplot(seven_dwarfs, aes(x = initial_pred_treated, y = targeted_pred_treated)) + 463 | geom_point() + 464 | # perfect prediction line: y = x 465 | geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") + 466 | labs( 467 | x = "Initial Predictions (Treated)", 468 | y = "Targeted Predictions (Treated)" 469 | ) + 470 | theme_minimal() 471 | 472 | # plot the initial vs targeted individual-level predictions for control units 473 | ggplot(seven_dwarfs, aes(x = initial_pred_control, y = targeted_pred_control)) + 474 | geom_point() + 475 | # perfect prediction line: y = x 476 | geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") + 477 | labs( 478 | x = "Initial Predictions (Control)", 479 | y = "Targeted Predictions (Control)" 480 | ) + 481 | theme_minimal() 482 | ``` 483 | 484 | ## Your Turn 4 485 | 486 | 1. Let's compare the mean difference in **initial** predictions between treated and control units, and the mean difference in **targeted** predictions between treated and control units. The first calculation is identical to the g-computation ATE estimate. The second calculation is the TMLE estimate. Remember to transform back to the original scale! 487 | 488 | ```{r} 489 | # Compare: initial estimate (good prediction, possibly biased for causal effect) 490 | # vs targeted estimate (adjusted to be unbiased for the causal parameter) 491 | # the g-computation ATE is the mean difference in initial predictions 492 | # Transform back to original scale by multiplying by (max_y - min_y) 493 | initial_ate <- mean(initial_pred_treated - initial_pred_control) * (max_y - min_y) 494 | initial_ate 495 | 496 | # the TMLE ATE is the mean difference in targeted predictions 497 | # Transform back to original scale 498 | targeted_ate <- mean(targeted_pred_treated - targeted_pred_control) * (max_y - min_y) 499 | targeted_ate 500 | ``` 501 | 502 | 2. Now let's calculate its standard error using the efficient influence curve (IC). The IC captures the uncertainty in our estimate and allows us to construct valid confidence intervals. There's no code to change here, just read through the comments to understand how the IC is constructed and how it relates to the TMLE estimate. 503 | 504 | ```{r} 505 | # Calculate the efficient influence curve (IC) for each observation 506 | # The IC has two components: 507 | # 1. clever_covariate * residual: captures remaining uncertainty in the outcome model 508 | # 2. (targeted_pred_treated - targeted_pred_control - tmle_ate): captures uncertainty in the treatment effect 509 | # Each observation's IC value represents its contribution to the overall uncertainty 510 | # Note: IC uses bounded outcomes and predictions 511 | ic <- clever_covariate * (y_bounded - targeted_pred_observed) + 512 | targeted_pred_treated - targeted_pred_control - targeted_ate / (max_y - min_y) 513 | 514 | # Standard error is the standard deviation of IC values divided by sqrt(n) 515 | # This works because TMLE constructs the estimate to behave like a sample mean of IC values 516 | # Even though we used ML, the targeting step ensures valid statistical inference 517 | # Transform SE back to original scale 518 | se_tmle <- sqrt(var(ic) / nrow(seven_dwarfs)) * (max_y - min_y) 519 | 520 | # Calculate 95% confidence interval using normal approximation 521 | # Valid because TMLE is asymptotically normal with the IC-based variance 522 | ci_lower <- targeted_ate - 1.96 * se_tmle 523 | ci_upper <- targeted_ate + 1.96 * se_tmle 524 | 525 | tibble( 526 | targeted_ate, 527 | se_tmle, 528 | ci_lower, 529 | ci_upper 530 | ) 531 | ``` 532 | 533 | ## Bonus: Fit with `tmle::tmle()` 534 | 535 | R has several packages for TMLE: tmle, ltmle, and tmle3, all with slightly different designs and capabilities. We'll use the `tmle` package here, which is quite simple and works with SuperLearner. 536 | 537 | ```{r} 538 | confounders <- seven_dwarfs |> 539 | select(park_ticket_season, park_close, park_temperature_high) |> 540 | mutate(park_close = as.numeric(park_close)) 541 | 542 | tmle_result <- tmle( 543 | # Y is the outcome 544 | Y = seven_dwarfs$wait_minutes_posted_avg, 545 | # A is the exposure 546 | A = seven_dwarfs$park_extra_magic_morning, 547 | W = confounders, 548 | # Q is the outcome model 549 | Q.SL.library = sl_library, 550 | # g is the exposure model (propensity score) 551 | g.SL.library = sl_library 552 | ) 553 | 554 | tmle_result 555 | summary(tmle_result) 556 | ``` 557 | 558 | *** 559 | 560 | # Take aways 561 | 562 | * Machine learning methods can improve causal inference by reducing bias and variance in treatment effect estimates. 563 | * SuperLearner provides a flexible framework for combining multiple machine learning algorithms. 564 | * TMLE provides a robust framework for estimating treatment effects while addressing bias and variance. 565 | * The targeting step in TMLE is crucial for improving estimates and reducing bias 566 | 567 | 568 | # Appendix: Bootstrapping for IPW and G-computation 569 | 570 | ## IPW 571 | 572 | 573 | ```{r} 574 | library(rsample) 575 | set.seed(1234) 576 | 577 | fit_ipw <- function(split, ...) { 578 | .df <- as.data.frame(split) 579 | 580 | # Fit propensity score model 581 | ps_model <- glm( 582 | park_extra_magic_morning ~ park_ticket_season + park_close + park_temperature_high, 583 | data = .df, 584 | family = binomial() 585 | ) 586 | 587 | # Calculate weights 588 | ps <- augment(ps_model, type.predict = "response", data = .df)$.fitted 589 | weights <- wt_ate(ps, .df$park_extra_magic_morning, exposure_type = "binary") 590 | 591 | # Fit weighted outcome model 592 | lm(wait_minutes_posted_avg ~ park_extra_magic_morning, 593 | data = .df, 594 | weights = weights) |> 595 | tidy() 596 | } 597 | 598 | # Bootstrap confidence intervals 599 | ipw_boot <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) |> 600 | mutate(results = map(splits, fit_ipw)) 601 | 602 | int_bca(ipw_boot, results, .fn = fit_ipw) |> 603 | filter(term == "park_extra_magic_morning") 604 | ``` 605 | 606 | ## G-computation 607 | 608 | ```{r} 609 | fit_gcomp <- function(split, ...) { 610 | .df <- as.data.frame(split) 611 | 612 | # Fit outcome model 613 | mod <- lm( 614 | wait_minutes_posted_avg ~ park_extra_magic_morning + park_ticket_season + 615 | park_close + park_temperature_high, 616 | data = .df 617 | ) 618 | 619 | # Clone and predict 620 | df_treat <- .df |> mutate(park_extra_magic_morning = 1) 621 | df_control <- .df |> mutate(park_extra_magic_morning = 0) 622 | 623 | pred_treat <- augment(mod, newdata = df_treat)$.fitted 624 | pred_control <- augment(mod, newdata = df_control)$.fitted 625 | 626 | # Return results 627 | tibble( 628 | term = "ate", 629 | estimate = mean(pred_treat - pred_control) 630 | ) 631 | } 632 | 633 | gcomp_boot <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) |> 634 | mutate(results = map(splits, fit_gcomp)) 635 | 636 | int_bca(gcomp_boot, results, .fn = fit_gcomp) 637 | ``` 638 | 639 | --------------------------------------------------------------------------------