├── .gitignore
├── Causal Inference in the News Article.docx
├── Cheat Sheets
├── Base R Cheat Sheet.pdf
├── Causal Diagrams Cheat Sheet.Rmd
├── Causal_Diagrams_Cheat_Sheet.pdf
├── Dagitty Cheat Sheet.Rmd
├── Dagitty-Cheat-Sheet.aux
├── Dagitty-Cheat-Sheet.out
├── Dagitty_Cheat_Sheet.pdf
├── Dagitty_Diagnostics.png
├── Dagitty_Intro.PNG
├── Dagitty_Model1.PNG
├── Dagitty_Model2.PNG
├── Dagitty_Screen.PNG
├── RStudio Cheat Cheet.pdf
├── Relationships Cheat Sheet.Rmd
├── Relationships_Cheat_Sheet.pdf
├── Simulation Cheat Sheet.Rmd
├── Simulation_Cheat_Sheet.pdf
├── dplyr Cheat Sheet.pdf
└── floatposition.tex
├── ECON 305 Syllabus Updated Template.docx
├── Instructor Guide
├── .Rhistory
├── Instructor_Guide.Rmd
├── Instructor_Guide.aux
└── Instructor_Guide.pdf
├── LICENSE.txt
├── Lectures
├── Animation_Local_Linear_Regression.gif
├── GDP.csv
├── Lecture_01-cloud-market-revenue.png
├── Lecture_01_Politics_Example.PNG
├── Lecture_01_World_of_Data.Rmd
├── Lecture_01_World_of_Data.html
├── Lecture_02_Understanding_Data.Rmd
├── Lecture_02_Understanding_Data.html
├── Lecture_03_Introduction_to_R.Rmd
├── Lecture_03_Introduction_to_R.html
├── Lecture_04_Objects_and_Functions_in_R.Rmd
├── Lecture_04_Objects_and_Functions_in_R.html
├── Lecture_05_Spreadsheet.PNG
├── Lecture_05_Working_with_Data_Part_1.Rmd
├── Lecture_05_Working_with_Data_Part_1.html
├── Lecture_05_data_frame.PNG
├── Lecture_06_Working_with_Data_Part_2.Rmd
├── Lecture_06_Working_with_Data_Part_2.html
├── Lecture_07_Summarizing_Data_Part_1.Rmd
├── Lecture_07_Summarizing_Data_Part_1.html
├── Lecture_08_Summarizing_Data_Part_2.Rmd
├── Lecture_08_Summarizing_Data_Part_2.html
├── Lecture_09_Relationships_Between_Variables_Part_1.Rmd
├── Lecture_09_Relationships_Between_Variables_Part_1.html
├── Lecture_10_Relationships_Between_Variables_Part_2.Rmd
├── Lecture_10_Relationships_Between_Variables_Part_2.html
├── Lecture_11_Simulations.Rmd
├── Lecture_11_Simulations.html
├── Lecture_12_Midterm_Review.Rmd
├── Lecture_12_Midterm_Review.html
├── Lecture_13_Causality.Rmd
├── Lecture_13_Causality.html
├── Lecture_14_Causal_Diagrams.Rmd
├── Lecture_14_Causal_Diagrams.html
├── Lecture_15_Drawing_Causal_Diagrams.Rmd
├── Lecture_15_Drawing_Causal_Diagrams.html
├── Lecture_16_Back_Doors.Rmd
├── Lecture_16_Back_Doors.html
├── Lecture_17_Article_Traffic.pdf
├── Lecture_17_Causal_Diagrams_Practice.Rmd
├── Lecture_17_Causal_Diagrams_Practice.html
├── Lecture_18_Controlling.Rmd
├── Lecture_18_Controlling.html
├── Lecture_19_Fixed_Effects.Rmd
├── Lecture_19_Fixed_Effects.html
├── Lecture_20_Untreated_Groups.Rmd
├── Lecture_20_Untreated_Groups.html
├── Lecture_21_Difference_in_Differences.Rmd
├── Lecture_21_Difference_in_Differences.html
├── Lecture_22_Flammer.PNG
├── Lecture_22_Regression_Discontinuity.Rmd
├── Lecture_22_Regression_Discontinuity.html
├── Lecture_23_Practice.Rmd
├── Lecture_23_Practice.html
├── Lecture_24_AutorDornHanson.PNG
├── Lecture_24_Instrumental_Variables.Rmd
├── Lecture_24_Instrumental_Variables.html
├── Lecture_25_Instrumental_Variables_in_Action.Rmd
├── Lecture_25_Instrumental_Variables_in_Action.html
├── Lecture_26_Causal_Midterm_Review.Rmd
├── Lecture_26_Causal_Midterm_Review.html
├── Lecture_27_Explaining_Better_Regression.Rmd
├── Lecture_27_Explaining_Better_Regression.html
├── Lecture_28_Explaining_Better_Regression_2.Rmd
├── Lecture_28_Explaining_Better_Regression_2.html
├── Mariel_Synthetic.png
├── ad_spend_and_gdp.csv
├── average-height-men-OWID.csv
├── mariel.RData
└── marvel_moody_sentencing.txt
├── Papers
├── BettingerLong2009.pdf
├── Flammer 2013 RDD Corporate Social responsibility.pdf
├── chang2014.pdf
├── desai2009.pdf
├── hanson2009.pdf
├── kowalski2016.pdf
└── marvel_moody_sentencing.pdf
├── README.md
└── Research Design Assignment.docx
/.gitignore:
--------------------------------------------------------------------------------
1 | Admin/
2 | Exams/
3 | Homework/
4 | Animations/
5 | Cheat Sheets/Causal_Diagrams_Cheat_Sheet_files/
6 | Lectures/Lecture_01_World_of_Data_files/
7 | Lectures/Lecture_02_Understanding_Data_files/
8 | Lectures/Lecture_03_Introduction_to_R_files/
9 | Lectures/Lecture_05_Working_with_Data_Part_1_files/
10 | Lectures/Lecture_06_Working_with_Data_Part_2_files/
11 | Lectures/Lecture_07_Summarizing_Data_Part_1_files/
12 | Lectures/Lecture_08_Summarizing_Data_Part_2_files/
13 | Lectures/Lecture_09_Relationships_Between_Variables_Part_1_files/
14 | Lectures/Lecture_10_Relationships_Between_Variables_Part_2_files/
15 | Lectures/Lecture_12_Midterm_Review_files
16 | Lectures/Lecture_15_Drawing_Causal_Diagrams_files/
17 | Lectures/Lecture_16_Back_Doors_files
18 | Lectures/Lecture_17_Causal_Diagrams_Practice_files/
19 | Lectures/Lecture_18_Controlling_files/
20 | Lectures/Lecture_20_Untreated_Groups_files/
21 | Lectures/Lecture_21_Difference_in_Differences_files/
22 | Lectures/Lecture_22_Regression_Discontinuity_files/
23 | Lectures/Lecture_25_Instrumental_Variables_in_Action_files/
24 | Lectures/Lecture_26_Causal_Midterm_Review_files/
25 | Lectures/Lecture_27_Explaining_Better_Regression_files/
26 | Lectures/Lecture_28_Explaining_Better_Regression_2_files/
27 | Lectures/libs/
28 | Lectures/Versions without DPLYR/
29 | Lectures/.Rhistory
30 |
31 |
--------------------------------------------------------------------------------
/Causal Inference in the News Article.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Causal Inference in the News Article.docx
--------------------------------------------------------------------------------
/Cheat Sheets/Base R Cheat Sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Base R Cheat Sheet.pdf
--------------------------------------------------------------------------------
/Cheat Sheets/Causal Diagrams Cheat Sheet.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Causal Diagrams Cheat Sheet"
3 | author: "Nick Huntington-Klein"
4 | date: "April 6, 2019"
5 | output: pdf_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
10 | library(tidyverse)
11 | library(dagitty)
12 | library(ggdag)
13 | library(Cairo)
14 | ```
15 |
16 | ## Causal Diagrams
17 |
18 | - Causal diagrams are a set of *variables* and a set of assumptions about *how those variables cause each other*
19 | - Causal diagrams are our way of representing how we think the *data was generated*. They are our *model of the world*
20 | - Once we have our diagram written down, it will tell us how we can *identify* the effect we're interested in
21 | - Writing a causal diagram requires us to make some assumptions - but without assumptions you literally can't get anywhere! Just hopefully make them reasonable assumptions
22 | - "`X` causes `Y`" means that if we could reach in and change the value of `X`, then the *distribution* of `Y` would change as a result. It doesn't necessarily mean that *all* `Y` is because of `X`, or that changing `X` will *always* change `Y`. It just means that a change in `X` will cause the distribution of `Y` to change.
23 |
24 | ## Components of a Causal Diagram
25 |
26 | A causal diagram consists of (a) a set of variables, and (b) causal arrows between those variables.
27 |
28 | An arrow from `X` to `Y` means that `X` causes `Y`. We can represent this in text as `X -> Y`. Arrows always go one direction. You can't have `X -> Y` and `X <- Y`. See "Tricky Situations" below if it feels like both variables cause each other.
29 |
30 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=1.5}
31 | dag <- dagify(Y~X,
32 | coords=list(
33 | x=c(X=1,Y=2),
34 | y=c(X=1,Y=1)
35 | )) %>% tidy_dagitty()
36 | ggdag(dag,node_size=10)
37 | ```
38 |
39 | If a variable is *not in* the causal diagram, that means that we're assuming that the variable is not important in the system, or at least not relevant to the effect we're trying to identify.
40 |
41 | If an arrow is *not in* the causal diagram, that means that we're assuming that neither variable causes the other. This is not a trivial assumption. In the below diagram, we assume that `X` does not cause `Z`, and that `Z` does not cause `X`.
42 |
43 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=1.5}
44 | dag <- dagify(Y~X+Z,
45 | coords=list(
46 | x=c(X=1,Y=2,Z=3),
47 | y=c(X=1,Y=1,Z=1)
48 | )) %>% tidy_dagitty()
49 | ggdag(dag,node_size=10)
50 | ```
51 |
52 | ## Building a Diagram
53 |
54 | Steps to building a diagram:
55 |
56 | 1. Consider all the variables that are likely to be important in the data generating process (this includes variables you can't observe)
57 | 2. For simplicity, combine them together or prune the ones least likely to be important
58 | 3. Consider which variables are likely to affect which other variables and draw arrows from one to the other
59 |
60 | ## Front and Back Doors
61 |
62 | Once you have a diagram and you know that you want to identify the effect of `X` on `Y`, it is helpful to make a list of all the front and back doors from `X` to `Y`, and determine whether they are open or closed.
63 |
64 | To do this:
65 |
66 | 1. Write down the list of *all paths* you can follow on your diagram to get from `X` to `Y`
67 | 2. If that path only contains arrows that point *from* `X` *towards* `Y`, like `X -> Y` or `X -> A -> Y`, that is a *front door path*
68 | 3. If that path contains any arrows that point *towards* `X`, like `X <- W -> Y` or `X -> Z <- A -> Y`, that is a *back door path*
69 |
70 | ## Open and Closed Paths and Colliders
71 |
72 | 1. If a path from `X` to `Y` contains any *colliders* where two arrows point at the same variable, like how both arrows point to `Z` in `X -> Z <- A -> Y`, then that path is *closed* by default since `Z` is a collider on that path
73 | 2. Otherwise, that path is *open* by default
74 | 3. If you control/adjust for a variable, then that will *close* any open paths it is on where that variable **isn't** a collider. So, for example, if you control for `W`, the open path `X <- W -> Y` will close
75 | 4. If you control/adjust for a variable, then that will *open* any closed back door paths it is on where that variable **is** a collider. So, for example, if you control for `Z`, the closed path `X -> Z <- A -> Y` will open back up. You can close it again by controlling for another non-collider on that path, like `A` in this case
76 |
77 | ## Identifying with the Front Door Method
78 |
79 | Sometimes, you can identify the effect of `X` on `Y` by focusing on front doors.
80 |
81 | For example, in the below diagram, there are no open back doors from `X` to `A` (there's a back door `X <- W -> Y <- A`, but it is closed since `Y` is a collider on this path), and so you can identify `X -> A`.
82 |
83 | Similarly, the only back door from `A` to `Y` is `A <- X <- W -> Y`, and so if you control for `X`, you can identify `A -> Y`.
84 |
85 | By combining `X -> A` and `A -> Y` you can identify `X -> Y`.
86 |
87 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=2.5}
88 | dag <- dagify(Y~A+W,
89 | A~X,
90 | X~W,
91 | coords=list(
92 | x=c(X=1,A=2,Y=3,W=2),
93 | y=c(X=1,A=1,Y=1,W=1.5)
94 | )) %>% tidy_dagitty()
95 | ggdag(dag,node_size=10)
96 | ```
97 |
98 | A common issue with this method is finding places where you can apply it convincingly. Are there really no open back door paths from `X` to `A`, or from `A` to `Y` in the true data-generating process?
99 |
100 | ## Identifying with the Back Door Method
101 |
102 | Once you've identified the full list of front-door paths and back-door paths, if you control/adjust for a set of variables that *fully blocks* all back door paths (without blocking your front door paths), then you have identified the effect of `X` on `Y`.
103 |
104 | For example, in the below diagram, we have the front door path `X -> A -> Y` and the back door path `X <- W -> Y`. If we control for `W` but not `A`, then we have identified the effect of `X` on `Y`.
105 |
106 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=2.5}
107 | dag <- dagify(Y~A+W,
108 | A~X,
109 | X~W,
110 | coords=list(
111 | x=c(X=1,A=2,Y=3,W=2),
112 | y=c(X=1,A=1,Y=1,W=1.5)
113 | )) %>% tidy_dagitty()
114 | ggdag(dag,node_size=10)
115 | ```
116 |
117 | A common issue with this method is that often you will need to control for variables that you can't actually observe. If you don't have data on `W`, then you can't identify the effect of `X` on `Y` using the back door method.
118 |
119 | ## Testing Implications
120 |
121 | Causal diagrams require assumptions to build, and they also imply certain relationships. Sometimes you can test these relationships, and if they're not right, that suggests that you might not have the right model.
122 |
123 | For example, in the diagram used in the previous two sections, `W` and `A` should be unrelated if you control for `X`. Also, `X` and `Y` should be unrelated if you control for `W` and `A`. You could test these predictions yourself.
124 |
125 | If you control for `X` and still find a strong relationship between `W` and `A`, or if you control for `W` and `A` and still find a strong relationship between `X` and `Y`, then your causal diagram is wrong.
126 |
127 | ## Tricky Situations
128 |
129 | Some situations make it difficult to think about how to construct a causal diagram. We'll cover those here.
130 |
131 | (1) Lots of variables. In the social sciences, the true data-generating process may have lots and lots of variables, perhaps hundreds! It's important to remember to (a) Remove any vairables that are likely to only be of trivial importance, and (b) if you have multiple variables that lie on identical-looking paths, you can combine them. For example, you can often combine "race", "gender", "parental income", "birthplace" etc. into the single variable "background"
132 |
133 | (2) Variables that are related but don't cause each other. Often we have two variables that are know are *correlated*, but we don't think either causes the other. For example, people buy more ice cream on days they wear shorts. In these cases, we include a "common cause" on the graph. We don't necessarily need to be specific about what it is, or if it's measured or unmeasured. I usually label these "U1" "U2" etc.
134 |
135 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=2.5}
136 | dag <- dagify(A~U1,
137 | B~U1,
138 | coords=list(
139 | x=c(A=1,B=2,U1=1.5),
140 | y=c(A=1,B=1,U1=1.5)
141 | )) %>% tidy_dagitty()
142 | ggdag(dag,node_size=10)
143 | ```
144 |
145 | Note that some people will use a double headed arrow here instead, like `A <-> B`. It means the same thing. We won't be doing this in our class though.
146 |
147 | (3) Variables that cause each other. A variable should never be able to cause itself. So, a causal diagram should never have a *loop* in it. The simplest example of a loop is if `A -> B` and `B -> A`. If we mix these together we have `A -> B -> A`. `A` causes itself! That doesn't make sense. This is why causal diagrams are technically called "directed *acyclic* graphs".
148 |
149 | But surely in the real world there are plenty of variables with self-feedback loops, or variables which cause each other. So what do we do?
150 |
151 | In these cases, you want to think about *time* carefully. Instead of `A -> B` and `B -> A`, split `A` and `B` up into two variables each: `ANow` and `ALater`, and `BNow` and `BLater`. Obviously, a causal effect needs time to occur. Now we have `ANow -> BLater` and `BNow -> ALater` and the loop is gone.
--------------------------------------------------------------------------------
/Cheat Sheets/Causal_Diagrams_Cheat_Sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Causal_Diagrams_Cheat_Sheet.pdf
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty Cheat Sheet.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Causal Diagrams Cheat Sheet"
3 | author: "Nick Huntington-Klein"
4 | date: "`r format(Sys.time(), '%d %B, %Y')`"
5 | output:
6 | rmarkdown::pdf_document:
7 | fig_caption: yes
8 | includes:
9 | in_header: floatposition.tex
10 | ---
11 |
12 | ```{r setup, include=FALSE}
13 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
14 | library(tidyverse)
15 | ```
16 |
17 | ## Dagitty
18 |
19 | - [Dagitty.net](http://www.dagitty.net) is a tool for building, drawing, and working with causal diagrams
20 | - You can also work with dagitty in R directly with the `dagitty` and `ggdag` packages, although this is more difficult and we won't be covering it here
21 |
22 | ## Getting Started
23 |
24 | Go to [Dagitty.net](http://www.dagitty.net) and click on "Launch Dagitty Online in your Browser"
25 |
26 | There you'll see the main dagitty screen with three areas. On the left it has style options; we will be ignoring this for this basic introduction, although you may want to refer back to the Legend in the lower left-hand corner sometimes. In the middle you have the canvas to build your diagram on, and the menus. On the right it provides information about the causal diagram you've built.
27 |
28 | 
29 |
30 | \newpage
31 |
32 | ## Building a Diagram
33 |
34 | In "Model" select "New Model". You will be presented with a blank canvas.
35 |
36 | From there, you can add new variables by clicking (or tapping, on mobile) on an empty part of the diagram, and then giving the variable a name.
37 |
38 | You can add arrows from `A` to `B` by first clicking on `A` and then clicking on `B`. You can remove an existing arrow in the same way. Once you have an arrow from `A` to `B`, you can create a double-headed arrow by clicking on `B` and then `A`.
39 |
40 | If you click-and-drag on an arrow, you can make it curvy, which may help avoid overlaps in complex diagrams. Similarly, you can click-and-drag the variables themselves.
41 |
42 | 
43 |
44 | \newpage
45 |
46 | ## Variable Characteristics
47 |
48 | Once you have some variables and arrows on your graph, you may want to manipulate them.
49 |
50 | You can do this one of two ways:
51 |
52 | 1. Clicking (or tapping on mobile) on the variable, and then looking for the "Variable" tab on the left (or below, on mobile).
53 | 2. Hovering over the variable and hitting a keyboard hotkey.
54 |
55 | 
56 |
57 | One thing you will probably want to do in almost *any* daggity graph is determine your **exposure/treatment** varible(s) (hotkey "e", or use the Variable tab) and your **outcome** variable (hotkey "o", or use the Variable tab).
58 |
59 | Often, you will be using dagitty to attempt to identify the effect of an exposure variable on an outcome variable. Setting the exposure and outcome variables properly lets anyone looking at your graph know what it's for, and also lets dagitty do some nice calculations for you.
60 |
61 | Other variable characteristics you might want to set:
62 |
63 | 1. Whether the variable is **observed in the data** (hotkey "u", or use the Variable tab). If a variable is not observed, you can't adjust for it. Let dagitty know it's not observed so it won't try to tell you to adjust for it!
64 | 2. Whether you've already **controlled/adjusted** for a variable (hotkey "a", or use the Variable tab).
65 |
66 | You may also want to re-work your diagram once you've made it. You can also **delete** variables (hotkey "d", or use the Variable tab), or **rename** them (hotkey "r", or use the Variable tab).
67 |
68 | \newpage
69 |
70 | ## Using the Dagitty Diagnostics
71 |
72 | The right bar of the screen contains lots of information about your diagram.
73 |
74 | {width=90%}
75 |
76 | At the top you'll see that it gives you the Minimal Adjustment Sets. Given the exposure variable you've set, and the outcome variable, it will tell you what you could control for to close all back doors. Here, you can identify `X -> Y` by controlling for `A` OR by controlling for `B` (`A` and `B` together would also work, but these are *minimal* adjustment sets - no unnecessary controls).
77 |
78 | Remember that the exposure and outcome variables must be set properly for this to work. It will also pay attention to the variables you've set to be unobserved, or the variables you've said you've already adjusted for.
79 |
80 | If you select the "Adjustment (total effect)" dropdown you'll see that you can also set it to look for instrumental variables, or to look for the "direct effect". The direct effect looks *only* for `X -> Y`, and not counting for other front-door paths like `X -> C -> Y` (not pictured).
81 |
82 | The diagram itself also has information. Any lines that are **purple** represent arrows that are a part of an open back door. Lines that are **green** are on a front door path / the causal effect of interest. And lines that are **black** are neither. In the graph we have here, we can see the green causal path of interest from `X` to `Y`, and a purple open back door path along `X <- A <-> B -> Y`. If you forget the colors, check the legend in the bottom-left.
83 |
84 | The variables are marked too. The exposure and outcome variables are clearly marked. Green variables like `A` are ancestors of the exposure (they cause the exposure), and blue variables like `B` are ancestors of the outcome (they cause the outcome). Red variables are ancestors of both, and are likely on a back door path.
85 |
86 | In the second panel on the right you can see the testable implications of the model. This shows you some relationships that *should* be there if your model is right. Here, that upside-down T means "not related to" and the | means "adjusting for". So, adjusting for `A`, `X` and `B` should be unrelated. And adjusting for both `B` and `X`, `A` and `Y` should be unrelated. You could test these yourself in data, and if you found they were related, you'd know the model was wrong.
87 |
88 | \newpage
89 |
90 | ## Saving/Exporting Your Diagram
91 |
92 | You can save your work using Model -> Publish. You'll need to save the code/URL you get so you can Model -> Load it back in later.
93 |
94 | You can also export your graph for viewing using Model -> Export. However, keep in mind that dagitty tends to like to show the whole canvas, even if your model is only on a part of it. So you may want to be prepared to open the result in a photo editor to crop it after you've exported it.
95 |
96 | Another option for getting images of your models is to take a screenshot, cropped down to the area of your diagram. In Windows you can use the Snipping Tool (Windows Accessories -> Snipping Tool). On a Mac you can use Grab.
97 |
98 |
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty-Cheat-Sheet.aux:
--------------------------------------------------------------------------------
1 | \relax
2 | \providecommand\hyper@newdestlabel[2]{}
3 | \providecommand\HyperFirstAtBeginDocument{\AtBeginDocument}
4 | \HyperFirstAtBeginDocument{\ifx\hyper@anchor\@undefined
5 | \global\let\oldcontentsline\contentsline
6 | \gdef\contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}}
7 | \global\let\oldnewlabel\newlabel
8 | \gdef\newlabel#1#2{\newlabelxx{#1}#2}
9 | \gdef\newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}}
10 | \AtEndDocument{\ifx\hyper@anchor\@undefined
11 | \let\contentsline\oldcontentsline
12 | \let\newlabel\oldnewlabel
13 | \fi}
14 | \fi}
15 | \global\let\hyper@last\relax
16 | \gdef\HyperFirstAtBeginDocument#1{#1}
17 | \providecommand\HyField@AuxAddToFields[1]{}
18 | \providecommand\HyField@AuxAddToCoFields[2]{}
19 | \@writefile{toc}{\contentsline {subsection}{Dagitty}{1}{section*.1}\protected@file@percent }
20 | \newlabel{dagitty}{{}{1}{Dagitty}{section*.1}{}}
21 | \@writefile{toc}{\contentsline {subsection}{Getting Started}{1}{section*.2}\protected@file@percent }
22 | \newlabel{getting-started}{{}{1}{Getting Started}{section*.2}{}}
23 | \@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Dagitty Main Screen}}{1}{figure.1}\protected@file@percent }
24 | \@writefile{toc}{\contentsline {subsection}{Building a Diagram}{2}{section*.3}\protected@file@percent }
25 | \newlabel{building-a-diagram}{{}{2}{Building a Diagram}{section*.3}{}}
26 | \@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Basic Dagitty Model}}{2}{figure.2}\protected@file@percent }
27 | \@writefile{toc}{\contentsline {subsection}{Variable Characteristics}{3}{section*.4}\protected@file@percent }
28 | \newlabel{variable-characteristics}{{}{3}{Variable Characteristics}{section*.4}{}}
29 | \@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Variable Tab}}{3}{figure.3}\protected@file@percent }
30 | \@writefile{toc}{\contentsline {subsection}{Using the Dagitty Diagnostics}{4}{section*.5}\protected@file@percent }
31 | \newlabel{using-the-dagitty-diagnostics}{{}{4}{Using the Dagitty Diagnostics}{section*.5}{}}
32 | \@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Dagitty Diagnostics}}{4}{figure.4}\protected@file@percent }
33 | \@writefile{toc}{\contentsline {subsection}{Saving/Exporting Your Diagram}{5}{section*.6}\protected@file@percent }
34 | \newlabel{savingexporting-your-diagram}{{}{5}{Saving/Exporting Your Diagram}{section*.6}{}}
35 |
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty-Cheat-Sheet.out:
--------------------------------------------------------------------------------
1 | \BOOKMARK [2][-]{section*.1}{\376\377\000D\000a\000g\000i\000t\000t\000y}{}% 1
2 | \BOOKMARK [2][-]{section*.2}{\376\377\000G\000e\000t\000t\000i\000n\000g\000\040\000S\000t\000a\000r\000t\000e\000d}{}% 2
3 | \BOOKMARK [2][-]{section*.3}{\376\377\000B\000u\000i\000l\000d\000i\000n\000g\000\040\000a\000\040\000D\000i\000a\000g\000r\000a\000m}{}% 3
4 | \BOOKMARK [2][-]{section*.4}{\376\377\000V\000a\000r\000i\000a\000b\000l\000e\000\040\000C\000h\000a\000r\000a\000c\000t\000e\000r\000i\000s\000t\000i\000c\000s}{}% 4
5 | \BOOKMARK [2][-]{section*.5}{\376\377\000U\000s\000i\000n\000g\000\040\000t\000h\000e\000\040\000D\000a\000g\000i\000t\000t\000y\000\040\000D\000i\000a\000g\000n\000o\000s\000t\000i\000c\000s}{}% 5
6 | \BOOKMARK [2][-]{section*.6}{\376\377\000S\000a\000v\000i\000n\000g\000/\000E\000x\000p\000o\000r\000t\000i\000n\000g\000\040\000Y\000o\000u\000r\000\040\000D\000i\000a\000g\000r\000a\000m}{}% 6
7 |
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty_Cheat_Sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Cheat_Sheet.pdf
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty_Diagnostics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Diagnostics.png
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty_Intro.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Intro.PNG
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty_Model1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Model1.PNG
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty_Model2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Model2.PNG
--------------------------------------------------------------------------------
/Cheat Sheets/Dagitty_Screen.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Screen.PNG
--------------------------------------------------------------------------------
/Cheat Sheets/RStudio Cheat Cheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/RStudio Cheat Cheet.pdf
--------------------------------------------------------------------------------
/Cheat Sheets/Relationships Cheat Sheet.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Relationships Cheat Sheet"
3 | author: "Nick Huntington-Klein"
4 | date: "April 5, 2019"
5 | output: pdf_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
10 | ```
11 |
12 | ## Definitions
13 |
14 | - **Dependent**: `X` and `Y` are *dependent* if knowing something about `X` gives you information about what `Y` is likely to be, or vice versa
15 | - **Correlated**: `X` and `Y` are *correlated* if knowing that `X` is unusually high tells you whether `Y` is likely to be unusually high or unusually low
16 | - **Explaining**: *Explaining* `Y` using `X` means that we are predicting what `Y` is likely to be, given a value of `X`
17 |
18 | ## Tables
19 |
20 | `table(x)` will show us the full *distribution* of `x`. `table(x,y)` will show us the full *distribution* of `x` and `y` together (the "joint distribution"). Typically not used for continuous variables.
21 |
22 | ```{r, echo=TRUE}
23 | library(Ecdat)
24 | data(Benefits)
25 |
26 | table(Benefits$joblost)
27 |
28 | table(Benefits$joblost,Benefits$married)
29 | ```
30 |
31 | You can label the variable names using the confusingly-named *dimnames names* option, `dnn`
32 |
33 | ```{r, echo=TRUE}
34 | table(Benefits$joblost,Benefits$married,dnn=c('Job Loss Reason','Married'))
35 | ```
36 |
37 | Wrap `table()` in `prop.table()` to get proportions instead of counts. The `margin` option of `prop.table()` will give the proportion within each row (`margin=1`) or within each column (`margin=2`) instead of overall.
38 |
39 | ```{r, echo=TRUE}
40 | prop.table(table(Benefits$joblost,Benefits$married))
41 | prop.table(table(Benefits$joblost,Benefits$married),margin=1)
42 | prop.table(table(Benefits$joblost,Benefits$married),margin=2)
43 | ```
44 |
45 | ## Correlation
46 |
47 | We can calculate the correlation between two (numeric) variables using `cor(x,y)`
48 |
49 | ```{r, echo=TRUE}
50 | cor(Benefits$age,Benefits$tenure)
51 | ```
52 |
53 | ## Scatterplots
54 |
55 | You can plot one variable against another with `plot(xvar,yvar)`. Add `xlab`, `ylab`, and `main` options to title the axes and entire plot, respectively. Use `col` to assign a color.
56 |
57 | Use `points()` to add more points to a graph after you've made it, likely with a different `col`or.
58 |
59 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5}
60 | library(tidyverse)
61 | BenefitsM <- Benefits %>% filter(sex=='male')
62 | BenefitsF <- Benefits %>% filter(sex=='female')
63 | plot(BenefitsM$age,BenefitsM$tenure,xlab='Age',ylab='Tenure at Job',col='blue')
64 | points(BenefitsF$age,BenefitsF$tenure,xlab='Age',ylab='Tenure at Job',col='red')
65 | ```
66 |
67 | ## Overlaid Densities
68 |
69 | You can show how the *distribution* of `Y` changes for different values of `X` by plotting the density separately for different values of `X`. Use `lines` to add the second density plot after you've done the first one.
70 |
71 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5}
72 | plot(density(BenefitsM$tenure),
73 | xlab='Tenure at Job',col='blue',main="Job Tenure by Gender")
74 | lines(density(BenefitsF$tenure),xlab='Tenure at Job',col='red')
75 | ```
76 |
77 | ## Means Within Groups and Explaining
78 |
79 | Part of looking at both *correlation* and *explanation* will require getting the mean of `Y` within values of `X`, which we can do with `group_by()` in `dplyr/tidyverse`.
80 |
81 | Using `summarize()` after `group_by()` will give us a table of means within each group. Using `mutate()` will add a new variable assigning that mean. Use `mutate()` with `mean(y)` to get the part of `y` explained by `x`, or with `y - mean(y)` to get the part not explained by `x` (the residual). Don't forget to `ungroup()`!
82 |
83 | ```{r, echo=TRUE}
84 | Benefits %>% group_by(joblost) %>%
85 | summarize(tenure = mean(tenure), age = mean(age))
86 |
87 | Benefits <- Benefits %>% group_by(joblost) %>%
88 | mutate(tenure.exp = mean(tenure),
89 | tenure.resid = tenure - mean(tenure)) %>% ungroup()
90 | head(Benefits %>% select(joblost,tenure,tenure.exp,tenure.resid))
91 | ```
92 |
93 | ## Explaining With a Continuous Variable
94 |
95 | If we want to explain `Y` using `X` but `X` is continuous, we need to break it up into bins first. We will do this with `cut()`, which has the `breaks` option for how many bins to split it up into.
96 |
97 | In this class, we will be choosing the number of breaks arbitrarily. I'll tell you what values to use.
98 |
99 | ```{r, echo=TRUE}
100 | Benefits <- Benefits %>% mutate(agebins = cut(age,breaks=5)) %>%
101 | group_by(agebins) %>%
102 | mutate(tenure.ageexp = mean(tenure),
103 | tenure.ageresid = tenure - mean(tenure)) %>% ungroup()
104 | head(Benefits %>% select(agebins,tenure,tenure.ageexp,tenure.ageresid))
105 | ```
106 |
107 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5}
108 | plot(Benefits$age,Benefits$tenure,xlab="Age",ylab="Tenure",col='black')
109 | points(Benefits$age,Benefits$tenure.ageexp,col='red',cex=1.5,bg='red',pch=21)
110 | ```
111 |
112 | ## Proportion of Variance Explained
113 |
114 | When `Y` is numeric, we can calculate its variance, and see how much of that variance is explained by `X`, and also how much is not. We do this by calculating the variance of the residuals, as this is the amount of variance in `Y` left over after taking out what `X` explains.
115 |
116 | ```{r, echo=TRUE}
117 | #Proportion of tenure NOT explained by age
118 | var(Benefits$tenure.ageresid)/var(Benefits$tenure)
119 |
120 | #Proportion of tenure explained by age
121 | 1 - var(Benefits$tenure.ageresid)/var(Benefits$tenure)
122 | var(Benefits$tenure.ageexp)/var(Benefits$tenure)
123 | ```
--------------------------------------------------------------------------------
/Cheat Sheets/Relationships_Cheat_Sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Relationships_Cheat_Sheet.pdf
--------------------------------------------------------------------------------
/Cheat Sheets/Simulation Cheat Sheet.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Simulation Cheat Sheet"
3 | author: "Nick Huntington-Klein"
4 | date: "April 5, 2019"
5 | output: pdf_document
6 | ---
7 |
8 | ```{r setup, include=FALSE}
9 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
10 | ```
11 |
12 | ## The Purpose of Simulations
13 |
14 | - Data is *simulated* if we randomly generate it, and then test its properties
15 | - We do this because, when we make our own data, *we know the process that generated the data*
16 | - That way, we can know whether the *method* we use is capable of getting the right result, because we know what the right result is
17 |
18 | ## Generating Random Data
19 |
20 | We can generate categorical or logical random data with `sample()` with the `replace=TRUE` option. This will choose, with equal probability (or unequal probabilities if you specify with the `prob` option), between the elements of the vector you give it.
21 |
22 | We can generate *normally* distributed data with `rnorm()`. Normal data is symmetrically distributed around a mean and has no maximum or minimum. By default, the `mean` is 0 and the standard deviation `sd` is 1.
23 |
24 | We can generate *uniformly* distributed data with `runif()`. Uniformly distributed data is equally likely to take any value between its `min`imum and `max`imum, which are by default 0 and 1.
25 |
26 | In each case, remember to tell it how many observations you want (1000 in the below example).
27 |
28 | ```{r, echo=TRUE}
29 | library(tidyverse)
30 | df <- tibble(category = sample(c('A','B','C'),1000,replace=T),
31 | normal = rnorm(1000),
32 | uniform = runif(1000))
33 |
34 | table(df$category)
35 |
36 | library(stargazer)
37 | stargazer(as.data.frame(df %>% select(normal, uniform)),type='text')
38 | ```
39 |
40 | ## Determining the Data Generating Process
41 |
42 | You can make sure that `X -> Y` in the data by using values of `X` in the creation of `Y`. In the below example, `W -> X`, `X -> Y`, and `W -> Y`. You can tell because `W` is used in the creation of `X`, and both `X` and `W` are used in the creation of `Y`.
43 |
44 | You can incorporate categorical variables into this process by turning them into logicals.
45 |
46 | Remember that, if you want to use one variable to create another, you need to finish creating it, and then create the variable it causes in a later `mutate()` command.
47 |
48 | ```{r, echo=TRUE}
49 | data <- tibble(W = sample(c('C','D','E'),2000,replace=T)) %>%
50 | mutate(X = .1*(W == 'C') + runif(2000)) %>%
51 | mutate(Y = 2*(W == 'C') - 2*X + rnorm(2000))
52 | ```
53 |
54 |
55 | ## For Loops
56 |
57 | A for loop executes a chunk of code over and over again, each time incrementing some index by 1. Remember that the parentheses and curly braces are required, and also that if you want it to show results while you're running a loop, you must use `print()`.
58 |
59 | If you are interested in getting further into R, I'd recommend trying to learn about the `apply` family of functions, which perform loops more quickly.
60 |
61 | ```{r, echo=TRUE}
62 | for (i in 1:5) {
63 | print(i)
64 | }
65 | ```
66 |
67 | ## Putting the Simulation Together
68 |
69 | In a simulation, we want to:
70 |
71 | - Create a blank vector to store our results
72 | - (in loop) Simulate data
73 | - (in loop) Get a result
74 | - (in loop) Store the result
75 | - Show what we found
76 |
77 | We can create a blank vector to store our results with `c(NA)`. Then, in the loop, whatever calculation we do we can store using the index we're looping over.
78 |
79 | Note that, if you like, you can make a blank data frame or tibble instead of a vector, and use `add_row` in dplyr to store results.
80 |
81 | ```{r, echo = TRUE}
82 | results <- c(NA)
83 |
84 | for (i in 1:1000) {
85 | data <- tibble(W = sample(c('C','D','E'),2000,replace=T)) %>%
86 | mutate(X = 1.5*(W == 'C') + runif(2000)) %>%
87 | mutate(Y = 2*(W == 'C') - .5*X + rnorm(2000))
88 |
89 | results[i] <- cor(data$X,data$Y)
90 | }
91 | ```
92 |
93 | After you're done you can show what happened using standard methods for analyzing a single variable, like creating a summary statistics table with `stargazer()` (don't forget to use `as.data.frame()`) or plotting the density. Make sure you're thinking about what the *true* answer is and how far you are from it. For example, here, we know that `X` negatively affects `Y`, and we might be interested in how often we find that.
94 |
95 | ```{r, echo=TRUE}
96 | library(stargazer)
97 | stargazer(as.data.frame(results),type='text')
98 | ```
99 |
100 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5}
101 | plot(density(results),xlab='Correlation Between X and Y',main='Simulation Results')
102 | abline(v=0)
103 | ```
104 |
--------------------------------------------------------------------------------
/Cheat Sheets/Simulation_Cheat_Sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Simulation_Cheat_Sheet.pdf
--------------------------------------------------------------------------------
/Cheat Sheets/dplyr Cheat Sheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/dplyr Cheat Sheet.pdf
--------------------------------------------------------------------------------
/Cheat Sheets/floatposition.tex:
--------------------------------------------------------------------------------
1 | \usepackage{float}
2 | \let\origfigure\figure
3 | \let\endorigfigure\endfigure
4 | \renewenvironment{figure}[1][2] {
5 | \expandafter\origfigure\expandafter[H]
6 | } {
7 | \endorigfigure
8 | }
--------------------------------------------------------------------------------
/ECON 305 Syllabus Updated Template.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/ECON 305 Syllabus Updated Template.docx
--------------------------------------------------------------------------------
/Instructor Guide/.Rhistory:
--------------------------------------------------------------------------------
1 | help(install.packages)
2 | install.packages(c('tidyverse','dplyr','magrittr','devtools','roxygen2','ggplot2','stargazer','estimatr','vtable','haven','readxl','lubridate'))
3 | install.packages(c('AER','car','zoo','Ecdat'))
4 | install.packages('wooldridge')
5 | remotes::install_github('r-lib/vctrs')
6 | remotes::install_github('r-lib/vctrs')
7 | remotes::install_github('tidyverse/vctrs')
8 | install.packages('vctrs')
9 | remotes::install_github('tidyverse/tidyr')
10 | remotes::install_github('r-lib/vctrs')
11 | devtools::install_github('r-lib/vctrs')
12 | library(estimatr)
13 | library(tidyverse)
14 | install.packages('rlang')
15 | install.packages("rlang")
16 | install.packages("rlang")
17 | library(estimatr)
18 | library(tidyverse)
19 | help(tibble)
20 | df <- tibble(z = rnorm(300),
21 | gamma = c(rep(-1,100),rep(0,100),rep(1,100))) %>%
22 | mutate(x = gamma*z + rnorm(300)) %>%
23 | mutate(y = 1.5*x+rnorm(300))
24 | help("iv_robust")
25 | summary(iv_robust(y~x|z,data=df))
26 | summary(iv_robust(y~x|z,data=df,weights=gamma))
27 | summary(iv_robust(y~x|z,data=df,subset=gamma==1))
28 | df <- tibble(z = rnorm(300),
29 | gamma = c(rep(0,100),rep(0.1,100),rep(1,100))) %>%
30 | mutate(x = gamma*z + rnorm(300)) %>%
31 | mutate(y = 1.5*x+rnorm(300))
32 | summary(iv_robust(y~x|z,data=df))
33 | df <- tibble(z = rnorm(300),
34 | gamma = c(rep(0,100),rep(0.1,100),rep(1,100))) %>%
35 | mutate(x = gamma*z + rnorm(300)) %>%
36 | mutate(y = 1.5*x+rnorm(300))
37 | summary(iv_robust(y~x|z,data=df))
38 | df <- tibble(z = rnorm(300),
39 | gamma = c(rep(0,100),rep(0.1,100),rep(1,100))) %>%
40 | mutate(x = gamma*z + rnorm(300)) %>%
41 | mutate(y = 1.5*x+rnorm(300))
42 | summary(iv_robust(y~x|z,data=df))
43 | summary(iv_robust(y~x|z,data=df,weights=gamma))
44 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
45 | df <- tibble(z = rnorm(300),
46 | gamma = c(rep(0,100),rep(0.05,100),rep(.5,100))) %>%
47 | mutate(x = gamma*z + rnorm(300)) %>%
48 | mutate(y = 1.5*x+rnorm(300))
49 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
50 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
51 | df <- tibble(z = rnorm(300),
52 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>%
53 | mutate(x = gamma*z + rnorm(300)) %>%
54 | mutate(y = 1.5*x+rnorm(300))
55 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
56 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
57 | df <- tibble(z = rnorm(300),
58 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>%
59 | mutate(x = gamma*z + rnorm(300)) %>%
60 | mutate(y = 1.5*x+rnorm(300))
61 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
62 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
63 | df <- tibble(z = rnorm(300),
64 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>%
65 | mutate(x = gamma*z + rnorm(300)) %>%
66 | mutate(y = 1.5*x+rnorm(300))
67 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
68 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
69 | df <- tibble(z = rnorm(300),
70 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>%
71 | mutate(x = gamma*z + rnorm(300)) %>%
72 | mutate(y = (1.5*(1+gamma))*x+rnorm(300))
73 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
74 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
75 | df <- tibble(z = rnorm(30000),
76 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
77 | mutate(x = gamma*z + rnorm(30000)) %>%
78 | mutate(y = (1.5*(1+gamma))*x+rnorm(30000))
79 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
80 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
81 | df <- tibble(z = rnorm(30000),
82 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
83 | mutate(x = gamma*z + rnorm(30000)) %>%
84 | mutate(y = (1.5)*x+rnorm(30000))
85 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
86 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
87 | df <- tibble(z = rnorm(30000),
88 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
89 | mutate(x = gamma*z + rnorm(30000)) %>%
90 | mutate(y = (1.5)*x+rnorm(30000))
91 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
92 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
93 | df <- tibble(z = rnorm(30000),
94 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
95 | mutate(x = gamma*z + rnorm(30000)) %>%
96 | mutate(y = (1.5)*x+rnorm(30000))
97 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
98 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
99 | df <- tibble(z = rnorm(30000),
100 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
101 | mutate(x = gamma*z + rnorm(30000)) %>%
102 | mutate(y = (1.5)*x+rnorm(30000))
103 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
104 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
105 | df <- tibble(z = rnorm(30000),
106 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
107 | mutate(x = gamma*z + rnorm(30000)) %>%
108 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000))
109 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
110 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
111 | df <- tibble(z = rnorm(30000),
112 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
113 | mutate(x = gamma*z + rnorm(30000)) %>%
114 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000))
115 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
116 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
117 | df <- tibble(z = rnorm(30000),
118 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>%
119 | mutate(x = gamma*z + rnorm(30000)) %>%
120 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000))
121 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
122 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
123 | df <- tibble(z = rnorm(30000),
124 | gamma = c(rep(0,10000),rep(0.05,15000),rep(1,5000))) %>%
125 | mutate(x = gamma*z + rnorm(30000)) %>%
126 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000))
127 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
128 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
129 | df <- tibble(z = rnorm(30000),
130 | gamma = c(rep(0,10000),rep(0.00,15000),rep(1,5000))) %>%
131 | mutate(x = gamma*z + rnorm(30000)) %>%
132 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000))
133 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
134 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
135 | rep(1,2)
136 | View(df)
137 | head(df)
138 | df <- tibble(z = rnorm(30000),
139 | gamma = c(rep(0,10000),rep(0.00,15000),rep(1,5000))) %>%
140 | mutate(x = gamma*z + rnorm(30000)) %>%
141 | mutate(y = (c(rep(.5,10000),rep(.25,15000),rep(.75,5000)))*x+rnorm(30000))
142 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
143 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
144 | df <- tibble(z = rnorm(30000),
145 | gamma = c(rep(0,10000),rep(0.05,15000),rep(1,5000))) %>%
146 | mutate(x = gamma*z + rnorm(30000)) %>%
147 | mutate(y = (c(rep(.5,10000),rep(.25,15000),rep(.75,5000)))*x+rnorm(30000))
148 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
149 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
150 | df <- tibble(z = rnorm(30000),
151 | gamma = c(rep(0,10000),rep(0.05,15000),rep(1,5000))) %>%
152 | mutate(x = gamma*z + rnorm(30000)) %>%
153 | mutate(y = (c(rep(.5,10000),rep(.25,15000),rep(.75,5000)))*x+rnorm(30000))
154 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
155 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
156 | df <- tibble(z = rnorm(300),
157 | gamma = c(rep(0,100),rep(0.05,150),rep(1,50))) %>%
158 | mutate(x = gamma*z + rnorm(300)) %>%
159 | mutate(y = (c(rep(.5,100),rep(.25,150),rep(.75,50)))*x+rnorm(300))
160 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
161 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
162 | df <- tibble(z = rnorm(300),
163 | gamma = c(rep(0,100),rep(0.05,150),rep(1,50))) %>%
164 | mutate(x = gamma*z + rnorm(300)) %>%
165 | mutate(y = (c(rep(.5,100),rep(.25,150),rep(.75,50)))*x+rnorm(300))
166 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
167 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
168 | df <- tibble(z = rnorm(300),
169 | gamma = c(rep(0,100),rep(0.05,150),rep(1,50))) %>%
170 | mutate(x = gamma*z + rnorm(300)) %>%
171 | mutate(y = (c(rep(.5,100),rep(.25,150),rep(.75,50)))*x+rnorm(300))
172 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE))
173 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE))
174 | setwd("C:/Users/nickc/Dropbox (CSU Fullerton)/Teaching/Econ 305/Instructor Guide")
175 | install.packages('tinytex')
176 | tinytex::install_tinytex()
177 |
--------------------------------------------------------------------------------
/Instructor Guide/Instructor_Guide.aux:
--------------------------------------------------------------------------------
1 | \relax
2 | \providecommand\hyper@newdestlabel[2]{}
3 | \providecommand\BKM@entry[2]{}
4 | \providecommand\HyperFirstAtBeginDocument{\AtBeginDocument}
5 | \HyperFirstAtBeginDocument{\ifx\hyper@anchor\@undefined
6 | \global\let\oldcontentsline\contentsline
7 | \gdef\contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}}
8 | \global\let\oldnewlabel\newlabel
9 | \gdef\newlabel#1#2{\newlabelxx{#1}#2}
10 | \gdef\newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}}
11 | \AtEndDocument{\ifx\hyper@anchor\@undefined
12 | \let\contentsline\oldcontentsline
13 | \let\newlabel\oldnewlabel
14 | \fi}
15 | \fi}
16 | \global\let\hyper@last\relax
17 | \gdef\HyperFirstAtBeginDocument#1{#1}
18 | \providecommand\HyField@AuxAddToFields[1]{}
19 | \providecommand\HyField@AuxAddToCoFields[2]{}
20 | \BKM@entry{id=1,dest={73656374696F6E2A2E31},srcline={126}}{5C3337365C3337375C303030575C303030685C303030615C303030745C3030305C3034305C303030595C3030306F5C303030755C3030305C3034305C3030304E5C303030655C303030655C303030645C3030305C3034305C303030745C3030306F5C3030305C3034305C3030304B5C3030306E5C3030306F5C30303077}
21 | \BKM@entry{id=2,dest={73656374696F6E2A2E32},srcline={174}}{5C3337365C3337375C303030495C3030306E5C3030305C3034305C303030545C303030685C303030695C303030735C3030305C3034305C303030445C3030306F5C303030635C303030755C3030306D5C303030655C3030306E5C30303074}
22 | \@writefile{toc}{\contentsline {subsection}{What You Need to Know}{1}{section*.1}\protected@file@percent }
23 | \newlabel{what-you-need-to-know}{{}{1}{What You Need to Know}{section*.1}{}}
24 | \@writefile{toc}{\contentsline {subsection}{In This Document}{1}{section*.2}\protected@file@percent }
25 | \newlabel{in-this-document}{{}{1}{In This Document}{section*.2}{}}
26 | \BKM@entry{id=3,dest={73656374696F6E2A2E33},srcline={230}}{}
27 | \@writefile{toc}{\contentsline {subsection}{What Distinguishes This Class from Other Econometrics Classes?}{2}{section*.3}\protected@file@percent }
28 | \newlabel{what-distinguishes-this-class-from-other-econometrics-classes}{{}{2}{What Distinguishes This Class from Other Econometrics Classes?}{section*.3}{}}
29 | \BKM@entry{id=4,dest={73656374696F6E2A2E34},srcline={325}}{5C3337365C3337375C303030525C303030735C303030745C303030755C303030645C303030695C3030306F5C3030302E5C303030635C3030306C5C3030306F5C303030755C30303064}
30 | \BKM@entry{id=5,dest={73656374696F6E2A2E35},srcline={351}}{5C3337365C3337375C303030525C3030304D5C303030615C303030725C3030306B5C303030645C3030306F5C303030775C3030306E}
31 | \@writefile{toc}{\contentsline {subsection}{Rstudio.cloud}{3}{section*.4}\protected@file@percent }
32 | \newlabel{rstudio.cloud}{{}{3}{Rstudio.cloud}{section*.4}{}}
33 | \@writefile{toc}{\contentsline {subsection}{RMarkdown}{3}{section*.5}\protected@file@percent }
34 | \newlabel{rmarkdown}{{}{3}{RMarkdown}{section*.5}{}}
35 | \BKM@entry{id=6,dest={73656374696F6E2A2E36},srcline={421}}{5C3337365C3337375C303030415C303030645C3030306A5C303030755C303030735C303030745C303030695C3030306E5C303030675C3030305C3034305C303030745C3030306F5C3030305C3034305C303030525C3030305C3034305C303030615C3030306E5C303030645C3030305C3034305C303030645C303030705C3030306C5C303030795C303030725C3030305C3034305C303030665C303030725C3030306F5C3030306D5C3030305C3034305C3030304F5C303030745C303030685C303030655C303030725C3030305C3034305C303030535C3030306F5C303030665C303030745C303030775C303030615C303030725C30303065}
36 | \@writefile{toc}{\contentsline {subsection}{Adjusting to R and dplyr from Other Software}{4}{section*.6}\protected@file@percent }
37 | \newlabel{adjusting-to-r-and-dplyr-from-other-software}{{}{4}{Adjusting to R and dplyr from Other Software}{section*.6}{}}
38 | \BKM@entry{id=7,dest={73656374696F6E2A2E37},srcline={816}}{5C3337365C3337375C303030645C303030615C303030675C303030695C303030745C303030745C303030795C3030305C3034305C303030615C3030306E5C303030645C3030305C3034305C303030675C303030675C303030645C303030615C30303067}
39 | \@writefile{toc}{\contentsline {subsection}{dagitty and ggdag}{8}{section*.7}\protected@file@percent }
40 | \newlabel{dagitty-and-ggdag}{{}{8}{dagitty and ggdag}{section*.7}{}}
41 | \BKM@entry{id=8,dest={73656374696F6E2A2E38},srcline={864}}{5C3337365C3337375C303030505C303030725C303030655C303030705C303030615C303030725C303030695C3030306E5C303030675C3030305C3034305C303030445C303030615C303030745C303030615C3030305C3034305C303030665C3030306F5C303030725C3030305C3034305C303030535C303030745C303030755C303030645C303030655C3030306E5C303030745C3030305C3034305C303030555C303030735C30303065}
42 | \BKM@entry{id=9,dest={73656374696F6E2A2E39},srcline={897}}{5C3337365C3337375C303030675C303030675C303030705C3030306C5C3030306F5C303030745C30303032}
43 | \@writefile{toc}{\contentsline {subsection}{Preparing Data for Student Use}{9}{section*.8}\protected@file@percent }
44 | \newlabel{preparing-data-for-student-use}{{}{9}{Preparing Data for Student Use}{section*.8}{}}
45 | \@writefile{toc}{\contentsline {subsection}{ggplot2}{9}{section*.9}\protected@file@percent }
46 | \newlabel{ggplot2}{{}{9}{ggplot2}{section*.9}{}}
47 | \BKM@entry{id=10,dest={73656374696F6E2A2E3130},srcline={952}}{5C3337365C3337375C303030435C303030615C303030755C303030735C303030615C3030306C5C3030305C3034305C303030445C303030695C303030615C303030675C303030725C303030615C3030306D5C30303073}
48 | \BKM@entry{id=11,dest={73656374696F6E2A2E3131},srcline={1040}}{5C3337365C3337375C303030545C303030685C303030655C3030305C3034305C303030545C3030306F5C3030306F5C3030306C5C303030625C3030306F5C303030785C3030305C3034305C303030575C303030695C303030745C303030685C3030306F5C303030755C303030745C3030305C3034305C303030525C303030655C303030675C303030725C303030655C303030735C303030735C303030695C3030306F5C3030306E}
49 | \BKM@entry{id=12,dest={73656374696F6E2A2E3132},srcline={1048}}{5C3337365C3337375C303030435C3030306F5C3030306E5C303030745C303030725C3030306F5C3030306C5C3030306C5C303030695C3030306E5C303030675C3030305C3034305C303030665C3030306F5C303030725C3030305C3034305C303030615C3030305C3034305C303030765C303030615C303030725C303030695C303030615C303030625C3030306C5C30303065}
50 | \@writefile{toc}{\contentsline {subsection}{Causal Diagrams}{11}{section*.10}\protected@file@percent }
51 | \newlabel{causal-diagrams}{{}{11}{Causal Diagrams}{section*.10}{}}
52 | \@writefile{toc}{\contentsline {subsection}{The Toolbox Without Regression}{11}{section*.11}\protected@file@percent }
53 | \newlabel{the-toolbox-without-regression}{{}{11}{The Toolbox Without Regression}{section*.11}{}}
54 | \BKM@entry{id=13,dest={73656374696F6E2A2E3133},srcline={1069}}{5C3337365C3337375C303030465C303030695C303030785C303030655C303030645C3030305C3034305C303030655C303030665C303030665C303030655C303030635C303030745C30303073}
55 | \BKM@entry{id=14,dest={73656374696F6E2A2E3134},srcline={1082}}{5C3337365C3337375C3030304D5C303030615C303030745C303030635C303030685C303030695C3030306E5C30303067}
56 | \BKM@entry{id=15,dest={73656374696F6E2A2E3135},srcline={1106}}{5C3337365C3337375C303030445C303030695C303030665C303030665C303030655C303030725C303030655C3030306E5C303030635C303030655C3030305C3034305C303030695C3030306E5C3030305C3034305C303030445C303030695C303030665C303030665C303030655C303030725C303030655C3030306E5C303030635C303030655C30303073}
57 | \BKM@entry{id=16,dest={73656374696F6E2A2E3136},srcline={1122}}{5C3337365C3337375C303030525C303030655C303030675C303030725C303030655C303030735C303030735C303030695C3030306F5C3030306E5C3030305C3034305C303030445C303030695C303030735C303030635C3030306F5C3030306E5C303030745C303030695C3030306E5C303030755C303030695C303030745C30303079}
58 | \@writefile{toc}{\contentsline {subsubsection}{Controlling for a variable}{12}{section*.12}\protected@file@percent }
59 | \newlabel{controlling-for-a-variable}{{}{12}{Controlling for a variable}{section*.12}{}}
60 | \@writefile{toc}{\contentsline {subsubsection}{Fixed effects}{12}{section*.13}\protected@file@percent }
61 | \newlabel{fixed-effects}{{}{12}{Fixed effects}{section*.13}{}}
62 | \@writefile{toc}{\contentsline {subsubsection}{Matching}{12}{section*.14}\protected@file@percent }
63 | \newlabel{matching}{{}{12}{Matching}{section*.14}{}}
64 | \@writefile{toc}{\contentsline {subsubsection}{Difference in Differences}{12}{section*.15}\protected@file@percent }
65 | \newlabel{difference-in-differences}{{}{12}{Difference in Differences}{section*.15}{}}
66 | \@writefile{toc}{\contentsline {subsubsection}{Regression Discontinuity}{12}{section*.16}\protected@file@percent }
67 | \newlabel{regression-discontinuity}{{}{12}{Regression Discontinuity}{section*.16}{}}
68 | \BKM@entry{id=17,dest={73656374696F6E2A2E3137},srcline={1138}}{5C3337365C3337375C303030495C3030306E5C303030735C303030745C303030725C303030755C3030306D5C303030655C3030306E5C303030745C303030615C3030306C5C3030305C3034305C303030565C303030615C303030725C303030695C303030615C303030625C3030306C5C303030655C30303073}
69 | \@writefile{toc}{\contentsline {subsubsection}{Instrumental Variables}{13}{section*.17}\protected@file@percent }
70 | \newlabel{instrumental-variables}{{}{13}{Instrumental Variables}{section*.17}{}}
71 |
--------------------------------------------------------------------------------
/Instructor Guide/Instructor_Guide.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Instructor Guide/Instructor_Guide.pdf
--------------------------------------------------------------------------------
/Lectures/Animation_Local_Linear_Regression.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Animation_Local_Linear_Regression.gif
--------------------------------------------------------------------------------
/Lectures/GDP.csv:
--------------------------------------------------------------------------------
1 | Year,GDP,Country
2 | 1960,5.43E+11,United States
3 | 1961,5.63E+11,United States
4 | 1962,6.05E+11,United States
5 | 1963,6.39E+11,United States
6 | 1964,6.86E+11,United States
7 | 1965,7.44E+11,United States
8 | 1966,8.15E+11,United States
9 | 1967,8.62E+11,United States
10 | 1968,9.43E+11,United States
11 | 1969,1.02E+12,United States
12 | 1970,1.08E+12,United States
13 | 1971,1.17E+12,United States
14 | 1972,1.28E+12,United States
15 | 1973,1.43E+12,United States
16 | 1974,1.55E+12,United States
17 | 1975,1.69E+12,United States
18 | 1976,1.88E+12,United States
19 | 1977,2.09E+12,United States
20 | 1978,2.36E+12,United States
21 | 1979,2.63E+12,United States
22 | 1980,2.86E+12,United States
23 | 1981,3.21E+12,United States
24 | 1982,3.34E+12,United States
25 | 1983,3.64E+12,United States
26 | 1984,4.04E+12,United States
27 | 1985,4.35E+12,United States
28 | 1986,4.59E+12,United States
29 | 1987,4.87E+12,United States
30 | 1988,5.25E+12,United States
31 | 1989,5.66E+12,United States
32 | 1990,5.98E+12,United States
33 | 1991,6.17E+12,United States
34 | 1992,6.54E+12,United States
35 | 1993,6.88E+12,United States
36 | 1994,7.31E+12,United States
37 | 1995,7.66E+12,United States
38 | 1996,8.10E+12,United States
39 | 1997,8.61E+12,United States
40 | 1998,9.09E+12,United States
41 | 1999,9.66E+12,United States
42 | 2000,1.03E+13,United States
43 | 2001,1.06E+13,United States
44 | 2002,1.10E+13,United States
45 | 2003,1.15E+13,United States
46 | 2004,1.23E+13,United States
47 | 2005,1.31E+13,United States
48 | 2006,1.39E+13,United States
49 | 2007,1.45E+13,United States
50 | 2008,1.47E+13,United States
51 | 2009,1.44E+13,United States
52 | 2010,1.50E+13,United States
53 | 2011,1.55E+13,United States
54 | 2012,1.62E+13,United States
55 | 2013,1.67E+13,United States
56 | 2014,1.74E+13,United States
57 | 2015,1.81E+13,United States
58 | 2016,1.86E+13,United States
59 | 2017,1.94E+13,United States
60 | 1960,1.37E+12,World
61 | 1961,1.42E+12,World
62 | 1962,1.53E+12,World
63 | 1963,1.64E+12,World
64 | 1964,1.80E+12,World
65 | 1965,1.96E+12,World
66 | 1966,2.13E+12,World
67 | 1967,2.26E+12,World
68 | 1968,2.44E+12,World
69 | 1969,2.69E+12,World
70 | 1970,2.96E+12,World
71 | 1971,3.27E+12,World
72 | 1972,3.77E+12,World
73 | 1973,4.59E+12,World
74 | 1974,5.30E+12,World
75 | 1975,5.90E+12,World
76 | 1976,6.42E+12,World
77 | 1977,7.26E+12,World
78 | 1978,8.54E+12,World
79 | 1979,9.93E+12,World
80 | 1980,1.12E+13,World
81 | 1981,1.15E+13,World
82 | 1982,1.14E+13,World
83 | 1983,1.16E+13,World
84 | 1984,1.21E+13,World
85 | 1985,1.27E+13,World
86 | 1986,1.50E+13,World
87 | 1987,1.71E+13,World
88 | 1988,1.92E+13,World
89 | 1989,2.01E+13,World
90 | 1990,2.26E+13,World
91 | 1991,2.39E+13,World
92 | 1992,2.54E+13,World
93 | 1993,2.58E+13,World
94 | 1994,2.77E+13,World
95 | 1995,3.08E+13,World
96 | 1996,3.15E+13,World
97 | 1997,3.14E+13,World
98 | 1998,3.13E+13,World
99 | 1999,3.25E+13,World
100 | 2000,3.36E+13,World
101 | 2001,3.34E+13,World
102 | 2002,3.46E+13,World
103 | 2003,3.89E+13,World
104 | 2004,4.38E+13,World
105 | 2005,4.74E+13,World
106 | 2006,5.13E+13,World
107 | 2007,5.78E+13,World
108 | 2008,6.34E+13,World
109 | 2009,6.01E+13,World
110 | 2010,6.60E+13,World
111 | 2011,7.33E+13,World
112 | 2012,7.50E+13,World
113 | 2013,7.71E+13,World
114 | 2014,7.91E+13,World
115 | 2015,7.48E+13,World
116 | 2016,7.59E+13,World
117 | 2017,8.07E+13,World
118 |
--------------------------------------------------------------------------------
/Lectures/Lecture_01-cloud-market-revenue.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_01-cloud-market-revenue.png
--------------------------------------------------------------------------------
/Lectures/Lecture_01_Politics_Example.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_01_Politics_Example.PNG
--------------------------------------------------------------------------------
/Lectures/Lecture_01_World_of_Data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 1: A World of Data"
3 | author: "Nick Huntington-Klein"
4 | date: "December 1, 2018"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 | ---
15 |
16 | ```{r setup, include=FALSE}
17 | knitr::opts_chunk$set(echo = FALSE)
18 | library(tidyverse)
19 | theme_set(theme_gray(base_size = 15))
20 | ```
21 |
22 | ## A World of Data
23 |
24 | It's cliche to say that the world focuses more on data than ever before, but that's just because it's true
25 |
26 | Even moreso than understanding *statistics and probability*, in order to understand the world around us we need to understand *data*, *how data is used*, and *what it means*
27 |
28 | Google and Facebook, among many others, have reams and reams of data on you and everybody else. What do they do with it? Why?
29 |
30 | ## Understanding the World
31 |
32 | Increasingly, understanding the world is going to require the ability to understand data
33 |
34 | And learning things about the world is going to require the ability to manipulate data
35 |
36 | ## Data Scientist Pay
37 |
38 | ```{r, echo = FALSE}
39 | #Read in data
40 | salary <- read.csv(text="Experience,Salary
41 | 0.11508951406649626, 89130.43478260869
42 | 1.877024722932652, 92521.73913043478
43 | 4.128303495311169, 96956.52173913045
44 | 5.890238704177326, 100347.82608695651
45 | 8.044117647058828, 104000
46 | 10.051577152600172, 106869.5652173913
47 | 12.205882352941181, 110000
48 | 14.654731457800512, 112608.6956521739
49 | 16.956308610400683, 115478.26086956522
50 | 19.01300085251492, 118086.95652173912
51 | 19.99168797953965, 120173.91304347827
52 | 21.8994032395567, 125130.4347826087
53 | 23.758312020460362, 129826.08695652173
54 | 25.665387894288155, 135565.21739130435
55 | 27.425618073316286, 141043.47826086957
56 | 29.67455242966753, 148347.82608695654")
57 | #Use ggplot to plot out data
58 | ggplot(salary,aes(x=Experience,y=Salary))+
59 | #with a smoothed line graph
60 | stat_smooth(method = lm, formula = y ~ poly(x, 10), se = FALSE)+
61 | #Have y-axis start at 50k
62 | expand_limits(y=50000)+
63 | #Add labels
64 | labs(title="Data Scientist Salary by Experience",subtitle="Data from Glassdoor; avg college grad starts $49,875")+
65 | xlab("Experience (Years)")
66 | ```
67 |
68 | ## Top Jobs for Economics Majors
69 |
70 | Data from The Balance Careers
71 |
72 | ```{r, echo = FALSE}
73 | #Read in data
74 | topjobs <- read.csv(text="Job,Salary,UsesData
75 | Market Research Analyst,71450,A Lot
76 | Economic Consultant,112650,A Lot
77 | Comp & Bfts Manager,130010,A Lot
78 | Actuary,114850,A Lot
79 | Credit Analyst,82900,A Little
80 | Financial Analyst,99430,A Little
81 | Policy Analyst,112030,A Lot
82 | Lawyer,141890,Not a Lot
83 | Management Consultant,93440,A Little
84 | Business Reporter,67500,Not a Lot")
85 | #Sort so it goes from lowest salary to highest
86 | topjobs$Job <- reorder(topjobs$Job, topjobs$Salary)
87 | #Reorder factor so it goes least to most
88 | topjobs$UsesData <- factor(topjobs$UsesData,levels=c("Not a Lot","A Little","A Lot"))
89 | #Plot out
90 | ggplot(topjobs,aes(x=Job,y=Salary/1000,fill=UsesData))+
91 | #With a bar graph
92 | geom_col()+
93 | #Label
94 | ylab("Avg. Salary (Thousands)")+xlab(element_blank())+
95 | labs(title="Do Top-Ten Econ Major Jobs Use Data?")+
96 | #Rotate job labels so they fit
97 | theme(axis.text.x = element_text(angle = 45, hjust = 1))
98 | ```
99 |
100 | ## We Use Data to Understand the Economy
101 |
102 | ```{r, echo = FALSE}
103 | #Read in data
104 | gdp <- read.csv('GDP.csv')
105 | #Get GDP in the first year of data using dplyr
106 | gdp <- gdp %>%
107 | group_by(Country) %>%
108 | mutate(firstGDP=GDP[1]) %>%
109 | mutate(gdprel=GDP/firstGDP)
110 | #Plot data
111 | ggplot(gdp,aes(x=Year,y=gdprel,color=Country))+
112 | #Line graph
113 | geom_line()+
114 | #Label
115 | xlab("Year")+ylab("GDP Relative to 1960")+
116 | theme(legend.title=element_blank())
117 |
118 | ```
119 |
120 | ## We Use Data to Understand Business
121 |
122 | 
123 |
124 | ## We Use Data to Understand Politics
125 |
126 |
127 | 
128 |
129 | ## We Use Data to Understand the World
130 |
131 | ```{r, echo = FALSE}
132 | #Read in data
133 | data(co2)
134 | #Plot, cex for bigger font
135 | plot(co2,xlab="Year",ylab="Atmospheric CO2 Concentration",cex=1.75)
136 | ```
137 |
138 | ## This Class
139 |
140 | In this class, we'll be accomplishing a few goals.
141 |
142 | - Learning how to use the statistical programming language R
143 | - Learning how to understand the data we see in the world
144 | - Learning how to figure out *what data actually tells us*
145 | - Learning about *causal inference* - the economist's comparative advantage!
146 |
147 | ## Why Programming?
148 |
149 | Why do we have to learn to code? Why not just use Excel?
150 |
151 | - Excel is great at being a spreadsheet. You should learn it. It's a pretty bad data analysis tool though
152 | - Learning a programming language is a very important skill
153 | - R is free, very flexible (heck, I wrote these slides in R), is growing in popularity, will be used in other econometrics courses, and easy to jump to something like Python if need be
154 |
155 | ## Don't Be Scared
156 |
157 | - Programming isn't all that hard
158 | - You're just telling the computer what to do
159 | - The computer will do exactly as you say
160 | - Just imagine it's like your bratty little sibling who would do what you said, *literally*
161 |
162 | ## Plus
163 |
164 | - As mentioned, once you know one language it's much easier to learn others
165 | - There will be plenty of resources and cheat sheets to help you
166 | - Ever get curious and have a question? Now you can just *answer it*. How cool is that?
167 |
168 | ## Causal Inference?
169 |
170 | What is causal inference?
171 |
172 | - It's easy to get data to tell us what happened, but not **why**. "Correlation does not equal casuation"
173 | - Economists have been looking at causation for longer than most other fields. We're good at it!
174 | - Causal inference is often necessary to link data to *models* and actually *learn how the world works*
175 | - We'll be taking a special approach to causal inference, one that lets us avoid complex mathematical models
176 |
177 | ## Lucky You!
178 |
179 | This is a pretty unusual course. We're lucky enough to be able to start the econometrics sequence off this way.
180 |
181 | In most places, you have to learn programming *while* learning advanced methods, and similarly for causal inference!
182 |
183 | Here we have time to build these important skills and intuitions before sending you into the more mathematical world of other econometrics courses
184 |
185 | ## Structure of the Course
186 |
187 | 1. Programming and working with data
188 | 2. Causal Inference and learning from data
189 | 3. Onto the next course!
190 |
191 | ## Admin
192 |
193 | - Syllabus
194 | - Homework (due Sundays, including this coming Sunday)
195 | - Short writing projects
196 | - Attendance
197 | - Midterms
198 | - Final
199 | - Extra Credit
200 |
201 | ## An Example
202 |
203 | - Let's look at a real-world application of data to an important economic problem
204 | - To look for: What data are they using?
205 | - How do they tell a story with it?
206 | - What can we learn from numbers alone?
207 | - How do they interpret the data? Can we trust it?
208 | - [Economic Lives of the Middle Class](https://www.nytimes.com/2018/11/03/upshot/how-the-economic-lives-of-the-middle-class-have-changed-since-2016-by-the-numbers.html)
209 |
--------------------------------------------------------------------------------
/Lectures/Lecture_02_Understanding_Data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 2: Understanding Data"
3 | author: "Nick Huntington-Klein"
4 | date: "December 1, 2018"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | theme_set(theme_gray(base_size = 15))
21 | ```
22 |
23 | ## What's the Point?
24 |
25 | What are we actually trying to DO when we use data?
26 |
27 | Contrary to popular opinion, the point isn't to make pretty graphs or to make a point, or justify something you've done.
28 |
29 | Those may be nice side effects!
30 |
31 | ## Uncovering Truths
32 |
33 | The cleanest way to think about data analysis is to remember that data *comes from somewhere*
34 |
35 | There was some process that *generated that data*
36 |
37 | Our goal, in all data analysis, is to get some idea of *what that process is*
38 |
39 | ## Example
40 |
41 | - Imagine a basic coin flip
42 | - Every time we flip, we get heads half the time and tails half the time
43 | - The TRUE process that generates the data is that there's a coin that's heads half the time and tails half the time
44 | - If we analyze the data correctly, we should report back that the coin is heads half the time
45 | - Let's try calculating the *proportion* of heads
46 |
47 | ## Example
48 |
49 | ```{r, echo = TRUE}
50 | #Generate 500 heads and tails
51 | data <- sample(c("Heads","Tails"),500,replace=TRUE)
52 | #Calculate the proportion of heads
53 | mean(data=="Heads")
54 | ```
55 | ```{r fig.width=5, fig.height=4, echo = FALSE}
56 | df <- data.frame(Result=data)
57 | ggplot(df,aes(x=Result))+geom_bar()+ylab("Count")
58 | ```
59 |
60 | ## Example
61 |
62 | - Let's try out that code in R a few times and see what happens
63 | - First, what do we *want* to happen? What should we see if our data analysis method is good?
64 |
65 | ## How Good Was It?
66 |
67 | - Our data analysis consistently told us that the coin was generating heads about half the time - the true process!
68 | - Our data analysis lets us conclude the coin is fair
69 | - That is describing the true data generating process pretty well!
70 | - Let's think - what other approaches could we have taken? What would the pros and cons be?
71 | - Counting the heads instead of taking the proportion?
72 | - Taking the mean and adding .1?
73 | - Just saying it's 50%?
74 |
75 | ## Another Example
76 |
77 | - People have different amounts of money in their wallet, from 0 to 10
78 | - We flip a coin and, if it's heads, give them a dollar
79 | - What's the data generating process here?
80 | - What should our data analysis uncover?
81 |
82 | ## Another Example
83 | ```{r, echo = TRUE}
84 | #Generate 1000 wallets and 1000 heads and tails
85 | data <- data.frame(wallets=sample(0:10,1000,replace=TRUE))
86 | data$coin <- sample(c("Heads","Tails"),1000,replace=TRUE)
87 | #Give a dollar whenever it's a heads, then get average money by coin
88 | data <- data %>% mutate(wallets = wallets + (coin=="Heads"))
89 | data %>% group_by(coin) %>% summarize(wallets = mean(wallets))
90 | ```
91 | ```{r fig.width=5, fig.height=3, echo = FALSE}
92 | aggdat <- data %>% group_by(coin) %>% summarize(wallets = mean(wallets))
93 | ggplot(aggdat,aes(x=coin,y=wallets))+geom_col()+ylab("Average in Wallet")+xlab("Flip")
94 | ```
95 |
96 | ## Conclusions
97 |
98 | - What does our data analysis tell us?
99 | - We *observe* a difference of `r abs(round(aggdat$wallets[2]-aggdat$wallets[1],2))` between the heads and tails
100 | - Because we know that nothing would have caused this difference other than the coin flip, we can conclude that the coin flip *is why* there's a difference of `r abs(round(aggdat$wallets[2]-aggdat$wallets[1],2))`
101 |
102 | ## But What If?
103 |
104 | - So far we've been cheating
105 | - We know exactly what process generated that data
106 | - So really, our data analysis doesn't matter
107 | - But what if we *don't* know that?
108 | - We want to make sure our *method* is good, so that when we draw a conclusion from our data, it's the right one
109 |
110 | ## Example 3
111 |
112 | - We're economists! So no big surprise we might be interested in demand curves
113 | - Demand curve says that as $P$ goes up, $Q_d$ goes down
114 | - Let's gather data on $P$ and $Q_d$
115 | - And determine things like the slope of demand, demand elasticity, etc.
116 | - Just so happens I have some data on 1879-1923 US food export prices and quantities!
117 |
118 | ## Example 3
119 |
120 | ```{r, echo = FALSE}
121 | #Bring in food data
122 | foodPQ <- read.csv(text='Year,FoodPrice,FoodQuantity
123 | 1879,93.4,148.5
124 | 1880,96.1,162.3
125 | 1881,102.6,117.7
126 | 1882,109,81.4
127 | 1883,104,82.7
128 | 1884,91.4,77.2
129 | 1885,85.4,70.9
130 | 1886,80.4,89
131 | 1887,81.6,85.3
132 | 1888,86,56.8
133 | 1889,75.4,85.9
134 | 1890,76.6,96.5
135 | 1891,100.1,119.1
136 | 1892,86.6,140.7
137 | 1893,77.7,107.8
138 | 1894,67.7,97.1
139 | 1895,69.4,92
140 | 1896,63.5,156.6
141 | 1897,71.3,197.7
142 | 1898,76.7,218.2
143 | 1899,74.2,186.5
144 | 1900,74.1,176.6
145 | 1901,77.4,182.7
146 | 1902,82,112.1
147 | 1903,81.4,123.1
148 | 1904,80.3,73.2
149 | 1905,82.5,108.8
150 | 1906,81.7,126.1
151 | 1907,95,118.2
152 | 1908,99.8,98.3
153 | 1909,104.2,65.1
154 | 1910,98.7,55.1
155 | 1911,97.9,68.5
156 | 1912,104.2,79.4
157 | 1913,100,100
158 | 1914,114.5,140
159 | 1915,133.8,200.6
160 | 1916,144.2,168.6
161 | 1917,214.9,134.5
162 | 1918,234.6,132.9
163 | 1919,241.7,156.9
164 | 1920,268.2,192.2
165 | 1921,155.7,252.8
166 | 1922,127.3,210.6
167 | 1923,129.2,116.9')
168 | #Plot
169 | ggplot(foodPQ,aes(x=FoodQuantity,y=FoodPrice))+
170 | #Add year labels
171 | geom_text(aes(label=Year))+
172 | #and axis labels
173 | ylab("Price")+xlab("Quantity")+ggtitle("Food Export Price v. Quantity, US, 1879-1923")
174 | ```
175 |
176 | ## Example 3
177 |
178 | We can calculate the correlation between $P$ and $Q$:
179 |
180 | ```{r, echo = TRUE}
181 | cor(foodPQ$FoodPrice,foodPQ$FoodQuantity)
182 | ```
183 |
184 | ## Example 3 Conclusions
185 |
186 | - We *observe* a POSITIVE correlation of `r round(cor(foodPQ$FoodPrice,foodPQ$FoodQuantity),2)`
187 | - But demand curves shouldn't slope upwards... huh?
188 | - Does demand really slope up? Does this tell us about the process that generated this data? Why or why not?
189 | - Why do we see the data we do? Let's try to think of some reasons.
190 |
191 | ## Getting Difficult
192 |
193 | - We need to be more careful to figure out what's actually going on
194 | - Plus, the more we know about the context and the underlying model, the more likely it is that we won't miss something important
195 |
196 | ##Getting Difficult
197 |
198 | - Our fake examples were easy because we knew perfectly where the data came from
199 | - But how much do you know about food prices in turn-of-the-century US?
200 | - Or for that matter how prices are set in the food industry?
201 | - There's more work to do in uncovering the process that made the data
202 |
203 |
204 | ## For the Record
205 |
206 | - Likely, one of the big problems with our food analysis was that we forgot to account for DEMAND shifting around, showing us what SUPPLY looks like. Supply *does* slope up!
207 | - Just because we loudly announced that we wanted the demand curve doesn't force the data to give it to us!
208 | - Let's imagine the process that might have generated this data by starting with the model itself and seeing how we can *generate* this data
209 |
210 | ## But...
211 |
212 | - If we can figure out what methods work well when we *do* know the right answer
213 | - And apply them when we *don't*...
214 | - We can figure out what those processes are
215 | - And that's the goal!
216 | - This is what we'll be figuring out how to do technically during the programming part of the class, and conceptually during causal inference
217 |
218 |
219 |
220 | ## R
221 |
222 | - We'll need good tools to do it
223 | - We'll be using R
224 | - Let's go through how R can be installed, in preparation for next week
225 | - [R-Project.org](http://www.r-project.org)
226 | - [RStudio.com](http://www.rstudio.com)
227 | - [RStudio.cloud](http://rstudio.cloud)
228 |
229 | ## Finish us out
230 |
231 | - Now that we're all prepared let's get a jump on homework
232 | - Go to New York Times' The Upshot or FiveThirtyEight and find an article that uses data
233 | - See Homework 1
234 | - Together let's start thinking about answers to these questions
235 |
--------------------------------------------------------------------------------
/Lectures/Lecture_03_Introduction_to_R.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 3: Introduction to R and RStudio"
3 | author: "Nick Huntington-Klein"
4 | date: "January 17, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | theme_set(theme_gray(base_size = 15))
21 | ```
22 |
23 | ## Welcome to R
24 |
25 | The first half of this class will be dedicated to getting familiar with R.
26 |
27 | R language for working with data. RStudio is an environment for working with that language.
28 |
29 | By the time we're done, you should be comfortable manipulating and examining data.
30 |
31 | ## Why R?
32 |
33 | This is going to accomplish a few things for us.
34 |
35 | - As we mentioned last week: Excel/Sheets is a great tool for accountants, not for working with data.
36 | - Learning to program is a highly valuable skill
37 | - R is basically a big cool calculator, and we want to do big cool calculations
38 |
39 | ## A few notes
40 |
41 | - These slides are written in R; you can look up the code I'm using on Titanium if you want to see how I did something
42 | - The assigned videos are there to help
43 | - Other resources on the [Econometrics Help](nickchk.com/econometrics.html) list, available on Titanium
44 | - Don't be afraid of programming! It's a language like any other, just a language that requires you to be very precise. Tell the computer what you want!
45 | - If you're already a programmer, you may get a little bored. If you'd like to learn something more advanced, let me know.
46 | - Ask lots of questions!!! (!)
47 |
48 | ## Today
49 |
50 | - Get used to working in RStudio
51 | - Figure out how R conceptualizes working with stuff
52 | - Figure out the Help system <- Important!
53 |
54 | ## The RStudio Panes
55 |
56 | - Console
57 | - Environment Pane
58 | - Browser Pane
59 | - Source Editor
60 | - Li'l tip: Ctrl/Cmd + Shift + number will maximize one of these panes
61 |
62 | ## Console
63 |
64 | - Typically bottom-left
65 | - This is where you can type in code and have it run immediately
66 | - Or, when you run code from the Source Editor, it will show up here
67 | - It will also show any output or errors
68 |
69 | ## Console Example
70 |
71 | - Let's copy/paste some code in there to run
72 |
73 | ```{r, echo = TRUE, eval = FALSE}
74 | #Generate 500 heads and tails
75 | data <- sample(c("Heads","Tails"),500,replace=TRUE)
76 | #Calculate the proportion of heads
77 | mean(data=="Heads")
78 | #This line should give an error - it didn't work!
79 | data <- sample(c("Heads","Tails"),500,replace=BLUE)
80 | #This line should give a warning
81 | #It did SOMETHING but maybe not what you want
82 | mean(data)
83 | #This line won't give an error or a warning
84 | #But it's not what we want!
85 | mean(data=="heads")
86 | ```
87 |
88 | ## What We Get Back
89 |
90 | - We can see the code that we've run
91 | - We can see the output of that code, if any
92 | - We can see any errors or warnings (in red). Remember - errors mean it didn't work. Warnings mean it *maybe* didn't work.
93 | - Just because there's no error or warning doesn't mean it DID work! Always think carefully
94 |
95 | ## Environment Pane
96 |
97 | - The output of what we've done can also be seen in the *Environment pane*
98 | - This is on the top-right
99 | - Two important tabs: Environment and History
100 | - Mostly, Environment
101 |
102 | ## History Tab
103 |
104 | - History shows us the commands that we've run
105 | - To save yourself some typing, you can re-run commands by double-clicking them or hitting Enter
106 | - Send to console with double-click/enter, send to source pane with Shift+double-click/Enter
107 | - Or use "To Console" or "To Source" buttons
108 |
109 | ## Environment Tab
110 |
111 | - Environment tab shows us all the objects we have in memory
112 | - For example, we created the `data` object, so we can see that in Environment
113 | - It shows us lots of handy information about that object too
114 | - (we'll get to that later)
115 | - You can erase everything with that little broom button (technically this does `rm(list=ls())`)
116 |
117 | ## Browser Pane
118 |
119 | - Bottom-right
120 | - Lots of handy stuff here!
121 | - Mostly, the *outcome* of what you do will be seen here
122 |
123 | ## Files Tab
124 |
125 | - Basic file browser
126 | - Handy for opening up files
127 | - Can also help you set the working directory:
128 | - Go to folder
129 | - In menu bar, Session
130 | - Set Working Directory
131 | - To Files Pane Location
132 |
133 | ## Plots and Viewer tabs
134 |
135 | - When you create something that must be viewed, like a plot, it will show up here.
136 |
137 | ```{r, echo=TRUE, eval=FALSE}
138 | data(LifeCycleSavings)
139 | plot(density(LifeCycleSavings$pop75),main='Percent of Population over 75')
140 | ```
141 |
142 | ```{r, echo=FALSE, eval=FALSE}
143 | ## THE GGPLOT2 WAY
144 | data(LifeCycleSavings)
145 | ggplot(LifeCycleSavings,aes(x=pop75))+stat_density(geom='line')+
146 | ggtitle('Percent of Population over 75')
147 | ```
148 |
149 | - Note the "Export" button here - this is an easy way to save plots you've created
150 |
151 | ## Packages Tab
152 |
153 | - This is one way to install new packages and load them in
154 | - We'll be talking more about packages later
155 | - I generally avoid this tab; better to do this via code
156 | - Why? Replicability! A **VERY** important reason to use code and not the GUI or, say, Excel
157 | - You always want to make sure your future self (or someone else) knows how to use your code
158 | - One thing I do use this for is checking for package updates, note handy "Update" button
159 |
160 | ## Help Tab
161 |
162 | - This is where help files appear when you ask for them
163 | - You can use the search bar here, or `help()` in the console
164 | - We'll be going over this today a bit later
165 |
166 | ## Source Pane
167 |
168 | - You should be working with code FROM THIS PANE, not the console!
169 | - Why? Replicability!
170 | - Also, COMMENTS! USE THEM! PLEASE! `#` lets you write a comment.
171 | - Switch between tabs like a browser
172 | - Set working directory:
173 | - Select a source file tab
174 | - In menu bar: Session
175 | - Set Working Directory
176 | - To Source File Location
177 |
178 | ## Running Code from the Source Pane
179 |
180 | - Select a chunk of code and hit the "Run" button
181 | - Click on a line of code and do Ctrl/Cmd-Enter to run just that line and advance to the next <- Super handy!
182 | - Going one line at a time lets you check for errors more easily
183 | - Let's try some!
184 |
185 | ```{r, eval=FALSE, echo=TRUE}
186 | data(mtcars)
187 | mean(mtcars$mpg)
188 | mean(mtcars$wt)
189 | 372+565
190 | log(exp(1))
191 | 2^9
192 | (1+1)^9
193 | ```
194 |
195 | ## Autocomplete
196 |
197 | - RStudio comes with autocomplete!
198 | - Typing in the Source Pane or the Console, it will try to fill in things for you
199 | - Command names (shows the syntax of the function too!)
200 | - Object names from your environment
201 | - Variable names in your data
202 | - Let's try redoing the code we just did, typing it out
203 |
204 | ## Help
205 |
206 | - Autocomplete is one way that RStudio tries to help you out
207 | - The way that R helps you out is with the documentation
208 | - When you start doing anything serious with a computer, like programming, the most important skills are:
209 | - Knowing to read documentation
210 | - Knowing to search the internet for help
211 |
212 | ## help()
213 |
214 | - You can get the documentation on most R objects using the `help()` function
215 | - `help(mean)`, for example, will show you:
216 | - What the function is
217 | - The "syntax" for the function
218 | - The available options for the function
219 | - Other, related functions, like `weighted.mean`
220 | - Ideally, some examples of proper use
221 | - Not just for functions/commands - some data sets will work too! Try `help(mtcars)`
222 |
223 | ## Searching the Internet
224 |
225 | - Even professional programmers will tell you - they spend a huge amount of time just looking up how to do stuff online
226 | - Why not?
227 | - Granted, this won't come in handy for a midterm, but for actually doing work it's essential
228 |
229 | ## Searching the Internet
230 |
231 | - Just Google (or whatever) what you need! There will usually be a resource.
232 | - R-bloggers, Quick-R, StackOverflow
233 | - Or just Google. Try including "R" and the name of what you want in the search
234 | - If "R" isn't turning up anything, try Rstats
235 | - Ask on Twitter with `#rstats`
236 | - Ask on StackExchange or StackOverflow <- tread lightly!
237 |
238 | ## Example
239 |
240 | - We've picked up some data sets to use like LifeCycleSavings and mtcars with the data() function
241 | - What data sets can we get this way?
242 | - Let's try `help(data)`
243 | - Let's try searching the internet
244 |
245 | ## That's it!
246 |
247 | See you next time!
--------------------------------------------------------------------------------
/Lectures/Lecture_04_Objects_and_Functions_in_R.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 4: Objects and Functions in R"
3 | author: "Nick Huntington-Klein"
4 | date: "January 17, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | theme_set(theme_gray(base_size = 15))
21 | ```
22 |
23 | ## Working in R
24 |
25 | - Today we'll be starting to actually DO stuff in R
26 | - R is all about:
27 | - Creating objects
28 | - Looking at objects
29 | - Manipulating objects
30 | - That's, uh, it. That's all. That's what R does.
31 |
32 | ## Creating a Basic Object
33 |
34 | - Let's create an object. We're going to do this with the assignment operator `<-` (a.k.a. "gets")
35 |
36 | ```{r, echo=TRUE}
37 | a <- 4
38 | ```
39 | - This creates an object called `a`. What is that object? It's a number. Specifically, it's the number 4. We know that because we took that 4 and we shoved it into `a`.
40 | - Why store it as an object rather than just saying 4? Oh, plenty of reasons.
41 | - We can do more complex calculations before storing it, too.
42 | ```{r, echo=TRUE}
43 | b <- sqrt(16)+10
44 | ```
45 |
46 |
47 | ## Looking at Objects
48 |
49 | - We can see this object that we've created in the Environment pane
50 | - Putting that object on a line by itself in R will show us what it is.
51 |
52 | ```{r, echo=TRUE}
53 | a
54 | b
55 | ```
56 |
57 | ## Looking at Objects
58 |
59 | - We can even create an object and look at it in the console without storing it.
60 | - Let's think about what is really happening in these lines
61 | ```{r, echo=TRUE}
62 | 3
63 | a+b
64 | ```
65 |
66 | ## Looking at Objects Differently
67 |
68 | - We can run objects through **functions** to look at them in different ways
69 |
70 | ```{r, echo=TRUE}
71 | #What does a look like if we take the square root of it?
72 | sqrt(a)
73 | #What does it look like if we add 1 to it?
74 | a + 1
75 | #If we look at it, do we see a number?
76 | is.numeric(a)
77 | ```
78 |
79 | ## Manipulating Objects
80 |
81 | - We manipulated the object and LOOKED at it, we can also SAVE it, or UPDATE it.
82 | ```{r, echo=TRUE}
83 | #We looked at what a looked like with 1 added, but a itself wasn't changed
84 | a
85 | #Let's save a+1 as something else
86 | b <- a + 1
87 | #And let's overwrite a with its square root
88 | a <- sqrt(a)
89 | a
90 | b
91 | ```
92 |
93 | ## Some Notes
94 |
95 | - Even though we changed `a`, `b` was already set using the old value of `a`, and so was still `4+1=5`, not `2+1=3`.
96 | - `a` basically got *reassigned* with `<-`. That's how we got it to be `2`
97 | - This is a very simple example, but basically *everything* in R is just this process but with more complex objects and more complex functions!
98 |
99 | ## Types of Objects
100 |
101 | - We already determined that `a` was a number. But what else could it be? What other kinds of variables are there?
102 | - Some basic object types:
103 | - Numeric: A single number
104 | - Character: A string of letters, like `'hello'`
105 | - Logical: `TRUE` or `FALSE` (or `T` or `F`)
106 | - Factor: A category, like `'left handed', 'right handed', or 'ambidextrous'`
107 | - Vector: A collection of objects of the same type
108 |
109 | ## Characters
110 |
111 | - A character object is a piece of text, held in quotes like `''` or `""`
112 | - For example, maybe you have some data on people's addresses.
113 |
114 | ```{r, echo=TRUE}
115 | address <- '321 Fake St.'
116 | address
117 | is.character(address)
118 | ```
119 |
120 | ## Logical
121 |
122 | - Logicals are binary - TRUE or FALSE. They're extremely important because lots of data is binary.
123 |
124 | ```{r, echo=TRUE}
125 | c <- TRUE
126 | is.logical(c)
127 | is.character(a)
128 | is.logical(is.numeric(a))
129 | ```
130 |
131 | ## Logical
132 |
133 | - Logicals are used a lot in programming too, because you can evaluate whether conditions hold.
134 |
135 | ```{r, echo=TRUE}
136 | a > 100
137 | a > 100 | b == 5
138 | ```
139 |
140 | - `&` is AND, `|` is OR, and to check equality use `==`, not `=`. `>=` is greater than OR equal to, similarly for `<=`
141 |
142 | ## Logical
143 |
144 | - They are also equivalent to `TRUE=1` and `FALSE=0` which comes in handy.
145 | - We can use `as` functions to change one object type to another (although for `T=1, F=0` it does it automatically)
146 |
147 | ```{r, echo=TRUE}
148 | as.numeric(FALSE)
149 | TRUE + 3
150 | ```
151 |
152 | ## Logicals Test
153 |
154 | - Let's stop and try to think about what these lines might do:
155 | ```{r, echo=TRUE, eval=FALSE}
156 | is.logical(is.numeric(FALSE))
157 | is.numeric(2) + is.character('hello')
158 | is.numeric(2) & is.character(3)
159 | TRUE | FALSE
160 | TRUE & FALSE
161 | ```
162 |
163 | ## Factors
164 |
165 | - Factors are categorical variables - mutually exclusive groups
166 | - They look like strings, but they're more like logicals, but with more than two levels (and labeled differently than T and F)
167 | - Factors have *levels* showing the possible categories you can be in
168 | ```{r, echo=TRUE}
169 | e <- as.factor('left-handed')
170 | levels(e) <- c('left-handed','right-handed','ambidextrous')
171 | e
172 | ```
173 |
174 | ## Vectors
175 |
176 | - Data is basically a bunch of variables all put together
177 | - So unsurprisingly, a lot of R works with vectors, which are a bunch of objects all put together!
178 | - Use `c()` (concatenate) to put a bunch of objects of the same type together in a vector
179 | - Use square brackets to pick out parts of the vector
180 |
181 | ```{r, echo=TRUE}
182 | d <- c(5,6,7,8)
183 | c(is.numeric(d),is.vector(d))
184 | d[2]
185 | ```
186 |
187 | ## Vectors
188 |
189 | - Unsurprisingly, lots of statistical functions look at multiple objects (what is statistics but making sense of lots of different measurements of the same thing?)
190 |
191 | ```{r, echo=TRUE}
192 | mean(d)
193 | c(sum(d),sd(d),prod(d))
194 | ```
195 |
196 | ## Vectors
197 |
198 | - We can perform the same operation on all parts of the vector at once!
199 |
200 | ```{r, echo=TRUE}
201 | d + 1
202 | d + d
203 | d > 6
204 | ```
205 |
206 | ## Vectors
207 |
208 | - Factors make a lot more sense as a vector
209 |
210 | ```{r, echo=TRUE}
211 | continents <- as.factor(c('Asia','Asia','Asia',
212 | 'N America','Europe','Africa','Africa'))
213 | table(continents)
214 | continents[4]
215 | ```
216 |
217 | ## Value Matching
218 |
219 | - It's easy to create logicals seeing if a value matches ANY value in a vector with %in%
220 |
221 | ```{r, echo=TRUE}
222 | 3 %in% c(3,4)
223 | c('Nick','James') %in% c('James','Andy','Sarah')
224 | ```
225 |
226 | ## Basic Vectors
227 |
228 | - Generating basic vectors:
229 |
230 | ```{r, echo=TRUE, eval=FALSE}
231 | 1:8
232 | rep(4,3)
233 | rep(c('a','b'),4)
234 | numeric(5)
235 | character(6)
236 | sample(1:20,3)
237 | sample(c("Heads","Tails"),6,replace=TRUE)
238 | ```
239 | ```{r, echo=FALSE, eval=TRUE}
240 | 1:8
241 | rep(4,3)
242 | rep(c('a','b'),4)
243 | numeric(5)
244 | character(6)
245 | sample(1:20,3)
246 | sample(c("Heads","Tails"),6,replace=TRUE)
247 | ```
248 |
249 | ## Vector Test
250 | - If we do `f <- c(2,3,4,5)`, then what will the output of these be?
251 |
252 | ```{r, echo=TRUE, eval = FALSE}
253 | f^2
254 | f + c(1,2,3,4)
255 | c(f,6)
256 | is.numeric(f)
257 | mean(f >= 4)
258 | f*c(1,2,3)
259 | length(f)
260 | length(rep(1:4,3))
261 | f/2 == 2 | f < 3
262 | as.character(f)
263 | f[1]+f[4]
264 | c(f,f,f,f)
265 | f[f[1]]
266 | f[c(1,3)]
267 | f %in% (1:4*2)
268 | ```
269 |
270 | ## And Now You
271 |
272 | - Create a factor that randomly samples six `'Male'` or `'Female'` people.
273 | - Add up all the numbers from 18 to 763, then get the mean
274 | - What happens if you make a list with a logical, a numeric, AND a string in it?
275 | - Figure out how to use `paste0()` to turn `c('a','b')` into `'ab'`
276 | - Use `[]` to turn `h <- c(10,9,8,7)` into `c(7,8,10,9)` and call it `j`
277 | - (Several ways) Create a vector with eleven 0's, then a 5.
278 | - (Tough!) Use `floor()` or `%%` to count how many multiples of 4 there are between 433 and 899
--------------------------------------------------------------------------------
/Lectures/Lecture_05_Spreadsheet.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_05_Spreadsheet.PNG
--------------------------------------------------------------------------------
/Lectures/Lecture_05_Working_with_Data_Part_1.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 5: Working with Data Part 1"
3 | author: "Nick Huntington-Klein"
4 | date: "January 18, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | theme_set(theme_gray(base_size = 15))
21 | options(warn=-1)
22 | ```
23 |
24 | ## Working with Data
25 |
26 | - R is all about working with data!
27 | - Today we're going to start going over the use of data.frames and tibbles
28 | - data.frames are an object type; tibbles are basically data.frames with some extra bells and whistles, from the *tidyverse* package
29 | - Most of the time, you'll be doing calculations using them
30 |
31 | ## The Basic Idea
32 |
33 | - Conceptually, data.frames are basically spreadsheets
34 | - Technically, they're a list of vectors
35 |
36 | |Spreadsheet | data.frame |
37 | |------------|-------------|
38 | | ||
39 |
40 |
41 | ## Example
42 |
43 | - It's a list of vectors... we can make one by listing some (same-length) vectors!
44 | - (Note the use of = here, not <-)
45 |
46 |
47 | ```{r, echo=TRUE}
48 | df <- data.frame(RacePosition = 1:5,
49 | WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
50 | NumberofKids = c(3,5,1,0,2))
51 | df <- tibble(RacePosition = 1:5,
52 | WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')),
53 | NumberofKids = c(3,5,1,0,2))
54 | df
55 | ```
56 |
57 | ## Looking Over Data
58 |
59 | - Now that we have our data, how can we take a look at it?
60 | - We can just name it in the Console and look at the whole thing, but that's usually too much data
61 | - We can look at the whole thing by clicking on it in Environment to open it up
62 |
63 | ## Glancing at Data
64 |
65 | - What if we just want a quick overview, rather than looking at the whole spreadsheet?
66 | - Down-arrow in the Environment tab
67 | - `head()` (look at the head of the data - first six rows)
68 | - `str()` (structure)
69 |
70 | ```{r, echo=TRUE}
71 | str(df)
72 | ```
73 |
74 | ## So What?
75 |
76 | - What do we want to know about our data?
77 | - What is this data OF? (won't get that with `str()`)
78 | - Data types
79 | - The kinds of values it takes
80 | - How many observations
81 | - Variable names.
82 | - Summary statistics and observation level (we'll get to that later)
83 |
84 | ## Getting at Data
85 |
86 | - Now we have a data frame, `df`. How do we use it?
87 | - One way is that we can pull those vectors back out with `$`! Note autocompletion of variable names.
88 | - We can treat it just like the vectors we had before
89 |
90 | ```{r, echo=TRUE}
91 | df$NumberofKids
92 | df$NumberofKids[2]
93 | df$NumberofKids >= 3
94 | ```
95 |
96 | ## Quick Note
97 |
98 | - There are actually many many ways to do this
99 | - (some of which I even go over in the videos)
100 | - For example, you can use `[row,column]` to get at data, for example `df$NumberofKids >= 3` is equivalent to `df[,3] >= 3` or `df[,'NumberofKids']>=3`
101 |
102 | ## That Said!
103 |
104 | - We can run the same calculations on these vectors as we were doing before
105 |
106 | ```{r, echo=TRUE}
107 | mean(df$RacePosition)
108 | df$WayTheySayHi[4]
109 | sum(df$NumberofKids <= 1)
110 | ```
111 |
112 | ## Practice
113 |
114 | - Create `df2 <- data.frame(a = 1:20, b = 0:19*2,` `c = sample(101:200,20,replace=TRUE))`
115 | - What is the average of `c`?
116 | - What is the sum of `a` times `b`?
117 | - Did you get any values of `c` 103 or below? (make a logical)
118 | - What is on the 8th row of `b`?
119 | - How many rows have `b` above 10 AND `c` below 150?
120 |
121 | ## Practice Answers
122 |
123 | ```{r, echo = FALSE, eval = TRUE}
124 | df2 <- data.frame(a = 1:20, b = 0:19*2, c = sample(101:200,20,replace=TRUE))
125 | ```
126 | ```{r, echo=TRUE, eval=FALSE}
127 | mean(df2$c)
128 | sum(df2$a*df2$b)
129 | sum(df2$c <= 103) > 0
130 | df2$b[8]
131 | sum(df2$b > 10 & df2$c < 150)
132 | ```
133 |
134 | ```{r, echo=FALSE, eval=TRUE}
135 | mean(df2$c)
136 | sum(df2$a*df2$b)
137 | sum(df$c <= 103) > 0
138 | df2$b[8]
139 | sum(df2$b > 10 & df2$c < 150)
140 | ```
141 |
142 | ## The Importance of Rows
143 |
144 | - So far we've basically just taken data frames and pulled the vectors (columns) back out
145 | - So... why not just stick with the vectors?
146 | - Because before long we're not just going to be interested in the columns one at a time
147 | - We'll want to keep track of each *row* - each row is an observation. The same observation!
148 |
149 | ## The Importance of Rows
150 |
151 | - Going back to `df`, that fourth row says that
152 | - The person in the fourth position...
153 | - Says hello by saying "Yo"
154 | - And has no kids
155 | - We're going to want to keep that straight when we want to, say, look at the relationship between having kids and your position in the race.
156 | - Or how the number of kids relates to how you say hello!
157 |
158 | ```{r, echo=FALSE, eval=TRUE}
159 | df
160 | ```
161 |
162 | ## Working With Data Frames
163 |
164 | - Not to mention, we can manipulate data frames and tibbles!
165 | - Let's figure out how we can:
166 | - Create new variables
167 | - Change variables
168 | - Rename variables
169 | - It's very common that you'll have to work with data a little before analyzing it
170 |
171 | ## Creating New Variables
172 |
173 | - Easy! data.frames are just lists of vectors
174 | - So create a vector and tell R where in that list to stick it!
175 | - Use descriptive names so you know what the variable is
176 |
177 | ```{r, echo=TRUE}
178 | df$State <- c('Alaska','California','California','Maine','Florida')
179 | df
180 | ```
181 |
182 | ## Our Approach - DPLYR and Tidyverse
183 |
184 | - That's the base-R way to do it, anyway
185 | - We're going to be using *dplyr* (think pliers) for data manipulation instead
186 | - dplyr syntax is inspired by SQL - so learning dplyr will give you a leg up if you want to learn SQL later. Plus it's just better.
187 |
188 | ## Packages
189 |
190 | - tidyverse isn't a part of base R. It's in a package, so we'll need to install it
191 | - We can install packages using `install.packages('nameofpackage')`
192 |
193 | ```{r, echo=TRUE, eval=FALSE}
194 | install.packages('tidyverse')
195 | ```
196 |
197 | - We can then check whether it's installed in the Packages tab
198 |
199 | ## Packages
200 |
201 | - Before we can use it we must then use the `library()` command to open it up
202 | - We'll need to run `library()` for it again every time we open up R if we want to use the package
203 |
204 | ```{r, echo=TRUE, eval=FALSE}
205 | library(tidyverse)
206 | ```
207 |
208 | - There are literally thousands of useful packages for R, and we're going to be using a few! Tidyverse will just be our first of many
209 | - Google R package X to look for packages that do X.
210 |
211 |
212 | ## Variable creation with dplyr
213 |
214 | - The *mutate* command will "mutate" our data frame to have a new column in it. We can then overwrite it.
215 | - The pipe `%>%` says "take df and send it to that mutate command to use"
216 | - Or we can stick the data frame itself in the `mutate` command
217 |
218 | ```{r, echo=TRUE, eval=FALSE}
219 | library(tidyverse)
220 | df <- df %>%
221 | mutate(State = c('Alaska','California','California','Maine','Florida'))
222 | df <- mutate(df,State = c('Alaska','California','California','Maine','Florida'))
223 | ```
224 |
225 | ```{r, echo=FALSE, eval=TRUE}
226 | df <- df %>%
227 | mutate(State = c('Alaska','California','California','Maine','Florida'))
228 | ```
229 |
230 | ## Creating New Variables
231 |
232 | - We can use all the tricks we already know about creating vectors
233 | - We can create multiple new variables in one mutate command
234 |
235 | ```{r, echo=TRUE}
236 | df <- df %>% mutate(MoreThanTwoKids = NumberofKids > 2,
237 | One = 1,
238 | KidsPlusPosition = NumberofKids + RacePosition)
239 | df
240 | ```
241 |
242 | ## Manipulating Variables
243 |
244 | - We can't really *change* variables, but we sure can overwrite them!
245 | - We can drop variables with `-` in the dplyr `select` command
246 | - Note we chain multiple dplyr commands with `%>%`
247 |
248 | ```{r, echo=TRUE}
249 | df <- df %>%
250 | select(-KidsPlusPosition,-WayTheySayHi,-One) %>%
251 | mutate(State = as.factor(State),
252 | RacePosition = RacePosition - 1)
253 | df$State[3] <- 'Alaska'
254 | str(df)
255 | ```
256 |
257 | ## Renaming Variables
258 |
259 | - Sometimes it will make sense to change the names of the variables we have.
260 | - Names are stored in `names(df)` which we can edit directly
261 | - Or the `rename()` command in dplyr has us covered
262 |
263 | ```{r, echo=TRUE}
264 | names(df)
265 | #names(df) <- c('Pos','Num.Kids','State','mt2Kids')
266 | df <- df %>% rename(Pos = RacePosition, Num.Kids=NumberofKids,
267 | mt2Kids = MoreThanTwoKids)
268 | names(df)
269 | ```
270 |
271 |
272 | ## tidylog
273 |
274 | - Protip: after loading the tidyverse, also load the `tidylog` package. This will tell you what each step of your dplyr command does!
275 |
276 | ```{r, echo=TRUE, eval=FALSE}
277 | library(tidyverse)
278 | library(tidylog)
279 | df <- df %>% mutate(Pos = Pos + 1,
280 | Num.Kids = 10)
281 | ```
282 | ```{r, echo=FALSE, eval=TRUE}
283 | library(tidylog, warn.conflicts=FALSE, quietly=TRUE)
284 | df <- df %>% mutate(Pos = Pos + 1,
285 | Num.Kids = 10)
286 | detach("package:tidylog", unload=TRUE)
287 | ```
288 |
289 | ## Practice
290 |
291 | - Create a data set `data` with three variables: `a` is all even numbers from 2 to 20, `b` is `c(0,1)` over and over, and `c` is any ten-element numeric vector of your choice.
292 | - Rename them to `EvenNumbers`, `Treatment`, `Outcome`.
293 | - Add a logical variable called Big that's true whenever EvenNumbers is greater than 15
294 | - Increase Outcome by 1 for all the rows where Treatment is 1.
295 | - Create a logical AboveMean that is true whenever Outcome is above the mean of Outcome.
296 | - Display the data structure
297 |
298 | ## Practice Answers
299 |
300 | ```{r, echo=TRUE, eval=FALSE}
301 | data <- data.frame(a = 1:10*2,
302 | b = c(0,1),
303 | c = sample(1:100,10,replace=FALSE)) %>%
304 | rename(EvenNumbers = a, Treatment = b, Outcome = c)
305 |
306 | data <- data %>%
307 | mutate(Big = EvenNumbers > 15,
308 | Outcome = Outcome + Treatment,
309 | AboveMean = Outcome > mean(Outcome))
310 | str(data)
311 | ```
312 |
313 | ## Other Ways to Get Data
314 |
315 | - Of course, most of the time we aren't making up data
316 | - We get it from the real world!
317 | - Two main ways to do this are the `data()` function in R
318 | - Or reading in files, usually with one of the `read` commands like `read.csv()`
319 |
320 | ## data()
321 |
322 | - R has many baked-in data sets, and more in packages!
323 | - Just type in `data(` and see what options it autocompletes
324 | - We can load in data and look at it
325 | - Many of these data sets have `help` files too
326 |
327 | ```{r, echo=TRUE, eval=FALSE}
328 | data(LifeCycleSavings)
329 | help(LifeCycleSavings)
330 | head(LifeCycleSavings)
331 | ```
332 |
333 | ```{r, echo=FALSE,eval=TRUE}
334 | data(LifeCycleSavings)
335 | head(LifeCycleSavings)
336 | ```
337 |
338 |
339 | ## read
340 |
341 | - Often there will be data files on the internet or your computer
342 | - You can read this in with one of the many `read` commands, like `read.csv`
343 | - CSV is a very basic spreadsheet format stored in a text file, you can create it from Excel or Sheets (or just write it)
344 | - There are different `read` commands for different file types
345 | - Make sure your working directory is set to where the data is!
346 | - Documentation will usually be in a different file
347 |
348 | ```{r, echo=TRUE, eval=FALSE}
349 | datafromCSV <- read.csv('mydatafile.csv')
350 | ```
351 |
352 | ## Practice
353 |
354 | - Use `data()` to open up a data set - any data set (although it should be in `data.frame` or `tibble` form - try again if you get something else)
355 | - Use `str()` and `help()` to examine that data set
356 | - What is it data of (help file)? How was it collected and what do the variables represent?
357 | - What kinds of variables are in there and what kinds of values do they have (`str()` and `head()`)?
358 | - Create a new variable using the variables that are already in there
359 | - Take a mean of one of the variables
360 | - Rename a variable to be more descriptive based on what you saw in `help()`.
361 |
--------------------------------------------------------------------------------
/Lectures/Lecture_05_data_frame.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_05_data_frame.PNG
--------------------------------------------------------------------------------
/Lectures/Lecture_06_Working_with_Data_Part_2.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 6: Working with Data Part 2"
3 | author: "Nick Huntington-Klein"
4 | date: "January 23, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | theme_set(theme_gray(base_size = 15))
21 | ```
22 |
23 | ## Recap
24 |
25 | - We can get `data.frame`s by making them with `data.frame()`, or reading in data with `data()` or `read.csv`
26 | - `data.frame`s are a list of vectors - we know vectors!
27 | - We can pull the vectors back out with `$`
28 | - We can assign new variables, or update them, using `$` as well
29 |
30 | ## Today
31 |
32 | - We are going to continue working with `data.frame`s/`tibble`s
33 | - And we're going to introduce an important aspect of data analysis: *splitting the data*
34 | - In other words, selecting only *part* of the data that we have
35 | - In other words, to *subset* the data
36 |
37 | ## Why?
38 |
39 | - Why would we want to do this?
40 | - Many statistical questions require us to!
41 | - We might be interested in how a variable *differs* for two different groups
42 | - Or how one variable *is related* to another (i.e. how A looks for different values of B)
43 | - Or how those relationships differ for different groups
44 |
45 | ## Example
46 |
47 | - Let's read in some data on male heights, from Our World in Data, and look at it
48 | - Always look at the data before you use it!
49 | - It has height in CM, let's change that to feet
50 |
51 | ```{r, echo=TRUE}
52 | df <- read.csv('http://www.nickchk.com/average-height-men-OWID.csv')
53 | str(df)
54 | df <- df %>% mutate(Heightft = Heightcm/30.48)
55 | ```
56 |
57 | ## Example
58 |
59 | - If we look at height overall we will see that mean height is `r mean(df$Heightcm)`
60 | - But if we look at the data we can see that some countries aren't present every year. So this isn't exactly representative
61 |
62 | ```{r, echo=TRUE}
63 | table(df$Year)
64 | ```
65 |
66 | - So let's just pick some countries that we DO have the full range of years on: the UK, Germany, France, the Congo, Gabon, and Nigeria!
67 |
68 | ## Example
69 |
70 | - If we limit the data just to those three countries, we can see that the average height in these three, which covers 1810s-1980s evenly, is `r mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'))$Heightcm)`
71 | - What if we want to compare the countries to each other? We need to split off each country by itself (let's convert to feet, too).
72 |
73 | ```{r, echo=FALSE, eval=TRUE}
74 | dfsub <- df %>% filter(Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')) %>%
75 | mutate(Entity = as.character(Entity)) %>%
76 | mutate(Entity = ifelse(Code=='COD','Congo',
77 | ifelse(Code=='GBR','UK',Entity)))
78 | dfsub %>% group_by(Entity) %>%
79 | summarize(Heightft = mean(Heightft))
80 | ```
81 |
82 | ## Example
83 |
84 | - What questions does this answer?
85 | - What is average height of men over this time period?
86 | - How does average height differ across countries?
87 | - What can't we answer yet?
88 | - How has height changed over time?
89 | - What causes these height differences [later!]
90 |
91 | ## Example
92 |
93 | - If we want to know how height changed over time, we need to evaluate each year separately too.
94 |
95 | ```{r, echo=FALSE, eval=TRUE, fig.width=8, fig.height=5}
96 | ggplot(filter(dfsub,Code=='GBR'),aes(y=Heightft,x=Year))+
97 | geom_line()+geom_point()+
98 | ylab("Height (Feet)")+
99 | ggtitle("Male Height over time in United Kingdom")
100 | ```
101 |
102 | ## Example
103 |
104 | - To compare the changes over time ACROSS countries, we evaluate separately by each year AND each country
105 |
106 | ```{r, echo=FALSE, eval=TRUE, fig.width=8, fig.height=5}
107 | ggplot(dfsub,aes(y=Heightft,x=Year,group=Entity,color=Entity))+
108 | geom_line()+geom_point()+
109 | ggtitle("Male Height over time")+
110 | ylab("Height (Feet)")+
111 | scale_colour_discrete(guide = 'none')+
112 | geom_text(data=filter(dfsub,Year==1980),aes(label=Entity,color=Entity,x=Year,y=Heightft),hjust=-.1)+
113 | scale_x_continuous(limits=c(min(dfsub$Year),max(dfsub$Year)+15))
114 | ```
115 |
116 | ## Example
117 |
118 | - You can see how subsetting can tell a much more complete story than looking at the aggregated data!
119 | - So how can we do this subsetting?
120 | - There are plenty of ways
121 | - For today we're going to focus on the `filter()` and `select()` commands.
122 | - We'll also learn to write a `for` loop.
123 |
124 | ## Subset
125 |
126 | - The `filter()` and `select()` commands will allow you to pick just certain parts of your data
127 | - You can select certain *rows*/observations using logicals with `filter()`
128 | - And you can select certain *columns*/variables with `select()`
129 | - The syntax is:
130 |
131 | ```{r, echo=TRUE, eval=FALSE}
132 | data.frame %>% filter(logical.for.rows)
133 | filter(data.frame, logical.for.rows)
134 | data.frame %>% select(variables,you,want)
135 | select(data.frame,variables,you,want)
136 | ```
137 |
138 |
139 | ## Subset for observations
140 |
141 | - Let's start by selecting rows from our data
142 | - We do this by creating a logical - `filter` will choose all the observations for which that logical is true!
143 | - Here's the logical I used to pick those six countries: `Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')`
144 | - This will be equal to `TRUE` if the `Code` variable is `%in%` that list of six I gave
145 |
146 | ## Subset for observations
147 |
148 | - We can use this logical with the `filter` command in dplyr - note I don't need to store the logical as a variable first
149 |
150 | ```{r, echo=TRUE}
151 | str(df)
152 | dfsubset <- df %>% filter(Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'))
153 | str(dfsubset)
154 | ```
155 |
156 | ## Subset for observations
157 |
158 | - Minor note: see that it still retains the original factor levels. Generally this is what we want. If not, you can "relevel" the factor by re-declaring it as a factor, using only the levels we have:
159 |
160 | ```{r, echo=TRUE}
161 | dfsubset <- dfsubset %>% mutate(Code=factor(dfsubset$Code))
162 | str(dfsubset)
163 | ```
164 |
165 | - While we're at it we might want `Entity` to be a character variable (why?)
166 |
167 | ```{r, echo=TRUE}
168 | dfsubset <- dfsubset %>% mutate(Entity = as.character(dfsubset$Entity))
169 | ```
170 |
171 | ## Subset for observations
172 |
173 | - Everything we know about constructing logicals can work here
174 | - We can also combine multiple variables when doing this
175 | - What if we want to see the dataset for 1980 only for these six?
176 |
177 | ```{r, echo=TRUE}
178 | #filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA') & Year == 1980)
179 | filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),Year == 1980)
180 | ```
181 |
182 | ## Subset for observations
183 |
184 | - We can treat the subset like any normal data frame
185 | - Let's get the mean height in these countries in 1980 and 1810
186 | - Can do it directly like this, or assign to a new `data.frame` like we did with `dfsubset` and use that
187 |
188 | ```{r, echo=TRUE}
189 | mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),
190 | Year == 1980)$Heightft)
191 | mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),
192 | Year == 1810)$Heightft)
193 | ```
194 |
195 | ## Subset for variables
196 |
197 | - Subsetting for variables is easy! Just use `select()` with a vector or list of variables you want!
198 | - Or you can do `-` a vector of variables you DON'T want!
199 | - We don't need Heightcm, let's get rid of it
200 |
201 | ## Subset for variables
202 |
203 | ```{r, echo=TRUE}
204 | str(dfsubset %>% select(Entity,Code,Year,Heightft))
205 | str(dfsubset %>% select(-c(Heightcm)))
206 | ```
207 |
208 | ## Subset for both!
209 |
210 | - We can do both at the same time, chaining one to the other
211 |
212 | ```{r, echo=TRUE}
213 | dfsubset %>% filter(Year == 1980) %>%
214 | select(Entity,Heightft)
215 | ```
216 |
217 | ## Practice
218 |
219 | - Get the dataset `mtcars` using the `data()` function
220 | - Look at it with `str()` and `help()`
221 | - Limit the dataset to just the variables `mpg, cyl`, and `hp`
222 | - Get the mean `hp` for cars at or above the median value of `cyl`
223 | - Get the mean `hp` for cars below the median value of `cyl`
224 | - Do the same for `mpg` instead of `hp`
225 | - Calculate the difference between above-median and below-median `mpg` and `hp`
226 | - How do you interpret these differences?
227 |
228 | ## Practice answers
229 |
230 | ```{r, echo=TRUE, eval=FALSE}
231 | data(mtcars)
232 | help(mtcars)
233 | str(mtcars)
234 | mtcars <- mtcars %>% select(mpg,cyl,hp)
235 |
236 | mean(filter(mtcars,cyl >= median(cyl))$hp)
237 | mean(filter(mtcars,cyl < median(cyl))$hp)
238 | mean(filter(mtcars,cyl >= median(cyl))$hp) - mean(filter(mtcars,cyl < median(cyl))$hp)
239 |
240 | mean(filter(mtcars,cyl >= median(cyl))$mpg)
241 | mean(filter(mtcars,cyl < median(cyl))$mpg)
242 | mean(filter(mtcars,cyl >= median(cyl))$mpg) - mean(filter(mtcars,cyl < median(cyl))$mpg)
243 | ```
244 |
245 |
246 | ## For loops
247 |
248 | - Sometimes we want to subset things in many different ways
249 | - Typing everything out over and over is a waste of time!
250 | - You don't understand how powerful computers are until you've written a `for` loop
251 | - This is an incredibly standard programming tool
252 | - R has another way of writing loops using the `apply` family, but we're not going to go there
253 |
254 | ## For loops
255 |
256 | - The basic idea of a `for` loop is that you have a vector of values, and you have an "iterator" variable
257 | - You go through the vector one by one, setting that iterator variable to each value one at a time
258 | - Then you run a chunk of that code with the iterator variable set
259 |
260 | ## For loops
261 |
262 | - In R the syntax is
263 |
264 | ```{r, echo = TRUE, eval=FALSE}
265 | for (iteratorvariable in vector) {
266 | code chunk
267 | }
268 | ```
269 |
270 | ## For loops
271 |
272 | - Let's rewrite our `hp` and `mpg` differences with a for loop. Heck, let's do a bunch more variables too. (don't forget to get `data(mtcars)` again - why?)
273 | - Note if we want it to display results inside a loop we need `print()`
274 | - Also, unfortunately, if we're looping over variables, `$` won't work - we need to use `[[]]`
275 |
276 | ```{r, echo = TRUE, eval=FALSE}
277 | data(mtcars)
278 | abovemed <- mtcars %>% filter(cyl >= median(cyl))
279 | belowmed <- mtcars %>% filter(cyl < median(cyl))
280 | for (i in c('mpg','disp','hp','wt')) {
281 | print(mean(abovemed[[i]])-mean(belowmed[[i]]))
282 | }
283 | ```
284 |
285 | -----
286 |
287 | ```{r, echo = TRUE}
288 | data(mtcars)
289 | abovemed <- mtcars %>% filter(cyl >= median(cyl))
290 | belowmed <- mtcars %>% filter(cyl < median(cyl))
291 | for (i in c('mpg','disp','hp','wt')) {
292 | print(mean(abovemed[[i]])-mean(belowmed[[i]]))
293 | }
294 | ```
295 |
296 | ## For loops
297 |
298 | - It's also sometimes useful to loop over different values
299 | - Let's get the average height by country, like before
300 | - `unique()` can give us the levels to loop over
301 |
302 | ```{r, echo = TRUE}
303 | unique(dfsubset$Entity)
304 | for (countryn in unique(dfsubset$Entity)) {
305 | print(countryn)
306 | print(mean(filter(dfsubset,Entity==countryn)$Heightft))
307 | }
308 | ```
309 |
310 | ## For loop practice
311 |
312 | - Get back the full `mtcars` again
313 | - Use `unique()` to see the different values of `cyl`
314 | - Use `unique` to loop over those values and get median `mpg` within each level of `cyl`
315 | - Use `:` to construct a vector to repeat the same loop
316 | - [Hard!] Use `paste0` to print out "The median mpg for cyl = # is X" where # is the iterator number and X is the answer.
317 |
318 | ## For loop practice answers
319 |
320 | ```{r, echo = TRUE, eval=FALSE}
321 | data(mtcars)
322 | unique(mtcars$cyl)
323 | for (c in unique(mtcars$cyl)) {
324 | print(median(filter(mtcars,cyl==c)$mpg))
325 | }
326 | for (c in 2:4*2) {
327 | print(median(filter(mtcars,cyl==c)$mpg))
328 | }
329 | for (c in unique(mtcars$cyl)) {
330 | print(paste0(c("The median mpg for cyl = ",c,
331 | " is ",median(filter(mtcars,cyl==c)$mpg)),
332 | collapse=''))
333 | }
334 | #Printing out just the last one:
335 | ```
336 |
337 | ```{r, echo = FALSE, eval=TRUE}
338 | data(mtcars)
339 | for (c in unique(mtcars$cyl)) {
340 | print(paste0(c("The median mpg for cyl = ",c,
341 | " is ",median(filter(mtcars,cyl==c)$mpg)),
342 | collapse=''))
343 | }
344 | ```
--------------------------------------------------------------------------------
/Lectures/Lecture_08_Summarizing_Data_Part_2.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 8: Summarizing Data Part 2: Plots"
3 | author: "Nick Huntington-Klein"
4 | date: "January 29, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | library(stargazer)
21 | library(Ecdat)
22 | theme_set(theme_gray(base_size = 15))
23 | ```
24 |
25 | ## Recap
26 |
27 | - Summary statistics are ways of describing the *distribution* of a variable
28 | - Examples include mean, percentiles, SD, and specific percentiles of interest (median, min, max)
29 | - Another, very good way to describe a variable is by actually showing that distribution with a graph
30 | - I showed you plenty of graphs last time, this time I'm going to show you how to make them yourself!
31 |
32 | ## Note about ggplot
33 |
34 | - As I mentioned before, we're going straightforward, doing our plotting with base R
35 | - In the case of plotting, it's going to be easier to use base R, but also not look as good as the alternative, `ggplot2`, which is also in the tidyverse package.
36 | - `ggplot2` code for plots will be available in the slides. Just search for 'ggplot'
37 | - And have also been for most previous slides - I've been making most graphs with it!
38 | - Don't forget extra credit opportunity for learning `ggplot2`
39 |
40 |
41 | ## Plot types we will cover:
42 |
43 | - Density plots (for continuous)
44 | - Histograms (for continuous)
45 | - Box plots (for continuous)
46 | - Bar plots (for categorical)
47 | - Plus:
48 | - Adding plots together
49 | - Putting lines on plots
50 | - Makin 'em look good!
51 |
52 | ## Plots we won't:
53 |
54 | - Lots and lots and lots of plot options
55 | - Mosaics, Sankey plots, pie graphs
56 | - Some aren't common in Econ but could be!
57 | - Others are just too tricky (like *maps*)
58 | - See [The R Graph Gallery](https://www.r-graph-gallery.com/)
59 |
60 | ## Density plots and histograms
61 |
62 | - Density plots and histograms will show you the full distribution of the variable
63 | - Values along the x-axis, and how often those values show up on y
64 | - The difference - density plots will present a smooth line by averaging nearby values
65 | - A histogram will create "bins" and tell you how many observations fall into each
66 | - (don't forget we can save plots using Export in the Plots pane)
67 |
68 | ## Density plots
69 |
70 | - To make a density plot, we take our variable and make a `density()` object out of it, then `plot()` that thing!
71 | - Let's play around with data from the `Ecdat` package
72 |
73 | ```{r, echo=TRUE, eval=FALSE}
74 | install.packages('Ecdat')
75 | library(Ecdat)
76 | data(MCAS)
77 | plot(density(MCAS$totsc4))
78 | ```
79 | ```{r fig.width=4.5, fig.height=3.5, echo=FALSE, eval=TRUE}
80 | data(MCAS)
81 | plot(density(MCAS$totsc4))
82 | ```
83 |
84 | ```{r, echo=FALSE, eval=FALSE}
85 | #THE GGPLOT2 WAY
86 | library(Ecdat)
87 | library(tidyverse)
88 | data(MCAS)
89 |
90 | ggplot(MCAS,aes(x=totsc4))+stat_density(geom='line')
91 | ```
92 |
93 | ## Titles and labels
94 |
95 | - Readability is super important in graphs
96 | - Add labels and titles! Titles with 'main' and axis labels with 'xlab' and 'ylab'
97 |
98 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE}
99 | plot(density(MCAS$totsc4),main='Massachusetts Test Scores',
100 | xlab='Fourth Grade Test Scores')
101 | ```
102 |
103 | ```{r, echo=FALSE, eval=FALSE}
104 | #THE GGPLOT2 WAY
105 | library(tidyverse)
106 |
107 | ggplot(MCAS,aes(x=totsc4))+stat_density(geom='line')+
108 | ggtitle('Massachusetts Test Scores')+
109 | xlab('Fourth Grade Test Scores')
110 | ```
111 |
112 |
113 | ## Histograms
114 |
115 | - Histograms require a little less typing - just `hist()`
116 |
117 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE}
118 | hist(MCAS$totsc4)
119 | ```
120 |
121 | ```{r, echo=FALSE, eval=FALSE}
122 | #THE GGPLOT2 WAY
123 | library(tidyverse)
124 |
125 | ggplot(MCAS,aes(x=totsc4))+geom_histogram()
126 | ```
127 |
128 | ## Histograms
129 |
130 | - These need labels too! Other important options:
131 | - Do proportions with `freq=FALSE`, or change how many bins there are, or where they are, with `breaks`
132 |
133 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE}
134 | hist(MCAS$totsc4,xlab="Fourth Grade Test Scores",
135 | main="Test Score Histogram",freq=FALSE,breaks=50)
136 | ```
137 |
138 | ```{r, echo=FALSE, eval=FALSE}
139 | #THE GGPLOT2 WAY
140 | library(tidyverse)
141 |
142 | #If we want proportions we need to calculate it ourselves
143 | ggplot(MCAS,aes(x=totsc4))+geom_histogram(aes(y=(..count..)/sum(..count..)),bins=50)+
144 | ggtitle("Test Score Histogram")
145 | xlab("Fourth Grade Test Scores")
146 | ```
147 |
148 | ## Box plots
149 |
150 | - Less common in economics, but a good way to look at your data
151 | - And check for outliers!
152 | - Basically a summary table in graph form
153 | - Shows the 25%, 50%, 75% percentiles
154 | - Plus lines that represent 1.5 IQR (75th minus 25th)
155 | - And dots for anything otuside that range
156 |
157 | ## Box plots
158 |
159 | - Simple!
160 | - Note the negative outliers - not necessarily a problem but good to know!
161 |
162 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE}
163 | boxplot(MCAS$totsc4,main="Box Plot of 4th Grade Scores")
164 | ```
165 |
166 | ```{r, echo=FALSE, eval=FALSE}
167 | #THE GGPLOT2 WAY
168 | library(tidyverse)
169 |
170 | ggplot(MCAS,aes(y=totsc4))+geom_boxplot()+
171 | ggtitle("Box Plot of 4th Grade Scores")
172 | ```
173 |
174 | ## Box plots
175 |
176 | - Easy to look at multiple variables at once, too
177 |
178 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE}
179 | boxplot(select(MCAS,totsc4,totsc8),main="Box Plot of 4th Grade Scores")
180 | ```
181 |
182 | ```{r, echo=FALSE, eval=FALSE}
183 | #THE GGPLOT2 WAY
184 | library(tidyverse)
185 |
186 | #Unfortunately, for this we need to change the format of the data
187 | #putting it in "long" format by "melt"ing it (from the reshape2 package)
188 | #and then identifying which variable it is by "group"
189 | library(reshape2)
190 | scores <- melt(MCAS,id.vars="code",measure.vars=c("totsc4","totsc8"))
191 | ggplot(scores,aes(x=variable,y=value,group=variable))+geom_boxplot()+
192 | ggtitle("Box Plot of 4th Grade Scores")
193 | ```
194 |
195 | ## Bar plots
196 |
197 | - Last time we talked about the `table()` command, which shows us the whole distribution of a *categorical* variable
198 | - Still a great command! Let's look at majors of students taking advanced micro from a mid-1980s sample in the `Ecdat` package
199 |
200 | ```{r, echo=TRUE}
201 | data(Mathlevel)
202 | table(Mathlevel$major)
203 | ```
204 |
205 | ## Bar plots
206 |
207 | - Bar plots, express this table, graphically
208 | - Stick that table in there!
209 |
210 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE}
211 | barplot(table(Mathlevel$major),main="Majors of Students in Advanced Microeconomics Class")
212 | ```
213 |
214 | ```{r, echo=FALSE, eval=FALSE}
215 | #THE GGPLOT2 WAY
216 | library(tidyverse)
217 |
218 | ggplot(Mathlevel,aes(x=major))+geom_bar()+
219 | ggtitle("Majors of Students in Advanced Microeconomics Class")
220 | ```
221 |
222 | ## Bar plots
223 |
224 | - And, just like before, we can make it a proportion instead of a count by wrapping that `table()` in `prop.table()`
225 |
226 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE}
227 | barplot(prop.table(table(Mathlevel$major)),main="Majors of Students in Advanced Microeconomics Class")
228 | ```
229 |
230 | ```{r, echo=FALSE, eval=FALSE}
231 | #THE GGPLOT2 WAY
232 | library(tidyverse)
233 |
234 | #Note that to get a proportion here we sort of need to calculate it ourselves
235 | #in the aes() argument of geom_bar
236 | ggplot(Mathlevel,aes(x=major))+geom_bar(aes(y = (..count..)/sum(..count..)))+
237 | ggtitle("Majors of Students in Advanced Microeconomics Class")
238 | ```
239 |
240 | ## Adding Lines
241 |
242 | - It's pretty common for us to want to add some explanatory lines to our graphs
243 | - For example, adding mean/median/etc. to a density
244 | - Or showing where 0 is
245 | - We can do this with `abline()` which adds a line with a given intercept (a) and slope (b), after we make our plot.
246 |
247 | ## Adding Lines
248 |
249 | - After creating the plot, THEN add the `abline(intercept,slope)`, or `abline(h=horizontal)`, `abline(v=vertical)` for horizontal or vertical numbers
250 |
251 | ```{r echo=TRUE, eval=FALSE}
252 | hist(MCAS$totsc4)
253 | passingscore <- 705
254 | abline(0,1)
255 | abline(v=passingscore,col='red')
256 | abline(h=50)
257 | ```
258 |
259 | ```{r, echo=FALSE, eval=FALSE}
260 | #THE GGPLOT2 WAY
261 | library(tidyverse)
262 |
263 | passingscore <- 705
264 | ggplot(MCAS,aes(x=totsc4))+geom_histogram()+
265 | geom_vline(aes(xintercept=passingscore),col='red')+
266 | geom_hline(aes(yintercept=50))+
267 | geom_abline(aes(intercept=0,slope=1))
268 | ```
269 |
270 | ## Adding Lines
271 |
272 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE}
273 | hist(MCAS$totsc4)
274 | passingscore <- 705
275 | abline(v=passingscore,col='red')
276 | ```
277 |
278 | ```{r, echo=FALSE, eval=FALSE}
279 | #THE GGPLOT2 WAY
280 | library(tidyverse)
281 |
282 | passingscore <- 705
283 | ggplot(MCAS,aes(x=totsc4))+geom_histogram()+
284 | geom_vline(aes(xintercept=passingscore),col='red')
285 | ```
286 |
287 | ## Adding Lines
288 |
289 | - A common use of adding lines is to show where the mean or median of the distribution is, as I did in the last lecture
290 |
291 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE}
292 | plot(density(MCAS$totsc4))
293 | abline(v=mean(MCAS$totsc4),col='red')
294 | abline(v=median(MCAS$totsc4),col='blue')
295 | ```
296 |
297 | ```{r, echo=FALSE, eval=FALSE}
298 | #THE GGPLOT2 WAY
299 | library(tidyverse)
300 |
301 | #Note that color OUTSIDE of aes() colors the line
302 | #color INSIDE of aes() acts like group
303 | ggplot(MCAS,aes(x=totsc4))+geom_histogram()+
304 | geom_vline(aes(xintercept=mean(totsc4)),color='red')+
305 | geom_vline(aes(xintercept=median(totsc4)),color='blue')
306 | ```
307 |
308 | ## Overlaying Densities
309 |
310 | - Sometimes it's nice to be able to compare two distributions
311 | - Because density plots are so simple, we can do this by overlaying them
312 | - The `lines()` function will, like `abline()`, add a line to your graph
313 | - But it can do this with a density! Be sure to set colors so you can tell them apart
314 |
315 | ## Overlaying Densities
316 |
317 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE}
318 | plot(density(filter(MCAS,tchratio < median(tchratio))$totsc4),
319 | col='red',main="Mass. Fourth-Grade Test Scores by Teacher Ratio")
320 | lines(density(filter(MCAS,tchratio >= median(tchratio))$totsc4),col='blue')
321 | ```
322 |
323 | ## Practice
324 |
325 | - Install `Ecdata`, load it, and get the `MCAS` data
326 | - Make a density plot for `totsc4`, and add a vertical green dashed (tough!) line at the median
327 | - Create a bar plot showing the *proportion* of observations that have nonzero values of `bilingua`
328 | - Create a box plot for totday, and add a horizontal line at the mean
329 | - Go back and add appropriate titles and/or axis labels to all graphs
330 |
331 | ## Practice answers
332 |
333 | ```{r echo = TRUE, eval = FALSE}
334 | install.packages('Ecdata')
335 | library(Ecdata)
336 | data(MCAS)
337 | plot(density(MCAS$totsc4),xlab="4th Grade Test Score")
338 | abline(v=median(MCAS$totsc4),col='green',lty='dashed')
339 |
340 | MCAS <- MCAS %>% mutate(nonzerobi = bilingua > 0)
341 | barplot(prop.table(table(MCAS$nonzerobi)),main="Nonzero Spending Per Bilingual Student")
342 |
343 | boxplot(MCAS$totday,main="Spending Per Pupil")
344 | abline(h=mean(MCAS$totday))
345 | ```
346 |
347 | ```{r, echo=FALSE, eval=FALSE}
348 | #THE GGPLOT2 WAY
349 | library(tidyverse)
350 |
351 | ggplot(MCAS,aes(x=totsc4))+stat_density(geom='line')+
352 | geom_vline(aes(xintercept=median(totsc4)),color='green',linetype='dashed')+
353 | xlab("4th Grade Test Score")
354 |
355 | MCAS <- MCAS %>%
356 | mutate(nonzerobi = MCAS$bilingua > 0)
357 | ggplot(MCAS,aes(x=nonzerobi))+geom_bar()+
358 | ggtitle("Nonzero Spending Per Bilingual Student")
359 |
360 | ggplot(MCAS,aes(y=totday))+geom_boxplot()+
361 | geom_hline(aes(yintercept=mean(totday)))+
362 | ggtitle("Spending Per Pupil")
363 | ```
--------------------------------------------------------------------------------
/Lectures/Lecture_09_Relationships_Between_Variables_Part_1.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 9: Relationships Between Variables, Part 1"
3 | author: "Nick Huntington-Klein"
4 | date: "February 5, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | library(stargazer)
21 | library(Ecdat)
22 | library(haven)
23 | theme_set(theme_gray(base_size = 15))
24 | ```
25 | ## Recap
26 |
27 | - Summary statistics are ways of describing the *distribution* of a variable
28 | - We can also just look at the variable directly
29 | - Understanding a variable's distribution is important if we want to use it
30 |
31 | ## This week
32 |
33 | - We aren't just interested in looking at variables by themselves!
34 | - We want to know how variables can be *related* to each other
35 | - When `X` is high, would we expect `Y` to also be high, or be low?
36 | - How are variables *correlated*?
37 | - How does one variable *explain* another?
38 | - How does one variable *cause* another? (later!)
39 |
40 | ## What Does it Mean to be Related?
41 |
42 | - We would consider two variables to be *related* if knowing something about *one* of them tells you something about the other
43 | - For example, consider the answer to two questions:
44 | - Are you a man?
45 | - Are you pregnant?
46 | - What do you think is the probability that a random person is pregnant?
47 | - What do you think is the probability that a random person *who is a man* is pregnant?
48 |
49 | ## What does it Mean to be Related?
50 |
51 | Some terms:
52 |
53 | - Variables are *dependent* on each other if telling you the value of one gives you information about the distribution of the other
54 | - Variables are *correlated* if knowing whether one of them is *unusually high* gives you information about whether the other is *unusually high* (positive correlation) or *unusually low* (negative correlation)
55 | - *Explaining* one variable `Y` with another `X` means predicting *your `Y`* by looking at the distribution of `Y` for *your* value of `X`
56 | - Let's look at two variables as an example
57 |
58 | ## An Example: Dependence
59 |
60 | ```{r, echo=FALSE, eval=TRUE}
61 | wage1 <- read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/wage1.dta")
62 | ```
63 |
64 | ```{r, echo=TRUE, eval=TRUE}
65 | table(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area'))
66 | ```
67 |
68 | ## An Example: Dependence
69 |
70 | - What are we looking for here?
71 | - For *dependence*, simply see if the distribution of one variable changes for the different values of the other.
72 | - Does the distribution of Number of Dependents differ based on your SMSA status?
73 |
74 | ```{r, echo=TRUE, eval=TRUE}
75 | prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area')),margin=2)
76 | ```
77 |
78 | ## An Example: Dependence
79 |
80 | - Does the distribution of SMSA differ based on your Number of Dependents Status?
81 |
82 | ```{r, echo=TRUE, eval=TRUE}
83 | prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Number of Dependents','Lives in Metropolitan Area')),margin=1)
84 | ```
85 |
86 | - Looks like it!
87 | - What do these two results mean?
88 |
89 |
90 | ## An Example: Correlation
91 |
92 | - We are interested in whether two variables tend to *move together* (positive correlation) or *move apart* (negative correlation)
93 | - One basic way to do this is to see whether values tend to be *high* together
94 | - One way to check in dplyr is to use `group_by()` to organize the data into groups
95 | - Then `summarize()` the data within those groups
96 |
97 | ```{r, echo=TRUE}
98 | wage1 %>%
99 | group_by(smsa) %>%
100 | summarize(numdep=mean(numdep))
101 | ```
102 |
103 | - When `smsa` is high, `numdep` tends to be low - negative correlation!
104 |
105 | ## An Example: Correlation
106 |
107 | - There's also a summary statistic we can calculate *called* correlation, this is typically what we mean by "correlation"
108 | - Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation)
109 | - Basically "a one-standard deviation increase in `X` is associated with a correlation-standard-deviation increase in `Y`"
110 |
111 | ```{r, echo=TRUE, eval=TRUE}
112 | cor(wage1$numdep,wage1$smsa)
113 | cor(wage1$smsa,wage1$numdep)
114 | ```
115 |
116 | ## An Example: Explanation
117 |
118 | Let's go back to those different means:
119 |
120 |
121 | ```{r, echo=FALSE, eval=TRUE}
122 | #THE DPLYR WAY
123 | wage1 %>%
124 | group_by(smsa) %>%
125 | summarize(numdep=mean(numdep))
126 | ```
127 |
128 | - Explanation would be saying that, based on this, if you're in an SMSA, I predict that you have `r mean(filter(wage1,smsa==1)$numdep)` dependents, and if you're not, you have `r mean(filter(wage1,smsa==0)$numdep)` dependents
129 | - If you are in an SMSA and have 2 dependents, then `r mean(filter(wage1,smsa==1)$numdep)` of those dependents are *explained by SMSA* and 2 - `r mean(filter(wage1,smsa==1)$numdep)` = `r 2-mean(filter(wage1,smsa==1)$numdep)` of them are *unexplained by SMSA*
130 | - We'll talk a lot more about this later
131 |
132 | ## Coding Recap
133 |
134 | - `table(df$var1,df$var2)` to look at two variables together
135 | - `prop.table(table(df$var1,df$var2))` for the \n proportion in each cell
136 | - `prop.table(table(df$var1,df$var2),margin=2)` to get proportions *within each column*
137 | - `prop.table(table(df$var1,df$var2),margin=1)` to get proportions *within each row*
138 | - `df %>% group_by(var1) %>% summarize(mean(var2))` \n to get mean of var2 for each value of var1
139 | - `cor(df$var1,df$var2)` to calculate correlation
140 |
141 | ## Graphing Relationships
142 |
143 | - Relationships between variables can be easier to see graphically
144 | - And graphs are extremely important to understanding relationships and the "shape" of those relationships
145 |
146 | ## Wage and Education
147 |
148 | - Let's use `plot(xvar,yvar)` with *two* variables
149 |
150 | ```{r, echo=TRUE, eval=TRUE, fig.width=7, fig.height=3.5}
151 | plot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage")
152 | ```
153 |
154 | ```{r, echo=FALSE, eval=FALSE}
155 | #THE GGPLOT2 WAY
156 | ggplot(wage1,aes(x=educ,y=wage))+geom_point()+
157 | xlab('Years of Education')+
158 | ylab('Wage')
159 | ```
160 |
161 | - As we look at different values of `educ`, what changes about the values of `wage` we see?
162 |
163 | ## Graphing Relationships
164 |
165 | - Try to picture the *shape* of the data
166 | - Should this be a straight line? A curved line? Positively sloped? Negatively?
167 |
168 | ```{r, echo=TRUE, eval=TRUE, fig.width=7, fig.height=3.5}
169 | plot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage")
170 | abline(-.9,.5,col='red')
171 | plot(function(x) 5.4-.6*x+.05*(x^2),0,18,add=TRUE,col='blue')
172 | ```
173 |
174 | ```{r, echo=FALSE, eval=FALSE}
175 | #THE GGPLOT2 WAY
176 | ggplot(wage1,aes(x=educ,y=wage))+geom_point()+
177 | xlab('Years of Education')+
178 | ylab('Wage')+
179 | geom_abline(aes(intercept=-.9,slope=.5),col='red')+
180 | stat_function(fun=function(x) 5.4-.6*x+.05*(x^2),col='blue')
181 | ```
182 |
183 | ## Graphing Relationships
184 |
185 | - `plot(xvar,yvar)` is extremely powerful, and will show you relationships at a glance
186 | - The previous graph showed a clear positive relationship, and indeed `cor(wage1$wage,wage1$educ)` = `r cor(wage1$wage,wage1$educ)`
187 | - Further, we don't only see a positive relationship, but we have some sense of *how* positive it is, what it looks like roughly
188 | - Let's look at some more
189 |
190 | ## Graphing Relationships
191 |
192 | - Let's compare clothing sales volume vs. profit margin for men's clothing firms
193 |
194 | ```{r, echo=TRUE, eval=FALSE}
195 | library(Ecdat)
196 | data(Clothing)
197 | plot(Clothing$sales,Clothing$margin,xlab="Gross Sales",ylab="Margin")
198 | ```
199 |
200 | ```{r, echo=FALSE, eval=TRUE, fig.width=, fig.width=7, fig.height=3.5}
201 | data(Clothing)
202 | plot(Clothing$sales,Clothing$margin,xlab="Gross Sales",ylab="Margin")
203 | ```
204 |
205 | ```{r, echo=FALSE, eval=FALSE}
206 | #THE GGPLOT2 WAY
207 | library(Ecdat)
208 | data(Clothing)
209 | ggplot(Clothing,aes(x=sales,y=margin))+geom_point()+
210 | xlab('Gross Sales')+
211 | ylab('Margin')
212 | ```
213 |
214 | - No clear up-or-down relationship (although the correlation is `r cor(Clothing$sales,Clothing$margin)`!) but clearly the variance is higher for low sales
215 |
216 | ## Graphing Relationships
217 |
218 | - Comparing Singapore diamond prices vs. carats
219 |
220 | ```{r, echo=TRUE, eval=FALSE}
221 | library(Ecdat)
222 | data(Diamond)
223 | plot(Diamond$carat,Diamond$price,xlab="Number of Carats",ylab="Price")
224 | ```
225 |
226 | ```{r, echo=FALSE, eval=TRUE, fig.width=, fig.width=7, fig.height=3.5}
227 | data(Diamond)
228 | plot(Diamond$carat,Diamond$price,xlab="Number of Carats",ylab="Price")
229 | ```
230 |
231 | ```{r, echo=FALSE, eval=FALSE}
232 | #THE GGPLOT2 WAY
233 | library(Ecdat)
234 | data(Diamond)
235 | ggplot(Diamond,aes(x=carat,y=price))+geom_point()+
236 | xlab('Number of Carats')+
237 | ylab('Price')
238 | ```
239 |
240 | ## Graphing Relationships
241 |
242 | - Another way to graph a relationship, especially when one of the variables only takes a few values, is to plot the `density()` function for different values
243 |
244 | ```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=3.5}
245 | plot(density(filter(wage1,married==0)$wage),col='blue',xlab="Wage",main="Wage Distribution; Blue = Unmarried, Red = Married")
246 | lines(density(filter(wage1,married==1)$wage),col='red')
247 | ```
248 |
249 | ```{r, echo=FALSE, eval=FALSE}
250 | #THE GGPLOT2 WAY
251 | ggplot(filter(wage1,married==0),aes(x=wage))+stat_density(geom='line',col='blue')+
252 | xlab('Wage')+
253 | ylab('Density')+
254 | ggtitle("Wage Distribution; Blue = Unmarried, Red = Married")+
255 | stat_density(data=filter(wage1,married==1),geom='line',col='red')
256 | ```
257 |
258 | - Clearly different distributions: married people earn more!
259 |
260 | ## Graphing relationships
261 |
262 | - We can back that up other ways
263 |
264 | ```{r, echo=TRUE,eval=TRUE}
265 | wage1 %>% group_by(married) %>% summarize(wage = mean(wage))
266 | cor(wage1$wage,wage1$married)
267 | ```
268 |
269 | ## Keep in mind!
270 |
271 | - Just because two variables are *related* doesn't mean we know *why*
272 | - If `cor(x,y)` is positive, it could be that `x` causes `y`... or that `y` causes `x`, or that something else causes both!
273 | - Or many other configurations... we'll talk about this after the midterm
274 | - Plus, even if we know the direction we may not know *why* that cause exists.
275 |
276 | ## For example
277 |
278 | ```{r, echo=TRUE, eval=TRUE, fig.width=7, fig.height=5}
279 | addata <- read.csv('http://www.nickchk.com/ad_spend_and_gdp.csv')
280 | plot(addata$AdSpending,addata$GDP,
281 | xlab='Ad Spend/Year (Mil.)',ylab='US GDP (Bil.)')
282 | ```
283 |
284 | ```{r, echo=FALSE, eval=FALSE}
285 | #THE GGPLOT2 WAY
286 | ggplot(addata,aes(x=AdSpending,y=GDP))+geom_point()+
287 | xlab('Ad Spend/Year (Mil.)')+
288 | ylab('US GDP (Bil.)')
289 | ```
290 |
291 | ## For example
292 |
293 | - The correlation between ad spend and GDP is `r cor(addata$AdSpending,addata$GDP)`
294 | - Does this mean that ads make GDP go up?
295 | - To some extent, yes (ad spending factors directly into GDP)
296 | - But that doesn't explain all of it!
297 | - Why else might this relationship exist?
298 |
299 | ## Practice
300 |
301 | - Install the `SMCRM` package, load it, get the `customerAcquisition` data. Rename it ca
302 | - Among `acquisition==1` observations, see if the size of first purchase is related to duration as a customer, with `cor` and (labeled) `plot`
303 | - See if `industry` and `acquisition` are dependent on each other using `prop.table` with the `margin` option
304 | - See if average revenues differ between industries using `aggregate`, then check the `cor`
305 | - Plot the density of revenues for `industry==0` in blue and, on the same graph, revenues for `industry==1` in red
306 | - In each case, think about relationship is suggested
307 |
308 | ## Practice Answers
309 |
310 | ```{r, echo=TRUE, eval=FALSE}
311 | install.packages('SMCRM')
312 | library(SMCRM)
313 | data(customerAcquisition)
314 | ca <- customerAcquisition
315 | cor(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration)
316 | plot(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration,
317 | xlab="Value of First Purchase",ylab="Customer Duration")
318 | prop.table(table(ca$industry,ca$acquisition),margin=1)
319 | prop.table(table(ca$industry,ca$acquisition),margin=2)
320 | aggregate(revenue~industry,data=ca,FUN=mean)
321 | cor(ca$revenue,ca$industry)
322 | plot(density(filter(ca,industry==0)$revenue),col='blue',xlab="Revenues",main="Revenue Distribution")
323 | lines(density(filter(ca,industry==1)$revenue),col='red')
324 | ```
--------------------------------------------------------------------------------
/Lectures/Lecture_12_Midterm_Review.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 12: Midterm Review"
3 | author: "Nick Huntington-Klein"
4 | date: "February 14, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | library(stargazer)
21 | theme_set(theme_gray(base_size = 15))
22 | ```
23 |
24 | ## Recap
25 |
26 | - We've been covering how to work with data in R
27 | - Building up multiple variables (numeric, character, logical, factor) into vectors
28 | - Joining vectors together into data.frames or tibbles
29 | - Making data, downloading data, getting data from packages
30 | - Manipulating (with dplyr), summarizing, and plotting that data
31 | - Looking at relationships between variables
32 |
33 | ## Working with Objects
34 |
35 | - Create a vector with `c()` or `1:4` or `sample()` or `numeric()` etc.
36 | - Create logicals to check conditions on a vector, i.e. `a < 5 & a > 1` or `c('A','B') %in% c('A','C','D')`
37 | - Check vector type with `is.` functions or change them with `as.`
38 | - Use `help()` to figure out how to use new functions you don't know yet!
39 |
40 | ## Working with Objects Practice
41 |
42 | - Use `sample()` to generate a vector of 1000 names from Jack, Jill, and Mary.
43 | - Use `%in%` to count how many are Jill or Mary.
44 | - Use `help()` to figure out how to use the `substr` function to get the first letter of the name. Then, use that to count how many names are Jack or Jill.
45 | - Change the vector to a factor.
46 | - Create a vector of all integers from 63 to 302. Then, count how many are below 99 or above 266.
47 |
48 | ## Answers
49 |
50 | ```{r, echo=TRUE, eval=FALSE}
51 | names <- sample(c('Jack','Jill','Mary'),1000,replace=T)
52 | sum(names %in% c('Jill','Mary'))
53 | firstletter <- substr(names,1,1)
54 | sum(firstletter == "J")
55 | names <- factor(names)
56 |
57 | numbers <- 63:302
58 | sum(numbers < 99 | numbers > 266)
59 | ```
60 |
61 | ## Working with Data
62 |
63 | - Get data with `read.csv()` or `data()`, or create it with `data.frame()` or `tibble()`
64 | - Use dplyr to manipulate it:
65 | - `filter()` to pick a subset of observations
66 | - `select()` to pick a subset of variables
67 | - `rename()` to rename variables
68 | - `mutate()` to create new variables
69 | - `%>%` to chain together commands
70 | - Automate things with a `for (i = 1:10) {}` loop
71 |
72 | ## Working with Data Practice
73 |
74 | - Load the `Ecdat` library and get the `Computers` data set
75 | - In one chain of commands,
76 | - create a logical `bigHD` if the `hd` is above median
77 | - remove the `ads` and `trend` variables
78 | - limit the data to only premium computers
79 | - Use a `for` loop to print out the median price for each level of `ram`
80 | - Loop over a vector, sometimes useful to use `unique()`
81 |
82 | ## Answers
83 |
84 | ```{r, echo=TRUE, eval=FALSE}
85 | library(Ecdat)
86 | data(Computers)
87 |
88 | Computers <- Computers %>%
89 | mutate(bigHD = hd > median(hd)) %>%
90 | select(-ads,-trend) %>%
91 | filter(premium == "yes")
92 |
93 | for (i in unique(Computers$ram)) {
94 | print(median(filter(Computers,ram==i)$price))
95 | }
96 | ```
97 |
98 | ## Summarizing Single Variables
99 |
100 | - Variables have a *distribution* and we are interested in describing that distribution
101 | - `table()`, `mean()`, `sd()`, `quantile()` and functions for 0, 50, 100% percentiles `min()` `median()` `max()`
102 | - `stargazer()` to get a bunch of summary stats at once
103 | - Plotting: `plot(density(x))`, `hist()`, `barplot(table())`
104 | - Adding to plots with `points()`, `lines()`, `abline()`
105 |
106 | ## Summarizing Single Variables Practice
107 |
108 | - Create a text stargazer table of Computers
109 | - Use `table` to look at the distribution of `ram`, then make a `barplot` of it
110 | - Create a density plot of price, and use a single `abline(v=)` to overlay the 0, 10, 20, ..., 100% percentiles on it as blue vertical lines
111 |
112 | ## Answers
113 |
114 | ```{r, echo=TRUE, eval=FALSE}
115 | library(stargazer)
116 | stargazer(Computers,type='text')
117 |
118 | table(Computers$ram)
119 | barplot(table(Computers$ram))
120 |
121 | plot(density(Computers$price),xlab='Price',main='Distribution of Computer Price')
122 | abline(v=quantile(Computers$price,0:10/10),col='blue')
123 | ```
124 |
125 | ## Relationships Between Variables
126 |
127 | - Looking at the distribution of one variable *at a given value of another variable*
128 | - Check for dependence with `prop.table(table(x,y),margin=)`
129 | - Correlation: are they large/small together? `cor()`
130 | - `group_by(x) %>% summarize(mean(y))` to get mean of y within values of x
131 | - `cut(x,breaks=10)` to put x into "bins" to explain y with
132 | - Mean of y within values of x gives part of y *explained* by x
133 | - Proportion of variance explained `1-var(residuals)/var(y)`
134 | - `plot(x,y)` or overlaid density plots
135 |
136 | ## Relationships Practice
137 |
138 | - Use `prop.table` with both margins to see if cd and multi look dependent
139 | - Use `cut` to make 10 bins of `hd`
140 | - Get average price by bin of `hd`, and residuals
141 | - Calculate proportion of variance in price explained by `hd`, and calculate correlation
142 | - Plot `price` (y-axis) against `hd` (x-axis)
143 |
144 | ## Answers
145 |
146 | ```{r, echo=TRUE, eval=FALSE}
147 | prop.table(table(Computers$cd,Computers$multi),margin=1)
148 | prop.table(table(Computers$cd,Computers$multi),margin=2)
149 |
150 | Computers <- Computers %>%
151 | mutate(hdbins = cut(hd,breaks=10)) %>%
152 | group_by(hdbins) %>%
153 | mutate(priceav = mean(price)) %>%
154 | mutate(res = price - priceav)
155 |
156 | #variance explained
157 | 1 - var(Computers$res)/var(Computers$price)
158 |
159 | plot(Computers$hd,Computers$price,xlab="Size of Hard Drive",ylab='Price')
160 | ```
161 |
162 | ## Simulation
163 |
164 | - There are *true models* that we can't see, but which generate data for us
165 | - We want to use methods that can work backwards to uncover true models
166 | - We can randomly generate data using a true model we decide, and see if our method uncovers it
167 | - `rnorm()`, `runif()`, `sample()`
168 | - Create a blank vector, then a `for` loop to make data and analyze. Store result in the vector
169 | - Analyze the vector to see what the results look like
170 |
171 | ##Simulation Practice
172 |
173 | - Create a for loop that creates 500 obs of `At.War` (logical, equal to 1 10% of the time)
174 | - And `Net.Exports` (uniform, min -1, max 1, then subtract 3*At.War)
175 | - And `GDP.Growth` (normal, mean 0, sd 3, then add + Net.Exports + At.War)
176 | - Explains GDP.Growth with At.War and takes the residual
177 | - Calculates `cor()` between GDP.Growth and Net.Exports, and between GDP.Growth and residual
178 | - Stores the correlations in two separate vectors, and then compares their distributions after 1000 loops
179 |
180 | ## Answer
181 |
182 | ```{r, echo=TRUE, eval=FALSE}
183 | library(stargazer)
184 |
185 | GDPcor <- c()
186 | rescor <- c()
187 |
188 | for (i in 1:1000) {
189 | df <- tibble(At.War = sample(0:1,500,replace=T,prob=c(.9,.1))) %>%
190 | mutate(Net.Exports = runif(500,-1,1)-3*At.War) %>%
191 | mutate(GDP.Growth = rnorm(500,0,3)+Net.Exports+At.War) %>%
192 | group_by(At.War) %>%
193 | mutate(residual = GDP.Growth - mean(GDP.Growth))
194 |
195 | GDPcor[i] <- cor(df$GDP.Growth,df$Net.Exports)
196 | rescor[i] <- cor(df$residual,df$Net.Exports)
197 | }
198 |
199 | stargazer(data.frame(rescor,GDPcor),type='text')
200 | ```
201 |
202 | ## Midterm
203 |
204 | Reminders:
205 |
206 | - Midterm will allow the use of the R help files but not the rest of the internet
207 | - You will also have access to lecture slides. *I do not recommend relying on them as this would take you a lot of time*
208 | - Anything we've covered is fair game
209 | - There will be one question that requires you to learn about, and use, a function we haven't used yet
210 | - The answer key to the midterm contains 35 lines of code, many of which are dplyr
--------------------------------------------------------------------------------
/Lectures/Lecture_13_Causality.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 13: Causality"
3 | author: "Nick Huntington-Klein"
4 | date: "February 18, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE)
19 | library(tidyverse)
20 | theme_set(theme_gray(base_size = 15))
21 | ```
22 |
23 | ## Midterm
24 |
25 | - Let's review the midterm answers
26 |
27 | ## And now!
28 |
29 | - The second half of the class
30 | - The first half focused on how to program and some useful statistical concepts
31 | - The second half will focus on *causality*
32 | - In other words, "how do we know if X causes Y?"
33 |
34 | ## Why Causality?
35 |
36 | - Many of the interesting questions we might want to answer with data are causal
37 | - Some are non-causal, too - for example, "how can we predict whether this photo is of a dog or a cat" is vital to how Google Images works, but it doesn't care what *caused* the photo to be of a dog or a cat
38 | - Nearly every *why* question is causal
39 | - And when we're talking about people, *why* is often what we want to know!
40 |
41 | ## Also
42 |
43 | - This is economists' comparative advantage!
44 | - Plenty of fields do statistics. But very few make it standard training for their students to understand causality
45 | - This understanding of causality makes economists very useful! *This* is one big reason why tech companies have whole economics departments in them
46 |
47 | ## Bringing us to...
48 |
49 | - Part of this half of the class will be understanding what causality *is* and how we can find it
50 | - Another big part will be understanding common *research designs* for uncovering causality in data when we can't do an experiment
51 | - These, more than supply & demand, more than ISLM, are the tools of the modern economist!
52 |
53 | ## So what is causality?
54 |
55 | - We say that `X` *causes* `Y` if...
56 | - were we to intervene and *change* the value of `X` without changing anything else...
57 | - then `Y` would also change as a result
58 |
59 | ## Some examples
60 |
61 | Examples of causal relationships!
62 |
63 | Some obvious:
64 |
65 | - A light switch being set to on causes the light to be on
66 | - Setting off fireworks raises the noise level
67 |
68 | Some less obvious:
69 |
70 | - Getting a college degree increases your earnings
71 | - Tariffs reduce the amount of trade
72 |
73 | ## Some examples
74 |
75 | Examples of non-zero *correlations* that are not *causal* (or may be causal in the wrong direction!)
76 |
77 | Some obvious:
78 |
79 | - People tend to wear shorts on days when ice cream trucks are out
80 | - Rooster crowing sounds are followed closely by sunrise*
81 |
82 | Some less obvious:
83 |
84 | - Colds tend to clear up a few days after you take Emergen-C
85 | - The performance of the economy tends to be lower or higher depending on the president's political party
86 |
87 | *This case of mistaken causality is the basis of the film Rock-a-Doodle which I remember being very entertaining when I was six.
88 |
89 | ## Important Note
90 |
91 | - "X causes Y" *doesn't* mean that X is necessarily the *only* thing that causes Y
92 | - And it *doesn't* mean that all Y must be X
93 | - For example, using a light switch causes the light to go on
94 | - But not if the bulb is burned out (no Y, despite X), or if the light was already on (Y without X)
95 | - But still we'd say that using the switch causes the light! The important thing is that X *changes the probability* that Y happens, not that it necessarily makes it happen for certain
96 |
97 | ## So How Can We Tell?
98 |
99 | - As just shown, there are plenty of *correlations* that aren't *causal*
100 | - So if we have a correlation, how can we tell if it *is*?
101 | - For this we're going to have to think hard about *causal inference*. That is, inferring causality from data
102 |
103 | ## The Problem of Causal Inference
104 |
105 | - Let's try to think about whether some `X` causes `Y`
106 | - That is, if we manipulated `X`, then `Y` would change as a result
107 | - For simplicity, let's assume that `X` is either 1 or 0, like "got a medical treatment" or "didn't"
108 |
109 | ## The Problem of Causal Inference
110 |
111 | - Now, how can we know *what would happen* if we manipulated `X`?
112 | - Let's consider just one person - Angela. We could just check what Angela's `Y` is when we make `X=0`, and then check what Angela's `Y` is again when we make `X=1`.
113 | - Are those two `Y`s different? If so, `X` causes `Y`!
114 | - Do that same process for everyone in your sample and you know in general what the effect of `X` on `Y` is
115 |
116 | ## The Problem of Causal Inference
117 |
118 | - You may have spotted the problem
119 | - Just like you can't be in two places at once, Angela can't exist both with `X=0` and with `X=1`. She either got that medical treatment or she didn't.
120 | - Let's say she did. So for Angela, `X=1` and, let's say, `Y=10`.
121 | - The other one, what `Y` *would have been* if we made `X=0`, is *missing*. We don't know what it is! Could also be `Y=10`. Could be `Y=9`. Could be `Y=1000`!
122 |
123 | ## The Problem of Causal Inference
124 |
125 | - Well, why don't we just take someone who actually DOES have `X=0` and compare their `Y`?
126 | - Because there are lots of reasons their `Y` could be different BESIDES `X`.
127 | - They're not Angela! A character flaw to be sure.
128 | - So if we find someone, Gareth, with `X=0` and they have `Y=9`, is that because `X` increases `Y`, or is that just because Angela and Gareth would have had different `Y`s anyway?
129 |
130 | ## The Problem of Causal Inference
131 |
132 | - The main goal we have in doing causal inference is in making *as good a guess as possible* as to what that `Y` *would have been* if `X` had been different
133 | - That "would have been" is called a *counterfactual* - counter to the fact of what actually happened
134 | - In doing so, we want to think about two people/firms/countries that are basically *exactly the same* except that one has `X=0` and one has `X=1`
135 |
136 | ## Experiments
137 |
138 | - A common way to do this in many fields is an *experiment*
139 | - If you can *randomly assign* `X`, then you know that the people with `X=0` are, on average, exactly the same as the people with `X=1`
140 | - So that's an easy comparison!
141 |
142 | ## Experiments
143 |
144 | - When we're working with people/firms/countries, running experiments is often infeasible, impossible, or unethical
145 | - So we have to think hard about a *model* of what the world looks like
146 | - So that we can use our model to figure out what the *counterfactual* would be
147 |
148 | ## Models
149 |
150 | - In causal inference, the *model* is our idea of what we think the process is that *generated the data*
151 | - We have to make some assumptions about what this is!
152 | - We put together what we know about the world with assumptions and end up with our model
153 | - The model can then tell us what kinds of things could give us wrong results so we can fix them and get the right counterfactual
154 |
155 | ## Models
156 |
157 | - Wouldn't it be nice to not have to make assumptions?
158 | - Yeah, but it's impossible to skip!
159 | - We're trying to predict something that hasn't happened - a counterfactual
160 | - This is literally impossible to do if you don't have some model of how the data is generated
161 | - You can't even predict the sun will rise tomorrow without a model!
162 | - If you think you can, you're just don't realize the model you're using - that's dangerous!
163 |
164 | ## An Example
165 |
166 | - Let's cheat again and know how our data is generated!
167 | - Let's say that getting `X` causes `Y` to increase by 1
168 | - And let's run a randomized experiment of who actually gets X
169 |
170 | ```{r, echo=TRUE, eval=TRUE}
171 | df <- data.frame(Y.without.X = rnorm(1000),X=sample(c(0,1),1000,replace=T)) %>%
172 | mutate(Y.with.X = Y.without.X + 1) %>%
173 | #Now assign who actually gets X
174 | mutate(Observed.Y = ifelse(X==1,Y.with.X,Y.without.X))
175 | #And see what effect our experiment suggests X has on Y
176 | df %>% group_by(X) %>% summarize(Y = mean(Observed.Y))
177 | ```
178 |
179 | ## An Example
180 |
181 | - Now this time we can't randomize X.
182 |
183 | ```{r, echo=TRUE, eval=TRUE}
184 | df <- data.frame(Z = runif(10000)) %>% mutate(Y.without.X = rnorm(10000) + Z, Y.with.X = Y.without.X + 1) %>%
185 | #Now assign who actually gets X
186 | mutate(X = Z > .7,Observed.Y = ifelse(X==1,Y.with.X,Y.without.X))
187 | df %>% group_by(X) %>% summarize(Y = mean(Observed.Y))
188 | #But if we properly model the process and compare apples to apples...
189 | df %>% filter(abs(Z-.7)<.01) %>% group_by(X) %>% summarize(Y = mean(Observed.Y))
190 | ```
191 |
192 | ## So!
193 |
194 | - So, as we move forward
195 | - We're going to be thinking about how to create models of the processes that generated the data
196 | - And, once we have those models, we'll figure out what methods we can use to generate plausible counterfactuals
197 | - Once we're really comparing apples to apples, we can figure out, using *only data we can actually observe*, how things would be different if we reached in and changed `X`, and how `Y` would change as a result.
--------------------------------------------------------------------------------
/Lectures/Lecture_14_Causal_Diagrams.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 14 Causal Diagrams"
3 | author: "Nick Huntington-Klein"
4 | date: "February 25, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
19 | library(tidyverse)
20 | library(dagitty)
21 | library(ggdag)
22 | library(gganimate)
23 | library(ggthemes)
24 | library(Cairo)
25 | theme_set(theme_gray(base_size = 15))
26 | ```
27 |
28 | ## Recap
29 |
30 | - Last time we talked about causality
31 | - The idea that if we could reach in and manipulate `X`, and as a result `Y` changes too, then `X` *causes* `Y`
32 | - We also talked about how we can identify causality in data
33 | - Part of that will necessarily require us to have a model
34 |
35 | ## Models
36 |
37 | - We *have to have a model* to get at causality
38 | - A model is our way of *understanding the world*. It's our idea of what we think the data-generating process is
39 | - Models can be informal or formal - "The sun rises every day because the earth spins" vs. super-complex astronomical models of the galaxy with thousands of equations
40 | - All models are wrong. Even quantum mechanics. But as long as models are right enough to be useful, we're good to go!
41 |
42 | ##Models
43 |
44 | - Once we *do* have a model, though, that model will tell us *exactly* how we can find a causal effect
45 | - (if it's possible; it's not always possible)
46 | - Sort of like how, last time, we knew how `X` was assigned, and using that information we were able to get a good estimate of the true treatment
47 |
48 | ## Example
49 |
50 | - Let's work through a familiar example from before, where we know the data generating process
51 |
52 | ```{r, echo=TRUE}
53 | # Is your company in tech? Let's say 30% of firms are
54 | df <- tibble(tech = sample(c(0,1),500,replace=T,prob=c(.7,.3))) %>%
55 | #Tech firms on average spend $3mil more defending IP lawsuits
56 | mutate(IP.spend = 3*tech+runif(500,min=0,max=4)) %>%
57 | #Tech firms also have higher profits. But IP lawsuits lower profits
58 | mutate(log.profit = 2*tech - .3*IP.spend + rnorm(500,mean=2))
59 | # Now let's check for how profit and IP.spend are correlated!
60 | cor(df$log.profit,df$IP.spend)
61 | ```
62 |
63 | - Uh-oh! Truth is negative relationship, but data says positive!!
64 |
65 | ## Example
66 |
67 | - Now we can ask: *what do we know* about this situation?
68 | - How do we suspect the data was generated? (ignoring for a moment that we know already)
69 | - We know that being a tech company leads you to have to spend more money on IP lawsuits
70 | - We know that being a tech company leads you to have higher profits
71 | - We know that IP lawsuits lower your profits
72 |
73 | ## Example
74 |
75 | - From this, we realize that part of what we get when we calculate `cor(df$log.profit,df$IP.spend)` is the influence of being a tech company
76 | - Meaning that if we remove that influence, what's left over should be the actual, negative, effect of IP lawsuits
77 | - Now, we can get to this intuitively, but it would be much more useful if we had a more formal model that could tell us what to do in *lots* of situations
78 |
79 | ## Causal Diagrams
80 |
81 | - Enter the causal diagram!
82 | - A causal diagram (aka a Directed Acyclic Graph) is a way of writing down your *model* that lets you figure out what you need to do to find your causal effect of interest
83 | - All you need to do to make a causal diagram is write down all the important features of the data generating process, and also write down what you think causes what!
84 |
85 | ## Example
86 |
87 | - We know that being a tech company leads you to have to spend more money on IP lawsuits
88 | - We know that being a tech company leads you to have higher profits
89 | - We know that IP lawsuits lower your profits
90 |
91 | ## Example
92 |
93 | - We know that being a tech company leads you to have to spend more money on IP lawsuits
94 | - We know that being a tech company leads you to have higher profits
95 | - We know that IP lawsuits lower your profits
96 |
97 | ```{r CairoPlot, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5}
98 | dag <- dagify(IP.sp~tech) %>% tidy_dagitty()
99 | ggdag(dag,node_size=20)
100 | ```
101 |
102 | ## Example
103 |
104 | - We know that being a tech company leads you to have to spend more money on IP lawsuits
105 | - We know that being a tech company leads you to have higher profits
106 | - We know that IP lawsuits lower your profits
107 |
108 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5}
109 | dag <- dagify(IP.sp~tech,
110 | profit~tech) %>% tidy_dagitty()
111 | ggdag(dag,node_size=20)
112 | ```
113 |
114 | ## Example
115 |
116 | - We know that being a tech company leads you to have to spend more money on IP lawsuits
117 | - We know that being a tech company leads you to have higher profits
118 | - We know that IP lawsuits lower your profits
119 |
120 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5}
121 | dag <- dagify(IP.sp~tech,
122 | profit~tech+IP.sp) %>% tidy_dagitty()
123 | ggdag(dag,node_size=20)
124 | ```
125 |
126 | ## Viola
127 |
128 | - We have *encoded* everything we know about this particular little world in our diagram
129 | - (well, not everything, the diagram doesn't say whether we think these effects are positive or negative)
130 | - Not only can we see our assumptions, but we can see how they fit together
131 | - For example, if we were looking for the impact of *tech* on profit, we'd know that it happens directly, AND happens because tech affects `IP.spend`, which then affects profit.
132 |
133 | ## Identification
134 |
135 | - And if we want to isolate the effect of `IP.spend` on `profit`, we can figure that out too
136 | - We call this process - isolating just the causal effect we're interested in - "identification"
137 | - We're *identifying* just one of those arrows, the one `IP.spend -> profit`, and seeing what the effect is on that arrow!
138 |
139 | ## Identification
140 |
141 | - Based on this graph, we can see that part of the correlation between `IP.Spend` and `profit` can be *explained by* how `tech` links the two.
142 |
143 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5}
144 | dag <- dagify(IP.sp~tech,
145 | profit~tech+IP.sp) %>% tidy_dagitty()
146 | ggdag(dag,node_size=20)
147 | ```
148 |
149 | ## Identification
150 |
151 | - Since we can *explain* part of the correlation with `tech`, but we want to *identify* the part of the correlation that ISN'T explained by `tech` (the causal part), we will want to just use what isn't explained by tech!
152 | - Use `tech` to explain `profit`, and take the residual
153 | - Use `tech` to explain `IP.spend`, and take the residual
154 | - The relationship between the first residual and the second residual is *causal*!
155 |
156 | ## Controlling
157 |
158 | - This process is called "adjusting" or "controlling". We are "controlling for tech" and taking out the part of the relationship that is explained by it
159 | - In doing so, we're looking at the relationship between `IP.spend` and `profit` *just comparing firms that have the same level of `tech`*.
160 | - This is our "apples to apples" comparison that gives us an experiment-like result
161 |
162 | ## Controlling
163 |
164 | ```{r, echo=TRUE, eval=TRUE}
165 | df <- df %>% group_by(tech) %>%
166 | mutate(log.profit.resid = log.profit - mean(log.profit),
167 | IP.spend.resid = IP.spend - mean(IP.spend)) %>% ungroup()
168 | cor(df$log.profit.resid,df$IP.spend.resid)
169 | ```
170 |
171 | - Negative! Hooray
172 |
173 | ## Controlling
174 |
175 | - Imagine we're looking at that relationship *within color*
176 |
177 | ```{r, dev='CairoPNG', echo=FALSE,fig.height=5.5,fig.width=8}
178 | ggplot(mutate(df,tech=factor(tech,labels=c("Not Tech","Tech"))),
179 | aes(x=IP.spend,y=log.profit,color=tech))+geom_point()+ guides(color=guide_legend(title="Firm Type"))+
180 | scale_color_colorblind()
181 | ```
182 |
183 | ## LITERALLY
184 |
185 | ```{r, dev='CairoPNG', echo=FALSE,fig.height=5.5,fig.width=5.5}
186 | df <- df %>%
187 | group_by(tech) %>%
188 | mutate(mean_profit = mean(log.profit),
189 | mean_IP = mean(IP.spend)) %>% ungroup()
190 |
191 | before_cor <- paste("1. Raw data. Correlation between log.profit and IP.spend: ",round(cor(df$log.profit,df$IP.spend),3),sep='')
192 | after_cor <- paste("6. Analyze what's left! cor(log.profit,IP.spend) controlling for tech: ",round(cor(df$log.profit-df$mean_profit,df$IP.spend-df$mean_IP),3),sep='')
193 |
194 |
195 |
196 |
197 | #Add step 2 in which IP.spend is demeaned, and 3 in which both IP.spend and log.profit are, and 4 which just changes label
198 | dffull <- rbind(
199 | #Step 1: Raw data only
200 | df %>% mutate(mean_IP=NA,mean_profit=NA,time=before_cor),
201 | #Step 2: Add x-lines
202 | df %>% mutate(mean_profit=NA,time='2. Figure out what differences in IP.spend are explained by tech'),
203 | #Step 3: IP.spend de-meaned
204 | df %>% mutate(IP.spend = IP.spend - mean_IP,mean_IP=0,mean_profit=NA,time="3. Remove differences in IP.spend explained by tech"),
205 | #Step 4: Remove IP.spend lines, add log.profit
206 | df %>% mutate(IP.spend = IP.spend - mean_IP,mean_IP=NA,time="4. Figure out what differences in log.profit are explained by tech"),
207 | #Step 5: log.profit de-meaned
208 | df %>% mutate(IP.spend = IP.spend - mean_IP,log.profit = log.profit - mean_profit,mean_IP=NA,mean_profit=0,time="5. Remove differences in log.profit explained by tech"),
209 | #Step 6: Raw demeaned data only
210 | df %>% mutate(IP.spend = IP.spend - mean_IP,log.profit = log.profit - mean_profit,mean_IP=NA,mean_profit=NA,time=after_cor))
211 |
212 | p <- ggplot(dffull,aes(y=log.profit,x=IP.spend,color=as.factor(tech)))+geom_point()+
213 | geom_vline(aes(xintercept=mean_IP,color=as.factor(tech)))+
214 | geom_hline(aes(yintercept=mean_profit,color=as.factor(tech)))+
215 | guides(color=guide_legend(title="tech"))+
216 | scale_color_colorblind()+
217 | labs(title = 'The Relationship between log.profit and IP.spend, Controlling for tech \n{next_state}')+
218 | theme(plot.title = element_text(size=14))+
219 | transition_states(time,transition_length=c(12,32,12,32,12,12),state_length=c(160,100,75,100,75,160),wrap=FALSE)+
220 | ease_aes('sine-in-out')+
221 | exit_fade()+enter_fade()
222 |
223 | animate(p,nframes=200)
224 | ```
225 |
226 | ## Recap
227 |
228 | - By controlling for `tech` ("holding it constant") we got rid of the part of the `IP.spend`/`profit` relationship that was explained by `tech`, and so managed to *identify* the `IP.spend -> profit` arrow, the causal effect we're interested in!
229 | - We correctly found that it was negative
230 | - Remember, we made it truly negative when we created the data, all those slides ago
231 |
232 | ## Causal Diagrams
233 |
234 | - And it was the diagram that told us to control for `tech`
235 | - It's going to turn out that diagrams can tell us how to identify things in much more complex circumstances - we'll get to that soon
236 | - But you might have noticed that it was pretty obvious what to do just by looking at the graph
237 |
238 | ## Causal Diagrams
239 |
240 | - Can't we just look at the data to see what we need to control for?
241 | - After all, that would free us from having to make all those assumptions and figure out our model
242 | - No!!!
243 | - Why? Because for a given set of data that we see, there are *many* different data generating processes that could have made it
244 | - Each requiring different kinds of adjustments in order to get it right
245 |
246 | ## Causal Diagrams
247 |
248 | - We observe that `profit` (y), `IP.spend` (x), and `tech` (z) are all related... which is it?
249 |
250 | ```{r dev='CairoPNG', echo=FALSE, fig.width=8, fig.height=5}
251 | ggdag_equivalent_dags(confounder_triangle(x_y_associated=TRUE),node_size=12)
252 | ```
253 |
254 | ## Causal Diagrams
255 |
256 | - With only the data to work with we have *literally no way of knowing* which of those is true
257 | - Maybe `IP.spend` causes companies to be `tech` companies (in 2, 3, 6)
258 | - We know that's silly because we have an idea of what the model is
259 | - But that's what lets us know it's wrong - the model. With just the data we have no clue.
260 |
261 | ## Causal Diagrams
262 |
263 | - Next time we'll set about actually making one of these diagrams
264 | - And soon we'll be looking for what the diagrams tell us about how to identify an effect!
265 |
266 | ## Practice
267 |
268 | - Load in `data(swiss)` and use `help` to look at it
269 | - Get the correlation between Fertility and Education
270 | - Think about what direction the arrows might go on a diagram with Fertility, Education, and Agriculture
271 | - Get the corrlelation between Fertility and Education *controlling for Agriculture* (use `cut` with `breaks=3`)
272 |
273 | ## Practice Answers
274 |
275 | ```{r, echo=TRUE, eval=FALSE}
276 | data(swiss)
277 | help(swiss)
278 | cor(swiss$Fertility,swiss$Education)
279 |
280 | swiss <- swiss %>%
281 | group_by(cut(Agriculture,breaks=3)) %>%
282 | mutate(Fert.resid = Fertility - mean(Fertility),
283 | Ed.resid = Education - mean(Education))
284 |
285 | cor(swiss$Fert.resid,swiss$Ed.resid)
286 | ```
287 |
288 | ```{r, echo=FALSE, eval=TRUE}
289 | data(swiss)
290 | cor(swiss$Fertility,swiss$Education)
291 |
292 | swiss <- swiss %>%
293 | group_by(cut(Agriculture,breaks=3)) %>%
294 | mutate(Fert.resid = Fertility - mean(Fertility),
295 | Ed.resid = Education - mean(Education))
296 |
297 | cor(swiss$Fert.resid,swiss$Ed.resid)
298 | ```
--------------------------------------------------------------------------------
/Lectures/Lecture_15_Drawing_Causal_Diagrams.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 15 Drawing Causal Diagrams"
3 | author: "Nick Huntington-Klein"
4 | date: "February 27, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
19 | library(tidyverse)
20 | library(dagitty)
21 | library(ggdag)
22 | library(gganimate)
23 | library(ggthemes)
24 | library(Cairo)
25 | theme_set(theme_gray(base_size = 15))
26 | ```
27 |
28 | ## Recap
29 |
30 | - Last time we covered the concept of "controlling" or "adjusting" for a variable
31 | - And we knew what to control for because of our causal diagram
32 | - A causal diagram is your *model* of what you think the data-generating process is
33 | - Which you can use to figure out how to *identify* particular arrows of interest
34 |
35 | ## Today
36 |
37 | - Today we're going to be working through the process of how to *make* causal diagrams
38 | - This will require us to *understand* what is going on in the real world
39 | - (especially since we won't know the right answer, like when we simulate our own data!)
40 | - And then represent our understanding in the diagram
41 |
42 | ## Remember
43 |
44 | - Our goal is to *represent the underlying data-generating process*
45 | - This is going to require some common-sense thinking
46 | - As well as some economic intuition
47 | - In real life, we're not going to know what the right answer is
48 | - Our models will be wrong. We just need them to be useful
49 |
50 | ## Steps to a Causal Diagram
51 |
52 | 1. Consider all the variables that are likely to be important in the data generating process (this includes variables you can't observe)
53 | 2. For simplicity, combine them together or prune the ones least likely to be important
54 | 3. Consider which variables are likely to affect which other variables and draw arrows from one to the other
55 | 4. (Bonus: Test some implications of the model to see if you have the right one)
56 |
57 | ## Some notes
58 |
59 | - Drawing an arrow requires a *direction*. You're making a statement!
60 | - Omitting an arrow is a statement too - you're saying neither causes the other (directly)
61 | - If two variables are correlated but neither causes the other, that means they're both caused by some other (perhaps unobserved) variable that causes both - add it!
62 | - There shouldn't be any *cycles* - You shouldn't be able to follow the arrows in one direction and end up where you started
63 | - If there *should* be a feedback loop, like "rich get richer", distinguish between the same variable at different points in time to avoid it
64 |
65 | ## So let's do it!
66 |
67 | - Let's start with an econometrics classic: what is the causal effect of an additional year of education on earnings?
68 | - That is, if we reached in and made someone get one more year of education than they already did, how much more money would they earn?
69 |
70 | ## 1. Listing Variables
71 |
72 | - We can start with our two main variables of interest:
73 | - Education [we call this the "treatment" or "exposure" variable]
74 | - Earnings [the "outcome"]
75 |
76 | ## 1. Listing Variables
77 |
78 | - Then, we can add other variables likely to be relevant
79 | - Focus on variables that are likely to cause or be caused by treatment
80 | - ESPECIALLY if they're related both to the treatment and the outcome
81 | - They don't have to be things you can actually observe/measure
82 | - Variables that affect the outcome but aren't related to anything else aren't really important (you'll see why next week)
83 |
84 | ## 1. Listing Variables
85 |
86 | - So what can we think of?
87 | - Ability
88 | - Socioeconomic status
89 | - Demographics
90 | - Phys. ed requirements
91 | - Year of birth
92 | - Location
93 | - Compulsory schooling laws
94 | - Job connections
95 |
96 | ## 2. Simplify
97 |
98 | - There's a lot going on - in any social science system, there are THOUSANDS of things that could plausibly be in your diagram
99 | - So we simplify. We ignore things that are likely to be only of trivial importance [so Phys. ed is out!]
100 | - And we might try to combine variables that are very similar or might overlap in what we measure [Socioeconomic status, Demographics, Location -> Background]
101 | - Now: Education, Earnings, Background, Year of birth, Location, Compulsory schooling, and Job Connections
102 |
103 | ## 3. Arrows!
104 |
105 | - Consider which variables are likely to cause which others
106 | - And, importantly, *which arrows you can leave out*
107 | - The arrows you *leave out* are important to think about - you sure there's no effect? - and prized! You need those NON-arrows to be able to causally identify anything.
108 |
109 | ## 3. Arrows
110 |
111 | - Let's start with our effect of interest
112 | - Education causes earnings
113 |
114 | ```{r CairoPlot, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
115 | set.seed(1000)
116 | dag <- dagify(Earn~Ed) %>% tidy_dagitty()
117 | ggdag(dag,node_size=20)
118 | ```
119 |
120 | ## 3. Arrows
121 |
122 | - Remaining: Background, Year of birth, Location, Compulsory schooling, and Job Connections
123 | - All of these but Job Connections should cause Ed
124 |
125 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
126 | set.seed(1000)
127 | dag <- dagify(Earn~Ed,
128 | Ed~Bgrd+Year+Loc+Comp) %>% tidy_dagitty()
129 | ggdag(dag,node_size=20)
130 | ```
131 |
132 | ## 3. Arrows
133 |
134 | - Seems like Year of Birth, Location, and Background should ALSO affect earnings. Job Connections, too.
135 |
136 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
137 | set.seed(1000)
138 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx,
139 | Ed~Bgrd+Year+Loc+Comp) %>% tidy_dagitty()
140 | ggdag(dag,node_size=20)
141 | ```
142 |
143 | ## 3. Arrows
144 |
145 | - Job connections, in fact, seems like it should be caused by Education
146 |
147 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
148 | set.seed(1000)
149 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx,
150 | Ed~Bgrd+Year+Loc+Comp,
151 | JobCx~Ed) %>% tidy_dagitty()
152 | ggdag(dag,node_size=20)
153 | ```
154 |
155 | ## 3. Arrows
156 |
157 | - Location and Background are likely to be related, but neither really causes the other. Make unobservable U1 and have it cause both!
158 |
159 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
160 | set.seed(1000)
161 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx,
162 | Ed~Bgrd+Year+Loc+Comp,
163 | JobCx~Ed,
164 | Loc~U1,
165 | Bgrd~U1) %>% tidy_dagitty()
166 | ggdag(dag,node_size=20)
167 | ```
168 |
169 | ## Causal Diagrams
170 |
171 | - And there we have it!
172 | - Perhaps a little messy, but that can happen
173 | - We have modeled our idea of what the data generating process looks like
174 |
175 | ## Causal Diagrams
176 |
177 | - The nice thing about these diagrams is that we can test (some of) our assumptions!
178 | - We can't test assumptions about which direction the arrows go, but notice that the diagram implies that certain relationships WON'T be there!
179 | - For example, in our diagram, all relationships between `Comp` and `JobCx` go through `Ed`
180 | - So if we control for `Ed`, `cor(Comp,JobCx)` should be zero - we can test that! We'll do more of this later.
181 |
182 | ## Dagitty.net
183 |
184 | - These graphs can be drawn by hand
185 | - Or we can use a computer to help
186 | - We will be using dagitty.net to draw these graphs
187 | - (You can also draw them with R code - see the slides, but you won't need to know this)
188 |
189 | ## Dagitty.net
190 |
191 | - Go to dagitty.net and click on "Launch"
192 | - You will see an example of a causal diagram with nodes (variables) and arrows
193 | - Plus handy color-coding and symbols. Green triangle for exposure/treatment, and blue bar for outcome.
194 | - The green arrow is the "causal path" we'd like to *identify*
195 |
196 | ## Dagitty.net
197 |
198 | - Go to Model and New Model
199 | - Let's recreate our Education and Earnings diagram
200 | - Put in Education as the exposure and Earnings as the outcome (you can use longer variable names than we've used here)
201 |
202 | ## Dagitty.net
203 |
204 | - Double-click on blank space to add new variables.
205 | - Add all the variables we're interested in.
206 |
207 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
208 | set.seed(1000)
209 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx,
210 | Ed~Bgrd+Year+Loc+Comp,
211 | JobCx~Ed,
212 | Loc~U1,
213 | Bgrd~U1) %>% tidy_dagitty()
214 | ggdag(dag,node_size=20)
215 | ```
216 |
217 | ## Dagitty.net
218 |
219 | - Then, double click on some variable `X`, then once on another variable `Y` to get an arrow for `X -> Y`
220 | - Fill in all our arrows!
221 |
222 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
223 | set.seed(1000)
224 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx,
225 | Ed~Bgrd+Year+Loc+Comp,
226 | JobCx~Ed,
227 | Loc~U1,
228 | Bgrd~U1) %>% tidy_dagitty()
229 | ggdag(dag,node_size=20)
230 | ```
231 |
232 | ## Dagitty.net
233 |
234 | - Feel free to move the variables around with drag-and-drop to make it look a bit nicer
235 | - You can even drag the arrows to make them curvy
236 |
237 |
238 | ## Dagitty.net
239 |
240 | - Next, we can classify some of our variables.
241 | - Hover over `U1` and tap the 'u' key to make Dagitty register it as unobservable
242 |
243 | ## Dagitty.net
244 |
245 | - Now, look at all that red!
246 | - Dagitty uses red to show you that a variable is going to cause problems for us! Like `tech` did last lecture
247 | - We can tell Dagitty that we plan to adjust/control for these variables in our analysis by hovering over them and hitting the 'a' key
248 | - When there's no red, we've identified the green arrow we're interested in!
249 |
250 | ## Dagitty.net
251 |
252 | - Notice that all Dagitty knows is what we told it about our model
253 | - And that was enough for it to tell us what we needed to adjust for to identify our effect!
254 | - That's not super complex computer magic, it's actually a fairly simple set of rules, which we'll go over next time.
255 |
256 | ## Practice
257 |
258 | - Consider the causal question "does a longer night's sleep extend your lifespan?"
259 |
260 | 1. List variables [think especially hard about what things might lead you to get a longer night's sleep]
261 | 2. Simplify
262 | 3. Arrows
263 |
264 | - Then, when you're done, draw it in Dagitty.net!
265 |
--------------------------------------------------------------------------------
/Lectures/Lecture_16_Back_Doors.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 16 Back Doors"
3 | author: "Nick Huntington-Klein"
4 | date: "March 3, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
19 | library(tidyverse)
20 | library(dagitty)
21 | library(ggdag)
22 | library(gganimate)
23 | library(ggthemes)
24 | library(Cairo)
25 | theme_set(theme_gray(base_size = 15))
26 | ```
27 |
28 | ## Recap
29 |
30 | - We've now covered how to create causal diagrams
31 | - (aka Directed Acyclic Graphs, if you're curious what "dag"itty means)
32 | - We simply write out the list of the important variables, and draw causal arrows indicating what causes what
33 | - This allows us to figure out what we need to do to *identify* our effect of interest
34 |
35 | ## Today
36 |
37 | - But HOW? How does it know?
38 | - Today we'll be covering the *process* that lets you figure out whether you can identify your effect of interest, and how
39 | - It turns out, once we have our diagram, to be pretty straightforward
40 | - So easy a computer can do it!
41 |
42 | ## The Back Door and the Front Door
43 |
44 | - The basic way we're going to be thinking about this is with a metaphor
45 | - When you do data analysis, it's like observing that someone left their house for the day
46 | - When you do causal inference, it's like asking *how they left their house*
47 | - You want to *make sure* that they came out the *front door*, and not out the back door, not out the window, not out the chimney
48 |
49 | ## The Back Door and the Front Door
50 |
51 | - Let's go back to this example
52 |
53 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5}
54 | dag <- dagify(IP.sp~tech,
55 | profit~tech+IP.sp) %>% tidy_dagitty()
56 | ggdag(dag,node_size=20)
57 | ```
58 |
59 | ## The Back Door and the Front Door
60 |
61 | - We're interested in the effect of IP spend on profits. That means that our *front door* is the ways in which IP spend *causally affects* profits
62 | - Our *back door* is any other thing that might drive a correlation between the two - the way that tech affects both
63 |
64 | ## Paths
65 |
66 | - In order to formalize this a little more, we need to think about the various *paths*
67 | - We observe that you got out of your house, but we want to know the paths you might have walked to get there
68 | - So, what are the paths we can walk to get from IP.spend to profits?
69 |
70 | ## Paths
71 |
72 | - We can go `Ip.spend -> profit`
73 | - Or `IP.spend <- tech -> profit`
74 |
75 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5}
76 | dag <- dagify(IP.sp~tech,
77 | profit~tech+IP.sp) %>% tidy_dagitty()
78 | ggdag(dag,node_size=20)
79 | ```
80 |
81 | ## The Back Door and the Front Door
82 |
83 | - One of these paths is the one we're interested in!
84 | - `Ip.spend -> profit` is a *front door path*
85 | - One of them is not!
86 | - `IP.spend <- tech -> profit` is a *back door path*
87 |
88 | ## Now what?
89 |
90 | - Now, it's pretty simple!
91 | - In order to make sure you came through the front door...
92 | - We must *close the back door*
93 | - We can do this by *controlling/adjusting* for things that will block that door!
94 | - We can close `IP.spend <- tech -> profit` by adjusting for `tech`
95 |
96 | ## So?
97 |
98 | - We already knew that we could get our desired effect in this case by controlling for `tech`.
99 | - But this process lets us figure out what we need to do in a *much wider range of situations*
100 | - All we need to do is follow the steps!
101 | - List all the paths
102 | - See which are back doors
103 | - Adjust for a set of variables that closes all the back doors!
104 |
105 | ## Example
106 |
107 | - How does wine affect your lifespan?
108 |
109 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
110 | dag <- dagify(life~wine+drugs+health+income,
111 | drugs~wine,
112 | wine~health+income,
113 | health~U1,
114 | income~U1,
115 | coords=list(
116 | x=c(life=5,wine=2,drugs=3.5,health=3,income=4,U1=3.5),
117 | y=c(life=3,wine=3,drugs=2,health=4,income=4,U1=4.5)
118 | )) %>% tidy_dagitty()
119 | ggdag(dag,node_size=20)
120 | ```
121 |
122 | ## Paths
123 |
124 | - Paths from `wine` to `life`:
125 | - `wine -> life`
126 | - `wine -> drugs -> life`
127 | - `wine <- health -> life`
128 | - `wine <- income -> life`
129 | - `wine <- health <- U1 -> income -> life`
130 | - `wine <- income <- U1 -> health -> life`
131 | - Don't leave any out, even the ones that seem redundant!
132 |
133 | ## Paths
134 |
135 | - Front doors/Back doors
136 | - `wine -> life`
137 | - `wine -> drugs -> life`
138 | - `wine <- health -> life`
139 | - `wine <- income -> life`
140 | - `wine <- health <- U1 -> income -> life`
141 | - `wine <- income <- U1 -> health -> life`
142 |
143 | ## Adjusting
144 |
145 | - By adjusting for variables we close these back doors
146 | - If an adjusted variable appears anywhere along the path, we can close that path off
147 | - Once *ALL* the back door paths are closed, we have blocked all the other ways that a correlation COULD appear except through the front door! We've identified the causal effect!
148 | - This is "the back door method" for identifying the effect. There are other methods; we'll get to them.
149 |
150 | ## Adjusting for Health
151 |
152 | - Front doors/Open back doors/Closed back doors
153 | - `wine -> life`
154 | - `wine -> drugs -> life`
155 | - `wine <- health -> life`
156 | - `wine <- income -> life`
157 | - `wine <- health <- U1 -> income -> life`
158 | - `wine <- income <- U1 -> health -> life`
159 |
160 | ## Adjusting for Health
161 |
162 | - Clearly, adjusting for health isn't ENOUGH to identify
163 | - We need to adjust for health AND income
164 | - We haven't covered how to actually control for multiple variables
165 | - We won't be focusing on it, but it's important for us to be able to know what *needs* to be controlled for
166 |
167 | ## Adjusting for Health and Income
168 |
169 | - Front doors/Open back doors/Closed back doors
170 | - `wine -> life`
171 | - `wine -> drugs -> life`
172 | - `wine <- health -> life`
173 | - `wine <- income -> life`
174 | - `wine <- health <- U1 -> income -> life`
175 | - `wine <- income <- U1 -> health -> life`
176 |
177 |
178 | ## How about Drugs?
179 |
180 | - Should we adjust for drugs?
181 | - No! This whole procedure makes that clear
182 | - It's on a *front door path*
183 | - If we adjusted for that, that's shutting out part of the way that `wine` *DOES* affect `life`
184 |
185 | ## The Front Door
186 |
187 | - In fact, remember, our real goal isn't necessarily to close the back doors
188 | - It's to make sure you came through the front door!
189 | - Sometimes (rarely), we can actually isolate the front door ourselves
190 |
191 | ## The Front Door
192 |
193 | - Imagine this version
194 |
195 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
196 | dag <- dagify(life~drugs+health+income,
197 | drugs~wine,
198 | wine~health+income,
199 | health~U1,
200 | income~U1,
201 | coords=list(
202 | x=c(life=5,wine=2,drugs=3.5,health=3,income=4,U1=3.5),
203 | y=c(life=3,wine=3,drugs=2,health=4,income=4,U1=4.5)
204 | )) %>% tidy_dagitty()
205 | ggdag(dag,node_size=20)
206 | ```
207 |
208 | ## The Front Door
209 |
210 | - This makes it real clear that you shouldn't control for drugs - that shuts the FRONT door! There's no way to get out of your house EXCEPT through the back door!
211 | - Note in this case that there's no back door from `wine` to `drugs`
212 | - And if we control for `wine`, no back door from `drugs` to `life` (let's check this by hand)
213 | - So we can identify `wine -> drugs` and we can identify `drugs -> life`, and combine them to get `wine -> life`!
214 |
215 | ## The Front Door
216 |
217 | - This is called the "front door method"
218 | - Much less common than the back door method, but actually older
219 | - So we'll only cover it briefly
220 | - Historical relevance! This is similar to how they proved that cigarettes cause cancer
221 |
222 | ## Cigarettes and Cancer
223 |
224 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
225 | dag <- dagify(tar~cigs,
226 | cigs~health+income,
227 | life~tar+health+income,
228 | coords=list(
229 | x=c(cigs=1,tar=2,health=2,income=2,life=3),
230 | y=c(cigs=2,tar=2,health=1,income=3,life=2)
231 | )) %>% tidy_dagitty()
232 | ggdag(dag,node_size=20)
233 | ```
234 |
235 | ## Paths
236 |
237 | - Front doors/Back doors
238 | - `cigs -> tar -> cancer`
239 | - `cigs <- income -> cancer`
240 | - `cigs <- health -> cancer`
241 |
242 | ## Paths
243 |
244 | - Closing these back doors is the problem that epidemiologists faced
245 | - They can't just run an experiment!
246 | - Problem: there are actually MANY back doors we're not listing
247 | - And sometimes we can't observe/control for these things
248 | - How can you possibly measure "health" well enough to actually control for it?
249 |
250 | ## The Front Door Method
251 |
252 | - So, noting that there's no back door from `cigs` to `tar`, and then controlling for `cigs` no back door from `tar` to `cancer`, they combined these two effects to get the causal effect of `cigs` on `life`
253 | - This is how we established this causal effect!
254 | - Doctors had a decent idea that cigs caused cancer before, but some doctors disagreed
255 | - And they had good reason to disagree! The back doors were VERY plausible reasons to see a `cigs`/`cancer` correlation other than the front door
256 |
257 | ## Practice
258 |
259 | - We want to know how `X` affects `Y`. Find all paths and make a list of what to adjust for to close all back doors
260 |
261 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
262 | dag <- dagify(Y~X+A+B+C+E,
263 | X~A+B+D,
264 | E~X,
265 | A~U1+C,
266 | B~U1,
267 | coords=list(
268 | x=c(X=1,E=2.5,A=2,B=3.5,C=1.5,D=1,Y=4,U1=2.5),
269 | y=c(X=2,E=2.25,A=3,B=3,C=4,D=3,Y=2,U1=4)
270 | )) %>% tidy_dagitty()
271 | ggdag(dag,node_size=20)
272 | ```
273 |
274 | ## Practice Answers
275 |
276 | - Front door paths: `X -> Y`, `X -> E -> Y`
277 | - Back doors: `X <- A -> Y`, `X <- B -> Y`, `X <- A <- U1 -> B -> Y`, `X <- B <- U1 -> A -> Y`, `X <- A <- C -> Y`, `X <- B <- U1 -> A <- C -> Y`
278 | - (that last back door is actually pre-closed, we'll get to that later)
279 | - We can close all back doors by adjusting for `A` and `B`.
--------------------------------------------------------------------------------
/Lectures/Lecture_17_Article_Traffic.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_17_Article_Traffic.pdf
--------------------------------------------------------------------------------
/Lectures/Lecture_17_Causal_Diagrams_Practice.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 17 Causal Diagram Practice"
3 | author: "Nick Huntington-Klein"
4 | date: "March 3, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
19 | library(tidyverse)
20 | library(dagitty)
21 | library(ggdag)
22 | library(gganimate)
23 | library(ggthemes)
24 | library(Cairo)
25 | theme_set(theme_gray(base_size = 15))
26 | ```
27 |
28 | ## Recap
29 |
30 | - To make a diagram:
31 | - List all the relevant variables
32 | - Combine identical variables, eliminate unimportant ones
33 | - Draw arrows to indicate what you think causes what
34 | - (See if the model implies that any relationships SHOULDN'T exist and test that)
35 | - Think carefully!
36 |
37 | ## Recap
38 |
39 | - If we're interested in the effect of `X` on `Y`:
40 | - Write the list of all paths from `X` to `Y`
41 | - Figure out which are front-door paths (going from `X` to `Y`)
42 | - and which are back-door paths (other ways)
43 | - Then figure out what set of variables need to be controlled/adjusted for to close those back doors
44 |
45 | ## Testing Relationships
46 |
47 | - Just a little more detail on this "testing relationships" thing
48 | - Our use of front-door and back-door paths means that we can look at *any two variables* in our diagram and say "hmm, if I control for A, B, and C, then that closes all front and back door paths between D and E"
49 | - So, if we control for A, B, and C, then D and E should be unrelated!
50 | - If `cor(D,E)` controlling for A, B, C is big, our diagram is probably wrong!
51 |
52 | ## Testing Relationships
53 |
54 | - What are some relationships we can test?
55 |
56 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5}
57 | dag <- dagify(B~A+D,
58 | C~B+D,
59 | E~C) %>% tidy_dagitty()
60 | ggdag(dag,node_size=20)
61 | ```
62 |
63 | ## Testing Relationships
64 |
65 | - We should get no relationship between: A and E controlling for C, B and E controlling for C, D and E controlling for C
66 | - (also D and A controlling for nothing, but we haven't gotten to why that one works yet, and A and C controlling for B and D, but we haven't covered why we need D there)
67 | - We'll be looking out for opportunities to test our models as we move forward! (note: dagitty will give us a list of what we can test!)
68 |
69 |
70 | ## Today
71 |
72 | - In groups, read [the assigned article](https://www.nytimes.com/2019/01/21/upshot/stuck-and-stressed-the-health-costs-of-traffic.html)
73 | - Pick one of the causal claims in the article (there are a lot!)
74 | - [Hint: words like "improve" "affect" "reduces", ask if you're not sure]
75 | - Draw a diagram to investigate that causal question
76 | - Determine what needs to be controlled for to identify that effect
77 | - If there's a linked study to explain the claim, try to look at it and see if they use the appropriate controls
78 | - Extra time? Do another claim
79 |
80 | ## Causal Inference in the News
81 |
82 | - Not long: 1-2 pages single-spaced plus diagram.
83 | - Find a news article that makes a causal claim (like the one we just did, but not that one) and interpret that claim by drawing an appropriate diagram in dagitty
84 | - Doesn't necessarily need to be a claim backed by a study or evidence, but it makes the assignment easier if it is
85 | - Justify your diagram (both your choice of variables and your arrows)
86 | - Explain how you would identify the causal claim, and discuss whether you think the article did so or not.
--------------------------------------------------------------------------------
/Lectures/Lecture_22_Flammer.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_22_Flammer.PNG
--------------------------------------------------------------------------------
/Lectures/Lecture_24_AutorDornHanson.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_24_AutorDornHanson.PNG
--------------------------------------------------------------------------------
/Lectures/Lecture_25_Instrumental_Variables_in_Action.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 25 Instrumental Variables in Action"
3 | author: "Nick Huntington-Klein"
4 | date: "March 28, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
19 | library(tidyverse)
20 | library(dagitty)
21 | library(ggdag)
22 | library(gganimate)
23 | library(ggthemes)
24 | library(Cairo)
25 | theme_set(theme_gray(base_size = 15))
26 | ```
27 |
28 | ## Causal Inference Midterm
29 |
30 | - One week from today
31 | - Similar format to the homeworks we've been having
32 | - At least one question evaluating a research question and drawing a dagitty graph
33 | - At least one question identifying the right causal inference method to use
34 | - At least one question about the feature(s) of the methods
35 | - At least one question carrying out a method in R
36 |
37 | ## Causal Inference Midterm
38 |
39 | - Covers everything up to today (obviously, a focus on things since the Programming Midterm, but there is a little programming)
40 | - No internet (except dagitty) or slides available this time
41 | - One 3x5 index card, front and back
42 | - You'll have the whole class period so don't be late!
43 |
44 | ## Recap
45 |
46 | - Instrumental Variables is sort of like the opposite of controlling for a variable
47 | - You isolate *just* the parts of `X` and `Y` that you can explain with the IV `Z`
48 | - If `Z` is related to `X` but all effects of `Z` on `Y` go THROUGH `X`, you've isolated a causal effect of `X` on `Y` by isolating just the causal part of `X` and ignoring all the back doors!
49 | - If `Z` is binary, get difference in `Y` divided by difference in `X`
50 | - If not, get correlation between explained `Y` and explained `X`
51 |
52 | ## Recap
53 |
54 | - In macroeconomics, how does US income affect US expenditures ("marginal propensity to consume")?
55 | - We can instrument with investment from LAST year.
56 |
57 | ```{r, echo=TRUE}
58 | library(AER)
59 | #US income and consumption data 1950-1993
60 | data(USConsump1993)
61 | USC93 <- as.data.frame(USConsump1993)
62 |
63 | #lag() gets the observation above; here the observation above is last year
64 | IV <- USC93 %>% mutate(lastyr.invest = lag(income) - lag(expenditure)) %>%
65 | group_by(cut(lastyr.invest,breaks=10)) %>%
66 | summarize(exp = mean(expenditure),inc = mean(income))
67 | cor(IV$exp,IV$inc)
68 | ```
69 |
70 | ## Today
71 |
72 | - We're going to be looking at several implementations of IV in real studies
73 | - We'll be looking at what they did and also asking ourselves what their causal diagrams might be
74 |
75 | ## Today
76 | - And whether we believe them! What would the diagram be for that expenditure/income example? Do we believe that there's really no back door from last year's investment to this year's expenditure? Really?
77 | - Every identification implies a diagram... and diagrams come from assumptions. We *always* want to think about whether we believe those assumptions
78 | - Remember, in any case, each of these is *just one study*. I could cite you equally plausible studies on these topics that found different findings in different contexts
79 |
80 | ## College Remediation
81 |
82 | - In most colleges, if you come unprepared to take first-year courses, you must first take remedial courses
83 | - Do these classes help you persist in college?
84 | - On one hand they can help ease you into the first-year courses
85 | - On the other hand you might get discouraged and drop out
86 |
87 | ## College Remediation
88 |
89 | - What's our diagram? Include `Rem`ediation, `Pers`istence, other things.
90 | - Keep in mind that many of the things that would cause you to take remediation in the first place are the same things that might lead you to drop out (difficulty with material, dislike for school, etc.)
91 | - Sketch out a diagram
92 |
93 | ## College Remediation
94 |
95 | - Bettinger & Long (2009) use data from Ohio and notice that the policy determining who goes to remediation varies from college to college
96 | - And also, as is generally well-known, people tend to go to the college closest to them
97 | - So for each student and each college, they calculate whether that student would be in remediation at that college
98 | - And use "Would be in remediation at your closest college" (`RemC`) as an instrument for "Actually in remediation" (`RemA`)
99 |
100 | ## College Remediation
101 |
102 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=6}
103 | dag <- dagify(RemA~RemC+A+B+C,
104 | Pers~RemA+A+B+C,
105 | coords=list(
106 | x=c(RemA=1,RemC=0,A=1.5,B=2,C=2.5,Pers=3),
107 | y=c(RemA=1,RemC=1,A=2,B=2,C=2,Pers=1)
108 | )) %>% tidy_dagitty()
109 | ggdag(dag,node_size=20)
110 | ```
111 |
112 | ## College Remediation
113 |
114 | - Bettinger & Long find positive effects of college remediation on persistence
115 | - But do we believe this IV?
116 | - Let's think - any possible other ways from `RemC` to `Pers`?
117 | - Keep in mind, `RemC` is based on the `loc`ation where you live, `loc -> RemC`. What else might be related to where you live?
118 | - How could we test the diagram?
119 |
120 | ## Medical Care Expenditures
121 |
122 | - One part of the health care debate is how much health care should be pair for by the person *using it* and how much should be paid for by *society*
123 | - One concern of taking the burden off of the user is that people might use way more medical care than they need if they're not paying for it
124 | - How does *the price you pay* for health care affect *how much you use it*?
125 |
126 | ## Medical Care Expenditures
127 |
128 | - So how does `price` affect `use`?
129 | - Something to keep in mind is that because of many varied insurance and social safety net programs, the `price` for the same procedure varies wildly between people
130 | - And might be affected by `inc`ome, `empl`oyment, what else?
131 | - Draw a diagram!
132 | - Before I show it, can you think of an instrument?
133 |
134 | ## Medical Care Expenditures
135 |
136 | - Kowalski (2016) notices that in many family insurance plans, if your family member is injured, the cost-sharing in the plan means that if *you* then get injured, you'll pay less for your care
137 | - So `fam`ily injury is an instrument for price
138 |
139 | ## Medical Care Expenditures
140 |
141 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=6}
142 | dag <- dagify(price~fam+A+B+C,
143 | use~price+A+B+C,
144 | coords=list(
145 | x=c(price=1,fam=0,A=1.5,B=2,C=2.5,use=3),
146 | y=c(price=1,fam=1,A=2,B=2,C=2,use=1)
147 | )) %>% tidy_dagitty()
148 | ggdag(dag,node_size=20)
149 | ```
150 |
151 | ## Medical Care Expenditures
152 |
153 | - Kowalski (2016) finds that a 10% price reduction increases use by 7-11%.
154 | - So do we believe this one?
155 | - Can you imagine any ways in which a family member's use of medical care might affect your use of medical care except through the price you face?
156 | - Let's consider some that we might be able to control for (and she does) and some we might not
157 | - Any back doors we can imagine? What could we test?
158 |
159 | ## Stock Market Indexing
160 |
161 | - The Russell 1000 stock index indexes the top 1000 largest firms by market `cap`italization, and the Russell 2000 indexes the next top 2000
162 | - Both indices are value-weighted, so "big fish-small pond" stocks at the top of the 2000 have more money coming to them than "small fish-big pond" stock at the bottom of the 1000
163 | - Even though the price shouldn't be affected by something as immaterial as what stocks you're being compared to... maybe it does!
164 |
165 | ## Stock Market Indexing
166 |
167 | - Does *where you're listed* affect your `price`?
168 | - Draw that diagram!
169 | - Keep in mind that your listing is based on your `cap` - just big enough for the 1000 and you're on the 1000, not quite there and you're on the `R2000`
170 | - All sorts of firm qualities may affect your `cap` and also your `price`
171 | - Any guesses as to what might be a good instrument?
172 |
173 | ## Stock Market Indexing
174 |
175 | - Sounds like a regression discontinuity, not an IV! What gives?
176 | - It's BOTH! This is called "fuzzy RD"
177 | - When the regression discontinuity isn't perfect - crossing the cutoff takes you from, say, 40% treated to 60% rather than 0% to 100% - it's sort of like the experiments with imperfect assignment we covered
178 | - And the fix is the same - an IV! Being `above` is an IV for being listed on `R2000`
179 |
180 | ## Stock Market Indexing
181 |
182 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=6}
183 | dag <- dagify(R2000~above+A+B+C,
184 | price~R2000+cap+A+B+C,
185 | above~cap,
186 | cap~A+B+C,
187 | coords=list(
188 | x=c(R2000=1,above=0,A=1.5,B=2,C=2.5,price=3,cap=0),
189 | y=c(R2000=1,above=1,A=2,B=2,C=2,price=1,cap=1.5)
190 | )) %>% tidy_dagitty()
191 | ggdag(dag,node_size=20)
192 | ```
193 |
194 | ## Company Tax Incentives
195 |
196 | - It's common for localities to offer tax incentives to bring companies to town
197 | - Economists tend not to like this, as that company probably would have been just as productive elsewhere, so it's a giveaway to them
198 | - But, selfishly, it's nice if that company is producing in *your* city rather than elsewhere, right? Jobs!
199 | - What are the impact of "Empowerment Zone (`EZ`)" tax incentives on `emp`loyment? Draw the diagram!
200 |
201 | ## Company Tax Incentives
202 |
203 | - This is looking pretty familiar by now
204 |
205 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=5.5}
206 | dag <- dagify(EZ~comm+A+B+C,
207 | emp~EZ+A+B+C+rep,
208 | comm~rep,
209 | coords=list(
210 | x=c(EZ=1,comm=0,A=1.5,B=2,C=2.5,emp=3,rep=0),
211 | y=c(EZ=1,comm=1,A=2,B=2,C=2,emp=1,rep=2)
212 | )) %>% tidy_dagitty()
213 | ggdag(dag,node_size=20)
214 | ```
215 |
216 | ## Company Tax Incentives
217 |
218 | - Hanson (2009) uses *the political power of your congressperson* as an IV for `EZ`
219 | - Did your representative make it onto a powerful `comm`ittee, increasing the chances of getting an EZ for their district? IV! (controlling for the `rep`, we're looking before/after here, like diff-in-diff)
220 | - Do we believe this? Any potential problems? What could we test?
221 |
222 | ## Others
223 |
224 | - Try to think of a decent IV
225 | - For *any* causal question
226 | - Remember, you must have:
227 | - `Z` is related to `X`
228 | - All back doors from `Z` to `Y` must be closed
229 | - All front doors from `Z` to `Y` must go through `X`
--------------------------------------------------------------------------------
/Lectures/Lecture_27_Explaining_Better_Regression.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Lecture 27 Explaining Better - Regression"
3 | author: "Nick Huntington-Klein"
4 | date: "April 3, 2019"
5 | output:
6 | revealjs::revealjs_presentation:
7 | theme: solarized
8 | transition: slide
9 | self_contained: true
10 | smart: true
11 | fig_caption: true
12 | reveal_options:
13 | slideNumber: true
14 |
15 | ---
16 |
17 | ```{r setup, include=FALSE}
18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)
19 | library(tidyverse)
20 | library(dagitty)
21 | library(ggdag)
22 | library(ggthemes)
23 | library(Cairo)
24 | theme_set(theme_gray(base_size = 15))
25 | ```
26 |
27 | ## Final Exam
28 |
29 | - Combination of both programming and causal inference methods
30 | - Everything is fair game
31 | - There will also be a subjective question in which you take a causal question, develop a diagram, and perform the analysis with data I give you
32 | - Slides and dagitty will be available, no other internet
33 |
34 | ## This Week
35 |
36 | - We'll be doing a little bit of review this last week
37 | - And also talking about other ways of *explaining* data beyond what we've done
38 | - This last week of material *will not be on the final* but it will be great prep for any upcoming class you take on this, or if you want to apply the ideas you've learned in class in the real world
39 |
40 | ## Explaining Better
41 |
42 | - So far, all of our methods have had to do with *explaining* one variable with another
43 | - After all, causal inference is all about looking at the effect of one variable on another
44 | - If that explanation is causally identified, we're good to go
45 | - Or, if that explanation is on a back door of what we're interested in, we'll explain what we can and take it out
46 |
47 | ## Explaining Better
48 |
49 | - The way that we've been explaining `A` with `B` so far:
50 | - Take the different values of `B` (if it's continuous, use bins with `cut()`)
51 | - For observations with each of those different values, take the mean of `A`
52 | - That mean is the "explained" part, the rest is the "residual"
53 |
54 | ## Explaining Better
55 |
56 | - Now, this is the basic idea of explaining - what value of `A` can we expect, given the value of `B` we're looking at?
57 | - But this isn't the only way to put that idea into action!
58 |
59 | ## Regression
60 |
61 | - The way that explaining is done most of the time is with a method called *regression*
62 | - You might be familiar with regression if you've taken ISDS 361A
63 | - But here we're going to go more into detail on what it actually is and how it relates to causal inference
64 |
65 | ## Regression
66 |
67 | - The idea of regression is the same as our approach to explaining - for different values of `B`, predict `A`
68 | - But what's different? In regression, you impose a little more structure on that prediction
69 | - Specifically, when `B` is continuous, you require that the relationship between `B` and `A` follows a *straight line*
70 |
71 | ## Regression
72 |
73 | - Let's look at wages and experience in Belgium
74 |
75 | ```{r, echo=TRUE}
76 | library(Ecdat)
77 | data(Bwages)
78 |
79 | #Explain wages with our normal method
80 | Bwages <- Bwages %>% group_by(cut(exper,breaks=8)) %>%
81 | mutate(wage.explained = mean(wage)) %>% ungroup()
82 |
83 | #Explain wages with regression
84 | #lm(wage~exper) regresses wage on exper, and predict() gets the explained values
85 | Bwages <- Bwages %>%
86 | mutate(wage.reg.explained = predict(lm(wage~exper)))
87 |
88 | #What's in a regression? An intercept and a slope! Like I said, it's a line.
89 | lm(wage~exper,data=Bwages)
90 | ```
91 |
92 | ## Regression
93 |
94 | ```{r, echo=FALSE, fig.width=8, fig.height=6}
95 | ggplot(filter(Bwages,wage<20),aes(x=exper,y=wage,color='Raw'))+geom_point(alpha=.1)+
96 | labs(x='Experience',y='Wage')+
97 | geom_point(aes(x=exper,y=wage.explained,color='Our Method'))+
98 | geom_point(aes(x=exper,y=wage.reg.explained,color='Regression'))+
99 | scale_color_manual(values=c('Raw'='black','Our Method'='red','Regression'='blue'))
100 | ```
101 |
102 | ## Regression
103 |
104 | - Okay, so it's the same thing but it's a straight line. Who cares?
105 | - Regression brings us some benefits, but also has some costs (there's always a tradeoff...)
106 |
107 | ## Regression Benefits
108 |
109 | - It boils down the relationship to be much simpler to explain
110 | - Instead of reporting eight different means, I can just give an intercept and a slope!
111 |
112 | ```{r, echo=FALSE}
113 | lm(wage~exper,data=Bwages)
114 | ```
115 |
116 | - We can interpret this easily as "one more year of `exper` is associated with `r coef(lm(wage~exper,data=Bwages))[2]` higher wages"
117 |
118 | ## Regression Benefits
119 |
120 | - This makes it much easier to explain using multiple variables at once
121 | - This is important when we're doing causal inference. If we want to close multiple back doors, we have to control for multiple variables at once. This is unwieldy with our approach
122 | - With regression we just add another dimension to our line and add another slope! As many as we want
123 |
124 | ```{r, echo=FALSE}
125 | #Bwages appears to drop the sex variable in some versions? Let's make it up
126 | Bwages <- Bwages %>%
127 | mutate(male = runif(1472) + (wage > mean(wage))*.04 >= .5)
128 | ```
129 |
130 | ```{r, echo=TRUE}
131 | lm(wage~exper+educ+male,data=Bwages)
132 | ```
133 |
134 | ## Regression Benefits
135 |
136 | - The fact that we're using a line means that we can use much more of the data
137 | - This increases our statistical power, and also reduces overfitting (remember that?)
138 | - For example, with regression discontinuity, we've been only looking just to the left and right of the cutoff
139 | - But this doesn't take into account the information we have about the trend of the variable *leading up to* the cutoff. Doing regression discontinuity with regression can!
140 |
141 | ## Regression Benefits
142 |
143 | - Take this example from regression discontinuity
144 |
145 | ```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5.5}
146 | set.seed(1000)
147 | rdd <- tibble(test = runif(300)*100) %>%
148 | mutate(GATE = test >= 75,
149 | above = test >= 75) %>%
150 | mutate(earn = runif(300)*40+10*GATE+test/2)
151 |
152 | ggplot(rdd,aes(x=test,y=earn,color=GATE))+geom_point()+
153 | geom_vline(aes(xintercept=75),col='red')+
154 | labs(x='Test Score',
155 | y='Earnings')
156 | ```
157 |
158 | ## Regression Benefits
159 |
160 | - Look how much more information we can take advantage of with regression
161 |
162 | ```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5}
163 | rdd <- rdd %>%
164 | mutate(above = test >= 75,
165 | zeroedtest = test-75)
166 |
167 | rdmeans <- rdd %>% filter(between(test,73,77)) %>%
168 | group_by(above) %>%
169 | summarize(E = mean(earn))
170 |
171 |
172 | ggplot(rdd,aes(x=test,y=earn,color='Raw'))+geom_point()+
173 | geom_vline(aes(xintercept=75),col='blue')+
174 | labs(x='Test Score',
175 | y='Earnings')+
176 | geom_smooth(aes(color='Regression'),method='lm',se=FALSE,formula=y~x+I(x>=75)+x*I(x>=75))+
177 | geom_segment(aes(x=73,xend=75,y=rdmeans$E[1],yend=rdmeans$E[1],color='Our Method'),size=2)+
178 | geom_segment(aes(x=75,xend=77,y=rdmeans$E[2],yend=rdmeans$E[2],color='Our Method'),size=2)+
179 | scale_color_manual(values=c('Raw'='black','Regression'='red','Our Method'='blue'))
180 | ```
181 |
182 | ## Regression Cons
183 |
184 | - One con is that it masks what it's doing a bit - you have to learn its inner workings to really know how it operates, and what it's sensitive to
185 | - There are also some statistical situations in which it can give strange results
186 | - For example, problem is that it fits a straight line
187 | - That means that if your relationship DOESN'T follow a straight line, you'll get bad results!
188 |
189 | ## Regression Cons
190 |
191 | ```{r echo=FALSE, fig.width=8, fig.height=6}
192 | quaddata <- tibble(x = rnorm(1000)) %>%
193 | mutate(y = x + x^2 + rnorm(1000)) %>%
194 | group_by(cut(x,breaks=10)) %>%
195 | mutate(y.exp = mean(y)) %>%
196 | ungroup() %>%
197 | mutate(y.exp.reg = predict(lm(y~x)))
198 |
199 | ggplot(quaddata,aes(y=y,x=x,color='Raw'))+geom_point(alpha=.2)+
200 | geom_point(aes(x=x,y=y.exp,color='Our Method'))+
201 | geom_point(aes(x=x,y=y.exp.reg,color='Regression'))+
202 | scale_color_manual(values=c('Raw'='black','Our Method'='red','Regression'='blue'))
203 | ```
204 |
205 | ## Regression Cons
206 |
207 | - That said, we're fitting a line, but it doesn't *necessarily* need to be a straight line. Let's add a quadratic term!
208 |
209 | ```{r, echo=TRUE}
210 | lm(y~x,data=quaddata)
211 | lm(y~x+I(x^2),data=quaddata)
212 | ```
213 |
214 | ## Regression Cons
215 |
216 | ```{r echo=FALSE, fig.width=8, fig.height=6}
217 | quaddata <- quaddata %>%
218 | mutate(y.exp.reg = predict(lm(y~x+I(x^2))))
219 |
220 | ggplot(quaddata,aes(y=y,x=x,color='Raw'))+geom_point(alpha=.2)+
221 | geom_point(aes(x=x,y=y.exp,color='Our Method'))+
222 | geom_point(aes(x=x,y=y.exp.reg,color='Quadratic Reg.'))+
223 | scale_color_manual(values=c('Raw'='black','Our Method'='red','Quadratic Reg.'='blue'))
224 | ```
225 |
226 | ## Regression
227 |
228 | - In fact, for these reasons, despite the cons, all the methods we've done so far (except matching) are commonly done with regression
229 | - We'll talk about this more next time
230 | - For now, let's do a simulation exercise with regression
231 | - I want to reiterate that regression *will not be on the final*
232 | - But this will let us practice our simulation skills, which we haven't done in a minute, and we may as well do it with regresion
233 | - Any remaining time we'll do other final prep
234 |
235 | ## Simulation Practice
236 |
237 | - Let's compare regression and our method in their ability to pick up a *tiny* positive effect of `X` on `Y`
238 | - We'll be counting how often each of them finds that positive result
239 | - Create blank vectors `reg.pos <- c(NA)` and `our.pos <- c(NA)` to store results
240 | - Create a `for (i in 1:10000) {` loop
241 | - Then, inside that loop...
242 |
243 | ## Simulation Practice
244 |
245 | - Create a tibble `df` with `x = runif(1000)` and `y = .01*x + rnorm(1000)`
246 | - Perform regression and get the slope on `x`: `coef(lm(y~x,data=df))[2]`. Store `0` in `reg.pos[i]` if this is negative, and `1` if it's positive
247 | - Explain `y` using `x` with our method and `cut(x,breaks=2)`. Use `summarize` and store the means in `our.df`
248 | - Store `0` in `our.pos[i]` if `our.df$y[2] - our.df$y[1]` is negative, and `1` if it's positive
249 | - See which method more often gets a positive result
250 |
251 | ## Simulation Practice Answers
252 |
253 | ```{r, echo=FALSE}
254 | set.seed(1234)
255 | ```
256 |
257 | ```{r, echo=TRUE}
258 | reg.pos <- c(NA)
259 | our.pos <- c(NA)
260 |
261 | for (i in 1:10000) {
262 | df <- tibble(x = runif(1000)) %>%
263 | mutate(y = .01*x + rnorm(1000))
264 |
265 | reg.pos[i] <- coef(lm(y~x,data=df))[2] >= 0
266 |
267 | our.df <- df %>% group_by(cut(x,breaks=2)) %>%
268 | summarize(y = mean(y))
269 |
270 | our.pos[i] <- our.df$y[2] - our.df$y[1] > 0
271 | }
272 |
273 | mean(reg.pos)
274 | mean(our.pos)
275 | ```
--------------------------------------------------------------------------------
/Lectures/Mariel_Synthetic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Mariel_Synthetic.png
--------------------------------------------------------------------------------
/Lectures/ad_spend_and_gdp.csv:
--------------------------------------------------------------------------------
1 | Year,AdSpending,AdSpendingNewsMagRadio,GDP,AdsasPctofGDP
2 | 1919,1930,733.00,78.30,2.50%
3 | 1920,2480,830.00,88.40,2.80%
4 | 1921,1930,940.00,73.60,2.60%
5 | 1922,2200,1017.00,73.40,3.00%
6 | 1923,2400,1101.00,85.40,2.80%
7 | 1924,2480,1188.00,87.00,2.90%
8 | 1925,2600,1281.00,90.60,2.90%
9 | 1926,2700,1355.00,97.00,2.80%
10 | 1927,2720,1432.00,95.50,2.80%
11 | 1928,2760,1493.00,97.40,2.80%
12 | 1929,2850,1560.00,103.60,2.80%
13 | 1930,2450,1378.00,91.20,2.70%
14 | 1931,2100,1220.00,76.50,2.70%
15 | 1932,1620,1000.00,58.70,2.80%
16 | 1933,1325,832.00,56.40,2.30%
17 | 1934,1650,935.00,66.00,2.50%
18 | 1935,1720,1065.00,73.30,2.30%
19 | 1936,1930,1191.00,83.80,2.30%
20 | 1937,2100,1305.00,91.90,2.30%
21 | 1938,1930,1182.00,86.10,2.20%
22 | 1939,2010,1232.00,92.20,2.20%
23 | 1940,2110,1311.00,101.40,2.10%
24 | 1941,2250,1400.00,126.70,1.80%
25 | 1942,2160,1352.00,161.90,1.30%
26 | 1943,2490,1639.00,198.60,1.30%
27 | 1944,2700,1790.00,219.80,1.20%
28 | 1945,2840,1923.00,223.10,1.30%
29 | 1946,3340,2262.00,222.30,1.50%
30 | 1947,4260,2723.00,244.20,1.70%
31 | 1948,4870,3091.00,269.20,1.80%
32 | 1949,5210,3301.00,267.30,1.90%
33 | 1950,5700,3633.00,293.80,1.90%
34 | 1951,6420,4080.00,339.30,1.90%
35 | 1952,7140,4552.00,358.30,2.00%
36 | 1953,7740,4942.00,379.40,2.00%
37 | 1954,8150,5161.00,380.40,2.10%
38 | 1955,9150,5866.00,414.80,2.20%
39 | 1956,9910,6342.00,437.50,2.30%
40 | 1957,10270,6588.00,461.10,2.20%
41 | 1958,10310,6509.00,467.20,2.20%
42 | 1959,11270,7183.00,506.60,2.20%
43 | 1960,11960,7585.00,526.40,2.30%
44 | 1961,11860,7510.00,544.70,2.20%
45 | 1962,12430,7896.00,585.60,2.10%
46 | 1963,13100,8284.00,617.70,2.10%
47 | 1964,14150,9018.00,663.60,2.10%
48 | 1965,15250,9761.00,719.10,2.10%
49 | 1966,16630,10734.00,787.80,2.10%
50 | 1967,16870,10887.00,832.60,2.00%
51 | 1968,18090,11718.00,910.00,2.00%
52 | 1969,19420,12723.00,984.60,2.00%
53 | 1970,19550,12702.00,1038.50,1.90%
54 | 1971,20700,13293.00,1127.10,1.80%
55 | 1972,23210,14921.00,1238.30,1.90%
56 | 1973,24980,16042.00,1382.70,1.80%
57 | 1974,26620,17009.00,1500.00,1.80%
58 | 1975,27900,17935.00,1638.30,1.70%
59 | 1976,33300,21579.00,1825.30,1.80%
60 | 1977,37440,24470.00,2030.90,1.80%
61 | 1978,43330,28322.00,2294.70,1.90%
62 | 1979,48780,31954.00,2563.30,1.90%
63 | 1980,53570,34937.00,2789.50,1.90%
64 | 1981,60460,39167.00,3128.40,1.90%
65 | 1982,66670,42811.00,3255.00,2.00%
66 | 1983,76000,49057.00,3536.70,2.10%
67 | 1984,88010,56765.00,3933.20,2.20%
68 | 1985,94900,60663.00,4220.30,2.20%
69 | 1986,102370,65029.00,4462.80,2.30%
70 | 1987,110270,69141.00,4739.50,2.30%
71 | 1988,118750,74004.00,5103.80,2.30%
72 | 1989,124770,77841.00,5484.40,2.30%
73 | 1990,129968,79932.00,5803.10,2.20%
74 | 1991,128352,76997.00,5995.90,2.10%
75 | 1992,133750,80560.00,6337.70,2.10%
76 | 1993,140956,84570.00,6657.40,2.10%
77 | 1994,153024,92501.00,7072.20,2.20%
78 | 1995,162930,97622.00,7397.70,2.20%
79 | 1996,175230,105973.00,7816.90,2.20%
80 | 1997,187529,113221.00,8304.30,2.30%
81 | 1998,206697,123628.00,8747.00,2.40%
82 | 1999,222308,132151.00,9268.40,2.40%
83 | 2000,247472,145887.00,9817.00,2.50%
84 | 2001,231287,132296.00,10128.00,2.30%
85 | 2002,236875,136244.00,10470.00,2.30%
86 | 2003,245477,140128.00,10961.00,2.20%
87 | 2004,263766,150305.00,11713.00,2.30%
88 | 2005,271074,151939.00,12456.00,2.20%
89 | 2006,281653,155466.00,13178.00,2.10%
90 | 2007,279612,150023.00,13807.00,2.00%
91 |
--------------------------------------------------------------------------------
/Lectures/mariel.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/mariel.RData
--------------------------------------------------------------------------------
/Papers/BettingerLong2009.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/BettingerLong2009.pdf
--------------------------------------------------------------------------------
/Papers/Flammer 2013 RDD Corporate Social responsibility.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/Flammer 2013 RDD Corporate Social responsibility.pdf
--------------------------------------------------------------------------------
/Papers/chang2014.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/chang2014.pdf
--------------------------------------------------------------------------------
/Papers/desai2009.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/desai2009.pdf
--------------------------------------------------------------------------------
/Papers/hanson2009.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/hanson2009.pdf
--------------------------------------------------------------------------------
/Papers/kowalski2016.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/kowalski2016.pdf
--------------------------------------------------------------------------------
/Papers/marvel_moody_sentencing.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/marvel_moody_sentencing.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Economics, Causality, and Analytics
2 | =========
3 |
4 | These are the course materials for the upcoming CSU Fullerton class ECON 305: Economics, Causality, and Analytics (minus the exams and homeworks; instructors contact me directly if you'd like to see these).
5 |
6 | The syllabus and writing assignment sheets are in the main folder.
7 |
8 | Lectures can be found in the Lectures folder. These are in RMarkdown Presentation format. Compiled HTML files are there as well as well as any other files necessary to render the completed slides. Completed HTML files are included, but will be easier to see fully rendered. Check [My Website](http://www.nickchk.com/econ305.html) for links.
9 |
10 | Papers includes the papers referenced in the lectures or other parts of the class.
11 |
12 | Cheat Sheets contains cheat sheets downloaded from standard sources like RStudio, as well as a few of my own for things like dagitty, causal diagrams, and relationships between variables.
13 |
14 | Code for the animations used in the lectures (or very nearly; it's been modified) can be found in the RMarkdown files, or at [this repo](https://github.com/NickCH-K/causalgraphs).
--------------------------------------------------------------------------------
/Research Design Assignment.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Research Design Assignment.docx
--------------------------------------------------------------------------------