├── .gitignore ├── Causal Inference in the News Article.docx ├── Cheat Sheets ├── Base R Cheat Sheet.pdf ├── Causal Diagrams Cheat Sheet.Rmd ├── Causal_Diagrams_Cheat_Sheet.pdf ├── Dagitty Cheat Sheet.Rmd ├── Dagitty-Cheat-Sheet.aux ├── Dagitty-Cheat-Sheet.out ├── Dagitty_Cheat_Sheet.pdf ├── Dagitty_Diagnostics.png ├── Dagitty_Intro.PNG ├── Dagitty_Model1.PNG ├── Dagitty_Model2.PNG ├── Dagitty_Screen.PNG ├── RStudio Cheat Cheet.pdf ├── Relationships Cheat Sheet.Rmd ├── Relationships_Cheat_Sheet.pdf ├── Simulation Cheat Sheet.Rmd ├── Simulation_Cheat_Sheet.pdf ├── dplyr Cheat Sheet.pdf └── floatposition.tex ├── ECON 305 Syllabus Updated Template.docx ├── Instructor Guide ├── .Rhistory ├── Instructor_Guide.Rmd ├── Instructor_Guide.aux └── Instructor_Guide.pdf ├── LICENSE.txt ├── Lectures ├── Animation_Local_Linear_Regression.gif ├── GDP.csv ├── Lecture_01-cloud-market-revenue.png ├── Lecture_01_Politics_Example.PNG ├── Lecture_01_World_of_Data.Rmd ├── Lecture_01_World_of_Data.html ├── Lecture_02_Understanding_Data.Rmd ├── Lecture_02_Understanding_Data.html ├── Lecture_03_Introduction_to_R.Rmd ├── Lecture_03_Introduction_to_R.html ├── Lecture_04_Objects_and_Functions_in_R.Rmd ├── Lecture_04_Objects_and_Functions_in_R.html ├── Lecture_05_Spreadsheet.PNG ├── Lecture_05_Working_with_Data_Part_1.Rmd ├── Lecture_05_Working_with_Data_Part_1.html ├── Lecture_05_data_frame.PNG ├── Lecture_06_Working_with_Data_Part_2.Rmd ├── Lecture_06_Working_with_Data_Part_2.html ├── Lecture_07_Summarizing_Data_Part_1.Rmd ├── Lecture_07_Summarizing_Data_Part_1.html ├── Lecture_08_Summarizing_Data_Part_2.Rmd ├── Lecture_08_Summarizing_Data_Part_2.html ├── Lecture_09_Relationships_Between_Variables_Part_1.Rmd ├── Lecture_09_Relationships_Between_Variables_Part_1.html ├── Lecture_10_Relationships_Between_Variables_Part_2.Rmd ├── Lecture_10_Relationships_Between_Variables_Part_2.html ├── Lecture_11_Simulations.Rmd ├── Lecture_11_Simulations.html ├── Lecture_12_Midterm_Review.Rmd ├── Lecture_12_Midterm_Review.html ├── Lecture_13_Causality.Rmd ├── Lecture_13_Causality.html ├── Lecture_14_Causal_Diagrams.Rmd ├── Lecture_14_Causal_Diagrams.html ├── Lecture_15_Drawing_Causal_Diagrams.Rmd ├── Lecture_15_Drawing_Causal_Diagrams.html ├── Lecture_16_Back_Doors.Rmd ├── Lecture_16_Back_Doors.html ├── Lecture_17_Article_Traffic.pdf ├── Lecture_17_Causal_Diagrams_Practice.Rmd ├── Lecture_17_Causal_Diagrams_Practice.html ├── Lecture_18_Controlling.Rmd ├── Lecture_18_Controlling.html ├── Lecture_19_Fixed_Effects.Rmd ├── Lecture_19_Fixed_Effects.html ├── Lecture_20_Untreated_Groups.Rmd ├── Lecture_20_Untreated_Groups.html ├── Lecture_21_Difference_in_Differences.Rmd ├── Lecture_21_Difference_in_Differences.html ├── Lecture_22_Flammer.PNG ├── Lecture_22_Regression_Discontinuity.Rmd ├── Lecture_22_Regression_Discontinuity.html ├── Lecture_23_Practice.Rmd ├── Lecture_23_Practice.html ├── Lecture_24_AutorDornHanson.PNG ├── Lecture_24_Instrumental_Variables.Rmd ├── Lecture_24_Instrumental_Variables.html ├── Lecture_25_Instrumental_Variables_in_Action.Rmd ├── Lecture_25_Instrumental_Variables_in_Action.html ├── Lecture_26_Causal_Midterm_Review.Rmd ├── Lecture_26_Causal_Midterm_Review.html ├── Lecture_27_Explaining_Better_Regression.Rmd ├── Lecture_27_Explaining_Better_Regression.html ├── Lecture_28_Explaining_Better_Regression_2.Rmd ├── Lecture_28_Explaining_Better_Regression_2.html ├── Mariel_Synthetic.png ├── ad_spend_and_gdp.csv ├── average-height-men-OWID.csv ├── mariel.RData └── marvel_moody_sentencing.txt ├── Papers ├── BettingerLong2009.pdf ├── Flammer 2013 RDD Corporate Social responsibility.pdf ├── chang2014.pdf ├── desai2009.pdf ├── hanson2009.pdf ├── kowalski2016.pdf └── marvel_moody_sentencing.pdf ├── README.md └── Research Design Assignment.docx /.gitignore: -------------------------------------------------------------------------------- 1 | Admin/ 2 | Exams/ 3 | Homework/ 4 | Animations/ 5 | Cheat Sheets/Causal_Diagrams_Cheat_Sheet_files/ 6 | Lectures/Lecture_01_World_of_Data_files/ 7 | Lectures/Lecture_02_Understanding_Data_files/ 8 | Lectures/Lecture_03_Introduction_to_R_files/ 9 | Lectures/Lecture_05_Working_with_Data_Part_1_files/ 10 | Lectures/Lecture_06_Working_with_Data_Part_2_files/ 11 | Lectures/Lecture_07_Summarizing_Data_Part_1_files/ 12 | Lectures/Lecture_08_Summarizing_Data_Part_2_files/ 13 | Lectures/Lecture_09_Relationships_Between_Variables_Part_1_files/ 14 | Lectures/Lecture_10_Relationships_Between_Variables_Part_2_files/ 15 | Lectures/Lecture_12_Midterm_Review_files 16 | Lectures/Lecture_15_Drawing_Causal_Diagrams_files/ 17 | Lectures/Lecture_16_Back_Doors_files 18 | Lectures/Lecture_17_Causal_Diagrams_Practice_files/ 19 | Lectures/Lecture_18_Controlling_files/ 20 | Lectures/Lecture_20_Untreated_Groups_files/ 21 | Lectures/Lecture_21_Difference_in_Differences_files/ 22 | Lectures/Lecture_22_Regression_Discontinuity_files/ 23 | Lectures/Lecture_25_Instrumental_Variables_in_Action_files/ 24 | Lectures/Lecture_26_Causal_Midterm_Review_files/ 25 | Lectures/Lecture_27_Explaining_Better_Regression_files/ 26 | Lectures/Lecture_28_Explaining_Better_Regression_2_files/ 27 | Lectures/libs/ 28 | Lectures/Versions without DPLYR/ 29 | Lectures/.Rhistory 30 | 31 | -------------------------------------------------------------------------------- /Causal Inference in the News Article.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Causal Inference in the News Article.docx -------------------------------------------------------------------------------- /Cheat Sheets/Base R Cheat Sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Base R Cheat Sheet.pdf -------------------------------------------------------------------------------- /Cheat Sheets/Causal Diagrams Cheat Sheet.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Causal Diagrams Cheat Sheet" 3 | author: "Nick Huntington-Klein" 4 | date: "April 6, 2019" 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 10 | library(tidyverse) 11 | library(dagitty) 12 | library(ggdag) 13 | library(Cairo) 14 | ``` 15 | 16 | ## Causal Diagrams 17 | 18 | - Causal diagrams are a set of *variables* and a set of assumptions about *how those variables cause each other* 19 | - Causal diagrams are our way of representing how we think the *data was generated*. They are our *model of the world* 20 | - Once we have our diagram written down, it will tell us how we can *identify* the effect we're interested in 21 | - Writing a causal diagram requires us to make some assumptions - but without assumptions you literally can't get anywhere! Just hopefully make them reasonable assumptions 22 | - "`X` causes `Y`" means that if we could reach in and change the value of `X`, then the *distribution* of `Y` would change as a result. It doesn't necessarily mean that *all* `Y` is because of `X`, or that changing `X` will *always* change `Y`. It just means that a change in `X` will cause the distribution of `Y` to change. 23 | 24 | ## Components of a Causal Diagram 25 | 26 | A causal diagram consists of (a) a set of variables, and (b) causal arrows between those variables. 27 | 28 | An arrow from `X` to `Y` means that `X` causes `Y`. We can represent this in text as `X -> Y`. Arrows always go one direction. You can't have `X -> Y` and `X <- Y`. See "Tricky Situations" below if it feels like both variables cause each other. 29 | 30 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=1.5} 31 | dag <- dagify(Y~X, 32 | coords=list( 33 | x=c(X=1,Y=2), 34 | y=c(X=1,Y=1) 35 | )) %>% tidy_dagitty() 36 | ggdag(dag,node_size=10) 37 | ``` 38 | 39 | If a variable is *not in* the causal diagram, that means that we're assuming that the variable is not important in the system, or at least not relevant to the effect we're trying to identify. 40 | 41 | If an arrow is *not in* the causal diagram, that means that we're assuming that neither variable causes the other. This is not a trivial assumption. In the below diagram, we assume that `X` does not cause `Z`, and that `Z` does not cause `X`. 42 | 43 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=1.5} 44 | dag <- dagify(Y~X+Z, 45 | coords=list( 46 | x=c(X=1,Y=2,Z=3), 47 | y=c(X=1,Y=1,Z=1) 48 | )) %>% tidy_dagitty() 49 | ggdag(dag,node_size=10) 50 | ``` 51 | 52 | ## Building a Diagram 53 | 54 | Steps to building a diagram: 55 | 56 | 1. Consider all the variables that are likely to be important in the data generating process (this includes variables you can't observe) 57 | 2. For simplicity, combine them together or prune the ones least likely to be important 58 | 3. Consider which variables are likely to affect which other variables and draw arrows from one to the other 59 | 60 | ## Front and Back Doors 61 | 62 | Once you have a diagram and you know that you want to identify the effect of `X` on `Y`, it is helpful to make a list of all the front and back doors from `X` to `Y`, and determine whether they are open or closed. 63 | 64 | To do this: 65 | 66 | 1. Write down the list of *all paths* you can follow on your diagram to get from `X` to `Y` 67 | 2. If that path only contains arrows that point *from* `X` *towards* `Y`, like `X -> Y` or `X -> A -> Y`, that is a *front door path* 68 | 3. If that path contains any arrows that point *towards* `X`, like `X <- W -> Y` or `X -> Z <- A -> Y`, that is a *back door path* 69 | 70 | ## Open and Closed Paths and Colliders 71 | 72 | 1. If a path from `X` to `Y` contains any *colliders* where two arrows point at the same variable, like how both arrows point to `Z` in `X -> Z <- A -> Y`, then that path is *closed* by default since `Z` is a collider on that path 73 | 2. Otherwise, that path is *open* by default 74 | 3. If you control/adjust for a variable, then that will *close* any open paths it is on where that variable **isn't** a collider. So, for example, if you control for `W`, the open path `X <- W -> Y` will close 75 | 4. If you control/adjust for a variable, then that will *open* any closed back door paths it is on where that variable **is** a collider. So, for example, if you control for `Z`, the closed path `X -> Z <- A -> Y` will open back up. You can close it again by controlling for another non-collider on that path, like `A` in this case 76 | 77 | ## Identifying with the Front Door Method 78 | 79 | Sometimes, you can identify the effect of `X` on `Y` by focusing on front doors. 80 | 81 | For example, in the below diagram, there are no open back doors from `X` to `A` (there's a back door `X <- W -> Y <- A`, but it is closed since `Y` is a collider on this path), and so you can identify `X -> A`. 82 | 83 | Similarly, the only back door from `A` to `Y` is `A <- X <- W -> Y`, and so if you control for `X`, you can identify `A -> Y`. 84 | 85 | By combining `X -> A` and `A -> Y` you can identify `X -> Y`. 86 | 87 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=2.5} 88 | dag <- dagify(Y~A+W, 89 | A~X, 90 | X~W, 91 | coords=list( 92 | x=c(X=1,A=2,Y=3,W=2), 93 | y=c(X=1,A=1,Y=1,W=1.5) 94 | )) %>% tidy_dagitty() 95 | ggdag(dag,node_size=10) 96 | ``` 97 | 98 | A common issue with this method is finding places where you can apply it convincingly. Are there really no open back door paths from `X` to `A`, or from `A` to `Y` in the true data-generating process? 99 | 100 | ## Identifying with the Back Door Method 101 | 102 | Once you've identified the full list of front-door paths and back-door paths, if you control/adjust for a set of variables that *fully blocks* all back door paths (without blocking your front door paths), then you have identified the effect of `X` on `Y`. 103 | 104 | For example, in the below diagram, we have the front door path `X -> A -> Y` and the back door path `X <- W -> Y`. If we control for `W` but not `A`, then we have identified the effect of `X` on `Y`. 105 | 106 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=2.5} 107 | dag <- dagify(Y~A+W, 108 | A~X, 109 | X~W, 110 | coords=list( 111 | x=c(X=1,A=2,Y=3,W=2), 112 | y=c(X=1,A=1,Y=1,W=1.5) 113 | )) %>% tidy_dagitty() 114 | ggdag(dag,node_size=10) 115 | ``` 116 | 117 | A common issue with this method is that often you will need to control for variables that you can't actually observe. If you don't have data on `W`, then you can't identify the effect of `X` on `Y` using the back door method. 118 | 119 | ## Testing Implications 120 | 121 | Causal diagrams require assumptions to build, and they also imply certain relationships. Sometimes you can test these relationships, and if they're not right, that suggests that you might not have the right model. 122 | 123 | For example, in the diagram used in the previous two sections, `W` and `A` should be unrelated if you control for `X`. Also, `X` and `Y` should be unrelated if you control for `W` and `A`. You could test these predictions yourself. 124 | 125 | If you control for `X` and still find a strong relationship between `W` and `A`, or if you control for `W` and `A` and still find a strong relationship between `X` and `Y`, then your causal diagram is wrong. 126 | 127 | ## Tricky Situations 128 | 129 | Some situations make it difficult to think about how to construct a causal diagram. We'll cover those here. 130 | 131 | (1) Lots of variables. In the social sciences, the true data-generating process may have lots and lots of variables, perhaps hundreds! It's important to remember to (a) Remove any vairables that are likely to only be of trivial importance, and (b) if you have multiple variables that lie on identical-looking paths, you can combine them. For example, you can often combine "race", "gender", "parental income", "birthplace" etc. into the single variable "background" 132 | 133 | (2) Variables that are related but don't cause each other. Often we have two variables that are know are *correlated*, but we don't think either causes the other. For example, people buy more ice cream on days they wear shorts. In these cases, we include a "common cause" on the graph. We don't necessarily need to be specific about what it is, or if it's measured or unmeasured. I usually label these "U1" "U2" etc. 134 | 135 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=2.5} 136 | dag <- dagify(A~U1, 137 | B~U1, 138 | coords=list( 139 | x=c(A=1,B=2,U1=1.5), 140 | y=c(A=1,B=1,U1=1.5) 141 | )) %>% tidy_dagitty() 142 | ggdag(dag,node_size=10) 143 | ``` 144 | 145 | Note that some people will use a double headed arrow here instead, like `A <-> B`. It means the same thing. We won't be doing this in our class though. 146 | 147 | (3) Variables that cause each other. A variable should never be able to cause itself. So, a causal diagram should never have a *loop* in it. The simplest example of a loop is if `A -> B` and `B -> A`. If we mix these together we have `A -> B -> A`. `A` causes itself! That doesn't make sense. This is why causal diagrams are technically called "directed *acyclic* graphs". 148 | 149 | But surely in the real world there are plenty of variables with self-feedback loops, or variables which cause each other. So what do we do? 150 | 151 | In these cases, you want to think about *time* carefully. Instead of `A -> B` and `B -> A`, split `A` and `B` up into two variables each: `ANow` and `ALater`, and `BNow` and `BLater`. Obviously, a causal effect needs time to occur. Now we have `ANow -> BLater` and `BNow -> ALater` and the loop is gone. -------------------------------------------------------------------------------- /Cheat Sheets/Causal_Diagrams_Cheat_Sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Causal_Diagrams_Cheat_Sheet.pdf -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty Cheat Sheet.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Causal Diagrams Cheat Sheet" 3 | author: "Nick Huntington-Klein" 4 | date: "`r format(Sys.time(), '%d %B, %Y')`" 5 | output: 6 | rmarkdown::pdf_document: 7 | fig_caption: yes 8 | includes: 9 | in_header: floatposition.tex 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 14 | library(tidyverse) 15 | ``` 16 | 17 | ## Dagitty 18 | 19 | - [Dagitty.net](http://www.dagitty.net) is a tool for building, drawing, and working with causal diagrams 20 | - You can also work with dagitty in R directly with the `dagitty` and `ggdag` packages, although this is more difficult and we won't be covering it here 21 | 22 | ## Getting Started 23 | 24 | Go to [Dagitty.net](http://www.dagitty.net) and click on "Launch Dagitty Online in your Browser" 25 | 26 | There you'll see the main dagitty screen with three areas. On the left it has style options; we will be ignoring this for this basic introduction, although you may want to refer back to the Legend in the lower left-hand corner sometimes. In the middle you have the canvas to build your diagram on, and the menus. On the right it provides information about the causal diagram you've built. 27 | 28 | ![Dagitty Main Screen](Dagitty_Screen.png) 29 | 30 | \newpage 31 | 32 | ## Building a Diagram 33 | 34 | In "Model" select "New Model". You will be presented with a blank canvas. 35 | 36 | From there, you can add new variables by clicking (or tapping, on mobile) on an empty part of the diagram, and then giving the variable a name. 37 | 38 | You can add arrows from `A` to `B` by first clicking on `A` and then clicking on `B`. You can remove an existing arrow in the same way. Once you have an arrow from `A` to `B`, you can create a double-headed arrow by clicking on `B` and then `A`. 39 | 40 | If you click-and-drag on an arrow, you can make it curvy, which may help avoid overlaps in complex diagrams. Similarly, you can click-and-drag the variables themselves. 41 | 42 | ![Basic Dagitty Model](Dagitty_Model1.png) 43 | 44 | \newpage 45 | 46 | ## Variable Characteristics 47 | 48 | Once you have some variables and arrows on your graph, you may want to manipulate them. 49 | 50 | You can do this one of two ways: 51 | 52 | 1. Clicking (or tapping on mobile) on the variable, and then looking for the "Variable" tab on the left (or below, on mobile). 53 | 2. Hovering over the variable and hitting a keyboard hotkey. 54 | 55 | ![Variable Tab](Dagitty_Model2.png) 56 | 57 | One thing you will probably want to do in almost *any* daggity graph is determine your **exposure/treatment** varible(s) (hotkey "e", or use the Variable tab) and your **outcome** variable (hotkey "o", or use the Variable tab). 58 | 59 | Often, you will be using dagitty to attempt to identify the effect of an exposure variable on an outcome variable. Setting the exposure and outcome variables properly lets anyone looking at your graph know what it's for, and also lets dagitty do some nice calculations for you. 60 | 61 | Other variable characteristics you might want to set: 62 | 63 | 1. Whether the variable is **observed in the data** (hotkey "u", or use the Variable tab). If a variable is not observed, you can't adjust for it. Let dagitty know it's not observed so it won't try to tell you to adjust for it! 64 | 2. Whether you've already **controlled/adjusted** for a variable (hotkey "a", or use the Variable tab). 65 | 66 | You may also want to re-work your diagram once you've made it. You can also **delete** variables (hotkey "d", or use the Variable tab), or **rename** them (hotkey "r", or use the Variable tab). 67 | 68 | \newpage 69 | 70 | ## Using the Dagitty Diagnostics 71 | 72 | The right bar of the screen contains lots of information about your diagram. 73 | 74 | ![Dagitty Diagnostics](Dagitty_Diagnostics.png){width=90%} 75 | 76 | At the top you'll see that it gives you the Minimal Adjustment Sets. Given the exposure variable you've set, and the outcome variable, it will tell you what you could control for to close all back doors. Here, you can identify `X -> Y` by controlling for `A` OR by controlling for `B` (`A` and `B` together would also work, but these are *minimal* adjustment sets - no unnecessary controls). 77 | 78 | Remember that the exposure and outcome variables must be set properly for this to work. It will also pay attention to the variables you've set to be unobserved, or the variables you've said you've already adjusted for. 79 | 80 | If you select the "Adjustment (total effect)" dropdown you'll see that you can also set it to look for instrumental variables, or to look for the "direct effect". The direct effect looks *only* for `X -> Y`, and not counting for other front-door paths like `X -> C -> Y` (not pictured). 81 | 82 | The diagram itself also has information. Any lines that are **purple** represent arrows that are a part of an open back door. Lines that are **green** are on a front door path / the causal effect of interest. And lines that are **black** are neither. In the graph we have here, we can see the green causal path of interest from `X` to `Y`, and a purple open back door path along `X <- A <-> B -> Y`. If you forget the colors, check the legend in the bottom-left. 83 | 84 | The variables are marked too. The exposure and outcome variables are clearly marked. Green variables like `A` are ancestors of the exposure (they cause the exposure), and blue variables like `B` are ancestors of the outcome (they cause the outcome). Red variables are ancestors of both, and are likely on a back door path. 85 | 86 | In the second panel on the right you can see the testable implications of the model. This shows you some relationships that *should* be there if your model is right. Here, that upside-down T means "not related to" and the | means "adjusting for". So, adjusting for `A`, `X` and `B` should be unrelated. And adjusting for both `B` and `X`, `A` and `Y` should be unrelated. You could test these yourself in data, and if you found they were related, you'd know the model was wrong. 87 | 88 | \newpage 89 | 90 | ## Saving/Exporting Your Diagram 91 | 92 | You can save your work using Model -> Publish. You'll need to save the code/URL you get so you can Model -> Load it back in later. 93 | 94 | You can also export your graph for viewing using Model -> Export. However, keep in mind that dagitty tends to like to show the whole canvas, even if your model is only on a part of it. So you may want to be prepared to open the result in a photo editor to crop it after you've exported it. 95 | 96 | Another option for getting images of your models is to take a screenshot, cropped down to the area of your diagram. In Windows you can use the Snipping Tool (Windows Accessories -> Snipping Tool). On a Mac you can use Grab. 97 | 98 | -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty-Cheat-Sheet.aux: -------------------------------------------------------------------------------- 1 | \relax 2 | \providecommand\hyper@newdestlabel[2]{} 3 | \providecommand\HyperFirstAtBeginDocument{\AtBeginDocument} 4 | \HyperFirstAtBeginDocument{\ifx\hyper@anchor\@undefined 5 | \global\let\oldcontentsline\contentsline 6 | \gdef\contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}} 7 | \global\let\oldnewlabel\newlabel 8 | \gdef\newlabel#1#2{\newlabelxx{#1}#2} 9 | \gdef\newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}} 10 | \AtEndDocument{\ifx\hyper@anchor\@undefined 11 | \let\contentsline\oldcontentsline 12 | \let\newlabel\oldnewlabel 13 | \fi} 14 | \fi} 15 | \global\let\hyper@last\relax 16 | \gdef\HyperFirstAtBeginDocument#1{#1} 17 | \providecommand\HyField@AuxAddToFields[1]{} 18 | \providecommand\HyField@AuxAddToCoFields[2]{} 19 | \@writefile{toc}{\contentsline {subsection}{Dagitty}{1}{section*.1}\protected@file@percent } 20 | \newlabel{dagitty}{{}{1}{Dagitty}{section*.1}{}} 21 | \@writefile{toc}{\contentsline {subsection}{Getting Started}{1}{section*.2}\protected@file@percent } 22 | \newlabel{getting-started}{{}{1}{Getting Started}{section*.2}{}} 23 | \@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Dagitty Main Screen}}{1}{figure.1}\protected@file@percent } 24 | \@writefile{toc}{\contentsline {subsection}{Building a Diagram}{2}{section*.3}\protected@file@percent } 25 | \newlabel{building-a-diagram}{{}{2}{Building a Diagram}{section*.3}{}} 26 | \@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Basic Dagitty Model}}{2}{figure.2}\protected@file@percent } 27 | \@writefile{toc}{\contentsline {subsection}{Variable Characteristics}{3}{section*.4}\protected@file@percent } 28 | \newlabel{variable-characteristics}{{}{3}{Variable Characteristics}{section*.4}{}} 29 | \@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Variable Tab}}{3}{figure.3}\protected@file@percent } 30 | \@writefile{toc}{\contentsline {subsection}{Using the Dagitty Diagnostics}{4}{section*.5}\protected@file@percent } 31 | \newlabel{using-the-dagitty-diagnostics}{{}{4}{Using the Dagitty Diagnostics}{section*.5}{}} 32 | \@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Dagitty Diagnostics}}{4}{figure.4}\protected@file@percent } 33 | \@writefile{toc}{\contentsline {subsection}{Saving/Exporting Your Diagram}{5}{section*.6}\protected@file@percent } 34 | \newlabel{savingexporting-your-diagram}{{}{5}{Saving/Exporting Your Diagram}{section*.6}{}} 35 | -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty-Cheat-Sheet.out: -------------------------------------------------------------------------------- 1 | \BOOKMARK [2][-]{section*.1}{\376\377\000D\000a\000g\000i\000t\000t\000y}{}% 1 2 | \BOOKMARK [2][-]{section*.2}{\376\377\000G\000e\000t\000t\000i\000n\000g\000\040\000S\000t\000a\000r\000t\000e\000d}{}% 2 3 | \BOOKMARK [2][-]{section*.3}{\376\377\000B\000u\000i\000l\000d\000i\000n\000g\000\040\000a\000\040\000D\000i\000a\000g\000r\000a\000m}{}% 3 4 | \BOOKMARK [2][-]{section*.4}{\376\377\000V\000a\000r\000i\000a\000b\000l\000e\000\040\000C\000h\000a\000r\000a\000c\000t\000e\000r\000i\000s\000t\000i\000c\000s}{}% 4 5 | \BOOKMARK [2][-]{section*.5}{\376\377\000U\000s\000i\000n\000g\000\040\000t\000h\000e\000\040\000D\000a\000g\000i\000t\000t\000y\000\040\000D\000i\000a\000g\000n\000o\000s\000t\000i\000c\000s}{}% 5 6 | \BOOKMARK [2][-]{section*.6}{\376\377\000S\000a\000v\000i\000n\000g\000/\000E\000x\000p\000o\000r\000t\000i\000n\000g\000\040\000Y\000o\000u\000r\000\040\000D\000i\000a\000g\000r\000a\000m}{}% 6 7 | -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty_Cheat_Sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Cheat_Sheet.pdf -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty_Diagnostics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Diagnostics.png -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty_Intro.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Intro.PNG -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty_Model1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Model1.PNG -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty_Model2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Model2.PNG -------------------------------------------------------------------------------- /Cheat Sheets/Dagitty_Screen.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Dagitty_Screen.PNG -------------------------------------------------------------------------------- /Cheat Sheets/RStudio Cheat Cheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/RStudio Cheat Cheet.pdf -------------------------------------------------------------------------------- /Cheat Sheets/Relationships Cheat Sheet.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Relationships Cheat Sheet" 3 | author: "Nick Huntington-Klein" 4 | date: "April 5, 2019" 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 10 | ``` 11 | 12 | ## Definitions 13 | 14 | - **Dependent**: `X` and `Y` are *dependent* if knowing something about `X` gives you information about what `Y` is likely to be, or vice versa 15 | - **Correlated**: `X` and `Y` are *correlated* if knowing that `X` is unusually high tells you whether `Y` is likely to be unusually high or unusually low 16 | - **Explaining**: *Explaining* `Y` using `X` means that we are predicting what `Y` is likely to be, given a value of `X` 17 | 18 | ## Tables 19 | 20 | `table(x)` will show us the full *distribution* of `x`. `table(x,y)` will show us the full *distribution* of `x` and `y` together (the "joint distribution"). Typically not used for continuous variables. 21 | 22 | ```{r, echo=TRUE} 23 | library(Ecdat) 24 | data(Benefits) 25 | 26 | table(Benefits$joblost) 27 | 28 | table(Benefits$joblost,Benefits$married) 29 | ``` 30 | 31 | You can label the variable names using the confusingly-named *dimnames names* option, `dnn` 32 | 33 | ```{r, echo=TRUE} 34 | table(Benefits$joblost,Benefits$married,dnn=c('Job Loss Reason','Married')) 35 | ``` 36 | 37 | Wrap `table()` in `prop.table()` to get proportions instead of counts. The `margin` option of `prop.table()` will give the proportion within each row (`margin=1`) or within each column (`margin=2`) instead of overall. 38 | 39 | ```{r, echo=TRUE} 40 | prop.table(table(Benefits$joblost,Benefits$married)) 41 | prop.table(table(Benefits$joblost,Benefits$married),margin=1) 42 | prop.table(table(Benefits$joblost,Benefits$married),margin=2) 43 | ``` 44 | 45 | ## Correlation 46 | 47 | We can calculate the correlation between two (numeric) variables using `cor(x,y)` 48 | 49 | ```{r, echo=TRUE} 50 | cor(Benefits$age,Benefits$tenure) 51 | ``` 52 | 53 | ## Scatterplots 54 | 55 | You can plot one variable against another with `plot(xvar,yvar)`. Add `xlab`, `ylab`, and `main` options to title the axes and entire plot, respectively. Use `col` to assign a color. 56 | 57 | Use `points()` to add more points to a graph after you've made it, likely with a different `col`or. 58 | 59 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5} 60 | library(tidyverse) 61 | BenefitsM <- Benefits %>% filter(sex=='male') 62 | BenefitsF <- Benefits %>% filter(sex=='female') 63 | plot(BenefitsM$age,BenefitsM$tenure,xlab='Age',ylab='Tenure at Job',col='blue') 64 | points(BenefitsF$age,BenefitsF$tenure,xlab='Age',ylab='Tenure at Job',col='red') 65 | ``` 66 | 67 | ## Overlaid Densities 68 | 69 | You can show how the *distribution* of `Y` changes for different values of `X` by plotting the density separately for different values of `X`. Use `lines` to add the second density plot after you've done the first one. 70 | 71 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5} 72 | plot(density(BenefitsM$tenure), 73 | xlab='Tenure at Job',col='blue',main="Job Tenure by Gender") 74 | lines(density(BenefitsF$tenure),xlab='Tenure at Job',col='red') 75 | ``` 76 | 77 | ## Means Within Groups and Explaining 78 | 79 | Part of looking at both *correlation* and *explanation* will require getting the mean of `Y` within values of `X`, which we can do with `group_by()` in `dplyr/tidyverse`. 80 | 81 | Using `summarize()` after `group_by()` will give us a table of means within each group. Using `mutate()` will add a new variable assigning that mean. Use `mutate()` with `mean(y)` to get the part of `y` explained by `x`, or with `y - mean(y)` to get the part not explained by `x` (the residual). Don't forget to `ungroup()`! 82 | 83 | ```{r, echo=TRUE} 84 | Benefits %>% group_by(joblost) %>% 85 | summarize(tenure = mean(tenure), age = mean(age)) 86 | 87 | Benefits <- Benefits %>% group_by(joblost) %>% 88 | mutate(tenure.exp = mean(tenure), 89 | tenure.resid = tenure - mean(tenure)) %>% ungroup() 90 | head(Benefits %>% select(joblost,tenure,tenure.exp,tenure.resid)) 91 | ``` 92 | 93 | ## Explaining With a Continuous Variable 94 | 95 | If we want to explain `Y` using `X` but `X` is continuous, we need to break it up into bins first. We will do this with `cut()`, which has the `breaks` option for how many bins to split it up into. 96 | 97 | In this class, we will be choosing the number of breaks arbitrarily. I'll tell you what values to use. 98 | 99 | ```{r, echo=TRUE} 100 | Benefits <- Benefits %>% mutate(agebins = cut(age,breaks=5)) %>% 101 | group_by(agebins) %>% 102 | mutate(tenure.ageexp = mean(tenure), 103 | tenure.ageresid = tenure - mean(tenure)) %>% ungroup() 104 | head(Benefits %>% select(agebins,tenure,tenure.ageexp,tenure.ageresid)) 105 | ``` 106 | 107 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5} 108 | plot(Benefits$age,Benefits$tenure,xlab="Age",ylab="Tenure",col='black') 109 | points(Benefits$age,Benefits$tenure.ageexp,col='red',cex=1.5,bg='red',pch=21) 110 | ``` 111 | 112 | ## Proportion of Variance Explained 113 | 114 | When `Y` is numeric, we can calculate its variance, and see how much of that variance is explained by `X`, and also how much is not. We do this by calculating the variance of the residuals, as this is the amount of variance in `Y` left over after taking out what `X` explains. 115 | 116 | ```{r, echo=TRUE} 117 | #Proportion of tenure NOT explained by age 118 | var(Benefits$tenure.ageresid)/var(Benefits$tenure) 119 | 120 | #Proportion of tenure explained by age 121 | 1 - var(Benefits$tenure.ageresid)/var(Benefits$tenure) 122 | var(Benefits$tenure.ageexp)/var(Benefits$tenure) 123 | ``` -------------------------------------------------------------------------------- /Cheat Sheets/Relationships_Cheat_Sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Relationships_Cheat_Sheet.pdf -------------------------------------------------------------------------------- /Cheat Sheets/Simulation Cheat Sheet.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Simulation Cheat Sheet" 3 | author: "Nick Huntington-Klein" 4 | date: "April 5, 2019" 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 10 | ``` 11 | 12 | ## The Purpose of Simulations 13 | 14 | - Data is *simulated* if we randomly generate it, and then test its properties 15 | - We do this because, when we make our own data, *we know the process that generated the data* 16 | - That way, we can know whether the *method* we use is capable of getting the right result, because we know what the right result is 17 | 18 | ## Generating Random Data 19 | 20 | We can generate categorical or logical random data with `sample()` with the `replace=TRUE` option. This will choose, with equal probability (or unequal probabilities if you specify with the `prob` option), between the elements of the vector you give it. 21 | 22 | We can generate *normally* distributed data with `rnorm()`. Normal data is symmetrically distributed around a mean and has no maximum or minimum. By default, the `mean` is 0 and the standard deviation `sd` is 1. 23 | 24 | We can generate *uniformly* distributed data with `runif()`. Uniformly distributed data is equally likely to take any value between its `min`imum and `max`imum, which are by default 0 and 1. 25 | 26 | In each case, remember to tell it how many observations you want (1000 in the below example). 27 | 28 | ```{r, echo=TRUE} 29 | library(tidyverse) 30 | df <- tibble(category = sample(c('A','B','C'),1000,replace=T), 31 | normal = rnorm(1000), 32 | uniform = runif(1000)) 33 | 34 | table(df$category) 35 | 36 | library(stargazer) 37 | stargazer(as.data.frame(df %>% select(normal, uniform)),type='text') 38 | ``` 39 | 40 | ## Determining the Data Generating Process 41 | 42 | You can make sure that `X -> Y` in the data by using values of `X` in the creation of `Y`. In the below example, `W -> X`, `X -> Y`, and `W -> Y`. You can tell because `W` is used in the creation of `X`, and both `X` and `W` are used in the creation of `Y`. 43 | 44 | You can incorporate categorical variables into this process by turning them into logicals. 45 | 46 | Remember that, if you want to use one variable to create another, you need to finish creating it, and then create the variable it causes in a later `mutate()` command. 47 | 48 | ```{r, echo=TRUE} 49 | data <- tibble(W = sample(c('C','D','E'),2000,replace=T)) %>% 50 | mutate(X = .1*(W == 'C') + runif(2000)) %>% 51 | mutate(Y = 2*(W == 'C') - 2*X + rnorm(2000)) 52 | ``` 53 | 54 | 55 | ## For Loops 56 | 57 | A for loop executes a chunk of code over and over again, each time incrementing some index by 1. Remember that the parentheses and curly braces are required, and also that if you want it to show results while you're running a loop, you must use `print()`. 58 | 59 | If you are interested in getting further into R, I'd recommend trying to learn about the `apply` family of functions, which perform loops more quickly. 60 | 61 | ```{r, echo=TRUE} 62 | for (i in 1:5) { 63 | print(i) 64 | } 65 | ``` 66 | 67 | ## Putting the Simulation Together 68 | 69 | In a simulation, we want to: 70 | 71 | - Create a blank vector to store our results 72 | - (in loop) Simulate data 73 | - (in loop) Get a result 74 | - (in loop) Store the result 75 | - Show what we found 76 | 77 | We can create a blank vector to store our results with `c(NA)`. Then, in the loop, whatever calculation we do we can store using the index we're looping over. 78 | 79 | Note that, if you like, you can make a blank data frame or tibble instead of a vector, and use `add_row` in dplyr to store results. 80 | 81 | ```{r, echo = TRUE} 82 | results <- c(NA) 83 | 84 | for (i in 1:1000) { 85 | data <- tibble(W = sample(c('C','D','E'),2000,replace=T)) %>% 86 | mutate(X = 1.5*(W == 'C') + runif(2000)) %>% 87 | mutate(Y = 2*(W == 'C') - .5*X + rnorm(2000)) 88 | 89 | results[i] <- cor(data$X,data$Y) 90 | } 91 | ``` 92 | 93 | After you're done you can show what happened using standard methods for analyzing a single variable, like creating a summary statistics table with `stargazer()` (don't forget to use `as.data.frame()`) or plotting the density. Make sure you're thinking about what the *true* answer is and how far you are from it. For example, here, we know that `X` negatively affects `Y`, and we might be interested in how often we find that. 94 | 95 | ```{r, echo=TRUE} 96 | library(stargazer) 97 | stargazer(as.data.frame(results),type='text') 98 | ``` 99 | 100 | ```{r, echo=TRUE, fig.width=5, fig.height = 3.5} 101 | plot(density(results),xlab='Correlation Between X and Y',main='Simulation Results') 102 | abline(v=0) 103 | ``` 104 | -------------------------------------------------------------------------------- /Cheat Sheets/Simulation_Cheat_Sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/Simulation_Cheat_Sheet.pdf -------------------------------------------------------------------------------- /Cheat Sheets/dplyr Cheat Sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Cheat Sheets/dplyr Cheat Sheet.pdf -------------------------------------------------------------------------------- /Cheat Sheets/floatposition.tex: -------------------------------------------------------------------------------- 1 | \usepackage{float} 2 | \let\origfigure\figure 3 | \let\endorigfigure\endfigure 4 | \renewenvironment{figure}[1][2] { 5 | \expandafter\origfigure\expandafter[H] 6 | } { 7 | \endorigfigure 8 | } -------------------------------------------------------------------------------- /ECON 305 Syllabus Updated Template.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/ECON 305 Syllabus Updated Template.docx -------------------------------------------------------------------------------- /Instructor Guide/.Rhistory: -------------------------------------------------------------------------------- 1 | help(install.packages) 2 | install.packages(c('tidyverse','dplyr','magrittr','devtools','roxygen2','ggplot2','stargazer','estimatr','vtable','haven','readxl','lubridate')) 3 | install.packages(c('AER','car','zoo','Ecdat')) 4 | install.packages('wooldridge') 5 | remotes::install_github('r-lib/vctrs') 6 | remotes::install_github('r-lib/vctrs') 7 | remotes::install_github('tidyverse/vctrs') 8 | install.packages('vctrs') 9 | remotes::install_github('tidyverse/tidyr') 10 | remotes::install_github('r-lib/vctrs') 11 | devtools::install_github('r-lib/vctrs') 12 | library(estimatr) 13 | library(tidyverse) 14 | install.packages('rlang') 15 | install.packages("rlang") 16 | install.packages("rlang") 17 | library(estimatr) 18 | library(tidyverse) 19 | help(tibble) 20 | df <- tibble(z = rnorm(300), 21 | gamma = c(rep(-1,100),rep(0,100),rep(1,100))) %>% 22 | mutate(x = gamma*z + rnorm(300)) %>% 23 | mutate(y = 1.5*x+rnorm(300)) 24 | help("iv_robust") 25 | summary(iv_robust(y~x|z,data=df)) 26 | summary(iv_robust(y~x|z,data=df,weights=gamma)) 27 | summary(iv_robust(y~x|z,data=df,subset=gamma==1)) 28 | df <- tibble(z = rnorm(300), 29 | gamma = c(rep(0,100),rep(0.1,100),rep(1,100))) %>% 30 | mutate(x = gamma*z + rnorm(300)) %>% 31 | mutate(y = 1.5*x+rnorm(300)) 32 | summary(iv_robust(y~x|z,data=df)) 33 | df <- tibble(z = rnorm(300), 34 | gamma = c(rep(0,100),rep(0.1,100),rep(1,100))) %>% 35 | mutate(x = gamma*z + rnorm(300)) %>% 36 | mutate(y = 1.5*x+rnorm(300)) 37 | summary(iv_robust(y~x|z,data=df)) 38 | df <- tibble(z = rnorm(300), 39 | gamma = c(rep(0,100),rep(0.1,100),rep(1,100))) %>% 40 | mutate(x = gamma*z + rnorm(300)) %>% 41 | mutate(y = 1.5*x+rnorm(300)) 42 | summary(iv_robust(y~x|z,data=df)) 43 | summary(iv_robust(y~x|z,data=df,weights=gamma)) 44 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 45 | df <- tibble(z = rnorm(300), 46 | gamma = c(rep(0,100),rep(0.05,100),rep(.5,100))) %>% 47 | mutate(x = gamma*z + rnorm(300)) %>% 48 | mutate(y = 1.5*x+rnorm(300)) 49 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 50 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 51 | df <- tibble(z = rnorm(300), 52 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>% 53 | mutate(x = gamma*z + rnorm(300)) %>% 54 | mutate(y = 1.5*x+rnorm(300)) 55 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 56 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 57 | df <- tibble(z = rnorm(300), 58 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>% 59 | mutate(x = gamma*z + rnorm(300)) %>% 60 | mutate(y = 1.5*x+rnorm(300)) 61 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 62 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 63 | df <- tibble(z = rnorm(300), 64 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>% 65 | mutate(x = gamma*z + rnorm(300)) %>% 66 | mutate(y = 1.5*x+rnorm(300)) 67 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 68 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 69 | df <- tibble(z = rnorm(300), 70 | gamma = c(rep(0,100),rep(0.05,150),rep(.5,50))) %>% 71 | mutate(x = gamma*z + rnorm(300)) %>% 72 | mutate(y = (1.5*(1+gamma))*x+rnorm(300)) 73 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 74 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 75 | df <- tibble(z = rnorm(30000), 76 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 77 | mutate(x = gamma*z + rnorm(30000)) %>% 78 | mutate(y = (1.5*(1+gamma))*x+rnorm(30000)) 79 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 80 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 81 | df <- tibble(z = rnorm(30000), 82 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 83 | mutate(x = gamma*z + rnorm(30000)) %>% 84 | mutate(y = (1.5)*x+rnorm(30000)) 85 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 86 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 87 | df <- tibble(z = rnorm(30000), 88 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 89 | mutate(x = gamma*z + rnorm(30000)) %>% 90 | mutate(y = (1.5)*x+rnorm(30000)) 91 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 92 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 93 | df <- tibble(z = rnorm(30000), 94 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 95 | mutate(x = gamma*z + rnorm(30000)) %>% 96 | mutate(y = (1.5)*x+rnorm(30000)) 97 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 98 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 99 | df <- tibble(z = rnorm(30000), 100 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 101 | mutate(x = gamma*z + rnorm(30000)) %>% 102 | mutate(y = (1.5)*x+rnorm(30000)) 103 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 104 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 105 | df <- tibble(z = rnorm(30000), 106 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 107 | mutate(x = gamma*z + rnorm(30000)) %>% 108 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000)) 109 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 110 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 111 | df <- tibble(z = rnorm(30000), 112 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 113 | mutate(x = gamma*z + rnorm(30000)) %>% 114 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000)) 115 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 116 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 117 | df <- tibble(z = rnorm(30000), 118 | gamma = c(rep(0,10000),rep(0.05,15000),rep(.5,5000))) %>% 119 | mutate(x = gamma*z + rnorm(30000)) %>% 120 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000)) 121 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 122 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 123 | df <- tibble(z = rnorm(30000), 124 | gamma = c(rep(0,10000),rep(0.05,15000),rep(1,5000))) %>% 125 | mutate(x = gamma*z + rnorm(30000)) %>% 126 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000)) 127 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 128 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 129 | df <- tibble(z = rnorm(30000), 130 | gamma = c(rep(0,10000),rep(0.00,15000),rep(1,5000))) %>% 131 | mutate(x = gamma*z + rnorm(30000)) %>% 132 | mutate(y = (1.5*(1-gamma))*x+rnorm(30000)) 133 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 134 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 135 | rep(1,2) 136 | View(df) 137 | head(df) 138 | df <- tibble(z = rnorm(30000), 139 | gamma = c(rep(0,10000),rep(0.00,15000),rep(1,5000))) %>% 140 | mutate(x = gamma*z + rnorm(30000)) %>% 141 | mutate(y = (c(rep(.5,10000),rep(.25,15000),rep(.75,5000)))*x+rnorm(30000)) 142 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 143 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 144 | df <- tibble(z = rnorm(30000), 145 | gamma = c(rep(0,10000),rep(0.05,15000),rep(1,5000))) %>% 146 | mutate(x = gamma*z + rnorm(30000)) %>% 147 | mutate(y = (c(rep(.5,10000),rep(.25,15000),rep(.75,5000)))*x+rnorm(30000)) 148 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 149 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 150 | df <- tibble(z = rnorm(30000), 151 | gamma = c(rep(0,10000),rep(0.05,15000),rep(1,5000))) %>% 152 | mutate(x = gamma*z + rnorm(30000)) %>% 153 | mutate(y = (c(rep(.5,10000),rep(.25,15000),rep(.75,5000)))*x+rnorm(30000)) 154 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 155 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 156 | df <- tibble(z = rnorm(300), 157 | gamma = c(rep(0,100),rep(0.05,150),rep(1,50))) %>% 158 | mutate(x = gamma*z + rnorm(300)) %>% 159 | mutate(y = (c(rep(.5,100),rep(.25,150),rep(.75,50)))*x+rnorm(300)) 160 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 161 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 162 | df <- tibble(z = rnorm(300), 163 | gamma = c(rep(0,100),rep(0.05,150),rep(1,50))) %>% 164 | mutate(x = gamma*z + rnorm(300)) %>% 165 | mutate(y = (c(rep(.5,100),rep(.25,150),rep(.75,50)))*x+rnorm(300)) 166 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 167 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 168 | df <- tibble(z = rnorm(300), 169 | gamma = c(rep(0,100),rep(0.05,150),rep(1,50))) %>% 170 | mutate(x = gamma*z + rnorm(300)) %>% 171 | mutate(y = (c(rep(.5,100),rep(.25,150),rep(.75,50)))*x+rnorm(300)) 172 | summary(iv_robust(y~x|z,data=df,diagnostics=TRUE)) 173 | summary(iv_robust(y~x|z,data=df,weights=gamma,diagnostics=TRUE)) 174 | setwd("C:/Users/nickc/Dropbox (CSU Fullerton)/Teaching/Econ 305/Instructor Guide") 175 | install.packages('tinytex') 176 | tinytex::install_tinytex() 177 | -------------------------------------------------------------------------------- /Instructor Guide/Instructor_Guide.aux: -------------------------------------------------------------------------------- 1 | \relax 2 | \providecommand\hyper@newdestlabel[2]{} 3 | \providecommand\BKM@entry[2]{} 4 | \providecommand\HyperFirstAtBeginDocument{\AtBeginDocument} 5 | \HyperFirstAtBeginDocument{\ifx\hyper@anchor\@undefined 6 | \global\let\oldcontentsline\contentsline 7 | \gdef\contentsline#1#2#3#4{\oldcontentsline{#1}{#2}{#3}} 8 | \global\let\oldnewlabel\newlabel 9 | \gdef\newlabel#1#2{\newlabelxx{#1}#2} 10 | \gdef\newlabelxx#1#2#3#4#5#6{\oldnewlabel{#1}{{#2}{#3}}} 11 | \AtEndDocument{\ifx\hyper@anchor\@undefined 12 | \let\contentsline\oldcontentsline 13 | \let\newlabel\oldnewlabel 14 | \fi} 15 | \fi} 16 | \global\let\hyper@last\relax 17 | \gdef\HyperFirstAtBeginDocument#1{#1} 18 | \providecommand\HyField@AuxAddToFields[1]{} 19 | \providecommand\HyField@AuxAddToCoFields[2]{} 20 | \BKM@entry{id=1,dest={73656374696F6E2A2E31},srcline={126}}{5C3337365C3337375C303030575C303030685C303030615C303030745C3030305C3034305C303030595C3030306F5C303030755C3030305C3034305C3030304E5C303030655C303030655C303030645C3030305C3034305C303030745C3030306F5C3030305C3034305C3030304B5C3030306E5C3030306F5C30303077} 21 | \BKM@entry{id=2,dest={73656374696F6E2A2E32},srcline={174}}{5C3337365C3337375C303030495C3030306E5C3030305C3034305C303030545C303030685C303030695C303030735C3030305C3034305C303030445C3030306F5C303030635C303030755C3030306D5C303030655C3030306E5C30303074} 22 | \@writefile{toc}{\contentsline {subsection}{What You Need to Know}{1}{section*.1}\protected@file@percent } 23 | \newlabel{what-you-need-to-know}{{}{1}{What You Need to Know}{section*.1}{}} 24 | \@writefile{toc}{\contentsline {subsection}{In This Document}{1}{section*.2}\protected@file@percent } 25 | \newlabel{in-this-document}{{}{1}{In This Document}{section*.2}{}} 26 | \BKM@entry{id=3,dest={73656374696F6E2A2E33},srcline={230}}{5C3337365C3337375C303030575C303030685C303030615C303030745C3030305C3034305C303030445C303030695C303030735C303030745C303030695C3030306E5C303030675C303030755C303030695C303030735C303030685C303030655C303030735C3030305C3034305C303030545C303030685C303030695C303030735C3030305C3034305C303030435C3030306C5C303030615C303030735C303030735C3030305C3034305C303030665C303030725C3030306F5C3030306D5C3030305C3034305C3030304F5C303030745C303030685C303030655C303030725C3030305C3034305C303030455C303030635C3030306F5C3030306E5C3030306F5C3030306D5C303030655C303030745C303030725C303030695C303030635C303030735C3030305C3034305C303030435C3030306C5C303030615C303030735C303030735C303030655C303030735C3030303F} 27 | \@writefile{toc}{\contentsline {subsection}{What Distinguishes This Class from Other Econometrics Classes?}{2}{section*.3}\protected@file@percent } 28 | \newlabel{what-distinguishes-this-class-from-other-econometrics-classes}{{}{2}{What Distinguishes This Class from Other Econometrics Classes?}{section*.3}{}} 29 | \BKM@entry{id=4,dest={73656374696F6E2A2E34},srcline={325}}{5C3337365C3337375C303030525C303030735C303030745C303030755C303030645C303030695C3030306F5C3030302E5C303030635C3030306C5C3030306F5C303030755C30303064} 30 | \BKM@entry{id=5,dest={73656374696F6E2A2E35},srcline={351}}{5C3337365C3337375C303030525C3030304D5C303030615C303030725C3030306B5C303030645C3030306F5C303030775C3030306E} 31 | \@writefile{toc}{\contentsline {subsection}{Rstudio.cloud}{3}{section*.4}\protected@file@percent } 32 | \newlabel{rstudio.cloud}{{}{3}{Rstudio.cloud}{section*.4}{}} 33 | \@writefile{toc}{\contentsline {subsection}{RMarkdown}{3}{section*.5}\protected@file@percent } 34 | \newlabel{rmarkdown}{{}{3}{RMarkdown}{section*.5}{}} 35 | \BKM@entry{id=6,dest={73656374696F6E2A2E36},srcline={421}}{5C3337365C3337375C303030415C303030645C3030306A5C303030755C303030735C303030745C303030695C3030306E5C303030675C3030305C3034305C303030745C3030306F5C3030305C3034305C303030525C3030305C3034305C303030615C3030306E5C303030645C3030305C3034305C303030645C303030705C3030306C5C303030795C303030725C3030305C3034305C303030665C303030725C3030306F5C3030306D5C3030305C3034305C3030304F5C303030745C303030685C303030655C303030725C3030305C3034305C303030535C3030306F5C303030665C303030745C303030775C303030615C303030725C30303065} 36 | \@writefile{toc}{\contentsline {subsection}{Adjusting to R and dplyr from Other Software}{4}{section*.6}\protected@file@percent } 37 | \newlabel{adjusting-to-r-and-dplyr-from-other-software}{{}{4}{Adjusting to R and dplyr from Other Software}{section*.6}{}} 38 | \BKM@entry{id=7,dest={73656374696F6E2A2E37},srcline={816}}{5C3337365C3337375C303030645C303030615C303030675C303030695C303030745C303030745C303030795C3030305C3034305C303030615C3030306E5C303030645C3030305C3034305C303030675C303030675C303030645C303030615C30303067} 39 | \@writefile{toc}{\contentsline {subsection}{dagitty and ggdag}{8}{section*.7}\protected@file@percent } 40 | \newlabel{dagitty-and-ggdag}{{}{8}{dagitty and ggdag}{section*.7}{}} 41 | \BKM@entry{id=8,dest={73656374696F6E2A2E38},srcline={864}}{5C3337365C3337375C303030505C303030725C303030655C303030705C303030615C303030725C303030695C3030306E5C303030675C3030305C3034305C303030445C303030615C303030745C303030615C3030305C3034305C303030665C3030306F5C303030725C3030305C3034305C303030535C303030745C303030755C303030645C303030655C3030306E5C303030745C3030305C3034305C303030555C303030735C30303065} 42 | \BKM@entry{id=9,dest={73656374696F6E2A2E39},srcline={897}}{5C3337365C3337375C303030675C303030675C303030705C3030306C5C3030306F5C303030745C30303032} 43 | \@writefile{toc}{\contentsline {subsection}{Preparing Data for Student Use}{9}{section*.8}\protected@file@percent } 44 | \newlabel{preparing-data-for-student-use}{{}{9}{Preparing Data for Student Use}{section*.8}{}} 45 | \@writefile{toc}{\contentsline {subsection}{ggplot2}{9}{section*.9}\protected@file@percent } 46 | \newlabel{ggplot2}{{}{9}{ggplot2}{section*.9}{}} 47 | \BKM@entry{id=10,dest={73656374696F6E2A2E3130},srcline={952}}{5C3337365C3337375C303030435C303030615C303030755C303030735C303030615C3030306C5C3030305C3034305C303030445C303030695C303030615C303030675C303030725C303030615C3030306D5C30303073} 48 | \BKM@entry{id=11,dest={73656374696F6E2A2E3131},srcline={1040}}{5C3337365C3337375C303030545C303030685C303030655C3030305C3034305C303030545C3030306F5C3030306F5C3030306C5C303030625C3030306F5C303030785C3030305C3034305C303030575C303030695C303030745C303030685C3030306F5C303030755C303030745C3030305C3034305C303030525C303030655C303030675C303030725C303030655C303030735C303030735C303030695C3030306F5C3030306E} 49 | \BKM@entry{id=12,dest={73656374696F6E2A2E3132},srcline={1048}}{5C3337365C3337375C303030435C3030306F5C3030306E5C303030745C303030725C3030306F5C3030306C5C3030306C5C303030695C3030306E5C303030675C3030305C3034305C303030665C3030306F5C303030725C3030305C3034305C303030615C3030305C3034305C303030765C303030615C303030725C303030695C303030615C303030625C3030306C5C30303065} 50 | \@writefile{toc}{\contentsline {subsection}{Causal Diagrams}{11}{section*.10}\protected@file@percent } 51 | \newlabel{causal-diagrams}{{}{11}{Causal Diagrams}{section*.10}{}} 52 | \@writefile{toc}{\contentsline {subsection}{The Toolbox Without Regression}{11}{section*.11}\protected@file@percent } 53 | \newlabel{the-toolbox-without-regression}{{}{11}{The Toolbox Without Regression}{section*.11}{}} 54 | \BKM@entry{id=13,dest={73656374696F6E2A2E3133},srcline={1069}}{5C3337365C3337375C303030465C303030695C303030785C303030655C303030645C3030305C3034305C303030655C303030665C303030665C303030655C303030635C303030745C30303073} 55 | \BKM@entry{id=14,dest={73656374696F6E2A2E3134},srcline={1082}}{5C3337365C3337375C3030304D5C303030615C303030745C303030635C303030685C303030695C3030306E5C30303067} 56 | \BKM@entry{id=15,dest={73656374696F6E2A2E3135},srcline={1106}}{5C3337365C3337375C303030445C303030695C303030665C303030665C303030655C303030725C303030655C3030306E5C303030635C303030655C3030305C3034305C303030695C3030306E5C3030305C3034305C303030445C303030695C303030665C303030665C303030655C303030725C303030655C3030306E5C303030635C303030655C30303073} 57 | \BKM@entry{id=16,dest={73656374696F6E2A2E3136},srcline={1122}}{5C3337365C3337375C303030525C303030655C303030675C303030725C303030655C303030735C303030735C303030695C3030306F5C3030306E5C3030305C3034305C303030445C303030695C303030735C303030635C3030306F5C3030306E5C303030745C303030695C3030306E5C303030755C303030695C303030745C30303079} 58 | \@writefile{toc}{\contentsline {subsubsection}{Controlling for a variable}{12}{section*.12}\protected@file@percent } 59 | \newlabel{controlling-for-a-variable}{{}{12}{Controlling for a variable}{section*.12}{}} 60 | \@writefile{toc}{\contentsline {subsubsection}{Fixed effects}{12}{section*.13}\protected@file@percent } 61 | \newlabel{fixed-effects}{{}{12}{Fixed effects}{section*.13}{}} 62 | \@writefile{toc}{\contentsline {subsubsection}{Matching}{12}{section*.14}\protected@file@percent } 63 | \newlabel{matching}{{}{12}{Matching}{section*.14}{}} 64 | \@writefile{toc}{\contentsline {subsubsection}{Difference in Differences}{12}{section*.15}\protected@file@percent } 65 | \newlabel{difference-in-differences}{{}{12}{Difference in Differences}{section*.15}{}} 66 | \@writefile{toc}{\contentsline {subsubsection}{Regression Discontinuity}{12}{section*.16}\protected@file@percent } 67 | \newlabel{regression-discontinuity}{{}{12}{Regression Discontinuity}{section*.16}{}} 68 | \BKM@entry{id=17,dest={73656374696F6E2A2E3137},srcline={1138}}{5C3337365C3337375C303030495C3030306E5C303030735C303030745C303030725C303030755C3030306D5C303030655C3030306E5C303030745C303030615C3030306C5C3030305C3034305C303030565C303030615C303030725C303030695C303030615C303030625C3030306C5C303030655C30303073} 69 | \@writefile{toc}{\contentsline {subsubsection}{Instrumental Variables}{13}{section*.17}\protected@file@percent } 70 | \newlabel{instrumental-variables}{{}{13}{Instrumental Variables}{section*.17}{}} 71 | -------------------------------------------------------------------------------- /Instructor Guide/Instructor_Guide.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Instructor Guide/Instructor_Guide.pdf -------------------------------------------------------------------------------- /Lectures/Animation_Local_Linear_Regression.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Animation_Local_Linear_Regression.gif -------------------------------------------------------------------------------- /Lectures/GDP.csv: -------------------------------------------------------------------------------- 1 | Year,GDP,Country 2 | 1960,5.43E+11,United States 3 | 1961,5.63E+11,United States 4 | 1962,6.05E+11,United States 5 | 1963,6.39E+11,United States 6 | 1964,6.86E+11,United States 7 | 1965,7.44E+11,United States 8 | 1966,8.15E+11,United States 9 | 1967,8.62E+11,United States 10 | 1968,9.43E+11,United States 11 | 1969,1.02E+12,United States 12 | 1970,1.08E+12,United States 13 | 1971,1.17E+12,United States 14 | 1972,1.28E+12,United States 15 | 1973,1.43E+12,United States 16 | 1974,1.55E+12,United States 17 | 1975,1.69E+12,United States 18 | 1976,1.88E+12,United States 19 | 1977,2.09E+12,United States 20 | 1978,2.36E+12,United States 21 | 1979,2.63E+12,United States 22 | 1980,2.86E+12,United States 23 | 1981,3.21E+12,United States 24 | 1982,3.34E+12,United States 25 | 1983,3.64E+12,United States 26 | 1984,4.04E+12,United States 27 | 1985,4.35E+12,United States 28 | 1986,4.59E+12,United States 29 | 1987,4.87E+12,United States 30 | 1988,5.25E+12,United States 31 | 1989,5.66E+12,United States 32 | 1990,5.98E+12,United States 33 | 1991,6.17E+12,United States 34 | 1992,6.54E+12,United States 35 | 1993,6.88E+12,United States 36 | 1994,7.31E+12,United States 37 | 1995,7.66E+12,United States 38 | 1996,8.10E+12,United States 39 | 1997,8.61E+12,United States 40 | 1998,9.09E+12,United States 41 | 1999,9.66E+12,United States 42 | 2000,1.03E+13,United States 43 | 2001,1.06E+13,United States 44 | 2002,1.10E+13,United States 45 | 2003,1.15E+13,United States 46 | 2004,1.23E+13,United States 47 | 2005,1.31E+13,United States 48 | 2006,1.39E+13,United States 49 | 2007,1.45E+13,United States 50 | 2008,1.47E+13,United States 51 | 2009,1.44E+13,United States 52 | 2010,1.50E+13,United States 53 | 2011,1.55E+13,United States 54 | 2012,1.62E+13,United States 55 | 2013,1.67E+13,United States 56 | 2014,1.74E+13,United States 57 | 2015,1.81E+13,United States 58 | 2016,1.86E+13,United States 59 | 2017,1.94E+13,United States 60 | 1960,1.37E+12,World 61 | 1961,1.42E+12,World 62 | 1962,1.53E+12,World 63 | 1963,1.64E+12,World 64 | 1964,1.80E+12,World 65 | 1965,1.96E+12,World 66 | 1966,2.13E+12,World 67 | 1967,2.26E+12,World 68 | 1968,2.44E+12,World 69 | 1969,2.69E+12,World 70 | 1970,2.96E+12,World 71 | 1971,3.27E+12,World 72 | 1972,3.77E+12,World 73 | 1973,4.59E+12,World 74 | 1974,5.30E+12,World 75 | 1975,5.90E+12,World 76 | 1976,6.42E+12,World 77 | 1977,7.26E+12,World 78 | 1978,8.54E+12,World 79 | 1979,9.93E+12,World 80 | 1980,1.12E+13,World 81 | 1981,1.15E+13,World 82 | 1982,1.14E+13,World 83 | 1983,1.16E+13,World 84 | 1984,1.21E+13,World 85 | 1985,1.27E+13,World 86 | 1986,1.50E+13,World 87 | 1987,1.71E+13,World 88 | 1988,1.92E+13,World 89 | 1989,2.01E+13,World 90 | 1990,2.26E+13,World 91 | 1991,2.39E+13,World 92 | 1992,2.54E+13,World 93 | 1993,2.58E+13,World 94 | 1994,2.77E+13,World 95 | 1995,3.08E+13,World 96 | 1996,3.15E+13,World 97 | 1997,3.14E+13,World 98 | 1998,3.13E+13,World 99 | 1999,3.25E+13,World 100 | 2000,3.36E+13,World 101 | 2001,3.34E+13,World 102 | 2002,3.46E+13,World 103 | 2003,3.89E+13,World 104 | 2004,4.38E+13,World 105 | 2005,4.74E+13,World 106 | 2006,5.13E+13,World 107 | 2007,5.78E+13,World 108 | 2008,6.34E+13,World 109 | 2009,6.01E+13,World 110 | 2010,6.60E+13,World 111 | 2011,7.33E+13,World 112 | 2012,7.50E+13,World 113 | 2013,7.71E+13,World 114 | 2014,7.91E+13,World 115 | 2015,7.48E+13,World 116 | 2016,7.59E+13,World 117 | 2017,8.07E+13,World 118 | -------------------------------------------------------------------------------- /Lectures/Lecture_01-cloud-market-revenue.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_01-cloud-market-revenue.png -------------------------------------------------------------------------------- /Lectures/Lecture_01_Politics_Example.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_01_Politics_Example.PNG -------------------------------------------------------------------------------- /Lectures/Lecture_01_World_of_Data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 1: A World of Data" 3 | author: "Nick Huntington-Klein" 4 | date: "December 1, 2018" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | --- 15 | 16 | ```{r setup, include=FALSE} 17 | knitr::opts_chunk$set(echo = FALSE) 18 | library(tidyverse) 19 | theme_set(theme_gray(base_size = 15)) 20 | ``` 21 | 22 | ## A World of Data 23 | 24 | It's cliche to say that the world focuses more on data than ever before, but that's just because it's true 25 | 26 | Even moreso than understanding *statistics and probability*, in order to understand the world around us we need to understand *data*, *how data is used*, and *what it means* 27 | 28 | Google and Facebook, among many others, have reams and reams of data on you and everybody else. What do they do with it? Why? 29 | 30 | ## Understanding the World 31 | 32 | Increasingly, understanding the world is going to require the ability to understand data 33 | 34 | And learning things about the world is going to require the ability to manipulate data 35 | 36 | ## Data Scientist Pay 37 | 38 | ```{r, echo = FALSE} 39 | #Read in data 40 | salary <- read.csv(text="Experience,Salary 41 | 0.11508951406649626, 89130.43478260869 42 | 1.877024722932652, 92521.73913043478 43 | 4.128303495311169, 96956.52173913045 44 | 5.890238704177326, 100347.82608695651 45 | 8.044117647058828, 104000 46 | 10.051577152600172, 106869.5652173913 47 | 12.205882352941181, 110000 48 | 14.654731457800512, 112608.6956521739 49 | 16.956308610400683, 115478.26086956522 50 | 19.01300085251492, 118086.95652173912 51 | 19.99168797953965, 120173.91304347827 52 | 21.8994032395567, 125130.4347826087 53 | 23.758312020460362, 129826.08695652173 54 | 25.665387894288155, 135565.21739130435 55 | 27.425618073316286, 141043.47826086957 56 | 29.67455242966753, 148347.82608695654") 57 | #Use ggplot to plot out data 58 | ggplot(salary,aes(x=Experience,y=Salary))+ 59 | #with a smoothed line graph 60 | stat_smooth(method = lm, formula = y ~ poly(x, 10), se = FALSE)+ 61 | #Have y-axis start at 50k 62 | expand_limits(y=50000)+ 63 | #Add labels 64 | labs(title="Data Scientist Salary by Experience",subtitle="Data from Glassdoor; avg college grad starts $49,875")+ 65 | xlab("Experience (Years)") 66 | ``` 67 | 68 | ## Top Jobs for Economics Majors 69 | 70 | Data from The Balance Careers 71 | 72 | ```{r, echo = FALSE} 73 | #Read in data 74 | topjobs <- read.csv(text="Job,Salary,UsesData 75 | Market Research Analyst,71450,A Lot 76 | Economic Consultant,112650,A Lot 77 | Comp & Bfts Manager,130010,A Lot 78 | Actuary,114850,A Lot 79 | Credit Analyst,82900,A Little 80 | Financial Analyst,99430,A Little 81 | Policy Analyst,112030,A Lot 82 | Lawyer,141890,Not a Lot 83 | Management Consultant,93440,A Little 84 | Business Reporter,67500,Not a Lot") 85 | #Sort so it goes from lowest salary to highest 86 | topjobs$Job <- reorder(topjobs$Job, topjobs$Salary) 87 | #Reorder factor so it goes least to most 88 | topjobs$UsesData <- factor(topjobs$UsesData,levels=c("Not a Lot","A Little","A Lot")) 89 | #Plot out 90 | ggplot(topjobs,aes(x=Job,y=Salary/1000,fill=UsesData))+ 91 | #With a bar graph 92 | geom_col()+ 93 | #Label 94 | ylab("Avg. Salary (Thousands)")+xlab(element_blank())+ 95 | labs(title="Do Top-Ten Econ Major Jobs Use Data?")+ 96 | #Rotate job labels so they fit 97 | theme(axis.text.x = element_text(angle = 45, hjust = 1)) 98 | ``` 99 | 100 | ## We Use Data to Understand the Economy 101 | 102 | ```{r, echo = FALSE} 103 | #Read in data 104 | gdp <- read.csv('GDP.csv') 105 | #Get GDP in the first year of data using dplyr 106 | gdp <- gdp %>% 107 | group_by(Country) %>% 108 | mutate(firstGDP=GDP[1]) %>% 109 | mutate(gdprel=GDP/firstGDP) 110 | #Plot data 111 | ggplot(gdp,aes(x=Year,y=gdprel,color=Country))+ 112 | #Line graph 113 | geom_line()+ 114 | #Label 115 | xlab("Year")+ylab("GDP Relative to 1960")+ 116 | theme(legend.title=element_blank()) 117 | 118 | ``` 119 | 120 | ## We Use Data to Understand Business 121 | 122 | ![Data from SkyHighNetworks](Lecture_01-cloud-market-revenue.PNG) 123 | 124 | ## We Use Data to Understand Politics 125 | 126 | 127 | ![Data from FiveThirtyEight](Lecture_01_Politics_Example.PNG) 128 | 129 | ## We Use Data to Understand the World 130 | 131 | ```{r, echo = FALSE} 132 | #Read in data 133 | data(co2) 134 | #Plot, cex for bigger font 135 | plot(co2,xlab="Year",ylab="Atmospheric CO2 Concentration",cex=1.75) 136 | ``` 137 | 138 | ## This Class 139 | 140 | In this class, we'll be accomplishing a few goals. 141 | 142 | - Learning how to use the statistical programming language R 143 | - Learning how to understand the data we see in the world 144 | - Learning how to figure out *what data actually tells us* 145 | - Learning about *causal inference* - the economist's comparative advantage! 146 | 147 | ## Why Programming? 148 | 149 | Why do we have to learn to code? Why not just use Excel? 150 | 151 | - Excel is great at being a spreadsheet. You should learn it. It's a pretty bad data analysis tool though 152 | - Learning a programming language is a very important skill 153 | - R is free, very flexible (heck, I wrote these slides in R), is growing in popularity, will be used in other econometrics courses, and easy to jump to something like Python if need be 154 | 155 | ## Don't Be Scared 156 | 157 | - Programming isn't all that hard 158 | - You're just telling the computer what to do 159 | - The computer will do exactly as you say 160 | - Just imagine it's like your bratty little sibling who would do what you said, *literally* 161 | 162 | ## Plus 163 | 164 | - As mentioned, once you know one language it's much easier to learn others 165 | - There will be plenty of resources and cheat sheets to help you 166 | - Ever get curious and have a question? Now you can just *answer it*. How cool is that? 167 | 168 | ## Causal Inference? 169 | 170 | What is causal inference? 171 | 172 | - It's easy to get data to tell us what happened, but not **why**. "Correlation does not equal casuation" 173 | - Economists have been looking at causation for longer than most other fields. We're good at it! 174 | - Causal inference is often necessary to link data to *models* and actually *learn how the world works* 175 | - We'll be taking a special approach to causal inference, one that lets us avoid complex mathematical models 176 | 177 | ## Lucky You! 178 | 179 | This is a pretty unusual course. We're lucky enough to be able to start the econometrics sequence off this way. 180 | 181 | In most places, you have to learn programming *while* learning advanced methods, and similarly for causal inference! 182 | 183 | Here we have time to build these important skills and intuitions before sending you into the more mathematical world of other econometrics courses 184 | 185 | ## Structure of the Course 186 | 187 | 1. Programming and working with data 188 | 2. Causal Inference and learning from data 189 | 3. Onto the next course! 190 | 191 | ## Admin 192 | 193 | - Syllabus 194 | - Homework (due Sundays, including this coming Sunday) 195 | - Short writing projects 196 | - Attendance 197 | - Midterms 198 | - Final 199 | - Extra Credit 200 | 201 | ## An Example 202 | 203 | - Let's look at a real-world application of data to an important economic problem 204 | - To look for: What data are they using? 205 | - How do they tell a story with it? 206 | - What can we learn from numbers alone? 207 | - How do they interpret the data? Can we trust it? 208 | - [Economic Lives of the Middle Class](https://www.nytimes.com/2018/11/03/upshot/how-the-economic-lives-of-the-middle-class-have-changed-since-2016-by-the-numbers.html) 209 | -------------------------------------------------------------------------------- /Lectures/Lecture_02_Understanding_Data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 2: Understanding Data" 3 | author: "Nick Huntington-Klein" 4 | date: "December 1, 2018" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | theme_set(theme_gray(base_size = 15)) 21 | ``` 22 | 23 | ## What's the Point? 24 | 25 | What are we actually trying to DO when we use data? 26 | 27 | Contrary to popular opinion, the point isn't to make pretty graphs or to make a point, or justify something you've done. 28 | 29 | Those may be nice side effects! 30 | 31 | ## Uncovering Truths 32 | 33 | The cleanest way to think about data analysis is to remember that data *comes from somewhere* 34 | 35 | There was some process that *generated that data* 36 | 37 | Our goal, in all data analysis, is to get some idea of *what that process is* 38 | 39 | ## Example 40 | 41 | - Imagine a basic coin flip 42 | - Every time we flip, we get heads half the time and tails half the time 43 | - The TRUE process that generates the data is that there's a coin that's heads half the time and tails half the time 44 | - If we analyze the data correctly, we should report back that the coin is heads half the time 45 | - Let's try calculating the *proportion* of heads 46 | 47 | ## Example 48 | 49 | ```{r, echo = TRUE} 50 | #Generate 500 heads and tails 51 | data <- sample(c("Heads","Tails"),500,replace=TRUE) 52 | #Calculate the proportion of heads 53 | mean(data=="Heads") 54 | ``` 55 | ```{r fig.width=5, fig.height=4, echo = FALSE} 56 | df <- data.frame(Result=data) 57 | ggplot(df,aes(x=Result))+geom_bar()+ylab("Count") 58 | ``` 59 | 60 | ## Example 61 | 62 | - Let's try out that code in R a few times and see what happens 63 | - First, what do we *want* to happen? What should we see if our data analysis method is good? 64 | 65 | ## How Good Was It? 66 | 67 | - Our data analysis consistently told us that the coin was generating heads about half the time - the true process! 68 | - Our data analysis lets us conclude the coin is fair 69 | - That is describing the true data generating process pretty well! 70 | - Let's think - what other approaches could we have taken? What would the pros and cons be? 71 | - Counting the heads instead of taking the proportion? 72 | - Taking the mean and adding .1? 73 | - Just saying it's 50%? 74 | 75 | ## Another Example 76 | 77 | - People have different amounts of money in their wallet, from 0 to 10 78 | - We flip a coin and, if it's heads, give them a dollar 79 | - What's the data generating process here? 80 | - What should our data analysis uncover? 81 | 82 | ## Another Example 83 | ```{r, echo = TRUE} 84 | #Generate 1000 wallets and 1000 heads and tails 85 | data <- data.frame(wallets=sample(0:10,1000,replace=TRUE)) 86 | data$coin <- sample(c("Heads","Tails"),1000,replace=TRUE) 87 | #Give a dollar whenever it's a heads, then get average money by coin 88 | data <- data %>% mutate(wallets = wallets + (coin=="Heads")) 89 | data %>% group_by(coin) %>% summarize(wallets = mean(wallets)) 90 | ``` 91 | ```{r fig.width=5, fig.height=3, echo = FALSE} 92 | aggdat <- data %>% group_by(coin) %>% summarize(wallets = mean(wallets)) 93 | ggplot(aggdat,aes(x=coin,y=wallets))+geom_col()+ylab("Average in Wallet")+xlab("Flip") 94 | ``` 95 | 96 | ## Conclusions 97 | 98 | - What does our data analysis tell us? 99 | - We *observe* a difference of `r abs(round(aggdat$wallets[2]-aggdat$wallets[1],2))` between the heads and tails 100 | - Because we know that nothing would have caused this difference other than the coin flip, we can conclude that the coin flip *is why* there's a difference of `r abs(round(aggdat$wallets[2]-aggdat$wallets[1],2))` 101 | 102 | ## But What If? 103 | 104 | - So far we've been cheating 105 | - We know exactly what process generated that data 106 | - So really, our data analysis doesn't matter 107 | - But what if we *don't* know that? 108 | - We want to make sure our *method* is good, so that when we draw a conclusion from our data, it's the right one 109 | 110 | ## Example 3 111 | 112 | - We're economists! So no big surprise we might be interested in demand curves 113 | - Demand curve says that as $P$ goes up, $Q_d$ goes down 114 | - Let's gather data on $P$ and $Q_d$ 115 | - And determine things like the slope of demand, demand elasticity, etc. 116 | - Just so happens I have some data on 1879-1923 US food export prices and quantities! 117 | 118 | ## Example 3 119 | 120 | ```{r, echo = FALSE} 121 | #Bring in food data 122 | foodPQ <- read.csv(text='Year,FoodPrice,FoodQuantity 123 | 1879,93.4,148.5 124 | 1880,96.1,162.3 125 | 1881,102.6,117.7 126 | 1882,109,81.4 127 | 1883,104,82.7 128 | 1884,91.4,77.2 129 | 1885,85.4,70.9 130 | 1886,80.4,89 131 | 1887,81.6,85.3 132 | 1888,86,56.8 133 | 1889,75.4,85.9 134 | 1890,76.6,96.5 135 | 1891,100.1,119.1 136 | 1892,86.6,140.7 137 | 1893,77.7,107.8 138 | 1894,67.7,97.1 139 | 1895,69.4,92 140 | 1896,63.5,156.6 141 | 1897,71.3,197.7 142 | 1898,76.7,218.2 143 | 1899,74.2,186.5 144 | 1900,74.1,176.6 145 | 1901,77.4,182.7 146 | 1902,82,112.1 147 | 1903,81.4,123.1 148 | 1904,80.3,73.2 149 | 1905,82.5,108.8 150 | 1906,81.7,126.1 151 | 1907,95,118.2 152 | 1908,99.8,98.3 153 | 1909,104.2,65.1 154 | 1910,98.7,55.1 155 | 1911,97.9,68.5 156 | 1912,104.2,79.4 157 | 1913,100,100 158 | 1914,114.5,140 159 | 1915,133.8,200.6 160 | 1916,144.2,168.6 161 | 1917,214.9,134.5 162 | 1918,234.6,132.9 163 | 1919,241.7,156.9 164 | 1920,268.2,192.2 165 | 1921,155.7,252.8 166 | 1922,127.3,210.6 167 | 1923,129.2,116.9') 168 | #Plot 169 | ggplot(foodPQ,aes(x=FoodQuantity,y=FoodPrice))+ 170 | #Add year labels 171 | geom_text(aes(label=Year))+ 172 | #and axis labels 173 | ylab("Price")+xlab("Quantity")+ggtitle("Food Export Price v. Quantity, US, 1879-1923") 174 | ``` 175 | 176 | ## Example 3 177 | 178 | We can calculate the correlation between $P$ and $Q$: 179 | 180 | ```{r, echo = TRUE} 181 | cor(foodPQ$FoodPrice,foodPQ$FoodQuantity) 182 | ``` 183 | 184 | ## Example 3 Conclusions 185 | 186 | - We *observe* a POSITIVE correlation of `r round(cor(foodPQ$FoodPrice,foodPQ$FoodQuantity),2)` 187 | - But demand curves shouldn't slope upwards... huh? 188 | - Does demand really slope up? Does this tell us about the process that generated this data? Why or why not? 189 | - Why do we see the data we do? Let's try to think of some reasons. 190 | 191 | ## Getting Difficult 192 | 193 | - We need to be more careful to figure out what's actually going on 194 | - Plus, the more we know about the context and the underlying model, the more likely it is that we won't miss something important 195 | 196 | ##Getting Difficult 197 | 198 | - Our fake examples were easy because we knew perfectly where the data came from 199 | - But how much do you know about food prices in turn-of-the-century US? 200 | - Or for that matter how prices are set in the food industry? 201 | - There's more work to do in uncovering the process that made the data 202 | 203 | 204 | ## For the Record 205 | 206 | - Likely, one of the big problems with our food analysis was that we forgot to account for DEMAND shifting around, showing us what SUPPLY looks like. Supply *does* slope up! 207 | - Just because we loudly announced that we wanted the demand curve doesn't force the data to give it to us! 208 | - Let's imagine the process that might have generated this data by starting with the model itself and seeing how we can *generate* this data 209 | 210 | ## But... 211 | 212 | - If we can figure out what methods work well when we *do* know the right answer 213 | - And apply them when we *don't*... 214 | - We can figure out what those processes are 215 | - And that's the goal! 216 | - This is what we'll be figuring out how to do technically during the programming part of the class, and conceptually during causal inference 217 | 218 | 219 | 220 | ## R 221 | 222 | - We'll need good tools to do it 223 | - We'll be using R 224 | - Let's go through how R can be installed, in preparation for next week 225 | - [R-Project.org](http://www.r-project.org) 226 | - [RStudio.com](http://www.rstudio.com) 227 | - [RStudio.cloud](http://rstudio.cloud) 228 | 229 | ## Finish us out 230 | 231 | - Now that we're all prepared let's get a jump on homework 232 | - Go to New York Times' The Upshot or FiveThirtyEight and find an article that uses data 233 | - See Homework 1 234 | - Together let's start thinking about answers to these questions 235 | -------------------------------------------------------------------------------- /Lectures/Lecture_03_Introduction_to_R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 3: Introduction to R and RStudio" 3 | author: "Nick Huntington-Klein" 4 | date: "January 17, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | theme_set(theme_gray(base_size = 15)) 21 | ``` 22 | 23 | ## Welcome to R 24 | 25 | The first half of this class will be dedicated to getting familiar with R. 26 | 27 | R language for working with data. RStudio is an environment for working with that language. 28 | 29 | By the time we're done, you should be comfortable manipulating and examining data. 30 | 31 | ## Why R? 32 | 33 | This is going to accomplish a few things for us. 34 | 35 | - As we mentioned last week: Excel/Sheets is a great tool for accountants, not for working with data. 36 | - Learning to program is a highly valuable skill 37 | - R is basically a big cool calculator, and we want to do big cool calculations 38 | 39 | ## A few notes 40 | 41 | - These slides are written in R; you can look up the code I'm using on Titanium if you want to see how I did something 42 | - The assigned videos are there to help 43 | - Other resources on the [Econometrics Help](nickchk.com/econometrics.html) list, available on Titanium 44 | - Don't be afraid of programming! It's a language like any other, just a language that requires you to be very precise. Tell the computer what you want! 45 | - If you're already a programmer, you may get a little bored. If you'd like to learn something more advanced, let me know. 46 | - Ask lots of questions!!! (!) 47 | 48 | ## Today 49 | 50 | - Get used to working in RStudio 51 | - Figure out how R conceptualizes working with stuff 52 | - Figure out the Help system <- Important! 53 | 54 | ## The RStudio Panes 55 | 56 | - Console 57 | - Environment Pane 58 | - Browser Pane 59 | - Source Editor 60 | - Li'l tip: Ctrl/Cmd + Shift + number will maximize one of these panes 61 | 62 | ## Console 63 | 64 | - Typically bottom-left 65 | - This is where you can type in code and have it run immediately 66 | - Or, when you run code from the Source Editor, it will show up here 67 | - It will also show any output or errors 68 | 69 | ## Console Example 70 | 71 | - Let's copy/paste some code in there to run 72 | 73 | ```{r, echo = TRUE, eval = FALSE} 74 | #Generate 500 heads and tails 75 | data <- sample(c("Heads","Tails"),500,replace=TRUE) 76 | #Calculate the proportion of heads 77 | mean(data=="Heads") 78 | #This line should give an error - it didn't work! 79 | data <- sample(c("Heads","Tails"),500,replace=BLUE) 80 | #This line should give a warning 81 | #It did SOMETHING but maybe not what you want 82 | mean(data) 83 | #This line won't give an error or a warning 84 | #But it's not what we want! 85 | mean(data=="heads") 86 | ``` 87 | 88 | ## What We Get Back 89 | 90 | - We can see the code that we've run 91 | - We can see the output of that code, if any 92 | - We can see any errors or warnings (in red). Remember - errors mean it didn't work. Warnings mean it *maybe* didn't work. 93 | - Just because there's no error or warning doesn't mean it DID work! Always think carefully 94 | 95 | ## Environment Pane 96 | 97 | - The output of what we've done can also be seen in the *Environment pane* 98 | - This is on the top-right 99 | - Two important tabs: Environment and History 100 | - Mostly, Environment 101 | 102 | ## History Tab 103 | 104 | - History shows us the commands that we've run 105 | - To save yourself some typing, you can re-run commands by double-clicking them or hitting Enter 106 | - Send to console with double-click/enter, send to source pane with Shift+double-click/Enter 107 | - Or use "To Console" or "To Source" buttons 108 | 109 | ## Environment Tab 110 | 111 | - Environment tab shows us all the objects we have in memory 112 | - For example, we created the `data` object, so we can see that in Environment 113 | - It shows us lots of handy information about that object too 114 | - (we'll get to that later) 115 | - You can erase everything with that little broom button (technically this does `rm(list=ls())`) 116 | 117 | ## Browser Pane 118 | 119 | - Bottom-right 120 | - Lots of handy stuff here! 121 | - Mostly, the *outcome* of what you do will be seen here 122 | 123 | ## Files Tab 124 | 125 | - Basic file browser 126 | - Handy for opening up files 127 | - Can also help you set the working directory: 128 | - Go to folder 129 | - In menu bar, Session 130 | - Set Working Directory 131 | - To Files Pane Location 132 | 133 | ## Plots and Viewer tabs 134 | 135 | - When you create something that must be viewed, like a plot, it will show up here. 136 | 137 | ```{r, echo=TRUE, eval=FALSE} 138 | data(LifeCycleSavings) 139 | plot(density(LifeCycleSavings$pop75),main='Percent of Population over 75') 140 | ``` 141 | 142 | ```{r, echo=FALSE, eval=FALSE} 143 | ## THE GGPLOT2 WAY 144 | data(LifeCycleSavings) 145 | ggplot(LifeCycleSavings,aes(x=pop75))+stat_density(geom='line')+ 146 | ggtitle('Percent of Population over 75') 147 | ``` 148 | 149 | - Note the "Export" button here - this is an easy way to save plots you've created 150 | 151 | ## Packages Tab 152 | 153 | - This is one way to install new packages and load them in 154 | - We'll be talking more about packages later 155 | - I generally avoid this tab; better to do this via code 156 | - Why? Replicability! A **VERY** important reason to use code and not the GUI or, say, Excel 157 | - You always want to make sure your future self (or someone else) knows how to use your code 158 | - One thing I do use this for is checking for package updates, note handy "Update" button 159 | 160 | ## Help Tab 161 | 162 | - This is where help files appear when you ask for them 163 | - You can use the search bar here, or `help()` in the console 164 | - We'll be going over this today a bit later 165 | 166 | ## Source Pane 167 | 168 | - You should be working with code FROM THIS PANE, not the console! 169 | - Why? Replicability! 170 | - Also, COMMENTS! USE THEM! PLEASE! `#` lets you write a comment. 171 | - Switch between tabs like a browser 172 | - Set working directory: 173 | - Select a source file tab 174 | - In menu bar: Session 175 | - Set Working Directory 176 | - To Source File Location 177 | 178 | ## Running Code from the Source Pane 179 | 180 | - Select a chunk of code and hit the "Run" button 181 | - Click on a line of code and do Ctrl/Cmd-Enter to run just that line and advance to the next <- Super handy! 182 | - Going one line at a time lets you check for errors more easily 183 | - Let's try some! 184 | 185 | ```{r, eval=FALSE, echo=TRUE} 186 | data(mtcars) 187 | mean(mtcars$mpg) 188 | mean(mtcars$wt) 189 | 372+565 190 | log(exp(1)) 191 | 2^9 192 | (1+1)^9 193 | ``` 194 | 195 | ## Autocomplete 196 | 197 | - RStudio comes with autocomplete! 198 | - Typing in the Source Pane or the Console, it will try to fill in things for you 199 | - Command names (shows the syntax of the function too!) 200 | - Object names from your environment 201 | - Variable names in your data 202 | - Let's try redoing the code we just did, typing it out 203 | 204 | ## Help 205 | 206 | - Autocomplete is one way that RStudio tries to help you out 207 | - The way that R helps you out is with the documentation 208 | - When you start doing anything serious with a computer, like programming, the most important skills are: 209 | - Knowing to read documentation 210 | - Knowing to search the internet for help 211 | 212 | ## help() 213 | 214 | - You can get the documentation on most R objects using the `help()` function 215 | - `help(mean)`, for example, will show you: 216 | - What the function is 217 | - The "syntax" for the function 218 | - The available options for the function 219 | - Other, related functions, like `weighted.mean` 220 | - Ideally, some examples of proper use 221 | - Not just for functions/commands - some data sets will work too! Try `help(mtcars)` 222 | 223 | ## Searching the Internet 224 | 225 | - Even professional programmers will tell you - they spend a huge amount of time just looking up how to do stuff online 226 | - Why not? 227 | - Granted, this won't come in handy for a midterm, but for actually doing work it's essential 228 | 229 | ## Searching the Internet 230 | 231 | - Just Google (or whatever) what you need! There will usually be a resource. 232 | - R-bloggers, Quick-R, StackOverflow 233 | - Or just Google. Try including "R" and the name of what you want in the search 234 | - If "R" isn't turning up anything, try Rstats 235 | - Ask on Twitter with `#rstats` 236 | - Ask on StackExchange or StackOverflow <- tread lightly! 237 | 238 | ## Example 239 | 240 | - We've picked up some data sets to use like LifeCycleSavings and mtcars with the data() function 241 | - What data sets can we get this way? 242 | - Let's try `help(data)` 243 | - Let's try searching the internet 244 | 245 | ## That's it! 246 | 247 | See you next time! -------------------------------------------------------------------------------- /Lectures/Lecture_04_Objects_and_Functions_in_R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 4: Objects and Functions in R" 3 | author: "Nick Huntington-Klein" 4 | date: "January 17, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | theme_set(theme_gray(base_size = 15)) 21 | ``` 22 | 23 | ## Working in R 24 | 25 | - Today we'll be starting to actually DO stuff in R 26 | - R is all about: 27 | - Creating objects 28 | - Looking at objects 29 | - Manipulating objects 30 | - That's, uh, it. That's all. That's what R does. 31 | 32 | ## Creating a Basic Object 33 | 34 | - Let's create an object. We're going to do this with the assignment operator `<-` (a.k.a. "gets") 35 | 36 | ```{r, echo=TRUE} 37 | a <- 4 38 | ``` 39 | - This creates an object called `a`. What is that object? It's a number. Specifically, it's the number 4. We know that because we took that 4 and we shoved it into `a`. 40 | - Why store it as an object rather than just saying 4? Oh, plenty of reasons. 41 | - We can do more complex calculations before storing it, too. 42 | ```{r, echo=TRUE} 43 | b <- sqrt(16)+10 44 | ``` 45 | 46 | 47 | ## Looking at Objects 48 | 49 | - We can see this object that we've created in the Environment pane 50 | - Putting that object on a line by itself in R will show us what it is. 51 | 52 | ```{r, echo=TRUE} 53 | a 54 | b 55 | ``` 56 | 57 | ## Looking at Objects 58 | 59 | - We can even create an object and look at it in the console without storing it. 60 | - Let's think about what is really happening in these lines 61 | ```{r, echo=TRUE} 62 | 3 63 | a+b 64 | ``` 65 | 66 | ## Looking at Objects Differently 67 | 68 | - We can run objects through **functions** to look at them in different ways 69 | 70 | ```{r, echo=TRUE} 71 | #What does a look like if we take the square root of it? 72 | sqrt(a) 73 | #What does it look like if we add 1 to it? 74 | a + 1 75 | #If we look at it, do we see a number? 76 | is.numeric(a) 77 | ``` 78 | 79 | ## Manipulating Objects 80 | 81 | - We manipulated the object and LOOKED at it, we can also SAVE it, or UPDATE it. 82 | ```{r, echo=TRUE} 83 | #We looked at what a looked like with 1 added, but a itself wasn't changed 84 | a 85 | #Let's save a+1 as something else 86 | b <- a + 1 87 | #And let's overwrite a with its square root 88 | a <- sqrt(a) 89 | a 90 | b 91 | ``` 92 | 93 | ## Some Notes 94 | 95 | - Even though we changed `a`, `b` was already set using the old value of `a`, and so was still `4+1=5`, not `2+1=3`. 96 | - `a` basically got *reassigned* with `<-`. That's how we got it to be `2` 97 | - This is a very simple example, but basically *everything* in R is just this process but with more complex objects and more complex functions! 98 | 99 | ## Types of Objects 100 | 101 | - We already determined that `a` was a number. But what else could it be? What other kinds of variables are there? 102 | - Some basic object types: 103 | - Numeric: A single number 104 | - Character: A string of letters, like `'hello'` 105 | - Logical: `TRUE` or `FALSE` (or `T` or `F`) 106 | - Factor: A category, like `'left handed', 'right handed', or 'ambidextrous'` 107 | - Vector: A collection of objects of the same type 108 | 109 | ## Characters 110 | 111 | - A character object is a piece of text, held in quotes like `''` or `""` 112 | - For example, maybe you have some data on people's addresses. 113 | 114 | ```{r, echo=TRUE} 115 | address <- '321 Fake St.' 116 | address 117 | is.character(address) 118 | ``` 119 | 120 | ## Logical 121 | 122 | - Logicals are binary - TRUE or FALSE. They're extremely important because lots of data is binary. 123 | 124 | ```{r, echo=TRUE} 125 | c <- TRUE 126 | is.logical(c) 127 | is.character(a) 128 | is.logical(is.numeric(a)) 129 | ``` 130 | 131 | ## Logical 132 | 133 | - Logicals are used a lot in programming too, because you can evaluate whether conditions hold. 134 | 135 | ```{r, echo=TRUE} 136 | a > 100 137 | a > 100 | b == 5 138 | ``` 139 | 140 | - `&` is AND, `|` is OR, and to check equality use `==`, not `=`. `>=` is greater than OR equal to, similarly for `<=` 141 | 142 | ## Logical 143 | 144 | - They are also equivalent to `TRUE=1` and `FALSE=0` which comes in handy. 145 | - We can use `as` functions to change one object type to another (although for `T=1, F=0` it does it automatically) 146 | 147 | ```{r, echo=TRUE} 148 | as.numeric(FALSE) 149 | TRUE + 3 150 | ``` 151 | 152 | ## Logicals Test 153 | 154 | - Let's stop and try to think about what these lines might do: 155 | ```{r, echo=TRUE, eval=FALSE} 156 | is.logical(is.numeric(FALSE)) 157 | is.numeric(2) + is.character('hello') 158 | is.numeric(2) & is.character(3) 159 | TRUE | FALSE 160 | TRUE & FALSE 161 | ``` 162 | 163 | ## Factors 164 | 165 | - Factors are categorical variables - mutually exclusive groups 166 | - They look like strings, but they're more like logicals, but with more than two levels (and labeled differently than T and F) 167 | - Factors have *levels* showing the possible categories you can be in 168 | ```{r, echo=TRUE} 169 | e <- as.factor('left-handed') 170 | levels(e) <- c('left-handed','right-handed','ambidextrous') 171 | e 172 | ``` 173 | 174 | ## Vectors 175 | 176 | - Data is basically a bunch of variables all put together 177 | - So unsurprisingly, a lot of R works with vectors, which are a bunch of objects all put together! 178 | - Use `c()` (concatenate) to put a bunch of objects of the same type together in a vector 179 | - Use square brackets to pick out parts of the vector 180 | 181 | ```{r, echo=TRUE} 182 | d <- c(5,6,7,8) 183 | c(is.numeric(d),is.vector(d)) 184 | d[2] 185 | ``` 186 | 187 | ## Vectors 188 | 189 | - Unsurprisingly, lots of statistical functions look at multiple objects (what is statistics but making sense of lots of different measurements of the same thing?) 190 | 191 | ```{r, echo=TRUE} 192 | mean(d) 193 | c(sum(d),sd(d),prod(d)) 194 | ``` 195 | 196 | ## Vectors 197 | 198 | - We can perform the same operation on all parts of the vector at once! 199 | 200 | ```{r, echo=TRUE} 201 | d + 1 202 | d + d 203 | d > 6 204 | ``` 205 | 206 | ## Vectors 207 | 208 | - Factors make a lot more sense as a vector 209 | 210 | ```{r, echo=TRUE} 211 | continents <- as.factor(c('Asia','Asia','Asia', 212 | 'N America','Europe','Africa','Africa')) 213 | table(continents) 214 | continents[4] 215 | ``` 216 | 217 | ## Value Matching 218 | 219 | - It's easy to create logicals seeing if a value matches ANY value in a vector with %in% 220 | 221 | ```{r, echo=TRUE} 222 | 3 %in% c(3,4) 223 | c('Nick','James') %in% c('James','Andy','Sarah') 224 | ``` 225 | 226 | ## Basic Vectors 227 | 228 | - Generating basic vectors: 229 | 230 | ```{r, echo=TRUE, eval=FALSE} 231 | 1:8 232 | rep(4,3) 233 | rep(c('a','b'),4) 234 | numeric(5) 235 | character(6) 236 | sample(1:20,3) 237 | sample(c("Heads","Tails"),6,replace=TRUE) 238 | ``` 239 | ```{r, echo=FALSE, eval=TRUE} 240 | 1:8 241 | rep(4,3) 242 | rep(c('a','b'),4) 243 | numeric(5) 244 | character(6) 245 | sample(1:20,3) 246 | sample(c("Heads","Tails"),6,replace=TRUE) 247 | ``` 248 | 249 | ## Vector Test 250 | - If we do `f <- c(2,3,4,5)`, then what will the output of these be? 251 | 252 | ```{r, echo=TRUE, eval = FALSE} 253 | f^2 254 | f + c(1,2,3,4) 255 | c(f,6) 256 | is.numeric(f) 257 | mean(f >= 4) 258 | f*c(1,2,3) 259 | length(f) 260 | length(rep(1:4,3)) 261 | f/2 == 2 | f < 3 262 | as.character(f) 263 | f[1]+f[4] 264 | c(f,f,f,f) 265 | f[f[1]] 266 | f[c(1,3)] 267 | f %in% (1:4*2) 268 | ``` 269 | 270 | ## And Now You 271 | 272 | - Create a factor that randomly samples six `'Male'` or `'Female'` people. 273 | - Add up all the numbers from 18 to 763, then get the mean 274 | - What happens if you make a list with a logical, a numeric, AND a string in it? 275 | - Figure out how to use `paste0()` to turn `c('a','b')` into `'ab'` 276 | - Use `[]` to turn `h <- c(10,9,8,7)` into `c(7,8,10,9)` and call it `j` 277 | - (Several ways) Create a vector with eleven 0's, then a 5. 278 | - (Tough!) Use `floor()` or `%%` to count how many multiples of 4 there are between 433 and 899 -------------------------------------------------------------------------------- /Lectures/Lecture_05_Spreadsheet.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_05_Spreadsheet.PNG -------------------------------------------------------------------------------- /Lectures/Lecture_05_Working_with_Data_Part_1.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 5: Working with Data Part 1" 3 | author: "Nick Huntington-Klein" 4 | date: "January 18, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | theme_set(theme_gray(base_size = 15)) 21 | options(warn=-1) 22 | ``` 23 | 24 | ## Working with Data 25 | 26 | - R is all about working with data! 27 | - Today we're going to start going over the use of data.frames and tibbles 28 | - data.frames are an object type; tibbles are basically data.frames with some extra bells and whistles, from the *tidyverse* package 29 | - Most of the time, you'll be doing calculations using them 30 | 31 | ## The Basic Idea 32 | 33 | - Conceptually, data.frames are basically spreadsheets 34 | - Technically, they're a list of vectors 35 | 36 | |Spreadsheet | data.frame | 37 | |------------|-------------| 38 | |![](Lecture_05_Spreadsheet.png) |![](Lecture_05_data_frame.png)| 39 | 40 | 41 | ## Example 42 | 43 | - It's a list of vectors... we can make one by listing some (same-length) vectors! 44 | - (Note the use of = here, not <-) 45 | 46 | 47 | ```{r, echo=TRUE} 48 | df <- data.frame(RacePosition = 1:5, 49 | WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')), 50 | NumberofKids = c(3,5,1,0,2)) 51 | df <- tibble(RacePosition = 1:5, 52 | WayTheySayHi = as.factor(c('Hi','Hello','Hey','Yo','Hi')), 53 | NumberofKids = c(3,5,1,0,2)) 54 | df 55 | ``` 56 | 57 | ## Looking Over Data 58 | 59 | - Now that we have our data, how can we take a look at it? 60 | - We can just name it in the Console and look at the whole thing, but that's usually too much data 61 | - We can look at the whole thing by clicking on it in Environment to open it up 62 | 63 | ## Glancing at Data 64 | 65 | - What if we just want a quick overview, rather than looking at the whole spreadsheet? 66 | - Down-arrow in the Environment tab 67 | - `head()` (look at the head of the data - first six rows) 68 | - `str()` (structure) 69 | 70 | ```{r, echo=TRUE} 71 | str(df) 72 | ``` 73 | 74 | ## So What? 75 | 76 | - What do we want to know about our data? 77 | - What is this data OF? (won't get that with `str()`) 78 | - Data types 79 | - The kinds of values it takes 80 | - How many observations 81 | - Variable names. 82 | - Summary statistics and observation level (we'll get to that later) 83 | 84 | ## Getting at Data 85 | 86 | - Now we have a data frame, `df`. How do we use it? 87 | - One way is that we can pull those vectors back out with `$`! Note autocompletion of variable names. 88 | - We can treat it just like the vectors we had before 89 | 90 | ```{r, echo=TRUE} 91 | df$NumberofKids 92 | df$NumberofKids[2] 93 | df$NumberofKids >= 3 94 | ``` 95 | 96 | ## Quick Note 97 | 98 | - There are actually many many ways to do this 99 | - (some of which I even go over in the videos) 100 | - For example, you can use `[row,column]` to get at data, for example `df$NumberofKids >= 3` is equivalent to `df[,3] >= 3` or `df[,'NumberofKids']>=3` 101 | 102 | ## That Said! 103 | 104 | - We can run the same calculations on these vectors as we were doing before 105 | 106 | ```{r, echo=TRUE} 107 | mean(df$RacePosition) 108 | df$WayTheySayHi[4] 109 | sum(df$NumberofKids <= 1) 110 | ``` 111 | 112 | ## Practice 113 | 114 | - Create `df2 <- data.frame(a = 1:20, b = 0:19*2,` `c = sample(101:200,20,replace=TRUE))` 115 | - What is the average of `c`? 116 | - What is the sum of `a` times `b`? 117 | - Did you get any values of `c` 103 or below? (make a logical) 118 | - What is on the 8th row of `b`? 119 | - How many rows have `b` above 10 AND `c` below 150? 120 | 121 | ## Practice Answers 122 | 123 | ```{r, echo = FALSE, eval = TRUE} 124 | df2 <- data.frame(a = 1:20, b = 0:19*2, c = sample(101:200,20,replace=TRUE)) 125 | ``` 126 | ```{r, echo=TRUE, eval=FALSE} 127 | mean(df2$c) 128 | sum(df2$a*df2$b) 129 | sum(df2$c <= 103) > 0 130 | df2$b[8] 131 | sum(df2$b > 10 & df2$c < 150) 132 | ``` 133 | 134 | ```{r, echo=FALSE, eval=TRUE} 135 | mean(df2$c) 136 | sum(df2$a*df2$b) 137 | sum(df$c <= 103) > 0 138 | df2$b[8] 139 | sum(df2$b > 10 & df2$c < 150) 140 | ``` 141 | 142 | ## The Importance of Rows 143 | 144 | - So far we've basically just taken data frames and pulled the vectors (columns) back out 145 | - So... why not just stick with the vectors? 146 | - Because before long we're not just going to be interested in the columns one at a time 147 | - We'll want to keep track of each *row* - each row is an observation. The same observation! 148 | 149 | ## The Importance of Rows 150 | 151 | - Going back to `df`, that fourth row says that 152 | - The person in the fourth position... 153 | - Says hello by saying "Yo" 154 | - And has no kids 155 | - We're going to want to keep that straight when we want to, say, look at the relationship between having kids and your position in the race. 156 | - Or how the number of kids relates to how you say hello! 157 | 158 | ```{r, echo=FALSE, eval=TRUE} 159 | df 160 | ``` 161 | 162 | ## Working With Data Frames 163 | 164 | - Not to mention, we can manipulate data frames and tibbles! 165 | - Let's figure out how we can: 166 | - Create new variables 167 | - Change variables 168 | - Rename variables 169 | - It's very common that you'll have to work with data a little before analyzing it 170 | 171 | ## Creating New Variables 172 | 173 | - Easy! data.frames are just lists of vectors 174 | - So create a vector and tell R where in that list to stick it! 175 | - Use descriptive names so you know what the variable is 176 | 177 | ```{r, echo=TRUE} 178 | df$State <- c('Alaska','California','California','Maine','Florida') 179 | df 180 | ``` 181 | 182 | ## Our Approach - DPLYR and Tidyverse 183 | 184 | - That's the base-R way to do it, anyway 185 | - We're going to be using *dplyr* (think pliers) for data manipulation instead 186 | - dplyr syntax is inspired by SQL - so learning dplyr will give you a leg up if you want to learn SQL later. Plus it's just better. 187 | 188 | ## Packages 189 | 190 | - tidyverse isn't a part of base R. It's in a package, so we'll need to install it 191 | - We can install packages using `install.packages('nameofpackage')` 192 | 193 | ```{r, echo=TRUE, eval=FALSE} 194 | install.packages('tidyverse') 195 | ``` 196 | 197 | - We can then check whether it's installed in the Packages tab 198 | 199 | ## Packages 200 | 201 | - Before we can use it we must then use the `library()` command to open it up 202 | - We'll need to run `library()` for it again every time we open up R if we want to use the package 203 | 204 | ```{r, echo=TRUE, eval=FALSE} 205 | library(tidyverse) 206 | ``` 207 | 208 | - There are literally thousands of useful packages for R, and we're going to be using a few! Tidyverse will just be our first of many 209 | - Google R package X to look for packages that do X. 210 | 211 | 212 | ## Variable creation with dplyr 213 | 214 | - The *mutate* command will "mutate" our data frame to have a new column in it. We can then overwrite it. 215 | - The pipe `%>%` says "take df and send it to that mutate command to use" 216 | - Or we can stick the data frame itself in the `mutate` command 217 | 218 | ```{r, echo=TRUE, eval=FALSE} 219 | library(tidyverse) 220 | df <- df %>% 221 | mutate(State = c('Alaska','California','California','Maine','Florida')) 222 | df <- mutate(df,State = c('Alaska','California','California','Maine','Florida')) 223 | ``` 224 | 225 | ```{r, echo=FALSE, eval=TRUE} 226 | df <- df %>% 227 | mutate(State = c('Alaska','California','California','Maine','Florida')) 228 | ``` 229 | 230 | ## Creating New Variables 231 | 232 | - We can use all the tricks we already know about creating vectors 233 | - We can create multiple new variables in one mutate command 234 | 235 | ```{r, echo=TRUE} 236 | df <- df %>% mutate(MoreThanTwoKids = NumberofKids > 2, 237 | One = 1, 238 | KidsPlusPosition = NumberofKids + RacePosition) 239 | df 240 | ``` 241 | 242 | ## Manipulating Variables 243 | 244 | - We can't really *change* variables, but we sure can overwrite them! 245 | - We can drop variables with `-` in the dplyr `select` command 246 | - Note we chain multiple dplyr commands with `%>%` 247 | 248 | ```{r, echo=TRUE} 249 | df <- df %>% 250 | select(-KidsPlusPosition,-WayTheySayHi,-One) %>% 251 | mutate(State = as.factor(State), 252 | RacePosition = RacePosition - 1) 253 | df$State[3] <- 'Alaska' 254 | str(df) 255 | ``` 256 | 257 | ## Renaming Variables 258 | 259 | - Sometimes it will make sense to change the names of the variables we have. 260 | - Names are stored in `names(df)` which we can edit directly 261 | - Or the `rename()` command in dplyr has us covered 262 | 263 | ```{r, echo=TRUE} 264 | names(df) 265 | #names(df) <- c('Pos','Num.Kids','State','mt2Kids') 266 | df <- df %>% rename(Pos = RacePosition, Num.Kids=NumberofKids, 267 | mt2Kids = MoreThanTwoKids) 268 | names(df) 269 | ``` 270 | 271 | 272 | ## tidylog 273 | 274 | - Protip: after loading the tidyverse, also load the `tidylog` package. This will tell you what each step of your dplyr command does! 275 | 276 | ```{r, echo=TRUE, eval=FALSE} 277 | library(tidyverse) 278 | library(tidylog) 279 | df <- df %>% mutate(Pos = Pos + 1, 280 | Num.Kids = 10) 281 | ``` 282 | ```{r, echo=FALSE, eval=TRUE} 283 | library(tidylog, warn.conflicts=FALSE, quietly=TRUE) 284 | df <- df %>% mutate(Pos = Pos + 1, 285 | Num.Kids = 10) 286 | detach("package:tidylog", unload=TRUE) 287 | ``` 288 | 289 | ## Practice 290 | 291 | - Create a data set `data` with three variables: `a` is all even numbers from 2 to 20, `b` is `c(0,1)` over and over, and `c` is any ten-element numeric vector of your choice. 292 | - Rename them to `EvenNumbers`, `Treatment`, `Outcome`. 293 | - Add a logical variable called Big that's true whenever EvenNumbers is greater than 15 294 | - Increase Outcome by 1 for all the rows where Treatment is 1. 295 | - Create a logical AboveMean that is true whenever Outcome is above the mean of Outcome. 296 | - Display the data structure 297 | 298 | ## Practice Answers 299 | 300 | ```{r, echo=TRUE, eval=FALSE} 301 | data <- data.frame(a = 1:10*2, 302 | b = c(0,1), 303 | c = sample(1:100,10,replace=FALSE)) %>% 304 | rename(EvenNumbers = a, Treatment = b, Outcome = c) 305 | 306 | data <- data %>% 307 | mutate(Big = EvenNumbers > 15, 308 | Outcome = Outcome + Treatment, 309 | AboveMean = Outcome > mean(Outcome)) 310 | str(data) 311 | ``` 312 | 313 | ## Other Ways to Get Data 314 | 315 | - Of course, most of the time we aren't making up data 316 | - We get it from the real world! 317 | - Two main ways to do this are the `data()` function in R 318 | - Or reading in files, usually with one of the `read` commands like `read.csv()` 319 | 320 | ## data() 321 | 322 | - R has many baked-in data sets, and more in packages! 323 | - Just type in `data(` and see what options it autocompletes 324 | - We can load in data and look at it 325 | - Many of these data sets have `help` files too 326 | 327 | ```{r, echo=TRUE, eval=FALSE} 328 | data(LifeCycleSavings) 329 | help(LifeCycleSavings) 330 | head(LifeCycleSavings) 331 | ``` 332 | 333 | ```{r, echo=FALSE,eval=TRUE} 334 | data(LifeCycleSavings) 335 | head(LifeCycleSavings) 336 | ``` 337 | 338 | 339 | ## read 340 | 341 | - Often there will be data files on the internet or your computer 342 | - You can read this in with one of the many `read` commands, like `read.csv` 343 | - CSV is a very basic spreadsheet format stored in a text file, you can create it from Excel or Sheets (or just write it) 344 | - There are different `read` commands for different file types 345 | - Make sure your working directory is set to where the data is! 346 | - Documentation will usually be in a different file 347 | 348 | ```{r, echo=TRUE, eval=FALSE} 349 | datafromCSV <- read.csv('mydatafile.csv') 350 | ``` 351 | 352 | ## Practice 353 | 354 | - Use `data()` to open up a data set - any data set (although it should be in `data.frame` or `tibble` form - try again if you get something else) 355 | - Use `str()` and `help()` to examine that data set 356 | - What is it data of (help file)? How was it collected and what do the variables represent? 357 | - What kinds of variables are in there and what kinds of values do they have (`str()` and `head()`)? 358 | - Create a new variable using the variables that are already in there 359 | - Take a mean of one of the variables 360 | - Rename a variable to be more descriptive based on what you saw in `help()`. 361 | -------------------------------------------------------------------------------- /Lectures/Lecture_05_data_frame.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_05_data_frame.PNG -------------------------------------------------------------------------------- /Lectures/Lecture_06_Working_with_Data_Part_2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 6: Working with Data Part 2" 3 | author: "Nick Huntington-Klein" 4 | date: "January 23, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | theme_set(theme_gray(base_size = 15)) 21 | ``` 22 | 23 | ## Recap 24 | 25 | - We can get `data.frame`s by making them with `data.frame()`, or reading in data with `data()` or `read.csv` 26 | - `data.frame`s are a list of vectors - we know vectors! 27 | - We can pull the vectors back out with `$` 28 | - We can assign new variables, or update them, using `$` as well 29 | 30 | ## Today 31 | 32 | - We are going to continue working with `data.frame`s/`tibble`s 33 | - And we're going to introduce an important aspect of data analysis: *splitting the data* 34 | - In other words, selecting only *part* of the data that we have 35 | - In other words, to *subset* the data 36 | 37 | ## Why? 38 | 39 | - Why would we want to do this? 40 | - Many statistical questions require us to! 41 | - We might be interested in how a variable *differs* for two different groups 42 | - Or how one variable *is related* to another (i.e. how A looks for different values of B) 43 | - Or how those relationships differ for different groups 44 | 45 | ## Example 46 | 47 | - Let's read in some data on male heights, from Our World in Data, and look at it 48 | - Always look at the data before you use it! 49 | - It has height in CM, let's change that to feet 50 | 51 | ```{r, echo=TRUE} 52 | df <- read.csv('http://www.nickchk.com/average-height-men-OWID.csv') 53 | str(df) 54 | df <- df %>% mutate(Heightft = Heightcm/30.48) 55 | ``` 56 | 57 | ## Example 58 | 59 | - If we look at height overall we will see that mean height is `r mean(df$Heightcm)` 60 | - But if we look at the data we can see that some countries aren't present every year. So this isn't exactly representative 61 | 62 | ```{r, echo=TRUE} 63 | table(df$Year) 64 | ``` 65 | 66 | - So let's just pick some countries that we DO have the full range of years on: the UK, Germany, France, the Congo, Gabon, and Nigeria! 67 | 68 | ## Example 69 | 70 | - If we limit the data just to those three countries, we can see that the average height in these three, which covers 1810s-1980s evenly, is `r mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'))$Heightcm)` 71 | - What if we want to compare the countries to each other? We need to split off each country by itself (let's convert to feet, too). 72 | 73 | ```{r, echo=FALSE, eval=TRUE} 74 | dfsub <- df %>% filter(Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')) %>% 75 | mutate(Entity = as.character(Entity)) %>% 76 | mutate(Entity = ifelse(Code=='COD','Congo', 77 | ifelse(Code=='GBR','UK',Entity))) 78 | dfsub %>% group_by(Entity) %>% 79 | summarize(Heightft = mean(Heightft)) 80 | ``` 81 | 82 | ## Example 83 | 84 | - What questions does this answer? 85 | - What is average height of men over this time period? 86 | - How does average height differ across countries? 87 | - What can't we answer yet? 88 | - How has height changed over time? 89 | - What causes these height differences [later!] 90 | 91 | ## Example 92 | 93 | - If we want to know how height changed over time, we need to evaluate each year separately too. 94 | 95 | ```{r, echo=FALSE, eval=TRUE, fig.width=8, fig.height=5} 96 | ggplot(filter(dfsub,Code=='GBR'),aes(y=Heightft,x=Year))+ 97 | geom_line()+geom_point()+ 98 | ylab("Height (Feet)")+ 99 | ggtitle("Male Height over time in United Kingdom") 100 | ``` 101 | 102 | ## Example 103 | 104 | - To compare the changes over time ACROSS countries, we evaluate separately by each year AND each country 105 | 106 | ```{r, echo=FALSE, eval=TRUE, fig.width=8, fig.height=5} 107 | ggplot(dfsub,aes(y=Heightft,x=Year,group=Entity,color=Entity))+ 108 | geom_line()+geom_point()+ 109 | ggtitle("Male Height over time")+ 110 | ylab("Height (Feet)")+ 111 | scale_colour_discrete(guide = 'none')+ 112 | geom_text(data=filter(dfsub,Year==1980),aes(label=Entity,color=Entity,x=Year,y=Heightft),hjust=-.1)+ 113 | scale_x_continuous(limits=c(min(dfsub$Year),max(dfsub$Year)+15)) 114 | ``` 115 | 116 | ## Example 117 | 118 | - You can see how subsetting can tell a much more complete story than looking at the aggregated data! 119 | - So how can we do this subsetting? 120 | - There are plenty of ways 121 | - For today we're going to focus on the `filter()` and `select()` commands. 122 | - We'll also learn to write a `for` loop. 123 | 124 | ## Subset 125 | 126 | - The `filter()` and `select()` commands will allow you to pick just certain parts of your data 127 | - You can select certain *rows*/observations using logicals with `filter()` 128 | - And you can select certain *columns*/variables with `select()` 129 | - The syntax is: 130 | 131 | ```{r, echo=TRUE, eval=FALSE} 132 | data.frame %>% filter(logical.for.rows) 133 | filter(data.frame, logical.for.rows) 134 | data.frame %>% select(variables,you,want) 135 | select(data.frame,variables,you,want) 136 | ``` 137 | 138 | 139 | ## Subset for observations 140 | 141 | - Let's start by selecting rows from our data 142 | - We do this by creating a logical - `filter` will choose all the observations for which that logical is true! 143 | - Here's the logical I used to pick those six countries: `Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')` 144 | - This will be equal to `TRUE` if the `Code` variable is `%in%` that list of six I gave 145 | 146 | ## Subset for observations 147 | 148 | - We can use this logical with the `filter` command in dplyr - note I don't need to store the logical as a variable first 149 | 150 | ```{r, echo=TRUE} 151 | str(df) 152 | dfsubset <- df %>% filter(Code %in% c('GBR','DEU','FRA','GAB','COD','NGA')) 153 | str(dfsubset) 154 | ``` 155 | 156 | ## Subset for observations 157 | 158 | - Minor note: see that it still retains the original factor levels. Generally this is what we want. If not, you can "relevel" the factor by re-declaring it as a factor, using only the levels we have: 159 | 160 | ```{r, echo=TRUE} 161 | dfsubset <- dfsubset %>% mutate(Code=factor(dfsubset$Code)) 162 | str(dfsubset) 163 | ``` 164 | 165 | - While we're at it we might want `Entity` to be a character variable (why?) 166 | 167 | ```{r, echo=TRUE} 168 | dfsubset <- dfsubset %>% mutate(Entity = as.character(dfsubset$Entity)) 169 | ``` 170 | 171 | ## Subset for observations 172 | 173 | - Everything we know about constructing logicals can work here 174 | - We can also combine multiple variables when doing this 175 | - What if we want to see the dataset for 1980 only for these six? 176 | 177 | ```{r, echo=TRUE} 178 | #filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA') & Year == 1980) 179 | filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'),Year == 1980) 180 | ``` 181 | 182 | ## Subset for observations 183 | 184 | - We can treat the subset like any normal data frame 185 | - Let's get the mean height in these countries in 1980 and 1810 186 | - Can do it directly like this, or assign to a new `data.frame` like we did with `dfsubset` and use that 187 | 188 | ```{r, echo=TRUE} 189 | mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'), 190 | Year == 1980)$Heightft) 191 | mean(filter(df,Code %in% c('GBR','DEU','FRA','GAB','COD','NGA'), 192 | Year == 1810)$Heightft) 193 | ``` 194 | 195 | ## Subset for variables 196 | 197 | - Subsetting for variables is easy! Just use `select()` with a vector or list of variables you want! 198 | - Or you can do `-` a vector of variables you DON'T want! 199 | - We don't need Heightcm, let's get rid of it 200 | 201 | ## Subset for variables 202 | 203 | ```{r, echo=TRUE} 204 | str(dfsubset %>% select(Entity,Code,Year,Heightft)) 205 | str(dfsubset %>% select(-c(Heightcm))) 206 | ``` 207 | 208 | ## Subset for both! 209 | 210 | - We can do both at the same time, chaining one to the other 211 | 212 | ```{r, echo=TRUE} 213 | dfsubset %>% filter(Year == 1980) %>% 214 | select(Entity,Heightft) 215 | ``` 216 | 217 | ## Practice 218 | 219 | - Get the dataset `mtcars` using the `data()` function 220 | - Look at it with `str()` and `help()` 221 | - Limit the dataset to just the variables `mpg, cyl`, and `hp` 222 | - Get the mean `hp` for cars at or above the median value of `cyl` 223 | - Get the mean `hp` for cars below the median value of `cyl` 224 | - Do the same for `mpg` instead of `hp` 225 | - Calculate the difference between above-median and below-median `mpg` and `hp` 226 | - How do you interpret these differences? 227 | 228 | ## Practice answers 229 | 230 | ```{r, echo=TRUE, eval=FALSE} 231 | data(mtcars) 232 | help(mtcars) 233 | str(mtcars) 234 | mtcars <- mtcars %>% select(mpg,cyl,hp) 235 | 236 | mean(filter(mtcars,cyl >= median(cyl))$hp) 237 | mean(filter(mtcars,cyl < median(cyl))$hp) 238 | mean(filter(mtcars,cyl >= median(cyl))$hp) - mean(filter(mtcars,cyl < median(cyl))$hp) 239 | 240 | mean(filter(mtcars,cyl >= median(cyl))$mpg) 241 | mean(filter(mtcars,cyl < median(cyl))$mpg) 242 | mean(filter(mtcars,cyl >= median(cyl))$mpg) - mean(filter(mtcars,cyl < median(cyl))$mpg) 243 | ``` 244 | 245 | 246 | ## For loops 247 | 248 | - Sometimes we want to subset things in many different ways 249 | - Typing everything out over and over is a waste of time! 250 | - You don't understand how powerful computers are until you've written a `for` loop 251 | - This is an incredibly standard programming tool 252 | - R has another way of writing loops using the `apply` family, but we're not going to go there 253 | 254 | ## For loops 255 | 256 | - The basic idea of a `for` loop is that you have a vector of values, and you have an "iterator" variable 257 | - You go through the vector one by one, setting that iterator variable to each value one at a time 258 | - Then you run a chunk of that code with the iterator variable set 259 | 260 | ## For loops 261 | 262 | - In R the syntax is 263 | 264 | ```{r, echo = TRUE, eval=FALSE} 265 | for (iteratorvariable in vector) { 266 | code chunk 267 | } 268 | ``` 269 | 270 | ## For loops 271 | 272 | - Let's rewrite our `hp` and `mpg` differences with a for loop. Heck, let's do a bunch more variables too. (don't forget to get `data(mtcars)` again - why?) 273 | - Note if we want it to display results inside a loop we need `print()` 274 | - Also, unfortunately, if we're looping over variables, `$` won't work - we need to use `[[]]` 275 | 276 | ```{r, echo = TRUE, eval=FALSE} 277 | data(mtcars) 278 | abovemed <- mtcars %>% filter(cyl >= median(cyl)) 279 | belowmed <- mtcars %>% filter(cyl < median(cyl)) 280 | for (i in c('mpg','disp','hp','wt')) { 281 | print(mean(abovemed[[i]])-mean(belowmed[[i]])) 282 | } 283 | ``` 284 | 285 | ----- 286 | 287 | ```{r, echo = TRUE} 288 | data(mtcars) 289 | abovemed <- mtcars %>% filter(cyl >= median(cyl)) 290 | belowmed <- mtcars %>% filter(cyl < median(cyl)) 291 | for (i in c('mpg','disp','hp','wt')) { 292 | print(mean(abovemed[[i]])-mean(belowmed[[i]])) 293 | } 294 | ``` 295 | 296 | ## For loops 297 | 298 | - It's also sometimes useful to loop over different values 299 | - Let's get the average height by country, like before 300 | - `unique()` can give us the levels to loop over 301 | 302 | ```{r, echo = TRUE} 303 | unique(dfsubset$Entity) 304 | for (countryn in unique(dfsubset$Entity)) { 305 | print(countryn) 306 | print(mean(filter(dfsubset,Entity==countryn)$Heightft)) 307 | } 308 | ``` 309 | 310 | ## For loop practice 311 | 312 | - Get back the full `mtcars` again 313 | - Use `unique()` to see the different values of `cyl` 314 | - Use `unique` to loop over those values and get median `mpg` within each level of `cyl` 315 | - Use `:` to construct a vector to repeat the same loop 316 | - [Hard!] Use `paste0` to print out "The median mpg for cyl = # is X" where # is the iterator number and X is the answer. 317 | 318 | ## For loop practice answers 319 | 320 | ```{r, echo = TRUE, eval=FALSE} 321 | data(mtcars) 322 | unique(mtcars$cyl) 323 | for (c in unique(mtcars$cyl)) { 324 | print(median(filter(mtcars,cyl==c)$mpg)) 325 | } 326 | for (c in 2:4*2) { 327 | print(median(filter(mtcars,cyl==c)$mpg)) 328 | } 329 | for (c in unique(mtcars$cyl)) { 330 | print(paste0(c("The median mpg for cyl = ",c, 331 | " is ",median(filter(mtcars,cyl==c)$mpg)), 332 | collapse='')) 333 | } 334 | #Printing out just the last one: 335 | ``` 336 | 337 | ```{r, echo = FALSE, eval=TRUE} 338 | data(mtcars) 339 | for (c in unique(mtcars$cyl)) { 340 | print(paste0(c("The median mpg for cyl = ",c, 341 | " is ",median(filter(mtcars,cyl==c)$mpg)), 342 | collapse='')) 343 | } 344 | ``` -------------------------------------------------------------------------------- /Lectures/Lecture_08_Summarizing_Data_Part_2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 8: Summarizing Data Part 2: Plots" 3 | author: "Nick Huntington-Klein" 4 | date: "January 29, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | library(stargazer) 21 | library(Ecdat) 22 | theme_set(theme_gray(base_size = 15)) 23 | ``` 24 | 25 | ## Recap 26 | 27 | - Summary statistics are ways of describing the *distribution* of a variable 28 | - Examples include mean, percentiles, SD, and specific percentiles of interest (median, min, max) 29 | - Another, very good way to describe a variable is by actually showing that distribution with a graph 30 | - I showed you plenty of graphs last time, this time I'm going to show you how to make them yourself! 31 | 32 | ## Note about ggplot 33 | 34 | - As I mentioned before, we're going straightforward, doing our plotting with base R 35 | - In the case of plotting, it's going to be easier to use base R, but also not look as good as the alternative, `ggplot2`, which is also in the tidyverse package. 36 | - `ggplot2` code for plots will be available in the slides. Just search for 'ggplot' 37 | - And have also been for most previous slides - I've been making most graphs with it! 38 | - Don't forget extra credit opportunity for learning `ggplot2` 39 | 40 | 41 | ## Plot types we will cover: 42 | 43 | - Density plots (for continuous) 44 | - Histograms (for continuous) 45 | - Box plots (for continuous) 46 | - Bar plots (for categorical) 47 | - Plus: 48 | - Adding plots together 49 | - Putting lines on plots 50 | - Makin 'em look good! 51 | 52 | ## Plots we won't: 53 | 54 | - Lots and lots and lots of plot options 55 | - Mosaics, Sankey plots, pie graphs 56 | - Some aren't common in Econ but could be! 57 | - Others are just too tricky (like *maps*) 58 | - See [The R Graph Gallery](https://www.r-graph-gallery.com/) 59 | 60 | ## Density plots and histograms 61 | 62 | - Density plots and histograms will show you the full distribution of the variable 63 | - Values along the x-axis, and how often those values show up on y 64 | - The difference - density plots will present a smooth line by averaging nearby values 65 | - A histogram will create "bins" and tell you how many observations fall into each 66 | - (don't forget we can save plots using Export in the Plots pane) 67 | 68 | ## Density plots 69 | 70 | - To make a density plot, we take our variable and make a `density()` object out of it, then `plot()` that thing! 71 | - Let's play around with data from the `Ecdat` package 72 | 73 | ```{r, echo=TRUE, eval=FALSE} 74 | install.packages('Ecdat') 75 | library(Ecdat) 76 | data(MCAS) 77 | plot(density(MCAS$totsc4)) 78 | ``` 79 | ```{r fig.width=4.5, fig.height=3.5, echo=FALSE, eval=TRUE} 80 | data(MCAS) 81 | plot(density(MCAS$totsc4)) 82 | ``` 83 | 84 | ```{r, echo=FALSE, eval=FALSE} 85 | #THE GGPLOT2 WAY 86 | library(Ecdat) 87 | library(tidyverse) 88 | data(MCAS) 89 | 90 | ggplot(MCAS,aes(x=totsc4))+stat_density(geom='line') 91 | ``` 92 | 93 | ## Titles and labels 94 | 95 | - Readability is super important in graphs 96 | - Add labels and titles! Titles with 'main' and axis labels with 'xlab' and 'ylab' 97 | 98 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE} 99 | plot(density(MCAS$totsc4),main='Massachusetts Test Scores', 100 | xlab='Fourth Grade Test Scores') 101 | ``` 102 | 103 | ```{r, echo=FALSE, eval=FALSE} 104 | #THE GGPLOT2 WAY 105 | library(tidyverse) 106 | 107 | ggplot(MCAS,aes(x=totsc4))+stat_density(geom='line')+ 108 | ggtitle('Massachusetts Test Scores')+ 109 | xlab('Fourth Grade Test Scores') 110 | ``` 111 | 112 | 113 | ## Histograms 114 | 115 | - Histograms require a little less typing - just `hist()` 116 | 117 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE} 118 | hist(MCAS$totsc4) 119 | ``` 120 | 121 | ```{r, echo=FALSE, eval=FALSE} 122 | #THE GGPLOT2 WAY 123 | library(tidyverse) 124 | 125 | ggplot(MCAS,aes(x=totsc4))+geom_histogram() 126 | ``` 127 | 128 | ## Histograms 129 | 130 | - These need labels too! Other important options: 131 | - Do proportions with `freq=FALSE`, or change how many bins there are, or where they are, with `breaks` 132 | 133 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE} 134 | hist(MCAS$totsc4,xlab="Fourth Grade Test Scores", 135 | main="Test Score Histogram",freq=FALSE,breaks=50) 136 | ``` 137 | 138 | ```{r, echo=FALSE, eval=FALSE} 139 | #THE GGPLOT2 WAY 140 | library(tidyverse) 141 | 142 | #If we want proportions we need to calculate it ourselves 143 | ggplot(MCAS,aes(x=totsc4))+geom_histogram(aes(y=(..count..)/sum(..count..)),bins=50)+ 144 | ggtitle("Test Score Histogram") 145 | xlab("Fourth Grade Test Scores") 146 | ``` 147 | 148 | ## Box plots 149 | 150 | - Less common in economics, but a good way to look at your data 151 | - And check for outliers! 152 | - Basically a summary table in graph form 153 | - Shows the 25%, 50%, 75% percentiles 154 | - Plus lines that represent 1.5 IQR (75th minus 25th) 155 | - And dots for anything otuside that range 156 | 157 | ## Box plots 158 | 159 | - Simple! 160 | - Note the negative outliers - not necessarily a problem but good to know! 161 | 162 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE} 163 | boxplot(MCAS$totsc4,main="Box Plot of 4th Grade Scores") 164 | ``` 165 | 166 | ```{r, echo=FALSE, eval=FALSE} 167 | #THE GGPLOT2 WAY 168 | library(tidyverse) 169 | 170 | ggplot(MCAS,aes(y=totsc4))+geom_boxplot()+ 171 | ggtitle("Box Plot of 4th Grade Scores") 172 | ``` 173 | 174 | ## Box plots 175 | 176 | - Easy to look at multiple variables at once, too 177 | 178 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE} 179 | boxplot(select(MCAS,totsc4,totsc8),main="Box Plot of 4th Grade Scores") 180 | ``` 181 | 182 | ```{r, echo=FALSE, eval=FALSE} 183 | #THE GGPLOT2 WAY 184 | library(tidyverse) 185 | 186 | #Unfortunately, for this we need to change the format of the data 187 | #putting it in "long" format by "melt"ing it (from the reshape2 package) 188 | #and then identifying which variable it is by "group" 189 | library(reshape2) 190 | scores <- melt(MCAS,id.vars="code",measure.vars=c("totsc4","totsc8")) 191 | ggplot(scores,aes(x=variable,y=value,group=variable))+geom_boxplot()+ 192 | ggtitle("Box Plot of 4th Grade Scores") 193 | ``` 194 | 195 | ## Bar plots 196 | 197 | - Last time we talked about the `table()` command, which shows us the whole distribution of a *categorical* variable 198 | - Still a great command! Let's look at majors of students taking advanced micro from a mid-1980s sample in the `Ecdat` package 199 | 200 | ```{r, echo=TRUE} 201 | data(Mathlevel) 202 | table(Mathlevel$major) 203 | ``` 204 | 205 | ## Bar plots 206 | 207 | - Bar plots, express this table, graphically 208 | - Stick that table in there! 209 | 210 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE} 211 | barplot(table(Mathlevel$major),main="Majors of Students in Advanced Microeconomics Class") 212 | ``` 213 | 214 | ```{r, echo=FALSE, eval=FALSE} 215 | #THE GGPLOT2 WAY 216 | library(tidyverse) 217 | 218 | ggplot(Mathlevel,aes(x=major))+geom_bar()+ 219 | ggtitle("Majors of Students in Advanced Microeconomics Class") 220 | ``` 221 | 222 | ## Bar plots 223 | 224 | - And, just like before, we can make it a proportion instead of a count by wrapping that `table()` in `prop.table()` 225 | 226 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE} 227 | barplot(prop.table(table(Mathlevel$major)),main="Majors of Students in Advanced Microeconomics Class") 228 | ``` 229 | 230 | ```{r, echo=FALSE, eval=FALSE} 231 | #THE GGPLOT2 WAY 232 | library(tidyverse) 233 | 234 | #Note that to get a proportion here we sort of need to calculate it ourselves 235 | #in the aes() argument of geom_bar 236 | ggplot(Mathlevel,aes(x=major))+geom_bar(aes(y = (..count..)/sum(..count..)))+ 237 | ggtitle("Majors of Students in Advanced Microeconomics Class") 238 | ``` 239 | 240 | ## Adding Lines 241 | 242 | - It's pretty common for us to want to add some explanatory lines to our graphs 243 | - For example, adding mean/median/etc. to a density 244 | - Or showing where 0 is 245 | - We can do this with `abline()` which adds a line with a given intercept (a) and slope (b), after we make our plot. 246 | 247 | ## Adding Lines 248 | 249 | - After creating the plot, THEN add the `abline(intercept,slope)`, or `abline(h=horizontal)`, `abline(v=vertical)` for horizontal or vertical numbers 250 | 251 | ```{r echo=TRUE, eval=FALSE} 252 | hist(MCAS$totsc4) 253 | passingscore <- 705 254 | abline(0,1) 255 | abline(v=passingscore,col='red') 256 | abline(h=50) 257 | ``` 258 | 259 | ```{r, echo=FALSE, eval=FALSE} 260 | #THE GGPLOT2 WAY 261 | library(tidyverse) 262 | 263 | passingscore <- 705 264 | ggplot(MCAS,aes(x=totsc4))+geom_histogram()+ 265 | geom_vline(aes(xintercept=passingscore),col='red')+ 266 | geom_hline(aes(yintercept=50))+ 267 | geom_abline(aes(intercept=0,slope=1)) 268 | ``` 269 | 270 | ## Adding Lines 271 | 272 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE} 273 | hist(MCAS$totsc4) 274 | passingscore <- 705 275 | abline(v=passingscore,col='red') 276 | ``` 277 | 278 | ```{r, echo=FALSE, eval=FALSE} 279 | #THE GGPLOT2 WAY 280 | library(tidyverse) 281 | 282 | passingscore <- 705 283 | ggplot(MCAS,aes(x=totsc4))+geom_histogram()+ 284 | geom_vline(aes(xintercept=passingscore),col='red') 285 | ``` 286 | 287 | ## Adding Lines 288 | 289 | - A common use of adding lines is to show where the mean or median of the distribution is, as I did in the last lecture 290 | 291 | ```{r fig.width=5, fig.height=4, echo=TRUE, eval=TRUE} 292 | plot(density(MCAS$totsc4)) 293 | abline(v=mean(MCAS$totsc4),col='red') 294 | abline(v=median(MCAS$totsc4),col='blue') 295 | ``` 296 | 297 | ```{r, echo=FALSE, eval=FALSE} 298 | #THE GGPLOT2 WAY 299 | library(tidyverse) 300 | 301 | #Note that color OUTSIDE of aes() colors the line 302 | #color INSIDE of aes() acts like group 303 | ggplot(MCAS,aes(x=totsc4))+geom_histogram()+ 304 | geom_vline(aes(xintercept=mean(totsc4)),color='red')+ 305 | geom_vline(aes(xintercept=median(totsc4)),color='blue') 306 | ``` 307 | 308 | ## Overlaying Densities 309 | 310 | - Sometimes it's nice to be able to compare two distributions 311 | - Because density plots are so simple, we can do this by overlaying them 312 | - The `lines()` function will, like `abline()`, add a line to your graph 313 | - But it can do this with a density! Be sure to set colors so you can tell them apart 314 | 315 | ## Overlaying Densities 316 | 317 | ```{r fig.width=6, fig.height=5, echo=TRUE, eval=TRUE} 318 | plot(density(filter(MCAS,tchratio < median(tchratio))$totsc4), 319 | col='red',main="Mass. Fourth-Grade Test Scores by Teacher Ratio") 320 | lines(density(filter(MCAS,tchratio >= median(tchratio))$totsc4),col='blue') 321 | ``` 322 | 323 | ## Practice 324 | 325 | - Install `Ecdata`, load it, and get the `MCAS` data 326 | - Make a density plot for `totsc4`, and add a vertical green dashed (tough!) line at the median 327 | - Create a bar plot showing the *proportion* of observations that have nonzero values of `bilingua` 328 | - Create a box plot for totday, and add a horizontal line at the mean 329 | - Go back and add appropriate titles and/or axis labels to all graphs 330 | 331 | ## Practice answers 332 | 333 | ```{r echo = TRUE, eval = FALSE} 334 | install.packages('Ecdata') 335 | library(Ecdata) 336 | data(MCAS) 337 | plot(density(MCAS$totsc4),xlab="4th Grade Test Score") 338 | abline(v=median(MCAS$totsc4),col='green',lty='dashed') 339 | 340 | MCAS <- MCAS %>% mutate(nonzerobi = bilingua > 0) 341 | barplot(prop.table(table(MCAS$nonzerobi)),main="Nonzero Spending Per Bilingual Student") 342 | 343 | boxplot(MCAS$totday,main="Spending Per Pupil") 344 | abline(h=mean(MCAS$totday)) 345 | ``` 346 | 347 | ```{r, echo=FALSE, eval=FALSE} 348 | #THE GGPLOT2 WAY 349 | library(tidyverse) 350 | 351 | ggplot(MCAS,aes(x=totsc4))+stat_density(geom='line')+ 352 | geom_vline(aes(xintercept=median(totsc4)),color='green',linetype='dashed')+ 353 | xlab("4th Grade Test Score") 354 | 355 | MCAS <- MCAS %>% 356 | mutate(nonzerobi = MCAS$bilingua > 0) 357 | ggplot(MCAS,aes(x=nonzerobi))+geom_bar()+ 358 | ggtitle("Nonzero Spending Per Bilingual Student") 359 | 360 | ggplot(MCAS,aes(y=totday))+geom_boxplot()+ 361 | geom_hline(aes(yintercept=mean(totday)))+ 362 | ggtitle("Spending Per Pupil") 363 | ``` -------------------------------------------------------------------------------- /Lectures/Lecture_09_Relationships_Between_Variables_Part_1.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 9: Relationships Between Variables, Part 1" 3 | author: "Nick Huntington-Klein" 4 | date: "February 5, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | library(stargazer) 21 | library(Ecdat) 22 | library(haven) 23 | theme_set(theme_gray(base_size = 15)) 24 | ``` 25 | ## Recap 26 | 27 | - Summary statistics are ways of describing the *distribution* of a variable 28 | - We can also just look at the variable directly 29 | - Understanding a variable's distribution is important if we want to use it 30 | 31 | ## This week 32 | 33 | - We aren't just interested in looking at variables by themselves! 34 | - We want to know how variables can be *related* to each other 35 | - When `X` is high, would we expect `Y` to also be high, or be low? 36 | - How are variables *correlated*? 37 | - How does one variable *explain* another? 38 | - How does one variable *cause* another? (later!) 39 | 40 | ## What Does it Mean to be Related? 41 | 42 | - We would consider two variables to be *related* if knowing something about *one* of them tells you something about the other 43 | - For example, consider the answer to two questions: 44 | - Are you a man? 45 | - Are you pregnant? 46 | - What do you think is the probability that a random person is pregnant? 47 | - What do you think is the probability that a random person *who is a man* is pregnant? 48 | 49 | ## What does it Mean to be Related? 50 | 51 | Some terms: 52 | 53 | - Variables are *dependent* on each other if telling you the value of one gives you information about the distribution of the other 54 | - Variables are *correlated* if knowing whether one of them is *unusually high* gives you information about whether the other is *unusually high* (positive correlation) or *unusually low* (negative correlation) 55 | - *Explaining* one variable `Y` with another `X` means predicting *your `Y`* by looking at the distribution of `Y` for *your* value of `X` 56 | - Let's look at two variables as an example 57 | 58 | ## An Example: Dependence 59 | 60 | ```{r, echo=FALSE, eval=TRUE} 61 | wage1 <- read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/wage1.dta") 62 | ``` 63 | 64 | ```{r, echo=TRUE, eval=TRUE} 65 | table(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area')) 66 | ``` 67 | 68 | ## An Example: Dependence 69 | 70 | - What are we looking for here? 71 | - For *dependence*, simply see if the distribution of one variable changes for the different values of the other. 72 | - Does the distribution of Number of Dependents differ based on your SMSA status? 73 | 74 | ```{r, echo=TRUE, eval=TRUE} 75 | prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Num. Dependents','Lives in Metropolitan Area')),margin=2) 76 | ``` 77 | 78 | ## An Example: Dependence 79 | 80 | - Does the distribution of SMSA differ based on your Number of Dependents Status? 81 | 82 | ```{r, echo=TRUE, eval=TRUE} 83 | prop.table(table(wage1$numdep,wage1$smsa,dnn=c('Number of Dependents','Lives in Metropolitan Area')),margin=1) 84 | ``` 85 | 86 | - Looks like it! 87 | - What do these two results mean? 88 | 89 | 90 | ## An Example: Correlation 91 | 92 | - We are interested in whether two variables tend to *move together* (positive correlation) or *move apart* (negative correlation) 93 | - One basic way to do this is to see whether values tend to be *high* together 94 | - One way to check in dplyr is to use `group_by()` to organize the data into groups 95 | - Then `summarize()` the data within those groups 96 | 97 | ```{r, echo=TRUE} 98 | wage1 %>% 99 | group_by(smsa) %>% 100 | summarize(numdep=mean(numdep)) 101 | ``` 102 | 103 | - When `smsa` is high, `numdep` tends to be low - negative correlation! 104 | 105 | ## An Example: Correlation 106 | 107 | - There's also a summary statistic we can calculate *called* correlation, this is typically what we mean by "correlation" 108 | - Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation) 109 | - Basically "a one-standard deviation increase in `X` is associated with a correlation-standard-deviation increase in `Y`" 110 | 111 | ```{r, echo=TRUE, eval=TRUE} 112 | cor(wage1$numdep,wage1$smsa) 113 | cor(wage1$smsa,wage1$numdep) 114 | ``` 115 | 116 | ## An Example: Explanation 117 | 118 | Let's go back to those different means: 119 | 120 | 121 | ```{r, echo=FALSE, eval=TRUE} 122 | #THE DPLYR WAY 123 | wage1 %>% 124 | group_by(smsa) %>% 125 | summarize(numdep=mean(numdep)) 126 | ``` 127 | 128 | - Explanation would be saying that, based on this, if you're in an SMSA, I predict that you have `r mean(filter(wage1,smsa==1)$numdep)` dependents, and if you're not, you have `r mean(filter(wage1,smsa==0)$numdep)` dependents 129 | - If you are in an SMSA and have 2 dependents, then `r mean(filter(wage1,smsa==1)$numdep)` of those dependents are *explained by SMSA* and 2 - `r mean(filter(wage1,smsa==1)$numdep)` = `r 2-mean(filter(wage1,smsa==1)$numdep)` of them are *unexplained by SMSA* 130 | - We'll talk a lot more about this later 131 | 132 | ## Coding Recap 133 | 134 | - `table(df$var1,df$var2)` to look at two variables together 135 | - `prop.table(table(df$var1,df$var2))` for the \n proportion in each cell 136 | - `prop.table(table(df$var1,df$var2),margin=2)` to get proportions *within each column* 137 | - `prop.table(table(df$var1,df$var2),margin=1)` to get proportions *within each row* 138 | - `df %>% group_by(var1) %>% summarize(mean(var2))` \n to get mean of var2 for each value of var1 139 | - `cor(df$var1,df$var2)` to calculate correlation 140 | 141 | ## Graphing Relationships 142 | 143 | - Relationships between variables can be easier to see graphically 144 | - And graphs are extremely important to understanding relationships and the "shape" of those relationships 145 | 146 | ## Wage and Education 147 | 148 | - Let's use `plot(xvar,yvar)` with *two* variables 149 | 150 | ```{r, echo=TRUE, eval=TRUE, fig.width=7, fig.height=3.5} 151 | plot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage") 152 | ``` 153 | 154 | ```{r, echo=FALSE, eval=FALSE} 155 | #THE GGPLOT2 WAY 156 | ggplot(wage1,aes(x=educ,y=wage))+geom_point()+ 157 | xlab('Years of Education')+ 158 | ylab('Wage') 159 | ``` 160 | 161 | - As we look at different values of `educ`, what changes about the values of `wage` we see? 162 | 163 | ## Graphing Relationships 164 | 165 | - Try to picture the *shape* of the data 166 | - Should this be a straight line? A curved line? Positively sloped? Negatively? 167 | 168 | ```{r, echo=TRUE, eval=TRUE, fig.width=7, fig.height=3.5} 169 | plot(wage1$educ,wage1$wage,xlab="Years of Education",ylab="Wage") 170 | abline(-.9,.5,col='red') 171 | plot(function(x) 5.4-.6*x+.05*(x^2),0,18,add=TRUE,col='blue') 172 | ``` 173 | 174 | ```{r, echo=FALSE, eval=FALSE} 175 | #THE GGPLOT2 WAY 176 | ggplot(wage1,aes(x=educ,y=wage))+geom_point()+ 177 | xlab('Years of Education')+ 178 | ylab('Wage')+ 179 | geom_abline(aes(intercept=-.9,slope=.5),col='red')+ 180 | stat_function(fun=function(x) 5.4-.6*x+.05*(x^2),col='blue') 181 | ``` 182 | 183 | ## Graphing Relationships 184 | 185 | - `plot(xvar,yvar)` is extremely powerful, and will show you relationships at a glance 186 | - The previous graph showed a clear positive relationship, and indeed `cor(wage1$wage,wage1$educ)` = `r cor(wage1$wage,wage1$educ)` 187 | - Further, we don't only see a positive relationship, but we have some sense of *how* positive it is, what it looks like roughly 188 | - Let's look at some more 189 | 190 | ## Graphing Relationships 191 | 192 | - Let's compare clothing sales volume vs. profit margin for men's clothing firms 193 | 194 | ```{r, echo=TRUE, eval=FALSE} 195 | library(Ecdat) 196 | data(Clothing) 197 | plot(Clothing$sales,Clothing$margin,xlab="Gross Sales",ylab="Margin") 198 | ``` 199 | 200 | ```{r, echo=FALSE, eval=TRUE, fig.width=, fig.width=7, fig.height=3.5} 201 | data(Clothing) 202 | plot(Clothing$sales,Clothing$margin,xlab="Gross Sales",ylab="Margin") 203 | ``` 204 | 205 | ```{r, echo=FALSE, eval=FALSE} 206 | #THE GGPLOT2 WAY 207 | library(Ecdat) 208 | data(Clothing) 209 | ggplot(Clothing,aes(x=sales,y=margin))+geom_point()+ 210 | xlab('Gross Sales')+ 211 | ylab('Margin') 212 | ``` 213 | 214 | - No clear up-or-down relationship (although the correlation is `r cor(Clothing$sales,Clothing$margin)`!) but clearly the variance is higher for low sales 215 | 216 | ## Graphing Relationships 217 | 218 | - Comparing Singapore diamond prices vs. carats 219 | 220 | ```{r, echo=TRUE, eval=FALSE} 221 | library(Ecdat) 222 | data(Diamond) 223 | plot(Diamond$carat,Diamond$price,xlab="Number of Carats",ylab="Price") 224 | ``` 225 | 226 | ```{r, echo=FALSE, eval=TRUE, fig.width=, fig.width=7, fig.height=3.5} 227 | data(Diamond) 228 | plot(Diamond$carat,Diamond$price,xlab="Number of Carats",ylab="Price") 229 | ``` 230 | 231 | ```{r, echo=FALSE, eval=FALSE} 232 | #THE GGPLOT2 WAY 233 | library(Ecdat) 234 | data(Diamond) 235 | ggplot(Diamond,aes(x=carat,y=price))+geom_point()+ 236 | xlab('Number of Carats')+ 237 | ylab('Price') 238 | ``` 239 | 240 | ## Graphing Relationships 241 | 242 | - Another way to graph a relationship, especially when one of the variables only takes a few values, is to plot the `density()` function for different values 243 | 244 | ```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=3.5} 245 | plot(density(filter(wage1,married==0)$wage),col='blue',xlab="Wage",main="Wage Distribution; Blue = Unmarried, Red = Married") 246 | lines(density(filter(wage1,married==1)$wage),col='red') 247 | ``` 248 | 249 | ```{r, echo=FALSE, eval=FALSE} 250 | #THE GGPLOT2 WAY 251 | ggplot(filter(wage1,married==0),aes(x=wage))+stat_density(geom='line',col='blue')+ 252 | xlab('Wage')+ 253 | ylab('Density')+ 254 | ggtitle("Wage Distribution; Blue = Unmarried, Red = Married")+ 255 | stat_density(data=filter(wage1,married==1),geom='line',col='red') 256 | ``` 257 | 258 | - Clearly different distributions: married people earn more! 259 | 260 | ## Graphing relationships 261 | 262 | - We can back that up other ways 263 | 264 | ```{r, echo=TRUE,eval=TRUE} 265 | wage1 %>% group_by(married) %>% summarize(wage = mean(wage)) 266 | cor(wage1$wage,wage1$married) 267 | ``` 268 | 269 | ## Keep in mind! 270 | 271 | - Just because two variables are *related* doesn't mean we know *why* 272 | - If `cor(x,y)` is positive, it could be that `x` causes `y`... or that `y` causes `x`, or that something else causes both! 273 | - Or many other configurations... we'll talk about this after the midterm 274 | - Plus, even if we know the direction we may not know *why* that cause exists. 275 | 276 | ## For example 277 | 278 | ```{r, echo=TRUE, eval=TRUE, fig.width=7, fig.height=5} 279 | addata <- read.csv('http://www.nickchk.com/ad_spend_and_gdp.csv') 280 | plot(addata$AdSpending,addata$GDP, 281 | xlab='Ad Spend/Year (Mil.)',ylab='US GDP (Bil.)') 282 | ``` 283 | 284 | ```{r, echo=FALSE, eval=FALSE} 285 | #THE GGPLOT2 WAY 286 | ggplot(addata,aes(x=AdSpending,y=GDP))+geom_point()+ 287 | xlab('Ad Spend/Year (Mil.)')+ 288 | ylab('US GDP (Bil.)') 289 | ``` 290 | 291 | ## For example 292 | 293 | - The correlation between ad spend and GDP is `r cor(addata$AdSpending,addata$GDP)` 294 | - Does this mean that ads make GDP go up? 295 | - To some extent, yes (ad spending factors directly into GDP) 296 | - But that doesn't explain all of it! 297 | - Why else might this relationship exist? 298 | 299 | ## Practice 300 | 301 | - Install the `SMCRM` package, load it, get the `customerAcquisition` data. Rename it ca 302 | - Among `acquisition==1` observations, see if the size of first purchase is related to duration as a customer, with `cor` and (labeled) `plot` 303 | - See if `industry` and `acquisition` are dependent on each other using `prop.table` with the `margin` option 304 | - See if average revenues differ between industries using `aggregate`, then check the `cor` 305 | - Plot the density of revenues for `industry==0` in blue and, on the same graph, revenues for `industry==1` in red 306 | - In each case, think about relationship is suggested 307 | 308 | ## Practice Answers 309 | 310 | ```{r, echo=TRUE, eval=FALSE} 311 | install.packages('SMCRM') 312 | library(SMCRM) 313 | data(customerAcquisition) 314 | ca <- customerAcquisition 315 | cor(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration) 316 | plot(filter(ca,acquisition==1)$first_purchase,filter(ca,acquisition==1)$duration, 317 | xlab="Value of First Purchase",ylab="Customer Duration") 318 | prop.table(table(ca$industry,ca$acquisition),margin=1) 319 | prop.table(table(ca$industry,ca$acquisition),margin=2) 320 | aggregate(revenue~industry,data=ca,FUN=mean) 321 | cor(ca$revenue,ca$industry) 322 | plot(density(filter(ca,industry==0)$revenue),col='blue',xlab="Revenues",main="Revenue Distribution") 323 | lines(density(filter(ca,industry==1)$revenue),col='red') 324 | ``` -------------------------------------------------------------------------------- /Lectures/Lecture_12_Midterm_Review.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 12: Midterm Review" 3 | author: "Nick Huntington-Klein" 4 | date: "February 14, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | library(stargazer) 21 | theme_set(theme_gray(base_size = 15)) 22 | ``` 23 | 24 | ## Recap 25 | 26 | - We've been covering how to work with data in R 27 | - Building up multiple variables (numeric, character, logical, factor) into vectors 28 | - Joining vectors together into data.frames or tibbles 29 | - Making data, downloading data, getting data from packages 30 | - Manipulating (with dplyr), summarizing, and plotting that data 31 | - Looking at relationships between variables 32 | 33 | ## Working with Objects 34 | 35 | - Create a vector with `c()` or `1:4` or `sample()` or `numeric()` etc. 36 | - Create logicals to check conditions on a vector, i.e. `a < 5 & a > 1` or `c('A','B') %in% c('A','C','D')` 37 | - Check vector type with `is.` functions or change them with `as.` 38 | - Use `help()` to figure out how to use new functions you don't know yet! 39 | 40 | ## Working with Objects Practice 41 | 42 | - Use `sample()` to generate a vector of 1000 names from Jack, Jill, and Mary. 43 | - Use `%in%` to count how many are Jill or Mary. 44 | - Use `help()` to figure out how to use the `substr` function to get the first letter of the name. Then, use that to count how many names are Jack or Jill. 45 | - Change the vector to a factor. 46 | - Create a vector of all integers from 63 to 302. Then, count how many are below 99 or above 266. 47 | 48 | ## Answers 49 | 50 | ```{r, echo=TRUE, eval=FALSE} 51 | names <- sample(c('Jack','Jill','Mary'),1000,replace=T) 52 | sum(names %in% c('Jill','Mary')) 53 | firstletter <- substr(names,1,1) 54 | sum(firstletter == "J") 55 | names <- factor(names) 56 | 57 | numbers <- 63:302 58 | sum(numbers < 99 | numbers > 266) 59 | ``` 60 | 61 | ## Working with Data 62 | 63 | - Get data with `read.csv()` or `data()`, or create it with `data.frame()` or `tibble()` 64 | - Use dplyr to manipulate it: 65 | - `filter()` to pick a subset of observations 66 | - `select()` to pick a subset of variables 67 | - `rename()` to rename variables 68 | - `mutate()` to create new variables 69 | - `%>%` to chain together commands 70 | - Automate things with a `for (i = 1:10) {}` loop 71 | 72 | ## Working with Data Practice 73 | 74 | - Load the `Ecdat` library and get the `Computers` data set 75 | - In one chain of commands, 76 | - create a logical `bigHD` if the `hd` is above median 77 | - remove the `ads` and `trend` variables 78 | - limit the data to only premium computers 79 | - Use a `for` loop to print out the median price for each level of `ram` 80 | - Loop over a vector, sometimes useful to use `unique()` 81 | 82 | ## Answers 83 | 84 | ```{r, echo=TRUE, eval=FALSE} 85 | library(Ecdat) 86 | data(Computers) 87 | 88 | Computers <- Computers %>% 89 | mutate(bigHD = hd > median(hd)) %>% 90 | select(-ads,-trend) %>% 91 | filter(premium == "yes") 92 | 93 | for (i in unique(Computers$ram)) { 94 | print(median(filter(Computers,ram==i)$price)) 95 | } 96 | ``` 97 | 98 | ## Summarizing Single Variables 99 | 100 | - Variables have a *distribution* and we are interested in describing that distribution 101 | - `table()`, `mean()`, `sd()`, `quantile()` and functions for 0, 50, 100% percentiles `min()` `median()` `max()` 102 | - `stargazer()` to get a bunch of summary stats at once 103 | - Plotting: `plot(density(x))`, `hist()`, `barplot(table())` 104 | - Adding to plots with `points()`, `lines()`, `abline()` 105 | 106 | ## Summarizing Single Variables Practice 107 | 108 | - Create a text stargazer table of Computers 109 | - Use `table` to look at the distribution of `ram`, then make a `barplot` of it 110 | - Create a density plot of price, and use a single `abline(v=)` to overlay the 0, 10, 20, ..., 100% percentiles on it as blue vertical lines 111 | 112 | ## Answers 113 | 114 | ```{r, echo=TRUE, eval=FALSE} 115 | library(stargazer) 116 | stargazer(Computers,type='text') 117 | 118 | table(Computers$ram) 119 | barplot(table(Computers$ram)) 120 | 121 | plot(density(Computers$price),xlab='Price',main='Distribution of Computer Price') 122 | abline(v=quantile(Computers$price,0:10/10),col='blue') 123 | ``` 124 | 125 | ## Relationships Between Variables 126 | 127 | - Looking at the distribution of one variable *at a given value of another variable* 128 | - Check for dependence with `prop.table(table(x,y),margin=)` 129 | - Correlation: are they large/small together? `cor()` 130 | - `group_by(x) %>% summarize(mean(y))` to get mean of y within values of x 131 | - `cut(x,breaks=10)` to put x into "bins" to explain y with 132 | - Mean of y within values of x gives part of y *explained* by x 133 | - Proportion of variance explained `1-var(residuals)/var(y)` 134 | - `plot(x,y)` or overlaid density plots 135 | 136 | ## Relationships Practice 137 | 138 | - Use `prop.table` with both margins to see if cd and multi look dependent 139 | - Use `cut` to make 10 bins of `hd` 140 | - Get average price by bin of `hd`, and residuals 141 | - Calculate proportion of variance in price explained by `hd`, and calculate correlation 142 | - Plot `price` (y-axis) against `hd` (x-axis) 143 | 144 | ## Answers 145 | 146 | ```{r, echo=TRUE, eval=FALSE} 147 | prop.table(table(Computers$cd,Computers$multi),margin=1) 148 | prop.table(table(Computers$cd,Computers$multi),margin=2) 149 | 150 | Computers <- Computers %>% 151 | mutate(hdbins = cut(hd,breaks=10)) %>% 152 | group_by(hdbins) %>% 153 | mutate(priceav = mean(price)) %>% 154 | mutate(res = price - priceav) 155 | 156 | #variance explained 157 | 1 - var(Computers$res)/var(Computers$price) 158 | 159 | plot(Computers$hd,Computers$price,xlab="Size of Hard Drive",ylab='Price') 160 | ``` 161 | 162 | ## Simulation 163 | 164 | - There are *true models* that we can't see, but which generate data for us 165 | - We want to use methods that can work backwards to uncover true models 166 | - We can randomly generate data using a true model we decide, and see if our method uncovers it 167 | - `rnorm()`, `runif()`, `sample()` 168 | - Create a blank vector, then a `for` loop to make data and analyze. Store result in the vector 169 | - Analyze the vector to see what the results look like 170 | 171 | ##Simulation Practice 172 | 173 | - Create a for loop that creates 500 obs of `At.War` (logical, equal to 1 10% of the time) 174 | - And `Net.Exports` (uniform, min -1, max 1, then subtract 3*At.War) 175 | - And `GDP.Growth` (normal, mean 0, sd 3, then add + Net.Exports + At.War) 176 | - Explains GDP.Growth with At.War and takes the residual 177 | - Calculates `cor()` between GDP.Growth and Net.Exports, and between GDP.Growth and residual 178 | - Stores the correlations in two separate vectors, and then compares their distributions after 1000 loops 179 | 180 | ## Answer 181 | 182 | ```{r, echo=TRUE, eval=FALSE} 183 | library(stargazer) 184 | 185 | GDPcor <- c() 186 | rescor <- c() 187 | 188 | for (i in 1:1000) { 189 | df <- tibble(At.War = sample(0:1,500,replace=T,prob=c(.9,.1))) %>% 190 | mutate(Net.Exports = runif(500,-1,1)-3*At.War) %>% 191 | mutate(GDP.Growth = rnorm(500,0,3)+Net.Exports+At.War) %>% 192 | group_by(At.War) %>% 193 | mutate(residual = GDP.Growth - mean(GDP.Growth)) 194 | 195 | GDPcor[i] <- cor(df$GDP.Growth,df$Net.Exports) 196 | rescor[i] <- cor(df$residual,df$Net.Exports) 197 | } 198 | 199 | stargazer(data.frame(rescor,GDPcor),type='text') 200 | ``` 201 | 202 | ## Midterm 203 | 204 | Reminders: 205 | 206 | - Midterm will allow the use of the R help files but not the rest of the internet 207 | - You will also have access to lecture slides. *I do not recommend relying on them as this would take you a lot of time* 208 | - Anything we've covered is fair game 209 | - There will be one question that requires you to learn about, and use, a function we haven't used yet 210 | - The answer key to the midterm contains 35 lines of code, many of which are dplyr -------------------------------------------------------------------------------- /Lectures/Lecture_13_Causality.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 13: Causality" 3 | author: "Nick Huntington-Klein" 4 | date: "February 18, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE) 19 | library(tidyverse) 20 | theme_set(theme_gray(base_size = 15)) 21 | ``` 22 | 23 | ## Midterm 24 | 25 | - Let's review the midterm answers 26 | 27 | ## And now! 28 | 29 | - The second half of the class 30 | - The first half focused on how to program and some useful statistical concepts 31 | - The second half will focus on *causality* 32 | - In other words, "how do we know if X causes Y?" 33 | 34 | ## Why Causality? 35 | 36 | - Many of the interesting questions we might want to answer with data are causal 37 | - Some are non-causal, too - for example, "how can we predict whether this photo is of a dog or a cat" is vital to how Google Images works, but it doesn't care what *caused* the photo to be of a dog or a cat 38 | - Nearly every *why* question is causal 39 | - And when we're talking about people, *why* is often what we want to know! 40 | 41 | ## Also 42 | 43 | - This is economists' comparative advantage! 44 | - Plenty of fields do statistics. But very few make it standard training for their students to understand causality 45 | - This understanding of causality makes economists very useful! *This* is one big reason why tech companies have whole economics departments in them 46 | 47 | ## Bringing us to... 48 | 49 | - Part of this half of the class will be understanding what causality *is* and how we can find it 50 | - Another big part will be understanding common *research designs* for uncovering causality in data when we can't do an experiment 51 | - These, more than supply & demand, more than ISLM, are the tools of the modern economist! 52 | 53 | ## So what is causality? 54 | 55 | - We say that `X` *causes* `Y` if... 56 | - were we to intervene and *change* the value of `X` without changing anything else... 57 | - then `Y` would also change as a result 58 | 59 | ## Some examples 60 | 61 | Examples of causal relationships! 62 | 63 | Some obvious: 64 | 65 | - A light switch being set to on causes the light to be on 66 | - Setting off fireworks raises the noise level 67 | 68 | Some less obvious: 69 | 70 | - Getting a college degree increases your earnings 71 | - Tariffs reduce the amount of trade 72 | 73 | ## Some examples 74 | 75 | Examples of non-zero *correlations* that are not *causal* (or may be causal in the wrong direction!) 76 | 77 | Some obvious: 78 | 79 | - People tend to wear shorts on days when ice cream trucks are out 80 | - Rooster crowing sounds are followed closely by sunrise* 81 | 82 | Some less obvious: 83 | 84 | - Colds tend to clear up a few days after you take Emergen-C 85 | - The performance of the economy tends to be lower or higher depending on the president's political party 86 | 87 | *This case of mistaken causality is the basis of the film Rock-a-Doodle which I remember being very entertaining when I was six. 88 | 89 | ## Important Note 90 | 91 | - "X causes Y" *doesn't* mean that X is necessarily the *only* thing that causes Y 92 | - And it *doesn't* mean that all Y must be X 93 | - For example, using a light switch causes the light to go on 94 | - But not if the bulb is burned out (no Y, despite X), or if the light was already on (Y without X) 95 | - But still we'd say that using the switch causes the light! The important thing is that X *changes the probability* that Y happens, not that it necessarily makes it happen for certain 96 | 97 | ## So How Can We Tell? 98 | 99 | - As just shown, there are plenty of *correlations* that aren't *causal* 100 | - So if we have a correlation, how can we tell if it *is*? 101 | - For this we're going to have to think hard about *causal inference*. That is, inferring causality from data 102 | 103 | ## The Problem of Causal Inference 104 | 105 | - Let's try to think about whether some `X` causes `Y` 106 | - That is, if we manipulated `X`, then `Y` would change as a result 107 | - For simplicity, let's assume that `X` is either 1 or 0, like "got a medical treatment" or "didn't" 108 | 109 | ## The Problem of Causal Inference 110 | 111 | - Now, how can we know *what would happen* if we manipulated `X`? 112 | - Let's consider just one person - Angela. We could just check what Angela's `Y` is when we make `X=0`, and then check what Angela's `Y` is again when we make `X=1`. 113 | - Are those two `Y`s different? If so, `X` causes `Y`! 114 | - Do that same process for everyone in your sample and you know in general what the effect of `X` on `Y` is 115 | 116 | ## The Problem of Causal Inference 117 | 118 | - You may have spotted the problem 119 | - Just like you can't be in two places at once, Angela can't exist both with `X=0` and with `X=1`. She either got that medical treatment or she didn't. 120 | - Let's say she did. So for Angela, `X=1` and, let's say, `Y=10`. 121 | - The other one, what `Y` *would have been* if we made `X=0`, is *missing*. We don't know what it is! Could also be `Y=10`. Could be `Y=9`. Could be `Y=1000`! 122 | 123 | ## The Problem of Causal Inference 124 | 125 | - Well, why don't we just take someone who actually DOES have `X=0` and compare their `Y`? 126 | - Because there are lots of reasons their `Y` could be different BESIDES `X`. 127 | - They're not Angela! A character flaw to be sure. 128 | - So if we find someone, Gareth, with `X=0` and they have `Y=9`, is that because `X` increases `Y`, or is that just because Angela and Gareth would have had different `Y`s anyway? 129 | 130 | ## The Problem of Causal Inference 131 | 132 | - The main goal we have in doing causal inference is in making *as good a guess as possible* as to what that `Y` *would have been* if `X` had been different 133 | - That "would have been" is called a *counterfactual* - counter to the fact of what actually happened 134 | - In doing so, we want to think about two people/firms/countries that are basically *exactly the same* except that one has `X=0` and one has `X=1` 135 | 136 | ## Experiments 137 | 138 | - A common way to do this in many fields is an *experiment* 139 | - If you can *randomly assign* `X`, then you know that the people with `X=0` are, on average, exactly the same as the people with `X=1` 140 | - So that's an easy comparison! 141 | 142 | ## Experiments 143 | 144 | - When we're working with people/firms/countries, running experiments is often infeasible, impossible, or unethical 145 | - So we have to think hard about a *model* of what the world looks like 146 | - So that we can use our model to figure out what the *counterfactual* would be 147 | 148 | ## Models 149 | 150 | - In causal inference, the *model* is our idea of what we think the process is that *generated the data* 151 | - We have to make some assumptions about what this is! 152 | - We put together what we know about the world with assumptions and end up with our model 153 | - The model can then tell us what kinds of things could give us wrong results so we can fix them and get the right counterfactual 154 | 155 | ## Models 156 | 157 | - Wouldn't it be nice to not have to make assumptions? 158 | - Yeah, but it's impossible to skip! 159 | - We're trying to predict something that hasn't happened - a counterfactual 160 | - This is literally impossible to do if you don't have some model of how the data is generated 161 | - You can't even predict the sun will rise tomorrow without a model! 162 | - If you think you can, you're just don't realize the model you're using - that's dangerous! 163 | 164 | ## An Example 165 | 166 | - Let's cheat again and know how our data is generated! 167 | - Let's say that getting `X` causes `Y` to increase by 1 168 | - And let's run a randomized experiment of who actually gets X 169 | 170 | ```{r, echo=TRUE, eval=TRUE} 171 | df <- data.frame(Y.without.X = rnorm(1000),X=sample(c(0,1),1000,replace=T)) %>% 172 | mutate(Y.with.X = Y.without.X + 1) %>% 173 | #Now assign who actually gets X 174 | mutate(Observed.Y = ifelse(X==1,Y.with.X,Y.without.X)) 175 | #And see what effect our experiment suggests X has on Y 176 | df %>% group_by(X) %>% summarize(Y = mean(Observed.Y)) 177 | ``` 178 | 179 | ## An Example 180 | 181 | - Now this time we can't randomize X. 182 | 183 | ```{r, echo=TRUE, eval=TRUE} 184 | df <- data.frame(Z = runif(10000)) %>% mutate(Y.without.X = rnorm(10000) + Z, Y.with.X = Y.without.X + 1) %>% 185 | #Now assign who actually gets X 186 | mutate(X = Z > .7,Observed.Y = ifelse(X==1,Y.with.X,Y.without.X)) 187 | df %>% group_by(X) %>% summarize(Y = mean(Observed.Y)) 188 | #But if we properly model the process and compare apples to apples... 189 | df %>% filter(abs(Z-.7)<.01) %>% group_by(X) %>% summarize(Y = mean(Observed.Y)) 190 | ``` 191 | 192 | ## So! 193 | 194 | - So, as we move forward 195 | - We're going to be thinking about how to create models of the processes that generated the data 196 | - And, once we have those models, we'll figure out what methods we can use to generate plausible counterfactuals 197 | - Once we're really comparing apples to apples, we can figure out, using *only data we can actually observe*, how things would be different if we reached in and changed `X`, and how `Y` would change as a result. -------------------------------------------------------------------------------- /Lectures/Lecture_14_Causal_Diagrams.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 14 Causal Diagrams" 3 | author: "Nick Huntington-Klein" 4 | date: "February 25, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 19 | library(tidyverse) 20 | library(dagitty) 21 | library(ggdag) 22 | library(gganimate) 23 | library(ggthemes) 24 | library(Cairo) 25 | theme_set(theme_gray(base_size = 15)) 26 | ``` 27 | 28 | ## Recap 29 | 30 | - Last time we talked about causality 31 | - The idea that if we could reach in and manipulate `X`, and as a result `Y` changes too, then `X` *causes* `Y` 32 | - We also talked about how we can identify causality in data 33 | - Part of that will necessarily require us to have a model 34 | 35 | ## Models 36 | 37 | - We *have to have a model* to get at causality 38 | - A model is our way of *understanding the world*. It's our idea of what we think the data-generating process is 39 | - Models can be informal or formal - "The sun rises every day because the earth spins" vs. super-complex astronomical models of the galaxy with thousands of equations 40 | - All models are wrong. Even quantum mechanics. But as long as models are right enough to be useful, we're good to go! 41 | 42 | ##Models 43 | 44 | - Once we *do* have a model, though, that model will tell us *exactly* how we can find a causal effect 45 | - (if it's possible; it's not always possible) 46 | - Sort of like how, last time, we knew how `X` was assigned, and using that information we were able to get a good estimate of the true treatment 47 | 48 | ## Example 49 | 50 | - Let's work through a familiar example from before, where we know the data generating process 51 | 52 | ```{r, echo=TRUE} 53 | # Is your company in tech? Let's say 30% of firms are 54 | df <- tibble(tech = sample(c(0,1),500,replace=T,prob=c(.7,.3))) %>% 55 | #Tech firms on average spend $3mil more defending IP lawsuits 56 | mutate(IP.spend = 3*tech+runif(500,min=0,max=4)) %>% 57 | #Tech firms also have higher profits. But IP lawsuits lower profits 58 | mutate(log.profit = 2*tech - .3*IP.spend + rnorm(500,mean=2)) 59 | # Now let's check for how profit and IP.spend are correlated! 60 | cor(df$log.profit,df$IP.spend) 61 | ``` 62 | 63 | - Uh-oh! Truth is negative relationship, but data says positive!! 64 | 65 | ## Example 66 | 67 | - Now we can ask: *what do we know* about this situation? 68 | - How do we suspect the data was generated? (ignoring for a moment that we know already) 69 | - We know that being a tech company leads you to have to spend more money on IP lawsuits 70 | - We know that being a tech company leads you to have higher profits 71 | - We know that IP lawsuits lower your profits 72 | 73 | ## Example 74 | 75 | - From this, we realize that part of what we get when we calculate `cor(df$log.profit,df$IP.spend)` is the influence of being a tech company 76 | - Meaning that if we remove that influence, what's left over should be the actual, negative, effect of IP lawsuits 77 | - Now, we can get to this intuitively, but it would be much more useful if we had a more formal model that could tell us what to do in *lots* of situations 78 | 79 | ## Causal Diagrams 80 | 81 | - Enter the causal diagram! 82 | - A causal diagram (aka a Directed Acyclic Graph) is a way of writing down your *model* that lets you figure out what you need to do to find your causal effect of interest 83 | - All you need to do to make a causal diagram is write down all the important features of the data generating process, and also write down what you think causes what! 84 | 85 | ## Example 86 | 87 | - We know that being a tech company leads you to have to spend more money on IP lawsuits 88 | - We know that being a tech company leads you to have higher profits 89 | - We know that IP lawsuits lower your profits 90 | 91 | ## Example 92 | 93 | - We know that being a tech company leads you to have to spend more money on IP lawsuits 94 | - We know that being a tech company leads you to have higher profits 95 | - We know that IP lawsuits lower your profits 96 | 97 | ```{r CairoPlot, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5} 98 | dag <- dagify(IP.sp~tech) %>% tidy_dagitty() 99 | ggdag(dag,node_size=20) 100 | ``` 101 | 102 | ## Example 103 | 104 | - We know that being a tech company leads you to have to spend more money on IP lawsuits 105 | - We know that being a tech company leads you to have higher profits 106 | - We know that IP lawsuits lower your profits 107 | 108 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5} 109 | dag <- dagify(IP.sp~tech, 110 | profit~tech) %>% tidy_dagitty() 111 | ggdag(dag,node_size=20) 112 | ``` 113 | 114 | ## Example 115 | 116 | - We know that being a tech company leads you to have to spend more money on IP lawsuits 117 | - We know that being a tech company leads you to have higher profits 118 | - We know that IP lawsuits lower your profits 119 | 120 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5} 121 | dag <- dagify(IP.sp~tech, 122 | profit~tech+IP.sp) %>% tidy_dagitty() 123 | ggdag(dag,node_size=20) 124 | ``` 125 | 126 | ## Viola 127 | 128 | - We have *encoded* everything we know about this particular little world in our diagram 129 | - (well, not everything, the diagram doesn't say whether we think these effects are positive or negative) 130 | - Not only can we see our assumptions, but we can see how they fit together 131 | - For example, if we were looking for the impact of *tech* on profit, we'd know that it happens directly, AND happens because tech affects `IP.spend`, which then affects profit. 132 | 133 | ## Identification 134 | 135 | - And if we want to isolate the effect of `IP.spend` on `profit`, we can figure that out too 136 | - We call this process - isolating just the causal effect we're interested in - "identification" 137 | - We're *identifying* just one of those arrows, the one `IP.spend -> profit`, and seeing what the effect is on that arrow! 138 | 139 | ## Identification 140 | 141 | - Based on this graph, we can see that part of the correlation between `IP.Spend` and `profit` can be *explained by* how `tech` links the two. 142 | 143 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=3.5} 144 | dag <- dagify(IP.sp~tech, 145 | profit~tech+IP.sp) %>% tidy_dagitty() 146 | ggdag(dag,node_size=20) 147 | ``` 148 | 149 | ## Identification 150 | 151 | - Since we can *explain* part of the correlation with `tech`, but we want to *identify* the part of the correlation that ISN'T explained by `tech` (the causal part), we will want to just use what isn't explained by tech! 152 | - Use `tech` to explain `profit`, and take the residual 153 | - Use `tech` to explain `IP.spend`, and take the residual 154 | - The relationship between the first residual and the second residual is *causal*! 155 | 156 | ## Controlling 157 | 158 | - This process is called "adjusting" or "controlling". We are "controlling for tech" and taking out the part of the relationship that is explained by it 159 | - In doing so, we're looking at the relationship between `IP.spend` and `profit` *just comparing firms that have the same level of `tech`*. 160 | - This is our "apples to apples" comparison that gives us an experiment-like result 161 | 162 | ## Controlling 163 | 164 | ```{r, echo=TRUE, eval=TRUE} 165 | df <- df %>% group_by(tech) %>% 166 | mutate(log.profit.resid = log.profit - mean(log.profit), 167 | IP.spend.resid = IP.spend - mean(IP.spend)) %>% ungroup() 168 | cor(df$log.profit.resid,df$IP.spend.resid) 169 | ``` 170 | 171 | - Negative! Hooray 172 | 173 | ## Controlling 174 | 175 | - Imagine we're looking at that relationship *within color* 176 | 177 | ```{r, dev='CairoPNG', echo=FALSE,fig.height=5.5,fig.width=8} 178 | ggplot(mutate(df,tech=factor(tech,labels=c("Not Tech","Tech"))), 179 | aes(x=IP.spend,y=log.profit,color=tech))+geom_point()+ guides(color=guide_legend(title="Firm Type"))+ 180 | scale_color_colorblind() 181 | ``` 182 | 183 | ## LITERALLY 184 | 185 | ```{r, dev='CairoPNG', echo=FALSE,fig.height=5.5,fig.width=5.5} 186 | df <- df %>% 187 | group_by(tech) %>% 188 | mutate(mean_profit = mean(log.profit), 189 | mean_IP = mean(IP.spend)) %>% ungroup() 190 | 191 | before_cor <- paste("1. Raw data. Correlation between log.profit and IP.spend: ",round(cor(df$log.profit,df$IP.spend),3),sep='') 192 | after_cor <- paste("6. Analyze what's left! cor(log.profit,IP.spend) controlling for tech: ",round(cor(df$log.profit-df$mean_profit,df$IP.spend-df$mean_IP),3),sep='') 193 | 194 | 195 | 196 | 197 | #Add step 2 in which IP.spend is demeaned, and 3 in which both IP.spend and log.profit are, and 4 which just changes label 198 | dffull <- rbind( 199 | #Step 1: Raw data only 200 | df %>% mutate(mean_IP=NA,mean_profit=NA,time=before_cor), 201 | #Step 2: Add x-lines 202 | df %>% mutate(mean_profit=NA,time='2. Figure out what differences in IP.spend are explained by tech'), 203 | #Step 3: IP.spend de-meaned 204 | df %>% mutate(IP.spend = IP.spend - mean_IP,mean_IP=0,mean_profit=NA,time="3. Remove differences in IP.spend explained by tech"), 205 | #Step 4: Remove IP.spend lines, add log.profit 206 | df %>% mutate(IP.spend = IP.spend - mean_IP,mean_IP=NA,time="4. Figure out what differences in log.profit are explained by tech"), 207 | #Step 5: log.profit de-meaned 208 | df %>% mutate(IP.spend = IP.spend - mean_IP,log.profit = log.profit - mean_profit,mean_IP=NA,mean_profit=0,time="5. Remove differences in log.profit explained by tech"), 209 | #Step 6: Raw demeaned data only 210 | df %>% mutate(IP.spend = IP.spend - mean_IP,log.profit = log.profit - mean_profit,mean_IP=NA,mean_profit=NA,time=after_cor)) 211 | 212 | p <- ggplot(dffull,aes(y=log.profit,x=IP.spend,color=as.factor(tech)))+geom_point()+ 213 | geom_vline(aes(xintercept=mean_IP,color=as.factor(tech)))+ 214 | geom_hline(aes(yintercept=mean_profit,color=as.factor(tech)))+ 215 | guides(color=guide_legend(title="tech"))+ 216 | scale_color_colorblind()+ 217 | labs(title = 'The Relationship between log.profit and IP.spend, Controlling for tech \n{next_state}')+ 218 | theme(plot.title = element_text(size=14))+ 219 | transition_states(time,transition_length=c(12,32,12,32,12,12),state_length=c(160,100,75,100,75,160),wrap=FALSE)+ 220 | ease_aes('sine-in-out')+ 221 | exit_fade()+enter_fade() 222 | 223 | animate(p,nframes=200) 224 | ``` 225 | 226 | ## Recap 227 | 228 | - By controlling for `tech` ("holding it constant") we got rid of the part of the `IP.spend`/`profit` relationship that was explained by `tech`, and so managed to *identify* the `IP.spend -> profit` arrow, the causal effect we're interested in! 229 | - We correctly found that it was negative 230 | - Remember, we made it truly negative when we created the data, all those slides ago 231 | 232 | ## Causal Diagrams 233 | 234 | - And it was the diagram that told us to control for `tech` 235 | - It's going to turn out that diagrams can tell us how to identify things in much more complex circumstances - we'll get to that soon 236 | - But you might have noticed that it was pretty obvious what to do just by looking at the graph 237 | 238 | ## Causal Diagrams 239 | 240 | - Can't we just look at the data to see what we need to control for? 241 | - After all, that would free us from having to make all those assumptions and figure out our model 242 | - No!!! 243 | - Why? Because for a given set of data that we see, there are *many* different data generating processes that could have made it 244 | - Each requiring different kinds of adjustments in order to get it right 245 | 246 | ## Causal Diagrams 247 | 248 | - We observe that `profit` (y), `IP.spend` (x), and `tech` (z) are all related... which is it? 249 | 250 | ```{r dev='CairoPNG', echo=FALSE, fig.width=8, fig.height=5} 251 | ggdag_equivalent_dags(confounder_triangle(x_y_associated=TRUE),node_size=12) 252 | ``` 253 | 254 | ## Causal Diagrams 255 | 256 | - With only the data to work with we have *literally no way of knowing* which of those is true 257 | - Maybe `IP.spend` causes companies to be `tech` companies (in 2, 3, 6) 258 | - We know that's silly because we have an idea of what the model is 259 | - But that's what lets us know it's wrong - the model. With just the data we have no clue. 260 | 261 | ## Causal Diagrams 262 | 263 | - Next time we'll set about actually making one of these diagrams 264 | - And soon we'll be looking for what the diagrams tell us about how to identify an effect! 265 | 266 | ## Practice 267 | 268 | - Load in `data(swiss)` and use `help` to look at it 269 | - Get the correlation between Fertility and Education 270 | - Think about what direction the arrows might go on a diagram with Fertility, Education, and Agriculture 271 | - Get the corrlelation between Fertility and Education *controlling for Agriculture* (use `cut` with `breaks=3`) 272 | 273 | ## Practice Answers 274 | 275 | ```{r, echo=TRUE, eval=FALSE} 276 | data(swiss) 277 | help(swiss) 278 | cor(swiss$Fertility,swiss$Education) 279 | 280 | swiss <- swiss %>% 281 | group_by(cut(Agriculture,breaks=3)) %>% 282 | mutate(Fert.resid = Fertility - mean(Fertility), 283 | Ed.resid = Education - mean(Education)) 284 | 285 | cor(swiss$Fert.resid,swiss$Ed.resid) 286 | ``` 287 | 288 | ```{r, echo=FALSE, eval=TRUE} 289 | data(swiss) 290 | cor(swiss$Fertility,swiss$Education) 291 | 292 | swiss <- swiss %>% 293 | group_by(cut(Agriculture,breaks=3)) %>% 294 | mutate(Fert.resid = Fertility - mean(Fertility), 295 | Ed.resid = Education - mean(Education)) 296 | 297 | cor(swiss$Fert.resid,swiss$Ed.resid) 298 | ``` -------------------------------------------------------------------------------- /Lectures/Lecture_15_Drawing_Causal_Diagrams.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 15 Drawing Causal Diagrams" 3 | author: "Nick Huntington-Klein" 4 | date: "February 27, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 19 | library(tidyverse) 20 | library(dagitty) 21 | library(ggdag) 22 | library(gganimate) 23 | library(ggthemes) 24 | library(Cairo) 25 | theme_set(theme_gray(base_size = 15)) 26 | ``` 27 | 28 | ## Recap 29 | 30 | - Last time we covered the concept of "controlling" or "adjusting" for a variable 31 | - And we knew what to control for because of our causal diagram 32 | - A causal diagram is your *model* of what you think the data-generating process is 33 | - Which you can use to figure out how to *identify* particular arrows of interest 34 | 35 | ## Today 36 | 37 | - Today we're going to be working through the process of how to *make* causal diagrams 38 | - This will require us to *understand* what is going on in the real world 39 | - (especially since we won't know the right answer, like when we simulate our own data!) 40 | - And then represent our understanding in the diagram 41 | 42 | ## Remember 43 | 44 | - Our goal is to *represent the underlying data-generating process* 45 | - This is going to require some common-sense thinking 46 | - As well as some economic intuition 47 | - In real life, we're not going to know what the right answer is 48 | - Our models will be wrong. We just need them to be useful 49 | 50 | ## Steps to a Causal Diagram 51 | 52 | 1. Consider all the variables that are likely to be important in the data generating process (this includes variables you can't observe) 53 | 2. For simplicity, combine them together or prune the ones least likely to be important 54 | 3. Consider which variables are likely to affect which other variables and draw arrows from one to the other 55 | 4. (Bonus: Test some implications of the model to see if you have the right one) 56 | 57 | ## Some notes 58 | 59 | - Drawing an arrow requires a *direction*. You're making a statement! 60 | - Omitting an arrow is a statement too - you're saying neither causes the other (directly) 61 | - If two variables are correlated but neither causes the other, that means they're both caused by some other (perhaps unobserved) variable that causes both - add it! 62 | - There shouldn't be any *cycles* - You shouldn't be able to follow the arrows in one direction and end up where you started 63 | - If there *should* be a feedback loop, like "rich get richer", distinguish between the same variable at different points in time to avoid it 64 | 65 | ## So let's do it! 66 | 67 | - Let's start with an econometrics classic: what is the causal effect of an additional year of education on earnings? 68 | - That is, if we reached in and made someone get one more year of education than they already did, how much more money would they earn? 69 | 70 | ## 1. Listing Variables 71 | 72 | - We can start with our two main variables of interest: 73 | - Education [we call this the "treatment" or "exposure" variable] 74 | - Earnings [the "outcome"] 75 | 76 | ## 1. Listing Variables 77 | 78 | - Then, we can add other variables likely to be relevant 79 | - Focus on variables that are likely to cause or be caused by treatment 80 | - ESPECIALLY if they're related both to the treatment and the outcome 81 | - They don't have to be things you can actually observe/measure 82 | - Variables that affect the outcome but aren't related to anything else aren't really important (you'll see why next week) 83 | 84 | ## 1. Listing Variables 85 | 86 | - So what can we think of? 87 | - Ability 88 | - Socioeconomic status 89 | - Demographics 90 | - Phys. ed requirements 91 | - Year of birth 92 | - Location 93 | - Compulsory schooling laws 94 | - Job connections 95 | 96 | ## 2. Simplify 97 | 98 | - There's a lot going on - in any social science system, there are THOUSANDS of things that could plausibly be in your diagram 99 | - So we simplify. We ignore things that are likely to be only of trivial importance [so Phys. ed is out!] 100 | - And we might try to combine variables that are very similar or might overlap in what we measure [Socioeconomic status, Demographics, Location -> Background] 101 | - Now: Education, Earnings, Background, Year of birth, Location, Compulsory schooling, and Job Connections 102 | 103 | ## 3. Arrows! 104 | 105 | - Consider which variables are likely to cause which others 106 | - And, importantly, *which arrows you can leave out* 107 | - The arrows you *leave out* are important to think about - you sure there's no effect? - and prized! You need those NON-arrows to be able to causally identify anything. 108 | 109 | ## 3. Arrows 110 | 111 | - Let's start with our effect of interest 112 | - Education causes earnings 113 | 114 | ```{r CairoPlot, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 115 | set.seed(1000) 116 | dag <- dagify(Earn~Ed) %>% tidy_dagitty() 117 | ggdag(dag,node_size=20) 118 | ``` 119 | 120 | ## 3. Arrows 121 | 122 | - Remaining: Background, Year of birth, Location, Compulsory schooling, and Job Connections 123 | - All of these but Job Connections should cause Ed 124 | 125 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 126 | set.seed(1000) 127 | dag <- dagify(Earn~Ed, 128 | Ed~Bgrd+Year+Loc+Comp) %>% tidy_dagitty() 129 | ggdag(dag,node_size=20) 130 | ``` 131 | 132 | ## 3. Arrows 133 | 134 | - Seems like Year of Birth, Location, and Background should ALSO affect earnings. Job Connections, too. 135 | 136 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 137 | set.seed(1000) 138 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx, 139 | Ed~Bgrd+Year+Loc+Comp) %>% tidy_dagitty() 140 | ggdag(dag,node_size=20) 141 | ``` 142 | 143 | ## 3. Arrows 144 | 145 | - Job connections, in fact, seems like it should be caused by Education 146 | 147 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 148 | set.seed(1000) 149 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx, 150 | Ed~Bgrd+Year+Loc+Comp, 151 | JobCx~Ed) %>% tidy_dagitty() 152 | ggdag(dag,node_size=20) 153 | ``` 154 | 155 | ## 3. Arrows 156 | 157 | - Location and Background are likely to be related, but neither really causes the other. Make unobservable U1 and have it cause both! 158 | 159 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 160 | set.seed(1000) 161 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx, 162 | Ed~Bgrd+Year+Loc+Comp, 163 | JobCx~Ed, 164 | Loc~U1, 165 | Bgrd~U1) %>% tidy_dagitty() 166 | ggdag(dag,node_size=20) 167 | ``` 168 | 169 | ## Causal Diagrams 170 | 171 | - And there we have it! 172 | - Perhaps a little messy, but that can happen 173 | - We have modeled our idea of what the data generating process looks like 174 | 175 | ## Causal Diagrams 176 | 177 | - The nice thing about these diagrams is that we can test (some of) our assumptions! 178 | - We can't test assumptions about which direction the arrows go, but notice that the diagram implies that certain relationships WON'T be there! 179 | - For example, in our diagram, all relationships between `Comp` and `JobCx` go through `Ed` 180 | - So if we control for `Ed`, `cor(Comp,JobCx)` should be zero - we can test that! We'll do more of this later. 181 | 182 | ## Dagitty.net 183 | 184 | - These graphs can be drawn by hand 185 | - Or we can use a computer to help 186 | - We will be using dagitty.net to draw these graphs 187 | - (You can also draw them with R code - see the slides, but you won't need to know this) 188 | 189 | ## Dagitty.net 190 | 191 | - Go to dagitty.net and click on "Launch" 192 | - You will see an example of a causal diagram with nodes (variables) and arrows 193 | - Plus handy color-coding and symbols. Green triangle for exposure/treatment, and blue bar for outcome. 194 | - The green arrow is the "causal path" we'd like to *identify* 195 | 196 | ## Dagitty.net 197 | 198 | - Go to Model and New Model 199 | - Let's recreate our Education and Earnings diagram 200 | - Put in Education as the exposure and Earnings as the outcome (you can use longer variable names than we've used here) 201 | 202 | ## Dagitty.net 203 | 204 | - Double-click on blank space to add new variables. 205 | - Add all the variables we're interested in. 206 | 207 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 208 | set.seed(1000) 209 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx, 210 | Ed~Bgrd+Year+Loc+Comp, 211 | JobCx~Ed, 212 | Loc~U1, 213 | Bgrd~U1) %>% tidy_dagitty() 214 | ggdag(dag,node_size=20) 215 | ``` 216 | 217 | ## Dagitty.net 218 | 219 | - Then, double click on some variable `X`, then once on another variable `Y` to get an arrow for `X -> Y` 220 | - Fill in all our arrows! 221 | 222 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 223 | set.seed(1000) 224 | dag <- dagify(Earn~Ed+Bgrd+Year+Loc+JobCx, 225 | Ed~Bgrd+Year+Loc+Comp, 226 | JobCx~Ed, 227 | Loc~U1, 228 | Bgrd~U1) %>% tidy_dagitty() 229 | ggdag(dag,node_size=20) 230 | ``` 231 | 232 | ## Dagitty.net 233 | 234 | - Feel free to move the variables around with drag-and-drop to make it look a bit nicer 235 | - You can even drag the arrows to make them curvy 236 | 237 | 238 | ## Dagitty.net 239 | 240 | - Next, we can classify some of our variables. 241 | - Hover over `U1` and tap the 'u' key to make Dagitty register it as unobservable 242 | 243 | ## Dagitty.net 244 | 245 | - Now, look at all that red! 246 | - Dagitty uses red to show you that a variable is going to cause problems for us! Like `tech` did last lecture 247 | - We can tell Dagitty that we plan to adjust/control for these variables in our analysis by hovering over them and hitting the 'a' key 248 | - When there's no red, we've identified the green arrow we're interested in! 249 | 250 | ## Dagitty.net 251 | 252 | - Notice that all Dagitty knows is what we told it about our model 253 | - And that was enough for it to tell us what we needed to adjust for to identify our effect! 254 | - That's not super complex computer magic, it's actually a fairly simple set of rules, which we'll go over next time. 255 | 256 | ## Practice 257 | 258 | - Consider the causal question "does a longer night's sleep extend your lifespan?" 259 | 260 | 1. List variables [think especially hard about what things might lead you to get a longer night's sleep] 261 | 2. Simplify 262 | 3. Arrows 263 | 264 | - Then, when you're done, draw it in Dagitty.net! 265 | -------------------------------------------------------------------------------- /Lectures/Lecture_16_Back_Doors.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 16 Back Doors" 3 | author: "Nick Huntington-Klein" 4 | date: "March 3, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 19 | library(tidyverse) 20 | library(dagitty) 21 | library(ggdag) 22 | library(gganimate) 23 | library(ggthemes) 24 | library(Cairo) 25 | theme_set(theme_gray(base_size = 15)) 26 | ``` 27 | 28 | ## Recap 29 | 30 | - We've now covered how to create causal diagrams 31 | - (aka Directed Acyclic Graphs, if you're curious what "dag"itty means) 32 | - We simply write out the list of the important variables, and draw causal arrows indicating what causes what 33 | - This allows us to figure out what we need to do to *identify* our effect of interest 34 | 35 | ## Today 36 | 37 | - But HOW? How does it know? 38 | - Today we'll be covering the *process* that lets you figure out whether you can identify your effect of interest, and how 39 | - It turns out, once we have our diagram, to be pretty straightforward 40 | - So easy a computer can do it! 41 | 42 | ## The Back Door and the Front Door 43 | 44 | - The basic way we're going to be thinking about this is with a metaphor 45 | - When you do data analysis, it's like observing that someone left their house for the day 46 | - When you do causal inference, it's like asking *how they left their house* 47 | - You want to *make sure* that they came out the *front door*, and not out the back door, not out the window, not out the chimney 48 | 49 | ## The Back Door and the Front Door 50 | 51 | - Let's go back to this example 52 | 53 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5} 54 | dag <- dagify(IP.sp~tech, 55 | profit~tech+IP.sp) %>% tidy_dagitty() 56 | ggdag(dag,node_size=20) 57 | ``` 58 | 59 | ## The Back Door and the Front Door 60 | 61 | - We're interested in the effect of IP spend on profits. That means that our *front door* is the ways in which IP spend *causally affects* profits 62 | - Our *back door* is any other thing that might drive a correlation between the two - the way that tech affects both 63 | 64 | ## Paths 65 | 66 | - In order to formalize this a little more, we need to think about the various *paths* 67 | - We observe that you got out of your house, but we want to know the paths you might have walked to get there 68 | - So, what are the paths we can walk to get from IP.spend to profits? 69 | 70 | ## Paths 71 | 72 | - We can go `Ip.spend -> profit` 73 | - Or `IP.spend <- tech -> profit` 74 | 75 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=5, fig.height=4.5} 76 | dag <- dagify(IP.sp~tech, 77 | profit~tech+IP.sp) %>% tidy_dagitty() 78 | ggdag(dag,node_size=20) 79 | ``` 80 | 81 | ## The Back Door and the Front Door 82 | 83 | - One of these paths is the one we're interested in! 84 | - `Ip.spend -> profit` is a *front door path* 85 | - One of them is not! 86 | - `IP.spend <- tech -> profit` is a *back door path* 87 | 88 | ## Now what? 89 | 90 | - Now, it's pretty simple! 91 | - In order to make sure you came through the front door... 92 | - We must *close the back door* 93 | - We can do this by *controlling/adjusting* for things that will block that door! 94 | - We can close `IP.spend <- tech -> profit` by adjusting for `tech` 95 | 96 | ## So? 97 | 98 | - We already knew that we could get our desired effect in this case by controlling for `tech`. 99 | - But this process lets us figure out what we need to do in a *much wider range of situations* 100 | - All we need to do is follow the steps! 101 | - List all the paths 102 | - See which are back doors 103 | - Adjust for a set of variables that closes all the back doors! 104 | 105 | ## Example 106 | 107 | - How does wine affect your lifespan? 108 | 109 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 110 | dag <- dagify(life~wine+drugs+health+income, 111 | drugs~wine, 112 | wine~health+income, 113 | health~U1, 114 | income~U1, 115 | coords=list( 116 | x=c(life=5,wine=2,drugs=3.5,health=3,income=4,U1=3.5), 117 | y=c(life=3,wine=3,drugs=2,health=4,income=4,U1=4.5) 118 | )) %>% tidy_dagitty() 119 | ggdag(dag,node_size=20) 120 | ``` 121 | 122 | ## Paths 123 | 124 | - Paths from `wine` to `life`: 125 | - `wine -> life` 126 | - `wine -> drugs -> life` 127 | - `wine <- health -> life` 128 | - `wine <- income -> life` 129 | - `wine <- health <- U1 -> income -> life` 130 | - `wine <- income <- U1 -> health -> life` 131 | - Don't leave any out, even the ones that seem redundant! 132 | 133 | ## Paths 134 | 135 | - Front doors/Back doors 136 | - `wine -> life` 137 | - `wine -> drugs -> life` 138 | - `wine <- health -> life` 139 | - `wine <- income -> life` 140 | - `wine <- health <- U1 -> income -> life` 141 | - `wine <- income <- U1 -> health -> life` 142 | 143 | ## Adjusting 144 | 145 | - By adjusting for variables we close these back doors 146 | - If an adjusted variable appears anywhere along the path, we can close that path off 147 | - Once *ALL* the back door paths are closed, we have blocked all the other ways that a correlation COULD appear except through the front door! We've identified the causal effect! 148 | - This is "the back door method" for identifying the effect. There are other methods; we'll get to them. 149 | 150 | ## Adjusting for Health 151 | 152 | - Front doors/Open back doors/Closed back doors 153 | - `wine -> life` 154 | - `wine -> drugs -> life` 155 | - `wine <- health -> life` 156 | - `wine <- income -> life` 157 | - `wine <- health <- U1 -> income -> life` 158 | - `wine <- income <- U1 -> health -> life` 159 | 160 | ## Adjusting for Health 161 | 162 | - Clearly, adjusting for health isn't ENOUGH to identify 163 | - We need to adjust for health AND income 164 | - We haven't covered how to actually control for multiple variables 165 | - We won't be focusing on it, but it's important for us to be able to know what *needs* to be controlled for 166 | 167 | ## Adjusting for Health and Income 168 | 169 | - Front doors/Open back doors/Closed back doors 170 | - `wine -> life` 171 | - `wine -> drugs -> life` 172 | - `wine <- health -> life` 173 | - `wine <- income -> life` 174 | - `wine <- health <- U1 -> income -> life` 175 | - `wine <- income <- U1 -> health -> life` 176 | 177 | 178 | ## How about Drugs? 179 | 180 | - Should we adjust for drugs? 181 | - No! This whole procedure makes that clear 182 | - It's on a *front door path* 183 | - If we adjusted for that, that's shutting out part of the way that `wine` *DOES* affect `life` 184 | 185 | ## The Front Door 186 | 187 | - In fact, remember, our real goal isn't necessarily to close the back doors 188 | - It's to make sure you came through the front door! 189 | - Sometimes (rarely), we can actually isolate the front door ourselves 190 | 191 | ## The Front Door 192 | 193 | - Imagine this version 194 | 195 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 196 | dag <- dagify(life~drugs+health+income, 197 | drugs~wine, 198 | wine~health+income, 199 | health~U1, 200 | income~U1, 201 | coords=list( 202 | x=c(life=5,wine=2,drugs=3.5,health=3,income=4,U1=3.5), 203 | y=c(life=3,wine=3,drugs=2,health=4,income=4,U1=4.5) 204 | )) %>% tidy_dagitty() 205 | ggdag(dag,node_size=20) 206 | ``` 207 | 208 | ## The Front Door 209 | 210 | - This makes it real clear that you shouldn't control for drugs - that shuts the FRONT door! There's no way to get out of your house EXCEPT through the back door! 211 | - Note in this case that there's no back door from `wine` to `drugs` 212 | - And if we control for `wine`, no back door from `drugs` to `life` (let's check this by hand) 213 | - So we can identify `wine -> drugs` and we can identify `drugs -> life`, and combine them to get `wine -> life`! 214 | 215 | ## The Front Door 216 | 217 | - This is called the "front door method" 218 | - Much less common than the back door method, but actually older 219 | - So we'll only cover it briefly 220 | - Historical relevance! This is similar to how they proved that cigarettes cause cancer 221 | 222 | ## Cigarettes and Cancer 223 | 224 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 225 | dag <- dagify(tar~cigs, 226 | cigs~health+income, 227 | life~tar+health+income, 228 | coords=list( 229 | x=c(cigs=1,tar=2,health=2,income=2,life=3), 230 | y=c(cigs=2,tar=2,health=1,income=3,life=2) 231 | )) %>% tidy_dagitty() 232 | ggdag(dag,node_size=20) 233 | ``` 234 | 235 | ## Paths 236 | 237 | - Front doors/Back doors 238 | - `cigs -> tar -> cancer` 239 | - `cigs <- income -> cancer` 240 | - `cigs <- health -> cancer` 241 | 242 | ## Paths 243 | 244 | - Closing these back doors is the problem that epidemiologists faced 245 | - They can't just run an experiment! 246 | - Problem: there are actually MANY back doors we're not listing 247 | - And sometimes we can't observe/control for these things 248 | - How can you possibly measure "health" well enough to actually control for it? 249 | 250 | ## The Front Door Method 251 | 252 | - So, noting that there's no back door from `cigs` to `tar`, and then controlling for `cigs` no back door from `tar` to `cancer`, they combined these two effects to get the causal effect of `cigs` on `life` 253 | - This is how we established this causal effect! 254 | - Doctors had a decent idea that cigs caused cancer before, but some doctors disagreed 255 | - And they had good reason to disagree! The back doors were VERY plausible reasons to see a `cigs`/`cancer` correlation other than the front door 256 | 257 | ## Practice 258 | 259 | - We want to know how `X` affects `Y`. Find all paths and make a list of what to adjust for to close all back doors 260 | 261 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 262 | dag <- dagify(Y~X+A+B+C+E, 263 | X~A+B+D, 264 | E~X, 265 | A~U1+C, 266 | B~U1, 267 | coords=list( 268 | x=c(X=1,E=2.5,A=2,B=3.5,C=1.5,D=1,Y=4,U1=2.5), 269 | y=c(X=2,E=2.25,A=3,B=3,C=4,D=3,Y=2,U1=4) 270 | )) %>% tidy_dagitty() 271 | ggdag(dag,node_size=20) 272 | ``` 273 | 274 | ## Practice Answers 275 | 276 | - Front door paths: `X -> Y`, `X -> E -> Y` 277 | - Back doors: `X <- A -> Y`, `X <- B -> Y`, `X <- A <- U1 -> B -> Y`, `X <- B <- U1 -> A -> Y`, `X <- A <- C -> Y`, `X <- B <- U1 -> A <- C -> Y` 278 | - (that last back door is actually pre-closed, we'll get to that later) 279 | - We can close all back doors by adjusting for `A` and `B`. -------------------------------------------------------------------------------- /Lectures/Lecture_17_Article_Traffic.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_17_Article_Traffic.pdf -------------------------------------------------------------------------------- /Lectures/Lecture_17_Causal_Diagrams_Practice.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 17 Causal Diagram Practice" 3 | author: "Nick Huntington-Klein" 4 | date: "March 3, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 19 | library(tidyverse) 20 | library(dagitty) 21 | library(ggdag) 22 | library(gganimate) 23 | library(ggthemes) 24 | library(Cairo) 25 | theme_set(theme_gray(base_size = 15)) 26 | ``` 27 | 28 | ## Recap 29 | 30 | - To make a diagram: 31 | - List all the relevant variables 32 | - Combine identical variables, eliminate unimportant ones 33 | - Draw arrows to indicate what you think causes what 34 | - (See if the model implies that any relationships SHOULDN'T exist and test that) 35 | - Think carefully! 36 | 37 | ## Recap 38 | 39 | - If we're interested in the effect of `X` on `Y`: 40 | - Write the list of all paths from `X` to `Y` 41 | - Figure out which are front-door paths (going from `X` to `Y`) 42 | - and which are back-door paths (other ways) 43 | - Then figure out what set of variables need to be controlled/adjusted for to close those back doors 44 | 45 | ## Testing Relationships 46 | 47 | - Just a little more detail on this "testing relationships" thing 48 | - Our use of front-door and back-door paths means that we can look at *any two variables* in our diagram and say "hmm, if I control for A, B, and C, then that closes all front and back door paths between D and E" 49 | - So, if we control for A, B, and C, then D and E should be unrelated! 50 | - If `cor(D,E)` controlling for A, B, C is big, our diagram is probably wrong! 51 | 52 | ## Testing Relationships 53 | 54 | - What are some relationships we can test? 55 | 56 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=7, fig.height=5} 57 | dag <- dagify(B~A+D, 58 | C~B+D, 59 | E~C) %>% tidy_dagitty() 60 | ggdag(dag,node_size=20) 61 | ``` 62 | 63 | ## Testing Relationships 64 | 65 | - We should get no relationship between: A and E controlling for C, B and E controlling for C, D and E controlling for C 66 | - (also D and A controlling for nothing, but we haven't gotten to why that one works yet, and A and C controlling for B and D, but we haven't covered why we need D there) 67 | - We'll be looking out for opportunities to test our models as we move forward! (note: dagitty will give us a list of what we can test!) 68 | 69 | 70 | ## Today 71 | 72 | - In groups, read [the assigned article](https://www.nytimes.com/2019/01/21/upshot/stuck-and-stressed-the-health-costs-of-traffic.html) 73 | - Pick one of the causal claims in the article (there are a lot!) 74 | - [Hint: words like "improve" "affect" "reduces", ask if you're not sure] 75 | - Draw a diagram to investigate that causal question 76 | - Determine what needs to be controlled for to identify that effect 77 | - If there's a linked study to explain the claim, try to look at it and see if they use the appropriate controls 78 | - Extra time? Do another claim 79 | 80 | ## Causal Inference in the News 81 | 82 | - Not long: 1-2 pages single-spaced plus diagram. 83 | - Find a news article that makes a causal claim (like the one we just did, but not that one) and interpret that claim by drawing an appropriate diagram in dagitty 84 | - Doesn't necessarily need to be a claim backed by a study or evidence, but it makes the assignment easier if it is 85 | - Justify your diagram (both your choice of variables and your arrows) 86 | - Explain how you would identify the causal claim, and discuss whether you think the article did so or not. -------------------------------------------------------------------------------- /Lectures/Lecture_22_Flammer.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_22_Flammer.PNG -------------------------------------------------------------------------------- /Lectures/Lecture_24_AutorDornHanson.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Lecture_24_AutorDornHanson.PNG -------------------------------------------------------------------------------- /Lectures/Lecture_25_Instrumental_Variables_in_Action.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 25 Instrumental Variables in Action" 3 | author: "Nick Huntington-Klein" 4 | date: "March 28, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 19 | library(tidyverse) 20 | library(dagitty) 21 | library(ggdag) 22 | library(gganimate) 23 | library(ggthemes) 24 | library(Cairo) 25 | theme_set(theme_gray(base_size = 15)) 26 | ``` 27 | 28 | ## Causal Inference Midterm 29 | 30 | - One week from today 31 | - Similar format to the homeworks we've been having 32 | - At least one question evaluating a research question and drawing a dagitty graph 33 | - At least one question identifying the right causal inference method to use 34 | - At least one question about the feature(s) of the methods 35 | - At least one question carrying out a method in R 36 | 37 | ## Causal Inference Midterm 38 | 39 | - Covers everything up to today (obviously, a focus on things since the Programming Midterm, but there is a little programming) 40 | - No internet (except dagitty) or slides available this time 41 | - One 3x5 index card, front and back 42 | - You'll have the whole class period so don't be late! 43 | 44 | ## Recap 45 | 46 | - Instrumental Variables is sort of like the opposite of controlling for a variable 47 | - You isolate *just* the parts of `X` and `Y` that you can explain with the IV `Z` 48 | - If `Z` is related to `X` but all effects of `Z` on `Y` go THROUGH `X`, you've isolated a causal effect of `X` on `Y` by isolating just the causal part of `X` and ignoring all the back doors! 49 | - If `Z` is binary, get difference in `Y` divided by difference in `X` 50 | - If not, get correlation between explained `Y` and explained `X` 51 | 52 | ## Recap 53 | 54 | - In macroeconomics, how does US income affect US expenditures ("marginal propensity to consume")? 55 | - We can instrument with investment from LAST year. 56 | 57 | ```{r, echo=TRUE} 58 | library(AER) 59 | #US income and consumption data 1950-1993 60 | data(USConsump1993) 61 | USC93 <- as.data.frame(USConsump1993) 62 | 63 | #lag() gets the observation above; here the observation above is last year 64 | IV <- USC93 %>% mutate(lastyr.invest = lag(income) - lag(expenditure)) %>% 65 | group_by(cut(lastyr.invest,breaks=10)) %>% 66 | summarize(exp = mean(expenditure),inc = mean(income)) 67 | cor(IV$exp,IV$inc) 68 | ``` 69 | 70 | ## Today 71 | 72 | - We're going to be looking at several implementations of IV in real studies 73 | - We'll be looking at what they did and also asking ourselves what their causal diagrams might be 74 | 75 | ## Today 76 | - And whether we believe them! What would the diagram be for that expenditure/income example? Do we believe that there's really no back door from last year's investment to this year's expenditure? Really? 77 | - Every identification implies a diagram... and diagrams come from assumptions. We *always* want to think about whether we believe those assumptions 78 | - Remember, in any case, each of these is *just one study*. I could cite you equally plausible studies on these topics that found different findings in different contexts 79 | 80 | ## College Remediation 81 | 82 | - In most colleges, if you come unprepared to take first-year courses, you must first take remedial courses 83 | - Do these classes help you persist in college? 84 | - On one hand they can help ease you into the first-year courses 85 | - On the other hand you might get discouraged and drop out 86 | 87 | ## College Remediation 88 | 89 | - What's our diagram? Include `Rem`ediation, `Pers`istence, other things. 90 | - Keep in mind that many of the things that would cause you to take remediation in the first place are the same things that might lead you to drop out (difficulty with material, dislike for school, etc.) 91 | - Sketch out a diagram 92 | 93 | ## College Remediation 94 | 95 | - Bettinger & Long (2009) use data from Ohio and notice that the policy determining who goes to remediation varies from college to college 96 | - And also, as is generally well-known, people tend to go to the college closest to them 97 | - So for each student and each college, they calculate whether that student would be in remediation at that college 98 | - And use "Would be in remediation at your closest college" (`RemC`) as an instrument for "Actually in remediation" (`RemA`) 99 | 100 | ## College Remediation 101 | 102 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=6} 103 | dag <- dagify(RemA~RemC+A+B+C, 104 | Pers~RemA+A+B+C, 105 | coords=list( 106 | x=c(RemA=1,RemC=0,A=1.5,B=2,C=2.5,Pers=3), 107 | y=c(RemA=1,RemC=1,A=2,B=2,C=2,Pers=1) 108 | )) %>% tidy_dagitty() 109 | ggdag(dag,node_size=20) 110 | ``` 111 | 112 | ## College Remediation 113 | 114 | - Bettinger & Long find positive effects of college remediation on persistence 115 | - But do we believe this IV? 116 | - Let's think - any possible other ways from `RemC` to `Pers`? 117 | - Keep in mind, `RemC` is based on the `loc`ation where you live, `loc -> RemC`. What else might be related to where you live? 118 | - How could we test the diagram? 119 | 120 | ## Medical Care Expenditures 121 | 122 | - One part of the health care debate is how much health care should be pair for by the person *using it* and how much should be paid for by *society* 123 | - One concern of taking the burden off of the user is that people might use way more medical care than they need if they're not paying for it 124 | - How does *the price you pay* for health care affect *how much you use it*? 125 | 126 | ## Medical Care Expenditures 127 | 128 | - So how does `price` affect `use`? 129 | - Something to keep in mind is that because of many varied insurance and social safety net programs, the `price` for the same procedure varies wildly between people 130 | - And might be affected by `inc`ome, `empl`oyment, what else? 131 | - Draw a diagram! 132 | - Before I show it, can you think of an instrument? 133 | 134 | ## Medical Care Expenditures 135 | 136 | - Kowalski (2016) notices that in many family insurance plans, if your family member is injured, the cost-sharing in the plan means that if *you* then get injured, you'll pay less for your care 137 | - So `fam`ily injury is an instrument for price 138 | 139 | ## Medical Care Expenditures 140 | 141 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=6} 142 | dag <- dagify(price~fam+A+B+C, 143 | use~price+A+B+C, 144 | coords=list( 145 | x=c(price=1,fam=0,A=1.5,B=2,C=2.5,use=3), 146 | y=c(price=1,fam=1,A=2,B=2,C=2,use=1) 147 | )) %>% tidy_dagitty() 148 | ggdag(dag,node_size=20) 149 | ``` 150 | 151 | ## Medical Care Expenditures 152 | 153 | - Kowalski (2016) finds that a 10% price reduction increases use by 7-11%. 154 | - So do we believe this one? 155 | - Can you imagine any ways in which a family member's use of medical care might affect your use of medical care except through the price you face? 156 | - Let's consider some that we might be able to control for (and she does) and some we might not 157 | - Any back doors we can imagine? What could we test? 158 | 159 | ## Stock Market Indexing 160 | 161 | - The Russell 1000 stock index indexes the top 1000 largest firms by market `cap`italization, and the Russell 2000 indexes the next top 2000 162 | - Both indices are value-weighted, so "big fish-small pond" stocks at the top of the 2000 have more money coming to them than "small fish-big pond" stock at the bottom of the 1000 163 | - Even though the price shouldn't be affected by something as immaterial as what stocks you're being compared to... maybe it does! 164 | 165 | ## Stock Market Indexing 166 | 167 | - Does *where you're listed* affect your `price`? 168 | - Draw that diagram! 169 | - Keep in mind that your listing is based on your `cap` - just big enough for the 1000 and you're on the 1000, not quite there and you're on the `R2000` 170 | - All sorts of firm qualities may affect your `cap` and also your `price` 171 | - Any guesses as to what might be a good instrument? 172 | 173 | ## Stock Market Indexing 174 | 175 | - Sounds like a regression discontinuity, not an IV! What gives? 176 | - It's BOTH! This is called "fuzzy RD" 177 | - When the regression discontinuity isn't perfect - crossing the cutoff takes you from, say, 40% treated to 60% rather than 0% to 100% - it's sort of like the experiments with imperfect assignment we covered 178 | - And the fix is the same - an IV! Being `above` is an IV for being listed on `R2000` 179 | 180 | ## Stock Market Indexing 181 | 182 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=6} 183 | dag <- dagify(R2000~above+A+B+C, 184 | price~R2000+cap+A+B+C, 185 | above~cap, 186 | cap~A+B+C, 187 | coords=list( 188 | x=c(R2000=1,above=0,A=1.5,B=2,C=2.5,price=3,cap=0), 189 | y=c(R2000=1,above=1,A=2,B=2,C=2,price=1,cap=1.5) 190 | )) %>% tidy_dagitty() 191 | ggdag(dag,node_size=20) 192 | ``` 193 | 194 | ## Company Tax Incentives 195 | 196 | - It's common for localities to offer tax incentives to bring companies to town 197 | - Economists tend not to like this, as that company probably would have been just as productive elsewhere, so it's a giveaway to them 198 | - But, selfishly, it's nice if that company is producing in *your* city rather than elsewhere, right? Jobs! 199 | - What are the impact of "Empowerment Zone (`EZ`)" tax incentives on `emp`loyment? Draw the diagram! 200 | 201 | ## Company Tax Incentives 202 | 203 | - This is looking pretty familiar by now 204 | 205 | ```{r, dev='CairoPNG', echo=FALSE, fig.width=6,fig.height=5.5} 206 | dag <- dagify(EZ~comm+A+B+C, 207 | emp~EZ+A+B+C+rep, 208 | comm~rep, 209 | coords=list( 210 | x=c(EZ=1,comm=0,A=1.5,B=2,C=2.5,emp=3,rep=0), 211 | y=c(EZ=1,comm=1,A=2,B=2,C=2,emp=1,rep=2) 212 | )) %>% tidy_dagitty() 213 | ggdag(dag,node_size=20) 214 | ``` 215 | 216 | ## Company Tax Incentives 217 | 218 | - Hanson (2009) uses *the political power of your congressperson* as an IV for `EZ` 219 | - Did your representative make it onto a powerful `comm`ittee, increasing the chances of getting an EZ for their district? IV! (controlling for the `rep`, we're looking before/after here, like diff-in-diff) 220 | - Do we believe this? Any potential problems? What could we test? 221 | 222 | ## Others 223 | 224 | - Try to think of a decent IV 225 | - For *any* causal question 226 | - Remember, you must have: 227 | - `Z` is related to `X` 228 | - All back doors from `Z` to `Y` must be closed 229 | - All front doors from `Z` to `Y` must go through `X` -------------------------------------------------------------------------------- /Lectures/Lecture_27_Explaining_Better_Regression.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 27 Explaining Better - Regression" 3 | author: "Nick Huntington-Klein" 4 | date: "April 3, 2019" 5 | output: 6 | revealjs::revealjs_presentation: 7 | theme: solarized 8 | transition: slide 9 | self_contained: true 10 | smart: true 11 | fig_caption: true 12 | reveal_options: 13 | slideNumber: true 14 | 15 | --- 16 | 17 | ```{r setup, include=FALSE} 18 | knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE) 19 | library(tidyverse) 20 | library(dagitty) 21 | library(ggdag) 22 | library(ggthemes) 23 | library(Cairo) 24 | theme_set(theme_gray(base_size = 15)) 25 | ``` 26 | 27 | ## Final Exam 28 | 29 | - Combination of both programming and causal inference methods 30 | - Everything is fair game 31 | - There will also be a subjective question in which you take a causal question, develop a diagram, and perform the analysis with data I give you 32 | - Slides and dagitty will be available, no other internet 33 | 34 | ## This Week 35 | 36 | - We'll be doing a little bit of review this last week 37 | - And also talking about other ways of *explaining* data beyond what we've done 38 | - This last week of material *will not be on the final* but it will be great prep for any upcoming class you take on this, or if you want to apply the ideas you've learned in class in the real world 39 | 40 | ## Explaining Better 41 | 42 | - So far, all of our methods have had to do with *explaining* one variable with another 43 | - After all, causal inference is all about looking at the effect of one variable on another 44 | - If that explanation is causally identified, we're good to go 45 | - Or, if that explanation is on a back door of what we're interested in, we'll explain what we can and take it out 46 | 47 | ## Explaining Better 48 | 49 | - The way that we've been explaining `A` with `B` so far: 50 | - Take the different values of `B` (if it's continuous, use bins with `cut()`) 51 | - For observations with each of those different values, take the mean of `A` 52 | - That mean is the "explained" part, the rest is the "residual" 53 | 54 | ## Explaining Better 55 | 56 | - Now, this is the basic idea of explaining - what value of `A` can we expect, given the value of `B` we're looking at? 57 | - But this isn't the only way to put that idea into action! 58 | 59 | ## Regression 60 | 61 | - The way that explaining is done most of the time is with a method called *regression* 62 | - You might be familiar with regression if you've taken ISDS 361A 63 | - But here we're going to go more into detail on what it actually is and how it relates to causal inference 64 | 65 | ## Regression 66 | 67 | - The idea of regression is the same as our approach to explaining - for different values of `B`, predict `A` 68 | - But what's different? In regression, you impose a little more structure on that prediction 69 | - Specifically, when `B` is continuous, you require that the relationship between `B` and `A` follows a *straight line* 70 | 71 | ## Regression 72 | 73 | - Let's look at wages and experience in Belgium 74 | 75 | ```{r, echo=TRUE} 76 | library(Ecdat) 77 | data(Bwages) 78 | 79 | #Explain wages with our normal method 80 | Bwages <- Bwages %>% group_by(cut(exper,breaks=8)) %>% 81 | mutate(wage.explained = mean(wage)) %>% ungroup() 82 | 83 | #Explain wages with regression 84 | #lm(wage~exper) regresses wage on exper, and predict() gets the explained values 85 | Bwages <- Bwages %>% 86 | mutate(wage.reg.explained = predict(lm(wage~exper))) 87 | 88 | #What's in a regression? An intercept and a slope! Like I said, it's a line. 89 | lm(wage~exper,data=Bwages) 90 | ``` 91 | 92 | ## Regression 93 | 94 | ```{r, echo=FALSE, fig.width=8, fig.height=6} 95 | ggplot(filter(Bwages,wage<20),aes(x=exper,y=wage,color='Raw'))+geom_point(alpha=.1)+ 96 | labs(x='Experience',y='Wage')+ 97 | geom_point(aes(x=exper,y=wage.explained,color='Our Method'))+ 98 | geom_point(aes(x=exper,y=wage.reg.explained,color='Regression'))+ 99 | scale_color_manual(values=c('Raw'='black','Our Method'='red','Regression'='blue')) 100 | ``` 101 | 102 | ## Regression 103 | 104 | - Okay, so it's the same thing but it's a straight line. Who cares? 105 | - Regression brings us some benefits, but also has some costs (there's always a tradeoff...) 106 | 107 | ## Regression Benefits 108 | 109 | - It boils down the relationship to be much simpler to explain 110 | - Instead of reporting eight different means, I can just give an intercept and a slope! 111 | 112 | ```{r, echo=FALSE} 113 | lm(wage~exper,data=Bwages) 114 | ``` 115 | 116 | - We can interpret this easily as "one more year of `exper` is associated with `r coef(lm(wage~exper,data=Bwages))[2]` higher wages" 117 | 118 | ## Regression Benefits 119 | 120 | - This makes it much easier to explain using multiple variables at once 121 | - This is important when we're doing causal inference. If we want to close multiple back doors, we have to control for multiple variables at once. This is unwieldy with our approach 122 | - With regression we just add another dimension to our line and add another slope! As many as we want 123 | 124 | ```{r, echo=FALSE} 125 | #Bwages appears to drop the sex variable in some versions? Let's make it up 126 | Bwages <- Bwages %>% 127 | mutate(male = runif(1472) + (wage > mean(wage))*.04 >= .5) 128 | ``` 129 | 130 | ```{r, echo=TRUE} 131 | lm(wage~exper+educ+male,data=Bwages) 132 | ``` 133 | 134 | ## Regression Benefits 135 | 136 | - The fact that we're using a line means that we can use much more of the data 137 | - This increases our statistical power, and also reduces overfitting (remember that?) 138 | - For example, with regression discontinuity, we've been only looking just to the left and right of the cutoff 139 | - But this doesn't take into account the information we have about the trend of the variable *leading up to* the cutoff. Doing regression discontinuity with regression can! 140 | 141 | ## Regression Benefits 142 | 143 | - Take this example from regression discontinuity 144 | 145 | ```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5.5} 146 | set.seed(1000) 147 | rdd <- tibble(test = runif(300)*100) %>% 148 | mutate(GATE = test >= 75, 149 | above = test >= 75) %>% 150 | mutate(earn = runif(300)*40+10*GATE+test/2) 151 | 152 | ggplot(rdd,aes(x=test,y=earn,color=GATE))+geom_point()+ 153 | geom_vline(aes(xintercept=75),col='red')+ 154 | labs(x='Test Score', 155 | y='Earnings') 156 | ``` 157 | 158 | ## Regression Benefits 159 | 160 | - Look how much more information we can take advantage of with regression 161 | 162 | ```{r, echo=FALSE, eval=TRUE, fig.width=7, fig.height=5} 163 | rdd <- rdd %>% 164 | mutate(above = test >= 75, 165 | zeroedtest = test-75) 166 | 167 | rdmeans <- rdd %>% filter(between(test,73,77)) %>% 168 | group_by(above) %>% 169 | summarize(E = mean(earn)) 170 | 171 | 172 | ggplot(rdd,aes(x=test,y=earn,color='Raw'))+geom_point()+ 173 | geom_vline(aes(xintercept=75),col='blue')+ 174 | labs(x='Test Score', 175 | y='Earnings')+ 176 | geom_smooth(aes(color='Regression'),method='lm',se=FALSE,formula=y~x+I(x>=75)+x*I(x>=75))+ 177 | geom_segment(aes(x=73,xend=75,y=rdmeans$E[1],yend=rdmeans$E[1],color='Our Method'),size=2)+ 178 | geom_segment(aes(x=75,xend=77,y=rdmeans$E[2],yend=rdmeans$E[2],color='Our Method'),size=2)+ 179 | scale_color_manual(values=c('Raw'='black','Regression'='red','Our Method'='blue')) 180 | ``` 181 | 182 | ## Regression Cons 183 | 184 | - One con is that it masks what it's doing a bit - you have to learn its inner workings to really know how it operates, and what it's sensitive to 185 | - There are also some statistical situations in which it can give strange results 186 | - For example, problem is that it fits a straight line 187 | - That means that if your relationship DOESN'T follow a straight line, you'll get bad results! 188 | 189 | ## Regression Cons 190 | 191 | ```{r echo=FALSE, fig.width=8, fig.height=6} 192 | quaddata <- tibble(x = rnorm(1000)) %>% 193 | mutate(y = x + x^2 + rnorm(1000)) %>% 194 | group_by(cut(x,breaks=10)) %>% 195 | mutate(y.exp = mean(y)) %>% 196 | ungroup() %>% 197 | mutate(y.exp.reg = predict(lm(y~x))) 198 | 199 | ggplot(quaddata,aes(y=y,x=x,color='Raw'))+geom_point(alpha=.2)+ 200 | geom_point(aes(x=x,y=y.exp,color='Our Method'))+ 201 | geom_point(aes(x=x,y=y.exp.reg,color='Regression'))+ 202 | scale_color_manual(values=c('Raw'='black','Our Method'='red','Regression'='blue')) 203 | ``` 204 | 205 | ## Regression Cons 206 | 207 | - That said, we're fitting a line, but it doesn't *necessarily* need to be a straight line. Let's add a quadratic term! 208 | 209 | ```{r, echo=TRUE} 210 | lm(y~x,data=quaddata) 211 | lm(y~x+I(x^2),data=quaddata) 212 | ``` 213 | 214 | ## Regression Cons 215 | 216 | ```{r echo=FALSE, fig.width=8, fig.height=6} 217 | quaddata <- quaddata %>% 218 | mutate(y.exp.reg = predict(lm(y~x+I(x^2)))) 219 | 220 | ggplot(quaddata,aes(y=y,x=x,color='Raw'))+geom_point(alpha=.2)+ 221 | geom_point(aes(x=x,y=y.exp,color='Our Method'))+ 222 | geom_point(aes(x=x,y=y.exp.reg,color='Quadratic Reg.'))+ 223 | scale_color_manual(values=c('Raw'='black','Our Method'='red','Quadratic Reg.'='blue')) 224 | ``` 225 | 226 | ## Regression 227 | 228 | - In fact, for these reasons, despite the cons, all the methods we've done so far (except matching) are commonly done with regression 229 | - We'll talk about this more next time 230 | - For now, let's do a simulation exercise with regression 231 | - I want to reiterate that regression *will not be on the final* 232 | - But this will let us practice our simulation skills, which we haven't done in a minute, and we may as well do it with regresion 233 | - Any remaining time we'll do other final prep 234 | 235 | ## Simulation Practice 236 | 237 | - Let's compare regression and our method in their ability to pick up a *tiny* positive effect of `X` on `Y` 238 | - We'll be counting how often each of them finds that positive result 239 | - Create blank vectors `reg.pos <- c(NA)` and `our.pos <- c(NA)` to store results 240 | - Create a `for (i in 1:10000) {` loop 241 | - Then, inside that loop... 242 | 243 | ## Simulation Practice 244 | 245 | - Create a tibble `df` with `x = runif(1000)` and `y = .01*x + rnorm(1000)` 246 | - Perform regression and get the slope on `x`: `coef(lm(y~x,data=df))[2]`. Store `0` in `reg.pos[i]` if this is negative, and `1` if it's positive 247 | - Explain `y` using `x` with our method and `cut(x,breaks=2)`. Use `summarize` and store the means in `our.df` 248 | - Store `0` in `our.pos[i]` if `our.df$y[2] - our.df$y[1]` is negative, and `1` if it's positive 249 | - See which method more often gets a positive result 250 | 251 | ## Simulation Practice Answers 252 | 253 | ```{r, echo=FALSE} 254 | set.seed(1234) 255 | ``` 256 | 257 | ```{r, echo=TRUE} 258 | reg.pos <- c(NA) 259 | our.pos <- c(NA) 260 | 261 | for (i in 1:10000) { 262 | df <- tibble(x = runif(1000)) %>% 263 | mutate(y = .01*x + rnorm(1000)) 264 | 265 | reg.pos[i] <- coef(lm(y~x,data=df))[2] >= 0 266 | 267 | our.df <- df %>% group_by(cut(x,breaks=2)) %>% 268 | summarize(y = mean(y)) 269 | 270 | our.pos[i] <- our.df$y[2] - our.df$y[1] > 0 271 | } 272 | 273 | mean(reg.pos) 274 | mean(our.pos) 275 | ``` -------------------------------------------------------------------------------- /Lectures/Mariel_Synthetic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/Mariel_Synthetic.png -------------------------------------------------------------------------------- /Lectures/ad_spend_and_gdp.csv: -------------------------------------------------------------------------------- 1 | Year,AdSpending,AdSpendingNewsMagRadio,GDP,AdsasPctofGDP 2 | 1919,1930,733.00,78.30,2.50% 3 | 1920,2480,830.00,88.40,2.80% 4 | 1921,1930,940.00,73.60,2.60% 5 | 1922,2200,1017.00,73.40,3.00% 6 | 1923,2400,1101.00,85.40,2.80% 7 | 1924,2480,1188.00,87.00,2.90% 8 | 1925,2600,1281.00,90.60,2.90% 9 | 1926,2700,1355.00,97.00,2.80% 10 | 1927,2720,1432.00,95.50,2.80% 11 | 1928,2760,1493.00,97.40,2.80% 12 | 1929,2850,1560.00,103.60,2.80% 13 | 1930,2450,1378.00,91.20,2.70% 14 | 1931,2100,1220.00,76.50,2.70% 15 | 1932,1620,1000.00,58.70,2.80% 16 | 1933,1325,832.00,56.40,2.30% 17 | 1934,1650,935.00,66.00,2.50% 18 | 1935,1720,1065.00,73.30,2.30% 19 | 1936,1930,1191.00,83.80,2.30% 20 | 1937,2100,1305.00,91.90,2.30% 21 | 1938,1930,1182.00,86.10,2.20% 22 | 1939,2010,1232.00,92.20,2.20% 23 | 1940,2110,1311.00,101.40,2.10% 24 | 1941,2250,1400.00,126.70,1.80% 25 | 1942,2160,1352.00,161.90,1.30% 26 | 1943,2490,1639.00,198.60,1.30% 27 | 1944,2700,1790.00,219.80,1.20% 28 | 1945,2840,1923.00,223.10,1.30% 29 | 1946,3340,2262.00,222.30,1.50% 30 | 1947,4260,2723.00,244.20,1.70% 31 | 1948,4870,3091.00,269.20,1.80% 32 | 1949,5210,3301.00,267.30,1.90% 33 | 1950,5700,3633.00,293.80,1.90% 34 | 1951,6420,4080.00,339.30,1.90% 35 | 1952,7140,4552.00,358.30,2.00% 36 | 1953,7740,4942.00,379.40,2.00% 37 | 1954,8150,5161.00,380.40,2.10% 38 | 1955,9150,5866.00,414.80,2.20% 39 | 1956,9910,6342.00,437.50,2.30% 40 | 1957,10270,6588.00,461.10,2.20% 41 | 1958,10310,6509.00,467.20,2.20% 42 | 1959,11270,7183.00,506.60,2.20% 43 | 1960,11960,7585.00,526.40,2.30% 44 | 1961,11860,7510.00,544.70,2.20% 45 | 1962,12430,7896.00,585.60,2.10% 46 | 1963,13100,8284.00,617.70,2.10% 47 | 1964,14150,9018.00,663.60,2.10% 48 | 1965,15250,9761.00,719.10,2.10% 49 | 1966,16630,10734.00,787.80,2.10% 50 | 1967,16870,10887.00,832.60,2.00% 51 | 1968,18090,11718.00,910.00,2.00% 52 | 1969,19420,12723.00,984.60,2.00% 53 | 1970,19550,12702.00,1038.50,1.90% 54 | 1971,20700,13293.00,1127.10,1.80% 55 | 1972,23210,14921.00,1238.30,1.90% 56 | 1973,24980,16042.00,1382.70,1.80% 57 | 1974,26620,17009.00,1500.00,1.80% 58 | 1975,27900,17935.00,1638.30,1.70% 59 | 1976,33300,21579.00,1825.30,1.80% 60 | 1977,37440,24470.00,2030.90,1.80% 61 | 1978,43330,28322.00,2294.70,1.90% 62 | 1979,48780,31954.00,2563.30,1.90% 63 | 1980,53570,34937.00,2789.50,1.90% 64 | 1981,60460,39167.00,3128.40,1.90% 65 | 1982,66670,42811.00,3255.00,2.00% 66 | 1983,76000,49057.00,3536.70,2.10% 67 | 1984,88010,56765.00,3933.20,2.20% 68 | 1985,94900,60663.00,4220.30,2.20% 69 | 1986,102370,65029.00,4462.80,2.30% 70 | 1987,110270,69141.00,4739.50,2.30% 71 | 1988,118750,74004.00,5103.80,2.30% 72 | 1989,124770,77841.00,5484.40,2.30% 73 | 1990,129968,79932.00,5803.10,2.20% 74 | 1991,128352,76997.00,5995.90,2.10% 75 | 1992,133750,80560.00,6337.70,2.10% 76 | 1993,140956,84570.00,6657.40,2.10% 77 | 1994,153024,92501.00,7072.20,2.20% 78 | 1995,162930,97622.00,7397.70,2.20% 79 | 1996,175230,105973.00,7816.90,2.20% 80 | 1997,187529,113221.00,8304.30,2.30% 81 | 1998,206697,123628.00,8747.00,2.40% 82 | 1999,222308,132151.00,9268.40,2.40% 83 | 2000,247472,145887.00,9817.00,2.50% 84 | 2001,231287,132296.00,10128.00,2.30% 85 | 2002,236875,136244.00,10470.00,2.30% 86 | 2003,245477,140128.00,10961.00,2.20% 87 | 2004,263766,150305.00,11713.00,2.30% 88 | 2005,271074,151939.00,12456.00,2.20% 89 | 2006,281653,155466.00,13178.00,2.10% 90 | 2007,279612,150023.00,13807.00,2.00% 91 | -------------------------------------------------------------------------------- /Lectures/mariel.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Lectures/mariel.RData -------------------------------------------------------------------------------- /Papers/BettingerLong2009.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/BettingerLong2009.pdf -------------------------------------------------------------------------------- /Papers/Flammer 2013 RDD Corporate Social responsibility.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/Flammer 2013 RDD Corporate Social responsibility.pdf -------------------------------------------------------------------------------- /Papers/chang2014.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/chang2014.pdf -------------------------------------------------------------------------------- /Papers/desai2009.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/desai2009.pdf -------------------------------------------------------------------------------- /Papers/hanson2009.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/hanson2009.pdf -------------------------------------------------------------------------------- /Papers/kowalski2016.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/kowalski2016.pdf -------------------------------------------------------------------------------- /Papers/marvel_moody_sentencing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Papers/marvel_moody_sentencing.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Economics, Causality, and Analytics 2 | ========= 3 | 4 | These are the course materials for the upcoming CSU Fullerton class ECON 305: Economics, Causality, and Analytics (minus the exams and homeworks; instructors contact me directly if you'd like to see these). 5 | 6 | The syllabus and writing assignment sheets are in the main folder. 7 | 8 | Lectures can be found in the Lectures folder. These are in RMarkdown Presentation format. Compiled HTML files are there as well as well as any other files necessary to render the completed slides. Completed HTML files are included, but will be easier to see fully rendered. Check [My Website](http://www.nickchk.com/econ305.html) for links. 9 | 10 | Papers includes the papers referenced in the lectures or other parts of the class. 11 | 12 | Cheat Sheets contains cheat sheets downloaded from standard sources like RStudio, as well as a few of my own for things like dagitty, causal diagrams, and relationships between variables. 13 | 14 | Code for the animations used in the lectures (or very nearly; it's been modified) can be found in the RMarkdown files, or at [this repo](https://github.com/NickCH-K/causalgraphs). -------------------------------------------------------------------------------- /Research Design Assignment.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NickCH-K/introcausality/e03f75407d9727159052098b69e95989714ee726/Research Design Assignment.docx --------------------------------------------------------------------------------