├── .github
└── workflows
│ └── render-rmd.yml
├── .gitignore
├── 01_intro_to_r
├── img
│ └── r-interface-2020.png
├── intro_to_r.Rmd
├── intro_to_r.html
└── more
│ ├── arbuthnot-readme.txt
│ ├── arbuthnot.r
│ ├── present-readme.txt
│ ├── present-reference.pdf
│ └── present.R
├── 02_intro_to_data
├── intro_to_data.Rmd
└── intro_to_data.html
├── 03_probability
├── more
│ ├── calc_streak.R
│ ├── kobe-readme.txt
│ ├── kobe.RData
│ ├── kobe.csv
│ └── kobe_data.xls
├── probability.Rmd
└── probability.html
├── 04_normal_distribution
├── normal_distribution.Rmd
└── normal_distribution.html
├── 05a_sampling_distributions
├── README.md
├── more
│ ├── AmesHousing.csv
│ ├── AmesHousing.xls
│ ├── OLD_sampling_distributions.Rmd
│ ├── README.md
│ ├── ames-readme.txt
│ ├── ames.csv
│ └── ames_dataprep.R
├── sampling_distributions.Rmd
└── www
│ └── lab.css
├── 05b_confidence_intervals
├── README.md
├── confidence_intervals.Rmd
└── confidence_intervals.html
├── 06_inf_for_categorical_data
├── inf_for_categorical_data.Rmd
└── www
│ └── lab.css
├── 07_inf_for_numerical_data
├── README.md
├── inf_for_numerical_data.Rmd
└── inf_for_numerical_data.html
├── 08_simple_regression
├── simple_regression.Rmd
└── simple_regression.html
├── 09_multiple_regression
├── multiple_regression.Rmd
└── multiple_regression.html
├── CODE_OF_CONDUCT.md
├── LICENSE.md
├── README.md
├── _config.yml
├── lab.css
├── lab_source_style_guide.md
└── logo
├── logo-social.jpeg
└── logo-square.png
/.github/workflows/render-rmd.yml:
--------------------------------------------------------------------------------
1 | on: push
2 |
3 | name: Render Rmd
4 |
5 | jobs:
6 | render:
7 | name: Render Rmd
8 | runs-on: macOS-latest
9 | steps:
10 | - uses: actions/checkout@v2
11 | - uses: r-lib/actions/setup-r@v1
12 | - uses: r-lib/actions/setup-pandoc@v1
13 | - name: Install rmarkdown, remotes, and the local package
14 | run: |
15 | install.packages("remotes")
16 | remotes::install_cran(c("rmarkdown", "tidyverse", "openintro", "infer", "statsr", "GGally", "skimr"))
17 | shell: Rscript {0}
18 | - name: Render Lab 01
19 | run: Rscript -e 'rmarkdown::render("01_intro_to_r/intro_to_r.Rmd")'
20 | - name: Render Lab 02
21 | run: Rscript -e 'rmarkdown::render("02_intro_to_data/intro_to_data.Rmd")'
22 | - name: Render Lab 03
23 | run: Rscript -e 'rmarkdown::render("03_probability/probability.Rmd")'
24 | - name: Render Lab 04
25 | run: Rscript -e 'rmarkdown::render("04_normal_distribution/normal_distribution.Rmd")'
26 | - name: Render Lab 07
27 | run: Rscript -e 'rmarkdown::render("07_inf_for_numerical_data/inf_for_numerical_data.Rmd")'
28 | - name: Render Lab 08
29 | run: Rscript -e 'rmarkdown::render("08_simple_regression/simple_regression.Rmd")'
30 | - name: Render Lab 09
31 | run: Rscript -e 'rmarkdown::render("09_multiple_regression/multiple_regression.Rmd")'
32 | - name: Commit results
33 | run: |
34 | git config --local user.email "actions@github.com"
35 | git config --local user.name "GitHub Actions"
36 | git commit -a -m 'Re-build Rmd' || echo "No changes to commit"
37 | git push origin || echo "No changes to commit"
38 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .html
2 | .Rproj.user
3 | .Rhistory
4 | .RData
5 | *.Rproj
6 | .DS_Store
7 | news.html
8 | README.html
9 | *key.Rmd
10 | *key.html
11 | rsconnect/
12 | *.key
13 |
--------------------------------------------------------------------------------
/01_intro_to_r/img/r-interface-2020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenIntroStat/oilabs-tidy/5e5703a7ecf076f0de5f7d90b436977f0d6c4ae0/01_intro_to_r/img/r-interface-2020.png
--------------------------------------------------------------------------------
/01_intro_to_r/intro_to_r.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to R and RStudio"
3 | output:
4 | html_document:
5 | css: ../lab.css
6 | highlight: pygments
7 | theme: cerulean
8 | toc: true
9 | toc_float: true
10 | ---
11 |
12 | ```{r global_options, include = FALSE}
13 | knitr::opts_chunk$set(eval = TRUE, results = FALSE)
14 | library(tidyverse)
15 | library(openintro)
16 | ```
17 |
18 | ## The RStudio Interface
19 |
20 | The goal of this lab is to introduce you to R and RStudio, which you'll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: `R` is the name of the programming language itself and RStudio is a convenient interface for working with `R` .
21 |
22 | As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer! Before we get to that stage, however, you need to build some basic fluency in `R`. First, we will explore the fundamental building blocks of `R` and RStudio: the RStudio interface, reading in data, and basic commands for working with data in `R`.
23 |
24 | Go ahead and launch RStudio. You should see a window that looks like the image shown below.
25 |
26 | ```{r r-interface-2020, echo=FALSE, results="asis"}
27 | knitr::include_graphics("img/r-interface-2020.png")
28 | ```
29 |
30 | The panel on the lower left is where the action happens. This panel is called the *console*. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you're running. Below that information is the *prompt*, indicated by the `>` symbol. As its name suggests, this prompt is really a request: a request for a command. Initially, interacting with `R` is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
31 |
32 | The panel in the upper right contains your *environment* as well as a history of the commands that you've previously entered.
33 |
34 | The panel in the lower right contains tabs for browse the *files* in your project folder, access *help* files for `R` functions, install and manage `R` *packages*, and inspecting visualizations. By default, all data visualizations you make will appear directly below the code you used to create them. If you would rather your plots appear in the *plots* tab, you will need to change your global options.
35 |
36 | ### R Packages
37 |
38 | `R` is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following:
39 |
40 | - The **tidyverse** "umbrella" package which houses a suite of many different `R` packages: for data wrangling and data visualization
41 | - The **openintro** `R` package: for data and custom functions with the OpenIntro resources
42 |
43 | In the lower right hand corner click on the *Packages* tab. Type the name of each of these packages (tidyverse, openintro) into the search box to see if they have been installed. If these packages do not appear when you type in their name, install them by copying and pasting or typing the following two lines of code into the console of your RStudio session. Be sure to press enter/return after each line of code.
44 |
45 | ```{r install-packages, message = FALSE, eval = FALSE}
46 | install.packages("tidyverse")
47 | install.packages("openintro")
48 | ```
49 |
50 | After pressing enter/return, a stream of text will begin, communicating the process `R` is going through to install the package from the location you selected when you installed `R`. If you were not prompted to select a server for downloading packages when you installed `R`, RStudio may prompt you to select a server from which to download; any of them will work.
51 |
52 | You only need to *install* packages once, but you need to *load* them each time you relaunch RStudio. We load packages with the `library` function. Copy and paste or type the the following two lines in your console to load the tidyverse and openintro packages into your working environment.
53 |
54 | ```{r load-packages, message = FALSE}
55 | library(tidyverse)
56 | library(openintro)
57 | ```
58 |
59 | We are choosing to use the tidyverse package because it consists of a set of packages necessary for different aspects of working with data, anything from loading data to wrangling data to visualizing data to analyzing data. Additionally, these packages share common philosophies and are designed to work together. You can find more about the packages in the tidyverse at [tidyverse.org](http://tidyverse.org/).
60 |
61 | ### Creating a reproducible lab report
62 |
63 | We will be using R Markdown to create reproducible lab reports. See the following videos describing why and how:
64 |
65 | [**Why use R Markdown for Lab Reports?**](https://youtu.be/lNWVQ2oxNho)
66 |
67 | [**Using R Markdown for Lab Reports in RStudio**](https://youtu.be/o0h-eVABe9M)
68 |
69 | In a nutshell, in RStudio, go to New File -\> R Markdown... Then, choose "From Template" and then choose `Lab Report for OpenIntro Statistics Lab 1` from the list of templates.
70 |
71 | Going forward you should refrain from typing your code directly in the console, as this makes it very difficult to remember and reproduce the output you want to reference. Potentially the most important feature of R Markdown files is that they allow for us to nest our `R` code within a written report. In an R Markdown file, `R` code appears in a gray box, which we call "code chunks." The R Markdown file knows that the gray box contains `R` code because it begins with three tick marks (\`\`\`), followed by two curly braces that contain a lowercase letter r ({r}). You've already seen this above!
72 |
73 | Instead of typing our `R` code into the console, we encourage you to type any code you produce (final correct answer, or anything you're just trying out) in the `R` code chunk associated with each problem. You can execute the `R` code you type in these code chunks similar to how you typed code into the console and pressed enter/return. Within the code chunk there are two ways to execute a line of `R` code: (1) place your cursor on the line on code and press `Ctrl-Enter` or `Cmd-Enter` at the same time, or (2) place your cursor on the line and press the "Run" button in the upper right hand corner of the R Markdown file. Alternatively, if you wanted to run all of the `R` code in a given code chunk, you can click on the "Play" button in the upper right hand corner of the code chunk (green sideways triangle).
74 |
75 | If at any point you need to start over and run all of the code chunks before a specific code chunk, you click on the "Fastforward" button in the upper right hand corner of that code chunk (gray upside down triangle with a bar below). This will run every code chunk that occurred *before* that code chunk, but *will not* execute the `R` code included in that code chunk.
76 |
77 | ## Dr. Arbuthnot's Baptism Records
78 |
79 | To get started, let's take a peek at the data.
80 |
81 | ```{r load-abrbuthnot-data}
82 | arbuthnot
83 | ```
84 |
85 | Again, you can run the code above by:
86 |
87 | - placing your cursor on the line and pressing `Ctrl-Enter` or `Cmd-Enter`
88 | - placing your cursor on the line and pressing the "Run" button in the upper right hand corner of the R Markdown file, or
89 | - by clicking on the green arrow at the top right hand corner of the code chunk
90 |
91 | The single line of code included in this code chunk instructs `R` to load some data: the Arbuthnot baptism counts for boys and girls. You should see that the *Environment* tab in the upper right hand corner of the RStudio window now lists a data set called `arbuthnot` that has 82 observations on 3 variables. As you interact with `R`, you will create objects for a variety of purposes. Sometimes you load the objects into your workspace by loading a package, as we have done here, but sometimes you create objects yourself as a byproduct of a computation process, for an analysis you have performed, or for a visualization you have created.
92 |
93 | The Arbuthnot data set refers to the work of Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. Once again, we can view the data by running the code below or by typing the name of the dataset into the console. Be careful the spelling and capitalization you use! `R` is case sensitive, so if you accidentally type `Arbuthnot` `R` will tell you that object cannot be found.
94 |
95 | ```{r view-data}
96 | arbuthnot
97 | ```
98 |
99 | This command does display the data for us, however, printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. The *Environment* tab (in the upper right pane) lists the objects in your environment. Clicking on the name `arbuthnot` will open up a *Data Viewer* tab next to your R Markdown file, which provides an alternative display of the data set. This display should feel similar to viewing data in Excel, where you are able to scroll through the dataset to inspect it. However, unlike Excel, you **will not** be able to edit the data in this tab. Once you are done viewing the data, you can close this tab by clicking on the `x` in the upper left hand corner.
100 |
101 | When inspecting the data, you should see four columns of numbers and 82 rows. Each row represents a different year that Arbuthnot collected data. The first entry in each row is the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.
102 |
103 | Note that the row numbers in the first column are not part of Arbuthnot's data. `R` adds these row numbers as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison of the data to a spreadsheet will generally be helpful. `R` has stored Arbuthnot's data in an object similar to a spreadsheet or a table, which `R` calls a *data frame*.
104 |
105 | You can see the dimensions of this data frame as well as the names of the variables and the first few observations by inserting the name of the dataset into the `glimpse()` function, as seen below:
106 |
107 | ```{r glimpse-data}
108 | glimpse(arbuthnot)
109 | ```
110 |
111 | Although we previously said that it is best practice to type all of your `R` code into the code chunk, it is better practice to type this command into your console. Generally, you should type all of the code that is necessary for your solution into the code chunk. Because this command is used to explore the data, it is not necessary for your solution code and **should not** be included in your solution file.
112 |
113 | This command should output the following:
114 |
115 | ```{r glimpse-data-result, echo=FALSE, results = TRUE}
116 | glimpse(arbuthnot)
117 | ```
118 |
119 | We can see that there are 82 observations and 3 variables in this dataset. The variable names are `year`, `boys`, and `girls`. At this point, you might notice that many of the commands in `R` look a lot like functions from math class; that is, invoking `R` commands means supplying a function with some number of inputs (what are called arguments) which the function uses to produce an output. The `glimpse()` command, for example, took a single argument, the name of a data frame and produced a display of the dataset as an output.
120 |
121 | ## Some Exploration
122 |
123 | Let's start to examine the data a little more closely. We can access the data in a single column of a data frame by extracting the column with a `$`. For example, the code below extracts the `boys` column from the `arbuthnot` data frame.
124 |
125 | ```{r view-boys}
126 | arbuthnot$boys
127 | ```
128 |
129 | This command will only show the number of boys baptized each year. `R` interprets the `$` as saying "go to the data frame that comes before me, and find the variable that comes after me."
130 |
131 | 1. What command would you use to extract just the counts of girls baptized? Try it out in the console!
132 |
133 | Notice that the way `R` has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data have been extracted from the data frame, so they are no longer structured in a table with other variables. Instead, these data are displayed one right after another. Objects that print out in this way are called *vectors*; similar to the vectors you have seen in mathematics courses, vectors represent a list of numbers. `R` has added numbers displayed in [brackets] along the left side of the printout to indicate each entry's location within the vector. For example, 5218 follows `[1]`, indicating that `5218` is the first entry in the vector. If `[43]` was displayed at the beginning of a line, that indicate that the first number displayed on that line would correspond to the 43rd entry in that vector.
134 |
135 | ### Data visualization
136 |
137 | `R` has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the following code:
138 |
139 | ```{r plot-girls-vs-year}
140 | ggplot(data = arbuthnot, aes(x = year, y = girls)) +
141 | geom_point()
142 | ```
143 |
144 | In this code, we use the `ggplot()` function to build a plot. If you run this code chunk, a plot will appear below the code chunk. The R Markdown document displays the plot below the code that was used to generate it, to give you an idea of what the plot would look like in a final report.
145 |
146 | The command above also looks like a mathematical function. This time, however, the function requires multiple inputs (arguments), which are separated by commas.
147 |
148 | With `ggplot()`:
149 |
150 | - The first argument is always the name of the dataset you wish to use for plotting.
151 | - Next, you provide the variables from the dataset to be assigned to different `aes`thetic elements of the plot, such as the x and the y axes.
152 |
153 | These commands will build a blank plot, with the variables you assigned to the x and y axes. Next, you need to tell `ggplot()` what type of visualization you would like to add to the blank template. You add another layer to the `ggplot()` by:
154 |
155 | - adding a `+` at the end of the line, to indicate that you are adding a layer
156 | - then specify the `geom`etric object to be used to create the plot.
157 |
158 | Since we want to scatterplot, we use `geom_point()`. This tells `ggplot()` that each data point should be represented by one point on the plot. If you wanted to visualize the above plot using a line graph instead of a scatterplot, you would replace `geom_point()` with `geom_line()`. This tells `ggplot()` to draw a line from each observation with the next observation (sequentially).
159 |
160 | ```{r plot-girls-vs-year-line}
161 | ggplot(data = arbuthnot, aes(x = year, y = girls)) +
162 | geom_line()
163 | ```
164 |
165 | Use the plot to address the following question:
166 |
167 | 1. Is there an apparent trend in the number of girls baptized over the years? How would you describe it? (To ensure that your lab report is comprehensive, be sure to include the code needed to make the plot as well as your written interpretation.)
168 |
169 | You might wonder how you are supposed to know the syntax for the `ggplot()` function. Thankfully, `R` documents all of its functions extensively. To learn what a function does and how to use it (e.g. the function's arguments), just type in a question mark followed by the name of the function that you're interested in into the console. Type the following in your console:
170 |
171 | ```{r plot-help, tidy = FALSE}
172 | ?ggplot
173 | ```
174 |
175 | Notice that the help file comes to the forefront, replacing the plot in the lower right panel. You can toggle between the tabs by clicking on their names.
176 |
177 | ### R as a big calculator
178 |
179 | Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that we can use `R` as a big calculator. To do this, we can type in mathematical expressions such as the below calculation into the console.
180 |
181 | ```{r calc-total-bapt-numbers}
182 | 5218 + 4683
183 | ```
184 |
185 | This calculation would provide us with the total number of baptisms in 1629. We could then repeat this calculation once for each year. This would probably take us a while, but luckily there is a faster way! If we add the vector for baptisms for boys to that of girls, `R` can compute each of these sums simultaneously.
186 |
187 | ```{r calc-total-bapt-vars}
188 | arbuthnot$boys + arbuthnot$girls
189 | ```
190 |
191 | What you will see is a list of 82 numbers. These numbers appear as a list, because we are working with vectors rather than a data frame. Each number represents the sum of how many boys and girls were baptized that year. You can take a look at the first few rows of the `boys` and `girls` columns to see if the calculation is right.
192 |
193 | ### Adding a new variable to the data frame
194 |
195 | We are interested in using this new vector of the total number of baptisms to generate some plots, so we'll want to save it as a permanent column in our data frame. We can do this using the following code:
196 |
197 | ```{r calc-total-bapt-vars-save}
198 | arbuthnot <- arbuthnot %>%
199 | mutate(total = boys + girls)
200 | ```
201 |
202 | This code has a lot of new pieces to it, so let's break it down. In the first line we are doing two things, (1) adding a new `total` column to this updated data frame, and (2) overwriting the existing `arbuthnot` data frame with an updated data frame that includes the new `total` column. We are able to chain these two processes together using the **piping** (`%>%`) operator. The piping operator takes the output of the previous expression and "pipes it" into the first argument of the next expression.
203 |
204 | To continue our analogy with mathematical functions, `x %>% f(y)` is equivalent to `f(x, y)`. Connecting `arbuthnot` and `mutate(total = boys + girls)` with the pipe operator is the same as typing `mutate(arbuthnot, total = boys + girls)`, where `arbuthnot` becomes the first argument included in the `mutate()` function.
205 |
206 | ::: {#boxedtext}
207 | **A note on piping:** Note that we can read these two lines of code as the following:
208 |
209 | *"Take the `arbuthnot` dataset and **pipe** it into the `mutate` function. Mutate the `arbuthnot` data set by creating a new variable called `total` that is the sum of the variables called `boys` and `girls`. Then assign the resulting dataset to the object called `arbuthnot`, i.e. overwrite the old `arbuthnot` dataset with the new one containing the new variable."*
210 |
211 | This is equivalent to going through each row and adding up the `boys` and `girls` counts for that year and recording that value in a new column called `total`.
212 | :::
213 |
214 |
215 |
216 | **Where is the new variable?** When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.
217 |
218 |
219 |
220 | You'll see that there is now a new column called `total` that has been tacked onto the data frame. The special symbol `<-` performs an *assignment*, taking the output of the piping operations and saving it into an object in your environment. In this case, you already have an object called `arbuthnot` in your environment, so this command updates that data set with the new mutated column.
221 |
222 | You can make a line plot of the total number of baptisms per year with the following code:
223 |
224 | ```{r plot-total-vs-year}
225 | ggplot(data = arbuthnot, aes(x = year, y = total)) +
226 | geom_line()
227 | ```
228 |
229 | In an similar fashion, once you know the total number of baptisms for boys and girls in 1629, you can compute the ratio of the number of boys to the number of girls baptized with the following code:
230 |
231 | ```{r calc-prop-boys-to-girls-numbers}
232 | 5218 / 4683
233 | ```
234 |
235 | Alternatively, you could calculate this ratio for every year by acting on the complete `boys` and `girls` columns, and then save those calculations into a new variable named `boy_to_girl_ratio`:
236 |
237 | ```{r calc-prop-boys-to-girls-vars}
238 | arbuthnot <- arbuthnot %>%
239 | mutate(boy_to_girl_ratio = boys / girls)
240 | ```
241 |
242 | You can also compute the proportion of newborns that are boys in 1629 with the following code:
243 |
244 | ```{r calc-prop-boys-numbers}
245 | 5218 / (5218 + 4683)
246 | ```
247 |
248 | Or you can compute this for all years simultaneously and add it as a new variable named `boy_ratio` to the dataset:
249 |
250 | ```{r calc-prop-boys-vars}
251 | arbuthnot <- arbuthnot %>%
252 | mutate(boy_ratio = boys / total)
253 | ```
254 |
255 | Notice that rather than dividing by `boys + girls` we are using the `total` variable we created earlier in our calculations!
256 |
257 | 3. Now, generate a plot of the proportion of boys born over time. What do you see?
258 |
259 |
260 |
261 | **Tip:** If you use the up and down arrow keys in the console, you can scroll through your previous commands, your so-called command history. You can also access your command history by clicking on the history tab in the upper right panel. This can save you a lot of typing in the future.
262 |
263 |
264 |
265 | Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, `>`, less than, `<`, and equality, `==`. For example, we can create a new variable called `more_boys` that tells us whether the number of births of boys outnumbered that of girls in each year with the following code:
266 |
267 | ```{r boys-more-than-girls}
268 | arbuthnot <- arbuthnot %>%
269 | mutate(more_boys = boys > girls)
270 | ```
271 |
272 | This command adds a new variable to the `arbuthnot` data frame containing the values of either `TRUE` if that year had more boys than girls, or `FALSE` if that year did not (the answer may surprise you). This variable contains a different kind of data than we have encountered so far. All other columns in the `arbuthnot` data frame have values that are numerical (the year, the number of boys and girls). Here, we've asked R to create *logical* data, data where the values are either `TRUE` or `FALSE`. In general, data analysis will involve many different kinds of data types, and one reason for using `R` is that it is able to represent and compute with many of them.
273 |
274 | ## More Practice
275 |
276 | In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot's baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. The data are stored in a data frame called `present`.
277 |
278 | To find the minimum and maximum values of columns, you can use the functions `min()` and `max()` within a `summarize()` call, which you will learn more about in the following lab.
279 |
280 | Here's an example of how to find the minimum and maximum amount of boy births in a year:
281 |
282 | ```{r summarize min and max}
283 | arbuthnot %>%
284 | summarize(min = min(boys),
285 | max = max(boys)
286 | )
287 | ```
288 |
289 | Answer the following questions with the `present` data frame:
290 |
291 | 1. What years are included in this data set? What are the dimensions of the data frame? What are the variable (column) names?
292 |
293 | 2. How do these counts compare to Arbuthnot's? Are they of a similar magnitude?
294 |
295 | 3. Make a plot that displays the proportion of boys born over time. What do you see? Does Arbuthnot's observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response. *Hint:* You should be able to reuse your code from Exercise 3 above, just replace the name of the data frame.
296 |
297 | 4. In what year did we see the most total number of births in the U.S.? *Hint:* First calculate the totals and save it as a new variable. Then, sort your dataset in descending order based on the `total` column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. To include the sorted result in your report you will need to use two new functions. First we use `arrange()` to sorting the variable. Then we can arrange the data in a descending order with another function, `desc()`, for descending order. The sample code is provided below.
298 |
299 | ```{r sample-arrange, eval=FALSE}
300 | present %>%
301 | arrange(desc(total))
302 | ```
303 |
304 | These data come from reports by the Centers for Disease Control. You can learn more about them by bringing up the help file using the command `?present`.
305 |
306 | ## Resources for learning R and working in RStudio
307 |
308 | That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.
309 |
310 | In this course we will be using the suite of R packages from the **tidyverse**. The book [R For Data Science](https://r4ds.had.co.nz/) by Grolemund and Wickham is a fantastic resource for data analysis in R with the tidyverse. If you are Googling for R code, make sure to also include these package names in your search query. For example, instead of Googling "scatterplot in R", Google "scatterplot in R with the tidyverse".
311 |
312 | These may come in handy throughout the semester:
313 |
314 | - [RMarkdown cheatsheet](https://github.com/rstudio/cheatsheets/raw/main/rmarkdown-2.0.pdf)
315 | - [Data transformation cheatsheet](https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf)
316 | - [Data visualization cheatsheet](https://github.com/rstudio/cheatsheets/raw/main/data-visualization-2.1.pdf)
317 |
318 | Note that some of the code on these cheatsheets may be too advanced for this course. However the majority of it will become useful throughout the semester.
319 |
320 | ------------------------------------------------------------------------
321 |
322 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
323 |
--------------------------------------------------------------------------------
/01_intro_to_r/more/arbuthnot-readme.txt:
--------------------------------------------------------------------------------
1 | - arbuthnot:
2 |
3 | Arbuthnot’s data on male and female birth ratios in London from 1629-1710. (From HistData package in R, http://cran.r-project.org/web/packages/HistData/HistData.pdf)
4 |
5 | Description
6 | John Arbuthnot (1710) used these time series data on the ratios of male to female births in London from 1629-1710 to carry out the first known significance test, comparing observed data to a null hypothesis. The data for these 81 years showed that in every year there were more male than female christenings.
7 |
8 | On the assumption that male and female births were equally likely, he showed that the probability of observing 82 years with more males than females was vanishingly small (4.14x10^-25). He used this to argue that a nearly constant birth ratio > 1 could be interpreted to show the guiding hand of a devine being. The data set adds variables of deaths from the plague and total mortality obtained by Campbell and from Creighton (1965).
9 |
10 | Format
11 | A data frame with 82 observations on the following 3 variables.
12 | year: a numeric vector, 1629-1710
13 | boys: a numeric vector, number of male christenings
14 | girls: a numeric vector, number of female christenings
--------------------------------------------------------------------------------
/01_intro_to_r/more/arbuthnot.r:
--------------------------------------------------------------------------------
1 | arbuthnot <-
2 | structure(list(year = 1629:1710, boys = c(5218L, 4858L, 4422L,
3 | 4994L, 5158L, 5035L, 5106L, 4917L, 4703L, 5359L, 5366L, 5518L,
4 | 5470L, 5460L, 4793L, 4107L, 4047L, 3768L, 3796L, 3363L, 3079L,
5 | 2890L, 3231L, 3220L, 3196L, 3441L, 3655L, 3668L, 3396L, 3157L,
6 | 3209L, 3724L, 4748L, 5216L, 5411L, 6041L, 5114L, 4678L, 5616L,
7 | 6073L, 6506L, 6278L, 6449L, 6443L, 6073L, 6113L, 6058L, 6552L,
8 | 6423L, 6568L, 6247L, 6548L, 6822L, 6909L, 7577L, 7575L, 7484L,
9 | 7575L, 7737L, 7487L, 7604L, 7909L, 7662L, 7602L, 7676L, 6985L,
10 | 7263L, 7632L, 8062L, 8426L, 7911L, 7578L, 8102L, 8031L, 7765L,
11 | 6113L, 8366L, 7952L, 8379L, 8239L, 7840L, 7640L), girls = c(4683L,
12 | 4457L, 4102L, 4590L, 4839L, 4820L, 4928L, 4605L, 4457L, 4952L,
13 | 4784L, 5332L, 5200L, 4910L, 4617L, 3997L, 3919L, 3395L, 3536L,
14 | 3181L, 2746L, 2722L, 2840L, 2908L, 2959L, 3179L, 3349L, 3382L,
15 | 3289L, 3013L, 2781L, 3247L, 4107L, 4803L, 4881L, 5681L, 4858L,
16 | 4319L, 5322L, 5560L, 5829L, 5719L, 6061L, 6120L, 5822L, 5738L,
17 | 5717L, 5847L, 6203L, 6033L, 6041L, 6299L, 6533L, 6744L, 7158L,
18 | 7127L, 7246L, 7119L, 7214L, 7101L, 7167L, 7302L, 7392L, 7316L,
19 | 7483L, 6647L, 6713L, 7229L, 7767L, 7626L, 7452L, 7061L, 7514L,
20 | 7656L, 7683L, 5738L, 7779L, 7417L, 7687L, 7623L, 7380L, 7288L
21 | )), .Names = c("year", "boys", "girls"), class = "data.frame", row.names = c(NA,
22 | -82L))
23 |
--------------------------------------------------------------------------------
/01_intro_to_r/more/present-readme.txt:
--------------------------------------------------------------------------------
1 |
2 |
3 | - present:
4 |
5 | Description
6 | Number of male and female births, sex ratio at birth, and number of excess males: United States, 1940–2002. (Mathews, TJ and Hamilton, B.E. Trend analysis of the sex ratio at birth in the United States. National Vital Statistics Reports. vol 53, no. 20, pp 1 - 17, 2005. http://www.cdc.gov/nchs/data/nvsr/nvsr53/nvsr53_20.pdf)
7 |
8 | Format
9 | A data frame with 63 observations on the following 3 variables.
10 | year: a numeric vector, 1940-2002
11 | boys: a numeric vector, number of male births
12 | girls: a numeric vector, number of female births
--------------------------------------------------------------------------------
/01_intro_to_r/more/present-reference.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenIntroStat/oilabs-tidy/5e5703a7ecf076f0de5f7d90b436977f0d6c4ae0/01_intro_to_r/more/present-reference.pdf
--------------------------------------------------------------------------------
/01_intro_to_r/more/present.R:
--------------------------------------------------------------------------------
1 | `present` <-
2 | structure(list(year = c(1940, 1941, 1942, 1943, 1944, 1945, 1946,
3 | 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957,
4 | 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,
5 | 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,
6 | 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,
7 | 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
8 | 2002), boys = c(1211684, 1289734, 1444365, 1508959, 1435301,
9 | 1404587, 1691220, 1899876, 1813852, 1826352, 1823555, 1923020,
10 | 1971262, 2001798, 2059068, 2073719, 2133588, 2179960, 2152546,
11 | 2173638, 2179708, 2186274, 2132466, 2101632, 2060162, 1927054,
12 | 1845862, 1803388, 1796326, 1846572, 1915378, 1822910, 1669927,
13 | 1608326, 1622114, 1613135, 1624436, 1705916, 1709394, 1791267,
14 | 1852616, 1860272, 1885676, 1865553, 1879490, 1927983, 1924868,
15 | 1951153, 2002424, 2069490, 2129495, 2101518, 2082097, 2048861,
16 | 2022589, 1996355, 1990480, 1985596, 2016205, 2026854, 2076969,
17 | 2057922, 2057979), girls = c(1148715, 1223693, 1364631, 1427901,
18 | 1359499, 1330869, 1597452, 1800064, 1721216, 1733177, 1730594,
19 | 1827830, 1875724, 1900322, 1958294, 1973576, 2029502, 2074824,
20 | 2051266, 2071158, 2078142, 2082052, 2034896, 1996388, 1967328,
21 | 1833304, 1760412, 1717571, 1705238, 1753634, 1816008, 1733060,
22 | 1588484, 1528639, 1537844, 1531063, 1543352, 1620716, 1623885,
23 | 1703131, 1759642, 1768966, 1794861, 1773380, 1789651, 1832578,
24 | 1831679, 1858241, 1907086, 1971468, 2028717, 2009389, 1982917,
25 | 1951379, 1930178, 1903234, 1901014, 1895298, 1925348, 1932563,
26 | 1981845, 1968011, 1963747)), .Names = c("year", "boys", "girls"
27 | ), row.names = c(NA, 63L), class = "data.frame")
28 |
--------------------------------------------------------------------------------
/02_intro_to_data/intro_to_data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to data"
3 | output:
4 | html_document:
5 | theme: cerulean
6 | highlight: pygments
7 | css: ../lab.css
8 | toc: true
9 | toc_float: true
10 | ---
11 |
12 | ```{r global-options, include=FALSE}
13 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
14 | library(tidyverse)
15 | library(openintro)
16 | ```
17 |
18 | Some define statistics as the field that focuses on turning information into knowledge.
19 | The first step in that process is to summarize and describe the raw information -- the data.
20 | In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013.
21 | We will generate simple graphical and numerical summaries of data on these flights and explore delay times.
22 | Since this is a large data set, along the way you'll also learn the indispensable skills of data processing and subsetting.
23 |
24 | ## Getting started
25 |
26 | ### Load packages
27 |
28 | In this lab, we will explore and visualize the data using the **tidyverse** suite of packages.
29 | The data can be found in the companion package for OpenIntro labs, **openintro**.
30 |
31 | Let's load the packages.
32 |
33 | ```{r load-packages, message=FALSE}
34 | library(tidyverse)
35 | library(openintro)
36 | ```
37 |
38 | ### Creating a reproducible lab report
39 |
40 | Remember that we will be using R Markdown to create reproducible lab reports.
41 | In RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
42 |
43 | See the following video describing how to get started with creating these reports for this lab, and all future labs:
44 |
45 | [**Basic R Markdown with an OpenIntro Lab**](https://www.youtube.com/watch?v=Pdc368lS2hk)
46 |
47 | ### The data
48 |
49 | The [Bureau of Transportation Statistics](http://www.rita.dot.gov/bts/about/) (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA).
50 | As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.
51 |
52 | First, we'll view the `nycflights` data frame.
53 | Type the following in your console to load the data:
54 |
55 | ```{r load-data}
56 | data(nycflights)
57 | ```
58 |
59 | The data set `nycflights` that shows up in your workspace is a *data matrix*, with each row representing an *observation* and each column representing a *variable*.
60 | R calls this data format a **data frame**, which is a term that will be used throughout the labs.
61 | For this data set, each *observation* is a single flight.
62 |
63 | To view the names of the variables, type the command
64 |
65 | ```{r names}
66 | names(nycflights)
67 | ```
68 |
69 | This returns the names of the variables in this data frame.
70 | The **codebook** (description of the variables) can be accessed by pulling up the help file:
71 |
72 | ```{r help}
73 | ?nycflights
74 | ```
75 |
76 | One of the variables refers to the carrier (i.e. airline) of the flight, which is coded according to the following system.
77 |
78 | - `carrier`: Two letter carrier abbreviation.
79 |
80 | - `9E`: Endeavor Air Inc.
81 | - `AA`: American Airlines Inc.
82 | - `AS`: Alaska Airlines Inc.
83 | - `B6`: JetBlue Airways
84 | - `DL`: Delta Air Lines Inc.
85 | - `EV`: ExpressJet Airlines Inc.
86 | - `F9`: Frontier Airlines Inc.
87 | - `FL`: AirTran Airways Corporation
88 | - `HA`: Hawaiian Airlines Inc.
89 | - `MQ`: Envoy Air
90 | - `OO`: SkyWest Airlines Inc.
91 | - `UA`: United Air Lines Inc.
92 | - `US`: US Airways Inc.
93 | - `VX`: Virgin America
94 | - `WN`: Southwest Airlines Co.
95 | - `YV`: Mesa Airlines Inc.
96 |
97 | Remember that you can use `glimpse` to take a quick peek at your data to understand its contents better.
98 |
99 | ```{r glimpse}
100 | glimpse(nycflights)
101 | ```
102 |
103 | The `nycflights` data frame is a massive trove of information.
104 | Let's think about some questions we might want to answer with these data:
105 |
106 | - How delayed were flights that were headed to Los Angeles?
107 | - How do departure delays vary by month?
108 | - Which of the three major NYC airports has the best on time percentage for departing flights?
109 |
110 | ## Analysis
111 |
112 | ### Lab report
113 |
114 | To record your analysis in a reproducible format, you can adapt the general Lab Report template from the **openintro** package.
115 | Watch the video above to learn how.
116 |
117 | ### Departure delays
118 |
119 | Let's start by examining the distribution of departure delays of all flights with a histogram.
120 |
121 | ```{r hist-dep-delay}
122 | ggplot(data = nycflights, aes(x = dep_delay)) +
123 | geom_histogram()
124 | ```
125 |
126 | This function says to plot the `dep_delay` variable from the `nycflights` data frame on the x-axis.
127 | It also defines a `geom` (short for geometric object), which describes the type of plot you will produce.
128 |
129 | Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins.
130 | You can easily define the binwidth you want to use:
131 |
132 | ```{r hist-dep-delay-bins}
133 | ggplot(data = nycflights, aes(x = dep_delay)) +
134 | geom_histogram(binwidth = 15)
135 | ggplot(data = nycflights, aes(x = dep_delay)) +
136 | geom_histogram(binwidth = 150)
137 | ```
138 |
139 | 1. Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
140 |
141 | If you want to visualize only delays of flights headed to Los Angeles, you need to first `filter` the data for flights with that destination (`dest == "LAX"`) and then make a histogram of the departure delays of only those flights.
142 |
143 | ```{r lax-flights-hist}
144 | lax_flights <- nycflights %>%
145 | filter(dest == "LAX")
146 | ggplot(data = lax_flights, aes(x = dep_delay)) +
147 | geom_histogram()
148 | ```
149 |
150 | Let's decipher these two commands (OK, so it might look like four lines, but the first two physical lines of code are actually part of the same command. It's common to add a break to a new line after `%>%` to help readability).
151 |
152 | - Command 1: Take the `nycflights` data frame, `filter` for flights headed to LAX, and save the result as a new data frame called `lax_flights`.
153 |
154 | - `==` means "if it's equal to".
155 | - `LAX` is in quotation marks since it is a character string.
156 |
157 | - Command 2: Basically the same `ggplot` call from earlier for making a histogram, except that it uses the smaller data frame for flights headed to LAX instead of all flights.
158 |
159 | ::: {#boxedtext}
160 | **Logical operators:** Filtering for certain observations (e.g. flights from a particular airport) is often of interest in data frames where we might want to examine observations with certain characteristics separately from the rest of the data.
161 | To do so, you can use the `filter` function and a series of **logical operators**.
162 | The most commonly used logical operators for data analysis are as follows:
163 |
164 | - `==` means "equal to"
165 | - `!=` means "not equal to"
166 | - `>` or `<` means "greater than" or "less than"
167 | - `>=` or `<=` means "greater than or equal to" or "less than or equal to"
168 | :::
169 |
170 | You can also obtain numerical summaries for these flights:
171 |
172 | ```{r lax-flights-summ}
173 | lax_flights %>%
174 | summarise(mean_dd = mean(dep_delay),
175 | median_dd = median(dep_delay),
176 | n = n())
177 | ```
178 |
179 | Note that in the `summarise` function you created a list of three different numerical summaries that you were interested in.
180 | The names of these elements are user defined, like `mean_dd`, `median_dd`, `n`, and you can customize these names as you like (just don't use spaces in your names).
181 | Calculating these summary statistics also requires that you know the function calls.
182 | Note that `n()` reports the sample size.
183 |
184 | ::: {#boxedtext}
185 | **Summary statistics:** Some useful function calls for summary statistics for a single numerical variable are as follows:
186 |
187 | - `mean`
188 | - `median`
189 | - `sd`
190 | - `var`
191 | - `IQR`
192 | - `min`
193 | - `max`
194 |
195 | Note that each of these functions takes a single vector as an argument and returns a single value.
196 | :::
197 |
198 | You can also filter based on multiple criteria.
199 | Suppose you are interested in flights headed to San Francisco (SFO) in February:
200 |
201 | ```{r sfo-feb-flights}
202 | sfo_feb_flights <- nycflights %>%
203 | filter(dest == "SFO", month == 2)
204 | ```
205 |
206 | Note that you can separate the conditions using commas if you want flights that are both headed to SFO **and** in February.
207 | If you are interested in either flights headed to SFO **or** in February, you can use the `|` instead of the comma.
208 |
209 | 1. Create a new data frame that includes flights headed to SFO in February, and save this data frame as `sfo_feb_flights`.
210 | How many flights meet these criteria?
211 |
212 | 2. Describe the distribution of the **arrival** delays of these flights using a histogram and appropriate summary statistics.
213 | **Hint:** The summary statistics you use should depend on the shape of the distribution.
214 |
215 | Another useful technique is quickly calculating summary statistics for various groups in your data frame.
216 | For example, we can modify the above command using the `group_by` function to get the same summary stats for each origin airport:
217 |
218 | ```{r summary-custom-list-origin}
219 | sfo_feb_flights %>%
220 | group_by(origin) %>%
221 | summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
222 | ```
223 |
224 | Here, we first grouped the data by `origin` and then calculated the summary statistics.
225 |
226 | 1. Calculate the median and interquartile range for `arr_delay`s of flights in in the `sfo_feb_flights` data frame, grouped by carrier. Which carrier has the most variable arrival delays?
227 |
228 | ### Departure delays by month
229 |
230 | Which month would you expect to have the highest average delay departing from an NYC airport?
231 |
232 | Let's think about how you could answer this question:
233 |
234 | - First, calculate monthly averages for departure delays. With the new language you are learning, you could
235 |
236 | - `group_by` months, then
237 | - `summarise` mean departure delays.
238 |
239 | - Then, you could to `arrange` these average delays in `desc`ending order.
240 |
241 | ```{r mean-dep-delay-months}
242 | nycflights %>%
243 | group_by(month) %>%
244 | summarise(mean_dd = mean(dep_delay)) %>%
245 | arrange(desc(mean_dd))
246 | ```
247 |
248 | 1. Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
249 |
250 | ```{=html}
251 |
257 | ```
258 | ### On time departure rate for NYC airports
259 |
260 | Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights.
261 | Also supposed that for you, a flight that is delayed for less than 5 minutes is basically "on time". You consider any flight delayed for 5 minutes of more to be "delayed".
262 |
263 | In order to determine which airport has the best on time departure rate, you can
264 |
265 | - first classify each flight as "on time" or "delayed",
266 | - then group flights by origin airport,
267 | - then calculate on time departure rates for each origin airport,
268 | - and finally arrange the airports in descending order for on time departure percentage.
269 |
270 | Let's start with classifying each flight as "on time" or "delayed" by creating a new variable with the `mutate` function.
271 |
272 | ```{r dep-type}
273 | nycflights <- nycflights %>%
274 | mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
275 | ```
276 |
277 | The first argument in the `mutate` function is the name of the new variable we want to create, in this case `dep_type`.
278 | Then if `dep_delay < 5`, we classify the flight as `"on time"` and `"delayed"` if not, i.e. if the flight is delayed for 5 or more minutes.
279 |
280 | Note that we are also overwriting the `nycflights` data frame with the new version of this data frame that includes the new `dep_type` variable.
281 |
282 | We can handle all of the remaining steps in one code chunk:
283 |
284 | ```{r ot-dep-rate}
285 | nycflights %>%
286 | group_by(origin) %>%
287 | summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
288 | arrange(desc(ot_dep_rate))
289 | ```
290 |
291 | 1. If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
292 |
293 | You can also visualize the distribution of on-time departure rate across the three airports using a segmented bar plot.
294 |
295 | ```{r viz-origin-dep-type}
296 | ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
297 | geom_bar()
298 | ```
299 |
300 | ------------------------------------------------------------------------
301 |
302 | ## More Practice
303 |
304 | 1. Mutate the data frame so that it includes a new variable that contains the average speed, `avg_speed` traveled by the plane for each flight (in mph).
305 | **Hint:** Average speed can be calculated as distance divided by number of hours of travel, and note that `air_time` is given in minutes.
306 |
307 | 2. Make a scatterplot of `avg_speed` vs. `distance`.
308 | Describe the relationship between average speed and distance.
309 | **Hint:** Use `geom_point()`.
310 |
311 | 3. Replicate the following plot.
312 | **Hint:** The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are `color`ed by `carrier`.
313 | Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
314 |
315 | ```{r plot-to-replicate, echo=FALSE, fig.show="asis", fig.width=7, fig.height=4}
316 | dl_aa_ua <- nycflights %>%
317 | filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
318 | ggplot(data = dl_aa_ua, aes(x = dep_delay, y = arr_delay, color = carrier)) +
319 | geom_point()
320 | ```
321 |
322 | ------------------------------------------------------------------------
323 |
324 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
325 |
--------------------------------------------------------------------------------
/03_probability/more/calc_streak.R:
--------------------------------------------------------------------------------
1 | calc_streak <- function(x){
2 | y <- rep(0,length(x))
3 | y[x == "H"] <- 1
4 | y <- c(0, y, 0)
5 | wz <- which(y == 0)
6 | streak <- diff(wz) - 1
7 | return(streak)
8 | }
--------------------------------------------------------------------------------
/03_probability/more/kobe-readme.txt:
--------------------------------------------------------------------------------
1 | - kobe:
2 |
3 | Data from the five games the Los Angeles Lakers played against the Orlando Magic in the 2009 NBA finals. Each row represents a shot Kobe Bryant took during these games. Kobe Bryant's performance against the Orlando Magic in the 2009 NBA finals earned him the title of Most Valuable Player and many spectators commented on how he appeared to show a hot hand.
4 |
5 | Format
6 | A data frame with 133 observations on the following 6 variables.
7 | vs: a categorical vector, ORL if the Los Angeles Lakers played against Orlando
8 | game: a numerical vector, game in the 2009 NBA finals
9 | quarter: a categorical vector, quarter in the game, OT stands for overtime
10 | time: a categorical vector, time at which Kobe took a shot
11 | description: a categorical vector, description of the shot
12 | basket: a categorical vector, H if the shot was a hit, M if the shot was a miss
--------------------------------------------------------------------------------
/03_probability/more/kobe.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenIntroStat/oilabs-tidy/5e5703a7ecf076f0de5f7d90b436977f0d6c4ae0/03_probability/more/kobe.RData
--------------------------------------------------------------------------------
/03_probability/more/kobe.csv:
--------------------------------------------------------------------------------
1 | "vs","game","quarter","time","description","basket"
2 | "ORL",1,"1","9:47","Kobe Bryant makes 4-foot two point shot",1
3 | "ORL",1,"1","9:07","Kobe Bryant misses jumper",0
4 | "ORL",1,"1","8:11","Kobe Bryant misses 7-foot jumper",0
5 | "ORL",1,"1","7:41","Kobe Bryant makes 16-foot jumper (Derek Fisher assists)",1
6 | "ORL",1,"1","7:03","Kobe Bryant makes driving layup",1
7 | "ORL",1,"1","6:01","Kobe Bryant misses jumper",0
8 | "ORL",1,"1","4:07","Kobe Bryant misses 12-foot jumper",0
9 | "ORL",1,"1","0:52","Kobe Bryant misses 19-foot jumper",0
10 | "ORL",1,"1","0:00","Kobe Bryant misses layup",0
11 | "ORL",1,"2","6:35","Kobe Bryant makes jumper",1
12 | "ORL",1,"2","5:58","Kobe Bryant makes 20-foot jumper",1
13 | "ORL",1,"2","5:22","Kobe Bryant makes 14-foot jumper",1
14 | "ORL",1,"2","4:37","Kobe Bryant misses driving layup",0
15 | "ORL",1,"2","3:30","Kobe Bryant makes 9-foot two point shot",1
16 | "ORL",1,"2","2:55","Kobe Bryant makes 14-foot running jumper",1
17 | "ORL",1,"2","1:55","Kobe Bryant misses 19-foot jumper",0
18 | "ORL",1,"2","0:38","Kobe Bryant misses 27-foot three point jumper",0
19 | "ORL",1,"2","0:04","Kobe Bryant makes driving layup",1
20 | "ORL",1,"3","11:44","Kobe Bryant makes layup",1
21 | "ORL",1,"3","11:15","Kobe Bryant makes 11-foot two point shot",1
22 | "ORL",1,"3","10:14","Kobe Bryant misses 13-foot jumper",0
23 | "ORL",1,"3","9:15","Kobe Bryant misses 9-foot jumper",0
24 | "ORL",1,"3","6:43","Kobe Bryant makes 14-foot two point shot",1
25 | "ORL",1,"3","4:58","Kobe Bryant misses 16-foot jumper",0
26 | "ORL",1,"3","4:24","Kobe Bryant makes two point shot",1
27 | "ORL",1,"3","3:55","Kobe Bryant makes 17-foot running jumper",1
28 | "ORL",1,"3","3:16","Kobe Bryant makes 9-foot two point shot",1
29 | "ORL",1,"3","1:15","Kobe Bryant misses 20-foot jumper",0
30 | "ORL",1,"3","0:00","Kobe Bryant misses 6-foot running jumper",0
31 | "ORL",1,"4","6:48","Kobe Bryant misses 15-foot jumper",0
32 | "ORL",1,"4","6:16","Kobe Bryant misses 11-foot jumper",0
33 | "ORL",1,"4","5:48","Kobe Bryant misses layup",0
34 | "ORL",1,"4","2:54","Kobe Bryant misses 13-foot jumper",0
35 | "ORL",1,"4","1:59","Kobe Bryant makes 10-foot jumper",1
36 | "ORL",2,"1","11:32","Kobe Bryant misses 20-foot jumper",0
37 | "ORL",2,"1","5:09","Kobe Bryant makes 18-foot two point shot",1
38 | "ORL",2,"1","4:35","Kobe Bryant misses 21-foot jumper",0
39 | "ORL",2,"1","0:01","Kobe Bryant misses 32-foot three point jumper",0
40 | "ORL",2,"2","1:52","Kobe Bryant makes 29-foot three point jumper (Pau Gasol assists)",1
41 | "ORL",2,"3","11:18","Kobe Bryant makes jumper",1
42 | "ORL",2,"3","9:52","Kobe Bryant makes 15-foot two point shot",1
43 | "ORL",2,"3","9:23","Kobe Bryant makes 16-foot two point shot",1
44 | "ORL",2,"3","8:48","Kobe Bryant misses 22-foot jumper",0
45 | "ORL",2,"3","4:37","Kobe Bryant makes slam dunk (Trevor Ariza assists)",1
46 | "ORL",2,"3","4:07","Kobe Bryant misses 7-foot jumper",0
47 | "ORL",2,"3","3:29","Kobe Bryant misses 17-foot jumper",0
48 | "ORL",2,"3","1:20","Kobe Bryant makes 18-foot jumper",1
49 | "ORL",2,"3","0:45","Kobe Bryant misses 25-foot three point jumper",0
50 | "ORL",2,"3","0:00","Kobe Bryant misses 31-foot three point jumper",0
51 | "ORL",2,"4","7:09","Kobe Bryant makes 17-foot two point shot (Pau Gasol assists)",1
52 | "ORL",2,"4","5:37","Kobe Bryant misses 12-foot jumper",0
53 | "ORL",2,"4","1:54","Kobe Bryant misses 12-foot two point shot",0
54 | "ORL",2,"4","1:10","Kobe Bryant makes 11-foot two point shot",1
55 | "ORL",2,"1OT","4:13","Kobe Bryant misses 22-foot jumper",0
56 | "ORL",2,"1OT","2:17","Kobe Bryant makes 11-foot two point shot",1
57 | "ORL",4,"1","11:19","Kobe Bryant makes 20-foot jumper",1
58 | "ORL",4,"1","8:29","Kobe Bryant misses 19-foot jumper",0
59 | "ORL",4,"1","7:01","Kobe Bryant misses 13-foot two point shot",0
60 | "ORL",4,"1","6:03","Kobe Bryant makes driving layup",1
61 | "ORL",4,"1","5:16","Kobe Bryant misses 9-foot jumper",0
62 | "ORL",4,"1","3:02","Kobe Bryant makes 18-foot jumper",1
63 | "ORL",4,"1","0:19","Kobe Bryant makes 21-foot jumper",1
64 | "ORL",4,"2","5:37","Kobe Bryant misses 21-foot jumper",0
65 | "ORL",4,"2","4:01","Kobe Bryant makes 26-foot three point jumper (Trevor Ariza assists)",1
66 | "ORL",4,"2","3:15","Kobe Bryant misses 16-foot two point shot",0
67 | "ORL",4,"2","2:08","Kobe Bryant misses 20-foot jumper",0
68 | "ORL",4,"2","0:39","Kobe Bryant misses 26-foot three point jumper",0
69 | "ORL",4,"3","9:17","Kobe Bryant makes 25-foot three point jumper",1
70 | "ORL",4,"3","7:24","Kobe Bryant misses 20-foot jumper",0
71 | "ORL",4,"3","7:13","Kobe Bryant misses layup",0
72 | "ORL",4,"3","5:30","Kobe Bryant misses 16-foot jumper",0
73 | "ORL",4,"3","0:51","Kobe Bryant misses 26-foot three point jumper",0
74 | "ORL",4,"3","0:04","Kobe Bryant makes 16-foot jumper",1
75 | "ORL",4,"4","8:52","Kobe Bryant misses 11-foot two point shot",0
76 | "ORL",4,"4","7:24","Kobe Bryant makes 21-foot jumper",1
77 | "ORL",4,"4","6:26","Kobe Bryant misses 19-foot jumper",0
78 | "ORL",4,"4","5:20","Kobe Bryant misses 5-foot jumper",0
79 | "ORL",4,"4","4:48","Kobe Bryant makes 11-foot two point shot",1
80 | "ORL",4,"4","3:33","Kobe Bryant misses 27-foot three point jumper",0
81 | "ORL",4,"4","1:02","Kobe Bryant misses 28-foot three point jumper",0
82 | "ORL",4,"1OT","4:13","Kobe Bryant makes 11-foot jumper",1
83 | "ORL",4,"1OT","3:32","Kobe Bryant makes 19-foot jumper",1
84 | "ORL",4,"1OT","2:49","Kobe Bryant misses 10-foot jumper",0
85 | "ORL",4,"1OT","1:58","Kobe Bryant misses 18-foot jumper",0
86 | "ORL",4,"1OT","0:47","Kobe Bryant misses 15-foot jumper",0
87 | "ORL",5,"1","11:00","Kobe Bryant misses layup",0
88 | "ORL",5,"1","9:56","Kobe Bryant makes 18-foot jumper",1
89 | "ORL",5,"1","5:20","Kobe Bryant makes 20-foot jumper",1
90 | "ORL",5,"1","4:48","Kobe Bryant makes 25-foot three point jumper",1
91 | "ORL",5,"1","3:48","Kobe Bryant misses 25-foot three point jumper",0
92 | "ORL",5,"1","1:13","Kobe Bryant misses 17-foot jumper",0
93 | "ORL",5,"2","7:54","Kobe Bryant makes driving dunk",1
94 | "ORL",5,"2","6:48","Kobe Bryant misses 12-foot jumper",0
95 | "ORL",5,"2","6:28","Kobe Bryant misses layup",0
96 | "ORL",5,"2","4:43","Kobe Bryant makes 14-foot jumper",1
97 | "ORL",5,"2","0:25","Kobe Bryant misses 23-foot jumper",0
98 | "ORL",5,"3","9:00","Kobe Bryant makes 17-foot jumper",1
99 | "ORL",5,"3","5:52","Kobe Bryant makes 6-foot running jumper",1
100 | "ORL",5,"3","2:17","Kobe Bryant misses 15-foot jumper",0
101 | "ORL",5,"4","11:00","Kobe Bryant makes 20-foot jumper",1
102 | "ORL",5,"4","9:56","Kobe Bryant misses two point shot",0
103 | "ORL",5,"4","9:06","Kobe Bryant misses 14-foot two point shot",0
104 | "ORL",5,"4","8:18","Kobe Bryant makes 25-foot three point jumper",1
105 | "ORL",5,"4","6:22","Kobe Bryant misses 18-foot jumper",0
106 | "ORL",5,"4","4:26","Kobe Bryant misses 27-foot three point jumper",0
107 | "ORL",5,"4","3:12","Kobe Bryant misses 27-foot three point jumper",0
108 | "ORL",5,"4","2:38","Kobe Bryant makes 9-foot two point shot",1
109 | "ORL",5,"4","2:06","Kobe Bryant misses 13-foot jumper",0
110 | "ORL",3,"1","5:41","Bryant Jump Shot: Made (2 PTS) ",1
111 | "ORL",3,"1","5:09","Bryant 3pt Shot: Made (5 PTS) ",1
112 | "ORL",3,"1","4:42","Bryant Jump Shot: Made (7 PTS) Assist: Fisher (1 AST) ",1
113 | "ORL",3,"1","3:37","Bryant Reverse Layup Shot: Missed ",0
114 | "ORL",3,"1","3:07","Bryant Finger Roll Layup Shot: Made (9 PTS) ",1
115 | "ORL",3,"1","2:00","Bryant Jump Shot: Made (11 PTS) ",1
116 | "ORL",3,"1","1:18","Bryant Jump Shot: Made (13 PTS) ",1
117 | "ORL",3,"1","00:34.9","Bryant Layup Shot: Missed Block: Battie (1 BLK) ",0
118 | "ORL",3,"1","00:30.0","Bryant 3pt Shot: Made (16 PTS) Assist: Bynum (1 AST) ",1
119 | "ORL",3,"1","00:00.0","Bryant Jump Shot: Missed ",0
120 | "ORL",3,"2","6:55","Bryant 3pt Shot: Made (20 PTS) ",1
121 | "ORL",3,"2","3:13","Bryant Fade Away Jumper Shot: Missed ",0
122 | "ORL",3,"2","00:53.7","Bryant Jump Shot: Missed Block: Howard (1 BLK) ",0
123 | "ORL",3,"2","00:41.8","Bryant 3pt Shot: Missed ",0
124 | "ORL",3,"2","00:02.2","Bryant Jump Shot: Missed ",0
125 | "ORL",3,"3","9:29","Bryant Layup Shot: Missed ",0
126 | "ORL",3,"3","5:55","Bryant Fade Away Jumper Shot: Missed ",0
127 | "ORL",3,"3","1:20","Bryant 3pt Shot: Made (26 PTS) ",1
128 | "ORL",3,"3","00:01.9","Bryant 3pt Shot: Missed ",0
129 | "ORL",3,"4","3:57","Bryant Jump Shot: Made (28 PTS) ",1
130 | "ORL",3,"4","3:33","Bryant Layup Shot: Missed ",0
131 | "ORL",3,"4","2:02","Bryant 3pt Shot: Missed ",0
132 | "ORL",3,"4","00:23.9","Bryant 3pt Shot: Missed ",0
133 | "ORL",3,"4","00:06.9","Bryant 3pt Shot: Missed ",0
134 | "ORL",3,"4","00:00.5","Bryant Layup Shot: Made (31 PTS) ",1
135 |
--------------------------------------------------------------------------------
/03_probability/more/kobe_data.xls:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenIntroStat/oilabs-tidy/5e5703a7ecf076f0de5f7d90b436977f0d6c4ae0/03_probability/more/kobe_data.xls
--------------------------------------------------------------------------------
/03_probability/probability.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Probability"
3 | output:
4 | html_document:
5 | css: ../lab.css
6 | highlight: pygments
7 | theme: cerulean
8 | toc: true
9 | toc_float: true
10 | ---
11 |
12 | ```{r global_options, include=FALSE}
13 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
14 | library(tidyverse)
15 | library(openintro)
16 | ```
17 |
18 | ## The Hot Hand
19 |
20 | Basketball players who make several baskets in succession are described as having a *hot hand*.
21 | Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next.
22 | However, [a 1985 paper](http://www.sciencedirect.com/science/article/pii/0010028585900106) by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events.
23 | This paper started a great controversy that continues to this day, as you can see by Googling *hot hand basketball*.
24 |
25 | We do not expect to resolve this controversy today.
26 | However, in this lab we'll apply one approach to answering questions like this.
27 | The goals for this lab are to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.
28 |
29 | ## Getting Started
30 |
31 | ### Load packages
32 |
33 | In this lab, we will explore and visualize the data using the `tidyverse` suite of packages.
34 | The data can be found in the companion package for OpenIntro labs, **openintro**.
35 |
36 | Let's load the packages.
37 |
38 | ```{r load-packages, message=FALSE}
39 | library(tidyverse)
40 | library(openintro)
41 | ```
42 |
43 | ### Creating a reproducible lab report
44 |
45 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
46 |
47 | ### Data
48 |
49 | Your investigation will focus on the performance of one player: [Kobe Bryant](https://en.wikipedia.org/wiki/Kobe_Bryant) of the Los Angeles Lakers.
50 | His performance against the Orlando Magic in the [2009 NBA Finals](https://en.wikipedia.org/wiki/2009_NBA_Finals) earned him the title *Most Valuable Player* and many spectators commented on how he appeared to show a hot hand.
51 | The data file we'll use is called `kobe_basket`.
52 |
53 | ```{r glimpse-data}
54 | glimpse(kobe_basket)
55 | ```
56 |
57 | This data frame contains 133 observations and 6 variables, where every row records a shot taken by Kobe Bryant.
58 | The `shot` variable in this dataset indicates whether the shot was a hit (`H`) or a miss (`M`).
59 |
60 | Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand.
61 | One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks.
62 | For this lab, we define the length of a shooting streak to be the *number of consecutive baskets made until a miss occurs*.
63 |
64 | For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:
65 |
66 | $$ \textrm{H M | M | H H M | M | M | M} $$
67 |
68 | You can verify this by viewing the first 9 rows of the data in the data viewer.
69 |
70 | Within the nine shot attempts, there are six streaks, which are separated by a "\|" above.
71 | Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).
72 |
73 | 1. What does a streak length of 1 mean, i.e. how many hits and misses are in a streak of 1? What about a streak length of 0?
74 |
75 | Counting streak lengths manually for all 133 shots would get tedious, so we'll use the custom function `calc_streak` to calculate them, and store the results in a data frame called `kobe_streak` as the `length` variable.
76 |
77 | ```{r calc-streak-kobe}
78 | kobe_streak <- calc_streak(kobe_basket$shot)
79 | ```
80 |
81 | We can then take a look at the distribution of these streak lengths.
82 |
83 | ```{r plot-streak-kobe}
84 | ggplot(data = kobe_streak, aes(x = length)) +
85 | geom_bar()
86 | ```
87 |
88 | 1. Describe the distribution of Kobe's streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak of baskets? Make sure to include the accompanying plot in your answer.
89 |
90 | ## Compared to What?
91 |
92 | We've shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had a hot hand?
93 | What can we compare them to?
94 |
95 | To answer these questions, let's return to the idea of *independence*.
96 | Two processes are independent if the outcome of one process doesn't effect the outcome of the second.
97 | If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
98 |
99 | A shooter with a hot hand will have shots that are *not* independent of one another.
100 | Specifically, if the shooter makes his first shot, the hot hand model says he will have a *higher* probability of making his second shot.
101 |
102 | Let's suppose for a moment that the hot hand model is valid for Kobe.
103 | During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation,
104 |
105 | $$ P(\textrm{shot 1 = H}) = 0.45 $$
106 |
107 | If he makes the first shot and has a hot hand (*not* independent shots), then the probability that he makes his second shot would go up to, let's say, 60%,
108 |
109 | $$ P(\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}) = 0.60 $$
110 |
111 | As a result of these increased probabilities, you'd expect Kobe to have longer streaks.
112 | Compare this to the skeptical perspective where Kobe does *not* have a hot hand, where each shot is independent of the next.
113 | If he hit his first shot, the probability that he makes the second is still 0.45.
114 |
115 | $$ P(\textrm{shot 2 = H} \, | \, \textrm{shot 1 = H}) = 0.45 $$
116 |
117 | In other words, making the first shot did nothing to effect the probability that he'd make his second shot.
118 | If Kobe's shots are independent, then he'd have the same probability of hitting every shot regardless of his past shots: 45%.
119 |
120 | Now that we've phrased the situation in terms of independent shots, let's return to the question: how do we tell if Kobe's shooting streaks are long enough to indicate that he has a hot hand?
121 | We can compare his streak lengths to someone without a hot hand: an independent shooter.
122 |
123 | ## Simulations in R
124 |
125 | While we don't have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R.
126 | In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules.
127 | As a simple example, you can simulate flipping a fair coin with the following.
128 |
129 | ```{r head-tail}
130 | coin_outcomes <- c("heads", "tails")
131 | sample(coin_outcomes, size = 1, replace = TRUE)
132 | ```
133 |
134 | The vector `coin_outcomes` can be thought of as a hat with two slips of paper in it: one slip says `heads` and the other says `tails`.
135 | The function `sample` draws one slip from the hat and tells us if it was a head or a tail.
136 |
137 | Run the second command listed above several times.
138 | Just like when flipping a coin, sometimes you'll get a heads, sometimes you'll get a tails, but in the long run, you'd expect to get roughly equal numbers of each.
139 |
140 | If you wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the `size` argument, which governs how many samples to draw (the `replace = TRUE` argument indicates we put the slip of paper back in the hat before drawing again).
141 | Save the resulting vector of heads and tails in a new object called `sim_fair_coin`.
142 |
143 | ```{r sim-fair-coin}
144 | sim_fair_coin <- sample(coin_outcomes, size = 100, replace = TRUE)
145 | ```
146 |
147 | To view the results of this simulation, type the name of the object and then use `table` to count up the number of heads and tails.
148 |
149 | ```{r table-sim-fair-coin}
150 | sim_fair_coin
151 | table(sim_fair_coin)
152 | ```
153 |
154 | Since there are only two elements in `coin_outcomes`, the probability that we "flip" a coin and it lands heads is 0.5.
155 | Say we're trying to simulate an unfair coin that we know only lands heads 20% of the time.
156 | We can adjust for this by adding an argument called `prob`, which provides a vector of two probability weights.
157 |
158 | ```{r sim-unfair-coin}
159 | sim_unfair_coin <- sample(coin_outcomes, size = 100, replace = TRUE,
160 | prob = c(0.2, 0.8))
161 | ```
162 |
163 | `prob=c(0.2, 0.8)` indicates that for the two elements in the `outcomes` vector, we want to select the first one, `heads`, with probability 0.2 and the second one, `tails` with probability 0.8.
164 | Another way of thinking about this is to think of the outcome space as a bag of 10 chips, where 2 chips are labeled "head" and 8 chips "tail".
165 | Therefore at each draw, the probability of drawing a chip that says "head"" is 20%, and "tail" is 80%.
166 |
167 | 1. In your simulation of flipping the unfair coin 100 times, how many flips came up heads? Include the code for sampling the unfair coin in your response. Since the markdown file will run the code, and generate a new sample each time you *Knit* it, you should also "set a seed" **before** you sample. Read more about setting a seed below.
168 |
169 | ::: {#boxedtext}
170 | **A note on setting a seed:** Setting a seed will cause R to select the same sample each time you knit your document.
171 | This will make sure your results don't change each time you knit, and it will also ensure reproducibility of your work (by setting the same seed it will be possible to reproduce your results).
172 | You can set a seed like this:
173 |
174 | ```{r set-seed}
175 | set.seed(35797) # make sure to change the seed
176 | ```
177 |
178 | The number above is completely arbitraty.
179 | If you need inspiration, you can use your ID, birthday, or just a random string of numbers.
180 | The important thing is that you use each seed only once in a document.
181 | Remember to do this **before** you sample in the exercise above.
182 | :::
183 |
184 | In a sense, we've shrunken the size of the slip of paper that says "heads", making it less likely to be drawn, and we've increased the size of the slip of paper saying "tails", making it more likely to be drawn.
185 | When you simulated the fair coin, both slips of paper were the same size.
186 | This happens by default if you don't provide a `prob` argument; all elements in the `outcomes` vector have an equal probability of being drawn.
187 |
188 | If you want to learn more about `sample` or any other function, recall that you can always check out its help file.
189 |
190 | ```{r help-sample,tidy = FALSE}
191 | ?sample
192 | ```
193 |
194 | ## Simulating the Independent Shooter
195 |
196 | Simulating a basketball player who has independent shots uses the same mechanism that you used to simulate a coin flip.
197 | To simulate a single shot from an independent shooter with a shooting percentage of 50% you can type
198 |
199 | ```{r sim-basket}
200 | shot_outcomes <- c("H", "M")
201 | sim_basket <- sample(shot_outcomes, size = 1, replace = TRUE)
202 | ```
203 |
204 | To make a valid comparison between Kobe and your simulated independent shooter, you need to align both their shooting percentage and the number of attempted shots.
205 |
206 | 1. What change needs to be made to the `sample` function so that it reflects a shooting percentage of 45%? Make this adjustment, then run a simulation to sample 133 shots. Assign the output of this simulation to a new object called `sim_basket`.
207 |
208 | Note that we've named the new vector `sim_basket`, the same name that we gave to the previous vector reflecting a shooting percentage of 50%.
209 | In this situation, R overwrites the old object with the new one, so always make sure that you don't need the information in an old vector before reassigning its name.
210 |
211 | With the results of the simulation saved as `sim_basket`, you have the data necessary to compare Kobe to our independent shooter.
212 |
213 | Both data sets represent the results of 133 shot attempts, each with the same shooting percentage of 45%.
214 | We know that our simulated data is from a shooter that has independent shots.
215 | That is, we know the simulated shooter does not have a hot hand.
216 |
217 | ------------------------------------------------------------------------
218 |
219 | ## More Practice
220 |
221 | ### Comparing Kobe Bryant to the Independent Shooter
222 |
223 | 1. Using `calc_streak`, compute the streak lengths of `sim_basket`, and save the results in a data frame called `sim_streak`.
224 |
225 | 2. Describe the distribution of streak lengths.
226 | What is the typical streak length for this simulated independent shooter with a 45% shooting percentage?
227 | How long is the player's longest streak of baskets in 133 shots?
228 | Make sure to include a plot in your answer.
229 |
230 | 3. If you were to run the simulation of the independent shooter a second time, how would you expect its streak distribution to compare to the distribution from the question above?
231 | Exactly the same?
232 | Somewhat similar?
233 | Totally different?
234 | Explain your reasoning.
235 |
236 | 4. How does Kobe Bryant's distribution of streak lengths compare to the distribution of streak lengths for the simulated shooter?
237 | Using this comparison, do you have evidence that the hot hand model fits Kobe's shooting patterns?
238 | Explain.
239 |
240 | ------------------------------------------------------------------------
241 |
242 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
243 |
--------------------------------------------------------------------------------
/04_normal_distribution/normal_distribution.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "The normal distribution"
3 | output:
4 | html_document:
5 | css: ../lab.css
6 | highlight: pygments
7 | theme: cerulean
8 | toc: true
9 | toc_float: true
10 | ---
11 |
12 | ```{r echo = FALSE}
13 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
14 | ```
15 |
16 | In this lab, you'll investigate the probability distribution that is most central to statistics: the normal distribution.
17 | If you are confident that your data are nearly normal, that opens the door to many powerful statistical methods.
18 | Here we'll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
19 |
20 | ## Getting Started
21 |
22 | ### Load packages
23 |
24 | In this lab, we will explore and visualize the data using the **tidyverse** suite of packages as well as the **openintro** package.
25 |
26 | Let's load the packages.
27 |
28 | ```{r load-packages, message=FALSE}
29 | library(tidyverse)
30 | library(openintro)
31 | ```
32 |
33 | ### Creating a reproducible lab report
34 |
35 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
36 |
37 | ### The data
38 |
39 | This week you'll be working with fast food data.
40 | This data set contains data on 515 menu items from some of the most popular fast food restaurants worldwide.
41 | Let's take a quick peek at the first few rows of the data.
42 |
43 | Either you can use `glimpse` like before, or `head` to do this.
44 |
45 | ```{r load-data, results=TRUE}
46 | library(tidyverse)
47 | library(openintro)
48 | head(fastfood)
49 | ```
50 |
51 | You'll see that for every observation there are 17 measurements, many of which are nutritional facts.
52 |
53 | You'll be focusing on just three columns to get started: restaurant, calories, calories from fat.
54 |
55 | Let's first focus on just products from McDonalds and Dairy Queen.
56 |
57 | ```{r male-female}
58 | mcdonalds <- fastfood %>%
59 | filter(restaurant == "Mcdonalds")
60 | dairy_queen <- fastfood %>%
61 | filter(restaurant == "Dairy Queen")
62 | ```
63 |
64 | 1. Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
65 |
66 | ## The normal distribution
67 |
68 | In your description of the distributions, did you use words like *bell-shaped* or *normal*?
69 | It's tempting to say so when faced with a unimodal symmetric distribution.
70 |
71 | To see how accurate that description is, you can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution.
72 | This normal curve should have the same mean and standard deviation as the data.
73 | You'll be focusing on calories from fat from Dairy Queen products, so let's store them as a separate object and then calculate some statistics that will be referenced later.
74 |
75 | ```{r female-hgt-mean-sd}
76 | dqmean <- mean(dairy_queen$cal_fat)
77 | dqsd <- sd(dairy_queen$cal_fat)
78 | ```
79 |
80 | Next, you make a density histogram to use as the backdrop and use the `lines` function to overlay a normal probability curve.
81 | The difference between a frequency histogram and a density histogram is that while in a frequency histogram the *heights* of the bars add up to the total number of observations, in a density histogram the *areas* of the bars add up to 1.
82 | The area of each bar can be calculated as simply the height *times* the width of the bar.
83 | Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function that also has area under the curve of 1.
84 | Frequency and density histograms both display the same exact shape; they only differ in their y-axis.
85 | You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.
86 |
87 | ```{r hist-height}
88 | ggplot(data = dairy_queen, aes(x = cal_fat)) +
89 | geom_blank() +
90 | geom_histogram(aes(y = ..density..)) +
91 | stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
92 | ```
93 |
94 | After initializing a blank plot with `geom_blank()`, the `ggplot2` package (within the `tidyverse`) allows us to add additional layers.
95 | The first layer is a density histogram.
96 | The second layer is a statistical function -- the density of the normal curve, `dnorm`.
97 | We specify that we want the curve to have the same mean and standard deviation as the column of calories from fat.
98 | The argument `col` simply sets the color for the line to be drawn.
99 | If we left it out, the line would be drawn in black.
100 |
101 | 2. Based on the this plot, does it appear that the data follow a nearly normal distribution?
102 |
103 | ## Evaluating the normal distribution
104 |
105 | Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve.
106 | An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for "quantile-quantile".
107 |
108 | ```{r qq}
109 | ggplot(data = dairy_queen, aes(sample = cal_fat)) +
110 | geom_line(stat = "qq")
111 | ```
112 |
113 | This time, you can use the `geom_line()` layer, while specifying that you will be creating a Q-Q plot with the `stat` argument.
114 | It's important to note that here, instead of using `x` instead `aes()`, you need to use `sample`.
115 |
116 | The x-axis values correspond to the quantiles of a theoretically normal curve with mean 0 and standard deviation 1 (i.e., the standard normal distribution).
117 | The y-axis values correspond to the quantiles of the original unstandardized sample data.
118 | However, even if we were to standardize the sample data values, the Q-Q plot would look identical.
119 | A data set that is nearly normal will result in a probability plot where the points closely follow a diagonal line.
120 | Any deviations from normality leads to deviations of these points from that line.
121 |
122 | The plot for Dairy Queen's calories from fat shows points that tend to follow the line but with some errant points towards the upper tail.
123 | You're left with the same problem that we encountered with the histogram above: how close is close enough?
124 |
125 | A useful way to address this question is to rephrase it as: what do probability plots look like for data that I *know* came from a normal distribution?
126 | We can answer this by simulating data from a normal distribution using `rnorm`.
127 |
128 | ```{r sim-norm}
129 | sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
130 | ```
131 |
132 | The first argument indicates how many numbers you'd like to generate, which we specify to be the same number of menu items in the `dairy_queen` data set using the `nrow()` function.
133 | The last two arguments determine the mean and standard deviation of the normal distribution from which the simulated sample will be generated.
134 | You can take a look at the shape of our simulated data set, `sim_norm`, as well as its normal probability plot.
135 |
136 | 3. Make a normal probability plot of `sim_norm`. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since `sim_norm` is not a dataframe, it can be put directly into the `sample` argument and the `data` argument can be dropped.)
137 |
138 | Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function.
139 | It shows the Q-Q plot corresponding to the original data in the top left corner, and the Q-Q plots of 8 different simulated normal data.
140 | It may be helpful to click the zoom button in the plot window.
141 |
142 | ```{r qqnormsim}
143 | qqnormsim(sample = cal_fat, data = dairy_queen)
144 | ```
145 |
146 | 4. Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data?
147 | That is, do the plots provide evidence that the calories from fat are nearly normal?
148 |
149 | 5. Using the same technique, determine whether or not the calories from McDonald's menu appear to come from a normal distribution.
150 |
151 | ## Normal probabilities
152 |
153 | Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed.
154 | Why should you care?
155 |
156 | It turns out that statisticians know a lot about the normal distribution.
157 | Once you decide that a random variable is approximately normal, you can answer all sorts of questions about that variable related to probability.
158 | Take, for example, the question of, "What is the probability that a randomly chosen Dairy Queen product has more than 600 calories from fat?"
159 |
160 | If we assume that the calories from fat from Dairy Queen's menu are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table).
161 | In R, this is done in one step with the function `pnorm()`.
162 |
163 | ```{r pnorm}
164 | 1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
165 | ```
166 |
167 | Note that the function `pnorm()` gives the area under the normal curve below a given value, `q`, with a given mean and standard deviation.
168 | Since we're interested in the probability that a Dairy Queen item has more than 600 calories from fat, we have to take one minus that probability.
169 |
170 | Assuming a normal distribution has allowed us to calculate a theoretical probability.
171 | If we want to calculate the probability empirically, we simply need to determine how many observations fall above 600 then divide this number by the total sample size.
172 |
173 | ```{r probability}
174 | dairy_queen %>%
175 | filter(cal_fat > 600) %>%
176 | summarise(percent = n() / nrow(dairy_queen))
177 | ```
178 |
179 | Although the probabilities are not exactly the same, they are reasonably close.
180 | The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.
181 |
182 | 6. Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
183 |
184 | ------------------------------------------------------------------------
185 |
186 | ## More Practice
187 |
188 | 7. Now let's consider some of the other variables in the dataset.
189 | Out of all the different restaurants, which ones' distribution is the closest to normal for sodium?
190 |
191 | 8. Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern.
192 | Why do you think this might be the case?
193 |
194 | 9. As you can see, normal probability plots can be used both to assess normality and visualize skewness.
195 | Make a normal probability plot for the total carbohydrates from a restaurant of your choice.
196 | Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed?\
197 | Use a histogram to confirm your findings.
198 |
199 | ------------------------------------------------------------------------
200 |
201 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
202 |
--------------------------------------------------------------------------------
/05a_sampling_distributions/README.md:
--------------------------------------------------------------------------------
1 | Depoloyed app at https://openintro.shinyapps.io/sampling_distributions/.
2 |
--------------------------------------------------------------------------------
/05a_sampling_distributions/more/AmesHousing.xls:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenIntroStat/oilabs-tidy/5e5703a7ecf076f0de5f7d90b436977f0d6c4ae0/05a_sampling_distributions/more/AmesHousing.xls
--------------------------------------------------------------------------------
/05a_sampling_distributions/more/OLD_sampling_distributions.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Foundations for statistical inference - Sampling distributions"
3 | runtime: shiny
4 | output:
5 | html_document:
6 | css: www/lab.css
7 | highlight: pygments
8 | theme: cerulean
9 | toc: true
10 | toc_float: true
11 | ---
12 |
13 | ```{r global_options, include=FALSE}
14 | knitr::opts_chunk$set(eval = FALSE)
15 | library(tidyverse)
16 | library(openintro)
17 | library(infer)
18 | ```
19 |
20 | In this lab, you will investigate the ways in which the statistics from a random
21 | sample of data can serve as point estimates for population parameters. We're
22 | interested in formulating a *sampling distribution* of our estimate in order
23 | to learn about the properties of the estimate, such as its distribution.
24 |
25 |
26 | **Setting a seed:** We will take some random samples and build sampling distributions
27 | in this lab, which means you should set a seed at the start of your lab. If this
28 | concept is new to you, review the lab concerning probability.
29 |
30 |
31 | ## Getting Started
32 |
33 | ### Load packages
34 |
35 | In this lab, we will explore and visualize the data using the **tidyverse** suite of packages.
36 | The data can be found in the companion package for OpenIntro resources, **openintro**.
37 | Lastly, we will use the **infer** package for resampling.
38 |
39 | Let's load the packages.
40 |
41 | ```{r load-packages, message=FALSE}
42 | library(tidyverse)
43 | library(openintro)
44 | library(infer)
45 | ```
46 |
47 | ### Creating a reproducible lab report
48 |
49 | To create your new lab report, start by opening a new R Markdown document... From
50 | Template... then select Lab Report from the **openintro** package.
51 |
52 | ### The data
53 |
54 | You will be analyzing real estate data from the city of Ames, Iowa. The details of
55 | every real estate transaction in Ames is recorded by the City Assessor's
56 | office. Your particular focus for this lab will be all residential home sales
57 | in Ames between 2006 and 2010. This collection represents your population of
58 | interest. In this lab, you will learn about these home sales by taking
59 | smaller samples from the full population. Let's load the data.
60 |
61 | ```{r load-data}
62 | data(ames)
63 | ```
64 |
65 | You can see that there are quite a few variables in the data set, enough to do a
66 | very in-depth analysis. For this lab, you will focus on just two of the variables:
67 | the above ground living area of the house in square feet (`area`) and the sale
68 | price (`price`).
69 |
70 | You can explore the distribution of areas of homes in the population of home
71 | sales visually and with summary statistics. Let's first create a visualization,
72 | a histogram:
73 |
74 | ```{r area-hist}
75 | ggplot(data = ames, aes(x = area)) +
76 | geom_histogram(binwidth = 250)
77 | ```
78 |
79 | Let's also obtain some summary statistics. Note that you can do this using the
80 | `summarise` function. You can calculate as many statistics as you want using this
81 | function, and just combine the results. Some of the functions below should
82 | be self explanatory (like `mean`, `median`, `sd`, `IQR`, `min`, and `max`). A
83 | new function here is the `quantile` function, which you can use to calculate
84 | values corresponding to specific percentile cutoffs in the distribution. For
85 | example `quantile(x, 0.25)` will yield the cutoff value for the 25th percentile (Q1)
86 | in the distribution of `x`. Finding these values is useful for describing the
87 | distribution, as you can use them for descriptions like *"the middle 50% of the
88 | homes have areas between such and such square feet"*.
89 |
90 | ```{r area-stats}
91 | ames %>%
92 | summarise(mu = mean(area), pop_med = median(area),
93 | sigma = sd(area), pop_iqr = IQR(area),
94 | pop_min = min(area), pop_max = max(area),
95 | pop_q1 = quantile(area, 0.25), # first quartile, 25th percentile
96 | pop_q3 = quantile(area, 0.75)) # third quartile, 75th percentile
97 | ```
98 |
99 | 1. Describe this population distribution using a visualization and the summary
100 | statistics listed above. You don't have to use all of the summary statistics
101 | in your description, and you will need to decide which ones are relevant based
102 | on the shape of the distribution. Make sure to include the plot and the summary
103 | statistics output in your report along with your narrative.
104 |
105 | ## The unknown sampling distribution
106 |
107 | In this lab, you have access to the entire population, but this is rarely the
108 | case in real life. Gathering information on an entire population is often
109 | extremely costly or impossible. Because of this, we often take a sample of
110 | the population and use that to understand the properties of the population.
111 |
112 | If you were interested in estimating the mean living area in Ames based on a
113 | sample, you can use the `sample_n` command to survey the population.
114 |
115 | ```{r samp1}
116 | samp1 <- ames %>%
117 | sample_n(50)
118 | ```
119 |
120 | This command collects a simple random sample of size 50 from the `ames` dataset,
121 | and assigns the result to `samp1`. This is similar to going into the City
122 | Assessor's database and pulling up the files on 50 random home sales. Working
123 | with these 50 files is considerably simpler than working with all 2930 home sales.
124 |
125 | 1. Describe the distribution of area in this sample. How does it compare to the
126 | distribution of the population? **Hint:** Although the `sample_n` function
127 | takes a random sample of observations (i.e. rows) from the dataset, you can
128 | still refer to the variables in the dataset with the same names. Code you used
129 | in the previous exercise will also be helpful for visualizing and summarizing
130 | the sample, however be careful to not label values `mu` and `sigma` anymore
131 | since these are sample statistics, not population parameters. You can customize
132 | the labels of any of the statistics to indicate that these come from the sample.
133 |
134 | If you're interested in estimating the average living area in homes in Ames
135 | using the sample, your best single guess is the sample mean.
136 |
137 | ```{r mean-samp1}
138 | samp1 %>%
139 | summarise(x_bar = mean(area))
140 | ```
141 |
142 | Depending on which 50 homes you selected, your estimate could be a bit above
143 | or a bit below the true population mean of `r round(mean(ames$area),2)` square
144 | feet. In general, though, the sample mean turns out to be a pretty good estimate
145 | of the average living area, and you were able to get it by sampling less than 3\%
146 | of the population.
147 |
148 | 1. Would you expect the mean of your sample to match the mean of another team's
149 | sample? Why, or why not? If the answer is no, would you expect the means to
150 | just be somewhat different or very different? Ask a neighboring team to confirm
151 | your answer.
152 |
153 | 1. Take a second sample, also of size 50, and call it `samp2`. How does the
154 | mean of `samp2` compare with the mean of `samp1`? Suppose we took two
155 | more samples, one of size 100 and one of size 1000. Which would you think
156 | would provide a more accurate estimate of the population mean?
157 |
158 | Not surprisingly, every time you take another random sample, you get a different
159 | sample mean. It's useful to get a sense of just how much variability you
160 | should expect when estimating the population mean this way. The distribution
161 | of sample means, called the *sampling distribution (of the mean)*, can help you
162 | understand this variability. In this lab, because you have access to the population,
163 | you can build up the sampling distribution for the sample mean by repeating the
164 | above steps many times. Here, you will generate 15,000 samples and compute the
165 | sample mean of each. Note that we specify that `replace = TRUE` since sampling
166 | distributions are constructed by sampling with replacement.
167 |
168 | ```{r loop}
169 | sample_means50 <- ames %>%
170 | rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
171 | summarise(x_bar = mean(area))
172 |
173 | ggplot(data = sample_means50, aes(x = x_bar)) +
174 | geom_histogram(binwidth = 20)
175 | ```
176 |
177 | Here, we use R to take 15,000 different samples of size 50 from the population,
178 | calculate the mean of each sample, and store each result in a vector called
179 | `sample_means50`. Next, you will review how this set of code works.
180 |
181 | 1. How many elements are there in `sample_means50`? Describe the sampling
182 | distribution, and be sure to specifically note its center. Make sure to include
183 | a plot of the distribution in your answer.
184 |
185 | ## Interlude: Sampling distributions
186 |
187 | The idea behind the `rep_sample_n` function is *repetition*. Earlier, you took
188 | a single sample of size `n` (50) from the population of all houses in Ames. With
189 | this new function, you can repeat this sampling procedure `rep` times in order
190 | to build a distribution of a series of sample statistics, which is called the
191 | **sampling distribution**.
192 |
193 | Note that in practice one rarely gets to build true sampling distributions,
194 | because one rarely has access to data from the entire population.
195 |
196 | Without the `rep_sample_n` function, this would be painful. We would have to
197 | manually run the following code 15,000 times
198 | ```{r sample-code, eval=FALSE}
199 | ames %>%
200 | sample_n(size = 50) %>%
201 | summarise(x_bar = mean(area))
202 | ```
203 | as well as store the resulting sample means each time in a separate vector.
204 |
205 | Note that for each of the 15,000 times we computed a mean, we did so from a
206 | **different** sample!
207 |
208 | 1. To make sure you understand how sampling distributions are built, and exactly
209 | what the `rep_sample_n` function does, try modifying the code to create a
210 | sampling distribution of **25 sample means** from **samples of size 10**,
211 | and put them in a data frame named `sample_means_small`. Print the output.
212 | How many observations are there in this object called `sample_means_small`?
213 | What does each observation represent?
214 |
215 | ## Sample size and the sampling distribution
216 |
217 | Mechanics aside, let's return to the reason we used the `rep_sample_n` function: to
218 | compute a sampling distribution, specifically, the sampling distribution of the
219 | mean home area for samples of 50 houses.
220 |
221 | ```{r hist}
222 | ggplot(data = sample_means50, aes(x = x_bar)) +
223 | geom_histogram(binwidth = 20)
224 | ```
225 |
226 | The sampling distribution that you computed tells you much about estimating
227 | the average living area in homes in Ames. Because the sample mean is an
228 | unbiased estimator, the sampling distribution is centered at the true average
229 | living area of the population, and the spread of the distribution
230 | indicates how much variability is incurred by sampling only 50 home sales.
231 |
232 | In the remainder of this section, you will work on getting a sense of the effect
233 | that sample size has on your sampling distribution.
234 |
235 | 1. Use the app below to create sampling distributions of means of `area`s from
236 | samples of size 10, 50, and 100. Use 5,000 simulations. What does each
237 | observation in the sampling distribution represent? How does the mean, standard
238 | error, and shape of the sampling distribution change as the sample size
239 | increases? How (if at all) do these values change if you increase the number
240 | of simulations? (You do not need to include plots in your answer.)
241 |
242 | ```{r shiny, echo=FALSE, eval=TRUE}
243 | shinyApp(
244 | ui <- fluidPage(
245 |
246 | # Sidebar with a slider input for number of bins
247 | sidebarLayout(
248 | sidebarPanel(
249 |
250 | selectInput("selected_var",
251 | "Variable:",
252 | choices = list("area", "price"),
253 | selected = "area"),
254 |
255 | numericInput("n_samp",
256 | "Sample size:",
257 | min = 1,
258 | max = nrow(ames),
259 | value = 30),
260 |
261 | numericInput("n_sim",
262 | "Number of samples:",
263 | min = 1,
264 | max = 30000,
265 | value = 15000)
266 |
267 | ),
268 |
269 | # Show a plot of the generated distribution
270 | mainPanel(
271 | plotOutput("sampling_plot"),
272 | verbatimTextOutput("sampling_mean"),
273 | verbatimTextOutput("sampling_se")
274 | )
275 | )
276 | ),
277 |
278 | # Define server logic required to draw a histogram
279 | server <- function(input, output) {
280 |
281 | # create sampling distribution
282 | sampling_dist <- reactive({
283 | ames[[input$selected_var]] %>%
284 | sample(size = input$n_samp * input$n_sim, replace = TRUE) %>%
285 | matrix(ncol = input$n_samp) %>%
286 | rowMeans() %>%
287 | data.frame(x_bar = .)
288 | })
289 |
290 | # plot sampling distribution
291 | output$sampling_plot <- renderPlot({
292 | x_min <- quantile(ames[[input$selected_var]], 0.1)
293 | x_max <- quantile(ames[[input$selected_var]], 0.9)
294 |
295 | ggplot(sampling_dist(), aes(x = x_bar)) +
296 | geom_histogram() +
297 | xlim(x_min, x_max) +
298 | ylim(0, input$n_sim * 0.35) +
299 | ggtitle(paste0("Sampling distribution of mean ",
300 | input$selected_var, " (n = ", input$n_samp, ")")) +
301 | xlab(paste("mean", input$selected_var)) +
302 | theme(plot.title = element_text(face = "bold", size = 16))
303 | })
304 |
305 | # mean of sampling distribution
306 | output$sampling_mean <- renderText({
307 | paste0("mean of sampling distribution = ", round(mean(sampling_dist()$x_bar), 2))
308 | })
309 |
310 | # mean of sampling distribution
311 | output$sampling_se <- renderText({
312 | paste0("SE of sampling distribution = ", round(sd(sampling_dist()$x_bar), 2))
313 | })
314 | },
315 |
316 | options = list(height = 500)
317 | )
318 | ```
319 |
320 |
321 | * * *
322 |
323 | ## More Practice
324 |
325 | So far, you have only focused on estimating the mean living area in homes in
326 | Ames. Now, you'll try to estimate the mean home price.
327 |
328 | Note that while you might be able to answer some of these questions using the app,
329 | you are expected to write the required code and produce the necessary plots and
330 | summary statistics. You are welcome to use the app for exploration.
331 |
332 | 1. Take a sample of size 15 from the population and calculate the mean `price`
333 | of the homes in this sample. Using this sample, what is your best point estimate
334 | of the population mean of prices of homes?
335 |
336 | 1. Since you have access to the population, simulate the sampling
337 | distribution of $\overline{price}$ for samples of size 15 by taking 2000
338 | samples from the population of size 15 and computing 2000 sample means.
339 | Store these means in a vector called `sample_means15`. Plot the data, then
340 | describe the shape of this sampling distribution. Based on this sampling
341 | distribution, what would you guess the mean home price of the population to
342 | be? Finally, calculate and report the population mean.
343 |
344 | 1. Change your sample size from 15 to 150, then compute the sampling
345 | distribution using the same method as above, and store these means in a
346 | new vector called `sample_means150`. Describe the shape of this sampling
347 | distribution and compare it to the sampling distribution for a sample
348 | size of 15. Based on this sampling distribution, what would you guess to
349 | be the mean sale price of homes in Ames?
350 |
351 | 1. Of the sampling distributions from 2 and 3, which has a smaller spread? If
352 | you're concerned with making estimates that are more often close to the
353 | true value, would you prefer a sampling distribution with a large or small spread?
354 |
355 |
356 |
357 | This is a product of OpenIntro that is released under a [Creative Commons
358 | Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0).
359 | This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
360 |
--------------------------------------------------------------------------------
/05a_sampling_distributions/more/README.md:
--------------------------------------------------------------------------------
1 | This folder contains a previous version of this labs that introduces CLT via means.
--------------------------------------------------------------------------------
/05a_sampling_distributions/more/ames-readme.txt:
--------------------------------------------------------------------------------
1 | Data: ames
2 |
3 | To access the data in R, type
4 |
5 | read.csv("http://www.openintro.org/stat/data/ames.csv")
6 |
7 | Description
8 | Data on all residential home sales in Ames, Iowa between 2006 and 2010. The data set contains many explanatory variables on the quality and quantity of physical attributes of residential homes in Iowa sold between 2006 and 2010. Most of the variables describe information a typical home buyer would like to know about a property (square footage, number of bedrooms and bathrooms, size of lot, etc.). A detailed discussion of variables can be found in the original paper referenced below.
9 |
10 | Source: De Cock, Journal of Statistics Education Volume 19, Number 3(2011), www.amstat.org/publications/jse/v19n3/decock.pdf.
11 |
12 | Format
13 | A data frame with 2930 observations and 82 variables. A description of all variables can be found at http://www.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt.
--------------------------------------------------------------------------------
/05a_sampling_distributions/more/ames_dataprep.R:
--------------------------------------------------------------------------------
1 | ames <- read.delim("http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt")
2 |
3 | write.csv(ames, "ames.csv", row.names = FALSE)
4 |
5 | # rooms <- ames$TotRms.AbvGrd
6 | area <- ames$Gr.Liv.Area
7 | price <- ames$SalePrice
8 |
9 | ###############
10 |
11 | sample.means50 <- rep(0, 5000)
12 |
13 | for(i in 1:5000){
14 | samp <- sample(area, 50)
15 | sample.means50[i] <- mean(samp)
16 | }
17 |
18 |
19 | sample.means10 <- rep(0, 5000)
20 |
21 | for(i in 1:5000){
22 | samp <- sample(area, 10)
23 | sample.means10[i] <- mean(samp)
24 | }
25 |
26 |
27 | sample.means100 <- rep(0, 5000)
28 |
29 | for(i in 1:5000){
30 | samp <- sample(area, 150)
31 | sample.means100[i] <- mean(samp)
32 | }
33 |
34 | par(mfrow = c(3,1))
35 | hist(sample.means10, breaks = 20, xlim = range(sample.means10))
36 | hist(sample.means50, breaks = 20, xlim = range(sample.means10))
37 | hist(sample.means100, breaks = 20, xlim = range(sample.means10))
38 |
39 |
40 | sample.means10 <- rep(0, 5000)
41 |
42 | for(i in 1:5000){
43 | samp <- sample(price, 10)
44 | sample.means10[i] <- mean(samp)
45 | }
46 |
47 | sample.means50 <- rep(0, 5000)
48 |
49 | for(i in 1:5000){
50 | samp <- sample(price, 50)
51 | sample.means50[i] <- mean(samp)
52 | }
53 |
54 | sample.means150 <- rep(0, 5000)
55 |
56 | for(i in 1:5000){
57 | samp <- sample(price, 150)
58 | sample.means150[i] <- mean(samp)
59 | }
60 |
61 | par(mfrow = c(3,1))
62 | hist(sample.means10, breaks = 20, xlim = range(sample.means10))
63 | hist(sample.means50, breaks = 20, xlim = range(sample.means50))
64 | hist(sample.means150, breaks = 20, xlim = range(sample.means50))
--------------------------------------------------------------------------------
/05a_sampling_distributions/sampling_distributions.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Foundations for statistical inference - Sampling distributions"
3 | runtime: shiny
4 | output:
5 | html_document:
6 | css: www/lab.css
7 | highlight: pygments
8 | theme: cerulean
9 | toc: true
10 | toc_float: true
11 | ---
12 |
13 | ```{r global_options, include=FALSE}
14 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
15 | set.seed(1234)
16 | ```
17 |
18 | In this lab, you will investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters.
19 | We're interested in formulating a *sampling distribution* of our estimate in order to learn about the properties of the estimate, such as its distribution.
20 |
21 | ::: {#boxedtext}
22 | **Setting a seed:** We will take some random samples and build sampling distributions in this lab, which means you should set a seed at the start of your lab.
23 | If this concept is new to you, review the lab on probability.
24 | :::
25 |
26 | ## Getting Started
27 |
28 | ### Load packages
29 |
30 | In this lab, we will explore and visualize the data using the **tidyverse** suite of packages.
31 | We will also use the **infer** package for resampling.
32 |
33 | Let's load the packages.
34 |
35 | ```{r load-packages, message=FALSE}
36 | library(tidyverse)
37 | library(openintro)
38 | library(infer)
39 | ```
40 |
41 | ### Creating a reproducible lab report
42 |
43 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
44 |
45 | ### The data
46 |
47 | A 2019 Gallup report states the following:
48 |
49 | > The premise that scientific progress benefits people has been embodied in discoveries throughout the ages -- from the development of vaccinations to the explosion of technology in the past few decades, resulting in billions of supercomputers now resting in the hands and pockets of people worldwide.
50 | > Still, not everyone around the world feels science benefits them personally.
51 | >
52 | > **Source:** [World Science Day: Is Knowledge Power?](https://news.gallup.com/opinion/gallup/268121/world-science-day-knowledge-power.aspx)
53 |
54 | The Wellcome Global Monitor finds that 20% of people globally do not believe that the work scientists do benefits people like them.
55 | In this lab, you will assume this 20% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population.
56 | We will first create our population assuming a population size of 100,000.
57 | This means 20,000 (20%) of the population think the work scientists do does not benefit them personally and the remaining 80,000 think it does.
58 |
59 | ```{r}
60 | global_monitor <- tibble(
61 | scientist_work = c(rep("Benefits", 80000), rep("Doesn't benefit", 20000))
62 | )
63 | ```
64 |
65 | The name of the data frame is `global_monitor` and the name of the variable that contains responses to the question *"Do you believe that the work scientists do benefit people like you?"* is `scientist_work`.
66 |
67 | We can quickly visualize the distribution of these responses using a bar plot.
68 |
69 | ```{r bar-plot-pop, fig.height=2.5, fig.width=10}
70 | ggplot(global_monitor, aes(x = scientist_work)) +
71 | geom_bar() +
72 | labs(
73 | x = "", y = "",
74 | title = "Do you believe that the work scientists do benefit people like you?"
75 | ) +
76 | coord_flip()
77 | ```
78 |
79 | We can also obtain summary statistics to confirm we constructed the data frame correctly.
80 |
81 | ```{r summ-stat-pop, results = TRUE}
82 | global_monitor %>%
83 | count(scientist_work) %>%
84 | mutate(p = n /sum(n))
85 | ```
86 |
87 | ## The unknown sampling distribution
88 |
89 | In this lab, you have access to the entire population, but this is rarely the case in real life.
90 | Gathering information on an entire population is often extremely costly or impossible.
91 | Because of this, we often take a sample of the population and use that to understand the properties of the population.
92 |
93 | If you are interested in estimating the proportion of people who don't think the work scientists do benefits them, you can use the `sample_n` command to survey the population.
94 |
95 | ```{r samp1}
96 | samp1 <- global_monitor %>%
97 | sample_n(50)
98 | ```
99 |
100 | This command collects a simple random sample of size 50 from the `global_monitor` dataset, and assigns the result to `samp1`.
101 | This is similar to randomly drawing names from a hat that contains the names of all in the population.
102 | Working with these 50 names is considerably simpler than working with all 100,000 people in the population.
103 |
104 | 1. Describe the distribution of responses in this sample. How does it compare to the distribution of responses in the population. **Hint:** Although the `sample_n` function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you presented earlier for visualizing and summarising the population data will still be useful for the sample, however be careful to not label your proportion `p` since you're now calculating a sample statistic, not a population parameters. You can customize the label of the statistics to indicate that it comes from the sample.
105 |
106 | If you're interested in estimating the proportion of all people who do not believe that the work scientists do benefits them, but you do not have access to the population data, your best single guess is the sample proportion.
107 |
108 | ```{r phat-samp1}
109 | samp1 %>%
110 | count(scientist_work) %>%
111 | mutate(p_hat = n /sum(n))
112 | ```
113 |
114 | ```{r inline-calc, include=FALSE}
115 | # For use inline below
116 | samp1_p_hat <- samp1 %>%
117 | count(scientist_work) %>%
118 | mutate(p_hat = n /sum(n)) %>%
119 | filter(scientist_work == "Doesn't benefit") %>%
120 | pull(p_hat) %>%
121 | round(2)
122 | ```
123 |
124 | Depending on which 50 people you selected, your estimate could be a bit above or a bit below the true population proportion of `r samp1_p_hat`.
125 | In general, though, the sample proportion turns out to be a pretty good estimate of the true population proportion, and you were able to get it by sampling less than 1% of the population.
126 |
127 | 1. Would you expect the sample proportion to match the sample proportion of another student's sample?
128 | Why, or why not?
129 | If the answer is no, would you expect the proportions to be somewhat different or very different?
130 | Ask a student team to confirm your answer.
131 |
132 | 2. Take a second sample, also of size 50, and call it `samp2`.
133 | How does the sample proportion of `samp2` compare with that of `samp1`?
134 | Suppose we took two more samples, one of size 100 and one of size 1000.
135 | Which would you think would provide a more accurate estimate of the population proportion?
136 |
137 | Not surprisingly, every time you take another random sample, you might get a different sample proportion.
138 | It's useful to get a sense of just how much variability you should expect when estimating the population mean this way.
139 | The distribution of sample proportions, called the *sampling distribution (of the proportion)*, can help you understand this variability.
140 | In this lab, because you have access to the population, you can build up the sampling distribution for the sample proportion by repeating the above steps many times.
141 | Here, we use R to take 15,000 different samples of size 50 from the population, calculate the proportion of responses in each sample, filter for only the *Doesn't benefit* responses, and store each result in a vector called `sample_props50`.
142 | Note that we specify that `replace = TRUE` since sampling distributions are constructed by sampling with replacement.
143 |
144 | ```{r iterate}
145 | sample_props50 <- global_monitor %>%
146 | rep_sample_n(size = 50, reps = 15000, replace = TRUE) %>%
147 | count(scientist_work) %>%
148 | mutate(p_hat = n /sum(n)) %>%
149 | filter(scientist_work == "Doesn't benefit")
150 | ```
151 |
152 | And we can visualize the distribution of these proportions with a histogram.
153 |
154 | ```{r fig.show="hide"}
155 | ggplot(data = sample_props50, aes(x = p_hat)) +
156 | geom_histogram(binwidth = 0.02) +
157 | labs(
158 | x = "p_hat (Doesn't benefit)",
159 | title = "Sampling distribution of p_hat",
160 | subtitle = "Sample size = 50, Number of samples = 15000"
161 | )
162 | ```
163 |
164 | Next, you will review how this set of code works.
165 |
166 | 1. How many elements are there in `sample_props50`? Describe the sampling distribution, and be sure to specifically note its center. Make sure to include a plot of the distribution in your answer.
167 |
168 | ## Interlude: Sampling distributions
169 |
170 | The idea behind the `rep_sample_n` function is *repetition*.
171 | Earlier, you took a single sample of size `n` (50) from the population of all people in the population.
172 | With this new function, you can repeat this sampling procedure `rep` times in order to build a distribution of a series of sample statistics, which is called the **sampling distribution**.
173 |
174 | Note that in practice one rarely gets to build true sampling distributions, because one rarely has access to data from the entire population.
175 |
176 | Without the `rep_sample_n` function, this would be painful.
177 | We would have to manually run the following code 15,000 times
178 |
179 | ```{r sample-code}
180 | global_monitor %>%
181 | sample_n(size = 50, replace = TRUE) %>%
182 | count(scientist_work) %>%
183 | mutate(p_hat = n /sum(n)) %>%
184 | filter(scientist_work == "Doesn't benefit")
185 | ```
186 |
187 | as well as store the resulting sample proportions each time in a separate vector.
188 |
189 | Note that for each of the 15,000 times we computed a proportion, we did so from a **different** sample!
190 |
191 | 1. To make sure you understand how sampling distributions are built, and exactly what the `rep_sample_n` function does, try modifying the code to create a sampling distribution of **25 sample proportions** from **samples of size 10**, and put them in a data frame named `sample_props_small`. Print the output. How many observations are there in this object called `sample_props_small`? What does each observation represent?
192 |
193 | ## Sample size and the sampling distribution
194 |
195 | Mechanics aside, let's return to the reason we used the `rep_sample_n` function: to compute a sampling distribution, specifically, the sampling distribution of the proportions from samples of 50 people.
196 |
197 | ```{r hist, fig.show='hide'}
198 | ggplot(data = sample_props50, aes(x = p_hat)) +
199 | geom_histogram(binwidth = 0.02)
200 | ```
201 |
202 | The sampling distribution that you computed tells you much about estimating the true proportion of people who think that the work scientists do doesn't benefit them.
203 | Because the sample proportion is an unbiased estimator, the sampling distribution is centered at the true population proportion, and the spread of the distribution indicates how much variability is incurred by sampling only 50 people at a time from the population.
204 |
205 | In the remainder of this section, you will work on getting a sense of the effect that sample size has on your sampling distribution.
206 |
207 | 1. Use the app below to create sampling distributions of proportions of *Doesn't benefit* from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error, and shape of the sampling distribution change as the sample size increases? How (if at all) do these values change if you increase the number of simulations? (You do not need to include plots in your answer.)
208 |
209 | ```{r shiny, echo=FALSE, eval=TRUE, results = TRUE}
210 | shinyApp(
211 | ui <- fluidPage(
212 |
213 | # Sidebar with a slider input for number of bins
214 | sidebarLayout(
215 | sidebarPanel(
216 |
217 | selectInput("outcome",
218 | "Outcome of interest:",
219 | choices = c("Benefits", "Doesn't benefit"),
220 | selected = "Doesn't benefit"),
221 |
222 | numericInput("n_samp",
223 | "Sample size:",
224 | min = 1,
225 | max = nrow(global_monitor),
226 | value = 30),
227 |
228 | numericInput("n_rep",
229 | "Number of samples:",
230 | min = 1,
231 | max = 30000,
232 | value = 15000),
233 |
234 | hr(),
235 |
236 | sliderInput("binwidth",
237 | "Binwidth:",
238 | min = 0, max = 0.5,
239 | value = 0.02,
240 | step = 0.005)
241 |
242 | ),
243 |
244 | # Show a plot of the generated distribution
245 | mainPanel(
246 | plotOutput("sampling_plot"),
247 | textOutput("sampling_mean"),
248 | textOutput("sampling_se")
249 | )
250 | )
251 | ),
252 |
253 | server <- function(input, output) {
254 |
255 | # create sampling distribution
256 | sampling_dist <- reactive({
257 | global_monitor %>%
258 | rep_sample_n(size = input$n_samp, reps = input$n_rep, replace = TRUE) %>%
259 | count(scientist_work) %>%
260 | mutate(p_hat = n /sum(n)) %>%
261 | filter(scientist_work == input$outcome)
262 | })
263 |
264 | # plot sampling distribution
265 | output$sampling_plot <- renderPlot({
266 |
267 | ggplot(sampling_dist(), aes(x = p_hat)) +
268 | geom_histogram(binwidth = input$binwidth) +
269 | xlim(0, 1) +
270 | labs(
271 | x = paste0("p_hat (", input$outcome, ")"),
272 | title = "Sampling distribution of p_hat",
273 | subtitle = paste0("Sample size = ", input$n_samp, " Number of samples = ", input$n_rep)
274 | ) +
275 | theme(plot.title = element_text(face = "bold", size = 16))
276 | })
277 |
278 | ggplot(data = sample_props50, aes(x = p_hat)) +
279 | geom_histogram(binwidth = 0.02) +
280 | labs(
281 | x = "p_hat (Doesn't benefit)",
282 | title = "Sampling distribution of p_hat",
283 | subtitle = "Sample size = 50, Number of samples = 15000"
284 | )
285 |
286 | # mean of sampling distribution
287 | output$sampling_mean <- renderText({
288 | paste0("Mean of sampling distribution = ", round(mean(sampling_dist()$p_hat), 2))
289 | })
290 |
291 | # mean of sampling distribution
292 | output$sampling_se <- renderText({
293 | paste0("SE of sampling distribution = ", round(sd(sampling_dist()$p_hat), 2))
294 | })
295 | },
296 |
297 | options = list(height = 900)
298 | )
299 | ```
300 |
301 | ------------------------------------------------------------------------
302 |
303 | ## More Practice
304 |
305 | So far, you have only focused on estimating the proportion of those you think the work scientists doesn't benefit them.
306 | Now, you'll try to estimate the proportion of those who think it does.
307 |
308 | Note that while you might be able to answer some of these questions using the app, you are expected to write the required code and produce the necessary plots and summary statistics.
309 | You are welcome to use the app for exploration.
310 |
311 | 1. Take a sample of size 15 from the population and calculate the proportion of people in this sample who think the work scientists do enchances their lives.
312 | Using this sample, what is your best point estimate of the population proportion of people who think the work scientists do enchances their lives?
313 |
314 | 2. Since you have access to the population, simulate the sampling distribution of proportion of those who think the work scientists do enhances their lives for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample proportions.
315 | Store these proportions in as `sample_props15`.
316 | Plot the data, then describe the shape of this sampling distribution.
317 | Based on this sampling distribution, what would you guess the true proportion of those who think the work scientists do enhances their lives to be?
318 | Finally, calculate and report the population proportion.
319 |
320 | 3. Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these proportions in a new object called `sample_props150`.
321 | Describe the shape of this sampling distribution and compare it to the sampling distribution for a sample size of 15.
322 | Based on this sampling distribution, what would you guess to be the true proportion of those who think the work scientists do enhances their lives?
323 |
324 | 4. Of the sampling distributions from 2 and 3, which has a smaller spread?
325 | If you're concerned with making estimates that are more often close to the true value, would you prefer a sampling distribution with a large or small spread?
326 |
327 | ------------------------------------------------------------------------
328 |
329 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
330 |
--------------------------------------------------------------------------------
/05a_sampling_distributions/www/lab.css:
--------------------------------------------------------------------------------
1 | body {
2 | counter-reset: li; /* initialize counter named li */
3 | }
4 |
5 | h1 {
6 | font-family:Arial, Helvetica, sans-serif;
7 | font-weight:bold;
8 | }
9 |
10 | h2 {
11 | font-family:Arial, Helvetica, sans-serif;
12 | font-weight:bold;
13 | margin-top: 24px;
14 | }
15 |
16 | ol {
17 | margin-left:0; /* Remove the default left margin */
18 | padding-left:0; /* Remove the default left padding */
19 | }
20 | ol > li {
21 | position:relative; /* Create a positioning context */
22 | margin:0 0 10px 2em; /* Give each list item a left margin to make room for the numbers */
23 | padding:10px 80px; /* Add some spacing around the content */
24 | list-style:none; /* Disable the normal item numbering */
25 | border-top:2px solid #317EAC;
26 | background:rgba(49, 126, 172, 0.1);
27 | }
28 | ol > li:before {
29 | content:"Exercise " counter(li); /* Use the counter as content */
30 | counter-increment:li; /* Increment the counter by 1 */
31 | /* Position and style the number */
32 | position:absolute;
33 | top:-2px;
34 | left:-2em;
35 | -moz-box-sizing:border-box;
36 | -webkit-box-sizing:border-box;
37 | box-sizing:border-box;
38 | width:7em;
39 | /* Some space between the number and the content in browsers that support
40 | generated content but not positioning it (Camino 2 is one example) */
41 | margin-right:8px;
42 | padding:4px;
43 | border-top:2px solid #317EAC;
44 | color:#fff;
45 | background:#317EAC;
46 | font-weight:bold;
47 | font-family:"Helvetica Neue", Arial, sans-serif;
48 | text-align:center;
49 | }
50 | li ol,
51 | li ul {margin-top:6px;}
52 | ol ol li:last-child {margin-bottom:0;}
53 |
54 | .oyo ul {
55 | list-style-type:decimal;
56 | }
57 |
58 | hr {
59 | border: 1px solid #357FAA;
60 | }
61 |
62 | div#boxedtext {
63 | background-color: rgba(86, 155, 189, 0.2);
64 | padding: 20px;
65 | margin-bottom: 20px;
66 | font-size: 10pt;
67 | }
68 |
69 | div#template {
70 | margin-top: 30px;
71 | margin-bottom: 30px;
72 | color: #808080;
73 | border:1px solid #808080;
74 | padding: 10px 10px;
75 | background-color: rgba(128, 128, 128, 0.2);
76 | border-radius: 5px;
77 | }
78 |
79 | div#license {
80 | margin-top: 30px;
81 | margin-bottom: 30px;
82 | color: #4C721D;
83 | border:1px solid #4C721D;
84 | padding: 10px 10px;
85 | background-color: rgba(76, 114, 29, 0.2);
86 | border-radius: 5px;
87 | }
--------------------------------------------------------------------------------
/05b_confidence_intervals/README.md:
--------------------------------------------------------------------------------
1 | Depoloyed app at https://openintro.shinyapps.io/confidence_intervals/.
2 |
--------------------------------------------------------------------------------
/05b_confidence_intervals/confidence_intervals.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: 'Foundations for statistical inference - Confidence intervals'
3 | runtime: shiny
4 | output:
5 | html_document:
6 | css: ../lab.css
7 | highlight: pygments
8 | theme: cerulean
9 | toc: true
10 | toc_float: true
11 | ---
12 |
13 | ```{r global_options, include=FALSE}
14 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
15 | ```
16 |
17 | If you have access to data on an entire population, say the opinion of every adult in the United States on whether or not they think climate change is affecting their local community, it's straightforward to answer questions like, "What percent of US adults think climate change is affecting their local community?".
18 | Similarly, if you had demographic information on the population you could examine how, if at all, this opinion varies among young and old adults and adults with different leanings.
19 | If you have access to only a sample of the population, as is often the case, the task becomes more complicated.
20 | What is your best guess for this proportion if you only have data from a small sample of adults?
21 | This type of situation requires that you use your sample to make inference on what your population looks like.
22 |
23 | ::: {#boxedtext}
24 | **Setting a seed:** You will take random samples and build sampling distributions in this lab, which means you should set a seed on top of your lab.
25 | If this concept is new to you, review the lab on probability.
26 | :::
27 |
28 | ## Getting Started
29 |
30 | ### Load packages
31 |
32 | In this lab, we will explore and visualize the data using the **tidyverse** suite of packages, and perform statistical inference using **infer**.
33 |
34 | Let's load the packages.
35 |
36 | ```{r load-packages, message=FALSE}
37 | library(tidyverse)
38 | library(openintro)
39 | library(infer)
40 | ```
41 |
42 | ### Creating a reproducible lab report
43 |
44 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
45 |
46 | ### The data
47 |
48 | A 2019 Pew Research report states the following:
49 |
50 | To keep our computation simple, we will assume a total population size of 100,000 (even though that's smaller than the population size of all US adults).
51 |
52 | > Roughly six-in-ten U.S. adults (62%) say climate change is currently affecting their local community either a great deal or some, according to a new Pew Research Center survey.
53 | >
54 | > **Source:** [Most Americans say climate change impacts their community, but effects vary by region](https://www.pewresearch.org/fact-tank/2019/12/02/most-americans-say-climate-change-impacts-their-community-but-effects-vary-by-region/)
55 |
56 | In this lab, you will assume this 62% is a true population proportion and learn about how sample proportions can vary from sample to sample by taking smaller samples from the population.
57 | We will first create our population assuming a population size of 100,000.
58 | This means 62,000 (62%) of the adult population think climate change impacts their community, and the remaining 38,000 does not think so.
59 |
60 | ```{r}
61 | us_adults <- tibble(
62 | climate_change_affects = c(rep("Yes", 62000), rep("No", 38000))
63 | )
64 | ```
65 |
66 | The name of the data frame is `us_adults` and the name of the variable that contains responses to the question *"Do you think climate change is affecting your local community?"* is `climate_change_affects`.
67 |
68 | We can quickly visualize the distribution of these responses using a bar plot.
69 |
70 | ```{r bar-plot-pop, fig.height=2.5, fig.width=10}
71 | ggplot(us_adults, aes(x = climate_change_affects)) +
72 | geom_bar() +
73 | labs(
74 | x = "", y = "",
75 | title = "Do you think climate change is affecting your local community?"
76 | ) +
77 | coord_flip()
78 | ```
79 |
80 | We can also obtain summary statistics to confirm we constructed the data frame correctly.
81 |
82 | ```{r summ-stat-pop, results = TRUE}
83 | us_adults %>%
84 | count(climate_change_affects) %>%
85 | mutate(p = n /sum(n))
86 | ```
87 |
88 | In this lab, you'll start with a simple random sample of size 60 from the population.
89 |
90 | ```{r sample}
91 | n <- 60
92 | samp <- us_adults %>%
93 | sample_n(size = n)
94 | ```
95 |
96 | 1. What percent of the adults in your sample think climate change affects their local community?
97 | **Hint:** Just like we did with the population, we can calculate the proportion of those **in this sample** who think climate change affects their local community.
98 |
99 | 2. Would you expect another student's sample proportion to be identical to yours?
100 | Would you expect it to be similar?
101 | Why or why not?
102 |
103 | ## Confidence intervals
104 |
105 | Return for a moment to the question that first motivated this lab: based on this sample, what can you infer about the population?
106 | With just one sample, the best estimate of the proportion of US adults who think climate change affects their local community would be the sample proportion, usually denoted as $\hat{p}$ (here we are calling it `p_hat`).
107 | That serves as a good **point estimate**, but it would be useful to also communicate how uncertain you are of that estimate.
108 | This uncertainty can be quantified using a **confidence interval**.
109 |
110 | One way of calculating a confidence interval for a population proportion is based on the Central Limit Theorem, as $\hat{p} \pm z^\star SE_{\hat{p}}$ is, or more precisely, as $$ \hat{p} \pm z^\star \sqrt{ \frac{\hat{p} (1-\hat{p})}{n} } $$
111 |
112 | Another way is using simulation, or to be more specific, using **bootstrapping**.
113 | The term **bootstrapping** comes from the phrase "pulling oneself up by one's bootstraps", which is a metaphor for accomplishing an impossible task without any outside help.
114 | In this case the impossible task is estimating a population parameter (the unknown population proportion), and we'll accomplish it using data from only the given sample.
115 | Note that this notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping.
116 |
117 | In essence, bootstrapping assumes that there are more of observations in the populations like the ones in the observed sample.
118 | So we "reconstruct" the population by resampling from our sample, with replacement.
119 | The bootstrapping scheme is as follows:
120 |
121 | - **Step 1.** Take a bootstrap sample - a random sample taken **with replacement** from the original sample, of the same size as the original sample.
122 | - **Step 2.** Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples.
123 | - **Step 3.** Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics.
124 | - **Step 4.** Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution.
125 |
126 | Instead of coding up each of these steps, we will construct confidence intervals using the **infer** package.
127 |
128 | Below is an overview of the functions we will use to construct this confidence interval:
129 |
130 | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
131 | | Function | Purpose |
132 | +=============+============================================================================================================================================+
133 | | `specify` | Identify your variable of interest |
134 | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
135 | | `generate` | The number of samples you want to generate |
136 | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
137 | | `calculate` | The sample statistic you want to do inference with, or you can also think of this as the population parameter you want to do inference for |
138 | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
139 | | `get_ci` | Find the confidence interval |
140 | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+
141 |
142 | This code will find the 95 percent confidence interval for proportion of US adults who think climate change affects their local community.
143 |
144 | ```{r confidence interval infer}
145 | samp %>%
146 | specify(response = climate_change_affects, success = "Yes") %>%
147 | generate(reps = 1000, type = "bootstrap") %>%
148 | calculate(stat = "prop") %>%
149 | get_ci(level = 0.95)
150 | ```
151 |
152 | - In `specify` we specify the `response` variable and the level of that variable we are calling a `success`.
153 | - In `generate` we provide the number of resamples we want from the population in the `reps` argument (this should be a reasonably large number) as well as the type of resampling we want to do, which is `"bootstrap"` in the case of constructing a confidence interval.
154 | - Then, we `calculate` the sample statistic of interest for each of these resamples, which is `prop`ortion.
155 |
156 | Feel free to test out the rest of the arguments for these functions, since these commands will be used together to calculate confidence intervals and solve inference problems for the rest of the semester.
157 | But we will also walk you through more examples in future chapters.
158 |
159 | To recap: even though we don't know what the full population looks like, we're 95% confident that the true proportion of US adults who think climate change affects their local community is between the two bounds reported as result of this pipeline.
160 |
161 | ## Confidence levels
162 |
163 | 1. In the interpretation above, we used the phrase "95% confident". What does "95% confidence" mean?
164 |
165 | In this case, you have the rare luxury of knowing the true population proportion (62%) since you have data on the entire population.
166 |
167 | 1. Does your confidence interval capture the true population proportion of US adults who think climate change affects their local community?
168 | If you are working on this lab in a classroom, does your neighbor's interval capture this value?
169 |
170 | 2. Each student should have gotten a slightly different confidence interval.
171 | What proportion of those intervals would you expect to capture the true population mean?
172 | Why?
173 |
174 | In the next part of the lab, you will collect many samples to learn more about how sample proportions and confidence intervals constructed based on those samples vary from one sample to another.
175 |
176 | - Obtain a random sample.
177 | - Calculate the sample proportion, and use these to calculate and store the lower and upper bounds of the confidence intervals.
178 | - Repeat these steps 50 times.
179 |
180 | Doing this would require learning programming concepts like iteration so that you can automate repeating running the code you've developed so far many times to obtain many (50) confidence intervals.
181 | In order to keep the programming simpler, we are providing the interactive app below that basically does this for you and created a plot similar to Figure 5.6 on [OpenIntro Statistics, 4th Edition (page 182)](https://www.openintro.org/os).
182 |
183 | ```{r shiny, echo=FALSE, eval=TRUE, results = TRUE}
184 | store_ci <- function(i, n, reps, conf_level, success) {
185 | us_adults %>%
186 | sample_n(size = n) %>%
187 | specify(response = climate_change_affects, success = success) %>%
188 | generate(reps, type = "bootstrap") %>%
189 | calculate(stat = "prop") %>%
190 | get_ci(level = conf_level) %>%
191 | rename(
192 | x_lower = names(.)[1],
193 | x_upper = names(.)[2]
194 | )
195 | }
196 |
197 | shinyApp(
198 | ui <- fluidPage(
199 | h4("Confidence intervals for the proportion of US adults who think
200 | climate change"),
201 |
202 | h4(selectInput("success", "",
203 | choices = c(
204 | "is affecting their local community" = "Yes",
205 | "is not affecting their local community" = "No"
206 | ),
207 | selected = "Yes", width = "50%"
208 | )),
209 |
210 | # Sidebar with a slider input for number of bins
211 | sidebarLayout(
212 | sidebarPanel(
213 | numericInput("n_samp",
214 | "Sample size for a single sample from the population:",
215 | min = 1,
216 | max = 1000,
217 | value = 60
218 | ),
219 |
220 | hr(),
221 |
222 | numericInput("n_rep",
223 | "Number of resamples for each bootstrap confidence interval:",
224 | min = 1,
225 | max = 15000,
226 | value = 1000
227 | ),
228 |
229 | numericInput("conf_level",
230 | "Confidence level",
231 | min = 0.01,
232 | max = 0.99,
233 | value = 0.95,
234 | step = 0.05
235 | ),
236 |
237 | hr(),
238 |
239 | radioButtons("n_ci",
240 | "Number of confidence intervals:",
241 | choices = c(10, 25, 50, 100),
242 | selected = 50, inline = TRUE
243 | ),
244 |
245 | actionButton("go", "Go")
246 | ),
247 |
248 | # Show a plot of the generated distribution
249 | mainPanel(
250 | plotOutput("ci_plot")
251 | )
252 | )
253 | ),
254 |
255 | server <- function(input, output) {
256 |
257 | # set true p
258 | p <- reactive(ifelse(input$success == "Yes", 0.62, 0.38))
259 |
260 | # create df_ci when go button is pushed
261 | df_ci <- eventReactive(input$go, {
262 | map_dfr(1:input$n_ci, store_ci,
263 | n = input$n_samp,
264 | reps = input$n_rep, conf_level = input$conf_level,
265 | success = input$success
266 | ) %>%
267 | mutate(
268 | y_lower = 1:input$n_ci,
269 | y_upper = 1:input$n_ci,
270 | capture_p = ifelse(x_lower < p() & x_upper > p(), "Yes", "No")
271 | )
272 | })
273 |
274 | # plot df_ci
275 | output$ci_plot <- renderPlot({
276 | ggplot(df_ci()) +
277 | geom_segment(aes(x = x_lower, y = y_lower, xend = x_upper, yend = y_upper, color = capture_p)) +
278 | geom_point(aes(x = x_lower, y = y_lower, color = capture_p)) +
279 | geom_point(aes(x = x_upper, y = y_upper, color = capture_p)) +
280 | geom_vline(xintercept = p(), color = "darkgray") +
281 | labs(
282 | y = "", x = "Bounds of the confidence interval",
283 | color = "Does the interval capture the true population proportion?"
284 | ) +
285 | theme(legend.position = "bottom")
286 | })
287 | },
288 | options = list(height = 700)
289 | )
290 | ```
291 |
292 | 1. Given a sample size of 60, 1000 bootstrap samples for each interval, and 50 confidence intervals constructed (the default values for the above app), what proportion of your confidence intervals include the true population proportion? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.
293 |
294 | ------------------------------------------------------------------------
295 |
296 | ## More Practice
297 |
298 | 1. Choose a different confidence level than 95%.
299 | Would you expect a confidence interval at this level to be wider or narrower than the confidence interval you calculated at the 95% confidence level?
300 | Explain your reasoning.
301 |
302 | 2. Using code from the **infer** package and data from the one sample you have (`samp`), find a confidence interval for the proportion of US Adults who think climate change is affecting their local community with a confidence level of your choosing (other than 95%) and interpret it.
303 |
304 | 3. Using the app, calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population proportion.
305 | How does this percentage compare to the confidence level selected for the intervals?
306 |
307 | 4. Lastly, try one more (different) confidence level.
308 | First, state how you expect the width of this interval to compare to previous ones you calculated.
309 | Then, calculate the bounds of the interval using the **infer** package and data from `samp` and interpret it.
310 | Finally, use the app to generate many intervals and calculate the proportion of intervals that are capture the true population proportion.
311 |
312 | 5. Using the app, experiment with different sample sizes and comment on how the widths of intervals change as sample size changes (increases and decreases).
313 |
314 | 6. Finally, given a sample size (say, 60), how does the width of the interval change as you increase the number of bootstrap samples.
315 | **Hint:** Does changing the number of bootstrap samples affect the standard error?
316 |
317 | ------------------------------------------------------------------------
318 |
319 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
320 |
--------------------------------------------------------------------------------
/06_inf_for_categorical_data/inf_for_categorical_data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Inference for categorical data"
3 | runtime: shiny
4 | output:
5 | html_document:
6 | css: www/lab.css
7 | highlight: pygments
8 | theme: cerulean
9 | toc: true
10 | toc_float: true
11 | ---
12 |
13 | ```{r global_options, include=FALSE}
14 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
15 | library(ggplot2)
16 | ```
17 |
18 | ## Getting Started
19 |
20 | ### Load packages
21 |
22 | In this lab, we will explore and visualize the data using the **tidyverse** suite of packages, and perform statistical inference using **infer**.
23 | The data can be found in the companion package for OpenIntro resources, **openintro**.
24 |
25 | Let's load the packages.
26 |
27 | ```{r load-packages}
28 | library(tidyverse)
29 | library(openintro)
30 | library(infer)
31 | ```
32 |
33 | ### Creating a reproducible lab report
34 |
35 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
36 |
37 | ### The data
38 |
39 | You will be analyzing the same dataset as in the previous lab, where you delved into a sample from the Youth Risk Behavior Surveillance System (YRBSS) survey, which uses data from high schoolers to help discover health patterns.
40 | The dataset is called `yrbss`.
41 |
42 | 1. What are the counts within each category for the amount of days these students have texted while driving within the past 30 days?
43 |
44 | 2. What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets?
45 |
46 | Remember that you can use `filter` to limit the dataset to just non-helmet wearers.
47 | Here, we will name the dataset `no_helmet`.
48 |
49 | ```{r no helmet}
50 | no_helmet <- yrbss %>%
51 | filter(helmet_12m == "never")
52 | ```
53 |
54 | Also, it may be easier to calculate the proportion if you create a new variable that specifies whether the individual has texted every day while driving over the past 30 days or not.
55 | We will call this variable `text_ind`.
56 |
57 | ```{r indicator-texting}
58 | no_helmet <- no_helmet %>%
59 | mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
60 | ```
61 |
62 | ## Inference on proportions
63 |
64 | When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population *parameters*.
65 | To do this, you can answer the question, "What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?" with a statistic; while the question "What proportion of people on earth have texted while driving each day for the past 30 days?" is answered with an estimate of the parameter.
66 |
67 | The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.
68 |
69 | ```{r nohelmet-text-ci}
70 | no_helmet %>%
71 | specify(response = text_ind, success = "yes") %>%
72 | generate(reps = 1000, type = "bootstrap") %>%
73 | calculate(stat = "prop") %>%
74 | get_ci(level = 0.95)
75 | ```
76 |
77 | Note that since the goal is to construct an interval estimate for a proportion, it's necessary to both include the `success` argument within `specify`, which accounts for the proportion of non-helmet wearers than have consistently texted while driving the past 30 days, in this example, and that `stat` within `calculate` is here "prop", signaling that you are trying to do some sort of inference on a proportion.
78 |
79 | 1. What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey?
80 |
81 | 2. Using the `infer` package, calculate confidence intervals for two other categorical variables (you'll need to decide which level to call "success", and report the associated margins of error. Interpret the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.
82 |
83 | ## How does the proportion affect the margin of error?
84 |
85 | Imagine you've set out to survey 1000 people on two questions: are you at least 6-feet tall?
86 | and are you left-handed?
87 | Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right?
88 | Wrong!
89 | While the margin of error does change with sample size, it is also affected by the proportion.
90 |
91 | Think back to the formula for the standard error: $SE = \sqrt{p(1-p)/n}$.
92 | This is then used in the formula for the margin of error for a 95% confidence interval: $$
93 | ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,.
94 | $$ Since the population proportion $p$ is in this $ME$ formula, it should make sense that the margin of error is in some way dependent on the population proportion.
95 | We can visualize this relationship by creating a plot of $ME$ vs. $p$.
96 |
97 | Since sample size is irrelevant to this discussion, let's just set it to some value ($n = 1000$) and use this value in the following calculations:
98 |
99 | ```{r n-for-me-plot}
100 | n <- 1000
101 | ```
102 |
103 | The first step is to make a variable `p` that is a sequence from 0 to 1 with each number incremented by 0.01.
104 | You can then create a variable of the margin of error (`me`) associated with each of these values of `p` using the familiar approximate formula ($ME = 2 \times SE$).
105 |
106 | ```{r p-me}
107 | p <- seq(from = 0, to = 1, by = 0.01)
108 | me <- 2 * sqrt(p * (1 - p)/n)
109 | ```
110 |
111 | Lastly, you can plot the two variables against each other to reveal their relationship.
112 | To do so, we need to first put these variables in a data frame that you can call in the `ggplot` function.
113 |
114 | ```{r me-plot}
115 | dd <- data.frame(p = p, me = me)
116 | ggplot(data = dd, aes(x = p, y = me)) +
117 | geom_line() +
118 | labs(x = "Population Proportion", y = "Margin of Error")
119 | ```
120 |
121 | 1. Describe the relationship between `p` and `me`. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of `p` is margin of error maximized?
122 |
123 | ## Success-failure condition
124 |
125 | We have emphasized that you must always check conditions before making inference.
126 | For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both $np \geq 10$ and $n(1 - p) \geq 10$.
127 | This rule of thumb is easy enough to follow, but it makes you wonder: what's so special about the number 10?
128 |
129 | The short answer is: nothing.
130 | You could argue that you would be fine with 9 or that you really should be using 11.
131 | What is the "best" value for such a rule of thumb is, at least to some degree, arbitrary.
132 | However, when $np$ and $n(1-p)$ reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.
133 |
134 | You can investigate the interplay between $n$ and $p$ and the shape of the sampling distribution by using simulations.
135 | Play around with the following app to investigate how the shape, center, and spread of the distribution of $\hat{p}$ changes as $n$ and $p$ changes.
136 |
137 | ```{r sf-app, echo=FALSE, eval=TRUE, results=TRUE}
138 | inputPanel(
139 | numericInput("n", label = "Sample size:", value = 300),
140 |
141 | sliderInput("p", label = "Population proportion:",
142 | min = 0, max = 1, value = 0.1, step = 0.01),
143 |
144 | numericInput("x_min", label = "Min for x-axis:", value = 0, min = 0, max = 1),
145 | numericInput("x_max", label = "Max for x-axis:", value = 1, min = 0, max = 1)
146 | )
147 |
148 | renderPlot({
149 | pp <- data.frame(p_hat = rep(0, 5000))
150 | for(i in 1:5000){
151 | samp <- sample(c(TRUE, FALSE), input$n, replace = TRUE,
152 | prob = c(input$p, 1 - input$p))
153 | pp$p_hat[i] <- sum(samp == TRUE) / input$n
154 | }
155 | bw <- diff(range(pp$p_hat)) / 30
156 | ggplot(data = pp, aes(x = p_hat)) +
157 | geom_histogram(binwidth = bw) +
158 | xlim(input$x_min, input$x_max) +
159 | ggtitle(paste0("Distribution of p_hats, drawn from p = ", input$p, ", n = ", input$n))
160 | })
161 | ```
162 |
163 | 1. Describe the sampling distribution of sample proportions at $n = 300$ and $p = 0.1$.
164 | Be sure to note the center, spread, and shape.
165 |
166 | 2. Keep $n$ constant and change $p$.
167 | How does the shape, center, and spread of the sampling distribution vary as $p$ changes.
168 | You might want to adjust min and max for the $x$-axis for a better view of the distribution.
169 |
170 | 3. Now also change $n$.
171 | How does $n$ appear to affect the distribution of $\hat{p}$?
172 |
173 | ------------------------------------------------------------------------
174 |
175 | ## More Practice
176 |
177 | For some of the exercises below, you will conduct inference comparing two proportions.
178 | In such cases, you have a response variable that is categorical, and an explanatory variable that is also categorical, and you are comparing the proportions of success of the response variable across the levels of the explanatory variable.
179 | This means that when using `infer`, you need to include both variables within `specify`.
180 |
181 | 1. Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week?
182 | As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference.
183 | If you find a significant difference, also quantify this difference with a confidence interval.
184 |
185 | 2. Let's say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours.
186 | What is the probability that you could detect a change (at a significance level of 0.05) simply by chance?
187 | *Hint:* Review the definition of the Type 1 error.
188 |
189 | 3. Suppose you're hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis.
190 | According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence.
191 | You have no idea what to expect for $p$.
192 | How many people would you have to sample to ensure that you are within the guidelines?\
193 | *Hint:* Refer to your plot of the relationship between $p$ and margin of error.
194 | This question does not require using a dataset.
195 |
196 | ------------------------------------------------------------------------
197 |
198 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
199 |
--------------------------------------------------------------------------------
/06_inf_for_categorical_data/www/lab.css:
--------------------------------------------------------------------------------
1 | body {
2 | counter-reset: li; /* initialize counter named li */
3 | }
4 |
5 | h1 {
6 | font-family:Arial, Helvetica, sans-serif;
7 | font-weight:bold;
8 | }
9 |
10 | h2 {
11 | font-family:Arial, Helvetica, sans-serif;
12 | font-weight:bold;
13 | margin-top: 24px;
14 | }
15 |
16 | ol {
17 | margin-left:0; /* Remove the default left margin */
18 | padding-left:0; /* Remove the default left padding */
19 | }
20 | ol > li {
21 | position:relative; /* Create a positioning context */
22 | margin:0 0 10px 2em; /* Give each list item a left margin to make room for the numbers */
23 | padding:10px 80px; /* Add some spacing around the content */
24 | list-style:none; /* Disable the normal item numbering */
25 | border-top:2px solid #317EAC;
26 | background:rgba(49, 126, 172, 0.1);
27 | }
28 | ol > li:before {
29 | content:"Exercise " counter(li); /* Use the counter as content */
30 | counter-increment:li; /* Increment the counter by 1 */
31 | /* Position and style the number */
32 | position:absolute;
33 | top:-2px;
34 | left:-2em;
35 | -moz-box-sizing:border-box;
36 | -webkit-box-sizing:border-box;
37 | box-sizing:border-box;
38 | width:7em;
39 | /* Some space between the number and the content in browsers that support
40 | generated content but not positioning it (Camino 2 is one example) */
41 | margin-right:8px;
42 | padding:4px;
43 | border-top:2px solid #317EAC;
44 | color:#fff;
45 | background:#317EAC;
46 | font-weight:bold;
47 | font-family:"Helvetica Neue", Arial, sans-serif;
48 | text-align:center;
49 | }
50 | li ol,
51 | li ul {margin-top:6px;}
52 | ol ol li:last-child {margin-bottom:0;}
53 |
54 | .oyo ul {
55 | list-style-type:decimal;
56 | }
57 |
58 | hr {
59 | border: 1px solid #357FAA;
60 | }
61 |
62 | div#boxedtext {
63 | background-color: rgba(86, 155, 189, 0.2);
64 | padding: 20px;
65 | margin-bottom: 20px;
66 | font-size: 10pt;
67 | }
68 |
69 | div#template {
70 | margin-top: 30px;
71 | margin-bottom: 30px;
72 | color: #808080;
73 | border:1px solid #808080;
74 | padding: 10px 10px;
75 | background-color: rgba(128, 128, 128, 0.2);
76 | border-radius: 5px;
77 | }
78 |
79 | div#license {
80 | margin-top: 30px;
81 | margin-bottom: 30px;
82 | color: #4C721D;
83 | border:1px solid #4C721D;
84 | padding: 10px 10px;
85 | background-color: rgba(76, 114, 29, 0.2);
86 | border-radius: 5px;
87 | }
88 |
--------------------------------------------------------------------------------
/07_inf_for_numerical_data/README.md:
--------------------------------------------------------------------------------
1 | Depoloyed app at https://openintro.shinyapps.io/inf_for_categorical_data/.
2 |
--------------------------------------------------------------------------------
/07_inf_for_numerical_data/inf_for_numerical_data.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: 'Inference for numerical data'
3 | output:
4 | html_document:
5 | css: ../lab.css
6 | highlight: pygments
7 | theme: cerulean
8 | toc: true
9 | toc_float: true
10 | ---
11 |
12 | ```{r global_options, include=FALSE}
13 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
14 | ```
15 |
16 | ## Getting Started
17 |
18 | ### Load packages
19 |
20 | In this lab, we will explore and visualize the data using the **tidyverse** suite of packages, and perform statistical inference using **infer**.
21 | The data can be found in the companion package for OpenIntro resources, **openintro**.
22 |
23 | Let's load the packages.
24 |
25 | ```{r load-packages, message=FALSE}
26 | library(tidyverse)
27 | library(openintro)
28 | library(infer)
29 | library(skimr)
30 | ```
31 |
32 | ### Creating a reproducible lab report
33 |
34 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
35 |
36 | ### The data
37 |
38 | Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns.
39 | You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
40 |
41 | Load the `yrbss` data set into your workspace.
42 |
43 | ```{r load-data}
44 | data(yrbss)
45 | ```
46 |
47 | There are observations on 13 different variables, some categorical and some numerical.
48 | The meaning of each variable can be found by bringing up the help file:
49 |
50 | ```{r help-nc}
51 | ?yrbss
52 | ```
53 |
54 | 1. What are the cases in this data set? How many cases are there in our sample?
55 |
56 | Remember that you can answer this question by viewing the data in the data viewer or by using the following command:
57 |
58 | ```{r str}
59 | glimpse(yrbss)
60 | ```
61 |
62 | ## Exploratory data analysis
63 |
64 | You will first start with analyzing the weight of the participants in kilograms: `weight`.
65 |
66 | Using visualization and summary statistics, describe the distribution of weights.
67 | The `skim()` function from the **skimr** package produces nice summaries of the variables in the dataset, separating categorical (character) variables from quantitative variables.
68 |
69 | ```{r summary}
70 | yrbss %>%
71 | skim()
72 | ```
73 |
74 | 1. How many observations are we missing weights from?
75 |
76 | Next, consider the possible relationship between a high schooler's weight and their physical activity.
77 | Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
78 |
79 | First, let's create a new variable `physical_3plus`, which will be coded as either "yes" if the student is physically active for *at least* 3 days a week, and "no" if not.
80 |
81 | ```{r create new var}
82 | yrbss <- yrbss %>%
83 | mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))
84 | ```
85 |
86 | 2. Make a side-by-side violin plots of `physical_3plus` and `weight`. Is there a relationship between these two variables? What did you expect and why?
87 |
88 | The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the `physical_3plus` variable, and then calculate the mean `weight` in these groups using the `mean` function while ignoring missing values by setting the `na.rm` argument to `TRUE`.
89 |
90 | ```{r by-means}
91 | yrbss %>%
92 | group_by(physical_3plus) %>%
93 | summarise(mean_weight = mean(weight, na.rm = TRUE))
94 | ```
95 |
96 | There is an observed difference, but is this difference large enough to deem it "statistically significant"?
97 | In order to answer this question we will conduct a hypothesis test.
98 |
99 | ## Inference
100 |
101 | 3. Are all conditions necessary for inference satisfied?
102 | Comment on each.
103 | You can compute the group sizes with the `summarize` command above by defining a new variable with the definition `n()`.
104 |
105 | 4. Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don't.
106 |
107 | Next, we will work through creating a permutation distribution using tools from the **infer** package.
108 |
109 | But first, we need to initialize the test, which we will save as `obs_diff`.
110 |
111 | ```{r inf-weight-habit-ht-initial, tidy=FALSE, warning = FALSE}
112 | obs_diff <- yrbss %>%
113 | specify(weight ~ physical_3plus) %>%
114 | calculate(stat = "diff in means", order = c("yes", "no"))
115 | ```
116 |
117 | Recall that the `specify()` function is used to specify the variables you are considering (notated `y ~ x`), and you can use the `calculate()` function to specify the `stat`istic you want to calculate and the `order` of subtraction you want to use.
118 | For this hypothesis, the statistic you are searching for is the difference in means, with the order being `yes - no`.
119 |
120 | After you have calculated your observed statistic, you need to create a permutation distribution.
121 | This is the distribution that is created by shuffling the observed weights into new `physical_3plus` groups, labeled "yes" and "no".
122 |
123 | We will save the permutation distribution as `null_dist`.
124 |
125 | ```{r inf-weight-habit-ht-null, tidy=FALSE, warning = FALSE}
126 | null_dist <- yrbss %>%
127 | specify(weight ~ physical_3plus) %>%
128 | hypothesize(null = "independence") %>%
129 | generate(reps = 1000, type = "permute") %>%
130 | calculate(stat = "diff in means", order = c("yes", "no"))
131 | ```
132 |
133 | The `hypothesize()` function is used to declare what the null hypothesis is.
134 | Here, we are assuming that student's weight is independent of whether they exercise at least 3 days or not.
135 |
136 | We should also note that the `type` argument within `generate()` is set to `"permute"`.
137 | This ensures that the statistics calculated by the `calculate()` function come from a reshuffling of the data (not a resampling of the data)!
138 | Finally, the `specify()` and `calculate()` steps should look familiar, since they are the same as what we used to find the observed difference in means!
139 |
140 | We can visualize this null distribution with the following code:
141 |
142 | ```{r}
143 | ggplot(data = null_dist, aes(x = stat)) +
144 | geom_histogram()
145 | ```
146 |
147 | 5. Add a vertical red line to the plot above, demonstrating where the observed difference in means (`obs_diff`) falls on the distribution.
148 |
149 | 6. How many of these `null_dist` permutations have a difference at least as large (or larger) as `obs_diff`?
150 |
151 | Now that you have calculated the observed statistic and generated a permutation distribution, you can calculate the p-value for your hypothesis test using the function `get_p_value()` from the infer package.
152 |
153 | ```{r inf-weight-habit-ht-pvalue}
154 | null_dist %>%
155 | get_p_value(obs_stat = obs_diff, direction = "two_sided")
156 | ```
157 |
158 | 7. What warning message do you get?
159 | Why do you think you get this warning message?
160 |
161 | 8. Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don't, and interpret this interval in context of the data.
162 |
163 | ------------------------------------------------------------------------
164 |
165 | ## More Practice
166 |
167 | 9. Calculate a 95% confidence interval for the average height in meters (`height`) and interpret it in context.
168 |
169 | 10. Calculate a new confidence interval for the same parameter at the 90% confidence level.
170 | Comment on the width of this interval versus the one obtained in the previous exercise.
171 |
172 | 11. Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don't.
173 |
174 | 12. Now, a non-inference task: Determine the number of different options there are in the dataset for the `hours_tv_per_school_day` there are.
175 |
176 | 13. Come up with a research question evaluating the relationship between height or weight and sleep.
177 | Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
178 | Report the statistical results, and also provide an explanation in plain language.
179 | Be sure to check all assumptions, state your $\alpha$ level, and conclude in context.
180 |
181 | ------------------------------------------------------------------------
182 |
183 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
184 |
--------------------------------------------------------------------------------
/08_simple_regression/simple_regression.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Introduction to linear regression"
3 | output:
4 | html_document:
5 | css: ../lab.css
6 | highlight: pygments
7 | theme: cerulean
8 | toc: true
9 | toc_float: true
10 | ---
11 |
12 | ```{r global_options, include=FALSE}
13 | knitr::opts_chunk$set(eval = TRUE,
14 | results = FALSE,
15 | fig.show = "hide",
16 | message = FALSE,
17 | warning = FALSE)
18 | ```
19 |
20 | The Human Freedom Index is a report that attempts to summarize the idea of "freedom" through a bunch of different variables for many countries around the globe.
21 | It serves as a rough objective measure for the relationships between the different types of freedom - whether it's political, religious, economical or personal freedom - and other social and economic circumstances.
22 | The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
23 |
24 | In this lab, you'll be analysing data from the Human Freedom Index reports.
25 | Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.
26 |
27 | ## Getting Started
28 |
29 | ### Load packages
30 |
31 | In this lab, you will explore and visualize the data using the **tidyverse** suite of packages.
32 | You will also use the **statsr** package to select a regression line that minimizes the sum of squared residuals and the **broom** package to tidy regression output.
33 | The data can be found in the **openintro** package, a companion package for OpenIntro resources.
34 |
35 | Let's load the packages.
36 |
37 | ```{r load-packages, message=FALSE}
38 | library(tidyverse)
39 | library(openintro)
40 | library(statsr)
41 | library(broom)
42 | ```
43 |
44 | ### Creating a reproducible lab report
45 |
46 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
47 |
48 | ### The data
49 |
50 | The data we're working with is in the openintro package and it's called `hfi`, short for Human Freedom Index.
51 |
52 | 1. What are the dimensions of the dataset?
53 | What does each row represent?
54 |
55 | 2. The dataset spans a lot of years, but we are only interested in data from year 2016.
56 | Filter the data `hfi` data frame for year 2016, select the six variables, and assign the result to a data frame named `hfi_2016`.
57 |
58 | ```{r include = FALSE}
59 | hfi_2016 <- hfi %>%
60 | filter(year == 2016)
61 | ```
62 |
63 | 3. What type of plot would you use to display the relationship between the personal freedom score, `pf_score`, and `pf_expression_control`? Plot this relationship using the variable `pf_expression_control` as the predictor. Does the relationship look linear? If you knew a country's `pf_expression_control`, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?
64 |
65 | If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
66 |
67 | ```{r cor}
68 | hfi_2016 %>%
69 | summarise(cor(pf_expression_control, pf_score))
70 | ```
71 |
72 | ## Sum of squared residuals
73 |
74 | ::: {#boxedtext}
75 | In this section, you will use an interactive function to investigate what we mean by "sum of squared residuals".
76 | You will need to run this function in your console, not in your markdown document.
77 | Running the function also requires that the `hfi` dataset is loaded in your environment.
78 | You will also need to make sure the Plots tab in the lower right-hand corner has enough space to make a plot.
79 | :::
80 |
81 | Think back to the way that we described the distribution of a single variable.
82 | Recall that we discussed characteristics such as center, spread, and shape.
83 | It's also useful to be able to describe the relationship of two numerical variables, such as `pf_expression_control` and `pf_score` above.
84 |
85 | 4. Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
86 |
87 | Just as you've used the mean and standard deviation to summarize a single variable, you can summarize the relationship between these two variables by finding the line that best follows their association.
88 | Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
89 |
90 | ```{r plotss-expression-score, eval=FALSE}
91 | plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016)
92 | ```
93 |
94 | After running this command, you'll be prompted to click two points on the plot to define a line.
95 | Once you've done that, the line you specified will be shown in black and the residuals in blue.
96 |
97 |
98 |
99 | If your plot is appearing below your code chunk and won't let you select points to make a line, take the following steps:
100 |
101 | - Go to the Tools bar at the top of RStudio
102 | - Click on "Global Options..."
103 | - Click on the "R Markdown pane" (on the left)
104 | - Uncheck the box that says "Show output inline for all R Markdown documents"
105 |
106 |
107 |
108 | Recall that the residuals are the difference between the observed values and the values predicted by the line:
109 |
110 | $$
111 | e_i = y_i - \hat{y}_i
112 | $$
113 |
114 | The most common way to do linear regression is to select the line that minimizes the sum of squared residuals.
115 | To visualize the squared residuals, you can rerun the plot command and add the argument `showSquares = TRUE`.
116 |
117 | ```{r plotss-expression-score-squares, eval=FALSE}
118 | plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)
119 | ```
120 |
121 | Note that the output from the `plot_ss` function provides you with the slope and intercept of your line as well as the sum of squares.
122 |
123 | 5. Using `plot_ss`, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbours?
124 |
125 | ## The linear model
126 |
127 | It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error.
128 | Instead, you can use the `lm` function in R to fit the linear model (a.k.a. regression line).
129 |
130 | ```{r m1}
131 | m1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)
132 | ```
133 |
134 | The first argument in the function `lm()` is a formula that takes the form `y ~ x`.
135 | Here it can be read that we want to make a linear model of `pf_score` as a function of `pf_expression_control`.
136 | The second argument specifies that R should look in the `hfi` data frame to find the two variables.
137 |
138 | **Note:** Piping **will not** work here, as the data frame is not the first argument!
139 |
140 | The output of `lm()` is an object that contains all of the information we need about the linear model that was just fit.
141 | We can access this information using the `tidy()` function.
142 |
143 | ```{r summary-m1}
144 | tidy(m1)
145 | ```
146 |
147 | Let's consider this output piece by piece.
148 | First, the formula used to describe the model is shown at the top, in what's displayed as the "Call".
149 | After the formula you find the five-number summary of the residuals.
150 | The "Coefficients" table shown next is key; its first column displays the linear model's y-intercept and the coefficient of `pf_expression_control`.
151 | With this table, we can write down the least squares regression line for the linear model:
152 |
153 | $$
154 | \hat{y} = 4.28 + 0.542 \times pf\_expression\_control
155 | $$
156 |
157 | This equation tells us two things:
158 |
159 | - For countries with a `pf_expression_control` of 0 (those with the largest amount of political pressure on media content), we expect their mean personal freedom score to be 4.28.
160 | - For every 1 unit increase in `pf_expression_control`, we expect a country's mean personal freedom score to increase 0.542 units.
161 |
162 | We can assess model fit using $R^2$, the proportion of variability in the response variable that is explained by the explanatory variable.
163 | We use the `glance()` function to access this information.
164 |
165 | ```{r}
166 | glance(m1)
167 | ```
168 |
169 | For this model, 71.4% of the variability in `pf_score` is explained by `pf_expression_control`.
170 |
171 | 6. Fit a new model that uses `pf_expression_control` to predict `hf_score`, or the total human freedom score. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?
172 |
173 | ## Prediction and prediction errors
174 |
175 | Let's create a scatterplot with the least squares line for `m1` laid on top.
176 |
177 | ```{r reg-with-line}
178 | ggplot(data = hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
179 | geom_point() +
180 | geom_smooth(method = "lm", se = FALSE)
181 | ```
182 |
183 | Here, we are literally adding a layer on top of our plot.
184 | `geom_smooth` creates the line by fitting a linear model.
185 | It can also show us the standard error `se` associated with our line, but we'll suppress that for now.
186 |
187 | This line can be used to predict $y$ at any value of $x$.
188 | When predictions are made for values of $x$ that are beyond the range of the observed data, it is referred to as *extrapolation* and is not usually recommended.
189 | However, predictions made within the range of the data are more reliable.
190 | They're also used to compute the residuals.
191 |
192 | 7. If someone saw the least squares regression line and not the actual data, how would they predict a country's personal freedom school for one with a 3 rating for `pf_expression_control`? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
193 |
194 | ## Model diagnostics
195 |
196 | To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
197 |
198 | In order to do these checks we need access to the fitted (predicted) values and the residuals.
199 | We can use the `augment()` function to calculate these.
200 |
201 | ```{r}
202 | m1_aug <- augment(m1)
203 | ```
204 |
205 | **Linearity**: You already checked if the relationship between `pf_score` and `pf_expression_control` is linear using a scatterplot.
206 | We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.
207 |
208 | ```{r residuals}
209 | ggplot(data = m1_aug, aes(x = .fitted, y = .resid)) +
210 | geom_point() +
211 | geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
212 | xlab("Fitted values") +
213 | ylab("Residuals")
214 | ```
215 |
216 | Notice here that `m1` can also serve as a data set because stored within it are the fitted values ($\hat{y}$) and the residuals.
217 | Also note that we're getting fancy with the code here.
218 | After creating the scatterplot on the first layer (first line of code), we overlay a red horizontal dashed line at $y = 0$ (to help us check whether the residuals are distributed around 0), and we also rename the axis labels to be more informative.
219 |
220 | 8. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between the two variables?
221 |
222 |
223 |
224 | **Nearly normal residuals**: To check this condition, we can look at a histogram of the residuals.
225 |
226 | ```{r hist-res}
227 | ggplot(data = m1_aug, aes(x = .resid)) +
228 | geom_histogram(binwidth = 0.25) +
229 | xlab("Residuals")
230 | ```
231 |
232 | 9. Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?
233 |
234 |
235 |
236 | **Constant variability**:
237 |
238 | 10. Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?
239 |
240 | ------------------------------------------------------------------------
241 |
242 | ## More Practice
243 |
244 |
245 |
246 | - Choose another variable that you think would strongly correlate with `pf_score`.
247 | Produce a scatterplot of the two variables and fit a linear model.
248 | At a glance, does there seem to be a linear relationship?
249 |
250 | - How does this relationship compare to the relationship between `pf_score` and `pf_expression_control`?
251 | Use the $R^2$ values from the two model summaries to compare.
252 | Does your independent variable seem to predict `pf_score` better?
253 | Why or why not?
254 |
255 | - Check the model diagnostics using appropriate visualizations and evaluate if the model conditions have been met.
256 |
257 | - Pick another pair of variables of interest and visualize the relationship between them.
258 | Do you find the relationship surprising or is it what you expected.
259 | Discuss why you were interested in these variables and why you were/were not surprised by the relationship you observed.
260 |
261 | ------------------------------------------------------------------------
262 |
263 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
264 |
--------------------------------------------------------------------------------
/09_multiple_regression/multiple_regression.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Multiple linear regression"
3 | output:
4 | html_document:
5 | css: ../lab.css
6 | highlight: pygments
7 | theme: cerulean
8 | ---
9 |
10 | ```{r global_options, include=FALSE}
11 | knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "hide", message = FALSE)
12 | ```
13 |
14 | ## Grading the professor
15 |
16 | Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously.
17 | However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor.
18 | The article titled, "Beauty in the classroom: instructors' pulchritude and putative pedagogical productivity" by Hamermesh and Parker found that instructors who are viewed to be better looking receive higher instructional ratings.
19 |
20 | Here, you will analyze the data from this study in order to learn what goes into a positive professor evaluation.
21 |
22 | ## Getting Started
23 |
24 | ### Load packages
25 |
26 | In this lab, you will explore and visualize the data using the **tidyverse** suite of packages.
27 | You will also use the **GGally** package for visualisation of many variables at once and the **broom** package to tidy regression output.
28 | The data can be found in the companion package for OpenIntro resources, **openintro**.
29 |
30 | Let's load the packages.
31 |
32 | ```{r load-packages, message=FALSE}
33 | library(tidyverse)
34 | library(openintro)
35 | library(GGally)
36 | library(broom)
37 | ```
38 |
39 | This is the first time we're using the **GGally** package.
40 | You will be using the `ggpairs()` function from this package later in the lab.
41 |
42 | ### Creating a reproducible lab report
43 |
44 | To create your new lab report, in RStudio, go to New File -\> R Markdown... Then, choose From Template and then choose `Lab Report for OpenIntro Statistics Labs` from the list of templates.
45 |
46 | ### The data
47 |
48 | The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin.
49 | In addition, six students rated the professors' physical appearance.
50 | The result is a data frame where each row contains a different course and columns represent variables about the courses and professors.
51 | It's called `evals`.
52 |
53 | ```{r}
54 | glimpse(evals)
55 | ```
56 |
57 | We have observations on 21 different variables, some categorical and some numerical.
58 | The meaning of each variable can be found by bringing up the help file:
59 |
60 | ```{r help-evals}
61 | ?evals
62 | ```
63 |
64 | ## Exploring the data
65 |
66 | 1. Is this an observational study or an experiment?
67 | The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations.
68 | Given the study design, is it possible to answer this question as it is phrased?
69 | If not, rephrase the question.
70 |
71 | 2. Describe the distribution of `score`.
72 | Is the distribution skewed?
73 | What does that tell you about how students rate courses?
74 | Is this what you expected to see?
75 | Why, or why not?
76 |
77 | 3. Excluding `score`, select two other variables and describe their relationship with each other using an appropriate visualization.
78 |
79 | ## Simple linear regression
80 |
81 | The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favourably.
82 | Let's create a scatterplot to see if this appears to be the case:
83 |
84 | ```{r scatter-score-bty_avg}
85 | ggplot(data = evals, aes(x = bty_avg, y = score)) +
86 | geom_point()
87 | ```
88 |
89 | Before you draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot.
90 | Is anything awry?
91 |
92 | 4. Re-plot the scatterplot, but this time use `geom_jitter` as your layer. What was misleading about the initial scatterplot?
93 |
94 | ```{r scatter-score-bty_avg-jitter}
95 | ggplot(data = evals, aes(x = bty_avg, y = score)) +
96 | geom_jitter()
97 | ```
98 |
99 | 5. Let's see if the apparent trend in the plot is something more than natural variation. Fit a linear model called `m_bty` to predict average professor score by average beauty rating. Write out the equation for the linear model and interpret the slope. Is average beauty score a statistically significant predictor? Does it appear to be a practically significant predictor?
100 |
101 | Add the line of the bet fit model to your plot using the following:
102 |
103 | ```{r scatter-score-bty_avg-line-se}
104 | ggplot(data = evals, aes(x = bty_avg, y = score)) +
105 | geom_jitter() +
106 | geom_smooth(method = "lm")
107 | ```
108 |
109 | The blue line is the model.
110 | The shaded gray area around the line tells you about the variability you might expect in your predictions.
111 | To turn that off, use `se = FALSE`.
112 |
113 | ```{r scatter-score-bty_avg-line}
114 | ggplot(data = evals, aes(x = bty_avg, y = score)) +
115 | geom_jitter() +
116 | geom_smooth(method = "lm", se = FALSE)
117 | ```
118 |
119 | 6. Use residual plots to evaluate whether the conditions of least squares regression are reasonable. Provide plots and comments for each one (see the Simple Regression Lab for a reminder of how to make these).
120 |
121 | ## Multiple linear regression
122 |
123 | The data set contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores.
124 | Let's take a look at the relationship between one of these scores and the average beauty score.
125 |
126 | ```{r bty-rel}
127 | ggplot(data = evals, aes(x = bty_f1lower, y = bty_avg)) +
128 | geom_point()
129 |
130 | evals %>%
131 | summarise(cor(bty_avg, bty_f1lower))
132 | ```
133 |
134 | As expected, the relationship is quite strong---after all, the average score is calculated using the individual scores.
135 | You can actually look at the relationships between all beauty variables (columns 13 through 19) using the following command:
136 |
137 | ```{r bty-rels}
138 | evals %>%
139 | select(contains("bty")) %>%
140 | ggpairs()
141 | ```
142 |
143 | These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model.
144 | In this application and with these highly-correlated predictors, it is reasonable to use the average beauty score as the single representative of these variables.
145 |
146 | In order to see if beauty is still a significant predictor of professor score after you've accounted for the professor's gender, you can add the gender term into the model.
147 |
148 | ```{r scatter-score-bty_avg_pic-color}
149 | m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
150 | tidy(m_bty_gen)
151 | ```
152 |
153 | 7. p-values and parameter estimates should only be trusted if the conditions for the regression are reasonable.
154 | Verify that the conditions for this model are reasonable using diagnostic plots.
155 |
156 | 8. Is `bty_avg` still a significant predictor of `score`?
157 | Has the addition of `gender` to the model changed the parameter estimate for `bty_avg`?
158 |
159 | Note that the estimate for `gender` is now called `gendermale`.
160 | You'll see this name change whenever you introduce a categorical variable.
161 | The reason is that R recodes `gender` from having the values of `male` and `female` to being an indicator variable called `gendermale` that takes a value of $0$ for female professors and a value of $1$ for male professors.
162 | (Such variables are often referred to as "dummy" variables.)
163 |
164 | As a result, for female professors, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.
165 |
166 | $$
167 | \begin{aligned}
168 | \widehat{score} &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg + \hat{\beta}_2 \times (0) \\
169 | &= \hat{\beta}_0 + \hat{\beta}_1 \times bty\_avg\end{aligned}
170 | $$
171 |
172 | 9. What is the equation of the line corresponding to male professors? (*Hint:* For male professors, the parameter estimate is multiplied by 1.) For two professors who received the same beauty rating, which gender tends to have the higher course evaluation score?
173 |
174 | The decision to call the indicator variable `gendermale` instead of `genderfemale` has no deeper meaning.
175 | R simply codes the category that comes first alphabetically as a $0$.
176 | (You can change the reference level of a categorical variable, which is the level that is coded as a 0, using the`relevel()` function.
177 | Use `?relevel` to learn more.)
178 |
179 | 10. Create a new model called `m_bty_rank` with `gender` removed and `rank` added in. How does R appear to handle categorical variables that have more than two levels? Note that the rank variable has three levels: `teaching`, `tenure track`, `tenured`.
180 |
181 | The interpretation of the coefficients in multiple regression is slightly different from that of simple regression.
182 | The estimate for `bty_avg` reflects how much higher a group of professors is expected to score if they have a beauty rating that is one point higher *while holding all other variables constant*.
183 | In this case, that translates into considering only professors of the same rank with `bty_avg` scores that are one point apart.
184 |
185 | ## The search for the best model
186 |
187 | We will start with a full model that predicts professor score based on rank, gender, ethnicity, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, and average beauty rating.
188 |
189 | 11. Which variable would you expect to have the highest p-value in this model? Why? *Hint:* Think about which variable would you expect to not have any association with the professor score.
190 |
191 | Let's run the model...
192 |
193 | ```{r m_full, tidy = FALSE}
194 | m_full <- lm(score ~ rank + gender + ethnicity + language + age + cls_perc_eval
195 | + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
196 | tidy(m_full)
197 | ```
198 |
199 | 12. Check your suspicions from the previous exercise.
200 | Include the model output in your response.
201 |
202 | 13. Interpret the coefficient associated with the ethnicity variable.
203 |
204 | 14. Drop one variable at a time and peek at the adjusted $R^2$.
205 | Removing which variable increases adjusted $R^2$ the most?
206 | Drop the variable with the highest p-value and re-fit the model.
207 | Did the coefficients and significance of the other explanatory variables change with this variable removed?
208 | (One of the things that makes multiple regression interesting is that coefficient estimates depend on the other variables that are included in the model.) If not, what does this say about whether or not the dropped variable was collinear with the other explanatory variables?
209 |
210 | 15. Using backward-selection and adjusted $R^2$ as the selection criterion, determine the best model.
211 | You do not need to show all steps in your answer, just the output for the final model.
212 | Also, write out the linear model for predicting score based on the final model you settle on.
213 |
214 | 16. Verify that the conditions for this model are reasonable using diagnostic plots.
215 |
216 | 17. The original paper describes how these data were gathered by taking a sample of professors from the University of Texas at Austin and including all courses that they have taught.
217 | Considering that each row represents a course, could this new information have an impact on any of the conditions of linear regression?
218 |
219 | 18. Based on your final model, describe the characteristics of a professor and course at University of Texas at Austin that would be associated with a high evaluation score.
220 |
221 | 19. Would you be comfortable generalizing your conclusions to apply to professors generally (at any university)?
222 | Why or why not?
223 |
224 | ## References
225 |
226 | ------------------------------------------------------------------------
227 |
228 | {style="border-width:0"} This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
229 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Code of Conduct
2 |
3 | As contributors and maintainers of this project, we pledge to respect all people who
4 | contribute through reporting issues, posting feature requests, updating documentation,
5 | submitting pull requests or patches, and other activities.
6 |
7 | We are committed to making participation in this project a harassment-free experience for
8 | everyone, regardless of level of experience, gender, gender identity and expression,
9 | sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
10 |
11 | Examples of unacceptable behavior by participants include the use of sexual language or
12 | imagery, derogatory comments or personal attacks, trolling, public or private harassment,
13 | insults, or other unprofessional conduct.
14 |
15 | Project maintainers have the right and responsibility to remove, edit, or reject comments,
16 | commits, code, wiki edits, issues, and other contributions that are not aligned to this
17 | Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed
18 | from the project team.
19 |
20 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by
21 | opening an issue or contacting one or more of the project maintainers.
22 |
23 | This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0,
24 | available at [here](http://contributor-covenant.org/version/1/0/0/).
25 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | ## LICENSE
2 |
3 | # Attribution-ShareAlike 4.0 International
4 |
5 | Creative Commons Corporation (“Creative Commons”) is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an “as-is” basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible.
6 |
7 | ### Using Creative Commons Public Licenses
8 |
9 | Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses.
10 |
11 | * __Considerations for licensors:__ Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC-licensed material, or material used under an exception or limitation to copyright. [More considerations for licensors](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensors).
12 |
13 | * __Considerations for the public:__ By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor’s permission is not necessary for any reason–for example, because of any applicable exception or limitation to copyright–then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable. [More considerations for the public](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensees).
14 |
15 | ## Creative Commons Attribution-ShareAlike 4.0 International Public License
16 |
17 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
18 |
19 | ### Section 1 – Definitions.
20 |
21 | a. __Adapted Material__ means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
22 |
23 | b. __Adapter's License__ means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
24 |
25 | c. __BY-SA Compatible License__ means a license listed at [creativecommons.org/compatiblelicenses](http://creativecommons.org/compatiblelicenses), approved by Creative Commons as essentially the equivalent of this Public License.
26 |
27 | d. __Copyright and Similar Rights__ means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
28 |
29 | e. __Effective Technological Measures__ means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
30 |
31 | f. __Exceptions and Limitations__ means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
32 |
33 | g. __License Elements__ means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.
34 |
35 | h. __Licensed Material__ means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
36 |
37 | i. __Licensed Rights__ means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
38 |
39 | j. __Licensor__ means the individual(s) or entity(ies) granting rights under this Public License.
40 |
41 | k. __Share__ means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
42 |
43 | l. __Sui Generis Database Rights__ means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
44 |
45 | m. __You__ means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
46 |
47 | ### Section 2 – Scope.
48 |
49 | a. ___License grant.___
50 |
51 | 1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
52 |
53 | A. reproduce and Share the Licensed Material, in whole or in part; and
54 |
55 | B. produce, reproduce, and Share Adapted Material.
56 |
57 | 2. __Exceptions and Limitations.__ For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
58 |
59 | 3. __Term.__ The term of this Public License is specified in Section 6(a).
60 |
61 | 4. __Media and formats; technical modifications allowed.__ The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
62 |
63 | 5. __Downstream recipients.__
64 |
65 | A. __Offer from the Licensor – Licensed Material.__ Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
66 |
67 | B. __Additional offer from the Licensor – Adapted Material.__ Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.
68 |
69 | C. __No downstream restrictions.__ You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
70 |
71 | 6. __No endorsement.__ Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
72 |
73 | b. ___Other rights.___
74 |
75 | 1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
76 |
77 | 2. Patent and trademark rights are not licensed under this Public License.
78 |
79 | 3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties.
80 |
81 | ### Section 3 – License Conditions.
82 |
83 | Your exercise of the Licensed Rights is expressly made subject to the following conditions.
84 |
85 | a. ___Attribution.___
86 |
87 | 1. If You Share the Licensed Material (including in modified form), You must:
88 |
89 | A. retain the following if it is supplied by the Licensor with the Licensed Material:
90 |
91 | i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
92 |
93 | ii. a copyright notice;
94 |
95 | iii. a notice that refers to this Public License;
96 |
97 | iv. a notice that refers to the disclaimer of warranties;
98 |
99 | v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
100 |
101 | B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
102 |
103 | C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
104 |
105 | 2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
106 |
107 | 3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
108 |
109 | b. ___ShareAlike.___
110 |
111 | In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply.
112 |
113 | 1. The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.
114 |
115 | 2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material.
116 |
117 | 3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.
118 |
119 | ### Section 4 – Sui Generis Database Rights.
120 |
121 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
122 |
123 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;
124 |
125 | b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and
126 |
127 | c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
128 |
129 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
130 |
131 | ### Section 5 – Disclaimer of Warranties and Limitation of Liability.
132 |
133 | a. __Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.__
134 |
135 | b. __To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.__
136 |
137 | c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
138 |
139 | ### Section 6 – Term and Termination.
140 |
141 | a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
142 |
143 | b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
144 |
145 | 1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
146 |
147 | 2. upon express reinstatement by the Licensor.
148 |
149 | For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
150 |
151 | c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
152 |
153 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
154 |
155 | ### Section 7 – Other Terms and Conditions.
156 |
157 | a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
158 |
159 | b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
160 |
161 | ### Section 8 – Interpretation.
162 |
163 | a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
164 |
165 | b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
166 |
167 | c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
168 |
169 | d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
170 |
171 | > Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” The text of the Creative Commons public licenses is dedicated to the public domain under the [CC0 Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/legalcode). Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at [creativecommons.org/policies](http://creativecommons.org/policies), Creative Commons does not authorize the use of the trademark “Creative Commons” or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses.
172 | >
173 | > Creative Commons may be contacted at creativecommons.org.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | OpenIntro Labs promote the understanding and application of statistics through applied data analysis.
2 | Labs are titled based on topic area, which correspond to particular chapters in all three versions of OpenIntro Statistics, a free and open-source textbook.
3 | The textbook as well as the html version of the labs can be found at [http://www.openintro.org/stat/labs.php](http://www.openintro.org/stat/labs.php).
4 |
5 | This repository is a fork of the original base-R labs.
6 | It incorporates the tidyverse syntax from the `dplyr` package for data manipulation, the `ggplot2` package for graphics, and the `infer` package for statistical inference.
7 |
8 | ## Labs
9 |
10 | 1. [Intro to R](http://openintrostat.github.io/oilabs-tidy/01_intro_to_r/intro_to_r.html)
11 | 2. [Intro to data](http://openintrostat.github.io/oilabs-tidy/02_intro_to_data/intro_to_data.html)
12 | 4. [Probability](http://openintrostat.github.io/oilabs-tidy/03_probability/probability.html)
13 | 3. [Normal distribution](http://openintrostat.github.io/oilabs-tidy/04_normal_distribution/normal_distribution.html)
14 | 5. Foundations of inference
15 | a. [Sampling distributions](https://openintro.shinyapps.io/sampling_distributions/)
16 | b. [Confidence intervals](https://openintro.shinyapps.io/confidence_intervals/)
17 | 6. [Inference for categorical data](https://openintro.shinyapps.io/inf_for_categorical_data/)
18 | 7. [Inference for numerical data](http://openintrostat.github.io/oilabs-tidy/07_inf_for_numerical_data/inf_for_numerical_data.html)
19 | 8. [Simple linear regression](http://openintrostat.github.io/oilabs-tidy/08_simple_regression/simple_regression.html)
20 | 9. [Multiple linear regression](http://openintrostat.github.io/oilabs-tidy/09_multiple_regression/multiple_regression.html)
21 |
22 | ## Source code for labs
23 |
24 | We currently support our source files in the RMarkdown (.Rmd) format, which can be output into html format (though output to pdf is also possible).
25 | The source files are processed using the [knitr](http://yihui.name/knitr/) package in R, and are easiest to use in [RStudio](https://www.rstudio.com/products/rstudio/download/).
26 |
27 | ## Feedback / collaboration
28 |
29 | Your feedback is most welcomed! If you have suggestions for minor updates (fixing typos, etc.) please do not hesitate to issue a pull request.
30 | If you have ideas for major revamp of a lab (replacing outdated code with modern version, overhaul in pedagogy, etc.) please create an issue so to start the conversation.
31 |
32 | It is our hope that these materials are useful for instructors and students of statistics.
33 | If you end up developing some interesting variants of these labs or creating new ones, please let us know!
34 |
35 | ## Code of Conduct
36 |
37 | Please note that the oilabs-tidy project is released with a Contributor Code of Conduct.
38 | By contributing to this project, you agree to abide by its terms.
39 |
40 | * * *
41 |
42 | This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
43 |
44 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-minimal
2 | logo: logo/logo-square.png
3 | title: OpenIntro Labs for R and tidyverse
--------------------------------------------------------------------------------
/lab.css:
--------------------------------------------------------------------------------
1 | body {
2 | counter-reset: li; /* initialize counter named li */
3 | }
4 |
5 | h1 {
6 | font-family:Arial, Helvetica, sans-serif;
7 | font-weight:bold;
8 | }
9 |
10 | h2 {
11 | font-family:Arial, Helvetica, sans-serif;
12 | font-weight:bold;
13 | margin-top: 24px;
14 | }
15 |
16 | ol {
17 | margin-left:0; /* Remove the default left margin */
18 | padding-left:0; /* Remove the default left padding */
19 | }
20 | ol > li {
21 | position:relative; /* Create a positioning context */
22 | margin:0 0 10px 2em; /* Give each list item a left margin to make room for the numbers */
23 | padding:10px 80px; /* Add some spacing around the content */
24 | list-style:none; /* Disable the normal item numbering */
25 | border-top:2px solid #317EAC;
26 | background:rgba(49, 126, 172, 0.1);
27 | }
28 | ol > li:before {
29 | content:"Exercise " counter(li); /* Use the counter as content */
30 | counter-increment:li; /* Increment the counter by 1 */
31 | /* Position and style the number */
32 | position:absolute;
33 | top:-2px;
34 | left:-2em;
35 | -moz-box-sizing:border-box;
36 | -webkit-box-sizing:border-box;
37 | box-sizing:border-box;
38 | width:7em;
39 | /* Some space between the number and the content in browsers that support
40 | generated content but not positioning it (Camino 2 is one example) */
41 | margin-right:8px;
42 | padding:4px;
43 | border-top:2px solid #317EAC;
44 | color:#fff;
45 | background:#317EAC;
46 | font-weight:bold;
47 | font-family:"Helvetica Neue", Arial, sans-serif;
48 | text-align:center;
49 | }
50 | li ol,
51 | li ul {margin-top:6px;}
52 | ol ol li:last-child {margin-bottom:0;}
53 |
54 | .oyo ul {
55 | list-style-type:decimal;
56 | }
57 |
58 | hr {
59 | border: 1px solid #357FAA;
60 | }
61 |
62 | div#boxedtext {
63 | background-color: rgba(86, 155, 189, 0.2);
64 | padding: 20px;
65 | margin-bottom: 20px;
66 | font-size: 10pt;
67 | }
68 |
69 | div#template {
70 | margin-top: 30px;
71 | margin-bottom: 30px;
72 | color: #808080;
73 | border:1px solid #808080;
74 | padding: 10px 10px;
75 | background-color: rgba(128, 128, 128, 0.2);
76 | border-radius: 5px;
77 | }
78 |
79 | div#license {
80 | margin-top: 30px;
81 | margin-bottom: 30px;
82 | color: #4C721D;
83 | border:1px solid #4C721D;
84 | padding: 10px 10px;
85 | background-color: rgba(76, 114, 29, 0.2);
86 | border-radius: 5px;
87 | }
--------------------------------------------------------------------------------
/lab_source_style_guide.md:
--------------------------------------------------------------------------------
1 | Lab Style Guide
2 | ===============
3 |
4 | - Section titles are formatted as header 2, `##`, and followed by a blank
5 | line.
6 |
7 | - Exercises are ordered lists (`1.`, `2.`, `3.`, ...), not lazy lists
8 | (`1.`, `1.`, `1.`, ...) with two spaces after the dot (or one space for
9 | numbers \> 9, < 100). Subsequent lines are hanging-indented four spaces.
10 |
11 | - On Your Own questions are unordered lists (`-`, `-`, `-`, ...) with three
12 | spaces following each `-`.
13 |
14 | - For nested lists, see the OYO in lab 6 for an example of the preferred
15 | method.
16 |
17 | - All lists items should be followed by a blank line.
18 |
19 | - Wrap all lines, text and code, at 80 characters. The counter can be found
20 | in the lower left corner of the editor. Long urls or strings that should
21 | not be segmented can overflow as can lines of code that become confusing
22 | when broken up.
23 |
24 | - In-text numbers and basic math operators are not marked up. Latex is used
25 | for in-line mathematical variables (e.g. `$nS`, `$p$`) and more complex
26 | symbols. Full equations should be displayed on a separate line with `\[`
27 | and `\]`.
28 |
29 | - Any in-text code is marked up with backticks.
30 |
31 | - Force a line break with a backslash `\\`.
32 |
33 | - The OYO section should be offset with a horizontal rule that precedes it.
34 | For this, use `* * *`.
35 |
36 | - Don't embed links in text. Do this: "The BRFSS Web site ([http://www.cdc.gov/brfss]
37 | (http://www.cdc.gov/brfss))...". Not this: "The [BRFSS Web site]
38 | (http://www.cdc.gov/brfss)...". If teachers print the lab out, we want the
39 | links to be visible.
40 |
41 | - For superscripts used in text, use HTML instead of Latex: th
--------------------------------------------------------------------------------
/logo/logo-social.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenIntroStat/oilabs-tidy/5e5703a7ecf076f0de5f7d90b436977f0d6c4ae0/logo/logo-social.jpeg
--------------------------------------------------------------------------------
/logo/logo-square.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/OpenIntroStat/oilabs-tidy/5e5703a7ecf076f0de5f7d90b436977f0d6c4ae0/logo/logo-square.png
--------------------------------------------------------------------------------