├── .DS_Store
├── README.md
├── homeworks
├── hw-1-us-murders
│ ├── hw-1-solutions.Rmd
│ ├── hw-1-solutions.html
│ └── hw-1-us-murders.Rmd
├── hw-2-vaccines
│ ├── hw-2-solutions.Rmd
│ ├── hw-2-solutions.html
│ └── hw-2-vaccines.Rmd
├── hw-3-casino
│ ├── hw-3-casino.Rmd
│ ├── hw-3-solutions.Rmd
│ └── hw-3-solutions.html
├── hw-4-elections
│ ├── elections_polls.RData
│ ├── hw-4-elections.Rmd
│ ├── hw-4-solutions.Rmd
│ └── hw-4-solutions.html
├── hw-5-moneyball
│ ├── hw-5-moneyball.Rmd
│ ├── hw-5-solutions.Rmd
│ └── hw-5-solutions.html
└── hw-6-netflix
│ └── hw-6-netflix.Rmd
└── lectures
├── .DS_Store
├── R
├── 00-motivation.Rmd
├── 01-data-types.Rmd
├── 02-vectors.Rmd
├── 03-sorting.Rmd
├── 04-vector-arithmetics.Rmd
├── 05-indexing.Rmd
├── 06-basic-data-wrangling.Rmd
├── 07-basic-plots.Rmd
├── 07-basic-plots.html
├── 08-importing-data.Rmd
├── 08-importing-data.html
├── 09-programming-basics.Rmd
├── 09-programming-basics.html
├── README.md
├── intro-to-rmarkdown.Rmd
└── murders.csv
├── course-intro
└── course-intro.pptx
├── dataviz
├── dataviz-principles-assessment.Rmd
├── dataviz-principles.Rmd
├── distributions.Rmd
├── gapminder-assessments.Rmd
├── gapminder.Rmd
└── intro-to-ggplot2.Rmd
├── git-and-github
├── git-command-line.Rmd
├── git-rstudio.Rmd
├── images
│ ├── clone_button.png
│ ├── directorysetup.png
│ ├── git-clone.png
│ ├── git_add.png
│ ├── git_clone.png
│ ├── git_commit.png
│ ├── git_fetch.png
│ ├── git_layout.png
│ ├── git_merge.png
│ ├── git_push.png
│ ├── git_status.png
│ ├── gitclean.png
│ ├── gitclone.png
│ ├── gitcommit.png
│ ├── github-https-clone.png
│ ├── github-ssh-clone.png
│ ├── github.png
│ ├── github_ssh.png
│ ├── gitpush.png
│ ├── gitstaged.png
│ ├── gituntracked.png
│ ├── mac-git-security.png
│ ├── mkdir-clone.png
│ ├── newproject.png
│ ├── rstudio_commit.png
│ ├── rstudio_screen.png
│ ├── sshkeygen.png
│ ├── wgi-defaultlines.png
│ ├── wgi-git-bash.png
│ ├── wgi-scarymessage.png
│ └── wgi-usemintty.png
└── setting-up-github.rmd
├── inference
├── association-tests.Rmd
├── bayes.Rmd
├── clt.Rmd
├── confidence-intervals-p-values-assessment.Rmd
├── confidence-intervals-p-values.Rmd
├── election-forecasting.Rmd
├── img
│ ├── pollster-2016-predictions.png
│ ├── popular-vote-538.png
│ ├── rcp-polls.png
│ └── urn.jpg
├── intro-to-inference.Rmd
├── models-assessment.Rmd
├── models.Rmd
├── parameters-estimates.Rmd
└── t-distribution.Rmd
├── ml
├── cross-validation-slides.pdf
├── cross-validation.Rmd
├── decision-trees.Rmd
├── img
│ ├── binsmoother1.gif
│ ├── binsmoother2.gif
│ ├── finishedEnvelope.jpg
│ ├── loess.gif
│ └── loesses.gif
├── intro-ml-assessment.Rmd
├── intro-ml.Rmd
├── lda.Rmd
├── matrices.Rmd
├── regularization.Rmd
└── rf.gif
├── prob
├── continuous-probability.Rmd
├── discrete-probability.Rmd
└── random-variables-sampling-models-clt.Rmd
├── regression
├── confounding.Rmd
├── intro-to-regression.Rmd
├── linear-models.Rmd
└── motivation-regression.Rmd
├── shiny
├── introToDataScience2017_shiny.pdf
├── introToDataScience2017_shiny.pptx
└── shiny_assessments.R
└── wrangling
├── .Rhistory
├── combining-tables.Rmd
├── data-import.Rmd
├── dates-and-times.Rmd
├── intro-to-wrangling.Rmd
├── reshaping-data.Rmd
├── string-processing.Rmd
├── tidy-data.Rmd
└── web-scraping.Rmd
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/.DS_Store
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Welcome to Introduction to Data Science
2 |
3 | * Homeworks and lectures for Fall 2017 can be found here.
4 | * Official course webpage here: [http://datasciencelabs.github.io/datasciencelabs.github.io-2017/.](http://datasciencelabs.github.io/datasciencelabs.github.io-2017/.)
5 |
--------------------------------------------------------------------------------
/homeworks/hw-1-us-murders/hw-1-solutions.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Homework 1 Solutions"
3 | date: "Due 9/17/2017"
4 | output: html_document
5 | editor_options:
6 | chunk_output_type: inline
7 | ---
8 |
9 | # Homework 1
10 |
11 | Several of your friends live in Europe and are offered jobs in a US company with many locations all across the country. The job offers are great but news with headlines such as [**US Gun Homicide Rate Higher Than Other Developed Countries**](http://abcnews.go.com/blogs/headlines/2012/12/us-gun-ownership-homicide-rate-higher-than-other-developed-countries/) have them worried. Charts like this make them worry even more:
12 |
13 | 
14 |
15 | You want to convince your friends that the US is a large and diverse country with 50 very different states as well as the District of Columbia (DC). You want to recommend some state for each friend knowing that some like hiking, others would like to be close to several large cosmopolitan cities. Use data from the US murders data set:
16 |
17 | ```{r}
18 | library(dslabs)
19 | data(murders)
20 | ```
21 |
22 | 1. What is the state with the most murders? Would you say this is the most dangerous state? Hint: Make a plot showing the relationship between population size and number of murders.
23 |
24 | **Solution:**
25 | ```{r}
26 | murders$state[which.max(murders$total)]
27 | ```
28 |
29 | California is the state with the most murders. However, this does not necessarily make California the most dangerous state. The following plot shows that the number of muders is highly correlated with the population of any given state. California, the state with the highest population, also has the highest total number of murders.
30 |
31 | ```{r}
32 | plot(murders$population,
33 | murders$total,
34 | xlab = "Population",
35 | ylab = "Murders",
36 | main = "Population and Gun Murders Across US States")
37 | ```
38 |
39 | 2. Add a column to the murder data table called `murder_rate` with each states murder rate.
40 |
41 | **Solution:**
42 | ```{r}
43 | library(dplyr)
44 | murders <- murders %>%
45 | mutate(murder_rate = total / (population / 100000))
46 |
47 | ## ## alternative approach
48 | ## murders$murder_rate <- murders$total / murders$population
49 | ```
50 |
51 | 3. Describe the distribution of murder rates across states. How similar are states? How much do murder rates vary by geographical regions?
52 |
53 | **Solution:** The distribution is heavily right skewed with a couple clear outliers, i.e. states with notably higher murder rates than the remaining 48. The South has a noticeably higher murder rate than other regions.
54 | ```{r}
55 | ## can use histogram
56 | hist(murders$murder_rate,
57 | breaks = 15,
58 | xlab = "Murders per 100,000",
59 | ylab = "Number of States",
60 | main = "Murder Rates Across US States")
61 |
62 | ## can use boxplot by region
63 | boxplot(murder_rate ~ region,
64 | data = murders,
65 | xlab = "Region",
66 | ylab = "Murders per 100,000",
67 | main = "Murder Rates by Region")
68 | ```
69 |
70 | 4. Write a report for your friends reminding them that the US is a large and diverse country with 50 very different states as well as the District of Columbia (DC). Suppose one of your friends loves hiking, one wants to live in a warm climate, and another would like to be close to several large cosmopolitan cities. Recommend a desirable state for each friend.
71 |
72 | **Solution:** A complete response should include suggestions for each of the three friends based on both their interests (e.g. by considering states good for hiking) and the data analyzed in the previous three problems (e.g. by considering state-level murder rates).
73 |
--------------------------------------------------------------------------------
/homeworks/hw-1-us-murders/hw-1-us-murders.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Homework 1: US gun murders"
3 | date: "September 6, 2017"
4 | output: html_document
5 | ---
6 |
7 | # Homework 1
8 |
9 | Several of your friends live in Europe and are offered jobs in a US company with many locations all across the country. The job offers are great but news with headlines such as [**US Gun Homicide Rate Higher Than Other Developed Countries**](http://abcnews.go.com/blogs/headlines/2012/12/us-gun-ownership-homicide-rate-higher-than-other-developed-countries/) have them worried. Charts like this make them worry even more:
10 |
11 | 
12 |
13 | You want to convince your friends that the US is a large and diverse country with 50 very different states as well as the District of Columbia (DC). You want to recommend some state for each friend knowing that some like hiking, others would like to be close to several large cosmopolitan cities. Use data from the US murders data set:
14 |
15 | ```{r}
16 | library(dslabs)
17 | data(murders)
18 | ```
19 |
20 | 1. What is the state with the most murders? Would you say this is the
21 | most dangerous state? Hint: Make a plot showing the relationship between population size and number of murders.
22 |
23 | 2. Add a column to the murder data table called `murder_rate` with each states murder rate.
24 |
25 | 3. Describe the distribution of murder rates across states. How similar are states? How much do murder rates vary by geographical regions?
26 |
27 | 4. Write a report for your friends reminding them that the US is a large and diverse country with 50 very different states as well as the District of Columbia (DC). Suppose one of your friends loves hiking, one wants to live in a warm climate, and another would like to be close to several large cosmopolitan cities. Recommend a desirable state for each friend.
28 |
29 |
30 |
31 |
32 |
--------------------------------------------------------------------------------
/homeworks/hw-2-vaccines/hw-2-vaccines.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Homework 2"
3 | date: "September 21, 2017"
4 | output: html_document
5 | ---
6 |
7 | # Homework 2
8 |
9 | Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, like smallpox and polio, were common. However, today, despite all the scientific evidence for their importance, vaccination programs have become somewhat controversial.
10 |
11 | The controversy started with a [paper](http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(97)11096-0/abstract) published in 1988 and lead by [Andrew Wakefield](https://en.wikipedia.org/wiki/Andrew_Wakefield) claiming
12 | there was a link between the administration of the measles, mumps and rubella (MMR) vaccine, and the appearance of autism and bowel disease.
13 | Despite much science contradicting this finding, sensationalists media reports and fear mongering from conspiracy theorists, led parts of the public to believe that vaccines were harmful. Some parents stopped vaccinating their children. This dangerous practice can be potentially disastrous given that the Center for Disease Control (CDC) estimates that vaccinations will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years (see [Benefits from Immunization during the Vaccines for Children Program Era — United States, 1994-2013, MMWR](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6316a4.htm)).
14 |
15 | Effective communication of data is a strong antidote to misinformation and fear mongering. In this homework you are going to prepare a report to have ready in case you need to help a family member, friend or acquaintance that is not aware of the positive impact vaccines have had for public health.
16 |
17 | The data used for these plots were collected, organized and distributed by the [Tycho Project](http://www.tycho.pitt.edu/). They include weekly reported counts data for seven diseases from 1928 to 2011, from all fifty states. We include the yearly totals in the `dslabs` package:
18 |
19 | ```{r}
20 | library(dslabs)
21 | data(us_contagious_diseases)
22 | ```
23 |
24 | 1. Use the `us_contagious_disease` and `dplyr` tools to create an object called `dat` that stores only the Measles data, includes a per 100,000 people rate, and removes Alaska and Hawaii since they only became states in the late 50s. Note that there is a `weeks_reporting` column. Take that into account when computing the rate.
25 |
26 | ```{r}
27 | ## Your code here
28 | ```
29 |
30 | 2. Plot the Measles disease rates per year for California. Find out when the Measles vaccine was introduced and add a vertical line to the plot to show this year.
31 |
32 | ```{r}
33 | ## Your code here
34 | ```
35 |
36 | 3. Note these rates start off as counts. For larger counts we can expect more variability. There are statistical explanations for this which we don't discuss here. But transforming the data might help stabilize the variability such that it is closer across levels. For 1950, 1960, and 1970, plot the histogram of the data across states with and without the square root transformation. Which seems to have more similar variability across years? Make sure to pick binwidths that result in informative plots.
37 |
38 | ```{r}
39 | ## Your code here
40 | ```
41 |
42 | 4. Plot the Measles disease rates per year for California. Use the the square root transformation. Make sure that the numbers $0,4,16,36, \dots, 100$ appear on the y-axis.
43 | Find out when the Measles vaccine was introduced and add a vertical line to the plot to show this year:
44 |
45 | ```{r}
46 | ## Your code here
47 | ```
48 |
49 | 5. Now, this is just California. Does the pattern hold for other states? Use boxplots to get an idea of the distribution of rates for each year, and see if the pattern holds across states.
50 |
51 | ```{r}
52 | ## Your code here
53 | ```
54 |
55 | 6. One problem with the boxplot is that it does not let us see state-specific trends. Make a plot showing the trends for all states. Add the US average to the plot. Hint: Note there are missing values in the data.
56 |
57 | ```{r}
58 | ## Your code here
59 | ```
60 |
61 | 7. One problem with the plot above is that we can't distinguish states from each other. There are just too many. We have three variables to show: year, state and rate. If we use the two dimensions to show year and state then we need something other than virtical or horizontal position to show the rates. Try using color. Hint: Use the the geometry `geom_tile` to tile the plot with colors representing disease rates.
62 |
63 | ```{r}
64 | ## Your code here
65 | ```
66 |
67 | 8. The plots above provide strong evidence showing the benefits of vaccines: as vaccines were introduced, disease rates were reduced. But did autism increase? Find yearly reported autism rates data and provide a plot that shows if it has increased and if the increase coincides with the introduction of vaccines.
68 |
69 | 9. Use data exploration to determine if other diseases (besides Measles) have enough data to explore the effects of vaccines. Prepare a two page report with as many plots as you think are necessary to provide a case for the benefit of vaccines.
70 |
71 |
--------------------------------------------------------------------------------
/homeworks/hw-3-casino/hw-3-casino.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Homework 3: Help the Casino"
3 | output: html_document
4 | ---
5 |
6 | # Problem 1
7 |
8 | In the game of [roulette](https://en.wikipedia.org/wiki/Roulette) you can bet on several things including black or red. On this bet, if you win, you double your earnings. In this problem we will look at how the casino makes money on this. If you look at the [possibilities](http://www.math.uah.edu/stat/games/Roulette.png), you realize that the chance of red or black are both slightly less than 1/2. There are two green spots, so the probability of landing on black (or red) is actually 18/38, or 9/19.
9 |
10 | ## Problem 1A
11 |
12 | Let's make a quick sampling model for this simple version of roulette. You are going to bet a dollar each time you play and always bet on black. Make a sampling model for this process using the `sample` function. Write a function `roulette` that takes as an argument the number of times you play, $n$, and returns your earnings, which here we denote with $S_n$.
13 |
14 | ## Problem 1B
15 |
16 | Use Monte Carlo simulation to study the distribution of total earnings $S_n$ for $n = 100, 250, 500, 1000$. That is, for each value of $n$, make one or more plots to examine the distribution of earnings. Examine the plots, and describe how the expected values and standard errors change with $n$. You do not need to show us the plots. Just the code you used to create them. Hints: It's OK to use a for-loop. Think about the possible values $S_n$ can take when deciding on the `geom_histogram` parameters such as `binwidth` and `center`.
17 |
18 | ## Problem 1C
19 |
20 | Repeat Problem 1B but for the means instead of the sums. After you answer, describe the mathematical results that you can use to answer this without making plots.
21 |
22 | ## Problem 1D
23 |
24 | Now think of a sampling model for our casino problem. What is the expected value of our sampling model? What is the standard deviation of our sampling model?
25 |
26 | ## Problem 1E
27 |
28 | Suppose you play 100 times. Use the Central Limit Theorem (CLT) to approximate the probability that the casino loses money. Then use a Monte Carlo simulation to corroborate your finding.
29 |
30 | ## Problem 1F
31 |
32 | In general, what is the probability that the casino loses money as a function of $n$? Make a plot for values ranging from 25 to 1,000. Why does the casino give you free drinks if you keep playing?
33 |
34 |
35 | # Problem 2
36 |
37 | The baseball playoffs are about to start. During the first round of the playoffs, teams play a best of five series. After the first round, they play seven game series.
38 |
39 | ## Problem 2A
40 |
41 | The Red Sox and Astros are playing a five game series. Assume they are equally good. This means each game is like a coin toss. Build a Monte Carlo simulation to determine the probability that the Red Sox win the series. (Hint: start by creating a function `series_outcome` similar to the `roulette` function from Problem 1A.)
42 |
43 | ## Problem 2B
44 |
45 | The answer to Problem 2A is not surprising. What if one of the teams is better? Compute the probability that the Red Sox win the series if the Astros are better and have a 60% of winning each game.
46 |
47 | ## Problem 2C
48 |
49 | How does this probability change if instead of five games, they play seven? How about three? What law did you learn that explains this?
50 |
51 | ## Problem 2D
52 |
53 | Now, assume again that the two teams are equally good. What is the probability that the Red Sox still win the series if they lose the first game? Do this for a five game and seven game series.
54 |
--------------------------------------------------------------------------------
/homeworks/hw-3-casino/hw-3-solutions.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Homework 3 Solutions"
3 | date: "Due 10/8/2017"
4 | output: html_document
5 | editor_options:
6 | chunk_output_type: inline
7 | ---
8 |
9 | # Homework 3
10 |
11 | ## Problem 1
12 |
13 | In the game of [roulette](https://en.wikipedia.org/wiki/Roulette) you can bet on several things including black or red. On this bet, if you win, you double your earnings. In this problem we will look at how the casino makes money on this. If you look at the [possibilities](http://www.math.uah.edu/stat/games/Roulette.png), you realize that the chance of red or black are both slightly less than 1/2. There are two green spots, so the probability of landing on black (or red) is actually 18/38, or 9/19.
14 |
15 | ### Problem 1A
16 |
17 | Let's make a quick sampling model for this simple version of roulette. You are going to bet a dollar each time you play and always bet on black. Make a sampling model for this process using the `sample` function. Write a function `roulette` that takes as an argument the number of times you play, $n$, and returns your earnings, which here we denote with $S_n$.
18 |
19 | **Solution:**
20 | ```{r}
21 | roulette <- function(n) {
22 | x <- sample(c(-1, 1), n, replace = TRUE, prob = c(10/19, 9/19))
23 | sum(x)
24 | }
25 | ```
26 |
27 | ### Problem 1B
28 |
29 | Use Monte Carlo simulation to study the distribution of total earnings $S_n$ for $n = 100, 250, 500, 1000$. That is, for each value of $n$, make one or more plots to examine the distribution of earnings. Examine the plots, and describe how the expected values and standard errors change with $n$. You do not need to show us the plots. Just the code you used to create them. Hints: It's OK to use a for-loop. Think about the possible values $S_n$ can take when deciding on the `geom_histogram` parameters such as `binwidth` and `center`.
30 |
31 | **Solution:**
32 | ```{r, eval=FALSE}
33 | library(tidyverse)
34 | B <- 10000
35 | ns <- c(100, 250, 500, 1000)
36 | for(n in ns) {
37 | winnings <- replicate(B, roulette(n))
38 | p <- data.frame(winnings = winnings) %>%
39 | ggplot(aes(x = winnings)) +
40 | geom_histogram(binwidth = 10, boundary = 0, color = "black")
41 | ggtitle(n)
42 | print(p)
43 | }
44 | ```
45 |
46 | For the sums, the expected value decreases (becomes more negative) and the standard error increases with larger $n$.
47 |
48 | ### Problem 1C
49 |
50 | Repeat Problem 1B but for the means instead of the sums. After you answer, describe the mathematical results that you can use to answer this without making plots.
51 |
52 | **Solution:**
53 | ```{r, eval=FALSE}
54 | B <- 10000
55 | ns <- c(100, 250, 500, 1000)
56 | for(n in ns) {
57 | winnings <- replicate(B, roulette(n))
58 | p <- data.frame(average_winnings = winnings / n) %>%
59 | ggplot(aes(x = average_winnings)) +
60 | geom_histogram(bins = 15, center = 0, color = "black") +
61 | ggtitle(n)
62 | print(p)
63 | }
64 | ```
65 |
66 | For the means, the expected value does not change and the standard error decreases with larger $n$. The expected value does not change because the expected value of an average of independent identically distributed random variables is the expected value any one of the random variables. The standard error, however, decreases because the standard error of the average of independent identically distributed random variables is the standard error of any one of the random variables divided by $\sqrt{n}$.
67 |
68 | ### Problem 1D
69 |
70 | Now think of a sampling model for our casino problem. What is the expected value of our sampling model? What is the standard deviation of our sampling model?
71 |
72 | **Solution:**
73 | The expectation is $\mu = -1 \times (1-p) + 1\times p$, which is -1/19. The casino makes, on average, about 5 cents on each bet. The standard deviation is $\sigma = |1 - -1|\sqrt{(9/19)(10/19)}$, which is 0.998614.
74 |
75 | ### Problem 1E
76 |
77 | Suppose you play 100 times. Use the Central Limit Theorem (CLT) to approximate the probability that the casino loses money. Then use a Monte Carlo simulation to corroborate your finding.
78 |
79 | **Solution:**
80 | By the CLT, the sum, $S_n$, is approximately normal with mean $\mu \times n$ and standard error $\sqrt{n} \sigma$. Since we play 100 times, $n = 100$. To calculate the probability that the casino loses (i.e. $S_n > 0$), we standardize $S_n$ and calculate the tail probability of a standard normal distribution.
81 |
82 | $$
83 | \begin{align}
84 | \mbox{Pr}( S_n > 0)
85 | &= \mbox{Pr}\left( \frac{S_n - \mu N}{\sigma \sqrt{N}} > \frac{ - \mu N}{\sigma \sqrt{N}}\right) \\
86 | &= 1 - \Phi^{-1} (\sqrt{N}\frac{ - \mu }{\sigma} )
87 | \end{align}
88 | $$
89 |
90 | ```{r}
91 | 1 - pnorm(sqrt(100) * (1/19) / 0.998614)
92 | ```
93 |
94 | Next, we can compare the probability estimated using the approximation with the probability estimated using Monte Carlo simulations.
95 |
96 | ```{r}
97 | B <- 10^5
98 | winnings <- replicate(B, roulette(100))
99 | mean(winnings > 0)
100 | ```
101 |
102 | ### Problem 1F
103 |
104 | In general, what is the probability that the casino loses money as a function of $n$? Make a plot for values ranging from 25 to 1,000. Why does the casino give you free drinks if you keep playing?
105 |
106 | **Solution:**
107 | ```{r}
108 | n <-seq(25, 1000, len = 100)
109 | prob_of_casino_losing <- 1 - pnorm(sqrt(n) * (1/19) / 0.998614)
110 | plot(n, prob_of_casino_losing,
111 | xlab = "Games Played",
112 | ylab = "Probability of Casino Losing Money",
113 | main = "Why Casinos Give You Free Drinks")
114 | ```
115 |
116 | The probability that the casino loses money decreases as the number of games played, $n$, increases. By giving you free drinks to play more rounds of roulette, the casino is decreasing their probability of losing money.
117 |
118 | ## Problem 2
119 |
120 | The baseball playoffs are about to start. During the first round of the playoffs, teams play a best of five series. After the first round, they play seven game series.
121 |
122 | ### Problem 2A
123 |
124 | The Red Sox and Astros are playing a five game series. Assume they are equally good. This means each game is like a coin toss. Build a Monte Carlo simulation to determine the probability that the Red Sox win the series. (Hint: start by creating a function `series_outcome` similar to the `roulette` function from Problem 1A.)
125 |
126 | **Solution:**
127 | ```{r}
128 | series_outcome <- function(n) {
129 | x <- sample(c(0, 1), n, replace = TRUE)
130 | sum(x) >= (n + 1) / 2
131 | }
132 | ```
133 |
134 | We can now perform a Monte Carlo simulation to determine the probability of winning the series. We play the series 10,000 times.
135 |
136 | ```{r}
137 | results <- replicate(10000, series_outcome(n = 5))
138 | mean(results)
139 | ```
140 |
141 | ### Problem 2B
142 |
143 | The answer to Problem 2A is not surprising. What if one of the teams is better? Compute the probability that the Red Sox win the series if the Astros are better and have a 60% chance of winning each game.
144 |
145 | **Solution:**
146 | We first modify the `series_outcome` function to also take `p` (the probability of the Red Sox winning each game) as a parameter.
147 | ```{r}
148 | series_outcome <- function(n, p) {
149 | x <- sample(c(0,1), n, replace = TRUE, prob = c(1 - p, p))
150 | sum(x) >= (n + 1) / 2
151 | }
152 | ```
153 |
154 | We again replicate the series 10,000 times using Monte Carlo simulation.
155 |
156 | ```{r}
157 | results <- replicate(10000, series_outcome(n = 5, p = 0.4))
158 | mean(results)
159 | ```
160 |
161 | Here, since `n` is small, it is also possible to use `pbinom` to calculate the exact Binomial probability. If the Astros are better and have a 60% chance of winning each game, the probability of the Red Sox winning the series decreases.
162 |
163 | ### Problem 2C
164 |
165 | How does this probability change if instead of five games, they play seven? How about three? What law did you learn that explains this?
166 |
167 | **Solution:**
168 | ```{r}
169 | results <- replicate(10000, series_outcome(n = 7, p = 0.4))
170 | mean(results)
171 |
172 | results <- replicate(10000, series_outcome(n = 3, p = 0.4))
173 | mean(results)
174 | ```
175 |
176 | Again, since `n` is small, it is also possible to use `pbinom` to calculate the exact Binomial probabilities. If they play seven games instead of five, the probability of the Red Sox winning the series is smaller. If they play three games, the probability is greater. This can be explained by the law of large numbers, and more directly, by the fact that the standard error of the average (here, the proportion of games won by the Red Sox) decreases with increasing $n$.
177 |
178 | ### Problem 2D
179 |
180 | Now, assume again that the two teams are equally good. What is the probability that the Red Sox still win the series if they lose the first game? Do this for a five game and seven game series.
181 |
182 | **Solution:**
183 | ```{r}
184 | after_one_loss <- function(n) {
185 | x <- sample(c(0, 1), n - 1, replace = TRUE)
186 | sum(x) >= (n + 1) / 2
187 | }
188 | results <- replicate(10000, after_one_loss(n = 5))
189 | mean(results)
190 |
191 | results <- replicate(10000, after_one_loss(n = 7))
192 | mean(results)
193 | ```
194 |
195 |
--------------------------------------------------------------------------------
/homeworks/hw-4-elections/elections_polls.RData:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/homeworks/hw-4-elections/elections_polls.RData
--------------------------------------------------------------------------------
/homeworks/hw-4-elections/hw-4-elections.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Homework 4: Election Forecasting"
3 | date: "10/11/2017"
4 | output: html_document
5 | ---
6 |
7 | ```{r setup, include=FALSE}
8 | knitr::opts_chunk$set(echo = TRUE)
9 | ```
10 |
11 | # How reliable is polling data?
12 |
13 | Leading up to the 2016 presidential election, many pollsters predicted that the Democratic candidate, Hillary Clinton, would win a ["decisive victory."][wapo] However, as we all know, the election was won by the Republican candidate, and current president, Donald Trump. During class we discussed how general biases, not accounted for by prediction models, often affect many pollsters in the same way. In this homework, you are going to further investigate these biases through comparisons across both national and state-level races.
14 |
15 | The repository for this homework includes an **.RData** file, `election_polls.RData`, containing a `data.frame` (`polls`) with several years worth of polling data (2008, 2010, 2012, 2014 and 2016). The polls cover federal elections for house representatives, senators and the president, and includes polling data from up to a year before the election date. The Presidential election polls were collected from the [RealClearPolitics website][rcp] and the Congressional and Senatorial polls were collected from the [FiveThirtyEight Github repository][thirty].
16 |
17 | ```{r, warning=FALSE, message=FALSE}
18 | library(tidyverse)
19 | load("elections_polls.RData")
20 | ```
21 |
22 | The `polls` `data.frame` contains the following columns:
23 |
24 | - `race`: race identifier year_electiontype_location.
25 | - `race_state`: race identifier year_electiontype_state. In contrast to the previous column, this identifier ignores information about counties and only contains information at the state level.
26 | - `state`: abbreviation of state of the election
27 | - `state_long`: full name of the state
28 | - `type`: type of race. Could be either presidential (Pres), senatorial election (Sen-G) or house representative election (House-G).
29 | - `year`: election year
30 | - `pollster`: name of the pollster
31 | - `samplesize`: size of the sample used in the poll
32 | - `startdate`: start date of the pole. If this date was not available, this will be the same as enddate
33 | - `enddate`: end date of the pole
34 | - `democrat_name`: name of the democratic candidate
35 | - `democrat_poll`: percentage of people from the poll saying they would vote for the democratic candidate
36 | - `democrat_result`: actual percentage of people voting for the democratic candidate in the election
37 | - `republican_name`: name of the republican candidate
38 | - `republican_poll`: percentage of people from the poll saying they would vote for the republican candidate
39 | - `republican_result`: actual percentage of people voting for the republican candidate in the election
40 |
41 | ## Problem 1
42 | Subset the `polls` `data.frame` to only keep polls which ended within approximately 6 weeks preceding any [Election Day][election-day] (i.e. in October or November). You will be using this smaller data set for the remainder of this homework. Hint: you might need to extract the month from the `enddate`. The `strftime` function might be useful for this.
43 |
44 |
45 | ## Problem 2
46 | For each poll, calculate the difference between the fraction of people saying they would vote for the Republican Party and the fraction of people saying they would vote for the Democratic Party. Add these values to your `data.frame` as a new column, `spread`. Similarly, calculate the true (actual) difference between the fraction of people who ended up voting for the Republican Party and the fraction of people who ended up voting for the Democratic Party. Again, add the true (actual) difference as a new column, `spread_act`, to your `data.frame`.
47 |
48 |
49 | ## Problem 3
50 | Now, we are going to collapse polls for each race. For this, we group polls by the type, year, and state of the corresponding election. There are several polls for each race, and each one provides an approximation of the real $\theta$ value. Generate a point estimate for each race, $\hat{\theta}$, that summarizes the polls for that race using the following steps: [1] use the column `race_state` to group polls by type, year, and state, and [2] use the `summarize` function to generate a new `data.frame` called `reduced_polls` with the following columns:
51 |
52 | 1. the mean `spread`,
53 | 2. the standard deviation of the `spread`,
54 | 3. the mean `spread_act`, and
55 | 4. the number of polls per race.
56 |
57 | Make sure you also keep information about the `year` and `state` of each race in this new `data.frame`.
58 |
59 |
60 | ## Problem 4
61 | Note that the previous question merges different congressional elections held in the same year across districts in a state. Thus, using the collapsed `data.frame` from the previous question, filter out races from congressional elections. Also, filter out races that had less than 3 polls. The `reduced_polls` `data.frame` should now contain only Presidential and Senatorial elections. For each remaining race, build a 95\% confidence interval for $\hat{\theta}$. Include the boundaries of these confidence intervals in the `reduced_polls` `data.frame`.
62 |
63 |
64 | ## Problem 5
65 | For each election type in each year, calculate the fraction of states where the actual result was **outside** of the 95% confidence interval. Which race was the most unpredictable, (i.e. for which race was the polling data most innacurate compared to the actual result)?
66 |
67 |
68 | ## Problem 6
69 | Using data from *only* the 2016 presidential election, make a plot of states ($x$-axis) and $\hat{\theta}$ estimates ($y$-axis). Using the `gg_errorbar` function, include the 95\% confidence intervals of $\hat{\theta}$ for each state. Finally, using a different color, include the actual results for each state. Describe the resulting plot.
70 |
71 |
72 | ## Problem 7
73 | Which states did Donald Trump win in the 2016 presidential election, despite the entire 95\% confidence intervals being in favor of his opponent, Hillary Clinton?
74 |
75 |
76 | ## Problem 8
77 | Looking again at all races, calculate the the difference between $\theta$ and $\hat{\theta}$ (Hint: use the data for all races in the `reduced_polls` object created in Problem 4). We call this the bias term. Add these values as a column to `reduced_polls`.
78 |
79 |
80 | ## Problem 9
81 | Plot and compare the distribution of bias terms for races in each year. Describe the bias patterns. Are these centered around zero? Give possible explanations.
82 |
83 |
84 | ## Problem 10
85 | Using the [__fiftystater__](https://cran.r-project.org/web/packages/fiftystater/index.html) package, create a plot for each of the last three presidential elections showing the bias estimates for each state on a map of the United States. Describe any patterns or differences between the three elections.
86 |
87 |
88 |
89 | [wapo]:https://www.washingtonpost.com/news/monkey-cage/wp/2016/11/08/a-comprehensive-average-of-election-forecasts-points-to-a-decisive-clinton-victory/
90 | [election-day]:https://en.wikipedia.org/wiki/Election_Day_(United_States)
91 | [rcp]: https://www.realclearpolitics.com/
92 | [thirty]: https://github.com/fivethirtyeight/data
93 |
--------------------------------------------------------------------------------
/homeworks/hw-6-netflix/hw-6-netflix.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Netflix Challenge"
3 | output: html_document
4 | ---
5 |
6 | ```{r, echo = FALSE}
7 | library(knitr)
8 | opts_chunk$set(cache = TRUE, message = FALSE)
9 | ```
10 |
11 | Recommendation systems use rating data from many products and users to make recommendations for a specific user. Netflix uses a recommendation system to predict your ratings for a specific movie.
12 |
13 | In October 2006, Netflix offered a challenge to the data science community: _improve our recommendation algorithm by 10% and win a million dollars_. In September 2009, [the winners were announced](http://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/). You can read a good summary of how the winning algorithm was put together [here](http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/), and a more detailed explanation [here](http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf).
14 |
15 | 
16 |
17 | In this homework, you will build your own recommendation system. You will submit predicted recommendations for a test data set where we have kept the actual recommendations hidden. We will then check your performance on these predictions and have our own Netflix challenge. The winning team, defined by the best root mean squared error (RMSE), will receive a prize. The set that you will have to predict is available on GitHub [here](https://github.com/datasciencelabs/data/blob/master/movielens-test.csv.gz).
18 |
19 | RMSE was the metric used to judge entries in the Netflix challenge. The lower the RMSE was on Netflix's quiz set between the submittedrating predictions and the actual ratings, the better the method was. We will be using RMSE to evaluate our machine learning models in this homework as well.
20 |
21 | $$\mbox{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N (\hat{Y}_i - Y_i)^2}$$
22 |
23 | Download and load the [large training data set which is compressed](https://github.com/datasciencelabs/data/blob/master/movielens-train.csv.gz) into R. Train a machine learning model of you choice. You may wish to utilize a technique such as cross-validation to optimize any parameters associated with your model, and you may implement any modelling technique you feel comfortable with. This may include regression, regularization techniques, matrix decompositions (such as utilized by the winning team [here](http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf)), etc.
24 |
25 | **Hint 1**: You can read in compressed file with `read_csv(gzfile(filename))`
26 |
27 | **Hint 2**: Use the `RMSE()` function below to check your accuracy.
28 | ```{r}
29 | RMSE <- function(true_ratings, predicted_ratings){
30 | sqrt(mean((true_ratings - predicted_ratings)^2))
31 | }
32 | ```
33 |
34 | Download the test data set available on GitHub [here](https://github.com/datasciencelabs/data/blob/master/movielens-test.csv.gz). Make predictions to fill in the `NA`s and save a file with the same format but with the ratings filled in to your repo. Submit this as a `.csv` file with your name in the file name (the file does not need to be compressed), along with the code you utilized to train the model, as part of your homework.
35 |
--------------------------------------------------------------------------------
/lectures/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/.DS_Store
--------------------------------------------------------------------------------
/lectures/R/00-motivation.Rmd:
--------------------------------------------------------------------------------
1 | # Introduction to R
2 |
3 | In this book we will be using the
4 | [R software environment](https://cran.r-project.org/) for all our
5 | analysis. Throughout the book you will learn R and data analysis techniques simultaneously. However, we need to introduce basic R syntax to get you going. In this chapter, rather than cover every R skill you need, we introduce just enough so that you can follow along the remaining chapters where we provide more in depth coverage, building upon what you learn in this chapter. We find that we better retain R knowledge when we learn it to solve a specific problem.
6 |
7 | In this chapter, as done throughout the book, we will use a motivating case study. We ask a specific question related to crime in United States and provide a relevant dataset. Some basic R skills will permit us to answer the motivating question.
8 |
9 |
10 | ## US gun murders
11 |
12 | Imagine you live in Europe and are offered a job in a US company with many locations across all states. It is a great job but news with headlines such as [**US Gun Homicide Rate Higher Than Other Developed Countries**](http://abcnews.go.com/blogs/headlines/2012/12/us-gun-ownership-homicide-rate-higher-than-other-developed-countries/) have you worried. Charts like this make you worry even more:
13 |
14 | 
15 |
16 | Or even worse, this version from [everytown.org](https://everytownresearch.org/us-gun-violence-trends/):
17 |
18 | 
19 |
20 | But then you are reminded that the US is a large and diverse country with 50 very different states as well as the District of Columbia (DC).
21 |
22 |
23 |
24 | California, for example, has a larger population than Canada and 20 US states have populations larger than that of Norway's. In some respects the variability across states in the US is akin to the variability across countries in Europe. Furthermore, although not in the charts above, the murder rates in Lithuania, Ukraine, and Russia are higher than 4 per 100,000. So perhaps the news reports that worried you are too superficial. You have options of where to live and want to find out how safe each state is. We will gain some insights by examining data related to gun homicides in the US using R.
25 |
26 | Now before we get started with our example, we need to cover logistics as well as some of the very basic building blocks that we need to gain more advanced R skills. Be aware that for some of these, it is not immediately obvious how it is useful, but later in the book you will appreciate having the knowledge under your belt.
27 |
28 |
--------------------------------------------------------------------------------
/lectures/R/01-data-types.Rmd:
--------------------------------------------------------------------------------
1 | ## Data types
2 |
3 | Variables in R can be of different types. For example we need to distinguish numbers from character strings and tables from simple lists of numbers. The function `class` helps us determine what type of object we have:
4 |
5 | ```{r}
6 | a <- 2
7 | class(a)
8 | ```
9 |
10 | To work efficiently in R it is important to learn the different types of variables and what we can do with these.
11 |
12 | ### Data Frames
13 |
14 | Up to now, the variables we have defined are just one number. This is not very useful for storing data. The most common way of storing a dataset in R is in a _data frame_. Conceptually, we can think of a data frame as table with rows representing observations and the different variables reported for each observatin defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.
15 |
16 | We stored the data for our motivating example in a data frame. You can access this dataset by loading the `dslabs` library and loading the `murders` dataset using the `data` function:
17 |
18 | ```{r}
19 | library(dslabs)
20 | data(murders)
21 | ```
22 |
23 | To see that this is in fact a data frame we type
24 |
25 | ```{r}
26 | class(murders)
27 | ```
28 |
29 | ### Examining an object
30 |
31 | The function `str` is useful to find out more about the structure of an object
32 |
33 | ```{r}
34 | str(murders)
35 | ```
36 |
37 | This tells us much more about the object. We see that the table has 51 rows (50 states plus DC) and five variables. We can show the first six lines using the function `head`:
38 |
39 | ```{r}
40 | head(murders)
41 | ```
42 |
43 | In this data set each state is considered an obsrvation and five varialbles reported for each state.
44 |
45 | Before we go any further in answering our original question about different states, let's get to know the components of this object better.
46 |
47 | ### The accessor
48 |
49 | For our analysis we will need to access the different variables, represented by columns, included in this data frame. To access these variables we use the accessor operator `$` in the following way:
50 |
51 | ```{r}
52 | murders$population
53 | ```
54 |
55 | But how did we know to use `population`? Above, by applying the function `str` to the obejct `murders`, we revealed the names for each of the five variables stored in this table. We can quickly access the variables names using:
56 |
57 | ```{r}
58 | names(murders)
59 | ```
60 |
61 | It is important to know that the order of the entries in `murders$population` preserve the order of the rows in our data table. This will later permit us to manipulate one variable based on the results of another, for example we will be able to order the state names by number of murders.
62 |
63 | **Tip**: R comes with a very nice auto-complete functionality that saves us the trouble of typing out all the name. Try typing `murders$p` then hitting the _tab_ key on your keyboard. RStudio has many useful auto-complete features options.
64 |
65 | ### Vectors: numerics, characters, and logical
66 |
67 | Note that the object `murders$population` is not one number but several. We call these types of objects _vectors_. A single number is technically a vector but in general we vectors refer to objects with several entries. The function `length` tells you how many entries in the vector:
68 |
69 | ```{r}
70 | pop <- murders$population
71 | length(pop)
72 | ```
73 |
74 | This particular vector is _numeric_ since population sizes are numbers:
75 |
76 | ```{r}
77 | class(pop)
78 | ```
79 | In a numeric vector, every entry must be a number.
80 |
81 | To store character strings, vectors can also be of class _character_. For example the state names are characters:
82 |
83 | ```{r}
84 | class(murders$state)
85 | ```
86 |
87 | As with numeric vectors, all entries in a character verctor need to be a character.
88 |
89 | Another important type are _logical vectors_. These must be either `TRUE` or `FALSE`.
90 |
91 | ```{r}
92 | z <- 3 == 2
93 | z
94 | class(z)
95 | ```
96 |
97 | Here the `==` is a relational operator asking if 3 is equals to 2. Remember that in R, if you just use one `=` you actually assign. You can see the other _relational operators_ by typing
98 |
99 | ```{r, eval=FALSE}
100 | ?Comparison
101 | ```
102 |
103 | In future sections you will see how useful relational operators can be.
104 |
105 |
106 | **Advanced**: Mathematically, the values in `pop` are integers and there is an integer class in R. However, by default, numbers are assigned class numeric even when they are round integers, for example `class(1)` returns numeric. You can turn them into class integer with `as.integer(1)` or add by adding an `L` like this: `1L`. Note the class by typing: `class(1L)`
107 |
108 |
109 | ### Factors
110 |
111 | In the `murders` dataset we might expect the region to also be a character vector. However it is not:
112 |
113 | ```{r}
114 | class(murders$region)
115 | ```
116 |
117 | it is a _factor_. Factors are useful for storing categorical data. Notice that there are only 4 regions:
118 |
119 |
120 | ```{r}
121 | levels(murders$region)
122 | ```
123 |
124 | So, in the background, R stores these _levels_ as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters. However, factors are also a source of confusion as they can easily be confused with characters but behave differently in different context. We will see more of this later.
125 |
126 | In general, we recommend avoiding factors as much as possible although they are sometimes necessary to fit model containing categorical data.
127 |
128 | ### Lists
129 |
130 | Data frames are a special case of _lists_. We will covers lists in more detail later but know that they are useful because you can store any combination of other types, here is an example of list we created for you:
131 |
132 |
133 | ```{r, echo=FALSE}
134 | record <- list(name = "John Doe",
135 | student_id = 1234,
136 | grades = c(95, 82, 91, 97, 93),
137 | final_grade = "A")
138 | ```
139 |
140 | ```{r}
141 | record
142 | class(record)
143 | ```
144 |
145 | We won't be using lists until later but you might encounter one in your own exploration of R. Note that, as with data frames, you can extract the components with the accessor `$`. In fact, data frames are a type of list.
146 |
147 | ```{r}
148 | record$student_id
149 | ```
150 |
151 | We can also use double brackets like this:
152 |
153 | ```{r}
154 | record[["student_id"]]
155 | ```
156 |
157 | You should get used to the fact that in R there are several ways to do the same thing, in particular accessing entries.
158 |
159 |
160 |
--------------------------------------------------------------------------------
/lectures/R/02-vectors.Rmd:
--------------------------------------------------------------------------------
1 | ## Vectors
2 |
3 | The most basic unit available in R to store data are _vectors_. As we have seen complex datasets can usually be broken down into components that are vectors. For example in a data frame, each column is a vector. Here we learn more about this important class.
4 |
5 | ### Creating vectors
6 |
7 | We can create vectors using the function `c`, which stands for concatenate. We use `c` to _concatenate_ entires in the following way:
8 |
9 | ```{r}
10 | codes <- c(380, 124, 818)
11 | codes
12 | ```
13 |
14 | We can also create character vectors. We use the quotes to denote that the entries are characters rather than variables names.
15 |
16 | ```{r}
17 | country <- c("italy","canada","egypt")
18 | ```
19 |
20 | By now you should know that if you type
21 |
22 | ```{r, eval=FALSE}
23 | country <- c(italy, canada, egypt)
24 | ```
25 | you recieve an error becuase the variables `italy`, `canada` and `egypt` are not defined: R looks for variables with those names and returns an error.
26 |
27 | ### Names
28 |
29 | Sometimes it is useful to name the entries of a vector. For example, when defining a vector of country codes we can use the names to connect the two:
30 |
31 | ```{r}
32 | codes <- c(italy = 380, canada = 124, egypt = 818)
33 | codes
34 | ```
35 |
36 | The object `codes` continues to be a numeric vector:
37 | ```{r}
38 | class(codes)
39 | ```
40 | but with names
41 | ```{r}
42 | names(codes)
43 | ```
44 |
45 | If the use of strings without quotes looks confusing, know that you can use the quotes as well
46 |
47 | ```{r}
48 | codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)
49 | codes
50 | ```
51 |
52 | There is no difference between this call and the previous one: one of the many ways R is quirky compared to other languages.
53 |
54 | We can also assign names using the `names` functions
55 |
56 | ```{r}
57 | codes <- c(380, 124, 818)
58 | country <- c("italy","canada","egypt")
59 | names(codes) <- country
60 | codes
61 | ```
62 |
63 | ### Sequences
64 |
65 | Another useful function for creating vectors generates sequences
66 |
67 | ```{r}
68 | seq(1, 10)
69 | ```
70 |
71 | The first argument defines the start, and the second the end. The default is to go up in increments of 1, but a third argument let's us tell it how much to jump by:
72 |
73 | ```{r}
74 | seq(1, 10, 2)
75 | ```
76 |
77 | If we want consecutive integers we can use the following shorthand
78 |
79 | ```{r}
80 | 1:10
81 | ```
82 |
83 | Note that when we use these function, R produces integers, not numerics, because they are typically used to index something:
84 |
85 | ```{r}
86 | class(1:10)
87 | ```
88 |
89 | However, note that as soon as we create something it's not an integer the class changes:
90 |
91 | ```{r}
92 | class(seq(1, 10))
93 | class(seq(1, 10, 0.5))
94 | ```
95 |
96 | ### Subsetting
97 |
98 | We use square brackets to access specific elements of a list. For the vector `codes` we defined above, we can access the seconde element using
99 | ```{r}
100 | codes[2]
101 | ```
102 |
103 | You can get more than one entry by using a multi-entry vector as an index:
104 | ```{r}
105 | codes[c(1,3)]
106 | ```
107 |
108 | The sequences defined above are particularly useful if we want to access, say, the first two elements
109 |
110 | ```{r}
111 | codes[1:2]
112 | ```
113 |
114 | If the elements have names, we can also access the entries using these names. Here are two examples
115 |
116 | ```{r}
117 | codes["canada"]
118 | codes[c("egypt","italy")]
119 | ```
120 |
121 | ### Coercion
122 |
123 | In general, _coercion_ is an attempt by R to be flexible with data types. When an entry does not match the expected, R tries to guess what we meant before throwing an errors. This can also lead to confusion. Failing to understand _coercion_ can drive programmer crazy when attempting to code in R since it behavoes quite diffentlyfrom most other languages in this regard. Let's learn about it with a some examples.
124 |
125 | We said that vectors must be all of the same type. So if we try to combine, say, numbers and characters you might expect an error
126 |
127 | ```{r}
128 | x <- c(1, "canada", 3)
129 | ```
130 |
131 | But we don't get one, not even a warning! What happened? Look at `x` and its class:
132 |
133 | ```{r}
134 | x
135 | class(x)
136 | ```
137 |
138 | R _coerced_ the data into characters. It guessed that because you put a character string in the vector you meant the 1 and 3 to actually be characters strings `"1"` and "`3`". The fact that not even a warning is issued is an example of how coercion can cause many unnoticed errors in R.
139 |
140 | R also offers functions to force a specific coercion. For example you can turn numbers into characters with
141 |
142 | ```{r}
143 | x <- 1:5
144 | y <- as.character(x)
145 | y
146 | ```
147 |
148 | And you can turn it back with `as.numeric`.
149 |
150 | ```{r}
151 | as.numeric(y)
152 | ```
153 |
154 | This function is actually quite useful as datasets that include numbers as character strings are common.
155 |
156 | ### Not Availables (NA)
157 |
158 | When these coercion function encounter an impossible case it gives us a warning and turns the entry into an special value called an `NA` for "not available". For example:
159 |
160 | ```{r}
161 | x <- c("1", "b", "3")
162 | as.numeric(x)
163 | ```
164 |
165 | R does not have any guesses for what number you want when you type `b` so it does not try.
166 |
167 | Note that as a data scientist you will encounter the `NA` often as they are used for missing data, a common problem in real life dataset.
168 |
169 |
170 |
171 |
--------------------------------------------------------------------------------
/lectures/R/03-sorting.Rmd:
--------------------------------------------------------------------------------
1 | ## Sorting
2 |
3 | Now that we have some basic R knowledge under our belt, let's try to gain some insights into the safety of differnet states in the conext of gun murders.
4 |
5 | ### `sort`
6 |
7 | We want to rank the states from least to most gun murders. The function `sort` sorts a vector in increasing order. So we can see that the largest number of gun murders by typing
8 |
9 | ```{r}
10 | library(dslabs)
11 | data(murders)
12 | sort(murders$total)
13 | ```
14 |
15 | However, this does not give us information about which states have which murder totals. For example, we don't know which state had `r max(murders$total)` murders in 2010.
16 |
17 | ### `order`
18 |
19 | The function `order` is closer to what we want. It takes a vector and returns the vector of indexes that sort the input vector. This may sound confuisng so let's look at a simple example: we create a vector and sort it:
20 |
21 | ```{r}
22 | x <- c(31, 4, 15, 92, 65)
23 | sort(x)
24 | ```
25 |
26 | Rather than sort the vector, the function `order` gives us back the index that, if used to index the vector, will sort it:
27 |
28 | ```{r}
29 | index <- order(x)
30 | x[index]
31 | ```
32 |
33 | If we look at this index we see why it works:
34 | ```{r}
35 | x
36 | order(x)
37 | ```
38 |
39 | Note that the second and fourth entry of `x` are the smallest so `order(x)` starts with `2`. The next smallest is the third entry so the second entry is `3` and so on.
40 |
41 | How does this help us order the states by murders? First remember that you the entries of vectors you access with `$` follow the same order as the rows in the table. So, for example, these two vectors, containing the state names and abbreviations respectively, are match by their order:
42 |
43 | ```{r}
44 | murders$state[1:10]
45 | murders$abb[1:10]
46 | ```
47 |
48 | So this means we can now order the state names by their total murders
49 | by first obtaining the index that orders the vectors according to murder totals, and the indexing the state names or abbreviation vector:
50 |
51 | ```{r}
52 | ind <- order(murders$total)
53 | murders$abb[ind]
54 | ```
55 |
56 | We see that California had the most murders.
57 |
58 | ### `max` and `which.max`
59 |
60 | If we are only interested in the entry with the largest value we can use `max` for the value
61 |
62 | ```{r}
63 | max(murders$total)
64 | ```
65 |
66 | and `which.max` for the index of the largest value
67 |
68 | ```{r}
69 | i_max <- which.max(murders$total)
70 | murders$state[i_max]
71 | ```
72 |
73 | For the minimum we can use `min` and `which.min` in the same way.
74 |
75 | So is California the most dangerous state? In a next section we argue that we should be considering rates not totals. Before doing that we introduce one last order related function: `rank`
76 |
77 | ### `rank`
78 |
79 | Although less useful than `order` and `sort`, the function `rank` is also related to order.
80 | For any given list it gives you a vector with the rank of the first entry, second entry, etc... of the vector. Here is a simple example
81 |
82 | ```{r}
83 | x <- c(31, 4, 15, 92, 65)
84 | rank(x)
85 | ```
86 |
87 | To summarize let's look at the results of the three functions we have introduced
88 |
89 | ```{r, echo=FALSE}
90 | knitr::kable(data.frame(original=x, sort=sort(x), order=order(x), rank=rank(x)))
91 |
92 | ```
93 |
94 |
95 |
--------------------------------------------------------------------------------
/lectures/R/04-vector-arithmetics.Rmd:
--------------------------------------------------------------------------------
1 | ## Vector arithmetics
2 |
3 | ```{r}
4 | library(dslabs)
5 | data(murders)
6 | ```
7 |
8 | California had the most murders. But does this mean it is the most dangerous state? What if it just has many more people than any other state? We can very quickly confirm that, indeed, California has the largest population:
9 |
10 | ```{r}
11 | murders$state[which.max(murders$population)]
12 | ```
13 |
14 | with over `r floor(max(murders$population)/10^6)` million inhabitants! It is therefore unfair to compare the totals if we are interested in learning how safe the state is.
15 |
16 | What we really should be computing is the murders per capita. The reports we describe in the motivating section used murders per 100,000 as the unit. To compute this quantiy, the powerful vector arithmetic capabilities of R come in handy.
17 |
18 | ### Rescaling
19 |
20 | In R, arithmetic operations on vectors occur _element wise_. For a quick example suppose we have height in inches
21 |
22 | ```{r}
23 | heights <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
24 | ```
25 | and want to covert to centimeters. Note what happens when we multiply `heights` by 2.54:
26 |
27 | ```{r}
28 | heights * 2.54
29 | ```
30 |
31 | it multiplied each element by 2.54. Similarly if we want to compute how many inches taller or shorter than the average, 69 inches, we can subtract it from every entry like this
32 |
33 | ```{r}
34 | heights - 69
35 | ```
36 |
37 |
38 | ### Two vectors
39 |
40 | If we have two vectors of the same length, and we sum them in R, they get added entry by entry like this
41 |
42 | $$
43 | \begin{pmatrix}
44 | a\\
45 | b\\
46 | c\\
47 | d
48 | \end{pmatrix}
49 | +
50 | \begin{pmatrix}
51 | e\\
52 | f\\
53 | g\\
54 | h
55 | \end{pmatrix}
56 | =
57 | \begin{pmatrix}
58 | a +e\\
59 | b + f\\
60 | c + g\\
61 | d + h
62 | \end{pmatrix}
63 | $$
64 |
65 | The same holds for other mathematical operation such as `-`, `*` and `/`.
66 |
67 | This implies that to compute the murder rates we can simply type
68 |
69 | ```{r}
70 | murder_rate <- murders$total / murders$population * 100000
71 | ```
72 |
73 | Once we do this, we notice that California is no longer near the top of the list. In fact, we can use what we have learned to order the states by murder rate:
74 |
75 | ```{r}
76 | murders$state[order(murder_rate)]
77 | ```
78 |
79 |
80 | ### Assessment
81 |
82 |
83 | 1. What is `1 + 1/2^2 + 1/3^2 + ... 1/100^2` ? Check if Euler was right.
84 |
85 |
86 | 2.Compute the per 100,000 murder rate for each state and store it in the object `murder_rate`. Then compute the average murder rate for the US using the function `mean`. What is the average?
87 |
88 | 3. Creat this data frame:
89 |
90 | ```{r}
91 | temp <- c(35, 88, 42, 84, 81, 30)
92 | city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto")
93 | city_temps <- data.frame(name = city, temperature = temp)
94 | ```
95 |
96 | Remake the data frame using the code above but add a line that converts the temperature from farenheit to celsius.
97 |
98 |
99 |
100 |
101 |
--------------------------------------------------------------------------------
/lectures/R/05-indexing.Rmd:
--------------------------------------------------------------------------------
1 | ## Indexing
2 |
3 | ```{r}
4 | library(dslabs)
5 | data(murders)
6 | ```
7 |
8 | R provides a powerful and convinient way of indexing vectors. We can, for example, subset a vector based on properties of another vector. We continue our us murders example to demonstrate
9 |
10 | ### Subsetting with logicals
11 |
12 | We have now calculated the murder rate using
13 |
14 | ```{r}
15 | murder_rate <- murders$total / murders$population * 100000
16 | ```
17 |
18 | Say you are moving from Italy where, according to the ABC news report, the murder rate is only 0.71 per 100,000. You would prefer to move to a state with a similar rate. Another powerful feature of R is that we can we can use logicals to index vectors.
19 | Note that if we compare a vector to a single number, it actually performs the test for each entry. Here is an example related to the question above
20 |
21 | ```{r}
22 | ind <- murder_rate < 0.71
23 | ```
24 |
25 | Or if we want to know if its less or equal we can use
26 |
27 | ```{r}
28 | ind <- murder_rate <= 0.71
29 | ind
30 | ```
31 |
32 | Note that we get back a logical vector with `TRUE` for each entry smaller or equal than 0.71. To see which states these are, we can leverage the fact that vectors can be indexed with logicals.
33 |
34 | ```{r}
35 | murders$state[ind]
36 | ```
37 |
38 | Note that to count how many are TRUE, the function `sum` returns the sum of the entries of a vector and logical vectors get _coerced_ to numeric with `TRUE` coded as 1 and `FALSE` as 0. Thus we can count the states using
39 |
40 | ```{r}
41 | sum(ind)
42 | ```
43 |
44 |
45 | ### Logical Operators
46 |
47 | Suppose we like the mountains and we want to move to a safe state in the West region of the country. We want the murder rate to be at most 1. So we want two different things to be true. Here we can use the logical operator _and_ which in R is `&`. This operation results in a true, only when both logicals are true. To see this consider this example
48 |
49 | ```{r}
50 | TRUE & TRUE
51 | TRUE & FALSE
52 | FALSE & FALSE
53 | ```
54 |
55 | We can form two logicals:
56 |
57 | ```{r}
58 | west <- murders$region=="West"
59 | safe <- murder_rate<=1
60 | ```
61 |
62 | and we can use the `&` to get a vector of logicals that tells us which states satisfy our condition
63 |
64 | ```{r}
65 | ind <- safe & west
66 | murders$state[ind]
67 | ```
68 |
69 | ### `which`
70 |
71 | Suppose we want to look up California's murder rate. For this type of operation, it is convenient to convert vectors of logicals into indexes instead of keeping long vectors of logicals. The function `which` tells us which entries of a logical vector are TRUE. So we can type:
72 |
73 | ```{r}
74 | ind <- which(murders$state == "California")
75 | ind ##this is the index that matches the California entry
76 | murder_rate[ind]
77 | ```
78 |
79 |
80 | ### `match`
81 |
82 | If instead of just one state we want to find out the murder rates for several, say New York, Florida, and Texas, we can use the function `match`. This function tells us which indexes of a second vector match each of the entries of a first vector:
83 |
84 | ```{r}
85 | ind <- match(c("New York", "Florida", "Texas"), murders$state)
86 | ind
87 | ```
88 |
89 | Now we can look at the murder rates:
90 |
91 | ```{r}
92 | murder_rate[ind]
93 | ```
94 |
95 | ### `%in%`
96 |
97 | If rather than an index we want a logical that tells us whether or not each element of a first vector is in a second we can use the function `%in%`. So, say you are not sure if Boston, Dakota and Washington are states, you can find out like this
98 |
99 | ```{r}
100 | c("Boston", "Dakota", "Washington") %in% murders$state
101 | ```
102 |
103 |
104 | **Advanced**: Note that there is a connection between `match` and `%in% through `which`: the following two lines of code are equivalent:
105 |
106 | ```{r}
107 | match(c("New York", "Florida", "Texas"), murders$state)
108 | which(murders$state%in%c("New York", "Florida", "Texas"))
109 | ```
110 |
111 |
112 | ### Assessment
113 |
114 | 1. Compute the per 100,000 murder rate for each state and store it in an object called `murder_rate`. Then use the logical operators to create a logical vector, name it `low`, that tells us which entries of `murder_rate` are lower than 1.
115 |
116 | 2. Now use the results from the previous exercise and the function `which` to determine the indeces of `murder_rate` associated with values lower than 1.
117 |
118 | 3. Use the results from the previous exercise to report the names of the states with murder rates lower than 1.
119 |
120 | 4. Now extend the code from exercises 2 and 3 to report the states in the Northeast with murder rates lower than 1. Hint use the previously defined logical vector `low` and the logical operator `&`.
121 |
122 | 5. In a previous exercise we computed the murder rate for each state and the average of these numbers. How many states are below the average?
123 |
124 | 6. Use the match function to identify the states with abbreviations AK, MI, and IA. Hint: Start by defining an index of the entries of `murders$abb` that match the three abbreviations, then use the `[` operator to extract the states.
125 |
126 | 7. Use the `%in%` operator to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?
127 |
128 | 8. Extend the code you used in exercise seven to report the one entry that is **not** an actual abbreviation. Hint: Use the `!` operator, which turns `FALSE` into `TRUE` and vice-versa, then `which` to obtain an index.
129 |
130 |
--------------------------------------------------------------------------------
/lectures/R/06-basic-data-wrangling.Rmd:
--------------------------------------------------------------------------------
1 | ## Basic Data Wrangling
2 |
3 | ```{r}
4 | library(dslabs)
5 | data(murders)
6 | ```
7 |
8 | Up to now we have been changing vectors by reordering them and subsetting them through indexing. But once we start more advanced analyses, we will want to prepare data tables for data analysis. We refer to the task as data wrangling.
9 | For this purpose we will introduce the `dplyr` package which provides intuitive functionality for working with tables.
10 |
11 | Once you install `dplyr` you can load it using
12 |
13 | ```{r}
14 | library(dplyr)
15 | ```
16 |
17 | This package introduces functions that perform the most common operations in data warngling and uses names for these functions that are relatively easy to remember. For example to change the data table by adding a new column we use `mutate`, to filter the data table to subset of rows we use `filter` and to subset the data by selecting specific columns we use `select`. We can also perform a series of operation, for example select and then filter, by sending the results of one function to another using a what is called the _pipe operator_: `%>%`Some details are included below.
18 |
19 | ### Adding a column with `mutate`
20 |
21 | We want all the necessary information for our analysis to be included in the data table. So the first task is to add the murder rate to our data frame. The function mutate takes the data frame as a first argument and the name and values of the variable in the second using the convention `name = values`. So to add murder rate we use:
22 |
23 | ```{r,message=FALSE}
24 | murders <- mutate(murders, murder_rate = total / population * 100000)
25 | ```
26 |
27 | Note that here we used `total` and `population` in the function, which are objects that are **not** defined in our workspace. What is happening is that `mutate` knows to look for these variables in the `murders` data frame. So the intuitive line of code above does exactly what we want. We can see the new column is added:
28 |
29 | ```{r}
30 | head(murders)
31 | ```
32 |
33 | Also note that we have over-written the original `murders` object. However, this does *not* change the object that is saved and we load with `data(murders)`. If we load the `murders` data again, the original will over-write our mutated version.
34 |
35 | Note: If we reload the dataset from the `dslabs` package it will rewrite our new data frame with the original.
36 |
37 | ### Subsetting with `filter`
38 |
39 | Now suppose that we want to filter the data table to only show the entries for which the murder rate is lower than 0.71. To do this we use the `filter` function which takes the data table as argument and then the conditional statement as the next. Like mutate, we can use the data table variable names inside the function and it will know we mean the columns and not objects in the Workspace.
40 |
41 | ```{r}
42 | filter(murders, murder_rate <= 0.71)
43 | ```
44 |
45 |
46 | ### Selecting columns with `select`
47 |
48 | Although our data table only has six columns, some data tables include hundreds. If we want to view just a few we can use the `select` function. In the code below we select three columns, assign this to a new object and then filter the new object:
49 |
50 | ```{r}
51 | new_table <- select(murders, state, region, murder_rate)
52 | filter(new_table, murder_rate <= 0.71)
53 | ```
54 |
55 | Note that in the call to `select`, the first argument, `murders`, is an object but `state`, `region`, and `rate` are variable names.
56 |
57 | ### The pipe: `%>%`
58 |
59 | In the code above we want to show the three variables for states that have murder rates below 0.71. To do this we defined an intermediate object. In `dplyr` we can write code that looks more like our description of what we want to:
60 |
61 | >> original data $\rightarrow$ select $\rightarrow$ filter
62 |
63 | For such operation, we can use the pipe `%>%`. The code looks like this:
64 |
65 | ```{r}
66 | murders %>% select(state, region, murder_rate) %>% filter(murder_rate <= 0.71)
67 | ```
68 |
69 | This line of code is equivalent to the two lines of code above. Note that when using the pipe we no longer need to specify the required argument as the `dplyr` functions assume that whatever is being _piped_ is what should be operated on.
70 |
71 | ## Summarizing data with `dplyr`
72 |
73 | An important part of exploratory data analysis is summarizing data. It is sometimes useful to split data into groups before summarizing.
74 |
75 | ### Summarize
76 |
77 | The `summarize` function in `dplyr` provides a way to compute summary statistics with intuitive and readable code. We start with a simple example based on heights
78 |
79 | ```{r}
80 | library(dslabs)
81 | data(heights)
82 | ```
83 |
84 | We can compute the average of the murder rates like this.
85 |
86 |
87 | ```{r}
88 | murders %>% summarize(avg = mean(murder_rate))
89 | ```
90 |
91 | However, note that the US murder is **not** the average of the state murder rates. Because in this computation the small states are given the same weight as the large ones. The US murder rate is proportional to the total US murders divided by the total US population
92 |
93 | ### Assessmnet
94 |
95 | Compute the country's murder rate using the `summarize` function, and name it `us_murder_rate`:
96 |
97 |
98 | This computation counts larger states proportionally to their size and this results in a larger value.
99 |
100 | ### Using the dot to accessing the pipped data
101 |
102 | The `us_murder_rate` object defined above represents just one number. Yet we are storing it in a data frame
103 |
104 | ```{r}
105 | class(us_murder_rate)
106 | ```
107 |
108 | since, as most `dplyr` functions, `summarize` always returns a data frame.
109 |
110 | This might be problematic if we want to use the result with functions that require a numeric value. Here we show a useful trick to access values stored in data piped via `%>%`: when a data object is piped it can be accessed using the dot `.`. To understand what we mean take a look at this line of code:
111 |
112 | ```{r}
113 | us_murder_rate %>% .$murder_rate
114 | ```
115 |
116 | Note that is returns the value in the `rate` column of `us_murders_rate` making it equivalent to `us_murders_rate$rate`. To understand this line, you just need to think of `.` as a placeholder for the data that is being passed through the pipe. Because this data object is a data frame, we can access it's columns with the `$`.
117 |
118 | To get a number from the original data table with one line of code we can type:
119 |
120 | ```{r}
121 | us_murder_rate <- murders %>%
122 | summarize( murder_rate= sum(total) / sum(population) * 100000) %>%
123 | .$murder_rate
124 |
125 | us_murder_rate
126 | ```
127 |
128 | which is now a numeric:
129 |
130 | ```{r}
131 | class(us_murder_rate)
132 | ```
133 |
134 | We will see other instances in which using the `.` is useful. For now, we will only use it to produce numeric vectors from pipelines constructed with `dplyr`.
135 |
136 | ### Group then summarize
137 |
138 | A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men and women heights separately. The `group_by` function helps us do this.
139 |
140 | If we type this:
141 |
142 | ```{r}
143 | murders %>%
144 | group_by(region) %>%
145 | summarize(median_rate = median(murder_rate))
146 | ```
147 |
148 | ### Sorting data tables
149 |
150 | When examining a dataset it is often convenient to sort the table by the different columns. We know about the `order` and `sort` function, but for ordering entire tables, the function `dplyr` function `arrange` is useful. For example, here we order the states by population size we type:
151 |
152 | ```{r}
153 | murders %>%
154 | arrange(population) %>%
155 | head()
156 | ```
157 |
158 | Note that we get to decide which column to sort by. To see the states by population, from smallest to largest, we arrange by `murder_rate` instead:
159 |
160 | ```{r}
161 | murders %>%
162 | arrange(murder_rate) %>%
163 | head()
164 | ```
165 |
166 | Note that the default behavior is to order in ascending order. In `dplyr`, the function `desc` transforms a vector to be in descending order. So if we want to sort the table in descending order we can type
167 |
168 | ```{r}
169 | murders %>%
170 | arrange(desc(murder_rate)) %>%
171 | head()
172 | ```
173 |
174 | #### Nested Sorting
175 |
176 | If we are ordering by a column with ties we can use a second column to break the tie. Similarly, a third column can be used to break ties between first and second and so on. Here we order by `region` then within region we order by murder rate
177 |
178 | ```{r}
179 | murders %>%
180 | arrange(region, murder_rate) %>%
181 | head()
182 | ```
183 |
184 |
185 | #### The top $n$
186 | In the code above we have used the function `head` to avoid having the page fill with the entire data. If we want to see a larger proportion we can use the `top_n` function. Here are the
187 |
188 | ```{r}
189 | murders %>% top_n(10, murder_rate)
190 | ```
191 |
192 | Note that `top_n` picks the highest `n` based on the column given as a second argument. However, the rows are not sorted.
193 |
194 | If the second argument is left blank, then it just takes the first `n` columns. This means that to see the top 10 states ranked by murder rate, sorted by murder rate we can type:
195 |
196 |
197 | ```{r}
198 | murders %>%
199 | arrange(desc(murder_rate)) %>%
200 | top_n(10)
201 | ```
202 |
203 |
204 |
205 | ### Creating a data frame
206 |
207 | It is sometimes useful for us to create our own data frames. You can do this using the `data.frame` function:
208 |
209 | ```{r}
210 | grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
211 | exam_1 = c(95, 80, 90, 85),
212 | exam_2 = c(90, 85, 85, 90))
213 | grades
214 | ```
215 |
216 | *Warning*: By default the function `data.frame` turns characters into factors:
217 | ```{r}
218 | class(grades$names)
219 | ```
220 |
221 | To avoid this we use the rather cumbersome argument `stringsAsFactors`:
222 | ```{r}
223 | grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
224 | exam_1 = c(95, 80, 90, 85),
225 | exam_2 = c(90, 85, 85, 90),
226 | stringsAsFactors = FALSE)
227 | class(grades$names)
228 | ```
229 |
230 |
231 |
--------------------------------------------------------------------------------
/lectures/R/07-basic-plots.Rmd:
--------------------------------------------------------------------------------
1 | ## Basic plots
2 |
3 | Exploratory data visualization is perhaps the strength of R. One can quickly go from idea to data to plot with a unique balance of flexibility and ease. For example, Excel may be easier than R but it is no where near as flexible. D3 may be more flexible and powerful than R, but it takes much longer to generate a plot.The next chapter is dedicated to this topic, but here we introduce some very basic plotting functions.
4 |
5 | ### Scatter plots
6 |
7 | Earlier we inferred that states with larger population are likely to have more murders. This can be confirmed with an exploratory visualization that plots these two quantities against each other:
8 |
9 | ```{r, first-plot}
10 | library(dslabs)
11 | data("murders")
12 | population_in_millions <- murders$population/10^6
13 | total_gun_murders <- murders$total
14 | plot(population_in_millions, total_gun_murders)
15 | ```
16 |
17 | We can clearly see a relationship.
18 | **Advanced**: For a quick plot that avoids accessing variables twice, we can use the `with` function
19 | ```{r, eval=FALSE}
20 | with(murders, plot(population, total))
21 | ```
22 |
23 |
24 | ### Histograms
25 |
26 | We will describe histograms as they related to distributions in the next chapter. Here we will simply note that histograms are a powerful graphical summary of a list of numbers that gives you a general overview of the types of values you have. We can make a histogram of our murder rates by simply typing
27 |
28 | ```{r}
29 | library(dplyr)
30 | murders <- mutate(murders, murder_rate = total / population * 100000)
31 | hist(murders$murder_rate)
32 | ```
33 |
34 | We can see that there is a wide range of values with most of them between 2 and 3 and one very extreme case with a murder rate of more than 15:
35 |
36 | ```{r}
37 | murders$state[which.max(murders$murder_rate)]
38 | ```
39 |
40 | ### Boxplot
41 |
42 | Boxplots will be described in more detail in the next chapter as well. But here we say that they provide a more terse summary than the histogram but they are easier to stack with other boxplots. Here we can use them to compare the different regions
43 |
44 | ```{r}
45 | boxplot(murder_rate~region, data = murders)
46 | ```
47 |
48 | We can see that the South has larger murder rates than the other three regions.
49 |
50 |
51 | ### Assessment
52 |
53 |
54 | 1. We made a plot of total murders versus population and noted a strong relationship: not surprisingly, states with larger populations had more murders.
55 |
56 | ```{r, eval = FALSE}
57 | population_in_millions <- murders$population/10^6
58 | total_gun_murders <- murders$total
59 | plot(population_in_millions, total_gun_murders)
60 | ```
61 |
62 | Note that many states have populations below 5 million and are bunched up. We may gain further insights from making this plot in the log scale. Transform the variables using the `log10` transformation and then plot them.
63 |
64 |
65 | 2. Create a histogram of the state populations.
66 |
67 | 3. Generate boxplots of the state populations by region
68 |
69 |
--------------------------------------------------------------------------------
/lectures/R/08-importing-data.Rmd:
--------------------------------------------------------------------------------
1 | ## Importing Data
2 |
3 | In this chapter we used a data set already stored in an R object. A data scientist will rarely have such luck and will have to import data into R from either a file, a database, or other source. We cover this in more detail later on. But because it is so common to read data from a file, we will briefly describe the key approach and function, in case you want to use your new knowledge on one of your own data sets.
4 |
5 |
6 | Small datasets such as the one used in this chapter are typically commonly stored as Excel files. Although there
7 | are R packages designed to read Excel (xls) format, you generally want
8 | to avoid this format and save files as comma delimited (Comma-Separated
9 | Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files.
10 | These plain-text formats make it easier to share data, since commercial software is not required for working with the data.
11 |
12 |
13 | #### Paths and the Working Directory
14 |
15 | The first step is to find the file containing your data and know its *path*.
16 | When you are working in R it is useful to know your _working directory_. This is the folder in which R will save or look for files by default. You can see your working directory by typing:
17 |
18 | ```{r, eval=FALSE}
19 | getwd()
20 | ```
21 |
22 | You can also change your working directory using the function `setwd`. Or you can change it through RStudio by clicking on "Session".
23 |
24 | The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for beginners will have you reading and writing to the working directory. However, you can also type the [full path](http://www.computerhope.com/jargon/a/absopath.htm), which will work independently of the working directory.
25 |
26 | We have included the US murders data in a CSV file as part of the `dslabs` package. We recommend placing your data in your working directory.
27 |
28 | Because knowing where packages store files is rather advanced, we provide the following code that finds the directory and copies the file:
29 |
30 | ```{r, hide=TRUE, warning=FALSE}
31 | dir <- system.file(package="dslabs") #extracts the location of package
32 | filename <- file.path(dir,"extdata/murders.csv")
33 | file.copy(filename, "murders.csv")
34 | ```
35 |
36 | You should be able to see the file in your working directory and can check using
37 |
38 | ```{r}
39 | list.files()
40 | ```
41 |
42 | ### `read.csv`
43 |
44 | We are ready to read in the file. There are several functions for reading in tables. Here we introduce one included in base R:
45 |
46 | ```{r}
47 | dat <- read.csv("murders.csv")
48 | head(dat)
49 | ```
50 |
51 | We can see that we have read in the file.
52 |
53 | Warning: `read.csv` automatically converts characters to factors. Note for example that:
54 |
55 | ```{r}
56 | class(dat$state)
57 | ```
58 |
59 | You can avoid this using
60 | ```{r}
61 | dat <- read.csv("murders.csv", stringsAsFactors = FALSE)
62 | class(dat$state)
63 | ```
64 |
65 | With this call the region variable is no longer a factor but we can easily change this with:
66 |
67 | ```{r, warning=FALSE, message=FALSE}
68 | require(dplyr)
69 | dat <- mutate(dat, region = as.factor(region))
70 | ```
71 |
72 |
73 | ## Assessment
74 |
75 | Find or create a spreadsheet on your computer or the internet. Download it and open it in R.
76 |
77 |
78 |
--------------------------------------------------------------------------------
/lectures/R/09-programming-basics.Rmd:
--------------------------------------------------------------------------------
1 | ## Programming basics
2 |
3 | We teach R because it greatly facilitates data analysis, the main topic of this book. Coding in R we can efficiently perform exploratory data analysis, build data analysis pipelines and prepare data visualization to communicate results. However R is not just a data analysis environment but a programming language. Advanced R programmers can develop complex packages and even improve R itself, but we do not cover advanced progamming in this book. However,in this section we introduce three key programming concepts: conditional expressions, for-loops and functions. These are not just key building blocks for advanced programming, but ocassionaly come in handy during data analysis. We also provide a list of power function that we dot not cover in the book but are worth knowing about as they are powerful tools commonly by expert data analysists.
4 |
5 | ### Conditionals expressions
6 |
7 | Conditionals expressions are one of the basic features of programming. The most common conditional expression is the if-else statement. In R, we can actually perform quite a bit of data analysis without conditionals. However, they do come up occasionally and if once you start writing your own functions and packages you will definitely need them.
8 |
9 | Here is a very simple example showing the general structure of an if-else statement. The basic idea is to print the reciprocal of `a` unless `a` is 0:
10 |
11 | ```{r}
12 | a <- 0
13 |
14 | if(a!=0){
15 | print(1/a)
16 | } else{
17 | print("No reciprocal for 0.")
18 | }
19 | ```
20 |
21 |
22 | Let's look at one more example using the US murders data frame.
23 |
24 | ```{r, echo=FALSE}
25 | library(dslabs)
26 | data(murders)
27 | murder_rate <- murders$total/murders$population*100000
28 | ```
29 |
30 |
31 | Here is a very simple example that tells us which states, if any, have a murder rate lower than 0.5 per 100,000. The if statement protects us from the case in which no state satisfies the condition.
32 |
33 | ```{r}
34 | ind <- which.min(murder_rate)
35 |
36 | if(murder_rate[ind] < 0.5){
37 | print(murders$state[ind])
38 | } else{
39 | print("No state has murder rate that low")
40 | }
41 | ```
42 |
43 | If we try it again with a rate of 0.25 we get a different answer:
44 |
45 | ```{r}
46 | if(murder_rate[ind] < 0.25){
47 | print(murders$state[ind])
48 | } else{
49 | print("No state has a murder rate that low.")
50 | }
51 | ```
52 |
53 | A related function that is very useful is `ifelse`. This function takes three arguments: a logical and two possible answers. If the logical is `TRUE` the first answer is returned and if `FALSE` the second. Here is an example
54 |
55 | ```{r}
56 | a <- 0
57 | ifelse(a > 0, 1/a, NA)
58 | ```
59 |
60 | The function is particularly useful because it works on vectors. It examines each element of the logical vector and returns corresponding answers from the accordingly.
61 |
62 | ```{r}
63 | a <- c(0,1,2,-4,5)
64 | result <- ifelse(a > 0, 1/a, NA)
65 | ```
66 |
67 | This table helps us see what happned:
68 | ```{r, echo=FALSE}
69 | knitr::kable(data.frame(a = a, is_a_positive = a>0, answer1 = 1/a, answer2 = NA, result = result))
70 | ```
71 |
72 | Here is an example of how this function can be readily used to replace all the missing values in a vector with zeros:
73 |
74 | ```{r}
75 | data(na_example)
76 | no_nas <- ifelse(is.na(na_example), 0, na_example)
77 | sum(is.na(no_nas))
78 | ```
79 |
80 | Two other useful function are `any` and `all`. The `any` function takes a vector of logical and returns `TRUE` if any of the entries is `TRUE`.The `all` function takes a vector of logical and returns `TRUE` if all of the entries is `TRUE`. Here is an example.
81 |
82 | ```{r}
83 | z <- c(TRUE, TRUE, FALSE)
84 | any(z)
85 | all(z)
86 | ```
87 |
88 | ### Defining Functions
89 |
90 | As you become more experienced you will find yourself needeing to perform the same operations over and over. A simple example is computing average. We can compute the average of a vector `x` using the `sum` and `length` functions: `sum(x)/length(x)`. But because we do this so often it is much more efficient to write a function that performs this operation and thus someone already wrote the `mean` function. However, you will encounter situations in which the function does not already exist so R permits you to write your own. A simple version of function that computes the average can be defined like this
91 |
92 | ```{r}
93 | avg <- function(x){
94 | s <- sum(x)
95 | n <- length(x)
96 | s/n
97 | }
98 | ```
99 |
100 | Now `avg` is a function that computes the mean:
101 |
102 | ```{r}
103 | x <- 1:100
104 | identical(mean(x), avg(x))
105 | ```
106 |
107 | Note that variables defined inside a function are not saved in the workspace. So while we use `s` and `n` when we call `avg`, there values are created and changed only during the call. Here are illustrative example:
108 |
109 | ```{r}
110 | s <- 3
111 | avg(1:10)
112 | s
113 | ```
114 |
115 | Note how `s` is still `r s` after we call `avg`.
116 |
117 |
118 | In general, functions are objects, so we assign them to variable names with `<-`. The function `function` tells R you are about to define a function. The general form of a function definition looks like this
119 |
120 |
121 | Also not that
122 | The functions you define can have multiple arguments as well as default values. For example we can define a function that computes either the arithmetic or geometric average depending on a user defined variable like this
123 |
124 | ```{r}
125 | avg <- function(x, arithmetic = TRUE){
126 | n <- length(x)
127 | ifelse(arithmetic, sum(x)/n, prod(x)^(1/n))
128 | }
129 | ```
130 |
131 | We will learn more about how to create functions through experience as we face more complex tasks.
132 |
133 |
134 |
135 | ### For loops
136 |
137 | The formula for the sum $1+2+\dots+n$ is $n(n+1)/2$. What if we weren't sure that was the right function, how could we check? Using what we learned about functions we can create on that computes the $S_n$:
138 |
139 | ```{r}
140 | compute_s_n <- function(n){
141 | x <- 1:n
142 | sum(x)
143 | }
144 | ```
145 |
146 | Now if we can compute $S_n$ for various vales of $n$, say $n=1,\dots,25$ how do we do it? Do we write 25 lines of code calling `compute_s_n`? No, that is what for loops are for in programming. Note that we are performing exactly the same task over and over and that the only thing that is changing is the value of $n$. For loops let us define the range that our variable takes (in our example $n=1,\dots,10$), then change the value as you _loop_ and evaluate expression as you loop. The general form looks like
147 |
148 |
149 | Perhaphs the simplest example of a for loop is this useless piece of code:
150 | ```{r}
151 | for(i in 1:5){
152 | print(i)
153 | }
154 | ```
155 |
156 | And here is the for loop we would write for our $S_n$ example:
157 |
158 | ```{r}
159 | m <- 25
160 | s_n <- vector(length = m) # create an empty vector
161 | for(n in 1:m){
162 | s_n[n] <- compute_s_n(n)
163 | }
164 | ```
165 | In each iteration $n=1$, $n=2$, etc..., we compute $S_n$ and store it in the $n$th entry of `s_n`.
166 |
167 | Now we can create a plot to get search for a pattern
168 |
169 | ```{r sum-of-consecutive-squares}
170 | n <- 1:m
171 | plot(n, s_n)
172 | ```
173 |
174 | If you noticed that it appears to be a quadratic, you are on the right track because the formula is $n(n+1)/2$ which we can confirm with a table
175 |
176 | ```{r show_s_n_table, echo=FALSE}
177 | head(data.frame(s_n = s_n, formula = n*(n+1)/2))
178 | ```
179 |
180 | We can also overlay the two results by using the function `lines` to draw a line over the previously plotted points:
181 |
182 | ```{r}
183 | plot(n, s_n)
184 | lines(n, n*(n+1)/2)
185 | ```
186 |
187 |
188 | ### Other functions
189 |
190 | It turns out that we rarely use for loops in R. This is because there are usually more powerful ways to perform the same task. Functions that are typically used instead of for loops are the apply family: `apply`, `sapply`, `tapply`, and `mapply`. We do not cover these functions in this book but they are worth learning if you intend to go beyond this introduction. Other functions that are widely used are `split`, `cut`, and `Reduce`.
191 |
192 |
193 |
194 | ### Assessment
195 |
196 |
197 | 1. What will this conditional expression return?
198 |
199 | ```{r}
200 | x <- c(1,2,-3,4)
201 |
202 | if(all(x>0)){
203 | print("A ll Postives")
204 | } else{
205 | print("Not all positives")
206 | }
207 | ```
208 |
209 | 2. Which of the following expressions is always `FALSE` when at least one entry of a logical vector `x` is TRUE?
210 |
211 | A. `all(x)`
212 | B. `any(x)`
213 | C. `any(!x)`
214 | D. `all(!x)`
215 |
216 |
217 | 3. The function `nchar` tells you how many characters long a character vector is. For examples
218 |
219 |
220 | ```{r}
221 | library(dslabs)
222 | data(murders)
223 | char_len <- nchar(murders$state)
224 | char_len[1:5]
225 | ```
226 |
227 |
228 | Write a line of code that assigns to the object `new_names` the state abbreviation when the state name is longer than 8 characters.
229 |
230 |
231 | 4. Create a function `sum_n`that for any given value, say $n$, computes the sum of the integers from 1 to n (inclusive). Use the function to determine the some of integers from 1 to 5,000.
232 |
233 | 5. Create a function `altman_plot` that takes two arguments `x` and `y` and plots the difference against the sum
234 |
235 | 6. After running the code below, what is the value of `x`?
236 |
237 | ```{r}
238 | x <- 3
239 | my_func <- function(y){
240 | x <- 5
241 | y+5
242 | }
243 | ```
244 |
245 |
246 |
247 | 7. Write a function `compute_s_n` that for any given $n$ computes the sum $S_n = 1^2 + 2^2 + 3^2 + \dots n^2$. Report the value of the sum when $n=10$.
248 |
249 |
250 | 8. Now define an empty numerical vector `s_n` of size 25 using `s_n <- vector("numeric", 25)` and store in the results of $S_1, S_2, \dots S_25$.
251 |
252 |
253 | 9. Plot $S_n$ versus $n$. Use points defined by $n=1,\dots,25$.
254 |
255 | 10. Confirm that the formula for this sum is $S_n= n(n+1)(2n+1)/6$:
256 |
257 |
258 |
259 |
260 |
261 |
262 |
263 |
264 |
265 |
266 |
267 |
268 |
269 |
270 |
271 |
--------------------------------------------------------------------------------
/lectures/R/README.md:
--------------------------------------------------------------------------------
1 | This directory contains 10 Rmd tutorials that provide a basic
2 | introduction to using R. Becoming comfortable with these concepts
3 | will help you complete homework 1.
4 |
--------------------------------------------------------------------------------
/lectures/R/murders.csv:
--------------------------------------------------------------------------------
1 | state,abb,region,population,total
2 | Alabama,AL,South,4779736,135
3 | Alaska,AK,West,710231,19
4 | Arizona,AZ,West,6392017,232
5 | Arkansas,AR,South,2915918,93
6 | California,CA,West,37253956,1257
7 | Colorado,CO,West,5029196,65
8 | Connecticut,CT,Northeast,3574097,97
9 | Delaware,DE,South,897934,38
10 | District of Columbia,DC,South,601723,99
11 | Florida,FL,South,19687653,669
12 | Georgia,GA,South,9920000,376
13 | Hawaii,HI,West,1360301,7
14 | Idaho,ID,West,1567582,12
15 | Illinois,IL,North Central,12830632,364
16 | Indiana,IN,North Central,6483802,142
17 | Iowa,IA,North Central,3046355,21
18 | Kansas,KS,North Central,2853118,63
19 | Kentucky,KY,South,4339367,116
20 | Louisiana,LA,South,4533372,351
21 | Maine,ME,Northeast,1328361,11
22 | Maryland,MD,South,5773552,293
23 | Massachusetts,MA,Northeast,6547629,118
24 | Michigan,MI,North Central,9883640,413
25 | Minnesota,MN,North Central,5303925,53
26 | Mississippi,MS,South,2967297,120
27 | Missouri,MO,North Central,5988927,321
28 | Montana,MT,West,989415,12
29 | Nebraska,NE,North Central,1826341,32
30 | Nevada,NV,West,2700551,84
31 | New Hampshire,NH,Northeast,1316470,5
32 | New Jersey,NJ,Northeast,8791894,246
33 | New Mexico,NM,West,2059179,67
34 | New York,NY,Northeast,19378102,517
35 | North Carolina,NC,South,9535483,286
36 | North Dakota,ND,North Central,672591,4
37 | Ohio,OH,North Central,11536504,310
38 | Oklahoma,OK,South,3751351,111
39 | Oregon,OR,West,3831074,36
40 | Pennsylvania,PA,Northeast,12702379,457
41 | Rhode Island,RI,Northeast,1052567,16
42 | South Carolina,SC,South,4625364,207
43 | South Dakota,SD,North Central,814180,8
44 | Tennessee,TN,South,6346105,219
45 | Texas,TX,South,25145561,805
46 | Utah,UT,West,2763885,22
47 | Vermont,VT,Northeast,625741,2
48 | Virginia,VA,South,8001024,250
49 | Washington,WA,West,6724540,93
50 | West Virginia,WV,South,1852994,27
51 | Wisconsin,WI,North Central,5686986,97
52 | Wyoming,WY,West,563626,5
53 |
--------------------------------------------------------------------------------
/lectures/course-intro/course-intro.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/course-intro/course-intro.pptx
--------------------------------------------------------------------------------
/lectures/dataviz/dataviz-principles-assessment.Rmd:
--------------------------------------------------------------------------------
1 | # Data Visualization Principles
2 |
3 | 1. Pie charts ar appropriate when
4 |
5 | A. When we want to display percentages
6 | B. When `ggplot2` is not available
7 | C. When I am in a bakery
8 | D. Never. Barplots and tables are always better.
9 |
10 |
11 | 2. What is the problem with the plot below:
12 |
13 | ```{r echo=FALSE}
14 | data.frame(candidate=c("Clinton","Trump"), electoral_votes = c(232, 306)) %>%
15 | ggplot(aes(candidate, electoral_votes)) +
16 | geom_bar(stat = "identity", width=0.5, color =1, fill = c("Blue","Red")) +
17 | coord_cartesian(ylim=c(200,310)) +
18 | ylab("Electoral Votes") +
19 | xlab("") +
20 | ggtitle("Result of Presidential Election 2016")
21 | ```
22 |
23 | A. The values are wrong. The final vote was 306 to 232.
24 | B. The axis does not start at 0. Judging by the length, it appears Trump recieved 3 times as many votes when in fact it about 30% more.
25 | C. The colors should be the same.
26 | D. Percentages should be shown as pie chart
27 |
28 |
29 | 3. Take a look at the following two plots. They show the same information: 1928 rates of Measels across the 50 states
30 |
31 | ```{r}
32 | library(gridExtra)
33 | data("us_contagious_diseases")
34 | p1 <- us_contagious_diseases %>%
35 | filter(year == 1928 & disease=="Measles" & count>0 & !is.na(population)) %>%
36 | mutate(rate = count / population * 10000) %>%
37 | ggplot(aes(state, rate)) +
38 | geom_bar(stat="identity") +
39 | coord_flip() +
40 | xlab("")
41 | p2 <- us_contagious_diseases %>%
42 | filter(year == 1928 & disease=="Measles" & count>0 & !is.na(population)) %>%
43 | mutate(rate = count / population * 10000) %>%
44 | mutate(state = reorder(state, rate)) %>%
45 | ggplot(aes(state, rate)) +
46 | geom_bar(stat="identity") +
47 | coord_flip() +
48 | xlab("")
49 | grid.arrange(p1, p2, ncol = 2)
50 | ```
51 |
52 | Which one plot do is easier to read if you are interest and best and worst states and why?
53 |
54 | A. They provide the same information so they are just as good.
55 | B. The plot on the right is better because it orders the states alphabetically.
56 | C. The plot on the right is better because alphebatical order has nothing to do with the disease and by order by the acutal rate we quickly see the states with most and least rates.
57 | D. Both plots should be a piechart.
58 |
59 | 4. To make the plot on the left we have to reorder the levels of the states variable
60 |
61 | ```{r}
62 | dat <- us_contagious_diseases %>% filter(year == 1928 & disease=="Measles" & count>0 & !is.na(population))
63 | ```
64 |
65 | Note what happens when we make a barplot:
66 |
67 | ```{r}
68 | dat %>% ggplot(aes(state, rate)) +
69 | geom_bar(stat="identity") +
70 | coord_flip()
71 | ```
72 |
73 | Define these objets
74 |
75 | ```{r}
76 | state <- dat$state
77 | rate <- dat$count/dat$population*10000
78 | ```
79 |
80 | Redefine the `state` object so that the levels are re-ordered. Print the new object `state` and it's levels so you can see that the vector is not re-ordered by the levels.
81 |
82 | 5. Now with one line of code, define the `dat` table as done above, but change the use mutate to create a rate variable and reorder state variable so that the levels are reorders by this variable. The make a baroplot using the code above, but for this new dat.
83 |
84 | ```{r}
85 | dat <- us_contagious_diseases %>% filter(year == 1928 & disease=="Measles" & count>0 & !is.na(population)) %>%
86 | mutate(res = count/population*10000) %>%
87 | mutate(state = reorder(state, res))
88 |
89 | dat %>% ggplot(aes(state, rate)) +
90 | geom_bar(stat="identity") +
91 | coord_flip()
92 | ```
93 |
94 | 5 - Say we are interested in comparing the gun homicide rates across regions of the US. We see this plot
95 |
96 |
97 | ```{r}
98 | library(dslabs)
99 | data("murders")
100 | murders %>% mutate(rate = total/population*100000) %>%
101 | group_by(region) %>%
102 | summarize(avg = mean(rate)) %>%
103 | mutate(region = factor(region)) %>%
104 | ggplot(aes(region, avg)) +
105 | geom_bar(stat="identity") +
106 | ylab("Murder Rate Average")
107 | ```
108 |
109 | and decide to move to a state in the West region. What is the main problem with this interpretaion.
110 |
111 | A. The categories are ordered alphabetically.
112 | B. The graph does not show standard errors.
113 | C. It does not show all the data. We do not see the variability withing region and it's possible that the safest states are not in the West.
114 | D. The Northeast has the lowest average.
115 |
116 | 6. Make a boxplot of the murder rates
117 |
118 | ```{r}
119 | data("murders")
120 | murders %>% mutate(rate = total/population*100000)
121 | ```
122 |
123 | by region, showing all the points and ordering the regions by their median rate.
124 |
125 |
126 | 7. This plots below show the for three continuos variables.
127 |
128 | ```{r}
129 | library(scatterplot3d)
130 | library(RColorBrewer)
131 | rafalib::mypar(1,1)
132 | set.seed(1)
133 | n <- 25
134 | group <- rep(1,n)
135 | group[1:(round(n/2))] <- 2
136 | x <- rnorm(n, group, .33)
137 | y <- rnorm(n, group, .33)
138 | z <- rnorm(n)
139 |
140 | scatterplot3d(x,y,z, color = group, pch=16)
141 | abline(v=4, col=3)
142 | ```
143 |
144 | The line $x=2$ appears to separate the points. But it is actually not the case, which we can see by plotting the data in a couple of two dimensinal points.
145 | ```{r}
146 | mypar(1,2)
147 | plot(x,y, col=group, pch =16)
148 | ##abline(3,-1,col=4)
149 | abline(v=2, col=3)
150 | plot(x,z,col=group, pch=16)
151 | abline(v=2, col=3)
152 | ```
153 |
154 | Why is this happening?
155 |
156 | A. Humans are not good at reading pseado 3D plots
157 | B. There must be an error in the code
158 | C. The colors confuse us.
159 | D. Scatter-plots should not be used to compare two variables when we have access to 3.
160 |
--------------------------------------------------------------------------------
/lectures/dataviz/gapminder-assessments.Rmd:
--------------------------------------------------------------------------------
1 | ### Assessments 1
2 |
3 |
4 | Start by loading the necessary packages and data:
5 |
6 | ```{r}
7 | library(tidyverse)
8 | library(dslabs)
9 | data(gapminder)
10 | ```
11 |
12 | 1. Using ggplot and the points layer, create a scatter plot of life expectancy versus fertilty for the African continent in 2012.
13 |
14 | 2. Note that there is quite a bit of variability with some African countries doing quite well. And there appears to be three clusters. Use color to dinstinguish the different regions to see if this explains the clusters.
15 |
16 |
17 | 3. While most of the countries in the healthier cluster are from Northern Africa, three countries are not. Write code that creates a table showing the country and region for he African countries that in 2012 had fertility rates below 3 and life expectancies above 70. Hint: use filter then select.
18 |
19 |
20 | 4. The Vietnam War lasted from 1955 to 1975. Does data support the negative effects of war? Create a time series plot from 1960 to 2010 of life expectancy that includes Vietnam and the United States. Use color to distinguish the two countries.
21 |
22 |
23 | 5. Cambodia was also involved in this conflict and, after the war. Pol Pot and his communist Khmer Rouge too control and ruled Cambodia from 1975 to 1979. He is considered one of the most brutal dictators in history. Does data support this claim? Create a time series plot from 1960 to 2010 of life expectancy for Cambodia.
24 |
25 |
26 | ### Assessment 2
27 |
28 | 6. Create a smooth density for the dollars per day summary for African countries in 2010. Use a log (base 2) scale for the x-axis. Use mutate to create `dollars_per_day` variable, defined as `gdp/population`, then `ggplot` and `geom_density`.
29 |
30 | 7. Edit the code above but now use `facet_grid` to show a different density for 1970 and 2010.
31 |
32 |
33 | 8. Edit the code from the previous exercise to show stacked histograms of each region in Africa. Make sure the densities are smooth by using th `bw = 0.5`. Hint: use the the `fill` and `position` arguments.
34 |
35 |
36 | 9. For 2010 make a scatter plot of infant mortaily rates `infant_mortaility` versus `dollars_per_day` for countries in the African continent. Use color ot denome the regions.
37 |
38 |
39 | 10. Edit the code from the previous answer to make the x-axis be in the log (base 2) scale.
40 |
41 | 11. Note that there is a pretty large variation between African countries. In the extreme cases we have one country with mortality rates of less than 20 per 1000 and an average income 16 dollars per day and another making about $1 a day and mortality rates above 10%. To find out what countries these are, remake the plot above but this this time show the country name instead of a point. Hint: use `geom_text`.
42 |
43 | 12. Edit the code above to see how this changed between 1970 and 2010. Hint: Add a `facet_grid`
44 |
45 |
--------------------------------------------------------------------------------
/lectures/git-and-github/git-command-line.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Homework with Git and GitHub Using the Command Line
3 | output: html_document
4 | ---
5 |
6 | We will be using git and GitHub to get your homework assignments, work on your homework,
7 | and submit your homework solutions. This tutorial will walk you through
8 | that process using git and GitHub through the command line(see the other [tutorial](https://github.com/datasciencelabs/2017/blob/master/lectures/git-and-github/git-rstudio.Rmd) for using git and GitHub with RStudio).
9 |
10 | ## Getting and Working on Homework
11 |
12 | ### Cloning your Homework repository
13 | Each of you will be made members of the [`datasciencelabs-students`
14 | organization on GitHub](https://github.com/datasciencelabs-students).
15 | This means that your homework repositories all technically belong to us.
16 | But you will be granted unlimited access throughout the course!
17 |
18 |
19 |
20 |
21 | You will notice when you visit the Data Science Labs'
22 | [Github page](https://github.com/datasciencelabs-students) that you can
23 | only see repositories with your GitHub username on them. You will get
24 | one repository for each homework throughout the semester. When a new
25 | homework is released you can go to the corresponding repository to
26 | see what is in store. The work flow will be pretty simple.
27 |
28 | 1. Go to: https://github.com/datasciencelabs-students
29 |
30 | 2. Click on the repository you want to work on. For
31 | example `-2017HW1` for Homework 1.
32 |
33 | 3. Copy the link near the top of the page that is revealed after clicking 'clone or download'.
34 |
35 |
36 |
37 | 4. Go to your `Terminal` (on Mac) or `git bash` (on Windows),
38 | change directories into your BST260 folder.
39 |
40 | 5. Use `git clone` to clone the repository using the link from step 3. For example:
41 |
42 | > `$ git clone https://github.com/datasciencelabs-students/-2017HW1.git`
43 |
44 |
45 |
46 | 6. You should now see a new directory called `-2017HW1`.
47 | Move into that directory with `cd` (shown in the last line of the previous image).
48 |
49 | 7. If you type `git status` it will give you the current status of your
50 | directory. It should look something like this:
51 |
52 |
53 |
54 |
55 | ### Working on your homework
56 |
57 | Once you have a local copy of your repository, it's time to get to work!
58 |
59 | After writing some of your homework in an `Rmd` file, and `knit` it,
60 | make pretty plots, find out some cool stuff about the dataset it's
61 | time to `add/commit/push`. After some work, if you head back to `Terminal`
62 | you will see that something has changed when you type `git status`:
63 |
64 |
65 |
66 | You will notice that there is one `untracked file`.
67 | In order to get git to track changes in this file we need to
68 | add it. So we type :
69 |
70 | > `$ git add HW2_Problems.html `
71 |
72 | We also need to add the .Rmd file in order to `stage` it (so that it
73 | will be included in the next commit). So we type :
74 |
75 | > `$ git add HW2_Problems.Rmd `
76 |
77 |
78 |
79 | Now you will notice that the files have turned green and are now
80 | labeled as changes to be committed, now it's time to commit.
81 | This is equivalent to `save` in most programs. But what is special
82 | about `git` and other version control software is that we can track
83 | and revert changes! We also need to give what's called a `commit message`,
84 | which will help us keep track of the changes we made when we look at
85 | this in the future. Leave detailed messages so that future you will
86 | know what you did. Future you will thank you. We will get to this
87 | part later. Notice the `-am` flag, the `a` stands for *all*,
88 | as in all tracked files, and the `m` stands for *message*.
89 |
90 | We do that by typing:
91 |
92 | ``
93 | git commit -am "This is my commit message, it is very detailed."
94 | ``
95 |
96 |
97 |
98 | Cool! Now we've saved our work on our local directory, we can now push
99 | our work to Github. Note, we can (and should) do this as many times as
100 | we want before the homework deadline. What is great about this is that
101 | it will make getting help from your TA easier as well as keeping a
102 | copy of your work in the cloud in case your computer crashes, or you
103 | accidentally delete something.
104 |
105 |
106 |
107 | ### Summary
108 | To summarize, it is important to do the following
109 | steps whenever you finish working on your homework to make full
110 | use of `git` and Github as well as generally having the best
111 | experience in this class.
112 |
113 | 1. Work on your homework
114 | 2. Add changes to track with: `git add`
115 | 3. Commit changes to your local repository: `git commit`
116 | 4. Push the changes to your github repo: `git push`
117 |
118 | Generally keep this picture in mind whenever you want to do this
119 | loop, it is important to only add changed files you care about
120 | and nothing you do not care about. If certain files keep popping
121 | up in your git status that you will never want to add, e.g. `.Rhistory`,
122 | etc, add them to your `.gitignore` to simplify your life, this will keep
123 | those files from showing up here. For more info on this see the
124 | `version_control.Rmd`
125 |
126 | 
127 | 
128 | 
129 |
130 | # Late Day Policy
131 | From the course web-page:
132 |
133 | > Each student is given six late days for homework at the beginning of the semester. A late day extends the individual homework deadline by 24 hours without penalty. No more than two late days may be used on any one assignment. Assignments handed in more than 48 hours after the original deadline will not be graded. We do not accept any homework under any circumstances more than 48 hours after the original deadline. Late days are intended to give you flexibility: you can use them for any reason no questions asked. You don't get any bonus points for not using your late days. Also, you can only use late days for the individual homework deadlines all other deadlines (e.g., project milestones) are hard.
134 |
135 | We made this policy because we understand that you are all busy
136 | and things happen. We hope that this added flexibility makes gives you
137 | the freedom to enjoy the courses and engage with the material fully.
138 |
139 | ## Some unsolicited advice
140 |
141 | To be fair to all the students we have to enforce this late day policy,
142 | so we have put together a list of things to consider near the deadline.
143 |
144 | Say the homework is due Sunday at 11:59 pm.
145 |
146 | 1. If we do not see any more `commit`s after the deadline we will take
147 | the last `commit` as your final submission.
148 | 2. Check that the final `commit` is showing on your Github repo page.
149 | "I forgot to `push`" is not an acceptable excuse for late work.
150 | 3. It may help to add a message like "This is my final version of the
151 | homework please grade this" but that's up to you.
152 | 4. If there are `commit`s after the deadline **we will take the last `commit`**
153 | up to Sunday at 11:59 pm as the final version.
154 | 5. We will assess the number of late days you used and keep track.
155 | 6. You **do not** need to tell us that you will take extra days, we will
156 | be able to see the time stamp of your last `commit`.
157 | 7. When you are done with the homework, do not `commit` or `push` any more.
158 | If you `commit` and `push` after the deadline you will be charged a late day.
159 | This is strict.
160 |
161 | # Happy `git`-ing
162 |
--------------------------------------------------------------------------------
/lectures/git-and-github/git-rstudio.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Homework with Git and GitHub inside RStudio
3 | output: html_document
4 | ---
5 |
6 | We will be using
7 | git and GitHub to get your homework assignments, work on your homework,
8 | and submit your homework solutions. This tutorial will walk you through
9 | that process using git and GitHub through RStudio (see the other [tutorial](https://github.com/datasciencelabs/2017/blob/master/lectures/git-and-github/git-command-line.Rmd) for using git and GitHub through the command line).
10 |
11 | ## Getting and Working on Homework
12 |
13 | ### Cloning your Homework repository
14 | Each of you will be made members of the [`datasciencelabs-students`
15 | organization on GitHub](https://github.com/datasciencelabs-students).
16 | This means that your homework repositories all technically belong to us.
17 | But you will be granted unlimited access throughout the course!
18 |
19 |
20 |
21 |
22 | You will notice when you visit the Data Science Labs'
23 | [Github page](https://github.com/datasciencelabs-students) that you can
24 | only see repositories with your GitHub username on them. You will get
25 | one repository for each homework throughout the semester. When a new
26 | homework is released you can go to the corresponding repository to
27 | see what is in store. The work flow will be pretty simple.
28 |
29 | 1. Go to: https://github.com/datasciencelabs-students
30 |
31 | 2. Click on the repository you want to work on. For
32 | example `-2017HW1` for Homework 1.
33 |
34 | 3. Copy the link near the top of the page that is revealed after clicking 'clone or download'.
35 |
36 |
37 |
38 | 4. In RStudio, start a new project: File > New Project > Version Control > Git. In the "repository URL" paste the URL of the homework repository you just copied. Take charge of – or at least notice! – the local directory for the Project. A common rookie mistake is to have no idea where you are saving files or what your working directory is. Pay attention. Be intentional. Personally, I suggest you check “Open in new session” and to keep
39 | all your homework repositories in a 'BST260' folder.
40 |
41 |
42 |
43 |
44 |
45 | 5. Click "Create Project". You should now see the files in the repository in the lower right window in RStudio. Also notice the Git tab in the upper right window.
46 |
47 |
48 |
49 |
50 | ### Working on your homework
51 |
52 | Once you have a local copy of your repository, it's time to get to work!
53 |
54 | After writing some of your homework in an `Rmd` file, and `knit` it,
55 | make pretty plots, find out some cool stuff about the dataset it's
56 | time to `commit/push`. After some work, save your changes and click the `commit` button in the Git tab window. This is equivalent to `save` in most programs. But what is special
57 | about `git` and other version control software is that we can track
58 | and revert changes! We also need to give what's called a `commit message`,
59 | which will help us keep track of the changes we made when we look at
60 | this in the future. Leave detailed messages so that future you will
61 | know what you did. Future you will thank you. We will get to this
62 | part later.
63 |
64 |
65 |
66 | Cool! Now we've saved our work on our local directory, we can now `push`
67 | our work to Github by clicking the *green up-arrow* in the Git tab window. If you are challenged for username and password, provide them. Note, we can (and should) do this as many times as we want before the homework deadline. What is great about this is that
68 | it will make getting help from your TA easier as well as keeping a
69 | copy of your work in the cloud in case your computer crashes, or you
70 | accidentally delete something.
71 |
72 |
73 | ### Summary
74 | To summarize, it is important to do the following
75 | steps whenever you finish working on your homework to make full
76 | use of `git` and Github as well as generally having the best
77 | experience in this class.
78 |
79 | 1. Work on your homework
80 | 2. Commit changes to your local repository: `commit` button in Git tab in RStudio
81 | 3. Push the changes to your github repo: `push` (green arrow) button in Git tab in RStudio
82 |
83 | Generally keep this picture in mind whenever you want to do this
84 | loop, it is important to only add changed files you care about
85 | and nothing you do not care about. If certain files keep popping
86 | up in your git status that you will never want to add, e.g. `.Rhistory`,
87 | etc, add them to your `.gitignore` to simplify your life, this will keep
88 | those files from showing up here. For more info on this see the
89 | `version_control.Rmd`
90 |
91 | 
92 | 
93 | 
94 |
95 | # Late Day Policy
96 | From the course web-page:
97 |
98 | > Each student is given six late days for homework at the beginning of the semester. A late day extends the individual homework deadline by 24 hours without penalty. No more than two late days may be used on any one assignment. Assignments handed in more than 48 hours after the original deadline will not be graded. We do not accept any homework under any circumstances more than 48 hours after the original deadline. Late days are intended to give you flexibility: you can use them for any reason no questions asked. You don't get any bonus points for not using your late days. Also, you can only use late days for the individual homework deadlines all other deadlines (e.g., project milestones) are hard.
99 |
100 | We made this policy because we understand that you are all busy
101 | and things happen. We hope that this added flexibility makes gives you
102 | the freedom to enjoy the courses and engage with the material fully.
103 |
104 | ## Some unsolicited advice
105 |
106 | To be fair to all the students we have to enforce this late day policy,
107 | so we have put together a list of things to consider near the deadline.
108 |
109 | Say the homework is due Sunday at 11:59 pm.
110 |
111 | 1. If we do not see any more `commit`s after the deadline we will take
112 | the last `commit` as your final submission.
113 | 2. Check that the final `commit` is showing on your Github repo page.
114 | "I forgot to `push`" is not an acceptable excuse for late work.
115 | 3. It may help to add a message like "This is my final version of the
116 | homework please grade this" but that's up to you.
117 | 4. If there are `commit`s after the deadline **we will take the last `commit`**
118 | up to the due date at 11:59 pm as the final version.
119 | 5. We will assess the number of late days you used and keep track.
120 | 6. You **do not** need to tell us that you will take extra days, we will
121 | be able to see the time stamp of your last `commit`.
122 | 7. When you are done with the homework, do not `commit` or `push` any more.
123 | If you `commit` and `push` after the deadline you will be charged a late day.
124 | This is strict.
125 |
126 | # Happy `git`-ing
127 |
--------------------------------------------------------------------------------
/lectures/git-and-github/images/clone_button.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/clone_button.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/directorysetup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/directorysetup.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git-clone.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git-clone.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_add.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_add.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_clone.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_clone.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_commit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_commit.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_fetch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_fetch.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_layout.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_layout.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_merge.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_merge.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_push.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_push.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/git_status.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/git_status.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/gitclean.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/gitclean.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/gitclone.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/gitclone.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/gitcommit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/gitcommit.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/github-https-clone.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/github-https-clone.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/github-ssh-clone.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/github-ssh-clone.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/github.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/github.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/github_ssh.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/github_ssh.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/gitpush.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/gitpush.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/gitstaged.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/gitstaged.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/gituntracked.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/gituntracked.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/mac-git-security.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/mac-git-security.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/mkdir-clone.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/mkdir-clone.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/newproject.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/newproject.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/rstudio_commit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/rstudio_commit.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/rstudio_screen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/rstudio_screen.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/sshkeygen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/sshkeygen.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/wgi-defaultlines.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/wgi-defaultlines.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/wgi-git-bash.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/wgi-git-bash.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/wgi-scarymessage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/wgi-scarymessage.png
--------------------------------------------------------------------------------
/lectures/git-and-github/images/wgi-usemintty.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/git-and-github/images/wgi-usemintty.png
--------------------------------------------------------------------------------
/lectures/git-and-github/setting-up-github.rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Setting up Git and GitHub
3 | output: html_document
4 | ---
5 |
6 | Starting with homework 1, we will be using a specific tool to submit homework assignments, other than Canvas. The tool we will be using is called git and GitHub. This is a tutorial that will help you install the tool git on your computer and create a GitHub account.
7 |
8 |
9 | ## Create your GitHub account
10 |
11 | The first week of class we asked each of you to set up your GitHub account
12 | and submit your GitHub username in a survey.
13 |
14 | To sign up for an account, just go to [github](https://github.com)
15 | and pick a unique username, an email address, and a password.
16 | Once you've done that, your github page will be at
17 | `https://github.com/`.
18 |
19 | Github also provides a student
20 | [developer package](https://education.github.com/pack).
21 | This is something that might be nice to have, but it is not
22 | necessary for the course. Github may take some time to approve
23 | your application for the package. Please note that this is
24 | optional and you do not have to have the package
25 | approved to fill out the survey.
26 |
27 |
28 | #### Programming expectations
29 |
30 | All the lecture material and homework for this class will use R and
31 | R Markdown files. Knowledge of R is not a prerequisite for this course,
32 | **provided you are comfortable learning on your own as needed**.
33 | Basically, you should feel comfortable with:
34 |
35 | * How to look up R syntax on Google and StackOverflow.
36 | * Basic programming concepts like functions, loops, arrays, dictionaries, strings, and if statements.
37 | * How to learn new libraries by reading documentation.
38 | * Asking questions on Slack and StackOverflow.
39 |
40 |
41 | ## Setting up your git environment
42 |
43 | ### 1. Installing git
44 |
45 | We will be using the [command line version of git](http://git-scm.com/docs/gittutorial).
46 |
47 | On Linux, install git using your system package manager (yum, apt-get, etc).
48 |
49 | On the Mac, if you ever installed [Xcode](https://developer.apple.com/xcode/),
50 | you should have git installed. Alternatively, you may have installed
51 | it using `homebrew`. Either of these are fine as long as the
52 | git version is greater than 2.0. To determine the version of git
53 | that is installed on your computer, open a terminal window and type:
54 |
55 | > $ `git --version`
56 |
57 | If git is installed, you should see a version number. Check to see if it
58 | is greater than version 2.0. If it is not, please update your version
59 | of git.
60 |
61 | If git is not installed on your Mac and Windows, go to http://git-scm.com.
62 | Accept all defaults in the installation process.
63 | On Windows, installing git will also install for you a minimal
64 | unix environment with a "bash" shell and terminal window.
65 | Voila, your windows computer is transformed into a unixy form.
66 |
67 | #### Windows specific notes
68 |
69 | There will be an installer `.exe` file you need to click. Accept all the defaults.
70 |
71 | Here is a screen shot from one of the defaults. It makes sure you will have the "bash" tool talked about earlier.
72 |
73 | 
74 |
75 | Choose the default line-encoding conversion:
76 |
77 | 
78 |
79 | Use the terminal emulator they provide, its better than the one shipped with windows.
80 |
81 | 
82 |
83 | Towards the end, you might see a message like this. It looks scary, but all you need to do is click "Continue"
84 |
85 | 
86 |
87 |
88 | At this point you will have git installed. You can bring up "git bash"
89 | either from your start menu, or from the right click menu on any
90 | folder background. When you do so, a terminal window will open.
91 | This terminal is where you will issue further git setup commands,
92 | and git commands in general.
93 |
94 | Get familiar with the terminal. It opens in your home folder, and
95 | maps `\\` paths on windows to more web/unix like paths with '/'.
96 | Try issuing the commands `ls`, `pwd`, and `cd folder` where folder
97 | is one of the folders you see when you do a ls. You can do
98 | a `cd ..` to come back up.
99 |
100 |
101 | #### Mac specific notes
102 |
103 | As mentioned earlier, if you ever installed Xcode or the
104 | "Command Line Developer tools", you may already have git.
105 | Make sure its version 2.0 or higher. (`git --version`)
106 |
107 | Or if you use **Homebrew**, you can install it from there.
108 | The current version on homebrew is 2.4.3
109 | You don't need to do anything more in this section.
110 |
111 | -----
112 |
113 | First click on the `.mpkg` file that comes when you open the
114 | downloaded `.dmg` file.
115 |
116 | When I tried to install git on my mac, I got a warning saying my
117 | security preferences wouldn't allow it to be installed. So I opened
118 | my system preferences and went to "Security".
119 |
120 | 
121 |
122 | Here you must click "Open Anyway", and the installer will run.
123 |
124 | The installer puts git as `/usr/local/git/bin/git`.
125 | That's not a particularly useful spot. Open up `Terminal.app`.
126 | It's usually in `/Applications/Utilities`. Once the terminal opens up, issue
127 |
128 | > $ `sudo ln -s /usr/local/git/bin/git /usr/local/bin/git`
129 |
130 | Keep the Terminal application handy in your dock. (You could also
131 | download and use iTerm.app, which is a nicer terminal, if you are into
132 | terminal geek-ery). We'll be using the terminal extensively for git.
133 |
134 | Try issuing the commands `ls`, `pwd`, and `cd folder` where
135 | folder is one of the folders you see when you do a ls. You
136 | can do a `cd ..` to come back up.
137 |
138 | ### 2. Optional: Creating ssh keys on your machine
139 |
140 | This is an optional step. But it makes things much easier so
141 | it's highly recommended.
142 |
143 | There are two ways git talks to github: https, which is a
144 | web based protocol
145 |
146 | 
147 |
148 | or over ssh
149 |
150 | 
151 |
152 | Which one you use is your choice. I recommend ssh, and the
153 | github urls in this homework and in labs will be ssh urls.
154 | Every time you contact your upstream repository (hosted on github),
155 | you need to prove you're you. You *can* do this with passwords over
156 | HTTPS, but it gets old quickly. By providing an ssh public key to
157 | github, your ssh-agent will handle all of that for you,
158 | and you won't have to put in any passwords.
159 |
160 | At your terminal, issue the command (skip this if you are a
161 | seasoned ssh user and already have keys):
162 |
163 | `ssh-keygen -t rsa`
164 |
165 | It will look like this:
166 | 
167 |
168 | Accept the defaults. When it asks for a passphrase for your keys,
169 | put in none. (you can put in one if you know how to set up a ssh-agent).
170 |
171 | This will create two files for you, in your home folder if
172 | you accepted the defaults.
173 |
174 | `id_rsa` is your PRIVATE key. NEVER NEVER NEVER give that to anyone.
175 | `id_rsa.pub` is your public key. You must supply this to github.
176 |
177 | ----
178 |
179 | ### 3. Optional: Uploading ssh keys and Authentication
180 |
181 | To upload an ssh key, log in to github and click on the gear icon
182 | in the top right corner (settings). Once you're there, click on
183 | "SSH keys" on the left. This page will contain all your ssh
184 | keys once you upload any.
185 |
186 | Click on "add ssh key" in the top right. You should see this box:
187 |
188 |
189 |
190 | The title field should be the name of your computer or some other
191 | way to identify this particular ssh key.
192 |
193 | In the key field, you'll need to copy and paste
194 | your *public* key. **Do not paste your private ssh key here.**
195 |
196 | When you hit "Add key", you should see the key name and some
197 | hexadecimal characters show up in the list. You're set.
198 |
199 | Now, whenever you clone a repository using this form:
200 |
201 | `$ git clone git@github.com:rdadolf/ac297r-git-demo.git`,
202 |
203 | you'll be connecting over ssh, and will not be asked for your github password
204 |
205 | You will need to repeat steps 2 and 3 of the setup for each computer you wish to use with github.
206 |
207 | ### 4. Setting global config for git
208 |
209 | Again, from the terminal, issue the command
210 |
211 | `git config --global user.name "YOUR NAME"`
212 |
213 | This sets up a name for you. Then do
214 |
215 | `git config --global user.email "YOUR EMAIL ADDRESS"`
216 |
217 | Use the **SAME** email address you used in setting up your github account.
218 |
219 | These commands set up your global configuration. On my Mac,
220 | these are stored in the text file `.gitconfig` in my home folder.
221 |
222 |
223 |
224 |
225 |
226 |
227 |
--------------------------------------------------------------------------------
/lectures/inference/association-tests.Rmd:
--------------------------------------------------------------------------------
1 | ## Association Tests
2 |
3 | ```{r,include=FALSE}
4 | set.seed(1)
5 | ```
6 |
7 | The statistical tests we have covered up to now leave out a
8 | substantial portion of data types. Specifically, we have not discussed inference for binary, categorical and ordinal data. To give a
9 | very specific example, consider the following case study.
10 |
11 |
12 | A [2014 PNAS paper](http://www.pnas.org/content/112/40/12349.abstract) analyzed success rates from funding agencies in the Netherlands and concluded that their
13 |
14 | > results reveal gender bias favoring male applicants over female applicants in the prioritization of their “quality of researcher” (but not "quality of proposal") evaluations and success rates, as well as in the language use in instructional and evaluation materials.
15 |
16 | The main evidence for this conclusion comes down to a comparison of the percentages. Table S1 in the paper includes the information we need:
17 |
18 | ```{r,echo=FALSE}
19 | data("research_funding_rates")
20 | research_funding_rates
21 | ```
22 |
23 | If we compute the difference in percentage
24 |
25 |
26 | We can compute the totals that were successful and the totals that were not like this:
27 |
28 | ```{r}
29 | totals <- research_funding_rates %>%
30 | select(-discipline) %>%
31 | summarize_all(funs(sum)) %>%
32 | summarize(yes_men = awards_men,
33 | no_men = applications_men - awards_men,
34 | yes_women = awards_women,
35 | no_women = applications_women - awards_women)
36 | ```
37 |
38 | So we see that a larger percent of men received awards than women:
39 |
40 | ```{r}
41 | totals %>% summarize(percent_men = yes_men/(yes_men+no_men),
42 | percent_women = yes_women/(yes_women+no_women))
43 | ```
44 |
45 | But could this be due just to random variability?
46 | Here we learn how to perform inference for this type of data.
47 |
48 |
49 | #### Lady Tasting Tea
50 |
51 |
52 | [R.A. Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) was one of the first to formalize hypothesis testing. The "Lady Testing Tea" is one of the most famous examples.
53 |
54 | The story goes like this. An acquaintance of Fisher's claimed that she could tell if milk was added
55 | before or after tea was poured. Fisher was skeptical. He designed an experiment to test this claim. He gave her four pairs of
56 | cups of tea: one with milk poured first, the other after. The order
57 | was randomized. The null hypothesis here is that she is guessing. Fisher derived the distribution for the number of correct picks on the assumption that the choice were random and independent.
58 |
59 | As an example, suppose she picked 3 out of 4 correctly, do we believe
60 | she has a special ability? The basic question we ask is: if the tester is actually guessing, what
61 | are the chances that she gets 3 or more correct? Just as we have done
62 | before, we can compute a probability under the null hypothesis that she
63 | is guessing 4 of each. Under this null hypothesis, we can
64 | think of this particular example as picking 4 balls out of an urn
65 | with 4 blue (correct answer) and 4 red (incorrect answer) balls. Remember she knows that there are four before tea and four after.
66 |
67 | Under the null hypothesis that she is simply guessing, each ball
68 | has the same chance of being picked. We can then use combinations to
69 | figure out each probability. The probability of picking 3 is
70 | ${4 \choose 3} {4 \choose 1} / {8 \choose 4} = 16/70$. The probability of
71 | picking all 4 correct is
72 | ${4 \choose 4} {4 \choose 0}/{8 \choose 4}= 1/70$.
73 | Thus, the chance of observing a 3 or something more extreme,
74 | under the null hypothesis, is $\approx 0.24$. This is the p-value. The
75 | procedure that produced this p-value is called _Fisher's exact test_ and
76 | it uses the *hypergeometric distribution*.
77 |
78 | #### Two By Two Tables
79 |
80 | The data from the experiment is usually summarized by a table like this:
81 |
82 | ```{r}
83 | tab <- matrix(c(3,1,1,3),2,2)
84 | rownames(tab)<-c("Poured Before","Poured After")
85 | colnames(tab)<-c("Guessed before","Guessed after")
86 | tab
87 | ```
88 |
89 | These are referred to two-by-two table. They show, for each of the four combinations on can get with a pair of binary variables, the observed counts for each occurrence.
90 |
91 | The function `fisher.test` performs the inference calculations above and can be obtained like this:
92 |
93 | ```{r}
94 | fisher.test(tab, alternative="greater")
95 | ```
96 |
97 | #### Chi-square Test
98 |
99 | Note that,in a way, our funding rates example is similar to the Lady Tasting Tea. However, in the Tasting Tea example the
100 | number of blue and red beads is experimentally fixed and the number
101 | of answers given for each category is also fixed. This is because Fisher made sure there were four before tea and four after tea and the lady knew this, so the answers would also have four and four. If this is the case the sum of the rows and the sum of the columns are
102 | fixed. This defines constraints on the possible ways we can fill the two
103 | by two table and also permits us to use the hypergeometric
104 | distribution. In general, this is not the case. Nonetheless, there is
105 | another approach, the Chi-squared test, which is described below.
106 |
107 |
108 | Imagine we have `r sum(totals)`, some are men and some are women and some get funded other don't. We saw that the success rates for men and woman were:
109 |
110 | ```{r}
111 | totals %>% summarize(percent_men = yes_men/(yes_men+no_men),
112 | percent_women = yes_women/(yes_women+no_women))
113 | ```
114 |
115 | respectively. Would we see this again if we randomly assign funding at the rate:
116 |
117 | ```{r}
118 | funding_rate <- totals %>%
119 | summarize(percent_total =
120 | (yes_men + yes_women)/
121 | (yes_men + no_men +yes_women + no_women)) %>%
122 | .$percent_total
123 | funding_rate
124 | ```
125 |
126 | The Chi-square test answers this question. The first step is to create the two-by-two data:
127 |
128 | ```{r}
129 | two_by_two <- tibble(awarded = c("no", "yes"),
130 | men = c(totals$no_men, totals$yes_men),
131 | women = c(totals$no_women, totals$yes_women))
132 | two_by_two
133 | ```
134 |
135 | The general idea of the Chi-square test is to compare this two-by-two table to what you expect to see which would be:
136 |
137 | ```{r}
138 | tibble(awarded = c("no", "yes"),
139 | men = (totals$no_men + totals$yes_men) *
140 | c(1 - funding_rate, funding_rate),
141 | women = (totals$no_women + totals$yes_women)*
142 | c(1 - funding_rate, funding_rate))
143 |
144 | ```
145 |
146 | We can see that more men than expected and less women than expected received funding. However, under the null hypothesis these observations are random variable. The Chi-square test tells us how likely it is to see
147 | a deviation this large of larger. This test uses an asymptotic result, similar to the CLT, related to the sums of independent binary outcomes.
148 | The R function `chisq.test` takes a two by two table and returns the results from the test:
149 |
150 | ```{r}
151 | two_by_two %>%
152 | select(-awarded) %>%
153 | chisq.test()
154 | ```
155 |
156 | We see that the p-value is 0.051.
157 |
158 |
159 | ### The Odds Ratio
160 |
161 | An informative summary statistic associated with two-by-two tables is the odds ratio. Define the two variables as $X = 1$ if you are a male and 0 otherwise and $Y=1$ if you are funded and 0 otherwise. The odds of getting funded if you are a man is defined
162 |
163 | $$\mbox{Pr}(Y=1 \mid X=1) / \mbox{Pr}(Y=0 \mid X=1)$$
164 |
165 | and can be computed like this:
166 | ```{r}
167 | odds_men <- (two_by_two$men[2] / sum(two_by_two$men)) /
168 | (two_by_two$men[1] / sum(two_by_two$men))
169 | ```
170 |
171 | And the odds of being funded if you are a women is
172 |
173 |
174 | $$\mbox{Pr}(Y=1 \mid X=0) / \mbox{Pr}(Y=0 \mid X=0)$$
175 |
176 |
177 | and can be computed like this:
178 | ```{r}
179 | odds_women <- (two_by_two$women[2] / sum(two_by_two$women)) /
180 | (two_by_two$women[1] / sum(two_by_two$women))
181 | ```
182 |
183 | The odds ratio is the ratio for these two odds: how many times larger are the odds for men than women:
184 |
185 | ```{r}
186 | odds_men / odds_women
187 | ```
188 |
189 | #### Large Samples, Small p-values
190 |
191 | As mentioned earlier, reporting only p-values is not an appropriate
192 | way to report the results of data analysis. In scientific journals, for example
193 | some studies seem to overemphasize p-values. Some of these studies large sample sizes
194 | and report impressively small p-values. Yet when one looks closely at
195 | the results, we realize odds ratios are quite modest: barely bigger
196 | than 1. In this case the difference may not be *practically significant* or *scientifically significant*.
197 |
198 | Note that the relationship between odds ratio and p-value is not a one-to-one. It depends on the sample size. So a very small p-values does not necessarily mean very large odds ratio.
199 |
200 | Notice what happens to the p-value if we multiply our two by two table by 10:
201 |
202 | ```{r}
203 | two_by_two %>%
204 | select(-awarded) %>%
205 | mutate(men = men*10, women = women*10) %>%
206 | chisq.test()
207 | ```
208 |
209 | Yet the odds ratio is unchanged.
210 |
211 | #### Confidence Intervals for the Odds Ratio
212 |
213 | Computing confidence intervals for the odds ratio is not mathematically
214 | straightforward. Unlike other statistics, for which we can derive
215 | useful approximations of their distributions, the odd ratio is not only a
216 | ratio, but a ratio of ratios. Therefore, there is no simple way of
217 | using, for example, the CLT.
218 |
219 | One approach is to use the theory of *generalized linear models*. You can learn more about these in this book:
220 | [McCullagh and Nelder, 1989](https://books.google.com/books?hl=en&lr=&id=h9kFH2_FfBkC)):
221 |
222 |
--------------------------------------------------------------------------------
/lectures/inference/clt.Rmd:
--------------------------------------------------------------------------------
1 | ## Central Limit Theorem in Practice
2 |
3 | ```{r, echo=FALSE, message=FALSE}
4 | library(tidyverse)
5 | library(dslabs)
6 | ds_theme_set()
7 | ```
8 |
9 | The CLT tells us that the distribution function for a sum of draws is approximately normal. We also learned that dividing a normally distributed random variable by a constant results in a normally distributed variable. This implies that the distribution of $\bar{X}$ is approximately normal.
10 |
11 | So in summary we have the $\bar{X}$ has an approximately normal distribution with expected value $p$ and standard error $\sqrt{p(1-p)/N}$.
12 |
13 | Now how does this help us? Suppose we want to know what is the probability that we are withing 1% from $p$. We are basically asking if
14 |
15 | $$
16 | \mbox{Pr}(| \bar{X} - p| \leq .01)
17 | $$
18 | which is the same as:
19 |
20 | $$
21 | \mbox{Pr}(\bar{X}\leq p + .01) - \mbox{Pr}(\bar{X} \leq p - .01)
22 | $$
23 |
24 | Can we answer this question? Note that we can use the mathematical trick we learned in the previous chapter. Subtract the expected value and divide by the standard error to get a standard normal random variable, call it $Z$, on the left. Since $p$ is the expected value and $\mbox{SE}(\bar{X}) = \sqrt{p(1-p)/N}$ is the standard error we get:
25 |
26 | $$
27 | \mbox{Pr}\left(Z \leq \,.01 / \mbox{SE}(\bar{X}) \right) -
28 | \mbox{Pr}\left(Z \leq - \,.01 / \mbox{SE}(\bar{X}) \right)
29 | $$
30 |
31 | A problem is that we don't know $p$, so we don't know $\mbox{SE}(\bar{X})$. But it turns out that the CLT still works if we use estimate the standard error by using $\bar{X}$ in place of $p$. We say that we _plug-in_ the estimate. Our estimate of the standard error is therefore:
32 |
33 | $$
34 | \hat{\mbox{SE}}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N}
35 | $$
36 | In statistics textbooks, we use a little hat to denote estimates. Note that the estimate can be constructed using the observed data and $N$.
37 |
38 | Now we continue with our calculation but dividing by $\hat{\mbox{SE}}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N})$ instead. In our first sample we had 12 blue and 13 red so $\bar{X} = 0.48$ and so our estimate of standard error is
39 |
40 | ```{r}
41 | X_hat <- 0.48
42 | se <- sqrt(X_hat*(1-X_hat)/25)
43 | se
44 | ```
45 |
46 | And now we can answer the question of the probability of being close to $p$. The answer is
47 |
48 | ```{r}
49 | pnorm(0.01/se) - pnorm(-0.01/se)
50 | ```
51 |
52 | So there is a small chance that we will be close. A poll of only $N=25$ people is not really very useful. At least for a close elections.
53 |
54 | Earlier we mentioned the _margin of error_. Now we can define it because it is simply two times the standard error which we can now estimate and in our case it is:
55 |
56 | ```{r}
57 | 2*se
58 | ```
59 |
60 | Why do we multiply by 2? Because if you ask what is the probability that we are withing two standard error from $p$ we get:
61 |
62 | $$
63 | \mbox{Pr}\left(Z \leq \, 2\mbox{SE}(\bar{X}) / \mbox{SE}(\bar{X}) \right) -
64 | \mbox{Pr}\left(Z \leq - 2 \mbox{SE}(\bar{X}) / \mbox{SE}(\bar{X}) \right)
65 | $$
66 | which is
67 |
68 | $$
69 | \mbox{Pr}\left(Z \leq 2 \right) -
70 | \mbox{Pr}\left(Z \leq - 2\right)
71 | $$
72 |
73 | which we know is about 95\%:
74 |
75 | ```{r}
76 | pnorm(2)-pnorm(-2)
77 | ```
78 |
79 | So there is a 95% that $\bar{X}$ will be within $2\times \hat{SE}(\bar{X})$, in our case `r round(2*se)`, to $p$. Note that 95% is somewhat of an arbitrary choice and sometimes other percentages are used, but it is the most commonly used value to define _margin of error_.
80 |
81 | In summary, the CLT tells us that our poll based on a sample size of $25$ is not a very useful. We don't really learn much when the margin of error is this large. All we can really say is that the popular vote will not be won by a large margin. This is why pollsters tend to use larger sample sizes.
82 |
83 | From the table above we see that typical sample sizes from 700 to 3500. To see how this gives us a much more practical result, note that if we had obtained a $\bar{X}$=0.48 with a sample size of 2,000 our standard error $\hat{\mbox{SE}}(\bar{X})$ would have been `r n<-2000;se<-sqrt(0.48*(1-0.48)/n);se`. So our result is an estimate of `48`% with a margin of error is `r round(2*se*100)`%. In this case, the result is much more informative and would make us think that there are more red balls than blue. But keep in mind, we this is hypothetical. We did not take a poll of 2,000 since we don't want ruin the competition.
84 |
85 |
86 | ### A Monte Carlo simulation
87 |
88 |
89 | Suppose we want to use a Monte Carlo simulation to corroborate the tools we have built using probability theory. To create a the simulation we would write code like this:
90 |
91 | ```{r, eval=FALSE}
92 | B <- 10000
93 | N <- 1000
94 | Xhat <- replicate(B, {
95 | X <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
96 | mean(X)
97 | })
98 | ```
99 |
100 | The problem is, of course, we don't know `p`. We could
101 | construct an urn like the one pictured above and run an analog (without a computer) simulation. It would take a long time, but you could take 10,000 samples, count the beads and keep track of the
102 | proportions of blue. We can use the function `take_poll(n=1000)` instead of drawing from an actual urn, but it would still take time to count the beads and enter the results.
103 |
104 | So, one thing we do to corroborate theoretical results is to pick one, or several values of `p`, and run the simulations. Let's set `p=0.45`. We can then simulate a poll
105 |
106 | ```{r}
107 | p <- 0.45
108 | N <- 1000
109 |
110 | X <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
111 | Xhat <- mean(X)
112 | ```
113 |
114 | In this particular sample our estimate is `Xhat`. We can use that code to a Monte Carlo simulation:
115 |
116 | ```{r}
117 | B <- 10000
118 | Xhat <- replicate(B, {
119 | X <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
120 | mean(X)
121 | })
122 | ```
123 |
124 | To review, the theory tells us that $\bar{X}$ has is approximately normally distributed, has expected value $p=$`r p` and standard error $\sqrt{p(1-p)/N}$ = `r sqrt(p*(1-p)/N)`. The simulation confirms this
125 |
126 | ```{r}
127 | mean(Xhat)
128 | sd(Xhat)
129 | ```
130 |
131 | A histogram and qq-plot confirm that the the normal approximation is accurate as well:
132 |
133 | ```{r,echo=FALSE, warning=FALSE, message=FALSE}
134 | library(gridExtra)
135 | p1 <- data.frame(Xhat=Xhat) %>%
136 | ggplot(aes(Xhat)) +
137 | geom_histogram(binwidth = 0.005, color="black")
138 | p2 <- data.frame(Xhat=Xhat) %>%
139 | ggplot(aes(sample=Xhat)) +
140 | stat_qq(dparams = list(mean=mean(Xhat), sd=sd(Xhat))) + geom_abline() + ylab("Xhat") + xlab("Theoretical normal")
141 | grid.arrange(p1,p2, nrow=1)
142 | ```
143 |
144 | Again, note that in a real life we would never be able to run such an experiment because we don't know $p$. But we could run it for various values of $p$ and $N$ and see that the theory does indeed work well for most values. You can easily do this by re-running the code above after changing `p` and `N`.
145 |
146 | ### The spread
147 |
148 | The competition is to predict the spread not the proportion $p$. However, because we are assuming there are only two parties, we know that the spread if $p - (1-p) = 2p - 1$. So everything we have done can easily be adapted to an estimate of $2p - 1$. Once we have our estimate $\bar{X}$ and $\hat{\mbox{SE}}(\bar{X})$ we estimate the spread with $2\bar{X} - 1$ and, since we are multiplying by 2, the standard error is $2\hat{\mbox{SE}}(\bar{X})$. Note that subtracting 1 does not add any variability so it does not affect the standard error.
149 |
150 | So for our 25 sample above, our estimate $p$ is `.48` with margin of error `.20` so our estimate of the spread is `0.04` with margin of error `.40`. Again, not a very useful sample size. But the point is that once we have an estimate and standard error for $p$, we have it for the spread $2p-1$.
151 |
152 |
153 | ### Bias: Why not run a very large poll?
154 |
155 | Note that for realistic value of $p$, say from 0.35 to 0.65, if we run a very large poll with 100,000 people, theory would tell us that we would predict the election perfectly since the largest possible margin of error is around 0.3\%. Here are the calculations:
156 |
157 | ```{r}
158 | N <- 100000
159 | p <- seq(0.35, 0.65, length = 100)
160 | SE <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))
161 | data.frame(p=p, SE = SE) %>% ggplot(aes(p, SE)) +
162 | geom_line()
163 | ```
164 |
165 | One reason is that running such a poll is very expensive. But perhaps a more important reason is that theory has it's limitations. Polling is much more complicated than picking beads from an urn. People might lie to you and other might not have phones. But perhaps the most different way an actual poll is from an urn model is that we actually don't know for sure who is in our population and who is not. How do we know who is going to vote? Are we reaching all possible voters? So even if our margin of error is very small it might not be exactly right that our expected value is $p$. We call this bias. Historically, we observe that polls are indeed biased although not by that much. The typical bias appears to be about 1-2%. This makes election forecasting a bit more interesting and we will talk about how to model this in a later chapter.
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
--------------------------------------------------------------------------------
/lectures/inference/confidence-intervals-p-values-assessment.Rmd:
--------------------------------------------------------------------------------
1 | ### Confidence Intervals and p-values
2 |
3 | For these exercises we will use actual polls for the 2016 election. You can load the data from the `dslabs` package
4 |
5 | ```{r}
6 | library(dslabs)
7 | data("polls_us_election_2016")
8 | ```
9 |
10 | Specifically, we will use all the national polls that ended within two weeks before the election.
11 |
12 | ```{r, message=FALSE, comment=FALSE}
13 | library(tidyverse)
14 | polls <- polls_us_election_2016 %>% filter(enddate >= "2016-10-31" & state == "U.S.")
15 | ```
16 |
17 | 1. For the first poll you can obtain the samples size and estimated Clinton percentage with
18 |
19 | ```{r}
20 | N <- polls$samplesize[1]
21 | X_hat <- polls$rawpoll_clinton[1]/100
22 | ```
23 |
24 | Assume there are only two candidates and construct a 95% confidence interval for the election night proportion $p$.
25 |
26 |
27 | 2. Now use `dplyr` to add a confidence intravel as two columns, called them `lower` and `upper`, of the object `poll` then show a Pollster. Show the enddate, pollster, the estimated proportion, and confidence interval columns. Hint: define temporaty columns `X_hat` and `se_hat`.
28 |
29 |
30 | 3. The final tally for the popular vote was Clinton 48.2% and Trump 46.1%. Add column, call it `hit`, to the previous table stating if the confidence interval included the true proporiton $p=0.482$ or not.
31 |
32 |
33 |
34 | 4. For the table you just created, what proportion of confidence intervals included $p$?
35 |
36 |
37 |
38 | 5. If these confidence intervals are constructed correctly, and the theory holds up, what proportion should include $p$?
39 |
40 |
41 | 6. Note a much smaller proportion of the polls than expected produce confidence intervals containing $p$. If you look closely at the table, you will see that most polls that fail to include $p$ are underestimating. The reason for this is undecided voters, individuals polled that do not yet know who they will vote for or do not want to say. Because, historically, undecideds divide evenly between the two main candidates on election day, it is more informative to estimate the spread or the difference beteween the proportion of two candidates $d$, which in this election was $0.482 - 0.461 = 0.021$. Assume that there are only two parties and that $d = 2p - 1$, define
42 |
43 | ```{r, message=FALSE, comment=FALSE}
44 | polls <- polls_us_election_2016 %>% filter(enddate >= "2016-10-31" & state == "U.S.") %>%
45 | mutate(d_hat = rawpoll_clinton/100 - rawpoll_trump/100)
46 | ```
47 |
48 | and re-do exercise 1 but for the difference.
49 |
50 |
51 | 7. Now repeat exercise 3, but for the difference.
52 |
53 |
54 | 8. Now repeat exercise 4, but for the difference
55 |
56 |
57 | 9. Note that althought the proportion of confidence intervals goes up substantially, it is still lower that 0.95. In the next lecture we learn why this is. To motivate this make a plot of the error, the difference between each poll's estimate and the actual $d=0.021$. Stratify by pollster. Hint: use `theme(axis.text.x = element_text(angle = 90, hjust = 1))`
58 |
59 |
60 | 10. Re-do the plot you made for 9 but only for pollsters that took five or more polls.
61 |
62 |
--------------------------------------------------------------------------------
/lectures/inference/confidence-intervals-p-values.Rmd:
--------------------------------------------------------------------------------
1 | ## Confidence Intervals
2 |
3 | ```{r, echo=FALSE, message=FALSE}
4 | library(tidyverse)
5 | library(dslabs)
6 | ds_theme_set()
7 | ```
8 |
9 | Confidence intervals are a very useful concept that is widely used by data scientists. A version of these that very commonly seen come from the `ggplot` geometry `geom_smooth`. Here is an example using a temperature dataset available in R:
10 |
11 | ```{r, message=FALSE}
12 | data("nhtemp")
13 | data.frame(year = as.numeric(time(nhtemp)), temperature=as.numeric(nhtemp)) %>%
14 | ggplot(aes(year, temperature)) +
15 | geom_point() +
16 | geom_smooth() +
17 | ggtitle("Average Yearly Temperatures in New Haven")
18 | ```
19 |
20 |
21 | We will later learn how the curve is formed, but note the shaded area around the curve. The shaded area around the curve is created using the concept of confidence intervals.
22 |
23 | In our competition we were asked to give an interval. If the interval you submit includes the $p$ you get half the money you spent on your "poll" back and pass to the next stage of the competition. One way to pass to the second round is to report a very large interval. For example, the interval $[0,1]$ is guaranteed to include $p$. However, with an interval this big, we have no chance of winning the competition. Similarly, if you are an election forecaster and predict the spread will be between -100% and 100% you will be ridiculed for stating the obvious. Even a smaller interval such as saying the spread will be between -10 and 10% will not be considered serious.
24 |
25 | On the other-hand the smaller the interval we report, the smaller our chances of winning the prize. Similarly, a bold pollster that reports very small intervals and misses the mark most of the time will not be considered a good pollster. We want to be somewhere in between.
26 |
27 | We can use the statistical theory we have learned to compute the probability of any given interval including $p$. Similarly,if we are asked to create an interval with, say, a 95\% chance of including $p$, we can do that as well. These are called 95\% confidence intervals.
28 |
29 | Note, that when pollster report an estimate and a margin of error, they are, in a way, reporting a 95\% confidence interval. Let's show how this works mathematically.
30 |
31 | We want to know that probability that the interval $[\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} + 2\hat{\mbox{SE}}(\bar{X})]$ contains the true proportion $p$. First, note that the start and end of this intervals are random variables: every time they we take a sample they change. To illustrate this run Monte Carlo simulation above twice. We use the same parameters as above
32 |
33 | ```{r}
34 | p <- 0.45
35 | N <- 1000
36 | ```
37 |
38 | And note that the interval here
39 |
40 | ```{r}
41 | X <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
42 | X_hat <- mean(X)
43 | SE_hat <- sqrt(X_hat*(1-X_hat)/N)
44 | c(X_hat - 2*SE_hat, X_hat + 2*SE_hat)
45 | ```
46 |
47 | is different from this one:
48 |
49 | ```{r}
50 | X <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
51 | X_hat <- mean(X)
52 | SE_hat <- sqrt(X_hat*(1-X_hat)/N)
53 | c(X_hat - 2*SE_hat, X_hat + 2*SE_hat)
54 | ```
55 |
56 | Keep sampling and creating intervals and you will see the random variation.
57 |
58 | To determine the probability that the interval includes $p$ we need to compute this:
59 | $$
60 | \mbox{Pr}\left(\bar{X} - 2\hat{\mbox{SE}}(\bar{X}) \leq p \leq \bar{X} + 2\hat{\mbox{SE}}(\bar{X})\right)
61 | $$
62 |
63 | By subtracting and dividing the same quantities in all parts of the equation we
64 | get that the above is equivalent to:
65 |
66 | $$
67 | \mbox{Pr}\left(-2 \leq \frac{\bar{X}- p}{\hat{\mbox{SE}}(\bar{X})} \leq 2\right)
68 | $$
69 |
70 |
71 | The term in the middle is an approximately normal random variable with expected value 0 and standard error 1, which we have been denoting with $Z$, so we have
72 |
73 | $$
74 | \mbox{Pr}\left(-2 \leq Z \leq 2\right)
75 | $$
76 |
77 | which we can quickly compute using
78 |
79 | ```{r}
80 | pnorm(2) - pnorm(-2)
81 | ```
82 |
83 | proving the we have a 95\% probability.
84 |
85 | Note that if we want to have a larger probability, say 99\%, we need to multiply by whatever `z` satisfies the following:
86 |
87 |
88 | $$
89 | \mbox{Pr}\left(-z \leq Z \leq z\right) = 0.99
90 | $$
91 |
92 | Note that by using
93 |
94 | ```{r}
95 | z <- qnorm(0.995)
96 | z
97 | ```
98 |
99 | will do it because by definition `pnorm(qnorm(0.995)` is 0.995 and by symmetry `pnorm(1-qnorm(0.995))` is 1 - 0.995, we have that
100 |
101 |
102 | ```{r}
103 | pnorm(z)-pnorm(-z)
104 | ```
105 |
106 | is `0.995 - 0.005 = 0.99`. We can use this approach for any percentile $q$: we use $1 - (1 - q)/2$. Why this number? Because $1 - (1 - q)/2 + (1 - q)/2 = q$.
107 |
108 | Note that to get exactly 0.95 confidence interval, we actually use a slightly smaller number than 2:
109 |
110 | ```{r}
111 | qnorm(0.975)
112 | ```
113 |
114 | ### A Monte Carlo Simulation
115 |
116 | We can run a Monte Carlo simulation to confirm that in fact that a 95\% confidence interval includes $p$ 95\% of the time.
117 |
118 | ```{r, eval=FALSE}
119 | set.seed(1)
120 | ```
121 |
122 |
123 | ```{r}
124 | B <- 10000
125 | inside <- replicate(B, {
126 | X <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
127 | X_hat <- mean(X)
128 | SE_hat <- sqrt(X_hat*(1-X_hat)/N)
129 | between(p, X_hat - 2*SE_hat, X_hat + 2*SE_hat)
130 | })
131 | mean(inside)
132 | ```
133 |
134 |
135 | The following plot shows the first 100 confidence intervals. In this case we created the simulation so The black line
136 |
137 | ```{r, message=FALSE, echo=FALSE}
138 | set.seed(1)
139 | tab <- replicate(100, {
140 | X <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
141 | X_hat <- mean(X)
142 | SE_hat <- sqrt(X_hat*(1-X_hat)/N)
143 | hit <- between(p, X_hat - 2.58*SE_hat, X_hat + 2.58*SE_hat)
144 | c(X_hat, X_hat - 2.58*SE_hat, X_hat + 2.58*SE_hat, hit)
145 | })
146 |
147 | tab <- data.frame(poll=1:ncol(tab), t(tab))
148 | names(tab)<-c("poll", "estimate", "low", "high", "hit")
149 | tab <- mutate(tab, p_inside = ifelse(hit, "Yes", "No") )
150 | ggplot(tab, aes(poll, estimate, ymin=low, ymax=high, col = p_inside)) +
151 | geom_point()+
152 | geom_errorbar() +
153 | coord_flip() +
154 | geom_hline(yintercept = p)
155 | ```
156 |
157 |
158 | ### The Correct Language
159 |
160 | When using the theory we described above it is important to remember that it is the intervals that are random not $p$. In the plot above we can see the random intervals moving around and $p$, represented with the vertical line, staying in the same place. The proportion of blue in the urn $p$ is not. So the 95\% relates to the probability that this random interval falls on top of $p$. Saying the $p$ has a 95\% of being between this and that is technically an incorrect statement. Again, because $p$ is not random.
161 |
162 |
163 | ### Power
164 |
165 | Pollsters are not successful for providing correct confidence intervals but rather for predicting who will win. When we took a 25 bead sample size, the confidence interval for the spread:
166 |
167 | ```{r}
168 | N <- 25
169 | X_hat <- 0.48
170 | (2*X_hat - 1) + c(-2,2)*2*sqrt(X_hat*(1-X_hat)/sqrt(N))
171 | ```
172 |
173 | includes 0. If this were a poll and we were forced to make a declaration, we would have to say it was a "toss-up".
174 |
175 | A problem with our poll results is that, given the sample size and the value of $p$, we would have to sacrifice on the probability of an incorrect call to create an interval that does not include 0.
176 |
177 | This does not mean that the election is close. It only means that we have a small sample size. In statistical textbooks this is called lack of _power_. In the context of polls, _power_ is the probability of detecting spreads different from 0.
178 |
179 | By increasing our sample size, we lower our standard error and therefore have a much better chance of detecting the direction of the spread.
180 |
181 |
182 | ## p-values
183 |
184 | p-values are ubiquitous in the scientific literature. They are related to confidence interval so we introduce the concept here.
185 |
186 | Let's consider the blue and red beads. Suppose that rather than wanting an estimate of the spread or the proportion of blue, I am interested only in the question: are there more blue beads or red beads? I want to know if the spread $2p-1 > 0$.
187 |
188 | Suppose we take a random sample of $N=100$ and we observe $52$ blue beads which gives us $2\bar{X}-1=0.04$. This seems to be pointing to their being more blue than red since 0.04 is larger than 0. However, as data scientists we need to be skeptical. We know there is chance involved in this process and we could get a 52 even when the actual spread is 0. We call this a _null hypothesis_. The null hypothesis is the skeptics hypothesis: the spread is $2p-1=0$. We have observed a random variable $2*\bar{X}-1 = 0.52$ and the p-value is the answer to the question how likely is it to see a value this large, when the null hypothesis is true. So we write
189 |
190 | $$\mbox{Pr}(\mid \bar{X} - 0.5 \mid > 0.02 ) $$
191 |
192 | assuming the $2p-1=0$ or $p=0.5$. Under the null hypothesis we know that
193 |
194 | $$
195 | \sqrt{N}\frac{\bar{X} - 0.5}{\sqrt{0.5(1-0.5)}}
196 | $$
197 |
198 | is standard normal. So we can compute the probability above, which is the p-value.
199 |
200 | $$\mbox{Pr}\left(\sqrt{N}\frac{\mid \bar{X} - 0.5\mid}{\sqrt{0.5(1-0.5)}} > \sqrt{N} \frac{0.02}{ \sqrt{0.5(1-0.5)}}\right)$$
201 |
202 |
203 | ```{r}
204 | N=100
205 | z <- sqrt(N)*0.02/0.5
206 | 1 - (pnorm(z) - pnorm(-z))
207 | ```
208 |
209 | This is the p-value. In this case there is actually a large chance of seeing 52 or larger under the null hypothesis.
210 |
211 | Note that there is a close connection between p-values and confidence intervals. If a 95% confidence interval of the spread does not include 0, we know that the p-value must be smaller than 0.05.
212 |
213 | To learn more about p-values you can consult any statistcs textbook. However, in general we prefer reporting confidence intervals over p-values since it gives us an idea of the size of the estimate. The p-value simply reports a probability and says nothing about the significance of the finding in the context of the problem.
214 |
215 |
216 |
217 |
--------------------------------------------------------------------------------
/lectures/inference/img/pollster-2016-predictions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/inference/img/pollster-2016-predictions.png
--------------------------------------------------------------------------------
/lectures/inference/img/popular-vote-538.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/inference/img/popular-vote-538.png
--------------------------------------------------------------------------------
/lectures/inference/img/rcp-polls.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/inference/img/rcp-polls.png
--------------------------------------------------------------------------------
/lectures/inference/img/urn.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/inference/img/urn.jpg
--------------------------------------------------------------------------------
/lectures/inference/intro-to-inference.Rmd:
--------------------------------------------------------------------------------
1 | # Inference and Modeling
2 |
3 | The day before the 2008 presidential election Nate Silver's Fivethirtyeight stated that "Barack Obama appears poised for a decisive electoral victroy". They went further and predicted that Obama would win the election 349 electoral votes to 189 and the popular vote by a margin of 6.1%. Fivethirtyeight also attached a probabilistic statement to their prediction claiming that Obama had a 91% chance of winning the election. The predictions were quite accurate since in the final tally actual results were Obama won the electoral college 365 to 173 and the popular vote by 7.2% difference. Their performance in the 2008 election brought Fivethirtyeight to the attention of political pundints and TV personalities. The week before the 2012, Fivethirtyeight Nate Silver was giving a Obama a 90% chance of winning despite many of the experts thinking they were close, political commentator Joe Scarborough said [during his show](https://www.youtube.com/watch?v=TbKkjm-gheY)
4 |
5 | >>Anybody that thinks that this race is anything but a tossup right now is such an ideologue ... they're jokes.".
6 |
7 | To which Nate Silver responded via Twitter:
8 |
9 | >> If you think it's a toss-up, let's bet. If Obama wins, you donate $1,000 to the American Red Cross. If Romney wins, I do. Deal?
10 |
11 | How was Mr. Silver so confident. We will demonstrate how _poll aggregators_, such as Fivethirtyeight, collected and combined data reported by different experts to produce imrpoved predictions. The two main statistical tools used by the aggregators are the topic of this chapter: inference and modeling. To begin to understand how election forecasting works we need to understand the basic data point they use: poll results.
12 |
13 |
14 | ## Polls
15 |
16 | Opinion polling have been conducted since the 19-th century. The general goal of these is to describe the opinions held by a specific population on a given set of topics. In recent times, these polls have been pervasive during presidential elections. Polls are useful when asking everybody in the population is logistically impossible. The general strategy is to ask a smaller group, chosen at random, and then infer the opinions of the entire population from the opinions of the smaller group. Statistical theory is used to justify the process. This theory is referred to as _inference_ and is the main topic of this section.
17 |
18 | Perhaps the best know opinion polls are those conducted to determining which candidate is preferred by voters in a given election. Political strategists make extensive use of polls to determine, for example, how to invest resources. For example, they may want to know which geographical locations to focus get out the vote efforts.
19 |
20 | Elections are a particularly interesting case of opinion polls because the actual opinion of the entire population is revealed on election day. Of course, it costs millions of dollars to run an actual election which makes polling a cost effective strategy for those that want to forecast the results.
21 |
22 | Although typically the results of these polls are kept private, similar polls are conducted by news organizations because results tend to be of interest to the general public and are often made public. We will eventually be looking at such data.
23 |
24 | [Real Clear Politics](http://www.realclearpolitics.com) is an example of a news aggregation that organizers and publishes poll results. For example, [here](http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html) are examples of polls reporting estimate of the popular vote for the 2016 presidential election:
25 |
26 | ```{r, echo=FALSE}
27 | knitr::include_graphics("img/rcp-polls.png")
28 | ```
29 |
30 | Although, in the United States, the popular vote does not determine the result of the election, we will use it here as an illustrative and simple example of how well polls work. Forecasting the election is a more complex process since it involves combining results from 50 states and DC.
31 |
32 | Let's make some observation about the table above. First note that different polls, all taken days before the election, report a different _spread_: the estimated difference between support for the two candidates. Note also that the reported spreads hover around what ended up being the actual result: Clinton won the popular vote by 2.1%.We also see a column titled **MoE** which stands for _margin of error_.
33 |
34 | In this section we show how the probability concepts we learned in the previous chapter can be applied to develop the statistical approaches that make polls an effective tool. We will learn the statistical concepts necessary to define _estimates_ and _margins of errors_, and show how we can use these to forecast final results relatively well and also provide an estimate of the precision of our forecast. Once we learn this we will be able to understand two concepts that are
35 | ubiquitous in data science: _confidence intervals_, and _p-values_. Finally, to understand probabilistic statetmets about the probability of a candidate winning we will have to learn about Bayesian modelling. In the final sections we put it all together to recreate the simplified version of the Fivethirtyeight model and apply it to 2016 election.
36 |
37 | We start by connecting probability theory to the task of using polls to learn about a population.
38 |
39 |
--------------------------------------------------------------------------------
/lectures/inference/models-assessment.Rmd:
--------------------------------------------------------------------------------
1 | ## Models
2 |
3 | 1. We have been using urn models to motivate the use of probability models. Most data science applications are not related to data obtained from urns. More common are data that come from individuals. The reason probability plays a role here is because the data come from a random sample. The random sample is taken from a population. The urn serves as an analogy for the population.
4 |
5 | Let's revisit the heights dataset. Suppose we consider the males in our course the population.
6 |
7 | ```{r}
8 | library(dplyr)
9 | library(dslabs)
10 | data(heights)
11 | x <- heights %>% filter(sex == "Male") %>%
12 | .$height
13 | ```
14 |
15 | Mathematically speaking, `x` is our population. Using the urn analogy, we have an urn with the values of `x` in it. What are the population average and standard deviation of our population?
16 |
17 |
18 | 2. Call the population average computed above $\mu$ and the standard deviation $\sigma$. Now take a sample of size 50, with replacement, and construct an estimate for $\mu$ and $\sigma$. Set the seed at 1 based on what has been described in this section.
19 |
20 |
21 | 3. What does the theory tell us about the sample average $\bar{X}$ and how it related to $\mu$?
22 |
23 | A. It is practically identical to $\mu$.
24 | B. It is a random variable with expected value $\mu$ and standard error $\sigma/\sqrt{N}$
25 | C. It is a random variable with expected value $\mu$ and standard error $\sigma$.
26 | D. Contains no information
27 |
28 |
29 | 4. So how is this useful? We are going to use an over-simplified yet illustrative example. Suppose we want to know the average height of our male students but we only get to measure 50 of the 708. We will use $\bar{X}$ as our estimate. We know from the answer to 3 that the standard estimate of our error $\bar{X}-\mu$ is $\sigma/sqrt{N}$. If want to know what this is, but we don't know $\sigma$. Based on what is described in thie section, show your estimate of $\sigma$
30 |
31 |
32 | 5. Now that we have an estimate of $\sigma$, let's call our estimate $s$. Construct a 95% confidence interval for $\mu$.
33 |
34 |
35 |
36 | 6. Now run a Monte Carlo simulation in which you
37 | compute 10,000 confidence intervals as you have just done. What proportion of these intervals include $\mu$? Set the seed to 1.
38 |
39 |
40 | 7. In this section we talked about pollster bias. We used visualization to motivate the presence of such bias. Here we will give it a more rigorous treatment. We will consider two pollsters that conducted daily polls. We will look at national polls for the month before the election.
41 |
42 | ```{r}
43 | data(polls_us_election_2016)
44 | polls <- polls_us_election_2016 %>%
45 | filter(pollster %in% c("Rasmussen Reports/Pulse Opinion Research","The Times-Picayune/Lucid") &
46 | enddate >= "2016-10-15" &
47 | state == "U.S.") %>%
48 | mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
49 | ```
50 |
51 | We want to answer the quetion "Is there a poll bias?". Make a plot showing the spreads for each poll.
52 |
53 |
54 |
55 | 8. The data does seem to suggest there is a difference. But these
56 | data are subject to variability. Maybe the differences we observe are due to chance.
57 |
58 | The urn model theory tells says nothing about pollster effect. Under the urn model both pollsters have the same expected value: the election day difference, that we call $d$.
59 |
60 | To answer this question, is there an urn model, we will model the observed data $Y_ij$ in the following way
61 |
62 | $$
63 | Y_{ij} = d + b_i + \varepsilon_{ij}
64 | $$
65 | with $i=1,2$ indexing the two pollsters, $b_i$ the bias for pollster $i$ and $\varepsinon_ij$ poll to poll chance variability. We assume the $\varepsilon$ are independent from each other, have expected value $0$ and standard deviation $\sigma_i$ regardless of $j$.
66 |
67 | Which of the following best represents our question?
68 |
69 | A. Is $\varepsilon_ij$ = 0
70 | B. How close are the $Y_ij$ to $d$
71 | C. Is $b_1 \neq b_2$?
72 | D. Are $b_1 = 0$ and $b_2 = 0$ ?
73 |
74 | 9. Note thet in the right side of this model only $\varepsilon_ij$ is a random variable. The other two are constants. What is the expected value of $Y_{1j}$?
75 |
76 |
77 | 10. Suppose we define $\bar{Y}_1$ as the average of poll results from the first poll, $Y_{11},\dots,Y_{1N_1}$ with $N_1$ the number of polls conducted by the first pollster:
78 |
79 | ```{r}
80 | polls %>%
81 | filter(pollster=="Rasmussen Reports/Pulse Opinion Research") %>%
82 | summarize(N_1 = n())
83 | ```
84 |
85 | What is the expected values $\bar{Y}_1$?
86 |
87 |
88 | 11. What is the standard error of $\bar_{Y}_1$ ?
89 |
90 |
91 | 12. What is the expected values $\bar{Y}_2$?
92 |
93 |
94 | 13. What is the standard error of $\bar_{Y}_2$ ?
95 |
96 |
97 | 14. Using what we learned in answering the questions above what is the expected value of $\bar{Y}_{2} - \bar{Y}_1$?
98 |
99 |
100 | 15. Using what we learned in answering the questions above what is the standard error of $\bar{Y}_{2} - \bar{Y}_1$?
101 |
102 |
--------------------------------------------------------------------------------
/lectures/inference/t-distribution.Rmd:
--------------------------------------------------------------------------------
1 | ## The t-distribution
2 |
3 | ```{r, echo=FALSE}
4 | library(tidyverse)
5 | library(dslabs)
6 | ds_theme_set()
7 | polls <- polls_us_election_2016 %>%
8 | filter(state == "U.S." & enddate >= "2016-10-31" &
9 | (grade %in% c("A+","A","A-","B+") | is.na(grade))) %>%
10 | mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
11 |
12 | one_poll_per_pollster <- polls %>% group_by(pollster) %>%
13 | filter(enddate == max(enddate)) %>%
14 | ungroup()
15 | ```
16 |
17 | Above we made use of the CLT with a sample size of 15. Because we are estimating a second parameters $\sigma$, further variability is introduced into our confidence interval which result in intervals that are too small. For very large sample sizes this extra variability is negligible, but, in general, for values smaller than 30 we need to be cautions about using CLT.
18 |
19 | However, if the data in the urn is known to follow a normal distribution, then we actually have mathematical theory that tell us how much bigger we need to make the intervals to account for the estimation of $\sigma$. Using this theory, we can construct confidence intervals for any $N$. But again only if **the data in the urn is known to follow a normal distribution**. So for the 0,1 data of our previous urn model, this theory definitely does not apply.
20 |
21 | The statistic on which confidence intervals for $d$ are based is
22 |
23 | $$
24 | Z = \frac{\bar{X} - d}{\sigma/\sqrt{N}}
25 | $$
26 |
27 | CLT tells us that Z is approximately normally distributed with expected value 0 and standard error 1. But in practice we don't know $\sigma$ so we use:
28 |
29 | $$
30 | Z = \frac{\bar{X} - d}{s\sqrt{N}}
31 | $$
32 |
33 |
34 | By substituting $\sigma$ with $s$ we introduce some variability. The theory tells us that $Z$ follows a t-distribution with $N-1$ _degrees of freedom_. The degrees of freedom is a parameter that controls the variability via fatter tails:
35 |
36 | ```{r, echo=FALSE}
37 | x <- seq(-5,5, len=100)
38 | data.frame(x=x, Normal = dnorm(x, 0, 1), t_03 = dt(x,3), t_05 = dt(x,5), t_15=dt(x,15)) %>% gather(distribution, f, -x) %>% ggplot(aes(x,f, color = distribution)) + geom_line() +ylab("f(x)")
39 | ```
40 |
41 | If we are willing to assume the pollster effect data is normally distributed, based on the sample data $X_1, \dots, X_N$,
42 | ```{r}
43 | one_poll_per_pollster %>%
44 | ggplot(aes(sample=spread)) + stat_qq()
45 | ```
46 | then $Z$ follows a t-distribution with $N-1$ degrees of freedom. So perhaps a better confidence interval for $d$ is:
47 |
48 |
49 | ```{r}
50 | z <- qt(0.975, nrow(one_poll_per_pollster)-1)
51 | one_poll_per_pollster %>%
52 | summarize(avg = mean(spread), moe = z*sd(spread)/sqrt(length(spread))) %>%
53 | mutate(start = avg - moe, end = avg + moe)
54 | ```
55 |
56 | A bit larger than the one using normal is
57 |
58 | ```{r}
59 | qt(0.975, 14)
60 | ```
61 |
62 | is bigger than
63 |
64 | ```{r}
65 | qnorm(0.975)
66 | ```
67 |
68 | Fivethirtyeight uses the t-distribution to generate errors that better model the deviations we see in election data. For example the deviation we saw in Wisconsin between the polls and actual result, where Trump won by 0.7%, is more in line with t-distributed data than the normal distribution.
69 |
70 | ```{r}
71 | results %>% filter(state == "Wisconsin")
72 | ```
73 |
74 |
75 |
--------------------------------------------------------------------------------
/lectures/ml/cross-validation-slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/ml/cross-validation-slides.pdf
--------------------------------------------------------------------------------
/lectures/ml/cross-validation.Rmd:
--------------------------------------------------------------------------------
1 | ## Cross validation
2 | ```{r, message=FALSE, warning=FALSE}
3 | library(readr)
4 | library(dplyr)
5 | library(tidyr)
6 | library(ggplot2)
7 | library(dslabs)
8 | ds_theme_set()
9 | ```
10 |
11 | ```{r}
12 | plotit <- function(dat, i, n=sqrt(ncol(dat)-1)){
13 | dat <- slice(dat,i)
14 | tmp <- expand.grid(Row=1:n, Column=1:n) %>%
15 | mutate(id=i, label=dat$label,
16 | value = unlist(dat[,-1]))
17 | tmp%>%ggplot(aes(Row, Column, fill=value)) +
18 | geom_raster() +
19 | scale_y_reverse() +
20 | scale_fill_gradient(low="white", high="black") +
21 | ggtitle(tmp$label[1])
22 | }
23 | ```
24 |
25 |
26 | ```{r}
27 | url <- "https://raw.githubusercontent.com/datasciencelabs/data/master/hand-written-digits-train.csv"
28 | original_dat <- read_csv(url)
29 | original_dat <- mutate(original_dat, label = as.factor(label))
30 | ```
31 |
32 | There a test set with no labels given:
33 |
34 | ```{r}
35 | url <- "https://raw.githubusercontent.com/datasciencelabs/data/master/hand-written-digits-test.csv"
36 | original_test<- read_csv(url)
37 | #View(original_test)
38 | ```
39 |
40 | ## Data Exploration
41 |
42 | ```{r}
43 | X <- sample_n(original_dat,200) %>%
44 | arrange(label)
45 |
46 | d <- dist(as.matrix(X[,-1]))
47 | image(as.matrix(d))
48 |
49 | plot(hclust(d),labels=as.character(X$label))
50 | ```
51 |
52 | 784 is too much to handle in a demo. So we will compress the predictors by combining groups of 16 pixels.
53 |
54 | ```{r}
55 | tmp <- slice(original_dat,1:100)
56 | names(tmp) <- gsub("pixel","",names(tmp))
57 | tmp <- tmp %>% mutate(obs = 1:nrow(tmp))
58 | tmp <- tmp %>% gather(feature, value, `0`:`783`)
59 | tmp <- tmp %>% mutate(feature = as.numeric(feature))
60 | tmp <- tmp %>% mutate(row = feature%%28, col =floor(feature/28))
61 | tmp <- tmp %>% mutate(row = floor(row/4), col = floor(col/4))
62 | tmp <- tmp %>% group_by(obs, row, col)
63 | tmp <- tmp %>% summarize(label = label[1], value = mean(value))
64 | tmp <- tmp %>% ungroup
65 | tmp <- tmp %>% mutate(feature = sprintf("X_%02d_%02d",col,row))
66 | tmp <- tmp %>% select(-row, -col)
67 | tmp <- tmp %>% group_by(obs) %>% spread(feature, value) %>% ungroup %>% select(-obs)
68 | ```
69 |
70 | Let's write a function to perform the compression.
71 |
72 | ```{r}
73 | compress <- function(tbl, n=4){
74 | names(tbl) <- gsub("pixel","",names(tbl))
75 | tbl %>% mutate(obs = 1:nrow(tbl)) %>%
76 | gather(feature, value, `0`:`783`) %>%
77 | mutate(feature = as.numeric(feature)) %>%
78 | mutate(row = feature%%28, col =floor(feature/28)) %>%
79 | mutate(row = floor(row/n), col = floor(col/n)) %>%
80 | group_by(obs, row, col) %>%
81 | summarize(label = label[1], value = mean(value)) %>%
82 | ungroup %>%
83 | mutate(feature = sprintf("X_%02d_%02d",col,row)) %>%
84 | select(-row, -col) %>%
85 | group_by(obs) %>% spread(feature, value) %>%
86 | ungroup %>%
87 | select(-obs)
88 | }
89 | ```
90 |
91 | Now, we compress the entire dataset. This will take a bit of time:
92 |
93 | ```{r}
94 | dat <- compress(original_dat)
95 | ```
96 |
97 | Note that some features are almost always 0:
98 |
99 | ```{r}
100 | library(caret)
101 | set.seed(1)
102 | inTrain <- createDataPartition(y = dat$label,
103 | p=0.9)$Resample
104 | X <- dat %>% select(-label) %>% slice(inTrain) %>% as.matrix
105 | column_means <- colMeans(X)
106 | plot(table(round(column_means)))
107 | ```
108 |
109 | Let's remove these "low information" features:
110 |
111 | ```{r}
112 | keep_columns <- which(column_means>10)
113 | ```
114 |
115 | Let's next define the training data and test data:
116 |
117 | ```{r}
118 | train_set <- slice(dat, inTrain) %>%
119 | select(label, keep_columns+1)
120 | test_set <- slice(dat, -inTrain) %>%
121 | select(label, keep_columns+1)
122 | ```
123 |
124 | Note that the distances look a bit cleaner:
125 |
126 | ```{r}
127 | X <- sample_n(train_set,200) %>%
128 | arrange(label)
129 |
130 | d <- dist(as.matrix(X[,-1]))
131 | image(as.matrix(d))
132 | plot(hclust(d),labels=as.character(X$label))
133 | ```
134 |
135 |
136 | ```{r}
137 | tmp = sample_n(train_set,5000)
138 |
139 | control <- trainControl(method='cv', number=20)
140 | res <- train(label ~ .,
141 | data = tmp,
142 | method = "knn",
143 | trControl = control,
144 | tuneGrid=data.frame(k=seq(1,15,2)),
145 | metric="Accuracy")
146 |
147 | plot(res)
148 |
149 | fit <- knn3(label~., train_set, k=3)
150 | pred <- predict(fit, newdata = test_set, type="class")
151 |
152 | tab <- table(pred, test_set$label)
153 | confusionMatrix(tab)
154 | ```
155 |
156 | Compete in Kaggle?
157 |
158 | ```{r}
159 | original_test <- mutate(original_test, label=NA)
160 | test <- compress(original_test)
161 | test <- test %>% select(label, keep_columns+1)
162 | pred <- predict(fit, newdata = test, type="class")
163 |
164 | i=11
165 | pred[i]
166 | plotit(original_test,i)
167 |
168 | res <- data.frame(ImageId=1:nrow(test),Label=as.character(pred))
169 | write.csv(res, file="test.csv", row.names=FALSE)
170 | ```
171 |
172 |
--------------------------------------------------------------------------------
/lectures/ml/img/binsmoother1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/ml/img/binsmoother1.gif
--------------------------------------------------------------------------------
/lectures/ml/img/binsmoother2.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/ml/img/binsmoother2.gif
--------------------------------------------------------------------------------
/lectures/ml/img/loess.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/ml/img/loess.gif
--------------------------------------------------------------------------------
/lectures/ml/img/loesses.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/ml/img/loesses.gif
--------------------------------------------------------------------------------
/lectures/ml/intro-ml-assessment.Rmd:
--------------------------------------------------------------------------------
1 | #### Assessment
2 |
3 | 1. For each of the following determine if the outcome is continuos or categorical
4 |
5 | a. Digit reader
6 | b. Movie recommendations
7 | c. Spam filter
8 | d. Hospitalizations
9 | e. Siri
10 |
11 |
12 | 2. How many features are availalbe to us for prediction in the digits dataset?
13 |
14 |
15 | 3. Create a predictor by rounding the heights to the nearest inch. What is the conditional probability of being Male if you are 70 inches tall?
16 |
17 | ```{r}
18 | heights %>% mutate(height = round(height)) %>%
19 | filter(height==70) %>%
20 | summarize(mean(sex=="Male"))
21 |
22 | ```
23 |
24 | 4. Define the following predictor
25 |
26 | ```{r}
27 | X = round(heights$height)
28 | ```
29 |
30 | Estimate $p(x) = \mbox{Pr}( y = 1 | X=x)$ for each $x$ and plot it against $x$.
31 |
32 | ```{r}
33 | heights %>% mutate(X = round(height)) %>%
34 | group_by(X) %>%
35 | summarize(pr = mean(sex=="Male")) %>%
36 | ggplot(aes(X, pr)) +
37 | geom_point()
38 | ```
39 |
40 |
--------------------------------------------------------------------------------
/lectures/ml/rf.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/ml/rf.gif
--------------------------------------------------------------------------------
/lectures/shiny/introToDataScience2017_shiny.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/shiny/introToDataScience2017_shiny.pdf
--------------------------------------------------------------------------------
/lectures/shiny/introToDataScience2017_shiny.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datasciencelabs/2017/20caca62a7ac20f1cdee9012b64ff19680c37137/lectures/shiny/introToDataScience2017_shiny.pptx
--------------------------------------------------------------------------------
/lectures/shiny/shiny_assessments.R:
--------------------------------------------------------------------------------
1 | # Here are the assessment templates for each of the assessments we will perform today
2 |
3 | # Assessment 1
4 | library(shiny)
5 | ui <- fluidPage(
6 | # Put code here for a slider with n as the id,
7 | # 'This is a slider' as the label, and {value, min, max} = {1, 0, 100}
8 | )
9 | server <- function(input,output){ }
10 | shinyApp(ui=ui,server=server)
11 |
12 |
13 | # Assessment 2
14 | library(shiny)
15 | ui <- fluidPage(
16 | # Place code here
17 | # hint you need a 'numericInput' with labelId = n
18 | # hint you also need to 'plot' the 'Output'
19 | )
20 | server <- function(input,output){
21 |
22 | output$hist <- renderPlot({
23 | hist(rnorm(input$n))
24 | })
25 |
26 | }
27 | shinyApp(ui=ui,server=server)
28 |
29 | # Assessment 3
30 | library(shiny)
31 | ui <- fluidPage(
32 | plotOutput("plot")
33 | )
34 | server <- function(input,output){
35 | # place code here to render a plot of the iris dataset
36 | # hint: plot(iris) and ?renderPlot
37 | }
38 |
39 | shinyApp(ui=ui,server=server)
40 |
41 | # Assessment 4
42 | library(shiny)
43 | ui <- fluidPage(
44 | selectInput("dataset","Which dataset?",choices=c("iris","cars")),
45 | plotOutput("plot")
46 | )
47 | server <- function(input,output){
48 | # place code here to create a 'dat' reactive function
49 | output$plot <- renderPlot({
50 | plot(dat())
51 | })
52 | }
53 | shinyApp(ui=ui,server=server)
54 |
55 |
56 | # Assessment 5
57 | library(shiny)
58 | ui <- fluidPage(
59 | tabsetPanel(
60 | # put code here
61 | )
62 | )
63 | server <- function(input,output){
64 |
65 | output$hist <- renderPlot({
66 | hist(rnorm(input$n))
67 | })
68 | }
69 |
70 | shinyApp(ui=ui,server=server)
71 |
--------------------------------------------------------------------------------
/lectures/wrangling/combining-tables.Rmd:
--------------------------------------------------------------------------------
1 | ## Combining tables
2 |
3 | ```{r, message=FALSE, warning=FALSE}
4 | library(tidyverse)
5 | library(ggrepel)
6 | library(dslabs)
7 | ds_theme_set()
8 | ```
9 |
10 | The information we need for a given analysis may not be just on table. For example, when forecasting elections we used the function `left_join` to combine the information from two tables. Here we use a simpler example to illustrate the general challenge of combining tables.
11 |
12 | Suppose we want to explore the relationship between population size for US states, which we have in this table
13 |
14 | ```{r}
15 | data(murders)
16 | head(murders)
17 | ```
18 |
19 | and electoral votes, which we have in this one:
20 |
21 | ```{r}
22 | data(polls_us_election_2016)
23 | head(results_us_election_2016)
24 | ```
25 |
26 | Notice that just joining these two tables together will not work since the order of the states is not quite the same:
27 |
28 | ```{r}
29 | identical(results_us_election_2016$state, murders$state)
30 | ```
31 |
32 | The _join_ functions described below, are designed to handle this challenge.
33 |
34 | ### Joins
35 |
36 | The `join` functions in the `dplyr` package, which are based SQL joins, make sure that the tables are combined so that matching rows are together.
37 | The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Note what happens if we join the two tables above by state using `left_join`:
38 |
39 | ```{r}
40 | tab <- left_join(murders, results_us_election_2016, by = "state")
41 | tab %>% select(state, population, electoral_votes) %>% head()
42 | ```
43 |
44 | The data has been successfully joins and we can now, for example, make a plot to explore the relationship
45 |
46 | ```{r}
47 | tab %>% ggplot(aes(population/10^6, electoral_votes, label = abb)) +
48 | geom_point() +
49 | geom_text_repel() +
50 | scale_x_continuous(trans = "log2") +
51 | scale_y_continuous(trans = "log2") +
52 | geom_smooth(method = "lm", se = FALSE)
53 | ```
54 |
55 | We see the relationship is close to linear with about 2 electoral votes for every million persons, but with smaller states getting a higher ratios.
56 |
57 |
58 | In practice, it is not always the case that each row in one table has a matching row in the other. For this reason we have several different ways to join. To illustrate this challenge, take subsets of the matrices above
59 |
60 | ```{r}
61 | tab1 <- slice(murders, 1:6) %>% select(state, population)
62 | tab1
63 | ```
64 |
65 | so that we no longer have the same states in the two tables:
66 | ```{r}
67 | tab2 <- slice(results_us_election_2016, c(1:3, 5, 7:8)) %>% select(state, electoral_votes)
68 | tab2
69 | ```
70 |
71 | We will use these two tables as examples.
72 |
73 | #### Left join
74 |
75 | Suppose we want a table like `tab1` but adding electoral votes to whatever states we have available. For this we use left join with `tab1` as the first argument.
76 |
77 | ```{r}
78 | left_join(tab1, tab2)
79 | ```
80 |
81 | Note that `NA`s are added to the two states not appearing in `tab2`. Also note that this function, as well as all the other joins, can receive the first arguments through the pipe:
82 |
83 | ```{r}
84 | tab1 %>% left_join(tab2)
85 | ```
86 |
87 |
88 | #### Right join
89 |
90 | If instead of a table like `tab1` we want one like `tab2` we can use `right_join`:
91 |
92 | ```{r}
93 | tab1 %>% right_join(tab2)
94 | ```
95 |
96 | Notice that now the NAs are in the column coming from `tab1`.
97 |
98 | #### inner join
99 |
100 | If we want to keep only the rows that have information in both tables we use inner join. You can think of this an an intersection:
101 |
102 | ```{r}
103 | inner_join(tab1, tab2)
104 | ```
105 |
106 | #### full join
107 |
108 | And if we want to keep all the rows, and fill the missing parts with NAs, we can use a full join. You can think of this as a union:
109 |
110 | ```{r}
111 | full_join(tab1, tab2)
112 | ```
113 |
114 | #### semi join
115 |
116 | The `semi_join` let's us keep the part of first table for which we have information in the second. It does not add the columns of the second:
117 |
118 | ```{r}
119 | semi_join(tab1, tab2)
120 | ```
121 |
122 |
123 | #### anti joun
124 |
125 | The function `anti_join` is the opposite of `semi_join`. It keeps the elements of the first table for which there is no information in the second:
126 |
127 | ```{r}
128 | anti_join(tab1, tab2)
129 | ```
130 |
131 | The cheat sheet contains a diagram that summarizes the above joins
132 |
133 | ### Binding
134 |
135 | Although we have yet to use it in this book, another common way in which datasets are combined is by _binding_ them. Unlike the join function, the binding function do no try to match by a variable but rather just combine datasets. If the datasets don't match by the appropriate dimensions one obtains error.
136 |
137 | #### Columns
138 |
139 | The `dplyr` function _bind_cols_ binds two objects by making them columns in a tibble. For example we quickly want to make data frame consisting of numbers we can use.
140 |
141 | ```{r}
142 | bind_cols(a = 1:3, b = 4:6)
143 | ```
144 |
145 | This function requires that we assign names to the columns. Here we chose `a` and `b`.
146 |
147 | Note there is an R-base function `cbind` that performs function but creates objects other that tibbles.
148 |
149 | `bind_cols` can also bind data frames. For example here we break up the `tab` data frame and then bind them back together:
150 |
151 | ```{r}
152 | tab1 <- tab[, 1:3]
153 | tab2 <- tab[, 4:6]
154 | tab3 <- tab[, 7:9]
155 | new_tab <- bind_cols(tab1, tab2, tab3)
156 | head(new_tab)
157 | ```
158 |
159 |
160 | #### rows
161 |
162 | The `bind_rows` is similar but binds rows instead of columns
163 |
164 | ```{r}
165 | tab1 <- tab[1:2,]
166 | tab2 <- tab[3:4,]
167 | bind_rows(tab1, tab2)
168 | ```
169 |
170 | This is based on an R-base function `rbind`.
171 |
172 | ### Set Operators
173 |
174 | Another set of commands useful for combing are the set operators. When applied to vectors, these behave as their names suggest. However, if the `tidyvers`, or more specifically, `dplyr` is loaded, these function can be used on data frames as opposed to just on vectors.
175 |
176 | #### Intersect
177 |
178 | You can take intersections of vectors:
179 |
180 | ```{r}
181 | intersect(1:10, 6:15)
182 | ```
183 |
184 | ```{r}
185 | intersect(c("a","b","c"), c("b","c","d"))
186 | ```
187 |
188 | But with `dplyr` loaded we can also do this for tables having the same column names:
189 |
190 | ```{r}
191 | tab1 <- tab[1:5,]
192 | tab2 <- tab[3:7,]
193 | intersect(tab1, tab2)
194 | ```
195 |
196 |
197 | #### Union
198 |
199 | Similarly _union_ takes the union:
200 |
201 | ```{r}
202 | union(1:10, 6:15)
203 | ```
204 |
205 | ```{r}
206 | union(c("a","b","c"), c("b","c","d"))
207 | ```
208 |
209 | But with `dplyr` loaded we can also do this for tables having the same column names:
210 |
211 | ```{r}
212 | tab1 <- tab[1:5,]
213 | tab2 <- tab[3:7,]
214 | union(tab1, tab2)
215 | ```
216 |
217 |
218 | #### Set differrence
219 |
220 | The set difference between a first and second argument can be obtained with `setdiff`. Not unlike `instersect` and `union`, this function is not symmetric:
221 |
222 |
223 | ```{r}
224 | setdiff(1:10, 6:15)
225 | setdiff(6:15, 1:10)
226 | ```
227 |
228 | As the others we can apply it to data frames:
229 | ```{r}
230 | tab1 <- tab[1:5,]
231 | tab2 <- tab[3:7,]
232 | setdiff(tab1, tab2)
233 | ```
234 |
235 | #### `setequal`
236 |
237 | Finally, the function `set_equal` tells us if two sets are the same, regardless of order. So
238 |
239 | ```{r}
240 | setequal(1:5, 1:6)
241 | ```
242 |
243 | but
244 |
245 | ```{r}
246 | setequal(1:5, 5:1)
247 | ```
248 |
249 | When applied to data frames that are not equal regardless of order, it provides a useful message letting us know how the sets are different:
250 |
251 | ```{r}
252 | setequal(tab1, tab2)
253 | ```
254 |
255 |
--------------------------------------------------------------------------------
/lectures/wrangling/dates-and-times.Rmd:
--------------------------------------------------------------------------------
1 | ## Parsing Dates and Times
2 |
3 | We have desribed three main types of vectors: numeric, character, and logical. In data science projects we very often encouter varialbles that are dates. Although we can represent a date with a string, for example, `November 2, 2017`, once we pick a reference day, referred to as the _epoch_, they can be converted to numbers. Computer languages usually use January 1, 1970 as the epoch. So November 2, 2017 is day 17,204.
4 |
5 | Now how should we represents dates and times when analyzing data in R? We could just use days since the epoch, but then it is almost impossible to interpret. If I tell you its November 2, 2017 you know what this means immediately. If I tell you it's day 17,204 you will be quite confused. Similar problems arise with times. In this case it gets even more complicated due to time zones.
6 |
7 | For this reason in R defines a data type just for dataes and times. We saw an exmaple in the polls data:
8 |
9 | ```{r}
10 | library(dslabs)
11 | data("polls_us_election_2016")
12 | polls_us_election_2016$startdate %>% head
13 | ```
14 |
15 | These look like strings. But they are not:
16 |
17 | ```{r}
18 | class(polls_us_election_2016$startdate)
19 | ```
20 |
21 | Look at what happens when we convert them to numbers:
22 |
23 | ```{r}
24 | as.numeric(polls_us_election_2016$startdate) %>% head
25 | ```
26 |
27 | It turns them into dates since the epoch.
28 |
29 | Plotting functions, such as those in ggplot, are aware of dates. This means that, for example, a scatter plot can use the numeric representation to decide on the poisition of the point, but include the string in the labels:
30 |
31 | ```{r}
32 | polls_us_election_2016 %>% filter(pollster == "Ipsos" & state =="U.S.") %>%
33 | ggplot(aes(startdate, rawpoll_trump)) +
34 | geom_line()
35 | ```
36 |
37 | Note in particular that the months are displayed. The tidyverse includes a functionality for dealing with dates through the `lubridate` package.
38 |
39 | ```{r}
40 | library(lubridate)
41 | ```
42 |
43 | We will take a random sample of dates to show some of the useful things one can do:
44 | ```{r}
45 | set.seed(2)
46 | dates <- sample(polls_us_election_2016$startdate, 10) %>% sort
47 | dates
48 | ```
49 |
50 | The functions `year`, `month` and `day` extract those values:
51 |
52 | ```{r}
53 | data.frame(date = days,
54 | month = month(dates),
55 | day = day(dates),
56 | year = year(dates))
57 | ```
58 |
59 | We can also extract the month labels:
60 |
61 | ```{r}
62 | month(dates, label = TRUE)
63 | ```
64 |
65 |
66 | Another useful set of functions are the _parsers_ that converts strings into dates.
67 |
68 | ```{r}
69 | x <- c(20090101, "2009-01-02", "2009 01 03", "2009-1-4",
70 | "2009-1, 5", "Created on 2009 1 6", "200901 !!! 07")
71 | ymd(x)
72 | ```
73 |
74 |
75 | A further complication comes from the fact that dates often come in different formats in which the order of year month and day are different. The preferred format is to show year (with all four digits), month (two digits) and then day or what is called the ISO 8601. Specifically we use YYYY-MM-DD so that if we order the string it will be ordered by date. You can see the function `ymd` returns them in this format.
76 |
77 | What if you encouter dates such as "09/01/02"? This could be September 1, 2002 or Janary 2, 2009 or January 9, 2002.
78 | In these cases examining the entire vector of dates will help you determine what format it is by process of elimination. Once you know you can use of the many parses provided by lubridate.
79 |
80 | For example if the string is
81 |
82 | ```{r}
83 | x <- "09/01/02"
84 | ```
85 |
86 | The `ymd` function assumes the first entry is the year the second the month and the third the day so it coverts it to:
87 |
88 | ```{r}
89 | ymd(x)
90 | ```
91 |
92 | The `mdy` function assumes the first entry is the month then the day then the year:
93 |
94 | ```{r}
95 | mdy(x)
96 | ```
97 |
98 | Lubridate provides a function for every possibility:
99 | ```{r}
100 | ydm(x)
101 | myd(x)
102 | dmy(x)
103 | dym(x)
104 | ```
105 |
106 |
107 |
108 | Lubridate is also useful for dealing with times. In R, you can get the current time tpying `Sys.time()`. Lubridate provides a slightly more advanced function, `now`, that permits you define the time zone
109 |
110 | ```{r}
111 | now()
112 | now("GMT")
113 | ```
114 |
115 | You can see all the available times zones with `OlsonNames()` function.
116 |
117 | Lubridate also has function to extract hours, minutes as seconds:
118 |
119 | ```{r}
120 | now() %>% hour()
121 | now() %>% minute()
122 | now() %>% second()
123 | ```
124 |
125 | as as function to conver parse strings into times:
126 |
127 | ```{r}
128 | x <- c("12:34:56")
129 | hms(x)
130 | ```
131 |
132 | as well as parses for time objects that include dates:
133 |
134 | ```{r}
135 | x <- "Nov/2/2012 12:34:56"
136 | mdy_hms(x)
137 | ```
138 |
139 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
--------------------------------------------------------------------------------
/lectures/wrangling/intro-to-wrangling.Rmd:
--------------------------------------------------------------------------------
1 | # Data Wrangling
2 |
3 | The datasets used in this book have been made available to you as R objects, specifically as data frames. The US murders data, the reported heights data, the Gapminder data, and the poll data are all examples. These datasets come included in the `dslabs` package and we loaded them using the the `data` function. Furthermore, we have made the data available in what is referred to as `tidy` form, a concept we define later in this chapter. The _tidyverse_ packages and functions assume that the data is `tidy` and this assumption is a big part of the reason these packages work so well together.
4 |
5 | However, very rarely in a data science project is data easily available as part of a package. We did quite a bit of work "behind the scenes" to get the original raw data into the _tidy_ tables you worked with. Much more typical is for the data to be in a file, a database, or extracted from a document including web pages, tweets, or PDFs. In these cases, the the first step is to import the data into R and, when using the _tidyverse_, tidy the data. The first step in the data analysis process usually involves several, often complicated, steps to covert data from its raw form to the _tidy_ form that greatly facilitates the rest of the analysis. We refer to this process as `data wrangling`.
6 |
7 | Here we cover several common steps of the data wrangling process including importing data into R from files, tidying data, string processing, html parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but data scientist will likely face them all at some point. Some of the example we use to demonstrate data wrangling techniques are based on the work we did to convert raw data into the the tidy datasets provided by the `dslabs` package and used in the book as examples.
8 |
9 |
10 |
--------------------------------------------------------------------------------
/lectures/wrangling/reshaping-data.Rmd:
--------------------------------------------------------------------------------
1 | ## Reshaping data
2 |
3 | ```{r, echo=FALSE, message=FALSE}
4 | library(tidyverse)
5 | path <- system.file("extdata", package="dslabs")
6 | filename <- file.path(path, "fertility-two-countries-example.csv")
7 | wide_data <- read_csv(filename)
8 | ```
9 |
10 | As we have seen, having data in `tidy` format is what makes the `tidyverse` flow. After the first step in the data analysis process, importing data,
11 | a common next step is to reshape the data into a form that facilitates the rest of the analysis. The `tidyr` package includes several functions that are useful for tyding data.
12 |
13 | ### gather
14 |
15 | One of the most used functions in this package is `gather`, which converts wide data into tidy data.
16 |
17 | In the third argument of the `gather` function you specify the columns that will be _gathered_. The default is to gather all function, so in most cases we have to specify the columns. Here we want columns `1960`, `1961` up to `2015`. The first argument sets the column/variable name that will hold the variable that is currently kept in the wide data column names. In our case it makes sense to set name to `year`, but we can name it anything. The second argument sets the column/variable name that will hold the values in the column cells. In this case we call it `fertility` since this what is stored in this file. Note that nowhere in this file does it tell us this is fertility data. Instead, this information was kept in the file name.
18 |
19 | The gathering code looks like this:
20 |
21 | ```{r}
22 | new_tidy_data <- wide_data %>%
23 | gather(my_year, my_fertility, `1960`:`2015`)
24 | ```
25 |
26 | We can see that the data have been converted to tidy format with columns `year` and `fertility`:
27 |
28 | ```{r}
29 | head(new_tidy_data)
30 | ```
31 |
32 | However, each year resulted in two rows since we have two countries and this column was not gathered.
33 |
34 | A somewhat quicker way to write this code is to specific which column will **not** be gathered rather than all the columns that will be gathered:
35 |
36 | ```{r}
37 | new_tidy_data <- wide_data %>%
38 | gather(year, fertility, -country)
39 | ```
40 |
41 | This data looks a lot like the original `tidy_data` we used. There is just one minor difference. Can you spot it? Look at the data type of the year column:
42 |
43 | ```{r}
44 | library(dslabs)
45 | data("gapminder")
46 | tidy_data <- gapminder %>%
47 | filter(country %in% c("South Korea", "Germany")) %>%
48 | select(country, year, fertility)
49 |
50 | class(tidy_data$year)
51 | class(new_tidy_data$year)
52 | ```
53 |
54 | The `gather` function assumes that column names are characters. So we need a bit more wrangling before we are ready to make a plot. We need to convert the column to numbers. The `gather` function has an argument for that, the `convert` argument:
55 |
56 | ```{r}
57 | new_tidy_data <- wide_data %>%
58 | gather(year, fertility, -country, convert = TRUE)
59 | class(new_tidy_data$year)
60 | ```
61 |
62 | We could have also used the `mutate` and `as.numeric`.
63 |
64 | Now that the data is tidy we can use the same ggplot as before
65 |
66 | ```{r}
67 | new_tidy_data %>% ggplot(aes(year, fertility, color = country)) +
68 | geom_point()
69 | ```
70 |
71 | ### spread
72 |
73 | As we will see in later examples it is sometimes useful for data wrangling purposes to convert tidy data into wide data. We often use this as intermediate step in tidying up data. The `spread` function is basically the inverse of `gather`. The first argument tells `spread` which variable will be used as the column names. The second argument specifies which variable to use to fill out the cells:
74 |
75 | ```{r}
76 | new_wide_data <- new_tidy_data %>% spread(year, fertility)
77 | select(new_wide_data, country, `1960`:`1967`)
78 | ```
79 |
80 |
81 | The diagrams in the cheat sheet can help remind you how these two functions work.
82 |
83 |
84 | ### `separate`
85 |
86 | The data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files we include an example that is slightly more complicated. It includes two variables: life expectancy as well as fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.
87 |
88 | ```{r}
89 | path <- system.file("extdata", package = "dslabs")
90 | filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv")
91 |
92 | raw_dat <- read_csv(filename)
93 | select(raw_dat, 1:5)
94 | ```
95 |
96 | First note that the data is in wide format. Second, note that now there are values for two variables with the column names encoding which column represents which variable. We can start the data wrangling with the `gather` function, but we should no longer use the column name `year` for the new column since since it also contains the variable type. We will call it `key`, the default, for now:
97 |
98 | ```{r}
99 | dat <- raw_dat %>% gather(key, value, -country)
100 | head(dat)
101 | ```
102 |
103 | The result is not exactly what we refer to as tidy since each observation is associated with two, not one, row. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the `key` column into the year and the variable type. Note that the entries in this column separate the year from the variable name with an underscore
104 |
105 | ```{r}
106 | dat$key[1:5]
107 | ```
108 |
109 | Encoding multiple variables in a column name is such a common problem that the `readr` package includes a function to separate these columns into two or more. Apart from the data, the `separate` function takes three arguments: the name of the column to be separated, the names to be used for the new columns and the character that separates the variables. So a first attempt at this is:
110 |
111 | ```{r, eval=FALSE}
112 | dat %>% separate(key, c("year", "variable_name"), "_")
113 | ```
114 |
115 | Because "_" is the default separator we actually can simply write:
116 |
117 | ```{r}
118 | dat %>% separate(key, c("year", "variable_name"))
119 | ```
120 |
121 | However, we run into a problem. Note that we receive the warning `Too many values at 112 locations:` and the that `life_exepectancy` variable is truncated to `life`. This is because the `_` is used to separate `life` and `expectancy` not just year and variable name. We could add a third column to catch this and let the `separate` function know which column to _fill in_ with missing values, `NA`, when there is no third value. Here we tell it to fill the column on the right:
122 |
123 | ```{r}
124 | dat %>% separate(key,
125 | c("year", "first_variable_name", "second_variable_name"),
126 | fill = "right")
127 | ```
128 |
129 | However, if we read the `separate` help file we find that a better approach is to merge the last two variables when there is an extra separation:
130 |
131 | ```{r}
132 | dat %>% separate(key, c("year", "variable_name"), sep = "_", extra = "merge")
133 | ```
134 |
135 | This achieves the separation we wanted. However, we are not done yet. We need to create a column for each variable. As we learned, the `spread` function can do this:
136 |
137 | ```{r}
138 | dat %>% separate(key, c("year", "variable_name"), sep = "_", extra = "merge") %>%
139 | spread(variable_name, value)
140 | ```
141 |
142 | The data is now in tidy format with one row for each observation with three variable: year, fertility and life expectancy.
143 |
144 | ### `unite`
145 |
146 | It is sometimes useful to do the inverse of `separate`, unite two columns into one. So, although this is *not* an optimal approach, had we used this command to separate
147 |
148 | ```{r}
149 | dat %>%
150 | separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right")
151 | ```
152 |
153 | We can achieve the same final result by uniting the second and third column like this:
154 |
155 | ```{r}
156 | dat %>%
157 | separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
158 | unite(variable_name, first_variable_name, second_variable_name, sep="_")
159 | ```
160 |
161 | Then spreading the columns:
162 |
163 | ```{r}
164 | dat %>%
165 | separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
166 | unite(variable_name, first_variable_name, second_variable_name, sep="_") %>%
167 | spread(variable_name, value) %>%
168 | rename(fertlity = fertility_NA)
169 | ```
170 |
171 |
172 |
173 |
174 |
175 |
--------------------------------------------------------------------------------
/lectures/wrangling/tidy-data.Rmd:
--------------------------------------------------------------------------------
1 | ## Tidy data
2 |
3 | ```{r, message=FALSE}
4 | library(tidyverse)
5 | library(dslabs)
6 | ds_theme_set()
7 | ```
8 |
9 | To help define tidy data we go back to an example we showed in the data visualization chapter in which we plotted fertility data across time for two countries: South Korea and Germany. To make the plot we used this subset of the data
10 |
11 | ```{r}
12 | data("gapminder")
13 | tidy_data <- gapminder %>%
14 | filter(country %in% c("South Korea", "Germany")) %>%
15 | select(country, year, fertility)
16 | head(tidy_data)
17 | ```
18 |
19 | With the data in this format we could quickly make the desired plot:
20 |
21 | ```{r}
22 | tidy_data %>%
23 | ggplot(aes(year, fertility, color = country)) +
24 | geom_point()
25 | ```
26 |
27 | One reason this code works seamlessly is because the data is _tidy_: each point is represented in a row. This brings us to the definition of _tidy data_: each row represents one observation and the column represent the different variable that we have data on for those observations.
28 |
29 | If we go back to the original data provided by GapMinder we see that it does not start out _tidy_. We include an example file with the he data shown in this graph mimicking the way it was originally saved in a spreadsheet:
30 |
31 | ```{r}
32 | path <- system.file("extdata", package="dslabs")
33 | filename <- file.path(path, "fertility-two-countries-example.csv")
34 | wide_data <- read_csv(filename)
35 | ```
36 |
37 | The object `wide_data` includes the same information as the object `tidy_data` except it is in a different format: a `wide` format. Here are the first nine columns:
38 |
39 | ```{r}
40 | select(wide_data, country, `1960`:`1967`)
41 | ```
42 |
43 | Two important differences between the wide and tidy formats. First, in the wide format each row includes several observations. Second, one of the variables, year, is stored in the header.
44 |
45 | The `ggplot` code we introduced earlier no longer works here. For one there is no `year` variable. So to use the `tidyverse` we need to wrangle this data into `tidy` format.
46 |
47 |
--------------------------------------------------------------------------------
/lectures/wrangling/web-scraping.Rmd:
--------------------------------------------------------------------------------
1 | ## Web Scraping
2 |
3 | The data we need to answer a question is not always in a spreadsheet, ready for us to read. For example the US murders dataset we used in the R Basics chapter originally comes from this Wikipedia page: [https://en.wikipedia.org/wiki/Murder_in_the_United_States_by_state](https://en.wikipedia.org/wiki/Murder_in_the_United_States_by_state). You can see the data table when you visit the web page.
4 |
5 | But, unfortunately, there is no link to a data file. To make the data frame we loaded using `data(murders)`, or reading the csv file made available through `dslabs`, we had to do some _web scrapping_.
6 |
7 | _Web scraping_, or _web harvesting_, are the terms we use to describe the process of extracting data from a website. The reason we can do this is because the information used by a browser to render web pages is received as **text** from a server. The text is computer code written in hyper text markup language (HTML). To see the code for a web page you can visit the page on your browser, then you can use the _View Source_ tool to see it.
8 |
9 |
10 | Because this code is accessible, we can download the HTML files, import it into R, and then write programs to extract the information we need from the page. However, once we look at HTML code this might seem like a daunting task. But we will show you some convenient tools to facilitate the process. To get an idea of how it works, here we show a few lines of code from the Wikipedia page that provides the US murders data:
11 |
12 | ```{r, eval = FALSE}
13 | p>The 2015 U.S. population total was 320.9 million. The 2015 U.S. overall murder rate per 100,000 inhabitants was 4.89.
14 |
15 |
16 |
17 | State |
18 | Population
19 | (total inhabitants)
20 | (2015) [1] |
21 | Murders and Nonnegligent
22 | Manslaughter
23 | (total deaths)
24 | (2015) [2]
25 | |
26 | Murder and Nonnegligent
27 | Manslaughter Rate
28 | (per 100,000 inhabitants)
29 | (2015)
30 | |
31 |
32 |
33 | Alabama |
34 | 4,853,875 |
35 | 348 |
36 | 7.2 |
37 |
38 |
39 | Alaska |
40 | 737,709 |
41 | 59 |
42 | 8.0 |
43 |
44 |
45 | ```
46 |
47 | You can actually see the data! We can also see a pattern of how it is stored. If you know HTML, you can write programs that leverage knowledge of these patterns to extract what we want. We also take advantage of a language widely used to make web pages look "pretty" called Cascading Style Sheets (CSS). We say more about this in the Reporter Section.
48 |
49 | Although we provide tools that make it possible to scrape data without knowing HTML,
50 | for data scientists, it is quite useful to learn some HTML and CSS. Not only does this improve your scraping skills but it might come in handy if you are creating a webpage to showcase you work. There are plenty of online courses and tutorials for learning these. Two examples are [code academy](https://www.codecademy.com/learn/learn-html) and [WWW3 school](https://www.w3schools.com/)
51 |
52 | ### The `rvest` package
53 |
54 | The `tidyverse` provides a web harvesting package called `rvest`. The first step in using this package is to import the web page into R. The package makes this quite simple:
55 |
56 | ```{r}
57 | library(rvest)
58 | url <- "https://en.wikipedia.org/wiki/Murder_in_the_United_States_by_state"
59 | h <- read_html(url)
60 | ```
61 |
62 | Note that the entire Murders in the US Wikipedia webpage is now contained in `h`. The class of this object is
63 |
64 | ```{r}
65 | class(h)
66 | ```
67 |
68 | The `rvest` package is actually more general, it handles XML documents. XML is a general markup language, that's what the ML stands for, that can be used to represent any kind of data. HTML is a specific type of XML specifically developed for representing web pages. Here we focus on HTML documents.
69 |
70 | Now, how do we extract the table from the object `h`? if we print `h` we don't really see much:
71 |
72 | ```{r}
73 | h
74 | ```
75 |
76 | When we know that the information is stored in an HTML table, you can see this in this line of the HTML code above ``. For this we can use the following code. The different parts of an HTML document, often defined with a message in between `<` and `>` are referred to as _nodes_. The `rvest` package includes functions to extract nodes of an HTML document: `html_nodes` extracts all nodes of different type and `html_node` extracts the first one. To extract the first table we use:
77 |
78 | ```{r}
79 | tab <- h %>% html_node("table")
80 | ```
81 |
82 | Now, instead of the entire web page, we just have the html code for that table:
83 |
84 | ```{r}
85 | tab
86 | ```
87 |
88 |
89 | We are not quite there yet because this is clearly not a tidy dataset, not even a data frame. In the code above you can definitely see a pattern and writing code to extract just the data is very doable. In fact, `rvest` includes a function just for converting HTML tables into data frames:
90 |
91 |
92 | ```{r}
93 | tab <- tab %>% html_table
94 | class(tab)
95 | ```
96 |
97 | We are now much closer to having a usable data table:
98 |
99 | ```{r}
100 | tab <- tab %>% setNames(c("state", "population", "total", "murder_rate"))
101 | head(tab)
102 | ```
103 |
104 | We still have some wrangling to do. For example, we need to remove the commas and turn characters into numbers. Before continuing with this, we will learn a more general approach to extracting information from web sites.
105 |
106 |
107 | ### CSS Selectors
108 |
109 | The default look of a webpage made with the most basic HTML is quite unattractive. The aesthetically pleasing pages we see today are made using CSS. CSS is used to add style to webpages. The fact that all pages for a company have the same style is usually a result that they all use the same CSS file. The general way these CSS files work is by defining how each of the elements of a webpage will look. The title looks, headings, itemized lists, tables, and links for example each receive their own style including font, color, size, and distance from the margin, among others. To do this CSS leverages patterns used to define these elements, referred to as _selectors_. An example of a pattern we used above is `table` but there are many many more.
110 |
111 | So if we want to grab data from a web-page and we happen to know a selector that is unique to the part of the page we can use the `html_nodes` function. However, knowing which selector can be quite complicated. To demonstrate this we will try to extract the recipe name, total preparation time, and list of ingredients from [this](http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609) Guacamole recipe. Looking at the code for this page, it seems that the task is impossibly complex. However, selector gadgets actually make this possible.
112 |
113 | [SelectorGadget](http://selectorgadget.com/) is piece of software that allows you to interactively determine what css selector you need to extract specific components from the web page. If you plan on scrapping data other than tables we highly recommend you install it. A Chrome extension is available which permits you to turn on the gadget and then as you click through the page it highlights parts and shows you the selector your need to extract these parts. There are various demos of how to do this including ADD, ADD and ADD.
114 |
115 | For the Guacamole recipe page we already have done this and determined that we need the following selectors:
116 |
117 | ```{r}
118 | h <- read_html("http://www.foodnetwork.com/recipes/alton-brown/guacamole-recipe-1940609")
119 | recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
120 | prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
121 | ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
122 | ```
123 |
124 | You can see how complex the selectors are. In any case we are now ready to extract what we want and create a list:
125 |
126 | ```{r}
127 | guacamole <- list(recipe, prep_time, ingredients)
128 | guacamole
129 | ```
130 |
131 | Since recipe pages from this website follow this general layout, we can use this code to create a function that extracts this information
132 |
133 | ```{r}
134 | get_recipe <- function(url){
135 | h <- read_html(url)
136 | recipe <- h %>% html_node(".o-AssetTitle__a-HeadlineText") %>% html_text()
137 | prep_time <- h %>% html_node(".o-RecipeInfo__a-Description--Total") %>% html_text()
138 | ingredients <- h %>% html_nodes(".o-Ingredients__a-ListItemText") %>% html_text()
139 | return(list(recipe = recipe, prep_time = prep_time, ingredients = ingredients))
140 | }
141 | ```
142 |
143 | and then use it on any of their webpages:
144 |
145 | ```{r}
146 | get_recipe("http://www.foodnetwork.com/recipes/food-network-kitchen/pancakes-recipe-1913844")
147 | ```
148 |
149 |
150 | There are several other powerful tools provided by `rvest`. For example the functions `html_form`, `set_values`, and `submit_form` permit you to query a web-page from R. This is a more advanced topic not covered here.
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 |
162 |
163 |
164 |
165 |
--------------------------------------------------------------------------------