├── .gitignore
├── 00-Getting-started.Rmd
├── 01-Visualize.Rmd
├── 02-Transform.Rmd
├── 03-Tidy.Rmd
├── 04-Case-Study.Rmd
├── 05-Data-Types.Rmd
├── 06-Iterate.Rmd
├── 07-Model.Rmd
├── 08-Organize.Rmd
├── 99-Setup.md
├── README.md
├── data-science-in-the-tidyverse.Rproj
├── email-to-participants.md
├── resources
    ├── 01-setup-login.png
    ├── 02-setup-temp-project.png
    ├── 04-setup-rproj-file.png
    ├── 05-setup-open-project.png
    ├── 06-setup-inside-project.png
    ├── 07-setup-all-done.png
    └── bialik-fridaythe13th-2.png
├── slides
    ├── 00-Introduction.pdf
    ├── 01-Visualize.pdf
    ├── 02-Transform.pdf
    ├── 03-Tidy.pdf
    ├── 04-Case-Study.pdf
    ├── 05-Data-Types.pdf
    ├── 06-Iteration.pdf
    ├── 07-Model.pdf
    ├── 08-Organize.pdf
    └── 09-Wrapping-Up.pdf
└── solutions
    ├── 01-Visualize-solutions.Rmd
    ├── 02-Transform-Solutions.Rmd
    ├── 03-Tidy-Solutions.Rmd
    ├── 04-Case-Study-Solutions.Rmd
    ├── 05-Data-Types-Solutions.Rmd
    ├── 06-Iterate-solutions.Rmd
    ├── 07-Model-Solutions.Rmd
    └── 08-Organize-Solutions.Rmd


/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | *.html
6 | /keynotes
7 | 


--------------------------------------------------------------------------------
/00-Getting-started.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "R Notebook"
 3 | output: html_notebook
 4 | editor_options: 
 5 |   chunk_output_type: inline
 6 | ---
 7 | 
 8 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License. -->
 9 | 
10 | ```{r setup}
11 | library(tidyverse)
12 | ```
13 | 
14 | ## R notebooks
15 | 
16 | This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 
17 | 
18 | R code goes in **code chunks**, denoted by three backticks. Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Crtl+Shift+Enter* (Windows) or *Cmd+Shift+Enter* (Mac). 
19 | 
20 | ```{r}
21 | ggplot(data = mpg) +
22 |   geom_point(mapping = aes(x = displ, y = hwy))
23 | ```
24 | 
25 | Add a new chunk by clicking the *Insert* button on the toolbar, then selecting *R* or by pressing  *Ctrl+Alt+I* (Windows) or *Cmd+Option+I* (Mac).
26 | 
27 | When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* (Windows) or *Cmd+Shift+K* (Mac) to preview the HTML file). 
28 | 
29 | The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
30 | 
31 | 


--------------------------------------------------------------------------------
/01-Visualize.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Visualization"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Amelia McNamara is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  9 | 
 10 | ## Setup
 11 | 
 12 | The first chunk in an R Notebook is usually titled "setup," and by convention includes the R packages you want to load. Remember, in order to use an R package you have to run some `library()` code every session. Execute these lines of code to load the packages. 
 13 | 
 14 | ```{r setup}
 15 | library(ggplot2)
 16 | library(fivethirtyeight)
 17 | ```
 18 | 
 19 | ## Bechdel test data
 20 | 
 21 | We're going to start by playing with data collected by the website FiveThirtyEight on movies and [the Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test). 
 22 | 
 23 | To begin, let's just preview our data. There are a couple ways to do that. One is just to type the name of the data and execute it like a piece of code. 
 24 | 
 25 | ```{r}
 26 | bechdel
 27 | ```
 28 | 
 29 | Notice that you can page through to see more of the dataset. 
 30 | 
 31 | Sometimes, people prefer to see their data in a more spreadsheet-like format, and RStudio provides a way to do that. Go to the Console and type `View(bechdel)` to see the data preview. 
 32 | 
 33 | (An aside-- `View` is a special function. Since it makes something happen in the RStudio interface, it doesn't work properly in R Notebooks. Most R functions have names that start with lowercase letters, so the uppercase "V" is there to remind you of its special status.)
 34 | 
 35 | 
 36 | 
 37 | ## Consider
 38 | What relationship do you expect to see between movie budget (budget) and domestic gross(domgross)?
 39 | 
 40 | ## Your Turn 1
 41 | 
 42 | Run the code on the slide to make a graph. Pay strict attention to spelling, capitalization, and parentheses!
 43 | 
 44 | ```{r}
 45 | 
 46 | ```
 47 | 
 48 | ## Your Turn 2
 49 | 
 50 | Add `color`, `size`, `alpha`, and `shape` aesthetics to your graph. Experiment.  
 51 | 
 52 | ```{r}
 53 | ggplot(data = bechdel) +
 54 |   geom_point(mapping = aes(x = budget, y = domgross))
 55 | ```
 56 | 
 57 | ## Set vs map
 58 | 
 59 | ```{r}
 60 | ggplot(bechdel) + 
 61 |     geom_point(mapping = aes(x = budget, y = domgross), color="blue")
 62 | ```
 63 | 
 64 | ## Your Turn 3
 65 | 
 66 | Replace this scatterplot with one that draws boxplots. Use the cheatsheet. Try your best guess.
 67 | 
 68 | ```{r}
 69 | ggplot(data = bechdel) + geom_point(aes(x = clean_test, y = budget))
 70 | ```
 71 | 
 72 | ## Your Turn 4
 73 | 
 74 | Make a histogram of the `budget` variable from `bechdel`.
 75 | 
 76 | ```{r}
 77 | 
 78 | ```
 79 | 
 80 | ## Your Turn 5
 81 | Try to find a better `binwidth` for `budget`.
 82 | 
 83 | ```{r}
 84 | 
 85 | ```
 86 | 
 87 | ## Your Turn 6
 88 | 
 89 | Make a density plot of `budget` colored by `clean_test`.
 90 | 
 91 | ```{r}
 92 | 
 93 | ```
 94 | 
 95 | ## Your Turn 7
 96 | 
 97 | Make a barchart of `clean_test` colored by `clean_test`.
 98 | 
 99 | ```{r}
100 | 
101 | ```
102 | 
103 | 
104 | ## Your Turn 8
105 | 
106 | Predict what this code will do. Then run it.
107 | 
108 | ```{r}
109 | ggplot(data = bechdel) +
110 |   geom_point(mapping = aes(x = budget, y = domgross)) +
111 |   geom_smooth(mapping = aes(x = budget, y = domgross))
112 | ```
113 | 
114 | ## global vs local
115 | 
116 | ```{r}
117 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) +
118 |   geom_point(mapping = aes(color = clean_test)) +
119 |   geom_smooth()
120 | ```
121 | 
122 | ```{r}
123 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) +
124 |   geom_point(mapping = aes(color = clean_test)) +
125 |   geom_smooth(data = filter(bechdel, clean_test == "ok"))
126 | ```
127 | 
128 | 
129 | 
130 | ## Your Turn 
131 | 
132 | What does `getwd()` return?
133 | 
134 | ```{r}
135 | 
136 | ```
137 | 
138 | ## Your Turn 9
139 | 
140 | Save the last plot and then locate it in the files pane. If you run your `ggsave()` code inside this notebook, the image will be saved in the same directory as your .Rmd file (likely, project -> code), but if you run `ggsave()` in the Console it will be in your working directory. 
141 | 
142 | ```{r}
143 | 
144 | ```
145 | 
146 | ***
147 | 
148 | # Take aways
149 | 
150 | You can use this code template to make thousands of graphs with **ggplot2**.
151 | 
152 | ```{r eval = FALSE}
153 | ggplot(data = <DATA>) +
154 |   <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
155 | ```


--------------------------------------------------------------------------------
/02-Transform.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Transform Data"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Amelia McNamara is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  9 | 
 10 | ```{r setup}
 11 | library(dplyr)
 12 | library(babynames)
 13 | library(nycflights13)
 14 | library(skimr)
 15 | ```
 16 | 
 17 | ## Babynames
 18 | 
 19 | ```{r}
 20 | babynames
 21 | skim(babynames)
 22 | skim_with(integer = list(p25 = NULL, p75=NULL))
 23 | ```
 24 | 
 25 | 
 26 | ## Your Turn 1
 27 | Run the skim_with() command, and then try skimming babynames again to see how the output is different
 28 | ```{r}
 29 | 
 30 | ```
 31 | 
 32 | ## Select
 33 | 
 34 | ```{r}
 35 | select(babynames, name, prop)
 36 | ```
 37 | 
 38 | ## Your Turn 2
 39 | 
 40 | Alter the code to select just the `n` column:
 41 | 
 42 | ```{r}
 43 | select(babynames, name, prop)
 44 | ```
 45 | 
 46 | 
 47 | ## Consider
 48 | 
 49 | Which of these is NOT a way to select the `name` and `n` columns together?
 50 | 
 51 | ```{r}
 52 | select(babynames, -c(year, sex, prop))
 53 | select(babynames, name:n)
 54 | select(babynames, starts_with("n"))
 55 | select(babynames, ends_with("n"))
 56 | ```
 57 | 
 58 | ## Filter
 59 | 
 60 | ```{r}
 61 | filter(babynames, name == "Amelia")
 62 | ```
 63 | 
 64 | ## Your Turn 3
 65 | 
 66 | Show:
 67 | 
 68 | * All of the names where prop is greater than or equal to 0.08  
 69 | * All of the children named "Sea"  
 70 | * All of the names that have a missing value for `n`  
 71 | 
 72 | ```{r}
 73 | filter(babynames, is.na(n))
 74 | 
 75 | ```
 76 | 
 77 | ## Your Turn 4
 78 | 
 79 | Use Boolean operators to alter the code below to return only the rows that contain:
 80 | 
 81 | * Girls named Sea  
 82 | * Names that were used by exactly 5 or 6 children in 1880  
 83 | * Names that are one of Acura, Lexus, or Yugo
 84 | 
 85 | ```{r}
 86 | filter(babynames, name == "Sea" | name == "Anemone")
 87 | ```
 88 | 
 89 | ## Arrange
 90 | 
 91 | ```{r}
 92 | arrange(babynames, n)
 93 | ```
 94 | 
 95 | ## Your Turn 5
 96 | 
 97 | Arrange babynames by `n`. Add `prop` as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of `n` is?
 98 | 
 99 | ```{r}
100 | 
101 | ```
102 | 
103 | ## desc
104 | 
105 | ```{r}
106 | arrange(babynames, desc(n))
107 | ```
108 | 
109 | ## Your Turn 6
110 | 
111 | Use `desc()` to find the names with the highest prop.
112 | Then, use `desc()` to find the names with the highest n.
113 | 
114 | ```{r}
115 | 
116 | ```
117 | 
118 | ## Steps and the pipe
119 | 
120 | ```{r}
121 | babynames %>%
122 |   filter(year == 2015, sex == "M") %>%
123 |   select(name, n) %>%
124 |   arrange(desc(n))
125 | ```
126 | 
127 | ## Your Turn 7
128 | 
129 | Use `%>%` to write a sequence of functions that: 
130 | 
131 | 1. Filter babynames to just the girls that were born in 2015  
132 | 2. Select the `name` and `n` columns  
133 | 3. Arrange the results so that the most popular names are near the top.
134 | 
135 | ```{r}
136 | 
137 | ```
138 | 
139 | ## Your Turn 8
140 | 
141 | 1. Trim `babynames` to just the rows that contain your `name` and your `sex`  
142 | 2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)  
143 | 3. Plot the results as a line graph with `year` on the x axis and `prop` on the y axis
144 | 
145 | ```{r}
146 | 
147 | ```
148 | 
149 | ## Your Turn 9
150 | 
151 | Use summarise() to compute three statistics about the data:
152 | 
153 | 1. The first (minimum) year in the dataset  
154 | 2. The last (maximum) year in the dataset  
155 | 3. The total number of children represented in the data
156 | 
157 | ```{r}
158 | 
159 | ```
160 | 
161 | ## Your Turn 10
162 | 
163 | Extract the rows where `name == "Khaleesi"`. Then use `summarise()` and a summary functions to find:
164 | 
165 | 1. The total number of children named Khaleesi
166 | 2. The first year Khaleesi appeared in the data
167 | 
168 | ```{r}
169 | 
170 | ```
171 | 
172 | ## Toy data for transforming
173 | 
174 | ```{r}
175 | # Toy dataset to use
176 | pollution <- tribble(
177 |        ~city,   ~size, ~amount, 
178 |   "New York", "large",      23,
179 |   "New York", "small",      14,
180 |     "London", "large",      22,
181 |     "London", "small",      16,
182 |    "Beijing", "large",      121,
183 |    "Beijing", "small",      56
184 | )
185 | ```
186 | 
187 | ## Summarize
188 | 
189 | ```{r}
190 | pollution %>% 
191 |  summarise(mean = mean(amount), sum = sum(amount), n = n())
192 | ```
193 | 
194 | ```{r}
195 | pollution %>% 
196 |   group_by(city) %>%
197 |   summarise(mean = mean(amount), sum = sum(amount), n = n())
198 | ```
199 | 
200 | 
201 | ## Your Turn 11
202 | 
203 | Use `group_by()`, `summarise()`, and `arrange()` to display the ten most popular baby names. Compute popularity as the total number of children of a single gender given a name.
204 | 
205 | ```{r}
206 | 
207 | ```
208 | 
209 | ## Your Turn 12
210 | 
211 | Use grouping to calculate and then plot the number of children born each year over time.
212 | 
213 | ```{r}
214 | 
215 | ```
216 | 
217 | ## Ungroup
218 | 
219 | ```{r}
220 | babynames %>%
221 |   group_by(name, sex) %>% 
222 |   summarise(total = sum(n)) %>% 
223 |   arrange(desc(total))
224 | ```
225 | 
226 | ## Mutate
227 | 
228 | ```{r}
229 | babynames %>%
230 |   mutate(percent = round(prop*100, 2))
231 | ```
232 | 
233 | ## Your Turn 13
234 | 
235 | Use `min_rank()` and `mutate()` to rank each row in `babynames` from largest `n` to lowest `n`.
236 | 
237 | ```{r}
238 | 
239 | ```
240 | 
241 | ## Your Turn 14
242 | 
243 | Compute each name's rank _within its year and sex_. 
244 | Then compute the median rank _for each combination of name and sex_, and arrange the results from highest median rank to lowest.
245 | 
246 | ```{r}
247 | 
248 | ```
249 | 
250 | ## Flights data
251 | ```{r}
252 | flights
253 | skim(flights)
254 | ```
255 | 
256 | ## Toy data
257 | 
258 | ```{r}
259 | band <- tribble(
260 |    ~name,     ~band,
261 |   "Mick",  "Stones",
262 |   "John", "Beatles",
263 |   "Paul", "Beatles"
264 | )
265 | 
266 | instrument <- tribble(
267 |     ~name,   ~plays,
268 |    "John", "guitar",
269 |    "Paul",   "bass",
270 |   "Keith", "guitar"
271 | )
272 | 
273 | instrument2 <- tribble(
274 |     ~artist,   ~plays,
275 |    "John", "guitar",
276 |    "Paul",   "bass",
277 |   "Keith", "guitar"
278 | )
279 | ```
280 | 
281 | ## Mutating joins
282 | 
283 | ```{r}
284 | band %>% left_join(instrument, by = "name")
285 | ```
286 | 
287 | ## Your Turn 15
288 | 
289 | Which airlines had the largest arrival delays? Complete the code below.
290 | 
291 | 1. Join `airlines` to `flights`
292 | 2. Compute and order the average arrival delays by airline. Display full names, no codes.
293 | 
294 | ```{r}
295 | flights %>%
296 |   drop_na(arr_delay) %>%
297 |                       %>%
298 |   group_by(         ) %>%
299 |                       %>%
300 |   arrange(       ) 
301 | ```
302 | 
303 | ## Different names
304 | 
305 | ```{r}
306 | band %>% left_join(instrument2, by = c("name" = "artist"))
307 | ```
308 | 
309 | ## Your Turn 16
310 | 
311 | How many airports in `airports` are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named `faa` in the **airports** data set and `dest` in the **flights** data set.
312 | 
313 | 
314 | ```{r}
315 | __________ %>%
316 |  _________(_________, by = ___________) %>%
317 |   distinct(faa)
318 | ```
319 | 
320 | 
321 | 
322 | ***
323 | 
324 | # Take aways
325 | 
326 | * Extract variables with `select()`  
327 | * Extract cases with `filter()`  
328 | * Arrange cases, with `arrange()`  
329 | 
330 | * Make tables of summaries with `summarise()`  
331 | * Make new variables, with `mutate()`  
332 | * Do groupwise operations with `group_by()`
333 | 
334 | * Connect operations with `%>%`  
335 | 
336 | * Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets
337 | * Use `semi_join()` or `anti_join()` to filter datasets against each other
338 | 
339 | 
340 | 
341 | 
342 | ## Joining data
343 | 
344 | ```{r}
345 | library(nycflights13)
346 | ```
347 | 
348 | ## Your turn
349 | Read in the toy datasets band and instrument
350 | 
351 | 
352 | ## Types of joins
353 | 
354 | ```{r}
355 | band %>% left_join(instrument, by = "name")
356 | band %>% right_join(instrument, by = "name")
357 | band %>% full_join(instrument, by = "name")
358 | band %>% inner_join(instrument, by = "name")
359 | ```
360 | 
361 | ## Your turn
362 | Which airlines had the largest arrival delays? Work in groups to complete the code below. 
363 | 
364 | ```{r}
365 | flights %>%
366 |   drop_na(arr_delay) %>%
367 |      #something!     %>%
368 |   group_by(   #something!  ) %>%
369 |      #something!      %>%
370 |   arrange(  #something! )       
371 | ```
372 | 
373 | ## Your turn
374 | Read in the toy dataset instrument2
375 | 
376 | ## What if the names don't match?
377 | 
378 | ```{r}
379 | band %>% left_join(instrument2, by = c("name" = "artist"))
380 | ```
381 | 
382 | ```{r}
383 | airports %>% left_join(flights, by = c("faa" = "dest"))
384 | ```
385 | 
386 | 
387 | # Take aways
388 | 
389 | * Extract variables with `select()`  
390 | * Extract cases with `filter()`  
391 | * Arrange cases, with `arrange()`  
392 | 
393 | * Make tables of summaries with `summarise()`  
394 | * Make new variables, with `mutate()`  
395 | * Do groupwise operations with `group_by()`
396 | 
397 | * Connect operations with `%>%`  
398 | 
399 | * Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets
400 | * Use `semi_join()` or `anti_join()` to filter datasets against each other
401 | 
402 | 
403 | 
404 | 


--------------------------------------------------------------------------------
/03-Tidy.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Tidy Data"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  9 | 
 10 | ```{r setup}
 11 | library(tidyverse)
 12 | library(babynames)
 13 | 
 14 | # Toy data
 15 | cases <- tribble(
 16 |   ~Country, ~"2011", ~"2012", ~"2013",
 17 |       "FR",    7000,    6900,    7000,
 18 |       "DE",    5800,    6000,    6200,
 19 |       "US",   15000,   14000,   13000
 20 | )
 21 | 
 22 | pollution <- tribble(
 23 |        ~city,   ~size, ~amount,
 24 |   "New York", "large",      23,
 25 |   "New York", "small",      14,
 26 |     "London", "large",      22,
 27 |     "London", "small",      16,
 28 |    "Beijing", "large",     121,
 29 |    "Beijing", "small",     56
 30 | )
 31 | 
 32 | 
 33 | bp_systolic <- tribble(
 34 |   ~ subject_id,  ~ time_1, ~ time_2, ~ time_3,
 35 |              1,       120,      118,      121,
 36 |              2,       125,      131,       NA,
 37 |              3,       141,       NA,       NA 
 38 | )
 39 | 
 40 | bp_systolic2 <- tribble(
 41 |   ~ subject_id,  ~ time, ~ systolic,
 42 |              1,       1,        120,
 43 |              1,       2,        118,
 44 |              1,       3,        121,
 45 |              2,       1,        125,
 46 |              2,       2,        131,
 47 |              3,       1,        141
 48 | )
 49 | 
 50 | ```
 51 | 
 52 | ## Tidy and untidy data
 53 | 
 54 | `table1` is tidy:
 55 | ```{r}
 56 | table1 
 57 | ```
 58 | 
 59 | For example, it's easy to add a rate column with `mutate()`:
 60 | ```{r}
 61 | table1 %>%
 62 |   mutate(rate = cases/population)
 63 | ```
 64 | 
 65 | `table2` isn't tidy, the count column really contains two variables:
 66 | ```{r}
 67 | table2
 68 | ```
 69 | 
 70 | It makes it very hard to manipulate.
 71 | 
 72 | 
 73 | ## Your Turn 1
 74 | 
 75 | Is `bp_systolic` tidy?
 76 | 
 77 | ```{r}
 78 | bp_systolic2 
 79 | ```
 80 | 
 81 | ## Your Turn 2
 82 | 
 83 | Using `bp_systolic2` with `group_by()`, and `summarise()`:
 84 | 
 85 | * Find the average systolic blood pressure for each subject
 86 | * Find the last time each subject was measured
 87 | 
 88 | ```{r}
 89 | bp_systolic2
 90 | ```
 91 | 
 92 | ## Your Turn 3
 93 | 
 94 | On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n**
 95 | 
 96 | ## Your Turn 4
 97 | 
 98 | Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**.
 99 | 
100 | ```{r}
101 | table4a 
102 | ```
103 | 
104 | ## Your Turn 5
105 | 
106 | On a sheet of paper, draw how this data set would look if it had the same values grouped into three columns: **city**, **large**, **small**
107 | 
108 | ## Your Turn 6
109 | 
110 | Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**.
111 | 
112 | ```{r}
113 | table2 
114 | ```
115 | 
116 | ***
117 | 
118 | # Take Aways
119 | 
120 | Data comes in many formats but R prefers just one: _tidy data_.
121 | 
122 | A data set is tidy if and only if:
123 | 
124 | 1. Every variable is in its own column
125 | 2. Every observation is in its own row
126 | 3. Every value is in its own cell (which follows from the above)
127 | 
128 | What is a variable and an observation may depend on your immediate goal.
129 | 


--------------------------------------------------------------------------------
/04-Case-Study.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Case Study: Friday the 13th Effect"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License. -->
  9 | 
 10 | ```{r setup}
 11 | library(fivethirtyeight)
 12 | library(tidyverse)
 13 | ```
 14 | 
 15 | ## Task
 16 | 
 17 | Reproduce this figure from fivethirtyeight's article [*Some People Are Too Superstitious To Have A Baby On Friday The 13th*](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/):
 18 | 
 19 | ![](resources/bialik-fridaythe13th-2.png)
 20 | 
 21 | ## Data
 22 | 
 23 | In the `fivethiryeight` package there are two datasets containing birth data, but for now let's just work with one, `US_births_1994_2003`.  Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014.
 24 | 
 25 | ## Your Turn 1 
 26 | 
 27 | With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot.
 28 | 
 29 | ```{r}
 30 | US_births_1994_2003
 31 | ```
 32 | 
 33 | ## Some overviews of the data
 34 | 
 35 | Whole time series:
 36 | ```{r}
 37 | ggplot(US_births_1994_2003, aes(x = date, y = births)) +
 38 |   geom_line()
 39 | ```
 40 | There is so much fluctuation it's really hard to see what is going on.
 41 | 
 42 | Let's try just looking at one year:
 43 | ```{r}
 44 | US_births_1994_2003 %>%
 45 |   filter(year == 1994) %>%
 46 |   ggplot(mapping = aes(x = date, y = births)) +
 47 |     geom_line()
 48 | ```
 49 | Strong weekly pattern accounts for most variation.
 50 | 
 51 | ## Strategy
 52 | 
 53 | Use the figure as a guide for what the data should like to make the final plot.  We want to end up with something like:
 54 | 
 55 | ---------------------------
 56 |  day_of_week   avg_diff_13 
 57 | ------------- -------------
 58 |      Mon         -2.686    
 59 | 
 60 |     Tues         -1.378    
 61 | 
 62 |      Wed         -3.274    
 63 |      
 64 |      ...          ...
 65 |      
 66 | ---------------------------     
 67 | 
 68 | 
 69 | ## Your Turn 2
 70 | 
 71 | Extract just the 6th, 13th and 20th of each month:
 72 | 
 73 | ```{r}
 74 | US_births_1994_2003 %>%
 75 |   select(-date) 
 76 | 
 77 | ```
 78 | 
 79 | ## Your Turn 3
 80 | 
 81 | Which arrangement is tidy?
 82 | 
 83 | **Option 1:**
 84 | 
 85 | -----------------------------------------------------
 86 |  year   month   date_of_month   day_of_week   births 
 87 | ------ ------- --------------- ------------- --------
 88 |  1994     1           6            Thurs      11406  
 89 | 
 90 |  1994     1          13            Thurs      11212  
 91 | 
 92 |  1994     1          20            Thurs      11682  
 93 | -----------------------------------------------------
 94 | 
 95 | **Option 2:**
 96 | 
 97 | ----------------------------------------------------
 98 |  year   month   day_of_week     6      13      20   
 99 | ------ ------- ------------- ------- ------- -------
100 |  1994     1        Thurs      11406   11212   11682 
101 | ----------------------------------------------------
102 | 
103 | (**Hint:** think about our next step *"Find the percent difference between the 13th and the average of the 6th and 12th"*. In which layout will this be easier using our tidy tools?)
104 | 
105 | ## Your Turn 4
106 | 
107 | Tidy the filtered data to have the days in columns.
108 | 
109 | ```{r}
110 | US_births_1994_2003 %>%
111 |   select(-date) %>% 
112 |   filter(date_of_month %in% c(6, 13, 20))
113 | ```
114 | 
115 | ## Your Turn 5
116 | 
117 | Now use `mutate()` to add columns for:
118 | 
119 | * The average of the births on the 6th and 20th
120 | * The percentage difference between the number of births on the 13th and the average of the 6th and 20th
121 | 
122 | ```{r}
123 | US_births_1994_2003 %>%
124 |   select(-date) %>% 
125 |   filter(date_of_month %in% c(6, 13, 20)) %>%
126 |   spread(date_of_month, births) 
127 | ```
128 | 
129 | ## A little additional exploring
130 | 
131 | Now we have a percent difference between the 13th and the 6th and 20th of each month, it's probably worth exploring a little (at the very least to check our calculations seem reasonable).
132 | 
133 | To make it a little easier let's assign our current data to a variable
134 | ```{r}
135 | births_diff_13 <- US_births_1994_2003 %>%
136 |   select(-date) %>% 
137 |   filter(date_of_month %in% c(6, 13, 20)) %>%
138 |   spread(date_of_month, births) %>%
139 |   mutate(
140 |     avg_6_20 = (`6` + `20`)/2,
141 |     diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
142 |   )
143 | ```
144 | 
145 | Then take a look
146 | ```{r}
147 | births_diff_13 %>% 
148 |   ggplot(mapping = aes(day_of_week, diff_13)) +
149 |     geom_point()
150 | ```
151 | 
152 | Looks like we are on the right path.  There's a big outlier one Monday
153 | ```{r}
154 | births_diff_13 %>%
155 |   filter(day_of_week == "Mon", diff_13 > 10)
156 | ```
157 | 
158 | Seem's to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year.
159 | 
160 | ## Your Turn 6
161 | 
162 | Summarize each day of the week to have mean of diff_13.
163 | 
164 | Then, recreate the fivethirtyeight plot.
165 | 
166 | ```{r}
167 | US_births_1994_2003 %>%
168 |   select(-date) %>% 
169 |   filter(date_of_month %in% c(6, 13, 20)) %>%
170 |   spread(date_of_month, births) %>%
171 |   mutate(
172 |     avg_6_20 = (`6` + `20`)/2,
173 |     diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
174 |   ) 
175 | ```
176 | 
177 | ## Extra Challenges
178 | 
179 | * If you wanted to use the `US_births_2000_2014` data instead, what would you need to change in the pipeline?  How about using both `US_births_1994_2003` and `US_births_2000_2014`?
180 | 
181 | * Try not removing the `date` column. At what point in the pipeline does it cause problems? Why?
182 | 
183 | * Can you come up with an alternative way to investigate the Friday the 13th effect?  Try it out!
184 | 
185 | ## Takeaways
186 | 
187 | The power of the tidyverse comes from being able to easily combine functions that do simple things well.  
188 | 
189 | 


--------------------------------------------------------------------------------
/05-Data-Types.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Data Types"
  3 | output: html_notebook
  4 | ---
  5 | 
  6 | ```{r setup}
  7 | library(tidyverse)
  8 | library(babynames)
  9 | library(nycflights13)
 10 | library(stringr)
 11 | library(forcats)
 12 | library(lubridate)
 13 | library(hms)
 14 | ```
 15 | 
 16 | ## Your Turn 1
 17 | 
 18 | Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`).
 19 | 
 20 | Then, remove all rows that contain an NA in `delayed`. 
 21 | 
 22 | Finally, create a summary table that shows:
 23 | 
 24 | 1. How many flights were delayed  
 25 | 2. What proportion of flights were delayed
 26 | 
 27 | ```{r}
 28 | 
 29 | ```
 30 | 
 31 | 
 32 | ## Your Turn 2
 33 | 
 34 | In your group, fill in the blanks to:
 35 | 
 36 | 1. Isolate the last letter of every name and create a logical variable that displays whether the last letter is one of "a", "e", "i", "o", "u", or "y".  
 37 | 2. Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by `year` and `sex`)   
 38 | 3. and then display the results as a line plot.
 39 | 
 40 | ```{r}
 41 | babynames %>%
 42 |   _______(last = _________, 
 43 |           vowel = __________) %>%
 44 |   group_by(__________) %>%
 45 |   _________(p_vowel = weighted.mean(vowel, n)) %>%
 46 |   _________ +
 47 |   __________
 48 | ```
 49 | 
 50 | ## Your Turn 3
 51 | 
 52 | Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status.
 53 | 
 54 | ```{r}
 55 | gss_cat %>%
 56 |   drop_na(________) %>%
 57 |   group_by(________) %>%
 58 |   summarise(_________________) %>%
 59 |   ggplot() +
 60 |     geom_point(mapping = aes(x = _______, y = _________________________))
 61 | ```
 62 | 
 63 | ## Your Turn 4
 64 | 
 65 | Do you think liberals or conservatives watch more TV?
 66 | Compute average tv hours by party ID an then plot the results.
 67 | 
 68 | ```{r}
 69 | 
 70 | ```
 71 | 
 72 | ## Your Turn 5
 73 | 
 74 | What is the best time of day to fly?
 75 | 
 76 | Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`.
 77 | 
 78 | ```{r}
 79 | 
 80 | ```
 81 | 
 82 | ## Your Turn 6
 83 | 
 84 | Fill in the blanks to:
 85 | 
 86 | Extract the day of the week of each flight (as a full name) from `time_hour`. 
 87 | 
 88 | Calculate the average `arr_delay` by day of the week.
 89 | 
 90 | Plot the results as a column chart (bar chart) with `geom_col()`.
 91 | 
 92 | ```{r}
 93 | flights %>% 
 94 |   mutate(weekday = _______________________________) %>% 
 95 |   __________________ %>% 
 96 |   drop_na(arr_delay) %>% 
 97 |   summarise(avg_delay = _______________) %>% 
 98 |   ggplot() +
 99 |     ___________(mapping = aes(x = weekday, y = avg_delay))
100 | ```
101 | 
102 | ***
103 | 
104 | # Take Aways
105 | 
106 | Dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarise()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data.
107 | 
108 | Package   | Data Type
109 | --------- | --------
110 | stringr   | strings
111 | forcats   | factors
112 | hms       | times
113 | lubridate | dates and times
114 | 
115 | 


--------------------------------------------------------------------------------
/06-Iterate.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Iteration"
  3 | output: html_document
  4 | ---
  5 | 
  6 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  7 | 
  8 | ```{r setup}
  9 | library(tidyverse)
 10 | 
 11 | # Toy data
 12 | set.seed(1000)
 13 | exams <- list(
 14 |   student1 = round(runif(10, 50, 100)),
 15 |   student2 = round(runif(10, 50, 100)),
 16 |   student3 = round(runif(10, 50, 100)),
 17 |   student4 = round(runif(10, 50, 100)),
 18 |   student5 = round(runif(10, 50, 100))
 19 | )
 20 | 
 21 | extra_credit <- list(0, 0, 10, 10, 15)
 22 | ```
 23 | 
 24 | ## Your Turn 1
 25 | 
 26 | What kind of object is `mod`?  Why are models stored as this kind of object?
 27 | 
 28 | ```{r}
 29 | mod <- lm(price ~ carat + cut + color + clarity, data = diamonds)
 30 | View(mod)
 31 | ```
 32 | 
 33 | ## Consider
 34 | 
 35 | What's the difference between a list and an **atomic** vector?
 36 | 
 37 | Atomic vectors are: "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw" vectors.
 38 | 
 39 | ## Your Turn 2
 40 | 
 41 | Here is a list:
 42 | 
 43 | ```{r}
 44 | a_list <- list(nums = c(8, 9), 
 45 |             log = TRUE,    
 46 |             cha = c("a", "b", "c"))
 47 | ```
 48 | 
 49 | Here are two subsetting commands. Do they return the same values? Run the code chunk above, _and then_ run the code chunks below to confirm
 50 | 
 51 | ```{r}
 52 | a_list["nums"] 
 53 | ```
 54 | 
 55 | ```{r}
 56 | a_list$nums
 57 | ```
 58 | 
 59 | ## Your Turn 3
 60 | 
 61 | What will each of these return? Run the code chunks to confirm.
 62 | 
 63 | ```{r}
 64 | vec <- c(-2, -1, 0, 1, 2)
 65 | abs(vec)
 66 | ```
 67 | 
 68 | ```{r, error = TRUE}
 69 | lst <- list(-2, -1, 0, 1, 2)
 70 | abs(lst)
 71 | ```
 72 | 
 73 | ## Your Turn 4
 74 | 
 75 | Run the code in the chunks. What does it return?
 76 | 
 77 | ```{r}
 78 | list(student1 = mean(exams$student1),
 79 |      student2 = mean(exams$student2),
 80 |      student3 = mean(exams$student3),
 81 |      student4 = mean(exams$student4),
 82 |      student5 = mean(exams$student5))
 83 | ```
 84 | 
 85 | ```{r}
 86 | library(purrr)
 87 | map(exams, mean)
 88 | ```
 89 | 
 90 | ## Your Turn 5
 91 | 
 92 | Calculate the variance (`var()`) of each student’s exam grades.
 93 | 
 94 | ```{r}
 95 | exams 
 96 | ```
 97 | 
 98 | ## Your Turn 6
 99 | 
100 | Calculate the max grade (`max()`)for each student. Return the result as a vector.
101 | 
102 | ```{r}
103 | exams
104 | ```
105 | 
106 | ## Your Turn 7
107 | 
108 | Write a function that counts the best exam twice and then takes the average. Use it to grade all of the students.
109 | 
110 | 1. Write code that solves the problem for a real object  
111 | 2. Wrap the code in `function(){}` to save it  
112 | 3. Add the name of the real object as the function argument 
113 | 
114 | ```{r}
115 | vec <- exams[[1]]
116 | 
117 | 
118 | ```
119 | 
120 | ### Your Turn 8
121 | 
122 | Compute a final grade for each student, where the final grade is the average test score plus any `extra_credit` assigned to the student. Return the results as a double (i.e. numeric) vector.
123 | 
124 | ```{r}
125 | 
126 | ```
127 | 
128 | 
129 | ***
130 | 
131 | # Take Aways
132 | 
133 | Lists are a useful way to organize data, but you need to arrange manually for functions to iterate over the elements of a list.
134 | 
135 | You can do this with the `map()` family of functions in the purrr package.
136 | 
137 | To write a function, 
138 | 
139 | 1. Write code that solves the problem for a real object  
140 | 2. Wrap the code in `function(){}` to save it  
141 | 3. Add the name of the real object as the function argument 
142 | 
143 | This sequence will help prevent bugs in your code (and reduce the time you spend correcting bugs). 
144 | 


--------------------------------------------------------------------------------
/07-Model.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Model"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  9 | 
 10 | ```{r setup, message=FALSE}
 11 | library(tidyverse)
 12 | library(modelr)
 13 | library(broom)
 14 | 
 15 | wages <- heights %>% filter(income > 0)
 16 | ```
 17 | 
 18 | ## Your Turn 1
 19 | 
 20 | Fit the model on the slide and then examine the output. What does it look like?
 21 | 
 22 | ```{r}
 23 | mod_e <- lm(log(income) ~ education, data = wages)
 24 | mod_e
 25 | ```
 26 | 
 27 | ## Your Turn 2
 28 | 
 29 | Use a pipe to model `log(income)` against `height`. Then use broom and dplyr functions to extract:
 30 | 
 31 | 1. The **coefficient estimates** and their related statistics 
 32 | 2. The **adj.r.squared** and **p.value** for the overall model
 33 | 
 34 | ```{r, error = TRUE}
 35 | mod_h <- wages %>% lm(    )
 36 | 
 37 | 
 38 | ```
 39 | 
 40 | ## Your Turn 3
 41 | 
 42 | Model `log(income)` against `education` _and_ `height`. Do the coefficients change?
 43 | 
 44 | ```{r, error = TRUE}
 45 | mod_eh <- wages %>% lm(    )
 46 | 
 47 | ```
 48 | 
 49 | ## Your Turn 4
 50 | 
 51 | Model `log(income)` against `education` and `height` and `sex`. Can you interpret the coefficients?
 52 | 
 53 | ```{r, error = TRUE}
 54 | mod_ehs <- wages %>%  lm(   )
 55 | ```
 56 | 
 57 | ## Your Turn 5
 58 | 
 59 | Use a broom function and ggplot2 to make a line graph of `height` vs `.fitted` for our heights model, `mod_h`.
 60 | 
 61 | _Bonus: Overlay the plot on the original data points._
 62 | 
 63 | ```{r}
 64 | mod_h <- wages %>% lm(log(income) ~ height, data = .)
 65 | 
 66 | ```
 67 | 
 68 | ## Your Turn 6
 69 | 
 70 | Repeat the process to make a line graph of `height` vs `.fitted` colored by `sex` for model `mod_ehs`. Are the results interpretable? Add `+ facet_wrap(~education)` to the end of your code. What happens?
 71 | 
 72 | ```{r}
 73 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .)
 74 | 
 75 | ```
 76 | 
 77 | ## Your Turn 7
 78 | 
 79 | Use one of `spread_predictions()` or `gather_predictions()` to make a line graph of `height` vs `pred` colored by `model` for each of mod_h, mod_eh, and mod_ehs. Are the results interpretable? 
 80 | 
 81 | Add `+ facet_grid(sex ~ education)` to the end of your code. What happens?
 82 | 
 83 | ```{r warning = FALSE, message = FALSE}
 84 | mod_h <- wages %>% lm(log(income) ~ height, data = .)
 85 | mod_eh <- wages %>% lm(log(income) ~ education + height, data = .)
 86 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .)
 87 | 
 88 | 
 89 | ```
 90 | 
 91 | ***
 92 | 
 93 | # Take Aways
 94 | 
 95 | * Use `glance()`, `tidy()`, and `augment()` from the **broom** package to return model values in a data frame.
 96 | 
 97 | * Use `add_predictions()` or `gather_predictions()` or `spread_predictions()` from the **modelr** package to visualize predictions.
 98 | 
 99 | * Use `add_residuals()` or `gather_residuals()` or `spread_residuals()` from the **modelr** package to visualize residuals.
100 | 
101 | 


--------------------------------------------------------------------------------
/08-Organize.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Organize with List Columns"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  9 | 
 10 | ```{r setup}
 11 | library(tidyverse)
 12 | library(gapminder)
 13 | library(broom)
 14 | 
 15 | nz <- gapminder %>%
 16 |   filter(country == "New Zealand")
 17 | us <- gapminder %>%
 18 |   filter(country == "United States")
 19 | ```
 20 | 
 21 | ## Your turn 1
 22 | 
 23 | How has life expectancy changed over time?
 24 | Make a line plot of lifeExp vs. year grouped by country.  
 25 | Set alpha to 0.2, to see the results better.
 26 | 
 27 | ```{r}
 28 | gapminder
 29 | 
 30 | 
 31 | ```
 32 | 
 33 | ## Consider
 34 | 
 35 | How is a data frame/tibble similar to a list?
 36 | 
 37 | ## Consider
 38 | 
 39 | If one of the elements of a list can be another list,
 40 | can one of the columns of a data frame be another list?
 41 | 
 42 | ## Your turn 2
 43 | 
 44 | Run this chunk:
 45 | ```{r}
 46 | gapminder_nested <- gapminder %>%
 47 |   group_by(country) %>%
 48 |   nest()
 49 | 
 50 | fit_model <- function(df) lm(lifeExp ~ year, data = df)
 51 | 
 52 | gapminder_nested <- gapminder_nested %>% 
 53 |   mutate(model = map(data, fit_model))
 54 | 
 55 | get_rsq <- function(mod) glance(mod)$r.squared
 56 | 
 57 | gapminder_nested <- gapminder_nested %>% 
 58 |   mutate(r.squared = map_dbl(model, get_rsq))
 59 | ```
 60 | 
 61 | Then filter `gapminder_nested` to find the countries with r.squared less than 0.5.  
 62 | 
 63 | ```{r}
 64 | 
 65 | ```
 66 | 
 67 | ## Your Turn 3
 68 | 
 69 | Edit the code in the chunk provided to instead find and plot countries with a slope above 0.6 years/year.
 70 | 
 71 | ```{r}
 72 | get_slope <- function(mod) {
 73 |   tidy(mod) %>% filter(term == "year") %>% pull(estimate)
 74 | }
 75 | 
 76 | # Add new column with r-sqaured
 77 | gapminder_nested <- gapminder_nested %>% 
 78 |   mutate(r.squared = map_dbl(model, get_rsq))
 79 | 
 80 | # filter out low r-squared countries
 81 | poor_fit <- gapminder_nested %>% 
 82 |   filter(r.squared < 0.5)
 83 | 
 84 | # unnest and plot result
 85 | unnest(poor_fit, data) %>%
 86 |   ggplot(aes(x = year, y = lifeExp)) + 
 87 |     geom_line(aes(color = country))
 88 | ```
 89 | 
 90 | ## Your Turn 4
 91 | 
 92 | **Challenge:**
 93 | 
 94 | 1. Create your own copy of `gapminder_nested` and then add one more list column: `output` which contains the output of `augment()` for each model.
 95 | 
 96 | 
 97 | ```{r}
 98 | 
 99 | ```
100 | 
101 | # Take away
102 | 
103 | 


--------------------------------------------------------------------------------
/99-Setup.md:
--------------------------------------------------------------------------------
 1 | # Getting Set Up
 2 | 
 3 | During the workshop you'll do your work on [rstudio.cloud](https://rstudio.cloud/).  This provides an easy way for me to share all the materials with you, and removes the hassle of getting the right versions of R, RStudio or any packages.
 4 | 
 5 | ## To get started:
 6 | 
 7 | To get set up follow these steps:
 8 | 
 9 | 1.  Visit the project at https://rstudio.cloud/project/163983
10 | 
11 | 2.  Log in using google, github, shinyapps.io or "Sign Up".
12 | 
13 | ![](resources/01-setup-login.png)
14 | 
15 | 3.  The "Data Science in the tidyverse" project will open, but it's a *Temporary copy*.  Click *Save a copy*.
16 | 
17 | ![](resources/02-setup-temp-project.png)
18 | 
19 | 4. Now the "Data Science in the tidyverse" project will open again, but this time it is your own copy.  Navigate to  the "data-science-in-the-tidyverse.Rproj" file and click it.
20 | 
21 | ![](resources/04-setup-rproj-file.png)
22 | 
23 | 6. You'll be asked if you want to open the project, hit Yes.
24 | 
25 | ![](resources/05-setup-open-project.png)
26 | 
27 | 7. All going well, you should now see your project looking like this.  Now, open "00-Getting-started.Rmd"
28 | 
29 | ![](resources/06-setup-inside-project.png)
30 | 
31 | 8.  You're all set!  You might like to read through "00-Getting-started.Rmd" and do what it tells you.
32 | 
33 | ![](resources/07-setup-all-done.png)
34 | 
35 | ## Once you are set up
36 | 
37 | You can access your copy of the project from *Your Workspace* on [rstudio.cloud](https://rstudio.cloud/).  


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | This is the repo for *"Data Science in the tidyverse"* given at `rstudio::conf(2019)` in Jan 2019.
 2 | 
 3 | ## Description
 4 | 
 5 | This is a two-day hands on workshop based on the book [R for Data Science](http://r4ds.had.co.nz/). You will learn how to visualize, transform, and model data in R and work with date-times, character strings, and untidy data formats. Along the way, you will learn and use many packages from the tidyverse including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, lubridate, and forcats.
 6 | 
 7 | ## Software Requirements
 8 | 
 9 | You'll be using RStudio Cloud, so (all going well) on the day of the workshop all you'll need is **a laptop that can access the internet** (wifi will be available).  
10 | 
11 | In the unlikely event that there are problems with the conference internet connection, you may want to have a local installation on your computer as a backup. If you'd like, install the following:
12 |  
13 | 1. A recent version of R (~3.5.2), which is available for free at [cran.r-project.org](http://www.cran.r-project.org)
14 | 2. A recent version of RStudio IDE (~1.1.463), available for free at [www.rstudio.com/download](http://www.rstudio.com/download)
15 | 3. The set of relevant R packages, which you can install by connecting to the internet, opening RStudio, and running:  
16 |  
17 |     install.packages(c("babynames", "fivethirtyeight", "formatR", "gapminder", "hexbin", "mgcv", "maps", "mapproj", "nycflights13", "rmarkdown", "skimr", "tidyverse", "viridis")) 
18 | 
19 | Don't forget to bring your power cord!
20 | 
21 | ## Instructor Info
22 | 
23 | Amelia McNamara
24 | 
25 | -   [amelia.mn](http://www.amelia.mn)
26 | -   @[AmeliaMN](http://www.twitter.com/AmeliaMN)
27 | 
28 | ## License
29 | 
30 | <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a>
31 | 
32 | <span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">*Data Science in the tidyverse*</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="https://github.com/AmeliaMN/" property="cc:attributionName" rel="cc:attributionURL">Amelia McNamara</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.  Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/rstudio-education/master-the-tidyverse" rel="dct:source">https://github.com/rstudio-education/master-the-tidyverse</a>, [https://github.com/cwickham/data-science-in-tidyverse](https://github.com/cwickham/data-science-in-tidyverse), and [https://github.com/AmeliaMN/IntroToR/](https://github.com/AmeliaMN/IntroToR/)
33 | 


--------------------------------------------------------------------------------
/data-science-in-the-tidyverse.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Yes
 4 | SaveWorkspace: Ask
 5 | AlwaysSaveHistory: Yes
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 


--------------------------------------------------------------------------------
/email-to-participants.md:
--------------------------------------------------------------------------------
 1 | *(This will be sent to registered particpants by email, but I'm also posting here as a convenient place to field any questions/issues.)*
 2 | 
 3 | Thank you for enrolling in Data Science in the Tidyverse.  
 4 | 
 5 | During class, we will be using RStudio Cloud, a hosted version of R and RStudio in the cloud. The only thing you need to do to prepare for class is sign up for a free RStudio Cloud account at <https://rstudio.cloud/>, and plan to bring your laptop with you. On the day of class, we'll provide you with an RStudio Cloud project that contains all of the course materials.
 6 | 
 7 | In the unlikely event that there are problems with the conference internet connection, you may want to have a local installation on your computer as a backup. If you'd like, install the following:
 8 |  
 9 | 1. A recent version of R (~3.5.2), which is available for free at <cran.r-project.org>  
10 | 2. A recent version of RStudio IDE (~1.1.463), available for free at <www.rstudio.com/download>  
11 | 3. The set of relevant R packages, which you can install by connecting to the internet, opening RStudio, and running:  
12 |  
13 |     install.packages(c("babynames", "fivethirtyeight", "formatR", "gapminder", "hexbin", "mgcv", "maps", "mapproj", "nycflights13", "rmarkdown", "skimr", "tidyverse", "viridis")) 
14 | 
15 | If you're a new R user or working on a government or corporate laptop, it's possible that installing R will be challenging. In that case, feel free to ignore the backup instructions and just count on RStudio Cloud. We'll talk about local installation on the second day of the workshop, and we'll have TAs there to help troubleshoot. 
16 |  
17 | Whatever you do, don't forget your power cord! 
18 | 
19 | We look forward to meeting you,
20 | 
21 | Amelia and Hadley


--------------------------------------------------------------------------------
/resources/01-setup-login.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/01-setup-login.png


--------------------------------------------------------------------------------
/resources/02-setup-temp-project.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/02-setup-temp-project.png


--------------------------------------------------------------------------------
/resources/04-setup-rproj-file.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/04-setup-rproj-file.png


--------------------------------------------------------------------------------
/resources/05-setup-open-project.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/05-setup-open-project.png


--------------------------------------------------------------------------------
/resources/06-setup-inside-project.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/06-setup-inside-project.png


--------------------------------------------------------------------------------
/resources/07-setup-all-done.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/07-setup-all-done.png


--------------------------------------------------------------------------------
/resources/bialik-fridaythe13th-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/bialik-fridaythe13th-2.png


--------------------------------------------------------------------------------
/slides/00-Introduction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/00-Introduction.pdf


--------------------------------------------------------------------------------
/slides/01-Visualize.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/01-Visualize.pdf


--------------------------------------------------------------------------------
/slides/02-Transform.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/02-Transform.pdf


--------------------------------------------------------------------------------
/slides/03-Tidy.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/03-Tidy.pdf


--------------------------------------------------------------------------------
/slides/04-Case-Study.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/04-Case-Study.pdf


--------------------------------------------------------------------------------
/slides/05-Data-Types.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/05-Data-Types.pdf


--------------------------------------------------------------------------------
/slides/06-Iteration.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/06-Iteration.pdf


--------------------------------------------------------------------------------
/slides/07-Model.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/07-Model.pdf


--------------------------------------------------------------------------------
/slides/08-Organize.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/08-Organize.pdf


--------------------------------------------------------------------------------
/slides/09-Wrapping-Up.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/09-Wrapping-Up.pdf


--------------------------------------------------------------------------------
/solutions/01-Visualize-solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Visualization - solutions"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Amelia McNamara is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  9 | 
 10 | ## Setup
 11 | 
 12 | The first chunk in an R Notebook is usually titled "setup," and by convention includes the R packages you want to load. Remember, in order to use an R package you have to run some `library()` code every session. Execute these lines of code to load the packages. 
 13 | 
 14 | ```{r setup}
 15 | library(ggplot2)
 16 | library(fivethirtyeight)
 17 | ```
 18 | 
 19 | ## Bechdel test data
 20 | 
 21 | We're going to start by playing with data collected by the website FiveThirtyEight on movies and [the Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test). 
 22 | 
 23 | To begin, let's just preview our data. There are a couple ways to do that. One is just to type the name of the data and execute it like a piece of code. 
 24 | 
 25 | ```{r}
 26 | bechdel
 27 | ```
 28 | 
 29 | Notice that you can page through to see more of the dataset. 
 30 | 
 31 | Sometimes, people prefer to see their data in a more spreadsheet-like format, and RStudio provides a way to do that. Go to the Console and type `View(bechdel)` to see the data preview. 
 32 | 
 33 | (An aside-- `View` is a special function. Since it makes something happen in the RStudio interface, it doesn't work properly in R Notebooks. Most R functions have names that start with lowercase letters, so the uppercase "V" is there to remind you of its special status.)
 34 | 
 35 | 
 36 | 
 37 | ## Consider
 38 | What relationship do you expect to see between movie budget (budget) and domestic gross(domgross)?
 39 | 
 40 | ## Your Turn 1
 41 | 
 42 | Run the code on the slide to make a graph. Pay strict attention to spelling, capitalization, and parentheses!
 43 | 
 44 | ```{r}
 45 | ggplot(data = bechdel) +
 46 |   geom_point(mapping = aes(x = budget, y = domgross))
 47 | ```
 48 | 
 49 | ## Your Turn 2
 50 | 
 51 | Add `color`, `size`, `alpha`, and `shape` aesthetics to your graph. Experiment.  
 52 | 
 53 | ```{r}
 54 | ggplot(data = bechdel) +
 55 |   geom_point(mapping = aes(x = budget, y = domgross, color=clean_test))
 56 | 
 57 | ggplot(bechdel) + 
 58 |   geom_point(mapping = aes(x = budget, y = domgross, size=clean_test))
 59 | ggplot(bechdel) + 
 60 |   geom_point(mapping = aes(x = budget, y = domgross, shape=clean_test))
 61 | ggplot(bechdel) + 
 62 |   geom_point(mapping = aes(x = budget, y = domgross, alpha=clean_test))
 63 | 
 64 | ```
 65 | 
 66 | ## Set vs map
 67 | 
 68 | ```{r}
 69 | ggplot(bechdel) + 
 70 |     geom_point(mapping = aes(x = budget, y = domgross), color="blue")
 71 | ```
 72 | 
 73 | ## Your Turn 3
 74 | 
 75 | Replace this scatterplot with one that draws boxplots. Use the cheatsheet. Try your best guess.
 76 | 
 77 | ```{r}
 78 | ggplot(data = bechdel) + geom_point(aes(x = clean_test, y = budget))
 79 | 
 80 | ggplot(data = bechdel) + geom_boxplot(aes(x = clean_test, y = budget))
 81 | ```
 82 | 
 83 | ## Your Turn 4
 84 | 
 85 | Make a histogram of the `budget` variable from `bechdel`.
 86 | 
 87 | ```{r}
 88 | ggplot(bechdel) + 
 89 |   geom_histogram(aes(x=budget))
 90 | ```
 91 | 
 92 | ## Your Turn 5
 93 | Try to find a better binwidth for `budget`.
 94 | 
 95 | ```{r}
 96 | ggplot(data = bechdel) +
 97 |   geom_histogram(mapping = aes(x = budget), binwidth=10000000)
 98 | ```
 99 | 
100 | ## Your Turn 6
101 | 
102 | Make a density plot of `budget` colored by `clean_test`.
103 | 
104 | ```{r}
105 | ggplot(data = bechdel) +
106 |   geom_density(mapping = aes(x = budget))
107 | 
108 | ggplot(data = bechdel) +
109 |   geom_density(mapping = aes(x = budget, color=clean_test))
110 | ```
111 | 
112 | 
113 | ## Your Turn 7
114 | 
115 | Make a barchart of `clean_test` colored by `clean_test`.
116 | 
117 | ```{r}
118 | ggplot(data=bechdel) +
119 |   geom_bar(mapping = aes(x = clean_test, fill = clean_test))
120 | ```
121 | 
122 | 
123 | ## Your Turn 8
124 | 
125 | Predict what this code will do. Then run it.
126 | 
127 | ```{r}
128 | ggplot(bechdel) + 
129 |   geom_point(aes(budget, domgross)) +
130 |   geom_smooth(aes(budget, domgross))
131 | ```
132 | 
133 | ## global vs local
134 | 
135 | ```{r}
136 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) +
137 |   geom_point(mapping = aes(color = clean_test)) +
138 |   geom_smooth()
139 | ```
140 | 
141 | ```{r}
142 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) +
143 |   geom_point(mapping = aes(color = clean_test)) +
144 |   geom_smooth(data = filter(bechdel, clean_test == "ok"))
145 | ```
146 | 
147 | ## Your Turn
148 | 
149 | What does `getwd()` return?
150 | 
151 | ```{r}
152 | getwd()
153 | ```
154 | 
155 | ## Your Turn 9
156 | 
157 | Save the last plot and then locate it in the files pane. If you run your `ggsave()` code inside this notebook, the image will be saved in the same directory as your .Rmd file (likely, project -> code), but if you run `ggsave()` in the Console it will be in your working directory. 
158 | 
159 | ```{r}
160 | ggsave("my-plot.png")
161 | ```
162 | 
163 | ***
164 | 
165 | # Take aways
166 | 
167 | You can use this code template to make thousands of graphs with **ggplot2**.
168 | 
169 | ```{r eval = FALSE}
170 | ggplot(data = <DATA>) +
171 |   <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
172 | ```


--------------------------------------------------------------------------------
/solutions/02-Transform-Solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Transform Data - solutions"
  3 | output: html_notebook
  4 | editor_options: 
  5 |   chunk_output_type: inline
  6 | ---
  7 | 
  8 | <!-- This file by Amelia McNamara is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
  9 | 
 10 | ```{r setup}
 11 | library(dplyr)
 12 | library(babynames)
 13 | library(nycflights13)
 14 | library(skimr)
 15 | ```
 16 | 
 17 | ## Babynames
 18 | 
 19 | ```{r}
 20 | babynames
 21 | skim(babynames)
 22 | skim_with(integer = list(p25 = NULL, p75=NULL))
 23 | ```
 24 | 
 25 | 
 26 | ## Your Turn 1
 27 | Run the skim_with() command, and then try skimming babynames again to see how the output is different
 28 | ```{r}
 29 | skim(babynames)
 30 | ```
 31 | 
 32 | ## Select
 33 | 
 34 | ```{r}
 35 | select(babynames, name, prop)
 36 | ```
 37 | 
 38 | ## Your Turn 2
 39 | 
 40 | Alter the code to select just the `n` column:
 41 | 
 42 | ```{r}
 43 | select(babynames, n)
 44 | ```
 45 | 
 46 | ## Consider
 47 | 
 48 | Which of these is NOT a way to select the `name` and `n` columns together?
 49 | 
 50 | ```{r}
 51 | select(babynames, -c(year, sex, prop))
 52 | select(babynames, name:n)
 53 | select(babynames, starts_with("n"))
 54 | select(babynames, ends_with("n"))
 55 | ```
 56 | 
 57 | ## Your Turn 3
 58 | 
 59 | Show:
 60 | 
 61 | * All of the names where prop is greater than or equal to 0.08  
 62 | * All of the children named "Sea"  
 63 | * All of the names that have a missing value for `n`  
 64 | 
 65 | ```{r}
 66 | filter(babynames, prop >= 0.08)
 67 | filter(babynames, name == "Sea")
 68 | filter(babynames, is.na(n))
 69 | ```
 70 | 
 71 | ## Your Turn 4
 72 | 
 73 | Use Boolean operators to alter the code below to return only the rows that contain:
 74 | 
 75 | * Girls named Sea  
 76 | * Names that were used by exactly 5 or 6 children in 1880  
 77 | * Names that are one of Acura, Lexus, or Yugo
 78 | 
 79 | ```{r}
 80 | filter(babynames, name == "Sea", sex == "F")
 81 | filter(babynames, n == 5 | n == 6, year == 1880)
 82 | filter(babynames, name %in% c("Acura", "Lexus", "Yugo"))
 83 | ```
 84 | 
 85 | ## Arrange
 86 | 
 87 | ```{r}
 88 | arrange(babynames, n)
 89 | ```
 90 | 
 91 | ## Your Turn 5
 92 | 
 93 | Arrange babynames by `n`. Add `prop` as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of `n` is?
 94 | 
 95 | ```{r}
 96 | arrange(babynames, n, prop)
 97 | ```
 98 | 
 99 | ## desc
100 | 
101 | ```{r}
102 | arrange(babynames, desc(n))
103 | ```
104 | 
105 | ## Your Turn 6
106 | 
107 | Use `desc()` to find the names with the highest prop.
108 | Then, use `desc()` to find the names with the highest n.
109 | 
110 | ```{r}
111 | arrange(babynames, desc(prop))
112 | arrange(babynames, desc(n))
113 | ```
114 | 
115 | ## Steps and the pipe
116 | 
117 | ```{r}
118 | babynames %>%
119 |   filter(year == 2015, sex == "M") %>%
120 |   select(name, n) %>%
121 |   arrange(desc(n))
122 | ```
123 | 
124 | ## Your Turn 7
125 | 
126 | Use `%>%` to write a sequence of functions that: 
127 | 
128 | 1. Filter babynames to just the girls that were born in 2015  
129 | 2. Select the `name` and `n` columns  
130 | 3. Arrange the results so that the most popular names are near the top.
131 | 
132 | ```{r}
133 | babynames %>% 
134 |   filter(year == 2015, sex == "F") %>% 
135 |   select(name, n) %>% 
136 |   arrange(desc(n))
137 | ```
138 | 
139 | ## Your Turn 8
140 | 
141 | 1. Trim `babynames` to just the rows that contain your `name` and your `sex`  
142 | 2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)  
143 | 3. Plot the results as a line graph with `year` on the x axis and `prop` on the y axis
144 | 
145 | ```{r}
146 | babynames %>% 
147 |   filter(name == "Amelia", sex == "F") %>%
148 |   select(year, prop) %>%
149 |   ggplot() +
150 |     geom_line(mapping = aes(year, prop))
151 | ```
152 | 
153 | ## Your Turn 9
154 | 
155 | Use summarise() to compute three statistics about the data:
156 | 
157 | 1. The first (minimum) year in the dataset  
158 | 2. The last (maximum) year in the dataset  
159 | 3. The total number of children represented in the data
160 | 
161 | ```{r}
162 | babynames %>% 
163 |   summarise(first = min(year), 
164 |             last = max(year), 
165 |             total = sum(n))
166 | ```
167 | 
168 | ## Your Turn 10
169 | 
170 | Extract the rows where `name == "Khaleesi"`. Then use `summarise()` and a summary functions to find:
171 | 
172 | 1. The total number of children named Khaleesi
173 | 2. The first year Khaleesi appeared in the data
174 | 
175 | ```{r}
176 | babynames %>% 
177 |   filter(name == "Khaleesi") %>% 
178 |   summarise(total = sum(n), first = min(year))
179 | ```
180 | 
181 | ## Toy data for transforming
182 | 
183 | ```{r}
184 | # Toy dataset to use
185 | pollution <- tribble(
186 |        ~city,   ~size, ~amount, 
187 |   "New York", "large",      23,
188 |   "New York", "small",      14,
189 |     "London", "large",      22,
190 |     "London", "small",      16,
191 |    "Beijing", "large",      121,
192 |    "Beijing", "small",      56
193 | )
194 | ```
195 | 
196 | ## Summarize
197 | 
198 | ```{r}
199 | pollution %>% 
200 |  summarise(mean = mean(amount), sum = sum(amount), n = n())
201 | ```
202 | 
203 | ```{r}
204 | pollution %>% 
205 |   group_by(city) %>%
206 |   summarise(mean = mean(amount), sum = sum(amount), n = n())
207 | ```
208 | 
209 | 
210 | ## Your Turn 11
211 | 
212 | Use `group_by()`, `summarise()`, and `arrange()` to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name.
213 | 
214 | ```{r}
215 | babynames %>%
216 |   group_by(name, sex) %>% 
217 |   summarise(total = sum(n)) %>% 
218 |   arrange(desc(total))
219 | ```
220 | 
221 | ## Your Turn 12
222 | 
223 | Use grouping to calculate and then plot the number of children born each year over time.
224 | 
225 | ```{r}
226 | babynames %>%
227 |   group_by(year) %>% 
228 |   summarise(n_children = sum(n)) %>% 
229 |   ggplot() +
230 |     geom_line(mapping = aes(x = year, y = n_children))
231 | ```
232 | 
233 | ## Ungroup
234 | 
235 | ```{r}
236 | babynames %>%
237 |   group_by(name, sex) %>% 
238 |   summarise(total = sum(n)) %>% 
239 |   arrange(desc(total))
240 | ```
241 | 
242 | ## Mutate
243 | 
244 | ```{r}
245 | babynames %>%
246 |   mutate(percent = round(prop*100, 2))
247 | ```
248 | 
249 | ## Your Turn 13
250 | 
251 | Use `min_rank()` and `mutate()` to rank each row in `babynames` from largest `n` to lowest `n`.
252 | 
253 | ```{r}
254 | babynames %>% 
255 |   mutate(rank = min_rank(desc(prop)))
256 | ```
257 | 
258 | ## Your Turn 14
259 | 
260 | Compute each name's rank _within its year and sex_. 
261 | Then compute the median rank _for each combination of name and sex_, and arrange the results from highest median rank to lowest.
262 | 
263 | ```{r}
264 | babynames %>% 
265 |   group_by(year, sex) %>% 
266 |   mutate(rank = min_rank(desc(prop))) %>% 
267 |   group_by(name, sex) %>% 
268 |   summarise(score = median(rank)) %>% 
269 |   arrange(score)
270 | ```
271 | 
272 | ## Flights data
273 | ```{r}
274 | flights
275 | skim(flights)
276 | ```
277 | 
278 | ## Toy data
279 | 
280 | ```{r}
281 | band <- tribble(
282 |    ~name,     ~band,
283 |   "Mick",  "Stones",
284 |   "John", "Beatles",
285 |   "Paul", "Beatles"
286 | )
287 | 
288 | instrument <- tribble(
289 |     ~name,   ~plays,
290 |    "John", "guitar",
291 |    "Paul",   "bass",
292 |   "Keith", "guitar"
293 | )
294 | 
295 | instrument2 <- tribble(
296 |     ~artist,   ~plays,
297 |    "John", "guitar",
298 |    "Paul",   "bass",
299 |   "Keith", "guitar"
300 | )
301 | ```
302 | 
303 | ## Mutating joins
304 | 
305 | ```{r}
306 | band %>% left_join(instrument, by = "name")
307 | ```
308 | 
309 | ## Your Turn 15
310 | 
311 | Which airlines had the largest arrival delays? Complete the code below.
312 | 
313 | 1. Join `airlines` to `flights`
314 | 2. Compute and order the average arrival delays by airline. Display full names, no codes.
315 | 
316 | ```{r}
317 | flights %>% 
318 |   drop_na(arr_delay) %>%
319 |   left_join(airlines, by = "carrier") %>%
320 |   group_by(name) %>%
321 |   summarise(delay = mean(arr_delay)) %>%
322 |   arrange(delay)
323 | ```
324 | 
325 | ## Different names
326 | 
327 | ```{r}
328 | band %>% left_join(instrument2, by = c("name" = "artist"))
329 | ```
330 | 
331 | ## Your Turn 16
332 | 
333 | How many airports in `airports` are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named `faa` in the **airports** data set and `dest` in the **flights** data set.
334 | 
335 | 
336 | ```{r}
337 | airports %>%
338 |   semi_join(flights, by = c("faa" = "dest")) %>%
339 |   distinct(faa)
340 | ```
341 | 
342 | ***
343 | 
344 | # Take aways
345 | 
346 | * Extract variables with `select()`  
347 | * Extract cases with `filter()`  
348 | * Arrange cases, with `arrange()`  
349 | 
350 | * Make tables of summaries with `summarise()`  
351 | * Make new variables, with `mutate()`  
352 | * Do groupwise operations with `group_by()`
353 | 
354 | * Connect operations with `%>%`  
355 | 
356 | * Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets
357 | * Use `semi_join()` or `anti_join()` to filter datasets against each other
358 | 


--------------------------------------------------------------------------------
/solutions/03-Tidy-Solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Tidy -- Solutions"
  3 | output:
  4 |   github_document: 
  5 |     df_print: tibble
  6 |   html_document:
  7 |     df_print: paged
  8 | ---
  9 | 
 10 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
 11 | 
 12 | 
 13 | ```{r setup}
 14 | library(tidyverse)
 15 | library(babynames)
 16 | 
 17 | # Toy data
 18 | cases <- tribble(
 19 |   ~Country, ~"2011", ~"2012", ~"2013",
 20 |       "FR",    7000,    6900,    7000,
 21 |       "DE",    5800,    6000,    6200,
 22 |       "US",   15000,   14000,   13000
 23 | )
 24 | pollution <- tribble(
 25 |        ~city,   ~size, ~amount,
 26 |   "New York", "large",      23,
 27 |   "New York", "small",      14,
 28 |     "London", "large",      22,
 29 |     "London", "small",      16,
 30 |    "Beijing", "large",     121,
 31 |    "Beijing", "small",     121
 32 | )
 33 | bp_systolic <- tribble(
 34 |   ~ subject_id,  ~ time_1, ~ time_2, ~ time_3,
 35 |              1,       120,      118,      121,
 36 |              2,       125,      131,       NA,
 37 |              3,       141,       NA,       NA 
 38 | )
 39 | bp_systolic2 <- tribble(
 40 |   ~ subject_id,  ~ time, ~ systolic,
 41 |              1,       1,        120,
 42 |              1,       2,        118,
 43 |              1,       3,        121,
 44 |              2,       1,        125,
 45 |              2,       2,        131,
 46 |              3,       1,        141
 47 | )
 48 | ```
 49 | 
 50 | ## Tidy and untidy data
 51 | 
 52 | `table1` is tidy:
 53 | ```{r}
 54 | table1 
 55 | ```
 56 | 
 57 | For example, it's easy to add a rate column with `mutate()`:
 58 | ```{r}
 59 | table1 %>%
 60 |   mutate(rate = cases/population)
 61 | ```
 62 | 
 63 | `table2` isn't tidy, the count column really contains two variables:
 64 | ```{r}
 65 | table2
 66 | ```
 67 | 
 68 | It makes it very hard to manipulate.
 69 | 
 70 | ## Your Turn 1
 71 | 
 72 | Is `bp_systolic` tidy?
 73 | 
 74 | ```{r}
 75 | bp_systolic2 
 76 | ```
 77 | 
 78 | ## Your Turn 2
 79 | 
 80 | Using `bp_systolic2` with `group_by()`, and `summarise()`:
 81 | 
 82 | * Find the average systolic blood pressure for each subject
 83 | * Find the last time each subject was measured
 84 | 
 85 | ```{r}
 86 | bp_systolic2 %>% 
 87 |   group_by(subject_id) %>%
 88 |   summarise(avg_bp = mean(systolic),
 89 |     last_time = max(time))
 90 | ```
 91 | 
 92 | ## Your Turn 3
 93 | 
 94 | On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n**
 95 | 
 96 | -----------------------------
 97 |    country     year   cases  
 98 | ------------- ------ --------
 99 |  Afghanistan   1999    745   
100 | 
101 |  Afghanistan   2000    2666  
102 | 
103 |    Brazil      1999   37737  
104 | 
105 |    Brazil      2000   80488  
106 | 
107 |     China      1999   212258 
108 | 
109 |     China      2000   213766 
110 | -----------------------------
111 | 
112 | ## Your Turn 4
113 | 
114 | Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**.
115 | 
116 | ```{r}
117 | table4a %>%
118 |   gather(key = "year", 
119 |     value = "cases", -country) %>%
120 |   arrange(country)
121 | ```
122 | 
123 | ## Your Turn 5
124 | 
125 | On a sheet of paper, draw how `pollution` would look if it had the same values grouped into three columns: **city**, **large**, **small**
126 | 
127 | --------------------------
128 |    city     large   small 
129 | ---------- ------- -------
130 |  Beijing     121     121  
131 | 
132 |   London     22      16   
133 | 
134 |  New York    23      14   
135 | --------------------------
136 | 
137 | ## Your Turn 6
138 | 
139 | Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**.
140 | 
141 | ```{r}
142 | table2 %>%
143 |   spread(key = type, value = count)
144 | ```
145 | 
146 | ***
147 | 
148 | # Take Aways
149 | 
150 | Data comes in many formats but R prefers just one: _tidy data_.
151 | 
152 | A data set is tidy if and only if:
153 | 
154 | 1. Every variable is in its own column
155 | 2. Every observation is in its own row
156 | 3. Every value is in its own cell (which follows from the above)
157 | 
158 | What is a variable and an observation may depend on your immediate goal.


--------------------------------------------------------------------------------
/solutions/04-Case-Study-Solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Case Study: Friday the 13th Effect (Solution)'
  3 | output:
  4 |   html_document:
  5 |     df_print: paged
  6 |   github_document: 
  7 |     df_print: tibble  
  8 | ---
  9 | 
 10 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License. -->
 11 | 
 12 | ```{r setup}
 13 | library(fivethirtyeight)
 14 | library(tidyverse)
 15 | ```
 16 | 
 17 | ## Task
 18 | 
 19 | Reproduce this figure from fivethirtyeight's article [*Some People Are Too Superstitious To Have A Baby On Friday The 13th*](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/):
 20 | 
 21 | ![](resources/bialik-fridaythe13th-2.png)
 22 | 
 23 | ## Data
 24 | 
 25 | In the `fivethiryeight` package there are two datasets containing birth data, but for now let's just work with one `US_births_1994_2003`.  Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014.
 26 | 
 27 | ## Your Turn 1 
 28 | 
 29 | With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot.
 30 | 
 31 | ```{r}
 32 | US_births_1994_2003
 33 | ```
 34 | 
 35 | ## Some overviews of the data
 36 | 
 37 | Whole time series:
 38 | ```{r}
 39 | ggplot(US_births_1994_2003, aes(x = date, y = births)) +
 40 |   geom_line()
 41 | ```
 42 | There is so much fluctuation it's really hard to see what is going on.
 43 | 
 44 | Let's try just looking at one year:
 45 | ```{r}
 46 | US_births_1994_2003 %>%
 47 |   filter(year == 1994) %>%
 48 |   ggplot(mapping = aes(x = date, y = births)) +
 49 |     geom_line()
 50 | ```
 51 | Strong weekly pattern accounts for most variation.
 52 | 
 53 | ## Strategy
 54 | 
 55 | Use the figure as a guide for what the data should like to make the final plot.  We want to end up with something like:
 56 | 
 57 | ---------------------------
 58 |  day_of_week   avg_diff_13 
 59 | ------------- -------------
 60 |      Mon         -2.686    
 61 | 
 62 |     Tues         -1.378    
 63 | 
 64 |      Wed         -3.274    
 65 |      
 66 |      ...          ...
 67 |      
 68 | ---------------------------     
 69 | 
 70 | There is more than one way to get there, but we
 71 | ll roughly follow this strategy:
 72 | 
 73 | * Get just the data for the 6th, 13th, and 20th
 74 | * Calculate variable of interest:
 75 |     * (For each month/year):
 76 |         * Find average births on 6th and 20th
 77 |         * Find percentage difference between births on 13th and average births on 6th and 20th
 78 |     
 79 |     * Average percent difference by day of the week
 80 | * Create plot
 81 | 
 82 | ## Your Turn 2
 83 | 
 84 | Extract just the 6th, 13th and 20th of each month:
 85 | 
 86 | ```{r}
 87 | US_births_1994_2003 %>%
 88 |   select(-date) %>% 
 89 |   filter(date_of_month %in% c(6, 13, 20)) 
 90 | ```
 91 | 
 92 | ## Your Turn 3
 93 | 
 94 | Which arrangement is tidy?
 95 | 
 96 | **Option 1:**
 97 | 
 98 | -----------------------------------------------------
 99 |  year   month   date_of_month   day_of_week   births 
100 | ------ ------- --------------- ------------- --------
101 |  1994     1           6            Thurs      11406  
102 | 
103 |  1994     1          13            Thurs      11212  
104 | 
105 |  1994     1          20            Thurs      11682  
106 | -----------------------------------------------------
107 | 
108 | **Option 2:**
109 | 
110 | ----------------------------------------------------
111 |  year   month   day_of_week     6      13      20   
112 | ------ ------- ------------- ------- ------- -------
113 |  1994     1        Thurs      11406   11212   11682 
114 | ----------------------------------------------------
115 | 
116 | (**Hint:** think about our next step *"Find the percent difference between the 13th and the average of the 6th and 12th"*. In which layout will this be easier using our tidy tools?)
117 | 
118 | **Solution**: Option 2, since then we can easily use `mutate()`.
119 | 
120 | ## Your Turn 4
121 | 
122 | Tidy the filtered data to have the days in columns.
123 | 
124 | ```{r}
125 | US_births_1994_2003 %>%
126 |   select(-date) %>% 
127 |   filter(date_of_month %in% c(6, 13, 20)) %>%
128 |   spread(date_of_month, births)
129 | ```
130 | 
131 | ## Your Turn 5
132 | 
133 | Now use `mutate()` to add columns for:
134 | 
135 | * The average of the births on the 6th and 20th
136 | * The percentage difference between the number of births on the 13th and the average of the 6th and 20th
137 | 
138 | ```{r}
139 | US_births_1994_2003 %>%
140 |   select(-date) %>% 
141 |   filter(date_of_month %in% c(6, 13, 20)) %>%
142 |   spread(date_of_month, births) %>%
143 |   mutate(
144 |     avg_6_20 = (`6` + `20`)/2,
145 |     diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
146 |   )
147 | ```
148 | 
149 | ## A little additional exploring
150 | 
151 | Now we have a percent difference between the 13th and the 6th and 20th of each month, it's probably worth exploring a little (at the very least to check our calculations seem reasonable).
152 | 
153 | To make it a little easier let's assign our current data to a variable
154 | ```{r}
155 | births_diff_13 <- US_births_1994_2003 %>%
156 |   select(-date) %>% 
157 |   filter(date_of_month %in% c(6, 13, 20)) %>%
158 |   spread(date_of_month, births) %>%
159 |   mutate(
160 |     avg_6_20 = (`6` + `20`)/2,
161 |     diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
162 |   )
163 | ```
164 | 
165 | Then take a look
166 | ```{r}
167 | births_diff_13 %>% 
168 |   ggplot(mapping = aes(day_of_week, diff_13)) +
169 |     geom_point()
170 | ```
171 | 
172 | Looks like we are on the right path.  There's a big outlier one Monday
173 | ```{r}
174 | births_diff_13 %>%
175 |   filter(day_of_week == "Mon", diff_13 > 10)
176 | ```
177 | 
178 | Seem's to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year.
179 | 
180 | ## Your Turn 6
181 | 
182 | Summarize each day of the week to have mean of diff_13.
183 | 
184 | Then, recreate the fivethirtyeight plot.
185 | 
186 | ```{r}
187 | US_births_1994_2003 %>%
188 |   select(-date) %>% 
189 |   filter(date_of_month %in% c(6, 13, 20)) %>%
190 |   spread(date_of_month, births) %>%
191 |   mutate(
192 |     avg_6_20 = (`6` + `20`)/2,
193 |     diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
194 |   ) %>%
195 |   group_by(day_of_week) %>%
196 |   summarise(avg_diff_13 = mean(diff_13)) %>%
197 |   ggplot(aes(x = day_of_week, y = avg_diff_13)) +
198 |     geom_bar(stat = "identity")
199 | ```
200 | 
201 | ## Extra Challenges
202 | 
203 | * If you wanted to use the `US_births_2000_2014` data instead, what would you need to change in the pipeline?  How about using both `US_births_1994_2003` and `US_births_2000_2014`?
204 | 
205 | * Try not removing the `date` column. At what point in the pipeline does it cause problems? Why?
206 | 
207 | * Can you come up with an alternative way to investigate the Friday the 13th effect?  Try it out!
208 | 
209 | ## Takeaways
210 | 
211 | The power of the tidyverse comes from being able to easily combine functions that do simple things well.  


--------------------------------------------------------------------------------
/solutions/05-Data-Types-Solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Data Types (solutions)"
  3 | output: html_document
  4 | ---
  5 | 
  6 | ```{r setup}
  7 | library(tidyverse)
  8 | library(babynames)
  9 | library(nycflights13)
 10 | library(stringr)
 11 | library(forcats)
 12 | library(lubridate)
 13 | library(hms)
 14 | ```
 15 | 
 16 | ## Your Turn 1
 17 | 
 18 | Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`).
 19 | 
 20 | Then, remove all rows that contain an NA in `delayed`. 
 21 | 
 22 | Finally, create a summary table that shows:
 23 | 
 24 | 1. How many flights were delayed  
 25 | 2. What proportion of flights were delayed
 26 | 
 27 | ```{r}
 28 | flights %>%
 29 |   mutate(delayed = arr_delay > 0) %>%
 30 |   drop_na(delayed) %>%
 31 |   summarise(total = sum(delayed), prop = mean(delayed))
 32 | ```
 33 | 
 34 | 
 35 | ## Your Turn 2
 36 | 
 37 | In your group, fill in the blanks to:
 38 | 
 39 | 1. Isolate the last letter of every name and create a logical variable that displays whether the last letter is one of "a", "e", "i", "o", "u", or "y".  
 40 | 2. Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by `year` and `sex`)   
 41 | 3. and then display the results as a line plot.
 42 | 
 43 | ```{r}
 44 | babynames %>%
 45 |   mutate(last = str_sub(name, -1), 
 46 |     vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>%
 47 |   group_by(year, sex) %>%
 48 |   summarise(p_vowel = weighted.mean(vowel, n)) %>%
 49 |   ggplot() +
 50 |     geom_line(mapping = aes(year, p_vowel, color = sex))
 51 | ```
 52 | 
 53 | ## Your Turn 3
 54 | 
 55 | Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status.
 56 | 
 57 | ```{r}
 58 | gss_cat %>%
 59 |   drop_na(tvhours) %>%
 60 |   group_by(marital) %>%
 61 |   summarise(tvhours = mean(tvhours)) %>%
 62 |   ggplot(aes(tvhours, fct_reorder(marital, tvhours))) +
 63 |     geom_point()
 64 | ```
 65 | 
 66 | ## Your Turn 4
 67 | 
 68 | Do you think liberals or conservatives watch more TV?
 69 | Compute average tv hours by party ID an then plot the results.
 70 | 
 71 | ```{r}
 72 | gss_cat %>%
 73 |    drop_na(tvhours) %>%
 74 |    group_by(partyid) %>%
 75 |    summarise(tvhours = mean(tvhours)) %>%
 76 |    ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) +
 77 |      geom_point() +
 78 |      labs(y = "partyid")
 79 | ```
 80 | 
 81 | ## Your Turn 5
 82 | 
 83 | What is the best time of day to fly?
 84 | 
 85 | Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`.
 86 | 
 87 | ```{r}
 88 | flights %>% 
 89 |   mutate(time = hms(hour = hour, minute = minute)) %>% 
 90 |   ggplot(aes(time, arr_delay)) + 
 91 |     geom_point(alpha = 0.2) + geom_smooth()
 92 | ```
 93 | 
 94 | ## Your Turn 6
 95 | 
 96 | Fill in the blanks to:
 97 | 
 98 | Extract the day of the week of each flight (as a full name) from `time_hour`. 
 99 | 
100 | Calculate the average `arr_delay` by day of the week.
101 | 
102 | Plot the results as a column chart (bar chart) with `geom_col()`.
103 | 
104 | ```{r}
105 | flights %>% 
106 |   mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>% 
107 |   group_by(weekday) %>% 
108 |   drop_na(arr_delay) %>% 
109 |   summarise(avg_delay = mean(arr_delay)) %>% 
110 |   ggplot() +
111 |     geom_col(mapping = aes(x = weekday, y = avg_delay))
112 | ```
113 | 
114 | ***
115 | 
116 | # Take Aways
117 | 
118 | Dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarise()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data.
119 | 
120 | Package   | Data Type
121 | --------- | --------
122 | stringr   | strings
123 | forcats   | factors
124 | hms       | times
125 | lubridate | dates and times
126 | 
127 | 


--------------------------------------------------------------------------------
/solutions/06-Iterate-solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Iteration (solutions)"
  3 | output:
  4 |   html_document:
  5 |     df_print: paged
  6 |   github_document: 
  7 |     df_print: tibble  
  8 | ---
  9 | 
 10 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
 11 | 
 12 | ```{r setup}
 13 | library(tidyverse)
 14 | 
 15 | # Toy data
 16 | set.seed(1000)
 17 | exams <- list(
 18 |   student1 = round(runif(10, 50, 100)),
 19 |   student2 = round(runif(10, 50, 100)),
 20 |   student3 = round(runif(10, 50, 100)),
 21 |   student4 = round(runif(10, 50, 100)),
 22 |   student5 = round(runif(10, 50, 100))
 23 | )
 24 | 
 25 | extra_credit <- list(0, 0, 10, 10, 15)
 26 | ```
 27 | 
 28 | ## Your Turn 1
 29 | 
 30 | What kind of object is `mod`?  Why are models stored as this kind of object?
 31 | 
 32 | ```{r}
 33 | mod <- lm(price ~ carat + cut + color + clarity, data = diamonds)
 34 | # View(mod)
 35 | ```
 36 | 
 37 | `mod` is a list.  A list is used because we need to store lots of heterogeneous information.
 38 | 
 39 | ## Quiz
 40 | 
 41 | What's the difference between a list and an **atomic** vector?
 42 | 
 43 | Atomic vectors are: "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw" vectors.
 44 | 
 45 | Lists can hold data of different types and different lengths, we can even put lists inside other lists.
 46 | 
 47 | ## Your Turn 2
 48 | 
 49 | Here is a list:
 50 | 
 51 | ```{r}
 52 | a_list <- list(num = c(8, 9), 
 53 |             log = TRUE,    
 54 |             cha = c("a", "b", "c"))
 55 | ```
 56 | 
 57 | Here are two subsetting commands. Do they return the same values? Run the code chunk above, _and then_ run the code chunks below to confirm
 58 | 
 59 | ```{r}
 60 | a_list["num"]
 61 | ```
 62 | 
 63 | ```{r}
 64 | a_list$num
 65 | ```
 66 | 
 67 | ## Your Turn 3
 68 | 
 69 | What will each of these return? Run the code chunks to confirm.
 70 | 
 71 | ```{r}
 72 | vec <- c(-2, -1, 0, 1, 2)
 73 | abs(vec)
 74 | ```
 75 | 
 76 | `abs()` returns the absolute value of each element.
 77 | 
 78 | ```{r, error = TRUE}
 79 | lst <- list(-2, -1, 0, 1, 2)
 80 | abs(lst)
 81 | ```
 82 | 
 83 | Out intent might be to take the absolute value of each element, but we get an error, because `abs()` doens't know how to handle a list.
 84 | 
 85 | ## Your Turn 4
 86 | 
 87 | Run the code in the chunks. What does it return?
 88 | 
 89 | ```{r}
 90 | list(student1 = mean(exams$student1),
 91 |      student2 = mean(exams$student2),
 92 |      student3 = mean(exams$student3),
 93 |      student4 = mean(exams$student4),
 94 |      student5 = mean(exams$student5))
 95 | ```
 96 | 
 97 | This chunk manually iterates over the elements of `exams` taking the mean of each element, and returning the results in a list.
 98 | 
 99 | ```{r}
100 | library(purrr)
101 | map(exams, mean)
102 | ```
103 | 
104 | This does the exact same thing, but automatically.
105 | 
106 | 
107 | ## Your Turn 5
108 | 
109 | Calculate the variance (`var()`) of each student’s exam grades.
110 | 
111 | ```{r}
112 | exams %>% map(var)
113 | ```
114 | 
115 | ## Your Turn 6
116 | 
117 | Calculate the max grade (max())for each student. Return the result as a vector.
118 | 
119 | ```{r}
120 | exams %>% map_dbl(max)
121 | ```
122 | 
123 | ## Your Turn 7
124 | 
125 | Write a function that counts the best exam twice and then takes the average. Use it to grade all of the students.
126 | 
127 | 1. Write code that solves the problem for a real object  
128 | 2. Wrap the code in `function(){}` to save it  
129 | 3. Add the name of the real object as the function argument 
130 | 
131 | ```{r}
132 | double_best <- function(x) {
133 |   (sum(x) + max(x)) / (length(x) + 1)
134 | }
135 | 
136 | exams %>%
137 |   map_dbl(double_best)
138 | ```
139 | 
140 | ### Your Turn 8
141 | 
142 | Compute a final grade for each student, where the final grade is the average test score plus any `extra_credit` assigned to the student. Return the results as a double (i.e. numeric) vector.
143 | 
144 | ```{r}
145 | exams %>% 
146 |   map2_dbl(extra_credit, function(x, y) mean(x) + y)
147 | ```
148 | 
149 | 
150 | ***
151 | 
152 | # Take Aways
153 | 
154 | Lists are a useful way to organize data, but you need to arrange manually for functions to iterate over the elements of a list.
155 | 
156 | You can do this with the `map()` family of functions in the purrr package.
157 | 
158 | To write a function, 
159 | 
160 | 1. Write code that solves the problem for a real object  
161 | 2. Wrap the code in `function(){}` to save it  
162 | 3. Add the name of the real object as the function argument 
163 | 
164 | This sequence will help prevent bugs in your code (and reduce the time you spend correcting bugs). 
165 | 


--------------------------------------------------------------------------------
/solutions/07-Model-Solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Model (solutions)"
  3 | output:
  4 |   html_document:
  5 |     df_print: paged
  6 |   github_document: 
  7 |     df_print: tibble  
  8 | ---
  9 | 
 10 | ```{r setup, message=FALSE}
 11 | library(tidyverse)
 12 | library(modelr)
 13 | library(broom)
 14 | 
 15 | wages <- heights %>% filter(income > 0)
 16 | ```
 17 | 
 18 | ## Your Turn 1
 19 | 
 20 | Fit the model on the slide and then examine the output. What does it look like?
 21 | 
 22 | ```{r}
 23 | mod_e <- lm(log(income) ~ education, data = wages)
 24 | mod_e
 25 | ```
 26 | 
 27 | ## Your Turn 2
 28 | 
 29 | Use a pipe to model `log(income)` against `height`. Then use broom and dplyr functions to extract:
 30 | 
 31 | 1. The **coefficient estimates** and their related statistics 
 32 | 2. The **adj.r.squared** and **p.value** for the overall model
 33 | 
 34 | ```{r}
 35 | mod_h <- wages %>% lm(log(income) ~ height, data = .)
 36 | mod_h %>%
 37 |   tidy()
 38 | 
 39 | mod_h %>% 
 40 |   glance() %>% 
 41 |   select(adj.r.squared, p.value)
 42 | ```
 43 | 
 44 | ## Your Turn 3
 45 | 
 46 | Model `log(income)` against `education` _and_ `height`. Do the coefficients change?
 47 | 
 48 | ```{r}
 49 | mod_eh <- wages %>% 
 50 |   lm(log(income) ~ education + height, data = .)
 51 | 
 52 | mod_eh %>% 
 53 |   tidy()
 54 | ```
 55 | 
 56 | ## Your Turn 4
 57 | 
 58 | Model `log(income)` against `education` and `height` and `sex`. Can you interpret the coefficients?
 59 | 
 60 | ```{r}
 61 | mod_ehs <- wages %>% 
 62 |   lm(log(income) ~ education + height + sex, data = .)
 63 | 
 64 | mod_ehs %>% 
 65 |   tidy()
 66 | ```
 67 | 
 68 | ## Your Turn 5
 69 | 
 70 | Use a broom function and ggplot2 to make a line graph of `height` vs `.fitted` for our heights model, `mod_h`.
 71 | 
 72 | _Bonus: Overlay the plot on the original data points._
 73 | 
 74 | ```{r}
 75 | mod_h <- wages %>% lm(log(income) ~ height, data = .)
 76 | 
 77 | mod_h %>% 
 78 |   augment(data = wages) %>% 
 79 |   ggplot(mapping = aes(x = height, y = .fitted)) +
 80 |     geom_point(mapping = aes(y = log(income)), alpha = 0.1) +
 81 |     geom_line(color = "blue")
 82 | ```
 83 | 
 84 | ## Your Turn 6
 85 | 
 86 | Repeat the process to make a line graph of `height` vs `.fitted` colored by `sex` for model mod_ehs. Are the results interpretable? Add `+ facet_wrap(~education)` to the end of your code. What happens?
 87 | 
 88 | ```{r}
 89 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .)
 90 | 
 91 | mod_ehs %>% 
 92 |   augment(data = wages) %>% 
 93 |   ggplot(mapping = aes(x = height, y = .fitted, color = sex)) +
 94 |     geom_line() +
 95 |     facet_wrap(~ education)
 96 | ```
 97 | 
 98 | ## Your Turn 7
 99 | 
100 | Use one of `spread_predictions()` or `gather_predictions()` to make a line graph of `height` vs `pred` colored by `model` for each of mod_h, mod_eh, and mod_ehs. Are the results interpretable? 
101 | 
102 | Add `+ facet_grid(sex ~ education)` to the end of your code. What happens?
103 | 
104 | ```{r warning = FALSE, message = FALSE}
105 | mod_h <- wages %>% lm(log(income) ~ height, data = .)
106 | mod_eh <- wages %>% lm(log(income) ~ education + height, data = .)
107 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .)
108 | 
109 | wages %>% 
110 |   gather_predictions(mod_h, mod_eh, mod_ehs) %>% 
111 |   ggplot(mapping = aes(x = height, y = pred, color = model)) +
112 |     geom_line() +
113 |     facet_grid(sex ~ education)
114 | ```
115 | 
116 | ***
117 | 
118 | # Take Aways
119 | 
120 | * Use `glance()`, `tidy()`, and `augment()` from the **broom** package to return model values in a data frame.
121 | 
122 | * Use `add_predictions()` or `gather_predictions()` or `spread_predictions()` from the **modelr** package to visualize predictions.
123 | 
124 | * Use `add_residuals()` or `gather_residuals()` or `spread_residuals()` from the **modelr** package to visualize residuals.
125 | 
126 | 


--------------------------------------------------------------------------------
/solutions/08-Organize-Solutions.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Organize with List Columns"
  3 | output:
  4 |   html_document:
  5 |     df_print: paged
  6 |   github_document: 
  7 |     df_print: tibble  
  8 | ---
  9 | 
 10 | <!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
 11 | 
 12 | ```{r setup}
 13 | library(tidyverse)
 14 | library(gapminder)
 15 | library(broom)
 16 | 
 17 | nz <- gapminder %>%
 18 |   filter(country == "New Zealand")
 19 | us <- gapminder %>%
 20 |   filter(country == "United States")
 21 | ```
 22 | 
 23 | ## Your turn 1
 24 | 
 25 | How has life expectancy changed in other countries?
 26 | Make a line plot of lifeExp vs. year grouped by country.  
 27 | Set alpha to 0.2, to see the results better.
 28 | 
 29 | ```{r}
 30 | gapminder %>% 
 31 |   ggplot(mapping = aes(x = year, y = lifeExp, group = country)) +
 32 |     geom_line(alpha = 0.2)
 33 | ```
 34 | 
 35 | ## Quiz
 36 | 
 37 | How is a data frame/tibble similar to a list?
 38 | 
 39 | ```{r}
 40 | gapminder_sm <- gapminder[1:5, ]
 41 | ```
 42 | 
 43 | It is a list!  Columns are like elements of a list
 44 | 
 45 | You can extract them with `$` of `[[`
 46 | ```{r}
 47 | gapminder_sm$country
 48 | gapminder_sm[["country"]]
 49 | ```
 50 | 
 51 | Or get a new smaller list with `[`:
 52 | ```{r}
 53 | gapminder_sm["country"]
 54 | ```
 55 | 
 56 | ## Quiz
 57 | 
 58 | If one of the elements of a list can be another list,
 59 | can one of the columns of a data frame be another list?
 60 | 
 61 | **Yes!**.
 62 | 
 63 | ```{r}
 64 | tibble(
 65 |   num = c(1, 2, 3),
 66 |   cha = c("one", "two", "three"),
 67 |   listcol = list(1, c("1", "two", "FALSE"), FALSE)
 68 | )
 69 | ```
 70 | 
 71 | And we call it a **list column**.
 72 | 
 73 | ## Your turn 2
 74 | 
 75 | Run this chunk:
 76 | ```{r}
 77 | gapminder_nested <- gapminder %>%
 78 |   group_by(country) %>%
 79 |   nest()
 80 | 
 81 | fit_model <- function(df) lm(lifeExp ~ year, data = df)
 82 | 
 83 | gapminder_nested <- gapminder_nested %>% 
 84 |   mutate(model = map(data, fit_model))
 85 | 
 86 | get_rsq <- function(mod) glance(mod)$r.squared
 87 | 
 88 | gapminder_nested <- gapminder_nested %>% 
 89 |   mutate(r.squared = map_dbl(model, get_rsq))
 90 | ```
 91 | 
 92 | Then filter `gapminder_nested` to find the countries with r.squared less than 0.5.  
 93 | 
 94 | ```{r}
 95 | gapminder_nested %>% 
 96 |   filter(r.squared < 0.5)
 97 | ```
 98 | 
 99 | ## Your Turn 3
100 | 
101 | Edit the code in the chunk provided to instead find and plot countries with a slope above 0.6 years/year.
102 | 
103 | ```{r}
104 | get_slope <- function(mod) {
105 |   tidy(mod) %>% filter(term == "year") %>% pull(estimate)
106 | }
107 | 
108 | # Add new column with r-sqaured
109 | gapminder_nested <- gapminder_nested %>% 
110 |   mutate(slope = map_dbl(model, get_slope))
111 | 
112 | # filter out low r-squared countries
113 | big_slope <- gapminder_nested %>% 
114 |   filter(slope > 0.6)
115 | 
116 | # unnest and plot result
117 | unnest(big_slope, data) %>%
118 |   ggplot(aes(x = year, y = lifeExp)) + 
119 |     geom_line(aes(color = country))
120 | ```
121 | 
122 | ## Your Turn 4
123 | 
124 | **Challenge:**
125 | 
126 | 1. Create your own copy of `gapminder_nested` and then add one more list column: `output` which contains the output of `augment()` for each model.
127 | 
128 | 2.  Plot the residuals against time for the countries with small r-squared.
129 | 
130 | ```{r}
131 | charlotte_gapminder <- gapminder_nested
132 | 
133 | charlotte_gapminder %>%
134 |   mutate(output = model %>% map(augment)) %>%
135 |   unnest(output) %>%
136 |   filter(r.squared < 0.5) %>%
137 |   ggplot() + 
138 |     geom_line(aes(year, .resid, color = country))
139 | 
140 | ```
141 | 
142 | # Take away
143 | 
144 | 


--------------------------------------------------------------------------------