├── .gitignore ├── 00-Getting-started.Rmd ├── 01-Visualize.Rmd ├── 02-Transform.Rmd ├── 03-Tidy.Rmd ├── 04-Case-Study.Rmd ├── 05-Data-Types.Rmd ├── 06-Iterate.Rmd ├── 07-Model.Rmd ├── 08-Organize.Rmd ├── 99-Setup.md ├── README.md ├── data-science-in-the-tidyverse.Rproj ├── email-to-participants.md ├── resources ├── 01-setup-login.png ├── 02-setup-temp-project.png ├── 04-setup-rproj-file.png ├── 05-setup-open-project.png ├── 06-setup-inside-project.png ├── 07-setup-all-done.png └── bialik-fridaythe13th-2.png ├── slides ├── 00-Introduction.pdf ├── 01-Visualize.pdf ├── 02-Transform.pdf ├── 03-Tidy.pdf ├── 04-Case-Study.pdf ├── 05-Data-Types.pdf ├── 06-Iteration.pdf ├── 07-Model.pdf ├── 08-Organize.pdf └── 09-Wrapping-Up.pdf └── solutions ├── 01-Visualize-solutions.Rmd ├── 02-Transform-Solutions.Rmd ├── 03-Tidy-Solutions.Rmd ├── 04-Case-Study-Solutions.Rmd ├── 05-Data-Types-Solutions.Rmd ├── 06-Iterate-solutions.Rmd ├── 07-Model-Solutions.Rmd └── 08-Organize-Solutions.Rmd /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | *.html 6 | /keynotes 7 | -------------------------------------------------------------------------------- /00-Getting-started.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R Notebook" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ```{r setup} 11 | library(tidyverse) 12 | ``` 13 | 14 | ## R notebooks 15 | 16 | This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 17 | 18 | R code goes in **code chunks**, denoted by three backticks. Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Crtl+Shift+Enter* (Windows) or *Cmd+Shift+Enter* (Mac). 19 | 20 | ```{r} 21 | ggplot(data = mpg) + 22 | geom_point(mapping = aes(x = displ, y = hwy)) 23 | ``` 24 | 25 | Add a new chunk by clicking the *Insert* button on the toolbar, then selecting *R* or by pressing *Ctrl+Alt+I* (Windows) or *Cmd+Option+I* (Mac). 26 | 27 | When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* (Windows) or *Cmd+Shift+K* (Mac) to preview the HTML file). 28 | 29 | The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed. 30 | 31 | -------------------------------------------------------------------------------- /01-Visualize.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Visualization" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ## Setup 11 | 12 | The first chunk in an R Notebook is usually titled "setup," and by convention includes the R packages you want to load. Remember, in order to use an R package you have to run some `library()` code every session. Execute these lines of code to load the packages. 13 | 14 | ```{r setup} 15 | library(ggplot2) 16 | library(fivethirtyeight) 17 | ``` 18 | 19 | ## Bechdel test data 20 | 21 | We're going to start by playing with data collected by the website FiveThirtyEight on movies and [the Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test). 22 | 23 | To begin, let's just preview our data. There are a couple ways to do that. One is just to type the name of the data and execute it like a piece of code. 24 | 25 | ```{r} 26 | bechdel 27 | ``` 28 | 29 | Notice that you can page through to see more of the dataset. 30 | 31 | Sometimes, people prefer to see their data in a more spreadsheet-like format, and RStudio provides a way to do that. Go to the Console and type `View(bechdel)` to see the data preview. 32 | 33 | (An aside-- `View` is a special function. Since it makes something happen in the RStudio interface, it doesn't work properly in R Notebooks. Most R functions have names that start with lowercase letters, so the uppercase "V" is there to remind you of its special status.) 34 | 35 | 36 | 37 | ## Consider 38 | What relationship do you expect to see between movie budget (budget) and domestic gross(domgross)? 39 | 40 | ## Your Turn 1 41 | 42 | Run the code on the slide to make a graph. Pay strict attention to spelling, capitalization, and parentheses! 43 | 44 | ```{r} 45 | 46 | ``` 47 | 48 | ## Your Turn 2 49 | 50 | Add `color`, `size`, `alpha`, and `shape` aesthetics to your graph. Experiment. 51 | 52 | ```{r} 53 | ggplot(data = bechdel) + 54 | geom_point(mapping = aes(x = budget, y = domgross)) 55 | ``` 56 | 57 | ## Set vs map 58 | 59 | ```{r} 60 | ggplot(bechdel) + 61 | geom_point(mapping = aes(x = budget, y = domgross), color="blue") 62 | ``` 63 | 64 | ## Your Turn 3 65 | 66 | Replace this scatterplot with one that draws boxplots. Use the cheatsheet. Try your best guess. 67 | 68 | ```{r} 69 | ggplot(data = bechdel) + geom_point(aes(x = clean_test, y = budget)) 70 | ``` 71 | 72 | ## Your Turn 4 73 | 74 | Make a histogram of the `budget` variable from `bechdel`. 75 | 76 | ```{r} 77 | 78 | ``` 79 | 80 | ## Your Turn 5 81 | Try to find a better `binwidth` for `budget`. 82 | 83 | ```{r} 84 | 85 | ``` 86 | 87 | ## Your Turn 6 88 | 89 | Make a density plot of `budget` colored by `clean_test`. 90 | 91 | ```{r} 92 | 93 | ``` 94 | 95 | ## Your Turn 7 96 | 97 | Make a barchart of `clean_test` colored by `clean_test`. 98 | 99 | ```{r} 100 | 101 | ``` 102 | 103 | 104 | ## Your Turn 8 105 | 106 | Predict what this code will do. Then run it. 107 | 108 | ```{r} 109 | ggplot(data = bechdel) + 110 | geom_point(mapping = aes(x = budget, y = domgross)) + 111 | geom_smooth(mapping = aes(x = budget, y = domgross)) 112 | ``` 113 | 114 | ## global vs local 115 | 116 | ```{r} 117 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) + 118 | geom_point(mapping = aes(color = clean_test)) + 119 | geom_smooth() 120 | ``` 121 | 122 | ```{r} 123 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) + 124 | geom_point(mapping = aes(color = clean_test)) + 125 | geom_smooth(data = filter(bechdel, clean_test == "ok")) 126 | ``` 127 | 128 | 129 | 130 | ## Your Turn 131 | 132 | What does `getwd()` return? 133 | 134 | ```{r} 135 | 136 | ``` 137 | 138 | ## Your Turn 9 139 | 140 | Save the last plot and then locate it in the files pane. If you run your `ggsave()` code inside this notebook, the image will be saved in the same directory as your .Rmd file (likely, project -> code), but if you run `ggsave()` in the Console it will be in your working directory. 141 | 142 | ```{r} 143 | 144 | ``` 145 | 146 | *** 147 | 148 | # Take aways 149 | 150 | You can use this code template to make thousands of graphs with **ggplot2**. 151 | 152 | ```{r eval = FALSE} 153 | ggplot(data = ) + 154 | (mapping = aes()) 155 | ``` -------------------------------------------------------------------------------- /02-Transform.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Transform Data" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ```{r setup} 11 | library(dplyr) 12 | library(babynames) 13 | library(nycflights13) 14 | library(skimr) 15 | ``` 16 | 17 | ## Babynames 18 | 19 | ```{r} 20 | babynames 21 | skim(babynames) 22 | skim_with(integer = list(p25 = NULL, p75=NULL)) 23 | ``` 24 | 25 | 26 | ## Your Turn 1 27 | Run the skim_with() command, and then try skimming babynames again to see how the output is different 28 | ```{r} 29 | 30 | ``` 31 | 32 | ## Select 33 | 34 | ```{r} 35 | select(babynames, name, prop) 36 | ``` 37 | 38 | ## Your Turn 2 39 | 40 | Alter the code to select just the `n` column: 41 | 42 | ```{r} 43 | select(babynames, name, prop) 44 | ``` 45 | 46 | 47 | ## Consider 48 | 49 | Which of these is NOT a way to select the `name` and `n` columns together? 50 | 51 | ```{r} 52 | select(babynames, -c(year, sex, prop)) 53 | select(babynames, name:n) 54 | select(babynames, starts_with("n")) 55 | select(babynames, ends_with("n")) 56 | ``` 57 | 58 | ## Filter 59 | 60 | ```{r} 61 | filter(babynames, name == "Amelia") 62 | ``` 63 | 64 | ## Your Turn 3 65 | 66 | Show: 67 | 68 | * All of the names where prop is greater than or equal to 0.08 69 | * All of the children named "Sea" 70 | * All of the names that have a missing value for `n` 71 | 72 | ```{r} 73 | filter(babynames, is.na(n)) 74 | 75 | ``` 76 | 77 | ## Your Turn 4 78 | 79 | Use Boolean operators to alter the code below to return only the rows that contain: 80 | 81 | * Girls named Sea 82 | * Names that were used by exactly 5 or 6 children in 1880 83 | * Names that are one of Acura, Lexus, or Yugo 84 | 85 | ```{r} 86 | filter(babynames, name == "Sea" | name == "Anemone") 87 | ``` 88 | 89 | ## Arrange 90 | 91 | ```{r} 92 | arrange(babynames, n) 93 | ``` 94 | 95 | ## Your Turn 5 96 | 97 | Arrange babynames by `n`. Add `prop` as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of `n` is? 98 | 99 | ```{r} 100 | 101 | ``` 102 | 103 | ## desc 104 | 105 | ```{r} 106 | arrange(babynames, desc(n)) 107 | ``` 108 | 109 | ## Your Turn 6 110 | 111 | Use `desc()` to find the names with the highest prop. 112 | Then, use `desc()` to find the names with the highest n. 113 | 114 | ```{r} 115 | 116 | ``` 117 | 118 | ## Steps and the pipe 119 | 120 | ```{r} 121 | babynames %>% 122 | filter(year == 2015, sex == "M") %>% 123 | select(name, n) %>% 124 | arrange(desc(n)) 125 | ``` 126 | 127 | ## Your Turn 7 128 | 129 | Use `%>%` to write a sequence of functions that: 130 | 131 | 1. Filter babynames to just the girls that were born in 2015 132 | 2. Select the `name` and `n` columns 133 | 3. Arrange the results so that the most popular names are near the top. 134 | 135 | ```{r} 136 | 137 | ``` 138 | 139 | ## Your Turn 8 140 | 141 | 1. Trim `babynames` to just the rows that contain your `name` and your `sex` 142 | 2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice) 143 | 3. Plot the results as a line graph with `year` on the x axis and `prop` on the y axis 144 | 145 | ```{r} 146 | 147 | ``` 148 | 149 | ## Your Turn 9 150 | 151 | Use summarise() to compute three statistics about the data: 152 | 153 | 1. The first (minimum) year in the dataset 154 | 2. The last (maximum) year in the dataset 155 | 3. The total number of children represented in the data 156 | 157 | ```{r} 158 | 159 | ``` 160 | 161 | ## Your Turn 10 162 | 163 | Extract the rows where `name == "Khaleesi"`. Then use `summarise()` and a summary functions to find: 164 | 165 | 1. The total number of children named Khaleesi 166 | 2. The first year Khaleesi appeared in the data 167 | 168 | ```{r} 169 | 170 | ``` 171 | 172 | ## Toy data for transforming 173 | 174 | ```{r} 175 | # Toy dataset to use 176 | pollution <- tribble( 177 | ~city, ~size, ~amount, 178 | "New York", "large", 23, 179 | "New York", "small", 14, 180 | "London", "large", 22, 181 | "London", "small", 16, 182 | "Beijing", "large", 121, 183 | "Beijing", "small", 56 184 | ) 185 | ``` 186 | 187 | ## Summarize 188 | 189 | ```{r} 190 | pollution %>% 191 | summarise(mean = mean(amount), sum = sum(amount), n = n()) 192 | ``` 193 | 194 | ```{r} 195 | pollution %>% 196 | group_by(city) %>% 197 | summarise(mean = mean(amount), sum = sum(amount), n = n()) 198 | ``` 199 | 200 | 201 | ## Your Turn 11 202 | 203 | Use `group_by()`, `summarise()`, and `arrange()` to display the ten most popular baby names. Compute popularity as the total number of children of a single gender given a name. 204 | 205 | ```{r} 206 | 207 | ``` 208 | 209 | ## Your Turn 12 210 | 211 | Use grouping to calculate and then plot the number of children born each year over time. 212 | 213 | ```{r} 214 | 215 | ``` 216 | 217 | ## Ungroup 218 | 219 | ```{r} 220 | babynames %>% 221 | group_by(name, sex) %>% 222 | summarise(total = sum(n)) %>% 223 | arrange(desc(total)) 224 | ``` 225 | 226 | ## Mutate 227 | 228 | ```{r} 229 | babynames %>% 230 | mutate(percent = round(prop*100, 2)) 231 | ``` 232 | 233 | ## Your Turn 13 234 | 235 | Use `min_rank()` and `mutate()` to rank each row in `babynames` from largest `n` to lowest `n`. 236 | 237 | ```{r} 238 | 239 | ``` 240 | 241 | ## Your Turn 14 242 | 243 | Compute each name's rank _within its year and sex_. 244 | Then compute the median rank _for each combination of name and sex_, and arrange the results from highest median rank to lowest. 245 | 246 | ```{r} 247 | 248 | ``` 249 | 250 | ## Flights data 251 | ```{r} 252 | flights 253 | skim(flights) 254 | ``` 255 | 256 | ## Toy data 257 | 258 | ```{r} 259 | band <- tribble( 260 | ~name, ~band, 261 | "Mick", "Stones", 262 | "John", "Beatles", 263 | "Paul", "Beatles" 264 | ) 265 | 266 | instrument <- tribble( 267 | ~name, ~plays, 268 | "John", "guitar", 269 | "Paul", "bass", 270 | "Keith", "guitar" 271 | ) 272 | 273 | instrument2 <- tribble( 274 | ~artist, ~plays, 275 | "John", "guitar", 276 | "Paul", "bass", 277 | "Keith", "guitar" 278 | ) 279 | ``` 280 | 281 | ## Mutating joins 282 | 283 | ```{r} 284 | band %>% left_join(instrument, by = "name") 285 | ``` 286 | 287 | ## Your Turn 15 288 | 289 | Which airlines had the largest arrival delays? Complete the code below. 290 | 291 | 1. Join `airlines` to `flights` 292 | 2. Compute and order the average arrival delays by airline. Display full names, no codes. 293 | 294 | ```{r} 295 | flights %>% 296 | drop_na(arr_delay) %>% 297 | %>% 298 | group_by( ) %>% 299 | %>% 300 | arrange( ) 301 | ``` 302 | 303 | ## Different names 304 | 305 | ```{r} 306 | band %>% left_join(instrument2, by = c("name" = "artist")) 307 | ``` 308 | 309 | ## Your Turn 16 310 | 311 | How many airports in `airports` are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named `faa` in the **airports** data set and `dest` in the **flights** data set. 312 | 313 | 314 | ```{r} 315 | __________ %>% 316 | _________(_________, by = ___________) %>% 317 | distinct(faa) 318 | ``` 319 | 320 | 321 | 322 | *** 323 | 324 | # Take aways 325 | 326 | * Extract variables with `select()` 327 | * Extract cases with `filter()` 328 | * Arrange cases, with `arrange()` 329 | 330 | * Make tables of summaries with `summarise()` 331 | * Make new variables, with `mutate()` 332 | * Do groupwise operations with `group_by()` 333 | 334 | * Connect operations with `%>%` 335 | 336 | * Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets 337 | * Use `semi_join()` or `anti_join()` to filter datasets against each other 338 | 339 | 340 | 341 | 342 | ## Joining data 343 | 344 | ```{r} 345 | library(nycflights13) 346 | ``` 347 | 348 | ## Your turn 349 | Read in the toy datasets band and instrument 350 | 351 | 352 | ## Types of joins 353 | 354 | ```{r} 355 | band %>% left_join(instrument, by = "name") 356 | band %>% right_join(instrument, by = "name") 357 | band %>% full_join(instrument, by = "name") 358 | band %>% inner_join(instrument, by = "name") 359 | ``` 360 | 361 | ## Your turn 362 | Which airlines had the largest arrival delays? Work in groups to complete the code below. 363 | 364 | ```{r} 365 | flights %>% 366 | drop_na(arr_delay) %>% 367 | #something! %>% 368 | group_by( #something! ) %>% 369 | #something! %>% 370 | arrange( #something! ) 371 | ``` 372 | 373 | ## Your turn 374 | Read in the toy dataset instrument2 375 | 376 | ## What if the names don't match? 377 | 378 | ```{r} 379 | band %>% left_join(instrument2, by = c("name" = "artist")) 380 | ``` 381 | 382 | ```{r} 383 | airports %>% left_join(flights, by = c("faa" = "dest")) 384 | ``` 385 | 386 | 387 | # Take aways 388 | 389 | * Extract variables with `select()` 390 | * Extract cases with `filter()` 391 | * Arrange cases, with `arrange()` 392 | 393 | * Make tables of summaries with `summarise()` 394 | * Make new variables, with `mutate()` 395 | * Do groupwise operations with `group_by()` 396 | 397 | * Connect operations with `%>%` 398 | 399 | * Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets 400 | * Use `semi_join()` or `anti_join()` to filter datasets against each other 401 | 402 | 403 | 404 | -------------------------------------------------------------------------------- /03-Tidy.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Tidy Data" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ```{r setup} 11 | library(tidyverse) 12 | library(babynames) 13 | 14 | # Toy data 15 | cases <- tribble( 16 | ~Country, ~"2011", ~"2012", ~"2013", 17 | "FR", 7000, 6900, 7000, 18 | "DE", 5800, 6000, 6200, 19 | "US", 15000, 14000, 13000 20 | ) 21 | 22 | pollution <- tribble( 23 | ~city, ~size, ~amount, 24 | "New York", "large", 23, 25 | "New York", "small", 14, 26 | "London", "large", 22, 27 | "London", "small", 16, 28 | "Beijing", "large", 121, 29 | "Beijing", "small", 56 30 | ) 31 | 32 | 33 | bp_systolic <- tribble( 34 | ~ subject_id, ~ time_1, ~ time_2, ~ time_3, 35 | 1, 120, 118, 121, 36 | 2, 125, 131, NA, 37 | 3, 141, NA, NA 38 | ) 39 | 40 | bp_systolic2 <- tribble( 41 | ~ subject_id, ~ time, ~ systolic, 42 | 1, 1, 120, 43 | 1, 2, 118, 44 | 1, 3, 121, 45 | 2, 1, 125, 46 | 2, 2, 131, 47 | 3, 1, 141 48 | ) 49 | 50 | ``` 51 | 52 | ## Tidy and untidy data 53 | 54 | `table1` is tidy: 55 | ```{r} 56 | table1 57 | ``` 58 | 59 | For example, it's easy to add a rate column with `mutate()`: 60 | ```{r} 61 | table1 %>% 62 | mutate(rate = cases/population) 63 | ``` 64 | 65 | `table2` isn't tidy, the count column really contains two variables: 66 | ```{r} 67 | table2 68 | ``` 69 | 70 | It makes it very hard to manipulate. 71 | 72 | 73 | ## Your Turn 1 74 | 75 | Is `bp_systolic` tidy? 76 | 77 | ```{r} 78 | bp_systolic2 79 | ``` 80 | 81 | ## Your Turn 2 82 | 83 | Using `bp_systolic2` with `group_by()`, and `summarise()`: 84 | 85 | * Find the average systolic blood pressure for each subject 86 | * Find the last time each subject was measured 87 | 88 | ```{r} 89 | bp_systolic2 90 | ``` 91 | 92 | ## Your Turn 3 93 | 94 | On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n** 95 | 96 | ## Your Turn 4 97 | 98 | Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**. 99 | 100 | ```{r} 101 | table4a 102 | ``` 103 | 104 | ## Your Turn 5 105 | 106 | On a sheet of paper, draw how this data set would look if it had the same values grouped into three columns: **city**, **large**, **small** 107 | 108 | ## Your Turn 6 109 | 110 | Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**. 111 | 112 | ```{r} 113 | table2 114 | ``` 115 | 116 | *** 117 | 118 | # Take Aways 119 | 120 | Data comes in many formats but R prefers just one: _tidy data_. 121 | 122 | A data set is tidy if and only if: 123 | 124 | 1. Every variable is in its own column 125 | 2. Every observation is in its own row 126 | 3. Every value is in its own cell (which follows from the above) 127 | 128 | What is a variable and an observation may depend on your immediate goal. 129 | -------------------------------------------------------------------------------- /04-Case-Study.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Case Study: Friday the 13th Effect" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ```{r setup} 11 | library(fivethirtyeight) 12 | library(tidyverse) 13 | ``` 14 | 15 | ## Task 16 | 17 | Reproduce this figure from fivethirtyeight's article [*Some People Are Too Superstitious To Have A Baby On Friday The 13th*](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/): 18 | 19 | ![](resources/bialik-fridaythe13th-2.png) 20 | 21 | ## Data 22 | 23 | In the `fivethiryeight` package there are two datasets containing birth data, but for now let's just work with one, `US_births_1994_2003`. Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014. 24 | 25 | ## Your Turn 1 26 | 27 | With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot. 28 | 29 | ```{r} 30 | US_births_1994_2003 31 | ``` 32 | 33 | ## Some overviews of the data 34 | 35 | Whole time series: 36 | ```{r} 37 | ggplot(US_births_1994_2003, aes(x = date, y = births)) + 38 | geom_line() 39 | ``` 40 | There is so much fluctuation it's really hard to see what is going on. 41 | 42 | Let's try just looking at one year: 43 | ```{r} 44 | US_births_1994_2003 %>% 45 | filter(year == 1994) %>% 46 | ggplot(mapping = aes(x = date, y = births)) + 47 | geom_line() 48 | ``` 49 | Strong weekly pattern accounts for most variation. 50 | 51 | ## Strategy 52 | 53 | Use the figure as a guide for what the data should like to make the final plot. We want to end up with something like: 54 | 55 | --------------------------- 56 | day_of_week avg_diff_13 57 | ------------- ------------- 58 | Mon -2.686 59 | 60 | Tues -1.378 61 | 62 | Wed -3.274 63 | 64 | ... ... 65 | 66 | --------------------------- 67 | 68 | 69 | ## Your Turn 2 70 | 71 | Extract just the 6th, 13th and 20th of each month: 72 | 73 | ```{r} 74 | US_births_1994_2003 %>% 75 | select(-date) 76 | 77 | ``` 78 | 79 | ## Your Turn 3 80 | 81 | Which arrangement is tidy? 82 | 83 | **Option 1:** 84 | 85 | ----------------------------------------------------- 86 | year month date_of_month day_of_week births 87 | ------ ------- --------------- ------------- -------- 88 | 1994 1 6 Thurs 11406 89 | 90 | 1994 1 13 Thurs 11212 91 | 92 | 1994 1 20 Thurs 11682 93 | ----------------------------------------------------- 94 | 95 | **Option 2:** 96 | 97 | ---------------------------------------------------- 98 | year month day_of_week 6 13 20 99 | ------ ------- ------------- ------- ------- ------- 100 | 1994 1 Thurs 11406 11212 11682 101 | ---------------------------------------------------- 102 | 103 | (**Hint:** think about our next step *"Find the percent difference between the 13th and the average of the 6th and 12th"*. In which layout will this be easier using our tidy tools?) 104 | 105 | ## Your Turn 4 106 | 107 | Tidy the filtered data to have the days in columns. 108 | 109 | ```{r} 110 | US_births_1994_2003 %>% 111 | select(-date) %>% 112 | filter(date_of_month %in% c(6, 13, 20)) 113 | ``` 114 | 115 | ## Your Turn 5 116 | 117 | Now use `mutate()` to add columns for: 118 | 119 | * The average of the births on the 6th and 20th 120 | * The percentage difference between the number of births on the 13th and the average of the 6th and 20th 121 | 122 | ```{r} 123 | US_births_1994_2003 %>% 124 | select(-date) %>% 125 | filter(date_of_month %in% c(6, 13, 20)) %>% 126 | spread(date_of_month, births) 127 | ``` 128 | 129 | ## A little additional exploring 130 | 131 | Now we have a percent difference between the 13th and the 6th and 20th of each month, it's probably worth exploring a little (at the very least to check our calculations seem reasonable). 132 | 133 | To make it a little easier let's assign our current data to a variable 134 | ```{r} 135 | births_diff_13 <- US_births_1994_2003 %>% 136 | select(-date) %>% 137 | filter(date_of_month %in% c(6, 13, 20)) %>% 138 | spread(date_of_month, births) %>% 139 | mutate( 140 | avg_6_20 = (`6` + `20`)/2, 141 | diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 142 | ) 143 | ``` 144 | 145 | Then take a look 146 | ```{r} 147 | births_diff_13 %>% 148 | ggplot(mapping = aes(day_of_week, diff_13)) + 149 | geom_point() 150 | ``` 151 | 152 | Looks like we are on the right path. There's a big outlier one Monday 153 | ```{r} 154 | births_diff_13 %>% 155 | filter(day_of_week == "Mon", diff_13 > 10) 156 | ``` 157 | 158 | Seem's to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year. 159 | 160 | ## Your Turn 6 161 | 162 | Summarize each day of the week to have mean of diff_13. 163 | 164 | Then, recreate the fivethirtyeight plot. 165 | 166 | ```{r} 167 | US_births_1994_2003 %>% 168 | select(-date) %>% 169 | filter(date_of_month %in% c(6, 13, 20)) %>% 170 | spread(date_of_month, births) %>% 171 | mutate( 172 | avg_6_20 = (`6` + `20`)/2, 173 | diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 174 | ) 175 | ``` 176 | 177 | ## Extra Challenges 178 | 179 | * If you wanted to use the `US_births_2000_2014` data instead, what would you need to change in the pipeline? How about using both `US_births_1994_2003` and `US_births_2000_2014`? 180 | 181 | * Try not removing the `date` column. At what point in the pipeline does it cause problems? Why? 182 | 183 | * Can you come up with an alternative way to investigate the Friday the 13th effect? Try it out! 184 | 185 | ## Takeaways 186 | 187 | The power of the tidyverse comes from being able to easily combine functions that do simple things well. 188 | 189 | -------------------------------------------------------------------------------- /05-Data-Types.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Types" 3 | output: html_notebook 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | library(babynames) 9 | library(nycflights13) 10 | library(stringr) 11 | library(forcats) 12 | library(lubridate) 13 | library(hms) 14 | ``` 15 | 16 | ## Your Turn 1 17 | 18 | Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`). 19 | 20 | Then, remove all rows that contain an NA in `delayed`. 21 | 22 | Finally, create a summary table that shows: 23 | 24 | 1. How many flights were delayed 25 | 2. What proportion of flights were delayed 26 | 27 | ```{r} 28 | 29 | ``` 30 | 31 | 32 | ## Your Turn 2 33 | 34 | In your group, fill in the blanks to: 35 | 36 | 1. Isolate the last letter of every name and create a logical variable that displays whether the last letter is one of "a", "e", "i", "o", "u", or "y". 37 | 2. Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by `year` and `sex`) 38 | 3. and then display the results as a line plot. 39 | 40 | ```{r} 41 | babynames %>% 42 | _______(last = _________, 43 | vowel = __________) %>% 44 | group_by(__________) %>% 45 | _________(p_vowel = weighted.mean(vowel, n)) %>% 46 | _________ + 47 | __________ 48 | ``` 49 | 50 | ## Your Turn 3 51 | 52 | Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status. 53 | 54 | ```{r} 55 | gss_cat %>% 56 | drop_na(________) %>% 57 | group_by(________) %>% 58 | summarise(_________________) %>% 59 | ggplot() + 60 | geom_point(mapping = aes(x = _______, y = _________________________)) 61 | ``` 62 | 63 | ## Your Turn 4 64 | 65 | Do you think liberals or conservatives watch more TV? 66 | Compute average tv hours by party ID an then plot the results. 67 | 68 | ```{r} 69 | 70 | ``` 71 | 72 | ## Your Turn 5 73 | 74 | What is the best time of day to fly? 75 | 76 | Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`. 77 | 78 | ```{r} 79 | 80 | ``` 81 | 82 | ## Your Turn 6 83 | 84 | Fill in the blanks to: 85 | 86 | Extract the day of the week of each flight (as a full name) from `time_hour`. 87 | 88 | Calculate the average `arr_delay` by day of the week. 89 | 90 | Plot the results as a column chart (bar chart) with `geom_col()`. 91 | 92 | ```{r} 93 | flights %>% 94 | mutate(weekday = _______________________________) %>% 95 | __________________ %>% 96 | drop_na(arr_delay) %>% 97 | summarise(avg_delay = _______________) %>% 98 | ggplot() + 99 | ___________(mapping = aes(x = weekday, y = avg_delay)) 100 | ``` 101 | 102 | *** 103 | 104 | # Take Aways 105 | 106 | Dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarise()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data. 107 | 108 | Package | Data Type 109 | --------- | -------- 110 | stringr | strings 111 | forcats | factors 112 | hms | times 113 | lubridate | dates and times 114 | 115 | -------------------------------------------------------------------------------- /06-Iterate.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Iteration" 3 | output: html_document 4 | --- 5 | 6 | 7 | 8 | ```{r setup} 9 | library(tidyverse) 10 | 11 | # Toy data 12 | set.seed(1000) 13 | exams <- list( 14 | student1 = round(runif(10, 50, 100)), 15 | student2 = round(runif(10, 50, 100)), 16 | student3 = round(runif(10, 50, 100)), 17 | student4 = round(runif(10, 50, 100)), 18 | student5 = round(runif(10, 50, 100)) 19 | ) 20 | 21 | extra_credit <- list(0, 0, 10, 10, 15) 22 | ``` 23 | 24 | ## Your Turn 1 25 | 26 | What kind of object is `mod`? Why are models stored as this kind of object? 27 | 28 | ```{r} 29 | mod <- lm(price ~ carat + cut + color + clarity, data = diamonds) 30 | View(mod) 31 | ``` 32 | 33 | ## Consider 34 | 35 | What's the difference between a list and an **atomic** vector? 36 | 37 | Atomic vectors are: "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw" vectors. 38 | 39 | ## Your Turn 2 40 | 41 | Here is a list: 42 | 43 | ```{r} 44 | a_list <- list(nums = c(8, 9), 45 | log = TRUE, 46 | cha = c("a", "b", "c")) 47 | ``` 48 | 49 | Here are two subsetting commands. Do they return the same values? Run the code chunk above, _and then_ run the code chunks below to confirm 50 | 51 | ```{r} 52 | a_list["nums"] 53 | ``` 54 | 55 | ```{r} 56 | a_list$nums 57 | ``` 58 | 59 | ## Your Turn 3 60 | 61 | What will each of these return? Run the code chunks to confirm. 62 | 63 | ```{r} 64 | vec <- c(-2, -1, 0, 1, 2) 65 | abs(vec) 66 | ``` 67 | 68 | ```{r, error = TRUE} 69 | lst <- list(-2, -1, 0, 1, 2) 70 | abs(lst) 71 | ``` 72 | 73 | ## Your Turn 4 74 | 75 | Run the code in the chunks. What does it return? 76 | 77 | ```{r} 78 | list(student1 = mean(exams$student1), 79 | student2 = mean(exams$student2), 80 | student3 = mean(exams$student3), 81 | student4 = mean(exams$student4), 82 | student5 = mean(exams$student5)) 83 | ``` 84 | 85 | ```{r} 86 | library(purrr) 87 | map(exams, mean) 88 | ``` 89 | 90 | ## Your Turn 5 91 | 92 | Calculate the variance (`var()`) of each student’s exam grades. 93 | 94 | ```{r} 95 | exams 96 | ``` 97 | 98 | ## Your Turn 6 99 | 100 | Calculate the max grade (`max()`)for each student. Return the result as a vector. 101 | 102 | ```{r} 103 | exams 104 | ``` 105 | 106 | ## Your Turn 7 107 | 108 | Write a function that counts the best exam twice and then takes the average. Use it to grade all of the students. 109 | 110 | 1. Write code that solves the problem for a real object 111 | 2. Wrap the code in `function(){}` to save it 112 | 3. Add the name of the real object as the function argument 113 | 114 | ```{r} 115 | vec <- exams[[1]] 116 | 117 | 118 | ``` 119 | 120 | ### Your Turn 8 121 | 122 | Compute a final grade for each student, where the final grade is the average test score plus any `extra_credit` assigned to the student. Return the results as a double (i.e. numeric) vector. 123 | 124 | ```{r} 125 | 126 | ``` 127 | 128 | 129 | *** 130 | 131 | # Take Aways 132 | 133 | Lists are a useful way to organize data, but you need to arrange manually for functions to iterate over the elements of a list. 134 | 135 | You can do this with the `map()` family of functions in the purrr package. 136 | 137 | To write a function, 138 | 139 | 1. Write code that solves the problem for a real object 140 | 2. Wrap the code in `function(){}` to save it 141 | 3. Add the name of the real object as the function argument 142 | 143 | This sequence will help prevent bugs in your code (and reduce the time you spend correcting bugs). 144 | -------------------------------------------------------------------------------- /07-Model.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Model" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ```{r setup, message=FALSE} 11 | library(tidyverse) 12 | library(modelr) 13 | library(broom) 14 | 15 | wages <- heights %>% filter(income > 0) 16 | ``` 17 | 18 | ## Your Turn 1 19 | 20 | Fit the model on the slide and then examine the output. What does it look like? 21 | 22 | ```{r} 23 | mod_e <- lm(log(income) ~ education, data = wages) 24 | mod_e 25 | ``` 26 | 27 | ## Your Turn 2 28 | 29 | Use a pipe to model `log(income)` against `height`. Then use broom and dplyr functions to extract: 30 | 31 | 1. The **coefficient estimates** and their related statistics 32 | 2. The **adj.r.squared** and **p.value** for the overall model 33 | 34 | ```{r, error = TRUE} 35 | mod_h <- wages %>% lm( ) 36 | 37 | 38 | ``` 39 | 40 | ## Your Turn 3 41 | 42 | Model `log(income)` against `education` _and_ `height`. Do the coefficients change? 43 | 44 | ```{r, error = TRUE} 45 | mod_eh <- wages %>% lm( ) 46 | 47 | ``` 48 | 49 | ## Your Turn 4 50 | 51 | Model `log(income)` against `education` and `height` and `sex`. Can you interpret the coefficients? 52 | 53 | ```{r, error = TRUE} 54 | mod_ehs <- wages %>% lm( ) 55 | ``` 56 | 57 | ## Your Turn 5 58 | 59 | Use a broom function and ggplot2 to make a line graph of `height` vs `.fitted` for our heights model, `mod_h`. 60 | 61 | _Bonus: Overlay the plot on the original data points._ 62 | 63 | ```{r} 64 | mod_h <- wages %>% lm(log(income) ~ height, data = .) 65 | 66 | ``` 67 | 68 | ## Your Turn 6 69 | 70 | Repeat the process to make a line graph of `height` vs `.fitted` colored by `sex` for model `mod_ehs`. Are the results interpretable? Add `+ facet_wrap(~education)` to the end of your code. What happens? 71 | 72 | ```{r} 73 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) 74 | 75 | ``` 76 | 77 | ## Your Turn 7 78 | 79 | Use one of `spread_predictions()` or `gather_predictions()` to make a line graph of `height` vs `pred` colored by `model` for each of mod_h, mod_eh, and mod_ehs. Are the results interpretable? 80 | 81 | Add `+ facet_grid(sex ~ education)` to the end of your code. What happens? 82 | 83 | ```{r warning = FALSE, message = FALSE} 84 | mod_h <- wages %>% lm(log(income) ~ height, data = .) 85 | mod_eh <- wages %>% lm(log(income) ~ education + height, data = .) 86 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) 87 | 88 | 89 | ``` 90 | 91 | *** 92 | 93 | # Take Aways 94 | 95 | * Use `glance()`, `tidy()`, and `augment()` from the **broom** package to return model values in a data frame. 96 | 97 | * Use `add_predictions()` or `gather_predictions()` or `spread_predictions()` from the **modelr** package to visualize predictions. 98 | 99 | * Use `add_residuals()` or `gather_residuals()` or `spread_residuals()` from the **modelr** package to visualize residuals. 100 | 101 | -------------------------------------------------------------------------------- /08-Organize.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Organize with List Columns" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ```{r setup} 11 | library(tidyverse) 12 | library(gapminder) 13 | library(broom) 14 | 15 | nz <- gapminder %>% 16 | filter(country == "New Zealand") 17 | us <- gapminder %>% 18 | filter(country == "United States") 19 | ``` 20 | 21 | ## Your turn 1 22 | 23 | How has life expectancy changed over time? 24 | Make a line plot of lifeExp vs. year grouped by country. 25 | Set alpha to 0.2, to see the results better. 26 | 27 | ```{r} 28 | gapminder 29 | 30 | 31 | ``` 32 | 33 | ## Consider 34 | 35 | How is a data frame/tibble similar to a list? 36 | 37 | ## Consider 38 | 39 | If one of the elements of a list can be another list, 40 | can one of the columns of a data frame be another list? 41 | 42 | ## Your turn 2 43 | 44 | Run this chunk: 45 | ```{r} 46 | gapminder_nested <- gapminder %>% 47 | group_by(country) %>% 48 | nest() 49 | 50 | fit_model <- function(df) lm(lifeExp ~ year, data = df) 51 | 52 | gapminder_nested <- gapminder_nested %>% 53 | mutate(model = map(data, fit_model)) 54 | 55 | get_rsq <- function(mod) glance(mod)$r.squared 56 | 57 | gapminder_nested <- gapminder_nested %>% 58 | mutate(r.squared = map_dbl(model, get_rsq)) 59 | ``` 60 | 61 | Then filter `gapminder_nested` to find the countries with r.squared less than 0.5. 62 | 63 | ```{r} 64 | 65 | ``` 66 | 67 | ## Your Turn 3 68 | 69 | Edit the code in the chunk provided to instead find and plot countries with a slope above 0.6 years/year. 70 | 71 | ```{r} 72 | get_slope <- function(mod) { 73 | tidy(mod) %>% filter(term == "year") %>% pull(estimate) 74 | } 75 | 76 | # Add new column with r-sqaured 77 | gapminder_nested <- gapminder_nested %>% 78 | mutate(r.squared = map_dbl(model, get_rsq)) 79 | 80 | # filter out low r-squared countries 81 | poor_fit <- gapminder_nested %>% 82 | filter(r.squared < 0.5) 83 | 84 | # unnest and plot result 85 | unnest(poor_fit, data) %>% 86 | ggplot(aes(x = year, y = lifeExp)) + 87 | geom_line(aes(color = country)) 88 | ``` 89 | 90 | ## Your Turn 4 91 | 92 | **Challenge:** 93 | 94 | 1. Create your own copy of `gapminder_nested` and then add one more list column: `output` which contains the output of `augment()` for each model. 95 | 96 | 97 | ```{r} 98 | 99 | ``` 100 | 101 | # Take away 102 | 103 | -------------------------------------------------------------------------------- /99-Setup.md: -------------------------------------------------------------------------------- 1 | # Getting Set Up 2 | 3 | During the workshop you'll do your work on [rstudio.cloud](https://rstudio.cloud/). This provides an easy way for me to share all the materials with you, and removes the hassle of getting the right versions of R, RStudio or any packages. 4 | 5 | ## To get started: 6 | 7 | To get set up follow these steps: 8 | 9 | 1. Visit the project at https://rstudio.cloud/project/163983 10 | 11 | 2. Log in using google, github, shinyapps.io or "Sign Up". 12 | 13 | ![](resources/01-setup-login.png) 14 | 15 | 3. The "Data Science in the tidyverse" project will open, but it's a *Temporary copy*. Click *Save a copy*. 16 | 17 | ![](resources/02-setup-temp-project.png) 18 | 19 | 4. Now the "Data Science in the tidyverse" project will open again, but this time it is your own copy. Navigate to the "data-science-in-the-tidyverse.Rproj" file and click it. 20 | 21 | ![](resources/04-setup-rproj-file.png) 22 | 23 | 6. You'll be asked if you want to open the project, hit Yes. 24 | 25 | ![](resources/05-setup-open-project.png) 26 | 27 | 7. All going well, you should now see your project looking like this. Now, open "00-Getting-started.Rmd" 28 | 29 | ![](resources/06-setup-inside-project.png) 30 | 31 | 8. You're all set! You might like to read through "00-Getting-started.Rmd" and do what it tells you. 32 | 33 | ![](resources/07-setup-all-done.png) 34 | 35 | ## Once you are set up 36 | 37 | You can access your copy of the project from *Your Workspace* on [rstudio.cloud](https://rstudio.cloud/). -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This is the repo for *"Data Science in the tidyverse"* given at `rstudio::conf(2019)` in Jan 2019. 2 | 3 | ## Description 4 | 5 | This is a two-day hands on workshop based on the book [R for Data Science](http://r4ds.had.co.nz/). You will learn how to visualize, transform, and model data in R and work with date-times, character strings, and untidy data formats. Along the way, you will learn and use many packages from the tidyverse including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, lubridate, and forcats. 6 | 7 | ## Software Requirements 8 | 9 | You'll be using RStudio Cloud, so (all going well) on the day of the workshop all you'll need is **a laptop that can access the internet** (wifi will be available). 10 | 11 | In the unlikely event that there are problems with the conference internet connection, you may want to have a local installation on your computer as a backup. If you'd like, install the following: 12 | 13 | 1. A recent version of R (~3.5.2), which is available for free at [cran.r-project.org](http://www.cran.r-project.org) 14 | 2. A recent version of RStudio IDE (~1.1.463), available for free at [www.rstudio.com/download](http://www.rstudio.com/download) 15 | 3. The set of relevant R packages, which you can install by connecting to the internet, opening RStudio, and running: 16 | 17 | install.packages(c("babynames", "fivethirtyeight", "formatR", "gapminder", "hexbin", "mgcv", "maps", "mapproj", "nycflights13", "rmarkdown", "skimr", "tidyverse", "viridis")) 18 | 19 | Don't forget to bring your power cord! 20 | 21 | ## Instructor Info 22 | 23 | Amelia McNamara 24 | 25 | - [amelia.mn](http://www.amelia.mn) 26 | - @[AmeliaMN](http://www.twitter.com/AmeliaMN) 27 | 28 | ## License 29 | 30 | Creative Commons License 31 | 32 | *Data Science in the tidyverse* by Amelia McNamara is licensed under a Creative Commons Attribution 4.0 International License. Based on a work at https://github.com/rstudio-education/master-the-tidyverse, [https://github.com/cwickham/data-science-in-tidyverse](https://github.com/cwickham/data-science-in-tidyverse), and [https://github.com/AmeliaMN/IntroToR/](https://github.com/AmeliaMN/IntroToR/) 33 | -------------------------------------------------------------------------------- /data-science-in-the-tidyverse.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Yes 4 | SaveWorkspace: Ask 5 | AlwaysSaveHistory: Yes 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /email-to-participants.md: -------------------------------------------------------------------------------- 1 | *(This will be sent to registered particpants by email, but I'm also posting here as a convenient place to field any questions/issues.)* 2 | 3 | Thank you for enrolling in Data Science in the Tidyverse. 4 | 5 | During class, we will be using RStudio Cloud, a hosted version of R and RStudio in the cloud. The only thing you need to do to prepare for class is sign up for a free RStudio Cloud account at , and plan to bring your laptop with you. On the day of class, we'll provide you with an RStudio Cloud project that contains all of the course materials. 6 | 7 | In the unlikely event that there are problems with the conference internet connection, you may want to have a local installation on your computer as a backup. If you'd like, install the following: 8 | 9 | 1. A recent version of R (~3.5.2), which is available for free at 10 | 2. A recent version of RStudio IDE (~1.1.463), available for free at 11 | 3. The set of relevant R packages, which you can install by connecting to the internet, opening RStudio, and running: 12 | 13 | install.packages(c("babynames", "fivethirtyeight", "formatR", "gapminder", "hexbin", "mgcv", "maps", "mapproj", "nycflights13", "rmarkdown", "skimr", "tidyverse", "viridis")) 14 | 15 | If you're a new R user or working on a government or corporate laptop, it's possible that installing R will be challenging. In that case, feel free to ignore the backup instructions and just count on RStudio Cloud. We'll talk about local installation on the second day of the workshop, and we'll have TAs there to help troubleshoot. 16 | 17 | Whatever you do, don't forget your power cord! 18 | 19 | We look forward to meeting you, 20 | 21 | Amelia and Hadley -------------------------------------------------------------------------------- /resources/01-setup-login.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/01-setup-login.png -------------------------------------------------------------------------------- /resources/02-setup-temp-project.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/02-setup-temp-project.png -------------------------------------------------------------------------------- /resources/04-setup-rproj-file.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/04-setup-rproj-file.png -------------------------------------------------------------------------------- /resources/05-setup-open-project.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/05-setup-open-project.png -------------------------------------------------------------------------------- /resources/06-setup-inside-project.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/06-setup-inside-project.png -------------------------------------------------------------------------------- /resources/07-setup-all-done.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/07-setup-all-done.png -------------------------------------------------------------------------------- /resources/bialik-fridaythe13th-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/resources/bialik-fridaythe13th-2.png -------------------------------------------------------------------------------- /slides/00-Introduction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/00-Introduction.pdf -------------------------------------------------------------------------------- /slides/01-Visualize.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/01-Visualize.pdf -------------------------------------------------------------------------------- /slides/02-Transform.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/02-Transform.pdf -------------------------------------------------------------------------------- /slides/03-Tidy.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/03-Tidy.pdf -------------------------------------------------------------------------------- /slides/04-Case-Study.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/04-Case-Study.pdf -------------------------------------------------------------------------------- /slides/05-Data-Types.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/05-Data-Types.pdf -------------------------------------------------------------------------------- /slides/06-Iteration.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/06-Iteration.pdf -------------------------------------------------------------------------------- /slides/07-Model.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/07-Model.pdf -------------------------------------------------------------------------------- /slides/08-Organize.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/08-Organize.pdf -------------------------------------------------------------------------------- /slides/09-Wrapping-Up.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmeliaMN/data-science-in-tidyverse/fc565700be3da555becbf52e7a39a1ecfc006401/slides/09-Wrapping-Up.pdf -------------------------------------------------------------------------------- /solutions/01-Visualize-solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Visualization - solutions" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ## Setup 11 | 12 | The first chunk in an R Notebook is usually titled "setup," and by convention includes the R packages you want to load. Remember, in order to use an R package you have to run some `library()` code every session. Execute these lines of code to load the packages. 13 | 14 | ```{r setup} 15 | library(ggplot2) 16 | library(fivethirtyeight) 17 | ``` 18 | 19 | ## Bechdel test data 20 | 21 | We're going to start by playing with data collected by the website FiveThirtyEight on movies and [the Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test). 22 | 23 | To begin, let's just preview our data. There are a couple ways to do that. One is just to type the name of the data and execute it like a piece of code. 24 | 25 | ```{r} 26 | bechdel 27 | ``` 28 | 29 | Notice that you can page through to see more of the dataset. 30 | 31 | Sometimes, people prefer to see their data in a more spreadsheet-like format, and RStudio provides a way to do that. Go to the Console and type `View(bechdel)` to see the data preview. 32 | 33 | (An aside-- `View` is a special function. Since it makes something happen in the RStudio interface, it doesn't work properly in R Notebooks. Most R functions have names that start with lowercase letters, so the uppercase "V" is there to remind you of its special status.) 34 | 35 | 36 | 37 | ## Consider 38 | What relationship do you expect to see between movie budget (budget) and domestic gross(domgross)? 39 | 40 | ## Your Turn 1 41 | 42 | Run the code on the slide to make a graph. Pay strict attention to spelling, capitalization, and parentheses! 43 | 44 | ```{r} 45 | ggplot(data = bechdel) + 46 | geom_point(mapping = aes(x = budget, y = domgross)) 47 | ``` 48 | 49 | ## Your Turn 2 50 | 51 | Add `color`, `size`, `alpha`, and `shape` aesthetics to your graph. Experiment. 52 | 53 | ```{r} 54 | ggplot(data = bechdel) + 55 | geom_point(mapping = aes(x = budget, y = domgross, color=clean_test)) 56 | 57 | ggplot(bechdel) + 58 | geom_point(mapping = aes(x = budget, y = domgross, size=clean_test)) 59 | ggplot(bechdel) + 60 | geom_point(mapping = aes(x = budget, y = domgross, shape=clean_test)) 61 | ggplot(bechdel) + 62 | geom_point(mapping = aes(x = budget, y = domgross, alpha=clean_test)) 63 | 64 | ``` 65 | 66 | ## Set vs map 67 | 68 | ```{r} 69 | ggplot(bechdel) + 70 | geom_point(mapping = aes(x = budget, y = domgross), color="blue") 71 | ``` 72 | 73 | ## Your Turn 3 74 | 75 | Replace this scatterplot with one that draws boxplots. Use the cheatsheet. Try your best guess. 76 | 77 | ```{r} 78 | ggplot(data = bechdel) + geom_point(aes(x = clean_test, y = budget)) 79 | 80 | ggplot(data = bechdel) + geom_boxplot(aes(x = clean_test, y = budget)) 81 | ``` 82 | 83 | ## Your Turn 4 84 | 85 | Make a histogram of the `budget` variable from `bechdel`. 86 | 87 | ```{r} 88 | ggplot(bechdel) + 89 | geom_histogram(aes(x=budget)) 90 | ``` 91 | 92 | ## Your Turn 5 93 | Try to find a better binwidth for `budget`. 94 | 95 | ```{r} 96 | ggplot(data = bechdel) + 97 | geom_histogram(mapping = aes(x = budget), binwidth=10000000) 98 | ``` 99 | 100 | ## Your Turn 6 101 | 102 | Make a density plot of `budget` colored by `clean_test`. 103 | 104 | ```{r} 105 | ggplot(data = bechdel) + 106 | geom_density(mapping = aes(x = budget)) 107 | 108 | ggplot(data = bechdel) + 109 | geom_density(mapping = aes(x = budget, color=clean_test)) 110 | ``` 111 | 112 | 113 | ## Your Turn 7 114 | 115 | Make a barchart of `clean_test` colored by `clean_test`. 116 | 117 | ```{r} 118 | ggplot(data=bechdel) + 119 | geom_bar(mapping = aes(x = clean_test, fill = clean_test)) 120 | ``` 121 | 122 | 123 | ## Your Turn 8 124 | 125 | Predict what this code will do. Then run it. 126 | 127 | ```{r} 128 | ggplot(bechdel) + 129 | geom_point(aes(budget, domgross)) + 130 | geom_smooth(aes(budget, domgross)) 131 | ``` 132 | 133 | ## global vs local 134 | 135 | ```{r} 136 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) + 137 | geom_point(mapping = aes(color = clean_test)) + 138 | geom_smooth() 139 | ``` 140 | 141 | ```{r} 142 | ggplot(data = bechdel, mapping = aes(x = budget, y = domgross)) + 143 | geom_point(mapping = aes(color = clean_test)) + 144 | geom_smooth(data = filter(bechdel, clean_test == "ok")) 145 | ``` 146 | 147 | ## Your Turn 148 | 149 | What does `getwd()` return? 150 | 151 | ```{r} 152 | getwd() 153 | ``` 154 | 155 | ## Your Turn 9 156 | 157 | Save the last plot and then locate it in the files pane. If you run your `ggsave()` code inside this notebook, the image will be saved in the same directory as your .Rmd file (likely, project -> code), but if you run `ggsave()` in the Console it will be in your working directory. 158 | 159 | ```{r} 160 | ggsave("my-plot.png") 161 | ``` 162 | 163 | *** 164 | 165 | # Take aways 166 | 167 | You can use this code template to make thousands of graphs with **ggplot2**. 168 | 169 | ```{r eval = FALSE} 170 | ggplot(data = ) + 171 | (mapping = aes()) 172 | ``` -------------------------------------------------------------------------------- /solutions/02-Transform-Solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Transform Data - solutions" 3 | output: html_notebook 4 | editor_options: 5 | chunk_output_type: inline 6 | --- 7 | 8 | 9 | 10 | ```{r setup} 11 | library(dplyr) 12 | library(babynames) 13 | library(nycflights13) 14 | library(skimr) 15 | ``` 16 | 17 | ## Babynames 18 | 19 | ```{r} 20 | babynames 21 | skim(babynames) 22 | skim_with(integer = list(p25 = NULL, p75=NULL)) 23 | ``` 24 | 25 | 26 | ## Your Turn 1 27 | Run the skim_with() command, and then try skimming babynames again to see how the output is different 28 | ```{r} 29 | skim(babynames) 30 | ``` 31 | 32 | ## Select 33 | 34 | ```{r} 35 | select(babynames, name, prop) 36 | ``` 37 | 38 | ## Your Turn 2 39 | 40 | Alter the code to select just the `n` column: 41 | 42 | ```{r} 43 | select(babynames, n) 44 | ``` 45 | 46 | ## Consider 47 | 48 | Which of these is NOT a way to select the `name` and `n` columns together? 49 | 50 | ```{r} 51 | select(babynames, -c(year, sex, prop)) 52 | select(babynames, name:n) 53 | select(babynames, starts_with("n")) 54 | select(babynames, ends_with("n")) 55 | ``` 56 | 57 | ## Your Turn 3 58 | 59 | Show: 60 | 61 | * All of the names where prop is greater than or equal to 0.08 62 | * All of the children named "Sea" 63 | * All of the names that have a missing value for `n` 64 | 65 | ```{r} 66 | filter(babynames, prop >= 0.08) 67 | filter(babynames, name == "Sea") 68 | filter(babynames, is.na(n)) 69 | ``` 70 | 71 | ## Your Turn 4 72 | 73 | Use Boolean operators to alter the code below to return only the rows that contain: 74 | 75 | * Girls named Sea 76 | * Names that were used by exactly 5 or 6 children in 1880 77 | * Names that are one of Acura, Lexus, or Yugo 78 | 79 | ```{r} 80 | filter(babynames, name == "Sea", sex == "F") 81 | filter(babynames, n == 5 | n == 6, year == 1880) 82 | filter(babynames, name %in% c("Acura", "Lexus", "Yugo")) 83 | ``` 84 | 85 | ## Arrange 86 | 87 | ```{r} 88 | arrange(babynames, n) 89 | ``` 90 | 91 | ## Your Turn 5 92 | 93 | Arrange babynames by `n`. Add `prop` as a second (tie breaking) variable to arrange on. Can you tell what the smallest value of `n` is? 94 | 95 | ```{r} 96 | arrange(babynames, n, prop) 97 | ``` 98 | 99 | ## desc 100 | 101 | ```{r} 102 | arrange(babynames, desc(n)) 103 | ``` 104 | 105 | ## Your Turn 6 106 | 107 | Use `desc()` to find the names with the highest prop. 108 | Then, use `desc()` to find the names with the highest n. 109 | 110 | ```{r} 111 | arrange(babynames, desc(prop)) 112 | arrange(babynames, desc(n)) 113 | ``` 114 | 115 | ## Steps and the pipe 116 | 117 | ```{r} 118 | babynames %>% 119 | filter(year == 2015, sex == "M") %>% 120 | select(name, n) %>% 121 | arrange(desc(n)) 122 | ``` 123 | 124 | ## Your Turn 7 125 | 126 | Use `%>%` to write a sequence of functions that: 127 | 128 | 1. Filter babynames to just the girls that were born in 2015 129 | 2. Select the `name` and `n` columns 130 | 3. Arrange the results so that the most popular names are near the top. 131 | 132 | ```{r} 133 | babynames %>% 134 | filter(year == 2015, sex == "F") %>% 135 | select(name, n) %>% 136 | arrange(desc(n)) 137 | ``` 138 | 139 | ## Your Turn 8 140 | 141 | 1. Trim `babynames` to just the rows that contain your `name` and your `sex` 142 | 2. Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice) 143 | 3. Plot the results as a line graph with `year` on the x axis and `prop` on the y axis 144 | 145 | ```{r} 146 | babynames %>% 147 | filter(name == "Amelia", sex == "F") %>% 148 | select(year, prop) %>% 149 | ggplot() + 150 | geom_line(mapping = aes(year, prop)) 151 | ``` 152 | 153 | ## Your Turn 9 154 | 155 | Use summarise() to compute three statistics about the data: 156 | 157 | 1. The first (minimum) year in the dataset 158 | 2. The last (maximum) year in the dataset 159 | 3. The total number of children represented in the data 160 | 161 | ```{r} 162 | babynames %>% 163 | summarise(first = min(year), 164 | last = max(year), 165 | total = sum(n)) 166 | ``` 167 | 168 | ## Your Turn 10 169 | 170 | Extract the rows where `name == "Khaleesi"`. Then use `summarise()` and a summary functions to find: 171 | 172 | 1. The total number of children named Khaleesi 173 | 2. The first year Khaleesi appeared in the data 174 | 175 | ```{r} 176 | babynames %>% 177 | filter(name == "Khaleesi") %>% 178 | summarise(total = sum(n), first = min(year)) 179 | ``` 180 | 181 | ## Toy data for transforming 182 | 183 | ```{r} 184 | # Toy dataset to use 185 | pollution <- tribble( 186 | ~city, ~size, ~amount, 187 | "New York", "large", 23, 188 | "New York", "small", 14, 189 | "London", "large", 22, 190 | "London", "small", 16, 191 | "Beijing", "large", 121, 192 | "Beijing", "small", 56 193 | ) 194 | ``` 195 | 196 | ## Summarize 197 | 198 | ```{r} 199 | pollution %>% 200 | summarise(mean = mean(amount), sum = sum(amount), n = n()) 201 | ``` 202 | 203 | ```{r} 204 | pollution %>% 205 | group_by(city) %>% 206 | summarise(mean = mean(amount), sum = sum(amount), n = n()) 207 | ``` 208 | 209 | 210 | ## Your Turn 11 211 | 212 | Use `group_by()`, `summarise()`, and `arrange()` to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name. 213 | 214 | ```{r} 215 | babynames %>% 216 | group_by(name, sex) %>% 217 | summarise(total = sum(n)) %>% 218 | arrange(desc(total)) 219 | ``` 220 | 221 | ## Your Turn 12 222 | 223 | Use grouping to calculate and then plot the number of children born each year over time. 224 | 225 | ```{r} 226 | babynames %>% 227 | group_by(year) %>% 228 | summarise(n_children = sum(n)) %>% 229 | ggplot() + 230 | geom_line(mapping = aes(x = year, y = n_children)) 231 | ``` 232 | 233 | ## Ungroup 234 | 235 | ```{r} 236 | babynames %>% 237 | group_by(name, sex) %>% 238 | summarise(total = sum(n)) %>% 239 | arrange(desc(total)) 240 | ``` 241 | 242 | ## Mutate 243 | 244 | ```{r} 245 | babynames %>% 246 | mutate(percent = round(prop*100, 2)) 247 | ``` 248 | 249 | ## Your Turn 13 250 | 251 | Use `min_rank()` and `mutate()` to rank each row in `babynames` from largest `n` to lowest `n`. 252 | 253 | ```{r} 254 | babynames %>% 255 | mutate(rank = min_rank(desc(prop))) 256 | ``` 257 | 258 | ## Your Turn 14 259 | 260 | Compute each name's rank _within its year and sex_. 261 | Then compute the median rank _for each combination of name and sex_, and arrange the results from highest median rank to lowest. 262 | 263 | ```{r} 264 | babynames %>% 265 | group_by(year, sex) %>% 266 | mutate(rank = min_rank(desc(prop))) %>% 267 | group_by(name, sex) %>% 268 | summarise(score = median(rank)) %>% 269 | arrange(score) 270 | ``` 271 | 272 | ## Flights data 273 | ```{r} 274 | flights 275 | skim(flights) 276 | ``` 277 | 278 | ## Toy data 279 | 280 | ```{r} 281 | band <- tribble( 282 | ~name, ~band, 283 | "Mick", "Stones", 284 | "John", "Beatles", 285 | "Paul", "Beatles" 286 | ) 287 | 288 | instrument <- tribble( 289 | ~name, ~plays, 290 | "John", "guitar", 291 | "Paul", "bass", 292 | "Keith", "guitar" 293 | ) 294 | 295 | instrument2 <- tribble( 296 | ~artist, ~plays, 297 | "John", "guitar", 298 | "Paul", "bass", 299 | "Keith", "guitar" 300 | ) 301 | ``` 302 | 303 | ## Mutating joins 304 | 305 | ```{r} 306 | band %>% left_join(instrument, by = "name") 307 | ``` 308 | 309 | ## Your Turn 15 310 | 311 | Which airlines had the largest arrival delays? Complete the code below. 312 | 313 | 1. Join `airlines` to `flights` 314 | 2. Compute and order the average arrival delays by airline. Display full names, no codes. 315 | 316 | ```{r} 317 | flights %>% 318 | drop_na(arr_delay) %>% 319 | left_join(airlines, by = "carrier") %>% 320 | group_by(name) %>% 321 | summarise(delay = mean(arr_delay)) %>% 322 | arrange(delay) 323 | ``` 324 | 325 | ## Different names 326 | 327 | ```{r} 328 | band %>% left_join(instrument2, by = c("name" = "artist")) 329 | ``` 330 | 331 | ## Your Turn 16 332 | 333 | How many airports in `airports` are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named `faa` in the **airports** data set and `dest` in the **flights** data set. 334 | 335 | 336 | ```{r} 337 | airports %>% 338 | semi_join(flights, by = c("faa" = "dest")) %>% 339 | distinct(faa) 340 | ``` 341 | 342 | *** 343 | 344 | # Take aways 345 | 346 | * Extract variables with `select()` 347 | * Extract cases with `filter()` 348 | * Arrange cases, with `arrange()` 349 | 350 | * Make tables of summaries with `summarise()` 351 | * Make new variables, with `mutate()` 352 | * Do groupwise operations with `group_by()` 353 | 354 | * Connect operations with `%>%` 355 | 356 | * Use `left_join()`, `right_join()`, `full_join()`, or `inner_join()` to join datasets 357 | * Use `semi_join()` or `anti_join()` to filter datasets against each other 358 | -------------------------------------------------------------------------------- /solutions/03-Tidy-Solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Tidy -- Solutions" 3 | output: 4 | github_document: 5 | df_print: tibble 6 | html_document: 7 | df_print: paged 8 | --- 9 | 10 | 11 | 12 | 13 | ```{r setup} 14 | library(tidyverse) 15 | library(babynames) 16 | 17 | # Toy data 18 | cases <- tribble( 19 | ~Country, ~"2011", ~"2012", ~"2013", 20 | "FR", 7000, 6900, 7000, 21 | "DE", 5800, 6000, 6200, 22 | "US", 15000, 14000, 13000 23 | ) 24 | pollution <- tribble( 25 | ~city, ~size, ~amount, 26 | "New York", "large", 23, 27 | "New York", "small", 14, 28 | "London", "large", 22, 29 | "London", "small", 16, 30 | "Beijing", "large", 121, 31 | "Beijing", "small", 121 32 | ) 33 | bp_systolic <- tribble( 34 | ~ subject_id, ~ time_1, ~ time_2, ~ time_3, 35 | 1, 120, 118, 121, 36 | 2, 125, 131, NA, 37 | 3, 141, NA, NA 38 | ) 39 | bp_systolic2 <- tribble( 40 | ~ subject_id, ~ time, ~ systolic, 41 | 1, 1, 120, 42 | 1, 2, 118, 43 | 1, 3, 121, 44 | 2, 1, 125, 45 | 2, 2, 131, 46 | 3, 1, 141 47 | ) 48 | ``` 49 | 50 | ## Tidy and untidy data 51 | 52 | `table1` is tidy: 53 | ```{r} 54 | table1 55 | ``` 56 | 57 | For example, it's easy to add a rate column with `mutate()`: 58 | ```{r} 59 | table1 %>% 60 | mutate(rate = cases/population) 61 | ``` 62 | 63 | `table2` isn't tidy, the count column really contains two variables: 64 | ```{r} 65 | table2 66 | ``` 67 | 68 | It makes it very hard to manipulate. 69 | 70 | ## Your Turn 1 71 | 72 | Is `bp_systolic` tidy? 73 | 74 | ```{r} 75 | bp_systolic2 76 | ``` 77 | 78 | ## Your Turn 2 79 | 80 | Using `bp_systolic2` with `group_by()`, and `summarise()`: 81 | 82 | * Find the average systolic blood pressure for each subject 83 | * Find the last time each subject was measured 84 | 85 | ```{r} 86 | bp_systolic2 %>% 87 | group_by(subject_id) %>% 88 | summarise(avg_bp = mean(systolic), 89 | last_time = max(time)) 90 | ``` 91 | 92 | ## Your Turn 3 93 | 94 | On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n** 95 | 96 | ----------------------------- 97 | country year cases 98 | ------------- ------ -------- 99 | Afghanistan 1999 745 100 | 101 | Afghanistan 2000 2666 102 | 103 | Brazil 1999 37737 104 | 105 | Brazil 2000 80488 106 | 107 | China 1999 212258 108 | 109 | China 2000 213766 110 | ----------------------------- 111 | 112 | ## Your Turn 4 113 | 114 | Use `gather()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**. 115 | 116 | ```{r} 117 | table4a %>% 118 | gather(key = "year", 119 | value = "cases", -country) %>% 120 | arrange(country) 121 | ``` 122 | 123 | ## Your Turn 5 124 | 125 | On a sheet of paper, draw how `pollution` would look if it had the same values grouped into three columns: **city**, **large**, **small** 126 | 127 | -------------------------- 128 | city large small 129 | ---------- ------- ------- 130 | Beijing 121 121 131 | 132 | London 22 16 133 | 134 | New York 23 14 135 | -------------------------- 136 | 137 | ## Your Turn 6 138 | 139 | Use `spread()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**. 140 | 141 | ```{r} 142 | table2 %>% 143 | spread(key = type, value = count) 144 | ``` 145 | 146 | *** 147 | 148 | # Take Aways 149 | 150 | Data comes in many formats but R prefers just one: _tidy data_. 151 | 152 | A data set is tidy if and only if: 153 | 154 | 1. Every variable is in its own column 155 | 2. Every observation is in its own row 156 | 3. Every value is in its own cell (which follows from the above) 157 | 158 | What is a variable and an observation may depend on your immediate goal. -------------------------------------------------------------------------------- /solutions/04-Case-Study-Solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Case Study: Friday the 13th Effect (Solution)' 3 | output: 4 | html_document: 5 | df_print: paged 6 | github_document: 7 | df_print: tibble 8 | --- 9 | 10 | 11 | 12 | ```{r setup} 13 | library(fivethirtyeight) 14 | library(tidyverse) 15 | ``` 16 | 17 | ## Task 18 | 19 | Reproduce this figure from fivethirtyeight's article [*Some People Are Too Superstitious To Have A Baby On Friday The 13th*](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/): 20 | 21 | ![](resources/bialik-fridaythe13th-2.png) 22 | 23 | ## Data 24 | 25 | In the `fivethiryeight` package there are two datasets containing birth data, but for now let's just work with one `US_births_1994_2003`. Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014. 26 | 27 | ## Your Turn 1 28 | 29 | With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot. 30 | 31 | ```{r} 32 | US_births_1994_2003 33 | ``` 34 | 35 | ## Some overviews of the data 36 | 37 | Whole time series: 38 | ```{r} 39 | ggplot(US_births_1994_2003, aes(x = date, y = births)) + 40 | geom_line() 41 | ``` 42 | There is so much fluctuation it's really hard to see what is going on. 43 | 44 | Let's try just looking at one year: 45 | ```{r} 46 | US_births_1994_2003 %>% 47 | filter(year == 1994) %>% 48 | ggplot(mapping = aes(x = date, y = births)) + 49 | geom_line() 50 | ``` 51 | Strong weekly pattern accounts for most variation. 52 | 53 | ## Strategy 54 | 55 | Use the figure as a guide for what the data should like to make the final plot. We want to end up with something like: 56 | 57 | --------------------------- 58 | day_of_week avg_diff_13 59 | ------------- ------------- 60 | Mon -2.686 61 | 62 | Tues -1.378 63 | 64 | Wed -3.274 65 | 66 | ... ... 67 | 68 | --------------------------- 69 | 70 | There is more than one way to get there, but we 71 | ll roughly follow this strategy: 72 | 73 | * Get just the data for the 6th, 13th, and 20th 74 | * Calculate variable of interest: 75 | * (For each month/year): 76 | * Find average births on 6th and 20th 77 | * Find percentage difference between births on 13th and average births on 6th and 20th 78 | 79 | * Average percent difference by day of the week 80 | * Create plot 81 | 82 | ## Your Turn 2 83 | 84 | Extract just the 6th, 13th and 20th of each month: 85 | 86 | ```{r} 87 | US_births_1994_2003 %>% 88 | select(-date) %>% 89 | filter(date_of_month %in% c(6, 13, 20)) 90 | ``` 91 | 92 | ## Your Turn 3 93 | 94 | Which arrangement is tidy? 95 | 96 | **Option 1:** 97 | 98 | ----------------------------------------------------- 99 | year month date_of_month day_of_week births 100 | ------ ------- --------------- ------------- -------- 101 | 1994 1 6 Thurs 11406 102 | 103 | 1994 1 13 Thurs 11212 104 | 105 | 1994 1 20 Thurs 11682 106 | ----------------------------------------------------- 107 | 108 | **Option 2:** 109 | 110 | ---------------------------------------------------- 111 | year month day_of_week 6 13 20 112 | ------ ------- ------------- ------- ------- ------- 113 | 1994 1 Thurs 11406 11212 11682 114 | ---------------------------------------------------- 115 | 116 | (**Hint:** think about our next step *"Find the percent difference between the 13th and the average of the 6th and 12th"*. In which layout will this be easier using our tidy tools?) 117 | 118 | **Solution**: Option 2, since then we can easily use `mutate()`. 119 | 120 | ## Your Turn 4 121 | 122 | Tidy the filtered data to have the days in columns. 123 | 124 | ```{r} 125 | US_births_1994_2003 %>% 126 | select(-date) %>% 127 | filter(date_of_month %in% c(6, 13, 20)) %>% 128 | spread(date_of_month, births) 129 | ``` 130 | 131 | ## Your Turn 5 132 | 133 | Now use `mutate()` to add columns for: 134 | 135 | * The average of the births on the 6th and 20th 136 | * The percentage difference between the number of births on the 13th and the average of the 6th and 20th 137 | 138 | ```{r} 139 | US_births_1994_2003 %>% 140 | select(-date) %>% 141 | filter(date_of_month %in% c(6, 13, 20)) %>% 142 | spread(date_of_month, births) %>% 143 | mutate( 144 | avg_6_20 = (`6` + `20`)/2, 145 | diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 146 | ) 147 | ``` 148 | 149 | ## A little additional exploring 150 | 151 | Now we have a percent difference between the 13th and the 6th and 20th of each month, it's probably worth exploring a little (at the very least to check our calculations seem reasonable). 152 | 153 | To make it a little easier let's assign our current data to a variable 154 | ```{r} 155 | births_diff_13 <- US_births_1994_2003 %>% 156 | select(-date) %>% 157 | filter(date_of_month %in% c(6, 13, 20)) %>% 158 | spread(date_of_month, births) %>% 159 | mutate( 160 | avg_6_20 = (`6` + `20`)/2, 161 | diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 162 | ) 163 | ``` 164 | 165 | Then take a look 166 | ```{r} 167 | births_diff_13 %>% 168 | ggplot(mapping = aes(day_of_week, diff_13)) + 169 | geom_point() 170 | ``` 171 | 172 | Looks like we are on the right path. There's a big outlier one Monday 173 | ```{r} 174 | births_diff_13 %>% 175 | filter(day_of_week == "Mon", diff_13 > 10) 176 | ``` 177 | 178 | Seem's to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year. 179 | 180 | ## Your Turn 6 181 | 182 | Summarize each day of the week to have mean of diff_13. 183 | 184 | Then, recreate the fivethirtyeight plot. 185 | 186 | ```{r} 187 | US_births_1994_2003 %>% 188 | select(-date) %>% 189 | filter(date_of_month %in% c(6, 13, 20)) %>% 190 | spread(date_of_month, births) %>% 191 | mutate( 192 | avg_6_20 = (`6` + `20`)/2, 193 | diff_13 = (`13` - avg_6_20) / avg_6_20 * 100 194 | ) %>% 195 | group_by(day_of_week) %>% 196 | summarise(avg_diff_13 = mean(diff_13)) %>% 197 | ggplot(aes(x = day_of_week, y = avg_diff_13)) + 198 | geom_bar(stat = "identity") 199 | ``` 200 | 201 | ## Extra Challenges 202 | 203 | * If you wanted to use the `US_births_2000_2014` data instead, what would you need to change in the pipeline? How about using both `US_births_1994_2003` and `US_births_2000_2014`? 204 | 205 | * Try not removing the `date` column. At what point in the pipeline does it cause problems? Why? 206 | 207 | * Can you come up with an alternative way to investigate the Friday the 13th effect? Try it out! 208 | 209 | ## Takeaways 210 | 211 | The power of the tidyverse comes from being able to easily combine functions that do simple things well. -------------------------------------------------------------------------------- /solutions/05-Data-Types-Solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Types (solutions)" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup} 7 | library(tidyverse) 8 | library(babynames) 9 | library(nycflights13) 10 | library(stringr) 11 | library(forcats) 12 | library(lubridate) 13 | library(hms) 14 | ``` 15 | 16 | ## Your Turn 1 17 | 18 | Use `flights` to create `delayed`, the variable that displays whether a flight was delayed (`arr_delay > 0`). 19 | 20 | Then, remove all rows that contain an NA in `delayed`. 21 | 22 | Finally, create a summary table that shows: 23 | 24 | 1. How many flights were delayed 25 | 2. What proportion of flights were delayed 26 | 27 | ```{r} 28 | flights %>% 29 | mutate(delayed = arr_delay > 0) %>% 30 | drop_na(delayed) %>% 31 | summarise(total = sum(delayed), prop = mean(delayed)) 32 | ``` 33 | 34 | 35 | ## Your Turn 2 36 | 37 | In your group, fill in the blanks to: 38 | 39 | 1. Isolate the last letter of every name and create a logical variable that displays whether the last letter is one of "a", "e", "i", "o", "u", or "y". 40 | 2. Use a weighted mean to calculate the proportion of children whose name ends in a vowel (by `year` and `sex`) 41 | 3. and then display the results as a line plot. 42 | 43 | ```{r} 44 | babynames %>% 45 | mutate(last = str_sub(name, -1), 46 | vowel = last %in% c("a", "e", "i", "o", "u", "y")) %>% 47 | group_by(year, sex) %>% 48 | summarise(p_vowel = weighted.mean(vowel, n)) %>% 49 | ggplot() + 50 | geom_line(mapping = aes(year, p_vowel, color = sex)) 51 | ``` 52 | 53 | ## Your Turn 3 54 | 55 | Repeat the previous exercise, some of whose code is below, to make a sensible graph of average TV consumption by marital status. 56 | 57 | ```{r} 58 | gss_cat %>% 59 | drop_na(tvhours) %>% 60 | group_by(marital) %>% 61 | summarise(tvhours = mean(tvhours)) %>% 62 | ggplot(aes(tvhours, fct_reorder(marital, tvhours))) + 63 | geom_point() 64 | ``` 65 | 66 | ## Your Turn 4 67 | 68 | Do you think liberals or conservatives watch more TV? 69 | Compute average tv hours by party ID an then plot the results. 70 | 71 | ```{r} 72 | gss_cat %>% 73 | drop_na(tvhours) %>% 74 | group_by(partyid) %>% 75 | summarise(tvhours = mean(tvhours)) %>% 76 | ggplot(aes(tvhours, fct_reorder(partyid, tvhours))) + 77 | geom_point() + 78 | labs(y = "partyid") 79 | ``` 80 | 81 | ## Your Turn 5 82 | 83 | What is the best time of day to fly? 84 | 85 | Use the `hour` and `minute` variables in `flights` to compute the time of day for each flight as an hms. Then use a smooth line to plot the relationship between time of day and `arr_delay`. 86 | 87 | ```{r} 88 | flights %>% 89 | mutate(time = hms(hour = hour, minute = minute)) %>% 90 | ggplot(aes(time, arr_delay)) + 91 | geom_point(alpha = 0.2) + geom_smooth() 92 | ``` 93 | 94 | ## Your Turn 6 95 | 96 | Fill in the blanks to: 97 | 98 | Extract the day of the week of each flight (as a full name) from `time_hour`. 99 | 100 | Calculate the average `arr_delay` by day of the week. 101 | 102 | Plot the results as a column chart (bar chart) with `geom_col()`. 103 | 104 | ```{r} 105 | flights %>% 106 | mutate(weekday = wday(time_hour, label = TRUE, abbr = FALSE)) %>% 107 | group_by(weekday) %>% 108 | drop_na(arr_delay) %>% 109 | summarise(avg_delay = mean(arr_delay)) %>% 110 | ggplot() + 111 | geom_col(mapping = aes(x = weekday, y = avg_delay)) 112 | ``` 113 | 114 | *** 115 | 116 | # Take Aways 117 | 118 | Dplyr gives you three _general_ functions for manipulating data: `mutate()`, `summarise()`, and `group_by()`. Augment these with functions from the packages below, which focus on specific types of data. 119 | 120 | Package | Data Type 121 | --------- | -------- 122 | stringr | strings 123 | forcats | factors 124 | hms | times 125 | lubridate | dates and times 126 | 127 | -------------------------------------------------------------------------------- /solutions/06-Iterate-solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Iteration (solutions)" 3 | output: 4 | html_document: 5 | df_print: paged 6 | github_document: 7 | df_print: tibble 8 | --- 9 | 10 | 11 | 12 | ```{r setup} 13 | library(tidyverse) 14 | 15 | # Toy data 16 | set.seed(1000) 17 | exams <- list( 18 | student1 = round(runif(10, 50, 100)), 19 | student2 = round(runif(10, 50, 100)), 20 | student3 = round(runif(10, 50, 100)), 21 | student4 = round(runif(10, 50, 100)), 22 | student5 = round(runif(10, 50, 100)) 23 | ) 24 | 25 | extra_credit <- list(0, 0, 10, 10, 15) 26 | ``` 27 | 28 | ## Your Turn 1 29 | 30 | What kind of object is `mod`? Why are models stored as this kind of object? 31 | 32 | ```{r} 33 | mod <- lm(price ~ carat + cut + color + clarity, data = diamonds) 34 | # View(mod) 35 | ``` 36 | 37 | `mod` is a list. A list is used because we need to store lots of heterogeneous information. 38 | 39 | ## Quiz 40 | 41 | What's the difference between a list and an **atomic** vector? 42 | 43 | Atomic vectors are: "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw" vectors. 44 | 45 | Lists can hold data of different types and different lengths, we can even put lists inside other lists. 46 | 47 | ## Your Turn 2 48 | 49 | Here is a list: 50 | 51 | ```{r} 52 | a_list <- list(num = c(8, 9), 53 | log = TRUE, 54 | cha = c("a", "b", "c")) 55 | ``` 56 | 57 | Here are two subsetting commands. Do they return the same values? Run the code chunk above, _and then_ run the code chunks below to confirm 58 | 59 | ```{r} 60 | a_list["num"] 61 | ``` 62 | 63 | ```{r} 64 | a_list$num 65 | ``` 66 | 67 | ## Your Turn 3 68 | 69 | What will each of these return? Run the code chunks to confirm. 70 | 71 | ```{r} 72 | vec <- c(-2, -1, 0, 1, 2) 73 | abs(vec) 74 | ``` 75 | 76 | `abs()` returns the absolute value of each element. 77 | 78 | ```{r, error = TRUE} 79 | lst <- list(-2, -1, 0, 1, 2) 80 | abs(lst) 81 | ``` 82 | 83 | Out intent might be to take the absolute value of each element, but we get an error, because `abs()` doens't know how to handle a list. 84 | 85 | ## Your Turn 4 86 | 87 | Run the code in the chunks. What does it return? 88 | 89 | ```{r} 90 | list(student1 = mean(exams$student1), 91 | student2 = mean(exams$student2), 92 | student3 = mean(exams$student3), 93 | student4 = mean(exams$student4), 94 | student5 = mean(exams$student5)) 95 | ``` 96 | 97 | This chunk manually iterates over the elements of `exams` taking the mean of each element, and returning the results in a list. 98 | 99 | ```{r} 100 | library(purrr) 101 | map(exams, mean) 102 | ``` 103 | 104 | This does the exact same thing, but automatically. 105 | 106 | 107 | ## Your Turn 5 108 | 109 | Calculate the variance (`var()`) of each student’s exam grades. 110 | 111 | ```{r} 112 | exams %>% map(var) 113 | ``` 114 | 115 | ## Your Turn 6 116 | 117 | Calculate the max grade (max())for each student. Return the result as a vector. 118 | 119 | ```{r} 120 | exams %>% map_dbl(max) 121 | ``` 122 | 123 | ## Your Turn 7 124 | 125 | Write a function that counts the best exam twice and then takes the average. Use it to grade all of the students. 126 | 127 | 1. Write code that solves the problem for a real object 128 | 2. Wrap the code in `function(){}` to save it 129 | 3. Add the name of the real object as the function argument 130 | 131 | ```{r} 132 | double_best <- function(x) { 133 | (sum(x) + max(x)) / (length(x) + 1) 134 | } 135 | 136 | exams %>% 137 | map_dbl(double_best) 138 | ``` 139 | 140 | ### Your Turn 8 141 | 142 | Compute a final grade for each student, where the final grade is the average test score plus any `extra_credit` assigned to the student. Return the results as a double (i.e. numeric) vector. 143 | 144 | ```{r} 145 | exams %>% 146 | map2_dbl(extra_credit, function(x, y) mean(x) + y) 147 | ``` 148 | 149 | 150 | *** 151 | 152 | # Take Aways 153 | 154 | Lists are a useful way to organize data, but you need to arrange manually for functions to iterate over the elements of a list. 155 | 156 | You can do this with the `map()` family of functions in the purrr package. 157 | 158 | To write a function, 159 | 160 | 1. Write code that solves the problem for a real object 161 | 2. Wrap the code in `function(){}` to save it 162 | 3. Add the name of the real object as the function argument 163 | 164 | This sequence will help prevent bugs in your code (and reduce the time you spend correcting bugs). 165 | -------------------------------------------------------------------------------- /solutions/07-Model-Solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Model (solutions)" 3 | output: 4 | html_document: 5 | df_print: paged 6 | github_document: 7 | df_print: tibble 8 | --- 9 | 10 | ```{r setup, message=FALSE} 11 | library(tidyverse) 12 | library(modelr) 13 | library(broom) 14 | 15 | wages <- heights %>% filter(income > 0) 16 | ``` 17 | 18 | ## Your Turn 1 19 | 20 | Fit the model on the slide and then examine the output. What does it look like? 21 | 22 | ```{r} 23 | mod_e <- lm(log(income) ~ education, data = wages) 24 | mod_e 25 | ``` 26 | 27 | ## Your Turn 2 28 | 29 | Use a pipe to model `log(income)` against `height`. Then use broom and dplyr functions to extract: 30 | 31 | 1. The **coefficient estimates** and their related statistics 32 | 2. The **adj.r.squared** and **p.value** for the overall model 33 | 34 | ```{r} 35 | mod_h <- wages %>% lm(log(income) ~ height, data = .) 36 | mod_h %>% 37 | tidy() 38 | 39 | mod_h %>% 40 | glance() %>% 41 | select(adj.r.squared, p.value) 42 | ``` 43 | 44 | ## Your Turn 3 45 | 46 | Model `log(income)` against `education` _and_ `height`. Do the coefficients change? 47 | 48 | ```{r} 49 | mod_eh <- wages %>% 50 | lm(log(income) ~ education + height, data = .) 51 | 52 | mod_eh %>% 53 | tidy() 54 | ``` 55 | 56 | ## Your Turn 4 57 | 58 | Model `log(income)` against `education` and `height` and `sex`. Can you interpret the coefficients? 59 | 60 | ```{r} 61 | mod_ehs <- wages %>% 62 | lm(log(income) ~ education + height + sex, data = .) 63 | 64 | mod_ehs %>% 65 | tidy() 66 | ``` 67 | 68 | ## Your Turn 5 69 | 70 | Use a broom function and ggplot2 to make a line graph of `height` vs `.fitted` for our heights model, `mod_h`. 71 | 72 | _Bonus: Overlay the plot on the original data points._ 73 | 74 | ```{r} 75 | mod_h <- wages %>% lm(log(income) ~ height, data = .) 76 | 77 | mod_h %>% 78 | augment(data = wages) %>% 79 | ggplot(mapping = aes(x = height, y = .fitted)) + 80 | geom_point(mapping = aes(y = log(income)), alpha = 0.1) + 81 | geom_line(color = "blue") 82 | ``` 83 | 84 | ## Your Turn 6 85 | 86 | Repeat the process to make a line graph of `height` vs `.fitted` colored by `sex` for model mod_ehs. Are the results interpretable? Add `+ facet_wrap(~education)` to the end of your code. What happens? 87 | 88 | ```{r} 89 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) 90 | 91 | mod_ehs %>% 92 | augment(data = wages) %>% 93 | ggplot(mapping = aes(x = height, y = .fitted, color = sex)) + 94 | geom_line() + 95 | facet_wrap(~ education) 96 | ``` 97 | 98 | ## Your Turn 7 99 | 100 | Use one of `spread_predictions()` or `gather_predictions()` to make a line graph of `height` vs `pred` colored by `model` for each of mod_h, mod_eh, and mod_ehs. Are the results interpretable? 101 | 102 | Add `+ facet_grid(sex ~ education)` to the end of your code. What happens? 103 | 104 | ```{r warning = FALSE, message = FALSE} 105 | mod_h <- wages %>% lm(log(income) ~ height, data = .) 106 | mod_eh <- wages %>% lm(log(income) ~ education + height, data = .) 107 | mod_ehs <- wages %>% lm(log(income) ~ education + height + sex, data = .) 108 | 109 | wages %>% 110 | gather_predictions(mod_h, mod_eh, mod_ehs) %>% 111 | ggplot(mapping = aes(x = height, y = pred, color = model)) + 112 | geom_line() + 113 | facet_grid(sex ~ education) 114 | ``` 115 | 116 | *** 117 | 118 | # Take Aways 119 | 120 | * Use `glance()`, `tidy()`, and `augment()` from the **broom** package to return model values in a data frame. 121 | 122 | * Use `add_predictions()` or `gather_predictions()` or `spread_predictions()` from the **modelr** package to visualize predictions. 123 | 124 | * Use `add_residuals()` or `gather_residuals()` or `spread_residuals()` from the **modelr** package to visualize residuals. 125 | 126 | -------------------------------------------------------------------------------- /solutions/08-Organize-Solutions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Organize with List Columns" 3 | output: 4 | html_document: 5 | df_print: paged 6 | github_document: 7 | df_print: tibble 8 | --- 9 | 10 | 11 | 12 | ```{r setup} 13 | library(tidyverse) 14 | library(gapminder) 15 | library(broom) 16 | 17 | nz <- gapminder %>% 18 | filter(country == "New Zealand") 19 | us <- gapminder %>% 20 | filter(country == "United States") 21 | ``` 22 | 23 | ## Your turn 1 24 | 25 | How has life expectancy changed in other countries? 26 | Make a line plot of lifeExp vs. year grouped by country. 27 | Set alpha to 0.2, to see the results better. 28 | 29 | ```{r} 30 | gapminder %>% 31 | ggplot(mapping = aes(x = year, y = lifeExp, group = country)) + 32 | geom_line(alpha = 0.2) 33 | ``` 34 | 35 | ## Quiz 36 | 37 | How is a data frame/tibble similar to a list? 38 | 39 | ```{r} 40 | gapminder_sm <- gapminder[1:5, ] 41 | ``` 42 | 43 | It is a list! Columns are like elements of a list 44 | 45 | You can extract them with `$` of `[[` 46 | ```{r} 47 | gapminder_sm$country 48 | gapminder_sm[["country"]] 49 | ``` 50 | 51 | Or get a new smaller list with `[`: 52 | ```{r} 53 | gapminder_sm["country"] 54 | ``` 55 | 56 | ## Quiz 57 | 58 | If one of the elements of a list can be another list, 59 | can one of the columns of a data frame be another list? 60 | 61 | **Yes!**. 62 | 63 | ```{r} 64 | tibble( 65 | num = c(1, 2, 3), 66 | cha = c("one", "two", "three"), 67 | listcol = list(1, c("1", "two", "FALSE"), FALSE) 68 | ) 69 | ``` 70 | 71 | And we call it a **list column**. 72 | 73 | ## Your turn 2 74 | 75 | Run this chunk: 76 | ```{r} 77 | gapminder_nested <- gapminder %>% 78 | group_by(country) %>% 79 | nest() 80 | 81 | fit_model <- function(df) lm(lifeExp ~ year, data = df) 82 | 83 | gapminder_nested <- gapminder_nested %>% 84 | mutate(model = map(data, fit_model)) 85 | 86 | get_rsq <- function(mod) glance(mod)$r.squared 87 | 88 | gapminder_nested <- gapminder_nested %>% 89 | mutate(r.squared = map_dbl(model, get_rsq)) 90 | ``` 91 | 92 | Then filter `gapminder_nested` to find the countries with r.squared less than 0.5. 93 | 94 | ```{r} 95 | gapminder_nested %>% 96 | filter(r.squared < 0.5) 97 | ``` 98 | 99 | ## Your Turn 3 100 | 101 | Edit the code in the chunk provided to instead find and plot countries with a slope above 0.6 years/year. 102 | 103 | ```{r} 104 | get_slope <- function(mod) { 105 | tidy(mod) %>% filter(term == "year") %>% pull(estimate) 106 | } 107 | 108 | # Add new column with r-sqaured 109 | gapminder_nested <- gapminder_nested %>% 110 | mutate(slope = map_dbl(model, get_slope)) 111 | 112 | # filter out low r-squared countries 113 | big_slope <- gapminder_nested %>% 114 | filter(slope > 0.6) 115 | 116 | # unnest and plot result 117 | unnest(big_slope, data) %>% 118 | ggplot(aes(x = year, y = lifeExp)) + 119 | geom_line(aes(color = country)) 120 | ``` 121 | 122 | ## Your Turn 4 123 | 124 | **Challenge:** 125 | 126 | 1. Create your own copy of `gapminder_nested` and then add one more list column: `output` which contains the output of `augment()` for each model. 127 | 128 | 2. Plot the residuals against time for the countries with small r-squared. 129 | 130 | ```{r} 131 | charlotte_gapminder <- gapminder_nested 132 | 133 | charlotte_gapminder %>% 134 | mutate(output = model %>% map(augment)) %>% 135 | unnest(output) %>% 136 | filter(r.squared < 0.5) %>% 137 | ggplot() + 138 | geom_line(aes(year, .resid, color = country)) 139 | 140 | ``` 141 | 142 | # Take away 143 | 144 | --------------------------------------------------------------------------------