├── .gitignore ├── README.md ├── tidyverse.Rmd ├── tidyverse.md ├── tidyverse.nb.html └── tidyverse_files └── figure-markdown_github ├── dplyr-tidyr-ggplot-1.png ├── make model data-1.png ├── plot cooksd-1.png ├── plot resid-1.png ├── unnamed-chunk-11-1.png ├── unnamed-chunk-14-1.png ├── unnamed-chunk-15-1.png ├── unnamed-chunk-26-1.png ├── unnamed-chunk-4-1.png ├── unnamed-chunk-6-1.png └── unnamed-chunk-8-1.png /.gitignore: -------------------------------------------------------------------------------- 1 | *.csv 2 | tidyverse_cache/ 3 | .DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | `tidyverse` talk for the Davis R-Users' Group, October 13, 2016. 2 | 3 | You can watch the talk here: https://www.youtube.com/watch?v=_rPhSAVhs1A -------------------------------------------------------------------------------- /tidyverse.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "tidyverse" 3 | author: "Michael Levy, Prepared for the Davis R-Users' Group" 4 | date: "October 13, 2016" 5 | output: 6 | github_document: default 7 | html_notebook: default 8 | --- 9 | 10 | ```{r setup, include = FALSE} 11 | knitr::opts_chunk$set(cache = TRUE, error = TRUE, fig.width = 4, fig.asp = 1) 12 | ``` 13 | 14 | 15 | ## What is the tidyverse? 16 | 17 | ~~Hadleyverse~~ 18 | 19 | The tidyverse is a suite of R tools that follow a tidy philosophy: 20 | 21 | ### Tidy data 22 | 23 | Put data in data frames 24 | 25 | - Each type of observation gets a data frame 26 | - Each variable gets a column 27 | - Each observation gets a row 28 | 29 | ### Tidy APIs 30 | 31 | Functions should be consistent and easily (human) readable 32 | 33 | - Take one step at a time 34 | - Connect simple steps with the pipe 35 | - Referential transparency 36 | 37 | 38 | ### Okay but really, what is it? 39 | 40 | Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R. 41 | 42 | - Core packages: ggplot2, dplyr, tidyr, readr, purrr, tibble 43 | - Specialized data manipulation: hms, stringr, lubridate, forcats 44 | - Data import: DBI, haven, httr, jsonlite, readxl, rvest, xml2 45 | - Modeling: modelr, broom 46 | 47 | `install.packages(tidyverse)` installs all of the above packages. 48 | 49 | `library(tidyverse)` attaches only the core packages. 50 | 51 | 52 | ## Why tidyverse? 53 | 54 | - Consistency 55 | - e.g. All `stringr` functions take string first 56 | - e.g. Many functions take data.frame first -> piping 57 | - Faster to write 58 | - Easier to read 59 | - Tidy data: Imposes good practices 60 | - Type specificity 61 | - You probably use some of it already. Synergize. 62 | - Implements simple solutions to common problems (e.g. `purrr::transpose`) 63 | - Smarter defaults 64 | - e.g. `utils::write.csv(row.names = FALSE)` = `readr::write_csv()` 65 | - Runs fast (thanks to `Rcpp`) 66 | - Interfaces well with other tools (e.g. Spark with `dplyr` via `sparklyr`) 67 | 68 | ## `tibble` 69 | 70 | > A modern reimagining of data frames. 71 | 72 | ```{r Attach core packages} 73 | library(tidyverse) 74 | ``` 75 | 76 | ```{r class tbl} 77 | tdf = tibble(x = 1:1e4, y = rnorm(1e4)) # == data_frame(x = 1:1e4, y = rnorm(1e4)) 78 | class(tdf) 79 | ``` 80 | 81 | 82 | Tibbles print politely. 83 | 84 | ```{r print tbl} 85 | tdf 86 | ``` 87 | 88 | 89 | - Can customize print methods with `print(tdf, n = rows, width = cols)` 90 | 91 | - Set default with `options(tibble.print_max = rows, tibble.width = cols)` 92 | 93 | Tibbles have some convenient and consistent defaults that are different from base R data.frames. 94 | 95 | #### strings as factors 96 | 97 | ```{r strings as factors} 98 | dfs = list( 99 | df = data.frame(abc = letters[1:3], xyz = letters[24:26]), 100 | tbl = data_frame(abc = letters[1:3], xyz = letters[24:26]) 101 | ) 102 | sapply(dfs, function(d) class(d$abc)) 103 | ``` 104 | 105 | 106 | #### partial matching of names 107 | 108 | ```{r partial matching} 109 | sapply(dfs, function(d) d$a) 110 | ``` 111 | 112 | #### type consistency 113 | 114 | ```{r single bracket excision} 115 | sapply(dfs, function(d) class(d[, "abc"])) 116 | ``` 117 | 118 | Note that tidyverse import functions (e.g. `readr::read_csv`) default to tibbles and that *this can break existing code*. 119 | 120 | #### List-columns! 121 | 122 | ```{r list columns} 123 | tibble(ints = 1:5, 124 | powers = lapply(1:5, function(x) x^(1:x))) 125 | ``` 126 | 127 | 128 | ## The pipe `%>%` 129 | 130 | Sends the output of the LHS function to the first argument of the RHS function. 131 | 132 | ```{r pipe} 133 | sum(1:8) %>% 134 | sqrt() 135 | ``` 136 | 137 | 138 | ## `dplyr` 139 | 140 | Common data(frame) manipulation tasks. 141 | 142 | Four core "verbs": filter, select, arrange, group_by + summarize, plus many more convenience functions. 143 | 144 | 145 | ```{r load movies} 146 | library(ggplot2movies) 147 | str(movies) 148 | ``` 149 | 150 | ```{r filter} 151 | filter(movies, length > 360) 152 | ``` 153 | 154 | ```{r select} 155 | filter(movies, length > 360) %>% 156 | select(title, rating, votes) 157 | ``` 158 | 159 | ```{r arrange} 160 | filter(movies, Animation == 1, votes > 1000) %>% 161 | select(title, rating) %>% 162 | arrange(desc(rating)) 163 | ``` 164 | 165 | `summarize` makes `aggregate` and `tapply` functionality easier, and the output is always a data frame. 166 | 167 | ```{r summarize} 168 | filter(movies, mpaa != "") %>% 169 | group_by(year, mpaa) %>% 170 | summarize(avg_budget = mean(budget, na.rm = TRUE), 171 | avg_rating = mean(rating, na.rm = TRUE)) %>% 172 | arrange(desc(year), mpaa) 173 | ``` 174 | 175 | 176 | `count` for frequency tables. Note the consistent API and easy readability vs. `table`. 177 | 178 | ```{r count} 179 | filter(movies, mpaa != "") %>% 180 | count(year, mpaa, Animation, sort = TRUE) 181 | ``` 182 | 183 | 184 | ```{r table} 185 | basetab = with(movies[movies$mpaa != "", ], table(year, mpaa, Animation)) 186 | basetab[1:5, , ] 187 | ``` 188 | 189 | 190 | ### joins 191 | 192 | `dplyr` also does multi-table joins and can connect to various types of databases. 193 | 194 | ```{r full join} 195 | t1 = data_frame(alpha = letters[1:6], num = 1:6) 196 | t2 = data_frame(alpha = letters[4:10], num = 4:10) 197 | full_join(t1, t2, by = "alpha", suffix = c("_t1", "_t2")) 198 | ``` 199 | 200 | 201 | Super-secret pro-tip: You can `group_by` %>% `mutate` to accomplish a summarize + join 202 | 203 | ```{r group mutate} 204 | data_frame(group = sample(letters[1:3], 10, replace = TRUE), 205 | value = rnorm(10)) %>% 206 | group_by(group) %>% 207 | mutate(group_average = mean(value)) 208 | ``` 209 | 210 | 211 | 212 | 213 | ## `tidyr` 214 | 215 | Latest generation of `reshape`. `gather` to make wide table long, `spread` to make long tables wide. 216 | 217 | ```{r who} 218 | who # Tuberculosis data from the WHO 219 | ``` 220 | 221 | ```{r gather} 222 | who %>% 223 | gather(group, cases, -country, -iso2, -iso3, -year) 224 | ``` 225 | 226 | 227 | ## `ggplot2` 228 | 229 | If you don't already know and love it, check out [one of](https://d-rug.github.io/blog/2012/ggplot-introduction) [our](https://d-rug.github.io/blog/2013/xtsmarkdown) [previous](https://d-rug.github.io/blog/2013/formatting-plots-for-pubs) [talks](https://d-rug.github.io/blog/2015/ggplot-tutorial-johnston) on ggplot or any of the excellent resources on the internet. 230 | 231 | Note that the pipe and consistent API make it easy to combine functions from different packages, and the whole thing is quite readable. 232 | 233 | ```{r dplyr-tidyr-ggplot} 234 | who %>% 235 | select(-iso2, -iso3) %>% 236 | gather(group, cases, -country, -year) %>% 237 | count(country, year, wt = cases) %>% 238 | ggplot(aes(x = year, y = n, group = country)) + 239 | geom_line(size = .2) 240 | ``` 241 | 242 | 243 | ## `readr` 244 | 245 | For reading flat files. Faster than base with smarter defaults. 246 | 247 | ```{r make big df} 248 | bigdf = data_frame(int = 1:1e6, 249 | squares = int^2, 250 | letters = sample(letters, 1e6, replace = TRUE)) 251 | ``` 252 | 253 | ```{r base write} 254 | system.time( 255 | write.csv(bigdf, "base-write.csv") 256 | ) 257 | ``` 258 | 259 | ```{r readr write} 260 | system.time( 261 | write_csv(bigdf, "readr-write.csv") 262 | ) 263 | ``` 264 | 265 | ```{r base read} 266 | read.csv("base-write.csv", nrows = 3) 267 | ``` 268 | 269 | ```{r readr read} 270 | read_csv("readr-write.csv", n_max = 3) 271 | ``` 272 | 273 | ## `broom` 274 | 275 | `broom` is a convenient little package to work with model results. Two functions I find useful are `tidy` to extract model results and `augment` to add residuals, predictions, etc. to a data.frame. 276 | 277 | ```{r make model data} 278 | d = data_frame(x = runif(20, 0, 10), 279 | y = 2 * x + rnorm(10)) 280 | qplot(x, y, data = d) 281 | ``` 282 | 283 | ### `tidy` 284 | 285 | ```{r tidy} 286 | library(broom) # Not attached with tidyverse 287 | model = lm(y ~ x, d) 288 | tidy(model) 289 | ``` 290 | 291 | ### `augment` 292 | 293 | i.e. The function formerly known as `fortify`. 294 | 295 | ```{r augment} 296 | aug = augment(model) 297 | aug 298 | ``` 299 | 300 | ```{r plot resid} 301 | ggplot(aug, aes(x = x)) + 302 | geom_point(aes(y = y, color = .resid)) + 303 | geom_line(aes(y = .fitted)) + 304 | viridis::scale_color_viridis() + 305 | theme(legend.position = c(0, 1), legend.justification = c(0, 1)) 306 | ``` 307 | 308 | ```{r plot cooksd} 309 | ggplot(aug, aes(.fitted, .resid, size = .cooksd)) + 310 | geom_point() 311 | ``` 312 | 313 | 314 | 315 | ## `purrr` 316 | 317 | `purrr` is kind of like `dplyr` for lists. It helps you repeatedly apply functions. Like the rest of the tidyverse, nothing you can't do in base R, but `purrr` makes the API consistent, encourages type specificity, and provides some nice shortcuts and speed ups. 318 | 319 | ```{r intro and speedtest} 320 | df = data_frame(fun = rep(c(lapply, map), 2), 321 | n = rep(c(1e5, 1e7), each = 2), 322 | comp_time = map2(fun, n, ~system.time(.x(1:.y, sqrt)))) 323 | df$comp_time 324 | ``` 325 | 326 | 327 | ### `map` 328 | 329 | Vanilla `map` is a slightly improved version of `lapply`. Do a function on each item in a list. 330 | 331 | ```{r map} 332 | map(1:4, log) 333 | ``` 334 | 335 | Can supply additional arguments as with `(x)apply` 336 | 337 | ```{r map arg} 338 | map(1:4, log, base = 2) 339 | ``` 340 | 341 | Can compose anonymous functions like `(x)apply`, either the old way or with a new formula shorthand. 342 | 343 | ```{r map formula} 344 | map(1:4, ~ log(4, base = .x)) # == map(1:4, function(x) log(4, base = x)) 345 | ``` 346 | 347 | `map` always returns a list. `map_xxx` type-specifies the output type and simplifies the list to a vector. 348 | 349 | ```{r map_type} 350 | map_dbl(1:4, log, base = 2) 351 | ``` 352 | 353 | And throws an error if any output isn't of the expected type (which is a good thing!). 354 | 355 | ```{r map_type error} 356 | map_int(1:4, log, base = 2) 357 | ``` 358 | 359 | 360 | `map2` is like `mapply` -- apply a function over two lists in parallel. `map_n` generalizes to any number of lists. 361 | 362 | ```{r map2} 363 | fwd = 1:10 364 | bck = 10:1 365 | map2_dbl(fwd, bck, `^`) 366 | ``` 367 | 368 | `map_if` tests each element on a function and if true applies the second function, if false returns the original element. 369 | 370 | ```{r map_if} 371 | data_frame(ints = 1:5, lets = letters[1:5], sqrts = ints^.5) %>% 372 | map_if(is.numeric, ~ .x^2) 373 | ``` 374 | 375 | ### Putting `map` to work 376 | 377 | Split the movies data frame by mpaa rating, fit a linear model to each data frame, and organize the model results in a data frame. 378 | 379 | ```{r movies split models} 380 | movies %>% 381 | filter(mpaa != "") %>% 382 | split(.$mpaa) %>% 383 | map(~ lm(rating ~ budget, data = .)) %>% 384 | map_df(tidy, .id = "mpaa-rating") %>% 385 | arrange(term) 386 | ``` 387 | 388 | List-columns make it easier to organize complex datasets. Can `map` over list-columns right in `data_frame`/`tibble` creation. And if you later want to calculate something else, everything is nicely organized in the data frame. 389 | 390 | ```{r list columns + map} 391 | d = 392 | data_frame( 393 | dist = c("normal", "poisson", "chi-square"), 394 | funs = list(rnorm, rpois, rchisq), 395 | samples = map(funs, ~.(100, 5)), 396 | mean = map_dbl(samples, mean), 397 | var = map_dbl(samples, var) 398 | ) 399 | d$median = map_dbl(d$samples, median) 400 | d 401 | ``` 402 | 403 | Let's see if we can really make this purrr... Fit a linear model of diamond price by every combination of two predictors in the dataset and see which two predict best. 404 | 405 | ```{r diamonds predictors} 406 | train = sample(nrow(diamonds), floor(nrow(diamonds) * .67)) 407 | setdiff(names(diamonds), "price") %>% 408 | combn(2, paste, collapse = " + ") %>% 409 | structure(., names = .) %>% 410 | map(~ formula(paste("price ~ ", .x))) %>% 411 | map(lm, data = diamonds[train, ]) %>% 412 | map_df(augment, newdata = diamonds[-train, ], .id = "predictors") %>% 413 | group_by(predictors) %>% 414 | summarize(rmse = sqrt(mean((price - .fitted)^2))) %>% 415 | arrange(rmse) 416 | ``` 417 | 418 | 419 | ### Type-stability 420 | 421 | We have seen that we can use map_lgl to ensure we get a logical vector, map_chr to ensure we get a character vector back, etc. Type stability is like a little built-in unit test. You make sure you're getting what you think you are, even in the middle of a pipeline or function. Here are two more type-stable function implemented in `purrr`. 422 | 423 | #### `flatten` 424 | 425 | Like `unlist` but can specify output type, and never recurses. 426 | 427 | ```{r flatten} 428 | map(-1:3, ~.x ^ seq(-.5, .5, .5)) %>% 429 | flatten_dbl() 430 | ``` 431 | 432 | #### `safely` 433 | 434 | ```{r error} 435 | junk = list(letters, 1:20, median) 436 | map(junk, ~ log(.x)) 437 | ``` 438 | 439 | - `safely` "catches" errors and always "succeeds". 440 | - `try` does the same, but either returns the value or a try-error object. 441 | - `safely` is type-stable. It always returns a length-two list with one object NULL. 442 | 443 | ```{r safely} 444 | safe = map(junk, ~ safely(log)(.x)) # Note the different syntax from try(log(.x)). `safely(log)` creates a new function. 445 | safe 446 | ``` 447 | 448 | #### `transpose` a list! 449 | 450 | Now we could conveniently move on where the function succeeded, particularly using `map_if`. To get that logical vector for the `map_if` test, we can use the `transpose` function, which inverts a list. 451 | 452 | ```{r} 453 | transpose(safe) 454 | ``` 455 | 456 | ```{r} 457 | map_if(transpose(safe)$result, ~!is.null(.x), median) 458 | ``` 459 | 460 | ## `stringr` 461 | 462 | All your string manipulation and regex functions with a consistent API. 463 | 464 | ```{r} 465 | library(stringr) # not attached with tidyverse 466 | fishes <- c("one fish", "two fish", "red fish", "blue fish") 467 | str_detect(fishes, "two") 468 | ``` 469 | 470 | ```{r} 471 | str_replace_all(fishes, "fish", "banana") 472 | ``` 473 | 474 | ```{r} 475 | str_extract(fishes, "[a-z]\\s") 476 | ``` 477 | 478 | Let's put that string manipulation engine to work. Remember the annoying column names in the WHO data? They look like this `r stringr::str_c(colnames(tidyr::who)[5:7], collapse = ", ")`, where "new" or "new_" doesn't mean anything, the following 2-3 letters indicate the test used, the following letter indicates the gender, and the final 2-4 numbers indicates the age-class. A string-handling challenge if ever there was one. Let's separate it out and plot the cases by year, gender, age-class, and test-method. 479 | 480 | ```{r, fig.width = 8, fig.asp = .6} 481 | who %>% 482 | select(-iso2, -iso3) %>% 483 | gather(group, cases, -country, -year) %>% 484 | mutate(group = str_replace(group, "new_*", ""), 485 | method = str_extract(group, "[a-z]+"), 486 | gender = str_sub(str_extract(group, "_[a-z]"), 2, 2), 487 | age = str_extract(group, "[0-9]+"), 488 | age = ifelse(str_length(age) > 2, 489 | str_c(str_sub(age, 1, -3), str_sub(age, -2, -1), sep = "-"), 490 | str_c(age, "+"))) %>% 491 | group_by(year, gender, age, method) %>% 492 | summarize(total_cases = sum(cases, na.rm = TRUE)) %>% 493 | ggplot(aes(x = year, y = total_cases, linetype = gender)) + 494 | geom_line() + 495 | facet_grid(method ~ age, 496 | labeller = labeller(.rows = label_both, .cols = label_both)) + 497 | scale_y_log10() + 498 | theme_light() + 499 | theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 500 | ``` 501 | 502 | 503 | ## Post-talk debugging improvisation 504 | 505 | ```{r} 506 | pipe_stopifnot = function(df, test){ 507 | stopifnot(test) 508 | return(df) 509 | } 510 | ``` 511 | 512 | ```{r} 513 | print_and_go = function(df, what_to_print) { 514 | cat(what_to_print) 515 | return(df) 516 | } 517 | ``` 518 | 519 | -------------------------------------------------------------------------------- /tidyverse.md: -------------------------------------------------------------------------------- 1 | tidyverse 2 | ================ 3 | Michael Levy, Prepared for the Davis R-Users' Group 4 | October 13, 2016 5 | 6 | What is the tidyverse? 7 | ---------------------- 8 | 9 | ~~Hadleyverse~~ 10 | 11 | The tidyverse is a suite of R tools that follow a tidy philosophy: 12 | 13 | ### Tidy data 14 | 15 | Put data in data frames 16 | 17 | - Each type of observation gets a data frame 18 | - Each variable gets a column 19 | - Each observation gets a row 20 | 21 | ### Tidy APIs 22 | 23 | Functions should be consistent and easily (human) readable 24 | 25 | - Take one step at a time 26 | - Connect simple steps with the pipe 27 | - Referential transparency 28 | 29 | ### Okay but really, what is it? 30 | 31 | Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R. 32 | 33 | - Core packages: ggplot2, dplyr, tidyr, readr, purrr, tibble 34 | - Specialized data manipulation: hms, stringr, lubridate, forcats 35 | - Data import: DBI, haven, httr, jsonlite, readxl, rvest, xml2 36 | - Modeling: modelr, broom 37 | 38 | `install.packages(tidyverse)` installs all of the above packages. 39 | 40 | `library(tidyverse)` attaches only the core packages. 41 | 42 | Why tidyverse? 43 | -------------- 44 | 45 | - Consistency 46 | - e.g. All `stringr` functions take string first 47 | - e.g. Many functions take data.frame first -> piping 48 | - Faster to write 49 | - Easier to read 50 | - Tidy data: Imposes good practices 51 | - Type specificity 52 | - You probably use some of it already. Synergize. 53 | - Implements simple solutions to common problems (e.g. `purrr::transpose`) 54 | - Smarter defaults 55 | - e.g. `utils::write.csv(row.names = FALSE)` = `readr::write_csv()` 56 | - Runs fast (thanks to `Rcpp`) 57 | - Interfaces well with other tools (e.g. Spark with `dplyr` via `sparklyr`) 58 | 59 | `tibble` 60 | -------- 61 | 62 | > A modern reimagining of data frames. 63 | 64 | ``` r 65 | library(tidyverse) 66 | ``` 67 | 68 | ``` r 69 | tdf = tibble(x = 1:1e4, y = rnorm(1e4)) # == data_frame(x = 1:1e4, y = rnorm(1e4)) 70 | class(tdf) 71 | ``` 72 | 73 | ## [1] "tbl_df" "tbl" "data.frame" 74 | 75 | Tibbles print politely. 76 | 77 | ``` r 78 | tdf 79 | ``` 80 | 81 | ## # A tibble: 10,000 × 2 82 | ## x y 83 | ## 84 | ## 1 1 1.7307583 85 | ## 2 2 1.4246209 86 | ## 3 3 0.2762850 87 | ## 4 4 1.9267297 88 | ## 5 5 1.8189041 89 | ## 6 6 1.1574624 90 | ## 7 7 0.1248573 91 | ## 8 8 -0.1066158 92 | ## 9 9 -0.7412011 93 | ## 10 10 -0.9383221 94 | ## # ... with 9,990 more rows 95 | 96 | - Can customize print methods with `print(tdf, n = rows, width = cols)` 97 | 98 | - Set default with `options(tibble.print_max = rows, tibble.width = cols)` 99 | 100 | Tibbles have some convenient and consistent defaults that are different from base R data.frames. 101 | 102 | #### strings as factors 103 | 104 | ``` r 105 | dfs = list( 106 | df = data.frame(abc = letters[1:3], xyz = letters[24:26]), 107 | tbl = data_frame(abc = letters[1:3], xyz = letters[24:26]) 108 | ) 109 | sapply(dfs, function(d) class(d$abc)) 110 | ``` 111 | 112 | ## df tbl 113 | ## "factor" "character" 114 | 115 | #### partial matching of names 116 | 117 | ``` r 118 | sapply(dfs, function(d) d$a) 119 | ``` 120 | 121 | ## Warning: Unknown column 'a' 122 | 123 | ## $df 124 | ## [1] a b c 125 | ## Levels: a b c 126 | ## 127 | ## $tbl 128 | ## NULL 129 | 130 | #### type consistency 131 | 132 | ``` r 133 | sapply(dfs, function(d) class(d[, "abc"])) 134 | ``` 135 | 136 | ## $df 137 | ## [1] "factor" 138 | ## 139 | ## $tbl 140 | ## [1] "tbl_df" "tbl" "data.frame" 141 | 142 | Note that tidyverse import functions (e.g. `readr::read_csv`) default to tibbles and that *this can break existing code*. 143 | 144 | #### List-columns! 145 | 146 | ``` r 147 | tibble(ints = 1:5, 148 | powers = lapply(1:5, function(x) x^(1:x))) 149 | ``` 150 | 151 | ## # A tibble: 5 × 2 152 | ## ints powers 153 | ## 154 | ## 1 1 155 | ## 2 2 156 | ## 3 3 157 | ## 4 4 158 | ## 5 5 159 | 160 | The pipe `%>%` 161 | -------------- 162 | 163 | Sends the output of the LHS function to the first argument of the RHS function. 164 | 165 | ``` r 166 | sum(1:8) %>% 167 | sqrt() 168 | ``` 169 | 170 | ## [1] 6 171 | 172 | `dplyr` 173 | ------- 174 | 175 | Common data(frame) manipulation tasks. 176 | 177 | Four core "verbs": filter, select, arrange, group\_by + summarize, plus many more convenience functions. 178 | 179 | ``` r 180 | library(ggplot2movies) 181 | str(movies) 182 | ``` 183 | 184 | ## Classes 'tbl_df', 'tbl' and 'data.frame': 58788 obs. of 24 variables: 185 | ## $ title : chr "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ... 186 | ## $ year : int 1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ... 187 | ## $ length : int 121 71 7 70 71 91 93 25 97 61 ... 188 | ## $ budget : int NA NA NA NA NA NA NA NA NA NA ... 189 | ## $ rating : num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ... 190 | ## $ votes : int 348 20 5 6 17 45 200 24 18 51 ... 191 | ## $ r1 : num 4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ... 192 | ## $ r2 : num 4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ... 193 | ## $ r3 : num 4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ... 194 | ## $ r4 : num 4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ... 195 | ## $ r5 : num 14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ... 196 | ## $ r6 : num 24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ... 197 | ## $ r7 : num 24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ... 198 | ## $ r8 : num 14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ... 199 | ## $ r9 : num 4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ... 200 | ## $ r10 : num 4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ... 201 | ## $ mpaa : chr "" "" "" "" ... 202 | ## $ Action : int 0 0 0 0 0 0 1 0 0 0 ... 203 | ## $ Animation : int 0 0 1 0 0 0 0 0 0 0 ... 204 | ## $ Comedy : int 1 1 0 1 0 0 0 0 0 0 ... 205 | ## $ Drama : int 1 0 0 0 0 1 1 0 1 0 ... 206 | ## $ Documentary: int 0 0 0 0 0 0 0 1 0 0 ... 207 | ## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ... 208 | ## $ Short : int 0 0 1 0 0 0 0 1 0 0 ... 209 | 210 | ``` r 211 | filter(movies, length > 360) 212 | ``` 213 | 214 | ## # A tibble: 21 × 24 215 | ## title year length budget 216 | ## 217 | ## 1 Commune (Paris, 1871), La 2000 555 NA 218 | ## 2 Cure for Insomnia, The 1987 5220 NA 219 | ## 3 Ebolusyon ng isang pamilyang pilipino 2004 647 NA 220 | ## 4 Empire 1964 485 NA 221 | ## 5 Farmer's Wife, The 1998 390 NA 222 | ## 6 Foolish Wives 1922 384 1100000 223 | ## 7 Four Stars 1967 1100 NA 224 | ## 8 Hitler - ein Film aus Deutschland 1978 407 NA 225 | ## 9 Imitation of Christ 1967 480 NA 226 | ## 10 Longest Most Meaningless Movie in the World, The 1970 2880 NA 227 | ## # ... with 11 more rows, and 20 more variables: rating , votes , 228 | ## # r1 , r2 , r3 , r4 , r5 , r6 , r7 , 229 | ## # r8 , r9 , r10 , mpaa , Action , 230 | ## # Animation , Comedy , Drama , Documentary , 231 | ## # Romance , Short 232 | 233 | ``` r 234 | filter(movies, length > 360) %>% 235 | select(title, rating, votes) 236 | ``` 237 | 238 | ## # A tibble: 21 × 3 239 | ## title rating votes 240 | ## 241 | ## 1 Commune (Paris, 1871), La 7.8 33 242 | ## 2 Cure for Insomnia, The 3.8 59 243 | ## 3 Ebolusyon ng isang pamilyang pilipino 8.4 5 244 | ## 4 Empire 5.5 46 245 | ## 5 Farmer's Wife, The 8.5 52 246 | ## 6 Foolish Wives 7.6 191 247 | ## 7 Four Stars 3.0 12 248 | ## 8 Hitler - ein Film aus Deutschland 9.0 70 249 | ## 9 Imitation of Christ 4.4 5 250 | ## 10 Longest Most Meaningless Movie in the World, The 6.4 15 251 | ## # ... with 11 more rows 252 | 253 | ``` r 254 | filter(movies, Animation == 1, votes > 1000) %>% 255 | select(title, rating) %>% 256 | arrange(desc(rating)) 257 | ``` 258 | 259 | ## # A tibble: 135 × 2 260 | ## title rating 261 | ## 262 | ## 1 Sen to Chihiro no kamikakushi 8.6 263 | ## 2 Duck Amuck 8.4 264 | ## 3 Wallace & Gromit: The Wrong Trousers 8.4 265 | ## 4 Finding Nemo 8.3 266 | ## 5 Hotaru no haka 8.3 267 | ## 6 Incredibles, The 8.3 268 | ## 7 Mononoke-hime 8.3 269 | ## 8 What's Opera, Doc? 8.3 270 | ## 9 Vincent 8.2 271 | ## 10 Wallace & Gromit: A Close Shave 8.2 272 | ## # ... with 125 more rows 273 | 274 | `summarize` makes `aggregate` and `tapply` functionality easier, and the output is always a data frame. 275 | 276 | ``` r 277 | filter(movies, mpaa != "") %>% 278 | group_by(year, mpaa) %>% 279 | summarize(avg_budget = mean(budget, na.rm = TRUE), 280 | avg_rating = mean(rating, na.rm = TRUE)) %>% 281 | arrange(desc(year), mpaa) 282 | ``` 283 | 284 | ## Source: local data frame [128 x 4] 285 | ## Groups: year [54] 286 | ## 287 | ## year mpaa avg_budget avg_rating 288 | ## 289 | ## 1 2005 NC-17 NaN 6.700000 290 | ## 2 2005 PG 45857143 5.733333 291 | ## 3 2005 PG-13 42269333 5.326087 292 | ## 4 2005 R 24305882 4.595833 293 | ## 5 2004 PG 45126852 5.847619 294 | ## 6 2004 PG-13 46288254 6.080180 295 | ## 7 2004 R 19548519 5.848469 296 | ## 8 2003 PG 37057692 5.897674 297 | ## 9 2003 PG-13 46269491 5.949038 298 | ## 10 2003 R 21915505 5.702273 299 | ## # ... with 118 more rows 300 | 301 | `count` for frequency tables. Note the consistent API and easy readability vs. `table`. 302 | 303 | ``` r 304 | filter(movies, mpaa != "") %>% 305 | count(year, mpaa, Animation, sort = TRUE) 306 | ``` 307 | 308 | ## Source: local data frame [156 x 4] 309 | ## Groups: year, mpaa [128] 310 | ## 311 | ## year mpaa Animation n 312 | ## 313 | ## 1 1999 R 0 366 314 | ## 2 2001 R 0 355 315 | ## 3 2002 R 0 343 316 | ## 4 2000 R 0 341 317 | ## 5 1998 R 0 335 318 | ## 6 1997 R 0 325 319 | ## 7 1996 R 0 310 320 | ## 8 1995 R 0 293 321 | ## 9 2003 R 0 264 322 | ## 10 2004 R 0 196 323 | ## # ... with 146 more rows 324 | 325 | ``` r 326 | basetab = with(movies[movies$mpaa != "", ], table(year, mpaa, Animation)) 327 | basetab[1:5, , ] 328 | ``` 329 | 330 | ## , , Animation = 0 331 | ## 332 | ## mpaa 333 | ## year NC-17 PG PG-13 R 334 | ## 1934 0 1 0 0 335 | ## 1938 0 1 0 0 336 | ## 1945 0 0 1 0 337 | ## 1946 0 1 0 0 338 | ## 1951 0 2 0 0 339 | ## 340 | ## , , Animation = 1 341 | ## 342 | ## mpaa 343 | ## year NC-17 PG PG-13 R 344 | ## 1934 0 0 0 0 345 | ## 1938 0 0 0 0 346 | ## 1945 0 0 0 0 347 | ## 1946 0 0 0 0 348 | ## 1951 0 0 0 0 349 | 350 | ### joins 351 | 352 | `dplyr` also does multi-table joins and can connect to various types of databases. 353 | 354 | ``` r 355 | t1 = data_frame(alpha = letters[1:6], num = 1:6) 356 | t2 = data_frame(alpha = letters[4:10], num = 4:10) 357 | full_join(t1, t2, by = "alpha", suffix = c("_t1", "_t2")) 358 | ``` 359 | 360 | ## # A tibble: 10 × 3 361 | ## alpha num_t1 num_t2 362 | ## 363 | ## 1 a 1 NA 364 | ## 2 b 2 NA 365 | ## 3 c 3 NA 366 | ## 4 d 4 4 367 | ## 5 e 5 5 368 | ## 6 f 6 6 369 | ## 7 g NA 7 370 | ## 8 h NA 8 371 | ## 9 i NA 9 372 | ## 10 j NA 10 373 | 374 | Super-secret pro-tip: You can `group_by` %>% `mutate` to accomplish a summarize + join 375 | 376 | ``` r 377 | data_frame(group = sample(letters[1:3], 10, replace = TRUE), 378 | value = rnorm(10)) %>% 379 | group_by(group) %>% 380 | mutate(group_average = mean(value)) 381 | ``` 382 | 383 | ## Source: local data frame [10 x 3] 384 | ## Groups: group [3] 385 | ## 386 | ## group value group_average 387 | ## 388 | ## 1 b -0.07744649 -0.61076927 389 | ## 2 c 0.11825771 0.08865786 390 | ## 3 c 0.58866540 0.08865786 391 | ## 4 c 0.27584554 0.08865786 392 | ## 5 b -0.80187845 -0.61076927 393 | ## 6 c -0.33398635 0.08865786 394 | ## 7 c -0.20549302 0.08865786 395 | ## 8 b -0.95298286 -0.61076927 396 | ## 9 a -0.48253785 -0.68985472 397 | ## 10 a -0.89717159 -0.68985472 398 | 399 | `tidyr` 400 | ------- 401 | 402 | Latest generation of `reshape`. `gather` to make wide table long, `spread` to make long tables wide. 403 | 404 | ``` r 405 | who # Tuberculosis data from the WHO 406 | ``` 407 | 408 | ## # A tibble: 7,240 × 60 409 | ## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 410 | ## 411 | ## 1 Afghanistan AF AFG 1980 NA NA NA 412 | ## 2 Afghanistan AF AFG 1981 NA NA NA 413 | ## 3 Afghanistan AF AFG 1982 NA NA NA 414 | ## 4 Afghanistan AF AFG 1983 NA NA NA 415 | ## 5 Afghanistan AF AFG 1984 NA NA NA 416 | ## 6 Afghanistan AF AFG 1985 NA NA NA 417 | ## 7 Afghanistan AF AFG 1986 NA NA NA 418 | ## 8 Afghanistan AF AFG 1987 NA NA NA 419 | ## 9 Afghanistan AF AFG 1988 NA NA NA 420 | ## 10 Afghanistan AF AFG 1989 NA NA NA 421 | ## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 , 422 | ## # new_sp_m4554 , new_sp_m5564 , new_sp_m65 , 423 | ## # new_sp_f014 , new_sp_f1524 , new_sp_f2534 , 424 | ## # new_sp_f3544 , new_sp_f4554 , new_sp_f5564 , 425 | ## # new_sp_f65 , new_sn_m014 , new_sn_m1524 , 426 | ## # new_sn_m2534 , new_sn_m3544 , new_sn_m4554 , 427 | ## # new_sn_m5564 , new_sn_m65 , new_sn_f014 , 428 | ## # new_sn_f1524 , new_sn_f2534 , new_sn_f3544 , 429 | ## # new_sn_f4554 , new_sn_f5564 , new_sn_f65 , 430 | ## # new_ep_m014 , new_ep_m1524 , new_ep_m2534 , 431 | ## # new_ep_m3544 , new_ep_m4554 , new_ep_m5564 , 432 | ## # new_ep_m65 , new_ep_f014 , new_ep_f1524 , 433 | ## # new_ep_f2534 , new_ep_f3544 , new_ep_f4554 , 434 | ## # new_ep_f5564 , new_ep_f65 , newrel_m014 , 435 | ## # newrel_m1524 , newrel_m2534 , newrel_m3544 , 436 | ## # newrel_m4554 , newrel_m5564 , newrel_m65 , 437 | ## # newrel_f014 , newrel_f1524 , newrel_f2534 , 438 | ## # newrel_f3544 , newrel_f4554 , newrel_f5564 , 439 | ## # newrel_f65 440 | 441 | ``` r 442 | who %>% 443 | gather(group, cases, -country, -iso2, -iso3, -year) 444 | ``` 445 | 446 | ## # A tibble: 405,440 × 6 447 | ## country iso2 iso3 year group cases 448 | ## 449 | ## 1 Afghanistan AF AFG 1980 new_sp_m014 NA 450 | ## 2 Afghanistan AF AFG 1981 new_sp_m014 NA 451 | ## 3 Afghanistan AF AFG 1982 new_sp_m014 NA 452 | ## 4 Afghanistan AF AFG 1983 new_sp_m014 NA 453 | ## 5 Afghanistan AF AFG 1984 new_sp_m014 NA 454 | ## 6 Afghanistan AF AFG 1985 new_sp_m014 NA 455 | ## 7 Afghanistan AF AFG 1986 new_sp_m014 NA 456 | ## 8 Afghanistan AF AFG 1987 new_sp_m014 NA 457 | ## 9 Afghanistan AF AFG 1988 new_sp_m014 NA 458 | ## 10 Afghanistan AF AFG 1989 new_sp_m014 NA 459 | ## # ... with 405,430 more rows 460 | 461 | `ggplot2` 462 | --------- 463 | 464 | If you don't already know and love it, check out [one of](https://d-rug.github.io/blog/2012/ggplot-introduction) [our](https://d-rug.github.io/blog/2013/xtsmarkdown) [previous](https://d-rug.github.io/blog/2013/formatting-plots-for-pubs) [talks](https://d-rug.github.io/blog/2015/ggplot-tutorial-johnston) on ggplot or any of the excellent resources on the internet. 465 | 466 | Note that the pipe and consistent API make it easy to combine functions from different packages, and the whole thing is quite readable. 467 | 468 | ``` r 469 | who %>% 470 | select(-iso2, -iso3) %>% 471 | gather(group, cases, -country, -year) %>% 472 | count(country, year, wt = cases) %>% 473 | ggplot(aes(x = year, y = n, group = country)) + 474 | geom_line(size = .2) 475 | ``` 476 | 477 | ![](tidyverse_files/figure-markdown_github/dplyr-tidyr-ggplot-1.png) 478 | 479 | `readr` 480 | ------- 481 | 482 | For reading flat files. Faster than base with smarter defaults. 483 | 484 | ``` r 485 | bigdf = data_frame(int = 1:1e6, 486 | squares = int^2, 487 | letters = sample(letters, 1e6, replace = TRUE)) 488 | ``` 489 | 490 | ``` r 491 | system.time( 492 | write.csv(bigdf, "base-write.csv") 493 | ) 494 | ``` 495 | 496 | ## user system elapsed 497 | ## 2.579 0.084 2.704 498 | 499 | ``` r 500 | system.time( 501 | write_csv(bigdf, "readr-write.csv") 502 | ) 503 | ``` 504 | 505 | ## user system elapsed 506 | ## 0.819 0.069 0.940 507 | 508 | ``` r 509 | read.csv("base-write.csv", nrows = 3) 510 | ``` 511 | 512 | ## X int squares letters 513 | ## 1 1 1 1 h 514 | ## 2 2 2 4 d 515 | ## 3 3 3 9 m 516 | 517 | ``` r 518 | read_csv("readr-write.csv", n_max = 3) 519 | ``` 520 | 521 | ## Parsed with column specification: 522 | ## cols( 523 | ## int = col_integer(), 524 | ## squares = col_double(), 525 | ## letters = col_character() 526 | ## ) 527 | 528 | ## # A tibble: 3 × 3 529 | ## int squares letters 530 | ## 531 | ## 1 1 1 h 532 | ## 2 2 4 d 533 | ## 3 3 9 m 534 | 535 | `broom` 536 | ------- 537 | 538 | `broom` is a convenient little package to work with model results. Two functions I find useful are `tidy` to extract model results and `augment` to add residuals, predictions, etc. to a data.frame. 539 | 540 | ``` r 541 | d = data_frame(x = runif(20, 0, 10), 542 | y = 2 * x + rnorm(10)) 543 | qplot(x, y, data = d) 544 | ``` 545 | 546 | ![](tidyverse_files/figure-markdown_github/make%20model%20data-1.png) 547 | 548 | ### `tidy` 549 | 550 | ``` r 551 | library(broom) # Not attached with tidyverse 552 | model = lm(y ~ x, d) 553 | tidy(model) 554 | ``` 555 | 556 | ## term estimate std.error statistic p.value 557 | ## 1 (Intercept) -0.02212952 0.32815614 -0.06743595 9.469781e-01 558 | ## 2 x 2.04880361 0.06368259 32.17211622 2.328477e-17 559 | 560 | ### `augment` 561 | 562 | i.e. The function formerly known as `fortify`. 563 | 564 | ``` r 565 | aug = augment(model) 566 | aug 567 | ``` 568 | 569 | ## y x .fitted .se.fit .resid .hat 570 | ## 1 4.365310 2.2167792 4.519616 0.2190258 -0.15430550 0.08531087 571 | ## 2 10.183616 5.4172613 11.076775 0.1790893 -0.89315878 0.05703654 572 | ## 3 7.215538 3.2282729 6.591968 0.1843041 0.62357006 0.06040652 573 | ## 4 16.104626 7.9968916 16.361931 0.2823602 -0.25730454 0.14178182 574 | ## 5 15.469541 8.0496637 16.470051 0.2850711 -1.00050940 0.14451735 575 | ## 6 1.777862 0.5917886 1.190329 0.2963871 0.58753261 0.15621841 576 | ## 7 14.792232 7.4687935 15.279961 0.2560816 -0.48772985 0.11661935 577 | ## 8 7.947098 3.9047317 7.977899 0.1709766 -0.03080099 0.05198605 578 | ## 9 6.652823 3.3824637 6.907874 0.1804498 -0.25505137 0.05790640 579 | ## 10 16.466676 7.2607843 14.853792 0.2462225 1.61288435 0.10781254 580 | ## 11 9.147981 4.6081148 9.418993 0.1680642 -0.27101132 0.05023009 581 | ## 12 7.045629 3.8482676 7.862215 0.1717156 -0.81658622 0.05243644 582 | ## 13 15.521358 7.3811829 15.100465 0.2518912 0.42089305 0.11283397 583 | ## 14 2.047400 0.9682785 1.961683 0.2769494 0.08571719 0.13640005 584 | ## 15 1.924992 1.2773890 2.594990 0.2615541 -0.66999792 0.12165693 585 | ## 16 6.737163 3.0714391 6.270646 0.1886685 0.46651671 0.06330129 586 | ## 17 1.835538 0.9904465 2.007101 0.2758271 -0.17156311 0.13529686 587 | ## 18 3.904365 1.8833655 3.836516 0.2332531 0.06784899 0.09675390 588 | ## 19 16.870169 8.4911367 17.374542 0.3082513 -0.50437308 0.16897542 589 | ## 20 15.051012 6.5529524 13.403583 0.2154124 1.64742911 0.08251922 590 | ## .sigma .cooksd .std.resid 591 | ## 1 0.7706297 2.158758e-03 -0.21515503 592 | ## 2 0.7386729 4.549928e-02 -1.22655805 593 | ## 3 0.7556838 2.365691e-02 0.85787128 594 | ## 4 0.7686765 1.133193e-02 -0.37038676 595 | ## 5 0.7256519 1.757615e-01 -1.44252197 596 | ## 6 0.7558680 6.734724e-02 0.85295049 597 | ## 7 0.7612891 3.160947e-02 -0.69200983 598 | ## 8 0.7715844 4.879446e-05 -0.04218560 599 | ## 9 0.7689861 3.773788e-03 -0.35041888 600 | ## 10 0.6510658 3.132905e-01 2.27709965 601 | ## 11 0.7686693 3.636518e-03 -0.37083874 602 | ## 12 0.7443161 3.462616e-02 -1.11867707 603 | ## 13 0.7639734 2.258173e-02 0.59590381 604 | ## 14 0.7712982 1.194837e-03 0.12300378 605 | ## 15 0.7518898 6.294179e-02 -0.95334091 606 | ## 16 0.7627149 1.396146e-02 0.64279740 607 | ## 17 0.7703240 4.735711e-03 -0.24603520 608 | ## 18 0.7714283 4.854302e-04 0.09520224 609 | ## 19 0.7598647 5.534563e-02 -0.73782239 610 | ## 20 0.6491487 2.365693e-01 2.29358645 611 | 612 | ``` r 613 | ggplot(aug, aes(x = x)) + 614 | geom_point(aes(y = y, color = .resid)) + 615 | geom_line(aes(y = .fitted)) + 616 | viridis::scale_color_viridis() + 617 | theme(legend.position = c(0, 1), legend.justification = c(0, 1)) 618 | ``` 619 | 620 | ![](tidyverse_files/figure-markdown_github/plot%20resid-1.png) 621 | 622 | ``` r 623 | ggplot(aug, aes(.fitted, .resid, size = .cooksd)) + 624 | geom_point() 625 | ``` 626 | 627 | ![](tidyverse_files/figure-markdown_github/plot%20cooksd-1.png) 628 | 629 | `purrr` 630 | ------- 631 | 632 | `purrr` is kind of like `dplyr` for lists. It helps you repeatedly apply functions. Like the rest of the tidyverse, nothing you can't do in base R, but `purrr` makes the API consistent, encourages type specificity, and provides some nice shortcuts and speed ups. 633 | 634 | ``` r 635 | df = data_frame(fun = rep(c(lapply, map), 2), 636 | n = rep(c(1e5, 1e7), each = 2), 637 | comp_time = map2(fun, n, ~system.time(.x(1:.y, sqrt)))) 638 | df$comp_time 639 | ``` 640 | 641 | ## [[1]] 642 | ## user system elapsed 643 | ## 0.042 0.003 0.046 644 | ## 645 | ## [[2]] 646 | ## user system elapsed 647 | ## 0.055 0.001 0.059 648 | ## 649 | ## [[3]] 650 | ## user system elapsed 651 | ## 12.366 0.218 12.612 652 | ## 653 | ## [[4]] 654 | ## user system elapsed 655 | ## 8.572 0.117 8.742 656 | 657 | ### `map` 658 | 659 | Vanilla `map` is a slightly improved version of `lapply`. Do a function on each item in a list. 660 | 661 | ``` r 662 | map(1:4, log) 663 | ``` 664 | 665 | ## [[1]] 666 | ## [1] 0 667 | ## 668 | ## [[2]] 669 | ## [1] 0.6931472 670 | ## 671 | ## [[3]] 672 | ## [1] 1.098612 673 | ## 674 | ## [[4]] 675 | ## [1] 1.386294 676 | 677 | Can supply additional arguments as with `(x)apply` 678 | 679 | ``` r 680 | map(1:4, log, base = 2) 681 | ``` 682 | 683 | ## [[1]] 684 | ## [1] 0 685 | ## 686 | ## [[2]] 687 | ## [1] 1 688 | ## 689 | ## [[3]] 690 | ## [1] 1.584963 691 | ## 692 | ## [[4]] 693 | ## [1] 2 694 | 695 | Can compose anonymous functions like `(x)apply`, either the old way or with a new formula shorthand. 696 | 697 | ``` r 698 | map(1:4, ~ log(4, base = .x)) # == map(1:4, function(x) log(4, base = x)) 699 | ``` 700 | 701 | ## [[1]] 702 | ## [1] Inf 703 | ## 704 | ## [[2]] 705 | ## [1] 2 706 | ## 707 | ## [[3]] 708 | ## [1] 1.26186 709 | ## 710 | ## [[4]] 711 | ## [1] 1 712 | 713 | `map` always returns a list. `map_xxx` type-specifies the output type and simplifies the list to a vector. 714 | 715 | ``` r 716 | map_dbl(1:4, log, base = 2) 717 | ``` 718 | 719 | ## [1] 0.000000 1.000000 1.584963 2.000000 720 | 721 | And throws an error if any output isn't of the expected type (which is a good thing!). 722 | 723 | ``` r 724 | map_int(1:4, log, base = 2) 725 | ``` 726 | 727 | ## Error: Can't coerce element 1 from a double to a integer 728 | 729 | `map2` is like `mapply` -- apply a function over two lists in parallel. `map_n` generalizes to any number of lists. 730 | 731 | ``` r 732 | fwd = 1:10 733 | bck = 10:1 734 | map2_dbl(fwd, bck, `^`) 735 | ``` 736 | 737 | ## [1] 1 512 6561 16384 15625 7776 2401 512 81 10 738 | 739 | `map_if` tests each element on a function and if true applies the second function, if false returns the original element. 740 | 741 | ``` r 742 | data_frame(ints = 1:5, lets = letters[1:5], sqrts = ints^.5) %>% 743 | map_if(is.numeric, ~ .x^2) 744 | ``` 745 | 746 | ## $ints 747 | ## [1] 1 4 9 16 25 748 | ## 749 | ## $lets 750 | ## [1] "a" "b" "c" "d" "e" 751 | ## 752 | ## $sqrts 753 | ## [1] 1 2 3 4 5 754 | 755 | ### Putting `map` to work 756 | 757 | Split the movies data frame by mpaa rating, fit a linear model to each data frame, and organize the model results in a data frame. 758 | 759 | ``` r 760 | movies %>% 761 | filter(mpaa != "") %>% 762 | split(.$mpaa) %>% 763 | map(~ lm(rating ~ budget, data = .)) %>% 764 | map_df(tidy, .id = "mpaa-rating") %>% 765 | arrange(term) 766 | ``` 767 | 768 | ## mpaa-rating term estimate std.error statistic 769 | ## 1 NC-17 (Intercept) 6.505809e+00 3.124604e-01 20.8212250 770 | ## 2 PG (Intercept) 5.768036e+00 1.368008e-01 42.1637635 771 | ## 3 PG-13 (Intercept) 5.749256e+00 7.809921e-02 73.6147735 772 | ## 4 R (Intercept) 5.814014e+00 5.238965e-02 110.9764024 773 | ## 5 NC-17 budget -6.046148e-08 1.835425e-08 -3.2941404 774 | ## 6 PG budget 2.028426e-09 2.745805e-09 0.7387365 775 | ## 7 PG-13 budget 3.493136e-09 1.443405e-09 2.4200674 776 | ## 8 R budget 7.732453e-09 1.658841e-09 4.6613582 777 | ## p.value 778 | ## 1 4.732587e-06 779 | ## 2 1.856102e-104 780 | ## 3 8.291365e-280 781 | ## 4 0.000000e+00 782 | ## 5 2.161461e-02 783 | ## 6 4.608918e-01 784 | ## 7 1.585419e-02 785 | ## 8 3.540387e-06 786 | 787 | List-columns make it easier to organize complex datasets. Can `map` over list-columns right in `data_frame`/`tibble` creation. And if you later want to calculate something else, everything is nicely organized in the data frame. 788 | 789 | ``` r 790 | d = 791 | data_frame( 792 | dist = c("normal", "poisson", "chi-square"), 793 | funs = list(rnorm, rpois, rchisq), 794 | samples = map(funs, ~.(100, 5)), 795 | mean = map_dbl(samples, mean), 796 | var = map_dbl(samples, var) 797 | ) 798 | d$median = map_dbl(d$samples, median) 799 | d 800 | ``` 801 | 802 | ## # A tibble: 3 × 6 803 | ## dist funs samples mean var median 804 | ## 805 | ## 1 normal 4.897684 0.9952718 4.910766 806 | ## 2 poisson 4.990000 4.1716162 5.000000 807 | ## 3 chi-square 5.466018 12.3804613 4.923235 808 | 809 | Let's see if we can really make this purrr... Fit a linear model of diamond price by every combination of two predictors in the dataset and see which two predict best. 810 | 811 | ``` r 812 | train = sample(nrow(diamonds), floor(nrow(diamonds) * .67)) 813 | setdiff(names(diamonds), "price") %>% 814 | combn(2, paste, collapse = " + ") %>% 815 | structure(., names = .) %>% 816 | map(~ formula(paste("price ~ ", .x))) %>% 817 | map(lm, data = diamonds[train, ]) %>% 818 | map_df(augment, newdata = diamonds[-train, ], .id = "predictors") %>% 819 | group_by(predictors) %>% 820 | summarize(rmse = sqrt(mean((price - .fitted)^2))) %>% 821 | arrange(rmse) 822 | ``` 823 | 824 | ## # A tibble: 36 × 2 825 | ## predictors rmse 826 | ## 827 | ## 1 carat + clarity 1296.010 828 | ## 2 carat + color 1474.577 829 | ## 3 carat + cut 1518.669 830 | ## 4 carat + x 1530.131 831 | ## 5 carat + y 1545.970 832 | ## 6 carat + depth 1546.579 833 | ## 7 carat + table 1549.821 834 | ## 8 carat + z 1557.959 835 | ## 9 clarity + x 1672.964 836 | ## 10 clarity + y 1689.942 837 | ## # ... with 26 more rows 838 | 839 | ### Type-stability 840 | 841 | We have seen that we can use map\_lgl to ensure we get a logical vector, map\_chr to ensure we get a character vector back, etc. Type stability is like a little built-in unit test. You make sure you're getting what you think you are, even in the middle of a pipeline or function. Here are two more type-stable function implemented in `purrr`. 842 | 843 | #### `flatten` 844 | 845 | Like `unlist` but can specify output type, and never recurses. 846 | 847 | ``` r 848 | map(-1:3, ~.x ^ seq(-.5, .5, .5)) %>% 849 | flatten_dbl() 850 | ``` 851 | 852 | ## [1] NaN 1.0000000 NaN Inf 1.0000000 0.0000000 1.0000000 853 | ## [8] 1.0000000 1.0000000 0.7071068 1.0000000 1.4142136 0.5773503 1.0000000 854 | ## [15] 1.7320508 855 | 856 | #### `safely` 857 | 858 | ``` r 859 | junk = list(letters, 1:20, median) 860 | map(junk, ~ log(.x)) 861 | ``` 862 | 863 | ## Error in log(.x): non-numeric argument to mathematical function 864 | 865 | - `safely` "catches" errors and always "succeeds". 866 | - `try` does the same, but either returns the value or a try-error object. 867 | - `safely` is type-stable. It always returns a length-two list with one object NULL. 868 | 869 | ``` r 870 | safe = map(junk, ~ safely(log)(.x)) # Note the different syntax from try(log(.x)). `safely(log)` creates a new function. 871 | safe 872 | ``` 873 | 874 | ## [[1]] 875 | ## [[1]]$result 876 | ## NULL 877 | ## 878 | ## [[1]]$error 879 | ## 880 | ## 881 | ## 882 | ## [[2]] 883 | ## [[2]]$result 884 | ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 885 | ## [8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573 886 | ## [15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323 887 | ## 888 | ## [[2]]$error 889 | ## NULL 890 | ## 891 | ## 892 | ## [[3]] 893 | ## [[3]]$result 894 | ## NULL 895 | ## 896 | ## [[3]]$error 897 | ## 898 | 899 | #### `transpose` a list! 900 | 901 | Now we could conveniently move on where the function succeeded, particularly using `map_if`. To get that logical vector for the `map_if` test, we can use the `transpose` function, which inverts a list. 902 | 903 | ``` r 904 | transpose(safe) 905 | ``` 906 | 907 | ## $result 908 | ## $result[[1]] 909 | ## NULL 910 | ## 911 | ## $result[[2]] 912 | ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 913 | ## [8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573 914 | ## [15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323 915 | ## 916 | ## $result[[3]] 917 | ## NULL 918 | ## 919 | ## 920 | ## $error 921 | ## $error[[1]] 922 | ## 923 | ## 924 | ## $error[[2]] 925 | ## NULL 926 | ## 927 | ## $error[[3]] 928 | ## 929 | 930 | ``` r 931 | map_if(transpose(safe)$result, ~!is.null(.x), median) 932 | ``` 933 | 934 | ## [[1]] 935 | ## NULL 936 | ## 937 | ## [[2]] 938 | ## [1] 2.35024 939 | ## 940 | ## [[3]] 941 | ## NULL 942 | 943 | `stringr` 944 | --------- 945 | 946 | All your string manipulation and regex functions with a consistent API. 947 | 948 | ``` r 949 | library(stringr) # not attached with tidyverse 950 | fishes <- c("one fish", "two fish", "red fish", "blue fish") 951 | str_detect(fishes, "two") 952 | ``` 953 | 954 | ## [1] FALSE TRUE FALSE FALSE 955 | 956 | ``` r 957 | str_replace_all(fishes, "fish", "banana") 958 | ``` 959 | 960 | ## [1] "one banana" "two banana" "red banana" "blue banana" 961 | 962 | ``` r 963 | str_extract(fishes, "[a-z]\\s") 964 | ``` 965 | 966 | ## [1] "e " "o " "d " "e " 967 | 968 | Let's put that string manipulation engine to work. Remember the annoying column names in the WHO data? They look like this new\_sp\_m014, new\_sp\_m1524, new\_sp\_m2534, where "new" or "new\_" doesn't mean anything, the following 2-3 letters indicate the test used, the following letter indicates the gender, and the final 2-4 numbers indicates the age-class. A string-handling challenge if ever there was one. Let's separate it out and plot the cases by year, gender, age-class, and test-method. 969 | 970 | ``` r 971 | who %>% 972 | select(-iso2, -iso3) %>% 973 | gather(group, cases, -country, -year) %>% 974 | mutate(group = str_replace(group, "new_*", ""), 975 | method = str_extract(group, "[a-z]+"), 976 | gender = str_sub(str_extract(group, "_[a-z]"), 2, 2), 977 | age = str_extract(group, "[0-9]+"), 978 | age = ifelse(str_length(age) > 2, 979 | str_c(str_sub(age, 1, -3), str_sub(age, -2, -1), sep = "-"), 980 | str_c(age, "+"))) %>% 981 | group_by(year, gender, age, method) %>% 982 | summarize(total_cases = sum(cases, na.rm = TRUE)) %>% 983 | ggplot(aes(x = year, y = total_cases, linetype = gender)) + 984 | geom_line() + 985 | facet_grid(method ~ age, 986 | labeller = labeller(.rows = label_both, .cols = label_both)) + 987 | scale_y_log10() + 988 | theme_light() + 989 | theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) 990 | ``` 991 | 992 | ![](tidyverse_files/figure-markdown_github/unnamed-chunk-6-1.png) 993 | 994 | Post-talk debugging improvisation 995 | --------------------------------- 996 | 997 | ``` r 998 | pipe_stopifnot = function(df, test){ 999 | stopifnot(test) 1000 | return(df) 1001 | } 1002 | ``` 1003 | 1004 | ``` r 1005 | print_and_go = function(df, what_to_print) { 1006 | cat(what_to_print) 1007 | return(df) 1008 | } 1009 | ``` 1010 | -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/dplyr-tidyr-ggplot-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/dplyr-tidyr-ggplot-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/make model data-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/make model data-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/plot cooksd-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/plot cooksd-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/plot resid-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/plot resid-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/unnamed-chunk-11-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-11-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/unnamed-chunk-15-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-15-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/unnamed-chunk-26-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-26-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /tidyverse_files/figure-markdown_github/unnamed-chunk-8-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-8-1.png --------------------------------------------------------------------------------