├── .gitignore
├── README.md
├── tidyverse.Rmd
├── tidyverse.md
├── tidyverse.nb.html
└── tidyverse_files
    └── figure-markdown_github
        ├── dplyr-tidyr-ggplot-1.png
        ├── make model data-1.png
        ├── plot cooksd-1.png
        ├── plot resid-1.png
        ├── unnamed-chunk-11-1.png
        ├── unnamed-chunk-14-1.png
        ├── unnamed-chunk-15-1.png
        ├── unnamed-chunk-26-1.png
        ├── unnamed-chunk-4-1.png
        ├── unnamed-chunk-6-1.png
        └── unnamed-chunk-8-1.png


/.gitignore:
--------------------------------------------------------------------------------
1 | *.csv
2 | tidyverse_cache/
3 | .DS_Store


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | `tidyverse` talk for the Davis R-Users' Group, October 13, 2016.
2 | 
3 | You can watch the talk here: https://www.youtube.com/watch?v=_rPhSAVhs1A


--------------------------------------------------------------------------------
/tidyverse.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "tidyverse"
  3 | author: "Michael Levy, Prepared for the Davis R-Users' Group"
  4 | date: "October 13, 2016"
  5 | output:
  6 |   github_document: default
  7 |   html_notebook: default
  8 | ---
  9 | 
 10 | ```{r setup, include = FALSE}
 11 | knitr::opts_chunk$set(cache = TRUE, error = TRUE, fig.width = 4, fig.asp = 1)
 12 | ```
 13 | 
 14 | 
 15 | ## What is the tidyverse?
 16 | 
 17 | ~~Hadleyverse~~
 18 | 
 19 | The tidyverse is a suite of R tools that follow a tidy philosophy:
 20 | 
 21 | ### Tidy data
 22 | 
 23 | Put data in data frames  
 24 | 
 25 | - Each type of observation gets a data frame
 26 | - Each variable gets a column
 27 | - Each observation gets a row
 28 | 
 29 | ### Tidy APIs
 30 | 
 31 | Functions should be consistent and easily (human) readable
 32 | 
 33 | - Take one step at a time
 34 | - Connect simple steps with the pipe
 35 | - Referential transparency
 36 | 
 37 | 
 38 | ### Okay but really, what is it? 
 39 | 
 40 | Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R.
 41 | 
 42 | - Core packages: ggplot2, dplyr, tidyr, readr, purrr, tibble
 43 | - Specialized data manipulation: hms, stringr, lubridate, forcats
 44 | - Data import: DBI, haven, httr, jsonlite, readxl, rvest, xml2
 45 | - Modeling: modelr, broom
 46 | 
 47 | `install.packages(tidyverse)` installs all of the above packages.
 48 | 
 49 | `library(tidyverse)` attaches only the core packages.
 50 | 
 51 | 
 52 | ## Why tidyverse?
 53 | 
 54 | - Consistency  
 55 |     - e.g. All `stringr` functions take string first  
 56 |     - e.g. Many functions take data.frame first -> piping
 57 |         - Faster to write
 58 |         - Easier to read
 59 |     - Tidy data: Imposes good practices
 60 |     - Type specificity
 61 | - You probably use some of it already. Synergize.
 62 | - Implements simple solutions to common problems (e.g. `purrr::transpose`)
 63 | - Smarter defaults 
 64 |     - e.g. `utils::write.csv(row.names = FALSE)` = `readr::write_csv()` 
 65 | - Runs fast (thanks to `Rcpp`)
 66 | - Interfaces well with other tools (e.g. Spark with `dplyr` via `sparklyr`)
 67 | 
 68 | ## `tibble`
 69 | 
 70 | > A modern reimagining of data frames.
 71 | 
 72 | ```{r Attach core packages}
 73 | library(tidyverse)
 74 | ```
 75 | 
 76 | ```{r class tbl}
 77 | tdf = tibble(x = 1:1e4, y = rnorm(1e4))  # == data_frame(x = 1:1e4, y = rnorm(1e4))
 78 | class(tdf)
 79 | ```
 80 | 
 81 | 
 82 | Tibbles print politely. 
 83 | 
 84 | ```{r print tbl}
 85 | tdf
 86 | ```
 87 | 
 88 | 
 89 | - Can customize print methods with `print(tdf, n = rows, width = cols)`
 90 | 
 91 | - Set default with `options(tibble.print_max = rows, tibble.width = cols)`
 92 | 
 93 | Tibbles have some convenient and consistent defaults that are different from base R data.frames.
 94 | 
 95 | #### strings as factors
 96 | 
 97 | ```{r strings as factors}
 98 | dfs = list(
 99 |   df = data.frame(abc = letters[1:3], xyz = letters[24:26]),
100 |   tbl = data_frame(abc = letters[1:3], xyz = letters[24:26])
101 | )
102 | sapply(dfs, function(d) class(d$abc))
103 | ```
104 | 
105 | 
106 | #### partial matching of names
107 | 
108 | ```{r partial matching}
109 | sapply(dfs, function(d) d$a)
110 | ```
111 | 
112 | #### type consistency
113 | 
114 | ```{r single bracket excision}
115 | sapply(dfs, function(d) class(d[, "abc"]))
116 | ```
117 | 
118 | Note that tidyverse import functions (e.g. `readr::read_csv`) default to tibbles and that *this can break existing code*.
119 | 
120 | #### List-columns!
121 | 
122 | ```{r list columns}
123 | tibble(ints = 1:5,
124 |        powers = lapply(1:5, function(x) x^(1:x)))
125 | ```
126 | 
127 | 
128 | ## The pipe `%>%`
129 | 
130 | Sends the output of the LHS function to the first argument of the RHS function.
131 | 
132 | ```{r pipe}
133 | sum(1:8) %>%
134 |   sqrt()
135 | ```
136 | 
137 | 
138 | ## `dplyr`
139 | 
140 | Common data(frame) manipulation tasks. 
141 | 
142 | Four core "verbs": filter, select, arrange, group_by + summarize, plus many more convenience functions. 
143 | 
144 | 
145 | ```{r load movies}
146 | library(ggplot2movies)
147 | str(movies)
148 | ```
149 | 
150 | ```{r filter}
151 | filter(movies, length > 360)
152 | ```
153 | 
154 | ```{r select}
155 | filter(movies, length > 360) %>%
156 |   select(title, rating, votes)
157 | ```
158 | 
159 | ```{r arrange}
160 | filter(movies, Animation == 1, votes > 1000) %>%
161 |   select(title, rating) %>%
162 |   arrange(desc(rating))
163 | ```
164 | 
165 | `summarize` makes `aggregate` and `tapply` functionality easier, and the output is always a data frame.
166 | 
167 | ```{r summarize}
168 | filter(movies, mpaa != "") %>%
169 |   group_by(year, mpaa) %>%
170 |   summarize(avg_budget = mean(budget, na.rm = TRUE),
171 |             avg_rating = mean(rating, na.rm = TRUE)) %>%
172 |   arrange(desc(year), mpaa)
173 | ```
174 | 
175 | 
176 | `count` for frequency tables. Note the consistent API and easy readability vs. `table`.
177 | 
178 | ```{r count}
179 | filter(movies, mpaa != "") %>%
180 |   count(year, mpaa, Animation, sort = TRUE)
181 | ```
182 | 
183 | 
184 | ```{r table}
185 | basetab = with(movies[movies$mpaa != "", ], table(year, mpaa, Animation))
186 | basetab[1:5, , ]
187 | ```
188 | 
189 | 
190 | ### joins
191 | 
192 | `dplyr` also does multi-table joins and can connect to various types of databases.
193 | 
194 | ```{r full join}
195 | t1 = data_frame(alpha = letters[1:6], num = 1:6)
196 | t2 = data_frame(alpha = letters[4:10], num = 4:10)
197 | full_join(t1, t2, by = "alpha", suffix = c("_t1", "_t2"))
198 | ```
199 | 
200 | 
201 | Super-secret pro-tip: You can `group_by` %>% `mutate` to accomplish a summarize + join
202 | 
203 | ```{r group mutate}
204 | data_frame(group = sample(letters[1:3], 10, replace = TRUE),
205 |            value = rnorm(10)) %>%
206 |   group_by(group) %>%
207 |   mutate(group_average = mean(value))
208 | ```
209 | 
210 | 
211 | 
212 | 
213 | ## `tidyr`
214 | 
215 | Latest generation of `reshape`. `gather` to make wide table long, `spread` to make long tables wide.
216 | 
217 | ```{r who}
218 | who  # Tuberculosis data from the WHO
219 | ```
220 | 
221 | ```{r gather}
222 | who %>%
223 |   gather(group, cases, -country, -iso2, -iso3, -year)
224 | ```
225 | 
226 | 
227 | ## `ggplot2`
228 | 
229 | If you don't already know and love it, check out [one of](https://d-rug.github.io/blog/2012/ggplot-introduction) [our](https://d-rug.github.io/blog/2013/xtsmarkdown) [previous](https://d-rug.github.io/blog/2013/formatting-plots-for-pubs) [talks](https://d-rug.github.io/blog/2015/ggplot-tutorial-johnston) on ggplot or any of the excellent resources on the internet. 
230 | 
231 | Note that the pipe and consistent API make it easy to combine functions from different packages, and the whole thing is quite readable.
232 | 
233 | ```{r dplyr-tidyr-ggplot}
234 | who %>%
235 |   select(-iso2, -iso3) %>%
236 |   gather(group, cases, -country, -year) %>%
237 |   count(country, year, wt = cases) %>%
238 |   ggplot(aes(x = year, y = n, group = country)) +
239 |   geom_line(size = .2) 
240 | ```
241 | 
242 | 
243 | ## `readr`
244 | 
245 | For reading flat files. Faster than base with smarter defaults.
246 | 
247 | ```{r make big df}
248 | bigdf = data_frame(int = 1:1e6, 
249 |                    squares = int^2, 
250 |                    letters = sample(letters, 1e6, replace = TRUE))
251 | ```
252 | 
253 | ```{r base write}
254 | system.time(
255 |   write.csv(bigdf, "base-write.csv")
256 | )
257 | ```
258 | 
259 | ```{r readr write}
260 | system.time(
261 |   write_csv(bigdf, "readr-write.csv")
262 | )
263 | ```
264 | 
265 | ```{r base read}
266 | read.csv("base-write.csv", nrows = 3)
267 | ```
268 | 
269 | ```{r readr read}
270 | read_csv("readr-write.csv", n_max = 3)
271 | ```
272 | 
273 | ## `broom` 
274 | 
275 | `broom` is a convenient little package to work with model results. Two functions I find useful are `tidy` to extract model results and `augment` to add residuals, predictions, etc. to a data.frame.
276 | 
277 | ```{r make model data}
278 | d = data_frame(x = runif(20, 0, 10), 
279 |                y = 2 * x + rnorm(10))
280 | qplot(x, y, data = d)
281 | ```
282 | 
283 | ### `tidy`
284 | 
285 | ```{r tidy}
286 | library(broom)  # Not attached with tidyverse
287 | model = lm(y ~ x, d)
288 | tidy(model)
289 | ```
290 | 
291 | ### `augment`
292 | 
293 | i.e. The function formerly known as `fortify`.
294 | 
295 | ```{r augment}
296 | aug = augment(model)
297 | aug
298 | ```
299 | 
300 | ```{r plot resid}
301 | ggplot(aug, aes(x = x)) +
302 |   geom_point(aes(y = y, color = .resid)) + 
303 |   geom_line(aes(y = .fitted)) +
304 |   viridis::scale_color_viridis() +
305 |   theme(legend.position = c(0, 1), legend.justification = c(0, 1))
306 | ```
307 | 
308 | ```{r plot cooksd}
309 | ggplot(aug, aes(.fitted, .resid, size = .cooksd)) + 
310 |   geom_point()
311 | ```
312 | 
313 | 
314 | 
315 | ## `purrr`
316 | 
317 | `purrr` is kind of like `dplyr` for lists. It helps you repeatedly apply functions. Like the rest of the tidyverse, nothing you can't do in base R, but `purrr` makes the API consistent, encourages type specificity, and provides some nice shortcuts and speed ups.
318 | 
319 | ```{r intro and speedtest}
320 | df = data_frame(fun = rep(c(lapply, map), 2),
321 |                 n = rep(c(1e5, 1e7), each = 2),
322 |                 comp_time = map2(fun, n, ~system.time(.x(1:.y, sqrt))))
323 | df$comp_time
324 | ```
325 | 
326 | 
327 | ### `map`
328 | 
329 | Vanilla `map` is a slightly improved version of `lapply`. Do a function on each item in a list.
330 | 
331 | ```{r map}
332 | map(1:4, log)
333 | ```
334 | 
335 | Can supply additional arguments as with `(x)apply`
336 | 
337 | ```{r map arg}
338 | map(1:4, log, base = 2)
339 | ```
340 | 
341 | Can compose anonymous functions like `(x)apply`, either the old way or with a new formula shorthand. 
342 | 
343 | ```{r map formula}
344 | map(1:4, ~ log(4, base = .x))  # == map(1:4, function(x) log(4, base = x))
345 | ```
346 | 
347 | `map` always returns a list. `map_xxx` type-specifies the output type and simplifies the list to a vector.
348 | 
349 | ```{r map_type}
350 | map_dbl(1:4, log, base = 2)
351 | ```
352 | 
353 | And throws an error if any output isn't of the expected type (which is a good thing!).
354 | 
355 | ```{r map_type error}
356 | map_int(1:4, log, base = 2)
357 | ```
358 | 
359 | 
360 | `map2` is like `mapply` -- apply a function over two lists in parallel. `map_n` generalizes to any number of lists.
361 | 
362 | ```{r map2}
363 | fwd = 1:10
364 | bck = 10:1
365 | map2_dbl(fwd, bck, `^`)
366 | ```
367 | 
368 | `map_if` tests each element on a function and if true applies the second function, if false returns the original element.
369 | 
370 | ```{r map_if}
371 | data_frame(ints = 1:5, lets = letters[1:5], sqrts = ints^.5) %>%
372 |   map_if(is.numeric, ~ .x^2) 
373 | ```
374 | 
375 | ### Putting `map` to work
376 | 
377 | Split the movies data frame by mpaa rating, fit a linear model to each data frame, and organize the model results in a data frame.
378 | 
379 | ```{r movies split models}
380 | movies %>% 
381 |   filter(mpaa != "") %>%
382 |   split(.$mpaa) %>%
383 |   map(~ lm(rating ~ budget, data = .)) %>%
384 |   map_df(tidy, .id = "mpaa-rating") %>%
385 |   arrange(term)
386 | ```
387 | 
388 | List-columns make it easier to organize complex datasets. Can `map` over list-columns right in `data_frame`/`tibble` creation. And if you later want to calculate something else, everything is nicely organized in the data frame.
389 | 
390 | ```{r list columns + map}
391 | d = 
392 |   data_frame(
393 |     dist = c("normal", "poisson", "chi-square"),
394 |     funs = list(rnorm, rpois, rchisq),
395 |     samples = map(funs, ~.(100, 5)),
396 |     mean = map_dbl(samples, mean),
397 |     var = map_dbl(samples, var)
398 |   )
399 | d$median = map_dbl(d$samples, median)
400 | d
401 | ```
402 | 
403 | Let's see if we can really make this purrr... Fit a linear model of diamond price by every combination of two predictors in the dataset and see which two predict best.
404 | 
405 | ```{r diamonds predictors}
406 | train = sample(nrow(diamonds), floor(nrow(diamonds) * .67))
407 | setdiff(names(diamonds), "price") %>%
408 |   combn(2, paste, collapse = " + ") %>%
409 |   structure(., names = .) %>%
410 |   map(~ formula(paste("price ~ ", .x))) %>%
411 |   map(lm, data = diamonds[train, ]) %>%
412 |   map_df(augment, newdata = diamonds[-train, ], .id = "predictors") %>%
413 |   group_by(predictors) %>%
414 |   summarize(rmse = sqrt(mean((price - .fitted)^2))) %>%
415 |   arrange(rmse)
416 | ```
417 | 
418 | 
419 | ### Type-stability
420 | 
421 | We have seen that we can use map_lgl to ensure we get a logical vector, map_chr to ensure we get a character vector back, etc. Type stability is like a little built-in unit test. You make sure you're getting what you think you are, even in the middle of a pipeline or function. Here are two more type-stable function implemented in `purrr`.
422 | 
423 | #### `flatten`
424 | 
425 | Like `unlist` but can specify output type, and never recurses.
426 | 
427 | ```{r flatten}
428 | map(-1:3, ~.x ^ seq(-.5, .5, .5)) %>%
429 |   flatten_dbl()
430 | ```
431 | 
432 | #### `safely`
433 | 
434 | ```{r error}
435 | junk = list(letters, 1:20, median)
436 | map(junk, ~ log(.x))
437 | ```
438 | 
439 | - `safely` "catches" errors and always "succeeds". 
440 | - `try` does the same, but either returns the value or a try-error object.
441 | - `safely` is type-stable. It always returns a length-two list with one object NULL.
442 | 
443 | ```{r safely}
444 | safe = map(junk, ~ safely(log)(.x))  # Note the different syntax from try(log(.x)). `safely(log)` creates a new function.
445 | safe
446 | ```
447 | 
448 | #### `transpose` a list!
449 | 
450 | Now we could conveniently move on where the function succeeded, particularly using `map_if`. To get that logical vector for the `map_if` test, we can use the `transpose` function, which inverts a list.
451 | 
452 | ```{r}
453 | transpose(safe)
454 | ```
455 | 
456 | ```{r}
457 | map_if(transpose(safe)$result, ~!is.null(.x), median)
458 | ```
459 | 
460 | ## `stringr`
461 | 
462 | All your string manipulation and regex functions with a consistent API. 
463 | 
464 | ```{r}
465 | library(stringr)  # not attached with tidyverse
466 | fishes <- c("one fish", "two fish", "red fish", "blue fish")
467 | str_detect(fishes, "two")
468 | ```
469 | 
470 | ```{r}
471 | str_replace_all(fishes, "fish", "banana")
472 | ```
473 | 
474 | ```{r}
475 | str_extract(fishes, "[a-z]\\s")
476 | ```
477 | 
478 | Let's put that string manipulation engine to work. Remember the annoying column names in the WHO data? They look like this `r stringr::str_c(colnames(tidyr::who)[5:7], collapse = ", ")`, where "new" or "new_" doesn't mean anything, the following 2-3 letters indicate the test used, the following letter indicates the gender, and the final 2-4 numbers indicates the age-class. A string-handling challenge if ever there was one. Let's separate it out and plot the cases by year, gender, age-class, and test-method.
479 | 
480 | ```{r, fig.width = 8, fig.asp = .6}
481 | who %>%
482 |   select(-iso2, -iso3) %>%
483 |   gather(group, cases, -country, -year) %>%
484 |   mutate(group = str_replace(group, "new_*", ""),
485 |          method = str_extract(group, "[a-z]+"),
486 |          gender = str_sub(str_extract(group, "_[a-z]"), 2, 2),
487 |          age = str_extract(group, "[0-9]+"),
488 |          age = ifelse(str_length(age) > 2,
489 |                       str_c(str_sub(age, 1, -3), str_sub(age, -2, -1), sep = "-"),
490 |                       str_c(age, "+"))) %>%
491 |   group_by(year, gender, age, method) %>%
492 |   summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
493 |   ggplot(aes(x = year, y = total_cases, linetype = gender)) +
494 |   geom_line() +
495 |   facet_grid(method ~ age,
496 |              labeller = labeller(.rows = label_both, .cols = label_both)) +
497 |   scale_y_log10() +
498 |   theme_light() +
499 |   theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
500 | ```
501 |          
502 | 
503 | ## Post-talk debugging improvisation
504 | 
505 | ```{r}
506 | pipe_stopifnot = function(df, test){
507 |   stopifnot(test)
508 |   return(df)
509 | }
510 | ```
511 | 
512 | ```{r}
513 | print_and_go = function(df, what_to_print) {
514 |   cat(what_to_print)
515 |   return(df)
516 | }
517 | ```
518 | 
519 | 


--------------------------------------------------------------------------------
/tidyverse.md:
--------------------------------------------------------------------------------
   1 | tidyverse
   2 | ================
   3 | Michael Levy, Prepared for the Davis R-Users' Group
   4 | October 13, 2016
   5 | 
   6 | What is the tidyverse?
   7 | ----------------------
   8 | 
   9 | ~~Hadleyverse~~
  10 | 
  11 | The tidyverse is a suite of R tools that follow a tidy philosophy:
  12 | 
  13 | ### Tidy data
  14 | 
  15 | Put data in data frames
  16 | 
  17 | -   Each type of observation gets a data frame
  18 | -   Each variable gets a column
  19 | -   Each observation gets a row
  20 | 
  21 | ### Tidy APIs
  22 | 
  23 | Functions should be consistent and easily (human) readable
  24 | 
  25 | -   Take one step at a time
  26 | -   Connect simple steps with the pipe
  27 | -   Referential transparency
  28 | 
  29 | ### Okay but really, what is it?
  30 | 
  31 | Suite of ~20 packages that provide consistent, user-friendly, smart-default tools to do most of what most people do in R.
  32 | 
  33 | -   Core packages: ggplot2, dplyr, tidyr, readr, purrr, tibble
  34 | -   Specialized data manipulation: hms, stringr, lubridate, forcats
  35 | -   Data import: DBI, haven, httr, jsonlite, readxl, rvest, xml2
  36 | -   Modeling: modelr, broom
  37 | 
  38 | `install.packages(tidyverse)` installs all of the above packages.
  39 | 
  40 | `library(tidyverse)` attaches only the core packages.
  41 | 
  42 | Why tidyverse?
  43 | --------------
  44 | 
  45 | -   Consistency
  46 |     -   e.g. All `stringr` functions take string first
  47 |     -   e.g. Many functions take data.frame first -&gt; piping
  48 |         -   Faster to write
  49 |         -   Easier to read
  50 |     -   Tidy data: Imposes good practices
  51 |     -   Type specificity
  52 | -   You probably use some of it already. Synergize.
  53 | -   Implements simple solutions to common problems (e.g. `purrr::transpose`)
  54 | -   Smarter defaults
  55 |     -   e.g. `utils::write.csv(row.names = FALSE)` = `readr::write_csv()`
  56 | -   Runs fast (thanks to `Rcpp`)
  57 | -   Interfaces well with other tools (e.g. Spark with `dplyr` via `sparklyr`)
  58 | 
  59 | `tibble`
  60 | --------
  61 | 
  62 | > A modern reimagining of data frames.
  63 | 
  64 | ``` r
  65 | library(tidyverse)
  66 | ```
  67 | 
  68 | ``` r
  69 | tdf = tibble(x = 1:1e4, y = rnorm(1e4))  # == data_frame(x = 1:1e4, y = rnorm(1e4))
  70 | class(tdf)
  71 | ```
  72 | 
  73 |     ## [1] "tbl_df"     "tbl"        "data.frame"
  74 | 
  75 | Tibbles print politely.
  76 | 
  77 | ``` r
  78 | tdf
  79 | ```
  80 | 
  81 |     ## # A tibble: 10,000 × 2
  82 |     ##        x          y
  83 |     ##    <int>      <dbl>
  84 |     ## 1      1  1.7307583
  85 |     ## 2      2  1.4246209
  86 |     ## 3      3  0.2762850
  87 |     ## 4      4  1.9267297
  88 |     ## 5      5  1.8189041
  89 |     ## 6      6  1.1574624
  90 |     ## 7      7  0.1248573
  91 |     ## 8      8 -0.1066158
  92 |     ## 9      9 -0.7412011
  93 |     ## 10    10 -0.9383221
  94 |     ## # ... with 9,990 more rows
  95 | 
  96 | -   Can customize print methods with `print(tdf, n = rows, width = cols)`
  97 | 
  98 | -   Set default with `options(tibble.print_max = rows, tibble.width = cols)`
  99 | 
 100 | Tibbles have some convenient and consistent defaults that are different from base R data.frames.
 101 | 
 102 | #### strings as factors
 103 | 
 104 | ``` r
 105 | dfs = list(
 106 |   df = data.frame(abc = letters[1:3], xyz = letters[24:26]),
 107 |   tbl = data_frame(abc = letters[1:3], xyz = letters[24:26])
 108 | )
 109 | sapply(dfs, function(d) class(d$abc))
 110 | ```
 111 | 
 112 |     ##          df         tbl 
 113 |     ##    "factor" "character"
 114 | 
 115 | #### partial matching of names
 116 | 
 117 | ``` r
 118 | sapply(dfs, function(d) d$a)
 119 | ```
 120 | 
 121 |     ## Warning: Unknown column 'a'
 122 | 
 123 |     ## $df
 124 |     ## [1] a b c
 125 |     ## Levels: a b c
 126 |     ## 
 127 |     ## $tbl
 128 |     ## NULL
 129 | 
 130 | #### type consistency
 131 | 
 132 | ``` r
 133 | sapply(dfs, function(d) class(d[, "abc"]))
 134 | ```
 135 | 
 136 |     ## $df
 137 |     ## [1] "factor"
 138 |     ## 
 139 |     ## $tbl
 140 |     ## [1] "tbl_df"     "tbl"        "data.frame"
 141 | 
 142 | Note that tidyverse import functions (e.g. `readr::read_csv`) default to tibbles and that *this can break existing code*.
 143 | 
 144 | #### List-columns!
 145 | 
 146 | ``` r
 147 | tibble(ints = 1:5,
 148 |        powers = lapply(1:5, function(x) x^(1:x)))
 149 | ```
 150 | 
 151 |     ## # A tibble: 5 × 2
 152 |     ##    ints    powers
 153 |     ##   <int>    <list>
 154 |     ## 1     1 <dbl [1]>
 155 |     ## 2     2 <dbl [2]>
 156 |     ## 3     3 <dbl [3]>
 157 |     ## 4     4 <dbl [4]>
 158 |     ## 5     5 <dbl [5]>
 159 | 
 160 | The pipe `%>%`
 161 | --------------
 162 | 
 163 | Sends the output of the LHS function to the first argument of the RHS function.
 164 | 
 165 | ``` r
 166 | sum(1:8) %>%
 167 |   sqrt()
 168 | ```
 169 | 
 170 |     ## [1] 6
 171 | 
 172 | `dplyr`
 173 | -------
 174 | 
 175 | Common data(frame) manipulation tasks.
 176 | 
 177 | Four core "verbs": filter, select, arrange, group\_by + summarize, plus many more convenience functions.
 178 | 
 179 | ``` r
 180 | library(ggplot2movies)
 181 | str(movies)
 182 | ```
 183 | 
 184 |     ## Classes 'tbl_df', 'tbl' and 'data.frame':    58788 obs. of  24 variables:
 185 |     ##  $ title      : chr  "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
 186 |     ##  $ year       : int  1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
 187 |     ##  $ length     : int  121 71 7 70 71 91 93 25 97 61 ...
 188 |     ##  $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
 189 |     ##  $ rating     : num  6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
 190 |     ##  $ votes      : int  348 20 5 6 17 45 200 24 18 51 ...
 191 |     ##  $ r1         : num  4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
 192 |     ##  $ r2         : num  4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
 193 |     ##  $ r3         : num  4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
 194 |     ##  $ r4         : num  4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
 195 |     ##  $ r5         : num  14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
 196 |     ##  $ r6         : num  24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
 197 |     ##  $ r7         : num  24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
 198 |     ##  $ r8         : num  14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
 199 |     ##  $ r9         : num  4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
 200 |     ##  $ r10        : num  4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
 201 |     ##  $ mpaa       : chr  "" "" "" "" ...
 202 |     ##  $ Action     : int  0 0 0 0 0 0 1 0 0 0 ...
 203 |     ##  $ Animation  : int  0 0 1 0 0 0 0 0 0 0 ...
 204 |     ##  $ Comedy     : int  1 1 0 1 0 0 0 0 0 0 ...
 205 |     ##  $ Drama      : int  1 0 0 0 0 1 1 0 1 0 ...
 206 |     ##  $ Documentary: int  0 0 0 0 0 0 0 1 0 0 ...
 207 |     ##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
 208 |     ##  $ Short      : int  0 0 1 0 0 0 0 1 0 0 ...
 209 | 
 210 | ``` r
 211 | filter(movies, length > 360)
 212 | ```
 213 | 
 214 |     ## # A tibble: 21 × 24
 215 |     ##                                               title  year length  budget
 216 |     ##                                               <chr> <int>  <int>   <int>
 217 |     ## 1                         Commune (Paris, 1871), La  2000    555      NA
 218 |     ## 2                            Cure for Insomnia, The  1987   5220      NA
 219 |     ## 3             Ebolusyon ng isang pamilyang pilipino  2004    647      NA
 220 |     ## 4                                            Empire  1964    485      NA
 221 |     ## 5                                Farmer's Wife, The  1998    390      NA
 222 |     ## 6                                     Foolish Wives  1922    384 1100000
 223 |     ## 7                                        Four Stars  1967   1100      NA
 224 |     ## 8                 Hitler - ein Film aus Deutschland  1978    407      NA
 225 |     ## 9                               Imitation of Christ  1967    480      NA
 226 |     ## 10 Longest Most Meaningless Movie in the World, The  1970   2880      NA
 227 |     ## # ... with 11 more rows, and 20 more variables: rating <dbl>, votes <int>,
 228 |     ## #   r1 <dbl>, r2 <dbl>, r3 <dbl>, r4 <dbl>, r5 <dbl>, r6 <dbl>, r7 <dbl>,
 229 |     ## #   r8 <dbl>, r9 <dbl>, r10 <dbl>, mpaa <chr>, Action <int>,
 230 |     ## #   Animation <int>, Comedy <int>, Drama <int>, Documentary <int>,
 231 |     ## #   Romance <int>, Short <int>
 232 | 
 233 | ``` r
 234 | filter(movies, length > 360) %>%
 235 |   select(title, rating, votes)
 236 | ```
 237 | 
 238 |     ## # A tibble: 21 × 3
 239 |     ##                                               title rating votes
 240 |     ##                                               <chr>  <dbl> <int>
 241 |     ## 1                         Commune (Paris, 1871), La    7.8    33
 242 |     ## 2                            Cure for Insomnia, The    3.8    59
 243 |     ## 3             Ebolusyon ng isang pamilyang pilipino    8.4     5
 244 |     ## 4                                            Empire    5.5    46
 245 |     ## 5                                Farmer's Wife, The    8.5    52
 246 |     ## 6                                     Foolish Wives    7.6   191
 247 |     ## 7                                        Four Stars    3.0    12
 248 |     ## 8                 Hitler - ein Film aus Deutschland    9.0    70
 249 |     ## 9                               Imitation of Christ    4.4     5
 250 |     ## 10 Longest Most Meaningless Movie in the World, The    6.4    15
 251 |     ## # ... with 11 more rows
 252 | 
 253 | ``` r
 254 | filter(movies, Animation == 1, votes > 1000) %>%
 255 |   select(title, rating) %>%
 256 |   arrange(desc(rating))
 257 | ```
 258 | 
 259 |     ## # A tibble: 135 × 2
 260 |     ##                                   title rating
 261 |     ##                                   <chr>  <dbl>
 262 |     ## 1         Sen to Chihiro no kamikakushi    8.6
 263 |     ## 2                            Duck Amuck    8.4
 264 |     ## 3  Wallace & Gromit: The Wrong Trousers    8.4
 265 |     ## 4                          Finding Nemo    8.3
 266 |     ## 5                        Hotaru no haka    8.3
 267 |     ## 6                      Incredibles, The    8.3
 268 |     ## 7                         Mononoke-hime    8.3
 269 |     ## 8                    What's Opera, Doc?    8.3
 270 |     ## 9                               Vincent    8.2
 271 |     ## 10      Wallace & Gromit: A Close Shave    8.2
 272 |     ## # ... with 125 more rows
 273 | 
 274 | `summarize` makes `aggregate` and `tapply` functionality easier, and the output is always a data frame.
 275 | 
 276 | ``` r
 277 | filter(movies, mpaa != "") %>%
 278 |   group_by(year, mpaa) %>%
 279 |   summarize(avg_budget = mean(budget, na.rm = TRUE),
 280 |             avg_rating = mean(rating, na.rm = TRUE)) %>%
 281 |   arrange(desc(year), mpaa)
 282 | ```
 283 | 
 284 |     ## Source: local data frame [128 x 4]
 285 |     ## Groups: year [54]
 286 |     ## 
 287 |     ##     year  mpaa avg_budget avg_rating
 288 |     ##    <int> <chr>      <dbl>      <dbl>
 289 |     ## 1   2005 NC-17        NaN   6.700000
 290 |     ## 2   2005    PG   45857143   5.733333
 291 |     ## 3   2005 PG-13   42269333   5.326087
 292 |     ## 4   2005     R   24305882   4.595833
 293 |     ## 5   2004    PG   45126852   5.847619
 294 |     ## 6   2004 PG-13   46288254   6.080180
 295 |     ## 7   2004     R   19548519   5.848469
 296 |     ## 8   2003    PG   37057692   5.897674
 297 |     ## 9   2003 PG-13   46269491   5.949038
 298 |     ## 10  2003     R   21915505   5.702273
 299 |     ## # ... with 118 more rows
 300 | 
 301 | `count` for frequency tables. Note the consistent API and easy readability vs. `table`.
 302 | 
 303 | ``` r
 304 | filter(movies, mpaa != "") %>%
 305 |   count(year, mpaa, Animation, sort = TRUE)
 306 | ```
 307 | 
 308 |     ## Source: local data frame [156 x 4]
 309 |     ## Groups: year, mpaa [128]
 310 |     ## 
 311 |     ##     year  mpaa Animation     n
 312 |     ##    <int> <chr>     <int> <int>
 313 |     ## 1   1999     R         0   366
 314 |     ## 2   2001     R         0   355
 315 |     ## 3   2002     R         0   343
 316 |     ## 4   2000     R         0   341
 317 |     ## 5   1998     R         0   335
 318 |     ## 6   1997     R         0   325
 319 |     ## 7   1996     R         0   310
 320 |     ## 8   1995     R         0   293
 321 |     ## 9   2003     R         0   264
 322 |     ## 10  2004     R         0   196
 323 |     ## # ... with 146 more rows
 324 | 
 325 | ``` r
 326 | basetab = with(movies[movies$mpaa != "", ], table(year, mpaa, Animation))
 327 | basetab[1:5, , ]
 328 | ```
 329 | 
 330 |     ## , , Animation = 0
 331 |     ## 
 332 |     ##       mpaa
 333 |     ## year   NC-17 PG PG-13 R
 334 |     ##   1934     0  1     0 0
 335 |     ##   1938     0  1     0 0
 336 |     ##   1945     0  0     1 0
 337 |     ##   1946     0  1     0 0
 338 |     ##   1951     0  2     0 0
 339 |     ## 
 340 |     ## , , Animation = 1
 341 |     ## 
 342 |     ##       mpaa
 343 |     ## year   NC-17 PG PG-13 R
 344 |     ##   1934     0  0     0 0
 345 |     ##   1938     0  0     0 0
 346 |     ##   1945     0  0     0 0
 347 |     ##   1946     0  0     0 0
 348 |     ##   1951     0  0     0 0
 349 | 
 350 | ### joins
 351 | 
 352 | `dplyr` also does multi-table joins and can connect to various types of databases.
 353 | 
 354 | ``` r
 355 | t1 = data_frame(alpha = letters[1:6], num = 1:6)
 356 | t2 = data_frame(alpha = letters[4:10], num = 4:10)
 357 | full_join(t1, t2, by = "alpha", suffix = c("_t1", "_t2"))
 358 | ```
 359 | 
 360 |     ## # A tibble: 10 × 3
 361 |     ##    alpha num_t1 num_t2
 362 |     ##    <chr>  <int>  <int>
 363 |     ## 1      a      1     NA
 364 |     ## 2      b      2     NA
 365 |     ## 3      c      3     NA
 366 |     ## 4      d      4      4
 367 |     ## 5      e      5      5
 368 |     ## 6      f      6      6
 369 |     ## 7      g     NA      7
 370 |     ## 8      h     NA      8
 371 |     ## 9      i     NA      9
 372 |     ## 10     j     NA     10
 373 | 
 374 | Super-secret pro-tip: You can `group_by` %&gt;% `mutate` to accomplish a summarize + join
 375 | 
 376 | ``` r
 377 | data_frame(group = sample(letters[1:3], 10, replace = TRUE),
 378 |            value = rnorm(10)) %>%
 379 |   group_by(group) %>%
 380 |   mutate(group_average = mean(value))
 381 | ```
 382 | 
 383 |     ## Source: local data frame [10 x 3]
 384 |     ## Groups: group [3]
 385 |     ## 
 386 |     ##    group       value group_average
 387 |     ##    <chr>       <dbl>         <dbl>
 388 |     ## 1      b -0.07744649   -0.61076927
 389 |     ## 2      c  0.11825771    0.08865786
 390 |     ## 3      c  0.58866540    0.08865786
 391 |     ## 4      c  0.27584554    0.08865786
 392 |     ## 5      b -0.80187845   -0.61076927
 393 |     ## 6      c -0.33398635    0.08865786
 394 |     ## 7      c -0.20549302    0.08865786
 395 |     ## 8      b -0.95298286   -0.61076927
 396 |     ## 9      a -0.48253785   -0.68985472
 397 |     ## 10     a -0.89717159   -0.68985472
 398 | 
 399 | `tidyr`
 400 | -------
 401 | 
 402 | Latest generation of `reshape`. `gather` to make wide table long, `spread` to make long tables wide.
 403 | 
 404 | ``` r
 405 | who  # Tuberculosis data from the WHO
 406 | ```
 407 | 
 408 |     ## # A tibble: 7,240 × 60
 409 |     ##        country  iso2  iso3  year new_sp_m014 new_sp_m1524 new_sp_m2534
 410 |     ##          <chr> <chr> <chr> <int>       <int>        <int>        <int>
 411 |     ## 1  Afghanistan    AF   AFG  1980          NA           NA           NA
 412 |     ## 2  Afghanistan    AF   AFG  1981          NA           NA           NA
 413 |     ## 3  Afghanistan    AF   AFG  1982          NA           NA           NA
 414 |     ## 4  Afghanistan    AF   AFG  1983          NA           NA           NA
 415 |     ## 5  Afghanistan    AF   AFG  1984          NA           NA           NA
 416 |     ## 6  Afghanistan    AF   AFG  1985          NA           NA           NA
 417 |     ## 7  Afghanistan    AF   AFG  1986          NA           NA           NA
 418 |     ## 8  Afghanistan    AF   AFG  1987          NA           NA           NA
 419 |     ## 9  Afghanistan    AF   AFG  1988          NA           NA           NA
 420 |     ## 10 Afghanistan    AF   AFG  1989          NA           NA           NA
 421 |     ## # ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
 422 |     ## #   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
 423 |     ## #   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
 424 |     ## #   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
 425 |     ## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
 426 |     ## #   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
 427 |     ## #   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
 428 |     ## #   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
 429 |     ## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
 430 |     ## #   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
 431 |     ## #   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
 432 |     ## #   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
 433 |     ## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
 434 |     ## #   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
 435 |     ## #   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
 436 |     ## #   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
 437 |     ## #   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
 438 |     ## #   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
 439 |     ## #   newrel_f65 <int>
 440 | 
 441 | ``` r
 442 | who %>%
 443 |   gather(group, cases, -country, -iso2, -iso3, -year)
 444 | ```
 445 | 
 446 |     ## # A tibble: 405,440 × 6
 447 |     ##        country  iso2  iso3  year       group cases
 448 |     ##          <chr> <chr> <chr> <int>       <chr> <int>
 449 |     ## 1  Afghanistan    AF   AFG  1980 new_sp_m014    NA
 450 |     ## 2  Afghanistan    AF   AFG  1981 new_sp_m014    NA
 451 |     ## 3  Afghanistan    AF   AFG  1982 new_sp_m014    NA
 452 |     ## 4  Afghanistan    AF   AFG  1983 new_sp_m014    NA
 453 |     ## 5  Afghanistan    AF   AFG  1984 new_sp_m014    NA
 454 |     ## 6  Afghanistan    AF   AFG  1985 new_sp_m014    NA
 455 |     ## 7  Afghanistan    AF   AFG  1986 new_sp_m014    NA
 456 |     ## 8  Afghanistan    AF   AFG  1987 new_sp_m014    NA
 457 |     ## 9  Afghanistan    AF   AFG  1988 new_sp_m014    NA
 458 |     ## 10 Afghanistan    AF   AFG  1989 new_sp_m014    NA
 459 |     ## # ... with 405,430 more rows
 460 | 
 461 | `ggplot2`
 462 | ---------
 463 | 
 464 | If you don't already know and love it, check out [one of](https://d-rug.github.io/blog/2012/ggplot-introduction) [our](https://d-rug.github.io/blog/2013/xtsmarkdown) [previous](https://d-rug.github.io/blog/2013/formatting-plots-for-pubs) [talks](https://d-rug.github.io/blog/2015/ggplot-tutorial-johnston) on ggplot or any of the excellent resources on the internet.
 465 | 
 466 | Note that the pipe and consistent API make it easy to combine functions from different packages, and the whole thing is quite readable.
 467 | 
 468 | ``` r
 469 | who %>%
 470 |   select(-iso2, -iso3) %>%
 471 |   gather(group, cases, -country, -year) %>%
 472 |   count(country, year, wt = cases) %>%
 473 |   ggplot(aes(x = year, y = n, group = country)) +
 474 |   geom_line(size = .2) 
 475 | ```
 476 | 
 477 | ![](tidyverse_files/figure-markdown_github/dplyr-tidyr-ggplot-1.png)
 478 | 
 479 | `readr`
 480 | -------
 481 | 
 482 | For reading flat files. Faster than base with smarter defaults.
 483 | 
 484 | ``` r
 485 | bigdf = data_frame(int = 1:1e6, 
 486 |                    squares = int^2, 
 487 |                    letters = sample(letters, 1e6, replace = TRUE))
 488 | ```
 489 | 
 490 | ``` r
 491 | system.time(
 492 |   write.csv(bigdf, "base-write.csv")
 493 | )
 494 | ```
 495 | 
 496 |     ##    user  system elapsed 
 497 |     ##   2.579   0.084   2.704
 498 | 
 499 | ``` r
 500 | system.time(
 501 |   write_csv(bigdf, "readr-write.csv")
 502 | )
 503 | ```
 504 | 
 505 |     ##    user  system elapsed 
 506 |     ##   0.819   0.069   0.940
 507 | 
 508 | ``` r
 509 | read.csv("base-write.csv", nrows = 3)
 510 | ```
 511 | 
 512 |     ##   X int squares letters
 513 |     ## 1 1   1       1       h
 514 |     ## 2 2   2       4       d
 515 |     ## 3 3   3       9       m
 516 | 
 517 | ``` r
 518 | read_csv("readr-write.csv", n_max = 3)
 519 | ```
 520 | 
 521 |     ## Parsed with column specification:
 522 |     ## cols(
 523 |     ##   int = col_integer(),
 524 |     ##   squares = col_double(),
 525 |     ##   letters = col_character()
 526 |     ## )
 527 | 
 528 |     ## # A tibble: 3 × 3
 529 |     ##     int squares letters
 530 |     ##   <int>   <dbl>   <chr>
 531 |     ## 1     1       1       h
 532 |     ## 2     2       4       d
 533 |     ## 3     3       9       m
 534 | 
 535 | `broom`
 536 | -------
 537 | 
 538 | `broom` is a convenient little package to work with model results. Two functions I find useful are `tidy` to extract model results and `augment` to add residuals, predictions, etc. to a data.frame.
 539 | 
 540 | ``` r
 541 | d = data_frame(x = runif(20, 0, 10), 
 542 |                y = 2 * x + rnorm(10))
 543 | qplot(x, y, data = d)
 544 | ```
 545 | 
 546 | ![](tidyverse_files/figure-markdown_github/make%20model%20data-1.png)
 547 | 
 548 | ### `tidy`
 549 | 
 550 | ``` r
 551 | library(broom)  # Not attached with tidyverse
 552 | model = lm(y ~ x, d)
 553 | tidy(model)
 554 | ```
 555 | 
 556 |     ##          term    estimate  std.error   statistic      p.value
 557 |     ## 1 (Intercept) -0.02212952 0.32815614 -0.06743595 9.469781e-01
 558 |     ## 2           x  2.04880361 0.06368259 32.17211622 2.328477e-17
 559 | 
 560 | ### `augment`
 561 | 
 562 | i.e. The function formerly known as `fortify`.
 563 | 
 564 | ``` r
 565 | aug = augment(model)
 566 | aug
 567 | ```
 568 | 
 569 |     ##            y         x   .fitted   .se.fit      .resid       .hat
 570 |     ## 1   4.365310 2.2167792  4.519616 0.2190258 -0.15430550 0.08531087
 571 |     ## 2  10.183616 5.4172613 11.076775 0.1790893 -0.89315878 0.05703654
 572 |     ## 3   7.215538 3.2282729  6.591968 0.1843041  0.62357006 0.06040652
 573 |     ## 4  16.104626 7.9968916 16.361931 0.2823602 -0.25730454 0.14178182
 574 |     ## 5  15.469541 8.0496637 16.470051 0.2850711 -1.00050940 0.14451735
 575 |     ## 6   1.777862 0.5917886  1.190329 0.2963871  0.58753261 0.15621841
 576 |     ## 7  14.792232 7.4687935 15.279961 0.2560816 -0.48772985 0.11661935
 577 |     ## 8   7.947098 3.9047317  7.977899 0.1709766 -0.03080099 0.05198605
 578 |     ## 9   6.652823 3.3824637  6.907874 0.1804498 -0.25505137 0.05790640
 579 |     ## 10 16.466676 7.2607843 14.853792 0.2462225  1.61288435 0.10781254
 580 |     ## 11  9.147981 4.6081148  9.418993 0.1680642 -0.27101132 0.05023009
 581 |     ## 12  7.045629 3.8482676  7.862215 0.1717156 -0.81658622 0.05243644
 582 |     ## 13 15.521358 7.3811829 15.100465 0.2518912  0.42089305 0.11283397
 583 |     ## 14  2.047400 0.9682785  1.961683 0.2769494  0.08571719 0.13640005
 584 |     ## 15  1.924992 1.2773890  2.594990 0.2615541 -0.66999792 0.12165693
 585 |     ## 16  6.737163 3.0714391  6.270646 0.1886685  0.46651671 0.06330129
 586 |     ## 17  1.835538 0.9904465  2.007101 0.2758271 -0.17156311 0.13529686
 587 |     ## 18  3.904365 1.8833655  3.836516 0.2332531  0.06784899 0.09675390
 588 |     ## 19 16.870169 8.4911367 17.374542 0.3082513 -0.50437308 0.16897542
 589 |     ## 20 15.051012 6.5529524 13.403583 0.2154124  1.64742911 0.08251922
 590 |     ##       .sigma      .cooksd  .std.resid
 591 |     ## 1  0.7706297 2.158758e-03 -0.21515503
 592 |     ## 2  0.7386729 4.549928e-02 -1.22655805
 593 |     ## 3  0.7556838 2.365691e-02  0.85787128
 594 |     ## 4  0.7686765 1.133193e-02 -0.37038676
 595 |     ## 5  0.7256519 1.757615e-01 -1.44252197
 596 |     ## 6  0.7558680 6.734724e-02  0.85295049
 597 |     ## 7  0.7612891 3.160947e-02 -0.69200983
 598 |     ## 8  0.7715844 4.879446e-05 -0.04218560
 599 |     ## 9  0.7689861 3.773788e-03 -0.35041888
 600 |     ## 10 0.6510658 3.132905e-01  2.27709965
 601 |     ## 11 0.7686693 3.636518e-03 -0.37083874
 602 |     ## 12 0.7443161 3.462616e-02 -1.11867707
 603 |     ## 13 0.7639734 2.258173e-02  0.59590381
 604 |     ## 14 0.7712982 1.194837e-03  0.12300378
 605 |     ## 15 0.7518898 6.294179e-02 -0.95334091
 606 |     ## 16 0.7627149 1.396146e-02  0.64279740
 607 |     ## 17 0.7703240 4.735711e-03 -0.24603520
 608 |     ## 18 0.7714283 4.854302e-04  0.09520224
 609 |     ## 19 0.7598647 5.534563e-02 -0.73782239
 610 |     ## 20 0.6491487 2.365693e-01  2.29358645
 611 | 
 612 | ``` r
 613 | ggplot(aug, aes(x = x)) +
 614 |   geom_point(aes(y = y, color = .resid)) + 
 615 |   geom_line(aes(y = .fitted)) +
 616 |   viridis::scale_color_viridis() +
 617 |   theme(legend.position = c(0, 1), legend.justification = c(0, 1))
 618 | ```
 619 | 
 620 | ![](tidyverse_files/figure-markdown_github/plot%20resid-1.png)
 621 | 
 622 | ``` r
 623 | ggplot(aug, aes(.fitted, .resid, size = .cooksd)) + 
 624 |   geom_point()
 625 | ```
 626 | 
 627 | ![](tidyverse_files/figure-markdown_github/plot%20cooksd-1.png)
 628 | 
 629 | `purrr`
 630 | -------
 631 | 
 632 | `purrr` is kind of like `dplyr` for lists. It helps you repeatedly apply functions. Like the rest of the tidyverse, nothing you can't do in base R, but `purrr` makes the API consistent, encourages type specificity, and provides some nice shortcuts and speed ups.
 633 | 
 634 | ``` r
 635 | df = data_frame(fun = rep(c(lapply, map), 2),
 636 |                 n = rep(c(1e5, 1e7), each = 2),
 637 |                 comp_time = map2(fun, n, ~system.time(.x(1:.y, sqrt))))
 638 | df$comp_time
 639 | ```
 640 | 
 641 |     ## [[1]]
 642 |     ##    user  system elapsed 
 643 |     ##   0.042   0.003   0.046 
 644 |     ## 
 645 |     ## [[2]]
 646 |     ##    user  system elapsed 
 647 |     ##   0.055   0.001   0.059 
 648 |     ## 
 649 |     ## [[3]]
 650 |     ##    user  system elapsed 
 651 |     ##  12.366   0.218  12.612 
 652 |     ## 
 653 |     ## [[4]]
 654 |     ##    user  system elapsed 
 655 |     ##   8.572   0.117   8.742
 656 | 
 657 | ### `map`
 658 | 
 659 | Vanilla `map` is a slightly improved version of `lapply`. Do a function on each item in a list.
 660 | 
 661 | ``` r
 662 | map(1:4, log)
 663 | ```
 664 | 
 665 |     ## [[1]]
 666 |     ## [1] 0
 667 |     ## 
 668 |     ## [[2]]
 669 |     ## [1] 0.6931472
 670 |     ## 
 671 |     ## [[3]]
 672 |     ## [1] 1.098612
 673 |     ## 
 674 |     ## [[4]]
 675 |     ## [1] 1.386294
 676 | 
 677 | Can supply additional arguments as with `(x)apply`
 678 | 
 679 | ``` r
 680 | map(1:4, log, base = 2)
 681 | ```
 682 | 
 683 |     ## [[1]]
 684 |     ## [1] 0
 685 |     ## 
 686 |     ## [[2]]
 687 |     ## [1] 1
 688 |     ## 
 689 |     ## [[3]]
 690 |     ## [1] 1.584963
 691 |     ## 
 692 |     ## [[4]]
 693 |     ## [1] 2
 694 | 
 695 | Can compose anonymous functions like `(x)apply`, either the old way or with a new formula shorthand.
 696 | 
 697 | ``` r
 698 | map(1:4, ~ log(4, base = .x))  # == map(1:4, function(x) log(4, base = x))
 699 | ```
 700 | 
 701 |     ## [[1]]
 702 |     ## [1] Inf
 703 |     ## 
 704 |     ## [[2]]
 705 |     ## [1] 2
 706 |     ## 
 707 |     ## [[3]]
 708 |     ## [1] 1.26186
 709 |     ## 
 710 |     ## [[4]]
 711 |     ## [1] 1
 712 | 
 713 | `map` always returns a list. `map_xxx` type-specifies the output type and simplifies the list to a vector.
 714 | 
 715 | ``` r
 716 | map_dbl(1:4, log, base = 2)
 717 | ```
 718 | 
 719 |     ## [1] 0.000000 1.000000 1.584963 2.000000
 720 | 
 721 | And throws an error if any output isn't of the expected type (which is a good thing!).
 722 | 
 723 | ``` r
 724 | map_int(1:4, log, base = 2)
 725 | ```
 726 | 
 727 |     ## Error: Can't coerce element 1 from a double to a integer
 728 | 
 729 | `map2` is like `mapply` -- apply a function over two lists in parallel. `map_n` generalizes to any number of lists.
 730 | 
 731 | ``` r
 732 | fwd = 1:10
 733 | bck = 10:1
 734 | map2_dbl(fwd, bck, `^`)
 735 | ```
 736 | 
 737 |     ##  [1]     1   512  6561 16384 15625  7776  2401   512    81    10
 738 | 
 739 | `map_if` tests each element on a function and if true applies the second function, if false returns the original element.
 740 | 
 741 | ``` r
 742 | data_frame(ints = 1:5, lets = letters[1:5], sqrts = ints^.5) %>%
 743 |   map_if(is.numeric, ~ .x^2) 
 744 | ```
 745 | 
 746 |     ## $ints
 747 |     ## [1]  1  4  9 16 25
 748 |     ## 
 749 |     ## $lets
 750 |     ## [1] "a" "b" "c" "d" "e"
 751 |     ## 
 752 |     ## $sqrts
 753 |     ## [1] 1 2 3 4 5
 754 | 
 755 | ### Putting `map` to work
 756 | 
 757 | Split the movies data frame by mpaa rating, fit a linear model to each data frame, and organize the model results in a data frame.
 758 | 
 759 | ``` r
 760 | movies %>% 
 761 |   filter(mpaa != "") %>%
 762 |   split(.$mpaa) %>%
 763 |   map(~ lm(rating ~ budget, data = .)) %>%
 764 |   map_df(tidy, .id = "mpaa-rating") %>%
 765 |   arrange(term)
 766 | ```
 767 | 
 768 |     ##   mpaa-rating        term      estimate    std.error   statistic
 769 |     ## 1       NC-17 (Intercept)  6.505809e+00 3.124604e-01  20.8212250
 770 |     ## 2          PG (Intercept)  5.768036e+00 1.368008e-01  42.1637635
 771 |     ## 3       PG-13 (Intercept)  5.749256e+00 7.809921e-02  73.6147735
 772 |     ## 4           R (Intercept)  5.814014e+00 5.238965e-02 110.9764024
 773 |     ## 5       NC-17      budget -6.046148e-08 1.835425e-08  -3.2941404
 774 |     ## 6          PG      budget  2.028426e-09 2.745805e-09   0.7387365
 775 |     ## 7       PG-13      budget  3.493136e-09 1.443405e-09   2.4200674
 776 |     ## 8           R      budget  7.732453e-09 1.658841e-09   4.6613582
 777 |     ##         p.value
 778 |     ## 1  4.732587e-06
 779 |     ## 2 1.856102e-104
 780 |     ## 3 8.291365e-280
 781 |     ## 4  0.000000e+00
 782 |     ## 5  2.161461e-02
 783 |     ## 6  4.608918e-01
 784 |     ## 7  1.585419e-02
 785 |     ## 8  3.540387e-06
 786 | 
 787 | List-columns make it easier to organize complex datasets. Can `map` over list-columns right in `data_frame`/`tibble` creation. And if you later want to calculate something else, everything is nicely organized in the data frame.
 788 | 
 789 | ``` r
 790 | d = 
 791 |   data_frame(
 792 |     dist = c("normal", "poisson", "chi-square"),
 793 |     funs = list(rnorm, rpois, rchisq),
 794 |     samples = map(funs, ~.(100, 5)),
 795 |     mean = map_dbl(samples, mean),
 796 |     var = map_dbl(samples, var)
 797 |   )
 798 | d$median = map_dbl(d$samples, median)
 799 | d
 800 | ```
 801 | 
 802 |     ## # A tibble: 3 × 6
 803 |     ##         dist   funs     samples     mean        var   median
 804 |     ##        <chr> <list>      <list>    <dbl>      <dbl>    <dbl>
 805 |     ## 1     normal  <fun> <dbl [100]> 4.897684  0.9952718 4.910766
 806 |     ## 2    poisson  <fun> <int [100]> 4.990000  4.1716162 5.000000
 807 |     ## 3 chi-square  <fun> <dbl [100]> 5.466018 12.3804613 4.923235
 808 | 
 809 | Let's see if we can really make this purrr... Fit a linear model of diamond price by every combination of two predictors in the dataset and see which two predict best.
 810 | 
 811 | ``` r
 812 | train = sample(nrow(diamonds), floor(nrow(diamonds) * .67))
 813 | setdiff(names(diamonds), "price") %>%
 814 |   combn(2, paste, collapse = " + ") %>%
 815 |   structure(., names = .) %>%
 816 |   map(~ formula(paste("price ~ ", .x))) %>%
 817 |   map(lm, data = diamonds[train, ]) %>%
 818 |   map_df(augment, newdata = diamonds[-train, ], .id = "predictors") %>%
 819 |   group_by(predictors) %>%
 820 |   summarize(rmse = sqrt(mean((price - .fitted)^2))) %>%
 821 |   arrange(rmse)
 822 | ```
 823 | 
 824 |     ## # A tibble: 36 × 2
 825 |     ##         predictors     rmse
 826 |     ##              <chr>    <dbl>
 827 |     ## 1  carat + clarity 1296.010
 828 |     ## 2    carat + color 1474.577
 829 |     ## 3      carat + cut 1518.669
 830 |     ## 4        carat + x 1530.131
 831 |     ## 5        carat + y 1545.970
 832 |     ## 6    carat + depth 1546.579
 833 |     ## 7    carat + table 1549.821
 834 |     ## 8        carat + z 1557.959
 835 |     ## 9      clarity + x 1672.964
 836 |     ## 10     clarity + y 1689.942
 837 |     ## # ... with 26 more rows
 838 | 
 839 | ### Type-stability
 840 | 
 841 | We have seen that we can use map\_lgl to ensure we get a logical vector, map\_chr to ensure we get a character vector back, etc. Type stability is like a little built-in unit test. You make sure you're getting what you think you are, even in the middle of a pipeline or function. Here are two more type-stable function implemented in `purrr`.
 842 | 
 843 | #### `flatten`
 844 | 
 845 | Like `unlist` but can specify output type, and never recurses.
 846 | 
 847 | ``` r
 848 | map(-1:3, ~.x ^ seq(-.5, .5, .5)) %>%
 849 |   flatten_dbl()
 850 | ```
 851 | 
 852 |     ##  [1]       NaN 1.0000000       NaN       Inf 1.0000000 0.0000000 1.0000000
 853 |     ##  [8] 1.0000000 1.0000000 0.7071068 1.0000000 1.4142136 0.5773503 1.0000000
 854 |     ## [15] 1.7320508
 855 | 
 856 | #### `safely`
 857 | 
 858 | ``` r
 859 | junk = list(letters, 1:20, median)
 860 | map(junk, ~ log(.x))
 861 | ```
 862 | 
 863 |     ## Error in log(.x): non-numeric argument to mathematical function
 864 | 
 865 | -   `safely` "catches" errors and always "succeeds".
 866 | -   `try` does the same, but either returns the value or a try-error object.
 867 | -   `safely` is type-stable. It always returns a length-two list with one object NULL.
 868 | 
 869 | ``` r
 870 | safe = map(junk, ~ safely(log)(.x))  # Note the different syntax from try(log(.x)). `safely(log)` creates a new function.
 871 | safe
 872 | ```
 873 | 
 874 |     ## [[1]]
 875 |     ## [[1]]$result
 876 |     ## NULL
 877 |     ## 
 878 |     ## [[1]]$error
 879 |     ## <simpleError in .f(...): non-numeric argument to mathematical function>
 880 |     ## 
 881 |     ## 
 882 |     ## [[2]]
 883 |     ## [[2]]$result
 884 |     ##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 885 |     ##  [8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573
 886 |     ## [15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323
 887 |     ## 
 888 |     ## [[2]]$error
 889 |     ## NULL
 890 |     ## 
 891 |     ## 
 892 |     ## [[3]]
 893 |     ## [[3]]$result
 894 |     ## NULL
 895 |     ## 
 896 |     ## [[3]]$error
 897 |     ## <simpleError in .f(...): non-numeric argument to mathematical function>
 898 | 
 899 | #### `transpose` a list!
 900 | 
 901 | Now we could conveniently move on where the function succeeded, particularly using `map_if`. To get that logical vector for the `map_if` test, we can use the `transpose` function, which inverts a list.
 902 | 
 903 | ``` r
 904 | transpose(safe)
 905 | ```
 906 | 
 907 |     ## $result
 908 |     ## $result[[1]]
 909 |     ## NULL
 910 |     ## 
 911 |     ## $result[[2]]
 912 |     ##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 913 |     ##  [8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573
 914 |     ## [15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323
 915 |     ## 
 916 |     ## $result[[3]]
 917 |     ## NULL
 918 |     ## 
 919 |     ## 
 920 |     ## $error
 921 |     ## $error[[1]]
 922 |     ## <simpleError in .f(...): non-numeric argument to mathematical function>
 923 |     ## 
 924 |     ## $error[[2]]
 925 |     ## NULL
 926 |     ## 
 927 |     ## $error[[3]]
 928 |     ## <simpleError in .f(...): non-numeric argument to mathematical function>
 929 | 
 930 | ``` r
 931 | map_if(transpose(safe)$result, ~!is.null(.x), median)
 932 | ```
 933 | 
 934 |     ## [[1]]
 935 |     ## NULL
 936 |     ## 
 937 |     ## [[2]]
 938 |     ## [1] 2.35024
 939 |     ## 
 940 |     ## [[3]]
 941 |     ## NULL
 942 | 
 943 | `stringr`
 944 | ---------
 945 | 
 946 | All your string manipulation and regex functions with a consistent API.
 947 | 
 948 | ``` r
 949 | library(stringr)  # not attached with tidyverse
 950 | fishes <- c("one fish", "two fish", "red fish", "blue fish")
 951 | str_detect(fishes, "two")
 952 | ```
 953 | 
 954 |     ## [1] FALSE  TRUE FALSE FALSE
 955 | 
 956 | ``` r
 957 | str_replace_all(fishes, "fish", "banana")
 958 | ```
 959 | 
 960 |     ## [1] "one banana"  "two banana"  "red banana"  "blue banana"
 961 | 
 962 | ``` r
 963 | str_extract(fishes, "[a-z]\\s")
 964 | ```
 965 | 
 966 |     ## [1] "e " "o " "d " "e "
 967 | 
 968 | Let's put that string manipulation engine to work. Remember the annoying column names in the WHO data? They look like this new\_sp\_m014, new\_sp\_m1524, new\_sp\_m2534, where "new" or "new\_" doesn't mean anything, the following 2-3 letters indicate the test used, the following letter indicates the gender, and the final 2-4 numbers indicates the age-class. A string-handling challenge if ever there was one. Let's separate it out and plot the cases by year, gender, age-class, and test-method.
 969 | 
 970 | ``` r
 971 | who %>%
 972 |   select(-iso2, -iso3) %>%
 973 |   gather(group, cases, -country, -year) %>%
 974 |   mutate(group = str_replace(group, "new_*", ""),
 975 |          method = str_extract(group, "[a-z]+"),
 976 |          gender = str_sub(str_extract(group, "_[a-z]"), 2, 2),
 977 |          age = str_extract(group, "[0-9]+"),
 978 |          age = ifelse(str_length(age) > 2,
 979 |                       str_c(str_sub(age, 1, -3), str_sub(age, -2, -1), sep = "-"),
 980 |                       str_c(age, "+"))) %>%
 981 |   group_by(year, gender, age, method) %>%
 982 |   summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
 983 |   ggplot(aes(x = year, y = total_cases, linetype = gender)) +
 984 |   geom_line() +
 985 |   facet_grid(method ~ age,
 986 |              labeller = labeller(.rows = label_both, .cols = label_both)) +
 987 |   scale_y_log10() +
 988 |   theme_light() +
 989 |   theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
 990 | ```
 991 | 
 992 | ![](tidyverse_files/figure-markdown_github/unnamed-chunk-6-1.png)
 993 | 
 994 | Post-talk debugging improvisation
 995 | ---------------------------------
 996 | 
 997 | ``` r
 998 | pipe_stopifnot = function(df, test){
 999 |   stopifnot(test)
1000 |   return(df)
1001 | }
1002 | ```
1003 | 
1004 | ``` r
1005 | print_and_go = function(df, what_to_print) {
1006 |   cat(what_to_print)
1007 |   return(df)
1008 | }
1009 | ```
1010 | 


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/dplyr-tidyr-ggplot-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/dplyr-tidyr-ggplot-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/make model data-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/make model data-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/plot cooksd-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/plot cooksd-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/plot resid-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/plot resid-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/unnamed-chunk-11-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-11-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-14-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/unnamed-chunk-15-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-15-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/unnamed-chunk-26-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-26-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/unnamed-chunk-4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-4-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/unnamed-chunk-6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-6-1.png


--------------------------------------------------------------------------------
/tidyverse_files/figure-markdown_github/unnamed-chunk-8-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/michaellevy/tidyverse_talk/633d10ad9d691db328635f6d4e2beca83fd11eb4/tidyverse_files/figure-markdown_github/unnamed-chunk-8-1.png


--------------------------------------------------------------------------------