├── .gitignore
├── LICENSE
├── README.md
├── col-benchmark.csv
├── col-benchmark.png
├── ex01_leave-it-in-the-data-frame.R
├── ex01_leave-it-in-the-data-frame.md
├── ex01_leave-it-in-the-data-frame_files
    └── figure-gfm
    │   ├── unnamed-chunk-2-1.png
    │   ├── unnamed-chunk-3-1.png
    │   ├── unnamed-chunk-3-2.png
    │   ├── unnamed-chunk-4-1.png
    │   ├── unnamed-chunk-5-1.png
    │   └── unnamed-chunk-6-1.png
├── ex02_create-or-mutate-in-place.R
├── ex02_create-or-mutate-in-place.md
├── ex03_row-wise-iteration-are-you-sure.R
├── ex03_row-wise-iteration-are-you-sure.md
├── ex04_map-example.R
├── ex04_map-example.md
├── ex05_attack-via-rows-or-columns.R
├── ex05_attack-via-rows-or-columns.md
├── ex06_runif-via-pmap.R
├── ex06_runif-via-pmap.md
├── ex07_group-by-summarise.R
├── ex07_group-by-summarise.md
├── ex08_nesting-is-good.R
├── ex08_nesting-is-good.md
├── ex08_nesting-is-good_files
    └── figure-gfm
    │   ├── alpha-order-1.png
    │   ├── principled-order-1.png
    │   ├── principled-order-coef-ests-1.png
    │   ├── principled-order-coef-ests-2.png
    │   └── revert-to-alphabetical-1.png
├── ex09_row-summaries.R
├── ex09_row-summaries.md
├── iterate-over-rows.R
├── iterate-over-rows.md
├── iterate-over-rows_files
    └── figure-gfm
    │   ├── col-benchmark-1.png
    │   └── row-benchmark-1.png
├── row-benchmark.csv
├── row-benchmark.png
├── row-oriented-workflows.Rproj
├── wch.Rmd
├── wch.md
└── wch_files
    └── figure-html
        └── plot-1.png


/.gitignore:
--------------------------------------------------------------------------------
 1 | .Rproj.user
 2 | .Rhistory
 3 | .RData
 4 | wch.html
 5 | wch_cache
 6 | iterate-over-rows.html
 7 | iterate-over-rows_cache
 8 | *.key
 9 | *.pdf
10 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | CC BY-SA 4.0
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Row-oriented workflows in R with the tidyverse
 2 | 
 3 | Materials for [RStudio webinar](https://resources.rstudio.com/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame-april-jenny-bryan) *recording available at this link!*:
 4 | 
 5 | Thinking inside the box: you can do that inside a data frame?!  
 6 | Jenny Bryan  
 7 | Wednesday, April 11 at 1:00pm ET / 10:00am PT  
 8 | [rstd.io/row-work](https://rstd.io/row-work) *<-- shortlink to this repo*  
 9 | Slides available [on SpeakerDeck](https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse)
10 | 
11 | ## Abstract
12 | 
13 | The data frame is a crucial data structure in R and, especially, in the tidyverse. Working on a column or a variable is a very natural operation, which is great. But what about row-oriented work? That also comes up frequently and is more awkward. In this webinar I’ll work through concrete code examples, exploring patterns that arise in data analysis. We’ll discuss the general notion of "split-apply-combine", row-wise work in a data frame, splitting vs. nesting, and list-columns.
14 | 
15 | ## Code examples
16 | 
17 | Beginner --> intermediate --> advanced  
18 | Not all are used in webinar
19 | 
20 |   * **Leave your data in that big, beautiful data frame.** [`ex01_leave-it-in-the-data-frame`](ex01_leave-it-in-the-data-frame.md) Show the evil of creating copies of certain rows of certain variables, using Magic Numbers and cryptic names, just to save some typing.
21 |   * **Adding or modifying variables.** [`ex02_create-or-mutate-in-place`](ex02_create-or-mutate-in-place.md) `df$var <- ...` versus `dplyr::mutate()`. Recycling/safety, `df`'s as data mask, aesthetics.
22 |   * **Are you SURE you need to iterate over rows?** [`ex03_row-wise-iteration-are-you-sure`](ex03_row-wise-iteration-are-you-sure.md) Don't fixate on most obvious generalization of your pilot example and risk overlooking a vectorized solution. Features a `paste()` example, then goes out with some glue glory.
23 |   * **Working with non-vectorized functions.** [`ex04_map-example`](ex04_map-example.md) Small example using `purrr::map()` to apply `nrow()` to list of data frames.
24 |   * **Row-wise thinking vs. column-wise thinking.** [`ex05_attack-via-rows-or-columns`](ex05_attack-via-rows-or-columns.md) Data rectangling example. Both are possible, but I find building a tibble column-by-column is less aggravating than building rows, then row binding.
25 |   * **Iterate over rows of a data frame.** [`iterate-over-rows`](iterate-over-rows.md) Empirical study of reshaping a data frame into this form: a list with one component per row. Revisiting a study originally done by Winston Chang. Run times for different number of [rows](row-benchmark.png) or [columns](col-benchmark.png).
26 |   * **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
27 |   * **Are you SURE you need to iterate over groups?** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
28 |   * **Group-and-nest.** [`ex08_nesting-is-good`](ex08_nesting-is-good.md) How to explicitly work on groups of rows via nesting (our recommendation) vs splitting.
29 |   * **Row-wise mean or sum.** [`ex09_row-summaries`](ex09_row-summaries.md) How to do `rowSums()`-y and `rowMeans()`-y work inside a data frame.
30 | 
31 | ## More tips and links
32 | 
33 | Big thanks to everyone who weighed in on the related [twitter thread](https://twitter.com/JennyBryan/status/980905136468910080). This was very helpful for planning content.
34 | 
35 | 45 minutes is not enough! A few notes about more special functions and patterns for row-driven work. Maybe we need to do a follow up ...
36 | 
37 | `tibble::enframe()` and `deframe()` are handy for getting into and out of the data frame state.
38 | 
39 | `map()` and `map2()` are useful for working with list-columns inside `mutate()`.
40 | 
41 | `tibble::add_row()` handy for adding a single row at an arbitrary position in data frame.
42 | 
43 | `imap()` handy for iterating over something and its names or integer indices at the same time.
44 | 
45 | `dplyr::case_when()` helps you get rid of hairy, nested `if () {...} else {...}` statements.
46 | 
47 | Great resource on the "why?" of functional programming approaches (such as `map()`): <https://github.com/getify/Functional-Light-JS/blob/master/manuscript/ch1.md/>
48 | 


--------------------------------------------------------------------------------
/col-benchmark.csv:
--------------------------------------------------------------------------------
  1 | ncol,method,time
  2 | 10,transpose,0
  3 | 10,transpose,0
  4 | 10,transpose,0
  5 | 10,transpose,0
  6 | 10,transpose,0
  7 | 10,pmap,0
  8 | 10,pmap,0
  9 | 10,pmap,9.999999997489795e-4
 10 | 10,pmap,0
 11 | 10,pmap,0
 12 | 10,split_lapply,0
 13 | 10,split_lapply,0
 14 | 10,split_lapply,0.0010000000002037268
 15 | 10,split_lapply,0
 16 | 10,split_lapply,9.999999997489795e-4
 17 | 10,lapply_row,0
 18 | 10,lapply_row,9.999999997489795e-4
 19 | 10,lapply_row,0
 20 | 10,lapply_row,0
 21 | 10,lapply_row,0.0010000000002037268
 22 | 10,for_loop,0.0010000000002037268
 23 | 10,for_loop,0
 24 | 10,for_loop,9.999999997489795e-4
 25 | 10,for_loop,0
 26 | 10,for_loop,0
 27 | 100,transpose,0
 28 | 100,transpose,0
 29 | 100,transpose,0
 30 | 100,transpose,0
 31 | 100,transpose,0
 32 | 100,pmap,0
 33 | 100,pmap,9.999999997489795e-4
 34 | 100,pmap,0.0010000000002037268
 35 | 100,pmap,0
 36 | 100,pmap,9.999999997489795e-4
 37 | 100,split_lapply,0.0019999999999527063
 38 | 100,split_lapply,0.0019999999999527063
 39 | 100,split_lapply,0.0029999999997016857
 40 | 100,split_lapply,0.0019999999999527063
 41 | 100,split_lapply,0.003000000000156433
 42 | 100,lapply_row,0.0019999999999527063
 43 | 100,lapply_row,0.0019999999999527063
 44 | 100,lapply_row,0.0020000000004074536
 45 | 100,lapply_row,0.0019999999999527063
 46 | 100,lapply_row,0.0019999999999527063
 47 | 100,for_loop,0.0029999999997016857
 48 | 100,for_loop,0.0020000000004074536
 49 | 100,for_loop,0.0019999999999527063
 50 | 100,for_loop,0.0019999999999527063
 51 | 100,for_loop,0.0029999999997016857
 52 | 1e3,transpose,0
 53 | 1e3,transpose,0
 54 | 1e3,transpose,0.0010000000002037268
 55 | 1e3,transpose,0
 56 | 1e3,transpose,0
 57 | 1e3,pmap,0.0020000000004074536
 58 | 1e3,pmap,0.0019999999999527063
 59 | 1e3,pmap,0.0019999999999527063
 60 | 1e3,pmap,0.0019999999999527063
 61 | 1e3,pmap,0.0019999999999527063
 62 | 1e3,split_lapply,0.022000000000389264
 63 | 1e3,split_lapply,0.02599999999983993
 64 | 1e3,split_lapply,0.023999999999887223
 65 | 1e3,split_lapply,0.028999999999996362
 66 | 1e3,split_lapply,0.02500000000009095
 67 | 1e3,lapply_row,0.023000000000138243
 68 | 1e3,lapply_row,0.021999999999934516
 69 | 1e3,lapply_row,0.021000000000185537
 70 | 1e3,lapply_row,0.027000000000043656
 71 | 1e3,lapply_row,0.023000000000138243
 72 | 1e3,for_loop,0.02099999999973079
 73 | 1e3,for_loop,0.021000000000185537
 74 | 1e3,for_loop,0.02099999999973079
 75 | 1e3,for_loop,0.021000000000185537
 76 | 1e3,for_loop,0.027000000000043656
 77 | 1e4,transpose,0.0010000000002037268
 78 | 1e4,transpose,9.999999997489795e-4
 79 | 1e4,transpose,0.0010000000002037268
 80 | 1e4,transpose,9.999999997489795e-4
 81 | 1e4,transpose,0.0020000000004074536
 82 | 1e4,pmap,0.02500000000009095
 83 | 1e4,pmap,0.024999999999636202
 84 | 1e4,pmap,0.027000000000043656
 85 | 1e4,pmap,0.026000000000294676
 86 | 1e4,pmap,0.03099999999994907
 87 | 1e4,split_lapply,0.24499999999989086
 88 | 1e4,split_lapply,0.23900000000003274
 89 | 1e4,split_lapply,0.24899999999979627
 90 | 1e4,split_lapply,0.2680000000000291
 91 | 1e4,split_lapply,0.24499999999989086
 92 | 1e4,lapply_row,0.22000000000025466
 93 | 1e4,lapply_row,0.2369999999996253
 94 | 1e4,lapply_row,0.23400000000037835
 95 | 1e4,lapply_row,0.2339999999999236
 96 | 1e4,lapply_row,0.22600000000011278
 97 | 1e4,for_loop,0.24899999999979627
 98 | 1e4,for_loop,0.23800000000028376
 99 | 1e4,for_loop,0.2519999999999527
100 | 1e4,for_loop,0.26499999999987267
101 | 1e4,for_loop,0.25700000000006185
102 | 1e5,transpose,0.01499999999987267
103 | 1e5,transpose,0.016999999999825377
104 | 1e5,transpose,0.016000000000076398
105 | 1e5,transpose,0.016999999999825377
106 | 1e5,transpose,0.027000000000043656
107 | 1e5,pmap,0.5749999999998181
108 | 1e5,pmap,0.6639999999997599
109 | 1e5,pmap,0.6190000000001419
110 | 1e5,pmap,0.7470000000002983
111 | 1e5,pmap,0.6419999999998254
112 | 1e5,split_lapply,3.2729999999996835
113 | 1e5,split_lapply,3.624000000000251
114 | 1e5,split_lapply,3.9329999999999927
115 | 1e5,split_lapply,3.380000000000109
116 | 1e5,split_lapply,3.4890000000000327
117 | 1e5,lapply_row,3.199000000000069
118 | 1e5,lapply_row,3.630000000000109
119 | 1e5,lapply_row,3.9980000000000473
120 | 1e5,lapply_row,3.5589999999997417
121 | 1e5,lapply_row,3.6010000000001128
122 | 1e5,for_loop,3.212999999999738
123 | 1e5,for_loop,3.66800000000012
124 | 1e5,for_loop,4.114000000000033
125 | 1e5,for_loop,3.882000000000062
126 | 1e5,for_loop,3.5149999999998727
127 | 


--------------------------------------------------------------------------------
/col-benchmark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/col-benchmark.png


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame.R:
--------------------------------------------------------------------------------
 1 | #' ---
 2 | #' title: "Leave your data in that big, beautiful data frame"
 3 | #' author: "Jenny Bryan"
 4 | #' date: "`r format(Sys.Date())`"
 5 | #' output: github_document
 6 | #' ---
 7 | 
 8 | #+ setup, include = FALSE, cache = FALSE
 9 | knitr::opts_chunk$set(
10 |   collapse = TRUE,
11 |   comment = "#>",
12 |   error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 | 
16 | #+ body
17 | # ----
18 | #' ## Don't create odd little excerpts and copies of your data.
19 | #'
20 | #' Code style that results from (I speculate) minimizing the number of key
21 | #' presses.
22 | 
23 | ## :(
24 | sl <- iris[51:100,1]
25 | pw <- iris[51:100,4]
26 | plot(sl ~ pw)
27 | 
28 | #' This clutters the workspace with "loose parts", `sl` and `pw`. Very soon, you
29 | #' are likely to forget what they are, which `Species` of `iris` they represent,
30 | #' and what the relationship between them is.
31 | 
32 | # ----
33 | #' ## Leave the data *in situ* and reveal intent in your code
34 | #'
35 | #' More verbose code conveys intent. Eliminating the Magic Numbers makes the
36 | #' code less likely to be, or become, wrong.
37 | #'
38 | #' Here's one way to do same in a tidyverse style:
39 | library(tidyverse)
40 | 
41 | ggplot(
42 |   filter(iris, Species == "versicolor"),
43 |   aes(x = Petal.Width, y = Sepal.Length)
44 | ) + geom_point()
45 | 
46 | #' Another tidyverse approach, this time using the pipe operator, `%>%`
47 | iris %>%
48 |   filter(Species == "versicolor") %>%
49 |   ggplot(aes(x = Petal.Width, y = Sepal.Length)) + ## <--- NOTE the `+` sign!!
50 |   geom_point()
51 | 
52 | #' A base solution that still follows the principles of
53 | #'
54 | #'   * leave the data in data frame
55 | #'   * convey intent
56 | plot(
57 |   Sepal.Length ~ Petal.Width,
58 |   data = subset(iris, subset = Species == "versicolor")
59 | )
60 | 


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame.md:
--------------------------------------------------------------------------------
 1 | Leave your data in that big, beautiful data frame
 2 | ================
 3 | Jenny Bryan
 4 | 2018-04-02
 5 | 
 6 | ## Don’t create odd little excerpts and copies of your data.
 7 | 
 8 | Code style that results from (I speculate) minimizing the number of key
 9 | presses.
10 | 
11 | ``` r
12 | ## :(
13 | sl <- iris[51:100,1]
14 | pw <- iris[51:100,4]
15 | plot(sl ~ pw)
16 | ```
17 | 
18 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png)<!-- -->
19 | 
20 | This clutters the workspace with “loose parts”, `sl` and `pw`. Very
21 | soon, you are likely to forget what they are, which `Species` of `iris`
22 | they represent, and what the relationship between them is.
23 | 
24 | ## Leave the data *in situ* and reveal intent in your code
25 | 
26 | More verbose code conveys intent. Eliminating the Magic Numbers makes
27 | the code less likely to be, or become, wrong.
28 | 
29 | Here’s one way to do same in a tidyverse style:
30 | 
31 | ``` r
32 | library(tidyverse)
33 | 
34 | ggplot(
35 |   filter(iris, Species == "versicolor"),
36 |   aes(x = Petal.Width, y = Sepal.Length)
37 | ) + geom_point()
38 | ```
39 | 
40 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png)<!-- -->
41 | 
42 | Another tidyverse approach, this time using the pipe operator, `%>%`
43 | 
44 | ``` r
45 | iris %>%
46 |   filter(Species == "versicolor") %>%
47 |   ggplot(aes(x = Petal.Width, y = Sepal.Length)) + ## <--- NOTE the `+` sign!!
48 |   geom_point()
49 | ```
50 | 
51 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png)<!-- -->
52 | 
53 | A base solution that still follows the principles of
54 | 
55 |   - leave the data in data frame
56 |   - convey intent
57 | 
58 | <!-- end list -->
59 | 
60 | ``` r
61 | plot(
62 |   Sepal.Length ~ Petal.Width,
63 |   data = subset(iris, subset = Species == "versicolor")
64 | )
65 | ```
66 | 
67 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png)<!-- -->
68 | 


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-1.png


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-2.png


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png


--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png


--------------------------------------------------------------------------------
/ex02_create-or-mutate-in-place.R:
--------------------------------------------------------------------------------
 1 | #' ---
 2 | #' title: "Add or modify a variable"
 3 | #' author: "Jenny Bryan"
 4 | #' date: "`r format(Sys.Date())`"
 5 | #' output: github_document
 6 | #' ---
 7 | 
 8 | #+ setup, include = FALSE, cache = FALSE
 9 | knitr::opts_chunk$set(
10 |   collapse = TRUE,
11 |   comment = "#>",
12 |   error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 | 
16 | #+ body
17 | # ----
18 | library(tidyverse)
19 | 
20 | # ----
21 | #' ### Function to produce a fresh example data frame
22 | new_df <- function() {
23 |   tribble(
24 |       ~ name, ~ age,
25 |       "Reed",   14L,
26 |     "Wesley",   12L,
27 |        "Eli",   12L,
28 |       "Toby",    1L
29 |   )
30 | }
31 | 
32 | # ----
33 | #' ## The `df$var <- ...` syntax
34 | 
35 | #' How to create or modify a variable is a fairly low stakes matter, i.e. really
36 | #' a matter of taste. This is not a hill I plan to die on. But here's my two
37 | #' cents.
38 | #'
39 | #' Of course, `df$var <- ...` absolutely works for creating new variables or
40 | #' modifying existing ones. But there are downsides:
41 | #'
42 | #'   * Silent recycling is a risk.
43 | #'   * `df` is not special. It's not the implied place to look first for things,
44 | #'   so you must be explicit. This can be a drag.
45 | #'   * I have aesthetic concerns. YMMV.
46 | df <- new_df()
47 | df$eyes <- 2L
48 | df$snack <- c("chips", "cheese")
49 | df$uname <- toupper(df$name)
50 | df
51 | 
52 | # ----
53 | #' ## `dplyr::mutate()` works "inside the box"
54 | 
55 | #' `dplyr::mutate()` is the tidyverse way to work on a variable. If I'm working
56 | #' in a script-y style and the tidyverse packages are already available, I
57 | #' generally prefer this method of adding or modifying a variable.
58 | #'
59 | #'   * Only a length one input can be recycled.
60 | #'   * `df` is the first place to look for things. It turns out that making a
61 | #'   new variable out of existing variables is very, very common, so it's nice
62 | #'   when this is easy.
63 | #'   * This is pipe-friendly, so I can easily combine with a few other logical
64 | #'   data manipuluations that need to happen around the same point.
65 | #'   * I like the way this looks. YMMV.
66 | 
67 | new_df() %>%
68 |   mutate(
69 |     eyes = 2L,
70 |     snack = c("chips", "cheese"),
71 |     uname = toupper(name)
72 |   )
73 | 
74 | #' Oops! I did not provide enough snacks!
75 | 
76 | new_df() %>%
77 |   mutate(
78 |     eyes = 2L,
79 |     snack = c("chips", "cheese", "mixed nuts", "nerf bullets"),
80 |     uname = toupper(name)
81 |   )
82 | 


--------------------------------------------------------------------------------
/ex02_create-or-mutate-in-place.md:
--------------------------------------------------------------------------------
  1 | Add or modify a variable
  2 | ================
  3 | Jenny Bryan
  4 | 2018-04-10
  5 | 
  6 | ``` r
  7 | library(tidyverse)
  8 | ```
  9 | 
 10 | ### Function to produce a fresh example data frame
 11 | 
 12 | ``` r
 13 | new_df <- function() {
 14 |   tribble(
 15 |       ~ name, ~ age,
 16 |       "Reed",   14L,
 17 |     "Wesley",   12L,
 18 |        "Eli",   12L,
 19 |       "Toby",    1L
 20 |   )
 21 | }
 22 | ```
 23 | 
 24 | ## The `df$var <- ...` syntax
 25 | 
 26 | How to create or modify a variable is a fairly low stakes matter,
 27 | i.e. really a matter of taste. This is not a hill I plan to die on. But
 28 | here’s my two cents.
 29 | 
 30 | Of course, `df$var <- ...` absolutely works for creating new variables
 31 | or modifying existing ones. But there are downsides:
 32 | 
 33 |   - Silent recycling is a risk.
 34 |   - `df` is not special. It’s not the implied place to look first for
 35 |     things, so you must be explicit. This can be a drag.
 36 |   - I have aesthetic concerns. YMMV.
 37 | 
 38 | <!-- end list -->
 39 | 
 40 | ``` r
 41 | df <- new_df()
 42 | df$eyes <- 2L
 43 | df$snack <- c("chips", "cheese")
 44 | df$uname <- toupper(df$name)
 45 | df
 46 | #> # A tibble: 4 x 5
 47 | #>   name     age  eyes snack  uname 
 48 | #>   <chr>  <int> <int> <chr>  <chr> 
 49 | #> 1 Reed      14     2 chips  REED  
 50 | #> 2 Wesley    12     2 cheese WESLEY
 51 | #> 3 Eli       12     2 chips  ELI   
 52 | #> 4 Toby       1     2 cheese TOBY
 53 | ```
 54 | 
 55 | ## `dplyr::mutate()` works “inside the box”
 56 | 
 57 | `dplyr::mutate()` is the tidyverse way to work on a variable. If I’m
 58 | working in a script-y style and the tidyverse packages are already
 59 | available, I generally prefer this method of adding or modifying a
 60 | variable.
 61 | 
 62 |   - Only a length one input can be recycled.
 63 |   - `df` is the first place to look for things. It turns out that making
 64 |     a new variable out of existing variables is very, very common, so
 65 |     it’s nice when this is easy.
 66 |   - This is pipe-friendly, so I can easily combine with a few other
 67 |     logical data manipuluations that need to happen around the same
 68 |     point.
 69 |   - I like the way this looks. YMMV.
 70 | 
 71 | <!-- end list -->
 72 | 
 73 | ``` r
 74 | new_df() %>%
 75 |   mutate(
 76 |     eyes = 2L,
 77 |     snack = c("chips", "cheese"),
 78 |     uname = toupper(name)
 79 |   )
 80 | #> Error in mutate_impl(.data, dots): Column `snack` must be length 4 (the number of rows) or one, not 2
 81 | ```
 82 | 
 83 | Oops\! I did not provide enough snacks\!
 84 | 
 85 | ``` r
 86 | new_df() %>%
 87 |   mutate(
 88 |     eyes = 2L,
 89 |     snack = c("chips", "cheese", "mixed nuts", "nerf bullets"),
 90 |     uname = toupper(name)
 91 |   )
 92 | #> # A tibble: 4 x 5
 93 | #>   name     age  eyes snack        uname 
 94 | #>   <chr>  <int> <int> <chr>        <chr> 
 95 | #> 1 Reed      14     2 chips        REED  
 96 | #> 2 Wesley    12     2 cheese       WESLEY
 97 | #> 3 Eli       12     2 mixed nuts   ELI   
 98 | #> 4 Toby       1     2 nerf bullets TOBY
 99 | ```
100 | 


--------------------------------------------------------------------------------
/ex03_row-wise-iteration-are-you-sure.R:
--------------------------------------------------------------------------------
 1 | #' ---
 2 | #' title: "Are you absolutely sure that you, personally, need to iterate over rows?"
 3 | #' author: "Jenny Bryan"
 4 | #' date: "`r format(Sys.Date())`"
 5 | #' output: github_document
 6 | #' ---
 7 | 
 8 | #+ setup, include = FALSE, cache = FALSE
 9 | knitr::opts_chunk$set(
10 |   collapse = TRUE,
11 |   comment = "#>",
12 |   error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 | 
16 | #+ body
17 | # ----
18 | library(tidyverse)
19 | 
20 | # ----
21 | #' ## Function to give my example data frame
22 | new_df <- function() {
23 |   tribble(
24 |     ~ name, ~ age,
25 |     "Reed", 14,
26 |     "Wesley", 12,
27 |     "Eli", 12,
28 |     "Toby", 1
29 |   )
30 | }
31 | 
32 | # ----
33 | #' ## Single-row example can cause tunnel vision
34 | #'
35 | #' Sometimes it's easy to fixate on one (unfavorable) way of accomplishing
36 | #' something, because it feels like a natural extension of a successful
37 | #' small-scale experiment.
38 | #'
39 | #' Let's create a string from row 1 of the data frame.
40 | df <- new_df()
41 | paste(df$name[1], "is", df$age[1], "years old")
42 | 
43 | #' I want to scale up, therefore I obviously must ... loop over all rows!
44 | n <- nrow(df)
45 | s <- vector(mode = "character", length = n)
46 | for (i in seq_len(n)) {
47 |   s[i] <- paste(df$name[i], "is", df$age[i], "years old")
48 | }
49 | s
50 | 
51 | #' HOLD ON. What if I told you `paste()` is already vectorized over its
52 | #' arguments?
53 | paste(df$name, "is", df$age, "years old")
54 | 
55 | #' A surprising number of "iterate over rows" problems can be eliminated by
56 | #' exploiting functions that are already vectorized and by making your own
57 | #' functions vectorized over the primary argument.
58 | #'
59 | #' Writing an explicit loop in your code is not necessarily bad, but it should
60 | #' always give you pause. Has someone already written this loop for you? Ideally
61 | #' in C or C++ and inside a package that's being regularly checked, with high
62 | #' test coverage. That is usually the better choice.
63 | 
64 | # ----
65 | #' ## Don't forget to work "inside the box"
66 | #'
67 | 
68 | #' For this string interpolation task, we can even work with a vectorized
69 | #' function that is happy to do lookup inside a data frame. The [glue
70 | #' package](https://glue.tidyverse.org) is doing the work under the hood here,
71 | #' but its Greatest Functions are now re-exported by stringr, which we already
72 | #' attached via `library(tidyverse)`.
73 | 
74 | str_glue_data(df, "{name} is {age} years old")
75 | 
76 | #' You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`, because
77 | #' the other variables in `df` are automatically available for use.
78 | 
79 | df %>%
80 |   mutate(sentence = str_glue("{name} is {age} years old"))
81 | 
82 | #' The tidyverse style is to manage data holistically in a data frame and
83 | #' provide a user interface that encourages self-explaining code with low
84 | #' "syntactical noise".
85 | 


--------------------------------------------------------------------------------
/ex03_row-wise-iteration-are-you-sure.md:
--------------------------------------------------------------------------------
  1 | Are you absolutely sure that you, personally, need to iterate over rows?
  2 | ================
  3 | Jenny Bryan
  4 | 2018-04-02
  5 | 
  6 | ``` r
  7 | library(tidyverse)
  8 | ```
  9 | 
 10 | ## Function to give my example data frame
 11 | 
 12 | ``` r
 13 | new_df <- function() {
 14 |   tribble(
 15 |     ~ name, ~ age,
 16 |     "Reed", 14,
 17 |     "Wesley", 12,
 18 |     "Eli", 12,
 19 |     "Toby", 1
 20 |   )
 21 | }
 22 | ```
 23 | 
 24 | ## Single-row example can cause tunnel vision
 25 | 
 26 | Sometimes it’s easy to fixate on one (unfavorable) way of accomplishing
 27 | something, because it feels like a natural extension of a successful
 28 | small-scale experiment.
 29 | 
 30 | Let’s create a string from row 1 of the data frame.
 31 | 
 32 | ``` r
 33 | df <- new_df()
 34 | paste(df$name[1], "is", df$age[1], "years old")
 35 | #> [1] "Reed is 14 years old"
 36 | ```
 37 | 
 38 | I want to scale up, therefore I obviously must … loop over all rows\!
 39 | 
 40 | ``` r
 41 | n <- nrow(df)
 42 | s <- vector(mode = "character", length = n)
 43 | for (i in seq_len(n)) {
 44 |   s[i] <- paste(df$name[i], "is", df$age[i], "years old")
 45 | }
 46 | s
 47 | #> [1] "Reed is 14 years old"   "Wesley is 12 years old"
 48 | #> [3] "Eli is 12 years old"    "Toby is 1 years old"
 49 | ```
 50 | 
 51 | HOLD ON. What if I told you `paste()` is already vectorized over its
 52 | arguments?
 53 | 
 54 | ``` r
 55 | paste(df$name, "is", df$age, "years old")
 56 | #> [1] "Reed is 14 years old"   "Wesley is 12 years old"
 57 | #> [3] "Eli is 12 years old"    "Toby is 1 years old"
 58 | ```
 59 | 
 60 | A surprising number of “iterate over rows” problems can be eliminated by
 61 | exploiting functions that are already vectorized and by making your own
 62 | functions vectorized over the primary argument.
 63 | 
 64 | Writing an explicit loop in your code is not necessarily bad, but it
 65 | should always give you pause. Has someone already written this loop for
 66 | you? Ideally in C or C++ and inside a package that’s being regularly
 67 | checked, with high test coverage. That is usually the better choice.
 68 | 
 69 | ## Don’t forget to work “inside the box”
 70 | 
 71 | For this string interpolation task, we can even work with a vectorized
 72 | function that is happy to do lookup inside a data frame. The [glue
 73 | package](https://glue.tidyverse.org) is doing the work under the hood
 74 | here, but its Greatest Functions are now re-exported by stringr, which
 75 | we already attached via `library(tidyverse)`.
 76 | 
 77 | ``` r
 78 | str_glue_data(df, "{name} is {age} years old")
 79 | #> Reed is 14 years old
 80 | #> Wesley is 12 years old
 81 | #> Eli is 12 years old
 82 | #> Toby is 1 years old
 83 | ```
 84 | 
 85 | You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`,
 86 | because the other variables in `df` are automatically available for use.
 87 | 
 88 | ``` r
 89 | df %>%
 90 |   mutate(sentence = str_glue("{name} is {age} years old"))
 91 | #> # A tibble: 4 x 3
 92 | #>   name     age sentence              
 93 | #>   <chr>  <dbl> <S3: glue>            
 94 | #> 1 Reed     14. Reed is 14 years old  
 95 | #> 2 Wesley   12. Wesley is 12 years old
 96 | #> 3 Eli      12. Eli is 12 years old   
 97 | #> 4 Toby      1. Toby is 1 years old
 98 | ```
 99 | 
100 | The tidyverse style is to manage data holistically in a data frame and
101 | provide a user interface that encourages self-explaining code with low
102 | “syntactical noise”.
103 | 


--------------------------------------------------------------------------------
/ex04_map-example.R:
--------------------------------------------------------------------------------
 1 | #' ---
 2 | #' title: "Small demo of purrr::map()"
 3 | #' author: "Jenny Bryan"
 4 | #' date: "`r format(Sys.Date())`"
 5 | #' output: github_document
 6 | #' ---
 7 | 
 8 | #+ setup, include = FALSE, cache = FALSE
 9 | knitr::opts_chunk$set(
10 |   collapse = TRUE,
11 |   comment = "#>",
12 |   error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 | 
16 | #+ body
17 | # ----
18 | #' ## `purrr::map()` can be used to work with functions that aren't vectorized.
19 | 
20 | df_list <- list(
21 |   iris = head(iris, 2),
22 |   mtcars = head(mtcars, 3)
23 | )
24 | df_list
25 | 
26 | #' This does not work. `nrow()` expects a single data frame as input.
27 | nrow(df_list)
28 | 
29 | #' `purrr::map()` applies `nrow()` to each element of `df_list`.
30 | library(purrr)
31 | 
32 | map(df_list, nrow)
33 | 
34 | #' Different calling styles make sense in more complicated situations. Hard to
35 | #' justify in this simple example.
36 | map(df_list, ~ nrow(.x))
37 | 
38 | df_list %>%
39 |   map(nrow)
40 | 
41 | #' If you know what the return type is (or *should* be), use a type-specific
42 | #' variant of `map()`.
43 | 
44 | map_int(df_list, ~ nrow(.x))
45 | 
46 | #' More on coverage of `map()` and friends: <https://jennybc.github.io/purrr-tutorial/>.
47 | 


--------------------------------------------------------------------------------
/ex04_map-example.md:
--------------------------------------------------------------------------------
 1 | Small demo of purrr::map()
 2 | ================
 3 | Jenny Bryan
 4 | 2018-04-10
 5 | 
 6 | ## `purrr::map()` can be used to work with functions that aren’t vectorized.
 7 | 
 8 | ``` r
 9 | df_list <- list(
10 |   iris = head(iris, 2),
11 |   mtcars = head(mtcars, 3)
12 | )
13 | df_list
14 | #> $iris
15 | #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
16 | #> 1          5.1         3.5          1.4         0.2  setosa
17 | #> 2          4.9         3.0          1.4         0.2  setosa
18 | #> 
19 | #> $mtcars
20 | #>                mpg cyl disp  hp drat    wt  qsec vs am gear carb
21 | #> Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
22 | #> Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
23 | #> Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
24 | ```
25 | 
26 | This does not work. `nrow()` expects a single data frame as input.
27 | 
28 | ``` r
29 | nrow(df_list)
30 | #> NULL
31 | ```
32 | 
33 | `purrr::map()` applies `nrow()` to each element of `df_list`.
34 | 
35 | ``` r
36 | library(purrr)
37 | 
38 | map(df_list, nrow)
39 | #> $iris
40 | #> [1] 2
41 | #> 
42 | #> $mtcars
43 | #> [1] 3
44 | ```
45 | 
46 | Different calling styles make sense in more complicated situations. Hard
47 | to justify in this simple example.
48 | 
49 | ``` r
50 | map(df_list, ~ nrow(.x))
51 | #> $iris
52 | #> [1] 2
53 | #> 
54 | #> $mtcars
55 | #> [1] 3
56 | 
57 | df_list %>%
58 |   map(nrow)
59 | #> $iris
60 | #> [1] 2
61 | #> 
62 | #> $mtcars
63 | #> [1] 3
64 | ```
65 | 
66 | If you know what the return type is (or *should* be), use a
67 | type-specific variant of `map()`.
68 | 
69 | ``` r
70 | map_int(df_list, ~ nrow(.x))
71 | #>   iris mtcars 
72 | #>      2      3
73 | ```
74 | 
75 | More on coverage of `map()` and friends:
76 | <https://jennybc.github.io/purrr-tutorial/>.
77 | 


--------------------------------------------------------------------------------
/ex05_attack-via-rows-or-columns.R:
--------------------------------------------------------------------------------
 1 | #' ---
 2 | #' title: "Attack via rows or columns?"
 3 | #' author: "Jenny Bryan"
 4 | #' date: "`r format(Sys.Date())`"
 5 | #' output: github_document
 6 | #' ---
 7 | 
 8 | #+ setup, include = FALSE, cache = FALSE
 9 | knitr::opts_chunk$set(
10 |   collapse = TRUE,
11 |   comment = "#>",
12 |   error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 | 
16 | #' **WARNING: half-baked**
17 | 
18 | #+ body
19 | # ----
20 | library(tidyverse)
21 | 
22 | # ----
23 | #' ## If you must sweat, compare row-wise work vs. column-wise work
24 | #'
25 | #' The approach you use in that first example is not always the one that scales
26 | #' up the best.
27 | 
28 | x <- list(
29 |   list(name = "sue", number = 1, veg = c("onion", "carrot")),
30 |   list(name = "doug", number = 2, veg = c("potato", "beet"))
31 | )
32 | 
33 | # row binding
34 | 
35 | # frustrating base attempts
36 | rbind(x)
37 | do.call(rbind, x)
38 | do.call(rbind, x) %>% str()
39 | 
40 | # tidyverse fail
41 | bind_rows(x)
42 | map_dfr(x, ~ .x)
43 | 
44 | map_dfr(x, ~ .x[c("name", "number")])
45 | 
46 | tibble(
47 |   name = map_chr(x, "name"),
48 |   number = map_dbl(x, "number"),
49 |   veg = map(x, "veg")
50 | )
51 | 


--------------------------------------------------------------------------------
/ex05_attack-via-rows-or-columns.md:
--------------------------------------------------------------------------------
 1 | Attack via rows or columns?
 2 | ================
 3 | Jenny Bryan
 4 | 2018-04-02
 5 | 
 6 | **WARNING: half-baked**
 7 | 
 8 | ``` r
 9 | library(tidyverse)
10 | ```
11 | 
12 | ## If you must sweat, compare row-wise work vs. column-wise work
13 | 
14 | The approach you use in that first example is not always the one that
15 | scales up the best.
16 | 
17 | ``` r
18 | x <- list(
19 |   list(name = "sue", number = 1, veg = c("onion", "carrot")),
20 |   list(name = "doug", number = 2, veg = c("potato", "beet"))
21 | )
22 | 
23 | # row binding
24 | 
25 | # frustrating base attempts
26 | rbind(x)
27 | #>   [,1]   [,2]  
28 | #> x List,3 List,3
29 | do.call(rbind, x)
30 | #>      name   number veg        
31 | #> [1,] "sue"  1      Character,2
32 | #> [2,] "doug" 2      Character,2
33 | do.call(rbind, x) %>% str()
34 | #> List of 6
35 | #>  $ : chr "sue"
36 | #>  $ : chr "doug"
37 | #>  $ : num 1
38 | #>  $ : num 2
39 | #>  $ : chr [1:2] "onion" "carrot"
40 | #>  $ : chr [1:2] "potato" "beet"
41 | #>  - attr(*, "dim")= int [1:2] 2 3
42 | #>  - attr(*, "dimnames")=List of 2
43 | #>   ..$ : NULL
44 | #>   ..$ : chr [1:3] "name" "number" "veg"
45 | 
46 | # tidyverse fail
47 | bind_rows(x)
48 | #> Error in bind_rows_(x, .id): Argument 3 must be length 1, not 2
49 | map_dfr(x, ~ .x)
50 | #> Error in bind_rows_(x, .id): Argument 3 must be length 1, not 2
51 | 
52 | map_dfr(x, ~ .x[c("name", "number")])
53 | #> # A tibble: 2 x 2
54 | #>   name  number
55 | #>   <chr>  <dbl>
56 | #> 1 sue       1.
57 | #> 2 doug      2.
58 | 
59 | tibble(
60 |   name = map_chr(x, "name"),
61 |   number = map_dbl(x, "number"),
62 |   veg = map(x, "veg")
63 | )
64 | #> # A tibble: 2 x 3
65 | #>   name  number veg      
66 | #>   <chr>  <dbl> <list>   
67 | #> 1 sue       1. <chr [2]>
68 | #> 2 doug      2. <chr [2]>
69 | ```
70 | 


--------------------------------------------------------------------------------
/ex06_runif-via-pmap.R:
--------------------------------------------------------------------------------
  1 | #' ---
  2 | #' title: "Generate data from different distributions via pmap()"
  3 | #' author: "Jenny Bryan"
  4 | #' date: "`r format(Sys.Date())`"
  5 | #' output: github_document
  6 | #' ---
  7 | 
  8 | #+ setup, include = FALSE, cache = FALSE
  9 | knitr::opts_chunk$set(
 10 |   collapse = TRUE,
 11 |   comment = "#>",
 12 |   error = TRUE
 13 | )
 14 | options(tidyverse.quiet = TRUE)
 15 | 
 16 | #+ body
 17 | # ----
 18 | #' ## Uniform[min, max] via `runif()`
 19 | #'
 20 | #' CONSIDER:
 21 | #' ```
 22 | #' runif(n, min = 0, max = 1)
 23 | #' ```
 24 | #'
 25 | #' Want to do this for several triples of (n, min, max).
 26 | #'
 27 | #' Store each triple as a row in a data frame.
 28 | #'
 29 | #' Now iterate over the rows.
 30 | 
 31 | library(tidyverse)
 32 | 
 33 | #' Notice how df's variable names are same as runif's argument names. Do this
 34 | #' when you can!
 35 | df <- tribble(
 36 |   ~ n, ~ min, ~ max,
 37 |    1L,     0,     1,
 38 |    2L,    10,   100,
 39 |    3L,   100,  1000
 40 | )
 41 | df
 42 | 
 43 | #' Set seed to make this repeatedly random.
 44 | #'
 45 | #' Practice on single rows.
 46 | set.seed(123)
 47 | (x <- df[1, ])
 48 | runif(n = x$n, min = x$min, max = x$max)
 49 | 
 50 | x <- df[2, ]
 51 | runif(n = x$n, min = x$min, max = x$max)
 52 | 
 53 | x <- df[3, ]
 54 | runif(n = x$n, min = x$min, max = x$max)
 55 | 
 56 | #' Think out loud in pseudo-code.
 57 | 
 58 | ## x <- df[i, ]
 59 | ## runif(n = x$n, min = x$min, max = x$max)
 60 | 
 61 | ## runif(n = df$n[i], min = df$min[i], max = df$max[i])
 62 | ## runif with all args from the i-th row of df
 63 | 
 64 | #' Just. Do. It. with `pmap()`.
 65 | set.seed(123)
 66 | pmap(df, runif)
 67 | 
 68 | #' ## Finessing variable and argument names
 69 | #'
 70 | #' Q: What if you can't arrange it so that variable names and arg names are
 71 | #' same?
 72 | foofy <- tibble(
 73 |   alpha = 1:3,            ## was: n
 74 |   beta = c(0, 10, 100),   ## was: min
 75 |   gamma = c(1, 100, 1000) ## was: max
 76 | )
 77 | foofy
 78 | 
 79 | #' A: Rename the variables on-the-fly, on the way in.
 80 | set.seed(123)
 81 | foofy %>%
 82 |   rename(n = alpha, min = beta, max = gamma) %>%
 83 |   pmap(runif)
 84 | 
 85 | #' A: Write a wrapper around `runif()` to say how df vars <--> runif args.
 86 | 
 87 | ## wrapper option #1:
 88 | ##   ARGNAME = l$VARNAME
 89 | my_runif <- function(...) {
 90 |   l <- list(...)
 91 |   runif(n = l$alpha, min = l$beta, max = l$gamma)
 92 | }
 93 | set.seed(123)
 94 | pmap(foofy, my_runif)
 95 | 
 96 | ## wrapper option #2:
 97 | my_runif <- function(alpha, beta, gamma, ...) {
 98 |   runif(n = alpha, min = beta, max = gamma)
 99 | }
100 | set.seed(123)
101 | pmap(foofy, my_runif)
102 | 
103 | #' You can use `..i` to refer to input by position.
104 | set.seed(123)
105 | pmap(foofy, ~ runif(n = ..1, min = ..2, max = ..3))
106 | #' Use this with *extreme caution*. Easy to shoot yourself in the foot.
107 | #'
108 | #' ## Extra variables in the data frame
109 | #'
110 | #' What if data frame includes variables that should not be passed to `.f()`?
111 | df_oops <- tibble(
112 |   n = 1:3,
113 |   min = c(0, 10, 100),
114 |   max = c(1, 100, 1000),
115 |   oops = c("please", "ignore", "me")
116 | )
117 | df_oops
118 | 
119 | #' This will not work!
120 | set.seed(123)
121 | pmap(df_oops, runif)
122 | 
123 | #' A: use `dplyr::select()` to limit the variables passed to `pmap()`.
124 | set.seed(123)
125 | df_oops %>%
126 |   select(n, min, max) %>% ## if it's easier to say what to keep
127 |   pmap(runif)
128 | 
129 | set.seed(123)
130 | df_oops %>%
131 |   select(-oops) %>%       ## if it's easier to say what to omit
132 |   pmap(runif)
133 | 
134 | #' A: Use a custom wrapper and absorb extra variables with `...`.
135 | my_runif <- function(n, min, max, ...) runif(n, min, max)
136 | 
137 | set.seed(123)
138 | pmap(df_oops, my_runif)
139 | 
140 | #' ## Add the generated data to the data frame as a list-column
141 | set.seed(123)
142 | (df_aug <- df %>%
143 |     mutate(data = pmap(., runif)))
144 | #View(df_aug)
145 | 
146 | #' What about computing within a data frame, in the presence of the
147 | #' complications discussed above? Use `list()` in the place of the `.`
148 | #' placeholder above to select the target variables and, if necessary, map
149 | #' variable names to argument names. *Thanks @hadley for [sharing this
150 | #' trick](https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/29).*
151 | #'
152 | #' How to address variable names != argument names:
153 | foofy <- tibble(
154 |   alpha = 1:3,            ## was: n
155 |   beta = c(0, 10, 100),   ## was: min
156 |   gamma = c(1, 100, 1000) ## was: max
157 | )
158 | 
159 | set.seed(123)
160 | foofy %>%
161 |   mutate(data = pmap(list(n = alpha, min = beta, max = gamma), runif))
162 | 
163 | #' How to address presence of 'extra variables' with either an inclusion or
164 | #' exclusion mentality
165 | df_oops <- tibble(
166 |   n = 1:3,
167 |   min = c(0, 10, 100),
168 |   max = c(1, 100, 1000),
169 |   oops = c("please", "ignore", "me")
170 | )
171 | 
172 | set.seed(123)
173 | df_oops %>%
174 |   mutate(data = pmap(list(n, min, max), runif))
175 | 
176 | df_oops %>%
177 |   mutate(data = pmap(select(., -oops), runif))
178 | 
179 | #' ## Review
180 | #'
181 | #' What have we done?
182 | #'
183 | #'   * Arranged inputs as rows in a data frame
184 | #'   * Used `pmap()` to implement a loop over the rows.
185 | #'   * Used dplyr verbs `rename()` and `select()` to manipulate data on the way
186 | #'   into `pmap()`.
187 | #'   * Wrote custom wrappers around `runif()` to deal with:
188 | #'     - df var names != `.f()` arg names
189 | #'     - df vars that aren't formal args of `.f()`
190 | #'   * Demonstrated all of the above when working inside a data frame and adding
191 | #'   generated data as a list-column
192 | 


--------------------------------------------------------------------------------
/ex06_runif-via-pmap.md:
--------------------------------------------------------------------------------
  1 | Generate data from different distributions via pmap()
  2 | ================
  3 | Jenny Bryan
  4 | 2018-05-08
  5 | 
  6 | ## Uniform\[min, max\] via `runif()`
  7 | 
  8 | CONSIDER:
  9 | 
 10 |     runif(n, min = 0, max = 1)
 11 | 
 12 | Want to do this for several triples of (n, min, max).
 13 | 
 14 | Store each triple as a row in a data frame.
 15 | 
 16 | Now iterate over the rows.
 17 | 
 18 | ``` r
 19 | library(tidyverse)
 20 | ```
 21 | 
 22 | Notice how df’s variable names are same as runif’s argument names. Do
 23 | this when you can\!
 24 | 
 25 | ``` r
 26 | df <- tribble(
 27 |   ~ n, ~ min, ~ max,
 28 |    1L,     0,     1,
 29 |    2L,    10,   100,
 30 |    3L,   100,  1000
 31 | )
 32 | df
 33 | #> # A tibble: 3 x 3
 34 | #>       n   min   max
 35 | #>   <int> <dbl> <dbl>
 36 | #> 1     1     0     1
 37 | #> 2     2    10   100
 38 | #> 3     3   100  1000
 39 | ```
 40 | 
 41 | Set seed to make this repeatedly random.
 42 | 
 43 | Practice on single rows.
 44 | 
 45 | ``` r
 46 | set.seed(123)
 47 | (x <- df[1, ])
 48 | #> # A tibble: 1 x 3
 49 | #>       n   min   max
 50 | #>   <int> <dbl> <dbl>
 51 | #> 1     1     0     1
 52 | runif(n = x$n, min = x$min, max = x$max)
 53 | #> [1] 0.2875775
 54 | 
 55 | x <- df[2, ]
 56 | runif(n = x$n, min = x$min, max = x$max)
 57 | #> [1] 80.94746 46.80792
 58 | 
 59 | x <- df[3, ]
 60 | runif(n = x$n, min = x$min, max = x$max)
 61 | #> [1] 894.7157 946.4206 141.0008
 62 | ```
 63 | 
 64 | Think out loud in pseudo-code.
 65 | 
 66 | ``` r
 67 | ## x <- df[i, ]
 68 | ## runif(n = x$n, min = x$min, max = x$max)
 69 | 
 70 | ## runif(n = df$n[i], min = df$min[i], max = df$max[i])
 71 | ## runif with all args from the i-th row of df
 72 | ```
 73 | 
 74 | Just. Do. It. with `pmap()`.
 75 | 
 76 | ``` r
 77 | set.seed(123)
 78 | pmap(df, runif)
 79 | #> [[1]]
 80 | #> [1] 0.2875775
 81 | #> 
 82 | #> [[2]]
 83 | #> [1] 80.94746 46.80792
 84 | #> 
 85 | #> [[3]]
 86 | #> [1] 894.7157 946.4206 141.0008
 87 | ```
 88 | 
 89 | ## Finessing variable and argument names
 90 | 
 91 | Q: What if you can’t arrange it so that variable names and arg names are
 92 | same?
 93 | 
 94 | ``` r
 95 | foofy <- tibble(
 96 |   alpha = 1:3,            ## was: n
 97 |   beta = c(0, 10, 100),   ## was: min
 98 |   gamma = c(1, 100, 1000) ## was: max
 99 | )
100 | foofy
101 | #> # A tibble: 3 x 3
102 | #>   alpha  beta gamma
103 | #>   <int> <dbl> <dbl>
104 | #> 1     1     0     1
105 | #> 2     2    10   100
106 | #> 3     3   100  1000
107 | ```
108 | 
109 | A: Rename the variables on-the-fly, on the way in.
110 | 
111 | ``` r
112 | set.seed(123)
113 | foofy %>%
114 |   rename(n = alpha, min = beta, max = gamma) %>%
115 |   pmap(runif)
116 | #> [[1]]
117 | #> [1] 0.2875775
118 | #> 
119 | #> [[2]]
120 | #> [1] 80.94746 46.80792
121 | #> 
122 | #> [[3]]
123 | #> [1] 894.7157 946.4206 141.0008
124 | ```
125 | 
126 | A: Write a wrapper around `runif()` to say how df vars \<–\> runif args.
127 | 
128 | ``` r
129 | ## wrapper option #1:
130 | ##   ARGNAME = l$VARNAME
131 | my_runif <- function(...) {
132 |   l <- list(...)
133 |   runif(n = l$alpha, min = l$beta, max = l$gamma)
134 | }
135 | set.seed(123)
136 | pmap(foofy, my_runif)
137 | #> [[1]]
138 | #> [1] 0.2875775
139 | #> 
140 | #> [[2]]
141 | #> [1] 80.94746 46.80792
142 | #> 
143 | #> [[3]]
144 | #> [1] 894.7157 946.4206 141.0008
145 | 
146 | ## wrapper option #2:
147 | my_runif <- function(alpha, beta, gamma, ...) {
148 |   runif(n = alpha, min = beta, max = gamma)
149 | }
150 | set.seed(123)
151 | pmap(foofy, my_runif)
152 | #> [[1]]
153 | #> [1] 0.2875775
154 | #> 
155 | #> [[2]]
156 | #> [1] 80.94746 46.80792
157 | #> 
158 | #> [[3]]
159 | #> [1] 894.7157 946.4206 141.0008
160 | ```
161 | 
162 | You can use `..i` to refer to input by position.
163 | 
164 | ``` r
165 | set.seed(123)
166 | pmap(foofy, ~ runif(n = ..1, min = ..2, max = ..3))
167 | #> [[1]]
168 | #> [1] 0.2875775
169 | #> 
170 | #> [[2]]
171 | #> [1] 80.94746 46.80792
172 | #> 
173 | #> [[3]]
174 | #> [1] 894.7157 946.4206 141.0008
175 | ```
176 | 
177 | Use this with *extreme caution*. Easy to shoot yourself in the foot.
178 | 
179 | ## Extra variables in the data frame
180 | 
181 | What if data frame includes variables that should not be passed to
182 | `.f()`?
183 | 
184 | ``` r
185 | df_oops <- tibble(
186 |   n = 1:3,
187 |   min = c(0, 10, 100),
188 |   max = c(1, 100, 1000),
189 |   oops = c("please", "ignore", "me")
190 | )
191 | df_oops
192 | #> # A tibble: 3 x 4
193 | #>       n   min   max oops  
194 | #>   <int> <dbl> <dbl> <chr> 
195 | #> 1     1     0     1 please
196 | #> 2     2    10   100 ignore
197 | #> 3     3   100  1000 me
198 | ```
199 | 
200 | This will not work\!
201 | 
202 | ``` r
203 | set.seed(123)
204 | pmap(df_oops, runif)
205 | #> Error in .f(n = .l[[c(1L, i)]], min = .l[[c(2L, i)]], max = .l[[c(3L, : unused argument (oops = .l[[c(4, i)]])
206 | ```
207 | 
208 | A: use `dplyr::select()` to limit the variables passed to `pmap()`.
209 | 
210 | ``` r
211 | set.seed(123)
212 | df_oops %>%
213 |   select(n, min, max) %>% ## if it's easier to say what to keep
214 |   pmap(runif)
215 | #> [[1]]
216 | #> [1] 0.2875775
217 | #> 
218 | #> [[2]]
219 | #> [1] 80.94746 46.80792
220 | #> 
221 | #> [[3]]
222 | #> [1] 894.7157 946.4206 141.0008
223 | 
224 | set.seed(123)
225 | df_oops %>%
226 |   select(-oops) %>%       ## if it's easier to say what to omit
227 |   pmap(runif)
228 | #> [[1]]
229 | #> [1] 0.2875775
230 | #> 
231 | #> [[2]]
232 | #> [1] 80.94746 46.80792
233 | #> 
234 | #> [[3]]
235 | #> [1] 894.7157 946.4206 141.0008
236 | ```
237 | 
238 | A: Use a custom wrapper and absorb extra variables with `...`.
239 | 
240 | ``` r
241 | my_runif <- function(n, min, max, ...) runif(n, min, max)
242 | 
243 | set.seed(123)
244 | pmap(df_oops, my_runif)
245 | #> [[1]]
246 | #> [1] 0.2875775
247 | #> 
248 | #> [[2]]
249 | #> [1] 80.94746 46.80792
250 | #> 
251 | #> [[3]]
252 | #> [1] 894.7157 946.4206 141.0008
253 | ```
254 | 
255 | ## Add the generated data to the data frame as a list-column
256 | 
257 | ``` r
258 | set.seed(123)
259 | (df_aug <- df %>%
260 |     mutate(data = pmap(., runif)))
261 | #> # A tibble: 3 x 4
262 | #>       n   min   max data     
263 | #>   <int> <dbl> <dbl> <list>   
264 | #> 1     1     0     1 <dbl [1]>
265 | #> 2     2    10   100 <dbl [2]>
266 | #> 3     3   100  1000 <dbl [3]>
267 | #View(df_aug)
268 | ```
269 | 
270 | What about computing within a data frame, in the presence of the
271 | complications discussed above? Use `list()` in the place of the `.`
272 | placeholder above to select the target variables and, if necessary, map
273 | variable names to argument names. *Thanks @hadley for [sharing this
274 | trick](https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/29).*
275 | 
276 | How to address variable names \!= argument names:
277 | 
278 | ``` r
279 | foofy <- tibble(
280 |   alpha = 1:3,            ## was: n
281 |   beta = c(0, 10, 100),   ## was: min
282 |   gamma = c(1, 100, 1000) ## was: max
283 | )
284 | 
285 | set.seed(123)
286 | foofy %>%
287 |   mutate(data = pmap(list(n = alpha, min = beta, max = gamma), runif))
288 | #> # A tibble: 3 x 4
289 | #>   alpha  beta gamma data     
290 | #>   <int> <dbl> <dbl> <list>   
291 | #> 1     1     0     1 <dbl [1]>
292 | #> 2     2    10   100 <dbl [2]>
293 | #> 3     3   100  1000 <dbl [3]>
294 | ```
295 | 
296 | How to address presence of ‘extra variables’ with either an inclusion or
297 | exclusion mentality
298 | 
299 | ``` r
300 | df_oops <- tibble(
301 |   n = 1:3,
302 |   min = c(0, 10, 100),
303 |   max = c(1, 100, 1000),
304 |   oops = c("please", "ignore", "me")
305 | )
306 | 
307 | set.seed(123)
308 | df_oops %>%
309 |   mutate(data = pmap(list(n, min, max), runif))
310 | #> # A tibble: 3 x 5
311 | #>       n   min   max oops   data     
312 | #>   <int> <dbl> <dbl> <chr>  <list>   
313 | #> 1     1     0     1 please <dbl [1]>
314 | #> 2     2    10   100 ignore <dbl [2]>
315 | #> 3     3   100  1000 me     <dbl [3]>
316 | 
317 | df_oops %>%
318 |   mutate(data = pmap(select(., -oops), runif))
319 | #> # A tibble: 3 x 5
320 | #>       n   min   max oops   data     
321 | #>   <int> <dbl> <dbl> <chr>  <list>   
322 | #> 1     1     0     1 please <dbl [1]>
323 | #> 2     2    10   100 ignore <dbl [2]>
324 | #> 3     3   100  1000 me     <dbl [3]>
325 | ```
326 | 
327 | ## Review
328 | 
329 | What have we done?
330 | 
331 |   - Arranged inputs as rows in a data frame
332 |   - Used `pmap()` to implement a loop over the rows.
333 |   - Used dplyr verbs `rename()` and `select()` to manipulate data on the
334 |     way into `pmap()`.
335 |   - Wrote custom wrappers around `runif()` to deal with:
336 |       - df var names \!= `.f()` arg names
337 |       - df vars that aren’t formal args of `.f()`
338 |   - Demonstrated all of the above when working inside a data frame and
339 |     adding generated data as a list-column
340 | 


--------------------------------------------------------------------------------
/ex07_group-by-summarise.R:
--------------------------------------------------------------------------------
 1 | #' ---
 2 | #' title: "Work on groups of rows via dplyr::group_by() + summarise()"
 3 | #' author: "Jenny Bryan"
 4 | #' date: "`r format(Sys.Date())`"
 5 | #' output: github_document
 6 | #' ---
 7 | 
 8 | #+ setup, include = FALSE, cache = FALSE
 9 | knitr::opts_chunk$set(
10 |   collapse = TRUE,
11 |   comment = "#>",
12 |   error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 | 
16 | #+ body
17 | # ----
18 | 
19 | #' What if you need to work on groups of rows? Such as the groups induced by
20 | #' the levels of a factor.
21 | #'
22 | #' You do not need to ... split the data frame into mini-data-frames, loop over
23 | #' them, and glue it all back together.
24 | #'
25 | #' Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
26 | #' compute group-wise summaries.
27 | 
28 | library(tidyverse)
29 | 
30 | iris %>%
31 |   group_by(Species) %>%
32 |   summarise(pl_avg = mean(Petal.Length), pw_avg = mean(Petal.Width))
33 | 
34 | #' What if you want to return summaries that are not just a single number?
35 | #'
36 | #' This does not "just work".
37 | iris %>%
38 |   group_by(Species) %>%
39 |   summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
40 | 
41 | #' Solution: package as a length-1 list that contains 3 values, creating a
42 | #' list-column.
43 | iris %>%
44 |   group_by(Species) %>%
45 |   summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
46 | 
47 | #' Q from
48 | #' [\@jcpsantiago](https://twitter.com/jcpsantiago/status/983997363298717696) via
49 | #' Twitter: How would you unnest so the final output is a data frame with a
50 | #' factor column `quantile` with levels "25%", "50%", and "75%"?
51 | #'
52 | #' A: I would `map()` `tibble::enframe()` on the new list column, to convert
53 | #' each entry from named list to a two-column data frame. Then use
54 | #' `tidyr::unnest()` to get rid of the list column and return to a simple data
55 | #' frame and, if you like, convert `quantile` into a factor.
56 | 
57 | iris %>%
58 |   group_by(Species) %>%
59 |   summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) %>%
60 |   mutate(pl_qtile = map(pl_qtile, enframe, name = "quantile")) %>%
61 |   unnest() %>%
62 |   mutate(quantile = factor(quantile))
63 | 
64 | #' If something like this comes up a lot in an analysis, you could package the
65 | #' key "moves" in a function, like so:
66 | enquantile <- function(x, ...) {
67 |   qtile <- enframe(quantile(x, ...), name = "quantile")
68 |   qtile$quantile <- factor(qtile$quantile)
69 |   list(qtile)
70 | }
71 | 
72 | #' This makes repeated downstream usage more concise.
73 | iris %>%
74 |   group_by(Species) %>%
75 |   summarise(pl_qtile = enquantile(Petal.Length, c(0.25, 0.5, 0.75))) %>%
76 |   unnest()
77 | 
78 | 


--------------------------------------------------------------------------------
/ex07_group-by-summarise.md:
--------------------------------------------------------------------------------
  1 | Work on groups of rows via dplyr::group\_by() + summarise()
  2 | ================
  3 | Jenny Bryan
  4 | 2018-04-11
  5 | 
  6 | What if you need to work on groups of rows? Such as the groups induced
  7 | by the levels of a factor.
  8 | 
  9 | You do not need to … split the data frame into mini-data-frames, loop
 10 | over them, and glue it all back together.
 11 | 
 12 | Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
 13 | compute group-wise summaries.
 14 | 
 15 | ``` r
 16 | library(tidyverse)
 17 | 
 18 | iris %>%
 19 |   group_by(Species) %>%
 20 |   summarise(pl_avg = mean(Petal.Length), pw_avg = mean(Petal.Width))
 21 | #> # A tibble: 3 x 3
 22 | #>   Species    pl_avg pw_avg
 23 | #>   <fct>       <dbl>  <dbl>
 24 | #> 1 setosa       1.46  0.246
 25 | #> 2 versicolor   4.26  1.33 
 26 | #> 3 virginica    5.55  2.03
 27 | ```
 28 | 
 29 | What if you want to return summaries that are not just a single number?
 30 | 
 31 | This does not “just work”.
 32 | 
 33 | ``` r
 34 | iris %>%
 35 |   group_by(Species) %>%
 36 |   summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
 37 | #> Error in summarise_impl(.data, dots): Column `pl_qtile` must be length 1 (a summary value), not 3
 38 | ```
 39 | 
 40 | Solution: package as a length-1 list that contains 3 values, creating a
 41 | list-column.
 42 | 
 43 | ``` r
 44 | iris %>%
 45 |   group_by(Species) %>%
 46 |   summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
 47 | #> # A tibble: 3 x 2
 48 | #>   Species    pl_qtile 
 49 | #>   <fct>      <list>   
 50 | #> 1 setosa     <dbl [3]>
 51 | #> 2 versicolor <dbl [3]>
 52 | #> 3 virginica  <dbl [3]>
 53 | ```
 54 | 
 55 | Q from
 56 | [@jcpsantiago](https://twitter.com/jcpsantiago/status/983997363298717696)
 57 | via Twitter: How would you unnest so the final output is a data frame
 58 | with a factor column `quantile` with levels “25%”, “50%”, and “75%”?
 59 | 
 60 | A: I would `map()` `tibble::enframe()` on the new list column, to
 61 | convert each entry from named list to a two-column data frame. Then use
 62 | `tidyr::unnest()` to get rid of the list column and return to a simple
 63 | data frame and, if you like, convert `quantile` into a factor.
 64 | 
 65 | ``` r
 66 | iris %>%
 67 |   group_by(Species) %>%
 68 |   summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) %>%
 69 |   mutate(pl_qtile = map(pl_qtile, enframe, name = "quantile")) %>%
 70 |   unnest() %>%
 71 |   mutate(quantile = factor(quantile))
 72 | #> # A tibble: 9 x 3
 73 | #>   Species    quantile value
 74 | #>   <fct>      <fct>    <dbl>
 75 | #> 1 setosa     25%       1.40
 76 | #> 2 setosa     50%       1.50
 77 | #> 3 setosa     75%       1.58
 78 | #> 4 versicolor 25%       4.00
 79 | #> 5 versicolor 50%       4.35
 80 | #> 6 versicolor 75%       4.60
 81 | #> 7 virginica  25%       5.10
 82 | #> 8 virginica  50%       5.55
 83 | #> 9 virginica  75%       5.88
 84 | ```
 85 | 
 86 | If something like this comes up a lot in an analysis, you could package
 87 | the key “moves” in a function, like so:
 88 | 
 89 | ``` r
 90 | enquantile <- function(x, ...) {
 91 |   qtile <- enframe(quantile(x, ...), name = "quantile")
 92 |   qtile$quantile <- factor(qtile$quantile)
 93 |   list(qtile)
 94 | }
 95 | ```
 96 | 
 97 | This makes repeated downstream usage more concise.
 98 | 
 99 | ``` r
100 | iris %>%
101 |   group_by(Species) %>%
102 |   summarise(pl_qtile = enquantile(Petal.Length, c(0.25, 0.5, 0.75))) %>%
103 |   unnest()
104 | #> # A tibble: 9 x 3
105 | #>   Species    quantile value
106 | #>   <fct>      <fct>    <dbl>
107 | #> 1 setosa     25%       1.40
108 | #> 2 setosa     50%       1.50
109 | #> 3 setosa     75%       1.58
110 | #> 4 versicolor 25%       4.00
111 | #> 5 versicolor 50%       4.35
112 | #> 6 versicolor 75%       4.60
113 | #> 7 virginica  25%       5.10
114 | #> 8 virginica  50%       5.55
115 | #> 9 virginica  75%       5.88
116 | ```
117 | 


--------------------------------------------------------------------------------
/ex08_nesting-is-good.R:
--------------------------------------------------------------------------------
  1 | #' ---
  2 | #' title: "Why nesting is worth the awkwardness"
  3 | #' author: "Jenny Bryan"
  4 | #' date: "`r format(Sys.Date())`"
  5 | #' output: github_document
  6 | #' ---
  7 | 
  8 | #+ setup, include = FALSE, cache = FALSE
  9 | knitr::opts_chunk$set(
 10 |   collapse = TRUE,
 11 |   comment = "#>",
 12 |   error = TRUE
 13 | )
 14 | options(tidyverse.quiet = TRUE)
 15 | 
 16 | #+ body
 17 | # ----
 18 | library(gapminder)
 19 | library(tidyverse)
 20 | 
 21 | # ----
 22 | #' gapminder data for Asia only
 23 | gap <- gapminder %>%
 24 |   filter(continent == "Asia") %>%
 25 |   mutate(yr1952 = year - 1952)
 26 | 
 27 | #+ alpha-order
 28 | ggplot(gap, aes(x = lifeExp, y = country)) +
 29 |   geom_point()
 30 | 
 31 | #' Countries are in alphabetical order.
 32 | #'
 33 | #' Set factor levels with intent. Example: order based on life expectancy in
 34 | #' 2007, the last year in this dataset. Imagine you want this to persist across
 35 | #' an entire analysis.
 36 | gap <- gap %>%
 37 |   mutate(country = fct_reorder2(country, .x = year, .y = lifeExp))
 38 | 
 39 | #+ principled-order
 40 | ggplot(gap, aes(x = lifeExp, y = country)) +
 41 |   geom_point()
 42 | 
 43 | 
 44 | #' Much better!
 45 | #'
 46 | #' Now imagine we want to fit a model to each country and look at dot plots of
 47 | #' slope and intercept.
 48 | #'
 49 | #' `dplyr::group_by()` + `tidyr::nest()` created a *nested data frame* and is an
 50 | #' alternative to splitting into country-specific data frames. Those data frames
 51 | #' end up, instead, in a list-column. The `country` variable remains as a normal
 52 | #' factor.
 53 | gap_nested <- gap %>%
 54 |   group_by(country) %>%
 55 |   nest()
 56 | 
 57 | gap_nested
 58 | gap_nested$data[[1]]
 59 | 
 60 | gap_fitted <- gap_nested %>%
 61 |   mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x)))
 62 | gap_fitted
 63 | gap_fitted$fit[[1]]
 64 | 
 65 | gap_fitted <- gap_fitted %>%
 66 |   mutate(
 67 |     intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]),
 68 |     slope = map_dbl(fit, ~ coef(.x)[["yr1952"]])
 69 |   )
 70 | gap_fitted
 71 | 
 72 | #+ principled-order-coef-ests
 73 | ggplot(gap_fitted, aes(x = intercept, y = country)) +
 74 |   geom_point()
 75 | 
 76 | ggplot(gap_fitted, aes(x = slope, y = country)) +
 77 |   geom_point()
 78 | 
 79 | #' The `split()` + `lapply()` + `do.call(rbind, ...)` approach.
 80 | #'
 81 | #' Split gap into many data frames, one per country.
 82 | gap_split <- split(gap, gap$country)
 83 | 
 84 | #' Fit a model to each country.
 85 | gap_split_fits <- lapply(
 86 |   gap_split,
 87 |   function(df) {
 88 |     lm(lifeExp ~ yr1952, data = df)
 89 |   }
 90 | )
 91 | #' Oops ... the unused levels of country are a problem (empty data frames in our
 92 | #' list).
 93 | #'
 94 | #' Drop unused levels in country and split.
 95 | gap_split <- split(droplevels(gap), droplevels(gap)$country)
 96 | head(gap_split, 2)
 97 | 
 98 | #' Fit model to each country and get `coefs()`.
 99 | gap_split_coefs <- lapply(
100 |   gap_split,
101 |   function(df) {
102 |     coef(lm(lifeExp ~ yr1952, data = df))
103 |   }
104 | )
105 | head(gap_split_coefs, 2)
106 | 
107 | #' Now we need to put everything back togethers. Row bind the list of coefs.
108 | #' Coerce from matrix back to data frame.
109 | gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs))
110 | 
111 | #' Restore `country` variable from row names.
112 | gap_split_coefs$country <- rownames(gap_split_coefs)
113 | str(gap_split_coefs)
114 | 
115 | #+ revert-to-alphabetical
116 | ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) +
117 |   geom_point()
118 | #' Uh-oh, we lost the order of the `country` factor, due to coercion from factor
119 | #' to character (list and then row names).
120 | #'
121 | #' The `nest()` approach allows you to keep data as data vs. in attributes, such
122 | #' as list or row names. Preserves factors and their levels or integer
123 | #' variables. Designs away various opportunities for different pieces of the
124 | #' dataset to get "out of sync" with each other, by leaving them in a data frame
125 | #' at all times.
126 | #'
127 | #' First in an interesting series of blog posts exploring these patterns and
128 | #' asking whether the tidyverse still needs a way to include the nesting
129 | #' variable in the nested data:
130 | #' <https://coolbutuseless.bitbucket.io/2018/03/03/split-apply-combine-my-search-for-a-replacement-for-group_by---do/>
131 | 


--------------------------------------------------------------------------------
/ex08_nesting-is-good.md:
--------------------------------------------------------------------------------
  1 | Why nesting is worth the awkwardness
  2 | ================
  3 | Jenny Bryan
  4 | 2018-04-12
  5 | 
  6 | ``` r
  7 | library(gapminder)
  8 | library(tidyverse)
  9 | ```
 10 | 
 11 | gapminder data for Asia only
 12 | 
 13 | ``` r
 14 | gap <- gapminder %>%
 15 |   filter(continent == "Asia") %>%
 16 |   mutate(yr1952 = year - 1952)
 17 | ```
 18 | 
 19 | ``` r
 20 | ggplot(gap, aes(x = lifeExp, y = country)) +
 21 |   geom_point()
 22 | ```
 23 | 
 24 | ![](ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png)<!-- -->
 25 | 
 26 | Countries are in alphabetical order.
 27 | 
 28 | Set factor levels with intent. Example: order based on life expectancy
 29 | in 2007, the last year in this dataset. Imagine you want this to persist
 30 | across an entire analysis.
 31 | 
 32 | ``` r
 33 | gap <- gap %>%
 34 |   mutate(country = fct_reorder2(country, .x = year, .y = lifeExp))
 35 | ```
 36 | 
 37 | ``` r
 38 | ggplot(gap, aes(x = lifeExp, y = country)) +
 39 |   geom_point()
 40 | ```
 41 | 
 42 | ![](ex08_nesting-is-good_files/figure-gfm/principled-order-1.png)<!-- -->
 43 | 
 44 | Much better\!
 45 | 
 46 | Now imagine we want to fit a model to each country and look at dot plots
 47 | of slope and intercept.
 48 | 
 49 | `dplyr::group_by()` + `tidyr::nest()` created a *nested data frame* and
 50 | is an alternative to splitting into country-specific data frames. Those
 51 | data frames end up, instead, in a list-column. The `country` variable
 52 | remains as a normal factor.
 53 | 
 54 | ``` r
 55 | gap_nested <- gap %>%
 56 |   group_by(country) %>%
 57 |   nest()
 58 | 
 59 | gap_nested
 60 | #> # A tibble: 33 x 2
 61 | #>    country          data             
 62 | #>    <fct>            <list>           
 63 | #>  1 Afghanistan      <tibble [12 × 6]>
 64 | #>  2 Bahrain          <tibble [12 × 6]>
 65 | #>  3 Bangladesh       <tibble [12 × 6]>
 66 | #>  4 Cambodia         <tibble [12 × 6]>
 67 | #>  5 China            <tibble [12 × 6]>
 68 | #>  6 Hong Kong, China <tibble [12 × 6]>
 69 | #>  7 India            <tibble [12 × 6]>
 70 | #>  8 Indonesia        <tibble [12 × 6]>
 71 | #>  9 Iran             <tibble [12 × 6]>
 72 | #> 10 Iraq             <tibble [12 × 6]>
 73 | #> # ... with 23 more rows
 74 | gap_nested$data[[1]]
 75 | #> # A tibble: 12 x 6
 76 | #>    continent  year lifeExp      pop gdpPercap yr1952
 77 | #>    <fct>     <int>   <dbl>    <int>     <dbl>  <dbl>
 78 | #>  1 Asia       1952    28.8  8425333      779.     0.
 79 | #>  2 Asia       1957    30.3  9240934      821.     5.
 80 | #>  3 Asia       1962    32.0 10267083      853.    10.
 81 | #>  4 Asia       1967    34.0 11537966      836.    15.
 82 | #>  5 Asia       1972    36.1 13079460      740.    20.
 83 | #>  6 Asia       1977    38.4 14880372      786.    25.
 84 | #>  7 Asia       1982    39.9 12881816      978.    30.
 85 | #>  8 Asia       1987    40.8 13867957      852.    35.
 86 | #>  9 Asia       1992    41.7 16317921      649.    40.
 87 | #> 10 Asia       1997    41.8 22227415      635.    45.
 88 | #> 11 Asia       2002    42.1 25268405      727.    50.
 89 | #> 12 Asia       2007    43.8 31889923      975.    55.
 90 | 
 91 | gap_fitted <- gap_nested %>%
 92 |   mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x)))
 93 | gap_fitted
 94 | #> # A tibble: 33 x 3
 95 | #>    country          data              fit     
 96 | #>    <fct>            <list>            <list>  
 97 | #>  1 Afghanistan      <tibble [12 × 6]> <S3: lm>
 98 | #>  2 Bahrain          <tibble [12 × 6]> <S3: lm>
 99 | #>  3 Bangladesh       <tibble [12 × 6]> <S3: lm>
100 | #>  4 Cambodia         <tibble [12 × 6]> <S3: lm>
101 | #>  5 China            <tibble [12 × 6]> <S3: lm>
102 | #>  6 Hong Kong, China <tibble [12 × 6]> <S3: lm>
103 | #>  7 India            <tibble [12 × 6]> <S3: lm>
104 | #>  8 Indonesia        <tibble [12 × 6]> <S3: lm>
105 | #>  9 Iran             <tibble [12 × 6]> <S3: lm>
106 | #> 10 Iraq             <tibble [12 × 6]> <S3: lm>
107 | #> # ... with 23 more rows
108 | gap_fitted$fit[[1]]
109 | #> 
110 | #> Call:
111 | #> lm(formula = lifeExp ~ yr1952, data = .x)
112 | #> 
113 | #> Coefficients:
114 | #> (Intercept)       yr1952  
115 | #>     29.9073       0.2753
116 | 
117 | gap_fitted <- gap_fitted %>%
118 |   mutate(
119 |     intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]),
120 |     slope = map_dbl(fit, ~ coef(.x)[["yr1952"]])
121 |   )
122 | gap_fitted
123 | #> # A tibble: 33 x 5
124 | #>    country          data              fit      intercept slope
125 | #>    <fct>            <list>            <list>       <dbl> <dbl>
126 | #>  1 Afghanistan      <tibble [12 × 6]> <S3: lm>      29.9 0.275
127 | #>  2 Bahrain          <tibble [12 × 6]> <S3: lm>      52.7 0.468
128 | #>  3 Bangladesh       <tibble [12 × 6]> <S3: lm>      36.1 0.498
129 | #>  4 Cambodia         <tibble [12 × 6]> <S3: lm>      37.0 0.396
130 | #>  5 China            <tibble [12 × 6]> <S3: lm>      47.2 0.531
131 | #>  6 Hong Kong, China <tibble [12 × 6]> <S3: lm>      63.4 0.366
132 | #>  7 India            <tibble [12 × 6]> <S3: lm>      39.3 0.505
133 | #>  8 Indonesia        <tibble [12 × 6]> <S3: lm>      36.9 0.635
134 | #>  9 Iran             <tibble [12 × 6]> <S3: lm>      45.0 0.497
135 | #> 10 Iraq             <tibble [12 × 6]> <S3: lm>      50.1 0.235
136 | #> # ... with 23 more rows
137 | ```
138 | 
139 | ``` r
140 | ggplot(gap_fitted, aes(x = intercept, y = country)) +
141 |   geom_point()
142 | ```
143 | 
144 | ![](ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png)<!-- -->
145 | 
146 | ``` r
147 | 
148 | ggplot(gap_fitted, aes(x = slope, y = country)) +
149 |   geom_point()
150 | ```
151 | 
152 | ![](ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png)<!-- -->
153 | 
154 | The `split()` + `lapply()` + `do.call(rbind, ...)` approach.
155 | 
156 | Split gap into many data frames, one per country.
157 | 
158 | ``` r
159 | gap_split <- split(gap, gap$country)
160 | ```
161 | 
162 | Fit a model to each country.
163 | 
164 | ``` r
165 | gap_split_fits <- lapply(
166 |   gap_split,
167 |   function(df) {
168 |     lm(lifeExp ~ yr1952, data = df)
169 |   }
170 | )
171 | #> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
172 | ```
173 | 
174 | Oops … the unused levels of country are a problem (empty data frames in
175 | our list).
176 | 
177 | Drop unused levels in country and split.
178 | 
179 | ``` r
180 | gap_split <- split(droplevels(gap), droplevels(gap)$country)
181 | head(gap_split, 2)
182 | #> $Japan
183 | #> # A tibble: 12 x 7
184 | #>    country continent  year lifeExp       pop gdpPercap yr1952
185 | #>    <fct>   <fct>     <int>   <dbl>     <int>     <dbl>  <dbl>
186 | #>  1 Japan   Asia       1952    63.0  86459025     3217.     0.
187 | #>  2 Japan   Asia       1957    65.5  91563009     4318.     5.
188 | #>  3 Japan   Asia       1962    68.7  95831757     6577.    10.
189 | #>  4 Japan   Asia       1967    71.4 100825279     9848.    15.
190 | #>  5 Japan   Asia       1972    73.4 107188273    14779.    20.
191 | #>  6 Japan   Asia       1977    75.4 113872473    16610.    25.
192 | #>  7 Japan   Asia       1982    77.1 118454974    19384.    30.
193 | #>  8 Japan   Asia       1987    78.7 122091325    22376.    35.
194 | #>  9 Japan   Asia       1992    79.4 124329269    26825.    40.
195 | #> 10 Japan   Asia       1997    80.7 125956499    28817.    45.
196 | #> 11 Japan   Asia       2002    82.0 127065841    28605.    50.
197 | #> 12 Japan   Asia       2007    82.6 127467972    31656.    55.
198 | #> 
199 | #> $`Hong Kong, China`
200 | #> # A tibble: 12 x 7
201 | #>    country          continent  year lifeExp     pop gdpPercap yr1952
202 | #>    <fct>            <fct>     <int>   <dbl>   <int>     <dbl>  <dbl>
203 | #>  1 Hong Kong, China Asia       1952    61.0 2125900     3054.     0.
204 | #>  2 Hong Kong, China Asia       1957    64.8 2736300     3629.     5.
205 | #>  3 Hong Kong, China Asia       1962    67.6 3305200     4693.    10.
206 | #>  4 Hong Kong, China Asia       1967    70.0 3722800     6198.    15.
207 | #>  5 Hong Kong, China Asia       1972    72.0 4115700     8316.    20.
208 | #>  6 Hong Kong, China Asia       1977    73.6 4583700    11186.    25.
209 | #>  7 Hong Kong, China Asia       1982    75.4 5264500    14561.    30.
210 | #>  8 Hong Kong, China Asia       1987    76.2 5584510    20038.    35.
211 | #>  9 Hong Kong, China Asia       1992    77.6 5829696    24758.    40.
212 | #> 10 Hong Kong, China Asia       1997    80.0 6495918    28378.    45.
213 | #> 11 Hong Kong, China Asia       2002    81.5 6762476    30209.    50.
214 | #> 12 Hong Kong, China Asia       2007    82.2 6980412    39725.    55.
215 | ```
216 | 
217 | Fit model to each country and get `coefs()`.
218 | 
219 | ``` r
220 | gap_split_coefs <- lapply(
221 |   gap_split,
222 |   function(df) {
223 |     coef(lm(lifeExp ~ yr1952, data = df))
224 |   }
225 | )
226 | head(gap_split_coefs, 2)
227 | #> $Japan
228 | #> (Intercept)      yr1952 
229 | #>  65.1220513   0.3529042 
230 | #> 
231 | #> $`Hong Kong, China`
232 | #> (Intercept)      yr1952 
233 | #>  63.4286410   0.3659706
234 | ```
235 | 
236 | Now we need to put everything back togethers. Row bind the list of
237 | coefs. Coerce from matrix back to data frame.
238 | 
239 | ``` r
240 | gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs))
241 | ```
242 | 
243 | Restore `country` variable from row names.
244 | 
245 | ``` r
246 | gap_split_coefs$country <- rownames(gap_split_coefs)
247 | str(gap_split_coefs)
248 | #> 'data.frame':    33 obs. of  3 variables:
249 | #>  $ (Intercept): num  65.1 63.4 66.3 61.8 49.7 ...
250 | #>  $ yr1952     : num  0.353 0.366 0.267 0.341 0.555 ...
251 | #>  $ country    : chr  "Japan" "Hong Kong, China" "Israel" "Singapore" ...
252 | ```
253 | 
254 | ``` r
255 | ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) +
256 |   geom_point()
257 | ```
258 | 
259 | ![](ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png)<!-- -->
260 | 
261 | Uh-oh, we lost the order of the `country` factor, due to coercion from
262 | factor to character (list and then row names).
263 | 
264 | The `nest()` approach allows you to keep data as data vs. in attributes,
265 | such as list or row names. Preserves factors and their levels or integer
266 | variables. Designs away various opportunities for different pieces of
267 | the dataset to get “out of sync” with each other, by leaving them in a
268 | data frame at all times.
269 | 
270 | First in an interesting series of blog posts exploring these patterns
271 | and asking whether the tidyverse still needs a way to include the
272 | nesting variable in the nested data:
273 | <https://coolbutuseless.bitbucket.io/2018/03/03/split-apply-combine-my-search-for-a-replacement-for-group_by---do/>
274 | 


--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png


--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/principled-order-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-1.png


--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png


--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png


--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png


--------------------------------------------------------------------------------
/ex09_row-summaries.R:
--------------------------------------------------------------------------------
  1 | #' ---
  2 | #' title: "Row-wise Summaries"
  3 | #' author: "Jenny Bryan"
  4 | #' date: "`r format(Sys.Date())`"
  5 | #' output: github_document
  6 | #' ---
  7 | 
  8 | #+ setup, include = FALSE, cache = FALSE
  9 | knitr::opts_chunk$set(
 10 |   collapse = TRUE,
 11 |   comment = "#>",
 12 |   error = TRUE
 13 | )
 14 | options(tidyverse.quiet = TRUE)
 15 | 
 16 | #' > For rowSums, mtcars %>% mutate(rowsum = pmap_dbl(., sum)) works but is
 17 | #' > a tidy oneliner for mean or sd per row?
 18 | #' > I'm looking for a tidy version of rowSums, rowMeans and similarly rowSDs...
 19 | #'
 20 | #' [Two](https://twitter.com/vrnijs/status/995129678284255233)
 21 | #' [tweets](https://twitter.com/vrnijs/status/995193240864178177) from Vincent
 22 | #' Nijs [github](https://github.com/vnijs),
 23 | #' [twitter](https://twitter.com/vrnijs)
 24 | #'
 25 | 
 26 | #' Good question! This also came up when I was originally casting about for
 27 | #' genuine row-wise operations, but I never worked it up. I will do so now!
 28 | #' First I set up my example.
 29 | #'
 30 | #+ body
 31 | # ----
 32 | library(tidyverse)
 33 | 
 34 | df <- tribble(
 35 |   ~ name, ~ t1, ~t2, ~t3,
 36 |   "Abby",    1,   2,   3,
 37 |   "Bess",    4,   5,   6,
 38 |   "Carl",    7,   8,   9
 39 | )
 40 | 
 41 | #' ## Use `rowSums()` and `rowMeans()` inside `dplyr::mutate()`
 42 | #'
 43 | #' One "tidy version" of `rowSums()` is to ... just stick `rowSums()` inside a
 44 | #' tidyverse pipeline. You can use `rowSums()` and `rowMeans()` inside
 45 | #' `mutate()`, because they have a method for `data.frame`:
 46 | df %>%
 47 |   mutate(t_sum = rowSums(select_if(., is.numeric)))
 48 | 
 49 | df %>%
 50 |   mutate(t_avg = rowMeans(select(., -name)))
 51 | 
 52 | #' Above I also demonstrate the use of `select(., SOME_EXPRESSION)` to express
 53 | #' which variables should be computed on.  This comes up a lot in row-wise work
 54 | #' with a data frame, because, almost by definition, your variables are of mixed
 55 | #' type. These are just a few examples of the different ways to say "use `t1`,
 56 | #' `t2`, and `t3`", so we don't try to sum or average `name`. I'll continue to
 57 | #' mix these in as we go. They are equally useful when expressing which
 58 | #' variables should be forwarded to `.f` inside `pmap_*().`
 59 | #'
 60 | #' ## Devil's Advocate: can't you just use `rowMeans()` and `rowSums()` alone?
 61 | #'
 62 | #' This is a great point [raised by Diogo
 63 | #' Camacho](https://twitter.com/DiogoMCamacho/status/996178967647412224). If
 64 | #' `rowSums()` and `rowMeans()` get the job done, why put yourself through the
 65 | #' pain of using `pmap()`, especially inside `mutate()`?
 66 | #'
 67 | #' There are a few reasons:
 68 | #'
 69 | #' * You might want to take the median or standard deviation instead of a mean
 70 | #' or a sum. You can't assume that base R or an add-on package offers a row-wise
 71 | #' `data.frame` method for every function you might need.
 72 | #' * You might have several variables besides `name` that need to be retained,
 73 | #' but that should not be forwarded to `rowSums()` or `rowMeans()`. A
 74 | #' matrix-with-row-names grants you a reprieve for exactly one variable and that
 75 | #' variable best not be integer, factor, date, or datetime. Because you must
 76 | #' store it as character. It's not a general solution.
 77 | #' * Correctness. If you extract the numeric columns or the variables whose
 78 | #' names start with `"t"`, compute `rowMeans()` on them, and then column-bind
 79 | #' the result back to the data, you are responsible for making sure that the two
 80 | #' objects are absolutely, positively row-aligned.
 81 | #'
 82 | #' I think it's important to have a general strategy for row-wise computation on
 83 | #' a subset of the columns in a data frame.
 84 | #'
 85 | #' ## How to use an arbitrary function inside `pmap()`
 86 | #'
 87 | #' What if you need to apply `foo()` to rows and the universe has not provided a
 88 | #' special-purpose `rowFoos()` function? Now you do need to use `pmap()` or a
 89 | #' type-stable variant, with `foo()` playing the role of `.f`.
 90 | #'
 91 | #' This works especially well with `sum()`.
 92 | 
 93 | df %>%
 94 |   mutate(t_sum = pmap_dbl(list(t1, t2, t3), sum))
 95 | 
 96 | df %>%
 97 |   mutate(t_sum = pmap_dbl(select(., starts_with("t")), sum))
 98 | 
 99 | #' But the original question was about means and standard deviations! Why is
100 | #' that any different? Look at the signature of `sum()` versus a few other
101 | #' numerical summaries:
102 | #'
103 | #+ eval = FALSE
104 |    sum(..., na.rm = FALSE)
105 |   mean(x, trim = 0, na.rm = FALSE, ...)
106 | median(x, na.rm = FALSE, ...)
107 |    var(x, y = NULL, na.rm = FALSE, use)
108 | 
109 | #' `sum()` is especially `pmap()`-friendly because it takes `...` as its primary
110 | #' argument. In contrast, `mean()` takes a vector `x` as primary argument, which
111 | #' makes it harder to just drop into `pmap()`. This is something you might never
112 | #' think about if you're used to using special-purpose helpers like
113 | #' `rowMeans()`.
114 | #'
115 | #' purrr has a family of `lift_*()` functions that help you convert between
116 | #' these forms. Here I apply `purrr::lift_vd()` to `mean()`, so I can use it
117 | #' inside `pmap()`. The "vd" says I want to convert a function that takes a
118 | #' "**v**ector" into one that takes "**d**ots".
119 | df %>%
120 |   mutate(t_avg = pmap_dbl(list(t1, t2, t3), lift_vd(mean)))
121 | 
122 | #' ## Strategies that use reshaping and joins
123 | #'
124 | #' Data frames simply aren't a convenient storage format if you have a frequent
125 | #' need to compute summaries, row-wise, on a subset of columns. It is highly
126 | #' suggestive that your data is in the wrong shape, i.e. it's not tidy. Here we
127 | #' explore some approaches that rely on reshaping and/or joining. They are more
128 | #' transparent than using `lift_*()` with `pmap()` inside `mutate()` and,
129 | #' consequently, more verbose.
130 | #'
131 | #' They all rely on forming row-wise summaries, then joining back to the data.
132 | #'
133 | #' ### Gather, group, summarize
134 | (s <- df %>%
135 |     gather("time", "val", starts_with("t")) %>%
136 |     group_by(name) %>%
137 |     summarize(t_avg = mean(val), t_sum = sum(val)))
138 | df %>%
139 |   left_join(s)
140 | 
141 | #' ### Group then summarise, with explicit `c()`
142 | (s <- df %>%
143 |     group_by(name) %>%
144 |     summarise(t_avg = mean(c(t1, t2, t3))))
145 | df %>%
146 |   left_join(s)
147 | 
148 | #' ### Nesting
149 | #'
150 | #' Let's revisit a pattern from
151 | #' [`ex08_nesting-is-good`](ex08_nesting-is-good.md). This is another way to
152 | #' "package" up the values of `t1`, `t2`, and `t3` in a way that make both
153 | #' `mean()` and `sum()` happy. *thanks @krlmlr*
154 | (s <- df %>%
155 |     gather("key", "value", -name) %>%
156 |     nest(-name) %>%
157 |     mutate(
158 |       sum = map(data, "value") %>% map_dbl(sum),
159 |       mean = map(data, "value") %>% map_dbl(mean)
160 |     ) %>%
161 |     select(-data))
162 | df %>%
163 |   left_join(s)
164 | 
165 | #' ### Yet another way to use `rowMeans()`
166 | (s <- df %>%
167 |     column_to_rownames("name") %>%
168 |     rowMeans() %>%
169 |     enframe())
170 | df %>%
171 |   left_join(s)
172 | 
173 | #' ## Maybe you should use a matrix
174 | #'
175 | #' If you truly have data where each row is:
176 | #'
177 | #'   * Identifier for this observational unit
178 | #'   * Homogeneous vector of length n for the unit
179 | #'
180 | #' then you do want to use a matrix with rownames. I used to do this alot but
181 | #' found that practically none of my data analysis problems live in this simple
182 | #' world for more than a couple of hours. Eventually I always get back to a
183 | #' setting where a data frame is the most favorable receptacle, overall. YMMV.
184 | m <- matrix(
185 |   1:9,
186 |   byrow = TRUE, nrow = 3,
187 |   dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
188 | )
189 | 
190 | cbind(m, rowsum = rowSums(m))
191 | cbind(m, rowmean = rowMeans(m))
192 | 


--------------------------------------------------------------------------------
/ex09_row-summaries.md:
--------------------------------------------------------------------------------
  1 | Row-wise Summaries
  2 | ================
  3 | Jenny Bryan
  4 | 2018-05-14
  5 | 
  6 | > For rowSums, mtcars %\>% mutate(rowsum = pmap\_dbl(., sum)) works but
  7 | > is a tidy oneliner for mean or sd per row? I’m looking for a tidy
  8 | > version of rowSums, rowMeans and similarly rowSDs…
  9 | 
 10 | [Two](https://twitter.com/vrnijs/status/995129678284255233)
 11 | [tweets](https://twitter.com/vrnijs/status/995193240864178177) from
 12 | Vincent Nijs [github](https://github.com/vnijs),
 13 | [twitter](https://twitter.com/vrnijs)
 14 | 
 15 | Good question\! This also came up when I was originally casting about
 16 | for genuine row-wise operations, but I never worked it up. I will do so
 17 | now\! First I set up my example.
 18 | 
 19 | ``` r
 20 | library(tidyverse)
 21 | 
 22 | df <- tribble(
 23 |   ~ name, ~ t1, ~t2, ~t3,
 24 |   "Abby",    1,   2,   3,
 25 |   "Bess",    4,   5,   6,
 26 |   "Carl",    7,   8,   9
 27 | )
 28 | ```
 29 | 
 30 | ## Use `rowSums()` and `rowMeans()` inside `dplyr::mutate()`
 31 | 
 32 | One “tidy version” of `rowSums()` is to … just stick `rowSums()` inside
 33 | a tidyverse pipeline. You can use `rowSums()` and `rowMeans()` inside
 34 | `mutate()`, because they have a method for `data.frame`:
 35 | 
 36 | ``` r
 37 | df %>%
 38 |   mutate(t_sum = rowSums(select_if(., is.numeric)))
 39 | #> Warning: package 'bindrcpp' was built under R version 3.4.4
 40 | #> # A tibble: 3 x 5
 41 | #>   name     t1    t2    t3 t_sum
 42 | #>   <chr> <dbl> <dbl> <dbl> <dbl>
 43 | #> 1 Abby      1     2     3     6
 44 | #> 2 Bess      4     5     6    15
 45 | #> 3 Carl      7     8     9    24
 46 | 
 47 | df %>%
 48 |   mutate(t_avg = rowMeans(select(., -name)))
 49 | #> # A tibble: 3 x 5
 50 | #>   name     t1    t2    t3 t_avg
 51 | #>   <chr> <dbl> <dbl> <dbl> <dbl>
 52 | #> 1 Abby      1     2     3     2
 53 | #> 2 Bess      4     5     6     5
 54 | #> 3 Carl      7     8     9     8
 55 | ```
 56 | 
 57 | Above I also demonstrate the use of `select(., SOME_EXPRESSION)` to
 58 | express which variables should be computed on. This comes up a lot in
 59 | row-wise work with a data frame, because, almost by definition, your
 60 | variables are of mixed type. These are just a few examples of the
 61 | different ways to say “use `t1`, `t2`, and `t3`”, so we don’t try to sum
 62 | or average `name`. I’ll continue to mix these in as we go. They are
 63 | equally useful when expressing which variables should be forwarded to
 64 | `.f` inside
 65 | `pmap_*().`
 66 | 
 67 | ## Devil’s Advocate: can’t you just use `rowMeans()` and `rowSums()` alone?
 68 | 
 69 | This is a great point [raised by Diogo
 70 | Camacho](https://twitter.com/DiogoMCamacho/status/996178967647412224).
 71 | If `rowSums()` and `rowMeans()` get the job done, why put yourself
 72 | through the pain of using `pmap()`, especially inside `mutate()`?
 73 | 
 74 | There are a few reasons:
 75 | 
 76 |   - You might want to take the median or standard deviation instead of a
 77 |     mean or a sum. You can’t assume that base R or an add-on package
 78 |     offers a row-wise `data.frame` method for every function you might
 79 |     need.
 80 |   - You might have several variables besides `name` that need to be
 81 |     retained, but that should not be forwarded to `rowSums()` or
 82 |     `rowMeans()`. A matrix-with-row-names grants you a reprieve for
 83 |     exactly one variable and that variable best not be integer, factor,
 84 |     date, or datetime. Because you must store it as character. It’s not
 85 |     a general solution.
 86 |   - Correctness. If you extract the numeric columns or the variables
 87 |     whose names start with `"t"`, compute `rowMeans()` on them, and then
 88 |     column-bind the result back to the data, you are responsible for
 89 |     making sure that the two objects are absolutely, positively
 90 |     row-aligned.
 91 | 
 92 | I think it’s important to have a general strategy for row-wise
 93 | computation on a subset of the columns in a data frame.
 94 | 
 95 | ## How to use an arbitrary function inside `pmap()`
 96 | 
 97 | What if you need to apply `foo()` to rows and the universe has not
 98 | provided a special-purpose `rowFoos()` function? Now you do need to use
 99 | `pmap()` or a type-stable variant, with `foo()` playing the role of
100 | `.f`.
101 | 
102 | This works especially well with `sum()`.
103 | 
104 | ``` r
105 | df %>%
106 |   mutate(t_sum = pmap_dbl(list(t1, t2, t3), sum))
107 | #> # A tibble: 3 x 5
108 | #>   name     t1    t2    t3 t_sum
109 | #>   <chr> <dbl> <dbl> <dbl> <dbl>
110 | #> 1 Abby      1     2     3     6
111 | #> 2 Bess      4     5     6    15
112 | #> 3 Carl      7     8     9    24
113 | 
114 | df %>%
115 |   mutate(t_sum = pmap_dbl(select(., starts_with("t")), sum))
116 | #> # A tibble: 3 x 5
117 | #>   name     t1    t2    t3 t_sum
118 | #>   <chr> <dbl> <dbl> <dbl> <dbl>
119 | #> 1 Abby      1     2     3     6
120 | #> 2 Bess      4     5     6    15
121 | #> 3 Carl      7     8     9    24
122 | ```
123 | 
124 | But the original question was about means and standard deviations\! Why
125 | is that any different? Look at the signature of `sum()` versus a few
126 | other numerical summaries:
127 | 
128 | ``` r
129 |    sum(..., na.rm = FALSE)
130 |   mean(x, trim = 0, na.rm = FALSE, ...)
131 | median(x, na.rm = FALSE, ...)
132 |    var(x, y = NULL, na.rm = FALSE, use)
133 | ```
134 | 
135 | `sum()` is especially `pmap()`-friendly because it takes `...` as its
136 | primary argument. In contrast, `mean()` takes a vector `x` as primary
137 | argument, which makes it harder to just drop into `pmap()`. This is
138 | something you might never think about if you’re used to using
139 | special-purpose helpers like `rowMeans()`.
140 | 
141 | purrr has a family of `lift_*()` functions that help you convert between
142 | these forms. Here I apply `purrr::lift_vd()` to `mean()`, so I can use
143 | it inside `pmap()`. The “vd” says I want to convert a function that
144 | takes a “**v**ector” into one that takes “**d**ots”.
145 | 
146 | ``` r
147 | df %>%
148 |   mutate(t_avg = pmap_dbl(list(t1, t2, t3), lift_vd(mean)))
149 | #> # A tibble: 3 x 5
150 | #>   name     t1    t2    t3 t_avg
151 | #>   <chr> <dbl> <dbl> <dbl> <dbl>
152 | #> 1 Abby      1     2     3     2
153 | #> 2 Bess      4     5     6     5
154 | #> 3 Carl      7     8     9     8
155 | ```
156 | 
157 | ## Strategies that use reshaping and joins
158 | 
159 | Data frames simply aren’t a convenient storage format if you have a
160 | frequent need to compute summaries, row-wise, on a subset of columns. It
161 | is highly suggestive that your data is in the wrong shape, i.e. it’s not
162 | tidy. Here we explore some approaches that rely on reshaping and/or
163 | joining. They are more transparent than using `lift_*()` with `pmap()`
164 | inside `mutate()` and, consequently, more verbose.
165 | 
166 | They all rely on forming row-wise summaries, then joining back to the
167 | data.
168 | 
169 | ### Gather, group, summarize
170 | 
171 | ``` r
172 | (s <- df %>%
173 |     gather("time", "val", starts_with("t")) %>%
174 |     group_by(name) %>%
175 |     summarize(t_avg = mean(val), t_sum = sum(val)))
176 | #> # A tibble: 3 x 3
177 | #>   name  t_avg t_sum
178 | #>   <chr> <dbl> <dbl>
179 | #> 1 Abby      2     6
180 | #> 2 Bess      5    15
181 | #> 3 Carl      8    24
182 | df %>%
183 |   left_join(s)
184 | #> Joining, by = "name"
185 | #> # A tibble: 3 x 6
186 | #>   name     t1    t2    t3 t_avg t_sum
187 | #>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
188 | #> 1 Abby      1     2     3     2     6
189 | #> 2 Bess      4     5     6     5    15
190 | #> 3 Carl      7     8     9     8    24
191 | ```
192 | 
193 | ### Group then summarise, with explicit `c()`
194 | 
195 | ``` r
196 | (s <- df %>%
197 |     group_by(name) %>%
198 |     summarise(t_avg = mean(c(t1, t2, t3))))
199 | #> # A tibble: 3 x 2
200 | #>   name  t_avg
201 | #>   <chr> <dbl>
202 | #> 1 Abby      2
203 | #> 2 Bess      5
204 | #> 3 Carl      8
205 | df %>%
206 |   left_join(s)
207 | #> Joining, by = "name"
208 | #> # A tibble: 3 x 5
209 | #>   name     t1    t2    t3 t_avg
210 | #>   <chr> <dbl> <dbl> <dbl> <dbl>
211 | #> 1 Abby      1     2     3     2
212 | #> 2 Bess      4     5     6     5
213 | #> 3 Carl      7     8     9     8
214 | ```
215 | 
216 | ### Nesting
217 | 
218 | Let’s revisit a pattern from
219 | [`ex08_nesting-is-good`](ex08_nesting-is-good.md). This is another way
220 | to “package” up the values of `t1`, `t2`, and `t3` in a way that make
221 | both `mean()` and `sum()` happy. *thanks @krlmlr*
222 | 
223 | ``` r
224 | (s <- df %>%
225 |     gather("key", "value", -name) %>%
226 |     nest(-name) %>%
227 |     mutate(
228 |       sum = map(data, "value") %>% map_dbl(sum),
229 |       mean = map(data, "value") %>% map_dbl(mean)
230 |     ) %>%
231 |     select(-data))
232 | #> # A tibble: 3 x 3
233 | #>   name    sum  mean
234 | #>   <chr> <dbl> <dbl>
235 | #> 1 Abby      6     2
236 | #> 2 Bess     15     5
237 | #> 3 Carl     24     8
238 | df %>%
239 |   left_join(s)
240 | #> Joining, by = "name"
241 | #> # A tibble: 3 x 6
242 | #>   name     t1    t2    t3   sum  mean
243 | #>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
244 | #> 1 Abby      1     2     3     6     2
245 | #> 2 Bess      4     5     6    15     5
246 | #> 3 Carl      7     8     9    24     8
247 | ```
248 | 
249 | ### Yet another way to use `rowMeans()`
250 | 
251 | ``` r
252 | (s <- df %>%
253 |     column_to_rownames("name") %>%
254 |     rowMeans() %>%
255 |     enframe())
256 | #> Warning: Setting row names on a tibble is deprecated.
257 | #> # A tibble: 3 x 2
258 | #>   name  value
259 | #>   <chr> <dbl>
260 | #> 1 Abby      2
261 | #> 2 Bess      5
262 | #> 3 Carl      8
263 | df %>%
264 |   left_join(s)
265 | #> Joining, by = "name"
266 | #> # A tibble: 3 x 5
267 | #>   name     t1    t2    t3 value
268 | #>   <chr> <dbl> <dbl> <dbl> <dbl>
269 | #> 1 Abby      1     2     3     2
270 | #> 2 Bess      4     5     6     5
271 | #> 3 Carl      7     8     9     8
272 | ```
273 | 
274 | ## Maybe you should use a matrix
275 | 
276 | If you truly have data where each row is:
277 | 
278 |   - Identifier for this observational unit
279 |   - Homogeneous vector of length n for the unit
280 | 
281 | then you do want to use a matrix with rownames. I used to do this alot
282 | but found that practically none of my data analysis problems live in
283 | this simple world for more than a couple of hours. Eventually I always
284 | get back to a setting where a data frame is the most favorable
285 | receptacle, overall. YMMV.
286 | 
287 | ``` r
288 | m <- matrix(
289 |   1:9,
290 |   byrow = TRUE, nrow = 3,
291 |   dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
292 | )
293 | 
294 | cbind(m, rowsum = rowSums(m))
295 | #>      t1 t2 t3 rowsum
296 | #> Abby  1  2  3      6
297 | #> Bess  4  5  6     15
298 | #> Carl  7  8  9     24
299 | cbind(m, rowmean = rowMeans(m))
300 | #>      t1 t2 t3 rowmean
301 | #> Abby  1  2  3       2
302 | #> Bess  4  5  6       5
303 | #> Carl  7  8  9       8
304 | ```
305 | 


--------------------------------------------------------------------------------
/iterate-over-rows.R:
--------------------------------------------------------------------------------
  1 | #' ---
  2 | #' title: "Turn data frame into a list, one component per row"
  3 | #' author: "Jenny Bryan, updating work of Winston Chang"
  4 | #' date: "`r format(Sys.Date())`"
  5 | #' output: github_document
  6 | #' ---
  7 | #'
  8 | #' Update of <https://rpubs.com/wch/200398>.
  9 | #'
 10 | #'   * Added some methods, removed some methods.
 11 | #'   * Run every combination of problem size & method multiple times.
 12 | #'   * Explore different number of rows and columns, with mixed col types.
 13 | 
 14 | library(scales)
 15 | library(tidyverse)
 16 | 
 17 | # for loop over row index
 18 | f_for_loop <- function(df) {
 19 |   out <- vector(mode = "list", length = nrow(df))
 20 |   for (i in seq_along(out)) {
 21 |     out[[i]] <- as.list(df[i, , drop = FALSE])
 22 |   }
 23 |   out
 24 | }
 25 | 
 26 | # split into single row data frames then + lapply
 27 | f_split_lapply <- function(df) {
 28 |   df <- split(df, seq_len(nrow(df)))
 29 |   lapply(df, function(row) as.list(row))
 30 | }
 31 | 
 32 | # lapply over the vector of row numbers
 33 | f_lapply_row <- function(df) {
 34 |   lapply(seq_len(nrow(df)), function(i) as.list(df[i, , drop = FALSE]))
 35 | }
 36 | 
 37 | # purrr::pmap
 38 | f_pmap <- function(df) {
 39 |   pmap(df, list)
 40 | }
 41 | 
 42 | # purrr::transpose (happens to be exactly what's needed here)
 43 | f_transpose <- function(df) {
 44 |   transpose(df)
 45 | }
 46 | 
 47 | ## explicit gc, then execute `expr` `n` times w/o explicit gc, return timings
 48 | benchmark <- function(n = 1, expr, envir = parent.frame()) {
 49 |   expr <- substitute(expr)
 50 |   gc()
 51 |   map(seq_len(n), ~ system.time(eval(expr, envir), gcFirst = FALSE))
 52 | }
 53 | 
 54 | run_row_benchmark <- function(nrow, times = 5) {
 55 |   df <- data.frame(
 56 |     x = rep_len(letters, length.out = nrow),
 57 |     y = runif(nrow),
 58 |     z = seq_len(nrow)
 59 |   )
 60 |   res <- list(
 61 |     transpose     = benchmark(times, f_transpose(df)),
 62 |     pmap          = benchmark(times, f_pmap(df)),
 63 |     split_lapply  = benchmark(times, f_split_lapply(df)),
 64 |     lapply_row    = benchmark(times, f_lapply_row(df)),
 65 |     for_loop      = benchmark(times, f_for_loop(df))
 66 |   )
 67 |   res <- map(res, ~ map_dbl(.x, "elapsed"))
 68 |   tibble(
 69 |     nrow = nrow,
 70 |     method = rep(names(res), lengths(res)),
 71 |     time = flatten_dbl(res)
 72 |   )
 73 | }
 74 | 
 75 | run_col_benchmark <- function(ncol, times = 5) {
 76 |   nrow <- 3
 77 |   template <- data.frame(
 78 |     x = letters[seq_len(nrow)],
 79 |     y = runif(nrow),
 80 |     z = seq_len(nrow)
 81 |   )
 82 |   df <- template[rep_len(seq_len(ncol(template)), length.out = ncol)]
 83 |   res <- list(
 84 |     transpose     = benchmark(times, f_transpose(df)),
 85 |     pmap          = benchmark(times, f_pmap(df)),
 86 |     split_lapply  = benchmark(times, f_split_lapply(df)),
 87 |     lapply_row    = benchmark(times, f_lapply_row(df)),
 88 |     for_loop      = benchmark(times, f_for_loop(df))
 89 |   )
 90 |   res <- map(res, ~ map_dbl(.x, "elapsed"))
 91 |   tibble(
 92 |     ncol = ncol,
 93 |     method = rep(names(res), lengths(res)),
 94 |     time = flatten_dbl(res)
 95 |   )
 96 | }
 97 | 
 98 | ## force figs to present methods in order of time
 99 | flevels <- function(df) {
100 |   mutate(df, method = fct_reorder(method, .x = desc(time)))
101 | }
102 | 
103 | plot_it <- function(df, what = "nrow") {
104 |   log10_breaks <- trans_breaks("log10", function(x) 10 ^ x)
105 |   log10_mbreaks <- function(x) {
106 |     limits <- c(floor(log10(x[1])), ceiling(log10(x[2])))
107 |     breaks <- 10 ^ seq(limits[1], limits[2])
108 | 
109 |     unlist(lapply(breaks, function(x) x * seq(0.1, 0.9, by = 0.1)))
110 |   }
111 |   log10_labels <- trans_format("log10", math_format(10 ^ .x))
112 | 
113 |   ggplot(
114 |     df %>% dplyr::filter(time > 0),
115 |     aes_string(x = what, y = "time", colour = "method")
116 |     ) +
117 |     geom_point() +
118 |     stat_summary(aes(group = method), fun.y = mean, geom = "line") +
119 |     scale_y_log10(
120 |       breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
121 |     ) +
122 |     scale_x_log10(
123 |       breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
124 |     ) +
125 |     labs(
126 |       x = paste0("Number of ", if (what == "nrow") "rows" else "columns"),
127 |       y = "Time (s)"
128 |     ) +
129 |     theme_bw() +
130 |     theme(aspect.ratio = 1, legend.justification = "top")
131 | }
132 | 
133 | ## dry runs
134 | # df_test <- run_row_benchmark(nrow = 10000) %>% flevels()
135 | # df_test <- run_col_benchmark(ncol = 10000) %>% flevels()
136 | # ggplot(df_test, aes(x = method, y = time)) +
137 | #   geom_jitter(width = 0.25, height = 0) +
138 | #   scale_y_log10()
139 | 
140 | ## The Real Thing
141 | ## fairly fast up to 10^4, go get a coffee at 10^5 (row case only)
142 | #df_r <- map_df(10 ^ (1:5), run_row_benchmark) %>% flevels()
143 | #write_csv(df_r, "row-benchmark.csv")
144 | df_r <- read_csv("row-benchmark.csv") %>% flevels()
145 | 
146 | #+ row-benchmark
147 | plot_it(df_r, "nrow")
148 | #ggsave("row-benchmark.png")
149 | 
150 | #df_c <- map_df(10 ^ (1:5), run_col_benchmark) %>% flevels()
151 | #write_csv(df_c, "col-benchmark.csv")
152 | df_c <- read_csv("col-benchmark.csv") %>% flevels()
153 | 
154 | #+ col-benchmark
155 | plot_it(df_c, "ncol")
156 | #ggsave("col-benchmark.png")
157 | 
158 | ## used at first, but saw same dramatic gc artefacts as described here
159 | ## in my plots
160 | ## https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/
161 | ## went for a DIY solution where I control gc
162 | # library(microbenchmark)
163 | # run_row_microbenchmark <- function(nrow, times = 5) {
164 | #   df <- data.frame(x = rnorm(nrow), y = runif(nrow), z = runif(nrow))
165 | #   microbenchmark(
166 | #     for_loop      = f_for_loop(df),
167 | #     split_lapply  = f_split_lapply(df),
168 | #     lapply_row    = f_lapply_row(df),
169 | #     pmap          = f_pmap(df),
170 | #     transpose     = f_transpose(df),
171 | #     times = times
172 | #   ) %>%
173 | #     as_tibble() %>%
174 | #     rename(method = expr) %>%
175 | #     mutate(method = as.character(method)) %>%
176 | #     add_column(nrow = nrow, .before = 1)
177 | # }
178 | 


--------------------------------------------------------------------------------
/iterate-over-rows.md:
--------------------------------------------------------------------------------
  1 | Turn data frame into a list, one component per row
  2 | ================
  3 | Jenny Bryan, updating work of Winston Chang
  4 | 2018-09-05
  5 | 
  6 | Update of <https://rpubs.com/wch/200398>.
  7 | 
  8 |   - Added some methods, removed some methods.
  9 |   - Run every combination of problem size & method multiple times.
 10 |   - Explore different number of rows and columns, with mixed col types.
 11 | 
 12 | <!-- end list -->
 13 | 
 14 | ``` r
 15 | library(scales)
 16 | library(tidyverse)
 17 | ```
 18 | 
 19 |     ## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──
 20 | 
 21 |     ## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
 22 |     ## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
 23 |     ## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
 24 |     ## ✔ readr   1.2.0     ✔ forcats 0.3.0
 25 | 
 26 |     ## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
 27 |     ## ✖ readr::col_factor() masks scales::col_factor()
 28 |     ## ✖ purrr::discard()    masks scales::discard()
 29 |     ## ✖ dplyr::filter()     masks stats::filter()
 30 |     ## ✖ dplyr::lag()        masks stats::lag()
 31 | 
 32 | ``` r
 33 | # for loop over row index
 34 | f_for_loop <- function(df) {
 35 |   out <- vector(mode = "list", length = nrow(df))
 36 |   for (i in seq_along(out)) {
 37 |     out[[i]] <- as.list(df[i, , drop = FALSE])
 38 |   }
 39 |   out
 40 | }
 41 | 
 42 | # split into single row data frames then + lapply
 43 | f_split_lapply <- function(df) {
 44 |   df <- split(df, seq_len(nrow(df)))
 45 |   lapply(df, function(row) as.list(row))
 46 | }
 47 | 
 48 | # lapply over the vector of row numbers
 49 | f_lapply_row <- function(df) {
 50 |   lapply(seq_len(nrow(df)), function(i) as.list(df[i, , drop = FALSE]))
 51 | }
 52 | 
 53 | # purrr::pmap
 54 | f_pmap <- function(df) {
 55 |   pmap(df, list)
 56 | }
 57 | 
 58 | # purrr::transpose (happens to be exactly what's needed here)
 59 | f_transpose <- function(df) {
 60 |   transpose(df)
 61 | }
 62 | 
 63 | ## explicit gc, then execute `expr` `n` times w/o explicit gc, return timings
 64 | benchmark <- function(n = 1, expr, envir = parent.frame()) {
 65 |   expr <- substitute(expr)
 66 |   gc()
 67 |   map(seq_len(n), ~ system.time(eval(expr, envir), gcFirst = FALSE))
 68 | }
 69 | 
 70 | run_row_benchmark <- function(nrow, times = 5) {
 71 |   df <- data.frame(
 72 |     x = rep_len(letters, length.out = nrow),
 73 |     y = runif(nrow),
 74 |     z = seq_len(nrow)
 75 |   )
 76 |   res <- list(
 77 |     transpose     = benchmark(times, f_transpose(df)),
 78 |     pmap          = benchmark(times, f_pmap(df)),
 79 |     split_lapply  = benchmark(times, f_split_lapply(df)),
 80 |     lapply_row    = benchmark(times, f_lapply_row(df)),
 81 |     for_loop      = benchmark(times, f_for_loop(df))
 82 |   )
 83 |   res <- map(res, ~ map_dbl(.x, "elapsed"))
 84 |   tibble(
 85 |     nrow = nrow,
 86 |     method = rep(names(res), lengths(res)),
 87 |     time = flatten_dbl(res)
 88 |   )
 89 | }
 90 | 
 91 | run_col_benchmark <- function(ncol, times = 5) {
 92 |   nrow <- 3
 93 |   template <- data.frame(
 94 |     x = letters[seq_len(nrow)],
 95 |     y = runif(nrow),
 96 |     z = seq_len(nrow)
 97 |   )
 98 |   df <- template[rep_len(seq_len(ncol(template)), length.out = ncol)]
 99 |   res <- list(
100 |     transpose     = benchmark(times, f_transpose(df)),
101 |     pmap          = benchmark(times, f_pmap(df)),
102 |     split_lapply  = benchmark(times, f_split_lapply(df)),
103 |     lapply_row    = benchmark(times, f_lapply_row(df)),
104 |     for_loop      = benchmark(times, f_for_loop(df))
105 |   )
106 |   res <- map(res, ~ map_dbl(.x, "elapsed"))
107 |   tibble(
108 |     ncol = ncol,
109 |     method = rep(names(res), lengths(res)),
110 |     time = flatten_dbl(res)
111 |   )
112 | }
113 | 
114 | ## force figs to present methods in order of time
115 | flevels <- function(df) {
116 |   mutate(df, method = fct_reorder(method, .x = desc(time)))
117 | }
118 | 
119 | plot_it <- function(df, what = "nrow") {
120 |   log10_breaks <- trans_breaks("log10", function(x) 10 ^ x)
121 |   log10_mbreaks <- function(x) {
122 |     limits <- c(floor(log10(x[1])), ceiling(log10(x[2])))
123 |     breaks <- 10 ^ seq(limits[1], limits[2])
124 | 
125 |     unlist(lapply(breaks, function(x) x * seq(0.1, 0.9, by = 0.1)))
126 |   }
127 |   log10_labels <- trans_format("log10", math_format(10 ^ .x))
128 | 
129 |   ggplot(
130 |     df %>% dplyr::filter(time > 0),
131 |     aes_string(x = what, y = "time", colour = "method")
132 |     ) +
133 |     geom_point() +
134 |     stat_summary(aes(group = method), fun.y = mean, geom = "line") +
135 |     scale_y_log10(
136 |       breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
137 |     ) +
138 |     scale_x_log10(
139 |       breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
140 |     ) +
141 |     labs(
142 |       x = paste0("Number of ", if (what == "nrow") "rows" else "columns"),
143 |       y = "Time (s)"
144 |     ) +
145 |     theme_bw() +
146 |     theme(aspect.ratio = 1, legend.justification = "top")
147 | }
148 | 
149 | ## dry runs
150 | # df_test <- run_row_benchmark(nrow = 10000) %>% flevels()
151 | # df_test <- run_col_benchmark(ncol = 10000) %>% flevels()
152 | # ggplot(df_test, aes(x = method, y = time)) +
153 | #   geom_jitter(width = 0.25, height = 0) +
154 | #   scale_y_log10()
155 | 
156 | ## The Real Thing
157 | ## fairly fast up to 10^4, go get a coffee at 10^5 (row case only)
158 | #df_r <- map_df(10 ^ (1:5), run_row_benchmark) %>% flevels()
159 | #write_csv(df_r, "row-benchmark.csv")
160 | df_r <- read_csv("row-benchmark.csv") %>% flevels()
161 | ```
162 | 
163 |     ## Parsed with column specification:
164 |     ## cols(
165 |     ##   nrow = col_double(),
166 |     ##   method = col_character(),
167 |     ##   time = col_double()
168 |     ## )
169 | 
170 | ``` r
171 | plot_it(df_r, "nrow")
172 | ```
173 | 
174 | ![](iterate-over-rows_files/figure-gfm/row-benchmark-1.png)<!-- -->
175 | 
176 | ``` r
177 | #ggsave("row-benchmark.png")
178 | 
179 | #df_c <- map_df(10 ^ (1:5), run_col_benchmark) %>% flevels()
180 | #write_csv(df_c, "col-benchmark.csv")
181 | df_c <- read_csv("col-benchmark.csv") %>% flevels()
182 | ```
183 | 
184 |     ## Parsed with column specification:
185 |     ## cols(
186 |     ##   ncol = col_double(),
187 |     ##   method = col_character(),
188 |     ##   time = col_double()
189 |     ## )
190 | 
191 | ``` r
192 | plot_it(df_c, "ncol")
193 | ```
194 | 
195 | ![](iterate-over-rows_files/figure-gfm/col-benchmark-1.png)<!-- -->
196 | 
197 | ``` r
198 | #ggsave("col-benchmark.png")
199 | 
200 | ## used at first, but saw same dramatic gc artefacts as described here
201 | ## in my plots
202 | ## https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/
203 | ## went for a DIY solution where I control gc
204 | # library(microbenchmark)
205 | # run_row_microbenchmark <- function(nrow, times = 5) {
206 | #   df <- data.frame(x = rnorm(nrow), y = runif(nrow), z = runif(nrow))
207 | #   microbenchmark(
208 | #     for_loop      = f_for_loop(df),
209 | #     split_lapply  = f_split_lapply(df),
210 | #     lapply_row    = f_lapply_row(df),
211 | #     pmap          = f_pmap(df),
212 | #     transpose     = f_transpose(df),
213 | #     times = times
214 | #   ) %>%
215 | #     as_tibble() %>%
216 | #     rename(method = expr) %>%
217 | #     mutate(method = as.character(method)) %>%
218 | #     add_column(nrow = nrow, .before = 1)
219 | # }
220 | ```
221 | 


--------------------------------------------------------------------------------
/iterate-over-rows_files/figure-gfm/col-benchmark-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/iterate-over-rows_files/figure-gfm/col-benchmark-1.png


--------------------------------------------------------------------------------
/iterate-over-rows_files/figure-gfm/row-benchmark-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/iterate-over-rows_files/figure-gfm/row-benchmark-1.png


--------------------------------------------------------------------------------
/row-benchmark.csv:
--------------------------------------------------------------------------------
  1 | nrow,method,time
  2 | 10,transpose,0
  3 | 10,transpose,0
  4 | 10,transpose,0
  5 | 10,transpose,0
  6 | 10,transpose,0
  7 | 10,pmap,9.999999997489795e-4
  8 | 10,pmap,0
  9 | 10,pmap,0
 10 | 10,pmap,0.0010000000002037268
 11 | 10,pmap,0
 12 | 10,split_lapply,0.0010000000002037268
 13 | 10,split_lapply,0.0010000000002037268
 14 | 10,split_lapply,9.999999997489795e-4
 15 | 10,split_lapply,0.0010000000002037268
 16 | 10,split_lapply,0
 17 | 10,lapply_row,9.999999997489795e-4
 18 | 10,lapply_row,0.0010000000002037268
 19 | 10,lapply_row,9.999999997489795e-4
 20 | 10,lapply_row,0
 21 | 10,lapply_row,0.0010000000002037268
 22 | 10,for_loop,0.0010000000002037268
 23 | 10,for_loop,0.0010000000002037268
 24 | 10,for_loop,9.999999997489795e-4
 25 | 10,for_loop,0.0010000000002037268
 26 | 10,for_loop,9.999999997489795e-4
 27 | 100,transpose,0
 28 | 100,transpose,0
 29 | 100,transpose,0
 30 | 100,transpose,0
 31 | 100,transpose,0
 32 | 100,pmap,9.999999997489795e-4
 33 | 100,pmap,0
 34 | 100,pmap,0.0010000000002037268
 35 | 100,pmap,0
 36 | 100,pmap,9.999999997489795e-4
 37 | 100,split_lapply,0.005999999999858119
 38 | 100,split_lapply,0.007000000000061846
 39 | 100,split_lapply,0.007000000000061846
 40 | 100,split_lapply,0.005999999999858119
 41 | 100,split_lapply,0.007999999999810825
 42 | 100,lapply_row,0.007000000000061846
 43 | 100,lapply_row,0.007000000000061846
 44 | 100,lapply_row,0.006999999999607098
 45 | 100,lapply_row,0.007000000000061846
 46 | 100,lapply_row,0.007000000000061846
 47 | 100,for_loop,0.008000000000265572
 48 | 100,for_loop,0.006999999999607098
 49 | 100,for_loop,0.008000000000265572
 50 | 100,for_loop,0.007999999999810825
 51 | 100,for_loop,0.007000000000061846
 52 | 1e3,transpose,9.999999997489795e-4
 53 | 1e3,transpose,0
 54 | 1e3,transpose,0
 55 | 1e3,transpose,0
 56 | 1e3,transpose,0
 57 | 1e3,pmap,0.0019999999999527063
 58 | 1e3,pmap,0.0019999999999527063
 59 | 1e3,pmap,0.003000000000156433
 60 | 1e3,pmap,0.0019999999999527063
 61 | 1e3,pmap,0.0019999999999527063
 62 | 1e3,split_lapply,0.0749999999998181
 63 | 1e3,split_lapply,0.07000000000016371
 64 | 1e3,split_lapply,0.07699999999977081
 65 | 1e3,split_lapply,0.07700000000022555
 66 | 1e3,split_lapply,0.08199999999987995
 67 | 1e3,lapply_row,0.07099999999991269
 68 | 1e3,lapply_row,0.07200000000011642
 69 | 1e3,lapply_row,0.0749999999998181
 70 | 1e3,lapply_row,0.068000000000211
 71 | 1e3,lapply_row,0.08399999999983265
 72 | 1e3,for_loop,0.06599999999980355
 73 | 1e3,for_loop,0.06500000000005457
 74 | 1e3,for_loop,0.07100000000036744
 75 | 1e3,for_loop,0.07699999999977081
 76 | 1e3,for_loop,0.0729999999998654
 77 | 1e4,transpose,9.999999997489795e-4
 78 | 1e4,transpose,0.0019999999999527063
 79 | 1e4,transpose,0.0010000000002037268
 80 | 1e4,transpose,0.0010000000002037268
 81 | 1e4,transpose,0.0019999999999527063
 82 | 1e4,pmap,0.018999999999778083
 83 | 1e4,pmap,0.023000000000138243
 84 | 1e4,pmap,0.02099999999973079
 85 | 1e4,pmap,0.028999999999996362
 86 | 1e4,pmap,0.023000000000138243
 87 | 1e4,split_lapply,1.0340000000001055
 88 | 1e4,split_lapply,1.074999999999818
 89 | 1e4,split_lapply,1.0900000000001455
 90 | 1e4,split_lapply,1.0859999999997854
 91 | 1e4,split_lapply,1.1520000000000437
 92 | 1e4,lapply_row,1.0160000000000764
 93 | 1e4,lapply_row,1.0669999999995525
 94 | 1e4,lapply_row,1.1410000000000764
 95 | 1e4,lapply_row,1.2590000000000146
 96 | 1e4,lapply_row,1.0799999999999272
 97 | 1e4,for_loop,1.031999999999698
 98 | 1e4,for_loop,1.0419999999999163
 99 | 1e4,for_loop,1.1170000000001892
100 | 1e4,for_loop,1.1039999999998145
101 | 1e4,for_loop,1.0979999999999563
102 | 1e5,transpose,0.016999999999825377
103 | 1e5,transpose,0.01900000000023283
104 | 1e5,transpose,0.021000000000185537
105 | 1e5,transpose,0.02099999999973079
106 | 1e5,transpose,0.021000000000185537
107 | 1e5,pmap,0.23700000000008004
108 | 1e5,pmap,0.25700000000006185
109 | 1e5,pmap,0.3690000000001419
110 | 1e5,pmap,0.29300000000012005
111 | 1e5,pmap,0.3819999999996071
112 | 1e5,split_lapply,35.738000000000284
113 | 1e5,split_lapply,35.86400000000003
114 | 1e5,split_lapply,35.68899999999985
115 | 1e5,split_lapply,35.559999999999945
116 | 1e5,split_lapply,35.922000000000025
117 | 1e5,lapply_row,33.54099999999971
118 | 1e5,lapply_row,34.87699999999995
119 | 1e5,lapply_row,35.669000000000324
120 | 1e5,lapply_row,34.465999999999894
121 | 1e5,lapply_row,35.59400000000005
122 | 1e5,for_loop,35.01800000000003
123 | 1e5,for_loop,35.29099999999971
124 | 1e5,for_loop,34.46300000000019
125 | 1e5,for_loop,35.26099999999997
126 | 1e5,for_loop,34.33699999999999
127 | 


--------------------------------------------------------------------------------
/row-benchmark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/row-benchmark.png


--------------------------------------------------------------------------------
/row-oriented-workflows.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: No
 4 | SaveWorkspace: No
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | PackageRoxygenize: rd,collate,namespace
22 | 


--------------------------------------------------------------------------------
/wch.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Applying a function over rows of a data frame"
  3 | author: "Winston Chang"
  4 | output:
  5 |   html_document:
  6 |     keep_md: TRUE
  7 | ---
  8 | 
  9 | ```{r setup, include=FALSE}
 10 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>", cache = TRUE)
 11 | ```
 12 | 
 13 | [Source](https://gist.github.com/wch/0e564def155d976c04dd28a876dc04b4) for this document.
 14 | 
 15 | [RPub](https://rpubs.com/wch/200398) for this document.
 16 | 
 17 | @dattali [asked](https://twitter.com/daattali/status/761058049859518464), "what's a safe way to iterate over rows of a data frame?" The example was to convert each row into a list and return a list of lists, indexed first by column, then by row.
 18 | 
 19 | A number of people gave suggestions on Twitter, which I've collected here. I've benchmarked these methods with data of various sizes; scroll down to see a plot of times.
 20 | 
 21 | ```{r load-packages, cache = FALSE}
 22 | library(purrr)
 23 | library(dplyr)
 24 | library(tidyr)
 25 | ```
 26 | 
 27 | 
 28 | ```{r define-approaches, message=FALSE}
 29 | # @dattali
 30 | # Using apply (only safe when all cols are same type)
 31 | f_apply <- function(df) {
 32 |   apply(df, 1, function(row) as.list(row))  
 33 | }
 34 | 
 35 | # @drob
 36 | # split + lapply
 37 | f_split_lapply <- function(df) {
 38 |   df <- split(df, seq_len(nrow(df)))
 39 |   lapply(df, function(row) as.list(row))
 40 | }
 41 | 
 42 | # @winston_chang
 43 | # lapply over row indices
 44 | f_lapply_row <- function(df) {
 45 |   lapply(seq_len(nrow(df)), function(i) as.list(df[i,,drop=FALSE]))
 46 | }
 47 | 
 48 | # @winston_chang
 49 | # lapply + lapply: Treat data frame as list, and the slice out lists
 50 | f_lapply_lapply <- function(df) {
 51 |   cols <- seq_len(length(df))
 52 |   names(cols) <- names(df)
 53 | 
 54 |   lapply(seq_len(nrow(df)), function(row) {
 55 |     lapply(cols, function(col) {
 56 |       df[[col]][[row]]
 57 |     })
 58 |   })
 59 | }
 60 | 
 61 | # @winston_chang
 62 | # purrr::by_row
 63 | # 2018-03-31 Jenny Bryan: by_row() no longer exists in purrr
 64 | # f_by_row <- function(df) {
 65 | #   res <- by_row(df, function(row) as.list(row))
 66 | #   res$.out
 67 | # }
 68 | 
 69 | # @JennyBryan
 70 | # purrr::pmap
 71 | f_pmap <- function(df) {
 72 |   pmap(df, list)
 73 | }
 74 | 
 75 | # purrr::pmap, but coerce df to a list first
 76 | f_pmap_aslist <- function(df) {
 77 |   pmap(as.list(df), list)
 78 | }
 79 | 
 80 | # @krlmlr
 81 | # dplyr::rowwise
 82 | f_rowwise <- function(df) {
 83 |   df %>% rowwise %>% do(row = as.list(.))
 84 | }
 85 | 
 86 | # @JennyBryan
 87 | # purrr::transpose (only works for this specific task, i.e. one sub-list per row)
 88 | f_transpose <- function(df) {
 89 |   transpose(df)
 90 | }
 91 | ```
 92 | 
 93 | 
 94 | Benchmark each of them, using data sets with varying numbers of rows:
 95 | 
 96 | ```{r run-benchmark}
 97 | run_benchmark <- function(nrow) {
 98 |   # Make some data
 99 |   df <- data.frame(
100 |     x = rnorm(nrow),
101 |     y = runif(nrow),
102 |     z = runif(nrow)
103 |   )
104 |   
105 |   res <- list(
106 |     apply         = system.time(f_apply(df)),
107 |     split_lapply  = system.time(f_split_lapply(df)),
108 |     lapply_row    = system.time(f_lapply_row(df)),
109 |     lapply_lapply = system.time(f_lapply_lapply(df)),
110 |     #by_row        = system.time(f_by_row(df)),
111 |     pmap          = system.time(f_pmap(df)),
112 |     pmap_aslist   = system.time(f_pmap_aslist(df)),
113 |     rowwise       = system.time(f_rowwise(df)),
114 |     transpose     = system.time(f_transpose(df))
115 |   )
116 |   
117 |   # Get elapsed times
118 |   res <- lapply(res, `[[`, "elapsed")
119 | 
120 |   # Add nrow to front
121 |   res <- c(nrow = nrow, res)
122 |   res
123 | }
124 | 
125 | # Run the benchmarks for various size data
126 | all_times <- lapply(1:5, function(n) {
127 |   run_benchmark(10^n)
128 | })
129 | 
130 | # Convert to data frame
131 | times <- lapply(all_times, as.data.frame)
132 | times <- do.call(rbind, times)
133 | 
134 | knitr::kable(times)
135 | ```
136 | 
137 | 
138 | ## Plot times
139 | 
140 | This plot shows the number of seconds needed to process n rows, for each method. Both the x and y use log scales, so each step along the x scale represents a 10x increase in number of rows, and each step along the y scale represents a 10x increase in time.
141 | 
142 | ```{r plot, message=FALSE, cache = FALSE}
143 | library(ggplot2)
144 | library(scales)
145 | library(forcats)
146 | 
147 | # Convert to long format
148 | times_long <- gather(times, method, seconds, -nrow)
149 | 
150 | # Set order of methods, for plots
151 | times_long$method <- fct_reorder2(
152 |   times_long$method,
153 |   x = times_long$nrow,
154 |   y = times_long$seconds
155 | )
156 | 
157 | # Plot with log-log axes
158 | ggplot(times_long, aes(x = nrow, y = seconds, colour = method)) +
159 |   geom_point() +
160 |   geom_line() +
161 |   annotation_logticks(sides = "trbl") +
162 |   theme_bw() +
163 |   scale_y_continuous(trans = log10_trans(),
164 |     breaks = trans_breaks("log10", function(x) 10^x),
165 |     labels = trans_format("log10", math_format(10^.x)),
166 |     minor_breaks = NULL) +
167 |   scale_x_continuous(trans = log10_trans(),
168 |     breaks = trans_breaks("log10", function(x) 10^x),
169 |     labels = trans_format("log10", math_format(10^.x)),
170 |     minor_breaks = NULL)
171 | ```
172 | 


--------------------------------------------------------------------------------
/wch.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Applying a function over rows of a data frame"
  3 | author: "Winston Chang"
  4 | output:
  5 |   html_document:
  6 |     keep_md: TRUE
  7 | ---
  8 | 
  9 | 
 10 | 
 11 | [Source](https://gist.github.com/wch/0e564def155d976c04dd28a876dc04b4) for this document.
 12 | 
 13 | [RPub](https://rpubs.com/wch/200398) for this document.
 14 | 
 15 | @dattali [asked](https://twitter.com/daattali/status/761058049859518464), "what's a safe way to iterate over rows of a data frame?" The example was to convert each row into a list and return a list of lists, indexed first by column, then by row.
 16 | 
 17 | A number of people gave suggestions on Twitter, which I've collected here. I've benchmarked these methods with data of various sizes; scroll down to see a plot of times.
 18 | 
 19 | 
 20 | ```r
 21 | library(purrr)
 22 | library(dplyr)
 23 | #> 
 24 | #> Attaching package: 'dplyr'
 25 | #> The following objects are masked from 'package:stats':
 26 | #> 
 27 | #>     filter, lag
 28 | #> The following objects are masked from 'package:base':
 29 | #> 
 30 | #>     intersect, setdiff, setequal, union
 31 | library(tidyr)
 32 | ```
 33 | 
 34 | 
 35 | 
 36 | ```r
 37 | # @dattali
 38 | # Using apply (only safe when all cols are same type)
 39 | f_apply <- function(df) {
 40 |   apply(df, 1, function(row) as.list(row))  
 41 | }
 42 | 
 43 | # @drob
 44 | # split + lapply
 45 | f_split_lapply <- function(df) {
 46 |   df <- split(df, seq_len(nrow(df)))
 47 |   lapply(df, function(row) as.list(row))
 48 | }
 49 | 
 50 | # @winston_chang
 51 | # lapply over row indices
 52 | f_lapply_row <- function(df) {
 53 |   lapply(seq_len(nrow(df)), function(i) as.list(df[i,,drop=FALSE]))
 54 | }
 55 | 
 56 | # @winston_chang
 57 | # lapply + lapply: Treat data frame as list, and the slice out lists
 58 | f_lapply_lapply <- function(df) {
 59 |   cols <- seq_len(length(df))
 60 |   names(cols) <- names(df)
 61 | 
 62 |   lapply(seq_len(nrow(df)), function(row) {
 63 |     lapply(cols, function(col) {
 64 |       df[[col]][[row]]
 65 |     })
 66 |   })
 67 | }
 68 | 
 69 | # @winston_chang
 70 | # purrr::by_row
 71 | # 2018-03-31 Jenny Bryan: by_row() no longer exists in purrr
 72 | # f_by_row <- function(df) {
 73 | #   res <- by_row(df, function(row) as.list(row))
 74 | #   res$.out
 75 | # }
 76 | 
 77 | # @JennyBryan
 78 | # purrr::pmap
 79 | f_pmap <- function(df) {
 80 |   pmap(df, list)
 81 | }
 82 | 
 83 | # purrr::pmap, but coerce df to a list first
 84 | f_pmap_aslist <- function(df) {
 85 |   pmap(as.list(df), list)
 86 | }
 87 | 
 88 | # @krlmlr
 89 | # dplyr::rowwise
 90 | f_rowwise <- function(df) {
 91 |   df %>% rowwise %>% do(row = as.list(.))
 92 | }
 93 | 
 94 | # @JennyBryan
 95 | # purrr::transpose (only works for this specific task, i.e. one sub-list per row)
 96 | f_transpose <- function(df) {
 97 |   transpose(df)
 98 | }
 99 | ```
100 | 
101 | 
102 | Benchmark each of them, using data sets with varying numbers of rows:
103 | 
104 | 
105 | ```r
106 | run_benchmark <- function(nrow) {
107 |   # Make some data
108 |   df <- data.frame(
109 |     x = rnorm(nrow),
110 |     y = runif(nrow),
111 |     z = runif(nrow)
112 |   )
113 |   
114 |   res <- list(
115 |     apply         = system.time(f_apply(df)),
116 |     split_lapply  = system.time(f_split_lapply(df)),
117 |     lapply_row    = system.time(f_lapply_row(df)),
118 |     lapply_lapply = system.time(f_lapply_lapply(df)),
119 |     #by_row        = system.time(f_by_row(df)),
120 |     pmap          = system.time(f_pmap(df)),
121 |     pmap_aslist   = system.time(f_pmap_aslist(df)),
122 |     rowwise       = system.time(f_rowwise(df)),
123 |     transpose     = system.time(f_transpose(df))
124 |   )
125 |   
126 |   # Get elapsed times
127 |   res <- lapply(res, `[[`, "elapsed")
128 | 
129 |   # Add nrow to front
130 |   res <- c(nrow = nrow, res)
131 |   res
132 | }
133 | 
134 | # Run the benchmarks for various size data
135 | all_times <- lapply(1:5, function(n) {
136 |   run_benchmark(10^n)
137 | })
138 | 
139 | # Convert to data frame
140 | times <- lapply(all_times, as.data.frame)
141 | times <- do.call(rbind, times)
142 | 
143 | knitr::kable(times)
144 | ```
145 | 
146 | 
147 | 
148 |   nrow   apply   split_lapply   lapply_row   lapply_lapply    pmap   pmap_aslist   rowwise   transpose
149 | ------  ------  -------------  -----------  --------------  ------  ------------  --------  ----------
150 |  1e+01   0.000          0.000        0.001           0.000   0.001         0.001     0.044       0.000
151 |  1e+02   0.002          0.005        0.005           0.005   0.002         0.002     0.054       0.002
152 |  1e+03   0.004          0.036        0.034           0.015   0.002         0.002     0.056       0.001
153 |  1e+04   0.033          0.422        0.339           0.163   0.017         0.016     0.504       0.002
154 |  1e+05   0.527         24.720       23.743           1.808   0.201         0.220     5.322       0.017
155 | 
156 | 
157 | ## Plot times
158 | 
159 | This plot shows the number of seconds needed to process n rows, for each method. Both the x and y use log scales, so each step along the x scale represents a 10x increase in number of rows, and each step along the y scale represents a 10x increase in time.
160 | 
161 | 
162 | ```r
163 | library(ggplot2)
164 | library(scales)
165 | library(forcats)
166 | 
167 | # Convert to long format
168 | times_long <- gather(times, method, seconds, -nrow)
169 | 
170 | # Set order of methods, for plots
171 | times_long$method <- fct_reorder2(
172 |   times_long$method,
173 |   x = times_long$nrow,
174 |   y = times_long$seconds
175 | )
176 | 
177 | # Plot with log-log axes
178 | ggplot(times_long, aes(x = nrow, y = seconds, colour = method)) +
179 |   geom_point() +
180 |   geom_line() +
181 |   annotation_logticks(sides = "trbl") +
182 |   theme_bw() +
183 |   scale_y_continuous(trans = log10_trans(),
184 |     breaks = trans_breaks("log10", function(x) 10^x),
185 |     labels = trans_format("log10", math_format(10^.x)),
186 |     minor_breaks = NULL) +
187 |   scale_x_continuous(trans = log10_trans(),
188 |     breaks = trans_breaks("log10", function(x) 10^x),
189 |     labels = trans_format("log10", math_format(10^.x)),
190 |     minor_breaks = NULL)
191 | #> Warning: Transformation introduced infinite values in continuous y-axis
192 | 
193 | #> Warning: Transformation introduced infinite values in continuous y-axis
194 | ```
195 | 
196 | ![](wch_files/figure-html/plot-1.png)<!-- -->
197 | 


--------------------------------------------------------------------------------
/wch_files/figure-html/plot-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/wch_files/figure-html/plot-1.png


--------------------------------------------------------------------------------