├── .gitignore
├── LICENSE
├── README.md
├── col-benchmark.csv
├── col-benchmark.png
├── ex01_leave-it-in-the-data-frame.R
├── ex01_leave-it-in-the-data-frame.md
├── ex01_leave-it-in-the-data-frame_files
└── figure-gfm
│ ├── unnamed-chunk-2-1.png
│ ├── unnamed-chunk-3-1.png
│ ├── unnamed-chunk-3-2.png
│ ├── unnamed-chunk-4-1.png
│ ├── unnamed-chunk-5-1.png
│ └── unnamed-chunk-6-1.png
├── ex02_create-or-mutate-in-place.R
├── ex02_create-or-mutate-in-place.md
├── ex03_row-wise-iteration-are-you-sure.R
├── ex03_row-wise-iteration-are-you-sure.md
├── ex04_map-example.R
├── ex04_map-example.md
├── ex05_attack-via-rows-or-columns.R
├── ex05_attack-via-rows-or-columns.md
├── ex06_runif-via-pmap.R
├── ex06_runif-via-pmap.md
├── ex07_group-by-summarise.R
├── ex07_group-by-summarise.md
├── ex08_nesting-is-good.R
├── ex08_nesting-is-good.md
├── ex08_nesting-is-good_files
└── figure-gfm
│ ├── alpha-order-1.png
│ ├── principled-order-1.png
│ ├── principled-order-coef-ests-1.png
│ ├── principled-order-coef-ests-2.png
│ └── revert-to-alphabetical-1.png
├── ex09_row-summaries.R
├── ex09_row-summaries.md
├── iterate-over-rows.R
├── iterate-over-rows.md
├── iterate-over-rows_files
└── figure-gfm
│ ├── col-benchmark-1.png
│ └── row-benchmark-1.png
├── row-benchmark.csv
├── row-benchmark.png
├── row-oriented-workflows.Rproj
├── wch.Rmd
├── wch.md
└── wch_files
└── figure-html
└── plot-1.png
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | wch.html
5 | wch_cache
6 | iterate-over-rows.html
7 | iterate-over-rows_cache
8 | *.key
9 | *.pdf
10 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | CC BY-SA 4.0
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Row-oriented workflows in R with the tidyverse
2 |
3 | Materials for [RStudio webinar](https://resources.rstudio.com/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame-april-jenny-bryan) *recording available at this link!*:
4 |
5 | Thinking inside the box: you can do that inside a data frame?!
6 | Jenny Bryan
7 | Wednesday, April 11 at 1:00pm ET / 10:00am PT
8 | [rstd.io/row-work](https://rstd.io/row-work) *<-- shortlink to this repo*
9 | Slides available [on SpeakerDeck](https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse)
10 |
11 | ## Abstract
12 |
13 | The data frame is a crucial data structure in R and, especially, in the tidyverse. Working on a column or a variable is a very natural operation, which is great. But what about row-oriented work? That also comes up frequently and is more awkward. In this webinar I’ll work through concrete code examples, exploring patterns that arise in data analysis. We’ll discuss the general notion of "split-apply-combine", row-wise work in a data frame, splitting vs. nesting, and list-columns.
14 |
15 | ## Code examples
16 |
17 | Beginner --> intermediate --> advanced
18 | Not all are used in webinar
19 |
20 | * **Leave your data in that big, beautiful data frame.** [`ex01_leave-it-in-the-data-frame`](ex01_leave-it-in-the-data-frame.md) Show the evil of creating copies of certain rows of certain variables, using Magic Numbers and cryptic names, just to save some typing.
21 | * **Adding or modifying variables.** [`ex02_create-or-mutate-in-place`](ex02_create-or-mutate-in-place.md) `df$var <- ...` versus `dplyr::mutate()`. Recycling/safety, `df`'s as data mask, aesthetics.
22 | * **Are you SURE you need to iterate over rows?** [`ex03_row-wise-iteration-are-you-sure`](ex03_row-wise-iteration-are-you-sure.md) Don't fixate on most obvious generalization of your pilot example and risk overlooking a vectorized solution. Features a `paste()` example, then goes out with some glue glory.
23 | * **Working with non-vectorized functions.** [`ex04_map-example`](ex04_map-example.md) Small example using `purrr::map()` to apply `nrow()` to list of data frames.
24 | * **Row-wise thinking vs. column-wise thinking.** [`ex05_attack-via-rows-or-columns`](ex05_attack-via-rows-or-columns.md) Data rectangling example. Both are possible, but I find building a tibble column-by-column is less aggravating than building rows, then row binding.
25 | * **Iterate over rows of a data frame.** [`iterate-over-rows`](iterate-over-rows.md) Empirical study of reshaping a data frame into this form: a list with one component per row. Revisiting a study originally done by Winston Chang. Run times for different number of [rows](row-benchmark.png) or [columns](col-benchmark.png).
26 | * **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame.
27 | * **Are you SURE you need to iterate over groups?** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column.
28 | * **Group-and-nest.** [`ex08_nesting-is-good`](ex08_nesting-is-good.md) How to explicitly work on groups of rows via nesting (our recommendation) vs splitting.
29 | * **Row-wise mean or sum.** [`ex09_row-summaries`](ex09_row-summaries.md) How to do `rowSums()`-y and `rowMeans()`-y work inside a data frame.
30 |
31 | ## More tips and links
32 |
33 | Big thanks to everyone who weighed in on the related [twitter thread](https://twitter.com/JennyBryan/status/980905136468910080). This was very helpful for planning content.
34 |
35 | 45 minutes is not enough! A few notes about more special functions and patterns for row-driven work. Maybe we need to do a follow up ...
36 |
37 | `tibble::enframe()` and `deframe()` are handy for getting into and out of the data frame state.
38 |
39 | `map()` and `map2()` are useful for working with list-columns inside `mutate()`.
40 |
41 | `tibble::add_row()` handy for adding a single row at an arbitrary position in data frame.
42 |
43 | `imap()` handy for iterating over something and its names or integer indices at the same time.
44 |
45 | `dplyr::case_when()` helps you get rid of hairy, nested `if () {...} else {...}` statements.
46 |
47 | Great resource on the "why?" of functional programming approaches (such as `map()`):
48 |
--------------------------------------------------------------------------------
/col-benchmark.csv:
--------------------------------------------------------------------------------
1 | ncol,method,time
2 | 10,transpose,0
3 | 10,transpose,0
4 | 10,transpose,0
5 | 10,transpose,0
6 | 10,transpose,0
7 | 10,pmap,0
8 | 10,pmap,0
9 | 10,pmap,9.999999997489795e-4
10 | 10,pmap,0
11 | 10,pmap,0
12 | 10,split_lapply,0
13 | 10,split_lapply,0
14 | 10,split_lapply,0.0010000000002037268
15 | 10,split_lapply,0
16 | 10,split_lapply,9.999999997489795e-4
17 | 10,lapply_row,0
18 | 10,lapply_row,9.999999997489795e-4
19 | 10,lapply_row,0
20 | 10,lapply_row,0
21 | 10,lapply_row,0.0010000000002037268
22 | 10,for_loop,0.0010000000002037268
23 | 10,for_loop,0
24 | 10,for_loop,9.999999997489795e-4
25 | 10,for_loop,0
26 | 10,for_loop,0
27 | 100,transpose,0
28 | 100,transpose,0
29 | 100,transpose,0
30 | 100,transpose,0
31 | 100,transpose,0
32 | 100,pmap,0
33 | 100,pmap,9.999999997489795e-4
34 | 100,pmap,0.0010000000002037268
35 | 100,pmap,0
36 | 100,pmap,9.999999997489795e-4
37 | 100,split_lapply,0.0019999999999527063
38 | 100,split_lapply,0.0019999999999527063
39 | 100,split_lapply,0.0029999999997016857
40 | 100,split_lapply,0.0019999999999527063
41 | 100,split_lapply,0.003000000000156433
42 | 100,lapply_row,0.0019999999999527063
43 | 100,lapply_row,0.0019999999999527063
44 | 100,lapply_row,0.0020000000004074536
45 | 100,lapply_row,0.0019999999999527063
46 | 100,lapply_row,0.0019999999999527063
47 | 100,for_loop,0.0029999999997016857
48 | 100,for_loop,0.0020000000004074536
49 | 100,for_loop,0.0019999999999527063
50 | 100,for_loop,0.0019999999999527063
51 | 100,for_loop,0.0029999999997016857
52 | 1e3,transpose,0
53 | 1e3,transpose,0
54 | 1e3,transpose,0.0010000000002037268
55 | 1e3,transpose,0
56 | 1e3,transpose,0
57 | 1e3,pmap,0.0020000000004074536
58 | 1e3,pmap,0.0019999999999527063
59 | 1e3,pmap,0.0019999999999527063
60 | 1e3,pmap,0.0019999999999527063
61 | 1e3,pmap,0.0019999999999527063
62 | 1e3,split_lapply,0.022000000000389264
63 | 1e3,split_lapply,0.02599999999983993
64 | 1e3,split_lapply,0.023999999999887223
65 | 1e3,split_lapply,0.028999999999996362
66 | 1e3,split_lapply,0.02500000000009095
67 | 1e3,lapply_row,0.023000000000138243
68 | 1e3,lapply_row,0.021999999999934516
69 | 1e3,lapply_row,0.021000000000185537
70 | 1e3,lapply_row,0.027000000000043656
71 | 1e3,lapply_row,0.023000000000138243
72 | 1e3,for_loop,0.02099999999973079
73 | 1e3,for_loop,0.021000000000185537
74 | 1e3,for_loop,0.02099999999973079
75 | 1e3,for_loop,0.021000000000185537
76 | 1e3,for_loop,0.027000000000043656
77 | 1e4,transpose,0.0010000000002037268
78 | 1e4,transpose,9.999999997489795e-4
79 | 1e4,transpose,0.0010000000002037268
80 | 1e4,transpose,9.999999997489795e-4
81 | 1e4,transpose,0.0020000000004074536
82 | 1e4,pmap,0.02500000000009095
83 | 1e4,pmap,0.024999999999636202
84 | 1e4,pmap,0.027000000000043656
85 | 1e4,pmap,0.026000000000294676
86 | 1e4,pmap,0.03099999999994907
87 | 1e4,split_lapply,0.24499999999989086
88 | 1e4,split_lapply,0.23900000000003274
89 | 1e4,split_lapply,0.24899999999979627
90 | 1e4,split_lapply,0.2680000000000291
91 | 1e4,split_lapply,0.24499999999989086
92 | 1e4,lapply_row,0.22000000000025466
93 | 1e4,lapply_row,0.2369999999996253
94 | 1e4,lapply_row,0.23400000000037835
95 | 1e4,lapply_row,0.2339999999999236
96 | 1e4,lapply_row,0.22600000000011278
97 | 1e4,for_loop,0.24899999999979627
98 | 1e4,for_loop,0.23800000000028376
99 | 1e4,for_loop,0.2519999999999527
100 | 1e4,for_loop,0.26499999999987267
101 | 1e4,for_loop,0.25700000000006185
102 | 1e5,transpose,0.01499999999987267
103 | 1e5,transpose,0.016999999999825377
104 | 1e5,transpose,0.016000000000076398
105 | 1e5,transpose,0.016999999999825377
106 | 1e5,transpose,0.027000000000043656
107 | 1e5,pmap,0.5749999999998181
108 | 1e5,pmap,0.6639999999997599
109 | 1e5,pmap,0.6190000000001419
110 | 1e5,pmap,0.7470000000002983
111 | 1e5,pmap,0.6419999999998254
112 | 1e5,split_lapply,3.2729999999996835
113 | 1e5,split_lapply,3.624000000000251
114 | 1e5,split_lapply,3.9329999999999927
115 | 1e5,split_lapply,3.380000000000109
116 | 1e5,split_lapply,3.4890000000000327
117 | 1e5,lapply_row,3.199000000000069
118 | 1e5,lapply_row,3.630000000000109
119 | 1e5,lapply_row,3.9980000000000473
120 | 1e5,lapply_row,3.5589999999997417
121 | 1e5,lapply_row,3.6010000000001128
122 | 1e5,for_loop,3.212999999999738
123 | 1e5,for_loop,3.66800000000012
124 | 1e5,for_loop,4.114000000000033
125 | 1e5,for_loop,3.882000000000062
126 | 1e5,for_loop,3.5149999999998727
127 |
--------------------------------------------------------------------------------
/col-benchmark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/col-benchmark.png
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Leave your data in that big, beautiful data frame"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #+ body
17 | # ----
18 | #' ## Don't create odd little excerpts and copies of your data.
19 | #'
20 | #' Code style that results from (I speculate) minimizing the number of key
21 | #' presses.
22 |
23 | ## :(
24 | sl <- iris[51:100,1]
25 | pw <- iris[51:100,4]
26 | plot(sl ~ pw)
27 |
28 | #' This clutters the workspace with "loose parts", `sl` and `pw`. Very soon, you
29 | #' are likely to forget what they are, which `Species` of `iris` they represent,
30 | #' and what the relationship between them is.
31 |
32 | # ----
33 | #' ## Leave the data *in situ* and reveal intent in your code
34 | #'
35 | #' More verbose code conveys intent. Eliminating the Magic Numbers makes the
36 | #' code less likely to be, or become, wrong.
37 | #'
38 | #' Here's one way to do same in a tidyverse style:
39 | library(tidyverse)
40 |
41 | ggplot(
42 | filter(iris, Species == "versicolor"),
43 | aes(x = Petal.Width, y = Sepal.Length)
44 | ) + geom_point()
45 |
46 | #' Another tidyverse approach, this time using the pipe operator, `%>%`
47 | iris %>%
48 | filter(Species == "versicolor") %>%
49 | ggplot(aes(x = Petal.Width, y = Sepal.Length)) + ## <--- NOTE the `+` sign!!
50 | geom_point()
51 |
52 | #' A base solution that still follows the principles of
53 | #'
54 | #' * leave the data in data frame
55 | #' * convey intent
56 | plot(
57 | Sepal.Length ~ Petal.Width,
58 | data = subset(iris, subset = Species == "versicolor")
59 | )
60 |
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame.md:
--------------------------------------------------------------------------------
1 | Leave your data in that big, beautiful data frame
2 | ================
3 | Jenny Bryan
4 | 2018-04-02
5 |
6 | ## Don’t create odd little excerpts and copies of your data.
7 |
8 | Code style that results from (I speculate) minimizing the number of key
9 | presses.
10 |
11 | ``` r
12 | ## :(
13 | sl <- iris[51:100,1]
14 | pw <- iris[51:100,4]
15 | plot(sl ~ pw)
16 | ```
17 |
18 | 
19 |
20 | This clutters the workspace with “loose parts”, `sl` and `pw`. Very
21 | soon, you are likely to forget what they are, which `Species` of `iris`
22 | they represent, and what the relationship between them is.
23 |
24 | ## Leave the data *in situ* and reveal intent in your code
25 |
26 | More verbose code conveys intent. Eliminating the Magic Numbers makes
27 | the code less likely to be, or become, wrong.
28 |
29 | Here’s one way to do same in a tidyverse style:
30 |
31 | ``` r
32 | library(tidyverse)
33 |
34 | ggplot(
35 | filter(iris, Species == "versicolor"),
36 | aes(x = Petal.Width, y = Sepal.Length)
37 | ) + geom_point()
38 | ```
39 |
40 | 
41 |
42 | Another tidyverse approach, this time using the pipe operator, `%>%`
43 |
44 | ``` r
45 | iris %>%
46 | filter(Species == "versicolor") %>%
47 | ggplot(aes(x = Petal.Width, y = Sepal.Length)) + ## <--- NOTE the `+` sign!!
48 | geom_point()
49 | ```
50 |
51 | 
52 |
53 | A base solution that still follows the principles of
54 |
55 | - leave the data in data frame
56 | - convey intent
57 |
58 |
59 |
60 | ``` r
61 | plot(
62 | Sepal.Length ~ Petal.Width,
63 | data = subset(iris, subset = Species == "versicolor")
64 | )
65 | ```
66 |
67 | 
68 |
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-1.png
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-2.png
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png
--------------------------------------------------------------------------------
/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png
--------------------------------------------------------------------------------
/ex02_create-or-mutate-in-place.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Add or modify a variable"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #+ body
17 | # ----
18 | library(tidyverse)
19 |
20 | # ----
21 | #' ### Function to produce a fresh example data frame
22 | new_df <- function() {
23 | tribble(
24 | ~ name, ~ age,
25 | "Reed", 14L,
26 | "Wesley", 12L,
27 | "Eli", 12L,
28 | "Toby", 1L
29 | )
30 | }
31 |
32 | # ----
33 | #' ## The `df$var <- ...` syntax
34 |
35 | #' How to create or modify a variable is a fairly low stakes matter, i.e. really
36 | #' a matter of taste. This is not a hill I plan to die on. But here's my two
37 | #' cents.
38 | #'
39 | #' Of course, `df$var <- ...` absolutely works for creating new variables or
40 | #' modifying existing ones. But there are downsides:
41 | #'
42 | #' * Silent recycling is a risk.
43 | #' * `df` is not special. It's not the implied place to look first for things,
44 | #' so you must be explicit. This can be a drag.
45 | #' * I have aesthetic concerns. YMMV.
46 | df <- new_df()
47 | df$eyes <- 2L
48 | df$snack <- c("chips", "cheese")
49 | df$uname <- toupper(df$name)
50 | df
51 |
52 | # ----
53 | #' ## `dplyr::mutate()` works "inside the box"
54 |
55 | #' `dplyr::mutate()` is the tidyverse way to work on a variable. If I'm working
56 | #' in a script-y style and the tidyverse packages are already available, I
57 | #' generally prefer this method of adding or modifying a variable.
58 | #'
59 | #' * Only a length one input can be recycled.
60 | #' * `df` is the first place to look for things. It turns out that making a
61 | #' new variable out of existing variables is very, very common, so it's nice
62 | #' when this is easy.
63 | #' * This is pipe-friendly, so I can easily combine with a few other logical
64 | #' data manipuluations that need to happen around the same point.
65 | #' * I like the way this looks. YMMV.
66 |
67 | new_df() %>%
68 | mutate(
69 | eyes = 2L,
70 | snack = c("chips", "cheese"),
71 | uname = toupper(name)
72 | )
73 |
74 | #' Oops! I did not provide enough snacks!
75 |
76 | new_df() %>%
77 | mutate(
78 | eyes = 2L,
79 | snack = c("chips", "cheese", "mixed nuts", "nerf bullets"),
80 | uname = toupper(name)
81 | )
82 |
--------------------------------------------------------------------------------
/ex02_create-or-mutate-in-place.md:
--------------------------------------------------------------------------------
1 | Add or modify a variable
2 | ================
3 | Jenny Bryan
4 | 2018-04-10
5 |
6 | ``` r
7 | library(tidyverse)
8 | ```
9 |
10 | ### Function to produce a fresh example data frame
11 |
12 | ``` r
13 | new_df <- function() {
14 | tribble(
15 | ~ name, ~ age,
16 | "Reed", 14L,
17 | "Wesley", 12L,
18 | "Eli", 12L,
19 | "Toby", 1L
20 | )
21 | }
22 | ```
23 |
24 | ## The `df$var <- ...` syntax
25 |
26 | How to create or modify a variable is a fairly low stakes matter,
27 | i.e. really a matter of taste. This is not a hill I plan to die on. But
28 | here’s my two cents.
29 |
30 | Of course, `df$var <- ...` absolutely works for creating new variables
31 | or modifying existing ones. But there are downsides:
32 |
33 | - Silent recycling is a risk.
34 | - `df` is not special. It’s not the implied place to look first for
35 | things, so you must be explicit. This can be a drag.
36 | - I have aesthetic concerns. YMMV.
37 |
38 |
39 |
40 | ``` r
41 | df <- new_df()
42 | df$eyes <- 2L
43 | df$snack <- c("chips", "cheese")
44 | df$uname <- toupper(df$name)
45 | df
46 | #> # A tibble: 4 x 5
47 | #> name age eyes snack uname
48 | #>
49 | #> 1 Reed 14 2 chips REED
50 | #> 2 Wesley 12 2 cheese WESLEY
51 | #> 3 Eli 12 2 chips ELI
52 | #> 4 Toby 1 2 cheese TOBY
53 | ```
54 |
55 | ## `dplyr::mutate()` works “inside the box”
56 |
57 | `dplyr::mutate()` is the tidyverse way to work on a variable. If I’m
58 | working in a script-y style and the tidyverse packages are already
59 | available, I generally prefer this method of adding or modifying a
60 | variable.
61 |
62 | - Only a length one input can be recycled.
63 | - `df` is the first place to look for things. It turns out that making
64 | a new variable out of existing variables is very, very common, so
65 | it’s nice when this is easy.
66 | - This is pipe-friendly, so I can easily combine with a few other
67 | logical data manipuluations that need to happen around the same
68 | point.
69 | - I like the way this looks. YMMV.
70 |
71 |
72 |
73 | ``` r
74 | new_df() %>%
75 | mutate(
76 | eyes = 2L,
77 | snack = c("chips", "cheese"),
78 | uname = toupper(name)
79 | )
80 | #> Error in mutate_impl(.data, dots): Column `snack` must be length 4 (the number of rows) or one, not 2
81 | ```
82 |
83 | Oops\! I did not provide enough snacks\!
84 |
85 | ``` r
86 | new_df() %>%
87 | mutate(
88 | eyes = 2L,
89 | snack = c("chips", "cheese", "mixed nuts", "nerf bullets"),
90 | uname = toupper(name)
91 | )
92 | #> # A tibble: 4 x 5
93 | #> name age eyes snack uname
94 | #>
95 | #> 1 Reed 14 2 chips REED
96 | #> 2 Wesley 12 2 cheese WESLEY
97 | #> 3 Eli 12 2 mixed nuts ELI
98 | #> 4 Toby 1 2 nerf bullets TOBY
99 | ```
100 |
--------------------------------------------------------------------------------
/ex03_row-wise-iteration-are-you-sure.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Are you absolutely sure that you, personally, need to iterate over rows?"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #+ body
17 | # ----
18 | library(tidyverse)
19 |
20 | # ----
21 | #' ## Function to give my example data frame
22 | new_df <- function() {
23 | tribble(
24 | ~ name, ~ age,
25 | "Reed", 14,
26 | "Wesley", 12,
27 | "Eli", 12,
28 | "Toby", 1
29 | )
30 | }
31 |
32 | # ----
33 | #' ## Single-row example can cause tunnel vision
34 | #'
35 | #' Sometimes it's easy to fixate on one (unfavorable) way of accomplishing
36 | #' something, because it feels like a natural extension of a successful
37 | #' small-scale experiment.
38 | #'
39 | #' Let's create a string from row 1 of the data frame.
40 | df <- new_df()
41 | paste(df$name[1], "is", df$age[1], "years old")
42 |
43 | #' I want to scale up, therefore I obviously must ... loop over all rows!
44 | n <- nrow(df)
45 | s <- vector(mode = "character", length = n)
46 | for (i in seq_len(n)) {
47 | s[i] <- paste(df$name[i], "is", df$age[i], "years old")
48 | }
49 | s
50 |
51 | #' HOLD ON. What if I told you `paste()` is already vectorized over its
52 | #' arguments?
53 | paste(df$name, "is", df$age, "years old")
54 |
55 | #' A surprising number of "iterate over rows" problems can be eliminated by
56 | #' exploiting functions that are already vectorized and by making your own
57 | #' functions vectorized over the primary argument.
58 | #'
59 | #' Writing an explicit loop in your code is not necessarily bad, but it should
60 | #' always give you pause. Has someone already written this loop for you? Ideally
61 | #' in C or C++ and inside a package that's being regularly checked, with high
62 | #' test coverage. That is usually the better choice.
63 |
64 | # ----
65 | #' ## Don't forget to work "inside the box"
66 | #'
67 |
68 | #' For this string interpolation task, we can even work with a vectorized
69 | #' function that is happy to do lookup inside a data frame. The [glue
70 | #' package](https://glue.tidyverse.org) is doing the work under the hood here,
71 | #' but its Greatest Functions are now re-exported by stringr, which we already
72 | #' attached via `library(tidyverse)`.
73 |
74 | str_glue_data(df, "{name} is {age} years old")
75 |
76 | #' You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`, because
77 | #' the other variables in `df` are automatically available for use.
78 |
79 | df %>%
80 | mutate(sentence = str_glue("{name} is {age} years old"))
81 |
82 | #' The tidyverse style is to manage data holistically in a data frame and
83 | #' provide a user interface that encourages self-explaining code with low
84 | #' "syntactical noise".
85 |
--------------------------------------------------------------------------------
/ex03_row-wise-iteration-are-you-sure.md:
--------------------------------------------------------------------------------
1 | Are you absolutely sure that you, personally, need to iterate over rows?
2 | ================
3 | Jenny Bryan
4 | 2018-04-02
5 |
6 | ``` r
7 | library(tidyverse)
8 | ```
9 |
10 | ## Function to give my example data frame
11 |
12 | ``` r
13 | new_df <- function() {
14 | tribble(
15 | ~ name, ~ age,
16 | "Reed", 14,
17 | "Wesley", 12,
18 | "Eli", 12,
19 | "Toby", 1
20 | )
21 | }
22 | ```
23 |
24 | ## Single-row example can cause tunnel vision
25 |
26 | Sometimes it’s easy to fixate on one (unfavorable) way of accomplishing
27 | something, because it feels like a natural extension of a successful
28 | small-scale experiment.
29 |
30 | Let’s create a string from row 1 of the data frame.
31 |
32 | ``` r
33 | df <- new_df()
34 | paste(df$name[1], "is", df$age[1], "years old")
35 | #> [1] "Reed is 14 years old"
36 | ```
37 |
38 | I want to scale up, therefore I obviously must … loop over all rows\!
39 |
40 | ``` r
41 | n <- nrow(df)
42 | s <- vector(mode = "character", length = n)
43 | for (i in seq_len(n)) {
44 | s[i] <- paste(df$name[i], "is", df$age[i], "years old")
45 | }
46 | s
47 | #> [1] "Reed is 14 years old" "Wesley is 12 years old"
48 | #> [3] "Eli is 12 years old" "Toby is 1 years old"
49 | ```
50 |
51 | HOLD ON. What if I told you `paste()` is already vectorized over its
52 | arguments?
53 |
54 | ``` r
55 | paste(df$name, "is", df$age, "years old")
56 | #> [1] "Reed is 14 years old" "Wesley is 12 years old"
57 | #> [3] "Eli is 12 years old" "Toby is 1 years old"
58 | ```
59 |
60 | A surprising number of “iterate over rows” problems can be eliminated by
61 | exploiting functions that are already vectorized and by making your own
62 | functions vectorized over the primary argument.
63 |
64 | Writing an explicit loop in your code is not necessarily bad, but it
65 | should always give you pause. Has someone already written this loop for
66 | you? Ideally in C or C++ and inside a package that’s being regularly
67 | checked, with high test coverage. That is usually the better choice.
68 |
69 | ## Don’t forget to work “inside the box”
70 |
71 | For this string interpolation task, we can even work with a vectorized
72 | function that is happy to do lookup inside a data frame. The [glue
73 | package](https://glue.tidyverse.org) is doing the work under the hood
74 | here, but its Greatest Functions are now re-exported by stringr, which
75 | we already attached via `library(tidyverse)`.
76 |
77 | ``` r
78 | str_glue_data(df, "{name} is {age} years old")
79 | #> Reed is 14 years old
80 | #> Wesley is 12 years old
81 | #> Eli is 12 years old
82 | #> Toby is 1 years old
83 | ```
84 |
85 | You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`,
86 | because the other variables in `df` are automatically available for use.
87 |
88 | ``` r
89 | df %>%
90 | mutate(sentence = str_glue("{name} is {age} years old"))
91 | #> # A tibble: 4 x 3
92 | #> name age sentence
93 | #>
94 | #> 1 Reed 14. Reed is 14 years old
95 | #> 2 Wesley 12. Wesley is 12 years old
96 | #> 3 Eli 12. Eli is 12 years old
97 | #> 4 Toby 1. Toby is 1 years old
98 | ```
99 |
100 | The tidyverse style is to manage data holistically in a data frame and
101 | provide a user interface that encourages self-explaining code with low
102 | “syntactical noise”.
103 |
--------------------------------------------------------------------------------
/ex04_map-example.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Small demo of purrr::map()"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #+ body
17 | # ----
18 | #' ## `purrr::map()` can be used to work with functions that aren't vectorized.
19 |
20 | df_list <- list(
21 | iris = head(iris, 2),
22 | mtcars = head(mtcars, 3)
23 | )
24 | df_list
25 |
26 | #' This does not work. `nrow()` expects a single data frame as input.
27 | nrow(df_list)
28 |
29 | #' `purrr::map()` applies `nrow()` to each element of `df_list`.
30 | library(purrr)
31 |
32 | map(df_list, nrow)
33 |
34 | #' Different calling styles make sense in more complicated situations. Hard to
35 | #' justify in this simple example.
36 | map(df_list, ~ nrow(.x))
37 |
38 | df_list %>%
39 | map(nrow)
40 |
41 | #' If you know what the return type is (or *should* be), use a type-specific
42 | #' variant of `map()`.
43 |
44 | map_int(df_list, ~ nrow(.x))
45 |
46 | #' More on coverage of `map()` and friends: .
47 |
--------------------------------------------------------------------------------
/ex04_map-example.md:
--------------------------------------------------------------------------------
1 | Small demo of purrr::map()
2 | ================
3 | Jenny Bryan
4 | 2018-04-10
5 |
6 | ## `purrr::map()` can be used to work with functions that aren’t vectorized.
7 |
8 | ``` r
9 | df_list <- list(
10 | iris = head(iris, 2),
11 | mtcars = head(mtcars, 3)
12 | )
13 | df_list
14 | #> $iris
15 | #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
16 | #> 1 5.1 3.5 1.4 0.2 setosa
17 | #> 2 4.9 3.0 1.4 0.2 setosa
18 | #>
19 | #> $mtcars
20 | #> mpg cyl disp hp drat wt qsec vs am gear carb
21 | #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
22 | #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
23 | #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
24 | ```
25 |
26 | This does not work. `nrow()` expects a single data frame as input.
27 |
28 | ``` r
29 | nrow(df_list)
30 | #> NULL
31 | ```
32 |
33 | `purrr::map()` applies `nrow()` to each element of `df_list`.
34 |
35 | ``` r
36 | library(purrr)
37 |
38 | map(df_list, nrow)
39 | #> $iris
40 | #> [1] 2
41 | #>
42 | #> $mtcars
43 | #> [1] 3
44 | ```
45 |
46 | Different calling styles make sense in more complicated situations. Hard
47 | to justify in this simple example.
48 |
49 | ``` r
50 | map(df_list, ~ nrow(.x))
51 | #> $iris
52 | #> [1] 2
53 | #>
54 | #> $mtcars
55 | #> [1] 3
56 |
57 | df_list %>%
58 | map(nrow)
59 | #> $iris
60 | #> [1] 2
61 | #>
62 | #> $mtcars
63 | #> [1] 3
64 | ```
65 |
66 | If you know what the return type is (or *should* be), use a
67 | type-specific variant of `map()`.
68 |
69 | ``` r
70 | map_int(df_list, ~ nrow(.x))
71 | #> iris mtcars
72 | #> 2 3
73 | ```
74 |
75 | More on coverage of `map()` and friends:
76 | .
77 |
--------------------------------------------------------------------------------
/ex05_attack-via-rows-or-columns.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Attack via rows or columns?"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #' **WARNING: half-baked**
17 |
18 | #+ body
19 | # ----
20 | library(tidyverse)
21 |
22 | # ----
23 | #' ## If you must sweat, compare row-wise work vs. column-wise work
24 | #'
25 | #' The approach you use in that first example is not always the one that scales
26 | #' up the best.
27 |
28 | x <- list(
29 | list(name = "sue", number = 1, veg = c("onion", "carrot")),
30 | list(name = "doug", number = 2, veg = c("potato", "beet"))
31 | )
32 |
33 | # row binding
34 |
35 | # frustrating base attempts
36 | rbind(x)
37 | do.call(rbind, x)
38 | do.call(rbind, x) %>% str()
39 |
40 | # tidyverse fail
41 | bind_rows(x)
42 | map_dfr(x, ~ .x)
43 |
44 | map_dfr(x, ~ .x[c("name", "number")])
45 |
46 | tibble(
47 | name = map_chr(x, "name"),
48 | number = map_dbl(x, "number"),
49 | veg = map(x, "veg")
50 | )
51 |
--------------------------------------------------------------------------------
/ex05_attack-via-rows-or-columns.md:
--------------------------------------------------------------------------------
1 | Attack via rows or columns?
2 | ================
3 | Jenny Bryan
4 | 2018-04-02
5 |
6 | **WARNING: half-baked**
7 |
8 | ``` r
9 | library(tidyverse)
10 | ```
11 |
12 | ## If you must sweat, compare row-wise work vs. column-wise work
13 |
14 | The approach you use in that first example is not always the one that
15 | scales up the best.
16 |
17 | ``` r
18 | x <- list(
19 | list(name = "sue", number = 1, veg = c("onion", "carrot")),
20 | list(name = "doug", number = 2, veg = c("potato", "beet"))
21 | )
22 |
23 | # row binding
24 |
25 | # frustrating base attempts
26 | rbind(x)
27 | #> [,1] [,2]
28 | #> x List,3 List,3
29 | do.call(rbind, x)
30 | #> name number veg
31 | #> [1,] "sue" 1 Character,2
32 | #> [2,] "doug" 2 Character,2
33 | do.call(rbind, x) %>% str()
34 | #> List of 6
35 | #> $ : chr "sue"
36 | #> $ : chr "doug"
37 | #> $ : num 1
38 | #> $ : num 2
39 | #> $ : chr [1:2] "onion" "carrot"
40 | #> $ : chr [1:2] "potato" "beet"
41 | #> - attr(*, "dim")= int [1:2] 2 3
42 | #> - attr(*, "dimnames")=List of 2
43 | #> ..$ : NULL
44 | #> ..$ : chr [1:3] "name" "number" "veg"
45 |
46 | # tidyverse fail
47 | bind_rows(x)
48 | #> Error in bind_rows_(x, .id): Argument 3 must be length 1, not 2
49 | map_dfr(x, ~ .x)
50 | #> Error in bind_rows_(x, .id): Argument 3 must be length 1, not 2
51 |
52 | map_dfr(x, ~ .x[c("name", "number")])
53 | #> # A tibble: 2 x 2
54 | #> name number
55 | #>
56 | #> 1 sue 1.
57 | #> 2 doug 2.
58 |
59 | tibble(
60 | name = map_chr(x, "name"),
61 | number = map_dbl(x, "number"),
62 | veg = map(x, "veg")
63 | )
64 | #> # A tibble: 2 x 3
65 | #> name number veg
66 | #>
67 | #> 1 sue 1.
68 | #> 2 doug 2.
69 | ```
70 |
--------------------------------------------------------------------------------
/ex06_runif-via-pmap.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Generate data from different distributions via pmap()"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #+ body
17 | # ----
18 | #' ## Uniform[min, max] via `runif()`
19 | #'
20 | #' CONSIDER:
21 | #' ```
22 | #' runif(n, min = 0, max = 1)
23 | #' ```
24 | #'
25 | #' Want to do this for several triples of (n, min, max).
26 | #'
27 | #' Store each triple as a row in a data frame.
28 | #'
29 | #' Now iterate over the rows.
30 |
31 | library(tidyverse)
32 |
33 | #' Notice how df's variable names are same as runif's argument names. Do this
34 | #' when you can!
35 | df <- tribble(
36 | ~ n, ~ min, ~ max,
37 | 1L, 0, 1,
38 | 2L, 10, 100,
39 | 3L, 100, 1000
40 | )
41 | df
42 |
43 | #' Set seed to make this repeatedly random.
44 | #'
45 | #' Practice on single rows.
46 | set.seed(123)
47 | (x <- df[1, ])
48 | runif(n = x$n, min = x$min, max = x$max)
49 |
50 | x <- df[2, ]
51 | runif(n = x$n, min = x$min, max = x$max)
52 |
53 | x <- df[3, ]
54 | runif(n = x$n, min = x$min, max = x$max)
55 |
56 | #' Think out loud in pseudo-code.
57 |
58 | ## x <- df[i, ]
59 | ## runif(n = x$n, min = x$min, max = x$max)
60 |
61 | ## runif(n = df$n[i], min = df$min[i], max = df$max[i])
62 | ## runif with all args from the i-th row of df
63 |
64 | #' Just. Do. It. with `pmap()`.
65 | set.seed(123)
66 | pmap(df, runif)
67 |
68 | #' ## Finessing variable and argument names
69 | #'
70 | #' Q: What if you can't arrange it so that variable names and arg names are
71 | #' same?
72 | foofy <- tibble(
73 | alpha = 1:3, ## was: n
74 | beta = c(0, 10, 100), ## was: min
75 | gamma = c(1, 100, 1000) ## was: max
76 | )
77 | foofy
78 |
79 | #' A: Rename the variables on-the-fly, on the way in.
80 | set.seed(123)
81 | foofy %>%
82 | rename(n = alpha, min = beta, max = gamma) %>%
83 | pmap(runif)
84 |
85 | #' A: Write a wrapper around `runif()` to say how df vars <--> runif args.
86 |
87 | ## wrapper option #1:
88 | ## ARGNAME = l$VARNAME
89 | my_runif <- function(...) {
90 | l <- list(...)
91 | runif(n = l$alpha, min = l$beta, max = l$gamma)
92 | }
93 | set.seed(123)
94 | pmap(foofy, my_runif)
95 |
96 | ## wrapper option #2:
97 | my_runif <- function(alpha, beta, gamma, ...) {
98 | runif(n = alpha, min = beta, max = gamma)
99 | }
100 | set.seed(123)
101 | pmap(foofy, my_runif)
102 |
103 | #' You can use `..i` to refer to input by position.
104 | set.seed(123)
105 | pmap(foofy, ~ runif(n = ..1, min = ..2, max = ..3))
106 | #' Use this with *extreme caution*. Easy to shoot yourself in the foot.
107 | #'
108 | #' ## Extra variables in the data frame
109 | #'
110 | #' What if data frame includes variables that should not be passed to `.f()`?
111 | df_oops <- tibble(
112 | n = 1:3,
113 | min = c(0, 10, 100),
114 | max = c(1, 100, 1000),
115 | oops = c("please", "ignore", "me")
116 | )
117 | df_oops
118 |
119 | #' This will not work!
120 | set.seed(123)
121 | pmap(df_oops, runif)
122 |
123 | #' A: use `dplyr::select()` to limit the variables passed to `pmap()`.
124 | set.seed(123)
125 | df_oops %>%
126 | select(n, min, max) %>% ## if it's easier to say what to keep
127 | pmap(runif)
128 |
129 | set.seed(123)
130 | df_oops %>%
131 | select(-oops) %>% ## if it's easier to say what to omit
132 | pmap(runif)
133 |
134 | #' A: Use a custom wrapper and absorb extra variables with `...`.
135 | my_runif <- function(n, min, max, ...) runif(n, min, max)
136 |
137 | set.seed(123)
138 | pmap(df_oops, my_runif)
139 |
140 | #' ## Add the generated data to the data frame as a list-column
141 | set.seed(123)
142 | (df_aug <- df %>%
143 | mutate(data = pmap(., runif)))
144 | #View(df_aug)
145 |
146 | #' What about computing within a data frame, in the presence of the
147 | #' complications discussed above? Use `list()` in the place of the `.`
148 | #' placeholder above to select the target variables and, if necessary, map
149 | #' variable names to argument names. *Thanks @hadley for [sharing this
150 | #' trick](https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/29).*
151 | #'
152 | #' How to address variable names != argument names:
153 | foofy <- tibble(
154 | alpha = 1:3, ## was: n
155 | beta = c(0, 10, 100), ## was: min
156 | gamma = c(1, 100, 1000) ## was: max
157 | )
158 |
159 | set.seed(123)
160 | foofy %>%
161 | mutate(data = pmap(list(n = alpha, min = beta, max = gamma), runif))
162 |
163 | #' How to address presence of 'extra variables' with either an inclusion or
164 | #' exclusion mentality
165 | df_oops <- tibble(
166 | n = 1:3,
167 | min = c(0, 10, 100),
168 | max = c(1, 100, 1000),
169 | oops = c("please", "ignore", "me")
170 | )
171 |
172 | set.seed(123)
173 | df_oops %>%
174 | mutate(data = pmap(list(n, min, max), runif))
175 |
176 | df_oops %>%
177 | mutate(data = pmap(select(., -oops), runif))
178 |
179 | #' ## Review
180 | #'
181 | #' What have we done?
182 | #'
183 | #' * Arranged inputs as rows in a data frame
184 | #' * Used `pmap()` to implement a loop over the rows.
185 | #' * Used dplyr verbs `rename()` and `select()` to manipulate data on the way
186 | #' into `pmap()`.
187 | #' * Wrote custom wrappers around `runif()` to deal with:
188 | #' - df var names != `.f()` arg names
189 | #' - df vars that aren't formal args of `.f()`
190 | #' * Demonstrated all of the above when working inside a data frame and adding
191 | #' generated data as a list-column
192 |
--------------------------------------------------------------------------------
/ex06_runif-via-pmap.md:
--------------------------------------------------------------------------------
1 | Generate data from different distributions via pmap()
2 | ================
3 | Jenny Bryan
4 | 2018-05-08
5 |
6 | ## Uniform\[min, max\] via `runif()`
7 |
8 | CONSIDER:
9 |
10 | runif(n, min = 0, max = 1)
11 |
12 | Want to do this for several triples of (n, min, max).
13 |
14 | Store each triple as a row in a data frame.
15 |
16 | Now iterate over the rows.
17 |
18 | ``` r
19 | library(tidyverse)
20 | ```
21 |
22 | Notice how df’s variable names are same as runif’s argument names. Do
23 | this when you can\!
24 |
25 | ``` r
26 | df <- tribble(
27 | ~ n, ~ min, ~ max,
28 | 1L, 0, 1,
29 | 2L, 10, 100,
30 | 3L, 100, 1000
31 | )
32 | df
33 | #> # A tibble: 3 x 3
34 | #> n min max
35 | #>
36 | #> 1 1 0 1
37 | #> 2 2 10 100
38 | #> 3 3 100 1000
39 | ```
40 |
41 | Set seed to make this repeatedly random.
42 |
43 | Practice on single rows.
44 |
45 | ``` r
46 | set.seed(123)
47 | (x <- df[1, ])
48 | #> # A tibble: 1 x 3
49 | #> n min max
50 | #>
51 | #> 1 1 0 1
52 | runif(n = x$n, min = x$min, max = x$max)
53 | #> [1] 0.2875775
54 |
55 | x <- df[2, ]
56 | runif(n = x$n, min = x$min, max = x$max)
57 | #> [1] 80.94746 46.80792
58 |
59 | x <- df[3, ]
60 | runif(n = x$n, min = x$min, max = x$max)
61 | #> [1] 894.7157 946.4206 141.0008
62 | ```
63 |
64 | Think out loud in pseudo-code.
65 |
66 | ``` r
67 | ## x <- df[i, ]
68 | ## runif(n = x$n, min = x$min, max = x$max)
69 |
70 | ## runif(n = df$n[i], min = df$min[i], max = df$max[i])
71 | ## runif with all args from the i-th row of df
72 | ```
73 |
74 | Just. Do. It. with `pmap()`.
75 |
76 | ``` r
77 | set.seed(123)
78 | pmap(df, runif)
79 | #> [[1]]
80 | #> [1] 0.2875775
81 | #>
82 | #> [[2]]
83 | #> [1] 80.94746 46.80792
84 | #>
85 | #> [[3]]
86 | #> [1] 894.7157 946.4206 141.0008
87 | ```
88 |
89 | ## Finessing variable and argument names
90 |
91 | Q: What if you can’t arrange it so that variable names and arg names are
92 | same?
93 |
94 | ``` r
95 | foofy <- tibble(
96 | alpha = 1:3, ## was: n
97 | beta = c(0, 10, 100), ## was: min
98 | gamma = c(1, 100, 1000) ## was: max
99 | )
100 | foofy
101 | #> # A tibble: 3 x 3
102 | #> alpha beta gamma
103 | #>
104 | #> 1 1 0 1
105 | #> 2 2 10 100
106 | #> 3 3 100 1000
107 | ```
108 |
109 | A: Rename the variables on-the-fly, on the way in.
110 |
111 | ``` r
112 | set.seed(123)
113 | foofy %>%
114 | rename(n = alpha, min = beta, max = gamma) %>%
115 | pmap(runif)
116 | #> [[1]]
117 | #> [1] 0.2875775
118 | #>
119 | #> [[2]]
120 | #> [1] 80.94746 46.80792
121 | #>
122 | #> [[3]]
123 | #> [1] 894.7157 946.4206 141.0008
124 | ```
125 |
126 | A: Write a wrapper around `runif()` to say how df vars \<–\> runif args.
127 |
128 | ``` r
129 | ## wrapper option #1:
130 | ## ARGNAME = l$VARNAME
131 | my_runif <- function(...) {
132 | l <- list(...)
133 | runif(n = l$alpha, min = l$beta, max = l$gamma)
134 | }
135 | set.seed(123)
136 | pmap(foofy, my_runif)
137 | #> [[1]]
138 | #> [1] 0.2875775
139 | #>
140 | #> [[2]]
141 | #> [1] 80.94746 46.80792
142 | #>
143 | #> [[3]]
144 | #> [1] 894.7157 946.4206 141.0008
145 |
146 | ## wrapper option #2:
147 | my_runif <- function(alpha, beta, gamma, ...) {
148 | runif(n = alpha, min = beta, max = gamma)
149 | }
150 | set.seed(123)
151 | pmap(foofy, my_runif)
152 | #> [[1]]
153 | #> [1] 0.2875775
154 | #>
155 | #> [[2]]
156 | #> [1] 80.94746 46.80792
157 | #>
158 | #> [[3]]
159 | #> [1] 894.7157 946.4206 141.0008
160 | ```
161 |
162 | You can use `..i` to refer to input by position.
163 |
164 | ``` r
165 | set.seed(123)
166 | pmap(foofy, ~ runif(n = ..1, min = ..2, max = ..3))
167 | #> [[1]]
168 | #> [1] 0.2875775
169 | #>
170 | #> [[2]]
171 | #> [1] 80.94746 46.80792
172 | #>
173 | #> [[3]]
174 | #> [1] 894.7157 946.4206 141.0008
175 | ```
176 |
177 | Use this with *extreme caution*. Easy to shoot yourself in the foot.
178 |
179 | ## Extra variables in the data frame
180 |
181 | What if data frame includes variables that should not be passed to
182 | `.f()`?
183 |
184 | ``` r
185 | df_oops <- tibble(
186 | n = 1:3,
187 | min = c(0, 10, 100),
188 | max = c(1, 100, 1000),
189 | oops = c("please", "ignore", "me")
190 | )
191 | df_oops
192 | #> # A tibble: 3 x 4
193 | #> n min max oops
194 | #>
195 | #> 1 1 0 1 please
196 | #> 2 2 10 100 ignore
197 | #> 3 3 100 1000 me
198 | ```
199 |
200 | This will not work\!
201 |
202 | ``` r
203 | set.seed(123)
204 | pmap(df_oops, runif)
205 | #> Error in .f(n = .l[[c(1L, i)]], min = .l[[c(2L, i)]], max = .l[[c(3L, : unused argument (oops = .l[[c(4, i)]])
206 | ```
207 |
208 | A: use `dplyr::select()` to limit the variables passed to `pmap()`.
209 |
210 | ``` r
211 | set.seed(123)
212 | df_oops %>%
213 | select(n, min, max) %>% ## if it's easier to say what to keep
214 | pmap(runif)
215 | #> [[1]]
216 | #> [1] 0.2875775
217 | #>
218 | #> [[2]]
219 | #> [1] 80.94746 46.80792
220 | #>
221 | #> [[3]]
222 | #> [1] 894.7157 946.4206 141.0008
223 |
224 | set.seed(123)
225 | df_oops %>%
226 | select(-oops) %>% ## if it's easier to say what to omit
227 | pmap(runif)
228 | #> [[1]]
229 | #> [1] 0.2875775
230 | #>
231 | #> [[2]]
232 | #> [1] 80.94746 46.80792
233 | #>
234 | #> [[3]]
235 | #> [1] 894.7157 946.4206 141.0008
236 | ```
237 |
238 | A: Use a custom wrapper and absorb extra variables with `...`.
239 |
240 | ``` r
241 | my_runif <- function(n, min, max, ...) runif(n, min, max)
242 |
243 | set.seed(123)
244 | pmap(df_oops, my_runif)
245 | #> [[1]]
246 | #> [1] 0.2875775
247 | #>
248 | #> [[2]]
249 | #> [1] 80.94746 46.80792
250 | #>
251 | #> [[3]]
252 | #> [1] 894.7157 946.4206 141.0008
253 | ```
254 |
255 | ## Add the generated data to the data frame as a list-column
256 |
257 | ``` r
258 | set.seed(123)
259 | (df_aug <- df %>%
260 | mutate(data = pmap(., runif)))
261 | #> # A tibble: 3 x 4
262 | #> n min max data
263 | #>
264 | #> 1 1 0 1
265 | #> 2 2 10 100
266 | #> 3 3 100 1000
267 | #View(df_aug)
268 | ```
269 |
270 | What about computing within a data frame, in the presence of the
271 | complications discussed above? Use `list()` in the place of the `.`
272 | placeholder above to select the target variables and, if necessary, map
273 | variable names to argument names. *Thanks @hadley for [sharing this
274 | trick](https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/29).*
275 |
276 | How to address variable names \!= argument names:
277 |
278 | ``` r
279 | foofy <- tibble(
280 | alpha = 1:3, ## was: n
281 | beta = c(0, 10, 100), ## was: min
282 | gamma = c(1, 100, 1000) ## was: max
283 | )
284 |
285 | set.seed(123)
286 | foofy %>%
287 | mutate(data = pmap(list(n = alpha, min = beta, max = gamma), runif))
288 | #> # A tibble: 3 x 4
289 | #> alpha beta gamma data
290 | #>
291 | #> 1 1 0 1
292 | #> 2 2 10 100
293 | #> 3 3 100 1000
294 | ```
295 |
296 | How to address presence of ‘extra variables’ with either an inclusion or
297 | exclusion mentality
298 |
299 | ``` r
300 | df_oops <- tibble(
301 | n = 1:3,
302 | min = c(0, 10, 100),
303 | max = c(1, 100, 1000),
304 | oops = c("please", "ignore", "me")
305 | )
306 |
307 | set.seed(123)
308 | df_oops %>%
309 | mutate(data = pmap(list(n, min, max), runif))
310 | #> # A tibble: 3 x 5
311 | #> n min max oops data
312 | #>
313 | #> 1 1 0 1 please
314 | #> 2 2 10 100 ignore
315 | #> 3 3 100 1000 me
316 |
317 | df_oops %>%
318 | mutate(data = pmap(select(., -oops), runif))
319 | #> # A tibble: 3 x 5
320 | #> n min max oops data
321 | #>
322 | #> 1 1 0 1 please
323 | #> 2 2 10 100 ignore
324 | #> 3 3 100 1000 me
325 | ```
326 |
327 | ## Review
328 |
329 | What have we done?
330 |
331 | - Arranged inputs as rows in a data frame
332 | - Used `pmap()` to implement a loop over the rows.
333 | - Used dplyr verbs `rename()` and `select()` to manipulate data on the
334 | way into `pmap()`.
335 | - Wrote custom wrappers around `runif()` to deal with:
336 | - df var names \!= `.f()` arg names
337 | - df vars that aren’t formal args of `.f()`
338 | - Demonstrated all of the above when working inside a data frame and
339 | adding generated data as a list-column
340 |
--------------------------------------------------------------------------------
/ex07_group-by-summarise.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Work on groups of rows via dplyr::group_by() + summarise()"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #+ body
17 | # ----
18 |
19 | #' What if you need to work on groups of rows? Such as the groups induced by
20 | #' the levels of a factor.
21 | #'
22 | #' You do not need to ... split the data frame into mini-data-frames, loop over
23 | #' them, and glue it all back together.
24 | #'
25 | #' Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
26 | #' compute group-wise summaries.
27 |
28 | library(tidyverse)
29 |
30 | iris %>%
31 | group_by(Species) %>%
32 | summarise(pl_avg = mean(Petal.Length), pw_avg = mean(Petal.Width))
33 |
34 | #' What if you want to return summaries that are not just a single number?
35 | #'
36 | #' This does not "just work".
37 | iris %>%
38 | group_by(Species) %>%
39 | summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
40 |
41 | #' Solution: package as a length-1 list that contains 3 values, creating a
42 | #' list-column.
43 | iris %>%
44 | group_by(Species) %>%
45 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
46 |
47 | #' Q from
48 | #' [\@jcpsantiago](https://twitter.com/jcpsantiago/status/983997363298717696) via
49 | #' Twitter: How would you unnest so the final output is a data frame with a
50 | #' factor column `quantile` with levels "25%", "50%", and "75%"?
51 | #'
52 | #' A: I would `map()` `tibble::enframe()` on the new list column, to convert
53 | #' each entry from named list to a two-column data frame. Then use
54 | #' `tidyr::unnest()` to get rid of the list column and return to a simple data
55 | #' frame and, if you like, convert `quantile` into a factor.
56 |
57 | iris %>%
58 | group_by(Species) %>%
59 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) %>%
60 | mutate(pl_qtile = map(pl_qtile, enframe, name = "quantile")) %>%
61 | unnest() %>%
62 | mutate(quantile = factor(quantile))
63 |
64 | #' If something like this comes up a lot in an analysis, you could package the
65 | #' key "moves" in a function, like so:
66 | enquantile <- function(x, ...) {
67 | qtile <- enframe(quantile(x, ...), name = "quantile")
68 | qtile$quantile <- factor(qtile$quantile)
69 | list(qtile)
70 | }
71 |
72 | #' This makes repeated downstream usage more concise.
73 | iris %>%
74 | group_by(Species) %>%
75 | summarise(pl_qtile = enquantile(Petal.Length, c(0.25, 0.5, 0.75))) %>%
76 | unnest()
77 |
78 |
--------------------------------------------------------------------------------
/ex07_group-by-summarise.md:
--------------------------------------------------------------------------------
1 | Work on groups of rows via dplyr::group\_by() + summarise()
2 | ================
3 | Jenny Bryan
4 | 2018-04-11
5 |
6 | What if you need to work on groups of rows? Such as the groups induced
7 | by the levels of a factor.
8 |
9 | You do not need to … split the data frame into mini-data-frames, loop
10 | over them, and glue it all back together.
11 |
12 | Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to
13 | compute group-wise summaries.
14 |
15 | ``` r
16 | library(tidyverse)
17 |
18 | iris %>%
19 | group_by(Species) %>%
20 | summarise(pl_avg = mean(Petal.Length), pw_avg = mean(Petal.Width))
21 | #> # A tibble: 3 x 3
22 | #> Species pl_avg pw_avg
23 | #>
24 | #> 1 setosa 1.46 0.246
25 | #> 2 versicolor 4.26 1.33
26 | #> 3 virginica 5.55 2.03
27 | ```
28 |
29 | What if you want to return summaries that are not just a single number?
30 |
31 | This does not “just work”.
32 |
33 | ``` r
34 | iris %>%
35 | group_by(Species) %>%
36 | summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75)))
37 | #> Error in summarise_impl(.data, dots): Column `pl_qtile` must be length 1 (a summary value), not 3
38 | ```
39 |
40 | Solution: package as a length-1 list that contains 3 values, creating a
41 | list-column.
42 |
43 | ``` r
44 | iris %>%
45 | group_by(Species) %>%
46 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75))))
47 | #> # A tibble: 3 x 2
48 | #> Species pl_qtile
49 | #>
50 | #> 1 setosa
51 | #> 2 versicolor
52 | #> 3 virginica
53 | ```
54 |
55 | Q from
56 | [@jcpsantiago](https://twitter.com/jcpsantiago/status/983997363298717696)
57 | via Twitter: How would you unnest so the final output is a data frame
58 | with a factor column `quantile` with levels “25%”, “50%”, and “75%”?
59 |
60 | A: I would `map()` `tibble::enframe()` on the new list column, to
61 | convert each entry from named list to a two-column data frame. Then use
62 | `tidyr::unnest()` to get rid of the list column and return to a simple
63 | data frame and, if you like, convert `quantile` into a factor.
64 |
65 | ``` r
66 | iris %>%
67 | group_by(Species) %>%
68 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) %>%
69 | mutate(pl_qtile = map(pl_qtile, enframe, name = "quantile")) %>%
70 | unnest() %>%
71 | mutate(quantile = factor(quantile))
72 | #> # A tibble: 9 x 3
73 | #> Species quantile value
74 | #>
75 | #> 1 setosa 25% 1.40
76 | #> 2 setosa 50% 1.50
77 | #> 3 setosa 75% 1.58
78 | #> 4 versicolor 25% 4.00
79 | #> 5 versicolor 50% 4.35
80 | #> 6 versicolor 75% 4.60
81 | #> 7 virginica 25% 5.10
82 | #> 8 virginica 50% 5.55
83 | #> 9 virginica 75% 5.88
84 | ```
85 |
86 | If something like this comes up a lot in an analysis, you could package
87 | the key “moves” in a function, like so:
88 |
89 | ``` r
90 | enquantile <- function(x, ...) {
91 | qtile <- enframe(quantile(x, ...), name = "quantile")
92 | qtile$quantile <- factor(qtile$quantile)
93 | list(qtile)
94 | }
95 | ```
96 |
97 | This makes repeated downstream usage more concise.
98 |
99 | ``` r
100 | iris %>%
101 | group_by(Species) %>%
102 | summarise(pl_qtile = enquantile(Petal.Length, c(0.25, 0.5, 0.75))) %>%
103 | unnest()
104 | #> # A tibble: 9 x 3
105 | #> Species quantile value
106 | #>
107 | #> 1 setosa 25% 1.40
108 | #> 2 setosa 50% 1.50
109 | #> 3 setosa 75% 1.58
110 | #> 4 versicolor 25% 4.00
111 | #> 5 versicolor 50% 4.35
112 | #> 6 versicolor 75% 4.60
113 | #> 7 virginica 25% 5.10
114 | #> 8 virginica 50% 5.55
115 | #> 9 virginica 75% 5.88
116 | ```
117 |
--------------------------------------------------------------------------------
/ex08_nesting-is-good.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Why nesting is worth the awkwardness"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #+ body
17 | # ----
18 | library(gapminder)
19 | library(tidyverse)
20 |
21 | # ----
22 | #' gapminder data for Asia only
23 | gap <- gapminder %>%
24 | filter(continent == "Asia") %>%
25 | mutate(yr1952 = year - 1952)
26 |
27 | #+ alpha-order
28 | ggplot(gap, aes(x = lifeExp, y = country)) +
29 | geom_point()
30 |
31 | #' Countries are in alphabetical order.
32 | #'
33 | #' Set factor levels with intent. Example: order based on life expectancy in
34 | #' 2007, the last year in this dataset. Imagine you want this to persist across
35 | #' an entire analysis.
36 | gap <- gap %>%
37 | mutate(country = fct_reorder2(country, .x = year, .y = lifeExp))
38 |
39 | #+ principled-order
40 | ggplot(gap, aes(x = lifeExp, y = country)) +
41 | geom_point()
42 |
43 |
44 | #' Much better!
45 | #'
46 | #' Now imagine we want to fit a model to each country and look at dot plots of
47 | #' slope and intercept.
48 | #'
49 | #' `dplyr::group_by()` + `tidyr::nest()` created a *nested data frame* and is an
50 | #' alternative to splitting into country-specific data frames. Those data frames
51 | #' end up, instead, in a list-column. The `country` variable remains as a normal
52 | #' factor.
53 | gap_nested <- gap %>%
54 | group_by(country) %>%
55 | nest()
56 |
57 | gap_nested
58 | gap_nested$data[[1]]
59 |
60 | gap_fitted <- gap_nested %>%
61 | mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x)))
62 | gap_fitted
63 | gap_fitted$fit[[1]]
64 |
65 | gap_fitted <- gap_fitted %>%
66 | mutate(
67 | intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]),
68 | slope = map_dbl(fit, ~ coef(.x)[["yr1952"]])
69 | )
70 | gap_fitted
71 |
72 | #+ principled-order-coef-ests
73 | ggplot(gap_fitted, aes(x = intercept, y = country)) +
74 | geom_point()
75 |
76 | ggplot(gap_fitted, aes(x = slope, y = country)) +
77 | geom_point()
78 |
79 | #' The `split()` + `lapply()` + `do.call(rbind, ...)` approach.
80 | #'
81 | #' Split gap into many data frames, one per country.
82 | gap_split <- split(gap, gap$country)
83 |
84 | #' Fit a model to each country.
85 | gap_split_fits <- lapply(
86 | gap_split,
87 | function(df) {
88 | lm(lifeExp ~ yr1952, data = df)
89 | }
90 | )
91 | #' Oops ... the unused levels of country are a problem (empty data frames in our
92 | #' list).
93 | #'
94 | #' Drop unused levels in country and split.
95 | gap_split <- split(droplevels(gap), droplevels(gap)$country)
96 | head(gap_split, 2)
97 |
98 | #' Fit model to each country and get `coefs()`.
99 | gap_split_coefs <- lapply(
100 | gap_split,
101 | function(df) {
102 | coef(lm(lifeExp ~ yr1952, data = df))
103 | }
104 | )
105 | head(gap_split_coefs, 2)
106 |
107 | #' Now we need to put everything back togethers. Row bind the list of coefs.
108 | #' Coerce from matrix back to data frame.
109 | gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs))
110 |
111 | #' Restore `country` variable from row names.
112 | gap_split_coefs$country <- rownames(gap_split_coefs)
113 | str(gap_split_coefs)
114 |
115 | #+ revert-to-alphabetical
116 | ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) +
117 | geom_point()
118 | #' Uh-oh, we lost the order of the `country` factor, due to coercion from factor
119 | #' to character (list and then row names).
120 | #'
121 | #' The `nest()` approach allows you to keep data as data vs. in attributes, such
122 | #' as list or row names. Preserves factors and their levels or integer
123 | #' variables. Designs away various opportunities for different pieces of the
124 | #' dataset to get "out of sync" with each other, by leaving them in a data frame
125 | #' at all times.
126 | #'
127 | #' First in an interesting series of blog posts exploring these patterns and
128 | #' asking whether the tidyverse still needs a way to include the nesting
129 | #' variable in the nested data:
130 | #'
131 |
--------------------------------------------------------------------------------
/ex08_nesting-is-good.md:
--------------------------------------------------------------------------------
1 | Why nesting is worth the awkwardness
2 | ================
3 | Jenny Bryan
4 | 2018-04-12
5 |
6 | ``` r
7 | library(gapminder)
8 | library(tidyverse)
9 | ```
10 |
11 | gapminder data for Asia only
12 |
13 | ``` r
14 | gap <- gapminder %>%
15 | filter(continent == "Asia") %>%
16 | mutate(yr1952 = year - 1952)
17 | ```
18 |
19 | ``` r
20 | ggplot(gap, aes(x = lifeExp, y = country)) +
21 | geom_point()
22 | ```
23 |
24 | 
25 |
26 | Countries are in alphabetical order.
27 |
28 | Set factor levels with intent. Example: order based on life expectancy
29 | in 2007, the last year in this dataset. Imagine you want this to persist
30 | across an entire analysis.
31 |
32 | ``` r
33 | gap <- gap %>%
34 | mutate(country = fct_reorder2(country, .x = year, .y = lifeExp))
35 | ```
36 |
37 | ``` r
38 | ggplot(gap, aes(x = lifeExp, y = country)) +
39 | geom_point()
40 | ```
41 |
42 | 
43 |
44 | Much better\!
45 |
46 | Now imagine we want to fit a model to each country and look at dot plots
47 | of slope and intercept.
48 |
49 | `dplyr::group_by()` + `tidyr::nest()` created a *nested data frame* and
50 | is an alternative to splitting into country-specific data frames. Those
51 | data frames end up, instead, in a list-column. The `country` variable
52 | remains as a normal factor.
53 |
54 | ``` r
55 | gap_nested <- gap %>%
56 | group_by(country) %>%
57 | nest()
58 |
59 | gap_nested
60 | #> # A tibble: 33 x 2
61 | #> country data
62 | #>
63 | #> 1 Afghanistan
64 | #> 2 Bahrain
65 | #> 3 Bangladesh
66 | #> 4 Cambodia
67 | #> 5 China
68 | #> 6 Hong Kong, China
69 | #> 7 India
70 | #> 8 Indonesia
71 | #> 9 Iran
72 | #> 10 Iraq
73 | #> # ... with 23 more rows
74 | gap_nested$data[[1]]
75 | #> # A tibble: 12 x 6
76 | #> continent year lifeExp pop gdpPercap yr1952
77 | #>
78 | #> 1 Asia 1952 28.8 8425333 779. 0.
79 | #> 2 Asia 1957 30.3 9240934 821. 5.
80 | #> 3 Asia 1962 32.0 10267083 853. 10.
81 | #> 4 Asia 1967 34.0 11537966 836. 15.
82 | #> 5 Asia 1972 36.1 13079460 740. 20.
83 | #> 6 Asia 1977 38.4 14880372 786. 25.
84 | #> 7 Asia 1982 39.9 12881816 978. 30.
85 | #> 8 Asia 1987 40.8 13867957 852. 35.
86 | #> 9 Asia 1992 41.7 16317921 649. 40.
87 | #> 10 Asia 1997 41.8 22227415 635. 45.
88 | #> 11 Asia 2002 42.1 25268405 727. 50.
89 | #> 12 Asia 2007 43.8 31889923 975. 55.
90 |
91 | gap_fitted <- gap_nested %>%
92 | mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x)))
93 | gap_fitted
94 | #> # A tibble: 33 x 3
95 | #> country data fit
96 | #>
97 | #> 1 Afghanistan
98 | #> 2 Bahrain
99 | #> 3 Bangladesh
100 | #> 4 Cambodia
101 | #> 5 China
102 | #> 6 Hong Kong, China
103 | #> 7 India
104 | #> 8 Indonesia
105 | #> 9 Iran
106 | #> 10 Iraq
107 | #> # ... with 23 more rows
108 | gap_fitted$fit[[1]]
109 | #>
110 | #> Call:
111 | #> lm(formula = lifeExp ~ yr1952, data = .x)
112 | #>
113 | #> Coefficients:
114 | #> (Intercept) yr1952
115 | #> 29.9073 0.2753
116 |
117 | gap_fitted <- gap_fitted %>%
118 | mutate(
119 | intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]),
120 | slope = map_dbl(fit, ~ coef(.x)[["yr1952"]])
121 | )
122 | gap_fitted
123 | #> # A tibble: 33 x 5
124 | #> country data fit intercept slope
125 | #>
126 | #> 1 Afghanistan 29.9 0.275
127 | #> 2 Bahrain 52.7 0.468
128 | #> 3 Bangladesh 36.1 0.498
129 | #> 4 Cambodia 37.0 0.396
130 | #> 5 China 47.2 0.531
131 | #> 6 Hong Kong, China 63.4 0.366
132 | #> 7 India 39.3 0.505
133 | #> 8 Indonesia 36.9 0.635
134 | #> 9 Iran 45.0 0.497
135 | #> 10 Iraq 50.1 0.235
136 | #> # ... with 23 more rows
137 | ```
138 |
139 | ``` r
140 | ggplot(gap_fitted, aes(x = intercept, y = country)) +
141 | geom_point()
142 | ```
143 |
144 | 
145 |
146 | ``` r
147 |
148 | ggplot(gap_fitted, aes(x = slope, y = country)) +
149 | geom_point()
150 | ```
151 |
152 | 
153 |
154 | The `split()` + `lapply()` + `do.call(rbind, ...)` approach.
155 |
156 | Split gap into many data frames, one per country.
157 |
158 | ``` r
159 | gap_split <- split(gap, gap$country)
160 | ```
161 |
162 | Fit a model to each country.
163 |
164 | ``` r
165 | gap_split_fits <- lapply(
166 | gap_split,
167 | function(df) {
168 | lm(lifeExp ~ yr1952, data = df)
169 | }
170 | )
171 | #> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
172 | ```
173 |
174 | Oops … the unused levels of country are a problem (empty data frames in
175 | our list).
176 |
177 | Drop unused levels in country and split.
178 |
179 | ``` r
180 | gap_split <- split(droplevels(gap), droplevels(gap)$country)
181 | head(gap_split, 2)
182 | #> $Japan
183 | #> # A tibble: 12 x 7
184 | #> country continent year lifeExp pop gdpPercap yr1952
185 | #>
186 | #> 1 Japan Asia 1952 63.0 86459025 3217. 0.
187 | #> 2 Japan Asia 1957 65.5 91563009 4318. 5.
188 | #> 3 Japan Asia 1962 68.7 95831757 6577. 10.
189 | #> 4 Japan Asia 1967 71.4 100825279 9848. 15.
190 | #> 5 Japan Asia 1972 73.4 107188273 14779. 20.
191 | #> 6 Japan Asia 1977 75.4 113872473 16610. 25.
192 | #> 7 Japan Asia 1982 77.1 118454974 19384. 30.
193 | #> 8 Japan Asia 1987 78.7 122091325 22376. 35.
194 | #> 9 Japan Asia 1992 79.4 124329269 26825. 40.
195 | #> 10 Japan Asia 1997 80.7 125956499 28817. 45.
196 | #> 11 Japan Asia 2002 82.0 127065841 28605. 50.
197 | #> 12 Japan Asia 2007 82.6 127467972 31656. 55.
198 | #>
199 | #> $`Hong Kong, China`
200 | #> # A tibble: 12 x 7
201 | #> country continent year lifeExp pop gdpPercap yr1952
202 | #>
203 | #> 1 Hong Kong, China Asia 1952 61.0 2125900 3054. 0.
204 | #> 2 Hong Kong, China Asia 1957 64.8 2736300 3629. 5.
205 | #> 3 Hong Kong, China Asia 1962 67.6 3305200 4693. 10.
206 | #> 4 Hong Kong, China Asia 1967 70.0 3722800 6198. 15.
207 | #> 5 Hong Kong, China Asia 1972 72.0 4115700 8316. 20.
208 | #> 6 Hong Kong, China Asia 1977 73.6 4583700 11186. 25.
209 | #> 7 Hong Kong, China Asia 1982 75.4 5264500 14561. 30.
210 | #> 8 Hong Kong, China Asia 1987 76.2 5584510 20038. 35.
211 | #> 9 Hong Kong, China Asia 1992 77.6 5829696 24758. 40.
212 | #> 10 Hong Kong, China Asia 1997 80.0 6495918 28378. 45.
213 | #> 11 Hong Kong, China Asia 2002 81.5 6762476 30209. 50.
214 | #> 12 Hong Kong, China Asia 2007 82.2 6980412 39725. 55.
215 | ```
216 |
217 | Fit model to each country and get `coefs()`.
218 |
219 | ``` r
220 | gap_split_coefs <- lapply(
221 | gap_split,
222 | function(df) {
223 | coef(lm(lifeExp ~ yr1952, data = df))
224 | }
225 | )
226 | head(gap_split_coefs, 2)
227 | #> $Japan
228 | #> (Intercept) yr1952
229 | #> 65.1220513 0.3529042
230 | #>
231 | #> $`Hong Kong, China`
232 | #> (Intercept) yr1952
233 | #> 63.4286410 0.3659706
234 | ```
235 |
236 | Now we need to put everything back togethers. Row bind the list of
237 | coefs. Coerce from matrix back to data frame.
238 |
239 | ``` r
240 | gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs))
241 | ```
242 |
243 | Restore `country` variable from row names.
244 |
245 | ``` r
246 | gap_split_coefs$country <- rownames(gap_split_coefs)
247 | str(gap_split_coefs)
248 | #> 'data.frame': 33 obs. of 3 variables:
249 | #> $ (Intercept): num 65.1 63.4 66.3 61.8 49.7 ...
250 | #> $ yr1952 : num 0.353 0.366 0.267 0.341 0.555 ...
251 | #> $ country : chr "Japan" "Hong Kong, China" "Israel" "Singapore" ...
252 | ```
253 |
254 | ``` r
255 | ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) +
256 | geom_point()
257 | ```
258 |
259 | 
260 |
261 | Uh-oh, we lost the order of the `country` factor, due to coercion from
262 | factor to character (list and then row names).
263 |
264 | The `nest()` approach allows you to keep data as data vs. in attributes,
265 | such as list or row names. Preserves factors and their levels or integer
266 | variables. Designs away various opportunities for different pieces of
267 | the dataset to get “out of sync” with each other, by leaving them in a
268 | data frame at all times.
269 |
270 | First in an interesting series of blog posts exploring these patterns
271 | and asking whether the tidyverse still needs a way to include the
272 | nesting variable in the nested data:
273 |
274 |
--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png
--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/principled-order-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-1.png
--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png
--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png
--------------------------------------------------------------------------------
/ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png
--------------------------------------------------------------------------------
/ex09_row-summaries.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Row-wise Summaries"
3 | #' author: "Jenny Bryan"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 |
8 | #+ setup, include = FALSE, cache = FALSE
9 | knitr::opts_chunk$set(
10 | collapse = TRUE,
11 | comment = "#>",
12 | error = TRUE
13 | )
14 | options(tidyverse.quiet = TRUE)
15 |
16 | #' > For rowSums, mtcars %>% mutate(rowsum = pmap_dbl(., sum)) works but is
17 | #' > a tidy oneliner for mean or sd per row?
18 | #' > I'm looking for a tidy version of rowSums, rowMeans and similarly rowSDs...
19 | #'
20 | #' [Two](https://twitter.com/vrnijs/status/995129678284255233)
21 | #' [tweets](https://twitter.com/vrnijs/status/995193240864178177) from Vincent
22 | #' Nijs [github](https://github.com/vnijs),
23 | #' [twitter](https://twitter.com/vrnijs)
24 | #'
25 |
26 | #' Good question! This also came up when I was originally casting about for
27 | #' genuine row-wise operations, but I never worked it up. I will do so now!
28 | #' First I set up my example.
29 | #'
30 | #+ body
31 | # ----
32 | library(tidyverse)
33 |
34 | df <- tribble(
35 | ~ name, ~ t1, ~t2, ~t3,
36 | "Abby", 1, 2, 3,
37 | "Bess", 4, 5, 6,
38 | "Carl", 7, 8, 9
39 | )
40 |
41 | #' ## Use `rowSums()` and `rowMeans()` inside `dplyr::mutate()`
42 | #'
43 | #' One "tidy version" of `rowSums()` is to ... just stick `rowSums()` inside a
44 | #' tidyverse pipeline. You can use `rowSums()` and `rowMeans()` inside
45 | #' `mutate()`, because they have a method for `data.frame`:
46 | df %>%
47 | mutate(t_sum = rowSums(select_if(., is.numeric)))
48 |
49 | df %>%
50 | mutate(t_avg = rowMeans(select(., -name)))
51 |
52 | #' Above I also demonstrate the use of `select(., SOME_EXPRESSION)` to express
53 | #' which variables should be computed on. This comes up a lot in row-wise work
54 | #' with a data frame, because, almost by definition, your variables are of mixed
55 | #' type. These are just a few examples of the different ways to say "use `t1`,
56 | #' `t2`, and `t3`", so we don't try to sum or average `name`. I'll continue to
57 | #' mix these in as we go. They are equally useful when expressing which
58 | #' variables should be forwarded to `.f` inside `pmap_*().`
59 | #'
60 | #' ## Devil's Advocate: can't you just use `rowMeans()` and `rowSums()` alone?
61 | #'
62 | #' This is a great point [raised by Diogo
63 | #' Camacho](https://twitter.com/DiogoMCamacho/status/996178967647412224). If
64 | #' `rowSums()` and `rowMeans()` get the job done, why put yourself through the
65 | #' pain of using `pmap()`, especially inside `mutate()`?
66 | #'
67 | #' There are a few reasons:
68 | #'
69 | #' * You might want to take the median or standard deviation instead of a mean
70 | #' or a sum. You can't assume that base R or an add-on package offers a row-wise
71 | #' `data.frame` method for every function you might need.
72 | #' * You might have several variables besides `name` that need to be retained,
73 | #' but that should not be forwarded to `rowSums()` or `rowMeans()`. A
74 | #' matrix-with-row-names grants you a reprieve for exactly one variable and that
75 | #' variable best not be integer, factor, date, or datetime. Because you must
76 | #' store it as character. It's not a general solution.
77 | #' * Correctness. If you extract the numeric columns or the variables whose
78 | #' names start with `"t"`, compute `rowMeans()` on them, and then column-bind
79 | #' the result back to the data, you are responsible for making sure that the two
80 | #' objects are absolutely, positively row-aligned.
81 | #'
82 | #' I think it's important to have a general strategy for row-wise computation on
83 | #' a subset of the columns in a data frame.
84 | #'
85 | #' ## How to use an arbitrary function inside `pmap()`
86 | #'
87 | #' What if you need to apply `foo()` to rows and the universe has not provided a
88 | #' special-purpose `rowFoos()` function? Now you do need to use `pmap()` or a
89 | #' type-stable variant, with `foo()` playing the role of `.f`.
90 | #'
91 | #' This works especially well with `sum()`.
92 |
93 | df %>%
94 | mutate(t_sum = pmap_dbl(list(t1, t2, t3), sum))
95 |
96 | df %>%
97 | mutate(t_sum = pmap_dbl(select(., starts_with("t")), sum))
98 |
99 | #' But the original question was about means and standard deviations! Why is
100 | #' that any different? Look at the signature of `sum()` versus a few other
101 | #' numerical summaries:
102 | #'
103 | #+ eval = FALSE
104 | sum(..., na.rm = FALSE)
105 | mean(x, trim = 0, na.rm = FALSE, ...)
106 | median(x, na.rm = FALSE, ...)
107 | var(x, y = NULL, na.rm = FALSE, use)
108 |
109 | #' `sum()` is especially `pmap()`-friendly because it takes `...` as its primary
110 | #' argument. In contrast, `mean()` takes a vector `x` as primary argument, which
111 | #' makes it harder to just drop into `pmap()`. This is something you might never
112 | #' think about if you're used to using special-purpose helpers like
113 | #' `rowMeans()`.
114 | #'
115 | #' purrr has a family of `lift_*()` functions that help you convert between
116 | #' these forms. Here I apply `purrr::lift_vd()` to `mean()`, so I can use it
117 | #' inside `pmap()`. The "vd" says I want to convert a function that takes a
118 | #' "**v**ector" into one that takes "**d**ots".
119 | df %>%
120 | mutate(t_avg = pmap_dbl(list(t1, t2, t3), lift_vd(mean)))
121 |
122 | #' ## Strategies that use reshaping and joins
123 | #'
124 | #' Data frames simply aren't a convenient storage format if you have a frequent
125 | #' need to compute summaries, row-wise, on a subset of columns. It is highly
126 | #' suggestive that your data is in the wrong shape, i.e. it's not tidy. Here we
127 | #' explore some approaches that rely on reshaping and/or joining. They are more
128 | #' transparent than using `lift_*()` with `pmap()` inside `mutate()` and,
129 | #' consequently, more verbose.
130 | #'
131 | #' They all rely on forming row-wise summaries, then joining back to the data.
132 | #'
133 | #' ### Gather, group, summarize
134 | (s <- df %>%
135 | gather("time", "val", starts_with("t")) %>%
136 | group_by(name) %>%
137 | summarize(t_avg = mean(val), t_sum = sum(val)))
138 | df %>%
139 | left_join(s)
140 |
141 | #' ### Group then summarise, with explicit `c()`
142 | (s <- df %>%
143 | group_by(name) %>%
144 | summarise(t_avg = mean(c(t1, t2, t3))))
145 | df %>%
146 | left_join(s)
147 |
148 | #' ### Nesting
149 | #'
150 | #' Let's revisit a pattern from
151 | #' [`ex08_nesting-is-good`](ex08_nesting-is-good.md). This is another way to
152 | #' "package" up the values of `t1`, `t2`, and `t3` in a way that make both
153 | #' `mean()` and `sum()` happy. *thanks @krlmlr*
154 | (s <- df %>%
155 | gather("key", "value", -name) %>%
156 | nest(-name) %>%
157 | mutate(
158 | sum = map(data, "value") %>% map_dbl(sum),
159 | mean = map(data, "value") %>% map_dbl(mean)
160 | ) %>%
161 | select(-data))
162 | df %>%
163 | left_join(s)
164 |
165 | #' ### Yet another way to use `rowMeans()`
166 | (s <- df %>%
167 | column_to_rownames("name") %>%
168 | rowMeans() %>%
169 | enframe())
170 | df %>%
171 | left_join(s)
172 |
173 | #' ## Maybe you should use a matrix
174 | #'
175 | #' If you truly have data where each row is:
176 | #'
177 | #' * Identifier for this observational unit
178 | #' * Homogeneous vector of length n for the unit
179 | #'
180 | #' then you do want to use a matrix with rownames. I used to do this alot but
181 | #' found that practically none of my data analysis problems live in this simple
182 | #' world for more than a couple of hours. Eventually I always get back to a
183 | #' setting where a data frame is the most favorable receptacle, overall. YMMV.
184 | m <- matrix(
185 | 1:9,
186 | byrow = TRUE, nrow = 3,
187 | dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
188 | )
189 |
190 | cbind(m, rowsum = rowSums(m))
191 | cbind(m, rowmean = rowMeans(m))
192 |
--------------------------------------------------------------------------------
/ex09_row-summaries.md:
--------------------------------------------------------------------------------
1 | Row-wise Summaries
2 | ================
3 | Jenny Bryan
4 | 2018-05-14
5 |
6 | > For rowSums, mtcars %\>% mutate(rowsum = pmap\_dbl(., sum)) works but
7 | > is a tidy oneliner for mean or sd per row? I’m looking for a tidy
8 | > version of rowSums, rowMeans and similarly rowSDs…
9 |
10 | [Two](https://twitter.com/vrnijs/status/995129678284255233)
11 | [tweets](https://twitter.com/vrnijs/status/995193240864178177) from
12 | Vincent Nijs [github](https://github.com/vnijs),
13 | [twitter](https://twitter.com/vrnijs)
14 |
15 | Good question\! This also came up when I was originally casting about
16 | for genuine row-wise operations, but I never worked it up. I will do so
17 | now\! First I set up my example.
18 |
19 | ``` r
20 | library(tidyverse)
21 |
22 | df <- tribble(
23 | ~ name, ~ t1, ~t2, ~t3,
24 | "Abby", 1, 2, 3,
25 | "Bess", 4, 5, 6,
26 | "Carl", 7, 8, 9
27 | )
28 | ```
29 |
30 | ## Use `rowSums()` and `rowMeans()` inside `dplyr::mutate()`
31 |
32 | One “tidy version” of `rowSums()` is to … just stick `rowSums()` inside
33 | a tidyverse pipeline. You can use `rowSums()` and `rowMeans()` inside
34 | `mutate()`, because they have a method for `data.frame`:
35 |
36 | ``` r
37 | df %>%
38 | mutate(t_sum = rowSums(select_if(., is.numeric)))
39 | #> Warning: package 'bindrcpp' was built under R version 3.4.4
40 | #> # A tibble: 3 x 5
41 | #> name t1 t2 t3 t_sum
42 | #>
43 | #> 1 Abby 1 2 3 6
44 | #> 2 Bess 4 5 6 15
45 | #> 3 Carl 7 8 9 24
46 |
47 | df %>%
48 | mutate(t_avg = rowMeans(select(., -name)))
49 | #> # A tibble: 3 x 5
50 | #> name t1 t2 t3 t_avg
51 | #>
52 | #> 1 Abby 1 2 3 2
53 | #> 2 Bess 4 5 6 5
54 | #> 3 Carl 7 8 9 8
55 | ```
56 |
57 | Above I also demonstrate the use of `select(., SOME_EXPRESSION)` to
58 | express which variables should be computed on. This comes up a lot in
59 | row-wise work with a data frame, because, almost by definition, your
60 | variables are of mixed type. These are just a few examples of the
61 | different ways to say “use `t1`, `t2`, and `t3`”, so we don’t try to sum
62 | or average `name`. I’ll continue to mix these in as we go. They are
63 | equally useful when expressing which variables should be forwarded to
64 | `.f` inside
65 | `pmap_*().`
66 |
67 | ## Devil’s Advocate: can’t you just use `rowMeans()` and `rowSums()` alone?
68 |
69 | This is a great point [raised by Diogo
70 | Camacho](https://twitter.com/DiogoMCamacho/status/996178967647412224).
71 | If `rowSums()` and `rowMeans()` get the job done, why put yourself
72 | through the pain of using `pmap()`, especially inside `mutate()`?
73 |
74 | There are a few reasons:
75 |
76 | - You might want to take the median or standard deviation instead of a
77 | mean or a sum. You can’t assume that base R or an add-on package
78 | offers a row-wise `data.frame` method for every function you might
79 | need.
80 | - You might have several variables besides `name` that need to be
81 | retained, but that should not be forwarded to `rowSums()` or
82 | `rowMeans()`. A matrix-with-row-names grants you a reprieve for
83 | exactly one variable and that variable best not be integer, factor,
84 | date, or datetime. Because you must store it as character. It’s not
85 | a general solution.
86 | - Correctness. If you extract the numeric columns or the variables
87 | whose names start with `"t"`, compute `rowMeans()` on them, and then
88 | column-bind the result back to the data, you are responsible for
89 | making sure that the two objects are absolutely, positively
90 | row-aligned.
91 |
92 | I think it’s important to have a general strategy for row-wise
93 | computation on a subset of the columns in a data frame.
94 |
95 | ## How to use an arbitrary function inside `pmap()`
96 |
97 | What if you need to apply `foo()` to rows and the universe has not
98 | provided a special-purpose `rowFoos()` function? Now you do need to use
99 | `pmap()` or a type-stable variant, with `foo()` playing the role of
100 | `.f`.
101 |
102 | This works especially well with `sum()`.
103 |
104 | ``` r
105 | df %>%
106 | mutate(t_sum = pmap_dbl(list(t1, t2, t3), sum))
107 | #> # A tibble: 3 x 5
108 | #> name t1 t2 t3 t_sum
109 | #>
110 | #> 1 Abby 1 2 3 6
111 | #> 2 Bess 4 5 6 15
112 | #> 3 Carl 7 8 9 24
113 |
114 | df %>%
115 | mutate(t_sum = pmap_dbl(select(., starts_with("t")), sum))
116 | #> # A tibble: 3 x 5
117 | #> name t1 t2 t3 t_sum
118 | #>
119 | #> 1 Abby 1 2 3 6
120 | #> 2 Bess 4 5 6 15
121 | #> 3 Carl 7 8 9 24
122 | ```
123 |
124 | But the original question was about means and standard deviations\! Why
125 | is that any different? Look at the signature of `sum()` versus a few
126 | other numerical summaries:
127 |
128 | ``` r
129 | sum(..., na.rm = FALSE)
130 | mean(x, trim = 0, na.rm = FALSE, ...)
131 | median(x, na.rm = FALSE, ...)
132 | var(x, y = NULL, na.rm = FALSE, use)
133 | ```
134 |
135 | `sum()` is especially `pmap()`-friendly because it takes `...` as its
136 | primary argument. In contrast, `mean()` takes a vector `x` as primary
137 | argument, which makes it harder to just drop into `pmap()`. This is
138 | something you might never think about if you’re used to using
139 | special-purpose helpers like `rowMeans()`.
140 |
141 | purrr has a family of `lift_*()` functions that help you convert between
142 | these forms. Here I apply `purrr::lift_vd()` to `mean()`, so I can use
143 | it inside `pmap()`. The “vd” says I want to convert a function that
144 | takes a “**v**ector” into one that takes “**d**ots”.
145 |
146 | ``` r
147 | df %>%
148 | mutate(t_avg = pmap_dbl(list(t1, t2, t3), lift_vd(mean)))
149 | #> # A tibble: 3 x 5
150 | #> name t1 t2 t3 t_avg
151 | #>
152 | #> 1 Abby 1 2 3 2
153 | #> 2 Bess 4 5 6 5
154 | #> 3 Carl 7 8 9 8
155 | ```
156 |
157 | ## Strategies that use reshaping and joins
158 |
159 | Data frames simply aren’t a convenient storage format if you have a
160 | frequent need to compute summaries, row-wise, on a subset of columns. It
161 | is highly suggestive that your data is in the wrong shape, i.e. it’s not
162 | tidy. Here we explore some approaches that rely on reshaping and/or
163 | joining. They are more transparent than using `lift_*()` with `pmap()`
164 | inside `mutate()` and, consequently, more verbose.
165 |
166 | They all rely on forming row-wise summaries, then joining back to the
167 | data.
168 |
169 | ### Gather, group, summarize
170 |
171 | ``` r
172 | (s <- df %>%
173 | gather("time", "val", starts_with("t")) %>%
174 | group_by(name) %>%
175 | summarize(t_avg = mean(val), t_sum = sum(val)))
176 | #> # A tibble: 3 x 3
177 | #> name t_avg t_sum
178 | #>
179 | #> 1 Abby 2 6
180 | #> 2 Bess 5 15
181 | #> 3 Carl 8 24
182 | df %>%
183 | left_join(s)
184 | #> Joining, by = "name"
185 | #> # A tibble: 3 x 6
186 | #> name t1 t2 t3 t_avg t_sum
187 | #>
188 | #> 1 Abby 1 2 3 2 6
189 | #> 2 Bess 4 5 6 5 15
190 | #> 3 Carl 7 8 9 8 24
191 | ```
192 |
193 | ### Group then summarise, with explicit `c()`
194 |
195 | ``` r
196 | (s <- df %>%
197 | group_by(name) %>%
198 | summarise(t_avg = mean(c(t1, t2, t3))))
199 | #> # A tibble: 3 x 2
200 | #> name t_avg
201 | #>
202 | #> 1 Abby 2
203 | #> 2 Bess 5
204 | #> 3 Carl 8
205 | df %>%
206 | left_join(s)
207 | #> Joining, by = "name"
208 | #> # A tibble: 3 x 5
209 | #> name t1 t2 t3 t_avg
210 | #>
211 | #> 1 Abby 1 2 3 2
212 | #> 2 Bess 4 5 6 5
213 | #> 3 Carl 7 8 9 8
214 | ```
215 |
216 | ### Nesting
217 |
218 | Let’s revisit a pattern from
219 | [`ex08_nesting-is-good`](ex08_nesting-is-good.md). This is another way
220 | to “package” up the values of `t1`, `t2`, and `t3` in a way that make
221 | both `mean()` and `sum()` happy. *thanks @krlmlr*
222 |
223 | ``` r
224 | (s <- df %>%
225 | gather("key", "value", -name) %>%
226 | nest(-name) %>%
227 | mutate(
228 | sum = map(data, "value") %>% map_dbl(sum),
229 | mean = map(data, "value") %>% map_dbl(mean)
230 | ) %>%
231 | select(-data))
232 | #> # A tibble: 3 x 3
233 | #> name sum mean
234 | #>
235 | #> 1 Abby 6 2
236 | #> 2 Bess 15 5
237 | #> 3 Carl 24 8
238 | df %>%
239 | left_join(s)
240 | #> Joining, by = "name"
241 | #> # A tibble: 3 x 6
242 | #> name t1 t2 t3 sum mean
243 | #>
244 | #> 1 Abby 1 2 3 6 2
245 | #> 2 Bess 4 5 6 15 5
246 | #> 3 Carl 7 8 9 24 8
247 | ```
248 |
249 | ### Yet another way to use `rowMeans()`
250 |
251 | ``` r
252 | (s <- df %>%
253 | column_to_rownames("name") %>%
254 | rowMeans() %>%
255 | enframe())
256 | #> Warning: Setting row names on a tibble is deprecated.
257 | #> # A tibble: 3 x 2
258 | #> name value
259 | #>
260 | #> 1 Abby 2
261 | #> 2 Bess 5
262 | #> 3 Carl 8
263 | df %>%
264 | left_join(s)
265 | #> Joining, by = "name"
266 | #> # A tibble: 3 x 5
267 | #> name t1 t2 t3 value
268 | #>
269 | #> 1 Abby 1 2 3 2
270 | #> 2 Bess 4 5 6 5
271 | #> 3 Carl 7 8 9 8
272 | ```
273 |
274 | ## Maybe you should use a matrix
275 |
276 | If you truly have data where each row is:
277 |
278 | - Identifier for this observational unit
279 | - Homogeneous vector of length n for the unit
280 |
281 | then you do want to use a matrix with rownames. I used to do this alot
282 | but found that practically none of my data analysis problems live in
283 | this simple world for more than a couple of hours. Eventually I always
284 | get back to a setting where a data frame is the most favorable
285 | receptacle, overall. YMMV.
286 |
287 | ``` r
288 | m <- matrix(
289 | 1:9,
290 | byrow = TRUE, nrow = 3,
291 | dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3))
292 | )
293 |
294 | cbind(m, rowsum = rowSums(m))
295 | #> t1 t2 t3 rowsum
296 | #> Abby 1 2 3 6
297 | #> Bess 4 5 6 15
298 | #> Carl 7 8 9 24
299 | cbind(m, rowmean = rowMeans(m))
300 | #> t1 t2 t3 rowmean
301 | #> Abby 1 2 3 2
302 | #> Bess 4 5 6 5
303 | #> Carl 7 8 9 8
304 | ```
305 |
--------------------------------------------------------------------------------
/iterate-over-rows.R:
--------------------------------------------------------------------------------
1 | #' ---
2 | #' title: "Turn data frame into a list, one component per row"
3 | #' author: "Jenny Bryan, updating work of Winston Chang"
4 | #' date: "`r format(Sys.Date())`"
5 | #' output: github_document
6 | #' ---
7 | #'
8 | #' Update of .
9 | #'
10 | #' * Added some methods, removed some methods.
11 | #' * Run every combination of problem size & method multiple times.
12 | #' * Explore different number of rows and columns, with mixed col types.
13 |
14 | library(scales)
15 | library(tidyverse)
16 |
17 | # for loop over row index
18 | f_for_loop <- function(df) {
19 | out <- vector(mode = "list", length = nrow(df))
20 | for (i in seq_along(out)) {
21 | out[[i]] <- as.list(df[i, , drop = FALSE])
22 | }
23 | out
24 | }
25 |
26 | # split into single row data frames then + lapply
27 | f_split_lapply <- function(df) {
28 | df <- split(df, seq_len(nrow(df)))
29 | lapply(df, function(row) as.list(row))
30 | }
31 |
32 | # lapply over the vector of row numbers
33 | f_lapply_row <- function(df) {
34 | lapply(seq_len(nrow(df)), function(i) as.list(df[i, , drop = FALSE]))
35 | }
36 |
37 | # purrr::pmap
38 | f_pmap <- function(df) {
39 | pmap(df, list)
40 | }
41 |
42 | # purrr::transpose (happens to be exactly what's needed here)
43 | f_transpose <- function(df) {
44 | transpose(df)
45 | }
46 |
47 | ## explicit gc, then execute `expr` `n` times w/o explicit gc, return timings
48 | benchmark <- function(n = 1, expr, envir = parent.frame()) {
49 | expr <- substitute(expr)
50 | gc()
51 | map(seq_len(n), ~ system.time(eval(expr, envir), gcFirst = FALSE))
52 | }
53 |
54 | run_row_benchmark <- function(nrow, times = 5) {
55 | df <- data.frame(
56 | x = rep_len(letters, length.out = nrow),
57 | y = runif(nrow),
58 | z = seq_len(nrow)
59 | )
60 | res <- list(
61 | transpose = benchmark(times, f_transpose(df)),
62 | pmap = benchmark(times, f_pmap(df)),
63 | split_lapply = benchmark(times, f_split_lapply(df)),
64 | lapply_row = benchmark(times, f_lapply_row(df)),
65 | for_loop = benchmark(times, f_for_loop(df))
66 | )
67 | res <- map(res, ~ map_dbl(.x, "elapsed"))
68 | tibble(
69 | nrow = nrow,
70 | method = rep(names(res), lengths(res)),
71 | time = flatten_dbl(res)
72 | )
73 | }
74 |
75 | run_col_benchmark <- function(ncol, times = 5) {
76 | nrow <- 3
77 | template <- data.frame(
78 | x = letters[seq_len(nrow)],
79 | y = runif(nrow),
80 | z = seq_len(nrow)
81 | )
82 | df <- template[rep_len(seq_len(ncol(template)), length.out = ncol)]
83 | res <- list(
84 | transpose = benchmark(times, f_transpose(df)),
85 | pmap = benchmark(times, f_pmap(df)),
86 | split_lapply = benchmark(times, f_split_lapply(df)),
87 | lapply_row = benchmark(times, f_lapply_row(df)),
88 | for_loop = benchmark(times, f_for_loop(df))
89 | )
90 | res <- map(res, ~ map_dbl(.x, "elapsed"))
91 | tibble(
92 | ncol = ncol,
93 | method = rep(names(res), lengths(res)),
94 | time = flatten_dbl(res)
95 | )
96 | }
97 |
98 | ## force figs to present methods in order of time
99 | flevels <- function(df) {
100 | mutate(df, method = fct_reorder(method, .x = desc(time)))
101 | }
102 |
103 | plot_it <- function(df, what = "nrow") {
104 | log10_breaks <- trans_breaks("log10", function(x) 10 ^ x)
105 | log10_mbreaks <- function(x) {
106 | limits <- c(floor(log10(x[1])), ceiling(log10(x[2])))
107 | breaks <- 10 ^ seq(limits[1], limits[2])
108 |
109 | unlist(lapply(breaks, function(x) x * seq(0.1, 0.9, by = 0.1)))
110 | }
111 | log10_labels <- trans_format("log10", math_format(10 ^ .x))
112 |
113 | ggplot(
114 | df %>% dplyr::filter(time > 0),
115 | aes_string(x = what, y = "time", colour = "method")
116 | ) +
117 | geom_point() +
118 | stat_summary(aes(group = method), fun.y = mean, geom = "line") +
119 | scale_y_log10(
120 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
121 | ) +
122 | scale_x_log10(
123 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
124 | ) +
125 | labs(
126 | x = paste0("Number of ", if (what == "nrow") "rows" else "columns"),
127 | y = "Time (s)"
128 | ) +
129 | theme_bw() +
130 | theme(aspect.ratio = 1, legend.justification = "top")
131 | }
132 |
133 | ## dry runs
134 | # df_test <- run_row_benchmark(nrow = 10000) %>% flevels()
135 | # df_test <- run_col_benchmark(ncol = 10000) %>% flevels()
136 | # ggplot(df_test, aes(x = method, y = time)) +
137 | # geom_jitter(width = 0.25, height = 0) +
138 | # scale_y_log10()
139 |
140 | ## The Real Thing
141 | ## fairly fast up to 10^4, go get a coffee at 10^5 (row case only)
142 | #df_r <- map_df(10 ^ (1:5), run_row_benchmark) %>% flevels()
143 | #write_csv(df_r, "row-benchmark.csv")
144 | df_r <- read_csv("row-benchmark.csv") %>% flevels()
145 |
146 | #+ row-benchmark
147 | plot_it(df_r, "nrow")
148 | #ggsave("row-benchmark.png")
149 |
150 | #df_c <- map_df(10 ^ (1:5), run_col_benchmark) %>% flevels()
151 | #write_csv(df_c, "col-benchmark.csv")
152 | df_c <- read_csv("col-benchmark.csv") %>% flevels()
153 |
154 | #+ col-benchmark
155 | plot_it(df_c, "ncol")
156 | #ggsave("col-benchmark.png")
157 |
158 | ## used at first, but saw same dramatic gc artefacts as described here
159 | ## in my plots
160 | ## https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/
161 | ## went for a DIY solution where I control gc
162 | # library(microbenchmark)
163 | # run_row_microbenchmark <- function(nrow, times = 5) {
164 | # df <- data.frame(x = rnorm(nrow), y = runif(nrow), z = runif(nrow))
165 | # microbenchmark(
166 | # for_loop = f_for_loop(df),
167 | # split_lapply = f_split_lapply(df),
168 | # lapply_row = f_lapply_row(df),
169 | # pmap = f_pmap(df),
170 | # transpose = f_transpose(df),
171 | # times = times
172 | # ) %>%
173 | # as_tibble() %>%
174 | # rename(method = expr) %>%
175 | # mutate(method = as.character(method)) %>%
176 | # add_column(nrow = nrow, .before = 1)
177 | # }
178 |
--------------------------------------------------------------------------------
/iterate-over-rows.md:
--------------------------------------------------------------------------------
1 | Turn data frame into a list, one component per row
2 | ================
3 | Jenny Bryan, updating work of Winston Chang
4 | 2018-09-05
5 |
6 | Update of .
7 |
8 | - Added some methods, removed some methods.
9 | - Run every combination of problem size & method multiple times.
10 | - Explore different number of rows and columns, with mixed col types.
11 |
12 |
13 |
14 | ``` r
15 | library(scales)
16 | library(tidyverse)
17 | ```
18 |
19 | ## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──
20 |
21 | ## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
22 | ## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
23 | ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
24 | ## ✔ readr 1.2.0 ✔ forcats 0.3.0
25 |
26 | ## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
27 | ## ✖ readr::col_factor() masks scales::col_factor()
28 | ## ✖ purrr::discard() masks scales::discard()
29 | ## ✖ dplyr::filter() masks stats::filter()
30 | ## ✖ dplyr::lag() masks stats::lag()
31 |
32 | ``` r
33 | # for loop over row index
34 | f_for_loop <- function(df) {
35 | out <- vector(mode = "list", length = nrow(df))
36 | for (i in seq_along(out)) {
37 | out[[i]] <- as.list(df[i, , drop = FALSE])
38 | }
39 | out
40 | }
41 |
42 | # split into single row data frames then + lapply
43 | f_split_lapply <- function(df) {
44 | df <- split(df, seq_len(nrow(df)))
45 | lapply(df, function(row) as.list(row))
46 | }
47 |
48 | # lapply over the vector of row numbers
49 | f_lapply_row <- function(df) {
50 | lapply(seq_len(nrow(df)), function(i) as.list(df[i, , drop = FALSE]))
51 | }
52 |
53 | # purrr::pmap
54 | f_pmap <- function(df) {
55 | pmap(df, list)
56 | }
57 |
58 | # purrr::transpose (happens to be exactly what's needed here)
59 | f_transpose <- function(df) {
60 | transpose(df)
61 | }
62 |
63 | ## explicit gc, then execute `expr` `n` times w/o explicit gc, return timings
64 | benchmark <- function(n = 1, expr, envir = parent.frame()) {
65 | expr <- substitute(expr)
66 | gc()
67 | map(seq_len(n), ~ system.time(eval(expr, envir), gcFirst = FALSE))
68 | }
69 |
70 | run_row_benchmark <- function(nrow, times = 5) {
71 | df <- data.frame(
72 | x = rep_len(letters, length.out = nrow),
73 | y = runif(nrow),
74 | z = seq_len(nrow)
75 | )
76 | res <- list(
77 | transpose = benchmark(times, f_transpose(df)),
78 | pmap = benchmark(times, f_pmap(df)),
79 | split_lapply = benchmark(times, f_split_lapply(df)),
80 | lapply_row = benchmark(times, f_lapply_row(df)),
81 | for_loop = benchmark(times, f_for_loop(df))
82 | )
83 | res <- map(res, ~ map_dbl(.x, "elapsed"))
84 | tibble(
85 | nrow = nrow,
86 | method = rep(names(res), lengths(res)),
87 | time = flatten_dbl(res)
88 | )
89 | }
90 |
91 | run_col_benchmark <- function(ncol, times = 5) {
92 | nrow <- 3
93 | template <- data.frame(
94 | x = letters[seq_len(nrow)],
95 | y = runif(nrow),
96 | z = seq_len(nrow)
97 | )
98 | df <- template[rep_len(seq_len(ncol(template)), length.out = ncol)]
99 | res <- list(
100 | transpose = benchmark(times, f_transpose(df)),
101 | pmap = benchmark(times, f_pmap(df)),
102 | split_lapply = benchmark(times, f_split_lapply(df)),
103 | lapply_row = benchmark(times, f_lapply_row(df)),
104 | for_loop = benchmark(times, f_for_loop(df))
105 | )
106 | res <- map(res, ~ map_dbl(.x, "elapsed"))
107 | tibble(
108 | ncol = ncol,
109 | method = rep(names(res), lengths(res)),
110 | time = flatten_dbl(res)
111 | )
112 | }
113 |
114 | ## force figs to present methods in order of time
115 | flevels <- function(df) {
116 | mutate(df, method = fct_reorder(method, .x = desc(time)))
117 | }
118 |
119 | plot_it <- function(df, what = "nrow") {
120 | log10_breaks <- trans_breaks("log10", function(x) 10 ^ x)
121 | log10_mbreaks <- function(x) {
122 | limits <- c(floor(log10(x[1])), ceiling(log10(x[2])))
123 | breaks <- 10 ^ seq(limits[1], limits[2])
124 |
125 | unlist(lapply(breaks, function(x) x * seq(0.1, 0.9, by = 0.1)))
126 | }
127 | log10_labels <- trans_format("log10", math_format(10 ^ .x))
128 |
129 | ggplot(
130 | df %>% dplyr::filter(time > 0),
131 | aes_string(x = what, y = "time", colour = "method")
132 | ) +
133 | geom_point() +
134 | stat_summary(aes(group = method), fun.y = mean, geom = "line") +
135 | scale_y_log10(
136 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
137 | ) +
138 | scale_x_log10(
139 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks
140 | ) +
141 | labs(
142 | x = paste0("Number of ", if (what == "nrow") "rows" else "columns"),
143 | y = "Time (s)"
144 | ) +
145 | theme_bw() +
146 | theme(aspect.ratio = 1, legend.justification = "top")
147 | }
148 |
149 | ## dry runs
150 | # df_test <- run_row_benchmark(nrow = 10000) %>% flevels()
151 | # df_test <- run_col_benchmark(ncol = 10000) %>% flevels()
152 | # ggplot(df_test, aes(x = method, y = time)) +
153 | # geom_jitter(width = 0.25, height = 0) +
154 | # scale_y_log10()
155 |
156 | ## The Real Thing
157 | ## fairly fast up to 10^4, go get a coffee at 10^5 (row case only)
158 | #df_r <- map_df(10 ^ (1:5), run_row_benchmark) %>% flevels()
159 | #write_csv(df_r, "row-benchmark.csv")
160 | df_r <- read_csv("row-benchmark.csv") %>% flevels()
161 | ```
162 |
163 | ## Parsed with column specification:
164 | ## cols(
165 | ## nrow = col_double(),
166 | ## method = col_character(),
167 | ## time = col_double()
168 | ## )
169 |
170 | ``` r
171 | plot_it(df_r, "nrow")
172 | ```
173 |
174 | 
175 |
176 | ``` r
177 | #ggsave("row-benchmark.png")
178 |
179 | #df_c <- map_df(10 ^ (1:5), run_col_benchmark) %>% flevels()
180 | #write_csv(df_c, "col-benchmark.csv")
181 | df_c <- read_csv("col-benchmark.csv") %>% flevels()
182 | ```
183 |
184 | ## Parsed with column specification:
185 | ## cols(
186 | ## ncol = col_double(),
187 | ## method = col_character(),
188 | ## time = col_double()
189 | ## )
190 |
191 | ``` r
192 | plot_it(df_c, "ncol")
193 | ```
194 |
195 | 
196 |
197 | ``` r
198 | #ggsave("col-benchmark.png")
199 |
200 | ## used at first, but saw same dramatic gc artefacts as described here
201 | ## in my plots
202 | ## https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/
203 | ## went for a DIY solution where I control gc
204 | # library(microbenchmark)
205 | # run_row_microbenchmark <- function(nrow, times = 5) {
206 | # df <- data.frame(x = rnorm(nrow), y = runif(nrow), z = runif(nrow))
207 | # microbenchmark(
208 | # for_loop = f_for_loop(df),
209 | # split_lapply = f_split_lapply(df),
210 | # lapply_row = f_lapply_row(df),
211 | # pmap = f_pmap(df),
212 | # transpose = f_transpose(df),
213 | # times = times
214 | # ) %>%
215 | # as_tibble() %>%
216 | # rename(method = expr) %>%
217 | # mutate(method = as.character(method)) %>%
218 | # add_column(nrow = nrow, .before = 1)
219 | # }
220 | ```
221 |
--------------------------------------------------------------------------------
/iterate-over-rows_files/figure-gfm/col-benchmark-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/iterate-over-rows_files/figure-gfm/col-benchmark-1.png
--------------------------------------------------------------------------------
/iterate-over-rows_files/figure-gfm/row-benchmark-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/iterate-over-rows_files/figure-gfm/row-benchmark-1.png
--------------------------------------------------------------------------------
/row-benchmark.csv:
--------------------------------------------------------------------------------
1 | nrow,method,time
2 | 10,transpose,0
3 | 10,transpose,0
4 | 10,transpose,0
5 | 10,transpose,0
6 | 10,transpose,0
7 | 10,pmap,9.999999997489795e-4
8 | 10,pmap,0
9 | 10,pmap,0
10 | 10,pmap,0.0010000000002037268
11 | 10,pmap,0
12 | 10,split_lapply,0.0010000000002037268
13 | 10,split_lapply,0.0010000000002037268
14 | 10,split_lapply,9.999999997489795e-4
15 | 10,split_lapply,0.0010000000002037268
16 | 10,split_lapply,0
17 | 10,lapply_row,9.999999997489795e-4
18 | 10,lapply_row,0.0010000000002037268
19 | 10,lapply_row,9.999999997489795e-4
20 | 10,lapply_row,0
21 | 10,lapply_row,0.0010000000002037268
22 | 10,for_loop,0.0010000000002037268
23 | 10,for_loop,0.0010000000002037268
24 | 10,for_loop,9.999999997489795e-4
25 | 10,for_loop,0.0010000000002037268
26 | 10,for_loop,9.999999997489795e-4
27 | 100,transpose,0
28 | 100,transpose,0
29 | 100,transpose,0
30 | 100,transpose,0
31 | 100,transpose,0
32 | 100,pmap,9.999999997489795e-4
33 | 100,pmap,0
34 | 100,pmap,0.0010000000002037268
35 | 100,pmap,0
36 | 100,pmap,9.999999997489795e-4
37 | 100,split_lapply,0.005999999999858119
38 | 100,split_lapply,0.007000000000061846
39 | 100,split_lapply,0.007000000000061846
40 | 100,split_lapply,0.005999999999858119
41 | 100,split_lapply,0.007999999999810825
42 | 100,lapply_row,0.007000000000061846
43 | 100,lapply_row,0.007000000000061846
44 | 100,lapply_row,0.006999999999607098
45 | 100,lapply_row,0.007000000000061846
46 | 100,lapply_row,0.007000000000061846
47 | 100,for_loop,0.008000000000265572
48 | 100,for_loop,0.006999999999607098
49 | 100,for_loop,0.008000000000265572
50 | 100,for_loop,0.007999999999810825
51 | 100,for_loop,0.007000000000061846
52 | 1e3,transpose,9.999999997489795e-4
53 | 1e3,transpose,0
54 | 1e3,transpose,0
55 | 1e3,transpose,0
56 | 1e3,transpose,0
57 | 1e3,pmap,0.0019999999999527063
58 | 1e3,pmap,0.0019999999999527063
59 | 1e3,pmap,0.003000000000156433
60 | 1e3,pmap,0.0019999999999527063
61 | 1e3,pmap,0.0019999999999527063
62 | 1e3,split_lapply,0.0749999999998181
63 | 1e3,split_lapply,0.07000000000016371
64 | 1e3,split_lapply,0.07699999999977081
65 | 1e3,split_lapply,0.07700000000022555
66 | 1e3,split_lapply,0.08199999999987995
67 | 1e3,lapply_row,0.07099999999991269
68 | 1e3,lapply_row,0.07200000000011642
69 | 1e3,lapply_row,0.0749999999998181
70 | 1e3,lapply_row,0.068000000000211
71 | 1e3,lapply_row,0.08399999999983265
72 | 1e3,for_loop,0.06599999999980355
73 | 1e3,for_loop,0.06500000000005457
74 | 1e3,for_loop,0.07100000000036744
75 | 1e3,for_loop,0.07699999999977081
76 | 1e3,for_loop,0.0729999999998654
77 | 1e4,transpose,9.999999997489795e-4
78 | 1e4,transpose,0.0019999999999527063
79 | 1e4,transpose,0.0010000000002037268
80 | 1e4,transpose,0.0010000000002037268
81 | 1e4,transpose,0.0019999999999527063
82 | 1e4,pmap,0.018999999999778083
83 | 1e4,pmap,0.023000000000138243
84 | 1e4,pmap,0.02099999999973079
85 | 1e4,pmap,0.028999999999996362
86 | 1e4,pmap,0.023000000000138243
87 | 1e4,split_lapply,1.0340000000001055
88 | 1e4,split_lapply,1.074999999999818
89 | 1e4,split_lapply,1.0900000000001455
90 | 1e4,split_lapply,1.0859999999997854
91 | 1e4,split_lapply,1.1520000000000437
92 | 1e4,lapply_row,1.0160000000000764
93 | 1e4,lapply_row,1.0669999999995525
94 | 1e4,lapply_row,1.1410000000000764
95 | 1e4,lapply_row,1.2590000000000146
96 | 1e4,lapply_row,1.0799999999999272
97 | 1e4,for_loop,1.031999999999698
98 | 1e4,for_loop,1.0419999999999163
99 | 1e4,for_loop,1.1170000000001892
100 | 1e4,for_loop,1.1039999999998145
101 | 1e4,for_loop,1.0979999999999563
102 | 1e5,transpose,0.016999999999825377
103 | 1e5,transpose,0.01900000000023283
104 | 1e5,transpose,0.021000000000185537
105 | 1e5,transpose,0.02099999999973079
106 | 1e5,transpose,0.021000000000185537
107 | 1e5,pmap,0.23700000000008004
108 | 1e5,pmap,0.25700000000006185
109 | 1e5,pmap,0.3690000000001419
110 | 1e5,pmap,0.29300000000012005
111 | 1e5,pmap,0.3819999999996071
112 | 1e5,split_lapply,35.738000000000284
113 | 1e5,split_lapply,35.86400000000003
114 | 1e5,split_lapply,35.68899999999985
115 | 1e5,split_lapply,35.559999999999945
116 | 1e5,split_lapply,35.922000000000025
117 | 1e5,lapply_row,33.54099999999971
118 | 1e5,lapply_row,34.87699999999995
119 | 1e5,lapply_row,35.669000000000324
120 | 1e5,lapply_row,34.465999999999894
121 | 1e5,lapply_row,35.59400000000005
122 | 1e5,for_loop,35.01800000000003
123 | 1e5,for_loop,35.29099999999971
124 | 1e5,for_loop,34.46300000000019
125 | 1e5,for_loop,35.26099999999997
126 | 1e5,for_loop,34.33699999999999
127 |
--------------------------------------------------------------------------------
/row-benchmark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/row-benchmark.png
--------------------------------------------------------------------------------
/row-oriented-workflows.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: No
4 | SaveWorkspace: No
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 |
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | PackageRoxygenize: rd,collate,namespace
22 |
--------------------------------------------------------------------------------
/wch.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Applying a function over rows of a data frame"
3 | author: "Winston Chang"
4 | output:
5 | html_document:
6 | keep_md: TRUE
7 | ---
8 |
9 | ```{r setup, include=FALSE}
10 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>", cache = TRUE)
11 | ```
12 |
13 | [Source](https://gist.github.com/wch/0e564def155d976c04dd28a876dc04b4) for this document.
14 |
15 | [RPub](https://rpubs.com/wch/200398) for this document.
16 |
17 | @dattali [asked](https://twitter.com/daattali/status/761058049859518464), "what's a safe way to iterate over rows of a data frame?" The example was to convert each row into a list and return a list of lists, indexed first by column, then by row.
18 |
19 | A number of people gave suggestions on Twitter, which I've collected here. I've benchmarked these methods with data of various sizes; scroll down to see a plot of times.
20 |
21 | ```{r load-packages, cache = FALSE}
22 | library(purrr)
23 | library(dplyr)
24 | library(tidyr)
25 | ```
26 |
27 |
28 | ```{r define-approaches, message=FALSE}
29 | # @dattali
30 | # Using apply (only safe when all cols are same type)
31 | f_apply <- function(df) {
32 | apply(df, 1, function(row) as.list(row))
33 | }
34 |
35 | # @drob
36 | # split + lapply
37 | f_split_lapply <- function(df) {
38 | df <- split(df, seq_len(nrow(df)))
39 | lapply(df, function(row) as.list(row))
40 | }
41 |
42 | # @winston_chang
43 | # lapply over row indices
44 | f_lapply_row <- function(df) {
45 | lapply(seq_len(nrow(df)), function(i) as.list(df[i,,drop=FALSE]))
46 | }
47 |
48 | # @winston_chang
49 | # lapply + lapply: Treat data frame as list, and the slice out lists
50 | f_lapply_lapply <- function(df) {
51 | cols <- seq_len(length(df))
52 | names(cols) <- names(df)
53 |
54 | lapply(seq_len(nrow(df)), function(row) {
55 | lapply(cols, function(col) {
56 | df[[col]][[row]]
57 | })
58 | })
59 | }
60 |
61 | # @winston_chang
62 | # purrr::by_row
63 | # 2018-03-31 Jenny Bryan: by_row() no longer exists in purrr
64 | # f_by_row <- function(df) {
65 | # res <- by_row(df, function(row) as.list(row))
66 | # res$.out
67 | # }
68 |
69 | # @JennyBryan
70 | # purrr::pmap
71 | f_pmap <- function(df) {
72 | pmap(df, list)
73 | }
74 |
75 | # purrr::pmap, but coerce df to a list first
76 | f_pmap_aslist <- function(df) {
77 | pmap(as.list(df), list)
78 | }
79 |
80 | # @krlmlr
81 | # dplyr::rowwise
82 | f_rowwise <- function(df) {
83 | df %>% rowwise %>% do(row = as.list(.))
84 | }
85 |
86 | # @JennyBryan
87 | # purrr::transpose (only works for this specific task, i.e. one sub-list per row)
88 | f_transpose <- function(df) {
89 | transpose(df)
90 | }
91 | ```
92 |
93 |
94 | Benchmark each of them, using data sets with varying numbers of rows:
95 |
96 | ```{r run-benchmark}
97 | run_benchmark <- function(nrow) {
98 | # Make some data
99 | df <- data.frame(
100 | x = rnorm(nrow),
101 | y = runif(nrow),
102 | z = runif(nrow)
103 | )
104 |
105 | res <- list(
106 | apply = system.time(f_apply(df)),
107 | split_lapply = system.time(f_split_lapply(df)),
108 | lapply_row = system.time(f_lapply_row(df)),
109 | lapply_lapply = system.time(f_lapply_lapply(df)),
110 | #by_row = system.time(f_by_row(df)),
111 | pmap = system.time(f_pmap(df)),
112 | pmap_aslist = system.time(f_pmap_aslist(df)),
113 | rowwise = system.time(f_rowwise(df)),
114 | transpose = system.time(f_transpose(df))
115 | )
116 |
117 | # Get elapsed times
118 | res <- lapply(res, `[[`, "elapsed")
119 |
120 | # Add nrow to front
121 | res <- c(nrow = nrow, res)
122 | res
123 | }
124 |
125 | # Run the benchmarks for various size data
126 | all_times <- lapply(1:5, function(n) {
127 | run_benchmark(10^n)
128 | })
129 |
130 | # Convert to data frame
131 | times <- lapply(all_times, as.data.frame)
132 | times <- do.call(rbind, times)
133 |
134 | knitr::kable(times)
135 | ```
136 |
137 |
138 | ## Plot times
139 |
140 | This plot shows the number of seconds needed to process n rows, for each method. Both the x and y use log scales, so each step along the x scale represents a 10x increase in number of rows, and each step along the y scale represents a 10x increase in time.
141 |
142 | ```{r plot, message=FALSE, cache = FALSE}
143 | library(ggplot2)
144 | library(scales)
145 | library(forcats)
146 |
147 | # Convert to long format
148 | times_long <- gather(times, method, seconds, -nrow)
149 |
150 | # Set order of methods, for plots
151 | times_long$method <- fct_reorder2(
152 | times_long$method,
153 | x = times_long$nrow,
154 | y = times_long$seconds
155 | )
156 |
157 | # Plot with log-log axes
158 | ggplot(times_long, aes(x = nrow, y = seconds, colour = method)) +
159 | geom_point() +
160 | geom_line() +
161 | annotation_logticks(sides = "trbl") +
162 | theme_bw() +
163 | scale_y_continuous(trans = log10_trans(),
164 | breaks = trans_breaks("log10", function(x) 10^x),
165 | labels = trans_format("log10", math_format(10^.x)),
166 | minor_breaks = NULL) +
167 | scale_x_continuous(trans = log10_trans(),
168 | breaks = trans_breaks("log10", function(x) 10^x),
169 | labels = trans_format("log10", math_format(10^.x)),
170 | minor_breaks = NULL)
171 | ```
172 |
--------------------------------------------------------------------------------
/wch.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Applying a function over rows of a data frame"
3 | author: "Winston Chang"
4 | output:
5 | html_document:
6 | keep_md: TRUE
7 | ---
8 |
9 |
10 |
11 | [Source](https://gist.github.com/wch/0e564def155d976c04dd28a876dc04b4) for this document.
12 |
13 | [RPub](https://rpubs.com/wch/200398) for this document.
14 |
15 | @dattali [asked](https://twitter.com/daattali/status/761058049859518464), "what's a safe way to iterate over rows of a data frame?" The example was to convert each row into a list and return a list of lists, indexed first by column, then by row.
16 |
17 | A number of people gave suggestions on Twitter, which I've collected here. I've benchmarked these methods with data of various sizes; scroll down to see a plot of times.
18 |
19 |
20 | ```r
21 | library(purrr)
22 | library(dplyr)
23 | #>
24 | #> Attaching package: 'dplyr'
25 | #> The following objects are masked from 'package:stats':
26 | #>
27 | #> filter, lag
28 | #> The following objects are masked from 'package:base':
29 | #>
30 | #> intersect, setdiff, setequal, union
31 | library(tidyr)
32 | ```
33 |
34 |
35 |
36 | ```r
37 | # @dattali
38 | # Using apply (only safe when all cols are same type)
39 | f_apply <- function(df) {
40 | apply(df, 1, function(row) as.list(row))
41 | }
42 |
43 | # @drob
44 | # split + lapply
45 | f_split_lapply <- function(df) {
46 | df <- split(df, seq_len(nrow(df)))
47 | lapply(df, function(row) as.list(row))
48 | }
49 |
50 | # @winston_chang
51 | # lapply over row indices
52 | f_lapply_row <- function(df) {
53 | lapply(seq_len(nrow(df)), function(i) as.list(df[i,,drop=FALSE]))
54 | }
55 |
56 | # @winston_chang
57 | # lapply + lapply: Treat data frame as list, and the slice out lists
58 | f_lapply_lapply <- function(df) {
59 | cols <- seq_len(length(df))
60 | names(cols) <- names(df)
61 |
62 | lapply(seq_len(nrow(df)), function(row) {
63 | lapply(cols, function(col) {
64 | df[[col]][[row]]
65 | })
66 | })
67 | }
68 |
69 | # @winston_chang
70 | # purrr::by_row
71 | # 2018-03-31 Jenny Bryan: by_row() no longer exists in purrr
72 | # f_by_row <- function(df) {
73 | # res <- by_row(df, function(row) as.list(row))
74 | # res$.out
75 | # }
76 |
77 | # @JennyBryan
78 | # purrr::pmap
79 | f_pmap <- function(df) {
80 | pmap(df, list)
81 | }
82 |
83 | # purrr::pmap, but coerce df to a list first
84 | f_pmap_aslist <- function(df) {
85 | pmap(as.list(df), list)
86 | }
87 |
88 | # @krlmlr
89 | # dplyr::rowwise
90 | f_rowwise <- function(df) {
91 | df %>% rowwise %>% do(row = as.list(.))
92 | }
93 |
94 | # @JennyBryan
95 | # purrr::transpose (only works for this specific task, i.e. one sub-list per row)
96 | f_transpose <- function(df) {
97 | transpose(df)
98 | }
99 | ```
100 |
101 |
102 | Benchmark each of them, using data sets with varying numbers of rows:
103 |
104 |
105 | ```r
106 | run_benchmark <- function(nrow) {
107 | # Make some data
108 | df <- data.frame(
109 | x = rnorm(nrow),
110 | y = runif(nrow),
111 | z = runif(nrow)
112 | )
113 |
114 | res <- list(
115 | apply = system.time(f_apply(df)),
116 | split_lapply = system.time(f_split_lapply(df)),
117 | lapply_row = system.time(f_lapply_row(df)),
118 | lapply_lapply = system.time(f_lapply_lapply(df)),
119 | #by_row = system.time(f_by_row(df)),
120 | pmap = system.time(f_pmap(df)),
121 | pmap_aslist = system.time(f_pmap_aslist(df)),
122 | rowwise = system.time(f_rowwise(df)),
123 | transpose = system.time(f_transpose(df))
124 | )
125 |
126 | # Get elapsed times
127 | res <- lapply(res, `[[`, "elapsed")
128 |
129 | # Add nrow to front
130 | res <- c(nrow = nrow, res)
131 | res
132 | }
133 |
134 | # Run the benchmarks for various size data
135 | all_times <- lapply(1:5, function(n) {
136 | run_benchmark(10^n)
137 | })
138 |
139 | # Convert to data frame
140 | times <- lapply(all_times, as.data.frame)
141 | times <- do.call(rbind, times)
142 |
143 | knitr::kable(times)
144 | ```
145 |
146 |
147 |
148 | nrow apply split_lapply lapply_row lapply_lapply pmap pmap_aslist rowwise transpose
149 | ------ ------ ------------- ----------- -------------- ------ ------------ -------- ----------
150 | 1e+01 0.000 0.000 0.001 0.000 0.001 0.001 0.044 0.000
151 | 1e+02 0.002 0.005 0.005 0.005 0.002 0.002 0.054 0.002
152 | 1e+03 0.004 0.036 0.034 0.015 0.002 0.002 0.056 0.001
153 | 1e+04 0.033 0.422 0.339 0.163 0.017 0.016 0.504 0.002
154 | 1e+05 0.527 24.720 23.743 1.808 0.201 0.220 5.322 0.017
155 |
156 |
157 | ## Plot times
158 |
159 | This plot shows the number of seconds needed to process n rows, for each method. Both the x and y use log scales, so each step along the x scale represents a 10x increase in number of rows, and each step along the y scale represents a 10x increase in time.
160 |
161 |
162 | ```r
163 | library(ggplot2)
164 | library(scales)
165 | library(forcats)
166 |
167 | # Convert to long format
168 | times_long <- gather(times, method, seconds, -nrow)
169 |
170 | # Set order of methods, for plots
171 | times_long$method <- fct_reorder2(
172 | times_long$method,
173 | x = times_long$nrow,
174 | y = times_long$seconds
175 | )
176 |
177 | # Plot with log-log axes
178 | ggplot(times_long, aes(x = nrow, y = seconds, colour = method)) +
179 | geom_point() +
180 | geom_line() +
181 | annotation_logticks(sides = "trbl") +
182 | theme_bw() +
183 | scale_y_continuous(trans = log10_trans(),
184 | breaks = trans_breaks("log10", function(x) 10^x),
185 | labels = trans_format("log10", math_format(10^.x)),
186 | minor_breaks = NULL) +
187 | scale_x_continuous(trans = log10_trans(),
188 | breaks = trans_breaks("log10", function(x) 10^x),
189 | labels = trans_format("log10", math_format(10^.x)),
190 | minor_breaks = NULL)
191 | #> Warning: Transformation introduced infinite values in continuous y-axis
192 |
193 | #> Warning: Transformation introduced infinite values in continuous y-axis
194 | ```
195 |
196 | 
197 |
--------------------------------------------------------------------------------
/wch_files/figure-html/plot-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/wch_files/figure-html/plot-1.png
--------------------------------------------------------------------------------