├── .gitignore ├── LICENSE ├── README.md ├── col-benchmark.csv ├── col-benchmark.png ├── ex01_leave-it-in-the-data-frame.R ├── ex01_leave-it-in-the-data-frame.md ├── ex01_leave-it-in-the-data-frame_files └── figure-gfm │ ├── unnamed-chunk-2-1.png │ ├── unnamed-chunk-3-1.png │ ├── unnamed-chunk-3-2.png │ ├── unnamed-chunk-4-1.png │ ├── unnamed-chunk-5-1.png │ └── unnamed-chunk-6-1.png ├── ex02_create-or-mutate-in-place.R ├── ex02_create-or-mutate-in-place.md ├── ex03_row-wise-iteration-are-you-sure.R ├── ex03_row-wise-iteration-are-you-sure.md ├── ex04_map-example.R ├── ex04_map-example.md ├── ex05_attack-via-rows-or-columns.R ├── ex05_attack-via-rows-or-columns.md ├── ex06_runif-via-pmap.R ├── ex06_runif-via-pmap.md ├── ex07_group-by-summarise.R ├── ex07_group-by-summarise.md ├── ex08_nesting-is-good.R ├── ex08_nesting-is-good.md ├── ex08_nesting-is-good_files └── figure-gfm │ ├── alpha-order-1.png │ ├── principled-order-1.png │ ├── principled-order-coef-ests-1.png │ ├── principled-order-coef-ests-2.png │ └── revert-to-alphabetical-1.png ├── ex09_row-summaries.R ├── ex09_row-summaries.md ├── iterate-over-rows.R ├── iterate-over-rows.md ├── iterate-over-rows_files └── figure-gfm │ ├── col-benchmark-1.png │ └── row-benchmark-1.png ├── row-benchmark.csv ├── row-benchmark.png ├── row-oriented-workflows.Rproj ├── wch.Rmd ├── wch.md └── wch_files └── figure-html └── plot-1.png /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | wch.html 5 | wch_cache 6 | iterate-over-rows.html 7 | iterate-over-rows_cache 8 | *.key 9 | *.pdf 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | CC BY-SA 4.0 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Row-oriented workflows in R with the tidyverse 2 | 3 | Materials for [RStudio webinar](https://resources.rstudio.com/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame-april-jenny-bryan) *recording available at this link!*: 4 | 5 | Thinking inside the box: you can do that inside a data frame?! 6 | Jenny Bryan 7 | Wednesday, April 11 at 1:00pm ET / 10:00am PT 8 | [rstd.io/row-work](https://rstd.io/row-work) *<-- shortlink to this repo* 9 | Slides available [on SpeakerDeck](https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse) 10 | 11 | ## Abstract 12 | 13 | The data frame is a crucial data structure in R and, especially, in the tidyverse. Working on a column or a variable is a very natural operation, which is great. But what about row-oriented work? That also comes up frequently and is more awkward. In this webinar I’ll work through concrete code examples, exploring patterns that arise in data analysis. We’ll discuss the general notion of "split-apply-combine", row-wise work in a data frame, splitting vs. nesting, and list-columns. 14 | 15 | ## Code examples 16 | 17 | Beginner --> intermediate --> advanced 18 | Not all are used in webinar 19 | 20 | * **Leave your data in that big, beautiful data frame.** [`ex01_leave-it-in-the-data-frame`](ex01_leave-it-in-the-data-frame.md) Show the evil of creating copies of certain rows of certain variables, using Magic Numbers and cryptic names, just to save some typing. 21 | * **Adding or modifying variables.** [`ex02_create-or-mutate-in-place`](ex02_create-or-mutate-in-place.md) `df$var <- ...` versus `dplyr::mutate()`. Recycling/safety, `df`'s as data mask, aesthetics. 22 | * **Are you SURE you need to iterate over rows?** [`ex03_row-wise-iteration-are-you-sure`](ex03_row-wise-iteration-are-you-sure.md) Don't fixate on most obvious generalization of your pilot example and risk overlooking a vectorized solution. Features a `paste()` example, then goes out with some glue glory. 23 | * **Working with non-vectorized functions.** [`ex04_map-example`](ex04_map-example.md) Small example using `purrr::map()` to apply `nrow()` to list of data frames. 24 | * **Row-wise thinking vs. column-wise thinking.** [`ex05_attack-via-rows-or-columns`](ex05_attack-via-rows-or-columns.md) Data rectangling example. Both are possible, but I find building a tibble column-by-column is less aggravating than building rows, then row binding. 25 | * **Iterate over rows of a data frame.** [`iterate-over-rows`](iterate-over-rows.md) Empirical study of reshaping a data frame into this form: a list with one component per row. Revisiting a study originally done by Winston Chang. Run times for different number of [rows](row-benchmark.png) or [columns](col-benchmark.png). 26 | * **Generate data from different distributions via `purrr::pmap()`.** [`ex06_runif-via-pmap`](ex06_runif-via-pmap.md) Use `purrr::pmap()` to generate U[min, max] data for various combinations of (n, min, max), stored as rows of a data frame. 27 | * **Are you SURE you need to iterate over groups?** [`ex07_group-by-summarise`](ex07_group-by-summarise.md) Use `dplyr::group_by()` and `dplyr::summarise()` to compute group-wise summaries, without explicitly splitting up the data frame and re-combining the results. Use `list()` to package multivariate summaries into something `summarise()` can handle, creating a list-column. 28 | * **Group-and-nest.** [`ex08_nesting-is-good`](ex08_nesting-is-good.md) How to explicitly work on groups of rows via nesting (our recommendation) vs splitting. 29 | * **Row-wise mean or sum.** [`ex09_row-summaries`](ex09_row-summaries.md) How to do `rowSums()`-y and `rowMeans()`-y work inside a data frame. 30 | 31 | ## More tips and links 32 | 33 | Big thanks to everyone who weighed in on the related [twitter thread](https://twitter.com/JennyBryan/status/980905136468910080). This was very helpful for planning content. 34 | 35 | 45 minutes is not enough! A few notes about more special functions and patterns for row-driven work. Maybe we need to do a follow up ... 36 | 37 | `tibble::enframe()` and `deframe()` are handy for getting into and out of the data frame state. 38 | 39 | `map()` and `map2()` are useful for working with list-columns inside `mutate()`. 40 | 41 | `tibble::add_row()` handy for adding a single row at an arbitrary position in data frame. 42 | 43 | `imap()` handy for iterating over something and its names or integer indices at the same time. 44 | 45 | `dplyr::case_when()` helps you get rid of hairy, nested `if () {...} else {...}` statements. 46 | 47 | Great resource on the "why?" of functional programming approaches (such as `map()`): 48 | -------------------------------------------------------------------------------- /col-benchmark.csv: -------------------------------------------------------------------------------- 1 | ncol,method,time 2 | 10,transpose,0 3 | 10,transpose,0 4 | 10,transpose,0 5 | 10,transpose,0 6 | 10,transpose,0 7 | 10,pmap,0 8 | 10,pmap,0 9 | 10,pmap,9.999999997489795e-4 10 | 10,pmap,0 11 | 10,pmap,0 12 | 10,split_lapply,0 13 | 10,split_lapply,0 14 | 10,split_lapply,0.0010000000002037268 15 | 10,split_lapply,0 16 | 10,split_lapply,9.999999997489795e-4 17 | 10,lapply_row,0 18 | 10,lapply_row,9.999999997489795e-4 19 | 10,lapply_row,0 20 | 10,lapply_row,0 21 | 10,lapply_row,0.0010000000002037268 22 | 10,for_loop,0.0010000000002037268 23 | 10,for_loop,0 24 | 10,for_loop,9.999999997489795e-4 25 | 10,for_loop,0 26 | 10,for_loop,0 27 | 100,transpose,0 28 | 100,transpose,0 29 | 100,transpose,0 30 | 100,transpose,0 31 | 100,transpose,0 32 | 100,pmap,0 33 | 100,pmap,9.999999997489795e-4 34 | 100,pmap,0.0010000000002037268 35 | 100,pmap,0 36 | 100,pmap,9.999999997489795e-4 37 | 100,split_lapply,0.0019999999999527063 38 | 100,split_lapply,0.0019999999999527063 39 | 100,split_lapply,0.0029999999997016857 40 | 100,split_lapply,0.0019999999999527063 41 | 100,split_lapply,0.003000000000156433 42 | 100,lapply_row,0.0019999999999527063 43 | 100,lapply_row,0.0019999999999527063 44 | 100,lapply_row,0.0020000000004074536 45 | 100,lapply_row,0.0019999999999527063 46 | 100,lapply_row,0.0019999999999527063 47 | 100,for_loop,0.0029999999997016857 48 | 100,for_loop,0.0020000000004074536 49 | 100,for_loop,0.0019999999999527063 50 | 100,for_loop,0.0019999999999527063 51 | 100,for_loop,0.0029999999997016857 52 | 1e3,transpose,0 53 | 1e3,transpose,0 54 | 1e3,transpose,0.0010000000002037268 55 | 1e3,transpose,0 56 | 1e3,transpose,0 57 | 1e3,pmap,0.0020000000004074536 58 | 1e3,pmap,0.0019999999999527063 59 | 1e3,pmap,0.0019999999999527063 60 | 1e3,pmap,0.0019999999999527063 61 | 1e3,pmap,0.0019999999999527063 62 | 1e3,split_lapply,0.022000000000389264 63 | 1e3,split_lapply,0.02599999999983993 64 | 1e3,split_lapply,0.023999999999887223 65 | 1e3,split_lapply,0.028999999999996362 66 | 1e3,split_lapply,0.02500000000009095 67 | 1e3,lapply_row,0.023000000000138243 68 | 1e3,lapply_row,0.021999999999934516 69 | 1e3,lapply_row,0.021000000000185537 70 | 1e3,lapply_row,0.027000000000043656 71 | 1e3,lapply_row,0.023000000000138243 72 | 1e3,for_loop,0.02099999999973079 73 | 1e3,for_loop,0.021000000000185537 74 | 1e3,for_loop,0.02099999999973079 75 | 1e3,for_loop,0.021000000000185537 76 | 1e3,for_loop,0.027000000000043656 77 | 1e4,transpose,0.0010000000002037268 78 | 1e4,transpose,9.999999997489795e-4 79 | 1e4,transpose,0.0010000000002037268 80 | 1e4,transpose,9.999999997489795e-4 81 | 1e4,transpose,0.0020000000004074536 82 | 1e4,pmap,0.02500000000009095 83 | 1e4,pmap,0.024999999999636202 84 | 1e4,pmap,0.027000000000043656 85 | 1e4,pmap,0.026000000000294676 86 | 1e4,pmap,0.03099999999994907 87 | 1e4,split_lapply,0.24499999999989086 88 | 1e4,split_lapply,0.23900000000003274 89 | 1e4,split_lapply,0.24899999999979627 90 | 1e4,split_lapply,0.2680000000000291 91 | 1e4,split_lapply,0.24499999999989086 92 | 1e4,lapply_row,0.22000000000025466 93 | 1e4,lapply_row,0.2369999999996253 94 | 1e4,lapply_row,0.23400000000037835 95 | 1e4,lapply_row,0.2339999999999236 96 | 1e4,lapply_row,0.22600000000011278 97 | 1e4,for_loop,0.24899999999979627 98 | 1e4,for_loop,0.23800000000028376 99 | 1e4,for_loop,0.2519999999999527 100 | 1e4,for_loop,0.26499999999987267 101 | 1e4,for_loop,0.25700000000006185 102 | 1e5,transpose,0.01499999999987267 103 | 1e5,transpose,0.016999999999825377 104 | 1e5,transpose,0.016000000000076398 105 | 1e5,transpose,0.016999999999825377 106 | 1e5,transpose,0.027000000000043656 107 | 1e5,pmap,0.5749999999998181 108 | 1e5,pmap,0.6639999999997599 109 | 1e5,pmap,0.6190000000001419 110 | 1e5,pmap,0.7470000000002983 111 | 1e5,pmap,0.6419999999998254 112 | 1e5,split_lapply,3.2729999999996835 113 | 1e5,split_lapply,3.624000000000251 114 | 1e5,split_lapply,3.9329999999999927 115 | 1e5,split_lapply,3.380000000000109 116 | 1e5,split_lapply,3.4890000000000327 117 | 1e5,lapply_row,3.199000000000069 118 | 1e5,lapply_row,3.630000000000109 119 | 1e5,lapply_row,3.9980000000000473 120 | 1e5,lapply_row,3.5589999999997417 121 | 1e5,lapply_row,3.6010000000001128 122 | 1e5,for_loop,3.212999999999738 123 | 1e5,for_loop,3.66800000000012 124 | 1e5,for_loop,4.114000000000033 125 | 1e5,for_loop,3.882000000000062 126 | 1e5,for_loop,3.5149999999998727 127 | -------------------------------------------------------------------------------- /col-benchmark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/col-benchmark.png -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Leave your data in that big, beautiful data frame" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #+ body 17 | # ---- 18 | #' ## Don't create odd little excerpts and copies of your data. 19 | #' 20 | #' Code style that results from (I speculate) minimizing the number of key 21 | #' presses. 22 | 23 | ## :( 24 | sl <- iris[51:100,1] 25 | pw <- iris[51:100,4] 26 | plot(sl ~ pw) 27 | 28 | #' This clutters the workspace with "loose parts", `sl` and `pw`. Very soon, you 29 | #' are likely to forget what they are, which `Species` of `iris` they represent, 30 | #' and what the relationship between them is. 31 | 32 | # ---- 33 | #' ## Leave the data *in situ* and reveal intent in your code 34 | #' 35 | #' More verbose code conveys intent. Eliminating the Magic Numbers makes the 36 | #' code less likely to be, or become, wrong. 37 | #' 38 | #' Here's one way to do same in a tidyverse style: 39 | library(tidyverse) 40 | 41 | ggplot( 42 | filter(iris, Species == "versicolor"), 43 | aes(x = Petal.Width, y = Sepal.Length) 44 | ) + geom_point() 45 | 46 | #' Another tidyverse approach, this time using the pipe operator, `%>%` 47 | iris %>% 48 | filter(Species == "versicolor") %>% 49 | ggplot(aes(x = Petal.Width, y = Sepal.Length)) + ## <--- NOTE the `+` sign!! 50 | geom_point() 51 | 52 | #' A base solution that still follows the principles of 53 | #' 54 | #' * leave the data in data frame 55 | #' * convey intent 56 | plot( 57 | Sepal.Length ~ Petal.Width, 58 | data = subset(iris, subset = Species == "versicolor") 59 | ) 60 | -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame.md: -------------------------------------------------------------------------------- 1 | Leave your data in that big, beautiful data frame 2 | ================ 3 | Jenny Bryan 4 | 2018-04-02 5 | 6 | ## Don’t create odd little excerpts and copies of your data. 7 | 8 | Code style that results from (I speculate) minimizing the number of key 9 | presses. 10 | 11 | ``` r 12 | ## :( 13 | sl <- iris[51:100,1] 14 | pw <- iris[51:100,4] 15 | plot(sl ~ pw) 16 | ``` 17 | 18 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png) 19 | 20 | This clutters the workspace with “loose parts”, `sl` and `pw`. Very 21 | soon, you are likely to forget what they are, which `Species` of `iris` 22 | they represent, and what the relationship between them is. 23 | 24 | ## Leave the data *in situ* and reveal intent in your code 25 | 26 | More verbose code conveys intent. Eliminating the Magic Numbers makes 27 | the code less likely to be, or become, wrong. 28 | 29 | Here’s one way to do same in a tidyverse style: 30 | 31 | ``` r 32 | library(tidyverse) 33 | 34 | ggplot( 35 | filter(iris, Species == "versicolor"), 36 | aes(x = Petal.Width, y = Sepal.Length) 37 | ) + geom_point() 38 | ``` 39 | 40 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png) 41 | 42 | Another tidyverse approach, this time using the pipe operator, `%>%` 43 | 44 | ``` r 45 | iris %>% 46 | filter(Species == "versicolor") %>% 47 | ggplot(aes(x = Petal.Width, y = Sepal.Length)) + ## <--- NOTE the `+` sign!! 48 | geom_point() 49 | ``` 50 | 51 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png) 52 | 53 | A base solution that still follows the principles of 54 | 55 | - leave the data in data frame 56 | - convey intent 57 | 58 | 59 | 60 | ``` r 61 | plot( 62 | Sepal.Length ~ Petal.Width, 63 | data = subset(iris, subset = Species == "versicolor") 64 | ) 65 | ``` 66 | 67 | ![](ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png) 68 | -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-2-1.png -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-1.png -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-3-2.png -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-4-1.png -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-5-1.png -------------------------------------------------------------------------------- /ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex01_leave-it-in-the-data-frame_files/figure-gfm/unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /ex02_create-or-mutate-in-place.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Add or modify a variable" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #+ body 17 | # ---- 18 | library(tidyverse) 19 | 20 | # ---- 21 | #' ### Function to produce a fresh example data frame 22 | new_df <- function() { 23 | tribble( 24 | ~ name, ~ age, 25 | "Reed", 14L, 26 | "Wesley", 12L, 27 | "Eli", 12L, 28 | "Toby", 1L 29 | ) 30 | } 31 | 32 | # ---- 33 | #' ## The `df$var <- ...` syntax 34 | 35 | #' How to create or modify a variable is a fairly low stakes matter, i.e. really 36 | #' a matter of taste. This is not a hill I plan to die on. But here's my two 37 | #' cents. 38 | #' 39 | #' Of course, `df$var <- ...` absolutely works for creating new variables or 40 | #' modifying existing ones. But there are downsides: 41 | #' 42 | #' * Silent recycling is a risk. 43 | #' * `df` is not special. It's not the implied place to look first for things, 44 | #' so you must be explicit. This can be a drag. 45 | #' * I have aesthetic concerns. YMMV. 46 | df <- new_df() 47 | df$eyes <- 2L 48 | df$snack <- c("chips", "cheese") 49 | df$uname <- toupper(df$name) 50 | df 51 | 52 | # ---- 53 | #' ## `dplyr::mutate()` works "inside the box" 54 | 55 | #' `dplyr::mutate()` is the tidyverse way to work on a variable. If I'm working 56 | #' in a script-y style and the tidyverse packages are already available, I 57 | #' generally prefer this method of adding or modifying a variable. 58 | #' 59 | #' * Only a length one input can be recycled. 60 | #' * `df` is the first place to look for things. It turns out that making a 61 | #' new variable out of existing variables is very, very common, so it's nice 62 | #' when this is easy. 63 | #' * This is pipe-friendly, so I can easily combine with a few other logical 64 | #' data manipuluations that need to happen around the same point. 65 | #' * I like the way this looks. YMMV. 66 | 67 | new_df() %>% 68 | mutate( 69 | eyes = 2L, 70 | snack = c("chips", "cheese"), 71 | uname = toupper(name) 72 | ) 73 | 74 | #' Oops! I did not provide enough snacks! 75 | 76 | new_df() %>% 77 | mutate( 78 | eyes = 2L, 79 | snack = c("chips", "cheese", "mixed nuts", "nerf bullets"), 80 | uname = toupper(name) 81 | ) 82 | -------------------------------------------------------------------------------- /ex02_create-or-mutate-in-place.md: -------------------------------------------------------------------------------- 1 | Add or modify a variable 2 | ================ 3 | Jenny Bryan 4 | 2018-04-10 5 | 6 | ``` r 7 | library(tidyverse) 8 | ``` 9 | 10 | ### Function to produce a fresh example data frame 11 | 12 | ``` r 13 | new_df <- function() { 14 | tribble( 15 | ~ name, ~ age, 16 | "Reed", 14L, 17 | "Wesley", 12L, 18 | "Eli", 12L, 19 | "Toby", 1L 20 | ) 21 | } 22 | ``` 23 | 24 | ## The `df$var <- ...` syntax 25 | 26 | How to create or modify a variable is a fairly low stakes matter, 27 | i.e. really a matter of taste. This is not a hill I plan to die on. But 28 | here’s my two cents. 29 | 30 | Of course, `df$var <- ...` absolutely works for creating new variables 31 | or modifying existing ones. But there are downsides: 32 | 33 | - Silent recycling is a risk. 34 | - `df` is not special. It’s not the implied place to look first for 35 | things, so you must be explicit. This can be a drag. 36 | - I have aesthetic concerns. YMMV. 37 | 38 | 39 | 40 | ``` r 41 | df <- new_df() 42 | df$eyes <- 2L 43 | df$snack <- c("chips", "cheese") 44 | df$uname <- toupper(df$name) 45 | df 46 | #> # A tibble: 4 x 5 47 | #> name age eyes snack uname 48 | #> 49 | #> 1 Reed 14 2 chips REED 50 | #> 2 Wesley 12 2 cheese WESLEY 51 | #> 3 Eli 12 2 chips ELI 52 | #> 4 Toby 1 2 cheese TOBY 53 | ``` 54 | 55 | ## `dplyr::mutate()` works “inside the box” 56 | 57 | `dplyr::mutate()` is the tidyverse way to work on a variable. If I’m 58 | working in a script-y style and the tidyverse packages are already 59 | available, I generally prefer this method of adding or modifying a 60 | variable. 61 | 62 | - Only a length one input can be recycled. 63 | - `df` is the first place to look for things. It turns out that making 64 | a new variable out of existing variables is very, very common, so 65 | it’s nice when this is easy. 66 | - This is pipe-friendly, so I can easily combine with a few other 67 | logical data manipuluations that need to happen around the same 68 | point. 69 | - I like the way this looks. YMMV. 70 | 71 | 72 | 73 | ``` r 74 | new_df() %>% 75 | mutate( 76 | eyes = 2L, 77 | snack = c("chips", "cheese"), 78 | uname = toupper(name) 79 | ) 80 | #> Error in mutate_impl(.data, dots): Column `snack` must be length 4 (the number of rows) or one, not 2 81 | ``` 82 | 83 | Oops\! I did not provide enough snacks\! 84 | 85 | ``` r 86 | new_df() %>% 87 | mutate( 88 | eyes = 2L, 89 | snack = c("chips", "cheese", "mixed nuts", "nerf bullets"), 90 | uname = toupper(name) 91 | ) 92 | #> # A tibble: 4 x 5 93 | #> name age eyes snack uname 94 | #> 95 | #> 1 Reed 14 2 chips REED 96 | #> 2 Wesley 12 2 cheese WESLEY 97 | #> 3 Eli 12 2 mixed nuts ELI 98 | #> 4 Toby 1 2 nerf bullets TOBY 99 | ``` 100 | -------------------------------------------------------------------------------- /ex03_row-wise-iteration-are-you-sure.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Are you absolutely sure that you, personally, need to iterate over rows?" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #+ body 17 | # ---- 18 | library(tidyverse) 19 | 20 | # ---- 21 | #' ## Function to give my example data frame 22 | new_df <- function() { 23 | tribble( 24 | ~ name, ~ age, 25 | "Reed", 14, 26 | "Wesley", 12, 27 | "Eli", 12, 28 | "Toby", 1 29 | ) 30 | } 31 | 32 | # ---- 33 | #' ## Single-row example can cause tunnel vision 34 | #' 35 | #' Sometimes it's easy to fixate on one (unfavorable) way of accomplishing 36 | #' something, because it feels like a natural extension of a successful 37 | #' small-scale experiment. 38 | #' 39 | #' Let's create a string from row 1 of the data frame. 40 | df <- new_df() 41 | paste(df$name[1], "is", df$age[1], "years old") 42 | 43 | #' I want to scale up, therefore I obviously must ... loop over all rows! 44 | n <- nrow(df) 45 | s <- vector(mode = "character", length = n) 46 | for (i in seq_len(n)) { 47 | s[i] <- paste(df$name[i], "is", df$age[i], "years old") 48 | } 49 | s 50 | 51 | #' HOLD ON. What if I told you `paste()` is already vectorized over its 52 | #' arguments? 53 | paste(df$name, "is", df$age, "years old") 54 | 55 | #' A surprising number of "iterate over rows" problems can be eliminated by 56 | #' exploiting functions that are already vectorized and by making your own 57 | #' functions vectorized over the primary argument. 58 | #' 59 | #' Writing an explicit loop in your code is not necessarily bad, but it should 60 | #' always give you pause. Has someone already written this loop for you? Ideally 61 | #' in C or C++ and inside a package that's being regularly checked, with high 62 | #' test coverage. That is usually the better choice. 63 | 64 | # ---- 65 | #' ## Don't forget to work "inside the box" 66 | #' 67 | 68 | #' For this string interpolation task, we can even work with a vectorized 69 | #' function that is happy to do lookup inside a data frame. The [glue 70 | #' package](https://glue.tidyverse.org) is doing the work under the hood here, 71 | #' but its Greatest Functions are now re-exported by stringr, which we already 72 | #' attached via `library(tidyverse)`. 73 | 74 | str_glue_data(df, "{name} is {age} years old") 75 | 76 | #' You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`, because 77 | #' the other variables in `df` are automatically available for use. 78 | 79 | df %>% 80 | mutate(sentence = str_glue("{name} is {age} years old")) 81 | 82 | #' The tidyverse style is to manage data holistically in a data frame and 83 | #' provide a user interface that encourages self-explaining code with low 84 | #' "syntactical noise". 85 | -------------------------------------------------------------------------------- /ex03_row-wise-iteration-are-you-sure.md: -------------------------------------------------------------------------------- 1 | Are you absolutely sure that you, personally, need to iterate over rows? 2 | ================ 3 | Jenny Bryan 4 | 2018-04-02 5 | 6 | ``` r 7 | library(tidyverse) 8 | ``` 9 | 10 | ## Function to give my example data frame 11 | 12 | ``` r 13 | new_df <- function() { 14 | tribble( 15 | ~ name, ~ age, 16 | "Reed", 14, 17 | "Wesley", 12, 18 | "Eli", 12, 19 | "Toby", 1 20 | ) 21 | } 22 | ``` 23 | 24 | ## Single-row example can cause tunnel vision 25 | 26 | Sometimes it’s easy to fixate on one (unfavorable) way of accomplishing 27 | something, because it feels like a natural extension of a successful 28 | small-scale experiment. 29 | 30 | Let’s create a string from row 1 of the data frame. 31 | 32 | ``` r 33 | df <- new_df() 34 | paste(df$name[1], "is", df$age[1], "years old") 35 | #> [1] "Reed is 14 years old" 36 | ``` 37 | 38 | I want to scale up, therefore I obviously must … loop over all rows\! 39 | 40 | ``` r 41 | n <- nrow(df) 42 | s <- vector(mode = "character", length = n) 43 | for (i in seq_len(n)) { 44 | s[i] <- paste(df$name[i], "is", df$age[i], "years old") 45 | } 46 | s 47 | #> [1] "Reed is 14 years old" "Wesley is 12 years old" 48 | #> [3] "Eli is 12 years old" "Toby is 1 years old" 49 | ``` 50 | 51 | HOLD ON. What if I told you `paste()` is already vectorized over its 52 | arguments? 53 | 54 | ``` r 55 | paste(df$name, "is", df$age, "years old") 56 | #> [1] "Reed is 14 years old" "Wesley is 12 years old" 57 | #> [3] "Eli is 12 years old" "Toby is 1 years old" 58 | ``` 59 | 60 | A surprising number of “iterate over rows” problems can be eliminated by 61 | exploiting functions that are already vectorized and by making your own 62 | functions vectorized over the primary argument. 63 | 64 | Writing an explicit loop in your code is not necessarily bad, but it 65 | should always give you pause. Has someone already written this loop for 66 | you? Ideally in C or C++ and inside a package that’s being regularly 67 | checked, with high test coverage. That is usually the better choice. 68 | 69 | ## Don’t forget to work “inside the box” 70 | 71 | For this string interpolation task, we can even work with a vectorized 72 | function that is happy to do lookup inside a data frame. The [glue 73 | package](https://glue.tidyverse.org) is doing the work under the hood 74 | here, but its Greatest Functions are now re-exported by stringr, which 75 | we already attached via `library(tidyverse)`. 76 | 77 | ``` r 78 | str_glue_data(df, "{name} is {age} years old") 79 | #> Reed is 14 years old 80 | #> Wesley is 12 years old 81 | #> Eli is 12 years old 82 | #> Toby is 1 years old 83 | ``` 84 | 85 | You can use the simpler form, `str_glue()`, inside `dplyr::mutate()`, 86 | because the other variables in `df` are automatically available for use. 87 | 88 | ``` r 89 | df %>% 90 | mutate(sentence = str_glue("{name} is {age} years old")) 91 | #> # A tibble: 4 x 3 92 | #> name age sentence 93 | #> 94 | #> 1 Reed 14. Reed is 14 years old 95 | #> 2 Wesley 12. Wesley is 12 years old 96 | #> 3 Eli 12. Eli is 12 years old 97 | #> 4 Toby 1. Toby is 1 years old 98 | ``` 99 | 100 | The tidyverse style is to manage data holistically in a data frame and 101 | provide a user interface that encourages self-explaining code with low 102 | “syntactical noise”. 103 | -------------------------------------------------------------------------------- /ex04_map-example.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Small demo of purrr::map()" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #+ body 17 | # ---- 18 | #' ## `purrr::map()` can be used to work with functions that aren't vectorized. 19 | 20 | df_list <- list( 21 | iris = head(iris, 2), 22 | mtcars = head(mtcars, 3) 23 | ) 24 | df_list 25 | 26 | #' This does not work. `nrow()` expects a single data frame as input. 27 | nrow(df_list) 28 | 29 | #' `purrr::map()` applies `nrow()` to each element of `df_list`. 30 | library(purrr) 31 | 32 | map(df_list, nrow) 33 | 34 | #' Different calling styles make sense in more complicated situations. Hard to 35 | #' justify in this simple example. 36 | map(df_list, ~ nrow(.x)) 37 | 38 | df_list %>% 39 | map(nrow) 40 | 41 | #' If you know what the return type is (or *should* be), use a type-specific 42 | #' variant of `map()`. 43 | 44 | map_int(df_list, ~ nrow(.x)) 45 | 46 | #' More on coverage of `map()` and friends: . 47 | -------------------------------------------------------------------------------- /ex04_map-example.md: -------------------------------------------------------------------------------- 1 | Small demo of purrr::map() 2 | ================ 3 | Jenny Bryan 4 | 2018-04-10 5 | 6 | ## `purrr::map()` can be used to work with functions that aren’t vectorized. 7 | 8 | ``` r 9 | df_list <- list( 10 | iris = head(iris, 2), 11 | mtcars = head(mtcars, 3) 12 | ) 13 | df_list 14 | #> $iris 15 | #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species 16 | #> 1 5.1 3.5 1.4 0.2 setosa 17 | #> 2 4.9 3.0 1.4 0.2 setosa 18 | #> 19 | #> $mtcars 20 | #> mpg cyl disp hp drat wt qsec vs am gear carb 21 | #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 22 | #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 23 | #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 24 | ``` 25 | 26 | This does not work. `nrow()` expects a single data frame as input. 27 | 28 | ``` r 29 | nrow(df_list) 30 | #> NULL 31 | ``` 32 | 33 | `purrr::map()` applies `nrow()` to each element of `df_list`. 34 | 35 | ``` r 36 | library(purrr) 37 | 38 | map(df_list, nrow) 39 | #> $iris 40 | #> [1] 2 41 | #> 42 | #> $mtcars 43 | #> [1] 3 44 | ``` 45 | 46 | Different calling styles make sense in more complicated situations. Hard 47 | to justify in this simple example. 48 | 49 | ``` r 50 | map(df_list, ~ nrow(.x)) 51 | #> $iris 52 | #> [1] 2 53 | #> 54 | #> $mtcars 55 | #> [1] 3 56 | 57 | df_list %>% 58 | map(nrow) 59 | #> $iris 60 | #> [1] 2 61 | #> 62 | #> $mtcars 63 | #> [1] 3 64 | ``` 65 | 66 | If you know what the return type is (or *should* be), use a 67 | type-specific variant of `map()`. 68 | 69 | ``` r 70 | map_int(df_list, ~ nrow(.x)) 71 | #> iris mtcars 72 | #> 2 3 73 | ``` 74 | 75 | More on coverage of `map()` and friends: 76 | . 77 | -------------------------------------------------------------------------------- /ex05_attack-via-rows-or-columns.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Attack via rows or columns?" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #' **WARNING: half-baked** 17 | 18 | #+ body 19 | # ---- 20 | library(tidyverse) 21 | 22 | # ---- 23 | #' ## If you must sweat, compare row-wise work vs. column-wise work 24 | #' 25 | #' The approach you use in that first example is not always the one that scales 26 | #' up the best. 27 | 28 | x <- list( 29 | list(name = "sue", number = 1, veg = c("onion", "carrot")), 30 | list(name = "doug", number = 2, veg = c("potato", "beet")) 31 | ) 32 | 33 | # row binding 34 | 35 | # frustrating base attempts 36 | rbind(x) 37 | do.call(rbind, x) 38 | do.call(rbind, x) %>% str() 39 | 40 | # tidyverse fail 41 | bind_rows(x) 42 | map_dfr(x, ~ .x) 43 | 44 | map_dfr(x, ~ .x[c("name", "number")]) 45 | 46 | tibble( 47 | name = map_chr(x, "name"), 48 | number = map_dbl(x, "number"), 49 | veg = map(x, "veg") 50 | ) 51 | -------------------------------------------------------------------------------- /ex05_attack-via-rows-or-columns.md: -------------------------------------------------------------------------------- 1 | Attack via rows or columns? 2 | ================ 3 | Jenny Bryan 4 | 2018-04-02 5 | 6 | **WARNING: half-baked** 7 | 8 | ``` r 9 | library(tidyverse) 10 | ``` 11 | 12 | ## If you must sweat, compare row-wise work vs. column-wise work 13 | 14 | The approach you use in that first example is not always the one that 15 | scales up the best. 16 | 17 | ``` r 18 | x <- list( 19 | list(name = "sue", number = 1, veg = c("onion", "carrot")), 20 | list(name = "doug", number = 2, veg = c("potato", "beet")) 21 | ) 22 | 23 | # row binding 24 | 25 | # frustrating base attempts 26 | rbind(x) 27 | #> [,1] [,2] 28 | #> x List,3 List,3 29 | do.call(rbind, x) 30 | #> name number veg 31 | #> [1,] "sue" 1 Character,2 32 | #> [2,] "doug" 2 Character,2 33 | do.call(rbind, x) %>% str() 34 | #> List of 6 35 | #> $ : chr "sue" 36 | #> $ : chr "doug" 37 | #> $ : num 1 38 | #> $ : num 2 39 | #> $ : chr [1:2] "onion" "carrot" 40 | #> $ : chr [1:2] "potato" "beet" 41 | #> - attr(*, "dim")= int [1:2] 2 3 42 | #> - attr(*, "dimnames")=List of 2 43 | #> ..$ : NULL 44 | #> ..$ : chr [1:3] "name" "number" "veg" 45 | 46 | # tidyverse fail 47 | bind_rows(x) 48 | #> Error in bind_rows_(x, .id): Argument 3 must be length 1, not 2 49 | map_dfr(x, ~ .x) 50 | #> Error in bind_rows_(x, .id): Argument 3 must be length 1, not 2 51 | 52 | map_dfr(x, ~ .x[c("name", "number")]) 53 | #> # A tibble: 2 x 2 54 | #> name number 55 | #> 56 | #> 1 sue 1. 57 | #> 2 doug 2. 58 | 59 | tibble( 60 | name = map_chr(x, "name"), 61 | number = map_dbl(x, "number"), 62 | veg = map(x, "veg") 63 | ) 64 | #> # A tibble: 2 x 3 65 | #> name number veg 66 | #> 67 | #> 1 sue 1. 68 | #> 2 doug 2. 69 | ``` 70 | -------------------------------------------------------------------------------- /ex06_runif-via-pmap.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Generate data from different distributions via pmap()" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #+ body 17 | # ---- 18 | #' ## Uniform[min, max] via `runif()` 19 | #' 20 | #' CONSIDER: 21 | #' ``` 22 | #' runif(n, min = 0, max = 1) 23 | #' ``` 24 | #' 25 | #' Want to do this for several triples of (n, min, max). 26 | #' 27 | #' Store each triple as a row in a data frame. 28 | #' 29 | #' Now iterate over the rows. 30 | 31 | library(tidyverse) 32 | 33 | #' Notice how df's variable names are same as runif's argument names. Do this 34 | #' when you can! 35 | df <- tribble( 36 | ~ n, ~ min, ~ max, 37 | 1L, 0, 1, 38 | 2L, 10, 100, 39 | 3L, 100, 1000 40 | ) 41 | df 42 | 43 | #' Set seed to make this repeatedly random. 44 | #' 45 | #' Practice on single rows. 46 | set.seed(123) 47 | (x <- df[1, ]) 48 | runif(n = x$n, min = x$min, max = x$max) 49 | 50 | x <- df[2, ] 51 | runif(n = x$n, min = x$min, max = x$max) 52 | 53 | x <- df[3, ] 54 | runif(n = x$n, min = x$min, max = x$max) 55 | 56 | #' Think out loud in pseudo-code. 57 | 58 | ## x <- df[i, ] 59 | ## runif(n = x$n, min = x$min, max = x$max) 60 | 61 | ## runif(n = df$n[i], min = df$min[i], max = df$max[i]) 62 | ## runif with all args from the i-th row of df 63 | 64 | #' Just. Do. It. with `pmap()`. 65 | set.seed(123) 66 | pmap(df, runif) 67 | 68 | #' ## Finessing variable and argument names 69 | #' 70 | #' Q: What if you can't arrange it so that variable names and arg names are 71 | #' same? 72 | foofy <- tibble( 73 | alpha = 1:3, ## was: n 74 | beta = c(0, 10, 100), ## was: min 75 | gamma = c(1, 100, 1000) ## was: max 76 | ) 77 | foofy 78 | 79 | #' A: Rename the variables on-the-fly, on the way in. 80 | set.seed(123) 81 | foofy %>% 82 | rename(n = alpha, min = beta, max = gamma) %>% 83 | pmap(runif) 84 | 85 | #' A: Write a wrapper around `runif()` to say how df vars <--> runif args. 86 | 87 | ## wrapper option #1: 88 | ## ARGNAME = l$VARNAME 89 | my_runif <- function(...) { 90 | l <- list(...) 91 | runif(n = l$alpha, min = l$beta, max = l$gamma) 92 | } 93 | set.seed(123) 94 | pmap(foofy, my_runif) 95 | 96 | ## wrapper option #2: 97 | my_runif <- function(alpha, beta, gamma, ...) { 98 | runif(n = alpha, min = beta, max = gamma) 99 | } 100 | set.seed(123) 101 | pmap(foofy, my_runif) 102 | 103 | #' You can use `..i` to refer to input by position. 104 | set.seed(123) 105 | pmap(foofy, ~ runif(n = ..1, min = ..2, max = ..3)) 106 | #' Use this with *extreme caution*. Easy to shoot yourself in the foot. 107 | #' 108 | #' ## Extra variables in the data frame 109 | #' 110 | #' What if data frame includes variables that should not be passed to `.f()`? 111 | df_oops <- tibble( 112 | n = 1:3, 113 | min = c(0, 10, 100), 114 | max = c(1, 100, 1000), 115 | oops = c("please", "ignore", "me") 116 | ) 117 | df_oops 118 | 119 | #' This will not work! 120 | set.seed(123) 121 | pmap(df_oops, runif) 122 | 123 | #' A: use `dplyr::select()` to limit the variables passed to `pmap()`. 124 | set.seed(123) 125 | df_oops %>% 126 | select(n, min, max) %>% ## if it's easier to say what to keep 127 | pmap(runif) 128 | 129 | set.seed(123) 130 | df_oops %>% 131 | select(-oops) %>% ## if it's easier to say what to omit 132 | pmap(runif) 133 | 134 | #' A: Use a custom wrapper and absorb extra variables with `...`. 135 | my_runif <- function(n, min, max, ...) runif(n, min, max) 136 | 137 | set.seed(123) 138 | pmap(df_oops, my_runif) 139 | 140 | #' ## Add the generated data to the data frame as a list-column 141 | set.seed(123) 142 | (df_aug <- df %>% 143 | mutate(data = pmap(., runif))) 144 | #View(df_aug) 145 | 146 | #' What about computing within a data frame, in the presence of the 147 | #' complications discussed above? Use `list()` in the place of the `.` 148 | #' placeholder above to select the target variables and, if necessary, map 149 | #' variable names to argument names. *Thanks @hadley for [sharing this 150 | #' trick](https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/29).* 151 | #' 152 | #' How to address variable names != argument names: 153 | foofy <- tibble( 154 | alpha = 1:3, ## was: n 155 | beta = c(0, 10, 100), ## was: min 156 | gamma = c(1, 100, 1000) ## was: max 157 | ) 158 | 159 | set.seed(123) 160 | foofy %>% 161 | mutate(data = pmap(list(n = alpha, min = beta, max = gamma), runif)) 162 | 163 | #' How to address presence of 'extra variables' with either an inclusion or 164 | #' exclusion mentality 165 | df_oops <- tibble( 166 | n = 1:3, 167 | min = c(0, 10, 100), 168 | max = c(1, 100, 1000), 169 | oops = c("please", "ignore", "me") 170 | ) 171 | 172 | set.seed(123) 173 | df_oops %>% 174 | mutate(data = pmap(list(n, min, max), runif)) 175 | 176 | df_oops %>% 177 | mutate(data = pmap(select(., -oops), runif)) 178 | 179 | #' ## Review 180 | #' 181 | #' What have we done? 182 | #' 183 | #' * Arranged inputs as rows in a data frame 184 | #' * Used `pmap()` to implement a loop over the rows. 185 | #' * Used dplyr verbs `rename()` and `select()` to manipulate data on the way 186 | #' into `pmap()`. 187 | #' * Wrote custom wrappers around `runif()` to deal with: 188 | #' - df var names != `.f()` arg names 189 | #' - df vars that aren't formal args of `.f()` 190 | #' * Demonstrated all of the above when working inside a data frame and adding 191 | #' generated data as a list-column 192 | -------------------------------------------------------------------------------- /ex06_runif-via-pmap.md: -------------------------------------------------------------------------------- 1 | Generate data from different distributions via pmap() 2 | ================ 3 | Jenny Bryan 4 | 2018-05-08 5 | 6 | ## Uniform\[min, max\] via `runif()` 7 | 8 | CONSIDER: 9 | 10 | runif(n, min = 0, max = 1) 11 | 12 | Want to do this for several triples of (n, min, max). 13 | 14 | Store each triple as a row in a data frame. 15 | 16 | Now iterate over the rows. 17 | 18 | ``` r 19 | library(tidyverse) 20 | ``` 21 | 22 | Notice how df’s variable names are same as runif’s argument names. Do 23 | this when you can\! 24 | 25 | ``` r 26 | df <- tribble( 27 | ~ n, ~ min, ~ max, 28 | 1L, 0, 1, 29 | 2L, 10, 100, 30 | 3L, 100, 1000 31 | ) 32 | df 33 | #> # A tibble: 3 x 3 34 | #> n min max 35 | #> 36 | #> 1 1 0 1 37 | #> 2 2 10 100 38 | #> 3 3 100 1000 39 | ``` 40 | 41 | Set seed to make this repeatedly random. 42 | 43 | Practice on single rows. 44 | 45 | ``` r 46 | set.seed(123) 47 | (x <- df[1, ]) 48 | #> # A tibble: 1 x 3 49 | #> n min max 50 | #> 51 | #> 1 1 0 1 52 | runif(n = x$n, min = x$min, max = x$max) 53 | #> [1] 0.2875775 54 | 55 | x <- df[2, ] 56 | runif(n = x$n, min = x$min, max = x$max) 57 | #> [1] 80.94746 46.80792 58 | 59 | x <- df[3, ] 60 | runif(n = x$n, min = x$min, max = x$max) 61 | #> [1] 894.7157 946.4206 141.0008 62 | ``` 63 | 64 | Think out loud in pseudo-code. 65 | 66 | ``` r 67 | ## x <- df[i, ] 68 | ## runif(n = x$n, min = x$min, max = x$max) 69 | 70 | ## runif(n = df$n[i], min = df$min[i], max = df$max[i]) 71 | ## runif with all args from the i-th row of df 72 | ``` 73 | 74 | Just. Do. It. with `pmap()`. 75 | 76 | ``` r 77 | set.seed(123) 78 | pmap(df, runif) 79 | #> [[1]] 80 | #> [1] 0.2875775 81 | #> 82 | #> [[2]] 83 | #> [1] 80.94746 46.80792 84 | #> 85 | #> [[3]] 86 | #> [1] 894.7157 946.4206 141.0008 87 | ``` 88 | 89 | ## Finessing variable and argument names 90 | 91 | Q: What if you can’t arrange it so that variable names and arg names are 92 | same? 93 | 94 | ``` r 95 | foofy <- tibble( 96 | alpha = 1:3, ## was: n 97 | beta = c(0, 10, 100), ## was: min 98 | gamma = c(1, 100, 1000) ## was: max 99 | ) 100 | foofy 101 | #> # A tibble: 3 x 3 102 | #> alpha beta gamma 103 | #> 104 | #> 1 1 0 1 105 | #> 2 2 10 100 106 | #> 3 3 100 1000 107 | ``` 108 | 109 | A: Rename the variables on-the-fly, on the way in. 110 | 111 | ``` r 112 | set.seed(123) 113 | foofy %>% 114 | rename(n = alpha, min = beta, max = gamma) %>% 115 | pmap(runif) 116 | #> [[1]] 117 | #> [1] 0.2875775 118 | #> 119 | #> [[2]] 120 | #> [1] 80.94746 46.80792 121 | #> 122 | #> [[3]] 123 | #> [1] 894.7157 946.4206 141.0008 124 | ``` 125 | 126 | A: Write a wrapper around `runif()` to say how df vars \<–\> runif args. 127 | 128 | ``` r 129 | ## wrapper option #1: 130 | ## ARGNAME = l$VARNAME 131 | my_runif <- function(...) { 132 | l <- list(...) 133 | runif(n = l$alpha, min = l$beta, max = l$gamma) 134 | } 135 | set.seed(123) 136 | pmap(foofy, my_runif) 137 | #> [[1]] 138 | #> [1] 0.2875775 139 | #> 140 | #> [[2]] 141 | #> [1] 80.94746 46.80792 142 | #> 143 | #> [[3]] 144 | #> [1] 894.7157 946.4206 141.0008 145 | 146 | ## wrapper option #2: 147 | my_runif <- function(alpha, beta, gamma, ...) { 148 | runif(n = alpha, min = beta, max = gamma) 149 | } 150 | set.seed(123) 151 | pmap(foofy, my_runif) 152 | #> [[1]] 153 | #> [1] 0.2875775 154 | #> 155 | #> [[2]] 156 | #> [1] 80.94746 46.80792 157 | #> 158 | #> [[3]] 159 | #> [1] 894.7157 946.4206 141.0008 160 | ``` 161 | 162 | You can use `..i` to refer to input by position. 163 | 164 | ``` r 165 | set.seed(123) 166 | pmap(foofy, ~ runif(n = ..1, min = ..2, max = ..3)) 167 | #> [[1]] 168 | #> [1] 0.2875775 169 | #> 170 | #> [[2]] 171 | #> [1] 80.94746 46.80792 172 | #> 173 | #> [[3]] 174 | #> [1] 894.7157 946.4206 141.0008 175 | ``` 176 | 177 | Use this with *extreme caution*. Easy to shoot yourself in the foot. 178 | 179 | ## Extra variables in the data frame 180 | 181 | What if data frame includes variables that should not be passed to 182 | `.f()`? 183 | 184 | ``` r 185 | df_oops <- tibble( 186 | n = 1:3, 187 | min = c(0, 10, 100), 188 | max = c(1, 100, 1000), 189 | oops = c("please", "ignore", "me") 190 | ) 191 | df_oops 192 | #> # A tibble: 3 x 4 193 | #> n min max oops 194 | #> 195 | #> 1 1 0 1 please 196 | #> 2 2 10 100 ignore 197 | #> 3 3 100 1000 me 198 | ``` 199 | 200 | This will not work\! 201 | 202 | ``` r 203 | set.seed(123) 204 | pmap(df_oops, runif) 205 | #> Error in .f(n = .l[[c(1L, i)]], min = .l[[c(2L, i)]], max = .l[[c(3L, : unused argument (oops = .l[[c(4, i)]]) 206 | ``` 207 | 208 | A: use `dplyr::select()` to limit the variables passed to `pmap()`. 209 | 210 | ``` r 211 | set.seed(123) 212 | df_oops %>% 213 | select(n, min, max) %>% ## if it's easier to say what to keep 214 | pmap(runif) 215 | #> [[1]] 216 | #> [1] 0.2875775 217 | #> 218 | #> [[2]] 219 | #> [1] 80.94746 46.80792 220 | #> 221 | #> [[3]] 222 | #> [1] 894.7157 946.4206 141.0008 223 | 224 | set.seed(123) 225 | df_oops %>% 226 | select(-oops) %>% ## if it's easier to say what to omit 227 | pmap(runif) 228 | #> [[1]] 229 | #> [1] 0.2875775 230 | #> 231 | #> [[2]] 232 | #> [1] 80.94746 46.80792 233 | #> 234 | #> [[3]] 235 | #> [1] 894.7157 946.4206 141.0008 236 | ``` 237 | 238 | A: Use a custom wrapper and absorb extra variables with `...`. 239 | 240 | ``` r 241 | my_runif <- function(n, min, max, ...) runif(n, min, max) 242 | 243 | set.seed(123) 244 | pmap(df_oops, my_runif) 245 | #> [[1]] 246 | #> [1] 0.2875775 247 | #> 248 | #> [[2]] 249 | #> [1] 80.94746 46.80792 250 | #> 251 | #> [[3]] 252 | #> [1] 894.7157 946.4206 141.0008 253 | ``` 254 | 255 | ## Add the generated data to the data frame as a list-column 256 | 257 | ``` r 258 | set.seed(123) 259 | (df_aug <- df %>% 260 | mutate(data = pmap(., runif))) 261 | #> # A tibble: 3 x 4 262 | #> n min max data 263 | #> 264 | #> 1 1 0 1 265 | #> 2 2 10 100 266 | #> 3 3 100 1000 267 | #View(df_aug) 268 | ``` 269 | 270 | What about computing within a data frame, in the presence of the 271 | complications discussed above? Use `list()` in the place of the `.` 272 | placeholder above to select the target variables and, if necessary, map 273 | variable names to argument names. *Thanks @hadley for [sharing this 274 | trick](https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/29).* 275 | 276 | How to address variable names \!= argument names: 277 | 278 | ``` r 279 | foofy <- tibble( 280 | alpha = 1:3, ## was: n 281 | beta = c(0, 10, 100), ## was: min 282 | gamma = c(1, 100, 1000) ## was: max 283 | ) 284 | 285 | set.seed(123) 286 | foofy %>% 287 | mutate(data = pmap(list(n = alpha, min = beta, max = gamma), runif)) 288 | #> # A tibble: 3 x 4 289 | #> alpha beta gamma data 290 | #> 291 | #> 1 1 0 1 292 | #> 2 2 10 100 293 | #> 3 3 100 1000 294 | ``` 295 | 296 | How to address presence of ‘extra variables’ with either an inclusion or 297 | exclusion mentality 298 | 299 | ``` r 300 | df_oops <- tibble( 301 | n = 1:3, 302 | min = c(0, 10, 100), 303 | max = c(1, 100, 1000), 304 | oops = c("please", "ignore", "me") 305 | ) 306 | 307 | set.seed(123) 308 | df_oops %>% 309 | mutate(data = pmap(list(n, min, max), runif)) 310 | #> # A tibble: 3 x 5 311 | #> n min max oops data 312 | #> 313 | #> 1 1 0 1 please 314 | #> 2 2 10 100 ignore 315 | #> 3 3 100 1000 me 316 | 317 | df_oops %>% 318 | mutate(data = pmap(select(., -oops), runif)) 319 | #> # A tibble: 3 x 5 320 | #> n min max oops data 321 | #> 322 | #> 1 1 0 1 please 323 | #> 2 2 10 100 ignore 324 | #> 3 3 100 1000 me 325 | ``` 326 | 327 | ## Review 328 | 329 | What have we done? 330 | 331 | - Arranged inputs as rows in a data frame 332 | - Used `pmap()` to implement a loop over the rows. 333 | - Used dplyr verbs `rename()` and `select()` to manipulate data on the 334 | way into `pmap()`. 335 | - Wrote custom wrappers around `runif()` to deal with: 336 | - df var names \!= `.f()` arg names 337 | - df vars that aren’t formal args of `.f()` 338 | - Demonstrated all of the above when working inside a data frame and 339 | adding generated data as a list-column 340 | -------------------------------------------------------------------------------- /ex07_group-by-summarise.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Work on groups of rows via dplyr::group_by() + summarise()" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #+ body 17 | # ---- 18 | 19 | #' What if you need to work on groups of rows? Such as the groups induced by 20 | #' the levels of a factor. 21 | #' 22 | #' You do not need to ... split the data frame into mini-data-frames, loop over 23 | #' them, and glue it all back together. 24 | #' 25 | #' Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to 26 | #' compute group-wise summaries. 27 | 28 | library(tidyverse) 29 | 30 | iris %>% 31 | group_by(Species) %>% 32 | summarise(pl_avg = mean(Petal.Length), pw_avg = mean(Petal.Width)) 33 | 34 | #' What if you want to return summaries that are not just a single number? 35 | #' 36 | #' This does not "just work". 37 | iris %>% 38 | group_by(Species) %>% 39 | summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75))) 40 | 41 | #' Solution: package as a length-1 list that contains 3 values, creating a 42 | #' list-column. 43 | iris %>% 44 | group_by(Species) %>% 45 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) 46 | 47 | #' Q from 48 | #' [\@jcpsantiago](https://twitter.com/jcpsantiago/status/983997363298717696) via 49 | #' Twitter: How would you unnest so the final output is a data frame with a 50 | #' factor column `quantile` with levels "25%", "50%", and "75%"? 51 | #' 52 | #' A: I would `map()` `tibble::enframe()` on the new list column, to convert 53 | #' each entry from named list to a two-column data frame. Then use 54 | #' `tidyr::unnest()` to get rid of the list column and return to a simple data 55 | #' frame and, if you like, convert `quantile` into a factor. 56 | 57 | iris %>% 58 | group_by(Species) %>% 59 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) %>% 60 | mutate(pl_qtile = map(pl_qtile, enframe, name = "quantile")) %>% 61 | unnest() %>% 62 | mutate(quantile = factor(quantile)) 63 | 64 | #' If something like this comes up a lot in an analysis, you could package the 65 | #' key "moves" in a function, like so: 66 | enquantile <- function(x, ...) { 67 | qtile <- enframe(quantile(x, ...), name = "quantile") 68 | qtile$quantile <- factor(qtile$quantile) 69 | list(qtile) 70 | } 71 | 72 | #' This makes repeated downstream usage more concise. 73 | iris %>% 74 | group_by(Species) %>% 75 | summarise(pl_qtile = enquantile(Petal.Length, c(0.25, 0.5, 0.75))) %>% 76 | unnest() 77 | 78 | -------------------------------------------------------------------------------- /ex07_group-by-summarise.md: -------------------------------------------------------------------------------- 1 | Work on groups of rows via dplyr::group\_by() + summarise() 2 | ================ 3 | Jenny Bryan 4 | 2018-04-11 5 | 6 | What if you need to work on groups of rows? Such as the groups induced 7 | by the levels of a factor. 8 | 9 | You do not need to … split the data frame into mini-data-frames, loop 10 | over them, and glue it all back together. 11 | 12 | Instead, use `dplyr::group_by()`, followed by `dplyr::summarize()`, to 13 | compute group-wise summaries. 14 | 15 | ``` r 16 | library(tidyverse) 17 | 18 | iris %>% 19 | group_by(Species) %>% 20 | summarise(pl_avg = mean(Petal.Length), pw_avg = mean(Petal.Width)) 21 | #> # A tibble: 3 x 3 22 | #> Species pl_avg pw_avg 23 | #> 24 | #> 1 setosa 1.46 0.246 25 | #> 2 versicolor 4.26 1.33 26 | #> 3 virginica 5.55 2.03 27 | ``` 28 | 29 | What if you want to return summaries that are not just a single number? 30 | 31 | This does not “just work”. 32 | 33 | ``` r 34 | iris %>% 35 | group_by(Species) %>% 36 | summarise(pl_qtile = quantile(Petal.Length, c(0.25, 0.5, 0.75))) 37 | #> Error in summarise_impl(.data, dots): Column `pl_qtile` must be length 1 (a summary value), not 3 38 | ``` 39 | 40 | Solution: package as a length-1 list that contains 3 values, creating a 41 | list-column. 42 | 43 | ``` r 44 | iris %>% 45 | group_by(Species) %>% 46 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) 47 | #> # A tibble: 3 x 2 48 | #> Species pl_qtile 49 | #> 50 | #> 1 setosa 51 | #> 2 versicolor 52 | #> 3 virginica 53 | ``` 54 | 55 | Q from 56 | [@jcpsantiago](https://twitter.com/jcpsantiago/status/983997363298717696) 57 | via Twitter: How would you unnest so the final output is a data frame 58 | with a factor column `quantile` with levels “25%”, “50%”, and “75%”? 59 | 60 | A: I would `map()` `tibble::enframe()` on the new list column, to 61 | convert each entry from named list to a two-column data frame. Then use 62 | `tidyr::unnest()` to get rid of the list column and return to a simple 63 | data frame and, if you like, convert `quantile` into a factor. 64 | 65 | ``` r 66 | iris %>% 67 | group_by(Species) %>% 68 | summarise(pl_qtile = list(quantile(Petal.Length, c(0.25, 0.5, 0.75)))) %>% 69 | mutate(pl_qtile = map(pl_qtile, enframe, name = "quantile")) %>% 70 | unnest() %>% 71 | mutate(quantile = factor(quantile)) 72 | #> # A tibble: 9 x 3 73 | #> Species quantile value 74 | #> 75 | #> 1 setosa 25% 1.40 76 | #> 2 setosa 50% 1.50 77 | #> 3 setosa 75% 1.58 78 | #> 4 versicolor 25% 4.00 79 | #> 5 versicolor 50% 4.35 80 | #> 6 versicolor 75% 4.60 81 | #> 7 virginica 25% 5.10 82 | #> 8 virginica 50% 5.55 83 | #> 9 virginica 75% 5.88 84 | ``` 85 | 86 | If something like this comes up a lot in an analysis, you could package 87 | the key “moves” in a function, like so: 88 | 89 | ``` r 90 | enquantile <- function(x, ...) { 91 | qtile <- enframe(quantile(x, ...), name = "quantile") 92 | qtile$quantile <- factor(qtile$quantile) 93 | list(qtile) 94 | } 95 | ``` 96 | 97 | This makes repeated downstream usage more concise. 98 | 99 | ``` r 100 | iris %>% 101 | group_by(Species) %>% 102 | summarise(pl_qtile = enquantile(Petal.Length, c(0.25, 0.5, 0.75))) %>% 103 | unnest() 104 | #> # A tibble: 9 x 3 105 | #> Species quantile value 106 | #> 107 | #> 1 setosa 25% 1.40 108 | #> 2 setosa 50% 1.50 109 | #> 3 setosa 75% 1.58 110 | #> 4 versicolor 25% 4.00 111 | #> 5 versicolor 50% 4.35 112 | #> 6 versicolor 75% 4.60 113 | #> 7 virginica 25% 5.10 114 | #> 8 virginica 50% 5.55 115 | #> 9 virginica 75% 5.88 116 | ``` 117 | -------------------------------------------------------------------------------- /ex08_nesting-is-good.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Why nesting is worth the awkwardness" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #+ body 17 | # ---- 18 | library(gapminder) 19 | library(tidyverse) 20 | 21 | # ---- 22 | #' gapminder data for Asia only 23 | gap <- gapminder %>% 24 | filter(continent == "Asia") %>% 25 | mutate(yr1952 = year - 1952) 26 | 27 | #+ alpha-order 28 | ggplot(gap, aes(x = lifeExp, y = country)) + 29 | geom_point() 30 | 31 | #' Countries are in alphabetical order. 32 | #' 33 | #' Set factor levels with intent. Example: order based on life expectancy in 34 | #' 2007, the last year in this dataset. Imagine you want this to persist across 35 | #' an entire analysis. 36 | gap <- gap %>% 37 | mutate(country = fct_reorder2(country, .x = year, .y = lifeExp)) 38 | 39 | #+ principled-order 40 | ggplot(gap, aes(x = lifeExp, y = country)) + 41 | geom_point() 42 | 43 | 44 | #' Much better! 45 | #' 46 | #' Now imagine we want to fit a model to each country and look at dot plots of 47 | #' slope and intercept. 48 | #' 49 | #' `dplyr::group_by()` + `tidyr::nest()` created a *nested data frame* and is an 50 | #' alternative to splitting into country-specific data frames. Those data frames 51 | #' end up, instead, in a list-column. The `country` variable remains as a normal 52 | #' factor. 53 | gap_nested <- gap %>% 54 | group_by(country) %>% 55 | nest() 56 | 57 | gap_nested 58 | gap_nested$data[[1]] 59 | 60 | gap_fitted <- gap_nested %>% 61 | mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x))) 62 | gap_fitted 63 | gap_fitted$fit[[1]] 64 | 65 | gap_fitted <- gap_fitted %>% 66 | mutate( 67 | intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]), 68 | slope = map_dbl(fit, ~ coef(.x)[["yr1952"]]) 69 | ) 70 | gap_fitted 71 | 72 | #+ principled-order-coef-ests 73 | ggplot(gap_fitted, aes(x = intercept, y = country)) + 74 | geom_point() 75 | 76 | ggplot(gap_fitted, aes(x = slope, y = country)) + 77 | geom_point() 78 | 79 | #' The `split()` + `lapply()` + `do.call(rbind, ...)` approach. 80 | #' 81 | #' Split gap into many data frames, one per country. 82 | gap_split <- split(gap, gap$country) 83 | 84 | #' Fit a model to each country. 85 | gap_split_fits <- lapply( 86 | gap_split, 87 | function(df) { 88 | lm(lifeExp ~ yr1952, data = df) 89 | } 90 | ) 91 | #' Oops ... the unused levels of country are a problem (empty data frames in our 92 | #' list). 93 | #' 94 | #' Drop unused levels in country and split. 95 | gap_split <- split(droplevels(gap), droplevels(gap)$country) 96 | head(gap_split, 2) 97 | 98 | #' Fit model to each country and get `coefs()`. 99 | gap_split_coefs <- lapply( 100 | gap_split, 101 | function(df) { 102 | coef(lm(lifeExp ~ yr1952, data = df)) 103 | } 104 | ) 105 | head(gap_split_coefs, 2) 106 | 107 | #' Now we need to put everything back togethers. Row bind the list of coefs. 108 | #' Coerce from matrix back to data frame. 109 | gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs)) 110 | 111 | #' Restore `country` variable from row names. 112 | gap_split_coefs$country <- rownames(gap_split_coefs) 113 | str(gap_split_coefs) 114 | 115 | #+ revert-to-alphabetical 116 | ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) + 117 | geom_point() 118 | #' Uh-oh, we lost the order of the `country` factor, due to coercion from factor 119 | #' to character (list and then row names). 120 | #' 121 | #' The `nest()` approach allows you to keep data as data vs. in attributes, such 122 | #' as list or row names. Preserves factors and their levels or integer 123 | #' variables. Designs away various opportunities for different pieces of the 124 | #' dataset to get "out of sync" with each other, by leaving them in a data frame 125 | #' at all times. 126 | #' 127 | #' First in an interesting series of blog posts exploring these patterns and 128 | #' asking whether the tidyverse still needs a way to include the nesting 129 | #' variable in the nested data: 130 | #' 131 | -------------------------------------------------------------------------------- /ex08_nesting-is-good.md: -------------------------------------------------------------------------------- 1 | Why nesting is worth the awkwardness 2 | ================ 3 | Jenny Bryan 4 | 2018-04-12 5 | 6 | ``` r 7 | library(gapminder) 8 | library(tidyverse) 9 | ``` 10 | 11 | gapminder data for Asia only 12 | 13 | ``` r 14 | gap <- gapminder %>% 15 | filter(continent == "Asia") %>% 16 | mutate(yr1952 = year - 1952) 17 | ``` 18 | 19 | ``` r 20 | ggplot(gap, aes(x = lifeExp, y = country)) + 21 | geom_point() 22 | ``` 23 | 24 | ![](ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png) 25 | 26 | Countries are in alphabetical order. 27 | 28 | Set factor levels with intent. Example: order based on life expectancy 29 | in 2007, the last year in this dataset. Imagine you want this to persist 30 | across an entire analysis. 31 | 32 | ``` r 33 | gap <- gap %>% 34 | mutate(country = fct_reorder2(country, .x = year, .y = lifeExp)) 35 | ``` 36 | 37 | ``` r 38 | ggplot(gap, aes(x = lifeExp, y = country)) + 39 | geom_point() 40 | ``` 41 | 42 | ![](ex08_nesting-is-good_files/figure-gfm/principled-order-1.png) 43 | 44 | Much better\! 45 | 46 | Now imagine we want to fit a model to each country and look at dot plots 47 | of slope and intercept. 48 | 49 | `dplyr::group_by()` + `tidyr::nest()` created a *nested data frame* and 50 | is an alternative to splitting into country-specific data frames. Those 51 | data frames end up, instead, in a list-column. The `country` variable 52 | remains as a normal factor. 53 | 54 | ``` r 55 | gap_nested <- gap %>% 56 | group_by(country) %>% 57 | nest() 58 | 59 | gap_nested 60 | #> # A tibble: 33 x 2 61 | #> country data 62 | #> 63 | #> 1 Afghanistan 64 | #> 2 Bahrain 65 | #> 3 Bangladesh 66 | #> 4 Cambodia 67 | #> 5 China 68 | #> 6 Hong Kong, China 69 | #> 7 India 70 | #> 8 Indonesia 71 | #> 9 Iran 72 | #> 10 Iraq 73 | #> # ... with 23 more rows 74 | gap_nested$data[[1]] 75 | #> # A tibble: 12 x 6 76 | #> continent year lifeExp pop gdpPercap yr1952 77 | #> 78 | #> 1 Asia 1952 28.8 8425333 779. 0. 79 | #> 2 Asia 1957 30.3 9240934 821. 5. 80 | #> 3 Asia 1962 32.0 10267083 853. 10. 81 | #> 4 Asia 1967 34.0 11537966 836. 15. 82 | #> 5 Asia 1972 36.1 13079460 740. 20. 83 | #> 6 Asia 1977 38.4 14880372 786. 25. 84 | #> 7 Asia 1982 39.9 12881816 978. 30. 85 | #> 8 Asia 1987 40.8 13867957 852. 35. 86 | #> 9 Asia 1992 41.7 16317921 649. 40. 87 | #> 10 Asia 1997 41.8 22227415 635. 45. 88 | #> 11 Asia 2002 42.1 25268405 727. 50. 89 | #> 12 Asia 2007 43.8 31889923 975. 55. 90 | 91 | gap_fitted <- gap_nested %>% 92 | mutate(fit = map(data, ~ lm(lifeExp ~ yr1952, data = .x))) 93 | gap_fitted 94 | #> # A tibble: 33 x 3 95 | #> country data fit 96 | #> 97 | #> 1 Afghanistan 98 | #> 2 Bahrain 99 | #> 3 Bangladesh 100 | #> 4 Cambodia 101 | #> 5 China 102 | #> 6 Hong Kong, China 103 | #> 7 India 104 | #> 8 Indonesia 105 | #> 9 Iran 106 | #> 10 Iraq 107 | #> # ... with 23 more rows 108 | gap_fitted$fit[[1]] 109 | #> 110 | #> Call: 111 | #> lm(formula = lifeExp ~ yr1952, data = .x) 112 | #> 113 | #> Coefficients: 114 | #> (Intercept) yr1952 115 | #> 29.9073 0.2753 116 | 117 | gap_fitted <- gap_fitted %>% 118 | mutate( 119 | intercept = map_dbl(fit, ~ coef(.x)[["(Intercept)"]]), 120 | slope = map_dbl(fit, ~ coef(.x)[["yr1952"]]) 121 | ) 122 | gap_fitted 123 | #> # A tibble: 33 x 5 124 | #> country data fit intercept slope 125 | #> 126 | #> 1 Afghanistan 29.9 0.275 127 | #> 2 Bahrain 52.7 0.468 128 | #> 3 Bangladesh 36.1 0.498 129 | #> 4 Cambodia 37.0 0.396 130 | #> 5 China 47.2 0.531 131 | #> 6 Hong Kong, China 63.4 0.366 132 | #> 7 India 39.3 0.505 133 | #> 8 Indonesia 36.9 0.635 134 | #> 9 Iran 45.0 0.497 135 | #> 10 Iraq 50.1 0.235 136 | #> # ... with 23 more rows 137 | ``` 138 | 139 | ``` r 140 | ggplot(gap_fitted, aes(x = intercept, y = country)) + 141 | geom_point() 142 | ``` 143 | 144 | ![](ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png) 145 | 146 | ``` r 147 | 148 | ggplot(gap_fitted, aes(x = slope, y = country)) + 149 | geom_point() 150 | ``` 151 | 152 | ![](ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png) 153 | 154 | The `split()` + `lapply()` + `do.call(rbind, ...)` approach. 155 | 156 | Split gap into many data frames, one per country. 157 | 158 | ``` r 159 | gap_split <- split(gap, gap$country) 160 | ``` 161 | 162 | Fit a model to each country. 163 | 164 | ``` r 165 | gap_split_fits <- lapply( 166 | gap_split, 167 | function(df) { 168 | lm(lifeExp ~ yr1952, data = df) 169 | } 170 | ) 171 | #> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases 172 | ``` 173 | 174 | Oops … the unused levels of country are a problem (empty data frames in 175 | our list). 176 | 177 | Drop unused levels in country and split. 178 | 179 | ``` r 180 | gap_split <- split(droplevels(gap), droplevels(gap)$country) 181 | head(gap_split, 2) 182 | #> $Japan 183 | #> # A tibble: 12 x 7 184 | #> country continent year lifeExp pop gdpPercap yr1952 185 | #> 186 | #> 1 Japan Asia 1952 63.0 86459025 3217. 0. 187 | #> 2 Japan Asia 1957 65.5 91563009 4318. 5. 188 | #> 3 Japan Asia 1962 68.7 95831757 6577. 10. 189 | #> 4 Japan Asia 1967 71.4 100825279 9848. 15. 190 | #> 5 Japan Asia 1972 73.4 107188273 14779. 20. 191 | #> 6 Japan Asia 1977 75.4 113872473 16610. 25. 192 | #> 7 Japan Asia 1982 77.1 118454974 19384. 30. 193 | #> 8 Japan Asia 1987 78.7 122091325 22376. 35. 194 | #> 9 Japan Asia 1992 79.4 124329269 26825. 40. 195 | #> 10 Japan Asia 1997 80.7 125956499 28817. 45. 196 | #> 11 Japan Asia 2002 82.0 127065841 28605. 50. 197 | #> 12 Japan Asia 2007 82.6 127467972 31656. 55. 198 | #> 199 | #> $`Hong Kong, China` 200 | #> # A tibble: 12 x 7 201 | #> country continent year lifeExp pop gdpPercap yr1952 202 | #> 203 | #> 1 Hong Kong, China Asia 1952 61.0 2125900 3054. 0. 204 | #> 2 Hong Kong, China Asia 1957 64.8 2736300 3629. 5. 205 | #> 3 Hong Kong, China Asia 1962 67.6 3305200 4693. 10. 206 | #> 4 Hong Kong, China Asia 1967 70.0 3722800 6198. 15. 207 | #> 5 Hong Kong, China Asia 1972 72.0 4115700 8316. 20. 208 | #> 6 Hong Kong, China Asia 1977 73.6 4583700 11186. 25. 209 | #> 7 Hong Kong, China Asia 1982 75.4 5264500 14561. 30. 210 | #> 8 Hong Kong, China Asia 1987 76.2 5584510 20038. 35. 211 | #> 9 Hong Kong, China Asia 1992 77.6 5829696 24758. 40. 212 | #> 10 Hong Kong, China Asia 1997 80.0 6495918 28378. 45. 213 | #> 11 Hong Kong, China Asia 2002 81.5 6762476 30209. 50. 214 | #> 12 Hong Kong, China Asia 2007 82.2 6980412 39725. 55. 215 | ``` 216 | 217 | Fit model to each country and get `coefs()`. 218 | 219 | ``` r 220 | gap_split_coefs <- lapply( 221 | gap_split, 222 | function(df) { 223 | coef(lm(lifeExp ~ yr1952, data = df)) 224 | } 225 | ) 226 | head(gap_split_coefs, 2) 227 | #> $Japan 228 | #> (Intercept) yr1952 229 | #> 65.1220513 0.3529042 230 | #> 231 | #> $`Hong Kong, China` 232 | #> (Intercept) yr1952 233 | #> 63.4286410 0.3659706 234 | ``` 235 | 236 | Now we need to put everything back togethers. Row bind the list of 237 | coefs. Coerce from matrix back to data frame. 238 | 239 | ``` r 240 | gap_split_coefs <- as.data.frame(do.call(rbind, gap_split_coefs)) 241 | ``` 242 | 243 | Restore `country` variable from row names. 244 | 245 | ``` r 246 | gap_split_coefs$country <- rownames(gap_split_coefs) 247 | str(gap_split_coefs) 248 | #> 'data.frame': 33 obs. of 3 variables: 249 | #> $ (Intercept): num 65.1 63.4 66.3 61.8 49.7 ... 250 | #> $ yr1952 : num 0.353 0.366 0.267 0.341 0.555 ... 251 | #> $ country : chr "Japan" "Hong Kong, China" "Israel" "Singapore" ... 252 | ``` 253 | 254 | ``` r 255 | ggplot(gap_split_coefs, aes(x = `(Intercept)`, y = country)) + 256 | geom_point() 257 | ``` 258 | 259 | ![](ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png) 260 | 261 | Uh-oh, we lost the order of the `country` factor, due to coercion from 262 | factor to character (list and then row names). 263 | 264 | The `nest()` approach allows you to keep data as data vs. in attributes, 265 | such as list or row names. Preserves factors and their levels or integer 266 | variables. Designs away various opportunities for different pieces of 267 | the dataset to get “out of sync” with each other, by leaving them in a 268 | data frame at all times. 269 | 270 | First in an interesting series of blog posts exploring these patterns 271 | and asking whether the tidyverse still needs a way to include the 272 | nesting variable in the nested data: 273 | 274 | -------------------------------------------------------------------------------- /ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/alpha-order-1.png -------------------------------------------------------------------------------- /ex08_nesting-is-good_files/figure-gfm/principled-order-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-1.png -------------------------------------------------------------------------------- /ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-1.png -------------------------------------------------------------------------------- /ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/principled-order-coef-ests-2.png -------------------------------------------------------------------------------- /ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/ex08_nesting-is-good_files/figure-gfm/revert-to-alphabetical-1.png -------------------------------------------------------------------------------- /ex09_row-summaries.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Row-wise Summaries" 3 | #' author: "Jenny Bryan" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | 8 | #+ setup, include = FALSE, cache = FALSE 9 | knitr::opts_chunk$set( 10 | collapse = TRUE, 11 | comment = "#>", 12 | error = TRUE 13 | ) 14 | options(tidyverse.quiet = TRUE) 15 | 16 | #' > For rowSums, mtcars %>% mutate(rowsum = pmap_dbl(., sum)) works but is 17 | #' > a tidy oneliner for mean or sd per row? 18 | #' > I'm looking for a tidy version of rowSums, rowMeans and similarly rowSDs... 19 | #' 20 | #' [Two](https://twitter.com/vrnijs/status/995129678284255233) 21 | #' [tweets](https://twitter.com/vrnijs/status/995193240864178177) from Vincent 22 | #' Nijs [github](https://github.com/vnijs), 23 | #' [twitter](https://twitter.com/vrnijs) 24 | #' 25 | 26 | #' Good question! This also came up when I was originally casting about for 27 | #' genuine row-wise operations, but I never worked it up. I will do so now! 28 | #' First I set up my example. 29 | #' 30 | #+ body 31 | # ---- 32 | library(tidyverse) 33 | 34 | df <- tribble( 35 | ~ name, ~ t1, ~t2, ~t3, 36 | "Abby", 1, 2, 3, 37 | "Bess", 4, 5, 6, 38 | "Carl", 7, 8, 9 39 | ) 40 | 41 | #' ## Use `rowSums()` and `rowMeans()` inside `dplyr::mutate()` 42 | #' 43 | #' One "tidy version" of `rowSums()` is to ... just stick `rowSums()` inside a 44 | #' tidyverse pipeline. You can use `rowSums()` and `rowMeans()` inside 45 | #' `mutate()`, because they have a method for `data.frame`: 46 | df %>% 47 | mutate(t_sum = rowSums(select_if(., is.numeric))) 48 | 49 | df %>% 50 | mutate(t_avg = rowMeans(select(., -name))) 51 | 52 | #' Above I also demonstrate the use of `select(., SOME_EXPRESSION)` to express 53 | #' which variables should be computed on. This comes up a lot in row-wise work 54 | #' with a data frame, because, almost by definition, your variables are of mixed 55 | #' type. These are just a few examples of the different ways to say "use `t1`, 56 | #' `t2`, and `t3`", so we don't try to sum or average `name`. I'll continue to 57 | #' mix these in as we go. They are equally useful when expressing which 58 | #' variables should be forwarded to `.f` inside `pmap_*().` 59 | #' 60 | #' ## Devil's Advocate: can't you just use `rowMeans()` and `rowSums()` alone? 61 | #' 62 | #' This is a great point [raised by Diogo 63 | #' Camacho](https://twitter.com/DiogoMCamacho/status/996178967647412224). If 64 | #' `rowSums()` and `rowMeans()` get the job done, why put yourself through the 65 | #' pain of using `pmap()`, especially inside `mutate()`? 66 | #' 67 | #' There are a few reasons: 68 | #' 69 | #' * You might want to take the median or standard deviation instead of a mean 70 | #' or a sum. You can't assume that base R or an add-on package offers a row-wise 71 | #' `data.frame` method for every function you might need. 72 | #' * You might have several variables besides `name` that need to be retained, 73 | #' but that should not be forwarded to `rowSums()` or `rowMeans()`. A 74 | #' matrix-with-row-names grants you a reprieve for exactly one variable and that 75 | #' variable best not be integer, factor, date, or datetime. Because you must 76 | #' store it as character. It's not a general solution. 77 | #' * Correctness. If you extract the numeric columns or the variables whose 78 | #' names start with `"t"`, compute `rowMeans()` on them, and then column-bind 79 | #' the result back to the data, you are responsible for making sure that the two 80 | #' objects are absolutely, positively row-aligned. 81 | #' 82 | #' I think it's important to have a general strategy for row-wise computation on 83 | #' a subset of the columns in a data frame. 84 | #' 85 | #' ## How to use an arbitrary function inside `pmap()` 86 | #' 87 | #' What if you need to apply `foo()` to rows and the universe has not provided a 88 | #' special-purpose `rowFoos()` function? Now you do need to use `pmap()` or a 89 | #' type-stable variant, with `foo()` playing the role of `.f`. 90 | #' 91 | #' This works especially well with `sum()`. 92 | 93 | df %>% 94 | mutate(t_sum = pmap_dbl(list(t1, t2, t3), sum)) 95 | 96 | df %>% 97 | mutate(t_sum = pmap_dbl(select(., starts_with("t")), sum)) 98 | 99 | #' But the original question was about means and standard deviations! Why is 100 | #' that any different? Look at the signature of `sum()` versus a few other 101 | #' numerical summaries: 102 | #' 103 | #+ eval = FALSE 104 | sum(..., na.rm = FALSE) 105 | mean(x, trim = 0, na.rm = FALSE, ...) 106 | median(x, na.rm = FALSE, ...) 107 | var(x, y = NULL, na.rm = FALSE, use) 108 | 109 | #' `sum()` is especially `pmap()`-friendly because it takes `...` as its primary 110 | #' argument. In contrast, `mean()` takes a vector `x` as primary argument, which 111 | #' makes it harder to just drop into `pmap()`. This is something you might never 112 | #' think about if you're used to using special-purpose helpers like 113 | #' `rowMeans()`. 114 | #' 115 | #' purrr has a family of `lift_*()` functions that help you convert between 116 | #' these forms. Here I apply `purrr::lift_vd()` to `mean()`, so I can use it 117 | #' inside `pmap()`. The "vd" says I want to convert a function that takes a 118 | #' "**v**ector" into one that takes "**d**ots". 119 | df %>% 120 | mutate(t_avg = pmap_dbl(list(t1, t2, t3), lift_vd(mean))) 121 | 122 | #' ## Strategies that use reshaping and joins 123 | #' 124 | #' Data frames simply aren't a convenient storage format if you have a frequent 125 | #' need to compute summaries, row-wise, on a subset of columns. It is highly 126 | #' suggestive that your data is in the wrong shape, i.e. it's not tidy. Here we 127 | #' explore some approaches that rely on reshaping and/or joining. They are more 128 | #' transparent than using `lift_*()` with `pmap()` inside `mutate()` and, 129 | #' consequently, more verbose. 130 | #' 131 | #' They all rely on forming row-wise summaries, then joining back to the data. 132 | #' 133 | #' ### Gather, group, summarize 134 | (s <- df %>% 135 | gather("time", "val", starts_with("t")) %>% 136 | group_by(name) %>% 137 | summarize(t_avg = mean(val), t_sum = sum(val))) 138 | df %>% 139 | left_join(s) 140 | 141 | #' ### Group then summarise, with explicit `c()` 142 | (s <- df %>% 143 | group_by(name) %>% 144 | summarise(t_avg = mean(c(t1, t2, t3)))) 145 | df %>% 146 | left_join(s) 147 | 148 | #' ### Nesting 149 | #' 150 | #' Let's revisit a pattern from 151 | #' [`ex08_nesting-is-good`](ex08_nesting-is-good.md). This is another way to 152 | #' "package" up the values of `t1`, `t2`, and `t3` in a way that make both 153 | #' `mean()` and `sum()` happy. *thanks @krlmlr* 154 | (s <- df %>% 155 | gather("key", "value", -name) %>% 156 | nest(-name) %>% 157 | mutate( 158 | sum = map(data, "value") %>% map_dbl(sum), 159 | mean = map(data, "value") %>% map_dbl(mean) 160 | ) %>% 161 | select(-data)) 162 | df %>% 163 | left_join(s) 164 | 165 | #' ### Yet another way to use `rowMeans()` 166 | (s <- df %>% 167 | column_to_rownames("name") %>% 168 | rowMeans() %>% 169 | enframe()) 170 | df %>% 171 | left_join(s) 172 | 173 | #' ## Maybe you should use a matrix 174 | #' 175 | #' If you truly have data where each row is: 176 | #' 177 | #' * Identifier for this observational unit 178 | #' * Homogeneous vector of length n for the unit 179 | #' 180 | #' then you do want to use a matrix with rownames. I used to do this alot but 181 | #' found that practically none of my data analysis problems live in this simple 182 | #' world for more than a couple of hours. Eventually I always get back to a 183 | #' setting where a data frame is the most favorable receptacle, overall. YMMV. 184 | m <- matrix( 185 | 1:9, 186 | byrow = TRUE, nrow = 3, 187 | dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3)) 188 | ) 189 | 190 | cbind(m, rowsum = rowSums(m)) 191 | cbind(m, rowmean = rowMeans(m)) 192 | -------------------------------------------------------------------------------- /ex09_row-summaries.md: -------------------------------------------------------------------------------- 1 | Row-wise Summaries 2 | ================ 3 | Jenny Bryan 4 | 2018-05-14 5 | 6 | > For rowSums, mtcars %\>% mutate(rowsum = pmap\_dbl(., sum)) works but 7 | > is a tidy oneliner for mean or sd per row? I’m looking for a tidy 8 | > version of rowSums, rowMeans and similarly rowSDs… 9 | 10 | [Two](https://twitter.com/vrnijs/status/995129678284255233) 11 | [tweets](https://twitter.com/vrnijs/status/995193240864178177) from 12 | Vincent Nijs [github](https://github.com/vnijs), 13 | [twitter](https://twitter.com/vrnijs) 14 | 15 | Good question\! This also came up when I was originally casting about 16 | for genuine row-wise operations, but I never worked it up. I will do so 17 | now\! First I set up my example. 18 | 19 | ``` r 20 | library(tidyverse) 21 | 22 | df <- tribble( 23 | ~ name, ~ t1, ~t2, ~t3, 24 | "Abby", 1, 2, 3, 25 | "Bess", 4, 5, 6, 26 | "Carl", 7, 8, 9 27 | ) 28 | ``` 29 | 30 | ## Use `rowSums()` and `rowMeans()` inside `dplyr::mutate()` 31 | 32 | One “tidy version” of `rowSums()` is to … just stick `rowSums()` inside 33 | a tidyverse pipeline. You can use `rowSums()` and `rowMeans()` inside 34 | `mutate()`, because they have a method for `data.frame`: 35 | 36 | ``` r 37 | df %>% 38 | mutate(t_sum = rowSums(select_if(., is.numeric))) 39 | #> Warning: package 'bindrcpp' was built under R version 3.4.4 40 | #> # A tibble: 3 x 5 41 | #> name t1 t2 t3 t_sum 42 | #> 43 | #> 1 Abby 1 2 3 6 44 | #> 2 Bess 4 5 6 15 45 | #> 3 Carl 7 8 9 24 46 | 47 | df %>% 48 | mutate(t_avg = rowMeans(select(., -name))) 49 | #> # A tibble: 3 x 5 50 | #> name t1 t2 t3 t_avg 51 | #> 52 | #> 1 Abby 1 2 3 2 53 | #> 2 Bess 4 5 6 5 54 | #> 3 Carl 7 8 9 8 55 | ``` 56 | 57 | Above I also demonstrate the use of `select(., SOME_EXPRESSION)` to 58 | express which variables should be computed on. This comes up a lot in 59 | row-wise work with a data frame, because, almost by definition, your 60 | variables are of mixed type. These are just a few examples of the 61 | different ways to say “use `t1`, `t2`, and `t3`”, so we don’t try to sum 62 | or average `name`. I’ll continue to mix these in as we go. They are 63 | equally useful when expressing which variables should be forwarded to 64 | `.f` inside 65 | `pmap_*().` 66 | 67 | ## Devil’s Advocate: can’t you just use `rowMeans()` and `rowSums()` alone? 68 | 69 | This is a great point [raised by Diogo 70 | Camacho](https://twitter.com/DiogoMCamacho/status/996178967647412224). 71 | If `rowSums()` and `rowMeans()` get the job done, why put yourself 72 | through the pain of using `pmap()`, especially inside `mutate()`? 73 | 74 | There are a few reasons: 75 | 76 | - You might want to take the median or standard deviation instead of a 77 | mean or a sum. You can’t assume that base R or an add-on package 78 | offers a row-wise `data.frame` method for every function you might 79 | need. 80 | - You might have several variables besides `name` that need to be 81 | retained, but that should not be forwarded to `rowSums()` or 82 | `rowMeans()`. A matrix-with-row-names grants you a reprieve for 83 | exactly one variable and that variable best not be integer, factor, 84 | date, or datetime. Because you must store it as character. It’s not 85 | a general solution. 86 | - Correctness. If you extract the numeric columns or the variables 87 | whose names start with `"t"`, compute `rowMeans()` on them, and then 88 | column-bind the result back to the data, you are responsible for 89 | making sure that the two objects are absolutely, positively 90 | row-aligned. 91 | 92 | I think it’s important to have a general strategy for row-wise 93 | computation on a subset of the columns in a data frame. 94 | 95 | ## How to use an arbitrary function inside `pmap()` 96 | 97 | What if you need to apply `foo()` to rows and the universe has not 98 | provided a special-purpose `rowFoos()` function? Now you do need to use 99 | `pmap()` or a type-stable variant, with `foo()` playing the role of 100 | `.f`. 101 | 102 | This works especially well with `sum()`. 103 | 104 | ``` r 105 | df %>% 106 | mutate(t_sum = pmap_dbl(list(t1, t2, t3), sum)) 107 | #> # A tibble: 3 x 5 108 | #> name t1 t2 t3 t_sum 109 | #> 110 | #> 1 Abby 1 2 3 6 111 | #> 2 Bess 4 5 6 15 112 | #> 3 Carl 7 8 9 24 113 | 114 | df %>% 115 | mutate(t_sum = pmap_dbl(select(., starts_with("t")), sum)) 116 | #> # A tibble: 3 x 5 117 | #> name t1 t2 t3 t_sum 118 | #> 119 | #> 1 Abby 1 2 3 6 120 | #> 2 Bess 4 5 6 15 121 | #> 3 Carl 7 8 9 24 122 | ``` 123 | 124 | But the original question was about means and standard deviations\! Why 125 | is that any different? Look at the signature of `sum()` versus a few 126 | other numerical summaries: 127 | 128 | ``` r 129 | sum(..., na.rm = FALSE) 130 | mean(x, trim = 0, na.rm = FALSE, ...) 131 | median(x, na.rm = FALSE, ...) 132 | var(x, y = NULL, na.rm = FALSE, use) 133 | ``` 134 | 135 | `sum()` is especially `pmap()`-friendly because it takes `...` as its 136 | primary argument. In contrast, `mean()` takes a vector `x` as primary 137 | argument, which makes it harder to just drop into `pmap()`. This is 138 | something you might never think about if you’re used to using 139 | special-purpose helpers like `rowMeans()`. 140 | 141 | purrr has a family of `lift_*()` functions that help you convert between 142 | these forms. Here I apply `purrr::lift_vd()` to `mean()`, so I can use 143 | it inside `pmap()`. The “vd” says I want to convert a function that 144 | takes a “**v**ector” into one that takes “**d**ots”. 145 | 146 | ``` r 147 | df %>% 148 | mutate(t_avg = pmap_dbl(list(t1, t2, t3), lift_vd(mean))) 149 | #> # A tibble: 3 x 5 150 | #> name t1 t2 t3 t_avg 151 | #> 152 | #> 1 Abby 1 2 3 2 153 | #> 2 Bess 4 5 6 5 154 | #> 3 Carl 7 8 9 8 155 | ``` 156 | 157 | ## Strategies that use reshaping and joins 158 | 159 | Data frames simply aren’t a convenient storage format if you have a 160 | frequent need to compute summaries, row-wise, on a subset of columns. It 161 | is highly suggestive that your data is in the wrong shape, i.e. it’s not 162 | tidy. Here we explore some approaches that rely on reshaping and/or 163 | joining. They are more transparent than using `lift_*()` with `pmap()` 164 | inside `mutate()` and, consequently, more verbose. 165 | 166 | They all rely on forming row-wise summaries, then joining back to the 167 | data. 168 | 169 | ### Gather, group, summarize 170 | 171 | ``` r 172 | (s <- df %>% 173 | gather("time", "val", starts_with("t")) %>% 174 | group_by(name) %>% 175 | summarize(t_avg = mean(val), t_sum = sum(val))) 176 | #> # A tibble: 3 x 3 177 | #> name t_avg t_sum 178 | #> 179 | #> 1 Abby 2 6 180 | #> 2 Bess 5 15 181 | #> 3 Carl 8 24 182 | df %>% 183 | left_join(s) 184 | #> Joining, by = "name" 185 | #> # A tibble: 3 x 6 186 | #> name t1 t2 t3 t_avg t_sum 187 | #> 188 | #> 1 Abby 1 2 3 2 6 189 | #> 2 Bess 4 5 6 5 15 190 | #> 3 Carl 7 8 9 8 24 191 | ``` 192 | 193 | ### Group then summarise, with explicit `c()` 194 | 195 | ``` r 196 | (s <- df %>% 197 | group_by(name) %>% 198 | summarise(t_avg = mean(c(t1, t2, t3)))) 199 | #> # A tibble: 3 x 2 200 | #> name t_avg 201 | #> 202 | #> 1 Abby 2 203 | #> 2 Bess 5 204 | #> 3 Carl 8 205 | df %>% 206 | left_join(s) 207 | #> Joining, by = "name" 208 | #> # A tibble: 3 x 5 209 | #> name t1 t2 t3 t_avg 210 | #> 211 | #> 1 Abby 1 2 3 2 212 | #> 2 Bess 4 5 6 5 213 | #> 3 Carl 7 8 9 8 214 | ``` 215 | 216 | ### Nesting 217 | 218 | Let’s revisit a pattern from 219 | [`ex08_nesting-is-good`](ex08_nesting-is-good.md). This is another way 220 | to “package” up the values of `t1`, `t2`, and `t3` in a way that make 221 | both `mean()` and `sum()` happy. *thanks @krlmlr* 222 | 223 | ``` r 224 | (s <- df %>% 225 | gather("key", "value", -name) %>% 226 | nest(-name) %>% 227 | mutate( 228 | sum = map(data, "value") %>% map_dbl(sum), 229 | mean = map(data, "value") %>% map_dbl(mean) 230 | ) %>% 231 | select(-data)) 232 | #> # A tibble: 3 x 3 233 | #> name sum mean 234 | #> 235 | #> 1 Abby 6 2 236 | #> 2 Bess 15 5 237 | #> 3 Carl 24 8 238 | df %>% 239 | left_join(s) 240 | #> Joining, by = "name" 241 | #> # A tibble: 3 x 6 242 | #> name t1 t2 t3 sum mean 243 | #> 244 | #> 1 Abby 1 2 3 6 2 245 | #> 2 Bess 4 5 6 15 5 246 | #> 3 Carl 7 8 9 24 8 247 | ``` 248 | 249 | ### Yet another way to use `rowMeans()` 250 | 251 | ``` r 252 | (s <- df %>% 253 | column_to_rownames("name") %>% 254 | rowMeans() %>% 255 | enframe()) 256 | #> Warning: Setting row names on a tibble is deprecated. 257 | #> # A tibble: 3 x 2 258 | #> name value 259 | #> 260 | #> 1 Abby 2 261 | #> 2 Bess 5 262 | #> 3 Carl 8 263 | df %>% 264 | left_join(s) 265 | #> Joining, by = "name" 266 | #> # A tibble: 3 x 5 267 | #> name t1 t2 t3 value 268 | #> 269 | #> 1 Abby 1 2 3 2 270 | #> 2 Bess 4 5 6 5 271 | #> 3 Carl 7 8 9 8 272 | ``` 273 | 274 | ## Maybe you should use a matrix 275 | 276 | If you truly have data where each row is: 277 | 278 | - Identifier for this observational unit 279 | - Homogeneous vector of length n for the unit 280 | 281 | then you do want to use a matrix with rownames. I used to do this alot 282 | but found that practically none of my data analysis problems live in 283 | this simple world for more than a couple of hours. Eventually I always 284 | get back to a setting where a data frame is the most favorable 285 | receptacle, overall. YMMV. 286 | 287 | ``` r 288 | m <- matrix( 289 | 1:9, 290 | byrow = TRUE, nrow = 3, 291 | dimnames = list(c("Abby", "Bess", "Carl"), paste0("t", 1:3)) 292 | ) 293 | 294 | cbind(m, rowsum = rowSums(m)) 295 | #> t1 t2 t3 rowsum 296 | #> Abby 1 2 3 6 297 | #> Bess 4 5 6 15 298 | #> Carl 7 8 9 24 299 | cbind(m, rowmean = rowMeans(m)) 300 | #> t1 t2 t3 rowmean 301 | #> Abby 1 2 3 2 302 | #> Bess 4 5 6 5 303 | #> Carl 7 8 9 8 304 | ``` 305 | -------------------------------------------------------------------------------- /iterate-over-rows.R: -------------------------------------------------------------------------------- 1 | #' --- 2 | #' title: "Turn data frame into a list, one component per row" 3 | #' author: "Jenny Bryan, updating work of Winston Chang" 4 | #' date: "`r format(Sys.Date())`" 5 | #' output: github_document 6 | #' --- 7 | #' 8 | #' Update of . 9 | #' 10 | #' * Added some methods, removed some methods. 11 | #' * Run every combination of problem size & method multiple times. 12 | #' * Explore different number of rows and columns, with mixed col types. 13 | 14 | library(scales) 15 | library(tidyverse) 16 | 17 | # for loop over row index 18 | f_for_loop <- function(df) { 19 | out <- vector(mode = "list", length = nrow(df)) 20 | for (i in seq_along(out)) { 21 | out[[i]] <- as.list(df[i, , drop = FALSE]) 22 | } 23 | out 24 | } 25 | 26 | # split into single row data frames then + lapply 27 | f_split_lapply <- function(df) { 28 | df <- split(df, seq_len(nrow(df))) 29 | lapply(df, function(row) as.list(row)) 30 | } 31 | 32 | # lapply over the vector of row numbers 33 | f_lapply_row <- function(df) { 34 | lapply(seq_len(nrow(df)), function(i) as.list(df[i, , drop = FALSE])) 35 | } 36 | 37 | # purrr::pmap 38 | f_pmap <- function(df) { 39 | pmap(df, list) 40 | } 41 | 42 | # purrr::transpose (happens to be exactly what's needed here) 43 | f_transpose <- function(df) { 44 | transpose(df) 45 | } 46 | 47 | ## explicit gc, then execute `expr` `n` times w/o explicit gc, return timings 48 | benchmark <- function(n = 1, expr, envir = parent.frame()) { 49 | expr <- substitute(expr) 50 | gc() 51 | map(seq_len(n), ~ system.time(eval(expr, envir), gcFirst = FALSE)) 52 | } 53 | 54 | run_row_benchmark <- function(nrow, times = 5) { 55 | df <- data.frame( 56 | x = rep_len(letters, length.out = nrow), 57 | y = runif(nrow), 58 | z = seq_len(nrow) 59 | ) 60 | res <- list( 61 | transpose = benchmark(times, f_transpose(df)), 62 | pmap = benchmark(times, f_pmap(df)), 63 | split_lapply = benchmark(times, f_split_lapply(df)), 64 | lapply_row = benchmark(times, f_lapply_row(df)), 65 | for_loop = benchmark(times, f_for_loop(df)) 66 | ) 67 | res <- map(res, ~ map_dbl(.x, "elapsed")) 68 | tibble( 69 | nrow = nrow, 70 | method = rep(names(res), lengths(res)), 71 | time = flatten_dbl(res) 72 | ) 73 | } 74 | 75 | run_col_benchmark <- function(ncol, times = 5) { 76 | nrow <- 3 77 | template <- data.frame( 78 | x = letters[seq_len(nrow)], 79 | y = runif(nrow), 80 | z = seq_len(nrow) 81 | ) 82 | df <- template[rep_len(seq_len(ncol(template)), length.out = ncol)] 83 | res <- list( 84 | transpose = benchmark(times, f_transpose(df)), 85 | pmap = benchmark(times, f_pmap(df)), 86 | split_lapply = benchmark(times, f_split_lapply(df)), 87 | lapply_row = benchmark(times, f_lapply_row(df)), 88 | for_loop = benchmark(times, f_for_loop(df)) 89 | ) 90 | res <- map(res, ~ map_dbl(.x, "elapsed")) 91 | tibble( 92 | ncol = ncol, 93 | method = rep(names(res), lengths(res)), 94 | time = flatten_dbl(res) 95 | ) 96 | } 97 | 98 | ## force figs to present methods in order of time 99 | flevels <- function(df) { 100 | mutate(df, method = fct_reorder(method, .x = desc(time))) 101 | } 102 | 103 | plot_it <- function(df, what = "nrow") { 104 | log10_breaks <- trans_breaks("log10", function(x) 10 ^ x) 105 | log10_mbreaks <- function(x) { 106 | limits <- c(floor(log10(x[1])), ceiling(log10(x[2]))) 107 | breaks <- 10 ^ seq(limits[1], limits[2]) 108 | 109 | unlist(lapply(breaks, function(x) x * seq(0.1, 0.9, by = 0.1))) 110 | } 111 | log10_labels <- trans_format("log10", math_format(10 ^ .x)) 112 | 113 | ggplot( 114 | df %>% dplyr::filter(time > 0), 115 | aes_string(x = what, y = "time", colour = "method") 116 | ) + 117 | geom_point() + 118 | stat_summary(aes(group = method), fun.y = mean, geom = "line") + 119 | scale_y_log10( 120 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks 121 | ) + 122 | scale_x_log10( 123 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks 124 | ) + 125 | labs( 126 | x = paste0("Number of ", if (what == "nrow") "rows" else "columns"), 127 | y = "Time (s)" 128 | ) + 129 | theme_bw() + 130 | theme(aspect.ratio = 1, legend.justification = "top") 131 | } 132 | 133 | ## dry runs 134 | # df_test <- run_row_benchmark(nrow = 10000) %>% flevels() 135 | # df_test <- run_col_benchmark(ncol = 10000) %>% flevels() 136 | # ggplot(df_test, aes(x = method, y = time)) + 137 | # geom_jitter(width = 0.25, height = 0) + 138 | # scale_y_log10() 139 | 140 | ## The Real Thing 141 | ## fairly fast up to 10^4, go get a coffee at 10^5 (row case only) 142 | #df_r <- map_df(10 ^ (1:5), run_row_benchmark) %>% flevels() 143 | #write_csv(df_r, "row-benchmark.csv") 144 | df_r <- read_csv("row-benchmark.csv") %>% flevels() 145 | 146 | #+ row-benchmark 147 | plot_it(df_r, "nrow") 148 | #ggsave("row-benchmark.png") 149 | 150 | #df_c <- map_df(10 ^ (1:5), run_col_benchmark) %>% flevels() 151 | #write_csv(df_c, "col-benchmark.csv") 152 | df_c <- read_csv("col-benchmark.csv") %>% flevels() 153 | 154 | #+ col-benchmark 155 | plot_it(df_c, "ncol") 156 | #ggsave("col-benchmark.png") 157 | 158 | ## used at first, but saw same dramatic gc artefacts as described here 159 | ## in my plots 160 | ## https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/ 161 | ## went for a DIY solution where I control gc 162 | # library(microbenchmark) 163 | # run_row_microbenchmark <- function(nrow, times = 5) { 164 | # df <- data.frame(x = rnorm(nrow), y = runif(nrow), z = runif(nrow)) 165 | # microbenchmark( 166 | # for_loop = f_for_loop(df), 167 | # split_lapply = f_split_lapply(df), 168 | # lapply_row = f_lapply_row(df), 169 | # pmap = f_pmap(df), 170 | # transpose = f_transpose(df), 171 | # times = times 172 | # ) %>% 173 | # as_tibble() %>% 174 | # rename(method = expr) %>% 175 | # mutate(method = as.character(method)) %>% 176 | # add_column(nrow = nrow, .before = 1) 177 | # } 178 | -------------------------------------------------------------------------------- /iterate-over-rows.md: -------------------------------------------------------------------------------- 1 | Turn data frame into a list, one component per row 2 | ================ 3 | Jenny Bryan, updating work of Winston Chang 4 | 2018-09-05 5 | 6 | Update of . 7 | 8 | - Added some methods, removed some methods. 9 | - Run every combination of problem size & method multiple times. 10 | - Explore different number of rows and columns, with mixed col types. 11 | 12 | 13 | 14 | ``` r 15 | library(scales) 16 | library(tidyverse) 17 | ``` 18 | 19 | ## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ── 20 | 21 | ## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5 22 | ## ✔ tibble 1.4.2 ✔ dplyr 0.7.6 23 | ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 24 | ## ✔ readr 1.2.0 ✔ forcats 0.3.0 25 | 26 | ## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ── 27 | ## ✖ readr::col_factor() masks scales::col_factor() 28 | ## ✖ purrr::discard() masks scales::discard() 29 | ## ✖ dplyr::filter() masks stats::filter() 30 | ## ✖ dplyr::lag() masks stats::lag() 31 | 32 | ``` r 33 | # for loop over row index 34 | f_for_loop <- function(df) { 35 | out <- vector(mode = "list", length = nrow(df)) 36 | for (i in seq_along(out)) { 37 | out[[i]] <- as.list(df[i, , drop = FALSE]) 38 | } 39 | out 40 | } 41 | 42 | # split into single row data frames then + lapply 43 | f_split_lapply <- function(df) { 44 | df <- split(df, seq_len(nrow(df))) 45 | lapply(df, function(row) as.list(row)) 46 | } 47 | 48 | # lapply over the vector of row numbers 49 | f_lapply_row <- function(df) { 50 | lapply(seq_len(nrow(df)), function(i) as.list(df[i, , drop = FALSE])) 51 | } 52 | 53 | # purrr::pmap 54 | f_pmap <- function(df) { 55 | pmap(df, list) 56 | } 57 | 58 | # purrr::transpose (happens to be exactly what's needed here) 59 | f_transpose <- function(df) { 60 | transpose(df) 61 | } 62 | 63 | ## explicit gc, then execute `expr` `n` times w/o explicit gc, return timings 64 | benchmark <- function(n = 1, expr, envir = parent.frame()) { 65 | expr <- substitute(expr) 66 | gc() 67 | map(seq_len(n), ~ system.time(eval(expr, envir), gcFirst = FALSE)) 68 | } 69 | 70 | run_row_benchmark <- function(nrow, times = 5) { 71 | df <- data.frame( 72 | x = rep_len(letters, length.out = nrow), 73 | y = runif(nrow), 74 | z = seq_len(nrow) 75 | ) 76 | res <- list( 77 | transpose = benchmark(times, f_transpose(df)), 78 | pmap = benchmark(times, f_pmap(df)), 79 | split_lapply = benchmark(times, f_split_lapply(df)), 80 | lapply_row = benchmark(times, f_lapply_row(df)), 81 | for_loop = benchmark(times, f_for_loop(df)) 82 | ) 83 | res <- map(res, ~ map_dbl(.x, "elapsed")) 84 | tibble( 85 | nrow = nrow, 86 | method = rep(names(res), lengths(res)), 87 | time = flatten_dbl(res) 88 | ) 89 | } 90 | 91 | run_col_benchmark <- function(ncol, times = 5) { 92 | nrow <- 3 93 | template <- data.frame( 94 | x = letters[seq_len(nrow)], 95 | y = runif(nrow), 96 | z = seq_len(nrow) 97 | ) 98 | df <- template[rep_len(seq_len(ncol(template)), length.out = ncol)] 99 | res <- list( 100 | transpose = benchmark(times, f_transpose(df)), 101 | pmap = benchmark(times, f_pmap(df)), 102 | split_lapply = benchmark(times, f_split_lapply(df)), 103 | lapply_row = benchmark(times, f_lapply_row(df)), 104 | for_loop = benchmark(times, f_for_loop(df)) 105 | ) 106 | res <- map(res, ~ map_dbl(.x, "elapsed")) 107 | tibble( 108 | ncol = ncol, 109 | method = rep(names(res), lengths(res)), 110 | time = flatten_dbl(res) 111 | ) 112 | } 113 | 114 | ## force figs to present methods in order of time 115 | flevels <- function(df) { 116 | mutate(df, method = fct_reorder(method, .x = desc(time))) 117 | } 118 | 119 | plot_it <- function(df, what = "nrow") { 120 | log10_breaks <- trans_breaks("log10", function(x) 10 ^ x) 121 | log10_mbreaks <- function(x) { 122 | limits <- c(floor(log10(x[1])), ceiling(log10(x[2]))) 123 | breaks <- 10 ^ seq(limits[1], limits[2]) 124 | 125 | unlist(lapply(breaks, function(x) x * seq(0.1, 0.9, by = 0.1))) 126 | } 127 | log10_labels <- trans_format("log10", math_format(10 ^ .x)) 128 | 129 | ggplot( 130 | df %>% dplyr::filter(time > 0), 131 | aes_string(x = what, y = "time", colour = "method") 132 | ) + 133 | geom_point() + 134 | stat_summary(aes(group = method), fun.y = mean, geom = "line") + 135 | scale_y_log10( 136 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks 137 | ) + 138 | scale_x_log10( 139 | breaks = log10_breaks, labels = log10_labels, minor_breaks = log10_mbreaks 140 | ) + 141 | labs( 142 | x = paste0("Number of ", if (what == "nrow") "rows" else "columns"), 143 | y = "Time (s)" 144 | ) + 145 | theme_bw() + 146 | theme(aspect.ratio = 1, legend.justification = "top") 147 | } 148 | 149 | ## dry runs 150 | # df_test <- run_row_benchmark(nrow = 10000) %>% flevels() 151 | # df_test <- run_col_benchmark(ncol = 10000) %>% flevels() 152 | # ggplot(df_test, aes(x = method, y = time)) + 153 | # geom_jitter(width = 0.25, height = 0) + 154 | # scale_y_log10() 155 | 156 | ## The Real Thing 157 | ## fairly fast up to 10^4, go get a coffee at 10^5 (row case only) 158 | #df_r <- map_df(10 ^ (1:5), run_row_benchmark) %>% flevels() 159 | #write_csv(df_r, "row-benchmark.csv") 160 | df_r <- read_csv("row-benchmark.csv") %>% flevels() 161 | ``` 162 | 163 | ## Parsed with column specification: 164 | ## cols( 165 | ## nrow = col_double(), 166 | ## method = col_character(), 167 | ## time = col_double() 168 | ## ) 169 | 170 | ``` r 171 | plot_it(df_r, "nrow") 172 | ``` 173 | 174 | ![](iterate-over-rows_files/figure-gfm/row-benchmark-1.png) 175 | 176 | ``` r 177 | #ggsave("row-benchmark.png") 178 | 179 | #df_c <- map_df(10 ^ (1:5), run_col_benchmark) %>% flevels() 180 | #write_csv(df_c, "col-benchmark.csv") 181 | df_c <- read_csv("col-benchmark.csv") %>% flevels() 182 | ``` 183 | 184 | ## Parsed with column specification: 185 | ## cols( 186 | ## ncol = col_double(), 187 | ## method = col_character(), 188 | ## time = col_double() 189 | ## ) 190 | 191 | ``` r 192 | plot_it(df_c, "ncol") 193 | ``` 194 | 195 | ![](iterate-over-rows_files/figure-gfm/col-benchmark-1.png) 196 | 197 | ``` r 198 | #ggsave("col-benchmark.png") 199 | 200 | ## used at first, but saw same dramatic gc artefacts as described here 201 | ## in my plots 202 | ## https://radfordneal.wordpress.com/2014/02/02/inaccurate-results-from-microbenchmark/ 203 | ## went for a DIY solution where I control gc 204 | # library(microbenchmark) 205 | # run_row_microbenchmark <- function(nrow, times = 5) { 206 | # df <- data.frame(x = rnorm(nrow), y = runif(nrow), z = runif(nrow)) 207 | # microbenchmark( 208 | # for_loop = f_for_loop(df), 209 | # split_lapply = f_split_lapply(df), 210 | # lapply_row = f_lapply_row(df), 211 | # pmap = f_pmap(df), 212 | # transpose = f_transpose(df), 213 | # times = times 214 | # ) %>% 215 | # as_tibble() %>% 216 | # rename(method = expr) %>% 217 | # mutate(method = as.character(method)) %>% 218 | # add_column(nrow = nrow, .before = 1) 219 | # } 220 | ``` 221 | -------------------------------------------------------------------------------- /iterate-over-rows_files/figure-gfm/col-benchmark-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/iterate-over-rows_files/figure-gfm/col-benchmark-1.png -------------------------------------------------------------------------------- /iterate-over-rows_files/figure-gfm/row-benchmark-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/iterate-over-rows_files/figure-gfm/row-benchmark-1.png -------------------------------------------------------------------------------- /row-benchmark.csv: -------------------------------------------------------------------------------- 1 | nrow,method,time 2 | 10,transpose,0 3 | 10,transpose,0 4 | 10,transpose,0 5 | 10,transpose,0 6 | 10,transpose,0 7 | 10,pmap,9.999999997489795e-4 8 | 10,pmap,0 9 | 10,pmap,0 10 | 10,pmap,0.0010000000002037268 11 | 10,pmap,0 12 | 10,split_lapply,0.0010000000002037268 13 | 10,split_lapply,0.0010000000002037268 14 | 10,split_lapply,9.999999997489795e-4 15 | 10,split_lapply,0.0010000000002037268 16 | 10,split_lapply,0 17 | 10,lapply_row,9.999999997489795e-4 18 | 10,lapply_row,0.0010000000002037268 19 | 10,lapply_row,9.999999997489795e-4 20 | 10,lapply_row,0 21 | 10,lapply_row,0.0010000000002037268 22 | 10,for_loop,0.0010000000002037268 23 | 10,for_loop,0.0010000000002037268 24 | 10,for_loop,9.999999997489795e-4 25 | 10,for_loop,0.0010000000002037268 26 | 10,for_loop,9.999999997489795e-4 27 | 100,transpose,0 28 | 100,transpose,0 29 | 100,transpose,0 30 | 100,transpose,0 31 | 100,transpose,0 32 | 100,pmap,9.999999997489795e-4 33 | 100,pmap,0 34 | 100,pmap,0.0010000000002037268 35 | 100,pmap,0 36 | 100,pmap,9.999999997489795e-4 37 | 100,split_lapply,0.005999999999858119 38 | 100,split_lapply,0.007000000000061846 39 | 100,split_lapply,0.007000000000061846 40 | 100,split_lapply,0.005999999999858119 41 | 100,split_lapply,0.007999999999810825 42 | 100,lapply_row,0.007000000000061846 43 | 100,lapply_row,0.007000000000061846 44 | 100,lapply_row,0.006999999999607098 45 | 100,lapply_row,0.007000000000061846 46 | 100,lapply_row,0.007000000000061846 47 | 100,for_loop,0.008000000000265572 48 | 100,for_loop,0.006999999999607098 49 | 100,for_loop,0.008000000000265572 50 | 100,for_loop,0.007999999999810825 51 | 100,for_loop,0.007000000000061846 52 | 1e3,transpose,9.999999997489795e-4 53 | 1e3,transpose,0 54 | 1e3,transpose,0 55 | 1e3,transpose,0 56 | 1e3,transpose,0 57 | 1e3,pmap,0.0019999999999527063 58 | 1e3,pmap,0.0019999999999527063 59 | 1e3,pmap,0.003000000000156433 60 | 1e3,pmap,0.0019999999999527063 61 | 1e3,pmap,0.0019999999999527063 62 | 1e3,split_lapply,0.0749999999998181 63 | 1e3,split_lapply,0.07000000000016371 64 | 1e3,split_lapply,0.07699999999977081 65 | 1e3,split_lapply,0.07700000000022555 66 | 1e3,split_lapply,0.08199999999987995 67 | 1e3,lapply_row,0.07099999999991269 68 | 1e3,lapply_row,0.07200000000011642 69 | 1e3,lapply_row,0.0749999999998181 70 | 1e3,lapply_row,0.068000000000211 71 | 1e3,lapply_row,0.08399999999983265 72 | 1e3,for_loop,0.06599999999980355 73 | 1e3,for_loop,0.06500000000005457 74 | 1e3,for_loop,0.07100000000036744 75 | 1e3,for_loop,0.07699999999977081 76 | 1e3,for_loop,0.0729999999998654 77 | 1e4,transpose,9.999999997489795e-4 78 | 1e4,transpose,0.0019999999999527063 79 | 1e4,transpose,0.0010000000002037268 80 | 1e4,transpose,0.0010000000002037268 81 | 1e4,transpose,0.0019999999999527063 82 | 1e4,pmap,0.018999999999778083 83 | 1e4,pmap,0.023000000000138243 84 | 1e4,pmap,0.02099999999973079 85 | 1e4,pmap,0.028999999999996362 86 | 1e4,pmap,0.023000000000138243 87 | 1e4,split_lapply,1.0340000000001055 88 | 1e4,split_lapply,1.074999999999818 89 | 1e4,split_lapply,1.0900000000001455 90 | 1e4,split_lapply,1.0859999999997854 91 | 1e4,split_lapply,1.1520000000000437 92 | 1e4,lapply_row,1.0160000000000764 93 | 1e4,lapply_row,1.0669999999995525 94 | 1e4,lapply_row,1.1410000000000764 95 | 1e4,lapply_row,1.2590000000000146 96 | 1e4,lapply_row,1.0799999999999272 97 | 1e4,for_loop,1.031999999999698 98 | 1e4,for_loop,1.0419999999999163 99 | 1e4,for_loop,1.1170000000001892 100 | 1e4,for_loop,1.1039999999998145 101 | 1e4,for_loop,1.0979999999999563 102 | 1e5,transpose,0.016999999999825377 103 | 1e5,transpose,0.01900000000023283 104 | 1e5,transpose,0.021000000000185537 105 | 1e5,transpose,0.02099999999973079 106 | 1e5,transpose,0.021000000000185537 107 | 1e5,pmap,0.23700000000008004 108 | 1e5,pmap,0.25700000000006185 109 | 1e5,pmap,0.3690000000001419 110 | 1e5,pmap,0.29300000000012005 111 | 1e5,pmap,0.3819999999996071 112 | 1e5,split_lapply,35.738000000000284 113 | 1e5,split_lapply,35.86400000000003 114 | 1e5,split_lapply,35.68899999999985 115 | 1e5,split_lapply,35.559999999999945 116 | 1e5,split_lapply,35.922000000000025 117 | 1e5,lapply_row,33.54099999999971 118 | 1e5,lapply_row,34.87699999999995 119 | 1e5,lapply_row,35.669000000000324 120 | 1e5,lapply_row,34.465999999999894 121 | 1e5,lapply_row,35.59400000000005 122 | 1e5,for_loop,35.01800000000003 123 | 1e5,for_loop,35.29099999999971 124 | 1e5,for_loop,34.46300000000019 125 | 1e5,for_loop,35.26099999999997 126 | 1e5,for_loop,34.33699999999999 127 | -------------------------------------------------------------------------------- /row-benchmark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/row-benchmark.png -------------------------------------------------------------------------------- /row-oriented-workflows.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace 22 | -------------------------------------------------------------------------------- /wch.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Applying a function over rows of a data frame" 3 | author: "Winston Chang" 4 | output: 5 | html_document: 6 | keep_md: TRUE 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>", cache = TRUE) 11 | ``` 12 | 13 | [Source](https://gist.github.com/wch/0e564def155d976c04dd28a876dc04b4) for this document. 14 | 15 | [RPub](https://rpubs.com/wch/200398) for this document. 16 | 17 | @dattali [asked](https://twitter.com/daattali/status/761058049859518464), "what's a safe way to iterate over rows of a data frame?" The example was to convert each row into a list and return a list of lists, indexed first by column, then by row. 18 | 19 | A number of people gave suggestions on Twitter, which I've collected here. I've benchmarked these methods with data of various sizes; scroll down to see a plot of times. 20 | 21 | ```{r load-packages, cache = FALSE} 22 | library(purrr) 23 | library(dplyr) 24 | library(tidyr) 25 | ``` 26 | 27 | 28 | ```{r define-approaches, message=FALSE} 29 | # @dattali 30 | # Using apply (only safe when all cols are same type) 31 | f_apply <- function(df) { 32 | apply(df, 1, function(row) as.list(row)) 33 | } 34 | 35 | # @drob 36 | # split + lapply 37 | f_split_lapply <- function(df) { 38 | df <- split(df, seq_len(nrow(df))) 39 | lapply(df, function(row) as.list(row)) 40 | } 41 | 42 | # @winston_chang 43 | # lapply over row indices 44 | f_lapply_row <- function(df) { 45 | lapply(seq_len(nrow(df)), function(i) as.list(df[i,,drop=FALSE])) 46 | } 47 | 48 | # @winston_chang 49 | # lapply + lapply: Treat data frame as list, and the slice out lists 50 | f_lapply_lapply <- function(df) { 51 | cols <- seq_len(length(df)) 52 | names(cols) <- names(df) 53 | 54 | lapply(seq_len(nrow(df)), function(row) { 55 | lapply(cols, function(col) { 56 | df[[col]][[row]] 57 | }) 58 | }) 59 | } 60 | 61 | # @winston_chang 62 | # purrr::by_row 63 | # 2018-03-31 Jenny Bryan: by_row() no longer exists in purrr 64 | # f_by_row <- function(df) { 65 | # res <- by_row(df, function(row) as.list(row)) 66 | # res$.out 67 | # } 68 | 69 | # @JennyBryan 70 | # purrr::pmap 71 | f_pmap <- function(df) { 72 | pmap(df, list) 73 | } 74 | 75 | # purrr::pmap, but coerce df to a list first 76 | f_pmap_aslist <- function(df) { 77 | pmap(as.list(df), list) 78 | } 79 | 80 | # @krlmlr 81 | # dplyr::rowwise 82 | f_rowwise <- function(df) { 83 | df %>% rowwise %>% do(row = as.list(.)) 84 | } 85 | 86 | # @JennyBryan 87 | # purrr::transpose (only works for this specific task, i.e. one sub-list per row) 88 | f_transpose <- function(df) { 89 | transpose(df) 90 | } 91 | ``` 92 | 93 | 94 | Benchmark each of them, using data sets with varying numbers of rows: 95 | 96 | ```{r run-benchmark} 97 | run_benchmark <- function(nrow) { 98 | # Make some data 99 | df <- data.frame( 100 | x = rnorm(nrow), 101 | y = runif(nrow), 102 | z = runif(nrow) 103 | ) 104 | 105 | res <- list( 106 | apply = system.time(f_apply(df)), 107 | split_lapply = system.time(f_split_lapply(df)), 108 | lapply_row = system.time(f_lapply_row(df)), 109 | lapply_lapply = system.time(f_lapply_lapply(df)), 110 | #by_row = system.time(f_by_row(df)), 111 | pmap = system.time(f_pmap(df)), 112 | pmap_aslist = system.time(f_pmap_aslist(df)), 113 | rowwise = system.time(f_rowwise(df)), 114 | transpose = system.time(f_transpose(df)) 115 | ) 116 | 117 | # Get elapsed times 118 | res <- lapply(res, `[[`, "elapsed") 119 | 120 | # Add nrow to front 121 | res <- c(nrow = nrow, res) 122 | res 123 | } 124 | 125 | # Run the benchmarks for various size data 126 | all_times <- lapply(1:5, function(n) { 127 | run_benchmark(10^n) 128 | }) 129 | 130 | # Convert to data frame 131 | times <- lapply(all_times, as.data.frame) 132 | times <- do.call(rbind, times) 133 | 134 | knitr::kable(times) 135 | ``` 136 | 137 | 138 | ## Plot times 139 | 140 | This plot shows the number of seconds needed to process n rows, for each method. Both the x and y use log scales, so each step along the x scale represents a 10x increase in number of rows, and each step along the y scale represents a 10x increase in time. 141 | 142 | ```{r plot, message=FALSE, cache = FALSE} 143 | library(ggplot2) 144 | library(scales) 145 | library(forcats) 146 | 147 | # Convert to long format 148 | times_long <- gather(times, method, seconds, -nrow) 149 | 150 | # Set order of methods, for plots 151 | times_long$method <- fct_reorder2( 152 | times_long$method, 153 | x = times_long$nrow, 154 | y = times_long$seconds 155 | ) 156 | 157 | # Plot with log-log axes 158 | ggplot(times_long, aes(x = nrow, y = seconds, colour = method)) + 159 | geom_point() + 160 | geom_line() + 161 | annotation_logticks(sides = "trbl") + 162 | theme_bw() + 163 | scale_y_continuous(trans = log10_trans(), 164 | breaks = trans_breaks("log10", function(x) 10^x), 165 | labels = trans_format("log10", math_format(10^.x)), 166 | minor_breaks = NULL) + 167 | scale_x_continuous(trans = log10_trans(), 168 | breaks = trans_breaks("log10", function(x) 10^x), 169 | labels = trans_format("log10", math_format(10^.x)), 170 | minor_breaks = NULL) 171 | ``` 172 | -------------------------------------------------------------------------------- /wch.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Applying a function over rows of a data frame" 3 | author: "Winston Chang" 4 | output: 5 | html_document: 6 | keep_md: TRUE 7 | --- 8 | 9 | 10 | 11 | [Source](https://gist.github.com/wch/0e564def155d976c04dd28a876dc04b4) for this document. 12 | 13 | [RPub](https://rpubs.com/wch/200398) for this document. 14 | 15 | @dattali [asked](https://twitter.com/daattali/status/761058049859518464), "what's a safe way to iterate over rows of a data frame?" The example was to convert each row into a list and return a list of lists, indexed first by column, then by row. 16 | 17 | A number of people gave suggestions on Twitter, which I've collected here. I've benchmarked these methods with data of various sizes; scroll down to see a plot of times. 18 | 19 | 20 | ```r 21 | library(purrr) 22 | library(dplyr) 23 | #> 24 | #> Attaching package: 'dplyr' 25 | #> The following objects are masked from 'package:stats': 26 | #> 27 | #> filter, lag 28 | #> The following objects are masked from 'package:base': 29 | #> 30 | #> intersect, setdiff, setequal, union 31 | library(tidyr) 32 | ``` 33 | 34 | 35 | 36 | ```r 37 | # @dattali 38 | # Using apply (only safe when all cols are same type) 39 | f_apply <- function(df) { 40 | apply(df, 1, function(row) as.list(row)) 41 | } 42 | 43 | # @drob 44 | # split + lapply 45 | f_split_lapply <- function(df) { 46 | df <- split(df, seq_len(nrow(df))) 47 | lapply(df, function(row) as.list(row)) 48 | } 49 | 50 | # @winston_chang 51 | # lapply over row indices 52 | f_lapply_row <- function(df) { 53 | lapply(seq_len(nrow(df)), function(i) as.list(df[i,,drop=FALSE])) 54 | } 55 | 56 | # @winston_chang 57 | # lapply + lapply: Treat data frame as list, and the slice out lists 58 | f_lapply_lapply <- function(df) { 59 | cols <- seq_len(length(df)) 60 | names(cols) <- names(df) 61 | 62 | lapply(seq_len(nrow(df)), function(row) { 63 | lapply(cols, function(col) { 64 | df[[col]][[row]] 65 | }) 66 | }) 67 | } 68 | 69 | # @winston_chang 70 | # purrr::by_row 71 | # 2018-03-31 Jenny Bryan: by_row() no longer exists in purrr 72 | # f_by_row <- function(df) { 73 | # res <- by_row(df, function(row) as.list(row)) 74 | # res$.out 75 | # } 76 | 77 | # @JennyBryan 78 | # purrr::pmap 79 | f_pmap <- function(df) { 80 | pmap(df, list) 81 | } 82 | 83 | # purrr::pmap, but coerce df to a list first 84 | f_pmap_aslist <- function(df) { 85 | pmap(as.list(df), list) 86 | } 87 | 88 | # @krlmlr 89 | # dplyr::rowwise 90 | f_rowwise <- function(df) { 91 | df %>% rowwise %>% do(row = as.list(.)) 92 | } 93 | 94 | # @JennyBryan 95 | # purrr::transpose (only works for this specific task, i.e. one sub-list per row) 96 | f_transpose <- function(df) { 97 | transpose(df) 98 | } 99 | ``` 100 | 101 | 102 | Benchmark each of them, using data sets with varying numbers of rows: 103 | 104 | 105 | ```r 106 | run_benchmark <- function(nrow) { 107 | # Make some data 108 | df <- data.frame( 109 | x = rnorm(nrow), 110 | y = runif(nrow), 111 | z = runif(nrow) 112 | ) 113 | 114 | res <- list( 115 | apply = system.time(f_apply(df)), 116 | split_lapply = system.time(f_split_lapply(df)), 117 | lapply_row = system.time(f_lapply_row(df)), 118 | lapply_lapply = system.time(f_lapply_lapply(df)), 119 | #by_row = system.time(f_by_row(df)), 120 | pmap = system.time(f_pmap(df)), 121 | pmap_aslist = system.time(f_pmap_aslist(df)), 122 | rowwise = system.time(f_rowwise(df)), 123 | transpose = system.time(f_transpose(df)) 124 | ) 125 | 126 | # Get elapsed times 127 | res <- lapply(res, `[[`, "elapsed") 128 | 129 | # Add nrow to front 130 | res <- c(nrow = nrow, res) 131 | res 132 | } 133 | 134 | # Run the benchmarks for various size data 135 | all_times <- lapply(1:5, function(n) { 136 | run_benchmark(10^n) 137 | }) 138 | 139 | # Convert to data frame 140 | times <- lapply(all_times, as.data.frame) 141 | times <- do.call(rbind, times) 142 | 143 | knitr::kable(times) 144 | ``` 145 | 146 | 147 | 148 | nrow apply split_lapply lapply_row lapply_lapply pmap pmap_aslist rowwise transpose 149 | ------ ------ ------------- ----------- -------------- ------ ------------ -------- ---------- 150 | 1e+01 0.000 0.000 0.001 0.000 0.001 0.001 0.044 0.000 151 | 1e+02 0.002 0.005 0.005 0.005 0.002 0.002 0.054 0.002 152 | 1e+03 0.004 0.036 0.034 0.015 0.002 0.002 0.056 0.001 153 | 1e+04 0.033 0.422 0.339 0.163 0.017 0.016 0.504 0.002 154 | 1e+05 0.527 24.720 23.743 1.808 0.201 0.220 5.322 0.017 155 | 156 | 157 | ## Plot times 158 | 159 | This plot shows the number of seconds needed to process n rows, for each method. Both the x and y use log scales, so each step along the x scale represents a 10x increase in number of rows, and each step along the y scale represents a 10x increase in time. 160 | 161 | 162 | ```r 163 | library(ggplot2) 164 | library(scales) 165 | library(forcats) 166 | 167 | # Convert to long format 168 | times_long <- gather(times, method, seconds, -nrow) 169 | 170 | # Set order of methods, for plots 171 | times_long$method <- fct_reorder2( 172 | times_long$method, 173 | x = times_long$nrow, 174 | y = times_long$seconds 175 | ) 176 | 177 | # Plot with log-log axes 178 | ggplot(times_long, aes(x = nrow, y = seconds, colour = method)) + 179 | geom_point() + 180 | geom_line() + 181 | annotation_logticks(sides = "trbl") + 182 | theme_bw() + 183 | scale_y_continuous(trans = log10_trans(), 184 | breaks = trans_breaks("log10", function(x) 10^x), 185 | labels = trans_format("log10", math_format(10^.x)), 186 | minor_breaks = NULL) + 187 | scale_x_continuous(trans = log10_trans(), 188 | breaks = trans_breaks("log10", function(x) 10^x), 189 | labels = trans_format("log10", math_format(10^.x)), 190 | minor_breaks = NULL) 191 | #> Warning: Transformation introduced infinite values in continuous y-axis 192 | 193 | #> Warning: Transformation introduced infinite values in continuous y-axis 194 | ``` 195 | 196 | ![](wch_files/figure-html/plot-1.png) 197 | -------------------------------------------------------------------------------- /wch_files/figure-html/plot-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jennybc/row-oriented-workflows/3a35465849edd2b98a0b63e2661b3c48911aff79/wch_files/figure-html/plot-1.png --------------------------------------------------------------------------------