├── .Rbuildignore ├── .gitignore ├── .travis.yml ├── DESCRIPTION ├── README.md ├── _bookdown.yml ├── _output.yml ├── adv-r.css ├── dotdotdot.Rmd ├── dplyr.Rmd ├── ga_script.html ├── ggplot.Rmd ├── glossary.Rmd ├── grammar.Rmd ├── index.Rmd ├── introduction.Rmd ├── modify.Rmd ├── setup.R ├── tidyeval.Rproj └── toolbox.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^tidyeval\.Rproj$ 2 | ^\.Rproj\.user$ 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | *_cache 3 | *_files 4 | _book 5 | _main.* 6 | *.md 7 | *.rds 8 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | # R for travis: see documentation at https://docs.travis-ci.com/user/languages/r 2 | 3 | language: R 4 | sudo: false 5 | cache: 6 | packages: true 7 | directories: 8 | - _bookdown_files 9 | - $HOME/.npm 10 | 11 | before_install: 12 | - nvm install stable 13 | - npm install netlify-cli -g 14 | 15 | script: 16 | - Rscript -e 'bookdown::render_book("index.Rmd")' 17 | 18 | deploy: 19 | provider: script 20 | script: netlify deploy --prod --dir _book 21 | skip_cleanup: true 22 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: tidyeval 2 | Title: Tidy evaluation 3 | Version: 0.0.0.9000 4 | Authors@R: c( 5 | person("Lionel", "Henry", ,"lionel@rstudio.com", c("aut", "cre")), 6 | person("Hadley", "Wickham", ,"hadley@rstudio.com", "aut"), 7 | person("RStudio", role = "cph")) 8 | Depends: 9 | R (>= 3.1.0) 10 | Imports: 11 | bookdown, 12 | dplyr (>= 0.8.2), 13 | ggplot2 (>= 3.0.0), 14 | rlang (>= 0.4.0) 15 | URL: http://tidyeval.tidyverse.org, https://github.com/tidyverse/tidyeval 16 | BugReports: https://github.com/tidyverse/tidyeval/issues 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ![Lifecycle Status](https://img.shields.io/badge/lifecycle-superseded-orange.svg) 3 | 4 | This guide is now superseded by more recent efforts at documenting tidy evaluation in a user-friendly way. We now recommend reading: 5 | 6 | - The new [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) vignette. 7 | 8 | - The [Using ggplot2 in packages](https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html) vignette. 9 | 10 | We are keeping this bookdown guide online for posterity, but please know that it is missing a lot of advances that make tidy eval more palatable, such as the embracing operator `{{ arg }}` and [glue](https://glue.tidyverse.org/) support for custom names. 11 | -------------------------------------------------------------------------------- /_bookdown.yml: -------------------------------------------------------------------------------- 1 | 2 | new_session: yes 3 | 4 | rmd_files: 5 | [ 6 | "index.Rmd", 7 | "introduction.Rmd", 8 | "dotdotdot.Rmd", 9 | "modify.Rmd", 10 | "glossary.Rmd", 11 | "dplyr.Rmd", 12 | "ggplot.Rmd", 13 | "toolbox.Rmd", 14 | "grammar.Rmd", 15 | ] 16 | -------------------------------------------------------------------------------- /_output.yml: -------------------------------------------------------------------------------- 1 | bookdown::gitbook: 2 | includes: 3 | in_header: [ga_script.html] 4 | config: 5 | sharing: 6 | github: yes 7 | facebook: no 8 | twitter: no 9 | toc: 10 | collapse: section 11 | before: | 12 |
  • Tidy evaluation
  • 13 | edit: https://github.com/tidyverse/tidyeval/edit/master/%s 14 | css: adv-r.css 15 | -------------------------------------------------------------------------------- /adv-r.css: -------------------------------------------------------------------------------- 1 | .book .book-header h1 { 2 | opacity: 1; 3 | text-align: left; 4 | } 5 | 6 | #header .title { 7 | margin-bottom: 0em; 8 | } 9 | #header h4.author { 10 | margin: 0; 11 | color: #666; 12 | } 13 | #header h4.author em { 14 | font-style: normal; 15 | } 16 | 17 | /* Sidebar formating --------------------------------------------*/ 18 | 19 | div.sidebar, div.base { 20 | border: 1px solid #ccc; 21 | border-left-width: 5px; 22 | border-radius: 5px; 23 | padding: 1em; 24 | margin: 1em 0; 25 | } 26 | 27 | /* .book .book-body .page-wrapper .page-inner section.normal is needed 28 | to override the styles produced by gitbook, which are ridiculously 29 | overspecified. Goal of the selectors is to ensure internal "margins" 30 | controlled only by padding of container */ 31 | 32 | .book .book-body .page-wrapper .page-inner section.normal div.sidebar > :first-child, 33 | .book .book-body .page-wrapper .page-inner section.normal div.base > :first-child { 34 | margin-top: 0; 35 | } 36 | 37 | .book .book-body .page-wrapper .page-inner section.normal div.sidebar > :last-child, 38 | .book .book-body .page-wrapper .page-inner section.normal div.base > :last-child { 39 | margin-bottom: 0; 40 | } 41 | 42 | div.base::before { 43 | display: block; 44 | content: "In base R"; 45 | } 46 | 47 | div.base::before, 48 | .book .book-body .page-wrapper .page-inner section.normal .sidebar h3 { 49 | font-size: 1.1em; 50 | font-weight: 700; 51 | margin-bottom: 0.25em; 52 | color: #333; 53 | } 54 | 55 | .todo { 56 | display: block; 57 | border: 1px solid red; 58 | border-left-width: 5px; 59 | border-radius: 5px; 60 | padding: 0.5em 1em; 61 | margin: 1em 0; 62 | } 63 | 64 | .todo::before { 65 | content: "TO DO: "; 66 | font-weight: bold; 67 | color: red; 68 | } 69 | 70 | /* Resolve gitbook and pandoc codeblock margins -------------------------- */ 71 | 72 | div.sourceCode { 73 | margin-top: 0; 74 | margin-bottom: 0.85em; 75 | } 76 | .book .book-body .page-wrapper .page-inner section.normal pre { 77 | margin-bottom: 0; 78 | } 79 | 80 | /* Other gitbook tweaks -------------------------------------------------- */ 81 | 82 | .book .book-body .page-wrapper .page-inner section.normal code { 83 | padding: 2px 0; 84 | } 85 | -------------------------------------------------------------------------------- /dotdotdot.Rmd: -------------------------------------------------------------------------------- 1 | 2 | ```{r setup, include = FALSE} 3 | source("setup.R") 4 | library("dplyr") 5 | ``` 6 | 7 | # Dealing with multiple arguments {#multiple} 8 | 9 | In the first chapter we have created `grouped_mean()`, a function that takes one grouping variable and one summary variable and computes the grouped average. It would make sense to take multiple grouping variables instead of just one. Quoting and unquoting multiple variables is pretty much the same process as for single arguments: 10 | 11 | * Unquoting multiple arguments requires a variant of `!!`, the big bang operator `!!!`. 12 | 13 | * Quoting multiple arguments can be done in two ways: internal quoting with the plural variant `enquos()` and external quoting with `vars()`. 14 | 15 | 16 | ## The `...` argument 17 | 18 | The dot-dot-dot argument is one of the nicest aspects of the R language. A function that takes `...` accepts any number of arguments, named or unnamed. As a programmer you can do three things with `...`: 19 | 20 | 1. **Evaluate** the arguments contained in the dots and materialise them in a list by forwarding the dots to `list()`: 21 | 22 | ```{r} 23 | materialise <- function(data, ...) { 24 | dots <- list(...) 25 | dots 26 | } 27 | ``` 28 | 29 | The dots names conveniently become the names of the list: 30 | 31 | ```{r} 32 | materialise(mtcars, 1 + 2, important_name = letters) 33 | ``` 34 | 35 | 1. **Quote** the arguments in the dots with `enquos()`: 36 | 37 | ```{r} 38 | capture <- function(data, ...) { 39 | dots <- enquos(...) 40 | dots 41 | } 42 | ``` 43 | 44 | All arguments passed to `...` are automatically quoted and returned as a list. The names of the arguments become the names of that list: 45 | 46 | ```{r} 47 | capture(mtcars, 1 + 2, important_name = letters) 48 | ``` 49 | 50 | 1. **Forward** the dots to another function: 51 | 52 | ```{r} 53 | forward <- function(data, ...) { 54 | forwardee(...) 55 | } 56 | ``` 57 | 58 | When dots are forwarded the names of arguments in `...` are matched to the arguments of the forwardee: 59 | 60 | ```{r} 61 | forwardee <- function(foo, bar, ...) { 62 | list(foo = foo, bar = bar, ...) 63 | } 64 | ``` 65 | 66 | Let's call the forwarding function with a bunch of named and unnamed arguments: 67 | 68 | ```{r} 69 | forward(mtcars, bar = 100, 1, 2, 3) 70 | ``` 71 | 72 | The unnamed argument `1` was matched to `foo` positionally. The named argument `bar` was matched to `bar`. The remaining arguments were passed in order. 73 | 74 | For the purpose of writing tidy eval functions the last two techniques are important. There are two distinct situations: 75 | 76 | 1. You don't need to modify the arguments in any way, just passing them through. Then simply forward `...` to other quoting functions in the ordinary way. 77 | 78 | 1. You'd like to change the argument names (which become column names in `dplyr::mutate()` calls) or modify the arguments themselves (for instance negate a `dplyr::select()`ion). In that case you'll need to use `enquos()` to *quote* the arguments in the dots. You'll then pass the quoted arguments to other quoting functions by *forwarding* them with the help of `!!!`. 79 | 80 | 81 | ## Simple forwarding of `...` 82 | 83 | If you are not modifying the arguments in `...` in any way and just want to pass them to another quoting function, just forward `...` like usual! There is no need for quoting and unquoting because of the magic of forwarding. The arguments in `...` are transported to their final destination where they will be quoted. 84 | 85 | The function `grouped_mean()` is still going to need some remodelling because it is good practice to take all important named arguments before the dots. Let's start by swapping `grouped_var` and `summary_var`: 86 | 87 | ```{r} 88 | grouped_mean <- function(data, summary_var, group_var) { 89 | summary_var <- enquo(summary_var) 90 | group_var <- enquo(group_var) 91 | 92 | data %>% 93 | group_by(!!group_var) %>% 94 | summarise(mean = mean(!!summary_var)) 95 | } 96 | ``` 97 | 98 | Then we replace `group_var` with `...` and pass it to `group_by()`: 99 | 100 | ```{r} 101 | grouped_mean <- function(data, summary_var, ...) { 102 | summary_var <- enquo(summary_var) 103 | 104 | data %>% 105 | group_by(...) %>% 106 | summarise(mean = mean(!!summary_var)) 107 | } 108 | ``` 109 | 110 | It is good practice to make one final adjustment. Because arguments in `...` can have arbitrary names, we don't want to "use up" valid names. In tidyverse packages we use the convention of prefixing named arguments with a dot so that conflicts are less likely: 111 | 112 | ```{r} 113 | grouped_mean <- function(.data, .summary_var, ...) { 114 | .summary_var <- enquo(.summary_var) 115 | 116 | .data %>% 117 | group_by(...) %>% 118 | summarise(mean = mean(!!.summary_var)) 119 | } 120 | ``` 121 | 122 | Let's check this function now works with any number of grouping variables: 123 | 124 | ```{r} 125 | grouped_mean(mtcars, disp, cyl, am) 126 | 127 | grouped_mean(mtcars, disp, cyl, am, vs) 128 | ``` 129 | 130 | 131 | ## Quote multiple arguments 132 | 133 | When we need to modify the arguments or their names, we can't simply forward the dots. We'll have to quote and unquote with the plural variants of `enquo()` and `!!`. 134 | 135 | - We'll quote the dots with `enquos()`. 136 | - We'll unquote-splice the quoted dots with `!!!`. 137 | 138 | While the singular `enquo()` returns a single quoted argument, the plural variant `enquos()` returns a list of quoted arguments. Let's use it to quote the dots: 139 | 140 | ```{r} 141 | grouped_mean2 <- function(.data, .summary_var, ...) { 142 | .summary_var <- enquo(.summary_var) 143 | .group_vars <- enquos(...) 144 | 145 | data %>% 146 | group_by(!!.group_vars) %>% 147 | summarise(mean = mean(!!.summary_var)) 148 | } 149 | ``` 150 | 151 | `grouped_mean()` now accepts and automatically quotes any number of grouping variables. However it doesn't work quite yet: 152 | 153 | **FIXME**: Depend on dev rlang to get a better error message. 154 | 155 | ```{r, error = TRUE } 156 | grouped_mean2(mtcars, disp, cyl, am) 157 | ``` 158 | 159 | Instead of *forwarding* the individual arguments to `group_by()` we have passed the list of arguments itself! Unquoting is not the right operation here. Fortunately tidy eval provides a special operator that makes it easy to forward a list of arguments. 160 | 161 | 162 | ## Unquote multiple arguments 163 | 164 | The **unquote-splice** operator `!!!` takes each element of a list and unquotes them as independent arguments to the surrounding function call. The arguments are *spliced* in the function call. This is just what we need for forwarding multiple quoted arguments. 165 | 166 | Let's use `qq_show()` to observe the difference between `!!` and `!!!` in a `group_by()` expression. We can only use `enquos()` within a function so let's create a list of quoted names for the purpose of experimenting: 167 | 168 | ```{r} 169 | vars <- list( 170 | quote(cyl), 171 | quote(am) 172 | ) 173 | ``` 174 | 175 | `qq_show()` shows the difference between unquoting a list and unquote-splicing a list: 176 | 177 | ```{r} 178 | rlang::qq_show(group_by(!!vars)) 179 | 180 | rlang::qq_show(group_by(!!!vars)) 181 | ``` 182 | 183 | When we use the unquote operator `!!`, `group_by()` gets a list of expressions. When we unquote-splice with `!!!`, the expressions are forwarded as individual arguments to `group_by()`. Let's use the latter to fix `grouped_mean2()`: 184 | 185 | ```{r} 186 | grouped_mean2 <- function(.data, .summary_var, ...) { 187 | .summary_var <- enquo(.summary_var) 188 | .group_vars <- enquos(...) 189 | 190 | .data %>% 191 | group_by(!!!.group_vars) %>% 192 | summarise(mean = mean(!!.summary_var)) 193 | } 194 | ``` 195 | 196 | The quote and unquote version of `grouped_mean()` does a bit more work but is functionally identical to the forwarding version: 197 | 198 | ```{r} 199 | grouped_mean(mtcars, disp, cyl, am) 200 | 201 | grouped_mean2(mtcars, disp, cyl, am) 202 | ``` 203 | 204 | When does it become useful to do all this extra work? Whenever you need to modify the arguments or their names. 205 | 206 | Up to now we have used the quote-and-unquote pattern to pass quoted arguments to other quoting functions "as is". With this simple and powerful pattern you can extract complex combinations of quoting verbs into reusable functions. 207 | 208 | However tidy eval provides much more flexibility. It is a general purpose meta-programming framework that makes it easy to modify quoted arguments before evaluation. In the next section you'll learn about basic metaprogramming patterns that will allow you to modify expressions before passing them on to other functions. 209 | -------------------------------------------------------------------------------- /dplyr.Rmd: -------------------------------------------------------------------------------- 1 | 2 | # (PART) Cookbooks {-} 3 | 4 | ```{r setup, include = FALSE} 5 | source("setup.R") 6 | library("dplyr") 7 | ``` 8 | 9 | # dplyr 10 | 11 | In the introductory vignette we learned that creating tidy eval functions boils down to a single pattern: quote and unquote. In this vignette we'll apply this pattern in a series of recipes for dplyr. 12 | 13 | This vignette is organised so that you can quickly find your way to a copy-paste solution when you face an immediate problem. 14 | 15 | 16 | ## Patterns for single arguments 17 | 18 | ### `enquo()` and `!!` - Quote and unquote arguments 19 | 20 | We start with a quick recap of the introductory vignette. Creating a function around dplyr pipelines involves three steps: abstraction, quoting, and unquoting. 21 | 22 | 23 | * **Abstraction step** 24 | 25 | First identify the varying parts: 26 | 27 | ```{r, eval = FALSE} 28 | df1 %>% group_by(x1) %>% summarise(mean = mean(y1)) 29 | df2 %>% group_by(x2) %>% summarise(mean = mean(y2)) 30 | df3 %>% group_by(x3) %>% summarise(mean = mean(y3)) 31 | df4 %>% group_by(x4) %>% summarise(mean = mean(y4)) 32 | ``` 33 | 34 | And abstract those away with a informative argument names: 35 | 36 | ```{r, eval = FALSE} 37 | data %>% group_by(group_var) %>% summarise(mean = mean(summary_var)) 38 | ``` 39 | 40 | And wrap in a function: 41 | 42 | ```{r} 43 | grouped_mean <- function(data, group_var, summary_var) { 44 | data %>% 45 | group_by(group_var) %>% 46 | summarise(mean = mean(summary_var)) 47 | } 48 | ``` 49 | 50 | 51 | * **Quoting step** 52 | 53 | Identify all the arguments where the user is allowed to refer to data frame columns directly. The function can't evaluate these arguments right away. Instead they should be automatically quoted. Apply `enquo()` to these arguments 54 | 55 | ```{r, eval = FALSE} 56 | group_var <- enquo(group_var) 57 | summary_var <- enquo(summary_var) 58 | ``` 59 | 60 | 61 | * **Unquoting step** 62 | 63 | Identify where these variables are passed to other quoting functions and unquote with `!!`. In this case we pass `group_var` to `group_by()` and `summary_var` to `summarise()`: 64 | 65 | ```{r, eval = FALSE} 66 | data %>% 67 | group_by(!!group_var) %>% 68 | summarise(mean = mean(!!summary_var)) 69 | ``` 70 | 71 | We end up with a function that automatically quotes its arguments `group_var` and `summary_var` and unquotes them when they are passed to other quoting functions: 72 | 73 | ```{r} 74 | grouped_mean <- function(data, group_var, summary_var) { 75 | group_var <- enquo(group_var) 76 | summary_var <- enquo(summary_var) 77 | 78 | data %>% 79 | group_by(!!group_var) %>% 80 | summarise(mean = mean(!!summary_var)) 81 | } 82 | 83 | grouped_mean(mtcars, cyl, mpg) 84 | ``` 85 | 86 | 87 | ### `as_label()` - Create default column names 88 | 89 | Use `as_label()` to transform a quoted expression to a column name: 90 | 91 | ```{r} 92 | simple_var <- quote(height) 93 | as_label(simple_var) 94 | ``` 95 | 96 | These names are only a default stopgap. For more complex uses, you'll probably want to let the user override the default. Here is a case where the default name is clearly suboptimal: 97 | 98 | ```{r} 99 | complex_var <- quote(mean(height, na.rm = TRUE)) 100 | as_label(complex_var) 101 | ``` 102 | 103 | 104 | ### `:=` and `!!` - Unquote column names 105 | 106 | In expressions like `c(name = NA)`, the argument name is quoted. Because of the quoting it's not possible to make an indirect reference to a variable that contains a name: 107 | 108 | ```{r} 109 | name <- "the real name" 110 | c(name = NA) 111 | ``` 112 | 113 | In tidy eval function it is possible to unquote argument names with `!!`. However you need the special `:=` operator: 114 | 115 | ```{r} 116 | rlang::qq_show(c(!!name := NA)) 117 | ``` 118 | 119 | This unusual operator is needed because using `!` on the left-hand side of `=` is not valid R code: 120 | 121 | ```{r, error = TRUE} 122 | rlang::qq_show(c(!!name = NA)) 123 | ``` 124 | 125 | Let's use this `!!` technique to pass custom column names to `group_by()` and `summarise()`: 126 | 127 | ```{r} 128 | grouped_mean <- function(data, group_var, summary_var) { 129 | group_var <- enquo(group_var) 130 | summary_var <- enquo(summary_var) 131 | 132 | # Create default column names 133 | group_nm <- as_label(group_var) 134 | summary_nm <- as_label(summary_var) 135 | 136 | # Prepend with an informative prefix 137 | group_nm <- paste0("group_", group_nm) 138 | summary_nm <- paste0("mean_", summary_nm) 139 | 140 | data %>% 141 | group_by(!!group_nm := !!group_var) %>% 142 | summarise(!!summary_nm := mean(!!summary_var)) 143 | } 144 | 145 | grouped_mean(mtcars, cyl, mpg) 146 | ``` 147 | 148 | 149 | ## Patterns for multiple arguments 150 | 151 | ### `...` - Forward multiple arguments 152 | 153 | We have created a function that takes one grouping variable and one summary variable. It would make sense to take multiple grouping variables instead of just one. Let's adjust our function with a `...` argument. 154 | 155 | 1. Replace `group_var` by `...`: 156 | 157 | ```{r, eval = FALSE} 158 | function(data, ..., summary_var) 159 | ``` 160 | 161 | 1. Swap `...` and `summary_var` because arguments on the right-hand side of `...` are harder to pass. They can only be passed with their full name explictly specified while arguments on the left-hand side can be passed without name: 162 | 163 | ```{r, eval = FALSE} 164 | function(data, summary_var, ...) 165 | ``` 166 | 167 | 1. It's good practice to prefix named arguments with a `.` to reduce the risk of conflicts between your arguments and the arguments passed to `...`: 168 | 169 | ```{r, eval = FALSE} 170 | function(.data, .summary_var, ...) 171 | ``` 172 | 173 | Because of the magic of dots forwarding we don't have to use the quote-and-unquote pattern. We can just pass `...` to other quoting functions like `group_by()`: 174 | 175 | ```{r} 176 | grouped_mean <- function(.data, .summary_var, ...) { 177 | summary_var <- enquo(.summary_var) 178 | 179 | .data %>% 180 | group_by(...) %>% # Forward `...` 181 | summarise(mean = mean(!!summary_var)) 182 | } 183 | 184 | grouped_mean(mtcars, disp, cyl, am) 185 | ``` 186 | 187 | Forwarding `...` is straightforward but has the downside that you can't modify the arguments or their names. 188 | 189 | 190 | ### `enquos()` and `!!!` - Quote and unquote multiple arguments 191 | 192 | Quoting and unquoting multiple variables with `...` is pretty much the same process as for single arguments: 193 | 194 | * Quoting multiple arguments can be done in two ways: internal quoting with the plural variant `enquos()` and external quoting with `vars()`. Use internal quoting when your function takes expressions with `...` and external quoting when your function takes a list of expressions. 195 | 196 | * Unquoting multiple arguments requires a variant of `!!`, the unquote-splice operator `!!!` which unquotes each element of a list as an independent argument in the surrounding function call. 197 | 198 | Quote the dots with `enquos()` and unquote-splice them with `!!!`: 199 | 200 | ```{r} 201 | grouped_mean2 <- function(.data, .summary_var, ...) { 202 | summary_var <- enquo(.summary_var) 203 | group_vars <- enquos(...) # Get a list of quoted dots 204 | 205 | .data %>% 206 | group_by(!!!group_vars) %>% # Unquote-splice the list 207 | summarise(mean = mean(!!summary_var)) 208 | } 209 | 210 | grouped_mean2(mtcars, disp, cyl, am) 211 | ``` 212 | 213 | The quote-and-unquote pattern does more work than simple forwarding of `...` and is functionally identical. Don't do this extra work unless you need to modify the arguments or their names. 214 | 215 | 216 | ### `expr()` - Modify quoted arguments 217 | 218 | Modifying quoted expressions is often necessary when dealing with multiple arguments. Say we'd like a `grouped_mean()` variant that takes multiple summary variables rather than multiple grouping variables. We need to somehow take the `mean()` of each summary variable. 219 | 220 | One easy way is to use the quote-and-unquote pattern with `expr()`. This function is just like `quote()` from base R. It plainly returns your argument, quoted: 221 | 222 | ```{r} 223 | quote(height) 224 | 225 | expr(height) 226 | 227 | 228 | quote(mean(height)) 229 | 230 | expr(mean(height)) 231 | ``` 232 | 233 | But `expr()` has a twist, it has full unquoting support: 234 | 235 | ```{r} 236 | vars <- list(quote(height), quote(mass)) 237 | 238 | expr(mean(!!vars[[1]])) 239 | 240 | expr(group_by(!!!vars)) 241 | ``` 242 | 243 | You can loop over a list of arguments and modify each of them: 244 | 245 | ```{r} 246 | purrr::map(vars, function(var) expr(mean(!!var, na.rm = TRUE))) 247 | ``` 248 | 249 | This makes it easy to take multiple summary variables, wrap them in a call to `mean()`, before unquote-splicing within `summarise()`: 250 | 251 | ```{r} 252 | grouped_mean3 <- function(.data, .group_var, ...) { 253 | group_var <- enquo(.group_var) 254 | summary_vars <- enquos(...) # Get a list of quoted summary variables 255 | 256 | summary_vars <- purrr::map(summary_vars, function(var) { 257 | expr(mean(!!var, na.rm = TRUE)) 258 | }) 259 | 260 | .data %>% 261 | group_by(!!group_var) %>% 262 | summarise(!!!summary_vars) # Unquote-splice the list 263 | } 264 | ``` 265 | 266 | 267 | ### `vars()` - Quote multiple arguments externally {#sec:external-quoting} 268 | 269 | How could we take multiple summary variables in addition to multiple grouping variables? Internal quoting with `...` has a major disadvantage: the arguments in `...` can only have one purpose. If you need to quote multiple sets of variables you have to delegate the quoting to another function. That's the purpose of `vars()` which quotes its arguments and returns a list: 270 | 271 | ```{r} 272 | vars(species, gender) 273 | ``` 274 | 275 | The arguments can be complex expressions and have names: 276 | 277 | ```{r} 278 | vars(h = height, m = mass / 100) 279 | ``` 280 | 281 | When the quoting is external you don't use `enquos()`. Simply take lists of expressions in your function and forward the lists to other quoting functions with `!!!`: 282 | 283 | ```{r, error = TRUE} 284 | grouped_mean3 <- function(data, group_vars, summary_vars) { 285 | stopifnot( 286 | is.list(group_vars), 287 | is.list(summary_vars) 288 | ) 289 | 290 | summary_vars <- purrr::map(summary_vars, function(var) { 291 | expr(mean(!!var, na.rm = TRUE)) 292 | }) 293 | 294 | data %>% 295 | group_by(!!!group_vars) %>% 296 | summarise(n = n(), !!!summary_vars) 297 | } 298 | 299 | grouped_mean3(starwars, vars(species, gender), vars(height)) 300 | 301 | grouped_mean3(starwars, vars(gender), vars(height, mass)) 302 | ``` 303 | 304 | One advantage of `vars()` is that it lets users specify their own names: 305 | 306 | ```{r} 307 | grouped_mean3(starwars, vars(gender), vars(h = height, m = mass)) 308 | ``` 309 | 310 | 311 | ### `enquos(.named = TRUE)` - Automatically add default names 312 | 313 | If you pass `.named = TRUE` to `enquos()` the unnamed expressions are automatically given default names: 314 | 315 | ```{r} 316 | f <- function(...) names(enquos(..., .named = TRUE)) 317 | 318 | f(height, mean(mass)) 319 | ``` 320 | 321 | User-supplied names are never overridden: 322 | 323 | ```{r} 324 | f(height, m = mean(mass)) 325 | ``` 326 | 327 | This is handy when you need to modify the names of quoted expressions. In this example we'll ensure the list is named before adding a prefix: 328 | 329 | ```{r} 330 | grouped_mean2 <- function(.data, .summary_var, ...) { 331 | summary_var <- enquo(.summary_var) 332 | group_vars <- enquos(..., .named = TRUE) # Ensure quoted dots are named 333 | 334 | # Prefix the names of the list of quoted dots 335 | names(group_vars) <- paste0("group_", names(group_vars)) 336 | 337 | .data %>% 338 | group_by(!!!group_vars) %>% # Unquote-splice the list 339 | summarise(mean = mean(!!summary_var)) 340 | } 341 | 342 | grouped_mean2(mtcars, disp, cyl, am) 343 | ``` 344 | 345 | One big downside of this technique is that all arguments get a prefix, including the arguments that were given specific names by the user: 346 | 347 | ```{r} 348 | grouped_mean2(mtcars, disp, c = cyl, a = am) 349 | ``` 350 | 351 | In general it's better to preserve the names explicitly passed by the user. To do that we can't automatically add default names with `enquos()` because once the list is fully named we don't have any way of detecting which arguments were passed with an explicit names. We'll have to add default names manually with `quos_auto_name()`. 352 | 353 | 354 | ### `quos_auto_name()` - Manually add default names 355 | 356 | It can be helpful add default names to the list of quoted dots manually: 357 | 358 | - We can detect which arguments were explicitly named by the user. 359 | - The default names can be applied to lists returned by `vars()`. 360 | 361 | Let's add default names manually with `quos_auto_name()` to lists of externally quoted variables. We'll detect unnamed arguments and only add a prefix to this subset of arguments. This way we preserve user-supplied names: 362 | 363 | ```{r} 364 | grouped_mean3 <- function(data, group_vars, summary_vars) { 365 | stopifnot( 366 | is.list(group_vars), 367 | is.list(summary_vars) 368 | ) 369 | 370 | # Detect and prefix unnamed arguments: 371 | unnamed <- names(summary_vars) == "" 372 | 373 | # Add the default names: 374 | summary_vars <- rlang::quos_auto_name(summary_vars) 375 | 376 | prefixed_nms <- paste0("mean_", names(summary_vars)[unnamed]) 377 | names(summary_vars)[unnamed] <- prefixed_nms 378 | 379 | # Expand the argument _after_ giving the list its default names 380 | summary_vars <- purrr::map(summary_vars, function(var) { 381 | expr(mean(!!var, na.rm = TRUE)) 382 | }) 383 | 384 | data %>% 385 | group_by(!!!group_vars) %>% 386 | summarise(n = n(), !!!summary_vars) # Unquote-splice the renamed list 387 | } 388 | ``` 389 | 390 | Note how we add the default names *before* wrapping the arguments in a `mean()` call. This way we avoid including `mean()` in the name: 391 | 392 | ```{r} 393 | as_label(quote(mass)) 394 | 395 | as_label(quote(mean(mass, na.rm = TRUE))) 396 | ``` 397 | 398 | We get nicely prefixed default names: 399 | 400 | ```{r} 401 | grouped_mean3(starwars, vars(gender), vars(height, mass)) 402 | ``` 403 | 404 | And the user is able to fully override the names: 405 | 406 | ```{r} 407 | grouped_mean3(starwars, vars(gender), vars(h = height, m = mass)) 408 | ``` 409 | 410 | 411 | ## `select()` 412 | 413 | TODO 414 | 415 | 416 | ## `filter()` 417 | 418 | TODO 419 | 420 | 421 | ## `case_when()` 422 | 423 | TODO 424 | 425 | 426 | ## Gotchas 427 | 428 | ### Nested quoting functions 429 | 430 | https://stackoverflow.com/questions/51902438/rlangsym-in-anonymous-functions 431 | -------------------------------------------------------------------------------- /ga_script.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 10 | -------------------------------------------------------------------------------- /ggplot.Rmd: -------------------------------------------------------------------------------- 1 | 2 | ```{r setup, include = FALSE} 3 | source("setup.R") 4 | library("ggplot2") 5 | ``` 6 | 7 | # ggplot2 8 | 9 | -------------------------------------------------------------------------------- /glossary.Rmd: -------------------------------------------------------------------------------- 1 | 2 | ```{r setup, include = FALSE} 3 | source("setup.R") 4 | library("dplyr") 5 | ``` 6 | 7 | # Glossary 8 | 9 | This glossary contains the vocabulary necessary to work with tidy evaluation and, more generally, with [expressions](#glossary-expr). The definitions in rlang are generally consistent with base R. When they differ, both definitions are presented so you can navigate between these two worlds more easily. 10 | 11 | 12 | ## Data structures 13 | 14 | ### TODO Data mask {#glossary-data-mask} 15 | 16 | ### Expression {#glossary-expr} 17 | 18 | An expression is a piece of R code that represents a value or a computation: 19 | 20 | ```{r, eval = FALSE} 21 | 12 # Value 22 | 12 / 3 # Computation 23 | 12 / (1 + 2) # Nested computations 24 | ``` 25 | 26 | Expressions are normally transient. They are computed (or [evaluated](#glossary-eval)) when you source a file or call a function. You can only observe: 27 | 28 | * The final value of the outermost expression. 29 | 30 | * Their side effects, such as the console output of a `print()` expression inside a loop. 31 | 32 | In R however, it is possible to suspend the normal evaluation of expressions with the [quotation](#glossary-quotation) mechanism. In a way, quotation causes expressions to freeze in place: 33 | 34 | ```{r} 35 | # Evaluated expression 36 | 12 / 3 37 | 38 | # Quoted expression 39 | expr(12 / 3) 40 | ``` 41 | 42 | The technical definition of expressions is any R object that is created by [parsing](#glossary-parse) R code: 43 | 44 | * [Constants](#glossary-constant-symbolic) like `NULL`, `1`, `"foo"`, `TRUE`, `NA`, etc. 45 | * [Symbols](#glossary-sym) like `height` or `weight` 46 | * [Calls](#glossary-call) like `c()` or `list()` 47 | 48 | Unlike constants, symbols and calls are [symbolic](#glossary-constant-symbolic) objects: their [value](#glossary-eval) depends on the [environment](#glossary-env). 49 | 50 | 51 | ### Expression (base) 52 | 53 | In base R, "expression" refers to a special type of vector that contains quoted expressions in the rlang sense: 54 | 55 | ```{r} 56 | base::expression(key <- "foo", toupper(key)) 57 | ``` 58 | 59 | You'll most likely encounter this rare data structure as the return value of `base::parse()`: 60 | 61 | ```{r} 62 | code <- "key <- 'foo'; toupper(key)" 63 | parse(text = code) 64 | ``` 65 | 66 | The only advantage of expression vectors compared to lists is that they include **source references**. Expression vectors with source references are printed with whitespace and comments preserved: 67 | 68 | ```{r} 69 | code <- "{ 70 | # Interesting comment 71 | weird <- whitespace 72 | }" 73 | parse(text = code, keep.source = TRUE) 74 | ``` 75 | 76 | Source references are mostly useful for debugging and development tools. They don't play any computational role and tidy evaluation doesn't make use of references. Consequently the parsing tools in rlang return normal lists of expressions (in the rlang sense) instead of expression vectors: 77 | 78 | ```{r} 79 | rlang::parse_exprs(code) 80 | ``` 81 | 82 | 83 | ### TODO Symbol {#glossary-sym} 84 | 85 | 86 | ## Programming Concepts 87 | 88 | ### Constant versus symbolic {#glossary-constant-symbolic} 89 | 90 | Constants, also called "literals", always have the same value no matter the context. On the other hand, symbols and calls are [symbolic](#glossary-constant-symbolic) expressions: their value depends on an [environment](#glossary-env) and what kind of objects are defined there. 91 | 92 | For instance the string `"mickey"` always represents the same string no matter the environment and what objects are defined there: 93 | 94 | ```{r} 95 | # Here's a string: 96 | "mickey" 97 | 98 | mickey <- "mouse" 99 | 100 | # Still the same string: 101 | "mickey" 102 | ``` 103 | 104 | In constrast, [symbols](#glossary-sym) depend on current definitions: 105 | 106 | ```{r} 107 | # We've defined `mickey` as "mouse" 108 | mickey 109 | 110 | mickey <- "mickey" 111 | 112 | # Now `mickey` is "mickey" 113 | mickey 114 | ``` 115 | 116 | One source of problems when you're working with quoted expressions is that they might be evaluated in arbitrary places, where objects have potentially been redefined to something different than expected. This is a common issue with tidyverse grammars because they evaluate quoted expressions in a [data mask](#glossary-data-mask). Say you'd like to divide a column by a factor defined in the current environment: 117 | 118 | ```{r} 119 | factor <- 100 120 | 121 | starwars %>% mutate(height / factor) %>% pull() 122 | ``` 123 | 124 | This works fine but what if the data frame contains a column called `factor`? The expression will be evaluated with the parasite definition: 125 | 126 | ```{r} 127 | # Derive a data frame that contains a `factor` column 128 | starwars2 <- starwars %>% mutate(factor = 1:n()) 129 | 130 | # Oh no! We're now dividing `height` by the new column! 131 | starwars2 %>% mutate(height / factor) %>% pull() 132 | ``` 133 | 134 | Masking is generally not a problem in scripts because you know what columns are inside your data frame. However as soon as your code is getting more general, for instance if you create a reusable function, you can no longer make assumptions about what's in the data. 135 | 136 | Fortunately with [quasiquotation](#glossary-qq) it is easy to solve masking issues by replacing symbols with constants. The unquoting operator `!!` allows you to inline constant values deep inside expressions. With [`qq_show()`](#toolbox-qq-show) we can observe the inlining: 137 | 138 | ```{r} 139 | vector <- 1:3 140 | 141 | # Without inlining, the expression depends on the value of `vector`: 142 | rlang::qq_show(list(vector)) 143 | 144 | # Let's inline the current value of `vector` by unquoting it: 145 | rlang::qq_show(list(!!vector)) 146 | ``` 147 | 148 | Because constants have the same value in any environment, the data mask can never take over with parasite definitions: 149 | 150 | ```{r} 151 | rlang::qq_show(starwars2 %>% mutate(height / !!factor) %>% pull()) 152 | 153 | starwars2 %>% mutate(height / !!factor) %>% pull() 154 | ``` 155 | 156 | 157 | ### TODO Non-Standard Evaluation (NSE) {#glossary-nse} 158 | 159 | ### TODO Quotation versus Evaluation {#glossary-quotation-evaluation} 160 | 161 | ### TODO Quasiquotation {#glossary-qq} 162 | 163 | ### TODO Parsing {#glossary-parse} 164 | 165 | ### TODO Metaprogramming {#glossary-metaprogramming} 166 | -------------------------------------------------------------------------------- /grammar.Rmd: -------------------------------------------------------------------------------- 1 | 2 | ```{r setup, include = FALSE} 3 | source("setup.R") 4 | ``` 5 | 6 | # Creating grammars 7 | -------------------------------------------------------------------------------- /index.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | knit: "bookdown::render_book" 3 | title: "Tidy evaluation" 4 | author: ["Lionel Henry", "Hadley Wickham"] 5 | 6 | description: "The primary goal of this book is to get you up to speed with tidy evaluation and how to write functions around tidyverse pipelines and grammars." 7 | 8 | github-repo: tidyverse/tidyeval 9 | site: bookdown::bookdown_site 10 | documentclass: book 11 | --- 12 | 13 | # Notice {-} 14 | 15 | ![Lifecycle Status](https://img.shields.io/badge/lifecycle-superseded-orange.svg) 16 | 17 | This guide is now superseded by more recent efforts at documenting tidy evaluation in a user-friendly way. We now recommend reading: 18 | 19 | - The new [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) vignette. 20 | 21 | - The [Using ggplot2 in packages](https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html) vignette. 22 | 23 | We are keeping this bookdown guide online for posterity, but please know that it is missing a lot of advances that make tidy eval more palatable, such as the embracing operator `{{ arg }}` and [glue](https://glue.tidyverse.org/) support for custom names. 24 | 25 | 26 | # Welcome {-} 27 | 28 | The primary goal of this book is to get you up to speed with tidy evaluation by showing you how to write functions using tidyverse pipelines and grammars. The book is written and organised so that you can quickly find the information you need to solve real world problems without having to "get" tidy eval first: 29 | 30 | * The first chapter *Getting up to speed* is a quick introduction to the main pattern used in all tidy eval functions: **quote and unquote**. 31 | 32 | * The *Cookbooks* sections are organised by common tasks for the dplyr and ggplot2 packages. 33 | 34 | Though this is a work in progress, we hope you'll find this bookdown valuable for programming with tidyverse interfaces. 35 | 36 | 37 | ## Other resources {-} 38 | 39 | You may also be interested in: 40 | 41 | * __"[Tidy eval in 5 minutes](https://www.youtube.com/watch?v=nERXS3ssntw)"__ is a quick 5 minutes video that explains the big ideas behind tidy evaluation. It's a great way to get an overview of concepts before diving in other tutorials. 42 | 43 | * __"[Tidy eval webinar](https://resources.rstudio.com/webinars/tidyeval)"__ is a one hour tutorial video on tidy evaluation. 44 | 45 | * __"[The second edition of Advanced R](http://adv-r.hadley.nz/)"__ which includes a whole chapter on meta programming with tidy eval. 46 | -------------------------------------------------------------------------------- /introduction.Rmd: -------------------------------------------------------------------------------- 1 | 2 | # (PART) Principles {-} 3 | 4 | ```{r setup, include = FALSE} 5 | source("setup.R") 6 | library("dplyr") 7 | ``` 8 | 9 | 10 | # Introduction 11 | 12 | Tidyverse grammars, like dplyr, have a distinctive look and feel. This is in part because they are designed to follow a set of [principles](https://principles.tidyverse.org/). The most important feature of tidyverse grammars is that they let you work with your data as if they were actual objects in your workspace. In a way, the data frame itself becomes a (temporary) workspace. 13 | 14 | Data masking makes it easy and natural to read and write data manipulation code, but it has a flip side. It is easier to refer to objects in the mask when you know their names in advance, but it is harder when the names are unknown at the time of writing. In particular, it is harder to make _indirect references_ with column names stored in variables or passed as function arguments. 15 | 16 | Tidy evaluation is a set of concepts and tools that make it possible to use tidyverse grammars when columns are specified indirectly. In particular, you will need to learn some tidy eval to extract a tidyverse pipeline in a reusable function. 17 | 18 | The first chapter [Why and how](#sec:why-how) provides the motivation for tidy eval, presents the problems that it poses in day-to-day programming, and the general theory and tools for solving those. If you are in a hurry, you can jump straight to [Do you need tidy eval?](#sec:do-you-need). A lot can be done without writing a single line of tidy eval! If you are positive you need it to solve your problem, [Getting up to speed](#sec:up-to-speed) is a self-contained chapter that will teach you the basic workflow of wrapping a tidyverse pipeline in a reusable function. The book ends with a series of recipes and idioms for solving various dplyr and ggplot2 problems. 19 | 20 | 21 | # Why and how {#sec:why-how} 22 | 23 | Tidy evaluation is a framework for metaprogramming in R, used throughout the tidyverse to implement data masking. Metaprogramming is about using a programming language to manipulate or modify its own code. This idea is used throughout the tidyverse to change the context of computation of certain pieces of R code. 24 | 25 | Changing the context of evaluation is useful for four main purposes: 26 | 27 | * To promote data frames to **full blown scopes**, where columns are exposed as named objects. 28 | 29 | * To execute your R code in a **foreign environment**. For instance, dbplyr translates ordinary dplyr pipelines to SQL queries. 30 | 31 | * To execute your R code with a more performant **compiled language**. For instance, the dplyr package uses C++ implementations for a certain set of mathematical expressions to avoid executing slower R code when possible[^perf]. 32 | 33 | * To implement **special rules** for ordinary R operators. For instance, selection functions such as `dplyr::select()` or `tidyr::gather()` implement specific behaviours for `c()`, `:` and `-`. 34 | 35 | [^perf]: The data.table package uses different metaprogramming tools than tidy eval for the same purpose. Certain expressions are executed in C to perform efficient data transformations. 36 | 37 | ## Data masking 38 | 39 | Of these goals, the promotion of data frames is the most important because data is often the most relevant context for data analysts. We believe that R and the tidyverse are [human-centered](https://principles.tidyverse.org/unifying-principles.html#human-centered) in big part because the data frame is available for direct use in computations, without syntax and boilerplate getting in the way. Formulas for statistical models are a prime example of human-centered syntax in R. Data masking and special operator rules make model formulas an intuitive interface for model specification. 40 | 41 | When the contents of the data frame are temporarily promoted as first class objects, we say the data **masks** the workspace: 42 | 43 | ```{r, eval = FALSE} 44 | library("dplyr") 45 | 46 | starwars %>% filter( 47 | height < 200, 48 | gender == "male" 49 | ) 50 | ``` 51 | 52 | Data masking is natural in R because it reduces boilerplate and results in code that maps more directly to how users think about data manipulation problems. Compare to the equivalent subsetting code where it is necessary to be explicit about where the columns come from: 53 | 54 | ```{r, eval = FALSE} 55 | starwars[starwars$height < 200 & starwars$gender == "male", ] 56 | ``` 57 | 58 | Data masking is only possible because R allows suspending the normal flow of evaluation. If code was evaluated in the normal way, R would not be able to find the relevant columns for the computation. For instance, a normal function like `list()`, which has no concept of data masking, will give an error about object not found: 59 | 60 | ```{r, error = TRUE} 61 | list( 62 | height < 200, 63 | gender == "male" 64 | ) 65 | ``` 66 | 67 | 68 | ## Quoting code 69 | 70 | In order to change the context, evaluation must first be suspended before being resumed in a different environment. The technical term for delaying code in this way is **quoting**. Tidyverse grammars quote the code supplied by users as arguments. They don't get results of code but the quoted code itself, whose evaluation can be resumed later on in a data context. In a way, quoted code is like a blueprint for R computations. One important quoting function in dplyr is `vars()`. This function does nothing but return its arguments as blueprints to be interpreted later on by verbs like `summarise_at()`: 71 | 72 | ```{r} 73 | starwars %>% summarise_at(vars(ends_with("color")), n_distinct) 74 | ``` 75 | 76 | If you call `vars()` alone, you get to see the blueprints! [^3] 77 | 78 | [^3]: As you can see, these blueprints are also called **quosures**. These are special types of [expressions](#glossary-expr) that keep track of the current context, or [environment](#glossary-env). 79 | 80 | ```{r} 81 | vars( 82 | ends_with("color"), 83 | height:mass 84 | ) 85 | ``` 86 | 87 | The evaluation of an expression captured as a blueprint can be resumed at any time, possibly in a different context: 88 | 89 | ```{r, error = TRUE} 90 | exprs <- vars(height / 100, mass + 50) 91 | 92 | rlang::eval_tidy(exprs[[1]]) 93 | 94 | rlang::eval_tidy(exprs[[1]], data = starwars) 95 | ``` 96 | 97 | To sum up, the distinctive look and feel of data masking UIs requires suspending the normal evaluation of R code. Once captured as quoted code, it can be resumed in a different context. Unfortunately, the delaying of code makes it harder to program with data masking functions, and requires learning a bit of new theory and some new tools. 98 | 99 | 100 | ## Unquoting code 101 | 102 | Data masking functions prevent the normal evaluation of their arguments by quoting them. Once in possession of the blueprints of their arguments, a data mask is created and the evaluation is resumed in this new context. Unfortunately, delaying code in this way has a flip side. While it is natural to substitute _values_ when you're programming with normal functions using regular evaluation, it is harder to substitute _column names_ in data masking functions that delay evaluation of your code. To make indirect references to columns, it is necessary to modify the quoted code _before_ it gets evaluated. This is exactly what the `!!` operator, pronounced bang bang, is all about. It is a surgery operator for blueprints of R code. 103 | 104 | In the world of normal functions, making indirect references to values is easy. Expressions that yield the same values can be freely interchanged, a property that is sometimes called [referential transparency](https://en.wikipedia.org/wiki/Referential_transparency). The following calls to `my_function()` all yield the same results because they were given the same value as inputs: 105 | 106 | ```{r} 107 | my_function <- function(x) x * 100 108 | 109 | my_function(6) 110 | 111 | my_function(2 * 3) 112 | 113 | a <- 2 114 | b <- 3 115 | my_function(a * b) 116 | ``` 117 | 118 | Because data masking functions evaluate their quoted arguments in a different context, they do not have this property: 119 | 120 | ```{r, error = TRUE} 121 | starwars %>% summarise(avg = mean(height, na.rm = TRUE)) 122 | 123 | value <- mean(height, na.rm = TRUE) 124 | starwars %>% summarise(avg = value) 125 | ``` 126 | 127 | Storing a column name in a variable or passing one as a function argument requires the tidy eval operator `!!`. This special operator, only available in quoting functions, acts like a surgical operator for modifying blueprints. To understand what it does, it is best to see it in action. The `qq_show()` helper from rlang processes `!!` and prints the resulting blueprint of the computation. As you can observe, `!!` modifies the quoted code by inlining the value of its operand right into the blueprint: 128 | 129 | ```{r} 130 | x <- 1 131 | 132 | rlang::qq_show( 133 | starwars %>% summarise(out = x) 134 | ) 135 | 136 | rlang::qq_show( 137 | starwars %>% summarise(out = !!x) 138 | ) 139 | ``` 140 | 141 | What would it take to create an indirect reference to a column name? Inlining the name as a string in the blueprint will not produce what you expect: 142 | 143 | ```{r} 144 | col <- "height" 145 | 146 | rlang::qq_show( 147 | starwars %>% summarise(out = sum(!!col, na.rm = TRUE)) 148 | ) 149 | ``` 150 | 151 | This code amounts to taking the sum of a string, something that R will not be happy about: 152 | 153 | ```{r, error = TRUE} 154 | starwars %>% summarise(out = sum("height", na.rm = TRUE)) 155 | ``` 156 | 157 | To refer to column names inside a blueprint, we need to inline blueprint material. We need **symbols**: 158 | 159 | ```{r} 160 | sym(col) 161 | ``` 162 | 163 | Symbols are a special type of string that represent other objects. When a piece of R code is evaluated, every bare variable name is actually a symbol that represents some value, as defined in the current context. Let's see what the modified blueprint looks like when we inline a symbol: 164 | 165 | ```{r} 166 | rlang::qq_show( 167 | starwars %>% summarise(out = sum(!!sym(col), na.rm = TRUE)) 168 | ) 169 | ``` 170 | 171 | Looks good! We're now ready to actually run the dplyr pipeline with an indirect reference: 172 | 173 | ```{r} 174 | starwars %>% summarise(out = sum(!!sym(col), na.rm = TRUE)) 175 | ``` 176 | 177 | There were two necessary steps to create an indirect reference and properly modify the summarising code: 178 | 179 | * We first created a piece of blueprint (a symbol) with `sym()`. 180 | * We used `!!` to insert it in the blueprint captured by `summarise()`. 181 | 182 | We call the combination of these two steps the __quote and unquote__ pattern. This pattern is the heart of programming with tidy eval functions. We quote an expression and unquote it in another quoted expression. In other words, we create or capture a piece of blueprint, and insert it in another blueprint just before it's captured by a data masking function. This process is also called __interpolation__. 183 | 184 | Most of the time though, we don't need to create blueprints manually. We'll get them by quoting the arguments supplied by users. This gives your functions the same usage and feel as tidyverse verbs. 185 | 186 | 187 | # Do you need tidy eval? {#sec:do-you-need} 188 | 189 | In computer science, frameworks like tidy evaluation are known as metaprogramming. Modifying the blueprints of computations amounts to programming the program, i.e. metaprogramming. In other languages, this type of approach is often seen as a last resort because it requires new skills and might make your code harder to read. Things are different in R because of the importance of data masking functions, but it is still good advice to consider other options before turning to tidy evaluation. In this section, we review several strategies for solving programming problems with tidyverse packages. 190 | 191 | Before diving into tidy eval, make sure to know about the fundamentals of programming with the tidyverse. These are likely to have a better return on investment of time and will also be useful to solve problems outside the tidyverse. 192 | 193 | * [Fixed column names](#sec:fixed-colnames). A solid function taking data frames with fixed column names is better than a brittle function that uses tidy eval. 194 | 195 | * [Automating loops](#sec:automating-loops). dplyr excels at automating loops. Acquiring a good command of rowwise vectorisation and columnwise mapping may prove very useful. 196 | 197 | Tidy evaluation is not all-or-nothing, it encompasses a wide range of features and techniques. Here are a few techniques that are easy to pick up in your workflow: 198 | 199 | * Passing expressions through `{{` and `...`. 200 | * Passing column names to `.data[[` and `one_of()`. 201 | 202 | All these techniques make it possible to reuse existing components of tidyverse grammars and compose them into new functions. 203 | 204 | 205 | ## Fixed column names {#sec:fixed-colnames} 206 | 207 | A simple solution is to write functions that expect data frames containing specific column names. If the computation always operates on the same columns and nothing varies, you don't need any tidy eval. On the other hand, your users must ensure the existence of these columns as part of their data cleaning process. This is why this technique primarily makes sense when you're writing functions tailored to your own data analysis uses, or perhaps in functions that interface with a specific web API for retrieving data. In general, fixed column names are task specific. 208 | 209 | Say we have a simple pipeline that computes the body mass index for each observation in a tibble: 210 | 211 | ```{r} 212 | starwars %>% transmute(bmi = mass / (height / 100)^2) 213 | ``` 214 | 215 | We could extract this code in a function that takes data frames with columns `mass` and `height`: 216 | 217 | ```{r} 218 | compute_bmi <- function(data) { 219 | data %>% transmute(bmi = mass / height^2) 220 | } 221 | ``` 222 | 223 | It's always a good idea to check the inputs of your functions and fail early with an informative error message when their assumptions are not met. In this case, we should validate the data frame and throw an error when it does not contain the expected columns: 224 | 225 | ```{r, error = TRUE} 226 | compute_bmi <- function(data) { 227 | if (!all(c("mass", "height") %in% names(data))) { 228 | stop("`data` must contain `mass` and `height` columns") 229 | } 230 | 231 | data %>% transmute(bmi = mass / height^2) 232 | } 233 | 234 | iris %>% compute_bmi() 235 | ``` 236 | 237 | In fact, we could go even further and validate the contents of the columns in addition to their names: 238 | 239 | ```{r} 240 | compute_bmi <- function(data) { 241 | if (!all(c("mass", "height") %in% names(data))) { 242 | stop("`data` must contain `mass` and `height` columns") 243 | } 244 | 245 | mean_height <- round(mean(data$height, na.rm = TRUE), 1) 246 | if (mean_height > 3) { 247 | warning(glue::glue( 248 | "Average height is { mean_height }, is it scaled in meters?" 249 | )) 250 | } 251 | 252 | data %>% transmute(bmi = mass / height^2) 253 | } 254 | 255 | starwars %>% compute_bmi() 256 | 257 | starwars %>% mutate(height = height / 100) %>% compute_bmi() 258 | ``` 259 | 260 | Spending your programming time on the domain logic of your function, such as input and scale validation, may have a greater payoff than learning tidy eval just to improve its syntax. It makes your function more robust to faulty data and reduces the risks of erroneous analyses. 261 | 262 | 263 | ## Automating loops {#sec:automating-loops} 264 | 265 | Most programming problems involve __iteration__ because data transformations are typically achieved element by element, by applying the same recipe over and over again. There are two main ways of automating iteration in R, __vectorisation__ and __mapping__. Learning how to juggle with the different ways of expressing loops is not only an important step towards acquiring a good command of R and the tidyverse, it will also make you more proficient at solving programming problems. 266 | 267 | 268 | ### Vectorisation in dplyr 269 | 270 | dplyr is designed to optimise iteration by taking advantage of the vectorisation of many R functions. Rowwise vectorisation is achieved through normal R rules, which dplyr augments with groupwise vectorisation. 271 | 272 | 273 | #### Rowwise vectorisation 274 | 275 | Rowwise vectorisation in dplyr is a consequence of normal R rules for vectorisation. A vectorised function is a function that works the same way with vectors of 1 element as with vectors of _n_ elements. The operation is applied elementwise (often at the machine code level, which makes them very efficient). We have already mentioned the vectorisation of `toupper()`, and many other functions in R are vectorised. One important class of vectorised functions is the arithmetic operators: 276 | 277 | ```{r} 278 | # Dividing 1 element 279 | 1 / 10 280 | 281 | # Dividing 5 elements 282 | 1:5 / 10 283 | ``` 284 | 285 | Technically, a function is vectorised when: 286 | 287 | * It returns a vector as long as the input. 288 | * Applying the function on a single element yields the same result than applying it on the whole vector and then subsetting the element. 289 | 290 | In other words, a vectorised function `fn` fulfills the following identity: 291 | 292 | ```{r, eval = FALSE} 293 | fn(x[[i]]) == fn(x)[[i]] 294 | ``` 295 | 296 | When you mix vectorised and non-vectorised operations, the combined operation is itself vectorised when the last operation to run is vectorised. Here we'll combine the vectorised `/` function with the summary function `mean()`. The result of this operation is a vector that has the same length as the LHS of `/`: 297 | 298 | ```{r} 299 | x <- 1:5 300 | x / mean(x) 301 | ``` 302 | 303 | Note that the other combination of operations is not vectorised because in that case the summary operation has the last word: 304 | 305 | ```{r} 306 | mean(x / 10) 307 | ``` 308 | 309 | The dplyr verb `mutate()` expects vector semantics. The operations defining new columns typically return vectors as long as their inputs: 310 | 311 | ```{r} 312 | data <- tibble(x = rnorm(5, sd = 10)) 313 | 314 | data %>% 315 | mutate(rescaled = x / sd(x)) 316 | ``` 317 | 318 | In fact, `mutate()` enforces vectorisation. Returning a smaller vector is an error unless it has size 1. If the result of a mutate expression has size 1, it is automatically recycled to the tibble or group size. This ensures that all columns have the same length and fit within the tibble constraints of rectangular data: 319 | 320 | ```{r} 321 | data %>% 322 | mutate(sigma = sd(x)) 323 | ``` 324 | 325 | In contrast to `mutate()`, the dplyr verb `summarise()` expects summary operations that return a single value: 326 | 327 | ```{r} 328 | data %>% 329 | summarise(sd(x)) 330 | ``` 331 | 332 | 333 | #### Groupwise vectorisation 334 | 335 | Things get interesting with grouped tibbles. dplyr augments the vectorisation of normal R functions with groupwise vectorisation. If your tibble has `ngroup` groups, the operations are repeated `ngroup` times. 336 | 337 | ```{r} 338 | my_division <- function(x, y) { 339 | message("I was just called") 340 | x / y 341 | } 342 | 343 | # Called 1 time 344 | data %>% 345 | mutate(new = my_division(x, 10)) 346 | 347 | gdata <- data %>% group_by(g = c("a", "a", "b", "b", "c")) 348 | 349 | # Called 3 times 350 | gdata %>% 351 | mutate(new = my_division(x, 10)) 352 | ``` 353 | 354 | If the operation is entirely vectorised, the result will be the same whether the tibble is grouped or not, since elementwise computations are not affected by the values of other elements. But as soon as summary operations are involved, the result depends on the grouping structure because the summaries are computed from group sections instead of whole columns. 355 | 356 | ```{r} 357 | # Marginal rescaling 358 | data %>% 359 | mutate(new = x / sd(x)) 360 | 361 | # Conditional rescaling 362 | gdata %>% 363 | mutate(new = x / sd(x)) 364 | ``` 365 | 366 | Whereas rowwise vectorisation automates loops over the elements of a column, groupwise vectorisation automates loops over the levels of a grouping specification. The combination of these is very powerful. 367 | 368 | 369 | ### Looping over columns 370 | 371 | Rowwise and groupwise vectorisations are means of looping in the direction of rows, applying the same operation to each group and each element. What if you'd like to apply an operation in the direction of columns? This is possible in dplyr by __mapping__ functions over columns. 372 | 373 | Mapping functions is part of the [functional programming](https://adv-r.hadley.nz/fp.html) approach. If you're going to spend some time learning new programming concepts, acquiring functional programming skills is likely to have a higher payoff than learning about the metaprogramming concepts of tidy evaluation. Functional programming is inherent to R as it underlies the `apply()` family of functions in base R and the `map()` family from the [purrr package](https://purrr.tidyverse.org/). It is a powerful tool to add to your quiver. 374 | 375 | 376 | #### Mapping functions 377 | 378 | Everything that exists in R is an object, including functions. If you type the name of a function without parentheses, you get the function object instead of the result of calling the function: 379 | 380 | ```{r} 381 | toupper 382 | ``` 383 | 384 | In its simplest form, functional programming is about passing a function object as argument to another function called a __mapper__ function, that iterates over a vector to apply the function on each element, and returns all results in a new vector. In other words, a mapper functions writes loops so you don't have to. Here is a manual loop that applies `toupper()` over all elements of a character vector and returns a new vector: 385 | 386 | ```{r} 387 | new <- character(length(letters)) 388 | 389 | for (i in seq_along(letters)) { 390 | new[[i]] <- toupper(letters[[i]]) 391 | } 392 | 393 | new 394 | ``` 395 | 396 | Using a mapper function results in much leaner code. Here we apply `toupper()` over all elements of `letters` and return the results as a character vector, as indicated by the suffix `_chr`: 397 | 398 | ```{r} 399 | new <- purrr::map_chr(letters, toupper) 400 | new 401 | ``` 402 | 403 | In practice, functional programming is all about hiding `for` loops, which are abstracted away by the mapper functions that automate the iteration. 404 | 405 | Mapping is an elegant way of transforming data element by element, but it's not the only one. For instance, `toupper()` is actually a vectorised function that already operates on whole vectors element by element. The fastest and leanest code is just: 406 | 407 | ```{r} 408 | toupper(letters) 409 | ``` 410 | 411 | Mapping functions are more useful with functions that are not vectorised or for computations over lists and data frame columns where the vectorisation occurs within the elements or columns themselves. In the following example, we apply a summarising function over all columns of a data frame: 412 | 413 | ```{r} 414 | purrr::map_int(mtcars, n_distinct) 415 | ``` 416 | 417 | 418 | #### Scoped dplyr variants 419 | 420 | dplyr provides variants of the main data manipulation verbs that map functions over a selection of columns. These verbs are known as the [scoped variants](https://dplyr.tidyverse.org/reference/scoped.html) and are recognizable from their `_at`, `_if` and `_all` suffixes. 421 | 422 | Scoped verbs support three sorts of selection: 423 | 424 | 1. `_all` verbs operate on all columns of the data frame. You can summarise all columns of a data frame within groups with `summarise_all()`: 425 | 426 | ```{r} 427 | iris %>% group_by(Species) %>% summarise_all(mean) 428 | ``` 429 | 430 | 1. `_if` verbs operate conditionally, on all columns for which a predicate returns `TRUE`. If you are familiar with purrr, the idea is similar to the conditional mapper `purrr::map_if()`. Promoting all character columns of a data frame as grouping variables is as simple as: 431 | 432 | ```{r} 433 | starwars %>% group_by_if(is.character) 434 | ``` 435 | 436 | 1. `_at` verbs operate on a selection of columns. You can supply integer vectors of column positions or character vectors of colunm names. 437 | 438 | ```{r} 439 | mtcars %>% summarise_at(1:2, mean) 440 | 441 | mtcars %>% summarise_at(c("disp", "drat"), median) 442 | ``` 443 | 444 | More interestingly, you can use `vars()`[^fn:vars] to supply the same sort of expressions you would pass to `select()`! The selection helpers make it very convenient to craft a selection of columns to map over. 445 | 446 | ```{r} 447 | starwars %>% summarise_at(vars(height:mass), mean) 448 | 449 | starwars %>% summarise_at(vars(ends_with("_color")), n_distinct) 450 | ``` 451 | 452 | The scoped variants of `mutate()` and `summarise()` are the closest analogue to `base::lapply()` and `purrr::map()`. Unlike pure list mappers, the scoped verbs fully implement the dplyr semantics, such as groupwise vectorisation or the summary constraints: 453 | 454 | ```{r, include = FALSE} 455 | # For printing 456 | mtcars <- as_tibble(mtcars) 457 | ``` 458 | 459 | ```{r} 460 | # map() returns a simple list with the results 461 | mtcars[1:5] %>% purrr::map(mean) 462 | 463 | # `mutate_` variants recycle to group size 464 | mtcars[1:5] %>% mutate_all(mean) 465 | 466 | # `summarise_` variants enforce a size 1 constraint 467 | mtcars[1:5] %>% summarise_all(mean) 468 | 469 | # All scoped verbs know about groups 470 | mtcars[1:5] %>% group_by(cyl) %>% summarise_all(mean) 471 | ``` 472 | 473 | The other scoped variants also accept optional functions to map over the selection of columns. For instance, you could group by a selection of variables and transform them on the fly: 474 | 475 | ```{r} 476 | iris %>% group_by_if(is.factor, as.character) 477 | ``` 478 | 479 | or transform the column names of selected variables: 480 | 481 | ```{r} 482 | storms %>% select_at(vars(name:hour), toupper) 483 | ``` 484 | 485 | The scoped variants lie at the intersection of purrr and dplyr and combine the rowwise looping mechanisms of dplyr with the columnwise mapping of purrr. This is a powerful combination. 486 | 487 | [^fn:vars]: `vars()` is the function that does the quoting of your expressions, and returns blueprints to its caller. This pattern of letting an external helper quote the arguments is called [external quoting](#sec:external-quoting). 488 | 489 | 490 | # Getting up to speed {#sec:up-to-speed} 491 | 492 | While tidyverse grammars are easy to write in scripts and at the console, they make it a bit harder to reduce code duplication. Writing functions around dplyr pipelines and other tidyeval APIs requires a bit of special knowledge because these APIs use a special type of functions called **quoting functions** in order to make data first class. 493 | 494 | If one-off code is often reasonable for common data analysis tasks, it is good practice to write reusable functions to reduce code duplication. In this introduction, you will learn about quoting functions, what challenges they pose for programming, and the solutions that **tidy evaluation** provides to solve those problems. 495 | 496 | 497 | ## Writing functions 498 | 499 | ### Reducing duplication 500 | 501 | Writing functions is essential for the clarity and robustness of your code. Functions have several advantages: 502 | 503 | 1. They prevent inconsistencies because they force multiple computations to follow a single recipe. 504 | 505 | 1. They emphasise what varies (the arguments) and what is constant (every other component of the computation). 506 | 507 | 1. They make change easier because you only need to modify one place. 508 | 509 | 1. They make your code clearer if you give the function and its arguments informative names. 510 | 511 | The process for creating a function is straightforward. First, recognise duplication in your code. A good rule of thumb is to create a function when you have copy-pasted a piece of code three times. Can you spot the copy-paste mistake in this duplicated code? 512 | 513 | ```{r, eval = FALSE} 514 | (df$a - min(df$a)) / (max(df$a) - min(df$a)) 515 | (df$b - min(df$b)) / (max(df$b) - min(df$b)) 516 | (df$c - min(df$c)) / (max(df$c) - min(df$c)) 517 | (df$d - min(df$d)) / (max(df$d) - min(df$c)) 518 | ``` 519 | 520 | Now identify the varying parts of the expression and give each a name. `x` is an easy choice, but it is often a good idea to reflect the type of argument expected in the name. In our case we expect a numeric vector: 521 | 522 | ```{r, eval = FALSE} 523 | (num - min(num)) / (max(num) - min(num)) 524 | (num - min(num)) / (max(num) - min(num)) 525 | (num - min(num)) / (max(num) - min(num)) 526 | (num - min(num)) / (max(num) - min(num)) 527 | ``` 528 | 529 | We can now create a function with a relevant name: 530 | 531 | ```{r, eval = FALSE} 532 | rescale01 <- function(num) { 533 | 534 | } 535 | ``` 536 | 537 | Fill it with our deduplicated code: 538 | 539 | ```{r, eval = FALSE} 540 | rescale01 <- function(num) { 541 | (num - min(num)) / (max(num) - min(num)) 542 | } 543 | ``` 544 | 545 | And refactor a little to reduce duplication further and handle more cases: 546 | 547 | ```{r, eval = FALSE} 548 | rescale01 <- function(num) { 549 | rng <- range(num, na.rm = TRUE, finite = TRUE) 550 | (num - rng[[1]]) / (rng[[2]] - rng[[1]]) 551 | } 552 | ``` 553 | 554 | Now you can reuse your function any place you need it: 555 | 556 | ```{r, eval = FALSE} 557 | rescale01(df$a) 558 | rescale01(df$b) 559 | rescale01(df$c) 560 | rescale01(df$d) 561 | ``` 562 | 563 | Reducing code duplication is as much needed with tidyverse grammars as with ordinary computations. Unfortunately, the straightforward process to create functions breaks down with grammars like dplyr, which we attach now. 564 | 565 | ```{r} 566 | library("dplyr") 567 | ``` 568 | 569 | To see the problem, let's use the same function-writing process with a duplicated dplyr pipeline: 570 | 571 | ```{r, eval = FALSE} 572 | df1 %>% group_by(x1) %>% summarise(mean = mean(y1)) 573 | df2 %>% group_by(x2) %>% summarise(mean = mean(y2)) 574 | df3 %>% group_by(x3) %>% summarise(mean = mean(y3)) 575 | df4 %>% group_by(x4) %>% summarise(mean = mean(y4)) 576 | ``` 577 | 578 | We first abstract out the varying parts by giving them informative names: 579 | 580 | ```{r, eval = FALSE} 581 | data %>% group_by(group_var) %>% summarise(mean = mean(summary_var)) 582 | ``` 583 | 584 | And wrap the pipeline with a function taking these argument names: 585 | 586 | ```{r} 587 | grouped_mean <- function(data, group_var, summary_var) { 588 | data %>% 589 | group_by(group_var) %>% 590 | summarise(mean = mean(summary_var)) 591 | } 592 | ``` 593 | 594 | Unfortunately this function doesn't actually work. When you call it dplyr complains that the variable `group_var` is unknown: 595 | 596 | ```{r, error = TRUE} 597 | grouped_mean(mtcars, cyl, mpg) 598 | ``` 599 | 600 | Here is the proper way of defining this function: 601 | 602 | ```{r} 603 | grouped_mean <- function(data, group_var, summary_var) { 604 | group_var <- enquo(group_var) 605 | summary_var <- enquo(summary_var) 606 | 607 | data %>% 608 | group_by(!!group_var) %>% 609 | summarise(mean = mean(!!summary_var)) 610 | } 611 | ``` 612 | 613 | ```{r} 614 | grouped_mean(mtcars, cyl, mpg) 615 | ``` 616 | 617 | To understand how that works, we need to learn about quoting functions and what special steps are needed to be effective at programming with them. Really we only need two new concepts forming together a single pattern: quoting and unquoting. This introduction will get you up to speed with this pattern. 618 | 619 | 620 | ### What's special about quoting functions? 621 | 622 | R functions can be categorised in two broad categories: evaluating functions and quoting functions [^1]. These functions differ in the way they get their arguments. Evaluating functions take arguments as **values**. It does not matter what the expression supplied as argument is or which objects it contains. R computes the argument value following the standard rules of evaluation which the function receives passively [^2]. 623 | 624 | The simplest regular function is `identity()`. It evaluates its single argument and returns the value. Because only the final value of the argument matters, all of these statements are completely equivalent: 625 | 626 | ```{r} 627 | identity(6) 628 | 629 | identity(2 * 3) 630 | 631 | a <- 2 632 | b <- 3 633 | identity(a * b) 634 | ``` 635 | 636 | On the other hand, a quoting function is not passed the value of an expression, it is passed the *expression itself*. We say the argument has been automatically quoted. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is `quote()`. It automatically quotes its argument and returns the quoted expression without any evaluation. Because only the expression passed as argument matters, none of these statements are equivalent: 637 | 638 | ```{r} 639 | quote(6) 640 | 641 | quote(2 * 3) 642 | 643 | quote(a * b) 644 | ``` 645 | 646 | Other familiar quoting operators are `""` and `~`. The `""` operator quotes a piece of text at parsing time and returns a string. This prevents the text from being interpreted as some R code to evaluate. The tilde operator is similar to the `quote()` function in that it prevents R code from being automatically evaluated and returns a quoted expression in the form of a formula. The expression is then used to define a statistical model in modelling functions. The three following expressions are doing something similar, they are quoting their input: 647 | 648 | ```{r} 649 | "a * b" 650 | 651 | ~ a * b 652 | 653 | quote(a * b) 654 | ``` 655 | 656 | The first statement returns a quoted string and the other two return quoted code in a formula or as a bare expression. 657 | 658 | 659 | [^1]: In practice this is a bit more complex because most quoting functions evaluate at least one argument, usually the data argument. 660 | 661 | [^2]: This is why regular functions are said to use standard evaluation unlike quoting functions which use non-standard evaluation (NSE). Note that the function is not entirely passive. Because arguments are lazily evaluated, the function gets to decide when an argument is evaluated, if at all. 662 | 663 | 664 | #### Quoting and evaluating in mundane R code 665 | 666 | As an R programmer, you are probably already familiar with the distinction between quoting and evaluating functions. Take the case of subsetting a data frame column by name. The `[[` and `$` operators are both standard for this task but they are used in very different situations. The former supports indirect references like variables or expressions that represent a column name while the latter takes a column name directly: 667 | 668 | ```{r} 669 | df <- data.frame( 670 | y = 1, 671 | var = 2 672 | ) 673 | 674 | var <- "y" 675 | df[[var]] 676 | 677 | df$y 678 | ``` 679 | 680 | Technically, `[[` is an evaluating function while `$` is a quoting function. You can indirectly refer to columns with `[[` because the subsetting index is evaluated, allowing indirect references. The following expressions are completely equivalent: 681 | 682 | ```{r} 683 | df[[var]] # Indirect 684 | 685 | df[["y"]] # Direct 686 | ``` 687 | 688 | But these are not: 689 | 690 | ```{r} 691 | df$var # Direct 692 | 693 | df$y # Direct 694 | ``` 695 | 696 | The following table summarises the fundamental asymmetry between the two subsetting methods: 697 | 698 | | | Quoted | Evaluated | 699 | | -------- |:------:|:-----------:| 700 | | Direct | `df$y` | `df[["y"]]` | 701 | | Indirect | ??? | `df[[var]]` | 702 | 703 | 704 | #### Detecting quoting functions 705 | 706 | Because they work so differently to standard R code, it is important to recognise auto-quoted arguments. The documentation of the quoting function should normally tell you if an argument is quoted and evaluated in a special way. You can also detect quoted arguments by yourself with some experimentation. Let's take the following expressions involving a mix of quoting and evaluating functions: 707 | 708 | ```{r} 709 | library(MASS) 710 | 711 | mtcars2 <- subset(mtcars, cyl == 4) 712 | 713 | sum(mtcars2$am) 714 | 715 | rm(mtcars2) 716 | ``` 717 | 718 | A good indication that an argument is auto-quoted and evaluated in a special way is that the argument will not work correctly outside of its original context. Let's try to break down each of these expressions in two steps by storing the arguments in an intermediary variable: 719 | 720 | 1. `library(MASS)` 721 | ```{r, error = TRUE} 722 | temp <- MASS 723 | 724 | temp <- "MASS" 725 | library(temp) 726 | ``` 727 | 728 | We get these errors because there is no `MASS` object for R to find, and `temp` is interpreted by `library()` directly as a package name rather than as an indirect reference. Let's try to break down the `subset()` expression: 729 | 730 | 2. `mtcars2 <- subset(mtcars, cyl == 4)` 731 | ```{r, error = TRUE} 732 | temp <- cyl == 4 733 | ``` 734 | 735 | R cannot find `cyl` because we haven't specified where to find it. This object exists only inside the `mtcars` data frame. 736 | 737 | 3. `sum(mtcars2$am)` 738 | ```{r, error = TRUE} 739 | temp <- mtcars$am 740 | sum(temp) 741 | ``` 742 | 743 | It worked! `sum()` is an evaluating function and the indirect reference was resolved in the ordinary way. 744 | 745 | 4. `rm(mtcars2)` 746 | ```{r, error = TRUE} 747 | mtcars2 <- mtcars 748 | temp <- "mtcars2" 749 | rm(temp) 750 | 751 | exists("mtcars2") 752 | exists("temp") 753 | ``` 754 | 755 | This time there was no error, but we have accidentally removed the variable `temp` instead of the variable it was referring to. This is because `rm()` auto-quotes its arguments. 756 | 757 | 758 | ### Unquotation 759 | 760 | In practice, functions that evaluate their arguments are easier to program with because they support both direct and indirect references. For quoting functions, a piece of syntax is missing. We need the ability to **unquote** arguments. 761 | 762 | 763 | #### Unquoting in base R 764 | 765 | Base R provides three different ways of allowing direct references: 766 | 767 | * An extra function that evaluates its arguments. For instance the evaluating variant of the `$` operator is `[[`. 768 | 769 | 770 | * An extra parameter that switches off auto-quoting. For instance `library()` evaluates its first argument if you set `character.only` to `TRUE`: 771 | 772 | ```{r} 773 | temp <- "MASS" 774 | library(temp, character.only = TRUE) 775 | ``` 776 | 777 | * An extra parameter that evaluates its argument. If you have a list of object names to pass to `rm()`, use the `list` argument: 778 | 779 | ```{r} 780 | temp <- "mtcars2" 781 | rm(list = temp) 782 | 783 | exists("mtcars2") 784 | ``` 785 | 786 | There is no general unquoting convention in base R so you have to read the documentation to figure out how to unquote an argument. Many functions like `subset()` or `transform()` do not provide any unquoting option at all. 787 | 788 | 789 | #### Unquoting in the tidyverse!! 790 | 791 | All quoting functions in the tidyverse support a single unquotation mechanism, the `!!` operator (pronounced **bang-bang**). You can use `!!` to cancel the automatic quotation and supply indirect references everywhere an argument is automatically quoted. In other words, unquoting lets you open a variable and use what's inside instead. 792 | 793 | First let's create a couple of variables that hold references to columns from the `mtcars` data frame. A simple way of creating these references is to use the fundamental quoting function `quote()`: 794 | 795 | ```{r} 796 | # Variables referring to columns `cyl` and `mpg` 797 | x_var <- quote(cyl) 798 | y_var <- quote(mpg) 799 | 800 | x_var 801 | 802 | y_var 803 | ``` 804 | 805 | Here are a few examples of how `!!` can be used in tidyverse functions to unquote these variables, i.e. open them and use their contents. 806 | 807 | * In dplyr most verbs quote their arguments: 808 | 809 | ```{r} 810 | library("dplyr") 811 | 812 | by_cyl <- mtcars %>% 813 | group_by(!!x_var) %>% # Open x_var 814 | summarise(mean = mean(!!y_var)) # Open y_var 815 | ``` 816 | 817 | * In ggplot2 `aes()` is the main quoting function: 818 | 819 | ```{r} 820 | library("ggplot2") 821 | 822 | ggplot(mtcars, aes(!!x_var, !!y_var)) + # Open x_var and y_var 823 | geom_point() 824 | ``` 825 | 826 | ggplot2 also features `vars()` which is useful for facetting: 827 | 828 | ```{r} 829 | ggplot(mtcars, aes(disp, drat)) + 830 | geom_point() + 831 | facet_grid(vars(!!x_var)) # Open x_var 832 | ``` 833 | 834 | Being able to make indirect references by opening variables with `!!` is rarely useful in scripts but is invaluable for writing functions. With `!!` we can now easily fix our wrapper function, as we'll see in the following section. 835 | 836 | 837 | ### Understanding `!!` with `qq_show()` 838 | 839 | At this point it is normal if the concept of unquoting still feels nebulous. A good way of practicing this operation is to see for yourself what it is really doing. To that end the `qq_show()` function from the rlang package performs unquoting and prints the result to the screen. Here is what `!!` is really doing in the dplyr example (I've broken the pipeline into two steps for readability): 840 | 841 | ```{r} 842 | rlang::qq_show(mtcars %>% group_by(!!x_var)) 843 | 844 | rlang::qq_show(data %>% summarise(mean = mean(!!y_var))) 845 | ``` 846 | 847 | Similarly for the ggplot2 pipeline: 848 | 849 | ```{r} 850 | rlang::qq_show(ggplot(mtcars, aes(!!x_var, !!y_var))) 851 | 852 | rlang::qq_show(facet_grid(vars(!!x_var))) 853 | ``` 854 | 855 | As you can see, unquoting a variable that contains a reference to the column `cyl` is equivalent to directly supplying `cyl` to the dplyr function. 856 | 857 | 858 | ## Quote and unquote 859 | 860 | The basic process for creating tidyeval functions requires thinking a bit differently but is straightforward: quote and unquote. 861 | 862 | 1. Use `enquo()` to make a function automatically quote its argument. 863 | 1. Use `!!` to unquote the argument. 864 | 865 | Apart from these additional two steps, the process is the same. 866 | 867 | 868 | ### The abstraction step 869 | 870 | We start as usual by identifying the varying parts of a computation and giving them informative names. These names become the arguments to the function. 871 | 872 | ```{r, eval = FALSE} 873 | grouped_mean <- function(data, group_var, summary_var) { 874 | data %>% 875 | group_by(group_var) %>% 876 | summarise(mean = mean(summary_var)) 877 | } 878 | ``` 879 | 880 | As we have seen earlier this function does not quite work yet so let's fix it by applying the two new steps. 881 | 882 | 883 | ### The quoting step 884 | 885 | The quoting step is about making our ordinary function a quoting function. Not all parameters should be automatically quoted though. For instance the `data` argument refers to a real data frame that is passed around in the ordinary way. It is crucial to identify which parameters of your function should be automatically quoted: the parameters for which it is allowed to refer to columns in the data frames. In the example, `group_var` and `summary_var` are the parameters that refer to the data. 886 | 887 | We know that the fundamental quoting function is `quote()` but how do we go about creating other quoting functions? This is the job of `enquo()`. While `quote()` quotes what *you* typed, `enquo()` quotes what *your user* typed. In other words it makes an argument automatically quote its input. This is exactly how dplyr verbs are created! Here is how to apply `enquo()` to the `group_var` and `summary_var` arguments: 888 | 889 | ```{r, eval = FALSE} 890 | group_var <- enquo(group_var) 891 | summary_var <- enquo(summary_var) 892 | ``` 893 | 894 | 895 | ### The unquoting step 896 | 897 | Finally we identify any place where these variables are passed to other quoting functions. That's where we need to unquote with `!!`. In this case we pass `group_var` to `group_by()` and `summary_var` to `summarise()`: 898 | 899 | ```{r, eval = FALSE} 900 | data %>% 901 | group_by(!!group_var) %>% 902 | summarise(mean = mean(!!summary_var)) 903 | ``` 904 | 905 | 906 | ### Result 907 | 908 | The finished function looks like this: 909 | 910 | ```{r} 911 | grouped_mean <- function(data, group_var, summary_var) { 912 | group_var <- enquo(group_var) 913 | summary_var <- enquo(summary_var) 914 | 915 | data %>% 916 | group_by(!!group_var) %>% 917 | summarise(mean = mean(!!summary_var)) 918 | } 919 | ``` 920 | 921 | And voilà! 922 | 923 | ```{r} 924 | grouped_mean(mtcars, cyl, mpg) 925 | 926 | grouped_mean(mtcars, cyl, disp) 927 | 928 | grouped_mean(mtcars, am, disp) 929 | ``` 930 | 931 | This simple quote-and-unquote pattern will get you a long way. It makes it possible to abstract complex combinations of quoting functions into a new quoting function. However this gets us in a sort of loop: quoting functions unquote inside other quoting functions and so on. At the start of the loop is the user typing expressions that are automatically quoted. But what if we can't or don't want to start with expressions typed by the user? What if we'd like to start with a character vector of column names? 932 | 933 | 934 | ## Strings instead of quotes 935 | 936 | So far we have created a quoting function that wraps around other quoting functions. How can we break this chain of quoting? How can we go from the evaluating world to the quoting universe? The most common way this transition occurs is when you start with a character vector of column names and somehow need to pass the corresponding columns to quoting functions like `dplyr::mutate()`, `dplyr::select()`, or `ggplot2::aes()`. We need a way of bridging evaluating and quoting functions. 937 | 938 | First let's see why simply unquoting strings does not work: 939 | 940 | ```{r, error = TRUE} 941 | var <- "height" 942 | mutate(starwars, rescaled = !!var * 100) 943 | ``` 944 | 945 | We get a type error. Observing the result of unquoting with `qq_show()` will shed some light on this: 946 | 947 | ```{r} 948 | rlang::qq_show(mutate(starwars, rescaled = !!var * 100)) 949 | ``` 950 | 951 | We have unquoted a string, and now dplyr tried to multiply that string by 100! 952 | 953 | 954 | ### Strings 955 | 956 | There is a fundamental difference between these two objects: 957 | 958 | ```{r} 959 | "height" 960 | 961 | quote(height) 962 | ``` 963 | 964 | `"height"` is a string and `quote(height)` is a **symbol**, or variable name. A symbol is much more than a string, it is a reference to an R object. That's why you have to use symbols to refer to data frame columns. Fortunately transforming strings to symbols is straightforward with the tidy eval `sym()` function: 965 | 966 | ```{r} 967 | sym("height") 968 | ``` 969 | 970 | If you use `sym()` instead of `enquo()`, you end up with an evaluating function that transforms its inputs into symbols that can suitably be unquoted: 971 | 972 | ```{r} 973 | grouped_mean2 <- function(data, group_var, summary_var) { 974 | group_var <- sym(group_var) 975 | summary_var <- sym(summary_var) 976 | 977 | data %>% 978 | group_by(!!group_var) %>% 979 | summarise(mean = mean(!!summary_var)) 980 | } 981 | ``` 982 | 983 | With this simple change we now have an *evaluating* wrapper which can be used in the same way as `[[`. You can call `grouped_mean2()` with direct references: 984 | 985 | ```{r} 986 | grouped_mean2(starwars, "gender", "mass") 987 | ``` 988 | 989 | Or indirect references: 990 | 991 | ```{r} 992 | grp_var <- "gender" 993 | sum_var <- "mass" 994 | grouped_mean2(starwars, grp_var, sum_var) 995 | ``` 996 | 997 | 998 | ### Character vectors of column names 999 | 1000 | What if you have a whole character vector of column names? You can transform vectors to a list of symbols with the plural variant `syms()`: 1001 | 1002 | ```{r} 1003 | cols <- syms(c("species", "gender")) 1004 | 1005 | cols 1006 | ``` 1007 | 1008 | But now we have a list. Can we just unquote a list of symbols with `!!`? 1009 | 1010 | ```{r, error = TRUE} 1011 | group_by(starwars, !!cols) 1012 | ``` 1013 | 1014 | Something's wrong. Using `qq_show()`, we see that `group_by()` gets a list instead of the individual symbols: 1015 | 1016 | ```{r} 1017 | rlang::qq_show(group_by(starwars, !!cols)) 1018 | ``` 1019 | 1020 | We should unquote each symbol in the list as a separate argument. The big bang operator `!!!` makes this easy: 1021 | 1022 | ```{r} 1023 | rlang::qq_show(group_by(starwars, !!cols[[1]], !!cols[[2]])) 1024 | 1025 | rlang::qq_show(group_by(starwars, !!!cols)) 1026 | ``` 1027 | 1028 | Working with multiple arguments and lists of expressions requires specific techniques such as using `!!!`. These techniques are covered in the next chapter. 1029 | -------------------------------------------------------------------------------- /modify.Rmd: -------------------------------------------------------------------------------- 1 | 2 | ```{r setup, include = FALSE} 3 | source("setup.R") 4 | library("dplyr") 5 | ``` 6 | 7 | # Modifying inputs 8 | 9 | With the quote-and-unquote pattern, quoted arguments are passed to other functions as is. In many cases you'll find this to be too restrictive. This chapter will guide you through the steps required to pass custom argument names and custom quoted expressions. 10 | 11 | 12 | ## Modifying names 13 | 14 | When your function creates new columns in a data frame it's often a good idea to give them names that reflect the meaning of those columns. In this section you'll learn how to: 15 | 16 | * Create default names for quoted arguments. 17 | * Unquote names. 18 | 19 | 20 | ### Default argument names 21 | 22 | If you are familiar with dplyr you have probably noticed that new columns are given default names when you don't supply one explictly to `mutate()` or `summarise()`. These default names are not practical for further manipulation but they are helpful to remind rushed users what their new column is about: 23 | 24 | ```{r} 25 | starwars %>% summarise(average = mean(height, na.rm = TRUE)) 26 | 27 | starwars %>% summarise(mean(height, na.rm = TRUE)) 28 | ``` 29 | 30 | You can create default names by applying `as_label()` to any expressions: 31 | 32 | ```{r} 33 | var1 <- quote(height) 34 | var2 <- quote(mean(height)) 35 | 36 | as_label(var1) 37 | as_label(var2) 38 | ``` 39 | 40 | Including automatically quoted arguments: 41 | 42 | ```{r} 43 | arg_name <- function(var) { 44 | var <- enquo(var) 45 | 46 | as_label(var) 47 | } 48 | 49 | arg_name(height) 50 | 51 | arg_name(mean(height)) 52 | ``` 53 | 54 | Lists of quoted expressions require a different approach because we don't want to override user-supplied names. The easiest way is to call `enquos()` with `.named = TRUE`. With this option, all unnamed arguments get a default name: 55 | 56 | ```{r} 57 | args_names <- function(...) { 58 | vars <- enquos(..., .named = TRUE) 59 | names(vars) 60 | } 61 | 62 | args_names(mean(height), weight) 63 | 64 | args_names(avg = mean(height), weight) 65 | ``` 66 | 67 | 68 | ### Unquoting argument names 69 | 70 | Argument names are one of the most common occurrence of quotation in R. There is no fundamental difference between these two ways of creating a `"myname"` string: 71 | 72 | ```{r} 73 | names(c(Mickey = NA)) 74 | 75 | as_label(quote(Mickey)) 76 | ``` 77 | 78 | Where there is quotation it is natural to have unquotation. For this reason, tidy eval makes it possible to use `!!` to unquote names. Unfortunately we'll have to use a somewhat peculiar syntax to unquote names because using complex expressions on the left-hand side of `=` is not valid R code: 79 | 80 | ```{r, error = TRUE} 81 | nm <- "Mickey" 82 | args_names(!!nm = 1) 83 | ``` 84 | 85 | Instead you'll have to unquote of the LHS of `:=`. This vestigial operator is interpreted by tidy eval functions in exactly the same way as `=` but with `!!` support: 86 | 87 | ```{r} 88 | nm <- "Mickey" 89 | args_names(!!nm := 1) 90 | ``` 91 | 92 | Another way of achieving the same result is to splice a named list of arguments: 93 | 94 | ```{r} 95 | args <- setNames(list(1), nm) 96 | args_names(!!!args) 97 | ``` 98 | 99 | This works because `!!!` uses the names of the list as argument names. This is a great pattern when you are dealing with multiple arguments: 100 | 101 | ```{r} 102 | nms <- c("Mickey", "Minnie") 103 | args <- setNames(list(1, 2), nms) 104 | args_names(!!!args) 105 | ``` 106 | 107 | 108 | ### Prefixing quoted arguments 109 | 110 | Now that we know how to unquote argument, let's apply informative prefixes to the names of the columns created in `grouped_mean()`. We'll start with the summary variable: 111 | 112 | 1. Get the default name of the quoted summary variable. 113 | 2. Prepend it with a prefix. 114 | 3. Unquote it with `!!` and `:=`. 115 | 116 | ```{r} 117 | grouped_mean2 <- function(.data, .summary_var, ...) { 118 | summary_var <- enquo(.summary_var) 119 | group_vars <- enquos(...) 120 | 121 | # Get and modify the default name 122 | summary_nm <- as_label(summary_var) 123 | summary_nm <- paste0("avg_", summary_nm) 124 | 125 | .data %>% 126 | group_by(!!!group_vars) %>% 127 | summarise(!!summary_nm := mean(!!summary_var)) # Unquote the name 128 | } 129 | 130 | grouped_mean2(mtcars, disp, cyl, am) 131 | 132 | names(grouped_mean2(mtcars, disp, cyl, am)) 133 | ``` 134 | 135 | Regarding the grouping variables, this is a case where explictly quoting and unquoting `...` pays off because we need to change the names of the list of quoted dots: 136 | 137 | - Give default names to quoted dots with `.named = TRUE`. 138 | - Prepend the names of the list with a prefix. 139 | - Unquote-splice the list of quoted arguments as usual. 140 | 141 | ```{r} 142 | grouped_mean2 <- function(.data, .summary_var, ...) { 143 | summary_var <- enquo(.summary_var) 144 | 145 | # Quote the dots with default names 146 | group_vars <- enquos(..., .named = TRUE) 147 | 148 | summary_nm <- as_label(summary_var) 149 | summary_nm <- paste0("avg_", summary_nm) 150 | 151 | # Modify the names of the list of quoted dots 152 | names(group_vars) <- paste0("groups_", names(group_vars)) 153 | 154 | .data %>% 155 | group_by(!!!group_vars) %>% # Unquote-splice as usual 156 | summarise(!!summary_nm := mean(!!summary_var)) 157 | } 158 | 159 | grouped_mean2(mtcars, disp, cyl, am) 160 | 161 | names(grouped_mean2(mtcars, disp, cyl, am)) 162 | ``` 163 | 164 | 165 | ## Modifying quoted expressions 166 | 167 | The quote-and-unquote pattern is a powerful and versatile technique. In this section we'll use it for modifying quoted arguments. 168 | 169 | In [dealing with multiple arguments](#multiple), we have created a version of `grouped_mean()` that takes multiple grouping variables. Say we would like to take multiple summary variables instead. We could start by replacing `summary_var` with the `...` argument: 170 | 171 | ```{r} 172 | grouped_mean3 <- function(.data, .group_var, ...) { 173 | group_var <- enquo(.group_var) 174 | summary_vars <- enquos(..., .named = TRUE) 175 | 176 | .data %>% 177 | group_by(!!group_var) %>% 178 | summarise(!!!summary_vars) # How do we take the mean? 179 | } 180 | ``` 181 | 182 | The quoting part is easy. But how do we go about taking the average of each argument before passing them on to `summarise()`? We'll have to modify the list of summary variables. 183 | 184 | 185 | ### Expanding quoted expressions with `expr()` 186 | 187 | Quoting and unquoting is an effective technique for modifying quoted expressions. But we'll need to add one more function to our toolbox to work around the lack of unquoting support in `quote()`. 188 | 189 | As we saw, the fundamental quoting function in R is `quote()`. All it does is return its quoted argument: 190 | 191 | ```{r} 192 | quote(mean(mass)) 193 | ``` 194 | 195 | `quote()` does not support quasiquotation but tidy eval provides a variant that does. With `expr()`, you can quote expressions with full unquoting support: 196 | 197 | ```{r} 198 | vars <- list(quote(mass), quote(height)) 199 | 200 | expr(mean(!!vars[[1]])) 201 | 202 | expr(group_by(!!!vars)) 203 | ``` 204 | 205 | Note what just happened: by quoting-and-unquoting, we have expanded existing quoted expressions! This is the key to modifying expressions before passing them on to other quoting functions. For instance we could loop over the summary variables and unquote each of them in a `mean()` expression: 206 | 207 | ```{r} 208 | purrr::map(vars, function(var) expr(mean(!!var, na.rm = TRUE))) 209 | ``` 210 | 211 | Let's fix `grouped_mean3()` using this pattern: 212 | 213 | ```{r} 214 | grouped_mean3 <- function(.data, .group_var, ...) { 215 | group_var <- enquo(.group_var) 216 | summary_vars <- enquos(..., .named = TRUE) 217 | 218 | # Wrap the summary variables with mean() 219 | summary_vars <- purrr::map(summary_vars, function(var) { 220 | expr(mean(!!var, na.rm = TRUE)) 221 | }) 222 | 223 | # Prefix the names with `avg_` 224 | names(summary_vars) <- paste0("avg_", names(summary_vars)) 225 | 226 | .data %>% 227 | group_by(!!group_var) %>% 228 | summarise(!!!summary_vars) 229 | } 230 | ``` 231 | 232 | ```{r} 233 | grouped_mean3(starwars, species, height) 234 | 235 | grouped_mean3(starwars, species, height, mass) 236 | ``` 237 | -------------------------------------------------------------------------------- /setup.R: -------------------------------------------------------------------------------- 1 | 2 | knitr::opts_chunk$set(collapse = T, comment = "#>") 3 | options( 4 | tibble.print_min = 5L, 5 | tibble.print_max = 5L, 6 | digits = 2 7 | ) 8 | 9 | set.seed(1014) 10 | -------------------------------------------------------------------------------- /tidyeval.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace 22 | -------------------------------------------------------------------------------- /toolbox.Rmd: -------------------------------------------------------------------------------- 1 | 2 | # (PART) Going further {-} 3 | 4 | ```{r setup, include = FALSE} 5 | source("setup.R") 6 | ``` 7 | 8 | # A rich toolbox 9 | 10 | ## TODO `quote()`, `expr()` and `enexpr()` 11 | 12 | `case_when()` example? 13 | 14 | 15 | ## TODO `quo()` and `enquo()` 16 | 17 | 18 | ## TODO `vars()`, `quos()` and `enquos()` 19 | 20 | Is it confusing that `vars()` is an alias of `quos()`? 21 | 22 | 23 | ## TODO `qq_show()` {#toolbox-qq-show} 24 | 25 | 26 | ## TODO `sym()` and `syms()` 27 | --------------------------------------------------------------------------------