├── .Rbuildignore ├── .github ├── .gitignore └── workflows │ └── bookdown.yaml ├── .gitignore ├── DESCRIPTION ├── LICENSE.md ├── README.md ├── _quarto.yml ├── _redirects ├── argument-clutter.qmd ├── boolean-strategies.qmd ├── call-data-details.qmd ├── changes-multivers.qmd ├── common.R ├── consistent-argument-names.qmd ├── cs-mapply-pmap.qmd ├── cs-rep.qmd ├── cs-rvest.qmd ├── cs-setNames.qmd ├── cs-stringr.qmd ├── def-inform.qmd ├── def-magical.qmd ├── def-user.qmd ├── defaults-short-and-sweet.qmd ├── design.Rproj ├── dots-after-required.qmd ├── dots-data.qmd ├── dots-inspect.qmd ├── dots-prefix.qmd ├── enumerate-options.qmd ├── err-call.qmd ├── err-constructor.qmd ├── explicit-strategies.qmd ├── fun_def.R ├── function-names.qmd ├── glossary.qmd ├── identity-strategy.qmd ├── implicit-strategies.qmd ├── important-args-first.qmd ├── independent-meaning.qmd ├── index.qmd ├── inputs-explicit.qmd ├── names.qmd ├── out-invisible.qmd ├── out-multi.qmd ├── out-type-stability.qmd ├── out-vectorisation.qmd ├── plausible.html ├── r4ds.scss ├── required-no-defaults.qmd ├── side-effects.qmd ├── spooky-action.qmd ├── spooky-action.rds ├── strategy-functions.qmd ├── strategy-objects.qmd ├── substack ├── 2023-07-28.qmd ├── 2023-08-04.qmd ├── 2023-08-11.qmd ├── 2023-09-29.qmd └── 2023-10-27.qmd └── unifying.qmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^\.travis\.yml$ 2 | ^.*\.Rproj$ 3 | ^\.Rproj\.user$ 4 | ^\.github$ 5 | -------------------------------------------------------------------------------- /.github/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | -------------------------------------------------------------------------------- /.github/workflows/bookdown.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | push: 5 | branches: [main, master] 6 | pull_request: 7 | branches: [main, master] 8 | workflow_dispatch: 9 | 10 | name: bookdown 11 | 12 | jobs: 13 | bookdown: 14 | runs-on: ubuntu-latest 15 | # Only restrict concurrency for non-PR jobs 16 | concurrency: 17 | group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }} 18 | env: 19 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 20 | steps: 21 | - uses: actions/checkout@v2 22 | 23 | - name: Install Quarto 24 | uses: quarto-dev/quarto-actions/install-quarto@v1 25 | 26 | - uses: r-lib/actions/setup-r@v2 27 | with: 28 | use-public-rspm: true 29 | 30 | - uses: r-lib/actions/setup-r-dependencies@v2 31 | 32 | - name: Render book 33 | run: | 34 | quarto render 35 | 36 | - name: Deploy to GitHub pages 🚀 37 | if: github.event_name != 'pull_request' 38 | uses: JamesIves/github-pages-deploy-action@v4 39 | with: 40 | folder: _book 41 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | _main.rds 6 | _book 7 | _bookdown_files 8 | /.quarto/ 9 | site_libs 10 | /*_cache 11 | *_files 12 | *.html 13 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: design 2 | Title: Tidyverse design 3 | Version: 1.0.0 4 | Authors@R: c( 5 | person("Tidyverse", "Team", , "tidyverse@rstudio.com", c("aut", "cre")) 6 | ) 7 | Depends: 8 | R (>= 3.1.0) 9 | Imports: 10 | bench, 11 | bookdown, 12 | downlit, 13 | dplyr, 14 | glue, 15 | nycflights13, 16 | rmarkdown, 17 | testthat, 18 | tibble (>= 2.0.1), 19 | tidyverse, 20 | vctrs 21 | Remotes: r-lib/testthat 22 | URL: https://github.com/tidyverse/design 23 | BugReports: https://github.com/tidyverse/design/issues 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | A place to document (and hash out) development principles for packages in the [tidyverse](http://tidyverse.org). 2 | 3 | A complement to . 4 | 5 | ## Structure 6 | 7 | Title should be a command. 8 | Keep it as short as possible, and frame it positively (which you should do, not what you shouldn't do). 9 | 10 | Sections: 11 | 12 | - What's the pattern? 13 | Brief description and why it's important. 14 | 15 | - What are some examples? 16 | Bulleted list of existing functions. 17 | Can be both positive and negative examples. 18 | Show results of code where useful. 19 | Goal is to include enough variety that everyone recognises at least one function, and can look up the docs for the details of the others. 20 | 21 | - What are the exceptions? 22 | 23 | - How to avoid/remediate/use it? 24 | Detailed explanation (with example) of how to prevent the problem, fix the problem, and/or use the pattern. 25 | 26 | - See also. 27 | Include related problems as bulleted list. 28 | 29 | Case studies are useful for functions that need more explanation, have multiple problems, or need greater discussion of different trade-offs. 30 | -------------------------------------------------------------------------------- /_quarto.yml: -------------------------------------------------------------------------------- 1 | project: 2 | type: book 3 | output-dir: _book 4 | 5 | book: 6 | title: "Tidy design principles" 7 | reader-mode: true 8 | 9 | page-footer: 10 | left: | 11 | Tidy design principles was written by Hadley Wickham 12 | right: | 13 | This book was built with Quarto. 14 | site-url: https://design.tidyverse.org 15 | repo-url: https://github.com/tidyverse/design 16 | repo-branch: main 17 | repo-actions: [edit, issue] 18 | 19 | chapters: 20 | - index.qmd 21 | - unifying.qmd 22 | 23 | - part: Implementation 24 | chapters: 25 | - names.qmd 26 | - call-data-details.qmd 27 | - function-names.qmd 28 | 29 | - part: Scannable specs 30 | chapters: 31 | - inputs-explicit.qmd 32 | - important-args-first.qmd 33 | - required-no-defaults.qmd 34 | - dots-after-required.qmd 35 | - defaults-short-and-sweet.qmd 36 | - enumerate-options.qmd 37 | - argument-clutter.qmd 38 | - independent-meaning.qmd 39 | 40 | - part: explicit-strategies.qmd 41 | chapters: 42 | - boolean-strategies.qmd 43 | - strategy-objects.qmd 44 | - strategy-functions.qmd 45 | - cs-rep.qmd 46 | - implicit-strategies.qmd 47 | - cs-rvest.qmd 48 | - identity-strategy.qmd 49 | 50 | - part: Function arguments 51 | chapters: 52 | - def-magical.qmd 53 | - def-inform.qmd 54 | - def-user.qmd 55 | - dots-data.qmd 56 | - dots-prefix.qmd 57 | - dots-inspect.qmd 58 | - cs-mapply-pmap.qmd 59 | - cs-setNames.qmd 60 | 61 | - part: Outputs 62 | chapters: 63 | - out-multi.qmd 64 | - out-type-stability.qmd 65 | - out-vectorisation.qmd 66 | - out-invisible.qmd 67 | 68 | - part: Evolution 69 | chapters: 70 | - changes-multivers.qmd 71 | 72 | - part: Side effects 73 | chapters: 74 | - side-effects.qmd 75 | - spooky-action.qmd 76 | 77 | - part: Errors 78 | chapters: 79 | - err-call.qmd 80 | - err-constructor.qmd 81 | 82 | appendices: 83 | - glossary.qmd 84 | 85 | format: 86 | html: 87 | theme: 88 | - cosmo 89 | - r4ds.scss 90 | code-link: true 91 | 92 | number-sections: false 93 | author-meta: "Hadley Wickham" 94 | include-in-header: "plausible.html" 95 | callout-appearance: simple 96 | 97 | editor: visual 98 | 99 | -------------------------------------------------------------------------------- /_redirects: -------------------------------------------------------------------------------- 1 | https://principles.tidyverse.org/* https://design.tidyverse.org/:splat 301! 2 | 3 | # Optional: Redirect default Netlify subdomain to primary domain 4 | https://gallant-minsky-15c9e2.netlify.com/* https://design.tidyverse.org/:splat 301! -------------------------------------------------------------------------------- /argument-clutter.qmd: -------------------------------------------------------------------------------- 1 | # Reduce argument clutter with an options object {#sec-argument-clutter} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the problem? 9 | 10 | If you have a large number of optional arguments that control the fine details of the operation of a function, it might be worth lumping them all together into a separate "options" object created by a helper function. 11 | 12 | Having a large number of less important arguments makes it harder to see the most important. 13 | By moving rarely used and less important arguments to a secondary function, you can more easily draw attention to what is most important. 14 | 15 | ## What are some examples? 16 | 17 | - Many base R modelling functions like `loess()`, `glm()`, and `nls()` have a `control` argument that are paired with a function like `loess.control()`, `glm.control()`, and `nls.control()`. 18 | These allow you to modify rarely used defaults, including the number of iterations, the stopping criteria, and some debugging options. 19 | 20 | `optim()` uses a less formal version of this structure --- while it has a `control` argument, it doesn't have a matching `optim.control()` helper. 21 | Instead, you supply a named list with components described in `?optim`. 22 | A helper function is more convenient than a named list because it checks the argument names for free and gives nicer autocomplete to the user. 23 | 24 | - This pattern is common in other modelling packages, e.g. `tune::fit_resamples()` + `tune::control_resamples()`, `tune::control_bayes()`, `tune::control_grid()`, and `caret::train()` + `caret::trainControl()` 25 | 26 | - `readr::read_delim()` and friends take a `locale` argument which is paired with the `readr::locale()` helper. 27 | This object bundles together a bunch of options related to parsing numbers, dates, and times that vary from country to country. 28 | 29 | - `readr::locale()` itself has a `date_names` argument that's paired with `readr::date_names()` and `readr::date_names_lang()` helpers. 30 | You typically use the argument by supplying a two letter locale (which `date_names_lang()` uses to look up common languages), but if your language isn't supported you can use `readr::date_names()` to individually supply full and abbreviated month and day of week names. 31 | 32 | On the other hand, some functions with many arguments that would benefit from this technique include: 33 | 34 | - `readr::read_delim()` has a lot of options that control rarely needed details of file parsing (e.g. `escape_backslash`, `escape_double`, `quoted_na`, `comment`, `trim_ws)`. 35 | These make the function specification very long and might well be better in a details object. 36 | 37 | - `ggplot2::geom_smooth()` fits a smooth line to your data. 38 | Most of the time you only want to pick the `model` and `formula` used, but `geom_smooth()` (via `ggplot2::stat_smooth()`) also provides `n`, `fullrange`, `span`, `level`, and `method.args` to control details of the fit. 39 | I think these would be better in their own details object. 40 | 41 | ## How do I use this pattern? 42 | 43 | The simplest implementation is just to write a helper function that returns a list: 44 | 45 | ```{r} 46 | my_fun_opts <- function(opt1 = 1, opt2 = 2) { 47 | list( 48 | opt1 = opt1, 49 | opt2 = opt2 50 | ) 51 | } 52 | ``` 53 | 54 | This alone is nice because you can document the individual arguments, you get name checking for free, and auto-complete will remind the user what these less important options include. 55 | 56 | ### Better error messages 57 | 58 | An optional extra is to add a unique class to the list: 59 | 60 | ```{r} 61 | my_fun_opts <- function(opt1 = 1, opt2 = 2) { 62 | structure( 63 | list( 64 | opt1 = opt1, 65 | opt2 = opt2 66 | ), 67 | class = "mypackage_my_fun_opts" 68 | ) 69 | } 70 | 71 | ``` 72 | 73 | This then allows you to create more informative error messages: 74 | 75 | ```{r} 76 | #| error: true 77 | 78 | my_fun_opts <- function(..., opts = my_fun_opts()) { 79 | if (!inherits(opts, "mypackage_my_fun_opts")) { 80 | cli::cli_abort("{.arg opts} must be created by {.fun my_fun_opts}.") 81 | } 82 | } 83 | 84 | my_fun_opts(opts = 1) 85 | ``` 86 | 87 | If you use this option in many places, you should consider pulling out the repeated code into a `check_my_fun_opts()` function. 88 | 89 | ## How do I remediate past mistakes? 90 | 91 | Typically you notice this problem only after you have created too many options so you'll need to carefully remediate by introducing a new options argument and paired helper function. 92 | For example, if your existing function looks like this: 93 | 94 | ```{r} 95 | my_fun <- function(x, y, opt1 = 1, opt2 = 2) { 96 | 97 | } 98 | ``` 99 | 100 | If you want to keep the existing function specification you could add a new `opts` argument that uses the values of `opt1` and `opt2:` 101 | 102 | ```{r} 103 | my_fun <- function(x, y, opts = NULL, opt1 = 1, opt2 = 2) { 104 | 105 | opts <- opts %||% my_fun_opts(opt1 = opt1, opt2 = opt2) 106 | } 107 | ``` 108 | 109 | However, that introduces a dependency between the arguments: if you specify both `opts` and `opt1`/`opt2`, `opts` will win. 110 | You could certainly add extra code to pick up on this problem and warn the user, but I think it's just cleaner to deprecate the old arguments so that you can eventually remove them: 111 | 112 | ```{r} 113 | my_fun <- function(x, y, opts = my_fun_opts(), opt1 = deprecated(), opt2 = deprecated()) { 114 | 115 | if (lifecycle::is_present(opt1)) { 116 | lifecycle::deprecate_warn("1.0.0", "my_fun(opt1)", "my_fun_opts(opt1)") 117 | opts$opt1 <- opt1 118 | } 119 | if (lifecycle::is_present(opt2)) { 120 | lifecycle::deprecate_warn("1.0.0", "my_fun(opt2)", "my_fun_opts(opt2)") 121 | opts$opt2 <- opt2 122 | } 123 | } 124 | ``` 125 | 126 | Then you can remove the old arguments in a future release. 127 | 128 | ## See also 129 | 130 | - @sec-strategy-objects is a similar pattern when you have multiple options function that each encapsulate a different strategy. 131 | -------------------------------------------------------------------------------- /boolean-strategies.qmd: -------------------------------------------------------------------------------- 1 | # Prefer a enum, even if only two choices {#sec-boolean-strategies} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | If your function implements two strategies, it's tempting to distinguish between them using an argument that takes either `TRUE` or `FALSE`. 11 | However, I recommend that you use an enumeration unless: 12 | 13 | - You're **really sure** there won't ever be another strategy. If you do discover a third (or fourth, or fifth, or ...) strategy, you'll need to change the interface of your function. 14 | - It's very clear what both `TRUE` and `FALSE` options mean just from the name of the argument. Generally the `TRUE` value tends to be easier to understand because `something = TRUE` tells you what will happen, but `something = FALSE` only tells you what won't happen. 15 | 16 | ## What are some examples? 17 | 18 | There are quite a few examples of the problem in tidyverse, because this is a pattern that we only discovered relatively recently: 19 | 20 | - By default, `stringr::str_subset(string, pattern)` returns the elements of `string` that match the `pattern`. You can use `negate = TRUE` to instead return the elements that don't match the pattern, but I now wonder if would be more clear as `return = c("matches", "non-matches")`. 21 | - `httr2::multi_req_perform()` allows you to perform a bunch of HTTP requests in parallel. It has an argument called `cancel_on_error` that can take `TRUE` or `FALSE`. It's fairly clear what `cancel_on_error = TRUE` means; but it's not so obvious what `cancel_on_error = FALSE` does. Additionally, it seems likely that I'll come up with other possible error handling strategies in the future, and even though I don't know what they are now, it would be better to plan for the future with an argument specification like `error = c("cancel", "continue")`. 22 | 23 | `cut()` has an argument called `right` which is used to pick between right-closed left-open intervals (`TRUE)` and right-open left-closed arguments. 24 | I think it's hard to remember which is which and a clearer specification might be `open_side = c("right", "left")` or maybe `bounds = c("[)", "(]")`. 25 | Another interesting case in base R is `sort()` which has two arguments that take a single logical value: `decreasing` and `na.last`: 26 | 27 | - The `decreasing` argument is used to pick between sorting in ascending or descending order. 28 | It's easy to understand what `decreasing = TRUE` does, but slightly less clear what `decreasing = FALSE`, the default, means because it feels like a double negative: 29 | 30 | ```{r} 31 | #| results: false 32 | 33 | x <- sample(10) 34 | sort(x, decreasing = TRUE) 35 | sort(x, decreasing = FALSE) 36 | ``` 37 | 38 | Compare this with `vctrs::vec_sort()`, which uses an enum: 39 | 40 | ```{r} 41 | #| results: false 42 | vctrs::vec_sort(x, direction = "desc") 43 | vctrs::vec_sort(x, direction = "asc") 44 | ``` 45 | 46 | I think this is a mild improvement because the two options are spelled out explicitly. 47 | 48 | - The `na.last` argument is used to control the location of missing values in the result. 49 | It takes three possible values: `TRUE` (put `NA`s at the end), `FALSE` (put `NA`s at the beginning), or `NA` (drop `NA`s from the result). 50 | This is an interesting way to support three strategies, but as we'll see later I think this would be more clear the argument specification was `na = c("drop", "first", "last")`. 51 | 52 | ## How do you remediate past mistakes? 53 | 54 | There are two possible ways to switch to using a strategy instead of `TRUE`/`FALSE` depending on whether the old argument name makes sense with the new argument values. 55 | The sections below show what you'll need to do if you need a new argument (most cases) or if you're lucky enough to be able to reuse the existing argument. 56 | 57 | ### Create a new argument 58 | 59 | Imagine we wanted to remediate the `na.last` argument to `sort()`. 60 | Currently: 61 | 62 | - `na.last = TRUE` means put `NA`s last. 63 | - `na.last = FALSE` means put `NA`s first. 64 | - `na.list = NA` means to drop them. 65 | 66 | I think we could make this function more clear by changing the argument name to `na` and accepting one of three values: `last`, `first`, or `drop`. 67 | 68 | Changing an argument name is equivalent removing the old name and adding the new name. 69 | This way of thinking about the change makes it easier to see how you do it in a backward compatible way: you need to deprecate the old argument in favour of the new one. 70 | 71 | ```{r} 72 | sort <- function(x, 73 | na.last = lifecycle::deprecated(), 74 | na = c("drop", "first", "last")) { 75 | if (lifecycle::is_present(na.last)) { 76 | lifecycle::deprecate_warn("1.0.0", "sort(na.last)", "sort(na)") 77 | 78 | if (!is.logical(na.last) || length(na.last) != 1) { 79 | cli::cli_abort("{.arg na.last} must be a single TRUE, FALSE, or NA.") 80 | } 81 | 82 | if (isTRUE(na.last)) { 83 | na <- "last" 84 | } else if (isFALSE(na.last)) { 85 | na <- "first" 86 | } else { 87 | na <- "drop" 88 | } 89 | } else { 90 | na <- arg_match(na) 91 | } 92 | 93 | ... 94 | } 95 | ``` 96 | 97 | ::: callout-note 98 | Note that because `na` is a prefix of `na.last` and `sort()` puts `na.last` before `…,`not after it (see @sec-dots-after-required), this introduces a very subtle behaviour change. 99 | Previously, `sort(x, n = TRUE)` would have worked and been equivalent to `sort(x, na.last = TRUE)`. 100 | But it will now fail because `n` is a prefix of two arguments (`na` and `na.last)`. 101 | This is unlikely to affect much code, but is worth being aware of. 102 | 103 | It would also be nice to make the default value `"last"` to match `order()`, especially since it's very unusual for a function to silently remove missing values. 104 | However, that's likely to affect a lot of existing code, making it unlikely to be worthwhile. 105 | ::: 106 | 107 | ### Re-use an existing name 108 | 109 | Originally `haven::write_sav(compress)` could either be `TRUE` (compress the file) or `FALSE` (don't compress it). 110 | But then SPSS version 21.0 introduced a new way of compressing files leading to three possible options: compress with the new way (zsav), compress with the old way (bytes), or don't compress. 111 | In this case we got lucky because we can continue to use the same argument name: `compress = c("byte", "zsav", "none")`. 112 | We allowed existing code by special casing the behaviour of `TRUE` and `FALSE`: 113 | 114 | ```{r} 115 | write_sav <- function(data, path, compress = c("byte", "zsav", "none"), adjust_tz = TRUE) { 116 | if (isTRUE(compress)) { 117 | compress <- "zsav" 118 | } else if (isFALSE(compress)) { 119 | compress <- "none" 120 | } else { 121 | compress <- arg_match(compress) 122 | } 123 | 124 | ... 125 | } 126 | ``` 127 | 128 | You could choose to deprecate `TRUE` and `FALSE`, but here we chose to the keep them since it's only a small amount of extra code in haven, and it means that existing users don't need to think about it. 129 | See `?haven::read_sav` for how we communicated the change in the docs. 130 | 131 | In a future version of haven we might change the order of the enum so that the `zsav` compression method becomes the default. 132 | This generally yields smaller files but can't be read by older versions of SPSS. 133 | Now that v21 is over 5 years old[^boolean-strategies-1], it's reasonable to make the smaller format the default. 134 | 135 | [^boolean-strategies-1]: Five years is the general threshold for support across the tidyverse. 136 | -------------------------------------------------------------------------------- /call-data-details.qmd: -------------------------------------------------------------------------------- 1 | # Name all but the most important arguments {#sec-call-data-details} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | When calling a function, you should name all but the most important arguments. 11 | For example: 12 | 13 | ```{r} 14 | y <- c(1:10, NA) 15 | mean(y, na.rm = TRUE) 16 | ``` 17 | 18 | Never use partial matching, like below. 19 | Partial matching was useful in the early days of R because when you were doing a quick and dirty interactive analysis you could save a little time by shortening argument names. 20 | However, today, most R editing environments support autocomplete so partial matching only saves you a single keystroke, and it makes code substantially harder to read. 21 | 22 | ```{r} 23 | mean(y, n = TRUE) 24 | ``` 25 | 26 | Avoid relying on position matching with empty arguments: 27 | 28 | ```{r} 29 | mean(y, , TRUE) 30 | ``` 31 | 32 | And don't name arguments that can you expect users to be familiar with: 33 | 34 | ```{r} 35 | mean(x = y) 36 | ``` 37 | 38 | You can make R give you are warning that you're using a partially named argument with a special option. 39 | Call `usethis::use_partial_warnings()` to make this the default for all R sessions. 40 | 41 | ```{r} 42 | options(warnPartialMatchArgs = TRUE) 43 | mean(x = 1:10, n = FALSE) 44 | ``` 45 | 46 | ## Why is this useful? 47 | 48 | I think it's reasonable to assume that the reader knows what a function does then they know what the one or two most important arguments are, and repeating their names just takes up space without aiding communication. 49 | For example, it's reasonable to assume that people can remember that the first argument to `log()` is `x` and the first two arguments to `dplyr::left_join()` are `x` and `y`. 50 | 51 | However, I don't think that most people will remember more than the one or two most important arguments, so you should name the rest. 52 | For example, I don't think that most people know that the second argument to `mean()` is `trim` or that the second argument to `median()` is `na.rm` even though I expect most people to know what the first arguments are. 53 | Spelling out the names makes it easier to understand when others (including future you) are reading the code. 54 | 55 | ## What are the exceptions? 56 | 57 | There are two main exceptions to this principle: when teaching functions and when one argument is particularly long. 58 | 59 | When teaching a function for the first time, you can't expect people to know what the arguments are, so it make sense to supply all names to help people understand exactly what's going on. 60 | For example, in [R for Data Science](https://r4ds.had.co.nz/data-visualisation.html) when we introduce ggplot2 we write code like: 61 | 62 | ```{r} 63 | #| eval = FALSE 64 | ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 65 | geom_point() 66 | ``` 67 | 68 | At the end of the chapter, we assume that the reader is familiar with the basic structure and so the rest of the book uses the style recommended here: 69 | 70 | ```{r} 71 | #| eval = FALSE 72 | ggplot(mpg, aes(`displ, hwy)) + 73 | geom_point() 74 | ``` 75 | 76 | There are also the occasional case when the first argument might be quite long, and there's a couple of short options that you also want to set. 77 | If the long argument comes first, you may have to re-interpret what the function is doing when you finally hit the options. 78 | I think this comes up most often when an argument usually receives code inside of `{}` but it can crop up when manually generating data too. 79 | 80 | ```{r} 81 | #| eval = FALSE 82 | 83 | writeLines(con = "test.txt", c( 84 | "line1", 85 | "line2", 86 | "line3" 87 | )) 88 | 89 | expect_snapshot(error = TRUE, { 90 | line1 91 | line2 92 | line3 93 | }) 94 | ``` 95 | -------------------------------------------------------------------------------- /changes-multivers.qmd: -------------------------------------------------------------------------------- 1 | # Work with multiple dependency versions {#sec-changes-multivers} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | In an ideal world, when a dependency of your package changes its interface, you want your package to work with both versions. 11 | This is more work but it has two significant advantages: 12 | 13 | - The CRAN submission process is decoupled. 14 | If your package only works with the development version of a dependency, you'll need to carefully coordinate your CRAN submission with the dependencies CRAN submission. 15 | If your package works with both versions, you can submit first, making life easier for CRAN and for the maintainer of the dependency. 16 | 17 | - User code is less likely to be affected. 18 | If your package only works with the latest version of the dependency, then when a user upgrades your package, the dependency also must update. 19 | Upgrading multiple packages is more likely to affect user code than updating a single package. 20 | 21 | In this pattern, you'll learn how to write code designed to work with multiple versions of a dependency, and you'll how to adapt your existing Travis configuration to test that you've got it right. 22 | 23 | ## Writing code 24 | 25 | Sometimes there will be an easy way to change the code to work with both old and new versions of the package; do this if you can! 26 | However, in most cases, you can't, and you'll need an `if` statement that runs different code for new and old versions of the package: 27 | 28 | ```{r} 29 | #| eval = FALSE 30 | if (dependency_has_new_interface()) { 31 | # freshly written code that works with in-development dependency 32 | } else { 33 | # existing code that works with the currently released dependency 34 | } 35 | ``` 36 | 37 | (If your freshly written code uses functions that don't exist in the CRAN version this will generate an R CMD check `NOTE` when you submit it to CRAN. This is one of the few NOTEs that you can explain: just mention that it's needed for forward/backward compatibility in your submission notes.) 38 | 39 | We recommend always pulling out the check out into a function so that the logic lives in one place. 40 | This will make it much easier to pull it out when it's no longer needed, and provides a good place to document why it's needed. 41 | 42 | There are three basic approaches to implement `dependency_has_new_interface()`: 43 | 44 | - Check the version of the package. 45 | This is recommended in most cases, but requires that the dependency author use a specific version convention. 46 | 47 | - Check for existence of a function. 48 | 49 | - Check for a specific argument value, or otherwise detect that the interface has changed. 50 | 51 | ### Case study: tidyr 52 | 53 | To make the problem concrete so we can show of some real code, lets imagine we have a package that uses `tidyr::nest()`. 54 | `tidyr::nest()` changed substantially between 0.8.3 and 1.0.0, and so we need to write code like this: 55 | 56 | ```{r} 57 | #| eval = FALSE 58 | if (tidyr_new_interface()) { 59 | out <- tidyr::nest_legacy(df, x, y, z) 60 | } else { 61 | out <- tidyr::nest(df, c(x, y, z)) 62 | } 63 | ``` 64 | 65 | (As described above, when submitted to CRAN this will generate a note about missing `tidyr::nest_legacy()` which can be explained in the submission comments.) 66 | 67 | To implement `tidyr_new_interface()`, we need to think about three versions of tidyr: 68 | 69 | - 0.8.3: the version currently on CRAN with the old interface. 70 | 71 | - 0.8.99.9000: the development version with the new interface. 72 | As usualy, the fourth component is \>= 9000 to indicate that it's a development version. 73 | Note, however, that the patch version is 99; this indicates that release includes breaking changes. 74 | 75 | - 1.0.0: the future CRAN version; this is the version that will be submitted to CRAN. 76 | 77 | The main question is how to write `tidyr_new_interface()`. 78 | There are three options: 79 | 80 | - Check that the version is greater than the development version: 81 | 82 | ```{r} 83 | tidyr_new_interface <- function() { 84 | packageVersion("tidyr") > "0.8.99" 85 | } 86 | ``` 87 | 88 | This technique works because tidyr uses the convention that the development version of backward incompatible functions contain `99` in the third (patch) component. 89 | 90 | - If tidyr didn't adopt this naming convention, we could test for the existence of `unnest_legacy()`. 91 | 92 | ```{r} 93 | tidyr_new_interface1 <- function() { 94 | exists("unnest_legacy", asNamespace("tidyr")) 95 | } 96 | ``` 97 | 98 | - If the interface change was more subtle, you might have to think more creatively. 99 | If the package uses the [lifecycle](http://lifecycle.r-lib.org/) system, one approach would be to test for the presence of `deprecated()` in the function arguments: 100 | 101 | ```{r} 102 | tidyr_new_interface2 <- function() { 103 | identical(formals(tidyr::unnest)$.drop, quote(deprecated())) 104 | } 105 | ``` 106 | 107 | All these approaches are reasonably fast, so it's unlikely they'll have any impact on performance unless called in a very tight loop. 108 | 109 | ```{r} 110 | bench::mark( 111 | version = tidyr_new_interface(), 112 | exists = tidyr_new_interface1(), 113 | formals = tidyr_new_interface2() 114 | )[1:5] 115 | ``` 116 | 117 | If you do need to use `packageVersion()` inside a performance sensitive function, I recommend caching the result in `.onLoad()` (which, by convention, lives in `zzz.R`). 118 | There a few ways to do this; but the following block shows one approach that matches the function interface I used above: 119 | 120 | ```{r} 121 | tidyr_new_interface <- function() FALSE 122 | .onLoad <- function(...) { 123 | if (utils::packageVersion("tidyr") > "0.8.2") { 124 | tidyr_new_interface <<- function() TRUE 125 | } 126 | } 127 | ``` 128 | 129 | ## Testing with multiple package versions 130 | 131 | It's good practice to test both old and new versions of the code, but this is challenging because you can't both sets of tests in the same R session. 132 | The easiest way to make sure that both versions are work and stay working is to use Travis. 133 | 134 | Before the dependency is released, you can manually install the development version using `remotes::install_github()`: 135 | 136 | ``` yaml 137 | matrix: 138 | include: 139 | - r: release 140 | name: tidyr-devel 141 | before_script: Rscript -e "remotes::install_github('tidyverse/tidyr')" 142 | ``` 143 | 144 | It's not generally that important to check that your code continues to work with an older version of the package, but if you want to you can use `remotes::install_version()`: 145 | 146 | ``` yaml 147 | matrix: 148 | include: 149 | - r: release 150 | name: tidyr-0.8 151 | before_script: Rscript -e "remotes::install_version('tidyr', '0.8.3')" 152 | ``` 153 | 154 | ## Using only the new version 155 | 156 | At some point in the future, you'll decide that the old version of the package is no longer widely used and you want to simplify your package by only depending on the new version. 157 | There are three steps: 158 | 159 | - In the DESCRIPTION, bump the required version of the dependency. 160 | 161 | - Search for `dependency_has_new_interface()`; remove the function definition and all uses (retaining the code used with the new version). 162 | 163 | - Remove the additional build in `.travis.yml`. 164 | -------------------------------------------------------------------------------- /common.R: -------------------------------------------------------------------------------- 1 | knitr::opts_chunk$set( 2 | comment = "#>", 3 | collapse = TRUE 4 | ) 5 | 6 | options( 7 | rlang_trace_top_env = rlang::current_env(), 8 | rlang_backtrace_on_error = "none" 9 | ) 10 | 11 | rename <- function(old, new) { 12 | old_path <- fs::path_ext_set(old, "qmd") 13 | new_path <- fs::path_ext_set(new, "qmd") 14 | 15 | if (file.exists(old_path)) 16 | fs::file_move(old_path, new_path) 17 | quarto <- readLines("_quarto.yml") 18 | quarto <- gsub(old_path, new_path, quarto, fixed = TRUE) 19 | writeLines(quarto, "_quarto.yml") 20 | 21 | old_slug <- paste0("sec-", old) 22 | new_slug <- paste0("sec-", new) 23 | 24 | qmd_paths <- dir(pattern = ".qmd$") 25 | qmds <- lapply(qmd_paths, readLines) 26 | qmds <- lapply(qmds, \(lines) gsub(old_slug, new_slug, lines, fixed = TRUE)) 27 | purrr:::map2(qmds, qmd_paths, \(lines, path) writeLines(lines, path)) 28 | 29 | invisible() 30 | } -------------------------------------------------------------------------------- /consistent-argument-names.qmd: -------------------------------------------------------------------------------- 1 | # Use consistent argument names {#sec-consistent-argument-names} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the problem? 9 | 10 | Strive to keep your argument names consistent. 11 | This is particularly important to do so within your package, but where possible you should use argument names that the user is likely to have encountered elsewhere. 12 | 13 | If you see the name of a familiar argument you've encountered an argument name before it's 14 | 15 | ## What are some examples? 16 | 17 | - Many base R functions use `x` to indicate the primary vector input and `data` to indicate a primary data frame input. 18 | Most tidyverse functions also adopt this convention. 19 | 20 | - I think `install.packages()` wins the prize for most varying argument styles. 21 | It includes arguments `Ncpus`, `configure.vars`, `keep_outputs`, `INSTALL_opts`, `contriburl`, and (via ... to `download.file)` `cacheOK`. 22 | You can avoid this problem with your team by developing a style guide (e.g. ) that everyone agrees to stick to. 23 | 24 | - Base R and stringr functions all use `pattern` to refer to the regular expression pattern. 25 | 26 | ## How do I apply this pattern? 27 | 28 | The biggest challenge when applying this pattern is picking what family of functions you want to be consistent with. 29 | Base R is not 100% consistent and the package ecosystem introduces even more variability. 30 | Since you can't be consistent with everything you'll need to pick the most important or closely related packages and functions to be consistent with. 31 | 32 | There's an additional challenge if you want to be consistent with an argument name that doesn't match your style guide. 33 | A big challenge for the tidyverse is `na.rm`: is it more important to be consistent with base R and use `na.rm` or is it more important to be consistent with our snake case naming conventions and call it `na_rm`? 34 | Different tidyverse packages have adopted different conventions (e.g. ggplot2 uses `na.rm` and dplyr mostly uses `na_rm`). 35 | This is meta-inconsistency! 36 | 37 | A similar challenge arises for words that vary between UK and US English. 38 | Should you be consistent with the dialect language that you use in the documentation, or consistent with base R which mostly uses US English. 39 | ggplot2 has the particularly egregious `scale_color_grey()` which manages to combine US spelling of color with UK spelling on grey. 40 | This is one of my greatest regrets with ggplot2 and I'd highly recommend avoiding this problem by not using argument names that vary between UK and US English! 41 | -------------------------------------------------------------------------------- /cs-mapply-pmap.qmd: -------------------------------------------------------------------------------- 1 | # Case study: `mapply()` vs `pmap()` {#sec-cs-mapply-pmap} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ```{r} 9 | library(purrr) 10 | ``` 11 | 12 | It's useful to compare `mapply()` to `purrr::pmap()`. 13 | They both are an attempt to solve a similar problem, extending `lapply()`/`map()` to handle iterating over any number of arguments. 14 | 15 | ```{r} 16 | args(mapply) 17 | args(pmap) 18 | ``` 19 | 20 | ```{r} 21 | x <- c("apple", "banana", "cherry") 22 | pattern <- c("p", "n", "h") 23 | replacement <- c("x", "f", "q") 24 | 25 | mapply(gsub, pattern, replacement, x) 26 | 27 | mapply(gsub, pattern, replacement, x) 28 | purrr::pmap_chr(list(pattern, replacement, x), gsub) 29 | ``` 30 | 31 | Here we'll ignore `simplify = TRUE` which makes `mapply()` type-unstable by default. 32 | I'll also ignore `USE.NAMES = TRUE` which isn't just about using names, but about using character vector input as names for output. 33 | I think it's reused from `lapply()` without too much thought as it's only the names of the first argument that matter. 34 | 35 | ```{r} 36 | mapply(toupper, letters[1:3]) 37 | mapply(toupper, letters[1:3], USE.NAMES = FALSE) 38 | mapply(toupper, setNames(letters[1:3], c("X", "Y", "Z"))) 39 | 40 | pmap_chr(list(letters[1:3]), toupper) 41 | ``` 42 | 43 | `mapply()` takes the function to apply as the first argument, followed by an arbitrary number of arguments to pass to the function. 44 | This makes it different to the other `apply()` functions (including `lapply()`, `sapply()` and `tapply()`), which take the data as the first argument. 45 | `mapply()` could take `...` as the first arguments, but that would force `FUN` to always be named, which would also make it inconsistent with the other `apply()` functions. 46 | 47 | `pmap()` avoids this problem by taking a list of vectors, rather than individual vectors in `...`. 48 | This allows `pmap()` to use `...` for another purpose, instead of the `MoreArg` argument (a list), `pmap()` passes `...` on to `.f`. 49 | 50 | ```{r} 51 | mapply(gsub, pattern, replacement, x, fixed = TRUE) 52 | purrr::pmap_chr(list(pattern, replacement, x), gsub, fixed = TRUE) 53 | ``` 54 | 55 | There's a subtle difference here that doesn't matter in most cases - in the `mapply()` `fixed` is recycled to the same length as `pattern` whereas it is not `pmap()`. 56 | TODO: figure out example where that's more clear. 57 | 58 | (Also note that `pmap()` uses the `.` prefix to avoid the problem described in Chapter @sec-dots-prefix.) 59 | -------------------------------------------------------------------------------- /cs-rep.qmd: -------------------------------------------------------------------------------- 1 | # Case study: `rep()` {#sec-cs-rep} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What does `rep()` do? 9 | 10 | `rep()` is an extremely useful base R function that repeats a vector `x` in various ways. 11 | It takes a vector of data in `x` and has arguments (`times`, `each`, and `length.out`[^cs-rep-1]) that control how `x` is repeated. 12 | Let's start by exploring the basics: 13 | 14 | [^cs-rep-1]: Note that the function specification is `rep(x, ...)`, and `times`, `each`, and `length.out` do not appear explicitly. 15 | You have to read the documentation to discover these arguments. 16 | 17 | ```{r} 18 | x <- c(1, 2, 4) 19 | 20 | rep(x, times = 3) 21 | rep(x, length.out = 10) 22 | ``` 23 | 24 | `times` and `length.out` replicate the vector in the same way, but `length.out` allows you to specify a non-integer number of replications. 25 | 26 | The `each` argument repeats individual components of the vector rather than the whole vector: 27 | 28 | ```{r} 29 | rep(x, each = 3) 30 | ``` 31 | 32 | And you can combine that with `times`: 33 | 34 | ```{r} 35 | rep(x, each = 3, times = 2) 36 | ``` 37 | 38 | If you supply a vector to `times` it works a similar way to `each`, repeating each component the specified number of times: 39 | 40 | ```{r} 41 | rep(x, times = x) 42 | ``` 43 | 44 | ## What makes this function hard to understand? 45 | 46 | - `times` and `length.out` both control the same underlying variable in different ways, and if you set them both then `length.out` silently wins: 47 | 48 | ```{r} 49 | rep(1:3, times = 2, length.out = 3) 50 | ``` 51 | 52 | - `times` and `each` are usually independent: 53 | 54 | ```{r} 55 | rep(1:3, times = 2, each = 2) 56 | ``` 57 | 58 | But if you specify a vector for `times` you can't use each. 59 | 60 | ```{r} 61 | #| error = TRUE 62 | rep(1:3, times = c(2, 2, 2), each = 2) 63 | ``` 64 | 65 | - I think using `times` with a vector is confusing because it switches from replicating the whole vector to replicating individual values, like `each` usually does. 66 | 67 | ```{r} 68 | rep(1:3, each = 2) 69 | rep(1:3, times = 2) 70 | rep(1:3, times = c(2, 2, 2)) 71 | ``` 72 | 73 | ## How might we improve the situation? 74 | 75 | I think these problems have the same underlying cause: `rep()` is trying to do too much in a single function. 76 | `rep()` is really two functions in a trench coat (@sec-strategy-functions) and it would be better served by a pair of functions, one which replicates element-by-element, and one which replicates the whole vector. 77 | 78 | The following sections consider how we might do so, starting with what we should call the functions, then what arguments they'll need, then what an implementation might look like, and then considering the downsides of this approach. 79 | 80 | ### Function names 81 | 82 | To create the new functions, we need to first come up with names: I like `rep_each()` and `rep_full()`. 83 | `rep_each()` was a fairly easy name to come up with because it'll repeating each element. 84 | `rep_full()` was a little harder and took a few iterations: I like that `full` has the same number of letters as `each`, which makes the two functions look like they belong together. 85 | 86 | Some other possibilities I considered: 87 | 88 | - `rep_each()` + `rep_every()`: each and every form a natural pair, but to me at least, repeating "every" element doesn't feel very different to repeating each element. 89 | - `rep_element()` and `rep_whole()`: I like how these capture the differences precisely, but they are maybe too long for such commonly used functions. 90 | 91 | ### Arguments 92 | 93 | Next, we need to think about their arguments. 94 | They both will start with `x`, the vector to repeat. 95 | Then their arguments differ: 96 | 97 | - `rep_each()` needs an argument that specifies the number of times to repeat each element, which can either be a single number, or a vector the same length as `x`. 98 | - `rep_full()` has two mutually exclusive arguments (@sec-implicit-strategies), either the number of times to repeat the whole vector or the desired length of the output. 99 | 100 | What should we call the arguments? 101 | We've already captured the different replication strategies in the function name, so I think the argument that specifies the number of times to replicate can be the same for both functions, and `times` seems reasonable. 102 | 103 | What about the second argument to `rep_full()` which specifies the desired length of the output vector? 104 | I draw inspiration from `rep()` which uses `length.out`. 105 | But I think it's obvious that the argument controls the output length, so `length` is adequate. 106 | 107 | ### Implementation 108 | 109 | We can combine these specifications with a simple implementation that uses the existing `rep` function.[^cs-rep-2] 110 | 111 | [^cs-rep-2]: In real code I'd want to turn these into explicit unit tests so we can run them repeatedly as we make changes. 112 | 113 | ```{r} 114 | rep_full <- function(x, times, length) { 115 | rlang::check_exclusive(times, length) 116 | 117 | if (!missing(length)) { 118 | rep(x, length.out = length) 119 | } else { 120 | rep(x, times = times) 121 | } 122 | } 123 | 124 | rep_each <- function(x, times) { 125 | if (length(times) == 1) { 126 | rep(x, each = times) 127 | } else if (length(times) == length(x)) { 128 | rep(x, times = times) 129 | } else { 130 | stop('`times` must be length 1 or the same length as `x`') 131 | } 132 | } 133 | ``` 134 | 135 | We can quickly check that the functions behave as we expect: 136 | 137 | ```{r} 138 | x <- c(1, 2, 4) 139 | 140 | # First the common times argument 141 | rep_each(x, times = 3) 142 | rep_full(x, times = 3) 143 | 144 | # Then a vector times argument to rep_each: 145 | rep_each(x, times = x) 146 | 147 | # Then the length argumetn to rep_full 148 | rep_full(x, length = 5) 149 | ``` 150 | 151 | ### Downsides 152 | 153 | One downside of this approach is if you want to both replicate each component *and* the entire vector, you have to use two function calls, which you might expect to be more verbose. 154 | However, I don't think this is a terribly common use case, and if we use our usual call naming conventions, then the new call is the same length: 155 | 156 | ```{r} 157 | rep(x, each = 2, times = 3) 158 | rep_full(rep_each(x, 2), 3) 159 | ``` 160 | 161 | And it's only slightly longer if you use the pipe, which is maybe slightly more readable: 162 | 163 | ```{r} 164 | x |> rep_each(2) |> rep_full(3) 165 | ``` 166 | 167 | ::: callout-caution 168 | Note that this implementation lacks any input checking so invalid inputs might work, warn, or throw an unhelpful error. 169 | For example, since we're not checking that `times` and `length` argument to `rep_full()` are single integers, the following calls give suboptimal results: 170 | 171 | ```{r} 172 | #| error: true 173 | rep_full(1:3, 1:3) 174 | rep_full(1:3, "x") 175 | ``` 176 | 177 | We'll come back to input checking later in the book. 178 | ::: 179 | -------------------------------------------------------------------------------- /cs-rvest.qmd: -------------------------------------------------------------------------------- 1 | # Case study: `html_element()` {#sec-cs-rvest} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What does the function do? 9 | 10 | `rvest::html_element()` is used to extract matching HTML elements/nodes from a web page. 11 | You can select nodes using one of two languages: CSS selectors or XPath expressions. 12 | These are both mini-languages for describing how to find the node you want. 13 | You can think of them like regular expressions, but instead of being design to find patterns in strings, they are designed to find patterns in trees (since HTML nodes form a tree). 14 | 15 | Interesting case because CSS selectors are much much simpler and likely to be used the majority of the time. 16 | XPath is a much richer and more powerful language, but most of the time that complexity is not required and just adds unneeded overhead. 17 | (One interesting wrinkle is that CSS selectors actually use XPath behind the hood because they are transformed using the selectr package by Simon Potter). 18 | 19 | `html_element()` implements these two strategies using mutually exclusive `css` and `xpath` arguments. 20 | 21 | Other approaches: 22 | 23 | - `html_element(x, selector, type = c("css", "xpath"))` 24 | - ``` html_element(x, css``(pattern``)) ``` vs `html_element(x, xpath(pattern))` 25 | - `html_element_css(x, pattern)`, ``` html_element_xpath(x,``pattern``) ``` 26 | 27 | | Common case | Rare case | 28 | |----------------------------|--------------------------------------------| 29 | | `x |> html_element("sel")` | `x |> html_element("sel", type = "xpath")` | 30 | | `x |> html_element("sel")` | `x |> html_element(xpath = "sel")` | 31 | | `x |> html_element("sel")` | `x |> html_element(xpath("sel"))` | 32 | | `x |> html_element("sel")` | `x |> html_element_xpath("sel")` | 33 | -------------------------------------------------------------------------------- /cs-setNames.qmd: -------------------------------------------------------------------------------- 1 | # Case study: `setNames()` 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What does `setNames()` do? 9 | 10 | `stats::setNames()` is a shorthand that allows you to set vector names inline (it's a little surprising that it lives in the stats package). 11 | It has a simple definition: 12 | 13 | ```{r} 14 | setNames <- function(object = nm, nm) { 15 | names(object) <- nm 16 | object 17 | } 18 | ``` 19 | 20 | And is easy to use: 21 | 22 | ```{r} 23 | # Instead of 24 | x <- 1:3 25 | names(x) <- c("a", "b", "c") 26 | 27 | # Can write 28 | x <- setNames(1:3, c("a", "b", "c")) 29 | x 30 | ``` 31 | 32 | This function is short (just two lines of code!) but yields a surprisingly rich analysis. 33 | 34 | ## How can we improve the names? 35 | 36 | Firstly, I prefer snake_case to camelCase, so I'd call the function `set_names()`. 37 | Then we need to consider the arguments: 38 | 39 | - I think the first argument, `object`, would be better called `x` in order to emphasise that this function only works with vectors (because only vectors have names). 40 | 41 | - The second argument, `nm` is rather terse, and I don't see any disadvantage in calling it `names`. 42 | I think you could also argue that it should be called `y` since its meaning should be obvious from the function name. 43 | 44 | This yields: 45 | 46 | ```{r} 47 | set_names <- function(x = names, names) { 48 | names(x) <- names 49 | x 50 | } 51 | ``` 52 | 53 | ## What about the default values? 54 | 55 | The default values of `setNames()` are a little hard to understand, because the default value of the first argument is the second argument. 56 | It was defined this way to make it possible to name a character vector with itself: 57 | 58 | ```{r} 59 | setNames(nm = c("apple", "banana", "cake")) 60 | ``` 61 | 62 | But that decision leads to a function signature that violates one of the principles of @sec-important-args-first: a required argument comes after an optional argument. 63 | Fortunately, we can fix this easily and still preserve the useful ability to name a vector with itself: 64 | 65 | ```{r} 66 | set_names <- function(x, names = x) { 67 | names(x) <- names 68 | x 69 | } 70 | 71 | set_names(c("apple", "banana", "cake")) 72 | ``` 73 | 74 | This helps to emphasise that `x` is the primary argument. 75 | 76 | ## What about bad inputs? 77 | 78 | Now that we've considered how the function works with correct inputs, it's time to consider how it should work with malformed inputs. 79 | The current function checks neither the length not the type: 80 | 81 | ```{r} 82 | set_names(1:3, "a") 83 | 84 | set_names(1:3, list(letters[1:3], letters[4], letters[5:6])) 85 | ``` 86 | 87 | We can resolve this by asserting that the names should always be a character vector, and should have the same length as `x`: 88 | 89 | ```{r} 90 | #| error = TRUE 91 | set_names <- function(x, names = x) { 92 | if (!is.character(names) || length(names) != length(x)) { 93 | stop("`names` must be a character vector the same length as `x`.", call. = FALSE) 94 | } 95 | 96 | names(x) <- names 97 | x 98 | } 99 | 100 | set_names(1:3, "a") 101 | set_names(1:3, list(letters[1:3], letters[4], letters[5:6])) 102 | ``` 103 | 104 | You could also frame this test using vctrs assertions: 105 | 106 | ```{r} 107 | library(vctrs) 108 | 109 | set_names <- function(x, names = x) { 110 | vec_assert(x) 111 | vec_assert(names, ptype = character(), size = length(x)) 112 | 113 | names(x) <- names 114 | x 115 | } 116 | ``` 117 | 118 | Note that I slipped in an assertion that `x` should be a vector. 119 | This slightly improves the error message if you accidentally supply the wrong sort of input to `set_names()`: 120 | 121 | ```{r} 122 | #| error = TRUE 123 | setNames(mean, 1:3) 124 | set_names(mean, 1:3) 125 | ``` 126 | 127 | Note that we're simply checking the length of `names` here, rather than recycling it, i.e. the invariant is `vec_size(set_names(x, y))` is `vec_size(x)`, not `vec_size_common(x, y)`. 128 | I think this is the correct behaviour because you usually add names to a vector to create a lookup table, and a lookup table is not useful if there are duplicated names. 129 | This makes `set_names()` less general in return for better error messages when you do something suspicious (and you can always use an explicit `rep_along()` if do want this behaviour.) 130 | 131 | ## How could we extend this function? 132 | 133 | Now that we've modified the function so it doesn't violate the principles in this book, we can think about how we might extend it. 134 | Currently the function is only useful for setting names to a constant. 135 | Maybe we could extend it to also make it easier to change existing names? 136 | One way to do that would be to allow `names` to be a function: 137 | 138 | ```{r} 139 | set_names <- function(x, names = x) { 140 | vec_assert(x) 141 | 142 | if (is.function(names)) { 143 | names <- names(base::names(x)) 144 | } 145 | vec_assert(names, ptype = character(), size = length(x)) 146 | 147 | names(x) <- names 148 | x 149 | } 150 | 151 | x <- c(a = 1, b = 2, c = 3) 152 | set_names(x, toupper) 153 | ``` 154 | 155 | We could also support anonymous function formula shortcut used in many places in the tidyverse. 156 | 157 | ```{r} 158 | set_names <- function(x, names = x) { 159 | vec_assert(x) 160 | 161 | if (is.function(names) || rlang::is_formula(names)) { 162 | fun <- rlang::as_function(names) 163 | names <- fun(base::names(x)) 164 | } 165 | vec_assert(names, ptype = character(), size = length(x)) 166 | 167 | names(x) <- names 168 | x 169 | } 170 | 171 | x <- c(a = 1, b = 2, c = 3) 172 | set_names(x, ~ paste0("x-", .)) 173 | 174 | ``` 175 | 176 | Now `set_names()` supports overriding and modifying names. 177 | What about removing them? 178 | It turns out that `setNames()` supported this, but our stricter checks prohibit: 179 | 180 | ```{r} 181 | #| error = TRUE 182 | x <- c(a = 1, b = 2, c = 3) 183 | setNames(x, NULL) 184 | set_names(x, NULL) 185 | ``` 186 | 187 | We can fix this with another clause: 188 | 189 | ```{r} 190 | set_names <- function(x, names = x) { 191 | vec_assert(x) 192 | 193 | if (!is.null(names)) { 194 | if (is.function(names) || rlang::is_formula(names)) { 195 | fun <- rlang::as_function(names) 196 | names <- fun(base::names(x)) 197 | } 198 | 199 | } 200 | 201 | names(x) <- names 202 | x 203 | } 204 | 205 | x <- c(a = 1, b = 2, c = 3) 206 | set_names(x, NULL) 207 | ``` 208 | 209 | However, I think this has muddied the logic. 210 | To resolve it, I think we should pull out the checking code into a separate function. 211 | After trying out a [few approaches](https://github.com/tidyverse/design/issues/79), I ended up with: 212 | 213 | ```{r} 214 | check_names <- function(names, x) { 215 | if (is.null(names)) { 216 | names 217 | } else if (vec_is(names)) { 218 | vec_assert(names, ptype = character(), size = length(x)) 219 | } else if (is.function(names)) { 220 | check_names_2(names(base::names(x)), x) 221 | } else if (rlang::is.formula(names)) { 222 | check_names_2(rlang::as_function(names), x) 223 | } else { 224 | rlang::abort("`names` must be NULL, a function or formula, or a vector") 225 | } 226 | } 227 | ``` 228 | 229 | This then replaces `vec_assert()` in `set_names()`. 230 | I separate the input checking and implementation with a blank line to help visually group the parts of the function. 231 | 232 | ```{r} 233 | set_names <- function(x, names = x) { 234 | vec_assert(x) 235 | names <- check_names(names, x) 236 | 237 | names(x) <- names 238 | x 239 | } 240 | ``` 241 | 242 | We *could* simplify the function even further, but I think this is a bad idea becaues it mingles input validation with implementation: 243 | 244 | ```{r} 245 | # Don't do this 246 | set_names <- function(x, names = x) { 247 | vec_assert(x) 248 | names(x) <- check_names(names, x) 249 | x 250 | } 251 | 252 | # Or even 253 | set_names <- function(x, names = x) { 254 | `names<-`(vec_assert(x), check_names(names, x)) 255 | } 256 | ``` 257 | 258 | ## Compared to `rlang::set_names()` 259 | 260 | If you're familiar with rlang, you might notice that we've ended up with something rather similar to `rlang::set_names()`. 261 | However, these careful analysis in this chapter has lead to a few differences. 262 | `rlang::set_names()`: 263 | 264 | - Calls the second argument `nm`, instead of something more descriptive. 265 | I think this is simply because we never sat down and fully considered the interface. 266 | 267 | - Coerces `nm` to character vector. 268 | This allows `rlang::set_names(1:4)` to automatically name the vector, but this seems a relatively weak new feature in return for the cost of not throwing an error message if you provide an unsual vector type. 269 | (Both lists and data frames have `as.character()` methods so this will work for basically any type of vector, even if completely inappropriate.) 270 | 271 | - Passes `...` on to function `nm`. 272 | I now think that decision was a mistake: it substantially complicates the interface in return for a relatively small investment. 273 | -------------------------------------------------------------------------------- /cs-stringr.qmd: -------------------------------------------------------------------------------- 1 | # Case study: stringr {#sec-cs-stringr} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ```{=html} 9 | 11 | ``` 12 | This chapter explores some of the considerations designing the stringr package. 13 | 14 | ```{r} 15 | library(stringr) 16 | ``` 17 | 18 | ## Function names 19 | 20 | When the base regular expression functions were written, most R users were familiar with the command line and tools like grepl. 21 | This made naming R's string manipulation functions after these tools seem natural. 22 | When I started work on stringr, the majority of R users were not familiar with linux or the command line, so it made more sense to start afresh. 23 | 24 | I think there were successes and failures here. 25 | On the whole, I think `str_replace_all()`, `str_locate()`, and `str_detect()` are easier to remember than `gsub()`, `regexpr()`, and `grepl()`. 26 | However, it's harder to remember what makes `str_subset()` and `str_which()` different. 27 | If I was to do stringr again, I would make more of an effort to distinguish between functions that operated on individual matches and individual strings as `str_locate()` and `str_which()` seem like their names should be more closely related as `str_locate()` returns the location of a match within each string in the vector, and `str_subset()` returns the matching locations within a vector. 28 | 29 | ## Argument order and names 30 | 31 | Base R string functions mostly have `pattern` as the first argument, with the chief exception being `strsplit()`. 32 | stringr functions always have `string` as the first argument. 33 | 34 | I regret using `string`; I now think `x` would be a more appropriate name. 35 | 36 | ## `str_flatten()` 37 | 38 | `str_flatten()` was a relatively recent addition to stringr. 39 | It took me a long time to realise that one of the challenges of understanding `paste()` was that depending on the presence or absence of the `collapse` argument it could either transform a string (i.e. return something the same length) or summarise a string (i.e. always return a single string). 40 | 41 | Once `str_flatten()` existed it become more clear that it would be useful to have `str_flatten_comma()` which made it easier to use the Oxford comma (which seems to be something that's only needed for English, and ironically the Oxford comma is more common in US English than UK English). 42 | 43 | ## Recycling rules 44 | 45 | stringr implements recycling rules so that you can either supply a vector of strings or a vector of patterns: 46 | 47 | ```{r} 48 | alphabet <- str_flatten(letters, collapse = "") 49 | vowels <- c("a", "e", "i", "o", "u") 50 | grepl(vowels, alphabet) 51 | str_detect(alphabet, vowels) 52 | ``` 53 | 54 | On the whole I regret this. 55 | It's generally not that useful (since you typically have more than one string, not more than one pattern), most people don't use it, and now it feels overly clever. 56 | 57 | ## Redundant functions 58 | 59 | There are a couple of stringr functions that were very useful at the time, but are now less important. 60 | 61 | - `nchar(NA)` used to return 2, and `nchar(factor("abc"))` used to return 1. `str_length()` fixed both of these problems, but those fixes also migrated to base R, leaving `str_length()` as less useful. 62 | - `paste0()` did not exist so `str_c()` was very useful. But now `str_c()` primarily only useful for its recycling logic. 63 | -------------------------------------------------------------------------------- /def-inform.qmd: -------------------------------------------------------------------------------- 1 | # Explain important defaults {#sec-def-inform} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | library(dplyr, warn.conflicts = FALSE) 7 | ``` 8 | 9 | ## What's the pattern? 10 | 11 | If a default value is important, and the computation is non-trivial, inform the user what value was used. 12 | This is particularly important when the default value is an educated guess, and you want the user to change it. 13 | It is also important when descriptor arguments (@sec-important-args-first)) have defaults. 14 | 15 | ## What are some examples? 16 | 17 | - `dplyr::left_join()` and friends automatically compute the variables to join `by` as the variables that occur in both `x` and `y` (this is called a natural join in SQL). 18 | This is convenient, but it's a heuristic so doesn't always work. 19 | 20 | ```{r} 21 | #| error = TRUE 22 | library(nycflights13) 23 | library(dplyr) 24 | 25 | # Correct 26 | out <- left_join(flights, airlines) 27 | 28 | # Incorrect 29 | out <- left_join(flights, planes) 30 | 31 | # Error 32 | out <- left_join(flights, airports) 33 | ``` 34 | 35 | - `readr::read_csv()` reads a csv file into a data frame. 36 | Because csv files don't store the type of each variable, readr must guess the types. 37 | In order to be fast, `read_csv()` uses some heuristics, so it might guess wrong. 38 | Or maybe guesses correctly today, but when your automated script runs in two months time when the data format has changed, it might guess incorrectly and give weird downstream errors. 39 | For this reason, `read_csv()` prints the column specification in a way that you can copy-and-paste into your code. 40 | 41 | ```{r} 42 | library(readr) 43 | mtcars <- read_csv(readr_example("mtcars.csv")) 44 | ``` 45 | 46 | - In `ggplot2::geom_histogram()`, the `binwidth` is an important parameter that you should always experiment with. 47 | This suggests it should be a required argument, but it's hard to know what values to try until you've seen a plot. 48 | For this reason, ggplot2 provides a suboptimal default of 30 bins: this gets you started, and then a message tells you how to modify. 49 | 50 | ```{r} 51 | library(ggplot2) 52 | ggplot(diamonds, aes(carat)) + geom_histogram() 53 | ``` 54 | 55 | - When installing packages, `install.packages()` informs of the value of the `lib` argument, which defaults to `.libPath()[[1]]`: 56 | 57 | ```{r} 58 | #| eval = FALSE 59 | install.packages("forcats") 60 | # Installing package into ‘/Users/hadley/R’ 61 | # (as ‘lib’ is unspecified) 62 | ``` 63 | 64 | This, however, is not terribly important (most people only use one library), it's easy to ignore this amongst the other output, and the message doesn't refer to the mechanism that controls the default (`.libPaths()`). 65 | 66 | ## Why is it important? 67 | 68 | > There are two ways to fire a machine gun in the dark. 69 | > You can find out exactly where your target is (range, elevation, and azimuth). 70 | > You can determine the environmental conditions (temperature, humidity, air pressure, wind, and so on). 71 | > You can determine the precise specifications of the cartridges and bullets you are using, and their interactions with the actual gun you are firing. 72 | > You can then use tables or a firing computer to calculate the exact bearing and elevation of the barrel. 73 | > If everything works exactly as specified, your tables are correct, and the environment doesn't change, your bullets should land close to their target. 74 | > 75 | > Or you could use tracer bullets. 76 | > 77 | > Tracer bullets are loaded at intervals on the ammo belt alongside regular ammunition. 78 | > When they're fired, their phosphorus ignites and leaves a pyrotechnic trail from the gun to whatever they hit. 79 | > If the tracers are hitting the target, then so are the regular bullets. 80 | > 81 | > --- [The Pragmatic Programmer](https://www.amazon.com/dp/B003GCTQAE) 82 | 83 | I think this is a valuable pattern because it helps balance two tensions in function design: 84 | 85 | - Forcing the function user to really think about what they want to do. 86 | 87 | - Trying to be helpful, so the user of function can achieve their goal as quickly as possible. 88 | 89 | Often your thoughts about a problem will be aided by a first attempt, even if that attempt is wrong. 90 | Helps facilitate iteration: you don't sit down and contemplate for an hour and then write one perfectly formed line of R code. 91 | You take a stab at it, look at the result, and then tweak. 92 | 93 | Taking a default that the user really should carefully think about and make a decision on, and turning it into a heurstic or educated guess, and reporting the value, is like a tracer bullet. 94 | 95 | The counterpoint to this pattern is that people don't read repeated output. 96 | For example, do you know how to cite R in a paper? 97 | It's mentioned every time that you start R. 98 | Human brains are extremely good at filtering out unchanging signals, which means that you must use this technique with caution. 99 | If every argument tells you the default it uses, it's effectively the same as doing nothing: the most important signals will get buried in the noise. 100 | This is why you'll see the technique used in only a handful of places in the tidyverse. 101 | 102 | ## How can I use it? 103 | 104 | To use this message you need to generate a message from the computation of the default value. 105 | The easiest way to do this to write a small helper function. 106 | It should compute the default value given some inputs and generate a `message()` that gives the code that you could copy and paste into the function call. 107 | 108 | Take the dplyr join functions, for example. 109 | They use a function like this: 110 | 111 | ```{r} 112 | common_by <- function(x, y) { 113 | common <- intersect(names(x), names(y)) 114 | if (length(common) == 0) { 115 | stop("Must specify `by` when no common variables in `x` and `y`", call. = FALSE) 116 | } 117 | 118 | message("Computing common variables: `by = ", rlang::expr_text(common), "`") 119 | common 120 | } 121 | 122 | common_by(data.frame(x = 1), data.frame(x = 1)) 123 | common_by(flights, planes) 124 | ``` 125 | 126 | The technique you use to generate the code will vary from function to function. 127 | `rlang::expr_text()` is useful here because it automatically creates the code you'd use to build the character vector. 128 | 129 | To avoid creating a magical default (@sec-def-magical), either export and document the function, or use one of the techniques in @sec-defaults-short-and-sweet: 130 | 131 | ```{r} 132 | left_join <- function(x, y, by = NULL) { 133 | by <- by %||% common_by(x, y) 134 | } 135 | ``` 136 | -------------------------------------------------------------------------------- /def-magical.qmd: -------------------------------------------------------------------------------- 1 | # Avoid magical defaults {#sec-def-magical} 2 | 3 | ```{r} 4 | #| include = FALSE, 5 | #| cache = FALSE 6 | source("common.R") 7 | ``` 8 | 9 | ```{r} 10 | #| eval = FALSE, 11 | #| include = FALSE 12 | source("fun_def.R") 13 | funs <- c(pkg_funs("base"), pkg_funs("stats"), pkg_funs("utils")) 14 | funs %>% funs_formals_keep(~ is_symbol(.x) && !is_missing(.x)) 15 | pkg_funs("base") %>% funs_body_keep(has_call, "missing") 16 | ``` 17 | 18 | ## What's the problem? 19 | 20 | If a function behaves differently when the default value is supplied explicitly, we say it has a **magical default**. 21 | Magical defaults are best avoided because they make it harder to interpret the function specification. 22 | 23 | ## What are some examples? 24 | 25 | - In `data.frame()`, the default argument for `row.names` is `NULL`, but if you supply it directly you get a different result: 26 | 27 | ```{r} 28 | args(data.frame) 29 | 30 | x <- setNames(nm = letters[1:2]) 31 | x 32 | 33 | data.frame(x) 34 | 35 | data.frame(x, row.names = NULL) 36 | ``` 37 | 38 | - In `hist()`, the default value of `xlim` is `range(breaks)`, and the default value for `breaks` is `"Sturges"`. 39 | `range("Sturges")` returns `c("Sturges", "Sturges")` which doesn't work when supplied explicitly: 40 | 41 | ```{r} 42 | #| error = TRUE, 43 | #| fig.show = "hide" 44 | args(hist.default) 45 | 46 | hist(1:10, xlim = c("Sturges", "Sturges")) 47 | ``` 48 | 49 | - `readr::read_csv()` has `progress = show_progress()`, but until version 1.3.1, `show_progress()` was not exported from the package. 50 | That means if you attempted to run it yourself, you'd see an error message: 51 | 52 | ```{r} 53 | #| error = TRUE 54 | show_progress() 55 | ``` 56 | 57 | ## What are the exceptions? 58 | 59 | It's ok to use this behaviour when you want the default value of one argument to be the same as another. 60 | For example, take `rlang::set_names()`, which allows you to create a named vector from two inputs: 61 | 62 | ```{r} 63 | library(rlang) 64 | args(set_names) 65 | 66 | set_names(1:3, letters[1:3]) 67 | ``` 68 | 69 | The default value for the names is the vector itself. 70 | This provides a convenient shortcut for naming a vector with itself: 71 | 72 | ```{r} 73 | set_names(letters[1:3]) 74 | ``` 75 | 76 | You can see this same technique in `merge()`, where `all.x` and `all.y` default to the same value as `all`, and in `factor()` where `labels` defaults to the same value as `levels`. 77 | 78 | If you use this technique, make sure that you never use the value of an argument that comes later in the argument list. 79 | For example, in `file.copy()` `overwrite` defaults to the same value as `recursive`, but the `recursive` argument is defined after `overwrite`: 80 | 81 | ```{r} 82 | args(file.copy) 83 | ``` 84 | 85 | This makes the defaults arguments harder to understand because you can't just read from left-to-right. 86 | 87 | ## What causes the problem? 88 | 89 | This problem is generally easy to avoid for new functions: 90 | 91 | - Don't use default values that depend on variables defined inside the function. 92 | The default values of function arguments are lazily evaluated in the environment of the function when they are first used, as described in [Advanced R](https://adv-r.hadley.nz/functions.html#default-arguments). Here's a simple example: 93 | 94 | ```{r} 95 | f1 <- function(x = y) { 96 | y <- 2 97 | x 98 | } 99 | 100 | y <- 1 101 | f1() 102 | f1(y) 103 | ``` 104 | 105 | When `x` takes the value `y` from its default, it's evaluated inside the function, yielding `1`. 106 | When `y` is supplied explicitly, it is evaluated in the caller environment, yielding `2`. 107 | 108 | - Don't use `missing()`\[\^def-magical-1\]. 109 | 110 | ```{r} 111 | f2 <- function(x = 1) { 112 | if (missing(x)) { 113 | 2 114 | } else { 115 | x 116 | } 117 | } 118 | 119 | f2() 120 | f2(1) 121 | ``` 122 | 123 | - Don't use unexported functions. 124 | In packages, it's easy to use a non-exported function without thinking about it. 125 | This function is available to you, the package author, but not the user of the package, which makes it harder for them to understand how a package works. 126 | 127 | ## How do I remediate the problem? 128 | 129 | If you have a made a mistake in an older function you can remediate it by using a `NULL` default, as described in @sec-defaults-short-and-sweet). 130 | If the problem is caused by an unexported function, you can also choose to document and export it. 131 | Remediating this problem shouldn't break existing code, because it expands the function interface: all previous code will continue to work, and the function will also work if the argument is passed `NULL` input (which probably didn't previously). 132 | -------------------------------------------------------------------------------- /def-user.qmd: -------------------------------------------------------------------------------- 1 | # User settable defaults {#sec-def-user} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | It's sometimes useful to give the user control over default values, so that they can set once per session or once for every session in their `.Rprofile`. 11 | To do so, use `getOption()` in the default value. 12 | 13 | Note that this pattern should general only be used to control the side-effects of a function, not its compute value. 14 | The two primary uses are for controlling the appearance of output, particularly in `print()` methods, and for setting default values in generated templates. 15 | 16 | Related patterns: 17 | 18 | - If a global option affects the results of the computation (not just its side-effects), you have an example of @sec-inputs-explicit. 19 | 20 | ## What are some examples? 21 | 22 | ## Why is it important? 23 | 24 | ## What are the exceptions? 25 | 26 | ## How do I use it? 27 | -------------------------------------------------------------------------------- /defaults-short-and-sweet.qmd: -------------------------------------------------------------------------------- 1 | # Keep defaults short and sweet {#sec-defaults-short-and-sweet} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ```{r} 9 | #| eval = FALSE, 10 | #| include = FALSE 11 | source("fun_def.R") 12 | pkg_funs("base") |> keep(\(f) some(f$formals, is.null)) 13 | pkg_funs("base") |> keep(\(f) some(f$formals, \(arg) is_call(arg) && !is_call(arg, c("c", "getOption", "c", "if", "topenv", "parent.frame")))) 14 | 15 | 16 | funs <- c(pkg_funs("base"), pkg_funs("stats")) 17 | arg_length <- function(x) map_int(x$formals, ~ nchar(expr_text(.x))) 18 | args <- map(funs, arg_length) 19 | args_max <- map_dbl(args, ~ if (length(.x) == 0) 0 else max(.x)) 20 | 21 | funs[args_max > 50] %>% discard(~ grepl("as.data.frame", .x$name, fixed = TRUE)) 22 | ``` 23 | 24 | ## What's the pattern? 25 | 26 | Default values should be short and sweet. 27 | Avoid large or complex calculations in the default values, instead using `NULL` or a helper function when the default requires complex calculation. 28 | This keeps the function specification focussed on the big picture (i.e. what are the arguments and are they required or not) rather than the details of the defaults. 29 | 30 | ## What are some examples? 31 | 32 | It's common for functions to use `NULL` to mean that the argument is optional, but the computation of the default is non-trivial: 33 | 34 | - The default `label` in `cut()` yields labels in the form `[a, b)`. 35 | - The default `pattern` in `dir()` means match all files. 36 | - The default `by` in `dplyr::left_join()` means join using the common variables between the two data frames (the so-called natural join). 37 | - The default `mapping` in `ggplot2::geom_point()` (and friends) means use the mapping from in the overall plot. 38 | 39 | In other cases, we encapsulate default values into a function: 40 | 41 | - readr functions use a family of functions including `readr::show_progress()`, `readr::should_show_col_types()` and `readr::should_show_lazy()` that make it easier for users to override various defaults. 42 | 43 | It's also worth looking at a couple of counter examples that come from base R: 44 | 45 | - The default value for `by` in `seq` is `((to - from)/(length.out - 1))`. 46 | 47 | - `reshape()` has a very long default argument: the `split` argument is one of two possible lists depending on the value of the `sep` argument: 48 | 49 | ```{r} 50 | #| eval = FALSE 51 | reshape <- function( 52 | ..., 53 | split = if (sep == "") { 54 | list(regexp = "[A-Za-z][0-9]", include = TRUE) 55 | } else { 56 | list(regexp = sep, include = FALSE, fixed = TRUE) 57 | } 58 | ) {} 59 | ``` 60 | 61 | - `sample.int()` uses a complicated rule to determine whether or not to use a faster hash based method that's only applicable in some circumstances: `useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e+07))`. 62 | 63 | ## How do I use it? 64 | 65 | So what should you do if a default requires some complex calculation? 66 | We have two recommended approaches: using `NULL` or creating a helper function. 67 | I'll also show you two other alternatives which we don't generally recommend but you'll see in a handful of places in the tidyverse, and can be useful in limited circumstances. 68 | 69 | ### `NULL` default 70 | 71 | The simplest, and most common, way to indicate that an argument is optional, but has a complex default is to use `NULL` as the default. 72 | Then in the body of the function you perform the actual calculation only if the is `NULL`. 73 | For example, if we were to use this approach in `sample.int()`, it might look something like this: 74 | 75 | ```{r} 76 | sample.int <- function (n, size = n, replace = FALSE, prob = NULL, useHash = NULL) { 77 | if (is.null(useHash)) { 78 | useHash <- n > 1e+07 && !replace && is.null(prob) && size <= n/2 79 | } 80 | } 81 | ``` 82 | 83 | This pattern is made more elegant with the infix `%||%` operator which is built in to R 4.4. 84 | If you need it in an older version of R you can import it from rlang or copy and paste it in to your `utils.R`: 85 | 86 | ```{r} 87 | `%||%` <- function(x, y) if (is.null(x)) y else x 88 | 89 | sample.int <- function (n, size = n, replace = FALSE, prob = NULL, useHash = NULL) { 90 | useHash <- useHash %||% n > 1e+07 && !replace && is.null(prob) && size <= n/2 91 | } 92 | ``` 93 | 94 | `%||%` is particularly well suited to arguments where the default value is found through a cascading system of fallbacks. 95 | For example, this code from `ggplot2::geom_bar()` finds the width by first looking at the data, then in the parameters, finally falling back to computing it from the resolution of the `x` variable: 96 | 97 | ```{r} 98 | #| eval = FALSE 99 | width <- data$width %||% params$width %||% (resolution(data$x, FALSE) * 0.9) 100 | ``` 101 | 102 | Don't use `%||%` for more complex examples where the individual clauses can't fit on their own line. 103 | For example in `reshape()`, I wouldn't write: 104 | 105 | ```{r} 106 | #| eval: false 107 | reshape <- function(..., sep = ".", split = NULL) { 108 | split <- split %||% if (sep == "") { 109 | list(regexp = "[A-Za-z][0-9]", include = TRUE) 110 | } else { 111 | list(regexp = sep, include = FALSE, fixed = TRUE) 112 | } 113 | ... 114 | } 115 | ``` 116 | 117 | I would instead use `is.null()` and assign `split` inside each branch: 118 | 119 | ```{r} 120 | #| eval: false 121 | reshape <- function(..., sep = ".", split = NULL) { 122 | if (is.null(split)) { 123 | if (sep == "") { 124 | split <- list(regexp = "[A-Za-z][0-9]", include = TRUE) 125 | } else { 126 | split <- list(regexp = sep, include = FALSE, fixed = TRUE) 127 | } 128 | } 129 | ... 130 | } 131 | ``` 132 | 133 | Or alternatively you might pull the code out into a helper function: 134 | 135 | ```{r} 136 | split_default <- function(sep = ".") { 137 | if (sep == "") { 138 | list(regexp = "[A-Za-z][0-9]", include = TRUE) 139 | } else { 140 | list(regexp = sep, include = FALSE, fixed = TRUE) 141 | } 142 | } 143 | 144 | reshape <- function(..., sep = ".", split = NULL) { 145 | split <- split %||% split_default(sep) 146 | ... 147 | } 148 | ``` 149 | 150 | That makes it very clear exactly which other arguments the default for `split` depends on. 151 | 152 | ### Exported helper function 153 | 154 | If you have created a helper function for your own use, might consider use it as the default: 155 | 156 | ```{r} 157 | reshape <- function(..., sep = ".", split = split_default(sep)) { 158 | ... 159 | } 160 | ``` 161 | 162 | The problem with using an internal function as the default is that the user can't easily run this function to see what it does, making the default a bit magical (@sec-def-magical). 163 | So we recommend that if you want to do this you export and document that function. 164 | This is the main downside of this approach: you have to think carefully about the name of the function because it's user facing. 165 | 166 | A good example of this pattern is `readr::show_progress()`: it's used in every `read_` function in readr to determine whether or not a progress bar should be shown. 167 | Because it has a relatively complex explanation, it's nice to be able to document it in its own file, rather than cluttering up file reading functions with incidental details. 168 | 169 | ### Alternatives 170 | 171 | If the above techniques don't work for your case there are two other alternatives that we don't generally recommend but can be useful in limited situations. 172 | 173 | ::: {.callout-note collapse="true"} 174 | #### Sentinel value 175 | 176 | Sometimes you'd like to use the `NULL` approach defined above, but `NULL` already has a specific meaning that you want to preserve. 177 | For example, this comes up in ggplot2 scales functions which allow you to set the `name` of the scale which is displayed on the axis or legend. 178 | The default value should just preserve whatever existing label is present so that if you're providing a scale to customise (e.g.) the breaks or labels, you don't need to re-type the scale name. 179 | However, `NULL` is also a meaningful value because it means eliminate the scale label altogether[^defaults-short-and-sweet-1]. 180 | For that reason the default value for `name` is `ggplot2::waiver()` a ggplot2-specific convention that means "inherit from the existing value". 181 | 182 | If you look at `ggplot2::waiver()` you'll see it's just a very lightweight S3 class[^defaults-short-and-sweet-2]: 183 | 184 | ```{r} 185 | ggplot2::waiver 186 | ``` 187 | 188 | And then ggplot2 also provides the internal `is.waive()`[^defaults-short-and-sweet-3] function which allows to work with it in the same way we might work with a `NULL`: 189 | 190 | ```{r} 191 | is.waive <- function(x) { 192 | inherits(x, "waiver") 193 | } 194 | ``` 195 | 196 | The primary downside of this technique is that it requires substantial infrastructure to set up, so it's only really worth it for very important functions or if you're going to use it in multiple places. 197 | ::: 198 | 199 | [^defaults-short-and-sweet-1]: Unlike `name = ""` which doesn't show the label, but preserves the space where it would appear (sometimes useful for aligning multiple plots), `name = NULL` also eliminates the space normally allocated for the label. 200 | 201 | [^defaults-short-and-sweet-2]: If I was to write this code today I'd use `ggplot2_waiver` as the class name. 202 | 203 | [^defaults-short-and-sweet-3]: If I wrote this code today, I'd call it `is_waiver()`. 204 | 205 | ::: {.callout-warning collapse="true"} 206 | #### No default 207 | 208 | The final alternative is to condition on the absence of an argument using `missing().` It works something like this: 209 | 210 | ```{r} 211 | reshape <- function(..., sep = ".", split) { 212 | if (missing(split)) { 213 | split <- split_default(sep) 214 | } 215 | ... 216 | } 217 | ``` 218 | 219 | I mention this technique because we used it in `purrr::reduce()` for the `.init` argument. 220 | This argument is mostly optional: 221 | 222 | ```{r} 223 | library(purrr) 224 | reduce(letters[1:3], paste) 225 | reduce(letters[1:2], paste) 226 | reduce(letters[1], paste) 227 | ``` 228 | 229 | But it is required when `.x` (the first argument) is empty, and it's good practice to supply it when wrapping `reduce()` inside another function because it ensures that you get the right type of output for all inputs: 230 | 231 | ```{r} 232 | #| error: true 233 | reduce(letters[0], paste) 234 | reduce(letters[0], paste, .init = "") 235 | ``` 236 | 237 | Why use this approach? 238 | `NULL` is a potentially valid option for `.init`, so we can't use that approach. 239 | And we only need it for a single function, that's not terribly important, so creating a sentinel didn't seem to worth it. 240 | `.init` is "semi" required so this seemed to be the least worst solution to the problem. 241 | 242 | The major drawback to this technique is that it makes it look like an argument is required (in direct conflict with @sec-required-no-defaults). 243 | ::: 244 | 245 | ## How do I remediate existing problems? 246 | 247 | If you have a function with a long default, you can remediate it with any of the approaches. 248 | It won't be a breaking change unless you accidentally change the computation of the default, so make sure you have a test for that before you begin. 249 | 250 | ## See also 251 | 252 | - See @sec-argument-clutter for a tecnhnique to simplify your function spec if its long because it has many less important optional arguments. 253 | -------------------------------------------------------------------------------- /design.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | BuildType: None 16 | 17 | MarkdownWrap: Sentence 18 | MarkdownCanonical: Yes 19 | -------------------------------------------------------------------------------- /dots-after-required.qmd: -------------------------------------------------------------------------------- 1 | # Put `…` after required arguments {#sec-dots-after-required} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | If you use `…` in a function, put it after the required arguments and before the optional arguments. 11 | 12 | This has two positive impacts: 13 | 14 | - It forces the user of your function to fully name optional arguments, because arguments that come after `...` are never matched by position or by partial name. 15 | We believe that using full names for optional arguments is good practice because it makes code easier to read. 16 | 17 | - This in turn means that uou can easily add new optional arguments or change the order of existing arguments without affecting existing code. 18 | 19 | ## What are some examples? 20 | 21 | The arguments to `mean()` are `x`, `trim`, `na.rm` and `…`. 22 | This means that you can write code like this: 23 | 24 | ```{r} 25 | x <- c(1, 2, 10, NA) 26 | mean(x, , TRUE) 27 | mean(x, n = TRUE, t = 0.1) 28 | ``` 29 | 30 | Not only does this allow for confusing code[^dots-after-required-1], it also makes it hard to later change the order of these arguments, or introduce new arguments that might be more important. 31 | 32 | [^dots-after-required-1]: As much as we recommended people don't write code like this, you know someone will! 33 | 34 | If `mean()` instead placed `…` before `trim` and `na.rm`, like `mean2()`[^dots-after-required-2] below, then you must fully name each argument: 35 | 36 | [^dots-after-required-2]: Note that I moved `na.rm = TRUE` in front of `trim` because I believe `na.rm` is the more important argument because it's used vastly more often than `trim` and I'm following @sec-important-args-first. 37 | 38 | ```{r} 39 | mean2 <- function(x, ..., na.rm = FALSE, trim = 0) { 40 | mean(x, ..., na.rm = na.rm, trim = trim) 41 | } 42 | 43 | mean2(x, na.rm = TRUE) 44 | mean2(x, na.rm = TRUE, trim = 0.1) 45 | ``` 46 | 47 | ## How do I remediate past mistakes? 48 | 49 | It's straightforward to fix a function where you've put `...` in the wrong place: you just need to change the argument order and use `rlang::check_dots_used()` to check that no arguments are lost (learn more in @sec-dots-inspect). 50 | This is a breaking change, but it tends to affect relatively little code because most people do fully name optional arguments. 51 | 52 | We can use this approach to make a safer version of `mean()`: 53 | 54 | ```{r} 55 | #| error = TRUE 56 | mean3 <- function(x, ..., na.rm = FALSE, trim = 0) { 57 | rlang::check_dots_used() 58 | mean(x, ..., na.rm = na.rm, trim = trim) 59 | } 60 | 61 | mean3(x, , TRUE) 62 | 63 | mean3(x, n = TRUE, t = 0.1) 64 | ``` 65 | 66 | ::: {.callout-note collapse="true"} 67 | ## Base R 68 | 69 | In base R you can use `base::chkDots()`, but it uses a slightly simpler technique which means it's not suitable for usage in S3 methods. 70 | ::: 71 | 72 | ## See also 73 | 74 | - @sec-dots-data: if `…` is a required argument because it's used to combine an arbitrary number of objects in a data structure. 75 | - @sec-dots-inspect: to ensure that arguments to `…` never go silently missing. 76 | -------------------------------------------------------------------------------- /dots-data.qmd: -------------------------------------------------------------------------------- 1 | # Making data with ... {#sec-dots-data} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | library(tidyverse) 7 | ``` 8 | 9 | ## What's the problem? 10 | 11 | A number of functions take `...` to save the user from having to create a vector themselves: 12 | 13 | ## What are some examples? 14 | 15 | ```{r} 16 | sum(c(1, 1, 1)) 17 | # can be shortened to: 18 | sum(1, 1, 1) 19 | 20 | f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a")) 21 | f 22 | fct_relevel(f, c("b", "a")) 23 | # can be shortened to: 24 | fct_relevel(f, "b", "a") 25 | 26 | ``` 27 | 28 | - `mapply()` 29 | 30 | ## Why is it important? 31 | 32 | In general, I think it is best to avoid using `...` for this purpose because it has a relatively small benefit, only reducing typing by three letters `c()`, but has a number of costs: 33 | 34 | - It can give the misleading impression that other functions in the same family work the same way. 35 | For example, if you're internalised how `sum()` works, you might predict that `mean()` works the same way, but it does not: 36 | 37 | ```{r} 38 | mean(c(1, 2, 3)) 39 | mean(1, 2, 3) 40 | ``` 41 | 42 | (See Chapter @sec-dots-after-required to learn why this doesn't give an error message.) 43 | 44 | - It makes it harder to adapt the function for new uses. 45 | For example, `fct_relevel()` can also be called with a function: 46 | 47 | ```{r} 48 | fct_relevel(f, sort) 49 | ``` 50 | 51 | If `fct_relevel()` took its input as a single vector, you could easily extend it to also work with functions: 52 | 53 | ```{r} 54 | fct_relevel <- function(f, x) { 55 | if (is.function(x)) { 56 | x <- f(levels(x)) 57 | } 58 | } 59 | ``` 60 | 61 | However, because `fct_relevel()` uses dots, the implementation needs to be more complicated: 62 | 63 | ```{r} 64 | #| eval = FALSE 65 | fct_relevel <- function(f, ...) { 66 | if (dots_n(...) == 1L && is.function(..1)) { 67 | levels <- fun(levels(x)) 68 | } else { 69 | levels <- c(...) 70 | } 71 | } 72 | ``` 73 | 74 | ## What are the exceptions? 75 | 76 | Note that in all the examples above, the `...` are used to collect a single details argument. 77 | It's ok to use `...` to collect *data*, as in `paste()`, `data.frame()`, or `list()`. 78 | 79 | ## How can remediate it? 80 | 81 | If you've already published a function where you've used `...` for this purpose you can change the interface by adding a new argument in front of `...`, and then warning if anything ends up in `...`. 82 | 83 | ```{r} 84 | old_foo <- function(x, ...) { 85 | } 86 | 87 | new_foo <- function(x, y, ...) { 88 | if (rlang::dots_n(...) > 0) { 89 | warning("Use of `...` is now deprecated. Please put all arguments in `y`") 90 | y <- c(y, ...) 91 | } 92 | } 93 | ``` 94 | 95 | Because this is a interface change, it should be prominently advertised in packages. 96 | 97 | ## How can I protect myself? 98 | 99 | If you do feel that the tradeoff is worth it (i.e. it's an extremely frequently used function and the savings over time will be considerable), you need to take some steps to minimise the downsides. 100 | 101 | This is easiest if you're constructing a vector that shouldn't have names. 102 | In this case, you can call `rlang::check_dots_unnamed()` to ensure that no named arguments have been accidentally passed to `...`. 103 | This protects you against the following undesirable behaviour of `sum()`: 104 | 105 | ```{r} 106 | #| error = TRUE 107 | sum(1, 1, 1, na.omit = TRUE) 108 | 109 | safe_sum <- function(..., na.rm = TRUE) { 110 | rlang::check_dots_unnamed() 111 | sum(c(...), na.rm = na.rm) 112 | } 113 | safe_sum(1, 1, 1, na.omit = TRUE) 114 | ``` 115 | 116 | If you want your vector to have names, the problem is harder, and there's relatively little that you can. 117 | You'll need to ensure that all other arguments get a `.` prefix (to minimise chances of a mismatch) and then think carefully about how you might detect problems by thinking about the expect type of `c(...)`. 118 | As far as I know, there are no general techniques, and you'll have to think about the problem on a case-by-case basis. 119 | 120 | ## Selecting variables 121 | 122 | A number of funtions in the tidyverse use `...` for selecting variables. 123 | For example, `tidyr::fill()` lets you fill in missing values based on the previous row: 124 | 125 | ```{r} 126 | df <- tribble( 127 | ~year, ~month, ~day, 128 | 2020, 1, 1, 129 | NA, NA, 2, 130 | NA, NA, 3, 131 | NA, 2, 1 132 | ) 133 | df %>% fill(year, month) 134 | ``` 135 | 136 | All functions that work like this include a call to `tidyselect::vars_select()` that looks something like this: 137 | 138 | ```{r} 139 | find_vars <- function(data, ...) { 140 | tidyselect::vars_select(names(data), ...) 141 | } 142 | 143 | find_vars(df, year, month) 144 | ``` 145 | 146 | I now think that this interface is a mistake because it suffers from the same problem as `sum()`: we're using `...` to only save a little typing. 147 | We can eliminate the use of dots by requiring the user to use `c()`. 148 | (This change also requires explicit quoting and unquoting of `vars` since we're no longer using `...`.) 149 | 150 | ```{r} 151 | foo <- function(data, vars) { 152 | tidyselect::vars_select(names(data), !!enquo(vars)) 153 | } 154 | 155 | find_vars(df, c(year, month)) 156 | ``` 157 | 158 | In other words, I believe that better interface to `fill()` would be: 159 | 160 | ```{r} 161 | #| eval = FALSE 162 | df %>% fill(c(year, month)) 163 | ``` 164 | 165 | Other tidyverse functions like dplyr's scoped verbs and `ggplot2::facet_grid()` require the user to explicitly quote the input. 166 | I now believe that this is also a suboptimal interface because it is more typing (`var()` is longer than `c()`, and you must quote even single variables), and arguments that require their inputs to be explicitly quoted are rare in the tidyverse. 167 | 168 | ```{r} 169 | #| eval = FALSE 170 | # existing interface 171 | dplyr::mutate_at(mtcars, vars(cyl:vs), mean) 172 | # what I would create today 173 | dplyr::mutate_at(mtcars, c(cyl:vs), mean) 174 | 175 | # existing interface 176 | ggplot2::facet_grid(rows = vars(drv), cols = vars(vs, am)) 177 | # what I would create today 178 | ggplot2::facet_grid(rows = drv, cols = c(vs, am)) 179 | ``` 180 | 181 | That said, it is unlikely we will ever change functions, because the benefit is smaller (primarily improved consistency) and the costs are high, as it impossible to switch from an evaluated argument to a quoted argument without breaking backward compatibility in some small percentage of cases. 182 | -------------------------------------------------------------------------------- /dots-inspect.qmd: -------------------------------------------------------------------------------- 1 | # Inspect the dots {#sec-dots-inspect} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | 9 | 10 | ## What's the pattern? 11 | 12 | Whenever you use `...` in an S3 generic to allow methods to add custom arguments, you should inspect the dots to make sure that every argument is used. 13 | You can also use this same approach when passing `...` to an overly permissive function. 14 | 15 | ## What are some examples? 16 | 17 | If you don't use this technique it is easy to end up with functions that silently return the incorrect result when argument names are misspelled. 18 | 19 | ```{r} 20 | # Misspelled 21 | weighted.mean(c(1, 0, -1), wt = c(10, 0, 0)) 22 | mean(c(1:9, 100), trim = 0.1) 23 | 24 | # Correct 25 | weighted.mean(c(1, 0, -1), w = c(10, 0, 0)) 26 | mean(c(1:9, 100), trim = 0.1) 27 | ``` 28 | 29 | ## How do I do it? 30 | 31 | Add a call to `rlang::check_dots_used()` in the generic before the call to `UseMethod()`. 32 | This automatically adds an on exit handler, which checks that ever element of `...` has been evaluated just prior to the function returnning. 33 | 34 | You can see this in action by creating a safe wrapper around `cut()`, which has different arguments for its numeric and date methods. 35 | 36 | ```{r} 37 | safe_cut <- function(x, breaks, ..., right = TRUE) { 38 | rlang::check_dots_used() 39 | UseMethod("safe_cut") 40 | } 41 | 42 | safe_cut.numeric <- function(x, breaks, ..., right = TRUE, include.lowest = FALSE) { 43 | cut(x, breaks = breaks, right = right, include.lowest = include.lowest) 44 | } 45 | 46 | safe_cut.Date <- function(x, breaks, ..., right = TRUE, start.on.monday = TRUE) { 47 | cut(x, breaks = breaks, right = right, start.on.monday = start.on.monday) 48 | } 49 | ``` 50 | 51 | ### What are the limitations? 52 | 53 | Accurately detecting this problem is hard because no one place has all the information needed to tell if an argument is superfluous or not (the precise details are beyond the scope of this text). 54 | Instead rlang takes advantage of R's [lazy evaluation](https://adv-r.hadley.nz/functions.html#lazy-evaluation) and inspects the internal components of `...` to see if their evaluation has been forced. 55 | 56 | If a function is called primarily for its side-effects, the error will occur after the side-effect has happened, making for a confusing result. 57 | Here the best we can do is a warning, generated by `rlang::check_dots_used(error = function(e) warn(e))` 58 | 59 | If a function captures the components of `...` using `enquo()` or `match.call()`, you can not use this technique. 60 | This also means that if you use `check_dots_used()`, the method author can not choose to add a quoted argument. 61 | I think this is ok because quoting vs. evaluating is part of the interface of the generic, so methods should not change this interface, and it's fine for the author of the generic to make that decision for all method authors. 62 | 63 | ### What are other uses? 64 | 65 | This same technique can also be used when you are wrapping other functions. 66 | For example, `stringr::str_sort()` takes `...` and passes it on to `stringi::stri_opts_collator()`. 67 | As of March 2019, `str_sort()` looked like this: 68 | 69 | ```{r} 70 | str_sort <- function(x, decreasing = FALSE, na_last = TRUE, locale = "en", numeric = FALSE, ...) 71 | { 72 | stringi::stri_sort(x, 73 | decreasing = decreasing, 74 | na_last = na_last, 75 | opts_collator = stringi::stri_opts_collator( 76 | locale, 77 | numeric = numeric, 78 | ... 79 | ) 80 | ) 81 | } 82 | ``` 83 | 84 | ```{r} 85 | x <- c("x1", "x100", "x2") 86 | str_sort(x) 87 | str_sort(x, numeric = TRUE) 88 | ``` 89 | 90 | This is wrapper is useful because it decouples `str_sort()` from the `stri_opts_collator()` meaning that if `stri_opts_collator()` gains new arguments users of `str_sort()` can take advantage of them immediately. 91 | But most of the arguments in `stri_opts_collator()` are sufficiently arcane that they don't need to be exposed directly in stringr, which is designed to minimise the cognitive load of the user, by hiding some of the full complexity of string handling. 92 | 93 | (The importance of the `locale` argument comes up in "hidden inputs", @sec-inputs-explicit.) 94 | 95 | TODO: Update this! 96 | It's now wrong! 97 | 98 | However, `stri_opts_collator()` deliberately ignores any arguments in `...`. 99 | This means that misspellings are silently ignored: 100 | 101 | ```{r} 102 | #| eval: false 103 | str_sort(x, numric = TRUE) 104 | ``` 105 | 106 | We can work around this behaviour by adding `check_dots_used()` to `str_sort()`: 107 | 108 | ```{r} 109 | #| error = TRUE 110 | str_sort <- function(x, decreasing = FALSE, na_last = TRUE, locale = "en", numeric = FALSE, ...) 111 | { 112 | rlang::check_dots_used() 113 | 114 | stringi::stri_sort(x, 115 | decreasing = decreasing, 116 | na_last = na_last, 117 | opts_collator = stringi::stri_opts_collator( 118 | locale, 119 | numeric = numeric, 120 | ... 121 | ) 122 | ) 123 | } 124 | 125 | str_sort(x, numric = TRUE) 126 | ``` 127 | 128 | Note, however, that it's better to figure out why `stri_opts_collator()` ignores `...` in the first place. 129 | You can see that discussion at . 130 | 131 | See for discussion about using this in another discussion about using this in `devtools::install_github()` which is an similar situation, but with a more complicated chain of calls: `devtools::install_github()` -\> `install.packages()` -\> `download.file()`. 132 | -------------------------------------------------------------------------------- /dots-prefix.qmd: -------------------------------------------------------------------------------- 1 | # Dot prefix {#sec-dots-prefix} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | When using `...` to create a data structure, or when passing `...` to a user-supplied function, add a `.` prefix to all named arguments. 11 | This reduces (but does not eliminate) the chances of matching an argument at the wrong level. 12 | Additionally, you should always provide some mechanism that allows you to escape and use that name if needed. 13 | 14 | ```{r} 15 | #| label = "setup" 16 | library(purrr) 17 | ``` 18 | 19 | (Not important if you ignore names: e.g. `cat()`.) 20 | 21 | ## What are some examples? 22 | 23 | Look at the arguments to some functions in purrr: 24 | 25 | ```{r} 26 | args(map) 27 | args(reduce) 28 | args(detect) 29 | ``` 30 | 31 | Notice that all named arguments start with `.`. 32 | This reduces the chance that you will incorrectly match an argument to `map()`, rather than to an argument of `.f`. 33 | Obviously it can't eliminate it. 34 | 35 | Escape mechanism is the anonymous function. 36 | Little easier to access in `purrr::map()` since you can create with `~`, which is much less typing than `function() {}`. 37 | For example, imagine you want to... 38 | 39 | Example: https://jennybc.github.io/purrr-tutorial/ls02_map-extraction-advanced.html#list_inside_a_data_frame 40 | 41 | ## Case study: dplyr verbs 42 | 43 | ```{r} 44 | args(dplyr::filter) 45 | args(dplyr::group_by) 46 | ``` 47 | 48 | Escape hatch is `:=`. 49 | 50 | Ooops: 51 | 52 | ```{r} 53 | args(dplyr::left_join) 54 | ``` 55 | 56 | ## Other approaches in base R 57 | 58 | Base R uses two alternative methods: uppercase and `_` prefix. 59 | 60 | The apply family tends to use uppercase function names for the same reason. 61 | Unfortunately the functions are a little inconsistent which makes it hard to see this pattern. 62 | I think a dot prefix is better because it's easier to type (you don't have to hold down the shift-key with one finger). 63 | 64 | ```{r} 65 | args(lapply) 66 | args(sapply) 67 | args(apply) 68 | args(mapply) 69 | args(tapply) 70 | ``` 71 | 72 | `Reduce()` and friends avoid the problem altogether by not accepting `...`, and requiring that the user creates anonymous functions. 73 | But this is verbose, particularly without shortcuts to create functions. 74 | 75 | `transform()` goes a step further and uses an non-syntactic variable name. 76 | 77 | ```{r} 78 | args(transform) 79 | ``` 80 | 81 | Using a non-syntactic variable names means that it must always be surrounded in `` ` ``. 82 | This means that a user is even less likely to use it that with `.`, but it increases friction when writing the function. 83 | In my opinion, this trade-off is not worth it. 84 | 85 | ## What are the exceptions? 86 | 87 | - `tryCatch()`: the names give classes so, as long as you don't create a condition class called `expr` or `finally` (which would be weird!) you don't need to worry about matches. 88 | -------------------------------------------------------------------------------- /enumerate-options.qmd: -------------------------------------------------------------------------------- 1 | # Enumerate possible options {#sec-enumerate-options} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ```{r} 9 | #| eval = FALSE, 10 | #| include = FALSE 11 | source("fun_def.R") 12 | pkg_funs("base") %>% funs_formals_keep(~ is_call(.x, "c")) 13 | 14 | has_several_ok <- function(x) { 15 | if (is_call(x, "match.arg")) { 16 | x <- call_standardise(x) 17 | isTRUE(x$several.ok) 18 | } else if (is_call(x)) { 19 | some(x[-1], has_several_ok) 20 | } else { 21 | FALSE 22 | } 23 | } 24 | pkg_funs("utils") %>% funs_body_keep(has_several_ok) 25 | ``` 26 | 27 | ## What's the pattern? 28 | 29 | If the possible values of an argument are a small set of strings, set the default argument to the set of possible values, and then use `match.arg()` or `rlang::arg_match()` in the function body. 30 | This convention advertises to the knowledgeable user[^enumerate-options-1] what the possible values, and makes it easy to generate an informative error message for inappropriate inputs. 31 | This interface is often coupled with an implementation that uses `switch()`. 32 | 33 | [^enumerate-options-1]: The main downside of this technique is that many users aren't aware of this convention and that the first value of the vector will be used as a default. 34 | 35 | This convention makes it possible to advertise the possible set of values for an argument. 36 | The advertisement happens in the function specification, so you see in tool tips and autocomplete, without having to look at the documentation. 37 | 38 | ## What are some examples? 39 | 40 | - In `difftime()`, `units` can be any one of "auto", "secs", "mins", "hours", "days", or "weeks". 41 | 42 | - In `format()`, `justify` can be "left", "right", "center", or "none". 43 | 44 | - In `trimws()`, you can choose `which` side to remove whitespace from: "both", "left", or "right". 45 | 46 | - In `rank()`, you can select the `ties.method` from one of "average", "first", "last", "random", "max", or "min". 47 | 48 | - `rank()` exposes six different methods for handling ties with the `ties.method` argument. 49 | 50 | - `quantile()` exposes nine different approaches to computing a quantile through the `type` argument. 51 | 52 | - `p.adjust()` exposes eight strategies for adjusting P values to account for multiple comparisons using the `p.adjust.methods` argument. 53 | 54 | ## How do I use this pattern? 55 | 56 | To use this technique, set the default value to a character vector, where the first value is the default. 57 | Inside the function, use `match.arg()` or `rlang::arg_match()` to check that the value comes from the known good set, and pick the default if none is supplied. 58 | 59 | Take `rank()`, for example. 60 | The heart of its implementation looks like this: 61 | 62 | ```{r} 63 | rank <- function( 64 | x, 65 | ties.method = c("average", "first", "last", "random", "max", "min") 66 | ) { 67 | 68 | ties.method <- match.arg(ties.method) 69 | 70 | switch(ties.method, 71 | average = , 72 | min = , 73 | max = .Internal(rank(x, length(x), ties.method)), 74 | first = sort.list(sort.list(x)), 75 | last = sort.list(rev.default(sort.list(x, decreasing = TRUE))), 76 | random = sort.list(order(x, stats::runif(length(x)))) 77 | ) 78 | } 79 | 80 | x <- c(1, 2, 2, 3, 3, 3) 81 | 82 | rank(x) 83 | rank(x, ties.method = "first") 84 | rank(x, ties.method = "min") 85 | ``` 86 | 87 | Note that `match.arg()` will automatically throw an error if the value is not in the set: 88 | 89 | ```{r} 90 | #| error = TRUE 91 | rank(x, ties.method = "middle") 92 | ``` 93 | 94 | It also supports partial matching so that the following code is shorthand for `ties.method = "random"`: 95 | 96 | ```{r} 97 | rank(x, ties.method = "r") 98 | ``` 99 | 100 | We prefer to avoid partial matching because while it saves a little time writing the code, it makes reading the code less clear. 101 | `rlang::arg_match()` is an alternative to `match.arg()` that doesn't support partial matching. 102 | It instead provides a helpful error message: 103 | 104 | ```{r} 105 | #| error = TRUE 106 | rank2 <- function( 107 | x, 108 | ties.method = c("average", "first", "last", "random", "max", "min") 109 | ) { 110 | ties.method <- rlang::arg_match(ties.method) 111 | rank(x, ties.method = ties.method) 112 | } 113 | 114 | rank2(x, ties.method = "r") 115 | 116 | # It also provides a suggestion if you misspell the argument 117 | rank2(x, ties.method = "avarage") 118 | ``` 119 | 120 | ### Escape hatch 121 | 122 | It's sometimes useful to build in an escape hatch from canned strategies. 123 | This allows users to access alternative strategies, and allows for experimentation that can later turn into a official strategies. 124 | One example of such an escape hatch is in name repair, which occurs in many places throughout the tidyverse. 125 | One place you might encounter it is in `tibble()`: 126 | 127 | ```{r} 128 | #| error: true 129 | tibble::tibble(a = 1, a = 2) 130 | ``` 131 | 132 | Beneath the surface all tidyverse functions that expose some sort of name repair eventually end up calling `vctrs::vec_as_names()`: 133 | 134 | ```{r} 135 | #| error: true 136 | vctrs::vec_as_names(c("a", "a"), repair = "check_unique") 137 | vctrs::vec_as_names(c("a", "a"), repair = "unique") 138 | vctrs::vec_as_names(c("a", "a"), repair = "unique_quiet") 139 | ``` 140 | 141 | `vec_as_names()` exposes six strategies, but it also allows you to supply a function: 142 | 143 | ```{r} 144 | vctrs::vec_as_names(c("a", "a"), repair = toupper) 145 | ``` 146 | 147 | ### How keep defaults short? 148 | 149 | This technique is a best used when the set of possible values is short as otherwise you run the risk of dominating the function spec with this one argument (@sec-defaults-short-and-sweet). 150 | If you have a long list of possibilities, there are three possible solutions: 151 | 152 | - Set a single default and supply the possible values to `match.arg()`/`arg_match()`: 153 | 154 | ```{r} 155 | rank2 <- function(x, ties.method = "average") { 156 | ties.method <- arg_match( 157 | ties.method, 158 | c("average", "first", "last", "random", "max", "min") 159 | ) 160 | } 161 | ``` 162 | 163 | - If the values are used by many functions, you can store the options in an exported vector: 164 | 165 | ```{r} 166 | ties.methods <- c("average", "first", "last", "random", "max", "min") 167 | 168 | rank2 <- function(x, ties.method = ties.methods) { 169 | ties.method <- arg_match(ties.method) 170 | } 171 | ``` 172 | 173 | For example `stats::p.adjust()`, `stats::pairwise.prop.test()`, `stats::pairwise.t.test()`, `stats::pairwise.wilcox.test()` all use `p.adjust.method = p.adjust.methods`. 174 | 175 | - You can store the options in a exported named list[^enumerate-options-2]. 176 | That has the advantage that you can advertise both the source of the values, and the defaults, and the user gets a nice auto-complete of the possible values. 177 | 178 | ```{r} 179 | library(rlang) 180 | ties <- as.list(set_names(c("average", "first", "last", "random", "max", "min"))) 181 | 182 | rank2 <- function(x, ties.method = ties$average) { 183 | ties.method <- arg_match(ties.method, names(ties)) 184 | } 185 | ``` 186 | 187 | [^enumerate-options-2]: Thanks to Brandon Loudermilk 188 | -------------------------------------------------------------------------------- /err-call.qmd: -------------------------------------------------------------------------------- 1 | # Error call {#sec-err-call} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | Don't display the call when generating an error message. 9 | Either use `stop(call. = FALSE)` or `rlang::abort()` to avoid it. 10 | 11 | Why not? 12 | Typically doesn't display enough information to find the source of the call (since most errors are not from top-level function calls), and you can expect the most people to either use RStudio, or know how to call `traceback()`. 13 | -------------------------------------------------------------------------------- /err-constructor.qmd: -------------------------------------------------------------------------------- 1 | # Error constructors {#sec-err-constructor} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | Following the rule of three, whenever you generate the same error in three or more places, you should extract it out into a common function, called an **error constructor**. 11 | This function should create a [custom condition](https://adv-r.hadley.nz/conditions.html#custom-conditions) that contains components that can easily be tested and a `conditionMessage()` method that generates user friendly error messages. 12 | 13 | (This is a new pattern that we are currently rolling out across the tidyverse; it's currently found in few packages.) 14 | 15 | ```{r} 16 | #| label = "setup" 17 | library(rlang) 18 | ``` 19 | 20 | ## Why is this important? 21 | 22 | - If you don't use an custom condition, you can only check that your function has generated the correct error by matching the text of the error message with a regular expression. 23 | This is fragile because the text of error messages changes relatively frequently, causing spurious test failures. 24 | 25 | - You *can* use custom conditions for one-off errors, but generally the extra implementation work is not worth the pay off. 26 | That's why we recommend only using an error constructor for repeated errors. 27 | 28 | - It gives more precise control over error handling with `tryCatch()`. 29 | This is particularly useful in packages because you may be able to give more useful high-level error mesasges by wrapping a specific low-level error. 30 | 31 | - As you start using this technique for more error messages you can create a hierarchy of errors that allows you to borrow behaviour, reducing the amount of code you need to write. 32 | 33 | - Once you have identified all the errors that can be thrown by a function, you can add a `@section Throws:` to the documentation that precisely describes the possible failure modes. 34 | 35 | ## What does an error constructor do? 36 | 37 | An error constructor is very similar to an [S3 constructor](https://adv-r.hadley.nz/s3.html#s3-constructor), as its job is to extract out repeated code and generate a rich object that can easily be computed with. 38 | The primary difference is that instead of creating and returning a new object, it creates a custom error and immediately throws it with `abort()`. 39 | 40 | Here's a simple imaginary error that might be thrown by [fs](http://fs.r-lib.org/) if it couldn't find a file: 41 | 42 | ```{r} 43 | stop_not_found <- function(path) { 44 | abort( 45 | .subclass = "fs_error_not_found", 46 | path = path 47 | ) 48 | } 49 | ``` 50 | 51 | Note the naming scheme: 52 | 53 | - The function should be called `stop_{error_type}` 54 | 55 | - The error class should be `{package}_error_{error_type}`. 56 | 57 | The function should have one argument for each varying part of the error, and these argument should be passed onto `abort()` to be stored in the condition object. 58 | 59 | To generate the error message shown to the user, provide a `conditionMessage()` method: 60 | 61 | ```{r} 62 | #' @export 63 | conditionMessage.fs_error_not_found <- function(c) { 64 | glue::glue_data(c, "'{path}' not found") 65 | } 66 | ``` 67 | 68 | ```{r} 69 | #| include = FALSE 70 | vctrs::s3_register("base::conditionMessage", "fs_error_not_found") 71 | ``` 72 | 73 | ```{r} 74 | #| eval = FALSE 75 | stop_not_found("a.csv") 76 | #> Error: 'a.csv' not found 77 | ``` 78 | 79 | This method must be exported, because you are defining a method for a generic in another package, and it will often use `glue::glue_data()` to assemble the components of the condition into a string. 80 | See for advice on writing the error message. 81 | 82 | ## How do I test? 83 | 84 | ```{r} 85 | library(testthat) 86 | ``` 87 | 88 | ### Test the constructor 89 | 90 | Firstly, you should test the error constructor. 91 | The primary goal of this test is to ensure that the error constructor generates a message that is useful to humans, which you can not automate. 92 | This means that you can not use a unit test (because the desired output is not known) and instead you need to use a regression test, so you can ensure that the message does not change unexpectedly. 93 | For that reason the best approach is usually to use [`verify_output()`](https://testthat.r-lib.org/reference/verify_output.html), e.g.: 94 | 95 | ```{r} 96 | #| eval = FALSE 97 | test_that("stop_not_found() generates useful error message", { 98 | verify_output(test_path("test-stop-not-found.txt"), { 99 | stop_not_found("a.csv") 100 | }) 101 | }) 102 | ``` 103 | 104 | This is useful for pull requests because `verify_output()` generates a complete error messages in a text file that can easily be read and reviewed. 105 | 106 | If your error has multiple arguments, or your `conditionMessage()` method contains `if` statements, you should generally attempt to cover them all in a test case. 107 | 108 | ### Test usage 109 | 110 | Now that you have an error constructor, you'll need to slightly change how you test your functions that use the error constructor. 111 | For example, take this imaginary example for reading a file into a single string: 112 | 113 | ```{r} 114 | read_lines <- function(x) { 115 | if (!file.exists(x)) { 116 | stop_not_found(x) 117 | } 118 | paste0(readLines(x), collapse = "\n") 119 | } 120 | ``` 121 | 122 | Previously, you might have written: 123 | 124 | ```{r} 125 | expect_error(read_lines("missing-file.txt"), "not found") 126 | ``` 127 | 128 | But, now as you see, testthat gives you a warning that suggests you need to use the class argument instead: 129 | 130 | ```{r} 131 | expect_error(read_lines("missing-file.txt"), class = "fs_error_not_found") 132 | ``` 133 | 134 | This is less fragile because you can now change the error message without having to worry about breaking existing tests. 135 | 136 | If you also want to check components of the error object, note that `expect_error()` returns it: 137 | 138 | ```{r} 139 | cnd <- expect_error(read_lines("missing-file.txt"), class = "fs_error_not_found") 140 | expect_equal(cnd$path, "missing-file.txt") 141 | ``` 142 | 143 | I don't think this level of testing is generally important, so you should only use it because the error generation code is complex conditions, or you have identified a bug. 144 | 145 | ## Error hierarchies 146 | 147 | As you start writing more and more error constructors, you may notice that you are starting to share code between them because the errors form a natural hierarchy. 148 | To take advantage of this hierarchy to reduce the amount of code you need to write, you can make the errors subclassable by adding `...` and `class` arguments: 149 | 150 | ```{r} 151 | stop_not_found <- function(path, ..., class = character()) { 152 | abort( 153 | .subclass = c(class, "fs_error_not_found"), 154 | path = path 155 | ) 156 | } 157 | ``` 158 | 159 | Then the subclasses can call this constructor, and the problem becomes one of S3 class design. 160 | We currently have little experience with this, so use with caution. 161 | -------------------------------------------------------------------------------- /explicit-strategies.qmd: -------------------------------------------------------------------------------- 1 | # Strategies {#sec-strategies-explicit} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | library(stringr) 7 | ``` 8 | 9 | If your function exposes multiple **implementation strategies**, make those explicit through a single argument that takes an [enumeration](#sec-enumerate-options). 10 | This makes it clear how to control the operation of your function and extends gracefully if you discover new strategies in the future. 11 | 12 | This part of the book goes into some of the details of and variations on this pattern: 13 | 14 | - You should consider this pattern even if there are only two variations and you're tempted to use `TRUE` and `FALSE` instead. @sec-boolean-strategies discusses why. 15 | - Sometimes, different strategies need different arguments and @sec-strategy-objects shows a useful pattern to achieve this. 16 | - Other time, the need for different arguments might suggest that you actually need different functions. That's the topic of @sec-strategy-functions, and then we dive into a big case study of that problem by looking at `rep()` in @sec-cs-rep. 17 | 18 | ## See also 19 | 20 | - The original [strategy pattern](https://en.wikipedia.org/wiki/Strategy_pattern) defined in [Design Patterns](https://en.wikipedia.org/wiki/Design_Patterns). This pattern has a rather different implementation in a classic OOP language. 21 | -------------------------------------------------------------------------------- /fun_def.R: -------------------------------------------------------------------------------- 1 | library(purrr) 2 | library(rlang) 3 | 4 | # TODO: 5 | # * way to print function with body 6 | # * way to highlight arguments 7 | 8 | pkg_funs <- function(pkg) { 9 | env <- pkg_env(pkg) 10 | funs <- keep(as.list(env, sorted = TRUE), is_closure) 11 | 12 | map2(names(funs), funs, fun_def, pkg = pkg) 13 | } 14 | 15 | fun_call <- function(fun, ...) { 16 | call <- enexpr(fun) 17 | 18 | if (is.symbol(call)) { 19 | name <- as.character(call) 20 | env_name <- env_name(environment(fun)) 21 | pkg <- if (grepl("namespace:", env_name)) gsub("namespace:", "", env_name) else NULL 22 | } else if (is_call(call, "::", n = 2)) { 23 | name <- as.character(call[[3]]) 24 | pkg <- as.character(call[[2]]) 25 | } else { 26 | abort("Invalid input") 27 | } 28 | 29 | fun_def(name, fun, pkg = pkg, ...) 30 | } 31 | 32 | fun_def <- function(name, fun, pkg = NULL, highlight = NULL) { 33 | stopifnot(is_string(name)) 34 | stopifnot(is.function(fun)) 35 | 36 | new_fun_def( 37 | name = name, 38 | formals = as.list(formals(fun)), 39 | body = body(fun), 40 | pkg = pkg, 41 | highlight = highlight 42 | ) 43 | } 44 | 45 | new_fun_def <- function(name, formals, body, pkg = NULL, highlight = NULL) { 46 | structure( 47 | list( 48 | name = name, 49 | formals = formals, 50 | body = body, 51 | pkg = pkg, 52 | highlight = highlight 53 | ), 54 | class = "fun_def" 55 | ) 56 | } 57 | format.fun_def <- function(x, ...) { 58 | if (is.null(x$pkg)) { 59 | call <- sym(x$name) 60 | } else { 61 | call <- call2("::", sym(x$pkg), sym(x$name)) 62 | } 63 | 64 | # Replace missing args with symbol 65 | formals <- x$formals 66 | if (!is.null(formals)) { 67 | is_missing <- map_lgl(formals, is_missing) 68 | formals[is_missing] <- syms(names(formals)[is_missing]) 69 | names(formals)[is_missing] <- "" 70 | } 71 | 72 | # Doesn't work because format escapes 73 | if (!is.null(x$highlight)) { 74 | embold <- names(formals) %in% x$highlight 75 | names(formals)[embold] <- cli::style_bold(names(formals)[embold]) 76 | } 77 | 78 | paste0(format(call2(call, !!!formals)), collapse = "\n") 79 | } 80 | 81 | print.fun_def <- function(x, ...) { 82 | cat(format(x, ...), "\n", sep = "") 83 | } 84 | 85 | funs_formals_keep <- function(x, .p) { 86 | keep(x, function(fn) some(fn$formals, .p)) 87 | } 88 | 89 | funs_body_keep <- function(.x, .p, ...) { 90 | keep(.x, function(fn) .p(fn$body, ...)) 91 | } 92 | has_call <- function(x, name) { 93 | if (is_call(x, name)) { 94 | TRUE 95 | } else if (is_call(x)) { 96 | some(x[-1], has_call, name = name) 97 | } else { 98 | FALSE 99 | } 100 | } 101 | -------------------------------------------------------------------------------- /function-names.qmd: -------------------------------------------------------------------------------- 1 | # Function names 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | Follow the style guide (i.e. use `snake_cake`). 9 | 10 | ## Nouns vs verbs 11 | 12 | In general, prefer verbs. 13 | Use imperative mood: `mutate()` not `mutated()`, `mutates()`, or `mutating()`; `do()` not `did()`, `does()`, `doing()`, `hide()` not `hid()`, `hides()`, or `hiding()`. 14 | 15 | Exception: noun-y interfaces where you're building up a complex object like ggplot2 or recipes (verb-y interface in ggvis was a mistake). 16 | 17 | Nouns should be singular (`geom_point()` not `geom_points()`), simply because the plurisation rules in English are complex. 18 | 19 | ## Function families 20 | 21 | Use prefixes to group functions together based on common input or common purpose. 22 | Prefixes are better than suffixes because of auto-complete. 23 | Examples: ggplot2, purrr. 24 | Counter example: shiny. 25 | 26 | Not sure about common prefixes for a package. 27 | Works well for stringr (esp. with stringi), forcats, xml2, and rvest. 28 | But there's only a limited number of short prefixes and I think it would break down if every package did it. 29 | 30 | Use suffixes for variations on a theme (e.g. `map_int()`, `map_lgl()`, `map_dbl()`; `str_locate()`, `str_locate_all()`.) 31 | 32 | Strive for thematic unity in related functions. 33 | Can you make related fuctions rhyme? 34 | Or have the same number of letters? 35 | Or similar background (i.e. all Germanic origins vs. French). 36 | 37 | ## Length 38 | 39 | Err on the side of too long rather than too short (reading is generally more important than writing). 40 | Autocomplete will mostly take care of the nuisance and you can always shorten later if you come up with a better name. 41 | (But hard to make long later, and you may take up a good word that is a lot of work to reclaim later). 42 | 43 | Length of name should be inversely proportional to frequency of usage. 44 | Reserve very short words for functions that are likely to be used very frequently. 45 | 46 | ## Conflicts 47 | 48 | You can't expect to avoid conflicts with every existing CRAN package, but you should strive to avoid conflicts with "nearby" packages (i.e. packages that are commonly used with your package). 49 | 50 | ## Techniques 51 | 52 | - Thesaurus 53 | - List of common verbs 54 | - Rhyming dictionary 55 | 56 | ## Other good advice 57 | 58 | - [I Shall Call It.. SomethingManager](https://blog.codinghorror.com/i-shall-call-it-somethingmanager/) 59 | - [The Poetry of Function Naming](http://blog.stephenwolfram.com/2010/10/the-poetry-of-function-naming/) 60 | -------------------------------------------------------------------------------- /glossary.qmd: -------------------------------------------------------------------------------- 1 | # Glossary 2 | 3 | 4 | -------------------------------------------------------------------------------- /identity-strategy.qmd: -------------------------------------------------------------------------------- 1 | # The I()dentity strategy {#sec-identity-strategy} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | 7 | # Code search: 8 | # 9 | ``` 10 | 11 | ## What's the pattern? 12 | 13 | One simple, but convenient, strategy is to use the base `I()` function to create objects of class `AsIs`. 14 | These are useful for representing values that should remain as is, when the default might be to change them in some way. 15 | 16 | ## What are some examples? 17 | 18 | There are two places that you can use `I()` in base R: 19 | 20 | - When creating a data frame, you can use it to request that `data.frame()` not transform the column in any way. 21 | It's one way you can create a list-column in base R: 22 | 23 | ```{r} 24 | #| error: true 25 | x <- list(1, 2:3, 4:6) 26 | 27 | # By default, if you give a data frame a list it will try to make 28 | # each element a column: 29 | data.frame(x = x) 30 | 31 | # But if you wrap it in `I()` it will become a list-column 32 | data.frame(x = I(x)) 33 | ``` 34 | 35 | - When fitting a linear model, `I()` allows you to escape the usual Wilkinson-Rogers interpretation of addition and multiplication and instead created a transformed input: 36 | 37 | ```{r} 38 | #| eval: false 39 | 40 | # fit a model with three terms: x, y, x:y 41 | lm(z ~ x * y) 42 | 43 | # fit a model with one term: x * y 44 | lm(z ~ I(x * y)) 45 | ``` 46 | 47 | You'll see `I()` used in a variety of places in the tidyverse: 48 | 49 | - In readr, you can use `I()` to indiate that you are supplying a string containing the literal data, rather than a path giving where to find the data: 50 | 51 | ```{r} 52 | #| message: false 53 | readr::read_csv(I("x,y\n1,2")) 54 | ``` 55 | 56 | - In ggplot2, you can use it to indicate that the values don't need to be transformed; they're the literal aeshetic values already. 57 | For example, compare the following two plots: 58 | 59 | ```{r} 60 | #| layout-ncol: 2 61 | #| fig-width: 3 62 | #| fig-height: 3 63 | #| fig-alt: > 64 | #| Two bar plots. In the plot the bars are coloured blue-green and 65 | #| a pinkish red, using the default ggplot2 colour scale. In the 66 | #| second plot, the bars are coloured red and bright green, using 67 | #| the literal R "red" and "green" colours. The first plot has a 68 | #| legend; the second does not. 69 | #| 70 | library(ggplot2) 71 | 72 | df <- data.frame(x = 1:2, colour = c("red", "green")) 73 | 74 | df |> ggplot(aes(x, fill = colour)) + geom_bar() 75 | df |> ggplot(aes(x, fill = I(colour))) + geom_bar() 76 | ``` 77 | 78 | - httr2, a tool for generating HTTP requests, will automatically escape special characters when constructing a URL. 79 | You can use `I()` to say that you've already escaped the string and it doesn't need further escaping. 80 | 81 | ### How can I use it? 82 | 83 | `I()` wraps adds the `"AsIs"` class to the object it wraps, so you can detect if `I()` has been used by checking for `inherits(x, "AsIs")`. 84 | 85 | It's best used for simple cases where there are two possible interpretations for an argument and one of them has more of a sense for being untransformed or unaltered in some way. 86 | For example, you could imagine using it instead of `fixed()` in stringr if there was only a choice between regular expressions and fixed stringr, but it's not quite powerful. 87 | 88 | If you're using `I()` for escaping, it's good practice to wrap any escaped values in `I()` to indicate that you've escaped them. 89 | That ensures that you never accidentally double-escape an input. 90 | For example, this is how you might write code like what httr2 uses to escape query parameters. 91 | 92 | ```{r} 93 | escape_params <- function(x) { 94 | if (inherits(x, "AsIs")) { 95 | x 96 | } else { 97 | I(curl::curl_escape(x)) 98 | } 99 | } 100 | 101 | x <- escape_params("Good morning") 102 | x 103 | ``` 104 | 105 | Wrapping the output in `I()` ensures that no matter how many times we call `escape_params()` the string is only escaped once. 106 | This is a particularly useful property as your code starts to get more complicated. 107 | 108 | ```{r} 109 | escape_params(x) 110 | ``` 111 | 112 | You can see here one of the downsides of using `I()`: the printed output of wrapped objects are no different from the objects themselves leaving to potentially confusing behaviour when two seemingly identical inputs yield different outputs. 113 | -------------------------------------------------------------------------------- /implicit-strategies.qmd: -------------------------------------------------------------------------------- 1 | # Implicit strategies {#sec-implicit-strategies} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | There are two implicit strategies that are sometimes useful. 11 | I call them implicit because you don't select them explicitly with a single argument, but instead select between them based on the presence and absence of different arguments. 12 | As you might guess, this can make for a confusing interface, but it is occasionaly the best option. 13 | 14 | - With **mutually exclusive arguments** you select between two strategies based on whether you supply argument `a` or argument `b`. 15 | - With **compound objects** you select between two strategies based on whether you supply one complex object (e.g. a data frame) or multiple simple objects (e.g. vectors). I think the most compelling reason to use this pattern is when another function might be called directly by a user (who will supply individual arguments) or with the output from another function (which needs to pour into a single argument). 16 | 17 | The main challenge with using these pattern is that you can't make them clear from the function signature alone, so you need to carefully document and check inputs yourself. 18 | They are also likely to be surprising to the user as they are relatively rare patterns. 19 | So before using either of these techniques you should try using an explicit strategy via an enum (@sec-enumerate-options), using separate functions (@sec-strategy-functions), or using strategy objects (@sec-strategy-objects). 20 | @sec-cs-rvest explores these options from the perspective of `rvest::read_html()`. 21 | 22 | ## What are some examples? 23 | 24 | ### Mutually exclusive arguments 25 | 26 | - `cutree()` is an example where I think mutually exclusive arguments shine: it's so simple 27 | 28 | - In `ggplot2::scale_x_date()` and friends you can specify the breaks and labels either with `breaks` and `labels` (like all other scale functions) or with `date_breaks` and `date_labels`. 29 | If you set both values in a pair, the `date_` version wins. 30 | 31 | - `forcats::fct_other()` allows you to either `keep` or `drop` specified factor values. 32 | If supply neither, or both, you get an error. 33 | 34 | - `dplyr::relocate()` has optional `.before` and `.after` arguments. 35 | 36 | ### Compound objects 37 | 38 | - For example, it seems reasonable that you should be able to feed the output of `str_locate()` directly into `str_sub()`: 39 | 40 | ```{r} 41 | library(stringr) 42 | 43 | x <- c("aaaaab", "aaab", "ccccb") 44 | loc <- str_locate(x, "a+b") 45 | 46 | str_sub(x, loc) 47 | ``` 48 | 49 | But equally, it's nice to be able to supply individual start and end values when calling it directly: 50 | 51 | ```{r} 52 | str_sub("Hadley", start = 2, end = 4) 53 | ``` 54 | 55 | So `str_sub()` allows either individual vectors supplied to `start` and `end`, or a two-column matrix supplied to `start`. 56 | 57 | - `options(list(a = 1, b = 2))` is equivalent to `options(a = 1, b = 2)`. 58 | This is half of very useful pattern. 59 | The other half of that pattern is that `options()` returns the previous value of any options that you set. 60 | That means you can do `old <- options(…); options(old)` to temporarily set options with in a function. 61 | 62 | `withr::local_options()` and `withr::local_envvar()` work similarly: you can either supply a single list of values, or individually named values. 63 | But they do it with different arguments. 64 | 65 | - Another place that this pattern crops up is in `dplyr::bind_rows()`. 66 | When binding rows together, it's equally useful to bind a few named data frames as it is to bind a list of data frames that come from map or similar. 67 | In base R you need to know about `do.call(rbind, mylist)` which is a relatively sophisticated pattern. 68 | So in dplyr we tried to make `bind_rows()` automatically figure out if you were in situation one or two. 69 | Unfortunately, it turns out to be really hard to tell which of the situations you are in, so dplyr implemented heuristics that work most of the time, but occasionally it fails in surprising ways. 70 | 71 | Now we have generally steered away from interfaces that try to automatically "unsplice" their inputs and instead require that you use `!!!` to explicitly unsplice. 72 | This is has some advantages and disadvantages: it's an interface that's becoming increasingly common in the tidyverse (and we have a good convention for documenting it with the `` tag), but it's still relatively rare and is an advanced technique that we don't expect everyone to learn. 73 | That's why for this important case, we also have `purrr::list_cbind()`. 74 | 75 | But it means that functions like `purrr::hoist()`, `forcats::fct_cross()`, and `rvest::html_form()` which are less commonly given lists have a clearly documented escape hatch that doesn't require another different function. 76 | (And of course if you understand the `do.call` pattern you can still use that too). 77 | 78 | ## How do you use this pattern? 79 | 80 | ### Mutually exclusive arguments 81 | 82 | If a function needs to have mutually exclusive arguments (i.e. you must supply only one of theme) make sure you check that only one is supplied in order to give a clear error message. 83 | Avoid implementing some precedence order where if both `a` and `b` are supplied, `b` silently wins. 84 | The easiest way to do this is to use `rlang::check_exclusive()`. 85 | 86 | (In the case of required args, you might want to consider putting them after `…`. This violations @sec-dots-after-required, but forces the user to name the arguments which will make the code easier to read) 87 | 88 | If you must pick one of the two mutually exclusive arguments, make their defaults empty. 89 | Otherwise, if they're optional, give them `NULL` arguments. 90 | 91 | ```{r} 92 | #| error: true 93 | 94 | fct_drop <- function(f, drop, keep) { 95 | rlang::check_exclusive(drop, keep) 96 | } 97 | 98 | fct_drop(factor()) 99 | 100 | fct_drop(factor(), keep = "a", drop = "b") 101 | ``` 102 | 103 | (If the arguments are optional, you'll need `.require = FALSE` until ) 104 | 105 | ::: {.callout-note collapse="true"} 106 | ## With base R 107 | 108 | If you don't want to use rlang, you implement yourself with `xor()` and `missing()`: 109 | 110 | ```{r} 111 | #| eval: false 112 | 113 | fct_drop <- function(f, drop, keep) { 114 | if (!xor(missing(keep), missing(drop))) { 115 | stop("Exactly one of `keep` and `drop` must be supplied") 116 | } 117 | } 118 | fct_drop(factor()) 119 | 120 | fct_drop(factor(), keep = "a", drop = "b") 121 | ``` 122 | ::: 123 | 124 | In the documentation, document the pair of arguments together, and make it clear that only one of the pair can be supplied: 125 | 126 | ```{r} 127 | #' @param keep,drop Pick one of `keep` and `drop`: 128 | #' * `keep` will preserve listed levels, replacing all others with 129 | #' `other_level`. 130 | #' * `drop` will replace listed levels with `other_level`, keeping all 131 | #' as is. 132 | ``` 133 | 134 | ### Compound arguments 135 | 136 | To implement in your own functions, you should branch on the type of the first argument and then check that the others aren't supplied. 137 | 138 | ```{r} 139 | str_sub <- function(string, start, end) { 140 | if (is.matrix(start)) { 141 | if (!missing(end)) { 142 | abort("`end` must be missing when `start` is a matrix") 143 | } 144 | if (ncol(start) != 2) { 145 | abort("Matrix `start` must have exactly two columns") 146 | } 147 | stri_sub(string, from = start[, 1], to = start[, 2]) 148 | } else { 149 | stri_sub(string, from = start, to = end) 150 | } 151 | } 152 | ``` 153 | 154 | And make it clear in the documentation: 155 | 156 | ```{r} 157 | #' @param start,end Integer vectors giving the `start` (default: first) 158 | #' and `end` (default: last) positions, inclusively. 159 | #' 160 | #' Alternatively, you pass a two-column matrix to `start`, i.e. 161 | #' `str_sub(x, start, end)` is equivalent to 162 | #' `str_sub(x, cbind(start, end))` 163 | ``` 164 | 165 | (If you look at `string::str_sub()` you'll notice that `start` and `end` do have defaults; I think this is a mistake because `start` and `end` are important enough that the user should always be forced to supply them.) 166 | -------------------------------------------------------------------------------- /important-args-first.qmd: -------------------------------------------------------------------------------- 1 | # Put the most important arguments first {#sec-important-args-first} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | In a function call, the most important arguments should come first. 11 | As a general rule, the most important arguments will be the ones that are used most often, but that's often hard to tell until your function has existed in the wild for a while. 12 | Fortunately, there are a few rules of thumb that can help: 13 | 14 | - If the output is a transformation of an input (e.g. `log()`, `stringr::str_replace()`, `dplyr::left_join()`) then that argument the most important. 15 | - Other arguments that determine the type or shape of the output are typically very important. 16 | - Optional arguments (i.e. arguments with a default) are the least important, and should come last. 17 | 18 | This convention makes it easy to understand the structure of a function at a glance: the more important an argument is, the earlier you'll see it. 19 | When the output is very strongly tied to an input, putting that argument first also ensures that your function works well with the pipe, leading to code that focuses on the transformations rather than the object being transformed. 20 | 21 | ## What are some examples? 22 | 23 | The vast majority of functions get this right, so we'll pick on a few examples which I think get it wrong: 24 | 25 | - I think the arguments to base R string functions (`grepl()`, `gsub()`, etc) are in the wrong order because they consistently make the regular expression (`pattern`) the first argument, rather than the character vector being manipulated (`x)`. 26 | 27 | - The first two arguments to `lm()` are `formula` and `data`. 28 | I'd argue that `data` should be the first argument; while it doesn't affect the shape of the output which is always an lm S3 object, it does affect the shape of many important functions like `predict()`. 29 | However, the designers of `lm()` wanted `data` to be optional, so you could still fit models even if you hadn't collected the individual variables into a data frame. 30 | Because `formula` is required and `data` is not, this means that `formula` had to come first. 31 | 32 | - The first two arguments to `ggplot()` are `data` and `mapping`. 33 | Both data and mapping are required for every plot, so why make `data` first? 34 | I picked this ordering because in most plots there's one dataset shared across all layers and only the mapping changes. 35 | 36 | On the other hand, the layer functions, like `geom_point()`, flip the order of these arguments because in an individual layer you're more likely to specify `mapping` than `data`, and in many cases if you do specify `data` you'll want `mapping` as well. 37 | This makes these the argument order inconsistent with `ggplot()`, but overall supports the most common use cases. 38 | 39 | - ggplot2 functions work by creating an object that is then added on to a plot, so the plot, which is really the most important argument, is not obvious at all. 40 | ggplot2 works this way in part because it was written before the pipe was discovered, and the best way I came up to define plots from left to right was to rely on `+` (so-called operator overloading). 41 | As an interesting historical fact, ggplot (the precursor to ggplot2) actually works great with the pipe, and a couple of years ago I bought it back to life as [ggplot1](https://github.com/hadley/ggplot1). 42 | 43 | ## How do I remediate past mistakes? 44 | 45 | Generally, it is not possible to change the order of the first few arguments because it will break existing code (since these are the arguments that are mostly likely to be used unnamed). 46 | This means that the only real solution is to dperecate the entire function and replace it with a new one. 47 | Because this is invasive to the user, it's best to do sparingly: if the mistake is minor, you're better off waiting until you've collected other problems before fixing it. 48 | For example, take `tidyr::gather()`. 49 | It has a number of problems with its design, including the argument order, that makes it harder to use. 50 | Because it wasn't possible to easily fix this mistake, we accumulated other `gather()` problems for several years before fixing them all at once in `pivot_longer()`. 51 | 52 | ## See also 53 | 54 | - @sec-dots-after-required: If the function uses `…`, it should come in between the required and optional arguments. 55 | -------------------------------------------------------------------------------- /independent-meaning.qmd: -------------------------------------------------------------------------------- 1 | # Argument meaning should be independent {#sec-independent-meaning} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the problem? 9 | 10 | Avoid having one argument change the interpretation of another argument. 11 | This makes it harder to understand a call, because as you read a call, you might need to go back and to re-interpret an earlier argument. 12 | This sort of call can lead to code that reads like a [garden path sentence](https://en.wikipedia.org/wiki/Garden-path_sentence), a human language problem where you need to re-parse a sentence when you get to the end of it. 13 | For example, in "the horse raced past the barn fell", your initial understanding of"raced" needs to be modified when you to get to the end of the sentence in order for it to make sense. 14 | 15 | Another way you will come across this problem is when only certain combinations of arguments are allowed or if one argument is ignored if another argument has a certain value. 16 | 17 | ## What are some examples? 18 | 19 | There aren't too many examples of one argument changing the meaning of another argument but here are few I dug up: 20 | 21 | - In `library()` the `character.only` argument changes how the `package` argument is interpreted: 22 | 23 | ```{r} 24 | #| eval = FALSE 25 | package <- "dplyr" 26 | 27 | # Loads a package called "package" 28 | library(package) 29 | 30 | # Loads dplyr 31 | library(package, character.only = TRUE) 32 | ``` 33 | 34 | - In `ggplot2::geom_text()` setting `parse = TRUE` causes the contents of the label aesthetic to be interpreted as mathematical equations, rather simple text. 35 | 36 | - In `install.packages()` setting `repos = NULL` changes the interpretation of `pkgs` from being a vector of package names to a vector of file paths. 37 | 38 | - In `findInterval()` if you set `left.open = TRUE` then the `rightmost.closed` argument actually controls the whether or not the *left*most interval is closed. 39 | 40 | A subtler example of this problem also arises in `grepl()` and friends where you can't fully interpret the pattern until you have seen if the `fixed` argument is set. 41 | This is one of the patterns that heavily influenced the design of stringr, and is discussed more in @sec-strategy-objects. 42 | 43 | There are quite a few functions that only allow certain combinations of arguments: 44 | 45 | - `read.table()` allows you to supply data with either a path to a `file`, or with in line `text`. If you supply both, `path` wins. 46 | 47 | ## How do I remediate past mistakes? 48 | 49 | There isn't a single solution to this problem and remediating the problem will require a situation dependent technique. 50 | For example, each of the cases above requires a different technique: 51 | 52 | - `library()` could use the same mechanism as `help()` where `help((topic))`[^independent-meaning-1] will always look for the topic recorded in the `topic` variable, rather than the topic literally called "topic". 53 | 54 | - Instead of `geom_text(parse = TRUE)`, maybe it would be better to have `geom_equation()`. 55 | However, this change would be challenging because ggplot2 has another functions with the parse argument: `geom_label()`, which is like `geom_text()` but draws a rectangle behind the text, often making it easier to read. 56 | Maybe it would be better to make this an argument (`background = TRUE`) but that would leave three arguments (`label.r`, `label.padding`, `label.size`) that only make sense when `background = TRUE`. 57 | So maybe we could do something like `background = label()`, where the new `label` function would have `r`, `padding`, and `size` arguments. 58 | This would also make it possible to specify different types of backgrounds. 59 | 60 | - `install.packages()` feels like a function that has grown organically over time: it started out simple, but gained more and more features over time. 61 | I suspect improving the design would involve recognising @sec-strategy-functions and breaking it apart into multiple functions, possibly using @sec-argument-clutter for the common, less important arguments. 62 | 63 | - `findInterval()` could be fixed by using an argument name that isn't direction specific. 64 | One possible option would be `extemum.closed`. 65 | Extremum is technical term that most people probably aren't familiar with, but it's also for a fairly uncommon argument in a rarely function, so is probably fine. 66 | 67 | [^independent-meaning-1]: If `library()` were a tidyverse function it would use tidyeval, and so you'd write `library(!!package)` if you wanted to refer to the package name stored in the package variable. 68 | 69 | Cases where arguments have complex dependencies often require techniques from the "Stategies" part of the book. 70 | -------------------------------------------------------------------------------- /index.qmd: -------------------------------------------------------------------------------- 1 | # Welcome {.unnumbered} 2 | 3 | The goal of this book is to help you write better R code. 4 | It has four main components: 5 | 6 | - Identifying design **challenges** that often lead to suboptimal outcomes. 7 | 8 | - Introducing useful **patterns** that help solve common problems. 9 | 10 | - Defining key **principles** that help you balance conflicting patterns. 11 | 12 | - Discussing **case studies** that help you see how all the pieces fit together with real code. 13 | 14 | While I've called these principles "tidy" and they're used extensively by the tidyverse team to promote consistency across our packages, they're not exclusive to the tidyverse. 15 | Think tidy in the sense of tidy data (broadly useful regardless of what tool you're using) not tidyverse (a collection of functions designed with a singular point of view in order to facilitate learning and use). 16 | 17 | This book will be under heavy development for quite some time; currently we are loosely aiming for completion in 2025. 18 | You'll find many chapters contain disjointed text that mostly serve as placeholders for the authors, and I do not recommend attempting to systematically read the book at this time. 19 | If you'd like to follow along with my journey writing this book, and learn which chapters are ready to read, please sign up for my [tid design substack mailing list](http://tidydesign.substack.com/). 20 | -------------------------------------------------------------------------------- /inputs-explicit.qmd: -------------------------------------------------------------------------------- 1 | # Make inputs explicit {#sec-inputs-explicit} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the problem? 9 | 10 | A function is easier to understand if its output depends only on its inputs (i.e. its arguments). 11 | If a function returns different results with the same inputs, then some inputs must be implicit, typically because the function relies on an option or some locale setting. 12 | Implicit inputs are not always bad, as some functions like `Sys.time()`, `read.csv()`, and the random number generators, fundamentally depend on them. 13 | But they should be used as sparingly as possible, and never when not related to the core purpose of the function. 14 | 15 | Explicit arguments make code easier to understand because you can see what will affect the outputs just by reading the code; you don't need to run it. 16 | Implicit arguments can lead to code that returns different results on different computers, and the differences are usually hard to track down. 17 | 18 | ## What are some examples? 19 | 20 | One common source of hidden arguments is the use of global options: 21 | 22 | - Historically, the worst offender was the `stringsAsFactors` option which changed how a number of functions[^inputs-explicit-1] treated character vectors. 23 | This option was part of a multi-year procedure to move R away toward character vectors and away from vectors. 24 | You can learn more in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) by Roger Peng and [*stringsAsFactors = \*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. 25 | 26 | - `lm()`'s handling of missing values depends on the global option of `na.action`. 27 | The default is `na.omit` which drops the missing values prior to fitting the model (which is inconvenient because then the results of `predict()` don't line up with the input data. 28 | 29 | [^inputs-explicit-1]: Such as `data.frame()`, `as.data.frame()`, and `read.csv()` 30 | 31 | Another common source of subtle bugs is relying on the system **locale**, i.e. the country and language specific settings controlled by your operating system. 32 | Relying on the system locale is always done with the best of intentions (you want your code to respect the user's preferences) but can lead to subtle differences when the same code is run by different people. 33 | Here are a few examples: 34 | 35 | - `strptime()` relies on the names of weekdays and months in the current locale. 36 | That means `strptime("1 Jan 2020", "%d %b %Y")` will work on computers with an English locale, and fail elsewhere. 37 | 38 | - `as.POSIXct()` depends on the current timezone. 39 | The following code returns different underlying times when run on different computers: 40 | 41 | ```{r} 42 | as.POSIXct("2020-01-01 09:00") 43 | ``` 44 | 45 | - `toupper()` and `tolower()` depend on the current locale. 46 | It is fairly uncommon for this to cause problems because most languages either use their own character set, or use the same rules for capitalisation as English. 47 | However, this behaviour did cause a bug in ggplot2 because internally it takes `geom = "identity"` and turns it into `GeomIdentity` to find the object that actually does computation. 48 | In Turkish, however, the upper case version of i is İ, and `Geomİdentity` does not exist. 49 | This meant that for some time ggplot2 did not work on Turkish computers. 50 | 51 | - `sort()` and `order()` rely on the lexicographic order (i.e. how different alphabets sort their letters) defined by the current locale. 52 | `lm()` automatically converts character vectors to factors with `factor()`, which uses `order()`, which means that it's possible for the coefficients to vary[^inputs-explicit-2] if your code is run in a different country! 53 | 54 | [^inputs-explicit-2]: Predictions and other diagnostics won't be affected, but you're likely to be surprised that your coefficients are different. 55 | 56 | ## How can I remediate the problem? 57 | 58 | At some level, implicit inputs are easy to avoid when creating new functions: just don't use the locale or global options! 59 | But it's easy for such problems to creep in indirectly, when you call a function not knowing that it has hidden inputs. 60 | The best way to prevent that is to consult the list of common offenders provided above. 61 | 62 | ### Make an option explicit 63 | 64 | If you want depend on an option or locale, make sure it's an explicit argument. 65 | Such arguments generally should not affect computation (@sec-def-user), just side-effects like printed output or status messages. 66 | If they do affect results, follow @sec-def-inform to make sure the user knows what's happening. 67 | For example, lets take `as.POSIXct()` which basically looks something like this: 68 | 69 | ```{r} 70 | as.POSIXct <- function(x, tz = "") { 71 | base::as.POSIXct(x, tz = tz) 72 | } 73 | as.POSIXct("2020-01-01 09:00") 74 | ``` 75 | 76 | The `tz` argument is present, but it's not obvious that `""` means the current time zone. 77 | Let's first make that explicit: 78 | 79 | ```{r} 80 | as.POSIXct <- function(x, tz = Sys.timezone()) { 81 | base::as.POSIXct(x, tz = tz) 82 | } 83 | as.POSIXct("2020-01-01 09:00") 84 | ``` 85 | 86 | Since this is an important default whose value can change, we also print it out if the user hasn't explicitly set it: 87 | 88 | ```{r} 89 | as.POSIXct <- function(x, tz = Sys.timezone()) { 90 | if (missing(tz)) { 91 | message("Using `tz = \"", tz, "\"`") 92 | } 93 | base::as.POSIXct(x, tz = tz) 94 | } 95 | as.POSIXct("2020-01-01 09:00") 96 | ``` 97 | 98 | Since most people don't like lots of random output this provides a subtle incentive to supply the timezone: 99 | 100 | ```{r} 101 | as.POSIXct("2020-01-01 09:00", tz = "America/Chicago") 102 | ``` 103 | 104 | ### Temporarily adjust global state 105 | 106 | If you're calling a function with implicit arguments and those implicit arguments are causing problems with your code, you can always work around them by temporarily changing the global state which it uses. 107 | The easiest way to do so is to use the [withr](https://withr.r-lib.org) package, which provides a variety of tools to change temporarily change global state. 108 | 109 | ## See also 110 | 111 | - @sec-def-user and @sec-def-inform: how to make an option as explicit as possible. 112 | - @sec-spooky-action: where a function changes global state in a surprising way. 113 | -------------------------------------------------------------------------------- /out-invisible.qmd: -------------------------------------------------------------------------------- 1 | # Side-effect functions should return invisibly {#sec-out-invisible} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | If a function is called primarily for its side-effects, it should invisibly return a useful output. 11 | If there's no obvious output, return the first argument. 12 | This makes it possible to use the function with in a pipeline. 13 | 14 | ## What are some examples? 15 | 16 | ```{r} 17 | #| eval = FALSE, 18 | #| include = FALSE 19 | source("fun_def.R") 20 | pkg_funs("base") %>% 21 | funs_body_keep(has_call, "invisible") %>% 22 | discard(~ grepl("print", .x$name)) 23 | ``` 24 | 25 | - `print(x)` invisibly returns the printed object. 26 | 27 | - `x <- y` invisible returns `y`. 28 | This is what makes it possible to chain together multiple assignments `x <- y <- z <- 1` 29 | 30 | - `readr::write_csv()` invisibly returns the data frame that was saved. 31 | 32 | - `purrr::walk()` invisibly returns the vector iterated over. 33 | 34 | - `fs:file_copy(from, to)` returns `to` 35 | 36 | - `options()` and `par()` invisibly return the previous value so you can reset with `on.exit()`. 37 | 38 | ## Why is it important? 39 | 40 | Invisibly returning the first argument allows to call the function mid-pipe for its side-effects while allow the primary data to continue flowing through the pipe. 41 | This is useful for generating intermediate diagnostics, or for saving multiple output formats. 42 | 43 | ```{r} 44 | library(dplyr, warn.conflicts = FALSE) 45 | library(tibble) 46 | 47 | mtcars %>% 48 | as_tibble() %>% 49 | dplyr::filter(cyl == 6) %>% 50 | print() %>% 51 | group_by(vs) %>% 52 | summarise(mpg = mean(mpg)) 53 | ``` 54 | 55 | ```{r} 56 | library(readr) 57 | 58 | mtcars %>% 59 | write_csv("mtcars.csv") %>% 60 | write_tsv("mtcars.tsv") 61 | 62 | unlink(c("mtcars.csv", "mtcars.tsv")) 63 | ``` 64 | 65 | ```{r} 66 | library(fs) 67 | 68 | paths <- file_temp() %>% 69 | dir_create() %>% 70 | path(letters[1:5]) %>% 71 | file_create() 72 | paths 73 | ``` 74 | 75 | Functions that modify some global state, like `options()` or `par()`, should return the *previous* value of the variables. 76 | This, in combination with compound argument pattern from @sec-implicit-strategies, makes it possible to easily reset the effect of the change: 77 | 78 | ```{r} 79 | x <- runif(1) 80 | old <- options(digits = 3) 81 | x 82 | 83 | options(old) 84 | x 85 | ``` 86 | -------------------------------------------------------------------------------- /out-multi.qmd: -------------------------------------------------------------------------------- 1 | # Returning multiple values {#sec-out-multi} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## Different sizes 9 | 10 | Use a list. 11 | Name it. 12 | 13 | If you return the same type of output from multiple functions, you should create a function that consistently creates exact the same format (to avoid accidentally inconsistency), and consider making it an S3 class (so you can have a custom print method). 14 | 15 | ## Same size 16 | 17 | When a function returns two vectors of the same size, as a general rule should you return a tibble: 18 | 19 | - A matrix would only work if the vectors were the same type (and not factor or Date), doesn't make it easy to extract the individual values, and is not easily input to other tidyverse functions. 20 | 21 | - A list doesn't capture the constraint that both vectors are the same length. 22 | 23 | - A data frame is ok if you don't want to take a dependency on tibble, but you need to remember the drawbacks: if the columns are character vectors you'll need to remember to use `stringsAsFactors = FALSE`, and the print method is confusing for list- and df-cols (and you have to create by modifying an existing data frame, not by calling `data.frame()`). 24 | (Example: it would be weird if glue returned tibbles from a function.) 25 | 26 | ## Case study: `str_locate()` 27 | 28 | e.g. `str_locate()`, `str_locate_all()` 29 | 30 | Interaction with `str_sub()`. 31 | -------------------------------------------------------------------------------- /out-type-stability.qmd: -------------------------------------------------------------------------------- 1 | # Type-stability {#sec-out-type-stability} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | The less you need to know about a function's inputs to predict the type of its output, the better. 9 | Ideally, a function should either always return the same type of thing, or return something that can be trivially computed from its inputs. 10 | 11 | If a function is **type-stable** it satisfies two conditions: 12 | 13 | - You can predict the output type based only on the input types (not their values). 14 | 15 | - If the function uses `...`, the order of arguments in does not affect the output type. 16 | 17 | ```{r} 18 | #| label = "setup" 19 | library(vctrs) 20 | ``` 21 | 22 | ## Simple examples 23 | 24 | - `purrr::map()` and `base::lapply()` are trivially type-stable because they always return lists. 25 | 26 | - `paste()` is type stable because it always returns a character vector. 27 | 28 | ```{r} 29 | vec_ptype(paste(1)) 30 | vec_ptype(paste("x")) 31 | ``` 32 | 33 | - `base::mean(x)` almost always returns the same type of output as `x`. 34 | For example, the mean of a numeric vector is a numeric vector, and the mean of a date-time is a date-time. 35 | 36 | ```{r} 37 | vec_ptype(mean(1)) 38 | vec_ptype(mean(Sys.time())) 39 | ``` 40 | 41 | - `ifelse()` is not type-stable because the output type depends on the value: 42 | 43 | ```{r} 44 | vec_ptype(ifelse(NA, 1L, 2)) 45 | vec_ptype(ifelse(FALSE, 1L, 2)) 46 | vec_ptype(ifelse(TRUE, 1L, 2)) 47 | ``` 48 | 49 | ## More complicated examples 50 | 51 | Some functions are more complex because they take multiple input types and have to return a single output type. 52 | This includes functions like `c()` and `ifelse()`. 53 | The rules governing base R functions are idiosyncratic, and each function tends to apply it's own slightly different set of rules. 54 | Tidy functions should use the consistent set of rules provided by the [vctrs](https://vctrs.r-lib.org) package. 55 | 56 | ## Challenge: the median 57 | 58 | A more challenging example is `median()`. 59 | The median of a vector is a value that (as evenly as possible) splits the vector into a lower half and an upper half. 60 | In the absence of ties, `mean(x > median(x)) == mean(x <= median(x)) == 0.5`. 61 | The median is straightforward to compute for odd lengths: you simply order the vector and pick the value in the middle, i.e. `sort(x)[(length(x) - 1) / 2]`. 62 | It's clear that the type of the output should be the same type as `x`, and this algorithm can be applied to any vector that can be ordered. 63 | 64 | But what if the vector has an even length? 65 | In this case, there's no longer a unique median, and by convention we usually take the mean of the middle two numbers. 66 | 67 | In R, this makes the `median()` not type-stable: 68 | 69 | ```{r} 70 | typeof(median(1:3)) 71 | typeof(median(1:4)) 72 | ``` 73 | 74 | Base R doesn't appear to follow a consistent principle when computing the median of a vector of length 2. 75 | Factors throw an error, but dates do not (even though there's no date half way between two days that differ by an odd number of days). 76 | 77 | ```{r} 78 | #| error = TRUE 79 | median(factor(1:2)) 80 | median(Sys.Date() + 0:1) 81 | ``` 82 | 83 | To be clear, the problems caused by this behaviour are quite small in practice, but it makes the analysis of `median()` more complex, and it makes it difficult to decide what principle you should adhere to when creating `median` methods for new vector classes. 84 | 85 | ```{r} 86 | #| error = TRUE 87 | median("foo") 88 | median(c("foo", "bar")) 89 | ``` 90 | 91 | ## Exercises 92 | 93 | 1. How is a date like an integer? 94 | Why is this inconsistent? 95 | 96 | ```{r} 97 | vec_ptype(mean(Sys.Date())) 98 | vec_ptype(mean(1L)) 99 | ``` 100 | -------------------------------------------------------------------------------- /out-vectorisation.qmd: -------------------------------------------------------------------------------- 1 | # Vectorisation {#sec-out-vectorisation} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | Vectorisation has two meanings: it can refer to either the interface of a function, or its implementation. 9 | We can make a precise statement about what a vectorised interface is. 10 | A function, `f`, is vectorised over a vector argument, `x`, iff `f(x)[[i]]` equals `f(x[[i]])`, i.e. we can exchange the order of subsetting and function application. 11 | This generalises naturally to more arguments: we say `f` is vectorised over `x` and `y` if `f(x[[i]], y[[i]])` equals `f(x, y)[[i]]`. 12 | A function can have some arguments that are vectorised and some that are not, `f(x, ...)[[i]]` equals `f(x[[i]], ...)`. 13 | 14 | It is harder to define vectorised implementation. 15 | It's necessary for a function with a vectorised implementation to have a vectorised interface, but it also must possess the property of computational efficiency. 16 | It's hard to make this precise, but generally it means that if there is an explicit loop, that loop is written in C or C++, not in a R. 17 | -------------------------------------------------------------------------------- /plausible.html: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /r4ds.scss: -------------------------------------------------------------------------------- 1 | /*-- scss:defaults --*/ 2 | 3 | $primary: #637238 !default; 4 | $font-size-root: 18px !default; 5 | 6 | /*-- scss:rules --*/ 7 | 8 | .sidebar-title { 9 | color: #637238; 10 | } 11 | 12 | div.sidebar-item-container .active { 13 | font-weight: bold; 14 | } 15 | 16 | .sidebar nav[role=doc-toc] ul>li>a.active, .sidebar nav[role=doc-toc] ul>li>ul>li>a.active{ 17 | font-weight: bold; 18 | } 19 | 20 | img.quarto-cover-image { 21 | box-shadow: 0 .5rem 1rem rgba(0,0,0,.15); 22 | } 23 | 24 | /* Headings ------------------------------------------------------ */ 25 | 26 | #title-block-header.quarto-title-block.default .quarto-title h1.title { 27 | margin-bottom: 0.5rem; 28 | } 29 | 30 | h2 { 31 | margin-top: 2rem; 32 | margin-bottom: 1rem; 33 | font-size: 1.4rem; 34 | font-weight: 600; 35 | } 36 | h3 { margin-top: 1.5em; font-size: 1.2rem; font-weight: 500; } 37 | h4 { margin-top: 1.5em; font-size: 1.1rem; } 38 | h5 { margin-top: 1.5em; font-size: 1rem; } 39 | 40 | .quarto-section-identifier { 41 | color: #6C6C6C; 42 | font-weight: normal; 43 | } 44 | 45 | #quarto-sidebar { 46 | .menu-text { 47 | font-weight: bold; 48 | } 49 | 50 | .chapter-title { 51 | line-height: 1; 52 | } 53 | 54 | .sidebar-item { 55 | margin-bottom: 5px; 56 | } 57 | } 58 | 59 | 60 | /* Code ------------------------------------------------ */ 61 | 62 | code { 63 | color: #373a3c; 64 | } 65 | 66 | code a:any-link { 67 | text-decoration: underline; 68 | text-decoration-color: #ccc; 69 | } 70 | 71 | pre { 72 | background-image: linear-gradient(160deg,#f8f8f8 0,#f1f1f1 100%); 73 | } 74 | -------------------------------------------------------------------------------- /required-no-defaults.qmd: -------------------------------------------------------------------------------- 1 | # Required args shouldn't have defaults {#sec-required-no-defaults} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the pattern? 9 | 10 | Required arguments shouldn't have defaults; optional arguments should have defaults. 11 | In other words, an argument should have a default if and only if it's optional. 12 | 13 | This simple convention ensures that you can tell which arguments are optional and which arguments are required from a glance at the function signature. 14 | Otherwise you need to rely on a careful reading of documentation. 15 | Additionally, if you don't follow this convention and want to provide helpful error messages, you'll need to implement them yourself rather than relying on R's defaults. 16 | 17 | ::: {.callout-note collapse="true"} 18 | ## When should an argument be required? 19 | 20 | This pattern raises the question of when an argument should be required, and when you should provide a default. 21 | I think this usually seems "obvious" but I wanted to discuss a few functions that might get it wrong: 22 | 23 | - `rnorm()` and `runif()` are interesting cases as they set default values for `mean`/`sd` and `min`/`max`. 24 | Giving them defaults makes them feels like less important, and inconsistent with the other RNGs which generally require that you specify the parameters of the distribution. 25 | But both the normal and uniform distributions have very high-profile "standard" versions that make sense as defaults. 26 | 27 | - You can use `predict()` directly on a model and it gives predictions for the data used to fit the model: 28 | 29 | ```{r} 30 | mod <- lm(Employed ~ ., data = longley) 31 | head(predict(mod)) 32 | ``` 33 | 34 | In my opinion, `predict()` should always require a dataset because prediction is primary about applying the model to new situations. 35 | 36 | - `stringr::str_sub()` has default values for `start` and `end`. 37 | This allows you to do clever things like `str_sub(x, end = 3)` or `str_sub(x, -3)` to select the first or last three characters, but I now believe that leads to code that is harder to read, and it would have been better to make `start` and `end` required arguments. 38 | ::: 39 | 40 | ## What are some examples? 41 | 42 | This is a straightforward convention that the vast majority of functions follow. 43 | There are a few exceptions that exist in base R, mostly for historical reasons. 44 | Here are a couple of examples: 45 | 46 | - In `sample()` neither `x` not `size` has a default value: 47 | 48 | ```{r} 49 | args(sample) 50 | ``` 51 | 52 | This suggests that `size` is required, but it's actually optional: 53 | 54 | ```{r} 55 | sample(1:4) 56 | sample(4) 57 | ``` 58 | 59 | - `lm()` does not have defaults for `formula`, `data`, `subset`, `weights`, `na.action`, or `offset`. 60 | 61 | ```{r} 62 | args(lm) 63 | ``` 64 | 65 | But only `formula` is actually required: 66 | 67 | ```{r} 68 | x <- 1:5 69 | y <- 2 * x + 1 + rnorm(length(x)) 70 | lm(y ~ x) 71 | ``` 72 | 73 | In the tidyverse, one function that fails to follow this pattern is `ggplot2::geom_abline()`, `slope` and `intercept` don't have defaults but are not required. 74 | If you don't supply them they default to `slope = 1` and `intercept = 0`, *or* are taken from `aes()` if they're provided there. 75 | This is a mistake caused by trying to have `geom_abline()` do too much --- it can be both used as an annotation (i.e. with a single `slope` and `intercept`) or used to draw multiple lines from data (i.e. with one line for each row). 76 | 77 | ## How do I use the pattern? 78 | 79 | This pattern is generally easy to follow: if you don't use `missing()` it's very hard to do this by mistake. 80 | 81 | ## How do I remediate past mistakes? 82 | 83 | If an argument is required, remove the default argument. 84 | If an argument is optional, either set it to the default value, or if the computation is complicated, set it to `NULL` and then compute inside the body of the function. 85 | -------------------------------------------------------------------------------- /side-effects.qmd: -------------------------------------------------------------------------------- 1 | # Side-effect soup {#sec-side-effect-soup} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | Side-effect soup occurs when you mix side-effects and regular computation within the same function. 9 | 10 | ## What is a side-effect? 11 | 12 | There are two main types of side-effect: 13 | 14 | - those that give feedback to the user. 15 | - those that change some global state. 16 | 17 | ### User feedback 18 | 19 | - Signalling a condition, with `message()`, `warning()`, or `stop()`. 20 | 21 | - Printing to the console with `cat()`. 22 | 23 | - Drawing to the current graphics device with base graphics or grid. 24 | 25 | ### Global state 26 | 27 | - Creating (or modifying) an existing binding with `<-`. 28 | 29 | - Modifying the search path by attaching a package with `library()`. 30 | 31 | - Changing the working directory with `setwd()`. 32 | 33 | - Modifying a file on disk with (e.g.) `write.csv()`. 34 | 35 | - Changing a global option with `options()` or a base graphics parameter with `gpar()`. 36 | 37 | - Setting the random seed with `set.seed()` 38 | 39 | - Installing a package. 40 | 41 | - Changing environment variables with `Sys.setenv()`, or indirectly via a function like `Sys.setlocale()`. 42 | 43 | - Modifying a variable in an enclosing environment with `assign()` or `<<-`. 44 | 45 | - Modifying an object with reference semantics (like R6 or data.table). 46 | 47 | More esoteric side-effects include: 48 | 49 | - Detaching a package from the search path with `detach()`. 50 | 51 | - Changing the library path, where R looks for packages, with `.libPaths()` 52 | 53 | - Changing the active graphics device with (e.g.) `png()` or `dev.off()`. 54 | 55 | - Registering an S4 class, method, or generic with `methods::setGeneric()`. 56 | 57 | - Modifying the internal `.Random.seed` 58 | 59 | ## What are some examples? 60 | 61 | - The summary of a linear model includes a p-value for the overall\ 62 | regression. 63 | This value is only computed when the summary is printed: you can see it but you can't touch it. 64 | 65 | ```{r} 66 | mod <- lm(mpg ~ wt, data = mtcars) 67 | summary(mod) 68 | ``` 69 | 70 | ## Why is it bad? 71 | 72 | Side-effect soup is bad because: 73 | 74 | - If a function does some computation and has side-effects, it can be challenging to extract the results of computation. 75 | 76 | - Makes code harder to analyse because it may have non-local effects. 77 | Take this code: 78 | 79 | ```{r} 80 | #| eval = FALSE 81 | x <- 1 82 | y <- compute(x) 83 | z <- calculate(x, y) 84 | 85 | df <- data.frame(x = "x") 86 | ``` 87 | 88 | If `compute()` or `calculate()` don't have side-effects then you can predict what `df` will be. 89 | But if `compute()` did `options(stringsAsFactors = FALSE)` then `df` would now contain a character vector rather than a factor. 90 | 91 | Side-effect soup increases the cognitive load of a function so should be used deliberately, and you should be especially cautious when combining them with other techniques that increase cognitive load like tidy-evaluation and type-instability. 92 | 93 | ## How avoid it? 94 | 95 | ### Localise side-effects 96 | 97 | Constrain the side-effects to as small a scope as possible, and clean up automatically to avoid side-effects. 98 | [withr](http://withr.r-lib.org) 99 | 100 | ### Extract side-effects 101 | 102 | It's not side-effects that are bad, so much as mixing them with non-side-effect code. 103 | 104 | Put them in a function that is specifically focussed on the side-effect. 105 | 106 | If your function is called primarily for its side-effects, it should return the primary data structure (which should be first argument), invisibly. 107 | This allows you to call it mid-pipe for its side-effects while allow the primary data to continue flowing through the pipe. 108 | 109 | ### Make side-effects noisy 110 | 111 | Primary purpose of the entire package is side-effects: modifying files on disk to support package and project development. 112 | usethis functions are also designed to be noisy: as well as doing it's job, each usethis function tells you what it's doing. 113 | 114 | But some usethis functions are building blocks for other more complex tasks. 115 | 116 | ### Provide an argument to suppress 117 | 118 | You've probably used `base::hist()` for it's side-effect of drawing a histogram: 119 | 120 | ```{r} 121 | x <- rnorm(1e5) 122 | hist(x) 123 | ``` 124 | 125 | But you might not know that `hist()` also returns the result of the computation. 126 | If you call `plot = FALSE` it will simply return the results of the computation: 127 | 128 | ```{r} 129 | xhist <- hist(x, plot = FALSE) 130 | str(xhist) 131 | ``` 132 | 133 | This is a good approach for retro-fitting older functions while making minimal API changes. 134 | However, I think it dilutes a function to be both used for plotting and for computing so should be best avoided in newer code. 135 | 136 | ### Use the `print()` method 137 | 138 | An alternative approach would be to always return the computation, and instead perform the output in the `print()` method. 139 | 140 | Of course ggplot2 isn't perfect: it creates an object that specifies the plot, but there's no easy way to extract the underlying computation so if you've used `geom_smooth()` to add lines of best fit, there's no way to extract the values. 141 | Again, you can see the results, but you can't touch them, which is very frustrating! 142 | 143 | ### Make easy to undo 144 | 145 | If all of the above techniques fail, you should at least make the side-effect easy to undo. 146 | A use technique to do this is to make sure that the function returns the *previous* values, and that it can take it's own input. 147 | 148 | This is how `options()` and `par()` work. 149 | You obviously can't eliminate those functions because their complete purpose is have global changes! 150 | But they are designed in such away that you can easily undo their operation, making it possible to apply on a local basis. 151 | 152 | There are two key ideas that make these functions easy to undo: 153 | 154 | 1. They [invisibly return](https://adv-r.hadley.nz/functions.html#invisible) the previous values as a list: 155 | 156 | ```{r} 157 | options(my_option = 1) 158 | old <- options(my_option = 2) 159 | str(old) 160 | ``` 161 | 162 | 2. Instead of `n` named arguments, they can take a single named list: 163 | 164 | ```{r} 165 | old <- options(list(my_option1 = 1, my_option2 = 2)) 166 | ``` 167 | 168 | (I wouldn't recommend copying this technique, but I'd instead recommend always taking a single named list. This makes the function because it has a single way to call it and makes it easy to extend the API in the future, as discussed in @sec-dots-data) 169 | 170 | Together, this means that you easily can set options temporarily.: 171 | 172 | ```{r} 173 | getOption("my_option1") 174 | 175 | old <- options(my_option1 = 10) 176 | getOption("my_option1") 177 | options(old) 178 | 179 | getOption("my_option1") 180 | ``` 181 | 182 | If temporarily setting options in a function, you should always restore the previous values using `on.exit()`: this ensures that the code is run regardless of how the function exits. 183 | 184 | ## Package considerations 185 | 186 | Code in package is executed at build-time.i.e. 187 | if you have: 188 | 189 | ```{r} 190 | x <- Sys.time() 191 | ``` 192 | 193 | For mac and windows, this will record when CRAN built the binary. 194 | For linux, when the package was installed. 195 | 196 | Beware copying functions from other packages: 197 | 198 | ```{r} 199 | #| eval = FALSE 200 | foofy <- barfy::foofy 201 | ``` 202 | 203 | Version of barfy might be different between run-time and build-time. 204 | 205 | Introduces a build-time dependency. 206 | 207 | 208 | -------------------------------------------------------------------------------- /spooky-action.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tidyverse/design/a1eedc8d4e3f0a1301444856ec9b3cf8e70dc26f/spooky-action.rds -------------------------------------------------------------------------------- /strategy-functions.qmd: -------------------------------------------------------------------------------- 1 | # Three functions in a trench coat {#sec-strategy-functions} 2 | 3 | ```{r} 4 | #| include = FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the problem? 9 | 10 | Sometimes a function that implements multiple strategies might be better off as independent functions. 11 | Two signs that this might be the case: 12 | 13 | - You're struggling to document how the arguments interact. Maybe you can set `a` and `b` and `a` and `c` but not `b` and `c`. 14 | - The implementation of your function has a couple of big if branches that share relatively little code. 15 | 16 | Splitting one complex multi-strategy function into multiple simpler functions can make maintenance and testing easier, while also improving the user experience. 17 | I think of this the three (or more generally $n$) functions in a trench coat[^strategy-functions-1] problem, because the reality of the separate functions tends to become very obvious in time. 18 | 19 | [^strategy-functions-1]: 20 | 21 | ## What are some examples? 22 | 23 | - `forcats::fct_lump()` chooses between one of three lumping strategies depending on whether you supply just `n`, just `prop`, or neither, while supplying both `n` and `prop` is an error. 24 | Additionally, the `ties.method` argument only does anything if you supply only `n`. 25 | `fct_lump()` is hard to understand and document because it's really three smaller functions. 26 | 27 | - Depending on the arguments used `library()` can load a package, it can list all installed packages, or display the help for a package. 28 | 29 | - `diag()` is used to both extract the diagonal of a matrix and construct a matrix from a vector of diagonal values. 30 | This combination of purposes makes its arguments hard to understand: if `x` is a matrix, you can `names` but not `nrow` and `ncol`; if `x` is a vector, you can use `nrow` and `ncol` but not `names`. 31 | 32 | - `sample()` is used to both randomly reorder a vector and generate a random vector of specified length. 33 | This function is particularly troublesome because it picks between the two strategies based on the length of the first argument. 34 | 35 | - `rep()` is used to both repeat each element of a vector and to repeat the complete vector. 36 | I discuss this more in @sec-cs-rep. 37 | 38 | These functions all also suffer from the problem that the strategies are implicit, not explicit. 39 | That's because they use either the presence or absence of different arguments or the type of an argument pick strategies. 40 | This combination tends to produce particularly opaque code. 41 | 42 | ## How do I identify the problem? 43 | 44 | Typically this problem arises as the scope of your function grows over time, and because the growth tends to be gradual it's hard to notice exactly when it becomes an issue. 45 | One way to splot this problem is notice that your function consists of a big if statement where the branches share very little code. 46 | An extreme example of this is the base `sample()` function. 47 | As of R 4.3.0 it looks something like this: 48 | 49 | ```{r} 50 | sample <- function(x, size, replace = FALSE, prob = NULL) { 51 | if (length(x) == 1L && is.numeric(x) && is.finite(x) && x >= 1) { 52 | if (missing(size)) 53 | size <- x 54 | sample.int(x, size, replace, prob) 55 | } else { 56 | if (missing(size)) 57 | size <- length(x) 58 | x[sample.int(length(x), size, replace, prob)] 59 | } 60 | } 61 | ``` 62 | 63 | You can see that there are two branches that share very little code, and each branch uses a different default value for `size`. 64 | This suggests it might be better to have two functions: 65 | 66 | ```{r} 67 | sample_vec <- function(x, size = length(x), replace = FALSE, prob = NULL) { 68 | # check_vector(x) 69 | # check_number_whole(size) 70 | 71 | x[sample.int(length(x), size, replace, prob)] 72 | } 73 | 74 | sample_int <- function(x, size = x, replace = FALSE, prob = NULL) { 75 | # check_number_whole(x) 76 | # check_number_whole(size) 77 | 78 | x[sample.int(length(x), size, replace, prob)] 79 | } 80 | ``` 81 | 82 | In other cases you might spot the problem because you're having trouble explaining the arguments in the documentation. 83 | If it feels like 84 | 85 | ## How do I remediate past mistakes? 86 | 87 | Remediating past mistakes is straightforward: define, document, and export one function for each strategy. 88 | Then rewrite the original function to use those strategies, deprecating that entire function if desired. 89 | For example, this is what `fct_lump()` looked like after we realised it was really the combination of three simpler functions: 90 | 91 | ```{r} 92 | fct_lump <- function(f, 93 | n, 94 | prop, 95 | w = NULL, 96 | other_level = "Other", 97 | ties.method = c("min", "average", "first", "last", "random", "max")) { 98 | if (missing(n) && missing(prop)) { 99 | fct_lump_lowfreq(f, w = w, other_level = other_level) 100 | } else if (missing(prop)) { 101 | fct_lump_n(f, n, w = w, other_level = other_level, ties.method = ties.method) 102 | } else if (missing(n)) { 103 | fct_lump_prop(f, prop, w = w, other_level = other_level) 104 | } else { 105 | cli::cli_abort("Must supply only one of {.arg n} and {.arg prop}.") 106 | } 107 | } 108 | ``` 109 | 110 | We decided to supersede `fct_lump()` rather than deprecating it, so we kept old function around and working. 111 | If we wanted to deprecate it, we'd need to add one deprecation for each branch: 112 | 113 | ```{r} 114 | fct_lump <- function(f, 115 | n, 116 | prop, 117 | w = NULL, 118 | other_level = "Other", 119 | ties.method = c("min", "average", "first", "last", "random", "max")) { 120 | if (missing(n) && missing(prop)) { 121 | lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_lowfreq()") 122 | fct_lump_lowfreq(f, w = w, other_level = other_level) 123 | } else if (missing(prop)) { 124 | lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_n()") 125 | fct_lump_n(f, n, w = w, other_level = other_level, ties.method = ties.method) 126 | } else if (missing(n)) { 127 | lifecycle::deprecate_warn("0.5.0", "fct_lump()", "fct_lump_prop()") 128 | fct_lump_prop(f, prop, w = w, other_level = other_level) 129 | } else { 130 | cli::cli_abort("Must supply only one of {.arg n} and {.arg prop}.") 131 | } 132 | } 133 | ``` 134 | -------------------------------------------------------------------------------- /strategy-objects.qmd: -------------------------------------------------------------------------------- 1 | # Extract strategies into objects {#sec-strategy-objects} 2 | 3 | ```{r} 4 | #| include: FALSE 5 | source("common.R") 6 | ``` 7 | 8 | ## What's the problem? 9 | 10 | Sometimes different strategies need different arguments. 11 | In this case, instead of using an enum, you'll need to use richer objects capable of storing optional values as well as the strategy name. 12 | 13 | This pattern is similar to combining @sec-argument-clutter and @sec-enumerate-options together. 14 | 15 | ## What are some examples? 16 | 17 | - `grepl()` has Boolean `perl` and `fixed` arguments, but you're not really toggling two independent settings, you're picking from one of three regular expression engines (the default, the engine used by Perl, and fixed matches). 18 | Additionally, the `ignore.case` argument only applies to two of the strategies. 19 | 20 | In stringr, however, you use helper functions like `regex()` and `fixed()` to wrap around the pattern, and supply optional arguments that only apply to that strategy. 21 | 22 | - `ggplot2::geom_histogram()` has three main strategies for defining the bins: you can supply the number of `bins`, the width of each bin (the `binwidth`), or the exact `breaks`. 23 | But it's currently difficult to derive this from the function specification, and there are complex argument dependencies (e.g. you can only supply one of `boundary` and `center`, and neither applies if you use `breaks`). 24 | 25 | - `dplyr::left_join()` uses an advanced form of this pattern where the different strategies for joining two data frames together are expressed in a mini-DSL provided by `dplyr::join_by()`. 26 | 27 | ## How do you use the pattern? 28 | 29 | In more complicated cases, different strategies will require different arguments, so you'll need a bit more infrastructure. 30 | The basic idea is to build on the options object described in @sec-argument-clutter, but instead of providing just one helper function, you'll provide one function per strategy. 31 | This is the way stringr works: you can select a different matching engine by wrapping the `pattern` in one of `regex()`, `boundary()`, `coll()`, or `fixed()`. 32 | We'll explore how stringr ended up with design and how you can implement something similar yourself by looking at the base regular expression functions. 33 | 34 | ### Selecting a pattern engine 35 | 36 | The basic regular expression functions (`grep()`, `grepl()`, `sub()`, `gsub()`, `regexpr()`, `gregexpr()`, `regexec()`, and `gregexec()`) all `fixed` and `perl` arguments that allow to select the regular expression engine that's used: 37 | 38 | - `perl = FALSE`, `fixed = FALSE`, the default, uses POSIX 1003.2 extended regular expressions. 39 | - `perl = TRUE`, `fixed = FALSE` uses Perl-style regular expressions. 40 | - `perl = FALSE`, `fixed = TRUE` uses fixed matching. 41 | - `perl = TRUE`, `fixed = TRUE` is an error. 42 | 43 | You could make this choice more clear by using an enumeration (@sec-enumerate-options) maybe something like `engine = c("POSIX", "perl", "fixed")`. 44 | That might look something like this: 45 | 46 | ```{r} 47 | #| eval: false 48 | grepl(pattern, string, engine = "regex") 49 | grepl(pattern, string, engine = "fixed") 50 | grepl(pattern, string, engine = "perl") 51 | ``` 52 | 53 | But there's an additional argument that throws a spanner in the works: `ignore.case = TRUE` only works with two of the three engines: POSIX and perl. 54 | Additionally, it's a bit unforunate that the `engine` argument, which is likely to come later in the call, affects the `pattern`, the first argument. 55 | That means you have to read the call until you see the `engine` argument before you can understand precisely what the `pattern` means. 56 | 57 | An alternative approach, as used by stringr, is to provide some helper functions that encode the engine as an attribute of the pattern: 58 | 59 | ```{r} 60 | #| eval: FALSE 61 | grepl(pattern, regex(string)) 62 | grepl(pattern, fixed(string)) 63 | grepl(pattern, perl(string)) 64 | ``` 65 | 66 | And because these are separate functions, they can take different arguments: 67 | 68 | ```{r} 69 | regex <- function(pattern, ignore.case = FALSE) {} 70 | perl <- function(pattern, ignore.case = FALSE) {} 71 | fixed <- function(pattern) {} 72 | ``` 73 | 74 | This gives a very flexible interface which is particularly nice in stringr because it means there's an easy way to support boundary matching, which doesn't even take a pattern: 75 | 76 | ```{r} 77 | #| message: false 78 | library(stringr) 79 | str_view("This is a sentence.", boundary("word")) 80 | str_view("This is a sentence.", boundary("sentence")) 81 | ``` 82 | 83 | ### Implementation 84 | 85 | Lets flesh this interface into an implementation. 86 | First we flesh out the pattern engine wrappers. 87 | These need to return an object that has the name of engine, the pattern, and any other arguments: 88 | 89 | ```{r} 90 | regex <- function(pattern, ignore.case = FALSE) { 91 | list(pattern = pattern, engine = "regex", ignore.case = ignore.case) 92 | } 93 | perl <- function(pattern, ignore.case = FALSE) { 94 | list(pattern = pattern, engine = "perl", ignore.case = ignore.case) 95 | } 96 | fixed <- function(pattern) { 97 | list(pattern = pattern, engine = "fixed") 98 | } 99 | ``` 100 | 101 | Then you could create a new `grepl()` variant that might look something like this: 102 | 103 | ```{r} 104 | my_grepl <- function(pattern, x, useBytes = FALSE) { 105 | switch(pattern$engine, 106 | regex = grepl(pattern$pattern, x, ignore.case = pattern$ignore.case, useBytes = useBytes), 107 | perl = grepl(pattern$pattern, x, perl = TRUE, ignore.case = pattern$ignore.case, useBytes = useBytes), 108 | fixed = grepl(pattern$pattern, x, fixed = TRUE, useBytes = useBytes) 109 | ) 110 | } 111 | ``` 112 | 113 | Or if you wanted to make it more clear how the engines differ, you could pull out a helper function that pulls out the repeated code: 114 | 115 | ```{r} 116 | my_grepl <- function(pattern, x, useBytes = FALSE) { 117 | grepl_wrapper <- function(...) { 118 | grepl(pattern$pattern, x, ..., useBytes = useBytes) 119 | } 120 | 121 | switch(pattern$engine, 122 | regex = grepl_wrapper(ignore.case = pattern$ignore.case), 123 | perl = grepl_wrapper(perl = TRUE, ignore.case = pattern$ignore.case), 124 | fixed = grepl_wrapper(fixed = TRUE) 125 | ) 126 | } 127 | ``` 128 | 129 | Here I'm just wrapping around the existing `grepl()` implementation because I don't want to go into the details of its implementation; for your own code you'd probably inline the implementation. 130 | 131 | I particularly like the `switch` pattern here and in stringr because it keeps the function calls close together, which makes it easier to keep them in sync. 132 | You could also implement the same strategy using `if` or S7 generic functions, depending on your needs. 133 | 134 | This is implementation a sketch that gives you the basic ideas. 135 | For a real implementation you'd also need to consider: 136 | 137 | - Are `fixed()`, `perl()`, and `regex()` the right names? Would it be useful to give them a common prefix? 138 | - It would be better for the engines to return an S7 object instead of a list, so we could provide a print method to make them display more nicely. 139 | - `grepl()` needs some error checking to ensure that `pattern` is generated by one of the engines, and probably should have a default path to handle bare character vectors as regular expressions (the current default). 140 | 141 | You can see these detailed worked out in the stringr package if you look at the source code, particularly that of `fixed()`, `type()`, `opts()`, then `str_detect()`. 142 | 143 | ## How do I remediate past problems? 144 | 145 | Changing from a complex dependency of individual arguments to a stra 146 | -------------------------------------------------------------------------------- /substack/2023-07-28.qmd: -------------------------------------------------------------------------------- 1 | The goal of this newsletter is to get feedback as I work on a new book called "Tidy design principles". 2 | But what's the point of that book? 3 | R has a very rich literature on statistics and data science, there are relatively few books that focus on programming. 4 | I've written a couple ([Advanced R](https://adv-r.hadley.nz/) and [R Packages](https://r-pkgs.org/)) but neither really talks about how to write good code R code (or even discusses what good code means). 5 | 6 | That's what I want to focus on in "[Tidy design principles](https://design.tidyverse.org/)": how do you write high-quality R code that's easy to understand, unlikely to fail in unexpected ways, and flexible enough to grow with your needs. 7 | This book will be organised around the idea of "design patterns". 8 | This was an idea that I encountered early in my programming journey and I found it very impactful. 9 | 10 | The idea of a design pattern is to come up with a catchy name that maps a common programming challenge to an effective solution. 11 | The catchy name is important because it serves as a handle for your memory and a convenient shorthand when discussing code with others. 12 | I first heard about design patterns when I was a CS undergrad learning Java (a much less flexible language than R) and I read the popular read [Design Patterns: Elements of Reusable Object-Oriented Software](https://en.wikipedia.org/wiki/Design_Patterns) book. 13 | 14 | I later learned that the idea of design patterns originated not from computer science, but from architecture, particularly [A Pattern Language: Towns, Buildings, Construction](https://en.wikipedia.org/wiki/A_Pattern_Language) by Christopher Alexander. 15 | This book resonated with me even more strongly than the CS patterns, and if you have any interest in architecture, I highly recommend reading it. 16 | I particularly liked that the patterns spanned many levels of detail, all the way from organising entire communities to how you might select chairs for a single room. 17 | 18 | That's the spirit in which I write "Tidy design principles". 19 | I want to name common problem solving patterns in R, and write them up so that others can easily use them. 20 | That means that this book will have rather a different structure to my previous books. 21 | It will have a large number of relatively short chapters, each of which describes a pattern: what it is, why it's important, where you can see it in the wild, and how you can apply it to your code. 22 | You might skim the whole book once, but you'll generally use it by referring to the patterns that apply specifically to your current problem. 23 | 24 | I also want to include some bigger principles that help you weigh conflicting patterns as well as case studies that illuminate some of our thinking when we've designed various parts of tidyverse (particularly parts that we now regret). 25 | 26 | ------------------------------------------------------------------------ 27 | 28 | This week I've identified one bigger principle: the definition of a function should be scannable, giving you useful information at a glance. 29 | This principle is important because you see the function definition in lots of places, like in autocomplete and at the top of the documentation. 30 | It's super useful if that glance can give you useful insight into the function. 31 | 32 | So far, I've gathered six patterns related to this principle: 33 | 34 | - You should be able to tell what affects the output of the function because [all inputs are explicit arguments](https://design.tidyverse.org/inputs-explicit.html). 35 | 36 | - You know which arguments are most important because [they come first](https://design.tidyverse.org/important-args-first.html). 37 | 38 | - You can tell if an argument is required or optional based on the [the presence or absence of a default](https://design.tidyverse.org/required-no-defaults.html) and whether it [comes before or after ](https://design.tidyverse.org/dots-after-required.html)`…`. 39 | 40 | - You can easily figure out the defaults because [they are short and sweet](https://design.tidyverse.org/defaults-short-and-sweet.html). 41 | 42 | - And if an argument has a small set of valid inputs, they are [explicitly enumerated in the default](https://design.tidyverse.org/enumerate-options.html). 43 | 44 | What do you think of this principle and the various patterns that make it up? 45 | Does it resonate with you? 46 | Do you think I've missed something important? 47 | 48 | One other pattern I've been noodling on is the idea that one argument shouldn't affect the meaning of another argument. 49 | This seems like an important principle, but so far I've only been able to come up with one example: `library()`, where the `character.only` argument affects the meaning of the `package` argument: 50 | 51 | ``` 52 | ggplot2 <- "dplyr" 53 | 54 | # Loads ggplot2 55 | library(ggplot2) 56 | 57 | # Loads dplyr 58 | library(ggplot2, character.only = TRUE) 59 | ``` 60 | 61 | Given that I only have one example, it doesn't seem worthwhile to write it up as a pattern, but maybe you've encountered other examples of this problem. 62 | If so, please let me know in the comments! 63 | -------------------------------------------------------------------------------- /substack/2023-08-04.qmd: -------------------------------------------------------------------------------- 1 | ```{r} 2 | #| include: FALSE 3 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 4 | ``` 5 | 6 | # Reducing clutter with an options object 7 | 8 | New this week is a new chapter on [reducing argument clutter by adding an options object](tidyr::pivot_wide). 9 | Sometimes you have a set of "second class" arguments that you don't expect people to use very commonly, so you don't want them cluttering up the function specification. 10 | If you want to give the user the ability to control them when needed, you can lump them all together into an "options" object. 11 | 12 | These are used in base R modelling functions (e.g. `glm()`, `loess()`) to control the details of the underlying numerical algorithm. 13 | For example, take this model from the glm docs: 14 | 15 | ```{r} 16 | data(anorexia, package = "MASS") 17 | 18 | mod <- glm( 19 | Postwt ~ Prewt + Treat + offset(Prewt), 20 | family = gaussian, 21 | data = anorexia 22 | ) 23 | ``` 24 | 25 | If you want to understand how model convergence is going you can set the `trace = TRUE` in `glm.control()`: 26 | 27 | ```{r} 28 | mod <- glm( 29 | Postwt ~ Prewt + Treat + offset(Prewt), 30 | family = gaussian, 31 | data = anorexia, 32 | control = glm.control(trace = TRUE) 33 | ) 34 | ``` 35 | 36 | 99% of the time you don't need to know these arguments exist, but they are available if you ever need to debug a convergence failure. 37 | 38 | ------------------------------------------------------------------------ 39 | 40 | You can see the same pattern in `readr::locale()` and `readr::date_names()`. 41 | When parsing dates, you often need to know the names of the month, and that obviously varies by location. 42 | `locale()` allows you to set `date_names` to a two-letter country code to use common locations that baked in readr, but what happens if you want to parse dates from an unsupported language? 43 | 44 | For example, take Austrian which came up in a [recent readr issue](https://github.com/tidyverse/readr/issues/1467). 45 | Austrian month names are mostly the same as German but use Jänner instead of Januar and Feber instead of Februar. 46 | We can parse Austrian date times by first taking the German date names structure and modifying it: 47 | 48 | ```{r} 49 | library(readr) 50 | 51 | au <- readr::date_names_lang("de") 52 | au$mon[1:2] <- c("Jänner", "Feber") 53 | au 54 | ``` 55 | 56 | Now we can pass this to object to `locale()`, and the locale object to a parsing function: 57 | 58 | ```{r} 59 | parse_date("15. Jänner 2015", "%d. %B %Y", locale = locale(date_names = au)) 60 | ``` 61 | 62 | I like how this hierarchy of option arguments buries something that you rarely need but still makes it accessible. 63 | 64 | Where else have you seen this pattern? 65 | Have you written functions where it would be useful? 66 | Are there places in the tidyverse that you think should use this pattern but don't? 67 | Please let me know in the comments! 68 | 69 | ------------------------------------------------------------------------ 70 | 71 | Thanks to everyone who contributed in the comments last week! 72 | The results aren't ready to read yet, but have really helped my thinking for two new chapters "make strategies explicit" and "argument meaning should be independent". 73 | -------------------------------------------------------------------------------- /substack/2023-08-11.qmd: -------------------------------------------------------------------------------- 1 | # Dot-dot-dot, bang-bang-bang, and `do.call()` 2 | 3 | ```{r} 4 | #| include: FALSE 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | ``` 7 | 8 | This week I wanted to talk through a bit of tidyverse design that I have mixed feelings about: `!!!`. 9 | What is `!!!` and when might you need it? 10 | To understand it, you'll need a little back story... 11 | 12 | Some functions want to work with both individual values and a list of values. 13 | For example, take `rbind()`. 14 | Sometimes you've created a couple of data frames by hand and want to join them together: 15 | 16 | ```{r} 17 | df1 <- data.frame(x = 1) 18 | df2 <- data.frame(x = 2) 19 | both <- rbind(df1, df2) 20 | ``` 21 | 22 | But other times you have created an entire list of data frames, typically through the application of `lapply()` or friends: 23 | 24 | ```{r} 25 | xs <- 1:5 26 | dfs <- lapply(xs, \(x) data.frame(x = x)) 27 | ``` 28 | 29 | How can you join these into a single data frame? 30 | Just calling `rbind()` doesn't do what you want: 31 | 32 | ```{r} 33 | rbind(dfs) 34 | ``` 35 | 36 | And while you certainly *could* index them by hand, you lose much of the advantage of using `lapply()` in the first place: 37 | 38 | ```{r} 39 | #| results: false 40 | rbind(dfs[[1]], dfs[[2]], dfs[[3]], dfs[[4]], dfs[[5]]) 41 | ``` 42 | 43 | This problem is sometimes called splicing or splatting and occurs whenever you have a single object that contains elements that you want to be individual arguments. 44 | 45 | The recommended solution for this problem in base R is to use `do.call()`. 46 | `do.call(rbind, dfs)` generates the call `rbind(dfs[[1]], rbind[[2]], …, rbind[[5]])` for you: 47 | 48 | ```{r} 49 | do.call(rbind, dfs) 50 | ``` 51 | 52 | `do.call()` is an effective, if advanced, technique but it gets a little tricky if you want to supply additional arguments. 53 | For example, historically, it used to be important to set `stringsAsFactors = FALSE` which requires gymnastics like this: 54 | 55 | ```{r} 56 | #| results: false 57 | do.call(rbind, c(dfs, list(stringsAsFactors = FALSE))) 58 | ``` 59 | 60 | This was one of the challenges I wanted to tackle in dplyr, so I came up with `bind_rows()` which tries to automatically figure out if you have a list of data frames or you're supplying them individually: 61 | 62 | ```{r} 63 | library(dplyr, warn.conflicts = FALSE) 64 | 65 | bind_rows(df1, df2) 66 | 67 | bind_rows(dfs) 68 | ``` 69 | 70 | Unfortunately the heuristic we used to decide whether we were in the first case or the second case grew progressively more complicated over time, as people found problems or asked for new functionality. 71 | Now, while `bind_rows()` works correctly 99% of the time, it has some weird special cases, like below where the inputs can become columns, rather than rows. 72 | 73 | ```{r} 74 | bind_rows(x = 1, y = 2) 75 | ``` 76 | 77 | These problems soured us on the idea of "automatic" splicing so we started looking for other solutions: 78 | 79 | - We could have a pair of functions, one that takes `…` and one that takes a list. This works for `bind_rows()` it turns out there are a lot of functions that take `…` where it would be nice to also take a list of objects so it lead to a substantial amount of duplication. 80 | - We could have a pair of arguments, `…` `.dots,` where you can supply individual arguments to `…` or a list of arguments to `.dots`. (I think I first recall seeing this approach in the RCurl package by Duncan Temple Lang.) But this would requires adding an additional argument to every function that uses `…`, and wouldn't it be nice if we didn't have to do that? 81 | 82 | Instead we found inspiration from tidy evaluation where we had recently solved a similar problem with `!!!`: 83 | 84 | ```{r} 85 | library(rlang) 86 | 87 | args <- exprs(a, b, c + d) 88 | expr(f(!!!args)) 89 | ``` 90 | 91 | So we introduced the idea of "dynamic dots", an extension to `…` that incorporates some features we thought we useful from tidy evaluation: splicing with `!!!` and dynamic names with `:=`. 92 | Dynamic dots is implemented via `rlang::list2()` and it's easy to use in your own functions if you find the idea appealing. 93 | 94 | One place you can see this idea in use is `forcats::fct_cross()`, which creates a factor that contains all combinations of its inputs: 95 | 96 | ```{r} 97 | library(forcats) 98 | 99 | fruit <- factor(c("apple", "kiwi", "apple", "apple")) 100 | colour <- factor(c("green", "green", "red", "green")) 101 | fct_cross(fruit, colour) 102 | ``` 103 | 104 | Because `fct_cross()` uses dynamic dots (which you can find out by looking at the dots), if you happen to have a list of values, you can use `!!!` to splice them in: 105 | 106 | ```{r} 107 | x <- list(fruit = fruit, colour = colour) 108 | fct_cross(!!!x) 109 | ``` 110 | 111 | (`fct_cross()` does a similar job to `interaction()`, which interestingly takes the automatic approach, so you can just call `interaction(x)` here. 112 | I don't love the approach it takes because if `interaction(list(f1, f2))` works, you might expect `interaction(list(f1, f2), f3)` to work, but it does not, and it doesn't give you a particularly useful error message). 113 | 114 | We have yet to figure out a way to use dynamic dots in `dplyr::bind_rows()` without breaking existing usage, but it's provided by the function that now does most of the work: `vctrs::vec_rbind()`. 115 | (vctrs is the package where we stick low-level operations on vectors and data frames that we use in multiple packages. It's designed to be programmer-friendly rather than analyst-friendly so we don't talk about it that much). 116 | 117 | ```{r} 118 | vctrs::vec_rbind(!!!dfs) 119 | ``` 120 | 121 | Because binding lists of data frames together is such a common operation we also provide `purrr::list_rbind()` and `purrr::list_cbind()`. 122 | If you look you'll see their implementations are very simple! 123 | 124 | Overall, I have mixed feelings about `!!!`. 125 | I love the elegance of it: it makes it easy to splices lists into `…` and it has a beautiful connection to tidy evaluation. 126 | But I worry it feels like magic to most R users and because it's only supported by some functions, it's not super clear how you know when you can use it, and you still also have to learn `do.call` or similar. 127 | And, at least for `bind_rows()`, we've still ended up with two functions! 128 | 129 | All that said, `!!!` still feels like the "least worst" solution to me, although looking back I wonder if we might have been better off using the more explicit `.dots` argument. 130 | What do you think? 131 | Had you heard of `!!!` before reading this post? 132 | Have you ever used it to successfully solve a problem? 133 | Do you prefer `do.call()` or are their other approaches used by other packages that you think are better? 134 | -------------------------------------------------------------------------------- /substack/2023-09-29.qmd: -------------------------------------------------------------------------------- 1 | My dad, Brian Walter Wickham, passed away peacefully last month. 2 | We've known this was coming for a while, but it's still a big blow: he was a very important part of my life. 3 | You can get a sense of how he influenced the world through [this article in the Irish Farmer's Journal](https://www.farmersjournal.ie/former-icbf-chief-executive-brian-wickham-passes-away-779918), but here I wanted to reflect particularly on how he influenced my work. 4 | 5 | I was lucky to have access to computers from a very young age (I don't remember when exactly, but I think around 10), thanks to Dad having a laptop for work. 6 | This was in the era where laptops were extremely expensive, heavy, and barely transportable, but I still I have fond memories of using Lotus 1-2-3 (an early spreadsheeting tool), learning DOS, and playing Dune 2. 7 | In my early teens, I remember being greatly excited to find a documentation manual for DOS in a computer shop and convincing Dad to buy it for me. 8 | 9 | Databases have been a big part of Dad's work through out his life, and he gave me "the talk" about [Codd's third normal form](https://en.wikipedia.org/wiki/Third_normal_form) when I was around 15. 10 | This sparked my interest in MS Access, which lead to part time jobs creating and documenting databases in high school and university. 11 | One projects involved creating a database for the library at his work, and I still remember the feeling after I accidentally deleted a file that contained a week's worth of data entry. 12 | This lead to "the talk" about backups, the principles of which I have followed ever since. 13 | Dad's knowledge about and use of databases had a deep impact on my life, leading many years later to the idea of tidy data, a framing of Codd's rules that were easier to understand and apply to statistical data. 14 | 15 | Much of Dad's work involved collecting data about cows (hence the databases), and one of his strong beliefs was that farmers should own their own data, and it should live together in a central database that could be used for the good of all. 16 | This made open source seem natural to me: why not build a community where developers collaboratively owned their code, and could work together to solve problems that were hard to tackle individually. 17 | Dad's belief in sharing his work for the betterment of all made adopting the principles of open source software seen obvious to me when I started producing software of my own. 18 | 19 | Dad did his PhD at Cornell, so growing up it made doing a PhD overseas seem like a totally normal and reasonable to do. 20 | So when I got interested in statistics and computer science in my undergrad, applying to universities in the US seemed like the obvious choice. 21 | Unfortunately Dad had given me unrealistic sense of how long a PhD would take, since he finished his in only two years! 22 | 23 | One of the things that most impressed me about Dad was his commitment to life long learning. 24 | He loved to embrace new technologies, and was a fluent user of FaceTime, AirBnB, and Uber (although he also loved to strike up conversations with strangers in a way that is very foreign to me). 25 | In his early 70s, he learned how to use GitHub and markdown so that he could edit "[People and Places of Clonakilty](https://www.lulu.com/shop/alison-wickham/people-and-places-of-clonakilty/paperback/product-q9qk6e.html)", a book that my mum wrote and that Charlotte and I produced with Quarto. 26 | He\'s a great role model as I get older; I want to continue to learn new things and embrace new technologies 27 | 28 | Dad taught me how to chair a meeting, how to grill a steak, and how to change a tire. 29 | I admire his optimism, calm and thoughtful manner, and endless patience, and hope I can live up to his legacy. 30 | He will be greatly missed. 31 | -------------------------------------------------------------------------------- /substack/2023-10-27.qmd: -------------------------------------------------------------------------------- 1 | # Strategies 2 | 3 | ```{r} 4 | #| include: FALSE 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | ``` 7 | 8 | Over the last couple of weeks I've been noodling on the idea of strategies: what happens if your function contains a few different approaches to the problem, and you want to expose to the user select. 9 | I believe that it's best to expose these explicitly like the `ties.method` argument to `rank()`, the `method` argument to `p.adjust()`, or the `.keep` argument to `dplyr::mutate()`. 10 | 11 | In functions that expose a strategy, it's common to see a character vector in the function interface. 12 | For example, `rank()` looks like this: 13 | 14 | ```{r} 15 | #| eval: false 16 | rank( 17 | x, 18 | na.last = TRUE, 19 | ties.method = c("average", "first", "last", "random", "max", "min") 20 | ) 21 | ``` 22 | 23 | I call this character vector an **enumeration** and discuss it in "[Enumerate possible options](https://design.tidyverse.org/enumerate-options.html)". 24 | This vector enumerates (itemises) the possible options (six here) and it also tells you the default value, the first value in the vector (`"average"`). 25 | This type of default value is usually paired with either `match.arg()` or `rlang::arg_match()` to give an informative error if the user supplies an unsupported value. 26 | The chief difference between the base and rlang versions of this function is that the base version supports partial matching (e.g. you can write `rank(x, ties.method = "r")` for short), which we believe is no longer a good idea. 27 | 28 | One of the reasons it's useful to understand this pattern is that you might want to apply it even when there are only two options. 29 | In such a case, it's tempting to expose the option as a Boolean argument, accepting either `TRUE` or `FALSE`. 30 | But this has two problems: 31 | 32 | - You might later discover that there's a third option. Now you're going to need to make more radical changes to your function interface to allow this. 33 | - It's often trickier to understand a negative. For example, I recently discovered the `cancel_on_error` argument in an [httr2](https://httr2.r-lib.org) function. I think it's pretty clear what `cancel_on_error = TRUE` does (it cancels if there's an error), but what does `cancel_on_error = FALSE` do? I wrote this code and now I couldn't tell you what it actually does. 34 | 35 | I explore this idea in more detail in "[Prefer a enum, even if only two choices](https://design.tidyverse.org/boolean-strategies.html)", including a deeper look at the `decreasing` and `na.last` arguments to `sort()`. 36 | 37 | A more complicated example of the strategy pattern comes about when different strategies require different arguments. 38 | The best example of this sort of pattern is stringr, which uses the functions `regex()`, `fixed()`, `boundary()`, and `coll()` to define the pattern matching engine: 39 | 40 | ```{r} 41 | library(stringr) 42 | x <- "The quick brown fox jumped over the lazy dog." 43 | 44 | str_view(x, regex("[aeiou]+", ignore_case = TRUE)) 45 | str_view(x, fixed(".")) 46 | str_view(x, boundary("word")) 47 | ``` 48 | 49 | I explore this idea more in "[Extract strategies into objects](https://design.tidyverse.org/strategy-objects.html)" (which really needs a catchier name), motivated by the `perl`, `fixed`, and `ignore.case` arguments to `grepl()` and friends. 50 | 51 | Finally, sometimes exposing multiple strategies in one function isn't the right move, and you're better off creating more simpler functions. 52 | I think of this problem as "[three functions in a trench coat](https://www.reddit.com/r/comics/comments/hzqw80/sheep_in_human_clothing/)" because it can feel like three or more functions crammed into one breaking apart at the seems. 53 | `forcats::fct_lump()` is a good example of this problem: it started off simple and then gained new strategies over time. 54 | Eventually it got so hard to explain that we decided to split apart in to three simpler functions. 55 | Another good example of this problem is the `rep()` function: I think it's actually two functions in trench coat and it gets easier to understand if you pull them apart. 56 | See the [rep() case study](https://design.tidyverse.org/cs-rep.html) for a full exploration including some of my thoughts about what you might name the functions and arguments. 57 | 58 | Do these patterns resonate with you? 59 | Are there other functions in the tidyverse that you think do a particularly good or bad job of exposing a strategy? 60 | Please let me know in the comments! 61 | -------------------------------------------------------------------------------- /unifying.qmd: -------------------------------------------------------------------------------- 1 | # Unifying principles 2 | 3 | The tidyverse is a language for solving data science challenges with R code. 4 | Its primary goal is to facilitate the conversation that a human has with a dataset, and we want to help dig a "pit of success" where the least-effort path trends towards a positive outcome. 5 | The primary tool to dig the pit is API design: by carefully considering the external interface to a function, we can help guide the user towards success. 6 | But it's also necessary to have some high level principles that guide how we think broadly about APIs, principles that we can use to "break ties" when other factors are balanced. 7 | 8 | The tidyverse has four guiding principles: 9 | 10 | - It is **human centered**, i.e. the tidyverse is designed specifically to support the activities of a human data analyst. 11 | 12 | - It is **consistent**, so that what you learn about one function or package can be applied to another, and the number of special cases that you need to remember is as small as possible. 13 | 14 | - It is **composable**, allowing you to solve complex problems by breaking them down into small pieces, supporting a rapid cycle of exploratory iteration to find the best solution. 15 | 16 | - It is **inclusive**, because the tidyverse is not just the collection of packages, but it is also the community of people who use them. 17 | 18 | These guiding principles are aspirational; they're not always fully realised in current tidyverse packages, but we strive to make them so. 19 | 20 | ### Related work {.unnumbered} 21 | 22 | These principles are inspired by writings about the design of other systems: such as: 23 | 24 | - [The Unix philsophy](https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch01s06.html) 25 | - [The Zen of Python](https://www.python.org/dev/peps/pep-0020/) 26 | - [Design Principles Behind Smalltalk](https://refs.devinmcgloin.com/smalltalk/Design-Principles-Behind-Smalltalk.pdf) 27 | 28 | ## Human centered 29 | 30 | > Programs must be written for people to read, and only incidentally for machines to execute. 31 | > 32 | > --- Hal Abelson 33 | 34 | Programming is a task performed by humans. 35 | To create effective programming tools we must explicitly recognise and acknowledge the role played by cognitive psychology. 36 | This is particularly important for R, because it's a language that's used primarily by non-programmers, and we want to make it as easy as possible for first-time and end-user programmers to learn the tidyverse. 37 | 38 | A particularly useful tool from cognitive psychology is "cognitive load theory"[^unifying-1]: we have a limited working memory, and anything we can do to reduce extraneous cognitive load helps the learner and user of the tidyverse. 39 | This motivates the next two principles: 40 | 41 | [^unifying-1]: A good practical introduction is [Cognitive load theory in practice](https://www.cese.nsw.gov.au/images/stories/PDF/Cognitive_load_theory_practice_guide_AA.pdf) (PDF). 42 | 43 | - By being **consistent** you only need to learn and internalise one expression of an idea, and then you can apply that many times. 44 | 45 | - By being **composable** you can break down complex problems into bite sized pieces that you can easily hold in your head. 46 | 47 | Idea of "chunking" is important. 48 | Some setup cost to learn a new chunk, but once you've internalised it, it only takes up one spot in your working memory. 49 | In some sense the goal of the tidyverse is to discover the minimal set of chunks needed to do data science and have some sense of the priority of the remainder. 50 | 51 | Other useful ideas come from design. 52 | One particularly powerful idea is that of "affordance": the exterior of a tool should suggest how to use it. 53 | We want to avoid ["Norman doors"](https://99percentinvisible.org/article/norman-doors/) where the exterior clues and cues point you in the wrong direction. 54 | 55 | This principle is deeply connected to our beliefs about performance. 56 | Most importantly performance of code depends not only on how long it takes to run, but also how long it takes to *write* and *read*. 57 | Human brains are typically slower than computers, so this means we spend a lot of time thinking about how to create intuitve interfaces, focussing on writing and reading speed. 58 | Intuitive interfaces sometimes are at odds with running speed, because writing the fastest code for a problem often requires designing the interface for performance rather than usability. 59 | Generally, we optimise first for humans, then use profiling to discover bottlenecks that cause friction in data analysis. 60 | Once we have identified an important bottleneck, then performance becomes a priority and we rewrite the existing code. 61 | Generally, we'll attempt to preserve the existing interface, only changing it when the performance implications are significant. 62 | 63 | ## Consistent 64 | 65 | > A system should be built with a minimum set of unchangeable parts; those parts should be as general as possible; and all parts of the system should be held in a uniform framework. 66 | > 67 | > --- Daniel H. H. Ingalls 68 | 69 | The most important API principle of the tidyverse is to be consistent. 70 | We want to find the smallest possible set of key ideas and use them again and again. 71 | This is important because it makes the tidyverse easier to learn and remember. 72 | 73 | (Another framing of this principle is [Less Volume, More Creativity](http://www.calvin.edu/~rpruim/talks/LessVolume/2015-06-24-AKL/LessVolume-2015-06-24.html#1), which comes from Mike McCarthy, the head coach of the Green Bay Packers, and popularised in Statistics Education by [Randall Pruim](https://www.calvin.edu/~rpruim/)) 74 | 75 | This is related to one of my favourite saying from the Python community: 76 | 77 | > There should be one---and preferably only one---obvious way to do it. 78 | > 79 | > --- Zen of Python 80 | 81 | The tidyverse aspires to put this philosophy into practice. 82 | However, because the tidyverse is embedded within the larger R ecosystem, applying this principle never needs to be 100% comprehensive. 83 | If you can't solve a problem from within the tidyverse, you can always step outside and do so with base R or another package. 84 | This also means that we don't have to rush to cover every possible use case; we can take our time to develop the best new solutions. 85 | 86 | The principle of consistency reveals itself in two primary ways: in function APIs and in data structures. 87 | The API of a function defines its external interface (independent of its internal implementation). 88 | Having consistent APIs means that each time you learn a function, learning the next function is a little easier; once you've mastered one package, mastering the next is easier. 89 | 90 | There are two ways that we make functions consistent that are so important that they're explicitly pull out as high-level principles below: 91 | 92 | - Functions should be composable: each individual function should tackle one well contained problem, and you solve complex real-world problems by composing many individual functions. 93 | 94 | - Overall, the API should feel "functional", which is a technical term for the programming paradigm favoured by the tidyverse 95 | 96 | But consistency also applies to data structures: we want to ensure we use the same data structures again and again and again. 97 | Principally, we expect data to be stored in [tidy](https://www.jstatsoft.org/article/view/v059i10) data frames or [tibbles](https://github.com/hadley/tibble/). 98 | This means that tools for converting other formats can be centralised in one place, and that packages development is simplified by assuming that data is already in a standard format. 99 | 100 | Valuing consistency is a trade-off, and we explicitly value it over performance. 101 | There are cases where a different data structure or a different interface might make a solution simpler to express or much faster. 102 | However, one-off solutions create a much higher cognitive load. 103 | 104 | ## Composable 105 | 106 | > No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system. 107 | > 108 | > --- Hal Abelson 109 | 110 | A powerful strategy for solving complex problems is to combine many simple pieces. 111 | Each piece should be easily understood in isolation, and have a standard way of combining with other pieces. 112 | 113 | Within the tidyverse, we prefer to compose functions using a single tool: the pipe, `%>%`. 114 | There are two notable exceptions to this principle: ggplot2 composes graphical elements with `+`, and httr composes requests primarily through `...`. 115 | These are not bad techniques in isolation, and they are well suited to the domains in which they are used, but the disadvantages of inconsistency outweigh any local advantages. 116 | 117 | For smaller domains, this means carefully designing functions so that the inputs and outputs align (e.g. the output from `stringr::str_locate()` can easily be fed into `str_sub()`). 118 | For middling domains, this means drawing many [feature matrices](https://www.evanmiller.org/feature-matrix.html) and ensuring that they are dense (e.g. consider the map family in purrr). 119 | For larger domains, this means carefully thinking about algebras and grammars, identifying the atoms of a problem and the ways in which they might be composed to solve bigger problems. 120 | 121 | We decompose large problems into smaller, more tractable ones by creating and combining functions that transform data rather than by creating objects whose state changes over time. 122 | 123 | Other techniques that tend to facilitate composability: 124 | 125 | - Functions are data: this leads some of the most impactful techniques for functional programming, which allow you to reduce code duplication. 126 | 127 | - Immutable objects. 128 | Enforces independence between components. 129 | 130 | - Partition side-effects. 131 | 132 | - Type-stable. 133 | 134 | ## Inclusive 135 | 136 | We value not just the interface between the human and the computer, but also the interface between humans. 137 | We want the tidyverse to be a diverse, inclusive, and welcoming community. 138 | 139 | - We develop educational materials that are accessible to people with many different skill levels. 140 | 141 | - We prefer explicit codes of conduct. 142 | 143 | - We create safe and friendly communities. 144 | We believe that kindness should be a core value of communities. 145 | 146 | - We think about how we can help others who are not like us (they may be visually impaired or may not speak English). 147 | 148 | We also appreciate the paradox of tolerance: the only people that we do not welcome are the intolerant. 149 | --------------------------------------------------------------------------------