├── .github ├── ISSUE_TEMPLATE.md ├── PULL_REQUEST_TEMPLATE.md └── workflows │ └── render.yaml ├── .gitignore ├── .netlify ├── CODE_OF_CONDUCT.md ├── COPYRIGHT ├── DESCRIPTION ├── LICENSE ├── LICENSE-APACHE2 ├── NEWS.md ├── README.md ├── _bookdown.yml ├── _output.yml ├── appendix.Rmd ├── build.R ├── churn.Rmd ├── debugging.Rmd ├── deploy.sh ├── design.Rmd ├── drake-manual.Rproj ├── dynamic.Rmd ├── examples.Rmd ├── extras.Rmd ├── faq.R ├── faq.Rmd ├── footer.html ├── google_analytics.html ├── gsp.Rmd ├── hpc.Rmd ├── images ├── apple-touch-icon-120x120.png ├── apple-touch-icon-152x152.png ├── apple-touch-icon-180x180.png ├── apple-touch-icon-60x60.png ├── apple-touch-icon-76x76.png ├── apple-touch-icon.png ├── favicon-16x16.png ├── favicon-32x32.png ├── favicon.ico ├── logo.png └── logo.svg ├── index.Rmd ├── memory.Rmd ├── packages.Rmd ├── plans.Rmd ├── preamble.tex ├── projects.Rmd ├── resources.Rmd ├── scripts.Rmd ├── stan.Rmd ├── start.Rmd ├── static.Rmd ├── storage.Rmd ├── style.css ├── time.Rmd ├── triggers.Rmd ├── visuals.Rmd └── walkthrough.Rmd /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | # Before posting 2 | 3 | The environment for collaboration should be friendly, inclusive, respectful, and safe for everyone, so all participants must follow [this repository's code of conduct](https://github.com/ropenscilabs/drake-manual/blob/main/CODE_OF_CONDUCT.md). 4 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | Please explain the context and purpose of your contribution and list the changes you made to the code base or documentation. 4 | 5 | # Related GitHub issues and pull requests 6 | 7 | - Ref: # 8 | 9 | # Checklist 10 | 11 | - [ ] I understand and agree to this repository's [code of conduct](https://github.com/ropensci/drake-manual/blob/main/CODE_OF_CONDUCT.md). 12 | - [ ] I have listed any substantial changes in the [development news](https://github.com/ropenscilabs/drake-manual/blob/main/NEWS.md). 13 | - [ ] This pull request is not a [draft](https://github.blog/2019-02-14-introducing-draft-pull-requests). 14 | -------------------------------------------------------------------------------- /.github/workflows/render.yaml: -------------------------------------------------------------------------------- 1 | on: 2 | push: 3 | branches: main 4 | 5 | name: bookdown 6 | 7 | jobs: 8 | build: 9 | runs-on: macOS-latest 10 | env: 11 | GITHUB_PAT: ${{ secrets.GITHUBPAT }} 12 | R_REMOTES_NO_ERRORS_FROM_WARNINGS: true 13 | PIP_NO_WARN_SCRIPT_LOCATION: false 14 | RETICULATE_AUTOCONFIGURE: 'FALSE' 15 | 16 | steps: 17 | - name: Checkout repo 18 | uses: actions/checkout@v2 19 | 20 | - name: Setup R 21 | uses: r-lib/actions/setup-r@master 22 | 23 | - name: Install system requirements 24 | run: | 25 | brew install pandoc 26 | 27 | - name: Cache bookdown results 28 | uses: actions/cache@v1 29 | with: 30 | path: _bookdown_files 31 | key: bookdown-${{ hashFiles('**/*Rmd') }} 32 | restore-keys: bookdown- 33 | 34 | - name: Install Python 35 | run: | 36 | Rscript -e "install.packages('remotes')" 37 | Rscript -e "remotes::install_github('rstudio/reticulate')" 38 | Rscript -e "reticulate::install_miniconda()" 39 | 40 | - name: Install TensorFlow 41 | run: | 42 | Rscript -e "install.packages(c('keras', 'tensorflow'))" 43 | Rscript -e "keras::install_keras()" 44 | Rscript -e "tensorflow::tf_config()" 45 | 46 | - name: Install packages 47 | run: | 48 | Rscript -e "install.packages('remotes')" 49 | Rscript -e "remotes::install_deps(dependencies = TRUE)" 50 | Rscript -e "install.packages('yardstick')" 51 | 52 | - name: Build manual 53 | run: | 54 | Rscript faq.R 55 | Rscript build.R 56 | ./deploy.sh 57 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .drake 2 | .Rproj.user 3 | .Rhistory 4 | .RData 5 | .Ruserdata 6 | _book 7 | _bookdown_files 8 | drake-manual.Rmd 9 | rsconnect 10 | Thumbs.db 11 | -------------------------------------------------------------------------------- /.netlify: -------------------------------------------------------------------------------- 1 | {"site_id":"8ae9dc68-59a5-41b3-8e0d-bdccf2ef6568","path":"_book"} 2 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Code of Conduct 2 | 3 | As contributors and maintainers of this project, we pledge to respect all people who 4 | contribute through reporting issues, posting feature requests, updating documentation, 5 | submitting pull requests or patches, and other activities. 6 | 7 | We are committed to making participation in this project a harassment-free experience for 8 | everyone, regardless of level of experience, gender, gender identity and expression, 9 | sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. 10 | 11 | Examples of unacceptable behavior by participants include the use of sexual language or 12 | imagery, derogatory comments or personal attacks, trolling, public or private harassment, 13 | insults, or other unprofessional conduct. 14 | 15 | Project maintainers have the right and responsibility to remove, edit, or reject comments, 16 | commits, code, wiki edits, issues, and other contributions that are not aligned to this 17 | Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed 18 | from the project team. 19 | 20 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by 21 | opening an issue or contacting one or more of the project maintainers. 22 | 23 | This Code of Conduct is adapted from the Contributor Covenant 24 | (http:contributor-covenant.org), version 1.0.0, available at 25 | http://contributor-covenant.org/version/1/0/0/ 26 | -------------------------------------------------------------------------------- /COPYRIGHT: -------------------------------------------------------------------------------- 1 | - The chapter on deep learning and customer churn (churn.Rmd) was taken directly from the notebook at https://github.com/sol-eng/tensorflow-w-r/blob/master/workflow/tensorflow-drake.Rmd and modified under the Apache 2.0 (copyright RStudio). See LICENSE-APACHE2 for a copy of the license. 2 | - All videos are indirectly embedded and copyright remains with the respective owners of the original content. 3 | - Other material was released under GPL version 3, copyright Eli Lilly and Company. See LICENSE for a copy of the GPL 3 license. 4 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: drake.manual 2 | Type: Book 3 | Title: The drake R Package User Manual 4 | Version: 7.12.2.9000 5 | License: Files LICENSE and LICENSE-APACHE2 6 | URL: https://github.com/ropensci-books/drake 7 | BugReports: https://github.com/ropensci-books/drake/issues 8 | Authors@R: c( 9 | person( 10 | family = "Landau", 11 | given = c("William", "Michael"), 12 | email = "will.landau@gmail.com", 13 | role = c("aut", "cre"), 14 | comment = c(ORCID = "0000-0003-1878-3253") 15 | ), 16 | person( 17 | family = "Axthelm", 18 | given = "Alex", 19 | email = "aaxthelm@che.IN.gov", 20 | role = "ctb" 21 | ), 22 | person( 23 | family = "Clarkberg", 24 | given = "Jasper", 25 | email = "jasper@clarkberg.org", 26 | role = "ctb" 27 | ), 28 | person( 29 | family = "Müller", 30 | given = "Kirill", 31 | email = "krlmlr+r@mailbox.org", 32 | role = "ctb" 33 | ), 34 | person( 35 | family = "Walthert", 36 | given = "Lorenz", 37 | email = "lorenz.walthertr@icloud.com", 38 | role = "ctb" 39 | ), 40 | person( 41 | family = "Hughes", 42 | given = "Ellis", 43 | email = "ellishughes@live.com", 44 | role = "aut" 45 | ), 46 | person( 47 | given = c("Matthew", "Mark"), 48 | family = "Strasiotto", 49 | role = "ctb", 50 | email = c( 51 | "matthew.strasiotto@gmail.com", 52 | "mstr3336@uni.sydney.edu.au" 53 | ) 54 | ), 55 | person( 56 | family = "Marwick", 57 | given = "Ben", 58 | email = "bmarwick@uw.edu", 59 | role = "rev" 60 | ), 61 | person( 62 | family = "Slaughter", 63 | given = "Peter", 64 | email = "slaughter@nceas.ucsb.edu", 65 | role = "rev" 66 | ), 67 | person( 68 | family = "Eli Lilly and Company", 69 | role = "cph" 70 | )) 71 | SystemRequirements: 72 | Python (>= 2.7.0), 73 | TensorFlow (https://www.tensorflow.org/), 74 | Keras >= 2.0 (https://keras.io) 75 | Depends: 76 | R (>= 3.2.0), 77 | biglm, 78 | bookdown, 79 | broom, 80 | cranlogs, 81 | curl, 82 | DBI, 83 | drake, 84 | downloader, 85 | Ecdat, 86 | forcats, 87 | fs, 88 | future, 89 | gapminder, 90 | ggplot2, 91 | ggraph, 92 | gh, 93 | glue, 94 | httr, 95 | keras, 96 | knitr, 97 | lubridate, 98 | microbenchmark, 99 | networkD3, 100 | pryr, 101 | purrr, 102 | R.utils, 103 | RSQLite, 104 | readxl, 105 | recipes, 106 | rlang, 107 | rmarkdown, 108 | rsample, 109 | rvest, 110 | styler, 111 | tensorflow, 112 | tidyverse, 113 | txtplot, 114 | txtq (>= 0.2.3), 115 | visNetwork, 116 | yardstick 117 | Remotes: 118 | ropensci/drake 119 | -------------------------------------------------------------------------------- /LICENSE-APACHE2: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | ## Version 7.12.2.9000 2 | 3 | * Replace the `iris` dataset with the `airquality` dataset. 4 | 5 | ## Version 7.7.0.9002 6 | 7 | - Add a chapter called `scripts` describing `drake`'s approach to script-based workflows and using `code_to_function()`.(#41, @thebioengineer) 8 | - Document new `on_select` behaviour of `vis_drake_graph()`, `drake_graph_info()`, and `render_drake_graph()`. 9 | - Migrate the manual to and (@jeroen). 10 | 11 | ## Version 6.0.0.9000 12 | 13 | - Add a link to the presentation at https://sinarueeger.github.io/20181004-geneve-rug. 14 | 15 | ## Version 0.0.1 16 | 17 | - First draft of the manual. 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Consider targets 2 | 3 | superseded lifecycle 4 | 5 | `drake` is [superseded](https://www.tidyverse.org/lifecycle/#superseded). The [`targets`](https://docs.ropensci.org/targets) R package is the long-term successor of `drake`, and it is more robust and easier to use. Please visit for full context and advice on transitioning. 6 | 7 | # The drake R package user manual 8 | 9 | This is the development repository of the [`drake` R package](https://github.com/ropensci/drake) user manual, hosted [here](https://books.ropensci.org/drake/). Please feel free to discuss on the [issue tracker](https://github.com/ropensci-books/drake/issues) and submit [pull requests](https://github.com/ropensci-books/drake/pulls) to add new examples and update old ones. The environment for collaboration should be friendly, inclusive, respectful, and safe for everyone, so all participants must obey [this repository's code of conduct](https://github.com/ropensci-books/drake/blob/main/CODE_OF_CONDUCT.md). 10 | -------------------------------------------------------------------------------- /_bookdown.yml: -------------------------------------------------------------------------------- 1 | book_filename: "drake-manual" 2 | repo: https://books.ropensci.org/drake/ 3 | language: 4 | ui: 5 | chapter_name: "Chapter " 6 | delete_merged_file: true 7 | rmd_files: [ 8 | "index.Rmd", 9 | "start.Rmd", 10 | "walkthrough.Rmd", 11 | "plans.Rmd", 12 | "static.Rmd", 13 | "dynamic.Rmd", 14 | "projects.Rmd", 15 | "scripts.Rmd", 16 | "examples.Rmd", 17 | "churn.Rmd", 18 | "stan.Rmd", 19 | "packages.Rmd", 20 | "gsp.Rmd", 21 | "resources.Rmd", 22 | "hpc.Rmd", 23 | "time.Rmd", 24 | "memory.Rmd", 25 | "storage.Rmd", 26 | "extras.Rmd", 27 | "visuals.Rmd", 28 | "debugging.Rmd", 29 | "triggers.Rmd", 30 | "appendix.Rmd", 31 | "faq.Rmd", 32 | "design.Rmd" 33 | ] 34 | -------------------------------------------------------------------------------- /_output.yml: -------------------------------------------------------------------------------- 1 | bookdown::gitbook: 2 | css: style.css 3 | config: 4 | sharing: 5 | github: yes 6 | facebook: false 7 | toc: 8 | collapse: subsection 9 | before: | 10 |
  • The drake R Package User Manual
  • 11 | after: | 12 |
  • Published with bookdown
  • 13 | edit: 14 | link: https://github.com/ropensci-books/drake/edit/main/%s 15 | text: "Edit this chapter" 16 | history: 17 | link: https://github.com/ropensci-books/drake/commits/main/%s 18 | text: "Chapter edit history" 19 | -------------------------------------------------------------------------------- /appendix.Rmd: -------------------------------------------------------------------------------- 1 | # (APPENDIX) Appendix {-} 2 | -------------------------------------------------------------------------------- /build.R: -------------------------------------------------------------------------------- 1 | library(fs) 2 | dir <- "_book" 3 | if (file.exists(dir)){ 4 | dir_delete(dir) 5 | } 6 | dir_create(dir) 7 | dir_copy("images", dir) 8 | options(drake_make_menu = FALSE) 9 | drake::clean(destroy = TRUE) 10 | bookdown::render_book( 11 | input = "index.Rmd", 12 | output_format = "bookdown::gitbook" 13 | ) 14 | -------------------------------------------------------------------------------- /churn.Rmd: -------------------------------------------------------------------------------- 1 | # Customer churn and deep learning {#churn} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | library(drake) 5 | library(glue) 6 | library(purrr) 7 | library(rlang) 8 | library(tidyverse) 9 | tmp <- suppressWarnings(drake_plan(x = 1, y = 2)) 10 | ``` 11 | 12 | 13 | ```{r, include = FALSE} 14 | library(drake) 15 | library(keras) 16 | library(tidyverse) 17 | library(rsample) 18 | library(recipes) 19 | library(yardstick) 20 | options( 21 | drake_make_menu = FALSE, 22 | drake_clean_menu = FALSE, 23 | warnPartialMatchArgs = FALSE, 24 | crayon.enabled = FALSE, 25 | readr.show_progress = FALSE, 26 | tidyverse.quiet = TRUE 27 | ) 28 | knitr::opts_chunk$set( 29 | collapse = TRUE, 30 | comment = "#>" 31 | ) 32 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 33 | ``` 34 | 35 | ```{r, include = FALSE} 36 | invisible(drake_example("customer-churn", overwrite = TRUE)) 37 | invisible(file.copy("customer-churn/data/customer_churn.csv", ".", overwrite = TRUE)) 38 | ``` 39 | 40 | [`drake`](https://github.com/ropensci/drake) is designed for workflows with long runtimes, and a major use case is deep learning. This chapter demonstrates how to leverage [`drake`](https://github.com/ropensci/drake) to manage a deep learning workflow. The original example comes from a [blog post by Matt Dancho](https://blogs.rstudio.com/tensorflow/posts/2018-01-11-keras-customer-churn/), and the chapter's content itself comes directly from [this R notebook](https://github.com/sol-eng/tensorflow-w-r/blob/master/workflow/tensorflow-drake.Rmd), part of an [RStudio Solutions Engineering example demonstrating TensorFlow in R](https://github.com/sol-eng/tensorflow-w-r). The notebook is modified and redistributed under the terms of the [Apache 2.0 license](https://github.com/sol-eng/tensorflow-w-r/blob/master/LICENSE), copyright RStudio ([details here](https://github.com/ropensci-books/drake/blob/main/COPYRIGHT)). 41 | 42 | ## Churn packages 43 | 44 | First, we load our packages into a fresh R session. 45 | 46 | ```{r, keraspkgs} 47 | library(drake) 48 | library(keras) 49 | library(tidyverse) 50 | library(rsample) 51 | library(recipes) 52 | library(yardstick) 53 | ``` 54 | 55 | ## Churn functions 56 | 57 | [`drake`](https://github.com/ropensci/drake) is R-focused and function-oriented. We create functions to [preprocess the data](https://github.com/tidymodels/recipes), 58 | 59 | ```{r} 60 | prepare_recipe <- function(data) { 61 | data %>% 62 | training() %>% 63 | recipe(Churn ~ .) %>% 64 | step_rm(customerID) %>% 65 | step_naomit(all_outcomes(), all_predictors()) %>% 66 | step_discretize(tenure, options = list(cuts = 6)) %>% 67 | step_log(TotalCharges) %>% 68 | step_mutate(Churn = ifelse(Churn == "Yes", 1, 0)) %>% 69 | step_dummy(all_nominal(), -all_outcomes()) %>% 70 | step_center(all_predictors(), -all_outcomes()) %>% 71 | step_scale(all_predictors(), -all_outcomes()) %>% 72 | prep() 73 | } 74 | ``` 75 | 76 | define a [`keras`](https://github.com/rstudio/keras) model, exposing arguments to set the dimensionality and activation functions of the layers, 77 | 78 | ```{r, kerasmodel} 79 | define_model <- function(rec, units1, units2, act1, act2, act3) { 80 | input_shape <- ncol( 81 | juice(rec, all_predictors(), composition = "matrix") 82 | ) 83 | keras_model_sequential() %>% 84 | layer_dense( 85 | units = units1, 86 | kernel_initializer = "uniform", 87 | activation = act1, 88 | input_shape = input_shape 89 | ) %>% 90 | layer_dropout(rate = 0.1) %>% 91 | layer_dense( 92 | units = units2, 93 | kernel_initializer = "uniform", 94 | activation = act2 95 | ) %>% 96 | layer_dropout(rate = 0.1) %>% 97 | layer_dense( 98 | units = 1, 99 | kernel_initializer = "uniform", 100 | activation = act3 101 | ) 102 | } 103 | ``` 104 | 105 | train a model, 106 | 107 | ```{r, kerastrain} 108 | train_model <- function( 109 | rec, 110 | units1 = 16, 111 | units2 = 16, 112 | act1 = "relu", 113 | act2 = "relu", 114 | act3 = "sigmoid" 115 | ) { 116 | model <- define_model( 117 | rec = rec, 118 | units1 = units1, 119 | units2 = units2, 120 | act1 = act1, 121 | act2 = act2, 122 | act3 = act3 123 | ) 124 | compile( 125 | model, 126 | optimizer = "adam", 127 | loss = "binary_crossentropy", 128 | metrics = c("accuracy") 129 | ) 130 | x_train_tbl <- juice( 131 | rec, 132 | all_predictors(), 133 | composition = "matrix" 134 | ) 135 | y_train_vec <- juice(rec, all_outcomes()) %>% 136 | pull() 137 | fit( 138 | object = model, 139 | x = x_train_tbl, 140 | y = y_train_vec, 141 | batch_size = 32, 142 | epochs = 32, 143 | validation_split = 0.3, 144 | verbose = 0 145 | ) 146 | model 147 | } 148 | ``` 149 | 150 | compare predictions against reality, 151 | 152 | ```{r, kerasconf} 153 | confusion_matrix <- function(data, rec, model) { 154 | testing_data <- bake(rec, testing(data)) 155 | x_test_tbl <- testing_data %>% 156 | select(-Churn) %>% 157 | as.matrix() 158 | y_test_vec <- testing_data %>% 159 | select(Churn) %>% 160 | pull() 161 | yhat_keras_class_vec <- model %>% 162 | predict_classes(x_test_tbl) %>% 163 | as.factor() %>% 164 | fct_recode(yes = "1", no = "0") 165 | yhat_keras_prob_vec <- 166 | model %>% 167 | predict_proba(x_test_tbl) %>% 168 | as.vector() 169 | test_truth <- y_test_vec %>% 170 | as.factor() %>% 171 | fct_recode(yes = "1", no = "0") 172 | estimates_keras_tbl <- tibble( 173 | truth = test_truth, 174 | estimate = yhat_keras_class_vec, 175 | class_prob = yhat_keras_prob_vec 176 | ) 177 | estimates_keras_tbl %>% 178 | conf_mat(truth, estimate) 179 | } 180 | ``` 181 | 182 | and compare the performance of multiple models. 183 | 184 | ```{r, kerascompare} 185 | compare_models <- function(...) { 186 | name <- match.call()[-1] %>% 187 | as.character() 188 | df <- map_df(list(...), summary) %>% 189 | filter(.metric %in% c("accuracy", "sens", "spec")) %>% 190 | mutate(name = rep(name, each = n() / length(name))) %>% 191 | rename(metric = .metric, estimate = .estimate) 192 | ggplot(df) + 193 | geom_line(aes(x = metric, y = estimate, color = name, group = name)) + 194 | theme_gray(24) 195 | } 196 | ``` 197 | 198 | ## Churn plan 199 | 200 | Next, we define our workflow in a [`drake` plan](https://books.ropensci.org/drake/plans.html). We will prepare the data, train different models with different activation functions, and compare the models in terms of performance. 201 | 202 | ```{r, kerasplan1} 203 | activations <- c("relu", "sigmoid") 204 | 205 | plan <- drake_plan( 206 | data = read_csv(file_in("customer_churn.csv"), col_types = cols()) %>% 207 | initial_split(prop = 0.3), 208 | rec = prepare_recipe(data), 209 | model = target( 210 | train_model(rec, act1 = act), 211 | format = "keras", # Supported in drake > 7.5.2 to store models properly. 212 | transform = map(act = !!activations) 213 | ), 214 | conf = target( 215 | confusion_matrix(data, rec, model), 216 | transform = map(model, .id = act) 217 | ), 218 | metrics = target( 219 | compare_models(conf), 220 | transform = combine(conf) 221 | ) 222 | ) 223 | ``` 224 | 225 | The plan is a data frame with the steps we are going to do. 226 | 227 | ```{r, paged.print = FALSE, warning = FALSE} 228 | plan 229 | ``` 230 | 231 | ## Churn dependency graph 232 | 233 | The graph visualizes the dependency relationships among the steps of the workflow. 234 | 235 | ```{r, message = FALSE} 236 | vis_drake_graph(plan) 237 | ``` 238 | 239 | ## Run the Keras models 240 | 241 | Call [`make()`](https://docs.ropensci.org/drake/reference/make.html) to actually run the workflow. 242 | 243 | ```{r} 244 | make(plan) 245 | ``` 246 | 247 | ## Inspect the Keras results 248 | 249 | The two models performed about the same. 250 | 251 | ```{r} 252 | readd(metrics) # see also loadd() 253 | ``` 254 | 255 | ## Add Keras models 256 | 257 | Let's try the softmax activation function. 258 | 259 | ```{r} 260 | activations <- c("relu", "sigmoid", "softmax") 261 | 262 | plan <- drake_plan( 263 | data = read_csv(file_in("customer_churn.csv"), col_types = cols()) %>% 264 | initial_split(prop = 0.3), 265 | rec = prepare_recipe(data), 266 | model = target( 267 | train_model(rec, act1 = act), 268 | format = "keras", # Supported in drake > 7.5.2 to store models properly. 269 | transform = map(act = !!activations) 270 | ), 271 | conf = target( 272 | confusion_matrix(data, rec, model), 273 | transform = map(model, .id = act) 274 | ), 275 | metrics = target( 276 | compare_models(conf), 277 | transform = combine(conf) 278 | ) 279 | ) 280 | ``` 281 | 282 | ```{r, message = FALSE} 283 | vis_drake_graph(plan) # see also outdated() and predict_runtime() 284 | ``` 285 | 286 | [`make()`](https://docs.ropensci.org/drake/reference/make.html) skips the relu and sigmoid models because they are already up to date. (Their dependencies did not change.) Only the softmax model needs to run. 287 | 288 | ```{r} 289 | make(plan) 290 | ``` 291 | 292 | ## Inspect the Churn results again 293 | 294 | ```{r} 295 | readd(metrics) # see also loadd() 296 | ``` 297 | 298 | ## Update the Churn code 299 | 300 | If you change upstream functions, even nested ones, `drake` automatically refits the affected models. Let's increase dropout in both layers. 301 | 302 | ```{r} 303 | define_model <- function(rec, units1, units2, act1, act2, act3) { 304 | input_shape <- ncol( 305 | juice(rec, all_predictors(), composition = "matrix") 306 | ) 307 | keras_model_sequential() %>% 308 | layer_dense( 309 | units = units1, 310 | kernel_initializer = "uniform", 311 | activation = act1, 312 | input_shape = input_shape 313 | ) %>% 314 | layer_dropout(rate = 0.15) %>% # Changed from 0.1 to 0.15. 315 | layer_dense( 316 | units = units2, 317 | kernel_initializer = "uniform", 318 | activation = act2 319 | ) %>% 320 | layer_dropout(rate = 0.15) %>% # Changed from 0.1 to 0.15. 321 | layer_dense( 322 | units = 1, 323 | kernel_initializer = "uniform", 324 | activation = act3 325 | ) 326 | } 327 | ``` 328 | 329 | All the models and downstream results are affected. 330 | 331 | ```{r} 332 | make(plan) 333 | ``` 334 | 335 | ## Churn history and provenance 336 | 337 | `drake` tracks history and provenance. You can see which models you ran, when you ran them, how long they took, and which settings you tried (i.e. named arguments to function calls in your commands). 338 | 339 | ```{r} 340 | history <- drake_history() 341 | history 342 | ``` 343 | 344 | And as long as you did not run `clean(garbage_collection = TRUE)`, you can get the old data back. Let's find the oldest run of the relu model. 345 | 346 | ```{r} 347 | hash <- history %>% 348 | filter(act1 == "relu") %>% 349 | pull(hash) %>% 350 | head(n = 1) 351 | drake_cache()$get_value(hash) 352 | ``` 353 | 354 | ```{r, echo = FALSE} 355 | clean(destroy = TRUE) 356 | ``` 357 | -------------------------------------------------------------------------------- /debugging.Rmd: -------------------------------------------------------------------------------- 1 | # Debugging and testing drake projects {#debugging} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set( 6 | collapse = TRUE, 7 | comment = "#>", 8 | warning = TRUE 9 | ) 10 | options( 11 | drake_make_menu = FALSE, 12 | drake_clean_menu = FALSE, 13 | warnPartialMatchArgs = FALSE, 14 | crayon.enabled = FALSE, 15 | readr.show_progress = FALSE, 16 | tidyverse.quiet = TRUE 17 | ) 18 | ``` 19 | 20 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 21 | library(drake) 22 | library(tidyverse) 23 | ``` 24 | 25 | This chapter aims to help users detect and diagnose problems with large complex workflows. 26 | 27 | ## Debugging failed targets 28 | 29 | ### Diagnosing errors 30 | 31 | When a target fails, `drake` tries to tell you. 32 | 33 | ```{r, error = TRUE} 34 | large_dataset <- function() { 35 | data.frame(x = rnorm(1e6), y = rnorm(1e6)) 36 | } 37 | 38 | expensive_analysis <- function(data) { 39 | # More operations go here. 40 | tricky_operation(data) 41 | } 42 | 43 | tricky_operation <- function(data) { 44 | # Expensive code here. 45 | stop("there is a bug somewhere.") 46 | } 47 | 48 | plan <- drake_plan( 49 | data = large_dataset(), 50 | analysis = expensive_analysis(data) 51 | ) 52 | 53 | make(plan) 54 | ``` 55 | 56 | `diagnose()` recovers the metadata on targets. For failed targets, this includes an error object. 57 | 58 | ```{r} 59 | error <- diagnose(analysis)$error 60 | error 61 | 62 | names(error) 63 | ``` 64 | 65 | Using the call stack, you can trace back the location of the error. Once you know roughly where to find the bug, you can troubleshoot interactively. 66 | 67 | ```{r} 68 | invisible(lapply(tail(error$calls, 3), print)) 69 | ``` 70 | 71 | ### Interactive debugging 72 | 73 | The clues from `diagnose()` help us go back and inspect the failing code. `debug()` is an interactive debugging tool which helps you verify exactly what is going wrong. Below, `make(plan)` pauses execution and turn interactive control over to you inside `tricky_operation()`. 74 | 75 | ```{r, eval = FALSE} 76 | debug(tricky_operation) 77 | make(plan) # Pauses at tricky_operation(data). 78 | undebug(tricky_operation) # Undoes debug(). 79 | ``` 80 | 81 | `drake`'s own `drake_debug()` function is nearly equivalent. 82 | 83 | ```{r, eval = FALSE} 84 | drake_debug(analysis, plan) # Pauses at the command expensive_analysis(data). 85 | ``` 86 | 87 | `browser()` is similar, but it affords you finer control over to pause execution 88 | 89 | ```{r, eval = FALSE} 90 | tricky_operation <- function(data) { 91 | # Expensive code here. 92 | browser() # Pauses right here to give you control. 93 | stop("there is a bug somewhere.") 94 | } 95 | 96 | make(plan) 97 | ``` 98 | 99 | ### Efficient trial and error 100 | 101 | If you are using `drake`, then chances are your targets are computationally expensive and the long runtimes make debugging difficult. To speed up trial and error, run the plan on a small dataset when you debug and repair things. 102 | 103 | ```{r, eval = FALSE} 104 | plan <- drake_plan( 105 | data = head(large_dataset()), # Just work with the first few rows. 106 | analysis = expensive_analysis(data) # Runs faster now. 107 | ) 108 | ``` 109 | 110 | ```{r, eval = FALSE} 111 | tricky_operation <- ... # Try to fix the function. 112 | 113 | debug(tricky_operation) # Set up to debug interactively. 114 | 115 | make(plan) # Try to run the workflow. 116 | ``` 117 | 118 | After a lot of quick trial and error, we finally fix the function and run it on the small data. 119 | 120 | ```{r} 121 | tricky_operation <- function(data) { 122 | # Good code goes here. 123 | } 124 | 125 | make(plan) 126 | ``` 127 | 128 | Now, that the code works, it is time to scale back up to the large data. Use `make(plan, recover = TRUE)` to salvage old targets from before the debugging process. 129 | 130 | ```{r} 131 | plan <- drake_plan( 132 | data = large_dataset(), # Use the large data again. 133 | analysis = expensive_analysis(data) # Should be repaired now. 134 | ) 135 | 136 | make(plan, recover = TRUE) 137 | ``` 138 | 139 | 140 | ## Why do my targets keep rerunning? 141 | 142 | Consider the following completed workflow. 143 | 144 | ```{r} 145 | load_mtcars_example() 146 | make(my_plan) 147 | ``` 148 | 149 | At this point, if you change the `reg1()` function, then `make()` will automatically detect and rerun downstream targets such as `regression1_large`. 150 | 151 | ```{r} 152 | reg1 <- function (d) { 153 | lm(y ~ 1 + x, data = d) 154 | } 155 | 156 | make(my_plan) 157 | ``` 158 | 159 | ```{r, echo = FALSE} 160 | reg1 <- function (d) { 161 | lm(y ~ 2 + x, data = d) 162 | } 163 | ``` 164 | 165 | 166 | In general, targets are "outdated" or "invalidated" they are out of sync with their dependencies. If a target is outdated, the next `make()` automatically detects discrepancies and rebuild the affected targets. Usually, this automation adds convenience, saves time, and ensures reproducibility in the face of long runtimes. 167 | 168 | However, it can be frustrating when `drake` detects outdated targets when you think everything is up to date. If this happens, it is important to understand 169 | 170 | 1. How your workflow fits together. 171 | 2. Which targets are outdated. 172 | 3. Why your targets are outdated. 173 | 4. Strategies to prevent unexpected changes in the future. 174 | 175 | `drake`'s utility functions offer clues to guide you. 176 | 177 | ### How your workflow fits together 178 | 179 | `drake` automatically analyzes your plan and functions to understand how your targets depend on each other. It assembles this information in a directed acyclic graph (DAG) which you can visualize and explore. 180 | 181 | ```{r} 182 | vis_drake_graph(my_plan) 183 | ``` 184 | 185 | To get a more localized version of the graph, use `deps_target()`. Unlike `vis_drake_graph()`, `deps_target()` gives you a more granular view of the dependencies of an individual target. 186 | 187 | ```{r} 188 | deps_target(regression1_large, my_plan) 189 | 190 | deps_target(report, my_plan) 191 | ``` 192 | 193 | To understand how `drake` detects dependencies in the first place, use `deps_code()`. This is what `drake` first sees when it reads your plan and functions to understand the dependencies. 194 | 195 | ```{r} 196 | deps_code(quote( 197 | suppressWarnings(summary(regression1_large$residuals)) 198 | )) 199 | 200 | deps_code(quote( 201 | knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE) 202 | )) 203 | ``` 204 | 205 | 206 | If `drake` detects new dependencies you were unaware of, that could be a reason why your targets are out of date. 207 | 208 | ### Which targets are outdated 209 | 210 | Graphing utilities like `vis_drake_graph()` label the outdated targets, but sometimes it is helpful to get a more programmatic view. 211 | 212 | ```{r} 213 | outdated(my_plan) 214 | ``` 215 | 216 | ### Why your targets are outdated 217 | 218 | The `deps_profile()` function offers clues. 219 | 220 | ```{r} 221 | deps_profile(regression1_small, my_plan) 222 | ``` 223 | 224 | From the data frame above, `regression1_small` is outdated because an R object dependency changed since the last `make()`. `drake` does not hold on to enough information to tell you precisely which object is the culprit, but functions like `vis_drake_graph()`, `deps_target()`, and `deps_code()` can help narrow down the possibilities. 225 | 226 | 227 | ### Strategies to prevent unexpected changes in the future 228 | 229 | `drake` is sensitive to changing functions in your global environment, and this sensitivity can invalidate targets unexpectedly. Whenever you plan to run `make()`, it is always best to restart your R session and load your packages and functions into a fresh clean workspace. [`r_make()`](https://docs.ropensci.org/drake/reference/r_make.html) does all this cleaning and prep work for you automatically, and it is more robust and dependable (and childproofed) than ordinary `make()`. To read more, visit . 230 | 231 | ## More help 232 | 233 | The [GitHub issue tracker](https://github.com/ropensci/drake/issues) is the best place to request help with your specific use case. 234 | -------------------------------------------------------------------------------- /deploy.sh: -------------------------------------------------------------------------------- 1 | git config --global user.email "will.landau@gmail.com" 2 | git config --global user.name "wlandau" 3 | git clone -b gh-pages https://${GITHUB_PAT}@github.com/ropensci-books/drake.git gh-pages 4 | cd gh-pages 5 | ls -a | grep -Ev "^\.$|^..$|^\.git$" | xargs rm -rf 6 | cp -r ../_book/* ./ 7 | git add * 8 | git commit -am "Update the manual" || true 9 | git push origin gh-pages 10 | -------------------------------------------------------------------------------- /design.Rmd: -------------------------------------------------------------------------------- 1 | # Design {#design} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(drake) 18 | library(tidyverse) 19 | ``` 20 | 21 | This chapter explains `drake`'s internal design and architecture. Goals: 22 | 23 | 1. Help developers and enthusiastic users contribute to the [code base](https://github.com/ropensci/drake). 24 | 2. [Invite high-level advice and discussion](https://github.com/ropensci/drake/issues) about potential improvements to the overall design. 25 | 26 | ## Principles 27 | 28 | ### Functions first 29 | 30 | From the user's point of view, `drake` is a [style of programming](https://books.ropensci.org/drake/plans.html#intro-to-plans) in its own right, and that style is [zealously and irrevocably function-oriented](https://books.ropensci.org/drake/plans.html#functions). It harmonizes with statistics and data science, where most methodology naturally takes the form of data transformations, and it embraces the natively function-oriented design of the R language. Functions are first-class citizens in `drake`, and they dominate the internal design at the highest levels. 31 | 32 | ### Light use of traditional OOP 33 | 34 | Most of a `drake` workflow happens inside the `make()` function. `make()` accepts a data frame of function calls (the [`drake` plan](#plans)), caches some targets, and then drops its internal state when it terminates. The state does not need to persist, and the user does not need to interact with it. This is a major reason why traditional object-oriented programming plays such a small, supporting role. 35 | 36 | In `drake`, full OOP classes and objects are small, simple, and extremely specialized. For example, the [decorated `storr`](https://github.com/ropensci/drake/blob/main/R/decorate_storr.R), [priority queue](https://github.com/ropensci/drake/blob/main/R/priority_queue.R), and [logger](https://github.com/ropensci/drake/blob/main/R/logger.R) [reference classes](http://adv-r.had.co.nz/R5.html) are narrowly defined and fit for purpose. The [S3 system](http://adv-r.had.co.nz/S3.html) appears far more often, often as a mechanism of [function overloading](https://en.wikipedia.org/wiki/Function_overloading) to streamline control flow, and also as a means of adding structure and validation to small target-specific objects optimized for performance. 37 | 38 | In future development, tactical reference classes will arise as needed to encapsulate low-level patterns into natural abstractions. However, `drake`'s design places greater importance on maximizing runtime efficiency. 39 | 40 | ### High-performant small objects 41 | 42 | `drake` maintains several small list-like objects for each target, such as the local spec, the target data, triggers, and the code analysis results. `drake` workflows with thousands of targets have thousands of these objects, and as [profiling](https://github.com/r-prof/proffer) studies have shown, we need these objects to perform as efficiently as possible. Instantiation and field access need to be fast, and the memory footprint needs to be low. For these reasons, we choose simple lists with S3 class attributes, which outclass S4 and reference classes when it comes to instantiation speed. 43 | 44 | ### Fast iteration along aggregated data 45 | 46 | Each of the large data structures aggregates a single type of information across all targets to help `drake` run fast. Examples include the [whole workflow specification](https://github.com/ropensci/drake/blob/main/R/create_drake_spec.R) (`config$spec`) and the in-memory target metadata cache (`config$meta`). These objects are hash-table-powered environments to make field access as fast as possible. 47 | 48 | ### Access to information across targets 49 | 50 | `drake` aggressively analyzes dependency relationships among targets. Even while `make()` builds a single target, it needs to stay aware of the other targets, not only to build the [dependency graph](https://github.com/ropensci/drake/blob/main/R/create_drake_graph.R), but also for other tasks like [dynamic branching](#dynamic). This is a major reason why the workflow specification, dependency graph, priority queue, and metadata are all stored in environments that most functions can reach. 51 | 52 | ## Specific classes 53 | 54 | This section describes `drake`'s primary internal data structures at a high level. It is not exhaustive, but it does cover most of the architecture. 55 | 56 | ### Config 57 | 58 | `make()`, `outdated()`, `vis_drake_graph()`, and related utilities keep track of a [`drake_config()`](https://docs.ropensci.org/drake/reference/drake_config.html) object. A `drake_config()` object is a list of class `"drake_config"`. Its purpose is to keep track of the state of a `drake` workflow and avoid long parameter lists in functions. Future development will focus on refactoring and formalizing `drake_config()` objects. 59 | 60 | ### Settings 61 | 62 | Static runtime parameters such as `keep_going` and `log_build_times` live in a list of class `drake_settings`, which is part of each `drake_config` object. 63 | 64 | ### Plan 65 | 66 | The `drake` plan is a simple data frame of class `"drake_plan"`, and it is `drake`'s version of a Makefile. The manual has a [whole chapter](#plans) on plans. 67 | 68 | ### Specification 69 | 70 | A `drake` plan is an *implicit* representation of targets and their immediate dependencies. Before `make()` starts to build targets, `drake` makes all these local dependency structures *explicit* and machine-readable in a [workflow `specification`](https://github.com/ropensci/drake/blob/main/R/create_drake_spec.R). The overall specification (`config$spec`) an R environment with the local specification of each individual target and each imported object/function. Each local specification is a list of class `"drake_spec"`, and it contains the names of objects referenced from the command, the files declared with `file_in()` and friends, the dependencies of the `condition` and `change` triggers, etc. 71 | 72 | ### Graph 73 | 74 | Whereas the specification tracks the *local* dependency structures, the graph (an `igraph` object) represents the *global* dependency structure of the whole workflow. It is less granular than the specification, and `make()` uses it to run the correct targets in the correct order. 75 | 76 | ### Priority queue 77 | 78 | In high-performance computing settings (e.g. `parallelism = "clustermq"` and `parallelism = "future"`) `drake` creates a [priority queue](https://github.com/ropensci/drake/blob/main/R/priority_queue.R) to schedule targets. For the sake of convenience, the underlying algorithms are different than that of a classical [priority queue](https://en.wikipedia.org/wiki/Priority_queue), but this does not seem to decrease performance in practice. 79 | 80 | ### Metadata 81 | 82 | `config$meta` is an environment, and each element is a list of class `"drake_meta"`. Whereas the workflow specification identifies the *names* of dependencies, the `"drake_meta"` contains *hashes* (and supporting information). `drake` uses the hashes decide if the target is up to date. Metadata lists are stored in the `"meta"` namespace of the decorated `storr`. 83 | 84 | `config$meta_old` is similar to `config$meta` and exists for performance purposes. 85 | 86 | ### Cache 87 | 88 | #### API 89 | 90 | `drake`'s cache API is a [decorated `storr`](https://github.com/ropensci/drake/blob/main/R/decorate_storr.R), a reference class that wraps around a [`storr`](github.com/richfitz/storr) object. `drake` relies heavily on `storr` namespaces (e.g. for metadata and recovery keys). `drake`'s custom wrapper around the `storr` class (i.e. the "decorated" part) has extra methods that power history (a [`txtq`](https://github.com/wlandau/txtq)) and [specialized data formats](https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets), as well as hash tables that only the cache needs. 91 | 92 | The `new_cache()` and `drake_cache()` functions create and reload `drake` caches, respectively, and they are equivalent to `storr::storr_rds()` plus `drake:::decorate_storr()`. 93 | 94 | #### Data 95 | 96 | Usually, the persistent data values live in a hidden `.drake/` folder. Most of the files come from [`storr_rds()`](http://richfitz.github.io/storr/reference/storr_rds.html) methods. Other files include the history [`txtq`](https://github.com/wlandau/txtq) and the values of targets with [specialized data formats](https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets). The files are structured so they can be used by either with `storr::storr_rds()` or `drake::drake_cache()`. 97 | 98 | Other `storr` backends like `storr_environment()` and `storr_dbi()` are also compatible with this approach. In these non-standard cases, `.drake/` does not contain the files of the inner `storr`, but it still has files supporting history and specialized target formats. 99 | 100 | ### Code analysis lists 101 | 102 | `drake` performs static code analysis on functions and commands in order to resolve the dependency structure of a workflow. Lists of class `drake_deps` and `drake_deps_ht` store the results of static code analysis on a single code chunk. Each element of a `drake_deps` list is a character vector of static dependencies of a certain type (e.g. global variables or `file_in()` files). The elements of `drake_deps_ht` lists are hash tables (which increase performance when the static code analysis is running). 103 | 104 | ### Environments 105 | 106 | `drake` has [memory management strategies](#memory) to make sure a target's dependencies are loaded when `make()` runs its command. Internally, [memory management](https://github.com/ropensci/drake/blob/main/R/manage_memory.R) works with a layered system of environments. This system helps `make()` protect the user's calling environment and perform dynamic branching without the need for static code analysis or metaprogramming. 107 | 108 | 1. `config$envir`: the calling environment of `make()`, which contains the user's functions and other imported objects. `make()` tries to leave this environment alone (and temporarily locks it when `lock_envir` is `TRUE`). 109 | 2. `config$envir_targets`: contains static targets. Its parent is `config$envir`. 110 | 3. `config$dynamic`: contains entire aggregated dynamic targets when `drake` needs them. Its parent is `config$envir_targets`. 111 | 4. `config$envir_subtargets`: contains individual sub-targets. Its parent is `config$envir_dynamic`. 112 | 113 | In addition, `config$envir_loaded` keeps track of which targets are loaded in (2), (3), and (4) above. 114 | 115 | These environments form a known [data clump](https://refactoring.guru/smells/data-clumps), and future development will encapsulate them. 116 | 117 | ### Hash tables 118 | 119 | The `drake_config()` object and decorated `storr` keep track of multiple hash tables to cache data in memory and boost speed while iterating over large collections of targets. They are simply R environments with `hash = TRUE`, and `drake` has [internal interface functions](https://github.com/ropensci/drake/blob/main/R/hash_tables.R) for working with them. Examples in `drake_config()` objects: 120 | 121 | * `ht_is_dynamic`: keeps track of names of dynamic targets. Makes `is_dynamic()` faster. 122 | * `ht_is_subtarget`: same as above, but for `is_subtarget()`. 123 | * `ht_dynamic_deps`: names of dynamic dependencies of dynamic targets. Powers `is_dynamic_dep()`. 124 | * `ht_target_exists`: tracks targets that already exist at the beginning of `make()`. 125 | * `ht_subtarget_parents`: keeps track of the parent of each sub-target. 126 | 127 | Examples in the decorated `storr`: 128 | 129 | * `ht_encode_path` and `ht_decode_path`: `drake` uses Base32 encoding to store references to static file paths. These hash tables avoid redundant encoding/decoding operations and increases performance for large collections of targets. 130 | * `ht_encode_namespaced` and `ht_decode_namespaced`: same for imported namespaced functions. 131 | * `ht_hash`: powers `memo_hash()`, which helps us avoid redundant calls to `input_file_hash()`, `output_file_hash()`, `static_dependency_hash()`, and `dynamic_dependency_hash()`. 132 | * `ht_keys`: a small hash table that powers the `set_progress` method. This progress information is stored in the cache by default, and the user can retrieve it with `drake_progress()`. 133 | 134 | ### Logger 135 | 136 | The [logger](https://github.com/ropensci/drake/blob/main/R/logger.R) (`config$logger`) is a reference class that controls messages to the console and a custom log file) if applicable). Logging messages help users informally monitor the progress of `make()`. -------------------------------------------------------------------------------- /drake-manual.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | BuildType: Website 16 | -------------------------------------------------------------------------------- /dynamic.Rmd: -------------------------------------------------------------------------------- 1 | # Dynamic branching {#dynamic} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, echo = FALSE, message = FALSE} 17 | library(broom) 18 | library(drake) 19 | library(gapminder) 20 | library(tidyverse) 21 | ``` 22 | 23 | ## A note about versions 24 | 25 | The first release of dynamic branching was in `drake` version 7.8.0. In subsequent versions, dynamic branching behaves differently. This manual describes how dynamic branching works in development `drake` (to become version 7.9.0 in early January 2020). If you are using version 7.8.0, please refer to [this version of the chapter](https://github.com/ropensci-books/drake/blob/c4dfa6dd71b5ffa4c6027633ae048d2ab0513c6d/dynamic.Rmd) instead. 26 | 27 | ## Motivation 28 | 29 | In large workflows, you may need more targets than you can easily type in a plan, and you may not be able to fully specify all targets in advance. Dynamic branching is an interface to declare new targets while `make()` is running. It lets you create more compact plans and graphs, it is easier to use than [static branching](#static), and it improves the startup speed of `make()` and friends. 30 | 31 | ## Which kind of branching should I use? 32 | 33 | With dynamic branching, `make()` is faster to initialize, and you have far more flexibility. With [static branching](#static), you have meaningful target names, and it is easier to predict what the plan is going to do in advance. There is a ton of room for overlap and personal judgement, and you can even use both kinds of branching together. 34 | 35 | ## Dynamic targets 36 | 37 | A dynamic target is a [vector](https://vctrs.r-lib.org/) of *sub-targets*. We let `make()` figure out which sub-targets to create and how to aggregate them. 38 | 39 | As an example, let's fit a regression model to each continent in [Gapminder data](https://github.com/jennybc/gapminder). To activate dynamic branching, use the `dynamic` argument of `target()`. 40 | 41 | ```{r} 42 | library(broom) 43 | library(drake) 44 | library(gapminder) 45 | library(tidyverse) 46 | 47 | # Split the Gapminder data by continent. 48 | gapminder_continents <- function() { 49 | gapminder %>% 50 | mutate(gdpPercap = scale(gdpPercap)) %>% 51 | split(f = .$continent) 52 | } 53 | 54 | # Fit a model to a continent. 55 | fit_model <- function(continent_data) { 56 | data <- continent_data[[1]] 57 | data %>% 58 | lm(formula = gdpPercap ~ year) %>% 59 | tidy() %>% 60 | mutate(continent = data$continent[1]) %>% 61 | select(continent, term, statistic, p.value) 62 | } 63 | 64 | plan <- drake_plan( 65 | continents = gapminder_continents(), 66 | model = target(fit_model(continents), dynamic = map(continents)) 67 | ) 68 | 69 | make(plan) 70 | ``` 71 | 72 | The data type of every sub-target is the same as the dynamic target it belongs to. In other words, `model` and `model_23022788` are both data frames, and `readd(model)` and friends automatically concatenate all the `model_*` sub-targets. 73 | 74 | ```{r} 75 | readd(model) 76 | ``` 77 | 78 | This behavior is powered by the [`vctrs`](https://vctrs.r-lib.org/). A dynamic target like `model` above is really a "`vctr`" of sub-targets. Under the hood, the aggregated value of `model` is what you get from calling `vec_c()` on all the `model_*` sub-targets. When you dynamically `map()` over a non-dynamic object, you are taking slices with `vec_slice()`. (When you `map()` over a dynamic target, each element is a sub-target and `vec_slice()` is not necessary.) 79 | 80 | ```{r} 81 | library(vctrs) 82 | 83 | # same as readd(model) 84 | s <- subtargets(model) 85 | vec_c( 86 | readd(s[1], character_only = TRUE), 87 | readd(s[2], character_only = TRUE), 88 | readd(s[3], character_only = TRUE), 89 | readd(s[4], character_only = TRUE), 90 | readd(s[5], character_only = TRUE) 91 | ) 92 | 93 | loadd(model) 94 | 95 | # Second slice if you were to map() over mtcars. 96 | vec_slice(mtcars, 2) 97 | 98 | # Fifth slice if you were to map() over letters. 99 | vec_slice(letters, 5) 100 | ``` 101 | 102 | You can use `vec_c()` and `vec_slice()` to anticipate edge cases in dynamic branching. 103 | 104 | ```{r} 105 | # If you map() over a list, each sub-target is a single-element list. 106 | vec_slice(list(1, 2), 1) 107 | ``` 108 | 109 | ```{r} 110 | # If each sub-target has multiple elements, 111 | # the aggregated target (e.g. from readd()) 112 | # will have more elements than sub-targets. 113 | subtarget1 <- c(1, 2) 114 | subtarget2 <- c(3, 4) 115 | vec_c(subtarget1, subtarget2) 116 | ``` 117 | 118 | Back in our plan, `target(fit_model(continents), dynamic = map(continents))` is equivalent to commands `fit_model(continents[1])` through `fit_model(continents[5])`. Since `continents` is really a list of data frames, `continents[1]` through `continents[5]` are also lists of data frames, which is why we need the line `data <- continent_data[[1]]` in `fit_model()`. 119 | 120 | To post-process our models, we can work with either the individual sub-targets or the whole vector of all the models. Below, `year` uses the former and `intercept` uses the latter. 121 | 122 | ```{r} 123 | plan <- drake_plan( 124 | continents = gapminder_continents(), 125 | model = target(fit_model(continents), dynamic = map(continents)), 126 | # Filter each model individually: 127 | year = target(filter(model, term == "year"), dynamic = map(model)), 128 | # Aggregate all the models, then filter the whole vector: 129 | intercept = filter(model, term != "year") 130 | ) 131 | 132 | make(plan) 133 | ``` 134 | 135 | ```{r} 136 | readd(year) 137 | ``` 138 | 139 | ```{r} 140 | readd(intercept) 141 | ``` 142 | 143 | 144 | If automatic concatenation of sub-targets is confusing (e.g. if some sub-targets are `NULL`, as in ) you can read the dynamic target as a named list (only in `drake` version 7.10.0 and above). 145 | 146 | ```{r} 147 | readd(model, subtarget_list = TRUE) # Requires drake >= 7.10.0. 148 | ``` 149 | 150 | Alternatively, you can identify an individual sub-target by its index. 151 | 152 | ```{r} 153 | subtargets(model) 154 | 155 | readd(model, subtargets = 2) # equivalent to readd() on a single model_* sub-target 156 | ``` 157 | 158 | If you don't know the index offhand, you can find out using the sub-target's name. 159 | 160 | ```{r, echo = FALSE} 161 | subtarget <- subtargets(model)[2] 162 | ``` 163 | 164 | ```{r} 165 | print(subtarget) 166 | 167 | which(subtarget == subtargets(model)) 168 | ``` 169 | 170 | If the sub-target errored out and `subtargets()` fails, the individual sub-target metadata will have a `subtarget_index` field. 171 | 172 | ```{r, eval = FALSE} 173 | diagnose(subtarget, character_only = TRUE)$subtarget_index 174 | #> [1] 2 175 | ``` 176 | 177 | Either way, once you have the sub-target's index, you can retrieve the section of data that the sub-target took as input. Below, we load the part of `contenents` that the second sub-target of `model` used during `make()`. 178 | 179 | ```{r} 180 | vctrs::vec_slice(readd(continents), 2) 181 | ``` 182 | 183 | If `continents` were dynamic, we could have just used `readd(continents, subtargets = 2)`. But `continents` was a static target, so we needed to replicate `drake`'s dynamic branching behavior using `vctrs`. 184 | 185 | ## Dynamic transformations 186 | 187 | Dynamic branching supports transformations `map()`, `cross()`, and `group()`. These transformations tell `drake` how to create sub-targets. 188 | 189 | ### `map()` 190 | 191 | `map()` iterates over the [vector slices](https://vctrs.r-lib.org/reference/vec_slice.html) of the targets you supply as arguments. We saw above how `map()` iterates over lists. If you give it a data frame, it will map over the rows. 192 | 193 | ```{r} 194 | plan <- drake_plan( 195 | subset = head(gapminder), 196 | row = target(subset, dynamic = map(subset)) 197 | ) 198 | 199 | make(plan) 200 | ``` 201 | 202 | ```{r} 203 | readd(row_9939cae3) 204 | ``` 205 | 206 | If you supply multiple targets, `map()` iterates over the slices of each. 207 | 208 | ```{r} 209 | plan <- drake_plan( 210 | numbers = seq_len(2), 211 | letters = c("a", "b"), 212 | zipped = target(paste0(numbers, letters), dynamic = map(numbers, letters)) 213 | ) 214 | 215 | make(plan) 216 | ``` 217 | 218 | ```{r} 219 | readd(zipped) 220 | ``` 221 | 222 | ### `cross()` 223 | 224 | `cross()` creates a new sub-target for each combination of targets you supply as arguments. 225 | 226 | ```{r} 227 | plan <- drake_plan( 228 | numbers = seq_len(2), 229 | letters = c("a", "b"), 230 | combo = target(paste0(numbers, letters), dynamic = cross(numbers, letters)) 231 | ) 232 | 233 | make(plan) 234 | ``` 235 | 236 | ```{r} 237 | readd(combo) 238 | ``` 239 | 240 | ### `group()` 241 | 242 | With `group()`, you can create multiple aggregates of a given target. Use the `.by` argument to set a grouping variable. 243 | 244 | ```{r} 245 | plan <- drake_plan( 246 | data = gapminder, 247 | by = data$continent, 248 | gdp = target( 249 | tibble(median = median(data$gdpPercap), continent = by[1]), 250 | dynamic = group(data, .by = by) 251 | ) 252 | ) 253 | 254 | make(plan) 255 | ``` 256 | 257 | ```{r} 258 | readd(gdp) 259 | ``` 260 | 261 | ## Trace 262 | 263 | All dynamic transforms have a `.trace` argument to record optional metadata for each sub-target. In the example from `group()`, the trace is another way to keep track of the continent of each median GDP value. 264 | 265 | ```{r} 266 | plan <- drake_plan( 267 | data = gapminder, 268 | by = data$continent, 269 | gdp = target( 270 | median(data$gdpPercap), 271 | dynamic = group(data, .by = by, .trace = by) 272 | ) 273 | ) 274 | 275 | make(plan) 276 | ``` 277 | 278 | The `gdp` target no longer contains any explicit reference to continent. 279 | 280 | ```{r} 281 | readd(gdp) 282 | ``` 283 | 284 | However, we can look up the continents in the trace. 285 | 286 | ```{r} 287 | read_trace("by", gdp) 288 | ``` 289 | 290 | 291 | ## `max_expand` 292 | 293 | Suppose we want a model for each *country*. 294 | 295 | ```{r} 296 | gapminder_countries <- function() { 297 | gapminder %>% 298 | mutate(gdpPercap = scale(gdpPercap)) %>% 299 | split(f = .$country) 300 | } 301 | 302 | plan <- drake_plan( 303 | countries = gapminder_countries(), 304 | model = target(fit_model(countries), dynamic = map(countries)) 305 | ) 306 | ``` 307 | 308 | The Gapminder dataset has 142 countries, which can get overwhelming. In the early stages of the workflow when we are still debugging and testing, we can limit the number of sub-targets using the `max_expand` argument of `make()`. 309 | 310 | ```{r} 311 | make(plan, max_expand = 2) 312 | ``` 313 | 314 | ```{r} 315 | readd(model) 316 | ``` 317 | 318 | Then, when we are confident and ready, we can scale up to the full number of models. 319 | 320 | ```{r, eval = FALSE} 321 | make(plan) 322 | ``` 323 | -------------------------------------------------------------------------------- /examples.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Examples {-} 2 | -------------------------------------------------------------------------------- /extras.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Extra features {-} 2 | -------------------------------------------------------------------------------- /faq.R: -------------------------------------------------------------------------------- 1 | is_faq <- function(label){ 2 | identical(label$name, "type: faq") 3 | } 4 | 5 | any_faq_label <- function(issue){ 6 | any(vapply(issue$labels, is_faq, logical(1))) 7 | } 8 | 9 | combine_fields <- function(lst, field){ 10 | map_chr(lst, function(x){ 11 | x[[field]] 12 | }) 13 | } 14 | 15 | build_faq <- function(){ 16 | library(tidyverse) 17 | library(gh) 18 | faq <- gh( 19 | "GET /repos/ropensci/drake/issues?state=all", 20 | .limit = Inf 21 | ) %>% 22 | Filter(f = any_faq_label) 23 | titles <- combine_fields(faq, "title") 24 | urls <- combine_fields(faq, "html_url") 25 | links <- paste0("- [", titles, "](", urls, ")") 26 | con <- file("faq.Rmd", "a") 27 | writeLines(c("", links), con) 28 | close(con) 29 | } 30 | 31 | if (nzchar(Sys.getenv("GITHUB_PAT"))){ 32 | build_faq() 33 | } 34 | -------------------------------------------------------------------------------- /faq.Rmd: -------------------------------------------------------------------------------- 1 | # Frequently-asked questions {#faq} 2 | 3 | This FAQ is a compendium of [pedagogically useful issues tagged on GitHub](https://github.com/ropensci/drake/issues?q=is%3Aissue+is%3Aopen+label%3A%22frequently+asked+question%22). To contribute, please [submit a new issue](https://github.com/ropensci/drake/issues/new) and ask that it be labeled a frequently asked question. 4 | -------------------------------------------------------------------------------- /footer.html: -------------------------------------------------------------------------------- 1 |
    2 | Copyright Eli Lilly and Company 3 |
    4 | -------------------------------------------------------------------------------- /google_analytics.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 10 | -------------------------------------------------------------------------------- /gsp.Rmd: -------------------------------------------------------------------------------- 1 | # Finding the best model of gross state product {#gsp} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(biglm) 18 | library(drake) 19 | library(Ecdat) 20 | library(ggplot2) 21 | library(knitr) 22 | library(purrr) 23 | library(tidyverse) 24 | ``` 25 | 26 | The following data analysis workflow shows off `drake`'s ability to generate lots of reproducibly-tracked tasks with ease. The same technique would be cumbersome, even intractable, with [GNU Make](https://www.gnu.org/software/make/). 27 | 28 | ## Get the code. 29 | 30 | Write the code files to your workspace. 31 | 32 | ```{r, eval = FALSE} 33 | drake_example("gsp") 34 | ``` 35 | 36 | The new `gsp` folder now includes a file structure of a serious `drake` project, plus an `interactive-tutorial.R` to narrate the example. The code is also [online here](https://github.com/wlandau/drake-examples/tree/main/gsp). 37 | 38 | ## Objective and methods 39 | 40 | The goal is to search for factors closely associated with the productivity of states in the USA around the 1970s and 1980s. For the sake of simplicity, we use gross state product as a metric of productivity, and we restrict ourselves to multiple linear regression models with three variables. For each of the 84 possible models, we fit the data and then evaluate the root mean squared prediction error (RMSPE). 41 | 42 | $$ 43 | \begin{aligned} 44 | \text{RMSPE} = \sqrt{(\text{y} - \widehat{y})^T(y - \widehat{y})} 45 | \end{aligned} 46 | $$ 47 | Here, $y$ is the vector of observed gross state products in the data, and $\widehat{y}$ is the vector of predicted gross state products under one of the models. We take the best variables to be the triplet in the model with the lowest RMSPE. 48 | 49 | ## Data 50 | 51 | The `Produc` dataset from the [Ecdat package](https://cran.r-project.org/package=Ecdat) contains data on the Gross State Product from 1970 to 1986. Each row is a single observation on a single state for a single year. The dataset has the following variables as columns. See the references later in this report for more details. 52 | 53 | - `gsp`: gross state product. 54 | - `state`: the state. 55 | - `year`: the year. 56 | - `pcap`: private capital stock. 57 | - `hwy`: highway and streets. 58 | - `water`: water and sewer facilities. 59 | - `util`: other public buildings and structures. 60 | - `pc`: public capital. 61 | - `emp`: labor input measured by the employment in non-agricultural payrolls. 62 | - `unemp`: state unemployment rate. 63 | 64 | ```{r} 65 | library(Ecdat) 66 | data(Produc) 67 | head(Produc) 68 | ``` 69 | 70 | ## Analysis 71 | 72 | First, we load the required packages. `drake` is aware of all the packages you load with `library()` or `require()`. 73 | 74 | ```{r} 75 | library(biglm) # lightweight models, easier to store than with lm() 76 | library(drake) 77 | library(Ecdat) # econometrics datasets 78 | library(ggplot2) 79 | library(knitr) 80 | library(purrr) 81 | library(tidyverse) 82 | ``` 83 | 84 | Next, we construct our plan. The following code uses `drake`'s special new language for generating plans (learn more [here](#plans)). 85 | 86 | ```{r} 87 | predictors <- setdiff(colnames(Produc), "gsp") 88 | 89 | # We will try all combinations of three covariates. 90 | combos <- combn(predictors, 3) %>% 91 | t() %>% 92 | as.data.frame(stringsAsFactors = FALSE) %>% 93 | setNames(c("x1", "x2", "x3")) 94 | 95 | head(combos) 96 | 97 | # We need to list each covariate as a symbol. 98 | for (col in colnames(combos)) { 99 | combos[[col]] <- rlang::syms(combos[[col]]) 100 | } 101 | 102 | # Requires drake >= 7.0.0 or the development version 103 | # at github.com/ropensci/drake. 104 | # Install with remotes::install_github("ropensci/drake"). 105 | plan <- drake_plan( 106 | model = target( 107 | biglm(gsp ~ x1 + x2 + x3, data = Ecdat::Produc), 108 | transform = map(.data = !!combos) # Remember the bang-bang!! 109 | ), 110 | rmspe_i = target( 111 | get_rmspe(model, Ecdat::Produc), 112 | transform = map(model) 113 | ), 114 | rmspe = target( 115 | bind_rows(rmspe_i, .id = "model"), 116 | transform = combine(rmspe_i) 117 | ), 118 | plot = ggsave( 119 | filename = file_out("rmspe.pdf"), 120 | plot = plot_rmspe(rmspe), 121 | width = 8, 122 | height = 8 123 | ), 124 | report = knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE) 125 | ) 126 | 127 | plan 128 | ``` 129 | 130 | We also need to define functions for summaries and plots. 131 | 132 | ```{r} 133 | get_rmspe <- function(model_fit, data){ 134 | y <- data$gsp 135 | yhat <- as.numeric(predict(model_fit, newdata = data)) 136 | terms <- attr(model_fit$terms, "term.labels") 137 | tibble( 138 | rmspe = sqrt(mean((y - yhat)^2)), # nolint 139 | X1 = terms[1], 140 | X2 = terms[2], 141 | X3 = terms[3] 142 | ) 143 | } 144 | 145 | plot_rmspe <- function(rmspe){ 146 | ggplot(rmspe) + 147 | geom_histogram(aes(x = rmspe), bins = 15) 148 | } 149 | ``` 150 | 151 | We have a [`report.Rmd` file ](https://github.com/wlandau/drake-examples/blob/main/gsp/report.Rmd) to summarize our results at the end. 152 | 153 | ```{r} 154 | drake_example("gsp") 155 | file.copy(from = "gsp/report.Rmd", to = ".", overwrite = TRUE) 156 | ``` 157 | 158 | We can inspect the project before we run it. 159 | 160 | ```{r} 161 | vis_drake_graph(plan) 162 | ``` 163 | 164 | Now, we can run the project. 165 | 166 | ```{r} 167 | make(plan, verbose = 0L) 168 | ``` 169 | 170 | ## Results 171 | 172 | Here are the root mean squared prediction errors of all the models. 173 | 174 | ```{r} 175 | results <- readd(rmspe) 176 | library(ggplot2) 177 | plot_rmspe(rmspe = results) 178 | ``` 179 | 180 | And here are the best models. The best variables are in the top row under `X1`, `X2`, and `X3`. 181 | 182 | ```{r} 183 | head(results[order(results$rmspe, decreasing = FALSE), ]) 184 | ``` 185 | 186 | ## Comparison with GNU Make 187 | 188 | If we were using [Make](https://www.gnu.org/software/make/) instead of `drake` with the same set of targets, the analogous [Makefile](https://www.gnu.org/software/make/) would look something like this pseudo-code sketch. 189 | 190 |
    models = model_state_year_pcap.rds model_state_year_hwy.rds ... # 84 of these
    191 | 
    192 | model_%
    193 |     Rscript -e 'saveRDS(lm(...), ...)'
    194 | 
    195 | rmspe_%: model_%
    196 |     Rscript -e 'saveRDS(get_rmspe(...), ...)'
    197 | 
    198 | rmspe.rds: rmspe_%
    199 |     Rscript -e 'saveRDS(dplyr::bind_rows(...), ...)'
    200 | 
    201 | rmspe.pdf: rmspe.rds
    202 |     Rscript -e 'ggplot2::ggsave(plot_rmspe(readRDS("rmspe.rds")), "rmspe.pdf")'
    203 | 
    204 | report.md: report.Rmd
    205 |     Rscript -e 'knitr::knit("report.Rmd")'
    206 | 
    207 | 208 | There are three main disadvantages to this approach. 209 | 210 | 1. Every target requires a new call to `Rscript`, which means that more time is spent initializing R sessions than doing the actual work. 211 | 2. The user must micromanage nearly one hundred output files (in this case, `*.rds` files), which is cumbersome, messy, and inconvenient. `drake`, on the other hand, automatically manages storage using a [storr cache](https://github.com/richfitz/storr). 212 | 3. The user needs to write the names of the 84 `models` near the top of the `Makefile`, which is less convenient than maintaining a data frame in R. 213 | 214 | ## References 215 | 216 | - Baltagi, Badi H (2003). Econometric analysis of panel data, John Wiley and sons, http://www.wiley.com/legacy/wileychi/baltagi/. 217 | - Baltagi, B. H. and N. Pinnoi (1995). "Public capital stock and state productivity growth: further evidence", Empirical Economics, 20, 351-359. 218 | - Munnell, A. (1990). "Why has productivity growth declined? Productivity and public investment"", New England Economic Review, 3-22. 219 | - Yves Croissant (2016). Ecdat: Data Sets for Econometrics. R package version 0.3-1. https://CRAN.R-project.org/package=Ecdat. 220 | -------------------------------------------------------------------------------- /images/apple-touch-icon-120x120.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/apple-touch-icon-120x120.png -------------------------------------------------------------------------------- /images/apple-touch-icon-152x152.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/apple-touch-icon-152x152.png -------------------------------------------------------------------------------- /images/apple-touch-icon-180x180.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/apple-touch-icon-180x180.png -------------------------------------------------------------------------------- /images/apple-touch-icon-60x60.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/apple-touch-icon-60x60.png -------------------------------------------------------------------------------- /images/apple-touch-icon-76x76.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/apple-touch-icon-76x76.png -------------------------------------------------------------------------------- /images/apple-touch-icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/apple-touch-icon.png -------------------------------------------------------------------------------- /images/favicon-16x16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/favicon-16x16.png -------------------------------------------------------------------------------- /images/favicon-32x32.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/favicon-32x32.png -------------------------------------------------------------------------------- /images/favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/favicon.ico -------------------------------------------------------------------------------- /images/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ropensci-books/drake/e5e4fb2f15e84645f804f52bc54f7926541c8eac/images/logo.png -------------------------------------------------------------------------------- /memory.Rmd: -------------------------------------------------------------------------------- 1 | # Memory management {#memory} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(biglm) 18 | library(drake) 19 | library(tidyverse) 20 | ``` 21 | 22 | The default settings of `drake` prioritize speed over memory efficiency. For projects with large data, this default behavior can cause problems. Consider the following hypothetical workflow, where we simulate several large datasets and summarize them. 23 | 24 | ```{r, paged.print = FALSE} 25 | reps <- 10 # Serious workflows may have several times more. 26 | 27 | # Reduce `n` to lighten the load if you want to try this workflow yourself. 28 | # It is super high in this chapter to motivate the memory issues. 29 | generate_large_data <- function(rep, n = 1e8) { 30 | tibble(x = rnorm(n), y = rnorm(n), rep = rep) 31 | } 32 | 33 | get_means <- function(...) { 34 | out <- NULL 35 | for (dataset in list(...)) { 36 | out <- bind_rows(out, colMeans(dataset)) 37 | } 38 | out 39 | } 40 | 41 | plan <- drake_plan( 42 | large_data = target( 43 | generate_large_data(rep), 44 | transform = map(rep = !!seq_len(reps), .id = FALSE) 45 | ), 46 | means = target( 47 | get_means(large_data), 48 | transform = combine(large_data) 49 | ), 50 | summ = summary(means) 51 | ) 52 | 53 | print(plan) 54 | ``` 55 | 56 | ```{r} 57 | vis_drake_graph(plan) 58 | ``` 59 | 60 | If you call `make(plan)` with no additional arguments, `drake` will try to load all the datasets into the same R session. Each dataset from `generate_large_data(n = 1e8)` occupies about 2.4 GB of memory, and most machines cannot handle all the data at once. We should use memory more wisely. 61 | 62 | ## Garbage collection and custom files 63 | 64 | `make()` has a `garbage_collection` argument, which tells `drake` to periodically unload data objects that no longer belong to variables. You can also run garbage collection manually with the `gc()` function. For more on garbage collection, please refer to the [memory usage chapter of Advanced R](http://adv-r.had.co.nz/memory.html#gc). 65 | 66 | Let's reduce the memory consumption of our example workflow: 67 | 68 | 1. Call `gc()` after every loop iteration of `get_means()`. 69 | 2. Avoid `drake`'s caching system with custom `file_out()` files in the plan. 70 | 3. Call `make(plan, garbage_collection = TRUE)`. 71 | 72 | ```{r, paged.print = FALSE} 73 | reps <- 10 # Serious workflows may have several times more. 74 | files <- paste0(seq_len(reps), ".rds") 75 | 76 | generate_large_data <- function(file, n = 1e8) { 77 | out <- tibble(x = rnorm(n), y = rnorm(n)) # a billion rows 78 | saveRDS(out, file) 79 | } 80 | 81 | get_means <- function(files) { 82 | out <- NULL 83 | for (file in files) { 84 | x <- colMeans(readRDS(file)) 85 | out <- bind_rows(out, x) 86 | gc() # Use the gc() function here to make sure each x gets unloaded. 87 | } 88 | out 89 | } 90 | 91 | plan <- drake_plan( 92 | large_data = target( 93 | generate_large_data(file = file_out(file)), 94 | transform = map(file = !!files, .id = FALSE) 95 | ), 96 | means = get_means(file_in(!!files)), 97 | summ = summary(means) 98 | ) 99 | 100 | print(plan) 101 | ``` 102 | 103 | ```{r} 104 | vis_drake_graph(plan) 105 | ``` 106 | 107 | ```{r, eval = FALSE} 108 | make(plan, garbage_collection = TRUE) 109 | ``` 110 | 111 | ## Memory strategies 112 | 113 | `make()` has a `memory_strategy` argument to customize how `drake` loads and unloads targets. With the right memory strategy, you can rely on `drake`'s built-in caching system without having to bother with messy `file_out()` files. 114 | 115 | Each memory strategy follows three stages for each target: 116 | 117 | 1. Initial discard: before building the target, optionally discard some other targets from the R session. The choice of discards depends on the memory strategy. (Note: we do not actually get the memory back until we call `gc()`.) 118 | 2. Initial load: before building the target, optionally load any dependencies that are not already in memory. 119 | 3. Final discard: optionally discard or keep the return value after the target finishes building. Either way, the return value is still stored in the cache, so you can load it with `loadd()` and `readd()`. 120 | 121 | The implementation of these steps varies from strategy to strategy. 122 | 123 | Memory strategy | Initial discard | Initial load | Final discard 124 | ---|---|---|--- 125 | "speed" | Discard nothing | Load any missing dependencies. | Keep the return value loaded. 126 | "autoclean"[^1] | Discard all targets which are not dependencies of the current target. | Load any missing dependencies. | Discard the return value. 127 | "preclean" | Discard all targets which are not dependencies of the current target. | Load any missing dependencies. | Keep the return value loaded. 128 | "lookahead" | Discard all targets which are not dependencies of either (1) the current target or (2) other targets waiting to be checked or built. | Load any missing dependencies. | Keep the return value loaded. 129 | "unload"[^2] | Unload all targets. | Load nothing. | Discard the return value. 130 | "none"[^2] | Unload nothing. | Load nothing. | Discard the return value. 131 | 132 | [^1]: Only supported in `drake` version 7.5.0 and above. 133 | [^2]: Only supported in `drake` version 7.4.0 and above. 134 | 135 | With the `"speed"`, `"autoclean"`, `"preclean"`, and `"lookahead"` strategies, you can simply call `make(plan, memory_strategy = YOUR_CHOICE, garbage_collection = TRUE)` and trust that your targets will build normally. For the `"unload"` and `"none"` strategies, there is extra work to do: you will need to manually load each target's dependencies with `loadd()` or `readd()`. This manual bookkeeping lets you aggressively optimize your workflow, and it is less cumbersome than swarms of `file_out()` files. It is particularly useful when you have a large `combine()` step. 136 | 137 | Let's redesign the workflow to reap the benefits of `make(plan, memory_strategy = "none", garbage_collection = TRUE)`. The trick is to use [`match.call()`](https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/match.call) inside `get_means()` so we can load and unload dependencies one at a time instead of all at once. 138 | 139 | ```{r, paged.print = FALSE} 140 | reps <- 10 # Serious workflows may have several times more. 141 | 142 | generate_large_data <- function(rep, n = 1e8) { 143 | tibble(x = rnorm(n), y = rnorm(n), rep = rep) 144 | } 145 | 146 | # Load targets one at a time 147 | get_means <- function(...) { 148 | arg_symbols <- match.call(expand.dots = FALSE)$... 149 | arg_names <- as.character(arg_symbols) 150 | out <- NULL 151 | for (arg_name in arg_names) { 152 | dataset <- readd(arg_name, character_only = TRUE) 153 | out <- bind_rows(out, colMeans(dataset)) 154 | gc() # Run garbage collection. 155 | } 156 | out 157 | } 158 | 159 | plan <- drake_plan( 160 | large_data = target( 161 | generate_large_data(rep), 162 | transform = map(rep = !!seq_len(reps), .id = FALSE) 163 | ), 164 | means = target( 165 | get_means(large_data), 166 | transform = combine(large_data) 167 | ), 168 | summ = { 169 | loadd(means) # Annoying, but necessary with the "none" strategy. 170 | summary(means) 171 | } 172 | ) 173 | ``` 174 | 175 | Now, we can build our targets. 176 | 177 | ```{r, eval = FALSE} 178 | make(plan, memory_strategy = "none", garbage_collection = TRUE) 179 | ``` 180 | 181 | But there is a snag: we needed to manually load `means` in the command for `summ` (notice the call to `loadd()`). This is annoying, especially because `means` is quite small. Fortunately, `drake` lets you define different memory strategies for different targets in the plan. The target-specific memory strategies override the global one (i.e. the `memory_strategy` argument of `make()`). 182 | 183 | ```{r} 184 | plan <- drake_plan( 185 | large_data = target( 186 | generate_large_data(rep), 187 | transform = map(rep = !!seq_len(reps), .id = FALSE), 188 | memory_strategy = "none" 189 | ), 190 | means = target( 191 | get_means(large_data), 192 | transform = combine(large_data), 193 | memory_strategy = "unload" # Be careful with this one. 194 | ), 195 | summ = summary(means) 196 | ) 197 | 198 | print(plan) 199 | ``` 200 | 201 | In fact, now you can run `make()` without setting a global memory strategy at all. 202 | 203 | ```{r, eval = FALSE} 204 | make(plan, garbage_collection = TRUE) 205 | ``` 206 | 207 | ## Data splitting 208 | 209 | The [`split()` transformation](https://books.ropensci.org/drake/plans.html#split) breaks up a dataset into smaller targets. The ordinary use of `split()` is to partition an in-memory dataset into slices. 210 | 211 | ```{r, paged.print = FALSE} 212 | drake_plan( 213 | data = get_large_data(), 214 | x = target( 215 | data %>% 216 | analyze_data(), 217 | transform = split(data, slices = 4) 218 | ) 219 | ) 220 | ``` 221 | 222 | However, you can also use it to load individual pieces of a large file, thus conserving memory. The trick is to break up an index set instead of the data itself. In the following sketch, `get_number_of_rows()` and `read_selected_rows()` are user-defined functions, and `%>%` is the [`magrittr`](https://magrittr.tidyverse.org) pipe. 223 | 224 | ```{r, paged.print = FALSE} 225 | get_number_of_rows <- function(file) { 226 | # ... 227 | } 228 | 229 | read_selected_rows <- function(which_rows, file) { 230 | # ... 231 | } 232 | 233 | plan <- drake_plan( 234 | row_indices = file_in("large_file.csv") %>% 235 | get_number_of_rows() %>% 236 | seq_len(), 237 | subset = target( 238 | row_indices %>% 239 | read_selected_rows(file = file_in("large_file.csv")), 240 | transform = split(row_indices, slices = 4) 241 | ) 242 | ) 243 | 244 | plan 245 | 246 | drake_plan_source(plan) 247 | ``` 248 | -------------------------------------------------------------------------------- /packages.Rmd: -------------------------------------------------------------------------------- 1 | # An analysis of R package download trends {#packages} 2 | 3 | ```{r, message = FALSE, warning = FALSE, include = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, include = FALSE} 17 | library(cranlogs) 18 | library(curl) 19 | library(drake) 20 | library(dplyr) 21 | library(ggplot2) 22 | library(httr) 23 | library(knitr) 24 | library(magrittr) 25 | library(R.utils) 26 | library(rvest) 27 | invisible(drake_example("packages", overwrite = TRUE)) 28 | invisible(file.copy("packages/report.Rmd", ".", overwrite = TRUE)) 29 | ``` 30 | 31 | This chapter explores R package download trends using the `cranlogs` package, and it shows how `drake`'s custom triggers can help with workflows with remote data sources. 32 | 33 | ## Get the code. 34 | 35 | Write the code files to your workspace. 36 | 37 | ```{r, eval = FALSE} 38 | drake_example("packages") 39 | ``` 40 | 41 | The new `packages` folder now includes a file structure of a serious `drake` project, plus an `interactive-tutorial.R` to narrate the example. The code is also [online here](https://github.com/wlandau/drake-examples/tree/main/packages). 42 | 43 | ## Overview 44 | 45 | This small data analysis project explores some trends in R package downloads over time. The datasets are downloaded using the [cranlogs package](https://github.com/metacran/cranlogs). 46 | 47 | ```{r} 48 | library(cranlogs) 49 | cran_downloads(packages = "dplyr", when = "last-week") 50 | ``` 51 | 52 | Above, each count is the number of times `dplyr` was downloaded from the RStudio CRAN mirror on the given day. To stay up to date with the latest download statistics, we need to refresh the data frequently. With `drake`, we can bring all our work up to date without restarting everything from scratch. 53 | 54 | ## Analysis 55 | 56 | First, we load the required packages. `drake` detects the packages you install and load. 57 | 58 | ```{r} 59 | library(cranlogs) 60 | library(drake) 61 | library(dplyr) 62 | library(ggplot2) 63 | library(knitr) 64 | library(rvest) 65 | ``` 66 | 67 | We will want custom functions to summarize the CRAN logs we download. 68 | 69 | ```{r} 70 | make_my_table <- function(downloads){ 71 | group_by(downloads, package) %>% 72 | summarize(mean_downloads = mean(count)) 73 | } 74 | 75 | make_my_plot <- function(downloads){ 76 | ggplot(downloads) + 77 | geom_line(aes(x = date, y = count, group = package, color = package)) 78 | } 79 | ``` 80 | 81 | Next, we generate the plan. 82 | 83 | We want to explore the daily downloads from the `knitr`, `Rcpp`, and `ggplot2` packages. We will use the [`cranlogs` package](https://github.com/metacran/cranlogs) to get daily logs of package downloads from RStudio's CRAN mirror. In our `drake_plan()`, we declare targets `older` and `recent` to kcontain snapshots of the logs. 84 | 85 | The following `drake_plan()` syntax is described [here](https://books.ropensci.org/drake/static.html), which is supported in `drake` 7.0.0 and above. 86 | 87 | ```{r} 88 | plan <- drake_plan( 89 | older = cran_downloads( 90 | packages = c("knitr", "Rcpp", "ggplot2"), 91 | from = "2016-11-01", 92 | to = "2016-12-01" 93 | ), 94 | recent = target( 95 | command = cran_downloads( 96 | packages = c("knitr", "Rcpp", "ggplot2"), 97 | when = "last-month" 98 | ), 99 | trigger = trigger(change = latest_log_date()) 100 | ), 101 | averages = target( 102 | make_my_table(data), 103 | transform = map(data = c(older, recent)) 104 | ), 105 | plot = target( 106 | make_my_plot(data), 107 | transform = map(data) 108 | ), 109 | report = knit( 110 | knitr_in("report.Rmd"), 111 | file_out("report.md"), 112 | quiet = TRUE 113 | ) 114 | ) 115 | ``` 116 | 117 | Notice the custom trigger for the target `recent`. Here, we are telling `drake` to rebuild `recent` whenever a new day's log is uploaded to [http://cran-logs.rstudio.com](http://cran-logs.rstudio.com). In other words, `drake` keeps track of the return value of `latest_log_date()` and recomputes `recent` (during `make()`) if that value changed since the last `make()`. Here, `latest_log_date()` is one of our custom imported functions. We use it to scrape [http://cran-logs.rstudio.com](http://cran-logs.rstudio.com) using the [`rvest`](https://github.com/hadley/rvest) package. 118 | 119 | ```{r} 120 | latest_log_date <- function(){ 121 | read_html("http://cran-logs.rstudio.com/") %>% 122 | html_nodes("li:last-of-type") %>% 123 | html_nodes("a:last-of-type") %>% 124 | html_text() %>% 125 | max 126 | } 127 | ``` 128 | 129 | Now, we run the project to download the data and analyze it. 130 | The results will be summarized in the knitted report, `report.md`, 131 | but you can also read the results directly from the cache. 132 | 133 | ```{r, fig.width = 7, fig.height = 4} 134 | make(plan) 135 | 136 | readd(averages_recent) 137 | 138 | readd(averages_older) 139 | 140 | readd(plot_recent) 141 | 142 | readd(plot_older) 143 | ``` 144 | 145 | If we run `make()` again right away, we see that everything is up to date. But if we wait until a new day's log is uploaded, `make()` will update `recent` and everything that depends on it. 146 | 147 | ```{r} 148 | make(plan) 149 | ``` 150 | 151 | To visualize the build behavior, you can plot the dependency network. 152 | 153 | ```{r} 154 | vis_drake_graph(plan) 155 | ``` 156 | 157 | ## Other ways to trigger downloads 158 | 159 | Sometimes, our remote data sources get revised, and web scraping may not be the best way to detect changes. We may want to look at our remote dataset's modification time or HTTP ETag. To see how this works, consider the CRAN log file from February 9, 2018. 160 | 161 | ```{r} 162 | url <- "http://cran-logs.rstudio.com/2018/2018-02-09-r.csv.gz" 163 | ``` 164 | 165 | We can track the modification date using the [`httr`](https://github.com/r-lib/httr) package. 166 | 167 | ```{r} 168 | library(httr) # For querying websites. 169 | HEAD(url)$headers[["last-modified"]] 170 | ``` 171 | 172 | In our `drake` plan, we can track this timestamp and trigger a download whenever it changes. 173 | 174 | ```{r} 175 | plan <- drake_plan( 176 | logs = target( 177 | get_logs(url), 178 | trigger = trigger(change = HEAD(url)$headers[["last-modified"]]) 179 | ) 180 | ) 181 | plan 182 | ``` 183 | 184 | where 185 | 186 | ```{r} 187 | library(R.utils) # For unzipping the files we download. 188 | library(curl) # For downloading data. 189 | get_logs <- function(url){ 190 | curl_download(url, "logs.csv.gz") # Get a big file. 191 | gunzip("logs.csv.gz", overwrite = TRUE) # Unzip it. 192 | out <- read.csv("logs.csv", nrows = 4) # Extract the data you need. 193 | unlink(c("logs.csv.gz", "logs.csv")) # Remove the big files 194 | out # Value of the target. 195 | } 196 | ``` 197 | 198 | When we are ready, we run the workflow. 199 | 200 | ```{r} 201 | make(plan) 202 | 203 | readd(logs) 204 | ``` 205 | 206 | If the log file at the `url` ever changes, the timestamp will update remotely, and `make()` will download the file again. 207 | -------------------------------------------------------------------------------- /plans.Rmd: -------------------------------------------------------------------------------- 1 | # `drake` plans {#plans} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(drake) 18 | library(glue) 19 | library(purrr) 20 | library(rlang) 21 | library(tidyverse) 22 | invisible(drake_example("main", overwrite = TRUE)) 23 | invisible(file.copy("main/raw_data.xlsx", ".", overwrite = TRUE)) 24 | invisible(file.copy("main/report.Rmd", ".", overwrite = TRUE)) 25 | tmp <- suppressWarnings(drake_plan(x = 1, y = 2)) 26 | ``` 27 | 28 | Most data analysis workflows consist of several steps, such as data cleaning, model fitting, visualization, and reporting. A `drake` plan is the high-level catalog of all these steps for a single workflow. It is the centerpiece of every `drake`-powered project, and it is always required. However, the plan is almost never the first thing we write. A typical plan rests on a foundation of carefully-crafted custom functions. 29 | 30 | ## Functions 31 | 32 | A function is a reusable instruction that accepts some inputs and returns a single output. After we define a function once, we can easily call it any number of times. 33 | 34 | ```{r} 35 | root_square_term <- function(l, w, h) { 36 | half_w <- w / 2 37 | l * sqrt(half_w ^ 2 + h ^ 2) 38 | } 39 | 40 | root_square_term(1, 2, 3) 41 | 42 | root_square_term(4, 5, 6) 43 | ``` 44 | 45 | In practice, functions are vocabulary. They are concise references to complicated ideas, and they help us write instructions of ever increasing complexity. 46 | 47 | ```{r} 48 | # right rectangular pyramid 49 | volume_pyramid <- function(length_base, width_base, height) { 50 | area_base <- length_base * width_base 51 | term1 <- root_square_term(length_base, width_base, height) 52 | term2 <- root_square_term(width_base, length_base, height) 53 | area_base + term1 + term2 54 | } 55 | 56 | volume_pyramid(3, 5, 7) 57 | ``` 58 | 59 | The `root_square_term()` function is custom shorthand that makes `volume_pyramid()` easier to write and understand. `volume_pyramid()`, in turn, helps us crudely approximate the total square meters of stone eroded from the Great Pyramid of Giza ([dimensions from Wikipedia](https://en.m.wikipedia.org/wiki/Great_Pyramid_of_Giza)). 60 | 61 | ```{r} 62 | volume_original <- volume_pyramid(230.4, 230.4, 146.5) 63 | volume_current <- volume_pyramid(230.4, 230.4, 138.8) 64 | volume_original - volume_current # volume eroded 65 | ``` 66 | 67 | This function-oriented code is concise and clear. Contrast it with the cumbersome mountain of [imperative](https://en.wikipedia.org/wiki/Imperative_programming) arithmetic that would have otherwise daunted us. 68 | 69 | ```{r} 70 | # Don't try this at home! 71 | width_original <- 230.4 72 | length_original <- 230.4 73 | height_original <- 146.5 74 | 75 | # We supply the same lengths and widths, 76 | # but we use different variable names 77 | # to illustrate the general case. 78 | width_current <- 230.4 79 | length_current <- 230.4 80 | height_current <- 138.8 81 | 82 | area_original <- length_original * width_original 83 | term1_original <- length_original * 84 | sqrt((width_original / 2) ^ 2 + height_original ^ 2) 85 | term2_original <- width_original * 86 | sqrt((length_original / 2) ^ 2 + height_original ^ 2) 87 | volume_original <- area_original + term1_original + term2_original 88 | 89 | area_current <- length_current * width_current 90 | term1_current <- length_current * 91 | sqrt((width_current / 2) ^ 2 + height_current ^ 2) 92 | term2_current <- width_current * 93 | sqrt((length_current / 2) ^ 2 + height_current ^ 2) 94 | volume_current <- area_current + term1_current + term2_current 95 | 96 | volume_original - volume_current # volume eroded 97 | ``` 98 | 99 | Unlike [imperative](https://en.wikipedia.org/wiki/Imperative_programming) scripts, functions break down complex ideas into manageable pieces, and they gradually build up bigger and bigger pieces until an elegant solution materializes. This process of building up functions helps us think clearly, understand what we are doing, and explain our methods to others. 100 | 101 | ## Intro to plans 102 | 103 | A `drake` plan is a data frame with columns named `target` and `command`. Each row represents a step in the workflow. Each command is a concise [expression](http://adv-r.had.co.nz/Expressions.html) that makes use of our functions, and each target is the return value of the command. (The `target` column has the *names* of the targets, not the values. These names must not conflict with the names of your functions or other global objects.) 104 | 105 | We create plans with the `drake_plan()` function. 106 | 107 | ```{r} 108 | plan <- drake_plan( 109 | raw_data = readxl::read_excel(file_in("raw_data.xlsx")), 110 | data = raw_data %>% 111 | mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), 112 | hist = create_plot(data), 113 | fit = lm(Ozone ~ Wind + Temp, data), 114 | report = rmarkdown::render( 115 | knitr_in("report.Rmd"), 116 | output_file = file_out("report.html"), 117 | quiet = TRUE 118 | ) 119 | ) 120 | 121 | plan 122 | ``` 123 | 124 | The plan makes use of a custom `create_plot()` function to produce target `hist`. Functions make the plan more concise and easier to read. 125 | 126 | ```{r} 127 | create_plot <- function(data) { 128 | ggplot(data) + 129 | geom_histogram(aes(x = Ozone)) + 130 | theme_gray(24) 131 | } 132 | ``` 133 | 134 | `drake` automatically understands the relationships among targets in the plan. It knows `data` depends on `raw_data` because the symbol `raw_data` is mentioned in the command for `data`. `drake` represents this dependency relationship with an arrow from `raw_data` to `data` in the graph. 135 | 136 | ```{r} 137 | vis_drake_graph(plan) 138 | ``` 139 | 140 | We can write the targets in any order and `drake` still understands the dependency relationships. 141 | 142 | ```{r} 143 | plan <- drake_plan( 144 | raw_data = readxl::read_excel(file_in("raw_data.xlsx")), 145 | data = raw_data %>% 146 | mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), 147 | hist = create_plot(data), 148 | fit = lm(Ozone ~ Wind + Temp, data), 149 | report = rmarkdown::render( 150 | knitr_in("report.Rmd"), 151 | output_file = file_out("report.html"), 152 | quiet = TRUE 153 | ) 154 | ) 155 | 156 | vis_drake_graph(plan) 157 | ``` 158 | 159 | The `make()` function runs the correct targets in the correct order and stores the results in a hidden cache. 160 | 161 | ```{r} 162 | library(drake) 163 | library(glue) 164 | library(purrr) 165 | library(rlang) 166 | library(tidyverse) 167 | 168 | make(plan) 169 | 170 | readd(hist) 171 | ``` 172 | 173 | The purpose of the plan is to identify steps we can skip in our workflow. If we change some code or data, `drake` saves time by running some steps and skipping others. 174 | 175 | 176 | ```{r} 177 | create_plot <- function(data) { 178 | ggplot(data) + 179 | geom_histogram(aes(x = Ozone), binwidth = 10) + # new bin width 180 | theme_gray(24) 181 | } 182 | 183 | vis_drake_graph(plan) 184 | ``` 185 | 186 | 187 | ```{r} 188 | make(plan) 189 | 190 | readd(hist) 191 | ``` 192 | 193 | ## A strategy for building up plans 194 | 195 | Building a `drake` plan is a gradual process. You do not need to write out every single target to start with. Instead, start with just one or two targets: for example, `raw_data` in the plan above. Then, `make()` the plan and inspect the results with `readd()`. If the target's return value seems correct to you, go ahead and write another target in the plan (`data`), `make()` the bigger plan, and repeat. These repetitive `make()`s should skip previous work each time, and you will have an intuitive sense of the results as you go. 196 | 197 | ## How to choose good targets 198 | 199 | Defining good targets is more of an art than a science, and it requires personal judgement and context specific to your use case. Generally speaking, a good target is 200 | 201 | 1. Long enough to eat up a decent chunk of runtime, and 202 | 2. Small enough that `make()` frequently skips it, and 203 | 3. Meaningful to your project, and 204 | 4. A well-behaved R object compatible with `saveRDS()`. For example, data frames behave better than database connection objects (discussions [here](https://github.com/ropensci/drake/issues/1046) and [here](https://github.com/ropensci-books/drake/issues/46)), [R6 classes](https://github.com/ropensci/drake/issues/345), and [`xgboost` matrices](https://github.com/ropensci/drake/issues/1038). 205 | 206 | Above, "long" and "short" refer to computational runtime, not the size of the target's value. The more data you return to the targets, the more data `drake` puts in storage, and the slower your workflow becomes. If you have a large dataset, it may not be wise to copy it over several targets. 207 | 208 | ```{r} 209 | bad_plan <- drake_plan( 210 | raw = get_big_raw_dataset(), # We write this ourselves. 211 | selection = select(raw, column1, column2), 212 | filtered = filter(selection, column3 == "abc"), 213 | analysis = my_analysis_function(filtered) # Same here. 214 | ) 215 | ``` 216 | 217 | In the above sketch, the dataset is super large, and selection and filtering are fast by comparison. It is much better to wrap up these steps in a data cleaning function and reduce the number of targets. 218 | 219 | ```{r} 220 | munged_dataset <- function() { 221 | get_big_raw_dataset() %>% 222 | select(column1, column2) %>% 223 | filter(column3 == "abc") 224 | } 225 | 226 | good_plan <- drake_plan( 227 | dataset = munged_dataset(), 228 | analysis = my_analysis_function(dataset) 229 | ) 230 | ``` 231 | 232 | ## Special data formats for targets 233 | 234 | `drake` supports custom formats for saving and loading large objects and highly specialized objects. For example, the `"fst"` and `"fst_tbl"` formats use the [`fst`](https://github.com/fstpackage/fst) package to save `data.frame` and `tibble` targets faster. Simply enclose the command and the format together with the `target()` function. 235 | 236 | ```{r, eval = FALSE} 237 | library(drake) 238 | n <- 1e8 # Each target is 1.6 GB in memory. 239 | plan <- drake_plan( 240 | data_fst = target( 241 | data.frame(x = runif(n), y = runif(n)), 242 | format = "fst" 243 | ), 244 | data_old = data.frame(x = runif(n), y = runif(n)) 245 | ) 246 | make(plan) 247 | #> target data_fst 248 | #> target data_old 249 | build_times(type = "build") 250 | #> # A tibble: 2 x 4 251 | #> target elapsed user system 252 | #> 253 | #> 1 data_fst 13.93s 37.562s 7.954s 254 | #> 2 data_old 184s (~3.07 minutes) 177s (~2.95 minutes) 4.157s 255 | ``` 256 | 257 | There are several formats, each with their own system requirements. These system requirements, such as the `fst` R package for the `"fst"` format, do not come pre-installed with `drake`. You will need to install them manually. 258 | 259 | - `"file"`: Dynamic files. To use this format, simply create local files and directories yourself and then return a character vector of paths as the target's value. Then, `drake` will watch for changes to those files in subsequent calls to `make()`. This is a more flexible alternative to `file_in()` and `file_out()`, and it is compatible with dynamic branching. See for an example. 260 | - `"fst"`: save big data frames fast. Requires the `fst` package. 261 | Note: this format strips non-data-frame attributes such as the 262 | - `"fst_tbl"`: Like `"fst"`, but for `tibble` objects. 263 | Requires the `fst` and `tibble` packages. 264 | Strips away non-data-frame non-tibble attributes. 265 | - `"fst_dt"`: Like `"fst"` format, but for `data.table` objects. 266 | Requires the `fst` and `data.table` packages. 267 | Strips away non-data-frame non-data-table attributes. 268 | - `"diskframe"`: 269 | Stores `disk.frame` objects, which could potentially be 270 | larger than memory. Requires the `fst` and `disk.frame` packages. 271 | Coerces objects to `disk.frame`s. 272 | Note: `disk.frame` objects get moved to the `drake` cache 273 | (a subfolder of `.drake/` for most workflows). 274 | To ensure this data transfer is fast, it is best to 275 | save your `disk.frame` objects to the same physical storage 276 | drive as the `drake` cache, 277 | `as.disk.frame(your_dataset, outdir = drake_tempfile())`. 278 | - `"keras"`: save Keras models as HDF5 files. 279 | Requires the `keras` package. 280 | - `"qs"`: save any R object that can be properly serialized 281 | with the `qs` package. Requires the `qs` package. 282 | Uses `qsave()` and `qread()`. 283 | Uses the default settings in `qs` version 0.20.2. 284 | - `"rds"`: save any R object that can be properly serialized. 285 | Requires R version >= 3.5.0 due to ALTREP. 286 | Note: the `"rds"` format uses gzip compression, which is slow. 287 | `"qs"` is a superior format. 288 | 289 | ## Special columns 290 | 291 | With `target()`, you can define any kind of special column in the plan. 292 | 293 | ```{r} 294 | drake_plan( 295 | x = target((1 + sqrt(5)) / 2, golden = "ratio"), 296 | y = target(pi * 3 ^ 2, area = "circle") 297 | ) 298 | ``` 299 | 300 | The following columns have special meanings, and `make()` reads and interprets them. 301 | 302 | - `format`: already described above. 303 | - `dynamic`: See the [chapter on dynamic branching](#dynamic). 304 | - `transform`: Automatically processed by `drake_plan()` except for `drake_plan(transform = FALSE)`. See the [chapter on static branching](#static). 305 | - `trigger`: rule to decide whether a target needs to run. See the [trigger chapter](#triggers) to learn more. 306 | - `elapsed` and `cpu`: number of seconds to wait for the target to build before timing out (`elapsed` for elapsed time and `cpu` for CPU time). 307 | - `hpc`: logical values (`TRUE`/`FALSE`/`NA`) whether to send each target to parallel workers. [Click here](https://books.ropensci.org/drake/hpc.html#selectivity) to learn more. 308 | - `resources`: target-specific lists of resources for a computing cluster. See the advanced options in the [parallel computing](#hpc) chapter for details. 309 | - `caching`: overrides the `caching` argument of `make()` for each target individually. Only supported in `drake` version 7.6.1.9000 and above. Possible values: 310 | - "main": tell the main process to store the target in the cache. 311 | - "worker": tell the HPC worker to store the target in the cache. 312 | - NA: default to the `caching` argument of `make()`. 313 | - `retries`: number of times to retry building a target in the event of an error. 314 | - `seed`: For statistical reproducibility, `drake` automatically assigns a unique pseudo-random number generator (RNG) seed to each target based on the target name and the global `seed` argument to `make()`. With the `seed` column of the plan, you can override these default seeds and set your own. Any non-missing seeds in the `seed` column override `drake`'s default target seeds. 315 | - `max_expand`: for dynamic branching only. Same as the `max_expand` argument of [make()], but on a target-by-target basis. Limits the number of sub-targets created for a given target. Only supported in `drake` >= 7.11.0. 316 | 317 | ## Static files 318 | 319 | `drake` has special functions to declare relationships between targets and external storage on disk. [`file_in()`](https://docs.ropensci.org/drake/reference/file_in.html) is for input files and directories, [`file_out()`](https://docs.ropensci.org/drake/reference/file_out.html) is for output files and directories, and [`knitr_in()`](https://docs.ropensci.org/drake/reference/knitr_in.html) is for R Markdown reports and `knitr` source files. If you use one of these functions inline in the plan, it tells `drake` to rerun a target when a file changes (or any of the files in a directory). 320 | 321 | All three functions appear in this plan. 322 | 323 | ```{r} 324 | plan 325 | ``` 326 | 327 | If we break the `file_out()` file, `drake` automatically repairs it. 328 | 329 | ```{r} 330 | unlink("report.html") 331 | 332 | make(plan) 333 | 334 | file.exists("report.html") 335 | ``` 336 | 337 | As for `knitr_in()`, recall what happened when we changed the `create_plot()`. Not only did `hist` rerun, `report` ran as well. Why? Because `knitr_in()` is special. It tells `drake` to look for mentions of `loadd()` and `readd()` in the code chunks. `drake` finds the targets you mention in those `loadd()` and `readd()` calls and treats them as dependencies of the report. This lets you choose to run the report either inside or outside a `drake` pipeline. 338 | 339 | ```{r} 340 | cat(readLines("report.Rmd"), sep = "\n") 341 | ``` 342 | 343 | That is why we have an arrow from `hist` to `report` in the graph. 344 | 345 | ```{r} 346 | vis_drake_graph(plan) 347 | ``` 348 | 349 | ### URLs 350 | 351 | `file_in()` understands URLs. If you supply a string beginning with `http://`, `https://`, or `ftp://`, `drake` watches the HTTP ETag, file size, and timestamp for changes. 352 | 353 | ```{r} 354 | drake_plan( 355 | external_data = download.file(file_in("http://example.com/file.zip")) 356 | ) 357 | ``` 358 | 359 | ### Limitations of static files 360 | 361 | #### Paths must be literal strings 362 | 363 | `file_in()`, `file_out()`, and `knitr_in()` require you to mention file and directory names *explicitly*. You cannot use a variable containing the name of a file. The reason is that `drake` detects dependency relationships with [static code analysis](http://adv-r.had.co.nz/Expressions.html#ast-funs). In other words, `drake` needs to know the names of all your files ahead of time (before we start building targets in `make()`). Here is an example of a bad plan. 364 | 365 | ```{r} 366 | prefix <- "eco_" 367 | bad_plan <- drake_plan( 368 | data = read_csv(file_in(paste0(prefix, "data.csv"))) 369 | ) 370 | 371 | vis_drake_graph(bad_plan) 372 | ``` 373 | 374 | ```{r, include = FALSE} 375 | file.create("eco_data.csv") 376 | ``` 377 | 378 | Instead, write this: 379 | 380 | ```{r} 381 | good_plan <- drake_plan( 382 | file = read_csv(file_in("eco_data.csv")) 383 | ) 384 | 385 | vis_drake_graph(good_plan) 386 | ``` 387 | 388 | Or even the one below, which uses the `!!` ("bang-bang") tidy evaluation unquoting operator. 389 | 390 | ```{r} 391 | prefix <- "eco_" 392 | drake_plan( 393 | file = read_csv(file_in(!!paste0(prefix, "data.csv"))) 394 | ) 395 | ``` 396 | 397 | #### Do not use inside functions 398 | 399 | `file_out()` and `knitr_in()` should not be used inside imported functions because `drake` does not know how to deal with functions that depend on targets. Instead of this: 400 | 401 | ```{r, eval = FALSE} 402 | f <- function() { 403 | render(knitr_in("report.Rmd"), output_file = file_out("report.html")) 404 | } 405 | 406 | plan <- drake_plan( 407 | y = f() 408 | ) 409 | ``` 410 | 411 | Write this: 412 | 413 | ```{r, eval = FALSE} 414 | plan <- drake_plan( 415 | y = render(knitr_in("report.Rmd"), output_file = file_out("report.html")) 416 | ) 417 | ``` 418 | 419 | Or this: 420 | 421 | ```{r, eval = FALSE} 422 | f <- function(input, output) { 423 | render(input, output_file = output) 424 | } 425 | 426 | plan <- drake_plan( 427 | y = f(input = knitr_in("report.Rmd"), output = file_out("report.html")) 428 | ) 429 | ``` 430 | 431 | `file_in()` can be used inside functions, but only for files that exist before you call `make()`. 432 | 433 | #### Incompatible with dynamic branching 434 | 435 | `file_out()` and `knitr_in()` deal with static output files, so they must not be used with [dynamic branching](#dynamic). As an alternative, consider dynamic files (described below). You can still use `file_in()`, but only for files that *all* the dynamic sub-targets depend on. (Changing a static input file dependency will invalidate all the sub-targets.) 436 | 437 | #### Database connections 438 | 439 | `file_in()` and friends do not help us manage database connections. If you work with a database, the most general best practice is to always [trigger](#triggers) a snapshot to make sure you have the latest data. 440 | 441 | ```{r, eval = FALSE} 442 | plan <- drake_plan( 443 | data = target( 444 | get_data_from_db("my_table"), # Define yourself. 445 | trigger = trigger(condition = TRUE) # Always runs. 446 | ), 447 | preprocess = my_preprocessing(data) # Runs when the data change. 448 | ) 449 | ``` 450 | 451 | In specific use cases, you may be able to watch database metadata for changes, but this information is situation-specific. 452 | 453 | ```{r, eval = FALSE} 454 | library(DBI) 455 | 456 | # Connection objects are brittle, so they should not be targets. 457 | # We define them up front, and we use ignore() to prevent 458 | # drake from rerunning targets when the connection object changes. 459 | con <- dbConnect(...) 460 | 461 | plan <- drake_plan( 462 | data = target( 463 | dbReadTable(ignore(con), "my_table"), # Use ignore() for db connection objects. 464 | trigger = trigger(change = somehow_get_db_timestamp()) # Define yourself. 465 | ), 466 | preprocess = my_preprocessing(data) # runs when the data change 467 | ) 468 | ``` 469 | 470 | ## Dynamic files 471 | 472 | `drake` >= 7.11.0 supports dynamic files through a specialized format. With dynamic files, `drake` can watch local files without knowing them in advance. This is a more flexible alternative to `file_out()` and `file_in()`, and it is fully compatible with [dynamic branching](#dynamic). 473 | 474 | ### How to use dynamic files 475 | 476 | 1. Set format = "file" in target() within drake_plan(). 477 | 2. Return the paths to local files from the target. 478 | 3. To link targets together in dependency relationships, reference target names and not literal character strings. 479 | 480 | ### Example of dynamic files 481 | 482 | ```{r} 483 | bad_plan <- drake_plan( 484 | upstream = target({ 485 | writeLines("one line", "my file") # Make sure the file exists. 486 | "my file" # Must return the file path. 487 | }, 488 | format = "file" # Necessary for dynamic files 489 | ), 490 | downstream = readLines("my file") # Oops! 491 | ) 492 | 493 | plot(bad_plan) 494 | 495 | good_plan <- drake_plan( 496 | upstream = target({ 497 | writeLines("one line", "my file") # Make sure the file exists. 498 | "my file" # Must return the file path. 499 | }, 500 | format = "file" # Necessary for dynamic files 501 | ), 502 | downstream = readLines(upstream) # Use the target name. 503 | ) 504 | 505 | plot(good_plan) 506 | 507 | make(good_plan) 508 | 509 | # Change how the file is generated. 510 | good_plan <- drake_plan( 511 | upstream = target({ 512 | writeLines("different line", "my file") # Change the file. 513 | "my file" 514 | }, 515 | format = "file" 516 | ), 517 | downstream = readLines(upstream) 518 | ) 519 | 520 | # The downstream target automatically reruns. 521 | make(good_plan) 522 | ``` 523 | 524 | ### Limitations of dynamic files 525 | 526 | Unlike `file_in()`, dynamic files cannot handle URLs. All files and directories must have valid local paths. 527 | 528 | ## Large plans 529 | 530 | `drake` has special interfaces to concisely define large numbers of targets. See the chapters on [static branching](#static) and [dynamic branching](#dynamic) for details. 531 | -------------------------------------------------------------------------------- /preamble.tex: -------------------------------------------------------------------------------- 1 | \usepackage{booktabs} 2 | \usepackage{mdframed} 3 | \usepackage{xcolor} 4 | \usepackage{hyperref} 5 | \usepackage[default]{sourcesanspro} 6 | \definecolor{roblue}{HTML}{6FAEF5} 7 | \definecolor{rolink}{HTML}{3D98FF} 8 | \hypersetup 9 | {colorlinks=true, 10 | linkcolor=rolink, 11 | urlcolor=rolink, 12 | filecolor=rolink, 13 | citecolor=rolink, 14 | allcolors=rolink 15 | } 16 | \newenvironment{summaryblock} 17 | {\begin{mdframed}[linecolor=roblue,linewidth=2pt]} 18 | {\end{mdframed}} 19 | -------------------------------------------------------------------------------- /projects.Rmd: -------------------------------------------------------------------------------- 1 | # `drake` projects {#projects} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(drake) 18 | library(dplyr) 19 | library(ggplot2) 20 | drake_example("main") 21 | tmp <- file.copy("main/R", ".", recursive = TRUE) 22 | tmp <- file.copy("main/_drake.R", ".") 23 | tmp <- file.copy("main/raw_data.xlsx", ".") 24 | tmp <- file.copy("main/", ".") 25 | tmp <- file.copy("main/report.Rmd", ".") 26 | tmp <- file.copy("main/report.Rmd", ".") 27 | tmp <- file.copy("main/report.Rmd", ".") 28 | 29 | ``` 30 | 31 | `drake`'s design philosophy is extremely R-focused. It embraces in-memory configuration, in-memory dependencies, interactivity, and flexibility. This scope leaves project setup and file management decisions mostly up to the user. This chapter tries to fill in the blanks and address practical hurdles when it comes to setting up projects. 32 | 33 | ## External resources 34 | 35 | - [Miles McBain](https://github.com/MilesMcBain)'s [excellent blog post](https://milesmcbain.xyz/the-drake-post/) explains the practical issues {drake} solves for most projects, how to set up a project as quickly and painlessly as possible, and how to overcome common obstacles. 36 | - Miles' [`dflow`](https://github.com/MilesMcBain/dflow) package generates the file structure for a boilerplate `drake` project. It is a more thorough alternative to `drake::use_drake()`. 37 | - `drake` is heavily function-oriented by design, and Miles' [`fnmate`](https://github.com/MilesMcBain/fnmate) package automatically generates boilerplate code and docstrings for functions you mention in `drake` plans. 38 | 39 | ## Code files 40 | 41 | The names and locations of the files are entirely up to you, but this pattern is particularly useful to start with. 42 | 43 | ``` 44 | make.R 45 | R/ 46 | ├── packages.R 47 | ├── functions.R 48 | └── plan.R 49 | ``` 50 | 51 | Here, `make.R` is a main top-level script that 52 | 53 | 1. Loads your packages, functions, and other in-memory data. 54 | 2. Creates the `drake` plan. 55 | 3. Calls `make()`. 56 | 57 | Let's consider the [main example](https://github.com/wlandau/drake-examples/tree/main/main), which you can download with `drake_example("main")`. Here, our main script is called `make.R`: 58 | 59 | ```{r, eval = FALSE} 60 | source("R/packages.R") # loads packages 61 | source("R/functions.R") # defines the create_plot() function 62 | source("R/plan.R") # creates the drake plan 63 | # options(clustermq.scheduler = "multicore") # optional parallel computing. Also needs parallelism = "clustermq" 64 | make( 65 | plan, # defined in R/plan.R 66 | verbose = 2 67 | ) 68 | ``` 69 | 70 | We have an `R` folder containing our supporting files. `packages.R` typically includes all the packages you will use in the workflow. 71 | 72 | ```{r, eval = FALSE} 73 | # packages.R 74 | library(drake) 75 | library(dplyr) 76 | library(ggplot2) 77 | ``` 78 | 79 | Your `functions.R` typically has the supporting custom functions you write for the workflow. If there are many functions, you could split them up into multiple files. 80 | 81 | ```{r, eval = FALSE} 82 | # functions.R 83 | create_plot <- function(data) { 84 | ggplot(data) + 85 | geom_histogram(aes(x = Ozone), binwidth = 10) + 86 | theme_gray(24) 87 | } 88 | ``` 89 | 90 | Finally, it is good practice to define a `plan.R` that defines the plan. 91 | 92 | ```{r, eval = FALSE} 93 | # plan.R 94 | plan <- drake_plan( 95 | raw_data = readxl::read_excel(file_in("raw_data.xlsx")), 96 | data = raw_data %>% 97 | mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), 98 | hist = create_plot(data), 99 | fit = lm(Ozone ~ Wind + Temp, data), 100 | report = rmarkdown::render( 101 | knitr_in("report.Rmd"), 102 | output_file = file_out("report.html"), 103 | quiet = TRUE 104 | ) 105 | ) 106 | ``` 107 | 108 | To run the example project above, 109 | 110 | 1. Start a clean new R session. 111 | 2. Run the `make.R` script. 112 | 113 | On Mac and Linux, you can do this by opening a terminal and entering `R CMD BATCH --no-save make.R`. On Windows, restart your R session and call `source("make.R")` in the R console. 114 | 115 | Note: this part of `drake` does not inherently focus on your script files. There is nothing magical about the names `make.R`, `packages.R`, `functions.R`, or `plan.R`. Different projects may require different file structures. 116 | 117 | `drake` has other functions to inspect your results and examine your workflow. Before invoking them interactively, it is best to start with a clean new R session. 118 | 119 | ```{r, eval = FALSE} 120 | # Restart R. 121 | interactive() 122 | #> [1] TRUE 123 | source("R/packages.R") 124 | source("R/functions.R") 125 | source("R/plan.R") 126 | vis_drake_graph(plan) 127 | ``` 128 | 129 | ## Safer interactivity 130 | 131 | ### Motivation 132 | 133 | A serious [`drake`](https://github.com/ropensci/drake) workflow should be consistent and reliable, ideally with the help of a [main top-level R script](https://github.com/wlandau/drake-examples/blob/main/gsp/make.R). Before it builds your targets, this script should begin in a fresh R session and load your packages and functions in a [dependable manner](https://github.com/wlandau/drake-examples/blob/d9417547a05aec416afbbda913eaf2d44a552d5b/gsp/make.R#L4-L6). [Batch mode](https://www.statmethods.net/interface/batch.html) makes sure all this goes according to plan. 134 | 135 | If you use a single persistent [interactive R session](https://stat.ethz.ch/R-manual/R-devel/library/base/html/interactive.html) to repeatedly invoke `make()` while you develop the workflow, then over time, your session could grow stale and accidentally invalidate targets. For example, if you interactively tinker with a new version of `create_plot()`, targets `hist` and `report` will fall out of date without warning, and the next `make()` will build them again. Even worse, the outputs from `hist` and `report` will be wrong if they depend on a half-finished `create_plot()`. 136 | 137 | The quickest workaround is to restart R and `source()` your setup scripts all over again. However, a better solution is to use [`r_make()`](https://docs.ropensci.org/drake/reference/r_make.html) and friends. [`r_make()`](https://docs.ropensci.org/drake/reference/r_make.html) runs `make()` in a new transient R session so that accidental changes to your interactive environment do not break your workflow. 138 | 139 | ### Usage 140 | 141 | To use [`r_make()`](https://docs.ropensci.org/drake/reference/r_make.html), you need a configuration R script. Unless you supply a custom file path (e.g. `r_make(source = "your_file.R")` or `options(drake_source = "your_file.R")`) `drake` assumes this configuration script is called `_drake.R`. (So the file name really *is* magical in this case). The suggested file structure becomes: 142 | 143 | ``` 144 | _drake.R 145 | R/ 146 | ├── packages.R 147 | ├── functions.R 148 | └── plan.R 149 | ``` 150 | 151 | Like our previous `make.R` script, `_drake.R` runs all our pre-`make()` setup steps. But this time, rather than calling `make()`, it ends with a call to `drake_config()`. `drake_config()` is the initial preprocessing stage of `make()`, and it accepts all the same arguments as `make()`. 152 | 153 | Example `_drake.R`: 154 | 155 | ```{r, eval = FALSE} 156 | source("R/packages.R") 157 | source("R/functions.R") 158 | source("R/plan.R") 159 | # options(clustermq.scheduler = "multicore") # optional parallel computing 160 | drake_config(plan, verbose = 2) 161 | ``` 162 | 163 | Here is what happens when you call `r_make()`. 164 | 165 | 1. `drake` launches a new transient R session using [`callr::r()`](https://github.com/r-lib/callr). The remaining steps all happen within this transient session. 166 | 2. Run the configuration script (e.g. `_drake.R`) to 167 | a. Load the packages, functions, global options, `drake` plan, etc. into the session's environnment, and 168 | b. Run the call to `drake_config()`and store the results in a variable called `config`. 169 | 3. Execute `make_impl(config = config)`, an internal `drake` function. 170 | 171 | The purpose of `drake_config()` is to collect and sanitize all the parameters and settings that `make()` needs to do its job. In fact, if you do not set the `config` argument explicitly, then `make()` invokes `drake_config()` behind the scenes. `make(plan, parallelism = "clustermq", jobs = 2, verbose = 6)` is equivalent to 172 | 173 | ```{r, eval = FALSE} 174 | config <- drake_config(plan, verbose = 2) 175 | make_impl(config = config) 176 | ``` 177 | 178 | There are many more `r_*()` functions besides `r_make()`, each of which launches a fresh session and runs an inner `drake` function on the `config` object from `_drake.R`. 179 | 180 | Outer function call | Inner function call 181 | --- | --- 182 | `r_make()` | `make_impl(config = config)` 183 | `r_drake_build(...)` | `drake_build_impl(config, ...)` 184 | `r_outdated(...)` | `outdated_impl(config, ...)` 185 | `r_missed(...)` | `missed_impl(config, ...)` 186 | `r_vis_drake_graph(...)` | `vis_drake_graph_impl(config, ...)` 187 | `r_sankey_drake_graph(...)` | `sankey_drake_graph_impl(config, ...)` 188 | `r_drake_ggraph(...)` | `drake_ggraph_impl(config, ...)` 189 | `r_drake_graph_info(...)` | `drake_graph_info_impl(config, ...)` 190 | `r_predict_runtime(...)` | `predict_runtime_impl(config, ...)` 191 | `r_predict_workers(...)` | `predict_workers_impl(config, ...)` 192 | 193 | 194 | ```{r} 195 | clean() 196 | r_outdated(r_args = list(show = FALSE)) 197 | 198 | r_make() 199 | r_outdated(r_args = list(show = FALSE)) 200 | 201 | r_vis_drake_graph(targets_only = TRUE, r_args = list(show = FALSE)) 202 | ``` 203 | 204 | Remarks: 205 | 206 | - You can run `r_make()` in an interactive session, but the transient process it launches will not be interactive. Thus, any `browser()` statements in the commands in your `drake` plan will be ignored. 207 | - You can select and configure the underlying [`callr`](https://github.com/r-lib/callr) function using arguments `r_fn` and `r_args`, respectively. 208 | - For example code, you can download the updated [main example](https://github.com/wlandau/drake-examples/tree/main/main) (`drake_example("main")`) and experiment with files [`_drake.R`](https://github.com/wlandau/drake-examples/blob/main/main/_drake.R) and [`interactive.R`](https://github.com/wlandau/drake-examples/blob/main/main/interactive.R). 209 | 210 | ## Script file pitfalls 211 | 212 | Despite the above discussion of R scripts, `drake` plans rely more on in-memory functions. You might be tempted to write a plan like the following, but then `drake` cannot tell that `my_analysis` depends on `my_data`. 213 | 214 | ```{r} 215 | bad_plan <- drake_plan( 216 | my_data = source(file_in("get_data.R")), 217 | my_analysis = source(file_in("analyze_data.R")), 218 | my_summaries = source(file_in("summarize_data.R")) 219 | ) 220 | 221 | vis_drake_graph(bad_plan, targets_only = TRUE) 222 | ``` 223 | 224 | When it comes to plans, use *functions* instead. 225 | 226 | ```{r, eval = FALSE} 227 | source("my_functions.R") # defines get_data(), analyze_data(), etc. 228 | good_plan <- drake_plan( 229 | my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint 230 | my_analysis = analyze_data(my_data), 231 | my_summaries = summarize_results(my_data, my_analysis) 232 | ) 233 | 234 | vis_drake_graph(good_plan, targets_only = TRUE) 235 | ``` 236 | 237 | In `drake` >= 7.6.2.9000, code_to_function() leverages existing imperative scripts for use in a `drake` plan. 238 | 239 | ```{r, eval=FALSE} 240 | get_data <- code_to_function("get_data.R") 241 | do_analysis <- code_to_function("analyze_data.R") 242 | do_summary <- code_to_function("summarize_data.R") 243 | 244 | good_plan <- drake_plan( 245 | my_data = get_data(), 246 | my_analysis = do_analysis(my_data), 247 | my_summaries = do_summary(my_data, my_analysis) 248 | ) 249 | 250 | vis_drake_graph(good_plan, targets_only = TRUE) 251 | 252 | ``` 253 | 254 | ```{r, echo = FALSE} 255 | good_plan <- drake_plan( 256 | my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint 257 | my_analysis = analyze_data(my_data), 258 | my_summaries = summarize_results(my_data, my_analysis) 259 | ) 260 | 261 | vis_drake_graph(good_plan, targets_only = TRUE) 262 | ``` 263 | 264 | 265 | ## Workflows as R packages 266 | 267 | The R package structure is a great way to organize and quality-control a data analysis project. If you write a `drake` workflow as a package, you will need 268 | 269 | 1. Supply the namespace of your package to the `envir` argument of `make()` or `drake_config()` (e.g. `make(envir = getNamespace("yourPackage")` so `drake` can watch you package's functions for changes and rebuild downstream targets accordingly. 270 | 2. If you load the package with `devtools::load_all()`, set the `prework` argument of `make()`: e.g. `make(prework = "devtools::load_all()")` and custom set the `packages` argument so your package name is not included. (Everything in `packages` is loaded with `library()`). 271 | 272 | For a minimal example, see [Tiernan Martin](https://github.com/tiernanmartin)'s [`drakepkg`](https://github.com/tiernanmartin/drakepkg). 273 | 274 | ## Other tools 275 | 276 | [`drake`](https://github.com/ropensci/drake) enhances reproducibility, but not in all respects. [Local library managers](https://rstudio.github.io/packrat), [containerization](https://www.docker.com), and [session management tools](https://github.com/tidyverse/reprex) offer more robust solutions in their respective domains. Reproducibility encompasses a [wide variety of tools and techniques](https://github.com/karthik/rstudio2019) all working together. Comprehensive overviews: 277 | 278 | - [PLOS article](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510) by Wilson et al. 279 | - [RStudio Conference 2019 presentation ](https://github.com/karthik/rstudio2019) by [Karthik Ram](https://github.com/karthik). 280 | - [`rrtools`](https://github.com/benmarwick/rrtools) by [Ben Marwick](https://github.com/benmarwick). 281 | -------------------------------------------------------------------------------- /resources.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Resource management {-} 2 | -------------------------------------------------------------------------------- /scripts.Rmd: -------------------------------------------------------------------------------- 1 | # Script-based workflows {#scripts} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(drake) 18 | library(tidyverse) 19 | invisible(drake_example("script-based-workflows", overwrite = TRUE)) 20 | files <- list.files( 21 | "script-based-workflows/R/", 22 | pattern = "*.R", 23 | full.names = TRUE 24 | ) 25 | tmp <- file.copy(files, ".", recursive = TRUE) 26 | tmp <- file.copy("script-based-workflows/raw_data.xlsx", ".") 27 | tmp <- file.copy("script-based-workflows/report.Rmd", ".") 28 | dir.create("data") 29 | ``` 30 | 31 | ## Function-oriented workflows 32 | 33 | `drake` works best when you write functions for data analysis. Functions break down complicated ideas into manageable pieces. 34 | 35 | ```{r} 36 | # R/functions.R 37 | get_data <- function(file){ 38 | readxl::read_excel(file) 39 | } 40 | 41 | munge_data <- function(raw_data){ 42 | raw_data %>% 43 | mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))) 44 | } 45 | 46 | fit_model <- function(munged_data){ 47 | lm(Ozone ~ Wind + Temp, munged_data) 48 | } 49 | ``` 50 | 51 | When we express computational steps as functions like `get_data()`, `munge_data()`, and `fit_model()`, we create special shorthand to make the rest of our code easier to read and understand. 52 | 53 | ```{r} 54 | # R/plan.R 55 | plan <- drake_plan( 56 | raw_data = get_data(file_in("raw_data.xlsx")), 57 | munged_data = munge_data(raw_data), 58 | model = fit_model(munged_data) 59 | ) 60 | ``` 61 | 62 | This function-oriented approach is elegant, powerful, testable, scalable, and maintainable. However, it can be challenging to convert pre-existing traditional script-based analyses to function-oriented `drake`-powered workflows. This chapter describes a stopgap to retrofit `drake` to existing projects. Custom functions are still better in the long run, but the following workaround is quick and painless, and it does not require you to change your original scripts. 63 | 64 | ## Traditional and legacy workflows 65 | 66 | It is common to express data analysis tasks as numbered scripts. 67 | 68 | ``` 69 | 01_data.R 70 | 02_munge.R 71 | 03_histogram.R 72 | 04_regression.R 73 | 05_report.R 74 | ``` 75 | 76 | The numeric prefixes indicate the order in which these scripts need to run. 77 | 78 | ```{r, eval=FALSE} 79 | # run_everything.R 80 | source("01_data.R") 81 | source("02_munge.R") 82 | source("03_histogram.R") 83 | source("04_regression.R") 84 | source("05_report.R") # Calls rmarkdown::render() on report.Rmd. 85 | ``` 86 | 87 | ## Overcoming Technical Debt 88 | 89 | `code_to_function()` creates `drake_plan()`-ready functions from scripts like these. 90 | 91 | ```{r} 92 | # R/functions.R 93 | load_data <- code_to_function("01_data.R") 94 | munge_data <- code_to_function("02_munge.R") 95 | make_histogram <- code_to_function("03_histogram.R") 96 | do_regression <- code_to_function("04_regression.R") 97 | generate_report <- code_to_function("05_report.R") 98 | ``` 99 | 100 | Each function contains all the code from its corresponding script, along with a special final line to make sure we never return the same value twice. 101 | 102 | ```{r} 103 | print(load_data) 104 | ``` 105 | 106 | ## Dependencies 107 | 108 | `drake` pays close attention to dependencies. In `drake`, a target's dependencies are the things it needs in order to build. Dependencies can include functions, files, and other targets upstream. Any time a dependency changes, the target is no longer valid. The `make()` function automatically detects when dependencies change, and it rebuilds the targets that need to rebuild. 109 | 110 | To leverage drake's dependency-watching capabilities, we create a `drake` plan. This plan should include all the steps of the analysis, from loading the data to generating a report. 111 | 112 | To write the plan, we plug in the functions we created from `code_to_function()`. 113 | 114 | ```{r} 115 | simple_plan <- drake_plan( 116 | data = load_data(), 117 | munged_data = munge_data(), 118 | hist = make_histogram(), 119 | fit = do_regression(), 120 | report = generate_report() 121 | ) 122 | ``` 123 | 124 | It's a start, but right now, `drake` has no idea which targets to run first and which need to wait for dependencies! In the following graph, there are no edges (arrows) connecting the targets! 125 | 126 | ```{r} 127 | vis_drake_graph(simple_plan) 128 | ``` 129 | 130 | 131 | ## Building the connections 132 | 133 | Just as our original scripts had to run in a certain order, so do our targets now. 134 | We pass targets as function arguments to express this execution order. 135 | 136 | For example, when we write `munged_data = munge_data(data)`, we are signaling to 137 | `drake` that the `munged_data` target depends on the function `munge_data()` and 138 | the target `data`. 139 | 140 | ```{r} 141 | script_based_plan <- drake_plan( 142 | data = load_data(), 143 | munged_data = munge_data(data), 144 | hist = make_histogram(munged_data), 145 | fit = do_regression(munged_data), 146 | report = generate_report(hist, fit) 147 | ) 148 | ``` 149 | 150 | ```{r} 151 | vis_drake_graph(script_based_plan) 152 | ``` 153 | 154 | ## Run the workflow 155 | 156 | We can now run the workflow with the `make()` function. The first call to `make()` runs all the data analysis tasks we got from the scripts. 157 | 158 | ```{r} 159 | make(script_based_plan) 160 | ``` 161 | 162 | ## Keeping the results up to date 163 | 164 | Any time we change a script, we need to run `code_to_function()` again to keep our function up to date. `drake` notices when this function changes, and `make()` reruns the updated function and the all downstream functions that rely on the output. 165 | 166 | For example, let's fine tune our histogram. We open `03_histogram.R`, change the `binwidth` argument, and call `code_to_function("03_histogram.R")` all over again. 167 | 168 | ```{r echo = FALSE} 169 | writeLines( 170 | c( 171 | "munged_data <- readRDS(\"data/munged_data.RDS\")", 172 | "gg <- ggplot(munged_data) +", 173 | " geom_histogram(aes(x = Ozone)) +", 174 | " theme_gray(20)", 175 | "ggsave(", 176 | " filename = \"data/ozone.PNG\",", 177 | " plot = gg,", 178 | " width = 6,", 179 | " height = 6", 180 | ")", 181 | "saveRDS(gg, \"data/ozone.RDS\")" 182 | ), 183 | "03_histogram.R" 184 | ) 185 | ``` 186 | 187 | ```{r} 188 | # We need to rerun code_to_function() to tell drake that the script changed. 189 | make_histogram <- code_to_function("03_histogram.R") 190 | ``` 191 | 192 | Targets `hist` and `report` depend on the code we modified, so `drake` marks 193 | those targets as outdated. 194 | 195 | ```{r message = FALSE} 196 | outdated(script_based_plan) 197 | 198 | vis_drake_graph(script_based_plan, targets_only = TRUE) 199 | ``` 200 | 201 | When you call `make()`, `drake` runs `make_histogram()` because the underlying 202 | script changed, and it runs `generate_report()` because the report depends on 203 | `hist`. 204 | 205 | ```{r} 206 | make(script_based_plan) 207 | ``` 208 | 209 | All the targets are now up to date! 210 | 211 | ```{r} 212 | vis_drake_graph(script_based_plan, targets_only = TRUE) 213 | ``` 214 | 215 | ## Final thoughts 216 | 217 | Countless data science workflows consist of numbered imperative scripts, and `code_to_function()` lets `drake` accommodate script-based projects too big and cumbersome to refactor. 218 | 219 | However, for new projects, we strongly recommend that you write functions. Functions help organize your thoughts, and they improve portability, readability, and compatibility with `drake`. For a deeper discussion of functions and their 220 | role in `drake`, consider watching the [webinar recording of the 2019-09-23 rOpenSci Community Call](https://ropensci.org/commcalls/2019-09-24). 221 | 222 | Even old projects are sometimes pliable enough to refactor into functions, especially with the new [`Rclean`](https://github.com/provtools/rclean) package. 223 | -------------------------------------------------------------------------------- /stan.Rmd: -------------------------------------------------------------------------------- 1 | # Validating a small hierarchical model with Stan {#stan} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE) 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE, eval = TRUE} 17 | options(tidyverse.quiet = TRUE) 18 | library(drake) 19 | library(tidyverse) 20 | tmp <- suppressWarnings(drake_plan(x = 1, y = 2)) 21 | ``` 22 | 23 | The goal of this example workflow is to validate a small Bayesian hierarchical model. 24 | 25 | ```{r, eval = FALSE} 26 | y_i ~ iid Normal(alpha + x_i * beta, sigma^2) 27 | alpha ~ Normal(0, 1) 28 | beta ~ Normal(0, 1) 29 | sigma ~ Uniform(0, 1) 30 | ``` 31 | 32 | We simulate multiple datasets from the model and fit the model on each dataset. For each model fit, we determine if the 50% credible interval of the regression coefficient `beta` contains the true value of `beta` used to generate the data. If we implemented the model correctly, roughly 50% of the models should recapture the true `beta` in 50% credible intervals. 33 | 34 | ## The `drake` project 35 | 36 | Because of the long computation time involved, this chapter of the manual does not actually run the analysis code. The complete code can be found at and downloaded with `drake::drake_example("stan")`, and we encourage you to try out the code yourself. This chapter serves to walk through the functions and plan and explain the overall thought process. 37 | 38 | The file structure is that of a [typical `drake` project](#projects) with some additions to allow optional [high-performance computing](#hpc) on a cluster. 39 | 40 | ```{r, eval = FALSE} 41 | ├── run.sh 42 | ├── run.R 43 | ├── _drake.R 44 | ├── sge.tmpl 45 | ├── R/ 46 | ├──── packages.R 47 | ├──── functions.R 48 | ├──── plan.R 49 | ├── stan/ 50 | ├──── model.stan 51 | └── report.Rmd 52 | ``` 53 | 54 | File | Purpose 55 | ---|--- 56 | `run.sh` | Shell script to run `run.R` in a persistent background process. Works on Unix-like systems. Helpful for long computations on servers. 57 | `run.R` | R script to run `r_make()`. 58 | `_drake.R` | The special R script that powers functions `r_make()` and friends ([details here]()). 59 | `sge.tmpl` | A [`clustermq`](https://github.com/mschubert/clustermq) template file to deploy targets in parallel to a Sun Grid Engine cluster. 60 | `R/packages.R` | A custom R script loading the packages we need. 61 | `R/functions.R` | A custom R script with user-defined functions. 62 | `R/plan.R` | A custom R script that defines the `drake` plan. 63 | `stan/model.stan` | The specification of our Stan model. 64 | `report.Rmd` | An R Markdown report summarizing the results of the analysis. 65 | 66 | The following sections walk through the functions and plan. 67 | 68 | ## Functions 69 | 70 | Good functions have meaningful inputs and outputs that are easy to generate. For data anlaysis, good inputs and outputs are typically datasets, models, and summaries of fitted models. The functions below for our Stan workflow follow this pattern. 71 | 72 | First, we need a function to compile the model. It accepts a Stan model specification file (a `*.stan` text file) and returns a paths to both the model file and the compiled RDS file. (We need to set `rstan_options(auto_write = TRUE)` to make sure `stan_model()` generates the RDS file.) We return the file paths because the target that uses this function will be a [dynamic file target](https://books.ropensci.org/drake/plans.html#dynamic-files). `drake` will reproducibly watch these files for changes and automatically recompile and run the model if changes are detected. 73 | 74 | ```{r} 75 | compile_model <- function(model_file) { 76 | stan_model(model_file, auto_write = TRUE, save_dso = TRUE) 77 | c(model_file, path_ext_set(model_file, "rds")) 78 | } 79 | ``` 80 | 81 | Next, we need a function to simulate a dataset from the hierarchical model. 82 | 83 | ```{r} 84 | simulate_data <- function() { 85 | alpha <- rnorm(1, 0, 1) 86 | beta <- rnorm(1, 0, 1) 87 | sigma <- runif(1, 0, 1) 88 | x <- rbinom(100, 1, 0.5) 89 | y <- rnorm(100, alpha + x * beta, sigma) 90 | tibble(x = x, y = y, beta_true = beta) 91 | } 92 | ``` 93 | 94 | Lastly, we write a function to fit the compiled model to a simulated dataset. We pass in the `*.stan` model specification file, but `rstan` is smart enough to use the compiled RDS model file instead if available. 95 | 96 | In Bayesian data analysis workflows with many runs of the same model, we need to make a conscious effort to conserve computing resources. That means we should not save all the posterior samples from every single model fit. Instead, we compute summary statistics on the chains such as posterior quantiles, coverage in credible intervals, and convergence diagnostics. 97 | 98 | ```{r} 99 | fit_model <- function(model_file, data) { 100 | # From https://github.com/stan-dev/rstan/issues/444#issuecomment-445108185, 101 | # we need each stanfit object to have its own unique name, so we create 102 | # a special new environment for it. 103 | tmp_envir <- new.env(parent = baseenv()) 104 | tmp_envir$model <- stan_model(model_file, auto_write = TRUE, save_dso = TRUE) 105 | tmp_envir$fit <- sampling( 106 | object = tmp_envir$model, 107 | data = list(x = data$x, y = data$y, n = nrow(data)), 108 | refresh = 0 109 | ) 110 | mcmc_list <- As.mcmc.list(tmp_envir$fit) 111 | samples <- as.data.frame(as.matrix(mcmc_list)) 112 | beta_25 <- quantile(samples$beta, 0.25) 113 | beta_median <- quantile(samples$beta, 0.5) 114 | beta_75 <- quantile(samples$beta, 0.75) 115 | beta_true <- data$beta_true[1] 116 | beta_cover <- beta_25 < beta_true && beta_true < beta_75 117 | psrf <- max(gelman.diag(mcmc_list, multivariate = FALSE)$psrf[, 1]) 118 | ess <- min(effectiveSize(mcmc_list)) 119 | tibble( 120 | beta_cover = beta_cover, 121 | beta_true = beta_true, 122 | beta_25 = beta_25, 123 | beta_median = beta_median, 124 | beta_75 = beta_75, 125 | psrf = psrf, 126 | ess = ess 127 | ) 128 | } 129 | ``` 130 | 131 | ## Plan 132 | 133 | Our [`drake` plan](#plans) is defined in the `R/plan.R` script. 134 | 135 | ```{r} 136 | plan <- drake_plan( 137 | model_file = target( 138 | compile_model("stan/model.stan"), 139 | format = "file", 140 | hpc = FALSE 141 | ), 142 | index = target( 143 | seq_len(10), # Change the number of simulations here. 144 | hpc = FALSE 145 | ), 146 | data = target( 147 | simulate_data(), 148 | dynamic = map(index), 149 | format = "fst_tbl" 150 | ), 151 | fit = target( 152 | fit_model(model_file[1], data), 153 | dynamic = map(data), 154 | format = "fst_tbl" 155 | ), 156 | report = target( 157 | render( 158 | knitr_in("report.Rmd"), 159 | output_file = file_out("report.html"), 160 | quiet = TRUE 161 | ), 162 | hpc = FALSE 163 | ) 164 | ) 165 | ``` 166 | 167 | The following subsections describe the strategy and practical adjustments behind each target. 168 | 169 | ### Model 170 | 171 | The `model_files` target is a [dynamic file target](https://books.ropensci.org/drake/plans.html#dynamic-files) to reproducibly track our Stan model specification file (`stan/model.stan`) and compiled model file (`stan/model.rds`). Below, `format = "file"` indicates that the target is a dynamic file target, and `hpc = FALSE` tells `drake` not to run the target on a parallel worker in [high-performance computing](#hpc) scenarios. 172 | 173 | ```{r} 174 | model_files = target( 175 | compile_model("stan/model.stan"), 176 | format = "file", 177 | hpc = FALSE 178 | ), 179 | ``` 180 | 181 | ### Index 182 | 183 | The `index` target is simply a numeric vector from 1 to the number of simulations. To fit our model multiple times, we are going to [dynamically map](#dynamic) over `index`. This is a small target and we do not want to waste expensive computing resources on it, so we set `hpc = FALSE`. 184 | 185 | ```{r} 186 | index = target( 187 | seq_len(1000), # Change the number of simulations here. 188 | hpc = FALSE 189 | ) 190 | ``` 191 | 192 | ### Data 193 | 194 | `data` is a [dynamic target](#dynamic) with one sub-target per simulated dataset, so we write `dynamic = map(index)` below. In addition, these datasets are data frames, so we choose `format = "fst_tbl"` below to increase read/write speeds and conserve storage space. [Read here](https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets) for more on specialized storage formats. 195 | 196 | ```{r} 197 | data = target( 198 | simulate_data(), 199 | dynamic = map(index), 200 | format = "fst_tbl" 201 | ) 202 | ``` 203 | 204 | ### Fit 205 | 206 | We want to fit our model once for each simulated dataset, so our `fit` target dynamically maps over the datasets with `dynamic = map(data)`. We also pass in the path to the `stan/model.stan` specification file, but `rstan` is smart enough to use the compiled model `stan/model.rds` instead if available. And since `fit_model()` returns a data frame, we also choose `format = "fst_tbl"` here. 207 | 208 | ```{r} 209 | fit = target( 210 | fit_model(model_files[1], data), 211 | dynamic = map(data), 212 | format = "fst_tbl" 213 | ) 214 | ``` 215 | 216 | ### Report 217 | 218 | R Markdown reports should never do any heavy lifting in `drake` pipelines. They should simply leverage the computationally expensive work done in the previous targets. If we follow this good practice and our report renders quickly, we should not need heavy computing resources to process it, and we can set `hpc = FALSE` below. 219 | 220 | The [`report.Rmd` file itself](https://github.com/wlandau/drake-examples/blob/main/stan/report.Rmd) has `loadd()` and `readd()` statements to refer to these targets, and with the `knitr_in()` keyword below, `drake` knows that it needs to update the report when the models or datasets change. Similarly, `file_out("report.html")` tells `drake` to rerun the report if the output file gets corrupted. 221 | 222 | ```{r} 223 | report = target( 224 | render( 225 | knitr_in("report.Rmd"), 226 | output_file = file_out("report.html"), 227 | quiet = TRUE 228 | ), 229 | hpc = FALSE 230 | ) 231 | ``` 232 | 233 | ## Try it out! 234 | 235 | The complete code can be found at and downloaded with `drake::drake_example("stan")`, and we encourage you to try out the code yourself. 236 | -------------------------------------------------------------------------------- /start.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Getting started {-} 2 | -------------------------------------------------------------------------------- /static.Rmd: -------------------------------------------------------------------------------- 1 | # Static branching {#static} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(drake) 18 | library(glue) 19 | library(purrr) 20 | library(rlang) 21 | library(tidyverse) 22 | invisible(drake_example("main", overwrite = TRUE)) 23 | invisible(file.copy("main/raw_data.xlsx", ".", overwrite = TRUE)) 24 | invisible(file.copy("main/report.Rmd", ".", overwrite = TRUE)) 25 | tmp <- suppressWarnings(drake_plan(x = 1, y = 2)) 26 | ``` 27 | 28 | ## Why static branching? 29 | 30 | Static branching helps us write large plans compactly. Instead of typing out every single target by hand, we use a special shorthand to declare entire batches of similar targets. To practice static branching in a controlled setting, try the interactive exercises at (from the workshop at ). 31 | 32 | Without static branching, plans like this one become too cumbersome to type by hand. 33 | 34 | ```{r, eval = FALSE} 35 | # Without static branching: 36 | 37 | drake_plan( 38 | data = get_data(), 39 | analysis_fast_1_main = main(data, mean = 1, tuning = "fast"), 40 | analysis_slow_1_main = main(data, mean = 1, tuning = "slow"), 41 | analysis_fast_2_main = main(data, mean = 2, tuning = "fast"), 42 | analysis_slow_2_main = main(data, mean = 2, tuning = "slow"), 43 | analysis_fast_3_main = main(data, mean = 3, tuning = "fast"), 44 | analysis_slow_3_main = main(data, mean = 3, tuning = "slow"), 45 | analysis_fast_4_main = main(data, mean = 4, tuning = "fast"), 46 | analysis_slow_4_main = main(data, mean = 4, tuning = "slow"), 47 | analysis_fast_1_altv = altv(data, mean = 1, tuning = "fast"), 48 | analysis_slow_1_altv = altv(data, mean = 1, tuning = "slow"), 49 | analysis_fast_2_altv = altv(data, mean = 2, tuning = "fast"), 50 | analysis_slow_2_altv = altv(data, mean = 2, tuning = "slow"), 51 | analysis_fast_3_altv = altv(data, mean = 3, tuning = "fast"), 52 | analysis_slow_3_altv = altv(data, mean = 3, tuning = "slow"), 53 | analysis_fast_4_altv = altv(data, mean = 4, tuning = "fast"), 54 | analysis_slow_4_altv = altv(data, mean = 4, tuning = "slow"), 55 | summary_analysis_fast_1_main = summarize_model(analysis_fast_1_main), 56 | summary_analysis_slow_1_main = summarize_model(analysis_slow_1_main), 57 | summary_analysis_fast_2_main = summarize_model(analysis_fast_2_main), 58 | summary_analysis_slow_2_main = summarize_model(analysis_slow_2_main), 59 | summary_analysis_fast_3_main = summarize_model(analysis_fast_3_main), 60 | summary_analysis_slow_3_main = summarize_model(analysis_slow_3_main), 61 | summary_analysis_fast_4_main = summarize_model(analysis_fast_4_main), 62 | summary_analysis_slow_4_main = summarize_model(analysis_slow_4_main), 63 | summary_analysis_fast_1_altv = summarize_model(analysis_fast_1_altv), 64 | summary_analysis_slow_1_altv = summarize_model(analysis_slow_1_altv), 65 | summary_analysis_fast_2_altv = summarize_model(analysis_fast_2_altv), 66 | summary_analysis_slow_2_altv = summarize_model(analysis_slow_2_altv), 67 | summary_analysis_fast_3_altv = summarize_model(analysis_fast_3_altv), 68 | summary_analysis_slow_3_altv = summarize_model(analysis_slow_3_altv), 69 | summary_analysis_fast_4_altv = summarize_model(analysis_fast_4_altv), 70 | summary_analysis_slow_4_altv = summarize_model(analysis_slow_4_altv), 71 | model_summary_altv = dplyr::bind_rows( 72 | summary_analysis_fast_1_altv, 73 | summary_analysis_slow_1_altv, 74 | summary_analysis_fast_2_altv, 75 | summary_analysis_slow_2_altv, 76 | summary_analysis_fast_3_altv, 77 | summary_analysis_slow_3_altv, 78 | summary_analysis_fast_4_altv, 79 | summary_analysis_slow_4_altv 80 | ), 81 | model_summary_main = dplyr::bind_rows( 82 | summary_analysis_fast_1_main, 83 | summary_analysis_slow_1_main, 84 | summary_analysis_fast_2_main, 85 | summary_analysis_slow_2_main, 86 | summary_analysis_fast_3_main, 87 | summary_analysis_slow_3_main, 88 | summary_analysis_fast_4_main, 89 | summary_analysis_slow_4_main 90 | ) 91 | ) 92 | ``` 93 | 94 | Static branching makes it easier to write and understand plans. To activate static branching, use the `transform` argument of `target()`. 95 | 96 | ```{r} 97 | # With static branching: 98 | 99 | model_functions <- rlang::syms(c("main", "altv")) # We need symbols. 100 | 101 | model_functions # List of symbols. 102 | 103 | plan <- drake_plan( 104 | data = get_data(), 105 | analysis = target( 106 | model_function(data, mean = mean_value, tuning = tuning_setting), 107 | # Define an analysis target for each combination of 108 | # tuning_setting, mean_value, and model_function. 109 | transform = cross( 110 | tuning_setting = c("fast", "slow"), 111 | mean_value = !!(1:4), # Why `!!`? See "Tidy Evaluation" below. 112 | model_function = !!model_functions # Why `!!`? See "Tidy Evaluation" below. 113 | ) 114 | ), 115 | # Define a new summary target for each analysis target defined previously. 116 | summary = target( 117 | summarize_model(analysis), 118 | transform = map(analysis) 119 | ), 120 | # Group together the summary targets by the corresponding value 121 | # of model_function. 122 | model_summary = target( 123 | dplyr::bind_rows(summary), 124 | transform = combine(summary, .by = model_function) 125 | ) 126 | ) 127 | 128 | plan 129 | ``` 130 | 131 | *Always* check the graph to make sure the plan makes sense. 132 | 133 | ```{r} 134 | plot(plan) # a quick and dirty alternative to vis_drake_graph() 135 | ``` 136 | 137 | 138 | If the graph is too complicated to look at or too slow to load, downsize the plan with `max_expand`. Then, when you are done debugging and testing, remove `max_expand` to scale back up to the full plan. 139 | 140 | ```{r} 141 | model_functions <- rlang::syms(c("main", "altv")) 142 | 143 | plan <- drake_plan( 144 | max_expand = 2, 145 | data = get_data(), 146 | analysis = target( 147 | model_function(data, mean = mean_value, tuning = tuning_setting), 148 | transform = cross( 149 | tuning_setting = c("fast", "slow"), 150 | mean_value = !!(1:4), # Why `!!`? See "Tidy Evaluation" below. 151 | model_function = !!model_functions # Why `!!`? See "Tidy Evaluation" below. 152 | ) 153 | ), 154 | summary = target( 155 | summarize_model(analysis), 156 | transform = map(analysis) 157 | ), 158 | model_summary = target( 159 | dplyr::bind_rows(summary), 160 | transform = combine(summary, .by = model_function) # defined in "analysis" 161 | ) 162 | ) 163 | 164 | # Click and drag the nodes in the graph to improve the view. 165 | plot(plan) 166 | ``` 167 | 168 | ## Grouping variables 169 | 170 | A *grouping variable* contains iterated values for a single instance of `map()` or `cross()`. `mean_value` and `tuning_par` are grouping variables below. Notice how they are defined inside `cross()`. Grouping variables are not targets, and they must be declared inside static transformations. 171 | 172 | ```{r} 173 | drake_plan( 174 | data = get_data(), 175 | model = target( 176 | fit_model(data, mean_value, tuning_par), 177 | transform = cross( 178 | mean_value = c(1, 2), 179 | tuning_par = c("fast", "slow") 180 | ) 181 | ) 182 | ) 183 | ``` 184 | 185 | Each model has its own `mean_value` and `tuning_par`. To see this correspondence, set `trace = TRUE`. 186 | 187 | ```{r} 188 | drake_plan( 189 | trace = TRUE, 190 | data = get_data(), 191 | model = target( 192 | fit_model(data, mean_value, tuning_par), 193 | transform = cross( 194 | mean_value = c(1, 2), 195 | tuning_par = c("fast", "slow") 196 | ) 197 | ) 198 | ) 199 | ``` 200 | 201 | If we summarize those models, each *summary* has its own `mean_value` and `tuning_par`. In other words, grouping variables have a natural nesting, and they propagate forward so we can use them in downstream targets. Notice how `mean_value` and `tuning_par` appear in `summarize_model()` and `combine()` below. 202 | 203 | ```{r} 204 | plan <- drake_plan( 205 | trace = TRUE, 206 | data = get_data(), 207 | model = target( 208 | fit_model(data, mean_value, tuning_par), 209 | transform = cross( 210 | mean_value = c(1, 2), 211 | tuning_par = c("fast", "slow") 212 | ) 213 | ), 214 | summary = target( 215 | # mean_value and tuning_par are old grouping variables from the models 216 | summarize_model(model, mean_value, tuning_par), 217 | transform = map(model) 218 | ), 219 | summary_by_tuning = target( 220 | dplyr::bind_rows(summary), 221 | # tuning_par is an old grouping variable from the models. 222 | transform = combine(summary, .by = tuning_par) 223 | ) 224 | ) 225 | 226 | plot(plan) 227 | ``` 228 | 229 | 230 | ### Limitations of grouping variables 231 | 232 | Each grouping variable should be defined only once. In the plan below, there are multiple conflicting definitions of `a1`, `a2`, and `a3` in the dependencies of `c1`, so `drake` does not know which definitions to use. 233 | 234 | ```{r, error = TRUE} 235 | drake_plan( 236 | b1 = target(1, transform = map(a1 = 1, a2 = 1, .id = FALSE)), 237 | b2 = target(1, transform = map(a1 = 1, a3 = 1, .id = FALSE)), 238 | b3 = target(1, transform = map(a2 = 1, a3 = 1, .id = FALSE)), 239 | c1 = target(1, transform = map(a1, a2, a3, .id = FALSE)), 240 | trace = TRUE 241 | ) 242 | ``` 243 | 244 | Other workarounds include `bind_plans()` (on separate sub-plans) and [dynamic branching](#dynamic). Always check your plans before you run them (`vis_drake_graph()` etc.). 245 | 246 | ## Tidy evaluation 247 | 248 | In earlier plans, we used "bang-bang" operator `!!` from [tidy evaluation](https://tidyeval.tidyverse.org/), e.g. `model_function = !!model_functions` in `cross()`. But why? Why not just type `model_function = model_functions`? Consider the following incorrect plan. 249 | 250 | ```{r} 251 | model_functions <- rlang::syms(c("main", "altv")) 252 | 253 | plan <- drake_plan( 254 | data = get_data(), 255 | analysis = target( 256 | model_function(data, mean = mean_value, tuning = tuning_setting), 257 | transform = cross( 258 | tuning_setting = c("fast", "slow"), 259 | mean_value = 1:4, # without !! 260 | model_function = model_functions # without !! 261 | ) 262 | ) 263 | ) 264 | 265 | drake_plan_source(plan) 266 | ``` 267 | 268 | Because we omit `!!`, we create two problems: 269 | 270 | 1. The commands use `model_functions()` instead of the desired `main()` and `altv()`. 271 | 2. We are missing the targets with `mean = 2` and `mean = 3`. 272 | 273 | Why? To make static branching work properly, `drake` does not actually evaluate the arguments to `cross()`. It just uses the raw symbols and expressions. To force `drake` to use the *values* instead, we need `!!`. 274 | 275 | 276 | ```{r} 277 | model_functions <- rlang::syms(c("main", "altv")) 278 | 279 | plan <- drake_plan( 280 | data = get_data(), 281 | analysis = target( 282 | model_function(data, mean = mean_value, tuning = tuning_setting), 283 | transform = cross( 284 | tuning_setting = c("fast", "slow"), 285 | mean_value = !!(1:4), # with !! 286 | model_function = !!model_functions # with !! 287 | ) 288 | ) 289 | ) 290 | 291 | drake_plan_source(plan) 292 | ``` 293 | 294 | ## Static transformations 295 | 296 | There are four transformations in static branching: `map()`, `cross()`, `split()`, and `combine()`. They are not actual functions, just special language to supply to the `transform` argument of `target()` in `drake_plan()`. Each transformation is similar to a function from the [Tidyverse](https://www.tidyverse.org/). 297 | 298 | | `drake` | Tidyverse analogue | 299 | |-------------|-----------------------------| 300 | | `map()` | `pmap()` from `purrr` | 301 | | `cross()` | `crossing()` from `tidyr` | 302 | | `split()` | `group_map()` from `dplyr` | 303 | | `combine()` | `summarize()` from `dplyr` | 304 | 305 | ### `map()` 306 | 307 | `map()` creates a new target for each row in a grid. 308 | 309 | ```{r} 310 | drake_plan( 311 | x = target( 312 | simulate_data(center, scale), 313 | transform = map(center = c(2, 1, 0), scale = c(3, 2, 1)) 314 | ) 315 | ) 316 | ``` 317 | 318 | You can supply the grid directly with the `.data` argument. Note the use of `!!` below. (See the tidy evaluation section.) 319 | 320 | ```{r} 321 | my_grid <- tibble( 322 | sim_function = c("rnorm", "rt", "rcauchy"), 323 | title = c("Normal", "Student t", "Cauchy") 324 | ) 325 | my_grid$sim_function <- rlang::syms(my_grid$sim_function) 326 | 327 | drake_plan( 328 | x = target( 329 | simulate_data(sim_function, title, center, scale), 330 | transform = map( 331 | center = c(2, 1, 0), 332 | scale = c(3, 2, 1), 333 | .data = !!my_grid, 334 | # In `.id`, you can select one or more grouping variables 335 | # for pretty target names. 336 | # Set to FALSE to use short numeric suffixes. 337 | .id = sim_function # Try `.id = c(sim_function, center)` yourself. 338 | ) 339 | ) 340 | ) 341 | ``` 342 | 343 | ### `cross()` 344 | 345 | `cross()` creates a new target for each combination of argument values. 346 | 347 | ```{r} 348 | drake_plan( 349 | x = target( 350 | simulate_data(nrow, ncol), 351 | transform = cross(nrow = c(1, 2, 3), ncol = c(4, 5)) 352 | ) 353 | ) 354 | ``` 355 | 356 | ### `split()` 357 | 358 | The `split()` transformation distributes a dataset as uniformly as possible across multiple targets. 359 | 360 | ```{r, split1} 361 | plan <- drake_plan( 362 | large_data = get_data(), 363 | slice_analysis = target( 364 | large_data %>% 365 | analyze(), 366 | transform = split(large_data, slices = 4) 367 | ), 368 | results = target( 369 | dplyr::bind_rows(slice_analysis), 370 | transform = combine(slice_analysis) 371 | ) 372 | ) 373 | 374 | plan 375 | ``` 376 | 377 | ```{r} 378 | plot(plan) 379 | ``` 380 | 381 | At runtime, `drake_slice()` takes a single subset of the data. It supports data frames, matrices, and arbitrary arrays. 382 | 383 | ```{r} 384 | drake_slice(mtcars, slices = 32, index = 1) 385 | 386 | drake_slice(mtcars, slices = 32, index = 2) 387 | ``` 388 | 389 | 390 | ### `combine()` 391 | 392 | `combine()` aggregates targets. The closest comparison is the unquote-splice operator `!!!` from tidy evaluation. 393 | 394 | ```{r} 395 | plan <- drake_plan( 396 | data_group1 = target( 397 | sim_data(mean = x, sd = y), 398 | transform = map(x = c(1, 2), y = c(3, 4)) 399 | ), 400 | data_group2 = target( 401 | pull_data(url), 402 | transform = map(url = c("example1.com", "example2.com")) 403 | ), 404 | larger = target( 405 | bind_rows(data_group1, data_group2, .id = "id") %>% 406 | arrange(sd) %>% 407 | head(n = 400), 408 | transform = combine(data_group1, data_group2) 409 | ) 410 | ) 411 | 412 | drake_plan_source(plan) 413 | ``` 414 | 415 | To create multiple combined groups, use the `.by` argument. 416 | 417 | ```{r} 418 | plan <- drake_plan( 419 | data = target( 420 | sim_data(mean = x, sd = y, skew = z), 421 | transform = cross(x = c(1, 2), y = c(3, 4), z = c(5, 6)) 422 | ), 423 | combined = target( 424 | bind_rows(data, .id = "id") %>% 425 | arrange(sd) %>% 426 | head(n = 400), 427 | transform = combine(data, .by = c(x, y)) 428 | ) 429 | ) 430 | 431 | drake_plan_source(plan) 432 | ``` 433 | 434 | ## Target names 435 | 436 | `drake` releases after 7.12.0 let you define your own custom names with the optional `.names` argument of transformations. 437 | 438 | ```{r} 439 | analysis_names <- c("experimental", "thorough", "minimal", "naive") 440 | 441 | plan <- drake_plan( 442 | dataset = target( 443 | get_dataset(data_index), 444 | transform = map(data_index = !!seq_len(2), .names = c("new", "old")) 445 | ), 446 | analysis = target( 447 | apply_method(method_name, dataset), 448 | transform = cross( 449 | method_name = c("method1", "method2"), 450 | dataset, 451 | .names = !!analysis_names 452 | ) 453 | ), 454 | summary = target( 455 | summarize(analysis), 456 | transform = combine(analysis, .by = dataset, .names = c("table1", "table2")) 457 | ) 458 | ) 459 | 460 | plan 461 | 462 | plot(plan) 463 | ``` 464 | 465 | The disadvantage of `.names` is you need to know in advance the number of targets a transformation will generate. As an alternative, all transformations have an optional `.id` argument to control the names of targets. Use it to select the grouping variables that go into the names, as well as the order they appear in the suffixes. 466 | 467 | ```{r} 468 | drake_plan( 469 | data = target( 470 | get_data(param1, param2), 471 | transform = map( 472 | param1 = c(123, 456), 473 | param2 = c(7, 9), 474 | param2 = c("abc", "xyz"), 475 | .id = param2 476 | ) 477 | ) 478 | ) 479 | ``` 480 | 481 | ```{r} 482 | drake_plan( 483 | data = target( 484 | get_data(param1, param2), 485 | transform = map( 486 | param1 = c(123, 456), 487 | param2 = c(7, 9), 488 | param2 = c("abc", "xyz"), 489 | .id = c(param2, param1) 490 | ) 491 | ) 492 | ) 493 | ``` 494 | 495 | ```{r} 496 | drake_plan( 497 | data = target( 498 | get_data(param1, param2), 499 | transform = map( 500 | param1 = c(123, 456), 501 | param2 = c(7, 9), 502 | param2 = c("abc", "xyz"), 503 | .id = c(param1, param2) 504 | ) 505 | ) 506 | ) 507 | ``` 508 | 509 | Set `.id` to `FALSE` to ignore the grouping variables altogether. 510 | 511 | ```{r} 512 | drake_plan( 513 | data = target( 514 | get_data(param1, param2), 515 | transform = map( 516 | param1 = c(123, 456), 517 | param2 = c(7, 9), 518 | param2 = c("abc", "xyz"), 519 | .id = FALSE 520 | ) 521 | ) 522 | ) 523 | ``` 524 | 525 | Finally, `drake` supports a special `.id_chr` symbol in commands to let you refer to the name of the current target as a character string. 526 | 527 | ```{r} 528 | as_chr <- function(x) { 529 | deparse(substitute(x)) 530 | } 531 | plan <- drake_plan( 532 | data = target( 533 | get_data(param), 534 | transform = map(param = c(123, 456)) 535 | ), 536 | keras_model = target( 537 | save_model_hdf5(fit_model(data), file_out(!!sprintf("%s.h5", .id_chr))), 538 | transform = map(data, .id = param) 539 | ), 540 | result = target( 541 | predict(load_model_hdf5(file_in(!!sprintf("%s.h5", as_chr(keras_model))))), 542 | transform = map(keras_model, .id = param) 543 | ) 544 | ) 545 | 546 | plan 547 | ``` 548 | 549 | ```{r} 550 | drake_plan_source(plan) 551 | ``` 552 | 553 | ## Tags 554 | 555 | A tag is a custom grouping variable for a transformation. There are two kinds of tags: 556 | 557 | 1. In-tags, which contain the target name you start with, and 558 | 2. Out-tags, which contain the target names generated by the transformations. 559 | 560 | ```{r} 561 | drake_plan( 562 | x = target( 563 | command, 564 | transform = map(y = c(1, 2), .tag_in = from, .tag_out = c(to, out)) 565 | ), 566 | trace = TRUE 567 | ) 568 | ``` 569 | 570 | Subsequent transformations can use tags as grouping variables and add to existing tags. 571 | 572 | ```{r} 573 | plan <- drake_plan( 574 | prep_work = do_prep_work(), 575 | local = target( 576 | get_local_data(n, prep_work), 577 | transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data) 578 | ), 579 | online = target( 580 | get_online_data(n, prep_work, port = "8080"), 581 | transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data) 582 | ), 583 | summary = target( 584 | summarize(bind_rows(data, .id = "data")), 585 | transform = combine(data, .by = data_source) 586 | ), 587 | munged = target( 588 | munge(bind_rows(data, .id = "data")), 589 | transform = combine(data, .by = n) 590 | ) 591 | ) 592 | 593 | drake_plan_source(plan) 594 | 595 | plot(plan) 596 | ``` 597 | -------------------------------------------------------------------------------- /storage.Rmd: -------------------------------------------------------------------------------- 1 | # Storage {#storage} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(drake) 18 | library(tidyverse) 19 | ``` 20 | 21 | ## `drake`'s cache 22 | 23 | When you run `make()`, `drake` stores your targets in a hidden storage cache. 24 | 25 | ```{r} 26 | library(drake) 27 | load_mtcars_example() # from https://github.com/wlandau/drake-examples/tree/main/mtcars 28 | make(my_plan, verbose = 0L) 29 | ``` 30 | 31 | The default cache is a hidden `.drake` folder. 32 | 33 | ```{r, eval = FALSE} 34 | find_cache() 35 | ### [1] "/home/you/project/.drake" 36 | ``` 37 | 38 | `drake`'s `loadd()` and `readd()` functions load targets into memory. 39 | 40 | ```{r} 41 | loadd(large) 42 | head(large) 43 | 44 | head(readd(small)) 45 | ``` 46 | 47 | 48 | ## Efficient target storage 49 | 50 | `drake` supports custom formats for large and specialized targets. For example, the `"fst"` format uses the [`fst`](https://github.com/fstpackage/fst) package to save data frames faster. Simply enclose the command and the format together with the `target()` function. 51 | 52 | ```{r, eval = FALSE} 53 | library(drake) 54 | n <- 1e8 # Each target is 1.6 GB in memory. 55 | plan <- drake_plan( 56 | data_fst = target( 57 | data.frame(x = runif(n), y = runif(n)), 58 | format = "fst" 59 | ), 60 | data_old = data.frame(x = runif(n), y = runif(n)) 61 | ) 62 | make(plan) 63 | #> target data_fst 64 | #> target data_old 65 | build_times(type = "build") 66 | #> # A tibble: 2 x 4 67 | #> target elapsed user system 68 | #> 69 | #> 1 data_fst 13.93s 37.562s 7.954s 70 | #> 2 data_old 184s (~3.07 minutes) 177s (~2.95 minutes) 4.157s 71 | ``` 72 | 73 | For more details and a complete list of formats, see . 74 | 75 | 76 | ## Why is my cache so big? 77 | 78 | ### Old targets 79 | 80 | By default, `drake` holds on to all your targets from all your runs of `make()`. Even if you run `clean()`, the data stays in the cache in case you need to recover it. 81 | 82 | ```{r} 83 | clean() 84 | 85 | make(my_plan, recover = TRUE) 86 | ``` 87 | 88 | If you really want to remove old historical values of targets, run `drake_gc()` or `drake_cache()$gc()`. 89 | 90 | ```{r} 91 | drake_gc() 92 | ``` 93 | 94 | `clean()` also has a `garbage_collection` argument for this purpose. Here is a slick way to remove historical targets and targets no longer in your plan. 95 | 96 | ```{r} 97 | clean(list = cached_unplanned(my_plan), garbage_collection = TRUE) 98 | ``` 99 | 100 | ### Garbage from interrupted builds 101 | 102 | If `make()` crashes or gets interrupted, old files can accumulate in `.drake/scratch/` and `.drake/drake/tmp/`. As long as `make()` is no longer running, can safely remove the files in those folders (but keep the folders themselves). 103 | 104 | ## Interfaces to the cache 105 | 106 | `drake` uses the [storr](https://github.com/richfitz/storr) package to create and modify caches. 107 | 108 | ```{r} 109 | library(storr) 110 | cache <- storr_rds(".drake") 111 | 112 | head(cache$list()) 113 | 114 | head(cache$get("small")) 115 | ``` 116 | 117 | `drake` has its own interface on top of [storr](https://github.com/richfitz/storr) to make it easier to work with the default `.drake/` cache. The `loadd()`, `readd()`, and `cached()` functions explore saved targets. 118 | 119 | ```{r} 120 | head(cached()) 121 | 122 | head(readd(small)) 123 | 124 | loadd(large) 125 | 126 | head(large) 127 | 128 | rm(large) # Does not remove `large` from the cache. 129 | ``` 130 | 131 | `new_cache()` create caches and `drake_cache()` recovers existing ones. (`drake_cache()` is only supported in `drake` version 7.4.0 and above.) 132 | 133 | ```{r} 134 | cache <- drake_cache() 135 | cache$driver$path 136 | 137 | cache <- drake_cache(path = ".drake") # File path to drake's cache. 138 | cache$driver$path 139 | ``` 140 | 141 | You can supply your own cache to `make()` and friends (including specialized `storr` caches like [`storr_dbi()`](http://richfitz.github.io/storr/reference/storr_dbi.html)). 142 | 143 | ```{r} 144 | plan <- drake_plan(x = 1, y = sqrt(x)) 145 | make(plan, cache = cache) 146 | 147 | vis_drake_graph(plan, cache = cache) 148 | ``` 149 | 150 | Destroy caches to remove them from your file system. 151 | 152 | ```{r} 153 | cache$destroy() 154 | 155 | file.exists(".drake") 156 | ``` 157 | -------------------------------------------------------------------------------- /style.css: -------------------------------------------------------------------------------- 1 | p.caption { 2 | color: #777; 3 | margin-top: 10px; 4 | } 5 | p code { 6 | white-space: inherit; 7 | } 8 | pre { 9 | word-break: normal; 10 | word-wrap: normal; 11 | } 12 | pre code { 13 | white-space: inherit; 14 | } 15 | 16 | .book .book-summary ul.summary li span { 17 | 18 | cursor: not-allowed; 19 | opacity: 1; 20 | 21 | } 22 | 23 | .book .book-summary ul.summary li.active > a { 24 | color: #1f58a3; 25 | font-weight: bold; 26 | } 27 | 28 | .book .book-summary ul.summary li > a:hover { 29 | color: #1f58a3; 30 | font-weight: bold; 31 | } 32 | 33 | .book .book-body .page-wrapper .page-inner section.normal a { 34 | color: #1f58a3; 35 | font-weight: bold; 36 | } 37 | 38 | .summaryblock{ 39 | padding: 1em 1em 1em 4em; 40 | margin-bottom: 10px; 41 | color: #333E50; 42 | border-style: solid; 43 | border-width: 5px; 44 | border-color: #6faef5; 45 | } 46 | 47 | .summaryblock > a { 48 | color: #d8dfee; 49 | } 50 | 51 | .part > span { 52 | color: #5c677e; 53 | } 54 | 55 | h1, h2, h3, h4, h5, h6 { 56 | color: #5c677e; 57 | } 58 | 59 | body { 60 | color: #333E50; 61 | } 62 | 63 | .book .book-body .page-wrapper .page-inner section.normal code { 64 | background: #dfe3eb; 65 | } 66 | 67 | div.book-header.fixed[style] { 68 | background-color: #6faef5 !important; 69 | } 70 | 71 | .book .book-header .btn { 72 | color: #c6deff; 73 | } 74 | 75 | .book .book-header .btn:hover { 76 | position: relative; 77 | text-decoration: none; 78 | color: #fff; 79 | background: 0 0; 80 | transform:scale(1.3,1.3); 81 | } 82 | 83 | .book .book-header h1 a, .book .book-header h1 a:hover { 84 | color: #fff; 85 | } 86 | -------------------------------------------------------------------------------- /time.Rmd: -------------------------------------------------------------------------------- 1 | # Time: speed, time logging, prediction, and strategy {#time} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set( 6 | collapse = TRUE, 7 | comment = "#>", 8 | fig.width = 6, 9 | fig.height = 6, 10 | fig.align = "center" 11 | ) 12 | options( 13 | drake_make_menu = FALSE, 14 | drake_clean_menu = FALSE, 15 | warnPartialMatchArgs = FALSE, 16 | crayon.enabled = FALSE, 17 | readr.show_progress = FALSE, 18 | tidyverse.quiet = TRUE 19 | ) 20 | ``` 21 | 22 | ## Why is `drake` so slow? 23 | 24 | ### Help us find out! 25 | 26 | If you encounter slowness, please report it to and we will do our best to speed up `drake` for your use case. Please include a [reproducible example](https://github.com/tidyverse/reprex) and tell us about your operating system and version of R. In addition, flame graphs from the [`proffer`](https://github.com/r-prof/proffer) package really help us identify bottlenecks. 27 | 28 | ### Too many targets? 29 | 30 | `make()` and friends tend to slow down if you have a huge number of targets. There are unavoidable overhead costs from storing each single target and checking if it is up to date, so please read [this advice on choosing good targets](https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets) and consider dividing your work into a manageably small number of meaningful targets. [Dynamic branching](#dynamic) can also help in many cases. 31 | 32 | ### Big data? 33 | 34 | `drake` saves the return value of each target to [on disk storage](#storage). So in addition to [dividing your work into a smaller number of targets](https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets), [specialized storage formats](https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets) can help speed things up. It may also be worth reflecting on how much data you really need to store. And if the cache is too big, the [storage chapter](#storage) has advice for downsizing it. 35 | 36 | ### Aggressive shortcuts 37 | 38 | If your plan *still* needs tens of thousands of targets, you can take aggressive shortcuts to make things run faster. 39 | 40 | ```{r, eval = FALSE} 41 | make( 42 | plan, 43 | verbose = 0L, # Console messages can pile up runtime. 44 | log_progress = FALSE, # drake_progress() will be useless. 45 | log_build_times = FALSE, # build_times() will be useless. 46 | recoverable = FALSE, # make(recover = TRUE) cannot be used later. 47 | history = FALSE, # drake_history() cannot be used later. 48 | session_info = FALSE, # drake_get_session_info() cannot be used later. 49 | lock_envir = FALSE, # See https://docs.ropensci.org/drake/reference/make.html#self-invalidation. 50 | ) 51 | ``` 52 | 53 | ## Time logging 54 | 55 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 56 | library(drake) 57 | ``` 58 | 59 | Thanks to [Jasper Clarkberg](https://github.com/dapperjapper), `drake` records how long it takes to build each target. For large projects that take hours or days to run, this feature becomes important for planning and execution. 60 | 61 | ```{r} 62 | library(drake) 63 | load_mtcars_example() # from https://github.com/wlandau/drake-examples/tree/main/mtcars 64 | make(my_plan) 65 | 66 | build_times(digits = 8) # From the cache. 67 | 68 | ## `dplyr`-style `tidyselect` commands 69 | build_times(starts_with("coef"), digits = 8) 70 | ``` 71 | 72 | ## Predict total runtime 73 | 74 | `drake` uses these times to predict the runtime of the next `make()`. At this moment, everything is up to date in the current example, so the next `make()` should ideally take no time at all (except for preprocessing overhead). 75 | 76 | ```{r} 77 | predict_runtime(my_plan) 78 | ``` 79 | 80 | Suppose we change a dependency to make some targets out of date. Now, the next `make()` should take longer since some targets are out of date. 81 | 82 | ```{r} 83 | reg2 <- function(d){ 84 | d$x3 <- d$x ^ 3 85 | lm(y ~ x3, data = d) 86 | } 87 | 88 | predict_runtime(my_plan) 89 | ``` 90 | 91 | And what if you plan to delete the cache and build all the targets from scratch? 92 | 93 | ```{r} 94 | predict_runtime(my_plan, from_scratch = TRUE) 95 | ``` 96 | 97 | ## Strategize your high-performance computing 98 | 99 | Let's say you are scaling up your workflow. You just put bigger data and heavier computation in your custom code, and the next time you run `make()`, your targets will take much longer to build. In fact, you estimate that every target except for your R Markdown report will take two hours to complete. Let's write down these known times in seconds. 100 | 101 | ```{r} 102 | known_times <- rep(7200, nrow(my_plan)) 103 | names(known_times) <- my_plan$target 104 | known_times["report"] <- 5 105 | known_times 106 | ``` 107 | 108 | How many parallel jobs should you use in the next `make()`? The `predict_runtime()` function can help you decide. `predict_runtime(jobs = n)` simulates persistent parallel workers and reports the estimated total runtime of `make(jobs = n)`. (See also `predict_workers()`.) 109 | 110 | ```{r} 111 | time <- c() 112 | for (jobs in 1:12){ 113 | time[jobs] <- predict_runtime( 114 | my_plan, 115 | jobs_predict = jobs, 116 | from_scratch = TRUE, 117 | known_times = known_times 118 | ) 119 | } 120 | library(ggplot2) 121 | ggplot(data.frame(time = time / 3600, jobs = ordered(1:12), group = 1)) + 122 | geom_line(aes(x = jobs, y = time, group = group)) + 123 | scale_y_continuous(breaks = 0:10 * 4, limits = c(0, 29)) + 124 | theme_gray(16) + 125 | xlab("jobs argument of make()") + 126 | ylab("Predicted runtime of make() (hours)") 127 | ``` 128 | 129 | We see serious potential speed gains up to 4 jobs, but beyond that point, we have to double the jobs to shave off another 2 hours. Your choice of `jobs` for `make()` ultimately depends on the runtime you can tolerate and the computing resources at your disposal. 130 | 131 | A final note on predicting runtime: the output of `predict_runtime()` and `predict_workers()` also depends the optional `workers` column of your `drake_plan()`. If you micromanage which workers are allowed to build which targets, you may minimize reads from disk, but you could also slow down your workflow if you are not careful. See the [high-performance computing guide](#hpc) for more. 132 | -------------------------------------------------------------------------------- /triggers.Rmd: -------------------------------------------------------------------------------- 1 | # Triggers: decision rules for building targets {#triggers} 2 | 3 | ```{r, message = FALSE, warning = FALSE, include = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, include = FALSE} 17 | library(drake) 18 | library(tidyverse) 19 | invisible(drake_example("main", overwrite = TRUE)) 20 | invisible(file.copy("main/raw_data.xlsx", ".", overwrite = TRUE)) 21 | invisible(file.copy("main/report.Rmd", ".", overwrite = TRUE)) 22 | ``` 23 | 24 | When you call `make()`, `drake` tries to skip as many targets as possible. If it thinks a command will return the same value as last time, it does not bother running it. In other words, `drake` is lazy, and laziness saves you time. 25 | 26 | ## What are triggers? 27 | 28 | To figure out whether it can skip a target, `drake` goes through an intricate checklist of **triggers**: 29 | 30 | 1. The **missing** trigger: Do we lack a return value from a previous `make()`? Maybe you are building the target for the first time or you removed it from the cache with `clean()`. 31 | 2. The **command** trigger: did the command in the `drake` plan change nontrivially since the last `make()`? Changes to spacing, formatting, and comments are ignored. 32 | 3. The **depend** trigger: did any non-file dependencies change since the last `make()`? These could be: 33 | - Other targets. 34 | - Imported objects. 35 | - Imported functions. To track changes to a function, `drake` removes any code closed in `ignore()`, deparses the literal code so that whitespace is standardized and comments are removed, and then hashes the resulting string. In some cases, `drake` makes special adjustments for strange edge cases like [`Rcpp` functions with pointers](https://github.com/ropensci/drake/issues/806) and functions defined with `Vectorize()`. However, edge cases like [this one](https://github.com/ropensci-books/drake/issues/130#issue-522436582) are inevitable because of the flexibility of R. 36 | - Any dependencies of imported functions. 37 | - Any dependencies of dependencies of imported functions, and so on. 38 | 4. The **file** trigger: did any file inputs or file outputs change since the last `make()`? These files are the ones explicitly declared in the command with `file_in()`, `knitr_in()`, and `file_out()`. 39 | 5. The **seed** trigger: for statistical reproducibility, `drake` assigns a unique seed to each target based on the target's name and the global `seed` argument to `make()`. If you change the target's pseudo-random number generator seed either with the `seed` argument or the custom `seed` column in the plan, this change will cause a rebuild if the `seed` trigger is turned on. 40 | 6. The **format** trigger: did you add or change the target's storage format since last build? Details: . 41 | 7. The **condition** trigger: an optional user-defined piece of code that evaluates to a `TRUE`/`FALSE` value. The target builds if the value is `TRUE`. 42 | 8. The **change** trigger: an optional user-defined piece of code that evaluates to any value (preferably small and quick to compute). The target builds if the value changed since the last `make()`. 43 | 44 | If *any* trigger detects something wrong or different with the target or its dependencies, the next `make()` will run the command and (re)build the target. 45 | 46 | ## Customization 47 | 48 | With the `trigger()` function, you can create your own customized checklist of triggers. Let's run a simple workflow with just the **missing** trigger. We deactivate the **command**, **depend**, and **file** triggers by setting the respective `command`, `depend`, and `file` arguments to `FALSE`. 49 | 50 | ```{r} 51 | plan <- drake_plan( 52 | psi_1 = (sqrt(5) + 1) / 2, 53 | psi_2 = (sqrt(5) - 1) / 2 54 | ) 55 | make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) 56 | ``` 57 | 58 | Now, even if you wreck all the commands, nothing rebuilds. 59 | 60 | ```{r} 61 | plan <- drake_plan( 62 | psi_1 = (sqrt(5) + 1) / 2 + 9999999999999, 63 | psi_2 = (sqrt(5) - 1) / 2 - 9999999999999 64 | ) 65 | make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) 66 | ``` 67 | 68 | You can also give different targets to different triggers. Triggers in the `drake` plan override the `trigger` argument to `make()`. Below, `psi_2` always builds, but `psi_1` only builds if it has never been built before. 69 | 70 | ```{r} 71 | plan <- drake_plan( 72 | psi_1 = (sqrt(5) + 1) / 2 + 9999999999999, 73 | psi_2 = target( 74 | command = (sqrt(5) - 1) / 2 - 9999999999999, 75 | trigger = trigger(condition = psi_1 > 0) 76 | ) 77 | ) 78 | plan 79 | make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) 80 | make(plan, trigger = trigger(command = FALSE, depend = FALSE, file = FALSE)) 81 | ``` 82 | 83 | Interestingly, `psi_2` now depends on `psi_1`. Since `psi_1` is part of the target `psi_2` because of the **condition** trigger, it needs to be up to date before we attempt `psi_2`. However, since `psi_1` is not part of the command, changing it will not trip the other triggers such as **depend**. 84 | 85 | ```{r} 86 | vis_drake_graph(plan) 87 | ``` 88 | 89 | In the next toy example below, `drake` reads from a file to decide whether to build `x`. Try it out. 90 | 91 | ```{r} 92 | plan <- drake_plan( 93 | x = target( 94 | 1 + 1, 95 | trigger = trigger(condition = file_in(readRDS("file.rds"))) 96 | ) 97 | ) 98 | saveRDS(TRUE, "file.rds") 99 | make(plan) 100 | make(plan) 101 | make(plan) 102 | saveRDS(FALSE, "file.rds") 103 | make(plan) 104 | make(plan) 105 | make(plan) 106 | ``` 107 | 108 | In a real project with remote data sources, you may want to use the **condition** trigger to limit your builds to times when enough bandwidth is available for a large download. For example, 109 | 110 | ```{r, eval = FALSE} 111 | drake_plan( 112 | x = target( 113 | command = download_large_dataset(), 114 | trigger = trigger(condition = is_enough_bandwidth()) 115 | ) 116 | ) 117 | ``` 118 | 119 | Since the **change** trigger can return any value, it is often easier to use than the **condition** trigger. 120 | 121 | ```{r} 122 | clean(destroy = TRUE) 123 | plan <- drake_plan( 124 | x = target( 125 | command = 1 + 1, 126 | trigger = trigger(change = sqrt(y)) 127 | ) 128 | ) 129 | y <- 1 130 | make(plan) 131 | make(plan) 132 | y <- 2 133 | make(plan) 134 | ``` 135 | 136 | In practice, you may want to use the **change** trigger to check a large remote before downloading it. 137 | 138 | ```{r, eval = FALSE} 139 | drake_plan( 140 | x = target( 141 | command = download_large_dataset(), 142 | trigger = trigger( 143 | condition = is_enough_bandwidth(), 144 | change = date_last_modified() 145 | ) 146 | ) 147 | ) 148 | ``` 149 | 150 | A word of caution: every non-`NULL` **change** trigger is always evaluated, and its value is carried around in memory throughout `make()`. So if you are not careful, heavy use of the **change** trigger could slow down your workflow and consume extra resources. The **change** trigger should return small values (and should ideally be quick to evaluate). To reduce memory consumption, you may want to return a fingerprint of your trigger value rather than the value itself. See the [`digest`](https://github.com/eddelbuettel/digest) package for more information on computing hashes/fingerprints. 151 | 152 | ```{r, eval = FALSE} 153 | library(digest) 154 | drake_plan( 155 | x = target( 156 | command = download_large_dataset(), 157 | trigger = trigger( 158 | change = digest(download_medium_dataset()) 159 | ) 160 | ) 161 | ) 162 | ``` 163 | 164 | ## Alternative trigger modes 165 | 166 | Sometimes, you may want to suppress a target without having to worry about turning off every single trigger. That is why the `trigger()` function has a `mode` argument, which controls the role of the **condition** trigger in the decision to build or skip a target. The available trigger modes are `"whitelist"` (default), `"blacklist"`, and `"condition"`. 167 | 168 | - `trigger(mode = "whitelist")`: we *rebuild* the target whenever `condition` evaluates to `TRUE`. Otherwise, we defer to the other triggers. This is the default behavior described above in this chapter. 169 | - `trigger(mode = "blacklist")`: we *skip* the target whenever `condition` evaluates to `FALSE`. Otherwise, we defer to the other triggers. 170 | - `trigger(mode = "condition")`: here, the `condition` trigger is the only decider, and we ignore all the other triggers. We *rebuild* target whenever `condition` evaluates to `TRUE` and *skip* it whenever `condition` evaluates to `FALSE`. 171 | 172 | ## A more practical example 173 | 174 | See the ["packages" example](#packages) for a more practical demonstration of triggers and their usefulness. 175 | -------------------------------------------------------------------------------- /visuals.Rmd: -------------------------------------------------------------------------------- 1 | # Visualization with drake {#visuals} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE, include = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 6 | options( 7 | drake_make_menu = FALSE, 8 | drake_clean_menu = FALSE, 9 | warnPartialMatchArgs = FALSE, 10 | crayon.enabled = FALSE, 11 | readr.show_progress = FALSE, 12 | tidyverse.quiet = TRUE 13 | ) 14 | ``` 15 | 16 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 17 | library(drake) 18 | library(visNetwork) 19 | ``` 20 | 21 | Data analysis projects have complicated networks of dependencies, and `drake` can help you visualize them with `vis_drake_graph()`, `sankey_drake_graph()`, and `drake_ggraph()` (note the two g's). 22 | 23 | ## Plotting plans 24 | 25 | Except for `drake` 7.7.0 and below, you can simply `plot()` the plan to show the targets and their dependency relationships. 26 | 27 | ```{r, eval = TRUE} 28 | library(drake) 29 | # from https://github.com/wlandau/drake-examples/tree/main/mtcars 30 | load_mtcars_example() 31 | 32 | my_plan 33 | 34 | plot(my_plan) 35 | ``` 36 | 37 | 38 | ### `vis_drake_graph()` 39 | 40 | Powered by [`visNetwork`](http://datastorm-open.github.io/visNetwork/). Colors represent target status, and shapes represent data type. These graphs are interactive, so you can click, drag, zoom, and and pan to adjust the size and position. Double-click on nodes to contract neighborhoods into clusters or expand them back out again. If you hover over a node, you will see text in a tooltip showing the first few lines of 41 | 42 | - The command of a target, or 43 | - The body of an imported function, or 44 | - The content of an imported text file. 45 | 46 | ```{r, eval = TRUE} 47 | vis_drake_graph(my_plan) 48 | ``` 49 | 50 | To save this interactive widget for later, just supply the name of an HTML file. 51 | 52 | ```{r, eval = FALSE} 53 | vis_drake_graph(my_plan, file = "graph.html") 54 | ``` 55 | 56 | To save a static image file, supply a file name that ends in `".png"`, `".pdf"`, `".jpeg"`, or `".jpg"`. 57 | 58 | ```{r, eval = FALSE} 59 | vis_drake_graph(my_plan, file = "graph.png") 60 | ``` 61 | 62 | ### `sankey_drake_graph()` 63 | 64 | These interactive [`networkD3`](https://github.com/christophergandrud/networkD3) [Sankey diagrams](https://en.wikipedia.org/wiki/Sankey_diagram) have more nuance: the height of each node is proportional to its number of connections. Nodes with many incoming connnections tend to fall out of date more often, and nodes with many outgoing connections can invalidate bigger chunks of the downstream pipeline. 65 | 66 | ```{r, eval = TRUE} 67 | sankey_drake_graph(my_plan) 68 | ``` 69 | 70 | Saving the graphs is the same as before. 71 | 72 | ```{r, eval = FALSE} 73 | sankey_drake_graph(my_plan, file = "graph.html") # Interactive HTML widget 74 | sankey_drake_graph(my_plan, file = "graph.png") # Static image file 75 | ``` 76 | 77 | Unfortunately, a legend is [not yet available for Sankey diagrams](https://github.com/ropensci/drake/pull/467), but `drake` exposes a separate legend for the colors and shapes. 78 | 79 | ```{r, eval = TRUE} 80 | library(visNetwork) 81 | legend_nodes() 82 | visNetwork(nodes = legend_nodes()) 83 | ``` 84 | 85 | ### `drake_ggraph()` 86 | 87 | `drake_ggraph()` can handle larger workflows than the other graphing functions. If your project has thousands of targets and `vis_drake_graph()`/`sankey_drake_graph()` does not render properly, consider `drake_ggraph()`. Powered by [`ggraph`](https://github.com/thomasp85/ggraph), `drake_ggraph()`s are static [`ggplot2`](https://github.com/tidyverse/ggplot2) objects, and you can save them with `ggsave()`. 88 | 89 | ```{r, eval = TRUE} 90 | drake_ggraph(my_plan) 91 | ``` 92 | 93 | ### `text_drake_graph()` 94 | 95 | If you are running R in a terminal without [X Window](https://en.wikipedia.org/wiki/X_Window_System) support, the usual visualizations will show up interactively in your session. Here, you can use `text_drake_graph()` to see a text display in your terminal window. Terminal colors are deactivated in this manual, but you will see color in your console. 96 | 97 | ```{r, eval = TRUE} 98 | # Use nchar = 0 or nchar = 1 for better results. 99 | # The color display is better in your own terminal. 100 | text_drake_graph(my_plan, nchar = 3) 101 | ``` 102 | 103 | 104 | ## Underlying graph data: node and edge data frames 105 | 106 | `drake_graph_info()` is used behind the scenes in `vis_drake_graph()`, `sankey_drake_graph()`, and `drake_ggraph()` to get the graph information ready for rendering. To save time, you can call `drake_graph_info()` to get these internals and then call `render_drake_graph()`, `render_sankey_drake_graph()`, or `render_drake_ggraph()`. 107 | 108 | ```{r, eval = TRUE} 109 | str(drake_graph_info(my_plan)) 110 | ``` 111 | 112 | ## Visualizing target status 113 | 114 | `drake`'s visuals tell you which targets are up to date and which are outdated. 115 | 116 | ```{r, eval = TRUE} 117 | make(my_plan, verbose = 0L) 118 | outdated(my_plan) 119 | 120 | sankey_drake_graph(my_plan) 121 | ``` 122 | 123 | When you change a dependency, some targets fall out of date (black nodes). 124 | 125 | ```{r, eval = TRUE} 126 | reg2 <- function(d){ 127 | d$x3 <- d$x ^ 3 128 | lm(y ~ x3, data = d) 129 | } 130 | sankey_drake_graph(my_plan) 131 | ``` 132 | 133 | ## Subgraphs 134 | 135 | Graphs can grow enormous for serious projects, so there are multiple ways to focus on a manageable subgraph. The most brute-force way is to just pick a manual `subset` of nodes. However, with the `subset` argument, the graphing functions can drop intermediate nodes and edges. 136 | 137 | ```{r, eval = TRUE} 138 | vis_drake_graph( 139 | my_plan, 140 | subset = c("regression2_small", "large") 141 | ) 142 | ``` 143 | 144 | The rest of the subgraph functionality preserves connectedness. Use `targets_only` to ignore the imports. 145 | 146 | ```{r, eval = TRUE} 147 | vis_drake_graph(my_plan, targets_only = TRUE) 148 | ``` 149 | 150 | Similarly, you can just show downstream nodes. 151 | 152 | ```{r, eval = TRUE} 153 | vis_drake_graph(my_plan, from = c("regression2_small", "regression2_large")) 154 | ``` 155 | 156 | Or upstream ones. 157 | 158 | ```{r, eval = TRUE} 159 | vis_drake_graph(my_plan, from = "small", mode = "in") 160 | ``` 161 | 162 | In fact, let us just take a small neighborhood around a target in both directions. For the graph below, given order is 1, but all the custom `file_out()` output files of the neighborhood's targets appear as well. This ensures consistent behavior between `show_output_files = TRUE` and `show_output_files = FALSE` (more on that later). 163 | 164 | ```{r, eval = TRUE} 165 | vis_drake_graph(my_plan, from = "small", mode = "all", order = 1) 166 | ``` 167 | 168 | ## Control the `vis_drake_graph()` legend. 169 | 170 | Some arguments to `vis_drake_graph()` control the legend. 171 | 172 | ```{r, eval = TRUE} 173 | vis_drake_graph(my_plan, full_legend = TRUE, ncol_legend = 2) 174 | ``` 175 | 176 | To remove the legend altogether, set the `ncol_legend` argument to `0`. 177 | 178 | ```{r, eval = TRUE} 179 | vis_drake_graph(my_plan, ncol_legend = 0) 180 | ``` 181 | 182 | ## Clusters 183 | 184 | With the `group` and `clusters` arguments to the graphing functions, you can condense nodes into clusters. This is handy for workflows with lots of targets. Take the schools scenario from the [`drake` plan guide](#plans). Our plan was generated with `drake_plan(trace = TRUE)`, so it has wildcard columns that group nodes into natural clusters already. You can manually add such columns if you wish. 185 | 186 | ```{r, eval = TRUE} 187 | # Visit https://books.ropensci.org/drake/static.html 188 | # to learn about the syntax with target(transform = ...). 189 | plan <- drake_plan( 190 | school = target( 191 | get_school_data(id), 192 | transform = map(id = c(1, 2, 3)) 193 | ), 194 | credits = target( 195 | fun(school), 196 | transform = cross( 197 | school, 198 | fun = c(check_credit_hours, check_students, check_graduations) 199 | ) 200 | ), 201 | public_funds_school = target( 202 | command = check_public_funding(school), 203 | transform = map(school = c(school_1, school_2)) 204 | ), 205 | trace = TRUE 206 | ) 207 | plan 208 | ``` 209 | 210 | Ordinarily, the workflow graph gives a separate node to each individual import object or target. 211 | 212 | ```{r, echo = FALSE} 213 | check_credit_hours <- check_students <- check_graduations <- 214 | check_public_funding <- get_school_data <- function(){} 215 | ``` 216 | 217 | ```{r, eval = TRUE} 218 | vis_drake_graph(plan) 219 | ``` 220 | 221 | For large projects with hundreds of nodes, this can get quite cumbersome. But here, we can choose a wildcard column (or any other column in the plan, even custom columns) to condense nodes into natural clusters. For the `group` argument to the graphing functions, choose the name of a column in `plan` or a column you know will be in `drake_graph_info(my_plan)$nodes`. Then for `clusters`, choose the values in your `group` column that correspond to nodes you want to bunch together. The new graph is not as cumbersome. 222 | 223 | ```{r, eval = TRUE} 224 | vis_drake_graph(plan, 225 | group = "school", 226 | clusters = c("school_1", "school_2", "school_3") 227 | ) 228 | ``` 229 | 230 | As previously mentioned, you can group on any column in `drake_graph_info(my_plan)$nodes`. Let's return to the `mtcars` project for demonstration. 231 | 232 | ```{r, eval = TRUE} 233 | vis_drake_graph(my_plan) 234 | ``` 235 | 236 | Let's condense all the imports into one node and all the up-to-date targets into another. That way, the outdated targets stand out. 237 | 238 | ```{r, eval = TRUE} 239 | vis_drake_graph( 240 | my_plan, 241 | group = "status", 242 | clusters = c("imported", "up to date") 243 | ) 244 | ``` 245 | 246 | 247 | ## Output files 248 | 249 | `drake` can reproducibly track multiple output files per target and show them in the graph. 250 | 251 | ```{r, eval = TRUE} 252 | plan <- drake_plan( 253 | target1 = { 254 | file.copy(file_in("in1.txt"), file_out("out1.txt")) 255 | file.copy(file_in("in2.txt"), file_out("out2.txt")) 256 | }, 257 | target2 = { 258 | file.copy(file_in("out1.txt"), file_out("out3.txt")) 259 | file.copy(file_in("out2.txt"), file_out("out4.txt")) 260 | } 261 | ) 262 | writeLines("in1", "in1.txt") 263 | writeLines("in2", "in2.txt") 264 | make(plan) 265 | 266 | writeLines("abcdefg", "out3.txt") 267 | vis_drake_graph(plan, targets_only = TRUE) 268 | ``` 269 | 270 | If your graph is too busy, you can hide the output files with `show_output_files = FALSE`. 271 | 272 | ```{r, eval = TRUE} 273 | vis_drake_graph(plan, show_output_files = FALSE, targets_only = TRUE) 274 | ``` 275 | 276 | 277 | ## Node Selection 278 | 279 | *(Supported in drake > 7.7.0 only)* 280 | 281 | First, we define our plan, adding a custom column named "link". 282 | 283 | ```{r, eval = TRUE} 284 | mtcars_link <- 285 | "https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html" 286 | 287 | plan <- drake_plan( 288 | mtc = target( 289 | mtcars, 290 | link = !!mtcars_link 291 | ), 292 | mtc2 = target( 293 | mtc, 294 | link = !!mtcars_link 295 | ), 296 | mtc3 = target( 297 | modify_mtc2(mtc2, number), 298 | transform = map(number = !!c(1:3), .tag_in = cluster_id), 299 | link = !!mtcars_link 300 | ), 301 | trace = TRUE 302 | ) 303 | ``` 304 | 305 | ```{r, eval = TRUE} 306 | unique_stems <- unique(plan$cluster_id) 307 | ``` 308 | 309 | 310 | ### Perform the default action on select 311 | 312 | By supplying `vis_drake_graph(on_select = TRUE, on_select_col = "my_column")`, 313 | treats the values in the column named `"my_column"` as hyperlinks. Click on a node in the graph to navigate to the corresponding link in your browser. 314 | 315 | ```{r, eval = TRUE} 316 | vis_drake_graph( 317 | plan, 318 | clusters = unique_stems, 319 | group = "cluster_id", 320 | on_select_col = "link", 321 | on_select = TRUE 322 | ) 323 | ``` 324 | 325 | ### Perform no action on select 326 | 327 | No action will be taken if any of the following are given to 328 | `vis_drake_graph()`: 329 | 330 | - `on_select = NULL`, 331 | - `on_select = FALSE`, 332 | - `on_select_col = NULL` 333 | 334 | This is the default behaviour. 335 | 336 | ```{r, eval = TRUE} 337 | vis_drake_graph( 338 | my_plan, 339 | clusters = unique_stems, 340 | group = "cluster_id", 341 | on_select_col = "link", 342 | on_select = NULL 343 | ) 344 | ``` 345 | 346 | 347 | ### Customize the onSelect event behaviour 348 | 349 | What if we instead wanted the browser to display an alert when a node is 350 | clicked? 351 | 352 | ```{r, eval = TRUE} 353 | alert_behaviour <- function(){ 354 | js <- " 355 | function(props) { 356 | alert('selected node with on_select_col: \\r\\n' + 357 | this.body.data.nodes.get(props.nodes[0]).on_select_col); 358 | }" 359 | } 360 | 361 | vis_drake_graph( 362 | my_plan, 363 | on_select_col = "link", 364 | on_select = alert_behaviour() 365 | ) 366 | ``` 367 | 368 | ## Enhanced interactivity 369 | 370 | For enhanced interactivity, including custom interactive target documentation, see the [`mandrake`](https://mstr3336.github.io/mandrake) R package. For a taste of the functionality, visit [this vignette page](https://mstr3336.github.io/mandrake/articles/Test_Usecase.html#graph) and click the `mtcars` node in the graph. 371 | 372 | -------------------------------------------------------------------------------- /walkthrough.Rmd: -------------------------------------------------------------------------------- 1 | # Walkthrough {#walkthrough} 2 | 3 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 4 | knitr::opts_knit$set(root.dir = fs::dir_create(tempfile())) 5 | knitr::opts_chunk$set( 6 | collapse = TRUE, 7 | comment = "#>", 8 | fig.width = 6, 9 | fig.height = 6, 10 | fig.align = "center" 11 | ) 12 | options( 13 | drake_make_menu = FALSE, 14 | drake_clean_menu = FALSE, 15 | warnPartialMatchArgs = FALSE, 16 | crayon.enabled = FALSE, 17 | readr.show_progress = FALSE, 18 | tidyverse.quiet = TRUE 19 | ) 20 | ``` 21 | 22 | ```{r, message = FALSE, warning = FALSE, echo = FALSE} 23 | library(drake) 24 | library(dplyr) 25 | library(ggplot2) 26 | invisible(drake_example("main", overwrite = TRUE)) 27 | invisible(file.copy("main/raw_data.xlsx", ".", overwrite = TRUE)) 28 | invisible(file.copy("main/report.Rmd", ".", overwrite = TRUE)) 29 | ``` 30 | 31 | A typical data analysis workflow is a sequence of data transformations. Raw data becomes tidy data, then turns into fitted models, summaries, and reports. Other analyses are usually variations of this pattern, and `drake` can easily accommodate them. 32 | 33 | ## Set the stage. 34 | 35 | To set up a project, load your packages, 36 | 37 | ```{r} 38 | library(drake) 39 | library(dplyr) 40 | library(ggplot2) 41 | library(tidyr) 42 | ``` 43 | 44 | load your custom functions, 45 | 46 | ```{r} 47 | create_plot <- function(data) { 48 | ggplot(data) + 49 | geom_histogram(aes(x = Ozone)) + 50 | theme_gray(24) 51 | } 52 | ``` 53 | 54 | check any supporting files (optional), 55 | 56 | ```{r} 57 | ## Get the files with drake_example("main"). 58 | file.exists("raw_data.xlsx") 59 | file.exists("report.Rmd") 60 | ``` 61 | 62 | and plan what you are going to do. 63 | 64 | ```{r} 65 | plan <- drake_plan( 66 | raw_data = readxl::read_excel(file_in("raw_data.xlsx")), 67 | data = raw_data %>% 68 | mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), 69 | hist = create_plot(data), 70 | fit = lm(Ozone ~ Wind + Temp, data), 71 | report = rmarkdown::render( 72 | knitr_in("report.Rmd"), 73 | output_file = file_out("report.html"), 74 | quiet = TRUE 75 | ) 76 | ) 77 | 78 | plan 79 | ``` 80 | 81 | Optionally, visualize your workflow to make sure you set it up correctly. The graph is interactive, so you can click, drag, hover, zoom, and explore. 82 | 83 | ```{r} 84 | vis_drake_graph(plan) 85 | ``` 86 | 87 | ## Make your results. 88 | 89 | So far, we have just been setting the stage. Use `make()` or [`r_make()`](https://books.ropensci.org/drake/projects.html#safer-interactivity) to do the real work. Targets are built in the correct order regardless of the row order of `plan`. 90 | 91 | ```{r} 92 | make(plan) # See also r_make(). 93 | ``` 94 | 95 | Except for output files like `report.html`, your output is stored in a hidden `.drake/` folder. Reading it back is easy. 96 | 97 | ```{r} 98 | readd(data) %>% # See also loadd(). 99 | head() 100 | ``` 101 | 102 | The graph shows everything up to date. 103 | 104 | ```{r} 105 | vis_drake_graph(plan) # See also r_vis_drake_graph(). 106 | ``` 107 | 108 | ## Go back and fix things. 109 | 110 | You may look back on your work and see room for improvement, but it's all good! The whole point of `drake` is to help you go back and change things quickly and painlessly. For example, we forgot to give our histogram a bin width. 111 | 112 | ```{r} 113 | readd(hist) 114 | ``` 115 | 116 | So let's fix the plotting function. 117 | 118 | ```{r} 119 | create_plot <- function(data) { 120 | ggplot(data) + 121 | geom_histogram(aes(x = Ozone), binwidth = 10) + 122 | theme_gray(24) 123 | } 124 | ``` 125 | 126 | `drake` knows which results are affected. 127 | 128 | ```{r} 129 | vis_drake_graph(plan) # See also r_vis_drake_graph(). 130 | ``` 131 | 132 | The next `make()` just builds `hist` and `report`. No point in wasting time on the data or model. 133 | 134 | ```{r} 135 | make(plan) # See also r_make(). 136 | ``` 137 | 138 | ```{r} 139 | loadd(hist) 140 | hist 141 | ``` 142 | 143 | ## History and provenance 144 | 145 | As of version 7.5.2, `drake` tracks the history and provenance of your targets: 146 | what you built, when you built it, how you built it, the arguments you 147 | used in your function calls, and how to get the data back. 148 | 149 | ```{r} 150 | history <- drake_history(analyze = TRUE) 151 | history 152 | ``` 153 | 154 | Remarks: 155 | 156 | - The `quiet` column appears above because one of the `drake_plan()` commands has `knit(quiet = TRUE)`. 157 | - The `hash` column identifies all the previous versions of your targets. As long as `exists` is `TRUE`, you can recover old data. 158 | - Advanced: if you use `make(cache_log_file = TRUE)` and put the cache log file under version control, you can match the hashes from `drake_history()` with the `git` commit history of your code. 159 | 160 | Let's use the history to recover the oldest histogram. 161 | 162 | ```{r} 163 | hash <- history %>% 164 | filter(target == "hist") %>% 165 | pull(hash) %>% 166 | head(n = 1) 167 | cache <- drake_cache() 168 | cache$get_value(hash) 169 | ``` 170 | 171 | ## Reproducible data recovery and renaming 172 | 173 | Remember how we made that change to our histogram? What if we want to change it back? If we revert `create_plot()`, `make(plan, recover = TRUE)` restores the original plot. 174 | 175 | ```{r} 176 | create_plot <- function(data) { 177 | ggplot(data) + 178 | geom_histogram(aes(x = Ozone)) + 179 | theme_gray(24) 180 | } 181 | 182 | # The report still needs to run in order to restore report.html. 183 | make(plan, recover = TRUE) 184 | 185 | readd(hist) # old histogram 186 | ``` 187 | 188 | `drake`'s data recovery feature is another way to avoid rerunning commands. It is useful if: 189 | 190 | - You want to revert to your old code, maybe with `git reset`. 191 | - You accidentally `clean()`ed a target and you want to get it back. 192 | - You want to rename an expensive target. 193 | 194 | In version 7.5.2 and above, `make(recover = TRUE)` can salvage the values of old targets. Before building a target, `drake` checks if you have ever built something else with the same command, dependencies, seed, etc. that you have right now. If appropriate, `drake` assigns the old value to the new target instead of rerunning the command. 195 | 196 | Caveats: 197 | 198 | 1. This feature is still experimental. 199 | 2. Recovery may not be a good idea if your external dependencies have changed a lot over time (R version, package environment, etc.). 200 | 201 | ### Undoing `clean()` 202 | 203 | ```{r} 204 | # Is the data really gone? 205 | clean() # garbage_collection = FALSE 206 | 207 | # Nope! 208 | make(plan, recover = TRUE) # The report still builds since report.md is gone. 209 | 210 | # When was the raw data *really* first built? 211 | diagnose(raw_data)$date 212 | ``` 213 | 214 | ### Renaming 215 | 216 | You can use recovery to rename a target. The trick is to supply the random number generator seed that `drake` used with the old target name. Also, renaming a target unavoidably invalidates downstream targets. 217 | 218 | ```{r} 219 | # Get the old seed. 220 | old_seed <- diagnose(data)$seed 221 | 222 | # Now rename the data and supply the old seed. 223 | plan <- drake_plan( 224 | raw_data = readxl::read_excel(file_in("raw_data.xlsx")), 225 | 226 | # Previously just named "data". 227 | airquality_data = target( 228 | raw_data %>% 229 | mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE))), 230 | seed = !!old_seed 231 | ), 232 | 233 | # `airquality_data` will be recovered from `data`, 234 | # but `hist` and `fit` have changed commands, 235 | # so they will build from scratch. 236 | hist = create_plot(airquality_data), 237 | fit = lm(Ozone ~ Wind + Temp, airquality_data), 238 | report = rmarkdown::render( 239 | knitr_in("report.Rmd"), 240 | output_file = file_out("report.html"), 241 | quiet = TRUE 242 | ) 243 | ) 244 | 245 | make(plan, recover = TRUE) 246 | ``` 247 | 248 | ## Try the code yourself! 249 | 250 | Use `drake_example("main")` to download the [code files](#projects) for this example. 251 | 252 | ## Thanks 253 | 254 | Thanks to [Kirill Müller](https://github.com/krlmlr) for originally providing this example. 255 | --------------------------------------------------------------------------------