├── .Rbuildignore ├── .covrignore ├── .github ├── .gitignore ├── CODE_OF_CONDUCT.md └── workflows │ ├── R-CMD-check.yaml │ ├── pkgdown.yaml │ ├── pr-commands.yaml │ └── test-coverage.yaml ├── .gitignore ├── DESCRIPTION ├── LICENSE ├── LICENSE.md ├── NAMESPACE ├── NEWS.md ├── R ├── cluster-call.R ├── cluster-utils.R ├── cluster.R ├── dplyr-dual.R ├── dplyr-single.R ├── multidplyr-package.R ├── partydf.R ├── utils.R └── zzz.R ├── README.Rmd ├── README.md ├── _pkgdown.yml ├── codecov.yml ├── cran-comments.md ├── man ├── cluster_call.Rd ├── cluster_utils.Rd ├── default_cluster.Rd ├── multidplyr-package.Rd ├── new_cluster.Rd ├── partition.Rd ├── party_df.Rd └── reexports.Rd ├── multidplyr.Rproj ├── tests ├── testthat.R └── testthat │ ├── _snaps │ ├── cluster-call.md │ ├── cluster.md │ ├── dplyr-dual.md │ └── partydf.md │ ├── setup.R │ ├── test-cluster-call.R │ ├── test-cluster-utils.R │ ├── test-cluster.R │ ├── test-dplyr-dual.R │ ├── test-dplyr-single.R │ └── test-partydf.R └── vignettes ├── .gitignore └── multidplyr.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^\.travis\.yml$ 4 | ^codecov\.yml$ 5 | ^\.covrignore$ 6 | ^cran-comments\.md$ 7 | ^_pkgdown\.yml$ 8 | ^docs$ 9 | ^pkgdown$ 10 | ^README\.Rmd$ 11 | ^\.github$ 12 | ^LICENSE\.md$ 13 | ^CRAN-RELEASE$ 14 | ^CRAN-SUBMISSION$ 15 | -------------------------------------------------------------------------------- /.covrignore: -------------------------------------------------------------------------------- 1 | R/deprec-*.R 2 | R/compat-*.R 3 | -------------------------------------------------------------------------------- /.github/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | -------------------------------------------------------------------------------- /.github/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, caste, color, religion, or sexual 10 | identity and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the overall 26 | community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or advances of 31 | any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email address, 35 | without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at codeofconduct@posit.co. 63 | All complaints will be reviewed and investigated promptly and fairly. 64 | 65 | All community leaders are obligated to respect the privacy and security of the 66 | reporter of any incident. 67 | 68 | ## Enforcement Guidelines 69 | 70 | Community leaders will follow these Community Impact Guidelines in determining 71 | the consequences for any action they deem in violation of this Code of Conduct: 72 | 73 | ### 1. Correction 74 | 75 | **Community Impact**: Use of inappropriate language or other behavior deemed 76 | unprofessional or unwelcome in the community. 77 | 78 | **Consequence**: A private, written warning from community leaders, providing 79 | clarity around the nature of the violation and an explanation of why the 80 | behavior was inappropriate. A public apology may be requested. 81 | 82 | ### 2. Warning 83 | 84 | **Community Impact**: A violation through a single incident or series of 85 | actions. 86 | 87 | **Consequence**: A warning with consequences for continued behavior. No 88 | interaction with the people involved, including unsolicited interaction with 89 | those enforcing the Code of Conduct, for a specified period of time. This 90 | includes avoiding interactions in community spaces as well as external channels 91 | like social media. Violating these terms may lead to a temporary or permanent 92 | ban. 93 | 94 | ### 3. Temporary Ban 95 | 96 | **Community Impact**: A serious violation of community standards, including 97 | sustained inappropriate behavior. 98 | 99 | **Consequence**: A temporary ban from any sort of interaction or public 100 | communication with the community for a specified period of time. No public or 101 | private interaction with the people involved, including unsolicited interaction 102 | with those enforcing the Code of Conduct, is allowed during this period. 103 | Violating these terms may lead to a permanent ban. 104 | 105 | ### 4. Permanent Ban 106 | 107 | **Community Impact**: Demonstrating a pattern of violation of community 108 | standards, including sustained inappropriate behavior, harassment of an 109 | individual, or aggression toward or disparagement of classes of individuals. 110 | 111 | **Consequence**: A permanent ban from any sort of public interaction within the 112 | community. 113 | 114 | ## Attribution 115 | 116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 117 | version 2.1, available at 118 | . 119 | 120 | Community Impact Guidelines were inspired by 121 | [Mozilla's code of conduct enforcement ladder][https://github.com/mozilla/inclusion]. 122 | 123 | For answers to common questions about this code of conduct, see the FAQ at 124 | . Translations are available at . 125 | 126 | [homepage]: https://www.contributor-covenant.org 127 | -------------------------------------------------------------------------------- /.github/workflows/R-CMD-check.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | # 4 | # NOTE: This workflow is overkill for most R packages and 5 | # check-standard.yaml is likely a better choice. 6 | # usethis::use_github_action("check-standard") will install it. 7 | on: 8 | push: 9 | branches: [main, master] 10 | pull_request: 11 | branches: [main, master] 12 | 13 | name: R-CMD-check 14 | 15 | permissions: read-all 16 | 17 | jobs: 18 | R-CMD-check: 19 | runs-on: ${{ matrix.config.os }} 20 | 21 | name: ${{ matrix.config.os }} (${{ matrix.config.r }}) 22 | 23 | strategy: 24 | fail-fast: false 25 | matrix: 26 | config: 27 | - {os: macos-latest, r: 'release'} 28 | 29 | - {os: windows-latest, r: 'release'} 30 | # use 4.1 to check with rtools40's older compiler 31 | - {os: windows-latest, r: '4.1'} 32 | 33 | - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} 34 | - {os: ubuntu-latest, r: 'release'} 35 | - {os: ubuntu-latest, r: 'oldrel-1'} 36 | - {os: ubuntu-latest, r: 'oldrel-2'} 37 | - {os: ubuntu-latest, r: 'oldrel-3'} 38 | - {os: ubuntu-latest, r: 'oldrel-4'} 39 | 40 | env: 41 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 42 | R_KEEP_PKG_SOURCE: yes 43 | 44 | steps: 45 | - uses: actions/checkout@v4 46 | 47 | - uses: r-lib/actions/setup-pandoc@v2 48 | 49 | - uses: r-lib/actions/setup-r@v2 50 | with: 51 | r-version: ${{ matrix.config.r }} 52 | http-user-agent: ${{ matrix.config.http-user-agent }} 53 | use-public-rspm: true 54 | 55 | - uses: r-lib/actions/setup-r-dependencies@v2 56 | with: 57 | extra-packages: any::rcmdcheck 58 | needs: check 59 | 60 | - uses: r-lib/actions/check-r-package@v2 61 | with: 62 | upload-snapshots: true 63 | build_args: 'c("--no-manual","--compact-vignettes=gs+qpdf")' 64 | -------------------------------------------------------------------------------- /.github/workflows/pkgdown.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | push: 5 | branches: [main, master] 6 | pull_request: 7 | branches: [main, master] 8 | release: 9 | types: [published] 10 | workflow_dispatch: 11 | 12 | name: pkgdown 13 | 14 | permissions: read-all 15 | 16 | jobs: 17 | pkgdown: 18 | runs-on: ubuntu-latest 19 | # Only restrict concurrency for non-PR jobs 20 | concurrency: 21 | group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }} 22 | env: 23 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 24 | permissions: 25 | contents: write 26 | steps: 27 | - uses: actions/checkout@v4 28 | 29 | - uses: r-lib/actions/setup-pandoc@v2 30 | 31 | - uses: r-lib/actions/setup-r@v2 32 | with: 33 | use-public-rspm: true 34 | 35 | - uses: r-lib/actions/setup-r-dependencies@v2 36 | with: 37 | extra-packages: any::pkgdown, local::. 38 | needs: website 39 | 40 | - name: Build site 41 | run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE) 42 | shell: Rscript {0} 43 | 44 | - name: Deploy to GitHub pages 🚀 45 | if: github.event_name != 'pull_request' 46 | uses: JamesIves/github-pages-deploy-action@v4.5.0 47 | with: 48 | clean: false 49 | branch: gh-pages 50 | folder: docs 51 | -------------------------------------------------------------------------------- /.github/workflows/pr-commands.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | issue_comment: 5 | types: [created] 6 | 7 | name: Commands 8 | 9 | permissions: read-all 10 | 11 | jobs: 12 | document: 13 | if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/document') }} 14 | name: document 15 | runs-on: ubuntu-latest 16 | env: 17 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 18 | steps: 19 | - uses: actions/checkout@v4 20 | 21 | - uses: r-lib/actions/pr-fetch@v2 22 | with: 23 | repo-token: ${{ secrets.GITHUB_TOKEN }} 24 | 25 | - uses: r-lib/actions/setup-r@v2 26 | with: 27 | use-public-rspm: true 28 | 29 | - uses: r-lib/actions/setup-r-dependencies@v2 30 | with: 31 | extra-packages: any::roxygen2 32 | needs: pr-document 33 | 34 | - name: Document 35 | run: roxygen2::roxygenise() 36 | shell: Rscript {0} 37 | 38 | - name: commit 39 | run: | 40 | git config --local user.name "$GITHUB_ACTOR" 41 | git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com" 42 | git add man/\* NAMESPACE 43 | git commit -m 'Document' 44 | 45 | - uses: r-lib/actions/pr-push@v2 46 | with: 47 | repo-token: ${{ secrets.GITHUB_TOKEN }} 48 | 49 | style: 50 | if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/style') }} 51 | name: style 52 | runs-on: ubuntu-latest 53 | env: 54 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 55 | steps: 56 | - uses: actions/checkout@v4 57 | 58 | - uses: r-lib/actions/pr-fetch@v2 59 | with: 60 | repo-token: ${{ secrets.GITHUB_TOKEN }} 61 | 62 | - uses: r-lib/actions/setup-r@v2 63 | 64 | - name: Install dependencies 65 | run: install.packages("styler") 66 | shell: Rscript {0} 67 | 68 | - name: Style 69 | run: styler::style_pkg() 70 | shell: Rscript {0} 71 | 72 | - name: commit 73 | run: | 74 | git config --local user.name "$GITHUB_ACTOR" 75 | git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com" 76 | git add \*.R 77 | git commit -m 'Style' 78 | 79 | - uses: r-lib/actions/pr-push@v2 80 | with: 81 | repo-token: ${{ secrets.GITHUB_TOKEN }} 82 | -------------------------------------------------------------------------------- /.github/workflows/test-coverage.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | push: 5 | branches: [main, master] 6 | pull_request: 7 | branches: [main, master] 8 | 9 | name: test-coverage 10 | 11 | permissions: read-all 12 | 13 | jobs: 14 | test-coverage: 15 | runs-on: ubuntu-latest 16 | env: 17 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 18 | 19 | steps: 20 | - uses: actions/checkout@v4 21 | 22 | - uses: r-lib/actions/setup-r@v2 23 | with: 24 | use-public-rspm: true 25 | 26 | - uses: r-lib/actions/setup-r-dependencies@v2 27 | with: 28 | extra-packages: any::covr, any::xml2 29 | needs: coverage 30 | 31 | - name: Test coverage 32 | run: | 33 | cov <- covr::package_coverage( 34 | quiet = FALSE, 35 | clean = FALSE, 36 | install_path = file.path(normalizePath(Sys.getenv("RUNNER_TEMP"), winslash = "/"), "package") 37 | ) 38 | covr::to_cobertura(cov) 39 | shell: Rscript {0} 40 | 41 | - uses: codecov/codecov-action@v4 42 | with: 43 | fail_ci_if_error: ${{ github.event_name != 'pull_request' && true || false }} 44 | file: ./cobertura.xml 45 | plugin: noop 46 | disable_search: true 47 | token: ${{ secrets.CODECOV_TOKEN }} 48 | 49 | - name: Show testthat output 50 | if: always() 51 | run: | 52 | ## -------------------------------------------------------------------- 53 | find '${{ runner.temp }}/package' -name 'testthat.Rout*' -exec cat '{}' \; || true 54 | shell: bash 55 | 56 | - name: Upload test results 57 | if: failure() 58 | uses: actions/upload-artifact@v4 59 | with: 60 | name: coverage-test-failures 61 | path: ${{ runner.temp }}/package 62 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | inst/doc 5 | docs/ 6 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: multidplyr 2 | Title: A Multi-Process 'dplyr' Backend 3 | Version: 0.1.3.9000 4 | Authors@R: c( 5 | person("Hadley", "Wickham", , "hadley@posit.co", role = c("aut", "cre")), 6 | person("Posit Software, PBC", role = c("cph", "fnd")) 7 | ) 8 | Description: Partition a data frame across multiple worker processes to 9 | provide simple multicore parallelism. 10 | License: MIT + file LICENSE 11 | URL: https://multidplyr.tidyverse.org, 12 | https://github.com/tidyverse/multidplyr 13 | BugReports: https://github.com/tidyverse/multidplyr/issues 14 | Depends: 15 | R (>= 3.6) 16 | Imports: 17 | callr (>= 3.5.1), 18 | cli, 19 | crayon, 20 | dplyr (>= 1.0.0), 21 | magrittr, 22 | qs (>= 0.24.1), 23 | R6, 24 | rlang (>= 1.0.6), 25 | tibble, 26 | tidyselect, 27 | vctrs (>= 0.3.6) 28 | Suggests: 29 | covr, 30 | knitr, 31 | lubridate, 32 | mgcv, 33 | nycflights13, 34 | rmarkdown, 35 | testthat (>= 3.0.2), 36 | vroom, 37 | withr 38 | VignetteBuilder: 39 | knitr 40 | Config/Needs/website: tidyverse/tidytemplate 41 | Config/testthat/edition: 3 42 | Encoding: UTF-8 43 | RoxygenNote: 7.3.2 44 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | YEAR: 2023 2 | COPYRIGHT HOLDER: multidplyr authors 3 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # MIT License 2 | 3 | Copyright (c) 2023 multidplyr authors 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method("[",multidplyr_cluster) 4 | S3method(anti_join,multidplyr_party_df) 5 | S3method(arrange,multidplyr_party_df) 6 | S3method(as.data.frame,multidplyr_party_df) 7 | S3method(auto_copy,multidplyr_party_df) 8 | S3method(collect,multidplyr_party_df) 9 | S3method(dim,multidplyr_party_df) 10 | S3method(do,multidplyr_party_df) 11 | S3method(full_join,multidplyr_party_df) 12 | S3method(group_by,multidplyr_party_df) 13 | S3method(group_vars,multidplyr_party_df) 14 | S3method(head,multidplyr_party_df) 15 | S3method(inner_join,multidplyr_party_df) 16 | S3method(left_join,multidplyr_party_df) 17 | S3method(mutate,multidplyr_party_df) 18 | S3method(print,multidplyr_cluster) 19 | S3method(print,multidplyr_party_df) 20 | S3method(pull,multidplyr_party_df) 21 | S3method(rename,multidplyr_party_df) 22 | S3method(right_join,multidplyr_party_df) 23 | S3method(same_src,multidplyr_party_df) 24 | S3method(select,multidplyr_party_df) 25 | S3method(semi_join,multidplyr_party_df) 26 | S3method(slice,multidplyr_party_df) 27 | S3method(summarise,multidplyr_party_df) 28 | S3method(tbl_sum,multidplyr_party_df) 29 | S3method(tbl_vars,multidplyr_party_df) 30 | S3method(ungroup,multidplyr_party_df) 31 | S3method(union_all,multidplyr_party_df) 32 | export("%>%") 33 | export(cluster_assign) 34 | export(cluster_assign_each) 35 | export(cluster_assign_partition) 36 | export(cluster_call) 37 | export(cluster_copy) 38 | export(cluster_library) 39 | export(cluster_rm) 40 | export(cluster_send) 41 | export(default_cluster) 42 | export(new_cluster) 43 | export(partition) 44 | export(party_df) 45 | import(rlang) 46 | importFrom(R6,R6Class) 47 | importFrom(dplyr,anti_join) 48 | importFrom(dplyr,arrange) 49 | importFrom(dplyr,auto_copy) 50 | importFrom(dplyr,collect) 51 | importFrom(dplyr,do) 52 | importFrom(dplyr,filter) 53 | importFrom(dplyr,full_join) 54 | importFrom(dplyr,group_by) 55 | importFrom(dplyr,group_vars) 56 | importFrom(dplyr,inner_join) 57 | importFrom(dplyr,intersect) 58 | importFrom(dplyr,left_join) 59 | importFrom(dplyr,mutate) 60 | importFrom(dplyr,pull) 61 | importFrom(dplyr,rename) 62 | importFrom(dplyr,right_join) 63 | importFrom(dplyr,same_src) 64 | importFrom(dplyr,select) 65 | importFrom(dplyr,semi_join) 66 | importFrom(dplyr,setdiff) 67 | importFrom(dplyr,slice) 68 | importFrom(dplyr,summarise) 69 | importFrom(dplyr,tbl_vars) 70 | importFrom(dplyr,ungroup) 71 | importFrom(dplyr,union) 72 | importFrom(dplyr,union_all) 73 | importFrom(magrittr,"%>%") 74 | importFrom(tibble,tbl_sum) 75 | importFrom(tibble,tibble) 76 | importFrom(utils,head) 77 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # multidplyr (development version) 2 | 3 | # multidplyr 0.1.3 4 | 5 | * Fix for R CMD check failure. 6 | 7 | # multidplyr 0.1.2 8 | 9 | * `cluster_call()` gains a `simplify` argument - use this to request that 10 | the result should be simplified (#136). 11 | 12 | # multidplyr 0.1.1 13 | 14 | * Fixed problems identified as part of working on dplyr 1.0.8. 15 | 16 | # multidplyr 0.1.0 17 | 18 | * Added a `NEWS.md` file to track changes to the package. 19 | -------------------------------------------------------------------------------- /R/cluster-call.R: -------------------------------------------------------------------------------- 1 | #' Call a function on each node of a cluster 2 | #' 3 | #' `cluster_call()` executes the code on each worker and returns the results; 4 | #' `cluster_send()` executes the code ignoring the result. Jobs are submitted 5 | #' to workers in parallel, and then we wait until they're complete. 6 | #' 7 | #' @param cluster A cluster. 8 | #' @param code An expression to execute on each worker. 9 | #' @param simplify Should the results be simplified from a list? 10 | #' * `TRUE`: simplify or die trying. 11 | #' * `NA`: simplify if possible. 12 | #' * `FALSE`: never try to simplify, always leaving as a list. 13 | #' 14 | #' `code` must return a vector of length one in order for simplification 15 | #' to succeed. 16 | #' @param ptype If `simplify` is `TRUE`, use `ptype` to enforce the desired 17 | #' output type. 18 | #' @export 19 | #' @return A list of results with one element for each worker in `cluster`. 20 | #' @examples 21 | #' cl <- default_cluster() 22 | #' 23 | #' # Run code on each cluster and retrieve results 24 | #' cluster_call(cl, Sys.getpid()) 25 | #' cluster_call(cl, runif(1)) 26 | #' 27 | #' # use ptype to simplify 28 | #' cluster_call(cl, runif(1), simplify = TRUE) 29 | #' 30 | #' # use cluster_send() to ignore results 31 | #' cluster_send(cl, x <- runif(1)) 32 | #' cluster_call(cl, x, simplify = TRUE) 33 | cluster_call <- function(cluster, code, simplify = FALSE, ptype = NULL) { 34 | stopifnot(is_cluster(cluster)) 35 | code <- enexpr(code) 36 | to_rm <- attr(cluster, "cleaner")$reset() 37 | 38 | if (length(simplify) > 1 || !is.logical(simplify)) { 39 | cli::cli_abort("{.arg simplify} must be `TRUE`, `FALSE`, or `NA`.") 40 | } 41 | if (!isTRUE(simplify) && !is.null(ptype)) { 42 | # 0.1.2 43 | warn("Must now set `simplify = TRUE` when supplying ptype") 44 | simplify <- TRUE 45 | } 46 | 47 | # nocov start 48 | f <- function(code, to_rm) { 49 | rm(list = to_rm, envir = globalenv()) 50 | eval(code, globalenv()) 51 | } 52 | # nocov end 53 | lapply(cluster, function(x) x$call(f, list(code = code, to_rm = to_rm))) 54 | lapply(cluster, function(x) x$poll_process(-1)) 55 | 56 | results <- lapply(cluster, function(x) x$read()) 57 | 58 | errs <- lapply(results, "[[", "error") 59 | failed <- !vapply(errs, is.null, logical(1)) 60 | if (any(failed)) { 61 | worker <- which(failed)[[1]] 62 | err <- errs[[worker]]$parent 63 | message <- paste0("Remote computation failed in worker ", worker) 64 | abort(message, parent = err) 65 | } 66 | 67 | out <- lapply(results, "[[", "result") 68 | if (!isFALSE(simplify)) { 69 | out <- simplify_impl(out, strict = !is.na(simplify), ptype = ptype) 70 | } 71 | out 72 | } 73 | 74 | #' @rdname cluster_call 75 | #' @export 76 | cluster_send <- function(cluster, code) { 77 | stopifnot(is_cluster(cluster)) 78 | code <- call2("{", enexpr(code), NULL) 79 | cluster_call(cluster, !!code) 80 | 81 | invisible(cluster) 82 | } 83 | 84 | # TODO: replace with purrr::list_simplify() when purrr 1.0.0 is out 85 | simplify_impl <- function(x, 86 | strict = TRUE, 87 | ptype = NULL, 88 | error_arg = caller_arg(x), 89 | error_call = caller_env()) { 90 | vctrs::vec_check_list(x, arg = error_arg, call = error_call) 91 | 92 | # Handle the cases where we definitely can't simplify 93 | if (strict) { 94 | vctrs::list_check_all_vectors(x, arg = error_arg, call = error_call) 95 | size_one <- vctrs::list_sizes(x) == 1L 96 | can_simplify <- all(size_one) 97 | 98 | if (!can_simplify) { 99 | bad <- which(!size_one)[[1]] 100 | cli::cli_abort( 101 | c( 102 | "All elements must be size 1.", 103 | i = "`{error_arg}[[{bad}]]` is size {vec_size(x[[bad]])}." 104 | ), 105 | call = error_call 106 | ) 107 | } 108 | } else { 109 | can_simplify <- vctrs::list_all_vectors(x) && all(vctrs::list_sizes(x) == 1L) 110 | 111 | if (!can_simplify) { 112 | return(x) 113 | } 114 | } 115 | 116 | names <- vctrs::vec_names(x) 117 | x <- vctrs::vec_set_names(x, NULL) 118 | 119 | out <- tryCatch( 120 | vctrs::vec_c(!!!x, ptype = ptype), 121 | vctrs_error_incompatible_type = function(err) { 122 | if (strict || !is.null(ptype)) { 123 | cnd_signal(err) 124 | } else { 125 | x 126 | } 127 | } 128 | ) 129 | vctrs::vec_set_names(out, names) 130 | } 131 | -------------------------------------------------------------------------------- /R/cluster-utils.R: -------------------------------------------------------------------------------- 1 | #' Cluster utitility functions 2 | #' 3 | #' These functions provide useful helpers for performaning common operations. 4 | #' `cluster_assign()` assigns the same value on each worker; 5 | #' `cluster_assign_each()` assigns different values on each worker; 6 | #' `cluster_assign_partition()` partitions vectors so that each worker gets 7 | #' (approximately) the same number of pieces. 8 | #' 9 | #' @param cluster,.cluster Cluster to work on 10 | #' @param ... Name-value pairs 11 | #' @param packages Character vector of packages to load 12 | #' @name cluster_utils 13 | #' @return Functions that modify the worker environment invisibly return 14 | #' `cluster` so calls can be piped together. The other functions return 15 | #' lists with one element for each worker. 16 | #' @examples 17 | #' cl <- default_cluster() 18 | #' cluster_assign(cl, a = runif(1)) 19 | #' cluster_call(cl, a) 20 | #' 21 | #' # Assign different values on each cluster 22 | #' cluster_assign_each(cl, b = c(1, 10)) 23 | #' cluster_call(cl, b) 24 | #' 25 | #' # Partition a vector so that each worker gets approximately the 26 | #' # same amount of it 27 | #' cluster_assign_partition(cl, c = 1:11) 28 | #' cluster_call(cl, c) 29 | #' 30 | #' # If you want different to compute different values on each 31 | #' # worker, use `cluster_call()` directly: 32 | #' cluster_call(cl, d <- runif(1)) 33 | #' cluster_call(cl, d) 34 | #' 35 | #' # cluster_copy() is a useful shortcut 36 | #' e <- 10 37 | #' cluster_copy(cl, "e") 38 | #' 39 | #' cluster_call(cl, ls()) 40 | #' cluster_rm(cl, letters[1:5]) 41 | #' cluster_call(cl, ls()) 42 | #' 43 | #' # Use cluster_library() to load packages 44 | #' cluster_call(cl, search()) 45 | #' cluster_library(cl, "magrittr") 46 | #' cluster_call(cl, search()) 47 | NULL 48 | 49 | #' @rdname cluster_utils 50 | #' @export 51 | cluster_assign <- function(.cluster, ...) { 52 | stopifnot(is_cluster(.cluster)) 53 | values <- list2(...) 54 | stopifnot(is_named(values)) 55 | 56 | path <- tempfile() 57 | on.exit(unlink(path)) 58 | 59 | qs::qsave(values, path, preset = "fast", check_hash = FALSE, nthreads = 2) 60 | cluster_send(.cluster, list2env(qs::qread(!!path), globalenv())) 61 | 62 | invisible(.cluster) 63 | } 64 | 65 | #' @rdname cluster_utils 66 | #' @export 67 | cluster_assign_each <- function(.cluster, ...) { 68 | stopifnot(is_cluster(.cluster)) 69 | values <- tibble(..., .rows = length(.cluster)) 70 | 71 | for (i in seq_len(nrow(values))) { 72 | cluster_assign(.cluster[i], !!!lapply(values, "[[", i)) 73 | } 74 | 75 | invisible(.cluster) 76 | } 77 | 78 | #' @export 79 | #' @rdname cluster_utils 80 | cluster_assign_partition <- function(.cluster, ...) { 81 | stopifnot(is_cluster(.cluster)) 82 | values <- list(...) 83 | 84 | m <- length(.cluster) 85 | values_split <- lapply(values, function(x) { 86 | vctrs::vec_split(x, cut(vctrs::vec_seq_along(x), m, labels = FALSE))$val 87 | }) 88 | 89 | cluster_assign_each(.cluster, !!!values_split) 90 | invisible(.cluster) 91 | } 92 | 93 | #' @rdname cluster_utils 94 | #' @param names Name of variables to copy. 95 | #' @param env Environment in which to look for varibles to copy. 96 | #' @export 97 | cluster_copy <- function(cluster, names, env = caller_env()) { 98 | stopifnot(is_cluster(cluster)) 99 | stopifnot(is.character(names)) 100 | 101 | cluster_assign(cluster, !!!env_get_list(env, names, inherit = TRUE)) 102 | } 103 | 104 | # Always suceeds: only gives warning if object not found, and warnings 105 | # are not syndicated back to master 106 | #' @rdname cluster_utils 107 | #' @export 108 | cluster_rm <- function(cluster, names) { 109 | stopifnot(is_cluster(cluster)) 110 | stopifnot(is.character(names)) 111 | 112 | cluster_send(cluster, rm(list = !!names, envir = globalenv())) 113 | invisible(cluster) 114 | } 115 | 116 | #' @rdname cluster_utils 117 | #' @export 118 | cluster_library <- function(cluster, packages) { 119 | lapply(packages, library, character.only = TRUE) 120 | 121 | cluster_send(cluster, lapply(!!packages, library, character.only = TRUE)) 122 | invisible(cluster) 123 | } 124 | -------------------------------------------------------------------------------- /R/cluster.R: -------------------------------------------------------------------------------- 1 | #' Create a new cluster with sensible defaults. 2 | #' 3 | #' Clusters created with this function will automatically clean up after 4 | #' themselves. 5 | #' 6 | #' @param n Number of workers to create. Avoid setting this higher than the 7 | #' number of cores in your computer as it will degrade performance. 8 | #' @export 9 | #' @returns A `multidplyr_cluster` object. 10 | #' @examples 11 | #' cluster <- new_cluster(2) 12 | #' cluster 13 | new_cluster <- function(n) { 14 | sessions <- replicate(n, callr::r_session$new(wait_timeout = 15 * 1000)) 15 | structure( 16 | sessions, 17 | cleaner = Cleaner$new(), 18 | class = "multidplyr_cluster" 19 | ) 20 | } 21 | 22 | #' @importFrom R6 R6Class 23 | Cleaner <- R6Class("Cleaner", list( 24 | names = character(), 25 | add = function(x) { 26 | self$names <- union(self$names, x) 27 | invisible(self) 28 | }, 29 | reset = function(x) { 30 | old <- self$names 31 | self$names <- character() 32 | old 33 | } 34 | )) 35 | 36 | #' @export 37 | print.multidplyr_cluster <- function(x, ...) { 38 | n <- length(x) 39 | 40 | state <- vapply(x, function(x) x$get_state(), character(1)) 41 | state_abbr <- c( 42 | "starting" = "S", 43 | "idle" = crayon::green("."), 44 | "busy" = crayon::red("*"), 45 | "finished" = "F" 46 | )[state] 47 | 48 | cat_line(n, " session cluster [", paste(state_abbr, collapse = ""), "]") 49 | } 50 | 51 | is_cluster <- function(x) inherits(x, "multidplyr_cluster") 52 | 53 | #' @export 54 | `[.multidplyr_cluster` <- function(x, i, ...) { 55 | structure(NextMethod(), 56 | cleaner = attr(x, "cleaner"), 57 | class = "multidplyr_cluster" 58 | ) 59 | } 60 | 61 | #' Default cluster 62 | #' 63 | #' Setting up a cluster is relatively expensive, so it's best to use a single 64 | #' cluster throughout a session. This function lazily creates a 2-worker 65 | #' cluster for use in examples and test. 66 | #' 67 | #' @param n Number of workers to use; defaults to 2 because this is the maximum 68 | #' allowed by CRAN. 69 | #' @keywords internal 70 | #' @returns A cached cluster of workers. 71 | #' @export 72 | #' @examples 73 | #' default_cluster() 74 | default_cluster <- function(n = 2) { 75 | if (!env_has(cluster_env, "cluster")) { 76 | message("Initialising default cluster of size ", n) 77 | env_bind(cluster_env, cluster = new_cluster(n)) 78 | } else { 79 | if (!missing(n)) { 80 | abort("Can not supply `n` when cluster has already been initiated") 81 | } 82 | } 83 | 84 | env_get(cluster_env, "cluster") 85 | } 86 | 87 | cluster_env <- env() 88 | 89 | -------------------------------------------------------------------------------- /R/dplyr-dual.R: -------------------------------------------------------------------------------- 1 | #' @importFrom dplyr same_src 2 | #' @export 3 | same_src.multidplyr_party_df <- function(x, y) { 4 | is_party_df(y) && identical(x$cluster, y$cluster) 5 | } 6 | 7 | #' @importFrom dplyr auto_copy 8 | #' @export 9 | auto_copy.multidplyr_party_df <- function(x, y, copy = FALSE, ...) { 10 | name <- table_name() 11 | cluster_assign(x$cluster, !!name := y) 12 | party_df(x$cluster, name) 13 | } 14 | 15 | # joins ------------------------------------------------------------------- 16 | 17 | #' @importFrom dplyr left_join 18 | #' @export 19 | left_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) { 20 | y <- auto_copy(x, y, copy = copy) 21 | shard_call_dual("left_join", x, y, ..., by = by, suffix = suffix) 22 | } 23 | 24 | #' @importFrom dplyr right_join 25 | #' @export 26 | right_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) { 27 | y <- auto_copy(x, y, copy = copy) 28 | shard_call_dual("right_join", x, y, ..., by = by, suffix = suffix) 29 | } 30 | 31 | #' @importFrom dplyr inner_join 32 | #' @export 33 | inner_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) { 34 | y <- auto_copy(x, y, copy = copy) 35 | shard_call_dual("inner_join", x, y, ..., by = by, suffix = suffix) 36 | } 37 | 38 | #' @importFrom dplyr full_join 39 | #' @export 40 | full_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) { 41 | y <- auto_copy(x, y, copy = copy) 42 | shard_call_dual("full_join", x, y, ..., by = by, suffix = suffix) 43 | } 44 | 45 | #' @importFrom dplyr anti_join 46 | #' @export 47 | anti_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE) { 48 | y <- auto_copy(x, y, copy = copy) 49 | shard_call_dual("anti_join", x, y, ..., by = by) 50 | } 51 | 52 | #' @importFrom dplyr semi_join 53 | #' @export 54 | semi_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE) { 55 | y <- auto_copy(x, y, copy = copy) 56 | shard_call_dual("semi_join", x, y, ..., by = by) 57 | } 58 | 59 | # setops ------------------------------------------------------------------ 60 | 61 | #' @importFrom dplyr intersect 62 | # Exported onload 63 | intersect.multidplyr_party_df <- function(x, y, ..., copy = FALSE) { 64 | y <- auto_copy(x, y, copy = copy) 65 | shard_call_dual("intersect", x, y, ...) 66 | } 67 | 68 | #' @importFrom dplyr union 69 | # Exported onload 70 | union.multidplyr_party_df <- function(x, y, ..., copy = FALSE) { 71 | y <- auto_copy(x, y, copy = copy) 72 | shard_call_dual("union", x, y, ...) 73 | } 74 | 75 | #' @importFrom dplyr union_all 76 | #' @export 77 | union_all.multidplyr_party_df <- function(x, y, ..., copy = FALSE) { 78 | y <- auto_copy(x, y, copy = copy) 79 | shard_call_dual("union_all", x, y, ...) 80 | } 81 | 82 | #' @importFrom dplyr setdiff 83 | # Exported onload 84 | setdiff.multidplyr_party_df <- function(x, y, ..., copy = FALSE) { 85 | y <- auto_copy(x, y, copy = copy) 86 | shard_call_dual("setdiff", x, y, ...) 87 | } 88 | 89 | 90 | # helpers ----------------------------------------------------------------- 91 | 92 | shard_call_dual <- function(.verb, .x, .y, ...) { 93 | new_name <- table_name() 94 | call <- call2(.verb, .x$name, .y$name, ..., .ns = "dplyr") 95 | 96 | cluster_send(.x$cluster, !!call2("<-", new_name, call)) 97 | new_party_df(.x$cluster, new_name, auto_rm = TRUE) 98 | } 99 | -------------------------------------------------------------------------------- /R/dplyr-single.R: -------------------------------------------------------------------------------- 1 | 2 | #' @importFrom dplyr arrange 3 | #' @export 4 | arrange.multidplyr_party_df <- function(.data, ..., .by_group = FALSE) { 5 | shard_call(.data, "arrange", enquos(...), .by_group = .by_group) 6 | } 7 | 8 | #' @importFrom dplyr filter 9 | # exported on load 10 | filter.multidplyr_party_df <- function(.data, ...) { 11 | shard_call(.data, "filter", enquos(...)) 12 | } 13 | 14 | #' @importFrom dplyr group_by 15 | #' @export 16 | group_by.multidplyr_party_df <- function(.data, ..., .add = FALSE) { 17 | shard_call(.data, "group_by", enquos(...), .add = .add) 18 | } 19 | 20 | #' @importFrom dplyr ungroup 21 | #' @export 22 | ungroup.multidplyr_party_df <- function(x, ...) { 23 | shard_call(x, "ungroup", enquos(...)) 24 | } 25 | 26 | #' @importFrom dplyr mutate 27 | #' @export 28 | mutate.multidplyr_party_df <- function(.data, ...) { 29 | shard_call(.data, "mutate", enquos(...)) 30 | } 31 | 32 | #' @importFrom dplyr rename 33 | #' @export 34 | rename.multidplyr_party_df <- function(.data, ...) { 35 | shard_call(.data, "rename", enquos(...)) 36 | } 37 | 38 | #' @importFrom dplyr select 39 | #' @export 40 | select.multidplyr_party_df <- function(.data, ...) { 41 | shard_call(.data, "select", enquos(...)) 42 | } 43 | 44 | #' @importFrom dplyr slice 45 | #' @export 46 | slice.multidplyr_party_df <- function(.data, ...) { 47 | shard_call(.data, "slice", enquos(...)) 48 | } 49 | 50 | #' @importFrom dplyr summarise 51 | #' @export 52 | summarise.multidplyr_party_df <- function(.data, ...) { 53 | shard_call(.data, "summarise", enquos(...)) 54 | } 55 | 56 | #' @importFrom dplyr do 57 | #' @export 58 | do.multidplyr_party_df <- function(.data, ...) { 59 | shard_call(.data, "do", enquos(...)) 60 | } 61 | 62 | 63 | # helpers ----------------------------------------------------------------- 64 | 65 | shard_call <- function(.data, .verb, dots, ...) { 66 | call <- call2(.verb, .data$name, !!!dots, ..., .ns = "dplyr") 67 | 68 | new_name <- table_name() 69 | cluster_send(.data$cluster, !!call2("<-", new_name, call)) 70 | new_party_df(.data$cluster, new_name, auto_rm = TRUE) 71 | } 72 | 73 | -------------------------------------------------------------------------------- /R/multidplyr-package.R: -------------------------------------------------------------------------------- 1 | #' @keywords internal 2 | #' @import rlang 3 | #' @importFrom tibble tibble 4 | "_PACKAGE" 5 | 6 | # The following block is used by usethis to automatically manage 7 | # roxygen namespace tags. Modify with care! 8 | ## usethis namespace: start 9 | ## usethis namespace: end 10 | NULL 11 | -------------------------------------------------------------------------------- /R/partydf.R: -------------------------------------------------------------------------------- 1 | #' Partition data across workers in a cluster 2 | #' 3 | #' Partitioning ensures that all observations in a group end up on the same 4 | #' worker. To try and keep the observations on each worker balanced, 5 | #' `partition()` uses a greedy algorithm that iteratively assigns each group to 6 | #' the worker that currently has the fewest rows. 7 | #' 8 | #' @param data Dataset to partition, typically grouped. When grouped, all 9 | #' observations in a group will be assigned to the same cluster. 10 | #' @param cluster Cluster to use. 11 | #' @returns A [party_df]. 12 | #' @export 13 | #' @examples 14 | #' library(dplyr) 15 | #' cl <- default_cluster() 16 | #' cluster_library(cl, "dplyr") 17 | #' 18 | #' mtcars2 <- partition(mtcars, cl) 19 | #' mtcars2 %>% mutate(cyl2 = 2 * cyl) 20 | #' mtcars2 %>% filter(vs == 1) 21 | #' mtcars2 %>% group_by(cyl) %>% summarise(n()) 22 | #' mtcars2 %>% select(-cyl) 23 | partition <- function(data, cluster) { 24 | worker_id <- worker_id(data, cluster) 25 | worker_rows <- split(seq_along(worker_id), worker_id) 26 | 27 | if (length(worker_rows) < length(cluster)) { 28 | message("Using partial cluster of size ", length(worker_rows)) 29 | cluster <- cluster[seq_along(worker_rows)] 30 | } 31 | shards <- lapply(worker_rows, function(i) data[i, , drop = FALSE]) 32 | 33 | name <- table_name() 34 | cluster_assign_each(cluster, !!name := shards) 35 | new_party_df(cluster, name, auto_rm = TRUE) 36 | } 37 | 38 | worker_id <- function(data, cluster) { 39 | n <- nrow(data) 40 | m <- length(cluster) 41 | 42 | if (!dplyr::is_grouped_df(data)) { 43 | # Assign sequentially 44 | (seq_len(n) - 1) %% m + 1 45 | } else { 46 | # Assign each new group to the session with fewest rows 47 | group_id <- dplyr::group_indices(data) 48 | counts <- tabulate(group_id) 49 | 50 | rows <- integer(m) 51 | group_worker_id <- integer(length(counts)) 52 | 53 | for (i in seq_along(counts)) { 54 | j <- which.min(rows) 55 | group_worker_id[[i]] <- j 56 | rows[[j]] <- rows[[j]] + counts[[i]] 57 | } 58 | 59 | group_worker_id[group_id] 60 | } 61 | } 62 | 63 | # Constructor ------------------------------------------------------------- 64 | 65 | #' A `party_df` partitioned data frame 66 | #' 67 | #' This S3 class represents a data frame partitioned across workers in a 68 | #' cluster. You can use this constructor if you have already spread 69 | #' data frames spread across a cluster. If not, start with [partition()] 70 | #' instead. 71 | #' 72 | #' @export 73 | #' @param cluster A cluster 74 | #' @param name Name of data frame variable. Must exist on every worker, 75 | #' be a data frame, and have the same names. 76 | #' @param auto_rm If `TRUE`, will automatically `rm()` the data frame on 77 | #' the workers when this object is created. 78 | #' @return An S3 object with class `multidplyr_party_df`. 79 | #' @export 80 | #' @examples 81 | #' # If a real example, you might spread file names across the clusters 82 | #' # and read in using data.table::fread()/vroom::vroom()/qs::qread(). 83 | #' cl <- default_cluster() 84 | #' cluster_send(cl[1], n <- 10) 85 | #' cluster_send(cl[2], n <- 15) 86 | #' cluster_send(cl, df <- data.frame(x = runif(n))) 87 | #' 88 | #' df <- party_df(cl, "df") 89 | #' df 90 | party_df <- function(cluster, name, auto_rm = FALSE) { 91 | stopifnot(is_cluster(cluster)) 92 | stopifnot(is_string(name)) 93 | 94 | # Check that variable exists, is data frame, and has same names 95 | exists <- unlist(cluster_call(cluster, exists(!!name))) 96 | if (!all(exists)) { 97 | abort(paste0("`", name, "` does not exist on all workers")) 98 | } 99 | 100 | is_df <- unlist(cluster_call(cluster, is.data.frame(!!sym(name)))) 101 | if (!all(is_df)) { 102 | abort(paste0("`", name, "` is not a data frame on all workers")) 103 | } 104 | 105 | names <- cluster_call(cluster, names(!!sym(name))) 106 | if (length(unique(names)) != 1) { 107 | abort(paste0("`", name, "` does not have the same names on all workers")) 108 | } 109 | 110 | new_party_df(cluster, name, auto_rm = auto_rm) 111 | } 112 | 113 | new_party_df <- function(cluster, name, auto_rm) { 114 | stopifnot(is_cluster(cluster)) 115 | stopifnot(is_string(name)) 116 | 117 | structure( 118 | list( 119 | cluster = cluster, 120 | name = sym(name), 121 | .auto_clean = shard_deleter(auto_rm, name, cluster) 122 | ), 123 | class = "multidplyr_party_df" 124 | ) 125 | } 126 | 127 | is_party_df <- function(x) inherits(x, "multidplyr_party_df") 128 | 129 | shard_deleter <- function(auto_rm, name, cluster) { 130 | if (!auto_rm) { 131 | return(NULL) 132 | } 133 | 134 | env <- new_environment() 135 | reg.finalizer(env, function(...) attr(cluster, "cleaner")$add(name)) 136 | env 137 | } 138 | 139 | shard_rows <- function(x) { 140 | nrows <- cluster_call(x$cluster, nrow(!!x$name)) 141 | unlist(nrows) 142 | } 143 | 144 | shard_cols <- function(x) { 145 | cluster_call(x$cluster[1], ncol(!!x$name))[[1]] 146 | } 147 | 148 | #' @importFrom tibble tbl_sum 149 | #' @export 150 | tbl_sum.multidplyr_party_df <- function(x) { 151 | character() 152 | } 153 | 154 | #' @importFrom dplyr tbl_vars 155 | #' @export 156 | tbl_vars.multidplyr_party_df <- function(x) { 157 | cluster_call(x$cluster[1], dplyr::tbl_vars(!!x$name))[[1]] 158 | } 159 | 160 | #' @importFrom dplyr group_vars 161 | #' @export 162 | group_vars.multidplyr_party_df <- function(x) { 163 | cluster_call(x$cluster[1], dplyr::group_vars(!!x$name))[[1]] 164 | } 165 | 166 | #' @export 167 | dim.multidplyr_party_df <- function(x) { 168 | c(sum(shard_rows(x)), shard_cols(x)) 169 | } 170 | 171 | #' @importFrom utils head 172 | #' @export 173 | head.multidplyr_party_df <- function(x, n = 6L, ...) { 174 | pieces <- vector("list", length(x$cluster)) 175 | left <- n 176 | 177 | # Work cluster by cluster until we have enough rows 178 | for (i in seq_along(x$cluster)) { 179 | head_i <- cluster_call(x$cluster[i], head(!!x$name, n = !!left))[[1]] 180 | 181 | pieces[[i]] <- head_i 182 | left <- left - nrow(head_i) 183 | if (left == 0) 184 | break 185 | } 186 | 187 | dplyr::bind_rows(pieces) 188 | } 189 | 190 | #' @export 191 | print.multidplyr_party_df <- function(x, ..., n = NULL, width = NULL) { 192 | cat("Source: party_df ", dplyr::dim_desc(x), "\n", sep = "") 193 | 194 | groups <- group_vars(x) 195 | if (length(groups) > 0) { 196 | cat("Groups: ", paste0(groups, collapse = ", "), "\n", sep = "") 197 | } 198 | 199 | shards <- shard_rows(x) 200 | cat("Shards: ", length(shards), 201 | " [", big_mark(min(shards)), "--", big_mark(max(shards)), " rows]\n", 202 | sep = "") 203 | cat("\n") 204 | print(tibble::trunc_mat(x, n = 6, width = width)) 205 | 206 | invisible(x) 207 | } 208 | 209 | #' @export 210 | as.data.frame.multidplyr_party_df <- function(x, row.names, optional, ...) { 211 | dplyr::bind_rows(cluster_call(x$cluster, !!x$name)) 212 | } 213 | 214 | #' @importFrom dplyr collect 215 | #' @export 216 | collect.multidplyr_party_df <- function(x, ...) { 217 | out <- as.data.frame(x) 218 | dplyr::grouped_df(out, group_vars(x)) 219 | } 220 | 221 | #' @importFrom dplyr pull 222 | #' @export 223 | pull.multidplyr_party_df <- function(.data, var = -1, ...) { 224 | var <- tidyselect::vars_pull(tbl_vars(.data), {{ var }}) 225 | 226 | .data <- ungroup(.data) 227 | .data <- select(.data, !!sym(var)) 228 | .data <- collect(.data) 229 | .data[[1]] 230 | } 231 | -------------------------------------------------------------------------------- /R/utils.R: -------------------------------------------------------------------------------- 1 | #' @importFrom magrittr %>% 2 | #' @export 3 | magrittr::`%>%` 4 | 5 | table_name <- local({ 6 | i <- 0 7 | function() { 8 | i <<- i + 1 9 | paste0("_DF", i) 10 | } 11 | }) 12 | 13 | big_mark <- function(x, ...) { 14 | mark <- if (identical(getOption("OutDec"), ",")) "." else "," 15 | formatC(x, big.mark = mark, ...) 16 | } 17 | 18 | cat_line <- function(...) { 19 | cat(paste(..., "\n", collapse = "", sep = "")) 20 | } 21 | 22 | -------------------------------------------------------------------------------- /R/zzz.R: -------------------------------------------------------------------------------- 1 | # nocov start 2 | .onLoad <- function(...) { 3 | register_s3_method("dplyr", "filter", "multidplyr_party_df") 4 | register_s3_method("dplyr", "intersect", "multidplyr_party_df") 5 | register_s3_method("dplyr", "union", "multidplyr_party_df") 6 | register_s3_method("dplyr", "setdiff", "multidplyr_party_df") 7 | } 8 | 9 | register_s3_method <- function(pkg, generic, class, fun = NULL) { 10 | stopifnot(is.character(pkg), length(pkg) == 1) 11 | stopifnot(is.character(generic), length(generic) == 1) 12 | stopifnot(is.character(class), length(class) == 1) 13 | 14 | if (is.null(fun)) { 15 | fun <- get(paste0(generic, ".", class), envir = parent.frame()) 16 | } else { 17 | stopifnot(is.function(fun)) 18 | } 19 | 20 | if (pkg %in% loadedNamespaces()) { 21 | registerS3method(generic, class, fun, envir = asNamespace(pkg)) 22 | } 23 | 24 | # Always register hook in case package is later unloaded & reloaded 25 | setHook( 26 | packageEvent(pkg, "onLoad"), 27 | function(...) { 28 | registerS3method(generic, class, fun, envir = asNamespace(pkg)) 29 | } 30 | ) 31 | } 32 | # nocov end 33 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | --- 4 | 5 | 6 | 7 | ```{r, include = FALSE} 8 | knitr::opts_chunk$set( 9 | collapse = TRUE, 10 | comment = "#>", 11 | fig.path = "man/figures/README-", 12 | out.width = "100%" 13 | ) 14 | ``` 15 | 16 | # multidplyr 17 | 18 | 19 | [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) 20 | [![R-CMD-check](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml) 21 | [![Codecov test coverage](https://codecov.io/gh/tidyverse/multidplyr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/multidplyr?branch=main) 22 | [![CRAN status](https://www.r-pkg.org/badges/version/multidplyr)](https://cran.r-project.org/package=multidplyr) 23 | 24 | 25 | ## Overview 26 | 27 | multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with `partition()` and then the data stays on each node until you explicitly retrieve it with `collect()`. This minimises the amount of time spent moving data around, and maximises parallel performance. This idea is inspired by [partools](https://github.com/matloff/partools) by Norm Matloff and [distributedR](https://github.com/vertica/DistributedR) by the Vertica Analytics team. 28 | 29 | Due to the overhead associated with communicating between the nodes, you won't see much performance improvement with simple operations on less than ~10 million observations, and you may want to instead try [dtplyr](https://dtplyr.tidyverse.org/), which uses [data.table](https://R-datatable.com/). multidplyr's strength is found parallelising calls to slower and more complex functions. 30 | 31 | (Note that unlike other packages in the tidyverse, multidplyr requires R 3.5 or greater. We hope to relax this requirement [in the future](https://github.com/traversc/qs/issues/11).) 32 | 33 | ## Installation 34 | 35 | You can install the released version of multidplyr from [CRAN](https://CRAN.R-project.org) with: 36 | 37 | ``` r 38 | install.packages("multidplyr") 39 | ``` 40 | 41 | And the development version from [GitHub](https://github.com/) with: 42 | 43 | ``` r 44 | # install.packages("pak") 45 | pak::pak("tidyverse/multidplyr") 46 | ``` 47 | 48 | ## Usage 49 | 50 | To use multidplyr, you first create a cluster of the desired number of workers. Each one of these workers is a separate R process, and the operating system will spread their execution across multiple cores: 51 | 52 | ```{r setup} 53 | library(multidplyr) 54 | 55 | cluster <- new_cluster(4) 56 | cluster_library(cluster, "dplyr") 57 | ``` 58 | 59 | There are two primary ways to use multidplyr. The first, and most efficient, way is to read different files on each worker: 60 | 61 | ```{r, eval = FALSE} 62 | # Create a filename vector containing different values on each worker 63 | cluster_assign_each(cluster, filename = c("a.csv", "b.csv", "c.csv", "d.csv")) 64 | 65 | # Use vroom to quickly load the csvs 66 | cluster_send(cluster, my_data <- vroom::vroom(filename)) 67 | 68 | # Create a party_df using the my_data variable on each worker 69 | my_data <- party_df(cluster, "my_data") 70 | ``` 71 | 72 | Alternatively, if you already have the data loaded in the main session, you can use `partition()` to automatically spread it across the workers. Before calling `partition()`, it's a good idea to call `group_by()` to ensure that all of the observations belonging to a group end up on the same worker. 73 | 74 | ```{r} 75 | library(nycflights13) 76 | 77 | flight_dest <- flights %>% group_by(dest) %>% partition(cluster) 78 | flight_dest 79 | ``` 80 | 81 | Now you can work with it like a regular data frame, but the computations will be spread across multiple cores. Once you've finished computation, use `collect()` to bring the data back to the host session: 82 | 83 | ```{r} 84 | flight_dest %>% 85 | summarise(delay = mean(dep_delay, na.rm = TRUE), n = n()) %>% 86 | collect() 87 | ``` 88 | 89 | Note that there is some overhead associated with copying data from the worker nodes back to the host node (and vice versa), so you're best off using multidplyr with more complex operations. See `vignette("multidplyr")` for more details. 90 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # multidplyr 5 | 6 | 7 | 8 | [![Lifecycle: 9 | experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) 10 | [![R-CMD-check](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml) 11 | [![Codecov test 12 | coverage](https://codecov.io/gh/tidyverse/multidplyr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/multidplyr?branch=main) 13 | [![CRAN 14 | status](https://www.r-pkg.org/badges/version/multidplyr)](https://cran.r-project.org/package=multidplyr) 15 | 16 | 17 | ## Overview 18 | 19 | multidplyr is a backend for dplyr that partitions a data frame across 20 | multiple cores. You tell multidplyr how to split the data up with 21 | `partition()` and then the data stays on each node until you explicitly 22 | retrieve it with `collect()`. This minimises the amount of time spent 23 | moving data around, and maximises parallel performance. This idea is 24 | inspired by [partools](https://github.com/matloff/partools) by Norm 25 | Matloff and [distributedR](https://github.com/vertica/DistributedR) by 26 | the Vertica Analytics team. 27 | 28 | Due to the overhead associated with communicating between the nodes, you 29 | won’t see much performance improvement with simple operations on less 30 | than ~10 million observations, and you may want to instead try 31 | [dtplyr](https://dtplyr.tidyverse.org/), which uses 32 | [data.table](https://R-datatable.com/). multidplyr’s strength is found 33 | parallelising calls to slower and more complex functions. 34 | 35 | (Note that unlike other packages in the tidyverse, multidplyr requires R 36 | 3.5 or greater. We hope to relax this requirement [in the 37 | future](https://github.com/traversc/qs/issues/11).) 38 | 39 | ## Installation 40 | 41 | You can install the released version of multidplyr from 42 | [CRAN](https://CRAN.R-project.org) with: 43 | 44 | ``` r 45 | install.packages("multidplyr") 46 | ``` 47 | 48 | And the development version from [GitHub](https://github.com/) with: 49 | 50 | ``` r 51 | # install.packages("pak") 52 | pak::pak("tidyverse/multidplyr") 53 | ``` 54 | 55 | ## Usage 56 | 57 | To use multidplyr, you first create a cluster of the desired number of 58 | workers. Each one of these workers is a separate R process, and the 59 | operating system will spread their execution across multiple cores: 60 | 61 | ``` r 62 | library(multidplyr) 63 | 64 | cluster <- new_cluster(4) 65 | cluster_library(cluster, "dplyr") 66 | #> 67 | #> Attaching package: 'dplyr' 68 | #> The following objects are masked from 'package:stats': 69 | #> 70 | #> filter, lag 71 | #> The following objects are masked from 'package:base': 72 | #> 73 | #> intersect, setdiff, setequal, union 74 | ``` 75 | 76 | There are two primary ways to use multidplyr. The first, and most 77 | efficient, way is to read different files on each worker: 78 | 79 | ``` r 80 | # Create a filename vector containing different values on each worker 81 | cluster_assign_each(cluster, filename = c("a.csv", "b.csv", "c.csv", "d.csv")) 82 | 83 | # Use vroom to quickly load the csvs 84 | cluster_send(cluster, my_data <- vroom::vroom(filename)) 85 | 86 | # Create a party_df using the my_data variable on each worker 87 | my_data <- party_df(cluster, "my_data") 88 | ``` 89 | 90 | Alternatively, if you already have the data loaded in the main session, 91 | you can use `partition()` to automatically spread it across the workers. 92 | Before calling `partition()`, it’s a good idea to call `group_by()` to 93 | ensure that all of the observations belonging to a group end up on the 94 | same worker. 95 | 96 | ``` r 97 | library(nycflights13) 98 | 99 | flight_dest <- flights %>% group_by(dest) %>% partition(cluster) 100 | flight_dest 101 | #> Source: party_df [336,776 x 19] 102 | #> Groups: dest 103 | #> Shards: 4 [81,594--86,548 rows] 104 | #> 105 | #> # A data frame: 336,776 × 19 106 | #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 107 | #> 108 | #> 1 2013 1 1 544 545 -1 1004 1022 109 | #> 2 2013 1 1 558 600 -2 923 937 110 | #> 3 2013 1 1 559 600 -1 854 902 111 | #> 4 2013 1 1 602 610 -8 812 820 112 | #> 5 2013 1 1 602 605 -3 821 805 113 | #> 6 2013 1 1 611 600 11 945 931 114 | #> # ℹ 336,770 more rows 115 | #> # ℹ 11 more variables: arr_delay , carrier , flight , 116 | #> # tailnum , origin , dest , air_time , distance , 117 | #> # hour , minute , time_hour 118 | ``` 119 | 120 | Now you can work with it like a regular data frame, but the computations 121 | will be spread across multiple cores. Once you’ve finished computation, 122 | use `collect()` to bring the data back to the host session: 123 | 124 | ``` r 125 | flight_dest %>% 126 | summarise(delay = mean(dep_delay, na.rm = TRUE), n = n()) %>% 127 | collect() 128 | #> # A tibble: 105 × 3 129 | #> dest delay n 130 | #> 131 | #> 1 ABQ 13.7 254 132 | #> 2 AUS 13.0 2439 133 | #> 3 BQN 12.4 896 134 | #> 4 BTV 13.6 2589 135 | #> 5 BUF 13.4 4681 136 | #> 6 CLE 13.4 4573 137 | #> 7 CMH 12.2 3524 138 | #> 8 DEN 15.2 7266 139 | #> 9 DSM 26.2 569 140 | #> 10 DTW 11.8 9384 141 | #> # ℹ 95 more rows 142 | ``` 143 | 144 | Note that there is some overhead associated with copying data from the 145 | worker nodes back to the host node (and vice versa), so you’re best off 146 | using multidplyr with more complex operations. See 147 | `vignette("multidplyr")` for more details. 148 | -------------------------------------------------------------------------------- /_pkgdown.yml: -------------------------------------------------------------------------------- 1 | url: https://multidplyr.tidyverse.org 2 | 3 | template: 4 | package: tidytemplate 5 | bootstrap: 5 6 | 7 | includes: 8 | in_header: | 9 | 10 | 11 | development: 12 | mode: auto 13 | 14 | news: 15 | releases: 16 | - text: "Version 0.1.0" 17 | href: https://www.tidyverse.org/blog/2021/02/dplyr-backends/ 18 | -------------------------------------------------------------------------------- /codecov.yml: -------------------------------------------------------------------------------- 1 | comment: false 2 | 3 | coverage: 4 | status: 5 | project: 6 | default: 7 | target: auto 8 | threshold: 1% 9 | informational: true 10 | patch: 11 | default: 12 | target: auto 13 | threshold: 1% 14 | informational: true 15 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## R CMD check results 2 | 3 | 0 errors | 0 warnings | 0 note 4 | 5 | ## revdepcheck results 6 | 7 | This was a patch release to fix an R CMD check failure; I did not check revdeps. 8 | -------------------------------------------------------------------------------- /man/cluster_call.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/cluster-call.R 3 | \name{cluster_call} 4 | \alias{cluster_call} 5 | \alias{cluster_send} 6 | \title{Call a function on each node of a cluster} 7 | \usage{ 8 | cluster_call(cluster, code, simplify = FALSE, ptype = NULL) 9 | 10 | cluster_send(cluster, code) 11 | } 12 | \arguments{ 13 | \item{cluster}{A cluster.} 14 | 15 | \item{code}{An expression to execute on each worker.} 16 | 17 | \item{simplify}{Should the results be simplified from a list? 18 | * `TRUE`: simplify or die trying. 19 | * `NA`: simplify if possible. 20 | * `FALSE`: never try to simplify, always leaving as a list. 21 | 22 | `code` must return a vector of length one in order for simplification 23 | to succeed.} 24 | 25 | \item{ptype}{If `simplify` is `TRUE`, use `ptype` to enforce the desired 26 | output type.} 27 | } 28 | \value{ 29 | A list of results with one element for each worker in `cluster`. 30 | } 31 | \description{ 32 | `cluster_call()` executes the code on each worker and returns the results; 33 | `cluster_send()` executes the code ignoring the result. Jobs are submitted 34 | to workers in parallel, and then we wait until they're complete. 35 | } 36 | \examples{ 37 | cl <- default_cluster() 38 | 39 | # Run code on each cluster and retrieve results 40 | cluster_call(cl, Sys.getpid()) 41 | cluster_call(cl, runif(1)) 42 | 43 | # use ptype to simplify 44 | cluster_call(cl, runif(1), simplify = TRUE) 45 | 46 | # use cluster_send() to ignore results 47 | cluster_send(cl, x <- runif(1)) 48 | cluster_call(cl, x, simplify = TRUE) 49 | } 50 | -------------------------------------------------------------------------------- /man/cluster_utils.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/cluster-utils.R 3 | \name{cluster_utils} 4 | \alias{cluster_utils} 5 | \alias{cluster_assign} 6 | \alias{cluster_assign_each} 7 | \alias{cluster_assign_partition} 8 | \alias{cluster_copy} 9 | \alias{cluster_rm} 10 | \alias{cluster_library} 11 | \title{Cluster utitility functions} 12 | \usage{ 13 | cluster_assign(.cluster, ...) 14 | 15 | cluster_assign_each(.cluster, ...) 16 | 17 | cluster_assign_partition(.cluster, ...) 18 | 19 | cluster_copy(cluster, names, env = caller_env()) 20 | 21 | cluster_rm(cluster, names) 22 | 23 | cluster_library(cluster, packages) 24 | } 25 | \arguments{ 26 | \item{...}{Name-value pairs} 27 | 28 | \item{cluster, .cluster}{Cluster to work on} 29 | 30 | \item{names}{Name of variables to copy.} 31 | 32 | \item{env}{Environment in which to look for varibles to copy.} 33 | 34 | \item{packages}{Character vector of packages to load} 35 | } 36 | \value{ 37 | Functions that modify the worker environment invisibly return 38 | `cluster` so calls can be piped together. The other functions return 39 | lists with one element for each worker. 40 | } 41 | \description{ 42 | These functions provide useful helpers for performaning common operations. 43 | `cluster_assign()` assigns the same value on each worker; 44 | `cluster_assign_each()` assigns different values on each worker; 45 | `cluster_assign_partition()` partitions vectors so that each worker gets 46 | (approximately) the same number of pieces. 47 | } 48 | \examples{ 49 | cl <- default_cluster() 50 | cluster_assign(cl, a = runif(1)) 51 | cluster_call(cl, a) 52 | 53 | # Assign different values on each cluster 54 | cluster_assign_each(cl, b = c(1, 10)) 55 | cluster_call(cl, b) 56 | 57 | # Partition a vector so that each worker gets approximately the 58 | # same amount of it 59 | cluster_assign_partition(cl, c = 1:11) 60 | cluster_call(cl, c) 61 | 62 | # If you want different to compute different values on each 63 | # worker, use `cluster_call()` directly: 64 | cluster_call(cl, d <- runif(1)) 65 | cluster_call(cl, d) 66 | 67 | # cluster_copy() is a useful shortcut 68 | e <- 10 69 | cluster_copy(cl, "e") 70 | 71 | cluster_call(cl, ls()) 72 | cluster_rm(cl, letters[1:5]) 73 | cluster_call(cl, ls()) 74 | 75 | # Use cluster_library() to load packages 76 | cluster_call(cl, search()) 77 | cluster_library(cl, "magrittr") 78 | cluster_call(cl, search()) 79 | } 80 | -------------------------------------------------------------------------------- /man/default_cluster.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/cluster.R 3 | \name{default_cluster} 4 | \alias{default_cluster} 5 | \title{Default cluster} 6 | \usage{ 7 | default_cluster(n = 2) 8 | } 9 | \arguments{ 10 | \item{n}{Number of workers to use; defaults to 2 because this is the maximum 11 | allowed by CRAN.} 12 | } 13 | \value{ 14 | A cached cluster of workers. 15 | } 16 | \description{ 17 | Setting up a cluster is relatively expensive, so it's best to use a single 18 | cluster throughout a session. This function lazily creates a 2-worker 19 | cluster for use in examples and test. 20 | } 21 | \examples{ 22 | default_cluster() 23 | } 24 | \keyword{internal} 25 | -------------------------------------------------------------------------------- /man/multidplyr-package.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/multidplyr-package.R 3 | \docType{package} 4 | \name{multidplyr-package} 5 | \alias{multidplyr} 6 | \alias{multidplyr-package} 7 | \title{multidplyr: A Multi-Process 'dplyr' Backend} 8 | \description{ 9 | Partition a data frame across multiple worker processes to provide simple multicore parallelism. 10 | } 11 | \seealso{ 12 | Useful links: 13 | \itemize{ 14 | \item \url{https://multidplyr.tidyverse.org} 15 | \item \url{https://github.com/tidyverse/multidplyr} 16 | \item Report bugs at \url{https://github.com/tidyverse/multidplyr/issues} 17 | } 18 | 19 | } 20 | \author{ 21 | \strong{Maintainer}: Hadley Wickham \email{hadley@posit.co} 22 | 23 | Other contributors: 24 | \itemize{ 25 | \item Posit Software, PBC [copyright holder, funder] 26 | } 27 | 28 | } 29 | \keyword{internal} 30 | -------------------------------------------------------------------------------- /man/new_cluster.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/cluster.R 3 | \name{new_cluster} 4 | \alias{new_cluster} 5 | \title{Create a new cluster with sensible defaults.} 6 | \usage{ 7 | new_cluster(n) 8 | } 9 | \arguments{ 10 | \item{n}{Number of workers to create. Avoid setting this higher than the 11 | number of cores in your computer as it will degrade performance.} 12 | } 13 | \value{ 14 | A `multidplyr_cluster` object. 15 | } 16 | \description{ 17 | Clusters created with this function will automatically clean up after 18 | themselves. 19 | } 20 | \examples{ 21 | cluster <- new_cluster(2) 22 | cluster 23 | } 24 | -------------------------------------------------------------------------------- /man/partition.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/partydf.R 3 | \name{partition} 4 | \alias{partition} 5 | \title{Partition data across workers in a cluster} 6 | \usage{ 7 | partition(data, cluster) 8 | } 9 | \arguments{ 10 | \item{data}{Dataset to partition, typically grouped. When grouped, all 11 | observations in a group will be assigned to the same cluster.} 12 | 13 | \item{cluster}{Cluster to use.} 14 | } 15 | \value{ 16 | A [party_df]. 17 | } 18 | \description{ 19 | Partitioning ensures that all observations in a group end up on the same 20 | worker. To try and keep the observations on each worker balanced, 21 | `partition()` uses a greedy algorithm that iteratively assigns each group to 22 | the worker that currently has the fewest rows. 23 | } 24 | \examples{ 25 | library(dplyr) 26 | cl <- default_cluster() 27 | cluster_library(cl, "dplyr") 28 | 29 | mtcars2 <- partition(mtcars, cl) 30 | mtcars2 \%>\% mutate(cyl2 = 2 * cyl) 31 | mtcars2 \%>\% filter(vs == 1) 32 | mtcars2 \%>\% group_by(cyl) \%>\% summarise(n()) 33 | mtcars2 \%>\% select(-cyl) 34 | } 35 | -------------------------------------------------------------------------------- /man/party_df.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/partydf.R 3 | \name{party_df} 4 | \alias{party_df} 5 | \title{A `party_df` partitioned data frame} 6 | \usage{ 7 | party_df(cluster, name, auto_rm = FALSE) 8 | } 9 | \arguments{ 10 | \item{cluster}{A cluster} 11 | 12 | \item{name}{Name of data frame variable. Must exist on every worker, 13 | be a data frame, and have the same names.} 14 | 15 | \item{auto_rm}{If `TRUE`, will automatically `rm()` the data frame on 16 | the workers when this object is created.} 17 | } 18 | \value{ 19 | An S3 object with class `multidplyr_party_df`. 20 | } 21 | \description{ 22 | This S3 class represents a data frame partitioned across workers in a 23 | cluster. You can use this constructor if you have already spread 24 | data frames spread across a cluster. If not, start with [partition()] 25 | instead. 26 | } 27 | \examples{ 28 | # If a real example, you might spread file names across the clusters 29 | # and read in using data.table::fread()/vroom::vroom()/qs::qread(). 30 | cl <- default_cluster() 31 | cluster_send(cl[1], n <- 10) 32 | cluster_send(cl[2], n <- 15) 33 | cluster_send(cl, df <- data.frame(x = runif(n))) 34 | 35 | df <- party_df(cl, "df") 36 | df 37 | } 38 | -------------------------------------------------------------------------------- /man/reexports.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/utils.R 3 | \docType{import} 4 | \name{reexports} 5 | \alias{reexports} 6 | \alias{\%>\%} 7 | \title{Objects exported from other packages} 8 | \keyword{internal} 9 | \description{ 10 | These objects are imported from other packages. Follow the links 11 | below to see their documentation. 12 | 13 | \describe{ 14 | \item{magrittr}{\code{\link[magrittr:pipe]{\%>\%}}} 15 | }} 16 | 17 | -------------------------------------------------------------------------------- /multidplyr.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: No 4 | SaveWorkspace: No 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace 22 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(multidplyr) 3 | 4 | test_check("multidplyr") 5 | -------------------------------------------------------------------------------- /tests/testthat/_snaps/cluster-call.md: -------------------------------------------------------------------------------- 1 | # validates inputs 2 | 3 | Code 4 | cluster_call(cl, 1, simplify = "x") 5 | Condition 6 | Error in `cluster_call()`: 7 | ! `simplify` must be `TRUE`, `FALSE`, or `NA`. 8 | 9 | # old ptype interface works with warning 10 | 11 | Code 12 | out <- cluster_call(cl, 1, ptype = double()) 13 | Condition 14 | Warning: 15 | Must now set `simplify = TRUE` when supplying ptype 16 | 17 | # errors are propagated 18 | 19 | Code 20 | cluster_call(cl, stop("!!")) 21 | Condition 22 | Error in `cluster_call()`: 23 | ! Remote computation failed in worker 1 24 | Caused by error: 25 | ! !! 26 | 27 | # errors capture worker id 28 | 29 | Code 30 | cluster_call(cl, f(x)) 31 | Condition 32 | Error in `cluster_call()`: 33 | ! Remote computation failed in worker 2 34 | Caused by error: 35 | ! Computation failed 36 | 37 | -------------------------------------------------------------------------------- /tests/testthat/_snaps/cluster.md: -------------------------------------------------------------------------------- 1 | # default cluster only creates once 2 | 3 | Can not supply `n` when cluster has already been initiated 4 | 5 | # cluster has useful print method 6 | 7 | Code 8 | cl <- default_cluster() 9 | cl 10 | Output 11 | 2 session cluster [..] 12 | 13 | -------------------------------------------------------------------------------- /tests/testthat/_snaps/dplyr-dual.md: -------------------------------------------------------------------------------- 1 | # joining data frame requires explicit copy 2 | 3 | Code 4 | left_join(pf, df) 5 | Condition 6 | Error in `auto_copy()`: 7 | ! `x` and `y` must share the same src. 8 | i `x` is a object. 9 | i `y` is a object. 10 | i Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow). 11 | 12 | -------------------------------------------------------------------------------- /tests/testthat/_snaps/partydf.md: -------------------------------------------------------------------------------- 1 | # can construct and print partydf 2 | 3 | Code 4 | df 5 | Output 6 | Source: party_df [20 x 1] 7 | Shards: 2 [10--10 rows] 8 | 9 | # A data frame: 20 x 1 10 | y 11 | 12 | 1 1 13 | 2 2 14 | 3 3 15 | 4 4 16 | 5 5 17 | 6 6 18 | # i 14 more rows 19 | 20 | --- 21 | 22 | Code 23 | group_by(df, y) 24 | Output 25 | Source: party_df [20 x 1] 26 | Groups: y 27 | Shards: 2 [10--10 rows] 28 | 29 | # A data frame: 20 x 1 30 | y 31 | 32 | 1 1 33 | 2 2 34 | 3 3 35 | 4 4 36 | 5 5 37 | 6 6 38 | # i 14 more rows 39 | 40 | # name must be data frame with same names 41 | 42 | Code 43 | party_df(cl, "x") 44 | Condition 45 | Error in `party_df()`: 46 | ! `x` does not exist on all workers 47 | 48 | --- 49 | 50 | Code 51 | party_df(cl, "x") 52 | Condition 53 | Error in `party_df()`: 54 | ! `x` is not a data frame on all workers 55 | 56 | --- 57 | 58 | Code 59 | party_df(cl, "x") 60 | Condition 61 | Error in `party_df()`: 62 | ! `x` does not have the same names on all workers 63 | 64 | -------------------------------------------------------------------------------- /tests/testthat/setup.R: -------------------------------------------------------------------------------- 1 | # Ensure default cluster initiated silenty 2 | suppressMessages(default_cluster()) 3 | -------------------------------------------------------------------------------- /tests/testthat/test-cluster-call.R: -------------------------------------------------------------------------------- 1 | test_that("calls submitted in parallel", { 2 | cl <- default_cluster() 3 | out <- cluster_call(cl, {x <- Sys.time(); Sys.sleep(0.2); x}) 4 | 5 | rng <- range(as.double(out)) 6 | expect_true(rng[2] - rng[1] < 0.2) 7 | }) 8 | 9 | test_that("calls submitted to each node", { 10 | cl <- default_cluster() 11 | pid <- cluster_call(cl, Sys.getpid()) 12 | expect_equal(length(unique(pid)), length(cl)) 13 | }) 14 | 15 | test_that("can simplify results", { 16 | cl <- default_cluster() 17 | out <- cluster_call(cl, 1, simplify = TRUE) 18 | expect_identical(out, c(1, 1)) 19 | }) 20 | 21 | test_that("validates inputs", { 22 | cl <- default_cluster() 23 | expect_snapshot(error = TRUE, { 24 | cluster_call(cl, 1, simplify = "x") 25 | }) 26 | }) 27 | 28 | test_that("old ptype interface works with warning", { 29 | cl <- default_cluster() 30 | expect_snapshot({ 31 | out <- cluster_call(cl, 1, ptype = double()) 32 | }) 33 | expect_identical(out, c(1, 1)) 34 | }) 35 | 36 | 37 | test_that("errors are propagated", { 38 | cl <- default_cluster() 39 | expect_snapshot(cluster_call(cl, stop("!!")), error = TRUE) 40 | }) 41 | 42 | test_that("errors capture worker id", { 43 | cl <- default_cluster() 44 | 45 | f <- function(x) { 46 | if (x == 2) { 47 | rlang::abort("Computation failed") 48 | } 49 | } 50 | cluster_assign(cl, f = f) 51 | cluster_assign_each(cl, x = seq_along(cl)) 52 | 53 | expect_snapshot(cluster_call(cl, f(x)), error = TRUE) 54 | }) 55 | 56 | 57 | test_that("call_send() returns cluster", { 58 | cl <- default_cluster() 59 | expect_equal(cluster_send(cl, 10), cl) 60 | }) 61 | -------------------------------------------------------------------------------- /tests/testthat/test-cluster-utils.R: -------------------------------------------------------------------------------- 1 | test_that("can set/get/remove/list", { 2 | cl <- default_cluster()[1] 3 | cluster_assign(cl, x = 1) 4 | 5 | expect_equal(cluster_call(cl, "x" %in% ls()), list(TRUE)) 6 | expect_equal(cluster_call(cl, x), list(1)) 7 | 8 | cluster_rm(cl, "x") 9 | expect_equal(cluster_call(cl, "x" %in% ls()), list(FALSE)) 10 | }) 11 | 12 | test_that("can assign different values to different clusters", { 13 | cl <- default_cluster() 14 | 15 | cluster_assign_each(cl, x = 1:2) 16 | expect_equal(cluster_call(cl, x), list(1L, 2L)) 17 | cluster_rm(cl, "x") 18 | }) 19 | 20 | test_that("can partition vectors across clusters", { 21 | cl <- default_cluster() 22 | 23 | cluster_assign_partition(cl, x = 1:4, y = 1:5) 24 | expect_equal(cluster_call(cl, x), list(1:2, 3:4)) 25 | expect_equal(cluster_call(cl, y), list(1:3, 4:5)) 26 | 27 | cluster_rm(cl, c("x", "y")) 28 | }) 29 | 30 | test_that("can copy from objects to cluster", { 31 | cl <- default_cluster()[1] 32 | x <- 1 33 | y <- 2 34 | 35 | cluster_copy(cl, c("x", "y")) 36 | expect_equal(cluster_call(cl, x)[[1]], x) 37 | expect_equal(cluster_call(cl, y)[[1]], y) 38 | 39 | cluster_rm(cl, c("x", "y")) 40 | }) 41 | 42 | test_that("can load package", { 43 | cl <- new_cluster(1) 44 | 45 | cluster_library(cl, "covr") 46 | expect_true("package:covr" %in% cluster_call(cl, search())[[1]]) 47 | }) 48 | -------------------------------------------------------------------------------- /tests/testthat/test-cluster.R: -------------------------------------------------------------------------------- 1 | test_that("default cluster only creates once", { 2 | cl1 <- default_cluster() 3 | cl2 <- default_cluster() 4 | 5 | expect_identical(cl1, cl2) 6 | 7 | # and errors if you attempt to set n the second time 8 | expect_snapshot_error(default_cluster(10)) 9 | }) 10 | 11 | test_that("subsetting cluster returns cluster", { 12 | expect_s3_class(default_cluster()[1], "multidplyr_cluster") 13 | }) 14 | 15 | test_that("cluster has useful print method", { 16 | expect_snapshot({ 17 | cl <- default_cluster() 18 | cl 19 | }) 20 | }) 21 | -------------------------------------------------------------------------------- /tests/testthat/test-dplyr-dual.R: -------------------------------------------------------------------------------- 1 | test_that("joining data frame requires explicit copy", { 2 | pf <- partition(data.frame(x = 1:6), default_cluster()) 3 | df <- data.frame(x = 1:3, y = 3:1) 4 | 5 | expect_snapshot(error = TRUE, { 6 | left_join(pf, df) 7 | }) 8 | expect_error(left_join(pf, df, copy = TRUE), NA) 9 | }) 10 | 11 | test_that("joins match local results", { 12 | pf1 <- partition(tibble(x = c(1, 2)), default_cluster()) 13 | pf2 <- partition(tibble(x = c(1, 3)), default_cluster()) 14 | 15 | # primarily testing that basic infrastructure works and that 16 | # I haven't accidentally typed the wrong verb name somewhere 17 | expect_equal(pf1 %>% inner_join(pf2, by = "x") %>% pull(x), 1) 18 | expect_equal(pf1 %>% left_join(pf2, by = "x") %>% pull(x), c(1, 2)) 19 | expect_equal(pf1 %>% right_join(pf2, by = "x") %>% pull(x), c(1, 3)) 20 | expect_equal(pf1 %>% full_join(pf2, by = "x") %>% pull(x), c(1, 2, 3)) 21 | 22 | expect_equal(pf1 %>% semi_join(pf2, by = "x") %>% pull(x), 1) 23 | expect_equal(pf1 %>% anti_join(pf2, by = "x") %>% pull(x), 2) 24 | }) 25 | 26 | test_that("set operations match local results", { 27 | pf1 <- partition(tibble(x = c(1, 2)), default_cluster()) 28 | pf2 <- partition(tibble(x = c(1, 3)), default_cluster()) 29 | 30 | expect_equal(pf1 %>% intersect(pf2) %>% pull(), 1) 31 | expect_setequal(pf1 %>% union(pf2) %>% pull(), c(1, 2, 3)) 32 | expect_equal(pf1 %>% union_all(pf2) %>% pull() %>% sort(), c(1, 1, 2, 3)) 33 | expect_equal(pf1 %>% setdiff(pf2) %>% pull(), 2) 34 | }) 35 | -------------------------------------------------------------------------------- /tests/testthat/test-dplyr-single.R: -------------------------------------------------------------------------------- 1 | test_that("test that pass through workds", { 2 | cl <- default_cluster() 3 | cluster_assign_each(cl, x = list( 4 | tibble(g = 1, x = 1:2), 5 | tibble(g = 2, x = 3:4) 6 | )) 7 | on.exit(cluster_rm(cl, "x")) 8 | pf <- party_df(cl, "x") 9 | 10 | expect_equal(pf %>% arrange(desc(x)) %>% collect() %>% pull(x), c(2, 1, 4, 3)) 11 | expect_equal(pf %>% filter(x %% 2 == 0) %>% collect() %>% pull(x), c(2, 4)) 12 | expect_equal(pf %>% mutate(y = x + 1) %>% collect() %>% pull(y), 2:5) 13 | expect_equal(pf %>% rename(z = x) %>% collect() %>% pull(z), 1:4) 14 | expect_equal(pf %>% slice(1) %>% collect() %>% pull(x), c(1, 3)) 15 | 16 | expect_equal(pf %>% select(x) %>% collect(), tibble(x = 1:4)) 17 | 18 | expect_equal(pf %>% do(y = 1) %>% collect(), tibble(y = list(1, 1))) 19 | 20 | expect_equal( 21 | pf %>% group_by(g) %>% summarise(x = sum(x)) %>% collect(), 22 | tibble(g = c(1, 2), x = c(3L, 7L)) 23 | ) 24 | }) 25 | -------------------------------------------------------------------------------- /tests/testthat/test-partydf.R: -------------------------------------------------------------------------------- 1 | test_that("can construct and print partydf", { 2 | withr::local_options(lifecycle_verbosity = "quiet") 3 | 4 | cl <- default_cluster() 5 | cl <- cluster_assign(cl, x = data.frame(y = 1:10)) 6 | 7 | df <- party_df(cl, "x") 8 | on.exit(cluster_rm(cl, "x")) 9 | 10 | expect_s3_class(df, "multidplyr_party_df") 11 | expect_true(is_party_df(df)) 12 | expect_equal(df$cluster, cl) 13 | 14 | expect_snapshot(df) 15 | expect_snapshot(group_by(df, y)) 16 | }) 17 | 18 | test_that("name must be data frame with same names", { 19 | cl <- default_cluster() 20 | expect_snapshot(error = TRUE, party_df(cl, "x")) 21 | 22 | cluster_assign(cl, x = 1) 23 | on.exit(cluster_rm(cl, "x")) 24 | expect_snapshot(error = TRUE, party_df(cl, "x")) 25 | 26 | cluster_assign_each(cl, x = list(tibble(x = 1), tibble(y = 2))) 27 | expect_snapshot(error = TRUE, party_df(cl, "x")) 28 | }) 29 | 30 | test_that("can automatically delete on gc() + cluster_call()", { 31 | cl <- default_cluster() 32 | cl <- cluster_assign(cl, x = data.frame(y = 1:10)) 33 | 34 | df <- party_df(cl, "x", auto_rm = TRUE) 35 | rm(df); gc() 36 | expect_equal(cluster_call(cl, exists("x")), list(FALSE, FALSE)) 37 | }) 38 | 39 | # partitioning ------------------------------------------------------------ 40 | 41 | test_that("can partition and re-collect", { 42 | cl <- default_cluster() 43 | 44 | df1 <- tibble::tibble(x = 1:2) 45 | df2 <- partition(df1, cl) 46 | 47 | expect_s3_class(df2, "multidplyr_party_df") 48 | expect_equal(df2$cluster, cl) 49 | 50 | expect_equal(collect(df2), df1) 51 | expect_equal(pull(df2, x), df1$x) 52 | }) 53 | 54 | test_that("can partition by group", { 55 | cl <- default_cluster() 56 | df1 <- tibble(x = c(rep(1, 2), rep(2, 1), rep(3, 1))) 57 | df2 <- df1 %>% group_by(x) %>% partition(cl) 58 | 59 | dfs <- cluster_call(cl, !!df2$name) 60 | expect_equal(dfs, list( 61 | group_by(tibble(x = c(1, 1)), x), 62 | group_by(tibble(x = c(2, 3)), x) 63 | )) 64 | }) 65 | 66 | test_that("reduce cluster size if needed", { 67 | cl <- default_cluster() 68 | df1 <- tibble(x = c(rep(1, 2))) 69 | expect_message( 70 | df2 <- df1 %>% group_by(x) %>% partition(cl), 71 | "partial cluster" 72 | ) 73 | 74 | expect_equal(length(df2$cluster), 1) 75 | }) 76 | -------------------------------------------------------------------------------- /vignettes/.gitignore: -------------------------------------------------------------------------------- 1 | multidplyr.html 2 | multidplyr.R 3 | -------------------------------------------------------------------------------- /vignettes/multidplyr.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "An introduction to multidplyr" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{An introduction to multidplyr} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>") 12 | set.seed(1014) 13 | ``` 14 | 15 | multidplyr is a backend for dplyr that spreads work across multiple processes. Like all dplyr backends, it allows you to use the dplyr verbs that you're already familiar with, but alters the underlying computational model to transparently support multi-process parallelism. 16 | 17 | This vignette will show you the basics of multidplyr using the [nycflights13](https://CRAN.R-project.org/package=nycflights13) dataset. 18 | 19 | ```{r setup} 20 | library(multidplyr) 21 | library(dplyr, warn.conflicts = FALSE) 22 | library(nycflights13) 23 | ``` 24 | 25 | ## Creating a cluster 26 | 27 | To start using multidplyr you must create a cluster. Here I used two cores because it's the maximum permitted by CRAN, but I suggest that you use more. For best performance, I recommend using 1 or 2 fewer than the total number of cores on your computer, which you can detect with `parallel::detectCores()` (leaving at least 1 core free means that you should still be able to use your computer for other tasks while your computation is running). 28 | 29 | ```{r} 30 | cluster <- new_cluster(2) 31 | cluster 32 | ``` 33 | 34 | (In the examples, you'll also see the use of `default_cluster()`; this is designed specifically for the constraints of R CMD check, so I don't recommend using it in your own code.) 35 | 36 | A cluster consists of multiple R processes created by [callr](https://callr.r-lib.org/). When multiple processes are running at the same time, your operating system will take care of spreading the work across multiple cores. 37 | 38 | ## Add data 39 | 40 | There are two ways to get data to the workers in cluster: 41 | 42 | * `partition()` a data frame that already loaded *in the interactive process*. 43 | * Load a different subset of the data *in each worker*. 44 | 45 | ### `partition()` 46 | 47 | `partition()` is useful if you have a single in-memory data frame. For example, take `nycflights13::flights`. This dataset contains information for about ~300,000 flights departing New York City in 2013. We group it by destination, then partition it: 48 | 49 | ```{r} 50 | flights1 <- flights %>% group_by(dest) %>% partition(cluster) 51 | flights1 52 | ``` 53 | 54 | `partition()` splits `flights1` into roughly equal subsets on each worker, ensuring that all rows in a group are transfered to the same worker. The result is a `party_df`, or partitioned data frame. 55 | 56 | ### Direct loading 57 | 58 | `partition()` is simple to call, but it's relatively expensive because it copies a lot of data between processes. An alternative strategy is for each worker to load the data (from files) it needs directly. 59 | 60 | To show how that might work, I'll first split flights up by month and save as csv files: 61 | 62 | ```{r} 63 | path <- tempfile() 64 | dir.create(path) 65 | 66 | flights %>% 67 | group_by(month) %>% 68 | group_walk(~ vroom::vroom_write(.x, sprintf("%s/month-%02i.csv", path, .y$month))) 69 | ``` 70 | 71 | Now we find all the files in the directory, and divide them up so that each worker gets (approximately) the same number of pieces: 72 | 73 | ```{r} 74 | files <- dir(path, full.names = TRUE) 75 | cluster_assign_partition(cluster, files = files) 76 | ``` 77 | 78 | Then we read in the files on each worker and use `party_df()` to create a partitioned dataframe: 79 | 80 | ```{r} 81 | cluster_send(cluster, flights2 <- vroom::vroom(files)) 82 | 83 | flights2 <- party_df(cluster, "flights2") 84 | flights2 85 | ``` 86 | 87 | ## dplyr verbs 88 | 89 | Once you have a partitioned data frame, you can operate on it with the usual dplyr verbs. To bring the data back to the interactive process, use `collect()`: 90 | 91 | ```{r} 92 | flights1 %>% 93 | summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>% 94 | collect() 95 | ``` 96 | 97 | For this size of data and a simple transformation, using a local cluster actually makes performance much worse! 98 | 99 | ```{r} 100 | by_dest <- flights %>% group_by(dest) 101 | 102 | # Local computation 103 | system.time(by_dest %>% summarise(mean(dep_delay, na.rm = TRUE))) 104 | 105 | # Remote: partitioning 106 | system.time(flights2 <- flights %>% partition(cluster)) 107 | # Remote: computation 108 | system.time(flights3 <- flights2 %>% summarise(mean(dep_delay, na.rm = TRUE))) 109 | # Remote: retrieve results 110 | system.time(flights3 %>% collect()) 111 | ``` 112 | 113 | That's because of the overhead associated with sending the data to each worker and retrieving the results at the end. For basic dplyr verbs, multidplyr is unlikely to give you significant speed ups unless you have 10s or 100s of millions of data points (and in that scenario you should first try [dtplyr](https://dtplyr.tidyverse.org/), which uses [data.table](https://R-datatable.com/)). 114 | 115 | multipldyr might help, however, if you're doing more complex things. Let's see how that plays out when fitting a moderately complex model. We'll start by selecting a subset of flights that have at least 50 occurrences, and we'll compute the day of the year from the date: 116 | 117 | ```{r} 118 | daily_flights <- flights %>% 119 | count(dest) %>% 120 | filter(n >= 365) 121 | 122 | common_dest <- flights %>% 123 | semi_join(daily_flights, by = "dest") %>% 124 | mutate(yday = lubridate::yday(ISOdate(year, month, day))) %>% 125 | group_by(dest) 126 | 127 | nrow(common_dest) 128 | ``` 129 | 130 | That leaves us with ~332,000 observations. Let's partition this smaller dataset: 131 | 132 | ```{r} 133 | by_dest <- common_dest %>% partition(cluster) 134 | by_dest 135 | ``` 136 | 137 | Let's fit a smoothed generalised additive model to each destination, estimating how delays vary over the course of the year and within a day. Note that we need to use `cluster_library()` to load the mgcv package on every node. That takes around 3s: 138 | 139 | ```{r, message = FALSE} 140 | cluster_library(cluster, "mgcv") 141 | system.time({ 142 | models <- by_dest %>% 143 | do(mod = gam(dep_delay ~ s(yday) + s(dep_time), data = .)) 144 | }) 145 | ``` 146 | 147 | Compared with around 5s doing it locally: 148 | 149 | ```{r} 150 | system.time({ 151 | models <- common_dest %>% 152 | group_by(dest) %>% 153 | do(mod = gam(dep_delay ~ s(yday) + s(dep_time), data = .)) 154 | }) 155 | ``` 156 | 157 | The cost of transmitting messages to the nodes is roughly constant, so the longer the task you're parallelising, the closer you'll get to a linear speed up. 158 | --------------------------------------------------------------------------------