├── .Rbuildignore
├── .covrignore
├── .github
    ├── .gitignore
    ├── CODE_OF_CONDUCT.md
    └── workflows
    │   ├── R-CMD-check.yaml
    │   ├── pkgdown.yaml
    │   ├── pr-commands.yaml
    │   └── test-coverage.yaml
├── .gitignore
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── NEWS.md
├── R
    ├── cluster-call.R
    ├── cluster-utils.R
    ├── cluster.R
    ├── dplyr-dual.R
    ├── dplyr-single.R
    ├── multidplyr-package.R
    ├── partydf.R
    ├── utils.R
    └── zzz.R
├── README.Rmd
├── README.md
├── _pkgdown.yml
├── codecov.yml
├── cran-comments.md
├── man
    ├── cluster_call.Rd
    ├── cluster_utils.Rd
    ├── default_cluster.Rd
    ├── multidplyr-package.Rd
    ├── new_cluster.Rd
    ├── partition.Rd
    ├── party_df.Rd
    └── reexports.Rd
├── multidplyr.Rproj
├── tests
    ├── testthat.R
    └── testthat
    │   ├── _snaps
    │       ├── cluster-call.md
    │       ├── cluster.md
    │       ├── dplyr-dual.md
    │       └── partydf.md
    │   ├── setup.R
    │   ├── test-cluster-call.R
    │   ├── test-cluster-utils.R
    │   ├── test-cluster.R
    │   ├── test-dplyr-dual.R
    │   ├── test-dplyr-single.R
    │   └── test-partydf.R
└── vignettes
    ├── .gitignore
    └── multidplyr.Rmd


/.Rbuildignore:
--------------------------------------------------------------------------------
 1 | ^.*\.Rproj$
 2 | ^\.Rproj\.user$
 3 | ^\.travis\.yml$
 4 | ^codecov\.yml$
 5 | ^\.covrignore$
 6 | ^cran-comments\.md$
 7 | ^_pkgdown\.yml$
 8 | ^docs$
 9 | ^pkgdown$
10 | ^README\.Rmd$
11 | ^\.github$
12 | ^LICENSE\.md$
13 | ^CRAN-RELEASE$
14 | ^CRAN-SUBMISSION$
15 | 


--------------------------------------------------------------------------------
/.covrignore:
--------------------------------------------------------------------------------
1 | R/deprec-*.R
2 | R/compat-*.R
3 | 


--------------------------------------------------------------------------------
/.github/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | 


--------------------------------------------------------------------------------
/.github/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | # Contributor Covenant Code of Conduct
  2 | 
  3 | ## Our Pledge
  4 | 
  5 | We as members, contributors, and leaders pledge to make participation in our
  6 | community a harassment-free experience for everyone, regardless of age, body
  7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  8 | identity and expression, level of experience, education, socio-economic status,
  9 | nationality, personal appearance, race, caste, color, religion, or sexual
 10 | identity and orientation.
 11 | 
 12 | We pledge to act and interact in ways that contribute to an open, welcoming,
 13 | diverse, inclusive, and healthy community.
 14 | 
 15 | ## Our Standards
 16 | 
 17 | Examples of behavior that contributes to a positive environment for our
 18 | community include:
 19 | 
 20 | * Demonstrating empathy and kindness toward other people
 21 | * Being respectful of differing opinions, viewpoints, and experiences
 22 | * Giving and gracefully accepting constructive feedback
 23 | * Accepting responsibility and apologizing to those affected by our mistakes,
 24 |   and learning from the experience
 25 | * Focusing on what is best not just for us as individuals, but for the overall
 26 |   community
 27 | 
 28 | Examples of unacceptable behavior include:
 29 | 
 30 | * The use of sexualized language or imagery, and sexual attention or advances of
 31 |   any kind
 32 | * Trolling, insulting or derogatory comments, and personal or political attacks
 33 | * Public or private harassment
 34 | * Publishing others' private information, such as a physical or email address,
 35 |   without their explicit permission
 36 | * Other conduct which could reasonably be considered inappropriate in a
 37 |   professional setting
 38 | 
 39 | ## Enforcement Responsibilities
 40 | 
 41 | Community leaders are responsible for clarifying and enforcing our standards of
 42 | acceptable behavior and will take appropriate and fair corrective action in
 43 | response to any behavior that they deem inappropriate, threatening, offensive,
 44 | or harmful.
 45 | 
 46 | Community leaders have the right and responsibility to remove, edit, or reject
 47 | comments, commits, code, wiki edits, issues, and other contributions that are
 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 49 | decisions when appropriate.
 50 | 
 51 | ## Scope
 52 | 
 53 | This Code of Conduct applies within all community spaces, and also applies when
 54 | an individual is officially representing the community in public spaces.
 55 | Examples of representing our community include using an official e-mail address,
 56 | posting via an official social media account, or acting as an appointed
 57 | representative at an online or offline event.
 58 | 
 59 | ## Enforcement
 60 | 
 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 62 | reported to the community leaders responsible for enforcement at codeofconduct@posit.co. 
 63 | All complaints will be reviewed and investigated promptly and fairly.
 64 | 
 65 | All community leaders are obligated to respect the privacy and security of the
 66 | reporter of any incident.
 67 | 
 68 | ## Enforcement Guidelines
 69 | 
 70 | Community leaders will follow these Community Impact Guidelines in determining
 71 | the consequences for any action they deem in violation of this Code of Conduct:
 72 | 
 73 | ### 1. Correction
 74 | 
 75 | **Community Impact**: Use of inappropriate language or other behavior deemed
 76 | unprofessional or unwelcome in the community.
 77 | 
 78 | **Consequence**: A private, written warning from community leaders, providing
 79 | clarity around the nature of the violation and an explanation of why the
 80 | behavior was inappropriate. A public apology may be requested.
 81 | 
 82 | ### 2. Warning
 83 | 
 84 | **Community Impact**: A violation through a single incident or series of
 85 | actions.
 86 | 
 87 | **Consequence**: A warning with consequences for continued behavior. No
 88 | interaction with the people involved, including unsolicited interaction with
 89 | those enforcing the Code of Conduct, for a specified period of time. This
 90 | includes avoiding interactions in community spaces as well as external channels
 91 | like social media. Violating these terms may lead to a temporary or permanent
 92 | ban.
 93 | 
 94 | ### 3. Temporary Ban
 95 | 
 96 | **Community Impact**: A serious violation of community standards, including
 97 | sustained inappropriate behavior.
 98 | 
 99 | **Consequence**: A temporary ban from any sort of interaction or public
100 | communication with the community for a specified period of time. No public or
101 | private interaction with the people involved, including unsolicited interaction
102 | with those enforcing the Code of Conduct, is allowed during this period.
103 | Violating these terms may lead to a permanent ban.
104 | 
105 | ### 4. Permanent Ban
106 | 
107 | **Community Impact**: Demonstrating a pattern of violation of community
108 | standards, including sustained inappropriate behavior, harassment of an
109 | individual, or aggression toward or disparagement of classes of individuals.
110 | 
111 | **Consequence**: A permanent ban from any sort of public interaction within the
112 | community.
113 | 
114 | ## Attribution
115 | 
116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
117 | version 2.1, available at
118 | <https://www.contributor-covenant.org/version/2/1/code_of_conduct.html>.
119 | 
120 | Community Impact Guidelines were inspired by
121 | [Mozilla's code of conduct enforcement ladder][https://github.com/mozilla/inclusion].
122 | 
123 | For answers to common questions about this code of conduct, see the FAQ at
124 | <https://www.contributor-covenant.org/faq>. Translations are available at <https://www.contributor-covenant.org/translations>.
125 | 
126 | [homepage]: https://www.contributor-covenant.org
127 | 


--------------------------------------------------------------------------------
/.github/workflows/R-CMD-check.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | #
 4 | # NOTE: This workflow is overkill for most R packages and
 5 | # check-standard.yaml is likely a better choice.
 6 | # usethis::use_github_action("check-standard") will install it.
 7 | on:
 8 |   push:
 9 |     branches: [main, master]
10 |   pull_request:
11 |     branches: [main, master]
12 | 
13 | name: R-CMD-check
14 | 
15 | permissions: read-all
16 | 
17 | jobs:
18 |   R-CMD-check:
19 |     runs-on: ${{ matrix.config.os }}
20 | 
21 |     name: ${{ matrix.config.os }} (${{ matrix.config.r }})
22 | 
23 |     strategy:
24 |       fail-fast: false
25 |       matrix:
26 |         config:
27 |           - {os: macos-latest,   r: 'release'}
28 | 
29 |           - {os: windows-latest, r: 'release'}
30 |           # use 4.1 to check with rtools40's older compiler
31 |           - {os: windows-latest, r: '4.1'}
32 | 
33 |           - {os: ubuntu-latest,  r: 'devel', http-user-agent: 'release'}
34 |           - {os: ubuntu-latest,  r: 'release'}
35 |           - {os: ubuntu-latest,  r: 'oldrel-1'}
36 |           - {os: ubuntu-latest,  r: 'oldrel-2'}
37 |           - {os: ubuntu-latest,  r: 'oldrel-3'}
38 |           - {os: ubuntu-latest,  r: 'oldrel-4'}
39 | 
40 |     env:
41 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
42 |       R_KEEP_PKG_SOURCE: yes
43 | 
44 |     steps:
45 |       - uses: actions/checkout@v4
46 | 
47 |       - uses: r-lib/actions/setup-pandoc@v2
48 | 
49 |       - uses: r-lib/actions/setup-r@v2
50 |         with:
51 |           r-version: ${{ matrix.config.r }}
52 |           http-user-agent: ${{ matrix.config.http-user-agent }}
53 |           use-public-rspm: true
54 | 
55 |       - uses: r-lib/actions/setup-r-dependencies@v2
56 |         with:
57 |           extra-packages: any::rcmdcheck
58 |           needs: check
59 | 
60 |       - uses: r-lib/actions/check-r-package@v2
61 |         with:
62 |           upload-snapshots: true
63 |           build_args: 'c("--no-manual","--compact-vignettes=gs+qpdf")'
64 | 


--------------------------------------------------------------------------------
/.github/workflows/pkgdown.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | on:
 4 |   push:
 5 |     branches: [main, master]
 6 |   pull_request:
 7 |     branches: [main, master]
 8 |   release:
 9 |     types: [published]
10 |   workflow_dispatch:
11 | 
12 | name: pkgdown
13 | 
14 | permissions: read-all
15 | 
16 | jobs:
17 |   pkgdown:
18 |     runs-on: ubuntu-latest
19 |     # Only restrict concurrency for non-PR jobs
20 |     concurrency:
21 |       group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }}
22 |     env:
23 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
24 |     permissions:
25 |       contents: write
26 |     steps:
27 |       - uses: actions/checkout@v4
28 | 
29 |       - uses: r-lib/actions/setup-pandoc@v2
30 | 
31 |       - uses: r-lib/actions/setup-r@v2
32 |         with:
33 |           use-public-rspm: true
34 | 
35 |       - uses: r-lib/actions/setup-r-dependencies@v2
36 |         with:
37 |           extra-packages: any::pkgdown, local::.
38 |           needs: website
39 | 
40 |       - name: Build site
41 |         run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE)
42 |         shell: Rscript {0}
43 | 
44 |       - name: Deploy to GitHub pages 🚀
45 |         if: github.event_name != 'pull_request'
46 |         uses: JamesIves/github-pages-deploy-action@v4.5.0
47 |         with:
48 |           clean: false
49 |           branch: gh-pages
50 |           folder: docs
51 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-commands.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | on:
 4 |   issue_comment:
 5 |     types: [created]
 6 | 
 7 | name: Commands
 8 | 
 9 | permissions: read-all
10 | 
11 | jobs:
12 |   document:
13 |     if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/document') }}
14 |     name: document
15 |     runs-on: ubuntu-latest
16 |     env:
17 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
18 |     steps:
19 |       - uses: actions/checkout@v4
20 | 
21 |       - uses: r-lib/actions/pr-fetch@v2
22 |         with:
23 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
24 | 
25 |       - uses: r-lib/actions/setup-r@v2
26 |         with:
27 |           use-public-rspm: true
28 | 
29 |       - uses: r-lib/actions/setup-r-dependencies@v2
30 |         with:
31 |           extra-packages: any::roxygen2
32 |           needs: pr-document
33 | 
34 |       - name: Document
35 |         run: roxygen2::roxygenise()
36 |         shell: Rscript {0}
37 | 
38 |       - name: commit
39 |         run: |
40 |           git config --local user.name "$GITHUB_ACTOR"
41 |           git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com"
42 |           git add man/\* NAMESPACE
43 |           git commit -m 'Document'
44 | 
45 |       - uses: r-lib/actions/pr-push@v2
46 |         with:
47 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
48 | 
49 |   style:
50 |     if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/style') }}
51 |     name: style
52 |     runs-on: ubuntu-latest
53 |     env:
54 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
55 |     steps:
56 |       - uses: actions/checkout@v4
57 | 
58 |       - uses: r-lib/actions/pr-fetch@v2
59 |         with:
60 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
61 | 
62 |       - uses: r-lib/actions/setup-r@v2
63 | 
64 |       - name: Install dependencies
65 |         run: install.packages("styler")
66 |         shell: Rscript {0}
67 | 
68 |       - name: Style
69 |         run: styler::style_pkg()
70 |         shell: Rscript {0}
71 | 
72 |       - name: commit
73 |         run: |
74 |           git config --local user.name "$GITHUB_ACTOR"
75 |           git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com"
76 |           git add \*.R
77 |           git commit -m 'Style'
78 | 
79 |       - uses: r-lib/actions/pr-push@v2
80 |         with:
81 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
82 | 


--------------------------------------------------------------------------------
/.github/workflows/test-coverage.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | on:
 4 |   push:
 5 |     branches: [main, master]
 6 |   pull_request:
 7 |     branches: [main, master]
 8 | 
 9 | name: test-coverage
10 | 
11 | permissions: read-all
12 | 
13 | jobs:
14 |   test-coverage:
15 |     runs-on: ubuntu-latest
16 |     env:
17 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
18 | 
19 |     steps:
20 |       - uses: actions/checkout@v4
21 | 
22 |       - uses: r-lib/actions/setup-r@v2
23 |         with:
24 |           use-public-rspm: true
25 | 
26 |       - uses: r-lib/actions/setup-r-dependencies@v2
27 |         with:
28 |           extra-packages: any::covr, any::xml2
29 |           needs: coverage
30 | 
31 |       - name: Test coverage
32 |         run: |
33 |           cov <- covr::package_coverage(
34 |             quiet = FALSE,
35 |             clean = FALSE,
36 |             install_path = file.path(normalizePath(Sys.getenv("RUNNER_TEMP"), winslash = "/"), "package")
37 |           )
38 |           covr::to_cobertura(cov)
39 |         shell: Rscript {0}
40 | 
41 |       - uses: codecov/codecov-action@v4
42 |         with:
43 |           fail_ci_if_error: ${{ github.event_name != 'pull_request' && true || false }}
44 |           file: ./cobertura.xml
45 |           plugin: noop
46 |           disable_search: true
47 |           token: ${{ secrets.CODECOV_TOKEN }}
48 | 
49 |       - name: Show testthat output
50 |         if: always()
51 |         run: |
52 |           ## --------------------------------------------------------------------
53 |           find '${{ runner.temp }}/package' -name 'testthat.Rout*' -exec cat '{}' \; || true
54 |         shell: bash
55 | 
56 |       - name: Upload test results
57 |         if: failure()
58 |         uses: actions/upload-artifact@v4
59 |         with:
60 |           name: coverage-test-failures
61 |           path: ${{ runner.temp }}/package
62 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | inst/doc
5 | docs/
6 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: multidplyr
 2 | Title: A Multi-Process 'dplyr' Backend
 3 | Version: 0.1.3.9000
 4 | Authors@R: c(
 5 |     person("Hadley", "Wickham", , "hadley@posit.co", role = c("aut", "cre")),
 6 |     person("Posit Software, PBC", role = c("cph", "fnd"))
 7 |   )
 8 | Description: Partition a data frame across multiple worker processes to
 9 |     provide simple multicore parallelism.
10 | License: MIT + file LICENSE
11 | URL: https://multidplyr.tidyverse.org,
12 |     https://github.com/tidyverse/multidplyr
13 | BugReports: https://github.com/tidyverse/multidplyr/issues
14 | Depends:
15 |     R (>= 3.6)
16 | Imports:
17 |     callr (>= 3.5.1),
18 |     cli,
19 |     crayon,
20 |     dplyr (>= 1.0.0),
21 |     magrittr,
22 |     qs (>= 0.24.1),
23 |     R6,
24 |     rlang (>= 1.0.6),
25 |     tibble,
26 |     tidyselect,
27 |     vctrs (>= 0.3.6)
28 | Suggests:
29 |     covr,
30 |     knitr,
31 |     lubridate,
32 |     mgcv,
33 |     nycflights13,
34 |     rmarkdown,
35 |     testthat (>= 3.0.2),
36 |     vroom,
37 |     withr
38 | VignetteBuilder: 
39 |     knitr
40 | Config/Needs/website: tidyverse/tidytemplate
41 | Config/testthat/edition: 3
42 | Encoding: UTF-8
43 | RoxygenNote: 7.3.2
44 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2023
2 | COPYRIGHT HOLDER: multidplyr authors
3 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | # MIT License
 2 | 
 3 | Copyright (c) 2023 multidplyr authors
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | S3method("[",multidplyr_cluster)
 4 | S3method(anti_join,multidplyr_party_df)
 5 | S3method(arrange,multidplyr_party_df)
 6 | S3method(as.data.frame,multidplyr_party_df)
 7 | S3method(auto_copy,multidplyr_party_df)
 8 | S3method(collect,multidplyr_party_df)
 9 | S3method(dim,multidplyr_party_df)
10 | S3method(do,multidplyr_party_df)
11 | S3method(full_join,multidplyr_party_df)
12 | S3method(group_by,multidplyr_party_df)
13 | S3method(group_vars,multidplyr_party_df)
14 | S3method(head,multidplyr_party_df)
15 | S3method(inner_join,multidplyr_party_df)
16 | S3method(left_join,multidplyr_party_df)
17 | S3method(mutate,multidplyr_party_df)
18 | S3method(print,multidplyr_cluster)
19 | S3method(print,multidplyr_party_df)
20 | S3method(pull,multidplyr_party_df)
21 | S3method(rename,multidplyr_party_df)
22 | S3method(right_join,multidplyr_party_df)
23 | S3method(same_src,multidplyr_party_df)
24 | S3method(select,multidplyr_party_df)
25 | S3method(semi_join,multidplyr_party_df)
26 | S3method(slice,multidplyr_party_df)
27 | S3method(summarise,multidplyr_party_df)
28 | S3method(tbl_sum,multidplyr_party_df)
29 | S3method(tbl_vars,multidplyr_party_df)
30 | S3method(ungroup,multidplyr_party_df)
31 | S3method(union_all,multidplyr_party_df)
32 | export("%>%")
33 | export(cluster_assign)
34 | export(cluster_assign_each)
35 | export(cluster_assign_partition)
36 | export(cluster_call)
37 | export(cluster_copy)
38 | export(cluster_library)
39 | export(cluster_rm)
40 | export(cluster_send)
41 | export(default_cluster)
42 | export(new_cluster)
43 | export(partition)
44 | export(party_df)
45 | import(rlang)
46 | importFrom(R6,R6Class)
47 | importFrom(dplyr,anti_join)
48 | importFrom(dplyr,arrange)
49 | importFrom(dplyr,auto_copy)
50 | importFrom(dplyr,collect)
51 | importFrom(dplyr,do)
52 | importFrom(dplyr,filter)
53 | importFrom(dplyr,full_join)
54 | importFrom(dplyr,group_by)
55 | importFrom(dplyr,group_vars)
56 | importFrom(dplyr,inner_join)
57 | importFrom(dplyr,intersect)
58 | importFrom(dplyr,left_join)
59 | importFrom(dplyr,mutate)
60 | importFrom(dplyr,pull)
61 | importFrom(dplyr,rename)
62 | importFrom(dplyr,right_join)
63 | importFrom(dplyr,same_src)
64 | importFrom(dplyr,select)
65 | importFrom(dplyr,semi_join)
66 | importFrom(dplyr,setdiff)
67 | importFrom(dplyr,slice)
68 | importFrom(dplyr,summarise)
69 | importFrom(dplyr,tbl_vars)
70 | importFrom(dplyr,ungroup)
71 | importFrom(dplyr,union)
72 | importFrom(dplyr,union_all)
73 | importFrom(magrittr,"%>%")
74 | importFrom(tibble,tbl_sum)
75 | importFrom(tibble,tibble)
76 | importFrom(utils,head)
77 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
 1 | # multidplyr (development version)
 2 | 
 3 | # multidplyr 0.1.3
 4 | 
 5 | * Fix for R CMD check failure.
 6 | 
 7 | # multidplyr 0.1.2
 8 | 
 9 | * `cluster_call()` gains a `simplify` argument - use this to request that
10 |   the result should be simplified (#136).
11 | 
12 | # multidplyr 0.1.1
13 | 
14 | * Fixed problems identified as part of working on dplyr 1.0.8.
15 | 
16 | # multidplyr 0.1.0
17 | 
18 | * Added a `NEWS.md` file to track changes to the package.
19 | 


--------------------------------------------------------------------------------
/R/cluster-call.R:
--------------------------------------------------------------------------------
  1 | #' Call a function on each node of a cluster
  2 | #'
  3 | #' `cluster_call()` executes the code on each worker and returns the results;
  4 | #' `cluster_send()` executes the code ignoring the result. Jobs are submitted
  5 | #' to workers in parallel, and then we wait until they're complete.
  6 | #'
  7 | #' @param cluster A cluster.
  8 | #' @param code An expression to execute on each worker.
  9 | #' @param simplify Should the results be simplified from a list?
 10 | #'   * `TRUE`: simplify or die trying.
 11 | #'   * `NA`: simplify if possible.
 12 | #'   * `FALSE`: never try to simplify, always leaving as a list.
 13 | #'
 14 | #'   `code` must return a vector of length one in order for simplification
 15 | #'   to succeed.
 16 | #' @param ptype If `simplify` is `TRUE`, use `ptype` to enforce the desired
 17 | #'   output type.
 18 | #' @export
 19 | #' @return A list of results with one element for each worker in `cluster`.
 20 | #' @examples
 21 | #' cl <- default_cluster()
 22 | #'
 23 | #' # Run code on each cluster and retrieve results
 24 | #' cluster_call(cl, Sys.getpid())
 25 | #' cluster_call(cl, runif(1))
 26 | #'
 27 | #' # use ptype to simplify
 28 | #' cluster_call(cl, runif(1), simplify = TRUE)
 29 | #'
 30 | #' # use cluster_send() to ignore results
 31 | #' cluster_send(cl, x <- runif(1))
 32 | #' cluster_call(cl, x, simplify = TRUE)
 33 | cluster_call <- function(cluster, code, simplify = FALSE, ptype = NULL) {
 34 |   stopifnot(is_cluster(cluster))
 35 |   code <- enexpr(code)
 36 |   to_rm <- attr(cluster, "cleaner")$reset()
 37 | 
 38 |   if (length(simplify) > 1 || !is.logical(simplify)) {
 39 |     cli::cli_abort("{.arg simplify} must be `TRUE`, `FALSE`, or `NA`.")
 40 |   }
 41 |   if (!isTRUE(simplify) && !is.null(ptype)) {
 42 |     # 0.1.2
 43 |     warn("Must now set `simplify = TRUE` when supplying ptype")
 44 |     simplify <- TRUE
 45 |   }
 46 | 
 47 |   # nocov start
 48 |   f <- function(code, to_rm) {
 49 |     rm(list = to_rm, envir = globalenv())
 50 |     eval(code, globalenv())
 51 |   }
 52 |   # nocov end
 53 |   lapply(cluster, function(x) x$call(f, list(code = code, to_rm = to_rm)))
 54 |   lapply(cluster, function(x) x$poll_process(-1))
 55 | 
 56 |   results <- lapply(cluster, function(x) x$read())
 57 | 
 58 |   errs <- lapply(results, "[[", "error")
 59 |   failed <- !vapply(errs, is.null, logical(1))
 60 |   if (any(failed)) {
 61 |     worker <- which(failed)[[1]]
 62 |     err <- errs[[worker]]$parent
 63 |     message <- paste0("Remote computation failed in worker ", worker)
 64 |     abort(message, parent = err)
 65 |   }
 66 | 
 67 |   out <- lapply(results, "[[", "result")
 68 |   if (!isFALSE(simplify)) {
 69 |     out <- simplify_impl(out, strict = !is.na(simplify), ptype = ptype)
 70 |   }
 71 |   out
 72 | }
 73 | 
 74 | #' @rdname cluster_call
 75 | #' @export
 76 | cluster_send <- function(cluster, code) {
 77 |   stopifnot(is_cluster(cluster))
 78 |   code <- call2("{", enexpr(code), NULL)
 79 |   cluster_call(cluster, !!code)
 80 | 
 81 |   invisible(cluster)
 82 | }
 83 | 
 84 | # TODO: replace with purrr::list_simplify() when purrr 1.0.0 is out
 85 | simplify_impl <- function(x,
 86 |                           strict = TRUE,
 87 |                           ptype = NULL,
 88 |                           error_arg = caller_arg(x),
 89 |                           error_call = caller_env()) {
 90 |   vctrs::vec_check_list(x, arg = error_arg, call = error_call)
 91 | 
 92 |   # Handle the cases where we definitely can't simplify
 93 |   if (strict) {
 94 |     vctrs::list_check_all_vectors(x, arg = error_arg, call = error_call)
 95 |     size_one <- vctrs::list_sizes(x) == 1L
 96 |     can_simplify <- all(size_one)
 97 | 
 98 |     if (!can_simplify) {
 99 |       bad <- which(!size_one)[[1]]
100 |       cli::cli_abort(
101 |         c(
102 |           "All elements must be size 1.",
103 |           i = "`{error_arg}[[{bad}]]` is size {vec_size(x[[bad]])}."
104 |         ),
105 |         call = error_call
106 |       )
107 |     }
108 |   } else {
109 |     can_simplify <- vctrs::list_all_vectors(x) && all(vctrs::list_sizes(x) == 1L)
110 | 
111 |     if (!can_simplify) {
112 |       return(x)
113 |     }
114 |   }
115 | 
116 |   names <- vctrs::vec_names(x)
117 |   x <- vctrs::vec_set_names(x, NULL)
118 | 
119 |   out <- tryCatch(
120 |     vctrs::vec_c(!!!x, ptype = ptype),
121 |     vctrs_error_incompatible_type = function(err) {
122 |       if (strict || !is.null(ptype)) {
123 |         cnd_signal(err)
124 |       } else {
125 |         x
126 |       }
127 |     }
128 |   )
129 |   vctrs::vec_set_names(out, names)
130 | }
131 | 


--------------------------------------------------------------------------------
/R/cluster-utils.R:
--------------------------------------------------------------------------------
  1 | #' Cluster utitility functions
  2 | #'
  3 | #' These functions provide useful helpers for performaning common operations.
  4 | #' `cluster_assign()` assigns the same value on each worker;
  5 | #' `cluster_assign_each()` assigns different values on each worker;
  6 | #' `cluster_assign_partition()` partitions vectors so that each worker gets
  7 | #' (approximately) the same number of pieces.
  8 | #'
  9 | #' @param cluster,.cluster Cluster to work on
 10 | #' @param ... Name-value pairs
 11 | #' @param packages Character vector of packages to load
 12 | #' @name cluster_utils
 13 | #' @return Functions that modify the worker environment invisibly return
 14 | #'   `cluster` so calls can be piped together. The other functions return
 15 | #'   lists with one element for each worker.
 16 | #' @examples
 17 | #' cl <- default_cluster()
 18 | #' cluster_assign(cl, a = runif(1))
 19 | #' cluster_call(cl, a)
 20 | #'
 21 | #' # Assign different values on each cluster
 22 | #' cluster_assign_each(cl, b = c(1, 10))
 23 | #' cluster_call(cl, b)
 24 | #'
 25 | #' # Partition a vector so that each worker gets approximately the
 26 | #' # same amount of it
 27 | #' cluster_assign_partition(cl, c = 1:11)
 28 | #' cluster_call(cl, c)
 29 | #'
 30 | #' # If you want different to compute different values on each
 31 | #' # worker, use `cluster_call()` directly:
 32 | #' cluster_call(cl, d <- runif(1))
 33 | #' cluster_call(cl, d)
 34 | #'
 35 | #' # cluster_copy() is a useful shortcut
 36 | #' e <- 10
 37 | #' cluster_copy(cl, "e")
 38 | #'
 39 | #' cluster_call(cl, ls())
 40 | #' cluster_rm(cl, letters[1:5])
 41 | #' cluster_call(cl, ls())
 42 | #'
 43 | #' # Use cluster_library() to load packages
 44 | #' cluster_call(cl, search())
 45 | #' cluster_library(cl, "magrittr")
 46 | #' cluster_call(cl, search())
 47 | NULL
 48 | 
 49 | #' @rdname cluster_utils
 50 | #' @export
 51 | cluster_assign <- function(.cluster, ...) {
 52 |   stopifnot(is_cluster(.cluster))
 53 |   values <- list2(...)
 54 |   stopifnot(is_named(values))
 55 | 
 56 |   path <- tempfile()
 57 |   on.exit(unlink(path))
 58 | 
 59 |   qs::qsave(values, path, preset = "fast", check_hash = FALSE, nthreads = 2)
 60 |   cluster_send(.cluster, list2env(qs::qread(!!path), globalenv()))
 61 | 
 62 |   invisible(.cluster)
 63 | }
 64 | 
 65 | #' @rdname cluster_utils
 66 | #' @export
 67 | cluster_assign_each <- function(.cluster, ...) {
 68 |   stopifnot(is_cluster(.cluster))
 69 |   values <- tibble(..., .rows = length(.cluster))
 70 | 
 71 |   for (i in seq_len(nrow(values))) {
 72 |     cluster_assign(.cluster[i], !!!lapply(values, "[[", i))
 73 |   }
 74 | 
 75 |   invisible(.cluster)
 76 | }
 77 | 
 78 | #' @export
 79 | #' @rdname cluster_utils
 80 | cluster_assign_partition <- function(.cluster, ...) {
 81 |   stopifnot(is_cluster(.cluster))
 82 |   values <- list(...)
 83 | 
 84 |   m <- length(.cluster)
 85 |   values_split <- lapply(values, function(x) {
 86 |     vctrs::vec_split(x, cut(vctrs::vec_seq_along(x), m, labels = FALSE))$val
 87 |   })
 88 | 
 89 |   cluster_assign_each(.cluster, !!!values_split)
 90 |   invisible(.cluster)
 91 | }
 92 | 
 93 | #' @rdname cluster_utils
 94 | #' @param names Name of variables to copy.
 95 | #' @param env Environment in which to look for varibles to copy.
 96 | #' @export
 97 | cluster_copy <- function(cluster, names, env = caller_env()) {
 98 |   stopifnot(is_cluster(cluster))
 99 |   stopifnot(is.character(names))
100 | 
101 |   cluster_assign(cluster, !!!env_get_list(env, names, inherit = TRUE))
102 | }
103 | 
104 | # Always suceeds: only gives warning if object not found, and warnings
105 | # are not syndicated back to master
106 | #' @rdname cluster_utils
107 | #' @export
108 | cluster_rm <- function(cluster, names) {
109 |   stopifnot(is_cluster(cluster))
110 |   stopifnot(is.character(names))
111 | 
112 |   cluster_send(cluster, rm(list = !!names, envir = globalenv()))
113 |   invisible(cluster)
114 | }
115 | 
116 | #' @rdname cluster_utils
117 | #' @export
118 | cluster_library <- function(cluster, packages) {
119 |   lapply(packages, library, character.only = TRUE)
120 | 
121 |   cluster_send(cluster, lapply(!!packages, library, character.only = TRUE))
122 |   invisible(cluster)
123 | }
124 | 


--------------------------------------------------------------------------------
/R/cluster.R:
--------------------------------------------------------------------------------
 1 | #' Create a new cluster with sensible defaults.
 2 | #'
 3 | #' Clusters created with this function will automatically clean up after
 4 | #' themselves.
 5 | #'
 6 | #' @param n Number of workers to create. Avoid setting this higher than the
 7 | #'   number of cores in your computer as it will degrade performance.
 8 | #' @export
 9 | #' @returns A `multidplyr_cluster` object.
10 | #' @examples
11 | #' cluster <- new_cluster(2)
12 | #' cluster
13 | new_cluster <- function(n) {
14 |   sessions <- replicate(n, callr::r_session$new(wait_timeout = 15 * 1000))
15 |   structure(
16 |     sessions,
17 |     cleaner = Cleaner$new(),
18 |     class = "multidplyr_cluster"
19 |   )
20 | }
21 | 
22 | #' @importFrom R6 R6Class
23 | Cleaner <- R6Class("Cleaner", list(
24 |   names = character(),
25 |   add = function(x) {
26 |     self$names <- union(self$names, x)
27 |     invisible(self)
28 |   },
29 |   reset = function(x) {
30 |     old <- self$names
31 |     self$names <- character()
32 |     old
33 |   }
34 | ))
35 | 
36 | #' @export
37 | print.multidplyr_cluster <- function(x, ...) {
38 |   n <- length(x)
39 | 
40 |   state <- vapply(x, function(x) x$get_state(), character(1))
41 |   state_abbr <- c(
42 |     "starting" = "S",
43 |     "idle" = crayon::green("."),
44 |     "busy" = crayon::red("*"),
45 |     "finished" = "F"
46 |   )[state]
47 | 
48 |   cat_line(n, " session cluster [", paste(state_abbr, collapse = ""), "]")
49 | }
50 | 
51 | is_cluster <- function(x) inherits(x, "multidplyr_cluster")
52 | 
53 | #' @export
54 | `[.multidplyr_cluster` <- function(x, i, ...) {
55 |   structure(NextMethod(),
56 |     cleaner = attr(x, "cleaner"),
57 |     class = "multidplyr_cluster"
58 |   )
59 | }
60 | 
61 | #' Default cluster
62 | #'
63 | #' Setting up a cluster is relatively expensive, so it's best to use a single
64 | #' cluster throughout a session. This function lazily creates a 2-worker
65 | #' cluster for use in examples and test.
66 | #'
67 | #' @param n Number of workers to use; defaults to 2 because this is the maximum
68 | #'   allowed by CRAN.
69 | #' @keywords internal
70 | #' @returns A cached cluster of workers.
71 | #' @export
72 | #' @examples
73 | #' default_cluster()
74 | default_cluster <- function(n = 2) {
75 |   if (!env_has(cluster_env, "cluster")) {
76 |     message("Initialising default cluster of size ", n)
77 |     env_bind(cluster_env, cluster = new_cluster(n))
78 |   } else {
79 |     if (!missing(n)) {
80 |       abort("Can not supply `n` when cluster has already been initiated")
81 |     }
82 |   }
83 | 
84 |   env_get(cluster_env, "cluster")
85 | }
86 | 
87 | cluster_env <- env()
88 | 
89 | 


--------------------------------------------------------------------------------
/R/dplyr-dual.R:
--------------------------------------------------------------------------------
 1 | #' @importFrom dplyr same_src
 2 | #' @export
 3 | same_src.multidplyr_party_df <- function(x, y) {
 4 |   is_party_df(y) && identical(x$cluster, y$cluster)
 5 | }
 6 | 
 7 | #' @importFrom dplyr auto_copy
 8 | #' @export
 9 | auto_copy.multidplyr_party_df <- function(x, y, copy = FALSE, ...) {
10 |   name <- table_name()
11 |   cluster_assign(x$cluster, !!name := y)
12 |   party_df(x$cluster, name)
13 | }
14 | 
15 | # joins -------------------------------------------------------------------
16 | 
17 | #' @importFrom dplyr left_join
18 | #' @export
19 | left_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) {
20 |   y <- auto_copy(x, y, copy = copy)
21 |   shard_call_dual("left_join", x, y, ..., by = by, suffix = suffix)
22 | }
23 | 
24 | #' @importFrom dplyr right_join
25 | #' @export
26 | right_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) {
27 |   y <- auto_copy(x, y, copy = copy)
28 |   shard_call_dual("right_join", x, y, ..., by = by, suffix = suffix)
29 | }
30 | 
31 | #' @importFrom dplyr inner_join
32 | #' @export
33 | inner_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) {
34 |   y <- auto_copy(x, y, copy = copy)
35 |   shard_call_dual("inner_join", x, y, ..., by = by, suffix = suffix)
36 | }
37 | 
38 | #' @importFrom dplyr full_join
39 | #' @export
40 | full_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE, suffix = c(".x", ".y")) {
41 |   y <- auto_copy(x, y, copy = copy)
42 |   shard_call_dual("full_join", x, y, ..., by = by, suffix = suffix)
43 | }
44 | 
45 | #' @importFrom dplyr anti_join
46 | #' @export
47 | anti_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE) {
48 |   y <- auto_copy(x, y, copy = copy)
49 |   shard_call_dual("anti_join", x, y, ..., by = by)
50 | }
51 | 
52 | #' @importFrom dplyr semi_join
53 | #' @export
54 | semi_join.multidplyr_party_df <- function(x, y, ..., by = NULL, copy = FALSE) {
55 |   y <- auto_copy(x, y, copy = copy)
56 |   shard_call_dual("semi_join", x, y, ..., by = by)
57 | }
58 | 
59 | # setops ------------------------------------------------------------------
60 | 
61 | #' @importFrom dplyr intersect
62 | # Exported onload
63 | intersect.multidplyr_party_df <- function(x, y, ..., copy = FALSE) {
64 |   y <- auto_copy(x, y, copy = copy)
65 |   shard_call_dual("intersect", x, y, ...)
66 | }
67 | 
68 | #' @importFrom dplyr union
69 | # Exported onload
70 | union.multidplyr_party_df <- function(x, y, ..., copy = FALSE) {
71 |   y <- auto_copy(x, y, copy = copy)
72 |   shard_call_dual("union", x, y, ...)
73 | }
74 | 
75 | #' @importFrom dplyr union_all
76 | #' @export
77 | union_all.multidplyr_party_df <- function(x, y, ..., copy = FALSE) {
78 |   y <- auto_copy(x, y, copy = copy)
79 |   shard_call_dual("union_all", x, y, ...)
80 | }
81 | 
82 | #' @importFrom dplyr setdiff
83 | # Exported onload
84 | setdiff.multidplyr_party_df <- function(x, y, ..., copy = FALSE) {
85 |   y <- auto_copy(x, y, copy = copy)
86 |   shard_call_dual("setdiff", x, y, ...)
87 | }
88 | 
89 | 
90 | # helpers -----------------------------------------------------------------
91 | 
92 | shard_call_dual <- function(.verb, .x, .y, ...) {
93 |   new_name <- table_name()
94 |   call <- call2(.verb, .x$name, .y$name, ..., .ns = "dplyr")
95 | 
96 |   cluster_send(.x$cluster, !!call2("<-", new_name, call))
97 |   new_party_df(.x$cluster, new_name, auto_rm = TRUE)
98 | }
99 | 


--------------------------------------------------------------------------------
/R/dplyr-single.R:
--------------------------------------------------------------------------------
 1 | 
 2 | #' @importFrom dplyr arrange
 3 | #' @export
 4 | arrange.multidplyr_party_df <- function(.data, ..., .by_group = FALSE) {
 5 |   shard_call(.data, "arrange", enquos(...), .by_group = .by_group)
 6 | }
 7 | 
 8 | #' @importFrom dplyr filter
 9 | # exported on load
10 | filter.multidplyr_party_df <- function(.data, ...) {
11 |   shard_call(.data, "filter", enquos(...))
12 | }
13 | 
14 | #' @importFrom dplyr group_by
15 | #' @export
16 | group_by.multidplyr_party_df <- function(.data, ..., .add = FALSE) {
17 |   shard_call(.data, "group_by", enquos(...), .add = .add)
18 | }
19 | 
20 | #' @importFrom dplyr ungroup
21 | #' @export
22 | ungroup.multidplyr_party_df <- function(x, ...) {
23 |   shard_call(x, "ungroup", enquos(...))
24 | }
25 | 
26 | #' @importFrom dplyr mutate
27 | #' @export
28 | mutate.multidplyr_party_df <- function(.data, ...) {
29 |   shard_call(.data, "mutate", enquos(...))
30 | }
31 | 
32 | #' @importFrom dplyr rename
33 | #' @export
34 | rename.multidplyr_party_df <- function(.data, ...) {
35 |   shard_call(.data, "rename", enquos(...))
36 | }
37 | 
38 | #' @importFrom dplyr select
39 | #' @export
40 | select.multidplyr_party_df <- function(.data, ...) {
41 |   shard_call(.data, "select", enquos(...))
42 | }
43 | 
44 | #' @importFrom dplyr slice
45 | #' @export
46 | slice.multidplyr_party_df <- function(.data, ...) {
47 |   shard_call(.data, "slice", enquos(...))
48 | }
49 | 
50 | #' @importFrom dplyr summarise
51 | #' @export
52 | summarise.multidplyr_party_df <- function(.data, ...) {
53 |   shard_call(.data, "summarise", enquos(...))
54 | }
55 | 
56 | #' @importFrom dplyr do
57 | #' @export
58 | do.multidplyr_party_df <- function(.data, ...) {
59 |   shard_call(.data, "do", enquos(...))
60 | }
61 | 
62 | 
63 | # helpers -----------------------------------------------------------------
64 | 
65 | shard_call <- function(.data, .verb, dots, ...) {
66 |   call <- call2(.verb, .data$name, !!!dots, ..., .ns = "dplyr")
67 | 
68 |   new_name <- table_name()
69 |   cluster_send(.data$cluster, !!call2("<-", new_name, call))
70 |   new_party_df(.data$cluster, new_name, auto_rm = TRUE)
71 | }
72 | 
73 | 


--------------------------------------------------------------------------------
/R/multidplyr-package.R:
--------------------------------------------------------------------------------
 1 | #' @keywords internal
 2 | #' @import rlang
 3 | #' @importFrom tibble tibble
 4 | "_PACKAGE"
 5 | 
 6 | # The following block is used by usethis to automatically manage
 7 | # roxygen namespace tags. Modify with care!
 8 | ## usethis namespace: start
 9 | ## usethis namespace: end
10 | NULL
11 | 


--------------------------------------------------------------------------------
/R/partydf.R:
--------------------------------------------------------------------------------
  1 | #' Partition data across workers in a cluster
  2 | #'
  3 | #' Partitioning ensures that all observations in a group end up on the same
  4 | #' worker. To try and keep the observations on each worker balanced,
  5 | #' `partition()` uses a greedy algorithm that iteratively assigns each group to
  6 | #' the worker that currently has the fewest rows.
  7 | #'
  8 | #' @param data Dataset to partition, typically grouped. When grouped, all
  9 | #'   observations in a group will be assigned to the same cluster.
 10 | #' @param cluster Cluster to use.
 11 | #' @returns A [party_df].
 12 | #' @export
 13 | #' @examples
 14 | #' library(dplyr)
 15 | #' cl <- default_cluster()
 16 | #' cluster_library(cl, "dplyr")
 17 | #'
 18 | #' mtcars2 <- partition(mtcars, cl)
 19 | #' mtcars2 %>% mutate(cyl2 = 2 * cyl)
 20 | #' mtcars2 %>% filter(vs == 1)
 21 | #' mtcars2 %>% group_by(cyl) %>% summarise(n())
 22 | #' mtcars2 %>% select(-cyl)
 23 | partition <- function(data, cluster) {
 24 |   worker_id <- worker_id(data, cluster)
 25 |   worker_rows <- split(seq_along(worker_id), worker_id)
 26 | 
 27 |   if (length(worker_rows) < length(cluster)) {
 28 |     message("Using partial cluster of size ", length(worker_rows))
 29 |     cluster <- cluster[seq_along(worker_rows)]
 30 |   }
 31 |   shards <- lapply(worker_rows, function(i) data[i, , drop = FALSE])
 32 | 
 33 |   name <- table_name()
 34 |   cluster_assign_each(cluster, !!name := shards)
 35 |   new_party_df(cluster, name, auto_rm = TRUE)
 36 | }
 37 | 
 38 | worker_id <- function(data, cluster) {
 39 |   n <- nrow(data)
 40 |   m <- length(cluster)
 41 | 
 42 |   if (!dplyr::is_grouped_df(data)) {
 43 |     # Assign sequentially
 44 |     (seq_len(n) - 1) %% m + 1
 45 |   } else {
 46 |     # Assign each new group to the session with fewest rows
 47 |     group_id <- dplyr::group_indices(data)
 48 |     counts <- tabulate(group_id)
 49 | 
 50 |     rows <- integer(m)
 51 |     group_worker_id <- integer(length(counts))
 52 | 
 53 |     for (i in seq_along(counts)) {
 54 |       j <- which.min(rows)
 55 |       group_worker_id[[i]] <- j
 56 |       rows[[j]] <- rows[[j]] + counts[[i]]
 57 |     }
 58 | 
 59 |     group_worker_id[group_id]
 60 |   }
 61 | }
 62 | 
 63 | # Constructor -------------------------------------------------------------
 64 | 
 65 | #' A `party_df` partitioned data frame
 66 | #'
 67 | #' This S3 class represents a data frame partitioned across workers in a
 68 | #' cluster. You can use this constructor if you have already spread
 69 | #' data frames spread across a cluster. If not, start with [partition()]
 70 | #' instead.
 71 | #'
 72 | #' @export
 73 | #' @param cluster A cluster
 74 | #' @param name Name of data frame variable. Must exist on every worker,
 75 | #'   be a data frame, and have the same names.
 76 | #' @param auto_rm If `TRUE`, will automatically `rm()` the data frame on
 77 | #'   the workers when this object is created.
 78 | #' @return An S3 object with class `multidplyr_party_df`.
 79 | #' @export
 80 | #' @examples
 81 | #' # If a real example, you might spread file names across the clusters
 82 | #' # and read in using data.table::fread()/vroom::vroom()/qs::qread().
 83 | #' cl <- default_cluster()
 84 | #' cluster_send(cl[1], n <- 10)
 85 | #' cluster_send(cl[2], n <- 15)
 86 | #' cluster_send(cl, df <- data.frame(x = runif(n)))
 87 | #'
 88 | #' df <- party_df(cl, "df")
 89 | #' df
 90 | party_df <- function(cluster, name, auto_rm = FALSE) {
 91 |   stopifnot(is_cluster(cluster))
 92 |   stopifnot(is_string(name))
 93 | 
 94 |   # Check that variable exists, is data frame, and has same names
 95 |   exists <- unlist(cluster_call(cluster, exists(!!name)))
 96 |   if (!all(exists)) {
 97 |     abort(paste0("`", name, "` does not exist on all workers"))
 98 |   }
 99 | 
100 |   is_df <- unlist(cluster_call(cluster, is.data.frame(!!sym(name))))
101 |   if (!all(is_df)) {
102 |     abort(paste0("`", name, "` is not a data frame on all workers"))
103 |   }
104 | 
105 |   names <- cluster_call(cluster, names(!!sym(name)))
106 |   if (length(unique(names)) != 1) {
107 |     abort(paste0("`", name, "` does not have the same names on all workers"))
108 |   }
109 | 
110 |   new_party_df(cluster, name, auto_rm = auto_rm)
111 | }
112 | 
113 | new_party_df <- function(cluster, name, auto_rm) {
114 |   stopifnot(is_cluster(cluster))
115 |   stopifnot(is_string(name))
116 | 
117 |   structure(
118 |     list(
119 |       cluster = cluster,
120 |       name = sym(name),
121 |       .auto_clean = shard_deleter(auto_rm, name, cluster)
122 |     ),
123 |     class = "multidplyr_party_df"
124 |   )
125 | }
126 | 
127 | is_party_df <- function(x) inherits(x, "multidplyr_party_df")
128 | 
129 | shard_deleter <- function(auto_rm, name, cluster) {
130 |   if (!auto_rm) {
131 |     return(NULL)
132 |   }
133 | 
134 |   env <- new_environment()
135 |   reg.finalizer(env, function(...) attr(cluster, "cleaner")$add(name))
136 |   env
137 | }
138 | 
139 | shard_rows <- function(x) {
140 |   nrows <- cluster_call(x$cluster, nrow(!!x$name))
141 |   unlist(nrows)
142 | }
143 | 
144 | shard_cols <- function(x) {
145 |   cluster_call(x$cluster[1], ncol(!!x$name))[[1]]
146 | }
147 | 
148 | #' @importFrom tibble tbl_sum
149 | #' @export
150 | tbl_sum.multidplyr_party_df <- function(x) {
151 |   character()
152 | }
153 | 
154 | #' @importFrom dplyr tbl_vars
155 | #' @export
156 | tbl_vars.multidplyr_party_df <- function(x) {
157 |   cluster_call(x$cluster[1], dplyr::tbl_vars(!!x$name))[[1]]
158 | }
159 | 
160 | #' @importFrom dplyr group_vars
161 | #' @export
162 | group_vars.multidplyr_party_df <- function(x) {
163 |   cluster_call(x$cluster[1], dplyr::group_vars(!!x$name))[[1]]
164 | }
165 | 
166 | #' @export
167 | dim.multidplyr_party_df <- function(x) {
168 |   c(sum(shard_rows(x)), shard_cols(x))
169 | }
170 | 
171 | #' @importFrom utils head
172 | #' @export
173 | head.multidplyr_party_df <- function(x, n = 6L, ...) {
174 |   pieces <- vector("list", length(x$cluster))
175 |   left <- n
176 | 
177 |   # Work cluster by cluster until we have enough rows
178 |   for (i in seq_along(x$cluster)) {
179 |     head_i <- cluster_call(x$cluster[i], head(!!x$name, n = !!left))[[1]]
180 | 
181 |     pieces[[i]] <- head_i
182 |     left <- left - nrow(head_i)
183 |     if (left == 0)
184 |       break
185 |   }
186 | 
187 |   dplyr::bind_rows(pieces)
188 | }
189 | 
190 | #' @export
191 | print.multidplyr_party_df <- function(x, ..., n = NULL, width = NULL) {
192 |   cat("Source: party_df ", dplyr::dim_desc(x), "\n", sep = "")
193 | 
194 |   groups <- group_vars(x)
195 |   if (length(groups) > 0) {
196 |     cat("Groups: ", paste0(groups, collapse = ", "), "\n", sep = "")
197 |   }
198 | 
199 |   shards <- shard_rows(x)
200 |   cat("Shards: ", length(shards),
201 |     " [", big_mark(min(shards)), "--", big_mark(max(shards)), " rows]\n",
202 |     sep = "")
203 |   cat("\n")
204 |   print(tibble::trunc_mat(x, n = 6, width = width))
205 | 
206 |   invisible(x)
207 | }
208 | 
209 | #' @export
210 | as.data.frame.multidplyr_party_df <- function(x, row.names, optional, ...) {
211 |   dplyr::bind_rows(cluster_call(x$cluster, !!x$name))
212 | }
213 | 
214 | #' @importFrom dplyr collect
215 | #' @export
216 | collect.multidplyr_party_df <- function(x, ...) {
217 |   out <- as.data.frame(x)
218 |   dplyr::grouped_df(out, group_vars(x))
219 | }
220 | 
221 | #' @importFrom dplyr pull
222 | #' @export
223 | pull.multidplyr_party_df <- function(.data, var = -1, ...) {
224 |   var <- tidyselect::vars_pull(tbl_vars(.data), {{ var }})
225 | 
226 |   .data <- ungroup(.data)
227 |   .data <- select(.data, !!sym(var))
228 |   .data <- collect(.data)
229 |   .data[[1]]
230 | }
231 | 


--------------------------------------------------------------------------------
/R/utils.R:
--------------------------------------------------------------------------------
 1 | #' @importFrom magrittr %>%
 2 | #' @export
 3 | magrittr::`%>%`
 4 | 
 5 | table_name <- local({
 6 |   i <- 0
 7 |   function() {
 8 |     i <<- i + 1
 9 |     paste0("_DF", i)
10 |   }
11 | })
12 | 
13 | big_mark <- function(x, ...) {
14 |   mark <- if (identical(getOption("OutDec"), ",")) "." else ","
15 |   formatC(x, big.mark = mark, ...)
16 | }
17 | 
18 | cat_line <- function(...) {
19 |   cat(paste(..., "\n", collapse = "", sep = ""))
20 | }
21 | 
22 | 


--------------------------------------------------------------------------------
/R/zzz.R:
--------------------------------------------------------------------------------
 1 | # nocov start
 2 | .onLoad <- function(...) {
 3 |   register_s3_method("dplyr", "filter", "multidplyr_party_df")
 4 |   register_s3_method("dplyr", "intersect", "multidplyr_party_df")
 5 |   register_s3_method("dplyr", "union", "multidplyr_party_df")
 6 |   register_s3_method("dplyr", "setdiff", "multidplyr_party_df")
 7 | }
 8 | 
 9 | register_s3_method <- function(pkg, generic, class, fun = NULL) {
10 |   stopifnot(is.character(pkg), length(pkg) == 1)
11 |   stopifnot(is.character(generic), length(generic) == 1)
12 |   stopifnot(is.character(class), length(class) == 1)
13 | 
14 |   if (is.null(fun)) {
15 |     fun <- get(paste0(generic, ".", class), envir = parent.frame())
16 |   } else {
17 |     stopifnot(is.function(fun))
18 |   }
19 | 
20 |   if (pkg %in% loadedNamespaces()) {
21 |     registerS3method(generic, class, fun, envir = asNamespace(pkg))
22 |   }
23 | 
24 |   # Always register hook in case package is later unloaded & reloaded
25 |   setHook(
26 |     packageEvent(pkg, "onLoad"),
27 |     function(...) {
28 |       registerS3method(generic, class, fun, envir = asNamespace(pkg))
29 |     }
30 |   )
31 | }
32 | # nocov end
33 | 


--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | output: github_document
 3 | ---
 4 | 
 5 | <!-- README.md is generated from README.Rmd. Please edit that file -->
 6 | 
 7 | ```{r, include = FALSE}
 8 | knitr::opts_chunk$set(
 9 |   collapse = TRUE,
10 |   comment = "#>",
11 |   fig.path = "man/figures/README-",
12 |   out.width = "100%"
13 | )
14 | ```
15 | 
16 | # multidplyr
17 | 
18 | <!-- badges: start -->
19 | [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
20 | [![R-CMD-check](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml)
21 | [![Codecov test coverage](https://codecov.io/gh/tidyverse/multidplyr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/multidplyr?branch=main)
22 | [![CRAN status](https://www.r-pkg.org/badges/version/multidplyr)](https://cran.r-project.org/package=multidplyr)
23 | <!-- badges: end -->
24 | 
25 | ## Overview
26 | 
27 | multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with `partition()` and then the data stays on each node until you explicitly retrieve it with `collect()`. This minimises the amount of time spent moving data around, and maximises parallel performance. This idea is inspired by [partools](https://github.com/matloff/partools) by Norm Matloff and [distributedR](https://github.com/vertica/DistributedR) by the Vertica Analytics team.
28 | 
29 | Due to the overhead associated with communicating between the nodes, you won't see much performance improvement with simple operations on less than ~10 million observations, and you may want to instead try [dtplyr](https://dtplyr.tidyverse.org/), which uses [data.table](https://R-datatable.com/). multidplyr's strength is found parallelising calls to slower and more complex functions.
30 | 
31 | (Note that unlike other packages in the tidyverse, multidplyr requires R 3.5 or greater. We hope to relax this requirement [in the future](https://github.com/traversc/qs/issues/11).)
32 | 
33 | ## Installation
34 | 
35 | You can install the released version of multidplyr from [CRAN](https://CRAN.R-project.org) with:
36 | 
37 | ``` r
38 | install.packages("multidplyr")
39 | ```
40 | 
41 | And the development version from [GitHub](https://github.com/) with:
42 | 
43 | ``` r
44 | # install.packages("pak")
45 | pak::pak("tidyverse/multidplyr")
46 | ```
47 | 
48 | ## Usage
49 | 
50 | To use multidplyr, you first create a cluster of the desired number of workers. Each one of these workers is a separate R process, and the operating system will spread their execution across multiple cores:
51 | 
52 | ```{r setup}
53 | library(multidplyr)
54 | 
55 | cluster <- new_cluster(4)
56 | cluster_library(cluster, "dplyr")
57 | ```
58 | 
59 | There are two primary ways to use multidplyr. The first, and most efficient, way is to read different files on each worker:
60 | 
61 | ```{r, eval = FALSE}
62 | # Create a filename vector containing different values on each worker
63 | cluster_assign_each(cluster, filename = c("a.csv", "b.csv", "c.csv", "d.csv"))
64 | 
65 | # Use vroom to quickly load the csvs
66 | cluster_send(cluster, my_data <- vroom::vroom(filename))
67 | 
68 | # Create a party_df using the my_data variable on each worker
69 | my_data <- party_df(cluster, "my_data")
70 | ```
71 | 
72 | Alternatively, if you already have the data loaded in the main session, you can use `partition()` to automatically spread it across the workers. Before calling `partition()`, it's a good idea to call `group_by()` to ensure that all of the observations belonging to a group end up on the same worker.
73 | 
74 | ```{r}
75 | library(nycflights13)
76 | 
77 | flight_dest <- flights %>% group_by(dest) %>% partition(cluster)
78 | flight_dest
79 | ```
80 | 
81 | Now you can work with it like a regular data frame, but the computations will be spread across multiple cores. Once you've finished computation, use `collect()` to bring the data back to the host session:
82 | 
83 | ```{r}
84 | flight_dest %>% 
85 |   summarise(delay = mean(dep_delay, na.rm = TRUE), n = n()) %>% 
86 |   collect()
87 | ```
88 | 
89 | Note that there is some overhead associated with copying data from the worker nodes back to the host node (and vice versa), so you're best off using multidplyr with more complex operations. See `vignette("multidplyr")` for more details.
90 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | <!-- README.md is generated from README.Rmd. Please edit that file -->
  3 | 
  4 | # multidplyr
  5 | 
  6 | <!-- badges: start -->
  7 | 
  8 | [![Lifecycle:
  9 | experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
 10 | [![R-CMD-check](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml)
 11 | [![Codecov test
 12 | coverage](https://codecov.io/gh/tidyverse/multidplyr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/multidplyr?branch=main)
 13 | [![CRAN
 14 | status](https://www.r-pkg.org/badges/version/multidplyr)](https://cran.r-project.org/package=multidplyr)
 15 | <!-- badges: end -->
 16 | 
 17 | ## Overview
 18 | 
 19 | multidplyr is a backend for dplyr that partitions a data frame across
 20 | multiple cores. You tell multidplyr how to split the data up with
 21 | `partition()` and then the data stays on each node until you explicitly
 22 | retrieve it with `collect()`. This minimises the amount of time spent
 23 | moving data around, and maximises parallel performance. This idea is
 24 | inspired by [partools](https://github.com/matloff/partools) by Norm
 25 | Matloff and [distributedR](https://github.com/vertica/DistributedR) by
 26 | the Vertica Analytics team.
 27 | 
 28 | Due to the overhead associated with communicating between the nodes, you
 29 | won’t see much performance improvement with simple operations on less
 30 | than ~10 million observations, and you may want to instead try
 31 | [dtplyr](https://dtplyr.tidyverse.org/), which uses
 32 | [data.table](https://R-datatable.com/). multidplyr’s strength is found
 33 | parallelising calls to slower and more complex functions.
 34 | 
 35 | (Note that unlike other packages in the tidyverse, multidplyr requires R
 36 | 3.5 or greater. We hope to relax this requirement [in the
 37 | future](https://github.com/traversc/qs/issues/11).)
 38 | 
 39 | ## Installation
 40 | 
 41 | You can install the released version of multidplyr from
 42 | [CRAN](https://CRAN.R-project.org) with:
 43 | 
 44 | ``` r
 45 | install.packages("multidplyr")
 46 | ```
 47 | 
 48 | And the development version from [GitHub](https://github.com/) with:
 49 | 
 50 | ``` r
 51 | # install.packages("pak")
 52 | pak::pak("tidyverse/multidplyr")
 53 | ```
 54 | 
 55 | ## Usage
 56 | 
 57 | To use multidplyr, you first create a cluster of the desired number of
 58 | workers. Each one of these workers is a separate R process, and the
 59 | operating system will spread their execution across multiple cores:
 60 | 
 61 | ``` r
 62 | library(multidplyr)
 63 | 
 64 | cluster <- new_cluster(4)
 65 | cluster_library(cluster, "dplyr")
 66 | #> 
 67 | #> Attaching package: 'dplyr'
 68 | #> The following objects are masked from 'package:stats':
 69 | #> 
 70 | #>     filter, lag
 71 | #> The following objects are masked from 'package:base':
 72 | #> 
 73 | #>     intersect, setdiff, setequal, union
 74 | ```
 75 | 
 76 | There are two primary ways to use multidplyr. The first, and most
 77 | efficient, way is to read different files on each worker:
 78 | 
 79 | ``` r
 80 | # Create a filename vector containing different values on each worker
 81 | cluster_assign_each(cluster, filename = c("a.csv", "b.csv", "c.csv", "d.csv"))
 82 | 
 83 | # Use vroom to quickly load the csvs
 84 | cluster_send(cluster, my_data <- vroom::vroom(filename))
 85 | 
 86 | # Create a party_df using the my_data variable on each worker
 87 | my_data <- party_df(cluster, "my_data")
 88 | ```
 89 | 
 90 | Alternatively, if you already have the data loaded in the main session,
 91 | you can use `partition()` to automatically spread it across the workers.
 92 | Before calling `partition()`, it’s a good idea to call `group_by()` to
 93 | ensure that all of the observations belonging to a group end up on the
 94 | same worker.
 95 | 
 96 | ``` r
 97 | library(nycflights13)
 98 | 
 99 | flight_dest <- flights %>% group_by(dest) %>% partition(cluster)
100 | flight_dest
101 | #> Source: party_df [336,776 x 19]
102 | #> Groups: dest
103 | #> Shards: 4 [81,594--86,548 rows]
104 | #> 
105 | #> # A data frame: 336,776 × 19
106 | #>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
107 | #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
108 | #> 1  2013     1     1      544            545        -1     1004           1022
109 | #> 2  2013     1     1      558            600        -2      923            937
110 | #> 3  2013     1     1      559            600        -1      854            902
111 | #> 4  2013     1     1      602            610        -8      812            820
112 | #> 5  2013     1     1      602            605        -3      821            805
113 | #> 6  2013     1     1      611            600        11      945            931
114 | #> # ℹ 336,770 more rows
115 | #> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
116 | #> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
117 | #> #   hour <dbl>, minute <dbl>, time_hour <dttm>
118 | ```
119 | 
120 | Now you can work with it like a regular data frame, but the computations
121 | will be spread across multiple cores. Once you’ve finished computation,
122 | use `collect()` to bring the data back to the host session:
123 | 
124 | ``` r
125 | flight_dest %>% 
126 |   summarise(delay = mean(dep_delay, na.rm = TRUE), n = n()) %>% 
127 |   collect()
128 | #> # A tibble: 105 × 3
129 | #>    dest  delay     n
130 | #>    <chr> <dbl> <int>
131 | #>  1 ABQ    13.7   254
132 | #>  2 AUS    13.0  2439
133 | #>  3 BQN    12.4   896
134 | #>  4 BTV    13.6  2589
135 | #>  5 BUF    13.4  4681
136 | #>  6 CLE    13.4  4573
137 | #>  7 CMH    12.2  3524
138 | #>  8 DEN    15.2  7266
139 | #>  9 DSM    26.2   569
140 | #> 10 DTW    11.8  9384
141 | #> # ℹ 95 more rows
142 | ```
143 | 
144 | Note that there is some overhead associated with copying data from the
145 | worker nodes back to the host node (and vice versa), so you’re best off
146 | using multidplyr with more complex operations. See
147 | `vignette("multidplyr")` for more details.
148 | 


--------------------------------------------------------------------------------
/_pkgdown.yml:
--------------------------------------------------------------------------------
 1 | url: https://multidplyr.tidyverse.org
 2 | 
 3 | template:
 4 |   package: tidytemplate
 5 |   bootstrap: 5
 6 | 
 7 |   includes:
 8 |     in_header: |
 9 |       <script defer data-domain="multidplyr.tidyverse.org,all.tidyverse.org" src="https://plausible.io/js/plausible.js"></script>
10 | 
11 | development:
12 |   mode: auto
13 | 
14 | news:
15 |   releases:
16 |   - text: "Version 0.1.0"
17 |     href: https://www.tidyverse.org/blog/2021/02/dplyr-backends/
18 | 


--------------------------------------------------------------------------------
/codecov.yml:
--------------------------------------------------------------------------------
 1 | comment: false
 2 | 
 3 | coverage:
 4 |   status:
 5 |     project:
 6 |       default:
 7 |         target: auto
 8 |         threshold: 1%
 9 |         informational: true
10 |     patch:
11 |       default:
12 |         target: auto
13 |         threshold: 1%
14 |         informational: true
15 | 


--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
1 | ## R CMD check results
2 | 
3 | 0 errors | 0 warnings | 0 note
4 | 
5 | ## revdepcheck results
6 | 
7 | This was a patch release to fix an R CMD check failure; I did not check revdeps.
8 | 


--------------------------------------------------------------------------------
/man/cluster_call.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/cluster-call.R
 3 | \name{cluster_call}
 4 | \alias{cluster_call}
 5 | \alias{cluster_send}
 6 | \title{Call a function on each node of a cluster}
 7 | \usage{
 8 | cluster_call(cluster, code, simplify = FALSE, ptype = NULL)
 9 | 
10 | cluster_send(cluster, code)
11 | }
12 | \arguments{
13 | \item{cluster}{A cluster.}
14 | 
15 | \item{code}{An expression to execute on each worker.}
16 | 
17 | \item{simplify}{Should the results be simplified from a list?
18 |   * `TRUE`: simplify or die trying.
19 |   * `NA`: simplify if possible.
20 |   * `FALSE`: never try to simplify, always leaving as a list.
21 | 
22 |   `code` must return a vector of length one in order for simplification
23 |   to succeed.}
24 | 
25 | \item{ptype}{If `simplify` is `TRUE`, use `ptype` to enforce the desired
26 | output type.}
27 | }
28 | \value{
29 | A list of results with one element for each worker in `cluster`.
30 | }
31 | \description{
32 | `cluster_call()` executes the code on each worker and returns the results;
33 | `cluster_send()` executes the code ignoring the result. Jobs are submitted
34 | to workers in parallel, and then we wait until they're complete.
35 | }
36 | \examples{
37 | cl <- default_cluster()
38 | 
39 | # Run code on each cluster and retrieve results
40 | cluster_call(cl, Sys.getpid())
41 | cluster_call(cl, runif(1))
42 | 
43 | # use ptype to simplify
44 | cluster_call(cl, runif(1), simplify = TRUE)
45 | 
46 | # use cluster_send() to ignore results
47 | cluster_send(cl, x <- runif(1))
48 | cluster_call(cl, x, simplify = TRUE)
49 | }
50 | 


--------------------------------------------------------------------------------
/man/cluster_utils.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/cluster-utils.R
 3 | \name{cluster_utils}
 4 | \alias{cluster_utils}
 5 | \alias{cluster_assign}
 6 | \alias{cluster_assign_each}
 7 | \alias{cluster_assign_partition}
 8 | \alias{cluster_copy}
 9 | \alias{cluster_rm}
10 | \alias{cluster_library}
11 | \title{Cluster utitility functions}
12 | \usage{
13 | cluster_assign(.cluster, ...)
14 | 
15 | cluster_assign_each(.cluster, ...)
16 | 
17 | cluster_assign_partition(.cluster, ...)
18 | 
19 | cluster_copy(cluster, names, env = caller_env())
20 | 
21 | cluster_rm(cluster, names)
22 | 
23 | cluster_library(cluster, packages)
24 | }
25 | \arguments{
26 | \item{...}{Name-value pairs}
27 | 
28 | \item{cluster, .cluster}{Cluster to work on}
29 | 
30 | \item{names}{Name of variables to copy.}
31 | 
32 | \item{env}{Environment in which to look for varibles to copy.}
33 | 
34 | \item{packages}{Character vector of packages to load}
35 | }
36 | \value{
37 | Functions that modify the worker environment invisibly return
38 |   `cluster` so calls can be piped together. The other functions return
39 |   lists with one element for each worker.
40 | }
41 | \description{
42 | These functions provide useful helpers for performaning common operations.
43 | `cluster_assign()` assigns the same value on each worker;
44 | `cluster_assign_each()` assigns different values on each worker;
45 | `cluster_assign_partition()` partitions vectors so that each worker gets
46 | (approximately) the same number of pieces.
47 | }
48 | \examples{
49 | cl <- default_cluster()
50 | cluster_assign(cl, a = runif(1))
51 | cluster_call(cl, a)
52 | 
53 | # Assign different values on each cluster
54 | cluster_assign_each(cl, b = c(1, 10))
55 | cluster_call(cl, b)
56 | 
57 | # Partition a vector so that each worker gets approximately the
58 | # same amount of it
59 | cluster_assign_partition(cl, c = 1:11)
60 | cluster_call(cl, c)
61 | 
62 | # If you want different to compute different values on each
63 | # worker, use `cluster_call()` directly:
64 | cluster_call(cl, d <- runif(1))
65 | cluster_call(cl, d)
66 | 
67 | # cluster_copy() is a useful shortcut
68 | e <- 10
69 | cluster_copy(cl, "e")
70 | 
71 | cluster_call(cl, ls())
72 | cluster_rm(cl, letters[1:5])
73 | cluster_call(cl, ls())
74 | 
75 | # Use cluster_library() to load packages
76 | cluster_call(cl, search())
77 | cluster_library(cl, "magrittr")
78 | cluster_call(cl, search())
79 | }
80 | 


--------------------------------------------------------------------------------
/man/default_cluster.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/cluster.R
 3 | \name{default_cluster}
 4 | \alias{default_cluster}
 5 | \title{Default cluster}
 6 | \usage{
 7 | default_cluster(n = 2)
 8 | }
 9 | \arguments{
10 | \item{n}{Number of workers to use; defaults to 2 because this is the maximum
11 | allowed by CRAN.}
12 | }
13 | \value{
14 | A cached cluster of workers.
15 | }
16 | \description{
17 | Setting up a cluster is relatively expensive, so it's best to use a single
18 | cluster throughout a session. This function lazily creates a 2-worker
19 | cluster for use in examples and test.
20 | }
21 | \examples{
22 | default_cluster()
23 | }
24 | \keyword{internal}
25 | 


--------------------------------------------------------------------------------
/man/multidplyr-package.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/multidplyr-package.R
 3 | \docType{package}
 4 | \name{multidplyr-package}
 5 | \alias{multidplyr}
 6 | \alias{multidplyr-package}
 7 | \title{multidplyr: A Multi-Process 'dplyr' Backend}
 8 | \description{
 9 | Partition a data frame across multiple worker processes to provide simple multicore parallelism.
10 | }
11 | \seealso{
12 | Useful links:
13 | \itemize{
14 |   \item \url{https://multidplyr.tidyverse.org}
15 |   \item \url{https://github.com/tidyverse/multidplyr}
16 |   \item Report bugs at \url{https://github.com/tidyverse/multidplyr/issues}
17 | }
18 | 
19 | }
20 | \author{
21 | \strong{Maintainer}: Hadley Wickham \email{hadley@posit.co}
22 | 
23 | Other contributors:
24 | \itemize{
25 |   \item Posit Software, PBC [copyright holder, funder]
26 | }
27 | 
28 | }
29 | \keyword{internal}
30 | 


--------------------------------------------------------------------------------
/man/new_cluster.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/cluster.R
 3 | \name{new_cluster}
 4 | \alias{new_cluster}
 5 | \title{Create a new cluster with sensible defaults.}
 6 | \usage{
 7 | new_cluster(n)
 8 | }
 9 | \arguments{
10 | \item{n}{Number of workers to create. Avoid setting this higher than the
11 | number of cores in your computer as it will degrade performance.}
12 | }
13 | \value{
14 | A `multidplyr_cluster` object.
15 | }
16 | \description{
17 | Clusters created with this function will automatically clean up after
18 | themselves.
19 | }
20 | \examples{
21 | cluster <- new_cluster(2)
22 | cluster
23 | }
24 | 


--------------------------------------------------------------------------------
/man/partition.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/partydf.R
 3 | \name{partition}
 4 | \alias{partition}
 5 | \title{Partition data across workers in a cluster}
 6 | \usage{
 7 | partition(data, cluster)
 8 | }
 9 | \arguments{
10 | \item{data}{Dataset to partition, typically grouped. When grouped, all
11 | observations in a group will be assigned to the same cluster.}
12 | 
13 | \item{cluster}{Cluster to use.}
14 | }
15 | \value{
16 | A [party_df].
17 | }
18 | \description{
19 | Partitioning ensures that all observations in a group end up on the same
20 | worker. To try and keep the observations on each worker balanced,
21 | `partition()` uses a greedy algorithm that iteratively assigns each group to
22 | the worker that currently has the fewest rows.
23 | }
24 | \examples{
25 | library(dplyr)
26 | cl <- default_cluster()
27 | cluster_library(cl, "dplyr")
28 | 
29 | mtcars2 <- partition(mtcars, cl)
30 | mtcars2 \%>\% mutate(cyl2 = 2 * cyl)
31 | mtcars2 \%>\% filter(vs == 1)
32 | mtcars2 \%>\% group_by(cyl) \%>\% summarise(n())
33 | mtcars2 \%>\% select(-cyl)
34 | }
35 | 


--------------------------------------------------------------------------------
/man/party_df.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/partydf.R
 3 | \name{party_df}
 4 | \alias{party_df}
 5 | \title{A `party_df` partitioned data frame}
 6 | \usage{
 7 | party_df(cluster, name, auto_rm = FALSE)
 8 | }
 9 | \arguments{
10 | \item{cluster}{A cluster}
11 | 
12 | \item{name}{Name of data frame variable. Must exist on every worker,
13 | be a data frame, and have the same names.}
14 | 
15 | \item{auto_rm}{If `TRUE`, will automatically `rm()` the data frame on
16 | the workers when this object is created.}
17 | }
18 | \value{
19 | An S3 object with class `multidplyr_party_df`.
20 | }
21 | \description{
22 | This S3 class represents a data frame partitioned across workers in a
23 | cluster. You can use this constructor if you have already spread
24 | data frames spread across a cluster. If not, start with [partition()]
25 | instead.
26 | }
27 | \examples{
28 | # If a real example, you might spread file names across the clusters
29 | # and read in using data.table::fread()/vroom::vroom()/qs::qread().
30 | cl <- default_cluster()
31 | cluster_send(cl[1], n <- 10)
32 | cluster_send(cl[2], n <- 15)
33 | cluster_send(cl, df <- data.frame(x = runif(n)))
34 | 
35 | df <- party_df(cl, "df")
36 | df
37 | }
38 | 


--------------------------------------------------------------------------------
/man/reexports.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/utils.R
 3 | \docType{import}
 4 | \name{reexports}
 5 | \alias{reexports}
 6 | \alias{\%>\%}
 7 | \title{Objects exported from other packages}
 8 | \keyword{internal}
 9 | \description{
10 | These objects are imported from other packages. Follow the links
11 | below to see their documentation.
12 | 
13 | \describe{
14 |   \item{magrittr}{\code{\link[magrittr:pipe]{\%>\%}}}
15 | }}
16 | 
17 | 


--------------------------------------------------------------------------------
/multidplyr.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: No
 4 | SaveWorkspace: No
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: knitr
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 
18 | BuildType: Package
19 | PackageUseDevtools: Yes
20 | PackageInstallArgs: --no-multiarch --with-keep.source
21 | PackageRoxygenize: rd,collate,namespace
22 | 


--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
1 | library(testthat)
2 | library(multidplyr)
3 | 
4 | test_check("multidplyr")
5 | 


--------------------------------------------------------------------------------
/tests/testthat/_snaps/cluster-call.md:
--------------------------------------------------------------------------------
 1 | # validates inputs
 2 | 
 3 |     Code
 4 |       cluster_call(cl, 1, simplify = "x")
 5 |     Condition
 6 |       Error in `cluster_call()`:
 7 |       ! `simplify` must be `TRUE`, `FALSE`, or `NA`.
 8 | 
 9 | # old ptype interface works with warning
10 | 
11 |     Code
12 |       out <- cluster_call(cl, 1, ptype = double())
13 |     Condition
14 |       Warning:
15 |       Must now set `simplify = TRUE` when supplying ptype
16 | 
17 | # errors are propagated
18 | 
19 |     Code
20 |       cluster_call(cl, stop("!!"))
21 |     Condition
22 |       Error in `cluster_call()`:
23 |       ! Remote computation failed in worker 1
24 |       Caused by error:
25 |       ! !!
26 | 
27 | # errors capture worker id
28 | 
29 |     Code
30 |       cluster_call(cl, f(x))
31 |     Condition
32 |       Error in `cluster_call()`:
33 |       ! Remote computation failed in worker 2
34 |       Caused by error:
35 |       ! Computation failed
36 | 
37 | 


--------------------------------------------------------------------------------
/tests/testthat/_snaps/cluster.md:
--------------------------------------------------------------------------------
 1 | # default cluster only creates once
 2 | 
 3 |     Can not supply `n` when cluster has already been initiated
 4 | 
 5 | # cluster has useful print method
 6 | 
 7 |     Code
 8 |       cl <- default_cluster()
 9 |       cl
10 |     Output
11 |       2 session cluster [..]
12 | 
13 | 


--------------------------------------------------------------------------------
/tests/testthat/_snaps/dplyr-dual.md:
--------------------------------------------------------------------------------
 1 | # joining data frame requires explicit copy
 2 | 
 3 |     Code
 4 |       left_join(pf, df)
 5 |     Condition
 6 |       Error in `auto_copy()`:
 7 |       ! `x` and `y` must share the same src.
 8 |       i `x` is a <multidplyr_party_df> object.
 9 |       i `y` is a <data.frame> object.
10 |       i Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow).
11 | 
12 | 


--------------------------------------------------------------------------------
/tests/testthat/_snaps/partydf.md:
--------------------------------------------------------------------------------
 1 | # can construct and print partydf
 2 | 
 3 |     Code
 4 |       df
 5 |     Output
 6 |       Source: party_df [20 x 1]
 7 |       Shards: 2 [10--10 rows]
 8 |       
 9 |       # A data frame: 20 x 1
10 |             y
11 |         <int>
12 |       1     1
13 |       2     2
14 |       3     3
15 |       4     4
16 |       5     5
17 |       6     6
18 |       # i 14 more rows
19 | 
20 | ---
21 | 
22 |     Code
23 |       group_by(df, y)
24 |     Output
25 |       Source: party_df [20 x 1]
26 |       Groups: y
27 |       Shards: 2 [10--10 rows]
28 |       
29 |       # A data frame: 20 x 1
30 |             y
31 |         <int>
32 |       1     1
33 |       2     2
34 |       3     3
35 |       4     4
36 |       5     5
37 |       6     6
38 |       # i 14 more rows
39 | 
40 | # name must be data frame with same names
41 | 
42 |     Code
43 |       party_df(cl, "x")
44 |     Condition
45 |       Error in `party_df()`:
46 |       ! `x` does not exist on all workers
47 | 
48 | ---
49 | 
50 |     Code
51 |       party_df(cl, "x")
52 |     Condition
53 |       Error in `party_df()`:
54 |       ! `x` is not a data frame on all workers
55 | 
56 | ---
57 | 
58 |     Code
59 |       party_df(cl, "x")
60 |     Condition
61 |       Error in `party_df()`:
62 |       ! `x` does not have the same names on all workers
63 | 
64 | 


--------------------------------------------------------------------------------
/tests/testthat/setup.R:
--------------------------------------------------------------------------------
1 | # Ensure default cluster initiated silenty
2 | suppressMessages(default_cluster())
3 | 


--------------------------------------------------------------------------------
/tests/testthat/test-cluster-call.R:
--------------------------------------------------------------------------------
 1 | test_that("calls submitted in parallel", {
 2 |   cl <- default_cluster()
 3 |   out <- cluster_call(cl, {x <- Sys.time(); Sys.sleep(0.2); x})
 4 | 
 5 |   rng <- range(as.double(out))
 6 |   expect_true(rng[2] - rng[1] < 0.2)
 7 | })
 8 | 
 9 | test_that("calls submitted to each node", {
10 |   cl <- default_cluster()
11 |   pid <- cluster_call(cl, Sys.getpid())
12 |   expect_equal(length(unique(pid)), length(cl))
13 | })
14 | 
15 | test_that("can simplify results", {
16 |   cl <- default_cluster()
17 |   out <- cluster_call(cl, 1, simplify = TRUE)
18 |   expect_identical(out, c(1, 1))
19 | })
20 | 
21 | test_that("validates inputs", {
22 |   cl <- default_cluster()
23 |   expect_snapshot(error = TRUE, {
24 |     cluster_call(cl, 1, simplify = "x")
25 |   })
26 | })
27 | 
28 | test_that("old ptype interface works with warning", {
29 |   cl <- default_cluster()
30 |   expect_snapshot({
31 |     out <- cluster_call(cl, 1, ptype = double())
32 |   })
33 |   expect_identical(out, c(1, 1))
34 | })
35 | 
36 | 
37 | test_that("errors are propagated", {
38 |   cl <- default_cluster()
39 |   expect_snapshot(cluster_call(cl, stop("!!")), error = TRUE)
40 | })
41 | 
42 | test_that("errors capture worker id", {
43 |   cl <- default_cluster()
44 | 
45 |   f <- function(x) {
46 |     if (x == 2) {
47 |       rlang::abort("Computation failed")
48 |     }
49 |   }
50 |   cluster_assign(cl, f = f)
51 |   cluster_assign_each(cl, x = seq_along(cl))
52 | 
53 |   expect_snapshot(cluster_call(cl, f(x)), error = TRUE)
54 | })
55 | 
56 | 
57 | test_that("call_send() returns cluster", {
58 |   cl <- default_cluster()
59 |   expect_equal(cluster_send(cl, 10), cl)
60 | })
61 | 


--------------------------------------------------------------------------------
/tests/testthat/test-cluster-utils.R:
--------------------------------------------------------------------------------
 1 | test_that("can set/get/remove/list", {
 2 |   cl <- default_cluster()[1]
 3 |   cluster_assign(cl, x = 1)
 4 | 
 5 |   expect_equal(cluster_call(cl, "x" %in% ls()), list(TRUE))
 6 |   expect_equal(cluster_call(cl, x), list(1))
 7 | 
 8 |   cluster_rm(cl, "x")
 9 |   expect_equal(cluster_call(cl, "x" %in% ls()), list(FALSE))
10 | })
11 | 
12 | test_that("can assign different values to different clusters", {
13 |   cl <- default_cluster()
14 | 
15 |   cluster_assign_each(cl, x = 1:2)
16 |   expect_equal(cluster_call(cl, x), list(1L, 2L))
17 |   cluster_rm(cl, "x")
18 | })
19 | 
20 | test_that("can partition vectors across clusters", {
21 |   cl <- default_cluster()
22 | 
23 |   cluster_assign_partition(cl, x = 1:4, y = 1:5)
24 |   expect_equal(cluster_call(cl, x), list(1:2, 3:4))
25 |   expect_equal(cluster_call(cl, y), list(1:3, 4:5))
26 | 
27 |   cluster_rm(cl, c("x", "y"))
28 | })
29 | 
30 | test_that("can copy from objects to cluster", {
31 |   cl <- default_cluster()[1]
32 |   x <- 1
33 |   y <- 2
34 | 
35 |   cluster_copy(cl, c("x", "y"))
36 |   expect_equal(cluster_call(cl, x)[[1]], x)
37 |   expect_equal(cluster_call(cl, y)[[1]], y)
38 | 
39 |   cluster_rm(cl, c("x", "y"))
40 | })
41 | 
42 | test_that("can load package", {
43 |   cl <- new_cluster(1)
44 | 
45 |   cluster_library(cl, "covr")
46 |   expect_true("package:covr" %in% cluster_call(cl, search())[[1]])
47 | })
48 | 


--------------------------------------------------------------------------------
/tests/testthat/test-cluster.R:
--------------------------------------------------------------------------------
 1 | test_that("default cluster only creates once", {
 2 |   cl1 <- default_cluster()
 3 |   cl2 <- default_cluster()
 4 | 
 5 |   expect_identical(cl1, cl2)
 6 | 
 7 |   # and errors if you attempt to set n the second time
 8 |   expect_snapshot_error(default_cluster(10))
 9 | })
10 | 
11 | test_that("subsetting cluster returns cluster", {
12 |   expect_s3_class(default_cluster()[1], "multidplyr_cluster")
13 | })
14 | 
15 | test_that("cluster has useful print method", {
16 |   expect_snapshot({
17 |     cl <- default_cluster()
18 |     cl
19 |   })
20 | })
21 | 


--------------------------------------------------------------------------------
/tests/testthat/test-dplyr-dual.R:
--------------------------------------------------------------------------------
 1 | test_that("joining data frame requires explicit copy", {
 2 |   pf <- partition(data.frame(x = 1:6), default_cluster())
 3 |   df <- data.frame(x = 1:3, y = 3:1)
 4 | 
 5 |   expect_snapshot(error = TRUE, {
 6 |     left_join(pf, df)
 7 |   })
 8 |   expect_error(left_join(pf, df, copy = TRUE), NA)
 9 | })
10 | 
11 | test_that("joins match local results", {
12 |   pf1 <- partition(tibble(x = c(1, 2)), default_cluster())
13 |   pf2 <- partition(tibble(x = c(1, 3)), default_cluster())
14 | 
15 |   # primarily testing that basic infrastructure works and that
16 |   # I haven't accidentally typed the wrong verb name somewhere
17 |   expect_equal(pf1 %>% inner_join(pf2, by = "x") %>% pull(x), 1)
18 |   expect_equal(pf1 %>% left_join(pf2, by = "x") %>% pull(x), c(1, 2))
19 |   expect_equal(pf1 %>% right_join(pf2, by = "x") %>% pull(x), c(1, 3))
20 |   expect_equal(pf1 %>% full_join(pf2, by = "x") %>% pull(x), c(1, 2, 3))
21 | 
22 |   expect_equal(pf1 %>% semi_join(pf2, by = "x") %>% pull(x), 1)
23 |   expect_equal(pf1 %>% anti_join(pf2, by = "x") %>% pull(x), 2)
24 | })
25 | 
26 | test_that("set operations match local results", {
27 |   pf1 <- partition(tibble(x = c(1, 2)), default_cluster())
28 |   pf2 <- partition(tibble(x = c(1, 3)), default_cluster())
29 | 
30 |   expect_equal(pf1 %>% intersect(pf2) %>% pull(), 1)
31 |   expect_setequal(pf1 %>% union(pf2) %>% pull(), c(1, 2, 3))
32 |   expect_equal(pf1 %>% union_all(pf2) %>% pull() %>% sort(), c(1, 1, 2, 3))
33 |   expect_equal(pf1 %>% setdiff(pf2) %>% pull(), 2)
34 | })
35 | 


--------------------------------------------------------------------------------
/tests/testthat/test-dplyr-single.R:
--------------------------------------------------------------------------------
 1 | test_that("test that pass through workds", {
 2 |   cl <- default_cluster()
 3 |   cluster_assign_each(cl, x = list(
 4 |     tibble(g = 1, x = 1:2),
 5 |     tibble(g = 2, x = 3:4)
 6 |   ))
 7 |   on.exit(cluster_rm(cl, "x"))
 8 |   pf <- party_df(cl, "x")
 9 | 
10 |   expect_equal(pf %>% arrange(desc(x)) %>% collect() %>%  pull(x), c(2, 1, 4, 3))
11 |   expect_equal(pf %>% filter(x %% 2 == 0) %>% collect() %>% pull(x), c(2, 4))
12 |   expect_equal(pf %>% mutate(y = x + 1) %>% collect() %>% pull(y), 2:5)
13 |   expect_equal(pf %>% rename(z = x) %>% collect() %>% pull(z), 1:4)
14 |   expect_equal(pf %>% slice(1) %>% collect() %>% pull(x), c(1, 3))
15 | 
16 |   expect_equal(pf %>% select(x) %>% collect(), tibble(x = 1:4))
17 | 
18 |   expect_equal(pf %>% do(y = 1) %>% collect(), tibble(y = list(1, 1)))
19 | 
20 |   expect_equal(
21 |     pf %>% group_by(g) %>% summarise(x = sum(x)) %>% collect(),
22 |     tibble(g = c(1, 2), x = c(3L, 7L))
23 |   )
24 | })
25 | 


--------------------------------------------------------------------------------
/tests/testthat/test-partydf.R:
--------------------------------------------------------------------------------
 1 | test_that("can construct and print partydf", {
 2 |   withr::local_options(lifecycle_verbosity = "quiet")
 3 | 
 4 |   cl <- default_cluster()
 5 |   cl <- cluster_assign(cl, x = data.frame(y = 1:10))
 6 | 
 7 |   df <- party_df(cl, "x")
 8 |   on.exit(cluster_rm(cl, "x"))
 9 | 
10 |   expect_s3_class(df, "multidplyr_party_df")
11 |   expect_true(is_party_df(df))
12 |   expect_equal(df$cluster, cl)
13 | 
14 |   expect_snapshot(df)
15 |   expect_snapshot(group_by(df, y))
16 | })
17 | 
18 | test_that("name must be data frame with same names", {
19 |   cl <- default_cluster()
20 |   expect_snapshot(error = TRUE, party_df(cl, "x"))
21 | 
22 |   cluster_assign(cl, x = 1)
23 |   on.exit(cluster_rm(cl, "x"))
24 |   expect_snapshot(error = TRUE, party_df(cl, "x"))
25 | 
26 |   cluster_assign_each(cl, x = list(tibble(x = 1), tibble(y = 2)))
27 |   expect_snapshot(error = TRUE, party_df(cl, "x"))
28 | })
29 | 
30 | test_that("can automatically delete on gc() + cluster_call()", {
31 |   cl <- default_cluster()
32 |   cl <- cluster_assign(cl, x = data.frame(y = 1:10))
33 | 
34 |   df <- party_df(cl, "x", auto_rm = TRUE)
35 |   rm(df); gc()
36 |   expect_equal(cluster_call(cl, exists("x")), list(FALSE, FALSE))
37 | })
38 | 
39 | # partitioning ------------------------------------------------------------
40 | 
41 | test_that("can partition and re-collect", {
42 |   cl <- default_cluster()
43 | 
44 |   df1 <- tibble::tibble(x = 1:2)
45 |   df2 <- partition(df1, cl)
46 | 
47 |   expect_s3_class(df2, "multidplyr_party_df")
48 |   expect_equal(df2$cluster, cl)
49 | 
50 |   expect_equal(collect(df2), df1)
51 |   expect_equal(pull(df2, x), df1$x)
52 | })
53 | 
54 | test_that("can partition by group", {
55 |   cl <- default_cluster()
56 |   df1 <- tibble(x = c(rep(1, 2), rep(2, 1), rep(3, 1)))
57 |   df2 <- df1 %>% group_by(x) %>% partition(cl)
58 | 
59 |   dfs <- cluster_call(cl, !!df2$name)
60 |   expect_equal(dfs, list(
61 |     group_by(tibble(x = c(1, 1)), x),
62 |     group_by(tibble(x = c(2, 3)), x)
63 |   ))
64 | })
65 | 
66 | test_that("reduce cluster size if needed", {
67 |   cl <- default_cluster()
68 |   df1 <- tibble(x = c(rep(1, 2)))
69 |   expect_message(
70 |     df2 <- df1 %>% group_by(x) %>% partition(cl),
71 |     "partial cluster"
72 |   )
73 | 
74 |   expect_equal(length(df2$cluster), 1)
75 | })
76 | 


--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | multidplyr.html
2 | multidplyr.R
3 | 


--------------------------------------------------------------------------------
/vignettes/multidplyr.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "An introduction to multidplyr"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{An introduction to multidplyr}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   %\VignetteEncoding{UTF-8}
  8 | ---
  9 | 
 10 | ```{r, include = FALSE}
 11 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
 12 | set.seed(1014)
 13 | ```
 14 | 
 15 | multidplyr is a backend for dplyr that spreads work across multiple processes. Like all dplyr backends, it allows you to use the dplyr verbs that you're already familiar with, but alters the underlying computational model to transparently support multi-process parallelism. 
 16 | 
 17 | This vignette will show you the basics of multidplyr using the [nycflights13](https://CRAN.R-project.org/package=nycflights13) dataset.
 18 | 
 19 | ```{r setup}
 20 | library(multidplyr)
 21 | library(dplyr, warn.conflicts = FALSE)
 22 | library(nycflights13)
 23 | ```
 24 | 
 25 | ## Creating a cluster
 26 | 
 27 | To start using multidplyr you must create a cluster. Here I used two cores because it's the maximum permitted by CRAN, but I suggest that you use more. For best performance, I recommend using 1 or 2 fewer than the total number of cores on your computer, which you can detect with `parallel::detectCores()` (leaving at least 1 core free means that you should still be able to use your computer for other tasks while your computation is running). 
 28 | 
 29 | ```{r}
 30 | cluster <- new_cluster(2)
 31 | cluster
 32 | ```
 33 | 
 34 | (In the examples, you'll also see the use of `default_cluster()`; this is designed specifically for the constraints of R CMD check, so I don't recommend using it in your own code.)
 35 | 
 36 | A cluster consists of multiple R processes created by [callr](https://callr.r-lib.org/). When multiple processes are running at the same time, your operating system will take care of spreading the work across multiple cores.
 37 | 
 38 | ## Add data
 39 | 
 40 | There are two ways to get data to the workers in  cluster:
 41 | 
 42 | * `partition()` a data frame that already loaded *in the interactive process*.
 43 | * Load a different subset of the data *in each worker*.
 44 | 
 45 | ### `partition()`
 46 | 
 47 | `partition()` is useful if you have a single in-memory data frame. For example, take `nycflights13::flights`. This dataset contains information for about ~300,000 flights departing New York City in 2013. We group it by destination, then partition it:
 48 | 
 49 | ```{r}
 50 | flights1 <- flights %>% group_by(dest) %>% partition(cluster)
 51 | flights1
 52 | ```
 53 | 
 54 | `partition()` splits `flights1` into roughly equal subsets on each worker, ensuring that all rows in a group are transfered to the same worker. The result is a `party_df`, or partitioned data frame. 
 55 | 
 56 | ### Direct loading
 57 | 
 58 | `partition()` is simple to call, but it's relatively expensive because it copies a lot of data between processes. An alternative strategy is for each worker to load the data (from files) it needs directly. 
 59 | 
 60 | To show how that might work, I'll first split flights up by month and save as csv files:
 61 | 
 62 | ```{r}
 63 | path <- tempfile()
 64 | dir.create(path)
 65 | 
 66 | flights %>% 
 67 |   group_by(month) %>% 
 68 |   group_walk(~ vroom::vroom_write(.x, sprintf("%s/month-%02i.csv", path, .y$month)))
 69 | ```
 70 | 
 71 | Now we find all the files in the directory, and divide them up so that each worker gets (approximately) the same number of pieces:
 72 | 
 73 | ```{r}
 74 | files <- dir(path, full.names = TRUE)
 75 | cluster_assign_partition(cluster, files = files)
 76 | ```
 77 | 
 78 | Then we read in the files on each worker and use `party_df()` to create a partitioned dataframe:
 79 | 
 80 | ```{r}
 81 | cluster_send(cluster, flights2 <- vroom::vroom(files))
 82 | 
 83 | flights2 <- party_df(cluster, "flights2")
 84 | flights2
 85 | ```
 86 | 
 87 | ## dplyr verbs
 88 | 
 89 | Once you have a partitioned data frame, you can operate on it with the usual dplyr verbs. To bring the data back to the interactive process, use `collect()`:
 90 | 
 91 | ```{r}
 92 | flights1 %>% 
 93 |   summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>% 
 94 |   collect()
 95 | ```
 96 | 
 97 | For this size of data and a simple transformation, using a local cluster actually makes performance much worse!
 98 | 
 99 | ```{r}
100 | by_dest <- flights %>% group_by(dest)
101 | 
102 | # Local computation
103 | system.time(by_dest %>% summarise(mean(dep_delay, na.rm = TRUE)))
104 | 
105 | # Remote: partitioning
106 | system.time(flights2 <- flights %>% partition(cluster))
107 | # Remote: computation
108 | system.time(flights3 <- flights2 %>% summarise(mean(dep_delay, na.rm = TRUE)))
109 | # Remote: retrieve results
110 | system.time(flights3 %>% collect())
111 | ```
112 | 
113 | That's because of the overhead associated with sending the data to each worker and retrieving the results at the end. For basic dplyr verbs, multidplyr is unlikely to give you significant speed ups unless you have 10s or 100s of millions of data points (and in that scenario you should first try [dtplyr](https://dtplyr.tidyverse.org/), which uses [data.table](https://R-datatable.com/)). 
114 | 
115 | multipldyr might help, however, if you're doing more complex things. Let's see how that plays out when fitting a moderately complex model. We'll start by selecting a subset of flights that have at least 50 occurrences, and we'll compute the day of the year from the date:
116 | 
117 | ```{r}
118 | daily_flights <- flights %>%
119 |   count(dest) %>%
120 |   filter(n >= 365)
121 | 
122 | common_dest <- flights %>% 
123 |   semi_join(daily_flights, by = "dest") %>% 
124 |   mutate(yday = lubridate::yday(ISOdate(year, month, day))) %>% 
125 |   group_by(dest)
126 | 
127 | nrow(common_dest)
128 | ```
129 | 
130 | That leaves us with ~332,000 observations. Let's partition this smaller dataset:
131 | 
132 | ```{r}
133 | by_dest <- common_dest %>% partition(cluster)
134 | by_dest
135 | ```
136 | 
137 | Let's fit a smoothed generalised additive model to each destination, estimating how delays vary over the course of the year and within a day. Note that we need to use `cluster_library()` to load the mgcv package on every node. That takes around 3s:
138 | 
139 | ```{r, message = FALSE}
140 | cluster_library(cluster, "mgcv")
141 | system.time({
142 |   models <- by_dest %>% 
143 |     do(mod = gam(dep_delay ~ s(yday) + s(dep_time), data = .))
144 | })
145 | ```
146 | 
147 | Compared with around 5s doing it locally:
148 | 
149 | ```{r}
150 | system.time({
151 |   models <- common_dest %>% 
152 |     group_by(dest) %>% 
153 |     do(mod = gam(dep_delay ~ s(yday) + s(dep_time), data = .))
154 | })
155 | ```
156 | 
157 | The cost of transmitting messages to the nodes is roughly constant, so the longer the task you're parallelising, the closer you'll get to a linear speed up. 
158 | 


--------------------------------------------------------------------------------