├── .Rbuildignore
├── .github
    ├── .gitignore
    ├── CODE_OF_CONDUCT.md
    └── workflows
    │   ├── R-CMD-check.yaml
    │   ├── lock.yaml
    │   ├── pkgdown.yaml
    │   ├── pr-commands.yaml
    │   └── test-coverage.yaml
├── .gitignore
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── NEWS.md
├── R
    ├── as-parsed-model.R
    ├── kmeans-viz.R
    ├── kmeans.R
    ├── lm.R
    ├── model_prep.R
    └── modeldb-package.R
├── README.Rmd
├── README.md
├── _pkgdown.yml
├── codecov.yml
├── cran-comments.md
├── man
    ├── add_dummy_variables.Rd
    ├── as_parsed_model.modeldb_lm.Rd
    ├── figures
    │   └── logo.png
    ├── linear_regression_db.Rd
    ├── modeldb-package.Rd
    ├── plot_kmeans.Rd
    ├── reexports.Rd
    └── simple_kmeans_db.Rd
├── modeldb.Rproj
├── revdep
    ├── .gitignore
    ├── README.md
    ├── check.R
    ├── failures.md
    └── problems.md
├── tests
    ├── testthat.R
    └── testthat
    │   ├── test-as-parsed-model.R
    │   ├── test_dummy_var.R
    │   ├── test_kmeans.R
    │   ├── test_kmeans_viz.R
    │   └── test_lr.R
└── vignettes
    ├── .gitignore
    ├── kmeans.Rmd
    └── linear-regression.Rmd


/.Rbuildignore:
--------------------------------------------------------------------------------
 1 | ^pkgdown$
 2 | ^docs$
 3 | ^_pkgdown\.yml$
 4 | ^CRAN-RELEASE$
 5 | ^.*\.Rproj$
 6 | ^\.Rproj\.user$
 7 | ^README\.Rmd$
 8 | ^README-.*\.png$
 9 | ^README_files$
10 | travis.yml
11 | config.toml
12 | ^codecov\.yml$
13 | ^revdep$
14 | cran-comments.md
15 | ^tools
16 | ^CRAN-RELEASE
17 | ^\.httr-oauth$
18 | ^\.github$
19 | ^CODE_OF_CONDUCT\.md$
20 | ^LICENSE\.md$
21 | 


--------------------------------------------------------------------------------
/.github/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | 


--------------------------------------------------------------------------------
/.github/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | # Contributor Covenant Code of Conduct
  2 | 
  3 | ## Our Pledge
  4 | 
  5 | We as members, contributors, and leaders pledge to make participation in our
  6 | community a harassment-free experience for everyone, regardless of age, body
  7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  8 | identity and expression, level of experience, education, socio-economic status,
  9 | nationality, personal appearance, race, caste, color, religion, or sexual
 10 | identity and orientation.
 11 | 
 12 | We pledge to act and interact in ways that contribute to an open, welcoming,
 13 | diverse, inclusive, and healthy community.
 14 | 
 15 | ## Our Standards
 16 | 
 17 | Examples of behavior that contributes to a positive environment for our
 18 | community include:
 19 | 
 20 | * Demonstrating empathy and kindness toward other people
 21 | * Being respectful of differing opinions, viewpoints, and experiences
 22 | * Giving and gracefully accepting constructive feedback
 23 | * Accepting responsibility and apologizing to those affected by our mistakes,
 24 |   and learning from the experience
 25 | * Focusing on what is best not just for us as individuals, but for the overall
 26 |   community
 27 | 
 28 | Examples of unacceptable behavior include:
 29 | 
 30 | * The use of sexualized language or imagery, and sexual attention or advances of
 31 |   any kind
 32 | * Trolling, insulting or derogatory comments, and personal or political attacks
 33 | * Public or private harassment
 34 | * Publishing others' private information, such as a physical or email address,
 35 |   without their explicit permission
 36 | * Other conduct which could reasonably be considered inappropriate in a
 37 |   professional setting
 38 | 
 39 | ## Enforcement Responsibilities
 40 | 
 41 | Community leaders are responsible for clarifying and enforcing our standards of
 42 | acceptable behavior and will take appropriate and fair corrective action in
 43 | response to any behavior that they deem inappropriate, threatening, offensive,
 44 | or harmful.
 45 | 
 46 | Community leaders have the right and responsibility to remove, edit, or reject
 47 | comments, commits, code, wiki edits, issues, and other contributions that are
 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 49 | decisions when appropriate.
 50 | 
 51 | ## Scope
 52 | 
 53 | This Code of Conduct applies within all community spaces, and also applies when
 54 | an individual is officially representing the community in public spaces.
 55 | Examples of representing our community include using an official e-mail address,
 56 | posting via an official social media account, or acting as an appointed
 57 | representative at an online or offline event.
 58 | 
 59 | ## Enforcement
 60 | 
 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 62 | reported to the community leaders responsible for enforcement at codeofconduct@posit.co. 
 63 | All complaints will be reviewed and investigated promptly and fairly.
 64 | 
 65 | All community leaders are obligated to respect the privacy and security of the
 66 | reporter of any incident.
 67 | 
 68 | ## Enforcement Guidelines
 69 | 
 70 | Community leaders will follow these Community Impact Guidelines in determining
 71 | the consequences for any action they deem in violation of this Code of Conduct:
 72 | 
 73 | ### 1. Correction
 74 | 
 75 | **Community Impact**: Use of inappropriate language or other behavior deemed
 76 | unprofessional or unwelcome in the community.
 77 | 
 78 | **Consequence**: A private, written warning from community leaders, providing
 79 | clarity around the nature of the violation and an explanation of why the
 80 | behavior was inappropriate. A public apology may be requested.
 81 | 
 82 | ### 2. Warning
 83 | 
 84 | **Community Impact**: A violation through a single incident or series of
 85 | actions.
 86 | 
 87 | **Consequence**: A warning with consequences for continued behavior. No
 88 | interaction with the people involved, including unsolicited interaction with
 89 | those enforcing the Code of Conduct, for a specified period of time. This
 90 | includes avoiding interactions in community spaces as well as external channels
 91 | like social media. Violating these terms may lead to a temporary or permanent
 92 | ban.
 93 | 
 94 | ### 3. Temporary Ban
 95 | 
 96 | **Community Impact**: A serious violation of community standards, including
 97 | sustained inappropriate behavior.
 98 | 
 99 | **Consequence**: A temporary ban from any sort of interaction or public
100 | communication with the community for a specified period of time. No public or
101 | private interaction with the people involved, including unsolicited interaction
102 | with those enforcing the Code of Conduct, is allowed during this period.
103 | Violating these terms may lead to a permanent ban.
104 | 
105 | ### 4. Permanent Ban
106 | 
107 | **Community Impact**: Demonstrating a pattern of violation of community
108 | standards, including sustained inappropriate behavior, harassment of an
109 | individual, or aggression toward or disparagement of classes of individuals.
110 | 
111 | **Consequence**: A permanent ban from any sort of public interaction within the
112 | community.
113 | 
114 | ## Attribution
115 | 
116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
117 | version 2.1, available at
118 | <https://www.contributor-covenant.org/version/2/1/code_of_conduct.html>.
119 | 
120 | Community Impact Guidelines were inspired by
121 | [Mozilla's code of conduct enforcement ladder][https://github.com/mozilla/inclusion].
122 | 
123 | For answers to common questions about this code of conduct, see the FAQ at
124 | <https://www.contributor-covenant.org/faq>. Translations are available at <https://www.contributor-covenant.org/translations>.
125 | 
126 | [homepage]: https://www.contributor-covenant.org
127 | 


--------------------------------------------------------------------------------
/.github/workflows/R-CMD-check.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | #
 4 | # NOTE: This workflow is overkill for most R packages and
 5 | # check-standard.yaml is likely a better choice.
 6 | # usethis::use_github_action("check-standard") will install it.
 7 | on:
 8 |   push:
 9 |     branches: [main, master]
10 |   pull_request:
11 |     branches: [main, master]
12 | 
13 | name: R-CMD-check
14 | 
15 | jobs:
16 |   R-CMD-check:
17 |     runs-on: ${{ matrix.config.os }}
18 | 
19 |     name: ${{ matrix.config.os }} (${{ matrix.config.r }})
20 | 
21 |     strategy:
22 |       fail-fast: false
23 |       matrix:
24 |         config:
25 |           - {os: macos-latest,   r: 'release'}
26 | 
27 |           - {os: windows-latest, r: 'release'}
28 |           # Use 3.6 to trigger usage of RTools35
29 |           - {os: windows-latest, r: '3.6'}
30 |           # use 4.1 to check with rtools40's older compiler
31 |           - {os: windows-latest, r: '4.1'}
32 | 
33 |           - {os: ubuntu-latest,   r: 'devel', http-user-agent: 'release'}
34 |           - {os: ubuntu-latest,   r: 'release'}
35 |           - {os: ubuntu-latest,   r: 'oldrel-1'}
36 |           - {os: ubuntu-latest,   r: 'oldrel-2'}
37 |           - {os: ubuntu-latest,   r: 'oldrel-3'}
38 | 
39 |     env:
40 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
41 |       R_KEEP_PKG_SOURCE: yes
42 | 
43 |     steps:
44 |       - uses: actions/checkout@v3
45 | 
46 |       - uses: r-lib/actions/setup-pandoc@v2
47 | 
48 |       - uses: r-lib/actions/setup-r@v2
49 |         with:
50 |           r-version: ${{ matrix.config.r }}
51 |           http-user-agent: ${{ matrix.config.http-user-agent }}
52 |           use-public-rspm: true
53 | 
54 |       - uses: r-lib/actions/setup-r-dependencies@v2
55 |         with:
56 |           extra-packages: any::rcmdcheck
57 |           needs: check
58 | 
59 |       - uses: r-lib/actions/check-r-package@v2
60 |         with:
61 |           upload-snapshots: true
62 | 


--------------------------------------------------------------------------------
/.github/workflows/lock.yaml:
--------------------------------------------------------------------------------
 1 | name: 'Lock Threads'
 2 | 
 3 | on:
 4 |   schedule:
 5 |     - cron: '0 0 * * *'
 6 | 
 7 | jobs:
 8 |   lock:
 9 |     runs-on: ubuntu-latest
10 |     steps:
11 |       - uses: dessant/lock-threads@v2
12 |         with:
13 |           github-token: ${{ github.token }}
14 |           issue-lock-inactive-days: '14'
15 | #          issue-exclude-labels: ''
16 | #          issue-lock-labels: 'outdated'
17 |           issue-lock-comment: >
18 |             This issue has been automatically locked. If you believe you have
19 |             found a related problem, please file a new issue (with a reprex:
20 |             <https://reprex.tidyverse.org>) and link to this issue.
21 |           issue-lock-reason: ''
22 |           pr-lock-inactive-days: '14'
23 | #          pr-exclude-labels: 'wip'
24 |           pr-lock-labels: ''
25 |           pr-lock-comment: >
26 |             This pull request has been automatically locked. If you believe you
27 |             have found a related problem, please file a new issue (with a reprex:
28 |             <https://reprex.tidyverse.org>) and link to this issue.
29 |           pr-lock-reason: ''
30 | #          process-only: 'issues'
31 | 


--------------------------------------------------------------------------------
/.github/workflows/pkgdown.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | on:
 4 |   push:
 5 |     branches: [main, master]
 6 |   pull_request:
 7 |     branches: [main, master]
 8 |   release:
 9 |     types: [published]
10 |   workflow_dispatch:
11 | 
12 | name: pkgdown
13 | 
14 | jobs:
15 |   pkgdown:
16 |     runs-on: ubuntu-latest
17 |     # Only restrict concurrency for non-PR jobs
18 |     concurrency:
19 |       group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }}
20 |     env:
21 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
22 |     permissions:
23 |       contents: write
24 |     steps:
25 |       - uses: actions/checkout@v3
26 | 
27 |       - uses: r-lib/actions/setup-pandoc@v2
28 | 
29 |       - uses: r-lib/actions/setup-r@v2
30 |         with:
31 |           use-public-rspm: true
32 | 
33 |       - uses: r-lib/actions/setup-r-dependencies@v2
34 |         with:
35 |           extra-packages: any::pkgdown, local::.
36 |           needs: website
37 | 
38 |       - name: Build site
39 |         run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE)
40 |         shell: Rscript {0}
41 | 
42 |       - name: Deploy to GitHub pages 🚀
43 |         if: github.event_name != 'pull_request'
44 |         uses: JamesIves/github-pages-deploy-action@v4.4.1
45 |         with:
46 |           clean: false
47 |           branch: gh-pages
48 |           folder: docs
49 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-commands.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | on:
 4 |   issue_comment:
 5 |     types: [created]
 6 | 
 7 | name: Commands
 8 | 
 9 | jobs:
10 |   document:
11 |     if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/document') }}
12 |     name: document
13 |     runs-on: ubuntu-latest
14 |     env:
15 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
16 |     steps:
17 |       - uses: actions/checkout@v3
18 | 
19 |       - uses: r-lib/actions/pr-fetch@v2
20 |         with:
21 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
22 | 
23 |       - uses: r-lib/actions/setup-r@v2
24 |         with:
25 |           use-public-rspm: true
26 | 
27 |       - uses: r-lib/actions/setup-r-dependencies@v2
28 |         with:
29 |           extra-packages: any::roxygen2
30 |           needs: pr-document
31 | 
32 |       - name: Document
33 |         run: roxygen2::roxygenise()
34 |         shell: Rscript {0}
35 | 
36 |       - name: commit
37 |         run: |
38 |           git config --local user.name "$GITHUB_ACTOR"
39 |           git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com"
40 |           git add man/\* NAMESPACE
41 |           git commit -m 'Document'
42 | 
43 |       - uses: r-lib/actions/pr-push@v2
44 |         with:
45 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
46 | 
47 |   style:
48 |     if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/style') }}
49 |     name: style
50 |     runs-on: ubuntu-latest
51 |     env:
52 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
53 |     steps:
54 |       - uses: actions/checkout@v3
55 | 
56 |       - uses: r-lib/actions/pr-fetch@v2
57 |         with:
58 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
59 | 
60 |       - uses: r-lib/actions/setup-r@v2
61 | 
62 |       - name: Install dependencies
63 |         run: install.packages("styler")
64 |         shell: Rscript {0}
65 | 
66 |       - name: Style
67 |         run: styler::style_pkg()
68 |         shell: Rscript {0}
69 | 
70 |       - name: commit
71 |         run: |
72 |           git config --local user.name "$GITHUB_ACTOR"
73 |           git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com"
74 |           git add \*.R
75 |           git commit -m 'Style'
76 | 
77 |       - uses: r-lib/actions/pr-push@v2
78 |         with:
79 |           repo-token: ${{ secrets.GITHUB_TOKEN }}
80 | 


--------------------------------------------------------------------------------
/.github/workflows/test-coverage.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | on:
 4 |   push:
 5 |     branches: [main, master]
 6 |   pull_request:
 7 |     branches: [main, master]
 8 | 
 9 | name: test-coverage
10 | 
11 | jobs:
12 |   test-coverage:
13 |     runs-on: ubuntu-latest
14 |     env:
15 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
16 | 
17 |     steps:
18 |       - uses: actions/checkout@v3
19 | 
20 |       - uses: r-lib/actions/setup-r@v2
21 |         with:
22 |           use-public-rspm: true
23 | 
24 |       - uses: r-lib/actions/setup-r-dependencies@v2
25 |         with:
26 |           extra-packages: any::covr
27 |           needs: coverage
28 | 
29 |       - name: Test coverage
30 |         run: |
31 |           covr::codecov(
32 |             quiet = FALSE,
33 |             clean = FALSE,
34 |             install_path = file.path(normalizePath(Sys.getenv("RUNNER_TEMP"), winslash = "/"), "package")
35 |           )
36 |         shell: Rscript {0}
37 | 
38 |       - name: Show testthat output
39 |         if: always()
40 |         run: |
41 |           ## --------------------------------------------------------------------
42 |           find ${{ runner.temp }}/package -name 'testthat.Rout*' -exec cat '{}' \; || true
43 |         shell: bash
44 | 
45 |       - name: Upload test results
46 |         if: failure()
47 |         uses: actions/upload-artifact@v3
48 |         with:
49 |           name: coverage-test-failures
50 |           path: ${{ runner.temp }}/package
51 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | inst/doc
 2 | .Rproj.user
 3 | .Rhistory
 4 | .RData
 5 | .Ruserdata
 6 | kmeans.csv
 7 | /content
 8 | /public
 9 | /layouts
10 | /static
11 | CRAN-RELEASE
12 | .httr-oauth
13 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: modeldb
 2 | Title: Fits Models Inside the Database
 3 | Version: 0.3.0.9000
 4 | Authors@R: c(
 5 |     person("Edgar", "Ruiz", , "edgar@posit.co", role = "aut"),
 6 |     person("Max", "Kuhn", , "max@posit.co", role = c("aut", "cre"))
 7 |   )
 8 | Description: Uses 'dplyr' and 'tidyeval' to fit statistical models inside
 9 |     the database. It currently supports KMeans and linear regression
10 |     models.
11 | License: MIT + file LICENSE
12 | URL: https://modeldb.tidymodels.org, https://github.com/tidymodels/modeldb
13 | BugReports: https://github.com/tidymodels/modeldb/issues
14 | Depends:
15 |     R (>= 3.6)
16 | Imports:
17 |     cli,
18 |     dplyr (>= 0.7),
19 |     ggplot2,
20 |     progress,
21 |     purrr,
22 |     rlang (>= 1.1.1),
23 |     tibble,
24 |     tidypredict
25 | Suggests: 
26 |     covr,
27 |     DBI,
28 |     dbplyr,
29 |     knitr,
30 |     methods,
31 |     nycflights13,
32 |     rmarkdown,
33 |     RSQLite,
34 |     testthat (>= 3.2.0)
35 | VignetteBuilder: 
36 |     knitr
37 | Config/Needs/website: tidyverse/tidytemplate
38 | Config/testthat/edition: 3
39 | Encoding: UTF-8
40 | Roxygen: list(markdown = TRUE)
41 | RoxygenNote: 7.2.3
42 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2021
2 | COPYRIGHT HOLDER: modeldb authors
3 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | # MIT License
 2 | 
 3 | Copyright (c) 2021 modeldb authors
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | S3method(as_parsed_model,modeldb_lm)
 4 | export(add_dummy_variables)
 5 | export(as_parsed_model)
 6 | export(db_calculate_squares)
 7 | export(linear_regression_db)
 8 | export(plot_kmeans)
 9 | export(simple_kmeans_db)
10 | import(ggplot2)
11 | import(rlang)
12 | import(tibble)
13 | importFrom(dplyr,all_vars)
14 | importFrom(dplyr,arrange)
15 | importFrom(dplyr,as_tibble)
16 | importFrom(dplyr,bind_cols)
17 | importFrom(dplyr,case_when)
18 | importFrom(dplyr,collect)
19 | importFrom(dplyr,contains)
20 | importFrom(dplyr,everything)
21 | importFrom(dplyr,filter)
22 | importFrom(dplyr,filter_all)
23 | importFrom(dplyr,funs)
24 | importFrom(dplyr,group_by)
25 | importFrom(dplyr,group_vars)
26 | importFrom(dplyr,lag)
27 | importFrom(dplyr,left_join)
28 | importFrom(dplyr,mutate)
29 | importFrom(dplyr,n)
30 | importFrom(dplyr,pull)
31 | importFrom(dplyr,rename)
32 | importFrom(dplyr,rename_all)
33 | importFrom(dplyr,select)
34 | importFrom(dplyr,summarise)
35 | importFrom(dplyr,summarise_all)
36 | importFrom(dplyr,tally)
37 | importFrom(dplyr,tbl_vars)
38 | importFrom(dplyr,ungroup)
39 | importFrom(purrr,imap)
40 | importFrom(purrr,map)
41 | importFrom(purrr,map2)
42 | importFrom(purrr,map_df)
43 | importFrom(purrr,pluck)
44 | importFrom(purrr,reduce)
45 | importFrom(purrr,transpose)
46 | importFrom(tidypredict,as_parsed_model)
47 | importFrom(utils,head)
48 | importFrom(utils,write.csv)
49 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
 1 | # modeldb (development version)
 2 | 
 3 | # modeldb 0.3.0
 4 | 
 5 | - Fixes compatability issues with `dplyr` 
 6 | 
 7 | # modeldb 0.2.3
 8 | 
 9 | * Re-licensed package from GPL-3 to MIT. All copyright holders are RStudio employees and give consent.
10 | 
11 | # modeldb 0.2.2
12 | 
13 | - Switches maintainer to Max Kuhn
14 | 
15 | # modeldb 0.2.1
16 | 
17 | - Uses `dplyr::tbl_vars()` for column name extraction to instead of `colnames()`
18 | 
19 | # modeldb 0.2.0
20 | 
21 | - Adds `as_parsed_model()` support for regression models
22 | 
23 | - Fixes compatability issues with `rlang` and `dplyr` 
24 | 
25 | # modeldb 0.1.2
26 | 
27 | - Removes pipes and other dplyr dependencies from internal `mlr()` function
28 | 
29 | - Consolidates duplicated database operations in `mlr()`
30 | 
31 | - Fixes an issue in `simple_kmeans_db()` when specifying variables
32 | 
33 | # modeldb 0.1.1
34 | 
35 | ## Bug fixes
36 | 
37 | - Fixes dependency issue with `tidypredict` by removing `as_parsed_model()`.  The function will be moved to `tidypredict` in its next version.
38 | 


--------------------------------------------------------------------------------
/R/as-parsed-model.R:
--------------------------------------------------------------------------------
 1 | 
 2 | #' @export
 3 | tidypredict::as_parsed_model
 4 | 
 5 | #' Prepares parsed model object
 6 | #'
 7 | #' @param x A parsed model object
 8 | #'
 9 | #' @export
10 | as_parsed_model.modeldb_lm <- function(x) {
11 |   terms <- imap(
12 |     x,
13 |     ~ {
14 |       list(
15 |         label = .y,
16 |         coef = .x,
17 |         is_intercept = ifelse(.y == "(Intercept)", 1, 0),
18 |         fields = list(list(type = "ordinary", col = .y))
19 |       )
20 |     }
21 |   )
22 |   pm <- list(
23 |     general = list(
24 |       model = "modeldb_lm",
25 |       version = 2,
26 |       type = "regression",
27 |       is_glm = 0
28 |     ),
29 |     terms = terms
30 |   )
31 |   as_parsed_model(pm)
32 | }
33 | 


--------------------------------------------------------------------------------
/R/kmeans-viz.R:
--------------------------------------------------------------------------------
  1 | #' Visualize a KMeans Cluster with lots of data
  2 | #'
  3 | #' It uses 'ggplot2' to display the results of a KMeans routine. Instead
  4 | #' of a scatterplot, it uses a square grid that displays the concentration
  5 | #' of intersections per square.  The number of squares in the grid can
  6 | #' be customized for more or less fine grain.
  7 | #'
  8 | #' @param df A Local or remote data frame with results of KMeans clustering
  9 | #' @param x A numeric variable for the x axis
 10 | #' @param y A numeric variable for the y axis
 11 | #' @param resolution The number of squares in the grid. Defaults to 50.
 12 | #' Meaning a 50 x 50 grid.
 13 | #' @param group A discrete variable containing the grouping for the KMeans. It defaults to 'center'
 14 | #'
 15 | #' @details
 16 | #' For large result-sets in remote sources, downloading every intersection will
 17 | #' be a long running, costly operation.  The approach of this function is to
 18 | #' devide the x and y plane in a grid and have the remote source figure the
 19 | #' total number of intersections, returned as a single number.  This reduces the
 20 | #' granularity of the visualization, but it speeds up the results.
 21 | #'
 22 | #' @examples
 23 | #' plot_kmeans(mtcars, mpg, wt, group = am)
 24 | #' @export
 25 | plot_kmeans <- function(df, x, y, resolution = 50, group = center) {
 26 |   x <- enquo(x)
 27 |   y <- enquo(y)
 28 |   group <- enquo(group)
 29 | 
 30 |   squares <- db_calculate_squares(
 31 |     df = df,
 32 |     x = !!x,
 33 |     y = !!y,
 34 |     group = !!group,
 35 |     resolution = resolution
 36 |   )
 37 | 
 38 | 
 39 |   d <- mutate(
 40 |     squares,
 41 |     x = !!x, y = !!y, Center = !!group, Count = n
 42 |   )
 43 | 
 44 |   ggplot(d) +
 45 |     geom_rect(
 46 |       aes(
 47 |         xmin = x,
 48 |         ymin = y,
 49 |         xmax = xend,
 50 |         ymax = yend,
 51 |         color = Center
 52 |       ),
 53 |       fill = "transparent"
 54 |     ) +
 55 |     geom_rect(aes(
 56 |       xmin = x,
 57 |       ymin = y,
 58 |       xmax = xend,
 59 |       ymax = yend,
 60 |       fill = Center,
 61 |       alpha = Count
 62 |     )) +
 63 |     theme_minimal() +
 64 |     theme(
 65 |       legend.position = "bottom",
 66 |       axis.line = element_blank(),
 67 |       panel.border = element_blank(),
 68 |       panel.grid.minor = element_blank()
 69 |     ) +
 70 |     labs(
 71 |       title = "Kmeans Clusters",
 72 |       subtitle = paste0(expr_text(y), " ~ ", expr_text(x)),
 73 |       x = expr_text(x),
 74 |       y = expr_text(y)
 75 |     )
 76 | }
 77 | 
 78 | #' @export
 79 | #' @rdname plot_kmeans
 80 | db_calculate_squares <- function(df, x, y, group, resolution = 50) {
 81 |   x <- enquo(x)
 82 |   y <- enquo(y)
 83 |   group <- enquo(group)
 84 |   segs <- group_by(
 85 |     df,
 86 |     !!x := !!db_bin(!!x, bins = resolution),
 87 |     !!y := !!db_bin(!!y, bins = resolution),
 88 |     !!group
 89 |   )
 90 |   segs <- tally(segs)
 91 |   segs <- ungroup(segs)
 92 |   segs <- collect(segs)
 93 |   mutate(segs,
 94 |     xend = !!x + min_dif(segs, !!x),
 95 |     yend = !!y + min_dif(segs, !!y)
 96 |   )
 97 | }
 98 | 
 99 | min_dif <- function(df, field) {
100 |   field <- enquo(field)
101 |   df <- mutate(df,
102 |     x1 = lag(!!field, 1),
103 |     dif = !!field - x1
104 |   )
105 |   df <- filter(df, dif != 0, dif > 0)
106 |   df <- summarise(df, min(dif))
107 |   df <- pull(df)
108 | }
109 | 
110 | db_bin <- function(var, bins = 30) {
111 |   var <- enquo(var)
112 |   range <- expr(max(!!var, na.rm = TRUE) - min(!!var, na.rm = TRUE))
113 |   binwidth <- expr((!!range) / (!!bins))
114 |   bin_number <- expr(as.integer(floor(((!!var) - min(!!var, na.rm = TRUE)) / (!!binwidth))))
115 |   expr(((!!binwidth) *
116 |     ifelse((!!bin_number) == (!!bins), (!!bin_number) - 1,
117 |       (!!bin_number)
118 |     )) + min(!!var, na.rm = TRUE))
119 | }
120 | 


--------------------------------------------------------------------------------
/R/kmeans.R:
--------------------------------------------------------------------------------
  1 | #' Simple kmeans routine that works in-database
  2 | #'
  3 | #' It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean
  4 | #' calculations, expressed in dplyr formulas until an the optimal
  5 | #' centers are found.
  6 | #'
  7 | #' @param df A Local or remote data frame
  8 | #' @param ... A list of variables to be used in the kmeans algorithm
  9 | #' @param centers The number of centers. Defaults to 3.
 10 | #' @param max_repeats The maximum number of cycles to run. Defaults to 100.
 11 | #' @param initial_kmeans A local dataframe with initial centroid values. Defaults to NULL.
 12 | #' @param safeguard_file Each cycle will update a file specified in this argument
 13 | #' with the current centers.  Defaults to 'kmeans.csv'. Pass NULL if no file is
 14 | #' desired.
 15 | #' @param verbose Indicates if the progress bar will be displayed during the model's fitting.
 16 | #'
 17 | #' @details
 18 | #' Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source,
 19 | #' the latest centroid data frame is saved to the parent environment in case the process needs to be
 20 | #' canceled and then restarted at a later point.  Passing the `current_kmeans` as the `initial_kmeans`
 21 | #' will allow the operation to pick up where it left off.
 22 | #'
 23 | #' @examples
 24 | #' library(dplyr)
 25 | #'
 26 | #' mtcars %>%
 27 | #'   simple_kmeans_db(mpg, qsec, wt) %>%
 28 | #'   glimpse()
 29 | #'
 30 | #' @export
 31 | simple_kmeans_db <- function(df,
 32 |                              ...,
 33 |                              centers = 3,
 34 |                              max_repeats = 100,
 35 |                              initial_kmeans = NULL,
 36 |                              safeguard_file = "kmeans.csv",
 37 |                              verbose = TRUE) {
 38 |   vars <- enquos(...)
 39 | 
 40 |   if (length(vars) > 0) {
 41 |     f_df <- select(df, !!!vars)
 42 |   } else {
 43 |     vars <- syms(colnames(df))
 44 |     f_df <- df
 45 |   }
 46 | 
 47 |   f_df <- filter_all(f_df, all_vars(!is.na(.)))
 48 | 
 49 |   if (!is.null(initial_kmeans)) {
 50 |     centroids <- initial_kmeans
 51 |   } else {
 52 |     centroids <- head(f_df, centers)
 53 |     centroids <- collect(centroids)
 54 |   }
 55 | 
 56 |   if (verbose) {
 57 |     pb <- progress::progress_bar$new(
 58 |       format = paste0(
 59 |         " Cycle :current of ", max_repeats, " max. [:bar] [:var][:elapsed]"
 60 |       ),
 61 |       total = max_repeats, clear = TRUE, width = 80
 62 |     )
 63 |   }
 64 | 
 65 |   for (iteration in 1:max_repeats) {
 66 |     prev_centroids <- centroids
 67 |     new_centroids <- calculate_centers(df, centroids, centers, vars)
 68 | 
 69 |     centroids_db <- select(new_centroids, !!!vars, center)
 70 |     centroids_db <- group_by(centroids_db, center)
 71 |     centroids_db <- summarise_all(centroids_db, "mean", na.rm = TRUE)
 72 | 
 73 |     centroids <- select(centroids_db, -center)
 74 |     centroids <- collect(centroids)
 75 | 
 76 |     if (!is.null(safeguard_file)) {
 77 |       sfg <- file.path(tempdir(), safeguard_file)
 78 |       write.csv(centroids, sfg, row.names = FALSE)
 79 |     }
 80 |     variance <- (
 81 |       round(
 82 |         abs(sum(prev_centroids) - sum(centroids)) / sum(prev_centroids),
 83 |         digits = 4
 84 |       ) * 100
 85 |     )
 86 |     if (verbose) pb$tick(tokens = list(var = variance))
 87 |     if (all(prev_centroids == centroids)) break()
 88 |   }
 89 |   centroids_db <- rename_all(centroids_db, ~ paste0("k_", .))
 90 |   joined <- left_join(
 91 |     rename(new_centroids, k_center = center),
 92 |     centroids_db,
 93 |     by = "k_center"
 94 |   )
 95 |   select(joined, contains("k_"), everything())
 96 | }
 97 | 
 98 | calculate_centers <- function(df, center_df, centers, vars) {
 99 |   center_names <- paste0("center_", 1:centers)
100 | 
101 |   fields <- length(vars)
102 | 
103 |   f_dist <- imap(
104 |     center_df, ~ {
105 |       map2(
106 |         .x, .y,
107 |         function(x, y) expr((!!x) - (!!sym(y)))
108 |       )
109 |     }
110 |   )
111 | 
112 |   f_inside <- function(curr_center) {
113 |     fi <- map(1:fields, ~ {
114 |       f <- pluck(f_dist, .x, curr_center)
115 |       expr((((!!f)) * ((!!f))))
116 |     })
117 |     reduce(fi, function(l, r) expr((!!l) + (!!r)))
118 |   }
119 | 
120 |   km <- map(1:centers, ~ expr(sqrt(!!f_inside(.x))))
121 |   km <- set_names(km, center_names)
122 | 
123 |   all <- map(center_names, ~ {
124 |     comp <- map2(.x, center_names, function(x, y) {
125 |       if (x != y) {
126 |         expr((!!sym(x)) < (!!sym(y)))
127 |       } else {
128 |         expr((!!sym(x)) >= (!!sym(y)))
129 |       }
130 |     })
131 |     comp <- reduce(comp, function(l, r) expr((!!l) & (!!r)))
132 |     reduce(c(comp, .x), function(l, r) expr((!!l) ~ !!(r)))
133 |   })
134 |   all <- flatten(all)
135 | 
136 |   comp <- expr(case_when(!!!all))
137 | 
138 |   res <- mutate(df, !!!km)
139 |   res <- mutate(res, center = !!comp)
140 |   res <- filter(res, !is.na(center))
141 |   select(res, -contains("center_"))
142 | }
143 | 


--------------------------------------------------------------------------------
/R/lm.R:
--------------------------------------------------------------------------------
  1 | #' Fits a Linear Regression model
  2 | #'
  3 | #' It uses 'tidyeval' and 'dplyr' to create a linear
  4 | #' regression model.
  5 | #'
  6 | #' @param df A Local or remote data frame
  7 | #' @param y_var Dependent variable
  8 | #' @param sample_size Prevents a table count. It is only used for models
  9 | #' with three or more independent variables
 10 | #' @param auto_count Serves as a safeguard in case sample_size is not
 11 | #' passed inadvertently.  Defaults to FALSE.  If it is ok for the
 12 | #' function to count how many records are in the sample, then set to
 13 | #' TRUE.  It is only used for models with three or more independent variables
 14 | #'
 15 | #' @details
 16 | #'
 17 | #' The linear_regression_db() function only calls one of three unexported functions.
 18 | #' The function used is determined by the number of independent variables.  This is
 19 | #' so any model of one or two variables can use a simpler formula, which in turn
 20 | #' will have less SQL overhead.
 21 | #'
 22 | #' @examples
 23 | #' library(dplyr)
 24 | #'
 25 | #' mtcars %>%
 26 | #'   select(mpg, wt, qsec) %>%
 27 | #'   linear_regression_db(mpg)
 28 | #'
 29 | #' @export
 30 | linear_regression_db <- function(df, y_var = NULL, sample_size = NULL, auto_count = FALSE) {
 31 |   y_var <- enexpr(y_var)
 32 | 
 33 |   col_names <- tbl_vars(df)
 34 |   grouped_count <- length(group_vars(df))
 35 |   n_cols <- length(col_names) - grouped_count
 36 | 
 37 |   x_vars <- col_names[col_names != expr_text(y_var)]
 38 |   if (grouped_count > 0) x_vars <- setdiff(x_vars, group_vars(df))
 39 | 
 40 |   if (n_cols == 2) {
 41 |     m <- simple_linear_regression_db(
 42 |       df = df,
 43 |       x = !!sym(x_vars[1]),
 44 |       y = !!y_var
 45 |     )
 46 |   }
 47 |   if (n_cols == 3) {
 48 |     m <- two_variable_regression(
 49 |       df = df,
 50 |       y = !!y_var,
 51 |       x1 = !!sym(x_vars[1]),
 52 |       x2 = !!sym(x_vars[2])
 53 |     )
 54 |   }
 55 |   if (n_cols > 3) {
 56 |     m <- mlr(
 57 |       df = df,
 58 |       y_var = !!y_var,
 59 |       sample_size = sample_size,
 60 |       auto_count = auto_count
 61 |     )
 62 |   }
 63 |   class(m) <- c("modeldb_lm", class(m))
 64 |   m
 65 | }
 66 | 
 67 | two_variable_regression <- function(df, y, x1, x2) {
 68 |   y <- enquo(y)
 69 |   x1 <- enquo(x1)
 70 |   x2 <- enquo(x2)
 71 | 
 72 |   vars <- group_vars(df)
 73 | 
 74 |   m <- summarise(
 75 |     df,
 76 |     x1y = sum(!!x1 * !!y, na.rm = TRUE) - (sum(!!x1, na.rm = TRUE) * sum(!!y, na.rm = TRUE) / n()),
 77 |     x2y = sum(!!x2 * !!y, na.rm = TRUE) - (sum(!!x2, na.rm = TRUE) * sum(!!y, na.rm = TRUE) / n()),
 78 |     x2x = sum(!!x2 * !!x2, na.rm = TRUE) - (sum(!!x2, na.rm = TRUE) * sum(!!x2, na.rm = TRUE) / n()),
 79 |     x1x = sum(!!x1 * !!x1, na.rm = TRUE) - (sum(!!x1, na.rm = TRUE) * sum(!!x1, na.rm = TRUE) / n()),
 80 |     all = sum(!!x1 * !!x2, na.rm = TRUE) - (sum(!!x1, na.rm = TRUE) * sum(!!x2, na.rm = TRUE) / n()),
 81 |     my  = mean(!!y, na.rm = TRUE),
 82 |     mx1 = mean(!!x1, na.rm = TRUE),
 83 |     mx2 = mean(!!x2, na.rm = TRUE)
 84 |   )
 85 |   m <- mutate(
 86 |     m,
 87 |     !!x1 := ((x2x * x1y) - (all * x2y)) / ((x1x * x2x) - (all * all)),
 88 |     !!x2 := ((x1x * x2y) - (all * x1y)) / ((x1x * x2x) - (all * all))
 89 |   )
 90 |   m <- mutate(m, Intercept = my - (!!x1 * mx1) - (!!x2 * mx2))
 91 |   m <- select(m, !!vars, Intercept, !!x1, !!x2)
 92 |   m <- collect(m)
 93 |   m <- as_tibble(m)
 94 |   rename(m, "(Intercept)" = Intercept)
 95 | }
 96 | 
 97 | simple_linear_regression_db <- function(df, x, y) {
 98 |   x <- enquo(x)
 99 |   y <- enquo(y)
100 | 
101 |   vars <- group_vars(df)
102 | 
103 |   m <- summarise(
104 |     df,
105 |     sx = sum(!!x, na.rm = TRUE),
106 |     sy = sum(!!y, na.rm = TRUE),
107 |     sxx = sum(!!x * !!x, na.rm = TRUE),
108 |     syy = sum(!!y * !!y, na.rm = TRUE),
109 |     sxy = sum(!!x * !!y, na.rm = TRUE),
110 |     n = n()
111 |   )
112 |   m <- mutate(m, !!x := ((n * sxy) - (sx * sy)) / ((n * sxx) - (sx * sx)))
113 |   m <- mutate(m, Intercept = ((1 / n) * sy) - (!!x * (1 / n) * sx))
114 |   m <- select(m, !!vars, Intercept, !!x)
115 |   m <- collect(m)
116 |   m <- as_tibble(m)
117 |   rename(m, "(Intercept)" = Intercept)
118 | }
119 | 
120 | mlr <- function(df, ..., y_var, sample_size = NULL, auto_count = FALSE) {
121 |   if (is.null(sample_size)) {
122 |     if (auto_count) {
123 |       sample_size <- pull(tally(df))
124 |     } else {
125 |       cli::cli_abort("No sample size provided, and auto_count is set to FALSE")
126 |     }
127 |   }
128 | 
129 |   y_var <- enquo(y_var)
130 |   y_text <- as_label(y_var)
131 | 
132 |   grouping_vars <- group_vars(df)
133 |   vars_count <- length(grouping_vars)
134 | 
135 |   x_vars <- tbl_vars(df)
136 |   x_vars <- x_vars[x_vars != y_text]
137 |   if (vars_count > 0) x_vars <- map(grouping_vars, ~ x_vars[x_vars != .x])[[1]]
138 |   x_vars <- syms(x_vars)
139 | 
140 |   all_vars <- c(x_vars, ensym(y_var))
141 | 
142 |   all_f_mapped <- map(
143 |     all_vars, ~ {
144 |       y <- .x
145 |       map(
146 |         all_vars, ~ {
147 |           xy <- c(as_label(.x), as_label(y))
148 |           pop <- ifelse(auto_count, expr(n()), sample_size)
149 |           list(
150 |             # f = ind_f(!!.x, !!y, sample_size, vars_count),
151 |             f = expr(
152 |               sum(!!.x * !!y, na.rm = TRUE) - ((sum(!!.x, na.rm = TRUE) * sum(!!y, na.rm = TRUE)) / !!pop)
153 |             ),
154 |             name = paste0(xy[order(xy)], collapse = "_")
155 |           )
156 |         }
157 |       )
158 |     }
159 |   )
160 |   all_f <- flatten(all_f_mapped)
161 |   all_f <- set_names(
162 |     map(all_f, ~ .x$f),
163 |     map(all_f, ~ .x$name)
164 |   )
165 | 
166 |   # Deduping field combos, decreases number of calcs inside DB
167 |   unique_f <- map(
168 |     unique(names(all_f)),
169 |     ~ all_f[names(all_f) == .x][[1]]
170 |   )
171 |   unique_f <- set_names(unique_f, unique(names(all_f)))
172 | 
173 |   all_means <- map(all_vars, ~ expr(mean(!!.x, na.rm = TRUE)))
174 |   all_means <- set_names(all_means, ~ paste0("mean_", all_vars))
175 | 
176 |   all_fm <- c(unique_f, all_means)
177 | 
178 |   # Send all operations to the DB simultaneously
179 |   ests_df <- summarise(df, !!!all_fm)
180 |   ests_df <- collect(ests_df)
181 | 
182 |   ests_list <- as.list(ests_df)
183 | 
184 |   xm_names <- names(all_f)[!grepl(y_text, names(all_f))]
185 |   xm <- prepare_matrix(ests_list, xm_names, length(x_vars))
186 | 
187 |   ym_names <- names(all_f)[grepl(y_text, names(all_f))]
188 |   ym_names <- unique(ym_names)[1:length(x_vars)]
189 |   ym <- prepare_matrix(ests_list, ym_names, length(x_vars))
190 | 
191 |   coefs <- map(
192 |     seq_len(vars_count + 1),
193 |     ~ as.numeric(solve(xm[[.x]], ym[[.x]]))
194 |   )
195 | 
196 |   intercept <- map(
197 |     seq_len(vars_count + 1), ~ {
198 |       y <- .x
199 |       x_f <- map(
200 |         seq_len(length(x_vars)), ~ {
201 |           x_name <- paste0("mean_", x_vars[.x])
202 |           x_mean <- ests_list[names(ests_list) == x_name][[1]][y]
203 |           expr((!!coefs[[y]][.x] * !!x_mean))
204 |         }
205 |       )
206 |       y_name <- paste0("mean_", y_text)
207 |       y_mean <- ests_list[names(ests_list) == y_name][[1]][y]
208 |       int_f <- reduce(
209 |         c(y_mean, x_f),
210 |         function(l, r) expr(!!l - !!r)
211 |       )
212 |       eval(int_f)
213 |     }
214 |   )
215 | 
216 |   res <- transpose(coefs)
217 |   res <- set_names(res, x_vars)
218 |   res <- c(list("(Intercept)" = intercept), res)
219 |   res <- map_df(transpose(res), ~.x)
220 |   bind_cols(ests_df[, grouping_vars], res)
221 | }
222 | 
223 | prepare_matrix <- function(estimates, field_names, matrix_size) {
224 |   m <- map(field_names, ~ estimates[.x])
225 |   m <- flatten(m)
226 |   m <- transpose(m)
227 |   map(m, ~ matrix(as.numeric(.x), nrow = matrix_size))
228 | }
229 | 


--------------------------------------------------------------------------------
/R/model_prep.R:
--------------------------------------------------------------------------------
 1 | #' Creates dummy variables
 2 | #'
 3 | #' It uses 'tidyeval' and 'dplyr' to create dummy variables based for
 4 | #' categorical variables.
 5 | #'
 6 | #' @param df A Local or remote data frame
 7 | #' @param x Categorical variable
 8 | #' @param values Possible known values of the categorical variable. If not passed
 9 | #' then the function will take an additional step to figure the unique values of
10 | #' the variable.
11 | #' @param auto_values Safeguard argument to prevent the function from figuring the
12 | #' unique values if the values argument is empty.  If it is ok for this function
13 | #' to obtain the unique values, set to TRUE.  Defaults to FALSE.
14 | #' @param remove_original It removes the original variable from the returned table.
15 | #' Defaults to TRUE.
16 | #'
17 | #' @examples
18 | #' library(dplyr)
19 | #'
20 | #' mtcars %>%
21 | #'   add_dummy_variables(cyl, values = c(4, 6, 8))
22 | #'
23 | #' mtcars %>%
24 | #'   add_dummy_variables(cyl, auto_values = TRUE)
25 | #' @export
26 | add_dummy_variables <- function(df, x, values = c(),
27 |                                 auto_values = FALSE, remove_original = TRUE) {
28 |   x <- enquo(x)
29 |   var_found <- as_label(x) %in% tbl_vars(df)
30 |   if (!var_found) cli::cli_abort("Variable not found")
31 |   if (length(values) == 0) {
32 |     if (auto_values == TRUE) {
33 |       values <- group_by(df, !!x)
34 |       values <- summarise(values)
35 |       values <- pull(values)
36 |     } else {
37 |       cli::cli_abort("No values provided and auto_values is set to FALSE")
38 |     }
39 |   }
40 |   vals <- map(values, ~ expr(ifelse(!!x == !!.x, 1, 0)))
41 |   names <- map(values, ~ paste0(as_label(x), "_", .x))
42 |   vals <- set_names(vals, names)
43 |   vals <- vals[2:length(vals)]
44 |   df <- mutate(df, !!!vals)
45 |   if (remove_original) df <- select(df, -!!x)
46 |   df
47 | }
48 | 


--------------------------------------------------------------------------------
/R/modeldb-package.R:
--------------------------------------------------------------------------------
 1 | #' @import rlang
 2 | #' @import ggplot2
 3 | #' @import tibble
 4 | #' @importFrom dplyr mutate summarise left_join
 5 | #' @importFrom dplyr summarise_all filter_all
 6 | #' @importFrom dplyr pull collect arrange
 7 | #' @importFrom dplyr contains tally ungroup
 8 | #' @importFrom dplyr group_vars lag bind_cols
 9 | #' @importFrom dplyr all_vars group_by funs
10 | #' @importFrom dplyr n as_tibble filter select
11 | #' @importFrom dplyr case_when rename rename_all
12 | #' @importFrom dplyr tbl_vars everything
13 | #' @importFrom purrr map map2 map_df transpose
14 | #' @importFrom purrr reduce imap pluck
15 | #' @importFrom tidypredict as_parsed_model
16 | #' @importFrom utils head
17 | #' @importFrom utils write.csv
18 | #' @keywords internal
19 | #'
20 | "_PACKAGE"
21 | NULL
22 | utils::globalVariables(c(
23 |   ".", "Center", "Count", "Intercept", "center", "dif", "mx1",
24 |   "mx2", "my", "sx", "sxx", "sxy", "sy", "val", "var", "x1",
25 |   "x1x", "x1y", "x2x", "x2y", "xend", "yend"
26 | ))
27 | 


--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | output: github_document
 3 | ---
 4 | 
 5 | # modeldb <img src="man/figures/logo.png" align="right" alt="" width="120" />
 6 | 
 7 | ```{r setup, include=FALSE}
 8 | library(dplyr)
 9 | library(modeldb)
10 | ```
11 | 
12 | [![R-CMD-check](https://github.com/tidymodels/modeldb/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/modeldb/actions)
13 | [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/modeldb)](https://CRAN.R-project.org/package=modeldb)
14 | [![Codecov test coverage](https://codecov.io/gh/tidymodels/modeldb/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/modeldb?branch=main)
15 | [![Downloads](http://cranlogs.r-pkg.org/badges/modeldb)](https://CRAN.R-project.org/package=modeldb)
16 | 
17 | Fit models inside the database! **modeldb works with most database back-ends** because it leverages [dplyr](https://dplyr.tidyverse.org/) and [dbplyr](https://dbplyr.tidyverse.org/) for the final SQL translation of the algorithm. It currently supports:
18 | 
19 | - K-means clustering
20 | 
21 | - Linear regression
22 | 
23 | ## Installation
24 | 
25 | Install the CRAN version with:
26 | 
27 | ```{r, eval = FALSE}
28 | install.packages("modeldb")
29 | ```
30 | 
31 | The development version is available from GitHub using remotes:
32 | 
33 | ```{r, eval = FALSE}
34 | # install.packages("remotes")
35 | remotes::install_github("tidymodels/modeldb")
36 | ```
37 | 
38 | ## Linear regression
39 | 
40 | An easy way to try out the package is by creating a temporary SQLite database, and loading `mtcars` to it.
41 | 
42 | ```{r}
43 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
44 | RSQLite::initExtension(con)
45 | dplyr::copy_to(con, mtcars)
46 | ```
47 | 
48 | ```{r}
49 | library(dplyr)
50 | 
51 | tbl(con, "mtcars") %>%
52 |   select(wt, mpg, qsec) %>%
53 |   linear_regression_db(wt)
54 | ```
55 | 
56 | The model output can be parsed by [tidypredict](https://tidypredict.tidymodels.org/) to run the predictions in the database. Please see the "Linear Regression" article to learn more about how to use `linear_regression_db()`
57 | 
58 | ## K Means clustering
59 | 
60 | To use the `simple_kmeans_db()` function, simply pipe the database back end table to the function.  This returns a list object that contains two items:
61 | 
62 | - A sql query table with the final center assignment
63 | - A local table with the information about the centers
64 | 
65 | ```{r}
66 | km <- tbl(con, "mtcars") %>%
67 |   simple_kmeans_db(mpg, wt)
68 | 
69 | colnames(km)
70 | ```
71 | 
72 | The SQL statement from `tbl` can be extracted using dbplyr's `remote_query()`
73 | 
74 | ```{r}
75 | dbplyr::remote_query(km)
76 | ```
77 | 
78 | ## Contributing
79 | 
80 | This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
81 | 
82 | - For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on Posit Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).
83 | 
84 | - If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/modeldb/issues).
85 | 
86 | - Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code. Check out [this helpful article on how to create reprexes](https://dbplyr.tidyverse.org/articles/reprex.html) for problems involving a database.
87 | 
88 | - Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).
89 | 
90 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # modeldb <img src="man/figures/logo.png" align="right" alt="" width="120" />
  3 | 
  4 | [![R-CMD-check](https://github.com/tidymodels/modeldb/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/modeldb/actions)
  5 | [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/modeldb)](https://CRAN.R-project.org/package=modeldb)
  6 | [![Codecov test
  7 | coverage](https://codecov.io/gh/tidymodels/modeldb/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/modeldb?branch=main)
  8 | [![Downloads](http://cranlogs.r-pkg.org/badges/modeldb)](https://CRAN.R-project.org/package=modeldb)
  9 | 
 10 | Fit models inside the database! **modeldb works with most database
 11 | back-ends** because it leverages [dplyr](https://dplyr.tidyverse.org/)
 12 | and [dbplyr](https://dbplyr.tidyverse.org/) for the final SQL
 13 | translation of the algorithm. It currently supports:
 14 | 
 15 | -   K-means clustering
 16 | 
 17 | -   Linear regression
 18 | 
 19 | ## Installation
 20 | 
 21 | Install the CRAN version with:
 22 | 
 23 | ``` r
 24 | install.packages("modeldb")
 25 | ```
 26 | 
 27 | The development version is available from GitHub using remotes:
 28 | 
 29 | ``` r
 30 | # install.packages("remotes")
 31 | remotes::install_github("tidymodels/modeldb")
 32 | ```
 33 | 
 34 | ## Linear regression
 35 | 
 36 | An easy way to try out the package is by creating a temporary SQLite
 37 | database, and loading `mtcars` to it.
 38 | 
 39 | ``` r
 40 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
 41 | RSQLite::initExtension(con)
 42 | dplyr::copy_to(con, mtcars)
 43 | ```
 44 | 
 45 | ``` r
 46 | library(dplyr)
 47 | 
 48 | tbl(con, "mtcars") %>%
 49 |   select(wt, mpg, qsec) %>%
 50 |   linear_regression_db(wt)
 51 | ```
 52 | 
 53 |     ## # A tibble: 1 × 3
 54 |     ##   `(Intercept)`    mpg  qsec
 55 |     ##           <dbl>  <dbl> <dbl>
 56 |     ## 1          4.12 -0.156 0.125
 57 | 
 58 | The model output can be parsed by
 59 | [tidypredict](https://tidypredict.tidymodels.org/) to run the
 60 | predictions in the database. Please see the “Linear Regression” article
 61 | to learn more about how to use `linear_regression_db()`
 62 | 
 63 | ## K Means clustering
 64 | 
 65 | To use the `simple_kmeans_db()` function, simply pipe the database back
 66 | end table to the function. This returns a list object that contains two
 67 | items:
 68 | 
 69 | -   A sql query table with the final center assignment
 70 | -   A local table with the information about the centers
 71 | 
 72 | ``` r
 73 | km <- tbl(con, "mtcars") %>%
 74 |   simple_kmeans_db(mpg, wt)
 75 | 
 76 | colnames(km)
 77 | ```
 78 | 
 79 |     ##  [1] "k_center" "k_mpg"    "k_wt"     "mpg"      "cyl"      "disp"    
 80 |     ##  [7] "hp"       "drat"     "wt"       "qsec"     "vs"       "am"      
 81 |     ## [13] "gear"     "carb"
 82 | 
 83 | The SQL statement from `tbl` can be extracted using dbplyr’s
 84 | `remote_query()`
 85 | 
 86 | ``` r
 87 | dbplyr::remote_query(km)
 88 | ```
 89 | 
 90 |     ## <SQL> SELECT `k_center`, `k_mpg`, `k_wt`, `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`
 91 |     ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `LHS`.`k_center` AS `k_center`, `k_mpg`, `k_wt`
 92 |     ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `center` AS `k_center`
 93 |     ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `center_1`, `center_2`, `center_3`, CASE
 94 |     ## WHEN (`center_1` >= `center_1` AND `center_1` < `center_2` AND `center_1` < `center_3`) THEN ('center_1')
 95 |     ## WHEN (`center_2` < `center_1` AND `center_2` >= `center_2` AND `center_2` < `center_3`) THEN ('center_2')
 96 |     ## WHEN (`center_3` < `center_1` AND `center_3` < `center_2` AND `center_3` >= `center_3`) THEN ('center_3')
 97 |     ## END AS `center`
 98 |     ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, SQRT(((20.6428571428571 - `mpg`) * (20.6428571428571 - `mpg`)) + ((3.07214285714286 - `wt`) * (3.07214285714286 - `wt`))) AS `center_1`, SQRT(((14.4583333333333 - `mpg`) * (14.4583333333333 - `mpg`)) + ((4.05866666666667 - `wt`) * (4.05866666666667 - `wt`))) AS `center_2`, SQRT(((30.0666666666667 - `mpg`) * (30.0666666666667 - `mpg`)) + ((1.873 - `wt`) * (1.873 - `wt`))) AS `center_3`
 99 |     ## FROM `mtcars`))
100 |     ## WHERE (NOT(((`center`) IS NULL)))) AS `LHS`
101 |     ## LEFT JOIN (SELECT `center` AS `k_center`, `mpg` AS `k_mpg`, `wt` AS `k_wt`
102 |     ## FROM (SELECT `center`, AVG(`mpg`) AS `mpg`, AVG(`wt`) AS `wt`
103 |     ## FROM (SELECT `mpg`, `wt`, `center`
104 |     ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `center_1`, `center_2`, `center_3`, CASE
105 |     ## WHEN (`center_1` >= `center_1` AND `center_1` < `center_2` AND `center_1` < `center_3`) THEN ('center_1')
106 |     ## WHEN (`center_2` < `center_1` AND `center_2` >= `center_2` AND `center_2` < `center_3`) THEN ('center_2')
107 |     ## WHEN (`center_3` < `center_1` AND `center_3` < `center_2` AND `center_3` >= `center_3`) THEN ('center_3')
108 |     ## END AS `center`
109 |     ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, SQRT(((20.6428571428571 - `mpg`) * (20.6428571428571 - `mpg`)) + ((3.07214285714286 - `wt`) * (3.07214285714286 - `wt`))) AS `center_1`, SQRT(((14.4583333333333 - `mpg`) * (14.4583333333333 - `mpg`)) + ((4.05866666666667 - `wt`) * (4.05866666666667 - `wt`))) AS `center_2`, SQRT(((30.0666666666667 - `mpg`) * (30.0666666666667 - `mpg`)) + ((1.873 - `wt`) * (1.873 - `wt`))) AS `center_3`
110 |     ## FROM `mtcars`))
111 |     ## WHERE (NOT(((`center`) IS NULL))))
112 |     ## GROUP BY `center`)) AS `RHS`
113 |     ## ON (`LHS`.`k_center` = `RHS`.`k_center`)
114 |     ## )
115 | 
116 | ## Contributing
117 | 
118 | This project is released with a [Contributor Code of
119 | Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html).
120 | By contributing to this project, you agree to abide by its terms.
121 | 
122 | -   For questions and discussions about tidymodels packages, modeling,
123 |     and machine learning, please [post on Posit
124 |     Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).
125 | 
126 | -   If you think you have encountered a bug, please [submit an
127 |     issue](https://github.com/tidymodels/modeldb/issues).
128 | 
129 | -   Either way, learn how to create and share a
130 |     [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html)
131 |     (a minimal, reproducible example), to clearly communicate about your
132 |     code. Check out [this helpful article on how to create
133 |     reprexes](https://dbplyr.tidyverse.org/articles/reprex.html) for
134 |     problems involving a database.
135 | 
136 | -   Check out further details on [contributing guidelines for tidymodels
137 |     packages](https://www.tidymodels.org/contribute/) and [how to get
138 |     help](https://www.tidymodels.org/help/).
139 | 


--------------------------------------------------------------------------------
/_pkgdown.yml:
--------------------------------------------------------------------------------
 1 | url: https://modeldb.tidymodels.org
 2 | 
 3 | template:
 4 |   package: tidytemplate
 5 |   bootstrap: 5
 6 |   bslib:
 7 |     danger: "#CA225E"
 8 |     primary: "#CA225E"
 9 |   includes:
10 |       in_header: |
11 |         <script defer data-domain="modeldb.tidymodels.org,all.tidymodels.org" src="https://plausible.io/js/plausible.js"></script>
12 | 
13 | development:
14 |   mode: auto
15 | 


--------------------------------------------------------------------------------
/codecov.yml:
--------------------------------------------------------------------------------
 1 | comment: false
 2 | 
 3 | coverage:
 4 |   status:
 5 |     project:
 6 |       default:
 7 |         target: auto
 8 |         threshold: 1%
 9 |         informational: true
10 |     patch:
11 |       default:
12 |         target: auto
13 |         threshold: 1%
14 |         informational: true
15 | 


--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
 1 | ## Release summary
 2 | 
 3 | * Switches to the Max Kuhn as package maintainer
 4 | 
 5 | ## Comments
 6 | 
 7 | * Having Max as the maintainer will guarantee good follow through for any communications with CRAN.  This is the reason why there is another update is such short order from the previous update in CRAN.  Thank you for your understanding.
 8 | 
 9 | ## Test environments
10 | 
11 | * Ubuntu 18.04.2 LTS, R 3.6.2
12 | 
13 | * Ubuntu 14.04 (on travis-ci)
14 | 
15 | ## R CMD check results
16 | 
17 | * 0 errors | 0 warnings | 0 notes
18 | 
19 | 
20 | 
21 | 


--------------------------------------------------------------------------------
/man/add_dummy_variables.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/model_prep.R
 3 | \name{add_dummy_variables}
 4 | \alias{add_dummy_variables}
 5 | \title{Creates dummy variables}
 6 | \usage{
 7 | add_dummy_variables(
 8 |   df,
 9 |   x,
10 |   values = c(),
11 |   auto_values = FALSE,
12 |   remove_original = TRUE
13 | )
14 | }
15 | \arguments{
16 | \item{df}{A Local or remote data frame}
17 | 
18 | \item{x}{Categorical variable}
19 | 
20 | \item{values}{Possible known values of the categorical variable. If not passed
21 | then the function will take an additional step to figure the unique values of
22 | the variable.}
23 | 
24 | \item{auto_values}{Safeguard argument to prevent the function from figuring the
25 | unique values if the values argument is empty.  If it is ok for this function
26 | to obtain the unique values, set to TRUE.  Defaults to FALSE.}
27 | 
28 | \item{remove_original}{It removes the original variable from the returned table.
29 | Defaults to TRUE.}
30 | }
31 | \description{
32 | It uses 'tidyeval' and 'dplyr' to create dummy variables based for
33 | categorical variables.
34 | }
35 | \examples{
36 | library(dplyr)
37 | 
38 | mtcars \%>\%
39 |   add_dummy_variables(cyl, values = c(4, 6, 8))
40 | 
41 | mtcars \%>\%
42 |   add_dummy_variables(cyl, auto_values = TRUE)
43 | }
44 | 


--------------------------------------------------------------------------------
/man/as_parsed_model.modeldb_lm.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/as-parsed-model.R
 3 | \name{as_parsed_model.modeldb_lm}
 4 | \alias{as_parsed_model.modeldb_lm}
 5 | \title{Prepares parsed model object}
 6 | \usage{
 7 | \method{as_parsed_model}{modeldb_lm}(x)
 8 | }
 9 | \arguments{
10 | \item{x}{A parsed model object}
11 | }
12 | \description{
13 | Prepares parsed model object
14 | }
15 | 


--------------------------------------------------------------------------------
/man/figures/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tidymodels/modeldb/e93e4025f3e08407c33062bfd8f69ecb55111b3c/man/figures/logo.png


--------------------------------------------------------------------------------
/man/linear_regression_db.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/lm.R
 3 | \name{linear_regression_db}
 4 | \alias{linear_regression_db}
 5 | \title{Fits a Linear Regression model}
 6 | \usage{
 7 | linear_regression_db(df, y_var = NULL, sample_size = NULL, auto_count = FALSE)
 8 | }
 9 | \arguments{
10 | \item{df}{A Local or remote data frame}
11 | 
12 | \item{y_var}{Dependent variable}
13 | 
14 | \item{sample_size}{Prevents a table count. It is only used for models
15 | with three or more independent variables}
16 | 
17 | \item{auto_count}{Serves as a safeguard in case sample_size is not
18 | passed inadvertently.  Defaults to FALSE.  If it is ok for the
19 | function to count how many records are in the sample, then set to
20 | TRUE.  It is only used for models with three or more independent variables}
21 | }
22 | \description{
23 | It uses 'tidyeval' and 'dplyr' to create a linear
24 | regression model.
25 | }
26 | \details{
27 | The linear_regression_db() function only calls one of three unexported functions.
28 | The function used is determined by the number of independent variables.  This is
29 | so any model of one or two variables can use a simpler formula, which in turn
30 | will have less SQL overhead.
31 | }
32 | \examples{
33 | library(dplyr)
34 | 
35 | mtcars \%>\%
36 |   select(mpg, wt, qsec) \%>\%
37 |   linear_regression_db(mpg)
38 | 
39 | }
40 | 


--------------------------------------------------------------------------------
/man/modeldb-package.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/modeldb-package.R
 3 | \docType{package}
 4 | \name{modeldb-package}
 5 | \alias{modeldb}
 6 | \alias{modeldb-package}
 7 | \title{modeldb: Fits Models Inside the Database}
 8 | \description{
 9 | \if{html}{\figure{logo.png}{options: style='float: right' alt='logo' width='120'}}
10 | 
11 | Uses 'dplyr' and 'tidyeval' to fit statistical models inside the database. It currently supports KMeans and linear regression models.
12 | }
13 | \seealso{
14 | Useful links:
15 | \itemize{
16 |   \item \url{https://modeldb.tidymodels.org}
17 |   \item \url{https://github.com/tidymodels/modeldb}
18 |   \item Report bugs at \url{https://github.com/tidymodels/modeldb/issues}
19 | }
20 | 
21 | }
22 | \author{
23 | \strong{Maintainer}: Max Kuhn \email{max@posit.co}
24 | 
25 | }
26 | \keyword{internal}
27 | 


--------------------------------------------------------------------------------
/man/plot_kmeans.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/kmeans-viz.R
 3 | \name{plot_kmeans}
 4 | \alias{plot_kmeans}
 5 | \alias{db_calculate_squares}
 6 | \title{Visualize a KMeans Cluster with lots of data}
 7 | \usage{
 8 | plot_kmeans(df, x, y, resolution = 50, group = center)
 9 | 
10 | db_calculate_squares(df, x, y, group, resolution = 50)
11 | }
12 | \arguments{
13 | \item{df}{A Local or remote data frame with results of KMeans clustering}
14 | 
15 | \item{x}{A numeric variable for the x axis}
16 | 
17 | \item{y}{A numeric variable for the y axis}
18 | 
19 | \item{resolution}{The number of squares in the grid. Defaults to 50.
20 | Meaning a 50 x 50 grid.}
21 | 
22 | \item{group}{A discrete variable containing the grouping for the KMeans. It defaults to 'center'}
23 | }
24 | \description{
25 | It uses 'ggplot2' to display the results of a KMeans routine. Instead
26 | of a scatterplot, it uses a square grid that displays the concentration
27 | of intersections per square.  The number of squares in the grid can
28 | be customized for more or less fine grain.
29 | }
30 | \details{
31 | For large result-sets in remote sources, downloading every intersection will
32 | be a long running, costly operation.  The approach of this function is to
33 | devide the x and y plane in a grid and have the remote source figure the
34 | total number of intersections, returned as a single number.  This reduces the
35 | granularity of the visualization, but it speeds up the results.
36 | }
37 | \examples{
38 | plot_kmeans(mtcars, mpg, wt, group = am)
39 | }
40 | 


--------------------------------------------------------------------------------
/man/reexports.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/as-parsed-model.R
 3 | \docType{import}
 4 | \name{reexports}
 5 | \alias{reexports}
 6 | \alias{as_parsed_model}
 7 | \title{Objects exported from other packages}
 8 | \keyword{internal}
 9 | \description{
10 | These objects are imported from other packages. Follow the links
11 | below to see their documentation.
12 | 
13 | \describe{
14 |   \item{tidypredict}{\code{\link[tidypredict]{as_parsed_model}}}
15 | }}
16 | 
17 | 


--------------------------------------------------------------------------------
/man/simple_kmeans_db.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/kmeans.R
 3 | \name{simple_kmeans_db}
 4 | \alias{simple_kmeans_db}
 5 | \title{Simple kmeans routine that works in-database}
 6 | \usage{
 7 | simple_kmeans_db(
 8 |   df,
 9 |   ...,
10 |   centers = 3,
11 |   max_repeats = 100,
12 |   initial_kmeans = NULL,
13 |   safeguard_file = "kmeans.csv",
14 |   verbose = TRUE
15 | )
16 | }
17 | \arguments{
18 | \item{df}{A Local or remote data frame}
19 | 
20 | \item{...}{A list of variables to be used in the kmeans algorithm}
21 | 
22 | \item{centers}{The number of centers. Defaults to 3.}
23 | 
24 | \item{max_repeats}{The maximum number of cycles to run. Defaults to 100.}
25 | 
26 | \item{initial_kmeans}{A local dataframe with initial centroid values. Defaults to NULL.}
27 | 
28 | \item{safeguard_file}{Each cycle will update a file specified in this argument
29 | with the current centers.  Defaults to 'kmeans.csv'. Pass NULL if no file is
30 | desired.}
31 | 
32 | \item{verbose}{Indicates if the progress bar will be displayed during the model's fitting.}
33 | }
34 | \description{
35 | It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean
36 | calculations, expressed in dplyr formulas until an the optimal
37 | centers are found.
38 | }
39 | \details{
40 | Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source,
41 | the latest centroid data frame is saved to the parent environment in case the process needs to be
42 | canceled and then restarted at a later point.  Passing the \code{current_kmeans} as the \code{initial_kmeans}
43 | will allow the operation to pick up where it left off.
44 | }
45 | \examples{
46 | library(dplyr)
47 | 
48 | mtcars \%>\%
49 |   simple_kmeans_db(mpg, qsec, wt) \%>\%
50 |   glimpse()
51 | 
52 | }
53 | 


--------------------------------------------------------------------------------
/modeldb.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | BuildType: Package
16 | PackageUseDevtools: Yes
17 | PackageInstallArgs: --no-multiarch --with-keep.source
18 | PackageRoxygenize: rd,collate,namespace,vignette
19 | 


--------------------------------------------------------------------------------
/revdep/.gitignore:
--------------------------------------------------------------------------------
1 | checks
2 | data.sqlite
3 | library
4 | 


--------------------------------------------------------------------------------
/revdep/README.md:
--------------------------------------------------------------------------------
 1 | # Platform
 2 | 
 3 | |field    |value                        |
 4 | |:--------|:----------------------------|
 5 | |version  |R version 3.6.0 (2019-04-26) |
 6 | |os       |Windows 10 x64               |
 7 | |system   |x86_64, mingw32              |
 8 | |ui       |RStudio                      |
 9 | |language |(EN)                         |
10 | |collate  |English_United States.1252   |
11 | |ctype    |English_United States.1252   |
12 | |tz       |America/Chicago              |
13 | |date     |2019-07-19                   |
14 | 
15 | # Dependencies
16 | 
17 | |package      |old      |new      |<U+0394>  |
18 | |:------------|:--------|:--------|:--|
19 | |modeldb      |0.1.2    |0.2.0    |*  |
20 | |assertthat   |0.2.1    |0.2.1    |   |
21 | |backports    |1.1.4    |1.1.4    |   |
22 | |BH           |1.69.0-1 |1.69.0-1 |   |
23 | |cli          |1.1.0    |1.1.0    |   |
24 | |colorspace   |1.4-1    |1.4-1    |   |
25 | |crayon       |1.3.4    |1.3.4    |   |
26 | |digest       |0.6.20   |0.6.20   |   |
27 | |dplyr        |0.8.3    |0.8.3    |   |
28 | |ellipsis     |0.2.0.1  |0.2.0.1  |   |
29 | |evaluate     |NA       |0.14     |*  |
30 | |fansi        |0.4.0    |0.4.0    |   |
31 | |generics     |NA       |0.0.2    |*  |
32 | |ggplot2      |3.2.0    |3.2.0    |   |
33 | |glue         |1.3.1    |1.3.1    |   |
34 | |gtable       |0.3.0    |0.3.0    |   |
35 | |highr        |NA       |0.8      |*  |
36 | |hms          |0.5.0    |0.5.0    |   |
37 | |knitr        |NA       |1.23     |*  |
38 | |labeling     |0.3      |0.3      |   |
39 | |lazyeval     |0.2.2    |0.2.2    |   |
40 | |magrittr     |1.5      |1.5      |   |
41 | |markdown     |NA       |1.0      |*  |
42 | |mime         |NA       |0.7      |*  |
43 | |munsell      |0.5.0    |0.5.0    |   |
44 | |pillar       |1.4.2    |1.4.2    |   |
45 | |pkgconfig    |2.0.2    |2.0.2    |   |
46 | |plogr        |0.2.0    |0.2.0    |   |
47 | |plyr         |1.8.4    |1.8.4    |   |
48 | |prettyunits  |1.0.2    |1.0.2    |   |
49 | |progress     |1.2.2    |1.2.2    |   |
50 | |purrr        |0.3.2    |0.3.2    |   |
51 | |R6           |2.4.0    |2.4.0    |   |
52 | |RColorBrewer |1.1-2    |1.1-2    |   |
53 | |Rcpp         |1.0.1    |1.0.1    |   |
54 | |reshape2     |1.4.3    |1.4.3    |   |
55 | |rlang        |0.4.0    |0.4.0    |   |
56 | |scales       |1.0.0    |1.0.0    |   |
57 | |stringi      |1.4.3    |1.4.3    |   |
58 | |stringr      |1.4.0    |1.4.0    |   |
59 | |tibble       |2.1.3    |2.1.3    |   |
60 | |tidypredict  |NA       |0.4.2    |*  |
61 | |tidyselect   |0.2.5    |0.2.5    |   |
62 | |utf8         |1.1.4    |1.1.4    |   |
63 | |vctrs        |0.2.0    |0.2.0    |   |
64 | |viridisLite  |0.3.0    |0.3.0    |   |
65 | |withr        |2.1.2    |2.1.2    |   |
66 | |xfun         |NA       |0.8      |*  |
67 | |yaml         |NA       |2.2.0    |*  |
68 | |zeallot      |0.1.0    |0.1.0    |   |
69 | 
70 | # Revdeps
71 | 
72 | 


--------------------------------------------------------------------------------
/revdep/check.R:
--------------------------------------------------------------------------------
1 | #devtools::install_github("r-lib/revdepcheck")
2 | library(revdepcheck)
3 | 
4 | #revdepcheck::revdep_reset()
5 | revdep_check()
6 | 


--------------------------------------------------------------------------------
/revdep/failures.md:
--------------------------------------------------------------------------------
1 | *Wow, no problems at all. :)*


--------------------------------------------------------------------------------
/revdep/problems.md:
--------------------------------------------------------------------------------
1 | *Wow, no problems at all. :)*


--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
 1 | # This file is part of the standard setup for testthat.
 2 | # It is recommended that you do not modify it.
 3 | #
 4 | # Where should you do additional test configuration?
 5 | # Learn more about the roles of various files in:
 6 | # * https://r-pkgs.org/testing-design.html#sec-tests-files-overview
 7 | # * https://testthat.r-lib.org/articles/special-files.html
 8 | 
 9 | library(testthat)
10 | library(dplyr)
11 | library(purrr)
12 | library(modeldb)
13 | 
14 | test_check("modeldb")
15 | 


--------------------------------------------------------------------------------
/tests/testthat/test-as-parsed-model.R:
--------------------------------------------------------------------------------
 1 | context("parsed model")
 2 | 
 3 | test_that("Simple linear regression matches lm()", {
 4 |   expect_is(
 5 |     as_parsed_model(
 6 |       linear_regression_db(mtcars, wt, sample_size = 32)
 7 |     ),
 8 |     "parsed_model"
 9 |   )
10 | })
11 | 


--------------------------------------------------------------------------------
/tests/testthat/test_dummy_var.R:
--------------------------------------------------------------------------------
 1 | context("add_dummy_variables")
 2 | 
 3 | test_that("Function create the correct columns", {
 4 |   cols_expected <- c("mpg", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "cyl_6", "cyl_8")
 5 |   expect_equal(
 6 |     colnames(add_dummy_variables(mtcars, cyl, c(4, 6, 8))),
 7 |     cols_expected
 8 |   )
 9 |   expect_equal(
10 |     colnames(add_dummy_variables(mtcars, cyl, auto_values = TRUE)),
11 |     cols_expected
12 |   )
13 | })
14 | 
15 | test_that("Function fails when no values are passed and auto_values is FALSE", {
16 |   expect_error(add_dummy_variables(mtcars, cyl))
17 |   expect_error(add_dummy_variables(mtcars, cyl, auto_values = FALSE))
18 | })
19 | 
20 | test_that("Has error when variable is missing", {
21 |   expect_error(add_dummy_variables(mtcars, error))
22 | })
23 | 


--------------------------------------------------------------------------------
/tests/testthat/test_kmeans.R:
--------------------------------------------------------------------------------
 1 | context("kmeans")
 2 | 
 3 | test_that("Specifying variables works", {
 4 |   expect_is(
 5 |     simple_kmeans_db(mtcars, wt, mpg),
 6 |     "data.frame"
 7 |   )
 8 | })
 9 | 
10 | test_that("Not specifying variables works", {
11 |   expect_silent(
12 |     select(mtcars, wt, mpg) %>%
13 |       simple_kmeans_db()
14 |   )
15 | })
16 | 
17 | test_that("Centroid argument is accepted", {
18 |   ik <- data.frame(
19 |     wt = c(3.072143, 4.058667, 1.873000),
20 |     mpg = c(20.64286, 14.45833, 30.06667)
21 |   )
22 | 
23 |   expect_silent(
24 |     simple_kmeans_db(mtcars, mpg, wt, initial_kmeans = ik)
25 |   )
26 | })
27 | 


--------------------------------------------------------------------------------
/tests/testthat/test_kmeans_viz.R:
--------------------------------------------------------------------------------
 1 | context("kmeans_viz")
 2 | 
 3 | 
 4 | test_that("plot_kmeans() returns a ggplot2 object", {
 5 |   expect_equal(
 6 |     class(plot_kmeans(mtcars, mpg, wt, group = am)),
 7 |     c("gg", "ggplot")
 8 |   )
 9 | })
10 | 
11 | test_that("plot_kmeans() returns error when no group is passed", {
12 |   expect_error(
13 |     plot_kmeans(mtcars, mpg, wt)
14 |   )
15 | })
16 | 
17 | test_that("Updating the resolution argument impacts the results", {
18 |   expect_false(
19 |     nrow(db_calculate_squares(mtcars,
20 |       mpg, wt,
21 |       group = am,
22 |       resolution = 50
23 |     )) ==
24 |       nrow(db_calculate_squares(mtcars,
25 |         mpg, wt,
26 |         group = am,
27 |         resolution = 30
28 |       ))
29 |   )
30 | })
31 | 


--------------------------------------------------------------------------------
/tests/testthat/test_lr.R:
--------------------------------------------------------------------------------
 1 | context("Linear Regression")
 2 | 
 3 | test_that("Simple linear regression matches lm()", {
 4 |   expect_equal(
 5 |     lm(wt ~ mpg, data = mtcars) %>%
 6 |       coef() %>%
 7 |       as.numeric(),
 8 |     mtcars %>%
 9 |       select(wt, mpg) %>%
10 |       linear_regression_db(wt) %>%
11 |       as.numeric()
12 |   )
13 | })
14 | 
15 | test_that("Two variable linear regression matches lm()", {
16 |   expect_equal(
17 |     lm(wt ~ mpg + qsec, data = mtcars) %>%
18 |       coef() %>%
19 |       as.numeric(),
20 |     mtcars %>%
21 |       select(wt, mpg, qsec) %>%
22 |       linear_regression_db(wt) %>%
23 |       as.numeric()
24 |   )
25 | })
26 | 
27 | test_that("Multiple variable linear regression matches lm()", {
28 |   expect_equal(
29 |     lm(wt ~ mpg + qsec + hp, data = mtcars) %>%
30 |       coef() %>%
31 |       as.numeric(),
32 |     mtcars %>%
33 |       select(wt, mpg, qsec, hp) %>%
34 |       linear_regression_db(wt, sample_size = 32) %>%
35 |       as.numeric()
36 |   )
37 | })
38 | 
39 | test_that("MLR matches lm() with auto_count set to TRUE", {
40 |   expect_equal(
41 |     lm(wt ~ mpg + qsec + hp, data = mtcars) %>%
42 |       coef() %>%
43 |       as.numeric(),
44 |     mtcars %>%
45 |       select(wt, mpg, qsec, hp) %>%
46 |       linear_regression_db(wt, auto_count = TRUE) %>%
47 |       as.numeric()
48 |   )
49 | })
50 | 
51 | test_that("MLR failes when auto_count set to FALSE and no sample_size is passed", {
52 |   expect_error(
53 |     mtcars %>%
54 |       select(wt, mpg, qsec, hp) %>%
55 |       linear_regression_db(wt, auto_count = FALSE)
56 |   )
57 |   expect_error(
58 |     mtcars %>%
59 |       select(wt, mpg, qsec, hp) %>%
60 |       linear_regression_db(wt)
61 |   )
62 | })
63 | 
64 | test_that("mlr with grouping matches lm()", {
65 |   expect_equal(
66 |     mtcars %>%
67 |       select(wt, mpg, qsec, hp, am) %>%
68 |       group_by(am) %>%
69 |       linear_regression_db(wt, auto_count = TRUE) %>%
70 |       transpose() %>%
71 |       map(~ as.numeric(.x)),
72 |     0:1 %>%
73 |       map(~ {
74 |         mtcars %>%
75 |           filter(am == .x) %>%
76 |           lm(wt ~ mpg + qsec + hp, data = .) %>%
77 |           coef() %>%
78 |           as.numeric() %>%
79 |           c(.x, .)
80 |       })
81 |   )
82 | })
83 | 


--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | *.R
3 | 


--------------------------------------------------------------------------------
/vignettes/kmeans.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "K Means models"
 3 | author: "Edgar Ruiz"
 4 | date: "`r Sys.Date()`"
 5 | output: rmarkdown::html_vignette
 6 | vignette: >
 7 |   %\VignetteIndexEntry{K Means models}
 8 |   %\VignetteEngine{knitr::rmarkdown}
 9 |   %\VignetteEncoding{UTF-8}
10 | ---
11 | 
12 | ```{r setup, include = FALSE}
13 | knitr::opts_chunk$set(
14 |   collapse = TRUE,
15 |   comment = "#>"
16 | )
17 | library(dplyr)
18 | library(purrr)
19 | library(rlang)
20 | library(nycflights13)
21 | ```
22 | 
23 | ## Intro
24 | 
25 | The `simple_kmeans_db()` function enables running the KMeans model inside the database.  It uses `dplyr` programming to abstract the steps needed produce a model, so that it can then be translated into SQL statements in the background.
26 | 
27 | ## Example setup
28 | 
29 | In this example, a simple `RSQlite` database will be use to load the `flights` data from the `nycflights13` library.
30 | 
31 | ```{r}
32 | library(dplyr)
33 | 
34 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
35 | RSQLite::initExtension(con)
36 | 
37 | db_flights <- copy_to(con, nycflights13::flights, "flights")
38 | ```
39 | 
40 | ## Running Kmeans clustering
41 | 
42 | The function `simple_kmeans_db()` can use with local data, or a remote table, such as the `db_flights` variable that is a pointer to the "flights" table inside the SQLite database.  When piping to the function, the only other required arguments are two or more fields separated by comma.  Because it uses 'tidyeval', the variable name auto-completion will work.
43 | 
44 | ```{r}
45 | library(modeldb)
46 | 
47 | km <- db_flights %>%
48 |   simple_kmeans_db(dep_time, distance)
49 | ```
50 | 
51 | The `simple_kmeans_db()` function uses a progress bar to show you the current cycle, the maximum cycles it's expected to run, the current difference between the previous cycle and the current cycle, and the running time.  The loop will stop once it wither has two matching consecutive cycles, or if it reaches the maximum number of cycles, as determined by the `max_repeats` argument.
52 | 
53 | The final **centers** are are stored in the `k_center` variable of the returned object
54 | ```{r}
55 | km %>% pull(k_center)
56 | ```
57 | 
58 | The latest results are stored in the `tbl` variable of the returned object.  The type of the returned table will match the type of the original source, so if it is a remote source, such as database table, then `tbl` will be a class `tbl_sql`.  This will allow us to do two thing:
59 | 
60 | - View the SQL statement that was used to find the final centers:
61 | ```{r}
62 | dbplyr::remote_query(km)
63 | ```
64 | 
65 | ## Under the hood
66 | 
67 | The `simple_kmeans_db()` function uses `dplyr` and 'tidyeval' to run the KMeans algorithm.  This means that when combined with `dbplyr`, the routines can be run inside a database.  
68 | 
69 | Unlike other packages that use this same methodology, such as `dbplot` and `tidypredict`, `simple_kmeans_db()` does not create a single `dplyr` code that can be extracted as SQL.  The function produces multiple, serial and dependent SQL statements that run individually inside the database. Each statement uses the current *centroids*, or centers, to estimate new centroids, and then it uses those centroids in a consecutive SQL statement to see if there was any variance.  Effectively, this approach uses R not only as translation layer, but also as an orchestration layer.
70 | 
71 | ## Safeguards for long running jobs
72 | 
73 | The`simple_kmeans_db()` approach of using multiple and consecutive SQL queries to find the optimal centers, additionally, in KMeans clustering, it matters the order in which the each set of centers is passed.  This creates an imperative to find a way to cache the current centers used in a long running job, in case the job is canceled or fails.  Starting from the centers that were calculated last, will mean that re-starting the job will not being from "0", but from a more advanced, read closer, set of centers.  
74 | 
75 | The safeguard implemented in this function is trough a file, called *kmeans.csv*.  Each cycle will update the file.  The file name can be changed by modifying the `safeguard_file` argument. Setting the argument to NULL will turn off the safeguard.  The file will be saved to the temporary directory of the R session.. 
76 | 
77 | In this example we will set the `max_repats` to 10, so as to artificially avoid finding the optimal means
78 | ```{r}
79 | km <- db_flights %>%
80 |   simple_kmeans_db(dep_time, distance, max_repeats = 10)
81 | ```
82 | 
83 | In the next run, the "kmeans.csv" file is passed as the `initial_kmeans` argument.  This will make `simple_kmeans_db()` use those centers as the starting point:
84 | 
85 | ```{r}
86 | km <- db_flights %>%
87 |   simple_kmeans_db(dep_time, distance, initial_kmeans = read.csv(file.path(tempdir(), "kmeans.csv")))
88 | ```
89 | 
90 | The second run took 7 cycles to complete, which adds up to the 17 cycles that it initially took in the first example at the top of this article.
91 | 
92 | 


--------------------------------------------------------------------------------
/vignettes/linear-regression.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Linear regression models"
  3 | author: "Edgar Ruiz"
  4 | date: "`r Sys.Date()`"
  5 | output: rmarkdown::html_vignette
  6 | vignette: >
  7 |   %\VignetteIndexEntry{Linear regression models}
  8 |   %\VignetteEngine{knitr::rmarkdown}
  9 |   %\VignetteEncoding{UTF-8}
 10 | ---
 11 | 
 12 | ```{r setup, include = FALSE}
 13 | knitr::opts_chunk$set(
 14 |   collapse = TRUE,
 15 |   comment = "#>"
 16 | )
 17 | 
 18 | knitr::opts_chunk$set(echo = TRUE)
 19 | library(dplyr)
 20 | library(dbplyr)
 21 | library(nycflights13)
 22 | library(DBI)
 23 | library(modeldb)
 24 | ```
 25 | 
 26 | ## Intro
 27 | 
 28 | The `linear_regression_db()` function can be used to fit this kind of model inside a database.  It uses `dplyr` programming to abstract the steps needed produce a model, so that it can then be translated into SQL statements in the background.
 29 | 
 30 | ## Example setup
 31 | 
 32 | A lightweight SQLite database will be used for this article.  Additionally, a sample data set is created.
 33 | 
 34 | ```{r}
 35 | # Open a database connection
 36 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
 37 | RSQLite::initExtension(con)
 38 | 
 39 | library(dplyr)
 40 | # Copy data to the database
 41 | db_flights <- copy_to(con, nycflights13::flights, "flights")
 42 | # Create a simple sample
 43 | db_sample <- db_flights %>%
 44 |   filter(!is.na(arr_time)) %>%
 45 |   head(20000)
 46 | ```
 47 | 
 48 | 
 49 | ## Model inside the database
 50 | 
 51 | The `linear_regression_db()` function does not use a formula.  It uses a table, and a named dependent variable.  This means data preparation is needed prior to running the model.  The best way to prepare the data for modeling will be using piped `dplyr` operations.
 52 | 
 53 | ```{r}
 54 | db_sample %>%
 55 |   select(arr_delay, dep_delay, distance) %>%
 56 |   linear_regression_db(arr_delay)
 57 | ```
 58 | 
 59 | ## Categorical variables
 60 | 
 61 | Adding a categorical a variable to a model requires prior data transformation  The `add_dummy_variables()` appends a set of boolean variables, one for each discrete value.  This function creates one-less discrete variable than the possible values.  For example: if the categorical variable has three possible values, the function will append two variables.  By default, `add_dummy_variables()` removes the original variable.
 62 | 
 63 | The reason for this approach is to reduce the number of database operations.  Without this step, then a fitting function would have to request all of the unique values every time a new model run, which creates unnecessary processing. 
 64 | 
 65 | ```{r}
 66 | db_sample %>%
 67 |   select(arr_delay, origin) %>%
 68 |   add_dummy_variables(origin, values = c("EWR", "JFK", "LGA"))
 69 | ```
 70 | 
 71 | In a real world scenario, the possible values are usually not known at the beginning of the analysis. So it is a good idea to load them into a vector variable so that it can be used any time that variable is added to a model.  This can be easily done using the `pull()` command from `dplyr`:
 72 | 
 73 | ```{r}
 74 | origins <- db_flights %>%
 75 |   group_by(origin) %>%
 76 |   summarise() %>%
 77 |   pull()
 78 | 
 79 | origins
 80 | ```
 81 | 
 82 | The `add_dummy_variables()` can be used as part of the piped code that terminates in the modeling function.
 83 | 
 84 | ```{r}
 85 | db_sample %>%
 86 |   select(arr_delay, origin) %>%
 87 |   add_dummy_variables(origin, values = origins) %>%
 88 |   linear_regression_db(arr_delay)
 89 | ```
 90 | 
 91 | ## Multiple linear regression
 92 | 
 93 | One of two arguments is needed to be set when fitting a model with three or more independent variables.  The both relate to the size of the data set used for the model.  So either the `sample_size` argument is passed, or `auto_count` is set to `TRUE`.  When `auto_count` is set to `TRUE`, and no sample size is passed, then the function will do a table count as part of the model fitting.  This is done in order to prevent unnecessary database operations, especially for cases when multiple models will be tested on top of the same sample data. 
 94 | 
 95 | ```{r}
 96 | db_sample %>%
 97 |   select(arr_delay, arr_time, dep_delay, dep_time) %>%
 98 |   linear_regression_db(arr_delay, sample_size = 20000)
 99 | ```
100 | 
101 | ## Interactions
102 | 
103 | Interactions have to be handled manually prior the modeling step.  
104 | 
105 | ```{r}
106 | db_sample %>%
107 |   mutate(distanceXarr_time = distance * arr_time) %>%
108 |   select(arr_delay, distanceXarr_time) %>%
109 |   linear_regression_db(arr_delay, sample_size = 20000)
110 | ```
111 | 
112 | A more typical model would also include the two original variables:
113 | 
114 | ```{r}
115 | db_sample %>%
116 |   mutate(distanceXarr_time = distance * arr_time) %>%
117 |   select(arr_delay, distance, arr_time, distanceXarr_time) %>%
118 |   linear_regression_db(arr_delay, sample_size = 20000)
119 | ```
120 | 
121 | ## Full example
122 | 
123 | Fitting a model with regular, categorical and interaction variables will look like this:
124 | 
125 | ```{r}
126 | remote_model <- db_sample %>%
127 |   mutate(distanceXarr_time = distance * arr_time) %>%
128 |   select(arr_delay, dep_time, distanceXarr_time, origin) %>%
129 |   add_dummy_variables(origin, values = origins) %>%
130 |   linear_regression_db(y_var = arr_delay, sample_size = 20000)
131 | 
132 | remote_model
133 | ```
134 | 
135 | ```{r, echo = FALSE}
136 | dbDisconnect(con)
137 | ```
138 | 


--------------------------------------------------------------------------------