├── .Rbuildignore ├── .github ├── .gitignore ├── CODE_OF_CONDUCT.md └── workflows │ ├── R-CMD-check.yaml │ ├── lock.yaml │ ├── pkgdown.yaml │ ├── pr-commands.yaml │ └── test-coverage.yaml ├── .gitignore ├── DESCRIPTION ├── LICENSE ├── LICENSE.md ├── NAMESPACE ├── NEWS.md ├── R ├── as-parsed-model.R ├── kmeans-viz.R ├── kmeans.R ├── lm.R ├── model_prep.R └── modeldb-package.R ├── README.Rmd ├── README.md ├── _pkgdown.yml ├── codecov.yml ├── cran-comments.md ├── man ├── add_dummy_variables.Rd ├── as_parsed_model.modeldb_lm.Rd ├── figures │ └── logo.png ├── linear_regression_db.Rd ├── modeldb-package.Rd ├── plot_kmeans.Rd ├── reexports.Rd └── simple_kmeans_db.Rd ├── modeldb.Rproj ├── revdep ├── .gitignore ├── README.md ├── check.R ├── failures.md └── problems.md ├── tests ├── testthat.R └── testthat │ ├── test-as-parsed-model.R │ ├── test_dummy_var.R │ ├── test_kmeans.R │ ├── test_kmeans_viz.R │ └── test_lr.R └── vignettes ├── .gitignore ├── kmeans.Rmd └── linear-regression.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^pkgdown$ 2 | ^docs$ 3 | ^_pkgdown\.yml$ 4 | ^CRAN-RELEASE$ 5 | ^.*\.Rproj$ 6 | ^\.Rproj\.user$ 7 | ^README\.Rmd$ 8 | ^README-.*\.png$ 9 | ^README_files$ 10 | travis.yml 11 | config.toml 12 | ^codecov\.yml$ 13 | ^revdep$ 14 | cran-comments.md 15 | ^tools 16 | ^CRAN-RELEASE 17 | ^\.httr-oauth$ 18 | ^\.github$ 19 | ^CODE_OF_CONDUCT\.md$ 20 | ^LICENSE\.md$ 21 | -------------------------------------------------------------------------------- /.github/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | -------------------------------------------------------------------------------- /.github/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, caste, color, religion, or sexual 10 | identity and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the overall 26 | community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or advances of 31 | any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email address, 35 | without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at codeofconduct@posit.co. 63 | All complaints will be reviewed and investigated promptly and fairly. 64 | 65 | All community leaders are obligated to respect the privacy and security of the 66 | reporter of any incident. 67 | 68 | ## Enforcement Guidelines 69 | 70 | Community leaders will follow these Community Impact Guidelines in determining 71 | the consequences for any action they deem in violation of this Code of Conduct: 72 | 73 | ### 1. Correction 74 | 75 | **Community Impact**: Use of inappropriate language or other behavior deemed 76 | unprofessional or unwelcome in the community. 77 | 78 | **Consequence**: A private, written warning from community leaders, providing 79 | clarity around the nature of the violation and an explanation of why the 80 | behavior was inappropriate. A public apology may be requested. 81 | 82 | ### 2. Warning 83 | 84 | **Community Impact**: A violation through a single incident or series of 85 | actions. 86 | 87 | **Consequence**: A warning with consequences for continued behavior. No 88 | interaction with the people involved, including unsolicited interaction with 89 | those enforcing the Code of Conduct, for a specified period of time. This 90 | includes avoiding interactions in community spaces as well as external channels 91 | like social media. Violating these terms may lead to a temporary or permanent 92 | ban. 93 | 94 | ### 3. Temporary Ban 95 | 96 | **Community Impact**: A serious violation of community standards, including 97 | sustained inappropriate behavior. 98 | 99 | **Consequence**: A temporary ban from any sort of interaction or public 100 | communication with the community for a specified period of time. No public or 101 | private interaction with the people involved, including unsolicited interaction 102 | with those enforcing the Code of Conduct, is allowed during this period. 103 | Violating these terms may lead to a permanent ban. 104 | 105 | ### 4. Permanent Ban 106 | 107 | **Community Impact**: Demonstrating a pattern of violation of community 108 | standards, including sustained inappropriate behavior, harassment of an 109 | individual, or aggression toward or disparagement of classes of individuals. 110 | 111 | **Consequence**: A permanent ban from any sort of public interaction within the 112 | community. 113 | 114 | ## Attribution 115 | 116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 117 | version 2.1, available at 118 | . 119 | 120 | Community Impact Guidelines were inspired by 121 | [Mozilla's code of conduct enforcement ladder][https://github.com/mozilla/inclusion]. 122 | 123 | For answers to common questions about this code of conduct, see the FAQ at 124 | . Translations are available at . 125 | 126 | [homepage]: https://www.contributor-covenant.org 127 | -------------------------------------------------------------------------------- /.github/workflows/R-CMD-check.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | # 4 | # NOTE: This workflow is overkill for most R packages and 5 | # check-standard.yaml is likely a better choice. 6 | # usethis::use_github_action("check-standard") will install it. 7 | on: 8 | push: 9 | branches: [main, master] 10 | pull_request: 11 | branches: [main, master] 12 | 13 | name: R-CMD-check 14 | 15 | jobs: 16 | R-CMD-check: 17 | runs-on: ${{ matrix.config.os }} 18 | 19 | name: ${{ matrix.config.os }} (${{ matrix.config.r }}) 20 | 21 | strategy: 22 | fail-fast: false 23 | matrix: 24 | config: 25 | - {os: macos-latest, r: 'release'} 26 | 27 | - {os: windows-latest, r: 'release'} 28 | # Use 3.6 to trigger usage of RTools35 29 | - {os: windows-latest, r: '3.6'} 30 | # use 4.1 to check with rtools40's older compiler 31 | - {os: windows-latest, r: '4.1'} 32 | 33 | - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} 34 | - {os: ubuntu-latest, r: 'release'} 35 | - {os: ubuntu-latest, r: 'oldrel-1'} 36 | - {os: ubuntu-latest, r: 'oldrel-2'} 37 | - {os: ubuntu-latest, r: 'oldrel-3'} 38 | 39 | env: 40 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 41 | R_KEEP_PKG_SOURCE: yes 42 | 43 | steps: 44 | - uses: actions/checkout@v3 45 | 46 | - uses: r-lib/actions/setup-pandoc@v2 47 | 48 | - uses: r-lib/actions/setup-r@v2 49 | with: 50 | r-version: ${{ matrix.config.r }} 51 | http-user-agent: ${{ matrix.config.http-user-agent }} 52 | use-public-rspm: true 53 | 54 | - uses: r-lib/actions/setup-r-dependencies@v2 55 | with: 56 | extra-packages: any::rcmdcheck 57 | needs: check 58 | 59 | - uses: r-lib/actions/check-r-package@v2 60 | with: 61 | upload-snapshots: true 62 | -------------------------------------------------------------------------------- /.github/workflows/lock.yaml: -------------------------------------------------------------------------------- 1 | name: 'Lock Threads' 2 | 3 | on: 4 | schedule: 5 | - cron: '0 0 * * *' 6 | 7 | jobs: 8 | lock: 9 | runs-on: ubuntu-latest 10 | steps: 11 | - uses: dessant/lock-threads@v2 12 | with: 13 | github-token: ${{ github.token }} 14 | issue-lock-inactive-days: '14' 15 | # issue-exclude-labels: '' 16 | # issue-lock-labels: 'outdated' 17 | issue-lock-comment: > 18 | This issue has been automatically locked. If you believe you have 19 | found a related problem, please file a new issue (with a reprex: 20 | ) and link to this issue. 21 | issue-lock-reason: '' 22 | pr-lock-inactive-days: '14' 23 | # pr-exclude-labels: 'wip' 24 | pr-lock-labels: '' 25 | pr-lock-comment: > 26 | This pull request has been automatically locked. If you believe you 27 | have found a related problem, please file a new issue (with a reprex: 28 | ) and link to this issue. 29 | pr-lock-reason: '' 30 | # process-only: 'issues' 31 | -------------------------------------------------------------------------------- /.github/workflows/pkgdown.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | push: 5 | branches: [main, master] 6 | pull_request: 7 | branches: [main, master] 8 | release: 9 | types: [published] 10 | workflow_dispatch: 11 | 12 | name: pkgdown 13 | 14 | jobs: 15 | pkgdown: 16 | runs-on: ubuntu-latest 17 | # Only restrict concurrency for non-PR jobs 18 | concurrency: 19 | group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }} 20 | env: 21 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 22 | permissions: 23 | contents: write 24 | steps: 25 | - uses: actions/checkout@v3 26 | 27 | - uses: r-lib/actions/setup-pandoc@v2 28 | 29 | - uses: r-lib/actions/setup-r@v2 30 | with: 31 | use-public-rspm: true 32 | 33 | - uses: r-lib/actions/setup-r-dependencies@v2 34 | with: 35 | extra-packages: any::pkgdown, local::. 36 | needs: website 37 | 38 | - name: Build site 39 | run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE) 40 | shell: Rscript {0} 41 | 42 | - name: Deploy to GitHub pages 🚀 43 | if: github.event_name != 'pull_request' 44 | uses: JamesIves/github-pages-deploy-action@v4.4.1 45 | with: 46 | clean: false 47 | branch: gh-pages 48 | folder: docs 49 | -------------------------------------------------------------------------------- /.github/workflows/pr-commands.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | issue_comment: 5 | types: [created] 6 | 7 | name: Commands 8 | 9 | jobs: 10 | document: 11 | if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/document') }} 12 | name: document 13 | runs-on: ubuntu-latest 14 | env: 15 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 16 | steps: 17 | - uses: actions/checkout@v3 18 | 19 | - uses: r-lib/actions/pr-fetch@v2 20 | with: 21 | repo-token: ${{ secrets.GITHUB_TOKEN }} 22 | 23 | - uses: r-lib/actions/setup-r@v2 24 | with: 25 | use-public-rspm: true 26 | 27 | - uses: r-lib/actions/setup-r-dependencies@v2 28 | with: 29 | extra-packages: any::roxygen2 30 | needs: pr-document 31 | 32 | - name: Document 33 | run: roxygen2::roxygenise() 34 | shell: Rscript {0} 35 | 36 | - name: commit 37 | run: | 38 | git config --local user.name "$GITHUB_ACTOR" 39 | git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com" 40 | git add man/\* NAMESPACE 41 | git commit -m 'Document' 42 | 43 | - uses: r-lib/actions/pr-push@v2 44 | with: 45 | repo-token: ${{ secrets.GITHUB_TOKEN }} 46 | 47 | style: 48 | if: ${{ github.event.issue.pull_request && (github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'OWNER') && startsWith(github.event.comment.body, '/style') }} 49 | name: style 50 | runs-on: ubuntu-latest 51 | env: 52 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 53 | steps: 54 | - uses: actions/checkout@v3 55 | 56 | - uses: r-lib/actions/pr-fetch@v2 57 | with: 58 | repo-token: ${{ secrets.GITHUB_TOKEN }} 59 | 60 | - uses: r-lib/actions/setup-r@v2 61 | 62 | - name: Install dependencies 63 | run: install.packages("styler") 64 | shell: Rscript {0} 65 | 66 | - name: Style 67 | run: styler::style_pkg() 68 | shell: Rscript {0} 69 | 70 | - name: commit 71 | run: | 72 | git config --local user.name "$GITHUB_ACTOR" 73 | git config --local user.email "$GITHUB_ACTOR@users.noreply.github.com" 74 | git add \*.R 75 | git commit -m 'Style' 76 | 77 | - uses: r-lib/actions/pr-push@v2 78 | with: 79 | repo-token: ${{ secrets.GITHUB_TOKEN }} 80 | -------------------------------------------------------------------------------- /.github/workflows/test-coverage.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | push: 5 | branches: [main, master] 6 | pull_request: 7 | branches: [main, master] 8 | 9 | name: test-coverage 10 | 11 | jobs: 12 | test-coverage: 13 | runs-on: ubuntu-latest 14 | env: 15 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 16 | 17 | steps: 18 | - uses: actions/checkout@v3 19 | 20 | - uses: r-lib/actions/setup-r@v2 21 | with: 22 | use-public-rspm: true 23 | 24 | - uses: r-lib/actions/setup-r-dependencies@v2 25 | with: 26 | extra-packages: any::covr 27 | needs: coverage 28 | 29 | - name: Test coverage 30 | run: | 31 | covr::codecov( 32 | quiet = FALSE, 33 | clean = FALSE, 34 | install_path = file.path(normalizePath(Sys.getenv("RUNNER_TEMP"), winslash = "/"), "package") 35 | ) 36 | shell: Rscript {0} 37 | 38 | - name: Show testthat output 39 | if: always() 40 | run: | 41 | ## -------------------------------------------------------------------- 42 | find ${{ runner.temp }}/package -name 'testthat.Rout*' -exec cat '{}' \; || true 43 | shell: bash 44 | 45 | - name: Upload test results 46 | if: failure() 47 | uses: actions/upload-artifact@v3 48 | with: 49 | name: coverage-test-failures 50 | path: ${{ runner.temp }}/package 51 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | inst/doc 2 | .Rproj.user 3 | .Rhistory 4 | .RData 5 | .Ruserdata 6 | kmeans.csv 7 | /content 8 | /public 9 | /layouts 10 | /static 11 | CRAN-RELEASE 12 | .httr-oauth 13 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: modeldb 2 | Title: Fits Models Inside the Database 3 | Version: 0.3.0.9000 4 | Authors@R: c( 5 | person("Edgar", "Ruiz", , "edgar@posit.co", role = "aut"), 6 | person("Max", "Kuhn", , "max@posit.co", role = c("aut", "cre")) 7 | ) 8 | Description: Uses 'dplyr' and 'tidyeval' to fit statistical models inside 9 | the database. It currently supports KMeans and linear regression 10 | models. 11 | License: MIT + file LICENSE 12 | URL: https://modeldb.tidymodels.org, https://github.com/tidymodels/modeldb 13 | BugReports: https://github.com/tidymodels/modeldb/issues 14 | Depends: 15 | R (>= 3.6) 16 | Imports: 17 | cli, 18 | dplyr (>= 0.7), 19 | ggplot2, 20 | progress, 21 | purrr, 22 | rlang (>= 1.1.1), 23 | tibble, 24 | tidypredict 25 | Suggests: 26 | covr, 27 | DBI, 28 | dbplyr, 29 | knitr, 30 | methods, 31 | nycflights13, 32 | rmarkdown, 33 | RSQLite, 34 | testthat (>= 3.2.0) 35 | VignetteBuilder: 36 | knitr 37 | Config/Needs/website: tidyverse/tidytemplate 38 | Config/testthat/edition: 3 39 | Encoding: UTF-8 40 | Roxygen: list(markdown = TRUE) 41 | RoxygenNote: 7.2.3 42 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | YEAR: 2021 2 | COPYRIGHT HOLDER: modeldb authors 3 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # MIT License 2 | 3 | Copyright (c) 2021 modeldb authors 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method(as_parsed_model,modeldb_lm) 4 | export(add_dummy_variables) 5 | export(as_parsed_model) 6 | export(db_calculate_squares) 7 | export(linear_regression_db) 8 | export(plot_kmeans) 9 | export(simple_kmeans_db) 10 | import(ggplot2) 11 | import(rlang) 12 | import(tibble) 13 | importFrom(dplyr,all_vars) 14 | importFrom(dplyr,arrange) 15 | importFrom(dplyr,as_tibble) 16 | importFrom(dplyr,bind_cols) 17 | importFrom(dplyr,case_when) 18 | importFrom(dplyr,collect) 19 | importFrom(dplyr,contains) 20 | importFrom(dplyr,everything) 21 | importFrom(dplyr,filter) 22 | importFrom(dplyr,filter_all) 23 | importFrom(dplyr,funs) 24 | importFrom(dplyr,group_by) 25 | importFrom(dplyr,group_vars) 26 | importFrom(dplyr,lag) 27 | importFrom(dplyr,left_join) 28 | importFrom(dplyr,mutate) 29 | importFrom(dplyr,n) 30 | importFrom(dplyr,pull) 31 | importFrom(dplyr,rename) 32 | importFrom(dplyr,rename_all) 33 | importFrom(dplyr,select) 34 | importFrom(dplyr,summarise) 35 | importFrom(dplyr,summarise_all) 36 | importFrom(dplyr,tally) 37 | importFrom(dplyr,tbl_vars) 38 | importFrom(dplyr,ungroup) 39 | importFrom(purrr,imap) 40 | importFrom(purrr,map) 41 | importFrom(purrr,map2) 42 | importFrom(purrr,map_df) 43 | importFrom(purrr,pluck) 44 | importFrom(purrr,reduce) 45 | importFrom(purrr,transpose) 46 | importFrom(tidypredict,as_parsed_model) 47 | importFrom(utils,head) 48 | importFrom(utils,write.csv) 49 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # modeldb (development version) 2 | 3 | # modeldb 0.3.0 4 | 5 | - Fixes compatability issues with `dplyr` 6 | 7 | # modeldb 0.2.3 8 | 9 | * Re-licensed package from GPL-3 to MIT. All copyright holders are RStudio employees and give consent. 10 | 11 | # modeldb 0.2.2 12 | 13 | - Switches maintainer to Max Kuhn 14 | 15 | # modeldb 0.2.1 16 | 17 | - Uses `dplyr::tbl_vars()` for column name extraction to instead of `colnames()` 18 | 19 | # modeldb 0.2.0 20 | 21 | - Adds `as_parsed_model()` support for regression models 22 | 23 | - Fixes compatability issues with `rlang` and `dplyr` 24 | 25 | # modeldb 0.1.2 26 | 27 | - Removes pipes and other dplyr dependencies from internal `mlr()` function 28 | 29 | - Consolidates duplicated database operations in `mlr()` 30 | 31 | - Fixes an issue in `simple_kmeans_db()` when specifying variables 32 | 33 | # modeldb 0.1.1 34 | 35 | ## Bug fixes 36 | 37 | - Fixes dependency issue with `tidypredict` by removing `as_parsed_model()`. The function will be moved to `tidypredict` in its next version. 38 | -------------------------------------------------------------------------------- /R/as-parsed-model.R: -------------------------------------------------------------------------------- 1 | 2 | #' @export 3 | tidypredict::as_parsed_model 4 | 5 | #' Prepares parsed model object 6 | #' 7 | #' @param x A parsed model object 8 | #' 9 | #' @export 10 | as_parsed_model.modeldb_lm <- function(x) { 11 | terms <- imap( 12 | x, 13 | ~ { 14 | list( 15 | label = .y, 16 | coef = .x, 17 | is_intercept = ifelse(.y == "(Intercept)", 1, 0), 18 | fields = list(list(type = "ordinary", col = .y)) 19 | ) 20 | } 21 | ) 22 | pm <- list( 23 | general = list( 24 | model = "modeldb_lm", 25 | version = 2, 26 | type = "regression", 27 | is_glm = 0 28 | ), 29 | terms = terms 30 | ) 31 | as_parsed_model(pm) 32 | } 33 | -------------------------------------------------------------------------------- /R/kmeans-viz.R: -------------------------------------------------------------------------------- 1 | #' Visualize a KMeans Cluster with lots of data 2 | #' 3 | #' It uses 'ggplot2' to display the results of a KMeans routine. Instead 4 | #' of a scatterplot, it uses a square grid that displays the concentration 5 | #' of intersections per square. The number of squares in the grid can 6 | #' be customized for more or less fine grain. 7 | #' 8 | #' @param df A Local or remote data frame with results of KMeans clustering 9 | #' @param x A numeric variable for the x axis 10 | #' @param y A numeric variable for the y axis 11 | #' @param resolution The number of squares in the grid. Defaults to 50. 12 | #' Meaning a 50 x 50 grid. 13 | #' @param group A discrete variable containing the grouping for the KMeans. It defaults to 'center' 14 | #' 15 | #' @details 16 | #' For large result-sets in remote sources, downloading every intersection will 17 | #' be a long running, costly operation. The approach of this function is to 18 | #' devide the x and y plane in a grid and have the remote source figure the 19 | #' total number of intersections, returned as a single number. This reduces the 20 | #' granularity of the visualization, but it speeds up the results. 21 | #' 22 | #' @examples 23 | #' plot_kmeans(mtcars, mpg, wt, group = am) 24 | #' @export 25 | plot_kmeans <- function(df, x, y, resolution = 50, group = center) { 26 | x <- enquo(x) 27 | y <- enquo(y) 28 | group <- enquo(group) 29 | 30 | squares <- db_calculate_squares( 31 | df = df, 32 | x = !!x, 33 | y = !!y, 34 | group = !!group, 35 | resolution = resolution 36 | ) 37 | 38 | 39 | d <- mutate( 40 | squares, 41 | x = !!x, y = !!y, Center = !!group, Count = n 42 | ) 43 | 44 | ggplot(d) + 45 | geom_rect( 46 | aes( 47 | xmin = x, 48 | ymin = y, 49 | xmax = xend, 50 | ymax = yend, 51 | color = Center 52 | ), 53 | fill = "transparent" 54 | ) + 55 | geom_rect(aes( 56 | xmin = x, 57 | ymin = y, 58 | xmax = xend, 59 | ymax = yend, 60 | fill = Center, 61 | alpha = Count 62 | )) + 63 | theme_minimal() + 64 | theme( 65 | legend.position = "bottom", 66 | axis.line = element_blank(), 67 | panel.border = element_blank(), 68 | panel.grid.minor = element_blank() 69 | ) + 70 | labs( 71 | title = "Kmeans Clusters", 72 | subtitle = paste0(expr_text(y), " ~ ", expr_text(x)), 73 | x = expr_text(x), 74 | y = expr_text(y) 75 | ) 76 | } 77 | 78 | #' @export 79 | #' @rdname plot_kmeans 80 | db_calculate_squares <- function(df, x, y, group, resolution = 50) { 81 | x <- enquo(x) 82 | y <- enquo(y) 83 | group <- enquo(group) 84 | segs <- group_by( 85 | df, 86 | !!x := !!db_bin(!!x, bins = resolution), 87 | !!y := !!db_bin(!!y, bins = resolution), 88 | !!group 89 | ) 90 | segs <- tally(segs) 91 | segs <- ungroup(segs) 92 | segs <- collect(segs) 93 | mutate(segs, 94 | xend = !!x + min_dif(segs, !!x), 95 | yend = !!y + min_dif(segs, !!y) 96 | ) 97 | } 98 | 99 | min_dif <- function(df, field) { 100 | field <- enquo(field) 101 | df <- mutate(df, 102 | x1 = lag(!!field, 1), 103 | dif = !!field - x1 104 | ) 105 | df <- filter(df, dif != 0, dif > 0) 106 | df <- summarise(df, min(dif)) 107 | df <- pull(df) 108 | } 109 | 110 | db_bin <- function(var, bins = 30) { 111 | var <- enquo(var) 112 | range <- expr(max(!!var, na.rm = TRUE) - min(!!var, na.rm = TRUE)) 113 | binwidth <- expr((!!range) / (!!bins)) 114 | bin_number <- expr(as.integer(floor(((!!var) - min(!!var, na.rm = TRUE)) / (!!binwidth)))) 115 | expr(((!!binwidth) * 116 | ifelse((!!bin_number) == (!!bins), (!!bin_number) - 1, 117 | (!!bin_number) 118 | )) + min(!!var, na.rm = TRUE)) 119 | } 120 | -------------------------------------------------------------------------------- /R/kmeans.R: -------------------------------------------------------------------------------- 1 | #' Simple kmeans routine that works in-database 2 | #' 3 | #' It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean 4 | #' calculations, expressed in dplyr formulas until an the optimal 5 | #' centers are found. 6 | #' 7 | #' @param df A Local or remote data frame 8 | #' @param ... A list of variables to be used in the kmeans algorithm 9 | #' @param centers The number of centers. Defaults to 3. 10 | #' @param max_repeats The maximum number of cycles to run. Defaults to 100. 11 | #' @param initial_kmeans A local dataframe with initial centroid values. Defaults to NULL. 12 | #' @param safeguard_file Each cycle will update a file specified in this argument 13 | #' with the current centers. Defaults to 'kmeans.csv'. Pass NULL if no file is 14 | #' desired. 15 | #' @param verbose Indicates if the progress bar will be displayed during the model's fitting. 16 | #' 17 | #' @details 18 | #' Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source, 19 | #' the latest centroid data frame is saved to the parent environment in case the process needs to be 20 | #' canceled and then restarted at a later point. Passing the `current_kmeans` as the `initial_kmeans` 21 | #' will allow the operation to pick up where it left off. 22 | #' 23 | #' @examples 24 | #' library(dplyr) 25 | #' 26 | #' mtcars %>% 27 | #' simple_kmeans_db(mpg, qsec, wt) %>% 28 | #' glimpse() 29 | #' 30 | #' @export 31 | simple_kmeans_db <- function(df, 32 | ..., 33 | centers = 3, 34 | max_repeats = 100, 35 | initial_kmeans = NULL, 36 | safeguard_file = "kmeans.csv", 37 | verbose = TRUE) { 38 | vars <- enquos(...) 39 | 40 | if (length(vars) > 0) { 41 | f_df <- select(df, !!!vars) 42 | } else { 43 | vars <- syms(colnames(df)) 44 | f_df <- df 45 | } 46 | 47 | f_df <- filter_all(f_df, all_vars(!is.na(.))) 48 | 49 | if (!is.null(initial_kmeans)) { 50 | centroids <- initial_kmeans 51 | } else { 52 | centroids <- head(f_df, centers) 53 | centroids <- collect(centroids) 54 | } 55 | 56 | if (verbose) { 57 | pb <- progress::progress_bar$new( 58 | format = paste0( 59 | " Cycle :current of ", max_repeats, " max. [:bar] [:var][:elapsed]" 60 | ), 61 | total = max_repeats, clear = TRUE, width = 80 62 | ) 63 | } 64 | 65 | for (iteration in 1:max_repeats) { 66 | prev_centroids <- centroids 67 | new_centroids <- calculate_centers(df, centroids, centers, vars) 68 | 69 | centroids_db <- select(new_centroids, !!!vars, center) 70 | centroids_db <- group_by(centroids_db, center) 71 | centroids_db <- summarise_all(centroids_db, "mean", na.rm = TRUE) 72 | 73 | centroids <- select(centroids_db, -center) 74 | centroids <- collect(centroids) 75 | 76 | if (!is.null(safeguard_file)) { 77 | sfg <- file.path(tempdir(), safeguard_file) 78 | write.csv(centroids, sfg, row.names = FALSE) 79 | } 80 | variance <- ( 81 | round( 82 | abs(sum(prev_centroids) - sum(centroids)) / sum(prev_centroids), 83 | digits = 4 84 | ) * 100 85 | ) 86 | if (verbose) pb$tick(tokens = list(var = variance)) 87 | if (all(prev_centroids == centroids)) break() 88 | } 89 | centroids_db <- rename_all(centroids_db, ~ paste0("k_", .)) 90 | joined <- left_join( 91 | rename(new_centroids, k_center = center), 92 | centroids_db, 93 | by = "k_center" 94 | ) 95 | select(joined, contains("k_"), everything()) 96 | } 97 | 98 | calculate_centers <- function(df, center_df, centers, vars) { 99 | center_names <- paste0("center_", 1:centers) 100 | 101 | fields <- length(vars) 102 | 103 | f_dist <- imap( 104 | center_df, ~ { 105 | map2( 106 | .x, .y, 107 | function(x, y) expr((!!x) - (!!sym(y))) 108 | ) 109 | } 110 | ) 111 | 112 | f_inside <- function(curr_center) { 113 | fi <- map(1:fields, ~ { 114 | f <- pluck(f_dist, .x, curr_center) 115 | expr((((!!f)) * ((!!f)))) 116 | }) 117 | reduce(fi, function(l, r) expr((!!l) + (!!r))) 118 | } 119 | 120 | km <- map(1:centers, ~ expr(sqrt(!!f_inside(.x)))) 121 | km <- set_names(km, center_names) 122 | 123 | all <- map(center_names, ~ { 124 | comp <- map2(.x, center_names, function(x, y) { 125 | if (x != y) { 126 | expr((!!sym(x)) < (!!sym(y))) 127 | } else { 128 | expr((!!sym(x)) >= (!!sym(y))) 129 | } 130 | }) 131 | comp <- reduce(comp, function(l, r) expr((!!l) & (!!r))) 132 | reduce(c(comp, .x), function(l, r) expr((!!l) ~ !!(r))) 133 | }) 134 | all <- flatten(all) 135 | 136 | comp <- expr(case_when(!!!all)) 137 | 138 | res <- mutate(df, !!!km) 139 | res <- mutate(res, center = !!comp) 140 | res <- filter(res, !is.na(center)) 141 | select(res, -contains("center_")) 142 | } 143 | -------------------------------------------------------------------------------- /R/lm.R: -------------------------------------------------------------------------------- 1 | #' Fits a Linear Regression model 2 | #' 3 | #' It uses 'tidyeval' and 'dplyr' to create a linear 4 | #' regression model. 5 | #' 6 | #' @param df A Local or remote data frame 7 | #' @param y_var Dependent variable 8 | #' @param sample_size Prevents a table count. It is only used for models 9 | #' with three or more independent variables 10 | #' @param auto_count Serves as a safeguard in case sample_size is not 11 | #' passed inadvertently. Defaults to FALSE. If it is ok for the 12 | #' function to count how many records are in the sample, then set to 13 | #' TRUE. It is only used for models with three or more independent variables 14 | #' 15 | #' @details 16 | #' 17 | #' The linear_regression_db() function only calls one of three unexported functions. 18 | #' The function used is determined by the number of independent variables. This is 19 | #' so any model of one or two variables can use a simpler formula, which in turn 20 | #' will have less SQL overhead. 21 | #' 22 | #' @examples 23 | #' library(dplyr) 24 | #' 25 | #' mtcars %>% 26 | #' select(mpg, wt, qsec) %>% 27 | #' linear_regression_db(mpg) 28 | #' 29 | #' @export 30 | linear_regression_db <- function(df, y_var = NULL, sample_size = NULL, auto_count = FALSE) { 31 | y_var <- enexpr(y_var) 32 | 33 | col_names <- tbl_vars(df) 34 | grouped_count <- length(group_vars(df)) 35 | n_cols <- length(col_names) - grouped_count 36 | 37 | x_vars <- col_names[col_names != expr_text(y_var)] 38 | if (grouped_count > 0) x_vars <- setdiff(x_vars, group_vars(df)) 39 | 40 | if (n_cols == 2) { 41 | m <- simple_linear_regression_db( 42 | df = df, 43 | x = !!sym(x_vars[1]), 44 | y = !!y_var 45 | ) 46 | } 47 | if (n_cols == 3) { 48 | m <- two_variable_regression( 49 | df = df, 50 | y = !!y_var, 51 | x1 = !!sym(x_vars[1]), 52 | x2 = !!sym(x_vars[2]) 53 | ) 54 | } 55 | if (n_cols > 3) { 56 | m <- mlr( 57 | df = df, 58 | y_var = !!y_var, 59 | sample_size = sample_size, 60 | auto_count = auto_count 61 | ) 62 | } 63 | class(m) <- c("modeldb_lm", class(m)) 64 | m 65 | } 66 | 67 | two_variable_regression <- function(df, y, x1, x2) { 68 | y <- enquo(y) 69 | x1 <- enquo(x1) 70 | x2 <- enquo(x2) 71 | 72 | vars <- group_vars(df) 73 | 74 | m <- summarise( 75 | df, 76 | x1y = sum(!!x1 * !!y, na.rm = TRUE) - (sum(!!x1, na.rm = TRUE) * sum(!!y, na.rm = TRUE) / n()), 77 | x2y = sum(!!x2 * !!y, na.rm = TRUE) - (sum(!!x2, na.rm = TRUE) * sum(!!y, na.rm = TRUE) / n()), 78 | x2x = sum(!!x2 * !!x2, na.rm = TRUE) - (sum(!!x2, na.rm = TRUE) * sum(!!x2, na.rm = TRUE) / n()), 79 | x1x = sum(!!x1 * !!x1, na.rm = TRUE) - (sum(!!x1, na.rm = TRUE) * sum(!!x1, na.rm = TRUE) / n()), 80 | all = sum(!!x1 * !!x2, na.rm = TRUE) - (sum(!!x1, na.rm = TRUE) * sum(!!x2, na.rm = TRUE) / n()), 81 | my = mean(!!y, na.rm = TRUE), 82 | mx1 = mean(!!x1, na.rm = TRUE), 83 | mx2 = mean(!!x2, na.rm = TRUE) 84 | ) 85 | m <- mutate( 86 | m, 87 | !!x1 := ((x2x * x1y) - (all * x2y)) / ((x1x * x2x) - (all * all)), 88 | !!x2 := ((x1x * x2y) - (all * x1y)) / ((x1x * x2x) - (all * all)) 89 | ) 90 | m <- mutate(m, Intercept = my - (!!x1 * mx1) - (!!x2 * mx2)) 91 | m <- select(m, !!vars, Intercept, !!x1, !!x2) 92 | m <- collect(m) 93 | m <- as_tibble(m) 94 | rename(m, "(Intercept)" = Intercept) 95 | } 96 | 97 | simple_linear_regression_db <- function(df, x, y) { 98 | x <- enquo(x) 99 | y <- enquo(y) 100 | 101 | vars <- group_vars(df) 102 | 103 | m <- summarise( 104 | df, 105 | sx = sum(!!x, na.rm = TRUE), 106 | sy = sum(!!y, na.rm = TRUE), 107 | sxx = sum(!!x * !!x, na.rm = TRUE), 108 | syy = sum(!!y * !!y, na.rm = TRUE), 109 | sxy = sum(!!x * !!y, na.rm = TRUE), 110 | n = n() 111 | ) 112 | m <- mutate(m, !!x := ((n * sxy) - (sx * sy)) / ((n * sxx) - (sx * sx))) 113 | m <- mutate(m, Intercept = ((1 / n) * sy) - (!!x * (1 / n) * sx)) 114 | m <- select(m, !!vars, Intercept, !!x) 115 | m <- collect(m) 116 | m <- as_tibble(m) 117 | rename(m, "(Intercept)" = Intercept) 118 | } 119 | 120 | mlr <- function(df, ..., y_var, sample_size = NULL, auto_count = FALSE) { 121 | if (is.null(sample_size)) { 122 | if (auto_count) { 123 | sample_size <- pull(tally(df)) 124 | } else { 125 | cli::cli_abort("No sample size provided, and auto_count is set to FALSE") 126 | } 127 | } 128 | 129 | y_var <- enquo(y_var) 130 | y_text <- as_label(y_var) 131 | 132 | grouping_vars <- group_vars(df) 133 | vars_count <- length(grouping_vars) 134 | 135 | x_vars <- tbl_vars(df) 136 | x_vars <- x_vars[x_vars != y_text] 137 | if (vars_count > 0) x_vars <- map(grouping_vars, ~ x_vars[x_vars != .x])[[1]] 138 | x_vars <- syms(x_vars) 139 | 140 | all_vars <- c(x_vars, ensym(y_var)) 141 | 142 | all_f_mapped <- map( 143 | all_vars, ~ { 144 | y <- .x 145 | map( 146 | all_vars, ~ { 147 | xy <- c(as_label(.x), as_label(y)) 148 | pop <- ifelse(auto_count, expr(n()), sample_size) 149 | list( 150 | # f = ind_f(!!.x, !!y, sample_size, vars_count), 151 | f = expr( 152 | sum(!!.x * !!y, na.rm = TRUE) - ((sum(!!.x, na.rm = TRUE) * sum(!!y, na.rm = TRUE)) / !!pop) 153 | ), 154 | name = paste0(xy[order(xy)], collapse = "_") 155 | ) 156 | } 157 | ) 158 | } 159 | ) 160 | all_f <- flatten(all_f_mapped) 161 | all_f <- set_names( 162 | map(all_f, ~ .x$f), 163 | map(all_f, ~ .x$name) 164 | ) 165 | 166 | # Deduping field combos, decreases number of calcs inside DB 167 | unique_f <- map( 168 | unique(names(all_f)), 169 | ~ all_f[names(all_f) == .x][[1]] 170 | ) 171 | unique_f <- set_names(unique_f, unique(names(all_f))) 172 | 173 | all_means <- map(all_vars, ~ expr(mean(!!.x, na.rm = TRUE))) 174 | all_means <- set_names(all_means, ~ paste0("mean_", all_vars)) 175 | 176 | all_fm <- c(unique_f, all_means) 177 | 178 | # Send all operations to the DB simultaneously 179 | ests_df <- summarise(df, !!!all_fm) 180 | ests_df <- collect(ests_df) 181 | 182 | ests_list <- as.list(ests_df) 183 | 184 | xm_names <- names(all_f)[!grepl(y_text, names(all_f))] 185 | xm <- prepare_matrix(ests_list, xm_names, length(x_vars)) 186 | 187 | ym_names <- names(all_f)[grepl(y_text, names(all_f))] 188 | ym_names <- unique(ym_names)[1:length(x_vars)] 189 | ym <- prepare_matrix(ests_list, ym_names, length(x_vars)) 190 | 191 | coefs <- map( 192 | seq_len(vars_count + 1), 193 | ~ as.numeric(solve(xm[[.x]], ym[[.x]])) 194 | ) 195 | 196 | intercept <- map( 197 | seq_len(vars_count + 1), ~ { 198 | y <- .x 199 | x_f <- map( 200 | seq_len(length(x_vars)), ~ { 201 | x_name <- paste0("mean_", x_vars[.x]) 202 | x_mean <- ests_list[names(ests_list) == x_name][[1]][y] 203 | expr((!!coefs[[y]][.x] * !!x_mean)) 204 | } 205 | ) 206 | y_name <- paste0("mean_", y_text) 207 | y_mean <- ests_list[names(ests_list) == y_name][[1]][y] 208 | int_f <- reduce( 209 | c(y_mean, x_f), 210 | function(l, r) expr(!!l - !!r) 211 | ) 212 | eval(int_f) 213 | } 214 | ) 215 | 216 | res <- transpose(coefs) 217 | res <- set_names(res, x_vars) 218 | res <- c(list("(Intercept)" = intercept), res) 219 | res <- map_df(transpose(res), ~.x) 220 | bind_cols(ests_df[, grouping_vars], res) 221 | } 222 | 223 | prepare_matrix <- function(estimates, field_names, matrix_size) { 224 | m <- map(field_names, ~ estimates[.x]) 225 | m <- flatten(m) 226 | m <- transpose(m) 227 | map(m, ~ matrix(as.numeric(.x), nrow = matrix_size)) 228 | } 229 | -------------------------------------------------------------------------------- /R/model_prep.R: -------------------------------------------------------------------------------- 1 | #' Creates dummy variables 2 | #' 3 | #' It uses 'tidyeval' and 'dplyr' to create dummy variables based for 4 | #' categorical variables. 5 | #' 6 | #' @param df A Local or remote data frame 7 | #' @param x Categorical variable 8 | #' @param values Possible known values of the categorical variable. If not passed 9 | #' then the function will take an additional step to figure the unique values of 10 | #' the variable. 11 | #' @param auto_values Safeguard argument to prevent the function from figuring the 12 | #' unique values if the values argument is empty. If it is ok for this function 13 | #' to obtain the unique values, set to TRUE. Defaults to FALSE. 14 | #' @param remove_original It removes the original variable from the returned table. 15 | #' Defaults to TRUE. 16 | #' 17 | #' @examples 18 | #' library(dplyr) 19 | #' 20 | #' mtcars %>% 21 | #' add_dummy_variables(cyl, values = c(4, 6, 8)) 22 | #' 23 | #' mtcars %>% 24 | #' add_dummy_variables(cyl, auto_values = TRUE) 25 | #' @export 26 | add_dummy_variables <- function(df, x, values = c(), 27 | auto_values = FALSE, remove_original = TRUE) { 28 | x <- enquo(x) 29 | var_found <- as_label(x) %in% tbl_vars(df) 30 | if (!var_found) cli::cli_abort("Variable not found") 31 | if (length(values) == 0) { 32 | if (auto_values == TRUE) { 33 | values <- group_by(df, !!x) 34 | values <- summarise(values) 35 | values <- pull(values) 36 | } else { 37 | cli::cli_abort("No values provided and auto_values is set to FALSE") 38 | } 39 | } 40 | vals <- map(values, ~ expr(ifelse(!!x == !!.x, 1, 0))) 41 | names <- map(values, ~ paste0(as_label(x), "_", .x)) 42 | vals <- set_names(vals, names) 43 | vals <- vals[2:length(vals)] 44 | df <- mutate(df, !!!vals) 45 | if (remove_original) df <- select(df, -!!x) 46 | df 47 | } 48 | -------------------------------------------------------------------------------- /R/modeldb-package.R: -------------------------------------------------------------------------------- 1 | #' @import rlang 2 | #' @import ggplot2 3 | #' @import tibble 4 | #' @importFrom dplyr mutate summarise left_join 5 | #' @importFrom dplyr summarise_all filter_all 6 | #' @importFrom dplyr pull collect arrange 7 | #' @importFrom dplyr contains tally ungroup 8 | #' @importFrom dplyr group_vars lag bind_cols 9 | #' @importFrom dplyr all_vars group_by funs 10 | #' @importFrom dplyr n as_tibble filter select 11 | #' @importFrom dplyr case_when rename rename_all 12 | #' @importFrom dplyr tbl_vars everything 13 | #' @importFrom purrr map map2 map_df transpose 14 | #' @importFrom purrr reduce imap pluck 15 | #' @importFrom tidypredict as_parsed_model 16 | #' @importFrom utils head 17 | #' @importFrom utils write.csv 18 | #' @keywords internal 19 | #' 20 | "_PACKAGE" 21 | NULL 22 | utils::globalVariables(c( 23 | ".", "Center", "Count", "Intercept", "center", "dif", "mx1", 24 | "mx2", "my", "sx", "sxx", "sxy", "sy", "val", "var", "x1", 25 | "x1x", "x1y", "x2x", "x2y", "xend", "yend" 26 | )) 27 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | --- 4 | 5 | # modeldb 6 | 7 | ```{r setup, include=FALSE} 8 | library(dplyr) 9 | library(modeldb) 10 | ``` 11 | 12 | [![R-CMD-check](https://github.com/tidymodels/modeldb/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/modeldb/actions) 13 | [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/modeldb)](https://CRAN.R-project.org/package=modeldb) 14 | [![Codecov test coverage](https://codecov.io/gh/tidymodels/modeldb/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/modeldb?branch=main) 15 | [![Downloads](http://cranlogs.r-pkg.org/badges/modeldb)](https://CRAN.R-project.org/package=modeldb) 16 | 17 | Fit models inside the database! **modeldb works with most database back-ends** because it leverages [dplyr](https://dplyr.tidyverse.org/) and [dbplyr](https://dbplyr.tidyverse.org/) for the final SQL translation of the algorithm. It currently supports: 18 | 19 | - K-means clustering 20 | 21 | - Linear regression 22 | 23 | ## Installation 24 | 25 | Install the CRAN version with: 26 | 27 | ```{r, eval = FALSE} 28 | install.packages("modeldb") 29 | ``` 30 | 31 | The development version is available from GitHub using remotes: 32 | 33 | ```{r, eval = FALSE} 34 | # install.packages("remotes") 35 | remotes::install_github("tidymodels/modeldb") 36 | ``` 37 | 38 | ## Linear regression 39 | 40 | An easy way to try out the package is by creating a temporary SQLite database, and loading `mtcars` to it. 41 | 42 | ```{r} 43 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:") 44 | RSQLite::initExtension(con) 45 | dplyr::copy_to(con, mtcars) 46 | ``` 47 | 48 | ```{r} 49 | library(dplyr) 50 | 51 | tbl(con, "mtcars") %>% 52 | select(wt, mpg, qsec) %>% 53 | linear_regression_db(wt) 54 | ``` 55 | 56 | The model output can be parsed by [tidypredict](https://tidypredict.tidymodels.org/) to run the predictions in the database. Please see the "Linear Regression" article to learn more about how to use `linear_regression_db()` 57 | 58 | ## K Means clustering 59 | 60 | To use the `simple_kmeans_db()` function, simply pipe the database back end table to the function. This returns a list object that contains two items: 61 | 62 | - A sql query table with the final center assignment 63 | - A local table with the information about the centers 64 | 65 | ```{r} 66 | km <- tbl(con, "mtcars") %>% 67 | simple_kmeans_db(mpg, wt) 68 | 69 | colnames(km) 70 | ``` 71 | 72 | The SQL statement from `tbl` can be extracted using dbplyr's `remote_query()` 73 | 74 | ```{r} 75 | dbplyr::remote_query(km) 76 | ``` 77 | 78 | ## Contributing 79 | 80 | This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms. 81 | 82 | - For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on Posit Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question). 83 | 84 | - If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/modeldb/issues). 85 | 86 | - Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code. Check out [this helpful article on how to create reprexes](https://dbplyr.tidyverse.org/articles/reprex.html) for problems involving a database. 87 | 88 | - Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/). 89 | 90 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # modeldb 3 | 4 | [![R-CMD-check](https://github.com/tidymodels/modeldb/workflows/R-CMD-check/badge.svg)](https://github.com/tidymodels/modeldb/actions) 5 | [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/modeldb)](https://CRAN.R-project.org/package=modeldb) 6 | [![Codecov test 7 | coverage](https://codecov.io/gh/tidymodels/modeldb/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/modeldb?branch=main) 8 | [![Downloads](http://cranlogs.r-pkg.org/badges/modeldb)](https://CRAN.R-project.org/package=modeldb) 9 | 10 | Fit models inside the database! **modeldb works with most database 11 | back-ends** because it leverages [dplyr](https://dplyr.tidyverse.org/) 12 | and [dbplyr](https://dbplyr.tidyverse.org/) for the final SQL 13 | translation of the algorithm. It currently supports: 14 | 15 | - K-means clustering 16 | 17 | - Linear regression 18 | 19 | ## Installation 20 | 21 | Install the CRAN version with: 22 | 23 | ``` r 24 | install.packages("modeldb") 25 | ``` 26 | 27 | The development version is available from GitHub using remotes: 28 | 29 | ``` r 30 | # install.packages("remotes") 31 | remotes::install_github("tidymodels/modeldb") 32 | ``` 33 | 34 | ## Linear regression 35 | 36 | An easy way to try out the package is by creating a temporary SQLite 37 | database, and loading `mtcars` to it. 38 | 39 | ``` r 40 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:") 41 | RSQLite::initExtension(con) 42 | dplyr::copy_to(con, mtcars) 43 | ``` 44 | 45 | ``` r 46 | library(dplyr) 47 | 48 | tbl(con, "mtcars") %>% 49 | select(wt, mpg, qsec) %>% 50 | linear_regression_db(wt) 51 | ``` 52 | 53 | ## # A tibble: 1 × 3 54 | ## `(Intercept)` mpg qsec 55 | ## 56 | ## 1 4.12 -0.156 0.125 57 | 58 | The model output can be parsed by 59 | [tidypredict](https://tidypredict.tidymodels.org/) to run the 60 | predictions in the database. Please see the “Linear Regression” article 61 | to learn more about how to use `linear_regression_db()` 62 | 63 | ## K Means clustering 64 | 65 | To use the `simple_kmeans_db()` function, simply pipe the database back 66 | end table to the function. This returns a list object that contains two 67 | items: 68 | 69 | - A sql query table with the final center assignment 70 | - A local table with the information about the centers 71 | 72 | ``` r 73 | km <- tbl(con, "mtcars") %>% 74 | simple_kmeans_db(mpg, wt) 75 | 76 | colnames(km) 77 | ``` 78 | 79 | ## [1] "k_center" "k_mpg" "k_wt" "mpg" "cyl" "disp" 80 | ## [7] "hp" "drat" "wt" "qsec" "vs" "am" 81 | ## [13] "gear" "carb" 82 | 83 | The SQL statement from `tbl` can be extracted using dbplyr’s 84 | `remote_query()` 85 | 86 | ``` r 87 | dbplyr::remote_query(km) 88 | ``` 89 | 90 | ## SELECT `k_center`, `k_mpg`, `k_wt`, `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb` 91 | ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `LHS`.`k_center` AS `k_center`, `k_mpg`, `k_wt` 92 | ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `center` AS `k_center` 93 | ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `center_1`, `center_2`, `center_3`, CASE 94 | ## WHEN (`center_1` >= `center_1` AND `center_1` < `center_2` AND `center_1` < `center_3`) THEN ('center_1') 95 | ## WHEN (`center_2` < `center_1` AND `center_2` >= `center_2` AND `center_2` < `center_3`) THEN ('center_2') 96 | ## WHEN (`center_3` < `center_1` AND `center_3` < `center_2` AND `center_3` >= `center_3`) THEN ('center_3') 97 | ## END AS `center` 98 | ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, SQRT(((20.6428571428571 - `mpg`) * (20.6428571428571 - `mpg`)) + ((3.07214285714286 - `wt`) * (3.07214285714286 - `wt`))) AS `center_1`, SQRT(((14.4583333333333 - `mpg`) * (14.4583333333333 - `mpg`)) + ((4.05866666666667 - `wt`) * (4.05866666666667 - `wt`))) AS `center_2`, SQRT(((30.0666666666667 - `mpg`) * (30.0666666666667 - `mpg`)) + ((1.873 - `wt`) * (1.873 - `wt`))) AS `center_3` 99 | ## FROM `mtcars`)) 100 | ## WHERE (NOT(((`center`) IS NULL)))) AS `LHS` 101 | ## LEFT JOIN (SELECT `center` AS `k_center`, `mpg` AS `k_mpg`, `wt` AS `k_wt` 102 | ## FROM (SELECT `center`, AVG(`mpg`) AS `mpg`, AVG(`wt`) AS `wt` 103 | ## FROM (SELECT `mpg`, `wt`, `center` 104 | ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, `center_1`, `center_2`, `center_3`, CASE 105 | ## WHEN (`center_1` >= `center_1` AND `center_1` < `center_2` AND `center_1` < `center_3`) THEN ('center_1') 106 | ## WHEN (`center_2` < `center_1` AND `center_2` >= `center_2` AND `center_2` < `center_3`) THEN ('center_2') 107 | ## WHEN (`center_3` < `center_1` AND `center_3` < `center_2` AND `center_3` >= `center_3`) THEN ('center_3') 108 | ## END AS `center` 109 | ## FROM (SELECT `mpg`, `cyl`, `disp`, `hp`, `drat`, `wt`, `qsec`, `vs`, `am`, `gear`, `carb`, SQRT(((20.6428571428571 - `mpg`) * (20.6428571428571 - `mpg`)) + ((3.07214285714286 - `wt`) * (3.07214285714286 - `wt`))) AS `center_1`, SQRT(((14.4583333333333 - `mpg`) * (14.4583333333333 - `mpg`)) + ((4.05866666666667 - `wt`) * (4.05866666666667 - `wt`))) AS `center_2`, SQRT(((30.0666666666667 - `mpg`) * (30.0666666666667 - `mpg`)) + ((1.873 - `wt`) * (1.873 - `wt`))) AS `center_3` 110 | ## FROM `mtcars`)) 111 | ## WHERE (NOT(((`center`) IS NULL)))) 112 | ## GROUP BY `center`)) AS `RHS` 113 | ## ON (`LHS`.`k_center` = `RHS`.`k_center`) 114 | ## ) 115 | 116 | ## Contributing 117 | 118 | This project is released with a [Contributor Code of 119 | Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). 120 | By contributing to this project, you agree to abide by its terms. 121 | 122 | - For questions and discussions about tidymodels packages, modeling, 123 | and machine learning, please [post on Posit 124 | Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question). 125 | 126 | - If you think you have encountered a bug, please [submit an 127 | issue](https://github.com/tidymodels/modeldb/issues). 128 | 129 | - Either way, learn how to create and share a 130 | [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) 131 | (a minimal, reproducible example), to clearly communicate about your 132 | code. Check out [this helpful article on how to create 133 | reprexes](https://dbplyr.tidyverse.org/articles/reprex.html) for 134 | problems involving a database. 135 | 136 | - Check out further details on [contributing guidelines for tidymodels 137 | packages](https://www.tidymodels.org/contribute/) and [how to get 138 | help](https://www.tidymodels.org/help/). 139 | -------------------------------------------------------------------------------- /_pkgdown.yml: -------------------------------------------------------------------------------- 1 | url: https://modeldb.tidymodels.org 2 | 3 | template: 4 | package: tidytemplate 5 | bootstrap: 5 6 | bslib: 7 | danger: "#CA225E" 8 | primary: "#CA225E" 9 | includes: 10 | in_header: | 11 | 12 | 13 | development: 14 | mode: auto 15 | -------------------------------------------------------------------------------- /codecov.yml: -------------------------------------------------------------------------------- 1 | comment: false 2 | 3 | coverage: 4 | status: 5 | project: 6 | default: 7 | target: auto 8 | threshold: 1% 9 | informational: true 10 | patch: 11 | default: 12 | target: auto 13 | threshold: 1% 14 | informational: true 15 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## Release summary 2 | 3 | * Switches to the Max Kuhn as package maintainer 4 | 5 | ## Comments 6 | 7 | * Having Max as the maintainer will guarantee good follow through for any communications with CRAN. This is the reason why there is another update is such short order from the previous update in CRAN. Thank you for your understanding. 8 | 9 | ## Test environments 10 | 11 | * Ubuntu 18.04.2 LTS, R 3.6.2 12 | 13 | * Ubuntu 14.04 (on travis-ci) 14 | 15 | ## R CMD check results 16 | 17 | * 0 errors | 0 warnings | 0 notes 18 | 19 | 20 | 21 | -------------------------------------------------------------------------------- /man/add_dummy_variables.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/model_prep.R 3 | \name{add_dummy_variables} 4 | \alias{add_dummy_variables} 5 | \title{Creates dummy variables} 6 | \usage{ 7 | add_dummy_variables( 8 | df, 9 | x, 10 | values = c(), 11 | auto_values = FALSE, 12 | remove_original = TRUE 13 | ) 14 | } 15 | \arguments{ 16 | \item{df}{A Local or remote data frame} 17 | 18 | \item{x}{Categorical variable} 19 | 20 | \item{values}{Possible known values of the categorical variable. If not passed 21 | then the function will take an additional step to figure the unique values of 22 | the variable.} 23 | 24 | \item{auto_values}{Safeguard argument to prevent the function from figuring the 25 | unique values if the values argument is empty. If it is ok for this function 26 | to obtain the unique values, set to TRUE. Defaults to FALSE.} 27 | 28 | \item{remove_original}{It removes the original variable from the returned table. 29 | Defaults to TRUE.} 30 | } 31 | \description{ 32 | It uses 'tidyeval' and 'dplyr' to create dummy variables based for 33 | categorical variables. 34 | } 35 | \examples{ 36 | library(dplyr) 37 | 38 | mtcars \%>\% 39 | add_dummy_variables(cyl, values = c(4, 6, 8)) 40 | 41 | mtcars \%>\% 42 | add_dummy_variables(cyl, auto_values = TRUE) 43 | } 44 | -------------------------------------------------------------------------------- /man/as_parsed_model.modeldb_lm.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/as-parsed-model.R 3 | \name{as_parsed_model.modeldb_lm} 4 | \alias{as_parsed_model.modeldb_lm} 5 | \title{Prepares parsed model object} 6 | \usage{ 7 | \method{as_parsed_model}{modeldb_lm}(x) 8 | } 9 | \arguments{ 10 | \item{x}{A parsed model object} 11 | } 12 | \description{ 13 | Prepares parsed model object 14 | } 15 | -------------------------------------------------------------------------------- /man/figures/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tidymodels/modeldb/e93e4025f3e08407c33062bfd8f69ecb55111b3c/man/figures/logo.png -------------------------------------------------------------------------------- /man/linear_regression_db.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/lm.R 3 | \name{linear_regression_db} 4 | \alias{linear_regression_db} 5 | \title{Fits a Linear Regression model} 6 | \usage{ 7 | linear_regression_db(df, y_var = NULL, sample_size = NULL, auto_count = FALSE) 8 | } 9 | \arguments{ 10 | \item{df}{A Local or remote data frame} 11 | 12 | \item{y_var}{Dependent variable} 13 | 14 | \item{sample_size}{Prevents a table count. It is only used for models 15 | with three or more independent variables} 16 | 17 | \item{auto_count}{Serves as a safeguard in case sample_size is not 18 | passed inadvertently. Defaults to FALSE. If it is ok for the 19 | function to count how many records are in the sample, then set to 20 | TRUE. It is only used for models with three or more independent variables} 21 | } 22 | \description{ 23 | It uses 'tidyeval' and 'dplyr' to create a linear 24 | regression model. 25 | } 26 | \details{ 27 | The linear_regression_db() function only calls one of three unexported functions. 28 | The function used is determined by the number of independent variables. This is 29 | so any model of one or two variables can use a simpler formula, which in turn 30 | will have less SQL overhead. 31 | } 32 | \examples{ 33 | library(dplyr) 34 | 35 | mtcars \%>\% 36 | select(mpg, wt, qsec) \%>\% 37 | linear_regression_db(mpg) 38 | 39 | } 40 | -------------------------------------------------------------------------------- /man/modeldb-package.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/modeldb-package.R 3 | \docType{package} 4 | \name{modeldb-package} 5 | \alias{modeldb} 6 | \alias{modeldb-package} 7 | \title{modeldb: Fits Models Inside the Database} 8 | \description{ 9 | \if{html}{\figure{logo.png}{options: style='float: right' alt='logo' width='120'}} 10 | 11 | Uses 'dplyr' and 'tidyeval' to fit statistical models inside the database. It currently supports KMeans and linear regression models. 12 | } 13 | \seealso{ 14 | Useful links: 15 | \itemize{ 16 | \item \url{https://modeldb.tidymodels.org} 17 | \item \url{https://github.com/tidymodels/modeldb} 18 | \item Report bugs at \url{https://github.com/tidymodels/modeldb/issues} 19 | } 20 | 21 | } 22 | \author{ 23 | \strong{Maintainer}: Max Kuhn \email{max@posit.co} 24 | 25 | } 26 | \keyword{internal} 27 | -------------------------------------------------------------------------------- /man/plot_kmeans.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/kmeans-viz.R 3 | \name{plot_kmeans} 4 | \alias{plot_kmeans} 5 | \alias{db_calculate_squares} 6 | \title{Visualize a KMeans Cluster with lots of data} 7 | \usage{ 8 | plot_kmeans(df, x, y, resolution = 50, group = center) 9 | 10 | db_calculate_squares(df, x, y, group, resolution = 50) 11 | } 12 | \arguments{ 13 | \item{df}{A Local or remote data frame with results of KMeans clustering} 14 | 15 | \item{x}{A numeric variable for the x axis} 16 | 17 | \item{y}{A numeric variable for the y axis} 18 | 19 | \item{resolution}{The number of squares in the grid. Defaults to 50. 20 | Meaning a 50 x 50 grid.} 21 | 22 | \item{group}{A discrete variable containing the grouping for the KMeans. It defaults to 'center'} 23 | } 24 | \description{ 25 | It uses 'ggplot2' to display the results of a KMeans routine. Instead 26 | of a scatterplot, it uses a square grid that displays the concentration 27 | of intersections per square. The number of squares in the grid can 28 | be customized for more or less fine grain. 29 | } 30 | \details{ 31 | For large result-sets in remote sources, downloading every intersection will 32 | be a long running, costly operation. The approach of this function is to 33 | devide the x and y plane in a grid and have the remote source figure the 34 | total number of intersections, returned as a single number. This reduces the 35 | granularity of the visualization, but it speeds up the results. 36 | } 37 | \examples{ 38 | plot_kmeans(mtcars, mpg, wt, group = am) 39 | } 40 | -------------------------------------------------------------------------------- /man/reexports.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/as-parsed-model.R 3 | \docType{import} 4 | \name{reexports} 5 | \alias{reexports} 6 | \alias{as_parsed_model} 7 | \title{Objects exported from other packages} 8 | \keyword{internal} 9 | \description{ 10 | These objects are imported from other packages. Follow the links 11 | below to see their documentation. 12 | 13 | \describe{ 14 | \item{tidypredict}{\code{\link[tidypredict]{as_parsed_model}}} 15 | }} 16 | 17 | -------------------------------------------------------------------------------- /man/simple_kmeans_db.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/kmeans.R 3 | \name{simple_kmeans_db} 4 | \alias{simple_kmeans_db} 5 | \title{Simple kmeans routine that works in-database} 6 | \usage{ 7 | simple_kmeans_db( 8 | df, 9 | ..., 10 | centers = 3, 11 | max_repeats = 100, 12 | initial_kmeans = NULL, 13 | safeguard_file = "kmeans.csv", 14 | verbose = TRUE 15 | ) 16 | } 17 | \arguments{ 18 | \item{df}{A Local or remote data frame} 19 | 20 | \item{...}{A list of variables to be used in the kmeans algorithm} 21 | 22 | \item{centers}{The number of centers. Defaults to 3.} 23 | 24 | \item{max_repeats}{The maximum number of cycles to run. Defaults to 100.} 25 | 26 | \item{initial_kmeans}{A local dataframe with initial centroid values. Defaults to NULL.} 27 | 28 | \item{safeguard_file}{Each cycle will update a file specified in this argument 29 | with the current centers. Defaults to 'kmeans.csv'. Pass NULL if no file is 30 | desired.} 31 | 32 | \item{verbose}{Indicates if the progress bar will be displayed during the model's fitting.} 33 | } 34 | \description{ 35 | It uses 'tidyeval' and 'dplyr' to run multiple cycles of kmean 36 | calculations, expressed in dplyr formulas until an the optimal 37 | centers are found. 38 | } 39 | \details{ 40 | Because each cycle is an independent 'dplyr' operation, or SQL operation if using a remote source, 41 | the latest centroid data frame is saved to the parent environment in case the process needs to be 42 | canceled and then restarted at a later point. Passing the \code{current_kmeans} as the \code{initial_kmeans} 43 | will allow the operation to pick up where it left off. 44 | } 45 | \examples{ 46 | library(dplyr) 47 | 48 | mtcars \%>\% 49 | simple_kmeans_db(mpg, qsec, wt) \%>\% 50 | glimpse() 51 | 52 | } 53 | -------------------------------------------------------------------------------- /modeldb.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | BuildType: Package 16 | PackageUseDevtools: Yes 17 | PackageInstallArgs: --no-multiarch --with-keep.source 18 | PackageRoxygenize: rd,collate,namespace,vignette 19 | -------------------------------------------------------------------------------- /revdep/.gitignore: -------------------------------------------------------------------------------- 1 | checks 2 | data.sqlite 3 | library 4 | -------------------------------------------------------------------------------- /revdep/README.md: -------------------------------------------------------------------------------- 1 | # Platform 2 | 3 | |field |value | 4 | |:--------|:----------------------------| 5 | |version |R version 3.6.0 (2019-04-26) | 6 | |os |Windows 10 x64 | 7 | |system |x86_64, mingw32 | 8 | |ui |RStudio | 9 | |language |(EN) | 10 | |collate |English_United States.1252 | 11 | |ctype |English_United States.1252 | 12 | |tz |America/Chicago | 13 | |date |2019-07-19 | 14 | 15 | # Dependencies 16 | 17 | |package |old |new | | 18 | |:------------|:--------|:--------|:--| 19 | |modeldb |0.1.2 |0.2.0 |* | 20 | |assertthat |0.2.1 |0.2.1 | | 21 | |backports |1.1.4 |1.1.4 | | 22 | |BH |1.69.0-1 |1.69.0-1 | | 23 | |cli |1.1.0 |1.1.0 | | 24 | |colorspace |1.4-1 |1.4-1 | | 25 | |crayon |1.3.4 |1.3.4 | | 26 | |digest |0.6.20 |0.6.20 | | 27 | |dplyr |0.8.3 |0.8.3 | | 28 | |ellipsis |0.2.0.1 |0.2.0.1 | | 29 | |evaluate |NA |0.14 |* | 30 | |fansi |0.4.0 |0.4.0 | | 31 | |generics |NA |0.0.2 |* | 32 | |ggplot2 |3.2.0 |3.2.0 | | 33 | |glue |1.3.1 |1.3.1 | | 34 | |gtable |0.3.0 |0.3.0 | | 35 | |highr |NA |0.8 |* | 36 | |hms |0.5.0 |0.5.0 | | 37 | |knitr |NA |1.23 |* | 38 | |labeling |0.3 |0.3 | | 39 | |lazyeval |0.2.2 |0.2.2 | | 40 | |magrittr |1.5 |1.5 | | 41 | |markdown |NA |1.0 |* | 42 | |mime |NA |0.7 |* | 43 | |munsell |0.5.0 |0.5.0 | | 44 | |pillar |1.4.2 |1.4.2 | | 45 | |pkgconfig |2.0.2 |2.0.2 | | 46 | |plogr |0.2.0 |0.2.0 | | 47 | |plyr |1.8.4 |1.8.4 | | 48 | |prettyunits |1.0.2 |1.0.2 | | 49 | |progress |1.2.2 |1.2.2 | | 50 | |purrr |0.3.2 |0.3.2 | | 51 | |R6 |2.4.0 |2.4.0 | | 52 | |RColorBrewer |1.1-2 |1.1-2 | | 53 | |Rcpp |1.0.1 |1.0.1 | | 54 | |reshape2 |1.4.3 |1.4.3 | | 55 | |rlang |0.4.0 |0.4.0 | | 56 | |scales |1.0.0 |1.0.0 | | 57 | |stringi |1.4.3 |1.4.3 | | 58 | |stringr |1.4.0 |1.4.0 | | 59 | |tibble |2.1.3 |2.1.3 | | 60 | |tidypredict |NA |0.4.2 |* | 61 | |tidyselect |0.2.5 |0.2.5 | | 62 | |utf8 |1.1.4 |1.1.4 | | 63 | |vctrs |0.2.0 |0.2.0 | | 64 | |viridisLite |0.3.0 |0.3.0 | | 65 | |withr |2.1.2 |2.1.2 | | 66 | |xfun |NA |0.8 |* | 67 | |yaml |NA |2.2.0 |* | 68 | |zeallot |0.1.0 |0.1.0 | | 69 | 70 | # Revdeps 71 | 72 | -------------------------------------------------------------------------------- /revdep/check.R: -------------------------------------------------------------------------------- 1 | #devtools::install_github("r-lib/revdepcheck") 2 | library(revdepcheck) 3 | 4 | #revdepcheck::revdep_reset() 5 | revdep_check() 6 | -------------------------------------------------------------------------------- /revdep/failures.md: -------------------------------------------------------------------------------- 1 | *Wow, no problems at all. :)* -------------------------------------------------------------------------------- /revdep/problems.md: -------------------------------------------------------------------------------- 1 | *Wow, no problems at all. :)* -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | # This file is part of the standard setup for testthat. 2 | # It is recommended that you do not modify it. 3 | # 4 | # Where should you do additional test configuration? 5 | # Learn more about the roles of various files in: 6 | # * https://r-pkgs.org/testing-design.html#sec-tests-files-overview 7 | # * https://testthat.r-lib.org/articles/special-files.html 8 | 9 | library(testthat) 10 | library(dplyr) 11 | library(purrr) 12 | library(modeldb) 13 | 14 | test_check("modeldb") 15 | -------------------------------------------------------------------------------- /tests/testthat/test-as-parsed-model.R: -------------------------------------------------------------------------------- 1 | context("parsed model") 2 | 3 | test_that("Simple linear regression matches lm()", { 4 | expect_is( 5 | as_parsed_model( 6 | linear_regression_db(mtcars, wt, sample_size = 32) 7 | ), 8 | "parsed_model" 9 | ) 10 | }) 11 | -------------------------------------------------------------------------------- /tests/testthat/test_dummy_var.R: -------------------------------------------------------------------------------- 1 | context("add_dummy_variables") 2 | 3 | test_that("Function create the correct columns", { 4 | cols_expected <- c("mpg", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "cyl_6", "cyl_8") 5 | expect_equal( 6 | colnames(add_dummy_variables(mtcars, cyl, c(4, 6, 8))), 7 | cols_expected 8 | ) 9 | expect_equal( 10 | colnames(add_dummy_variables(mtcars, cyl, auto_values = TRUE)), 11 | cols_expected 12 | ) 13 | }) 14 | 15 | test_that("Function fails when no values are passed and auto_values is FALSE", { 16 | expect_error(add_dummy_variables(mtcars, cyl)) 17 | expect_error(add_dummy_variables(mtcars, cyl, auto_values = FALSE)) 18 | }) 19 | 20 | test_that("Has error when variable is missing", { 21 | expect_error(add_dummy_variables(mtcars, error)) 22 | }) 23 | -------------------------------------------------------------------------------- /tests/testthat/test_kmeans.R: -------------------------------------------------------------------------------- 1 | context("kmeans") 2 | 3 | test_that("Specifying variables works", { 4 | expect_is( 5 | simple_kmeans_db(mtcars, wt, mpg), 6 | "data.frame" 7 | ) 8 | }) 9 | 10 | test_that("Not specifying variables works", { 11 | expect_silent( 12 | select(mtcars, wt, mpg) %>% 13 | simple_kmeans_db() 14 | ) 15 | }) 16 | 17 | test_that("Centroid argument is accepted", { 18 | ik <- data.frame( 19 | wt = c(3.072143, 4.058667, 1.873000), 20 | mpg = c(20.64286, 14.45833, 30.06667) 21 | ) 22 | 23 | expect_silent( 24 | simple_kmeans_db(mtcars, mpg, wt, initial_kmeans = ik) 25 | ) 26 | }) 27 | -------------------------------------------------------------------------------- /tests/testthat/test_kmeans_viz.R: -------------------------------------------------------------------------------- 1 | context("kmeans_viz") 2 | 3 | 4 | test_that("plot_kmeans() returns a ggplot2 object", { 5 | expect_equal( 6 | class(plot_kmeans(mtcars, mpg, wt, group = am)), 7 | c("gg", "ggplot") 8 | ) 9 | }) 10 | 11 | test_that("plot_kmeans() returns error when no group is passed", { 12 | expect_error( 13 | plot_kmeans(mtcars, mpg, wt) 14 | ) 15 | }) 16 | 17 | test_that("Updating the resolution argument impacts the results", { 18 | expect_false( 19 | nrow(db_calculate_squares(mtcars, 20 | mpg, wt, 21 | group = am, 22 | resolution = 50 23 | )) == 24 | nrow(db_calculate_squares(mtcars, 25 | mpg, wt, 26 | group = am, 27 | resolution = 30 28 | )) 29 | ) 30 | }) 31 | -------------------------------------------------------------------------------- /tests/testthat/test_lr.R: -------------------------------------------------------------------------------- 1 | context("Linear Regression") 2 | 3 | test_that("Simple linear regression matches lm()", { 4 | expect_equal( 5 | lm(wt ~ mpg, data = mtcars) %>% 6 | coef() %>% 7 | as.numeric(), 8 | mtcars %>% 9 | select(wt, mpg) %>% 10 | linear_regression_db(wt) %>% 11 | as.numeric() 12 | ) 13 | }) 14 | 15 | test_that("Two variable linear regression matches lm()", { 16 | expect_equal( 17 | lm(wt ~ mpg + qsec, data = mtcars) %>% 18 | coef() %>% 19 | as.numeric(), 20 | mtcars %>% 21 | select(wt, mpg, qsec) %>% 22 | linear_regression_db(wt) %>% 23 | as.numeric() 24 | ) 25 | }) 26 | 27 | test_that("Multiple variable linear regression matches lm()", { 28 | expect_equal( 29 | lm(wt ~ mpg + qsec + hp, data = mtcars) %>% 30 | coef() %>% 31 | as.numeric(), 32 | mtcars %>% 33 | select(wt, mpg, qsec, hp) %>% 34 | linear_regression_db(wt, sample_size = 32) %>% 35 | as.numeric() 36 | ) 37 | }) 38 | 39 | test_that("MLR matches lm() with auto_count set to TRUE", { 40 | expect_equal( 41 | lm(wt ~ mpg + qsec + hp, data = mtcars) %>% 42 | coef() %>% 43 | as.numeric(), 44 | mtcars %>% 45 | select(wt, mpg, qsec, hp) %>% 46 | linear_regression_db(wt, auto_count = TRUE) %>% 47 | as.numeric() 48 | ) 49 | }) 50 | 51 | test_that("MLR failes when auto_count set to FALSE and no sample_size is passed", { 52 | expect_error( 53 | mtcars %>% 54 | select(wt, mpg, qsec, hp) %>% 55 | linear_regression_db(wt, auto_count = FALSE) 56 | ) 57 | expect_error( 58 | mtcars %>% 59 | select(wt, mpg, qsec, hp) %>% 60 | linear_regression_db(wt) 61 | ) 62 | }) 63 | 64 | test_that("mlr with grouping matches lm()", { 65 | expect_equal( 66 | mtcars %>% 67 | select(wt, mpg, qsec, hp, am) %>% 68 | group_by(am) %>% 69 | linear_regression_db(wt, auto_count = TRUE) %>% 70 | transpose() %>% 71 | map(~ as.numeric(.x)), 72 | 0:1 %>% 73 | map(~ { 74 | mtcars %>% 75 | filter(am == .x) %>% 76 | lm(wt ~ mpg + qsec + hp, data = .) %>% 77 | coef() %>% 78 | as.numeric() %>% 79 | c(.x, .) 80 | }) 81 | ) 82 | }) 83 | -------------------------------------------------------------------------------- /vignettes/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | *.R 3 | -------------------------------------------------------------------------------- /vignettes/kmeans.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "K Means models" 3 | author: "Edgar Ruiz" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{K Means models} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | ```{r setup, include = FALSE} 13 | knitr::opts_chunk$set( 14 | collapse = TRUE, 15 | comment = "#>" 16 | ) 17 | library(dplyr) 18 | library(purrr) 19 | library(rlang) 20 | library(nycflights13) 21 | ``` 22 | 23 | ## Intro 24 | 25 | The `simple_kmeans_db()` function enables running the KMeans model inside the database. It uses `dplyr` programming to abstract the steps needed produce a model, so that it can then be translated into SQL statements in the background. 26 | 27 | ## Example setup 28 | 29 | In this example, a simple `RSQlite` database will be use to load the `flights` data from the `nycflights13` library. 30 | 31 | ```{r} 32 | library(dplyr) 33 | 34 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:") 35 | RSQLite::initExtension(con) 36 | 37 | db_flights <- copy_to(con, nycflights13::flights, "flights") 38 | ``` 39 | 40 | ## Running Kmeans clustering 41 | 42 | The function `simple_kmeans_db()` can use with local data, or a remote table, such as the `db_flights` variable that is a pointer to the "flights" table inside the SQLite database. When piping to the function, the only other required arguments are two or more fields separated by comma. Because it uses 'tidyeval', the variable name auto-completion will work. 43 | 44 | ```{r} 45 | library(modeldb) 46 | 47 | km <- db_flights %>% 48 | simple_kmeans_db(dep_time, distance) 49 | ``` 50 | 51 | The `simple_kmeans_db()` function uses a progress bar to show you the current cycle, the maximum cycles it's expected to run, the current difference between the previous cycle and the current cycle, and the running time. The loop will stop once it wither has two matching consecutive cycles, or if it reaches the maximum number of cycles, as determined by the `max_repeats` argument. 52 | 53 | The final **centers** are are stored in the `k_center` variable of the returned object 54 | ```{r} 55 | km %>% pull(k_center) 56 | ``` 57 | 58 | The latest results are stored in the `tbl` variable of the returned object. The type of the returned table will match the type of the original source, so if it is a remote source, such as database table, then `tbl` will be a class `tbl_sql`. This will allow us to do two thing: 59 | 60 | - View the SQL statement that was used to find the final centers: 61 | ```{r} 62 | dbplyr::remote_query(km) 63 | ``` 64 | 65 | ## Under the hood 66 | 67 | The `simple_kmeans_db()` function uses `dplyr` and 'tidyeval' to run the KMeans algorithm. This means that when combined with `dbplyr`, the routines can be run inside a database. 68 | 69 | Unlike other packages that use this same methodology, such as `dbplot` and `tidypredict`, `simple_kmeans_db()` does not create a single `dplyr` code that can be extracted as SQL. The function produces multiple, serial and dependent SQL statements that run individually inside the database. Each statement uses the current *centroids*, or centers, to estimate new centroids, and then it uses those centroids in a consecutive SQL statement to see if there was any variance. Effectively, this approach uses R not only as translation layer, but also as an orchestration layer. 70 | 71 | ## Safeguards for long running jobs 72 | 73 | The`simple_kmeans_db()` approach of using multiple and consecutive SQL queries to find the optimal centers, additionally, in KMeans clustering, it matters the order in which the each set of centers is passed. This creates an imperative to find a way to cache the current centers used in a long running job, in case the job is canceled or fails. Starting from the centers that were calculated last, will mean that re-starting the job will not being from "0", but from a more advanced, read closer, set of centers. 74 | 75 | The safeguard implemented in this function is trough a file, called *kmeans.csv*. Each cycle will update the file. The file name can be changed by modifying the `safeguard_file` argument. Setting the argument to NULL will turn off the safeguard. The file will be saved to the temporary directory of the R session.. 76 | 77 | In this example we will set the `max_repats` to 10, so as to artificially avoid finding the optimal means 78 | ```{r} 79 | km <- db_flights %>% 80 | simple_kmeans_db(dep_time, distance, max_repeats = 10) 81 | ``` 82 | 83 | In the next run, the "kmeans.csv" file is passed as the `initial_kmeans` argument. This will make `simple_kmeans_db()` use those centers as the starting point: 84 | 85 | ```{r} 86 | km <- db_flights %>% 87 | simple_kmeans_db(dep_time, distance, initial_kmeans = read.csv(file.path(tempdir(), "kmeans.csv"))) 88 | ``` 89 | 90 | The second run took 7 cycles to complete, which adds up to the 17 cycles that it initially took in the first example at the top of this article. 91 | 92 | -------------------------------------------------------------------------------- /vignettes/linear-regression.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Linear regression models" 3 | author: "Edgar Ruiz" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{Linear regression models} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | ```{r setup, include = FALSE} 13 | knitr::opts_chunk$set( 14 | collapse = TRUE, 15 | comment = "#>" 16 | ) 17 | 18 | knitr::opts_chunk$set(echo = TRUE) 19 | library(dplyr) 20 | library(dbplyr) 21 | library(nycflights13) 22 | library(DBI) 23 | library(modeldb) 24 | ``` 25 | 26 | ## Intro 27 | 28 | The `linear_regression_db()` function can be used to fit this kind of model inside a database. It uses `dplyr` programming to abstract the steps needed produce a model, so that it can then be translated into SQL statements in the background. 29 | 30 | ## Example setup 31 | 32 | A lightweight SQLite database will be used for this article. Additionally, a sample data set is created. 33 | 34 | ```{r} 35 | # Open a database connection 36 | con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:") 37 | RSQLite::initExtension(con) 38 | 39 | library(dplyr) 40 | # Copy data to the database 41 | db_flights <- copy_to(con, nycflights13::flights, "flights") 42 | # Create a simple sample 43 | db_sample <- db_flights %>% 44 | filter(!is.na(arr_time)) %>% 45 | head(20000) 46 | ``` 47 | 48 | 49 | ## Model inside the database 50 | 51 | The `linear_regression_db()` function does not use a formula. It uses a table, and a named dependent variable. This means data preparation is needed prior to running the model. The best way to prepare the data for modeling will be using piped `dplyr` operations. 52 | 53 | ```{r} 54 | db_sample %>% 55 | select(arr_delay, dep_delay, distance) %>% 56 | linear_regression_db(arr_delay) 57 | ``` 58 | 59 | ## Categorical variables 60 | 61 | Adding a categorical a variable to a model requires prior data transformation The `add_dummy_variables()` appends a set of boolean variables, one for each discrete value. This function creates one-less discrete variable than the possible values. For example: if the categorical variable has three possible values, the function will append two variables. By default, `add_dummy_variables()` removes the original variable. 62 | 63 | The reason for this approach is to reduce the number of database operations. Without this step, then a fitting function would have to request all of the unique values every time a new model run, which creates unnecessary processing. 64 | 65 | ```{r} 66 | db_sample %>% 67 | select(arr_delay, origin) %>% 68 | add_dummy_variables(origin, values = c("EWR", "JFK", "LGA")) 69 | ``` 70 | 71 | In a real world scenario, the possible values are usually not known at the beginning of the analysis. So it is a good idea to load them into a vector variable so that it can be used any time that variable is added to a model. This can be easily done using the `pull()` command from `dplyr`: 72 | 73 | ```{r} 74 | origins <- db_flights %>% 75 | group_by(origin) %>% 76 | summarise() %>% 77 | pull() 78 | 79 | origins 80 | ``` 81 | 82 | The `add_dummy_variables()` can be used as part of the piped code that terminates in the modeling function. 83 | 84 | ```{r} 85 | db_sample %>% 86 | select(arr_delay, origin) %>% 87 | add_dummy_variables(origin, values = origins) %>% 88 | linear_regression_db(arr_delay) 89 | ``` 90 | 91 | ## Multiple linear regression 92 | 93 | One of two arguments is needed to be set when fitting a model with three or more independent variables. The both relate to the size of the data set used for the model. So either the `sample_size` argument is passed, or `auto_count` is set to `TRUE`. When `auto_count` is set to `TRUE`, and no sample size is passed, then the function will do a table count as part of the model fitting. This is done in order to prevent unnecessary database operations, especially for cases when multiple models will be tested on top of the same sample data. 94 | 95 | ```{r} 96 | db_sample %>% 97 | select(arr_delay, arr_time, dep_delay, dep_time) %>% 98 | linear_regression_db(arr_delay, sample_size = 20000) 99 | ``` 100 | 101 | ## Interactions 102 | 103 | Interactions have to be handled manually prior the modeling step. 104 | 105 | ```{r} 106 | db_sample %>% 107 | mutate(distanceXarr_time = distance * arr_time) %>% 108 | select(arr_delay, distanceXarr_time) %>% 109 | linear_regression_db(arr_delay, sample_size = 20000) 110 | ``` 111 | 112 | A more typical model would also include the two original variables: 113 | 114 | ```{r} 115 | db_sample %>% 116 | mutate(distanceXarr_time = distance * arr_time) %>% 117 | select(arr_delay, distance, arr_time, distanceXarr_time) %>% 118 | linear_regression_db(arr_delay, sample_size = 20000) 119 | ``` 120 | 121 | ## Full example 122 | 123 | Fitting a model with regular, categorical and interaction variables will look like this: 124 | 125 | ```{r} 126 | remote_model <- db_sample %>% 127 | mutate(distanceXarr_time = distance * arr_time) %>% 128 | select(arr_delay, dep_time, distanceXarr_time, origin) %>% 129 | add_dummy_variables(origin, values = origins) %>% 130 | linear_regression_db(y_var = arr_delay, sample_size = 20000) 131 | 132 | remote_model 133 | ``` 134 | 135 | ```{r, echo = FALSE} 136 | dbDisconnect(con) 137 | ``` 138 | --------------------------------------------------------------------------------