├── .github ├── .gitignore └── workflows │ ├── R-CMD-check.yaml │ └── render-readme.yaml ├── .gitignore ├── tests ├── testthat.R └── testthat │ ├── Rplots.pdf │ ├── test-visuals.R │ ├── test-type-identification.R │ └── test-score.R ├── CRAN-SUBMISSION ├── man ├── figures │ ├── README-PPS-barplot-1.png │ ├── README-PPS-heatmap-1.png │ ├── README-custom-plot-1.png │ ├── README-sbs-heatmap-1.png │ └── README-correlation-heatmap-1.png ├── available_algorithms.Rd ├── ppsr.Rd ├── available_evaluation_metrics.Rd ├── score_correlations.Rd ├── normalize_score.Rd ├── score_matrix.Rd ├── score_model.Rd ├── visualize_correlations.Rd ├── visualize_both.Rd ├── score_naive.Rd ├── visualize_pps.Rd ├── score_df.Rd ├── score.Rd └── score_predictors.Rd ├── .Rbuildignore ├── NAMESPACE ├── ppsr.Rproj ├── R ├── ppsr.R ├── checks.R ├── utils.R ├── algorithms.R ├── metrics.R ├── visualize.R └── score.R ├── cran-comments.md ├── DESCRIPTION ├── NEWS.md ├── README.Rmd ├── README.md └── LICENSE.md /.github/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(ppsr) 3 | 4 | test_check("ppsr") 5 | -------------------------------------------------------------------------------- /tests/testthat/Rplots.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulvanderlaken/ppsr/HEAD/tests/testthat/Rplots.pdf -------------------------------------------------------------------------------- /CRAN-SUBMISSION: -------------------------------------------------------------------------------- 1 | Version: 0.0.5 2 | Date: 2024-02-18 11:57:39 UTC 3 | SHA: 7a9dbecb5e7f39b96a7d15cb54429379bc907696 4 | -------------------------------------------------------------------------------- /man/figures/README-PPS-barplot-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulvanderlaken/ppsr/HEAD/man/figures/README-PPS-barplot-1.png -------------------------------------------------------------------------------- /man/figures/README-PPS-heatmap-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulvanderlaken/ppsr/HEAD/man/figures/README-PPS-heatmap-1.png -------------------------------------------------------------------------------- /man/figures/README-custom-plot-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulvanderlaken/ppsr/HEAD/man/figures/README-custom-plot-1.png -------------------------------------------------------------------------------- /man/figures/README-sbs-heatmap-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulvanderlaken/ppsr/HEAD/man/figures/README-sbs-heatmap-1.png -------------------------------------------------------------------------------- /man/figures/README-correlation-heatmap-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulvanderlaken/ppsr/HEAD/man/figures/README-correlation-heatmap-1.png -------------------------------------------------------------------------------- /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^LICENSE\.md$ 4 | ^README\.Rmd$ 5 | ^\.github$ 6 | ^cran-comments\.md$ 7 | ^CRAN-RELEASE$ 8 | ^CRAN-SUBMISSION$ 9 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(available_algorithms) 4 | export(available_evaluation_metrics) 5 | export(score) 6 | export(score_correlations) 7 | export(score_df) 8 | export(score_matrix) 9 | export(score_predictors) 10 | export(visualize_both) 11 | export(visualize_correlations) 12 | export(visualize_pps) 13 | -------------------------------------------------------------------------------- /tests/testthat/test-visuals.R: -------------------------------------------------------------------------------- 1 | test_that("Functions produce ggplot2 lists", { 2 | skip_on_cran() 3 | 4 | plot_pred = visualize_pps(iris, y = 'Species') 5 | plot_mat = visualize_pps(iris) 6 | plot_cor = visualize_correlations(iris) 7 | plot_both = visualize_both(iris) 8 | 9 | expect_true(is.list(plot_pred)) 10 | expect_true(is.list(plot_mat)) 11 | expect_true(is.list(plot_cor)) 12 | expect_true(is.list(plot_both)) 13 | }) 14 | -------------------------------------------------------------------------------- /ppsr.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | -------------------------------------------------------------------------------- /R/ppsr.R: -------------------------------------------------------------------------------- 1 | #' ppsr: An R implementation of the Predictive Power Score (PPS) 2 | #' 3 | #' The PPS is an asymmetric, data-type-agnostic score that can detect linear or 4 | #' non-linear relationships between two columns. The score ranges from 0 5 | #' (no predictive power) to 1 (perfect predictive power). It can be used as an 6 | #' alternative to the correlation (matrix). 7 | #' 8 | #' @docType package 9 | #' @name ppsr 10 | #' @aliases ppsr-package 11 | NULL 12 | -------------------------------------------------------------------------------- /man/available_algorithms.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/algorithms.R 3 | \name{available_algorithms} 4 | \alias{available_algorithms} 5 | \title{Lists all algorithms currently supported} 6 | \usage{ 7 | available_algorithms() 8 | } 9 | \value{ 10 | a list of all available parsnip engines 11 | } 12 | \description{ 13 | Lists all algorithms currently supported 14 | } 15 | \examples{ 16 | available_algorithms() 17 | } 18 | -------------------------------------------------------------------------------- /R/checks.R: -------------------------------------------------------------------------------- 1 | 2 | is_id = function(x) { 3 | return(is.character(x) & length(unique(x)) == length(x)) 4 | } 5 | 6 | is_constant = function(x) { 7 | return(length(unique(x)) == 1) 8 | } 9 | 10 | is_same = function(x, y) { 11 | if(is.factor(x) & is.factor(y)) { 12 | return(all(as.character(x) == as.character(y))) 13 | } 14 | return(all(x == y)) 15 | } 16 | 17 | is_binary = function(x) { 18 | return(length(unique(x)) == 2) 19 | } 20 | 21 | is_binary_numeric = function(x) { 22 | return(is_binary(x) & is.numeric(x)) 23 | } 24 | -------------------------------------------------------------------------------- /man/ppsr.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/ppsr.R 3 | \docType{package} 4 | \name{ppsr} 5 | \alias{ppsr} 6 | \alias{ppsr-package} 7 | \title{ppsr: An R implementation of the Predictive Power Score (PPS)} 8 | \description{ 9 | The PPS is an asymmetric, data-type-agnostic score that can detect linear or 10 | non-linear relationships between two columns. The score ranges from 0 11 | (no predictive power) to 1 (perfect predictive power). It can be used as an 12 | alternative to the correlation (matrix). 13 | } 14 | -------------------------------------------------------------------------------- /R/utils.R: -------------------------------------------------------------------------------- 1 | format_score = function(x, digits = 2) { 2 | return(formatC(x, format = 'f', digits = digits)) 3 | } 4 | 5 | modal_value = function(x) { 6 | uq = unique(x) 7 | m = uq[which.max(tabulate(match(x, uq)))] 8 | return(m) 9 | } 10 | 11 | 12 | fill_blanks_in_list = function(ll) { 13 | elements_uq = unique(unlist(lapply(ll, names))) 14 | for (i in seq_along(ll)) { 15 | elements_filled = names(ll[[i]]) 16 | elements_missing = setdiff(elements_uq, elements_filled) 17 | ll[[i]][elements_missing] = NA 18 | } 19 | return(ll) 20 | } 21 | -------------------------------------------------------------------------------- /man/available_evaluation_metrics.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/metrics.R 3 | \name{available_evaluation_metrics} 4 | \alias{available_evaluation_metrics} 5 | \title{Lists all evaluation metrics currently supported} 6 | \usage{ 7 | available_evaluation_metrics() 8 | } 9 | \value{ 10 | a list of all available evaluation metrics and their implementation in functional form 11 | } 12 | \description{ 13 | Lists all evaluation metrics currently supported 14 | } 15 | \examples{ 16 | available_evaluation_metrics() 17 | } 18 | -------------------------------------------------------------------------------- /man/score_correlations.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{score_correlations} 4 | \alias{score_correlations} 5 | \title{Calculate correlation coefficients for whole dataframe} 6 | \usage{ 7 | score_correlations(df, ...) 8 | } 9 | \arguments{ 10 | \item{df}{data.frame containing columns for x and y} 11 | 12 | \item{...}{arguments to pass to \code{stats::cor()}} 13 | } 14 | \value{ 15 | a data.frame with x-y correlation coefficients 16 | } 17 | \description{ 18 | Calculate correlation coefficients for whole dataframe 19 | } 20 | \examples{ 21 | score_correlations(iris) 22 | } 23 | -------------------------------------------------------------------------------- /R/algorithms.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | engine_tree = function(type) { 5 | return(parsnip::set_engine(parsnip::decision_tree(), "rpart")) 6 | } 7 | 8 | engine_glm = function(type) { 9 | if (type == 'regression') { 10 | return(parsnip::set_engine(parsnip::linear_reg(), "lm")) 11 | } else if (type == 'classification') { 12 | return(parsnip::set_engine(parsnip::logistic_reg(), "glm")) 13 | } 14 | } 15 | 16 | #' Lists all algorithms currently supported 17 | #' 18 | #' @return a list of all available parsnip engines 19 | #' @export 20 | #' 21 | #' @examples 22 | #' available_algorithms() 23 | available_algorithms = function() { 24 | return(list( 25 | 'tree' = engine_tree, 26 | 'glm' = engine_glm 27 | )) 28 | } 29 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## winbuilder test results 2 | There were no ERRORs or WARNINGs. 3 | There were 3 NOTEs. 4 | 5 | N checking dependencies in R code 6 | Namespaces in Imports field not imported from: 7 | 'rpart' 'withr' 8 | All declared Imports should be used. 9 | 10 | > These are actually used. So I do not understand these being flagged. 11 | > Once removed their absence results in failures. 12 | 13 | N checking for detritus in the temp directory 14 | Found the following files/directories: 15 | 'lastMiKTeXException' 16 | 17 | > I do not find this file and I do not understand where it comes from 18 | 19 | 20 | ## R CMD check results 21 | 22 | 0 errors v | 0 warnings v | 0 notes v 23 | 24 | 25 | ## R-hub builder results 26 | There were no ERRORs or WARNINGs. 27 | There were no NOTEs. 28 | -------------------------------------------------------------------------------- /man/normalize_score.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{normalize_score} 4 | \alias{normalize_score} 5 | \title{Normalizes the original score compared to a naive baseline score 6 | The calculation that's being performed depends on the type of model} 7 | \usage{ 8 | normalize_score(baseline_score, model_score, type) 9 | } 10 | \arguments{ 11 | \item{baseline_score}{float, the evaluation metric score for a naive baseline (model)} 12 | 13 | \item{model_score}{float, the evaluation metric score for a statistical model} 14 | 15 | \item{type}{character, type of model} 16 | } 17 | \value{ 18 | numeric vector of length one, normalized score 19 | } 20 | \description{ 21 | Normalizes the original score compared to a naive baseline score 22 | The calculation that's being performed depends on the type of model 23 | } 24 | -------------------------------------------------------------------------------- /man/score_matrix.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{score_matrix} 4 | \alias{score_matrix} 5 | \title{Calculate predictive power score matrix 6 | Iterates through the columns of the dataset, calculating the predictive power 7 | score for every possible combination of \code{x} and \code{y}.} 8 | \usage{ 9 | score_matrix(df, ...) 10 | } 11 | \arguments{ 12 | \item{df}{data.frame containing columns for x and y} 13 | 14 | \item{...}{any arguments passed to \code{\link{score_df}}, 15 | some of which will be passed on to \code{\link{score}}} 16 | } 17 | \value{ 18 | a matrix of numeric values, representing predictive power scores 19 | } 20 | \description{ 21 | Note that the targets are on the rows, and the features on the columns. 22 | } 23 | \examples{ 24 | \donttest{score_matrix(iris)} 25 | \donttest{score_matrix(mtcars, do_parallel = TRUE, n_cores=2)} 26 | } 27 | -------------------------------------------------------------------------------- /man/score_model.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{score_model} 4 | \alias{score_model} 5 | \title{Calculates out-of-sample model performance of a statistical model} 6 | \usage{ 7 | score_model(train, test, model, x, y, metric) 8 | } 9 | \arguments{ 10 | \item{train}{df, training data, containing variable y} 11 | 12 | \item{test}{df, test data, containing variable y} 13 | 14 | \item{model}{parsnip model object, with mode preset} 15 | 16 | \item{x}{character, column name of predictor variable} 17 | 18 | \item{y}{character, column name of target variable} 19 | 20 | \item{metric}{character, name of evaluation metric being used, see \code{available_evaluation_metrics()}} 21 | } 22 | \value{ 23 | numeric vector of length one, evaluation score for predictions using naive model 24 | } 25 | \description{ 26 | Calculates out-of-sample model performance of a statistical model 27 | } 28 | -------------------------------------------------------------------------------- /.github/workflows/R-CMD-check.yaml: -------------------------------------------------------------------------------- 1 | # For help debugging build failures open an issue on the RStudio community with the 'github-actions' tag. 2 | # https://community.rstudio.com/new-topic?category=Package%20development&tags=github-actions 3 | on: 4 | push: 5 | branches: 6 | - main 7 | - master 8 | pull_request: 9 | branches: 10 | - main 11 | - master 12 | 13 | name: R-CMD-check 14 | 15 | jobs: 16 | R-CMD-check: 17 | runs-on: macOS-latest 18 | env: 19 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 20 | steps: 21 | - uses: actions/checkout@v2 22 | - uses: r-lib/actions/setup-r@v2 23 | - name: Install dependencies 24 | run: | 25 | install.packages(c("remotes", "rcmdcheck")) 26 | remotes::install_deps(dependencies = TRUE) 27 | shell: Rscript {0} 28 | - name: Check 29 | run: rcmdcheck::rcmdcheck(args = "--no-manual", error_on = "error") 30 | shell: Rscript {0} 31 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: ppsr 2 | Type: Package 3 | Title: Predictive Power Score 4 | Version: 0.0.5 5 | Authors@R: 6 | person("Paul", "van der Laken", email = "paulvanderlaken@gmail.com", role = c("aut", "cre", "cph")) 7 | Description: The Predictive Power Score (PPS) is an asymmetric, data-type-agnostic score 8 | that can detect linear or non-linear relationships between two variables. 9 | The score ranges from 0 (no predictive power) to 1 (perfect predictive power). 10 | PPS can be useful for data exploration purposes, in the same way correlation analysis is. 11 | For more information on PPS, see . 12 | License: GPL (>= 3) 13 | Encoding: UTF-8 14 | LazyData: true 15 | Suggests: 16 | testthat (>= 2.0.0) 17 | Config/testthat/edition: 3 18 | Config/testthat/parallel: true 19 | RoxygenNote: 7.2.3 20 | Imports: 21 | ggplot2 (>= 3.3.3), 22 | parsnip (>= 0.1.5), 23 | rpart (>= 4.1.15), 24 | withr (>= 2.4.1), 25 | gridExtra (>= 2.3) 26 | -------------------------------------------------------------------------------- /.github/workflows/render-readme.yaml: -------------------------------------------------------------------------------- 1 | # Controls when the action will run 2 | on: 3 | push: 4 | branches: master 5 | 6 | name: render README 7 | 8 | jobs: 9 | render: 10 | # The type of runner that the job will run on 11 | runs-on: windows-latest 12 | 13 | steps: 14 | # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it 15 | - uses: actions/checkout@v2 16 | - uses: r-lib/actions/setup-r@v2 17 | - uses: r-lib/actions/setup-pandoc@v2 18 | 19 | # install packages needed 20 | - name: install required packages 21 | run: Rscript -e 'install.packages(c("rmarkdown"))' 22 | 23 | # Render READEME.md using rmarkdown 24 | - name: render README 25 | run: Rscript -e 'rmarkdown::render("README.Rmd", output_format = "md_document")' 26 | 27 | - name: commit rendered README 28 | run: | 29 | git add README.md man/figures/README-* 30 | git commit -m "Re-build README.md" || echo "No changes to commit" 31 | git push origin master || echo "No changes to commit" 32 | -------------------------------------------------------------------------------- /tests/testthat/test-type-identification.R: -------------------------------------------------------------------------------- 1 | test_that("ID columns are identified", { 2 | 3 | df = mtcars 4 | df[['id']] = row.names(mtcars) 5 | 6 | expect_true(is_id(df[['id']])) 7 | expect_equal(score(df, x = 'id', y = 'mpg')[['pps']], 0) 8 | expect_equal(score(df, x = 'mpg', y = 'id')[['pps']], 0) 9 | }) 10 | 11 | test_that("Constants are identified", { 12 | 13 | df = mtcars 14 | df$constant1 = 1 15 | df$constant2 = 'a' 16 | 17 | expect_true(is_constant(df$constant1)) 18 | expect_true(is_constant(df$constant2)) 19 | expect_false(is_constant(df$mpg)) 20 | expect_equal(score(df, x = 'constant1', y = 'constant2')[['pps']], 0) 21 | expect_equal(score(df, x = 'constant2', y = 'constant1')[['pps']], 0) 22 | }) 23 | 24 | test_that("Identify when target and features are the same", { 25 | 26 | df = mtcars 27 | df$mpg2 = df$mpg 28 | 29 | expect_true(is_same(df$mpg, df$mpg2)) 30 | expect_false(is_same(df$mpg, df$cyl)) 31 | expect_equal(score(df, x = 'mpg', y = 'mpg')[['pps']], 1) 32 | expect_equal(score(df, x = 'mpg', y = 'mpg2')[['pps']], 1) 33 | }) 34 | -------------------------------------------------------------------------------- /man/visualize_correlations.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/visualize.R 3 | \name{visualize_correlations} 4 | \alias{visualize_correlations} 5 | \title{Visualize the correlation matrix} 6 | \usage{ 7 | visualize_correlations( 8 | df, 9 | color_value_positive = "#08306B", 10 | color_value_negative = "#8b0000", 11 | color_text = "#FFFFFF", 12 | include_missings = FALSE, 13 | ... 14 | ) 15 | } 16 | \arguments{ 17 | \item{df}{data.frame containing columns for x and y} 18 | 19 | \item{color_value_positive}{color used for upper limit of gradient (high positive correlation)} 20 | 21 | \item{color_value_negative}{color used for lower limit of gradient (high negative correlation)} 22 | 23 | \item{color_text}{color used for text, best to pick high contrast with \code{color_value_high}} 24 | 25 | \item{include_missings}{bool, whether to include the variables without correlation values in the plot} 26 | 27 | \item{...}{arguments to pass to \code{stats::cor()}} 28 | } 29 | \value{ 30 | a ggplot object, a heatmap visualization 31 | } 32 | \description{ 33 | Visualize the correlation matrix 34 | } 35 | \examples{ 36 | visualize_correlations(iris) 37 | } 38 | -------------------------------------------------------------------------------- /man/visualize_both.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/visualize.R 3 | \name{visualize_both} 4 | \alias{visualize_both} 5 | \title{Visualize the PPS & correlation matrices} 6 | \usage{ 7 | visualize_both( 8 | df, 9 | color_value_positive = "#08306B", 10 | color_value_negative = "#8b0000", 11 | color_text = "#FFFFFF", 12 | include_missings = TRUE, 13 | nrow = 1, 14 | ... 15 | ) 16 | } 17 | \arguments{ 18 | \item{df}{data.frame containing columns for x and y} 19 | 20 | \item{color_value_positive}{color used for upper limit of gradient (high positive correlation)} 21 | 22 | \item{color_value_negative}{color used for lower limit of gradient (high negative correlation)} 23 | 24 | \item{color_text}{string, hex value or color name used for text, best to pick high contrast with \code{color_value_high}} 25 | 26 | \item{include_missings}{bool, whether to include the variables without correlation values in the plot} 27 | 28 | \item{nrow}{numeric, number of rows, either 1 or 2} 29 | 30 | \item{...}{any arguments passed to \code{\link{score}}} 31 | } 32 | \value{ 33 | a grob object, a grid with two ggplot2 heatmap visualizations 34 | } 35 | \description{ 36 | Visualize the PPS & correlation matrices 37 | } 38 | \examples{ 39 | \donttest{visualize_both(iris)} 40 | 41 | \donttest{visualize_both(mtcars, do_parallel = TRUE, n_cores = 2)} 42 | } 43 | -------------------------------------------------------------------------------- /man/score_naive.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{score_naive} 4 | \alias{score_naive} 5 | \title{Calculate out-of-sample model performance of naive baseline model 6 | The calculation that's being performed depends on the type of model 7 | For regression models, the mean is used as prediction 8 | For classification, a model predicting random values and 9 | a model predicting modal values are used and 10 | the best model is taken as baseline score} 11 | \usage{ 12 | score_naive(train, test, x, y, type, metric) 13 | } 14 | \arguments{ 15 | \item{train}{df, training data, containing variable y} 16 | 17 | \item{test}{df, test data, containing variable y} 18 | 19 | \item{x}{character, column name of predictor variable} 20 | 21 | \item{y}{character, column name of target variable} 22 | 23 | \item{type}{character, type of model} 24 | 25 | \item{metric}{character, evaluation metric being used} 26 | } 27 | \value{ 28 | numeric vector of length one, evaluation score for predictions using naive model 29 | } 30 | \description{ 31 | Calculate out-of-sample model performance of naive baseline model 32 | The calculation that's being performed depends on the type of model 33 | For regression models, the mean is used as prediction 34 | For classification, a model predicting random values and 35 | a model predicting modal values are used and 36 | the best model is taken as baseline score 37 | } 38 | -------------------------------------------------------------------------------- /R/metrics.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | .eval = function(y, yhat) { 4 | return(NULL) 5 | } 6 | 7 | eval_mae = function(y, yhat) { 8 | return(mean(abs(y - yhat))) 9 | } 10 | 11 | eval_rmse = function(y, yhat) { 12 | return(mean((y - yhat)^2)^0.5) 13 | } 14 | 15 | 16 | calculate_f1_scores = function(y, yhat) { 17 | yhat <- factor(as.character(yhat), levels=unique(as.character(y))) 18 | y <- as.factor(y) 19 | cm = as.matrix(table(y, yhat)) 20 | 21 | precision <- diag(cm) / colSums(cm) 22 | recall <- diag(cm) / rowSums(cm) 23 | f1 <- ifelse(precision + recall == 0, 0, 2 * precision * recall / (precision + recall)) 24 | 25 | f1[is.na(f1)] <- 0 # assuming that F1 is zero when it's not possible compute it 26 | 27 | return(f1) 28 | } 29 | 30 | 31 | eval_macro_f1_score = function(y, yhat) { 32 | f1 = calculate_f1_scores(y, yhat) 33 | return(mean(f1)) 34 | } 35 | 36 | eval_weighted_f1_score = function(y, yhat) { 37 | f1 = calculate_f1_scores(y, yhat) 38 | w = prop.table(table(y)) 39 | return(stats::weighted.mean(f1, w)) 40 | } 41 | 42 | #' Lists all evaluation metrics currently supported 43 | #' 44 | #' @return a list of all available evaluation metrics and their implementation in functional form 45 | #' @export 46 | #' 47 | #' @examples 48 | #' available_evaluation_metrics() 49 | available_evaluation_metrics = function() { 50 | return(list( 51 | 'MAE' = eval_mae, 52 | 'RMSE' = eval_rmse, 53 | 'F1_macro' = eval_macro_f1_score, 54 | 'F1_weighted' = eval_weighted_f1_score 55 | )) 56 | } 57 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # ppsr 0.0.5 2 | 3 | Fixed the package alias breaking due to `roxygen` dependency 4 | 5 | # ppsr 0.0.4 6 | 7 | Removed parallel package from imports to fix bug on MacOS #32 8 | 9 | 10 | # ppsr 0.0.3 11 | 12 | No major changes. 13 | 14 | 15 | # ppsr 0.0.2.1 16 | 17 | ## Major changes: 18 | * Update to ggplot2 theme, removed irrelevant panel grid lines 19 | 20 | ## Bug fixes: 21 | * #19 Parallelization should not now work. Library call to package instead of exporting functions by name. 22 | 23 | 24 | # ppsr 0.0.2 25 | 26 | ## Major changes: 27 | 28 | * Improved documentation, extended README, and more detailed Usage sections for functions 29 | * Several user-irrelevant functions now no longer exported 30 | * Support for (generalized) linear models for regression and binary classification (not multinomial) 31 | * `available_algorithms()` and `available_evaluation_metrics()` to inspect options 32 | * `visualize_pps()` now includes a `include_target` parameter to remove the target variable from the barplot visualization 33 | 34 | ## Bug fixes: 35 | 36 | * Customizable axis labeling in the visualizations 37 | 38 | 39 | # ppsr 0.0.1 40 | 41 | ## Major changes: 42 | 43 | * Adds a NEWS file 44 | * Default color of negative correlations to dark red 45 | 46 | ## New features: 47 | 48 | * Parallel computing now optional in all `score_*()` and `visualize_*()` functions 49 | 50 | ## Bug fixes: 51 | 52 | * related to classification models (#11 and #16) 53 | * related to inconsistent text color (#21) 54 | 55 | 56 | -------------------------------------------------------------------------------- /man/visualize_pps.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/visualize.R 3 | \name{visualize_pps} 4 | \alias{visualize_pps} 5 | \title{Visualize the Predictive Power scores of the entire dataframe, or given a target} 6 | \usage{ 7 | visualize_pps( 8 | df, 9 | y = NULL, 10 | color_value_high = "#08306B", 11 | color_value_low = "#FFFFFF", 12 | color_text = "#FFFFFF", 13 | include_target = TRUE, 14 | ... 15 | ) 16 | } 17 | \arguments{ 18 | \item{df}{data.frame containing columns for x and y} 19 | 20 | \item{y}{string, column name of target variable, 21 | can be left \code{NULL} to visualize all X-Y PPS} 22 | 23 | \item{color_value_high}{string, hex value or color name used for upper limit of PPS gradient (high PPS)} 24 | 25 | \item{color_value_low}{string, hex value or color name used for lower limit of PPS gradient (low PPS)} 26 | 27 | \item{color_text}{string, hex value or color name used for text, best to pick high contrast with \code{color_value_high}} 28 | 29 | \item{include_target}{boolean, whether to include the target variable in the barplot} 30 | 31 | \item{...}{any arguments passed to \code{\link{score}}} 32 | } 33 | \value{ 34 | a ggplot object, a vertical barplot or heatmap visualization 35 | } 36 | \description{ 37 | If \code{y} is specified, \code{visualize_pps} returns a barplot of the PPS of 38 | every predictor on the specified target variable. 39 | If \code{y} is not specified, \code{visualize_pps} returns a heatmap visualization 40 | of the PPS for all X-Y combinations in a dataframe. 41 | } 42 | \examples{ 43 | visualize_pps(iris, y = 'Species') 44 | 45 | \donttest{visualize_pps(iris)} 46 | 47 | \donttest{visualize_pps(mtcars, do_parallel = TRUE, n_cores = 2)} 48 | } 49 | -------------------------------------------------------------------------------- /man/score_df.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{score_df} 4 | \alias{score_df} 5 | \title{Calculate predictive power scores for whole dataframe 6 | Iterates through the columns of the dataframe, calculating the predictive power 7 | score for every possible combination of \code{x} and \code{y}.} 8 | \usage{ 9 | score_df(df, ..., do_parallel = FALSE, n_cores = -1) 10 | } 11 | \arguments{ 12 | \item{df}{data.frame containing columns for x and y} 13 | 14 | \item{...}{any arguments passed to \code{\link{score}}} 15 | 16 | \item{do_parallel}{bool, whether to perform \code{\link{score}} calls in parallel} 17 | 18 | \item{n_cores}{numeric, number of cores to use, defaults to maximum minus 1} 19 | } 20 | \value{ 21 | a data.frame containing \describe{ 22 | \item{x}{the name of the predictor variable} 23 | \item{y}{the name of the target variable} 24 | \item{result_type}{text showing how to interpret the resulting score} 25 | \item{pps}{the predictive power score} 26 | \item{metric}{the evaluation metric used to compute the PPS} 27 | \item{baseline_score}{the score of a naive model on the evaluation metric} 28 | \item{model_score}{the score of the predictive model on the evaluation metric} 29 | \item{cv_folds}{how many cross-validation folds were used} 30 | \item{seed}{the seed that was set} 31 | \item{algorithm}{text shwoing what algorithm was used} 32 | \item{model_type}{text showing whether classification or regression was used} 33 | } 34 | } 35 | \description{ 36 | Calculate predictive power scores for whole dataframe 37 | Iterates through the columns of the dataframe, calculating the predictive power 38 | score for every possible combination of \code{x} and \code{y}. 39 | } 40 | \examples{ 41 | \donttest{score_df(iris)} 42 | \donttest{score_df(mtcars, do_parallel = TRUE, n_cores = 2)} 43 | } 44 | -------------------------------------------------------------------------------- /man/score.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{score} 4 | \alias{score} 5 | \title{Calculate predictive power score for x on y} 6 | \usage{ 7 | score( 8 | df, 9 | x, 10 | y, 11 | algorithm = "tree", 12 | metrics = list(regression = "MAE", classification = "F1_weighted"), 13 | cv_folds = 5, 14 | seed = 1, 15 | verbose = TRUE 16 | ) 17 | } 18 | \arguments{ 19 | \item{df}{data.frame containing columns for x and y} 20 | 21 | \item{x}{string, column name of predictor variable} 22 | 23 | \item{y}{string, column name of target variable} 24 | 25 | \item{algorithm}{string, see \code{available_algorithms()}} 26 | 27 | \item{metrics}{named list of \code{eval_*} functions used for 28 | regression and classification problems, see \code{available_evaluation_metrics()}} 29 | 30 | \item{cv_folds}{float, number of cross-validation folds} 31 | 32 | \item{seed}{float, seed to ensure reproducibility/stability} 33 | 34 | \item{verbose}{boolean, whether to print notifications} 35 | } 36 | \value{ 37 | a named list, potentially containing \describe{ 38 | \item{x}{the name of the predictor variable} 39 | \item{y}{the name of the target variable} 40 | \item{result_type}{text showing how to interpret the resulting score} 41 | \item{pps}{the predictive power score} 42 | \item{metric}{the evaluation metric used to compute the PPS} 43 | \item{baseline_score}{the score of a naive model on the evaluation metric} 44 | \item{model_score}{the score of the predictive model on the evaluation metric} 45 | \item{cv_folds}{how many cross-validation folds were used} 46 | \item{seed}{the seed that was set} 47 | \item{algorithm}{text shwoing what algorithm was used} 48 | \item{model_type}{text showing whether classification or regression was used} 49 | } 50 | } 51 | \description{ 52 | Calculate predictive power score for x on y 53 | } 54 | \examples{ 55 | score(iris, x = 'Petal.Length', y = 'Species') 56 | } 57 | -------------------------------------------------------------------------------- /man/score_predictors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/score.R 3 | \name{score_predictors} 4 | \alias{score_predictors} 5 | \title{Calculate predictive power scores for y 6 | Calculates the predictive power scores for the specified \code{y} variable 7 | using every column in the dataset as \code{x}, including itself.} 8 | \usage{ 9 | score_predictors(df, y, ..., do_parallel = FALSE, n_cores = -1) 10 | } 11 | \arguments{ 12 | \item{df}{data.frame containing columns for x and y} 13 | 14 | \item{y}{string, column name of target variable} 15 | 16 | \item{...}{any arguments passed to \code{\link{score}}} 17 | 18 | \item{do_parallel}{bool, whether to perform \code{\link{score}} calls in parallel} 19 | 20 | \item{n_cores}{numeric, number of cores to use, defaults to maximum minus 1} 21 | } 22 | \value{ 23 | a data.frame containing \describe{ 24 | \item{x}{the name of the predictor variable} 25 | \item{y}{the name of the target variable} 26 | \item{result_type}{text showing how to interpret the resulting score} 27 | \item{pps}{the predictive power score} 28 | \item{metric}{the evaluation metric used to compute the PPS} 29 | \item{baseline_score}{the score of a naive model on the evaluation metric} 30 | \item{model_score}{the score of the predictive model on the evaluation metric} 31 | \item{cv_folds}{how many cross-validation folds were used} 32 | \item{seed}{the seed that was set} 33 | \item{algorithm}{text shwoing what algorithm was used} 34 | \item{model_type}{text showing whether classification or regression was used} 35 | } 36 | } 37 | \description{ 38 | Calculate predictive power scores for y 39 | Calculates the predictive power scores for the specified \code{y} variable 40 | using every column in the dataset as \code{x}, including itself. 41 | } 42 | \examples{ 43 | \donttest{score_predictors(df = iris, y = 'Species')} 44 | \donttest{score_predictors(df = mtcars, y = 'mpg', do_parallel = TRUE, n_cores = 2)} 45 | } 46 | -------------------------------------------------------------------------------- /tests/testthat/test-score.R: -------------------------------------------------------------------------------- 1 | test_that("PPS is between zero and one", { 2 | set.seed(1) 3 | n = 100 4 | x = rnorm(n = n) 5 | df = data.frame( 6 | x = x, 7 | y1 = rnorm(n = n), 8 | y2 = runif(n = n), 9 | y3 = x, 10 | y4 = x ^ 2 + rnorm(n = n), 11 | y5 = as.numeric(x > 0) 12 | ) 13 | 14 | pps1 = score(df, 'x', 'y1', verbose = FALSE)[['pps']] 15 | pps2 = score(df, 'x', 'y2', verbose = FALSE)[['pps']] 16 | pps3 = score(df, 'x', 'y3', verbose = FALSE)[['pps']] 17 | pps4 = score(df, 'x', 'y4', verbose = FALSE)[['pps']] 18 | pps5 = score(df, 'x', 'y5', verbose = FALSE)[['pps']] 19 | 20 | min_pps = 0 21 | max_pps = 1 22 | 23 | expect_true(pps1 >= min_pps & pps1 <= max_pps) 24 | expect_true(pps2 >= min_pps & pps2 <= max_pps) 25 | expect_true(pps3 >= min_pps & pps3 <= max_pps) 26 | expect_true(pps4 >= min_pps & pps4 <= max_pps) 27 | expect_true(pps5 >= min_pps & pps4 <= max_pps) 28 | }) 29 | 30 | 31 | test_that("The calculated PPS is stable", { 32 | set.seed(1) 33 | n = 100 34 | x = rnorm(n = n) 35 | df = data.frame( 36 | x = x, 37 | y1 = x ^ 2 + rnorm(n = n), 38 | y2 = as.numeric(x > 0) 39 | ) 40 | 41 | pps1.1 = score(df, 'x', 'y1', verbose = FALSE)[['pps']] 42 | pps2.1 = score(df, 'x', 'y2', verbose = FALSE)[['pps']] 43 | 44 | pps1.2 = score(df, 'x', 'y1', verbose = FALSE)[['pps']] 45 | pps2.2 = score(df, 'x', 'y2', verbose = FALSE)[['pps']] 46 | 47 | expect_equal(pps1.1, pps1.2) 48 | expect_equal(pps2.1, pps2.2) 49 | }) 50 | 51 | 52 | test_that("Classification works for characters, booleans, and binary numerics", { 53 | set.seed(1) 54 | n = 100 55 | x = rnorm(n = n) 56 | df = data.frame( 57 | x = x, 58 | y1 = as.logical(x > 0), 59 | y2 = as.numeric(x > 0), 60 | y3 = sample(c('a', 'b', 'c'), size = n, replace = TRUE) 61 | ) 62 | 63 | result1 = score(df, 'x', 'y1', verbose = FALSE) 64 | result2 = score(df, 'x', 'y2', verbose = FALSE) 65 | result3 = score(df, 'x', 'y3', verbose = FALSE) 66 | 67 | 68 | expect_equal(result1$model_type, 'classification') 69 | expect_equal(result2$model_type, 'classification') 70 | expect_equal(result3$model_type, 'classification') 71 | }) 72 | 73 | 74 | test_that("Regression works for doubles and integers", { 75 | set.seed(1) 76 | n = 100 77 | x = rnorm(n = n) 78 | df = data.frame( 79 | x = x, 80 | y1 = as.integer(seq_along(x)), 81 | y2 = as.numeric(x + rnorm(n = n)) 82 | ) 83 | 84 | result1 = score(df, 'x', 'y1', verbose = FALSE) 85 | result2 = score(df, 'x', 'y2', verbose = FALSE) 86 | 87 | expect_equal(result1$model_type, 'regression') 88 | expect_equal(result1$model_type, 'regression') 89 | }) 90 | 91 | 92 | 93 | test_that("Scoring functions work as expected", { 94 | set.seed(1) 95 | n = 100 96 | x = rnorm(n = n) 97 | df = data.frame( 98 | x = x, 99 | y1 = as.integer(seq_along(x)) 100 | ) 101 | result = score(df, 'x', 'y1') 102 | result_predictors = score_predictors(df, 'y1') 103 | result_df = score_df(df) 104 | expect_true(is.list(result)) 105 | expect_true(is.data.frame(result_predictors)) 106 | expect_equal(nrow(result_predictors), ncol(df)) 107 | expect_true(is.data.frame(result_df)) 108 | expect_equal(nrow(result_df), ncol(df) ^ 2) 109 | }) 110 | 111 | 112 | 113 | test_that("Algorithsm work as expected", { 114 | set.seed(1) 115 | n = 100 116 | x = rnorm(n = n) 117 | df = data.frame( 118 | x = x, 119 | y1 = as.integer(seq_along(x)), 120 | y2 = x + rnorm(n = n) 121 | ) 122 | for (algo in names(available_algorithms())){ 123 | expect_true(is.list(score(df, 'x', 'y1', algorithm = algo))) 124 | expect_true(is.list(score(df, 'x', 'y2', algorithm = algo))) 125 | } 126 | }) 127 | 128 | 129 | test_that("Evaluation metrics work as expected", { 130 | set.seed(1) 131 | n = 100 132 | x = rnorm(n = n) 133 | df = data.frame( 134 | x = x, 135 | y1 = as.integer(seq_along(x)), 136 | y2 = x + rnorm(n = n) 137 | ) 138 | for (eval in names(available_evaluation_metrics())){ 139 | metrics = list(regression = eval, classification = eval) 140 | expect_true(is.list(score(df, 'x', 'y1', metrics = metrics))) 141 | expect_true(is.list(score(df, 'x', 'y2', metrics = metrics))) 142 | } 143 | }) 144 | -------------------------------------------------------------------------------- /R/visualize.R: -------------------------------------------------------------------------------- 1 | # To prevent warnings: no visible binding for global variable 2 | x <- y <- pps <- correlation <- NULL 3 | 4 | 5 | 6 | pps_break_interval = function() { 7 | return(0.2) 8 | } 9 | 10 | pps_breaks = function() { 11 | return(seq(0, 1, pps_break_interval())) 12 | } 13 | 14 | pps_minor_breaks = function() { 15 | minor_breaks = seq(0, 1, pps_break_interval() / 2) 16 | return(setdiff(minor_breaks, pps_breaks())) 17 | } 18 | 19 | correlation_breaks = function() { 20 | # range twice as long so interval twice as long 21 | return(seq(-1, 1, pps_break_interval() * 2)) 22 | } 23 | 24 | theme_ppsr = function(){ 25 | ggplot2::theme_minimal() + 26 | ggplot2::theme( 27 | panel.grid.major = ggplot2::element_blank(), 28 | panel.grid.minor = ggplot2::element_blank(), 29 | NULL 30 | ) 31 | } 32 | 33 | 34 | #' Visualize the Predictive Power scores of the entire dataframe, or given a target 35 | #' 36 | #' If \code{y} is specified, \code{visualize_pps} returns a barplot of the PPS of 37 | #' every predictor on the specified target variable. 38 | #' If \code{y} is not specified, \code{visualize_pps} returns a heatmap visualization 39 | #' of the PPS for all X-Y combinations in a dataframe. 40 | #' 41 | #' @inheritParams score_predictors 42 | #' @param y string, column name of target variable, 43 | #' can be left \code{NULL} to visualize all X-Y PPS 44 | #' @param color_value_high string, hex value or color name used for upper limit of PPS gradient (high PPS) 45 | #' @param color_value_low string, hex value or color name used for lower limit of PPS gradient (low PPS) 46 | #' @param color_text string, hex value or color name used for text, best to pick high contrast with \code{color_value_high} 47 | #' @param include_target boolean, whether to include the target variable in the barplot 48 | #' 49 | #' @return a ggplot object, a vertical barplot or heatmap visualization 50 | #' @export 51 | #' 52 | #' @examples 53 | #' visualize_pps(iris, y = 'Species') 54 | #' 55 | #' \donttest{visualize_pps(iris)} 56 | #' 57 | #' \donttest{visualize_pps(mtcars, do_parallel = TRUE, n_cores = 2)} 58 | visualize_pps = function(df, 59 | y = NULL, 60 | color_value_high = '#08306B', 61 | color_value_low = '#FFFFFF', 62 | color_text = '#FFFFFF', 63 | include_target = TRUE, 64 | ...) { 65 | if (is.null(y)) { 66 | p = ggplot2::ggplot(score_df(df, ...), ggplot2::aes(x = x, y = y)) + 67 | ggplot2::geom_tile(ggplot2::aes(fill = pps)) + 68 | ggplot2::geom_text(ggplot2::aes(label = format_score(pps)), col = color_text) + 69 | ggplot2::scale_x_discrete(limits = colnames(df)) + 70 | ggplot2::scale_y_discrete(limits = rev(colnames(df))) + 71 | ggplot2::labs(x = 'predictor', y = 'target') + 72 | theme_ppsr() 73 | } else { 74 | res = score_predictors(df, y, ...) 75 | if (!include_target) { 76 | res = res[res$x != y, ] 77 | } 78 | p = ggplot2::ggplot(res, 79 | ggplot2::aes(x = pps, y = stats::reorder(x, pps))) + 80 | ggplot2::geom_col(ggplot2::aes(fill = pps)) + 81 | ggplot2::geom_text(ggplot2::aes(label = format_score(pps)), hjust = 0) + 82 | ggplot2::scale_x_continuous(breaks = pps_breaks(), minor_breaks = pps_minor_breaks(), limits = c(0, 1.05)) + 83 | ggplot2::labs(y = 'feature') + 84 | theme_ppsr() + 85 | ggplot2::theme(panel.grid.major.x = ggplot2::element_line(colour = "grey92"), 86 | panel.grid.minor.x = ggplot2::element_line(size = ggplot2::rel(0.5), colour = "grey92") 87 | ) 88 | } 89 | p = p + 90 | ggplot2::scale_fill_gradient(low = color_value_low, high = color_value_high, 91 | limits = range(pps_breaks()), breaks = pps_breaks()) + 92 | ggplot2::expand_limits(fill = range(pps_breaks())) 93 | return(p) 94 | } 95 | 96 | 97 | 98 | #' Visualize the correlation matrix 99 | #' 100 | #' @inheritParams score_correlations 101 | #' @param color_value_positive color used for upper limit of gradient (high positive correlation) 102 | #' @param color_value_negative color used for lower limit of gradient (high negative correlation) 103 | #' @param color_text color used for text, best to pick high contrast with \code{color_value_high} 104 | #' @param include_missings bool, whether to include the variables without correlation values in the plot 105 | #' 106 | #' @return a ggplot object, a heatmap visualization 107 | #' @export 108 | #' 109 | #' @examples 110 | #' visualize_correlations(iris) 111 | visualize_correlations = function(df, 112 | color_value_positive = '#08306B', 113 | color_value_negative = '#8b0000', 114 | color_text = '#FFFFFF', 115 | include_missings = FALSE, 116 | ...) { 117 | df_correlations = score_correlations(df, ...) 118 | 119 | if (include_missings) { 120 | cnames = colnames(df) 121 | } else { 122 | cnames = unique(df_correlations[['x']]) 123 | } 124 | 125 | # TODO standardize in heatmap function 126 | p = ggplot2::ggplot(df_correlations, ggplot2::aes(x = x, y = y)) + 127 | ggplot2::geom_tile(ggplot2::aes(fill = correlation)) + 128 | ggplot2::geom_text(ggplot2::aes(label = format_score(correlation)), col = color_text) + 129 | ggplot2::scale_x_discrete(limits = cnames) + 130 | ggplot2::scale_y_discrete(limits = rev(cnames)) + 131 | ggplot2::scale_fill_gradient2(low = color_value_negative, 132 | mid = '#FFFFFF', 133 | high = color_value_positive, 134 | limits = range(correlation_breaks()), 135 | breaks = correlation_breaks()) + 136 | ggplot2::expand_limits(fill = range(correlation_breaks())) + 137 | theme_ppsr() 138 | return(p) 139 | } 140 | 141 | 142 | #' Visualize the PPS & correlation matrices 143 | #' 144 | #' @inheritParams visualize_pps 145 | #' @inheritParams visualize_correlations 146 | #' @param nrow numeric, number of rows, either 1 or 2 147 | #' 148 | #' @return a grob object, a grid with two ggplot2 heatmap visualizations 149 | #' @export 150 | #' 151 | #' @examples 152 | #' \donttest{visualize_both(iris)} 153 | #' 154 | #' \donttest{visualize_both(mtcars, do_parallel = TRUE, n_cores = 2)} 155 | visualize_both = function(df, 156 | color_value_positive = '#08306B', 157 | color_value_negative = '#8b0000', 158 | color_text = '#FFFFFF', 159 | include_missings = TRUE, 160 | nrow = 1, 161 | ...) { 162 | plot_pps = visualize_pps(df, 163 | color_value_high = color_value_positive, 164 | color_text = color_text, 165 | ...) 166 | plot_cor = visualize_correlations(df, 167 | color_value_positive = color_value_positive, 168 | color_value_negative = color_value_negative, 169 | color_text = color_text, 170 | include_missings = include_missings) 171 | return(gridExtra::grid.arrange(plot_pps, plot_cor, nrow = nrow)) 172 | } 173 | 174 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | --- 4 | 5 | 6 | 7 | ```{r, include = FALSE} 8 | knitr::opts_chunk$set( 9 | collapse = TRUE, 10 | comment = "#>", 11 | fig.path = "man/figures/README-", 12 | fig.width = 7, 13 | fig.height = 5, 14 | dpi = 300) 15 | ``` 16 | 17 | # `ppsr` - Predictive Power Score 18 | 19 | 20 | [![R-CMD-check](https://github.com/paulvanderlaken/ppsr/workflows/R-CMD-check/badge.svg)](https://github.com/paulvanderlaken/ppsr/actions) 21 | [![CRAN status](https://www.r-pkg.org/badges/version/ppsr)](https://cran.r-project.org/package=ppsr) 22 | [![CRAN_Downloads_Total](http://cranlogs.r-pkg.org/badges/grand-total/ppsr)](https://cran.r-project.org/package=ppsr) 23 | 24 | 25 | 26 | `ppsr` is the R implementation of the **Predictive Power Score** (PPS). 27 | 28 | The PPS is an asymmetric, data-type-agnostic score that can detect linear or 29 | non-linear relationships between two variables. 30 | The score ranges from 0 (no predictive power) to 1 (perfect predictive power). 31 | 32 | The general concept of PPS is useful for data exploration purposes, 33 | in the same way correlation analysis is. 34 | You can read more about the (dis)advantages of using PPS in [this blog post](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598). 35 | 36 | 37 | 38 | ## Installation 39 | 40 | 41 | You can install the latest stable version of `ppsr` from CRAN: 42 | 43 | ```{r, eval = TRUE, echo = TRUE} 44 | install.packages('ppsr') 45 | ``` 46 | 47 | Not all recent features and bugfixes may be included in the CRAN release. 48 | 49 | Instead, you might want to download the most recent developmental version of `ppsr` from Github: 50 | 51 | ```{r, eval = FALSE, echo = TRUE} 52 | # install.packages('devtools') # Install devtools if needed 53 | devtools::install_github('https://github.com/paulvanderlaken/ppsr') 54 | ``` 55 | 56 | 57 | 58 | ## Computing PPS 59 | 60 | **PPS** represents a **framework for evaluating predictive validity**. 61 | 62 | There is _not one single way_ of computing a predictive power score, but rather there are _many different ways_. 63 | 64 | You can select different machine learning algorithms, their associated parameters, 65 | cross-validation schemes, and/or model evaluation metrics. 66 | Each of these design decisions will affect your model's predictive performance and, 67 | in turn, affect the resulting predictive power score you compute. 68 | 69 | Hence, you can compute many different PPS for any given predictor and target variable. 70 | 71 | For example, the PPS computed with a _decision tree_ regression model... 72 | 73 | ```{r} 74 | ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length', algorithm = 'tree')[['pps']] 75 | ``` 76 | 77 | ...will differ from the PPS computed with a _simple linear regression_ model. 78 | 79 | ```{r} 80 | ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length', algorithm = 'glm')[['pps']] 81 | ``` 82 | 83 | 84 | 85 | ## Usage 86 | 87 | The `ppsr` package has four main functions to compute PPS: 88 | 89 | * `score()` computes an x-y PPS 90 | * `score_predictors()` computes all X-y PPS 91 | * `score_df()` computes all X-Y PPS 92 | * `score_matrix()` computes all X-Y PPS, and shows them in a matrix 93 | 94 | where `x` and `y` represent an individual predictor/target, 95 | and `X` and `Y` represent all predictors/targets in a given dataset. 96 | 97 | ### Examples: 98 | 99 | `score()` computes the PPS for a single target and predictor 100 | ```{r} 101 | ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length') 102 | ``` 103 | 104 | `score_predictors()` computes all PPSs for a single target using all predictors in a dataframe 105 | ```{r} 106 | ppsr::score_predictors(df = iris, y = 'Species') 107 | ``` 108 | 109 | `score_df()` computes all PPSs for every target-predictor combination in a dataframe 110 | ```{r} 111 | ppsr::score_df(df = iris) 112 | ``` 113 | 114 | `score_df()` computes all PPSs for every target-predictor combination in a dataframe, 115 | but returns only the scores arranged in a neat matrix, like the familiar correlation matrix 116 | ```{r} 117 | ppsr::score_matrix(df = iris) 118 | ``` 119 | 120 | 121 | Currently, the `ppsr` package computes PPS by default using... 122 | 123 | * the default decision tree implementation of the `rpart` package, wrapped by `parsnip` 124 | * *weighted F1* scores to evaluate classification models, and _MAE_ to evaluate regression models 125 | * 5 cross-validations 126 | 127 | You can call the `available_algorithms()` and `available_evaluation_metrics()` functions 128 | to see what alternative settings are supported. 129 | 130 | Note that the calculated PPS reflects the **out-of-sample** predictive validity 131 | when more than a single cross-validation is used. 132 | If you prefer to look at in-sample scores, you can set `cv_folds = 1`. 133 | Note that in such cases overfitting can become an issue, particularly with the more flexible algorithms. 134 | 135 | 136 | 137 | ## Visualizing PPS 138 | Subsequently, there are three main functions that wrap around these computational 139 | functions to help you visualize your PPS using `ggplot2`: 140 | 141 | * `visualize_pps()` produces a barplot of all X-y PPS, or a heatmap of all X-Y PPS 142 | * `visualize_correlations()` produces a heatmap of all X-Y correlations 143 | * `visualize_both()` produces the two heatmaps of all X-Y PPS and correlations side-by-side 144 | 145 | ### Examples: 146 | 147 | If you specify a target variable (`y`) in `visualize_pps()`, you get a barplot of its predictors. 148 | ```{r, PPS-barplot} 149 | ppsr::visualize_pps(df = iris, y = 'Species') 150 | ``` 151 | 152 | If you do not specify a target variable in `visualize_pps()`, you get the PPS matrix visualized as a heatmap. 153 | ```{r, PPS-heatmap} 154 | ppsr::visualize_pps(df = iris) 155 | ``` 156 | 157 | Some users might find it useful to look at a correlation matrix for comparison. 158 | ```{r, correlation-heatmap} 159 | ppsr::visualize_correlations(df = iris) 160 | ``` 161 | 162 | With `visualize_both` you generate the PPS and correlation matrices side-by-side, for easy comparison. 163 | ```{r, sbs-heatmap, fig.width=14} 164 | ppsr::visualize_both(df = iris) 165 | ``` 166 | 167 | You can change the colors of the visualizations using the functions arguments. 168 | There are also arguments to change the color of the text scores. 169 | 170 | Furthermore, the functions return `ggplot2` objects, so that you can easily change the theme and other settings. 171 | 172 | ```{r, custom-plot} 173 | ppsr::visualize_pps(df = iris, 174 | color_value_high = 'red', 175 | color_value_low = 'yellow', 176 | color_text = 'black') + 177 | ggplot2::theme_classic() + 178 | ggplot2::theme(plot.background = ggplot2::element_rect(fill = "lightgrey")) + 179 | ggplot2::theme(title = ggplot2::element_text(size = 15)) + 180 | ggplot2::labs(title = 'Add your own title', 181 | subtitle = 'Maybe an informative subtitle', 182 | caption = 'Did you know ggplot2 includes captions?', 183 | x = 'You could call this\nthe independent variable\nas well') 184 | ``` 185 | 186 | 187 | ## Parallelization 188 | 189 | The number of predictive models that one needs to build in order to fill 190 | the PPS matrix belonging to a dataframe increases exponentially 191 | with every new column in that dataframe. 192 | 193 | For traditional correlation analyses, this is not a problem. 194 | Yet, with more computation-intensive algorithms, with many train-test splits, 195 | and with large or high-dimensional datasets, it can take a decent amount of time 196 | to build all the predictive models and derive their PPSs. 197 | 198 | One way to speed matters up is to use the `ppsr::score_predictors()` function and 199 | focus on predicting only the target/dependent variable you are most interested in. 200 | 201 | Yet, since version `0.0.1`, all `ppsr::score_*` and `pssr::visualize_*` functions 202 | now take in two arguments that facilitate parallel computing. 203 | You can parallelize `ppsr`'s computations by setting the `do_parallel` argument to `TRUE`. 204 | If done so, a cluster will be created using the `parallel` package. 205 | By default, this cluster will use the maximum number of cores (see `parallel::detectCores()`) minus 1. 206 | 207 | However, with the second argument -- `n_cores` -- you can manually specify the number of cores you want `ppsr` to use. 208 | 209 | Examples: 210 | ```{r, eval = FALSE} 211 | ppsr::score_df(df = mtcars, do_parallel = TRUE) 212 | ``` 213 | 214 | ```{r, eval = FALSE} 215 | ppsr::visualize_pps(df = iris, do_parallel = TRUE, n_cores = 2) 216 | ``` 217 | 218 | 219 | ## Interpreting PPS 220 | 221 | The PPS is a **normalized score** that ranges from 0 (no predictive power) 222 | to 1 (perfect predictive power). 223 | 224 | The normalization occurs by comparing how well we are able to predict the values of 225 | a _target_ variable (`y`) using the values of a _predictor_ variable (`x`), 226 | respective to two **benchmarks**: a perfect prediction, and a naive prediction 227 | 228 | The **perfect prediction** can be theoretically derived. 229 | A perfect regression model produces no error (=0.0), 230 | whereas a perfect classification model results in 100% accuracy, recall, et cetera (=1.0). 231 | 232 | The **naive prediction** is derived empirically. 233 | A naive _regression_ model is simulated by predicting the mean `y` value for all observations. 234 | This is similar to how R-squared is calculated. 235 | A naive _classification_ model is simulated by taking the best among two models: 236 | one predicting the modal `y` class, and one predicting random `y` classes for all observations. 237 | 238 | Whenever we train an "informed" model to predict `y` using `x`, 239 | we can assess how well it performs by comparing it to these two benchmarks. 240 | 241 | Suppose we train a regression model, and its mean average error (MAE) is 0.10. 242 | Suppose the naive model resulted in an MAE of 0.40. 243 | We know the perfect model would produce no error, which means an MAE of 0.0. 244 | 245 | With these three scores, we can normalize the performance of our informed regression model 246 | by interpolating its score between the perfect and the naive benchmarks. 247 | In this case, our model's performance lies about 1/4th of the way from the perfect model, 248 | and 3/4ths of the way from the naive model. 249 | In other words, our model's predictive power score is 75%: it produced 75% less error than the naive baseline, and was only 25% short of perfect predictions. 250 | 251 | Using such normalized scores for model performance allows us to easily interpret 252 | how much better our models are as compared to a naive baseline. 253 | Moreover, such normalized scores allow us to compare and contrast different 254 | modeling approaches, in terms of the algorithms, the target's data type, 255 | the evaluation metrics, and any other settings used. 256 | 257 | 258 | 259 | 260 | ## Considerations 261 | 262 | The main use of PPS is as a tool for data exploration. 263 | It trains out-of-the-box machine learning models to assess the predictive relations in your dataset. 264 | 265 | However, this PPS is quite a "quick and dirty" approach. 266 | The trained models are not at all tailored to your specific regression/classification problem. 267 | For example, it could be that you get many PPSs of 0 with the default settings. 268 | A known issue is that the default decision tree often does not find valuable splits 269 | and reverts to predicting the mean `y` value found at its root. 270 | Here, it could help to try calculating PPS with different settings (e.g., `algorithm = 'glm'`). 271 | 272 | At other times, predictive relationships may rely on a combination of variables 273 | (i.e. interaction/moderation). These are not captured by the PPS calculations, 274 | which consider only univariate relations. 275 | PPS is simply not suited for capturing such complexities. 276 | In these cases, it might be more interesting to train models on all your features simultaneously 277 | and turn to concepts like [feature/variable importance](https://topepo.github.io/caret/variable-importance.html), 278 | [partial dependency](https://christophm.github.io/interpretable-ml-book/pdp.html), [conditional expectations](https://christophm.github.io/interpretable-ml-book/ice.html), [accumulated local effects](https://christophm.github.io/interpretable-ml-book/ale.html), and others. 279 | 280 | In general, the PPS should not be considered more than a fast and easy tool to 281 | finding starting points for further, in-depth analysis. 282 | Keep in mind that you can build much better predictive models than the default 283 | PPS functions if you tailor your modeling efforts to your specific data context. 284 | 285 | 286 | 287 | ## Open issues & development 288 | 289 | PPS is a relatively young concept, and likewise the `ppsr` package is still under development. 290 | If you spot any bugs or potential improvements, please raise an issue or submit a pull request. 291 | 292 | On the developmental agenda are currently: 293 | 294 | * Support for different modeling techniques/algorithms 295 | * Support for generalized linear models for multinomial classification 296 | * Passing/setting of parameters for models 297 | * Different model evaluation metrics 298 | * Support for user-defined model evaluation metrics 299 | * Downsampling for large datasets 300 | 301 | 302 | ## Attribution 303 | 304 | This R package was inspired by 8080labs' Python package [`ppscore`](https://github.com/8080labs/ppscore). 305 | 306 | The same 8080labs also developed an earlier, unfinished [R implementation of PPS](https://github.com/8080labs/ppscoreR). 307 | 308 | Read more about the big ideas behind PPS in [this blog post](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598). 309 | -------------------------------------------------------------------------------- /R/score.R: -------------------------------------------------------------------------------- 1 | 2 | #' Calculate predictive power score for x on y 3 | #' 4 | #' @param df data.frame containing columns for x and y 5 | #' @param x string, column name of predictor variable 6 | #' @param y string, column name of target variable 7 | #' @param algorithm string, see \code{available_algorithms()} 8 | #' @param metrics named list of \code{eval_*} functions used for 9 | #' regression and classification problems, see \code{available_evaluation_metrics()} 10 | #' @param cv_folds float, number of cross-validation folds 11 | #' @param seed float, seed to ensure reproducibility/stability 12 | #' @param verbose boolean, whether to print notifications 13 | #' 14 | #' @return a named list, potentially containing \describe{ 15 | #' \item{x}{the name of the predictor variable} 16 | #' \item{y}{the name of the target variable} 17 | #' \item{result_type}{text showing how to interpret the resulting score} 18 | #' \item{pps}{the predictive power score} 19 | #' \item{metric}{the evaluation metric used to compute the PPS} 20 | #' \item{baseline_score}{the score of a naive model on the evaluation metric} 21 | #' \item{model_score}{the score of the predictive model on the evaluation metric} 22 | #' \item{cv_folds}{how many cross-validation folds were used} 23 | #' \item{seed}{the seed that was set} 24 | #' \item{algorithm}{text shwoing what algorithm was used} 25 | #' \item{model_type}{text showing whether classification or regression was used} 26 | #' } 27 | #' @export 28 | #' 29 | #' @examples 30 | #' score(iris, x = 'Petal.Length', y = 'Species') 31 | score = function(df, 32 | x, 33 | y, 34 | algorithm = 'tree', 35 | metrics = list('regression' = 'MAE', 'classification' = 'F1_weighted'), 36 | cv_folds = 5, 37 | seed = 1, 38 | verbose = TRUE) { 39 | 40 | # check if x and y are different variables 41 | if (x == y) { 42 | return(generate_invalid_report(x, y, 'predictor and target are the same', 1)) 43 | } 44 | 45 | # check if columns occur in dataframe 46 | cnames = colnames(df) 47 | for (name in c(x, y)) { 48 | if (! name %in% cnames) { 49 | stop(name, ' is not a column in the provided data frame') 50 | } else if (sum(cnames == name) > 1) { 51 | stop(name, ' appears as a column name more than once in the data frame') 52 | } 53 | } 54 | 55 | # drop all other columns from dataframe 56 | df = df[, c(x, y)] 57 | # remove records with missing data 58 | df = df[stats::complete.cases(df), ] 59 | 60 | # if there's no data left, there's no predictive power 61 | if (nrow(df) == 0) { 62 | return(generate_invalid_report(x, y, 'no non-missing records', 1)) 63 | } 64 | 65 | # an ID variable has no predictive power 66 | if (is_id(df[[y]])) { 67 | return(generate_invalid_report(x, y, 'target is id', 0)) 68 | } else if (is_id(df[[x]])) { 69 | return(generate_invalid_report(x, y, 'predictor is id', 0)) 70 | } 71 | # a vector without variation has no predictive power 72 | if (is_constant(df[[y]])) { 73 | return(generate_invalid_report(x, y, 'target is constant', 0)) 74 | } else if (is_constant(df[[x]])) { 75 | return(generate_invalid_report(x, y, 'predictor is constant', 0)) 76 | } 77 | # a vector that is completely similar has full predictive power 78 | if (is_same(df[[x]], df[[y]])) { 79 | return(generate_invalid_report(x, y, 'target and predictor are same', 1)) 80 | } 81 | 82 | # force binary numerics, boolean/logicals, and characters/texts to factor 83 | if (algorithm == 'glm') { 84 | if (is.logical(df[[y]])) { 85 | if(verbose) { 86 | cat('Note:', y, 'was forced from', typeof(df[[y]]), 'to integer.\n') 87 | } 88 | df[[y]] = as.integer(df[[y]]) 89 | } else if (is.character(df[[y]])) { 90 | return(generate_invalid_report(x, y, 'algorithm does not support multinomial classification', NA)) 91 | } 92 | } else if (is_binary_numeric(df[[y]]) | is.logical(df[[y]]) | is.character(df[[y]])) { 93 | if(verbose) { 94 | cat('Note:', y, 'was forced from', typeof(df[[y]]), 'to factor.\n') 95 | } 96 | df[[y]] = as.factor(df[[y]]) 97 | } 98 | 99 | # check whether the number of cvfolds is possible and sensible 100 | if (cv_folds < 1) { 101 | stop('cv_folds needs to be numeric and larger or equal to 1') 102 | } 103 | if (cv_folds > length(df[[y]])) { 104 | stop('There are more cv_folds than unique ', x, '-', y, ' values. Pick a smaller number of folds.') 105 | } 106 | n_per_fold = length(df[[y]]) / cv_folds 107 | fold_size_warning_threshold = 10 108 | if (n_per_fold < fold_size_warning_threshold) { 109 | warning('There are on average only ', n_per_fold, ' observations in each test-set', 110 | ' for the ', x, '-', y, ' relationship.\n', 111 | 'Model performance will be highly instable. Fewer cv_folds are advised.') 112 | } 113 | 114 | # set seed to ensure stability of results 115 | set.seed(seed) 116 | 117 | ## set up statistical model 118 | # determine type of model we are dealing with 119 | type = ifelse(is.numeric(df[[y]]), 'regression', 'classification') 120 | 121 | # set engine based on algorithm 122 | engine = available_algorithms()[[algorithm]] 123 | model = engine(type) 124 | 125 | # set mode based on model type 126 | model = parsnip::set_mode(object = model, mode = type) 127 | 128 | # TODO implement possibility to feed in udf as evaluation metrics 129 | # get the appropriate evaluation function 130 | # check if metrics are provided 131 | if (! 'regression' %in% names(metrics)) { 132 | stop('Input list of metrics does not contain a metric for regression') 133 | } 134 | if (! 'classification' %in% names(metrics)) { 135 | stop('Input list of metrics does not contain a metric for classification') 136 | } 137 | metric = metrics[[type]] 138 | 139 | ## prepare data 140 | # make cross validation folds 141 | if (cv_folds > 1) { 142 | membership = 1 + seq_len(nrow(df)) %% cv_folds # create vector with equal amount of group ids 143 | random_membership = sample(membership) # shuffle the group ids 144 | # split the training and test sets 145 | folds = lapply(seq_len(cv_folds), function(test_fold){ 146 | inTest = random_membership == test_fold 147 | return(list( 148 | fold = test_fold, 149 | test = df[inTest, ], 150 | train = df[!inTest, ]) 151 | ) 152 | }) 153 | } else if (cv_folds == 1) { 154 | folds = list( 155 | list( 156 | fold = 1, 157 | test = df, 158 | train = df 159 | ) 160 | ) 161 | } 162 | 163 | ## evaluate model in each cross validation 164 | ## TODO simplify this into a get_scores function 165 | # in the Python implementation there is only one baseline calculated 166 | # yet I feel that it would be better to assess each folds predictive performance 167 | # on a baseline that also resembles the naive performance on that same test set 168 | scores = lapply(folds, FUN = function(e) { 169 | model_score = score_model(train = e[['train']], 170 | test = e[['test']], 171 | model = model, 172 | x, 173 | y, 174 | metric = metric) 175 | baseline_score = score_naive(train = e[['train']], 176 | test = e[['test']], 177 | x, 178 | y, 179 | type = type, 180 | metric = metric) 181 | normalized_score = normalize_score(baseline_score = baseline_score, 182 | model_score = model_score, 183 | type = type) 184 | return(list( 185 | model_score = model_score, 186 | baseline_score = baseline_score, 187 | normalized_score = normalized_score 188 | )) 189 | }) 190 | 191 | report = list( 192 | x = x, 193 | y = y, 194 | result_type = 'predictive power score', 195 | pps = mean(vapply(scores, function(x) x$normalized_score, numeric(1))), 196 | metric = metric, 197 | baseline_score = mean(vapply(scores, function(x) x$baseline_score, numeric(1))), 198 | model_score = mean(vapply(scores, function(x) x$model_score, numeric(1))), 199 | cv_folds = cv_folds, 200 | seed = seed, 201 | algorithm = algorithm, 202 | #TODO: Find out how to store model in the resulting report 203 | #model = model, #Error: All columns in a tibble must be vectors. 204 | model_type = type 205 | ) 206 | 207 | return(report) 208 | } 209 | 210 | 211 | #' Calculate predictive power scores for y 212 | #' Calculates the predictive power scores for the specified \code{y} variable 213 | #' using every column in the dataset as \code{x}, including itself. 214 | #' 215 | #' @inheritParams score 216 | #' @param ... any arguments passed to \code{\link{score}} 217 | #' @param do_parallel bool, whether to perform \code{\link{score}} calls in parallel 218 | #' @param n_cores numeric, number of cores to use, defaults to maximum minus 1 219 | #' 220 | #' @return a data.frame containing \describe{ 221 | #' \item{x}{the name of the predictor variable} 222 | #' \item{y}{the name of the target variable} 223 | #' \item{result_type}{text showing how to interpret the resulting score} 224 | #' \item{pps}{the predictive power score} 225 | #' \item{metric}{the evaluation metric used to compute the PPS} 226 | #' \item{baseline_score}{the score of a naive model on the evaluation metric} 227 | #' \item{model_score}{the score of the predictive model on the evaluation metric} 228 | #' \item{cv_folds}{how many cross-validation folds were used} 229 | #' \item{seed}{the seed that was set} 230 | #' \item{algorithm}{text shwoing what algorithm was used} 231 | #' \item{model_type}{text showing whether classification or regression was used} 232 | #' } 233 | #' 234 | #' @export 235 | #' 236 | #' @examples 237 | #' \donttest{score_predictors(df = iris, y = 'Species')} 238 | #' \donttest{score_predictors(df = mtcars, y = 'mpg', do_parallel = TRUE, n_cores = 2)} 239 | score_predictors = function(df, y, ..., do_parallel = FALSE, n_cores = -1) { 240 | temp_score = function(x) { 241 | score(df, x = x, y = y, ...) 242 | } 243 | if (do_parallel) { 244 | if (n_cores == -1) { 245 | n_cores = parallel::detectCores() - 1 246 | } 247 | cl = parallel::makeCluster(n_cores) 248 | parallel::clusterEvalQ(cl, {library(ppsr)}) 249 | scores = parallel::clusterApply(cl, colnames(df), temp_score) 250 | parallel::stopCluster(cl) 251 | } else { 252 | scores = lapply(colnames(df), temp_score) 253 | } 254 | scores = fill_blanks_in_list(scores) 255 | scores_df = do.call(rbind.data.frame, scores) 256 | rownames(scores_df) = NULL 257 | return(scores_df) 258 | } 259 | 260 | 261 | #' Calculate predictive power scores for whole dataframe 262 | #' Iterates through the columns of the dataframe, calculating the predictive power 263 | #' score for every possible combination of \code{x} and \code{y}. 264 | #' 265 | #' @inheritParams score 266 | #' @inheritParams score_predictors 267 | #' 268 | #' @return a data.frame containing \describe{ 269 | #' \item{x}{the name of the predictor variable} 270 | #' \item{y}{the name of the target variable} 271 | #' \item{result_type}{text showing how to interpret the resulting score} 272 | #' \item{pps}{the predictive power score} 273 | #' \item{metric}{the evaluation metric used to compute the PPS} 274 | #' \item{baseline_score}{the score of a naive model on the evaluation metric} 275 | #' \item{model_score}{the score of the predictive model on the evaluation metric} 276 | #' \item{cv_folds}{how many cross-validation folds were used} 277 | #' \item{seed}{the seed that was set} 278 | #' \item{algorithm}{text shwoing what algorithm was used} 279 | #' \item{model_type}{text showing whether classification or regression was used} 280 | #' } 281 | #' @export 282 | #' 283 | #' @examples 284 | #' \donttest{score_df(iris)} 285 | #' \donttest{score_df(mtcars, do_parallel = TRUE, n_cores = 2)} 286 | score_df = function(df, ..., do_parallel = FALSE, n_cores = -1) { 287 | cnames = colnames(df) 288 | param_grid = expand.grid(x = cnames, y = cnames, stringsAsFactors = FALSE) 289 | temp_score = function(i) { 290 | score(df, x = param_grid[['x']][i], y = param_grid[['y']][i], ...) 291 | } 292 | if (do_parallel) { 293 | if (n_cores == -1) { 294 | n_cores = parallel::detectCores() - 1 295 | } 296 | cl = parallel::makeCluster(n_cores) 297 | parallel::clusterEvalQ(cl, {library(ppsr)}) 298 | scores = parallel::clusterApply(cl, seq_len(nrow(param_grid)), temp_score) 299 | parallel::stopCluster(cl) 300 | } else { 301 | scores = lapply(seq_len(nrow(param_grid)), temp_score) 302 | } 303 | scores = fill_blanks_in_list(scores) 304 | df_scores = do.call(rbind.data.frame, scores) 305 | rownames(df_scores) = NULL 306 | return(df_scores) 307 | } 308 | 309 | 310 | #' Calculate predictive power score matrix 311 | #' Iterates through the columns of the dataset, calculating the predictive power 312 | #' score for every possible combination of \code{x} and \code{y}. 313 | #' 314 | #' Note that the targets are on the rows, and the features on the columns. 315 | #' 316 | #' @inheritParams score 317 | #' @param ... any arguments passed to \code{\link{score_df}}, 318 | #' some of which will be passed on to \code{\link{score}} 319 | #' 320 | #' @return a matrix of numeric values, representing predictive power scores 321 | #' @export 322 | #' 323 | #' @examples 324 | #' \donttest{score_matrix(iris)} 325 | #' \donttest{score_matrix(mtcars, do_parallel = TRUE, n_cores=2)} 326 | score_matrix = function(df, ...) { 327 | df_scores = score_df(df, ...) 328 | var_uq = unique(df_scores[['x']]) 329 | mtrx = matrix(nrow = length(var_uq), ncol = length(var_uq), dimnames = list(var_uq, var_uq)) 330 | for (x in var_uq) { 331 | for (y in var_uq) { 332 | # Note: target on the y axis (rows) and feature on the x axis (columns) 333 | mtrx[y, x] = df_scores[['pps']][df_scores[['x']] == x & df_scores[['y']] == y] 334 | } 335 | } 336 | return(mtrx) 337 | } 338 | 339 | 340 | 341 | #' Calculate correlation coefficients for whole dataframe 342 | #' 343 | #' @param df data.frame containing columns for x and y 344 | #' @param ... arguments to pass to \code{stats::cor()} 345 | #' 346 | #' @return a data.frame with x-y correlation coefficients 347 | #' @export 348 | #' 349 | #' @examples 350 | #' score_correlations(iris) 351 | score_correlations = function(df, ...) { 352 | isCorrelationColumn = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1)) 353 | cnames = names(df)[isCorrelationColumn] 354 | cmat = stats::cor(df[, cnames], ...) 355 | correlation = as.vector(cmat) 356 | y = rep(rownames(cmat), each = ncol(cmat)) 357 | x = rep(colnames(cmat), times = nrow(cmat)) 358 | return(data.frame(x, y, correlation)) 359 | } 360 | 361 | 362 | 363 | 364 | 365 | generate_invalid_report = function(x, y, result_type, pps) { 366 | return(list( 367 | x = x, 368 | y = y, 369 | result_type = result_type, 370 | pps = pps 371 | )) 372 | } 373 | 374 | #' Calculates out-of-sample model performance of a statistical model 375 | #' 376 | #' @param train df, training data, containing variable y 377 | #' @param test df, test data, containing variable y 378 | #' @param model parsnip model object, with mode preset 379 | #' @param x character, column name of predictor variable 380 | #' @param y character, column name of target variable 381 | #' @param metric character, name of evaluation metric being used, see \code{available_evaluation_metrics()} 382 | #' 383 | #' @return numeric vector of length one, evaluation score for predictions using naive model 384 | score_model = function(train, test, model, x, y, metric) { 385 | model = parsnip::fit(model, 386 | formula = stats::as.formula(paste(y, '~', x)), 387 | data = train) 388 | yhat = stats::predict(model, new_data = test)[[1]] 389 | eval_fun = available_evaluation_metrics()[[metric]] 390 | return(eval_fun(y = test[[y]], yhat = yhat)) 391 | } 392 | 393 | #' Calculate out-of-sample model performance of naive baseline model 394 | #' The calculation that's being performed depends on the type of model 395 | #' For regression models, the mean is used as prediction 396 | #' For classification, a model predicting random values and 397 | #' a model predicting modal values are used and 398 | #' the best model is taken as baseline score 399 | #' 400 | #' @param train df, training data, containing variable y 401 | #' @param test df, test data, containing variable y 402 | #' @param x character, column name of predictor variable 403 | #' @param y character, column name of target variable 404 | #' @param type character, type of model 405 | #' @param metric character, evaluation metric being used 406 | #' 407 | #' @return numeric vector of length one, evaluation score for predictions using naive model 408 | score_naive = function(train, test, x, y, type, metric) { 409 | eval_fun = available_evaluation_metrics()[[metric]] 410 | if (type == 'regression') { 411 | # naive regression model takes the mean value of the target variable 412 | naive_predictions = rep(mean(train[[y]]), nrow(test)) 413 | return(eval_fun(y = test[[y]], yhat = naive_predictions)) 414 | } else { 415 | # naive classification model takes the best model 416 | # among a model that predicts the most common case 417 | # and a model that predicts random values 418 | naive_predictions_modal = rep(modal_value(train[[y]]), times = nrow(test)) 419 | # ensure that the random predictions are always the same 420 | naive_predictions_random = sample(train[[y]], size = nrow(test), replace = TRUE) 421 | return(max(c(eval_fun(y = test[[y]], yhat = naive_predictions_modal), 422 | eval_fun(y = test[[y]], yhat = naive_predictions_random)))) 423 | } 424 | } 425 | 426 | #' Normalizes the original score compared to a naive baseline score 427 | #' The calculation that's being performed depends on the type of model 428 | #' 429 | #' @param baseline_score float, the evaluation metric score for a naive baseline (model) 430 | #' @param model_score float, the evaluation metric score for a statistical model 431 | #' @param type character, type of model 432 | #' 433 | #' @return numeric vector of length one, normalized score 434 | normalize_score = function(baseline_score, model_score, type) { 435 | if (type == 'regression') { 436 | # normalize the pps by taking the relative improvement over a naive model 437 | # or 0 in case of worse performance than a naive model 438 | return(max(c(1 - (model_score / baseline_score), 0))) 439 | } else { 440 | # normalize the pps by taking the relative improvement over a naive model 441 | # or 0 in case of worse performance than a naive model 442 | return(max(c((model_score - baseline_score) / (1 - baseline_score), 0))) 443 | } 444 | } 445 | 446 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # `ppsr` - Predictive Power Score 5 | 6 | 7 | 8 | [![R-CMD-check](https://github.com/paulvanderlaken/ppsr/workflows/R-CMD-check/badge.svg)](https://github.com/paulvanderlaken/ppsr/actions) 9 | [![CRAN 10 | status](https://www.r-pkg.org/badges/version/ppsr)](https://cran.r-project.org/package=ppsr) 11 | [![DOI](https://zenodo.org/badge/324570501.svg)](https://zenodo.org/badge/latestdoi/324570501) 12 | [![CRAN\_Downloads\_Total](http://cranlogs.r-pkg.org/badges/grand-total/ppsr)](https://cran.r-project.org/package=ppsr) 13 | 14 | 15 | 16 | `ppsr` is the R implementation of the **Predictive Power Score** (PPS). 17 | 18 | The PPS is an asymmetric, data-type-agnostic score that can detect 19 | linear or non-linear relationships between two variables. The score 20 | ranges from 0 (no predictive power) to 1 (perfect predictive power). 21 | 22 | The general concept of PPS is useful for data exploration purposes, in 23 | the same way correlation analysis is. You can read more about the 24 | (dis)advantages of using PPS in [this blog 25 | post](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598). 26 | 27 | ## Installation 28 | 29 | You can install the latest stable version of `ppsr` from CRAN: 30 | 31 | ``` r 32 | install.packages('ppsr') 33 | ``` 34 | 35 | Not all recent features and bugfixes may be included in the CRAN 36 | release. 37 | 38 | Instead, you might want to download the most recent developmental 39 | version of `ppsr` from Github: 40 | 41 | ``` r 42 | # install.packages('devtools') # Install devtools if needed 43 | devtools::install_github('https://github.com/paulvanderlaken/ppsr') 44 | ``` 45 | 46 | ## Computing PPS 47 | 48 | **PPS** represents a **framework for evaluating predictive validity**. 49 | 50 | There is *not one single way* of computing a predictive power score, but 51 | rather there are *many different ways*. 52 | 53 | You can select different machine learning algorithms, their associated 54 | parameters, cross-validation schemes, and/or model evaluation metrics. 55 | Each of these design decisions will affect your model’s predictive 56 | performance and, in turn, affect the resulting predictive power score 57 | you compute. 58 | 59 | Hence, you can compute many different PPS for any given predictor and 60 | target variable. 61 | 62 | For example, the PPS computed with a *decision tree* regression model… 63 | 64 | ``` r 65 | ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length', algorithm = 'tree')[['pps']] 66 | #> [1] 0.6160836 67 | ``` 68 | 69 | …will differ from the PPS computed with a *simple linear regression* 70 | model. 71 | 72 | ``` r 73 | ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length', algorithm = 'glm')[['pps']] 74 | #> [1] 0.5441131 75 | ``` 76 | 77 | ## Usage 78 | 79 | The `ppsr` package has four main functions to compute PPS: 80 | 81 | - `score()` computes an x-y PPS 82 | - `score_predictors()` computes all X-y PPS 83 | - `score_df()` computes all X-Y PPS 84 | - `score_matrix()` computes all X-Y PPS, and shows them in a matrix 85 | 86 | where `x` and `y` represent an individual predictor/target, and `X` and 87 | `Y` represent all predictors/targets in a given dataset. 88 | 89 | ### Examples: 90 | 91 | `score()` computes the PPS for a single target and predictor 92 | 93 | ``` r 94 | ppsr::score(iris, x = 'Sepal.Length', y = 'Petal.Length') 95 | #> $x 96 | #> [1] "Sepal.Length" 97 | #> 98 | #> $y 99 | #> [1] "Petal.Length" 100 | #> 101 | #> $result_type 102 | #> [1] "predictive power score" 103 | #> 104 | #> $pps 105 | #> [1] 0.6160836 106 | #> 107 | #> $metric 108 | #> [1] "MAE" 109 | #> 110 | #> $baseline_score 111 | #> [1] 1.571967 112 | #> 113 | #> $model_score 114 | #> [1] 0.5971445 115 | #> 116 | #> $cv_folds 117 | #> [1] 5 118 | #> 119 | #> $seed 120 | #> [1] 1 121 | #> 122 | #> $algorithm 123 | #> [1] "tree" 124 | #> 125 | #> $model_type 126 | #> [1] "regression" 127 | ``` 128 | 129 | `score_predictors()` computes all PPSs for a single target using all 130 | predictors in a dataframe 131 | 132 | ``` r 133 | ppsr::score_predictors(df = iris, y = 'Species') 134 | #> x y result_type pps metric 135 | #> 1 Sepal.Length Species predictive power score 0.5591864 F1_weighted 136 | #> 2 Sepal.Width Species predictive power score 0.3134401 F1_weighted 137 | #> 3 Petal.Length Species predictive power score 0.9167580 F1_weighted 138 | #> 4 Petal.Width Species predictive power score 0.9398532 F1_weighted 139 | #> 5 Species Species predictor and target are the same 1.0000000 140 | #> baseline_score model_score cv_folds seed algorithm model_type 141 | #> 1 0.3176487 0.7028029 5 1 tree classification 142 | #> 2 0.3176487 0.5377587 5 1 tree classification 143 | #> 3 0.3176487 0.9404972 5 1 tree classification 144 | #> 4 0.3176487 0.9599148 5 1 tree classification 145 | #> 5 NA NA NA NA 146 | ``` 147 | 148 | `score_df()` computes all PPSs for every target-predictor combination in 149 | a dataframe 150 | 151 | ``` r 152 | ppsr::score_df(df = iris) 153 | #> x y result_type pps 154 | #> 1 Sepal.Length Sepal.Length predictor and target are the same 1.00000000 155 | #> 2 Sepal.Width Sepal.Length predictive power score 0.04632352 156 | #> 3 Petal.Length Sepal.Length predictive power score 0.54913985 157 | #> 4 Petal.Width Sepal.Length predictive power score 0.41276679 158 | #> 5 Species Sepal.Length predictive power score 0.40754872 159 | #> 6 Sepal.Length Sepal.Width predictive power score 0.06790301 160 | #> 7 Sepal.Width Sepal.Width predictor and target are the same 1.00000000 161 | #> 8 Petal.Length Sepal.Width predictive power score 0.23769911 162 | #> 9 Petal.Width Sepal.Width predictive power score 0.21746588 163 | #> 10 Species Sepal.Width predictive power score 0.20128762 164 | #> 11 Sepal.Length Petal.Length predictive power score 0.61608360 165 | #> 12 Sepal.Width Petal.Length predictive power score 0.24263851 166 | #> 13 Petal.Length Petal.Length predictor and target are the same 1.00000000 167 | #> 14 Petal.Width Petal.Length predictive power score 0.79175121 168 | #> 15 Species Petal.Length predictive power score 0.79049070 169 | #> 16 Sepal.Length Petal.Width predictive power score 0.48735314 170 | #> 17 Sepal.Width Petal.Width predictive power score 0.20124105 171 | #> 18 Petal.Length Petal.Width predictive power score 0.74378445 172 | #> 19 Petal.Width Petal.Width predictor and target are the same 1.00000000 173 | #> 20 Species Petal.Width predictive power score 0.75611126 174 | #> 21 Sepal.Length Species predictive power score 0.55918638 175 | #> 22 Sepal.Width Species predictive power score 0.31344008 176 | #> 23 Petal.Length Species predictive power score 0.91675800 177 | #> 24 Petal.Width Species predictive power score 0.93985320 178 | #> 25 Species Species predictor and target are the same 1.00000000 179 | #> metric baseline_score model_score cv_folds seed algorithm 180 | #> 1 NA NA NA NA 181 | #> 2 MAE 0.6893222 0.6620058 5 1 tree 182 | #> 3 MAE 0.6893222 0.3100867 5 1 tree 183 | #> 4 MAE 0.6893222 0.4040123 5 1 tree 184 | #> 5 MAE 0.6893222 0.4076661 5 1 tree 185 | #> 6 MAE 0.3372222 0.3184796 5 1 tree 186 | #> 7 NA NA NA NA 187 | #> 8 MAE 0.3372222 0.2564258 5 1 tree 188 | #> 9 MAE 0.3372222 0.2631636 5 1 tree 189 | #> 10 MAE 0.3372222 0.2677963 5 1 tree 190 | #> 11 MAE 1.5719667 0.5971445 5 1 tree 191 | #> 12 MAE 1.5719667 1.1945031 5 1 tree 192 | #> 13 NA NA NA NA 193 | #> 14 MAE 1.5719667 0.3265152 5 1 tree 194 | #> 15 MAE 1.5719667 0.3280552 5 1 tree 195 | #> 16 MAE 0.6623556 0.3377682 5 1 tree 196 | #> 17 MAE 0.6623556 0.5315834 5 1 tree 197 | #> 18 MAE 0.6623556 0.1684906 5 1 tree 198 | #> 19 NA NA NA NA 199 | #> 20 MAE 0.6623556 0.1608119 5 1 tree 200 | #> 21 F1_weighted 0.3176487 0.7028029 5 1 tree 201 | #> 22 F1_weighted 0.3176487 0.5377587 5 1 tree 202 | #> 23 F1_weighted 0.3176487 0.9404972 5 1 tree 203 | #> 24 F1_weighted 0.3176487 0.9599148 5 1 tree 204 | #> 25 NA NA NA NA 205 | #> model_type 206 | #> 1 207 | #> 2 regression 208 | #> 3 regression 209 | #> 4 regression 210 | #> 5 regression 211 | #> 6 regression 212 | #> 7 213 | #> 8 regression 214 | #> 9 regression 215 | #> 10 regression 216 | #> 11 regression 217 | #> 12 regression 218 | #> 13 219 | #> 14 regression 220 | #> 15 regression 221 | #> 16 regression 222 | #> 17 regression 223 | #> 18 regression 224 | #> 19 225 | #> 20 regression 226 | #> 21 classification 227 | #> 22 classification 228 | #> 23 classification 229 | #> 24 classification 230 | #> 25 231 | ``` 232 | 233 | `score_df()` computes all PPSs for every target-predictor combination in 234 | a dataframe, but returns only the scores arranged in a neat matrix, like 235 | the familiar correlation matrix 236 | 237 | ``` r 238 | ppsr::score_matrix(df = iris) 239 | #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species 240 | #> Sepal.Length 1.00000000 0.04632352 0.5491398 0.4127668 0.4075487 241 | #> Sepal.Width 0.06790301 1.00000000 0.2376991 0.2174659 0.2012876 242 | #> Petal.Length 0.61608360 0.24263851 1.0000000 0.7917512 0.7904907 243 | #> Petal.Width 0.48735314 0.20124105 0.7437845 1.0000000 0.7561113 244 | #> Species 0.55918638 0.31344008 0.9167580 0.9398532 1.0000000 245 | ``` 246 | 247 | Currently, the `ppsr` package computes PPS by default using… 248 | 249 | - the default decision tree implementation of the `rpart` package, 250 | wrapped by `parsnip` 251 | - *weighted F1* scores to evaluate classification models, and *MAE* to 252 | evaluate regression models 253 | - 5 cross-validations 254 | 255 | You can call the `available_algorithms()` and 256 | `available_evaluation_metrics()` functions to see what alternative 257 | settings are supported. 258 | 259 | Note that the calculated PPS reflects the **out-of-sample** predictive 260 | validity when more than a single cross-validation is used. If you prefer 261 | to look at in-sample scores, you can set `cv_folds = 1`. Note that in 262 | such cases overfitting can become an issue, particularly with the more 263 | flexible algorithms. 264 | 265 | ## Visualizing PPS 266 | 267 | Subsequently, there are three main functions that wrap around these 268 | computational functions to help you visualize your PPS using `ggplot2`: 269 | 270 | - `visualize_pps()` produces a barplot of all X-y PPS, or a heatmap of 271 | all X-Y PPS 272 | - `visualize_correlations()` produces a heatmap of all X-Y 273 | correlations 274 | - `visualize_both()` produces the two heatmaps of all X-Y PPS and 275 | correlations side-by-side 276 | 277 | ### Examples: 278 | 279 | If you specify a target variable (`y`) in `visualize_pps()`, you get a 280 | barplot of its predictors. 281 | 282 | ``` r 283 | ppsr::visualize_pps(df = iris, y = 'Species') 284 | ``` 285 | 286 | ![](man/figures/README-PPS-barplot-1.png) 287 | 288 | If you do not specify a target variable in `visualize_pps()`, you get 289 | the PPS matrix visualized as a heatmap. 290 | 291 | ``` r 292 | ppsr::visualize_pps(df = iris) 293 | ``` 294 | 295 | ![](man/figures/README-PPS-heatmap-1.png) 296 | 297 | Some users might find it useful to look at a correlation matrix for 298 | comparison. 299 | 300 | ``` r 301 | ppsr::visualize_correlations(df = iris) 302 | ``` 303 | 304 | ![](man/figures/README-correlation-heatmap-1.png) 305 | 306 | With `visualize_both` you generate the PPS and correlation matrices 307 | side-by-side, for easy comparison. 308 | 309 | ``` r 310 | ppsr::visualize_both(df = iris) 311 | ``` 312 | 313 | ![](man/figures/README-sbs-heatmap-1.png) 314 | 315 | You can change the colors of the visualizations using the functions 316 | arguments. There are also arguments to change the color of the text 317 | scores. 318 | 319 | Furthermore, the functions return `ggplot2` objects, so that you can 320 | easily change the theme and other settings. 321 | 322 | ``` r 323 | ppsr::visualize_pps(df = iris, 324 | color_value_high = 'red', 325 | color_value_low = 'yellow', 326 | color_text = 'black') + 327 | ggplot2::theme_classic() + 328 | ggplot2::theme(plot.background = ggplot2::element_rect(fill = "lightgrey")) + 329 | ggplot2::theme(title = ggplot2::element_text(size = 15)) + 330 | ggplot2::labs(title = 'Add your own title', 331 | subtitle = 'Maybe an informative subtitle', 332 | caption = 'Did you know ggplot2 includes captions?', 333 | x = 'You could call this\nthe independent variable\nas well') 334 | ``` 335 | 336 | ![](man/figures/README-custom-plot-1.png) 337 | 338 | ## Parallelization 339 | 340 | The number of predictive models that one needs to build in order to fill 341 | the PPS matrix belonging to a dataframe increases exponentially with 342 | every new column in that dataframe. 343 | 344 | For traditional correlation analyses, this is not a problem. Yet, with 345 | more computation-intensive algorithms, with many train-test splits, and 346 | with large or high-dimensional datasets, it can take a decent amount of 347 | time to build all the predictive models and derive their PPSs. 348 | 349 | One way to speed matters up is to use the `ppsr::score_predictors()` 350 | function and focus on predicting only the target/dependent variable you 351 | are most interested in. 352 | 353 | Yet, since version `0.0.1`, all `ppsr::score_*` and `pssr::visualize_*` 354 | functions now take in two arguments that facilitate parallel computing. 355 | You can parallelize `ppsr`’s computations by setting the `do_parallel` 356 | argument to `TRUE`. If done so, a cluster will be created using the 357 | `parallel` package. By default, this cluster will use the maximum number 358 | of cores (see `parallel::detectCores()`) minus 1. 359 | 360 | However, with the second argument – `n_cores` – you can manually specify 361 | the number of cores you want `ppsr` to use. 362 | 363 | Examples: 364 | 365 | ``` r 366 | ppsr::score_df(df = mtcars, do_parallel = TRUE) 367 | ``` 368 | 369 | ``` r 370 | ppsr::visualize_pps(df = iris, do_parallel = TRUE, n_cores = 2) 371 | ``` 372 | 373 | ## Interpreting PPS 374 | 375 | The PPS is a **normalized score** that ranges from 0 (no predictive 376 | power) to 1 (perfect predictive power). 377 | 378 | The normalization occurs by comparing how well we are able to predict 379 | the values of a *target* variable (`y`) using the values of a 380 | *predictor* variable (`x`), respective to two **benchmarks**: a perfect 381 | prediction, and a naive prediction 382 | 383 | The **perfect prediction** can be theoretically derived. A perfect 384 | regression model produces no error (=0.0), whereas a perfect 385 | classification model results in 100% accuracy, recall, et cetera (=1.0). 386 | 387 | The **naive prediction** is derived empirically. A naive *regression* 388 | model is simulated by predicting the mean `y` value for all 389 | observations. This is similar to how R-squared is calculated. A naive 390 | *classification* model is simulated by taking the best among two models: 391 | one predicting the modal `y` class, and one predicting random `y` 392 | classes for all observations. 393 | 394 | Whenever we train an “informed” model to predict `y` using `x`, we can 395 | assess how well it performs by comparing it to these two benchmarks. 396 | 397 | Suppose we train a regression model, and its mean average error (MAE) is 398 | 0.10. Suppose the naive model resulted in an MAE of 0.40. We know the 399 | perfect model would produce no error, which means an MAE of 0.0. 400 | 401 | With these three scores, we can normalize the performance of our 402 | informed regression model by interpolating its score between the perfect 403 | and the naive benchmarks. In this case, our model’s performance lies 404 | about 1/4th of the way from the perfect model, and 405 | 3/4ths of the way from the naive model. In other words, our 406 | model’s predictive power score is 75%: it produced 75% less error than 407 | the naive baseline, and was only 25% short of perfect predictions. 408 | 409 | Using such normalized scores for model performance allows us to easily 410 | interpret how much better our models are as compared to a naive 411 | baseline. Moreover, such normalized scores allow us to compare and 412 | contrast different modeling approaches, in terms of the algorithms, the 413 | target’s data type, the evaluation metrics, and any other settings used. 414 | 415 | ## Considerations 416 | 417 | The main use of PPS is as a tool for data exploration. It trains 418 | out-of-the-box machine learning models to assess the predictive 419 | relations in your dataset. 420 | 421 | However, this PPS is quite a “quick and dirty” approach. The trained 422 | models are not at all tailored to your specific 423 | regression/classification problem. For example, it could be that you get 424 | many PPSs of 0 with the default settings. A known issue is that the 425 | default decision tree often does not find valuable splits and reverts to 426 | predicting the mean `y` value found at its root. Here, it could help to 427 | try calculating PPS with different settings (e.g., `algorithm = 'glm'`). 428 | 429 | At other times, predictive relationships may rely on a combination of 430 | variables (i.e. interaction/moderation). These are not captured by the 431 | PPS calculations, which consider only univariate relations. PPS is 432 | simply not suited for capturing such complexities. In these cases, it 433 | might be more interesting to train models on all your features 434 | simultaneously and turn to concepts like [feature/variable 435 | importance](https://topepo.github.io/caret/variable-importance.html), 436 | [partial 437 | dependency](https://christophm.github.io/interpretable-ml-book/pdp.html), 438 | [conditional 439 | expectations](https://christophm.github.io/interpretable-ml-book/ice.html), 440 | [accumulated local 441 | effects](https://christophm.github.io/interpretable-ml-book/ale.html), 442 | and others. 443 | 444 | In general, the PPS should not be considered more than a fast and easy 445 | tool to finding starting points for further, in-depth analysis. Keep in 446 | mind that you can build much better predictive models than the default 447 | PPS functions if you tailor your modeling efforts to your specific data 448 | context. 449 | 450 | ## Open issues & development 451 | 452 | PPS is a relatively young concept, and likewise the `ppsr` package is 453 | still under development. If you spot any bugs or potential improvements, 454 | please raise an issue or submit a pull request. 455 | 456 | On the developmental agenda are currently: 457 | 458 | - Support for different modeling techniques/algorithms 459 | - Support for generalized linear models for multinomial classification 460 | - Passing/setting of parameters for models 461 | - Different model evaluation metrics 462 | - Support for user-defined model evaluation metrics 463 | - Downsampling for large datasets 464 | 465 | ## Attribution 466 | 467 | This R package was inspired by 8080labs’ Python package 468 | [`ppscore`](https://github.com/8080labs/ppscore). 469 | 470 | The same 8080labs also developed an earlier, unfinished [R 471 | implementation of PPS](https://github.com/8080labs/ppscoreR). 472 | 473 | Read more about the big ideas behind PPS in [this blog 474 | post](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598). 475 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | GNU General Public License 2 | ========================== 3 | 4 | _Version 3, 29 June 2007_ 5 | _Copyright © 2007 Free Software Foundation, Inc. <>_ 6 | 7 | Everyone is permitted to copy and distribute verbatim copies of this license 8 | document, but changing it is not allowed. 9 | 10 | ## Preamble 11 | 12 | The GNU General Public License is a free, copyleft license for software and other 13 | kinds of works. 14 | 15 | The licenses for most software and other practical works are designed to take away 16 | your freedom to share and change the works. By contrast, the GNU General Public 17 | License is intended to guarantee your freedom to share and change all versions of a 18 | program--to make sure it remains free software for all its users. We, the Free 19 | Software Foundation, use the GNU General Public License for most of our software; it 20 | applies also to any other work released this way by its authors. You can apply it to 21 | your programs, too. 22 | 23 | When we speak of free software, we are referring to freedom, not price. Our General 24 | Public Licenses are designed to make sure that you have the freedom to distribute 25 | copies of free software (and charge for them if you wish), that you receive source 26 | code or can get it if you want it, that you can change the software or use pieces of 27 | it in new free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you these rights or 30 | asking you to surrender the rights. Therefore, you have certain responsibilities if 31 | you distribute copies of the software, or if you modify it: responsibilities to 32 | respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether gratis or for a fee, 35 | you must pass on to the recipients the same freedoms that you received. You must make 36 | sure that they, too, receive or can get the source code. And you must show them these 37 | terms so they know their rights. 38 | 39 | Developers that use the GNU GPL protect your rights with two steps: **(1)** assert 40 | copyright on the software, and **(2)** offer you this License giving you legal permission 41 | to copy, distribute and/or modify it. 42 | 43 | For the developers' and authors' protection, the GPL clearly explains that there is 44 | no warranty for this free software. For both users' and authors' sake, the GPL 45 | requires that modified versions be marked as changed, so that their problems will not 46 | be attributed erroneously to authors of previous versions. 47 | 48 | Some devices are designed to deny users access to install or run modified versions of 49 | the software inside them, although the manufacturer can do so. This is fundamentally 50 | incompatible with the aim of protecting users' freedom to change the software. The 51 | systematic pattern of such abuse occurs in the area of products for individuals to 52 | use, which is precisely where it is most unacceptable. Therefore, we have designed 53 | this version of the GPL to prohibit the practice for those products. If such problems 54 | arise substantially in other domains, we stand ready to extend this provision to 55 | those domains in future versions of the GPL, as needed to protect the freedom of 56 | users. 57 | 58 | Finally, every program is threatened constantly by software patents. States should 59 | not allow patents to restrict development and use of software on general-purpose 60 | computers, but in those that do, we wish to avoid the special danger that patents 61 | applied to a free program could make it effectively proprietary. To prevent this, the 62 | GPL assures that patents cannot be used to render the program non-free. 63 | 64 | The precise terms and conditions for copying, distribution and modification follow. 65 | 66 | ## TERMS AND CONDITIONS 67 | 68 | ### 0. Definitions 69 | 70 | “This License” refers to version 3 of the GNU General Public License. 71 | 72 | “Copyright” also means copyright-like laws that apply to other kinds of 73 | works, such as semiconductor masks. 74 | 75 | “The Program” refers to any copyrightable work licensed under this 76 | License. Each licensee is addressed as “you”. “Licensees” and 77 | “recipients” may be individuals or organizations. 78 | 79 | To “modify” a work means to copy from or adapt all or part of the work in 80 | a fashion requiring copyright permission, other than the making of an exact copy. The 81 | resulting work is called a “modified version” of the earlier work or a 82 | work “based on” the earlier work. 83 | 84 | A “covered work” means either the unmodified Program or a work based on 85 | the Program. 86 | 87 | To “propagate” a work means to do anything with it that, without 88 | permission, would make you directly or secondarily liable for infringement under 89 | applicable copyright law, except executing it on a computer or modifying a private 90 | copy. Propagation includes copying, distribution (with or without modification), 91 | making available to the public, and in some countries other activities as well. 92 | 93 | To “convey” a work means any kind of propagation that enables other 94 | parties to make or receive copies. Mere interaction with a user through a computer 95 | network, with no transfer of a copy, is not conveying. 96 | 97 | An interactive user interface displays “Appropriate Legal Notices” to the 98 | extent that it includes a convenient and prominently visible feature that **(1)** 99 | displays an appropriate copyright notice, and **(2)** tells the user that there is no 100 | warranty for the work (except to the extent that warranties are provided), that 101 | licensees may convey the work under this License, and how to view a copy of this 102 | License. If the interface presents a list of user commands or options, such as a 103 | menu, a prominent item in the list meets this criterion. 104 | 105 | ### 1. Source Code 106 | 107 | The “source code” for a work means the preferred form of the work for 108 | making modifications to it. “Object code” means any non-source form of a 109 | work. 110 | 111 | A “Standard Interface” means an interface that either is an official 112 | standard defined by a recognized standards body, or, in the case of interfaces 113 | specified for a particular programming language, one that is widely used among 114 | developers working in that language. 115 | 116 | The “System Libraries” of an executable work include anything, other than 117 | the work as a whole, that **(a)** is included in the normal form of packaging a Major 118 | Component, but which is not part of that Major Component, and **(b)** serves only to 119 | enable use of the work with that Major Component, or to implement a Standard 120 | Interface for which an implementation is available to the public in source code form. 121 | A “Major Component”, in this context, means a major essential component 122 | (kernel, window system, and so on) of the specific operating system (if any) on which 123 | the executable work runs, or a compiler used to produce the work, or an object code 124 | interpreter used to run it. 125 | 126 | The “Corresponding Source” for a work in object code form means all the 127 | source code needed to generate, install, and (for an executable work) run the object 128 | code and to modify the work, including scripts to control those activities. However, 129 | it does not include the work's System Libraries, or general-purpose tools or 130 | generally available free programs which are used unmodified in performing those 131 | activities but which are not part of the work. For example, Corresponding Source 132 | includes interface definition files associated with source files for the work, and 133 | the source code for shared libraries and dynamically linked subprograms that the work 134 | is specifically designed to require, such as by intimate data communication or 135 | control flow between those subprograms and other parts of the work. 136 | 137 | The Corresponding Source need not include anything that users can regenerate 138 | automatically from other parts of the Corresponding Source. 139 | 140 | The Corresponding Source for a work in source code form is that same work. 141 | 142 | ### 2. Basic Permissions 143 | 144 | All rights granted under this License are granted for the term of copyright on the 145 | Program, and are irrevocable provided the stated conditions are met. This License 146 | explicitly affirms your unlimited permission to run the unmodified Program. The 147 | output from running a covered work is covered by this License only if the output, 148 | given its content, constitutes a covered work. This License acknowledges your rights 149 | of fair use or other equivalent, as provided by copyright law. 150 | 151 | You may make, run and propagate covered works that you do not convey, without 152 | conditions so long as your license otherwise remains in force. You may convey covered 153 | works to others for the sole purpose of having them make modifications exclusively 154 | for you, or provide you with facilities for running those works, provided that you 155 | comply with the terms of this License in conveying all material for which you do not 156 | control copyright. Those thus making or running the covered works for you must do so 157 | exclusively on your behalf, under your direction and control, on terms that prohibit 158 | them from making any copies of your copyrighted material outside their relationship 159 | with you. 160 | 161 | Conveying under any other circumstances is permitted solely under the conditions 162 | stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 163 | 164 | ### 3. Protecting Users' Legal Rights From Anti-Circumvention Law 165 | 166 | No covered work shall be deemed part of an effective technological measure under any 167 | applicable law fulfilling obligations under article 11 of the WIPO copyright treaty 168 | adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention 169 | of such measures. 170 | 171 | When you convey a covered work, you waive any legal power to forbid circumvention of 172 | technological measures to the extent such circumvention is effected by exercising 173 | rights under this License with respect to the covered work, and you disclaim any 174 | intention to limit operation or modification of the work as a means of enforcing, 175 | against the work's users, your or third parties' legal rights to forbid circumvention 176 | of technological measures. 177 | 178 | ### 4. Conveying Verbatim Copies 179 | 180 | You may convey verbatim copies of the Program's source code as you receive it, in any 181 | medium, provided that you conspicuously and appropriately publish on each copy an 182 | appropriate copyright notice; keep intact all notices stating that this License and 183 | any non-permissive terms added in accord with section 7 apply to the code; keep 184 | intact all notices of the absence of any warranty; and give all recipients a copy of 185 | this License along with the Program. 186 | 187 | You may charge any price or no price for each copy that you convey, and you may offer 188 | support or warranty protection for a fee. 189 | 190 | ### 5. Conveying Modified Source Versions 191 | 192 | You may convey a work based on the Program, or the modifications to produce it from 193 | the Program, in the form of source code under the terms of section 4, provided that 194 | you also meet all of these conditions: 195 | 196 | * **a)** The work must carry prominent notices stating that you modified it, and giving a 197 | relevant date. 198 | * **b)** The work must carry prominent notices stating that it is released under this 199 | License and any conditions added under section 7. This requirement modifies the 200 | requirement in section 4 to “keep intact all notices”. 201 | * **c)** You must license the entire work, as a whole, under this License to anyone who 202 | comes into possession of a copy. This License will therefore apply, along with any 203 | applicable section 7 additional terms, to the whole of the work, and all its parts, 204 | regardless of how they are packaged. This License gives no permission to license the 205 | work in any other way, but it does not invalidate such permission if you have 206 | separately received it. 207 | * **d)** If the work has interactive user interfaces, each must display Appropriate Legal 208 | Notices; however, if the Program has interactive interfaces that do not display 209 | Appropriate Legal Notices, your work need not make them do so. 210 | 211 | A compilation of a covered work with other separate and independent works, which are 212 | not by their nature extensions of the covered work, and which are not combined with 213 | it such as to form a larger program, in or on a volume of a storage or distribution 214 | medium, is called an “aggregate” if the compilation and its resulting 215 | copyright are not used to limit the access or legal rights of the compilation's users 216 | beyond what the individual works permit. Inclusion of a covered work in an aggregate 217 | does not cause this License to apply to the other parts of the aggregate. 218 | 219 | ### 6. Conveying Non-Source Forms 220 | 221 | You may convey a covered work in object code form under the terms of sections 4 and 222 | 5, provided that you also convey the machine-readable Corresponding Source under the 223 | terms of this License, in one of these ways: 224 | 225 | * **a)** Convey the object code in, or embodied in, a physical product (including a 226 | physical distribution medium), accompanied by the Corresponding Source fixed on a 227 | durable physical medium customarily used for software interchange. 228 | * **b)** Convey the object code in, or embodied in, a physical product (including a 229 | physical distribution medium), accompanied by a written offer, valid for at least 230 | three years and valid for as long as you offer spare parts or customer support for 231 | that product model, to give anyone who possesses the object code either **(1)** a copy of 232 | the Corresponding Source for all the software in the product that is covered by this 233 | License, on a durable physical medium customarily used for software interchange, for 234 | a price no more than your reasonable cost of physically performing this conveying of 235 | source, or **(2)** access to copy the Corresponding Source from a network server at no 236 | charge. 237 | * **c)** Convey individual copies of the object code with a copy of the written offer to 238 | provide the Corresponding Source. This alternative is allowed only occasionally and 239 | noncommercially, and only if you received the object code with such an offer, in 240 | accord with subsection 6b. 241 | * **d)** Convey the object code by offering access from a designated place (gratis or for 242 | a charge), and offer equivalent access to the Corresponding Source in the same way 243 | through the same place at no further charge. You need not require recipients to copy 244 | the Corresponding Source along with the object code. If the place to copy the object 245 | code is a network server, the Corresponding Source may be on a different server 246 | (operated by you or a third party) that supports equivalent copying facilities, 247 | provided you maintain clear directions next to the object code saying where to find 248 | the Corresponding Source. Regardless of what server hosts the Corresponding Source, 249 | you remain obligated to ensure that it is available for as long as needed to satisfy 250 | these requirements. 251 | * **e)** Convey the object code using peer-to-peer transmission, provided you inform 252 | other peers where the object code and Corresponding Source of the work are being 253 | offered to the general public at no charge under subsection 6d. 254 | 255 | A separable portion of the object code, whose source code is excluded from the 256 | Corresponding Source as a System Library, need not be included in conveying the 257 | object code work. 258 | 259 | A “User Product” is either **(1)** a “consumer product”, which 260 | means any tangible personal property which is normally used for personal, family, or 261 | household purposes, or **(2)** anything designed or sold for incorporation into a 262 | dwelling. In determining whether a product is a consumer product, doubtful cases 263 | shall be resolved in favor of coverage. For a particular product received by a 264 | particular user, “normally used” refers to a typical or common use of 265 | that class of product, regardless of the status of the particular user or of the way 266 | in which the particular user actually uses, or expects or is expected to use, the 267 | product. A product is a consumer product regardless of whether the product has 268 | substantial commercial, industrial or non-consumer uses, unless such uses represent 269 | the only significant mode of use of the product. 270 | 271 | “Installation Information” for a User Product means any methods, 272 | procedures, authorization keys, or other information required to install and execute 273 | modified versions of a covered work in that User Product from a modified version of 274 | its Corresponding Source. The information must suffice to ensure that the continued 275 | functioning of the modified object code is in no case prevented or interfered with 276 | solely because modification has been made. 277 | 278 | If you convey an object code work under this section in, or with, or specifically for 279 | use in, a User Product, and the conveying occurs as part of a transaction in which 280 | the right of possession and use of the User Product is transferred to the recipient 281 | in perpetuity or for a fixed term (regardless of how the transaction is 282 | characterized), the Corresponding Source conveyed under this section must be 283 | accompanied by the Installation Information. But this requirement does not apply if 284 | neither you nor any third party retains the ability to install modified object code 285 | on the User Product (for example, the work has been installed in ROM). 286 | 287 | The requirement to provide Installation Information does not include a requirement to 288 | continue to provide support service, warranty, or updates for a work that has been 289 | modified or installed by the recipient, or for the User Product in which it has been 290 | modified or installed. Access to a network may be denied when the modification itself 291 | materially and adversely affects the operation of the network or violates the rules 292 | and protocols for communication across the network. 293 | 294 | Corresponding Source conveyed, and Installation Information provided, in accord with 295 | this section must be in a format that is publicly documented (and with an 296 | implementation available to the public in source code form), and must require no 297 | special password or key for unpacking, reading or copying. 298 | 299 | ### 7. Additional Terms 300 | 301 | “Additional permissions” are terms that supplement the terms of this 302 | License by making exceptions from one or more of its conditions. Additional 303 | permissions that are applicable to the entire Program shall be treated as though they 304 | were included in this License, to the extent that they are valid under applicable 305 | law. If additional permissions apply only to part of the Program, that part may be 306 | used separately under those permissions, but the entire Program remains governed by 307 | this License without regard to the additional permissions. 308 | 309 | When you convey a copy of a covered work, you may at your option remove any 310 | additional permissions from that copy, or from any part of it. (Additional 311 | permissions may be written to require their own removal in certain cases when you 312 | modify the work.) You may place additional permissions on material, added by you to a 313 | covered work, for which you have or can give appropriate copyright permission. 314 | 315 | Notwithstanding any other provision of this License, for material you add to a 316 | covered work, you may (if authorized by the copyright holders of that material) 317 | supplement the terms of this License with terms: 318 | 319 | * **a)** Disclaiming warranty or limiting liability differently from the terms of 320 | sections 15 and 16 of this License; or 321 | * **b)** Requiring preservation of specified reasonable legal notices or author 322 | attributions in that material or in the Appropriate Legal Notices displayed by works 323 | containing it; or 324 | * **c)** Prohibiting misrepresentation of the origin of that material, or requiring that 325 | modified versions of such material be marked in reasonable ways as different from the 326 | original version; or 327 | * **d)** Limiting the use for publicity purposes of names of licensors or authors of the 328 | material; or 329 | * **e)** Declining to grant rights under trademark law for use of some trade names, 330 | trademarks, or service marks; or 331 | * **f)** Requiring indemnification of licensors and authors of that material by anyone 332 | who conveys the material (or modified versions of it) with contractual assumptions of 333 | liability to the recipient, for any liability that these contractual assumptions 334 | directly impose on those licensors and authors. 335 | 336 | All other non-permissive additional terms are considered “further 337 | restrictions” within the meaning of section 10. If the Program as you received 338 | it, or any part of it, contains a notice stating that it is governed by this License 339 | along with a term that is a further restriction, you may remove that term. If a 340 | license document contains a further restriction but permits relicensing or conveying 341 | under this License, you may add to a covered work material governed by the terms of 342 | that license document, provided that the further restriction does not survive such 343 | relicensing or conveying. 344 | 345 | If you add terms to a covered work in accord with this section, you must place, in 346 | the relevant source files, a statement of the additional terms that apply to those 347 | files, or a notice indicating where to find the applicable terms. 348 | 349 | Additional terms, permissive or non-permissive, may be stated in the form of a 350 | separately written license, or stated as exceptions; the above requirements apply 351 | either way. 352 | 353 | ### 8. Termination 354 | 355 | You may not propagate or modify a covered work except as expressly provided under 356 | this License. Any attempt otherwise to propagate or modify it is void, and will 357 | automatically terminate your rights under this License (including any patent licenses 358 | granted under the third paragraph of section 11). 359 | 360 | However, if you cease all violation of this License, then your license from a 361 | particular copyright holder is reinstated **(a)** provisionally, unless and until the 362 | copyright holder explicitly and finally terminates your license, and **(b)** permanently, 363 | if the copyright holder fails to notify you of the violation by some reasonable means 364 | prior to 60 days after the cessation. 365 | 366 | Moreover, your license from a particular copyright holder is reinstated permanently 367 | if the copyright holder notifies you of the violation by some reasonable means, this 368 | is the first time you have received notice of violation of this License (for any 369 | work) from that copyright holder, and you cure the violation prior to 30 days after 370 | your receipt of the notice. 371 | 372 | Termination of your rights under this section does not terminate the licenses of 373 | parties who have received copies or rights from you under this License. If your 374 | rights have been terminated and not permanently reinstated, you do not qualify to 375 | receive new licenses for the same material under section 10. 376 | 377 | ### 9. Acceptance Not Required for Having Copies 378 | 379 | You are not required to accept this License in order to receive or run a copy of the 380 | Program. Ancillary propagation of a covered work occurring solely as a consequence of 381 | using peer-to-peer transmission to receive a copy likewise does not require 382 | acceptance. However, nothing other than this License grants you permission to 383 | propagate or modify any covered work. These actions infringe copyright if you do not 384 | accept this License. Therefore, by modifying or propagating a covered work, you 385 | indicate your acceptance of this License to do so. 386 | 387 | ### 10. Automatic Licensing of Downstream Recipients 388 | 389 | Each time you convey a covered work, the recipient automatically receives a license 390 | from the original licensors, to run, modify and propagate that work, subject to this 391 | License. You are not responsible for enforcing compliance by third parties with this 392 | License. 393 | 394 | An “entity transaction” is a transaction transferring control of an 395 | organization, or substantially all assets of one, or subdividing an organization, or 396 | merging organizations. If propagation of a covered work results from an entity 397 | transaction, each party to that transaction who receives a copy of the work also 398 | receives whatever licenses to the work the party's predecessor in interest had or 399 | could give under the previous paragraph, plus a right to possession of the 400 | Corresponding Source of the work from the predecessor in interest, if the predecessor 401 | has it or can get it with reasonable efforts. 402 | 403 | You may not impose any further restrictions on the exercise of the rights granted or 404 | affirmed under this License. For example, you may not impose a license fee, royalty, 405 | or other charge for exercise of rights granted under this License, and you may not 406 | initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging 407 | that any patent claim is infringed by making, using, selling, offering for sale, or 408 | importing the Program or any portion of it. 409 | 410 | ### 11. Patents 411 | 412 | A “contributor” is a copyright holder who authorizes use under this 413 | License of the Program or a work on which the Program is based. The work thus 414 | licensed is called the contributor's “contributor version”. 415 | 416 | A contributor's “essential patent claims” are all patent claims owned or 417 | controlled by the contributor, whether already acquired or hereafter acquired, that 418 | would be infringed by some manner, permitted by this License, of making, using, or 419 | selling its contributor version, but do not include claims that would be infringed 420 | only as a consequence of further modification of the contributor version. For 421 | purposes of this definition, “control” includes the right to grant patent 422 | sublicenses in a manner consistent with the requirements of this License. 423 | 424 | Each contributor grants you a non-exclusive, worldwide, royalty-free patent license 425 | under the contributor's essential patent claims, to make, use, sell, offer for sale, 426 | import and otherwise run, modify and propagate the contents of its contributor 427 | version. 428 | 429 | In the following three paragraphs, a “patent license” is any express 430 | agreement or commitment, however denominated, not to enforce a patent (such as an 431 | express permission to practice a patent or covenant not to sue for patent 432 | infringement). To “grant” such a patent license to a party means to make 433 | such an agreement or commitment not to enforce a patent against the party. 434 | 435 | If you convey a covered work, knowingly relying on a patent license, and the 436 | Corresponding Source of the work is not available for anyone to copy, free of charge 437 | and under the terms of this License, through a publicly available network server or 438 | other readily accessible means, then you must either **(1)** cause the Corresponding 439 | Source to be so available, or **(2)** arrange to deprive yourself of the benefit of the 440 | patent license for this particular work, or **(3)** arrange, in a manner consistent with 441 | the requirements of this License, to extend the patent license to downstream 442 | recipients. “Knowingly relying” means you have actual knowledge that, but 443 | for the patent license, your conveying the covered work in a country, or your 444 | recipient's use of the covered work in a country, would infringe one or more 445 | identifiable patents in that country that you have reason to believe are valid. 446 | 447 | If, pursuant to or in connection with a single transaction or arrangement, you 448 | convey, or propagate by procuring conveyance of, a covered work, and grant a patent 449 | license to some of the parties receiving the covered work authorizing them to use, 450 | propagate, modify or convey a specific copy of the covered work, then the patent 451 | license you grant is automatically extended to all recipients of the covered work and 452 | works based on it. 453 | 454 | A patent license is “discriminatory” if it does not include within the 455 | scope of its coverage, prohibits the exercise of, or is conditioned on the 456 | non-exercise of one or more of the rights that are specifically granted under this 457 | License. You may not convey a covered work if you are a party to an arrangement with 458 | a third party that is in the business of distributing software, under which you make 459 | payment to the third party based on the extent of your activity of conveying the 460 | work, and under which the third party grants, to any of the parties who would receive 461 | the covered work from you, a discriminatory patent license **(a)** in connection with 462 | copies of the covered work conveyed by you (or copies made from those copies), or **(b)** 463 | primarily for and in connection with specific products or compilations that contain 464 | the covered work, unless you entered into that arrangement, or that patent license 465 | was granted, prior to 28 March 2007. 466 | 467 | Nothing in this License shall be construed as excluding or limiting any implied 468 | license or other defenses to infringement that may otherwise be available to you 469 | under applicable patent law. 470 | 471 | ### 12. No Surrender of Others' Freedom 472 | 473 | If conditions are imposed on you (whether by court order, agreement or otherwise) 474 | that contradict the conditions of this License, they do not excuse you from the 475 | conditions of this License. If you cannot convey a covered work so as to satisfy 476 | simultaneously your obligations under this License and any other pertinent 477 | obligations, then as a consequence you may not convey it at all. For example, if you 478 | agree to terms that obligate you to collect a royalty for further conveying from 479 | those to whom you convey the Program, the only way you could satisfy both those terms 480 | and this License would be to refrain entirely from conveying the Program. 481 | 482 | ### 13. Use with the GNU Affero General Public License 483 | 484 | Notwithstanding any other provision of this License, you have permission to link or 485 | combine any covered work with a work licensed under version 3 of the GNU Affero 486 | General Public License into a single combined work, and to convey the resulting work. 487 | The terms of this License will continue to apply to the part which is the covered 488 | work, but the special requirements of the GNU Affero General Public License, section 489 | 13, concerning interaction through a network will apply to the combination as such. 490 | 491 | ### 14. Revised Versions of this License 492 | 493 | The Free Software Foundation may publish revised and/or new versions of the GNU 494 | General Public License from time to time. Such new versions will be similar in spirit 495 | to the present version, but may differ in detail to address new problems or concerns. 496 | 497 | Each version is given a distinguishing version number. If the Program specifies that 498 | a certain numbered version of the GNU General Public License “or any later 499 | version” applies to it, you have the option of following the terms and 500 | conditions either of that numbered version or of any later version published by the 501 | Free Software Foundation. If the Program does not specify a version number of the GNU 502 | General Public License, you may choose any version ever published by the Free 503 | Software Foundation. 504 | 505 | If the Program specifies that a proxy can decide which future versions of the GNU 506 | General Public License can be used, that proxy's public statement of acceptance of a 507 | version permanently authorizes you to choose that version for the Program. 508 | 509 | Later license versions may give you additional or different permissions. However, no 510 | additional obligations are imposed on any author or copyright holder as a result of 511 | your choosing to follow a later version. 512 | 513 | ### 15. Disclaimer of Warranty 514 | 515 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. 516 | EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 517 | PROVIDE THE PROGRAM “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER 518 | EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 519 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE 520 | QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE 521 | DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 522 | 523 | ### 16. Limitation of Liability 524 | 525 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY 526 | COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS 527 | PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, 528 | INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE 529 | PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE 530 | OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE 531 | WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 532 | POSSIBILITY OF SUCH DAMAGES. 533 | 534 | ### 17. Interpretation of Sections 15 and 16 535 | 536 | If the disclaimer of warranty and limitation of liability provided above cannot be 537 | given local legal effect according to their terms, reviewing courts shall apply local 538 | law that most closely approximates an absolute waiver of all civil liability in 539 | connection with the Program, unless a warranty or assumption of liability accompanies 540 | a copy of the Program in return for a fee. 541 | 542 | _END OF TERMS AND CONDITIONS_ 543 | 544 | ## How to Apply These Terms to Your New Programs 545 | 546 | If you develop a new program, and you want it to be of the greatest possible use to 547 | the public, the best way to achieve this is to make it free software which everyone 548 | can redistribute and change under these terms. 549 | 550 | To do so, attach the following notices to the program. It is safest to attach them 551 | to the start of each source file to most effectively state the exclusion of warranty; 552 | and each file should have at least the “copyright” line and a pointer to 553 | where the full notice is found. 554 | 555 | 556 | Copyright (C) 557 | 558 | This program is free software: you can redistribute it and/or modify 559 | it under the terms of the GNU General Public License as published by 560 | the Free Software Foundation, either version 3 of the License, or 561 | (at your option) any later version. 562 | 563 | This program is distributed in the hope that it will be useful, 564 | but WITHOUT ANY WARRANTY; without even the implied warranty of 565 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 566 | GNU General Public License for more details. 567 | 568 | You should have received a copy of the GNU General Public License 569 | along with this program. If not, see . 570 | 571 | Also add information on how to contact you by electronic and paper mail. 572 | 573 | If the program does terminal interaction, make it output a short notice like this 574 | when it starts in an interactive mode: 575 | 576 | Copyright (C) 577 | This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. 578 | This is free software, and you are welcome to redistribute it 579 | under certain conditions; type 'show c' for details. 580 | 581 | The hypothetical commands `show w` and `show c` should show the appropriate parts of 582 | the General Public License. Of course, your program's commands might be different; 583 | for a GUI interface, you would use an “about box”. 584 | 585 | You should also get your employer (if you work as a programmer) or school, if any, to 586 | sign a “copyright disclaimer” for the program, if necessary. For more 587 | information on this, and how to apply and follow the GNU GPL, see 588 | <>. 589 | 590 | The GNU General Public License does not permit incorporating your program into 591 | proprietary programs. If your program is a subroutine library, you may consider it 592 | more useful to permit linking proprietary applications with the library. If this is 593 | what you want to do, use the GNU Lesser General Public License instead of this 594 | License. But first, please read 595 | <>. 596 | --------------------------------------------------------------------------------