├── data └── titanic.rda ├── endoR_0.1.0.tar.gz ├── figures ├── Workflow.png └── Decisions_summary.png ├── .gitignore ├── tests ├── testthat │ └── test-test.R └── testthat.R ├── figure ├── unnamed-chunk-14-1.png ├── unnamed-chunk-16-1.png ├── unnamed-chunk-17-1.png ├── unnamed-chunk-18-1.png ├── unnamed-chunk-25-1.png └── unnamed-chunk-26-1.png ├── endoR.Rcheck ├── R_check_bin │ ├── R │ └── Rscript ├── 00_pkg_src │ └── endoR │ │ ├── tests │ │ ├── testthat │ │ │ └── test-test.R │ │ └── testthat.R │ │ ├── data │ │ └── titanic.rda │ │ ├── build │ │ └── vignette.rds │ │ ├── figures │ │ ├── Workflow.png │ │ └── Decisions_summary.png │ │ ├── man │ │ ├── compatibleNames.Rd │ │ ├── endoR.Rd │ │ ├── getThresholds.Rd │ │ ├── model2DE_cluster.Rd │ │ ├── featureImportance.Rd │ │ ├── discretizeData_model.Rd │ │ ├── plotFeatures.Rd │ │ ├── decisions2FullDummy.Rd │ │ ├── discretizeData.Rd │ │ ├── decisionImportance.Rd │ │ ├── filterDecisionsImportances.Rd │ │ ├── plotNetwork.Rd │ │ ├── getNetwork.Rd │ │ ├── pruneDecisions.Rd │ │ ├── getDecisionsMetrics.Rd │ │ ├── model2DE_resampling.Rd │ │ ├── aggregateTaxa.Rd │ │ ├── evaluateAlpha.Rd │ │ ├── stabilitySelection.Rd │ │ ├── discretizeDecisions.Rd │ │ ├── preCluster.Rd │ │ └── model2DE.Rd │ │ ├── R │ │ ├── endoR.R │ │ ├── featureImportance.R │ │ ├── model2DE_cluster.R │ │ ├── discretizeVector_model.R │ │ ├── discretizeVector.R │ │ ├── aggregateTaxa_fine.R │ │ ├── discretizeData_model.R │ │ ├── Misc.R │ │ ├── filterDecisionsImportances.R │ │ ├── aggregateTaxa_coarse.R │ │ ├── measureSingleDecision.R │ │ ├── decisionImportance.R │ │ ├── discretizeData.R │ │ ├── aggregateTaxa.R │ │ ├── model2DE_resampling.R │ │ ├── evaluateAlpha.R │ │ ├── getThresholds.R │ │ ├── getDecisionsMetric.R │ │ ├── plotFeatures.R │ │ └── changeDecisionsDummies.R │ │ ├── NAMESPACE │ │ ├── .github │ │ └── workflows │ │ │ └── ci.yml │ │ ├── DESCRIPTION │ │ ├── vignettes │ │ └── iris_multiclass.Rmd │ │ └── README.md └── 00check.log ├── cran-comments.md ├── .Rbuildignore ├── man ├── endoR.Rd ├── compatibleNames.Rd ├── titanic.Rd ├── getThresholds.Rd ├── featureImportance.Rd ├── discretizeData_model.Rd ├── decisions2FullDummy.Rd ├── discretizeData.Rd ├── changeDecisionsDummies.Rd ├── aggregateTaxa.Rd ├── plotFeatures.Rd ├── plotNetwork.Rd ├── filterDecisionsImportances.Rd ├── decisionImportance.Rd ├── model2DE_cluster.Rd ├── pruneDecisions.Rd ├── getDecisionsMetrics.Rd ├── getNetwork.Rd └── discretizeDecisions.Rd ├── endoR.Rproj ├── R ├── featureImportance.R ├── discretizeVector_model.R ├── model2DE_cluster.R ├── discretizeVector.R ├── aggregateTaxa_fine.R ├── discretizeData_model.R ├── Misc.R ├── filterDecisionsImportances.R ├── endoR.R ├── aggregateTaxa_coarse.R ├── measureSingleDecision.R ├── decisionImportance.R ├── aggregateTaxa.R ├── discretizeData.R ├── getThresholds.R ├── getDecisionsMetric.R ├── evaluateAlpha.R ├── plotFeatures.R └── getComplements.R ├── NAMESPACE ├── .github └── workflows │ └── ci.yml ├── DESCRIPTION ├── examples ├── iris_basic.R ├── iris_each_function.R └── iris_bootstraps.R └── vignettes └── iris_multiclass.Rmd /data/titanic.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/data/titanic.rda -------------------------------------------------------------------------------- /endoR_0.1.0.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR_0.1.0.tar.gz -------------------------------------------------------------------------------- /figures/Workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figures/Workflow.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | inst/doc 6 | doc 7 | Meta 8 | -------------------------------------------------------------------------------- /tests/testthat/test-test.R: -------------------------------------------------------------------------------- 1 | test_that("multiplication works", { 2 | expect_equal(2 * 2, 4) 3 | }) 4 | -------------------------------------------------------------------------------- /figure/unnamed-chunk-14-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-14-1.png -------------------------------------------------------------------------------- /figure/unnamed-chunk-16-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-16-1.png -------------------------------------------------------------------------------- /figure/unnamed-chunk-17-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-17-1.png -------------------------------------------------------------------------------- /figure/unnamed-chunk-18-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-18-1.png -------------------------------------------------------------------------------- /figure/unnamed-chunk-25-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-25-1.png -------------------------------------------------------------------------------- /figure/unnamed-chunk-26-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figure/unnamed-chunk-26-1.png -------------------------------------------------------------------------------- /figures/Decisions_summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/figures/Decisions_summary.png -------------------------------------------------------------------------------- /endoR.Rcheck/R_check_bin/R: -------------------------------------------------------------------------------- 1 | echo "'R' should not be used without a path -- see par. 1.6 of the manual" 2 | exit 1 3 | -------------------------------------------------------------------------------- /endoR.Rcheck/R_check_bin/Rscript: -------------------------------------------------------------------------------- 1 | echo "'Rscript' should not be used without a path -- see par. 1.6 of the manual" 2 | exit 1 3 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/tests/testthat/test-test.R: -------------------------------------------------------------------------------- 1 | test_that("multiplication works", { 2 | expect_equal(2 * 2, 4) 3 | }) 4 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/data/titanic.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/data/titanic.rda -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/build/vignette.rds: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/build/vignette.rds -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/figures/Workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/figures/Workflow.png -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/figures/Decisions_summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/leylabmpi/endoR/HEAD/endoR.Rcheck/00_pkg_src/endoR/figures/Decisions_summary.png -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## R CMD check results 2 | There were no ERRORs or WARNINGs or NOTES. 3 | 4 | ## Downstream dependencies 5 | There are currently no downstream dependencies for this package. -------------------------------------------------------------------------------- /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^doc$ 4 | ^Meta$ 5 | ^LICENSE\.md$ 6 | ^figures$ 7 | ^\.git* 8 | ^\.RData$ 9 | cran-comments.md 10 | \vignettes\titanic_boostraps.Rmd.orig 11 | ^figure$ 12 | ^examples$ -------------------------------------------------------------------------------- /man/endoR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/endoR.R 3 | \docType{package} 4 | \name{endoR} 5 | \alias{endoR} 6 | \title{endoR} 7 | \description{ 8 | endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. 9 | } 10 | \author{ 11 | Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de} 12 | } 13 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/compatibleNames.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Misc.R 3 | \name{compatibleNames} 4 | \alias{compatibleNames} 5 | \title{Transform character strings to be compatible with endoR functions.} 6 | \usage{ 7 | compatibleNames(x) 8 | } 9 | \description{ 10 | Transform character strings to be compatible with endoR functions. 11 | } 12 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/endoR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/endoR.R 3 | \docType{Package} 4 | \name{endoR} 5 | \alias{endoR} 6 | \title{endoR} 7 | \description{ 8 | endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. 9 | } 10 | \author{ 11 | Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de} 12 | } 13 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | # This file is part of the standard setup for testthat. 2 | # It is recommended that you do not modify it. 3 | # 4 | # Where should you do additional test configuration? 5 | # Learn more about the roles of various files in: 6 | # * https://r-pkgs.org/tests.html 7 | # * https://testthat.r-lib.org/reference/test_package.html#special-files 8 | 9 | library(testthat) 10 | library(endoR) 11 | 12 | test_check("endoR") 13 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/tests/testthat.R: -------------------------------------------------------------------------------- 1 | # This file is part of the standard setup for testthat. 2 | # It is recommended that you do not modify it. 3 | # 4 | # Where should you do additional test configuration? 5 | # Learn more about the roles of various files in: 6 | # * https://r-pkgs.org/tests.html 7 | # * https://testthat.r-lib.org/reference/test_package.html#special-files 8 | 9 | library(testthat) 10 | library(endoR) 11 | 12 | test_check("endoR") 13 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/endoR.R: -------------------------------------------------------------------------------- 1 | #' endoR 2 | #' 3 | #' endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. 4 | #' 5 | #' @docType Package 6 | #' 7 | #' @author Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de} 8 | #' 9 | #' @name endoR 10 | #' @import data.table 11 | #' @import inTrees 12 | #' @import dplyr 13 | #' @import stringr 14 | #' @import ggplot2 15 | #' @import ggraph 16 | #' @import igraph 17 | NULL 18 | -------------------------------------------------------------------------------- /endoR.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/getThresholds.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/getThresholds.R 3 | \name{getThresholds} 4 | \alias{getThresholds} 5 | \title{Get thresholds to discretize variables according to their split in tree ensemble models} 6 | \usage{ 7 | getThresholds(conditions, data, Kmax = 2) 8 | } 9 | \description{ 10 | Get thresholds to discretize variables according to their split in tree ensemble models 11 | } 12 | -------------------------------------------------------------------------------- /endoR.Rcheck/00check.log: -------------------------------------------------------------------------------- 1 | * using log directory ‘/mnt/c/Users/Albane Ruaud/MPI-EBIO/Ruaud2022/endoR/endoR.Rcheck’ 2 | * using R version 4.0.2 (2020-06-22) 3 | * using platform: x86_64-conda_cos6-linux-gnu (64-bit) 4 | * using session charset: UTF-8 5 | * using option ‘--as-cran’ 6 | * checking for file ‘endoR/DESCRIPTION’ ... OK 7 | * checking extension type ... Package 8 | * this is package ‘endoR’ version ‘0.1.0’ 9 | * package encoding: UTF-8 10 | * checking CRAN incoming feasibility ... -------------------------------------------------------------------------------- /man/compatibleNames.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/Misc.R 3 | \name{compatibleNames} 4 | \alias{compatibleNames} 5 | \title{Transform character strings to be compatible with endoR functions.} 6 | \usage{ 7 | compatibleNames(x) 8 | } 9 | \arguments{ 10 | \item{x}{character string or vector} 11 | } 12 | \description{ 13 | Transform character strings to be compatible with endoR functions. 14 | } 15 | \examples{ 16 | x <- c('hello.world__I am;happy', 'me-2') 17 | compatibleNames(x) 18 | } 19 | -------------------------------------------------------------------------------- /man/titanic.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/endoR.R 3 | \docType{data} 4 | \name{titanic} 5 | \alias{titanic} 6 | \title{This is data to be included in my package} 7 | \format{ 8 | An object of class \code{data.frame} with 2207 rows and 9 columns. 9 | } 10 | \usage{ 11 | titanic 12 | } 13 | \description{ 14 | This is data to be included in my package 15 | } 16 | \references{ 17 | \url{https://cran.r-project.org/package=titanic} 18 | } 19 | \author{ 20 | Paul Hendricks \email{paul.hendricks.2013@owu.edu} 21 | } 22 | \keyword{datasets} 23 | -------------------------------------------------------------------------------- /man/getThresholds.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/getThresholds.R 3 | \name{getThresholds} 4 | \alias{getThresholds} 5 | \title{Get thresholds to discretize variables according to their split in tree ensemble models} 6 | \usage{ 7 | getThresholds(conditions, data, Kmax = 2) 8 | } 9 | \arguments{ 10 | \item{conditions}{character vector with all conditions from which to find the thresholds} 11 | 12 | \item{data}{data to discretize} 13 | 14 | \item{Kmax}{numeric, maximal number of categories for each variable (default: Kmax = 2).} 15 | } 16 | \description{ 17 | Get thresholds to discretize variables according to their split in tree ensemble models 18 | } 19 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/model2DE_cluster.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/model2DE_cluster.R 3 | \name{model2DE_cluster} 4 | \alias{model2DE_cluster} 5 | \title{Run model2DE on several bootstrap resamples in parallel.} 6 | \usage{ 7 | model2DE_cluster(partition) 8 | } 9 | \arguments{ 10 | \item{partition}{a vector with row numbers to subset data.} 11 | } 12 | \description{ 13 | Function to with the Q() function from the clustermq R-package with the following arguments in export: 14 | data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores. 15 | See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions. 16 | } 17 | -------------------------------------------------------------------------------- /man/featureImportance.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/featureImportance.R 3 | \name{featureImportance} 4 | \alias{featureImportance} 5 | \title{Calculate the feature importances from a decision ensemble} 6 | \usage{ 7 | featureImportance(nodes_agg) 8 | } 9 | \arguments{ 10 | \item{nodes_agg}{a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n).} 11 | } 12 | \value{ 13 | a datatable with the feature importance for each variable. 14 | } 15 | \description{ 16 | This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name'). 17 | } 18 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/featureImportance.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/featureImportance.R 3 | \name{featureImportance} 4 | \alias{featureImportance} 5 | \title{Calculate the feature importances from a decision ensemble} 6 | \usage{ 7 | featureImportance(nodes_agg) 8 | } 9 | \arguments{ 10 | \item{nodes_agg}{a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n).} 11 | } 12 | \value{ 13 | a datatable with the feature importance for each variable. 14 | } 15 | \description{ 16 | This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name'). 17 | } 18 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/discretizeData_model.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/discretizeData_model.R 3 | \name{discretizeData_model} 4 | \alias{discretizeData_model} 5 | \title{Discretize numerical variables in a dataset based on thresholds used in the model to create splits.} 6 | \usage{ 7 | discretizeData_model(data, conditions, Kmax = 2, return_split = FALSE) 8 | } 9 | \arguments{ 10 | \item{data}{data to discretize.} 11 | 12 | \item{Kmax}{numeric, maximal number of categories for each variable (default: Kmax = 2).} 13 | 14 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.} 15 | } 16 | \value{ 17 | Data with discretized variables. 18 | } 19 | \description{ 20 | This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable. 21 | } 22 | -------------------------------------------------------------------------------- /man/discretizeData_model.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/discretizeData_model.R 3 | \name{discretizeData_model} 4 | \alias{discretizeData_model} 5 | \title{Discretize numerical variables in a dataset based on thresholds used in the model to create splits.} 6 | \usage{ 7 | discretizeData_model(data, conditions, Kmax = 2, return_split = FALSE) 8 | } 9 | \arguments{ 10 | \item{data}{data to discretize} 11 | 12 | \item{conditions}{character vector with all conditions from which to find the thresholds} 13 | 14 | \item{Kmax}{numeric, maximal number of categories for each variable (default: Kmax = 2).} 15 | 16 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.} 17 | } 18 | \value{ 19 | Data with discretized variables. 20 | } 21 | \description{ 22 | This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable. 23 | } 24 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(Ranger2List_endoR) 4 | export(aggregateTaxa) 5 | export(compatibleNames) 6 | export(decisionImportance) 7 | export(decisions2FullDummy) 8 | export(discretizeData) 9 | export(discretizeData_model) 10 | export(evaluateAlpha) 11 | export(featureImportance) 12 | export(filterDecisionsImportances) 13 | export(getDecisionsMetrics) 14 | export(getNetwork) 15 | export(getNewRule) 16 | export(getThresholds) 17 | export(interactionVariables) 18 | export(model2DE) 19 | export(model2DE_cluster) 20 | export(model2DE_resampling) 21 | export(perlevelRule) 22 | export(plotFeatures) 23 | export(plotNetwork) 24 | export(preCluster) 25 | export(pruneDecisions) 26 | export(pruneSingleRule_endoR) 27 | export(rmError) 28 | export(singleRulePerLevel) 29 | export(stabilitySelection) 30 | import(data.table) 31 | import(dplyr) 32 | import(ggplot2) 33 | import(ggraph) 34 | import(igraph) 35 | import(inTrees) 36 | import(stringr) 37 | -------------------------------------------------------------------------------- /R/featureImportance.R: -------------------------------------------------------------------------------- 1 | #' Calculate the feature importances from a decision ensemble 2 | #' 3 | #' This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name'). 4 | #' 5 | #' @param nodes_agg a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n). 6 | #' 7 | #' @return a datatable with the feature importance for each variable. 8 | #' 9 | #' @export 10 | featureImportance <- function(nodes_agg) { 11 | featImp <- copy(nodes_agg)[, `:=`(Feature = str_extract(var, pattern = ".*(?=\\_{2})"))][ 12 | is.na(Feature), Feature := var 13 | ][, list(Feature, importance, imp, n)][ 14 | , importance := sum(importance * imp * n), 15 | by = Feature 16 | ][, list(Feature, importance)] 17 | featImp <- unique(featImp) 18 | setorder(featImp, -importance) 19 | return(featImp) 20 | } 21 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/featureImportance.R: -------------------------------------------------------------------------------- 1 | #' Calculate the feature importances from a decision ensemble 2 | #' 3 | #' This function calculates the featureImportance of variables in a decision ensemble. Levels of discretized variables are grouped so that only one importance is returned ('variable name' separated by 2*'_' from the 'level name'). 4 | #' 5 | #' @param nodes_agg a datatable with, for variable in each decision, their name (column Feature), decision-wise importance (column importance), the importance and multiplicity of the decision (columns imp and n). 6 | #' 7 | #' @return a datatable with the feature importance for each variable. 8 | #' 9 | #' @export 10 | featureImportance <- function(nodes_agg) { 11 | featImp <- copy(nodes_agg)[, `:=`(Feature = str_extract(var, pattern = ".*(?=\\_{2})"))][ 12 | is.na(Feature), Feature := var 13 | ][, .(Feature, importance, imp, n)][ 14 | , importance := sum(importance * imp * n), 15 | by = Feature 16 | ][, .(Feature, importance)] 17 | featImp <- unique(featImp) 18 | setorder(featImp, -importance) 19 | return(featImp) 20 | } 21 | -------------------------------------------------------------------------------- /man/decisions2FullDummy.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/decisions2FullDummy.R 3 | \name{decisions2FullDummy} 4 | \alias{decisions2FullDummy} 5 | \title{Transform all variables in a decision ensemble into dummy variables.} 6 | \usage{ 7 | decisions2FullDummy( 8 | rules, 9 | data, 10 | in_parallel = FALSE, 11 | n_cores = detectCores() - 1, 12 | cluster = NULL 13 | ) 14 | } 15 | \arguments{ 16 | \item{rules}{data frame with a 'condition' column.} 17 | 18 | \item{data}{data used to train the original model, or for which the column order corresponds to the column numbers in the rules (column 'condition').} 19 | 20 | \item{in_parallel}{if TRUE, the function is run in parallel (default = FALSE).} 21 | 22 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.} 23 | 24 | \item{cluster}{the cluster to use to run the function in parallel.} 25 | } 26 | \description{ 27 | This function first transforms all non-numeric variables from data used to train the original model into dummy variables. It then updates the decision ensemble. 28 | } 29 | -------------------------------------------------------------------------------- /man/discretizeData.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/discretizeData.R 3 | \name{discretizeData} 4 | \alias{discretizeData} 5 | \title{Discretize numerical variables in a dataset} 6 | \usage{ 7 | discretizeData( 8 | data, 9 | K = 2, 10 | features = NULL, 11 | knames = NULL, 12 | return_split = FALSE 13 | ) 14 | } 15 | \arguments{ 16 | \item{data}{data to discretize.} 17 | 18 | \item{K}{numeric, number of categories (default: K = 2).} 19 | 20 | \item{features}{vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized.} 21 | 22 | \item{knames}{optional: character vector of the same length than K, containing the ordered names for categories.} 23 | 24 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.} 25 | } 26 | \value{ 27 | Data with discretized variables. 28 | } 29 | \description{ 30 | This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles. 31 | Names of categories can optionally be given in knames. 32 | } 33 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/plotFeatures.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotFeatures.R 3 | \name{plotFeatures} 4 | \alias{plotFeatures} 5 | \title{Plot the importance and influence of features.} 6 | \usage{ 7 | plotFeatures( 8 | decision_ensemble, 9 | levels_order = NULL, 10 | colour_low = "#E69F00", 11 | colour_mid = "grey87", 12 | colour_high = "#0072B2", 13 | return_all = FALSE 14 | ) 15 | } 16 | \arguments{ 17 | \item{decision_ensemble}{stable decision ensemble (see stabilitySelection).} 18 | 19 | \item{levels_order}{optional, order for variables levels on the influence plot} 20 | 21 | \item{return_all}{TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE).} 22 | } 23 | \value{ 24 | 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots. 25 | } 26 | \description{ 27 | Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence. 28 | It uses the ggpubr package to combine plots. 29 | } 30 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/model2DE_cluster.R: -------------------------------------------------------------------------------- 1 | #' Run model2DE on several bootstrap resamples in parallel. 2 | #' 3 | #' Function to with the Q() function from the clustermq R-package with the following arguments in export: 4 | #' data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores. 5 | #' See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions. 6 | #' 7 | #' @param partition a vector with row numbers to subset data. 8 | #' @export 9 | model2DE_cluster <- function(partition) { 10 | library(data.table) 11 | res <- model2DE( 12 | data = data[partition, ], target = target[partition], 13 | exec = exec, 14 | classPos = classPos, dummy_var = dummy_var, 15 | prune = prune, maxDecay = maxDecay, typeDecay = typeDecay, 16 | filter = filter, min_imp = 1, 17 | in_parallel = in_parallel, n_cores = n_cores, 18 | light = TRUE 19 | ) 20 | res <- list( 21 | "pdecisions" = res$n_decisions, 22 | "rules" = res[[length(res) - 4]], 23 | "nodes_agg" = res$nodes_agg, "edges_agg" = res$edges_agg 24 | ) 25 | return(res) 26 | } 27 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/decisions2FullDummy.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/decisions2FullDummy.R 3 | \name{decisions2FullDummy} 4 | \alias{decisions2FullDummy} 5 | \title{Transform all variables in a decision ensemble into dummy variables.} 6 | \usage{ 7 | decisions2FullDummy( 8 | rules, 9 | data, 10 | in_parallel = FALSE, 11 | n_cores = detectCores() - 1, 12 | cluster = NULL 13 | ) 14 | } 15 | \arguments{ 16 | \item{rules}{data frame with a 'condition' column.} 17 | 18 | \item{data}{data used to train the original model, or for which the column order corresponds to the column numbers in the rules (column 'condition').} 19 | 20 | \item{in_parallel}{if TRUE, the function is run in parallel (default = FALSE).} 21 | 22 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.} 23 | 24 | \item{cluster}{the cluster to use to run the function in parallel.} 25 | } 26 | \description{ 27 | This function first transforms all non-numeric variables from data used to train the original model into dummy variables. It then updates the decision ensemble. 28 | } 29 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/discretizeData.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/discretizeData.R 3 | \name{discretizeData} 4 | \alias{discretizeData} 5 | \title{Discretize numerical variables in a dataset} 6 | \usage{ 7 | discretizeData( 8 | data, 9 | K = 2, 10 | features = NULL, 11 | knames = NULL, 12 | return_split = FALSE 13 | ) 14 | } 15 | \arguments{ 16 | \item{data}{data to discretize.} 17 | 18 | \item{K}{numeric, number of categories (default: K = 2).} 19 | 20 | \item{features}{vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized.} 21 | 22 | \item{knames}{optional: character vector of the same length than K, containing the ordered names for categories.} 23 | 24 | \item{return_split}{if TRUE, then the table with thresholds used to discretize data is also returned.} 25 | } 26 | \value{ 27 | Data with discretized variables. 28 | } 29 | \description{ 30 | This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles. 31 | Names of categories can optionally be given in knames. 32 | } 33 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/decisionImportance.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/decisionImportance.R 3 | \name{decisionImportance} 4 | \alias{decisionImportance} 5 | \title{Compute the importance of decisions} 6 | \usage{ 7 | decisionImportance( 8 | rules, 9 | data, 10 | target, 11 | classPos = NULL, 12 | in_parallel = FALSE, 13 | n_cores = detectCores() - 1, 14 | cluster = NULL 15 | ) 16 | } 17 | \arguments{ 18 | \item{data}{data from which to get the decision support.} 19 | 20 | \item{target}{response variable.} 21 | 22 | \item{classPos}{if classification, the positive class.} 23 | 24 | \item{in_parallel}{if TRUE, the function is run in parallel.} 25 | 26 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.} 27 | 28 | \item{cluster}{the cluster to use to run the function in parallel.} 29 | 30 | \item{rulesa}{a data.frame with a column "condition" or a vector with name "condition".} 31 | } 32 | \value{ 33 | the data.frame passed in rules with the gain and importance of the each decision. 34 | } 35 | \description{ 36 | This function computes the importance of decisions. 37 | } 38 | -------------------------------------------------------------------------------- /R/discretizeVector_model.R: -------------------------------------------------------------------------------- 1 | discretizeVector_model <- function(l_var, return_all = FALSE) { 2 | 3 | # sanity check 4 | if (length(l_var$thr) == 0) { 5 | return(l_var) 6 | } 7 | 8 | # set category names 9 | numSplit <- length(l_var$thr) 10 | if (numSplit == 1) { 11 | vNames <- c("Low", "High") 12 | } else if (numSplit == 2) { 13 | vNames <- c("Low", "Medium", "High") 14 | } else if (numSplit == 3) { 15 | vNames <- c("Very_Low", "Low", "High", "Very_High") 16 | } else if (numSplit == 4) { 17 | vNames <- c("Very_Low", "Low", "Medium", "High", "Very_High") 18 | } else { 19 | vNames <- as.character(1:(numSplit + 1)) 20 | } 21 | 22 | 23 | newV <- rep(vNames[numSplit + 1], length(l_var$var_v)) 24 | 25 | for (j in seq(numSplit, 1, by = -1)) { 26 | newV[which(l_var$var_v <= l_var$thr[j])] <- vNames[j] 27 | } 28 | 29 | if (return_all == TRUE) { 30 | l_var$med <- lapply(unique(newV), function(x, l_var) { 31 | i <- which(newV == x) 32 | return(median(l_var$var_v[i])) 33 | }, l_var = l_var) 34 | names(l_var$med) <- unique(newV) 35 | l_var$var_d <- newV 36 | return(l_var) 37 | } else { 38 | return(newV) 39 | } 40 | } 41 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(aggregateTaxa) 4 | export(changeDecisionsDummies) 5 | export(compatibleNames) 6 | export(decisionImportance) 7 | export(decisions2FullDummy) 8 | export(discretizeData) 9 | export(discretizeData_model) 10 | export(discretizeDecisions) 11 | export(evaluateAlpha) 12 | export(featureImportance) 13 | export(filterDecisionsImportances) 14 | export(getDecisionsMetrics) 15 | export(getNetwork) 16 | export(getThresholds) 17 | export(model2DE) 18 | export(model2DE_cluster) 19 | export(model2DE_resampling) 20 | export(plotFeatures) 21 | export(plotNetwork) 22 | export(preCluster) 23 | export(pruneDecisions) 24 | export(stabilitySelection) 25 | import(caret) 26 | import(clustermq) 27 | import(data.table) 28 | import(dplyr, except = c(union, as_data_frame, groups, combine, slice, filter, lag, last, first, between)) 29 | import(ggplot2, except = margin) 30 | import(ggraph) 31 | import(igraph, except = c(is_named, decompose, spectrum)) 32 | import(inTrees) 33 | import(parallel) 34 | import(randomForest, except = importance) 35 | import(ranger) 36 | import(rlang, except = ':=') 37 | import(stats) 38 | import(stringr) 39 | import(tidyverse) 40 | import(utils) 41 | import(xgboost) 42 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/discretizeVector_model.R: -------------------------------------------------------------------------------- 1 | discretizeVector_model <- function(l_var, return_all = FALSE) { 2 | 3 | # sanity check 4 | if (length(l_var$thr) == 0) { 5 | return(l_var) 6 | } 7 | 8 | # set category names 9 | numSplit <- length(l_var$thr) 10 | if (numSplit == 1) { 11 | vNames <- c("Low", "High") 12 | } else if (numSplit == 2) { 13 | vNames <- c("Low", "Medium", "High") 14 | } else if (numSplit == 3) { 15 | vNames <- c("Very_Low", "Low", "High", "Very_High") 16 | } else if (numSplit == 4) { 17 | vNames <- c("Very_Low", "Low", "Medium", "High", "Very_High") 18 | } else { 19 | vNames <- as.character(1:(numSplit + 1)) 20 | } 21 | 22 | 23 | newV <- rep(vNames[numSplit + 1], length(l_var$var_v)) 24 | 25 | for (j in seq(numSplit, 1, by = -1)) { 26 | newV[which(l_var$var_v <= l_var$thr[j])] <- vNames[j] 27 | } 28 | 29 | if (return_all == TRUE) { 30 | l_var$med <- lapply(unique(newV), function(x, l_var) { 31 | i <- which(newV == x) 32 | return(median(l_var$var_v[i])) 33 | }, l_var = l_var) 34 | names(l_var$med) <- unique(newV) 35 | l_var$var_d <- newV 36 | return(l_var) 37 | } else { 38 | return(newV) 39 | } 40 | } 41 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/filterDecisionsImportances.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/filterDecisionsImportances.R 3 | \name{filterDecisionsImportances} 4 | \alias{filterDecisionsImportances} 5 | \title{Filter decisions according to their metrics} 6 | \usage{ 7 | filterDecisionsImportances(rules, min_imp = 0.7) 8 | } 9 | \arguments{ 10 | \item{rules}{data.frame corresponding to the decisions, with all their metrics.} 11 | 12 | \item{min_imp}{minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.} 13 | } 14 | \value{ 15 | The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp". 16 | } 17 | \description{ 18 | This function filters decisions in a heuristic manner according to their importance and multiplicity. 19 | A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated. 20 | All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept. 21 | } 22 | -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: EndoR 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | pull_request: 7 | branches: [ main ] 8 | 9 | jobs: 10 | build: 11 | name: build (${{ matrix.r-version }}, ${{ matrix.os }}) 12 | runs-on: ubuntu-latest 13 | strategy: 14 | matrix: 15 | r-version: [4.1] 16 | steps: 17 | - uses: conda-incubator/setup-miniconda@v2 18 | with: 19 | miniconda-version: 'latest' 20 | auto-update-conda: true 21 | python-version: '3.10' 22 | channels: conda-forge,bioconda,defaults 23 | activate-environment: endor_env 24 | - name: conda env setup 25 | shell: bash -l {0} 26 | run: | 27 | conda install -y mamba 28 | mamba install -y r-base=${{ matrix.r-version }} r-renv r-nloptr r-data.table r-dplyr r-ggplot2 r-ggraph r-igraph r-stringr 29 | - uses: actions/checkout@v2 30 | - name: renv install 31 | shell: bash -l {0} 32 | run: | 33 | R -e 'renv::install("rstudio/renv")' 34 | R -e 'renv::install("softwaredeng/inTrees/inTrees.Rcheck/inTrees")' 35 | R -e 'renv::install("nick-youngblut/endoR")' 36 | - name: Package unit tests 37 | shell: bash -l {0} 38 | run: | 39 | R -e 'library(endoR)' 40 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: EndoR 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | pull_request: 7 | branches: [ main ] 8 | 9 | jobs: 10 | build: 11 | name: build (${{ matrix.r-version }}, ${{ matrix.os }}) 12 | runs-on: ubuntu-latest 13 | strategy: 14 | matrix: 15 | r-version: [4.1] 16 | steps: 17 | - uses: conda-incubator/setup-miniconda@v2 18 | with: 19 | miniconda-version: 'latest' 20 | auto-update-conda: true 21 | python-version: '3.10' 22 | channels: conda-forge,bioconda,defaults 23 | activate-environment: endor_env 24 | - name: conda env setup 25 | shell: bash -l {0} 26 | run: | 27 | conda install -y mamba 28 | mamba install -y r-base=${{ matrix.r-version }} r-renv r-nloptr r-data.table r-dplyr r-ggplot2 r-ggraph r-igraph r-stringr 29 | - uses: actions/checkout@v2 30 | - name: renv install 31 | shell: bash -l {0} 32 | run: | 33 | R -e 'renv::install("rstudio/renv")' 34 | R -e 'renv::install("softwaredeng/inTrees/inTrees.Rcheck/inTrees")' 35 | R -e 'renv::install("nick-youngblut/endoR")' 36 | - name: Package unit tests 37 | shell: bash -l {0} 38 | run: | 39 | R -e 'library(endoR)' 40 | -------------------------------------------------------------------------------- /man/changeDecisionsDummies.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/changeDecisionsDummies.R 3 | \name{changeDecisionsDummies} 4 | \alias{changeDecisionsDummies} 5 | \title{Takes decisions and modifies them so that only one level of a multiclass variable is used in decisions} 6 | \usage{ 7 | changeDecisionsDummies( 8 | rules, 9 | dummy_var, 10 | data, 11 | target, 12 | classPos = NULL, 13 | in_parallel = FALSE, 14 | n_cores = detectCores() - 1, 15 | cluster = NULL 16 | ) 17 | } 18 | \arguments{ 19 | \item{rules}{a data frame with a column "condition".} 20 | 21 | \item{dummy_var}{string vector with the names of columns to change to dummy variable.} 22 | 23 | \item{data}{data on which to fit the decision ensemble.} 24 | 25 | \item{target}{response variable.} 26 | 27 | \item{classPos}{for classification, the positive class.} 28 | 29 | \item{in_parallel}{if TRUE, the function is run in parallel.} 30 | 31 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.} 32 | 33 | \item{cluster}{the cluster to use to run the function in parallel (opt).} 34 | } 35 | \description{ 36 | Takes decisions and modifies them so that only one level of a multiclass variable is used in decisions 37 | } 38 | -------------------------------------------------------------------------------- /R/model2DE_cluster.R: -------------------------------------------------------------------------------- 1 | #' Run model2DE on several bootstrap resamples in parallel. 2 | #' 3 | #' Function to with the Q() function from the clustermq R-package with the following arguments in export: 4 | #' data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores. 5 | #' See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions. 6 | #' 7 | #' @param partition a vector with row numbers to subset data. 8 | #' @example examples/iris_bootstraps.R 9 | #' @export 10 | model2DE_cluster <- function(partition) { 11 | #library(data.table) 12 | res <- model2DE( 13 | data = data[partition, ], target = target[partition], 14 | exec = exec, 15 | classPos = classPos, dummy_var = dummy_var, 16 | prune = prune, maxDecay = maxDecay, typeDecay = typeDecay, 17 | filter = filter, min_imp = 1, 18 | in_parallel = in_parallel, n_cores = n_cores, 19 | light = TRUE 20 | ) 21 | 22 | # get the position of the last set of computed rules 23 | tmp <- str_which(names(res), pattern = 'rules') 24 | tmp <- tmp[length(tmp)] 25 | 26 | res <- list( 27 | "pdecisions" = res$n_decisions, 28 | "rules" = res$rules, 29 | "nodes_agg" = res$nodes_agg, "edges_agg" = res$edges_agg 30 | ) 31 | return(res) 32 | } 33 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/plotNetwork.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotNetwork.R 3 | \name{plotNetwork} 4 | \alias{plotNetwork} 5 | \title{Plot an interaction variable network.} 6 | \usage{ 7 | plotNetwork( 8 | decision_ensemble, 9 | path_length = Inf, 10 | layout = "stress", 11 | colour_edge_low = "#E69F00", 12 | colour_edge_mid = "grey87", 13 | colour_edge_high = "#0072B2", 14 | colour_node_low = "#E69F00", 15 | colour_node_mid = "grey87", 16 | colour_node_high = "#0072B2", 17 | text_size = 4, 18 | hide_isolated_nodes = TRUE, 19 | seed = 0 20 | ) 21 | } 22 | \arguments{ 23 | \item{decision_ensemble}{stable decision ensemble: list with edges, nodes, etc.} 24 | 25 | \item{path_length}{maximal number of edges between 2 nodes, default = Inf.} 26 | 27 | \item{layout}{network layout, default is 'stress' (see ggraph package)} 28 | 29 | \item{text_size}{size of node labels.} 30 | 31 | \item{hide_isolated_nodes}{logical, default = TRUE (= nodes without any edge are not shown).} 32 | 33 | \item{seed}{the seed to use for generating the network.} 34 | 35 | \item{colour_x_y}{colour of x = edge or node, and y = low, mid or high (e.g., colour_edge_mid), to use for the colour gradients.} 36 | } 37 | \value{ 38 | a ggraph object 39 | } 40 | \description{ 41 | Returns a plot of nodes and edges. Plots are created with the ggraph and igraph packages. 42 | } 43 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: endoR 2 | Type: Package 3 | Title: Interpret and Visualize Stable Tree Ensemble Models 4 | Version: 0.1.0 5 | Authors@R: c(person("Albane", "Ruaud", email = "albane.ruaud@tuebingen.mpg.de" 6 | , role = c("aut", "cre"), comment = c(ORCID = "0000-0001-5920-1710")) 7 | , person("Houtao", "Deng", role = "ctb")) 8 | Description: Extract and visualize how predictive variables contribute to tree ensemble model accuracy. 9 | You provide the model and data, tell endoR which regularization steps you want to obtain stable results, and it will return a plot of the feature importance and influence on the response variable and a decision network. For more details, please refer to our pre-print: Albane Ruaud, Niklas A Pfister, Ruth E Ley, Nicholas D Youngblut. Interpreting tree ensemble machine learning models with endoR. bioRxiv (2022). . 10 | License: GPL (>= 3) 11 | Encoding: UTF-8 12 | LazyData: true 13 | Suggests: 14 | knitr, 15 | rmarkdown, 16 | testthat (>= 3.0.0) 17 | Imports: rlang, data.table, dplyr, ggplot2, ggraph, igraph, inTrees, stringr, caret, ggpubr, ranger, xgboost, parallel, clustermq, utils, randomForest, tidyverse 18 | RoxygenNote: 7.2.0 19 | VignetteBuilder: knitr 20 | Depends: 21 | R (>= 2.10) 22 | BugReports: https://github.com/aruaud/endoR/issues 23 | URL: https://github.com/aruaud/endoR 24 | Config/testthat/edition: 3 25 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: endoR 2 | Type: Package 3 | Title: Interpret and Visualize Stable Tree Ensemble Models 4 | Version: 0.1.0 5 | Authors@R: c(person("Albane", "Ruaud", email = "albane.ruaud@tuebingen.mpg.de" 6 | , role = c("aut", "cre"), comment = c(ORCID = "0000-0001-5920-1710")) 7 | , person("Houtao", "Deng", role = "ctb")) 8 | Description: endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. 9 | You provide the model and data, tell endoR which regularization steps you want to obtain stable results, and it will return a plot of the feature importance and influence on the response variable and a decision network. 10 | License: GPL (>= 3) 11 | Encoding: UTF-8 12 | LazyData: true 13 | Suggests: knitr, rmarkdown, ranger, xgboost, parallel, testthat (>= 14 | 3.0.0) 15 | Imports: data.table, dplyr, ggplot2, ggraph, igraph, inTrees, stringr, 16 | caret, ggpubr 17 | RoxygenNote: 7.1.1 18 | VignetteBuilder: knitr 19 | Depends: R (>= 2.10) 20 | BugReports: https://github.com/aruaud/endoR/issues 21 | URL: https://github.com/aruaud/endoR 22 | Config/testthat/edition: 3 23 | NeedsCompilation: no 24 | Packaged: 2022-08-11 14:46:14 UTC; Albane Ruaud 25 | Author: Albane Ruaud [aut, cre] (), 26 | Houtao Deng [ctb] 27 | Maintainer: Albane Ruaud 28 | -------------------------------------------------------------------------------- /examples/iris_basic.R: -------------------------------------------------------------------------------- 1 | library(randomForest) 2 | library(caret) 3 | 4 | # import data and fit model 5 | data(iris) 6 | mod <- randomForest(Species ~ ., data = iris) 7 | 8 | # Fit a decision ensemble to predict the species setosa (vs versicolor and 9 | # virginica): no regularization (no decision pruning, discretization, 10 | # bootstrapping, or decision filtering) 11 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 12 | , target = iris$Species, classPos = "setosa" 13 | , filter = FALSE, discretize = FALSE, prune = FALSE) 14 | 15 | # Only decision pruning (default = TRUE) and discretization (default in 2 16 | # categories, we want 3 categories so we change K); no bootstrapping or 17 | # decision filtering. 18 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 19 | , target = iris$Species, classPos = "setosa" 20 | , filter = FALSE, discretize = TRUE, K = 3) 21 | 22 | # idem but run it in parallel on 2 threads 23 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 24 | , target = iris$Species, classPos = "setosa" 25 | , filter = FALSE, discretize = TRUE, K = 3 26 | , in_parallel = TRUE, n_cores = 2) 27 | 28 | # Plot the decision ensemble: 29 | # Plants from the setosa species have small petal and narrow long sepals. 30 | plotFeatures(endo_setosa, levels_order = c("Low", "Medium", "High")) 31 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = "fr") 32 | -------------------------------------------------------------------------------- /R/discretizeVector.R: -------------------------------------------------------------------------------- 1 | # modified from dicretizeVector() of the inTrees package. 2 | discretizeVector <- 3 | function(v, K = 5, knames = NULL, return_all = FALSE) { 4 | 5 | # set category names 6 | if (is.null(knames) == TRUE) { 7 | if (K == 2) { 8 | knames <- c("min", "Low", "High") 9 | } else if (K == 3) { 10 | knames <- c("min", "Low", "Medium", "High") 11 | } else if (K == 4) { 12 | knames <- c("min", "veryLow", "Low", "High", "veryHigh") 13 | } else if (K == 5) { 14 | knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh") 15 | } else { 16 | knames <- c("L1", paste("L", seq(1:(K - 1)), sep = "")) 17 | } 18 | } else if (length(knames) == K) { 19 | knames <- c("min", knames) 20 | } 21 | 22 | 23 | splitV <- quantile(v, probs = seq(0, 1, 1 / K), na.rm = TRUE, names = TRUE, type = 3) 24 | names(splitV) <- knames 25 | 26 | numSplit <- length(splitV) 27 | if (numSplit < 2) { 28 | return(v) 29 | } 30 | 31 | newV <- vector("character", length(v)) 32 | newV[which(v <= splitV[2])] <- names(splitV)[2] 33 | 34 | if (numSplit >= 3) { 35 | for (jj in 3:numSplit) { 36 | newV[which(v > splitV[jj - 1] & v <= splitV[jj])] <- names(splitV)[jj] 37 | } 38 | } 39 | 40 | splitV <- splitV[c("min", unique(newV))] 41 | 42 | if (return_all == TRUE) { 43 | return(list("newV" = newV, "splitV" = splitV[order(splitV)])) 44 | } else { 45 | return(newV) 46 | } 47 | } 48 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/getNetwork.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/getNetwork.R 3 | \name{getNetwork} 4 | \alias{getNetwork} 5 | \title{Transform a decision ensemble into a network} 6 | \usage{ 7 | getNetwork( 8 | rules, 9 | data, 10 | target, 11 | classPos = NULL, 12 | additional_decisions = NULL, 13 | aggregate_taxa = FALSE, 14 | taxa = NULL, 15 | type = "coarse", 16 | in_parallel = FALSE, 17 | n_cores = detectCores() - 1, 18 | cluster = NULL 19 | ) 20 | } 21 | \arguments{ 22 | \item{rules}{the decision ensemble.} 23 | 24 | \item{data}{data from which to measure characteristic.} 25 | 26 | \item{target}{response variable.} 27 | 28 | \item{classPos}{the positive class predicted by decisions.} 29 | 30 | \item{in_parallel}{if TRUE, the function is run in parallel.} 31 | 32 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.} 33 | 34 | \item{cluster}{the cluster to use to run the function in parallel.} 35 | } 36 | \value{ 37 | A list with in the nodes and edges dataframes, the feature and interaction importance and influence; the decision-wise feature and interaction importances and influences are contained in the nodes_agg and edges_agg dataframes. 38 | } 39 | \description{ 40 | Takes a decision ensemble and measures the importance and influence of each feature and pair of features to create a network. 41 | For categorical variables or discretized ones, the importance and influence are calculated per level. See featureImportance to obtain the overall feature importance. 42 | } 43 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/discretizeVector.R: -------------------------------------------------------------------------------- 1 | # modified from dicretizeVector() of the inTrees package. 2 | discretizeVector <- 3 | function(v, K = 5, knames = NULL, return_all = FALSE) { 4 | 5 | # set category names 6 | if (is.null(knames) == TRUE) { 7 | if (K == 2) { 8 | knames <- c("min", "Low", "High") 9 | } else if (K == 3) { 10 | knames <- c("min", "Low", "Medium", "High") 11 | } else if (K == 4) { 12 | knames <- c("min", "veryLow", "Low", "High", "veryHigh") 13 | } else if (K == 5) { 14 | knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh") 15 | } else { 16 | knames <- c("L1", paste("L", seq(1:(K - 1)), sep = "")) 17 | } 18 | } else if (length(knames) == K) { 19 | knames <- c("min", knames) 20 | } 21 | 22 | 23 | splitV <- quantile(v, probs = seq(0, 1, 1 / K), na.rm = TRUE, names = TRUE, type = 3) 24 | names(splitV) <- knames 25 | 26 | numSplit <- length(splitV) 27 | if (numSplit < 2) { 28 | return(v) 29 | } 30 | 31 | newV <- vector("character", length(v)) 32 | newV[which(v <= splitV[2])] <- names(splitV)[2] 33 | 34 | if (numSplit >= 3) { 35 | for (jj in 3:numSplit) { 36 | newV[which(v > splitV[jj - 1] & v <= splitV[jj])] <- names(splitV)[jj] 37 | } 38 | } 39 | 40 | splitV <- splitV[c("min", unique(newV))] 41 | 42 | if (return_all == TRUE) { 43 | return(list("newV" = newV, "splitV" = splitV[order(splitV)])) 44 | } else { 45 | return(newV) 46 | } 47 | } 48 | -------------------------------------------------------------------------------- /R/aggregateTaxa_fine.R: -------------------------------------------------------------------------------- 1 | aggregateTaxa_fine <- function(taxa) { 2 | ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary 3 | 4 | if (!("data.table" %in% class(taxa))) { 5 | taxa <- as.data.table(taxa)[, Feature := as.character(Feature)] 6 | } 7 | # create the columns for the updating 8 | taxa <- taxa[, `:=`(newFeature = Feature, ix = .I)] 9 | tax_col <- which(colnames(taxa) %in% c("f", "g", "s")) 10 | 11 | ### First: at the genus level 12 | not_s <- taxa[is.na(s) & !is.na(g), ix] 13 | for (i in not_s) { 14 | # subset the specie of the same genus 15 | tmp <- taxa[["g"]][i] 16 | fine <- taxa[g == tmp & !is.na(s), newFeature] 17 | if (length(fine) == 1) { 18 | taxa <- taxa[ix == i, newFeature := fine] 19 | } 20 | fine <- NULL 21 | } 22 | 23 | ### Then: at the family level 24 | not_s <- taxa[is.na(g), ix] 25 | for (i in not_s) { 26 | # subset to the features of the same family 27 | tmp <- taxa[["f"]][i] 28 | fine <- unique(taxa[f == tmp & !is.na(g), newFeature]) 29 | if (length(fine) == 1) { 30 | taxa <- taxa[ix == i, newFeature := fine] 31 | } 32 | fine <- NULL 33 | } 34 | 35 | # record how many times a newFeature has been attributed and who was changed 36 | taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][ 37 | n > 1 & newFeature == Feature, changed := "Recipient" 38 | ][ 39 | n > 1 & newFeature != Feature, changed := "Upgraded" 40 | ][ 41 | , changed := as.factor(changed) 42 | ] 43 | 44 | return(taxa) 45 | } 46 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/pruneDecisions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/pruneDecisions.R 3 | \name{pruneDecisions} 4 | \alias{pruneDecisions} 5 | \title{Prune rules of a decision ensemble} 6 | \usage{ 7 | pruneDecisions( 8 | rules, 9 | data, 10 | target, 11 | classPos = NULL, 12 | maxDecay = 0, 13 | typeDecay = 1, 14 | in_parallel = FALSE, 15 | n_cores = detectCores() - 1, 16 | cluster = NULL 17 | ) 18 | } 19 | \arguments{ 20 | \item{rules}{a data frame with a column "condition".} 21 | 22 | \item{data}{data to use for calculating the metrics.} 23 | 24 | \item{target}{response variable.} 25 | 26 | \item{classPos}{for classification tasks, the positive class predicted.} 27 | 28 | \item{maxDecay}{threshold for the increase in error; if maxDecay = -Inf, no pruning is done; if maxDecay = 0, only variables increasing the error are pruned from decisions.} 29 | 30 | \item{typeDecay}{if typeDecay = 1, the absolute increase in error is computed, and the relative one is computed if typeDecay = 2 (default).} 31 | 32 | \item{in_parallel}{if TRUE, the function is run in parallel.} 33 | 34 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.} 35 | 36 | \item{cluster}{the cluster to use to run the function in parallel.} 37 | } 38 | \value{ 39 | Decision ensemble with pruned conditions. 40 | } 41 | \description{ 42 | This function removes from rules, variables that do not increase the rule error more than a certain threshold. See the pruneRules function from the inTrees package (Deng, 2019) for more details. 43 | } 44 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/aggregateTaxa_fine.R: -------------------------------------------------------------------------------- 1 | aggregateTaxa_fine <- function(taxa) { 2 | ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary 3 | 4 | if (!("data.table" %in% class(taxa))) { 5 | taxa <- as.data.table(taxa)[, Feature := as.character(Feature)] 6 | } 7 | # create the columns for the updating 8 | taxa <- taxa[, `:=`(newFeature = Feature, ix = .I)] 9 | tax_col <- which(colnames(taxa) %in% c("f", "g", "s")) 10 | 11 | ### First: at the genus level 12 | not_s <- taxa[is.na(s) & !is.na(g), ix] 13 | for (i in not_s) { 14 | # subset the specie of the same genus 15 | tmp <- taxa[["g"]][i] 16 | fine <- taxa[g == tmp & !is.na(s), newFeature] 17 | if (length(fine) == 1) { 18 | taxa <- taxa[ix == i, newFeature := fine] 19 | } 20 | fine <- NULL 21 | } 22 | 23 | ### Then: at the family level 24 | not_s <- taxa[is.na(g), ix] 25 | for (i in not_s) { 26 | # subset to the features of the same family 27 | tmp <- taxa[["f"]][i] 28 | fine <- unique(taxa[f == tmp & !is.na(g), newFeature]) 29 | if (length(fine) == 1) { 30 | taxa <- taxa[ix == i, newFeature := fine] 31 | } 32 | fine <- NULL 33 | } 34 | 35 | # record how many times a newFeature has been attributed and who was changed 36 | taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][ 37 | n > 1 & newFeature == Feature, changed := "Recipient" 38 | ][ 39 | n > 1 & newFeature != Feature, changed := "Upgraded" 40 | ][ 41 | , changed := as.factor(changed) 42 | ] 43 | 44 | return(taxa) 45 | } 46 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/getDecisionsMetrics.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/getDecisionsMetric.R 3 | \name{getDecisionsMetrics} 4 | \alias{getDecisionsMetrics} 5 | \title{Measure the error, prediction and importance of decisions} 6 | \usage{ 7 | getDecisionsMetrics( 8 | ruleExec, 9 | data, 10 | target, 11 | classPos = NULL, 12 | importances = TRUE, 13 | in_parallel = FALSE, 14 | n_cores = detectCores() - 1, 15 | cluster = NULL 16 | ) 17 | } 18 | \arguments{ 19 | \item{ruleExec}{a vector with name "condition" or a data.frame with a column "condition".} 20 | 21 | \item{data}{data from which to get the decision support.} 22 | 23 | \item{target}{response variable.} 24 | 25 | \item{classPos}{for clssification tasks, the positive class to be predicted by decisions.} 26 | 27 | \item{importances}{if FALSE, the importances are not calculated (importances = TRUE by default).} 28 | 29 | \item{in_parallel}{if TRUE, the function is run in parallel.} 30 | 31 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.} 32 | 33 | \item{cluster}{the cluster to use to run the function in parallel.} 34 | } 35 | \value{ 36 | a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated. 37 | } 38 | \description{ 39 | This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off. 40 | } 41 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/discretizeData_model.R: -------------------------------------------------------------------------------- 1 | #' Discretize numerical variables in a dataset based on thresholds used in the model to create splits. 2 | #' 3 | #' This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable. 4 | #' 5 | #' @param data data to discretize. 6 | #' @param Kmax numeric, maximal number of categories for each variable (default: Kmax = 2). 7 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned. 8 | #' @return Data with discretized variables. 9 | #' @export 10 | discretizeData_model <- function(data, conditions, Kmax = 2, return_split = FALSE) { 11 | 12 | # get the thresholds 13 | l_var <- getThresholds(conditions = conditions, data = data, Kmax = Kmax) 14 | colNb <- names(l_var) 15 | 16 | # discretize the variable vectors 17 | l_var <- lapply(l_var, discretizeVector_model, return_all = TRUE) 18 | names(l_var) <- colNb 19 | colNb <- as.integer(colNb) 20 | 21 | data_ctg <- as.data.table(data) 22 | 23 | for (j in colNb) set(data_ctg, i = NULL, j = j, l_var[[as.character(j)]]$var_d) 24 | 25 | data_ctg <- data_ctg[, (colNb) := lapply(.SD, as.factor), .SDcols = colNb] 26 | 27 | 28 | # get the split table if asked 29 | if (return_split == TRUE) { 30 | splitV <- lapply(l_var, function(x) { 31 | x$thr 32 | }) 33 | names(splitV) <- names(l_var) 34 | 35 | splitV_med <- lapply(l_var, function(x) { 36 | x$med 37 | }) 38 | names(splitV_med) <- names(splitV) 39 | 40 | return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med)) 41 | } 42 | 43 | 44 | return(data_ctg) 45 | } 46 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/model2DE_resampling.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/model2DE_resampling.R 3 | \name{model2DE_resampling} 4 | \alias{model2DE_resampling} 5 | \title{Run model2DE on several bootstrap resamples.} 6 | \usage{ 7 | model2DE_resampling( 8 | model, 9 | model_type, 10 | data, 11 | target, 12 | classPos = NULL, 13 | times = 10, 14 | p = 0.5, 15 | sample_weight = NULL, 16 | ntree = "all", 17 | maxdepth = Inf, 18 | dummy_var = NULL, 19 | prune = TRUE, 20 | maxDecay = 0.05, 21 | typeDecay = 2, 22 | discretize = TRUE, 23 | K = 2, 24 | mode = "data", 25 | filter = TRUE, 26 | min_imp = 0.9, 27 | seed = 0, 28 | in_parallel = FALSE, 29 | n_cores = detectCores() - 1, 30 | cluster = NULL 31 | ) 32 | } 33 | \arguments{ 34 | \item{times}{number of bootstraps} 35 | 36 | \item{p}{fraction of data to resample.} 37 | 38 | \item{sample_weight}{numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument).} 39 | 40 | \item{in_parallel}{if TRUE, the function is run in parallel} 41 | 42 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use} 43 | 44 | \item{cluster}{the cluster to use to run the function in parallel} 45 | 46 | \item{...}{arguments to be passed to the model2DE funtion.} 47 | } 48 | \value{ 49 | A list with the row numbers of partitioned data, the rules originally extracted from the model, a list with results from each bootstrap (use stabilitySelection to obtain the stable decison ensemble). 50 | } 51 | \description{ 52 | Wrapper around the model2DE function to run it on several bootstrap resamples. 53 | } 54 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/Misc.R: -------------------------------------------------------------------------------- 1 | taxa2vector <- function(taxa_table) { 2 | 3 | ### taxa_table is a data frame with taxonomic levels as columns 4 | # Add the previous levels to each one of them 5 | taxa_table$k <- paste("k_", taxa_table$k, sep = "") 6 | taxa_table$p <- paste(taxa_table$k, taxa_table$p, sep = "__p_") 7 | taxa_table$c <- paste(taxa_table$p, taxa_table$c, sep = "__c_") 8 | taxa_table$o <- paste(taxa_table$c, taxa_table$o, sep = "__o_") 9 | taxa_table$f <- paste(taxa_table$o, taxa_table$f, sep = "__f_") 10 | taxa_table$g <- paste(taxa_table$f, taxa_table$g, sep = "__g_") 11 | taxa_table$s <- paste(taxa_table$g, taxa_table$s, sep = "__s_") 12 | 13 | ### Vectorize the taxa_table table 14 | tax_vector <- unique(unlist(setDT(taxa_table)[, .(k, p, c, o, f, g, s)])) 15 | 16 | ### Text formatting: 17 | # no punctuation or weird symbole, only '_' 18 | 19 | tax_vector <- str_replace_all(tax_vector, pattern = "[:punct:]", replacement = "_") 20 | tax_vector <- str_replace_all(tax_vector, pattern = " ", replacement = "_") 21 | tax_vector <- str_replace_all(tax_vector, pattern = "__", replacement = "_") 22 | 23 | # levels are separated by 2x'_' and the initial of the level 24 | tax_vector <- str_replace_all(tax_vector, pattern = "\\_(?=(p|c|o|f|g|s){1}_)", replacement = "__") 25 | 26 | return(tax_vector) 27 | } 28 | 29 | 30 | #' Transform character strings to be compatible with endoR functions. 31 | #' @export 32 | compatibleNames <- function(x) { 33 | x %>% 34 | str_replace_all(pattern = "[:punct:]", replacement = "_") %>% 35 | str_replace_all(pattern = " ", replacement = "_") %>% 36 | str_replace_all(pattern = "\\_+", replacement = "_") %>% 37 | str_replace_all(pattern = "\\_$", replacement = "") 38 | } 39 | -------------------------------------------------------------------------------- /R/discretizeData_model.R: -------------------------------------------------------------------------------- 1 | #' Discretize numerical variables in a dataset based on thresholds used in the model to create splits. 2 | #' 3 | #' This function discretizes all numerical variables into Kmax categories based on the splits used by the model for each variable. 4 | #' 5 | #' @param data data to discretize 6 | #' @param conditions character vector with all conditions from which to find the thresholds 7 | #' @param Kmax numeric, maximal number of categories for each variable (default: Kmax = 2). 8 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned. 9 | #' @return Data with discretized variables. 10 | #' @export 11 | discretizeData_model <- function(data, conditions, Kmax = 2, return_split = FALSE) { 12 | 13 | # get the thresholds 14 | l_var <- getThresholds(conditions = conditions, data = data, Kmax = Kmax) 15 | colNb <- names(l_var) 16 | 17 | # discretize the variable vectors 18 | l_var <- lapply(l_var, discretizeVector_model, return_all = TRUE) 19 | names(l_var) <- colNb 20 | colNb <- as.integer(colNb) 21 | 22 | data_ctg <- as.data.table(data) 23 | 24 | for (j in colNb) set(data_ctg, i = NULL, j = j, l_var[[as.character(j)]]$var_d) 25 | 26 | data_ctg <- data_ctg[, (colNb) := lapply(.SD, as.factor), .SDcols = colNb] 27 | 28 | 29 | # get the split table if asked 30 | if (return_split == TRUE) { 31 | splitV <- lapply(l_var, function(x) { 32 | x$thr 33 | }) 34 | names(splitV) <- names(l_var) 35 | 36 | splitV_med <- lapply(l_var, function(x) { 37 | x$med 38 | }) 39 | names(splitV_med) <- names(splitV) 40 | 41 | return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med)) 42 | } 43 | 44 | 45 | return(data_ctg) 46 | } 47 | -------------------------------------------------------------------------------- /man/aggregateTaxa.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/aggregateTaxa.R 3 | \name{aggregateTaxa} 4 | \alias{aggregateTaxa} 5 | \title{Aggregate taxa according to a measure.} 6 | \usage{ 7 | aggregateTaxa( 8 | taxa, 9 | features = NULL, 10 | weights = NULL, 11 | thr = NULL, 12 | type = "coarse" 13 | ) 14 | } 15 | \arguments{ 16 | \item{taxa}{taxa should be a data.frame with the feature name and all its other coarser/thinner levels (columns: 'Feature', 'f', 'g', 's')} 17 | 18 | \item{features}{if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked.} 19 | 20 | \item{weights}{a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only.} 21 | 22 | \item{thr}{numeric, value after which the algorithm should stop looking for better ranks in a taxonomic branch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used.} 23 | 24 | \item{type}{character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a thinner level if it has a better rank and is unique, i.e. there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed.} 25 | } 26 | \value{ 27 | A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column. 28 | } 29 | \description{ 30 | This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thinner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels. 31 | } 32 | -------------------------------------------------------------------------------- /R/Misc.R: -------------------------------------------------------------------------------- 1 | taxa2vector <- function(taxa_table) { 2 | 3 | ### taxa_table is a data frame with taxonomic levels as columns 4 | # Add the previous levels to each one of them 5 | taxa_table$k <- paste("k_", taxa_table$k, sep = "") 6 | taxa_table$p <- paste(taxa_table$k, taxa_table$p, sep = "__p_") 7 | taxa_table$c <- paste(taxa_table$p, taxa_table$c, sep = "__c_") 8 | taxa_table$o <- paste(taxa_table$c, taxa_table$o, sep = "__o_") 9 | taxa_table$f <- paste(taxa_table$o, taxa_table$f, sep = "__f_") 10 | taxa_table$g <- paste(taxa_table$f, taxa_table$g, sep = "__g_") 11 | taxa_table$s <- paste(taxa_table$g, taxa_table$s, sep = "__s_") 12 | 13 | ### Vectorize the taxa_table table 14 | tax_vector <- unique(unlist(setDT(taxa_table)[, list(k, p, c, o, f, g, s)])) 15 | 16 | ### Text formatting: 17 | # no punctuation or weird symbole, only '_' 18 | 19 | tax_vector <- str_replace_all(tax_vector, pattern = "[:punct:]", replacement = "_") 20 | tax_vector <- str_replace_all(tax_vector, pattern = " ", replacement = "_") 21 | tax_vector <- str_replace_all(tax_vector, pattern = "__", replacement = "_") 22 | 23 | # levels are separated by 2x'_' and the initial of the level 24 | tax_vector <- str_replace_all(tax_vector, pattern = "\\_(?=(p|c|o|f|g|s){1}_)", replacement = "__") 25 | 26 | return(tax_vector) 27 | } 28 | 29 | 30 | #' Transform character strings to be compatible with endoR functions. 31 | #' @param x character string or vector 32 | #' @examples 33 | #' x <- c('hello.world__I am;happy', 'me-2') 34 | #' compatibleNames(x) 35 | #' @export 36 | compatibleNames <- function(x) { 37 | x %>% 38 | str_replace_all(pattern = "[:punct:]", replacement = "_") %>% 39 | str_replace_all(pattern = " ", replacement = "_") %>% 40 | str_replace_all(pattern = "\\_+", replacement = "_") %>% 41 | str_replace_all(pattern = "\\_$", replacement = "") 42 | } 43 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/aggregateTaxa.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/aggregateTaxa.R 3 | \name{aggregateTaxa} 4 | \alias{aggregateTaxa} 5 | \title{Aggregate taxa according to a measure.} 6 | \usage{ 7 | aggregateTaxa( 8 | taxa, 9 | features = NULL, 10 | weights = NULL, 11 | thr = NULL, 12 | type = "coarse" 13 | ) 14 | } 15 | \arguments{ 16 | \item{taxa}{taxa should be a vector with the long format of the features of the taxa table with taxonomic levels as columns - cf taxa2vector; or a data.frame with the feature name and all its other coarser/thiner levels.} 17 | 18 | \item{features}{if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked.} 19 | 20 | \item{weights}{a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only.} 21 | 22 | \item{thr}{numeric, value after which the algorithm should stop looking for better ranks in a taxa banch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used.} 23 | 24 | \item{type}{character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a finer level if it has a better rank and is unique, ie there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed.} 25 | } 26 | \value{ 27 | A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column. 28 | } 29 | \description{ 30 | This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thiner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels. 31 | } 32 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/filterDecisionsImportances.R: -------------------------------------------------------------------------------- 1 | #' Filter decisions according to their metrics 2 | #' 3 | #' This function filters decisions in a heuristic manner according to their importance and multiplicity. 4 | #' A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated. 5 | #' All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept. 6 | #' 7 | #' @param rules data.frame corresponding to the decisions, with all their metrics. 8 | #' @param min_imp minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp. 9 | #' @return The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp". 10 | #' @export 11 | filterDecisionsImportances <- function(rules, min_imp = 0.7) { 12 | if (("data.table" %in% class(rules)) == FALSE) { 13 | rules <- setDT(rules) 14 | } 15 | 16 | maxThr <- rules[, max(imp)] * min_imp 17 | impthr <- seq(0, maxThr, by = maxThr / 500) 18 | 19 | checkImp <- data.table(impThr = impthr) 20 | 21 | for (i in 1:nrow(checkImp)) { 22 | set(checkImp, i, "nrules", nrow(rules[imp >= impthr[i], ])) 23 | set(checkImp, i, "sum_impn", sum(rules[imp >= impthr[i], imp * n])) ### remove n 24 | } 25 | 26 | checkImp <- checkImp[, nremoved := max(nrules) - nrules][, f := sum_impn * nremoved] 27 | 28 | maxD <- max(checkImp[f == max(f), impThr]) 29 | rules <- rules[, filt_imp := "not ok"][imp >= maxD, filt_imp := "ok"] 30 | 31 | message( 32 | "Threshold for relative importance: < ", round(maxD, digits = 3), " and ", 33 | nrow(rules[filt_imp == "not ok"]), " rules removed.\n" 34 | ) 35 | 36 | # Filter rules 37 | rules <- subset(rules, filt_imp == "ok", select = -filt_imp) 38 | 39 | return(rules) 40 | } 41 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/evaluateAlpha.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/evaluateAlpha.R 3 | \name{evaluateAlpha} 4 | \alias{evaluateAlpha} 5 | \title{Calculate the number fo decisions and of predicted samples from decision ensemble obtained with different alpha values} 6 | \usage{ 7 | evaluateAlpha( 8 | rules, 9 | alphas = c(5, 10, 15, 20, 30, 50, 75), 10 | pi_thr = 0.7, 11 | data = NULL, 12 | decision_ensembles = TRUE, 13 | aggregate_taxa = FALSE, 14 | taxa = NULL 15 | ) 16 | } 17 | \arguments{ 18 | \item{alphas}{expected number of false positive decision selected (default = 1).} 19 | 20 | \item{pi_thr}{fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).} 21 | 22 | \item{decision_ensembles}{should the decision ensemble be returned?} 23 | 24 | \item{res}{list of bootstrap results} 25 | } 26 | \value{ 27 | A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes. 28 | } 29 | \description{ 30 | The aim is to help picking an alpha that will result in a decision ensemble able to predict most samples. Performs stability selection for each of the given alpha value. The number of decisions and of samples that follow decisions are also calculated. 31 | The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble. 32 | The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble. 33 | } 34 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/stabilitySelection.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/stabilitySelection.R 3 | \name{stabilitySelection} 4 | \alias{stabilitySelection} 5 | \title{Obtain a stable decision ensemble} 6 | \usage{ 7 | stabilitySelection( 8 | rules, 9 | alpha_error = 1, 10 | pi_thr = 0.7, 11 | aggregate_taxa = FALSE, 12 | taxa = NULL 13 | ) 14 | } 15 | \arguments{ 16 | \item{rules}{list of bootstrap results} 17 | 18 | \item{alpha_error}{expected number of false positive decision selected (default = 1).} 19 | 20 | \item{pi_thr}{fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7).} 21 | 22 | \item{aggregate_taxa}{should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species)} 23 | 24 | \item{taxa}{if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s)} 25 | } 26 | \value{ 27 | A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes. 28 | } 29 | \description{ 30 | Performs stability selection after bootstrapping with the model2DE_cluster or model2DE_resampling functions. 31 | The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble. 32 | The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble. 33 | } 34 | -------------------------------------------------------------------------------- /R/filterDecisionsImportances.R: -------------------------------------------------------------------------------- 1 | #' Filter decisions according to their metrics 2 | #' 3 | #' This function filters decisions in a heuristic manner according to their importance and multiplicity. 4 | #' A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated. 5 | #' All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept. 6 | #' 7 | #' @param rules data.frame corresponding to the decisions, with all their metrics. 8 | #' @param min_imp minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp. 9 | #' @return The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp". 10 | #' 11 | #' @example examples/iris_each_function.R 12 | #' @export 13 | filterDecisionsImportances <- function(rules, min_imp = 0.7) { 14 | if (("data.table" %in% class(rules)) == FALSE) { 15 | rules <- setDT(rules) 16 | } 17 | 18 | maxThr <- rules[, max(imp)] * min_imp 19 | impthr <- seq(0, maxThr, by = maxThr / 500) 20 | 21 | checkImp <- data.table(impThr = impthr) 22 | 23 | for (i in 1:nrow(checkImp)) { 24 | set(checkImp, i, "nrules", nrow(rules[imp >= impthr[i], ])) 25 | set(checkImp, i, "sum_impn", sum(rules[imp >= impthr[i], imp * n])) ### remove n 26 | } 27 | 28 | checkImp <- checkImp[, nremoved := max(nrules) - nrules][, f := sum_impn * nremoved] 29 | 30 | maxD <- max(checkImp[f == max(f), impThr]) 31 | rules <- rules[, filt_imp := "not ok"][imp >= maxD, filt_imp := "ok"] 32 | 33 | message( 34 | "Threshold for relative importance: < ", round(maxD, digits = 3), " and ", 35 | nrow(rules[filt_imp == "not ok"]), " rules removed.\n" 36 | ) 37 | 38 | # Filter rules 39 | rules <- subset(rules, filt_imp == "ok", select = -filt_imp) 40 | 41 | return(rules) 42 | } 43 | -------------------------------------------------------------------------------- /R/endoR.R: -------------------------------------------------------------------------------- 1 | #' endoR 2 | #' 3 | #' endoR extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. 4 | #' 5 | #' @docType package 6 | #' 7 | #' @author Albane Ruaud \email{albane.ruaud@tuebingen.mpg.de} 8 | #' 9 | #' @name endoR 10 | #' @import data.table 11 | #' @rawNamespace import(dplyr, except = c(union, as_data_frame, groups, combine, slice, filter, lag, last, first, between)) 12 | #' @import inTrees 13 | #' @import stringr 14 | #' @rawNamespace import(ggplot2, except = margin) 15 | #' @import ggraph 16 | #' @rawNamespace import(igraph, except = c(is_named, decompose, spectrum)) 17 | #' @import stats 18 | #' @import parallel 19 | #' @import clustermq 20 | #' @rawNamespace import(rlang, except = ':=') 21 | #' @import utils 22 | #' @rawNamespace import(randomForest, except = importance) 23 | #' @import tidyverse 24 | #' @import caret 25 | #' @import ranger 26 | #' @import xgboost 27 | 28 | utils::globalVariables(c('Feature', 'Feature_short', 'association_sign' 29 | , 'change', 'changed', 'classPos', 'coarseFeature' 30 | , 'condition', 'd', 'd.x', 'd.y', 'd_assoc', 'decay' 31 | , 'dummy_var', 'elongateTaxa', 'err', 'err.rm', 'f' 32 | , 'filt_imp', 'full', 'fullID', 'g', 'gain', 'imp' 33 | , 'impThr', 'imp_sd', 'importance.x', 'importance.y' 34 | , 'inN', 'in_parallel', 'ix', 'k', 'len', 'lev' 35 | , 'level', 'maxDecay', 'n_cores', 'n_sd', 'name' 36 | , 'newFeature', 'newWeight', 'nremoved', 'nrules', 'o' 37 | , 'oldN', 'p', 'pred', 'pred.rm', 'predRm.x' 38 | , 'predRm.xy', 'predRm.y', 'prune', 'ruleID', 's' 39 | , 's.rm', 'sRm.x', 'sRm.y', 'sum_impn', 'support' 40 | , 'target', 'to_up', 'to_update', 'to_update_ctg' 41 | , 'typeDecay', 'weight', 'wimp', 'x', 'y')) 42 | 43 | 44 | #' This is data to be included in my package 45 | #' 46 | #' @author Paul Hendricks \email{paul.hendricks.2013@owu.edu} 47 | #' @references \url{https://cran.r-project.org/package=titanic} 48 | "titanic" 49 | 50 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/discretizeDecisions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/discretizeDecisions.R 3 | \name{discretizeDecisions} 4 | \alias{discretizeDecisions} 5 | \title{Discretize numerical variables in decision ensemble} 6 | \usage{ 7 | discretizeDecisions( 8 | rules, 9 | data = NULL, 10 | target, 11 | mode = "data", 12 | K = 2, 13 | splitV = NULL, 14 | classPos = NULL, 15 | in_parallel = FALSE, 16 | n_cores = detectCores() - 1, 17 | cluster = NULL 18 | ) 19 | } 20 | \arguments{ 21 | \item{rules}{a data frame with a column "condition".} 22 | 23 | \item{data}{data to discretize.} 24 | 25 | \item{target}{response variable.} 26 | 27 | \item{mode}{whether to discretize variables based on the data distribution (default, mode = 'data') or on the data splits in the model (mode = 'model').} 28 | 29 | \item{K}{numeric, number of categories to create from numeric variables (default: K = 2).} 30 | 31 | \item{splitV}{instead of running internally discretizeData, one can provide a list with, for each variable to discretize in rules, the thresholds delimiting each new category.} 32 | 33 | \item{classPos}{for classification, the positive class.} 34 | 35 | \item{in_parallel}{if TRUE, the function is run in parallel.} 36 | 37 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.} 38 | 39 | \item{cluster}{the cluster to use to run the function in parallel. 40 | 41 | @return Decision ensemble with discretized variables in the condition. Decisions with the same condition are aggregated: their importances are summed, and all other metrics are averaged. 42 | 43 | @export} 44 | 45 | \item{data_ctg}{discretized data, if splitV is passed. Necessary to re-compute the metrics (if column 'err' in rules).} 46 | 47 | \item{return_data}{if TRUE, discretized data are also returned.} 48 | } 49 | \description{ 50 | This function replaces in a decision ensemble the boundaries of numerical features by their corresponding levels when the variable is discretized. 51 | If discretized data are not passed, data are first discretized into Kmax categories based on their quantiles (see discretizeData). 52 | The error, prediction, importance and multiplicity of decisions are updated after discretization. 53 | } 54 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/preCluster.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/preCluster.R 3 | \name{preCluster} 4 | \alias{preCluster} 5 | \title{Extract decisions from a model and create bootstrap resamples.} 6 | \usage{ 7 | preCluster( 8 | model, 9 | model_type, 10 | data, 11 | target, 12 | times = 10, 13 | p = 0.5, 14 | sample_weight = NULL, 15 | classPos = NULL, 16 | ntree = "all", 17 | maxdepth = Inf, 18 | dummy_var = NULL, 19 | discretize = FALSE, 20 | K = 2, 21 | mode = "data", 22 | seed = 0, 23 | in_parallel = FALSE, 24 | n_cores = detectCores() - 1 25 | ) 26 | } 27 | \arguments{ 28 | \item{model}{model to extract rules from.} 29 | 30 | \item{model_type}{character string: 'RF', 'random forest', 'rf', 'xgboost', 'XGBOOST', 'xgb', 'XGB', 'ranger', 'Ranger', 'gbm' or 'GBM'.} 31 | 32 | \item{data}{data with the same columns than data used to fit the model.} 33 | 34 | \item{target}{response variable.} 35 | 36 | \item{times}{number of bootstraps} 37 | 38 | \item{p}{fraction of data to resample.} 39 | 40 | \item{sample_weight}{numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument).} 41 | 42 | \item{classPos}{the positive class predicted by decisions} 43 | 44 | \item{ntree}{number of trees to use from the model (default = all)} 45 | 46 | \item{maxdepth}{maximal node depth to use for extracting rules (by default, full branches are used).} 47 | 48 | \item{dummy_var}{if multiclass variables were transformed into dummy variables before fitting the model, one can pass their names in a vector here to avoid multiple levels to be used in a same rule (recommended).} 49 | 50 | \item{discretize}{if TRUE, discretization is performed with K categories (discretizeDecisions, by default = FALSE).} 51 | 52 | \item{in_parallel}{if TRUE, the function is run in parallel.} 53 | 54 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.} 55 | 56 | \item{cluster}{the cluster to use to run the function in parallel.} 57 | } 58 | \value{ 59 | A list with the row numbers of partitioned data, the rules originally extracted from the model and new data if discretization was performed. 60 | } 61 | \description{ 62 | to run before bootstrapping on parallel with the clustermq package and model2DE_cluster function. Extracts decisions, optionally discretizes them. Creates data partitions for bootstrapping. 63 | } 64 | -------------------------------------------------------------------------------- /R/aggregateTaxa_coarse.R: -------------------------------------------------------------------------------- 1 | aggregateTaxa_coarse <- function(taxa, weights, thr = NULL) { 2 | ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary 3 | ### weights: df with the features and the weights only (the highest the best) 4 | 5 | if (is.null(thr) == TRUE) { 6 | thr <- mean(weights$weight) 7 | } 8 | 9 | # some formatting 10 | if (!("data.table" %in% class(taxa))) { 11 | taxa <- as.data.table(taxa)[, Feature := as.character(Feature)] 12 | } 13 | taxa <- merge(taxa, weights[, list(Feature, weight)], by = "Feature", all.x = TRUE) 14 | setorder(taxa, weight) 15 | taxa <- taxa[!is.na(f), ][, `:=`(newFeature = Feature, newWeight = weight, ix = .I)] 16 | tax_col <- which(colnames(taxa) %in% c("f", "g", "s")) 17 | lim <- taxa[weight > thr, min(ix)] 18 | 19 | 20 | for (i in 1:lim) { 21 | 22 | # subset to the features of the same family doing better (to look at the right spot) 23 | tmp <- taxa$f[i] 24 | to_check <- taxa[f == tmp, ix] 25 | to_check <- to_check[which(to_check > i)] 26 | 27 | if (length(to_check) > 0) { 28 | 29 | # record if features have the same genus/family names and how good is their rank 30 | check <- matrix(ncol = 4) 31 | colnames(check) <- c(colnames(taxa)[tax_col], "ix") 32 | for (k in 1:length(to_check)) { 33 | tmp <- cbind((taxa[i, c(tax_col), with = FALSE] == taxa[to_check[k], c(tax_col), with = FALSE]), 34 | ix = to_check[k] 35 | ) 36 | check <- rbind(check, tmp) 37 | } 38 | check <- as.data.table(check, stringsAsFactors = FALSE) 39 | 40 | # select the one with the thinnest taxonomic level 41 | if (is.na(taxa$s[i]) == FALSE) { 42 | better <- check[is.na(s) & g == 1, ix] 43 | } else if (is.na(taxa$g[i]) == FALSE) { 44 | better <- check[is.na(g) & f == 1, ix] 45 | } else { 46 | better <- i 47 | } 48 | 49 | # and give it to the bad one 50 | if (length(better) > 0) { 51 | # to make it retro-active! 52 | to_change <- taxa[newFeature == taxa[["Feature"]][i], ix] 53 | taxa <- taxa[ix %in% to_change, `:=`( 54 | newFeature = taxa[["newFeature"]][max(better)], 55 | newWeight = taxa[["newWeight"]][max(better)] 56 | )] 57 | } 58 | } 59 | } 60 | 61 | taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][ 62 | n > 1 & newFeature == Feature, changed := "Recipient" 63 | ][ 64 | n > 1 & newFeature != Feature, changed := "Downgraded" 65 | ][ 66 | , changed := as.factor(changed) 67 | ] 68 | 69 | return(taxa) 70 | } 71 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/aggregateTaxa_coarse.R: -------------------------------------------------------------------------------- 1 | aggregateTaxa_coarse <- function(taxa, weights, thr = NULL) { 2 | ### taxa: the elongated features (from elongateTaxa), the longFeature column is not necessary 3 | ### weights: df with the features and the weights only (the highest the best) 4 | 5 | if (is.null(thr) == TRUE) { 6 | thr <- mean(weights$weight) 7 | } 8 | 9 | # some formatting 10 | if (!("data.table" %in% class(taxa))) { 11 | taxa <- as.data.table(taxa)[, Feature := as.character(Feature)] 12 | } 13 | taxa <- merge(taxa, weights[, .(Feature, weight)], by = "Feature", all.x = TRUE) 14 | setorder(taxa, weight) 15 | taxa <- taxa[!is.na(f), ][, `:=`(newFeature = Feature, newWeight = weight, ix = .I)] 16 | tax_col <- which(colnames(taxa) %in% c("f", "g", "s")) 17 | lim <- taxa[weight > thr, min(ix)] 18 | 19 | 20 | for (i in 1:lim) { 21 | 22 | # subset to the features of the same family doing better (to look at the right spot) 23 | tmp <- taxa$f[i] 24 | to_check <- taxa[f == tmp, ix] 25 | to_check <- to_check[which(to_check > i)] 26 | 27 | if (length(to_check) > 0) { 28 | 29 | # record if features have the same genus/family names and how good is their rank 30 | check <- matrix(ncol = 4) 31 | colnames(check) <- c(colnames(taxa)[tax_col], "ix") 32 | for (k in 1:length(to_check)) { 33 | tmp <- cbind((taxa[i, c(tax_col), with = FALSE] == taxa[to_check[k], c(tax_col), with = FALSE]), 34 | ix = to_check[k] 35 | ) 36 | check <- rbind(check, tmp) 37 | } 38 | check <- as.data.table(check, stringsAsFactors = FALSE) 39 | 40 | # select the one with the thinnest taxonomic level 41 | if (is.na(taxa$s[i]) == FALSE) { 42 | better <- check[is.na(s) & g == 1, ix] 43 | } else if (is.na(taxa$g[i]) == FALSE) { 44 | better <- check[is.na(g) & f == 1, ix] 45 | } else { 46 | better <- i 47 | } 48 | 49 | # and give it to the bad one 50 | if (length(better) > 0) { 51 | # to make it retro-active! 52 | to_change <- taxa[newFeature == taxa[["Feature"]][i], ix] 53 | taxa <- taxa[ix %in% to_change, `:=`( 54 | newFeature = taxa[["newFeature"]][max(better)], 55 | newWeight = taxa[["newWeight"]][max(better)] 56 | )] 57 | } 58 | } 59 | } 60 | 61 | taxa <- taxa[, "n" := .N, by = newFeature][, "changed" := "Unchanged"][ 62 | n > 1 & newFeature == Feature, changed := "Recipient" 63 | ][ 64 | n > 1 & newFeature != Feature, changed := "Downgraded" 65 | ][ 66 | , changed := as.factor(changed) 67 | ] 68 | 69 | return(taxa) 70 | } 71 | -------------------------------------------------------------------------------- /R/measureSingleDecision.R: -------------------------------------------------------------------------------- 1 | measureSingleDecision <- 2 | function(ruleExec, data, target, type = "reg", gain = FALSE, pred_null = NULL) { 3 | 4 | # same than original but with the fraction of each class -1 is returned 5 | # instead of the majority class 6 | 7 | if (gain == FALSE) { 8 | colNames <- c("len", "support", "err", "condition", "pred") 9 | } else { 10 | colNames <- c("len", "support", "err", "condition", "pred", "gain") 11 | } 12 | 13 | 14 | len <- length(unlist(strsplit(ruleExec, split = " & "))) 15 | origRule <- ruleExec 16 | 17 | ruleExec <- paste("which(", ruleExec, ")") 18 | ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,") 19 | ixMatch <- eval(parse(text = ruleExec)) 20 | 21 | if (length(ixMatch) == 0) { 22 | v <- rep("-1", length(colNames)) 23 | names(v) <- colNames 24 | return(v) 25 | } 26 | ys <- target[ixMatch] 27 | pred <- mean(ys, na.rm = TRUE) 28 | support <- length(ys) / nrow(data) 29 | 30 | 31 | if (type == "reg") { 32 | err <- mean((pred - ys)^2, na.rm = TRUE) 33 | if (gain == FALSE) { 34 | v <- c(len, support, err, origRule, pred) 35 | } else { 36 | g <- 1 - err / (sum((pred_null - ys)^2) / length(ys)) 37 | v <- c(len, support, err, origRule, pred, g) 38 | } 39 | } else { 40 | if (pred %in% c(0, 1)) { 41 | err <- 0 42 | } else { 43 | err <- 1 - exp(mean(sapply(ys, function(x, pred) { 44 | x * log(pred) + (1 - x) * log(1 - pred) 45 | }, pred = pred))) 46 | } 47 | 48 | if (gain == FALSE) { 49 | v <- c(len, support, err, origRule, pred) 50 | } else { 51 | err_null <- 1 - exp(mean(sapply(ys, function(x, pred) { 52 | x * log(pred) + (1 - x) * log(1 - pred) 53 | }, pred = pred_null))) 54 | g <- 1 - err / err_null 55 | v <- c(len, support, err, origRule, pred, g) 56 | } 57 | } 58 | 59 | names(v) <- colNames 60 | return(v) 61 | } 62 | 63 | 64 | 65 | 66 | ############################################################### 67 | 68 | measureAll <- 69 | function(data, target, classPos = NULL) { 70 | if (!is.numeric(target)) { 71 | target <- ifelse(target == classPos, 1, 0) 72 | pred <- mean(target, na.rm = TRUE) 73 | err <- 1 - exp(mean(sapply(target, function(x, pred) { 74 | x * log(pred) + (1 - x) * log(1 - pred) 75 | }, pred = pred))) 76 | } else { 77 | pred <- mean(target, na.rm = TRUE) 78 | err <- mean((pred - target)^2, na.rm = TRUE) 79 | } 80 | 81 | v <- c(err, pred) 82 | names(v) <- c("err", "pred") 83 | 84 | return(v) 85 | } 86 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/man/model2DE.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/model2DE.R 3 | \name{model2DE} 4 | \alias{model2DE} 5 | \title{Extract a decision ensemble from a tree-based model, simplify it and creates an interaction network from it.} 6 | \usage{ 7 | model2DE( 8 | model, 9 | model_type, 10 | data, 11 | target, 12 | ntree = "all", 13 | maxdepth = Inf, 14 | classPos = NULL, 15 | dummy_var = NULL, 16 | discretize = FALSE, 17 | K = 2, 18 | mode = "data", 19 | prune = TRUE, 20 | maxDecay = 0.05, 21 | typeDecay = 2, 22 | aggregate_taxa = FALSE, 23 | taxa = NULL, 24 | type = "both", 25 | filter = TRUE, 26 | min_imp = 0.7, 27 | ntest = 100, 28 | exec = NULL, 29 | in_parallel = FALSE, 30 | n_cores = detectCores() - 1, 31 | cluster = NULL, 32 | light = FALSE 33 | ) 34 | } 35 | \arguments{ 36 | \item{model}{model to extract rules from.} 37 | 38 | \item{model_type}{character string: 'RF', 'random forest', 'rf', 'xgboost', 'XGBOOST', 'xgb', 'XGB', 'ranger', 'Ranger', 'gbm' or 'GBM'.} 39 | 40 | \item{data}{data with the same columns than data used to fit the model.} 41 | 42 | \item{target}{response variable.} 43 | 44 | \item{ntree}{number of trees to use from the model (default = all)} 45 | 46 | \item{maxdepth}{maximal node depth to use for extracting rules (by default, full branches are used).} 47 | 48 | \item{classPos}{the positive class predicted by decisions} 49 | 50 | \item{dummy_var}{if multiclass variables were transformed into dummy variables before fitting the model, one can pass their names in a vector here to avoid multiple levels to be used in a same rule (recommended).} 51 | 52 | \item{exec}{if decisions have already been extracted, datatable with a 'condition' column.} 53 | 54 | \item{in_parallel}{if TRUE, the function is run in parallel} 55 | 56 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1} 57 | 58 | \item{cluster}{the cluster to use to run the function in parallel} 59 | 60 | \item{light}{if FALSE, returns all intermediary decision ensembles; default = TRUE} 61 | 62 | \item{...}{arguments to be passed to pruneDecisions, discretizeDecisions, filterDecisionsImportances.} 63 | } 64 | \value{ 65 | A list with the final decision ensemble, if numeric variables were discretized in decisions, the discretized data, edges and nodes to make a network (plotNetwork and plotFeatures). 66 | } 67 | \description{ 68 | Wrapper function to extract rules from a tree based model. 69 | It automatically transforms multiclass predictive variables into dummy variables. 70 | Optionally discretizes numeric variables (see discretizeDecisions). Measures decisions and optionally prunes them. Finally, generates a network. 71 | Can be run in parallel. 72 | } 73 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/measureSingleDecision.R: -------------------------------------------------------------------------------- 1 | measureSingleDecision <- 2 | function(ruleExec, data, target, type = "reg", gain = FALSE, pred_null = NULL) { 3 | 4 | # same than original but with the fraction of each class -1 is returned 5 | # instead of the majority class 6 | 7 | if (gain == FALSE) { 8 | colNames <- c("len", "support", "err", "condition", "pred") 9 | } else { 10 | colNames <- c("len", "support", "err", "condition", "pred", "gain") 11 | } 12 | 13 | 14 | len <- length(unlist(strsplit(ruleExec, split = " & "))) 15 | origRule <- ruleExec 16 | 17 | ruleExec <- paste("which(", ruleExec, ")") 18 | ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,") 19 | ixMatch <- eval(parse(text = ruleExec)) 20 | 21 | if (length(ixMatch) == 0) { 22 | v <- rep("-1", length(colNames)) 23 | names(v) <- colNames 24 | return(v) 25 | } 26 | ys <- target[ixMatch] 27 | pred <- mean(ys, na.rm = TRUE) 28 | support <- length(ys) / nrow(data) 29 | 30 | 31 | if (type == "reg") { 32 | err <- mean((pred - ys)^2, na.rm = TRUE) 33 | if (gain == FALSE) { 34 | v <- c(len, support, err, origRule, pred) 35 | } else { 36 | g <- 1 - err / (sum((pred_null - ys)^2) / length(ys)) 37 | v <- c(len, support, err, origRule, pred, g) 38 | } 39 | } else { 40 | if (pred %in% c(0, 1)) { 41 | err <- 0 42 | } else { 43 | err <- 1 - exp(mean(sapply(ys, function(x, pred) { 44 | x * log(pred) + (1 - x) * log(1 - pred) 45 | }, pred = pred))) 46 | } 47 | 48 | if (gain == FALSE) { 49 | v <- c(len, support, err, origRule, pred) 50 | } else { 51 | err_null <- 1 - exp(mean(sapply(ys, function(x, pred) { 52 | x * log(pred) + (1 - x) * log(1 - pred) 53 | }, pred = pred_null))) 54 | g <- 1 - err / err_null 55 | v <- c(len, support, err, origRule, pred, g) 56 | } 57 | } 58 | 59 | names(v) <- colNames 60 | return(v) 61 | } 62 | 63 | 64 | 65 | 66 | ############################################################### 67 | 68 | measureAll <- 69 | function(data, target, classPos = NULL) { 70 | if (!is.numeric(target)) { 71 | target <- ifelse(target == classPos, 1, 0) 72 | pred <- mean(target, na.rm = TRUE) 73 | err <- 1 - exp(mean(sapply(target, function(x, pred) { 74 | x * log(pred) + (1 - x) * log(1 - pred) 75 | }, pred = pred))) 76 | } else { 77 | pred <- mean(target, na.rm = TRUE) 78 | err <- mean((pred - target)^2, na.rm = TRUE) 79 | } 80 | 81 | v <- c(err, pred) 82 | names(v) <- c("err", "pred") 83 | 84 | return(v) 85 | } 86 | -------------------------------------------------------------------------------- /man/plotFeatures.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotFeatures.R 3 | \name{plotFeatures} 4 | \alias{plotFeatures} 5 | \title{Plot the importance and influence of features.} 6 | \usage{ 7 | plotFeatures( 8 | decision_ensemble, 9 | levels_order = NULL, 10 | colour_low = "#E69F00", 11 | colour_mid = "grey87", 12 | colour_high = "#0072B2", 13 | return_all = FALSE 14 | ) 15 | } 16 | \arguments{ 17 | \item{decision_ensemble}{stable decision ensemble (see stabilitySelection).} 18 | 19 | \item{levels_order}{optional, order for variables levels on the influence plot} 20 | 21 | \item{colour_low}{colour for the negative feature influence values (default: yellowish)} 22 | 23 | \item{colour_mid}{colour for the null feature influence values (default: light grey)} 24 | 25 | \item{colour_high}{colour for the positive feature influence values (default: blue)} 26 | 27 | \item{return_all}{TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE).} 28 | } 29 | \value{ 30 | 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots. 31 | } 32 | \description{ 33 | Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence. 34 | It uses the ggpubr package to combine plots. 35 | } 36 | \examples{ 37 | library(randomForest) 38 | library(caret) 39 | 40 | # import data and fit model 41 | data(iris) 42 | mod <- randomForest(Species ~ ., data = iris) 43 | 44 | # Fit a decision ensemble to predict the species setosa (vs versicolor and 45 | # virginica): no regularization (no decision pruning, discretization, 46 | # bootstrapping, or decision filtering) 47 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 48 | , target = iris$Species, classPos = "setosa" 49 | , filter = FALSE, discretize = FALSE, prune = FALSE) 50 | 51 | # Only decision pruning (default = TRUE) and discretization (default in 2 52 | # categories, we want 3 categories so we change K); no bootstrapping or 53 | # decision filtering. 54 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 55 | , target = iris$Species, classPos = "setosa" 56 | , filter = FALSE, discretize = TRUE, K = 3) 57 | 58 | # idem but run it in parallel on 2 threads 59 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 60 | , target = iris$Species, classPos = "setosa" 61 | , filter = FALSE, discretize = TRUE, K = 3 62 | , in_parallel = TRUE, n_cores = 2) 63 | 64 | # Plot the decision ensemble: 65 | # Plants from the setosa species have small petal and narrow long sepals. 66 | plotFeatures(endo_setosa, levels_order = c("Low", "Medium", "High")) 67 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = "fr") 68 | } 69 | -------------------------------------------------------------------------------- /examples/iris_each_function.R: -------------------------------------------------------------------------------- 1 | library(randomForest) 2 | library(caret) 3 | library(data.table) 4 | 5 | # import data and fit model 6 | data(iris) 7 | mod <- randomForest(Species ~ ., data = iris) 8 | 9 | # Let's get the decision ensemble. One could use the wrapping function 10 | # model2DE() but, we will run each function separately. 11 | 12 | # Get the raw decision ensemble 13 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 14 | , target = iris$Species, classPos = "setosa" 15 | , times = 1 # number of bootstraps, here just one 16 | , discretize = FALSE) # we will discretize outside for the example 17 | summary(de) 18 | # exec = the decision ensemble 19 | # partitions = list of sample indexes for boostrapping 20 | # if we had done discretization, the new data would be in data_ctg 21 | de <- de$exec 22 | 23 | # Discretize variables in 3 categories - optional 24 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species 25 | , K = 3, classPos = "setosa", mode = "data") 26 | data_ctg <- de$data_ctg 27 | de <- de$rules_ctg 28 | 29 | # Homogenize the decision ensemble 30 | de <- de[, condition := sapply(condition, function(x) { 31 | paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ") 32 | })] 33 | de <- unique( 34 | as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition] 35 | ) 36 | 37 | # Calculate decision metrics, we don't need the importances yet since we will 38 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps. 39 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species 40 | , classPos = "setosa", importances = FALSE) 41 | de <- de[de_met, on = "condition"] 42 | 43 | # Pruning - optional 44 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species 45 | , classPos = "setosa") 46 | 47 | # Decision importances 48 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species 49 | , classPos = "setosa") 50 | 51 | # Filter out decisions with the lowest importance: min_imp = the minimal 52 | # importance in the decision ensemble compared to the maximal one. 53 | # E.g., if min_imp = 0.5, then at least all decisions with an 54 | # importance > 0.5*max(importance) will be kept. 55 | # This ensures that we don't throw out too much. 56 | # Since the decision ensemble is quite small, we don't need to filter much... 57 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1) 58 | 59 | # Get the network 60 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species 61 | , classPos = "setosa") 62 | 63 | # Plot the feature importance/influence and the network 64 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High")) 65 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr") 66 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/decisionImportance.R: -------------------------------------------------------------------------------- 1 | #' Compute the importance of decisions 2 | #' 3 | #' This function computes the importance of decisions. 4 | #' 5 | #' 6 | #' @param rulesa a data.frame with a column "condition" or a vector with name "condition". 7 | #' @param data data from which to get the decision support. 8 | #' @param target response variable. 9 | #' @param classPos if classification, the positive class. 10 | #' @param in_parallel if TRUE, the function is run in parallel. 11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use. 12 | #' @param cluster the cluster to use to run the function in parallel. 13 | #' 14 | #' @return the data.frame passed in rules with the gain and importance of the each decision. 15 | #' 16 | #' @export 17 | decisionImportance <- function(rules, data, target, classPos = NULL, 18 | in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) { 19 | 20 | 21 | # numerise target 22 | if (is.numeric(target) == FALSE) { 23 | target <- ifelse(target == classPos, 1, 0) 24 | type <- "classification" 25 | } else { 26 | type <- "reg" 27 | } 28 | 29 | 30 | if ("data.table" %in% class(data)) { 31 | data <- as.data.frame(data) 32 | } 33 | 34 | pred_null <- mean(target, na.rm = TRUE) 35 | 36 | 37 | if (in_parallel == TRUE) { 38 | if (is.null(cluster) == TRUE) { 39 | message("Initiate parallelisation ... ") 40 | cluster <- makeCluster(n_cores) 41 | on.exit(stopCluster(cluster)) 42 | } 43 | tmp <- parApply( 44 | cl = cluster, rules, MARGIN = 1, FUN = importanceSingleRule, 45 | data = data, target = target, type = type, pred_null = pred_null 46 | ) 47 | } else { 48 | tmp <- apply(rules, 49 | MARGIN = 1, FUN = importanceSingleRule, 50 | data = data, target = target, type = type, pred_null = pred_null 51 | ) 52 | } 53 | 54 | 55 | rules <- rules[, "gain" := tmp][, "imp" := gain * support] 56 | 57 | if ("n" %in% colnames(rules)) { 58 | rules <- rules[, "n" := as.numeric(n)][order(-n * imp), ] 59 | } else { 60 | setorder(rules, -imp) 61 | } 62 | 63 | 64 | return(rules) 65 | } 66 | 67 | 68 | ###################################################### 69 | importanceSingleRule <- 70 | function(rule, data, target, type = "reg", pred_null = NULL) { 71 | ruleExec <- paste("which(", rule["condition"], ")") 72 | ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,") 73 | ixMatch <- eval(parse(text = ruleExec)) 74 | 75 | if (length(ixMatch) == 0) { 76 | return(0) 77 | } 78 | ys <- target[ixMatch] 79 | 80 | if (type == "reg") { 81 | err_null <- sum((pred_null - ys)^2) / length(ys) 82 | } else { 83 | err_null <- exp(mean(sapply(ys, function(x, pred) { 84 | x * log(pred) + (1 - x) * log(1 - pred) 85 | }, pred = pred_null))) 86 | err_null <- 1 - err_null 87 | } 88 | 89 | g <- 1 - as.numeric(rule["err"]) / err_null 90 | return(g) 91 | } 92 | -------------------------------------------------------------------------------- /R/decisionImportance.R: -------------------------------------------------------------------------------- 1 | #' Compute the importance of decisions 2 | #' 3 | #' This function computes the importance of decisions. 4 | #' 5 | #' 6 | #' @param rules a data.frame with a column "condition" or a vector with name "condition". 7 | #' @param data data from which to get the decision support. 8 | #' @param target response variable. 9 | #' @param classPos if classification, the positive class. 10 | #' @param in_parallel if TRUE, the function is run in parallel. 11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use. 12 | #' @param cluster the cluster to use to run the function in parallel. 13 | #' 14 | #' @return the data.frame passed in rules with the gain and importance of the each decision. 15 | #' @example examples/iris_each_function.R 16 | #' 17 | #' @export 18 | decisionImportance <- function(rules, data, target, classPos = NULL, 19 | in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) { 20 | 21 | 22 | # numerise target 23 | if (is.numeric(target) == FALSE) { 24 | target <- ifelse(target == classPos, 1, 0) 25 | type <- "classification" 26 | } else { 27 | type <- "reg" 28 | } 29 | 30 | 31 | if ("data.table" %in% class(data)) { 32 | data <- as.data.frame(data) 33 | } 34 | 35 | pred_null <- mean(target, na.rm = TRUE) 36 | 37 | 38 | if (in_parallel == TRUE) { 39 | if (is.null(cluster) == TRUE) { 40 | message("Initiate parallelisation ... ") 41 | cluster <- makeCluster(n_cores) 42 | on.exit(stopCluster(cluster)) 43 | } 44 | tmp <- parApply( 45 | cl = cluster, rules, MARGIN = 1, FUN = importanceSingleRule, 46 | data = data, target = target, type = type, pred_null = pred_null 47 | ) 48 | } else { 49 | tmp <- apply(rules, 50 | MARGIN = 1, FUN = importanceSingleRule, 51 | data = data, target = target, type = type, pred_null = pred_null 52 | ) 53 | } 54 | 55 | 56 | rules <- rules[, "gain" := tmp][, "imp" := gain * support] 57 | 58 | if ("n" %in% colnames(rules)) { 59 | rules <- rules[, "n" := as.numeric(n)][order(-n * imp), ] 60 | } else { 61 | setorder(rules, -imp) 62 | } 63 | 64 | 65 | return(rules) 66 | } 67 | 68 | 69 | ###################################################### 70 | importanceSingleRule <- 71 | function(rule, data, target, type = "reg", pred_null = NULL) { 72 | ruleExec <- paste("which(", rule["condition"], ")") 73 | ruleExec <- gsub(ruleExec, pattern = "X\\[,", replacement = "data\\[,") 74 | ixMatch <- eval(parse(text = ruleExec)) 75 | 76 | if (length(ixMatch) == 0) { 77 | return(0) 78 | } 79 | ys <- target[ixMatch] 80 | 81 | if (type == "reg") { 82 | err_null <- sum((pred_null - ys)^2) / length(ys) 83 | } else { 84 | err_null <- exp(mean(sapply(ys, function(x, pred) { 85 | x * log(pred) + (1 - x) * log(1 - pred) 86 | }, pred = pred_null))) 87 | err_null <- 1 - err_null 88 | } 89 | 90 | g <- 1 - as.numeric(rule["err"]) / err_null 91 | return(g) 92 | } 93 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/discretizeData.R: -------------------------------------------------------------------------------- 1 | #' Discretize numerical variables in a dataset 2 | #' 3 | #' This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles. 4 | #' Names of categories can optionally be given in knames. 5 | #' 6 | #' @param data data to discretize. 7 | #' @param K numeric, number of categories (default: K = 2). 8 | #' @param features vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized. 9 | #' @param knames optional: character vector of the same length than K, containing the ordered names for categories. 10 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned. 11 | #' @return Data with discretized variables. 12 | #' @export 13 | discretizeData <- function(data, K = 2, features = NULL, knames = NULL, return_split = FALSE) { 14 | 15 | # set category names 16 | if (is.null(knames) == TRUE) { 17 | if (K == 2) { 18 | knames <- c("min", "Low", "High") 19 | } else if (K == 3) { 20 | knames <- c("min", "Low", "Medium", "High") 21 | } else if (K == 4) { 22 | knames <- c("min", "veryLow", "Low", "High", "veryHigh") 23 | } else if (K == 5) { 24 | knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh") 25 | } else { 26 | knames <- c("min", paste("L", seq(1:K), sep = "")) 27 | } 28 | } else if (length(knames) == K) { 29 | knames <- c("min", knames) 30 | } 31 | 32 | # get column numbers - not the most efficient but later easier 33 | if (is.character(features) == TRUE) { 34 | colNb <- which(colnames(data) %in% features) 35 | } else if (is.numeric(features) == TRUE) { 36 | colNb <- features 37 | } else { 38 | colNb <- which(sapply(data, function(x) { 39 | (length(unique(x)) > 2 & is.numeric(x)) 40 | })) 41 | } 42 | 43 | # discretize 44 | data_ctg <- as.data.table(data) 45 | tmp <- lapply(data_ctg[, (colNb), with = FALSE], discretizeVector, K = K, knames = knames, return_all = TRUE) 46 | names(tmp) <- as.character(colNb) 47 | for (j in colNb) { 48 | set(data_ctg, i = NULL, j = j, tmp[[as.character(j)]]$newV) 49 | set(data_ctg, i = which(data_ctg[, j, with = FALSE] == ""), j = j, value = NA) 50 | } 51 | data_ctg <- data_ctg[, c(colNb) := lapply(.SD, as.factor), .SDcols = colNb] 52 | 53 | 54 | # get the split table if asked 55 | if (return_split == TRUE) { 56 | splitV <- list() 57 | for (i in 1:length(tmp)) { 58 | splitV[[i]] <- tmp[[i]]$splitV 59 | } 60 | names(splitV) <- names(tmp) 61 | 62 | splitV_med <- lapply(names(splitV), FUN = getMedian, data = data, splitV = splitV) 63 | names(splitV_med) <- names(splitV) 64 | 65 | return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med)) 66 | } 67 | 68 | return(data_ctg) 69 | } 70 | 71 | 72 | ############## 73 | getMedian <- function(var, data, splitV) { 74 | dataVar <- data[[as.numeric(var)]] 75 | med <- list() 76 | for (i in 2:length(splitV[[var]])) { 77 | iX <- which(dataVar >= splitV[[var]][i - 1] & dataVar <= splitV[[var]][i]) 78 | med <- c(med, median(dataVar[iX], na.rm = TRUE)) 79 | } 80 | names(med) <- names(splitV[[var]])[-1] 81 | return(med) 82 | } 83 | -------------------------------------------------------------------------------- /R/aggregateTaxa.R: -------------------------------------------------------------------------------- 1 | #' Aggregate taxa according to a measure. 2 | #' 3 | #' This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thinner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels. 4 | #' 5 | #' @param taxa taxa should be a data.frame with the feature name and all its other coarser/thinner levels (columns: 'Feature', 'f', 'g', 's') 6 | #' @param features if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked. 7 | #' @param weights a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only. 8 | #' @param thr numeric, value after which the algorithm should stop looking for better ranks in a taxonomic branch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used. 9 | #' @param type character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a thinner level if it has a better rank and is unique, i.e. there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed. 10 | #' 11 | #' @return A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column. 12 | #' 13 | #' @export 14 | aggregateTaxa <- function(taxa, features = NULL, weights = NULL, thr = NULL, type = "coarse") { 15 | if (!is.null(weights) && !("weight" %in% colnames(weights))) { 16 | wI <- which(sapply(weights, is.numeric)) 17 | colnames(weights)[wI] <- "weight" 18 | } 19 | 20 | #if (is.vector(taxa) == TRUE || !("Feature" %in% colnames(taxa))) { 21 | # taxa <- elongateTaxa(features = features, taxa = taxa) 22 | #} 23 | 24 | if (is.null(thr) == TRUE) { 25 | thr <- median(weights$weight) 26 | } 27 | 28 | ### wrapper of the 2 collapsing functions, with the option of performing both 29 | if (type == "coarse") { 30 | taxa <- aggregateTaxa_coarse(taxa, weights, thr) 31 | } 32 | 33 | if (type == "fine") { 34 | taxa <- aggregateTaxa_fine(taxa) 35 | } 36 | 37 | if (type == "both") { 38 | taxa_d <- aggregateTaxa_coarse(taxa, weights, thr) 39 | taxa_u <- aggregateTaxa_fine(subset(taxa, Feature %in% taxa_d$newFeature)) 40 | 41 | taxa <- unique(taxa_u[, list(Feature, newFeature)])[, "coarseFeature" = newFeature][, newFeature := NULL] 42 | taxa <- merge(taxa, taxa_d[, list(newFeature, Feature, weight, newWeight)], by = c("coarseFeature" = "newFeature")) 43 | 44 | # record how many times a newFeature has been attributed and who was changed 45 | taxa <- taxa[, n := .N, by = "newFeature"][, "changed" := "Unchanged"][ 46 | n > 1 & newFeature == Feature, changed := "Recipient" 47 | ][ 48 | newFeature != coarseFeature & coarseFeature == Feature, changed := "Upgraded" 49 | ][ 50 | newFeature == coarseFeature & coarseFeature != Feature, changed := "Downgraded" 51 | ][ 52 | newFeature != coarseFeature & coarseFeature != Feature, changed := "Down&Upgraded" 53 | ][ 54 | , changed := as.factor(changed) 55 | ] 56 | } 57 | 58 | return(taxa) 59 | } 60 | -------------------------------------------------------------------------------- /man/plotNetwork.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotNetwork.R 3 | \name{plotNetwork} 4 | \alias{plotNetwork} 5 | \title{Plot an interaction variable network.} 6 | \usage{ 7 | plotNetwork( 8 | decision_ensemble, 9 | path_length = Inf, 10 | layout = "stress", 11 | colour_edge_low = "#E69F00", 12 | colour_edge_mid = "grey87", 13 | colour_edge_high = "#0072B2", 14 | colour_node_low = "#E69F00", 15 | colour_node_mid = "grey87", 16 | colour_node_high = "#0072B2", 17 | text_size = 4, 18 | hide_isolated_nodes = TRUE, 19 | seed = 0 20 | ) 21 | } 22 | \arguments{ 23 | \item{decision_ensemble}{stable decision ensemble: list with edges, nodes, etc.} 24 | 25 | \item{path_length}{maximal number of edges between 2 nodes, default = Inf.} 26 | 27 | \item{layout}{network layout, default is 'stress' (see ggraph package)} 28 | 29 | \item{colour_edge_low}{colour for the negative interaction influence values (i.e., edges; default: yellowish)} 30 | 31 | \item{colour_edge_mid}{colour for the null interaction influence values (i.e., edges; default: light grey)} 32 | 33 | \item{colour_edge_high}{colour for the positive interaction influence values (i.e., edges; default: blue)} 34 | 35 | \item{colour_node_low}{colour for the negative feature influence values (i.e., nodes; default: yellowish)} 36 | 37 | \item{colour_node_mid}{colour for the null feature influence values (i.e., nodes; default: light grey)} 38 | 39 | \item{colour_node_high}{colour for the positive feature influence values (i.e., nodes; default: blue)} 40 | 41 | \item{text_size}{size of node labels.} 42 | 43 | \item{hide_isolated_nodes}{logical, default = TRUE (= nodes without any edge are not shown).} 44 | 45 | \item{seed}{the seed to use for generating the network.} 46 | } 47 | \value{ 48 | a ggraph object 49 | } 50 | \description{ 51 | Returns a plot of nodes and edges. Plots are created with the ggraph and igraph packages. 52 | } 53 | \examples{ 54 | library(randomForest) 55 | library(caret) 56 | 57 | # import data and fit model 58 | data(iris) 59 | mod <- randomForest(Species ~ ., data = iris) 60 | 61 | # Fit a decision ensemble to predict the species setosa (vs versicolor and 62 | # virginica): no regularization (no decision pruning, discretization, 63 | # bootstrapping, or decision filtering) 64 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 65 | , target = iris$Species, classPos = "setosa" 66 | , filter = FALSE, discretize = FALSE, prune = FALSE) 67 | 68 | # Only decision pruning (default = TRUE) and discretization (default in 2 69 | # categories, we want 3 categories so we change K); no bootstrapping or 70 | # decision filtering. 71 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 72 | , target = iris$Species, classPos = "setosa" 73 | , filter = FALSE, discretize = TRUE, K = 3) 74 | 75 | # idem but run it in parallel on 2 threads 76 | endo_setosa <- model2DE(model = mod, model_type = "rf", data = iris[, -5] 77 | , target = iris$Species, classPos = "setosa" 78 | , filter = FALSE, discretize = TRUE, K = 3 79 | , in_parallel = TRUE, n_cores = 2) 80 | 81 | # Plot the decision ensemble: 82 | # Plants from the setosa species have small petal and narrow long sepals. 83 | plotFeatures(endo_setosa, levels_order = c("Low", "Medium", "High")) 84 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = "fr") 85 | } 86 | -------------------------------------------------------------------------------- /R/discretizeData.R: -------------------------------------------------------------------------------- 1 | #' Discretize numerical variables in a dataset 2 | #' 3 | #' This function discretizes all numerical variables, or only the ones passed in features, into K categories based on their quantiles. 4 | #' Names of categories can optionally be given in knames. 5 | #' 6 | #' @param data data to discretize. 7 | #' @param K numeric, number of categories (default: K = 2). 8 | #' @param features vector with variables names or column numbers to discretize. If NULL (default), then all numeric variables are discretized. 9 | #' @param knames optional: character vector of the same length than K, containing the ordered names for categories. 10 | #' @param return_split if TRUE, then the table with thresholds used to discretize data is also returned. 11 | #' @return Data with discretized variables. 12 | #' @export 13 | discretizeData <- function(data, K = 2, features = NULL, knames = NULL, return_split = FALSE) { 14 | 15 | # set category names 16 | if (is.null(knames) == TRUE) { 17 | if (K == 2) { 18 | knames <- c("min", "Low", "High") 19 | } else if (K == 3) { 20 | knames <- c("min", "Low", "Medium", "High") 21 | } else if (K == 4) { 22 | knames <- c("min", "veryLow", "Low", "High", "veryHigh") 23 | } else if (K == 5) { 24 | knames <- c("min", "veryLow", "Low", "Medium", "High", "veryHigh") 25 | } else { 26 | knames <- c("min", paste("L", seq(1:K), sep = "")) 27 | } 28 | } else if (length(knames) == K) { 29 | knames <- c("min", knames) 30 | } 31 | 32 | # get column numbers - not the most efficient but later easier 33 | if (is.character(features) == TRUE) { 34 | colNb <- which(colnames(data) %in% features) 35 | } else if (is.numeric(features) == TRUE) { 36 | colNb <- features 37 | } else { 38 | colNb <- which(sapply(data, function(x) { 39 | (length(unique(x)) > 2 & is.numeric(x)) 40 | })) 41 | } 42 | 43 | # discretize 44 | data_ctg <- as.data.table(data) 45 | tmp <- lapply(data_ctg[, (colNb), with = FALSE], discretizeVector, K = K, knames = knames, return_all = TRUE) 46 | names(tmp) <- as.character(colNb) 47 | for (j in colNb) { 48 | set(data_ctg, i = NULL, j = j, tmp[[as.character(j)]]$newV) 49 | set(data_ctg, i = which(data_ctg[, j, with = FALSE] == ""), j = j, value = NA) 50 | } 51 | data_ctg <- data_ctg[, c(colNb) := lapply(.SD, as.factor), .SDcols = colNb] 52 | 53 | 54 | # get the split table if asked 55 | if (return_split == TRUE) { 56 | splitV <- list() 57 | for (i in 1:length(tmp)) { 58 | splitV[[i]] <- tmp[[i]]$splitV 59 | } 60 | names(splitV) <- names(tmp) 61 | 62 | splitV_med <- lapply(names(splitV), FUN = getMedian, data = data, splitV = splitV) 63 | names(splitV_med) <- names(splitV) 64 | 65 | return(list("data_ctg" = data_ctg, "splitV" = splitV, "splitV_med" = splitV_med)) 66 | } 67 | 68 | return(data_ctg) 69 | } 70 | 71 | 72 | ############## 73 | getMedian <- function(var, data, splitV) { 74 | dataVar <- data[[as.numeric(var)]] 75 | med <- list() 76 | for (i in 2:length(splitV[[var]])) { 77 | iX <- which(dataVar > splitV[[var]][i - 1] & dataVar <= splitV[[var]][i]) 78 | med <- c(med, median(dataVar[iX], na.rm = TRUE)) 79 | } 80 | names(med) <- names(splitV[[var]])[-1] 81 | 82 | if (is.na(med[[1]])){ 83 | med[[1]] = min(dataVar) 84 | } 85 | return(med) 86 | } 87 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/aggregateTaxa.R: -------------------------------------------------------------------------------- 1 | #' Aggregate taxa according to a measure. 2 | #' 3 | #' This function aggregates taxa according to their rank from a measure passed in weights: if a taxa has a lower rank than its coarser level, type = 'coarse', it is aggregated into the coarser; if their is a unique thiner level with a better rank than the coarser one, then the coarser is aggregated into the thiner Comparison is done only for the family, genus and specie levels. 4 | #' 5 | #' @param taxa taxa should be a vector with the long format of the features of the taxa table with taxonomic levels as columns - cf taxa2vector; or a data.frame with the feature name and all its other coarser/thiner levels. 6 | #' @param features if taxa is a vector or a data.frame that lacks all levels, a vector of the features to be checked. 7 | #' @param weights a data.frame with a column Feature and a column weight; for type = 'both' or 'coarse' only. 8 | #' @param thr numeric, value after which the algorithm should stop looking for better ranks in a taxa banch; for type = 'both' or 'coarse' only. If NULL, the median of weights is used. 9 | #' @param type character. If 'coarse', finer levels are aggregated into their coarser one if it has a better rank. If 'fine' then coarser levels are aggregated into a finer level if it has a better rank and is unique, ie there is a unique finer level for that coarser level. If 'both', both aggregation steps are seuqentially performed. 10 | #' 11 | #' @return A dataframe with aggregated features in the "Feature" column, and the 'recipient' taxa in the "newFeature" column. 12 | #' 13 | #' @export 14 | aggregateTaxa <- function(taxa, features = NULL, weights = NULL, thr = NULL, type = "coarse") { 15 | if (!is.null(weights) && !("weight" %in% colnames(weights))) { 16 | wI <- which(sapply(weights, is.numeric)) 17 | colnames(weights)[wI] <- "weight" 18 | } 19 | 20 | if (is.vector(taxa) == TRUE || !("Feature" %in% colnames(taxa))) { 21 | taxa <- elongateTaxa(features = features, taxa = taxa) 22 | } 23 | 24 | if (is.null(thr) == TRUE) { 25 | thr <- median(weights$weight) 26 | } 27 | 28 | ### wrapper of the 2 collapsing functions, with the option of performing both 29 | if (type == "coarse") { 30 | taxa <- aggregateTaxa_coarse(taxa, weights, thr) 31 | } 32 | 33 | if (type == "fine") { 34 | taxa <- aggregateTaxa_fine(taxa) 35 | } 36 | 37 | if (type == "both") { 38 | taxa_d <- aggregateTaxa_coarse(taxa, weights, thr) 39 | taxa_u <- aggregateTaxa_fine(subset(taxa, Feature %in% taxa_d$newFeature)) 40 | 41 | taxa <- unique(taxa_u[, .(Feature, newFeature)])[, "coarseFeature" = newFeature][, newFeature := NULL] 42 | taxa <- merge(taxa, taxa_d[, .(newFeature, Feature, weight, newWeight)], by = c("coarseFeature" = "newFeature")) 43 | 44 | # record how many times a newFeature has been attributed and who was changed 45 | taxa <- taxa[, n := .N, by = "newFeature"][, "changed" := "Unchanged"][ 46 | n > 1 & newFeature == Feature, changed := "Recipient" 47 | ][ 48 | newFeature != coarseFeature & coarseFeature == Feature, changed := "Upgraded" 49 | ][ 50 | newFeature == coarseFeature & coarseFeature != Feature, changed := "Downgraded" 51 | ][ 52 | newFeature != coarseFeature & coarseFeature != Feature, changed := "Down&Upgraded" 53 | ][ 54 | , changed := as.factor(changed) 55 | ] 56 | } 57 | 58 | return(taxa) 59 | } 60 | -------------------------------------------------------------------------------- /examples/iris_bootstraps.R: -------------------------------------------------------------------------------- 1 | library(randomForest) 2 | library(caret) 3 | 4 | # import data and fit model 5 | data(iris) 6 | mod <- randomForest(Species ~ ., data = iris) 7 | 8 | # Get decision ensemble with bootstrapping. 9 | 10 | # Run 1 bootstrap after the other (times = 2 bootstraps) 11 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf" 12 | , data = iris[, -5], target = iris$Species, classPos = "setosa" 13 | , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE) 14 | 15 | # Same but use different sample weights for bootstrapping 16 | n_setosa <- sum(iris$Species == "setosa") 17 | n_samp <- length(iris$Species) 18 | samp_weight <- round( 19 | ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp) 20 | , digits = 2) 21 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf" 22 | , data = iris[, -5], target = iris$Species, classPos = "setosa" 23 | , times = 2, sample_weight = samp_weight 24 | , in_parallel = TRUE, n_cores = 2, filter = FALSE) 25 | 26 | # Run the bootstraps in parallel 27 | # First do all steps before bootstrapping 28 | preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 29 | , target = iris$Species, classPos = "setosa", times = 2 30 | , discretize = TRUE, in_parallel = FALSE) 31 | 32 | # Remove the special characters from column names 33 | colnames(preclu$data) <- compatibleNames(colnames(preclu$data)) 34 | 35 | # Parameters for clustermq: can also run on HPC environment 36 | library(clustermq) 37 | options(clustermq.scheduler = "multiprocess") 38 | # ... and run in parallel on each bootstrap 39 | # (preclu$partitions = list of sample indexes for each bootstraps) 40 | endo_setosa <- Q(model2DE_cluster 41 | , partition = preclu$partitions 42 | , export = list(data = preclu$data 43 | , target = iris$Species 44 | , exec = preclu$exec 45 | , classPos = "setosa" 46 | , prune = TRUE, filter = FALSE 47 | , maxDecay = 0.05 # values needed for maxDecay and typeDecay 48 | , typeDecay = 2 # here default ones, see pruneDecisions() 49 | , in_parallel = FALSE # can parallelize within each boostrap! 50 | ) 51 | , n_jobs = 2 # max number of bootstraps that can be ran in parallel 52 | , pkgs = c("data.table", "parallel", "caret", "stringr", "scales" 53 | , "dplyr", "inTrees", "endoR") 54 | , log_worker = FALSE # to keep a log of the runs, e.g. if it fails.. 55 | ) 56 | 57 | 58 | # Stability selection 59 | # First we can look at the effect of the alpha parameter on selection; 60 | # alpha = expected number of false decisions 61 | alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10) 62 | , data = preclu$data) 63 | alphas$summary_table 64 | 65 | # perform stability selection with alpha = 1 66 | de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7) 67 | 68 | # Plot the decision ensemble: 69 | # Plants from the setosa species have small petal and narrow long sepals. 70 | plotFeatures(de_final, levels_order = c("Low", "Medium", "High")) 71 | 72 | # there is no interaction between variables (all decisions with len = 1, 73 | # the number of variables in the rules) 74 | de_final$rules_summary 75 | # hence the network would be empty and couldn't be plotted... 76 | # plotNetwork(de_final, hide_isolated_nodes = FALSE) 77 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/model2DE_resampling.R: -------------------------------------------------------------------------------- 1 | #' Run model2DE on several bootstrap resamples. 2 | #' 3 | #' Wrapper around the model2DE function to run it on several bootstrap resamples. 4 | #' 5 | #' @param ... arguments to be passed to the model2DE funtion. 6 | #' @param times number of bootstraps 7 | #' @param p fraction of data to resample. 8 | #' @param sample_weight numeric vector with the weights of samples for bootstrap resampling. For classification, if 2 values are given, the 1st one is assumed to be for the positive class (classpos argument). 9 | #' @param in_parallel if TRUE, the function is run in parallel 10 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use 11 | #' @param cluster the cluster to use to run the function in parallel 12 | #' @return A list with the row numbers of partitioned data, the rules originally extracted from the model, a list with results from each bootstrap (use stabilitySelection to obtain the stable decison ensemble). 13 | #' @import data.table 14 | #' @export 15 | model2DE_resampling <- function(model, model_type, data, target, classPos = NULL, 16 | times = 10, p = .5, sample_weight = NULL, 17 | ntree = "all", maxdepth = Inf, 18 | dummy_var = NULL, 19 | prune = TRUE, maxDecay = 0.05, typeDecay = 2, 20 | discretize = TRUE, K = 2, mode = "data", 21 | filter = TRUE, min_imp = 0.9 22 | # , aggregate_taxa = FALSE, taxa = NULL 23 | # , alpha_error = 1, pi_thr = 0.7 24 | , seed = 0, in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) { 25 | 26 | 27 | ### Get the partitions and rules 28 | tmp <- preCluster( 29 | model = model, model_type = model_type, data = data, target = target, 30 | times = times, p = p, sample_weight = sample_weight, classPos = classPos, 31 | ntree = ntree, maxdepth = maxdepth, dummy_var = dummy_var, 32 | discretize = discretize, K = K, mode = mode, 33 | seed = seed, 34 | in_parallel = in_parallel, n_cores = n_cores 35 | ) 36 | exec <- tmp$exec 37 | partitions <- tmp$partitions 38 | data <- tmp$data 39 | 40 | 41 | ### Get a network for each subset 42 | # define classPos if it has not been passed 43 | if (is.character(target) && is.null(classPos) == TRUE) { 44 | classPos <- names(which.max(table(target))) 45 | cat("Positive class:", classPos, "\n") 46 | } 47 | resamp <- list() 48 | k <- 1 49 | for (ix in partitions) { 50 | res <- model2DE( 51 | data = data[ix, ], target = target[ix], 52 | exec = exec, 53 | classPos = classPos, 54 | prune = prune, maxDecay = maxDecay, typeDecay = typeDecay, 55 | filter = filter, 56 | in_parallel = in_parallel, n_cores = n_cores, 57 | light = TRUE 58 | ) 59 | 60 | # get the decisions: depends on options so we'll just take whatever was produced before getting nodes! 61 | i_rules <- which(names(res) == "nodes") - 1 62 | 63 | res <- list( 64 | "pdecisions" = res$n_decisions, 65 | "rules" = res[[i_rules]], 66 | "nodes_agg" = res$nodes_agg, "edges_agg" = res$edges_agg 67 | ) 68 | 69 | resamp[[k]] <- res 70 | k <- k + 1 71 | } 72 | 73 | 74 | return(list("partitions" = partitions, "exec" = exec, "data" = data, "resamp" = resamp)) 75 | } 76 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/evaluateAlpha.R: -------------------------------------------------------------------------------- 1 | #' Calculate the number fo decisions and of predicted samples from decision ensemble obtained with different alpha values 2 | #' 3 | #' The aim is to help picking an alpha that will result in a decision ensemble able to predict most samples. Performs stability selection for each of the given alpha value. The number of decisions and of samples that follow decisions are also calculated. 4 | #' The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble. 5 | #' The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble. 6 | #' @param res list of bootstrap results 7 | #' @param alphas expected number of false positive decision selected (default = 1). 8 | #' @param decision_ensembles should the decision ensemble be returned? 9 | #' @param pi_thr fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7). 10 | #' @return A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes. 11 | #' @export 12 | 13 | evaluateAlpha <- function(rules, alphas = c(5, 10, 15, 20, 30, 50, 75), pi_thr = 0.7, 14 | data = NULL, decision_ensembles = TRUE, 15 | aggregate_taxa = FALSE, taxa = NULL) { 16 | check_sampl <- data.frame(alpha = numeric(), n_decision = numeric(), n_samples = numeric()) 17 | if ("resamp" %in% names(rules)) { 18 | minN <- pi_thr * length(rules$resamp) 19 | } else { 20 | minN <- pi_thr * length(rules) 21 | } 22 | 23 | if (is.null(data) & !("data" %in% names(rules))) { 24 | warning("Please provide the data used to create the decision ensemble (original if not discretized, precluster$data if discretized).") 25 | return() 26 | } else if ("data" %in% names(rules)) { 27 | data <- rules$data 28 | } 29 | 30 | res <- list() 31 | check_sampl <- data.frame(alpha = numeric(), n_dec = numeric(), n_samp = numeric()) 32 | 33 | for (i in 1:length(alphas)) { 34 | tmp <- stabilitySelection(rules = rules, alpha_error = alphas[i], pi_thr = pi_thr, aggregate_taxa = aggregate_taxa, taxa = taxa) 35 | if (decision_ensembles == TRUE) { 36 | res[[as.character(alphas[i])]] <- tmp 37 | } 38 | 39 | tmp <- subset(tmp$rules_summary, inN >= minN) 40 | 41 | cond <- tmp$condition 42 | if (length(cond) == 0) { 43 | check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = 0, n_samp = 0) 44 | } else { 45 | cond <- str_replace_all(cond, pattern = "X", replacement = "data") 46 | cond <- paste0("which(", cond, ")") 47 | pred_ix <- lapply(cond, function(x) { 48 | eval(parse(text = x)) 49 | }) %>% 50 | unlist() %>% 51 | unique() 52 | check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = length(cond), n_samp = length(pred_ix)) 53 | } 54 | } 55 | if (decision_ensembles == TRUE) { 56 | res[["summary_table"]] <- check_sampl 57 | return(res) 58 | } else { 59 | return(check_sampl) 60 | } 61 | } 62 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/getThresholds.R: -------------------------------------------------------------------------------- 1 | #' Get thresholds to discretize variables according to their split in tree ensemble models 2 | #' @export 3 | 4 | getThresholds <- function(conditions, data, Kmax = 2) { 5 | ### a little slow, may be improved 6 | 7 | # get all individual sub conditions per variable 8 | var_cond <- unlist(lapply(conditions, function(x) { 9 | unlist(strsplit(x, split = " & ")) 10 | })) 11 | # make it to a data.frame with variable and thresholds used 12 | var_cond <- data.frame( 13 | var = as.numeric(str_extract(var_cond, pattern = "[:digit:]*(?=\\])")), 14 | thr = as.numeric(str_extract(var_cond, pattern = "-?[:digit:]*\\.?[:digit:]*$")) 15 | ) 16 | 17 | # and now transform to a list 18 | var_split <- lapply(sort(unique(var_cond$var)), 19 | function(x, var_cond, data) { 20 | thr <- unlist(subset(var_cond, var == x, select = thr)) 21 | var_v <- data[[x]] 22 | thr[thr < min(var_v)] <- min(var_v) 23 | thr[thr > max(var_v)] <- max(var_v) 24 | return(thr) 25 | }, 26 | var_cond = var_cond, data = data 27 | ) 28 | names(var_split) <- as.character(sort(unique(var_cond$var))) 29 | 30 | # remove non-numeric variables 31 | are_num <- as.character(which(sapply(data, function(x) { 32 | length(unique(x)) > 2 33 | }))) 34 | are_num <- are_num[are_num %in% names(var_split)] 35 | var_split <- var_split[are_num] 36 | 37 | # get thresholds for discretization 38 | if (Kmax == 2) { 39 | new_thr <- lapply(var_split, getMode) 40 | } else { 41 | new_thr <- lapply(var_split, getModes_all) 42 | new_thr <- lapply(new_thr, function(x, Kmax) { 43 | x[1:min(Kmax - 1, length(x))] 44 | }, Kmax = Kmax) 45 | } 46 | 47 | # remove thresholds out of range 48 | colNb <- as.integer(names(new_thr)) 49 | new_thr <- lapply(colNb, function(x, data, thr) { 50 | list( 51 | "var_v" = data[[x]], 52 | "thr" = thr[[as.character(x)]] 53 | ) 54 | }, data = data, thr = new_thr) 55 | 56 | new_thr <- lapply(new_thr, function(x) { 57 | x$thr <- x$thr[x$thr >= min(x$var_v) & x$thr <= max(x$var_v)] 58 | if (length(x$thr) == 0) x$thr <- min(x$var_v) 59 | return(x) 60 | }) 61 | names(new_thr) <- as.character(colNb) 62 | 63 | return(new_thr) 64 | } 65 | 66 | 67 | ########## 68 | 69 | getMode <- function(x) { 70 | if (length(x) == 1) { 71 | return(x) 72 | } else { 73 | tmp <- density(x) 74 | return(tmp$x[which.max(tmp$y)]) 75 | } 76 | } 77 | 78 | 79 | getModes_all <- function(var) { 80 | # copied from the pastecs R-package: 81 | # https://github.com/phgrosjean/pastecs/blob/master/R/turnpoints.R 82 | # (I just removed the unecessary bits..) 83 | 84 | if (length(x) == 1) { 85 | return(x) 86 | } 87 | 88 | x <- as.vector(density(var)$y) 89 | n <- length(x) 90 | diffs <- c(x[1] - 1, x[1:(n - 1)]) != x 91 | 92 | uniques <- x[diffs] 93 | 94 | n2 <- length(uniques) 95 | poss <- (1:n)[diffs] 96 | exaequos <- c(poss[2:n2], n + 1) - poss - 1 97 | 98 | m <- n2 - 2 99 | ex <- matrix(uniques[1:m + rep(3:1, rep(m, 3)) - 1], m) 100 | peaks <- c(FALSE, apply(ex, 1, max, na.rm = TRUE) == ex[, 2], FALSE) 101 | tppos <- (poss + exaequos)[peaks] 102 | 103 | # Now, order the peaks and return the x values 104 | y_peaks <- x[tppos] 105 | x_peaks <- density(var)$x[tppos] 106 | 107 | peaks_order <- order(y_peaks, decreasing = TRUE) 108 | # y_peaks <- y_peaks[peaks_order] 109 | # x_peaks <- x_peaks[peaks_order] 110 | return(x_peaks[peaks_order]) 111 | } 112 | -------------------------------------------------------------------------------- /vignettes/iris_multiclass.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "endoR: multiclass" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{endoR: multiclass} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set( 12 | collapse = TRUE, 13 | comment = "#>", 14 | fig.width = 7, 15 | fig.height = 2.5 16 | ) 17 | ``` 18 | 19 | This vignette illustrates how to interpret multiclass models with endoR. 20 | We will use the iris data for this purpose: the species is being predicted 21 | (= target) using the length and width of petals and sepals (= features). 22 | 23 | # Preambule 24 | ```{r setup, message = FALSE, warning=FALSE} 25 | library(endoR) 26 | library(randomForest) 27 | library(tidyverse) 28 | library(caret) 29 | library(ggpubr) 30 | library(ggraph) 31 | ``` 32 | 33 | # Data 34 | ```{r} 35 | data(iris) 36 | summary(iris) 37 | ``` 38 | 39 | # Fit model 40 | ```{r} 41 | set.seed(1313) 42 | mod <- randomForest(Species ~ ., data = iris) 43 | mod 44 | ``` 45 | 46 | # Fit endoR for the "setosa" species 47 | 48 | We will only prune (= TRUE) and discretize decisions in K = 3 levels. 49 | ```{r} 50 | endo_setosa <- model2DE(model = mod, model_type = 'rf' 51 | , data = select(iris, -Species), target = iris$Species 52 | , classPos = 'setosa' # our focal class 53 | , filter = FALSE 54 | 55 | # we filter in K = 3 categories the numerical features 56 | , discretize = TRUE, K = 3) 57 | ``` 58 | 59 | Plants from the setosa species have small petals and narrow, long sepals. 60 | ```{r} 61 | plotFeatures(endo_setosa, levels_order = c('Low', 'Medium', 'High')) 62 | 63 | # The warnings du to the font are due to Windows .. no worries. 64 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = 'fr') 65 | ``` 66 | 67 | The only interaction used for predictions is the one of sepal length (High = 68 | long) and width (Low = narrow). We can also use ggplot2 `theme` to format the 69 | plot, e.g., put the legend boxes next to each other. 70 | ```{r} 71 | plotNetwork(endo_setosa, hide_isolated_nodes = TRUE, layout = 'fr')+ 72 | theme(legend.box = "horizontal") 73 | ``` 74 | 75 | 76 | # Fit endoR for the "versicolor" species 77 | 78 | This time we will filter decisions based on their importance to trim the network 79 | (filter = TRUE). We will use min_imp = 0.5 to keep at least all decisions with 80 | an importance > 0.5*the best importance (the lower min_imp, the slighter the 81 | filtering). 82 | 83 | ```{r} 84 | endo_versicolor <- model2DE(model = mod, model_type = 'rf' 85 | , data = select(iris, -Species), target = iris$Species 86 | , classPos = 'versicolor' 87 | , K = 3, discretize = TRUE 88 | , filter = TRUE, min_imp = 0.5) 89 | ``` 90 | 91 | 92 | The petal's proportions are intermediary compared to the setosa and virginica 93 | species. 94 | ```{r} 95 | plotFeatures(endo_versicolor, levels_order = c('Low', 'Medium', 'High')) 96 | ``` 97 | 98 | Sepals have a narrow-intermediary width and long-intermediary length 99 | (= if wide and small then it's not a versicolor, as seen on the network). 100 | ```{r} 101 | plotNetwork(endo_versicolor, hide_isolated_nodes = FALSE, layout = 'fr' 102 | # we show only edges that connect 3 nodes max -> removes edges with 103 | # lowest importances - for longer paths = more complex network, 104 | # you can increase path_length 105 | , path_length = 3 106 | )+ 107 | scale_edge_alpha(range = c(0.8,1))+ 108 | theme(legend.box = "horizontal") 109 | ``` 110 | -------------------------------------------------------------------------------- /R/getThresholds.R: -------------------------------------------------------------------------------- 1 | #' Get thresholds to discretize variables according to their split in tree ensemble models 2 | #' @param conditions character vector with all conditions from which to find the thresholds 3 | #' @param data data to discretize 4 | #' @param Kmax numeric, maximal number of categories for each variable (default: Kmax = 2). 5 | #' @export 6 | 7 | getThresholds <- function(conditions, data, Kmax = 2) { 8 | ### a little slow, may be improved 9 | 10 | # get all individual sub conditions per variable 11 | var_cond <- unlist(lapply(conditions, function(x) { 12 | unlist(strsplit(x, split = " & ")) 13 | })) 14 | # make it to a data.frame with variable and thresholds used 15 | var_cond <- data.frame( 16 | var = as.numeric(str_extract(var_cond, pattern = "[:digit:]*(?=\\])")), 17 | thr = as.numeric(str_extract(var_cond, pattern = "-?[:digit:]*\\.?[:digit:]*$")) 18 | ) 19 | 20 | # and now transform to a list 21 | var_split <- lapply(sort(unique(var_cond$var)), 22 | function(x, var_cond, data) { 23 | thr <- unlist(subset(var_cond, var == x, select = thr)) 24 | var_v <- data[[x]] 25 | thr[thr < min(var_v)] <- min(var_v) 26 | thr[thr > max(var_v)] <- max(var_v) 27 | return(thr) 28 | }, 29 | var_cond = var_cond, data = data 30 | ) 31 | names(var_split) <- as.character(sort(unique(var_cond$var))) 32 | 33 | # remove non-numeric variables 34 | are_num <- as.character(which(sapply(data, function(x) { 35 | length(unique(x)) > 2 36 | }))) 37 | are_num <- are_num[are_num %in% names(var_split)] 38 | var_split <- var_split[are_num] 39 | 40 | # get thresholds for discretization 41 | if (Kmax == 2) { 42 | new_thr <- lapply(var_split, getMode) 43 | } else { 44 | new_thr <- lapply(var_split, getModes_all) 45 | new_thr <- lapply(new_thr, function(x, Kmax) { 46 | x[1:min(Kmax - 1, length(x))] 47 | }, Kmax = Kmax) 48 | } 49 | 50 | # remove thresholds out of range 51 | colNb <- as.integer(names(new_thr)) 52 | new_thr <- lapply(colNb, function(x, data, thr) { 53 | list( 54 | "var_v" = data[[x]], 55 | "thr" = thr[[as.character(x)]] 56 | ) 57 | }, data = data, thr = new_thr) 58 | 59 | new_thr <- lapply(new_thr, function(x) { 60 | x$thr <- x$thr[x$thr >= min(x$var_v) & x$thr <= max(x$var_v)] 61 | if (length(x$thr) == 0) x$thr <- min(x$var_v) 62 | return(x) 63 | }) 64 | names(new_thr) <- as.character(colNb) 65 | 66 | return(new_thr) 67 | } 68 | 69 | 70 | ########## 71 | 72 | getMode <- function(x) { 73 | if (length(x) == 1) { 74 | return(x) 75 | } else { 76 | tmp <- density(x) 77 | return(tmp$x[which.max(tmp$y)]) 78 | } 79 | } 80 | 81 | 82 | getModes_all <- function(var) { 83 | # copied from the pastecs R-package: 84 | # https://github.com/phgrosjean/pastecs/blob/master/R/turnpoints.R 85 | # (I just removed the unecessary bits..) 86 | 87 | if (length(x) == 1) { 88 | return(x) 89 | } 90 | 91 | x <- as.vector(density(var)$y) 92 | n <- length(x) 93 | diffs <- c(x[1] - 1, x[1:(n - 1)]) != x 94 | 95 | uniques <- x[diffs] 96 | 97 | n2 <- length(uniques) 98 | poss <- (1:n)[diffs] 99 | exaequos <- c(poss[2:n2], n + 1) - poss - 1 100 | 101 | m <- n2 - 2 102 | ex <- matrix(uniques[1:m + rep(3:1, rep(m, 3)) - 1], m) 103 | peaks <- c(FALSE, apply(ex, 1, max, na.rm = TRUE) == ex[, 2], FALSE) 104 | tppos <- (poss + exaequos)[peaks] 105 | 106 | # Now, order the peaks and return the x values 107 | y_peaks <- x[tppos] 108 | x_peaks <- density(var)$x[tppos] 109 | 110 | peaks_order <- order(y_peaks, decreasing = TRUE) 111 | # y_peaks <- y_peaks[peaks_order] 112 | # x_peaks <- x_peaks[peaks_order] 113 | return(x_peaks[peaks_order]) 114 | } 115 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/vignettes/iris_multiclass.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "endoR: multiclass" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{endoR: multiclass} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set( 12 | collapse = TRUE, 13 | comment = "#>", 14 | fig.width = 7, 15 | fig.height = 2.5 16 | ) 17 | ``` 18 | 19 | This vignette illustrates how to interpret multiclass models with endoR. 20 | We will use the iris data for this purpose: the species is being predicted 21 | (= target) using the length and width of petals and sepals (= features). 22 | 23 | # Preambule 24 | ```{r setup, message = FALSE, warning=FALSE} 25 | library(endoR) 26 | library(randomForest) 27 | library(tidyverse) 28 | library(caret) 29 | library(ggpubr) 30 | library(ggraph) 31 | ``` 32 | 33 | # Data 34 | ```{r} 35 | data(iris) 36 | summary(iris) 37 | ``` 38 | 39 | # Fit model 40 | ```{r} 41 | set.seed(1313) 42 | mod <- randomForest(Species ~ ., data = iris) 43 | mod 44 | ``` 45 | 46 | # Fit endoR for the "setosa" species 47 | 48 | We will only prune (= TRUE) and discretize decisions in K = 3 levels. 49 | ```{r} 50 | endo_setosa <- model2DE(model = mod, model_type = 'rf' 51 | , data = select(iris, -Species), target = iris$Species 52 | , classPos = 'setosa' # our focal class 53 | , filter = FALSE 54 | 55 | # we filter in K = 3 categories the numerical features 56 | , discretize = TRUE, K = 3) 57 | ``` 58 | 59 | Plants from the setosa species have small petals and narrow, long sepals. 60 | ```{r} 61 | plotFeatures(endo_setosa, levels_order = c('Low', 'Medium', 'High')) 62 | 63 | # The warnings du to the font are due to Windows .. no worries. 64 | plotNetwork(endo_setosa, hide_isolated_nodes = FALSE, layout = 'fr') 65 | ``` 66 | 67 | The only interaction used for predictions is the one of sepal length (High = 68 | long) and width (Low = narrow). We can also use ggplot2 `theme` to format the 69 | plot, e.g., put the legend boxes next to each other. 70 | ```{r} 71 | plotNetwork(endo_setosa, hide_isolated_nodes = TRUE, layout = 'fr')+ 72 | theme(legend.box = "horizontal") 73 | ``` 74 | 75 | 76 | # Fit endoR for the "versicolor" species 77 | 78 | This time we will filter decisions based on their importance to trim the network 79 | (filter = TRUE). We will use min_imp = 0.5 to keep at least all decisions with 80 | an importance > 0.5*the best importance (the lower min_imp, the slighter the 81 | filtering). 82 | 83 | ```{r} 84 | endo_versicolor <- model2DE(model = mod, model_type = 'rf' 85 | , data = select(iris, -Species), target = iris$Species 86 | , classPos = 'versicolor' 87 | , K = 3, discretize = TRUE 88 | , filter = TRUE, min_imp = 0.5) 89 | ``` 90 | 91 | 92 | The petal's proportions are intermediary compared to the setosa and virginica 93 | species. 94 | ```{r} 95 | plotFeatures(endo_versicolor, levels_order = c('Low', 'Medium', 'High')) 96 | ``` 97 | 98 | Sepals have a narrow-intermediary width and long-intermediary length 99 | (= if wide and small then it's not a versicolor, as seen on the network). 100 | ```{r} 101 | plotNetwork(endo_versicolor, hide_isolated_nodes = FALSE, layout = 'fr' 102 | # we show only edges that connect 3 nodes max -> removes edges with 103 | # lowest importances - for longer paths = more complex network, 104 | # you can increase path_length 105 | , path_length = 3 106 | )+ 107 | scale_edge_alpha(range = c(0.8,1))+ 108 | theme(legend.box = "horizontal") 109 | ``` 110 | -------------------------------------------------------------------------------- /man/filterDecisionsImportances.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/filterDecisionsImportances.R 3 | \name{filterDecisionsImportances} 4 | \alias{filterDecisionsImportances} 5 | \title{Filter decisions according to their metrics} 6 | \usage{ 7 | filterDecisionsImportances(rules, min_imp = 0.7) 8 | } 9 | \arguments{ 10 | \item{rules}{data.frame corresponding to the decisions, with all their metrics.} 11 | 12 | \item{min_imp}{minimal relative importance of the decisions that must be kept, the threshold to remove decisions is thus going to take lower values than max(imp)*min_imp.} 13 | } 14 | \value{ 15 | The decision ensemble from which decisions with the lowest errors and/or importances have been removed, or are indicated in a column "filt_err"/"filt_imp". 16 | } 17 | \description{ 18 | This function filters decisions in a heuristic manner according to their importance and multiplicity. 19 | A relative importance threshold that maximises the average product relative importance * n and the number of decisions to be removed is calculated. 20 | All decisions with a relative importance above that threshold are kept. The argument min_imp is the minimal relative importance of the decisions kept. 21 | } 22 | \examples{ 23 | library(randomForest) 24 | library(caret) 25 | library(data.table) 26 | 27 | # import data and fit model 28 | data(iris) 29 | mod <- randomForest(Species ~ ., data = iris) 30 | 31 | # Let's get the decision ensemble. One could use the wrapping function 32 | # model2DE() but, we will run each function separately. 33 | 34 | # Get the raw decision ensemble 35 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 36 | , target = iris$Species, classPos = "setosa" 37 | , times = 1 # number of bootstraps, here just one 38 | , discretize = FALSE) # we will discretize outside for the example 39 | summary(de) 40 | # exec = the decision ensemble 41 | # partitions = list of sample indexes for boostrapping 42 | # if we had done discretization, the new data would be in data_ctg 43 | de <- de$exec 44 | 45 | # Discretize variables in 3 categories - optional 46 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species 47 | , K = 3, classPos = "setosa", mode = "data") 48 | data_ctg <- de$data_ctg 49 | de <- de$rules_ctg 50 | 51 | # Homogenize the decision ensemble 52 | de <- de[, condition := sapply(condition, function(x) { 53 | paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ") 54 | })] 55 | de <- unique( 56 | as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition] 57 | ) 58 | 59 | # Calculate decision metrics, we don't need the importances yet since we will 60 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps. 61 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species 62 | , classPos = "setosa", importances = FALSE) 63 | de <- de[de_met, on = "condition"] 64 | 65 | # Pruning - optional 66 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species 67 | , classPos = "setosa") 68 | 69 | # Decision importances 70 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species 71 | , classPos = "setosa") 72 | 73 | # Filter out decisions with the lowest importance: min_imp = the minimal 74 | # importance in the decision ensemble compared to the maximal one. 75 | # E.g., if min_imp = 0.5, then at least all decisions with an 76 | # importance > 0.5*max(importance) will be kept. 77 | # This ensures that we don't throw out too much. 78 | # Since the decision ensemble is quite small, we don't need to filter much... 79 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1) 80 | 81 | # Get the network 82 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species 83 | , classPos = "setosa") 84 | 85 | # Plot the feature importance/influence and the network 86 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High")) 87 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr") 88 | } 89 | -------------------------------------------------------------------------------- /man/decisionImportance.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/decisionImportance.R 3 | \name{decisionImportance} 4 | \alias{decisionImportance} 5 | \title{Compute the importance of decisions} 6 | \usage{ 7 | decisionImportance( 8 | rules, 9 | data, 10 | target, 11 | classPos = NULL, 12 | in_parallel = FALSE, 13 | n_cores = detectCores() - 1, 14 | cluster = NULL 15 | ) 16 | } 17 | \arguments{ 18 | \item{rules}{a data.frame with a column "condition" or a vector with name "condition".} 19 | 20 | \item{data}{data from which to get the decision support.} 21 | 22 | \item{target}{response variable.} 23 | 24 | \item{classPos}{if classification, the positive class.} 25 | 26 | \item{in_parallel}{if TRUE, the function is run in parallel.} 27 | 28 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.} 29 | 30 | \item{cluster}{the cluster to use to run the function in parallel.} 31 | } 32 | \value{ 33 | the data.frame passed in rules with the gain and importance of the each decision. 34 | } 35 | \description{ 36 | This function computes the importance of decisions. 37 | } 38 | \examples{ 39 | library(randomForest) 40 | library(caret) 41 | library(data.table) 42 | 43 | # import data and fit model 44 | data(iris) 45 | mod <- randomForest(Species ~ ., data = iris) 46 | 47 | # Let's get the decision ensemble. One could use the wrapping function 48 | # model2DE() but, we will run each function separately. 49 | 50 | # Get the raw decision ensemble 51 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 52 | , target = iris$Species, classPos = "setosa" 53 | , times = 1 # number of bootstraps, here just one 54 | , discretize = FALSE) # we will discretize outside for the example 55 | summary(de) 56 | # exec = the decision ensemble 57 | # partitions = list of sample indexes for boostrapping 58 | # if we had done discretization, the new data would be in data_ctg 59 | de <- de$exec 60 | 61 | # Discretize variables in 3 categories - optional 62 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species 63 | , K = 3, classPos = "setosa", mode = "data") 64 | data_ctg <- de$data_ctg 65 | de <- de$rules_ctg 66 | 67 | # Homogenize the decision ensemble 68 | de <- de[, condition := sapply(condition, function(x) { 69 | paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ") 70 | })] 71 | de <- unique( 72 | as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition] 73 | ) 74 | 75 | # Calculate decision metrics, we don't need the importances yet since we will 76 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps. 77 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species 78 | , classPos = "setosa", importances = FALSE) 79 | de <- de[de_met, on = "condition"] 80 | 81 | # Pruning - optional 82 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species 83 | , classPos = "setosa") 84 | 85 | # Decision importances 86 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species 87 | , classPos = "setosa") 88 | 89 | # Filter out decisions with the lowest importance: min_imp = the minimal 90 | # importance in the decision ensemble compared to the maximal one. 91 | # E.g., if min_imp = 0.5, then at least all decisions with an 92 | # importance > 0.5*max(importance) will be kept. 93 | # This ensures that we don't throw out too much. 94 | # Since the decision ensemble is quite small, we don't need to filter much... 95 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1) 96 | 97 | # Get the network 98 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species 99 | , classPos = "setosa") 100 | 101 | # Plot the feature importance/influence and the network 102 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High")) 103 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr") 104 | } 105 | -------------------------------------------------------------------------------- /man/model2DE_cluster.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/model2DE_cluster.R 3 | \name{model2DE_cluster} 4 | \alias{model2DE_cluster} 5 | \title{Run model2DE on several bootstrap resamples in parallel.} 6 | \usage{ 7 | model2DE_cluster(partition) 8 | } 9 | \arguments{ 10 | \item{partition}{a vector with row numbers to subset data.} 11 | } 12 | \description{ 13 | Function to with the Q() function from the clustermq R-package with the following arguments in export: 14 | data, target, exec, classPos, dummy_var, prune, maxDecay, typeDecay, filter, in_parallel, n_cores. 15 | See preCluster() to obtain the list of boostraps resamples, the discretized data and exec dataframe with decisions. 16 | } 17 | \examples{ 18 | library(randomForest) 19 | library(caret) 20 | 21 | # import data and fit model 22 | data(iris) 23 | mod <- randomForest(Species ~ ., data = iris) 24 | 25 | # Get decision ensemble with bootstrapping. 26 | 27 | # Run 1 bootstrap after the other (times = 2 bootstraps) 28 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf" 29 | , data = iris[, -5], target = iris$Species, classPos = "setosa" 30 | , times = 2, in_parallel = TRUE, n_cores = 2, filter = FALSE) 31 | 32 | # Same but use different sample weights for bootstrapping 33 | n_setosa <- sum(iris$Species == "setosa") 34 | n_samp <- length(iris$Species) 35 | samp_weight <- round( 36 | ifelse(iris$Species == "setosa", 1 - n_setosa/n_samp, n_setosa/n_samp) 37 | , digits = 2) 38 | endo_setosa <- model2DE_resampling(model = mod, model_type = "rf" 39 | , data = iris[, -5], target = iris$Species, classPos = "setosa" 40 | , times = 2, sample_weight = samp_weight 41 | , in_parallel = TRUE, n_cores = 2, filter = FALSE) 42 | 43 | # Run the bootstraps in parallel 44 | # First do all steps before bootstrapping 45 | preclu <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 46 | , target = iris$Species, classPos = "setosa", times = 2 47 | , discretize = TRUE, in_parallel = FALSE) 48 | 49 | # Remove the special characters from column names 50 | colnames(preclu$data) <- compatibleNames(colnames(preclu$data)) 51 | 52 | # Parameters for clustermq: can also run on HPC environment 53 | library(clustermq) 54 | options(clustermq.scheduler = "multiprocess") 55 | # ... and run in parallel on each bootstrap 56 | # (preclu$partitions = list of sample indexes for each bootstraps) 57 | endo_setosa <- Q(model2DE_cluster 58 | , partition = preclu$partitions 59 | , export = list(data = preclu$data 60 | , target = iris$Species 61 | , exec = preclu$exec 62 | , classPos = "setosa" 63 | , prune = TRUE, filter = FALSE 64 | , maxDecay = 0.05 # values needed for maxDecay and typeDecay 65 | , typeDecay = 2 # here default ones, see pruneDecisions() 66 | , in_parallel = FALSE # can parallelize within each boostrap! 67 | ) 68 | , n_jobs = 2 # max number of bootstraps that can be ran in parallel 69 | , pkgs = c("data.table", "parallel", "caret", "stringr", "scales" 70 | , "dplyr", "inTrees", "endoR") 71 | , log_worker = FALSE # to keep a log of the runs, e.g. if it fails.. 72 | ) 73 | 74 | 75 | # Stability selection 76 | # First we can look at the effect of the alpha parameter on selection; 77 | # alpha = expected number of false decisions 78 | alphas <- evaluateAlpha(rules = endo_setosa, alphas = c(1:5, 7, 10) 79 | , data = preclu$data) 80 | alphas$summary_table 81 | 82 | # perform stability selection with alpha = 1 83 | de_final <- stabilitySelection(rules = endo_setosa, alpha_error = 7) 84 | 85 | # Plot the decision ensemble: 86 | # Plants from the setosa species have small petal and narrow long sepals. 87 | plotFeatures(de_final, levels_order = c("Low", "Medium", "High")) 88 | 89 | # there is no interaction between variables (all decisions with len = 1, 90 | # the number of variables in the rules) 91 | de_final$rules_summary 92 | # hence the network would be empty and couldn't be plotted... 93 | # plotNetwork(de_final, hide_isolated_nodes = FALSE) 94 | } 95 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/getDecisionsMetric.R: -------------------------------------------------------------------------------- 1 | #' Measure the error, prediction and importance of decisions 2 | #' 3 | #' This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off. 4 | #' 5 | #' @param ruleExec a vector with name "condition" or a data.frame with a column "condition". 6 | #' @param data data from which to get the decision support. 7 | #' @param target response variable. 8 | #' @param classPos for clssification tasks, the positive class to be predicted by decisions. 9 | #' @param importances if FALSE, the importances are not calculated (importances = TRUE by default). 10 | #' @param in_parallel if TRUE, the function is run in parallel. 11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use. 12 | #' @param cluster the cluster to use to run the function in parallel. 13 | #' @return a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated. 14 | #' @export 15 | getDecisionsMetrics <- function(ruleExec, data, target, classPos = NULL, 16 | importances = TRUE, 17 | in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) { 18 | # modified from getRuleMetrics in the inTrees R-package 19 | 20 | # get the classPos 21 | if (is.numeric(target) == FALSE) { 22 | if (is.null(classPos) == TRUE) { 23 | classPos <- names(which.max(table(target))) 24 | cat("Positive class:", classPos, "\n") 25 | } 26 | target <- ifelse(target == classPos, 1, 0) 27 | type <- "classification" 28 | } else { 29 | type <- "reg" 30 | } 31 | 32 | if ("data.table" %in% class(data)) { 33 | data <- as.data.frame(data) 34 | } 35 | 36 | if (importances == TRUE) { 37 | pred_null <- mean(target, na.rm = TRUE) 38 | } else { 39 | pred_null <- NULL 40 | } 41 | 42 | 43 | 44 | if (in_parallel == FALSE) { 45 | ruleMetric <- t(sapply(as.matrix(ruleExec[, "condition"]), measureSingleDecision, 46 | data = data, target = target, type = type, 47 | gain = importances, pred_null = pred_null 48 | )) 49 | } else { 50 | if (is.null(cluster) == TRUE) { 51 | message("Initiate parallelisation ... ") 52 | cluster <- makeCluster(n_cores) 53 | clusterEvalQ(cluster, library(inTrees)) 54 | on.exit(stopCluster(cluster)) 55 | } 56 | 57 | message("Calculate metrics ... ") 58 | ruleMetric <- parLapply(cluster, as.matrix(ruleExec[, "condition"]), measureSingleDecision, 59 | data = data, target = target, type = type, 60 | gain = importances, pred_null = pred_null 61 | ) 62 | 63 | if (importances == FALSE) { 64 | colN <- c("len", "support", "err", "condition", "pred") 65 | } else { 66 | colN <- c("len", "support", "err", "condition", "pred", "gain") 67 | } 68 | 69 | ruleMetric <- matrix(unlist(ruleMetric), nrow = length(colN), byrow = FALSE) 70 | rownames(ruleMetric) <- colN 71 | ruleMetric <- t(ruleMetric) 72 | } 73 | 74 | rownames(ruleMetric) <- NULL 75 | # remove the decisions with empty support 76 | dIx <- which(ruleMetric[, "len"] == "-1") 77 | if (length(dIx) > 0) { 78 | ruleMetric <- ruleMetric[-dIx, ] 79 | } 80 | if ("character" %in% class(ruleMetric)) { 81 | ruleMetric <- t(ruleMetric) 82 | } 83 | 84 | 85 | if (importances == TRUE) { 86 | ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred", "gain") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred", "gain")][ 87 | , "imp" := gain * support 88 | ] 89 | } else { 90 | ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred")] 91 | } 92 | 93 | 94 | return(ruleMetric) 95 | } 96 | -------------------------------------------------------------------------------- /R/getDecisionsMetric.R: -------------------------------------------------------------------------------- 1 | #' Measure the error, prediction and importance of decisions 2 | #' 3 | #' This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off. 4 | #' 5 | #' @param ruleExec a vector with name "condition" or a data.frame with a column "condition". 6 | #' @param data data from which to get the decision support. 7 | #' @param target response variable. 8 | #' @param classPos for clssification tasks, the positive class to be predicted by decisions. 9 | #' @param importances if FALSE, the importances are not calculated (importances = TRUE by default). 10 | #' @param in_parallel if TRUE, the function is run in parallel. 11 | #' @param n_cores if in_parallel = TRUE, and no cluster has been passed: number of cores to use. 12 | #' @param cluster the cluster to use to run the function in parallel. 13 | #' @return a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated. 14 | #' 15 | #' @example examples/iris_each_function.R 16 | #' 17 | #' @export 18 | getDecisionsMetrics <- function(ruleExec, data, target, classPos = NULL, 19 | importances = TRUE, 20 | in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) { 21 | # modified from getRuleMetrics in the inTrees R-package 22 | 23 | # get the classPos 24 | if (is.numeric(target) == FALSE) { 25 | if (is.null(classPos) == TRUE) { 26 | classPos <- names(which.max(table(target))) 27 | cat("Positive class:", classPos, "\n") 28 | } 29 | target <- ifelse(target == classPos, 1, 0) 30 | type <- "classification" 31 | } else { 32 | type <- "reg" 33 | } 34 | 35 | if ("data.table" %in% class(data)) { 36 | data <- as.data.frame(data) 37 | } 38 | 39 | if (importances == TRUE) { 40 | pred_null <- mean(target, na.rm = TRUE) 41 | } else { 42 | pred_null <- NULL 43 | } 44 | 45 | 46 | 47 | if (in_parallel == FALSE) { 48 | ruleMetric <- t(sapply(as.matrix(ruleExec[, "condition"]), measureSingleDecision, 49 | data = data, target = target, type = type, 50 | gain = importances, pred_null = pred_null 51 | )) 52 | } else { 53 | if (is.null(cluster) == TRUE) { 54 | message("Initiate parallelisation ... ") 55 | cluster <- makeCluster(n_cores) 56 | clusterEvalQ(cluster, library(inTrees)) 57 | on.exit(stopCluster(cluster)) 58 | } 59 | 60 | message("Calculate metrics ... ") 61 | ruleMetric <- parLapply(cluster, as.matrix(ruleExec[, "condition"]), measureSingleDecision, 62 | data = data, target = target, type = type, 63 | gain = importances, pred_null = pred_null 64 | ) 65 | 66 | if (importances == FALSE) { 67 | colN <- c("len", "support", "err", "condition", "pred") 68 | } else { 69 | colN <- c("len", "support", "err", "condition", "pred", "gain") 70 | } 71 | 72 | ruleMetric <- matrix(unlist(ruleMetric), nrow = length(colN), byrow = FALSE) 73 | rownames(ruleMetric) <- colN 74 | ruleMetric <- t(ruleMetric) 75 | } 76 | 77 | rownames(ruleMetric) <- NULL 78 | # remove the decisions with empty support 79 | dIx <- which(ruleMetric[, "len"] == "-1") 80 | if (length(dIx) > 0) { 81 | ruleMetric <- ruleMetric[-dIx, ] 82 | } 83 | if ("character" %in% class(ruleMetric)) { 84 | ruleMetric <- t(ruleMetric) 85 | } 86 | 87 | 88 | if (importances == TRUE) { 89 | ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred", "gain") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred", "gain")][ 90 | , "imp" := gain * support 91 | ] 92 | } else { 93 | ruleMetric <- as.data.table(ruleMetric)[, c("len", "support", "err", "pred") := lapply(.SD, as.numeric), .SDcols = c("len", "support", "err", "pred")] 94 | } 95 | 96 | 97 | return(ruleMetric) 98 | } 99 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/plotFeatures.R: -------------------------------------------------------------------------------- 1 | #' Plot the importance and influence of features. 2 | #' 3 | #' Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence. 4 | #' It uses the ggpubr package to combine plots. 5 | #' 6 | #' @param decision_ensemble stable decision ensemble (see stabilitySelection). 7 | #' @param levels_order optional, order for variables levels on the influence plot 8 | #' @param return_all TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE). 9 | #' @return 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots. 10 | #' @export 11 | plotFeatures <- function(decision_ensemble, levels_order = NULL, 12 | colour_low = "#E69F00", colour_mid = "grey87", colour_high = "#0072B2", 13 | return_all = FALSE) { 14 | 15 | 16 | # get the importance across all levels for each feature 17 | agg_imp <- decision_ensemble$nodes 18 | agg_imp$Feature <- str_replace(agg_imp$var, pattern = "\\_{2}.*", replacement = "") 19 | agg_imp <- agg_imp %>% 20 | group_by(Feature) %>% 21 | summarize(importance = sum(importance)) 22 | # order features based on their importance 23 | agg_imp$Feature <- factor(agg_imp$Feature, levels = agg_imp$Feature[order(agg_imp$importance)]) 24 | 25 | # make the feature importance plot 26 | paggimp <- ggplot(agg_imp, aes(x = Feature, y = importance)) + 27 | geom_col(fill = "lightgray", width = 0.1) + 28 | geom_point(size = 3) + 29 | coord_flip() + 30 | theme_classic() + 31 | theme( 32 | axis.text.y = element_text(size = rel(1.2)), 33 | axis.text.x = element_text(size = rel(1.2)), legend.text = element_text(size = rel(1.2)), 34 | panel.grid.major.y = element_line(colour = "grey", size = 0.2) 35 | ) + 36 | labs(x = "", y = "Importance", title = "Feature importance") 37 | 38 | 39 | # arrange the feature influences 40 | agg_inf <- decision_ensemble$nodes %>% select(var, influence) 41 | agg_inf$level <- str_extract(agg_inf$var, pattern = "(?<=\\_{2}).*") 42 | agg_inf$Feature_short <- str_replace(agg_inf$var, pattern = "\\_{2}.*", replacement = "") 43 | agg_inf <- left_join(agg_inf, agg_imp, by = c(Feature_short = "Feature")) 44 | l <- unique(agg_inf$level) 45 | l <- l[!is.na(l)] 46 | 47 | if (length(l) > 0) { 48 | ix <- which(is.na(agg_inf$level)) 49 | tmp <- agg_inf[rep(ix, length(l)), ] 50 | tmp$level <- unlist(lapply(l, function(x) rep(x, length(ix)))) 51 | agg_inf <- agg_inf[complete.cases(agg_inf), ] %>% rbind(tmp) 52 | agg_inf$Feature_short <- factor(agg_inf$Feature_short, levels = unique(agg_inf$Feature_short[order(agg_inf$importance)])) 53 | if (!is.null(levels_order)) { 54 | agg_inf$level <- factor(agg_inf$level, levels = levels_order) 55 | } 56 | } else { 57 | agg_inf$level <- "" 58 | } 59 | 60 | level_inf <- ggplot(agg_inf, aes( 61 | x = level, y = Feature_short, 62 | fill = influence 63 | )) + 64 | geom_tile(color = "white") + 65 | theme_classic() + 66 | theme( 67 | axis.text.y = element_blank(), axis.text.x = element_text(size = rel(1.2)), 68 | legend.text = element_text(size = rel(1.2)), legend.title = element_text( 69 | size = rel(1.2), 70 | vjust = 0.85 71 | ) 72 | ) + 73 | labs(x = "Level", y = "", title = "Influence per level") + 74 | scale_fill_gradient2( 75 | low = colour_low, mid = colour_mid, 76 | high = colour_high, midpoint = 0, name = "Influence on\nphenotype" 77 | ) 78 | 79 | 80 | if (return_all == TRUE) { 81 | return(list( 82 | "importance" = agg_imp, "importance_p" = paggimp, 83 | "influences" = agg_inf, "influence_p" = level_inf 84 | )) 85 | } else { 86 | require(ggpubr) 87 | paggs <- ggpubr::ggarrange(paggimp + theme(legend.position = "none"), 88 | level_inf, 89 | nrow = 1, ncol = 2, widths = c(1, 0.8) 90 | ) 91 | return(paggs) 92 | } 93 | } 94 | -------------------------------------------------------------------------------- /R/evaluateAlpha.R: -------------------------------------------------------------------------------- 1 | #' Calculate the number fo decisions and of predicted samples from decision ensemble obtained with different alpha values 2 | #' 3 | #' The aim is to help picking an alpha that will result in a decision ensemble able to predict most samples. Performs stability selection for each of the given alpha value. The number of decisions and of samples that follow decisions are also calculated. 4 | #' The procedure is adapted from Meinshausenand and Buehlmann (2010): the best decisions from each bootstrap are pre-seleected and the the ones that were pre-selected in a certain fraction of bootstraps are included in the stable decision ensemble. 5 | #' The decision importances and multiplicities are averaged across bootstraps. Decision-wise feature and interaction importances and influences are averaged across bootstraps before computing the feature and interaction importances and influences from the stable decision ensemble. 6 | #' @param rules list of bootstrap results 7 | #' @param alphas expected number of false positive decision selected (default = 1). 8 | #' @param data data for which to calculate how many samples follow each decision. Columns should be the same as for fitting the decision ensemble but samples can be any. 9 | #' @param decision_ensembles should the decision ensemble be returned? 10 | #' @param pi_thr fraction of bootstraps in which a decision should have been selected in to be included in the stable decision ensemble (default = 0.7). 11 | #' @param aggregate_taxa should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species) 12 | #' @param taxa if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s) 13 | #' @return A list with all decisions from all bootstrasps, the summary of decisions across bootstraps, the feature and interaction importance and influence in the nodes and edges dataframes, as well as the the decision-wise feature and interaction importances and influences the nodes_agg and edges_agg dataframes. 14 | #' @example examples/iris_bootstraps.R 15 | #' @export 16 | 17 | evaluateAlpha <- function(rules, alphas = c(5, 10, 15, 20, 30, 50, 75), pi_thr = 0.7, 18 | data = NULL, decision_ensembles = TRUE, 19 | aggregate_taxa = FALSE, taxa = NULL 20 | ) { 21 | check_sampl <- data.frame(alpha = numeric(), n_decision = numeric(), n_samples = numeric()) 22 | if ("resamp" %in% names(rules)) { 23 | minN <- pi_thr * length(rules$resamp) 24 | } else { 25 | minN <- pi_thr * length(rules) 26 | } 27 | 28 | if (is.null(data) & !("data" %in% names(rules))) { 29 | warning("Please provide the data used to create the decision ensemble (original if not discretized, precluster$data if discretized).") 30 | return() 31 | } else if ("data" %in% names(rules)) { 32 | data <- rules$data 33 | } 34 | 35 | res <- list() 36 | check_sampl <- data.frame(alpha = numeric(), n_dec = numeric(), n_samp = numeric()) 37 | 38 | for (i in 1:length(alphas)) { 39 | tmp <- stabilitySelection(rules = rules, alpha_error = alphas[i], pi_thr = pi_thr, aggregate_taxa = aggregate_taxa, taxa = taxa) 40 | if (decision_ensembles == TRUE) { 41 | res[[as.character(alphas[i])]] <- tmp 42 | } 43 | 44 | tmp <- subset(tmp$rules_summary, inN >= minN) 45 | 46 | cond <- tmp$condition 47 | if (length(cond) == 0) { 48 | check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = 0, n_samp = 0) 49 | } else { 50 | cond <- str_replace_all(cond, pattern = "X", replacement = "data") 51 | cond <- paste0("which(", cond, ")") 52 | pred_ix <- lapply(cond, function(x) { 53 | eval(parse(text = x)) 54 | }) %>% 55 | unlist() %>% 56 | unique() 57 | check_sampl <- add_row(check_sampl, alpha = alphas[i], n_dec = length(cond), n_samp = length(pred_ix)) 58 | } 59 | } 60 | if (decision_ensembles == TRUE) { 61 | res[["summary_table"]] <- check_sampl 62 | return(res) 63 | } else { 64 | return(check_sampl) 65 | } 66 | } 67 | -------------------------------------------------------------------------------- /R/plotFeatures.R: -------------------------------------------------------------------------------- 1 | #' Plot the importance and influence of features. 2 | #' 3 | #' Returns a ggplot object with variables importance (across all categorical levels for factor variables) and variable per-level influence. 4 | #' It uses the ggpubr package to combine plots. 5 | #' 6 | #' @param decision_ensemble stable decision ensemble (see stabilitySelection). 7 | #' @param levels_order optional, order for variables levels on the influence plot 8 | #' @param return_all TRUE, returns the table of feature importance and influences and each plot separated (default = FALSE). 9 | #' @param colour_low colour for the negative feature influence values (default: yellowish) 10 | #' @param colour_mid colour for the null feature influence values (default: light grey) 11 | #' @param colour_high colour for the positive feature influence values (default: blue) 12 | #' @return 2 ggplots arranged in a row with ggpubr; if return_all = TRUE, returns plots separately in a list , as well as the tables used to create plots. 13 | #' 14 | #' @example examples/iris_basic.R 15 | #' @export 16 | plotFeatures <- function(decision_ensemble, levels_order = NULL, 17 | colour_low = "#E69F00", colour_mid = "grey87", colour_high = "#0072B2", 18 | return_all = FALSE) { 19 | 20 | 21 | # get the importance across all levels for each feature 22 | agg_imp <- decision_ensemble$nodes 23 | agg_imp$Feature <- str_replace(agg_imp$var, pattern = "\\_{2}.*", replacement = "") 24 | agg_imp <- agg_imp %>% 25 | group_by(Feature) %>% 26 | summarize(importance = sum(importance)) 27 | # order features based on their importance 28 | agg_imp$Feature <- factor(agg_imp$Feature, levels = agg_imp$Feature[order(agg_imp$importance)]) 29 | 30 | # make the feature importance plot 31 | paggimp <- ggplot(agg_imp, aes(x = Feature, y = importance)) + 32 | geom_col(fill = "lightgray", width = 0.1) + 33 | geom_point(size = 3) + 34 | coord_flip() + 35 | theme_classic() + 36 | theme( 37 | axis.text.y = element_text(size = rel(1.2)), 38 | axis.text.x = element_text(size = rel(1.2)), legend.text = element_text(size = rel(1.2)), 39 | panel.grid.major.y = element_line(colour = "grey", size = 0.2) 40 | ) + 41 | labs(x = "", y = "Importance", title = "Feature importance") 42 | 43 | 44 | # arrange the feature influences 45 | agg_inf <- decision_ensemble$nodes %>% select(var, influence) 46 | agg_inf$level <- str_extract(agg_inf$var, pattern = "(?<=\\_{2}).*") 47 | agg_inf$Feature_short <- str_replace(agg_inf$var, pattern = "\\_{2}.*", replacement = "") 48 | agg_inf <- left_join(agg_inf, agg_imp, by = c(Feature_short = "Feature")) 49 | l <- unique(agg_inf$level) 50 | l <- l[!is.na(l)] 51 | 52 | if (length(l) > 0) { 53 | ix <- which(is.na(agg_inf$level)) 54 | tmp <- agg_inf[rep(ix, length(l)), ] 55 | tmp$level <- unlist(lapply(l, function(x) rep(x, length(ix)))) 56 | agg_inf <- agg_inf[complete.cases(agg_inf), ] %>% rbind(tmp) 57 | agg_inf$Feature_short <- factor(agg_inf$Feature_short, levels = unique(agg_inf$Feature_short[order(agg_inf$importance)])) 58 | if (!is.null(levels_order)) { 59 | agg_inf$level <- factor(agg_inf$level, levels = levels_order) 60 | } 61 | } else { 62 | agg_inf$level <- "" 63 | } 64 | 65 | level_inf <- ggplot(agg_inf, aes( 66 | x = level, y = Feature_short, 67 | fill = influence 68 | )) + 69 | geom_tile(color = "white") + 70 | theme_classic() + 71 | theme( 72 | axis.text.y = element_blank(), axis.text.x = element_text(size = rel(1.2)), 73 | legend.text = element_text(size = rel(1.2)), legend.title = element_text( 74 | size = rel(1.2), 75 | vjust = 0.85 76 | ) 77 | ) + 78 | labs(x = "Level", y = "", title = "Influence per level") + 79 | scale_fill_gradient2( 80 | low = colour_low, mid = colour_mid, 81 | high = colour_high, midpoint = 0, name = "Influence on\nphenotype" 82 | ) 83 | 84 | 85 | if (return_all == TRUE) { 86 | return(list( 87 | "importance" = agg_imp, "importance_p" = paggimp, 88 | "influences" = agg_inf, "influence_p" = level_inf 89 | )) 90 | } else { 91 | paggs <- ggpubr::ggarrange(paggimp + theme(legend.position = "none"), 92 | level_inf, 93 | nrow = 1, ncol = 2, widths = c(1, 0.8) 94 | ) 95 | return(paggs) 96 | } 97 | } 98 | -------------------------------------------------------------------------------- /man/pruneDecisions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/pruneDecisions.R 3 | \name{pruneDecisions} 4 | \alias{pruneDecisions} 5 | \title{Prune rules of a decision ensemble} 6 | \usage{ 7 | pruneDecisions( 8 | rules, 9 | data, 10 | target, 11 | classPos = NULL, 12 | maxDecay = 0, 13 | typeDecay = 1, 14 | in_parallel = FALSE, 15 | n_cores = detectCores() - 1, 16 | cluster = NULL 17 | ) 18 | } 19 | \arguments{ 20 | \item{rules}{a data frame with a column "condition".} 21 | 22 | \item{data}{data to use for calculating the metrics.} 23 | 24 | \item{target}{response variable.} 25 | 26 | \item{classPos}{for classification tasks, the positive class predicted.} 27 | 28 | \item{maxDecay}{threshold for the increase in error; if maxDecay = -Inf, no pruning is done; if maxDecay = 0, only variables increasing the error are pruned from decisions.} 29 | 30 | \item{typeDecay}{if typeDecay = 1, the absolute increase in error is computed, and the relative one is computed if typeDecay = 2 (default).} 31 | 32 | \item{in_parallel}{if TRUE, the function is run in parallel.} 33 | 34 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.} 35 | 36 | \item{cluster}{the cluster to use to run the function in parallel.} 37 | } 38 | \value{ 39 | Decision ensemble with pruned conditions. 40 | } 41 | \description{ 42 | This function removes from rules, variables that do not increase the rule error more than a certain threshold. See the pruneRules function from the inTrees package (Deng, 2019) for more details. 43 | } 44 | \examples{ 45 | library(randomForest) 46 | library(caret) 47 | library(data.table) 48 | 49 | # import data and fit model 50 | data(iris) 51 | mod <- randomForest(Species ~ ., data = iris) 52 | 53 | # Let's get the decision ensemble. One could use the wrapping function 54 | # model2DE() but, we will run each function separately. 55 | 56 | # Get the raw decision ensemble 57 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 58 | , target = iris$Species, classPos = "setosa" 59 | , times = 1 # number of bootstraps, here just one 60 | , discretize = FALSE) # we will discretize outside for the example 61 | summary(de) 62 | # exec = the decision ensemble 63 | # partitions = list of sample indexes for boostrapping 64 | # if we had done discretization, the new data would be in data_ctg 65 | de <- de$exec 66 | 67 | # Discretize variables in 3 categories - optional 68 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species 69 | , K = 3, classPos = "setosa", mode = "data") 70 | data_ctg <- de$data_ctg 71 | de <- de$rules_ctg 72 | 73 | # Homogenize the decision ensemble 74 | de <- de[, condition := sapply(condition, function(x) { 75 | paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ") 76 | })] 77 | de <- unique( 78 | as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition] 79 | ) 80 | 81 | # Calculate decision metrics, we don't need the importances yet since we will 82 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps. 83 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species 84 | , classPos = "setosa", importances = FALSE) 85 | de <- de[de_met, on = "condition"] 86 | 87 | # Pruning - optional 88 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species 89 | , classPos = "setosa") 90 | 91 | # Decision importances 92 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species 93 | , classPos = "setosa") 94 | 95 | # Filter out decisions with the lowest importance: min_imp = the minimal 96 | # importance in the decision ensemble compared to the maximal one. 97 | # E.g., if min_imp = 0.5, then at least all decisions with an 98 | # importance > 0.5*max(importance) will be kept. 99 | # This ensures that we don't throw out too much. 100 | # Since the decision ensemble is quite small, we don't need to filter much... 101 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1) 102 | 103 | # Get the network 104 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species 105 | , classPos = "setosa") 106 | 107 | # Plot the feature importance/influence and the network 108 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High")) 109 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr") 110 | } 111 | -------------------------------------------------------------------------------- /man/getDecisionsMetrics.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/getDecisionsMetric.R 3 | \name{getDecisionsMetrics} 4 | \alias{getDecisionsMetrics} 5 | \title{Measure the error, prediction and importance of decisions} 6 | \usage{ 7 | getDecisionsMetrics( 8 | ruleExec, 9 | data, 10 | target, 11 | classPos = NULL, 12 | importances = TRUE, 13 | in_parallel = FALSE, 14 | n_cores = detectCores() - 1, 15 | cluster = NULL 16 | ) 17 | } 18 | \arguments{ 19 | \item{ruleExec}{a vector with name "condition" or a data.frame with a column "condition".} 20 | 21 | \item{data}{data from which to get the decision support.} 22 | 23 | \item{target}{response variable.} 24 | 25 | \item{classPos}{for clssification tasks, the positive class to be predicted by decisions.} 26 | 27 | \item{importances}{if FALSE, the importances are not calculated (importances = TRUE by default).} 28 | 29 | \item{in_parallel}{if TRUE, the function is run in parallel.} 30 | 31 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.} 32 | 33 | \item{cluster}{the cluster to use to run the function in parallel.} 34 | } 35 | \value{ 36 | a datatable with the rule (column "condition"), error ("err"), prediction ("pred") support, number of variables in the decision rule ("len"). Columns "gain" and "imp" wit hthe gain and importance of teh decision are added if importances were calculated. 37 | } 38 | \description{ 39 | This function measures the prediction and error on the response variable of each decision on its support in the data passed. The importance is calculated by default but this can be switched off. 40 | } 41 | \examples{ 42 | library(randomForest) 43 | library(caret) 44 | library(data.table) 45 | 46 | # import data and fit model 47 | data(iris) 48 | mod <- randomForest(Species ~ ., data = iris) 49 | 50 | # Let's get the decision ensemble. One could use the wrapping function 51 | # model2DE() but, we will run each function separately. 52 | 53 | # Get the raw decision ensemble 54 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 55 | , target = iris$Species, classPos = "setosa" 56 | , times = 1 # number of bootstraps, here just one 57 | , discretize = FALSE) # we will discretize outside for the example 58 | summary(de) 59 | # exec = the decision ensemble 60 | # partitions = list of sample indexes for boostrapping 61 | # if we had done discretization, the new data would be in data_ctg 62 | de <- de$exec 63 | 64 | # Discretize variables in 3 categories - optional 65 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species 66 | , K = 3, classPos = "setosa", mode = "data") 67 | data_ctg <- de$data_ctg 68 | de <- de$rules_ctg 69 | 70 | # Homogenize the decision ensemble 71 | de <- de[, condition := sapply(condition, function(x) { 72 | paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ") 73 | })] 74 | de <- unique( 75 | as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition] 76 | ) 77 | 78 | # Calculate decision metrics, we don't need the importances yet since we will 79 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps. 80 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species 81 | , classPos = "setosa", importances = FALSE) 82 | de <- de[de_met, on = "condition"] 83 | 84 | # Pruning - optional 85 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species 86 | , classPos = "setosa") 87 | 88 | # Decision importances 89 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species 90 | , classPos = "setosa") 91 | 92 | # Filter out decisions with the lowest importance: min_imp = the minimal 93 | # importance in the decision ensemble compared to the maximal one. 94 | # E.g., if min_imp = 0.5, then at least all decisions with an 95 | # importance > 0.5*max(importance) will be kept. 96 | # This ensures that we don't throw out too much. 97 | # Since the decision ensemble is quite small, we don't need to filter much... 98 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1) 99 | 100 | # Get the network 101 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species 102 | , classPos = "setosa") 103 | 104 | # Plot the feature importance/influence and the network 105 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High")) 106 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr") 107 | } 108 | -------------------------------------------------------------------------------- /man/getNetwork.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/getNetwork.R 3 | \name{getNetwork} 4 | \alias{getNetwork} 5 | \title{Transform a decision ensemble into a network} 6 | \usage{ 7 | getNetwork( 8 | rules, 9 | data, 10 | target, 11 | classPos = NULL, 12 | aggregate_taxa = FALSE, 13 | taxa = NULL, 14 | in_parallel = FALSE, 15 | n_cores = detectCores() - 1, 16 | cluster = NULL 17 | ) 18 | } 19 | \arguments{ 20 | \item{rules}{the decision ensemble.} 21 | 22 | \item{data}{data from which to measure decision metrics.} 23 | 24 | \item{target}{response variable.} 25 | 26 | \item{classPos}{the positive class predicted by decisions.} 27 | 28 | \item{aggregate_taxa}{should taxa be aggregated at the genus level (if species have lower importance than their genus) or species level (if a genus is represented by a unique species)} 29 | 30 | \item{taxa}{if aggregate_taxa = TRUE, a data.frame with all taxa included in the dataset: columns = taxonomic ranks (with columns f, g, and s)} 31 | 32 | \item{in_parallel}{if TRUE, the function is run in parallel.} 33 | 34 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use.} 35 | 36 | \item{cluster}{the cluster to use to run the function in parallel.} 37 | } 38 | \value{ 39 | A list with in the nodes and edges dataframes, the feature and interaction importance and influence; the decision-wise feature and interaction importances and influences are contained in the nodes_agg and edges_agg dataframes. 40 | } 41 | \description{ 42 | Takes a decision ensemble and measures the importance and influence of each feature and pair of features to create a network. 43 | For categorical variables or discretized ones, the importance and influence are calculated per level. See featureImportance to obtain the overall feature importance. 44 | } 45 | \examples{ 46 | library(randomForest) 47 | library(caret) 48 | library(data.table) 49 | 50 | # import data and fit model 51 | data(iris) 52 | mod <- randomForest(Species ~ ., data = iris) 53 | 54 | # Let's get the decision ensemble. One could use the wrapping function 55 | # model2DE() but, we will run each function separately. 56 | 57 | # Get the raw decision ensemble 58 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 59 | , target = iris$Species, classPos = "setosa" 60 | , times = 1 # number of bootstraps, here just one 61 | , discretize = FALSE) # we will discretize outside for the example 62 | summary(de) 63 | # exec = the decision ensemble 64 | # partitions = list of sample indexes for boostrapping 65 | # if we had done discretization, the new data would be in data_ctg 66 | de <- de$exec 67 | 68 | # Discretize variables in 3 categories - optional 69 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species 70 | , K = 3, classPos = "setosa", mode = "data") 71 | data_ctg <- de$data_ctg 72 | de <- de$rules_ctg 73 | 74 | # Homogenize the decision ensemble 75 | de <- de[, condition := sapply(condition, function(x) { 76 | paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ") 77 | })] 78 | de <- unique( 79 | as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition] 80 | ) 81 | 82 | # Calculate decision metrics, we don't need the importances yet since we will 83 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps. 84 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species 85 | , classPos = "setosa", importances = FALSE) 86 | de <- de[de_met, on = "condition"] 87 | 88 | # Pruning - optional 89 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species 90 | , classPos = "setosa") 91 | 92 | # Decision importances 93 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species 94 | , classPos = "setosa") 95 | 96 | # Filter out decisions with the lowest importance: min_imp = the minimal 97 | # importance in the decision ensemble compared to the maximal one. 98 | # E.g., if min_imp = 0.5, then at least all decisions with an 99 | # importance > 0.5*max(importance) will be kept. 100 | # This ensures that we don't throw out too much. 101 | # Since the decision ensemble is quite small, we don't need to filter much... 102 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1) 103 | 104 | # Get the network 105 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species 106 | , classPos = "setosa") 107 | 108 | # Plot the feature importance/influence and the network 109 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High")) 110 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr") 111 | } 112 | -------------------------------------------------------------------------------- /man/discretizeDecisions.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/discretizeDecisions.R 3 | \name{discretizeDecisions} 4 | \alias{discretizeDecisions} 5 | \title{Discretize numerical variables in decision ensemble} 6 | \usage{ 7 | discretizeDecisions( 8 | rules, 9 | data = NULL, 10 | target, 11 | mode = "data", 12 | K = 2, 13 | splitV = NULL, 14 | classPos = NULL, 15 | in_parallel = FALSE, 16 | n_cores = detectCores() - 1, 17 | cluster = NULL 18 | ) 19 | } 20 | \arguments{ 21 | \item{rules}{a data frame with a column "condition".} 22 | 23 | \item{data}{data to discretize.} 24 | 25 | \item{target}{response variable.} 26 | 27 | \item{mode}{whether to discretize variables based on the data distribution (default, mode = 'data') or on the data splits in the model (mode = 'model').} 28 | 29 | \item{K}{numeric, number of categories to create from numeric variables (default: K = 2).} 30 | 31 | \item{splitV}{instead of running internally discretizeData, one can provide a list with, for each variable to discretize in rules, the thresholds delimiting each new category.} 32 | 33 | \item{classPos}{for classification, the positive class.} 34 | 35 | \item{in_parallel}{if TRUE, the function is run in parallel.} 36 | 37 | \item{n_cores}{if in_parallel = TRUE, and no cluster has been passed: number of cores to use, default is detectCores() - 1.} 38 | 39 | \item{cluster}{the cluster to use to run the function in parallel.} 40 | } 41 | \value{ 42 | Decision ensemble with discretized variables in the condition. Decisions with the same condition are aggregated: their importances are summed, and all other metrics are averaged. 43 | } 44 | \description{ 45 | This function replaces in a decision ensemble the boundaries of numerical features by their corresponding levels when the variable is discretized. 46 | If discretized data are not passed, data are first discretized into Kmax categories based on their quantiles (see discretizeData). 47 | The error, prediction, importance and multiplicity of decisions are updated after discretization. 48 | } 49 | \examples{ 50 | library(randomForest) 51 | library(caret) 52 | library(data.table) 53 | 54 | # import data and fit model 55 | data(iris) 56 | mod <- randomForest(Species ~ ., data = iris) 57 | 58 | # Let's get the decision ensemble. One could use the wrapping function 59 | # model2DE() but, we will run each function separately. 60 | 61 | # Get the raw decision ensemble 62 | de <- preCluster(model = mod, model_type = "rf", data = iris[, -5] 63 | , target = iris$Species, classPos = "setosa" 64 | , times = 1 # number of bootstraps, here just one 65 | , discretize = FALSE) # we will discretize outside for the example 66 | summary(de) 67 | # exec = the decision ensemble 68 | # partitions = list of sample indexes for boostrapping 69 | # if we had done discretization, the new data would be in data_ctg 70 | de <- de$exec 71 | 72 | # Discretize variables in 3 categories - optional 73 | de <- discretizeDecisions(rules = de, data = iris[, -5], target = iris$Species 74 | , K = 3, classPos = "setosa", mode = "data") 75 | data_ctg <- de$data_ctg 76 | de <- de$rules_ctg 77 | 78 | # Homogenize the decision ensemble 79 | de <- de[, condition := sapply(condition, function(x) { 80 | paste(sort(unlist(strsplit(x, split = " & "))), collapse = " & ") 81 | })] 82 | de <- unique( 83 | as.data.table(de)[, n := as.numeric(n)][, n := sum(n), by = condition] 84 | ) 85 | 86 | # Calculate decision metrics, we don't need the importances yet since we will 87 | # do pruning. Otherwise, set importances = TRUE and skip the next 2 steps. 88 | de_met <- getDecisionsMetrics(de, data = data_ctg, target = iris$Species 89 | , classPos = "setosa", importances = FALSE) 90 | de <- de[de_met, on = "condition"] 91 | 92 | # Pruning - optional 93 | de <- pruneDecisions(rules = de, data = data_ctg, target = iris$Species 94 | , classPos = "setosa") 95 | 96 | # Decision importances 97 | de <- decisionImportance(rules = de, data = data_ctg, target = iris$Species 98 | , classPos = "setosa") 99 | 100 | # Filter out decisions with the lowest importance: min_imp = the minimal 101 | # importance in the decision ensemble compared to the maximal one. 102 | # E.g., if min_imp = 0.5, then at least all decisions with an 103 | # importance > 0.5*max(importance) will be kept. 104 | # This ensures that we don't throw out too much. 105 | # Since the decision ensemble is quite small, we don't need to filter much... 106 | de <- filterDecisionsImportances(rules = de, min_imp = 0.1) 107 | 108 | # Get the network 109 | de_net <- getNetwork(rules = de, data = data_ctg, target = iris$Species 110 | , classPos = "setosa") 111 | 112 | # Plot the feature importance/influence and the network 113 | plotFeatures(de_net, levels_order = c("Low", "Medium", "High")) 114 | plotNetwork(de_net, hide_isolated_nodes = FALSE, layout = "fr") 115 | } 116 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/R/changeDecisionsDummies.R: -------------------------------------------------------------------------------- 1 | # takes decisions and modifies them so that only one level of a multiclass variable is used in decisions 2 | # ’ @export 3 | changeDecisionsDummies <- function(rules, dummy_var, data, target, classPos = NULL, 4 | in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) { 5 | 6 | # get the colnames and column numbers of dummy variables 7 | dum_lev <- lapply(dummy_var, getDummyLevels, colN = colnames(data)) 8 | names(dum_lev) <- dummy_var 9 | dum_colN <- unlist(lapply(dummy_var, FUN = dumColnames, colN = colnames(data))) 10 | 11 | # replace the <=0.5 conditions in rules for all other levels with >0.5 12 | rules <- rules[, to_update := "no"] 13 | # separate levels in rules 14 | if (in_parallel == TRUE) { 15 | if (is.null(cluster) == TRUE) { 16 | message("Initiate parallelisation ... ") 17 | cluster <- makeCluster(n_cores) 18 | clusterEvalQ(cluster, library(endoR)) 19 | clusterEvalQ(cluster, library(stringr)) 20 | on.exit(stopCluster(cluster)) 21 | } 22 | 23 | rules <- t(parApply( 24 | cl = cluster, rules, MARGIN = 1, FUN = changeSingleRuleDummies, 25 | dum_colN = dum_colN, dum_lev = dum_lev 26 | )) 27 | } else { 28 | rules <- t(apply(rules, 29 | MARGIN = 1, FUN = changeSingleRuleDummies, 30 | dum_colN = dum_colN, dum_lev = dum_lev 31 | )) 32 | } 33 | 34 | 35 | # get one rule per level 36 | tmp <- split(rules, seq(nrow(rules))) 37 | tmp <- lapply(tmp, 38 | FUN = function(x, newN) { 39 | names(x) <- newN 40 | return(x) 41 | }, 42 | newN = colnames(rules) 43 | ) 44 | if (in_parallel == TRUE) { 45 | rules <- parLapply(cl = cluster, tmp, fun = singleRulePerLevel) 46 | } else { 47 | rules <- lapply(tmp, FUN = singleRulePerLevel) 48 | } 49 | 50 | rules <- do.call(what = rbind, rules) 51 | rules <- as.data.table(rules)[, n := as.numeric(n)][, n := sum(n), by = condition] 52 | 53 | if ("err" %in% colnames(rules)) { 54 | rules <- rules[to_update == "yes", to_up := "yes"][, to_update := NULL][, to_update := NULL] 55 | rules <- unique(rules) 56 | # Get the new metrics 57 | if (nrow(rules[to_up == "yes", ]) > 0) { 58 | if ("imp" %in% colnames(rules)) { 59 | importances <- TRUE 60 | colN <- c("len", "err", "support", "pred", "imp", "to_up") 61 | } else { 62 | importances <- FALSE 63 | colN <- c("len", "err", "support", "pred", "to_up") 64 | } 65 | 66 | no_up <- unique(subset(rules, is.na(to_up), select = -to_up)) 67 | tmp1 <- unique(subset(rules, to_up == "yes" & !(condition %in% no_up$condition))[, (colN) := NULL]) 68 | 69 | tmp2 <- getDecisionsMetrics(tmp1, 70 | data = data, target = target, classPos = classPos, 71 | importances = importances, 72 | in_parallel = in_parallel, n_cores = n_cores, cluster = cluster 73 | ) 74 | tmp2 <- tmp1[tmp2, on = "condition"] 75 | # tmp2 <- merge(tmp2, tmp1, all.x = TRUE, by = 'condition' ) 76 | 77 | if (nrow(no_up) > 0) { 78 | rules <- rbind(no_up, tmp2) 79 | } else { 80 | rules <- tmp2 81 | } 82 | 83 | rules <- rules[, c("len", "support", "err", "pred", "n") := lapply(.SD, as.numeric), 84 | .SDcols = c("len", "support", "err", "pred", "n") 85 | ] 86 | rules <- unique(rules[, n := sum(n), by = condition]) 87 | } 88 | } else { 89 | rules <- unique(rules[, to_update := NULL][, to_update := NULL]) 90 | } 91 | 92 | 93 | return(rules) 94 | } 95 | 96 | 97 | ################################################## 98 | getDummyLevels <- function(var, colN) { 99 | sub <- str_which(colN, pattern = paste0("^", var)) 100 | sub <- paste0("X[,", sub, "]<=0.5") 101 | return(sub) 102 | } 103 | dumColnames <- function(var, colN) { 104 | sub <- str_which(colN, pattern = paste0("^", var)) 105 | dum <- rep(var, length(sub)) 106 | names(dum) <- paste0("X[,", sub, "]<=0.5") 107 | return(dum) 108 | } 109 | 110 | ################################################# 111 | changeSingleRuleDummies <- function(rule, dum_colN, dum_lev) { 112 | ruleV <- unlist(strsplit(rule["condition"], " & ")) 113 | if (length(ruleV[ruleV %in% names(dum_colN)]) > 0) { 114 | oriRule <- ruleV[!(ruleV %in% names(dum_colN))] 115 | ruleV <- ruleV[ruleV %in% names(dum_colN)] 116 | names(ruleV) <- dum_colN[ruleV] 117 | vdum <- unique(names(ruleV)) 118 | newV <- c() 119 | for (v in vdum) { 120 | tmp <- names(dum_colN[dum_colN == v]) 121 | tmp <- tmp[!(tmp %in% ruleV)] 122 | tmp <- str_replace(tmp, pattern = "<=", replacement = ">") 123 | if (length(tmp[tmp %in% oriRule]) > 0) next 124 | if (length(tmp) > 1) { 125 | tmp <- paste0("(", paste(tmp, collapse = " | "), ")") 126 | } 127 | newV <- c(newV, tmp) 128 | } 129 | rule["condition"] <- paste0(sort(c(oriRule, newV)), collapse = " & ") 130 | rule["to_update"] <- "yes" 131 | } 132 | 133 | return(rule) 134 | } 135 | -------------------------------------------------------------------------------- /R/getComplements.R: -------------------------------------------------------------------------------- 1 | getComplements <- function(rules, data, target, classPos = NULL, 2 | in_parallel = FALSE, n_cores = detectCores() - 1, cluster = NULL) { 3 | if (!("data.table" %in% class(rules))) { 4 | rules <- as.data.table(rules) 5 | } 6 | 7 | # give rules ID 8 | rules <- rules[, ruleID := paste0("decision", .I)] 9 | 10 | 11 | # get the single complements and rm rules 12 | if (in_parallel == FALSE) { 13 | tmp <- apply(rules, MARGIN = 1, complementSingleRule) 14 | } else { 15 | if (is.null(cluster) == TRUE) { 16 | message("Initiate parallelisation ... ") 17 | cluster <- makeCluster(n_cores) 18 | clusterEvalQ(cluster, library(stringr)) 19 | on.exit(stopCluster(cluster)) 20 | } 21 | 22 | message("Generate additional decisions ... ") 23 | tmp <- parApply(cl = cluster, rules, MARGIN = 1, FUN = complementSingleRule) 24 | } 25 | # the rm decisions 26 | rulesAdd <- unlist(lapply(tmp, function(x) { 27 | x$nR 28 | })) 29 | names(rulesAdd) <- str_replace(names(rulesAdd), pattern = "^decision[:digit:]+\\.", replacement = "") 30 | # the signs 31 | sVar <- unlist(lapply(tmp, function(x) { 32 | x$s 33 | })) 34 | names(sVar) <- str_replace(names(sVar), pattern = "^decision[:digit:]+\\.", replacement = "") 35 | sVar <- unlist(lapply(sVar, FUN = function(x) { 36 | eval(parse(text = x)) 37 | })) 38 | 39 | 40 | # get the directions 41 | directions <- data.table( 42 | ruleID = str_extract(names(sVar), pattern = "^decision[:digit:]+(?=var)"), 43 | var = str_extract(names(sVar), pattern = "(?<=var)[:digit:]+$"), 44 | d = ifelse(sVar == TRUE, 1, -1) 45 | ) 46 | 47 | 48 | # get metrics for the rm 49 | cIx <- str_which(names(rulesAdd), pattern = "varRm") 50 | rulesRm <- data.table( 51 | fullID = names(rulesAdd)[cIx], 52 | ruleID = str_extract(names(rulesAdd)[cIx], pattern = "^decision[:digit:]+(?=var)"), 53 | var = str_extract(names(rulesAdd)[cIx], pattern = "(?<=varRm)[:digit:]+$"), 54 | condition = rulesAdd[cIx] 55 | ) 56 | tmp <- unique(copy(rulesRm)[condition != "", list(condition)]) # remove repetitions due to the dummy variables 57 | if (nrow(tmp) == 0) { 58 | message("There is no interaction in the decision ensemble, only single variable effects.") 59 | } else { 60 | suppressMessages( 61 | tmp <- getDecisionsMetrics(tmp, 62 | data = data, target = target, classPos = classPos, 63 | importances = FALSE, 64 | in_parallel = in_parallel, n_cores = n_cores, cluster = cluster 65 | ) 66 | ) 67 | rulesRm <- tmp[rulesRm, on = "condition"] 68 | } 69 | 70 | # add the metrics for the "empty decisions" (original rule = a single variable only) 71 | tmp <- measureAll(data = data, target = target, classPos = classPos) 72 | rulesRm <- rulesRm[condition == "", `:=`(len = 0, support = 1, err = tmp["err"], pred = tmp["pred"])] 73 | 74 | 75 | # get the absolute support size for all 76 | nall <- nrow(data) 77 | rules <- rules[, "support_abs" := nall * support] 78 | rulesRm <- rulesRm[, "support_abs" := nall * support] 79 | 80 | return(list( 81 | "original" = rules, 82 | "rm" = rulesRm, "directions" = directions 83 | )) 84 | } 85 | 86 | 87 | 88 | 89 | 90 | #################################################################### 91 | 92 | 93 | 94 | ### Get the complements for individual rules 95 | complementSingleRule <- function(rule) { 96 | rule <- unlist(rule) 97 | ruleExec <- unlist(strsplit(rule["condition"], split = " & ")) 98 | vars <- str_extract(ruleExec, pattern = "(?<=X\\[,)[:digit:]+") 99 | # order per variable 100 | tmp <- order(vars) 101 | ruleExec <- ruleExec[tmp] 102 | vars <- vars[tmp] 103 | # group conditions on a same variable 104 | in_dup <- which(duplicated(vars)) 105 | if (length(in_dup) > 0) { 106 | for (i in in_dup) { 107 | ruleExec[i - 1] <- paste(ruleExec[i - 1], ruleExec[i], sep = " & ") 108 | } 109 | ruleExec <- ruleExec[-in_dup] 110 | vars <- vars[-in_dup] 111 | } 112 | 113 | ruleID <- rule["ruleID"] 114 | signs <- c("<=" = "a", ">" = "b", "a" = ">", "b" = "<=") # need that trick to be sure replace them all when there are 2! 115 | newRules <- list() 116 | s <- list() 117 | 118 | for (i in 1:length(ruleExec)) { 119 | if (str_detect(ruleExec[i], pattern = "&")) { 120 | comp <- str_replace_all(ruleExec[i], signs) 121 | comp <- paste0("(", str_replace_all(comp, pattern = "&", replacement = "|"), ")") 122 | sV <- paste0("mean(unlist(data[which(", ruleExec[i], "),", vars[i], "])) > mean(unlist(data[which(", comp, "),", vars[i], "]))") 123 | sV <- str_replace_all(sV, pattern = "(?<=\\(| )X(?=\\[)", replacement = "data") 124 | } else { 125 | if (str_detect(ruleExec[i], pattern = ">")) { 126 | sV <- "TRUE" 127 | } else { 128 | sV <- "FALSE" 129 | } 130 | } 131 | 132 | rm <- paste(ruleExec[-i], collapse = " & ") 133 | 134 | newRules[[paste0(ruleID, "varRm", vars[i])]] <- rm 135 | s[[paste0(ruleID, "var", vars[i])]] <- sV 136 | } 137 | 138 | return(list("nR" = newRules, "s" = s)) 139 | } 140 | -------------------------------------------------------------------------------- /endoR.Rcheck/00_pkg_src/endoR/README.md: -------------------------------------------------------------------------------- 1 | # endoR 2 | Code and manual of the endoR R-package. 3 | 4 | [![DOI](https://zenodo.org/badge/349814633.svg)](https://zenodo.org/badge/latestdoi/349814633) 5 | 6 | 7 | - author: Albane Ruaud [albane.ruaud@tuebingen.mpg.de](mailto:albane.ruaud@tuebingen.mpg.de) 8 | - maintainer: Albane Ruaud [albane.ruaud@tuebingen.mpg.de](mailto:albane.ruaud@tuebingen.mpg.de) 9 | 10 | Preprint: Albane Ruaud, Niklas A Pfister, Ruth E Ley, Nicholas D Youngblut. Interpreting tree ensemble machine learning models with endoR. bioRxiv (2022). DOI: [10.1101/2022.01.03.474763](https://www.biorxiv.org/content/10.1101/2022.01.03.474763v1) 11 | 12 | 13 | # Abstract 14 | **Motivation:** Tree ensemble machine learning models are increasingly used in microbiome science to explore associations between microbes and their environment, as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. The complex structure of such models enables efficient capture of higher-order interactions to improve predictions, but makes the final model often difficult to interpret. Therefore, while tree ensembles are often the most accurate models for microbiome data, the approach often yields limited insight into how microbial taxa or genomic content may be associated with host phenotypes. 15 | 16 | **Procedure:** endoR is a method that extracts and visualizes how predictive variables contribute to tree ensemble model accuracy. The fitted model is simplified into a decision ensemble and then reduced via regularization and bootstrapping to retain only the most essential aspects. Information about predictive variables and their pairwise interactions are extracted from the decision ensemble and displayed as a network for straightforward interpretation. The network and importance scores derived from endoR help understand how tree ensemble models make predictions from variables and interactions between variables. 17 | 18 | ![endoR overall workflow](figures/Workflow.png) 19 | 20 | 21 | # Usage 22 | Installation: `devtools::install_github(repo = "aruaud/endoR")` 23 | 24 | Before: select and fit a machine learning model (regression or classification) 25 | - random forest R-packages: randomForest, ranger 26 | - gradient boosted model R-packages: GBM, XGBoost 27 | 28 | 1. Generate the **stable decision ensemble**: 29 | - without bootstrapping, opt. in parallel, with `model2DE()` 30 | - with bootstrapping, opt. in parallel, with `model2DE_resampling()` 31 | - with boostrapping each ran in parallel (recommended): `preCluster()` followed by `model2DE_cluster()` iterated on partitions in the `Q()` function of the clusterMQ R-package 32 | 2. If bootstrapping, perform **stability selection**: `stabilitySelection()`. Alternatively, use `evaluateAlpha()` to perform stability selection with various values of `alpha` and pick the decision ensemble that can predict as many samples as possible for the lowest `alpha`. 33 | 3. Plot results with `plotFeatures()` and `plotNetwork()` (created with ggplot2, ggraph and igraph R-packages, arranged with the ggpubr R-package) 34 | 35 | Some common issues: 36 | - if the installation doesn't work, you may need to install R version 4.0.3 and use the inTrees R-package version 1.3 37 | - endoR uses text patterns: make sure to remove all special characters from your column names before using endoR, `colnames(data) <- compatibleNames(colnames(data))` 38 | 39 | 40 | # Background 41 | 42 | **From tree ensemble to decision ensembles** 43 | 44 | ![Schema decisions](figures/Decisions_summary.png) 45 | 46 | A decision consists of a full branch of a decision tree: 47 | - all splits along the branch form the *rule* of the decision; 48 | - the average response variable in the subset of observations following the rule corresponds to the *prediction* of the decision; 49 | - the sample support is the set of observations that follow the rule; 50 | - the error is the average deviation of the prediction to the observed resonse variable. 51 | 52 | 53 | **Decision importance** 54 | 55 | We measure the decision importance by comparing the error of the decision to what could be expected by a random guessing, i.e., the mean response variable across all samples. Respectively for regressions and classifications, this corresponds to calculating the coefficient of determination R2 and Cohen's kappa. We weight 56 | 57 | 58 | 59 | 60 | **Feature and interaction importance** = how much a feature, or pair of features, contributes to the accuracy of predictions 61 | 62 | **Feature and interaction influence** = how a feature, or pair of features, changes precisions (e.g., increases the predicted age or changes the prediction towards "disease" vs "healthy" state) 63 | 64 | **Regularization** to obtain a stable decision ensemble that is generalizable 65 | - decision pruning = remove unimportant variables from decision rules (i.e. the ones that do not increase the decision error much than a threshold) 66 | - bootstrapping and stability selection = select decisions measured as important across bootstrap resamples (i.e, across bootstraps of observations/samples) --------------------------------------------------------------------------------